On the Generalization Power of
Overfitted Two-Layer Neural Tangent Kernel Models

Peizhong Ju School of Electrical and Computer Engineering, Purdue University. Email: {jup,linx}@purdue.edu Xiaojun Lin^†^†footnotemark: Ness B. Shroff Department of ECE and CSE, The Ohio State University. Email: [email protected]

(March 1, 2021)

Abstract

In this paper, we study the generalization performance of min $\ell_{2}$ -norm overfitting solutions for the neural tangent kernel (NTK) model of a two-layer neural network with ReLU activation that has no bias term. We show that, depending on the ground-truth function, the test error of overfitted NTK models exhibits characteristics that are different from the “double-descent” of other overparameterized linear models with simple Fourier or Gaussian features. Specifically, for a class of learnable functions, we provide a new upper bound of the generalization error that approaches a small limiting value, even when the number of neurons $p$ approaches infinity. This limiting value further decreases with the number of training samples $n$ . For functions outside of this class, we provide a lower bound on the generalization error that does not diminish to zero even when $n$ and $p$ are both large.

1 Introduction

Recently, there is significant interest in understanding why overparameterized deep neural networks (DNNs) can still generalize well (Zhang et al., 2017; Advani et al., 2020), which seems to defy the classical understanding of bias-variance tradeoff in statistical learning (Bishop, 2006; Hastie et al., 2009; Stein, 1956; James & Stein, 1992; LeCun et al., 1991; Tikhonov, 1943). Towards this direction, a recent line of study has focused on overparameterized linear models (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020; Mei & Montanari, 2019). For linear models with simple features (e.g., Gaussian features and Fourier features) (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020), an interesting “double-descent” phenomenon has been observed. Thus, there is a region where the number of model parameters (or linear features) is larger than the number of samples (and thus overfitting occurs), but the generalization error actually decreases with the number of features. However, linear models with these simple features are still quite different from nonlinear neural networks. Thus, although such results provide some hint why overparameterization and overfitting may be harmless, it is still unclear whether similar conclusions apply to neural networks.

In this paper, we are interested in linear models based on the neural tangent kernel (NTK) (Jacot et al., 2018), which can be viewed as a useful intermediate step towards modeling nonlinear neural networks. Essentially, NTK can be seen as a linear approximation of neural networks when the weights of the neurons do not change much. Indeed, Li & Liang (2018); Du et al. (2018) have shown that, for a wide and fully-connected two-layer neural network, both the neuron weights and their activation patterns do not change much after gradient descent (GD) training with a sufficiently small step size. As a result, such a shallow and wide neural network is approximately linear in the weights when there are a sufficient number of neurons, which suggests the utility of the NTK model.

Despite its linearity, however, characterizing the double descent of such a NTK model remains elusive. The work in Mei & Montanari (2019) also studies the double-descent of a linear version of two-layer neural network. It uses the so-called “random-feature” model, where the bottom-layer weights are random and fixed, and only the top-layer weights are trained. (In comparison, the NTK model for such a two-layer neural network corresponds to training only the bottom-layer weights.) However, the setting there requires the number of neurons, the number of samples, and the data dimension to all grow proportionally to infinity. In contrast, we are interested in the setting where the number of samples is given, and the number of neurons is allowed to be much larger than the number of samples. As a consequence of the different setting, in Mei & Montanari (2019) eventually only linear ground-truth functions can be learned. (Similar settings are also studied in d’Ascoli et al. (2020).) In contrast, we will show that far more complex functions can be learned in our setting. In a related work, Ghorbani et al. (2019) shows that both the random-feature model and the NTK model can approximate highly nonlinear ground-truth functions with a sufficient number of neurons. However, Ghorbani et al. (2019) mainly studies the expressiveness of the models, and therefore does not explain why overfitting solutions can still generalize well. To the best of our knowledge, our work is the first to characterize the double-descent of overfitting solutions based on the NTK model.

Specifically, in this paper we study the generalization error of the min $\ell_{2}$ -norm overfitting solution for a linear model based on the NTK of a two-layer neural network with ReLU activation that has no bias. Only the bottom-layer weights are trained. We are interested in min $\ell_{2}$ -norm overfitting solutions because gradient descent (GD) can be shown to converge to such solutions while driving the training error to zero (Zhang et al., 2017) (see also Section 2). Given a class of ground truth functions (see details in Section 3), which we refer to as “learnable functions,” our main result (Theorem 1) provides an upper bound on the generalization error of the min $\ell_{2}$ -norm overfitting solution for the two-layer NTK model with $n$ samples and $p$ neurons (for any finite $p$ larger than a polynomial function of $n$ ). This upper bound confirms that the generalization error of the overfitting solution indeed exhibits descent in the overparameterized regime when $p$ increases. Further, our upper bound can also account for the noise in the training samples.

Our results reveal several important insights. First, we find that the (double) descent of the overfitted two-layer NTK model is drastically different from that of linear models with simple Gaussian or Fourier features (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019). Specifically, for linear models with simple features, when the number of features $p$ increases, the generalization error will eventually grow again and approach the so-called “null risk” (Hastie et al., 2019), which is the error of a trivial model that predicts zero. In contrast, for the class of learnable functions described earlier, the generalization error of the overfitted NTK model will continue to descend as $p$ grows to infinity, and will approach a limiting value that depends on the number of samples $n$ . Further, when there is no noise, this limiting value will decrease to zero as the number of samples $n$ increases. This difference is shown in Fig. 1(a). As $p$ increases, the test mean-square-error (MSE) of min- $\ell_{1}$ and min- $\ell_{2}$ overfitting solutions for Fourier features (blue and red curves) eventually grow back to the null risk (the black dashed line), even though they exhibit a descent at smaller $p$ . In contrast, the error of the overfitted NTK model continues to descend to a much lower level.

The second important insight is that the aforementioned behavior critically depends on the ground-truth function belonging to the class of “learnable functions.” Further, this class of learnable functions depend on the specific network architecture. For our NTK model (with RELU activation that has no bias), we precisely characterize this class of learnable functions. Specifically, for ground-truth functions that are outside the class of learnable functions, we show a lower bound on the generalization error that does not diminish to zero for any $n$ and $p$ (see Proposition 2 and Section 4). This difference is shown in Fig. 1(b), where we use an almost identical setting as Fig. 1(a), except a different ground-truth function. We can see in Fig. 1(b) that the test-error of the overfitted NTK model is always above the null risk and looks very different from that in Fig. 1(a). We note that whether certain functions are learnable or not critically depends on the specific structure of the NTK model, such as the choice of the activation unit. Recently, (Satpathi & Srikant, 2021) shows that all polynomials can be learned by 2-layer NTK model with ReLU activation that has a bias term, provided that the number of neurons $p$ is sufficiently large. (See further discussions in Remark 2. However, (Satpathi & Srikant, 2021) does not characterize the descent of generalization errors as $p$ increases.) This difference in the class of learnable functions between the two settings (ReLU with or without bias) also turns out to be consistent with the difference in the expressiveness of the neural networks. That is, shallow networks with biased-ReLU are known to be universal function approximators (Ji et al., 2019), while those without bias can only approximate the sum of linear functions and even functions (Ghorbani et al., 2019).

Refer to caption — Figure 1: The test mean-square-error(MSE) vs. the number of features/neurons $p$ for (a) learnable function and (b) not-learnable function when $n=50$ , $d=2$ , $\|\bm{\epsilon}\|_{2}^{2}=0.01$ . The corresponding ground-truth are (a) $f(\theta)=\sum_{k\in\{0,1,2,4\}}(\sin(k\theta)+\cos(k\theta))$ , and (b) $f(\theta)=\sum_{k\in\{3,5,7,9\}}(\sin(k\theta)+\cos(k\theta))$ . (Note that in 2-dimension every input $\bm{x}$ on a unit circle can be represented by an angle $\theta\in[-\pi,\pi]$ . See the end of Section 4.) Every curve is the average of $9$ random simulation runs. For GD on the real neural network (NN), we use the step size $1/\sqrt{p}$ and the number of training epochs is fixed at $2000$ .

A closely related result to ours is the work in Arora et al. (2019), which characterizes the generalization performance of wide two-layer neural networks whose bottom-layer weights are trained by gradient descent (GD) to overfit the training samples. In particular, our class of learnable functions almost coincides with that of Arora et al. (2019). This is not surprising because, when the number of neurons is large, NTK becomes a close approximation of such two-layer neural networks. In that sense, the results in Arora et al. (2019) are even more faithful in following the GD dynamics of the original two-layer network. However, the advantage of the NTK model is that it is easier to analyze. In particular, the results in this paper can quantify how the generalization error descends with $p$ . In contrast, the results in Arora et al. (2019) provide only a generalization bound that is independent of $p$ (provided that $p$ is sufficiently large), but do not quantify the descent behavior as $p$ increases. Our numerical results in Fig. 1(a) suggest that, over a wide range of $p$ , the descent behavior of the NTK model (the green curve) matches well with that of two-layer neural networks trained by gradient descent (the cyan curve). Thus, we believe that our results also provide guidance for the latter model. The work in (Fiat et al., 2019) studies a different neural network architecture with gated ReLU, whose NTK model turns out to be the same as ours. However, similar to Arora et al. (2019), the result in Fiat et al. (2019) does not capture the speed of descent with respect to $p$ either. Second, Arora et al. (2019) only provides upper bounds on the generalization error. There is no corresponding lower bound to explain whether ground-truth functions outside a certain class are not learnable. Our result in Proposition 2 provides such a lower bound, and therefore more completely characterizes the class of learnable functions. (See further comparison in Remark 1 of Section 3 and Remark 3 of Section 5.) Another related work Allen-Zhu et al. (2019) also characterizes the class of learnable functions for two-layer and three-layer networks. However, Allen-Zhu et al. (2019) studies a training method that takes a new sample in every iteration, and thus does not overfit all training data. Finally, our paper studies generalization of NTK models for the regression setting, which is different from the classification setting that assumes a separability condition, e.g., in (Ji & Telgarsky, 2019).

2 Problem Setup

We assume the following data model $y=f(\bm{x})+\epsilon$ , with the input $\bm{x}\in\mathds{R}^{d}$ , the output $y\in\mathds{R}$ , the noise $\epsilon\in\mathds{R}$ , and $f:\ \mathds{R}^{d}\mapsto\mathds{R}$ denotes the ground-truth function. Let $(\mathbf{X}_{i},\ y_{i})$ , $i=1,2,\cdots,n$ denote $n$ training samples. We collect them as $\mathbf{X}=[\mathbf{X}_{1}\ \mathbf{X}_{2}\ \cdots\ \mathbf{X}_{n}]\in\mathds{R}^{d\times n}$ , $\bm{y}=[y_{1}\ y_{2}\ \cdots\ y_{n}]^{T}\in\mathds{R}^{n}$ , $\bm{\epsilon}=[\epsilon_{1}\ \epsilon_{2}\ \cdots\ \epsilon_{n}]^{T}\in\mathds{R}^{n}$ , and $\mathbf{F}(\mathbf{X})=[f(\mathbf{X}_{1})\ f(\mathbf{X}_{2})\ \cdots\ f(\mathbf{X}_{n})]^{T}\in\mathds{R}^{n}$ . Then, the training samples can be written as $\bm{y}=\mathbf{F}(\mathbf{X})+\bm{\epsilon}$ . After training (to be described below), we denote the trained model by the function $\hat{f}$ . Then, for any new test data $\bm{x}$ , we will calculate the test error by $|\hat{f}(\bm{x})-f(\bm{x})|$ , and the mean squared error (MSE) by $\operatorname*{\mathsf{E}}_{\bm{x}}[\hat{f}(\bm{x})-f(\bm{x})]^{2}$ .

Figure 2: A two-layer neural network where

d=2

p=3

For training, consider a fully-connected two-layer neural network with $p$ neurons. Let $\bm{w}_{j}\in\mathds{R}$ and $\mathbf{V}_{0}[j]\in\mathds{R}^{d}$ denote the top-layer and bottom-layer weights, respectively, of the $j$ -th neuron, $j=1,2,\cdots,p$ (see Fig. 2). We collect them into $\bm{w}=[\bm{w}_{1}\ \bm{w}_{2}\ \cdots\ \bm{w}_{p}]^{T}\in\mathds{R}^{p}$ , and $\mathbf{V}_{0}=[\mathbf{V}_{0}[1]^{T}\ \mathbf{V}_{0}[2]^{T}\ \cdots\ \mathbf{V}_{0}[p]^{T}]^{T}\in\mathds{R}^{dp}$ (a column vector with $dp$ elements). Note that with this notation, for any row or column vector $\bm{v}$ with $dp$ elements, $\bm{v}[j]$ denotes a (row/column) vector that consists of the $(jd+1)$ -th to $(jd+d)$ -th elements of $\bm{v}$ . We choose ReLU as the activation function for all neurons and there is no bias term in the ReLU activation function.

Now we are ready to introduce the NTK model (Jacot et al., 2018). We fix the top-layer weights $\bm{w}$ , and let the initial bottom-layer weights $\mathbf{V}_{0}$ be randomly chosen. We then train only the bottom-layer weights. Let $\mathbf{V}_{0}+\overline{\Delta\mathbf{V}}$ denote the bottom-layer weights after training. Thus, the change of the output after training is

	$\displaystyle\sum_{j=1}^{p}\bm{w}_{j}\bm{1}_{\{\bm{x}^{T}(\mathbf{V}_{0}[j]+\overline{\Delta\mathbf{V}}[j])>0\}}\cdot(\mathbf{V}_{0}[j]+\overline{\Delta\mathbf{V}}[j])^{T}\bm{x}$
	$\displaystyle-\sum_{j=1}^{p}\bm{w}_{j}\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\mathbf{V}_{0}[j]^{T}\bm{x}.$

In the NTK model, one assumes that $\overline{\Delta\mathbf{V}}$ is very small. As a result, $\bm{1}_{\{\bm{x}^{T}(\mathbf{V}_{0}[j]+\overline{\Delta\mathbf{V}}[j])>0\}}=\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}$ for most $\bm{x}$ . Thus, the change of the output can be approximated by

\displaystyle\sum_{j=1}^{p}\bm{w}_{j}\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\overline{\Delta\mathbf{V}}[j]^{T}\bm{x}=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V},

where $\Delta\mathbf{V}\in\mathds{R}^{dp}$ is given by $\Delta\mathbf{V}[j]\mathrel{\mathop{:}}=\bm{w}_{j}\overline{\Delta\mathbf{V}}[j],\ j=1,2,\cdots,p$ , and $\bm{h}_{\mathbf{V}_{0},\bm{x}}\in\mathds{R}^{1\times(dp)}$ is given by

\displaystyle\bm{h}_{\mathbf{V}_{0},\bm{x}}[j]\mathrel{\mathop{:}}=\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\bm{x}^{T},\ j=1,2,\cdots,p.

(1)

In the NTK model, we assume that the output of the trained model is exactly given by Eq. (1), i.e.,

\displaystyle\hat{f}_{\Delta\mathbf{V},\mathbf{V}_{0}}(\bm{x})\mathrel{\mathop{:}}=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}.

(2)

In other words, the NTK model can be viewed as a linear approximation of the two-layer network when the change of the bottom-layer weights is small.

Define $\mathbf{H}\in\mathds{R}^{n\times(dp)}$ such that its $i$ -th row is $\mathbf{H}_{i}\mathrel{\mathop{:}}=\bm{h}_{\mathbf{V}_{0},\mathbf{X}_{i}}$ . Throughout the paper, we will focus on the following min- $\ell_{2}$ -norm overfitting solution

\displaystyle\Delta\mathbf{V}^{\ell_{2}}\mathrel{\mathop{:}}=\operatorname*{arg\,min}_{\bm{v}}\|\bm{v}\|_{2},\text{ subject to }\mathbf{H}\bm{v}=\bm{y}.

Whenever $\Delta\mathbf{V}^{\ell_{2}}$ exists, it can be written in closed form as

\displaystyle\Delta\mathbf{V}^{\ell_{2}}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}.

(3)

The reason that we are interested in $\Delta\mathbf{V}^{\ell_{2}}$ is that gradient descent (GD) or stochastic gradient descent (SGD) for the NTK model in Eq. (2) is known to converge to $\Delta\mathbf{V}^{\ell_{2}}$ (proven in Supplementary Material, Appendix B).

Using Eq. (2) and Eq. (3), the trained model is then

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})\mathrel{\mathop{:}}=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{\ell_{2}}.

(4)

In the rest of the paper, we will study the generalization error of Eq. (4).

We collect some assumptions. Define the unit sphere in $\mathds{R}^{d}$ as: $\mathcal{S}^{d-1}\mathrel{\mathop{:}}=\left\{\bm{v}\in\mathds{R}^{d}\ |\ \|\bm{v}\|_{2}=1\right\}$ . Let $\mu(\cdot)$ denote the distribution of the input $\bm{x}$ . Without loss of generality, we make the following assumptions: (i) the inputs $\bm{x}$ are i.i.d. uniformly distributed in $\mathcal{S}^{d-1}$ , and the initial weights $\mathbf{V}_{0}[j]$ ’s are i.i.d. uniformly distributed in all directions in $\mathds{R}^{d}$ ; (ii) $p\geq n/d$ and $d\geq 2$ ; (iii) $\mathbf{X}_{i}\nparallel\mathbf{X}_{j}$ for any $i\neq j$ , and $\mathbf{V}_{0}[k]\nparallel\mathbf{V}_{0}[l]$ for any $k\neq l$ . We provide detailed justification of those assumptions in Supplementary Material, Appendix C.

3 Learnable Functions and Generalization Performance

We now show that the generalization performance of the overfitted NTK model in Eq. (4) crucially depends on the ground-truth function $f(\cdot)$ , where good generalization performance only occurs when the ground-truth function is “learnable.” Below, we first describe a candidate class of ground-truth functions, and explain why they may correspond to the class of “learnable functions.” Then, we will give an upper-bound on the generalization performance for this class of ground-truth functions. Finally, we will give a lower-bound on the generalization performance when the ground-truth functions are outside of this class.

We first define a set $\mathcal{F}^{\ell_{2}}$ of ground-truth functions.

Definition 1.

$\mathcal{F}^{\ell_{2}}\mathrel{\mathop{:}}=\big{\{}f\stackrel{{\scriptstyle\text{a.e.}}}{{=}}f_{g}\ \big{|}\ f_{g}(\bm{x})=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z}),\ \|g\|_{1}<\infty\big{\}}$ .

Note that in Definition 1, $\stackrel{{\scriptstyle\text{a.e.}}}{{=}}$ means two functions equals almost everywhere, and $\|g\|_{1}\mathrel{\mathop{:}}=\int_{\mathcal{S}^{d-1}}|g(\bm{z})|d\mu(\bm{z})$ . The function $g(\bm{z})$ may be any finite-value function in $L^{1}(\mathcal{S}^{d-1}\mapsto\mathds{R})$ . Further, we also allow $g(\bm{z})$ to contain (as components) Dirac $\delta$ -functions on $\mathcal{S}^{d-1}$ . Note that a $\delta$ -function $\delta_{\bm{z}_{0}}(\bm{z})$ has zero value for all $\bm{z}\in\mathcal{S}^{d-1}\setminus\{\bm{z}_{0}\}$ , but $\|\delta_{\bm{z}_{0}}\|_{1}\mathrel{\mathop{:}}=\int_{\mathcal{S}^{d-1}}\delta_{\bm{z}_{0}}(\bm{z})d\mu(\bm{z})=1$ . Thus, the function $g(\bm{z})$ may contain any sum of $\delta$ -functions and finite-value $L^{1}$ -functions. ¹¹1Alternatively, we can also interpret $g(\bm{z})$ as a signed measure (Rao & Rao, 1983) on $\mathcal{S}^{d-1}$ . Then, $\delta$ -functions correspond to point masses, and the condition $\|g\|_{1}<\infty$ implies that the corresponding unsigned version of the measure on $\mathcal{S}^{d-1}$ is bounded.

To see why $\mathcal{F}^{\ell_{2}}$ may correspond to the class of learnable functions, we can first examine what the learned function $\hat{f}^{\ell_{2}}$ in Eq. (4) should look like. Recall that $\mathbf{H}^{T}=[\mathbf{H}_{1}^{T}\ \cdots\ \mathbf{H}_{n}^{T}]$ . Thus, $\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}=\sum_{i=1}^{n}(\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}_{i}^{T})\bm{e}_{i}^{T}$ , where $\bm{e}_{i}\in\mathds{R}^{n}$ denotes the $i$ -th standard basis. Combining Eq. (3) and Eq. (4), we can see that the learned function in Eq. (4) is of the form

	$\displaystyle\hat{f}^{\ell_{2}}(\bm{x})=$	$\displaystyle\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\left(\frac{1}{p}\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}_{i}^{T}\right)p\bm{e}_{i}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}.$		(5)

For all $\bm{x},\bm{z}\in\mathcal{S}^{d-1}$ , define $\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}\mathrel{\mathop{:}}=\{j\in\{1,2,\cdots,p\}\ |\ \bm{z}^{T}\mathbf{V}_{0}[j]>0,\bm{x}^{T}\mathbf{V}_{0}[j]>0\}$ , and its cardinality is given by

\displaystyle\left|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}\right|=\sum_{j=1}^{p}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}.

(6)

Then, using Eq. (1), we can show $\frac{1}{p}\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}_{i}^{T}=\bm{x}^{T}\mathbf{X}_{i}\frac{|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}|}{p}$ . It is not hard to show that

\displaystyle\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi},\text{ as }p\to\infty.

(7)

where $\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}$ denotes converge in probability. (see Supplementary Material, Appendix D.5). Thus, if we let

\displaystyle g(\bm{z})=\sum_{i=1}^{n}p\bm{e}_{i}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}\delta_{\mathbf{X}_{i}}(\bm{z}),

(8)

then as $p\to\infty$ , Eq. (3) should approach a function in $\mathcal{F}^{\ell_{2}}$ . This explains why $\mathcal{F}^{\ell_{2}}$ is a candidate class of “learnable functions.” However, note that the above discussion only addresses the expressiveness of the model. It is still unclear whether any function in $\mathcal{F}^{\ell_{2}}$ can be learned with low generalization error. The following result provides the answer.

For some $m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]$ , define (recall that $d$ is the dimension of $\bm{x}$ )

\displaystyle J_{m}(n,d)\mathrel{\mathop{:}}=2^{2d+5.5}d^{0.5d}n^{\left(2+\frac{1}{m}\right)(d-1)}.

(9)

Theorem 1.

Assume a ground-truth function $f\stackrel{{\scriptstyle\text{a.e.}}}{{=}}f_{g}\in\mathcal{F}^{\ell_{2}}$ where $\|g\|_{\infty}<\infty$ ²²2 The requirement of $\|g\|_{\infty}<\infty$ can be relaxed. We show in Supplementary Material, Appendix L that, even when $g$ is a $\delta$ -function (so $\|g\|_{\infty}=\infty$ ), we can still have a similar result of Eq. (10) but Term 1 will have a slower speed of decay $O(n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})})$ with respect to $n$ instead of $O(n^{-\frac{1}{2}}(1-\frac{1}{q}))$ shown in Eq. (10). Term 4 of Eq. (10) will also be different when $g$ is a $\delta$ -function, but it still goes to zero when $p$ and $n$ are large. , $n\geq 2$ , $m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]$ , $d\leq n^{4}$ , and $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ . Then, for any $q\in[1,\ \infty)$ and for almost every $\bm{x}\in\mathcal{S}^{d-1}$ , we must have³³3The notion $\operatorname*{\mathsf{Pr}}\limits_{M}$ in Eq. (10) emphasizes that randomness is in $M$ .

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\Big{\{}\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|\geq\underbrace{n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}}_{\text{Term 1}}$
	$\displaystyle+\underbrace{\left(1+\sqrt{J_{m}(n,d)n}\right)p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}}_{\text{Term 2}}+\underbrace{\sqrt{J_{m}(n,d)n}\\|\bm{\epsilon}\\|_{2}}_{\text{Term 3}},$
	$\displaystyle\quad\text{ for all }\bm{\epsilon}\in\mathds{R}^{n}\Big{\}}\leq 2e^{2}\bigg{(}\underbrace{\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right)}_{\text{Term 4}}$
	$\displaystyle+\underbrace{\exp\left(-\frac{\sqrt[q]{p}}{8\\|g\\|_{1}^{2}}\right)}_{\text{Term 5}}+\underbrace{\exp\left(-\frac{\sqrt[q]{p}}{8n\\|g\\|_{1}^{2}}\right)}_{\text{Term 6}}\bigg{)}+\underbrace{\frac{4}{\sqrt[m]{n}}}_{\text{Term 7}}.$		(10)

To interpret Theorem 1, we can first focus on the noiseless case, where $\bm{\epsilon}$ and Term 3 are zero. If we fix $n$ and let $p\to\infty$ , then Terms 2, 5, and 6 all approach zero. We can then conclude that, in the noiseless and heavily overparameterized setting ( $p\to\infty$ ), the generalization error will converge to a small limiting value (Term 1) that depends only on $n$ . Further, this limiting value (Term 1) will converge to zero (so do Terms 4 and 7) as $n\to\infty$ , i.e., when there are sufficiently many training samples. Finally, Theorem 1 holds even when there is noise.

The parameters of $q$ and $m$ can be tuned to make Eq. (10) sharper when $n$ and $p$ are large. For example, as we increase $q$ , Term 1 will approach $n^{-0.5}$ . Although a larger $q$ makes Terms 4, 5, and 6 bigger, as long as $n$ and $p$ are sufficiently large, those terms will still be close to 0. Similarly, if we increase $m$ , then $J_{m}(n,d)$ will approach the order of $n^{2(d-1)}$ . As a result, Term 3 approaches the order of $n^{2d-0.5}$ times $\|\bm{\epsilon}\|_{2}$ and the requirement $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ approaches the order of $n^{2(d-1)}\ln n$ .

Remark 1.

We note that Arora et al. (2019) shows that, for two-layer neural networks whose bottom-layer weights are trained by gradient descent, the generalization error for sufficiently large $p$ has the following upper bound: for any $\zeta>0$ ,

		$\displaystyle\operatorname{\mathsf{Pr}}\bigg{\{}\operatorname{\mathsf{E}}_{\bm{x}}\|\hat{f}(\bm{x})-f(\bm{x})\|\leq\sqrt{\frac{2\bm{y}^{T}(\mathbf{H}^{\infty})^{-1}\bm{y}}{n}}$
		$\displaystyle+O\bigg{(}\sqrt{\frac{\log\frac{n}{\zeta\cdot\min\mathsf{eig}(\mathbf{H}^{\infty})}}{n}}\bigg{)}\bigg{\}}\geq 1-\zeta,$		(11)

where $\mathbf{H}^{\infty}=\lim\limits_{p\to\infty}(\mathbf{H}\mathbf{H}^{T}/p)\in\mathds{R}^{n\times n}$ . For certain class of learnable functions (we will compare them with our $\mathcal{F}^{\ell_{2}}$ in Section 4), the quantity $\bm{y}^{T}(\mathbf{H}^{\infty})^{-1}\bm{y}$ is bounded. Thus, $\sqrt{\frac{2\bm{y}^{T}(\mathbf{H}^{\infty})^{-1}\bm{y}}{n}}$ also decreases at the speed $1/\sqrt{n}$ . The second $O(\cdot)$ -term in Eq. (1) contains the minimum eigenvalue of $\mathbf{H}^{\infty}$ , which decreases with $n$ . (Indeed, we show that this minimum eigenvalue is upper bounded by $O(n^{-\frac{1}{d-1}})$ in Supplementary Material, Appendix G.) Thus, Eq. (1) may decrease a little bit slower than $1/\sqrt{n}$ , which is consistent with Term 1 in Eq. (10) (when $q$ is large). Note that the term $2\bm{y}^{T}(\mathbf{H}^{\infty})^{-1}\bm{y}$ in Eq. (1) captures how the complexity of the ground-truth function affects the generalization error. Similarly, the norm of $g(\cdot)$ also captures the impact⁴⁴4Although Term 1 in Eq. (10) in its current form does not depend on $g(\cdot)$ , it is possible to modify our proof so that the norm of $g(\cdot)$ also enters Term 1. of the complexity of the ground-truth function in Eq. (10). However, we caution that the GD solution in Arora et al. (2019) is based on the original neural network, which is usually different from our min $\ell_{2}$ -norm solution based on the NTK model (even though they are close for very large $p$ ). Thus, the two results may not be directly comparable.

Theorem 1 reveals several important insights on the generalization performance when the ground-truth function belongs to $\mathcal{F}^{\ell_{2}}$ .

(i) Descent in the overparameterized region: When $p$ increases, both sides of Eq. (10) decreases, suggesting that the test error of the overfitted NTK model decreases with $p$ . In Fig. 1(a), we choose a ground-truth function in $\mathcal{F}^{\ell_{2}}$ (we will explain why this function is in $\mathcal{F}^{\ell_{2}}$ later in Section 4). The test MSE of the aforementioned NTK model (green curve) confirms the overall trend⁵⁵5This curve oscillates at the early stage when $p$ is small. We suspect it is because, at small $p$ , the convergence in Eq. (7) has not occurred yet, and thus the randomness in $\mathbf{V}_{0}[j]$ makes the simulation results more volatile. of descent in the overparameterized region. We note that while Arora et al. (2019) provides a generalization error upper-bound for large $p$ (i.e., Eq. (1)), the upper bound there does not capture the dependency in $p$ and thus does not predict this descent.

More importantly, we note a significant difference between the descent in Theorem 1 and that of min $\ell_{2}$ -norm overfitting solutions for linear models with simple features (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Liao et al., 2020; Jacot et al., 2020). For example, for linear models with Gaussian features, we can obtain (see, e.g., Theorem 2 of Belkin et al. (2019)):

\displaystyle\text{MSE}=\|f\|_{2}^{2}\left(1-\frac{n}{p}\right)+\frac{\sigma^{2}n}{p-n-1},\text{ for $p\geq n+2$}

(12)

where $\sigma^{2}$ denotes the variance of the noise. If we let $p\to\infty$ in Eq. (12), we can see that the MSE quickly approaches $\|f\|_{2}^{2}$ , which is referred to as the “null risk” (Hastie et al., 2019), i.e., the MSE of a model that predicts zero. Note that the null-risk is at the level of the signal, and thus is quite large. In contrast, as $p\to\infty$ , the test error of the NTK model converges to a value determined by $n$ and $\bm{\epsilon}$ (and is independent of the null risk). This difference is confirmed in Fig. 1(a), where the test MSE for the NTK model (green curve) is much lower than the null risk (the dashed line) when $p\to\infty$ , while both the min $\ell_{2}$ -norm (the red curve) and the min $\ell_{1}$ -norm solutions (the blue curve) Ju et al. (2020) with Fourier features rise to the null risk when $p\to\infty$ . Finally, note that the descent in Theorem 1 requires $p$ to increase much faster than $n$ . Specifically, to keep Term 2 in Eq. (10) small, it suffices to let $p$ increase a little bit faster than $\Omega(n^{4d-1})$ . This is again quite different from the descent shown in Eq. (12) and in other related work using Fourier and Gaussian features (Liao et al., 2020; Jacot et al., 2020), where $p$ only needs to grow proportionally with $n$ .

(ii) Speed of the descent: Since Theorem 1 holds for finite $p$ , it also characterizes the speed of descent. In particular, Term 2 is proportional to $p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}$ , which approaches $1/\sqrt{p}$ when $q$ is large. Again, such a speed of descent is not captured in Arora et al. (2019). As we show in Fig. 1(a), the test error of the gradient descent solution under the original neural network (cyan curve) is usually quite close to that of the NTK model (green curve). Thus, our result provides useful guidance on how fast the generalization error descends with $p$ for such neural networks.

(iii) The effect of noise: Term 3 in Eq. (10) characterizes the impact of the noise $\bm{\epsilon}$ , which does not decrease or increase with $p$ . Notice that this is again very different from Eq. (12), i.e., results of min $\ell_{2}$ -norm overfitting solutions for simple features, where the noise term $\frac{\sigma^{2}n}{p-n-1}\to 0$ when $p\to\infty$ . We use Fig. 3(a) to validate this insight. In Fig. 3(a), we fix $n=50$ and plot curves of test MSE of NTK overfitting solution as $p$ increases. We let the noise $\epsilon_{i}$ in the $i$ -th training sample be i.i.d. Gaussian with zero mean and variance $\sigma^{2}$ . The green, red, and blue curves in Fig. 3(a) corresponds to the situation $\sigma^{2}=0$ , $\sigma^{2}=0.04$ , and $\sigma^{2}=0.16$ , respectively. We can see that all three curves become flat when $p$ is very large, and this phenomenon implies that the gap across different noise levels does not decrease when $p\to\infty$ , which is in contrast to Eq. (12).

In Fig. 3(b), we instead fix $p=20000$ , and increase $n$ ). We plot the test MSE both for the noiseless setting (green curve) and for $\sigma^{2}=0.01$ (red curve). The difference between the two curves (dashed blue curve) then captures the impact of noise, which is related to Term 3 in Eq. (10). Somewhat surprisingly, we find that the dashed blue curve is insensitive to $n$ , which suggests that Term 3 in Eq. (10) may have room for improvement.

In summary, we have shown that any ground-truth function in $\mathcal{F}^{\ell_{2}}$ leads to low generalization error for overfitted NTK models. It is then natural to ask what happens if the ground-truth function is not in $\mathcal{F}^{\ell_{2}}$ . Let $\overline{\mathcal{F}^{\ell_{2}}}$ denote the closure⁶⁶6We consider the normed space of all functions in $L^{2}(\mathcal{S}^{d-1}\mapsto\mathds{R})$ . Notice that although $g(\bm{z})$ in Definition 1 may not be in $L^{2}$ , $f_{g}$ is always in $L^{2}$ . Specifically, $f_{g}(\bm{x})$ is bounded for every $\bm{x}\in\mathcal{S}^{d-1}$ when $\|g\|_{1}<\infty$ . of $\mathcal{F}^{\ell_{2}}$ , and $D(f,\mathcal{F}^{\ell_{2}})$ denotes the $L^{2}$ -distance between $f$ and $\mathcal{F}^{\ell_{2}}$ (i.e., the infimum of the $L^{2}$ -distance from $f$ to every function in $\mathcal{F}^{\ell_{2}}$ ).

Proposition 2.

(i) For any given $(\mathbf{X},\bm{y})$ , there exists a function $\hat{f}^{\ell_{2}}_{\infty}\in\mathcal{F}^{\ell_{2}}$ such that, uniformly over all $\bm{x}\in\mathcal{S}^{d-1}$ , $\hat{f}^{\ell_{2}}(\bm{x})\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}\hat{f}^{\ell_{2}}_{\infty}(\bm{x})$ as $p\to\infty$ . (ii) Consequently, if the ground-truth function $f\notin\overline{\mathcal{F}^{\ell_{2}}}$ (or equivalently, $D(f,\mathcal{F}^{\ell_{2}})>0$ ), then the MSE of $\hat{f}^{\ell_{2}}_{\infty}$ (with respect to the ground-truth function $f$ ) is at least $D(f,\mathcal{F}^{\ell_{2}})$ .

Intuitively, Proposition 2 (proven in Supplementary Material Appendix J) suggests that, if a ground-truth function is outside the closure of $\mathcal{F}^{\ell_{2}}$ , then no matter how large $n$ is, the test error of a NTK model with infinitely many neurons cannot be small (regardless whether or not the training samples contain noise). We validate this in Fig. 1(b), where a ground-truth function is chosen outside $\overline{\mathcal{F}^{\ell_{2}}}$ . The test MSE of NTK overfitting solutions (green curve) is above null risk (dashed black line) and thus is much higher compared with Fig. 1(a). We also plot the test MSE of the GD solution of the real neural network (cyan curve), which seems to show the same trend.

Comparing Theorem 1 and Proposition 2, we can clearly see that, all functions in $\mathcal{F}^{\ell_{2}}$ are learnable by the overfitted NTK model, and all functions not in $\overline{\mathcal{F}^{\ell_{2}}}$ are not.

4 What Exactly are the Functions in $\mathcal{F}^{\ell_{2}}$ ?

Our expression for learnable functions in Definition 1 is still in an indirect form, i.e., through the unknown function $g(\cdot)$ . In Arora et al. (2019), the authors show that all functions of the form $(\bm{x}^{T}\bm{a})^{l}$ , $l\in\{0,1,2,4,6,\cdots\}$ are learnable by GD (assuming large $p$ and small step size), for a similar 2-layer network with ReLU activation that has no bias. In the following, we will show that our learnable functions in Definition 1 also have a similar form. Further, we can show that any functions of the form $(\bm{x}^{T}\bm{a})^{l}$ , $l\in\{3,5,7,\cdots\}$ are not learnable. Our characterization uses an interesting connection to harmonics and filtering on $\mathcal{S}^{d-1}$ , which may be of independent interest.

Towards this end, we first note that the integral form in Definition 1 can be viewed as a convolution on $\mathcal{S}^{d-1}$ (denoted by $\circledast$ ). Specifically, for any $f_{g}\in\mathcal{F}^{\ell_{2}}$ , we can rewrite it as

	$\displaystyle f_{g}(\bm{x})=g\circledast h(\bm{x})\mathrel{\mathop{:}}=\int_{\mathsf{SO}(d)}g(\mathbf{S}\bm{e})h(\mathbf{S}^{-1}\bm{x})d\mathbf{S},$		(13)
	$\displaystyle h(\bm{x})\mathrel{\mathop{:}}=\bm{x}^{T}\bm{e}\frac{\pi-\arccos(\bm{x}^{T}\bm{e})}{2\pi},$		(14)

where $\bm{e}\mathrel{\mathop{:}}=[0\ 0\ \cdots\ 0\ 1]^{T}\in\mathds{R}^{d}$ , and $\mathbf{S}$ is a $d\times d$ orthogonal matrix that denotes a rotation in $\mathcal{S}^{d-1}$ , chosen from the set $\mathsf{SO}(d)$ of all rotations. An important property of the convolution Eq. (13) is that it corresponds to multiplication in the frequency domain, similar to Fourier coefficients. To define such a transformation to the frequency domain, we use a set of hyper-spherical harmonics $\Xi_{\mathbf{K}}^{l}$ (Vilenkin, 1968; Dokmanic & Petrinovic, 2009) when $d\geq 3$ , which forms an orthonormal basis for functions on $\mathcal{S}^{d-1}$ . These harmonics are indexed by $l$ and $\mathbf{K}$ , where $\mathbf{K}=(k_{1},k_{2},\cdots,k_{d-2})$ and $l=k_{0}\geq k_{1}\geq k_{2}\geq\cdots\geq k_{d-2}\geq 0$ (those $k_{i}$ ’s and $l$ are all non-negative integers). Any function $f\in L^{2}(\mathcal{S}^{d-1}\mapsto\mathds{R})$ (including even $\delta$ -functions (Li & Wong, 2013)) can be decomposed uniquely into these harmonics, i.e., $f(\bm{x})=\sum_{l}\sum_{\mathbf{K}}c_{f}(l,\mathbf{K})\Xi_{\mathbf{K}}^{l}(\bm{x})$ , where $c_{f}(\cdot,\cdot)$ are projections of $f$ onto the basis function. In Eq. (13), let $c_{g}(\cdot,\cdot)$ and $c_{h}(\cdot,\cdot)$ denote the coefficients corresponding to the decompositions of $g$ and $h$ , respectively. Then, we must have (Dokmanic & Petrinovic, 2009)

\displaystyle c_{f_{g}}(l,\mathbf{K})=\Lambda\cdot c_{g}(l,\mathbf{K})c_{h}(l,\bm{0}),

(15)

where $\Lambda$ is some normalization constant. Notice that in Eq. (15), the coefficient for $h$ is $c_{h}(l,\bm{0})$ instead of $c_{h}(l,\mathbf{K})$ , which is due to the intrinsic rotational symmetry of such convolution (Dokmanic & Petrinovic, 2009).

The above decomposition has an interesting “filtering” interpretation as follows. We can regard the function $h$ as a “filter” or “channel,” while the function $g$ as a transmitted “signal.” Then, the function $f_{g}$ in Eq. (13) and Eq. (15) can be regarded as the received signal after $g$ goes through the channel/filter $h$ . Therefore, when coefficient $c_{h}(l,\bm{0})$ of $h$ is non-zero, then the corresponding coefficient $c_{f_{g}}(l,\mathbf{K})$ for $f_{g}$ can be any value (because we can arbitrarily choose $g$ ). In contrast, if a coefficient $c_{h}(l,\bm{0})$ of $h$ is zero, then the corresponding coefficient $c_{f_{g}}(l,\mathbf{K})$ for $f_{g}$ must also be zero for all $\mathbf{K}$ .

Ideally, if $h$ contains all “frequencies,” i.e., all coefficients $c_{h}(l,\bm{0})$ are non-zero, then $f_{g}$ can also contain all “frequencies,” which means that $\mathcal{F}^{\ell_{2}}$ can contain almost all functions. Unfortunately, this is not true for the function $h$ given in Eq. (14). Specifically, using the harmonics defined in Dokmanic & Petrinovic (2009), the basis $\Xi_{\bm{0}}^{l}$ for $(l,\bm{0})$ turns out to have the form

\displaystyle\Xi_{\bm{0}}^{l}(\bm{x})=\sum_{k=0}^{\lfloor\frac{l}{2}\rfloor}(-1)^{k}\cdot a_{l,k}\cdot(\bm{x}^{T}\bm{e})^{l-2k},

(16)

where $a_{l,k}$ are positive constants. Note that the expression Eq. (16) contains either only even powers of $\bm{x}^{T}\bm{e}$ (if $l$ is even) or odd powers of $\bm{x}^{T}\bm{e}$ (if $l$ is odd). Then, for the function $h$ in Eq. (14), we have the following proposition (proven in Supplementary Material, Appendix K.4). We note that (Basri et al., 2019) has a similar harmonics analysis, where the expression of $c_{h}(l,\bm{0})$ is given. However, it is not obvious that the expression of $c_{h}(l,\bm{0})$ for all $l=0,1,2,4,6,\cdots$ given in (Basri et al., 2019) must be non-zero, which is made clear by Proposition 3 as follows.

Proposition 3.

$c_{h}(l,\bm{0})$ is zero for $l=3,5,7,\cdots$ and is non-zero for $l=0,1,2,4,6,\cdots$ .

We are now ready to characterize what functions are in $\mathcal{F}^{\ell_{2}}$ . By the form of Eq. (16), for any non-negative integer $k$ , any even power $(\bm{x}^{T}\bm{e})^{2k}$ is a linear combination of $\Xi_{\bm{0}}^{0},\Xi_{\bm{0}}^{2},\cdots,\Xi_{\bm{0}}^{2k}$ , and any odd power $(\bm{x}^{T}\bm{e})^{2k+1}$ is a linear combination of $\Xi_{\bm{0}}^{1},\Xi_{\bm{0}}^{3},\cdots,\Xi_{\bm{0}}^{2k+1}$ . By Proposition 3, we thus conclude that any function $f_{g}(\bm{x})=(\bm{x}^{T}\bm{e})^{l}$ where $l\in\{0,1,2,4,6,\cdots\}$ can be written in the form of Eq. (15) in the frequency domain, and thus are in $\mathcal{F}^{\ell_{2}}$ . In contrast, any function $f(\bm{x})=(\bm{x}^{T}\bm{e})^{l}$ where $l\in\{3,5,7,\cdots\}$ cannot be written in the form of Eq. (15), and are thus not in $\mathcal{F}^{\ell_{2}}$ . Further, the $\ell_{2}$ -norm of any latter function will also be equal to its distance to $\mathcal{F}^{\infty}$ . Therefore, the generalization-error lower-bound in Proposition 2 will apply (with $D(f,\mathcal{F}^{\ell_{2}})=\|f\|_{2}$ ). Finally, by Eq. (13), $\mathcal{F}^{\ell_{2}}$ is invariant under rotation and finite linear summation. Therefore, any finite sum of $(\bm{x}^{T}\bm{a})^{l}$ , $l=0,1,2,4,6,\cdots$ must also belong to $\mathcal{F}^{\ell_{2}}$ .

For the special case of $d=2$ , the input $\bm{x}$ corresponds to an angle $\theta\in[-\pi,\pi]$ , and the above-mentioned harmonics become Fourier series $\sin(k\theta)$ and $\cos(k\theta)$ , $k=0,1,\cdots$ . We can then get similar results that frequencies of $k\in\{0,1,2,4,6,\cdots\}$ are learnable (while others are not), which explains the learnable and not-learnable functions in Fig. 1. Details can be found in Supplementary Material, Appendix K.5.

Remark 2.

We caution that the above claim on non-learnable functions critically depends on the network architecture. That is, we assume throughout this paper that the ReLU activation has no bias. It is known from an expressiveness point of view that, using ReLU without bias, a shallow network can only approximate the sum of linear functions and even functions (Ghorbani et al., 2019). Thus, it is not surprising that other odd-power (but non-linear) polynomials cannot be learned. In contrast, by adding a bias, a shallow network using ReLU becomes a universal approximator (Ji et al., 2019). The recent work of Satpathi & Srikant (2021) shows that polynomials with all powers can be learned by the corresponding 2-layer NTK model. These results are consistent with ours because a ReLU activation function operating on $\tilde{\bm{x}}\in\mathds{R}^{d-1}$ with a bias can be equivalently viewed as one operating on a $d$ -dimension input (with the last-dimension being fixed at $1/\sqrt{d}$ ) but with no bias. Even though only a subset of functions are learnable in the $d$ -dimension space, when projected into a $(d-1)$ -dimension subspace, they may already span all functions. For example, one could write $(\bm{x}^{T}\bm{a})^{3}$ as a linear combination of ( $\begin{bmatrix}\tilde{\bm{x}}\\ 1/\sqrt{d}\end{bmatrix}^{T}\bm{b}_{i})^{l_{i}}$ , where $i\in\{1,2,\cdots,5\}$ , $[l_{1},\cdots,l_{5}]=[4,4,2,1,0]$ , and $\bm{b}_{i}\in\mathds{R}^{d}$ depends only on $\bm{a}$ . (See Supplementary Material, Appendix K.6 for details.) It remains an interesting question whether similar difference arises for other network architectures (e.g., with more than 2 layers).

5 Proof Sketch of Theorem 1

In this section, we sketch the key steps to prove Theorem 1. Starting from Eq. (3), we have

\displaystyle\Delta\mathbf{V}^{\ell_{2}}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\left(\mathbf{F}(\mathbf{X})+\bm{\epsilon}\right).

(17)

For the learned model $\hat{f}^{\ell_{2}}(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{\ell_{2}}$ given in Eq. (4), the error for any test input $\bm{x}$ is then

	$\displaystyle\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})=$	$\displaystyle\left(\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{F}(\mathbf{X})-f(\bm{x})\right)$
		$\displaystyle+\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}.$		(18)

In the classical “bias-variance” analysis with respect to MSE (Belkin et al., 2018a), the first term on the right-hand-side of Eq. (5) contributes to the bias and the second term contributes to the variance. We first quantify the second term (i.e., the variance) in the following proposition.

Proposition 4.

For any $n\geq 2$ , $m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]$ , $d\leq n^{4}$ , if $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ , we must have $\operatorname*{\mathsf{Pr}}\limits_{\mathbf{X},\mathbf{V}_{0}}\big{\{}|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}|\leq\sqrt{J_{m}(n,d)n}\|\bm{\epsilon}\|_{2},\text{ for all }\bm{\epsilon}\in\mathds{R}^{n}\big{\}}\geq 1-\frac{2}{\sqrt[m]{n}}$ .

The proof is in Supplementary Material Appendix F. Proposition 4 implies that, for fixed $n$ and $d$ , when $p\to\infty$ , with high probability the variance will not exceed a certain factor of the noise $\|\bm{\epsilon}\|_{2}$ . In other words, the variance will not go to infinity when $p\to\infty$ . The main step in the proof is to lower bound $\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)/p$ , which is given by $1/(J_{m}(n,d)n)$ . Note that this is the main place where we used the assumption that $\bm{x}$ is uniformly distributed. We expect that our main proof techniques can be generalized to other distributions (with a different expression of $J_{m}(n,d)$ ), which we leave for future work.

Remark 3.

In the upper bound in Arora et al. (2019) (i.e., Eq. (1)), any noise added to $\bm{y}$ will at least contribute to the generalization upper bound Eq. (1) by a positive term $\bm{\epsilon}^{T}(\mathbf{H}^{\infty})^{-1}\bm{\epsilon}/n$ . Thus, their upper bound may also grow as $\min\mathsf{eig}(\mathbf{H}^{\infty})$ decreases. One of the contribution of Proposition 4 is to characterize this minimum eigenvalue.

We now bound the bias part. We first study the class of ground-truth functions that can be learned with fixed $\mathbf{V}_{0}$ . We refer to them as pseudo ground-truth, to differentiate them with the set $\mathcal{F}^{\ell_{2}}$ of learnable functions for random $\mathbf{V}_{0}$ . They are defined with respect to the same $g(\cdot)$ function, so that we can later extend to the “real” ground-truth functions in $\mathcal{F}^{\ell_{2}}$ when considering the randomness of $\mathbf{V}_{0}$ .

Definition 2.

Given $\mathbf{V}_{0}$ , for any learnable ground-truth function $f_{g}\in\mathcal{F}^{\ell_{2}}$ with the corresponding function $g(\cdot)$ , define the corresponding pseudo ground-truth as

\displaystyle f_{\mathbf{V}_{0}}^{g}(\bm{x})\mathrel{\mathop{:}}=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}g(\bm{z})d\mu(\bm{z}).

The reason that this class of functions may be the learnable functions for fixed $\mathbf{V}_{0}$ is similar to the discussions in Eq. (3) and Eq. (6). Indeed, using the same choice of $g(\bm{z})$ in Eq. (8), the learned function $\hat{f}^{\ell_{2}}$ in Eq. (3) at fixed $\mathbf{V}_{0}$ is always of the form in Definition 2.

The following proposition gives an upper bound of the generalization performance when the data model is based on the pseudo ground-truth and the NTK model uses exactly the same $\mathbf{V}_{0}$ .

Proposition 5.

Assume fixed $\mathbf{V}_{0}$ (thus $p$ and $d$ are also fixed), there is no noise. If the ground-truth function is $f=f_{\mathbf{V}_{0}}^{g}$ in Definition 2 and $\|g\|_{\infty}<\infty$ , then for any $\bm{x}\in\mathcal{S}^{d-1}$ and $q\in[1,\ \infty)$ , we have $\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\big{\{}|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})|\leq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\big{\}}\geq 1-2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\|g\|_{\infty}^{2}}\right)$ .

The proof is in Supplementary Material, Appendix H. Note that both the threshold of the probability event and the upper bound coincide with Term 1 and Term 4, respectively, in Eq. (10). Here we sketch the proof of Proposition 5. Based on the definition of the pseudo ground-truth, we can rewrite $f_{\mathbf{V}_{0}}^{g}$ as $f_{\mathbf{V}_{0}}^{g}(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}$ , where $\Delta\mathbf{V}^{*}\in\mathds{R}^{dp}$ is given by, for all $j\in\{1,2,\cdots,p\}$ , $\Delta\mathbf{V}^{*}[j]=\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0\}}\bm{z}\frac{g(\bm{z})}{p}d\mu(\bm{z})$ . From Eq. (3) and Eq. (4), we can see that the learned model is $\hat{f}^{\ell_{2}}(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{P}\Delta\mathbf{V}^{*}$ where $\mathbf{P}\mathrel{\mathop{:}}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{H}$ . Note that $\mathbf{P}$ is an orthogonal projection to the row-space of $\mathbf{H}$ . Further, it is easy to show that $\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\|_{2}\leq\sqrt{p}$ . Thus, we have $|\hat{f}^{\ell_{2}}(\bm{x})-f_{\mathbf{V}_{0}}^{g}(\bm{x})|=|\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}|\leq\sqrt{p}\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\|_{2}$ . The term $(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}$ can be interpreted as the distance from $\Delta\mathbf{V}^{*}$ to the row-space of $\mathbf{H}$ . Note that this distance is no greater than the distance between $\Delta\mathbf{V}^{*}$ and any point in the row-space of $\mathbf{H}$ . Thus, in order to get an upper bound on $\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\|_{2}$ , we only need to find a vector $\bm{a}\in\mathds{R}^{n}$ that makes $\|\Delta\mathbf{V}^{*}-\mathbf{H}^{T}\bm{a}\|_{2}$ as small as possible, especially when $n$ is large. Our proof uses the vector $\bm{a}$ such that its $i$ -th element is $\bm{a}_{i}\mathrel{\mathop{:}}=\frac{g(\mathbf{X}_{i})}{np}$ . See Supplementary Material, Appendix H for the rest of the details.

The final step is to allow $\mathbf{V}_{0}$ to be random. Given any random $\mathbf{V}_{0}$ , any function $f_{g}\in\mathcal{F}^{\ell_{2}}$ can be viewed as the summation of a pseudo ground-truth function (with the same $g(\cdot)$ ) and a difference term. This difference can be viewed as a special form of “noise”, and thus we can use Proposition 4 to quantify its impact. Further, the magnitude of this “noise” should decrease with $p$ (because of Eq. (7)). Combining this argument with Proposition 5, we can then prove Theorem 1. See Supplementary Material, Appendix I for details.

6 Conclusions

In this paper, we studied the generalization performance of the min $\ell_{2}$ -norm overfitting solution for a two-layer NTK model. We provide a precise characterization of the learnable ground-truth functions for such models, by providing a generalization upper bound for all functions in $\mathcal{F}^{\ell_{2}}$ , and a generalization lower bound for all functions not in $\overline{\mathcal{F}^{\ell_{2}}}$ . We show that, while the test error of the overfitted NTK model also exhibits descent in the overparameterized regime, the descent behavior can be quite different from the double descent of linear models with simple features.

There are several interesting directions for future work. First, based on Fig. 3(b), our estimation of the effect of noise could be further improved. Second, it would be interesting to explore whether the methodology can be extended to NTK model for other neural networks, e.g., with different activation functions and with more than two layers.

Acknowledgements

This work is partially supported by an NSF sub-award via Duke University (IIS-1932630), by NSF grants CNS-1717493, CNS-1901057, and CNS-2007231, and by Office of Naval Research under Grant N00014-17-1-241. The authors would like to thank Professor R. Srikant at the University of Illinois at Urbana-Champaign and anonymous reviewers for their valuable comments and suggestions.

References

Advani et al. (2020) Advani, M. S., Saxe, A. M., and Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pp. 6158–6169, 2019.
Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332, 2019.
Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
Basri et al. (2019) Basri, R., Jacobs, D., Kasten, Y., and Kritchman, S. The convergence rate of neural networks for learned functions of different frequencies. arXiv preprint arXiv:1906.00425, 2019.
Belkin et al. (2018a) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018a.
Belkin et al. (2018b) Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549, 2018b.
Belkin et al. (2019) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
Chaudhry et al. (1997) Chaudhry, M. A., Qadir, A., Rafique, M., and Zubair, S. Extension of euler’s beta function. Journal of computational and applied mathematics, 78(1):19–32, 1997.
Dokmanic & Petrinovic (2009) Dokmanic, I. and Petrinovic, D. Convolution on the $n$ -sphere with application to pdf modeling. IEEE transactions on signal processing, 58(3):1157–1170, 2009.
Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
Dutka (1981) Dutka, J. The incomplete beta function—a historical profile. Archive for history of exact sciences, pp. 11–29, 1981.
d’Ascoli et al. (2020) d’Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pp. 2280–2290. PMLR, 2020.
Fiat et al. (2019) Fiat, J., Malach, E., and Shalev-Shwartz, S. Decoupling gating from linearity. arXiv preprint arXiv:1906.05032, 2019.
Ghorbani et al. (2019) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
Goemans (2015) Goemans, M. Chernoff bounds, and some applications. URL http: //math.mit.edu /goemans /18310S15 /chernoff-notes.pdf, 2015.
Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
Hayes (2005) Hayes, T. P. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 2005.
Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
Jacot et al. (2020) Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631–4640. PMLR, 2020.
James & Stein (1992) James, W. and Stein, C. Estimation with quadratic loss. In Breakthroughs in Statistics, pp. 443–460. Springer, 1992.
Ji & Telgarsky (2019) Ji, Z. and Telgarsky, M. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
Ji et al. (2019) Ji, Z., Telgarsky, M., and Xian, R. Neural tangent kernels, transportation mappings, and universal approximation. arXiv preprint arXiv:1910.06956, 2019.
Ju et al. (2020) Ju, P., Lin, X., and Liu, J. Overfitting can be harmless for basis pursuit, but only to a degree. Advances in Neural Information Processing Systems, 33, 2020.
LeCun et al. (1991) LeCun, Y., Kanter, I., and Solla, S. A. Second order properties of error surfaces: Learning time and generalization. In Advances in Neural Information Processing Systems, pp. 918–924, 1991.
Li (2011) Li, S. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
Li & Liang (2018) Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems, 31:8157–8166, 2018.
Li & Wong (2013) Li, Y. and Wong, R. Integral and series representations of the dirac delta function. arXiv preprint arXiv:1303.1943, 2013.
Liao et al. (2020) Liao, Z., Couillet, R., and Mahoney, M. W. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. arXiv preprint arXiv:2006.05013, 2020.
Mei & Montanari (2019) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2299–2303. IEEE, 2019.
Rao & Rao (1983) Rao, K. B. and Rao, M. B. Theory of charges: a study of finitely additive measures. Academic Press, 1983.
Satpathi & Srikant (2021) Satpathi, S. and Srikant, R. The dynamics of gradient descent for overparametrized neural networks. In 3rd Annual Learning for Dynamics and Control Conference (L4DC), 2021.
Stein (1956) Stein, C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical report, Stanford University Stanford United States, 1956.
Tikhonov (1943) Tikhonov, A. N. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp. 195–198, 1943.
Vilenkin (1968) Vilenkin, N. Y. Special functions and the theory of group representations. providence: American mathematical society. sftp, 1968.
Wainwright (2015) Wainwright, M. Uniform laws of large numbers, 2015. https://www.stat.berkeley.edu/~mjwain/stat210b/Chap4_Uniform_Feb4_2015.pdf, Accessed: Feb. 7, 2021.
Wendel (1962) Wendel, J. G. A problem in geometric probability. Math. Scand, 11:109–111, 1962.
Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, 2017.

Appendix A Extra Notations

In addition to the notations that we have introduced in the main body of this paper, we need some extra notations that are used in the following appendices. The distribution of the initial weights $\mathbf{V}_{0}[j]$ is denoted by the probability density $\lambda(\cdot)$ on $\mathds{R}^{d}$ , and the directions of the initial weights (i.e., the normalized initial weights $\frac{\mathbf{V}_{0}[j]}{\|\mathbf{V}_{0}[j]\|_{2}}$ ) follows the probability density $\tilde{\lambda}(\cdot)$ on $\mathcal{S}^{d-1}$ . Let $\lambda_{a}(\cdot)$ be the Lebesgue measure on $\mathds{R}^{a}$ where the dimension $a$ can be, e.g., $(d-1)$ and $(d-2)$ .

Let $\mathsf{Bino}(a,b)$ denote the binomial distribution, where $a$ is the number of trials and $b$ is the success probability. Let $I_{\cdot}(\cdot,\cdot)$ denote the regularized incomplete beta function Dutka (1981). Let $B(\cdot,\cdot)$ denote the beta function Chaudhry et al. (1997). Specifically,

	$\displaystyle B(x,y)\mathrel{\mathop{:}}=\int_{0}^{1}t^{x-1}(1-t)^{y-1}dt,$		(19)
	$\displaystyle I_{x}(a,b)\mathrel{\mathop{:}}=\frac{\int_{0}^{x}t^{a-1}(1-t)^{b-1}dt}{B(a,b)}.$		(20)

Define a cap on a unit hyper-sphere $\mathcal{S}^{d-1}$ as the intersection of $\mathcal{S}^{d-1}$ with an open ball in $\mathds{R}^{d}$ centered at $\bm{v}_{*}$ with radius $r$ , i.e.,

\displaystyle\mathcal{B}_{\bm{v}_{*}}^{r}\mathrel{\mathop{:}}=\left\{\bm{v}\in\mathcal{S}^{d-1}\ |\ \|\bm{v}-\bm{v}_{*}\|_{2}<r\right\}.

(21)

Remark 4.

For ease of exposition, we will sometimes neglect the subscript $\bm{v}_{*}$ of $\mathcal{B}_{\bm{v}_{*}}^{r}$ and use $\mathcal{B}^{r}$ instead, when the quantity that we are estimating only depends on $r$ but not $\bm{v}_{*}$ . For example, where we are interested in the area of $\mathcal{B}_{\bm{v}_{*}}^{r}$ , it only depends on $r$ but not $\bm{v}_{*}$ . Thus, we write $\lambda_{d-1}(\mathcal{B}^{r})$ instead.

For any $\bm{x}\in\mathds{R}^{d}$ such that $\bm{x}^{T}\bm{v}_{*}=0$ , define two halves of the cap $\mathcal{B}_{\bm{v}_{*}}^{r}$ as

\displaystyle\mathcal{B}_{\bm{v}_{*},+}^{r,\bm{x}}\mathrel{\mathop{:}}=\left\{\bm{v}\in\mathcal{B}_{\bm{v}_{*}}^{r}\ |\ \bm{x}^{T}\bm{v}>0\right\},\quad\mathcal{B}_{\bm{v}_{*},-}^{r,\bm{x}}\mathrel{\mathop{:}}=\left\{\bm{v}\in\mathcal{B}_{\bm{v}_{*}}^{r}\ |\ \bm{x}^{T}\bm{v}<0\right\}.

(22)

Define the set of directions of the initial weights $\mathbf{V}_{0}[j]$ ’s as

\displaystyle\mathcal{A}_{\mathbf{V}_{0}}\mathrel{\mathop{:}}=\left\{\frac{\mathbf{V}_{0}[j]}{\|\mathbf{V}_{0}[j]\|_{2}}\ \Bigg{|}\ j\in\{1,2,\cdots,p\}\right\}.

(23)

Appendix B GD (gradient descent) Converges to Min $\ell_{2}$ -Norm Solutions

We assume that the GD algorithm for minimizing the training MSE is given by

\displaystyle\Delta\mathbf{V}^{\text{GD}}_{k+1}=\Delta\mathbf{V}^{\text{GD}}_{k}-\gamma_{k}\sum_{i=1}^{n}(\mathbf{H}_{i}\Delta\mathbf{V}^{\text{GD}}_{k}-y_{i})\mathbf{H}_{i}^{T},

(24)

where $\Delta\mathbf{V}^{\text{GD}}_{k}$ denotes the solution in the $k$ -th GD iteration ( $\Delta\mathbf{V}^{\text{GD}}_{0}=\bm{0}$ ), and $\gamma_{k}$ denotes the step size of the $k$ -th iteration.

Lemma 6.

If $\Delta\mathbf{V}^{\ell_{2}}$ exists and GD in Eq. (24) converges to zero-training loss (i.e., $\mathbf{H}\Delta\mathbf{V}^{\text{GD}}_{\infty}=\bm{y}$ ), then $\Delta\mathbf{V}^{\text{GD}}_{\infty}=\Delta\mathbf{V}^{\ell_{2}}$ .

Proof.

Because $\Delta\mathbf{V}^{\text{GD}}_{0}=\bm{0}$ and Eq. (24), we know that $\Delta\mathbf{V}^{\text{GD}}_{k}$ is in the row space of $\mathbf{H}$ for any $k$ . Thus, we can let $\Delta\mathbf{V}^{\text{GD}}_{\infty}=\mathbf{H}^{T}\bm{a}$ where $\bm{a}\in\mathds{R}^{n}$ . When GD converges to zero training loss, we have $\mathbf{H}\Delta\mathbf{V}^{\text{GD}}_{\infty}=\bm{y}$ . Thus, we have $\mathbf{H}\mathbf{H}^{T}\bm{a}=\bm{y}$ , which implies $\bm{a}=(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}$ . Therefore, we must have $\Delta\mathbf{V}^{\text{GD}}_{\infty}=\mathbf{H}^{T}\bm{a}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}=\Delta\mathbf{V}^{\ell_{2}}$ . ∎

Appendix C Assumptions and Justifications

Because $\hat{f}_{\Delta\mathbf{V},\mathbf{V}_{0}}(a\bm{x})=a\cdot\hat{f}_{\Delta\mathbf{V},\mathbf{V}_{0}}(\bm{x})$ for any $a\in\mathds{R}$ , we can always do preprocessing to normalize the input $\bm{x}$ . For simplicity, we focus on the simplest situation that the randomness for the inputs and the initial weights are uniform. Nonetheless, methods and results of this paper can be readily generalized to other continuous random variable distributions, which we leave for future work. We thus make the following Assumption 1.

Assumption 1.

The input $\bm{x}$ are uniformly distributed in $\mathcal{S}^{d-1}$ . The initial weights $\mathbf{V}_{0}[j]$ ’s are uniform in all directions. In other words, $\mu(\cdot)$ and $\tilde{\lambda}(\cdot)$ are both $\mathsf{unif}(\mathcal{S}^{d-1})$ .

We study the overparameterized and overfitted setting, so in this paper we always assume $p\geq n/d$ , i.e., the number of parameters $pd$ is larger than or equal to the number of training samples $n$ . The situation of $d=1$ is relatively trivial, so we only consider the case $d\geq 2$ . We then make Assumption 2.

Assumption 2.

$p\geq n/d$ and $d\geq 2$ .

If the input is a continuous random vector, then for any $i\neq j$ , we have $\operatorname*{\mathsf{Pr}}\{\mathbf{X}_{i}=\mathbf{X}_{j}\}=0$ and $\operatorname*{\mathsf{Pr}}\{\mathbf{X}_{i}=-\mathbf{X}_{j}\}=0$ (because the probability that a continuous random variable equals to a given value is zero). Thus, $\operatorname*{\mathsf{Pr}}\{\mathbf{X}_{i}\parallel\mathbf{X}_{j}\}=0$ , and $\operatorname*{\mathsf{Pr}}\{\mathbf{X}_{i}\nparallel\mathbf{X}_{j}\}=1$ . Similarly, we can show that $\operatorname*{\mathsf{Pr}}\{\mathbf{V}_{0}[k]\nparallel\mathbf{V}_{0}[l]\}=1$ . We thus make Assumption 3.

Assumption 3.

$\mathbf{X}_{i}\nparallel\mathbf{X}_{j}$ for any $i\neq j$ , and $\mathbf{V}_{0}[k]\nparallel\mathbf{V}_{0}[l]$ for any $k\neq l$ .

With these assumptions, the following lemma says that when $p$ is large enough, with high probability $\mathbf{H}$ has full row-rank (and thus $\Delta\mathbf{V}^{\ell_{2}}$ exists).

Lemma 7.

$\lim_{p\to\infty}\underset{\mathbf{V}_{0}}{\operatorname*{\mathsf{Pr}}}\left\{\operatorname*{\mathsf{rank}}(\mathbf{H})=n\ |\ \mathbf{X}\right\}=1$ .

Proof.

See Appendix E. ∎

Appendix D Some Useful Supporting Results

Here we collect some useful lemmas that are needed for proofs in other appendices, many of which are estimations of certain quantities that we will use later.

D.1 Quantities related to the area of a cap on a hyper-sphere

The following lemma is introduced by Li (2011), which gives the area of a cap on a hyper-sphere with respect to the colatitude angle.

Lemma 8.

Let $\phi\in[0,\ \frac{\pi}{2}]$ denote the colatitude angle of the smaller cap on $\mathcal{S}^{d-1}$ , then the area (in the measure of $\lambda_{d-1}$ ) of this hyper-spherical cap is

\displaystyle\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})I_{\sin^{2}\phi}\left(\frac{d-1}{2},\frac{1}{2}\right).

The following lemma is another representation of the area of the cap with respect to the radius $r$ (recall the definition of $\mathcal{B}^{r}$ in Eq. (21) and Remark 4).

Lemma 9.

If $r\leq\sqrt{2}$ , then we have

\displaystyle\lambda_{d-1}(\mathcal{B}^{r})=\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})I_{r^{2}(1-\frac{r^{2}}{4})}\left(\frac{d-1}{2},\frac{1}{2}\right).

Proof.

Let $\phi$ denote the colatitude angle. By the law of cosines, we have

\displaystyle\cos\phi=1-\frac{r^{2}}{2}.

Thus, we have

\displaystyle\sin^{2}\phi=1-\cos^{2}\phi=1-\left(1-\frac{r^{2}}{2}\right)^{2}=r^{2}\left(1-\frac{r^{2}}{4}\right).

By Lemma 8, the result of this lemma thus follows. Notice that we require $r\leq\sqrt{2}$ to make sure that $\phi\in[0,\ \frac{\pi}{2}]$ , which is required by Lemma 8. ∎

The area of a cap can be interpreted as the probability of the event that a uniformly-distributed random vector falls into that cap. We have the following lemma.

Lemma 10.

Suppose that a random vector $\bm{b}\in\mathcal{S}^{d-1}$ follows uniform distribution in all directions. Given any $\bm{a}\in\mathcal{S}^{d-1}$ and for any $c\in(0,1)$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\bm{b}}\left\{|\bm{a}^{T}\bm{b}|>c\right\}=I_{1-c^{2}}\left(\frac{d-1}{2},\frac{1}{2}\right).

Proof.

Notice that $\left\{\bm{b}\ \big{|}\ \bm{a}^{T}\bm{b}>c\right\}$ is a hyper-spherical cap. Define its colatitude angle as $\phi$ . We have $\cos\phi=\bm{a}^{T}\bm{b}=c$ . Thus, we have $\sin^{2}\phi=1-c^{2}$ . By Lemma 8, we then have

\displaystyle\lambda_{d-1}\left(\left\{\bm{b}\ \big{|}\ \bm{a}^{T}\bm{b}>c\right\}\right)=\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})I_{1-c^{2}}\left(\frac{d-1}{2},\frac{1}{2}\right).

Further, by symmetry, we have

\displaystyle\lambda_{d-1}\left(\left\{\bm{b}\ \big{|}\ |\bm{a}^{T}\bm{b}|>c\right\}\right)=2\lambda_{d-1}\left(\left\{\bm{b}\ \big{|}\ \bm{a}^{T}\bm{b}>c\right\}\right)=\lambda_{d-1}(\mathcal{S}^{d-1})I_{1-c^{2}}\left(\frac{d-1}{2},\frac{1}{2}\right).

Because $\bm{b}$ follows uniform distribution in all directions, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\bm{b}}\left\{|\bm{a}^{T}\bm{b}|>c\right\}=\frac{\lambda_{d-1}\left(\left\{\bm{b}\ \big{|}\ |\bm{a}^{T}\bm{b}|>c\right\}\right)}{\lambda_{d-1}(\mathcal{S}^{d-1})}=I_{1-c^{2}}\left(\frac{d-1}{2},\frac{1}{2}\right).

∎

D.2 Estimation of certain norms

In this subsection, we will show $\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\|_{2}\leq\sqrt{p}$ in Lemma 11. We also upper bound the norm of the product of two matrices by the product of their norms in Lemma 12. At last, Lemma 13 states that if two vector differ a lot, then the sum of their norm cannot be too small.

Lemma 11.

$\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\|_{2}\leq\sqrt{p}$ for any $\bm{x}\in\mathcal{S}^{d-1}$ .

Proof.

This follows because the input $\bm{x}$ is normalized. Specifically, by Eq. (1), we have

\displaystyle\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\|_{2}=\sqrt{\sum_{j=1}^{p}\left\|\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\bm{x}^{T}\right\|_{2}^{2}}\leq\sqrt{p}.

(25)

∎

Lemma 12.

If $\mathbf{C}=\mathbf{A}\mathbf{B}$ , then $\|\mathbf{C}\|_{2}\leq\|\mathbf{A}\|_{2}\cdot\|\mathbf{B}\|_{2}$ . Here $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ could be scalars, vectors, or matrices.

Proof.

This lemma directly follows the definition of matrix norm. ∎

Remark 5.

Note that the ( $\ell_{2}$ ) matrix-norm (i.e., spectral norm) of a vector is exactly its $\ell_{2}$ vector-norm (i.e., Euclidean norm)⁷⁷7To see this, consider a (row or column) vector $\bm{a}$ . The matrix norm of $\bm{a}$ is $\displaystyle\max_{|x|=1}\|\bm{a}x\|_{2}\text{ (when $\bm{a}$ is a column vector)},$ or $\displaystyle\max_{\|\bm{x}\|_{2}=1}\|\bm{a}\bm{x}\|_{2}\text{ (when $\bm{a}$ is a row vector)}.$ In both cases, the value of the matrix-norm equals to $\sqrt{\sum a_{i}^{2}}$ , which is exactly the $\ell_{2}$ -norm (Euclidean norm) of $\bm{a}$ . . Therefore, when applying Lemma 12, we do not need to worry about whether $\mathbf{A}$ , $\mathbf{B}$ , and $\mathbf{C}$ are matrices or vectors.

Lemma 13.

For any $\bm{v}_{1},\bm{v}_{2}\in\mathds{R}^{d}$ , we have

\displaystyle\|\bm{v}_{1}\|_{2}^{2}+\|\bm{v}_{2}\|_{2}^{2}\geq\frac{1}{2}\|\bm{v}_{1}-\bm{v}_{2}\|_{2}^{2}.

Proof.

It is easy to prove that $\|\cdot\|_{2}^{2}$ is convex. Thus, we have

	$\displaystyle\\|\bm{v}_{1}\\|_{2}^{2}+\\|\bm{v}_{2}\\|_{2}^{2}$	$\displaystyle=\\|\bm{v}_{1}\\|_{2}^{2}+\\|-\bm{v}_{2}\\|_{2}^{2}$
		$\displaystyle\geq 2\left\\|\frac{\bm{v}_{1}-\bm{v}_{2}}{2}\right\\|_{2}^{2}\text{ (apply Jensen's inequality on the convex function $\\|\cdot\\|_{2}^{2}$)}$
		$\displaystyle=\frac{1}{2}\\|\bm{v}_{1}-\bm{v}_{2}\\|_{2}^{2}.$

∎

D.3 Estimates of certain tail probabilities

The following is the (restated) Corollary 5 of Goemans (2015).

Lemma 14.

If the random variable $X$ follows $\mathsf{Bino}(a,b)$ , then for all $0<\delta<1$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}\{|X-ab|>\delta ab\}\leq 2e^{-ab\delta^{2}/3}.

The following lemma is the (restated) Theorem 1.8 of Hayes (2005).

Lemma 15 (Azuma–Hoeffding inequality for random vectors).

Let $X_{1},X_{2},\cdots,X_{k}$ be i.i.d. random vectors with zero mean (of the same dimension) in a real Euclidean space such that $\|X_{i}\|_{2}\leq 1$ for all $i=1,2,\cdots,k$ . Then, for every $a>0$ ,

\displaystyle\operatorname*{\mathsf{Pr}}\left\{\left\|\sum_{i=1}^{k}X_{i}\right\|_{2}\geq a\right\}<2e^{2}\exp\left(-\frac{a^{2}}{2k}\right).

In the following lemma, we use Azuma–Hoeffding inequality to upper bound the deviation of the empirical mean value of a bounded random vector from its expectation.

Lemma 16.

Let $X_{1},X_{2},\cdots,X_{k}$ be i.i.d. random vectors (of the same dimension) in a real Euclidean space such that $\|X_{i}\|_{2}\leq U$ for all $i=1,2,\cdots,k$ . Then, for any $q\in[1,\ \infty)$ ,

\displaystyle\operatorname*{\mathsf{Pr}}\left\{\left\|\left(\frac{1}{k}\sum_{i=1}^{k}X_{i}\right)-\operatorname*{\mathsf{E}}X_{1}\right\|_{2}\geq k^{\frac{1}{2q}-\frac{1}{2}}\right\}<2e^{2}\exp\left(-\frac{\sqrt[q]{k}}{8U^{2}}\right).

Proof.

Because $\|X_{i}\|_{2}\leq U$ , we have $\operatorname*{\mathsf{E}}\|X_{i}\|_{2}\leq U$ . By triangle inequality, we have $\|X_{i}-\operatorname*{\mathsf{E}}X_{i}\|_{2}\leq\|X_{i}\|_{2}+\operatorname*{\mathsf{E}}\|X_{i}\|_{2}\leq 2U$ , i.e.,

\displaystyle\left\|\frac{X_{i}-\operatorname*{\mathsf{E}}X_{i}}{2U}\right\|_{2}\leq 1.

(26)

We also have

\displaystyle\operatorname*{\mathsf{E}}\left[\frac{X_{i}-\operatorname*{\mathsf{E}}X_{i}}{2U}\right]=\frac{\operatorname*{\mathsf{E}}X_{i}-\operatorname*{\mathsf{E}}X_{i}}{2U}=\bm{0}.

(27)

We then have

		$\displaystyle\operatorname{\mathsf{Pr}}\left\{\left\\|\left(\frac{1}{k}\sum_{i=1}^{k}X_{i}\right)-\operatorname{\mathsf{E}}X_{1}\right\\|_{2}\geq k^{\frac{1}{2q}-\frac{1}{2}}\right\}$
	$\displaystyle=$	$\displaystyle\operatorname{\mathsf{Pr}}\left\{\left\\|\sum_{i=1}^{k}\left(X_{i}-\operatorname{\mathsf{E}}X_{i}\right)\right\\|_{2}\geq k^{\frac{1}{2q}+\frac{1}{2}}\right\}$
	$\displaystyle=$	$\displaystyle\operatorname{\mathsf{Pr}}\left\{\left\\|\sum_{i=1}^{k}\left(\frac{X_{i}-\operatorname{\mathsf{E}}X_{i}}{2U}\right)\right\\|_{2}\geq\frac{k^{\frac{1}{2q}+\frac{1}{2}}}{2U}\right\}$
	$\displaystyle<$	$\displaystyle 2e^{2}\exp\left(-\frac{\sqrt[q]{k}}{8U^{2}}\right)\text{ (by Eqs.~{}\eqref{eq.temp_120308}\eqref{eq.temp_022601} and letting $a=\frac{k^{\frac{1}{2q}+\frac{1}{2}}}{2U}$ in Lemma~{}\ref{le.hoeffding})}.$

∎

D.4 Calculation of certain integrals

The following lemma calculates the ratio between the intersection area of two hyper-hemispheres and the area of the whole hyper-sphere.

Lemma 17.

\displaystyle\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}d\tilde{\lambda}(\bm{v})=\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}.

(28)

(Recall that $\tilde{\lambda}(\cdot)$ denotes the distribution of the normalized version of $\mathbf{V}_{0}[j]$ on $\mathcal{S}^{d-1}$ and is assumed to be uniform in all directions.)

Before we give the proof of Lemma 17, we give its geometric explanation.

Geometric explanation of Eq. (28): Indeed, since $\tilde{\lambda}$ is uniform on $\mathcal{S}^{d-1}$ , the integral on the left-hand-side of Eq. (28) represents the probability that a random point falls into the intersection of two hyper-hemispheres that are represented by $\{\bm{v}\in\mathcal{S}^{d-1}\ |\ \bm{z}^{T}\bm{v}>0\}$ and $\{\bm{v}\in\mathcal{S}^{d-1}\ |\ \bm{x}^{T}\bm{v}>0\}$ , respectively. We can calculate that probability by

\displaystyle\frac{\text{measure of a hyper-spherical lune with angle $\pi-\theta(\bm{z},\bm{x})$}}{\text{measure of a unit hyper-sphere}}=\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi},

(29)

where $\theta(\cdot,\cdot)$ denote the angle (in radians) between two vectors, which would lead to Eq. (28).

To help readers understand Eq. (29), we give examples for 2D and 3D in Fig. 4 and Fig. 5, respectively. In the 2D case depicted in Fig. 4, $\overrightarrow{\mathrm{OA}}$ denotes $\bm{z}$ , $\overrightarrow{\mathrm{OB}}$ denotes $\bm{x}$ . Thus, the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{EAF}}}$ denotes $\{\bm{v}\ |\ \bm{z}^{T}\bm{v}>0\}$ , and the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CBD}}}$ denotes $\{\bm{v}\ |\ \bm{x}^{T}\bm{v}>0\}$ . The intersection of $\stackrel{{\scriptstyle\frown}}{{\mathrm{EAF}}}$ and $\stackrel{{\scriptstyle\frown}}{{\mathrm{CBD}}}$ , i.e., the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CBF}}}$ , represents $\{\bm{v}\ |\ \bm{z}^{T}\bm{v}>0,\bm{x}^{T}\bm{v}>0\}$ . Notice that the angle of $\stackrel{{\scriptstyle\frown}}{{\mathrm{CBF}}}$ equals $\pi-\theta$ , where $\theta$ denotes the angle between $\bm{z}$ and $\bm{x}$ . Therefore, ratio of the length of $\stackrel{{\scriptstyle\frown}}{{\mathrm{CBF}}}$ to the perimeter of the circle equals to $\frac{\angle\mathrm{COF}}{2\pi}=\frac{\pi-\theta}{2\pi}$ . Similarly, in the 3D case depicted in Fig. 5, the spherical lune $\mathrm{ICHF}$ denotes the intersection of the semi-sphere in the direction of $\overrightarrow{\mathrm{OA}}$ and the semi-sphere in the direction of $\overrightarrow{\mathrm{OB}}$ . We can see that the area of the spherical lune $\mathrm{ICHF}$ is still proportional to the angle $\angle\mathrm{COF}$ . Thus, we still have the result that the area of the spherical lune $\mathrm{ICHF}$ is $\frac{\pi-\theta}{2\pi}$ of the area of the whole sphere. The proof below, on the other hand, applies to arbitrary dimensions.

Proof.

Due to symmetry, we know that the integral of Eq. (28) only depends on the angle between $\bm{x}$ and $\bm{z}$ . Thus, without loss of generality, we let

\displaystyle\bm{x}=[\bm{x}_{1}\ \bm{x}_{2}\ \cdots\ \bm{x}_{d}]=[0\ 0\ \cdots\ 0\ 1\ 0]^{T},\ \bm{z}=[0\ 0\ \cdots\ 0\ \cos\theta\ \sin\theta]^{T},

where

\displaystyle\theta=\arccos(\bm{x}^{T}\bm{z})\in[0,\ \pi].

(30)

Thus, for any $\bm{v}=[\bm{v}_{1}\ \bm{v}_{2}\ \cdots\ \bm{v}_{d}]^{T}$ that makes $\bm{z}^{T}\bm{v}>0$ and $\bm{x}^{T}\bm{v}>0$ , it only needs to satisfy

\displaystyle[\cos\theta\ \sin\theta]\begin{bmatrix}\bm{v}_{d-1}\\ \bm{v}_{d}\end{bmatrix}>0,\quad[1\ 0]\begin{bmatrix}\bm{v}_{d-1}\\ \bm{v}_{d}\end{bmatrix}>0.

(31)

We compute the spherical coordinates $\bm{\varphi}_{\bm{x}}=[\varphi_{1}^{\bm{x}}\ \varphi_{2}^{\bm{x}}\ \cdots\ \varphi_{d-1}^{\bm{x}}]^{T}$ where $\varphi_{1}^{\bm{x}},\cdots,\varphi_{d-2}^{\bm{x}}\in[0,\pi]$ and $\varphi_{d-1}^{\bm{x}}\in[0,2\pi)$ with the convention that

	$\displaystyle\bm{x}_{1}=\cos(\varphi_{1}^{\bm{x}}),$
	$\displaystyle\bm{x}_{2}=\sin(\varphi_{1}^{\bm{x}})\cos(\varphi_{2}^{\bm{x}}),$
	$\displaystyle\bm{x}_{3}=\sin(\varphi_{1}^{\bm{x}})\sin(\varphi_{2}^{\bm{x}})\cos(\varphi_{3}^{\bm{x}}),$
	$\displaystyle\vdots$
	$\displaystyle\bm{x}_{d-1}=\sin(\varphi_{1}^{\bm{x}})\sin(\varphi_{2}^{\bm{x}})\cdots\sin(\varphi_{d-2}^{\bm{x}})\cos(\varphi_{d-1}^{\bm{x}}),$
	$\displaystyle\bm{x}_{d}=\sin(\varphi_{1}^{\bm{x}})\sin(\varphi_{2}^{\bm{x}})\cdots\sin(\varphi_{d-2}^{\bm{x}})\sin(\varphi_{d-1}^{\bm{x}}).$

Thus, we have $\bm{\varphi}_{\bm{x}}=[\pi/2\ \pi/2\ \cdots\ \pi/2\ 0]^{T}$ . Similarly, the spherical coordinates for $\bm{z}$ is $\bm{\varphi}_{\bm{z}}=[\pi/2\ \pi/2\ \cdots\pi/2\ \theta]^{T}$ . Let the spherical coordinates for $\bm{v}$ be $\bm{\varphi}_{\bm{v}}=[\varphi_{1}^{\bm{v}}\ \varphi_{2}^{\bm{v}}\ \cdots\ \varphi_{d-1}^{\bm{v}}]^{T}$ . Thus, Eq. (31) is equivalent to

	$\displaystyle\sin(\varphi_{1}^{\bm{v}})\sin(\varphi_{2}^{\bm{v}})\cdots\sin(\varphi_{d-2}^{\bm{v}})\left(\cos\theta\cos(\varphi_{d-1}^{\bm{v}})+\sin\theta\sin(\varphi_{d-1}^{\bm{v}})\right)>0,$		(32)
	$\displaystyle\sin(\varphi_{1}^{\bm{v}})\sin(\varphi_{2}^{\bm{v}})\cdots\sin(\varphi_{d-2}^{\bm{v}})\cos(\varphi_{d-1}^{\bm{v}})>0.$		(33)

Because $\varphi_{1}^{\bm{v}},\cdots,\varphi_{d-2}^{\bm{v}}\in[0,\pi]$ (by the convention of spherical coordinates), we have

\displaystyle\sin(\varphi_{1}^{\bm{v}})\sin(\varphi_{2}^{\bm{v}})\cdots\sin(\varphi_{d-2}^{\bm{v}})\geq 0.

Thus, for Eq. (32) and Eq. (33), we have

\displaystyle\cos(\theta-\varphi_{d-1}^{\bm{v}})>0,\quad\cos(\varphi_{d-1}^{\bm{v}})>0,

i.e., $\varphi_{d-1}^{\bm{v}}\in(-\pi/2,\ \pi/2)\cap(\theta-\pi/2,\ \theta+\pi/2)\pmod{2\pi}$ . We have

		$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}d\tilde{\lambda}(\bm{v})$
	$\displaystyle=$	$\displaystyle\frac{\int_{(-\frac{\pi}{2},\ \frac{\pi}{2})\cap(\theta-\frac{\pi}{2},\ \theta+\frac{\pi}{2})}\int_{0}^{\pi}\cdots\int_{0}^{\pi}\sin^{d-2}(\varphi_{1})\sin^{d-3}(\varphi_{2})\cdots\sin(\varphi_{d-2})\ d\varphi_{1}\ d\varphi_{2}\cdots d\varphi_{d-1}}{\int_{0}^{2\pi}\int_{0}^{\pi}\cdots\int_{0}^{\pi}\sin^{d-2}(\varphi_{1})\sin^{d-3}(\varphi_{2})\cdots\sin(\varphi_{d-2})\ d\varphi_{1}\ d\varphi_{2}\cdots d\varphi_{d-1}}$
	$\displaystyle=$	$\displaystyle\frac{\int_{(-\frac{\pi}{2},\ \frac{\pi}{2})\cap(\theta-\frac{\pi}{2},\ \theta+\frac{\pi}{2})}A\cdot d\varphi_{d-1}}{\int_{0}^{2\pi}A\cdot d\varphi_{d-1}}$
		(by defining $A\mathrel{\mathop{:}}=\int_{0}^{\pi}\cdots\int_{0}^{\pi}\sin^{d-2}(\varphi_{1})\sin^{d-3}(\varphi_{2})\cdots\sin(\varphi_{d-2})\ d\varphi_{1}\ d\varphi_{2}$ )
	$\displaystyle=$	$\displaystyle\frac{\text{length of the interval }(-\frac{\pi}{2},\ \frac{\pi}{2})\cap(\theta-\frac{\pi}{2},\ \theta+\frac{\pi}{2})}{2\pi}$
	$\displaystyle=$	$\displaystyle\frac{\pi-\theta}{2\pi}\text{ (because $\theta\in[0,\pi]$ by Eq.~{}\eqref{eq.temp_020901})}$
	$\displaystyle=$	$\displaystyle\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}\text{ (by Eq.~{}\eqref{eq.temp_020901})}.$

The result of this lemma thus follows. ∎

D.5 Limits of ${|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}/{p}$ when $p\to\infty$

We introduce some notions given by Wainwright (2015).

Glivenko-Cantelli class. Let $\mathscr{F}$ be a class of integrable real-valued functions with domain $\mathcal{X}$ , and let $X_{1}^{k}=\{X_{1},\cdots,X_{k}\}$ be a collection of i.i.d. samples from some distribution $\mathbb{P}$ over $\mathcal{X}$ . Consider the random variable

\displaystyle\|\mathbb{P}_{k}-\mathbb{P}\|_{\mathscr{F}}\mathrel{\mathop{:}}=\sup_{\tilde{f}\in\mathscr{F}}\left|\frac{1}{k}\sum_{i=1}^{k}\tilde{f}(X_{k})-\operatorname*{\mathsf{E}}[\tilde{f}]\right|,

which measures the maximum deviation (over the class $\mathscr{F}$ ) between the sample average $\frac{1}{k}\sum_{i=1}^{k}\tilde{f}(X_{i})$ and the population average $\operatorname*{\mathsf{E}}[\tilde{f}]=\operatorname*{\mathsf{E}}[\tilde{f}(X)]$ . We say that $\mathscr{F}$ is a Glivenko-Cantelli class for $\mathbb{P}$ if $\|\mathbb{P}_{k}-\mathbb{P}\|_{\mathscr{F}}$ converges to zero in probability as $k\to\infty$ .

Polynomial discrimination. A class $\mathscr{F}$ of functions with domain $\mathcal{X}$ has polynomial discrimination of order $\nu\geq 1$ if for each positive integer $k$ and collection $X_{1}^{k}=\{X_{1},\cdots,X_{k}\}$ of $k$ points in $\mathcal{X}$ , the set $\mathscr{F}(X_{1}^{k})$ has cardinality upper bounded by

\displaystyle\text{card}(\mathscr{F}(X_{1}^{k}))\leq(k+1)^{\nu}.

The following lemma is shown in Page 108 of Wainwright (2015).

Lemma 18.

Any bounded function class with polynomial discrimination is Glivenko-Cantelli.

For our case, we care about the following value.

\displaystyle\left|\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}-\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}\right|=\left|\frac{1}{p}\sum_{j=1}^{p}\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0,\bm{z}^{T}\mathbf{V}_{0}[j]>0\}}-\operatorname*{\mathsf{E}}_{\bm{v}\sim\tilde{\lambda}(\cdot)}[\bm{1}_{\{\bm{x}^{T}\bm{v}>0,\bm{z}^{T}\bm{v}>0\}}]\right|\text{ (by Lemma~{}\ref{le.spherePortion})}.

In the language of Glivenko-Cantelli class, the function class $\mathscr{F}_{*}$ consists of functions $\bm{1}_{\{\bm{x}^{T}\bm{v}>0,\bm{z}^{T}\bm{v}>0\}}$ that map $\bm{v}\in\mathcal{S}^{d-1}$ to $0$ or $1$ , where every $\bm{x}\in\mathcal{S}^{d-1}$ and $\bm{z}\in\mathcal{S}^{d-1}$ corresponds to a distinct function in $\mathscr{F}_{*}$ . According to Lemma 18, we need to calculate the order of the polynomial discrimination for this $\mathscr{F}_{*}$ . Towards this end, we need the following lemma, which can be derived from the quantity $Q_{n,N}$ in Wendel (1962) (which is the quantity $Q_{d,k}$ in the following lemma).

Lemma 19.

Given $\bm{v}_{1},\bm{v}_{2},\cdots,\bm{v}_{k}\in\mathcal{S}^{d-1}$ , the number of different values (i.e., the cardinality) of the set $\left\{\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0\}}\right)\ \big{|}\ \bm{x}\in\mathcal{S}^{d-1}\right\}$ is at most $Q_{d,k}$ , where

\displaystyle Q_{d,k}\mathrel{\mathop{:}}=\begin{cases}2\sum_{i=0}^{d-1}\binom{k-1}{i},&\text{ if }k>d,\\ 2^{k},&\text{ if }k\leq d.\end{cases}

Intuitively, Lemma 19 states the number of different regions that $k$ hyper-planes through the origin (i.e., the kernel of the inner product with each $\bm{v}_{i}$ ) can cut $\mathcal{S}^{d-1}$ into, because all $\bm{x}$ in one region corresponds to the same value of the tuple $\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0\}}\right)$ . For example, in the 2D case (i.e., $d=2$ ), $k$ diameters of a circle can at most cut the whole circle into $2k$ (which equals to $Q_{2,k}$ ) parts. Notice that if some $\bm{v}_{i}$ ’s are parallel (thus some diameters are overlapped), then the total number of different parts can only be smaller. That is why Lemma 19 states that the cardinality is “at most” $Q_{d,k}$ .

The following lemma shows that the cardinality in Lemma 19 is polynomial in $k$ .

Lemma 20.

Recall the definition $Q_{d,k}$ in Lemma 19. For any integer $k\geq 1$ and $d\geq 2$ , we must have $Q_{d,k}\leq(k+1)^{d+1}$ .

Proof.

When $k>d$ , because $\binom{k-1}{i}\leq(k-1)^{d-1}$ when $i\leq d-1$ , we have $Q_{d,k}=2\sum\limits_{i=0}^{d-1}\binom{k-1}{i}\leq 2d(k+1)^{d-1}\leq(k+1)^{d+1}$ (the last step uses $k\geq 1$ and $k>d$ ). When $k\leq d$ , because $k\geq 1$ , we have $Q_{d,k}=2^{k}\leq(k+1)^{k}\leq(k+1)^{d}$ . In summary, for any integer $k\geq 1$ and $d\geq 2$ , the result $Q_{d,k}\leq(k+1)^{d+1}$ always holds. ∎

We can now calculate the order of the polynomial discrimination for the function class $\mathscr{F}_{*}$ . Because

		$\displaystyle\text{card}\left(\left\{\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0,\bm{z}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0,\bm{z}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0,\bm{z}^{T}\bm{v}_{k}0\}}\right)\ \big{\|}\ \bm{x}\in\mathcal{S}^{d-1},\bm{z}\in\mathcal{S}^{d-1}\right\}\right)$
	$\displaystyle\leq$	$\displaystyle\text{card}\left(\left\{\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0\}}\right)\ \big{\|}\ \bm{x}\in\mathcal{S}^{d-1}\right\}\right)$
		$\displaystyle\cdot\text{card}\left(\left\{\left(\bm{1}_{\{\bm{z}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{z}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{z}^{T}\bm{v}_{k}>0\}}\right)\ \big{\|}\ \bm{z}\in\mathcal{S}^{d-1}\right\}\right),$

by Lemma 19 and Lemma 20, we know that

\displaystyle\text{card}(\mathscr{F}_{*}(X_{1}^{k}))\leq\left(Q_{d,k}\right)^{2}\leq(k+1)^{2(d+1)}.

(Here $X_{1}^{k}$ means $\{\mathbf{V}_{0}[1],\cdots,\mathbf{V}_{0}[k]\}$ .)

Thus, $\mathscr{F}_{*}$ has polynomial discrimination with order at most $2(d+1)$ . Notice that all functions in $\mathscr{F}_{*}$ is bounded because their outputs can only be $0$ or $1$ . Therefore, by Lemma 18 (i.e., any bounded function class with polynomial discrimination is Glivenko-Cantelli), we know that $\mathscr{F}_{*}$ is Glivenko-Cantelli. In other words, we have shown the following lemma.

Lemma 21.

\displaystyle\sup_{\bm{x},\bm{z}\in\mathcal{S}^{d-1}}\left|\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}-\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}\right|\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}0,\text{ as }p\to\infty.

(34)

Appendix E Proof of Lemma 7 ( $\mathbf{H}$ has full row-rank with high probability as $p\to\infty$ )

In this section, we prove Lemma 7, i.e., the matrix $\mathbf{H}$ has full row-rank with high probability when $p\to\infty$ . We first introduce two useful lemmas as follows.

The following lemma states that, given $\mathbf{X}$ (that satisfies Assumption 3) and $k\in\{1,2,\cdots,n\}$ , there always exists a vector $\bm{v}\in\mathcal{S}^{d-1}$ that is only orthogonal to one training input $\mathbf{X}_{k}$ but not orthogonal to other training inputs $\mathbf{X}_{i}$ for all $i\neq k$ . An intuitive explanation is that, because no training inputs are parallel (as stated in Assumption 3), the total set of vectors that are orthogonal to at least two training inputs is too small. That gives us many options to pick such a vector $\bm{v}$ that is only orthogonal to one input but not others.

Lemma 22.

For all $k\in\{1,2,\cdots,n\}$ we have

\displaystyle\mathcal{T}_{k}\mathrel{\mathop{:}}=\left\{\bm{v}\in\mathcal{S}^{d-1}\ |\ \bm{v}^{T}\mathbf{X}_{k}=0,\bm{v}^{T}\mathbf{X}_{i}\neq 0,\text{for all }i\in\{1,2,\cdots,n\}\setminus\{k\}\right\}\neq\varnothing.

Proof.

We have

	$\displaystyle\mathcal{T}_{k}$	$\displaystyle=\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\setminus\left(\bigcup_{i\in\{1,2,\cdots,n\}\setminus\{k\}}\ker(\mathbf{X}_{i})\right)$
		$\displaystyle=\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\setminus\left(\bigcup_{i\in\{1,2,\cdots,n\}\setminus\{k\}}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\cap\ker(\mathbf{X}_{i})\right)\right).$

Because

	$\displaystyle\dim(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k}))=d-2,$
	$\displaystyle\dim(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\cap\ker(\mathbf{X}_{i}))=d-3\text{ for all }i\in\{1,2,\cdots,n\}\setminus\{k\}\text{ (because $\mathbf{X}_{i}\nparallel\mathbf{X}_{k}$)},$		(35)

we have

	$\displaystyle\lambda_{d-2}(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k}))=\lambda_{d-2}(\mathcal{S}^{d-2})>0,$
	$\displaystyle\lambda_{d-2}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\cap\ker(\mathbf{X}_{i})\right)=0\text{ for all }i\in\{1,2,\cdots,n\}\setminus\{k\}.$		(36)

(When $d=2$ , the set in Eq. (35) is not defined. Nonetheless, Eq. (36) still holds when $d=2$ .) Thus, we have

	$\displaystyle\lambda_{d-2}(\mathcal{T}_{k})$	$\displaystyle=\lambda_{d-2}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\right)-\lambda_{d-2}\left(\bigcup_{i\in\{1,2,\cdots,n\}\setminus\{k\}}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\cap\ker(\mathbf{X}_{i})\right)\right)$
		$\displaystyle\geq\lambda_{d-2}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\right)-\sum_{i\in\{1,2,\cdots,n\}\setminus\{k\}}\lambda_{d-2}\left(\mathcal{S}^{d-1}\cap\ker(\mathbf{X}_{k})\cap\ker(\mathbf{X}_{i})\right)$
		$\displaystyle=\lambda_{d-2}(\mathcal{S}^{d-2})$
		$\displaystyle>0.$

Therefore, $\mathcal{T}_{k}\neq\varnothing$ . ∎

The following lemma plays an important role in answering whether $\mathbf{H}$ has full row-rank. Further, it is also closely related to our estimation on the $\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})$ later in Appendix F.

Lemma 23.

Consider any $i\in\{1,2,\cdots,n\}$ . For any $\bm{v}_{*,i}\in\mathcal{S}^{d-1}$ satisfying $\bm{v}_{*,i}^{T}\mathbf{X}_{i}=0$ , we define

\displaystyle r_{i}\mathrel{\mathop{:}}=\min_{j\in\{1,2,\cdots,n\}\setminus\{i\}}\left|\bm{v}_{*,i}^{T}\mathbf{X}_{j}\right|.

(37)

If there exist $k,l\in\{1,\cdots,p\}$ such that

\displaystyle\frac{\mathbf{V}_{0}[k]}{\|\mathbf{V}_{0}[k]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}},\quad\frac{\mathbf{V}_{0}[l]}{\|\mathbf{V}_{0}[l]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}},

(38)

then we must have

	$\displaystyle\mathbf{H}_{j}[k]=\mathbf{H}_{j}[l],\ \text{for all }j\in\{1,2,\cdots,n\}\setminus\{i\},$		(39)
	$\displaystyle\mathbf{H}_{i}[k]=\mathbf{X}_{i}^{T},$		(40)
	$\displaystyle\mathbf{H}_{i}[l]=\bm{0}.$		(41)

(Notice that Eq. (38) implies $r_{i}>0$ .)

Remark 6.

We first give an intuitive geometric interpretation of Lemma 23. In Fig. 6, the sphere centered at $\mathrm{O}$ denotes $\mathcal{S}^{d-1}$ , the vector $\overrightarrow{\mathrm{OC}}$ denotes $\mathbf{X}_{i}$ , the vector $\overrightarrow{\mathrm{OD}}$ denotes one of other $\mathbf{X}_{j}$ ’s, the vector $\overrightarrow{\mathrm{OE}}$ denotes $\bm{v}_{*,i}$ , which is perpendicular to $\mathbf{X}_{i}$ (i.e., $\mathbf{X}_{i}^{T}\bm{v}_{*,i}=0$ ). The upper half of the cap $\mathrm{E}$ denotes $\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}}$ , the lower half of the cap $\mathrm{E}$ denotes $\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}$ . The great circle $\mathrm{L_{c}}$ cuts the sphere into two semi-spheres. The semi-sphere in the direction of $\overrightarrow{\mathrm{OC}}$ corresponds to all vectors $\bm{v}$ on the sphere that have positive inner product with $\mathbf{X}_{i}$ (i.e., $\bm{v}_{*,i}^{T}\mathbf{X}_{i}>0$ ), and the semi-sphere in the opposite direction of $\overrightarrow{\mathrm{OC}}$ corresponds to all vectors $\bm{v}$ on the sphere that have negative inner product with $\mathbf{X}_{i}$ (i.e., $\bm{v}^{T}\mathbf{X}_{i}<0$ ). The great circle $\mathrm{L_{d}}$ is similar to the great circle $\mathrm{L_{c}}$ , but is perpendicular to the direction $\overrightarrow{\mathrm{OD}}$ (i.e., $\mathbf{X}_{j}$ ). By choosing the radius of the cap $\mathrm{E}$ in Eq. (37), we can ensure that all great circles that are perpendicular to other $\mathbf{X}_{j}$ ’s do not pass the cap $\mathrm{E}$ . In other words, for the two semi-spheres cut by the great circle perpendicular to $\mathbf{X}_{j}$ , $j\neq i$ , the cap $\mathrm{E}$ must be contained in one of them. Therefore, vectors on the upper half of the cap $\mathrm{E}$ and the vectors on the lower half of the cap $\mathrm{E}$ must have the same sign when calculating the inner product with all $\mathbf{X}_{j}$ ’s, for all $j\neq i$ .

Now, let us consider the meaning of Eq. (38) in this geometric setup depicted in Fig. 6. The expression $\frac{\mathbf{V}_{0}[k]}{\|\mathbf{V}_{0}[k]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}}$ means that the direction of $\mathbf{V}_{0}[k]$ is in the upper half of the cap $\mathrm{E}$ . By the definition of $\mathbf{H}_{i}=\bm{h}_{\mathbf{V}_{0},\mathbf{X}_{i}}$ in Eq. (1), we must then have $\mathbf{H}_{i}[k]=\mathbf{X}_{i}^{T}$ . Similarly, the expression $\frac{\mathbf{V}_{0}[l]}{\|\mathbf{V}_{0}[l]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}$ means that the direction of $\mathbf{V}_{0}[l]$ is in the lower half of the cap $\mathrm{E}$ , and thus $\mathbf{H}_{i}[l]=\bm{0}$ . Then, based on the discussions in the previous paragraph, we know that $\mathbf{V}_{0}[k]$ and $\mathbf{V}_{0}[l]$ has the same activation pattern under ReLU for all $\mathbf{X}_{j}$ ’s that $j\neq i$ , which implies that $\mathbf{H}_{j}[k]=\mathbf{H}_{j}[l]$ . These are precisely the conclusions in Eqs. (39)(40)(41).

Later in Appendix F, Lemma 23 plays an important role in estimating $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}$ . To see this, let $a_{j}$ denotes the $j$ -th element of $\bm{a}$ . By Eq. (39), we have $\sum_{j\in\{1,2,\cdots,n\}\setminus\{i\}}((\mathbf{H}^{T}a_{j})[k]-(\mathbf{H}^{T}a_{j})[l])=\bm{0}$ . By Eq. (40) and Eq. (41), we have $(\mathbf{H}^{T}a_{i})[k]-(\mathbf{H}^{T}a_{i})[l]=\mathbf{X}_{i}$ . Combining them together, we have $(\mathbf{H}^{T}\bm{a})[k]-(\mathbf{H}^{T}\bm{a})[l]=a_{i}\mathbf{X}_{i}$ . As long as $a_{i}$ is not zero, then regardless values of other elements in $\bm{a}$ , we always obtain that $\mathbf{H}^{T}\bm{a}$ is a non-zero vector. This implies $\|\mathbf{H}^{T}\bm{a}\|_{2}>0$ , which will be useful for estimating $\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})/p$ in Appendix F.

Proof.

By the definition of $r_{i}$ , we have

\displaystyle|\bm{v}_{*,i}^{T}\mathbf{X}_{j}|-r_{i}\geq 0\text{, for all }j\in\{1,2,\cdots,n\}\setminus\{i\}.

(42)

For any $j\in\{1,2,\cdots,n\}\setminus\{i\}$ and any $\bm{v}\in\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}}$ , since $\|\bm{v}-\bm{v}_{*,i}\|_{2}<r_{i}$ , we have

	$\displaystyle(\bm{v}^{T}\mathbf{X}_{j})(\bm{v}_{*,i}^{T}\mathbf{X}_{j})$	$\displaystyle=\left((\bm{v}-\bm{v}_{,i})^{T}\mathbf{X}_{j}+\bm{v}_{,i}^{T}\mathbf{X}_{j}\right)(\bm{v}_{*,i}^{T}\mathbf{X}_{j})$
		$\displaystyle=(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}+(\bm{v}_{,i}^{T}\mathbf{X}_{j})\left((\bm{v}-\bm{v}_{*,i})^{T}\mathbf{X}_{j}\right)$
		$\displaystyle\geq(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot\left\|(\bm{v}-\bm{v}_{*,i})^{T}\mathbf{X}_{j}\right\|$
		$\displaystyle\geq(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot\left\\|\bm{v}-\bm{v}_{*,i}\right\\|_{2}\left\\|\mathbf{X}_{j}\right\\|_{2}$
		$\displaystyle>(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot r_{i}\text{ (by Eq.~{}\eqref{eq.temp_100101})}$
		$\displaystyle=\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\|(\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\|-r_{i})$
		$\displaystyle\geq 0\text{ (by Eq.~{}\eqref{eq.temp_093003})}.$

Thus, for any $\bm{v}_{1}\in\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}},\ \bm{v}_{2}\in\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}},\ j\in\{1,2,\cdots,n\}\setminus\{i\}$ , we have $(\bm{v}_{1}^{T}\mathbf{X}_{j})(\bm{v}_{*,i}^{T}\mathbf{X}_{j})>0$ and $(\bm{v}_{2}^{T}\mathbf{X}_{j})(\bm{v}_{*,i}^{T}\mathbf{X}_{j})>0$ . It implies that

\displaystyle\text{sign}(\bm{v}_{1}^{T}\mathbf{X}_{j})=\text{sign}(\bm{v}_{*,i}^{T}\mathbf{X}_{j})=\text{sign}(\bm{v}_{2}^{T}\mathbf{X}_{j}).

(43)

By Eq. (38), we know that both $\mathbf{V}_{0}[k]$ and $\mathbf{V}_{0}[l]$ are in $\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}}$ . Applying Eq. (43), we have

\displaystyle\text{sign}(\mathbf{X}_{j}^{T}\mathbf{V}_{0}[k])=\text{sign}(\mathbf{X}_{j}^{T}\mathbf{V}_{0}[l]),\ \text{for all }j\in\{1,2,\cdots,n\}\setminus\{i\}.

Thus, by Eq. (1), we have

\displaystyle\mathbf{H}_{j}[k]=\bm{1}_{\{\mathbf{X}_{j}^{T}\mathbf{V}_{0}[k]>0\}}\mathbf{X}_{j}^{T}=\bm{1}_{\{\mathbf{X}_{j}^{T}\mathbf{V}_{0}[l]>0\}}\mathbf{X}_{j}^{T}=\mathbf{H}_{j}[l],\ \text{for all }j\in\{1,2,\cdots,n\}\setminus\{i\}.

By Eq. (22), we have

\displaystyle\mathbf{X}_{i}^{T}\mathbf{V}_{0}[k]>0,\ \mathbf{X}_{i}^{T}\mathbf{V}_{0}[l]<0.

Thus, by Eq. (1), we have

\displaystyle\mathbf{H}_{i}[k]=\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[k]>0\}}\mathbf{X}_{i}^{T}=\mathbf{X}_{i}^{T},\quad\mathbf{H}_{i}[l]=\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[l]>0\}}\mathbf{X}_{i}^{T}=\bm{0}.

∎

Now, we are ready to prove Lemma 7.

Proof.

We prove by contradiction. Suppose on the contrary that with some nonzero probability, the design matrix is not full row-rank as $p\to\infty$ . Note that when the design matrix is not full row-rank, there exists a set of indices $\mathcal{I}\subseteq\{1,\cdots,n\}$ such that

\displaystyle\sum_{i\in\mathcal{I}}b_{i}\mathbf{H}_{i}=\bm{0},\ b_{i}\neq 0\text{ for all }i\in\mathcal{I}.

(44)

The proof will be finished by two steps: 1) find an event $\mathcal{J}$ that happens almost surely when $p\to\infty$ ; 2) prove this event $\mathcal{J}$ contradicts Eq. (44).

Step 1:

Consider each $i\in\{1,2,\cdots,n\}$ . By Lemma 22, we know that there exists a $\bm{v}_{*,i}\in\mathcal{S}^{d-1}$ such that

\displaystyle\bm{v}_{*,i}^{T}\mathbf{X}_{i}=0,\ \bm{v}_{*,i}^{T}\mathbf{X}_{j}\neq 0,\text{ for all }j\in\{1,2,\cdots,n\}\setminus\{i\}.

(45)

Define

\displaystyle r_{i}=\min_{j\in\{1,2,\cdots,n\}\setminus\{i\}}\left|\bm{v}_{*,i}^{T}\mathbf{X}_{j}\right|>0.

(46)

For all $i=1,2,\cdots,n$ , we define several events as follows.

	$\displaystyle\mathcal{J}_{i}\mathrel{\mathop{:}}=\left\{\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{,i},+}^{r_{i},\mathbf{X}_{i}}\neq\varnothing,\ \mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{,i},-}^{r_{i},\mathbf{X}_{i}}\neq\varnothing\right\},$
	$\displaystyle\mathcal{J}_{i,+}=\left\{\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}}\neq\varnothing\right\},$
	$\displaystyle\mathcal{J}_{i,-}=\left\{\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}\neq\varnothing\right\},$
	$\displaystyle\mathcal{J}\mathrel{\mathop{:}}=\bigcap_{i=1}^{n}\mathcal{J}_{i}.$

(Recall the geometric interpretation in Remark 6. The events $\mathcal{J}_{i,+}$ and $\mathcal{J}_{i,-}$ mean that there exists $\mathbf{V}_{0}[j]/\|\mathbf{V}_{0}[j]\|_{2}$ in the upper half and the lower half of the cap $\mathrm{E}$ , respectively. The event $\mathcal{J}_{i}=\mathcal{J}_{i,+}\cap\mathcal{J}_{i,-}$ means that there exist $\mathbf{V}_{0}[j]/\|\mathbf{V}_{0}[j]\|_{2}$ in both halves of the cap $\mathrm{E}$ . Finally, the event $\mathcal{J}$ occurs when $\mathcal{J}_{i}$ occurs for all $i$ , although the vector $\mathbf{V}_{0}[j]/\|\mathbf{V}_{0}[j]\|_{2}$ that falls into the two halves may differ across $i$ . As we will show later, whenever the event $\mathcal{J}$ occurs, the matrix $\mathbf{H}$ will have the full row-rank, which is why we are interesting in the probability of the event $\mathcal{J}$ .)

Those definitions implies that

	$\displaystyle\mathcal{J}_{i}^{c}=\mathcal{J}_{i,+}^{c}\cup\mathcal{J}_{i,-}^{c}\text{ for all }i=1,2,\cdots,n,$		(47)
	$\displaystyle\mathcal{J}^{c}=\bigcup_{i=1}^{n}\mathcal{J}_{i}^{c}.$		(48)

Thus, we have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}]=$	$\displaystyle 1-\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}^{c}]$
	$\displaystyle\geq$	$\displaystyle 1-\sum_{i=1}^{n}\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i}^{c}]\text{ (by Eq.~{}\eqref{eq.temp_112301} and the union bound)}.$		(49)

For a fixed $i$ , recall that by Eq. (46), we have $r_{i}>0$ . Because $\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}}$ and $\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}$ are two halves of $\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}}$ , we have

\displaystyle\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}})=\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}})=\frac{1}{2}\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}}).

(50)

Therefore, we have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i}^{c}]\leq$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i,+}^{c}]+\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i,-}^{c}]\text{ (by Eq.~{}\eqref{eq.temp_112302} and the union bound)}$
	$\displaystyle=$	$\displaystyle\left(1-\frac{\lambda_{d-1}(\mathcal{B}_{\bm{v}_{,i},+}^{r_{i},\mathbf{X}_{i}})}{\lambda_{d-1}(\mathcal{S}^{d-1})}\right)^{p}+\left(1-\frac{\lambda_{d-1}(\mathcal{B}_{\bm{v}_{,i},-}^{r_{i},\mathbf{X}_{i}})}{\lambda_{d-1}(\mathcal{S}^{d-1})}\right)^{p}$
		(all $\mathbf{V}_{0}[i]$ ’s are independent and Assumption 1)
	$\displaystyle=$	$\displaystyle 2\left(1-\frac{\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*,i}}^{r_{i}})}{2\lambda_{d-1}(\mathcal{S}^{d-1})}\right)^{p}\text{ (by Eq.~{}\eqref{eq.temp_100102})}.$

Notice that $r_{i}$ is determined only by $\mathbf{X}$ , and is independent of $\mathbf{V}_{0}$ and $p$ . Therefore, we have

\displaystyle\lim_{p\to\infty}\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i}^{c}]=0.

(51)

Plugging Eq. (51) into Eq. (49), we have

\displaystyle\lim_{p\to\infty}\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}]=1\text{ (because $n$ is finite)}.

Step 2:

To complete the proof, it remains to show that the event $\mathcal{J}$ contradicts Eq. (44). Towards this end, we assume the event $\mathcal{J}$ happens. By Eq. (44), we can pick one $i\in\mathcal{I}$ . Further, by the definition of $\mathcal{J}$ , there exists $r_{i}$ such that $\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}}\neq\varnothing$ and $\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}\neq\varnothing$ . In other words, there must exist $k,l\in\{1,\cdots,p\}$ such that

\displaystyle\frac{\mathbf{V}_{0}[k]}{\|\mathbf{V}_{0}[k]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}},\quad\frac{\mathbf{V}_{0}[l]}{\|\mathbf{V}_{0}[l]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}.

By Lemma 23, we have

	$\displaystyle\mathbf{H}_{j}[k]=\mathbf{H}_{j}[l],\ \text{for all }j\in\{1,2,\cdots,n\}\setminus\{i\},$		(52)
	$\displaystyle\mathbf{H}_{i}[k]=\mathbf{X}_{i}^{T},\quad\mathbf{H}_{i}[l]=\bm{0}.$		(53)

We now show that $\mathbf{H}$ restricted to the columns corresponding to $k$ and $l$ cannot be linearly dependent. Specifically, we have

	$\displaystyle\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[k]$	$\displaystyle=b_{i}\mathbf{H}_{i}[k]+\sum_{j\in\mathcal{I}\setminus\{i\}}b_{j}\mathbf{H}_{j}[k]\text{ (as we have picked $i\in\mathcal{I}$)}$
		$\displaystyle=b_{i}\mathbf{H}_{i}[k]-b_{j}\mathbf{H}_{i}[l]+\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[l]\text{ (by Eq.~{}\eqref{eq.temp_1000106})}$
		$\displaystyle=b_{i}\mathbf{X}_{i}^{T}+\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[l]\text{ (by Eq.~{}\eqref{eq.temp_100107})}$
		$\displaystyle\neq\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[l]\text{ (because $b_{i}\neq 0$)}.$

This contradicts the assumption Eq. (44) that

\displaystyle\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[k]=\sum_{j\in\mathcal{I}}b_{j}\mathbf{H}_{j}[l]=\bm{0}.

The result thus follows. ∎

Appendix F Proof of Proposition 4 (the upper bound of the variance)

The following lemma shows the relationship between the variance term and $\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)/p$ .

Lemma 24.

\displaystyle|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}|\leq\frac{\sqrt{p}\|\bm{\epsilon}\|_{2}}{\sqrt{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}}.

Proof.

We have

\displaystyle\|\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\|_{2}=\sqrt{(\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon})^{T}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}}=\sqrt{\bm{\epsilon}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}}\leq\frac{\|\bm{\epsilon}\|_{2}}{\sqrt{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}}.

(54)

Thus, we have

		$\displaystyle\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\|$
	$\displaystyle=$	$\displaystyle\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\\|_{2}\text{ ($\ell_{2}$-norm of a number equals to its absolute value)}$
	$\displaystyle\leq$	$\displaystyle\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\\|_{2}\cdot\\|\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\\|_{2}\text{ (by Lemma~{}\ref{le.matrix_norm})}$
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{p}\\|\bm{\epsilon}\\|_{2}}{\sqrt{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}}\text{ (by Lemma~{}\ref{le.h_p} and Eq.~{}\eqref{eq.temp_022801})}.$

∎

The following lemma shows our estimation on $\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)/p$ .

Lemma 25.

For any $n\geq 2$ , $m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]$ , $d\leq n^{4}$ , if $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ , we must have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\Big{\{}\frac{\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}{p}\geq\frac{1}{J_{m}(n,d)n}\Big{\}}\geq 1-\frac{2}{\sqrt[m]{n}}.

Proposition 4 directly follows from Lemma 25 and Lemma 24. ⁸⁸8We can see that the key part during the proof of Proposition 4 is to estimate ${\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}/{p}$ . Lemma 25 shows a lower bound of ${\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}/{p}$ which is almost $\Omega(n^{1-2d})$ when $p$ is large. However, our estimation of this value may be loose. We will show a upper bound which is $O(n^{-\frac{1}{d-1}})$ (see Appendix G).

In rest of this section, we will show how to prove Lemma 25. The following lemma shows that, to estimate $\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)/p$ , it is equivalent to estimate $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ .

Lemma 26.

\displaystyle\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)=\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}.

Proof.

Do the singular value decomposition (SVD) of $\mathbf{H}^{T}$ as $\mathbf{H}^{T}=\mathbf{U}\mathbf{\Sigma}\mathbf{W}^{T}$ , where

\displaystyle\mathbf{\Sigma}\in\mathds{R}^{(dp)\times n}=\mathsf{diag}(\Sigma_{1},\Sigma_{2},\cdots,\Sigma_{n}).

By properties of singular values, we have

\displaystyle\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}=\min_{i\in\{1,2,\cdots,n\}}\Sigma_{i}^{2}.

We also have

	$\displaystyle\mathbf{H}\mathbf{H}^{T}$	$\displaystyle=\mathbf{W}\mathbf{\Sigma}^{T}\mathbf{U}^{T}\mathbf{U}\mathbf{\Sigma}\mathbf{W}^{T}$
		$\displaystyle=\mathbf{W}\mathbf{\Sigma}^{T}\mathbf{\Sigma}\mathbf{W}^{T}\text{ (because $\mathbf{U}^{T}\mathbf{U}=\mathbf{I}$)}$
		$\displaystyle=\mathbf{W}\mathsf{diag}(\Sigma_{1}^{2},\Sigma_{2}^{2},\cdots,\Sigma_{n}^{2})\mathbf{W}^{T}.$

This equation is indeed the eigenvalue decomposition of $\mathbf{H}\mathbf{H}^{T}$ , which implies that its eigenvalues are $\Sigma_{1}^{2},\Sigma_{2}^{2},\cdots,\Sigma_{n}^{2}$ . Thus, we have

\displaystyle\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)=\min_{i\in\{1,2,\cdots,n\}}\Sigma_{i}^{2}=\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}.

∎

Therefore, to finish the proof of Proposition 4, it only remains to estimate $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}$ .

By Lemma 7 and its proof in Appendix E, we have already shown that $\mathbf{H}^{T}\bm{a}$ is not likely to be zero (i.e. $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}>0$ ) when $p\to\infty$ . Here, we basically use the similar method as in Appendix E, but with more precise quantification.

Recall the definitions in Eqs. (21)(22)(23). For any $i\in\{1,2,\cdots,n\}$ , we choose one

\displaystyle\bm{v}_{*,i}\in\mathcal{S}^{d-1}\text{ independently of $\mathbf{X}_{j}$, $j\neq i$, such that }\bm{v}_{*,i}^{T}\mathbf{X}_{i}=0.

(55)

(Note that here, unlike in Eq. (45), we do not require $\bm{v}_{*,i}^{T}\mathbf{X}_{j}\neq 0$ for all $j\neq i$ . This is important as we would like $\mathbf{X}_{j}$ to be independent of $\bm{v}_{*,i}$ for all $j\neq i$ .) Further, for any $0\leq r_{0}\leq 1$ , we define

\displaystyle c_{r_{0}}^{i}\mathrel{\mathop{:}}=\min\left\{|\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},+}^{r_{0},\mathbf{X}_{i}}|,\ |\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},-}^{r_{0},\mathbf{X}_{i}}|\right\}.

(56)

Then, we define

	$\displaystyle r_{i}\mathrel{\mathop{:}}=\min_{j\in\{1,2,\cdots,n\}\setminus\{i\}}\left\|\bm{v}_{*,i}^{T}\mathbf{X}_{j}\right\|,$		(57)
	$\displaystyle\hat{r}\mathrel{\mathop{:}}=\min_{i\in\{1,2,\cdots,n\}}r_{i}.$		(58)

(Note that here $r_{i}$ or $\hat{r}$ may be zero. Later we will show that they can be lower bounded with high probability.) Define

\displaystyle D_{\mathbf{X}}\mathrel{\mathop{:}}=\frac{\lambda_{d-1}(\mathcal{B}^{\hat{r}})}{8n\lambda_{d-1}(\mathcal{S}^{d-1})}.

(59)

Similar to Remark 6, these definitions have their geometric interpretation in Fig. 6. The value $c_{r_{0}}^{i}$ denotes the number of distinct pairs $\left(\frac{\mathbf{V}_{0}[k]}{\|\mathbf{V}_{0}[k]\|_{2}},\frac{\mathbf{V}_{0}[l]}{\|\mathbf{V}_{0}[l]\|_{2}}\right)$ ⁹⁹9Here, “distinct” means that any normalized version of $\mathbf{V}_{0}[j]$ can appear at most in one pair. such that $\frac{\mathbf{V}_{0}[k]}{\|\mathbf{V}_{0}[k]\|_{2}}$ is in the upper half of the cap $\mathrm{E}$ , and $\frac{\mathbf{V}_{0}[l]}{\|\mathbf{V}_{0}[l]\|_{2}}$ is in the lower half of the cap $\mathrm{E}$ . The quantities $r_{0}$ , $r_{i}$ , and $\hat{r}$ can all be used as the radius of the cap $\mathrm{E}$ . The ratio $D_{\mathbf{X}}$ is proportional to the area of the cap $\mathrm{E}$ with radius $\hat{r}$ (or equivalently, the probability that the normalized $\mathbf{V}_{0}[j]$ falls in the cap $\mathrm{E}$ ).

The following lemma gives an estimation on $\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ when $\mathbf{X}$ is given. We put its proof in Appendix F.1.

Lemma 27.

Given $\mathbf{X}$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq pD_{\mathbf{X}},\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\}\geq 1-4ne^{-npD_{\mathbf{X}}/6}.

Notice that $D_{\mathbf{X}}$ only depends on $\mathbf{X}$ and it may even be zero if $\hat{r}$ is zero. However, after we introduce the randomness of $\mathbf{X}$ , we can show that $\hat{r}$ is lower bounded with high probability. We can then obtain the following lemma. We put its proof in Appendix F.2.

Define

	$\displaystyle C_{d}\mathrel{\mathop{:}}=\frac{2\sqrt{2}}{B(\frac{d-1}{2},\frac{1}{2})},$		(60)
	$\displaystyle D(n,d,\delta)\mathrel{\mathop{:}}=\frac{1}{16n}I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)}\left(\frac{d-1}{2},\frac{1}{2}\right).$		(61)

Lemma 28.

For any $\delta\in\left(0,\frac{2}{\pi}\right]$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\}\geq 1-4ne^{-npD(n,d,\delta)/6}-\delta.

Notice that Lemma 28 is already very close to Lemma 25, and we put the final steps of the proof of Lemma 25 in Appendix F.3.

F.1 Proof of Lemma 27

Proof.

Define events as follows.

	$\displaystyle\mathcal{J}\mathrel{\mathop{:}}=\left\{\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\geq pD_{\mathbf{X}},\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\},$
	$\displaystyle\mathcal{J}_{i}\mathrel{\mathop{:}}=\left\{\text{there exists }\bm{a}\in\mathcal{S}^{n-1}\text{ that }i\in\operatorname*{arg\,max}_{j\in\{1,2,\cdots,n\}}\|a_{j}\|,\ \text{and }\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\leq pD_{\mathbf{X}}\right\},$
	$\displaystyle\mathcal{K}_{i}\mathrel{\mathop{:}}=\left\{c_{r_{i}}^{i}\leq 2npD_{\mathbf{X}}\right\},\text{ for }i=1,2,\cdots,n.$

Those definitions directly imply that

\displaystyle\mathcal{J}^{c}=\bigcup_{i=1}^{n}\mathcal{J}_{i}.

(62)

Step 1: prove $\mathcal{J}_{i}\subseteq\mathcal{K}_{i}$

To show $\mathcal{J}_{i}\subseteq\mathcal{K}_{i}$ , we only need to prove that $\mathcal{J}_{i}$ implies $\mathcal{K}_{i}$ . To that end, it suffices to show $\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq\frac{c_{r_{i}}^{i}}{2n}$ for the vector $\bm{a}$ defined in $\mathcal{J}_{i}$ . Because $i\in\operatorname*{arg\,max}_{j=1}^{n}|a_{j}|$ and $\|\bm{a}\|_{2}=1$ , we have

\displaystyle|a_{i}|\geq\frac{1}{\sqrt{n}}.

(63)

By Eq. (56), we can construct $c_{r_{i}}^{i}$ pairs $(k_{j},l_{j})$ for $j=1,2,\cdots,c_{r_{i}}^{i}$ (all $k_{j}$ ’s are different and all $l_{j}$ ’s are different), such that

\displaystyle\frac{\mathbf{V}_{0}[k_{j}]}{\|\mathbf{V}_{0}[k_{j}]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},+}^{r_{i},\mathbf{X}_{i}},\quad\frac{\mathbf{V}_{0}[l_{j}]}{\|\mathbf{V}_{0}[l_{j}]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},-}^{r_{i},\mathbf{X}_{i}}.

Thus, we have

	$\displaystyle(\mathbf{H}^{T}\bm{a})[k_{j}]-(\mathbf{H}^{T}\bm{a})[l_{j}]=$	$\displaystyle\sum_{k=1}^{n}a_{k}\left(\mathbf{H}_{k}[k_{j}]-\mathbf{H}_{k}[l_{j}]\right)$
	$\displaystyle=$	$\displaystyle a_{i}\left(\mathbf{H}_{i}[k_{j}]-\mathbf{H}_{i}[l_{j}]\right)+\sum_{k\in\{1,2,\cdots,n\}\setminus\{i\}}a_{k}\left(\mathbf{H}_{k}[k_{j}]-\mathbf{H}_{k}[l_{j}]\right)$
	$\displaystyle=$	$\displaystyle a_{i}\mathbf{X}_{i}\text{ (by Lemma~{}\ref{le.split_half})}.$

We then have

	$\displaystyle\\|(\mathbf{H}^{T}\bm{a})[k_{j}]\\|_{2}^{2}+\\|(\mathbf{H}^{T}\bm{a})[l_{j}]\\|_{2}^{2}$	$\displaystyle\geq\frac{1}{2}\\|a_{i}\mathbf{X}_{i}\\|_{2}^{2}\text{ (by Lemma~{}\ref{le.l2_diff})}$
		$\displaystyle\geq\frac{1}{2n}\text{ (by Eq.~{}\eqref{eq.temp_100401})}.$

Further, we have

\displaystyle\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}

\displaystyle=\sum_{j=1}^{p}\|(\mathbf{H}^{T}\bm{a})[j]\|_{2}^{2}\geq\sum_{j=1}^{c_{r_{i}}^{i}}\|(\mathbf{H}^{T}\bm{a})[k_{j}]\|_{2}^{2}+\|(\mathbf{H}^{T}\bm{a})[l_{j}]\|_{2}^{2}=\frac{c_{r_{i}}^{i}}{2n}.

(64)

Clearly, if the event $\mathcal{J}_{i}$ occurs, then $\|\mathbf{H}\bm{a}\|_{2}^{2}\leq pD_{\mathbf{X}}$ . Combining with Eq. (64), we then have $c_{r_{i}}^{i}\leq 2npD_{\mathbf{X}}$ . In other words, the event $\mathcal{K}_{i}$ must occur. Hence, we have shown that $\mathcal{J}_{i}\subseteq\mathcal{K}_{i}$ .

Step 2: estimate the probability of $\mathcal{K}_{i}$

For all $j\in\{1,2,\cdots,p\}$ , because $\mathbf{V}_{0}[j]$ is uniformly distributed in all directions, for any fixed $0\leq r_{0}\leq 1$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\frac{\mathbf{V}_{0}[j]}{\|\mathbf{V}_{0}[j]\|_{2}}\in\mathcal{B}_{\bm{v}_{*,i},+}^{r_{0},\mathbf{X}_{i}}\Bigg{|}\ i\right\}=\frac{\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{2\lambda_{d-1}(\mathcal{S}^{d-1})}.

Thus, $|\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},+}^{r_{0},\mathbf{X}_{i}}|$ follows the distribution $\mathsf{Bino}\left(p,\ \frac{\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{2\lambda_{d-1}(\mathcal{S}^{d-1})}\right)$ given $i\text{ and }\mathbf{X}$ . By Lemma 14 (with $\delta=\frac{1}{2}$ ), we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{|\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},+}^{r_{0},\mathbf{X}_{i}}|<\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{4\lambda_{d-1}(\mathcal{S}^{d-1})}\Big{|}\ i\right\}\leq 2\exp\left(-\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{48\lambda_{d-1}(\mathcal{S}^{d-1})}\right).

(65)

Similarly, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{|\mathcal{A}_{\mathbf{V}_{0}}\cap\mathcal{B}_{\bm{v}_{*,i},-}^{r_{0},\mathbf{X}_{i}}|<\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{4\lambda_{d-1}(\mathcal{S}^{d-1})}\Big{|}\ i\right\}\leq 2\exp\left(-\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{48\lambda_{d-1}(\mathcal{S}^{d-1})}\right).

(66)

By plugging Eq. (65) and Eq. (66) into Eq. (56) and applying the union bound, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{c_{r_{0}}^{i}<\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{4\lambda_{d-1}(\mathcal{S}^{d-1})}\Big{|}\ i\right\}\leq 4\exp\left(-\frac{p\lambda_{d-1}(\mathcal{B}_{\bm{v}_{*}}^{r_{0}})}{48\lambda_{d-1}(\mathcal{S}^{d-1})}\right).

By letting $r_{0}=\hat{r}$ and by Eq. (59), we thus have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{c_{r_{i}}^{i}\leq 2npD_{\mathbf{X}}\Big{|}\ i\right\}\leq 4\exp\left(-\frac{1}{6}npD_{\mathbf{X}}\right),

i.e.,

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{K}_{i}]\leq 4\exp\left(-\frac{1}{6}npD_{\mathbf{X}}\right),\text{ for all }i=1,2,\cdots,n.

(67)

Step3: estimate the probability of $\mathcal{J}$

We have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}^{c}]\leq$	$\displaystyle\sum_{i=1}^{n}\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{i}]\text{ (by Eq.~{}\eqref{eq.temp_112405} and the union bound)}$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{n}\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{K}_{i}]\text{ (by $\mathcal{J}_{i}\subseteq\mathcal{K}_{i}$ proven in Step 1)}$
	$\displaystyle\leq$	$\displaystyle 4n\exp\left(-\frac{1}{6}npD_{\mathbf{X}}\right)\text{ (by Eq.~{}\eqref{eq.temp_020301})}.$

Thus, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}]=1-\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}^{c}]\geq 1-4n\exp\left(-\frac{1}{6}npD_{\mathbf{X}}\right).

The result of this lemma thus follows. ∎

F.2 Proof of Lemma 28

Based on Lemma 27, it remains to estimate $\hat{r}$ , which will then allow us to bound $D_{\mathbf{X}}$ . Towards this end, we need a few lemmas to estimate $B\left(\frac{d-1}{2},\ \frac{1}{2}\right)$ and $I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)$ .

Lemma 29.

For any $x\in\mathds{R}$ , we must have $x+1\leq e^{x}$ .

Proof.

Consider a function $g(x)=e^{x}-x-1$ . It remains to show that $g(x)\geq 0$ for all $x$ . We have $g^{\prime}(x)=e^{x}-1$ . In other words, $g^{\prime}(x)\leq 0$ when $x\leq 0$ , and $g^{\prime}(x)\geq 0$ when $x\geq 0$ . Thus, $g(x)$ is monotone decreasing when $x\leq 0$ , and is monotone increasing when $x\geq 0$ . Hence, we know that $g(x)$ achieves its minimum value at $x=0$ , i.e., $g(x)\geq g(0)=0$ for any $x$ . The conclusion of this lemma thus follows. ∎

Lemma 30.

For any $d\geq 5$ , we have

\displaystyle\left(1-\frac{1}{d-3}\right)^{d-3}\geq\frac{1}{e^{2}}.

Proof.

By letting $x=\frac{1}{d-4}$ in Lemma 29, we have

\displaystyle\frac{d-3}{d-4}=\frac{1}{d-4}+1\leq\exp\left(\frac{1}{d-4}\right),

i.e.,

\displaystyle\frac{d-4}{d-3}\geq\exp\left(-\frac{1}{d-4}\right).

(68)

Thus, we have

	$\displaystyle\left(1-\frac{1}{d-3}\right)^{d-3}$	$\displaystyle=\left(\frac{d-4}{d-3}\right)^{d-3}$
		$\displaystyle\geq\exp\left(-\frac{d-3}{d-4}\right)$
		$\displaystyle=\exp\left(-1-\frac{1}{d-4}\right)$
		$\displaystyle\geq\exp(-2)\text{ (because $\exp(\cdot)$ is monotone increasing and $d\geq 5$)}.$

∎

Lemma 31.

For any $d\geq 5$ , we must have

\displaystyle\frac{2}{e}\sqrt{\frac{1}{d-3}}\geq\frac{2}{e}\frac{1}{\sqrt{d}}

Proof.

The result directly follows by $d-3\leq d$ when $d\geq 5$ . ∎

Lemma 32.

\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\in\left[\frac{2}{e}\frac{1}{\sqrt{d}},\ \pi\right].

Further, if $d\geq 5$ , we have

\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\in\left[\frac{2}{e}\frac{1}{\sqrt{d}},\ \frac{4}{\sqrt{d-3}}\right].

Proof.

When $d=2$ , we have $B\left(\frac{d-1}{2},\ \frac{1}{2}\right)=\pi$ . When $d=3$ , we have $B\left(\frac{d-1}{2},\ \frac{1}{2}\right)=2$ . When $d=4$ , we have $B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\approx 1.57$ . It is easy to verify that the statement of the lemma holds for $d=2$ , $3$ , and $4$ . It remains to validate the case of $d\geq 5$ . We first prove the lower bound. For any $m\in(0,1)$ , we have

	$\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)=$	$\displaystyle\int_{0}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle\geq$	$\displaystyle\int_{m}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt\text{ (because $t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}\geq 0$)}$
	$\displaystyle\geq$	$\displaystyle m^{\frac{d-3}{2}}\int_{m}^{1}(1-t)^{-\frac{1}{2}}dt$
		(because $t^{\frac{d-3}{2}}$ is monotone increasing with respect to $t$ when $d\geq 5$ )
	$\displaystyle=$	$\displaystyle m^{\frac{d-3}{2}}\left(-2\sqrt{1-t}\ \bigg{\|}_{m}^{1}\right)$
	$\displaystyle=$	$\displaystyle m^{\frac{d-3}{2}}\cdot 2\sqrt{1-m}.$

By letting $m=1-\frac{1}{d-3}$ , we thus have

\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\geq\left(1-\frac{1}{d-3}\right)^{\frac{d-3}{2}}\cdot 2\sqrt{\frac{1}{d-3}}.

Then, applying Lemma 30, we have

\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\geq\frac{2}{e}\cdot\sqrt{\frac{1}{d-3}}.

Thus, by Lemma 31, we have

\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\geq\frac{2}{e}\frac{1}{\sqrt{d}}.

Now we prove the upper bound. For any $m\in(0,1)$ , we have

	$\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)=$	$\displaystyle\int_{0}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle=$	$\displaystyle\int_{0}^{m}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt+\int_{m}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle\leq$	$\displaystyle\int_{0}^{m}t^{\frac{d-3}{2}}(1-m)^{-\frac{1}{2}}dt+\int_{m}^{1}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle=$	$\displaystyle\frac{2}{d-1}m^{\frac{d-1}{2}}(1-m)^{-\frac{1}{2}}+2\sqrt{1-m}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{d-1}(1-m)^{-\frac{1}{2}}+2\sqrt{1-m}\text{ (because $m<1$ and $d\geq 5$)}.$

By letting $m=1-\frac{1}{d-3}$ , we thus have

	$\displaystyle B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\leq$	$\displaystyle\frac{2\sqrt{d-3}}{d-1}+\frac{2}{\sqrt{d-3}}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{\sqrt{d-3}}.$

Notice that $\frac{4}{\sqrt{5-3}}=2\sqrt{2}<\pi$ . The result of this lemma thus follows.

∎

Lemma 33.

Recall $C_{d}$ is defined in Eq. (60). If $d\leq n^{4}$ and $\delta\leq 1$ , then

\displaystyle\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)^{\frac{d-1}{2}}\geq\frac{1}{2}.

Proof.

We have

	$\displaystyle\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)^{\frac{d-1}{2}}\geq$	$\displaystyle\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)^{d-1}$
	$\displaystyle\geq$	$\displaystyle 1-\frac{(d-1)\delta^{2}}{4n^{4}C_{d}^{2}}\text{ (by Bernoulli's inequality $(1+x)^{a}\geq 1+ax$)}$
	$\displaystyle=$	$\displaystyle 1-\frac{(d-1)\left(B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\right)^{2}}{4n^{4}\cdot 8}\text{ (by $\delta\leq 1$ and Eq.~{}\eqref{eq.def_Cd})}$
	$\displaystyle\geq$	$\displaystyle 1-\frac{(d-1)\pi^{2}}{32n^{4}}\text{ (by Lemma~{}\ref{le.bound_B})}$
	$\displaystyle\geq$	$\displaystyle 1-\frac{d}{n^{4}}\cdot\frac{\pi^{2}}{32}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\text{ (because $n^{4}\geq d$ and $\pi\leq 4$)}.$

∎

Lemma 34.

For any $\delta\in\left(0,\ \frac{2}{\pi}\right]$ , we must have $\frac{\delta}{n^{2}C_{d}}\leq\frac{1}{\sqrt{2}}$ .

Proof.

Because Eq. (60), $\delta\leq\frac{2}{\pi}$ , and $n\geq 1$ , this lemma directly follows by Lemma 32. ∎

Lemma 35.

For any $x\in[0,\ 1]$ , we must have

\displaystyle I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)\geq\frac{C_{d}}{\sqrt{2}(d-1)}x^{\frac{d-1}{2}},

and

\displaystyle\lim_{x\to 0}\frac{I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)}{x^{\frac{d-1}{2}}}=\frac{C_{d}}{\sqrt{2}(d-1)}.

Proof.

we have

	$\displaystyle I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)=$	$\displaystyle\frac{\int_{0}^{x}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}$
	$\displaystyle=$	$\displaystyle\frac{C_{d}}{2\sqrt{2}}\int_{0}^{x}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt\text{ (by Eq.~{}\eqref{eq.def_Cd})}$
	$\displaystyle\in$	$\displaystyle\left[\frac{C_{d}}{2\sqrt{2}}\int_{0}^{x}t^{\frac{d-3}{2}}dt,\ \frac{C_{d}}{2\sqrt{2}\sqrt{1-x}}\int_{0}^{x}t^{\frac{d-3}{2}}dt\right]$
		(because $(1-t)^{-1/2}\in\left[1,\frac{1}{\sqrt{1-x}}\right]$ )
	$\displaystyle\in$	$\displaystyle\left[\frac{C_{d}}{\sqrt{2}(d-1)}x^{\frac{d-1}{2}},\ \frac{C_{d}}{\sqrt{2}(d-1)\sqrt{1-x}}x^{\frac{d-1}{2}}\right].$

Thus, we have

\displaystyle\frac{I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)}{x^{\frac{d-1}{2}}}\in\left[\frac{C_{d}}{\sqrt{2}(d-1)},\ \frac{C_{d}}{\sqrt{2}(d-1)\sqrt{1-x}}\right],

which implies

\displaystyle\lim_{x\to 0}\frac{I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)}{x^{\frac{d-1}{2}}}=\frac{C_{d}}{\sqrt{2}(d-1)}.

∎

Lemma 36.

For any $x\in\left[\frac{1}{2},\ 1\right)$ and for any $d\in\{2,3,\cdots\}$ , we have

\displaystyle I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)\geq 1-\frac{2\sqrt{2(1-x)}}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}.

We also have

\displaystyle\lim_{(1-x)\to 0^{+}}\frac{1-I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)}{\sqrt{1-x}}=\frac{2}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}.

Proof.

By the definition of regularized incomplete beta function in Eq. (20), we have

\displaystyle I_{x}\left(\frac{d-1}{2},\frac{1}{2}\right)=\frac{\int_{0}^{x}t^{\frac{d-1}{2}-1}(1-t)^{-\frac{1}{2}}dt}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}=1-\frac{\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}.

Thus, it remains to show that

	$\displaystyle\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt\leq 2\sqrt{2(1-x)},\text{ and}$		(69)
	$\displaystyle\lim_{(1-x)\to 0^{+}}\frac{\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt}{\sqrt{1-x}}=2.$		(70)

First, we prove Eq. (69). Case 1: $d=2$ . We have

		$\displaystyle\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle=$	$\displaystyle\int_{x}^{1}t^{-\frac{1}{2}}(1-t)^{-\frac{1}{2}}dt$
	$\displaystyle\leq$	$\displaystyle\frac{1}{\sqrt{x}}\int_{x}^{1}(1-t)^{-\frac{1}{2}}dt\text{ (because $t^{-\frac{1}{2}}$ is monotone decreasing in $[x,\ 1]$)}$
	$\displaystyle=$	$\displaystyle 2\sqrt{\frac{1-x}{x}}$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{2(1-x)}\text{ (because $x\geq\frac{1}{2}$)}.$

Case 2: $d\geq 3$ . Then $t^{\frac{d-3}{2}}$ is monotone increasing in $[x,\ 1]$ . Thus, we have

\displaystyle\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt\leq\int_{x}^{1}(1-t)^{-\frac{1}{2}}dt=2\sqrt{1-x}\leq 2\sqrt{2(1-x)}.

To conclude, for all $d\in\{2,3,\cdots\}$ , Eq. (69) holds.

Second, we prove Eq. (70). We have

	$\displaystyle\frac{\int_{x}^{1}t^{\frac{d-3}{2}}(1-t)^{-\frac{1}{2}}dt}{\sqrt{1-x}}\in$	$\displaystyle\left[\frac{\min\{1,\ x^{\frac{d-3}{2}}\}\int_{x}^{1}(1-t)^{-\frac{1}{2}}dt}{\sqrt{1-x}},\ \frac{\max\{1,\ x^{\frac{d-3}{2}}\}\int_{x}^{1}(1-t)^{-\frac{1}{2}}dt}{\sqrt{1-x}}\right]$
	$\displaystyle=$	$\displaystyle\left[2\min\{1,\ x^{\frac{d-3}{2}}\},\ 2\max\{1,\ x^{\frac{d-3}{2}}\}\right].$

Since $\lim_{x\to 1}x^{\frac{d-3}{2}}=1$ , Eq. (70) thus follows. ∎

Now we are ready to prove Lemma 28.

Recall $\bm{v}_{*,i}$ defined in Eq. (55). For any $b\in\left(0,\frac{1}{\sqrt{2}}\right]$ , we have, for $\bm{x}$ independent of $\bm{v}_{*,i}$ and with distribution $\mu$ ,

$\displaystyle\operatorname{\mathsf{Pr}}_{\bm{x}\sim\mu}\left\{\|\bm{v}_{,i}^{T}\bm{x}\|\geq b\right\}$	$\displaystyle=I_{1-b^{2}}\left(\frac{d-1}{2},\frac{1}{2}\right)\text{ (because Lemma~{}\ref{le.innerProd_I})}$
	$\displaystyle\geq 1-\frac{2\sqrt{2\left(1-(1-b^{2})\right)}}{B\left(\frac{d-1}{2},\frac{1}{2}\right)}\text{ (by Lemma~{}\ref{le.beta_bound})}$
	$\displaystyle=1-C_{d}b\text{ (by the definition of $C_{d}$ in Eq.~{}\eqref{eq.def_Cd})}.$	(71)

Since each of the $\mathbf{X}_{j}$ , $j\neq i$ , is independent of $\bm{v}_{*,i}$ , we have

		$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X}}\left\{\min_{j\in\{1,2,\cdots,n\}\setminus\{i\}}\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\|\geq b\right\}$
	$\displaystyle=$	$\displaystyle\left(\operatorname{\mathsf{Pr}}_{\bm{x}\sim\mu}\left\{\|\bm{v}_{,i}^{T}\bm{x}\|\geq b\right\}\right)^{n-1}\text{ (because each $\mathbf{X}_{j}$, $j\neq i$, is \emph{i.i.d.} and independent of $\bm{v}_{*,i}$)}$
	$\displaystyle\geq$	$\displaystyle\left(1-C_{d}b\right)^{n-1}\text{ (by Eq.~{}\eqref{eq.temp_110202})}$
	$\displaystyle\geq$	$\displaystyle 1-(n-1)C_{d}b\text{ (by Bernoulli's inequality)}$
	$\displaystyle\geq$	$\displaystyle 1-nC_{d}b.$

Or, equivalently,

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\min_{i\in\{1,2,\cdots,n\}\setminus\{i\}}|\bm{v}_{*,i}^{T}\mathbf{X}_{i}|<b\right\}\leq nC_{d}b.

(72)

Recall the definition of $r_{i}$ and $\hat{r}$ in Eqs. (57)(58). Thus, we then have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\hat{r}<\frac{\delta}{n^{2}C_{d}}\right\}$
$\displaystyle\leq$	$\displaystyle n\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{r_{i}<\frac{\delta}{n^{2}C_{d}}\right\}\text{ (by Eq.~{}\eqref{eq.def_hat_r} and the union bound)}$
$\displaystyle=$	$\displaystyle n\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{r_{i}<\frac{\delta}{n^{2}C_{d}}\right\}\text{ (because $r$ is independent of $\mathbf{V}_{0}$)}$
$\displaystyle=$	$\displaystyle n\operatorname{\mathsf{Pr}}_{\mathbf{X}}\left\{\min_{j\in\{1,2,\cdots,n\}\setminus\{i\}}\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|<\frac{\delta}{n^{2}C_{d}}\right\}\text{ (by Eq.~{}\eqref{eq.temp_110201})}$
$\displaystyle\leq$	$\displaystyle n\cdot nC_{d}\cdot\frac{\delta}{n^{2}C_{d}}\text{ (by letting $b=\frac{\delta}{n^{2}C_{d}}$ in Eq.~{}\eqref{eq.temp_110203} and $b\leq\frac{1}{\sqrt{2}}$ because of Lemma~{}\ref{le.bound_b})}$
$\displaystyle=$	$\displaystyle\delta.$	(73)

By Lemma 9 and Eq. (61), we have

\displaystyle\lambda_{d-1}(\mathcal{B}^{\frac{\delta}{n^{2}C_{d}}})=\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}})}\left(\frac{d-1}{2},\frac{1}{2}\right)=8n\lambda_{d-1}(\mathcal{S}^{d-1})D(n,d,\delta).

Thus, we have

\displaystyle D(n,d,\delta)=\frac{\lambda_{d-1}(\mathcal{B}^{\frac{\delta}{n^{2}C_{d}}})}{8n\lambda_{d-1}(\mathcal{S}^{d-1})}.

(74)

By Eq. (59) and Eq. (74), we have

\displaystyle D_{\mathbf{X}}\geq D(n,d,\delta),\text{ when }\hat{r}\geq\frac{\delta}{n^{2}C_{d}}.

Notice that $\hat{r}$ only depends on $\mathbf{X}$ and is independent of $\mathbf{V}_{0}$ . By Lemma 27, for any $\mathbf{X}$ that makes $\hat{r}\geq\frac{\delta}{n^{2}C_{d}}$ , we must have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\}\geq 1-4ne^{-npD(n,d,\delta)/6}.

In other words,

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\ \bigg{|}\ \text{any given }\mathbf{X}\text{ such that }\hat{r}\geq\frac{\delta}{n^{2}C_{d}}\right\}\geq 1-4ne^{-npD(n,d,\delta)/6}.

We thus have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\ \bigg{|}\ \hat{r}\geq\frac{\delta}{n^{2}C_{d}}\right\}\geq 1-4ne^{-npD(n,d,\delta)/6}.

(75)

Thus, we have

		$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\}$
	$\displaystyle\geq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\hat{r}\geq\frac{\delta}{n^{2}C_{d}},\text{ and }\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\right\}$
	$\displaystyle=$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\geq pD(n,d,\delta),\ \text{for all }\bm{a}\in\mathcal{S}^{n-1}\ \bigg{\|}\ \hat{r}\geq\frac{\delta}{n^{2}C_{d}}\right\}\cdot\operatorname{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\hat{r}\geq\frac{\delta}{n^{2}C_{d}}\right\}$
	$\displaystyle\geq$	$\displaystyle(1-4ne^{-npD(n,d,\delta)/6})(1-\delta)\text{ (by Eq.~{}\eqref{eq.temp_100403} and Eq.~{}\eqref{eq.temp_020302})}$
	$\displaystyle\geq$	$\displaystyle 1-4ne^{-npD(n,d,\delta)/6}-\delta.$

The result of this lemma thus follows.

F.3 Proof of Lemma 25

Based on Lemma 28, it only remains to estimate $D(n,d,\delta)$ . We start with a lemma.

Lemma 37.

If $\delta\leq 1$ and $d\leq n^{4}$ , we must have

\displaystyle D(n,d,\delta)\geq 2^{-2d-5.5}d^{-0.5d}n^{-2d+1}\delta^{d-1}.

(76)

For any given $\delta\geq 0$ and $d$ , we must have

\displaystyle\lim_{n\to\infty}\frac{D(n,d,\delta)}{n^{2d-1}}=2^{-1.5d-1.5}\left(B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\right)^{d-2}\frac{1}{d-1}\delta^{d-1}.

Proof.

We start from

$\displaystyle\frac{1}{(d-1)C_{d}^{d-2}}=$	$\displaystyle\frac{\left(B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\right)^{d-2}}{(d-1)(2\sqrt{2})^{d-2}}\text{ (by Eq.~{}\eqref{eq.def_Cd})}$
$\displaystyle\geq$	$\displaystyle\frac{1}{(d-1)d^{\frac{d}{2}-1}(e\sqrt{2})^{d-2}}\text{ (by Lemma~{}\ref{le.bound_B})}$
$\displaystyle\geq$	$\displaystyle\frac{1}{d^{\frac{d}{2}}(e\sqrt{2})^{d}}$
$\displaystyle=$	$\displaystyle(16d)^{-\frac{d}{2}}\text{ (since $2e^{2}\approx 14.778\leq 16$)}.$	(77)

Thus, we have

	$\displaystyle D(n,d,\delta)\geq$	$\displaystyle\frac{1}{16n}\frac{C_{d}}{\sqrt{2}(d-1)}\left(\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}\text{ (by Eq.~{}\eqref{eq.temp_100404} and Lemma~{}\ref{le.estimate_Ix})}$
	$\displaystyle=$	$\displaystyle\frac{1}{16\sqrt{2}}\frac{1}{(d-1)C_{d}^{d-2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)^{\frac{d-1}{2}}\frac{\delta^{d-1}}{n^{2d-1}}$
	$\displaystyle\geq$	$\displaystyle\frac{1}{32\sqrt{2}}(16d)^{-\frac{d}{2}}\frac{\delta^{d-1}}{n^{2d-1}}\text{ (by Lemma~{}\ref{le.temp_112401} and Eq.~{}\eqref{eq.temp_112402})}$
	$\displaystyle=$	$\displaystyle 2^{-2d-5.5}d^{-0.5d}n^{-2d+1}\delta^{d-1}.$

For any given $d$ and $\delta\geq 0$ , we have

	$\displaystyle\lim_{n\to\infty}\frac{D(n,d,\delta)}{n^{2d-1}}=$	$\displaystyle\lim_{n\to\infty}\frac{1}{16n^{2d-2}}I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)}\left(\frac{d-1}{2},\frac{1}{2}\right)\text{ (by Eq.~{}\eqref{eq.temp_100404})}$
	$\displaystyle=$	$\displaystyle\lim_{n\to\infty}\frac{\left(\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}}{16n^{2d-2}}\cdot\frac{I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)}\left(\frac{d-1}{2},\frac{1}{2}\right)}{\left(\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}}$
	$\displaystyle=$	$\displaystyle\frac{1}{16}\lim_{n\to\infty}\left(\frac{\delta^{2}}{C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}\cdot\frac{I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)}\left(\frac{d-1}{2},\frac{1}{2}\right)}{\left(\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}}$
	$\displaystyle=$	$\displaystyle\frac{1}{16}\lim_{n\to\infty}\left(\frac{\delta^{2}}{C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}\cdot\lim_{n\to\infty}\frac{I_{\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)}\left(\frac{d-1}{2},\frac{1}{2}\right)}{\left(\frac{\delta^{2}}{n^{4}C_{d}^{2}}\left(1-\frac{\delta^{2}}{4n^{4}C_{d}^{2}}\right)\right)^{\frac{d-1}{2}}}$
	$\displaystyle=$	$\displaystyle\frac{1}{16}\frac{\delta^{d-1}}{C_{d}^{d-1}}\frac{C_{d}}{\sqrt{2}(d-1)}\text{ (by Lemma~{}\ref{le.estimate_Ix})}$
	$\displaystyle=$	$\displaystyle 2^{-1.5d-1.5}\left(B\left(\frac{d-1}{2},\ \frac{1}{2}\right)\right)^{d-2}\frac{1}{d-1}\delta^{d-1}\text{ (by Eq.~{}\eqref{eq.def_Cd})}.$

∎

Now we are ready to finish our proof of Lemma 25.

We have

	$\displaystyle D(n,d,\delta)\bigg{\|}_{\delta=\frac{1}{\sqrt[m]{n}}}\geq$	$\displaystyle\frac{1}{2^{2d+5.5}d^{0.5d}n^{2d-1}n^{\frac{d-1}{m}}}\text{ (by Eq.~{}\eqref{eq.temp_120901})}$
	$\displaystyle=$	$\displaystyle\frac{1}{2^{2d+5.5}d^{0.5d}n^{\left(2+\frac{1}{m}\right)(d-1)}n}$
	$\displaystyle=$	$\displaystyle\frac{1}{J_{m}(n,d)n}\text{ (by Eq.~{}\eqref{eq.def_Jmnd})}.$

Thus, when $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ , we have

\displaystyle 1-4ne^{-npD(n,d,\delta)/6}-\delta\bigg{|}_{\delta=\frac{1}{\sqrt[m]{n}}}\geq

\displaystyle 1-\frac{2}{\sqrt[m]{n}}.

Then, we have

\displaystyle m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]\implies\left(\frac{\pi}{2}\right)^{m}\leq n\implies n^{\frac{1}{m}}\geq\frac{\pi}{2}\implies\frac{1}{\sqrt[m]{n}}\leq\frac{2}{\pi}\implies\delta\leq\frac{2}{\pi}.

By Lemma 26 and Lemma 28, the conclusion of Lemma 25 thus follows.

Appendix G Upper bound of ${\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}/{p}$

By Lemma 26, to get an upper bound of ${\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}/{p}$ , it is equivalent to get an upper bound of $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ . To that end, we only need to construct a vector $\bm{a}$ and calculate the value of $\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ , which automatically becomes an upper bound $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ .

The following lemma shows that, for given $\mathbf{X}$ , if two input training data $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ are close to each other, then $\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}/p$ is unlikely to be large.

Lemma 38.

If there exist $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ such that $i\neq k$ and $\theta\mathrel{\mathop{:}}=\arccos(\mathbf{X}_{i}^{T}\mathbf{X}_{k})$ , then

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\min_{\bm{a}\in\mathcal{S}^{n-1}}\|\mathbf{H}^{T}\bm{a}\|_{2}^{2}\geq\frac{3p\theta^{2}}{8}+\frac{3p\theta}{4\pi}\right\}\leq 2\exp\left(-\frac{p}{24}\right)+2\exp\left(-\frac{p\theta}{12}\right).

Intuitively, Lemma 38 is true because, when $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ are similar, $\mathbf{H}_{i}$ and $\mathbf{H}_{k}$ (the $i$ -th and $k$ -th row of $\mathbf{H}$ , respectively) will also likely be similar, i.e., $\|\mathbf{H}_{i}-\mathbf{H}_{k}\|_{2}$ is not likely to be large. Thus, we can construct $\bm{a}$ such that $\mathbf{H}^{T}\bm{a}$ is proportional to $\mathbf{H}_{i}-\mathbf{H}_{k}$ , which will lead to the result of Lemma 38. We put the proof of Lemma 38 in Appendix G.1.

The next step is to estimate such difference between $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ (or equivalently, the angle $\theta$ between them). We have the following lemma.

Lemma 39.

When $n\geq\pi(d-1)$ , there must exist two different $\mathbf{X}_{i}$ ’s such that the angle between them is at most

\displaystyle\theta=\pi\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}}.

Lemma 39 is intuitive because $\mathcal{S}^{d-1}$ has limited area. When there are many $\mathbf{X}_{i}$ ’s on $\mathcal{S}^{d-1}$ , there must exist at least two $\mathbf{X}_{i}$ ’s that are relatively close. We put the proof of Lemma 39 in Appendix G.2.

Finally, we have the following lemma.

Lemma 40.

When $n\geq\pi(d-1)$ , we have

		$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\Big{\{}\frac{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}{p}\leq\frac{3\pi^{2}}{8}\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{2}{d-1}}n^{-\frac{2}{d-1}}$
		$\displaystyle\qquad+\frac{3}{4}\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}}\Big{\}}$
	$\displaystyle\geq$	$\displaystyle 1-2\exp\left(-\frac{p}{24}\right)-2\exp\left(-\frac{p}{12}\pi\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}}\right).$

Proof.

This lemma directly follows by combining Lemma 26, Lemma 38, and Lemma 39. ∎

By Lemma 40, we can conclude that when $p$ is much larger than $n^{\frac{1}{d-1}}$ , $\frac{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}{p}=O(n^{-\frac{1}{d-1}})$ with high probability.

G.1 Proof of Lemma 38

We first prove a useful lemma.

Lemma 41.

For any $\varphi\in[0,2\pi]$ , we must have $\sin\varphi\leq\varphi$ . For any $\varphi\in[0,\pi/2]$ , we must have $\varphi\leq\frac{\pi}{2}\sin\varphi$ .

Proof.

To prove the first part of the lemma, note that

\displaystyle\frac{d(\varphi-\sin\varphi)}{d\varphi}=1-\cos\varphi\geq 0.

Thus, the function $(\varphi-\sin\varphi)$ is monotone increasing with respect to $\varphi\in[0,2\pi]$ . Thus, we have

\displaystyle\min_{\varphi\in[0,2\pi]}(\varphi-\sin\varphi)=(\varphi-\sin\varphi)\big{|}_{\varphi=0}=0.

In other words, we have $\sin\varphi\leq\varphi$ for any $\varphi\in[0,2\pi]$ .

To prove the second part of the lemma, note that when $\varphi\in[0,\pi/2]$ , we have

\displaystyle\frac{d^{2}(\varphi-\frac{\pi}{2}\sin\varphi)}{d\varphi^{2}}=\frac{\pi}{2}\sin\varphi\geq 0.

Thus, the function $\varphi-\frac{\pi}{2}\sin\varphi$ is convex with respect to $\varphi\in[0,\pi/2]$ . Because the maximum of a convex function must be attained at the endpoint of the domain interval, we have

\displaystyle\max_{\varphi\in[0,\pi/2]}(\varphi-\frac{\pi}{2}\sin\varphi)=\max_{\varphi\in\{0,\pi/2\}}(\varphi-\frac{\pi}{2}\sin\varphi)=0.

Thus, we have $\varphi\leq\frac{\pi}{2}\sin\varphi$ for any $\varphi\in[0,\pi/2]$ . ∎

Now we are ready to prove Lemma 38.

Proof.

Through the proof, we fix $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ , and only consider the randomness of $\mathbf{V}_{0}$ . Because $\theta$ is the angle between $\mathbf{X}_{i}$ and $\mathbf{X}_{k}$ and because of Assumption 1, we have

$\displaystyle\\|\mathbf{X}_{i}-\mathbf{X}_{k}\\|_{2}=$	$\displaystyle 2\sin\frac{\theta}{2}$
$\displaystyle\leq$	$\displaystyle 2\cdot\frac{\theta}{2}\text{ (by Lemma~{}\ref{le.sin})}$
$\displaystyle=$	$\displaystyle\theta.$	(78)

Let $\bm{a}=\frac{1}{\sqrt{2}}(\bm{e}_{i}-\bm{e}_{k})$ , where $\bm{e}_{q}$ denotes the $q$ -th standard basis vector, $q=1,2,\cdots,n$ . Then, we have

$\displaystyle\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}=$	$\displaystyle\frac{1}{2}\\|\mathbf{H}_{i}^{T}-\mathbf{H}_{k}^{T}\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{j=1}^{p}\left\\|\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}\mathbf{X}_{i}-\bm{1}_{\{\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j]>0\}}\mathbf{X}_{k}\right\\|_{2}^{2}\text{ (by Eq.~{}\eqref{eq.def_hx})}$
$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{j=1}^{p}\left(\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0,\ \mathbf{X}_{k}^{T}\mathbf{V}_{0}[j]>0\}}\\|\mathbf{X}_{i}-\mathbf{X}_{k}\\|_{2}^{2}+\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\right)\text{ (by $\\|\mathbf{X}_{i}\\|_{2}^{2}=\\|\mathbf{X}_{k}\\|_{2}^{2}=1$)}$
$\displaystyle\leq$	$\displaystyle\frac{1}{2}\sum_{j=1}^{p}\left(\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0,\ \mathbf{X}_{k}^{T}\mathbf{V}_{0}[j]>0\}}\theta^{2}+\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\right)\text{ (by Eq.~{}\eqref{eq.temp_122201})}$
$\displaystyle\leq$	$\displaystyle\frac{\theta^{2}}{2}\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}+\frac{1}{2}\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}.$	(79)

Since $\mathbf{X}_{i}$ is fixed and the direction of $\mathbf{V}_{0}[j]$ is uniformly distributed, we have $\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}=\frac{1}{2}$ and

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}=$	$\displaystyle 2\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0,\ \mathbf{X}_{k}^{T}\mathbf{V}_{0}[j]<0\}$
	$\displaystyle=$	$\displaystyle 2\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0,\ -\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j]>0\}$
	$\displaystyle=$	$\displaystyle 2\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\mathbf{X}_{i}^{T}\bm{v}>0,\ -\mathbf{X}_{k}^{T}\bm{v}>0\}}d\tilde{\lambda}(\bm{v})$
	$\displaystyle=$	$\displaystyle 2\cdot\frac{\pi-(\pi-\theta)}{2\pi}\text{ (by Lemma~{}\ref{le.spherePortion})}$
	$\displaystyle=$	$\displaystyle\frac{\theta}{\pi}.$

Thus, based on the randomness of $\mathbf{V}_{0}$ , when $\mathbf{X}$ are given, we have

	$\displaystyle\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}\sim\mathsf{Bino}\left(p,\ \frac{1}{2}\right),$
	$\displaystyle\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\sim\mathsf{Bino}\left(p,\ \frac{\theta}{\pi}\right).$

By letting $\delta=\frac{1}{2}$ , $a=p$ , $b=\frac{1}{2}$ in Lemma 14, we then have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}\geq\frac{3p}{4}\right\}\leq 2\exp\left(-\frac{p}{24}\right),$		(80)
	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\geq\frac{3p\theta}{2\pi}\right\}\leq 2\exp\left(-\frac{p\theta}{12\pi}\right).$		(81)

Thus, we have

		$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\\|\mathbf{H}^{T}\bm{a}\\|_{2}^{2}\geq\frac{3p\theta^{2}}{8}+\frac{3p\theta}{4\pi}\right\}$
	$\displaystyle\leq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\frac{\theta^{2}}{2}\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}+\frac{1}{2}\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\geq\frac{3p\theta^{2}}{8}+\frac{3p\theta}{4\pi}\right\}$
		(by Eq. (79))
	$\displaystyle\leq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\left\{\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}>\frac{3p}{4}\right\}\cup\left\{\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\geq\frac{3p\theta}{2\pi}\right\}\right\}$
	$\displaystyle\leq$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\sum_{j=1}^{p}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0\}}>\frac{3p}{4}\right\}+\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\sum_{j=1}^{p}\bm{1}_{\{(\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j])(\mathbf{X}_{k}^{T}\mathbf{V}_{0}[j])<0\}}\geq\frac{3p\theta}{2\pi}\right\}$
		(by the union bound)
	$\displaystyle\leq$	$\displaystyle 2\exp\left(-\frac{p}{24}\right)+2\exp\left(-\frac{p\theta}{12}\right)\text{ (by Eq.~{}\eqref{eq.temp_122203} and Eq.~{}\eqref{eq.temp_122204})}.$

The result of Lemma 38 thus follows. ∎

G.2 Proof of Lemma 39

We first prove a useful lemma. Recall the definition of $C_{d}$ in Eq. (60).

Lemma 42.

We have

\displaystyle\frac{2\sqrt{2}(d-1)}{nC_{d}}\in\left[\frac{2}{e}\frac{d-1}{n\sqrt{d}},\ \frac{\pi(d-1)}{n}\right].

Proof.

By Lemma 32 and Eq. (60), we have

\displaystyle C_{d}\in\left[\frac{2\sqrt{2}}{\pi},\ e\sqrt{2d}\right].

Thus, we have

\displaystyle\frac{2\sqrt{2}(d-1)}{nC_{d}}\in\left[\frac{2}{e}\frac{d-1}{n\sqrt{d}},\ \frac{\pi(d-1)}{n}\right].

∎

Now we are ready to proof Lemma 39.

Proof.

Recall the definition of $\theta$ in Lemma 39. Draw $n$ caps on $\mathcal{S}^{d-1}$ centered at $\mathbf{X}_{1},\ \mathbf{X}_{2},\cdots,\mathbf{X}_{n}$ with the colatitude angle $\varphi$ where

\displaystyle\varphi=\frac{\theta}{2}=\frac{\pi}{2}\left(\frac{2\sqrt{2}(d-1)}{nC_{d}}\right)^{\frac{1}{d-1}}\text{ (by Eq.~{}\eqref{eq.def_Cd})}.

(82)

By Lemma 42 and $n\geq\pi(d-1)$ , we have $\varphi\in[0,\pi/2]$ . Thus, by Lemma 41, we have

\displaystyle\sin\varphi\geq\frac{2\varphi}{\pi}=\left(\frac{2\sqrt{2}(d-1)}{nC_{d}}\right)^{\frac{1}{d-1}}.

(83)

By Lemma 8, the area of each cap is

\displaystyle A=\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})I_{\sin^{2}\varphi}\left(\frac{d-1}{2},\frac{1}{2}\right).

Applying Lemma 35 and Eq. (83), we thus have

\displaystyle A\geq\frac{1}{2}\lambda_{d-1}(\mathcal{S}^{d-1})\frac{C_{d}}{\sqrt{2}(d-1)}(\sin^{2}\varphi)^{\frac{d-1}{2}}=\frac{1}{n}\lambda_{d-1}(\mathcal{S}^{d-1}).

In other words, we have

\displaystyle\frac{\lambda_{d-1}(\mathcal{S}^{d-1})}{A}\leq n.

By the pigeonhole principle, we know there exist at least two different caps that overlap, i.e., the angle between them is at most $2\varphi$ . The result of this lemma thus follows by Eq. (82). ∎

Appendix H Proof of Proposition 5

We follow the sketch of proof in Section 5. Recall the definition of the pseudo ground-truth function $f_{\mathbf{V}_{0}}^{g}$ in Definition 2, and the corresponding $\Delta\mathbf{V}^{*}\in\mathds{R}^{dp}$ that

\displaystyle\Delta\mathbf{V}^{*}[j]=\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0\}}\bm{z}\frac{g(\bm{z})}{p}d\mu(\bm{z}),\text{ for all $j\in\{1,2,\cdots,p\}$}.

(84)

We first show that the pseudo ground-truth can be written in a linear form.

Lemma 43.

$\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}=f_{\mathbf{V}_{0}}^{g}(\bm{x})$ for all $\bm{x}\in\mathcal{S}^{d-1}$ .

Proof.

For all $\bm{x}\in\mathcal{S}^{d-1}$ , we have

	$\displaystyle\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}$	$\displaystyle=\sum_{j=1}^{p}\bm{h}_{\mathbf{V}_{0},\bm{x}}[j]\Delta\mathbf{V}^{*}[j]$
		$\displaystyle=\sum_{j=1}^{p}\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\bm{x}^{T}\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0\}}\bm{z}\frac{g(\bm{z})}{p}d\mu(\bm{z})\text{ (by Eq.~{}\eqref{eq.def_hx} and Eq.~{}\eqref{eq.temp_0929})}$
		$\displaystyle=\int_{\mathcal{S}^{d-1}}\sum_{j=1}^{p}\bm{1}_{\{\bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\cdot\bm{x}^{T}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0\}}\bm{z}\frac{g(\bm{z})}{p}d\mu(\bm{z})$
		$\displaystyle=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}\|}{p}g(\bm{z})d\mu(\bm{z})\text{ (by Eq.~{}\eqref{eq.card_c})}$
		$\displaystyle=f_{\mathbf{V}_{0}}^{g}(\bm{x})\text{ (by Definition~{}\ref{def.fv})}.$

∎

Let $\mathbf{P}\mathrel{\mathop{:}}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{H}$ . Since $\mathbf{P}^{2}=\mathbf{P}$ and $\mathbf{P}=\mathbf{P}^{T}$ , we know that $\mathbf{P}$ is an orthogonal projection to the row-space of $\mathbf{H}$ . Next, we give an expression for the test error. Note that even though Proposition 4 assumes no noise, below we state a more general version below with noise (which will be useful later).

Lemma 44.

If the ground-truth is $f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}$ for all $\bm{x}$ , then we have

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}+\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon},\text{ for all }\bm{x}.

Proof.

Because $f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}$ , we have $\bm{y}=\mathbf{H}\Delta\mathbf{V}^{*}+\bm{\epsilon}$ . Thus, we have

	$\displaystyle\Delta\mathbf{V}^{\ell_{2}}$	$\displaystyle=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}\text{ (by Eq.~{}\eqref{eq.DVl})}$
		$\displaystyle=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}(\mathbf{H}\Delta\mathbf{V}^{*}+\bm{\epsilon}).$

Further, we have

	$\displaystyle\Delta\mathbf{V}^{\ell_{2}}-\Delta\mathbf{V}^{*}=$	$\displaystyle\left(\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{H}-\mathbf{I}\right)\Delta\mathbf{V}^{*}+\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}$
	$\displaystyle=$	$\displaystyle(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}+\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}.$

Finally, using Eq. (4), we have

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{\ell_{2}}-\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{*}=\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}+\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}.

∎

When there is no noise, Lemma 44 reduces to $\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}$ . As we described in Section 5, $(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}$ has the interpretation of the distance from $\Delta\mathbf{V}^{*}$ to the row-space of $\mathbf{H}$ . We then have the following.

Lemma 45.

For all $\bm{a}\in\mathds{R}^{n}$ , we have

\displaystyle\left|\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\right|\leq\sqrt{p}\|\Delta\mathbf{V}^{*}-\mathbf{H}\bm{a}\|_{2}.

Proof.

Recall that $\mathbf{P}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{H}$ . Thus, we have

\displaystyle\mathbf{P}\mathbf{H}^{T}=\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{H}\mathbf{H}^{T}=\mathbf{H}^{T}.

(85)

We then have

	$\displaystyle\\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\\|_{2}$	$\displaystyle=\\|\mathbf{P}\Delta\mathbf{V}^{}-\Delta\mathbf{V}^{}\\|_{2}$
		$\displaystyle=\\|\mathbf{P}(\mathbf{H}^{T}\bm{a}+\Delta\mathbf{V}^{}-\mathbf{H}^{T}\bm{a})-\Delta\mathbf{V}^{}\\|_{2}$
		$\displaystyle=\\|\mathbf{P}\mathbf{H}^{T}\bm{a}+\mathbf{P}(\Delta\mathbf{V}^{}-\mathbf{H}^{T}\bm{a})-\Delta\mathbf{V}^{}\\|_{2}$
		$\displaystyle=\\|\mathbf{H}^{T}\bm{a}+\mathbf{P}(\Delta\mathbf{V}^{}-\mathbf{H}^{T}\bm{a})-\Delta\mathbf{V}^{}\\|_{2}\text{ (by Eq.~{}\eqref{eq.temp_110801})}$
		$\displaystyle=\\|(\mathbf{P}-\mathbf{I})(\Delta\mathbf{V}^{*}-\mathbf{H}^{T}\bm{a})\\|_{2}$
		$\displaystyle\leq\\|\Delta\mathbf{V}^{*}-\mathbf{H}^{T}\bm{a}\\|_{2}\text{ (because $\mathbf{P}$ is an orthogonal projection)}.$

Therefore, we have

	$\displaystyle\left\|\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\right\|=$	$\displaystyle\left\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\\|_{2}\cdot\\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\\|_{2}\text{ (by Lemma~{}\ref{le.matrix_norm})}$
	$\displaystyle\leq$	$\displaystyle\sqrt{p}\\|\Delta\mathbf{V}^{*}-\mathbf{H}\bm{a}\\|_{2}\text{ (by Lemma~{}\ref{le.h_p})}.$

∎

Now we are ready to prove Proposition 5.

Proof.

Because there is no noise, we have $\bm{\epsilon}=\bm{0}$ . Thus, by Lemma 44, we have

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})=\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}.

(86)

We then have, for all $\bm{a}\in\mathds{R}^{n}$ ,

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\left\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\right\|\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
$\displaystyle=$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X}}\left\{\left\|\bm{h}_{\mathbf{V}_{0},\bm{x}}(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{}\right\|\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
$\displaystyle\leq$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X}}\left\{\sqrt{p}\\|\mathbf{H}^{T}\bm{a}-\Delta\mathbf{V}^{}\\|_{2}\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\text{ (by Lemma~{}\ref{le.hxDV})}.$	(87)

It only remains to find the vector $\bm{a}$ . Define $\mathbf{K}_{i}\in\mathds{R}^{dp}$ for $i=1,2,\cdots,n$ as

\displaystyle\mathbf{K}_{i}[j]\mathrel{\mathop{:}}=\bm{1}_{\{\bm{\mathbf{X}_{i}}^{T}\mathbf{V}_{0}[j]>0\}}\bm{\mathbf{X}_{i}}\frac{g(\bm{\mathbf{X}_{i}})}{p},\ j=1,2,\cdots,p.

By Eq. (84), for all $j=1,2,\cdots,p$ , we have

\displaystyle\operatorname*{\mathsf{E}}_{\mathbf{X}_{i}}[\mathbf{K}_{i}[j]]=\Delta\mathbf{V}^{*}[j].

(88)

Because $\|\mathbf{X}_{i}\|_{2}=1$ , we have

\displaystyle\|\mathbf{K}_{i}[j]\|_{2}\leq\frac{\|g\|_{\infty}}{p}.

Thus, we have

\displaystyle\|\mathbf{K}_{i}\|_{2}=

\displaystyle\sqrt{\sum_{j=1}^{p}\|\mathbf{K}_{i}[j]\|_{2}^{2}}\leq\frac{\|g\|_{\infty}}{\sqrt{p}},

i.e.,

\displaystyle\sqrt{p}\|\mathbf{K}_{i}\|_{2}\leq\|g\|_{\infty}.

(89)

We now construct the vector $\bm{a}$ . Define $\bm{a}\in\mathds{R}^{n}$ whose $i$ -th element is $\bm{a}_{i}=\frac{g(\mathbf{X}_{i})}{np}$ , $i=1,2,\cdots,n$ . Notice that $\bm{a}$ is well-defined because $\|g\|_{\infty}<\infty$ . Then, for all $j\in\{1,2,\cdots,p\}$ , we have

	$\displaystyle(\mathbf{H}^{T}\bm{a})[j]$	$\displaystyle=\sum_{i=1}^{n}\mathbf{H}_{i}^{T}[j]\bm{a}_{i}$
		$\displaystyle=\sum_{i=1}^{n}\bm{1}_{\{\bm{\mathbf{X}_{i}}^{T}\mathbf{V}_{0}[j]>0\}}\bm{\mathbf{X}_{i}}\frac{g(\bm{\mathbf{X}_{i}})}{np}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbf{K}_{i}[j],$

i.e.,

\displaystyle\mathbf{H}^{T}\bm{a}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{K}_{i}.

(90)

Thus, by Eq. (89) and Lemma 16 (with $X_{i}=\sqrt{p}\mathbf{K}_{i}$ , $U=\|g\|_{\infty}$ , and $k=n$ ), we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\sqrt{p}\left\|\left(\frac{1}{n}\sum_{i=1}^{n}\mathbf{K}_{i}\right)-\operatorname*{\mathsf{E}}_{\mathbf{X}}\mathbf{K}_{1}\right\|_{2}\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\|g\|_{\infty}^{2}}\right).

Further, by Eq. (90) and Eq. (88), we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\sqrt{p}\|\mathbf{H}^{T}\bm{a}-\Delta\mathbf{V}^{*}\|_{2}\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\|g\|_{\infty}^{2}}\right).

(91)

Plugging Eq. (91) into Eq. (87), we thus have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\left|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\right|\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\|g\|_{\infty}^{2}}\right).

∎

Appendix I Proof of Theorem 1

We first prove a useful lemma.

Lemma 46.

If $\|g\|_{1}<\infty$ , then for any $\bm{x}$ , we must have

\displaystyle\int_{\mathcal{S}^{d-1}}\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})d\mu(\bm{z})d\tilde{\lambda}(\bm{v})=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z}).

Proof.

This follows from Fubini’s Theorem and by a change of order of the integral. Specifically, because $\|g\|_{1}<\infty$ , we have

\displaystyle\int_{\mathcal{S}^{d-1}}|g(\bm{z})|d\mu(\bm{z})<\infty.

Thus, we have

\displaystyle\int_{\mathcal{S}^{d-1}\times\mathcal{S}^{d-1}}|g(\bm{z})|d\mu(\bm{z})\tilde{\lambda}(\bm{v})<\infty.

Because $\left|\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}\right|\leq 1$ when $\bm{x}\in\mathcal{S}^{d-1}$ and $\bm{z}\in\mathcal{S}^{d-1}$ , we have

\displaystyle\int_{\mathcal{S}^{d-1}\times\mathcal{S}^{d-1}}\left|\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})\right|d\mu(\bm{z})\tilde{\lambda}(\bm{v})\leq\int_{\mathcal{S}^{d-1}\times\mathcal{S}^{d-1}}|g(\bm{z})|d\mu(\bm{z})\tilde{\lambda}(\bm{v})<\infty.

Thus, by Fubini’s theorem, we can exchange the sequence of integral, i.e., we have

		$\displaystyle\int_{\mathcal{S}^{d-1}}\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})d\mu(\bm{z})d\tilde{\lambda}(\bm{v})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})d\tilde{\lambda}(\bm{v})d\mu(\bm{z})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\left(\int_{\mathcal{S}^{d-1}}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}d\tilde{\lambda}(\bm{v})\right)\bm{x}^{T}\bm{z}g(\bm{z})d\mu(\bm{z})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z})\text{ (by Lemma~{}\ref{le.spherePortion})}.$

∎

The following proposition characterizes generalization performance when $\bm{\epsilon}=\bm{0}$ , i.e., the bias term in Eq. (5).

Proposition 47.

Assume no noise ( $\bm{\epsilon}=\bm{0}$ ), a ground truth $f=f_{g}\in\mathcal{F}^{\ell_{2}}$ where $\|g\|_{\infty}<\infty$ , $n\geq 2$ , $m\in\left[1,\ \frac{\ln n}{\ln\frac{\pi}{2}}\right]$ , $d\leq n^{4}$ , and $p\geq 6J_{m}(n,d)\ln\left(4n^{1+\frac{1}{m}}\right)$ . Then, for any $q\in[1,\ \infty)$ and for almost every $\bm{x}\in\mathcal{S}^{d-1}$ , we must have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\big{\{}\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}$
	$\displaystyle+\left(1+\sqrt{J_{m}(n,d)n}\right)p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\big{\}}$
	$\displaystyle\leq 2e^{2}\bigg{(}\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right)+\exp\left(-\frac{\sqrt[q]{p}}{8\\|g\\|_{1}^{2}}\right)$
	$\displaystyle+\exp\left(-\frac{\sqrt[q]{p}}{8n\\|g\\|_{1}^{2}}\right)\bigg{)}+\frac{2}{\sqrt[m]{n}}.$

Proof.

We split the whole proof into 5 steps as follows.

Step 1: use pseudo ground-truth as a “intermediary”

Recall Definition 2 where we define the pseudo ground-truth $f_{\mathbf{V}_{0}}^{g}$ . We then define the output of the pseudo ground-truth for training input as

\displaystyle\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})\mathrel{\mathop{:}}=[f_{\mathbf{V}_{0}}^{g}(\mathbf{X}_{1})\ f_{\mathbf{V}_{0}}^{g}(\mathbf{X}_{2})\ \cdots\ f_{\mathbf{V}_{0}}^{g}(\mathbf{X}_{n})]^{T}.

The rest of the proof will use the pseudo ground-truth as a “intermediary” to connect the ground-truth $f$ and the model output $\hat{f}^{\ell_{2}}$ . Specifically, we have

$\displaystyle\hat{f}^{\ell_{2}}(\bm{x})$	$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{\ell_{2}}$
	$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{F}(\mathbf{X})\text{ (by Eq.~{}\eqref{eq.temp_110903} and $\bm{\epsilon}=\bm{0}$)}$
	$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\left(\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right).$	(92)

Thus, we have

	$\displaystyle\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|$
$\displaystyle=$	$\displaystyle\left\|\hat{f}^{\ell_{2}}(\bm{x})-f_{\mathbf{V}_{0}}^{g}(\bm{x})+f_{\mathbf{V}_{0}}^{g}(\bm{x})-f(\bm{x})\right\|$
$\displaystyle=$	$\displaystyle\left\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-f_{\mathbf{V}_{0}}^{g}(\bm{x})-\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\left(\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right)\right.$
	$\displaystyle\left.+f_{\mathbf{V}_{0}}^{g}(\bm{x})-f(\bm{x})\right\|\text{ (by Eq.~{}\eqref{eq.temp_110904})}$
$\displaystyle\leq$	$\displaystyle\underbrace{\left\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-f_{\mathbf{V}_{0}}^{g}(\bm{x})\right\|}_{\text{term }A}+\underbrace{\left\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\left(\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right)\right\|}_{\text{term }B}$
	$\displaystyle+\underbrace{\left\|f_{\mathbf{V}_{0}}^{g}(\bm{x})-f(\bm{x})\right\|}_{\text{term }C}.$	(93)

In Eq. (93), we can see that the term $A$ corresponds to the test error of the pseudo ground-truth, the term $B$ corresponds to the impact of the difference between the pseudo ground-truth and the real ground-truth in the training data, and the term $C$ corresponds to the impact of the difference between pseudo ground-truth and real ground-truth in the test data. Using the terminology of bias-variance decomposition, we refer to term $A$ as the “pseudo bias” and term $B$ as the “pseudo variance”.

Step 2: estimate term $A$

We have

$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\text{term }A\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$	$\displaystyle=\int_{\mathbf{V}_{0}\in\mathds{R}^{dp}}\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\text{term }A\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\ \bigg{\|}\ \mathbf{V}_{0}\right\}\ d\lambda(\mathbf{V}_{0})$
	$\displaystyle\leq\int_{\mathbf{V}_{0}\in\mathds{R}^{dp}}2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right)\ d\lambda(\mathbf{V}_{0})\text{ (by Proposition~{}\ref{th.fixed_Vzero})}$
	$\displaystyle=2e^{2}\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right).$	(94)

Step 3: estimate term $C$

For all $j=1,2,\cdots,p$ , define

\displaystyle K_{j}^{\bm{x}}\mathrel{\mathop{:}}=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}g(\bm{z})d\mu(\bm{z}).

We now show that $K_{j}^{\bm{x}}$ is bounded and with mean equal to $f_{g}$ , where $f_{g}=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z})$ defined by Definition 1. Specifically, we have

$\displaystyle\operatorname*{\mathsf{E}}_{\mathbf{V}_{0}}K_{j}^{\bm{x}}=$	$\displaystyle\operatorname*{\mathsf{E}}_{\bm{v}\sim\tilde{\lambda}}\left(\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})d\mu(\bm{z})\right)$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\bm{v}>0,\ \bm{x}^{T}\bm{v}>0\}}g(\bm{z})d\mu(\bm{z})d\tilde{\lambda}(\bm{v})$
$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z})\text{ (by Lemma~{}\ref{le.change_seq})}$
$\displaystyle=$	$\displaystyle f_{g}(\bm{x})\text{ (by Definition~{}\ref{def.learnableSet})}.$	(95)

From Definition 2, we have

$\displaystyle f_{\mathbf{V}_{0}}^{g}(\bm{x})=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}\|}{p}g(\bm{z})d\mu(\bm{z})\text{ (by Definition~{}\ref{def.fv})}$
$\displaystyle=$	$\displaystyle\frac{1}{p}\sum_{j=1}^{p}\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}g(\bm{z})d\mu(\bm{z})\text{ (by Eq.~{}\eqref{eq.card_c})}$
$\displaystyle=$	$\displaystyle\frac{1}{p}\sum_{j=1}^{p}K_{j}^{\bm{x}}.$	(96)

Because $\mathbf{V}_{0}[j]$ ’s are i.i.d., $K_{j}^{\bm{x}}$ ’s are also i.i.d.. Thus, we have

\displaystyle\operatorname*{\mathsf{E}}_{\mathbf{V}_{0}}f_{\mathbf{V}_{0}}^{g}(\bm{x})

\displaystyle=f_{g}(\bm{x}).

(97)

Further, for any $j\in\{1,2,\cdots,p\}$ , we have

$\displaystyle\\|K_{j}^{\bm{x}}\\|_{2}=$	$\displaystyle\|K_{j}^{\bm{x}}\|\text{ (because $K_{j}^{\bm{x}}$ is a scalar)}$
$\displaystyle=$	$\displaystyle\left\|\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}g(\bm{z})d\mu(\bm{z})\right\|$
$\displaystyle\leq$	$\displaystyle\int_{\mathcal{S}^{d-1}}\left\|\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}g(\bm{z})\right\|d\mu(\bm{z})$
$\displaystyle\leq$	$\displaystyle\int_{\mathcal{S}^{d-1}}\left\|\bm{x}^{T}\bm{z}\bm{1}_{\{\bm{z}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}\right\|\cdot\left\|g(\bm{z})\right\|d\mu(\bm{z})$
$\displaystyle\leq$	$\displaystyle\int_{\mathcal{S}^{d-1}}\left\|g(\bm{z})\right\|d\mu(\bm{z})$
$\displaystyle=$	$\displaystyle\\|g\\|_{1}.$	(98)

Thus, by Lemma 16, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\left\|\left(\frac{1}{p}\sum_{j=1}^{p}K_{j}^{\bm{x}}\right)-\operatorname*{\mathsf{E}}_{\mathbf{V}_{0}}K_{1}\right\|_{2}\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8\|g\|_{1}^{2}}\right).

Further, by Eq. (I) and Eq. (I), we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\left|f_{\mathbf{V}_{0}}^{g}(\bm{x})-f_{g}(\bm{x})\right|\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8\|g\|_{1}^{2}}\right).

Because $f\stackrel{{\scriptstyle\text{a.e.}}}{{=}}f_{g}$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\left|f_{\mathbf{V}_{0}}^{g}(\bm{x})-f(\bm{x})\right|\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8\|g\|_{1}^{2}}\right).

Because $f_{\mathbf{V}_{0}}^{g}$ does not change with $\mathbf{X}$ , we thus have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }C\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8\|g\|_{1}^{2}}\right).

(99)

Step 4: estimate term $B$

Our idea is to treat $\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})$ as a special form of noise, and then apply Proposition 4. We first bound the magnitude of this special noise. For $j=1,2,\cdots,p$ , we define

\displaystyle\mathbf{K}_{j}\mathrel{\mathop{:}}=[K_{j}^{\mathbf{X}_{1}}\ K_{j}^{\mathbf{X}_{2}}\ \cdots\ K_{j}^{\mathbf{X}_{n}}]^{T}.

Then, we have

\displaystyle\|\mathbf{K}_{j}\|_{2}=\sqrt{\sum_{i=1}^{n}\|K_{j}^{\mathbf{X}_{i}}\|_{2}^{2}}\leq\sqrt{n}\|g\|_{1}\text{ (by Eq.~{}\eqref{eq.temp_120411})}.

Similar to how we get Eq. (99) in Step 3, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\left\|\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right\|_{2}\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\leq 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8n\|g\|_{1}^{2}}\right).

(100)

Thus, we have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }B\geq\sqrt{J_{m}(n,d)n}p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
$\displaystyle=$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }B\geq\sqrt{J_{m}(n,d)n}p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)},\left\\|\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right\\|_{2}\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
	$\displaystyle+\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }B\geq\sqrt{J_{m}(n,d)n}p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)},\left\\|\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right\\|_{2}<p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
$\displaystyle\leq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\left\\|\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right\\|_{2}\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
	$\displaystyle+\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }B\geq\sqrt{J_{m}(n,d)n}\left\\|\mathbf{F}_{\mathbf{V}_{0}}^{g}(\mathbf{X})-\mathbf{F}(\mathbf{X})\right\\|_{2}\right\}$
$\displaystyle\leq$	$\displaystyle 2e^{2}\exp\left(-\frac{\sqrt[q]{p}}{8n\\|g\\|_{1}^{2}}\right)+\frac{2}{\sqrt[m]{n}}\text{ (by Eq.~{}\eqref{eq.temp_120404} and Proposition~{}\ref{prop.noise_large_p})}.$	(101)

Step 5: estimate $|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})|$

We have

		$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}+\frac{1+\sqrt{J_{m}(n,d)n}}{\sqrt[4]{p}}\right\}$
	$\displaystyle\leq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }A+\text{term }B+\text{term }C\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}+\frac{1+\sqrt{J_{m}(n,d)n}}{\sqrt[4]{p}}\right\}\text{ (by Eq.~{}\eqref{eq.temp_100704})}$
	$\displaystyle\leq$	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\left\{\text{term }A\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\cup\left\{\ \text{term }B\geq\sqrt{J_{m}(n,d)n}p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\right.$
		$\displaystyle\left.\cup\left\{\ \text{term }C\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\right\}$
	$\displaystyle\leq$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\text{term }A\geq n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}+\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }B\geq\sqrt{J_{m}(n,d)n}p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}$
		$\displaystyle+\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\left\{\text{term }C\geq p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}\right\}\text{(by the union bound)}$
	$\displaystyle\leq$	$\displaystyle 2e^{2}\left(\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right)+\exp\left(-\frac{\sqrt[q]{p}}{8\\|g\\|_{1}^{2}}\right)+\exp\left(-\frac{\sqrt[q]{p}}{8n\\|g\\|_{1}^{2}}\right)\right)+\frac{2}{\sqrt[m]{n}}$
		$\displaystyle\text{ (by Eqs.~{}\eqref{eq.temp_111102}\eqref{eq.temp_120403}\eqref{eq.temp_111104})}.$

The last step exactly gives the conclusion of this proposition.

∎

Theorem 1 thus follows by Proposition 4, Proposition 47, Eq. (5), and the union bound.

Appendix J Proof of Proposition 2 (lower bound for ground-truth functions outside $\overline{\mathcal{F}^{\ell_{2}}}$ )

We first show what $\hat{f}^{\ell_{2}}_{\infty}$ looks like. Define $\mathbf{H}^{\infty}\in\mathds{R}^{n\times n}$ where its $(i,j)$ -th element is

\displaystyle\mathbf{H}^{\infty}_{i,j}=\mathbf{X}_{i}^{T}\mathbf{X}_{j}\frac{\pi-\arccos(\mathbf{X}_{i}^{T}\mathbf{X}_{j})}{2\pi}.

Notice that

\displaystyle\left(\frac{\mathbf{H}\mathbf{H}^{T}}{p}\right)_{i,j}=\frac{1}{p}\sum_{k=1}^{p}\mathbf{X}_{i}^{T}\mathbf{X}_{j}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[k]>0,\mathbf{X}_{j}^{T}\mathbf{V}_{0}[k]>0\}}=\mathbf{X}_{i}^{T}\mathbf{X}_{j}\frac{|\mathcal{C}^{\mathbf{V}_{0}}_{\mathbf{X}_{i},\mathbf{X}_{j}}|}{p}.

By Lemma 21, we have that $\left(\frac{\mathbf{H}\mathbf{H}^{T}}{p}\right)_{i,j}$ converges in probability to $(\mathbf{H}^{\infty})_{i,j}$ as $p\to\infty$ uniformly in $i,j$ . In other words,

\displaystyle\max_{i,j}\left|\left(\frac{\mathbf{H}\mathbf{H}^{T}}{p}\right)_{i,j}-(\mathbf{H}^{\infty})_{i,j}\right|\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}0,\text{ as }p\to\infty.

(102)

Let $\{\bm{e}_{i}\ |\ 1\leq i\leq n\}$ denote the standard basis in $\mathds{R}^{n}$ . For $i=1,2,\cdots,n$ , define

\displaystyle g_{i,p}\mathrel{\mathop{:}}=np\bm{e}_{i}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y},

(103)

which is a number. Further, define

\displaystyle[g_{1,p}\ g_{2,p}\ \cdots\ g_{n,p}]^{T}=np(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}.

Further, define the number

\displaystyle g_{i,\infty}\mathrel{\mathop{:}}=n\bm{e}_{i}^{T}(\mathbf{H}^{\infty})^{-1}\bm{y},

and

\displaystyle[g_{1,\infty}\ g_{2,\infty}\ \cdots\ g_{n,\infty}]^{T}=n(\mathbf{H}^{\infty})^{-1}\bm{y}.

Notice that $(\mathbf{H}^{\infty})^{-1}$ exists because of Eq. (102) and Lemma 7.

By Eq. (102), we have

\displaystyle\max_{i\in\{1,2,\cdots,n\}}|g_{i,p}-g_{i,\infty}|\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}0,\text{ as }p\to\infty.

(104)

For any given $\mathbf{X}$ , we define $\hat{f}^{\ell_{2}}_{\infty}(\cdot):\ \mathcal{S}^{d-1}\mapsto\mathds{R}$ as

\displaystyle\hat{f}^{\ell_{2}}_{\infty}(\bm{x})\mathrel{\mathop{:}}=\frac{1}{n}\sum_{i=1}^{n}\bm{x}^{T}\mathbf{X}_{i}\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}g_{i,\infty}.

(105)

By the definition of the Dirac delta function $\delta_{a}(\cdot)$ with peak position at $a$ , we can write $\hat{f}^{\ell_{2}}_{\infty}(\bm{x})$ as an integral

\displaystyle\hat{f}^{\ell_{2}}_{\infty}(\bm{x})=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}\frac{1}{n}\sum_{i=1}^{n}g_{i,\infty}\delta_{\mathbf{X}_{i}}(\bm{z})d\mu(\bm{z}).

Notice that $g_{i,\infty}$ only depends on the training data and does not change with $p$ (and thus is finite). Therefore, we have $\hat{f}^{\ell_{2}}_{\infty}\in\mathcal{F}^{\ell_{2}}$ . It remains to show why $\hat{f}^{\ell_{2}}$ converges to $\hat{f}^{\ell_{2}}_{\infty}$ in probability. The following lemma shows what $\hat{f}^{\ell_{2}}$ looks like.

Lemma 48.

$\hat{f}^{\ell_{2}}(\bm{x})=\frac{1}{n}\sum_{i=1}^{n}\bm{x}^{T}\mathbf{X}_{i}\frac{|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}|}{p}g_{i,p}=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}\frac{1}{n}\sum_{i=1}^{n}g_{i,p}\delta_{\mathbf{X}_{i}}(\bm{z})d\mu(\bm{z})$ .

Proof.

For any $\bm{x}\in\mathcal{S}^{d-1}$ , we have

	$\displaystyle\hat{f}^{\ell_{2}}(\bm{x})$	$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\Delta\mathbf{V}^{\ell_{2}}$
		$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}\text{ (by Eq.~{}\eqref{eq.DVl})}$
		$\displaystyle=\bm{h}_{\mathbf{V}_{0},\bm{x}}\sum_{i=1}^{n}\mathbf{H}_{i}^{T}\bm{e}_{i}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{y}$
		$\displaystyle=\frac{1}{np}\sum_{i=1}^{n}\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}_{i}^{T}g_{i,p}\text{ (by Eq.~{}\eqref{eq.def_gbi})}$
		$\displaystyle=\frac{1}{np}\sum_{i=1}^{n}\sum_{j=1}^{p}\bm{x}^{T}\mathbf{X}_{i}\bm{1}_{\{\mathbf{X}_{i}^{T}\mathbf{V}_{0}[j]>0,\ \bm{x}^{T}\mathbf{V}_{0}[j]>0\}}g_{i,p}.$

By Eq. (6), we thus have

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})=\frac{1}{n}\sum_{i=1}^{n}\bm{x}^{T}\mathbf{X}_{i}\frac{|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}|}{p}g_{i,p}.

(106)

By the definition of the Dirac delta function, we have

\displaystyle\hat{f}^{\ell_{2}}(\bm{x})

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bm{x}^{T}\mathbf{X}_{i}\frac{|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}|}{p}g_{i,p}=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}{p}\frac{1}{n}\sum_{i=1}^{n}g_{i,p}\delta_{\mathbf{X}_{i}}(\bm{z})d\mu(\bm{z}).

∎

Now we are ready to prove the statement of Proposition 2, i.e., uniformly over all $\bm{x}\in\mathcal{S}^{d-1}$ , $\hat{f}^{\ell_{2}}(\bm{x})\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}\hat{f}^{\ell_{2}}_{\infty}(\bm{x})$ as $p\to\infty$ (notice that we have already shown that $\hat{f}^{\ell_{2}}_{\infty}\in\mathcal{F}^{\ell_{2}}$ ). To be more specific, we restate that uniform convergence as the following lemma.

Lemma 49.

For any given $\mathbf{X}$ , $\sup\limits_{\bm{x}\in\mathcal{S}^{d-1}}|\hat{f}^{\ell_{2}}(\bm{x})-\hat{f}^{\ell_{2}}_{\infty}(\bm{x})|\stackrel{{\scriptstyle\text{P}}}{{\rightarrow}}0$ as $p\to\infty$ .

Proof.

For any $\zeta>0$ , define two events:

	$\displaystyle\mathcal{J}_{1}\mathrel{\mathop{:}}=\left\{\sup_{\bm{x},\bm{z}\in\mathcal{S}^{d-1}}\left\|\frac{\|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}\|}{p}-\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}\right\|<\zeta\right\},$
	$\displaystyle\mathcal{J}_{2}\mathrel{\mathop{:}}=\left\{\max_{i\in\{1,2,\cdots,n\}}\|g_{i,p}-g_{i,\infty}\|<\zeta\right\}.$

By Lemma 21, there exists a threshold $p_{0}$ such that for any $p>p_{0}$ ,

\displaystyle\operatorname*{\mathsf{Pr}}[\mathcal{J}_{1}]>1-\zeta.

By Eq. (104), there exists a threshold $p_{1}$ such that for any $p>p_{1}$ ,

\displaystyle\operatorname*{\mathsf{Pr}}[\mathcal{J}_{2}]>1-\zeta.

Thus, by the union bound, when $p>\max\{p_{0},p_{1}\}$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}[\mathcal{J}_{1}\cap\mathcal{J}_{2}]>1-2\zeta.

(107)

When $\mathcal{J}_{1}\cap\mathcal{J}_{2}$ happens, we have

		$\displaystyle\sup\limits_{\bm{x}\in\mathcal{S}^{d-1}}\|\hat{f}^{\ell_{2}}(\bm{x})-\hat{f}^{\ell_{2}}_{\infty}(\bm{x})\|$
	$\displaystyle=$	$\displaystyle\sup\limits_{\bm{x}\in\mathcal{S}^{d-1}}\left\|\frac{1}{n}\sum_{i=1}^{n}\bm{x}^{T}\mathbf{X}_{i}\left(\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}g_{i,p}-\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}g_{i,\infty}\right)\right\|$
		(by Lemma 48 and Eq. (105))
	$\displaystyle\leq$	$\displaystyle\sup\limits_{\bm{x}\in\mathcal{S}^{d-1},i\in\{1,2,\cdots,n\}}\left\|\left(\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}g_{i,p}-\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}g_{i,\infty}\right)\right\|\text{ (because $\|\bm{x}^{T}\mathbf{X}_{i}\|\leq 1$)}$
	$\displaystyle=$	$\displaystyle\sup\limits_{\bm{x}\in\mathcal{S}^{d-1},i\in\{1,2,\cdots,n\}}\left\|\left(\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}-\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}\right)g_{i,\infty}+(g_{i,p}-g_{i,\infty})\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup\limits_{\bm{x}\in\mathcal{S}^{d-1},i\in\{1,2,\cdots,n\}}\left\|\left(\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}-\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}\right)g_{i,\infty}\right\|+\left\|(g_{i,p}-g_{i,\infty})\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}\right\|$
	$\displaystyle\leq$	$\displaystyle\zeta\cdot\left(\max_{i}\|g_{i,\infty}\|+1\right)\text{ (because $\mathcal{J}_{1}\cap\mathcal{J}_{2}$ happens, $\frac{\|\mathcal{C}_{\mathbf{X}_{i},\bm{x}}^{\mathbf{V}_{0}}\|}{p}\in[0,1]$, and $\frac{\pi-\arccos(\bm{x}^{T}\mathbf{X}_{i})}{2\pi}\in[0,0.5]$)}.$

Because $\max_{i}|g_{i,\infty}|$ is fixed when $\mathbf{X}$ is given, $\zeta\cdot\left(\max_{i}|g_{i,\infty}|+1\right)$ can be arbitrarily small as long as $\zeta$ is small enough. The conclusion of this lemma thus follows by Eq. (107). ∎

If the ground-truth function $f\notin\overline{\mathcal{F}^{\ell_{2}}}$ (or equivalently, $D(f,\mathcal{F}^{\ell_{2}})>0$ ), then the MSE of $\hat{f}^{\ell_{2}}_{\infty}$ (with respect to the ground-truth function $f$ ) is at least $D(f,\mathcal{F}^{\ell_{2}})$ (because $\hat{f}^{\ell_{2}}_{\infty}\in\mathcal{F}^{\ell_{2}}$ ). Therefore, we have proved Proposition 2. Below we state an even stronger result than part (ii) of Proposition 2, i.e., it captures not only the MSE of $\hat{f}^{\ell_{2}}_{\infty}$ , but also that of $\hat{f}^{\ell_{2}}$ for sufficiently large $p$ .

Lemma 50.

For any given $\mathbf{X}$ and $\zeta>0$ , there exists a threshold $p_{0}$ such that for all $p>p_{0}$ , $\operatorname*{\mathsf{Pr}}\{\sqrt{\text{MSE}}\geq D(f,\mathcal{F}^{\ell_{2}})-\zeta\}>1-\zeta$ .

Proof.

By Lemma 49, for any $\zeta>0$ , there must exist a threshold $p_{0}$ such that for all $p>p_{0}$ ,

\displaystyle\operatorname*{\mathsf{Pr}}\left\{\sup\limits_{\bm{x}\in\mathcal{S}^{d-1}}|\hat{f}^{\ell_{2}}(\bm{x})-\hat{f}^{\ell_{2}}_{\infty}(\bm{x})|<\zeta\right\}>1-\zeta.

When $\sup\limits_{\bm{x}\in\mathcal{S}^{d-1}}|\hat{f}^{\ell_{2}}(\bm{x})-\hat{f}^{\ell_{2}}_{\infty}(\bm{x})|<\zeta$ , we have

\displaystyle D(\hat{f}^{\ell_{2}},\hat{f}^{\ell_{2}}_{\infty})=

\displaystyle\sqrt{\int_{\mathcal{S}^{d-1}}\left(\hat{f}^{\ell_{2}}(\bm{x})-\hat{f}^{\ell_{2}}_{\infty}(\bm{x})\right)^{2}d\mu(\bm{x})}\leq\zeta.

Because $\hat{f}^{\ell_{2}}_{\infty}\in\mathcal{F}^{\ell_{2}}$ , we have $D(\hat{f}^{\ell_{2}}_{\infty},f)\geq D(f,\mathcal{F}^{\ell_{2}})$ . Thus, by the triangle inequality, we have $D(f,\hat{f}^{\ell_{2}})\geq D(f,\hat{f}^{\ell_{2}}_{\infty})-D(\hat{f}^{\ell_{2}},\hat{f}^{\ell_{2}}_{\infty})\geq D(f,\mathcal{F}^{\ell_{2}})-\zeta$ . Putting these together, we have

\displaystyle\operatorname*{\mathsf{Pr}}\left\{D(f,\hat{f}^{\ell_{2}})\geq D(f,\mathcal{F}^{\ell_{2}})-\zeta\right\}>1-\zeta.

Notice that $\text{MSE}=(D(f,\hat{f}^{\ell_{2}}))^{2}$ . The result of this lemma thus follows. ∎

Appendix K Details for Section 4 (hyper-spherical harmonics decomposition on $\mathcal{S}^{d-1}$ )

K.1 Convolution on $\mathcal{S}^{d-1}$

First, we introduce the definition of the convolution on $\mathcal{S}^{d-1}$ . In Dokmanic & Petrinovic (2009), the convolution on $\mathcal{S}^{d-1}$ is defined as follows.

\displaystyle f_{1}\circledast f_{2}(\bm{x})\mathrel{\mathop{:}}=\int_{\mathsf{SO}(d)}f_{1}(\mathbf{S}\bm{e})f_{2}(\mathbf{S}^{-1}\bm{x})d\mathbf{S},

where $\mathbf{S}$ is a $d\times d$ orthogonal matrix that denotes a rotation in $\mathcal{S}^{d-1}$ , chosen from the set $\mathsf{SO}(d)$ of all rotations. In the following, we will show Eq. (13). To that end, we have

\displaystyle g\circledast h(\bm{x})

\displaystyle=\int_{\mathsf{SO}(d)}g(\mathbf{S}\bm{e})h(\mathbf{S}^{-1}\bm{x})d\mathbf{S}.

(108)

Now, we replace $\mathbf{S}\bm{e}$ by $\bm{z}$ . Thus, we have

\displaystyle\mathbf{S}\bm{e}=\bm{z}\implies\bm{e}=\mathbf{S}^{-1}\bm{z}\implies(\mathbf{S}^{-1}\bm{x})^{T}\bm{e}=(\mathbf{S}^{-1}\bm{x})^{T}\mathbf{S}^{-1}\bm{z}\implies(\mathbf{S}^{-1}\bm{x})^{T}\bm{e}=\bm{x}^{T}(\mathbf{S}^{-1})^{T}\mathbf{S}^{-1}\bm{z}.

Because $\mathbf{S}$ is an orthonormal matrix, we have $\mathbf{S}^{T}=\mathbf{S}^{-1}$ . Therefore, we have $(\mathbf{S}^{-1}\bm{x})^{T}\bm{e}=\bm{x}^{T}\bm{z}$ . Thus, by Eq. (14), we have

\displaystyle h(\mathbf{S}^{-1}\bm{x})=(\mathbf{S}^{-1}\bm{x})^{T}\bm{e}\frac{\pi-\arccos((\mathbf{S}^{-1}\bm{x})^{T}\bm{e})}{2\pi}=\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}.

(109)

By plugging Eq. (109) into Eq. (108), we have

\displaystyle g\circledast h(\bm{x})

\displaystyle=\int_{\mathcal{S}^{d-1}}g(\bm{z})\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}d\mu(\bm{z}).

Eq. (13) thus follows.

The following lemma shows the intrinsic symmetry of such a convolution.

Lemma 51.

Let $\mathbf{S}\in\mathds{R}^{d\times d}$ denotes any rotation in $\mathds{R}^{d}$ . If $f(\bm{x})\in\mathcal{F}^{\ell_{2}}$ , then $f(\mathbf{S}\bm{x})\in\mathcal{F}^{\ell_{2}}$ .

Proof.

Because $f(\bm{x})\in\mathcal{F}^{\ell_{2}}$ , we can find $g$ such that

\displaystyle f(\bm{x})=\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z}).

Thus, we have

	$\displaystyle f(\mathbf{S}\bm{x})=$	$\displaystyle\int_{\mathcal{S}^{d-1}}(\mathbf{S}\bm{x})^{T}\bm{z}\frac{\pi-\arccos((\mathbf{S}\bm{x})^{T}\bm{z})}{2\pi}g(\bm{z})d\mu(\bm{z})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}(\mathbf{S}^{T}\bm{z})\frac{\pi-\arccos(\bm{x}^{T}(\mathbf{S}^{T}\bm{z}))}{2\pi}g(\bm{z})d\mu(\bm{z})$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}(\mathbf{S}^{T}\bm{z})\frac{\pi-\arccos(\bm{x}^{T}(\mathbf{S}^{T}\bm{z}))}{2\pi}g(\mathbf{S}\mathbf{S}^{T}\bm{z})d\mu(\bm{z})$
		(because $\mathbf{S}$ is a rotation, we have $\mathbf{S}\mathbf{S}^{T}=\mathbf{I}$ )
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\mathbf{S}\bm{z})d\mu(\mathbf{S}\bm{z})\text{ (replace $\mathbf{S}^{T}\bm{z}$ by $\bm{z}$)}$
	$\displaystyle=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\bm{x}^{T}\bm{z}\frac{\pi-\arccos(\bm{x}^{T}\bm{z})}{2\pi}g(\mathbf{S}\bm{z})d\mu(\bm{z})\text{ (by Assumption~{}\ref{as.uniform})}$

The result of this lemma thus follows. ∎

K.2 Hyper-spherical harmonics

We follow the the conventions of hyper-spherical harmonics in Dokmanic & Petrinovic (2009). We express $\bm{x}=[\bm{x}_{1}\ \bm{x}_{2}\ \cdots\ \bm{x}_{d}]\in\mathcal{S}^{d-1}$ in a set of hyper-spherical polar coordinates as follows.

	$\displaystyle\bm{x}_{1}$	$\displaystyle=\sin\theta_{d-1}\sin\theta_{d-2}\cdots\sin\theta_{2}\sin\theta_{1},$
	$\displaystyle\bm{x}_{2}$	$\displaystyle=\sin\theta_{d-1}\sin\theta_{d-2}\cdots\sin\theta_{2}\cos\theta_{1},$
	$\displaystyle\bm{x}_{3}$	$\displaystyle=\sin\theta_{d-1}\sin\theta_{d-2}\cdots\cos\theta_{2},$
		$\displaystyle\quad\vdots$
	$\displaystyle\bm{x}_{d-1}$	$\displaystyle=\sin\theta_{d-1}\cos\theta_{d-2},$
	$\displaystyle\bm{x}_{d}$	$\displaystyle=\cos\theta_{d-1}.$

Notice that $\theta_{1}\in[0,\ 2\pi)$ and $\theta_{2},\theta_{3},\cdots,\theta_{d-1}\in[0,\pi)$ . Let $\xi=[\theta_{1}\ \theta_{2}\ \cdots\ \theta_{d-1}]$ . In such coordinates, hyper-spherical harmonics are given by Dokmanic & Petrinovic (2009)

\displaystyle\Xi_{\mathbf{K}}^{l}(\xi)=A_{\mathbf{K}}^{l}\times\prod_{i=0}^{d-3}C_{k_{i}-k_{i+1}}^{\frac{d-i-2}{2}+k_{i+1}}(\cos\theta_{d-i-1})\sin^{k_{i+1}}\theta_{d-i-1}e^{\pm jk_{d-2}\theta_{1}},

(110)

where the normalization factor is

\displaystyle A_{\mathbf{K}}^{l}=\sqrt{\frac{1}{\Gamma\left(\frac{d}{2}\right)}\prod_{i=0}^{d-3}2^{2k_{i+1}+d-i-4}\times\frac{(k_{i}-k_{i+1})!(d-i+2k_{i}-2)\Gamma^{2}\left(\frac{d-i-2}{2}+k_{i+1}\right)}{\sqrt{\pi}\Gamma(k_{i}+k_{i+1}+d-i-2)}},

and $C_{d}^{\lambda}(t)$ are the Gegenbauer polynomials of degree $d$ . These Gegenbauer polynomials can be defined as the coefficients of $\alpha^{n}$ in the power-series expansion of the following function,

\displaystyle(1-2t\alpha+\alpha^{2})^{-\lambda}=\sum_{i=0}^{\infty}C_{i}^{\lambda}(t)\alpha^{i}.

Further, the Gegenbauer polynomials can be computed by a three-term recursive relation,

\displaystyle(i+2)C_{i+2}^{\lambda}(t)=2(\lambda+i+1)tC_{i+1}^{\lambda}(t)-(2\lambda+i)C_{i}^{\lambda}(t),

(111)

with $C_{0}^{\lambda}(t)=1$ and $C_{1}^{\lambda}(t)=2\lambda t$ .

K.3 Calculate $\Xi_{\mathbf{K}}^{l}(\xi)$ where $\mathbf{K}=\mathbf{0}$

Recall that $\mathbf{K}=(k_{1},k_{2},\cdots,k_{d-2})$ and $l=k_{0}$ . By plugging $\mathbf{K}=\mathbf{0}$ into Eq. (110), we have

\displaystyle\Xi_{\mathbf{0}}^{l}(\xi)=A_{\mathbf{0}}^{l}\times C_{l}^{\frac{d-2}{2}}(\cos\theta_{d-1}).

(112)

The following lemma gives an explicit form of Gegenbauer polynomials.

Lemma 52.

\displaystyle C_{i}^{\lambda}(t)=\sum_{k=0}^{\lfloor\frac{i}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda)}{\Gamma(\lambda)k!(i-2k)!}(2t)^{i-2k}.

(113)

Proof.

We use mathematical induction. We already know that $C_{0}^{\lambda}(t)=1$ and $C_{1}^{\lambda}(t)=2\lambda t$ , which both satisfy Eq. (113). Suppose that $C_{i}^{\lambda}(t)$ and $C_{i+1}^{\lambda}(t)$ satisfy Eq. (113), i.e.,

	$\displaystyle C_{i}^{\lambda}(t)=\sum_{k=0}^{\lfloor\frac{i}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda)}{\Gamma(\lambda)k!(i-2k)!}(2t)^{i-2k},$
	$\displaystyle C_{i+1}^{\lambda}(t)=\sum_{k=0}^{\lfloor\frac{i+1}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+1)!}(2t)^{i-2k+1}.$

It remains to show that $C_{i+2}^{\lambda}(t)$ also satisfy Eq. (113). By Eq. (111), it suffices to show that

	$\displaystyle(i+2)\sum_{k=0}^{\lfloor\frac{i+2}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda+2)}{\Gamma(\lambda)k!(i-2k+2)!}(2t)^{i-2k+2}$
$\displaystyle=$	$\displaystyle 2(\lambda+i+1)t\sum_{k=0}^{\lfloor\frac{i+1}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+1)!}(2t)^{i-2k+1}$
	$\displaystyle-(2\lambda+i)\sum_{k=0}^{\lfloor\frac{i}{2}\rfloor}(-1)^{k}\frac{\Gamma(i-k+\lambda)}{\Gamma(\lambda)k!(i-2k)!}(2t)^{i-2k}.$	(114)

To that end, it suffices to show that the coefficients of $(2t)^{i-2k+2}$ are the same for both sides of Eq. (114), for $k=0,1,\cdots,\lfloor\frac{i+2}{2}\rfloor$ . For the first step, we verify the coefficients of $(2t)^{i-2k+2}$ for $k=1,\cdots,\lfloor\frac{i+1}{2}\rfloor$ . We have

		coefficients of $(2t)^{i-2k+2}$ on the right-hand-side of Eq. (114)
	$\displaystyle=$	$\displaystyle(\lambda+i+1)(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+1)!}-(2\lambda+i)(-1)^{k-1}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)(k-1)!(i-2k+2)!}$
	$\displaystyle=$	$\displaystyle(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+2)!}\left((\lambda+i+1)(i-2k+2)+(2\lambda+i)k\right)$
	$\displaystyle=$	$\displaystyle(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+2)!}\left((\lambda+i+1)(i+2)+(2\lambda+i)k-2k(\lambda+i+1)\right)$
	$\displaystyle=$	$\displaystyle(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+2)!}\left((\lambda+i+1)(i+2)-k(i+2)\right)$
	$\displaystyle=$	$\displaystyle(-1)^{k}\frac{\Gamma(i-k+\lambda+1)}{\Gamma(\lambda)k!(i-2k+2)!}(\lambda-k+i+1)(i+2)$
	$\displaystyle=$	$\displaystyle(i+2)(-1)^{k}\frac{\Gamma(i-k+\lambda+2)}{\Gamma(\lambda)k!(i-2k+2)!}$
	$\displaystyle=$	$\displaystyle\text{coefficients of $(2t)^{i-2k+2}$ on the left-hand-side of Eq.~{}\eqref{eq.temp_011303}}.$

For the second step, we verify the coefficient of $(2t)^{i-2k+2}$ for $k=0$ , i.e., the coefficient of $(2t)^{i+2}$ . We have

		coefficients of $(2t)^{i+2}$ on the right-hand-side of Eq. (114)
	$\displaystyle=$	$\displaystyle(\lambda+i+1)\frac{\Gamma(i+\lambda+1)}{\Gamma(\lambda)(i+1)!}$
	$\displaystyle=$	$\displaystyle(i+2)\frac{\Gamma(i+2+\lambda)}{\Gamma(\lambda)(i+2)!}$
	$\displaystyle=$	$\displaystyle\text{coefficients of $(2t)^{i+2}$ on the left-hand-side of Eq.~{}\eqref{eq.temp_011303}}.$

For the third step, we verify the coefficient of $(2t)^{i-2k+2}$ for $k=\lfloor\frac{i+2}{2}\rfloor=\lfloor\frac{i}{2}\rfloor+1$ . We consider two cases: 1) $i$ is even, and 2) $i$ is odd. When $i$ is even, we have $\lfloor\frac{i}{2}\rfloor+1=\frac{i}{2}+1$ , i.e., $i-2k+2=0$ . Thus, we have

		coefficients of $(2t)^{0}$ on the right-hand-side of Eq. (114)
	$\displaystyle=$	$\displaystyle-(2\lambda+i)(-1)^{\frac{i}{2}}\frac{\Gamma\left(\frac{i}{2}+\lambda\right)}{\Gamma(\lambda)\left(\frac{i}{2}\right)!}$
	$\displaystyle=$	$\displaystyle(i+2)(-1)^{\frac{i}{2}+1}\frac{\Gamma\left(\frac{i}{2}+1+\lambda\right)}{\Gamma(\lambda)\left(\frac{i}{2}+1\right)!}$
	$\displaystyle=$	$\displaystyle\text{coefficients of $(2t)^{0}$ on the left-hand-side of Eq.~{}\eqref{eq.temp_011303}}.$

When $i$ is odd, we have $k=\lfloor\frac{i}{2}\rfloor+1=\frac{i+1}{2}=\lfloor\frac{i+1}{2}\rfloor$ and this case has already been verified in the first step.

In conclusion, the coefficients of $(2t)^{i-2k+2}$ are the same for both sides of Eq. (114), for $k=0,1,\cdots,\lfloor\frac{i+2}{2}\rfloor$ . Thus, by mathematical induction, the result of this lemma thus follows. ∎

Applying Lemma 52 in Eq. (112), we have

\displaystyle\Xi_{\bm{0}}^{l}(\xi)=A_{\bm{0}}^{l}\sum_{k=0}^{\lfloor\frac{l}{2}\rfloor}(-1)^{k}\frac{\Gamma(l-k+\frac{d-2}{2})}{\Gamma(\frac{d-2}{2})k!(l-2k)!}(2\cos\theta_{d-1})^{l-2k}.

(115)

We give a few examples of $\Xi_{\mathbf{0}}^{l}(\xi)$ as follows.

	$\displaystyle\Xi_{\mathbf{0}}^{0}(\xi)=A_{\bm{0}}^{0},$
	$\displaystyle\Xi_{\mathbf{0}}^{1}(\xi)=A_{\bm{0}}^{1}(d-2)\cos\theta_{d-1},$
	$\displaystyle\Xi_{\mathbf{0}}^{2}(\xi)=A_{\bm{0}}^{2}\frac{d-2}{2}\left(d\cos^{2}\theta_{d-1}-1\right),$
	$\displaystyle\Xi_{\mathbf{0}}^{3}(\xi)=A_{\bm{0}}^{3}\frac{d-2}{2}\cdot d\cdot\left(\frac{d+2}{3}\cos^{3}\theta_{d-1}-\cos\theta_{d-1}\right).$

K.4 Proof of Proposition 3

Recall that

\displaystyle h(\bm{x})\mathrel{\mathop{:}}=\bm{x}^{T}\bm{e}\frac{\pi-\arccos(\bm{x}^{T}\bm{e})}{2\pi},\quad\bm{e}\mathrel{\mathop{:}}=[0\ 0\ \cdots\ 0\ 1]^{T}\in\mathds{R}^{d}.

Notice that $\bm{x}^{T}\bm{e}=\cos\theta_{d-1}$ . Thus, we have

\displaystyle h(\bm{x})=\cos\theta_{d-1}\frac{\pi-\arccos(\cos\theta_{d-1})}{2\pi}.

The $\arccos$ function has a Taylor Series Expansion:

\displaystyle\arccos(a)=\frac{\pi}{2}-\sum_{i=0}^{\infty}\frac{(2i)!}{2^{2i}(i!)^{2}}\frac{a^{2i+1}}{2i+1},

which converges when $-1\leq a\leq 1$ . Thus, we have

\displaystyle h(\bm{x})=\frac{1}{4}\cos\theta_{d-1}+\frac{1}{2\pi}\sum_{i=0}^{\infty}\frac{(2i)!}{2^{2i}(i!)^{2}}\frac{\cos^{2i+2}\theta_{d-1}}{2i+1}.

(116)

By comparing terms of even and odd power of $\cos\theta_{d-1}$ in Eq. (115) and Eq. (116), we immediately see that $h(\bm{x})\not\perp\Xi_{\bm{0}}^{l}(\bm{x})$ when $l=1$ , and $h(\bm{x})\perp\Xi_{\bm{0}}^{l}(\bm{x})$ when $l=3,5,7,\cdots$ . It remains to examine whether $h(\bm{x})\perp\Xi_{\bm{0}}^{l}(\bm{x})$ or $h(\bm{x})\not\perp\Xi_{\bm{0}}^{l}(\bm{x})$ for $l\in\{0,1,2,4,6,\cdots\}$ . We first introduce the following lemma.

Lemma 53.

Let $a$ and $b$ be two non-negative integers. Define the function

\displaystyle Q(a,b)\mathrel{\mathop{:}}=\int_{\mathcal{S}^{d-1}}\cos^{a}(\theta_{d-1})\Xi_{\bm{0}}^{b}(\xi)d\mu(\bm{x}).

We must have

\displaystyle Q(2k,2m)\begin{cases}>0,\text{ if $m\leq k$},\\ =0,\text{ if $m>k$}.\end{cases}

(117)

Proof.

We have

\displaystyle Q(2k,0)=\int_{\mathcal{S}^{d-1}}\cos^{2k}(\theta_{d-1})\Xi_{\bm{0}}^{0}(\xi)d\mu(\bm{x})=A_{\bm{0}}^{0}\int_{\mathcal{S}^{d-1}}\cos^{2k}(\theta_{d-1})d\mu(\bm{x})>0.

Thus, to finish the proof, we only need to consider the case of $m\geq 1$ in Eq. (117). We then prove by mathematical induction on the first parameter of $Q(\cdot,\cdot)$ , i.e., $k$ in Eq. (117). When $m>0$ , we have

	$\displaystyle Q(0,2m)=$	$\displaystyle\int_{\mathcal{S}^{d-1}}\Xi_{\bm{0}}^{2m}(\xi)d\mu(\bm{x})=\frac{1}{A_{\bm{0}}^{0}}\int_{\mathcal{S}^{d-1}}\Xi_{\bm{0}}^{0}(\xi)\Xi_{\bm{0}}^{2m}(\xi)d\mu(\bm{x})=0$
		$\displaystyle\text{ (by the orthogonality of the basis)}.$

Thus, Eq. (117) holds for all $m$ when $k=0$ . Suppose that Eq. (117) holds when $k=i$ . To complete the mathematical induction, it only remains to show that Eq. (117) also holds for all $m$ when $k=i+1$ . By Eq. (111) and Eq. (112), for any $l$ , we have

\displaystyle\cos(\theta_{d-1})\Xi_{\bm{0}}^{l+1}(\xi)=\frac{(l+2)A_{\bm{0}}^{l+1}}{(d+2l)A_{\bm{0}}^{l+2}}\Xi_{\bm{0}}^{l+2}(\xi)+\frac{(d-2+l)A_{\bm{0}}^{l+1}}{(d+2l)A_{\bm{0}}^{l}}\Xi_{\bm{0}}^{l}(\xi).

Thus, we have

\displaystyle Q(a+1,l+1)=q_{l,1}\cdot Q(a,l+2)+q_{l,2}\cdot Q(a,l),

(118)

where

\displaystyle q_{l,1}\mathrel{\mathop{:}}=\frac{(l+2)A_{\bm{0}}^{l+1}}{(d+2l)A_{\bm{0}}^{l+2}},\quad q_{l,2}\mathrel{\mathop{:}}=\frac{(d-2+l)A_{\bm{0}}^{l+1}}{(d+2l)A_{\bm{0}}^{l}}.

It is obvious that $q_{l,1}>0$ and $q_{l,2}>0$ . Applying Eq. (118) multiple times, we have

	$\displaystyle Q(2i+2,2m)=q_{2m-1,1}\cdot Q(2i+1,2m+1)+q_{2m-1,2}\cdot Q(2i+1,2m-1),$		(119)
	$\displaystyle Q(2i+1,2m+1)=q_{2m,1}\cdot Q(2i,2m+2)+q_{2m,2}\cdot Q(2i,2m),$		(120)
	$\displaystyle Q(2i+1,2m-1)=q_{2m-2,1}\cdot Q(2i,2m)+q_{2m-2,2}\cdot Q(2i,2m-2).$		(121)

(Notice that we have already let $m\geq 1$ , so all $q_{\cdot,1},q_{\cdot,2},Q(\cdot,\cdot)$ in those equations are well-defined.) By plugging Eq. (120) and Eq. (121) into Eq. (119), we have

	$\displaystyle Q(2i+2,2m)=$	$\displaystyle q_{2m,1}q_{2m-1,1}Q(2i,2m+2)+(q_{2m-1,1}q_{2m,2}+q_{2m-1,2}q_{2m-2,1})Q(2i,2m)$
		$\displaystyle+q_{2m-1,2}q_{2m-2,2}Q(2i,2m-2).$		(122)

To prove that Eq. (117) holds when $k=i+1$ for all $m$ , we consider two cases, Case 1: $m\leq i+1$ , and Case 2: $m>i+1$ . Notice that by the induction hypothesis, we already know that Eq. (117) holds when $k=i$ for all $m$ .

Case 1. When $m\leq i+1$ , we have $m-1\leq i$ . Thus, by the induction hypothesis for $k=i$ , we have $Q(2i,2m-2)>0$ (by $m-1\leq i$ ), which implies that the third term of the right-hand-side of Eq. (122) is positive. Further, by the induction hypothesis for $k=i$ , we also know that $Q(2i,2m+2)\geq 0$ and $Q(2i,2m)\geq 0$ (regardless of the value of $m$ ), which means that the first and the second term of Eq. (122) is non-negative. Thus, by considering all three terms in Eq. (122) together, we have $Q(2i+2,2m)>0$ when $m\leq i+1$ .

Case 2. When $m>i+1$ , we have $m+1>i$ , $m>i$ , and $m-1>i$ . Thus, by the induction hypothesis for $k=i$ , we have $Q(2i,2m+2)=Q(2i,2m)=Q(2i,2m-2)=0$ . Therefore, by Eq. (122), we have $Q(2i+2,2m)=0$ .

In summary, Eq. (117) holds when $k=i+1$ for all $m$ . The mathematical induction is completed and the result of this lemma follows. ∎

By Lemma 53, for all $k\geq 0$ , we have

		$\displaystyle\int_{\mathcal{S}^{d-1}}\frac{1}{2\pi}\sum_{i=0}^{\infty}\frac{(2i)!}{2^{2i}(i!)^{2}}\frac{\cos^{2i+2}\theta_{d-1}}{2i+1}\Xi_{\bm{0}}^{2k}(\xi)d\mu(\bm{x})$
	$\displaystyle=$	$\displaystyle\frac{1}{2\pi}\sum_{i=0}^{\infty}\frac{(2i)!}{2^{2i}(i!)^{2}}\frac{1}{2i+1}\int_{\mathcal{S}^{d-1}}\cos^{2i+2}\theta_{d-1}\Xi_{\bm{0}}^{2k}(\xi)d\mu(\bm{x})$
	$\displaystyle>$	$\displaystyle 0.$

Thus, by Eq. (116), we know that $h(\bm{x})\not\perp\Xi_{\bm{0}}^{l}(\bm{x})$ for all $l\in\{0,2,4,\cdots\}$ .

K.5 A special case: when $d=2$

When $d=2$ , $\mathcal{S}^{d-1}$ denotes a unit circle. Therefore, every $\bm{x}$ corresponds to an angle $\varphi\in[-\pi,\ \pi]$ such that $\bm{x}=[\cos\varphi\ \sin\varphi]^{T}$ . In this situation, the hyper-spherical harmonics are the well-known Fourier series, i.e., $1,\cos(\theta),\sin(\theta),\cos(2\theta),\sin(2\theta),\cdots$ . Thus, we can explicitly calculate all Fourier coefficients of $h$ more easily.

Similarly to Appendix K.1, we first write down the convolution for $d=2$ , which is also in a simpler form. For any function $f_{g}\in\mathcal{F}^{\ell_{2}}$ , we have

	$\displaystyle f_{g}(\varphi)$	$\displaystyle=\frac{1}{2\pi}\int_{\varphi-\pi}^{\varphi+\pi}\frac{\pi-\|\theta-\varphi\|}{2\pi}\cos(\theta-\varphi)g(\theta)d\theta$
		$\displaystyle=\frac{1}{2\pi}\int_{-\pi}^{\pi}\frac{\pi-\|\theta\|}{2\pi}\cos\theta\ g(\theta+\varphi)\ d\theta\text{ (replace $\theta$ by $\theta-\varphi$)}$
		$\displaystyle=\frac{1}{2\pi}\int_{-\pi}^{\pi}\frac{\pi-\|\theta\|}{2\pi}\cos\theta\ g(\varphi-\theta)\ d\theta\text{ (replace $\theta$ by $-\theta$)}.$

Define $h(\theta)\mathrel{\mathop{:}}=\frac{\pi-|\theta|}{2\pi}\cos\theta$ . We then have

\displaystyle f_{g}(\varphi)=\frac{1}{2\pi}h(\varphi)\circledast g(\varphi),

where $\circledast$ denotes (continuous) circular convolution. Let $c_{f_{g}}(k),c_{h}(k)$ and $c_{g}(k)$ (where $k=\cdots,-1,0,1,\cdots$ ) denote the (complex) Fourier series coefficients for $f_{g}(\varphi)$ , $h(\varphi)$ , and $g(\varphi)$ , correspondingly. Specifically, we have

\displaystyle f_{g}(\varphi)=\sum_{k=-\infty}^{\infty}c_{f_{g}}(k)e^{ik\varphi},\quad h(\varphi)=\sum_{k=-\infty}^{\infty}c_{h}(k)e^{ik\varphi},\quad g(\varphi)=\sum_{k=-\infty}^{\infty}c_{g}(k)e^{ik\varphi}.

Thus, we have

\displaystyle c_{f_{g}}(k)=c_{h}(k)c_{g}(k).

(123)

Now we calculate $c_{h}(k)$ , i.e., the Fourier decomposition of $h(\cdot)$ . We have

	$\displaystyle c_{h}(k)$	$\displaystyle=\frac{1}{2\pi}\int_{-\pi}^{\pi}\frac{\pi-\|\theta\|}{2\pi}\cos\theta\ e^{-ik\theta}d\theta$
		$\displaystyle=\frac{1}{4\pi}\int_{-\pi}^{\pi}\left(1-\frac{\|\theta\|}{\pi}\right)\frac{e^{-i(k+1)\theta}+e^{-i(k-1)\theta}}{2}d\theta$
		$\displaystyle=-\frac{1}{8\pi^{2}}\int_{-\pi}^{\pi}\|\theta\|\left(e^{-i(k+1)\theta}+e^{-i(k-1)\theta}\right)d\theta+\frac{1}{8\pi}\int_{-\pi}^{\pi}\left(e^{-i(k+1)\theta}+e^{-i(k-1)\theta}\right)d\theta.$

It is easy to verify that

\displaystyle\int xe^{cx}dx=e^{cx}\left(\frac{cx-1}{c^{2}}\right),\quad\forall c\neq 0.

Thus, we have

	$\displaystyle c_{h}(1)$	$\displaystyle=-\frac{1}{8\pi^{2}}\int_{-\pi}^{\pi}\|\theta\|\left(e^{-i2\theta}+1\right)d\theta+\frac{1}{4}$
		$\displaystyle=-\frac{1}{8\pi^{2}}\left(\pi^{2}-\int_{-\pi}^{0}\theta e^{-i2\theta}d\theta+\int_{0}^{\pi}\theta e^{-i2\theta}d\theta\right)+\frac{1}{4}$
		$\displaystyle=-\frac{1}{8\pi^{2}}\left(\pi^{2}+\frac{i2\pi}{-4}+\frac{-i2\pi}{-4}\right)+\frac{1}{4}$
		$\displaystyle=-\frac{1}{8}+\frac{1}{4}$
		$\displaystyle=\frac{1}{8}.$

Similarly, we have

\displaystyle c_{h}(-1)=\frac{1}{8}.

Now we consider the situation of $n\neq\pm 1$ . We have

	$\displaystyle\int_{-\pi}^{0}\|\theta\|e^{-i(k+1)\theta}d\theta=-e^{-i(k+1)\theta}\cdot\frac{-i(k+1)\theta-1}{-(k+1)^{2}}\ \Bigg{\|}_{-\pi}^{0}=-\frac{1}{(k+1)^{2}}+\frac{1-i(k+1)\pi}{(k+1)^{2}}e^{i(k+1)\pi},$
	$\displaystyle\int_{0}^{\pi}\|\theta\|e^{-i(k+1)\theta}d\theta=e^{-i(k+1)\theta}\cdot\frac{-i(k+1)\theta-1}{-(k+1)^{2}}\ \Bigg{\|}_{0}^{\pi}=-\frac{1}{(k+1)^{2}}+\frac{1+i(k+1)\pi}{(k+1)^{2}}e^{-i(k+1)\pi}.$

Notice that $e^{-i(k+1)\pi}=e^{-i(k+1)2\pi}e^{i(k+1)\pi}=e^{i(k+1)\pi}$ . Therefore, we have

\displaystyle\int_{-\pi}^{\pi}|\theta|e^{-i(k+1)\theta}d\theta=\frac{2}{(k+1)^{2}}\left(e^{i(k+1)\pi}-1\right).

Similarly, we have

\displaystyle\int_{-\pi}^{\pi}|\theta|e^{-i(k-1)\theta}d\theta=\frac{2}{(k-1)^{2}}\left(e^{i(k-1)\pi}-1\right).

In summary, we have

	$\displaystyle c_{h}(k)$	$\displaystyle=\begin{cases}\frac{1}{8},\quad&k=\pm 1\\ -\frac{1}{4\pi^{2}}\left(\frac{1}{(k+1)^{2}}+\frac{1}{(k-1)^{2}}\right)\left(e^{i(k+1)\pi}-1\right),\quad&\text{otherwise}\end{cases}$
		$\displaystyle=\begin{cases}\frac{1}{8},\quad&k=\pm 1\\ \frac{1}{2\pi^{2}}\left(\frac{1}{(k+1)^{2}}+\frac{1}{(k-1)^{2}}\right),\quad&k=0,\pm 2,\pm 4,\cdots\\ 0,\quad&k=\pm 3,\pm 5,\cdots\end{cases}.$

By Eq. (123), we thus have

\displaystyle c_{f_{g}}(k)

\displaystyle=\begin{cases}\frac{1}{8}c_{g}(k),\quad&k=\pm 1\\ \frac{1}{2\pi^{2}}\left(\frac{1}{(k+1)^{2}}+\frac{1}{(k-1)^{2}}\right)c_{g}(k),\quad&k=0,\pm 2,\pm 4,\cdots\\ 0,\quad&k=\pm 3,\pm 5,\cdots\end{cases}.

In other words, when $d=2$ , functions in $\mathcal{F}^{\ell_{2}}$ can only contain frequencies $0,\theta,2\theta,4\theta,6\theta,\cdots$ , and cannot contain other frequencies $3\theta,5\theta,7\theta,\cdots$ .

K.6 Details of Remark 2

As we discussed in Remark 2, a ReLU activation function with bias that operates on $\tilde{\bm{x}}\in\mathds{R}^{d-1}$ , $\|\tilde{\bm{x}}\|_{2}^{2}=\frac{d-1}{d}$ can be equivalently viewed as one without bias that operates on $\bm{x}\in\mathcal{S}^{d-1}$ , but with the last element of $\bm{x}$ fixed at $1/\sqrt{d}$ . Note that by fixing the last element of $\bm{x}\in\mathcal{S}^{d-1}$ at a constant $\frac{1}{\sqrt{d}}$ , we essentially consider ground-truth functions with a much smaller domain $\mathcal{D}\mathrel{\mathop{:}}=\left\{\bm{x}=\begin{bmatrix}\tilde{\bm{x}}\\ 1/\sqrt{d}\end{bmatrix}\ \big{|}\ \tilde{\bm{x}}\in\mathds{R}^{d-1},\|\tilde{\bm{x}}\|_{2}^{2}=\frac{d-1}{d}\right\}\subset\mathcal{S}^{d-1}$ . Correspondingly, define a vector $\tilde{\bm{a}}\in\mathds{R}^{d-1}$ and $a_{0}\in\mathds{R}$ such that $\bm{a}=\begin{bmatrix}\tilde{\bm{a}}\\ a_{0}\end{bmatrix}\in\mathds{R}^{d}$ . We claim that for any $\bm{a}\in\mathds{R}^{d}$ and for all non-negative integer $l$ , a ground-truth function $f(\bm{x})=(\bm{x}^{T}\bm{a})^{l},\bm{x}\in\mathcal{D}$ must be learnable. In other words, all polynomials can be learned in the constrained domain $\mathcal{D}$ . Towards this end, recall that we have already shown that polynomials (of $\bm{x}\in\mathcal{S}^{d-1}$ ) to the power of $l=0,1,2,4,6,\cdots$ are learnable. Thus, it suffices to prove that polynomials of $\bm{x}\in\mathcal{D}$ to the power of $l=3,5,7,\cdots$ can be represented by a finite sum of those to the power of $l=0,1,2,4,6,\cdots$ . The idea is to utilize the fact that the binomial expansion of $(\tilde{\bm{x}}^{T}\tilde{\bm{a}}+\frac{a_{0}}{\sqrt{d}})^{l}$ contains $(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{k}$ for all $k=0,1,2,3,\cdots,l$ . Here we give an example for writing $(\bm{x}^{T}\bm{a})^{3}$ as a linear combination of learnable components. Other values of $l=5,7,9,\cdots$ can be proved in a similar way. Notice that

	$\displaystyle(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{3}=$	$\displaystyle\frac{1}{4}\left((\tilde{\bm{x}}^{T}\tilde{\bm{a}}+1)^{4}-(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{4}-6(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{2}-4(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{2}-1\right)\text{ (by the binomial expansion of $(\tilde{\bm{x}}^{T}\tilde{\bm{a}}+1)^{4}$)}$
	$\displaystyle=$	$\displaystyle\frac{1}{4}\left(\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ \sqrt{d}\end{bmatrix}\right)^{4}-\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)^{4}-6\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)^{2}-4\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)-1\right).$		(124)

Thus, for all $\bm{x}=\begin{bmatrix}\tilde{\bm{x}}\\ 1/\sqrt{d}\end{bmatrix}$ and $\bm{a}=\begin{bmatrix}\tilde{\bm{a}}\\ a_{0}\end{bmatrix}$ , we have

	$\displaystyle(\bm{x}^{T}\bm{a})^{3}=$	$\displaystyle\left(\tilde{\bm{x}}^{T}\tilde{\bm{a}}+\frac{a_{0}}{\sqrt{d}}\right)^{3}$
	$\displaystyle=$	$\displaystyle(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{3}+3\left(\frac{a_{0}}{\sqrt{d}}\right)(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{2}+3\left(\frac{a_{0}}{\sqrt{d}}\right)^{2}(\tilde{\bm{x}}^{T}\tilde{\bm{a}})+\left(\frac{a_{0}}{\sqrt{d}}\right)^{3}$
	$\displaystyle=$	$\displaystyle(\tilde{\bm{x}}^{T}\tilde{\bm{a}})^{3}+3\left(\frac{a_{0}}{\sqrt{d}}\right)\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)^{2}+3\left(\frac{a_{0}}{\sqrt{d}}\right)^{2}\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)+\left(\frac{a_{0}}{\sqrt{d}}\right)^{3}$
	$\displaystyle=$	$\displaystyle\frac{1}{4}\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ \sqrt{d}\end{bmatrix}\right)^{4}-\frac{1}{4}\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)^{4}+\left(3\left(\frac{a_{0}}{\sqrt{d}}\right)-\frac{3}{2}\right)\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)^{2}$
		$\displaystyle+\left(3\left(\frac{a_{0}}{\sqrt{d}}\right)^{2}-1\right)\left(\bm{x}^{T}\begin{bmatrix}\tilde{\bm{a}}\\ 0\end{bmatrix}\right)+\left(\left(\frac{a_{0}}{\sqrt{d}}\right)^{3}-\frac{1}{4}\right)\text{ (by Eq.~{}\eqref{eq.temp_031601})},$

which is a sum of $5$ learnable components (corresponding to the polynomials with power of $4$ , $4$ , $2$ , $1$ , and $0$ , respectively).

Appendix L Discussion when $g$ is a $\delta$ -function ( $\|g\|_{\infty}=\infty$ )

We now discuss what happens to the conclusion of Theorem 1 if $g$ contains a $\delta$ -function, in which case $\|g\|_{\infty}=\infty$ . In Eq. (10) of Theorem 1, only Term 1 and Term 4 (come from Proposition 5) will be affected when $\|g\|_{\infty}=\infty$ . That is because only Proposition 5 requires $\|g\|_{\infty}<\infty$ during the proof of Theorem 1. To accommodate the situation when $g$ contains a $\delta$ -function ( $\|g\|_{\infty}=\infty$ ), we need a new version of Proposition 5. In other words, we need to know the performance of the overfitted NTK solution in learning the pseudo ground-truth when $\|g\|_{\infty}=\infty$ .

Without loss of generality, we consider the situation that $g=\delta_{\bm{z}_{0}}$ . We have the following proposition.

Proposition 54.

If the ground-truth function is $f=f_{\mathbf{V}_{0}}^{g}$ in Definition 2 with $g=\delta_{\bm{z}_{0}}$ and $\bm{\epsilon}=\bm{0}$ , for any $\bm{x}\in\mathcal{S}^{d-1}$ and $q\in(1,\ \infty)$ , we have

		$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}\left\{\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|\leq\left(\sqrt{\frac{3}{4}+\frac{\pi^{2}}{2}}\right)\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{1}{2(d-1)}}n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})}\right\}$
	$\displaystyle\geq$	$\displaystyle 1-\exp\left(-n^{\frac{1}{q}}\right)-2\exp\left(-\frac{p}{24}\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}(1-\frac{1}{q})}\right),$

when

\displaystyle n\geq\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)^{\frac{q}{q-1}}\text{, i.e., }\left((d-1)B(\frac{d-1}{2},\frac{1}{2})\right)n^{-(1-\frac{1}{q})}\leq 1.

(125)

(Estimates of $B(\frac{d-1}{2},\frac{1}{2})$ can be found in Lemma 32.)

Proposition 54 implies that when $n$ is large and $p$ is much larger than $n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})}$ , the test error between the pseudo ground-truth and learned result decreases with $n$ at the speed $O(n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})})$ . Further, if we let $q$ be large, then the decreasing speed with $n$ is almost $O(n^{-\frac{1}{2(d-1)}})$ . When $d\geq 3$ , this speed is slower than $O(n^{-\frac{1}{2}})$ described in Proposition 5 (i.e., Term 1 in Eq. (10) of Theorem 1). When $d=2$ , the decreasing speed with respect to $n$ is $O(n^{-\frac{1}{2}})$ for both Proposition 5 and Proposition 54. Nonetheless, Proposition 54 implies that the ground-truth functions $f_{g}\in\mathcal{F}^{\ell_{2}}$ is still learnable even when $g$ is a $\delta$ -function (i.e., $\|g\|_{\infty}=\infty$ ), but the test error potentially suffers a slower convergence speed with respect to $n$ when $d$ is large.

In Fig. 7, we plot the curves of the model error $\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\|_{2}$ for learning the pseudo ground-truth $f_{\mathbf{V}_{0}}^{g}$ with respect to $n$ when $g=\delta_{\bm{z}_{0}}$ (two blue curves) and when $g$ is constant (two red curves). We plot both the case when $d=2$ (two solid curves) and the case when $d=10$ (two dashed curves). By Lemma 44, the model error $\|(\mathbf{P}-\mathbf{I})\Delta\mathbf{V}^{*}\|_{2}$ can represent the generalization performance for learning the pseudo ground-truth $f_{\mathbf{V}_{0}}^{g}$ when there is no noise. In Fig. 7, we can see that those two curves corresponding to $d=10$ have different slopes and the other two curves corresponding to $d=2$ have a similar slope, which confirms our prediction in the earlier paragraph (i.e., when $d=2$ the test error will decay at the same speed regardless of whether $g$ contains a $\delta$ -function or not, but when $d>2$ the test error will decay more slowly when $g$ contains a $\delta$ -function).

L.1 Proof of Proposition 54

We first show two useful lemmas.

Lemma 55.

For any $q\in(1,\infty)$ , if $b\in[n^{-(1-1/q)},1]$ , then

\displaystyle(1-b)^{n}\leq\exp\left(-n^{\frac{1}{q}}\right).

Proof.

By Lemma 29, we have

		$\displaystyle e^{-b}\geq 1-b$
	$\displaystyle\implies$	$\displaystyle e^{-1}\geq(1-b)^{\frac{1}{b}}$
	$\displaystyle\implies$	$\displaystyle\exp\left(-n^{\frac{1}{q}}\right)\geq(1-b)^{n^{\frac{1}{q}}/b}$
	$\displaystyle\implies$	$\displaystyle\exp\left(-n^{\frac{1}{q}}\right)\geq(1-b)^{n}\text{ because $b\in[n^{-(1-1/q)},1]$}.$

∎

Lemma 56.

Consider $\bm{x}_{1}\in\mathcal{S}^{d-1}$ where $\varphi=\arccos(\bm{x}_{1}^{T}\bm{z}_{0})$ . For any $\theta\in[\varphi,\pi]$ , there must exist $\bm{x}_{2}\in\mathcal{S}^{d-1}$ such that $\arccos(\bm{x}_{2}^{T}\bm{z}_{0})=\theta$ and

\displaystyle\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}},\quad\mathcal{C}_{\bm{x}_{1},-\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{\bm{x}_{2},-\bm{z}_{0}}^{\mathbf{V}_{0}}.

(126)

We will explain the intuition of Lemma 56 in Remark 8 right after we use the lemma. We put the proof of Lemma 56 in Section L.2.

Now we are ready to prove Proposition 54. Recall $\Delta\mathbf{V}^{*}$ defined in Eq. (84). By Eq. (1) and $g=\delta_{\bm{z}_{0}}$ , we have

\displaystyle\Delta\mathbf{V}^{*}=\frac{(\bm{h}_{\mathbf{V}_{0},\bm{z}_{0}})^{T}}{p}.

Define

	$\displaystyle i^{}=\operatorname{arg\,min}_{i\in\{1,2,\cdots,n\}}\\|\mathbf{X}_{i}-\bm{z}_{0}\\|_{2},$
	$\displaystyle\theta^{}=\arccos(\mathbf{X}_{i^{}}^{T}\bm{z}_{0}).$

Thus, we have

$\displaystyle\\|\mathbf{X}_{i^{*}}-\bm{z}_{0}\\|_{2}=$	$\displaystyle\sqrt{2-2\cos\theta^{*}}\text{ (by the law of cosines)}$
$\displaystyle=$	$\displaystyle 2\sin\frac{\theta^{*}}{2}\text{ (by the half angle identity)}$
$\displaystyle\leq$	$\displaystyle\theta^{*}\text{ (by Lemma~{}\ref{le.sin})}.$	(127)

(Graphically, Eq. (L.1) means that a chord is not longer than the corresponding arc.)

As we discussed in the proof sketch of Proposition 5, we now construct the vector $\bm{a}$ such that $\mathbf{H}^{T}\bm{a}$ is close to $\Delta\mathbf{V}^{*}$ . Define $\bm{a}\in\mathds{R}^{n}$ whose $i$ -th element is

\displaystyle\bm{a}_{i}=\begin{cases}1/p,\quad&\text{ if }i=i^{*}\\ 0,&\text{ if }i\in\{1,2,\cdots,n\}\setminus\{i^{*}\}\end{cases}.

Thus, we have $\mathbf{H}^{T}\bm{a}=(\bm{h}_{\mathbf{V}_{0},\mathbf{X}_{i^{*}}})^{T}/p$ . Therefore, we have

	$\displaystyle\\|\mathbf{H}^{T}\bm{a}-\Delta\mathbf{V}^{*}\\|_{2}^{2}=$	$\displaystyle\sum_{j=1}^{p}\\|(\mathbf{H}^{T}\bm{a})[j]-\Delta\mathbf{V}^{*}[j]\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{p^{2}}\sum_{j=1}^{p}\left(\bm{1}_{\{\mathbf{X}_{i^{}}^{T}\mathbf{V}_{0}[j]>0,\bm{z}_{0}^{T}\mathbf{V}_{0}[j]>0\}}\\|\mathbf{X}_{i^{}}-\bm{z}_{0}\\|_{2}^{2}+\bm{1}_{\{(\mathbf{X}_{i^{*}}^{T}\mathbf{V}_{0}[j])(\bm{z}_{0}^{T}\mathbf{V}_{0}[j])<0\}}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{1}{p^{2}}\left(p\\|\mathbf{X}_{i^{}}-\bm{z}_{0}\\|_{2}^{2}+\|\mathcal{C}_{-\mathbf{X}_{i^{}},\bm{z}_{0}}^{\mathbf{V}_{0}}\|+\|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\|\right)\text{ (by Eq.~{}\eqref{eq.card_c})}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{p^{2}}\left(p\cdot(\theta^{})^{2}+\|\mathcal{C}_{-\mathbf{X}_{i^{}},\bm{z}_{0}}^{\mathbf{V}_{0}}\|+\|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\|\right)\text{ (by Eq.~{}\eqref{eq.temp_030201})}.$

Thus, we have

	$\displaystyle\sqrt{p}\\|\mathbf{H}\bm{a}-\Delta\mathbf{V}^{*}\\|_{2}\leq$	$\displaystyle\sqrt{(\theta^{})^{2}+\frac{\|\mathcal{C}_{-\mathbf{X}_{i^{}},\bm{z}_{0}}^{\mathbf{V}_{0}}\|+\|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\|}{p}}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\pi\theta^{}+\frac{\|\mathcal{C}_{-\mathbf{X}_{i^{}},\bm{z}_{0}}^{\mathbf{V}_{0}}\|+\|\mathcal{C}_{\mathbf{X}_{i^{}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\|}{p}}\text{ (because $\theta^{}\leq\pi$)}.$		(128)

Remark 7.

We give a geometric interpretation of Eq. (128) when $d=2$ by Fig. 4, where $\overrightarrow{\mathrm{OA}}$ denotes $\bm{z}_{0}$ , $\overrightarrow{\mathrm{OB}}$ denotes $\mathbf{X}_{i^{*}}$ . Then, $|\mathcal{C}_{-\mathbf{X}_{i^{*}},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}|$ corresponds to the number of $\mathbf{V}_{0}[j]$ ’s whose direction is in the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CE}}}$ or the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{FD}}}$ , and $\theta^{*}$ corresponds to the angle $\angle\mathrm{AOB}$ . Intuitively, when $n$ increases, $\mathbf{X}_{i^{*}}$ and $\bm{z}_{0}$ get closer, so $\theta^{*}$ becomes smaller. At the same time, both the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CE}}}$ and the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{FD}}}$ become shorter. Consequently, the value of Eq. (128) decreases as $n$ increases. In the rest of the proof, we will quantitatively estimate the above relationship.

Recall $C_{d}$ in Eq. (60). Define

\displaystyle\theta\mathrel{\mathop{:}}=\frac{\pi}{2}\left(\frac{2\sqrt{2}(d-1)}{C_{d}}\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}(1-\frac{1}{q})}\in\left[0,\frac{\pi}{2}\right]\quad\text{ (by Eq.~{}\eqref{eq.temp_030502})}.

(129)

For any $q\in(1,\infty)$ , we define two events:

	$\displaystyle\mathcal{J}_{1}\mathrel{\mathop{:}}=\left\{\frac{\|\mathcal{C}_{-\mathbf{X}_{i^{}},\bm{z}_{0}}^{\mathbf{V}_{0}}\|+\|\mathcal{C}_{\mathbf{X}_{i^{}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\|}{p}\leq\frac{3\theta}{2\pi}\right\},$
	$\displaystyle\mathcal{J}_{2}\mathrel{\mathop{:}}=\left\{\theta^{*}\leq\theta\right\}.$

If both $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ happen, by Eq. (128), we must then have

	$\displaystyle\sqrt{p}\\|\mathbf{H}\bm{a}-\Delta\mathbf{V}^{*}\\|_{2}$	$\displaystyle\leq\left(\sqrt{\frac{3}{2\pi}+\pi}\right)\cdot\sqrt{\theta}$
		$\displaystyle=\left(\sqrt{\frac{3}{4}+\frac{\pi^{2}}{2}}\right)\left(\frac{2\sqrt{2}(d-1)}{C_{d}}\right)^{\frac{1}{2(d-1)}}n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})}.$

Thus, by Lemma 44 and Lemma 45, if $f=f_{\mathbf{V}_{0}}^{g}$ and both $\mathcal{J}_{1}$ and $\mathcal{J}_{2}$ happen, then for any $\bm{x}\in\mathcal{S}^{d-1}$ , we must have

\displaystyle|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})|\leq\left(\sqrt{\frac{3}{4}+\frac{\pi^{2}}{2}}\right)\left(\frac{2\sqrt{2}(d-1)}{C_{d}}\right)^{\frac{1}{2(d-1)}}n^{-\frac{1}{2(d-1)}(1-\frac{1}{q})}.

(130)

It then only remains to estimate the probability of $\mathcal{J}_{1}\cap\mathcal{J}_{2}$ .

Step 1: Estimate the probability of $\mathcal{J}_{1}$ conditional on $\mathcal{J}_{2}$ .

When $\mathcal{J}_{2}$ happens, we have $\theta^{*}<\theta$ . By Lemma 56, we can find $\bm{x}\in\mathcal{S}^{d-1}$ such that the angle between $\bm{x}$ and $\bm{z}_{0}$ is exactly $\theta$ and

\displaystyle\frac{|\mathcal{C}_{-\mathbf{X}_{i^{*}},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}|}{p}\leq\frac{|\mathcal{C}_{-\bm{x},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\bm{x},-\bm{z}_{0}}^{\mathbf{V}_{0}}|}{p}.

(131)

Remark 8.

We give a geometric interpretation of Eq. (131) (i.e., Lemma 56) when $d=2$ by Fig. 4. Recall in Remark 7 that, if we take $\overrightarrow{\mathrm{OA}}$ as $\bm{z}_{0}$ and $\overrightarrow{\mathrm{OB}}$ as $\mathbf{X}_{i^{*}}$ , then $|\mathcal{C}_{-\mathbf{X}_{i^{*}},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}|$ corresponds to the number of $\mathbf{V}_{0}[j]$ ’s whose direction is in the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CE}}}$ or the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{FD}}}$ . If we fix $\overrightarrow{\mathrm{OA}}$ (i.e., $\bm{z}_{0}$ ) and increase the angle $\angle\mathrm{AOB}$ (corresponding to $\theta^{*}$ ), then both the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{CE}}}$ and the arc $\stackrel{{\scriptstyle\frown}}{{\mathrm{FD}}}$ will become longer. In other words, if we replace $\mathbf{X}_{i}$ by $\bm{x}$ such that the angle $\theta^{*}$ (between $\bm{z}_{0}$ and $\mathbf{X}_{i}$ ) increases to the angle $\theta$ (between $\bm{z}_{0}$ and $\bm{x}$ ), then $\mathcal{C}_{-\mathbf{X}_{i^{*}},\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{-\bm{x},\bm{z}_{0}}^{\mathbf{V}_{0}}$ and $\mathcal{C}_{\mathbf{X}_{i^{*}},-\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{\bm{x},-\bm{z}_{0}}^{\mathbf{V}_{0}}$ , and thus Eq. (131) follows.

We next estimate the probability that the right-hand-side of Eq. (131) is greater than $\frac{3\theta}{2\pi}$ . By Eq. (6), we have

\displaystyle\frac{|\mathcal{C}_{-\bm{x},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\bm{x},-\bm{z}_{0}}^{\mathbf{V}_{0}}|}{p}=\frac{1}{p}\sum_{j=1}^{p}\underbrace{\bm{1}_{\{-\bm{x}^{T}\mathbf{V}_{0}[j]>0,\bm{z}_{0}^{T}\mathbf{V}_{0}[j]>0\text{ OR }\bm{x}^{T}\mathbf{V}_{0}[j]>0,-\bm{z}_{0}^{T}\mathbf{V}_{0}[j]<0\}}}_{\text{Term A}}.

(132)

Notice that the angle between $-\bm{x}$ and $\bm{z}_{0}$ is $\pi-\theta$ , and the angle between $\bm{x}$ and $-\bm{z}_{0}$ is also $\pi-\theta$ . By Lemma 17 and Assumption 1, we know that the Term A in Eq. (132) follows Bernoulli distribution with the probability $2\cdot\frac{\pi-(\pi-\theta)}{2\pi}=\frac{\theta}{\pi}$ . By letting $\delta=1/2$ , $a=p$ , $b=\frac{\theta}{\pi}$ in Lemma 14, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\left||\mathcal{C}_{-\bm{x},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\bm{x},-\bm{z}_{0}}^{\mathbf{V}_{0}}|-\frac{p\theta}{\pi}\right|>\frac{p\theta}{2\pi}\right\}\leq 2\exp\left(-\frac{p\theta}{12\pi}\right).

By Eq. (131), we then have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{1}^{c}\ |\ \mathcal{J}_{2}]\leq\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0}}\left\{\frac{|\mathcal{C}_{-\bm{x},\bm{z}_{0}}^{\mathbf{V}_{0}}|+|\mathcal{C}_{\bm{x},-\bm{z}_{0}}^{\mathbf{V}_{0}}|}{p}>\frac{3\theta}{2\pi}\right\}\leq 2\exp\left(-\frac{p\theta}{12\pi}\right).

Step 2: Estimate the probability of $\mathcal{J}_{2}$ .

By Lemma 8 and Assumption 1, for any $i\in\{1,2,\cdots,n\}$ and because $\theta\in[0,\pi/2]$ , we have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\arccos(\mathbf{X}_{i}^{T}\bm{z}_{0})>\theta\right\}=$	$\displaystyle 1-\frac{1}{2}I_{\sin^{2}\theta}\left(\frac{d-1}{2},\frac{1}{2}\right)$
	$\displaystyle\leq$	$\displaystyle 1-\frac{C_{d}}{2\sqrt{2}(d-1)}\sin^{d-1}\theta\text{ (by Lemma~{}\ref{le.estimate_Ix})}.$

Note that since $\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\arccos(\mathbf{X}_{i}^{T}\bm{z}_{0})>\theta\right\}\geq 0$ , we must have

\displaystyle\frac{C_{d}}{2\sqrt{2}(d-1)}\sin^{d-1}\theta\leq 1.

(133)

Further, because all $\mathbf{X}_{i}$ ’s are i.i.d. for $i\in\{1,2,\cdots,n\}$ , we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\{\theta^{*}>\theta\}=\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\min_{i\in\{1,2,\cdots,n\}}\arccos(\mathbf{X}_{i}^{T}\bm{z}_{0})>\theta\right\}\leq

\displaystyle\left(1-\frac{C_{d}}{2\sqrt{2}(d-1)}\sin^{d-1}\theta\right)^{n}.

(134)

By Eq. (129) and Lemma 41, we then have

\displaystyle\sin\theta\geq\left(\frac{2\sqrt{2}(d-1)}{C_{d}}\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}(1-\frac{1}{q})},

i.e.,

\displaystyle\frac{C_{d}}{2\sqrt{2}(d-1)}\sin^{d-1}\theta\geq n^{-(1-1/q)}.

Thus, by Eq. (133), Eq. (134), and Lemma 55, we have

\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X}}[\mathcal{J}_{2}^{c}]=\operatorname*{\mathsf{Pr}}_{\mathbf{X}}\left\{\theta^{*}>\theta\right\}\leq\exp\left(-n^{\frac{1}{q}}\right).

Combining the results of Step 1 and Step 2, we thus have

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}[\mathcal{J}_{1}\cap\mathcal{J}_{2}]=$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}[\mathcal{J}_{1}\ \|\ \mathcal{J}_{2}]\cdot\operatorname{\mathsf{Pr}}_{\mathbf{X},\mathbf{V}_{0}}[\mathcal{J}_{2}]$
	$\displaystyle=$	$\displaystyle\operatorname{\mathsf{Pr}}_{\mathbf{V}_{0}}[\mathcal{J}_{1}\ \|\ \mathcal{J}_{2}]\cdot\operatorname{\mathsf{Pr}}_{\mathbf{X}}[\mathcal{J}_{2}]\text{ (because of $\mathbf{V}_{0}$ and $\mathbf{X}$ are independent)}$
	$\displaystyle\geq$	$\displaystyle\left(1-2\exp\left(-\frac{p\theta}{12\pi}\right)\right)\left(1-\exp\left(-n^{\frac{1}{q}}\right)\right)$
	$\displaystyle\geq$	$\displaystyle 1-\exp\left(-n^{\frac{1}{q}}\right)-2\exp\left(-\frac{p\theta}{12\pi}\right)$
	$\displaystyle=$	$\displaystyle 1-\exp\left(-n^{\frac{1}{q}}\right)-2\exp\left(-\frac{p}{24}\left(\frac{2\sqrt{2}(d-1)}{C_{d}}\right)^{\frac{1}{d-1}}n^{-\frac{1}{d-1}(1-\frac{1}{q})}\right)\text{ (by Eq.~{}\eqref{eq.temp_030306})}.$

By Eq. (60), the conclusion of Proposition 54 thus follows.

L.2 Proof of Lemma 56

Proof.

When $\bm{x}_{1}=\bm{z}_{0}$ , the conclusion of this lemma trivially holds because $\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}=\mathcal{C}_{\bm{x}_{1},-\bm{z}_{0}}^{\mathbf{V}_{0}}=\varnothing$ (because $-\bm{x}^{T}\mathbf{V}_{0}[j]$ and $\bm{z}_{0}^{T}\mathbf{V}_{0}[j]$ cannot be both positive or negative at the same time when $\bm{x}_{1}=\bm{z}_{0}$ .). It remains to consider $\bm{x}_{1}\neq\bm{z}_{0}$ . Define

\displaystyle\bm{z}_{0,\perp}\mathrel{\mathop{:}}=\frac{\bm{x}_{1}-(\bm{x}_{1}^{T}\bm{z}_{0})\bm{z}_{0}}{\|\bm{x}_{1}-(\bm{x}_{1}^{T}\bm{z}_{0})\bm{z}_{0}\|_{2}}.

Thus, we have $\bm{z}_{0,\perp}^{T}\bm{z}_{0}=0$ and $\|\bm{z}_{0,\perp}\|_{2}=1$ , i.e., $\bm{z}_{0}$ and $\bm{z}_{0,\perp}$ are orthonormal basis vectors on the 2D plane $\mathcal{L}$ spanned by $\bm{x}_{1}$ and $\bm{z}_{0}$ . Thus, we can represent $\bm{x}_{1}$ as

\displaystyle\bm{x}_{1}=\cos\varphi\cdot\bm{z}_{0}+\sin\varphi\cdot\bm{z}_{0,\perp}\in\mathcal{L}.

For any $\theta\in[\varphi,\pi]$ , we construct $\bm{x}_{2}$ as

\displaystyle\bm{x}_{2}\mathrel{\mathop{:}}=\cos\theta\cdot\bm{z}_{0}+\sin\theta\cdot\bm{z}_{0,\perp}\in\mathcal{L}.

In order to show $\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ , we only need to prove any $j\in\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}$ must in $\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ . For any $\mathbf{V}_{0}[j]$ , $j=1,2,\cdots,p$ , define the angle $\theta_{j}\in[0,2\pi]$ as the angle between $\bm{z}_{0}$ and $\mathbf{V}_{0}[j]$ ’s projected component $\bm{v}_{j}$ on $\mathcal{L}$ ¹⁰¹⁰10Note that such an angle $\theta_{j}$ is well defined as long as $\mathbf{V}_{0}[j]$ is not perpendicular to $\mathcal{L}$ . The reason that we do not need to worry about those $j$ ’s such that $\mathbf{V}_{0}[j]\perp\mathcal{L}$ is as follows. When $\mathbf{V}_{0}[j]\perp\mathcal{L}$ , we then have $\bm{x}_{1}^{T}\mathbf{V}_{0}[j]=\bm{x}_{2}^{T}\mathbf{V}_{0}[j]=\bm{z}_{0}^{T}\mathbf{V}_{0}[j]=0$ . Thus, those $j$ ’s do not belong to any set $\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}$ , $\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ , $\mathcal{C}_{\bm{x}_{1},-\bm{z}_{0}}^{\mathbf{V}_{0}}$ , or $\mathcal{C}_{\bm{x}_{2},-\bm{z}_{0}}^{\mathbf{V}_{0}}$ in Eq. (126)., i.e.,

\displaystyle\bm{v}_{j}=\cos\theta_{j}\cdot\bm{z}_{0}+\sin\theta_{j}\cdot\bm{z}_{0,\perp}\in\mathcal{L}.

By the proof of Lemma 17, we know that $j\in\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}$ if and only if $\theta_{j}\in(-\frac{\pi}{2},\frac{\pi}{2})\cap(\pi+\varphi-\frac{\pi}{2},\pi+\varphi+\frac{\pi}{2})$ (mod $2\pi$ ). Similarly, $j\in\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ if and only if $\theta_{j}\in(-\frac{\pi}{2},\frac{\pi}{2})\cap(\pi+\theta-\frac{\pi}{2},\pi+\theta+\frac{\pi}{2})$ (mod $2\pi$ ). Because $\varphi\in[0,\pi]$ and $\theta\in[\varphi,\pi]$ , we have

\displaystyle(-\frac{\pi}{2},\frac{\pi}{2})\cap(\pi+\varphi-\frac{\pi}{2},\pi+\varphi+\frac{\pi}{2})\subseteq(-\frac{\pi}{2},\frac{\pi}{2})\cap(\pi+\theta-\frac{\pi}{2},\pi+\theta+\frac{\pi}{2})\text{ (mod $2\pi$)}.

Thus, whenever $j\in\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}$ , we must have $j\in\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ . Therefore, we conclude that $\mathcal{C}_{-\bm{x}_{1},\bm{z}_{0}}^{\mathbf{V}_{0}}\in\mathcal{C}_{-\bm{x}_{2},\bm{z}_{0}}^{\mathbf{V}_{0}}$ . Using a similar method, we can also show that $\mathcal{C}_{\bm{x}_{1},-\bm{z}_{0}}^{\mathbf{V}_{0}}\subseteq\mathcal{C}_{\bm{x}_{2},-\bm{z}_{0}}^{\mathbf{V}_{0}}$ . The result of this lemma thus follows. ∎

	$\displaystyle\operatorname*{\mathsf{Pr}}_{\mathbf{V}_{0},\mathbf{X}}\Big{\{}\|\hat{f}^{\ell_{2}}(\bm{x})-f(\bm{x})\|\geq\underbrace{n^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}}_{\text{Term 1}}$
	$\displaystyle+\underbrace{\left(1+\sqrt{J_{m}(n,d)n}\right)p^{-\frac{1}{2}\left(1-\frac{1}{q}\right)}}_{\text{Term 2}}+\underbrace{\sqrt{J_{m}(n,d)n}\\|\bm{\epsilon}\\|_{2}}_{\text{Term 3}},$
	$\displaystyle\quad\text{ for all }\bm{\epsilon}\in\mathds{R}^{n}\Big{\}}\leq 2e^{2}\bigg{(}\underbrace{\exp\left(-\frac{\sqrt[q]{n}}{8\\|g\\|_{\infty}^{2}}\right)}_{\text{Term 4}}$
	$\displaystyle+\underbrace{\exp\left(-\frac{\sqrt[q]{p}}{8\\|g\\|_{1}^{2}}\right)}_{\text{Term 5}}+\underbrace{\exp\left(-\frac{\sqrt[q]{p}}{8n\\|g\\|_{1}^{2}}\right)}_{\text{Term 6}}\bigg{)}+\underbrace{\frac{4}{\sqrt[m]{n}}}_{\text{Term 7}}.$		(10)

	$\displaystyle\\|\bm{v}_{1}\\|_{2}^{2}+\\|\bm{v}_{2}\\|_{2}^{2}$	$\displaystyle=\\|\bm{v}_{1}\\|_{2}^{2}+\\|-\bm{v}_{2}\\|_{2}^{2}$
		$\displaystyle\geq 2\left\\|\frac{\bm{v}_{1}-\bm{v}_{2}}{2}\right\\|_{2}^{2}\text{ (apply Jensen's inequality on the convex function $\\|\cdot\\|_{2}^{2}$)}$
		$\displaystyle=\frac{1}{2}\\|\bm{v}_{1}-\bm{v}_{2}\\|_{2}^{2}.$

		$\displaystyle\text{card}\left(\left\{\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0,\bm{z}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0,\bm{z}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0,\bm{z}^{T}\bm{v}_{k}0\}}\right)\ \big{\|}\ \bm{x}\in\mathcal{S}^{d-1},\bm{z}\in\mathcal{S}^{d-1}\right\}\right)$
	$\displaystyle\leq$	$\displaystyle\text{card}\left(\left\{\left(\bm{1}_{\{\bm{x}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{x}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{x}^{T}\bm{v}_{k}>0\}}\right)\ \big{\|}\ \bm{x}\in\mathcal{S}^{d-1}\right\}\right)$
		$\displaystyle\cdot\text{card}\left(\left\{\left(\bm{1}_{\{\bm{z}^{T}\bm{v}_{1}>0\}},\bm{1}_{\{\bm{z}^{T}\bm{v}_{2}>0\}},\cdots,\bm{1}_{\{\bm{z}^{T}\bm{v}_{k}>0\}}\right)\ \big{\|}\ \bm{z}\in\mathcal{S}^{d-1}\right\}\right),$

	$\displaystyle(\bm{v}^{T}\mathbf{X}_{j})(\bm{v}_{*,i}^{T}\mathbf{X}_{j})$	$\displaystyle=\left((\bm{v}-\bm{v}_{,i})^{T}\mathbf{X}_{j}+\bm{v}_{,i}^{T}\mathbf{X}_{j}\right)(\bm{v}_{*,i}^{T}\mathbf{X}_{j})$
		$\displaystyle=(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}+(\bm{v}_{,i}^{T}\mathbf{X}_{j})\left((\bm{v}-\bm{v}_{*,i})^{T}\mathbf{X}_{j}\right)$
		$\displaystyle\geq(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot\left\|(\bm{v}-\bm{v}_{*,i})^{T}\mathbf{X}_{j}\right\|$
		$\displaystyle\geq(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot\left\\|\bm{v}-\bm{v}_{*,i}\right\\|_{2}\left\\|\mathbf{X}_{j}\right\\|_{2}$
		$\displaystyle>(\bm{v}_{,i}^{T}\mathbf{X}_{j})^{2}-\left\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\right\|\cdot r_{i}\text{ (by Eq.~{}\eqref{eq.temp_100101})}$
		$\displaystyle=\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\|(\|\bm{v}_{,i}^{T}\mathbf{X}_{j}\|-r_{i})$
		$\displaystyle\geq 0\text{ (by Eq.~{}\eqref{eq.temp_093003})}.$

		$\displaystyle\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\|$
	$\displaystyle=$	$\displaystyle\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\\|_{2}\text{ ($\ell_{2}$-norm of a number equals to its absolute value)}$
	$\displaystyle\leq$	$\displaystyle\\|\bm{h}_{\mathbf{V}_{0},\bm{x}}\\|_{2}\cdot\\|\mathbf{H}^{T}(\mathbf{H}\mathbf{H}^{T})^{-1}\bm{\epsilon}\\|_{2}\text{ (by Lemma~{}\ref{le.matrix_norm})}$
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{p}\\|\bm{\epsilon}\\|_{2}}{\sqrt{\min\mathsf{eig}(\mathbf{H}\mathbf{H}^{T})}}\text{ (by Lemma~{}\ref{le.h_p} and Eq.~{}\eqref{eq.temp_022801})}.$

On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models

Abstract

1 Introduction

2 Problem Setup

3 Learnable Functions and Generalization Performance

Definition 1.

Theorem 1.

Remark 1.

Proposition 2.

4 What Exactly are the Functions in ℱℓ2\mathcal{F}^{\ell_{2}}?

Proposition 3.

Remark 2.

5 Proof Sketch of Theorem 1

Proposition 4.

Remark 3.

Definition 2.

Proposition 5.

6 Conclusions

Acknowledgements

References

Appendix A Extra Notations

Remark 4.

Appendix B GD (gradient descent) Converges to Min ℓ2\ell_{2}-Norm Solutions

Lemma 6.

Proof.

Appendix C Assumptions and Justifications

Assumption 1.

Assumption 2.

Assumption 3.

Lemma 7.

Proof.

Appendix D Some Useful Supporting Results

D.1 Quantities related to the area of a cap on a hyper-sphere

Lemma 8.

Lemma 9.

Proof.

Lemma 10.

Proof.

D.2 Estimation of certain norms

Lemma 11.

Proof.

Lemma 12.

Proof.

Remark 5.

Lemma 13.

Proof.

D.3 Estimates of certain tail probabilities

Lemma 14.

Lemma 15 (Azuma–Hoeffding inequality for random vectors).

Lemma 16.

Proof.

D.4 Calculation of certain integrals

Lemma 17.

Proof.

D.5 Limits of |𝒞𝒛,𝒙𝐕0|/p{|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}/{p} when p→∞p\to\infty

Lemma 18.

Lemma 19.

Lemma 20.

Proof.

Lemma 21.

Appendix E Proof of Lemma 7 (𝐇\mathbf{H} has full row-rank with high probability as p→∞p\to\infty)

Lemma 22.

Proof.

Lemma 23.

Remark 6.

Proof.

Proof.

Appendix F Proof of Proposition 4 (the upper bound of the variance)

Lemma 24.

Proof.

Lemma 25.

Lemma 26.

Proof.

Lemma 27.

Lemma 28.

F.1 Proof of Lemma 27

Proof.

F.2 Proof of Lemma 28

Lemma 29.

Proof.

On the Generalization Power of
Overfitted Two-Layer Neural Tangent Kernel Models

4 What Exactly are the Functions in $\mathcal{F}^{\ell_{2}}$ ?

Appendix B GD (gradient descent) Converges to Min $\ell_{2}$ -Norm Solutions

D.5 Limits of ${|\mathcal{C}_{\bm{z},\bm{x}}^{\mathbf{V}_{0}}|}/{p}$ when $p\to\infty$

Appendix E Proof of Lemma 7 ( $\mathbf{H}$ has full row-rank with high probability as $p\to\infty$ )

Appendix G Upper bound of ${\min\mathsf{eig}\left(\mathbf{H}\mathbf{H}^{T}\right)}/{p}$

Appendix J Proof of Proposition 2 (lower bound for ground-truth functions outside $\overline{\mathcal{F}^{\ell_{2}}}$ )

Appendix K Details for Section 4 (hyper-spherical harmonics decomposition on $\mathcal{S}^{d-1}$ )

K.1 Convolution on $\mathcal{S}^{d-1}$

K.3 Calculate $\Xi_{\mathbf{K}}^{l}(\xi)$ where $\mathbf{K}=\mathbf{0}$

K.5 A special case: when $d=2$

Appendix L Discussion when $g$ is a $\delta$ -function ( $\|g\|_{\infty}=\infty$ )