Efficient SGD Neural Network Training via Sublinear Activated Neuron Identification
Deep learning has been widely used in many fields, but the model training process usually consumes massive computational resources and time. Therefore, designing an efficient neural network training method with a provable convergence guarantee is a fundamental and important research question. In this paper, we present a static half-space report data structure that consists of a fully connected two-layer neural network for shifted ReLU activation to enable activated neuron identification in sublinear time via geometric search. We also prove that our algorithm can converge in time with network size quadratic in the coefficient norm upper bound and error term .
1 Introduction
Deep learning is widely used in computer vision [50, 51, 69, 48, 43], natural language processing [19, 49], game playing [68, 71] and beyond. It’s often the case that the training of deep learning algorithms takes an enormous amount of computational resources. A fundamental challenge in this line of research is, therefore, designing an efficient neural network training method that provably converges. Existing work that provably converges suffers a over-parameterized network structure [29, 52, 2, 11, 9, 24, 17, 79, 90, 93, 61, 94, 13, 38].
The preceding study by [21] established that SGD is capable of learning polynomials with restricted weights and particular kernel spaces utilizing a neural network of depth two of near optimal size. This is particularly the case for the set of even polynomials of limited degree and with a coefficient vector norm that does not exceed ., for input distribution on a unit sphere, neurons and iterations suffice to output a predictor with error via depth two neural networks. However, their algorithm still suffers a cost per iteration as , where is the batch size, is the width of the neural tangent kernel and is the dimension of input data point. This linear dependency on comes from computing inner products between the gradient of the loss function with respect to weights and the gradient of the weights with respect to the points being queried. This seems to be a natural barrier. One natural question to ask is,
Is there some algorithm that only requires cost per iteration?
We provide a positive response to the preceding question, summarizing our contributions in the following manner:
-
•
We proposed a static half-space report data structure that consists of a fully connected two-layer neural network with neural tangent kernel for shifted ReLU activation. In specific, we build a half-space report data structure of weights with batched SGD update. At every iteration, our algorithm identifies the weights that are fired by current data points and propagates them efficiently. Additionally, we can show that at any given iteration, the number of activated neurons for each input data point is upper bounded by . Thus, via geometric search, our algorithm identifies those activated neurons in time sublinear in .
-
•
We show that our algorithm can converge in time with network size quadratic in the coefficient norm upper bound and error term .
1.1 Our Results
To formally introduce our main results, we first present the definitions regarding the input data distribution. In this paper, we will assume that the input data distribution is on a unit sphere . On top of that, we present the definition of -bounded distribution.
Definition 1.1 (-bounded distribution).
A distribution on is said to be -bounded if, for every , the expectation is less than or equal to .
In practice, any given distribution is bounded by . Additionally, many commonly used distributions are bounded by , or even . Examples of such distributions include uniform distribution on a unit sphere , on a discrete cube , and on randomly selected points.
Then, we present the formal definition of the neural network structure that we will consider in this paper. Our study will focus on fully connected, depth-2 neural networks with hidden neurons, implementing a shifted ReLU activation function with -loss function. More formally, we present the definition of prediction function and loss functions as follows:
Definition 1.2 (Prediction and Loss function).
Given and , we say a prediction function is if:
where denote the shifted ReLU function .
The shifted ReLU function is frequently employed in literature, as well as theoretical investigations as indicated in [94, 82]. Regarding the neural network weights, we leverage the Xavier initialization with zero outputs method as described in [32]. This involves organizing the neurons into pairs, with each pair composed of two neurons initialized identically, differing only by a factor of . We denote , and as the aggregation of all weights. More formally, we present our weight initialization as follows:
Definition 1.3 (Weights at Initialization).
Given input dimension , number of neurons , constant , we say the weight , is initialized according to distribution if:
-
•
For each , we sample , and .
-
•
For each , we sample from uniformly at random, and .
Note that if , then with probability , .
Next, we present a benchmark of our algorithm:
Definition 1.4 (Even Polynomial with Bounded Coefficient Norm).
We denote the class of even polynomials with coefficient norm bound as . More formally,
We are now ready to state our main theorem (a combination of Theorem 5.1 and Lemma 6.9) that, Algorithm 1 is capable of learning even polynomials of bounded norm, denoted , exhibiting near-optimal characteristics in terms of sample complexity and network size. Furthermore, it showcases a per iteration time that is sublinear in :
Theorem 1.5 (Main theorem).
Given the following:
-
•
a constant , accuracy parameter , along with positive constants and ,
-
•
a selection of parameters ,
-
•
sample access to -bounded distribution (Definition 1.1),
-
•
input dimension and coefficient norm bound (Definition 1.4),
-
•
is the expected loss of predictor on input distribution .
there exists an algorithm (Algorithm 1) which gives that when running the Stochastic Gradient Descent
-
•
with a batch size of ,
-
•
on (Definition 1.2),
-
•
returns a function that satisfies:
-
•
with expected per-iteration running time in
Roadmap.
We first present a technique overview of our paper in Section 2. We then introduce some notations and preliminaries in Section 3. We present our main algorithm and give the proof of correctness of our algorithm in Section 5. We give the proof of the running time of our algorithm in Section 6. We conclude the contribution of this paper in Section 7.
2 Technical Overview
Our work consists of two results: The first result focuses on proving the learnability of shifted ReLU activation on two-layered neural network with SGD update. The second result focuses on designing an efficient half-space reporting data structure based on the weight sparsity induced by shifted ReLU activation function, which gives us time per iteration sublinear in .
We prove the first result via reduction-based techniques. In order to prove the learnability of shifted ReLU activation on two-layer neural network (Theorem 5.1), we actually prove a more general statement in Theorem 5.2, where for general activation function , the expected loss of the two-layer neural network is optimal up to an additive error with neurons and updates. Next, we prove that the general activation on two-layer network is equivalent to neural tangent kernel training (Lemma 5.3) when weight vectors of the network on neurons are initialized with large enough (Definition 1.3). In specific, when is large enough, SGD with a general activation function on two-layer neural network follows a lazy update, where the weight vector on the input only moves around a small ball. In this case, the first-order approximation of the network function on the initial weight is approximately equivalent to the original network function. In this light, neural tangent kernel learning suffices to approximate the general two-layer neural network training with SGD update. Then, we analyze the neural tangent kernel training in the language of (vector) random feature scheme, which completes the whole proof for the first result.
The second result is based on the observation that, for shifted ReLU activation function on two-layer neural network with SGD update, the number of fired neuron for each data point is sublinear in the network size . In this light, we adapted half-space reporting data structure hsr to boost the per-iteration running time by preprocessing the network weights. More precisely, the algorithm initializes the half-space reporting data structure hsr with initialized weight vectors, then at every batched SGD iteration, the algorithm queries the weight vectors that fire for the points in the batch by hsr and update the fired neuron set in time.
3 Preliminaries
Notations
We use to denote the descent activation function. We use to denote the input distribution. In a matrix , the -th row is represented as . We denote the -th row in a matrix by . We use to represent the -norm of . Given a matrix , we use to denote its spectral norm, defined as . We adhere to the standard convention where . Regarding a distribution on a space , and , we use . For any function , we use to represent . For an integer , we use to denote the set . For a given function , we leverage and to represent its first-order and second-order derivative, respectively.
3.1 Definitions
In this section, we present some definitions of properties on a function . To begin with, we present the definition of a convex function.
Definition 3.1 (Convexity).
We say a function is convex if for any , we have:
Next, we present the definition of Lipschitzness of a function.
Definition 3.2 (-Lipschitz).
We say a function is -Lipshitz with respect to norm if for any , we have:
3.2 Neural Network Training
Then, we introduce some basic definitions regarding neural network training of supervised learning and some related notations:
Definition 3.3 (Supervised Learning).
The objective of supervised learning is to learn a mapping from an input space, denoted as , to an output space, denoted as , using a sample set . These samples are independently and identically drawn from a distribution , which spans across .
A special case of the supervised learning is binary classification, where the prediction is a binary label.
Definition 3.4 (Binary Classification).
The binary classification problem is characterized by the label . Specifically, given a loss function , the aim is to identify a predictor with a loss is small.
Moreover, when a function is defined by a parameter vector , we denote , and . For a class of predictors from to , we denote
For classification problems, the properties of their loss function are defined as follows:
Definition 3.5 (Properties of Loss Function).
A loss function exhibits -Lipschitz characteristics if
-
•
for all , the function adheres to -Lipschitz properties (Definition 3.2).
Similarly, it is considered convex if is convex (Definition 3.1) for each . The function is considered to have -descent properties if is convex, -Lipschitz, and twice differentiable except at a finite number of points for every .
Note that shifted ReLU activation is a descent activation function. Furthermore, we present the definition of empirical loss on samples:
Definition 3.6 (Empirical Loss).
The empirical loss for a set of points is:
Furthermore, when function is defined by a vector of parameters , we denote . For a class of predictors , we denote .
In the remainder of our paper, we denote NeuralNetworkTraining as the neural network training with activation , input dimension , weight dimension , loss , learning rate , SGD batch size , initialization parameter and the number of iteration . Additionally, this optimization process initialize weight vector as (Definition 1.3).
3.3 Kernel Spaces
In this section, we provide some definitions regarding kernel and kernel spaces.
Definition 3.7 (Kernel).
Let be a given set. A kernel is defined as a function that guarantees, for all , the resulting matrix is positive semi-definite. A kernel space pertains to a Hilbert space in which the mapping is bounded. The following theorem delineates a bijective correlation between kernels and kernel spaces.
The succeeding theorem details a one-to-one relationship between kernels and kernel spaces.
Theorem 3.8 (Kernel versus Kernel Spaces).
For each kernel , a unique kernel space exists such that for all . Similarly, for every kernel space , a kernel can be found such that .
Within the context of , the norm, and inner product are denoted by and respectively. The ensuing theorem elucidates the robust correlation between kernels and the embeddings of into Hilbert spaces.
Theorem 3.9 (Kernel versus Embedding).
A function is recognized as a kernel if and only if a mapping exists to some Hilbert space where
In this situation, we have:
Furthermore, we denote , and the minimizer is unique.
We will leverage a certain kind of kernels which are known as inner product kernels. These are kernels given by where are scalars satisfying . It is well known that for any such series, acts as a kernel. The upcoming lemma outlines a few properties of inner product kernels.
Lemma 3.10 (Characteristics of Inner Product Kernel [21]).
Let be the inner product kernel . Assuming that ,
-
•
If , then , and .
-
•
For every , the function resides in and .
For a kernel and , we represent . Additionally, the inner product kernel space is a natural benchmark for learning algorithms. Then, we present the definition of Hermite polynomials and dual activation functions.
Definition 3.11 (Hermite Polynomials and the Dual Activation).
Hermite polynomials correspond to the sequence of orthonormal polynomials linked to the standard Gaussian measure on . Establish an activation , we define the dual activation of as follows:
-
•
where represents -correlated standard Gaussian.
-
•
Additionally, it stands that if , then
Specifically, forms an inner product kernel.
4 Related Work
Sketching
Sketching is a well-known technique to improve performance or memory complexity [18]. It has wide applications in linear algebra, such as linear regression and low-rank approximation[18, 60, 58, 67, 75, 39, 6, 76, 77, 23], training over-parameterized neural network [82, 83, 91], empirical risk minimization [56, 65], linear programming [56, 47, 80], distributed problems [85, 15], clustering [31], generative adversarial networks [88], kernel density estimation [63], tensor decomposition [78], trace estimation [46], projected gradient descent [40, 87], matrix sensing [27, 64], John Ellipsoid computation [16, 81], semi-definite programming [35], kernel methods [1, 4, 20, 74], adversarial training [34], cutting plane method [45], discrepany [92], federated learning [66], reinforcement learning [5, 86, 72], relational database [62].
Over-parameterization in Training Neural Networks.
The investigation of the geometry and convergence patterns of various optimization methods on over-parameterized neural networks has become a significant focus within the deep learning sphere [52, 44, 30, 10, 12, 25, 79, 89, 61, 53, 13, 82, 38, 83, 42, 55, 7, 59, 92]. The ground-breaking research by [44] introduced the concept of neural tangent kernel (NTK), a critical analytical tool in deep learning theory. By expanding the neural network’s size to the extent that the network width becomes relatively large , it can be demonstrated that the training dynamic of a neural network closely mirrors that of an NTK.
5 Proof of Correctness
At first, we present our algorithm (Algorithm 1) for shifted ReLU activation over two layer neural network via SGD update.
Next, we deliver the proof of correctness showing that Algorithm 1 is capable of learning even polynomials of bounded norm (Definition 1.4) with nearly optimal sample complexity and network size.
Theorem 5.1 (Neural Network Learning with Shifted ReLU Activation).
Given the following conditions:
-
•
a fixed constant and ,
-
•
, ,
-
•
the network function with initialization as indicated in Definition 1.3,
there is a choice of
and positive values of and . Such a selection makes sure that for every -bounded distribution (as defined in Definition 1.1) and a batch size , the function obtained by Algorithm 1 satisfies the condition that
From [22], for shifted ReLU activation , it holds that for every constant , . As a result, the following theorem implies the above theorem.
Theorem 5.2 (Neural Network Learning).
Given and , there exists a choice of
along with and , such that for any batch size and -bounded distribution (Definition 1.1), the function obtained by NeuralNetworkTraining gives us:
We can prove this theorem by a reduction to neural tangent kernel space on the initialized weight. At first, we use to denote the gradient of the function with respect to the hidden weights, i.e.,
where we use denote the first order derivative of activation . Moreover, we denote .
Next, we show that NeuralNetworkTraining is equivalent to NeuralTangentKernelTraining, with large enough initialization of the weights on neurons. We defer the proof of this lemma to Section C.1.
Lemma 5.3 (Equivalence for NNT and NTKT).
If the following conditions hold
-
•
Fix a descent activation as well as a convex descent loss (Definition 3.1).
There is a choice , such that for every input distribution the following holds: Let be the functions returned by NeuralNetworkTraining with parameters and NeuralTangentKernelTraining.
Then, we have
-
•
.
By the above lemma, it’s enough to analyze NeuralTangentKernelTraining in order to prove Theorem 5.2. To be specific, Theorem 5.2 follows from the following theorem:
Theorem 5.4.
Given and , there is a choice of , , and which enable that for every -bounded distribution (Definition 1.1) and batch size , the function obtained by NeuralTangentKernelTraining satisfies
In order to prove the above theorem, we prove an equivalent theorem (Theorem 5.6) in the next section, where we rephrase everything in the language of the vector random feature scheme.
5.1 Vector Random Feature Schemes
We note that NeuralTangentKernelTraining is SGD on top of the random embedding that consists of i.i.d. random mappings:
where follows a standard Gaussian distribution. For simplification, we adjust the training process to SGD on independent and identically distributed random mappings . After the application of this mapping, inner products between different examples remain unaffected up to multiplication. As the SGD update solely depends on these inner products, analyzing the learning process within the corresponding random feature scheme framework is sufficient. To begin with, we use random mapping to denote the random -embedding generated from :
where are i.i.d. Next, we consider SGDRFS for learning the class , by running SGD algorithm on these random features. For the remainder of this section, we establish a -bounded Random Feature Space (RFS) for a kernel and a randomly selected embedding . We adjust the notation to denote the RFS as . The Neural Tangent Kernel (NTK) RFS is presented by the mapping defined by
Definition 5.5 (-Kernel and -Kernel Space).
For a Random Feature Space (RFS) of a kernel , and a randomly chosen embedding , the random -kernel with regard to is
Similarly, the random -kernel space corresponding to is . For every , we define
as the average of independent random variables whose expectation is
With all the definitions, now we are ready to state the proof of correctness on the training of random feature scheme by the SGD update. Moreover, we defer the proof of Theorem 5.6 to Section C.2.
6 Proof of Running Time
In this section, we first present some definitions and properties regarding the activated neuron at each iteration. Then, we present the problem definition, algorithm, and runtime guarantee of half-space reporting. Finally, we present the runtime analysis that our algorithm has a cost per iteration sublinear in number of neurons .
6.1 Sparsity Characterization
In this section, we show that with high probability, at every time , the number of neurons activated by shifted ReLU of each data point is sublinear in . To begin with, we first present a definition which shows the set of neurons that fires at time .
Definition 6.1 (Fire Set).
For each index within the range , and for every timestep within the range , we define as a subset of . This set corresponds to the neurons that ”activate” or ”fire” at time . Formally, it is defined as follows:
We also define to be the size of the aforementioned set, i.e., for all .
We subsequently introduce a novel ”sparsity lemma” which demonstrates that the activation function results in the required sparsity.
Lemma 6.2 (Sparsity After Initialization, [82]).
Given parameter , and network structure as (Definition 1.2), then after weight initialization, with probability
for every input , the number of fired neurons is upper bounded by: .
Next, we present a choice of threshold that gives us a fired neurons set sublinear in network size :
Remark 6.3.
Let , then . For , Lemma 6.2 implies that:
In the forthcoming discussions, our aim is to establish that at any time instance , the count of activated neurons is sublinear with respect to . Initially, we define the set of flip neurons at time :
Definition 6.4 (Flip Set).
For each index within the range , and for each time instance , we define as a subset of . This set corresponds to the neurons that switch their state at time . More formally, it is defined as follows:
In contrast, there exist neurons that remain in the same state throughout the entire training process:
Definition 6.5 (Nonflip Set).
For each index in the range , we define as a subset of . This set includes the neurons that never switch their state during the whole training process. Specifically, it is defined as follows:
Then, we introduce a lemma showing that the number of fired neurons is small for all with high probability.
Lemma 6.6 (Upper Bound of Fired Neuron per Iteration, [82]).
Let
-
•
be a parameter,
-
•
be the activation function.
For each , is the number of activated neurons at the -th iteration. For , with probability at least
is at most for all .
6.2 Data Structure for Half-Space Reporting
In this section, we introduce the problem formulation, the data structure, and the time efficiency guarantees for the half-space reporting data structure. The primary objective of half-space reporting is to construct a data structure for a set in such a way that, for any given half-space , the data structure can quickly identify and output the points that fall within this half-space:
Definition 6.7 (Half-space Range Reporting).
Given a set of points in , we define a half-space range reporting data structure that supports two fundamental operations:
-
•
Query: Provided a half-space , this operation returns all points within that are also contained within . In other words, it outputs the intersection of and .
-
•
Update: This operation pertains to modifying the set by either inserting a new point into it, or removing an existing point from it:
-
–
Insert: This operation inserts a point into the set .
-
–
Delete: This operation deletes the point from the set .
-
–
Moreover, we denote , , and as the pre-processing time, per round query time, and per round update time for the data structure.
Next, we present the formal data structure for half-space reporting.
From [3], this problem can be solved with sublinear time complexity:
Corollary 6.8 ([3]).
Given a set of points in , the half-space reporting problem can be solved with:
-
•
A query time denoted by , where is the number of points in the intersection of the set and half-space .
-
•
An amortized update time denoted by .
6.3 Cost per iteration
In this section, we analyze the time complexity per iteration of Algorithm 1.
Lemma 6.9 (Running time of Algorithm 1).
7 Conclusion
Deep learning is widely employed in many domains, but its model training procedure often takes an unnecessarily large amount of computational resources and time. In this paper, we design an efficient neural network training method with SGD update that has a provable convergence guarantee. By leveraging the static half-space report data structure into the optimization process of a fully connected two-layer neural network with neural tangent kernel for shifted ReLU activation, our algorithm supports sublinear time activate neuron identification via geometric search. In addition, we prove that our algorithm can converge in time with network size quadratic in the coefficient norm upper bound and error term . As far as we are aware, our work does not have negative societal impacts. One limitation of our work is that we can study other activation functions beyond shifted ReLU in the future.
References
- ACW [17] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017.
- ADH+ [19] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
- AEM [92] Pankaj K Agarwal, David Eppstein, and Jiri Matousek. Dynamic half-space reporting, geometric optimization, and minimum spanning trees. In Annual Symposium on Foundations of Computer Science, volume 33, pages 80–80. IEEE Computer Society Press, 1992.
- AKK+ [20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
- AKL [17] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pages 166–175. PMLR, 2017.
- ALS+ [18] Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In International Conference on Machine Learning (ICML), pages 224–233. PMLR, 2018.
- ALS+ [22] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. arXiv preprint arXiv:2211.14227, 2022.
- AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
- [9] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- [10] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, 2019.
- [11] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. Advances in neural information processing systems, 32, 2019.
- [12] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, 2019.
- BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In 12th Innovations in Theoretical Computer Science Conference, ITCS 2021, January 6-8, 2021, Virtual Conference, volume 185 of LIPIcs, pages 63:1–63:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- BSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv e-prints, pages arXiv–2304, 2023.
- BWZ [16] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236–249, 2016.
- CCLY [19] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Conference on Learning Theory, pages 849–873. PMLR, 2019.
- CG [19] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in neural information processing systems, 32, 2019.
- CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM Symposium on Theory of Computing (STOC), 2013.
- CWB+ [11] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493–2537, 2011.
- CY [21] Yifan Chen and Yun Yang. Accumulations of projections—a unified framework for random sketches in kernel ridge regression. In International Conference on Artificial Intelligence and Statistics, pages 2953–2961. PMLR, 2021.
- Dan [20] Amit Daniely. Neural networks learning and memorization with (almost) no over-parameterization. Advances in Neural Information Processing Systems, 33:9007–9016, 2020.
- DFS [16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Advances In Neural Information Processing Systems, 29, 2016.
- DJS+ [19] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
- [24] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
- [25] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), 2019.
- [26] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
- [27] Yichuan Deng, Zhihang Li, and Zhao Song. An improved sample complexity for rank-1 matrix sensing. arXiv preprint arXiv:2303.06895, 2023.
- DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arXiv preprint arXiv:2304.04397, 2023.
- DZPS [18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
- DZPS [19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. arXiv preprint arXiv:1810.02054, 2019.
- EMZ [21] Hossein Esfandiari, Vahab Mirrokni, and Peilin Zhong. Almost linear time density level set estimation via dbscan. In AAAI, 2021.
- GB [10] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
- GMS [23] Yeqi Gao, Sridhar Mahadevan, and Zhao Song. An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
- GQSW [22] Yeqi Gao, Lianke Qin, Zhao Song, and Yitan Wang. A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
- GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
- GSX [23] Yeqi Gao, Zhao Song, and Shenghao Xie. In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023.
- GSY [23] Yeqi Gao, Zhao Song, and Junze Yin. An iterative algorithm for rescaled hyperbolic functions regression. arXiv preprint arXiv:2305.00660, 2023.
- HLSY [21] Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pages 4423–4434. PMLR, 2021.
- HLW [17] Jarvis Haupt, Xingguo Li, and David P Woodruff. Near optimal sketching of low-rank tensor regression. In ICML, 2017.
- HMR [18] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. Sega: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
- Hoe [63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
- HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
- HZRS [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- JGH [18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
- JPWZ [21] Shuli Jiang, Hai Pham, David Woodruff, and Richard Zhang. Optimal sketching for trace estimation. Advances in Neural Information Processing Systems, 34, 2021.
- JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
- KSH [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- KT [19] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- LBBH [98] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- LHBB [99] Yann LeCun, Patrick Haffner, Léon Bottou, and Yoshua Bengio. Object recognition with gradient-based learning. In Shape, contour and grouping in computer vision, pages 319–345. Springer, 1999.
- LL [18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
- LSS+ [20] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural networks. In NeurIPS, 2020.
- LSX+ [23] Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023.
- LSY [23] Xiaoxiao Li, Zhao Song, and Jiaming Yang. Federated adversarial learning: A framework with convergence analysis. In International Conference on Machine Learning, pages 19932–19959. PMLR, 2023.
- LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT, 2019.
- LSZ [23] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint arXiv:2303.15725, 2023.
- MM [13] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 91–100, 2013.
- MOSW [22] Alexander Munteanu, Simon Omlor, Zhao Song, and David Woodruff. Bounding the width of neural networks via coupled initialization a worst case analysis. In International Conference on Machine Learning, pages 16083–16122. PMLR, 2022.
- NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2013.
- OS [20] Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
- QJS+ [22] Lianke Qin, Rajesh Jayaram, Elaine Shi, Zhao Song, Danyang Zhuo, and Shumo Chu. Adore: Differentially oblivious relational database operators. In VLDB, 2022.
- QRS+ [22] Lianke Qin, Aravind Reddy, Zhao Song, Zhaozhuo Xu, and Danyang Zhuo. Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
- QSZ [23] Lianke Qin, Zhao Song, and Ruizhe Zhang. A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
- QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pages 101–156. PMLR, 2023.
- RPU+ [20] Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pages 8253–8265. PMLR, 2020.
- RSW [16] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 250–263, 2016.
- SHM+ [16] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- SLJ+ [15] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- SSBD [14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- SSS+ [17] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- SSX [23] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. A tale of two efficient value iteration algorithms for solving linear mdps with large action space. In AISTATS, 2023.
- SSZ [23] Ritwik Sinha, Zhao Song, and Tianyi Zhou. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
- SWYZ [21] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
- SWZ [17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise -norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC), 2017.
- [76] Zhao Song, David Woodruff, and Peilin Zhong. Average case column subset selection for entrywise -norm loss. Advances in Neural Information Processing Systems (NeurIPS), 32:10111–10121, 2019.
- [77] Zhao Song, David Woodruff, and Peilin Zhong. Towards a zero-one law for column subset selection. Advances in Neural Information Processing Systems, 32:6123–6134, 2019.
- [78] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. arXiv preprint arXiv:1704.08246, 2019.
- SY [19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
- SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
- SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
- SYZ [21] Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
- SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
- WYW+ [23] Junda Wu, Tong Yu, Rui Wang, Zhao Song, Ruiyi Zhang, Handong Zhao, Chaochao Lu, Shuai Li, and Ricardo Henao. Infoprompt: Information-theoretic soft prompt tuning for natural language understanding. arXiv preprint arXiv:2306.04933, 2023.
- WZ [16] David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847–858. IEEE, 2016.
- WZD+ [20] Ruosong Wang, Peilin Zhong, Simon S Du, Russ R Salakhutdinov, and Lin F Yang. Planning with general objective functions: Going beyond total rewards. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
- XSS [21] Zhaozhuo Xu, Zhao Song, and Anshumali Shrivastava. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
- XZZ [18] Chang Xiao, Peilin Zhong, and Changxi Zheng. Bourgan: generative networks with metric embeddings. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 2275–2286, 2018.
- ZCZG [20] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks. In Mach. Learn., 2020.
- ZG [19] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. Advances in neural information processing systems, 32, 2019.
- ZHA+ [21] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Scaling neural tangent kernels via sketching and random features. Advances in Neural Information Processing Systems, 34, 2021.
- Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
- ZMG [19] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- ZPD+ [20] Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Over-parameterized adversarial training: An analysis overcoming the curse of dimensionality. Advances in Neural Information Processing Systems, 33:679–688, 2020.
- ZSZ+ [23] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H o: Heavy-hitter oracle for efficient generative inference of large language models. arXiv preprint arXiv:2306.14048, 2023.
Appendix
Roadmap.
Appendix A Probabilistic Inequalities
Here we present the Hoeffding bound that characterize the probability that the sum of independent random bounded variables deviates from its true mean by a certain amount.
Lemma A.1 (Hoeffding bound ([41])).
Let denote independent bounded variables in . Let , then we have
We present Jensen’s Inequality that relates the function value of a convex function on a convex combination of inputs and the value of the same convex combination of the function values on these inputs.
Lemma A.2 (Jensen’s Inequality).
Let be a convex function, and , such that . Then for all , we have:
Appendix B More preliminaries
In Section B.1, we introduce the necessary preliminaries related to neural tangent kernel training. In Section B.2, we discuss some fundamental aspects of the random feature scheme.
B.1 The Neural Tangent Kernel Training
Given fixed network parameters , and , the neural tangent kernel (as defined in [44]) for weights is defined as follows:
The Neural Tangent Kernel space, denoted as , is a linear approximation of the updates in the function value as a result of small alterations in the network weights . More formally, iff there is an such that:
(1) |
Note that is the minimal Euclidean norm of that satisfies Eq. (1). For simplicity, we use to denote , which is the expected initial neural tangent kernel:
In the rest of the paper, we will abbreviate NeuralTangentKernelTraining as neural tangent kernel training with parameters: activation function , input dimension , weight dimension , loss function , learning rate , SGD batch size , and iteration number . Moreover, in this optimization process, we will initialize the weight vector following and set the initial kernel weight as .
B.2 Random Feature Scheme
In this section, we present some essential backgrounds on random feature scheme. We begin with the definition of random feature scheme with respect to kernel.
Definition B.1 (Random Feature Scheme).
Let be a measurable space and let be a kernel. A random features scheme(RFS) for is a pair where is a probability measure on a measurable space , and is a measurable function, such that:
(2) |
where is the standard Gaussian measure on , which is an RFS for the kernel .
Next, we present the definition of -bounded RFS. For activation function , the NTK RFS is -bounded for .
Definition B.2 (-bounded RFS).
We say that is -bounded if .
Furthermore, we present the definition of factorized RFS. Additionally, the NTK RFS can be factorized.
Definition B.3 (Factorized RFS).
We say that an RFS is factorized if there exists a function such that .
For the remainder of this paper, we use SGDRFS as a shorthand to denote the Stochastic Gradient Descent in the Random Feature Space. This encapsulates the following parameters: the Random Feature Space (RFS) , input dimension , the count of random features , the loss function , learning rate , SGD batch size , and the number of iterations . Additionally, the optimization process will initialize the weight vector as .
B.3 Several Instances of
Appendix C Missing Proofs
In Section C.1 we present the proof of the Equivalence for NNT and NTKT. In Section C.2, we present the proof of Theorem 5.6. In Section C.3, we present the proof of running time for our algorithm.
C.1 Proof of Equivalence for NNT and NTKT
Lemma C.1 (Equivalence for NNT and NTKT, restatement of Lemma 5.3).
If the following conditions hold
-
•
Fix a descent activation as well as a convex descent loss (Definition 3.1).
There is a choice , such that for every input distribution the following holds: Let be the functions returned by NeuralNetworkTraining with parameters and NeuralTangentKernelTraining.
Then, we have
-
•
.
Proof.
For simplicity, we first prove under the assumption that the activation function is twice differentiable and satisfied . Then, at the end of this proof, we will show how this implies the proof for the case where the activation function is -descent.
We analyze two different implementations of the NeuralNetworkTraining algorithm: The first implementation initiates with weights sampled from distribution and adopts learning rate . The second implementation utilizes the exact same mini-batches and hyperparameters as the first one, with the exception that the output weights are scaled by a factor and the learning rate is divided by , i.e., and . Essentially, this transforms the network function from to .
Next, we show that the second implementation of NeuralNetworkTraining approximates NeuralTangentKernelTraining. Compared to the first implementation, the gradient of the hidden layer becomes times larger, while the gradient of the output layer remains unchanged, i.e.,
Consequently, the overall shift is scaled down by a factor of . Thus, the optimization process operates within a sphere of radius around , where is a polynomial in .
Next, we examine the first-order approximation of around the initial weight, specifically,
The first step is derived from for the initial weight and signifies a uniform bound on the Hessian of , which is obtained from the fact that . Since doesn’t depend on , for a sufficiently large , the quadratic part . Thus, we only need to consider the scenario where the optimization is conducted over the linear function with a learning rate of and starting at 0. This is equivalent to NeuralTangentKernelTraining that optimizes over the linear function with a learning rate of and starting at 0.
It’s important to note that any -descent activation function locally ensures . Additionally, if is sufficiently large, the output of the hidden layer before the activation remains largely stable throughout the entire optimization process. Given this, we don’t transition into different regions that comply with for every sample in the mini-batches.
∎
C.2 Proof of Theorem 5.6
In this section, we first present the correctness theorem for SGDRFS and its general proof by Lemma C.5. Then, we present some definitions and lemmas to prove Lemma C.5. Finally, we present the proof of Lemma C.5.
Theorem C.2 (Restatement of Theorem 5.6).
Proof.
At first, by Hoeffding’s bound (Lemma A.1), we have: For any , for every , we have:
(3) |
For the ease of proof, we introduce the definition of :
Definition C.3 ().
Given and function , we denote as follows:
Corollary C.4 (Function Approximation).
For the following conditions:
-
•
for all , ,
-
•
if represents a distribution on ,
we establish that:
Proof.
From Equation (6), we find that . Additionally, for every , the variance of can be computed as follows:
The initial step is derived from the fact that is -bounded (see Definition B.2), while the concluding step is drawn from Eq. (5). Consequently, it directly leads us to:
(7) |
Additionally, when represents a distribution on the set , we can derive that:
where the first step follows from Jensen’s inequality(Lemma A.2), the second step follows from plugging in the definition of , the third step follows from exchanging the order of expectation, and the last step follows from using Eq. (7). ∎
Thus, random features are adequate to ensure an expected distance of no more than . Next, we present a situation where a -dimensional random feature is as effective as one-dimensional random features. Specifically, random features are sufficient to guarantee an expected distance of at most .
Lemma C.5 (Closeness of and ).
Proof.
Let us have and . We have:
where the first step comes from Jensen’s inequality (Lemma A.2 ), the second step is because of plugging in the definition of , the third step derives from changing the order of expectations, the fourth step follows from Theorem 3.9 and Eq. (6) , the fifth step is due to the fact that the variance is bounded by squared -norm, the sixth step follows from that the considered RFS is factorized(Definition B.3), the seventh step is because that and are -bounded (Definition B.2 ), the eighth step comes from is -bounded (Definition 1.1 ), and the final step is due to Eq. (5). Taking the square root of both sides yields to:
C.3 Proof of Running Time
Proof.
The per-time complexity can be decomposed as follows:
-
•
Querying the active neuron set for takes time.
where the first step follows from Corollary 6.8, and the final step follows from calculation.
- •
-
•
Backward computation takes time.
-
–
Computing takes time .
- –
-
–
-
•
Updating the weight vectors in HalfSpaceReport data structure takes time.
Summing over all the above terms gives us the per iteration running time . ∎