Solving Regularized Exp, Cosh and Sinh Regression Problems
In modern machine learning, attention computation is a fundamental task for training large language models such as Transformer, GPT-4 and ChatGPT. In this work, we study the exponential regression problem which is inspired by the softmax/exp unit in the attention mechanism in large language models. The standard exponential regression is non-convex. We study the regularization version of the exponential regression problem which is a convex problem. We use the approximate newton method to solve in input sparsity time.
Formally, in this problem, one is given matrix , , and any of functions and denoted as . The goal is to find the optimal that minimize . The straightforward method is to use the naive Newton’s method. Let denote the number of non-zeros entries in matrix . Let denote the exponent of matrix multiplication. Currently, . Let denote the accuracy error. In this paper, we make use of the input sparsity and purpose an algorithm that use iterations and per iteration time to solve the problem.
1 Introduction
State-of-the-art language models like Transformer [69], BERT [21], GPT-3 [6], PaLM [18], and OPT [80] exhibit greater proficiency in natural language processing when compared to smaller models or traditional techniques. These models have the capacity to understand and generate intricate language, proving beneficial in various applications such as language translation, sentiment analysis, and question answering. LLMs can be customized for multiple purposes without necessitating their reconstruction from scratch. An instance of this is ChatGPT, an OpenAI-developed chat software that employs GPT-3’s full potential. The latest iteration, GPT-4 [53], has the potential to surpass GPT-3 in its impressive capabilities, including text generation, question answering, and language translation. This development could lead to significant implications in the field of NLP, with new applications potentially emerging in areas such as virtual assistants, chatbots, and automatic content creation. However, even though deep learning has a swift incline in popularity, we hold the belief that there exist discrepancies in our comprehension of the concept of attention and the reasoning behind its effectiveness.
The primary technical foundation behind LLMs is the attention matrix [69, 55, 21, 6, 3, 77]. An attention matrix is a matrix that features rows and columns aligning with individual words or ”tokens” and their relationships within a given text. Its purpose is to measure the critical nature of each token in a sequence in relation to the intended output. The attention matrix is learned during training. These parameters are optimized to maximize the model’s accuracy in predicting the desired output. Through the attention mechanism, each input token is evaluated based on its importance or relevance to the desired output. This is achieved by weighing the token score, which is based on a similarity function comparing the current output state and input states.
More formally, the attention matrix can be expressed by considering two matrices, and , containing query and key tokens, respectively. Both and hold values in the dimensional space. The attention matrix can be denoted by the square matrix which is of size . This matrix establishes a relationship between the input tokens in the sequence where every entry represents the attention weight or score between a particular input token (query token ) and an output token (key token ). It is essential to mention that diagonal entries of this matrix display self-attention scores, signifying the importance of each token with respect to itself. A majority of the methods used for effective computation of attention matrices are divided into two primary categories based on their approach. One approach involves leveraging sparsity, as seen in Reformer [45], while the other involves utilizing the low-rank attributes of the attention matrices, as observed in Linformer [73] and Performer [16].
During training, our primary goal is to tackle the issue of multiple attention regression by utilizing the exponential function and its corresponding equation: , where denotes the normalization factor. However, upon further investigation, it has been brought to our attention that the single regression scenario has not been thoroughly studied. As a result, this study is centered on the situation of single regression.
In our setting, the presence of is unnecessary due to the fact that is a column vector (but not a matrix). Thus, we have opted to address the optimization problem with regards to . It is important to note that our approach is not exclusive to the exponential function, and can also be extended to other hyperbolic functions such as and .111 and
1.1 Our Results
We state our result as follows.
Theorem 1.1 (Main result, Informal version of Theorem 6.11).
Given matrix , , and .
Let be any of functions and . Let denote the gradient of function .
Let denote the optimal solution of
that and .
Let . Let for all .
Let .
Let denote an initial point such that .
For any accuracy parameter and failure probability . There is a randomized algorithm (Algorithm 1) that runs iterations and spend
time per iteration, and finally outputs a vector such that
holds with probability at least .
1.2 Related Work
Input sparsity time algorithms
Input sparsity is a term used to describe datasets that have a majority of elements that are either zero or negligible. Utilizing algorithms optimized for sparse input data enables faster processing times than those utilized in dense data algorithms. This is because these algorithms can solely focus on non-zero elements, thereby minimizing computational and memory usage. Sparse algorithms’ time complexity depends only on the number of non-zero elements rather than the number of total elements. Input sparsity algorithms are highly applicable within fields such as solving John Ellipsoid [65], discrepancy minimization [26], low-rank approximation [19, 56, 60, 63], subspace embeddings [51], column subset selection [61, 62] and least squares regression [27].
Algorithmic Regularization
The standard exponential regression is non-convex. We study the regularization version of the exponential regression problem which is a convex problem. In the context of deep learning, the non-convexity of the objective function necessitates the use of regularization techniques. Due to the possibility of the objective function generating multiple global minima that are widely scattered and vary significantly in their generalization capabilities, this becomes essential. Algorithmic regularization can be observed in various machine learning applications such as binary classification [58, 48, 10], matrix factorization [30, 1], convolutional neural networks [30, 42], generative adversarial networks [4], contrastive learning [72] and mixture of experts [13]. There are numerous factors that can induce algorithmic regularization. [29, 33, 46, 59, 50] introduce how learning rate and batch size help regularize the training process. [50] explains that while a small initial learning rate may result in prompt training and better performance initially, a larger learning rate tends to yield improved generalization shortly after the annealing of the learning rate. The regularizer’s role in the GAN model is expounded in [4]. In addition, [38] has conducted a theoretical analysis of the regularization generated by momentum. Furthermore, the employment of an adaptive step-size optimizer, such as Adam optimizer, can also function as a form of regularization during the training process [44, 52, 20, 74, 75, 40]. Batch normalization is analyzed in a theoretical capacity in [2, 32, 36], which delves into its impact on regularization. A separate study conducted by [57, 71] explains how introducing dropout into the training process can prevent overfitting. In [25], they apply TensorSketch to the Kronecker product of multiple matrices efficiently without explicitly computing the tensor product and also apply the regularization to solve the Kronecker product regression.
Attention Theory
The topic of attention’s expressivity has been a focus of early theoretical works. In the case of self-attention blocks, [68] interpret self-attention as a system of self-interacting particles and theoretically explain the attention. [28] explain it from inductive biases and variable creation perspective. The role of attention in Transformers was studied by [70, 22]. In terms of optimization, [78] examined the impact of adaptive approaches on attention models, while [66] analyzed the dynamics of single-head attention to approximate Seq2Seq architecture’s learning process. For most LLMs, it generally suffices to conduct attention computations in an approximate manner during the inference process, provided that there are adequate assurances of accuracy. Research conducted by various sources such as [15, 45, 73, 23, 47, 12, 11] has underscored this perspective. Due to that motivation, [77, 3] study the computation of the attention matrix from the hardness perspective and purpose faster algorithms.
Newton Method and Hessian Computation
Computing the Hessian or approximately computing the Hessian is a standard task in convex optimization. Many of the previous have work on this direction and use that improve several optimization problems such as linear programming [17, 49, 8, 5, 64, 43, 24, 9, 31], empirical risk minimization [49, 54], cutting plane method [39], semi-definite programming [37, 34, 31], sum of squares [41], training over-parameterized neural network [79, 14, 7, 67, 35, 76].
Roadmap.
We organize the following paper as follows. In Section 2 we provide some tools for basic algebra and the analysis of a regularization term . In Section 3 we provide detailed analysis of the loss function based on (denoted as ) and the loss function with a regularization term . In Section 4 we provide detailed analysis of and . In Section 5 we provide detailed analysis of and . In Section 6 we provide an approximate version of newton method which use for solving convex optimization problem which is more efficient under certain assumptions.
2 Preliminary
In this section, we provide preliminaries to be used in our paper. In Section 2.1 we introduce notations we use. In Section 2.2 we provide some facts about approximate computations and exact computations. In Section 2.3, we provide some trivial facts regarding gradient and hessian. In Section 2.4, we computed the gradient and hessian of a regularization term .
2.1 Notations
We use to denote real. We use to denote non-negative real numbers.
For any vector , we use to denote a vector where -th entry is .
For a vector , we use to denote its norm, i.e., .
For any matrix , we denote the spectral norm of by , i.e.,
For a matrix , we use to denote the largest singular value of . We use to denote the smallest singular value of .
For a vector , we use to denote .
For each , we use to denote a vector that has length and the -th entry is for all .
For a vector with each entry of , we use to generate a diagonal matrix where each entry of on the diagonal is for all .
Given two column vectors , we use to denote a column vector that is .
We use to denote a length- vector where all the entries are ones.
We say if for all vector .
We define and .
For any given matrix , we use to denote the number of non-zero entries of , i.e.,
For a diagonal matrix , we say is a -sparse diagonal matrix, i.e., .
For any function , we use to denote .
2.2 Basic Algebras
We state some facts which can give exact computation.
Fact 2.1 (, , exact and approximate for general range).
We have
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
We state some facts which can give a reasonable approximate computation.
Fact 2.2 (, , approximate computation in small range).
We have
-
•
Part 1. For any satisfy that , we have
-
•
Part 2. For any satisfy that , we have
-
•
Part 3. For any satisfy that , we have
-
•
Part 4. For any satisfy that , we have
-
•
Part 5. For any satisfy that , we have
-
•
Part 6. For any satisfy that , we have
Proof.
Most of the proofs are standard, we only provide proofs for some of them.
Proof of Part 5.
We have
where the first step follows from the definition of , the second step follows from triangle inequality, the third step follows from simple algebra, the fourth step follows from the fact that for all and the last step follows from the definition of .
Proof of Part 6.
We have
where the first step follows from the definition of , the second step follows from triangle inequality, the third step follows from simple algebra, the fourth step follows from the fact that for all and the last step follows from the definition of . ∎
Fact 2.3.
For any vectors and , we have
-
•
-
•
-
–
-
–
-
–
-
•
-
•
Fact 2.4.
For vectors , we have
-
•
Part 1.
-
•
Part 2.
-
•
Part 3.
-
•
Part 4.
-
•
Part 5.
-
•
Part 6.
-
•
Part 7. For any , we have
-
•
Part 8. For any , we have
-
•
Part 9. For any , we have
Proof.
Most of the proofs are standard, we only provide proofs for some of them.
Proof of Part 8.
We have
where the first step follows from Part 5 in Fact 2.2 and the last step follows from the definition of norm. By summing up the square of both sides, we have
Proof of Part 9.
We have
where the first step follows from Part 6 in Fact 2.2 and the last step follows from the definition of norm. By summing up the square of both sides, we have
∎
Fact 2.5.
For matrices , we have
-
•
-
•
-
•
-
•
-
•
If , then
Fact 2.6.
Let . If
Then
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
2.3 Standard Gradient and Hessian Computation
Lemma 2.7 (Standard Gradient and Hessian Computation).
We have
-
•
Part 1. .
-
•
Part 2.
-
•
Part 3.
Proof.
Proof of Part 1.
It trivially follows from chain rule.
Proof of Part 2.
The equation takes derivative of the vector by each entry of itself and trivially gets the result of .
Proof of Part 3.
where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, and the last step is due to the constant entries of the matrix .
∎
2.4 Regularization term
Lemma 2.8.
Let denote a weight vector Let . Let denote a diagonal matrix.
Then we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
Proof.
Proof of Part 1.
where the first step follows from chain rule.
Proof of Part 2.
where the first step follows from the expansion of Hessian, the second step follows from by applying the arguments in Part 1, the third step follows from simple algebra. ∎
Lemma 2.9.
Let .
Let denote a diagonal matrix where all diagonal entries are positive. Then we have
Proof.
For any , we have
where the first step follows from simple algebra, the second step follows from the definition of , the third step follows from simple algebra, the fourth step follows from the definition of , the fifth step follows from definition of , the last step follows from simple algebra . ∎
3 Exponential Regression
In this section, we provide detailed analysis of . In Section 3.1 we define the loss function based on . In Section 3.2 we compute the gradient of by detail. In Section 3.3 we compute the hessian of by detail. In Section 3.4, we summarize the result of Section 3.2 and Section 3.3 and aquire the gradient and hessian for . In Section 3.5 we define by adding the regularization term in Section 2.4 to and compute the gradient and hessian of . In Section 3.6 we proved that and thus showed that is convex. In Section 3.7 we provide the upper bound for and thus proved is lipschitz.
3.1 Definitions
Definition 3.1 (Loss function for Exp Regression).
Given and . For a vector , we define loss function as follows
3.2 Gradient
Lemma 3.2 (Gradient for Exp).
We have
-
•
Part 1.
-
•
Part 2.
Further, we have for each
-
•
Part 3.
-
•
Part 4.
-
•
Part 5.
Proof.
Proof of Part 1.
For each , we have
where the first and second step follow from the differential chain rule.
Proof of Part 2.
We have
where the first step follows from the differential chain rule, and the second step follows from in Part 1.
Proof of Part 3.
We have
where the first step follows from the property of the gradient, the second step follows from simple algebra, and the last step directly follows from Lemma 2.7.
Proof of Part 4.
By substitute into of Part 3, we get
where the first step follows from the result of Part 2 and , the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.
Proof of Part 5.
We have
where this step follows from the result of Part 4 directly. ∎
3.3 Hessian
Lemma 3.3.
-
•
Part 1.
-
•
Part 2.
-
•
Part 3.
-
•
Part 4.
Proof.
Proof of Part 1.
where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix with constant entries out of the derivative, and the last step also follows from the chain rule.
Proof of Part 2.
where the first step is an expansion of the Hessian, the second step follows from Part 3 of Lemma 3.2, the third step follows from Part 3 of Lemma 3.2.
Proof of Part 3.
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 3.2, the third step follows from differential chain rule and Part 1 of Lemma 3.2, the last step follows from Fact 2.3.
Proof of Part 4.
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 3.2, the third step follows from differential chain rule and Part 1 of Lemma 3.2, the last step follows from Fact 2.3.
∎
3.4 Gradient and Hessian of the Loss function for Exp Function
Lemma 3.4.
Let be defined in Definition 3.1. Then for any , we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
3.5 Loss Function with a Regularization Term
Definition 3.5.
Given matrix and , . For a vector , we define loss function as follows
where .
Lemma 3.6.
Let be defined as Definition 3.5, then we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
3.6 Hessian is Positive Definite
Lemma 3.7 (Hessian is positive definite).
Let denote a parameter.
If the following condition hold
-
•
Let
-
•
for all
Then, we have
Proof.
We define
Then we can rewrite Hessian as
We define
Then we have
where the first step follows from simple algebra, the second step follows from replacing with , the third step follows from , the fourth step follows from simple algebra, the fifth step follows from .
Since we know for all and Lemma 2.9, we have
Thus, Hessian is positive definite forever and thus the function is convex.
∎
3.7 Hessian is Lipschitz
Lemma 3.8 (Hessian is Lipschitz).
If the following condition holds
-
•
Let
-
•
Let
-
•
-
•
-
•
-
•
Then we have
Proof.
We have
(1) |
where the first step follows from and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.
For the first term in Eq. (3.7), we have
(2) |
For the second term in Eq. (3.7), we have
(3) |
where the first step follows from Fact 2.4 , the second step follows from Fact 2.4, the third step follows from , the fourth step follows from , the last step follows from .
For the third term in Eq. (3.7), we have
(4) |
where the first step follows from and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4 , the fourth step follows from Fact 2.4, the fifth step follows from Fact 2.5 and , the last step follows from .
Putting it all together, we have
where the first step follows from by applying Eq. (2), Eq. (3.7), and Eq. (3.7), the second step follows from simple algebra, the third step follows from , the last step follows from simple algebra.
∎
4 Cosh Regression
In this section, we provide detailed analysis of . In Section 4.1 we define the loss function based on . In Section 4.2 we compute the gradient of by detail. In Section 4.3 we compute the hessian of by detail. In Section 4.4, we summarize the result of Section 4.2 and Section 4.3 and aquire the gradient and hessian for . In Section 4.5 we define by adding the regularization term in Section 2.4 to and compute the gradient and hessian of . In Section 4.6 we proved that and thus showed that is convex. In Section 4.7 we provide the upper bound for and thus proved is lipschitz.
4.1 Definition
Definition 4.1.
Given and . For a vector , we define loss function as follows:
4.2 Gradient
Lemma 4.2 (Gradient for Cosh).
We have
-
•
Part 1.
-
•
Part 2.
Further, we have for each
-
•
Part 3.
-
•
Part 4.
-
•
Part 5.
Proof.
Proof of Part 1.
For each , we have
where the first and second step follow from the differential chain rule.
Thus, we complete the proof.
Proof of Part 2.
We have
where the first step follows from the differential chain rule, and the second step follows from in Part 1.
Proof of Part 3.
We have
where the first step follows from the property of the gradient, the second step follows from simple algebra, and the last step directly follows from Lemma 2.7.
Proof of Part 4.
By substitute into of Part 3, we get
where the first step follows from the result of Part 2, the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.
Proof of Part 5.
We have
this follows from the result of Part 4 directly. ∎
4.3 Hessian
Lemma 4.3.
-
•
Part 1.
-
•
Part 2.
-
•
Part 3.
-
•
Part 4.
Proof.
Proof of Part 1.
where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix with constant entries out of the derivative, and the last step also follows from the chain rule.
Proof of Part 2.
where the first step is an expansion of the Hessian, the second and third steps follow from the differential chain rule.
Proof of Part 3.
Here in the proof, for simplicity, we let use .
We have
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 4.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.
Proof of Part 4.
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 4.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.
∎
4.4 Gradient and Hessian of the Loss function for Cosh Function
Lemma 4.4.
Let be defined in Definition 3.1. Then for any , we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
4.5 Loss Function with a Regularization Term
Definition 4.5.
Given matrix and , . For a vector , we define loss function as follows
where .
Lemma 4.6.
Let be defined as Definition 4.5, then we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
4.6 Hessian is Positive Definite
Lemma 4.7.
If for all , then
Proof.
We define diagonal matrix
Then we can rewrite Hessian as
Then we have
where the first step follows from simple algebra, the second step follows from replacing with and (Fact 2.1), the third step follows from , the fourth step follows from simple algebra, the fifth step follows from .
Since we know for all and Lemma 2.9, we have
Thus, Hessian is positive definite forever and thus the function is convex. ∎
4.7 Hessian is Lipschitz
Lemma 4.8 (Hessian is Lipschitz).
If the following condition holds
-
•
Let
-
•
Let
-
•
-
•
-
•
-
•
Then we have
Proof.
We have
(5) |
where the first step follows from and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.
For the first term in Eq. (4.7), we have
(6) |
For the second term in Eq. (4.7), we have
(7) |
where the first step follows from Fact 2.4 , the second step follows from Fact 2.4 , the third step follows from , , the fourth step follows from and the last step follows from .
For the third term in Eq. (4.7), we have
(8) |
where the first step follows from and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from , the fifth step follows from Fact 2.5, the last step follows from .
Putting it all together, we have
where the first step follows from by applying Eq. (6), Eq. (4.7), and Eq. (4.7), the second step follows from simple algebra, the third step follows from , the last step follows from simple algebra.
∎
5 Sinh Regression
In this section, we provide detailed analysis of . In Section 5.1 we define the loss function based on . In Section 5.2 we compute the gradient of by detail. In Section 5.3 we compute the hessian of by detail. In Section 5.4, we summarize the result of Section 5.2 and Section 5.3 and aquire the gradient and hessian for . In Section 5.5 we define by adding the regularization term in Section 2.4 to and compute the gradient and hessian of . In Section 5.6 we proved that and thus showed that is convex. In Section 5.7 we provide the upper bound for and thus proved is lipschitz.
5.1 Definition
Definition 5.1.
Given and . For a vector , we define loss function as follows:
5.2 Gradient
Lemma 5.2 (Gradient for Sinh).
We have
-
•
Part 1.
-
•
Part 2.
Further, we have for each
-
•
Part 3.
-
•
Part 4.
-
•
Part 5.
Proof.
Proof of Part 1.
For each , we have
where the first and second step follow from the differential chain rule.
Thus, we complete the proof.
Proof of Part 2.
We have
where the first step follows from the differential chain rule, and the second step follows from in Part 1.
Proof of Part 3.
We have
where the first step follows from the property of the gradient, the second step follows from the differential chain rule, and the last step directly follows from Lemma 2.7.
Proof of Part 4.
By substitute into of Part 3, we get
where the first step follows from the result of Part 2 and the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.
Proof of Part 5.
We have
where this step follows from the result of Part 4 directly. ∎
5.3 Hessian
Lemma 5.3.
-
•
Part 1.
-
•
Part 2.
-
•
Part 3.
-
•
Part 4.
Proof.
Proof of Part 1.
where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix with constant entries out of the derivative, and the last step also follows from the chain rule.
Proof of Part 2.
where the first step is an expansion of the Hessian, the second and third steps follow from the differential chain rule.
Proof of Part 3.
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 5.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.
Proof of Part 4.
where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 5.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.
∎
5.4 Gradient and Hessian of the Loss function for Sinh Function
Lemma 5.4.
Let be defined in Definition 3.1. Then for any , we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
5.5 Loss Function with a Regularization Term
Definition 5.5.
Given matrix and , . For a vector , we define loss function as follows
where .
Lemma 5.6.
Let be defined as Definition 5.5, then we have
-
•
Part 1. Gradient
-
•
Part 2. Hessian
5.6 Hessian is Positive Definite
Lemma 5.7.
Let denote a parameter. If for all , then
Proof.
We define
Then we can rewrite Hessian as
We define
Then we have
where the first step follows from simple algebra, the second step follows from replacing with and (Fact 2.1), the third step follows from , the fourth step follows from simple algebra, the fifth step follows from .
Since we know for all and Lemma 2.9, we have
Thus, Hessian is positive definite forever and thus the function is convex. ∎
5.7 Hessian is Lipschitz
Lemma 5.8 (Hessian is Lipschitz).
If the following condition holds
-
•
Let
-
•
Let
-
•
-
•
-
•
-
•
Then we have
Proof.
We have
(9) |
where the first step follows from and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.
For the first term in Eq. (5.7), we have
(10) |
For the second term in Eq. (5.7), we have
(11) |
where the first step follows from Fact 2.4 , the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from , the fifth step follows from , the last step follows from .
For the third term in Eq. (5.7), we have
(12) |
where the first step follows from and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from , the fifth step follows from Fact 2.5, the last step follows from .
Putting it all together, we have
where the first step follows from by applying Eq. (10), Eq. (5.7), and Eq. (5.7), the second step follows from simple algebra, the third step follows from , the last step follows from simple algebra.
∎
6 Newton Method
In this section, we provide an approximate version of the Newton method for solving convex optimization problems and provide a detailed analysis of such a method. In Section 6.1 we define some assumptions under which we can tackle the optimization problem efficiently. In Section 6.2 we state a simple lemma which is useful in Section 6.5. In Section 6.3 we provide an approximation variant for the update step of the newton method for convex optimization. In Section 6.4 we provide the upper bound of . In Section 6.5 we provide the upper bound for and thus showed that our approximate update step is effective in solving the optimization problem. In Section 6.6 we provide a lemma that showed our update step is effective. In Section 6.7, we prove our main result.
6.1 Definition and Update Rule
Let us study the local convergence of the Newton method. Consider the problem
under the following assumptions:
Definition 6.1.
We have
-
•
-local Minimum. Let denote a parameter. There is a vector such that
-
–
.
-
–
.
-
–
-
•
Hessian is -Lipschitz. Let denote a parameter that
-
•
Good Initialization Point. Let such that
We define gradient and Hessian as follows
Definition 6.2 (Gradient and Hessian).
= We define gradient function as
We define the Hessian function ,
Using the and , we can rewrite the exact process as follows :
Definition 6.3 (Exact update).
6.2 Connection between Gradient and Hessian
Lemma 6.4 (folklore).
Let denote the gradient function and let denote the hessian function, then for any , we have
Proof.
We have
∎
6.3 Approximation of Hessian and Update Rule
In many optimization applications, computing or is quite expensive. Therefore, a natural motivation is to approximately formulate its Hessian or inverse of Hessian.
Definition 6.5 (Approximate Hessian).
For any , we define to satisfy the following condition
To efficiently compute , we use a standard tool from the literature
Lemma 6.6 ([65, 26]).
Let . Given a matrix , for any positive diagonal matrix , there is an algorithm that runs in time
output a sparse diagonal matrix such that
Definition 6.7 (Approximate update).
We consider the following process
6.4 Property of Hessian
Lemma 6.8.
If the following conditions hold
We have
Proof.
We can show that
where the first step follows from Fact 2.5, the second step follows from is a -bounded function, the third step follows from . ∎
6.5 One Step Shrinking Lemma
Lemma 6.9.
If the following condition hold
-
•
Function follows from Definition 6.1.
-
•
Let denote the Hessian of
-
•
Let denote the gradient of
-
•
Let
Then we have
Proof.
We have
(13) |
where the first step follows from Definition 6.7, the second step follows from , the third step follows from Lemma 6.4, the forth step follows from , the fifth step follows from simple algebra, the sixth step follows from simple algebra, the last step follows from rewrite the equation using below :
Then we can bound as follows
(14) |
where the first step follows from the definition of , the second step follows from simple algebra, the third step follows from simple algebra, the last step follows from Fact 2.5.
For the first term, we have
(15) |
where the first step follows from simple algebra, the second step follows from , the third step follows from Fact 2.6.
For the second term, we have
(16) |
where the first step follows from Fact 2.5, the second step follows from and Fact 2.5 , the third step follows from , the forth step follows from , the fifth step follows from Definition 6.1 , the sixth step follows from (see Lemma 6.8), the last step follows from .
Thus, we have,
where the first step follows from Eq. (6.5) and by definition of , the second step follows form , the third step follows from , the last step follows from Eq. (6.5), Eq. (6.5), and Eq. (6.5).
∎
6.6 Induction
Lemma 6.10.
If the following condition hold
-
•
-
•
, for all
-
•
, for all
Then we have
-
•
-
•
Proof.
Proof of Part 1.
Proof of Part 2.
We have
∎
6.7 Main Result
We state our main result as follows.
Theorem 6.11 (Formal version of Theorem 1.1).
Given matrix , , and .
Let be any of functions and .
Let denote the optimal solution of
that and .
Let .
Let .
Let denote an initial point such that .
For any accuracy parameter and failure probability . There is a randomized algorithm (Algorithm 1) that runs iterations and spend
time per iteration, and finally outputs a vector such that
holds with probability at least .
Proof.
Proof of function.
After iterations, we have
By choice of , we get the desired bound. The failure probability is following from union bound over iterations.
Proof of function.
Proof of function.
∎
References
- ACHL [19] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
- ALL [18] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.
- AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
- AZL [21] Zeyuan Allen-Zhu and Yuanzhi Li. Forward super-resolution: How can gans learn hierarchical generative models for real-world distributions. arXiv preprint arXiv:2106.02619, 2021.
- BLSS [20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 775–788, 2020.
- BMR+ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
- Bra [20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
- Bra [21] Jan van den Brand. Unifying matrix data structures: Simplifying and speeding up iterative algorithms. In Symposium on Simplicity in Algorithms (SOSA), pages 1–13. SIAM, 2021.
- CB [20] Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
- CDL+ [22] Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re. Pixelated butterfly: Simple and efficient sparse training for neural network models. In International Conference on Learning Representations (ICLR), 2022.
- CDW+ [21] Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems (NeurIPS), 34:17413–17426, 2021.
- CDW+ [22] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813, 2022.
- CGH+ [19] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
- CGRS [19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- CLD+ [20] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- CLS [19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
- CND+ [22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, pages 81–90, 2013.
- Dan [17] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
- DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- DGV+ [18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- DKOD [20] Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020.
- DLY [21] Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: A multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1784–1797, 2021.
- DSSW [18] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308. PMLR, 2018.
- DSW [22] Yichuan Deng, Zhao Song, and Omri Weinstein. Discrepancy minimization in input-sparsity time. arXiv preprint arXiv:2210.12468, 2022.
- DSWY [19] Huaian Diao, Zhao Song, David Woodruff, and Xin Yang. Total least squares regression in input sparsity time. Advances in Neural Information Processing Systems, 32, 2019.
- EGKZ [21] Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. arXiv preprint arXiv:2110.10090, 2021.
- GDG+ [17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- GLSS [18] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018.
- GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
- HBGS [19] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks, 2019.
- HHS [17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
- HJS+ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
- HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
- IS [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- JKL+ [20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pages 910–918. IEEE, 2020.
- JL [22] Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9965–10040. PMLR, 17–23 Jul 2022.
- JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 944–953, 2020.
- JMGL [22] Samy Jelassi, Arthur Mensch, Gauthier Gidel, and Yuanzhi Li. Adam is no better than normalized SGD: Dissecting how adaptivity improves GAN performance, 2022.
- JNW [22] Shunhua Jiang, Bento Natura, and Omri Weinstein. A faster interior-point method for sum-of-squares optimization, 2022.
- JRG [22] Meena Jagadeesan, Ilya Razenshteyn, and Suriya Gunasekar. Inductive bias of multi-channel linear convolutional networks with bounded weight norm. In Conference on Learning Theory, pages 2276–2325. PMLR, 2022.
- JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
- KB [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- KKL [20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
- KMN+ [16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- KVPF [20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
- LL [19] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
- LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT, 2019.
- LWM [19] Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. arXiv preprint arXiv:1907.04595, 2019.
- NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In FOCS, pages 117–126. IEEE, 2013.
- NSS [15] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimization in deep neural networks. arXiv preprint arXiv:1506.02617, 2015.
- Ope [23] OpenAI. Gpt-4 technical report, 2023.
- QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
- RNS+ [18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. ., 2018.
- RSW [16] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 250–263, 2016.
- SHK+ [14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- SHN+ [18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- SKYL [18] Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size, 2018.
- SWZ [17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise -norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC), 2017.
- [61] Zhao Song, David Woodruff, and Peilin Zhong. Average case column subset selection for entrywise -norm loss. Advances in Neural Information Processing Systems (NeurIPS), 32:10111–10121, 2019.
- [62] Zhao Song, David Woodruff, and Peilin Zhong. Towards a zero-one law for column subset selection. NeurIPS, 32:6123–6134, 2019.
- [63] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. arXiv preprint arXiv:1704.08246, 2019.
- SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
- SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
- SZKS [21] Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
- SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
- VBC [20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention. arXiv preprint arXiv:2007.02876, 2020.
- VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- WCM [21] Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. arXiv preprint arXiv:2107.13163, 2021.
- WKM [20] Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout, 2020.
- WL [21] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11112–11122. PMLR, 18–24 Jul 2021.
- WLK+ [20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- WRS+ [17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
- ZCLG [21] Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371, 2021.
- Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
- ZHDK [23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
- ZKV+ [20] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
- ZMG [19] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- ZRG+ [22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.