An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come.
Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression.
The exponential regression [Li, Song, Zhou 2023] and softmax regression [Deng, Li, Song 2023] can be formulated as follows. Given matrix and vector , the goal of exponential regression is to solve
and the goal of softmax regression is to solve
In this work, we define a slightly different formulation than softmax regression.
where . We provide an input sparsity time algorithm for this problem. Our algorithm framework is very general and can be applied to functions like and as well. Our technique is also general enough to be applied to in-context learning for rescaled softmax regression.
1 Introduction
The background of large language models (LLMs) can be traced back to a series of groundbreaking models, including the Transformer model [45], GPT-1 [35], BERT [12], GPT-2 [36], and GPT-3 [8]. These models are trained on massive amounts of textual data to generate natural language text and have shown their power on various real-world tasks, including natural language translation [21], sentiment analysis [44], language modeling [31], and even creative writing [33]. The success of the new version of LLM named GPT-4 [33] has exemplified the use of LLMs in human-interaction tasks and suggests that LLMs are likely to continue to be a key area of research in the years to come.
Large Language Models (LLMs) rely heavily on attention computation to improve their performance in natural language processing tasks. The attention mechanism enables the model to selectively focus on specific parts of the input text [45, 12, 36, 8, 35], enhancing its ability to identify and extract relevant information. The attention matrix, a crucial component of the attention mechanism, is a squared matrix that represents the correlations between words or tokens in the input text. The entries in the matrix are computed using a soft attention mechanism that generates weights by applying a softmax function over the input sequence. Through this process, LLMs are able to identify and prioritize important parts of the input text, resulting in more accurate and efficient language processing.
Due to the importance of attention computation in natural language processing, there has been a growing interest in addressing both computation and regression issues that arise in attention calculations. A number of recent results study the computation of the attention matrix in LLMs, such as [48, 3, 10, 15]. In this work, we focus on the direction of regression tasks [17, 29, 13, 27]. To better illustrate the regression issues in the attention model, we will first introduce the traditional linear regression as follows.
Definition 1.1 (Linear regression).
Given a matrix and , the goal is to solve
Based on the activation function which is widely used in the attention model, several theoretical transformer works have studied either exponential regression [17, 29] for example
Based on the problem mentioned above, we conducted a more in-depth analysis of Softmax Regression and extended the original problem to a rescaled version. Furthermore, the functions we concentrate on include , , and . Let be one of the following
-
•
-
•
-
•
The problem we address is presented as follows:
Definition 1.4 (Rescaled Softmax Regression).
Given a and a vector , the goal is to solve
The first contribution of this paper is defining the rescaled version of the softmax regression problem (Definition 1.4). We remark the major difference between classical softmax regression (Definition 1.3) and our new rescaled softmax regression (Definition 1.4) is the location of the normalization factor . Due to the difference, the analysis for rescaled softmax regression will be quite different. This is the second contribution of this work. The third contribution of this paper is that our framework is very general and can handle several hyperbolic functions at the same time, including , , and .
1.1 Our Results
Note that we follow the assumption that as [29]. The reason why [13] can assume that is because they solve a normalized version. Therefore, in our re-scaled version, we only assume .
Theorem 1.5 (Informal version of Theorem 11.1).
Given that vectors and a matrix .
Let be one of the following
-
•
-
•
-
•
We define as the optimal solution of the following problem
Suppose the following conditions are holding:
-
•
.
-
•
.
-
•
.
-
•
.
-
•
.
-
•
.
-
•
for all
-
•
Let denote accuracy parameter.
-
•
Let denote failure probability.
-
•
Let denote an initial point for which it holds that .
Then there exists a randomized algorithm (Algorithm 1) such that, with probability at least ,
-
•
it runs iterations
-
•
in each iteration, it spends time
-
•
outputs a vector such that
Roadmap.
Our paper is organized as follows. In Section 2, we introduce the recent research papers, which are relevant to our work. In Section 3, we present the techniques that we used for proving our results. We define the notations and propose approximate algebra, differential computation, and math tools for exact algebra used in our paper in Section 4. In Section 5, we introduce the Newton Step. In Section 6, we explain the important definitions, equations, and functions which are used in the later sections. In Section 7, we introduce the computation of Hessian and Gradient. In Section 8, we prove is convex function. The hessian of is proved to be Lipschitz in Section 9. In Section 10, we analyze the Lipschitz with respect to , where . In Section 11, we introduce our main result and algorithm.
2 Related Work
2.1 Transformer Theory
Optimization and Convergence
Studies in the field of optimization have investigated diverse facets of optimization methods and their applications. [42] investigated the behavior of the mechanism of single-head attention for Seq2Seq model learning, providing insights into how to choose parameters for better performance. [49] emphasized the importance of adaptive methods for attention models and proposed a new adaptive method for the attention mechanism. [17] studied the convergence of over-parameterized neural networks with exponential activation functions, addressing the over-parametrization problem. [29] proposed an algorithm for regularized exponential regression that runs in input sparsity time and demonstrated its effectiveness on various datasets. Finally, [26] provided a thorough clarification of how transformers can learn the ”semantic structure” to detect the patterns of word co-occurrence, exploring the optimization techniques used in transformers and highlighting their strengths and weaknesses.
Learning in-context
Research on in-context learners based on transformers has been exploring various aspects of their abilities and mechanisms. As an example, [4] showed that these learners can implicitly perform traditional learning algorithms through updating them continuously with new examples and encoding smaller models within their activations. Another work by [19] focused on training a model that is under the in-context conditions which are used for learning a class of functions, like the linear functions, aiming to determine whether or not a model that has been given information obtained from specific functions within a class can learn the “majority” of functions in this class through training. In their research, [32] described how Transformers operate as in-context learners and revealed similarities between a few meta-learning formulations, which are based on gradient descent, and the training process of Transformers in in-context tasks. In general, these studies provide valuable insights into the abilities and mechanisms of in-context learners based on transformers, which possess the huge potential to significantly improve the applications of machine learning. [27] proved a theoretical result about the in-context learning under softmax regression formulation [13].
Fast Attention Computation
The computation of attention has been explored in several works, with a focus on both dynamic and static attention. [10] investigated the dynamic version of attention computation, showing both positive results and negative results. They utilized lazy update techniques in their algorithmic results while the hardness result was based on the Hinted MV conjecture. On the other hand, [48] and [3] focused on static attention computation. [3] proposed an algorithm for static attention and provided a hardness result based on the exponential time hypothesis. Meanwhile, [48] explored the efficiency of static attention algorithms in various applications. Finally, [15] investigated the sparsification of feature dimension in attention computation, providing both randomized and deterministic algorithms.
2.2 Fast Iterative Algorithm for Linear Algebra
One of the important ideas in this work is to use sketching to speed up the iterative algorithm in optimization. Sketching is a powerful technique and has been widely used in numerous important tasks such as linear programming [24, 39, 14, 30, 11, 18], empirical risk minimization [28, 34], John Ellipsoid algorithm [40], cutting plane methods [22, 23], semidefinite programming [18, 41], federated learning [38], discrepancy algorithm [16], training over-parametrized neural network [43, 1, 47].
3 Technique Overview
An overview of our techniques is proposed in this section.
General Functions
For the purpose of applying our theory to , , and functions at the same time, we will introduce our generalized definition first. is used to represent the functions including and . With the aim that we can only use in the following proof, the common property used in our proof of will be proposed. To elaborate further, the expression for is as follows:
-
•
Case 1.
-
•
Case 2.
-
•
Case 3.
With Fact 4.8 and , we have
The expression for is as follows:
-
•
Case 1.
-
•
Case 2.
-
•
Case 3.
A unified version of the Hessian computation and the gradient computation can also be obtained as follows:
-
•
-
–
for each
-
–
-
•
for each
-
•
for each
Hessian Computation Taking into account as well, the target function we are focusing on is listed as follows:
(1) |
The computation of the Hessian for the problem directly is complex. We will introduce some techniques used in the computation of the Hessian function. To enhance the clarity of our expression, we draw a comparison between our Hessian Computation and the ones presented in [29, 13]. Specifically, we introduce the function and note that in [29], there is no need to compute , while is the focus of [13]. Our emphasis, however, is on the function .
Additionally, with the definition , the computation of the Hessian can be divided into the , and .
is Positive Definite
The first property we need to establish in order to apply the Newton optimization method is the positive definiteness of the Hessian. This is inspired by the semi-definite programming literature [2, 20]. We have define . Give that
and the bound on
by choosing , the Hessian function is Positive definite now.
is Lipschitz with respect to variable
To apply the Newton optimization method, it is also necessary to ensure the Lipschitz property. To finish the proof, we will get the upper bound of by where is a scalar. can be decomposed into where .
The idea of how to bound each term is quite standard neural network literature (for example, see [7, 6]).
With and then we get the following bound on by the following equation:
Furthermore, we can prove that the Hessian is Lipschitz continuous as follow:
Approximated Newton Algorithm
Based on the property of the Hessian function we have, we can apply the approximated Newton Method to the function regression problem. Building on the assumption of a -good loss function, we can guarantee the correctness of our algorithm.
Given , as the optimization of Eq. (1) and as the initialization, we have a good initialization assumption
To expedite the algorithm computation, it is natural to introduce a method for approximating the Hessian and its inverse (for example, see [11, 28, 37, 9, 24, 39, 20, 18, 16, 40, 23]). Given that is a diagonal matrix, can be transformed into a format . With , an alternative way to obtain a sparse method is to substitute with a sparse matrix where
The running time of Hessian computation can be reduced to 111With [46, 25, 5], we define as the exponent of matrix multiplication. To ensure the convergence of our algorithm, the number of iterations is expected to be based on the assumption above, leading to a total running time of
Thus, we can derive our main result Theorem 1.5.
From Lipschitz with respect to to Lipschitz with to
In Section 9, we already proved a number of results for Lipschitz with respect to . To present the application to in-context learning for rescaled softmax regression, we generalize the Lipschitz with respect to Lipschitz with respect to (see Setion 10). Finally, we present the Corollary 11.2 as our in-context learning application.
4 Preliminary
In Section 4.1, we introduce the important notations and mathematics symbols, which are used in this paper. In Section 4.2, we present the algebraic properties for and . In Section 4.3, the properties of the inner product are explained. In Section 4.4, the properties of the and its relationship with and are introduced. In Section 4.5, we display the fundamental derivative rules, both for the scalar variables and for the vector variables. In Section 4.6, we demonstrate the properties of the vector norm bound, including the Cauchy-Schwarz inequality and other inequalities of the bound containing and . In Section 4.7, we illustrate the properties of the matrix norm bound, namely the inequalities of the spectral norms. In Section 4.8, we introduce the properties of the hyperbolic functions, which take the scalar as an element of their domains. On the other hand, in Section 4.9, we elucidate the properties of the hyperbolic functions, which take the vector as an element of their domains. In Section 4.10, we define the important regularization function, , and analyze its basic properties. In Section 4.11, we define the gradient and Hessian of an arbitrary loss function and define the update of the Newton method.
4.1 Notations
In this section, we explain the fundamental notations.
We use to represent a set that contains all the positive integers, and we use to be an arbitrary element in . We define to be the set, i.e., .
Let and be two vectors. For any , we let to denote the -th entry of . We use to represent the vector satisfying for each . We use (where ) to represent the norm of , where ( norm), ( norm), and ( norm).
For a scalar , we let represent the standard exponential function.
We then define and .
Note that
and
For an arbitrary vector , we use to denote a vector whose -th entry is . We use to denote .
represents a -dimensional vector whose entries are all , and represents a -dimensional vector whose entries are all .
For an arbitrary vector , let represent a diagonal matrix whose -th entry on diagonal is .
For an arbitrary symmetric matrix , we say is positive definite () if for all vectors , .
For a symmetric matrix , we say is positive semidefinite () if for all vectors , .
For symmetric matrices and , we say if for all , .
For any matrix , we use to denote the spectral norm of , i.e., .
For each , we use to denote the -th column of matrix .
We use to denote an identity matrix that has size and all the diagonal are ones.
4.2 Basic Algebra for and
In this section, we provide a fact that includes the basic algebraic properties of and .
Fact 4.1.
Given vectors , , and , we have
-
•
-
–
-
–
-
–
-
–
-
•
-
•
-
•
-
–
-
–
-
–
-
•
4.3 Basic Inner Product
Now, we present the inner product properties.
Fact 4.2.
Given vectors , and , we have
-
•
-
•
-
•
-
•
-
•
-
•
4.4 Positive Semi-definite
In this section, we explain the properties of the mathematics operation .
Fact 4.3.
Let . We have:
-
•
.
-
•
-
•
-
•
-
•
-
•
-
•
-
•
-
•
4.5 Basic Calculus and Chain Rule
In this section, we present the basic calculus rules, including the derivative rules for scalars and the derivative rules for vectors.
Fact 4.4.
We have
-
•
Let be a fixed scalar, let denote variable, then we have
-
•
Let , we have
-
•
Let be a fixed vector, we have
-
•
Let denote a scalar variable, we have the following calculus rules
-
–
-
–
-
–
-
–
-
•
Let denote a vector variable, we have the following calculus rules
-
–
-
–
-
–
-
–
4.6 Basic Vector Norm Bounds
Now, we analyze the norm bounds for vectors.
Fact 4.5.
Given vectors and , we have
-
•
(Cauchy-Schwarz inequality)
-
•
-
•
-
•
(-norm vs -norm)
-
•
(-norm vs -norm)
4.7 Basic Matrix Norm Bound
Then, we analyze the norm bounds for matrices.
Fact 4.6.
For matrices and , we have
-
•
Let denote two vectors, then we have .
-
•
.
-
•
-
•
Let be a scalar, then we have .
4.8 Basic Hyperbolic Functions: Scalar Version
In this section, we analyze the properties of the hyperbolic functions, including and , and exponential functions, , where the elements of the domains of these functions are all scalars.
Fact 4.7.
For a scalar , we have
-
•
Part 1.
-
•
Part 2.
-
•
Part 3.
-
•
Part 4.
-
•
Part 5.
Taylor expansions
-
•
-
•
-
•
Approximation in small range,
-
•
For all satisfy that , we can get
-
•
For all satisfy that , we can get
-
•
For all satisfy that , we can get
-
•
For all satisfy that , we can get
-
•
For all satisfy that , we can get
-
•
For all satisfy that , we can get
Proof.
Most of the proofs are trivial or standard. We only provide some proofs.
Proof of Part 4.
We have
where second step is true because it’s for and also true for .
We have
Proof of Part 5.
We have
We have
∎
4.9 Basic Hyperbolic Functions: Vector Version
In this section, we keep analyzing the properties of the hyperbolic functions, namely and , and exponential functions, , but the elements of the domains of these functions are all vectors.
Fact 4.8.
For vectors
-
•
-
•
-
•
-
•
Approximation in a small range,
-
•
For any , we have
-
•
For any , we have
-
•
For any , we have
4.10 Regularization
Now, we define the regularization function, , and analyze its properties.
Definition 4.9.
Given a matrix and where is a vector, we define
4.11 Gradient and Hessian
Finally, in this section, we define the gradient and Hessian of the loss function and present the definition of the update of the Newton method.
Definition 4.11 (Gradient and Hessian).
Let be some loss function. The gradient of the loss function is defined as
The Hessian of the loss function is defined as follow:
Definition 4.12 (Update of the Newton method).
Given that the gradient function and the Hessian matrix , the exact process of the Newton method can be defined as follows:
5 Newton Step
In Section 5.1, we give a certain loss function and call it -good if it satisfies certain conditions. In Section 5.2, we define approximated Hessian, and based on it, we define approximate update. Then, we analyze the properties of them.
5.1 -good Loss function
In this section, we explain the meaning of -good Loss function.
Given that , we consider the following optimization problem
Now, we provide the following assumptions
Definition 5.1 (-good Loss function).
For a function , we say is -good if satisfies the following conditions,
-
•
-local Minimum. We define to be a positive scalar. If there exists a vector such that the following holds
-
–
.
-
–
.
-
–
-
•
Hessian is -Lipschitz. If there exists a positive scalar such that
-
•
Good Initialization Point. Let denote the initialization point. If satisfies
5.2 Approximate of Hessian and Update Rule
In this section, we demonstrate the concept of approximate update and present an induction hypothesis.
Definition 5.2 (Approximated Hessian).
Let be a parameter.
For any Hessian , we define the approximated Hessian to be a matrix such that the following holds,
In order to get the approximated Hessian efficiently, here we state a standard tool (see Lemma 4.5 in [16]).
Lemma 5.3 ([16, 40]).
Let be a constant precision parameter. Let denote failure probability. Let be a real matrix, then for any positive diagonal (PD) matrix , there exists an algorithm which runs in time
and it outputs an sparse diagonal matrix for which
Note that, denotes the exponent of matrix multiplication, currently [46, 25, 5]. Here denotes the number of nonzero entries in .
Definition 5.4 (Approximate update).
We consider the following process
We state a tool from prior work,
Lemma 5.5 (Iterative shrinking, Lemma 6.9 on page 32 of [29]).
6 General Functions: Definitions
In this section, we present the definitions of the important functions which we use in the later sections.
Definition 6.1.
Let be one of the following
-
•
Case 1.
-
•
Case 2.
-
•
Case 3.
We define a helpful function
Definition 6.2.
Let be one of the following
-
•
Case 1. (when )
-
•
Case 2. (when )
-
•
Case 3. (when )
Definition 6.3 (Loss function ).
Given a matrix and a vector . We define loss function as follows
For convenient, we define two helpful notations and
Definition 6.4 (Rescaled coefficients).
Given a matrix .
7 General Function: Gradient and Hessian Computations
In Section 7.1, we compute the gradients of , , , and . In Section 7.2, we present the second-order derivatives of with respect to and , where and are two arbitrary entries of the vector . In Section 7.3, we present the second-order derivatives of with respect to and , where and are two arbitrary entries of the vector . In Section 7.4, we compute the second-order derivatives of with respect to and . Finally, in Section 7.5, we compute the second-order derivatives of with respect to and .
7.1 Gradient Computations
In this section, we compute the gradients of , , , and , namely their first-order derivative with respect to .
Lemma 7.1 (Gradient).
Proof.
Proof of Part 1. For each , we have
where the first step follows from chain rule (Fact 4.4), the second step follows from basic chain rule (Fact 4.4), the third step follows from basic calculus rule (Fact 4.4).
Since the above equation is true for all , we have
Proof of Part 2. It trivially follows from arguments in Part 1.
Proof of Part 3.
where the first step is due to the definition of (see Definition 6.5), the second step follows from Part 1 and Part 2.
7.2 Hessian Computation Step 1. Vector Function
7.3 Hessian Computation Step 2. Scalar Function
7.4 Hessian Computation Step 3. Vector Function
Now, we compute the second-order derivative of with respect to and .
Lemma 7.4.
If the following conditions hold
-
•
Let be defined as Definition 6.5.
-
•
Let .
-
•
Let .
-
•
Let .
Then, we have
-
•
Part 1.
-
•
Part 2.
7.5 Hessian Computation Step 4. Scalar Function
Then, we compute the second-order derivative of with respect to and , by first introducing some functions, (see Definition 7.5), to simplify the process of computation.
Definition 7.5.
Given the following objects
-
•
Let .
-
•
Let .
-
•
Let .
Then, we define the functions as
We define
We define as follows:
Lemma 7.6.
Proof.
Proof of Part 1.
where the first step follows from simple algebra, the second step follows from basic chain rule (see Fact 4.4), and the last step follows from basic calculus.
For the first term, in the above equation, we have
where the first step is due to Part 3 of Lemma 7.1, the second step follows from Fact 4.2, and the last step follows from the definition of for each (see Definition 7.5).
For the second term, we have
where the first step is due to Part 1 of Lemma 7.4, the second step follows from Fact 4.2, and the last step follows from for all (see Definition 7.5).
Thus, we finally have
Proof of Part 2.
The proof is similar, and we omitted the details here. ∎
8 General Function: PSD Lower Bound
In Section 8.1, we provide the upper bound for the norms of and for the absolute value of . In Section 8.2, we compute both the upper bound and the lower bound of in terms of . In Section 8.3, we analyze the lower bound of Hessian.
8.1 Upper Bound for Several Basic Quantities
In this section, we compute the bounds for the norms of the vectors and compute the bound for the absolute value of .
Claim 8.1.
Proof.
Proof of Part 1. The proof is standard in the literature, and we omit the details here.
Proof of Part 2. We can show
where the first step follows from definition of (see Definition 6.4), the second step follows from Fact 4.5, the third step follows from Part 1.
Proof of Part 3. We can show
where the first step comes from the definition of (see Definition 6.5), the second step follows from the triangle inequality, the third step is because of Part 1, the fourth step follows from the assumption on , the fifth step follows from Part 2, and the last step follows from simple algebra.
∎
8.2 PSD Bounds for Several Basic Matrix Functions
In this section, we first define the matrices and find the -bound for them. Then, we combine them together to form the bound for
Definition 8.2.
Given the following objects
We define as follows:
We define
We define
-
•
-
•
-
•
-
•
-
–
(when )
-
–
(when )
-
–
(when )
-
–
-
•
Lemma 8.3.
Proof.
Proof of Part 1. First, we focus on the lower bound of . We have
where the first step follows from the definition of (see Definition 8.2) and the second step follows from Fact 4.3.
Similarly, we have
where the first step follows from the definition of (see Definition 8.2) and the second step follows from Fact 4.3
Proof of Part 2. According to what we obtained in the Part 1, we can also have
Proof of Part 3.
The proof is trivially following from definition of . We have
Proof of Part 4. For , , we have
where the first step is due to the definition of (see Definition 8.2) and the second step follows from Fact 4.3.
On the other hand, we have
Proof of Part 5.
where the first step follows from the definition of (see Definition 8.2), the second step follows from Fact 4.3, and the third step follows from Fact 4.3.
Similarly, we have
where the first step comes from the definition of (see Definition 8.2), the second step follows from Fact 4.3, and the third step follows from Fact 4.3.
Proof of Part 6. Using Fact 4.3
We also have
∎
8.3 Lower bound on Hessian
In this section, we compute the lower bound for Hessian.
Lemma 8.4.
If conditions as follows are satisfied
-
•
Let .
-
•
Let be defined as Definition 6.1.
-
•
Let be defined as Definition 6.2.
-
•
is defined in Definition 6.3.
-
•
is defined in Definition 4.9.
-
•
.
-
•
Given , and denotes the matrix with as the -th diagonal.
-
•
We use as the minimum singular value of .
-
•
We let as a scalar.
-
•
Let .
Then we have
-
•
Part 1. If all , , then we have
-
•
Part 2. If all , , then we have
Proof.
By Lemma 7.6, we have
(2) |
By Lemma 4.10, we have
(3) |
By what we have in the Lemma statement, we also have
(4) |
By combining Eq. (2), Eq. (3), and Eq. (4), we can rewrite the equation above as follows:
where the second step follows from simple algebra.
Now we define
Now we get the bound of
where the first step follows from Part 6 of Lemma 8.3, the second step follows from simple algebra, and the third step is because of the assumption of this part.
Since is positive definite, then we have
Proof of Part 2.
Using Part 6 of Lemma 8.3, we have
From assumption on , we also have
Combining the above three equations,
Thus,
which implies that
∎
9 General Function: Hessian is Lipschitz with Respect to
In Section 9.1, we summarize all of the important properties that we derive in the following subsections to form an upper bound for . In Section 9.2, we analyze the upper bound for . In Section 9.3, we analyze the upper bound for . In Section 9.4, we investigate the upper bound for . In Section 9.5, we evaluate the upper bound of the sum of all the spectral norms of the matrices , for all , where the spectral norms of each of the matrix is evaluated in each of the following subsection. In Section 9.6, we analyze the upper bound of the spectral norm of . In Section 9.7, we find the upper bound of the spectral norm of . In Section 9.8, we study the upper bound of the spectral norm of . In Section 9.9, we assess the upper bound of the spectral norm of . In Section 9.10, we investigate the upper bound of the spectral norm of .
9.1 Main Result
In this section, we introduce our main result, which is the combination of all the important concepts in Section 9.
Lemma 9.1.
If the following condition holds
-
•
Let
-
•
Let
-
•
, , where
-
•
, where
-
•
-
•
, where
-
•
Let
-
–
where
-
–
this is proved by Part 1 and Part 3 in Claim 8.1
-
–
Then we have
9.2 Lipschitz for
We use a tool from [13].
9.3 Lipschitz for
We use a tool from previous work, namely [13].
9.4 Lipschitz for
We find the upper bound of .
Lemma 9.4.
Proof.
We have
where the first step is from how we defined (Definition 6.5), the second step is due to the triangle inequality, and the last step follows from simple algebra. ∎
9.5 Summary of Five Steps
In this section, we analyze the upper bound of the sum of , for all .
Lemma 9.5.
If the following conditions hold
-
•
-
•
-
•
-
•
-
•
-
•
Let
Then, we have
-
•
Part 1.
-
•
Part 2. Let
9.6 Lipschitz Calculations: Step 1. Lipschitz for Matrix Function
We find the upper bound of .
Lemma 9.6.
If the following conditions hold
-
•
Then, we have
9.7 Lipschitz Calculations: Step 2. Lipschitz for Matrix Function
We look for the upper bound of .
Lemma 9.7.
If the following conditions hold
-
•
Then, we have
Proof.
The proof is very similar to the previous Lemma. So we omit the details here. ∎
9.8 Lipschitz Calculations: Step 3. Lipschitz for Matrix Function
We analyze the upper bound of .
Lemma 9.8.
If the following conditions hold
-
•
Then, we have
Proof.
We define
We have
We can show that
where the first step comes from the definition of , the second step is due to simple algebra, and the third step follows from Fact 4.6.
Similarly, we can show that
Thus, we complete the proof. ∎
9.9 Lipschitz Calculations: Step 4. Lipschitz for Matrix Function
We investigate the upper bound of .
Since we need to prove the Lipschitz, the effect of make no difference. The will be canceled. Thus, we define the terms without having at all.
Lemma 9.9.
If the following conditions hold
-
•
Then, we have
9.10 Lipschitz Calculations: Step 5. Lipschitz for Matrix Function
We compute the upper bound of .
Lemma 9.10.
If the following conditions hold
-
•
Then, we have
10 Lipschitz with respect to
In Section 10.1, we consider the case, which is to upper bound . In Section 10.2, we consider the case, namely computing the upper bound of . In Section 10.3, we analyze the bound for . In Section 10.4, we investigate the bound for . In Section 10.5, we analyze the bound for .
10.1 For the case
In this section, we bound .
Definition 10.1.
We define be to the vector that satisfies
Lemma 10.2.
We have
Proof.
Similarly as [30] described, there could be multiple solutions, e.g. possible solutions
The norm of all the solutions are same. Therefore, we can just consider one solution, which is the following solution
Thus,
∎
We use a tool, which is from [13].
Lemma 10.3 (Lemma 8.9 in [13]).
If the following condition hold
-
•
Let
-
•
Let
We have
The proof is standard, we omit the details here.
10.2 For the case
Here, we bound .
Definition 10.4.
We define be to the vector that satsifies
Lemma 10.5.
We have
Lemma 10.6 (Lemma 8.9 in [13]).
If the following points hold
-
•
Let
-
•
Let
We have
10.3 Lipschitz for
We state a tool that directly implies by previous work. The proof is very standard, so we omit the details here.
10.4 Lipschitz for
We state a tool which directly implies by previous work. The proof is very standard, so we omit the details here.
10.5 Lipschitz for
In this section, we bound .
Lemma 10.9 (A variation of Lemma 9.4).
Proof.
We have
where the first step comes from how we defined (see Definition 6.5), the second step is because of the triangle inequality, and the last step follows from simple algebra. ∎
11 Main Result
In Section 11.1, we introduce our algorithm (see Algorithm 1) and use our main Theorem (see Theorem 11.1) to analyze its properties, including running time and the output . In Section 11.2, we introduce a corollary which is the application of in-context learning.
11.1 Convergence
Now, we introduce our main algorithm and Theorem.
Theorem 11.1 ( ).
Given that vectors and a matrix , we define as the optimal solution of the following problem
And then if the conditions as follows hold:
-
•
.
-
•
.
-
•
.
-
•
.
-
•
.
-
•
for all
-
•
.
-
•
Let accuracy
-
•
Let failure probability
-
•
Let denote an initial point for which it holds that .
Then there exists a randomized algorithm (Algorithm 1) such that, with probability at least ,
-
•
it runs iterations
-
•
spends
-
•
outputs a vector such that
Here denote the exponent of matrix multiplication. Currently [46, 25, 5].
Proof.
Proof of Hessian is PD.
We can obtain this conclusion from Lemma 8.4.
Proof of Hessian is Lipschitz.
The proof is due to Lemma 9.1.
Proof of Cost per iteration.
This follows from Lemma 5.3.
Proof of Convergence per Iteration.
By Lemma 5.5, we have
Proof of Number of Iterations.
After iterations, we have
By choice of , we get the desired bound. The failure probability is following from union bound over iterations.
∎
11.2 Application to In-context Learning
In this section, we introduce the application to in-context learning.
Corollary 11.2 (Bounded shift for Learning in-context).
If the following conditions hold
-
•
Let .
-
•
Let .
-
•
.
-
•
Let .
-
•
.
-
•
.
-
•
Let .
-
•
Let .
-
•
Let .
We consider the rescaled softmax regression (Definition 1.4) problem
-
•
Part 1. If we move the to , then we’re solving a new rescaled softmax regression problem with
where
-
•
Part 2. If we move the to , then we’re solving a new rescaled softmax regression with
where
References
- ALS+ [22] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. arXiv preprint arXiv:2211.14227, 2022.
- Ans [00] Kurt M Anstreicher. The volumetric barrier for semidefinite programming. Mathematics of Operations Research, 2000.
- AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
- ASA+ [22] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- AW [21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522–539. SIAM, 2021.
- [6] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- [7] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. Advances in neural information processing systems, 32, 2019.
- BMR+ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bra [20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
- BSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
- CLS [19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
- DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- DLS [23] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
- DLY [21] Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: a multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1784–1797, 2021.
- DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
- DSW [22] Yichuan Deng, Zhao Song, and Omri Weinstein. Discrepancy minimization in input-sparsity time. arXiv preprint arXiv:2210.12468, 2022.
- GMS [23] Yeqi Gao, Sridhar Mahadevan, and Zhao Song. An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
- GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
- GTLV [22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. arXiv preprint arXiv:2208.01066, 2022.
- HJS+ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
- HWL [21] Weihua He, Yongyun Wu, and Xiaohua Li. Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE, 2021.
- JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
- JLSZ [23] Haotian Jiang, Yin Tat Lee, Zhao Song, and Lichen Zhang. Convex minimization with integer minima in time. arXiv preprint arXiv:2304.03426, 2023.
- JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC, 2021.
- LG [14] François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296–303, 2014.
- LLR [23] Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023.
- LSX+ [23] Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023.
- LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory (COLT), pages 2140–2157. PMLR, 2019.
- [29] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
- [30] S Cliff Liu, Zhao Song, Hengjie Zhang, Lichen Zhang, and Tianyi Zhou. Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In ICALP, 2023.
- MMS+ [19] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de La Clergerie, Djame Seddah, and Benoit Sagot. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894, 2019.
- ONR+ [22] Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
- Ope [23] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
- RNS+ [18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
- RWC+ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Son [19] Zhao Song. Matrix theory: optimization, concentration, and algorithms. The University of Texas at Austin, 2019.
- SWYZ [23] Zhao Song, Yitan Wang, Zheng Yu, and Lichen Zhang. Sketching for first order method: Efficient algorithm for low-bandwidth channel and vulnerability. In ICML, 2023.
- SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pages 9835–9847. PMLR, 2021.
- SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
- SYYZ [23] Zhao Song, Xin Yang, Yuanyuan Yang, and Lichen Zhang. Sketching meets differential privacy: Fast algorithm for dynamic kronecker projection maintenance. In ICML, 2023.
- SZKS [21] Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
- SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
- UAS+ [20] Mohd Usama, Belal Ahmad, Enmin Song, M Shamim Hossain, Mubarak Alrashoud, and Ghulam Muhammad. Attention-based sentiment analysis using convolutional and recurrent neural network. Future Generation Computer Systems, 113:571–578, 2020.
- VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wil [12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 887–898, 2012.
- Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
- ZHDK [23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
- ZKV+ [20] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.