This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Yeqi Gao [email protected]. The University of Washington.    Zhao Song [email protected]. Adobe Research.    Junze Yin [email protected]. Boston University.

Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text classification, summarization, and generation. LLMs have shown great promise in improving the accuracy and efficiency of these tasks, and have the potential to revolutionize the field of natural language processing (NLP) in the years to come.

Exponential function based attention unit is a fundamental element in LLMs. Several previous works have studied the convergence of exponential regression and softmax regression.

The exponential regression [Li, Song, Zhou 2023] and softmax regression [Deng, Li, Song 2023] can be formulated as follows. Given matrix An×dA\in\mathbb{R}^{n\times d} and vector bnb\in\mathbb{R}^{n}, the goal of exponential regression is to solve

minxexp(Ax)b2\displaystyle\min_{x}\|\exp(Ax)-b\|_{2}

and the goal of softmax regression is to solve

minxexp(Ax),𝟏n1exp(Ax)b2.\displaystyle\min_{x}\|\langle\exp(Ax),{\bf 1}_{n}\rangle^{-1}\exp(Ax)-b\|_{2}.

In this work, we define a slightly different formulation than softmax regression.

minxdu(x)u(x),𝟏nb2\displaystyle\min_{x\in\mathbb{R}^{d}}\|u(x)-\langle u(x),{\bf 1}_{n}\rangle\cdot b\|_{2}

where u(x){exp(Ax),cosh(Ax),sinh(Ax)}u(x)\in\{\exp(Ax),\cosh(Ax),\sinh(Ax)\}. We provide an input sparsity time algorithm for this problem. Our algorithm framework is very general and can be applied to functions like cosh()\cosh() and sinh()\sinh() as well. Our technique is also general enough to be applied to in-context learning for rescaled softmax regression.

1 Introduction

The background of large language models (LLMs) can be traced back to a series of groundbreaking models, including the Transformer model [45], GPT-1 [35], BERT [12], GPT-2 [36], and GPT-3 [8]. These models are trained on massive amounts of textual data to generate natural language text and have shown their power on various real-world tasks, including natural language translation [21], sentiment analysis [44], language modeling [31], and even creative writing [33]. The success of the new version of LLM named GPT-4 [33] has exemplified the use of LLMs in human-interaction tasks and suggests that LLMs are likely to continue to be a key area of research in the years to come.

Large Language Models (LLMs) rely heavily on attention computation to improve their performance in natural language processing tasks. The attention mechanism enables the model to selectively focus on specific parts of the input text [45, 12, 36, 8, 35], enhancing its ability to identify and extract relevant information. The attention matrix, a crucial component of the attention mechanism, is a squared matrix that represents the correlations between words or tokens in the input text. The entries in the matrix are computed using a soft attention mechanism that generates weights by applying a softmax function over the input sequence. Through this process, LLMs are able to identify and prioritize important parts of the input text, resulting in more accurate and efficient language processing.

Due to the importance of attention computation in natural language processing, there has been a growing interest in addressing both computation and regression issues that arise in attention calculations. A number of recent results study the computation of the attention matrix in LLMs, such as [48, 3, 10, 15]. In this work, we focus on the direction of regression tasks [17, 29, 13, 27]. To better illustrate the regression issues in the attention model, we will first introduce the traditional linear regression as follows.

Definition 1.1 (Linear regression).

Given a matrix An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}, the goal is to solve

minxdAxb2\displaystyle\min_{x\in\mathbb{R}^{d}}\|Ax-b\|_{2}

Based on the exp\exp activation function which is widely used in the attention model, several theoretical transformer works have studied either exponential regression [17, 29] for example

Definition 1.2 ([17, 29]).

Given a An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}, the goal is to solve

minxdexp(Ax)b2\displaystyle\min_{x\in\mathbb{R}^{d}}\|\exp(Ax)-b\|_{2}

or softmax regression problem [13, 27].

Definition 1.3 (Softmax Regression, [13, 27]).

Given a An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}, the goal is to solve

minxdexp(Ax),𝟏n1exp(Ax)b2\displaystyle\min_{x\in\mathbb{R}^{d}}\|\langle\exp(Ax),{\bf 1}_{n}\rangle^{-1}\exp(Ax)-b\|_{2}

Based on the problem mentioned above, we conducted a more in-depth analysis of Softmax Regression and extended the original problem to a rescaled version. Furthermore, the functions we concentrate on include exp\exp, sinh\sinh, and cosh\cosh. Let u(x)u(x) be one of the following

  • u(x)=exp(Ax)u(x)=\exp(Ax)

  • u(x)=cosh(Ax)u(x)=\cosh(Ax)

  • u(x)=sinh(Ax)u(x)=\sinh(Ax)

The problem we address is presented as follows:

Definition 1.4 (Rescaled Softmax Regression).

Given a An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}, the goal is to solve

minxdu(x)u(x),𝟏nb2\displaystyle\min_{x\in\mathbb{R}^{d}}\|u(x)-\langle u(x),{\bf 1}_{n}\rangle\cdot b\|_{2}

The first contribution of this paper is defining the rescaled version of the softmax regression problem (Definition 1.4). We remark the major difference between classical softmax regression (Definition 1.3) and our new rescaled softmax regression (Definition 1.4) is the location of the normalization factor u(x),𝟏n\langle u(x),{\bf 1}_{n}\rangle. Due to the difference, the analysis for rescaled softmax regression will be quite different. This is the second contribution of this work. The third contribution of this paper is that our framework is very general and can handle several hyperbolic functions at the same time, including exp\exp, cosh\cosh, and sinh\sinh.

1.1 Our Results

Note that we follow the assumption that b2R\|b\|_{2}\leq R as [29]. The reason why [13] can assume that b21\|b\|_{2}\leq 1 is because they solve a normalized version. Therefore, in our re-scaled version, we only assume b2R\|b\|_{2}\leq R.

Theorem 1.5 (Informal version of Theorem 11.1).

Given that vectors b,wnb,w\in\mathbb{R}^{n} and a matrix An×dA\in\mathbb{R}^{n\times d}.

Let u(x)u(x) be one of the following

  • u(x)=exp(Ax)u(x)=\exp(Ax)

  • u(x)=cosh(Ax)u(x)=\cosh(Ax)

  • u(x)=sinh(Ax)u(x)=\sinh(Ax)

We define xx^{*} as the optimal solution of the following problem

minxd0.5u(x)u(x),𝟏nb22+0.5diag(w)Ax22\displaystyle\min_{x\in\mathbb{R}^{d}}0.5\cdot\|u(x)-\langle u(x),{\bf 1}_{n}\rangle\cdot b\|_{2}^{2}+0.5\cdot\|\operatorname{diag}(w)Ax\|_{2}^{2}

Suppose the following conditions are holding:

  • R4R\geq 4.

  • g(x)=𝟎dg(x^{*})={\bf 0}_{d}.

  • x2R\|x^{*}\|_{2}\leq R.

  • AR\|A\|\leq R.

  • b2R\|b\|_{2}\leq R.

  • M=exp(O(R2+logn))M=\exp(O(R^{2}+\log n)).

  • wi2100M+l/σmin(A)2w_{i}^{2}\geq 100M+l/\sigma_{\min}(A)^{2} for all i[n]i\in[n]

  • Let ϵ(0,0.1)\epsilon\in(0,0.1) denote accuracy parameter.

  • Let δ(0,0.1)\delta\in(0,0.1) denote failure probability.

  • Let x0x_{0} denote an initial point for which it holds that Mx0x20.1lM\|x_{0}-x^{*}\|_{2}\leq 0.1l.

Then there exists a randomized algorithm (Algorithm 1) such that, with probability at least 1δ1-\delta,

  • it runs T=log(x0x2/ϵ)T=\log(\|x_{0}-x^{*}\|_{2}/\epsilon) iterations

  • in each iteration, it spends time

    O((nnz(A)+dω)poly(log(n/δ)).\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\cdot\operatorname{poly}(\log(n/\delta)).
  • outputs a vector x~d\widetilde{x}\in\mathbb{R}^{d} such that

    x~x2ϵ\displaystyle\|\widetilde{x}-x^{*}\|_{2}\leq\epsilon

Roadmap.

Our paper is organized as follows. In Section 2, we introduce the recent research papers, which are relevant to our work. In Section 3, we present the techniques that we used for proving our results. We define the notations and propose approximate algebra, differential computation, and math tools for exact algebra used in our paper in Section 4. In Section 5, we introduce the Newton Step. In Section 6, we explain the important definitions, equations, and functions which are used in the later sections. In Section 7, we introduce the computation of Hessian and Gradient. In Section 8, we prove L=Lu+LregL=L_{u}+L_{\operatorname{reg}} is convex function. The hessian of L=Lu+LregL=L_{u}+L_{\operatorname{reg}} is proved to be Lipschitz in Section 9. In Section 10, we analyze the Lipschitz with respect to AA, where An×dA\in\mathbb{R}^{n\times d}. In Section 11, we introduce our main result and algorithm.

2 Related Work

2.1 Transformer Theory

Optimization and Convergence

Studies in the field of optimization have investigated diverse facets of optimization methods and their applications. [42] investigated the behavior of the mechanism of single-head attention for Seq2Seq model learning, providing insights into how to choose parameters for better performance. [49] emphasized the importance of adaptive methods for attention models and proposed a new adaptive method for the attention mechanism. [17] studied the convergence of over-parameterized neural networks with exponential activation functions, addressing the over-parametrization problem. [29] proposed an algorithm for regularized exponential regression that runs in input sparsity time and demonstrated its effectiveness on various datasets. Finally, [26] provided a thorough clarification of how transformers can learn the ”semantic structure” to detect the patterns of word co-occurrence, exploring the optimization techniques used in transformers and highlighting their strengths and weaknesses.

Learning in-context

Research on in-context learners based on transformers has been exploring various aspects of their abilities and mechanisms. As an example, [4] showed that these learners can implicitly perform traditional learning algorithms through updating them continuously with new examples and encoding smaller models within their activations. Another work by [19] focused on training a model that is under the in-context conditions which are used for learning a class of functions, like the linear functions, aiming to determine whether or not a model that has been given information obtained from specific functions within a class can learn the “majority” of functions in this class through training. In their research, [32] described how Transformers operate as in-context learners and revealed similarities between a few meta-learning formulations, which are based on gradient descent, and the training process of Transformers in in-context tasks. In general, these studies provide valuable insights into the abilities and mechanisms of in-context learners based on transformers, which possess the huge potential to significantly improve the applications of machine learning. [27] proved a theoretical result about the in-context learning under softmax regression formulation [13].

Fast Attention Computation

The computation of attention has been explored in several works, with a focus on both dynamic and static attention. [10] investigated the dynamic version of attention computation, showing both positive results and negative results. They utilized lazy update techniques in their algorithmic results while the hardness result was based on the Hinted MV conjecture. On the other hand, [48] and [3] focused on static attention computation. [3] proposed an algorithm for static attention and provided a hardness result based on the exponential time hypothesis. Meanwhile, [48] explored the efficiency of static attention algorithms in various applications. Finally, [15] investigated the sparsification of feature dimension in attention computation, providing both randomized and deterministic algorithms.

2.2 Fast Iterative Algorithm for Linear Algebra

One of the important ideas in this work is to use sketching to speed up the iterative algorithm in optimization. Sketching is a powerful technique and has been widely used in numerous important tasks such as linear programming [24, 39, 14, 30, 11, 18], empirical risk minimization [28, 34], John Ellipsoid algorithm [40], cutting plane methods [22, 23], semidefinite programming [18, 41], federated learning [38], discrepancy algorithm [16], training over-parametrized neural network [43, 1, 47].

3 Technique Overview

An overview of our techniques is proposed in this section.

General Functions

For the purpose of applying our theory to exp\exp, sinh\sinh, and cosh\cosh functions at the same time, we will introduce our generalized definition first. u(x)u(x) is used to represent the functions including exp,sinh\exp,\sinh and cosh\cosh. With the aim that we can only use u(x)u(x) in the following proof, the common property used in our proof of u(x)u(x) will be proposed. To elaborate further, the expression for u(x)u(x) is as follows:

  • Case 1. u(x)=exp(Ax)u(x)=\exp(Ax)

  • Case 2. u(x)=cosh(Ax)u(x)=\cosh(Ax)

  • Case 3. u(x)=sinh(Ax)u(x)=\sinh(Ax)

With Fact 4.8 and AR\|A\|\leq R, we have

u(x)2nexp(R2)\displaystyle\|u(x)\|_{2}\leq\sqrt{n}\exp{(R^{2})}

The expression for v(x)v(x) is as follows:

  • Case 1. v(x)=exp(Ax)v(x)=\exp(Ax)

  • Case 2. v(x)=sinh(Ax)v(x)=\sinh(Ax)

  • Case 3. v(x)=cosh(Ax)v(x)=\cosh(Ax)

A unified version of the Hessian computation and the gradient computation can also be obtained as follows:

  • du(x)dx=(v(x)𝟏d)A\frac{\mathrm{d}u(x)}{\mathrm{d}x}=(v(x){\bf 1}_{d}^{\top})\circ A

    • du(x)dxi=v(x)A,i\frac{\mathrm{d}u(x)}{\mathrm{d}x_{i}}=v(x)\circ A_{*,i} for each i[d]i\in[d]

  • d2u(x)dxi2=A,iu(x)A,i\frac{\mathrm{d}^{2}u(x)}{\mathrm{d}x_{i}^{2}}=A_{*,i}\circ u(x)\circ A_{*,i} for each i[d]i\in[d]

  • d2u(x)dxidxj=A,iu(x)A,j\frac{\mathrm{d}^{2}u(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}\circ u(x)\circ A_{*,j} for each i,j[d]×[d]i,j\in[d]\times[d]

Hessian Computation Taking wdw\in\mathbb{R}^{d} into account as well, the target function we are focusing on is listed as follows:

minxdL(x)=minxd0.5u(x)u(x),𝟏nb22:=Lu+0.5diag(w)Ax22:=Lreg\displaystyle\min_{x\in\mathbb{R}^{d}}L(x)=\min_{x\in\mathbb{R}^{d}}\underbrace{0.5\cdot\|u(x)-\langle u(x),{\bf 1}_{n}\rangle\cdot b\|_{2}^{2}}_{:=L_{u}}+\underbrace{0.5\cdot\|\operatorname{diag}(w)Ax\|_{2}^{2}}_{:=L_{\operatorname{reg}}} (1)

The computation of the Hessian for the problem directly is complex. We will introduce some techniques used in the computation of the Hessian function. To enhance the clarity of our expression, we draw a comparison between our Hessian Computation and the ones presented in [29, 13]. Specifically, we introduce the function α(x):=u(x),𝟏n\alpha(x):=\langle u(x),{\bf 1}_{n}\rangle and note that in [29], there is no need to compute α(x)\alpha(x), while α1(x)\alpha^{-1}(x) is the focus of [13]. Our emphasis, however, is on the function α(x)\alpha(x).

Additionally, with the definition c(x):=u(x)bα(x)c(x):=u(x)-b\cdot\alpha(x), the computation of the Hessian can be divided into the d2u(x)dx2\frac{\mathrm{d}^{2}u(x)}{\mathrm{d}x^{2}}, d2α(x)dx2\frac{\mathrm{d}^{2}\alpha(x)}{\mathrm{d}x^{2}} and d2c(x)dx2\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x^{2}}.

d2Ldx2\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}} is Positive Definite

The first property we need to establish in order to apply the Newton optimization method is the positive definiteness of the Hessian. This is inspired by the semi-definite programming literature [2, 20]. We have define R0:=max{v(x)2,b2,c(x)2,u(x)2,1}R_{0}:=\max\{\|v(x)\|_{2},\|b\|_{2},\|c(x)\|_{2},\|u(x)\|_{2},1\}. Give that

d2L(x)dxi2=A,iB(x)A,id2Lu(x)dxi2+A,iW2A,id2Lreg(x)dxi2\displaystyle\frac{\mathrm{d}^{2}L(x)}{\mathrm{d}x_{i}^{2}}=\underbrace{A_{*,i}^{\top}B(x)A_{*,i}}_{\frac{\mathrm{d}^{2}L_{u}(x)}{\mathrm{d}x_{i}^{2}}}+\underbrace{A_{*,i}^{\top}W^{2}A_{*,i}}_{\frac{\mathrm{d}^{2}L_{\operatorname{reg}}(x)}{\mathrm{d}x_{i}^{2}}}

and the bound on B(x)B(x)

10R04InB(x)10R04In,\displaystyle-10R_{0}^{4}\cdot I_{n}\preceq B(x)\preceq 10R_{0}^{4}\cdot I_{n},

by choosing wi10R04+l/σmin(A)2w_{i}\geq 10R_{0}^{4}+l/\sigma_{\min}(A)^{2}, the Hessian function is Positive definite now.

d2Ldx2\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}} is Lipschitz with respect to variable xx

To apply the Newton optimization method, it is also necessary to ensure the Lipschitz property. To finish the proof, we will get the upper bound of H(x)H(y)\|H(x)-H(y)\| by cxy2c\cdot\|x-y\|_{2} where cc is a scalar. H(x)H(x) can be decomposed into GiG_{i} where i[n]i\in[n].

H(x)H(y)\displaystyle\|H(x)-H(y)\|\leq A(i=15Gi)A\displaystyle~{}\|A\|\cdot(\sum_{i=1}^{5}\|G_{i}\|)\|A\|

The idea of how to bound each term GiG_{i} is quite standard neural network literature (for example, see [7, 6]).

With R:=max{u(x)2,u(y)2,v(x)2,v(y)2,c(x)2,c(y)2,b2,1}R_{\infty}:=\max\{\|u(x)\|_{2},\|u(y)\|_{2},\|v(x)\|_{2},\|v(y)\|_{2},\|c(x)\|_{2},\|c(y)\|_{2},\|b\|_{2},1\} and then we get the following bound on H(x)H(y)\|H(x)-H(y)\| by the following equation:

i=15Gi20R3max{u(x)u(y)2,c(x)c(y)2}.\displaystyle\sum_{i=1}^{5}\|G_{i}\|\leq 20R_{\infty}^{3}\cdot\max\{\|u(x)-u(y)\|_{2},\|c(x)-c(y)\|_{2}\}.

Furthermore, we can prove that the Hessian is Lipschitz continuous as follow:

H(x)H(y)n4exp(20R2)xy2\displaystyle\|H(x)-H(y)\|\leq~{}n^{4}\exp(20R^{2})\cdot\|x-y\|_{2}

Approximated Newton Algorithm

Based on the property of the Hessian function we have, we can apply the approximated Newton Method to the function regression problem. Building on the assumption of a (l,M)(l,M)-good loss function, we can guarantee the correctness of our algorithm.

Given M=n4exp(20R2)M=n^{4}\exp(20R^{2}), xx_{*} as the optimization of Eq. (1) and x0x_{0} as the initialization, we have a good initialization assumption

x0x:=r0M0.1l\displaystyle\underbrace{\|x_{0}-x_{*}\|}_{:=r_{0}}M\leq 0.1l

To expedite the algorithm computation, it is natural to introduce a method for approximating the Hessian and its inverse (for example, see [11, 28, 37, 9, 24, 39, 20, 18, 16, 40, 23]). Given that H(xt)H(x_{t}) is a diagonal matrix, d2Ldx2\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}} can be transformed into a format AH(xt)AA^{\top}H(x_{t})A. With ϵ0(0,0.1)\epsilon_{0}\in(0,0.1), an alternative way to obtain a sparse method is to substitute H(xt)H(x_{t}) with a sparse matrix H~(xt)\widetilde{H}(x_{t}) where

(1ϵ0)H(xt)H~(xt)(1+ϵ0)H(xt)\displaystyle(1-\epsilon_{0})\cdot H(x_{t})\preceq\widetilde{H}(x_{t})\preceq(1+\epsilon_{0})\cdot H(x_{t})

The running time of Hessian computation can be reduced to O~(nnz(A)+dω)\widetilde{O}(\operatorname{nnz}(A)+d^{\omega})111With ω2.373\omega\approx 2.373 [46, 25, 5], we define ω\omega as the exponent of matrix multiplication. To ensure the convergence of our algorithm, the number of iterations is expected to be log(1/ϵ)\log(1/\epsilon) based on the assumption above, leading to a total running time of

O~((nnz(A)+dω)log(1/ϵ).\displaystyle\widetilde{O}((\operatorname{nnz}(A)+d^{\omega})\cdot\log(1/\epsilon).

Thus, we can derive our main result Theorem 1.5.

From Lipschitz with respect to xx to Lipschitz with to AA

In Section 9, we already proved a number of results for Lipschitz with respect to xx. To present the application to in-context learning for rescaled softmax regression, we generalize the Lipschitz with respect xx to Lipschitz with respect to AA (see Setion 10). Finally, we present the Corollary 11.2 as our in-context learning application.

4 Preliminary

In Section 4.1, we introduce the important notations and mathematics symbols, which are used in this paper. In Section 4.2, we present the algebraic properties for \circ and diag\operatorname{diag}. In Section 4.3, the properties of the inner product are explained. In Section 4.4, the properties of the \preceq and its relationship with diag\operatorname{diag} and \circ are introduced. In Section 4.5, we display the fundamental derivative rules, both for the scalar variables and for the vector variables. In Section 4.6, we demonstrate the properties of the vector norm bound, including the Cauchy-Schwarz inequality and other inequalities of the bound containing \circ and diag\operatorname{diag}. In Section 4.7, we illustrate the properties of the matrix norm bound, namely the inequalities of the spectral norms. In Section 4.8, we introduce the properties of the hyperbolic functions, which take the scalar as an element of their domains. On the other hand, in Section 4.9, we elucidate the properties of the hyperbolic functions, which take the vector as an element of their domains. In Section 4.10, we define the important regularization function, Lreg:dL_{\operatorname{reg}}:\mathbb{R}^{d}\to\mathbb{R}, and analyze its basic properties. In Section 4.11, we define the gradient and Hessian of an arbitrary loss function LL and define the update of the Newton method.

4.1 Notations

In this section, we explain the fundamental notations.

We use +\operatorname*{\mathbb{Z}}_{+} to represent a set that contains all the positive integers, and we use nn to be an arbitrary element in +\operatorname*{\mathbb{Z}}_{+}. We define [n][n] to be the set, i.e., [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\}.

Let xnx\in\mathbb{R}^{n} and yny\in\mathbb{R}^{n} be two vectors. For any i[n]i\in[n], we let xix_{i}\in\mathbb{R} to denote the ii-th entry of xx. We use xynx\circ y\in\mathbb{R}^{n} to represent the vector satisfying (xy)i=xiyi(x\circ y)_{i}=x_{i}y_{i} for each i[n]i\in[n]. We use xp\|x\|_{p} (where p{1,2,}p\in\{1,2,\infty\}) to represent the p\ell_{p} norm of xx, where x1:=i=1n|xi|\|x\|_{1}:=\sum_{i=1}^{n}|x_{i}| (1\ell_{1} norm), x2:=(i=1nxi2)1/2\|x\|_{2}:=(\sum_{i=1}^{n}x_{i}^{2})^{1/2} (2\ell_{2} norm), and x:=maxi[n]|xi|\|x\|_{\infty}:=\max_{i\in[n]}|x_{i}| (\ell_{\infty} norm).

For a scalar zz\in\mathbb{R}, we let exp(z)\exp(z) represent the standard exponential function.

We then define cosh(z):=12(exp(z)+exp(z))\cosh(z):=\frac{1}{2}(\exp(z)+\exp(-z)) and sinh(z):=12(exp(z)exp(z))\sinh(z):=\frac{1}{2}(\exp(z)-\exp(-z)).

Note that

exp(z)=exp(z),cosh(z)=sinh(z),sinh(z)=cosh(z)\displaystyle\exp(z)^{\prime}=\exp(z),\cosh(z)^{\prime}=\sinh(z),\sinh(z)^{\prime}=\cosh(z)

and

exp(z)′′=exp(z),cosh(z)′′=cosh(z),sinh(z)′′=sinh(z).\displaystyle\exp(z)^{\prime\prime}=\exp(z),\cosh(z)^{\prime\prime}=\cosh(z),\sinh(z)^{\prime\prime}=\sinh(z).

For an arbitrary vector xnx\in\mathbb{R}^{n}, we use exp(x)n\exp(x)\in\mathbb{R}^{n} to denote a vector whose ii-th entry exp(x)i\exp(x)_{i} is exp(xi)\exp(x_{i}). We use x,y\langle x,y\rangle to denote i=1nxiyi\sum_{i=1}^{n}x_{i}y_{i}.

𝟏n{\bf 1}_{n} represents a nn-dimensional vector whose entries are all 11, and 𝟎n{\bf 0}_{n} represents a nn-dimensional vector whose entries are all 0.

For an arbitrary vector unu\in\mathbb{R}^{n}, let diag(u)n×n\operatorname{diag}(u)\in\mathbb{R}^{n\times n} represent a diagonal matrix whose ii-th entry on diagonal is uiu_{i}.

For an arbitrary symmetric matrix Bn×nB\in\mathbb{R}^{n\times n}, we say BB is positive definite (B0B\succ 0) if for all vectors xn\{𝟎n}x\in\mathbb{R}^{n}\backslash\{{\bf 0}_{n}\}, xBx>0x^{\top}Bx>0.

For a symmetric matrix Bn×nB\in\mathbb{R}^{n\times n}, we say BB is positive semidefinite (B0B\succeq 0) if for all vectors xnx\in\mathbb{R}^{n}, xBx0x^{\top}Bx\geq 0.

For symmetric matrices BB and CC, we say BCB\succeq C if for all xx, xBxxCxx^{\top}Bx\geq x^{\top}Cx.

For any matrix AA, we use A\|A\| to denote the spectral norm of AA, i.e., A=maxx2=1Ax2\|A\|=\max_{\|x\|_{2}=1}\|Ax\|_{2}.

For each i[d]i\in[d], we use A,inA_{*,i}\in\mathbb{R}^{n} to denote the ii-th column of matrix An×dA\in\mathbb{R}^{n\times d}.

We use InI_{n} to denote an identity matrix that has size n×nn\times n and all the diagonal are ones.

4.2 Basic Algebra for \circ and diag\operatorname{diag}

In this section, we provide a fact that includes the basic algebraic properties of \circ and diag\operatorname{diag}.

Fact 4.1.

Given vectors ana\in\mathbb{R}^{n}, bnb\in\mathbb{R}^{n}, cnc\in\mathbb{R}^{n} and dnd\in\mathbb{R}^{n}, we have

  • ab=diag(a)b=diag(a)diag(b)𝟏na\circ b=\operatorname{diag}(a)\cdot b=\operatorname{diag}(a)\cdot\operatorname{diag}(b)\cdot{\bf 1}_{n}

    • ab=baa\circ b=b\circ a

    • diag(a)b=diag(b)a\operatorname{diag}(a)b=\operatorname{diag}(b)a

    • diag(a)diag(b)𝟏n=diag(b)diag(a)𝟏n\operatorname{diag}(a)\cdot\operatorname{diag}(b)\cdot{\bf 1}_{n}=\operatorname{diag}(b)\cdot\operatorname{diag}(a)\cdot{\bf 1}_{n}

  • diag(ab)=diag(a)diag(b)\operatorname{diag}(a\circ b)=\operatorname{diag}(a)\operatorname{diag}(b)

  • diag(a)+diag(b)=diag(a+b)\operatorname{diag}(a)+\operatorname{diag}(b)=\operatorname{diag}(a+b)

  • a(bc)=adiag(b)ca^{\top}(b\circ c)=a^{\top}\operatorname{diag}(b)c

    • a(bc)=b(ac)=c(ab)a^{\top}(b\circ c)=b^{\top}(a\circ c)=c^{\top}(a\circ b)

    • adiag(b)c=bdiag(a)c=adiag(c)ba^{\top}\operatorname{diag}(b)c=b^{\top}\operatorname{diag}(a)c=a^{\top}\operatorname{diag}(c)b

  • a,bc=adiag(b)c\langle a,b\circ c\rangle=a^{\top}\operatorname{diag}(b)c

4.3 Basic Inner Product

Now, we present the inner product properties.

Fact 4.2.

Given vectors ana\in\mathbb{R}^{n}, bnb\in\mathbb{R}^{n} and cnc\in\mathbb{R}^{n}, we have

  • a,b=b,a\langle a,b\rangle=\langle b,a\rangle

  • ab,c=abc,𝟏n\langle a\circ b,c\rangle=\langle a\circ b\circ c,{\bf 1}_{n}\rangle

  • a,b=ab=ba\langle a,b\rangle=a^{\top}b=b^{\top}a

  • a,b=ab,𝟏n\langle a,b\rangle=\langle a\circ b,{\bf 1}_{n}\rangle

  • ab22=a22+b222a,b\|a-b\|_{2}^{2}=\|a\|_{2}^{2}+\|b\|_{2}^{2}-2\langle a,b\rangle

  • a,bc,d=abcd=bacd\langle a,b\rangle\langle c,d\rangle=a^{\top}bcd^{\top}=b^{\top}acd^{\top}

4.4 Positive Semi-definite

In this section, we explain the properties of the mathematics operation \preceq.

Fact 4.3.

Let u,vnu,v\in\mathbb{R}^{n}. We have:

  • uuu22Inuu^{\top}\preceq\|u\|_{2}^{2}\cdot I_{n}.

  • diag(u)u2In\operatorname{diag}(u)\preceq\|u\|_{2}\cdot I_{n}

  • diag(uu)u22In\operatorname{diag}(u\circ u)\preceq\|u\|_{2}^{2}\cdot I_{n}

  • diag(uv)u2v2In\operatorname{diag}(u\circ v)\preceq\|u\|_{2}\cdot\|v\|_{2}\cdot I_{n}

  • uv+vuuu+vvuv^{\top}+vu^{\top}\preceq uu^{\top}+vv^{\top}

  • uv+vu(uu+vv)uv^{\top}+vu^{\top}\succeq-(uu^{\top}+vv^{\top})

  • (vu)(vu)v2uu(v\circ u)(v\circ u)^{\top}\preceq\|v\|_{\infty}^{2}uu^{\top}

  • (vu)uvuu(v\circ u)u^{\top}\preceq\|v\|_{\infty}uu^{\top}

  • (vu)uvuu(v\circ u)u^{\top}\succeq-\|v\|_{\infty}uu^{\top}

4.5 Basic Calculus and Chain Rule

In this section, we present the basic calculus rules, including the derivative rules for scalars and the derivative rules for vectors.

Fact 4.4.

We have

  • Let α\alpha\in\mathbb{R} be a fixed scalar, let xdx\in\mathbb{R}^{d} denote variable, then we have dαxdt=αdxdt\frac{\mathrm{d}\alpha x}{\mathrm{d}t}=\alpha\frac{\mathrm{d}x}{\mathrm{d}t}

  • Let f(x)nf(x)\in\mathbb{R}^{n}, we have d0.5f(x)22dt=f(x),df(x)dt\frac{\mathrm{d}0.5\|f(x)\|_{2}^{2}}{\mathrm{d}t}=\langle f(x),\frac{\mathrm{d}f(x)}{\mathrm{d}t}\rangle

  • Let bnb\in\mathbb{R}^{n} be a fixed vector, we have d(bf(x))dt=bdf(x)dt\frac{\mathrm{d}(b\circ f(x))}{\mathrm{d}t}=b\circ\frac{\mathrm{d}f(x)}{\mathrm{d}t}

  • Let zz\in\mathbb{R} denote a scalar variable, we have the following calculus rules

    • dexp(z)dt=exp(z)dzdt\frac{\mathrm{d}\exp(z)}{\mathrm{d}t}=\exp(z)\frac{\mathrm{d}z}{\mathrm{d}t}

    • dcosh(z)dt=sinh(z)dzdt\frac{\mathrm{d}\cosh(z)}{\mathrm{d}t}=\sinh(z)\frac{\mathrm{d}z}{\mathrm{d}t}

    • dsinh(z)dt=cosh(z)dzdt\frac{\mathrm{d}\sinh(z)}{\mathrm{d}t}=\cosh(z)\frac{\mathrm{d}z}{\mathrm{d}t}

  • Let xnx\in\mathbb{R}^{n} denote a vector variable, we have the following calculus rules

    • dexp(x)dt=exp(x)dxdt\frac{\mathrm{d}\exp(x)}{\mathrm{d}t}=\exp(x)\circ\frac{\mathrm{d}x}{\mathrm{d}t}

    • dcosh(x)dt=sinh(x)dxdt\frac{\mathrm{d}\cosh(x)}{\mathrm{d}t}=\sinh(x)\circ\frac{\mathrm{d}x}{\mathrm{d}t}

    • dsinh(x)dt=cosh(x)dxdt\frac{\mathrm{d}\sinh(x)}{\mathrm{d}t}=\cosh(x)\circ\frac{\mathrm{d}x}{\mathrm{d}t}

4.6 Basic Vector Norm Bounds

Now, we analyze the norm bounds for vectors.

Fact 4.5.

Given vectors ana\in\mathbb{R}^{n} and bnb\in\mathbb{R}^{n}, we have

  • a,ba2b2\langle a,b\rangle\leq\|a\|_{2}\cdot\|b\|_{2} (Cauchy-Schwarz inequality)

  • diag(a)aa2\|\operatorname{diag}(a)\|\leq\|a\|_{\infty}\leq\|a\|_{2}

  • ab2ab2a2b2\|a\circ b\|_{2}\leq\|a\|_{\infty}\cdot\|b\|_{2}\leq\|a\|_{2}\cdot\|b\|_{2}

  • aa2na\|a\|_{\infty}\leq\|a\|_{2}\leq\sqrt{n}\cdot\|a\|_{\infty} (\ell_{\infty}-norm vs 2\ell_{2}-norm)

  • a2a1na2\|a\|_{2}\leq\|a\|_{1}\leq\sqrt{n}\cdot\|a\|_{2} (2\ell_{2}-norm vs 1\ell_{1}-norm)

4.7 Basic Matrix Norm Bound

Then, we analyze the norm bounds for matrices.

Fact 4.6.

For matrices AA and BB, we have

  • Let a,bda,b\in\mathbb{R}^{d} denote two vectors, then we have aba2b2\|ab^{\top}\|\leq\|a\|_{2}\cdot\|b\|_{2}.

  • AxAx2\|Ax\|\leq\|A\|\cdot\|x\|_{2}.

  • ABAB\|AB\|\leq\|A\|\cdot\|B\|

  • Let α\alpha\in\mathbb{R} be a scalar, then we have αA|α|A\|\alpha\cdot A\|\leq|\alpha|\cdot\|A\|.

4.8 Basic Hyperbolic Functions: Scalar Version

In this section, we analyze the properties of the hyperbolic functions, including sinh\sinh and cosh\cosh, and exponential functions, exp\exp, where the elements of the domains of these functions are all scalars.

Fact 4.7.

For a scalar zz\in\mathbb{R}, we have

  • Part 1. cosh2(z)sinh2(z)=1\cosh^{2}(z)-\sinh^{2}(z)=1

  • Part 2. |exp(z)|exp(|z|)|\exp(z)|\leq\exp(|z|)

  • Part 3. |cosh(z)|=cosh(z)=cosh(|z|)exp(|z|)|\cosh(z)|=\cosh(z)=\cosh(|z|)\leq\exp(|z|)

  • Part 4. |sinh(z)|=sinh(|z|)exp(|z|)|\sinh(z)|=\sinh(|z|)\leq\exp(|z|)

  • Part 5. sinh(|z)|cosh(|z|)exp(|z|)\sinh(|z)|\leq\cosh(|z|)\leq\exp(|z|)

Taylor expansions

  • exp(z)=i=01i!zi\exp(z)=\sum_{i=0}^{\infty}\frac{1}{i!}z^{i}

  • cosh(z)=i=01(2i)!z2i\cosh(z)=\sum_{i=0}^{\infty}\frac{1}{(2i)!}z^{2i}

  • sinh(z)=i=01(2i+1)!z2i+1\sinh(z)=\sum_{i=0}^{\infty}\frac{1}{(2i+1)!}z^{2i+1}

Approximation in small range,

  • For all xx\in\mathbb{R} satisfy that |x|0.1|x|\leq 0.1, we can get |exp(x)1|2|x||\exp(x)-1|\leq 2|x|

  • For all xx\in\mathbb{R} satisfy that |x|0.1|x|\leq 0.1, we can get |cosh(x)1|x2|\cosh(x)-1|\leq x^{2}

  • For all xx\in\mathbb{R} satisfy that |x|0.1|x|\leq 0.1, we can get |sinh(x)|2|x||\sinh(x)|\leq 2|x|

  • For all x,yx,y\in\mathbb{R} satisfy that |xy|0.1|x-y|\leq 0.1, we can get |exp(x)exp(y)|exp(x)2|xy||\exp(x)-\exp(y)|\leq\exp(x)\cdot 2|x-y|

  • For all x,yx,y\in\mathbb{R} satisfy that |xy|0.1|x-y|\leq 0.1, we can get |cosh(x)cosh(y)|cosh(x)2|xy||\cosh(x)-\cosh(y)|\leq\cosh(x)\cdot 2|x-y|

  • For all x,yx,y\in\mathbb{R} satisfy that |xy|0.1|x-y|\leq 0.1, we can get |sinh(x)sinh(y)|cosh(x)2|xy||\sinh(x)-\sinh(y)|\leq\cosh(x)\cdot 2|x-y|

Proof.

Most of the proofs are trivial or standard. We only provide some proofs.

Proof of Part 4.

We have

|sinh(z)|=\displaystyle|\sinh(z)|= |0.5exp(z)0.5exp(z)|\displaystyle~{}|0.5\exp(z)-0.5\exp(-z)|
=\displaystyle= 0.5exp(|z|)0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)-0.5\exp(-|z|)
=\displaystyle= sinh(|z|)\displaystyle~{}\sinh(|z|)

where second step is true because it’s for z0z\geq 0 and also true for z<0z<0.

We have

|sinh(z)|=\displaystyle|\sinh(z)|= |0.5exp(z)0.5exp(z)|\displaystyle~{}|0.5\exp(z)-0.5\exp(-z)|
\displaystyle\leq 0.5exp(|z|)+0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)+0.5\exp(|z|)
=\displaystyle= exp(|z|)\displaystyle~{}\exp(|z|)

Proof of Part 5.

We have

sinh(|z|)=\displaystyle\sinh(|z|)= 0.5exp(|z|)0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)-0.5\exp(-|z|)
\displaystyle\leq 0.5exp(|z|)+0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)+0.5\exp(-|z|)
=\displaystyle= cosh(|z)|\displaystyle~{}\cosh(|z)|

We have

cosh(|z)|=\displaystyle\cosh(|z)|= 0.5exp(|z|)+0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)+0.5\exp(-|z|)
\displaystyle\leq 0.5exp(|z|)+0.5exp(|z|)\displaystyle~{}0.5\exp(|z|)+0.5\exp(|z|)
=\displaystyle= exp(|z|)\displaystyle~{}\exp(|z|)

4.9 Basic Hyperbolic Functions: Vector Version

In this section, we keep analyzing the properties of the hyperbolic functions, namely sinh\sinh and cosh\cosh, and exponential functions, exp\exp, but the elements of the domains of these functions are all vectors.

Fact 4.8.

For vectors a,bna,b\in\mathbb{R}^{n}

  • exp(a)exp(a2)\|\exp(a)\|_{\infty}\leq\exp(\|a\|_{2})

  • cosh(a)cosh(a2)exp(a2)\|\cosh(a)\|_{\infty}\leq\cosh(\|a\|_{2})\leq\exp(\|a\|_{2})

  • sinh(a)sinh(a2)cosh(a2)exp(a2)\|\sinh(a)\|_{\infty}\leq\sinh(\|a\|_{2})\leq\cosh(\|a\|_{2})\leq\exp(\|a\|_{2})

  • cosh(a)cosh(a)sinh(a)sinh(a)=𝟏n\cosh(a)\circ\cosh(a)-\sinh(a)\circ\sinh(a)={\bf 1}_{n}

Approximation in a small range,

  • For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have exp(a)exp(b)2exp(a)22ab\|\exp(a)-\exp(b)\|_{2}\leq\|\exp(a)\|_{2}\cdot 2\|a-b\|_{\infty}

  • For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have cosh(a)cosh(b)2cosh(a)22ab\|\cosh(a)-\cosh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

  • For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have sinh(a)sinh(b)2cosh(a)22ab\|\sinh(a)-\sinh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

4.10 Regularization

Now, we define the regularization function, LregL_{\operatorname{reg}}, and analyze its properties.

Definition 4.9.

Given a matrix An×dA\in\mathbb{R}^{n\times d} and W=diag(w)n×nW=\operatorname{diag}(w)\in\mathbb{R}^{n\times n} where wnw\in\mathbb{R}^{n} is a vector, we define Lreg:dL_{\operatorname{reg}}:\mathbb{R}^{d}\rightarrow\mathbb{R}

Lreg(x):=0.5WAx22\displaystyle L_{\operatorname{reg}}(x):=0.5\|WAx\|_{2}^{2}
Lemma 4.10 (Folklore, see [29, 13] as an example).

If the following condition holds

  • Let Lreg(x)L_{\operatorname{reg}}(x) be defined as Definition 4.9.

Then, we have

  • The gradient is

    dLregdx=AW2Ax\displaystyle\frac{\mathrm{d}L_{\operatorname{reg}}}{\mathrm{d}x}=A^{\top}W^{2}Ax
  • The Hessian is

    d2Lregdx2=AW2A\displaystyle\frac{\mathrm{d}^{2}L_{\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}W^{2}A

4.11 Gradient and Hessian

Finally, in this section, we define the gradient and Hessian of the loss function and present the definition of the update of the Newton method.

Definition 4.11 (Gradient and Hessian).

Let L(x)L(x) be some loss function. The gradient g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} of the loss function is defined as

g(x):=L(x)=dLdx\displaystyle g(x):=\nabla L(x)=\frac{\mathrm{d}L}{\mathrm{d}x}

The Hessian H:dd×dH:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d} of the loss function is defined as follow:

H(x):=2L(x)=d2Ldx2\displaystyle H(x):=\nabla^{2}L(x)=\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}
Definition 4.12 (Update of the Newton method).

Given that the gradient function g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} and the Hessian matrix H:dd×dH:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d}, the exact process of the Newton method can be defined as follows:

xt+1=xtH(xt)1g(xt)\displaystyle x_{t+1}=x_{t}-H(x_{t})^{-1}\cdot g(x_{t})

5 Newton Step

In Section 5.1, we give a certain loss function L(x)L(x) and call it (l,M)(l,M)-good if it satisfies certain conditions. In Section 5.2, we define approximated Hessian, and based on it, we define approximate update. Then, we analyze the properties of them.

5.1 (l,M)(l,M)-good Loss function

In this section, we explain the meaning of (l,M)(l,M)-good Loss function.

Given that L(x)=Lu(x)+Lreg(x)L(x)=L_{u}(x)+L_{\operatorname{reg}}(x), we consider the following optimization problem

minxdL(x).\displaystyle\min_{x\in\mathbb{R}^{d}}L(x).

Now, we provide the following assumptions

Definition 5.1 ((l,M)(l,M)-good Loss function).

For a function L:dL:\mathbb{R}^{d}\rightarrow\mathbb{R}, we say LL is (l,M)(l,M)-good if satisfies the following conditions,

  • ll-local Minimum. We define l>0l>0 to be a positive scalar. If there exists a vector xdx^{*}\in\mathbb{R}^{d} such that the following holds

    • L(x)=𝟎d\nabla L(x^{*})={\bf 0}_{d}.

    • 2L(x)lId\nabla^{2}L(x^{*})\succeq l\cdot I_{d}.

  • Hessian is MM-Lipschitz. If there exists a positive scalar M>0M>0 such that

    2L(y)2L(x)Myx2\displaystyle\|\nabla^{2}L(y)-\nabla^{2}L(x)\|\leq M\cdot\|y-x\|_{2}
  • Good Initialization Point. Let x0x_{0} denote the initialization point. If r0:=x0x2r_{0}:=\|x_{0}-x_{*}\|_{2} satisfies

    r0M0.1l.\displaystyle r_{0}M\leq 0.1l.

5.2 Approximate of Hessian and Update Rule

In this section, we demonstrate the concept of approximate update and present an induction hypothesis.

Definition 5.2 (Approximated Hessian).

Let ϵ0(0,0.1)\epsilon_{0}\in(0,0.1) be a parameter.

For any Hessian H(xt)d×dH(x_{t})\in\mathbb{R}^{d\times d}, we define the approximated Hessian H~(xt)d×d\widetilde{H}(x_{t})\in\mathbb{R}^{d\times d} to be a matrix such that the following holds,

(1ϵ0)H(xt)H~(xt)(1+ϵ0)H(xt).\displaystyle(1-\epsilon_{0})\cdot H(x_{t})\preceq\widetilde{H}(x_{t})\preceq(1+\epsilon_{0})\cdot H(x_{t}).

In order to get the approximated Hessian H~(xt)\widetilde{H}(x_{t}) efficiently, here we state a standard tool (see Lemma 4.5 in [16]).

Lemma 5.3 ([16, 40]).

Let ϵ0=0.01\epsilon_{0}=0.01 be a constant precision parameter. Let δ(0,1)\delta\in(0,1) denote failure probability. Let An×dA\in\mathbb{R}^{n\times d} be a real matrix, then for any positive diagonal (PD) matrix Dn×nD\in\mathbb{R}^{n\times n}, there exists an algorithm which runs in time

O((nnz(A)+dω)poly(log(n/δ)))\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\operatorname{poly}(\log(n/\delta)))

and it outputs an O(dlog(n/δ))O(d\log(n/\delta)) sparse diagonal matrix D~n×n\widetilde{D}\in\mathbb{R}^{n\times n} for which

(1ϵ0)ADAAD~A(1+ϵ0)ADA.\displaystyle(1-\epsilon_{0})A^{\top}DA\preceq A^{\top}\widetilde{D}A\preceq(1+\epsilon_{0})A^{\top}DA.

Note that, ω\omega denotes the exponent of matrix multiplication, currently ω2.373\omega\approx 2.373 [46, 25, 5]. Here nnz(A)\operatorname{nnz}(A) denotes the number of nonzero entries in AA.

Definition 5.4 (Approximate update).

We consider the following process

xt+1=xtH~(xt)1g(xt).\displaystyle x_{t+1}=x_{t}-\widetilde{H}(x_{t})^{-1}\cdot g(x_{t}).

We state a tool from prior work,

Lemma 5.5 (Iterative shrinking, Lemma 6.9 on page 32 of [29]).

If the following conditions hold

  • Loss Function LL is (l,M)(l,M)-good (see Definition 5.1).

  • Let ϵ0(0,0.1)\epsilon_{0}\in(0,0.1) (see Definition 5.2).

  • Let xx^{*} be defined in Definition 5.1.

  • Let xtx_{t} be defined in Definition 5.4.

  • Let rt:=xtx2r_{t}:=\|x_{t}-x^{*}\|_{2}.

  • Let r¯t:=Mrt\overline{r}_{t}:=M\cdot r_{t}

Then we have

rt+12(ϵ0+r¯t/(lr¯t))rt.\displaystyle r_{t+1}\leq 2\cdot(\epsilon_{0}+\overline{r}_{t}/(l-\overline{r}_{t}))\cdot r_{t}.

Here, TT represents the total number of iterations of the algorithm, to apply Lemma 5.5, we will need the following induction hypothesis lemma. This is very standard in the literature, see [29].

Lemma 5.6 (Induction hypothesis, Lemma 6.10 on page 34 of [29]).

If the following condition hold

  • For each i[t]i\in[t], we define ri:=xix2r_{i}:=\|x_{i}-x^{*}\|_{2}

  • Let xix_{i} be defined in Definition 5.4.

  • Let xx^{*} be defined in Definition 5.1.

  • ϵ0=0.01\epsilon_{0}=0.01 (see Definition 5.2 for ϵ0\epsilon_{0})

  • ri0.4ri1r_{i}\leq 0.4\cdot r_{i-1}, for all i[t]i\in[t]

  • Mri0.1lM\cdot r_{i}\leq 0.1l, for all i[t]i\in[t] (see Definition 5.1 for MM and ll)

Then we have

  • rt+10.4rtr_{t+1}\leq 0.4r_{t}

  • Mrt+10.1lM\cdot r_{t+1}\leq 0.1l

6 General Functions: Definitions

In this section, we present the definitions of the important functions which we use in the later sections.

Definition 6.1.

Let u(x)u(x) be one of the following

  • Case 1. u(x)=exp(Ax)u(x)=\exp(Ax)

  • Case 2. u(x)=cosh(Ax)u(x)=\cosh(Ax)

  • Case 3. u(x)=sinh(Ax)u(x)=\sinh(Ax)

We define a helpful function

Definition 6.2.

Let v(x)v(x) be one of the following

  • Case 1. v(x)=exp(Ax)v(x)=\exp(Ax) (when u(x)=exp(Ax)u(x)=\exp(Ax))

  • Case 2. v(x)=sinh(Ax)v(x)=\sinh(Ax) (when u(x)=cosh(Ax)u(x)=\cosh(Ax))

  • Case 3. v(x)=cosh(Ax)v(x)=\cosh(Ax) (when u(x)=sinh(Ax)u(x)=\sinh(Ax))

Definition 6.3 (Loss function LuL_{u}).

Given a matrix An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}. We define loss function Lu:dL_{u}:\mathbb{R}^{d}\rightarrow\mathbb{R} as follows

Lu(x):=0.5u(x)u(x),𝟏nb22.\displaystyle L_{u}(x):=0.5\cdot\|u(x)-\langle u(x),{\bf 1}_{n}\rangle\cdot b\|_{2}^{2}.

For convenient, we define two helpful notations α\alpha and cc

Definition 6.4 (Rescaled coefficients).

Given a matrix An×dA\in\mathbb{R}^{n\times d}.

We define α:d\alpha:\mathbb{R}^{d}\rightarrow\mathbb{R} as follows

α(x):=u(x),𝟏n.\displaystyle\alpha(x):=\langle u(x),{\bf 1}_{n}\rangle.

Then, the Lu(x)L_{u}(x) (see Definition 6.3) can be rewritten as follows

  • Lu(x)=0.5u(x)bα(x)22L_{u}(x)=0.5\cdot\|u(x)-b\cdot\alpha(x)\|_{2}^{2}.

Definition 6.5.

Given a matrix An×dA\in\mathbb{R}^{n\times d} and a vector bnb\in\mathbb{R}^{n}.

Let α(x)\alpha(x) be defined as in Definition 6.4.

We define function c:dnc:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} as follows

c(x):=u(x)bα(x).\displaystyle c(x):=u(x)-b\cdot\alpha(x).

Then, we can rewrite Lu(x)L_{u}(x) (see Definition 6.3) as follows

  • Lu(x)=0.5c(x)22L_{u}(x)=0.5\cdot\|c(x)\|_{2}^{2}.

7 General Function: Gradient and Hessian Computations

In Section 7.1, we compute the gradients of u(x)u(x), α(x)\alpha(x), c(x)c(x), and Lu(x)L_{u}(x). In Section 7.2, we present the second-order derivatives of u(x)u(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}, where xix_{i} and xjx_{j} are two arbitrary entries of the vector xdx\in\mathbb{R}^{d}. In Section 7.3, we present the second-order derivatives of α(x)\alpha(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}, where xix_{i} and xjx_{j} are two arbitrary entries of the vector xdx\in\mathbb{R}^{d}. In Section 7.4, we compute the second-order derivatives of c(x)c(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}. Finally, in Section 7.5, we compute the second-order derivatives of Lu(x)L_{u}(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}.

7.1 Gradient Computations

In this section, we compute the gradients of u(x)u(x), α(x)\alpha(x), c(x)c(x), and Lu(x)L_{u}(x), namely their first-order derivative with respect to xix_{i}.

Lemma 7.1 (Gradient).

If the following conditions hold

  • Let An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}.

  • For all i[d]i\in[d], A,inA_{*,i}\in\mathbb{R}^{n} denotes the ii-th column of matrix An×dA\in\mathbb{R}^{n\times d}.

  • Let u(x)u(x) be defined in Definition 6.1.

  • Let v(x)v(x) be defined in Definition 6.2.

  • Let α(x)\alpha(x) be defined in Definition 6.4.

  • Let c(x)c(x) be defined in Definition 6.5.

  • Let Lu(x)L_{u}(x) be defined in Definition 6.3

Then, for each i[d]i\in[d], we have

  • Part 1. (see Part 1 in Lemma 5.6 in page 11 in [13])

    du(x)dxi=v(x)A,i\displaystyle\frac{\mathrm{d}u(x)}{\mathrm{d}x_{i}}=v(x)\circ A_{*,i}
  • Part 2. (see Part 2 in Lemma 5.6 in page 11 in [13])

    dα(x)dxi=v(x),A,i\displaystyle\frac{\mathrm{d}\alpha(x)}{\mathrm{d}x_{i}}=\langle v(x),A_{*,i}\rangle
  • Part 3.

    dc(x)dxi=v(x)A,ibv(x),A,i\displaystyle\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}}=v(x)\circ A_{*,i}-b\cdot\langle v(x),A_{*,i}\rangle
  • Part 4.

    dLu(x)dxi=A,i(c(x)v(x)v(x)b,c(x))\displaystyle\frac{\mathrm{d}L_{u}(x)}{\mathrm{d}x_{i}}=A_{*,i}^{\top}(c(x)\circ v(x)-v(x)\langle b,c(x)\rangle)
Proof.

Proof of Part 1. For each j[n]j\in[n], we have

d(u(x))jdxi=\displaystyle\frac{\mathrm{d}(u(x))_{j}}{\mathrm{d}x_{i}}= v(x)jd(Ax)jdxi\displaystyle~{}v(x)_{j}\cdot\frac{\mathrm{d}(Ax)_{j}}{\mathrm{d}x_{i}}
=\displaystyle= v(x)j(Adx)jdxi\displaystyle~{}v(x)_{j}\cdot\frac{(A\mathrm{d}x)_{j}}{\mathrm{d}x_{i}}
=\displaystyle= v(x)jAj,i\displaystyle~{}v(x)_{j}\cdot{A_{j,i}}

where the first step follows from chain rule (Fact 4.4), the second step follows from basic chain rule (Fact 4.4), the third step follows from basic calculus rule (Fact 4.4).

Since the above equation is true for all j[n]j\in[n], we have

du(x)dxin×1=[du(x)1dxidu(x)2dxidu(x)ndxi]=v(x)n×1A,in×1\displaystyle\underbrace{\frac{\mathrm{d}u(x)}{\mathrm{d}x_{i}}}_{n\times 1}=\begin{bmatrix}\frac{\mathrm{d}u(x)_{1}}{\mathrm{d}x_{i}}\\ \frac{\mathrm{d}u(x)_{2}}{\mathrm{d}x_{i}}\\ \vdots\\ \frac{\mathrm{d}u(x)_{n}}{\mathrm{d}x_{i}}\end{bmatrix}=\underbrace{v(x)}_{n\times 1}\circ\underbrace{A_{*,i}}_{n\times 1}

Proof of Part 2. It trivially follows from arguments in Part 1.

Proof of Part 3.

dc(x)dxi=\displaystyle\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}}= ddxi(u(x)bα(x))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}(u(x)-b\cdot\alpha(x))
=\displaystyle= v(x)A,ibv(x),A,i\displaystyle~{}v(x)\circ A_{*,i}-b\cdot\langle v(x),A_{*,i}\rangle

where the first step is due to the definition of c(x)c(x) (see Definition 6.5), the second step follows from Part 1 and Part 2.

Proof of Part 4.

dLu(x)dxi=\displaystyle\frac{\mathrm{d}L_{u}(x)}{\mathrm{d}x_{i}}= ddxi(0.5c(x)22)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}(0.5\cdot\|c(x)\|_{2}^{2})
=\displaystyle= c(x)ddxic(x)\displaystyle~{}c(x)^{\top}\frac{\mathrm{d}}{\mathrm{d}x_{i}}c(x)
=\displaystyle= c(x)(v(x)A,ibv(x),A,i)\displaystyle~{}c(x)^{\top}(v(x)\circ A_{*,i}-b\cdot\langle v(x),A_{*,i}\rangle)
=\displaystyle= A,i(c(x)v(x))A,iv(x)b,c(x)\displaystyle~{}A_{*,i}^{\top}(c(x)\circ v(x))-A_{*,i}^{\top}v(x)\langle b,c(x)\rangle
=\displaystyle= A,i(c(x)v(x)v(x)b,c(x))\displaystyle~{}A_{*,i}^{\top}(c(x)\circ v(x)-v(x)\langle b,c(x)\rangle)

where the first step is due to the definition of Lu(x)L_{u}(x) (see Definition 6.3), the second step follows from chain rule (Fact 4.4), the third step is due to Part 3, the fourth step is obtained by using Fact 4.1, and the fifth step follows from simple algebra. ∎

7.2 Hessian Computation Step 1. Vector Function exp(Ax)\exp(Ax)

We state a tool from previous work [29, 13].

Lemma 7.2 (Lemma 5.9 in [13] and implicitly in [29]).

If the following conditions hold

  • Let An×dA\in\mathbb{R}^{n\times d}

  • Let xdx\in\mathbb{R}^{d}.

We have

  • Part 1.

    d2u(x)dxi2=A,iu(x)A,i\displaystyle\frac{\mathrm{d}^{2}u(x)}{\mathrm{d}x_{i}^{2}}=A_{*,i}\circ u(x)\circ A_{*,i}
  • Part 2.

    d2u(x)dxidxj=A,ju(x)A,i.\displaystyle\frac{\mathrm{d}^{2}u(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,j}\circ u(x)\circ A_{*,i}.

7.3 Hessian Computation Step 2. Scalar Function α(x)\alpha(x)

We state a tool from previous work [29, 13].

Lemma 7.3 (Lemma 5.10 in [13], implicitly in [29]).

If the following conditions hold

  • Let α(x)\alpha(x) be defined as Definition 6.4.

  • Let u(x)u(x) be defined as in Definition 6.1.

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let xdx\in\mathbb{R}^{d}.

Then, we have

  • Part 1.

    d2α(x)dxi2=u(x),A,iA,i\displaystyle\frac{\mathrm{d}^{2}\alpha(x)}{\mathrm{d}x_{i}^{2}}=\langle u(x),A_{*,i}\circ A_{*,i}\rangle
  • Part 2.

    d2α(x)dxidxj=u(x),A,iA,j\displaystyle\frac{\mathrm{d}^{2}\alpha(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=\langle u(x),A_{*,i}\circ A_{*,j}\rangle

7.4 Hessian Computation Step 3. Vector Function c(x)c(x)

Now, we compute the second-order derivative of c(x)c(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}.

Lemma 7.4.

If the following conditions hold

  • Let c(x)c(x) be defined as Definition 6.5.

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let xdx\in\mathbb{R}^{d}.

  • Let bnb\in\mathbb{R}^{n}.

Then, we have

  • Part 1.

    d2c(x)dxi2=A,iu(x)A,ibu(x),A,iA,i\displaystyle\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}^{2}}=A_{*,i}\circ u(x)\circ A_{*,i}-b\cdot\langle u(x),A_{*,i}\circ A_{*,i}\rangle
  • Part 2.

    d2c(x)dxidxj=A,iu(x)A,jbu(x),A,iA,j\displaystyle\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}\circ u(x)\circ A_{*,j}-b\cdot\langle u(x),A_{*,i}\circ A_{*,j}\rangle
Proof.

Proof of Part 1.

d2c(x)dxi2=\displaystyle\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}^{2}}= d2dxi2(u(x)bα(x))\displaystyle~{}\frac{\mathrm{d}^{2}}{\mathrm{d}x_{i}^{2}}(u(x)-b\cdot\alpha(x))
=\displaystyle= A,iu(x)A,ibu(x),A,iA,i\displaystyle~{}A_{*,i}\circ u(x)\circ A_{*,i}-b\cdot\langle u(x),A_{*,i}\circ A_{*,i}\rangle

where the first step follows from definition of c(x)c(x) (see Definition 6.5), the second step follows from Lemma 7.2 and Lemma 7.3.

Proof of Part 2.

d2c(x)dxidxj=\displaystyle\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= d2dxidxj(u(x)bα(x))\displaystyle~{}\frac{\mathrm{d}^{2}}{\mathrm{d}x_{i}\mathrm{d}x_{j}}(u(x)-b\cdot\alpha(x))
=\displaystyle= A,iu(x)A,jbu(x),A,iA,j\displaystyle~{}A_{*,i}\circ u(x)\circ A_{*,j}-b\cdot\langle u(x),A_{*,i}\circ A_{*,j}\rangle

where the first step follows from definition of c(x)c(x) (see Definition 6.5), the second step follows from Lemma 7.2 and Lemma 7.3. ∎

7.5 Hessian Computation Step 4. Scalar Function Lu(x)L_{u}(x)

Then, we compute the second-order derivative of Lu(x)L_{u}(x) with respect to xi2x_{i}^{2} and xixjx_{i}x_{j}, by first introducing some functions, B1,1,B1,2,B1,3,B1,4,B2,1,B2,2B_{1,1},B_{1,2},B_{1,3},B_{1,4},B_{2,1},B_{2,2} (see Definition 7.5), to simplify the process of computation.

Definition 7.5.

Given the following objects

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let xdx\in\mathbb{R}^{d}.

  • Let bnb\in\mathbb{R}^{n}.

Then, we define the functions B1,1,B1,2,B1,3,B1,4,B2,1,B2,2:dn×nB_{1,1},B_{1,2},B_{1,3},B_{1,4},B_{2,1},B_{2,2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n\times n} as

B1,1(x):=\displaystyle B_{1,1}(x):= diag(v(x)v(x))\displaystyle~{}\operatorname{diag}(v(x)\circ v(x))
B1,2(x):=\displaystyle B_{1,2}(x):= (v(x)b)v(x)\displaystyle~{}-(v(x)\circ b)\cdot v(x)^{\top}
B1,3(x):=\displaystyle B_{1,3}(x):= v(x)(v(x)b)\displaystyle~{}-v(x)\cdot(v(x)\circ b)^{\top}
B1,4(x):=\displaystyle B_{1,4}(x):= b22v(x)v(x)\displaystyle~{}\|b\|_{2}^{2}\cdot v(x)v(x)^{\top}

We define

B2,1(x):=\displaystyle B_{2,1}(x):= diag(c(x)u(x))\displaystyle~{}\operatorname{diag}(c(x)\circ u(x))
B2,2(x):=\displaystyle B_{2,2}(x):= c(x),bdiag(u(x))\displaystyle~{}-\langle c(x),b\rangle\operatorname{diag}(u(x))

We define B:dn×nB:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n\times n} as follows:

B(x):=\displaystyle B(x):= B1,1(x)+B1,2(x)+B1,3(x)+B1,4(x)\displaystyle~{}B_{1,1}(x)+B_{1,2}(x)+B_{1,3}(x)+B_{1,4}(x)
+B2,1(x)+B2,2(x)\displaystyle~{}+B_{2,1}(x)+B_{2,2}(x)
Lemma 7.6.

If the following conditions hold

  • Let B(x)B(x) be defined as in Definition 7.5.

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let Lu(x)L_{u}(x) be defined as in Definition 6.3.

Then, we have

  • Part 1.

    d2Lu(x)dxi2=A,iBA,i\displaystyle\frac{\mathrm{d}^{2}L_{u}(x)}{\mathrm{d}x_{i}^{2}}=A_{*,i}^{\top}BA_{*,i}
  • Part 2.

    d2Lu(x)dxidxj=A,iBA,j\displaystyle\frac{\mathrm{d}^{2}L_{u}(x)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}^{\top}BA_{*,j}
Proof.

Proof of Part 1.

d2Lu(x)dxi2=\displaystyle\frac{\mathrm{d}^{2}L_{u}(x)}{\mathrm{d}x_{i}^{2}}= ddxi(dLu(x)dxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}(\frac{\mathrm{d}L_{u}(x)}{\mathrm{d}x_{i}})
=\displaystyle= ddxi(c(x)dc(x)dxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}(c(x)^{\top}\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}})
=\displaystyle= dc(x)dxi,dc(x)dxi+c(x),d2c(x)dxi2\displaystyle~{}\langle\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}},\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}}\rangle+\langle c(x),\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}^{2}}\rangle

where the first step follows from simple algebra, the second step follows from basic chain rule (see Fact 4.4), and the last step follows from basic calculus.

For the first term, in the above equation, we have

dc(x)dxi,dc(x)dxi=\displaystyle\langle\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}},\frac{\mathrm{d}c(x)}{\mathrm{d}x_{i}}\rangle= v(x)A,ibv(x),A,i22\displaystyle~{}\|v(x)\circ A_{*,i}-b\cdot\langle v(x),A_{*,i}\rangle\|_{2}^{2}
=\displaystyle= A,idiag(v(x)v(x))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(v(x)\circ v(x))A_{*,i}
A,i(v(x)b)v(x)A,i\displaystyle~{}-A_{*,i}^{\top}(v(x)\circ b)v(x)^{\top}A_{*,i}
A,i(v(x))(v(x)b)A,i\displaystyle~{}-A_{*,i}^{\top}(v(x))(v(x)\circ b)^{\top}A_{*,i}
+A,ib22v(x)v(x)A,i\displaystyle~{}+A_{*,i}^{\top}\|b\|_{2}^{2}v(x)v(x)^{\top}A_{*,i}
=\displaystyle= A,i(B1,1(x)+B1,2(x)+B1,3(x)+B1,4(x))A,i\displaystyle~{}A_{*,i}^{\top}(B_{1,1}(x)+B_{1,2}(x)+B_{1,3}(x)+B_{1,4}(x))A_{*,i}

where the first step is due to Part 3 of Lemma 7.1, the second step follows from Fact 4.2, and the last step follows from the definition of B1,i(x)B_{1,i}(x) for each i[4]i\in[4] (see Definition 7.5).

For the second term, we have

c(x),d2c(x)dxi2=\displaystyle\langle c(x),\frac{\mathrm{d}^{2}c(x)}{\mathrm{d}x_{i}^{2}}\rangle= c(x),A,iu(x)A,ibu(x),A,iA,i\displaystyle~{}\langle c(x),A_{*,i}\circ u(x)\circ A_{*,i}-b\cdot\langle u(x),A_{*,i}\circ A_{*,i}\rangle\rangle
=\displaystyle= A,idiag(c(x)u(x))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(c(x)\circ u(x))A_{*,i}
A,ic(x),bdiag(u(x))A,i\displaystyle~{}-A_{*,i}^{\top}\langle c(x),b\rangle\operatorname{diag}(u(x))A_{*,i}
=\displaystyle= A,i(B2,1(x)+B2,2(x))A,i\displaystyle~{}A_{*,i}^{\top}(B_{2,1}(x)+B_{2,2}(x))A_{*,i}

where the first step is due to Part 1 of Lemma 7.4, the second step follows from Fact 4.2, and the last step follows from B2,iB_{2,i} for all i[2]i\in[2] (see Definition 7.5).

Thus, we finally have

d2Lu(x)dxi2=A,iB(x)A,i\displaystyle\frac{\mathrm{d}^{2}L_{u}(x)}{\mathrm{d}x_{i}^{2}}=A_{*,i}^{\top}B(x)A_{*,i}

Proof of Part 2.

The proof is similar, and we omitted the details here. ∎

8 General Function: PSD Lower Bound

In Section 8.1, we provide the upper bound for the 2\ell_{2} norms of u(x),v(x),c(x)nu(x),v(x),c(x)\in\mathbb{R}^{n} and for the absolute value of α(x)\alpha(x)\in\mathbb{R}. In Section 8.2, we compute both the upper bound and the lower bound of B(x)B(x) in terms of \preceq. In Section 8.3, we analyze the lower bound of Hessian.

8.1 Upper Bound for Several Basic Quantities

In this section, we compute the bounds for the 2\ell_{2} norms of the vectors u(x),v(x),c(x)nu(x),v(x),c(x)\in\mathbb{R}^{n} and compute the bound for the absolute value of α(x)\alpha(x).

Claim 8.1.

If the following conditions hold

  • Let R2R\geq 2.

  • AR\|A\|\leq R

  • x2R\|x\|_{2}\leq R

  • b2R\|b\|_{2}\leq R

  • Let u(x)nu(x)\in\mathbb{R}^{n} be defined as Definition 6.1.

  • Let v(x)nv(x)\in\mathbb{R}^{n} be defined as Definition 6.2.

  • Let α(x)\alpha(x)\in\mathbb{R} be defined as Definition 6.4.

  • Let c(x)nc(x)\in\mathbb{R}^{n} be defined as in Definition 6.5.

Then, we have

  • Part 1. (see [29, 13, 27])

    u(x)2\displaystyle\|u(x)\|_{2}\leq nexp(R2)\displaystyle~{}\sqrt{n}\exp(R^{2})
    v(x)2\displaystyle\|v(x)\|_{2}\leq nexp(R2)\displaystyle~{}\sqrt{n}\exp(R^{2})
  • Part 2.

    |α(x)|nexp(R2)\displaystyle|\alpha(x)|\leq n\exp(R^{2})
  • Part 3.

    c(x)2nRexp(R2)\displaystyle\|c(x)\|_{2}\leq nR\exp(R^{2})
Proof.

Proof of Part 1. The proof is standard in the literature, and we omit the details here.

Proof of Part 2. We can show

|α(x)|=\displaystyle|\alpha(x)|= |u(x),𝟏n|\displaystyle~{}|\langle u(x),{\bf 1}_{n}\rangle|
\displaystyle\leq nu(x)2\displaystyle~{}\sqrt{n}\cdot\|u(x)\|_{2}
\displaystyle\leq nexp(R2)\displaystyle~{}n\cdot\exp(R^{2})

where the first step follows from definition of α(x)\alpha(x) (see Definition 6.4), the second step follows from Fact 4.5, the third step follows from Part 1.

Proof of Part 3. We can show

c(x)2=\displaystyle\|c(x)\|_{2}= u(x)α(x)b2\displaystyle~{}\|u(x)-\alpha(x)b\|_{2}
\displaystyle\leq u(x)2+α(x)b2\displaystyle~{}\|u(x)\|_{2}+\|\alpha(x)b\|_{2}
=\displaystyle= nexp(R2)+|α(x)|b2\displaystyle~{}\sqrt{n}\exp(R^{2})+|\alpha(x)|\cdot\|b\|_{2}
\displaystyle\leq nexp(R2)+|α(x)|R\displaystyle~{}\sqrt{n}\exp(R^{2})+|\alpha(x)|\cdot R
\displaystyle\leq nexp(R2)+nRexp(R2)\displaystyle~{}\sqrt{n}\exp(R^{2})+nR\exp(R^{2})
\displaystyle\leq 2nRexp(R2),\displaystyle~{}2nR\exp(R^{2}),

where the first step comes from the definition of c(x)c(x) (see Definition 6.5), the second step follows from the triangle inequality, the third step is because of Part 1, the fourth step follows from the assumption on bb, the fifth step follows from Part 2, and the last step follows from simple algebra.

8.2 PSD Bounds for Several Basic Matrix Functions

In this section, we first define the matrices Brank1,Brank2,Brank3,Bdiag1,Bdiag2n×nB_{\operatorname{rank}}^{1},B_{\operatorname{rank}}^{2},B_{\operatorname{rank}}^{3},B_{\operatorname{diag}}^{1},B_{\operatorname{diag}}^{2}\in\mathbb{R}^{n\times n} and find the \preceq-bound for them. Then, we combine them together to form the bound for B(x)n×nB(x)\in\mathbb{R}^{n\times n}

Definition 8.2.

Given the following objects

  • Let u(x)u(x) be defined as in Definition 6.1.

  • Let c(x)c(x) be defined as in Definition 6.5.

  • Let bnb\in\mathbb{R}^{n}.

We define B:dn×nB:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n\times n} as follows:

B(x):=Brank+Bdiag\displaystyle B(x):=B_{\operatorname{rank}}+B_{\operatorname{diag}}

We define

Brank:=\displaystyle B_{\operatorname{rank}}:= Brank1+Brank2+Brank3\displaystyle~{}B_{\operatorname{rank}}^{1}+B_{\operatorname{rank}}^{2}+B_{\operatorname{rank}}^{3}
Bdiag:=\displaystyle B_{\operatorname{diag}}:= Bdiag1+Bdiag2\displaystyle~{}B_{\operatorname{diag}}^{1}+B_{\operatorname{diag}}^{2}

We define

  • Brank1:=u(x)(u(x)b)B_{\operatorname{rank}}^{1}:=-u(x)(u(x)\circ b)^{\top}

  • Brank2:=(u(x)b)u(x)B_{\operatorname{rank}}^{2}:=-(u(x)\circ b)u(x)^{\top}

  • Brank3:=b22u(x)u(x)B_{\operatorname{rank}}^{3}:=\|b\|_{2}^{2}u(x)u(x)^{\top}

  • Bdiag1:=diag((u(x)+c(x))u(x)+q)B_{\operatorname{diag}}^{1}:=\operatorname{diag}((u(x)+c(x))\circ u(x)+q)

    • q=𝟎nq={\bf 0}_{n} (when u(x)=exp(Ax)u(x)=\exp(Ax))

    • q=𝟏nq=-{\bf 1}_{n} (when u(x)=cosh(Ax)u(x)=\cosh(Ax))

    • q=𝟏nq={\bf 1}_{n} (when u(x)=sinh(Ax)u(x)=\sinh(Ax))

  • Bdiag2:=c(x),bdiag(u(x))B_{\operatorname{diag}}^{2}:=-\langle c(x),b\rangle\operatorname{diag}(u(x))

Lemma 8.3.

If the following situations hold

  • B(x)B(x) is a n×nn\times n dimension matrix (See Definition 8.2).

  • Brank1B_{\operatorname{rank}}^{1}, Brank2B_{\operatorname{rank}}^{2}, Brank3B_{\operatorname{rank}}^{3} are defined in Definition 8.2.

  • Bdiag1B_{\operatorname{diag}}^{1}, Bdiag2B_{\operatorname{diag}}^{2} are defined in Definition 8.2.

Then, we have

  • Part 1.

    b2v(x)v(x)Brank1b2v(x)v(x)\displaystyle-\|b\|_{2}v(x)v(x)^{\top}\preceq B_{\operatorname{rank}}^{1}\preceq\|b\|_{2}v(x)v(x)^{\top}
  • Part 2.

    b2v(x)v(x)Brank2b2v(x)v(x)\displaystyle-\|b\|_{2}v(x)v(x)^{\top}\preceq B_{\operatorname{rank}}^{2}\preceq\|b\|_{2}v(x)v(x)^{\top}
  • Part 3.

    Brank3=b22v(x)v(x)\displaystyle B_{\operatorname{rank}}^{3}=\|b\|_{2}^{2}v(x)v(x)^{\top}
  • Part 4.

    (1+(u(x)+c(x))u(x))InBdiag1(1+(u(x)+c(x))u(x))In\displaystyle-(1+(\|u(x)\|_{\infty}+\|c(x)\|_{\infty})\cdot\|u(x)\|_{\infty})\cdot I_{n}\preceq B_{\operatorname{diag}}^{1}\preceq(1+(\|u(x)\|_{\infty}+\|c(x)\|_{\infty})\cdot\|u(x)\|_{\infty})\cdot I_{n}
  • Part 5.

    b2c(x)2u(x)InBdiag2b2c(x)2u(x)In\displaystyle-\|b\|_{2}\|c(x)\|_{2}\|u(x)\|_{\infty}I_{n}\preceq B_{\operatorname{diag}}^{2}\preceq\|b\|_{2}\|c(x)\|_{2}\|u(x)\|_{\infty}I_{n}
  • Part 6.

    • Let R0=max{u(x)2,v(x)2,b2,c(x)2,1}R_{0}=\max\{\|u(x)\|_{2},\|v(x)\|_{2},\|b\|_{2},\|c(x)\|_{2},1\}

    • Then, we have

      10R04InB(x)10R04In\displaystyle-10R_{0}^{4}\cdot I_{n}\preceq B(x)\preceq 10R_{0}^{4}\cdot I_{n}
Proof.

Proof of Part 1. First, we focus on the lower bound of Brank1B_{\operatorname{rank}}^{1}. We have

Brank1=\displaystyle B_{\operatorname{rank}}^{1}= v(x)(v(x)b)\displaystyle~{}-v(x)(v(x)\circ b)^{\top}
\displaystyle\succeq b2v(x)v(x),\displaystyle~{}-\|b\|_{2}\cdot v(x)v(x)^{\top},

where the first step follows from the definition of Brank1B_{\operatorname{rank}}^{1} (see Definition 8.2) and the second step follows from Fact 4.3.

Similarly, we have

Brank1=\displaystyle B_{\operatorname{rank}}^{1}= v(x)(v(x)b)\displaystyle~{}-v(x)(v(x)\circ b)^{\top}
\displaystyle\preceq b2v(x)v(x),\displaystyle~{}\|b\|_{2}\cdot v(x)v(x)^{\top},

where the first step follows from the definition of Brank1B_{\operatorname{rank}}^{1} (see Definition 8.2) and the second step follows from Fact 4.3

Proof of Part 2. According to what we obtained in the Part 1, we can also have

b2v(x)v(x)Brank2b2v(x)v(x)\displaystyle-\|b\|_{2}v(x)v(x)^{\top}\preceq B_{\operatorname{rank}}^{2}\preceq\|b\|_{2}v(x)v(x)^{\top}

Proof of Part 3.

The proof is trivially following from definition of Brank3B_{\operatorname{rank}}^{3}. We have

Brank3=b22v(x)v(x)\displaystyle B_{\operatorname{rank}}^{3}=\|b\|_{2}^{2}\cdot v(x)v(x)^{\top}

Proof of Part 4. For i[n]i\in[n], u(x)i>0u(x)_{i}>0, we have

Bdiag1=\displaystyle B_{\operatorname{diag}}^{1}= diag((u(x)+c(x))u(x)+q)\displaystyle~{}\operatorname{diag}((u(x)+c(x))\circ u(x)+q)
\displaystyle\preceq (1+(u(x)+c(x))u(x))In,\displaystyle~{}(1+(\|u(x)\|_{\infty}+\|c(x)\|_{\infty})\|u(x)\|_{\infty})\cdot I_{n},

where the first step is due to the definition of Bdiag1B_{\operatorname{diag}}^{1} (see Definition 8.2) and the second step follows from Fact 4.3.

On the other hand, we have

Bdiag1(1+(u(x)+c(x))u(x))In\displaystyle B_{\operatorname{diag}}^{1}\succeq-(1+(\|u(x)\|_{\infty}+\|c(x)\|_{\infty})\|u(x)\|_{\infty})\cdot I_{n}

Proof of Part 5.

Bdiag2=\displaystyle B_{\operatorname{diag}}^{2}= c(x),bdiag(u(x))\displaystyle~{}-\langle c(x),b\rangle\operatorname{diag}(u(x))
\displaystyle\preceq b2c(x)2diag(u(x))\displaystyle~{}\|b\|_{2}\cdot\|c(x)\|_{2}\cdot\operatorname{diag}(u(x))
\displaystyle\preceq b2c(x)2u(x)In,\displaystyle~{}\|b\|_{2}\cdot\|c(x)\|_{2}\cdot\|u(x)\|_{\infty}\cdot I_{n},

where the first step follows from the definition of Bdiag2B_{\operatorname{diag}}^{2} (see Definition 8.2), the second step follows from Fact 4.3, and the third step follows from Fact 4.3.

Similarly, we have

Bdiag2=\displaystyle B_{\operatorname{diag}}^{2}= c(x),bdiag(u(x))\displaystyle~{}-\langle c(x),b\rangle\operatorname{diag}(u(x))
\displaystyle\succeq b2c(x)2diag(u(x))\displaystyle~{}-\|b\|_{2}\cdot\|c(x)\|_{2}\cdot\operatorname{diag}(u(x))
\displaystyle\succeq b2c(x)2u(x)In,\displaystyle~{}-\|b\|_{2}\cdot\|c(x)\|_{2}\cdot\|u(x)\|_{\infty}\cdot I_{n},

where the first step comes from the definition of Bdiag2B_{\operatorname{diag}}^{2} (see Definition 8.2), the second step follows from Fact 4.3, and the third step follows from Fact 4.3.

Proof of Part 6. Using Fact 4.3

u(x)u(x)u(x)22In\displaystyle u(x)u(x)^{\top}\preceq\|u(x)\|_{2}^{2}I_{n}

We also have

max{Brank1,Brank2,Brank3,Bdiag1,Bdiag2}2R04In\displaystyle\max\{B^{1}_{\operatorname{rank}},B^{2}_{\operatorname{rank}},B^{3}_{\operatorname{rank}},B^{1}_{\operatorname{diag}},B^{2}_{\operatorname{diag}}\}\leq 2R_{0}^{4}\cdot I_{n}

8.3 Lower bound on Hessian

In this section, we compute the lower bound for Hessian.

Lemma 8.4.

If conditions as follows are satisfied

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let u(x)u(x) be defined as Definition 6.1.

  • Let v(x)v(x) be defined as Definition 6.2.

  • Lu(x)L_{u}(x) is defined in Definition 6.3.

  • Lreg(x)L_{\operatorname{reg}}(x) is defined in Definition 4.9.

  • L(x)=Lreg(x)+Lu(x)L(x)=L_{\operatorname{reg}}(x)+L_{u}(x).

  • Given wnw\in\mathbb{R}^{n}, W=diag(w)n×nW=\operatorname{diag}(w)\in\mathbb{R}^{n\times n} and W2W^{2} denotes the matrix with wi2w_{i}^{2} as the ii-th diagonal.

  • We use σmin(A)\sigma_{\min}(A) as the minimum singular value of AA.

  • We let l>0l>0 as a scalar.

  • Let R0=max{u(x)2,v2,b2,c(x)2,1}R_{0}=\max\{\|u(x)\|_{2},\|v\|_{2},\|b\|_{2},\|c(x)\|_{2},1\}.

Then we have

  • Part 1. If all i[n]i\in[n], wi210R04+l/σmin(A)2w_{i}^{2}\geq 10R_{0}^{4}+l/\sigma_{\min}(A)^{2}, then we have

    d2Ldx2lId\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}\succeq l\cdot I_{d}
  • Part 2. If all i[n]i\in[n], wi2200R04+l/σmin(A)2w_{i}^{2}\geq 200R_{0}^{4}+l/\sigma_{\min}(A)^{2}, then we have

    (11/10)(B(x)+W2)W2(1+1/10)(B(x)+W2).\displaystyle(1-1/10)\cdot(B(x)+W^{2})\preceq W^{2}\preceq(1+1/10)\cdot(B(x)+W^{2}).
Proof.

By Lemma 7.6, we have

d2Ludx2=AB(x)A,\displaystyle\frac{\mathrm{d}^{2}L_{u}}{\mathrm{d}x^{2}}=A^{\top}B(x)A, (2)

By Lemma 4.10, we have

d2Lregdx2=AW2A.\displaystyle\frac{\mathrm{d}^{2}L_{\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}W^{2}A. (3)

By what we have in the Lemma statement, we also have

d2Ldx2=d2Ludx2+d2Lregdx2\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}=\frac{\mathrm{d}^{2}L_{u}}{\mathrm{d}x^{2}}+\frac{\mathrm{d}^{2}L_{\operatorname{reg}}}{\mathrm{d}x^{2}} (4)

By combining Eq. (2), Eq. (3), and Eq. (4), we can rewrite the equation above as follows:

d2Ldx2=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}= AB(x)A+AW2A\displaystyle~{}A^{\top}B(x)A+A^{\top}W^{2}A
=\displaystyle= A(B(x)+W2)A,\displaystyle~{}A^{\top}(B(x)+W^{2})A,

where the second step follows from simple algebra.

Now we define

D:=B(x)+W2\displaystyle D:=B(x)+W^{2}

Now we get the bound of DD

D\displaystyle D\succeq 10R04In+wmin2In\displaystyle~{}-10R_{0}^{4}I_{n}+w_{\min}^{2}I_{n}
=\displaystyle= (wmin210R04)In\displaystyle~{}(w^{2}_{\min}-10R_{0}^{4})I_{n}
\displaystyle\succeq lσmin(A)2In,\displaystyle~{}\frac{l}{\sigma_{\min}(A)^{2}}I_{n},

where the first step follows from Part 6 of Lemma 8.3, the second step follows from simple algebra, and the third step is because of the assumption of this part.

Since DD is positive definite, then we have

ADAσmin(D)σmin(A)2IdlId\displaystyle A^{\top}DA\succeq\sigma_{\min}(D)\cdot\sigma_{\min}(A)^{2}\cdot I_{d}\succeq l\cdot I_{d}

Proof of Part 2.

Using Part 6 of Lemma 8.3, we have

10R04InB(x)10R04In.\displaystyle-10R_{0}^{4}I_{n}\preceq B(x)\preceq 10R_{0}^{4}I_{n}.

From assumption on WW, we also have

W2\displaystyle W^{2}\succeq 200R04In\displaystyle~{}200R_{0}^{4}I_{n}
W2\displaystyle-W^{2}\preceq 200R04In\displaystyle~{}-200R_{0}^{4}I_{n}

Combining the above three equations,

120W2B(x)120W2\displaystyle-\frac{1}{20}W^{2}\preceq B(x)\preceq\frac{1}{20}W^{2}

Thus,

(1120)W2B(x)+W2(1+120)W2\displaystyle(1-\frac{1}{20})W^{2}\preceq B(x)+W^{2}\preceq(1+\frac{1}{20})W^{2}

which implies that

(1+110)(B(x)+W2)W2(1+110)(B(x)+W2)\displaystyle-(1+\frac{1}{10})(B(x)+W^{2})\preceq W^{2}\preceq(1+\frac{1}{10})(B(x)+W^{2})

9 General Function: Hessian is Lipschitz with Respect to xx

In Section 9.1, we summarize all of the important properties that we derive in the following subsections to form an upper bound for H(x)H(y)\|H(x)-H(y)\|. In Section 9.2, we analyze the upper bound for u(x)u(y)2\|u(x)-u(y)\|_{2}. In Section 9.3, we analyze the upper bound for |α(x)α(y)||\alpha(x)-\alpha(y)|. In Section 9.4, we investigate the upper bound for c(x)c(y)2\|c(x)-c(y)\|_{2}. In Section 9.5, we evaluate the upper bound of the sum of all the spectral norms of the matrices Gin×nG_{i}\in\mathbb{R}^{n\times n}, for all i[5]i\in[5], where the spectral norms of each of the matrix GiG_{i} is evaluated in each of the following subsection. In Section 9.6, we analyze the upper bound of the spectral norm of G1n×nG_{1}\in\mathbb{R}^{n\times n}. In Section 9.7, we find the upper bound of the spectral norm of G2n×nG_{2}\in\mathbb{R}^{n\times n}. In Section 9.8, we study the upper bound of the spectral norm of G3n×nG_{3}\in\mathbb{R}^{n\times n}. In Section 9.9, we assess the upper bound of the spectral norm of G4n×nG_{4}\in\mathbb{R}^{n\times n}. In Section 9.10, we investigate the upper bound of the spectral norm of G5n×nG_{5}\in\mathbb{R}^{n\times n}.

9.1 Main Result

In this section, we introduce our main result, which is the combination of all the important concepts in Section 9.

Lemma 9.1.

If the following condition holds

  • Let H(x)=d2Ldx2H(x)=\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}

  • Let R>4R>4

  • x2R\|x\|_{2}\leq R, y2R\|y\|_{2}\leq R, where x,ydx,y\in\mathbb{R}^{d}

  • A(xy)<0.01\|A(x-y)\|_{\infty}<0.01, where An×dA\in\mathbb{R}^{n\times d}

  • AR\|A\|\leq R

  • b2R\|b\|_{2}\leq R, where bnb\in\mathbb{R}^{n}

  • Let R:=max{u(x)2,u(y)2,c(x)2,c(y)2,1}R_{\infty}:=\max\{\|u(x)\|_{2},\|u(y)\|_{2},\|c(x)\|_{2},\|c(y)\|_{2},1\}

    • where R2nRexp(R2)R_{\infty}\leq 2nR\exp(R^{2})

    • this is proved by Part 1 and Part 3 in Claim 8.1

Then we have

H(x)H(y)n4exp(20R2)xy2\displaystyle\|H(x)-H(y)\|\leq n^{4}\exp(20R^{2})\cdot\|x-y\|_{2}
Proof.
H(x)H(y)\displaystyle~{}\|H(x)-H(y)\|
\displaystyle\leq A(G1+G2++G5)A\displaystyle~{}\|A\|\cdot(\|G_{1}\|+\|G_{2}\|+\cdots+\|G_{5}\|)\|A\|
\displaystyle\leq R2(G1+G2++G5)\displaystyle~{}R^{2}\cdot(\|G_{1}\|+\|G_{2}\|+\cdots+\|G_{5}\|)
\displaystyle\leq R25R3b2nu(x)u(y)2\displaystyle~{}R^{2}\cdot 5\cdot R_{\infty}^{3}\|b\|_{2}\sqrt{n}\cdot\|u(x)-u(y)\|_{2}
\displaystyle\leq R25R3b2n2nRexp(R2)xy2\displaystyle~{}R^{2}\cdot 5\cdot R_{\infty}^{3}\|b\|_{2}\sqrt{n}\cdot 2\sqrt{n}R\exp(R^{2})\cdot\|x-y\|_{2}
\displaystyle\leq 80n4R6exp(4R2)xy2\displaystyle~{}80n^{4}R^{6}\exp(4R^{2})\cdot\|x-y\|_{2}
\displaystyle\leq n4exp(20R2)xy2,\displaystyle~{}n^{4}\exp(20R^{2})\cdot\|x-y\|_{2},

where the first step is due to the definition of GiG_{i} (see Lemma 9.5) and Fact 4.6, the second step follows from AR\|A\|\leq R, the third step follows from Lemma 9.5, the fourth step is because of Lemma 9.2, the fifth step is due to the assumption on RR_{\infty}, and the last step is from simple algebra. ∎

9.2 Lipschitz for u(x)u(x)

We use a tool from [13].

Lemma 9.2 (Part 1 of Lemma 7.2 in [13]).

If the following conditions hold

  • Let u(x)u(x) be defined in definition 6.1.

  • Let A(xy)<0.01\|A(x-y)\|_{\infty}<0.01

  • Let AR\|A\|\leq R, where An×dA\in\mathbb{R}^{n\times d}

  • Let x2,y2R\|x\|_{2},\|y\|_{2}\leq R , where x,ydx,y\in\mathbb{R}^{d}

then, we have

u(x)u(y)22nRexp(R2)xy2\displaystyle\|u(x)-u(y)\|_{2}\leq 2\sqrt{n}R\exp(R^{2})\|x-y\|_{2}

9.3 Lipschitz for α(x)\alpha(x)

We use a tool from previous work, namely [13].

Lemma 9.3 (Part 2 of Lemma 7.2 in [13]).

If the following conditions hold

  • Let α(x)\alpha(x) be defined as Definition 6.4.

  • Let u(x)u(x) be defined as Definition 6.1.

then, we have

|α(x)α(y)|nu(x)u(y)2\displaystyle|\alpha(x)-\alpha(y)|\leq\sqrt{n}\cdot\|u(x)-u(y)\|_{2}

9.4 Lipschitz for c(x)c(x)

We find the upper bound of c(x)c(y)2\|c(x)-c(y)\|_{2}.

Lemma 9.4.

If the following situations hold

  • Let c(x)c(x) be defined in Definition 6.5.

  • Let α(x)\alpha(x) be defined as Definition 6.4.

  • Let u(x)u(x) be defined as Definition 6.1.

  • Let bnb\in\mathbb{R}^{n}.

Then, we have

c(x)c(y)2u(x)u(y)2+|α(x)α(y)|b2\displaystyle\|c(x)-c(y)\|_{2}\leq\|u(x)-u(y)\|_{2}+|\alpha(x)-\alpha(y)|\cdot\|b\|_{2}
Proof.

We have

c(x)c(y)2=\displaystyle\|c(x)-c(y)\|_{2}= (u(x)α(x)b)(u(y)α(y)b)2\displaystyle~{}\|(u(x)-\alpha(x)\cdot b)-(u(y)-\alpha(y)\cdot b)\|_{2}
\displaystyle\leq u(x)u(y)2+(α(x)α(y))b2\displaystyle~{}\|u(x)-u(y)\|_{2}+\|(\alpha(x)-\alpha(y))\cdot b\|_{2}
=\displaystyle= u(x)u(y)2+|α(x)α(y)|b2\displaystyle~{}\|u(x)-u(y)\|_{2}+|\alpha(x)-\alpha(y)|\cdot\|b\|_{2}

where the first step is from how we defined cc (Definition 6.5), the second step is due to the triangle inequality, and the last step follows from simple algebra. ∎

9.5 Summary of Five Steps

In this section, we analyze the upper bound of the sum of Gi\|G_{i}\|, for all i[5]i\in[5].

Lemma 9.5.

If the following conditions hold

  • G1=v(x)(v(x)b)v(y)(v(y)b)G_{1}=v(x)(v(x)\circ b)^{\top}-v(y)(v(y)\circ b)^{\top}

  • G2=(v(x)b)v(x)(v(y)b)v(y)G_{2}=(v(x)\circ b)v(x)^{\top}-(v(y)\circ b)v(y)^{\top}

  • G3=b22v(x)v(x)b22v(y)v(y)G_{3}=\|b\|_{2}^{2}v(x)v(x)^{\top}-\|b\|_{2}^{2}v(y)v(y)^{\top}

  • G4=diag((u(x)+c(x))u(x))diag((u(y)+c(y))u(y))G_{4}=\operatorname{diag}((u(x)+c(x))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(y))

  • G5=c(x),bdiag(u(x))c(y),bdiag(u(y))G_{5}=\langle c(x),b\rangle\operatorname{diag}(u(x))-\langle c(y),b\rangle\operatorname{diag}(u(y))

  • Let R:=max{u(x)2,u(y)2,v(x)2,v(y)2,c(x)2,c(y)2,b2,1}R_{\infty}:=\max\{\|u(x)\|_{2},\|u(y)\|_{2},\|v(x)\|_{2},\|v(y)\|_{2},\|c(x)\|_{2},\|c(y)\|_{2},\|b\|_{2},1\}

Then, we have

  • Part 1.

    i=15Gi20R3max{u(x)u(y)2,c(x)c(y)2}.\displaystyle\sum_{i=1}^{5}\|G_{i}\|\leq 20R_{\infty}^{3}\cdot\max\{\|u(x)-u(y)\|_{2},\|c(x)-c(y)\|_{2}\}.
  • Part 2. Let b2R\|b\|_{2}\leq R

    i=15Gi100R3Rnu(x)u(y)2\displaystyle\sum_{i=1}^{5}\|G_{i}\|\leq 100R_{\infty}^{3}R\sqrt{n}\|u(x)-u(y)\|_{2}
Proof.

Proof of Part 1.

Using Lemma 9.6, Lemma 9.7, Lemma 9.8, Lemma 9.9 and Lemma 9.10, we can show for each i[5]i\in[5], we have

Gi20R3max{u(x)u(y)2,c(x)c(y)2}.\displaystyle\|G_{i}\|\leq 20R_{\infty}^{3}\cdot\max\{\|u(x)-u(y)\|_{2},\|c(x)-c(y)\|_{2}\}.

Proof of Part 2.

Note that

c(x)c(y)2\displaystyle\|c(x)-c(y)\|_{2}\leq u(x)u(y)2+|α(x)α(y)|b2\displaystyle~{}\|u(x)-u(y)\|_{2}+|\alpha(x)-\alpha(y)|\cdot\|b\|_{2}
\displaystyle\leq u(x)u(y)2+u(x)u(y)2nb2\displaystyle~{}\|u(x)-u(y)\|_{2}+\|u(x)-u(y)\|_{2}\sqrt{n}\|b\|_{2}
\displaystyle\leq u(x)u(y)2+u(x)u(y)2nR\displaystyle~{}\|u(x)-u(y)\|_{2}+\|u(x)-u(y)\|_{2}\sqrt{n}R
\displaystyle\leq 2nRu(x)u(y)2,\displaystyle~{}2\sqrt{n}R\|u(x)-u(y)\|_{2},

where the first step follows from Lemma 9.4, the second step follows from Lemma 9.3, the third step follows from the assumption on b2R\|b\|_{2}\leq R, and the last step follows from simple algebra.

9.6 Lipschitz Calculations: Step 1. Lipschitz for Matrix Function v(x)(v(x)b)v(x)(v(x)\circ b)^{\top}

We find the upper bound of G1\|G_{1}\|.

Lemma 9.6.

If the following conditions hold

  • G1=v(x)(v(x)b)v(y)(v(y)b)G_{1}=v(x)(v(x)\circ b)^{\top}-v(y)(v(y)\circ b)^{\top}

Then, we have

G12max{v(x)2,v(y)2}b2v(x)v(y)2.\displaystyle\|G_{1}\|\leq 2\max\{\|v(x)\|_{2},\|v(y)\|_{2}\}\cdot\|b\|_{2}\cdot\|v(x)-v(y)\|_{2}.
Proof.

We define

G1,1:=\displaystyle G_{1,1}:= v(x)(v(x)b)v(y)(v(x)b)\displaystyle~{}v(x)(v(x)\circ b)^{\top}-v(y)(v(x)\circ b)^{\top}
G1,2:=\displaystyle G_{1,2}:= v(y)(v(x)b)v(y)(v(y)b)\displaystyle~{}v(y)(v(x)\circ b)^{\top}-v(y)(v(y)\circ b)^{\top}

We have

G1=G1,1+G1,2\displaystyle G_{1}=G_{1,1}+G_{1,2}

We can show

G1,1=\displaystyle\|G_{1,1}\|= (v(x)v(y))(v(x)b)\displaystyle~{}\|(v(x)-v(y))\cdot(v(x)\circ b)^{\top}\|
\displaystyle\leq v(x)v(y)2v(x)b2\displaystyle~{}\|v(x)-v(y)\|_{2}\cdot\|v(x)\circ b\|_{2}
\displaystyle\leq v(x)v(y)2v(x)2b2\displaystyle~{}\|v(x)-v(y)\|_{2}\cdot\|v(x)\|_{2}\cdot\|b\|_{2}

where the first step is due to the definition of G1,1G_{1,1}, the second step follows from Fact 4.6, and the last step follows from Fact 4.5.

Similarly, we can also show

G1,2=\displaystyle\|G_{1,2}\|= v(y)((v(x)v(y))b)\displaystyle~{}\|v(y)\cdot((v(x)-v(y))\circ b)^{\top}\|
\displaystyle\leq v(y)2(v(x)v(y))b2\displaystyle~{}\|v(y)\|_{2}\cdot\|(v(x)-v(y))\circ b\|_{2}
\displaystyle\leq v(y)2v(x)v(y)2b2\displaystyle~{}\|v(y)\|_{2}\cdot\|v(x)-v(y)\|_{2}\cdot\|b\|_{2}

where the first step is due to the definition of G1,2G_{1,2}, the second step follows from Fact 4.6, and the last step follows from Fact 4.5. Thus, we complete the proof. ∎

9.7 Lipschitz Calculations: Step 2. Lipschitz for Matrix Function (v(x)b)v(x)(v(x)\circ b)v(x)^{\top}

We look for the upper bound of G2\|G_{2}\|.

Lemma 9.7.

If the following conditions hold

  • G2=(v(x)b)(v(x))(v(y)b)v(y)G_{2}=(v(x)\circ b)(v(x))^{\top}-(v(y)\circ b)v(y)^{\top}

Then, we have

G22max{v(x)2,v(y)2}b2v(x)v(y)2.\displaystyle\|G_{2}\|\leq 2\max\{\|v(x)\|_{2},\|v(y)\|_{2}\}\cdot\|b\|_{2}\cdot\|v(x)-v(y)\|_{2}.
Proof.

The proof is very similar to the previous Lemma. So we omit the details here. ∎

9.8 Lipschitz Calculations: Step 3. Lipschitz for Matrix Function b22v(x)v(x)\|b\|_{2}^{2}v(x)v(x)^{\top}

We analyze the upper bound of G3\|G_{3}\|.

Lemma 9.8.

If the following conditions hold

  • G3=b22v(x)v(x)b22v(y)v(y)G_{3}=\|b\|_{2}^{2}v(x)v(x)^{\top}-\|b\|_{2}^{2}v(y)v(y)^{\top}

Then, we have

G32max{v(x)2,v(y)2}b22v(x)v(y)2.\displaystyle\|G_{3}\|\leq 2\max\{\|v(x)\|_{2},\|v(y)\|_{2}\}\cdot\|b\|_{2}^{2}\cdot\|v(x)-v(y)\|_{2}.
Proof.

We define

G3,1:=\displaystyle G_{3,1}:= b22v(x)v(x)b22v(y)v(x)\displaystyle~{}\|b\|_{2}^{2}v(x)v(x)^{\top}-\|b\|_{2}^{2}v(y)v(x)^{\top}
G3,2:=\displaystyle G_{3,2}:= b22v(y)v(x)b22v(y)v(y)\displaystyle~{}\|b\|_{2}^{2}v(y)v(x)^{\top}-\|b\|_{2}^{2}v(y)v(y)^{\top}

We have

G3=G3,1+G3,2.\displaystyle G_{3}=G_{3,1}+G_{3,2}.

We can show that

G3,1=\displaystyle\|G_{3,1}\|= b22v(x)v(x)v(y)v(x)\displaystyle~{}\|b\|_{2}^{2}\cdot\|v(x)v(x)^{\top}-v(y)v(x)^{\top}\|
=\displaystyle= b22(v(x)v(y))v(x)\displaystyle~{}\|b\|_{2}^{2}\cdot\|(v(x)-v(y))v(x)^{\top}\|
\displaystyle\leq b22v(x)v(y)2v(x)2,\displaystyle~{}\|b\|_{2}^{2}\cdot\|v(x)-v(y)\|_{2}\cdot\|v(x)\|_{2},

where the first step comes from the definition of G3,1G_{3,1}, the second step is due to simple algebra, and the third step follows from Fact 4.6.

Similarly, we can show that

G3,2b22v(x)v(y)2v(x)2.\displaystyle\|G_{3,2}\|\leq\|b\|_{2}^{2}\cdot\|v(x)-v(y)\|_{2}\cdot\|v(x)\|_{2}.

Thus, we complete the proof. ∎

9.9 Lipschitz Calculations: Step 4. Lipschitz for Matrix Function diag((u(x)+c(x))u(x))\operatorname{diag}((u(x)+c(x))\circ u(x))

We investigate the upper bound of G4\|G_{4}\|.

Since we need to prove the Lipschitz, the effect of qq make no difference. The qq will be canceled. Thus, we define the terms without having qq at all.

Lemma 9.9.

If the following conditions hold

  • G4=diag((u(x)+c(x))u(x))diag((u(y)+c(y))u(y))G_{4}=\operatorname{diag}((u(x)+c(x))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(y))

Then, we have

G44max{u(x)2,u(y)2,c(x)2,c(y)2}(u(x)u(y)2+c(x)c(y)2)\displaystyle\|G_{4}\|\leq 4\max\{\|u(x)\|_{2},\|u(y)\|_{2},\|c(x)\|_{2},\|c(y)\|_{2}\}\cdot(\|u(x)-u(y)\|_{2}+\|c(x)-c(y)\|_{2})
Proof.

We define

G4,1:=\displaystyle G_{4,1}:= diag((u(x)+c(x))u(x))diag((u(y)+c(y))u(x))\displaystyle~{}\operatorname{diag}((u(x)+c(x))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(x))
G4,2:=\displaystyle G_{4,2}:= diag((u(y)+c(y))u(x))diag((u(y)+c(y))u(y))\displaystyle\operatorname{diag}((u(y)+c(y))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(y))

Then we have

G4,1=\displaystyle\|G_{4,1}\|= diag((u(x)+c(x))u(x))diag((u(y)+c(y))u(x))\displaystyle~{}\|\operatorname{diag}((u(x)+c(x))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(x))\|
\displaystyle\leq (u(x)+c(x)u(y)c(y))u(x)2\displaystyle~{}\|(u(x)+c(x)-u(y)-c(y))\circ u(x)\|_{2}
\displaystyle\leq u(x)+c(x)u(y)c(y)2u(x)2\displaystyle~{}\|u(x)+c(x)-u(y)-c(y)\|_{2}\cdot\|u(x)\|_{2}
\displaystyle\leq (u(x)u(y)2+c(x)c(y)2)u(y)2\displaystyle~{}(\|u(x)-u(y)\|_{2}+\|c(x)-c(y)\|_{2})\cdot\|u(y)\|_{2}

where the first step is due to the definition of G4,1G_{4,1}, the second step is due to Fact 4.5, and the third step is due to Fact 4.5 and the last step follows from triangle inequality.

Similarly, we have

G4,2=\displaystyle\|G_{4,2}\|= diag((u(y)+c(y))u(x))diag((u(y)+c(y))u(y))\displaystyle~{}\|\operatorname{diag}((u(y)+c(y))\circ u(x))-\operatorname{diag}((u(y)+c(y))\circ u(y))\|
\displaystyle\leq (u(y)+c(y))u(x)(u(y)+c(y))u(y)2\displaystyle~{}\|(u(y)+c(y))\circ u(x)-(u(y)+c(y))\circ u(y)\|_{2}
\displaystyle\leq (u(y)2+c(y)2)u(x)u(y)2\displaystyle~{}(\|u(y)\|_{2}+\|c(y)\|_{2})\cdot\|u(x)-u(y)\|_{2}

where the first step is due to the definition of G4,2G_{4,2}, the second step is due to Fact 4.5, and the third step is due to Fact 4.5. ∎

9.10 Lipschitz Calculations: Step 5. Lipschitz for Matrix Function c(x),bdiag(u(x))\langle c(x),b\rangle\operatorname{diag}(u(x))

We compute the upper bound of G5\|G_{5}\|.

Lemma 9.10.

If the following conditions hold

  • G5=c(x),bdiag(u(x))c(y),bdiag(u(y))G_{5}=\langle c(x),b\rangle\operatorname{diag}(u(x))-\langle c(y),b\rangle\operatorname{diag}(u(y))

Then, we have

G54max{u(x)2,u(y)2,c(x)2,c(y)2}b2(u(x)u(y)2+c(x)c(y)2)\displaystyle\|G_{5}\|\leq 4\max\{\|u(x)\|_{2},\|u(y)\|_{2},\|c(x)\|_{2},\|c(y)\|_{2}\}\cdot\|b\|_{2}(\|u(x)-u(y)\|_{2}+\|c(x)-c(y)\|_{2})
Proof.

We define

G5,1:=\displaystyle G_{5,1}:= c(x),bdiag(u(x))c(x),bdiag(u(y))\displaystyle~{}\langle c(x),b\rangle\operatorname{diag}(u(x))-\langle c(x),b\rangle\operatorname{diag}(u(y))
G5,2:=\displaystyle G_{5,2}:= c(x),bdiag(u(y))c(y),bdiag(u(y))\displaystyle~{}\langle c(x),b\rangle\operatorname{diag}(u(y))-\langle c(y),b\rangle\operatorname{diag}(u(y))

We can show

G5,1=\displaystyle\|G_{5,1}\|= c(x),b(diag(u(x))diag(u(y)))\displaystyle~{}\|\langle c(x),b\rangle\cdot(\operatorname{diag}(u(x))-\operatorname{diag}(u(y)))\|
=\displaystyle= |c(x),b|diag(u(x))diag(u(y))\displaystyle~{}|\langle c(x),b\rangle|\cdot\|\operatorname{diag}(u(x))-\operatorname{diag}(u(y))\|
\displaystyle\leq c(x)2b2diag(u(x))diag(u(y))\displaystyle~{}\|c(x)\|_{2}\cdot\|b\|_{2}\cdot\|\operatorname{diag}(u(x))-\operatorname{diag}(u(y))\|
\displaystyle\leq c(x)2b2u(x)u(y)2\displaystyle~{}\|c(x)\|_{2}\cdot\|b\|_{2}\cdot\|u(x)-u(y)\|_{2}

where the first step is due to the definition of G5,1G_{5,1}, the second step follows from Fact 4.6, the second step follows from Fact 4.5, and the last step follows from Fact 4.5.

Similarly, we have

G5,2=\displaystyle\|G_{5,2}\|= |c(x)c(y),b|diag(u(y))\displaystyle~{}|\langle c(x)-c(y),b\rangle|\cdot\|\operatorname{diag}(u(y))\|
\displaystyle\leq c(x)c(y)2b2u(y)2\displaystyle~{}\|c(x)-c(y)\|_{2}\cdot\|b\|_{2}\cdot\|u(y)\|_{2}

where the first step is due to Fact 4.5, the definition of G5,2G_{5,2} and simple algebra, and the second follows from Fact 4.5 and Fact 4.3. ∎

10 Lipschitz with respect to AA

In Section 10.1, we consider the xx case, which is to upper bound |α(x)1||\alpha(x)^{-1}|. In Section 10.2, we consider the AA case, namely computing the upper bound of |α(A)1||\alpha(A)^{-1}|. In Section 10.3, we analyze the bound for u(A)u(B)2\|u(A)-u(B)\|_{2}. In Section 10.4, we investigate the bound for |α(A)α(B)||\alpha(A)-\alpha(B)|. In Section 10.5, we analyze the bound for c(A)c(B)2\|c(A)-c(B)\|_{2}.

10.1 For the xx case

In this section, we bound |α(x)1||\alpha(x)^{-1}|.

Definition 10.1.

We define δb\delta_{b} be to the vector that satisfies

u(xt+1)α(xt+1)b22=u(xt)α(xt)(bδb)22\displaystyle\|u(x_{t+1})-\alpha(x_{t+1})b\|_{2}^{2}=\|u(x_{t})-\alpha(x_{t})(b-\delta_{b})\|_{2}^{2}
Lemma 10.2.

We have

δb2|α(xt)1|c(xt+1)c(xt)2\displaystyle\|\delta_{b}\|_{2}\leq|\alpha(x_{t})^{-1}|\cdot\|c(x_{t+1})-c(x_{t})\|_{2}
Proof.

Similarly as [30] described, there could be multiple solutions, e.g. 2n2^{n} possible solutions

u(xt+1)α(xt+1)b=(u(xt)α(xt)(bδb)){1,+1}n\displaystyle u(x_{t+1})-\alpha(x_{t+1})b=(u(x_{t})-\alpha(x_{t})(b-\delta_{b}))\circ\{-1,+1\}^{n}

The norm of all the solutions are same. Therefore, we can just consider one solution, which is the following solution

u(xt+1)α(xt+1)b=u(xt)α(xt)(bδb)\displaystyle u(x_{t+1})-\alpha(x_{t+1})b=u(x_{t})-\alpha(x_{t})(b-\delta_{b})

Thus,

δb=\displaystyle\delta_{b}= α(xt)1(u(xt+1)u(xt)b(α(xt+1)α(xt)))\displaystyle~{}\alpha(x_{t})^{-1}(u(x_{t+1})-u(x_{t})-b(\alpha(x_{t+1})-\alpha(x_{t})))
=\displaystyle= α(xt)1(c(xt+1)c(xt))\displaystyle~{}\alpha(x_{t})^{-1}(c(x_{t+1})-c(x_{t}))

We use a tool, which is from [13].

Lemma 10.3 (Lemma 8.9 in [13]).

If the following condition hold

  • Let AR\|A\|\leq R

  • Let x2R\|x\|_{2}\leq R

We have

|α(x)1|exp(R2)\displaystyle|\alpha(x)^{-1}|\leq\exp(R^{2})

The proof is standard, we omit the details here.

10.2 For the AA case

Here, we bound |α(A)1||\alpha(A)^{-1}|.

Definition 10.4.

We define δb\delta_{b} be to the vector that satsifies

u(xt+1)α(xt+1)b22=u(xt)α(xt)(bδb)22\displaystyle\|u(x_{t+1})-\alpha(x_{t+1})b\|_{2}^{2}=\|u(x_{t})-\alpha(x_{t})(b-\delta_{b})\|_{2}^{2}
Lemma 10.5.

We have

δb2|α(xt)1|c(xt+1)c(xt)2\displaystyle\|\delta_{b}\|_{2}\leq|\alpha(x_{t})^{-1}|\cdot\|c(x_{t+1})-c(x_{t})\|_{2}
Lemma 10.6 (Lemma 8.9 in [13]).

If the following points hold

  • Let AR\|A\|\leq R

  • Let x2R\|x\|_{2}\leq R

We have

|α(A)1|exp(R2)\displaystyle|\alpha(A)^{-1}|\leq\exp(R^{2})

10.3 Lipschitz for u(A)u(A)

We state a tool that directly implies by previous work. The proof is very standard, so we omit the details here.

Lemma 10.7 (A variation of Part 1 of Lemma 7.2 in [13]).

If the following conditions hold

  • Let u(A)u(A) be defined as definition 6.1 with reparamerization by AA instead of xx.222Instead of calling u(x)=exp(Ax)u(x)=\exp(Ax). We call u(A)=exp(Ax)u(A)=\exp(Ax).

  • Let (AB)x<0.01\|(A-B)x\|_{\infty}<0.01

  • Let A,BR\|A\|,\|B\|\leq R, where A,Bn×dA,B\in\mathbb{R}^{n\times d}

  • Let x2R\|x\|_{2}\leq R , where xdx\in\mathbb{R}^{d}

then, we have

u(A)u(B)22nRexp(R2)AB\displaystyle\|u(A)-u(B)\|_{2}\leq 2\sqrt{n}R\exp(R^{2})\|A-B\|

10.4 Lipschitz for α(A)\alpha(A)

We state a tool which directly implies by previous work. The proof is very standard, so we omit the details here.

Lemma 10.8 (A variation of Part 2 of Lemma 7.2 in [13]).

If the following conditions hold

  • Let α(A)\alpha(A) be defined in Definition 6.4 with reparameterization by AA instead of xx.

  • Let u(A)u(A) be defined as Definition 6.1 with reparameterization by AA instead of xx.

then, we have

|α(A)α(B)|nu(A)u(B)2\displaystyle|\alpha(A)-\alpha(B)|\leq\sqrt{n}\cdot\|u(A)-u(B)\|_{2}

10.5 Lipschitz for c(x)c(x)

In this section, we bound c(A)c(B)2\|c(A)-c(B)\|_{2}.

Lemma 10.9 (A variation of Lemma 9.4).

If the following conditions hold

  • Let c(A)c(A) be defined as Definition 6.5 with reparametrization by AA.

  • Let α(A)\alpha(A) be defined as Definition 6.4 with reparameterization by AA.

  • Let u(A)u(A) be defined as Definition 6.1 with reparameterization by AA.

  • Let bnb\in\mathbb{R}^{n}.

Then, we have

c(A)c(B)2u(A)u(B)2+|α(A)α(B)|b2\displaystyle\|c(A)-c(B)\|_{2}\leq\|u(A)-u(B)\|_{2}+|\alpha(A)-\alpha(B)|\cdot\|b\|_{2}
Proof.

We have

c(A)c(B)2=\displaystyle\|c(A)-c(B)\|_{2}= (u(A)α(B)b)(u(A)α(B)b)2\displaystyle~{}\|(u(A)-\alpha(B)\cdot b)-(u(A)-\alpha(B)\cdot b)\|_{2}
\displaystyle\leq u(A)u(B)2+(α(A)α(B))b2\displaystyle~{}\|u(A)-u(B)\|_{2}+\|(\alpha(A)-\alpha(B))\cdot b\|_{2}
=\displaystyle= u(A)u(B)2+|α(A)α(B)|b2\displaystyle~{}\|u(A)-u(B)\|_{2}+|\alpha(A)-\alpha(B)|\cdot\|b\|_{2}

where the first step comes from how we defined cc (see Definition 6.5), the second step is because of the triangle inequality, and the last step follows from simple algebra. ∎

11 Main Result

In Section 11.1, we introduce our algorithm (see Algorithm 1) and use our main Theorem (see Theorem 11.1) to analyze its properties, including running time and the output x~\widetilde{x}. In Section 11.2, we introduce a corollary which is the application of in-context learning.

11.1 Convergence

Now, we introduce our main algorithm and Theorem.

Algorithm 1 Rescaled Hyperbolic Functions Regression.
1:procedure RescaledHyperbolicFunctionsRegression(An×d,bn,wn,ϵ,δA\in\mathbb{R}^{n\times d},b\in\mathbb{R}^{n},w\in\mathbb{R}^{n},\epsilon,\delta) \triangleright Theorem 11.1
2:     We choose x0x_{0} (suppose it satisfies Definition 5.1)
3:     We use Tlog(x0x2/ϵ)T\leftarrow\log(\|x_{0}-x^{*}\|_{2}/\epsilon) to denote the number of iterations.
4:     for t=0Tt=0\to T do
5:         DBdiag(xt)+diag(ww)D\leftarrow B_{\operatorname{diag}}(x_{t})+\operatorname{diag}(w\circ w)
6:         D~SubSample(D,A,ϵ1=Θ(1),δ1=δ/T)\widetilde{D}\leftarrow\textsc{SubSample}(D,A,\epsilon_{1}=\Theta(1),\delta_{1}=\delta/T) \triangleright Lemma 5.3
7:         gA(c(xt)v(xt)v(xt)b,c(xt))+AW2Axtg\leftarrow A^{\top}(c(x_{t})\circ v(x_{t})-v(x_{t})\langle b,c(x_{t})\rangle)+A^{\top}W^{2}Ax_{t} \triangleright Definition 6.2 and Definition 6.5
8:         H~AD~A\widetilde{H}\leftarrow A^{\top}\widetilde{D}A
9:         xt+1xt+H~1gx_{t+1}\leftarrow x_{t}+\widetilde{H}^{-1}g \triangleright Definition 5.4
10:     end for
11:     x~xT+1\widetilde{x}\leftarrow x_{T+1}
12:     return x~\widetilde{x}
13:end procedure
Theorem 11.1 ( ).

Given that vectors b,wnb,w\in\mathbb{R}^{n} and a matrix An×dA\in\mathbb{R}^{n\times d}, we define xx^{*} as the optimal solution of the following problem

minxd0.5exp(Ax)exp(Ax),𝟏nb22+0.5diag(w)Ax22\displaystyle\min_{x\in\mathbb{R}^{d}}0.5\cdot\|\exp(Ax)-\langle\exp(Ax),{\bf 1}_{n}\rangle\cdot b\|_{2}^{2}+0.5\|\operatorname{diag}(w)Ax\|_{2}^{2}

And then if the conditions as follows hold:

  • R4R\geq 4.

  • g(x)=𝟎dg(x^{*})={\bf 0}_{d}.

  • x2R\|x^{*}\|_{2}\leq R.

  • AR\|A\|\leq R.

  • b2R\|b\|_{2}\leq R.

  • wi2100+l/σmin(A)2w_{i}^{2}\geq 100+l/\sigma_{\min}(A)^{2} for all i[n]i\in[n]

  • M=exp(O(R2+logn))M=\exp(O(R^{2}+\log n)).

  • Let accuracy ϵ(0,0.1)\epsilon\in(0,0.1)

  • Let failure probability δ(0,0.1)\delta\in(0,0.1)

  • Let x0x_{0} denote an initial point for which it holds that Mx0x20.1lM\|x_{0}-x^{*}\|_{2}\leq 0.1l.

Then there exists a randomized algorithm (Algorithm 1) such that, with probability at least 1δ1-\delta,

  • it runs T=log(x0x2/ϵ)T=\log(\|x_{0}-x^{*}\|_{2}/\epsilon) iterations

  • spends

    O((nnz(A)+dω)poly(log(n/δ)).\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\cdot\operatorname{poly}(\log(n/\delta)).
  • outputs a vector x~d\widetilde{x}\in\mathbb{R}^{d} such that

    x~x2ϵ\displaystyle\|\widetilde{x}-x^{*}\|_{2}\leq\epsilon

Here ω\omega denote the exponent of matrix multiplication. Currently ω2.373\omega\approx 2.373 [46, 25, 5].

Proof.

Proof of Hessian is PD.

We can obtain this conclusion from Lemma 8.4.

Proof of Hessian is Lipschitz.

The proof is due to Lemma 9.1.

Proof of Cost per iteration.

This follows from Lemma 5.3.

Proof of Convergence per Iteration.

By Lemma 5.5, we have

xkx20.4xk1x2.\displaystyle\|x_{k}-x^{*}\|_{2}\leq 0.4\cdot\|x_{k-1}-x^{*}\|_{2}.

Proof of Number of Iterations.

After TT iterations, we have

xTx20.4Tx0x2\displaystyle\|x_{T}-x^{*}\|_{2}\leq 0.4^{T}\cdot\|x_{0}-x^{*}\|_{2}

By choice of TT, we get the desired bound. The failure probability is following from union bound over TT iterations.

11.2 Application to In-context Learning

In this section, we introduce the application to in-context learning.

Corollary 11.2 (Bounded shift for Learning in-context).

If the following conditions hold

  • Let An×dA\in\mathbb{R}^{n\times d}.

  • Let bnb\in\mathbb{R}^{n}.

  • AR\|A\|\leq R.

  • Let x2R\|x\|_{2}\leq R.

  • A(xt+1xt)<0.01\|A(x_{t+1}-x_{t})\|_{\infty}<0.01.

  • (At+1At)x<0.01\|(A_{t+1}-A_{t})x\|_{\infty}<0.01.

  • Let R4R\geq 4.

  • Let M:=exp(O(R2+logn))M:=\exp(O(R^{2}+\log n)).

  • Let u(x){exp(Ax),cosh(Ax),sinh(Ax)}u(x)\in\{\exp(Ax),\cosh(Ax),\sinh(Ax)\}.

We consider the rescaled softmax regression (Definition 1.4) problem

minxdu(x)α(x)b2.\displaystyle\min_{x\in\mathbb{R}^{d}}\|u(x)-\alpha(x)b\|_{2}.
  • Part 1. If we move the xtx_{t} to xt+1x_{t+1}, then we’re solving a new rescaled softmax regression problem with

    minxdu(x)α(x)b~2\displaystyle\min_{x\in\mathbb{R}^{d}}\|u(x)-\alpha(x)\widetilde{b}\|_{2}

    where

    b~b2Mxt+1xt2\displaystyle\|\widetilde{b}-b\|_{2}\leq M\cdot\|x_{t+1}-x_{t}\|_{2}
  • Part 2. If we move the AtA_{t} to At+1A_{t+1}, then we’re solving a new rescaled softmax regression with

    minxu(x)α(x)b^2\displaystyle\min_{x}\|u(x)-\alpha(x)\widehat{b}\|_{2}

    where

    b^b2MAt+1At\displaystyle\|\widehat{b}-b\|_{2}\leq M\cdot\|A_{t+1}-A_{t}\|
Proof.

Proof of Part 1. The proof follows from by combining Lemma 10.2, Lemma 10.3, Lemma 9.2, Lemma 9.3, Lemma 9.4.

Proof of Part 2. The proof follows from by combining Lemma 10.5, Lemma 10.6, Lemma 10.7, Lemma 10.8, Lemma 10.9. ∎

References

  • ALS+ [22] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. arXiv preprint arXiv:2211.14227, 2022.
  • Ans [00] Kurt M Anstreicher. The volumetric barrier for semidefinite programming. Mathematics of Operations Research, 2000.
  • AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  • ASA+ [22] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
  • AW [21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522–539. SIAM, 2021.
  • [6] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  • [7] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. Advances in neural information processing systems, 32, 2019.
  • BMR+ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Bra [20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
  • BSZ [23] Jan van den Brand, Zhao Song, and Tianyi Zhou. Algorithm and hardness for dynamic attention maintenance in large language models. arXiv preprint arXiv:2304.02207, 2023.
  • CLS [19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
  • DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • DLS [23] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  • DLY [21] Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: a multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1784–1797, 2021.
  • DMS [23] Yichuan Deng, Sridhar Mahadevan, and Zhao Song. Randomized and deterministic attention sparsification algorithms for over-parameterized feature dimension. arxiv preprint: arxiv 2304.03426, 2023.
  • DSW [22] Yichuan Deng, Zhao Song, and Omri Weinstein. Discrepancy minimization in input-sparsity time. arXiv preprint arXiv:2210.12468, 2022.
  • GMS [23] Yeqi Gao, Sridhar Mahadevan, and Zhao Song. An over-parameterized exponential regression. arXiv preprint arXiv:2303.16504, 2023.
  • GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
  • GTLV [22] Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. arXiv preprint arXiv:2208.01066, 2022.
  • HJS+ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
  • HWL [21] Weihua He, Yongyun Wu, and Xiaohua Li. Attention mechanism for neural machine translation: A survey. In 2021 IEEE 5th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), volume 5, pages 1485–1489. IEEE, 2021.
  • JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
  • JLSZ [23] Haotian Jiang, Yin Tat Lee, Zhao Song, and Lichen Zhang. Convex minimization with integer minima in O~(n4)\widetilde{O}(n^{4}) time. arXiv preprint arXiv:2304.03426, 2023.
  • JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC, 2021.
  • LG [14] François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296–303, 2014.
  • LLR [23] Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023.
  • LSX+ [23] Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023.
  • LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory (COLT), pages 2140–2157. PMLR, 2019.
  • [29] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint, 2303.15725, 2023.
  • [30] S Cliff Liu, Zhao Song, Hengjie Zhang, Lichen Zhang, and Tianyi Zhou. Space-efficient interior point method, with applications to linear programming and maximum weight bipartite matching. In ICALP, 2023.
  • MMS+ [19] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de La Clergerie, Djame Seddah, and Benoit Sagot. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894, 2019.
  • ONR+ [22] Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
  • Ope [23] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
  • RNS+ [18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
  • RWC+ [19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Son [19] Zhao Song. Matrix theory: optimization, concentration, and algorithms. The University of Texas at Austin, 2019.
  • SWYZ [23] Zhao Song, Yitan Wang, Zheng Yu, and Lichen Zhang. Sketching for first order method: Efficient algorithm for low-bandwidth channel and vulnerability. In ICML, 2023.
  • SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pages 9835–9847. PMLR, 2021.
  • SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
  • SYYZ [23] Zhao Song, Xin Yang, Yuanyuan Yang, and Lichen Zhang. Sketching meets differential privacy: Fast algorithm for dynamic kronecker projection maintenance. In ICML, 2023.
  • SZKS [21] Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
  • SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • UAS+ [20] Mohd Usama, Belal Ahmad, Enmin Song, M Shamim Hossain, Mubarak Alrashoud, and Ghulam Muhammad. Attention-based sentiment analysis using convolutional and recurrent neural network. Future Generation Computer Systems, 113:571–578, 2020.
  • VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wil [12] Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 887–898, 2012.
  • Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
  • ZHDK [23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
  • ZKV+ [20] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.