This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Solving Regularized Exp, Cosh and Sinh Regression Problems

Zhihang Li [email protected]. Huazhong Agricultural University.    Zhao Song [email protected]. Adobe Research.    Tianyi Zhou [email protected]. University of California San Diego.

In modern machine learning, attention computation is a fundamental task for training large language models such as Transformer, GPT-4 and ChatGPT. In this work, we study the exponential regression problem which is inspired by the softmax/exp unit in the attention mechanism in large language models. The standard exponential regression is non-convex. We study the regularization version of the exponential regression problem which is a convex problem. We use the approximate newton method to solve in input sparsity time.

Formally, in this problem, one is given matrix An×dA\in\mathbb{R}^{n\times d}, bnb\in\mathbb{R}^{n}, wnw\in\mathbb{R}^{n} and any of functions exp,cosh\exp,\cosh and sinh\sinh denoted as ff. The goal is to find the optimal xx that minimize 0.5f(Ax)b22+0.5diag(w)Ax220.5\|f(Ax)-b\|_{2}^{2}+0.5\|\operatorname{diag}(w)Ax\|_{2}^{2}. The straightforward method is to use the naive Newton’s method. Let nnz(A)\mathrm{nnz}(A) denote the number of non-zeros entries in matrix AA. Let ω\omega denote the exponent of matrix multiplication. Currently, ω2.373\omega\approx 2.373. Let ϵ\epsilon denote the accuracy error. In this paper, we make use of the input sparsity and purpose an algorithm that use log(x0x2/ϵ)\log(\|x_{0}-x^{*}\|_{2}/\epsilon) iterations and O~(nnz(A)+dω)\widetilde{O}(\mathrm{nnz}(A)+d^{\omega}) per iteration time to solve the problem.

1 Introduction

State-of-the-art language models like Transformer [69], BERT [21], GPT-3 [6], PaLM [18], and OPT [80] exhibit greater proficiency in natural language processing when compared to smaller models or traditional techniques. These models have the capacity to understand and generate intricate language, proving beneficial in various applications such as language translation, sentiment analysis, and question answering. LLMs can be customized for multiple purposes without necessitating their reconstruction from scratch. An instance of this is ChatGPT, an OpenAI-developed chat software that employs GPT-3’s full potential. The latest iteration, GPT-4 [53], has the potential to surpass GPT-3 in its impressive capabilities, including text generation, question answering, and language translation. This development could lead to significant implications in the field of NLP, with new applications potentially emerging in areas such as virtual assistants, chatbots, and automatic content creation. However, even though deep learning has a swift incline in popularity, we hold the belief that there exist discrepancies in our comprehension of the concept of attention and the reasoning behind its effectiveness.

The primary technical foundation behind LLMs is the attention matrix [69, 55, 21, 6, 3, 77]. An attention matrix is a matrix that features rows and columns aligning with individual words or ”tokens” and their relationships within a given text. Its purpose is to measure the critical nature of each token in a sequence in relation to the intended output. The attention matrix is learned during training. These parameters are optimized to maximize the model’s accuracy in predicting the desired output. Through the attention mechanism, each input token is evaluated based on its importance or relevance to the desired output. This is achieved by weighing the token score, which is based on a similarity function comparing the current output state and input states.

More formally, the attention matrix can be expressed by considering two matrices, QQ and KK, containing query and key tokens, respectively. Both QQ and KK hold values in the n×dn\times d dimensional space. The attention matrix can be denoted by the square matrix AA which is of size n×nn\times n. This matrix establishes a relationship between the input tokens in the sequence where every entry represents the attention weight or score between a particular input token (query token QQ) and an output token (key token KK). It is essential to mention that diagonal entries of this matrix display self-attention scores, signifying the importance of each token with respect to itself. A majority of the methods used for effective computation of attention matrices are divided into two primary categories based on their approach. One approach involves leveraging sparsity, as seen in Reformer [45], while the other involves utilizing the low-rank attributes of the attention matrices, as observed in Linformer [73] and Performer [16].

During training, our primary goal is to tackle the issue of multiple attention regression by utilizing the exponential function and its corresponding equation: minXD1exp(AX)B2\min_{X}\|D^{-1}\exp(AX)-B\|_{2}, where D1D^{-1} denotes the normalization factor. However, upon further investigation, it has been brought to our attention that the single regression scenario has not been thoroughly studied. As a result, this study is centered on the situation of single regression.

In our setting, the presence of DD is unnecessary due to the fact that AxAx is a column vector (but not a matrix). Thus, we have opted to address the optimization problem with regards to minxexp(Ax)b2\min_{x}\|\exp(Ax)-b\|_{2}. It is important to note that our approach is not exclusive to the exponential function, and can also be extended to other hyperbolic functions such as cosh\cosh and sinh\sinh.111cosh(x):=12(ex+ex)\cosh(x):=\frac{1}{2}(e^{x}+e^{-x}) and sinh(x):=12(exex)\sinh(x):=\frac{1}{2}(e^{x}-e^{-x})

1.1 Our Results

We state our result as follows.

Theorem 1.1 (Main result, Informal version of Theorem 6.11).

Given matrix An×dA\in\mathbb{R}^{n\times d}, bnb\in\mathbb{R}^{n}, and wnw\in\mathbb{R}^{n}.

Let ff be any of functions exp,cosh\exp,\cosh and sinh\sinh. Let gg denote the gradient of function ff.

Let xx^{*} denote the optimal solution of

minxd0.5f(Ax)b22+0.5diag(w)Ax22\displaystyle\min_{x\in\mathbb{R}^{d}}0.5\|f(Ax)-b\|_{2}^{2}+0.5\|\operatorname{diag}(w)Ax\|_{2}^{2}

that g(x)=𝟎dg(x^{*})={\bf 0}_{d} and x2R\|x^{*}\|_{2}\leq R.

Let AR,b2R\|A\|\leq R,\|b\|_{2}\leq R. Let wi2>0.5bi2+1+l/σmin(A)2w_{i}^{2}>0.5b_{i}^{2}+1+l/\sigma_{\min}(A)^{2} for all i[n]i\in[n].

Let M=exp(6(R2+logn))M=\exp(6(R^{2}+\log n)).

Let x0x_{0} denote an initial point such that Mx0x20.1lM\|x_{0}-x^{*}\|_{2}\leq 0.1l.

For any accuracy parameter ϵ(0,0.1)\epsilon\in(0,0.1) and failure probability δ(0,0.1)\delta\in(0,0.1). There is a randomized algorithm (Algorithm 1) that runs log(x0x2/ϵ)\log(\|x_{0}-x^{*}\|_{2}/\epsilon) iterations and spend

O((nnz(A)+dω)poly(log(n/δ))\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\cdot\operatorname{poly}(\log(n/\delta))

time per iteration, and finally outputs a vector x~d\widetilde{x}\in\mathbb{R}^{d} such that

x~x2ϵ\displaystyle\|\widetilde{x}-x^{*}\|_{2}\leq\epsilon

holds with probability at least 1δ1-\delta.

Algorithm 1
1:procedure FastAlgorithm(An×d,bn,wn,ϵ(0,0.1),δ(0,0.1)A\in\mathbb{R}^{n\times d},b\in\mathbb{R}^{n},w\in\mathbb{R}^{n},\epsilon\in(0,0.1),\delta\in(0,0.1)) \triangleright Theorem 1.1
2:     Pick an initialization point x0x_{0}
3:     Tlog(x0x2/ϵ)T\leftarrow\log(\|x_{0}-x^{*}\|_{2}/\epsilon)
4:     for t=0Tt=0\to T do
5:         Ddiag((2exp(Axt)b)exp(Axt)ww)D\leftarrow\operatorname{diag}((2\exp(Ax_{t})-b)\circ\exp(Ax_{t})-w\circ w)
6:         D~SubSample(D,A,ϵ1=Θ(1),δ1=δ/T)\widetilde{D}\leftarrow\textsc{SubSample}(D,A,\epsilon_{1}=\Theta(1),\delta_{1}=\delta/T) \triangleright Lemma 6.6
7:         gAdiag(exp(Ax)(exp(Ax)b))𝟏ng\leftarrow A^{\top}\operatorname{diag}(\exp(Ax)\circ(\exp(Ax)-b)){\bf 1}_{n}
8:         \triangleright H=ADAH=A^{\top}DA is the exat Hessian
9:         \triangleright xt+1xt+H1gx_{t+1}\leftarrow x_{t}+H^{-1}g is the exact update
10:         H~AD~A\widetilde{H}\leftarrow A^{\top}\widetilde{D}A \triangleright H~\widetilde{H} is an approximation of HH
11:         xt+1xt+H~1gx_{t+1}\leftarrow x_{t}+\widetilde{H}^{-1}g \triangleright This is an approximate update step
12:     end for
13:     x~xT+1\widetilde{x}\leftarrow x_{T+1}
14:     return x~\widetilde{x}
15:end procedure

1.2 Related Work

Input sparsity time algorithms

Input sparsity is a term used to describe datasets that have a majority of elements that are either zero or negligible. Utilizing algorithms optimized for sparse input data enables faster processing times than those utilized in dense data algorithms. This is because these algorithms can solely focus on non-zero elements, thereby minimizing computational and memory usage. Sparse algorithms’ time complexity depends only on the number of non-zero elements rather than the number of total elements. Input sparsity algorithms are highly applicable within fields such as solving John Ellipsoid [65], discrepancy minimization [26], low-rank approximation [19, 56, 60, 63], subspace embeddings [51], column subset selection [61, 62] and least squares regression [27].

Algorithmic Regularization

The standard exponential regression is non-convex. We study the regularization version of the exponential regression problem which is a convex problem. In the context of deep learning, the non-convexity of the objective function necessitates the use of regularization techniques. Due to the possibility of the objective function generating multiple global minima that are widely scattered and vary significantly in their generalization capabilities, this becomes essential. Algorithmic regularization can be observed in various machine learning applications such as binary classification [58, 48, 10], matrix factorization [30, 1], convolutional neural networks [30, 42], generative adversarial networks [4], contrastive learning [72] and mixture of experts [13]. There are numerous factors that can induce algorithmic regularization. [29, 33, 46, 59, 50] introduce how learning rate and batch size help regularize the training process. [50] explains that while a small initial learning rate may result in prompt training and better performance initially, a larger learning rate tends to yield improved generalization shortly after the annealing of the learning rate. The regularizer’s role in the GAN model is expounded in [4]. In addition, [38] has conducted a theoretical analysis of the regularization generated by momentum. Furthermore, the employment of an adaptive step-size optimizer, such as Adam optimizer, can also function as a form of regularization during the training process [44, 52, 20, 74, 75, 40]. Batch normalization is analyzed in a theoretical capacity in [2, 32, 36], which delves into its impact on regularization. A separate study conducted by [57, 71] explains how introducing dropout into the training process can prevent overfitting. In [25], they apply TensorSketch to the Kronecker product of multiple matrices efficiently without explicitly computing the tensor product and also apply the regularization to solve the Kronecker product regression.

Attention Theory

The topic of attention’s expressivity has been a focus of early theoretical works. In the case of self-attention blocks, [68] interpret self-attention as a system of self-interacting particles and theoretically explain the attention. [28] explain it from inductive biases and variable creation perspective. The role of attention in Transformers was studied by [70, 22]. In terms of optimization, [78] examined the impact of adaptive approaches on attention models, while [66] analyzed the dynamics of single-head attention to approximate Seq2Seq architecture’s learning process. For most LLMs, it generally suffices to conduct attention computations in an approximate manner during the inference process, provided that there are adequate assurances of accuracy. Research conducted by various sources such as [15, 45, 73, 23, 47, 12, 11] has underscored this perspective. Due to that motivation, [77, 3] study the computation of the attention matrix from the hardness perspective and purpose faster algorithms.

Newton Method and Hessian Computation

Computing the Hessian or approximately computing the Hessian is a standard task in convex optimization. Many of the previous have work on this direction and use that improve several optimization problems such as linear programming [17, 49, 8, 5, 64, 43, 24, 9, 31], empirical risk minimization [49, 54], cutting plane method [39], semi-definite programming [37, 34, 31], sum of squares [41], training over-parameterized neural network [79, 14, 7, 67, 35, 76].

Roadmap.

We organize the following paper as follows. In Section 2 we provide some tools for basic algebra and the analysis of a regularization term LregL_{\operatorname{reg}}. In Section 3 we provide detailed analysis of the loss function based on exp(x)\exp(x) (denoted as LexpL_{\exp}) and the loss function with a regularization term Lexp,regL_{\exp,\operatorname{reg}}. In Section 4 we provide detailed analysis of LcoshL_{\cosh} and Lcosh,regL_{\cosh,\operatorname{reg}}. In Section 5 we provide detailed analysis of LsinhL_{\sinh} and Lsinh,regL_{\sinh,\operatorname{reg}}. In Section 6 we provide an approximate version of newton method which use for solving convex optimization problem which is more efficient under certain assumptions.

2 Preliminary

In this section, we provide preliminaries to be used in our paper. In Section 2.1 we introduce notations we use. In Section 2.2 we provide some facts about approximate computations and exact computations. In Section 2.3, we provide some trivial facts regarding gradient and hessian. In Section 2.4, we computed the gradient and hessian of a regularization term LregL_{\operatorname{reg}}.

2.1 Notations

We use \mathbb{R} to denote real. We use 0\mathbb{R}_{\geq 0} to denote non-negative real numbers.

For any vector xnx\in\mathbb{R}^{n}, we use exp(x)n\exp(x)\in\mathbb{R}^{n} to denote a vector where ii-th entry is exp(xi)\exp(x_{i}).

For a vector xnx\in\mathbb{R}^{n}, we use x2\|x\|_{2} to denote its 2\ell_{2} norm, i.e., x2:=(i=1nxi2)1/2\|x\|_{2}:=(\sum_{i=1}^{n}x_{i}^{2})^{1/2}.

For any matrix An×kA\in\mathbb{R}^{n\times k}, we denote the spectral norm of AA by A\|A\|, i.e.,

A:=supxkAx2/x2.\displaystyle\|A\|:=\sup_{x\in\mathbb{R}^{k}}\|Ax\|_{2}/\|x\|_{2}.

For a matrix AA, we use σmax(A)\sigma_{\max}(A) to denote the largest singular value of AA. We use σmin(A)\sigma_{\min}(A) to denote the smallest singular value of AA.

For a vector xnx\in\mathbb{R}^{n}, we use x\|x\|_{\infty} to denote maxi[n]|xi|\max_{i\in[n]}|x_{i}|.

For each xn,ynx\in\mathbb{R}^{n},y\in\mathbb{R}^{n}, we use xyx\circ y to denote a vector that has length nn and the ii-th entry is xiyix_{i}y_{i} for all i[n]i\in[n].

For a vector xnx\in\mathbb{R}^{n} with each entry of xix_{i}, we use diag(x)=A\operatorname{diag}(x)=A to generate a diagonal matrix An×nA\in\mathbb{R}^{n\times n} where each entry of AA on the diagonal is Ai,i=xiA_{i,i}=x_{i} for all i[n]i\in[n].

Given two column vectors a,bna,b\in\mathbb{R}^{n}, we use aba\circ b to denote a column vector that (ab)i(a\circ b)_{i} is aibia_{i}b_{i}.

We use 𝟏n{\bf 1}_{n} to denote a length-nn vector where all the entries are ones.

We say ABA\succeq B if xAxxBxx^{\top}Ax\geq x^{\top}Bx for all vector xx.

We define cosh(x)=12(exp(x)+exp(x))\cosh(x)=\frac{1}{2}(\exp(x)+\exp(-x)) and sinh(x)=12(exp(x)exp(x))\sinh(x)=\frac{1}{2}(\exp(x)-\exp(-x)).

For any given matrix An×dA\in\mathbb{R}^{n\times d}, we use nnz(A)\operatorname{nnz}(A) to denote the number of non-zero entries of AA, i.e., nnz(A):=|{(i,j)[n]×[d]|Ai,j0}|\operatorname{nnz}(A):=|\{(i,j)\in[n]\times[d]~{}|~{}A_{i,j}\neq 0\}|

For a diagonal matrix Dn×nD\in\mathbb{R}^{n\times n}, we say DD is a kk-sparse diagonal matrix, i.e., k=|{i[n]|Di,i0}|k=|\{i\in[n]~{}|~{}D_{i,i}\neq 0\}|.

For any function ff, we use O~(f)\widetilde{O}(f) to denote fpoly(logf)f\cdot\operatorname{poly}(\log f).

2.2 Basic Algebras

We state some facts which can give exact computation.

Fact 2.1 (exp\exp, cosh\cosh, sinh\sinh exact and approximate for general range).

We have

  • dexp(x)dx=exp(x)\frac{\mathrm{d}\exp(x)}{\mathrm{d}x}=\exp(x)

  • dcosh(x)dx=sinh(x)\frac{\mathrm{d}\cosh(x)}{\mathrm{d}x}=\sinh(x)

  • dsinh(x)dx=cosh(x)\frac{\mathrm{d}\sinh(x)}{\mathrm{d}x}=\cosh(x)

  • cosh2(x)sinh2(x)=1\cosh^{2}(x)-\sinh^{2}(x)=1

  • |exp(x)|exp(|x|)|\exp(x)|\leq\exp(|x|)

  • |cosh(x)|=cosh(|x|)exp(|x|)|\cosh(x)|=\cosh(|x|)\leq\exp(|x|)

  • |sinh(x)|=sinh(|x|)|\sinh(x)|=\sinh(|x|)

  • exp(x)=i=01i!xi\exp(x)=\sum_{i=0}^{\infty}\frac{1}{i!}x^{i}

  • cosh(x)=i=01(2i)!x2i\cosh(x)=\sum_{i=0}^{\infty}\frac{1}{(2i)!}x^{2i}

  • sinh(x)=i=01(2i+1)!x2i+1\sinh(x)=\sum_{i=0}^{\infty}\frac{1}{(2i+1)!}x^{2i+1}

We state some facts which can give a reasonable approximate computation.

Fact 2.2 (exp\exp, cosh\cosh, sinh\sinh approximate computation in small range).

We have

  • Part 1. For any xx satisfy that |x|0.1|x|\leq 0.1, we have |exp(x)1|2|x||\exp(x)-1|\leq 2|x|

  • Part 2. For any xx satisfy that |x|0.1|x|\leq 0.1, we have |cosh(x)1|x2|\cosh(x)-1|\leq x^{2}

  • Part 3. For any xx satisfy that |x|0.1|x|\leq 0.1, we have |sinh(x)|2|x||\sinh(x)|\leq 2|x|

  • Part 4. For any x,yx,y satisfy that |xy|0.1|x-y|\leq 0.1, we have |exp(x)exp(y)|exp(x)2|xy||\exp(x)-\exp(y)|\leq\exp(x)\cdot 2|x-y|

  • Part 5. For any x,yx,y satisfy that |xy|0.1|x-y|\leq 0.1, we have |cosh(x)cosh(y)|cosh(x)2|xy||\cosh(x)-\cosh(y)|\leq\cosh(x)\cdot 2|x-y|

  • Part 6. For any x,yx,y satisfy that |xy|0.1|x-y|\leq 0.1, we have |sinh(x)sinh(y)|cosh(x)2|xy||\sinh(x)-\sinh(y)|\leq\cosh(x)\cdot 2|x-y|

Proof.

Most of the proofs are standard, we only provide proofs for some of them.

Proof of Part 5.

We have

|cosh(x)cosh(y)|=\displaystyle|\cosh(x)-\cosh(y)|= |0.5exp(x)+0.5exp(x)0.5exp(y)0.5exp(y)|\displaystyle~{}|0.5\exp(x)+0.5\exp(-x)-0.5\exp(y)-0.5\exp(-y)|
\displaystyle\leq |0.5exp(x)0.5exp(y)|+|0.5exp(x)0.5exp(y)|\displaystyle~{}|0.5\exp(x)-0.5\exp(y)|+|0.5\exp(-x)-0.5\exp(-y)|
\displaystyle\leq 0.5exp(x)|1exp(yx)|+0.5exp(x)|1exp(y+x)|\displaystyle~{}0.5\exp(x)|1-\exp(y-x)|+0.5\exp(-x)|1-\exp(-y+x)|
\displaystyle\leq 0.5exp(x)2|yx|+0.5exp(x)2|yx|\displaystyle~{}0.5\exp(x)\cdot 2|y-x|+0.5\exp(-x)\cdot 2|y-x|
=\displaystyle= cosh(x)2|xy|\displaystyle~{}\cosh(x)\cdot 2|x-y|

where the first step follows from the definition of cosh\cosh, the second step follows from triangle inequality, the third step follows from simple algebra, the fourth step follows from the fact that |exp(x)1|2x|\exp(x)-1|\leq 2x for all x[0,0.1]x\in[0,0.1] and the last step follows from the definition of cosh\cosh.

Proof of Part 6.

We have

|sinh(x)sinh(y)|=\displaystyle|\sinh(x)-\sinh(y)|= |0.5exp(x)0.5exp(x)0.5exp(y)+0.5exp(y)|\displaystyle~{}|0.5\exp(x)-0.5\exp(-x)-0.5\exp(y)+0.5\exp(-y)|
\displaystyle\leq |0.5exp(x)0.5exp(y)|+|0.5exp(y)0.5exp(x)|\displaystyle~{}|0.5\exp(x)-0.5\exp(y)|+|0.5\exp(-y)-0.5\exp(-x)|
=\displaystyle= 0.5exp(x)|1exp(yx)|+0.5exp(x)|exp(xy)1|\displaystyle~{}0.5\exp(x)|1-\exp(y-x)|+0.5\exp(-x)|\exp(x-y)-1|
\displaystyle\leq 0.5exp(x)2|yx|+0.5exp(x)2|yx|\displaystyle~{}0.5\exp(x)\cdot 2|y-x|+0.5\exp(-x)\cdot 2|y-x|
=\displaystyle= cosh(x)2|xy|\displaystyle~{}\cosh(x)\cdot 2|x-y|

where the first step follows from the definition of sinh\sinh, the second step follows from triangle inequality, the third step follows from simple algebra, the fourth step follows from the fact that |exp(x)1|2x|\exp(x)-1|\leq 2x for all x[0,0.1]x\in[0,0.1] and the last step follows from the definition of cosh\cosh. ∎

Fact 2.3.

For any vectors an,bna\in\mathbb{R}^{n},b\in\mathbb{R}^{n} and cnc\in\mathbb{R}^{n}, we have

  • ab=ba=diag(a)b=diag(b)a=diag(a)diag(b)𝟏na\circ b=b\circ a=\operatorname{diag}(a)\cdot b=\operatorname{diag}(b)\cdot a=\operatorname{diag}(a)\cdot\operatorname{diag}(b)\cdot{\bf 1}_{n}

  • a(bc)=adiag(b)ca^{\top}(b\circ c)=a^{\top}\operatorname{diag}(b)c

    • a(bc)=b(ac)=c(ab)a^{\top}(b\circ c)=b^{\top}(a\circ c)=c^{\top}(a\circ b)

    • adiag(b)c=bdiag(a)c=adiag(c)ba^{\top}\operatorname{diag}(b)c=b^{\top}\operatorname{diag}(a)c=a^{\top}\operatorname{diag}(c)b

  • diag(ab)=diag(a)diag(b)\operatorname{diag}(a\circ b)=\operatorname{diag}(a)\operatorname{diag}(b)

  • diag(a)+diag(b)=diag(a+b)\operatorname{diag}(a)+\operatorname{diag}(b)=\operatorname{diag}(a+b)

Fact 2.4.

For vectors a,bna,b\in\mathbb{R}^{n}, we have

  • Part 1. ab2ab2\|a\circ b\|_{2}\leq\|a\|_{\infty}\cdot\|b\|_{2}

  • Part 2. aa2na\|a\|_{\infty}\leq\|a\|_{2}\leq\sqrt{n}\cdot\|a\|_{\infty}

  • Part 3. exp(a)exp(a2)\|\exp(a)\|_{\infty}\leq\exp(\|a\|_{2})

  • Part 4. cosh(a)cosh(a2)exp(a2)\|\cosh(a)\|_{\infty}\leq\cosh(\|a\|_{2})\leq\exp(\|a\|_{2})

  • Part 5. sinh(a)sinh(a2)cosh(a2)exp(a2)\|\sinh(a)\|_{\infty}\leq\sinh(\|a\|_{2})\leq\cosh(\|a\|_{2})\leq\exp(\|a\|_{2})

  • Part 6. cosh(a)cosh(a)sinh(a)sinh(a)=𝟏n\cosh(a)\circ\cosh(a)-\sinh(a)\circ\sinh(a)={\bf 1}_{n}

  • Part 7. For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have exp(a)exp(b)2exp(a)22ab\|\exp(a)-\exp(b)\|_{2}\leq\|\exp(a)\|_{2}\cdot 2\|a-b\|_{\infty}

  • Part 8. For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have cosh(a)cosh(b)2cosh(a)22ab\|\cosh(a)-\cosh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

  • Part 9. For any ab0.01\|a-b\|_{\infty}\leq 0.01, we have sinh(a)sinh(b)2cosh(a)22ab\|\sinh(a)-\sinh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

Proof.

Most of the proofs are standard, we only provide proofs for some of them.

Proof of Part 8.

We have

|cosh(ai)cosh(bi)|\displaystyle|\cosh(a_{i})-\cosh(b_{i})|\leq |cosh(ai)|2|aibi|\displaystyle~{}|\cosh(a_{i})|\cdot 2|a_{i}-b_{i}|
\displaystyle\leq |cosh(ai)|2ab,\displaystyle~{}|\cosh(a_{i})|\cdot 2\|a-b\|_{\infty},

where the first step follows from Part 5 in Fact 2.2 and the last step follows from the definition of norm. By summing up the square of both sides, we have

cosh(a)cosh(b)2cosh(a)22ab\displaystyle\|\cosh(a)-\cosh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

Proof of Part 9.

We have

|sinh(ai)sinh(bi)|\displaystyle|\sinh(a_{i})-\sinh(b_{i})|\leq |cosh(ai)|2|aibi|\displaystyle~{}|\cosh(a_{i})|\cdot 2|a_{i}-b_{i}|
\displaystyle\leq |cosh(ai)|2ab,\displaystyle~{}|\cosh(a_{i})|\cdot 2\|a-b\|_{\infty},

where the first step follows from Part 6 in Fact 2.2 and the last step follows from the definition of norm. By summing up the square of both sides, we have

sinh(a)sinh(b)2cosh(a)22ab\displaystyle\|\sinh(a)-\sinh(b)\|_{2}\leq\|\cosh(a)\|_{2}\cdot 2\|a-b\|_{\infty}

Fact 2.5.

For matrices A,BA,B, we have

  • A=A\|A^{\top}\|=\|A\|

  • ABAB\|A\|\geq\|B\|-\|A-B\|

  • A+BA+B\|A+B\|\leq\|A\|+\|B\|

  • ABAB\|A\cdot B\|\leq\|A\|\cdot\|B\|

  • If AαBA\preceq\alpha\cdot B, then AαB\|A\|\leq\alpha\cdot\|B\|

Fact 2.6.

Let ϵ(0,0.1)\epsilon\in(0,0.1). If

(1ϵ)AB(1+ϵ)A\displaystyle(1-\epsilon)A\preceq B\preceq(1+\epsilon)A

Then

  • ϵABAϵA-\epsilon A\preceq B-A\preceq\epsilon A

  • ϵA1/2(BA)A1/2ϵ-\epsilon\preceq A^{-1/2}(B-A)A^{-1/2}\preceq\epsilon

  • A1/2(BA)A1/2ϵ\|A^{-1/2}(B-A)A^{-1/2}\|\leq\epsilon

  • A1(BA)ϵ\|A^{-1}(B-A)\|\leq\epsilon

  • (1+ϵ)1BA(1ϵ)1A(1+\epsilon)^{-1}B\preceq A\preceq(1-\epsilon)^{-1}A

  • (12ϵ)BA(1+2ϵ)B(1-2\epsilon)B\preceq A\preceq(1+2\epsilon)B

  • B1/2(BA)B1/22ϵ\|B^{-1/2}(B-A)B^{-1/2}\|\leq 2\epsilon

  • B1(BA)2ϵ\|B^{-1}(B-A)\|\leq 2\epsilon

2.3 Standard Gradient and Hessian Computation

Lemma 2.7 (Standard Gradient and Hessian Computation).

We have

  • Part 1. dAxdt=Adxdt\frac{\mathrm{d}Ax}{\mathrm{d}t}=A\frac{\mathrm{d}x}{\mathrm{d}t}.

  • Part 2. dAxdxi=A,in\frac{\mathrm{d}Ax}{\mathrm{d}x_{i}}=A_{*,i}\in\mathbb{R}^{n}

  • Part 3. d2Axdxi2=0\frac{\mathrm{d}^{2}Ax}{\mathrm{d}x_{i}^{2}}=0

Proof.

Proof of Part 1.

It trivially follows from chain rule.

Proof of Part 2.

The equation takes derivative of the vector xx by each entry xix_{i} of itself and trivially gets the result of A,iA_{*,i}.

Proof of Part 3.

d2Axdxi2=\displaystyle\frac{\mathrm{d}^{2}Ax}{\mathrm{d}x_{i}^{2}}= ddxi(dAxdxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}Ax}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= dA,idxi\displaystyle~{}\frac{\mathrm{d}A_{*,i}}{\mathrm{d}x_{i}}
=\displaystyle= 0\displaystyle~{}0

where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, and the last step is due to the constant entries of the matrix A,iA_{*,i}.

2.4 Regularization term

Lemma 2.8.

Let wnw\in\mathbb{R}^{n} denote a weight vector Let Lreg:=0.5WAx22L_{\operatorname{reg}}:=0.5\|WAx\|_{2}^{2}. Let W=diag(w)n×nW=\operatorname{diag}(w)\in\mathbb{R}^{n\times n} denote a diagonal matrix.

Then we have

  • Part 1. Gradient

    dLregdx=AW2Ax\displaystyle\frac{\mathrm{d}L_{\operatorname{reg}}}{\mathrm{d}x}=A^{\top}W^{2}Ax
  • Part 2. Hessian

    d2Lregdx2=AW2A\displaystyle\frac{\mathrm{d}^{2}L_{\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}W^{2}A
Proof.

Proof of Part 1.

dLregdx=\displaystyle\frac{\mathrm{d}L_{\operatorname{reg}}}{\mathrm{d}x}= (ddx(WAx))(WAx)\displaystyle~{}(\frac{\mathrm{d}}{\mathrm{d}x}(WAx))^{\top}\cdot(WAx)
=\displaystyle= AWWAx\displaystyle~{}A^{\top}W^{\top}\cdot WAx
=\displaystyle= AW2Ax\displaystyle~{}A^{\top}W^{2}Ax

where the first step follows from chain rule.

Proof of Part 2.

d2Lregdx2=\displaystyle\frac{\mathrm{d}^{2}L_{\operatorname{reg}}}{\mathrm{d}x^{2}}= ddx(dLregdx)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}\Big{(}\frac{\mathrm{d}L_{\operatorname{reg}}}{\mathrm{d}x}\Big{)}
=\displaystyle= ddx(AW2Ax)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x}\Big{(}A^{\top}W^{2}Ax\Big{)}
=\displaystyle= AW2A\displaystyle~{}A^{\top}W^{2}A

where the first step follows from the expansion of Hessian, the second step follows from by applying the arguments in Part 1, the third step follows from simple algebra. ∎

Lemma 2.9.

Let An×dA\in\mathbb{R}^{n\times d}.

Let Dn×nD\in\mathbb{R}^{n\times n} denote a diagonal matrix where all diagonal entries are positive. Then we have

ADAσmin(A)2mini[n]Di,iId\displaystyle A^{\top}DA\succeq\sigma_{\min}(A)^{2}\cdot\min_{i\in[n]}D_{i,i}\cdot I_{d}
Proof.

For any xdx\in\mathbb{R}^{d}, we have

xADAx=\displaystyle x^{\top}A^{\top}DAx= DAx22\displaystyle~{}\|\sqrt{D}Ax\|_{2}^{2}
=\displaystyle= i=1nDi,i(Ax)i2\displaystyle~{}\sum_{i=1}^{n}D_{i,i}(Ax)_{i}^{2}
\displaystyle\geq (mini[n]Di,i)i=1n(Ax)i2\displaystyle~{}(\min_{i\in[n]}D_{i,i})\cdot\sum_{i=1}^{n}(Ax)_{i}^{2}
\displaystyle\geq mini[n]Di,iAx22\displaystyle~{}\min_{i\in[n]}D_{i,i}\cdot\|Ax\|_{2}^{2}
\displaystyle\geq mini[n]Di,iσmin(A)2x22\displaystyle~{}\min_{i\in[n]}D_{i,i}\cdot\sigma_{\min}(A)^{2}\cdot\|x\|_{2}^{2}
=\displaystyle= x(mini[n]Di,iσmin(A)2Id)x\displaystyle~{}x^{\top}\cdot(\min_{i\in[n]}D_{i,i}\cdot\sigma_{\min}(A)^{2}\cdot I_{d})x

where the first step follows from simple algebra, the second step follows from the definition of x2\|x\|_{2}, the third step follows from simple algebra, the fourth step follows from the definition of x2\|x\|_{2}, the fifth step follows from definition of σmin(A)\sigma_{\min}(A) , the last step follows from simple algebra . ∎

3 Exponential Regression

In this section, we provide detailed analysis of LexpL_{\exp}. In Section 3.1 we define the loss function LexpL_{\exp} based on exp(x)\exp(x). In Section 3.2 we compute the gradient of LexpL_{\exp} by detail. In Section 3.3 we compute the hessian of LexpL_{\exp} by detail. In Section 3.4, we summarize the result of Section 3.2 and Section 3.3 and aquire the gradient Lexp\nabla L_{\exp} and hessian 2Lexp\nabla^{2}L_{\exp} for LexpL_{\exp}. In Section 3.5 we define Lexp,regL_{\exp,\operatorname{reg}} by adding the regularization term LregL_{\operatorname{reg}} in Section 2.4 to LexpL_{\exp} and compute the gradient Lexp,reg\nabla L_{\exp,\operatorname{reg}} and hessian 2Lexp,reg\nabla^{2}L_{\exp,\operatorname{reg}} of Lexp,regL_{\exp,\operatorname{reg}}. In Section 3.6 we proved that 2Lexp,reg0\nabla^{2}L_{\exp,\operatorname{reg}}\succ 0 and thus showed that Lexp,regL_{\exp,\operatorname{reg}} is convex. In Section 3.7 we provide the upper bound for 2Lexp,reg(x)2Lexp,reg(y)\|\nabla^{2}L_{\exp,\operatorname{reg}}(x)-\nabla^{2}L_{\exp,\operatorname{reg}}(y)\| and thus proved 2Lexp,reg\nabla^{2}L_{\exp,\operatorname{reg}} is lipschitz.

3.1 Definitions

Definition 3.1 (Loss function for Exp Regression).

Given An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function Lexp(x)L_{\exp}(x) as follows

Lexp(x):=0.5exp(Ax)b22\displaystyle L_{\exp}(x):=0.5\cdot\|\exp(Ax)-b\|_{2}^{2}

3.2 Gradient

Lemma 3.2 (Gradient for Exp).

We have

  • Part 1.

    d(exp(Ax)b)dt=exp(Ax)Adxdt\displaystyle\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}t}=\exp(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t}
  • Part 2.

    dLexpdt=(exp(Ax)b)(exp(Ax)Adxdt)\displaystyle\frac{\mathrm{d}L_{\exp}}{\mathrm{d}t}=(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

Further, we have for each i[d]i\in[d]

  • Part 3.

    d(exp(Ax)b)dxi=exp(Ax)A,i\displaystyle\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}x_{i}}=\exp(Ax)\circ A_{*,i}
  • Part 4.

    dLexpdxi=\displaystyle\frac{\mathrm{d}L_{\exp}}{\mathrm{d}x_{i}}= (exp(Ax)b)(exp(Ax)A,i)\displaystyle~{}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ A_{*,i})
  • Part 5.

    dLexpdx=Adiag(exp(Ax))(exp(Ax)b)\displaystyle\frac{\mathrm{d}L_{\exp}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\exp(Ax))(\exp(Ax)-b)
Proof.

Proof of Part 1.

For each i[n]i\in[n], we have

d(exp(Ax)b)idt=\displaystyle\frac{\mathrm{d}(\exp(Ax)-b)_{i}}{\mathrm{d}t}= exp(Ax)id(Ax)idt\displaystyle~{}\exp(Ax)_{i}\cdot\frac{\mathrm{d}(Ax)_{i}}{\mathrm{d}t}
=\displaystyle= exp(Ax)i(Adx)idt\displaystyle~{}\exp(Ax)_{i}\cdot\frac{(A\mathrm{d}x)_{i}}{\mathrm{d}t}

where the first and second step follow from the differential chain rule.

Proof of Part 2.

We have

dLexpdt=\displaystyle\frac{\mathrm{d}L_{\exp}}{\mathrm{d}t}= (exp(Ax)b)d(exp(Ax)b)dt\displaystyle~{}(\exp(Ax)-b)^{\top}\cdot\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}t}
=\displaystyle= (exp(Ax)b)(exp(Ax)Adxdt)\displaystyle~{}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

where the first step follows from the differential chain rule, and the second step follows from d(exp(Ax)b)dt=exp(Ax)Adxdt\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}t}=\exp(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t} in Part 1.

Proof of Part 3.

We have

d(exp(Ax)b)dxi=\displaystyle\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}x_{i}}= d(exp(Ax))dxidbdxi\displaystyle~{}\frac{\mathrm{d}(\exp(Ax))}{\mathrm{d}x_{i}}-\frac{\mathrm{d}b}{\mathrm{d}x_{i}}
=\displaystyle= exp(Ax)dAxdxi0\displaystyle~{}\exp(Ax)\circ\frac{\mathrm{d}Ax}{\mathrm{d}x_{i}}-0
=\displaystyle= exp(Ax)A,i\displaystyle~{}\exp(Ax)\circ A_{*,i}

where the first step follows from the property of the gradient, the second step follows from simple algebra, and the last step directly follows from Lemma 2.7.

Proof of Part 4.

By substitute xix_{i} into tt of Part 3, we get

dLdxi=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x_{i}}= (exp(Ax)b)(exp(Ax)Adxdxi)\displaystyle~{}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}x_{i}})
=\displaystyle= (exp(Ax)b)(exp(Ax)A,i)\displaystyle~{}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(exp(Ax))(exp(Ax)b)\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\exp(Ax))(\exp(Ax)-b)

where the first step follows from the result of Part 2 and dy2dx=2ydydx\frac{\mathrm{d}y^{2}}{\mathrm{d}x}=2y^{\top}\frac{\mathrm{d}y}{\mathrm{d}x}, the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.

Proof of Part 5.

We have

dLdx=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x}= Adiag(exp(Ax))(exp(Ax)b)\displaystyle~{}A^{\top}\operatorname{diag}(\exp(Ax))(\exp(Ax)-b)

where this step follows from the result of Part 4 directly. ∎

3.3 Hessian

Lemma 3.3.
  • Part 1.

    d2(exp(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\exp(Ax)-b)}{\mathrm{d}x_{i}^{2}}= A,iexp(Ax)A,i\displaystyle~{}A_{*,i}\circ\exp(Ax)\circ A_{*,i}
  • Part 2.

    d2(exp(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\exp(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= A,jexp(Ax)A,i\displaystyle~{}A_{*,j}\circ\exp(Ax)\circ A_{*,i}
  • Part 3.

    d2Lexpdxi2=\displaystyle\frac{\mathrm{d}^{2}L_{\exp}}{\mathrm{d}x_{i}^{2}}= A,idiag(2exp(Ax)b)diag(exp(Ax))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A_{*,i}
  • Part 4.

    d2Lexpdxidxj=A,idiag(2exp(Ax)b)diag(exp(Ax))A,j\displaystyle\frac{\mathrm{d}^{2}L_{\exp}}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A_{*,j}
Proof.

Proof of Part 1.

d2(exp(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\exp(Ax)-b)}{\mathrm{d}x_{i}^{2}}= ddxi(d(exp(Ax)b)dxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}(\exp(Ax)-b)}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= d(exp(Ax)A,i)dxi\displaystyle~{}\frac{\mathrm{d}(\exp(Ax)\circ A_{*,i})}{\mathrm{d}x_{i}}
=\displaystyle= A,idexp(Ax)dxi\displaystyle~{}A_{*,i}\circ\frac{\mathrm{d}\exp(Ax)}{\mathrm{d}x_{i}}
=\displaystyle= A,iexp(Ax)A,i\displaystyle~{}A_{*,i}\circ\exp(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix A,iA_{*,i} with constant entries out of the derivative, and the last step also follows from the chain rule.

Proof of Part 2.

d2(exp(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\exp(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(ddxj(exp(Ax)b))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}}{\mathrm{d}x_{j}}\bigg{(}\exp(Ax)-b\bigg{)}\bigg{)}
=\displaystyle= ddxi(exp(Ax)A,j)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\exp(Ax)\circ A_{*,j}\bigg{)}
=\displaystyle= A,jexp(Ax)A,i\displaystyle~{}A_{*,j}\circ\exp(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from Part 3 of Lemma 3.2, the third step follows from Part 3 of Lemma 3.2.

Proof of Part 3.

d2Ldxi2=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}^{2}}= ddxi(dLdxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= ddxi((exp(Ax)b)(exp(Ax)A,i))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ A_{*,i})\bigg{)}
=\displaystyle= (exp(Ax)A,i)(exp(Ax)A,i)+(exp(Ax)b)(A,iexp(Ax)A,i)\displaystyle~{}(\exp(Ax)\circ A_{*,i})^{\top}\cdot(\exp(Ax)\circ A_{*,i})+(\exp(Ax)-b)^{\top}\cdot(A_{*,i}\circ\exp(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(2exp(Ax)b)diag(exp(Ax))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 3.2, the third step follows from differential chain rule and Part 1 of Lemma 3.2, the last step follows from Fact 2.3.

Proof of Part 4.

d2Ldxidxj=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(dLdxj)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{j}}\bigg{)}
=\displaystyle= ddxi((exp(Ax)b)(exp(Ax)A,j))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\exp(Ax)-b)^{\top}\cdot(\exp(Ax)\circ A_{*,j})\bigg{)}
=\displaystyle= (exp(Ax)A,i)(exp(Ax)A,j)+(exp(Ax)b)(A,jexp(Ax)A,i)\displaystyle~{}(\exp(Ax)\circ A_{*,i})^{\top}\cdot(\exp(Ax)\circ A_{*,j})+(\exp(Ax)-b)^{\top}\cdot(A_{*,j}\circ\exp(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(2exp(Ax)b)diag(exp(Ax))A,j\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A_{*,j}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 3.2, the third step follows from differential chain rule and Part 1 of Lemma 3.2, the last step follows from Fact 2.3.

3.4 Gradient and Hessian of the Loss function for Exp Function

Lemma 3.4.

Let Lexp:d0L_{\exp}:\mathbb{R}^{d}\to\mathbb{R}_{\geq 0} be defined in Definition 3.1. Then for any i,j[d]i,j\in[d], we have

  • Part 1. Gradient

    Lexp=Adiag(exp(Ax))diag(exp(Ax)b)𝟏n\displaystyle\nabla L_{\exp}=A^{\top}\operatorname{diag}(\exp(Ax))\operatorname{diag}(\exp(Ax)-b){\bf 1}_{n}
  • Part 2. Hessian

    2Lexp=Adiag(2exp(Ax)b)diag(exp(Ax))A\displaystyle\nabla^{2}L_{\exp}=A^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A
Proof.

Part 1. We run Lemma 3.2 and Fact 2.3 directly.

Part 2. It follows from Part 3 and 4 of Lemma 3.3. ∎

3.5 Loss Function with a Regularization Term

Definition 3.5.

Given matrix An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}, wnw\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function Lexp,reg(x)L_{\exp,\operatorname{reg}}(x) as follows

Lexp,reg(x):=0.5exp(Ax)b22+0.5WAx22\displaystyle L_{\exp,\operatorname{reg}}(x):=0.5\cdot\|\exp(Ax)-b\|_{2}^{2}+0.5\cdot\|WAx\|_{2}^{2}

where W=diag(w)W=\operatorname{diag}(w).

Lemma 3.6.

Let Lexp,regL_{\exp,\operatorname{reg}} be defined as Definition 3.5, then we have

  • Part 1. Gradient

    dLexp,regdx=Adiag(exp(Ax))diag(exp(Ax)b)𝟏n+AW2Ax\displaystyle\frac{\mathrm{d}L_{\exp,\operatorname{reg}}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\exp(Ax))\operatorname{diag}(\exp(Ax)-b){\bf 1}_{n}+A^{\top}W^{2}Ax
  • Part 2. Hessian

    d2Lexp,regdx2=Adiag(2exp(Ax)b)diag(exp(Ax))A+AW2A\displaystyle\frac{\mathrm{d}^{2}L_{\exp,\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A+A^{\top}W^{2}A
Proof.

Proof of Part 1. We run Lemma 3.4 and Lemma 2.8 directly.

Proof of Part 2. We run Lemma 3.4 and Lemma 2.8 directly.

3.6 Hessian is Positive Definite

Lemma 3.7 (Hessian is positive definite).

Let l>0l>0 denote a parameter.

If the following condition hold

  • Let H(x)=d2Lexp,regdx2H(x)=\frac{\mathrm{d}^{2}L_{\exp,\operatorname{reg}}}{\mathrm{d}x^{2}}

  • wi2>0.5bi2+l/σmin(A)2w_{i}^{2}>0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2} for all i[n]i\in[n]

Then, we have

H(x)lId\displaystyle H(x)\succeq l\cdot I_{d}
Proof.

We define DD

D=diag(2exp(Ax)b)diag(exp(Ax))+W2\displaystyle D=\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))+W^{2}

Then we can rewrite Hessian as

d2Ldx2=ADA.\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}=A^{\top}DA.

We define

z=exp(Ax).\displaystyle z=\exp(Ax).

Then we have

Di,i=\displaystyle D_{i,i}= 2(exp((Ax)i)bi)exp((Ax)i)+wi,i2\displaystyle~{}2(\exp((Ax)_{i})-b_{i})\exp((Ax)_{i})+w_{i,i}^{2}
=\displaystyle= 2(zibi)zi+wi,i2\displaystyle~{}2(z_{i}-b_{i})z_{i}+w_{i,i}^{2}
>\displaystyle> 2(zibi)zi+0.5bi2+l/σmin(A)2\displaystyle~{}2(z_{i}-b_{i})z_{i}+0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}
=\displaystyle= 0.5(2zibi)2+l/σmin(A)2\displaystyle~{}0.5(2z_{i}-b_{i})^{2}+l/\sigma_{\min}(A)^{2}
\displaystyle\geq l/σmin(A)2\displaystyle~{}l/\sigma_{\min}(A)^{2}

where the first step follows from simple algebra, the second step follows from replacing exp(Ax)\exp(Ax) with z=exp(Ax)z=\exp(Ax), the third step follows from wi2>0.5bi2+l/σmin(A)2w_{i}^{2}>0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}, the fourth step follows from simple algebra, the fifth step follows from x20,xx^{2}\geq 0,\forall x.

Since we know Di,i>l/σmin(A)2D_{i,i}>l/\sigma_{\min}(A)^{2} for all i[n]i\in[n] and Lemma 2.9, we have

ADA(mini[n]Di,i)σmin(A)2IdlId\displaystyle A^{\top}DA\succeq(\min_{i\in[n]}D_{i,i})\cdot\sigma_{\min}(A)^{2}I_{d}\succeq l\cdot I_{d}

Thus, Hessian is positive definite forever and thus the function is convex.

3.7 Hessian is Lipschitz

Lemma 3.8 (Hessian is Lipschitz).

If the following condition holds

  • Let H(x)=d2Lexp,regdx2H(x)=\frac{\mathrm{d}^{2}L_{\exp,\operatorname{reg}}}{\mathrm{d}x^{2}}

  • Let R>2R>2

  • x2R,y2R\|x\|_{2}\leq R,\|y\|_{2}\leq R

  • A(xy)<0.01\|A(x-y)\|_{\infty}<0.01

  • AR\|A\|\leq R

  • b2R\|b\|_{2}\leq R

Then we have

H(x)H(y)nexp(6R2)xy2\displaystyle\|H(x)-H(y)\|\leq\sqrt{n}\cdot\exp(6R^{2})\cdot\|x-y\|_{2}
Proof.

We have

H(x)H(y)\displaystyle~{}\|H(x)-H(y)\|
=\displaystyle= Adiag(2exp(Ax)b)diag(exp(Ax))AAdiag(2exp(Ay)b)diag(exp(Ay))A\displaystyle~{}\|A^{\top}\operatorname{diag}(2\exp(Ax)-b)\operatorname{diag}(\exp(Ax))A-A^{\top}\operatorname{diag}(2\exp(Ay)-b)\operatorname{diag}(\exp(Ay))A\|
\displaystyle\leq A2(2exp(Ax)b)exp(Ax)(2exp(Ay)b)exp(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\exp(Ax)-b)\circ\exp(Ax)-(2\exp(Ay)-b)\circ\exp(Ay)\|_{2}
=\displaystyle= A22(exp(Ax)+exp(Ay))(exp(Ax)exp(Ay))b(exp(Ax)exp(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|2(\exp(Ax)+\exp(Ay))\circ(\exp(Ax)-\exp(Ay))-b\circ(\exp(Ax)-\exp(Ay))\|_{2}
=\displaystyle= A2(2exp(Ax)+2exp(Ay)b)(exp(Ax)exp(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|(2\exp(Ax)+2\exp(Ay)-b)\circ(\exp(Ax)-\exp(Ay))\|_{2}
\displaystyle\leq A2(2exp(Ax)+2exp(Ay)b)exp(Ax)exp(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\exp(Ax)+2\exp(Ay)-b)\|_{\infty}\cdot\|\exp(Ax)-\exp(Ay)\|_{2} (1)

where the first step follows from H(x)=2LH(x)=\nabla^{2}L and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.

For the first term in Eq. (3.7), we have

A2R2\displaystyle\|A\|^{2}\leq R^{2} (2)

For the second term in Eq. (3.7), we have

(2exp(Ax)+2exp(Ay)b)\displaystyle\|(2\exp(Ax)+2\exp(Ay)-b)\|_{\infty}\leq 2exp(Ax)+2exp(Ay)+b\displaystyle~{}\|2\exp(Ax)\|_{\infty}+\|2\exp(Ay)\|_{\infty}+\|b\|_{\infty}
\displaystyle\leq 2exp(Ax2)+2exp(Ay2)+b\displaystyle~{}2\exp(\|Ax\|_{2})+2\exp(\|Ay\|_{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+b\displaystyle~{}4\exp(R^{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+R\displaystyle~{}4\exp(R^{2})+R
\displaystyle\leq 5exp(R2)\displaystyle~{}5\exp(R^{2}) (3)

where the first step follows from Fact 2.4 , the second step follows from Fact 2.4, the third step follows from Ax2R2,Ay2R2\|Ax\|_{2}\leq R^{2},\|Ay\|_{2}\leq R^{2}, the fourth step follows from bb2R\|b\|_{\infty}\leq\|b\|_{2}\leq R, the last step follows from R2R\geq 2.

For the third term in Eq. (3.7), we have

exp(Ax)exp(Ay)2\displaystyle\|\exp(Ax)-\exp(Ay)\|_{2}\leq exp(Ax)22A(yx)\displaystyle~{}\|\exp(Ax)\|_{2}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq nexp(Ax)2A(yx)\displaystyle~{}\sqrt{n}\cdot\|\exp(Ax)\|_{\infty}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq nexp(Ax)2A(yx)2\displaystyle~{}\sqrt{n}\cdot\|\exp(Ax)\|_{\infty}\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(Ax2)2A(yx)2\displaystyle~{}\sqrt{n}\cdot\exp(\|Ax\|_{2})\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(R2)2Ayx2\displaystyle~{}\sqrt{n}\exp(R^{2})\cdot 2\|A\|\cdot\|y-x\|_{2}
\displaystyle\leq 2nRexp(R2)yx2\displaystyle~{}2\sqrt{n}R\exp(R^{2})\cdot\|y-x\|_{2} (4)

where the first step follows from A(yx)<0.01\|A(y-x)\|_{\infty}<0.01 and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4 , the fourth step follows from Fact 2.4, the fifth step follows from Fact 2.5 and Ax2R2\|Ax\|_{2}\leq R^{2}, the last step follows from AR\|A\|\leq R.

Putting it all together, we have

H(x)H(y)\displaystyle\|H(x)-H(y)\|\leq R25exp(R2)2nRexp(R2)yx2\displaystyle~{}R^{2}\cdot 5\exp(R^{2})\cdot 2\sqrt{n}R\exp(R^{2})\|y-x\|_{2}
=\displaystyle= 10nR3exp(2R2)yx2\displaystyle~{}10\sqrt{n}R^{3}\exp(2R^{2})\cdot\|y-x\|_{2}
\displaystyle\leq nexp(4R2)exp(2R2)yx2\displaystyle~{}\sqrt{n}\exp(4R^{2})\cdot\exp(2R^{2})\cdot\|y-x\|_{2}
=\displaystyle= nexp(6R2)yx2\displaystyle~{}\sqrt{n}\exp(6R^{2})\cdot\|y-x\|_{2}

where the first step follows from by applying Eq. (2), Eq. (3.7), and Eq. (3.7), the second step follows from simple algebra, the third step follows from R2R\geq 2, the last step follows from simple algebra.

4 Cosh Regression

In this section, we provide detailed analysis of LcoshL_{\cosh}. In Section 4.1 we define the loss function LcoshL_{\cosh} based on cosh(x)\cosh(x). In Section 4.2 we compute the gradient of LcoshL_{\cosh} by detail. In Section 4.3 we compute the hessian of LcoshL_{\cosh} by detail. In Section 4.4, we summarize the result of Section 4.2 and Section 4.3 and aquire the gradient Lcosh\nabla L_{\cosh} and hessian 2Lcosh\nabla^{2}L_{\cosh} for LcoshL_{\cosh}. In Section 4.5 we define Lcosh,regL_{\cosh,\operatorname{reg}} by adding the regularization term LregL_{\operatorname{reg}} in Section 2.4 to LcoshL_{\cosh} and compute the gradient Lcosh,reg\nabla L_{\cosh,\operatorname{reg}} and hessian 2Lcosh,reg\nabla^{2}L_{\cosh,\operatorname{reg}} of Lcosh,regL_{\cosh,\operatorname{reg}}. In Section 4.6 we proved that 2Lcosh,reg0\nabla^{2}L_{\cosh,\operatorname{reg}}\succ 0 and thus showed that Lcosh,regL_{\cosh,\operatorname{reg}} is convex. In Section 4.7 we provide the upper bound for 2Lcosh,reg(x)2Lcosh,reg(y)\|\nabla^{2}L_{\cosh,\operatorname{reg}}(x)-\nabla^{2}L_{\cosh,\operatorname{reg}}(y)\| and thus proved 2Lcosh,reg\nabla^{2}L_{\cosh,\operatorname{reg}} is lipschitz.

4.1 Definition

Definition 4.1.

Given An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function Lcosh(x)L_{\cosh}(x) as follows:

Lcosh(x):=0.5cosh(Ax)b22\displaystyle L_{\cosh}(x):=0.5\cdot\|\cosh(Ax)-b\|_{2}^{2}

4.2 Gradient

Lemma 4.2 (Gradient for Cosh).

We have

  • Part 1.

    d(cosh(Ax)b)dt=sinh(Ax)Adxdt\displaystyle\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}t}=\sinh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t}
  • Part 2.

    dLcoshdt=(cosh(Ax)b)(sinh(Ax)Adxdt)\displaystyle\frac{\mathrm{d}L_{\cosh}}{\mathrm{d}t}=(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

Further, we have for each i[d]i\in[d]

  • Part 3.

    d(cosh(Ax)b)dxi=sinh(Ax)A,i\displaystyle\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}x_{i}}=\sinh(Ax)\circ A_{*,i}
  • Part 4.

    dLcoshdxi=\displaystyle\frac{\mathrm{d}L_{\cosh}}{\mathrm{d}x_{i}}= (cosh(Ax)b)(sinh(Ax)A,i)\displaystyle~{}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ A_{*,i})
  • Part 5.

    dLcoshdx=Adiag(sinh(Ax))(cosh(Ax)b)\displaystyle\frac{\mathrm{d}L_{\cosh}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\sinh(Ax))(\cosh(Ax)-b)
Proof.

Proof of Part 1.

For each i[n]i\in[n], we have

d(cosh(Ax)b)idt=\displaystyle\frac{\mathrm{d}(\cosh(Ax)-b)_{i}}{\mathrm{d}t}= sinh(Ax)id(Ax)idt\displaystyle~{}\sinh(Ax)_{i}\cdot\frac{\mathrm{d}(Ax)_{i}}{\mathrm{d}t}
=\displaystyle= sinh(Ax)i(Adx)idt\displaystyle~{}\sinh(Ax)_{i}\cdot\frac{(A\mathrm{d}x)_{i}}{\mathrm{d}t}

where the first and second step follow from the differential chain rule.

Thus, we complete the proof.

Proof of Part 2.

We have

dLcoshdt=\displaystyle\frac{\mathrm{d}L_{\cosh}}{\mathrm{d}t}= (cosh(Ax)b)d(cosh(Ax)b)dt\displaystyle~{}(\cosh(Ax)-b)^{\top}\cdot\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}t}
=\displaystyle= (cosh(Ax)b)(sinh(Ax)Adxdt)\displaystyle~{}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

where the first step follows from the differential chain rule, and the second step follows from d(cosh(Ax)b)dt=sinh(Ax)Adxdt\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}t}=\sinh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t} in Part 1.

Proof of Part 3.

We have

d(cosh(Ax)b)dxi=\displaystyle\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}x_{i}}= d(cosh(Ax))dxidbdxi\displaystyle~{}\frac{\mathrm{d}(\cosh(Ax))}{\mathrm{d}x_{i}}-\frac{\mathrm{d}b}{\mathrm{d}x_{i}}
=\displaystyle= sinh(Ax)dAxdxi0\displaystyle~{}\sinh(Ax)\circ\frac{\mathrm{d}Ax}{\mathrm{d}x_{i}}-0
=\displaystyle= sinh(Ax)A,i\displaystyle~{}\sinh(Ax)\circ A_{*,i}

where the first step follows from the property of the gradient, the second step follows from simple algebra, and the last step directly follows from Lemma 2.7.

Proof of Part 4.

By substitute xix_{i} into tt of Part 3, we get

dLdxi=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x_{i}}= (cosh(Ax)b)(sinh(Ax)Adxdxi)\displaystyle~{}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}x_{i}})
=\displaystyle= (cosh(Ax)b)(sinh(Ax)A,i)\displaystyle~{}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(sinh(Ax))(cosh(Ax)b)\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\sinh(Ax))(\cosh(Ax)-b)

where the first step follows from the result of Part 2, the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.

Proof of Part 5.

We have

dLdx=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x}= Adiag(sinh(Ax))(cosh(Ax)b)\displaystyle~{}A^{\top}\operatorname{diag}(\sinh(Ax))(\cosh(Ax)-b)

this follows from the result of Part 4 directly. ∎

4.3 Hessian

Lemma 4.3.
  • Part 1.

    d2(cosh(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\cosh(Ax)-b)}{\mathrm{d}x_{i}^{2}}= A,icosh(Ax)A,i\displaystyle~{}A_{*,i}\circ\cosh(Ax)\circ A_{*,i}
  • Part 2.

    d2(cosh(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\cosh(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= A,jcosh(Ax)A,i\displaystyle~{}A_{*,j}\circ\cosh(Ax)\circ A_{*,i}
  • Part 3.

    d2Lcoshdxi2=\displaystyle\frac{\mathrm{d}^{2}L_{\cosh}}{\mathrm{d}x_{i}^{2}}= A,idiag(2cosh(Ax)cosh(Ax)bcosh(Ax)𝟏n)A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax)-{\bf 1}_{n})A_{*,i}
  • Part 4.

    d2Lcoshdxidxj=A,idiag(sinh2(Ax)+cosh2(Ax)bcosh(Ax))A,j\displaystyle\frac{\mathrm{d}^{2}L_{\cosh}}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}^{\top}\operatorname{diag}(\sinh^{2}(Ax)+\cosh^{2}(Ax)-b\circ\cosh(Ax))A_{*,j}
Proof.

Proof of Part 1.

d2(cosh(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\cosh(Ax)-b)}{\mathrm{d}x_{i}^{2}}= ddxi(d(cosh(Ax)b)dxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}(\cosh(Ax)-b)}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= d(sinh(Ax)A,i)dxi\displaystyle~{}\frac{\mathrm{d}(\sinh(Ax)\circ A_{*,i})}{\mathrm{d}x_{i}}
=\displaystyle= A,idsinh(Ax)dxi\displaystyle~{}A_{*,i}\circ\frac{\mathrm{d}\sinh(Ax)}{\mathrm{d}x_{i}}
=\displaystyle= A,icosh(Ax)A,i\displaystyle~{}A_{*,i}\circ\cosh(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix A,iA_{*,i} with constant entries out of the derivative, and the last step also follows from the chain rule.

Proof of Part 2.

d2(cosh(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\cosh(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(ddxj(cosh(Ax)b))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}}{\mathrm{d}x_{j}}\bigg{(}\cosh(Ax)-b\bigg{)}\bigg{)}
=\displaystyle= ddxi(sinh(Ax)A,j)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\sinh(Ax)\circ A_{*,j}\bigg{)}
=\displaystyle= A,jcosh(Ax)A,i\displaystyle~{}A_{*,j}\circ\cosh(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second and third steps follow from the differential chain rule.

Proof of Part 3.

Here in the proof, for simplicity, we let use cosh2(Ax)=cosh(Ax)cosh(Ax)\cosh^{2}(Ax)=\cosh(Ax)\circ\cosh(Ax).

We have

d2Ldxi2=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}^{2}}= ddxi(dLdxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= ddxi((cosh(Ax)b)(sinh(Ax)A,i))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ A_{*,i})\bigg{)}
=\displaystyle= (sinh(Ax)A,i)(sinh(Ax)A,i)+(cosh(Ax)b)(A,icosh(Ax)A,i)\displaystyle~{}(\sinh(Ax)\circ A_{*,i})^{\top}\cdot(\sinh(Ax)\circ A_{*,i})+(\cosh(Ax)-b)^{\top}\cdot(A_{*,i}\circ\cosh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(sinh(Ax)sinh(Ax)+cosh(Ax)cosh(Ax)bcosh(Ax))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\sinh(Ax)\circ\sinh(Ax)+\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax))A_{*,i}
=\displaystyle= A,idiag(2cosh(Ax)cosh(Ax)bcosh(Ax)𝟏n)A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax)-{\bf 1}_{n})A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 4.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.

Proof of Part 4.

d2Ldxidxj=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(dLdxj)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{j}}\bigg{)}
=\displaystyle= ddxi((cosh(Ax)b)(sinh(Ax)A,j))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\cosh(Ax)-b)^{\top}\cdot(\sinh(Ax)\circ A_{*,j})\bigg{)}
=\displaystyle= (sinh(Ax)A,i)(sinh(Ax)A,j)+(cosh(Ax)b)(A,jcosh(Ax)A,i)\displaystyle~{}(\sinh(Ax)\circ A_{*,i})^{\top}\cdot(\sinh(Ax)\circ A_{*,j})+(\cosh(Ax)-b)^{\top}\cdot(A_{*,j}\circ\cosh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(sinh(Ax)sinh(Ax)+cosh(Ax)cosh(Ax)bcosh(Ax))A,j\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\sinh(Ax)\circ\sinh(Ax)+\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax))A_{*,j}
=\displaystyle= A,idiag(2cosh(Ax)cosh(Ax)bcosh(Ax)𝟏n)A,j\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax)-{\bf 1}_{n})A_{*,j}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 4.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.

4.4 Gradient and Hessian of the Loss function for Cosh Function

Lemma 4.4.

Let L:d0L:\mathbb{R}^{d}\to\mathbb{R}_{\geq 0} be defined in Definition 3.1. Then for any i,j[d]i,j\in[d], we have

  • Part 1. Gradient

    Lcosh=Adiag(sinh(Ax))diag(cosh(Ax)b)𝟏n\displaystyle\nabla L_{\cosh}=A^{\top}\operatorname{diag}(\sinh(Ax))\operatorname{diag}(\cosh(Ax)-b){\bf 1}_{n}
  • Part 2. Hessian

    2Lcosh=Adiag(2cosh(Ax)cosh(Ax)bcosh(Ax))A\displaystyle\nabla^{2}L_{\cosh}=A^{\top}\operatorname{diag}(2\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax))A
Proof.

Part 1. We run Lemma 4.2 and Fact 2.3 directly.

Part 2. It follows from Part 5 of Lemma 4.3. ∎

4.5 Loss Function with a Regularization Term

Definition 4.5.

Given matrix An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}, wnw\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function L(x)L(x) as follows

Lcosh,reg(x):=0.5cosh(Ax)b22+0.5WAx22\displaystyle L_{\cosh,\operatorname{reg}}(x):=0.5\cdot\|\cosh(Ax)-b\|_{2}^{2}+0.5\cdot\|WAx\|_{2}^{2}

where W=diag(w)W=\operatorname{diag}(w).

Lemma 4.6.

Let LL be defined as Definition 4.5, then we have

  • Part 1. Gradient

    dLcosh,regdx=Adiag(sinh(Ax))(diag(cosh(Ax)b))𝟏n+AW2Ax\displaystyle\frac{\mathrm{d}L_{\cosh,\operatorname{reg}}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\sinh(Ax))(\operatorname{diag}(\cosh(Ax)-b)){\bf 1}_{n}+A^{\top}W^{2}Ax
  • Part 2. Hessian

    d2Lcosh,regdx2=Adiag(2cosh(Ax)cosh(Ax)bcosh(Ax))A+AW2A\displaystyle\frac{\mathrm{d}^{2}L_{\cosh,\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}\operatorname{diag}(2\cosh(Ax)\circ\cosh(Ax)-b\circ\cosh(Ax))A+A^{\top}W^{2}A
Proof.

Proof of Part 1. We run Lemma 3.4 and Lemma 2.8 directly.

Proof of Part 2. We run Lemma 3.4 and Lemma 2.8 directly.

4.6 Hessian is Positive Definite

Lemma 4.7.

If wi2>0.5bi2+l/σmin(A)2+1w_{i}^{2}>0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}+1 for all i[n]i\in[n], then

d2Ldx2lId\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}\succeq l\cdot I_{d}
Proof.

We define diagonal matrix Dn×nD\in\mathbb{R}^{n\times n}

D=diag(sinh2(Ax)+cosh2(Ax)bcosh(Ax))+W2\displaystyle D=\operatorname{diag}(\sinh^{2}(Ax)+\cosh^{2}(Ax)-b\circ\cosh(Ax))+W^{2}

Then we can rewrite Hessian as

d2Ldx2=ADA.\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}=A^{\top}DA.

Then we have

Di,i=\displaystyle D_{i,i}= (sinh2((Ax)i)+cosh2((Ax)i)bicosh2((Ax)i)))+wi,i2\displaystyle~{}(\sinh^{2}((Ax)_{i})+\cosh^{2}((Ax)_{i})-b_{i}\cosh^{2}((Ax)_{i})))+w_{i,i}^{2}
=\displaystyle= (zi21+zi2bizi)+wi2\displaystyle~{}(z_{i}^{2}-1+z_{i}^{2}-b_{i}z_{i})+w_{i}^{2}
=\displaystyle= 2zi2bizi+wi21\displaystyle~{}2z_{i}^{2}-b_{i}z_{i}+w_{i}^{2}-1
>\displaystyle> 2zi2bizi+0.5bi2+l/σmin(A)2\displaystyle~{}2z_{i}^{2}-b_{i}z_{i}+0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}
=\displaystyle= 0.5(2zibi)2+l/σmin(A)2\displaystyle~{}0.5(2z_{i}-b_{i})^{2}+l/\sigma_{\min}(A)^{2}
\displaystyle\geq l/σmin(A)2\displaystyle~{}l/\sigma_{\min}(A)^{2}

where the first step follows from simple algebra, the second step follows from replacing cosh(Ax)\cosh(Ax) with z=cosh(Ax)z=\cosh(Ax) and sinh2()=cosh2()1\sinh^{2}()=\cosh^{2}()-1 (Fact 2.1), the third step follows from wi2>0.5bi2+l/σmin(A)2+1w_{i}^{2}>0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}+1, the fourth step follows from simple algebra, the fifth step follows from x20,xx^{2}\geq 0,\forall x.

Since we know Di,i>0D_{i,i}>0 for all i[n]i\in[n] and Lemma 2.9, we have

ADA(mini[n]Di,i)σmin(A)2IdlId\displaystyle A^{\top}DA\succeq(\min_{i\in[n]}D_{i,i})\cdot\sigma_{\min}(A)^{2}I_{d}\succeq l\cdot I_{d}

Thus, Hessian is positive definite forever and thus the function is convex. ∎

4.7 Hessian is Lipschitz

Lemma 4.8 (Hessian is Lipschitz).

If the following condition holds

  • Let H(x)=d2Lcosh,regdx2H(x)=\frac{\mathrm{d}^{2}L_{\cosh,\operatorname{reg}}}{\mathrm{d}x^{2}}

  • Let R>2R>2

  • x2R,y2R\|x\|_{2}\leq R,\|y\|_{2}\leq R

  • A(xy)<0.01\|A(x-y)\|_{\infty}<0.01

  • AR\|A\|\leq R

  • b2R\|b\|_{2}\leq R

Then we have

H(x)H(y)nexp(6R2)xy2\displaystyle\|H(x)-H(y)\|\leq\sqrt{n}\exp(6R^{2})\cdot\|x-y\|_{2}
Proof.

We have

H(x)H(y)\displaystyle~{}\|H(x)-H(y)\|
=\displaystyle= Adiag(2cosh(Ax)b)diag(cosh(Ax))AAdiag(2cosh(Ay)b)diag(cosh(Ay))A\displaystyle~{}\|A^{\top}\operatorname{diag}(2\cosh(Ax)-b)\operatorname{diag}(\cosh(Ax))A-A^{\top}\operatorname{diag}(2\cosh(Ay)-b)\operatorname{diag}(\cosh(Ay))A\|
\displaystyle\leq A2(2cosh(Ax)b)cosh(Ax)(2cosh(Ay)b)cosh(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\cosh(Ax)-b)\circ\cosh(Ax)-(2\cosh(Ay)-b)\circ\cosh(Ay)\|_{2}
=\displaystyle= A22(cosh(Ax)+cosh(Ay))(cosh(Ax)cosh(Ay))b(cosh(Ax)cosh(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|2(\cosh(Ax)+\cosh(Ay))\circ(\cosh(Ax)-\cosh(Ay))-b\circ(\cosh(Ax)-\cosh(Ay))\|_{2}
=\displaystyle= A2(2cosh(Ax)+2cosh(Ay)b)(cosh(Ax)cosh(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|(2\cosh(Ax)+2\cosh(Ay)-b)\circ(\cosh(Ax)-\cosh(Ay))\|_{2}
\displaystyle\leq A2(2cosh(Ax)+2cosh(Ay)b)cosh(Ax)cosh(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\cosh(Ax)+2\cosh(Ay)-b)\|_{\infty}\cdot\|\cosh(Ax)-\cosh(Ay)\|_{2} (5)

where the first step follows from H(x)=2LH(x)=\nabla^{2}L and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.

For the first term in Eq. (4.7), we have

A2R2\displaystyle\|A\|^{2}\leq R^{2} (6)

For the second term in Eq. (4.7), we have

(2cosh(Ax)+2cosh(Ay)b)\displaystyle\|(2\cosh(Ax)+2\cosh(Ay)-b)\|_{\infty}\leq 2cosh(Ax)+2cosh(Ay)+b\displaystyle~{}\|2\cosh(Ax)\|_{\infty}+\|2\cosh(Ay)\|_{\infty}+\|b\|_{\infty}
\displaystyle\leq 2exp(Ax2)+2exp(Ay2)+b\displaystyle~{}2\exp(\|Ax\|_{2})+2\exp(\|Ay\|_{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+b\displaystyle~{}4\exp(R^{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+R\displaystyle~{}4\exp(R^{2})+R
\displaystyle\leq 5exp(R2)\displaystyle~{}5\exp(R^{2}) (7)

where the first step follows from Fact 2.4 , the second step follows from Fact 2.4 , the third step follows from Ax2R2\|Ax\|_{2}\leq R^{2}, Ay2R2\|Ay\|_{2}\leq R^{2}, the fourth step follows from bR\|b\|_{\infty}\leq R and the last step follows from R2R\geq 2.

For the third term in Eq. (4.7), we have

cosh(Ax)cosh(Ay)2\displaystyle\|\cosh(Ax)-\cosh(Ay)\|_{2}\leq cosh(Ax)22A(yx)\displaystyle~{}\|\cosh(Ax)\|_{2}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq ncosh(Ax)2A(yx)\displaystyle~{}\sqrt{n}\|\cosh(Ax)\|_{\infty}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq nexp(Ax2)2A(yx)2\displaystyle~{}\sqrt{n}\exp(\|Ax\|_{2})\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(R2)2A(yx)2\displaystyle~{}\sqrt{n}\exp(R^{2})\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(R2)2Ayx2\displaystyle~{}\sqrt{n}\exp(R^{2})\cdot 2\|A\|\cdot\|y-x\|_{2}
\displaystyle\leq 2nRexp(R2)yx2\displaystyle~{}2\sqrt{n}R\exp(R^{2})\cdot\|y-x\|_{2} (8)

where the first step follows from A(yx)<0.01\|A(y-x)\|_{\infty}<0.01 and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from Ax2R2\|Ax\|_{2}\leq R^{2}, the fifth step follows from Fact 2.5, the last step follows from AR\|A\|\leq R.

Putting it all together, we have

H(x)H(y)\displaystyle\|H(x)-H(y)\|\leq R25exp(R2)2nRexp(R2)yx2\displaystyle~{}R^{2}\cdot 5\exp(R^{2})\cdot 2\sqrt{n}R\exp(R^{2})\|y-x\|_{2}
=\displaystyle= 10nR3exp(2R2)yx2\displaystyle~{}10\sqrt{n}R^{3}\exp(2R^{2})\cdot\|y-x\|_{2}
\displaystyle\leq nexp(4R2)exp(2R2)yx2\displaystyle~{}\sqrt{n}\exp(4R^{2})\cdot\exp(2R^{2})\cdot\|y-x\|_{2}
=\displaystyle= nexp(6R2)yx2\displaystyle~{}\sqrt{n}\exp(6R^{2})\cdot\|y-x\|_{2}

where the first step follows from by applying Eq. (6), Eq. (4.7), and Eq. (4.7), the second step follows from simple algebra, the third step follows from R2R\geq 2, the last step follows from simple algebra.

5 Sinh Regression

In this section, we provide detailed analysis of LsinhL_{\sinh}. In Section 5.1 we define the loss function LsinhL_{\sinh} based on sinh(x)\sinh(x). In Section 5.2 we compute the gradient of LsinhL_{\sinh} by detail. In Section 5.3 we compute the hessian of LsinhL_{\sinh} by detail. In Section 5.4, we summarize the result of Section 5.2 and Section 5.3 and aquire the gradient Lsinh\nabla L_{\sinh} and hessian 2Lsinh\nabla^{2}L_{\sinh} for LsinhL_{\sinh}. In Section 5.5 we define Lsinh,regL_{\sinh,\operatorname{reg}} by adding the regularization term LregL_{\operatorname{reg}} in Section 2.4 to LsinhL_{\sinh} and compute the gradient Lsinh,reg\nabla L_{\sinh,\operatorname{reg}} and hessian 2Lsinh,reg\nabla^{2}L_{\sinh,\operatorname{reg}} of Lsinh,regL_{\sinh,\operatorname{reg}}. In Section 5.6 we proved that 2Lsinh,reg0\nabla^{2}L_{\sinh,\operatorname{reg}}\succ 0 and thus showed that Lsinh,regL_{\sinh,\operatorname{reg}} is convex. In Section 5.7 we provide the upper bound for 2Lsinh,reg(x)2Lsinh,reg(y)\|\nabla^{2}L_{\sinh,\operatorname{reg}}(x)-\nabla^{2}L_{\sinh,\operatorname{reg}}(y)\| and thus proved 2Lsinh,reg\nabla^{2}L_{\sinh,\operatorname{reg}} is lipschitz.

5.1 Definition

Definition 5.1.

Given An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function L(x)L(x) as follows:

L(x):=0.5sinh(Ax)b22\displaystyle L(x):=0.5\cdot\|\sinh(Ax)-b\|_{2}^{2}

5.2 Gradient

Lemma 5.2 (Gradient for Sinh).

We have

  • Part 1.

    d(sinh(Ax)b)dt=cosh(Ax)Adxdt\displaystyle\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}t}=\cosh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t}
  • Part 2.

    dLsinhdt=(sinh(Ax)b)(cosh(Ax)Adxdt)\displaystyle\frac{\mathrm{d}L_{\sinh}}{\mathrm{d}t}=(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

Further, we have for each i[d]i\in[d]

  • Part 3.

    d(sinh(Ax)b)dxi=cosh(Ax)A,i\displaystyle\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}x_{i}}=\cosh(Ax)\circ A_{*,i}
  • Part 4.

    dLsinhdxi=\displaystyle\frac{\mathrm{d}L_{\sinh}}{\mathrm{d}x_{i}}= (sinh(Ax)b)(cosh(Ax)A,i)\displaystyle~{}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ A_{*,i})
  • Part 5.

    dLsinhdx=Adiag(cosh(Ax))(sinh(Ax)b)\displaystyle\frac{\mathrm{d}L_{\sinh}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\cosh(Ax))(\sinh(Ax)-b)
Proof.

Proof of Part 1.

For each i[n]i\in[n], we have

d(sinh(Ax)b)idt=\displaystyle\frac{\mathrm{d}(\sinh(Ax)-b)_{i}}{\mathrm{d}t}= cosh(Ax)id(Ax)idt\displaystyle~{}\cosh(Ax)_{i}\cdot\frac{\mathrm{d}(Ax)_{i}}{\mathrm{d}t}
=\displaystyle= cosh(Ax)i(Adx)idt\displaystyle~{}\cosh(Ax)_{i}\cdot\frac{(A\mathrm{d}x)_{i}}{\mathrm{d}t}

where the first and second step follow from the differential chain rule.

Thus, we complete the proof.

Proof of Part 2.

We have

dLsinhdt=\displaystyle\frac{\mathrm{d}L_{\sinh}}{\mathrm{d}t}= (sinh(Ax)b)d(sinh(Ax)b)dt\displaystyle~{}(\sinh(Ax)-b)^{\top}\cdot\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}t}
=\displaystyle= (sinh(Ax)b)(cosh(Ax)Adxdt)\displaystyle~{}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t})

where the first step follows from the differential chain rule, and the second step follows from d(sinh(Ax)b)dt=cosh(Ax)Adxdt\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}t}=\cosh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}t} in Part 1.

Proof of Part 3.

We have

d(sinh(Ax)b)dxi=\displaystyle\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}x_{i}}= d(sinh(Ax))dxidbdxi\displaystyle~{}\frac{\mathrm{d}(\sinh(Ax))}{\mathrm{d}x_{i}}-\frac{\mathrm{d}b}{\mathrm{d}x_{i}}
=\displaystyle= cosh(Ax)dAxdxi0\displaystyle~{}\cosh(Ax)\circ\frac{\mathrm{d}Ax}{\mathrm{d}x_{i}}-0
=\displaystyle= cosh(Ax)A,i\displaystyle~{}\cosh(Ax)\circ A_{*,i}

where the first step follows from the property of the gradient, the second step follows from the differential chain rule, and the last step directly follows from Lemma 2.7.

Proof of Part 4.

By substitute xix_{i} into tt of Part 3, we get

dLdxi=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x_{i}}= (sinh(Ax)b)(cosh(Ax)Adxdxi)\displaystyle~{}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ\frac{A\mathrm{d}x}{\mathrm{d}x_{i}})
=\displaystyle= (sinh(Ax)b)(cosh(Ax)A,i)\displaystyle~{}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(cosh(Ax))(sinh(Ax)b)\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\cosh(Ax))(\sinh(Ax)-b)

where the first step follows from the result of Part 2 and the second step follows from the result of Lemma 2.7, the last step follows from Fact 2.3.

Proof of Part 5.

We have

dLdx=\displaystyle\frac{\mathrm{d}L}{\mathrm{d}x}= Adiag(cosh(Ax))(sinh(Ax)b)\displaystyle~{}A^{\top}\operatorname{diag}(\cosh(Ax))(\sinh(Ax)-b)

where this step follows from the result of Part 4 directly. ∎

5.3 Hessian

Lemma 5.3.
  • Part 1.

    d2(sinh(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\sinh(Ax)-b)}{\mathrm{d}x_{i}^{2}}= A,isinh(Ax)A,i\displaystyle~{}A_{*,i}\circ\sinh(Ax)\circ A_{*,i}
  • Part 2.

    d2(sinh(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\sinh(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= A,jsinh(Ax)A,i\displaystyle~{}A_{*,j}\circ\sinh(Ax)\circ A_{*,i}
  • Part 3.

    d2Lsinhdxi2=\displaystyle\frac{\mathrm{d}^{2}L_{\sinh}}{\mathrm{d}x_{i}^{2}}= A,idiag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})A_{*,i}
  • Part 4.

    d2Lsinhdxidxj=A,idiag(2sinh(Ax)sinh(Ax)bsinh(Ax)𝟏n)A,j\displaystyle\frac{\mathrm{d}^{2}L_{\sinh}}{\mathrm{d}x_{i}\mathrm{d}x_{j}}=A_{*,i}^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)-{\bf 1}_{n})A_{*,j}
Proof.

Proof of Part 1.

d2(sinh(Ax)b)dxi2=\displaystyle\frac{\mathrm{d}^{2}(\sinh(Ax)-b)}{\mathrm{d}x_{i}^{2}}= ddxi(d(sinh(Ax)b)dxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}(\sinh(Ax)-b)}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= d(cosh(Ax)A,i)dxi\displaystyle~{}\frac{\mathrm{d}(\cosh(Ax)\circ A_{*,i})}{\mathrm{d}x_{i}}
=\displaystyle= A,idcosh(Ax)dxi\displaystyle~{}A_{*,i}\circ\frac{\mathrm{d}\cosh(Ax)}{\mathrm{d}x_{i}}
=\displaystyle= A,isinh(Ax)A,i\displaystyle~{}A_{*,i}\circ\sinh(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from the differential chain rule, the third step extracts the matrix A,iA_{*,i} with constant entries out of the derivative, and the last step also follows from the chain rule.

Proof of Part 2.

d2(sinh(Ax)b)dxidxj=\displaystyle\frac{\mathrm{d}^{2}(\sinh(Ax)-b)}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(ddxj(sinh(Ax)b))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}}{\mathrm{d}x_{j}}\bigg{(}\sinh(Ax)-b\bigg{)}\bigg{)}
=\displaystyle= ddxi(cosh(Ax)A,j)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\cosh(Ax)\circ A_{*,j}\bigg{)}
=\displaystyle= A,jsinh(Ax)A,i\displaystyle~{}A_{*,j}\circ\sinh(Ax)\circ A_{*,i}

where the first step is an expansion of the Hessian, the second and third steps follow from the differential chain rule.

Proof of Part 3.

d2Ldxi2=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}^{2}}= ddxi(dLdxi)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{i}}\bigg{)}
=\displaystyle= ddxi((sinh(Ax)b)(cosh(Ax)A,i))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ A_{*,i})\bigg{)}
=\displaystyle= (cosh(Ax)A,i)(cosh(Ax)A,i)+(sinh(Ax)b)(A,isinh(Ax)A,i)\displaystyle~{}(\cosh(Ax)\circ A_{*,i})^{\top}\cdot(\cosh(Ax)\circ A_{*,i})+(\sinh(Ax)-b)^{\top}\cdot(A_{*,i}\circ\sinh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(cosh2(Ax)+sinh2(Ax)bsinh(Ax))A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\cosh^{2}(Ax)+\sinh^{2}(Ax)-b\circ\sinh(Ax))A_{*,i}
=\displaystyle= A,idiag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)A,i\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})A_{*,i}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 5.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.

Proof of Part 4.

d2Ldxidxj=\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x_{i}\mathrm{d}x_{j}}= ddxi(dLdxj)\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}\frac{\mathrm{d}L}{\mathrm{d}x_{j}}\bigg{)}
=\displaystyle= ddxi((sinh(Ax)b)(cosh(Ax)A,j))\displaystyle~{}\frac{\mathrm{d}}{\mathrm{d}x_{i}}\bigg{(}(\sinh(Ax)-b)^{\top}\cdot(\cosh(Ax)\circ A_{*,j})\bigg{)}
=\displaystyle= (cosh(Ax)A,i)(cosh(Ax)A,j)+(sinh(Ax)b)(A,jsinh(Ax)A,i)\displaystyle~{}(\cosh(Ax)\circ A_{*,i})^{\top}\cdot(\cosh(Ax)\circ A_{*,j})+(\sinh(Ax)-b)^{\top}\cdot(A_{*,j}\circ\sinh(Ax)\circ A_{*,i})
=\displaystyle= A,idiag(cosh2(Ax)+sinh2(Ax)bsinh(Ax))A,j\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(\cosh^{2}(Ax)+\sinh^{2}(Ax)-b\circ\sinh(Ax))A_{*,j}
=\displaystyle= A,idiag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)A,j\displaystyle~{}A_{*,i}^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})A_{*,j}

where the first step is an expansion of the Hessian, the second step follows from Part 4 of Lemma 5.2, the third step follows from the product rule of calculus, the fourth step follows from Fact 2.3, the last step follows from Fact 2.3.

5.4 Gradient and Hessian of the Loss function for Sinh Function

Lemma 5.4.

Let L:d0L:\mathbb{R}^{d}\to\mathbb{R}_{\geq 0} be defined in Definition 3.1. Then for any i,j[d]i,j\in[d], we have

  • Part 1. Gradient

    Lsinh=Adiag(cosh(Ax))diag(sinh(Ax)b)𝟏n\displaystyle\nabla L_{\sinh}=A^{\top}\operatorname{diag}(\cosh(Ax))\operatorname{diag}(\sinh(Ax)-b){\bf 1}_{n}
  • Part 2. Hessian

    2Lsinh=Adiag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)A\displaystyle\nabla^{2}L_{\sinh}=A^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})A
Proof.

Part 1. We run Lemma 5.2 and Fact 2.3 directly.

Part 2. It follows from Part 5 of Lemma 5.3. ∎

5.5 Loss Function with a Regularization Term

Definition 5.5.

Given matrix An×dA\in\mathbb{R}^{n\times d} and bnb\in\mathbb{R}^{n}, wnw\in\mathbb{R}^{n}. For a vector xdx\in\mathbb{R}^{d}, we define loss function L(x)L(x) as follows

Lsinh,reg(x):=0.5sinh(Ax)b22+0.5WAx22\displaystyle L_{\sinh,\operatorname{reg}}(x):=0.5\cdot\|\sinh(Ax)-b\|_{2}^{2}+0.5\cdot\|WAx\|_{2}^{2}

where W=diag(w)W=\operatorname{diag}(w).

Lemma 5.6.

Let LL be defined as Definition 5.5, then we have

  • Part 1. Gradient

    dLsinh,regdx=Adiag(cosh(Ax))(diag(sinh(Ax)b))𝟏n+AW2Ax\displaystyle\frac{\mathrm{d}L_{\sinh,\operatorname{reg}}}{\mathrm{d}x}=A^{\top}\operatorname{diag}(\cosh(Ax))(\operatorname{diag}(\sinh(Ax)-b)){\bf 1}_{n}+A^{\top}W^{2}Ax
  • Part 2. Hessian

    d2Lsinh,regdx2=Adiag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)A+AW2A\displaystyle\frac{\mathrm{d}^{2}L_{\sinh,\operatorname{reg}}}{\mathrm{d}x^{2}}=A^{\top}\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})A+A^{\top}W^{2}A
Proof.

Proof of Part 1. We run Lemma 3.4 and Lemma 2.8 directly.

Proof of Part 2. We run Lemma 3.4 and Lemma 2.8 directly. ∎

5.6 Hessian is Positive Definite

Lemma 5.7.

Let l>0l>0 denote a parameter. If wi2>0.5bi2+l/σmin(A)21w_{i}^{2}>0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}-1 for all i[n]i\in[n], then

d2Ldx2lId\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}\succeq l\cdot I_{d}
Proof.

We define DD

D=diag(2sinh(Ax)sinh(Ax)bsinh(Ax)+𝟏n)+W2\displaystyle D=\operatorname{diag}(2\sinh(Ax)\circ\sinh(Ax)-b\circ\sinh(Ax)+{\bf 1}_{n})+W^{2}

Then we can rewrite Hessian as

d2Ldx2=ADA.\displaystyle\frac{\mathrm{d}^{2}L}{\mathrm{d}x^{2}}=A^{\top}DA.

We define

zi=sinh((Ax)i)\displaystyle z_{i}=\sinh((Ax)_{i})

Then we have

Di,i=\displaystyle D_{i,i}= (2sinh2((Ax)i)+1bisinh((Ax)i)))+wi,i2\displaystyle~{}(2\sinh^{2}((Ax)_{i})+1-b_{i}\sinh((Ax)_{i})))+w_{i,i}^{2}
=\displaystyle= (2zi2+1bizi)+wi2\displaystyle~{}(2z_{i}^{2}+1-b_{i}z_{i})+w_{i}^{2}
=\displaystyle= 2zi2bizi+wi2+1\displaystyle~{}2z_{i}^{2}-b_{i}z_{i}+w_{i}^{2}+1
>\displaystyle> 2zi2bizi+0.5bi2+l/σmin(A)2\displaystyle~{}2z_{i}^{2}-b_{i}z_{i}+0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}
=\displaystyle= 0.5(2zibi)2+l/σmin(A)2\displaystyle~{}0.5(2z_{i}-b_{i})^{2}+l/\sigma_{\min}(A)^{2}
\displaystyle\geq l/σmin(A)2\displaystyle~{}l/\sigma_{\min}(A)^{2}

where the first step follows from simple algebra, the second step follows from replacing sinh(Ax)\sinh(Ax) with z=sinh(Ax)z=\sinh(Ax) and cosh2()=sinh2()+1\cosh^{2}()=\sinh^{2}()+1 (Fact 2.1), the third step follows from wi2>0.5bi2+1/σmin(A)21w_{i}^{2}>0.5b_{i}^{2}+1/\sigma_{\min}(A)^{2}-1, the fourth step follows from simple algebra, the fifth step follows from x20,xx^{2}\geq 0,\forall x.

Since we know Di,i>lD_{i,i}>l for all i[n]i\in[n] and Lemma 2.9, we have

ADA(mini[n]Di,i)σmin(A)2IdlId\displaystyle A^{\top}DA\succeq(\min_{i\in[n]}D_{i,i})\cdot\sigma_{\min}(A)^{2}I_{d}\succeq l\cdot I_{d}

Thus, Hessian is positive definite forever and thus the function is convex. ∎

5.7 Hessian is Lipschitz

Lemma 5.8 (Hessian is Lipschitz).

If the following condition holds

  • Let H(x)=d2Lsinh,regdx2H(x)=\frac{\mathrm{d}^{2}L_{\sinh,\operatorname{reg}}}{\mathrm{d}x^{2}}

  • Let R>2R>2

  • x2R,y2R\|x\|_{2}\leq R,\|y\|_{2}\leq R

  • A(xy)<0.01\|A(x-y)\|_{\infty}<0.01

  • AR\|A\|\leq R

  • b2R\|b\|_{2}\leq R

Then we have

H(x)H(y)nexp(6R2)xy2\displaystyle\|H(x)-H(y)\|\leq\sqrt{n}\exp(6R^{2})\cdot\|x-y\|_{2}
Proof.

We have

H(x)H(y)\displaystyle~{}\|H(x)-H(y)\|
=\displaystyle= Adiag(2sinh(Ax)b)diag(sinh(Ax))AAdiag(2sinh(Ay)b)diag(sinh(Ay))A\displaystyle~{}\|A^{\top}\operatorname{diag}(2\sinh(Ax)-b)\operatorname{diag}(\sinh(Ax))A-A^{\top}\operatorname{diag}(2\sinh(Ay)-b)\operatorname{diag}(\sinh(Ay))A\|
\displaystyle\leq A2(2sinh(Ax)b)sinh(Ax)(2sinh(Ay)b)sinh(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\sinh(Ax)-b)\circ\sinh(Ax)-(2\sinh(Ay)-b)\circ\sinh(Ay)\|_{2}
=\displaystyle= A22(sinh(Ax)+sinh(Ay))(sinh(Ax)sinh(Ay))b(sinh(Ax)sinh(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|2(\sinh(Ax)+\sinh(Ay))\circ(\sinh(Ax)-\sinh(Ay))-b\circ(\sinh(Ax)-\sinh(Ay))\|_{2}
=\displaystyle= A2(2sinh(Ax)+2sinh(Ay)b)(sinh(Ax)sinh(Ay))2\displaystyle~{}\|A\|^{2}\cdot\|(2\sinh(Ax)+2\sinh(Ay)-b)\circ(\sinh(Ax)-\sinh(Ay))\|_{2}
\displaystyle\leq A2(2sinh(Ax)+2sinh(Ay)b)sinh(Ax)sinh(Ay)2\displaystyle~{}\|A\|^{2}\cdot\|(2\sinh(Ax)+2\sinh(Ay)-b)\|_{\infty}\cdot\|\sinh(Ax)-\sinh(Ay)\|_{2} (9)

where the first step follows from H(x)=2LH(x)=\nabla^{2}L and simple algebra, the second step follows from Fact 2.5, the third step follows from simple algebra, the fourth step follows from simple algebra, the last step follows from Fact 2.4.

For the first term in Eq. (5.7), we have

A2R2\displaystyle\|A\|^{2}\leq R^{2} (10)

For the second term in Eq. (5.7), we have

(2sinh(Ax)+2sinh(Ay)b)\displaystyle\|(2\sinh(Ax)+2\sinh(Ay)-b)\|_{\infty}\leq 2sinh(Ax)+2sinh(Ay)+b\displaystyle~{}\|2\sinh(Ax)\|_{\infty}+\|2\sinh(Ay)\|_{\infty}+\|b\|_{\infty}
\displaystyle\leq 2cosh(Ax)+2cosh(Ay)+b\displaystyle~{}\|2\cosh(Ax)\|_{\infty}+\|2\cosh(Ay)\|_{\infty}+\|b\|_{\infty}
\displaystyle\leq 2exp(Ax2)+2exp(Ay2)+b\displaystyle~{}2\exp(\|Ax\|_{2})+2\exp(\|Ay\|_{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+b\displaystyle~{}4\exp(R^{2})+\|b\|_{\infty}
\displaystyle\leq 4exp(R2)+R\displaystyle~{}4\exp(R^{2})+R
\displaystyle\leq 5exp(R2)\displaystyle~{}5\exp(R^{2}) (11)

where the first step follows from Fact 2.4 , the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from Ax2R2,Ay2R2\|Ax\|_{2}\leq R^{2},\|Ay\|_{2}\leq R^{2}, the fifth step follows from bR\|b\|_{\infty}\leq R, the last step follows from R2R\geq 2.

For the third term in Eq. (5.7), we have

sinh(Ax)sinh(Ay)2\displaystyle\|\sinh(Ax)-\sinh(Ay)\|_{2}\leq cosh(Ax)22A(yx)\displaystyle~{}\|\cosh(Ax)\|_{2}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq ncosh(Ax)2A(yx)\displaystyle~{}\sqrt{n}\|\cosh(Ax)\|_{\infty}\cdot 2\|A(y-x)\|_{\infty}
\displaystyle\leq nexp(Ax2)2A(yx)2\displaystyle~{}\sqrt{n}\exp(\|Ax\|_{2})\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(R2)2A(yx)2\displaystyle~{}\sqrt{n}\exp(R^{2})\cdot 2\|A(y-x)\|_{2}
\displaystyle\leq nexp(R2)2Ayx2\displaystyle~{}\sqrt{n}\exp(R^{2})\cdot 2\|A\|\cdot\|y-x\|_{2}
\displaystyle\leq 2nRexp(R2)yx2\displaystyle~{}2\sqrt{n}R\exp(R^{2})\cdot\|y-x\|_{2} (12)

where the first step follows from A(yx)<0.01\|A(y-x)\|_{\infty}<0.01 and Fact 2.4, the second step follows from Fact 2.4, the third step follows from Fact 2.4, the fourth step follows from Ax2R2\|Ax\|_{2}\leq R^{2}, the fifth step follows from Fact 2.5, the last step follows from AR\|A\|\leq R.

Putting it all together, we have

H(x)H(y)\displaystyle\|H(x)-H(y)\|\leq R25exp(R2)2nRexp(R2)yx2\displaystyle~{}R^{2}\cdot 5\exp(R^{2})\cdot 2\sqrt{n}R\exp(R^{2})\|y-x\|_{2}
=\displaystyle= 10nR3exp(2R2)yx2\displaystyle~{}10\sqrt{n}R^{3}\exp(2R^{2})\cdot\|y-x\|_{2}
\displaystyle\leq nexp(4R2)exp(2R2)yx2\displaystyle~{}\sqrt{n}\exp(4R^{2})\cdot\exp(2R^{2})\cdot\|y-x\|_{2}
=\displaystyle= nexp(6R2)yx2\displaystyle~{}\sqrt{n}\exp(6R^{2})\cdot\|y-x\|_{2}

where the first step follows from by applying Eq. (10), Eq. (5.7), and Eq. (5.7), the second step follows from simple algebra, the third step follows from R2R\geq 2, the last step follows from simple algebra.

6 Newton Method

In this section, we provide an approximate version of the Newton method for solving convex optimization problems and provide a detailed analysis of such a method. In Section 6.1 we define some assumptions under which we can tackle the optimization problem efficiently. In Section 6.2 we state a simple lemma which is useful in Section 6.5. In Section 6.3 we provide an approximation variant for the update step of the newton method for convex optimization. In Section 6.4 we provide the upper bound of H(xk)\|H(x_{k})\|. In Section 6.5 we provide the upper bound for rk+1\|r_{k+1}\| and thus showed that our approximate update step is effective in solving the optimization problem. In Section 6.6 we provide a lemma that showed our update step is effective. In Section 6.7, we prove our main result.

6.1 Definition and Update Rule

Let us study the local convergence of the Newton method. Consider the problem

minxdf(x)\displaystyle\min_{x\in\mathbb{R}^{d}}f(x)

under the following assumptions:

Definition 6.1.

We have

  • ll-local Minimum. Let l>0l>0 denote a parameter. There is a vector xdx^{*}\in\mathbb{R}^{d} such that

    • f(x)=𝟎d\nabla f(x^{*})={\bf 0}_{d}.

    • 2f(x)lId\nabla^{2}f(x^{*})\succeq l\cdot I_{d}.

  • Hessian is MM-Lipschitz. Let M>0M>0 denote a parameter that

    2f(y)2f(x)Myx2\displaystyle\|\nabla^{2}f(y)-\nabla^{2}f(x)\|\leq M\cdot\|y-x\|_{2}
  • Good Initialization Point. Let r0:=x0x2r_{0}:=\|x_{0}-x_{*}\|_{2} such that

    r0M0.1l\displaystyle r_{0}M\leq 0.1l

We define gradient and Hessian as follows

Definition 6.2 (Gradient and Hessian).

= We define gradient function g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} as

g(x):=f(x)\displaystyle g(x):=\nabla f(x)

We define the Hessian function H:dd×dH:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d} ,

H(x):=2f(x)\displaystyle H(x):=\nabla^{2}f(x)

Using the g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} and H:dd×dH:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d}, we can rewrite the exact process as follows :

Definition 6.3 (Exact update).
xk+1=xkH(xk)1g(xk)\displaystyle x_{k+1}=x_{k}-H(x_{k})^{-1}\cdot g(x_{k})

6.2 Connection between Gradient and Hessian

Lemma 6.4 (folklore).

Let gg denote the gradient function and let H:dd×dH:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d} denote the hessian function, then for any x,yx,y, we have

g(y)g(x)=01H(x+τ(yx))(yx)dτ\displaystyle g(y)-g(x)=\int_{0}^{1}H(x+\tau(y-x))\cdot(y-x)\mathrm{d}\tau
Proof.

We have

01H(x+τ(yx))(yx)dτ=\displaystyle\int_{0}^{1}H(x+\tau(y-x))\cdot(y-x)\mathrm{d}\tau= g(x+τ(yx))|01\displaystyle~{}g(x+\tau(y-x))\big{|}_{0}^{1}
=\displaystyle= g(x+1(yx))g(x+0(yx))\displaystyle~{}g(x+1\cdot(y-x))-g(x+0\cdot(y-x))
=\displaystyle= g(y)g(x)\displaystyle~{}g(y)-g(x)

6.3 Approximation of Hessian and Update Rule

In many optimization applications, computing 2f(xk)\nabla^{2}f(x_{k}) or (2f(xk))1(\nabla^{2}f(x_{k}))^{-1} is quite expensive. Therefore, a natural motivation is to approximately formulate its Hessian or inverse of Hessian.

Definition 6.5 (Approximate Hessian).

For any H(xk)H(x_{k}), we define H~(xk)\widetilde{H}(x_{k}) to satisfy the following condition

(1ϵH)H(xk)H~(xk)(1+ϵH)H(xk).\displaystyle(1-\epsilon_{H})\cdot H(x_{k})\preceq\widetilde{H}(x_{k})\preceq(1+\epsilon_{H})\cdot H(x_{k}).

To efficiently compute H~(xk)\widetilde{H}(x_{k}), we use a standard tool from the literature

Lemma 6.6 ([65, 26]).

Let ϵH=0.01\epsilon_{H}=0.01. Given a matrix An×dA\in\mathbb{R}^{n\times d}, for any positive diagonal matrix Dn×nD\in\mathbb{R}^{n\times n}, there is an algorithm that runs in time

O((nnz(A)+dω)poly(log(n/δ)))\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\operatorname{poly}(\log(n/\delta)))

output a O(dlog(n/δ))O(d\log(n/\delta)) sparse diagonal matrix D~n×n\widetilde{D}\in\mathbb{R}^{n\times n} such that

(1ϵH)ADAAD~A(1+ϵH)ADA.\displaystyle(1-\epsilon_{H})A^{\top}DA\preceq A^{\top}\widetilde{D}A\preceq(1+\epsilon_{H})A^{\top}DA.
Definition 6.7 (Approximate update).

We consider the following process

xk+1d×1=xkd×1H~(xk)1d×dg(xk)d×1\displaystyle\underbrace{x_{k+1}}_{d\times 1}=\underbrace{x_{k}}_{d\times 1}-\underbrace{\widetilde{H}(x_{k})^{-1}}_{d\times d}\cdot\underbrace{g(x_{k})}_{d\times 1}

6.4 Property of Hessian

Lemma 6.8.

If the following conditions hold

  • Let ff be function that Hessian is MM-Lipschitz (see Definition 6.1)

  • Suppose the optimal solution xx^{*} satisfy that H(x)l\|H(x_{*})\|\geq l (see Definition 6.1)

  • Let rk:=xkx2r_{k}:=\|x_{k}-x^{*}\|_{2}

We have

H(xk)lMrk\displaystyle\|H(x_{k})\|\geq l-M\cdot r_{k}
Proof.

We can show that

H(xk)\displaystyle\|H(x_{k})\|\geq H(x)H(x)H(xk)\displaystyle~{}\|H(x_{*})\|-\|H(x_{*})-H(x_{k})\|
\displaystyle\geq H(x)Mxxk2\displaystyle~{}\|H(x_{*})\|-M\cdot\|x_{*}-x_{k}\|_{2}
\displaystyle\geq lMrk\displaystyle~{}l-Mr_{k}

where the first step follows from Fact 2.5, the second step follows from ff is a MM-bounded function, the third step follows from H(xk)l\|H(x_{k})\|\geq l. ∎

6.5 One Step Shrinking Lemma

Lemma 6.9.

If the following condition hold

  • Function ff follows from Definition 6.1.

  • Let HH denote the Hessian of ff

  • Let gg denote the gradient of ff

  • Let rk:=xkx2r_{k}:=\|x_{k}-x^{*}\|_{2}

Then we have

rk+12(ϵH+MrklMrk)rk\displaystyle r_{k+1}\leq 2(\epsilon_{H}+\frac{Mr_{k}}{l-Mr_{k}})\cdot r_{k}
Proof.

We have

xk+1x=\displaystyle x_{k+1}-x^{*}= xkxH~(xk)1g(xk)\displaystyle~{}x_{k}-x^{*}-\widetilde{H}(x_{k})^{-1}\cdot g(x_{k})
=\displaystyle= xkxH~(xk)1(g(xk)g(x))\displaystyle~{}x_{k}-x^{*}-\widetilde{H}(x_{k})^{-1}\cdot(g(x_{k})-g(x^{*}))
=\displaystyle= xkxH~(xk)101H(x+τ(xkx))(xkx)dτ\displaystyle~{}x_{k}-x^{*}-\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}H(x^{*}+\tau(x_{k}-x^{*}))(x_{k}-x^{*})\mathrm{d}\tau
=\displaystyle= H~(xk)1(H~(xk)(xkx))H~(xk)101H(x+τ(xkx))(xkx)dτ\displaystyle~{}\widetilde{H}(x_{k})^{-1}(\widetilde{H}(x_{k})(x_{k}-x^{*}))-\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}H(x^{*}+\tau(x_{k}-x^{*}))(x_{k}-x^{*})\mathrm{d}\tau
=\displaystyle= H~(xk)1(H~(xk)01H(x+τ(xkx))dτ)(xkx)\displaystyle~{}\widetilde{H}(x_{k})^{-1}\bigg{(}\widetilde{H}(x_{k})-\int_{0}^{1}H(x^{*}+\tau(x_{k}-x^{*}))\mathrm{d}\tau\bigg{)}\cdot(x_{k}-x^{*})
=\displaystyle= H~(xk)1(01H~(xk)dτ01H(x+τ(xkx))dτ)(xkx)\displaystyle~{}\widetilde{H}(x_{k})^{-1}\bigg{(}\int_{0}^{1}\widetilde{H}(x_{k})\mathrm{d}\tau-\int_{0}^{1}H(x^{*}+\tau(x_{k}-x^{*}))\mathrm{d}\tau\bigg{)}\cdot(x_{k}-x^{*})
=\displaystyle= (H~(xk)101(H~(xk)H(x+τ(xkx))dτ)(xkx)\displaystyle~{}\bigg{(}\widetilde{H}(x_{k})^{-1}\int_{0}^{1}(\widetilde{H}(x_{k})-H(x^{*}+\tau(x_{k}-x^{*}))\mathrm{d}\tau\bigg{)}\cdot(x_{k}-x^{*})
=\displaystyle= Gk(xkx)\displaystyle~{}G_{k}\cdot(x_{k}-x^{*}) (13)

where the first step follows from Definition 6.7, the second step follows from g(x)=𝟎dg(x^{*})={\bf 0}_{d}, the third step follows from Lemma 6.4, the forth step follows from H1H=IH^{-1}H=I, the fifth step follows from simple algebra, the sixth step follows from simple algebra, the last step follows from rewrite the equation using GkG_{k} below :

Gk:=H~(xk)101H~(xk)H(x+τ(xkx))dτ\displaystyle G_{k}:=\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}\widetilde{H}(x_{k})-H(x^{*}+\tau(x_{k}-x^{*}))\mathrm{d}\tau

Then we can bound Gk\|G_{k}\| as follows

Gk=\displaystyle\|G_{k}\|= H~(xk)101(H~(xk)H(x+τ(xkx)))dτ\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(\widetilde{H}(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|}
=\displaystyle= H~(xk)101(H~(xk)H(xk)+H(xk)H(x+τ(xkx)))dτ\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(\widetilde{H}(x_{k})-H(x_{k})+H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|}
=\displaystyle= H~(xk)1(01(H~(xk)H(xk))dτ+01(H(xk)H(x+τ(xkx)))dτ)\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\Big{(}\int_{0}^{1}(\widetilde{H}(x_{k})-H(x_{k}))\mathrm{d}\tau+\int_{0}^{1}(H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{)}\Big{\|}
\displaystyle\leq H~(xk)101(H~(xk)H(xk))dτ+H~(xk)101(H(xk)H(x+τ(xkx)))dτ\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(\widetilde{H}(x_{k})-H(x_{k}))\mathrm{d}\tau\Big{\|}+\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|} (14)

where the first step follows from the definition of GkG_{k}, the second step follows from simple algebra, the third step follows from simple algebra, the last step follows from Fact 2.5.

For the first term, we have

H~(xk)101(H~(xk)H(xk))dτ=\displaystyle\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(\widetilde{H}(x_{k})-H(x_{k}))\mathrm{d}\tau\Big{\|}= H~(xk)1(H~(xk)H(xk))01dτ\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot(\widetilde{H}(x_{k})-H(x_{k}))\int_{0}^{1}\mathrm{d}\tau\Big{\|}
=\displaystyle= H~(xk)1(H~(xk)H(xk))\displaystyle~{}\|\widetilde{H}(x_{k})^{-1}(\widetilde{H}(x_{k})-H(x_{k}))\|
\displaystyle\leq 2ϵH\displaystyle~{}2\epsilon_{H} (15)

where the first step follows from simple algebra, the second step follows from 01dτ=1\int_{0}^{1}\mathrm{d}\tau=1, the third step follows from Fact 2.6.

For the second term, we have

H~(xk)101(H(xk)H(x+τ(xkx)))dτ\displaystyle~{}\Big{\|}\widetilde{H}(x_{k})^{-1}\cdot\int_{0}^{1}(H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|}
\displaystyle\leq H~(xk)101(H(xk)H(x+τ(xkx)))dτ\displaystyle~{}\|\widetilde{H}(x_{k})^{-1}\|\cdot\Big{\|}\int_{0}^{1}(H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|}
\displaystyle\leq (1+ϵH)H(xk)101(H(xk)H(x+τ(xkx)))dτ\displaystyle~{}(1+\epsilon_{H})\cdot\|H(x_{k})^{-1}\|\cdot\Big{\|}\int_{0}^{1}(H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*})))\mathrm{d}\tau\Big{\|}
\displaystyle\leq (1+ϵH)H(xk)101H(xk)H(x+τ(xkx))dτ\displaystyle~{}(1+\epsilon_{H})\cdot\|H(x_{k})^{-1}\|\cdot\int_{0}^{1}\Big{\|}H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*}))\Big{\|}\mathrm{d}\tau
\displaystyle\leq (1+ϵH)H(xk)1maxτ[0,1]H(xk)H(x+τ(xkx))\displaystyle~{}(1+\epsilon_{H})\cdot\|H(x_{k})^{-1}\|\cdot\max_{\tau\in[0,1]}\Big{\|}H(x_{k})-H(x^{*}+\tau(x_{k}-x^{*}))\Big{\|}
\displaystyle\leq (1+ϵH)H(xk)1rkM\displaystyle~{}(1+\epsilon_{H})\cdot\|H(x_{k})^{-1}\|\cdot r_{k}M
\displaystyle\leq (1+ϵH)(lMrk)1rkM\displaystyle~{}(1+\epsilon_{H})\cdot(l-Mr_{k})^{-1}\cdot r_{k}M
\displaystyle\leq 2MrklMrk\displaystyle~{}2\frac{Mr_{k}}{l-Mr_{k}} (16)

where the first step follows from Fact 2.5, the second step follows from (1+ϵH)H(xk)<H~(xk)(1+\epsilon_{H})H(x_{k})<\widetilde{H}(x_{k}) and Fact 2.5 , the third step follows from dτdτ\|\int\mathrm{d}\tau\|\leq\int\|\|\mathrm{d}\tau, the forth step follows from 01f(τ)dτmaxτ[0,1]f(τ)\int_{0}^{1}f(\tau)\mathrm{d}\tau\leq\max_{\tau\in[0,1]}f(\tau) , the fifth step follows from Definition 6.1 , the sixth step follows from H(xk)lMrk\|H(x_{k})\|\geq l-Mr_{k} (see Lemma 6.8), the last step follows from ϵH(0,1)\epsilon_{H}\in(0,1).

Thus, we have,

rk+1=\displaystyle r_{k+1}= Gk(xkx)\displaystyle~{}\|G_{k}\cdot(x_{k}-x^{*})\|
\displaystyle\leq Gk(xkx)\displaystyle~{}\|G_{k}\|\cdot\|(x_{k}-x^{*})\|
=\displaystyle= Gkrk\displaystyle~{}\|G_{k}\|\cdot r_{k}
\displaystyle\leq 2(ϵH+MrklMrk)rk\displaystyle~{}2(\epsilon_{H}+\frac{Mr_{k}}{l-Mr_{k}})\cdot r_{k}

where the first step follows from Eq. (6.5) and by definition of rkr_{k}, the second step follows form abab,a,b\|ab\|\leq\|a\|\|b\|,\forall a,b, the third step follows from rk=xkxr_{k}=\|x_{k}-x^{*}\|, the last step follows from Eq. (6.5), Eq. (6.5), and Eq. (6.5).

6.6 Induction

Lemma 6.10.

If the following condition hold

  • ϵH=0.01\epsilon_{H}=0.01

  • ri0.4ri1r_{i}\leq 0.4r_{i-1}, for all i[k]i\in[k]

  • Mri0.1lM\cdot r_{i}\leq 0.1l, for all i[k]i\in[k]

Then we have

  • rk+10.4rkr_{k+1}\leq 0.4r_{k}

  • Mrk+10.1lM\cdot r_{k+1}\leq 0.1l

Proof.

Proof of Part 1.

We have

rk+1\displaystyle r_{k+1}\leq 2(ϵH+MrklMrk)rk\displaystyle 2(\epsilon_{H}+\frac{Mr_{k}}{l-Mr_{k}})\cdot r_{k}
\displaystyle\leq 2(0.01+MrklMrk)rk\displaystyle~{}2(0.01+\frac{Mr_{k}}{l-Mr_{k}})r_{k}
\displaystyle\leq 2(0.01+0.1ll0.1l)rk\displaystyle~{}2(0.01+\frac{0.1l}{l-0.1l})r_{k}
\displaystyle\leq 0.4rk\displaystyle~{}0.4r_{k}

where the first step follows from Lemma 6.9.

Proof of Part 2.

We have

Mrk+1\displaystyle M\cdot r_{k+1}\leq M0.4rk\displaystyle~{}M\cdot 0.4r_{k}
\displaystyle\leq 0.40.1l\displaystyle~{}0.4\cdot 0.1l
\displaystyle\leq 0.1l\displaystyle~{}0.1l

6.7 Main Result

We state our main result as follows.

Theorem 6.11 (Formal version of Theorem 1.1).

Given matrix An×dA\in\mathbb{R}^{n\times d}, bnb\in\mathbb{R}^{n}, and wnw\in\mathbb{R}^{n}.

Let ff be any of functions exp,cosh\exp,\cosh and sinh\sinh.

Let xx^{*} denote the optimal solution of

minxd0.5f(Ax)b22+0.5diag(w)Ax22\displaystyle\min_{x\in\mathbb{R}^{d}}0.5\|f(Ax)-b\|_{2}^{2}+0.5\|\operatorname{diag}(w)Ax\|_{2}^{2}

that g(x)=𝟎dg(x^{*})={\bf 0}_{d} and x2R\|x^{*}\|_{2}\leq R.

Let AR,b2R\|A\|\leq R,\|b\|_{2}\leq R.

  • Let wi20.5bi2+lw_{i}^{2}\geq 0.5b_{i}^{2}+l for all i[n]i\in[n]. (If f=expf=\exp, see Lemma 3.7)

  • Let wi20.5bi2+l/σmin(A)2+1w_{i}^{2}\geq 0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}+1 for all i[n]i\in[n]. (If f=coshf=\cosh, see Lemma 4.7)

  • Let wi20.5bi2+l/σmin(A)21w_{i}^{2}\geq 0.5b_{i}^{2}+l/\sigma_{\min}(A)^{2}-1 for all i[n]i\in[n]. (If f=sinhf=\sinh, see Lemma 5.7)

Let M=nexp(6R2)M=\sqrt{n}\cdot\exp(6R^{2}).

Let x0x_{0} denote an initial point such that Mx0x20.1lM\|x_{0}-x^{*}\|_{2}\leq 0.1l.

For any accuracy parameter ϵ(0,0.1)\epsilon\in(0,0.1) and failure probability δ(0,0.1)\delta\in(0,0.1). There is a randomized algorithm (Algorithm 1) that runs log(x0x2/ϵ)\log(\|x_{0}-x^{*}\|_{2}/\epsilon) iterations and spend

O((nnz(A)+dω)poly(log(n/δ))\displaystyle O((\operatorname{nnz}(A)+d^{\omega})\cdot\operatorname{poly}(\log(n/\delta))

time per iteration, and finally outputs a vector x~d\widetilde{x}\in\mathbb{R}^{d} such that

x~x2ϵ\displaystyle\|\widetilde{x}-x^{*}\|_{2}\leq\epsilon

holds with probability at least 1δ1-\delta.

Proof.

Proof of exp\exp function.

It follows from combining Lemma 3.7, Lemma 6.10, Lemma 6.6, and Lemma 6.9.

After TT iterations, we have

xTx20.4Tx0x2\displaystyle\|x_{T}-x^{*}\|_{2}\leq 0.4^{T}\cdot\|x_{0}-x^{*}\|_{2}

By choice of TT, we get the desired bound. The failure probability is following from union bound over TT iterations.

Proof of cosh\cosh function.

It follows from combining Lemma 4.7, Lemma 6.10, Lemma 6.6, and Lemma 6.9.

Proof of sinh\sinh function.

It follows from combining Lemma 5.7, Lemma 6.10, Lemma 6.6, and Lemma 6.9.

References

  • ACHL [19] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
  • ALL [18] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. arXiv preprint arXiv:1812.03981, 2018.
  • AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  • AZL [21] Zeyuan Allen-Zhu and Yuanzhi Li. Forward super-resolution: How can gans learn hierarchical generative models for real-world distributions. arXiv preprint arXiv:2106.02619, 2021.
  • BLSS [20] Jan van den Brand, Yin Tat Lee, Aaron Sidford, and Zhao Song. Solving tall dense linear programs in nearly linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 775–788, 2020.
  • BMR+ [20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
  • Bra [20] Jan van den Brand. A deterministic linear program solver in current matrix multiplication time. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 259–278. SIAM, 2020.
  • Bra [21] Jan van den Brand. Unifying matrix data structures: Simplifying and speeding up iterative algorithms. In Symposium on Simplicity in Algorithms (SOSA), pages 1–13. SIAM, 2021.
  • CB [20] Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
  • CDL+ [22] Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re. Pixelated butterfly: Simple and efficient sparse training for neural network models. In International Conference on Learning Representations (ICLR), 2022.
  • CDW+ [21] Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems (NeurIPS), 34:17413–17426, 2021.
  • CDW+ [22] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding mixture of experts in deep learning. arXiv preprint arXiv:2208.02813, 2022.
  • CGH+ [19] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
  • CGRS [19] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  • CLD+ [20] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
  • CLS [19] Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. In STOC, 2019.
  • CND+ [22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  • CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, pages 81–90, 2013.
  • Dan [17] Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
  • DCLT [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • DGV+ [18] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
  • DKOD [20] Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. Smyrf-efficient attention using asymmetric clustering. Advances in Neural Information Processing Systems (NeurIPS), 33:6476–6489, 2020.
  • DLY [21] Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: A multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1784–1797, 2021.
  • DSSW [18] Huaian Diao, Zhao Song, Wen Sun, and David Woodruff. Sketching for kronecker product regression and p-splines. In International Conference on Artificial Intelligence and Statistics, pages 1299–1308. PMLR, 2018.
  • DSW [22] Yichuan Deng, Zhao Song, and Omri Weinstein. Discrepancy minimization in input-sparsity time. arXiv preprint arXiv:2210.12468, 2022.
  • DSWY [19] Huaian Diao, Zhao Song, David Woodruff, and Xin Yang. Total least squares regression in input sparsity time. Advances in Neural Information Processing Systems, 32, 2019.
  • EGKZ [21] Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. arXiv preprint arXiv:2110.10090, 2021.
  • GDG+ [17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • GLSS [18] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018.
  • GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
  • HBGS [19] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks, 2019.
  • HHS [17] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.
  • HJS+ [22] Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 233–244. IEEE, 2022.
  • HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
  • IS [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • JKL+ [20] Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pages 910–918. IEEE, 2020.
  • JL [22] Samy Jelassi and Yuanzhi Li. Towards understanding how momentum improves generalization in deep learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9965–10040. PMLR, 17–23 Jul 2022.
  • JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games, and its applications. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 944–953, 2020.
  • JMGL [22] Samy Jelassi, Arthur Mensch, Gauthier Gidel, and Yuanzhi Li. Adam is no better than normalized SGD: Dissecting how adaptivity improves GAN performance, 2022.
  • JNW [22] Shunhua Jiang, Bento Natura, and Omri Weinstein. A faster interior-point method for sum-of-squares optimization, 2022.
  • JRG [22] Meena Jagadeesan, Ilya Razenshteyn, and Suriya Gunasekar. Inductive bias of multi-channel linear convolutional networks with bounded weight norm. In Conference on Learning Theory, pages 2276–2325. PMLR, 2022.
  • JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
  • KB [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • KKL [20] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  • KMN+ [16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  • KVPF [20] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
  • LL [19] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
  • LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT, 2019.
  • LWM [19] Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. arXiv preprint arXiv:1907.04595, 2019.
  • NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In FOCS, pages 117–126. IEEE, 2013.
  • NSS [15] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-sgd: Path-normalized optimization in deep neural networks. arXiv preprint arXiv:1506.02617, 2015.
  • Ope [23] OpenAI. Gpt-4 technical report, 2023.
  • QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In AISTATS, 2023.
  • RNS+ [18] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. ., 2018.
  • RSW [16] Ilya Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the Forty-Eighth Annual ACM Symposium on Theory of Computing, STOC ’16, page 250–263, 2016.
  • SHK+ [14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • SHN+ [18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • SKYL [18] Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Don’t decay the learning rate, increase the batch size, 2018.
  • SWZ [17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise 1\ell_{1}-norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC), 2017.
  • [61] Zhao Song, David Woodruff, and Peilin Zhong. Average case column subset selection for entrywise 1\ell_{1}-norm loss. Advances in Neural Information Processing Systems (NeurIPS), 32:10111–10121, 2019.
  • [62] Zhao Song, David Woodruff, and Peilin Zhong. Towards a zero-one law for column subset selection. NeurIPS, 32:6123–6134, 2019.
  • [63] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. arXiv preprint arXiv:1704.08246, 2019.
  • SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
  • SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
  • SZKS [21] Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
  • SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • VBC [20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical theory of attention. arXiv preprint arXiv:2007.02876, 2020.
  • VSP+ [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • WCM [21] Colin Wei, Yining Chen, and Tengyu Ma. Statistically meaningful approximation: a case study on approximating turing machines with transformers. arXiv preprint arXiv:2107.13163, 2021.
  • WKM [20] Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout, 2020.
  • WL [21] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11112–11122. PMLR, 18–24 Jul 2021.
  • WLK+ [20] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  • WRS+ [17] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292, 2017.
  • ZCLG [21] Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371, 2021.
  • Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
  • ZHDK [23] Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. Kdeformer: Accelerating transformers via kernel density estimation. arXiv preprint arXiv:2302.02451, 2023.
  • ZKV+ [20] Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
  • ZMG [19] Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • ZRG+ [22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.