Research of Damped Newton Stochastic Gradient Descent Method for Neural Network Training

Jingcheng Zhou Wei Wei Zhiming Zheng School of Mathematical Sciences, Beihang University

Abstract

First-order methods like stochastic gradient descent(SGD) are recently the popular optimization method to train deep neural networks (DNNs), but second-order methods are scarcely used because of the overpriced computing cost in getting the high-order information. In this paper, we propose the Damped Newton Stochastic Gradient Descent(DN-SGD) method and Stochastic Gradient Descent Damped Newton(SGD-DN) method to train DNNs for regression problems with Mean Square Error(MSE) and classification problems with Cross-Entropy Loss(CEL), which is inspired by a proved fact that the hessian matrix of last layer of DNNs is always semi-definite. Different from other second-order methods to estimate the hessian matrix of all parameters, our methods just accurately compute a small part of the parameters, which greatly reduces the computational cost and makes convergence of the learning process much faster and more accurate than SGD. Several numerical experiments on real datesets are performed to verify the effectiveness of our methods for regression and classification problems.

keywords:

stochastic gradient descent,damped Newton

1 Introduction

First-order methods are popularly used to train deep neural networks(DNNs), such as stochastic gradient descent (SGD) and its variants that use momentum and acceleration and an adaptive learning rate[2]. At each iteration, SGD calculates the gradient only on a small batch instead of the whole training data. Such randomness introduced by sampling the small batch can lead to better generalization of the DNNs[3]. Recently there is a lot of work attempting to come up with more efficient first-order methods beyond SGD[4]. First-order methods are easy to implement and only require moderate computational cost per iteration, however, the requirement of adjusting their hyper-parameters (such as learning rate) becomes a complex task, and they are often slow to flee from areas where the loss function’s hessian matrix is ill-conditioned.

Second-order stochastic methods have also been proposed for training DNNs because they take far fewer iterations. Especially, second-order stochastic methods have the ability to escape from regions where the hessian matrix of the loss function is ill-conditioned and provide adaptive learning rates. But, there exist a main defect that it is practically impossible to compute and invert a full hessian matrix due to the massive parameters of DNNs. Efforts to conquer this problem include hessian-free inexact Newton methods, stochastic L-BFGS methods, Gauss-Newton and natural gradient methods and diagonal scaling methods[5]. Besides, at the basement of those methods, more algorithms are generated such as SMW-GN and SMW-NG which use the Sherman-Morrison-Woodbury formula to cut the cost of computing the inverse matrix[8] and GGN that combining GN and NTK[1]. There are also some variances of the Quasi-Newton methods for approximating the hessian matrix[6][7].

2 Main results

2.1 feed-forward neural networks

Consider a simple fully connected neural networks with $L+1$ layers. For each layer, the number of the nodes is $n_{l},0\leq l\leq L$ . In the following expressions, we put the bias into the weights for simplicity. Given an input $x$ , add a component for $x$ to make the input as $x^{(0)}=(1,x^{T})^{T}$ for the first layer. For the $l$ -th layer, add a component for $x^{(l-1)}$ , so $x^{(l)}=\sigma^{(l)}(W^{(l)}x^{(l-1)})$ , where $W^{(l)}=(w^{(l)}_{1},w^{(l)}_{2},\cdots,w^{(l)}_{n_{l}})^{T}\in R^{n_{l}\times(n_{l-1}+1)},w^{(l)}_{j}=(w^{(l)}_{0j},w^{(l)}_{1j},\cdots,w^{(l)}_{n_{l-1}j})^{T}\in R^{n_{l-1}+1}$ which denote the parameters between the $(l-1)$ -th layer and the $j$ -th node of $l$ -th layer, $\sigma^{(l)}:R\to R$ is a nonlinear activation function and when $\sigma^{(l)}$ is used on a vector, it means $\sigma^{(l)}$ operates on every component of the vector, the output $f(\theta,x)=\sigma^{(L)}(W^{(L)}x^{(L-1)})$ where $\theta=\left({\rm vec}((W^{(1)})^{T}),{\rm vec}((W^{(2)})^{T}),\cdots,{\rm vec}((W^{(L)})^{T})\right)^{T}$ in which ${\rm vec}(W)$ vectorizes the matrix $W$ by concatenating its columns.

Given the training data $(\boldsymbol{X},\boldsymbol{Y})=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{m},y_{m})\}$ , for training the neural networks, the loss function for minimization is generally defined as

L(\theta)=\dfrac{1}{m}\sum_{i=1}^{m}L_{i}(\theta,\boldsymbol{X},\boldsymbol{Y})=\dfrac{1}{m}\sum_{i=1}^{m}l(f(\theta,x_{i}),y_{i})

(1)

where $l(f(\theta,x_{i}),y_{i})$ is a loss function. For the learning tasks, the standard Mean Square Error(MSE) is always used for the regression problems and the Cross-Entropy Loss(CEL) is used for the classification problems. Note $\theta\in R^{n}$ where $n=\sum_{i=1}^{L}n_{l}(n_{l-1}+1)$ .

2.2 property of the loss function

Theorem 1.

Given a neural network like the one above using MSE for loss function of regression problems that the number of the nodes for $L$ -th layer is one. The last layer the activation function is used as identical function for regression problems. Then the hessian matrix of $L(\theta)$ to the parameters of last layer is positive semi-definite. Especially, when $\sigma^{(L-1)}(x)=relu(x)=max(0,x)$ , the hessian matrix of $L(\theta)$ to the parameters of last but one layer is also positive semi-definite.

Proof.

For simplicity, consider single one input first. Given the input $x_{i}$ , the loss function should be

L(W^{(L)})=\left(W^{(L)}x_{i}^{(L-1)}-y_{i}\right)^{2}

and its gradient vector as well as hessian matrix should be

\dfrac{\partial L(W^{(L)})}{\partial W^{(L)}}=2\left(W^{(L)}x_{i}^{(L-1)}-y_{i}\right)x_{i}^{(L-1)},

(2)

\dfrac{\partial^{2}L(W^{(L)})}{\partial(w^{(L)})^{2}}=2x_{i}^{(L-1)}(x_{i}^{(L-1)})^{T}=A.

(3)

With the knowledge of matrix[9], if $x_{i}^{(L-1)}$ isn’t a null vector, it can easily be got that $A$ is a positive semi-definite matrix with rank one. As for multiple inputs, the hessian is the addition of several positive semi-definite matrixes, so the hessian matrix is also positive semi-definite.

If the activation function $\sigma^{(L-1)}(x)=max(0,x)$ , we can get the loss function

L(W^{(L-1)})=(W^{(L)}\sigma^{(L-1)}(W^{(L-1)}x_{i}^{(L-2)})-y_{i})^{2}

and the gradient vector as well as hessian matrix

\dfrac{\partial L}{\partial w_{j}^{(L-1)}}=2(W^{(L)}\sigma^{(L-1)}(W^{(L-1)}x_{i}^{(L-2)})-y_{i})W^{(L)}_{j}\sigma^{{}^{\prime}}((w_{j}^{(L-1)})^{T}x_{i}^{(L-2)})x_{i}^{(L-2)},

(4)

\dfrac{\partial^{2}L}{\partial w_{j}^{(L-1)}\partial w_{k}^{(L-1)}}=2W^{(L)}_{j}W^{(L)}_{k}\sigma^{{}^{\prime}}((w_{j}^{(L-1)})^{T}x_{i}^{(L-2)})\sigma^{{}^{\prime}}((w_{k}^{(L-1)})^{T}x_{i}^{(L-2)})x_{i}^{(L-2)}(x_{i}^{(L-2)})^{T},

(5)

\dfrac{\partial^{2}L}{\partial(v^{L-1})^{2}}=B\bigotimes C,

(6)

where $\bigotimes$ denotes kronecker product and

v^{(L-1)}=((w_{1}^{(L-1)})^{T},(w_{2}^{(L-1)})^{T},\cdots,(w_{n_{L-1}}^{(L-1)})^{T})^{T},

B\in R^{n_{L-1}\times n_{L-1}},B_{jk}=2W^{(L)}_{j}W^{(L)}_{k}\sigma^{{}^{\prime}}((w_{j}^{(L-1)})^{T}x_{i}^{(L-2)})\sigma^{{}^{\prime}}((w_{k}^{(L-1)})^{T}x_{i}^{(L-2)}),

C=x_{i}^{(L-2)}(x_{i}^{(L-2)})^{T},

Let $u_{j}=W_{j}^{(L)}\sigma^{{}^{\prime}}((w_{k}^{(L-1)})^{T}x_{i}^{(L-2)}),u=(u_{1},u_{2},\cdots,u_{n_{L-1}})^{T}$ , we can get $B=2uu^{T}$ . Thus, it’s easy to see $B$ and $C$ are positive semi-definite matrix. Because the kronecker product of positive semi-definite matrices is also positive semi-definite[10], the hessian matrix $B\bigotimes C$ is positive semi-definite. Hence, the hessian of the multiple inputs is also positive semi-definite. ∎

Theorem 2.

Given a neural network like the one above using CEL for loss function of classification problems. The last layer $softmax$ is used as activation function for classification problems. Then the hessian matrix of $L(\theta)$ to the parameters of last layer is positive semi-definite.

Proof.

For simplicity, consider single one input first. Given the input $x_{i}$ , the loss function should be

L=-\sum_{j=1}^{n_{L}}Y_{j}lnx^{(L)}_{ij},

where $Y_{j}$ is 1 when $x_{i}$ is the $j$ -th class, otherwise 0. $x^{(L)}_{ij}=\dfrac{e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}}{\sum_{j=1}^{n_{L}}e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}},$ so

L=-\sum_{j=1}^{n_{L}}Y_{j}(w^{(L)}_{j})^{T}x_{i}^{(L-1)}+ln(D),

where $D=\sum_{j=1}^{n_{L}}e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}.$

Then compute the gradient vector as well as hessian matrix

\dfrac{\partial L}{\partial w_{j}^{(L)}}=\left(\dfrac{e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}}{D}-Y_{j}\right)x_{i}^{(L-1)},

(7)

\dfrac{\partial^{2}L}{\partial w_{j}^{(L)}\partial w_{k}^{(L)}}=\begin{cases}\dfrac{e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}(D-e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}})}{D^{2}}x_{i}^{(L-1)}(x_{i}^{(L-1)})^{T},j=k,\\ -\dfrac{e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}e^{(w^{(L)}_{k})^{T}x_{i}^{(L-1)}}}{D^{2}}x_{i}^{(L-1)}(x_{i}^{(L-1)})^{T},j\neq k.\end{cases}

(8)

Let $v^{(L)}=((w_{1}^{(L)})^{T},(w_{2}^{(L)})^{T},\cdots,(w_{n_{L}}^{(L)})^{T})^{T},$ then we have

\dfrac{\partial^{2}L}{\partial(v^{(L)})^{2}}=E\bigotimes F,

(9)

where

F=\dfrac{1}{D^{2}}x_{i}^{(L-1)}(x_{i}^{(L-1)})^{T},

E\in R^{n_{L}\times n_{L}},E_{jk}=\begin{cases}e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}(D-e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}),j=k,\\ -e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}e^{(w^{(L)}_{k})^{T}x_{i}^{(L-1)}},j\neq k.\end{cases}

(10)

For any vector $z=(z_{1},z_{2},\cdots,z_{n_{L}})^{T}\in R^{n_{L}},$

z^{T}Ez=\sum_{jk}^{n_{L}}z_{j}z_{k}E_{jk}=\dfrac{1}{2}\sum_{jk}^{n_{L}}e^{(w^{(L)}_{j})^{T}x_{i}^{(L-1)}}e^{(w^{(L)}_{k})^{T}x_{i}^{(L-1)}}(z_{j}-z_{k})^{2}\geq 0.

(11)

Thus, $E$ and $F$ are positive semi-definite matrixes. Because the kronecker product of positive semi-definite matrices is also positive semi-definite[10], the hessian matrix $E\bigotimes F$ is positive semi-definite and the hessian matrix of the condition of multiple inputs is also positive semi-definite. ∎

3 Our innovation:Damped Newton Stochastic Gradient Descent

As for the property of the loss function, the convexity of the loss function can be got for the last layer of DNNs. Thus, we can use Damped Newton method for parameters of the last layer and SGD for other layers. For each iteration, if the Damped Newton method is used first, then SGD, it is called Damped Newton Stochastic Gradient Descent(QN-SGD). Else, the SGD is used first, then Damped Newton method, which is called Stochastic Gradient Descent Damped Newton method(SGD-QN).

In traditional damped Newton method, adding scalar matrix is used to make the hessian matrix positive definite. The iteration is $\theta^{(t+1)}=\theta^{(t)}-(H^{t}+\lambda I)^{-1}g^{(t)}$ . This method is hard to put into practise for the uncertain of $\lambda$ . And the Quasi-Newton like others is to approximate the hessian matrix for cutting the computing cost, but all these methods consider the whole hessian matrix and lose much information to approximate.

Our optimization method is based on more elaborate analysis for the hessian matrix, in which the hessian matrix is not used entirely for accelerating the training. We just use the last layer’s hessian matrix by damped Newton method. The iteration is $\theta^{(t+1)}=\theta^{(t)}-(H^{(t)}+H^{(t)}_{max}+\alpha I)^{-1}g^{(t)}$ where $H_{max}^{(t)}$ is the maximum value of the element of the $H^{(t)}$ and $\alpha$ is the damping coefficient. $H_{max}^{(t)}$ is to make the matrix positive definite and lower the conditional number at the same time. $\alpha$ is just to keep off the condition that $H^{(t)}$ is too bad to be a null matrix or very close to a null matrix like results on Shot Selection 4.

Our method combines the advantages of the stochastic gradient descent and damped Newton. On one hand, the main part of the parameters of DNNs using SGD method for training guarantees that the computing cost changes little. On the other hand, the parameters of the last layer using QN-SGN or SGD-QN for training can make the learning process converge more quickly with just a little more computing cost. When the training data is huge, the less number of iterations can cut much time for training the DNNs and overcome the cost of computing the second-order information of the last layer.

4 Algorithm

In this section, the QN-SGD and SGD-QN methods are summarized as follows.

Input: Training Set

S_{M}=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{M},y_{M})\}

, hyper-parameter

\alpha

, learning rate

\lambda

, batchsize

N

1 Initialize the network parameter

\theta^{(0)},t=0

2for

t=0,1,2,\cdots

3 Randomly select a mini-batch

S_{N}\subset S_{M}

of size

N

4 Compute

H_{last}^{(t)},g_{last}^{(t)}

for the last layer and

H_{max}^{(t)}=\max(H_{last}^{(t)})

5 Compute

p^{(t)}=(H_{last}^{(t)}+(\alpha+H_{max}^{(t)})I)^{-1}

6 Set

\theta_{last}^{(t+1)}=\theta_{last}^{(t+1)}-p^{(t)}g_{last}^{(t)}

7 Compute

g_{front}^{(t)}

for other layers

8 Set

\theta^{(t+1)}_{front}=\theta^{(t)}_{front}-\lambda\times g_{front}^{(t)}

end for

Algorithm 1 QN-SGD

Input: Training Set

S_{M}=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{M},y_{M})\}

, hyper-parameter

\alpha

, learning rate

\lambda

, batchsize

N

1 Initialize the network parameter

\theta^{(0)},t=0

2for

t=0,1,2,\cdots

3 Randomly select a mini-batch

S_{N}\subset S_{M}

of size

N

4 Compute

g_{front}^{(t)}

for other layers

5 Set

\theta^{(t+1)}_{front}=\theta^{(t)}_{front}-\lambda\times g_{front}^{(t)}

6 Compute

H_{last}^{(t)},g_{last}^{(t)}

for the last layer and

H_{max}^{(t)}=\max(H_{last}^{(t)})

7 Computer

p^{(t)}=(H_{last}^{(t)}+(\alpha+H_{max}^{(t)})I)^{-1}

8 Set

\theta_{last}^{(t+1)}=\theta_{last}^{(t+1)}-p^{(t)}g_{last}^{(t)}

end for

Algorithm 2 SGD-QN

The only difference of the QN-SGD and SGD-QN methods is that in each iteration, the damped Newton method or the stochastic gradient descent is used first.

5 Numerical example

5.1 Regression Problem

Figure 1: Results on House Prices regression problem with $N=50,\alpha=0,\lambda=0.01$

Figure 2: Results on Housing Market regression problem with $N=200,\alpha=0,\lambda=0.001$

Figure 3: Results on Cabbage Price regression problem with $N=100,\alpha=0,\lambda=0.01$

The performance of the algorithms QN-SGD and SGD-QN is compared with SGD, the experiments are performed on several regression problems, and both training loss and testing loss are provided. The data sets are scaled to have normal distribution.

House Prices: The training set is of size $1361$ , and the test set is of size $100$ . A neural network with one hidden layer of size 5 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(288,5,1)$ , where the first and last layers are the size of input and output. The output layer has no activation function but with MSE, and the learning rate for SGD was set to be 0.01 the same as other methods for the purposes of comparison. Each algorithm is run for 20 epochs. The website is https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data. The results are presented in Figure 1.

Housing Market: The training set is of size $20473$ , and the test set is of size $10000$ . A neural network with one hidden layer of size 4 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(451,4,1)$ , where the first and last layers are the size of input and output. The output layer has no activation function but with MSE, and the learning rate for SGD was set to be 0.001 the same as other methods for the purposes of comparison. Each algorithm is run for 5 epochs. The website is https://www.kaggle.com/c/sberbank-russian-housing-market/data. The results are presented in Figure 2.

Cabbage Price: The training set is of size $2423$ , and the test set is of size $500$ . A neural network with one hidden layer of size 6 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(4,6,1)$ , where the first and last layers are the size of input and output. The output layer has no activation function but with MSE, and the learning rate for SGD was set to be 0.01 the same as other methods for purposes of comparison. Each algorithm is run for 5 epochs. The website is https://www.kaggle.com/c/regression-cabbage-price/data. The results are presented in Figure 3.

5.2 Classification Problem

Figure 4: Results on Shot Selection classification problem with $N=200,\alpha=0.01,\lambda=0.01$

Figure 5: Results on Categorical Feature Encoding classification problem with $N=200,\alpha=0,\lambda=0.01$

Figure 6: Results on Categorical Feature Encoding classification problem with $N=200,\alpha=0,\lambda=0.01$

The performance of the algorithms QN-SGD and SGD-QN is compared with SGD, the experiments are performed on several classification problems, and both training loss and testing loss are provided. The data sets are scaled to have normal distribution.

Shot Selection: The training set is of size $20698$ , and the test set is of size $10000$ . A neural network with one hidden layer of size 6 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(228,6,2)$ , where the first and last layers are the size of input and output. The output layer has softmax with cross entropy, and the learning rate for SGD was set to be 0.01 the same as other methods for the purposes of comparison. Each algorithm is run for 25 epochs. The website is https://www.kaggle.com/c/kobe-bryant-shot-selection/data. The results are presented in Figure 4.

Categorical Feature Encoding: The training set is of size $30000$ , and the test set is of size $10000$ . A neural network with one hidden layer of size 6 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(96,4,2)$ , where the first and last layers are the size of input and output. The output layer has softmax with cross entropy, and the learning rate for SGD was set to be 0.01 the same as other methods for the purposes of comparison. Each algorithm is run for 20 epochs. The website is https://www.kaggle.com/c/cat-in-the-dat/data. The results are presented in Figure 5.

Categorical Feature Encoding: The training set is of size $30000$ , and the test set is of size $10000$ . A neural network with one hidden layer of size 4 and sigmoid activation function is used, i.e., $(n_{0},n_{1},n_{2})=(369,4,2)$ , where the first and last layers are the size of input and output. The output layer has softmax with cross entropy, and the learning rate for SGD was set to be 0.01 the same as other methods for the purposes of comparison. Each algorithm is run for 20 epochs. The website is https://www.kaggle.com/c/santander-customer-satisfaction/data. The results are presented in Figure 6.

6 Discussion of Results

From the experimental results, it can be seen that QN-SGD and SGD-QN are always faster than SGD in terms of both steps and time, which accords with the provided analysis. Besides, when the dataset is huge, the QN-SGD and SGD-QN can be more efficient because the advantage of quick descent can exceed the cost of computing the second-order information.

7 Conclusion and Future Research Directions

In this paper, we propose the QN-SGD and SGD-QN methods that combining the stochastic gradient descent and damped Newton method, which perform better than SGD in several data sets for training neural networks. Those methods also have the potential for application to more sophisticated models such as CNNs, ResNet and even GNNs. A promising future research topic is the study of property of the hessian matrix for each hidden layer to generate more quicker methods.

References

[1] T. Cai, R. Gao, J. Hou, S. Chen, D. Wang, D. He, Z. Zhang, and L. Wang. A gram-gauss-newton method learning overparameterized deep neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
[2] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121?2159, 2011.
[3] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
[4] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
[5] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223-311, 2018.
[6] Donald Goldfarb, Yi Ren and Achraf Bahamou. Practical Quasi-Newton Methods for Training Deep Neural Networks, 2020; arXiv:2006.08877.
[7] Yi Ren and Donald Goldfarb. Kronecker-factored Quasi-Newton Methods for Convolutional Neural Networks, 2021; arXiv:2102.06737.
[8] Yi Ren and Donald Goldfarb. Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks, 2019; arXiv:1906.02353.
[9] Matrix Analysis. Johnson Charles R and Horn Roger A. Cambridge University Press. 2013. 2nd edition
[10] Matrix calculus and Kronecker product with applications and C++ programs. Willi-Hans Steeb in collaboration with Tan Kiat Shi., World Scientific 1997.