Minimization of Stochastic First-order Oracle Complexity of Adaptive Methods for Nonconvex Optimization

Hideaki Iiduka Department of Computer Science Meiji University 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571 Japan [email protected]

Abstract.

Numerical evaluations have definitively shown that, for deep learning optimizers such as stochastic gradient descent, momentum, and adaptive methods, the number of steps needed to train a deep neural network halves for each doubling of the batch size and that there is a region of diminishing returns beyond the critical batch size. In this paper, we determine the actual critical batch size by using the global minimizer of the stochastic first-order oracle (SFO) complexity of the optimizer. To prove the existence of the actual critical batch size, we set the lower and upper bounds of the SFO complexity and prove that there exist critical batch sizes in the sense of minimizing the lower and upper bounds. This proof implies that, if the SFO complexity fits the lower and upper bounds, then the existence of these critical batch sizes demonstrates the existence of the actual critical batch size. We also discuss the conditions needed for the SFO complexity to fit the lower and upper bounds and provide numerical results that support our theoretical results.

Key words and phrases:

adaptive method, batch size, nonconvex optimization, stochastic first-order oracle complexity

1. Introduction

1.1. Background

Adaptive methods are powerful deep learning optimizers used to find the model parameters of deep neural networks that minimize the expected risk and empirical risk loss functions [2, Section 4]. Specific adaptive methods are, for example, adaptive gradient (AdaGrad) [5], root mean square propagation (RMSProp) [26], adaptive moment estimation (Adam) [11], and adaptive mean square gradient (AMSGrad) [19] (Table 2 in [22] lists the main deep learning optimizers).

Deep learning optimizers have been widely studied from the viewpoints of both theory and practice. While the convergence and convergence rate of deep learning optimizers have been theoretically studied for convex optimization [31, 11, 19, 13, 15], theoretical investigation of deep learning optimizers for nonconvex optimization is needed so that such optimizers can be put into practical use in deep learning [28, 1, 27].

The convergence of adaptive methods has been studied for nonconvex optimization [7, 4, 30, 10], and the convergence of stochastic gradient descent (SGD) methods has been studied for nonconvex optimization [8, 3, 21, 12]. Chen et al. showed that generalized Adam, which includes the heavy-ball method, AdaGrad, RMSProp, AMSGrad, and AdaGrad with first-order momentum (AdaFom), has an $\mathcal{O}(\log K/\sqrt{K})$ convergence rate when a decaying learning rate ( $\alpha_{k}=1/\sqrt{k}$ ) is used [4]. AdaBelief (which adapts the step size in accordance with the belief in the observed gradients) using $\alpha_{k}=1/\sqrt{k}$ has an $\mathcal{O}(\log K/\sqrt{K})$ convergence rate [30]. A method that unifies adaptive methods such as AMSGrad and AdaBelief has been shown to have a convergence rate of $\mathcal{O}(1/\sqrt{K})$ when $\alpha_{k}=1/\sqrt{k}$ [10], which improves on the results of [4, 30].

[Uncaptioned image] — Figure 1. Relationship between lower bound $\underline{K}(b)$ and upper bound $\overline{K}(b)$ ( $b^{\star}$ denotes critical batch size).

There have been several practical studies of deep learning optimizers. Shallue et al. [23] studied how increasing the batch size affects the performances of SGD, SGD with momentum [18, 20], and Nesterov momentum [17, 25]. Zhang et al. [29] studied the relationship between batch size and performance for Adam and K-FAC (Kronecker-factored approximate curvature [14]). In both studies, it was numerically shown that increasing the batch size tends to decrease the number of steps needed for training deep neural networks, but with diminishing returns. Moreover, it was shown that SGD with momentum and Nesterov momentum can exploit larger batches than SGD [23] and that K-FAC and Adam can exploit larger batches than SGD with momentum [29]. Thus, it has been shown that momentum and adaptive methods can greatly reduce the number of steps needed for training deep neural networks [23, Figure 4], [29, Figure 5]. Smith et al. [24] numerically showed that using enormous batches leads to reductions in the number of parameter updates and model training time.

1.2. Motivation

As described in Section 1.1, there have been theoretical studies of the relationship between the learning rate and convergence of deep learning optimizers. Furthermore, as also described in Section 1.1, the practical performance of a deep learning optimizer strongly depends on the batch size. Our goal in this paper is to clarify theoretically the relationship between batch size and the performance of deep learning optimizers. We focus on the relationship between the batch size and the number of steps needed for nonconvex optimization of several adaptive methods.

To explicitly show our contribution, we characterize the diminishing returns reported in [23, 29] by using the stochastic first-order oracle (SFO) complexities of deep learning optimizers. We define the SFO complexity of a deep learning optimizer as $Kb$ on the basis of the number of steps $K$ needed for training the deep neural network and on the batch size $b$ used in the optimizer. Let $b^{\star}$ be a critical batch size such that there are diminishing returns for all batch sizes beyond $b^{\star}$ , as asserted in [23, 29]. Thanks to the numerical evaluations in [23, 29], we know that there is a positive number $C$ such that

(1.1)

\displaystyle Kb\approx 2^{C}\text{ for }b\leq b^{\star}\text{ and }Kb\geq 2^{C}\text{ for }b\geq b^{\star},

where $K$ and $b$ are defined for $i,j\in\mathbb{N}$ by $K=2^{i}$ and $b=2^{j}$ . For example, Figure 5(a) in [23] shows $C\approx 16$ and $b^{\star}\approx 2^{8}$ for the momentum method used to train a convolutional neural network on the MNIST dataset. This implies that, while SFO complexity $Kb$ initially is not almost changed (i.e., $K$ halves for each doubling of $b$ ), $Kb$ is minimized at critical batch size $b^{\star}$ , and there are diminishing returns once the batch size exceeds $b^{\star}$ . Therefore, we can see that there are diminishing returns beyond the critical batch size $b^{\star}$ that minimizes SFO complexity.

1.3. Goal

The goal of this paper is to develop a theory demonstrating the existence of critical batch sizes, which was shown numerically by Shallue et al. and Zhang et al. [23, 29].

To reach this goal, we first examine the relationship between the number of steps $K$ needed for nonconvex optimization (Section 2) and the batch size $b$ used for each adaptive method. We show that, under certain assumptions, the lower bound $\underline{K}$ and the upper bound $\overline{K}$ of $K$ are rational functions of batch size $b$ ; i.e., $\underline{K}$ and $\overline{K}$ have the following forms (see Section 3 for details):

(1.2)		$\displaystyle\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B}$	$\displaystyle=:\underline{K}(b)\leq K$
		$\displaystyle\leq\overline{K}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F},$

where $A$ , $B$ , $C$ , $D$ , $E$ , $F$ , and $G$ are positive constants depending on the parameters (e.g., constant learning rate $\alpha$ and momentum coefficient $\beta$ ) used in each adaptive method, and $\epsilon,\delta>0$ are the precisions for nonconvex optimization. $\underline{K}$ and $\overline{K}$ defined by (1.2) are monotone decreasing and convex for batch size $b$ . Accordingly, $\underline{K}$ and $\overline{K}$ decrease with an increase in batch size $b$ . For minimizing the empirical average loss function for the given number of samples $n$ (see (2.2)), $\underline{K}$ and $\overline{K}$ are minimized at $b=n$ . We can see that there are asymptotic values $\underline{K}^{\star}=A/\{\epsilon^{2}-(C+D)\}$ and $\overline{K}^{\star}=E/(\delta\epsilon^{2}+G)$ of $\underline{K}$ and $\overline{K}$ .

Figure 2 visualizes the relationship between $\underline{K}(b)$ and $\overline{K}(b)$ . The number of steps $K$ needed for nonconvex optimization is inversely proportional to a batch size smaller than the critical batch size $b^{\star}$ , and there are diminishing returns beyond the critical batch $b^{\star}$ , as shown by the numerical results of Shallue et al. and Zhang et al. [23, 29]. Moreover, the critical batch size $b^{\star}$ minimizes SFO complexity $Kb$ , as seen in (1.1). We thus define the critical batch size by using the global minimizer of SFO complexity $K(b)b$ .

From the forms of $\underline{K}(b)$ and $\overline{K}(b)$ in (1.2), we have the forms of $\underline{K}(b)b$ and $\overline{K}(b)b$ :

(1.3)		$\displaystyle\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B}$	$\displaystyle=\underline{K}(b)b\leq Kb$
		$\displaystyle\leq\overline{K}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.$

We can show that $\underline{K}(b)b$ and $\overline{K}(b)b$ are convex; i.e., there are global minimizers, denoted by $b_{*}$ and $b^{*}$ , of $\underline{K}(b)b$ and $\overline{K}(b)b$ , as seen in Figure 2 (see Section 3 for details). Therefore, we can regard batch sizes $b_{*}$ and $b^{*}$ as the critical batch sizes in the sense of minimizing $\underline{K}(b)b$ and $\overline{K}(b)b$ defined by (1.3). Suppose that $\underline{K}(b)b$ and $\overline{K}(b)b$ defined by (1.3) approximate the actual SFO complexity $Kb$ ; i.e.,

	$\displaystyle\frac{2B}{\epsilon^{2}-(C+D)}=b_{}\approx b^{}=\frac{2F}{\delta\epsilon^{2}+G}\text{ and}$
	$\displaystyle\frac{4AB}{\{\epsilon^{2}-(C+D)\}^{2}}=\underline{K}(b_{})b_{}\approx\overline{K}(b^{})b^{}=\frac{4EF}{(\delta\epsilon^{2}+G)^{2}},$

which implies that

(1.4)

\displaystyle\begin{split}&\underbrace{(C+D)F}_{\approx(M_{1}\alpha+M_{2})(M_{3}\alpha-M_{4})}+\underbrace{BG}_{\approx M_{5}\alpha}\approx(F-\delta B)\epsilon^{2}\text{ and}\\ &\underbrace{(C+D)}_{\approx M_{1}\alpha+M_{2}}\sqrt{EF}+G\sqrt{AB}\approx(\sqrt{EF}-\delta\sqrt{AB})\epsilon^{2},\end{split}

where $M_{i}$ ( $i=1,2,3,4,5$ ) are positive constants that do not depend on a constant learning rate $\alpha$ (see Section 3 for details). Hence, if precisions $\epsilon$ and $\delta$ are small, using a sufficiently small learning rate $\alpha$ could be used to approximate $K(b)b$ (Indeed, small learning rates were practically used by Shallue et al. [23]). Then, the demonstration of the existence of critical batch sizes $b_{*}$ and $b^{*}$ leads to the existence of the actual critical batch $b^{\star}\approx b_{*}\approx b^{*}$ .

2. Nonconvex Optimization and Deep Learning Optimizers

2.1. Nonconvex optimization

Let $\mathbb{R}^{d}$ be a $d$ -dimensional Euclidean space with inner product $\langle\bm{x},\bm{y}\rangle:=\bm{x}^{\top}\bm{y}$ inducing norm $\|\bm{x}\|$ , and let $\mathbb{N}$ be the set of nonnegative integers. Let $\mathbb{S}_{++}^{d}$ be the set of $d\times d$ symmetric positive-definite matrices, and let $\mathbb{D}^{d}$ be the set of $d\times d$ diagonal matrices: $\mathbb{D}^{d}=\{M\in\mathbb{R}^{d\times d}\colon M=\mathsf{diag}(x_{i}),\text{ }x_{i}\in\mathbb{R}\text{ }(i\in[d]:=\{1,2,\ldots,d\})\}$ .

The mathematical model used here is based on that of Shallue et al. [23]. Given a parameter $\bm{\theta}$ in a set $\mathcal{S}\subset\mathbb{R}^{d}$ and given a data point $z$ in a data domain $Z$ , a machine learning model provides a prediction for which the quality is estimated using a differentiable nonconvex loss function $\ell(\bm{\theta};z)$ . We minimize the expected loss defined for all $\bm{\theta}\in\mathcal{S}$ by using

(2.1)

\displaystyle L(\bm{\theta})=\mathbb{E}_{z\sim\mathcal{D}}[\ell(\bm{\theta};z)]=\mathbb{E}[\ell_{\xi}(\bm{\theta})],

where $\mathcal{D}$ is the probability distribution over $Z$ , $\xi$ denotes a random variable with distribution function $\mathrm{P}$ , and $\mathbb{E}[\cdot]$ denotes the expectation taken with respect to $\xi$ . A particularly interesting example of (2.1) is the empirical average loss defined for all $\bm{\theta}\in\mathcal{S}$ :

(2.2)

\displaystyle L(\bm{\theta};S)=\frac{1}{n}\sum_{i\in[n]}\ell(\bm{\theta};z_{i})=\frac{1}{n}\sum_{i\in[n]}\ell_{i}(\bm{\theta}),

where $S=(z_{1},z_{2},\ldots,z_{n})$ denotes the training set, $\ell_{i}(\cdot):=\ell(\cdot;z_{i})$ denotes the loss function corresponding to the $i$ -th training data instance $z_{i}$ , and $[n]:=\{1,2,\ldots,n\}$ . Our main objective is to find a local minimizer of $L$ over $\mathcal{S}$ , i.e., a point $\bm{\theta}^{\star}\in\mathcal{S}$ that belongs to the set

(2.3)

\displaystyle\begin{split}&\mathrm{VI}\left(\mathcal{S},\nabla L\right)\\ &\quad:=\left\{\bm{\theta}^{\star}\in\mathcal{S}\colon(\bm{\theta}^{\star}-\bm{\theta})^{\top}\nabla L(\bm{\theta}^{\star})\leq 0\text{ }\left(\bm{\theta}\in\mathcal{S}\right)\right\},\end{split}

where $\nabla L\colon\mathbb{R}^{d}\to\mathbb{R}^{d}$ denotes the gradient of $L$ . The inequality $(\bm{\theta}^{\star}-\bm{\theta})^{\top}\nabla L(\bm{\theta}^{\star})\leq 0$ is called the variational inequality of $\nabla L$ over $\mathcal{S}$ [6].

2.2. Deep learning optimizers

We assume that there exists SFO such that, for a given $\bm{\theta}\in\mathcal{S}$ , it returns a stochastic gradient $\mathsf{G}_{\xi}(\bm{\theta})$ of the function $L$ defined by (2.1), where a random variable $\xi$ is supported on $\Xi$ and does not depend on $\bm{\theta}$ . Throughout this paper, we assume three standard conditions:

(S1)

$L\colon\mathbb{R}^{d}\to\mathbb{R}$ defined by (2.1) is continuously differentiable.

(S2)

Let $(\bm{\theta}_{k})_{k\in\mathbb{N}}\subset\mathcal{S}$ be the sequence generated by a deep learning optimizer. For each iteration $k$ ,

(2.4)

\displaystyle\mathbb{E}_{\xi_{k}}[\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})]=\nabla L(\bm{\theta}_{k}),

where $\xi_{0},\xi_{1},\ldots$ are independent samples, and the random variable $\xi_{k}$ is independent of $(\bm{\theta}_{l})_{l=0}^{k}$ . There exists a nonnegative constant $\sigma^{2}$ such that

(2.5)

\displaystyle\mathbb{E}_{\xi_{k}}\left[\left\|\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\right\|^{2}\right]\leq\sigma^{2}.

(S3)

For each iteration $k$ , the deep learning optimizer samples a batch $B_{k}$ of size $b$ and estimates the full gradient $\nabla L$ as

$\displaystyle\nabla L_{B_{k}}(\bm{\theta}_{k}):=\frac{1}{b}\sum_{i\in[b]}\mathsf{G}_{\xi_{k,i}}(\bm{\theta}_{k}),$

where $\xi_{k,i}$ is the random variable generated by the $i$ -th sampling in the $k$ -th iteration.

In the case where $L$ is defined by (2.2), we have that, for each $k$ , $B_{k}\subset[n]$ and

\displaystyle\nabla L_{B_{k}}(\bm{\theta}_{k})=\frac{1}{b}\sum_{i\in B_{k}}\nabla\ell_{i}(\bm{\theta}_{k}).

We consider the following algorithm (Algorithm 1), which is a unified algorithm for most deep learning optimizers,¹¹1We define $\bm{x}\odot\bm{x}$ for $\bm{x}:=(x_{i})_{i=1}^{d}\in\mathbb{R}^{d}$ by $\bm{x}\odot\bm{x}:=(x_{i}^{2})_{i=1}^{d}\in\mathbb{R}^{d}$ . $\mathrm{C}(\cdot,l,u)\colon\mathbb{R}\to\mathbb{R}$ in AMSBound ( $l,u\in\mathbb{R}$ with $l\leq u$ are given) is defined for all $x\in\mathbb{R}$ by $\displaystyle\mathrm{C}(x,l,u):=\begin{cases}l&\text{ if }x<l,\\ x&\text{ if }l\leq x\leq u,\\ u&\text{ if }x>u.\end{cases}$ including Momentum [18], AMSGrad [19], AMSBound [13], Adam [11], and AdaBelief [30], which are listed in Table 1.

Algorithm 1 Deep learning optimizer

\alpha\in(0,+\infty)

\beta\in[0,1)

\gamma\in[0,1)

k\leftarrow 0

\bm{\theta}_{0}\in\mathcal{S}

\bm{m}_{-1}:=\bm{0}

\mathsf{H}_{0}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d}

2: loop

\bm{m}_{k}:=\beta\bm{m}_{k-1}+(1-\beta)\nabla L_{B_{k}}(\bm{\theta}_{k})

\displaystyle{\hat{\bm{m}}_{k}:=(1-{\gamma}^{k+1})^{-1}\bm{m}_{k}}

\mathsf{H}_{k}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d}

(see Table 1 for examples of

\mathsf{H}_{k}

)

6: Find

\bm{\mathsf{d}}_{k}\in\mathbb{R}^{d}

that solves

\mathsf{H}_{k}\bm{\mathsf{d}}=-\hat{\bm{m}}_{k}

\bm{\theta}_{k+1}:=\bm{\theta}_{k}+\alpha\bm{\mathsf{d}}_{k}

k\leftarrow k+1

9: end loop

Let $\xi_{k}=(\xi_{k,1},\xi_{k,2},\ldots,\xi_{k,b})$ be the random samplings in the $k$ -th iteration, and let $\xi_{[k]}=(\xi_{0},\xi_{1},\ldots,\xi_{k})$ be the history of process $\xi_{0},\xi_{1},\ldots$ to time step $k$ . The adaptive methods listed in Table 1 all satisfy the following conditions:

(A1)

$\mathsf{H}_{k}:=\mathsf{diag}(h_{k,i})$ depends on $\xi_{[k]}$ , and $h_{k+1,i}\geq h_{k,i}$ for all $k\in\mathbb{N}$ and all $i\in[d]$ ;
(A2)

For all $i\in[d]$ , a positive number $H_{i}$ exists such that $\sup_{k\in\mathbb{N}}\mathbb{E}[h_{k,i}]\leq H_{i}$ .

We define $H:=\max_{i\in[d]}H_{i}$ . The optimizers in Table 1 obviously satisfy (A1). Previously reported results [4, p.29], [30, p.18], and [10] show that $(\mathsf{H}_{k})_{k\in\mathbb{N}}$ in Table 1 satisfies (A2). For example, AMSGrad (resp. Adam) satisfies (A2) with $H=\sqrt{M}$ (resp. $H=\sqrt{M/(1-\zeta)}$ ), where $M:=\sup_{k\in\mathbb{N}}\|\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})\|<+\infty$ .

Table 1. Examples of

\mathsf{H}_{k}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d}

(step 5) in Algorithm 1 (

\eta,\zeta\in[0,1)

)

	$\mathsf{H}_{k}$
SGD	$\mathsf{H}_{k}$ is the identity matrix.
( $\beta=\gamma=0$ )
Momentum	$\mathsf{H}_{k}$ is the identity matrix.
[18]
( $\gamma=0$ )
AMSGrad	$\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})$
[19]	$\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}$
( $\gamma=0$ )	$\hat{\bm{v}}_{k}=(\max\{\hat{v}_{k-1,i},v_{k,i}\})_{i=1}^{d}$
	$\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\hat{v}_{k,i}})$
AMSBound	$\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})$
[13]	$\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}$
( $\gamma=0$ )	$\hat{\bm{v}}_{k}=(\max\{\hat{v}_{k-1,i},v_{k,i}\})_{i=1}^{d}$
	$\tilde{{v}}_{k,i}=\mathrm{C}\left(\frac{1}{\sqrt{\hat{v}_{k,i}}},l_{k},u_{k}\right)^{-1}$
	$\mathsf{H}_{k}=\mathsf{diag}(\tilde{v}_{k,i})$
Adam	$\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})$
[11]	$\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}$
( $v_{k,i}\leq v_{k+1,i}$ )	$\bar{\bm{v}}_{k}=\frac{\bm{v}_{k}}{1-\zeta^{k}}$
	$\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\bar{v}_{k,i}})$
AdaBelief	$\tilde{\bm{p}}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})-\bm{m}_{k}$
[30]	$\tilde{\bm{s}}_{k}=\tilde{\bm{p}}_{k}\odot\tilde{\bm{p}}_{k}$
( $s_{k,i}\leq s_{k+1,i}$ )	$\bm{s}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\tilde{\bm{s}}_{k}$
	$\hat{\bm{s}}_{k}=\frac{\bm{s}_{k}}{1-\zeta^{k}}$
	$\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\hat{s}_{k,i}})$

The theoretical performance metric used to approximate a local minimizer in $\mathrm{VI}(\mathcal{S},\nabla L)$ (see (2.3)) is an $\epsilon$ -approximation of the sequence $(\bm{\theta}_{k})_{k\in\mathbb{N}}$ generated by Algorithm 1 in the sense of the mean value of $(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})$ ( $k\in[K]$ ); that is,

(2.6)

\displaystyle\begin{split}\delta\epsilon^{2}&\leq\mathrm{V}(K,\bm{\theta}):=\frac{1}{K}\sum_{k\in[K]}\mathbb{E}\left[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})\right]\\ &\leq\epsilon^{2},\end{split}

where $\epsilon>0$ and $\delta\in(0,1]$ are precisions and $\bm{\theta}\in\mathcal{S}$ . It would be sufficient to consider that the right-hand side of (2.6), $\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}$ , approximates a local minimizer of $L$ . Since consideration of $\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}$ leads to the lower bound of $K$ satisfying $\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}$ , we also consider $\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta})$ , which leads to the upper bound of $K$ .

3. Main Results

Let us suppose that (S1), (S2)(2.4), and (S3) hold. Then, Algorithm 1 satisfies the following equation (see Lemma A.2 for details). For all $\bm{\theta}\in\mathcal{S}$ and all $K\geq 1$ ,

(3.1)

\displaystyle\begin{split}&\mathrm{V}(K,\bm{\theta})\\ &=\frac{1}{2\alpha\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\left\{\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]\right\}}_{\Gamma_{K}}\\ &+\frac{\alpha}{2\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]}_{A_{K}}+\frac{\beta}{\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]}_{B_{K}},\end{split}

where $\tilde{\beta}:=1-\beta$ , $\tilde{\gamma}_{k}:=1-\gamma^{k+1}$ , and $\|\bm{x}\|_{S}^{2}:=\bm{x}^{\top}S\bm{x}$ ( $S\in\mathbb{S}_{++}^{d}$ , $\bm{x}\in\mathbb{R}^{d}$ ).

3.1. Lower bound of $K$

We provide the lower bound of $K$ satisfying $\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}$ under the following conditions:

(L1)

There exists a positive number $P$ such that $\mathbb{E}[\|\nabla L(\bm{\theta}_{k})\|^{2}]\leq P^{2}$ , where $\bm{\theta}_{k}=(\theta_{k,i})$ is the sequence generated by Algorithm 1.
(L2)

For all $\bm{\theta}=(\theta_{i})\in\mathcal{S}$ , there exists a positive number $\mathrm{Dist}(\bm{\theta})$ such that $\max_{i\in[d]}\sup\{(\theta_{k,i}-\theta_{i})^{2}\colon k\in\mathbb{N}\}\leq\mathrm{Dist}(\bm{\theta})$ .

Condition (L1) was used in previous studies [4, A2] and [30, Theorem 2.2] to analyze deep learning optimizers for nonconvex optimization. Here, we use (L1) to provide the upper bounds of $\mathbb{E}[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}]$ and $\mathbb{E}[\|\bm{m}_{k}\|^{2}]$ (see Lemma A.3 for details); i.e.,

(3.2)

\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}\text{ and }\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{\sigma^{2}b^{-1}+P^{2}}{\tilde{\gamma}h_{0}^{*}},

where (S2)(2.5) and (A1) are assumed, $\tilde{\gamma}:=1-\gamma$ , and $h_{0}^{*}:=\min_{i\in[d]}h_{0,i}$ . This formulation implies that $A_{K}$ in (3.1) is bounded above (see Theorems A.1 and 3.1 for details).

Condition (L2) is assumed for both convex and nonconvex optimization, e.g., [16, p.1574], [11, Theorem 4.1], [19, p.2], [13, Theorem 4], and [30, Theorem 2.1]. Here, we use (L2) and (A2) to provide the upper bounds of $\Gamma_{K}$ and $B_{K}$ in (3.1) (see Theorem A.1 for details); i.e.,

\displaystyle\Gamma_{K}\leq d\mathrm{Dist}(\bm{\theta})H\text{ and }B_{K}\leq\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}K,

where (S2)(2.5) and (L1) are assumed and $H:=\max_{i\in[d]}H_{i}$ . Therefore, we have the following theorem (see the Appendix for detailed proof):

Theorem 3.1.

Suppose that (S1)–(S3), (A1)–(A2), and (L1)–(L2) hold. Then, Algorithm 1 satisfies that, for all $\bm{\theta}\in\mathcal{S}$ and all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta})\leq\frac{A}{K}+\frac{B}{b}+C+D,

where $\tilde{\beta}:=1-\beta$ , $\tilde{\gamma}:=1-\gamma$ , $h_{0}^{*}:=\min_{i\in[d]}h_{0,i}$ ,

	$\displaystyle A=A(\alpha,\beta,\bm{\theta},d,H)=\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}},$
	$\displaystyle B=B(\alpha,\beta,\gamma,\sigma^{2},h_{0}^{})=\frac{\sigma^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{}},$
	$\displaystyle C=C(\alpha,\beta,\gamma,P^{2},h_{0}^{})=\frac{P^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{}},$
	$\displaystyle D=D(\beta,\bm{\theta},d,\sigma^{2},P^{2})=\frac{\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}\beta}{\tilde{\beta}}.$

Moreover, the lower bound $\underline{K}_{\epsilon}$ of the number of steps $K_{\epsilon}$ satisfying $\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}$ for Algorithm 1 is such that, for all $b>B/\{\epsilon^{2}-(C+D)\}\geq 0$ ,

\displaystyle\underline{K}_{\epsilon}(b):=\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B}\leq K_{\epsilon}.

The lower bound $\underline{K}_{\epsilon}$ is monotone decreasing and convex:

\displaystyle\underline{K}_{\epsilon}^{\star}:=\frac{A}{\epsilon^{2}-(C+D)}<\underline{K}_{\epsilon}(b)

for all $b>B/\{\epsilon^{2}-(C+D)\}$ .

The minimization of $\underline{K}_{\epsilon}(b)b$ is as follows (The proof of Theorem 3.2 is given in the Appendix):

Theorem 3.2.

Suppose that the assumptions in Theorem 3.1 hold and consider Algorithm 1. Then, there exists

\displaystyle b_{*}:=\frac{2B}{\epsilon^{2}-(C+D)}

such that $b_{*}$ minimizes a convex function $\underline{K}_{\epsilon}(b)b$ ; i.e., for all $b>B/\{\epsilon^{2}-(C+D)\}$ ,

	$\displaystyle\frac{4AB}{\{\epsilon^{2}-(C+D)\}^{2}}$	$\displaystyle=\underline{K}_{\epsilon}(b_{})b_{}\leq\underline{K}_{\epsilon}(b)b$
		$\displaystyle=\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B}.$

3.2. Upper bound of $K$

We provide the upper bound of $K$ satisfying $\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta})$ under the following conditions:

(U1)

There exists $c_{1}\in[0,1]$ such that $\mathbb{E}[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}]\geq c_{1}\sigma^{2}/b$ .
(U2)

There exist $\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L)$ and $c_{2}\geq 0$ such that $\mathbb{E}[(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}]\geq-c_{2}(\sigma^{2}/b+P^{2})$ .
(U3)

For $\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L)$ , there exists a positive number $X(\bm{\theta}^{*})$ such that, for all $k\geq 1$ , $\tilde{\gamma}\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}^{*}\|_{\mathsf{H}_{1}}^{2}]-\mathbb{E}[\|\bm{\theta}_{k+1}-\bm{\theta}^{*}\|_{\mathsf{H}_{k}}^{2}]\geq X(\bm{\theta}^{*})$ , where $\tilde{\gamma}:=1-\gamma$ .

We consider the case where $\mathsf{H}_{k}$ is the identity matrix needed to justify (U1). The definition of $\bm{m}_{k}:=\beta\bm{m}_{k-1}+(1-\beta)\nabla L_{B_{k}}(\bm{\theta}_{k})$ implies that $\bm{m}_{k}$ depends on $\nabla L_{B_{k}}(\bm{\theta}_{k})$ . Here, we assume the following condition, which is stronger than (S2)(2.5): there exists $c_{1}\in[0,1]$ such that

(3.3)

\displaystyle c_{1}\sigma^{2}\leq\mathbb{E}_{\xi_{k}}\left[\left\|\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\right\|^{2}\right]\leq\sigma^{2}.

Then, (S2)(2.4) and (3.3) ensure that

	$\displaystyle\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$	$\displaystyle\geq\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle\geq\frac{c_{1}\sigma^{2}}{b}.$

From $\tilde{\gamma}_{k}=1-\gamma^{k+1}\leq 1$ , we have that $\|\bm{\mathsf{d}}_{k}\|^{2}=\tilde{\gamma}_{k}^{-2}\|\bm{m}_{k}\|^{2}\geq\|\bm{m}_{k}\|^{2}$ ( $k\in\mathbb{N}$ ). Accordingly, the lower bound of $\|\bm{\mathsf{d}}_{k}\|^{2}$ depends on $c_{1}\sigma^{2}/b$ .

Algorithm 1 for nonconvex optimization satisfies that there exist positive numbers $C_{1}$ and $C_{2}$ such that, for all $\bm{\theta}\in\mathcal{S}$ ,

(3.4)

\displaystyle\begin{split}&\liminf_{k\to+\infty}\mathbb{E}\left[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})\right]\leq C_{1}\alpha+C_{2}\beta,\\ &\mathrm{V}(K,\bm{\theta})\leq\mathcal{O}\left(\frac{1}{K}\right)+C_{1}\alpha+C_{2}\beta\end{split}

(see [10, Theorem 1]). The sequence $(\bm{\theta}_{k})_{k\in\mathbb{N}}$ generated by Algorithm 1 can thus approximate a local minimizer $\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L)$ . Meanwhile, we have that

	$\displaystyle(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}$
	$\displaystyle=\frac{1}{2}\left\{\\|\bm{\theta}^{}-\bm{\theta}_{k}\\|^{2}+\\|\bm{m}_{k-1}\\|^{2}-\\|(\bm{\theta}^{}-\bm{\theta}_{k})-\bm{m}_{k-1}\\|^{2}\right\}$
	$\displaystyle\geq-\frac{1}{2}\\|(\bm{\theta}^{*}-\bm{\theta}_{k})-\bm{m}_{k-1}\\|^{2}.$

Accordingly, for a sufficiently large $k$ , the lower bound of $(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}$ depends on $-\mathbb{E}[\|\bm{m}_{k-1}\|^{2}]\geq-(\sigma^{2}/b+P^{2})$ , where (S2)(2.5) and (L1) are assumed (see (3.2)). Hence, we assume (U2) for Algorithm 1.

Condition (U3) implies that $\mathbb{E}[\|\bm{\theta}_{k+1}-\bm{\theta}^{*}\|_{\mathsf{H}_{k}}^{2}]<\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}^{*}\|_{\mathsf{H}_{1}}^{2}]<+\infty$ , which is stronger than (L2), and that $\bm{\theta}_{k+1}$ approximates more appropriately $\bm{\theta}^{*}$ than $\bm{\theta}_{1}$ . Since sequence $(\bm{\theta}_{k})_{k\in\mathbb{N}}$ generated by Algorithm 1 can approximate a local minimizer $\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L)$ (see (3.4)), (U3) is also adequate for consideration of the upper bound of $K$ .

Theorem 3.3.

Suppose that (S1)–(S3), (A1), (L1), and (U1)–(U3) hold. Then, Algorithm 1 satisfies that, for all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\frac{E}{K}+\frac{F}{b}-G,

where $\tilde{\beta}:=1-\beta$ , $\tilde{\gamma}:=1-\gamma$ ,

	$\displaystyle E=E(\alpha,\beta,\bm{\theta}^{})=\frac{X(\bm{\theta}^{})}{2\alpha\tilde{\beta}},$
	$\displaystyle F=F(\alpha,\beta,\gamma,\sigma^{2},c_{1},c_{2})=\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}},$
	$\displaystyle G=G(\beta,c_{2},P^{2})=\frac{c_{2}\beta P^{2}}{\tilde{\beta}}.$

Moreover, the upper bound $\overline{K}_{\epsilon,\delta}$ of the number of steps $K_{\epsilon,\delta}$ satisfying $\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta}^{*})$ for Algorithm 1 is such that, for all $b>F/(\delta\epsilon^{2}+G)\geq 0$ ,

\displaystyle\overline{K}_{\epsilon,\delta}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F}\geq K_{\epsilon,\delta}.

The upper bound $\overline{K}_{\epsilon,\delta}$ is monotone decreasing and convex:

\displaystyle\overline{K}_{\epsilon,\delta}^{\star}:=\frac{E}{\delta\epsilon^{2}+G}<\overline{K}_{\epsilon,\delta}(b)

for all $b>F/(\delta\epsilon^{2}+G)$ .

The minimization of $\overline{K}_{\epsilon,\delta}(b)b$ is as follows (The proof of Theorem 3.4 is given in the Appendix):

Theorem 3.4.

Suppose that the assumptions in Theorem 3.3 hold and consider Algorithm 1. Then, there exists

\displaystyle b^{*}:=\frac{2F}{\delta\epsilon^{2}+G}

such that $b^{*}$ minimizes a convex function $\overline{K}_{\epsilon,\delta}(b)b$ ; i.e., for all $b>F/(\delta\epsilon^{2}+G)$ ,

\displaystyle\frac{4EF}{(\delta\epsilon^{2}+G)^{2}}=\overline{K}_{\epsilon,\delta}(b^{*})b^{*}\leq\overline{K}_{\epsilon,\delta}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.

Table 2. Elapsed time and training accuracy of SGD when

L(\bm{\theta}_{K})\leq 10^{-1}

for training ResNet-20 on CIFAR-10

SGD
Batch Size	$2^{2}$	$2^{3}$	$2^{4}$	$2^{5}$	$2^{6}$	$2^{7}$	$2^{8}$	$2^{9}$
Time (s)	16983.64	9103.76	6176.19	3759.25	—	—	—	—
Accuracy (%)	96.75	96.69	96.66	96.88	—	—	—	—

Table 3. Elapsed time and training accuracy of Momentum when

L(\bm{\theta}_{K})\leq 10^{-1}

for training ResNet-20 on CIFAR-10

Momentum
Batch Size	$2^{2}$	$2^{3}$	$2^{4}$	$2^{5}$	$2^{6}$	$2^{7}$	$2^{8}$	$2^{9}$
Time (s)	7978.90	3837.72	2520.82	1458.70	887.01	678.66	625.10	866.65
Accuracy (%)	96.49	96.79	96.51	96.72	96.70	96.94	96.94	98.34

Table 4. Elapsed time and training accuracy of Adam when

L(\bm{\theta}_{K})\leq 10^{-1}

for training ResNet-20 on CIFAR-10

Adam
Batch Size	$2^{2}$	$2^{3}$	$2^{4}$	$2^{5}$	$2^{6}$	$2^{7}$	$2^{8}$	$2^{9}$
Time (s)	10601.78	4405.73	2410.28	1314.01	617.14	487.75	281.74	225.03
Accuracy (%)	96.46	96.38	96.65	96.53	96.43	96.68	96.58	96.72

Table 5. Elapsed time and training accuracy of Adam when

L(\bm{\theta}_{K})\leq 10^{-1}

for training ResNet-20 on CIFAR-10

Adam
Batch Size	$2^{10}$	$2^{11}$	$2^{12}$	$2^{13}$	$2^{14}$	$2^{15}$	$2^{16}$
Time (s)	197.78	195.40	233.70	349.81	691.04	644.19	1148.68
Accuracy (%)	96.74	97.21	97.54	97.75	97.51	99.05	99.03

3.3. Discussion

To guarantee that $\underline{K}_{\epsilon}(b)b$ in Theorem 3.2 and $\overline{K}_{\epsilon,\delta}(b)b$ in Theorem 3.4 fit the actual SFO complexity $Kb$ , we need to satisfy (1.4); that is,

	$\displaystyle\underbrace{(C+D)F}_{\approx(M_{1}\alpha+M_{2})(M_{3}\alpha-M_{4})}+\underbrace{BG}_{\approx M_{5}\alpha}\approx(F-\delta B)\epsilon^{2}\text{ and}$
	$\displaystyle\underbrace{(C+D)}_{\approx M_{1}\alpha+M_{2}}\sqrt{EF}+G\sqrt{AB}\approx(\sqrt{EF}-\delta\sqrt{AB})\epsilon^{2}.$

Hence, if precisions $\epsilon$ and $\delta$ are small, then using a sufficiently small learning rate $\alpha$ could be used to approximate $K(b)b$ . Indeed, small learning rates such as $10^{-2}$ and $10^{-3}$ have been practically used in previous numerical validations [23]. The demonstration of the existence of $b_{*}$ and $b^{*}$ leads to the existence of the actual critical batch size $b^{\star}\approx b_{*}\approx b^{*}$ .

4. Numerical Examples

We evaluated the performances of SGD, Momentum, and Adam with different batch sizes for training ResNet-20 on the CIFAR-10 dataset with $n=50000$ . The metrics were the number of steps $K$ and the SFO complexity $Kb$ satisfying $L(\bm{\theta}_{K})\leq 10^{-1}$ , where $\bm{\theta}_{K}$ is generated for each of SGD, Momentum, and Adam using batch size $b$ . The stopping condition was $200$ epochs. The experimental environment consisted of two Intel(R) Xeon(R) Gold 6148 2.4-GHz CPUs with 20 cores each, a 16-GB NVIDIA Tesla V100 900-Gbps GPU, and the Red Hat Enterprise Linux 7.6 OS. The code was written in Python 3.8.2 using the NumPy 1.17.3 and PyTorch 1.3.0 packages. A constant learning rate ( $\alpha=10^{-3}$ ) was commonly used. SGD used the identity matrix $\mathsf{H}_{k}$ and $\beta=\gamma=0$ , and Momentum used the identity matrix $\mathsf{H}_{k}$ , $\beta=0.9$ , and $\gamma=0$ . Adam used $\mathsf{H}_{k}$ defined by Table 1, $\beta=\gamma=0.9$ , and $\eta=\zeta=0.999$ [11].

Figure 4 shows the number of steps for SGD, Momentum, and Adam versus the batch size. For SGD and Momentum, the number of steps $K$ needed for $L(\bm{\theta}_{K})\leq 10^{-1}$ initially decreased. However, SGD with $b\geq 2^{6}$ and Momentum with $b\geq 2^{10}$ did not satisfy $L(\bm{\theta}_{K})\leq 10^{-1}$ before the stopping condition was reached. Adam had an initial period of perfect scaling (indicated by dashed line) such that the number of steps $K$ needed for $L(\bm{\theta}_{K})\leq 10^{-1}$ was inversely proportional to batch size $b$ , and critical batch size $b^{\star}=2^{11}$ such that $K$ was not inversely proportional to the batch size beyond $b^{\star}$ ; i.e., there were diminishing returns.

Figure 4 plots the SFO complexities for SGD, Momentum, and Adam versus the batch size. For SGD, SFO complexity was minimum at $b=2^{2}$ ; for Momentum, it was minimum at $b=2^{3}$ . For Adam, SFO complexity was minimum at the critical batch size $b^{\star}=2^{11}$ . In particular, we determined that the SFO complexity of Adam with $b=2^{10}$ was about $1404928$ , that of Adam with $b^{\star}=2^{11}$ was about $1382400$ , and that of Adam with $b=2^{12}$ was about $1650688$ .

Tables 2, 3, 4, and 5 show the elapsed time and the training accuracy for SGD, Momentum, and Adam when $L(\bm{\theta}_{K})\leq 10^{-1}$ was achieved. Table 2 shows that the elapsed time for SGD decreased with an increase in batch size. Table 3 shows that the elapsed time for Momentum monotonically decreased for $b\leq 2^{8}$ and increased for $b=2^{9}$ compared with that for $b=2^{8}$ . This is because the SFO complexity for Momentum for $b=2^{9}$ was substantially higher than that for $b=2^{8}$ , as shown in Figure 4. Tables 4 and 5 show that the elapsed time for Adam monotonically decreased for $b\leq 2^{11}$ and that the elapsed time for critical batch size $b^{\star}=2^{11}$ was the shortest. The elapsed time for $b\geq 2^{12}$ increased with the SFO complexity, as shown in Figure 4.

We also investigated the behaviors of SGD, Momentum, and Adam for a decaying learning rate ( $\alpha_{k}=10^{-3}/\sqrt{k}$ ) and a decaying learning rate ( $\alpha_{k}$ defined by $\alpha_{k}=10^{-2}-10^{-2}(1-\alpha)kT^{-1}$ ( $k\leq T$ ) or $\alpha_{k}=10^{-2}\alpha$ ( $k>T$ )), where the decay factor was $\alpha=10^{-3}$ and the number of training steps was $T=10^{4}$ ( $\alpha$ and $T$ were set on the basis of the results of a previous study [23, Figure 14]). None of the optimizers achieved $L(\bm{\theta}_{K})\leq 10^{-1}$ before the stopping condition ( $200$ epochs) was reached.

5. Conclusion

We have clarified the lower and upper bounds of the number of steps needed for nonconvex optimization of adaptive methods and have shown that there exist critical batch sizes $b_{*}$ and $b^{*}$ in the sense of minimizing the lower and upper bounds of the stochastic first-order oracle (SFO) complexities of adaptive methods. Previous numerical evaluations [23, 29] showed that the actual critical batch size $b^{\star}$ minimizes SFO complexity. Hence, if the lower and upper bounds of the SFO complexity fit the actual SFO complexity, then the existence of $b_{*}$ and $b^{*}$ guarantee the existence of the actual critical batch size $b^{\star}$ . We also demonstrated that it would be sufficient to use small learning rates for an adaptive method to satisfy that the lower and upper bounds of SFO complexity fit the actual SFO complexity. Furthermore, we provided numerical examples supporting our theoretical results. They showed that Adam with a frequently used small learning rate ( $10^{-3}$ ) had a critical batch size that minimizes SFO complexity.

References

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. https://arxiv.org/pdf/1701.07875.pdf, 2017.
[2] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60:223–311, 2018.
[3] Hao Chen, Lili Zheng, Raed AL Kontar, and Garvesh Raskutti. Stochastic gradient descent in correlated settings: A study on Gaussian processes. In Advances in Neural Information Processing Systems, volume 33, 2020.
[4] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In Proceedings of The International Conference on Learning Representations, 2019.
[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
[6] F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems I. Springer, New York, 2003.
[7] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, volume 31, 2018.
[8] Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21:1–48, 2020.
[9] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.
[10] Hideaki Iiduka. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2021.3107415, 2021.
[11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of The International Conference on Learning Representations, 2015.
[12] Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, 2021.
[13] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of The International Conference on Learning Representations, 2019.
[14] James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of Machine Learning Research, volume 37, pages 2408–2417, 2015.
[15] Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Advances in Neural Information Processing Systems, volume 33, 2020.
[16] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.
[17] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence ${O}(1/k^{2})$ . Doklady AN USSR, 269:543–547, 1983.
[18] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:1–17, 1964.
[19] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In Proceedings of The International Conference on Learning Representations, 2018.
[20] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
[21] Kevin Scaman and Cédric Malherbe. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Advances in Neural Information Processing Systems, volume 33, 2020.
[22] Robin M. Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley–Benchmarking deep learning optimizers. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9367–9376, 2021.
[23] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20:1–49, 2019.
[24] Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In Proceedings of The International Conference on Learning Representations, 2018.
[25] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, 2013.
[26] Tijmen Tieleman and Geoffrey Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4:26–31, 2012.
[27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30, 2017.
[28] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2048–2057, 2015.
[29] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, volume 32, 2019.
[30] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James S. Duncan. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, volume 33, 2020.
[31] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 23, 2010.

Appendix A Appendix

Unless stated otherwise, all relationships between random variables are supported to hold almost surely. Let $S\in\mathbb{S}_{++}^{d}$ . The $S$ -inner product of $\mathbb{R}^{d}$ is defined for all $\bm{x},\bm{y}\in\mathbb{R}^{d}$ by $\langle\bm{x},\bm{y}\rangle_{S}:=\langle\bm{x},S\bm{y}\rangle=\bm{x}^{\top}(S\bm{y})$ , and the $S$ -norm is defined by $\|\bm{x}\|_{S}:=\sqrt{\langle\bm{x},S\bm{x}\rangle}$ .

A.1. Lemmas and Theorems

Lemma A.1.

Suppose that (S1), (S2)(2.4), and (S3) hold. Then, Algorithm 1 satisfies the following: for all $k\in\mathbb{N}$ and all $\bm{\theta}\in\mathcal{S}$ ,

	$\displaystyle\mathbb{E}\left[\\|\bm{\theta}_{k+1}-\bm{\theta}\\|_{\mathsf{H}_{k}}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\\|\bm{\theta}_{k}-\bm{\theta}\\|_{\mathsf{H}_{k}}^{2}\right]+\alpha^{2}\mathbb{E}\left[\\|\bm{\mathsf{d}}_{k}\\|_{\mathsf{H}_{k}}^{2}\right]$
		$\displaystyle\quad+2\alpha\left\{\frac{\beta}{{\tilde{\gamma}}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L(\bm{\theta}_{k})\right]\right\},$

where $\tilde{\beta}:=1-\beta$ and ${\tilde{\gamma}}_{k}:=1-{\gamma}^{k+1}$ .

Proof.

Let $\bm{\theta}\in\mathcal{S}$ and $k\in\mathbb{N}$ . The definition of $\bm{\theta}_{k+1}$ implies that

\displaystyle\|\bm{\theta}_{k+1}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}=\|\bm{\theta}_{k}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}+2\alpha\langle\bm{\theta}_{k}-\bm{\theta},\bm{\mathsf{d}}_{k}\rangle_{\mathsf{H}_{k}}+\alpha^{2}\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}.

Moreover, the definitions of $\bm{\mathsf{d}}_{k}$ , $\bm{m}_{k}$ , and $\hat{\bm{m}}_{k}$ ensure that

\displaystyle\left\langle\bm{\theta}_{k}-\bm{\theta},\bm{\mathsf{d}}_{k}\right\rangle_{\mathsf{H}_{k}}=\frac{1}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k}=\frac{\beta}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k}).

Hence,

(A.1)

\displaystyle\begin{split}\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}&=\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}+\alpha^{2}\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\\ &\quad+2\alpha\left\{\frac{\beta}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k})\right\}.\end{split}

Conditions (2.4) and (S3) guarantee that

	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k})\Big{\|}\bm{\theta}_{k}\right]\right]$	$\displaystyle=\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\mathbb{E}\left[\nabla L_{B_{k}}(\bm{\theta}_{k})\Big{\|}\bm{\theta}_{k}\right]\right]$
		$\displaystyle=\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L(\bm{\theta}_{k})\right].$

Therefore, the lemma follows by taking the expectation with respect to $\xi_{k}$ on both sides of (A.1). This completes the proof. ∎

Lemma A.1 leads to the following:

Lemma A.2.

Suppose that the assumptions in Lemma A.1 hold and consider Algorithm 1. Let $V_{k}(\bm{\theta}):=\mathbb{E}[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})]$ for all $\bm{\theta}\in\mathcal{S}$ and all $k\in\mathbb{N}$ . Then, for all $\bm{\theta}\in\mathcal{S}$ and all $K\geq 1$ ,

(A.2)

\displaystyle\begin{split}{\sum_{k=1}^{K}V_{k}(\bm{\theta})}&=\frac{1}{2\alpha\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\left\{\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]\right\}}_{\Gamma_{K}}\\ &\quad+\frac{\alpha}{2\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]}_{A_{K}}+\frac{\beta}{\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]}_{B_{K}}.\end{split}

Proof.

Let $\bm{\theta}\in\mathcal{S}$ . Lemma A.1 guarantees that, for all $k\in\mathbb{N}$ ,

	$\displaystyle\frac{2\alpha\tilde{\beta}}{\tilde{\gamma}_{k}}V_{k}(\bm{\theta})$	$\displaystyle=\mathbb{E}\left[\left\\|\bm{\theta}_{k}-\bm{\theta}\right\\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\\|\bm{\theta}_{k+1}-\bm{\theta}\right\\|_{\mathsf{H}_{k}}^{2}\right]+\alpha^{2}\mathbb{E}\left[\left\\|\bm{\mathsf{d}}_{k}\right\\|_{\mathsf{H}_{k}}^{2}\right]$
		$\displaystyle\quad+\frac{2\alpha\beta}{\tilde{\gamma}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right].$

Hence, we have that

	$\displaystyle V_{k}(\bm{\theta})$	$\displaystyle=\frac{\tilde{\gamma}_{k}}{2\alpha\tilde{\beta}}\left\{\mathbb{E}\left[\left\\|\bm{\theta}_{k}-\bm{\theta}\right\\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\\|\bm{\theta}_{k+1}-\bm{\theta}\right\\|_{\mathsf{H}_{k}}^{2}\right]\right\}$
		$\displaystyle\quad+\frac{\alpha\tilde{\gamma}_{k}}{2\tilde{\beta}}\mathbb{E}\left[\left\\|\bm{\mathsf{d}}_{k}\right\\|_{\mathsf{H}_{k}}^{2}\right]+\frac{\beta}{\tilde{\beta}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right].$

Summing the above inequality from $k=1$ to $K\geq 1$ implies the assertion in Lemma A.2. This completes the proof. ∎

Lemma A.3.

Algorithm 1 satisfies that, under (S2)(2.4), (2.5), and (L1), for all $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}.

Under (A1) and (L1), for all $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{1}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

where $h_{0}^{*}:=\min_{i\in[d]}h_{0,i}$ .

Proof.

Let $k\in\mathbb{N}$ . From (S2)(2.4), we have that

	$\displaystyle\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$	$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})+\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]+\mathbb{E}\left[\\|\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle\quad+2\mathbb{E}\left[(\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k}))^{\top}\nabla L(\bm{\theta}_{k})\Big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]+\\|\nabla L(\bm{\theta}_{k})\\|^{2},$

which, together with (S2)(2.5) and (L1), implies that

(A.3)

\displaystyle\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}.

The convexity of $\|\cdot\|^{2}$ , together with the definition of $\bm{m}_{k}$ and (A.3), guarantees that, for all $k\in\mathbb{N}$ ,

	$\displaystyle\mathbb{E}\left[\\|\bm{m}_{k}\\|^{2}\right]$	$\displaystyle\leq\beta\mathbb{E}\left[\\|\bm{m}_{k-1}\\|^{2}\right]+(1-\beta)\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})\\|^{2}\right]$
		$\displaystyle\leq\beta\mathbb{E}\left[\\|\bm{m}_{k-1}\\|^{2}\right]+(1-\beta)\left(\frac{\sigma^{2}}{b}+P^{2}\right).$

Induction thus ensures that, for all $k\in\mathbb{N}$ ,

(A.4)

\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\max\left\{\|\bm{m}_{-1}\|^{2},\frac{\sigma^{2}}{b}+P^{2}\right\}=\frac{\sigma^{2}}{b}+P^{2},

where $\bm{m}_{-1}=\bm{0}$ . For $k\in\mathbb{N}$ , $\mathsf{H}_{k}\in\mathbb{S}_{++}^{d}$ guarantees the existence of a unique matrix $\overline{\mathsf{H}}_{k}\in\mathbb{S}_{++}^{d}$ such that $\mathsf{H}_{k}=\overline{\mathsf{H}}_{k}^{2}$ [9, Theorem 7.2.6]. We have that, for all $\bm{x}\in\mathbb{R}^{d}$ , $\|\bm{x}\|_{\mathsf{H}_{k}}^{2}=\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2}$ . Accordingly, the definitions of $\bm{\mathsf{d}}_{k}$ and $\hat{\bm{m}}_{k}$ imply that, for all $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]=\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\mathsf{H}_{k}\bm{\mathsf{d}}_{k}\right\|^{2}\right]\leq\frac{1}{{\tilde{\gamma}}_{k}^{2}}\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|^{2}\|\bm{m}_{k}\|^{2}\right]\leq\frac{1}{(1-\gamma)^{2}}\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|^{2}\|\bm{m}_{k}\|^{2}\right],

where

\displaystyle\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|=\left\|\mathsf{diag}\left(h_{k,i}^{-\frac{1}{2}}\right)\right\|={\max_{i\in[d]}h_{k,i}^{-\frac{1}{2}}}

and ${\tilde{\gamma}}_{k}:=1-{\gamma}^{k+1}\geq 1-{\gamma}$ . Moreover, (A1) ensures that, for all $k\in\mathbb{N}$ ,

\displaystyle h_{k,i}\geq h_{0,i}\geq h_{0}^{*}:=\min_{i\in[d]}h_{0,i}.

Hence, (A.4) implies that, for all $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{1}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

completing the proof. ∎

We are now in the position to prove the following theorems.

Theorem A.1.

Suppose that (S1)–(S3), (A1)–(A2), and (L1)–(L2) hold and consider Algorithm 1. Then, for all $\bm{\theta}\in\mathcal{S}$ and all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta})\leq\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}K}+\frac{\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right)+\frac{\beta\sqrt{d\mathrm{Dist}(\bm{\theta})}}{\tilde{\beta}}\sqrt{\frac{\sigma^{2}}{b}+P^{2}},

where $\tilde{\gamma}:=1-\gamma$ , $\mathrm{Dist}(\bm{\theta})$ and $H_{i}$ are defined as in (L2) and (A2), and $H:=\max_{i\in[d]}H_{i}$ .

Proof.

Let $\bm{\theta}\in\mathcal{S}$ . From the definition of $\Gamma_{K}$ in Lemma A.2, we have that

(A.5)

\displaystyle\begin{split}\Gamma_{K}&=\tilde{\gamma}_{1}\mathbb{E}\left[\left\|\bm{\theta}_{1}-\bm{\theta}\right\|_{\mathsf{H}_{1}}^{2}\right]+\underbrace{\sum_{k=2}^{K}\left\{\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\tilde{\gamma}_{k-1}\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k-1}}^{2}\right]\right\}}_{\tilde{\Gamma}_{K}}\\ &\quad-\tilde{\gamma}_{K}\mathbb{E}\left[\left\|\bm{\theta}_{K+1}-\bm{\theta}\right\|_{\mathsf{H}_{K}}^{2}\right].\end{split}

Since $\overline{\mathsf{H}}_{k}\in\mathbb{S}_{++}^{d}$ exists such that $\mathsf{H}_{k}=\overline{\mathsf{H}}_{k}^{2}$ , we have $\|\bm{x}\|_{\mathsf{H}_{k}}^{2}=\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2}$ for all $\bm{x}\in\mathbb{R}^{d}$ . Accordingly, we have

\displaystyle\tilde{\Gamma}_{K}=\mathbb{E}\left[\sum_{k=2}^{K}\left\{\tilde{\gamma}_{k}\left\|\overline{\mathsf{H}}_{k}(\bm{\theta}_{k}-\bm{\theta})\right\|^{2}-\tilde{\gamma}_{k-1}\left\|\overline{\mathsf{H}}_{k-1}(\bm{\theta}_{k}-\bm{\theta})\right\|^{2}\right\}\right].

From $\overline{\mathsf{H}}_{k}=\mathsf{diag}(\sqrt{h_{k,i}})$ , we have that, for all $\bm{x}=(x_{i})_{i=1}^{d}\in\mathbb{R}^{d}$ , $\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2}=\sum_{i=1}^{d}h_{k,i}x_{i}^{2}$ . Hence, for all $K\geq 2$ ,

(A.6)

\displaystyle\tilde{\Gamma}_{K}=\mathbb{E}\left[\sum_{k=2}^{K}\sum_{i=1}^{d}\left(\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\right)(\theta_{k,i}-\theta_{i})^{2}\right].

From (A1), we have that, for all $k\geq 1$ and all $i\in[d]$ ,

\displaystyle\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\geq 0.

Moreover, from (L2), $\max_{i\in[d]}\sup_{k\in\mathbb{N}}(\theta_{k,i}-\theta_{i})^{2}\leq\mathrm{Dist}(\bm{\theta})$ . Accordingly, for all $K\geq 2$ ,

\displaystyle\tilde{\Gamma}_{K}\leq\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{k=2}^{K}\sum_{i=1}^{d}\left(\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\right)\right]=\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}\left(\tilde{\gamma}_{K}h_{K,i}-\tilde{\gamma}_{1}h_{1,i}\right)\right].

Therefore, (A.5), $\tilde{\gamma}_{1}\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}\|_{\mathsf{H}_{1}}^{2}]\leq\mathrm{Dist}(\bm{\theta})\tilde{\gamma}_{1}\mathbb{E}[\sum_{i=1}^{d}h_{1,i}]$ , and (A2) imply, for all $K\geq 1$ ,

	$\displaystyle\Gamma_{K}$	$\displaystyle\leq\tilde{\gamma}_{1}\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}h_{1,i}\right]+\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}\left(\tilde{\gamma}_{K}h_{K,i}-\tilde{\gamma}_{1}h_{1,i}\right)\right]$
		$\displaystyle=\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}h_{K,i}\right]$
		$\displaystyle\leq\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})\sum_{i=1}^{d}H_{i}$
		$\displaystyle\leq\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})dH,$

where $H=\max_{i\in[d]}H_{i}$ . From $\tilde{\gamma}_{K}=1-{\gamma}^{K+1}\leq 1$ , we have that

(A.7)

\displaystyle\frac{1}{2\alpha\tilde{\beta}}\Gamma_{K}\leq\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}}.

Lemma A.3 implies that, for all $K\geq 1$ ,

\displaystyle A_{K}:=\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]\leq\sum_{k=1}^{K}\frac{\tilde{\gamma}_{k}}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

which, together with $\tilde{\gamma}_{k}\leq 1$ , implies that

(A.8)

\displaystyle\frac{\alpha}{2\tilde{\beta}}A_{K}\leq\frac{\alpha K}{2\tilde{\beta}(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right).

Lemma A.3 and Jensen’s inequality ensure that, for all $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|\right]\leq\sqrt{\frac{\sigma^{2}}{b}+P^{2}}.

The Cauchy-Schwarz inequality and (L2) guarantee that, for all $K\geq 1$ ,

(A.9)

\displaystyle\begin{split}\frac{\beta}{\tilde{\beta}}B_{K}&:=\frac{\beta}{\tilde{\beta}}\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]\leq\frac{\beta}{\tilde{\beta}}\sqrt{d\mathrm{Dist}(\bm{\theta})}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\bm{m}_{k-1}\right\|\right]\\ &\leq\frac{\beta\sqrt{d\mathrm{Dist}(\bm{\theta})}}{\tilde{\beta}}K\sqrt{\frac{\sigma^{2}}{b}+P^{2}}.\end{split}

Therefore, (A.2), (A.7), (A.8), and (A.9) lead to the assertion in Theorem A.1. This completes the proof. ∎

Theorem A.2.

Suppose that (S1)–(S3), (A1), (L1), and (U1)–(U3) hold and consider Algorithm 1. Then, for all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\frac{X(\bm{\theta}^{*})}{2\alpha\tilde{\beta}K}+\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}b}-\frac{c_{2}\beta P^{2}}{\tilde{\beta}},

where $X(\bm{\theta}^{*})$ , $c_{1}$ , and $c_{2}$ are defined by (U1)–(U3), and $\sigma^{2}$ is defined by (S2)(2.5).

Proof.

Condition (A1), together with (A.5) and (A.6), implies that $\tilde{\Gamma}_{K}\geq 0$ for all $K\geq 1$ . Hence, from (A.5) and $1\geq\tilde{\gamma}_{k}\geq\tilde{\gamma}_{k-1}$ ( $k\geq 1$ ),

	$\displaystyle\Gamma_{K}$	$\displaystyle\geq\tilde{\gamma}_{1}\mathbb{E}\left[\left\\|\bm{\theta}_{1}-\bm{\theta}^{}\right\\|_{\mathsf{H}_{1}}^{2}\right]-\tilde{\gamma}_{K}\mathbb{E}\left[\left\\|\bm{\theta}_{K+1}-\bm{\theta}^{}\right\\|_{\mathsf{H}_{K}}^{2}\right]$
		$\displaystyle\geq(1-\gamma)\mathbb{E}\left[\left\\|\bm{\theta}_{1}-\bm{\theta}^{}\right\\|_{\mathsf{H}_{1}}^{2}\right]-\mathbb{E}\left[\left\\|\bm{\theta}_{K+1}-\bm{\theta}^{}\right\\|_{\mathsf{H}_{K}}^{2}\right],$

which, together with (U3), implies that, for all $K\geq 1$ ,

\displaystyle\Gamma_{K}

\displaystyle\geq X(\bm{\theta}^{*}).

Moreover, (U1) and (U2) imply that, for all $K\geq 1$ ,

	$\displaystyle A_{K}=\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\\|\bm{\mathsf{d}}_{k}\\|_{\mathsf{H}_{k}}^{2}\right]\geq\frac{c_{1}\tilde{\gamma}\sigma^{2}}{b}K\text{ and}$
	$\displaystyle B_{K}=\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]\geq-c_{2}\left(\frac{\sigma^{2}}{b}+P^{2}\right)K.$

Accordingly, Lemma A.2 leads to the assertion in Theorem A.2. This completes the proof. ∎

A.2. Proof of Theorem 3.1

Proof.

Theorem A.1 guarantees that, for all $\bm{\theta}\in\mathcal{S}$ and all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta})\leq\underbrace{\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}}}_{A(\alpha,\beta,\bm{\theta},d,H)}\frac{1}{K}+\underbrace{\frac{\sigma^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}}_{B(\alpha,\beta,\gamma,\sigma^{2},h_{0}^{*})}\frac{1}{b}+\underbrace{\frac{P^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}}_{C(\alpha,\beta,\gamma,P^{2},h_{0}^{*})}+\underbrace{\frac{\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}\beta}{\tilde{\beta}}}_{D(\beta,\bm{\theta},d,\sigma^{2},P^{2})}.

If the right-hand side of the above inequality is less than or equal to $\epsilon^{2}$ , then we have that

\displaystyle\frac{A}{K}+\frac{B}{b}+C+D\leq\epsilon^{2},

which implies that

\displaystyle\underline{K}(b):=\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B}\leq K,

where $b>B/\{(\epsilon^{2}-(C+D)\}\geq 0$ . We have that, for $b>B/\{(\epsilon^{2}-(C+D)\}$ ,

	$\displaystyle\underline{K}^{\star}:=\frac{A}{\epsilon^{2}-(C+D)}<\underline{K}(b),$
	$\displaystyle\frac{\mathrm{d}\underline{K}(b)}{\mathrm{d}b}=\frac{-AB}{[\{\epsilon^{2}-(C+D)\}b-B]^{2}}\leq 0,$
	$\displaystyle\frac{\mathrm{d}^{2}\underline{K}(b)}{\mathrm{d}b^{2}}=\frac{2AB\{\epsilon^{2}-(C+D)\}}{[\{\epsilon^{2}-(C+D)\}b-B]^{3}}\geq 0.$

Hence, $\underline{K}$ is monotone decreasing and convex for $b>B/\{(\epsilon^{2}-(C+D)\}$ . This completes the proof. ∎

A.3. Proof of Theorem 3.2

Proof.

We have that, for $b>B/\{(\epsilon^{2}-(C+D)\}$ ,

\displaystyle\underline{K}(b)b=\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B}.

Hence,

	$\displaystyle\frac{\mathrm{d}\underline{K}(b)b}{\mathrm{d}b}=\frac{Ab[\{(\epsilon^{2}-(C+D)\}b-2B]}{[\{(\epsilon^{2}-(C+D)\}b-B]^{2}},$
	$\displaystyle\frac{\mathrm{d}^{2}\underline{K}(b)b}{\mathrm{d}b^{2}}=\frac{2AB^{2}}{[\{(\epsilon^{2}-(C+D)\}b-B]^{3}}\geq 0,$

which implies that $\underline{K}(b)b$ is convex for $b>B/\{(\epsilon^{2}-(C+D)\}$ and

\displaystyle\frac{\mathrm{d}\underline{K}(b)b}{\mathrm{d}b}\begin{cases}<0&\text{ if }b<b_{*},\\ =0&\text{ if }b=b_{*}=\frac{2B}{\epsilon^{2}-(C+D)},\\ >0&\text{ if }b>b_{*}.\end{cases}

The point $b_{*}$ attains the minimum value $\underline{K}(b_{*})b_{*}$ of $\underline{K}(b)b$ . This completes the proof. ∎

A.4. Proof of Theorem 3.3

Proof.

Theorem A.2 guarantees that, for all $K\geq 1$ ,

\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\underbrace{\frac{X(\bm{\theta}^{*})}{2\alpha\tilde{\beta}}}_{E(\alpha,\beta,\bm{\theta}^{*})}\frac{1}{K}+\underbrace{\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}}}_{F(\alpha,\beta,\gamma,\sigma^{2},c_{1},c_{2})}\frac{1}{b}-\underbrace{\frac{c_{2}\beta P^{2}}{\tilde{\beta}}}_{G(\beta,c_{2},P^{2})}.

If the right-hand side of the above inequality is more than or equal to $\delta\epsilon^{2}$ , then we have that

\displaystyle\frac{E}{K}+\frac{F}{b}-G\geq\delta\epsilon^{2},

which implies that

\displaystyle\overline{K}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F}\geq K,

where $b>F/(\delta\epsilon^{2}+G)\geq 0$ . We have that, for $b>F/(\delta\epsilon^{2}+G)$ ,

	$\displaystyle\overline{K}^{\star}:=\frac{E}{\delta\epsilon^{2}+G}<\overline{K}(b),$
	$\displaystyle\frac{\mathrm{d}\overline{K}(b)}{\mathrm{d}b}=\frac{-EF}{\{(\delta\epsilon^{2}+G)b-F\}^{2}}\leq 0,$
	$\displaystyle\frac{\mathrm{d}^{2}\overline{K}(b)}{\mathrm{d}b^{2}}=\frac{2EF(\delta\epsilon^{2}+G)}{\{(\delta\epsilon^{2}+G)b-F\}^{3}}\geq 0.$

Hence, $\overline{K}$ is monotone decreasing and convex for $b>F/(\delta\epsilon^{2}+G)$ . This completes the proof. ∎

A.5. Proof of Theorem 3.4

Proof.

We have that, for $b>F/(\delta\epsilon^{2}+G)$ ,

\displaystyle\overline{K}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.

Hence,

	$\displaystyle\frac{\mathrm{d}\overline{K}(b)b}{\mathrm{d}b}=\frac{Eb\{(\delta\epsilon^{2}+G)b-2F\}}{\{(\delta\epsilon^{2}+G)b-F\}^{2}},$
	$\displaystyle\frac{\mathrm{d}^{2}\overline{K}(b)b}{\mathrm{d}b^{2}}=\frac{2EF^{2}}{\{(\delta\epsilon^{2}+G)b-F\}^{3}}\geq 0,$

which implies that $\overline{K}(b)b$ is convex for $b>F/(\delta\epsilon^{2}+G)$ and

\displaystyle\frac{\mathrm{d}\overline{K}(b)b}{\mathrm{d}b}\begin{cases}<0&\text{ if }b<b^{*},\\ =0&\text{ if }b=b^{*}=\frac{2F}{\delta\epsilon^{2}+G},\\ >0&\text{ if }b>b^{*}.\end{cases}

The point $b^{*}$ attains the minimum value $\overline{K}(b^{*})b^{*}$ of $\overline{K}(b)b$ . This completes the proof. ∎

	$\displaystyle\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$	$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})+\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]+\mathbb{E}\left[\\|\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle\quad+2\mathbb{E}\left[(\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k}))^{\top}\nabla L(\bm{\theta}_{k})\Big{\|}\bm{\theta}_{k}\right]$
		$\displaystyle=\mathbb{E}\left[\\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\\|^{2}\big{\|}\bm{\theta}_{k}\right]+\\|\nabla L(\bm{\theta}_{k})\\|^{2},$

Minimization of Stochastic First-order Oracle Complexity of Adaptive Methods for Nonconvex Optimization

Abstract.

Key words and phrases:

1. Introduction

1.1. Background

1.2. Motivation

1.3. Goal

2. Nonconvex Optimization and Deep Learning Optimizers

2.1. Nonconvex optimization

2.2. Deep learning optimizers

3. Main Results

3.1. Lower bound of KK

Theorem 3.1.

Theorem 3.2.

3.2. Upper bound of KK

Theorem 3.3.

Theorem 3.4.

3.3. Discussion

4. Numerical Examples

5. Conclusion

References

Appendix A Appendix

A.1. Lemmas and Theorems

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

Theorem A.1.

Proof.

Theorem A.2.

Proof.

A.2. Proof of Theorem 3.1

Proof.

A.3. Proof of Theorem 3.2

Proof.

A.4. Proof of Theorem 3.3

Proof.

A.5. Proof of Theorem 3.4

Proof.

3.1. Lower bound of $K$

3.2. Upper bound of $K$