This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimization of Stochastic First-order Oracle Complexity of Adaptive Methods for Nonconvex Optimization

Hideaki Iiduka Department of Computer Science Meiji University 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571 Japan [email protected]
Abstract.

Numerical evaluations have definitively shown that, for deep learning optimizers such as stochastic gradient descent, momentum, and adaptive methods, the number of steps needed to train a deep neural network halves for each doubling of the batch size and that there is a region of diminishing returns beyond the critical batch size. In this paper, we determine the actual critical batch size by using the global minimizer of the stochastic first-order oracle (SFO) complexity of the optimizer. To prove the existence of the actual critical batch size, we set the lower and upper bounds of the SFO complexity and prove that there exist critical batch sizes in the sense of minimizing the lower and upper bounds. This proof implies that, if the SFO complexity fits the lower and upper bounds, then the existence of these critical batch sizes demonstrates the existence of the actual critical batch size. We also discuss the conditions needed for the SFO complexity to fit the lower and upper bounds and provide numerical results that support our theoretical results.

Key words and phrases:
adaptive method, batch size, nonconvex optimization, stochastic first-order oracle complexity

1. Introduction

1.1. Background

Adaptive methods are powerful deep learning optimizers used to find the model parameters of deep neural networks that minimize the expected risk and empirical risk loss functions [2, Section 4]. Specific adaptive methods are, for example, adaptive gradient (AdaGrad) [5], root mean square propagation (RMSProp) [26], adaptive moment estimation (Adam) [11], and adaptive mean square gradient (AMSGrad) [19] (Table 2 in [22] lists the main deep learning optimizers).

Deep learning optimizers have been widely studied from the viewpoints of both theory and practice. While the convergence and convergence rate of deep learning optimizers have been theoretically studied for convex optimization [31, 11, 19, 13, 15], theoretical investigation of deep learning optimizers for nonconvex optimization is needed so that such optimizers can be put into practical use in deep learning [28, 1, 27].

The convergence of adaptive methods has been studied for nonconvex optimization [7, 4, 30, 10], and the convergence of stochastic gradient descent (SGD) methods has been studied for nonconvex optimization [8, 3, 21, 12]. Chen et al. showed that generalized Adam, which includes the heavy-ball method, AdaGrad, RMSProp, AMSGrad, and AdaGrad with first-order momentum (AdaFom), has an 𝒪(logK/K)\mathcal{O}(\log K/\sqrt{K}) convergence rate when a decaying learning rate (αk=1/k\alpha_{k}=1/\sqrt{k}) is used [4]. AdaBelief (which adapts the step size in accordance with the belief in the observed gradients) using αk=1/k\alpha_{k}=1/\sqrt{k} has an 𝒪(logK/K)\mathcal{O}(\log K/\sqrt{K}) convergence rate [30]. A method that unifies adaptive methods such as AMSGrad and AdaBelief has been shown to have a convergence rate of 𝒪(1/K)\mathcal{O}(1/\sqrt{K}) when αk=1/k\alpha_{k}=1/\sqrt{k} [10], which improves on the results of [4, 30].

[Uncaptioned image]
Figure 1. Relationship between lower bound K¯(b)\underline{K}(b) and upper bound K¯(b)\overline{K}(b) (bb^{\star} denotes critical batch size).
[Uncaptioned image]
Figure 2. Relationship between K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b (bb_{*} and bb^{*} are global minimizers of K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b that always exist).

There have been several practical studies of deep learning optimizers. Shallue et al. [23] studied how increasing the batch size affects the performances of SGD, SGD with momentum [18, 20], and Nesterov momentum [17, 25]. Zhang et al. [29] studied the relationship between batch size and performance for Adam and K-FAC (Kronecker-factored approximate curvature [14]). In both studies, it was numerically shown that increasing the batch size tends to decrease the number of steps needed for training deep neural networks, but with diminishing returns. Moreover, it was shown that SGD with momentum and Nesterov momentum can exploit larger batches than SGD [23] and that K-FAC and Adam can exploit larger batches than SGD with momentum [29]. Thus, it has been shown that momentum and adaptive methods can greatly reduce the number of steps needed for training deep neural networks [23, Figure 4], [29, Figure 5]. Smith et al. [24] numerically showed that using enormous batches leads to reductions in the number of parameter updates and model training time.

1.2. Motivation

As described in Section 1.1, there have been theoretical studies of the relationship between the learning rate and convergence of deep learning optimizers. Furthermore, as also described in Section 1.1, the practical performance of a deep learning optimizer strongly depends on the batch size. Our goal in this paper is to clarify theoretically the relationship between batch size and the performance of deep learning optimizers. We focus on the relationship between the batch size and the number of steps needed for nonconvex optimization of several adaptive methods.

To explicitly show our contribution, we characterize the diminishing returns reported in [23, 29] by using the stochastic first-order oracle (SFO) complexities of deep learning optimizers. We define the SFO complexity of a deep learning optimizer as KbKb on the basis of the number of steps KK needed for training the deep neural network and on the batch size bb used in the optimizer. Let bb^{\star} be a critical batch size such that there are diminishing returns for all batch sizes beyond bb^{\star}, as asserted in [23, 29]. Thanks to the numerical evaluations in [23, 29], we know that there is a positive number CC such that

(1.1) Kb2C for bb and Kb2C for bb,\displaystyle Kb\approx 2^{C}\text{ for }b\leq b^{\star}\text{ and }Kb\geq 2^{C}\text{ for }b\geq b^{\star},

where KK and bb are defined for i,ji,j\in\mathbb{N} by K=2iK=2^{i} and b=2jb=2^{j}. For example, Figure 5(a) in [23] shows C16C\approx 16 and b28b^{\star}\approx 2^{8} for the momentum method used to train a convolutional neural network on the MNIST dataset. This implies that, while SFO complexity KbKb initially is not almost changed (i.e., KK halves for each doubling of bb), KbKb is minimized at critical batch size bb^{\star}, and there are diminishing returns once the batch size exceeds bb^{\star}. Therefore, we can see that there are diminishing returns beyond the critical batch size bb^{\star} that minimizes SFO complexity.

1.3. Goal

The goal of this paper is to develop a theory demonstrating the existence of critical batch sizes, which was shown numerically by Shallue et al. and Zhang et al. [23, 29].

To reach this goal, we first examine the relationship between the number of steps KK needed for nonconvex optimization (Section 2) and the batch size bb used for each adaptive method. We show that, under certain assumptions, the lower bound K¯\underline{K} and the upper bound K¯\overline{K} of KK are rational functions of batch size bb; i.e., K¯\underline{K} and K¯\overline{K} have the following forms (see Section 3 for details):

(1.2) Ab{ϵ2(C+D)}bB\displaystyle\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B} =:K¯(b)K\displaystyle=:\underline{K}(b)\leq K
K¯(b):=Eb(δϵ2+G)bF,\displaystyle\leq\overline{K}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F},

where AA, BB, CC, DD, EE, FF, and GG are positive constants depending on the parameters (e.g., constant learning rate α\alpha and momentum coefficient β\beta) used in each adaptive method, and ϵ,δ>0\epsilon,\delta>0 are the precisions for nonconvex optimization. K¯\underline{K} and K¯\overline{K} defined by (1.2) are monotone decreasing and convex for batch size bb. Accordingly, K¯\underline{K} and K¯\overline{K} decrease with an increase in batch size bb. For minimizing the empirical average loss function for the given number of samples nn (see (2.2)), K¯\underline{K} and K¯\overline{K} are minimized at b=nb=n. We can see that there are asymptotic values K¯=A/{ϵ2(C+D)}\underline{K}^{\star}=A/\{\epsilon^{2}-(C+D)\} and K¯=E/(δϵ2+G)\overline{K}^{\star}=E/(\delta\epsilon^{2}+G) of K¯\underline{K} and K¯\overline{K}.

Figure 2 visualizes the relationship between K¯(b)\underline{K}(b) and K¯(b)\overline{K}(b). The number of steps KK needed for nonconvex optimization is inversely proportional to a batch size smaller than the critical batch size bb^{\star}, and there are diminishing returns beyond the critical batch bb^{\star}, as shown by the numerical results of Shallue et al. and Zhang et al. [23, 29]. Moreover, the critical batch size bb^{\star} minimizes SFO complexity KbKb, as seen in (1.1). We thus define the critical batch size by using the global minimizer of SFO complexity K(b)bK(b)b.

From the forms of K¯(b)\underline{K}(b) and K¯(b)\overline{K}(b) in (1.2), we have the forms of K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b:

(1.3) Ab2{ϵ2(C+D)}bB\displaystyle\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B} =K¯(b)bKb\displaystyle=\underline{K}(b)b\leq Kb
K¯(b)b=Eb2(δϵ2+G)bF.\displaystyle\leq\overline{K}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.

We can show that K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b are convex; i.e., there are global minimizers, denoted by bb_{*} and bb^{*}, of K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b, as seen in Figure 2 (see Section 3 for details). Therefore, we can regard batch sizes bb_{*} and bb^{*} as the critical batch sizes in the sense of minimizing K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b defined by (1.3). Suppose that K¯(b)b\underline{K}(b)b and K¯(b)b\overline{K}(b)b defined by (1.3) approximate the actual SFO complexity KbKb; i.e.,

2Bϵ2(C+D)=bb=2Fδϵ2+G and\displaystyle\frac{2B}{\epsilon^{2}-(C+D)}=b_{*}\approx b^{*}=\frac{2F}{\delta\epsilon^{2}+G}\text{ and}
4AB{ϵ2(C+D)}2=K¯(b)bK¯(b)b=4EF(δϵ2+G)2,\displaystyle\frac{4AB}{\{\epsilon^{2}-(C+D)\}^{2}}=\underline{K}(b_{*})b_{*}\approx\overline{K}(b^{*})b^{*}=\frac{4EF}{(\delta\epsilon^{2}+G)^{2}},

which implies that

(1.4) (C+D)F(M1α+M2)(M3αM4)+BGM5α(FδB)ϵ2 and(C+D)M1α+M2EF+GAB(EFδAB)ϵ2,\displaystyle\begin{split}&\underbrace{(C+D)F}_{\approx(M_{1}\alpha+M_{2})(M_{3}\alpha-M_{4})}+\underbrace{BG}_{\approx M_{5}\alpha}\approx(F-\delta B)\epsilon^{2}\text{ and}\\ &\underbrace{(C+D)}_{\approx M_{1}\alpha+M_{2}}\sqrt{EF}+G\sqrt{AB}\approx(\sqrt{EF}-\delta\sqrt{AB})\epsilon^{2},\end{split}

where MiM_{i} (i=1,2,3,4,5i=1,2,3,4,5) are positive constants that do not depend on a constant learning rate α\alpha (see Section 3 for details). Hence, if precisions ϵ\epsilon and δ\delta are small, using a sufficiently small learning rate α\alpha could be used to approximate K(b)bK(b)b (Indeed, small learning rates were practically used by Shallue et al. [23]). Then, the demonstration of the existence of critical batch sizes bb_{*} and bb^{*} leads to the existence of the actual critical batch bbbb^{\star}\approx b_{*}\approx b^{*}.

2. Nonconvex Optimization and Deep Learning Optimizers

2.1. Nonconvex optimization

Let d\mathbb{R}^{d} be a dd-dimensional Euclidean space with inner product 𝒙,𝒚:=𝒙𝒚\langle\bm{x},\bm{y}\rangle:=\bm{x}^{\top}\bm{y} inducing norm 𝒙\|\bm{x}\|, and let \mathbb{N} be the set of nonnegative integers. Let 𝕊++d\mathbb{S}_{++}^{d} be the set of d×dd\times d symmetric positive-definite matrices, and let 𝔻d\mathbb{D}^{d} be the set of d×dd\times d diagonal matrices: 𝔻d={Md×d:M=𝖽𝗂𝖺𝗀(xi), xi (i[d]:={1,2,,d})}\mathbb{D}^{d}=\{M\in\mathbb{R}^{d\times d}\colon M=\mathsf{diag}(x_{i}),\text{ }x_{i}\in\mathbb{R}\text{ }(i\in[d]:=\{1,2,\ldots,d\})\}.

The mathematical model used here is based on that of Shallue et al. [23]. Given a parameter 𝜽\bm{\theta} in a set 𝒮d\mathcal{S}\subset\mathbb{R}^{d} and given a data point zz in a data domain ZZ, a machine learning model provides a prediction for which the quality is estimated using a differentiable nonconvex loss function (𝜽;z)\ell(\bm{\theta};z). We minimize the expected loss defined for all 𝜽𝒮\bm{\theta}\in\mathcal{S} by using

(2.1) L(𝜽)=𝔼z𝒟[(𝜽;z)]=𝔼[ξ(𝜽)],\displaystyle L(\bm{\theta})=\mathbb{E}_{z\sim\mathcal{D}}[\ell(\bm{\theta};z)]=\mathbb{E}[\ell_{\xi}(\bm{\theta})],

where 𝒟\mathcal{D} is the probability distribution over ZZ, ξ\xi denotes a random variable with distribution function P\mathrm{P}, and 𝔼[]\mathbb{E}[\cdot] denotes the expectation taken with respect to ξ\xi. A particularly interesting example of (2.1) is the empirical average loss defined for all 𝜽𝒮\bm{\theta}\in\mathcal{S}:

(2.2) L(𝜽;S)=1ni[n](𝜽;zi)=1ni[n]i(𝜽),\displaystyle L(\bm{\theta};S)=\frac{1}{n}\sum_{i\in[n]}\ell(\bm{\theta};z_{i})=\frac{1}{n}\sum_{i\in[n]}\ell_{i}(\bm{\theta}),

where S=(z1,z2,,zn)S=(z_{1},z_{2},\ldots,z_{n}) denotes the training set, i():=(;zi)\ell_{i}(\cdot):=\ell(\cdot;z_{i}) denotes the loss function corresponding to the ii-th training data instance ziz_{i}, and [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\}. Our main objective is to find a local minimizer of LL over 𝒮\mathcal{S}, i.e., a point 𝜽𝒮\bm{\theta}^{\star}\in\mathcal{S} that belongs to the set

(2.3) VI(𝒮,L):={𝜽𝒮:(𝜽𝜽)L(𝜽)0 (𝜽𝒮)},\displaystyle\begin{split}&\mathrm{VI}\left(\mathcal{S},\nabla L\right)\\ &\quad:=\left\{\bm{\theta}^{\star}\in\mathcal{S}\colon(\bm{\theta}^{\star}-\bm{\theta})^{\top}\nabla L(\bm{\theta}^{\star})\leq 0\text{ }\left(\bm{\theta}\in\mathcal{S}\right)\right\},\end{split}

where L:dd\nabla L\colon\mathbb{R}^{d}\to\mathbb{R}^{d} denotes the gradient of LL. The inequality (𝜽𝜽)L(𝜽)0(\bm{\theta}^{\star}-\bm{\theta})^{\top}\nabla L(\bm{\theta}^{\star})\leq 0 is called the variational inequality of L\nabla L over 𝒮\mathcal{S} [6].

2.2. Deep learning optimizers

We assume that there exists SFO such that, for a given 𝜽𝒮\bm{\theta}\in\mathcal{S}, it returns a stochastic gradient 𝖦ξ(𝜽)\mathsf{G}_{\xi}(\bm{\theta}) of the function LL defined by (2.1), where a random variable ξ\xi is supported on Ξ\Xi and does not depend on 𝜽\bm{\theta}. Throughout this paper, we assume three standard conditions:

  1. (S1)

    L:dL\colon\mathbb{R}^{d}\to\mathbb{R} defined by (2.1) is continuously differentiable.

  2. (S2)

    Let (𝜽k)k𝒮(\bm{\theta}_{k})_{k\in\mathbb{N}}\subset\mathcal{S} be the sequence generated by a deep learning optimizer. For each iteration kk,

    (2.4) 𝔼ξk[𝖦ξk(𝜽k)]=L(𝜽k),\displaystyle\mathbb{E}_{\xi_{k}}[\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})]=\nabla L(\bm{\theta}_{k}),

    where ξ0,ξ1,\xi_{0},\xi_{1},\ldots are independent samples, and the random variable ξk\xi_{k} is independent of (𝜽l)l=0k(\bm{\theta}_{l})_{l=0}^{k}. There exists a nonnegative constant σ2\sigma^{2} such that

    (2.5) 𝔼ξk[𝖦ξk(𝜽k)L(𝜽k)2]σ2.\displaystyle\mathbb{E}_{\xi_{k}}\left[\left\|\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\right\|^{2}\right]\leq\sigma^{2}.
  3. (S3)

    For each iteration kk, the deep learning optimizer samples a batch BkB_{k} of size bb and estimates the full gradient L\nabla L as

    LBk(𝜽k):=1bi[b]𝖦ξk,i(𝜽k),\displaystyle\nabla L_{B_{k}}(\bm{\theta}_{k}):=\frac{1}{b}\sum_{i\in[b]}\mathsf{G}_{\xi_{k,i}}(\bm{\theta}_{k}),

    where ξk,i\xi_{k,i} is the random variable generated by the ii-th sampling in the kk-th iteration.

In the case where LL is defined by (2.2), we have that, for each kk, Bk[n]B_{k}\subset[n] and

LBk(𝜽k)=1biBki(𝜽k).\displaystyle\nabla L_{B_{k}}(\bm{\theta}_{k})=\frac{1}{b}\sum_{i\in B_{k}}\nabla\ell_{i}(\bm{\theta}_{k}).

We consider the following algorithm (Algorithm 1), which is a unified algorithm for most deep learning optimizers,111We define 𝒙𝒙\bm{x}\odot\bm{x} for 𝒙:=(xi)i=1dd\bm{x}:=(x_{i})_{i=1}^{d}\in\mathbb{R}^{d} by 𝒙𝒙:=(xi2)i=1dd\bm{x}\odot\bm{x}:=(x_{i}^{2})_{i=1}^{d}\in\mathbb{R}^{d}. C(,l,u):\mathrm{C}(\cdot,l,u)\colon\mathbb{R}\to\mathbb{R} in AMSBound (l,ul,u\in\mathbb{R} with lul\leq u are given) is defined for all xx\in\mathbb{R} by C(x,l,u):={l if x<l,x if lxu,u if x>u.\displaystyle\mathrm{C}(x,l,u):=\begin{cases}l&\text{ if }x<l,\\ x&\text{ if }l\leq x\leq u,\\ u&\text{ if }x>u.\end{cases} including Momentum [18], AMSGrad [19], AMSBound [13], Adam [11], and AdaBelief [30], which are listed in Table 1.

Algorithm 1 Deep learning optimizer
0:  α(0,+)\alpha\in(0,+\infty), β[0,1)\beta\in[0,1), γ[0,1)\gamma\in[0,1)
1:  k0k\leftarrow 0, 𝜽0𝒮\bm{\theta}_{0}\in\mathcal{S}, 𝒎1:=𝟎\bm{m}_{-1}:=\bm{0}, 𝖧0𝕊++d𝔻d\mathsf{H}_{0}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d}
2:  loop
3:     𝒎k:=β𝒎k1+(1β)LBk(𝜽k)\bm{m}_{k}:=\beta\bm{m}_{k-1}+(1-\beta)\nabla L_{B_{k}}(\bm{\theta}_{k})
4:     𝒎^k:=(1γk+1)1𝒎k\displaystyle{\hat{\bm{m}}_{k}:=(1-{\gamma}^{k+1})^{-1}\bm{m}_{k}}
5:     𝖧k𝕊++d𝔻d\mathsf{H}_{k}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d} (see Table 1 for examples of 𝖧k\mathsf{H}_{k})
6:     Find 𝗱kd\bm{\mathsf{d}}_{k}\in\mathbb{R}^{d} that solves 𝖧k𝗱=𝒎^k\mathsf{H}_{k}\bm{\mathsf{d}}=-\hat{\bm{m}}_{k}
7:     𝜽k+1:=𝜽k+α𝗱k\bm{\theta}_{k+1}:=\bm{\theta}_{k}+\alpha\bm{\mathsf{d}}_{k}
8:     kk+1k\leftarrow k+1
9:  end loop

Let ξk=(ξk,1,ξk,2,,ξk,b)\xi_{k}=(\xi_{k,1},\xi_{k,2},\ldots,\xi_{k,b}) be the random samplings in the kk-th iteration, and let ξ[k]=(ξ0,ξ1,,ξk)\xi_{[k]}=(\xi_{0},\xi_{1},\ldots,\xi_{k}) be the history of process ξ0,ξ1,\xi_{0},\xi_{1},\ldots to time step kk. The adaptive methods listed in Table 1 all satisfy the following conditions:

  1. (A1)

    𝖧k:=𝖽𝗂𝖺𝗀(hk,i)\mathsf{H}_{k}:=\mathsf{diag}(h_{k,i}) depends on ξ[k]\xi_{[k]}, and hk+1,ihk,ih_{k+1,i}\geq h_{k,i} for all kk\in\mathbb{N} and all i[d]i\in[d];

  2. (A2)

    For all i[d]i\in[d], a positive number HiH_{i} exists such that supk𝔼[hk,i]Hi\sup_{k\in\mathbb{N}}\mathbb{E}[h_{k,i}]\leq H_{i}.

We define H:=maxi[d]HiH:=\max_{i\in[d]}H_{i}. The optimizers in Table 1 obviously satisfy (A1). Previously reported results [4, p.29], [30, p.18], and [10] show that (𝖧k)k(\mathsf{H}_{k})_{k\in\mathbb{N}} in Table 1 satisfies (A2). For example, AMSGrad (resp. Adam) satisfies (A2) with H=MH=\sqrt{M} (resp. H=M/(1ζ)H=\sqrt{M/(1-\zeta)}), where M:=supkLBk(𝜽k)LBk(𝜽k)<+M:=\sup_{k\in\mathbb{N}}\|\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})\|<+\infty.

Table 1. Examples of 𝖧k𝕊++d𝔻d\mathsf{H}_{k}\in\mathbb{S}_{++}^{d}\cap\mathbb{D}^{d} (step 5) in Algorithm 1 (η,ζ[0,1)\eta,\zeta\in[0,1))
𝖧k\mathsf{H}_{k}
SGD 𝖧k\mathsf{H}_{k} is the identity matrix.
(β=γ=0\beta=\gamma=0)
Momentum 𝖧k\mathsf{H}_{k} is the identity matrix.
[18]
(γ=0\gamma=0)
AMSGrad 𝒑k=LBk(𝜽k)LBk(𝜽k)\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})
[19] 𝒗k=η𝒗k1+(1η)𝒑k\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}
(γ=0\gamma=0) 𝒗^k=(max{v^k1,i,vk,i})i=1d\hat{\bm{v}}_{k}=(\max\{\hat{v}_{k-1,i},v_{k,i}\})_{i=1}^{d}
𝖧k=𝖽𝗂𝖺𝗀(v^k,i)\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\hat{v}_{k,i}})
AMSBound 𝒑k=LBk(𝜽k)LBk(𝜽k)\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})
[13] 𝒗k=η𝒗k1+(1η)𝒑k\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}
(γ=0\gamma=0) 𝒗^k=(max{v^k1,i,vk,i})i=1d\hat{\bm{v}}_{k}=(\max\{\hat{v}_{k-1,i},v_{k,i}\})_{i=1}^{d}
v~k,i=C(1v^k,i,lk,uk)1\tilde{{v}}_{k,i}=\mathrm{C}\left(\frac{1}{\sqrt{\hat{v}_{k,i}}},l_{k},u_{k}\right)^{-1}
𝖧k=𝖽𝗂𝖺𝗀(v~k,i)\mathsf{H}_{k}=\mathsf{diag}(\tilde{v}_{k,i})
Adam 𝒑k=LBk(𝜽k)LBk(𝜽k)\bm{p}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})\odot\nabla L_{B_{k}}(\bm{\theta}_{k})
[11] 𝒗k=η𝒗k1+(1η)𝒑k\bm{v}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\bm{p}_{k}
(vk,ivk+1,iv_{k,i}\leq v_{k+1,i}) 𝒗¯k=𝒗k1ζk\bar{\bm{v}}_{k}=\frac{\bm{v}_{k}}{1-\zeta^{k}}
𝖧k=𝖽𝗂𝖺𝗀(v¯k,i)\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\bar{v}_{k,i}})
AdaBelief 𝒑~k=LBk(𝜽k)𝒎k\tilde{\bm{p}}_{k}=\nabla L_{B_{k}}(\bm{\theta}_{k})-\bm{m}_{k}
[30] 𝒔~k=𝒑~k𝒑~k\tilde{\bm{s}}_{k}=\tilde{\bm{p}}_{k}\odot\tilde{\bm{p}}_{k}
(sk,isk+1,is_{k,i}\leq s_{k+1,i}) 𝒔k=η𝒗k1+(1η)𝒔~k\bm{s}_{k}=\eta\bm{v}_{k-1}+(1-\eta)\tilde{\bm{s}}_{k}
𝒔^k=𝒔k1ζk\hat{\bm{s}}_{k}=\frac{\bm{s}_{k}}{1-\zeta^{k}}
𝖧k=𝖽𝗂𝖺𝗀(s^k,i)\mathsf{H}_{k}=\mathsf{diag}(\sqrt{\hat{s}_{k,i}})

The theoretical performance metric used to approximate a local minimizer in VI(𝒮,L)\mathrm{VI}(\mathcal{S},\nabla L) (see (2.3)) is an ϵ\epsilon-approximation of the sequence (𝜽k)k(\bm{\theta}_{k})_{k\in\mathbb{N}} generated by Algorithm 1 in the sense of the mean value of (𝜽k𝜽)L(𝜽k)(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k}) (k[K]k\in[K]); that is,

(2.6) δϵ2V(K,𝜽):=1Kk[K]𝔼[(𝜽k𝜽)L(𝜽k)]ϵ2,\displaystyle\begin{split}\delta\epsilon^{2}&\leq\mathrm{V}(K,\bm{\theta}):=\frac{1}{K}\sum_{k\in[K]}\mathbb{E}\left[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})\right]\\ &\leq\epsilon^{2},\end{split}

where ϵ>0\epsilon>0 and δ(0,1]\delta\in(0,1] are precisions and 𝜽𝒮\bm{\theta}\in\mathcal{S}. It would be sufficient to consider that the right-hand side of (2.6), V(K,𝜽)ϵ2\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}, approximates a local minimizer of LL. Since consideration of V(K,𝜽)ϵ2\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2} leads to the lower bound of KK satisfying V(K,𝜽)ϵ2\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2}, we also consider δϵ2V(K,𝜽)\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta}), which leads to the upper bound of KK.

3. Main Results

Let us suppose that (S1), (S2)(2.4), and (S3) hold. Then, Algorithm 1 satisfies the following equation (see Lemma A.2 for details). For all 𝜽𝒮\bm{\theta}\in\mathcal{S} and all K1K\geq 1,

(3.1) V(K,𝜽)=12αβ~Kk=1Kγ~k{𝔼[𝜽k𝜽𝖧k2]𝔼[𝜽k+1𝜽𝖧k2]}ΓK+α2β~Kk=1Kγ~k𝔼[𝗱k𝖧k2]AK+ββ~Kk=1K𝔼[(𝜽𝜽k)𝒎k1]BK,\displaystyle\begin{split}&\mathrm{V}(K,\bm{\theta})\\ &=\frac{1}{2\alpha\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\left\{\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]\right\}}_{\Gamma_{K}}\\ &+\frac{\alpha}{2\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]}_{A_{K}}+\frac{\beta}{\tilde{\beta}K}\underbrace{\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]}_{B_{K}},\end{split}

where β~:=1β\tilde{\beta}:=1-\beta, γ~k:=1γk+1\tilde{\gamma}_{k}:=1-\gamma^{k+1}, and 𝒙S2:=𝒙S𝒙\|\bm{x}\|_{S}^{2}:=\bm{x}^{\top}S\bm{x} (S𝕊++dS\in\mathbb{S}_{++}^{d}, 𝒙d\bm{x}\in\mathbb{R}^{d}).

3.1. Lower bound of KK

We provide the lower bound of KK satisfying V(K,𝜽)ϵ2\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2} under the following conditions:

  1. (L1)

    There exists a positive number PP such that 𝔼[L(𝜽k)2]P2\mathbb{E}[\|\nabla L(\bm{\theta}_{k})\|^{2}]\leq P^{2}, where 𝜽k=(θk,i)\bm{\theta}_{k}=(\theta_{k,i}) is the sequence generated by Algorithm 1.

  2. (L2)

    For all 𝜽=(θi)𝒮\bm{\theta}=(\theta_{i})\in\mathcal{S}, there exists a positive number Dist(𝜽)\mathrm{Dist}(\bm{\theta}) such that maxi[d]sup{(θk,iθi)2:k}Dist(𝜽)\max_{i\in[d]}\sup\{(\theta_{k,i}-\theta_{i})^{2}\colon k\in\mathbb{N}\}\leq\mathrm{Dist}(\bm{\theta}).

Condition (L1) was used in previous studies [4, A2] and [30, Theorem 2.2] to analyze deep learning optimizers for nonconvex optimization. Here, we use (L1) to provide the upper bounds of 𝔼[𝗱k𝖧k2]\mathbb{E}[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}] and 𝔼[𝒎k2]\mathbb{E}[\|\bm{m}_{k}\|^{2}] (see Lemma A.3 for details); i.e.,

(3.2) 𝔼[𝒎k2]σ2b+P2 and 𝔼[𝗱k𝖧k2]σ2b1+P2γ~h0,\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}\text{ and }\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{\sigma^{2}b^{-1}+P^{2}}{\tilde{\gamma}h_{0}^{*}},

where (S2)(2.5) and (A1) are assumed, γ~:=1γ\tilde{\gamma}:=1-\gamma, and h0:=mini[d]h0,ih_{0}^{*}:=\min_{i\in[d]}h_{0,i}. This formulation implies that AKA_{K} in (3.1) is bounded above (see Theorems A.1 and 3.1 for details).

Condition (L2) is assumed for both convex and nonconvex optimization, e.g., [16, p.1574], [11, Theorem 4.1], [19, p.2], [13, Theorem 4], and [30, Theorem 2.1]. Here, we use (L2) and (A2) to provide the upper bounds of ΓK\Gamma_{K} and BKB_{K} in (3.1) (see Theorem A.1 for details); i.e.,

ΓKdDist(𝜽)H and BKdDist(𝜽)(σ2+P2)K,\displaystyle\Gamma_{K}\leq d\mathrm{Dist}(\bm{\theta})H\text{ and }B_{K}\leq\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}K,

where (S2)(2.5) and (L1) are assumed and H:=maxi[d]HiH:=\max_{i\in[d]}H_{i}. Therefore, we have the following theorem (see the Appendix for detailed proof):

Theorem 3.1.

Suppose that (S1)–(S3), (A1)–(A2), and (L1)–(L2) hold. Then, Algorithm 1 satisfies that, for all 𝛉𝒮\bm{\theta}\in\mathcal{S} and all K1K\geq 1,

V(K,𝜽)AK+Bb+C+D,\displaystyle\mathrm{V}(K,\bm{\theta})\leq\frac{A}{K}+\frac{B}{b}+C+D,

where β~:=1β\tilde{\beta}:=1-\beta, γ~:=1γ\tilde{\gamma}:=1-\gamma, h0:=mini[d]h0,ih_{0}^{*}:=\min_{i\in[d]}h_{0,i},

A=A(α,β,𝜽,d,H)=dDist(𝜽)H2αβ~,\displaystyle A=A(\alpha,\beta,\bm{\theta},d,H)=\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}},
B=B(α,β,γ,σ2,h0)=σ2α2β~γ~2h0,\displaystyle B=B(\alpha,\beta,\gamma,\sigma^{2},h_{0}^{*})=\frac{\sigma^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}},
C=C(α,β,γ,P2,h0)=P2α2β~γ~2h0,\displaystyle C=C(\alpha,\beta,\gamma,P^{2},h_{0}^{*})=\frac{P^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}},
D=D(β,𝜽,d,σ2,P2)=dDist(𝜽)(σ2+P2)ββ~.\displaystyle D=D(\beta,\bm{\theta},d,\sigma^{2},P^{2})=\frac{\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}\beta}{\tilde{\beta}}.

Moreover, the lower bound K¯ϵ\underline{K}_{\epsilon} of the number of steps KϵK_{\epsilon} satisfying V(K,𝛉)ϵ2\mathrm{V}(K,\bm{\theta})\leq\epsilon^{2} for Algorithm 1 is such that, for all b>B/{ϵ2(C+D)}0b>B/\{\epsilon^{2}-(C+D)\}\geq 0,

K¯ϵ(b):=Ab{ϵ2(C+D)}bBKϵ.\displaystyle\underline{K}_{\epsilon}(b):=\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B}\leq K_{\epsilon}.

The lower bound K¯ϵ\underline{K}_{\epsilon} is monotone decreasing and convex:

K¯ϵ:=Aϵ2(C+D)<K¯ϵ(b)\displaystyle\underline{K}_{\epsilon}^{\star}:=\frac{A}{\epsilon^{2}-(C+D)}<\underline{K}_{\epsilon}(b)

for all b>B/{ϵ2(C+D)}b>B/\{\epsilon^{2}-(C+D)\}.

The minimization of K¯ϵ(b)b\underline{K}_{\epsilon}(b)b is as follows (The proof of Theorem 3.2 is given in the Appendix):

Theorem 3.2.

Suppose that the assumptions in Theorem 3.1 hold and consider Algorithm 1. Then, there exists

b:=2Bϵ2(C+D)\displaystyle b_{*}:=\frac{2B}{\epsilon^{2}-(C+D)}

such that bb_{*} minimizes a convex function K¯ϵ(b)b\underline{K}_{\epsilon}(b)b; i.e., for all b>B/{ϵ2(C+D)}b>B/\{\epsilon^{2}-(C+D)\},

4AB{ϵ2(C+D)}2\displaystyle\frac{4AB}{\{\epsilon^{2}-(C+D)\}^{2}} =K¯ϵ(b)bK¯ϵ(b)b\displaystyle=\underline{K}_{\epsilon}(b_{*})b_{*}\leq\underline{K}_{\epsilon}(b)b
=Ab2{ϵ2(C+D)}bB.\displaystyle=\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B}.

3.2. Upper bound of KK

We provide the upper bound of KK satisfying δϵ2V(K,𝜽)\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta}) under the following conditions:

  1. (U1)

    There exists c1[0,1]c_{1}\in[0,1] such that 𝔼[𝗱k𝖧k2]c1σ2/b\mathbb{E}[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}]\geq c_{1}\sigma^{2}/b.

  2. (U2)

    There exist 𝜽VI(𝒮,L)\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L) and c20c_{2}\geq 0 such that 𝔼[(𝜽𝜽k)𝒎k1]c2(σ2/b+P2)\mathbb{E}[(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}]\geq-c_{2}(\sigma^{2}/b+P^{2}).

  3. (U3)

    For 𝜽VI(𝒮,L)\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L), there exists a positive number X(𝜽)X(\bm{\theta}^{*}) such that, for all k1k\geq 1, γ~𝔼[𝜽1𝜽𝖧12]𝔼[𝜽k+1𝜽𝖧k2]X(𝜽)\tilde{\gamma}\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}^{*}\|_{\mathsf{H}_{1}}^{2}]-\mathbb{E}[\|\bm{\theta}_{k+1}-\bm{\theta}^{*}\|_{\mathsf{H}_{k}}^{2}]\geq X(\bm{\theta}^{*}), where γ~:=1γ\tilde{\gamma}:=1-\gamma.

We consider the case where 𝖧k\mathsf{H}_{k} is the identity matrix needed to justify (U1). The definition of 𝒎k:=β𝒎k1+(1β)LBk(𝜽k)\bm{m}_{k}:=\beta\bm{m}_{k-1}+(1-\beta)\nabla L_{B_{k}}(\bm{\theta}_{k}) implies that 𝒎k\bm{m}_{k} depends on LBk(𝜽k)\nabla L_{B_{k}}(\bm{\theta}_{k}). Here, we assume the following condition, which is stronger than (S2)(2.5): there exists c1[0,1]c_{1}\in[0,1] such that

(3.3) c1σ2𝔼ξk[𝖦ξk(𝜽k)L(𝜽k)2]σ2.\displaystyle c_{1}\sigma^{2}\leq\mathbb{E}_{\xi_{k}}\left[\left\|\mathsf{G}_{\xi_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\right\|^{2}\right]\leq\sigma^{2}.

Then, (S2)(2.4) and (3.3) ensure that

𝔼[LBk(𝜽k)2|𝜽k]\displaystyle\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right] 𝔼[LBk(𝜽k)L(𝜽k)2|𝜽k]\displaystyle\geq\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right]
c1σ2b.\displaystyle\geq\frac{c_{1}\sigma^{2}}{b}.

From γ~k=1γk+11\tilde{\gamma}_{k}=1-\gamma^{k+1}\leq 1, we have that 𝗱k2=γ~k2𝒎k2𝒎k2\|\bm{\mathsf{d}}_{k}\|^{2}=\tilde{\gamma}_{k}^{-2}\|\bm{m}_{k}\|^{2}\geq\|\bm{m}_{k}\|^{2} (kk\in\mathbb{N}). Accordingly, the lower bound of 𝗱k2\|\bm{\mathsf{d}}_{k}\|^{2} depends on c1σ2/bc_{1}\sigma^{2}/b.

Algorithm 1 for nonconvex optimization satisfies that there exist positive numbers C1C_{1} and C2C_{2} such that, for all 𝜽𝒮\bm{\theta}\in\mathcal{S},

(3.4) lim infk+𝔼[(𝜽k𝜽)L(𝜽k)]C1α+C2β,V(K,𝜽)𝒪(1K)+C1α+C2β\displaystyle\begin{split}&\liminf_{k\to+\infty}\mathbb{E}\left[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})\right]\leq C_{1}\alpha+C_{2}\beta,\\ &\mathrm{V}(K,\bm{\theta})\leq\mathcal{O}\left(\frac{1}{K}\right)+C_{1}\alpha+C_{2}\beta\end{split}

(see [10, Theorem 1]). The sequence (𝜽k)k(\bm{\theta}_{k})_{k\in\mathbb{N}} generated by Algorithm 1 can thus approximate a local minimizer 𝜽VI(𝒮,L)\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L). Meanwhile, we have that

(𝜽𝜽k)𝒎k1\displaystyle(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}
=12{𝜽𝜽k2+𝒎k12(𝜽𝜽k)𝒎k12}\displaystyle=\frac{1}{2}\left\{\|\bm{\theta}^{*}-\bm{\theta}_{k}\|^{2}+\|\bm{m}_{k-1}\|^{2}-\|(\bm{\theta}^{*}-\bm{\theta}_{k})-\bm{m}_{k-1}\|^{2}\right\}
12(𝜽𝜽k)𝒎k12.\displaystyle\geq-\frac{1}{2}\|(\bm{\theta}^{*}-\bm{\theta}_{k})-\bm{m}_{k-1}\|^{2}.

Accordingly, for a sufficiently large kk, the lower bound of (𝜽𝜽k)𝒎k1(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1} depends on 𝔼[𝒎k12](σ2/b+P2)-\mathbb{E}[\|\bm{m}_{k-1}\|^{2}]\geq-(\sigma^{2}/b+P^{2}), where (S2)(2.5) and (L1) are assumed (see (3.2)). Hence, we assume (U2) for Algorithm 1.

Condition (U3) implies that 𝔼[𝜽k+1𝜽𝖧k2]<𝔼[𝜽1𝜽𝖧12]<+\mathbb{E}[\|\bm{\theta}_{k+1}-\bm{\theta}^{*}\|_{\mathsf{H}_{k}}^{2}]<\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}^{*}\|_{\mathsf{H}_{1}}^{2}]<+\infty, which is stronger than (L2), and that 𝜽k+1\bm{\theta}_{k+1} approximates more appropriately 𝜽\bm{\theta}^{*} than 𝜽1\bm{\theta}_{1}. Since sequence (𝜽k)k(\bm{\theta}_{k})_{k\in\mathbb{N}} generated by Algorithm 1 can approximate a local minimizer 𝜽VI(𝒮,L)\bm{\theta}^{*}\in\mathrm{VI}(\mathcal{S},\nabla L) (see (3.4)), (U3) is also adequate for consideration of the upper bound of KK.

Theorem 3.3.

Suppose that (S1)–(S3), (A1), (L1), and (U1)–(U3) hold. Then, Algorithm 1 satisfies that, for all K1K\geq 1,

V(K,𝜽)EK+FbG,\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\frac{E}{K}+\frac{F}{b}-G,

where β~:=1β\tilde{\beta}:=1-\beta, γ~:=1γ\tilde{\gamma}:=1-\gamma,

E=E(α,β,𝜽)=X(𝜽)2αβ~,\displaystyle E=E(\alpha,\beta,\bm{\theta}^{*})=\frac{X(\bm{\theta}^{*})}{2\alpha\tilde{\beta}},
F=F(α,β,γ,σ2,c1,c2)=σ2(c1αγ~2c2β)2β~,\displaystyle F=F(\alpha,\beta,\gamma,\sigma^{2},c_{1},c_{2})=\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}},
G=G(β,c2,P2)=c2βP2β~.\displaystyle G=G(\beta,c_{2},P^{2})=\frac{c_{2}\beta P^{2}}{\tilde{\beta}}.

Moreover, the upper bound K¯ϵ,δ\overline{K}_{\epsilon,\delta} of the number of steps Kϵ,δK_{\epsilon,\delta} satisfying δϵ2V(K,𝛉)\delta\epsilon^{2}\leq\mathrm{V}(K,\bm{\theta}^{*}) for Algorithm 1 is such that, for all b>F/(δϵ2+G)0b>F/(\delta\epsilon^{2}+G)\geq 0,

K¯ϵ,δ(b):=Eb(δϵ2+G)bFKϵ,δ.\displaystyle\overline{K}_{\epsilon,\delta}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F}\geq K_{\epsilon,\delta}.

The upper bound K¯ϵ,δ\overline{K}_{\epsilon,\delta} is monotone decreasing and convex:

K¯ϵ,δ:=Eδϵ2+G<K¯ϵ,δ(b)\displaystyle\overline{K}_{\epsilon,\delta}^{\star}:=\frac{E}{\delta\epsilon^{2}+G}<\overline{K}_{\epsilon,\delta}(b)

for all b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G).

The minimization of K¯ϵ,δ(b)b\overline{K}_{\epsilon,\delta}(b)b is as follows (The proof of Theorem 3.4 is given in the Appendix):

Theorem 3.4.

Suppose that the assumptions in Theorem 3.3 hold and consider Algorithm 1. Then, there exists

b:=2Fδϵ2+G\displaystyle b^{*}:=\frac{2F}{\delta\epsilon^{2}+G}

such that bb^{*} minimizes a convex function K¯ϵ,δ(b)b\overline{K}_{\epsilon,\delta}(b)b; i.e., for all b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G),

4EF(δϵ2+G)2=K¯ϵ,δ(b)bK¯ϵ,δ(b)b=Eb2(δϵ2+G)bF.\displaystyle\frac{4EF}{(\delta\epsilon^{2}+G)^{2}}=\overline{K}_{\epsilon,\delta}(b^{*})b^{*}\leq\overline{K}_{\epsilon,\delta}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.
[Uncaptioned image]
Figure 3. Number of steps for SGD, Momentum, and Adam versus batch size needed to train ResNet-20 on CIFAR-10. There is an initial period of perfect scaling (indicated by dashed line) such that the number of steps KK needed for L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} is inversely proportional to batch size bb. However, Adam has critical batch size b=211b^{\star}=2^{11}.
[Uncaptioned image]
Figure 4. SFO complexities for SGD, Momentum, and Adam versus batch size needed to train ResNet-20 on CIFAR-10. SFO complexity of Adam is minimum at critical batch size b=211b^{\star}=2^{11}, whereas SFO complexity for SGD and Momentum tends to increase with batch size.
Table 2. Elapsed time and training accuracy of SGD when L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} for training ResNet-20 on CIFAR-10
SGD
Batch Size 222^{2} 232^{3} 242^{4} 252^{5} 262^{6} 272^{7} 282^{8} 292^{9}
Time (s) 16983.64 9103.76 6176.19 3759.25
Accuracy (%) 96.75 96.69 96.66 96.88
Table 3. Elapsed time and training accuracy of Momentum when L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} for training ResNet-20 on CIFAR-10
Momentum
Batch Size 222^{2} 232^{3} 242^{4} 252^{5} 262^{6} 272^{7} 282^{8} 292^{9}
Time (s) 7978.90 3837.72 2520.82 1458.70 887.01 678.66 625.10 866.65
Accuracy (%) 96.49 96.79 96.51 96.72 96.70 96.94 96.94 98.34
Table 4. Elapsed time and training accuracy of Adam when L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} for training ResNet-20 on CIFAR-10
Adam
Batch Size 222^{2} 232^{3} 242^{4} 252^{5} 262^{6} 272^{7} 282^{8} 292^{9}
Time (s) 10601.78 4405.73 2410.28 1314.01 617.14 487.75 281.74 225.03
Accuracy (%) 96.46 96.38 96.65 96.53 96.43 96.68 96.58 96.72
Table 5. Elapsed time and training accuracy of Adam when L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} for training ResNet-20 on CIFAR-10
Adam
Batch Size 2102^{10} 2112^{11} 2122^{12} 2132^{13} 2142^{14} 2152^{15} 2162^{16}
Time (s) 197.78 195.40 233.70 349.81 691.04 644.19 1148.68
Accuracy (%) 96.74 97.21 97.54 97.75 97.51 99.05 99.03

3.3. Discussion

To guarantee that K¯ϵ(b)b\underline{K}_{\epsilon}(b)b in Theorem 3.2 and K¯ϵ,δ(b)b\overline{K}_{\epsilon,\delta}(b)b in Theorem 3.4 fit the actual SFO complexity KbKb, we need to satisfy (1.4); that is,

(C+D)F(M1α+M2)(M3αM4)+BGM5α(FδB)ϵ2 and\displaystyle\underbrace{(C+D)F}_{\approx(M_{1}\alpha+M_{2})(M_{3}\alpha-M_{4})}+\underbrace{BG}_{\approx M_{5}\alpha}\approx(F-\delta B)\epsilon^{2}\text{ and}
(C+D)M1α+M2EF+GAB(EFδAB)ϵ2.\displaystyle\underbrace{(C+D)}_{\approx M_{1}\alpha+M_{2}}\sqrt{EF}+G\sqrt{AB}\approx(\sqrt{EF}-\delta\sqrt{AB})\epsilon^{2}.

Hence, if precisions ϵ\epsilon and δ\delta are small, then using a sufficiently small learning rate α\alpha could be used to approximate K(b)bK(b)b. Indeed, small learning rates such as 10210^{-2} and 10310^{-3} have been practically used in previous numerical validations [23]. The demonstration of the existence of bb_{*} and bb^{*} leads to the existence of the actual critical batch size bbbb^{\star}\approx b_{*}\approx b^{*}.

4. Numerical Examples

We evaluated the performances of SGD, Momentum, and Adam with different batch sizes for training ResNet-20 on the CIFAR-10 dataset with n=50000n=50000. The metrics were the number of steps KK and the SFO complexity KbKb satisfying L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1}, where 𝜽K\bm{\theta}_{K} is generated for each of SGD, Momentum, and Adam using batch size bb. The stopping condition was 200200 epochs. The experimental environment consisted of two Intel(R) Xeon(R) Gold 6148 2.4-GHz CPUs with 20 cores each, a 16-GB NVIDIA Tesla V100 900-Gbps GPU, and the Red Hat Enterprise Linux 7.6 OS. The code was written in Python 3.8.2 using the NumPy 1.17.3 and PyTorch 1.3.0 packages. A constant learning rate (α=103\alpha=10^{-3}) was commonly used. SGD used the identity matrix 𝖧k\mathsf{H}_{k} and β=γ=0\beta=\gamma=0, and Momentum used the identity matrix 𝖧k\mathsf{H}_{k}, β=0.9\beta=0.9, and γ=0\gamma=0. Adam used 𝖧k\mathsf{H}_{k} defined by Table 1, β=γ=0.9\beta=\gamma=0.9, and η=ζ=0.999\eta=\zeta=0.999 [11].

Figure 4 shows the number of steps for SGD, Momentum, and Adam versus the batch size. For SGD and Momentum, the number of steps KK needed for L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} initially decreased. However, SGD with b26b\geq 2^{6} and Momentum with b210b\geq 2^{10} did not satisfy L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} before the stopping condition was reached. Adam had an initial period of perfect scaling (indicated by dashed line) such that the number of steps KK needed for L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} was inversely proportional to batch size bb, and critical batch size b=211b^{\star}=2^{11} such that KK was not inversely proportional to the batch size beyond bb^{\star}; i.e., there were diminishing returns.

Figure 4 plots the SFO complexities for SGD, Momentum, and Adam versus the batch size. For SGD, SFO complexity was minimum at b=22b=2^{2}; for Momentum, it was minimum at b=23b=2^{3}. For Adam, SFO complexity was minimum at the critical batch size b=211b^{\star}=2^{11}. In particular, we determined that the SFO complexity of Adam with b=210b=2^{10} was about 14049281404928, that of Adam with b=211b^{\star}=2^{11} was about 13824001382400, and that of Adam with b=212b=2^{12} was about 16506881650688.

Tables 2, 3, 4, and 5 show the elapsed time and the training accuracy for SGD, Momentum, and Adam when L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} was achieved. Table 2 shows that the elapsed time for SGD decreased with an increase in batch size. Table 3 shows that the elapsed time for Momentum monotonically decreased for b28b\leq 2^{8} and increased for b=29b=2^{9} compared with that for b=28b=2^{8}. This is because the SFO complexity for Momentum for b=29b=2^{9} was substantially higher than that for b=28b=2^{8}, as shown in Figure 4. Tables 4 and 5 show that the elapsed time for Adam monotonically decreased for b211b\leq 2^{11} and that the elapsed time for critical batch size b=211b^{\star}=2^{11} was the shortest. The elapsed time for b212b\geq 2^{12} increased with the SFO complexity, as shown in Figure 4.

We also investigated the behaviors of SGD, Momentum, and Adam for a decaying learning rate (αk=103/k\alpha_{k}=10^{-3}/\sqrt{k}) and a decaying learning rate (αk\alpha_{k} defined by αk=102102(1α)kT1\alpha_{k}=10^{-2}-10^{-2}(1-\alpha)kT^{-1} (kTk\leq T) or αk=102α\alpha_{k}=10^{-2}\alpha (k>Tk>T)), where the decay factor was α=103\alpha=10^{-3} and the number of training steps was T=104T=10^{4} (α\alpha and TT were set on the basis of the results of a previous study [23, Figure 14]). None of the optimizers achieved L(𝜽K)101L(\bm{\theta}_{K})\leq 10^{-1} before the stopping condition (200200 epochs) was reached.

5. Conclusion

We have clarified the lower and upper bounds of the number of steps needed for nonconvex optimization of adaptive methods and have shown that there exist critical batch sizes bb_{*} and bb^{*} in the sense of minimizing the lower and upper bounds of the stochastic first-order oracle (SFO) complexities of adaptive methods. Previous numerical evaluations [23, 29] showed that the actual critical batch size bb^{\star} minimizes SFO complexity. Hence, if the lower and upper bounds of the SFO complexity fit the actual SFO complexity, then the existence of bb_{*} and bb^{*} guarantee the existence of the actual critical batch size bb^{\star}. We also demonstrated that it would be sufficient to use small learning rates for an adaptive method to satisfy that the lower and upper bounds of SFO complexity fit the actual SFO complexity. Furthermore, we provided numerical examples supporting our theoretical results. They showed that Adam with a frequently used small learning rate (10310^{-3}) had a critical batch size that minimizes SFO complexity.

References

  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. https://arxiv.org/pdf/1701.07875.pdf, 2017.
  • [2] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60:223–311, 2018.
  • [3] Hao Chen, Lili Zheng, Raed AL Kontar, and Garvesh Raskutti. Stochastic gradient descent in correlated settings: A study on Gaussian processes. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [4] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In Proceedings of The International Conference on Learning Representations, 2019.
  • [5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  • [6] F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems I. Springer, New York, 2003.
  • [7] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • [8] Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21:1–48, 2020.
  • [9] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.
  • [10] Hideaki Iiduka. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2021.3107415, 2021.
  • [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of The International Conference on Learning Representations, 2015.
  • [12] Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, 2021.
  • [13] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of The International Conference on Learning Representations, 2019.
  • [14] James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of Machine Learning Research, volume 37, pages 2408–2417, 2015.
  • [15] Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [16] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.
  • [17] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2){O}(1/k^{2}). Doklady AN USSR, 269:543–547, 1983.
  • [18] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:1–17, 1964.
  • [19] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In Proceedings of The International Conference on Learning Representations, 2018.
  • [20] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
  • [21] Kevin Scaman and Cédric Malherbe. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [22] Robin M. Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley–Benchmarking deep learning optimizers. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9367–9376, 2021.
  • [23] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20:1–49, 2019.
  • [24] Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In Proceedings of The International Conference on Learning Representations, 2018.
  • [25] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, 2013.
  • [26] Tijmen Tieleman and Geoffrey Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4:26–31, 2012.
  • [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [28] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2048–2057, 2015.
  • [29] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • [30] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James S. Duncan. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, volume 33, 2020.
  • [31] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 23, 2010.

Appendix A Appendix

Unless stated otherwise, all relationships between random variables are supported to hold almost surely. Let S𝕊++dS\in\mathbb{S}_{++}^{d}. The SS-inner product of d\mathbb{R}^{d} is defined for all 𝒙,𝒚d\bm{x},\bm{y}\in\mathbb{R}^{d} by 𝒙,𝒚S:=𝒙,S𝒚=𝒙(S𝒚)\langle\bm{x},\bm{y}\rangle_{S}:=\langle\bm{x},S\bm{y}\rangle=\bm{x}^{\top}(S\bm{y}), and the SS-norm is defined by 𝒙S:=𝒙,S𝒙\|\bm{x}\|_{S}:=\sqrt{\langle\bm{x},S\bm{x}\rangle}.

A.1. Lemmas and Theorems

Lemma A.1.

Suppose that (S1), (S2)(2.4), and (S3) hold. Then, Algorithm 1 satisfies the following: for all kk\in\mathbb{N} and all 𝛉𝒮\bm{\theta}\in\mathcal{S},

𝔼[𝜽k+1𝜽𝖧k2]\displaystyle\mathbb{E}\left[\|\bm{\theta}_{k+1}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}\right] =𝔼[𝜽k𝜽𝖧k2]+α2𝔼[𝗱k𝖧k2]\displaystyle=\mathbb{E}\left[\|\bm{\theta}_{k}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}\right]+\alpha^{2}\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]
+2α{βγ~k𝔼[(𝜽𝜽k)𝒎k1]+β~γ~k𝔼[(𝜽𝜽k)L(𝜽k)]},\displaystyle\quad+2\alpha\left\{\frac{\beta}{{\tilde{\gamma}}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L(\bm{\theta}_{k})\right]\right\},

where β~:=1β\tilde{\beta}:=1-\beta and γ~k:=1γk+1{\tilde{\gamma}}_{k}:=1-{\gamma}^{k+1}.

Proof.

Let 𝜽𝒮\bm{\theta}\in\mathcal{S} and kk\in\mathbb{N}. The definition of 𝜽k+1\bm{\theta}_{k+1} implies that

𝜽k+1𝜽𝖧k2=𝜽k𝜽𝖧k2+2α𝜽k𝜽,𝗱k𝖧k+α2𝗱k𝖧k2.\displaystyle\|\bm{\theta}_{k+1}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}=\|\bm{\theta}_{k}-\bm{\theta}\|_{\mathsf{H}_{k}}^{2}+2\alpha\langle\bm{\theta}_{k}-\bm{\theta},\bm{\mathsf{d}}_{k}\rangle_{\mathsf{H}_{k}}+\alpha^{2}\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}.

Moreover, the definitions of 𝗱k\bm{\mathsf{d}}_{k}, 𝒎k\bm{m}_{k}, and 𝒎^k\hat{\bm{m}}_{k} ensure that

𝜽k𝜽,𝗱k𝖧k=1γ~k(𝜽𝜽k)𝒎k=βγ~k(𝜽𝜽k)𝒎k1+β~γ~k(𝜽𝜽k)LBk(𝜽k).\displaystyle\left\langle\bm{\theta}_{k}-\bm{\theta},\bm{\mathsf{d}}_{k}\right\rangle_{\mathsf{H}_{k}}=\frac{1}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k}=\frac{\beta}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k}).

Hence,

(A.1) 𝜽k+1𝜽𝖧k2=𝜽k𝜽𝖧k2+α2𝗱k𝖧k2+2α{βγ~k(𝜽𝜽k)𝒎k1+β~γ~k(𝜽𝜽k)LBk(𝜽k)}.\displaystyle\begin{split}\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}&=\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}+\alpha^{2}\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\\ &\quad+2\alpha\left\{\frac{\beta}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}+\frac{\tilde{\beta}}{{\tilde{\gamma}}_{k}}(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k})\right\}.\end{split}

Conditions (2.4) and (S3) guarantee that

𝔼[𝔼[(𝜽𝜽k)LBk(𝜽k)|𝜽k]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L_{B_{k}}(\bm{\theta}_{k})\Big{|}\bm{\theta}_{k}\right]\right] =𝔼[(𝜽𝜽k)𝔼[LBk(𝜽k)|𝜽k]]\displaystyle=\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\mathbb{E}\left[\nabla L_{B_{k}}(\bm{\theta}_{k})\Big{|}\bm{\theta}_{k}\right]\right]
=𝔼[(𝜽𝜽k)L(𝜽k)].\displaystyle=\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\nabla L(\bm{\theta}_{k})\right].

Therefore, the lemma follows by taking the expectation with respect to ξk\xi_{k} on both sides of (A.1). This completes the proof. ∎

Lemma A.1 leads to the following:

Lemma A.2.

Suppose that the assumptions in Lemma A.1 hold and consider Algorithm 1. Let Vk(𝛉):=𝔼[(𝛉k𝛉)L(𝛉k)]V_{k}(\bm{\theta}):=\mathbb{E}[(\bm{\theta}_{k}-\bm{\theta})^{\top}\nabla L(\bm{\theta}_{k})] for all 𝛉𝒮\bm{\theta}\in\mathcal{S} and all kk\in\mathbb{N}. Then, for all 𝛉𝒮\bm{\theta}\in\mathcal{S} and all K1K\geq 1,

(A.2) k=1KVk(𝜽)=12αβ~k=1Kγ~k{𝔼[𝜽k𝜽𝖧k2]𝔼[𝜽k+1𝜽𝖧k2]}ΓK+α2β~k=1Kγ~k𝔼[𝗱k𝖧k2]AK+ββ~k=1K𝔼[(𝜽𝜽k)𝒎k1]BK.\displaystyle\begin{split}{\sum_{k=1}^{K}V_{k}(\bm{\theta})}&=\frac{1}{2\alpha\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\left\{\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]\right\}}_{\Gamma_{K}}\\ &\quad+\frac{\alpha}{2\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]}_{A_{K}}+\frac{\beta}{\tilde{\beta}}\underbrace{\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]}_{B_{K}}.\end{split}
Proof.

Let 𝜽𝒮\bm{\theta}\in\mathcal{S}. Lemma A.1 guarantees that, for all kk\in\mathbb{N},

2αβ~γ~kVk(𝜽)\displaystyle\frac{2\alpha\tilde{\beta}}{\tilde{\gamma}_{k}}V_{k}(\bm{\theta}) =𝔼[𝜽k𝜽𝖧k2]𝔼[𝜽k+1𝜽𝖧k2]+α2𝔼[𝗱k𝖧k2]\displaystyle=\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]+\alpha^{2}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]
+2αβγ~k𝔼[(𝜽𝜽k)𝒎k1].\displaystyle\quad+\frac{2\alpha\beta}{\tilde{\gamma}_{k}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right].

Hence, we have that

Vk(𝜽)\displaystyle V_{k}(\bm{\theta}) =γ~k2αβ~{𝔼[𝜽k𝜽𝖧k2]𝔼[𝜽k+1𝜽𝖧k2]}\displaystyle=\frac{\tilde{\gamma}_{k}}{2\alpha\tilde{\beta}}\left\{\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{k+1}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]\right\}
+αγ~k2β~𝔼[𝗱k𝖧k2]+ββ~𝔼[(𝜽𝜽k)𝒎k1].\displaystyle\quad+\frac{\alpha\tilde{\gamma}_{k}}{2\tilde{\beta}}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]+\frac{\beta}{\tilde{\beta}}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right].

Summing the above inequality from k=1k=1 to K1K\geq 1 implies the assertion in Lemma A.2. This completes the proof. ∎

Lemma A.3.

Algorithm 1 satisfies that, under (S2)(2.4), (2.5), and (L1), for all kk\in\mathbb{N},

𝔼[𝒎k2]σ2b+P2.\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}.

Under (A1) and (L1), for all kk\in\mathbb{N},

𝔼[𝗱k𝖧k2]1(1γ)2h0(σ2b+P2),\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{1}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

where h0:=mini[d]h0,ih_{0}^{*}:=\min_{i\in[d]}h_{0,i}.

Proof.

Let kk\in\mathbb{N}. From (S2)(2.4), we have that

𝔼[LBk(𝜽k)2|𝜽k]\displaystyle\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right] =𝔼[LBk(𝜽k)L(𝜽k)+L(𝜽k)2|𝜽k]\displaystyle=\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})+\nabla L(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right]
=𝔼[LBk(𝜽k)L(𝜽k)2|𝜽k]+𝔼[L(𝜽k)2|𝜽k]\displaystyle=\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right]+\mathbb{E}\left[\|\nabla L(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right]
+2𝔼[(LBk(𝜽k)L(𝜽k))L(𝜽k)|𝜽k]\displaystyle\quad+2\mathbb{E}\left[(\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k}))^{\top}\nabla L(\bm{\theta}_{k})\Big{|}\bm{\theta}_{k}\right]
=𝔼[LBk(𝜽k)L(𝜽k)2|𝜽k]+L(𝜽k)2,\displaystyle=\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})-\nabla L(\bm{\theta}_{k})\|^{2}\big{|}\bm{\theta}_{k}\right]+\|\nabla L(\bm{\theta}_{k})\|^{2},

which, together with (S2)(2.5) and (L1), implies that

(A.3) 𝔼[LBk(𝜽k)2]σ2b+P2.\displaystyle\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})\|^{2}\right]\leq\frac{\sigma^{2}}{b}+P^{2}.

The convexity of 2\|\cdot\|^{2}, together with the definition of 𝒎k\bm{m}_{k} and (A.3), guarantees that, for all kk\in\mathbb{N},

𝔼[𝒎k2]\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right] β𝔼[𝒎k12]+(1β)𝔼[LBk(𝜽k)2]\displaystyle\leq\beta\mathbb{E}\left[\|\bm{m}_{k-1}\|^{2}\right]+(1-\beta)\mathbb{E}\left[\|\nabla L_{B_{k}}(\bm{\theta}_{k})\|^{2}\right]
β𝔼[𝒎k12]+(1β)(σ2b+P2).\displaystyle\leq\beta\mathbb{E}\left[\|\bm{m}_{k-1}\|^{2}\right]+(1-\beta)\left(\frac{\sigma^{2}}{b}+P^{2}\right).

Induction thus ensures that, for all kk\in\mathbb{N},

(A.4) 𝔼[𝒎k2]max{𝒎12,σ2b+P2}=σ2b+P2,\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|^{2}\right]\leq\max\left\{\|\bm{m}_{-1}\|^{2},\frac{\sigma^{2}}{b}+P^{2}\right\}=\frac{\sigma^{2}}{b}+P^{2},

where 𝒎1=𝟎\bm{m}_{-1}=\bm{0}. For kk\in\mathbb{N}, 𝖧k𝕊++d\mathsf{H}_{k}\in\mathbb{S}_{++}^{d} guarantees the existence of a unique matrix 𝖧¯k𝕊++d\overline{\mathsf{H}}_{k}\in\mathbb{S}_{++}^{d} such that 𝖧k=𝖧¯k2\mathsf{H}_{k}=\overline{\mathsf{H}}_{k}^{2} [9, Theorem 7.2.6]. We have that, for all 𝒙d\bm{x}\in\mathbb{R}^{d}, 𝒙𝖧k2=𝖧¯k𝒙2\|\bm{x}\|_{\mathsf{H}_{k}}^{2}=\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2}. Accordingly, the definitions of 𝗱k\bm{\mathsf{d}}_{k} and 𝒎^k\hat{\bm{m}}_{k} imply that, for all kk\in\mathbb{N},

𝔼[𝗱k𝖧k2]=𝔼[𝖧¯k1𝖧k𝗱k2]1γ~k2𝔼[𝖧¯k12𝒎k2]1(1γ)2𝔼[𝖧¯k12𝒎k2],\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]=\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\mathsf{H}_{k}\bm{\mathsf{d}}_{k}\right\|^{2}\right]\leq\frac{1}{{\tilde{\gamma}}_{k}^{2}}\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|^{2}\|\bm{m}_{k}\|^{2}\right]\leq\frac{1}{(1-\gamma)^{2}}\mathbb{E}\left[\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|^{2}\|\bm{m}_{k}\|^{2}\right],

where

𝖧¯k1=𝖽𝗂𝖺𝗀(hk,i12)=maxi[d]hk,i12\displaystyle\left\|\overline{\mathsf{H}}_{k}^{-1}\right\|=\left\|\mathsf{diag}\left(h_{k,i}^{-\frac{1}{2}}\right)\right\|={\max_{i\in[d]}h_{k,i}^{-\frac{1}{2}}}

and γ~k:=1γk+11γ{\tilde{\gamma}}_{k}:=1-{\gamma}^{k+1}\geq 1-{\gamma}. Moreover, (A1) ensures that, for all kk\in\mathbb{N},

hk,ih0,ih0:=mini[d]h0,i.\displaystyle h_{k,i}\geq h_{0,i}\geq h_{0}^{*}:=\min_{i\in[d]}h_{0,i}.

Hence, (A.4) implies that, for all kk\in\mathbb{N},

𝔼[𝗱k𝖧k2]1(1γ)2h0(σ2b+P2),\displaystyle\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\leq\frac{1}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

completing the proof. ∎

We are now in the position to prove the following theorems.

Theorem A.1.

Suppose that (S1)–(S3), (A1)–(A2), and (L1)–(L2) hold and consider Algorithm 1. Then, for all 𝛉𝒮\bm{\theta}\in\mathcal{S} and all K1K\geq 1,

V(K,𝜽)dDist(𝜽)H2αβ~K+α2β~γ~2h0(σ2b+P2)+βdDist(𝜽)β~σ2b+P2,\displaystyle\mathrm{V}(K,\bm{\theta})\leq\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}K}+\frac{\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right)+\frac{\beta\sqrt{d\mathrm{Dist}(\bm{\theta})}}{\tilde{\beta}}\sqrt{\frac{\sigma^{2}}{b}+P^{2}},

where γ~:=1γ\tilde{\gamma}:=1-\gamma, Dist(𝛉)\mathrm{Dist}(\bm{\theta}) and HiH_{i} are defined as in (L2) and (A2), and H:=maxi[d]HiH:=\max_{i\in[d]}H_{i}.

Proof.

Let 𝜽𝒮\bm{\theta}\in\mathcal{S}. From the definition of ΓK\Gamma_{K} in Lemma A.2, we have that

(A.5) ΓK=γ~1𝔼[𝜽1𝜽𝖧12]+k=2K{γ~k𝔼[𝜽k𝜽𝖧k2]γ~k1𝔼[𝜽k𝜽𝖧k12]}Γ~Kγ~K𝔼[𝜽K+1𝜽𝖧K2].\displaystyle\begin{split}\Gamma_{K}&=\tilde{\gamma}_{1}\mathbb{E}\left[\left\|\bm{\theta}_{1}-\bm{\theta}\right\|_{\mathsf{H}_{1}}^{2}\right]+\underbrace{\sum_{k=2}^{K}\left\{\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k}}^{2}\right]-\tilde{\gamma}_{k-1}\mathbb{E}\left[\left\|\bm{\theta}_{k}-\bm{\theta}\right\|_{\mathsf{H}_{k-1}}^{2}\right]\right\}}_{\tilde{\Gamma}_{K}}\\ &\quad-\tilde{\gamma}_{K}\mathbb{E}\left[\left\|\bm{\theta}_{K+1}-\bm{\theta}\right\|_{\mathsf{H}_{K}}^{2}\right].\end{split}

Since 𝖧¯k𝕊++d\overline{\mathsf{H}}_{k}\in\mathbb{S}_{++}^{d} exists such that 𝖧k=𝖧¯k2\mathsf{H}_{k}=\overline{\mathsf{H}}_{k}^{2}, we have 𝒙𝖧k2=𝖧¯k𝒙2\|\bm{x}\|_{\mathsf{H}_{k}}^{2}=\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2} for all 𝒙d\bm{x}\in\mathbb{R}^{d}. Accordingly, we have

Γ~K=𝔼[k=2K{γ~k𝖧¯k(𝜽k𝜽)2γ~k1𝖧¯k1(𝜽k𝜽)2}].\displaystyle\tilde{\Gamma}_{K}=\mathbb{E}\left[\sum_{k=2}^{K}\left\{\tilde{\gamma}_{k}\left\|\overline{\mathsf{H}}_{k}(\bm{\theta}_{k}-\bm{\theta})\right\|^{2}-\tilde{\gamma}_{k-1}\left\|\overline{\mathsf{H}}_{k-1}(\bm{\theta}_{k}-\bm{\theta})\right\|^{2}\right\}\right].

From 𝖧¯k=𝖽𝗂𝖺𝗀(hk,i)\overline{\mathsf{H}}_{k}=\mathsf{diag}(\sqrt{h_{k,i}}), we have that, for all 𝒙=(xi)i=1dd\bm{x}=(x_{i})_{i=1}^{d}\in\mathbb{R}^{d}, 𝖧¯k𝒙2=i=1dhk,ixi2\|\overline{\mathsf{H}}_{k}\bm{x}\|^{2}=\sum_{i=1}^{d}h_{k,i}x_{i}^{2}. Hence, for all K2K\geq 2,

(A.6) Γ~K=𝔼[k=2Ki=1d(γ~khk,iγ~k1hk1,i)(θk,iθi)2].\displaystyle\tilde{\Gamma}_{K}=\mathbb{E}\left[\sum_{k=2}^{K}\sum_{i=1}^{d}\left(\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\right)(\theta_{k,i}-\theta_{i})^{2}\right].

From (A1), we have that, for all k1k\geq 1 and all i[d]i\in[d],

γ~khk,iγ~k1hk1,i0.\displaystyle\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\geq 0.

Moreover, from (L2), maxi[d]supk(θk,iθi)2Dist(𝜽)\max_{i\in[d]}\sup_{k\in\mathbb{N}}(\theta_{k,i}-\theta_{i})^{2}\leq\mathrm{Dist}(\bm{\theta}). Accordingly, for all K2K\geq 2,

Γ~KDist(𝜽)𝔼[k=2Ki=1d(γ~khk,iγ~k1hk1,i)]=Dist(𝜽)𝔼[i=1d(γ~KhK,iγ~1h1,i)].\displaystyle\tilde{\Gamma}_{K}\leq\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{k=2}^{K}\sum_{i=1}^{d}\left(\tilde{\gamma}_{k}h_{k,i}-\tilde{\gamma}_{k-1}h_{k-1,i}\right)\right]=\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}\left(\tilde{\gamma}_{K}h_{K,i}-\tilde{\gamma}_{1}h_{1,i}\right)\right].

Therefore, (A.5), γ~1𝔼[𝜽1𝜽𝖧12]Dist(𝜽)γ~1𝔼[i=1dh1,i]\tilde{\gamma}_{1}\mathbb{E}[\|\bm{\theta}_{1}-\bm{\theta}\|_{\mathsf{H}_{1}}^{2}]\leq\mathrm{Dist}(\bm{\theta})\tilde{\gamma}_{1}\mathbb{E}[\sum_{i=1}^{d}h_{1,i}], and (A2) imply, for all K1K\geq 1,

ΓK\displaystyle\Gamma_{K} γ~1Dist(𝜽)𝔼[i=1dh1,i]+Dist(𝜽)𝔼[i=1d(γ~KhK,iγ~1h1,i)]\displaystyle\leq\tilde{\gamma}_{1}\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}h_{1,i}\right]+\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}\left(\tilde{\gamma}_{K}h_{K,i}-\tilde{\gamma}_{1}h_{1,i}\right)\right]
=γ~KDist(𝜽)𝔼[i=1dhK,i]\displaystyle=\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})\mathbb{E}\left[\sum_{i=1}^{d}h_{K,i}\right]
γ~KDist(𝜽)i=1dHi\displaystyle\leq\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})\sum_{i=1}^{d}H_{i}
γ~KDist(𝜽)dH,\displaystyle\leq\tilde{\gamma}_{K}\mathrm{Dist}(\bm{\theta})dH,

where H=maxi[d]HiH=\max_{i\in[d]}H_{i}. From γ~K=1γK+11\tilde{\gamma}_{K}=1-{\gamma}^{K+1}\leq 1, we have that

(A.7) 12αβ~ΓKdDist(𝜽)H2αβ~.\displaystyle\frac{1}{2\alpha\tilde{\beta}}\Gamma_{K}\leq\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}}.

Lemma A.3 implies that, for all K1K\geq 1,

AK:=k=1Kγ~k𝔼[𝗱k𝖧k2]k=1Kγ~k(1γ)2h0(σ2b+P2),\displaystyle A_{K}:=\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\left\|\bm{\mathsf{d}}_{k}\right\|_{\mathsf{H}_{k}}^{2}\right]\leq\sum_{k=1}^{K}\frac{\tilde{\gamma}_{k}}{(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right),

which, together with γ~k1\tilde{\gamma}_{k}\leq 1, implies that

(A.8) α2β~AKαK2β~(1γ)2h0(σ2b+P2).\displaystyle\frac{\alpha}{2\tilde{\beta}}A_{K}\leq\frac{\alpha K}{2\tilde{\beta}(1-{\gamma})^{2}h_{0}^{*}}\left(\frac{\sigma^{2}}{b}+P^{2}\right).

Lemma A.3 and Jensen’s inequality ensure that, for all kk\in\mathbb{N},

𝔼[𝒎k]σ2b+P2.\displaystyle\mathbb{E}\left[\|\bm{m}_{k}\|\right]\leq\sqrt{\frac{\sigma^{2}}{b}+P^{2}}.

The Cauchy-Schwarz inequality and (L2) guarantee that, for all K1K\geq 1,

(A.9) ββ~BK:=ββ~k=1K𝔼[(𝜽𝜽k)𝒎k1]ββ~dDist(𝜽)k=1K𝔼[𝒎k1]βdDist(𝜽)β~Kσ2b+P2.\displaystyle\begin{split}\frac{\beta}{\tilde{\beta}}B_{K}&:=\frac{\beta}{\tilde{\beta}}\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]\leq\frac{\beta}{\tilde{\beta}}\sqrt{d\mathrm{Dist}(\bm{\theta})}\sum_{k=1}^{K}\mathbb{E}\left[\left\|\bm{m}_{k-1}\right\|\right]\\ &\leq\frac{\beta\sqrt{d\mathrm{Dist}(\bm{\theta})}}{\tilde{\beta}}K\sqrt{\frac{\sigma^{2}}{b}+P^{2}}.\end{split}

Therefore, (A.2), (A.7), (A.8), and (A.9) lead to the assertion in Theorem A.1. This completes the proof. ∎

Theorem A.2.

Suppose that (S1)–(S3), (A1), (L1), and (U1)–(U3) hold and consider Algorithm 1. Then, for all K1K\geq 1,

V(K,𝜽)X(𝜽)2αβ~K+σ2(c1αγ~2c2β)2β~bc2βP2β~,\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\frac{X(\bm{\theta}^{*})}{2\alpha\tilde{\beta}K}+\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}b}-\frac{c_{2}\beta P^{2}}{\tilde{\beta}},

where X(𝛉)X(\bm{\theta}^{*}), c1c_{1}, and c2c_{2} are defined by (U1)–(U3), and σ2\sigma^{2} is defined by (S2)(2.5).

Proof.

Condition (A1), together with (A.5) and (A.6), implies that Γ~K0\tilde{\Gamma}_{K}\geq 0 for all K1K\geq 1. Hence, from (A.5) and 1γ~kγ~k11\geq\tilde{\gamma}_{k}\geq\tilde{\gamma}_{k-1} (k1k\geq 1),

ΓK\displaystyle\Gamma_{K} γ~1𝔼[𝜽1𝜽𝖧12]γ~K𝔼[𝜽K+1𝜽𝖧K2]\displaystyle\geq\tilde{\gamma}_{1}\mathbb{E}\left[\left\|\bm{\theta}_{1}-\bm{\theta}^{*}\right\|_{\mathsf{H}_{1}}^{2}\right]-\tilde{\gamma}_{K}\mathbb{E}\left[\left\|\bm{\theta}_{K+1}-\bm{\theta}^{*}\right\|_{\mathsf{H}_{K}}^{2}\right]
(1γ)𝔼[𝜽1𝜽𝖧12]𝔼[𝜽K+1𝜽𝖧K2],\displaystyle\geq(1-\gamma)\mathbb{E}\left[\left\|\bm{\theta}_{1}-\bm{\theta}^{*}\right\|_{\mathsf{H}_{1}}^{2}\right]-\mathbb{E}\left[\left\|\bm{\theta}_{K+1}-\bm{\theta}^{*}\right\|_{\mathsf{H}_{K}}^{2}\right],

which, together with (U3), implies that, for all K1K\geq 1,

ΓK\displaystyle\Gamma_{K} X(𝜽).\displaystyle\geq X(\bm{\theta}^{*}).

Moreover, (U1) and (U2) imply that, for all K1K\geq 1,

AK=k=1Kγ~k𝔼[𝗱k𝖧k2]c1γ~σ2bK and\displaystyle A_{K}=\sum_{k=1}^{K}\tilde{\gamma}_{k}\mathbb{E}\left[\|\bm{\mathsf{d}}_{k}\|_{\mathsf{H}_{k}}^{2}\right]\geq\frac{c_{1}\tilde{\gamma}\sigma^{2}}{b}K\text{ and}
BK=k=1K𝔼[(𝜽𝜽k)𝒎k1]c2(σ2b+P2)K.\displaystyle B_{K}=\sum_{k=1}^{K}\mathbb{E}\left[(\bm{\theta}^{*}-\bm{\theta}_{k})^{\top}\bm{m}_{k-1}\right]\geq-c_{2}\left(\frac{\sigma^{2}}{b}+P^{2}\right)K.

Accordingly, Lemma A.2 leads to the assertion in Theorem A.2. This completes the proof. ∎

A.2. Proof of Theorem 3.1

Proof.

Theorem A.1 guarantees that, for all 𝜽𝒮\bm{\theta}\in\mathcal{S} and all K1K\geq 1,

V(K,𝜽)dDist(𝜽)H2αβ~A(α,β,𝜽,d,H)1K+σ2α2β~γ~2h0B(α,β,γ,σ2,h0)1b+P2α2β~γ~2h0C(α,β,γ,P2,h0)+dDist(𝜽)(σ2+P2)ββ~D(β,𝜽,d,σ2,P2).\displaystyle\mathrm{V}(K,\bm{\theta})\leq\underbrace{\frac{d\mathrm{Dist}(\bm{\theta})H}{2\alpha\tilde{\beta}}}_{A(\alpha,\beta,\bm{\theta},d,H)}\frac{1}{K}+\underbrace{\frac{\sigma^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}}_{B(\alpha,\beta,\gamma,\sigma^{2},h_{0}^{*})}\frac{1}{b}+\underbrace{\frac{P^{2}\alpha}{2\tilde{\beta}\tilde{\gamma}^{2}h_{0}^{*}}}_{C(\alpha,\beta,\gamma,P^{2},h_{0}^{*})}+\underbrace{\frac{\sqrt{d\mathrm{Dist}(\bm{\theta})(\sigma^{2}+P^{2})}\beta}{\tilde{\beta}}}_{D(\beta,\bm{\theta},d,\sigma^{2},P^{2})}.

If the right-hand side of the above inequality is less than or equal to ϵ2\epsilon^{2}, then we have that

AK+Bb+C+Dϵ2,\displaystyle\frac{A}{K}+\frac{B}{b}+C+D\leq\epsilon^{2},

which implies that

K¯(b):=Ab{ϵ2(C+D)}bBK,\displaystyle\underline{K}(b):=\frac{Ab}{\{\epsilon^{2}-(C+D)\}b-B}\leq K,

where b>B/{(ϵ2(C+D)}0b>B/\{(\epsilon^{2}-(C+D)\}\geq 0. We have that, for b>B/{(ϵ2(C+D)}b>B/\{(\epsilon^{2}-(C+D)\},

K¯:=Aϵ2(C+D)<K¯(b),\displaystyle\underline{K}^{\star}:=\frac{A}{\epsilon^{2}-(C+D)}<\underline{K}(b),
dK¯(b)db=AB[{ϵ2(C+D)}bB]20,\displaystyle\frac{\mathrm{d}\underline{K}(b)}{\mathrm{d}b}=\frac{-AB}{[\{\epsilon^{2}-(C+D)\}b-B]^{2}}\leq 0,
d2K¯(b)db2=2AB{ϵ2(C+D)}[{ϵ2(C+D)}bB]30.\displaystyle\frac{\mathrm{d}^{2}\underline{K}(b)}{\mathrm{d}b^{2}}=\frac{2AB\{\epsilon^{2}-(C+D)\}}{[\{\epsilon^{2}-(C+D)\}b-B]^{3}}\geq 0.

Hence, K¯\underline{K} is monotone decreasing and convex for b>B/{(ϵ2(C+D)}b>B/\{(\epsilon^{2}-(C+D)\}. This completes the proof. ∎

A.3. Proof of Theorem 3.2

Proof.

We have that, for b>B/{(ϵ2(C+D)}b>B/\{(\epsilon^{2}-(C+D)\},

K¯(b)b=Ab2{ϵ2(C+D)}bB.\displaystyle\underline{K}(b)b=\frac{Ab^{2}}{\{\epsilon^{2}-(C+D)\}b-B}.

Hence,

dK¯(b)bdb=Ab[{(ϵ2(C+D)}b2B][{(ϵ2(C+D)}bB]2,\displaystyle\frac{\mathrm{d}\underline{K}(b)b}{\mathrm{d}b}=\frac{Ab[\{(\epsilon^{2}-(C+D)\}b-2B]}{[\{(\epsilon^{2}-(C+D)\}b-B]^{2}},
d2K¯(b)bdb2=2AB2[{(ϵ2(C+D)}bB]30,\displaystyle\frac{\mathrm{d}^{2}\underline{K}(b)b}{\mathrm{d}b^{2}}=\frac{2AB^{2}}{[\{(\epsilon^{2}-(C+D)\}b-B]^{3}}\geq 0,

which implies that K¯(b)b\underline{K}(b)b is convex for b>B/{(ϵ2(C+D)}b>B/\{(\epsilon^{2}-(C+D)\} and

dK¯(b)bdb{<0 if b<b,=0 if b=b=2Bϵ2(C+D),>0 if b>b.\displaystyle\frac{\mathrm{d}\underline{K}(b)b}{\mathrm{d}b}\begin{cases}<0&\text{ if }b<b_{*},\\ =0&\text{ if }b=b_{*}=\frac{2B}{\epsilon^{2}-(C+D)},\\ >0&\text{ if }b>b_{*}.\end{cases}

The point bb_{*} attains the minimum value K¯(b)b\underline{K}(b_{*})b_{*} of K¯(b)b\underline{K}(b)b. This completes the proof. ∎

A.4. Proof of Theorem 3.3

Proof.

Theorem A.2 guarantees that, for all K1K\geq 1,

V(K,𝜽)X(𝜽)2αβ~E(α,β,𝜽)1K+σ2(c1αγ~2c2β)2β~F(α,β,γ,σ2,c1,c2)1bc2βP2β~G(β,c2,P2).\displaystyle\mathrm{V}(K,\bm{\theta}^{*})\geq\underbrace{\frac{X(\bm{\theta}^{*})}{2\alpha\tilde{\beta}}}_{E(\alpha,\beta,\bm{\theta}^{*})}\frac{1}{K}+\underbrace{\frac{\sigma^{2}(c_{1}\alpha\tilde{\gamma}-2c_{2}\beta)}{2\tilde{\beta}}}_{F(\alpha,\beta,\gamma,\sigma^{2},c_{1},c_{2})}\frac{1}{b}-\underbrace{\frac{c_{2}\beta P^{2}}{\tilde{\beta}}}_{G(\beta,c_{2},P^{2})}.

If the right-hand side of the above inequality is more than or equal to δϵ2\delta\epsilon^{2}, then we have that

EK+FbGδϵ2,\displaystyle\frac{E}{K}+\frac{F}{b}-G\geq\delta\epsilon^{2},

which implies that

K¯(b):=Eb(δϵ2+G)bFK,\displaystyle\overline{K}(b):=\frac{Eb}{(\delta\epsilon^{2}+G)b-F}\geq K,

where b>F/(δϵ2+G)0b>F/(\delta\epsilon^{2}+G)\geq 0. We have that, for b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G),

K¯:=Eδϵ2+G<K¯(b),\displaystyle\overline{K}^{\star}:=\frac{E}{\delta\epsilon^{2}+G}<\overline{K}(b),
dK¯(b)db=EF{(δϵ2+G)bF}20,\displaystyle\frac{\mathrm{d}\overline{K}(b)}{\mathrm{d}b}=\frac{-EF}{\{(\delta\epsilon^{2}+G)b-F\}^{2}}\leq 0,
d2K¯(b)db2=2EF(δϵ2+G){(δϵ2+G)bF}30.\displaystyle\frac{\mathrm{d}^{2}\overline{K}(b)}{\mathrm{d}b^{2}}=\frac{2EF(\delta\epsilon^{2}+G)}{\{(\delta\epsilon^{2}+G)b-F\}^{3}}\geq 0.

Hence, K¯\overline{K} is monotone decreasing and convex for b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G). This completes the proof. ∎

A.5. Proof of Theorem 3.4

Proof.

We have that, for b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G),

K¯(b)b=Eb2(δϵ2+G)bF.\displaystyle\overline{K}(b)b=\frac{Eb^{2}}{(\delta\epsilon^{2}+G)b-F}.

Hence,

dK¯(b)bdb=Eb{(δϵ2+G)b2F}{(δϵ2+G)bF}2,\displaystyle\frac{\mathrm{d}\overline{K}(b)b}{\mathrm{d}b}=\frac{Eb\{(\delta\epsilon^{2}+G)b-2F\}}{\{(\delta\epsilon^{2}+G)b-F\}^{2}},
d2K¯(b)bdb2=2EF2{(δϵ2+G)bF}30,\displaystyle\frac{\mathrm{d}^{2}\overline{K}(b)b}{\mathrm{d}b^{2}}=\frac{2EF^{2}}{\{(\delta\epsilon^{2}+G)b-F\}^{3}}\geq 0,

which implies that K¯(b)b\overline{K}(b)b is convex for b>F/(δϵ2+G)b>F/(\delta\epsilon^{2}+G) and

dK¯(b)bdb{<0 if b<b,=0 if b=b=2Fδϵ2+G,>0 if b>b.\displaystyle\frac{\mathrm{d}\overline{K}(b)b}{\mathrm{d}b}\begin{cases}<0&\text{ if }b<b^{*},\\ =0&\text{ if }b=b^{*}=\frac{2F}{\delta\epsilon^{2}+G},\\ >0&\text{ if }b>b^{*}.\end{cases}

The point bb^{*} attains the minimum value K¯(b)b\overline{K}(b^{*})b^{*} of K¯(b)b\overline{K}(b)b. This completes the proof. ∎