Minimization of Stochastic First-order Oracle Complexity of Adaptive Methods for Nonconvex Optimization
Abstract.
Numerical evaluations have definitively shown that, for deep learning optimizers such as stochastic gradient descent, momentum, and adaptive methods, the number of steps needed to train a deep neural network halves for each doubling of the batch size and that there is a region of diminishing returns beyond the critical batch size. In this paper, we determine the actual critical batch size by using the global minimizer of the stochastic first-order oracle (SFO) complexity of the optimizer. To prove the existence of the actual critical batch size, we set the lower and upper bounds of the SFO complexity and prove that there exist critical batch sizes in the sense of minimizing the lower and upper bounds. This proof implies that, if the SFO complexity fits the lower and upper bounds, then the existence of these critical batch sizes demonstrates the existence of the actual critical batch size. We also discuss the conditions needed for the SFO complexity to fit the lower and upper bounds and provide numerical results that support our theoretical results.
Key words and phrases:
adaptive method, batch size, nonconvex optimization, stochastic first-order oracle complexity1. Introduction
1.1. Background
Adaptive methods are powerful deep learning optimizers used to find the model parameters of deep neural networks that minimize the expected risk and empirical risk loss functions [2, Section 4]. Specific adaptive methods are, for example, adaptive gradient (AdaGrad) [5], root mean square propagation (RMSProp) [26], adaptive moment estimation (Adam) [11], and adaptive mean square gradient (AMSGrad) [19] (Table 2 in [22] lists the main deep learning optimizers).
Deep learning optimizers have been widely studied from the viewpoints of both theory and practice. While the convergence and convergence rate of deep learning optimizers have been theoretically studied for convex optimization [31, 11, 19, 13, 15], theoretical investigation of deep learning optimizers for nonconvex optimization is needed so that such optimizers can be put into practical use in deep learning [28, 1, 27].
The convergence of adaptive methods has been studied for nonconvex optimization [7, 4, 30, 10], and the convergence of stochastic gradient descent (SGD) methods has been studied for nonconvex optimization [8, 3, 21, 12]. Chen et al. showed that generalized Adam, which includes the heavy-ball method, AdaGrad, RMSProp, AMSGrad, and AdaGrad with first-order momentum (AdaFom), has an convergence rate when a decaying learning rate () is used [4]. AdaBelief (which adapts the step size in accordance with the belief in the observed gradients) using has an convergence rate [30]. A method that unifies adaptive methods such as AMSGrad and AdaBelief has been shown to have a convergence rate of when [10], which improves on the results of [4, 30].
![]() |
![]() |
There have been several practical studies of deep learning optimizers. Shallue et al. [23] studied how increasing the batch size affects the performances of SGD, SGD with momentum [18, 20], and Nesterov momentum [17, 25]. Zhang et al. [29] studied the relationship between batch size and performance for Adam and K-FAC (Kronecker-factored approximate curvature [14]). In both studies, it was numerically shown that increasing the batch size tends to decrease the number of steps needed for training deep neural networks, but with diminishing returns. Moreover, it was shown that SGD with momentum and Nesterov momentum can exploit larger batches than SGD [23] and that K-FAC and Adam can exploit larger batches than SGD with momentum [29]. Thus, it has been shown that momentum and adaptive methods can greatly reduce the number of steps needed for training deep neural networks [23, Figure 4], [29, Figure 5]. Smith et al. [24] numerically showed that using enormous batches leads to reductions in the number of parameter updates and model training time.
1.2. Motivation
As described in Section 1.1, there have been theoretical studies of the relationship between the learning rate and convergence of deep learning optimizers. Furthermore, as also described in Section 1.1, the practical performance of a deep learning optimizer strongly depends on the batch size. Our goal in this paper is to clarify theoretically the relationship between batch size and the performance of deep learning optimizers. We focus on the relationship between the batch size and the number of steps needed for nonconvex optimization of several adaptive methods.
To explicitly show our contribution, we characterize the diminishing returns reported in [23, 29] by using the stochastic first-order oracle (SFO) complexities of deep learning optimizers. We define the SFO complexity of a deep learning optimizer as on the basis of the number of steps needed for training the deep neural network and on the batch size used in the optimizer. Let be a critical batch size such that there are diminishing returns for all batch sizes beyond , as asserted in [23, 29]. Thanks to the numerical evaluations in [23, 29], we know that there is a positive number such that
(1.1) |
where and are defined for by and . For example, Figure 5(a) in [23] shows and for the momentum method used to train a convolutional neural network on the MNIST dataset. This implies that, while SFO complexity initially is not almost changed (i.e., halves for each doubling of ), is minimized at critical batch size , and there are diminishing returns once the batch size exceeds . Therefore, we can see that there are diminishing returns beyond the critical batch size that minimizes SFO complexity.
1.3. Goal
The goal of this paper is to develop a theory demonstrating the existence of critical batch sizes, which was shown numerically by Shallue et al. and Zhang et al. [23, 29].
To reach this goal, we first examine the relationship between the number of steps needed for nonconvex optimization (Section 2) and the batch size used for each adaptive method. We show that, under certain assumptions, the lower bound and the upper bound of are rational functions of batch size ; i.e., and have the following forms (see Section 3 for details):
(1.2) | ||||
where , , , , , , and are positive constants depending on the parameters (e.g., constant learning rate and momentum coefficient ) used in each adaptive method, and are the precisions for nonconvex optimization. and defined by (1.2) are monotone decreasing and convex for batch size . Accordingly, and decrease with an increase in batch size . For minimizing the empirical average loss function for the given number of samples (see (2.2)), and are minimized at . We can see that there are asymptotic values and of and .
Figure 2 visualizes the relationship between and . The number of steps needed for nonconvex optimization is inversely proportional to a batch size smaller than the critical batch size , and there are diminishing returns beyond the critical batch , as shown by the numerical results of Shallue et al. and Zhang et al. [23, 29]. Moreover, the critical batch size minimizes SFO complexity , as seen in (1.1). We thus define the critical batch size by using the global minimizer of SFO complexity .
From the forms of and in (1.2), we have the forms of and :
(1.3) | ||||
We can show that and are convex; i.e., there are global minimizers, denoted by and , of and , as seen in Figure 2 (see Section 3 for details). Therefore, we can regard batch sizes and as the critical batch sizes in the sense of minimizing and defined by (1.3). Suppose that and defined by (1.3) approximate the actual SFO complexity ; i.e.,
which implies that
(1.4) |
where () are positive constants that do not depend on a constant learning rate (see Section 3 for details). Hence, if precisions and are small, using a sufficiently small learning rate could be used to approximate (Indeed, small learning rates were practically used by Shallue et al. [23]). Then, the demonstration of the existence of critical batch sizes and leads to the existence of the actual critical batch .
2. Nonconvex Optimization and Deep Learning Optimizers
2.1. Nonconvex optimization
Let be a -dimensional Euclidean space with inner product inducing norm , and let be the set of nonnegative integers. Let be the set of symmetric positive-definite matrices, and let be the set of diagonal matrices: .
The mathematical model used here is based on that of Shallue et al. [23]. Given a parameter in a set and given a data point in a data domain , a machine learning model provides a prediction for which the quality is estimated using a differentiable nonconvex loss function . We minimize the expected loss defined for all by using
(2.1) |
where is the probability distribution over , denotes a random variable with distribution function , and denotes the expectation taken with respect to . A particularly interesting example of (2.1) is the empirical average loss defined for all :
(2.2) |
where denotes the training set, denotes the loss function corresponding to the -th training data instance , and . Our main objective is to find a local minimizer of over , i.e., a point that belongs to the set
(2.3) |
where denotes the gradient of . The inequality is called the variational inequality of over [6].
2.2. Deep learning optimizers
We assume that there exists SFO such that, for a given , it returns a stochastic gradient of the function defined by (2.1), where a random variable is supported on and does not depend on . Throughout this paper, we assume three standard conditions:
-
(S1)
defined by (2.1) is continuously differentiable.
-
(S2)
Let be the sequence generated by a deep learning optimizer. For each iteration ,
(2.4) where are independent samples, and the random variable is independent of . There exists a nonnegative constant such that
(2.5) -
(S3)
For each iteration , the deep learning optimizer samples a batch of size and estimates the full gradient as
where is the random variable generated by the -th sampling in the -th iteration.
In the case where is defined by (2.2), we have that, for each , and
We consider the following algorithm (Algorithm 1), which is a unified algorithm for most deep learning optimizers,111We define for by . in AMSBound ( with are given) is defined for all by including Momentum [18], AMSGrad [19], AMSBound [13], Adam [11], and AdaBelief [30], which are listed in Table 1.
Let be the random samplings in the -th iteration, and let be the history of process to time step . The adaptive methods listed in Table 1 all satisfy the following conditions:
-
(A1)
depends on , and for all and all ;
-
(A2)
For all , a positive number exists such that .
We define . The optimizers in Table 1 obviously satisfy (A1). Previously reported results [4, p.29], [30, p.18], and [10] show that in Table 1 satisfies (A2). For example, AMSGrad (resp. Adam) satisfies (A2) with (resp. ), where .
SGD | is the identity matrix. |
---|---|
() | |
Momentum | is the identity matrix. |
[18] | |
() | |
AMSGrad | |
[19] | |
() | |
AMSBound | |
[13] | |
() | |
Adam | |
[11] | |
() | |
AdaBelief | |
[30] | |
() | |
The theoretical performance metric used to approximate a local minimizer in (see (2.3)) is an -approximation of the sequence generated by Algorithm 1 in the sense of the mean value of (); that is,
(2.6) |
where and are precisions and . It would be sufficient to consider that the right-hand side of (2.6), , approximates a local minimizer of . Since consideration of leads to the lower bound of satisfying , we also consider , which leads to the upper bound of .
3. Main Results
Let us suppose that (S1), (S2)(2.4), and (S3) hold. Then, Algorithm 1 satisfies the following equation (see Lemma A.2 for details). For all and all ,
(3.1) |
where , , and (, ).
3.1. Lower bound of
We provide the lower bound of satisfying under the following conditions:
-
(L1)
There exists a positive number such that , where is the sequence generated by Algorithm 1.
-
(L2)
For all , there exists a positive number such that .
Condition (L1) was used in previous studies [4, A2] and [30, Theorem 2.2] to analyze deep learning optimizers for nonconvex optimization. Here, we use (L1) to provide the upper bounds of and (see Lemma A.3 for details); i.e.,
(3.2) |
where (S2)(2.5) and (A1) are assumed, , and . This formulation implies that in (3.1) is bounded above (see Theorems A.1 and 3.1 for details).
Condition (L2) is assumed for both convex and nonconvex optimization, e.g., [16, p.1574], [11, Theorem 4.1], [19, p.2], [13, Theorem 4], and [30, Theorem 2.1]. Here, we use (L2) and (A2) to provide the upper bounds of and in (3.1) (see Theorem A.1 for details); i.e.,
where (S2)(2.5) and (L1) are assumed and . Therefore, we have the following theorem (see the Appendix for detailed proof):
Theorem 3.1.
The minimization of is as follows (The proof of Theorem 3.2 is given in the Appendix):
3.2. Upper bound of
We provide the upper bound of satisfying under the following conditions:
-
(U1)
There exists such that .
-
(U2)
There exist and such that .
-
(U3)
For , there exists a positive number such that, for all , , where .
We consider the case where is the identity matrix needed to justify (U1). The definition of implies that depends on . Here, we assume the following condition, which is stronger than (S2)(2.5): there exists such that
(3.3) |
Then, (S2)(2.4) and (3.3) ensure that
From , we have that (). Accordingly, the lower bound of depends on .
Algorithm 1 for nonconvex optimization satisfies that there exist positive numbers and such that, for all ,
(3.4) |
(see [10, Theorem 1]). The sequence generated by Algorithm 1 can thus approximate a local minimizer . Meanwhile, we have that
Accordingly, for a sufficiently large , the lower bound of depends on , where (S2)(2.5) and (L1) are assumed (see (3.2)). Hence, we assume (U2) for Algorithm 1.
Condition (U3) implies that , which is stronger than (L2), and that approximates more appropriately than . Since sequence generated by Algorithm 1 can approximate a local minimizer (see (3.4)), (U3) is also adequate for consideration of the upper bound of .
Theorem 3.3.
The minimization of is as follows (The proof of Theorem 3.4 is given in the Appendix):
Theorem 3.4.
![]() |
![]() |
SGD | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Size | ||||||||
Time (s) | 16983.64 | 9103.76 | 6176.19 | 3759.25 | — | — | — | — |
Accuracy (%) | 96.75 | 96.69 | 96.66 | 96.88 | — | — | — | — |
Momentum | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Size | ||||||||
Time (s) | 7978.90 | 3837.72 | 2520.82 | 1458.70 | 887.01 | 678.66 | 625.10 | 866.65 |
Accuracy (%) | 96.49 | 96.79 | 96.51 | 96.72 | 96.70 | 96.94 | 96.94 | 98.34 |
Adam | ||||||||
---|---|---|---|---|---|---|---|---|
Batch Size | ||||||||
Time (s) | 10601.78 | 4405.73 | 2410.28 | 1314.01 | 617.14 | 487.75 | 281.74 | 225.03 |
Accuracy (%) | 96.46 | 96.38 | 96.65 | 96.53 | 96.43 | 96.68 | 96.58 | 96.72 |
Adam | |||||||
---|---|---|---|---|---|---|---|
Batch Size | |||||||
Time (s) | 197.78 | 195.40 | 233.70 | 349.81 | 691.04 | 644.19 | 1148.68 |
Accuracy (%) | 96.74 | 97.21 | 97.54 | 97.75 | 97.51 | 99.05 | 99.03 |
3.3. Discussion
To guarantee that in Theorem 3.2 and in Theorem 3.4 fit the actual SFO complexity , we need to satisfy (1.4); that is,
Hence, if precisions and are small, then using a sufficiently small learning rate could be used to approximate . Indeed, small learning rates such as and have been practically used in previous numerical validations [23]. The demonstration of the existence of and leads to the existence of the actual critical batch size .
4. Numerical Examples
We evaluated the performances of SGD, Momentum, and Adam with different batch sizes for training ResNet-20 on the CIFAR-10 dataset with . The metrics were the number of steps and the SFO complexity satisfying , where is generated for each of SGD, Momentum, and Adam using batch size . The stopping condition was epochs. The experimental environment consisted of two Intel(R) Xeon(R) Gold 6148 2.4-GHz CPUs with 20 cores each, a 16-GB NVIDIA Tesla V100 900-Gbps GPU, and the Red Hat Enterprise Linux 7.6 OS. The code was written in Python 3.8.2 using the NumPy 1.17.3 and PyTorch 1.3.0 packages. A constant learning rate () was commonly used. SGD used the identity matrix and , and Momentum used the identity matrix , , and . Adam used defined by Table 1, , and [11].
Figure 4 shows the number of steps for SGD, Momentum, and Adam versus the batch size. For SGD and Momentum, the number of steps needed for initially decreased. However, SGD with and Momentum with did not satisfy before the stopping condition was reached. Adam had an initial period of perfect scaling (indicated by dashed line) such that the number of steps needed for was inversely proportional to batch size , and critical batch size such that was not inversely proportional to the batch size beyond ; i.e., there were diminishing returns.
Figure 4 plots the SFO complexities for SGD, Momentum, and Adam versus the batch size. For SGD, SFO complexity was minimum at ; for Momentum, it was minimum at . For Adam, SFO complexity was minimum at the critical batch size . In particular, we determined that the SFO complexity of Adam with was about , that of Adam with was about , and that of Adam with was about .
Tables 2, 3, 4, and 5 show the elapsed time and the training accuracy for SGD, Momentum, and Adam when was achieved. Table 2 shows that the elapsed time for SGD decreased with an increase in batch size. Table 3 shows that the elapsed time for Momentum monotonically decreased for and increased for compared with that for . This is because the SFO complexity for Momentum for was substantially higher than that for , as shown in Figure 4. Tables 4 and 5 show that the elapsed time for Adam monotonically decreased for and that the elapsed time for critical batch size was the shortest. The elapsed time for increased with the SFO complexity, as shown in Figure 4.
We also investigated the behaviors of SGD, Momentum, and Adam for a decaying learning rate () and a decaying learning rate ( defined by () or ()), where the decay factor was and the number of training steps was ( and were set on the basis of the results of a previous study [23, Figure 14]). None of the optimizers achieved before the stopping condition ( epochs) was reached.
5. Conclusion
We have clarified the lower and upper bounds of the number of steps needed for nonconvex optimization of adaptive methods and have shown that there exist critical batch sizes and in the sense of minimizing the lower and upper bounds of the stochastic first-order oracle (SFO) complexities of adaptive methods. Previous numerical evaluations [23, 29] showed that the actual critical batch size minimizes SFO complexity. Hence, if the lower and upper bounds of the SFO complexity fit the actual SFO complexity, then the existence of and guarantee the existence of the actual critical batch size . We also demonstrated that it would be sufficient to use small learning rates for an adaptive method to satisfy that the lower and upper bounds of SFO complexity fit the actual SFO complexity. Furthermore, we provided numerical examples supporting our theoretical results. They showed that Adam with a frequently used small learning rate () had a critical batch size that minimizes SFO complexity.
References
- [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. https://arxiv.org/pdf/1701.07875.pdf, 2017.
- [2] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60:223–311, 2018.
- [3] Hao Chen, Lili Zheng, Raed AL Kontar, and Garvesh Raskutti. Stochastic gradient descent in correlated settings: A study on Gaussian processes. In Advances in Neural Information Processing Systems, volume 33, 2020.
- [4] Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of Adam-type algorithms for non-convex optimization. In Proceedings of The International Conference on Learning Representations, 2019.
- [5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
- [6] F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems I. Springer, New York, 2003.
- [7] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, volume 31, 2018.
- [8] Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gradient descent method for non-convex objective functions. Journal of Machine Learning Research, 21:1–48, 2020.
- [9] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, 1985.
- [10] Hideaki Iiduka. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2021.3107415, 2021.
- [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of The International Conference on Learning Representations, 2015.
- [12] Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, 2021.
- [13] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. In Proceedings of The International Conference on Learning Representations, 2019.
- [14] James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of Machine Learning Research, volume 37, pages 2408–2417, 2015.
- [15] Celestine Mendler-Dünner, Juan C. Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In Advances in Neural Information Processing Systems, volume 33, 2020.
- [16] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19:1574–1609, 2009.
- [17] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . Doklady AN USSR, 269:543–547, 1983.
- [18] Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4:1–17, 1964.
- [19] Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of Adam and beyond. In Proceedings of The International Conference on Learning Representations, 2018.
- [20] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
- [21] Kevin Scaman and Cédric Malherbe. Robustness analysis of non-convex stochastic gradient descent using biased expectations. In Advances in Neural Information Processing Systems, volume 33, 2020.
- [22] Robin M. Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley–Benchmarking deep learning optimizers. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 9367–9376, 2021.
- [23] Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20:1–49, 2019.
- [24] Samuel L. Smith, Pieter-Jan Kindermans, and Quoc V. Le. Don’t decay the learning rate, increase the batch size. In Proceedings of The International Conference on Learning Representations, 2018.
- [25] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, 2013.
- [26] Tijmen Tieleman and Geoffrey Hinton. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4:26–31, 2012.
- [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- [28] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 2048–2057, 2015.
- [29] Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, and Roger Grosse. Which algorithmic choices matter at which batch sizes? Insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, volume 32, 2019.
- [30] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James S. Duncan. AdaBelief optimizer: Adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, volume 33, 2020.
- [31] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, volume 23, 2010.
Appendix A Appendix
Unless stated otherwise, all relationships between random variables are supported to hold almost surely. Let . The -inner product of is defined for all by , and the -norm is defined by .
A.1. Lemmas and Theorems
Lemma A.1.
Proof.
Lemma A.1 leads to the following:
Lemma A.2.
Proof.
Lemma A.3.
Proof.
Let . From (S2)(2.4), we have that
which, together with (S2)(2.5) and (L1), implies that
(A.3) |
The convexity of , together with the definition of and (A.3), guarantees that, for all ,
Induction thus ensures that, for all ,
(A.4) |
where . For , guarantees the existence of a unique matrix such that [9, Theorem 7.2.6]. We have that, for all , . Accordingly, the definitions of and imply that, for all ,
where
and . Moreover, (A1) ensures that, for all ,
Hence, (A.4) implies that, for all ,
completing the proof. ∎
We are now in the position to prove the following theorems.
Theorem A.1.
Suppose that (S1)–(S3), (A1)–(A2), and (L1)–(L2) hold and consider Algorithm 1. Then, for all and all ,
where , and are defined as in (L2) and (A2), and .
Proof.
Let . From the definition of in Lemma A.2, we have that
(A.5) |
Since exists such that , we have for all . Accordingly, we have
From , we have that, for all , . Hence, for all ,
(A.6) |
From (A1), we have that, for all and all ,
Moreover, from (L2), . Accordingly, for all ,
Therefore, (A.5), , and (A2) imply, for all ,
where . From , we have that
(A.7) |
Lemma A.3 implies that, for all ,
which, together with , implies that
(A.8) |
Lemma A.3 and Jensen’s inequality ensure that, for all ,
The Cauchy-Schwarz inequality and (L2) guarantee that, for all ,
(A.9) |
Therefore, (A.2), (A.7), (A.8), and (A.9) lead to the assertion in Theorem A.1. This completes the proof. ∎
Theorem A.2.
A.2. Proof of Theorem 3.1
Proof.
Theorem A.1 guarantees that, for all and all ,
If the right-hand side of the above inequality is less than or equal to , then we have that
which implies that
where . We have that, for ,
Hence, is monotone decreasing and convex for . This completes the proof. ∎
A.3. Proof of Theorem 3.2
Proof.
We have that, for ,
Hence,
which implies that is convex for and
The point attains the minimum value of . This completes the proof. ∎
A.4. Proof of Theorem 3.3
Proof.
Theorem A.2 guarantees that, for all ,
If the right-hand side of the above inequality is more than or equal to , then we have that
which implies that
where . We have that, for ,
Hence, is monotone decreasing and convex for . This completes the proof. ∎
A.5. Proof of Theorem 3.4
Proof.
We have that, for ,
Hence,
which implies that is convex for and
The point attains the minimum value of . This completes the proof. ∎