Superlinear Optimization Algorithms
Abstract
This paper proposes several novel optimization algorithms for minimizing a nonlinear objective function. The algorithms are enlightened by the optimal state trajectory of an optimal control problem closely related to the minimized objective function. They are superlinear convergent when appropriate parameters are selected as required. Unlike Newton’s method, all of them can be also applied in the case of a singular Hessian matrix. More importantly, by reduction, some of them avoid calculating the inverse of the Hessian matrix or an identical dimension matrix and some of them need only the diagonal elements of the Hessian matrix. In these cases, these algorithms still outperform the gradient descent method. The merits of the proposed optimization algorithm are illustrated by numerical experiments.
I Introduction
The history of optimization problems is long and spans multiple disciplines, including mathematics, operations research, computer science, and engineering [1, 2, 3]. The roots of optimization can be traced back to the 18th century when mathematicians like Leonhard Euler and Joseph-Louis Lagrange began formulating and solving problems related to finding extrema (maxima or minima) of functions. Several important achievements include: the calculus of variations became a formal branch of mathematics [4]; the simplex method for solving linear programming problems led to the development of operations research [5]; the global optimization, dealing with finding the global optimum rather than the local optima of a nonlinear function, gained attention [6]; evolutionary and metaheuristic optimization algorithms gained popularity [7]; optimization techniques found extensive applications in operations research, engineering, finance, and various other fields. Integer programming, mixed-integer programming, and other discrete optimization techniques were developed to address real-world problems; optimization has been becoming more and more important for training models in machine learning and in various data science applications [8]. Gradient descent and its variants, such as stochastic gradient descent, have become essential optimization techniques in the context of neural networks [9].
The gradient descent [10] and Newton’s methods [11] are predominant optimization methods. The choice of step size is crucial for the convergence and stability of these methods. If the step size is too large, the methods may not converge, and if it is too small, it may converge very slowly [12, 13, 14]. To control the size of each iteration and improve the convergence properties of the algorithms, the line search or backtracking techniques are utilized to determine an appropriate step size at each iteration [15]. It means additional calculation costs for acquiring a step size. In practical implementations, the step size can be adjusted dynamically based on the behavior of the function. A body of variations and enhancements, such as stochastic gradient descent [16], momentum [17], adaptive learning rate methods [18, 19, 20], are thus proposed.
Besides, Newton’s method may diverge if the Hessian matrix is singular or the algorithm encounters numerical instability (when the Hessian matrix is ill-conditioned). To mitigate the aforementioned issues, quasi-Newton method [21], inexact Newton method [22], and modified Newton method [23] have been developed.
The paper attempts to solve optimization problems (OPs) by proposing novel algorithms, which are enlightened by the optimal state trajectory of the closely related optimal control problems (OCPs). Because our original algorithm is proposed almost along the optimal state trajectory of OCP, it holds some remarkable features. It converges more rapidly than gradient descent because it superlinearly converges; meanwhile, it is superior to Newton’s method because it can be applied in the case of a singular Hessian matrix. To reduce calculation costs, we also offer several variants of the original algorithm. It is worth pointing out that the variants of the original algorithm also inherit the above merits. All of them are superlinearly convergent.
The remainder is arranged as follows. In Section II, we propose the novel algorithm almost along an optimal state trajectory. Section III is devoted to the convergence analysis of the proposed algorithm and its several variants. The effectiveness of algorithms is illustrated in Section IV. Some conclusions are achieved in Section V.
Notation: Throughout the paper, the superscript stands for matrix transposition; denotes the -dimensional real vector space; is -dimensional unit matrix. For any square matrix , means that it is positive definite and represents the spectral radius of . and denote appropriate vector norm and matrix norm, respectively.
II Optimization algorithms enlightened by Optimal control
Assume that is a twice continuously differentiable nonlinear function. The optimization problem(OP)
(1) |
is generally solved by numerical iteration
(2) |
with being a designed function.
We will design in (2) with the aid of the following optimal control problem (OPC):
(3) | |||
(4) |
where and are the state and control of system (4), respectively, integer is the control terminal time, matrix is the control weight matrix, and (3) is the cost functional of OCP, closely related to OP (1).
By solving OCP (3)-(4), one can obtain the optimal control law and optimal state trajectory as below.
Lemma 1.
Proof.
The conclusion is direct by using the maximum principle and solving the Hamiltonian systems
(7) | ||||
(8) |
Please refer to [24] for the detail. ∎
Along the optimal state trajectory (6), we will explore several algorithms for solving OP (1). It is feasible only if OCP (3)-(4) is solvable. The fact that the global optimality means local optimality indicates that the optimal solution is in charge of regulating to minimize . Given that , the optimal can minimize by regulating the state . One can thereby assert that the optimal minimizes .
It seems impossible to exactly achieve the optimal state trajectory of (3)-(4) because of the nonlinearity of . Consequently, we attempt to approximate the optimal state trajectory for solving (1) by
(9) | |||
(10) |
where denotes the Hessian matrix of .
Lemma 2.
Proof.
Given that relies on for in (9)-(10), it can not iterate forward, we propose the following algorithm
(11) |
Lemma 3.
Proof.
Replace all for in the right-hand side of (9) with and get the iterative relation
(12) | |||
(13) |
According to (13), can be directly calculated as
(14) |
By combining (12) and (14), it is immediate to obtain (11). Observe the structure of (11), it is indeed different from any of the existing algorithms. Let . It is immediate to see that (11) is reduced to Newton’s method. ∎
III Superlinear algorithms
Lemma 3 encourages us to explore more algorithms.
III-A A superlinear algorithm
Note that in (11) is required to be less than . We thus modify (11) as
Algorithm I:
(15) |
Only if is strictly convex, there still hold and . Matrix slightly alters the step size and direction of Newton’s method, we thus guess that (15) has some advantages of Newton’s method, which will be proved later.
Theorem 1.
Proof.
The proof will be divided into two parts. Prove that the algorithm (15) is convergent and then show that it converges superlinearly.
III-B Superlinear algorithm without involving the inverse of Hessian matrix
It cannot be denied that in (15), the calculation cost focuses on the inverse of . The good news is that in (15) is an adjustable matrix, so is matrix . To save calculation,
we replace with an adjustable matrix . Note that . can thus be substituted by .
Then algorithm (15) can be modified as
Algorithm II:
(22) |
Algorithm (22) inherits the superlinear convergence of algorithm (15), which will be shown in the sequel.
Theorem 2.
Proof.
Denote
(24) |
Direct derivation shows
(25) | ||||
(26) |
From (22),
(27) |
In view of when , the proof is completed. ∎
Remark 1.
Remark 2.
When and , only if the spectral radius of is small enough, always holds by appropriately selecting .
III-C Superlinear algorithms without involving Hessian matrix
In addition, to alleviate the requirement for the second-order information, the Hessian matrix in algorithm (28)-(29) can be approximated by the first-order difference. As a result, algorithm (28)-(29) is replaced by
Algorithm III:
(30) | ||||
(31) |
where is an appropriate diagonal matrix and is the first-order backward difference of at point . Denote as the element of and as the element of . Then the element of is given by .
To reduce calculation cost further, we replace the Hessian matrix with a diagonal matrix . It has the identical diagonal elements with the Hessian matrix . That is to say, . As a result, algorithm (28)-(29) is modified as
Algorithm IV:
(32) | ||||
(33) |
where is an appropriate diagonal matrix.
Remark 3.
Only if we select such that and , then Algorithm IV is superlinearly convergent.
From the above discussion, the computational complexity of Algorithm I, Algorithm II, Algorithm III, and Algorithm IV has been reduced sequentially.
IV Conclusion
In the paper, several optimization algorithms have been developed. They are enlightened by the optimal trajectory of the related OCP. The algorithms converge more rapidly than the gradient descent. Meanwhile, unlike Newton’s method, they still work even if the Hessian matrix in iterate is singular or ill-conditioned. The convergence rate of the algorithms will be superlinear if the adjustable matrices are selected as required.
References
- [1] S. S. Rao, Engineering optimization: theory and practice. John Wiley & Sons, 2019.
- [2] A. D. Belegundu and T. R. Chandrupatla, Optimization concepts and applications in engineering. Cambridge University Press, 2019.
- [3] N. V. Banichuk, Introduction to optimization of structures. Springer Science & Business Media, 2013.
- [4] I. M. Gelfand, R. A. Silverman, et al., Calculus of variations. Courier Corporation, 2000.
- [5] P. Wolfe, “The simplex method for quadratic programming,” Econometrica: Journal of the Econometric Society, pp. 382–398, 1959.
- [6] A. Törn and A. Žilinskas, Global optimization, vol. 350. Springer, 1989.
- [7] M. Abdel-Basset, L. Abdel-Fatah, and A. K. Sangaiah, “Metaheuristic algorithms: A comprehensive review,” Computational intelligence for multimedia big data on the cloud with engineering applications, pp. 185–231, 2018.
- [8] E. K. Chong, W.-S. Lu, and S. H. Żak, An Introduction to Optimization: With Applications to Machine Learning. John Wiley & Sons, 2023.
- [9] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International conference on machine learning, pp. 1675–1685, PMLR, 2019.
- [10] J. D. Faires and R. L. Burden, Numerical methods. Thomson, 2003.
- [11] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
- [12] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online learning rate adaptation with hypergradient descent,” arXiv:1703.04782, 2017.
- [13] Z. Fei, Z. Wu, Y. Xiao, J. Ma, and W. He, “A new short-arc fitting method with high precision using adam optimization algorithm,” Optik, vol. 212, p. 164788, 2020.
- [14] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv:1609.04747, 2016.
- [15] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms. MIT press, 2022.
- [16] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, pp. 177–186, Springer, 2010.
- [17] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
- [18] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
- [19] B. O’donoghue and E. Candes, “Adaptive restart for accelerated gradient schemes,” Foundations of computational mathematics, vol. 15, pp. 715–732, 2015.
- [20] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv:1212.5701, 2012.
- [21] C. G. Broyden, “Quasi-newton methods and their application to function minimisation,” Mathematics of Computation, vol. 21, no. 99, pp. 368–381, 1967.
- [22] R. S. Dembo, S. C. Eisenstat, and T. Steihaug, “Inexact newton methods,” SIAM Journal on Numerical analysis, vol. 19, no. 2, pp. 400–408, 1982.
- [23] R. Fletcher and T. L. Freeman, “A modified newton method for minimization,” Journal of Optimization Theory and Applications, vol. 23, pp. 357–372, 1977.
- [24] Y. Xu, Z. Guo, H. Wang, and H. Zhang, “Optimization method based on optimal control,” arXiv preprint arXiv:2309.05280, 2023.
- [25] H. Zhang, H. Wang, Y. Xu, and Z. Guo, “Optimization methods rooted in optimal control,” Science China Information Sciences, to appear.
- [26] A. W. Marshall, I. Olkin, and B. C. Arnold, “Inequalities: theory of majorization and its applications,” 1979.