Comparisons Are All You Need for
Optimizing Smooth Functions
Abstract
When optimizing machine learning models, there are various scenarios where gradient computations are challenging or even infeasible. Furthermore, in reinforcement learning (RL), preference-based RL that only compares between options has wide applications, including reinforcement learning with human feedback in large language models. In this paper, we systematically study optimization of a smooth function only assuming an oracle that compares function values at two points and tells which is larger. When is convex, we give two algorithms using and comparison queries to find an -optimal solution, respectively. When is nonconvex, our algorithm uses comparison queries to find an -approximate stationary point. All these results match the best-known zeroth-order algorithms with function evaluation queries in dependence, thus suggest that comparisons are all you need for optimizing smooth functions using derivative-free methods. In addition, we also give an algorithm for escaping saddle points and reaching an -second order stationary point of a nonconvex , using comparison queries.
1 Introduction
Optimization is pivotal in the realm of machine learning. For instance, advancements in stochastic gradient descent (SGD) such as ADAM [25], Adagrad [13], etc., serve as foundational methods for the training of deep neural networks. However, there exist scenarios where gradient computations are challenging or even infeasible, such as black-box adversarial attack on neural networks [40, 33, 8] and policy search in reinforcement learning [42, 10]. Consequently, zeroth-order optimization methods with function evaluations have gained prominence, with provable guarantee for convex optimization [14, 37] and nonconvex optimization [16, 15, 22, 20, 51, 45, 4].
Furthermore, optimization for machine learning has been recently soliciting for even less information. For instance, it is known that taking only signs of gradient descents still enjoy good performance [32, 31, 6]. Moreover, in the breakthrough of large language models (LLMs), reinforcement learning from human feedback (RLHF) played an important rule in training these LLMs, especially GPTs by OpenAI [39]. Compared to standard RL that applies function evaluation for rewards, RLHF is preference-based RL that only compares between options and tells which is better. There is emerging research interest in preference-based RL. Refs. [9, 41, 38, 48, 52, 44] established provable guarantees for learning a near-optimal policy from preference feedback. Ref. [46] proved that for a wide range of preference models, preference-based RL can be solved with small or no extra costs compared to those of standard reward-based RL.
In this paper, we systematically study optimization of smooth functions using comparisons. Specifically, for a function , we define the comparison oracle of as such that
(1) |
(When , outputting either 1 or is okay.) We consider an -smooth function , defined as
Furthermore, we say is -Hessian Lipschitz if
In terms of the goal of optimization, we define:
-
•
is an -optimal point if , where .
-
•
is an -first-order stationary point (-FOSP) if .
- •
Our main results can be listed as follows:
Intuitively, our results can be described as comparisons are all you need for derivative-free methods: For finding an approximate minimum of a convex function, the state-of-the-art zeroth-order methods with full function evaluations have query complexities [37] or [28], which are matched in by our Theorem 2 and Theorem 3 using comparisons, respectively. For finding an approximate stationary point of a nonconvex function, the state-of-the-art zeroth-order result has query complexity [15], which is matched by our Theorem 4 up to a logarithmic factor. In other words, in derivative-free scenarios for optimizing smooth functions, function values per se are unimportant but their comparisons, which indicate the direction that the function decreases.
Among the literature for derivative-free optimization methods [27], direct search methods [26] proceed by comparing function values, including the directional direct search method [3] and the Nelder-Mead method [34] as examples. However, the directional direct search method does not have a known rate of convergence, meanwhile the Nelson-Mead method may fail to converge to a stationary point for smooth functions [12]. As far as we know, the most relevant result is by Bergou et al. [5], which proposed the stochastic three points (STP) method and found an -optimal point of a convex function and an -FOSP of a nonconvex function in and comparisons, respectively. STP also has a version with momentum [18]. Our Theorem 2 and Theorem 4 can be seen as rediscoveries of these results using different methods. However, for comparison-based convex optimization with dependence, Ref. [19] achieved this for strongly convex functions, and the state-of-the-art result for general convex optimization by Karabag et al. [24] takes comparison queries. Their algorithm applies the ellipsoid method, which has iterations and each iteration takes comparisons to construct the ellipsoid. This bound is noticeably worse than our Theorem 3. As far as we know, our Theorem 5 is the first provable guarantee for finding an -SOSP of a nonconvex function by comparisons.
Techniques.
Our first technical contribution is Theorem 1, which for a point estimates the direction of within precision . This is achieved by Algorithm 2, named as Comparison-GDE (GDE is the acronym for gradient direction estimation). It is built upon a directional preference subroutine (Algorithm 1), which inputs a unit vector and a precision parameter , and outputs whether or using the value of the comparison oracle for . Comparison-GDE then has three phases:
-
•
First, it sets to be all standard basis directions to determine the signs of all (up to ).
-
•
It then sets as , which can determine whether or is larger (up to ). Start with and and keep iterating to find the with the largest (up to ).
-
•
Finally, for each , It then sets to have form and applies binary search to find the value for such that equals to up to enough precision.
Comparison-GDE outputs for GDE, where . It in total uses comparison queries, with the main cost coming from binary searches in the last step (the first two steps both take comparisons).
We then leverage Comparison-GDE for solving various optimization problems. In convex optimization, we develop two algorithms that find an -optimal point separately in Section 3.1 and Section 3.2. Our first algorithm is a specialization of the adaptive version of normalized gradient descent (NGD) introduced in [30], where we replace the normalized gradient query in their algorithm by Comparison-GDE. It is a natural choice to apply gradient estimation to normalized gradient descent, given that the comparison model only allows us to estimate the gradient direction without providing information about its norm. Note that Ref. [5] also discussed NGD, but their algorithm using NGD still needs the full gradient and cannot be directly implemented by comparisons. Our second algorithm builds upon the framework of cutting plane methods, where we show that the output of Comparison-GDE is a valid separation oracle, as long as it is accurate enough.
In nonconvex optimization, we develop two algorithms that find an -FOSP and an -SOSP, respectively, in Section 4.1 and Section 4.2. Our algorithm for finding an -FOSP is a specialization of the NGD algorithm, where the normalized gradient is given by Comparison-GDE. Our algorithm for finding an -SOSP uses a similar approach as corresponding first-order methods by [2, 47] and proceeds in rounds, where we alternately apply NGD and negative curvature descent to ensure that the function value will have a large decrease if more than of the iterations in this round are not -SOSP. The normalized gradient descent part is essentially the same as our algorithm for -FOSP in Section 4.1. The negative curvature descent part with comparison information, however, is much more technically involved. In particular, previous first-order methods [2, 47, 49] all contains a subroutine that can find a negative curvature direction near a saddle point with . One crucial step in this subroutine is to approximate the Hessian-vector product for some unit vector by taking the difference between and , where is a very small parameter. However, this is infeasible in the comparison model which only allows us to estimate the gradient direction without providing information about its norm. Instead, we find the directions of , , and by Comparison-GDE, and we determine the direction of using the fact that its intersection with and as well as its intersection with and give two segments of same length (see Figure 1).
Open questions.
Our work leaves several natural directions for future investigation:
-
•
Can we give comparison-based optimization algorithms based on accelerated gradient descent (AGD) methods? This is challenging because AGD requires carefully chosen step sizes, but with comparisons we can only learn gradient directions but not the norm of gradients. This is also the main reason why the dependence in our Theorem 2 and Theorem 5 are worse than [37] and [50] with evaluations in their respective settings.
-
•
Can we improve our result for finding second-order stationary points in nonconvex optimization? Compared to gradient-based methods that choose the step size in negative curvature finding [2, 47], our comparison-based perturbed normalized gradient descent (Algorithm 5) can only utilize gradient directions but have no information about gradient norms, resulting in a fixed and conservative step size and in total iterations.
-
•
Can we apply our algorithms to machine learning? [44] made attempts on preference-based RL, and it is worth further exploring whether we can prove more theoretical results for preference-based RL and other machine learning settings. It would be also of general interest to see if our results can provide theoretical justification for quantization in neural networks [17].
Notations.
We use bold letters, e.g., , , to denote vectors and capital letters, e.g., , , to denote matrices. We use to denote the Euclidean norm (-norm) and denote to be the -dimensional sphere with radius 1, i.e., . We denote and . For a convex set , its diameter is defined as and its projection operator is defined as
2 Estimation of Gradient Direction by Comparisons
First, we show that given a point and a direction , we can use one comparison query to understand whether the inner product is roughly positive or negative. Intuitively, this inner product determines whether is following or against the direction of , also known as directional preference (DP) in [24].
Lemma 1.
Given a point , a unit vector , and precision for directional preference. Then Algorithm 1 is correct:
-
•
If , then .
-
•
If , then .
Proof.
Since is an -smooth differentiable function,
for any . Take , this gives
Therefore, if , i.e., ,
and hence . On the other hand, if , i.e., ,
and hence . ∎
Now, we prove that we can use comparison queries to approximate the direction of the gradient at a point, which is one of our main technical contributions.
Theorem 1.
For an -smooth function and a point , Algorithm 2 outputs an estimate of the direction of using queries to the comparison oracle of (Eq. (1)) that satisfies
if we are given a parameter such that .
(2) |
(3) |
(4) |
Proof.
The correctness of (2) and (3) follows directly from the arguments in Line 3 and Line 5, respectively. For Line 8, since for any , the binary search can be regarded as having bins with interval lengths , and when the binary search ends Eq. (4) is satisfied. Furthermore, Eq. (4) can be written as
This is because implies , and together with (3) we have because .
We now estimate . Note and . Moreover
By Lemma 5 for bounding distance between normalized vectors) and the fact that ,
Thus the correctness has been established. For the query complexity, Line 3 takes queries, Line 5 takes queries, and Line 8 throughout the for loop takes queries to the comparison oracle, given that each is within the range of and we approximate it to accuracy . This finishes the proof. ∎
3 Convex Optimization by Comparisons
In this section, we study convex optimization with function value comparisons:
Problem 1 (Comparison-based convex optimization).
In the comparison-based convex optimization (CCO) problem we are given query access to a comparison oracle (1) for an -smooth convex function whose minimum is achieved at with . The goal is to output a point such that and , i.e., is an -optimal point.
We provide two algorithms that solve Problem 1. In Section 3.1, we use normalized gradient descent to achieve linear dependence in (up to a log factor) in terms of comparison queries. In Section 3.2, we use cutting plane method to achieve dependence in terms of comparison queries.
3.1 Comparison-based adaptive normalized gradient descent
In this subsection, we present our first algorithm for Problem 1, Algorithm 3, which applies Comparison-GDE (Algorithm 2) with estimated gradient direction at each iteration to the adaptive normalized gradient descent (AdaNGD), originally introduced by [30].
Theorem 2.
Algorithm 3 solves Problem 1 using queries.
The following result bounds the rate at which Algorithm 3 decreases the function value of .
Lemma 2.
The proof of Lemma 2 is deferred to Appendix B. We now prove Theorem 2 using Lemma 2.
Proof of Theorem 2.
We show that Algorithm 3 solves Problem 1 by contradiction. Assume that the output of Algorithm 3 is not an -optimal point of , or equivalently, for any . This leads to
given that is convex. Hence, Theorem 1 promises that
With these approximate gradient directions, by Lemma 2 we can derive that
contradiction. This proves the correctness of Algorithm 3. The query complexity of Algorithm 3 only comes from the gradient direction estimation step in Line 3, which equals
∎
3.2 Comparison-based cutting plane method
In this subsection, we provide a comparison-based cutting plane method that solves Problem 1. We begin by introducing the basic notation and concepts of cutting plane methods, which are algorithms that solves the feasibility problem defined as follows.
Problem 2 (Feasibility Problem, [21, 43]).
We are given query access to a separation oracle for a set such that on query the oracle outputs a vector and either , in which case , or , in which case . The goal is to query a point .
[21] developed a cutting plane method that solves Problem 2 using queries to a separation oracle where and are parameters related to the convex set .
Lemma 3 (Theorem 1.1, [21]).
There is a cutting plane method which solves Problem 2 using at most queries for some constant , given that the set is contained in the ball of radius centered at the origin and it contains a ball of radius .
Refs. [35, 29] showed that, running cutting plane method on a Lipschitz convex function with the separation oracle being the gradient of would yield a sequence of points where at least one of them is -optimal. Furthermore, Ref. [43] showed that even if we cannot access the exact gradient value of , it suffices to use an approximate gradient estimate with absolute error at most .
In this work, we show that this result can be extended to the case where we have an estimate of the gradient direction instead of the gradient itself. Specifically, we prove the following result.
Theorem 3.
There exists an algorithm based on cutting plane method that solves Problem 1 using queries.
Proof of Theorem 3.
The proof follows a similar intuition as the proof of Proposition 1 in [43]. Define to be the set of -optimal points of , and to be the set of -optimal points of . Given that is -smooth, must contain a ball of radius at least since for any with we have
We apply the cutting plane method, as described in Lemma 3, to query a point in , which is a subset of the ball . To achieve this, at each query of the cutting plane method, we use Comparison-GDE, our comparison-based gradient direction estimation algorithm (Algorithm 2), as the separation oracle for the cutting plane method, where we set
We show that any query outside of to Comparison-GDE will be a valid separation oracle for . In particular, if we ever queried Comparison-GDE at any with output being , for any we have
where
given that is convex. Combined with Theorem 1, it guarantees that
Hence,
indicating that is a valid separation oracle for the set . Consequently, by Lemma 3, after iterations, at least one of the queries must lie within , and we can choose the query with minimum function value to output, which can be done by making comparisons.
4 Nonconvex Optimization by Comparisons
In this section, we study nonconvex optimization with function value comparisons. We first develop an algorithm that finds an -FOSP of a smooth nonconvex function in Section 4.1. Then in Section 4.2, we further develop an algorithm that finds an -SOSP of a nonconvex function that is smooth and Hessian-Lipschitz.
4.1 First-order stationary point computation by comparisons
In this subsection, we focus on the problem of finding an -FOSP of a smooth nonconvex function by making function value comparisons.
Problem 3 (Comparison-based first-order stationary point computation).
In the Comparison-based first-order stationary point computation (Comparison-FOSP) problem we are given query access to a comparison oracle (1) for an -smooth (possibly) nonconvex function satisfying . The goal is to output an -FOSP of .
We develop a comparison-based normalized gradient descent algorithm that solves Problem 3.
Theorem 4.
With success probability at least , Algorithm 4 solves Problem 3 using queries.
The proof of Theorem 4 is deferred to Appendix C.1.
4.2 Escaping saddle points of nonconvex functions by comparisons
In this subsection, we focus on the problem of escaping from saddle points, i.e., finding an -SOSP of a nonconvex function that is smooth and Hessian-Lipschitz, by making function value comparisons.
Problem 4 (Comparison-based escaping from saddle point).
In the Comparison-based escaping from saddle point (Comparison-SOSP) problem we are given query access to a comparison oracle (1) for a (possibly) nonconvex function satisfying that is -smooth and -Hessian Lipschitz. The goal is to output an -SOSP of .
Our algorithm for Problem 4 given in Algorithm 5 is a combination of comparison-based normalized gradient descent and comparison-based negative curvature descent (Comparison-NCD). Specifically, Comparison-NCD is built upon our comparison-based negative curvature finding algorithms, Comparison-NCF1 (Algorithm 8) and Comparison-NCF2 (Algorithm 9) that work when the gradient is small or large respectively, and can decrease the function value efficiently when applied at a point with a large negative curvature.
Lemma 4.
In the setting of Problem 4, for any satisfying , Algorithm 6 outputs a point satisfying
with success probability at least using queries.
The proof of Lemma 4 is deferred to Appendix C.3. Next, we present the main result of this subsection, which describes the complexity of solving Problem 4 using Algorithm 5.
Theorem 5.
With success probability at least , Algorithm 5 solves Problem 4 using an expected queries.
The proof of Theorem 5 is deferred to Appendix C.4.
Acknowledgements
We thank Yexin Zhang for helpful discussions. TL was supported by the National Natural Science Foundation of China (Grant Numbers 62372006 and 92365117), and The Fundamental Research Funds for the Central Universities, Peking University.
References
- [1] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma, Finding approximate local minima faster than gradient descent, Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199, 2017, arXiv:1611.01146
- [2] Zeyuan Allen-Zhu and Yuanzhi Li, Neon2: Finding local minima via first-order oracles, Advances in Neural Information Processing Systems, pp. 3716--3726, 2018, arXiv:1711.06673
- [3] Charles Audet and John E. Dennis Jr, Mesh adaptive direct search algorithms for constrained optimization, SIAM Journal on Optimization 17 (2006), no. 1, 188--217.
- [4] Krishnakumar Balasubramanian and Saeed Ghadimi, Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points, Foundations of Computational Mathematics (2022), 1--42, arXiv:1809.06474
- [5] El Houcine Bergou, Eduard Gorbunov, and Peter Richtárik, Stochastic three points method for unconstrained smooth minimization, SIAM Journal on Optimization 30 (2020), no. 4, 2726--2749, arXiv:1902.03591
- [6] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar, signSGD: Compressed optimisation for non-convex problems, International Conference on Machine Learning, pp. 560--569, PMLR, 2018, arXiv:1802.04434
- [7] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford, Accelerated methods for nonconvex optimization, SIAM Journal on Optimization 28 (2018), no. 2, 1751--1772, arXiv:1611.00756
- [8] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh, Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15--26, 2017, arXiv:1708.03999
- [9] Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, and Liwei Wang, Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation, International Conference on Machine Learning, pp. 3773--3793, PMLR, 2022, arXiv:2205.11140
- [10] Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard Turner, and Adrian Weller, Structured evolution with compact architectures for scalable policy optimization, International Conference on Machine Learning, pp. 970--978, PMLR, 2018, arXiv:1804.02395
- [11] Frank E. Curtis, Daniel P. Robinson, and Mohammadreza Samadi, A trust region algorithm with a worst-case iteration complexity of for nonconvex optimization, Mathematical Programming 162 (2017), no. 1-2, 1--32.
- [12] John E. Dennis, Jr and Virginia Torczon, Direct search methods on parallel machines, SIAM Journal on Optimization 1 (1991), no. 4, 448--474.
- [13] John Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011), no. 7, 2121--2159.
- [14] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono, Optimal rates for zero-order convex optimization: The power of two function evaluations, IEEE Transactions on Information Theory 61 (2015), no. 5, 2788--2806, arXiv:1312.2139
- [15] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang, SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator, Advances in Neural Information Processing Systems 31 (2018), arXiv:1807.01695
- [16] Saeed Ghadimi and Guanghui Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM Journal on Optimization 23 (2013), no. 4, 2341--2368, arXiv:1309.5549
- [17] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer, A survey of quantization methods for efficient neural network inference, Low-Power Computer Vision, Chapman and Hall/CRC, 2022, arXiv:2103.13630, pp. 291--326.
- [18] Eduard Gorbunov, Adel Bibi, Ozan Sener, El Houcine Bergou, and Peter Richtarik, A stochastic derivative free optimization method with momentum, International Conference on Learning Representations, 2020, arXiv:1905.13278
- [19] Kevin G. Jamieson, Robert Nowak, and Ben Recht, Query complexity of derivative-free optimization, Advances in Neural Information Processing Systems 25 (2012), arXiv:1209.2434
- [20] Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang, Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization, International Conference on Machine Learning, pp. 3100--3109, PMLR, 2019, arXiv:1910.12166
- [21] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong, An improved cutting plane method for convex optimization, convex-concave games, and its applications, Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 944--953, 2020, arXiv:2004.04250
- [22] Chi Jin, Lydia T. Liu, Rong Ge, and Michael I. Jordan, On the local minima of the empirical risk, Advances in Neural Information Processing Systems 31 (2018), arXiv:1803.09357
- [23] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan, Accelerated gradient descent escapes saddle points faster than gradient descent, Conference on Learning Theory, pp. 1042--1085, 2018, arXiv:1711.10456
- [24] Mustafa O. Karabag, Cyrus Neary, and Ufuk Topcu, Smooth convex optimization using sub-zeroth-order oracles, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021), no. 5, 3815--3822, arXiv:2103.00667
- [25] Diederik P. Kingma and Jimmy Ba, Adam: A method for stochastic optimization, International Conference on Learning Representations, 2015, arXiv:1412.6980
- [26] Tamara G. Kolda, Robert Michael Lewis, and Virginia Torczon, Optimization by direct search: New perspectives on some classical and modern methods, SIAM review 45 (2003), no. 3, 385--482.
- [27] Jeffrey Larson, Matt Menickelly, and Stefan M. Wild, Derivative-free optimization methods, Acta Numerica 28 (2019), 287--404, arXiv:904.11585
- [28] Yin Tat Lee, Aaron Sidford, and Santosh S. Vempala, Efficient convex optimization with membership oracles, Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 75, pp. 1292--1294, 2018, arXiv:1706.07357
- [29] Yin Tat Lee, Aaron Sidford, and Sam Chiu-wai Wong, A faster cutting plane method and its implications for combinatorial and convex optimization, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 1049--1065, IEEE, 2015, arXiv:1508.04874
- [30] Kfir Levy, Online to offline conversions, universality and adaptive minibatch sizes, Advances in Neural Information Processing Systems 30 (2017), arXiv:1705.10499
- [31] Xiuxian Li, Kuo-Yi Lin, Li Li, Yiguang Hong, and Jie Chen, On faster convergence of scaled sign gradient descent, IEEE Transactions on Industrial Informatics (2023), arXiv:2109.01806
- [32] Sijia Liu, Pin-Yu Chen, Xiangyi Chen, and Mingyi Hong, signSGD via zeroth-order oracle, International Conference on Learning Representations, 2019.
- [33] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu, Towards deep learning models resistant to adversarial attacks, International Conference on Learning Representations, 2018, arXiv:1706.06083
- [34] John A. Nelder and Roger Mead, A simplex method for function minimization, The Computer Journal 7 (1965), no. 4, 308--313.
- [35] Arkadi Nemirovski, Efficient methods in convex programming, Lecture notes (1994).
- [36] Yurii Nesterov and Boris T. Polyak, Cubic regularization of Newton method and its global performance, Mathematical Programming 108 (2006), no. 1, 177--205.
- [37] Yurii Nesterov and Vladimir Spokoiny, Random gradient-free minimization of convex functions, Foundations of Computational Mathematics 17 (2017), 527--566.
- [38] Ellen Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel Burdick, Dueling posterior sampling for preference-based reinforcement learning, Conference on Uncertainty in Artificial Intelligence, pp. 1029--1038, PMLR, 2020, arXiv:1908.01289
- [39] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al., Training language models to follow instructions with human feedback, Advances in Neural Information Processing Systems 35 (2022), 27730--27744, arXiv:2203.02155
- [40] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami, Practical black-box attacks against machine learning, Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506--519, 2017, arXiv:1602.02697
- [41] Aadirupa Saha, Aldo Pacchiano, and Jonathan Lee, Dueling RL: Reinforcement learning with trajectory preferences, International Conference on Artificial Intelligence and Statistics, pp. 6263--6289, PMLR, 2023, arXiv:2111.04850
- [42] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever, Evolution strategies as a scalable alternative to reinforcement learning, 2017, arXiv:1703.03864
- [43] Aaron Sidford and Chenyi Zhang, Quantum speedups for stochastic optimization, Advances in Neural Information Processing Systems 37 (2023), arXiv:2308.01582
- [44] Zhiwei Tang, Dmitry Rybin, and Tsung-Hui Chang, Zeroth-order optimization meets human feedback: Provable learning via ranking oracles, 2023, arXiv:2303.03751
- [45] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras, Efficiently avoiding saddle points with zero order methods: No gradients required, Advances in Neural Information Processing Systems 32 (2019), arXiv:1910.13021
- [46] Yuanhao Wang, Qinghua Liu, and Chi Jin, Is RLHF more difficult than standard RL? a theoretical perspective, Thirty-seventh Conference on Neural Information Processing Systems, 2023, arXiv:2306.14111
- [47] Yi Xu, Rong Jin, and Tianbao Yang, First-order stochastic algorithms for escaping from saddle points in almost linear time, Advances in Neural Information Processing Systems, pp. 5530--5540, 2018, arXiv:1711.01944
- [48] Yichong Xu, Ruosong Wang, Lin Yang, Aarti Singh, and Artur Dubrawski, Preference-based reinforcement learning with finite-time guarantees, Advances in Neural Information Processing Systems 33 (2020), 18784--18794, arXiv:2006.08910
- [49] Chenyi Zhang and Tongyang Li, Escape saddle points by a simple gradient-descent based algorithm, Advances in Neural Information Processing Systems 34 (2021), 8545--8556, arXiv:2111.14069
- [50] Hualin Zhang and Bin Gu, Faster gradient-free methods for escaping saddle points, The Eleventh International Conference on Learning Representations, 2023.
- [51] Hualin Zhang, Huan Xiong, and Bin Gu, Zeroth-order negative curvature finding: Escaping saddle points without gradients, Advances in Neural Information Processing Systems 35 (2022), 38332--38344, arXiv:2210.01496
- [52] Banghua Zhu, Jiantao Jiao, and Michael Jordan, Principled reinforcement learning with human feedback from pairwise or -wise comparisons, ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023, arXiv:2301.11270
Appendix A Auxiliary Lemmas
A.1 Distance between normalized vectors
Lemma 5.
If are two vectors such that and , we have
Proof.
By the triangle inequality, we have
∎
Lemma 6.
If are two vectors such that , and are another two vectors such that where , we have
Proof.
By the triangle inequality, we have
On the one hand, by the triangle inequality and the Cauchy-Schwarz inequality,
On the other hand, by the Cauchy-Schwarz inequality, , and hence
In all, due to ,
∎
A.2 A fact for vector norms
Lemma 7.
For any nonzero vectors ,
Proof.
We have
∎
A.3 Gradient upper bound of smooth convex functions
Lemma 8 (Lemma A.2, [30]).
For any -smooth convex function and any , we have
Appendix B Approximate adaptive normalized gradient descent (Approx-AdaNGD)
In this section, we prove technical details of the normalized gradient descent we use for convex optimization. Inspired by [30] which condcuted a detailed analysis for the normalized gradient descent method, we first introduce the Approximate Adaptive Gradient Descent (Approx-AdaGrad) algorithm below:
Lemma 9.
Proof.
The proof follows the flow of the proof of Theorem 1.1 in [30]. For any and we have
and
Since is convex for each , we have
which leads to
where we denote . Further we can derive that
Moreover, we have
which leads to
∎
Proof of Lemma 2.
The proof follows the flow of the proof of Theorem 2.1 in [30]. In particular, observe that Algorithm 3 is equivalent to applying Approx-AdaGrad (Algorithm 7) to the following sequence of functions
Then by Lemma 9, for any we have
where
given that is convex, and is the diameter of . Hence,
Next, we proceed to bound the term on the denominator. By the Cauchy-Schwarz inequality,
which leads to
where
where the first inequality is by Lemma 8, the second inequality is by the convexity of , and the third inequality is due to Lemma 9. Further we can derive that
∎
Appendix C Proof details of nonconvex optimization by comparisons
C.1 Proof of Theorem 4
Proof of Theorem 4.
We prove the correctness of Theorem 4 by contradiction. For any iteration with , by Theorem 1 we have
indicating
That is to say, for any iteration such that is not an -FOSP, the function value will decrease at least in this iteration. Furthermore, for any iteration with , by Theorem 1 we have
indicating
(5) |
For any iteration with , we have
Combining (C.1) and the above inequality, we know that for any iteration such that is an -FOSP, the function value increases at most in this iteration. Moreover, since
we can conclude that at least of the iterations have being an -FOSP, and randomly outputting one of them solves Problem 3 with success probability at least .
The query complexity of Algorithm 4 only comes from the gradient direction estimation step in Line 4, which equals
∎
C.2 Negative curvature finding by comparisons
In this subsection, we show how to find a negative curvature direction of a point satisfying Observe that the Hessian matrix admits the following eigen-decomposition:
(6) |
where the vectors forms an orthonormal basis of . Without loss of generality we assume the eigenvalues corresponding to satisfy
(7) |
where . Throughout this subsection, for any vector , we denote
to be the component of that is orthogonal to .
C.2.1 Negative curvature finding when the gradient is relatively small
In this part, we present our negative curvature finding algorithm that finds the negative curvature of a point with when the norm of the gradient is relatively small.
Lemma 10.
In the setting of Problem 4, for any satisfying
Algorithm 8 outputs a unit vector satisfying
with success probability at least using
queries.
To prove Lemma 10, without loss of generality we assume by shifting such that is mapped to . We denote for each iteration of Algorithm 8.
Lemma 11.
In the setting of Problem 4, for any iteration of Algorithm 8 with , we have
Proof.
Observe that
Given that is -Hessian Lipschitz, we have
Moreover, we have
which leads to
where the last inequality is due to the fact that
∎
Lemma 12.
Proof.
We use recurrence to prove this lemma. In particular, assume
(9) |
is true for all for some , which guarantees that
Then for , we have
and
(10) |
Since by Lemma 11, we have
by Theorem 1. Moreover, observe that
(11) |
where the norm of
is upper bounded by
given that is -Hessian Lipschitz and . Next, we proceed to bound the first term on the RHS of (10), where
where
and
Given that
is always true, we have
and
Combined with (10), we can derive that
(12) | ||||
(13) |
Similarly, we have
(14) |
where the second term on the RHS of (14) satisfies
by Theorem 1, whereas the first term on the RHS of (14) satisfies
where the absolute value of
is upper bounded by
given that is -Hessian Lipschitz and
Combined with (14), we can derive that
Combined with (12), we have
Hence, if , (9) is also true for . Otherwise, we have and
Thus, we can conclude that (9) is true for all . This completes the proof. ∎
Lemma 13.
In the setting of Problem 4, for any with , the -th iteration of Algorithm 8 satisfies
(15) |
if and .
Proof.
For any , similar to (14) in the proof of Lemma 12, we have
and
(16) |
By Lemma 12 we have for each , which combined with Lemma 11 leads to . Thus, the second term on the RHS of (16) satisfies
by Theorem 1. Moreover, the first term on the RHS of (16) satisfies
where the absolute value of
is upper bounded by
given that is -Hessian Lipschitz and
Combined with (16), we can derive that
Considering that ,
where the last inequality is due to the fact that by Lemma 12. Hence, for any we have
Since is -smooth, we have
which leads to
Thus,
∎
Proof of Lemma 10.
We consider the case where , which happens with probability
In this case, by Lemma 13 we have
and
Let be the smallest integer such that . Then the output of Algorithm 8 satisfies
The query complexity of Algorithm 8 only comes from the gradient direction estimation step in Line 8, which equals
∎
C.2.2 Negative curvature finding when the gradient is relatively large
In this part, we present our negative curvature finding algorithm that finds the negative curvature of a point with when the norm of the gradient is relatively large.
The subroutine Comparison-Hessian-Vector in Line 9 of Algorithm 9 is given as Algorithm 10, whose output approximates the Hessian-vector product .
Lemma 14.
In the setting of Problem 4, for any satisfying
Algorithm 10 outputs a vector satisfying
using queries.
Proof of Lemma 14.
Since is a -Hessian Lipschitz function,
(17) | |||
(18) |
Therefore,
(19) | ||||
(20) |
Furthermore, because and is -smooth,
We first understand how to approximate by normalized vectors , and then analyze the approximation error due to using , respectively. By Lemma 7, we have
(21) |
i.e., we denote the value above as . Because is -Hessian Lipschitz, . Since , . Also note that by Lemma 6 we have
This promises that
(22) |
In arguments next, we say a vector is -close to a vector if . We prove that the vector
(23) |
is -close to a vector proportional to . This is because (17), (18), and Lemma 5 imply that
are -close to each other,
(24) |
is proportional to , and the definition of implies
(25) |
The above vector is -close to by (17), and the error in above steps cumulates by at most using Lemma 6. In total .
Furthermore, this vector proportional to that is -close to (23) has norm at least because the coefficient in (24) is positive, while in the equality above we have . Therefore, applying Lemma 5, the vector in (23) satisfies
(26) |
Following the same proof, we can prove that the vector
(27) |
satisfies
(28) |
Furthermore, (25) implies that is -close to
(29) |
Because and , . Therefore, the RHS of (29) has norm at least , and by Lemma 5 we have
(30) |
Finally, by Theorem 1 and our choice of the precision parameter, the error coming from running Comparison-GDE is:
(31) |
Combined with (26) and (28), we know that the vector we obtained in Algorithm 10 is
(32) |
close to . Since by (22), by Lemma 5 we have
(33) |
In total, all the errors we have accumulated are (30) and (33):
(34) |
Our selection of can guarantee that (34) is at most .
In terms of query complexity, we made 3 calls to Comparison-GDE. By Theorem 1 and that our precision is
the total query complexity is . ∎
Based on Lemma 14, we obtain the following result.
Lemma 15.
In the setting of Problem 4, for any satisfying
Algorithm 9 outputs a unit vector satisfying
with success probability at least using
queries.
The proof of Lemma 15 is similar to the proof of Lemma 10. Without loss of generality we assume by shifting such that is mapped to . We denote for each iteration of Algorithm 9.
Lemma 16.
Proof.
We use recurrence to prove this lemma. In particular, assume
(36) |
is true for all for some , which guarantees that
Then for , we have
and
(37) |
where
by Lemma 14. Next, we proceed to bound the first term on the RHS of (37). Note that
and
where
Consequently, we have
which leads to
Combined with (37), we can derive that
(38) | ||||
(39) |
Similarly, we have
(40) |
where the second term on the RHS of (40) satisfies
by Lemma 14. Combined with (40), we can derive that
Consequently,
Thus, if , (36) is also true for . Otherwise, we have and
Thus, we can conclude that (36) is true for all . This completes the proof. ∎
Lemma 17.
Proof.
For any , similar to (40) in the proof of Lemma 16, we have
and
(42) |
where the second term on the RHS of (42) satisfies
by Lemma 14. Moreover, the first term on the RHS of (42) satisfies
Consequently, we have
Meanwhile,
where the last inequality is due to the fact that by Lemma 16. Hence, for any we have
Since is -smooth, we have
which leads to
Thus,
∎
Proof of Lemma 15.
We consider the case where , which happens with probability
In this case, by Lemma 17 we have
and
Let be the smallest integer such that . Then the output of Algorithm 9 satisfies
The query complexity of Algorithm 9 only comes from the Hessian-vector product estimation step in Line 9, which equals
∎
C.3 Proof of Lemma 4
Proof.
By Lemma 10 and Lemma 15, at least one of the two unit vectors is a negative curvature direction. Quantitatively, with probability at least , at least one of the following two inequalities is true:
WLOG we assume the first inequality is true. Denote . Given that is -Hessian Lipschitz, we have
and
Hence,
which leads to
By Lemma 10 and Lemma 15, the query complexity of Algorithm 6 equals
∎
C.4 Escape from saddle point via negative curvature finding
Lemma 18.
In the setting of Problem 4, if the iterations of Algorithm 5 satisfy
then the number of -FOSP among is at least .
Proof.
For any iteration with , by Theorem 1 we have
indicating
That is to say, for any such that is not an -FOSP, the function value will decrease at least in this iteration. Moreover, given that
and
we can conclude that the number of -FOSP among is at least
∎
Lemma 19.
In the setting of Problem 4, if there are less than -SOSP among the iterations of Algorithm 5, with probability at least we have
Proof.
If , we directly have
Hence, we only need to prove the case with , where by Lemma 18 the number of -FOSP among is at least . Since there are less than -SOSP among the iterations , there exists
different values of such that
For each such , with probability the subroutine Comparison-NCD (Algorithm 6) is executed in this iteration. Conditioned on that, with probability at least its output satisfies
by Lemma 4. Hence, with probability at least
there exists a with
which leads to
where the second inequality is due to the fact that for any possible value of in . ∎
Proof of Theorem 5.
We assume for any with containing less than -SOSP we have
Given that there are at most different values of , by Lemma 19, the probability of this assumption being true is at least
(43) |
Moreover, given that
there are at least different values of with
as we have for any . Hence, in this case the proportion of -SOSP among all the iterations is at least
Combined with (43), the overall success probability of outputting an -SOSP is at least .
The query complexity of Algorithm 5 comes from both the gradient estimation step in Line 5 and the negative curvature descent step in Line 5. By Theorem 1, the query complexity of the first part equals
whereas the expected query complexity of the second part equals
Hence, the overall query complexity of Algorithm 5 equals
∎