A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization
Abstract
We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second-moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an -stationary solution, are of order and respectively, where hides constants in . Notably, the dependence of these complexity bounds on and are separate in the sense that changing one does not impact the dependence of the bounds on the other. For the case of , we also provide a high-probability convergence result that depends poly-logarithmically on the inverse confidence level. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms.
1 Introduction
We study projection-free algorithms for solving the following stochastic multi-level composition optimization problem
(1) |
where are continuously differentiable functions, the composite function is bounded below by and is a closed convex set. We assume that the exact function values and derivatives of ’s are not available. In particular, we assume that for some random variable . Our goal is to solve the above optimization problem, given access to noisy evaluations of ’s and ’s.
There are two main challenges to address in developing efficient projection-free algorithms for solving (1). First, note that denoting the transpose of the Jacobian of by , the gradient of the objective function in (1), is given by , where for , and . Because of the nested nature of the gradient , obtaining an unbiased gradient estimator in the stochastic first-order setting, with controlled moments, becomes non-trivial. Using naive stochastic gradient estimators lead to oracle complexities that depend exponentially on (in terms of the accuracy parameter). Next, even when in the stochastic setting, projection-free algorithms like the conditional gradient method or its sliding variants invariably require increasing order of mini-batches111We discuss in detail about some recent works that avoid requiring increasing mini-batches, albeit under stronger assumptions, in Section 1.2. [LZ16, RSPS16, HL16, QLX18, YSC19], which make their practical implementation infeasible.
In this work, we propose a projection-free conditional gradient-type algorithm that achieves level-independent oracle complexities (i.e., the dependence of the complexities on the target accuracy is independent of ) using only one sample of in each iteration, thereby addressing both of the above challenges. Our algorithm uses moving-average based stochastic estimators of the gradient and function values, also used recently by [GRW20] and [BGN22], along with an inexact conditional gradient method used by [BG22] (which in turn is inspired by the sliding method by [LZ16]). In order to establish our oracle complexity results, we use a novel merit function based convergence analysis. To the best of our knowledge, such an analysis technique is used for the first time in the context of analyzing stochastic conditional gradient-type algorithms.
1.1 Preliminaries and Main Contributions
We now introduce the technical preliminaries required to state and highlight the main contributions of this work. Throughout this work, denotes the Euclidean norm for vectors and the Frobenius norm for matrices. We first describe the set of assumptions on the objective functions and the constraint set.
Assumption 1 (Constraint set).
The set is convex and closed with .
Assumption 2 (Smoothness).
All functions and their derivatives are Lipschitz continuous with Lipschitz constants and , respectively.
The above assumptions on the constraint set and the objective function are standard in the literature on stochastic optimization, and in particular in the analysis of conditional gradient algorithms and multi-level optimization; see, for example, [LZ16], [YWF19] and [BGN22]. We emphasize here that the above smoothness assumptions are made only on the functions and not on the stochastic functions (which would be a stronger assumption). Moreover, the Lipschitz continuity of ’s are implied by the Assumption 1 and assuming the functions are continuously differentiable. However, for sake of illustration, we state both assumptions separately. In addition to these assumptions, we also make unbiasedness and bounded-variance assumptions on the stochastic first-order oracle. Due to its technical nature, we state the precise details later in Section 3 (see Assumption 3).
We next turn our attention to the convergence criterion that we use in this work to evaluate the performance of the proposed algorithm. Gradient-based algorithms iteratively solve sub-problems in the form of
(2) |
for some , and . Denoting the optimal solution of the above problem by and noting its optimality condition, one can easily show that
where is the normal cone to at and denotes a ball centered at the origin with radius . Thus, reducing the radius of the ball in the above relation will result in finding an approximate first-order stationary point of the problem for non-convex constrained minimization problems. Motivated by this fact, we can define the gradient mapping at point as
(3) |
where denotes the Euclidean projection of the vector onto the set . The gradient mapping is a classical measure has been widely used in the literature as a convergence criterion when solving nonconvex constrained problems [Nes18]. It plays an analogous role of the gradient for constrained optimization problems; in fact when the set the gradient mapping just reduces to . It should be emphasized that while the gradient mapping cannot be computed in the stochastic setting, it still serves as a measure of convergence. Our main goal in this work is to find an -stationary solution to (1), in the sense described below.
Definition 1.
A point generated by an algorithm for solving (1) is called an -stationary point, if we have , where the expectation is taken over all the randomness involved in the problem.
In the literature on conditional gradient methods for the nonconvex setting, the so-called Frank-Wolfe gap is also widely used to provide convergence analysis. The Frank-Wolfe Gap is defined formally as
(4) |
As pointed out by [BG22], the gradient mapping criterion and the Frank-Wolfe gap are related to each other in the following sense.
Proposition 1.
For stochastic conditional gradient-type algorithms, the oracle complexity is measured in terms of number of calls to the Stochastic First-order Oracle (SFO) and the Linear Minimization Oracle (LMO) used to the solve the sub-problems (that are of the form of minimizing a linear function over the convex feasible set) arising in the algorithm. In this work, we hence measure the number of calls to SFO and LMO required by the proposed algorithm to obtain an -stationary solution in the sense of Definition 1. We now highlight our main contributions:
- •
-
•
The above SFO and LMO complexities are in particular level-independent (i.e., the dependence of the complexities on the target accuracy is independent of ). The proposed algorithm is parameter-free and does not require any mini-batches, making it applicable for the online setting.
-
•
When considering the case of , we present a simplified method (Algorithm 3 and 4) to obtain the same oracle complexities. Intriguingly, while this simplified method is still parameter-free for , it is not when (see Theorem 3 and Remark Remark). Furthermore, for the case of , we also establish high-probability bounds (see Theorem 5).
A summary of oracle complexities for stochastic conditional gradient-type algorithms is in Table 1.
Algorithm | Criterion | # of levels | Batch size | SFO | LMO |
---|---|---|---|---|---|
SPIFER-SFW [YSC19] | FW-gap | 1 | |||
1-SFW [ZSM+20] | FW-gap | 1 | 1 | ||
SCFW [ABTR21] | FW-gap | 2 | 1 | ||
SCGS [QLX18] | GM | 1 | |||
SGD+ICG [BG22] | GM | 1 | |||
LiNASA+ICG (Algorithm 1) | GM | 1 |
1.2 Related Work
Conditional Gradient-Type Method. The conditional gradient algorithm [FW56, LP66], has had a renewed interest in the machine learning and optimization communities in the past decade; see [Mig94, Jag13, HJN15, LJJ15, BS17, GKS21] for a partial list of recent works. Considering the stochastic convex setup, [HK12, HL16] provided expected oracle complexity results for the stochastic conditional gradient algorithm. The complexities were further improved by a sliding procedure in [LZ16], based on Nesterov’s acceleration method. [RSPS16, YSC19, HL16] considered variance reduced stochastic conditional gradient algorithms, and provided expected oracle complexities in the non-convex setting. [QLX18] analyzed the sliding algorithm in the non-convex setting and provided results for the gradient mapping criterion. All of the above works require increasing orders of mini-batches to obtain their oracle complexity results.
[MHK20] and [ZSM+20] proposed a stochastic conditional gradient-type algorithm with moving-average gradient estimator for the convex and non-convex setting that uses only one-sample in each iteration. However, [MHK20] and [ZSM+20] require several restrictive assumptions, which we explain next (focusing on [ZSM+20] which considers the nonconvex case). Specifically, [ZSM+20] requires that the stochastic function has uniformly bounded function value, gradient-norm, and Hessian spectral-norm, and the distribution of the random vector has an absolutely continuous density such that the norm of the gradient of and spectral norm of the Hessian of has finite fourth and second-moments respectively. In contrasts, we do not require such stringent assumptions. Next, all of the above works focus only on the case of . [ABTR21] considered stochastic conditional gradient algorithm for solving (1), with . However, [ABTR21] also makes stringent assumptions: (i) the stochastic objective functions and themselves have Lipschitz gradients almost surely and (ii) for a given instance of random vectors and , one could query the oracle at the current and previous iterations, which makes the algorithm not to be truly online. See Table 1 for a summary.
Stochastic Multi-level Composition Optimization. Compositional optimization problems of the form in (1) have been considered as early as 1970s by [Erm76]. Recently, there has been a renewed interest on this problem. [EN13] and [DPR17] considered a sample-average approximation approach for solving (1) and established several asymptotic results. For the case of , [WFL17], [WLF16] and [BGI+17] proposed and analyzed stochastic gradient descent-type algorithms in the smooth setting. [DD19] and [DR18] considered the non-smooth setting and established oracle complexity results. Furthermore, [HZCH20] proposed algorithms when the randomness between the two levels are not necessarily independent. For the general case of , [YWF19] proposed stochastic gradient descent-type algorithms and established oracle complexities established that depend exponentially on and are hence sub-optimal. Indeed, large deviation and Central Limit Theorem results established in [EN13, DPR17], respectively, show that in the sample-average approximation setting, the of the problem in (1) based on samples, converges at a level-independent rate (i.e., dependence of the convergence rate on the target accuracy is independent of ) to the true minimizer, under suitable regularity conditions.
[GRW20] proposed a single time-scale Nested Averaged Stochastic Approximation (NASA) algorithm and established optimal rates for the cases of . For the general case of , [BGN22] proposed a linearized NASA algorithm and established level-independent and optimal convergence rates. Concurrently, [Rus21] considered the case when the function are non-smooth and established asymptotic convergence results. [ZX21] also established non-asymptotic level-independent oracle complexities, however, under stronger assumptions than that in [BGN22]. Firstly, they assumed that for a fixed batch of samples, one could query the oracle on different points, which is not suited for the general online stochastic optimization setup. Next, they assume a much stronger mean-square Lipschitz smoothness assumption on the individual functions and their gradients. Finally, they required mini-batches sizes that depend exponentially on , which makes their method impractical. Concurrent to [BGN22], level-independent rates were also obtained for unconstrained problems by [CSY21], albeit, under the stronger assumption that the stochastic functions are Lipschitz, almost surely. It is also worth mentioning that while some of the above papers considered constrained problems, the algorithms proposed and analyzed in the above works are not projection-free, which is the main focus of this work.
2 Methodology
In this section, we present our projection-free algorithm for solving problem (1). The method generates three random sequences, namely, approximate solutions , average gradients , and average function values , defined on a certain probability space . We let to be the -algebra generated by . The overall method is given in Algorithm 1. In (7), the stochastic Jacobians , and the product is calculated as . In (8), we use the notation to represent both matrix-vector multiplication and vector-vector inner product. There are two aspects of the algorithm that we highlight specifically: (i) In addition to estimating the gradient of , we also estimate a stochastic linear approximation of the inner functions by a moving-average technique. In the multi-level setting we consider, it helps us to avoid the accumulation of bias, when estimating the directly. Linearization techniques were used in the stochastic optimization since the work of [Rus87]. A similar approach was used in [BGN22] in the context of projected-based methods for solving (1). It is also worth mentioning that other linearization techniques have been used in [DD19] and [DR18] for estimating the stochastic inner function values for weakly convex two-level composition problems, and (ii) The ICG method given in Algorithm 2 is essentially applying deterministic conditional gradient method with the exact line search for solving the quadratic minimization subproblem in (2) with the estimated gradient in (7). It was also used in [BG22] as a sub-routine and is motivated by the sliding approach of [LZ16].
(5) | ||||
(6) |
(7) | ||||
(8) |
3 Main Results
In this section, we present our main result on the oracle complexity of Algorithm 1. Before we proceed, we present our assumptions on the stochastic first-order oracle.
Assumption 3 (Stochastic First-Order Oracle).
Denote . For each , being the input, the stochastic oracle outputs and such that given and for any
-
(a)
,
-
(b)
,
-
(c)
The outputs of the stochastic oracle at level , and , are independent. The outputs of the stochastic oracle are independent between levels, i.e., are independent and so are .
Parts (a) and (b) in Assumption 3 are standard unbiasedness and bounded variance assumptions on the stochastic gradient, common in the literature. Part (c) is essential to establish the convergence results in the multi-level case. Similar assumptions have been made, for example, in [YWF19] and [BGN22]. We also emphasize that unlike some prior works (see e.g., [ZSM+20]), Assumption 3 allows the case of endogenous uncertainty, and we do not require the distribution of the random variables to be independent of the distribution of the decision variables .
Remark.
We start with the merit function used in this work and its connection to the gradient mapping criterion. Our proof leverages the following merit function:
(9) |
where are positive constants and
(10) |
Compared to [BGN22], we require the additional term , which turns out to be essential in our proof due to the ICG routine. The following proposition relates the merit function above to the gradient mapping.
Proposition 2.
Proof.
By expanding the square, and using the properties of projection operation, we have
Thus, we have . The proof is completed immediately by noting that ∎
We now present out main result on the oracle complexity of Algorithm1.
Theorem 2.
Under Assumption 1, 2, 3, let be the sequence generated by Algorithm 1 with and
(11) |
where is an arbitrary positive constant. Provided that the merit function is defined as (9) with
(12) |
we have,
(13) |
(14) |
where , and is a constant depending on the parameters given in (42). The expectation is taken with respect to all random sequences generated by the method and an independent random integer number uniformly distributed over . That is to say, the number of calls to SFO and LMO to get an -stationary point is upper bounded by respectively.
Remark.
The constant is given the definition of and the value of in (12), which further implies that the total number of calls to SFO and LMO of Algorithm 1 for finding an -stationary point of (1), are bounded by and respectively. Furthermore, it is worth noting that this complexity bound for Algorithm 1 is obtained without any dependence of the parameter on Lipschitz constants due to the choice of arbitrary positive constant in (11), and depend only on the number of iterations and respectively. This makes Algorithm 1 parameter-free and easy to implement.
Remark.
As discussed in Section 2, the ICG routine given in Algorithm 2 is a deterministic method with the estimated gradient in (7). The number of iterations, , required to run Algorithm 2 is given by . That is, we require more precise solutions for the ICG routine, only for later outer iterations. Furthermore, due to the deterministic nature of the ICG routine, further advances in the analysis of deterministic conditional gradient methods under additional assumptions on the constraint set (see, for example, [GH15, GW21]) could be leveraged to improve the overall LMO complexity.
3.1 The special cases of and
We now discuss several intriguing points regarding the choice of tuning parameter , for the case of , and the more standard case of . Specifically, the linearization technique used in Algorithm 1 turns out to be not necessary for the case of and to obtain similar rates. However, without linearization, the choice of is dependent on the problem parameters for . Whereas it turns out to be independent of the problem parameters (similar to Algorithm 1 and Theorem 2 which holds for all ) for . As the outer function value estimates (i.e., sequence) are not required for the convergence analysis, we remove them in Algorithms 3 and 4.
Theorem 3.
Let Assumptions 1, 2, 3 be satisfied by the optimization problem (1). Let and be some constants depending on the parameters , as defined in (54) and (62). Let , , where is the total number of iterations.
- (a)
-
(b)
Let and let be the sequence generated by Algorithm 4 with . Then, we have ,
All expectations are taken with respect to all random sequences generated by the respective algorithms and an independent random integer number uniformly distributed over . In both cases, the number of calls to SFO and LMO to get an -stationary point is upper bounded by respectively.
Remark.
While we can obtain the same complexities without using the linear approximation of the inner function for , it seems necessary to have a parameter-free algorithm as the choice of in (15) depends on the knowledge of the problem parameters. Indeed, the linearization term in (8) helps use to better exploit the Lipschitz smoothness of the gradients get an error bound in the order of for estimating the inner function values. Without this term, we are only able to use the Lipschitz continuity of the inner functions and so the error estimate will increase to the order of . Hence, we need to choose a larger (as in (15)) to reduce and handle the error term without compromising the complexities. However, this is not the case for as it can be seen as a two-level problem whose inner function is exactly known (the identity map). In this case, the choice of is independent of the problem parameters with or without the linearization term.
3.2 High-Probability Convergence for
In this subsection, we establish an oracle complexity result with high-probability for the case of . We first provide a notion of -stationary point and a related tail assumption on the stochastic first-order oracle below.
Definition 4.
A point generated by an algorithm for solving (1) is called an -stationary point, if we have with probability .
Assumption 4.
Let for . For each , given we have and is -sub-Gaussian.
The above assumption is commonly used in the literature; see [HK14, HLPR19, LO20, ZCC+18]. We also refer to [Ver18] and Appendix E for additional details. The high-probability bound for solving non-convex constrained problems by Algorithm 4 is given below.
Theorem 5.
Let Assumptions 1, 2, 4 be satisfied by the optimization problem (1) with . Let , , where is the total number of iterations. Let and let be the sequence generated by Algorithm 4 with . Then, we have , with probability at least ,
Therefore, the number of calls to SFO and LMO to get an -stationary point is upper bounded by respectively.
Remark.
To the best of our knowledge, the above result is (i) the first high-probability bound for one-sample stochastic conditional gradient-type algorithm for the case of , and (ii) the first high-probability bound for constrained stochastic optimization algorithms in the non-convex setting; see Appendix J of [MDB21].
4 Proof Sketch of Main Results
In this section, we only present the proof sketch. The complete proofs are provided in the appendix. For convenience, let , and we denote as the function value of the subproblem at step , as the optimal solution of the subproblem i.e.,
(16) |
Then, the proof of Theorem 2 proceeds via the following steps:
- 1.
-
2.
Telescoping the above inequality, in Lemma 11 we obtain the following:
-
3.
To further control the error term introduced by the ICG method, we set , the number of iterations in ICG method at step , to . By Lemma 8, we therefore have
Also, with the choice of and , we can conclude that
-
4.
Then, taking expectation of both sides and by the definition of random integer , we have
, where is a constant depending on the problem parameters .
- 5.
The proofs of Theorems 3 and 5 follow the same argument with appropriate modifications. The high-probability convergence proof of Theorem 5 mainly consists of controlling the tail probability of the residual term being large.
5 Discussion
In this work, we propose and analyze projection-free conditional gradient-type algorithms for constrained stochastic multi-level composition optimization of the form in (1). We show that the oracle complexity of the proposed algorithms is level-independent in terms of the target accuracy. Furthermore, our algorithm does not require any increasing order of mini-batches under standard unbiasedness and bounded second-moment assumptions on the stochastic first-order oracle, and is parameter-free. Some open questions for future research: (i) Considering the one-sample setting, either improving the LMO complexity from to for general closed convex constraint sets or establishing lower bounds showing that is necessary while keeping the SFO in the order of , is extremely interesting; and (ii) Providing high-probability bounds for stochastic multi-level composition problems () and under sub-Gaussian or heavy-tail assumptions (as in [MDB21, LZW22]) is interesting to explore.
Acknowledgment
TX was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and National Science Foundation (NSF) grant CCF-1934568. KB was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and NSF grant DMS-2053918. SG was partially supported by an NSERC Discovery Grant.
References
- [ABTR21] Zeeshan Akhtar, Amrit Singh Bedi, Srujan Teja Thomdapu, and Ketan Rajawat. Projection-Free Algorithm for Stochastic Bi-level Optimization. arXiv preprint arXiv:2110.11721, 2021.
- [AF21] Raul Astudillo and Peter Frazier. Bayesian optimization of function networks. Advances in Neural Information Processing Systems, 34, 2021.
- [BG22] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, 22(1):35–76, 2022.
- [BGI+17] Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, and Chaoxu Zhou. Unbiased simulation for optimizing stochastic function compositions. arXiv preprint arXiv:1711.07564, 2017.
- [BGN22] Krishnakumar Balasubramanian, Saeed Ghadimi, and Anthony Nguyen. Stochastic multilevel composition optimization algorithms with level-independent convergence rates. SIAM Journal on Optimization, 32(2):519–544, 2022.
- [BS17] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164(1-2):1–27, 2017.
- [CFKM20] Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1393–1403, 2020.
- [CSY21] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937–4948, 2021.
- [DD19] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
- [DPR17] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczyński. Statistical estimation of composite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760, 2017.
- [DR18] John Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
- [EN13] Yuri Ermoliev and Vladimir Norkin. Sample average approximation method for compound stochastic optimization problems. SIAM Journal on Optimization, 23(4):2231–2263, 2013.
- [Erm76] Yuri Ermoliev. Methods of stochastic programming. Nauka, Moscow, 1976.
- [FMO21] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34, 2021.
- [FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
- [GH15] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning, pages 541–549. PMLR, 2015.
- [GKS21] Dan Garber, Atara Kaplan, and Shoham Sabach. Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems. Mathematical Programming, 186(1):185–208, 2021.
- [GRW20] Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960–979, 2020.
- [GW21] Dan Garber and Noam Wolf. Frank-Wolfe with a nearest extreme point oracle. In Conference on Learning Theory, pages 2103–2132. PMLR, 2021.
- [HJN15] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015.
- [HK12] Elad Hazan and Satyen Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.
- [HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
- [HL16] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
- [HLPR19] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
- [HZCH20] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33, 2020.
- [Jag13] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435. PMLR, 2013.
- [LJJ15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496–504, 2015.
- [LO20] Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
- [LP66] Evgeny Levitin and Boris Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966.
- [LZ16] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379–1409, 2016.
- [LZW22] Zhipeng Lou, Wanrong Zhu, and Wei Biao Wu. Beyond sub-gaussian noises: Sharp concentration analysis for stochastic gradient descent. Journal of Machine Learning Research, 23:1–22, 2022.
- [MDB21] Liam Madden, Emiliano Dall’Anese, and Stephen Becker. High-probability convergence bounds for non-convex stochastic gradient descent. arXiv preprint arXiv:2006.05610, 2021.
- [MHK20] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of machine learning research, 2020.
- [Mig94] Athanasios Migdalas. A regularization of the frank—wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 65(1):331–345, 1994.
- [Nes18] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
- [QGX+21] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34, 2021.
- [QHZ+22] Zi-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang. Large-scale stochastic optimization of ndcg surrogates for deep learning with provable convergence. arXiv preprint arXiv:2202.12183, 2022.
- [QLX18] Chao Qu, Yan Li, and Huan Xu. Non-convex conditional gradient sliding. In International Conference on Machine Learning, pages 4208–4217. PMLR, 2018.
- [QLX+21] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of areas under precision-recall curves with provable convergence. Advances in Neural Information Processing Systems, 34, 2021.
- [RS06] Andrzej Ruszczyński and Alexander Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006.
- [RSPS16] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
- [Rus87] Andrzej Ruszczyński. A linearization method for nonsmooth stochastic programming problems. Mathematics of Operations Research, 12(1):32–49, 1987.
- [Rus21] Andrzej Ruszczynski. A stochastic subgradient method for nonsmooth nonconvex multilevel composition optimization. SIAM Journal on Control and Optimization, 59(3):2301–2320, 2021.
- [RWC03] David Ruppert, Matt P Wand, and Raymond J Carroll. Semiparametric regression. Number 12. Cambridge university press, 2003.
- [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [WFL17] Mengdi Wang, Ethan Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
- [WGE17] Gang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773–794, 2017.
- [WLF16] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems, 2016.
- [WYZY22] Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic AUPRC maximization. In International Conference on Artificial Intelligence and Statistics, pages 3753–3771. PMLR, 2022.
- [YSC19] Alp Yurtsever, Suvrit Sra, and Volkan Cevher. Conditional gradient methods via stochastic path-integrated differential estimator. In International Conference on Machine Learning, pages 7282–7291. PMLR, 2019.
- [YWF19] Shuoguang Yang, Mengdi Wang, and Ethan Fang. Multilevel stochastic gradient methods for nested composition optimization. SIAM Journal on Optimization, 29(1):616–659, 2019.
- [ZCC+18] Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
- [ZSM+20] Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One-sample Stochastic Frank-Wolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012–4023. PMLR, 2020.
- [ZX21] Junyu Zhang and Lin Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131–1157, 2021.
Supplementary Materials
The supplementary materials are organized as follows. Appendix A provides motivating examples for stochastic multilevel optimization. Appendix B introduces the essential technical lemmas to complete the proof. We present the whole proofs of Theorem 2 and Theorem 3 in Appendix C and D. Finally, we present the high-probability convergence analysis particularly for the case when in Appendix E.
Appendix A Motivating Examples
Problems of the form in (1) are generalizations of the standard constrained stochastic optimization problem which is obtained when , and arise in several machine learning applications. Some examples include sparse additive modeling in non-parametric statistics [WFL17, Section 4.1], Bayesian optimization [AF21], model-agnostic meta-learning [CSY21, FMO21], distributionally robust optimization [QGX+21], training graph neural networks [CFKM20], reinforcement learning [WLF16, Setion 1.1] and AUPRC maximization [QLX+21, WYZY22, QHZ+22]. Below, we provide a concrete motivating example from the field of risk-averse stochastic optimization [RS06].
The mean-deviation risk-averse optimization is given by the following optimization problem
As noted by [YWF19], [Rus21] and [BGN22], the above problem is a stochastic 3-level composition optimization problem with
Here, is added to make the square root function smooth. In particular, we consider a semi-parametric data generating process given by a sparse single-index model of the form , where is called the link function, is assumed to be a sparse vector and represents the Euclidean inner-product between two vectors. Such single-index models are widely used in statistics, machine learning and economics [RWC03]. A standard choices of the link function is the square function, in which case, the model is also called as the sparse phase retrieval model [WGE17]. Here, is the input data which is assumed to be independent of the noise . In this case, and the if we consider the squared-loss, then and is non-convex in . The goal is to estimate the sparse index vector in a risk-averse manner, as they are well-known to provide stable solutions [YWF19]. To encourage sparsity, the set is the ball [Jag13].
Appendix B Technical Lemmas
Lemma 6.
Lemma 7.
Lemma 8.
Appendix C Proof of Theorem 2
To establish the rate of convergence for Algorithm 1 in Theorem 2, we first present Lemma 9 and Lemma 10 regarding the basic recursion on the errors in estimating the inner function values and the order of . The proofs follow [BGN22] with minor modifications. We present the complete proofs below for the reader’s convenience.
Lemma 9.
Proof.
Lemma 10.
Proof.
By the update rule given in (8) and the definitions in (19), for and , we have
where . With the convexity of , we can further obtain
(27) |
Moreover, under Assumption 3, we have, for and ,
(28) |
where the second inequality follows from (23). Setting in the inequality above and noting that , we have
Thus, with the choice of , we obtain
Taking expectation of both sides of (21) conditioning on , and under Assumption 3, we obtain
(29) |
Setting in the inequality above, we have
This completes the proof of (24) and (25) when . We now use backward induction to complete the proof. By the above result, the base case of holds. Assume that (25) hold when for some , i.e., . Then, setting in (28), we obtain
Furthermore, with (27) and the choice of , we have
which together with (29), imply that
∎
We now leverage the merit function defined in (9) and provide a basic inequality for establishing convergence analysis of Algorithm 1 in Lemma 11. In Proposition 3, we show the boundedness of the term appearing on the right hand side of (30) in expectation. These two results form the crucial steps in establishing the convergence analysis of Algorithm 1.
Lemma 11.
Proof.
We first bound . By the Lipschitzness of (Lemma 6), we have
(32) |
We then provide a bound for . By the lipschitzness of (Lemma 7) with the partial gradients of given by
we have
(33) |
where the second equality comes from (6) and (7). Due to the optimality condition of in the definition of , we have for all , which together with the choice of implies that
(34) |
Thus, combining (33) with (34), we obtain
(35) |
In addition, by Lemma 18, we have
(36) |
Then combing (32), (35), (36), we have
(37) |
Furthermore, defining
and by the update rule given by (7), we have
(38) |
where and the last inequality comes from two fact that and
The above upper bound for the term is obtained by leveraging Lemma 18 and the fact that for non-negative sequence .
Moreover, by Lemma 9, we have, for ,
(39) |
Finally, multiplying both sides of (39) by for and both sides of (38) by , adding them to (37), rearranging the terms, and noting that due to the quadratic structure of and , we obtain
(40) |
where is defined in (31) and
We can further provide a simplified upper bound for . By Young’s inequality, we have
Thus,
For any , let
Then, we have
(41) |
As a result of (40) and (41), we can further obtain
which immediately implies (30) by telescoping. ∎
Proof.
Proof of Theorem 2.
We now present the proof of Theorem 2. Note that by Lemma 11 and given values of in (12), we obtain
Taking expectation of both sides and noting that by Proposition 3, we have
(43) |
Then, setting to be values in (11) and noting that by Lemma 8, we have
Also, with the choice of , we have . Thus, we can conclude that
which together with (43) immediately imply that ,
where
and is given in (42). As a result, we can obtain (13) and (14) by the definition of random integer and
∎
Appendix D Proofs for Section 3.1
D.1 Proof of Theorem 3 for
To show the rate of convergence for Algorithm 3, we simplify the merit function in the analysis of the multi-level problems and leverage the following function:
(44) |
where are positive constants, is defined in (10). We now present the analogue of Lemma 11 for Algorithm 3. The proof follows similar steps as that proof of Lemma 11 with slight modifications, and hence we will skip some arguments already presented before.
Lemma 12.
Proof of Lemma 12.
1. By the Lipschitzness of (Lemma 6), we have
(46) |
2. Also, by the Lipschitzness of (Lemma 7) and the optimality condition of in the definition of , we have
(47) |
3. In addition, by the Lipschitzness of and , we have
(48) |
4. Moreover, by the update rule, we have
(49) |
where and the last inequality follows Jensen’s inequality for the convex function as well as
5. Defining
and by the update rule, we have
(50) |
where . We can further upper bound the term by
(51) |
6. By combing (46), (47), (48), (49), (50), (51), rearranging the terms, and noting that and , we obtain
(52) |
where is defined in (45) and
We then provide a simplified upper bound for . By the Young’s inequality, we have
In addition, we reparametrize . Noting that by Lemma 6 with
we therefore have
Then, setting and , we can obtain
Also, we have . Therefore, we can further bound by
(53) |
Telescoping (52) together with (53), we get
∎
Proof of Theorem 3, part (a).
D.2 Proof of Theorem 3 for
To show the rate of convergence for Algorithm 4, we leverage the following merit function:
(55) |
where , is defined in (10).
Lemma 13.
Proof.
The proof is a essentially a simplified version of the proof of Lemma 12. Hence, we skip some arguments already presented earlier.
1. By the Lipschitzness of , we have
(57) |
2. Also, by the lipschitzness of (Lemma 7) and the optimality condition of in the definition of , we have
(58) |
3. By the update rule, we have
(59) |
where .
Appendix E High-Probability Convergence for
E.1 Preliminaries
We provide a short review of sub-gaussian and sub-exponential random variables for completeness.
Definition 14.
(Sub-gaussian and Sub-exponential)
-
(a)
A random variable is -sub-gaussian if there exists such that . The sub-gaussian norm of , denoted , is defined to be the smallest . That is to say,
-
(b)
A random variable is -sub-exponential if there exists such that . The sub-exponential norm of , denoted , is defined to be the smallest . That is to say,
The above characterization is based on the so-called orlicz norm of a random variable. There are equivalent definitions of sub-gaussian and sub-exponential random variables. We refer readers to Proposition 2.5.2 and Proposition 2.7.1 in [Ver18]. In particular, we will also use another definition of sub-gaussian random variables based on the moment generating function given below.
Lemma 15.
(Sub-gaussian M.G.F. [Ver18]) If a random variable is -sub-gaussian with , then , where is a absolute constant.
In the high probability results we show for the special case with , we handle the tail probability for two terms involving the mean-zero noise with sub-gaussian norm, and , where and are adapted to . Our proof leverages the following two lemmas to control the probability of these two terms being too large.
Lemma 16.
(Sub-exponential is sub-gaussian squared [Ver18]) A random variable is sub-gaussian if and only if is sub-exponential. Moreover, .
Lemma 17.
(Generalized Freedman-type Inequality [HLPR19]) Let be a filtered probability space, and be adapted to , and . Suppose for all , , , and . Then for any ,
(63) |
E.2 Proof of Theorem 5
We start with presenting the lemma below which leverages inequalities in Appendix E to show a high-probability upper bound for terms involving in the previous analysis.
Lemma 18.
Proof of Lemma 18.
We first show (a). Using the law of total expectation, we have , which implies that is -sub-exponential. Thus, we have with probability at least ,
(64) |
We then show (b). Let . Note that for all , is -sub-exponential, which further implies that the sub-exponential norm of satisfies . Therefore, we have for any , with probability at least ,
(65) |
To prove (c), we apply Lemma 15 and Lemma 17 with
Noting that , we obtain that for all with probability at least , and
where the third inequality comes from the convexity of and the Lipschitzness of . ∎
Proof of Theorem 5.
Given the update rule of and the fact that , we can obtain
where and . Thus,
where the second inequality comes from the convexity of . Therefore, we have
Applying Lemma 18 with and together with Lemma 13, we have with probability at least ,
Thus, noting that , we have with probability at least ,
Following the same arguments as in the proof of Theorem 2, we have with probability at least ,
∎