Projection-free Online Learning over Strongly Convex Sets
Abstract
To efficiently solve online problems with complicated constraints, projection-free algorithms including online frank-wolfe (OFW) and its variants have received significant interest recently. However, in the general case, existing efficient projection-free algorithms only achieved the regret bound of , which is worse than the regret of projection-based algorithms, where is the number of decision rounds. In this paper, we study the special case of online learning over strongly convex sets, for which we first prove that OFW can enjoy a better regret bound of for general convex losses. The key idea is to refine the decaying step-size in the original OFW by a simple line search rule. Furthermore, for strongly convex losses, we propose a strongly convex variant of OFW by redefining the surrogate loss function in OFW. We show that it achieves a regret bound of over general convex sets and a better regret bound of over strongly convex sets.
Introduction
Online convex optimization (OCO) is a powerful framework that has been used to model and solve problems from diverse domains such as online routing (Awerbuch and Kleinberg 2004, 2008) and online portfolio selection (Blum and Kalai 1999; Agarwal et al. 2006; Luo, Wei, and Zheng 2018). In OCO, an online player plays a repeated game of rounds against an adversary. At each round , the player chooses a decision from a convex set . After that, a convex function chosen by the adversary is revealed, and incurs a loss to the player. The goal of the player is to choose decisions so that the regret defined as
is minimized. Over the past decades, various algorithms such as online gradient descent (OGD) (Zinkevich 2003), online Newton step (Hazan, Agarwal, and Kale 2007) and follow-the-regularized-leader (Shalev-Shwartz 2007; Shalev-Shwartz and Singer 2007) have been proposed to yield optimal regret bounds under different scenarios.
To ensure the feasibility of each decision, a common way in these algorithms is to apply a projection operation to any infeasible decision. However, in many high-dimensional problems with complicated decision sets, the projection step becomes a computational bottleneck (Zhang et al. 2013; Chen et al. 2016; Yang, Lin, and Zhang 2017). For example, in semi-definite programs (Hazan 2008) and multiclass classification (Hazan and Luo 2016), it amounts to computing the singular value decomposition (SVD) of a matrix, when the decision set consists of all matrices with bounded trace norm. To tackle the computational bottleneck, Hazan and Kale (2012) proposed a projection-free algorithm called online Frank-Wolfe (OFW) that replaces the time-consuming projection step with a more efficient linear optimization step. Taking matrix completion as an example again, the linear optimization step can be carried out by computing the top singular vector pair of a matrix, which is significantly faster than computing the SVD. Although OFW can efficiently handle complicated decision sets, it only attains a regret bound of for the general OCO, which is worse than the optimal regret bound achieved by projection-based algorithms.
Algorithm | Extra Condition on Loss | Extra Condition on | Regret Bound |
---|---|---|---|
OFW | |||
LLOO-OCO | polyhedral | ||
LLOO-OCO | strongly convex | polyhedral | |
Fast OGD | smooth | ||
Fast OGD | strongly convex | smooth | |
OSPF | smooth | ||
OFW with Line Search (this work) | strongly convex | ||
SC-OFW (this work) | strongly convex | ||
SC-OFW (this work) | strongly convex | strongly convex |
More specifically, OFW is an online extension of an offline algorithm called Frank-Wolfe (FW) (Frank and Wolfe 1956; Jaggi 2013) that iteratively performs linear optimization steps to minimize a convex and smooth function. In each round, OFW updates the decision by utilizing a single step of FW to minimize a surrogate loss function, which implies that the approximation error caused by the single step of FW could be the main reason for the regret gap between projection-based algorithms and OFW. Recently, Garber and Hazan (2015) made a quadratic improvement in the convergence rate of FW for strongly convex and smooth offline optimization over strongly convex sets compared to the general case. They used a simple line search rule to choose the step-size of FW, which allows FW to converge faster even if the strong convexity of the decision set is unknown. It is therefore natural to ask whether the faster convergence of FW can be utilized to improve the regret of OFW. In this paper, we give an affirmative answer by improving OFW to achieve an regret bound of over strongly convex sets, which is better than the original regret bound. Inspired by Garber and Hazan (2015), the key idea is to refine the decaying step-size in the original OFW by a simple line search rule.
Furthermore, we study OCO with strongly convex losses, and propose a strongly convex variant of OFW (SC-OFW). Different from OFW, we introduce a new surrogate loss function from Garber and Hazan (2016) to utilize the strong convexity of losses, which has been used to achieve an regret bound for strongly convex OCO over polyhedral sets. Theoretical analysis reveals that SC-OFW for strongly convex OCO attains a regret bound of over general convex sets111When this paper is under review, we notice that a concurrent work (Garber and Kretzu 2020b) also proposed an algorithm similar to SC-OFW and established the regret bound. and a better regret bound of over strongly convex sets.
Related Work
In this section, we briefly review the related work about efficient projection-free algorithms for OCO, the time complexity of which is a constant in each round.
OFW (Hazan and Kale 2012; Hazan 2016) is the first projection-free algorithms for OCO, which attains a regret bound of by only performing linear optimization step at each round. Recently, some studies have proposed projection-free online algorithms which are as efficient as OFW and attain better regret bounds, for special cases of OCO. If the decision set is a polytope, Garber and Hazan (2016) proposed a variant of OFW named as LLOO-based online convex optimization (LLOO-OCO), which enjoy an regret bound for convex losses and an regret bound for strongly convex losses. For OCO over smooth sets, Levy and Krause (2019) proposed a projection-free variant of OGD via devising a fast approximate projection for such sets, and established and regret bounds for convex and strongly convex losses, respectively. Besides these improvements for OCO over special decision sets, Hazan and Minasyan (2020) proposed online smooth projection free algorithm (OSPF) for OCO with smooth losses, which is a randomized method and achieves an expected regret bound of .
Furthermore, OFW has been extended to two practical scenarios. The first scenario is the bandit setting (Flaxman, Kalai, and McMahan 2005; Bubeck et al. 2015), where only the loss value is available to the player. Chen, Zhang, and Karbasi (2019) proposed the first bandit variant of OFW, and established an expected regret bound of . Later, two improved bandit variants of OFW were proposed to enjoy better expected regret bounds on the order of for convex losses (Garber and Kretzu 2020a) and for strongly convex losses (Garber and Kretzu 2020b). The second scenario is the distributed setting (Duchi, Agarwal, and Wainwright 2011; Hosseini, Chapman, and Mesbahi 2013), where many players are distributed over a network and each player can only access to the local loss function. The first projection-free algorithm for distributed OCO was proposed by Zhang et al. (2017), which requires the communication complexity of . Recently, Wan, Tu, and Zhang (2020) further reduced the communication complexity from to .
Despite these great progresses about projection-free online learning, for the general OCO, OFW is still the best known efficient projection-free algorithm and the regret bound of has remained unchanged. In this paper, we will study OCO over strongly convex sets, and improve the regret bound to and for convex and strongly convex losses, respectively. The detailed comparisons between efficient projection-free algorithms are summarized in Table 1.
Main Results
In this section, we first introduce necessary preliminaries including common notations, definitions and assumptions. Then, we present an improved regret bound for OFW over strongly convex sets by utilizing the line search. Finally, we introduce our SC-OFW algorithm for strongly convex OCO as well as its theoretical guarantees.
Preliminaries
In this paper, the convex set belongs to a finite vector space . For any , the inner product is denoted by . We recall two standard definitions for smooth and strongly convex functions (Boyd and Vandenberghe 2004), respectively.
Definition 1
Let be a function over . It is called -smooth over if for all
Definition 2
Let be a function over . It is called -strongly convex over if for all
When is an -strongly convex function and , Garber and Hazan (2015) have proved that for any
(1) |
and
(2) |
Then, we introduce the definition of the strongly convex set, which has been well studied in offline optimization (Levitin and Polyak 1966; Demyanov and Rubinov 1970; Dunn 1979; Garber and Hazan 2015; Rector-Brooks, Wang, and Mozafari 2019).
Definition 3
A convex set is called -strongly convex with respect to a norm if for any , and such that , it holds that
As shown by Garber and Hazan (2015), various balls induced by norms, Schatten norms and group norms are strongly convex, which are commonly used to constrain the decision. For example, the norm ball defined as
is -strongly convex with respect to the norm for any (Garber and Hazan, 2015, Corollary 1).
Assumption 1
The diameter of the convex decision set is bounded by , i.e.,
for any .
Assumption 2
At each round , the loss function is -Lipschitz over , i.e.,
for any .
OFW with Line Search
For the general OCO, OFW (Hazan and Kale 2012; Hazan 2016) arbitrarily chooses from , and then iteratively updates its decision as the following linear optimization step
where
(3) |
is the surrogate loss function, and are two parameters. According to Hazan (2016), OFW with and attains the regret bound of . However, this decaying step-size cannot utilize the property of strongly convex sets. Inspired by a recent progress on FW over strongly convex sets (Garber and Hazan 2015), we utilized a line search rule to choose the parameter as
The detailed procedures are summarized in Algorithm 1.
From previous discussions, the approximation error of minimizing by a single step of FW has a significant impact on the regret of OFW. Therefore, we first present the following lemma with respect to the approximation error for Algorithm 1 over strongly convex sets.
Lemma 1
We find that the approximation error incurred by a single step of FW is upper bounded by for Algorithm 1 over strongly convex sets. For a clear comparison, we note that the approximation error for the original OFW over general convex sets has a worse bound of (Hazan, 2016, Lemma 7.3).
By applying Lemma 1 and further analyzing the property of decisions defined in Lemma 1, we prove that OFW with line search enjoys the following regret bound over strongly convex sets.
Theorem 1
Let be an -strongly convex set with respect to the norm and . For any , Algorithm 1 with ensures
We find that the regret bound in Theorem 1 is better than the bound achieved by previous studies for the original OFW over general convex sets (Hazan and Kale 2012; Hazan 2016).
SC-OFW for Strongly Convex Losses
Moreover, we propose a strongly convex variant of OFW (SC-OFW), which is inspired by Garber and Hazan (2016). To be precise, Garber and Hazan (2016) have proposed a variant of OFW for strongly convex OCO over polyhedral sets, which enjoys the regret bound. Compared with the original OFW, their algorithm has two main differences. First, a local linear optimization step for polyhedral sets is developed to replace the linear optimization step utilized in OFW. Second, to handle -strongly convex losses, is redefined as
(4) |
where is a parameter that depends on properties of polyhedral sets.
Since this paper considers OCO over strongly convex sets, our SC-OFW adopts the linear optimization step utilized in the original OFW, and simplifies in (4) as
(5) |
by simply setting . The detailed procedures are summarized in Algorithm 2, where the parameter is selected by the line search as
To the best of our knowledge, SC-OFW is the first projection-free algorithm for strongly convex OCO over any decision set.
In the following lemma, we upper bound the approximation error of minimizing the surrogate loss function for SC-OFW over strongly convex sets.
Lemma 2
Furthermore, based on Lemma 2, we provide the following theorem with respect to the regret bound of SC-OFW over strongly convex sets.
Theorem 2
Let be an -strongly convex set with respect to the norm. If is -strongly convex for any , for any , Algorithm 2 ensures
where
From Theorem 2, we achieve a regret bound of for strongly convex losses, which is better than the above bound for general convex losses.
Moreover, we show the regret bound of Algorithm 2 over any general convex set in the following theorem.
Theorem 3
Theoretical Analysis
Proof of Theorem 1
In the beginning, we define for any , where is defined in (3).
Since each function is convex, we have
(6) |
So, we will upper bound the regret by analyzing and .
By applying Lemma 1, we have
(7) |
where the second inequality is due to (1) and the fact that is -strongly convex, and the last inequality is due to .
To bound , we introduce the following lemma.
Lemma 3
(Lemma 6.6 of Garber and Hazan (2016)) Let be a sequence of loss functions and let for any . Then, it holds that
To apply Lemma 3, we define for and for any . We recall that and for any . Then, by applying Lemma 3 to , we have
which implies that
Since and , we have
(8) |
Moreover, by combining the fact that is -strongly convex with (1), we have
which implies that
(9) |
By combining (8), (9) and , we have
(10) |
where the last inequality is due to and for any .
Proof of Theorem 2
Let for any and for any .
Since each function is -strongly convex, we have
So, we will derive a regret bound by analyzing and .
To bound , we introduce the following lemma.
Lemma 4
(Lemma 6.7 of Garber and Hazan (2016)) For any , the function is -Lipschitz over .
Proof of Lemma 1
For brevity, we define for and for .
For , since , we have
(14) |
For any , assuming , we first note that
(15) |
where the first inequality is due to and the last inequality is due to and (9).
Then, to bound by , we further introduce the following lemma.
Lemma 5
(Derived from Lemma 1 of Garber and Hazan (2015)) Let be a convex and -smooth function, where is -strongly convex with respect to the norm. Moreover, let and , where and . For any , we have
We note that is -smooth for any . By applying Lemma 5 with and , for any , we have and
(17) |
Because of (16), (17) and , if , it is easy to verify that
(18) |
where the last inequality is due to for any .
Then, if , there exist two cases. First, if , it is easy to verify that
(19) |
where the lase inequality has been proved in (18).
Second, if , we have
where the second inequality is due to (16) and the third inequality is due to (2).
Since for any , it is easy to verify that
which further implies that
(20) |
where the second inequality is due to .
Conclusion and Future Work
In this paper, we first prove that the classical OFW algorithm with line search attains an regret bound for OCO over strongly convex sets, which is better than the regret bound for the general OCO. Furthermore, for strongly convex losses, we introduce a strongly convex variant of OFW, and prove that it achieves a regret bound of over general convex sets and a better regret bound of over strongly convex sets.
An open question is whether the regret of OFW and its strongly convex variant over strongly convex sets can be further improved if the losses are smooth. We note that Hazan and Minasyan (2020) have proposed a projection-free algorithm for OCO over general convex sets, and established an improved regret bound of by taking advantage of the smoothness.
Acknowledgments
This work was partially supported by the NSFC (61921006), NSFC-NRF Joint Research Project (61861146001), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.
References
- Agarwal et al. (2006) Agarwal, A.; Hazan, E.; Kale, S.; and Schapire, R. E. 2006. Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd International Conference on Machine Learning, 9–16.
- Awerbuch and Kleinberg (2008) Awerbuch, B.; and Kleinberg, R. 2008. Online linear optimization and adaptive routing. Journal of Computer and System Sciences 74(1): 97–114.
- Awerbuch and Kleinberg (2004) Awerbuch, B.; and Kleinberg, R. D. 2004. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, 45–53.
- Blum and Kalai (1999) Blum, A.; and Kalai, A. 1999. Universal portfolios with and without transaction costs. Machine Learning 35(3): 193–205.
- Boyd and Vandenberghe (2004) Boyd, S.; and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.
- Bubeck et al. (2015) Bubeck, S.; Dekel, O.; Koren, T.; and Peres, Y. 2015. Bandit convex optimization: regret in one dimension. In Proceedings of the 28th Conference on Learning Theory, 266–278.
- Chen et al. (2016) Chen, J.; Yang, T.; Lin, Q.; Zhang, L.; and Chang, Y. 2016. Optimal stochastic strongly convex optimization with a logarithmic number of projections. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 122–131.
- Chen, Zhang, and Karbasi (2019) Chen, L.; Zhang, M.; and Karbasi, A. 2019. Projection-free bandit convex optimization. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2047–2056.
- Demyanov and Rubinov (1970) Demyanov, V. F.; and Rubinov, A. M. 1970. Approximate Methods in Optimization Problems. Elsevier Publishing Company.
- Duchi, Agarwal, and Wainwright (2011) Duchi, J. C.; Agarwal, A.; and Wainwright, M. J. 2011. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control 57(3): 592–606.
- Dunn (1979) Dunn, J. C. 1979. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17(2): 187–211.
- Flaxman, Kalai, and McMahan (2005) Flaxman, A. D.; Kalai, A. T.; and McMahan, H. B. 2005. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, 385–394.
- Frank and Wolfe (1956) Frank, M.; and Wolfe, P. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3(1–2): 95–110.
- Garber and Hazan (2015) Garber, D.; and Hazan, E. 2015. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning, 541–549.
- Garber and Hazan (2016) Garber, D.; and Hazan, E. 2016. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3): 1493–1528.
- Garber and Kretzu (2020a) Garber, D.; and Kretzu, B. 2020a. Improved regret bounds for projection-free bandit convex optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2196–2206.
- Garber and Kretzu (2020b) Garber, D.; and Kretzu, B. 2020b. Revisiting projection-free online learning: the strongly convex case. ArXiv e-prints arXiv: 2010.07572.
- Hazan (2008) Hazan, E. 2008. Sparse approximate solutions to semidefinite programs. In Latin American Symposium on Theoretical Informatics, 306–316.
- Hazan (2016) Hazan, E. 2016. Introduction to online convex optimization. Foundations and Trends in Optimization 2(3–4): 157–325.
- Hazan, Agarwal, and Kale (2007) Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2): 169–192.
- Hazan and Kale (2012) Hazan, E.; and Kale, S. 2012. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, 1843–1850.
- Hazan and Luo (2016) Hazan, E.; and Luo, H. 2016. Variance-reduced and projection-free stochastic optimization. In Proceedings of the 33rd International Conference on Machine Learning, 1263–1271.
- Hazan and Minasyan (2020) Hazan, E.; and Minasyan, E. 2020. Faster projection-free online learning. In Proceedings of the 33rd Annual Conference on Learning Theory, 1877–1893.
- Hosseini, Chapman, and Mesbahi (2013) Hosseini, S.; Chapman, A.; and Mesbahi, M. 2013. Online distributed optimization via dual averaging. In 52nd IEEE Conference on Decision and Control, 1484–1489.
- Jaggi (2013) Jaggi, M. 2013. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, 427–435.
- Levitin and Polyak (1966) Levitin, E. S.; and Polyak, B. T. 1966. Constrained minimization methods. USSR Computational mathematics and mathematical physics 6: 1–50.
- Levy and Krause (2019) Levy, K. Y.; and Krause, A. 2019. Projection free online learning over smooth sets. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 1458–1466.
- Luo, Wei, and Zheng (2018) Luo, H.; Wei, C.-Y.; and Zheng, K. 2018. Efficient online portfolio with logarithmic regret. In Advances in Neural Information Processing Systems 31, 8235–8245.
- Rector-Brooks, Wang, and Mozafari (2019) Rector-Brooks, J.; Wang, J.-K.; and Mozafari, B. 2019. Revisiting projection-free optimization for strongly convex constraint sets. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 1576–1583.
- Shalev-Shwartz (2007) Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms, and Applications. Ph.D. thesis, The Hebrew University of Jerusalem.
- Shalev-Shwartz (2011) Shalev-Shwartz, S. 2011. Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2): 107–194.
- Shalev-Shwartz and Singer (2007) Shalev-Shwartz, S.; and Singer, Y. 2007. A primal-dual perspective of online learning algorithm. Machine Learning 69(2–3): 115–142.
- Wan, Tu, and Zhang (2020) Wan, Y.; Tu, W.-W.; and Zhang, L. 2020. Projection-free distributed online convex optimization with communication complexity. In Proceedings of the 37th International Conference on Machine Learning, 9818–9828.
- Yang, Lin, and Zhang (2017) Yang, T.; Lin, Q.; and Zhang, L. 2017. A richer theory of convex constrained optimization with reduced projections and improved rates. In Proceedings of the 34th International Conference on Machine Learning, 3901–3910.
- Zhang et al. (2013) Zhang, L.; Yang, T.; Jin, R.; and He, X. 2013. projections for stochastic optimization of smooth and strongly convex functions. In Proceedings of the 30th International Conference on Machine Learning, 1121–1129.
- Zhang et al. (2017) Zhang, W.; Zhao, P.; Zhu, W.; Hoi, S. C. H.; and Zhang, T. 2017. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning, 4054–4062.
- Zinkevich (2003) Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 928–936.
Proof of Lemma 2
Let for . For brevity, we define for and for . For , we have
(21) |
For any , assuming , we have
where the first inequality is due to Lemma 4 and the last inequality is due to (11) and
(22) |
Let . Because of , we further have
where the second inequality is due to for any .
We note that is -smooth for any . By applying Lemma 5 with and , we have and
(23) |
Because of (23), if , it is easy to verify that
(24) |
where the last inequality is due to for any .
Otherwise, if , there exist two cases. First, if , it is easy to verify that
(25) |
Second, if , we have
where the third inequality is due to (2).
Proof of Lemma 4
This lemma has been proved by Garber and Hazan (2016), we include the proof for completeness. For any and , we have
Following the above inequality, we also have
which completes the proof.
Proof of Theorem 3
This proof is similar to that of Theorem 2. Let for any . Let for any . From the proof of Theorem 2, we have
where and .
Moreover, as in (13), we have proved that
Therefore, we only need to bound . For SC-OFW over general convex sets, we introduce a new lemma.
By applying Lemmas 4 and 6, we have
(27) |
where the third inequality is due to for and (11), and the last equality is due to
Because of and , we further have
which completes this proof.
Proof of Lemma 6
For brevity, we define and for any . For , following (21), we have
(28) |
For any , assuming , we have
where the first inequality is due to Lemma 4 and the last inequality is due to (11) and (22).
Because of and for any , we further have
(29) |
Then, for any , we have
(30) |
where the first inequality is due to the smoothness of , the second inequality is due to the line search and the third inequality is due to .