This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Projection-free Online Learning over Strongly Convex Sets

Yuanyu Wan1,  Lijun Zhang1,2, Lijun Zhang is the corresponding author.
Abstract

To efficiently solve online problems with complicated constraints, projection-free algorithms including online frank-wolfe (OFW) and its variants have received significant interest recently. However, in the general case, existing efficient projection-free algorithms only achieved the regret bound of O(T3/4)O(T^{3/4}), which is worse than the regret of projection-based algorithms, where TT is the number of decision rounds. In this paper, we study the special case of online learning over strongly convex sets, for which we first prove that OFW can enjoy a better regret bound of O(T2/3)O(T^{2/3}) for general convex losses. The key idea is to refine the decaying step-size in the original OFW by a simple line search rule. Furthermore, for strongly convex losses, we propose a strongly convex variant of OFW by redefining the surrogate loss function in OFW. We show that it achieves a regret bound of O(T2/3)O(T^{2/3}) over general convex sets and a better regret bound of O(T)O(\sqrt{T}) over strongly convex sets.

Introduction

Online convex optimization (OCO) is a powerful framework that has been used to model and solve problems from diverse domains such as online routing (Awerbuch and Kleinberg 2004, 2008) and online portfolio selection (Blum and Kalai 1999; Agarwal et al. 2006; Luo, Wei, and Zheng 2018). In OCO, an online player plays a repeated game of TT rounds against an adversary. At each round tt, the player chooses a decision 𝐱t\mathbf{x}_{t} from a convex set 𝒦\mathcal{K}. After that, a convex function ft(𝐱):𝒦f_{t}(\mathbf{x}):\mathcal{K}\to\mathbb{R} chosen by the adversary is revealed, and incurs a loss ft(𝐱t)f_{t}(\mathbf{x}_{t}) to the player. The goal of the player is to choose decisions so that the regret defined as

R(T)=t=1Tft(𝐱t)min𝐱𝒦t=1Tft(𝐱)R(T)=\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\min_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x})

is minimized. Over the past decades, various algorithms such as online gradient descent (OGD) (Zinkevich 2003), online Newton step (Hazan, Agarwal, and Kale 2007) and follow-the-regularized-leader (Shalev-Shwartz 2007; Shalev-Shwartz and Singer 2007) have been proposed to yield optimal regret bounds under different scenarios.

To ensure the feasibility of each decision, a common way in these algorithms is to apply a projection operation to any infeasible decision. However, in many high-dimensional problems with complicated decision sets, the projection step becomes a computational bottleneck (Zhang et al. 2013; Chen et al. 2016; Yang, Lin, and Zhang 2017). For example, in semi-definite programs (Hazan 2008) and multiclass classification (Hazan and Luo 2016), it amounts to computing the singular value decomposition (SVD) of a matrix, when the decision set consists of all matrices with bounded trace norm. To tackle the computational bottleneck, Hazan and Kale (2012) proposed a projection-free algorithm called online Frank-Wolfe (OFW) that replaces the time-consuming projection step with a more efficient linear optimization step. Taking matrix completion as an example again, the linear optimization step can be carried out by computing the top singular vector pair of a matrix, which is significantly faster than computing the SVD. Although OFW can efficiently handle complicated decision sets, it only attains a regret bound of O(T3/4)O(T^{3/4}) for the general OCO, which is worse than the optimal O(T)O(\sqrt{T}) regret bound achieved by projection-based algorithms.

Algorithm Extra Condition on Loss Extra Condition on 𝒦\mathcal{K} Regret Bound
OFW O(T3/4)O(T^{3/4})
LLOO-OCO polyhedral O(T)O(\sqrt{T})
LLOO-OCO strongly convex polyhedral O(logT)O(\log T)
Fast OGD smooth O(T)O(\sqrt{T})
Fast OGD strongly convex smooth O(logT)O(\log T)
OSPF smooth O(T2/3)O(T^{2/3})
OFW with Line Search (this work) strongly convex O(T2/3)O(T^{2/3})
SC-OFW (this work) strongly convex O(T2/3)O(T^{2/3})
SC-OFW (this work) strongly convex strongly convex O(T)O(\sqrt{T})
Table 1: Comparisons of regret bounds for efficient projection-free online algorithms including OFW (Hazan and Kale 2012; Hazan 2016), LLOO-OCO (Garber and Hazan 2016), Fast OGD (Levy and Krause 2019), OSPF (Hazan and Minasyan 2020) and our algorithms.

More specifically, OFW is an online extension of an offline algorithm called Frank-Wolfe (FW) (Frank and Wolfe 1956; Jaggi 2013) that iteratively performs linear optimization steps to minimize a convex and smooth function. In each round, OFW updates the decision by utilizing a single step of FW to minimize a surrogate loss function, which implies that the approximation error caused by the single step of FW could be the main reason for the regret gap between projection-based algorithms and OFW. Recently, Garber and Hazan (2015) made a quadratic improvement in the convergence rate of FW for strongly convex and smooth offline optimization over strongly convex sets compared to the general case. They used a simple line search rule to choose the step-size of FW, which allows FW to converge faster even if the strong convexity of the decision set is unknown. It is therefore natural to ask whether the faster convergence of FW can be utilized to improve the regret of OFW. In this paper, we give an affirmative answer by improving OFW to achieve an regret bound of O(T2/3)O(T^{2/3}) over strongly convex sets, which is better than the original O(T3/4)O(T^{3/4}) regret bound. Inspired by Garber and Hazan (2015), the key idea is to refine the decaying step-size in the original OFW by a simple line search rule.

Furthermore, we study OCO with strongly convex losses, and propose a strongly convex variant of OFW (SC-OFW). Different from OFW, we introduce a new surrogate loss function from Garber and Hazan (2016) to utilize the strong convexity of losses, which has been used to achieve an O(logT)O(\log T) regret bound for strongly convex OCO over polyhedral sets. Theoretical analysis reveals that SC-OFW for strongly convex OCO attains a regret bound of O(T2/3)O(T^{2/3}) over general convex sets111When this paper is under review, we notice that a concurrent work (Garber and Kretzu 2020b) also proposed an algorithm similar to SC-OFW and established the O(T2/3)O(T^{2/3}) regret bound. and a better regret bound of O(T)O(\sqrt{T}) over strongly convex sets.

Related Work

In this section, we briefly review the related work about efficient projection-free algorithms for OCO, the time complexity of which is a constant in each round.

OFW (Hazan and Kale 2012; Hazan 2016) is the first projection-free algorithms for OCO, which attains a regret bound of O(T3/4)O(T^{3/4}) by only performing 11 linear optimization step at each round. Recently, some studies have proposed projection-free online algorithms which are as efficient as OFW and attain better regret bounds, for special cases of OCO. If the decision set is a polytope, Garber and Hazan (2016) proposed a variant of OFW named as LLOO-based online convex optimization (LLOO-OCO), which enjoy an O(T)O(\sqrt{T}) regret bound for convex losses and an O(logT)O(\log T) regret bound for strongly convex losses. For OCO over smooth sets, Levy and Krause (2019) proposed a projection-free variant of OGD via devising a fast approximate projection for such sets, and established O(T)O(\sqrt{T}) and O(logT)O(\log T) regret bounds for convex and strongly convex losses, respectively. Besides these improvements for OCO over special decision sets, Hazan and Minasyan (2020) proposed online smooth projection free algorithm (OSPF) for OCO with smooth losses, which is a randomized method and achieves an expected regret bound of O(T2/3)O(T^{2/3}).

Furthermore, OFW has been extended to two practical scenarios. The first scenario is the bandit setting (Flaxman, Kalai, and McMahan 2005; Bubeck et al. 2015), where only the loss value is available to the player. Chen, Zhang, and Karbasi (2019) proposed the first bandit variant of OFW, and established an expected regret bound of O(T4/5)O(T^{4/5}). Later, two improved bandit variants of OFW were proposed to enjoy better expected regret bounds on the order of O(T3/4)O(T^{3/4}) for convex losses (Garber and Kretzu 2020a) and O~(T2/3)\widetilde{O}(T^{2/3}) for strongly convex losses (Garber and Kretzu 2020b). The second scenario is the distributed setting (Duchi, Agarwal, and Wainwright 2011; Hosseini, Chapman, and Mesbahi 2013), where many players are distributed over a network and each player can only access to the local loss function. The first projection-free algorithm for distributed OCO was proposed by Zhang et al. (2017), which requires the communication complexity of O(T)O(T). Recently, Wan, Tu, and Zhang (2020) further reduced the communication complexity from O(T)O(T) to O(T)O(\sqrt{T}).

Despite these great progresses about projection-free online learning, for the general OCO, OFW is still the best known efficient projection-free algorithm and the regret bound of O(T3/4)O(T^{3/4}) has remained unchanged. In this paper, we will study OCO over strongly convex sets, and improve the regret bound to O(T2/3)O(T^{2/3}) and O(T)O(\sqrt{T}) for convex and strongly convex losses, respectively. The detailed comparisons between efficient projection-free algorithms are summarized in Table 1.

Main Results

In this section, we first introduce necessary preliminaries including common notations, definitions and assumptions. Then, we present an improved regret bound for OFW over strongly convex sets by utilizing the line search. Finally, we introduce our SC-OFW algorithm for strongly convex OCO as well as its theoretical guarantees.

Preliminaries

In this paper, the convex set 𝒦\mathcal{K} belongs to a finite vector space 𝔼\mathbb{E}. For any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}, the inner product is denoted by 𝐱,𝐲\langle\mathbf{x},\mathbf{y}\rangle. We recall two standard definitions for smooth and strongly convex functions (Boyd and Vandenberghe 2004), respectively.

Definition 1

Let f(𝐱):𝒦f(\mathbf{x}):\mathcal{K}\to\mathbb{R} be a function over 𝒦\mathcal{K}. It is called β\beta-smooth over 𝒦\mathcal{K} if for all 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}

f(𝐲)f(𝐱)+f(𝐱),𝐲𝐱+β2𝐲𝐱22.f(\mathbf{y})\leq f(\mathbf{x})+\langle\nabla f(\mathbf{x}),\mathbf{y}-\mathbf{x}\rangle+\frac{\beta}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}.
Definition 2

Let f(𝐱):𝒦f(\mathbf{x}):\mathcal{K}\to\mathbb{R} be a function over 𝒦\mathcal{K}. It is called α\alpha-strongly convex over 𝒦\mathcal{K} if for all 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}

f(𝐲)f(𝐱)+f(𝐱),𝐲𝐱+α2𝐲𝐱22.f(\mathbf{y})\geq f(\mathbf{x})+\langle\nabla f(\mathbf{x}),\mathbf{y}-\mathbf{x}\rangle+\frac{\alpha}{2}\|\mathbf{y}-\mathbf{x}\|_{2}^{2}.

When f(𝐱):𝒦f(\mathbf{x}):\mathcal{K}\to\mathbb{R} is an α\alpha-strongly convex function and 𝐱=argmin𝐱𝒦f(𝐱)\mathbf{x}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}f(\mathbf{x}), Garber and Hazan (2015) have proved that for any 𝐱𝒦\mathbf{x}\in\mathcal{K}

α2𝐱𝐱22f(𝐱)f(𝐱)\frac{\alpha}{2}\|\mathbf{x}-\mathbf{x}^{\ast}\|_{2}^{2}\leq f(\mathbf{x})-f(\mathbf{x}^{\ast}) (1)

and

f(𝐱)2α2f(𝐱)f(𝐱).\|\nabla f(\mathbf{x})\|_{2}\geq\sqrt{\frac{\alpha}{2}}\sqrt{f(\mathbf{x})-f(\mathbf{x}_{\ast})}. (2)

Then, we introduce the definition of the strongly convex set, which has been well studied in offline optimization (Levitin and Polyak 1966; Demyanov and Rubinov 1970; Dunn 1979; Garber and Hazan 2015; Rector-Brooks, Wang, and Mozafari 2019).

Definition 3

A convex set 𝒦𝔼\mathcal{K}\in\mathbb{E} is called α\alpha-strongly convex with respect to a norm \|\cdot\| if for any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}, γ[0,1]\gamma\in[0,1] and 𝐳𝔼\mathbf{z}\in\mathbb{E} such that 𝐳=1\|\mathbf{z}\|=1, it holds that

γ𝐱+(1γ)𝐲+γ(1γ)α2𝐱𝐲2𝐳𝒦.\gamma\mathbf{x}+(1-\gamma)\mathbf{y}+\gamma(1-\gamma)\frac{\alpha}{2}\|\mathbf{x}-\mathbf{y}\|^{2}\mathbf{z}\in\mathcal{K}.

As shown by Garber and Hazan (2015), various balls induced by p\ell_{p} norms, Schatten norms and group norms are strongly convex, which are commonly used to constrain the decision. For example, the p\ell_{p} norm ball defined as

𝒦={𝐱d|𝐱pr}\mathcal{K}=\{\mathbf{x}\in\mathbb{R}^{d}|\|\mathbf{x}\|_{p}\leq r\}

is (p1)d121pr\frac{(p-1)d^{\frac{1}{2}-\frac{1}{p}}}{r}-strongly convex with respect to the 2\ell_{2} norm for any p(1,2]p\in(1,2] (Garber and Hazan, 2015, Corollary 1).

We also introduce two common assumptions in OCO (Shalev-Shwartz 2011; Hazan 2016).

Assumption 1

The diameter of the convex decision set 𝒦\mathcal{K} is bounded by DD, i.e.,

𝐱𝐲2D\|\mathbf{x}-\mathbf{y}\|_{2}\leq D

for any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}.

Assumption 2

At each round tt, the loss function ft(𝐱)f_{t}(\mathbf{x}) is GG-Lipschitz over 𝒦\mathcal{K}, i.e.,

|ft(𝐱)ft(𝐲)|G𝐱𝐲2|f_{t}(\mathbf{x})-f_{t}(\mathbf{y})|\leq G\|\mathbf{x}-\mathbf{y}\|_{2}

for any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K}.

Algorithm 1 OFW with Line Search
1:  Input: feasible set 𝒦\mathcal{K}, η\eta
2:  Initialization: choose 𝐱1𝒦\mathbf{x}_{1}\in\mathcal{K}
3:  for t=1,,Tt=1,\cdots,T do
4:     Define Ft(𝐱)=ητ=1tfτ(𝐱τ),𝐱+𝐱𝐱122F_{t}(\mathbf{x})=\eta\sum_{\tau=1}^{t}\langle\nabla f_{\tau}(\mathbf{x}_{\tau}),\mathbf{x}\rangle+\|\mathbf{x}-\mathbf{x}_{1}\|_{2}^{2}
5:     𝐯targmin𝐱𝒦Ft(𝐱t),𝐱\mathbf{v}_{t}\in\operatorname*{argmin}\limits_{\mathbf{x}\in\mathcal{K}}\langle\nabla F_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle
6:     σt=argminσ[0,1]σ(𝐯t𝐱t),Ft(𝐱t)+σ2𝐯t𝐱t22\sigma_{t}=\operatorname*{argmin}\limits_{\sigma\in[0,1]}\langle\sigma(\mathbf{v}_{t}-\mathbf{x}_{t}),\nabla F_{t}(\mathbf{x}_{t})\rangle+\sigma^{2}\|\mathbf{v}_{t}-\mathbf{x}_{t}\|_{2}^{2}
7:     𝐱t+1=𝐱t+σt(𝐯t𝐱t)\mathbf{x}_{t+1}=\mathbf{x}_{t}+\sigma_{t}(\mathbf{v}_{t}-\mathbf{x}_{t})
8:  end for

OFW with Line Search

For the general OCO, OFW (Hazan and Kale 2012; Hazan 2016) arbitrarily chooses 𝐱1\mathbf{x}_{1} from 𝒦\mathcal{K}, and then iteratively updates its decision as the following linear optimization step

𝐯=argmin𝐱𝒦Ft(𝐱t),𝐱𝐱t+1=𝐱t+σt(𝐯𝐱t)\begin{split}&\mathbf{v}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\langle\nabla F_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle\\ &\mathbf{x}_{t+1}=\mathbf{x}_{t}+\sigma_{t}(\mathbf{v}-\mathbf{x}_{t})\end{split}

where

Ft(𝐱)=ητ=1tfτ(𝐱τ),𝐱+𝐱𝐱122F_{t}(\mathbf{x})=\eta\sum_{\tau=1}^{t}\langle\nabla f_{\tau}(\mathbf{x}_{\tau}),\mathbf{x}\rangle+\|\mathbf{x}-\mathbf{x}_{1}\|_{2}^{2} (3)

is the surrogate loss function, σt\sigma_{t} and η\eta are two parameters. According to Hazan (2016), OFW with η=O(T3/4)\eta=O(T^{-3/4}) and σt=O(t1/2)\sigma_{t}=O(t^{-1/2}) attains the regret bound of O(T3/4)O(T^{3/4}). However, this decaying step-size σt=O(t1/2)\sigma_{t}=O(t^{-1/2}) cannot utilize the property of strongly convex sets. Inspired by a recent progress on FW over strongly convex sets (Garber and Hazan 2015), we utilized a line search rule to choose the parameter σt\sigma_{t} as

σt=argminσ[0,1]σ(𝐯t𝐱t),Ft(𝐱t)+σ2𝐯t𝐱t22.\sigma_{t}=\operatorname*{argmin}_{\sigma\in[0,1]}\langle\sigma(\mathbf{v}_{t}-\mathbf{x}_{t}),\nabla F_{t}(\mathbf{x}_{t})\rangle+\sigma^{2}\|\mathbf{v}_{t}-\mathbf{x}_{t}\|_{2}^{2}.

The detailed procedures are summarized in Algorithm 1.

From previous discussions, the approximation error of minimizing Ft(𝐱)F_{t}(\mathbf{x}) by a single step of FW has a significant impact on the regret of OFW. Therefore, we first present the following lemma with respect to the approximation error for Algorithm 1 over strongly convex sets.

Lemma 1

Let 𝒦\mathcal{K} be an αK\alpha_{K}-strongly convex set with respect to the 2\ell_{2} norm. Let 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t[T+1]t\in[T+1], where Ft(𝐱)F_{t}(\mathbf{x}) is defined in (3). Then, for any t[T+1]t\in[T+1], Algorithm 1 with η=D2G(T+2)2/3\eta=\frac{D}{2G(T+2)^{2/3}} has

Ft1(𝐱t)Ft1(𝐱t)ϵt=C(t+2)2/3F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast})\leq\epsilon_{t}=\frac{C}{(t+2)^{2/3}}

where C=max(4D2,40963αK2).C=\max\left(4D^{2},\frac{4096}{3\alpha_{K}^{2}}\right).

We find that the approximation error incurred by a single step of FW is upper bounded by O(t2/3)O(t^{-2/3}) for Algorithm 1 over strongly convex sets. For a clear comparison, we note that the approximation error for the original OFW over general convex sets has a worse bound of O(1/t)O(1/\sqrt{t}) (Hazan, 2016, Lemma 7.3).

By applying Lemma 1 and further analyzing the property of decisions 𝐱1,,𝐱T\mathbf{x}_{1}^{\ast},\cdots,\mathbf{x}_{T}^{\ast} defined in Lemma 1, we prove that OFW with line search enjoys the following regret bound over strongly convex sets.

Theorem 1

Let 𝒦\mathcal{K} be an αK\alpha_{K}-strongly convex set with respect to the 2\ell_{2} norm and C=max(4D2,40963αK2)C=\max\left(4D^{2},\frac{4096}{3\alpha_{K}^{2}}\right). For any 𝐱𝒦\mathbf{x}^{\ast}\in\mathcal{K}, Algorithm 1 with η=D2G(T+2)2/3\eta=\frac{D}{2G(T+2)^{2/3}} ensures

t=1Tft(𝐱t)t=1Tft(𝐱)114GC(T+2)2/3.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast})\leq\frac{11}{4}G\sqrt{C}(T+2)^{2/3}.

We find that the O(T2/3)O(T^{2/3}) regret bound in Theorem 1 is better than the O(T3/4)O(T^{3/4}) bound achieved by previous studies for the original OFW over general convex sets (Hazan and Kale 2012; Hazan 2016).

Algorithm 2 Strongly Convex Variant of OFW (SC-OFW)
1:  Input: feasible set 𝒦\mathcal{K}, λ\lambda
2:  Initialization: choose 𝐱1𝒦\mathbf{x}_{1}\in\mathcal{K}
3:  for t=1,,Tt=1,\cdots,T do
4:     Define Ft(𝐱)=τ=1t(fτ(𝐱τ),𝐱+λ2𝐱𝐱τ22)F_{t}(\mathbf{x})=\sum_{\tau=1}^{t}(\langle\nabla f_{\tau}(\mathbf{x}_{\tau}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{\tau}\|_{2}^{2})
5:     𝐯targmin𝐱𝒦Ft(𝐱t),𝐱\mathbf{v}_{t}\in\operatorname*{argmin}\limits_{\mathbf{x}\in\mathcal{K}}\langle\nabla F_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle
6:     σt=argminσ[0,1]σ(𝐯t𝐱t),Ft(𝐱t)+σ2λt2𝐯t𝐱t22\sigma_{t}=\operatorname*{argmin}\limits_{\sigma\in[0,1]}\langle\sigma(\mathbf{v}_{t}-\mathbf{x}_{t}),\nabla F_{t}(\mathbf{x}_{t})\rangle+\frac{\sigma^{2}\lambda t}{2}\|\mathbf{v}_{t}-\mathbf{x}_{t}\|_{2}^{2}
7:     𝐱t+1=𝐱t+σt(𝐯t𝐱t)\mathbf{x}_{t+1}=\mathbf{x}_{t}+\sigma_{t}(\mathbf{v}_{t}-\mathbf{x}_{t})
8:  end for

SC-OFW for Strongly Convex Losses

Moreover, we propose a strongly convex variant of OFW (SC-OFW), which is inspired by Garber and Hazan (2016). To be precise, Garber and Hazan (2016) have proposed a variant of OFW for strongly convex OCO over polyhedral sets, which enjoys the O(logT)O(\log T) regret bound. Compared with the original OFW, their algorithm has two main differences. First, a local linear optimization step for polyhedral sets is developed to replace the linear optimization step utilized in OFW. Second, to handle λ\lambda-strongly convex losses, Ft(𝐱)F_{t}(\mathbf{x}) is redefined as

Ft(𝐱)=τ=1t(fτ(𝐱τ),𝐱+λ2𝐱𝐱τ22)+cλ2𝐱𝐱122\begin{split}F_{t}(\mathbf{x})=&\sum_{\tau=1}^{t}\left(\langle\nabla f_{\tau}(\mathbf{x}_{\tau}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{\tau}\|_{2}^{2}\right)\\ &+\frac{c\lambda}{2}\|\mathbf{x}-\mathbf{x}_{1}\|_{2}^{2}\end{split} (4)

where cc is a parameter that depends on properties of polyhedral sets.

Since this paper considers OCO over strongly convex sets, our SC-OFW adopts the linear optimization step utilized in the original OFW, and simplifies Ft(𝐱)F_{t}(\mathbf{x}) in (4) as

Ft(𝐱)=τ=1t(fτ(𝐱τ),𝐱+λ2𝐱𝐱τ22)F_{t}(\mathbf{x})=\sum_{\tau=1}^{t}\left(\langle\nabla f_{\tau}(\mathbf{x}_{\tau}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{\tau}\|_{2}^{2}\right) (5)

by simply setting c=0c=0. The detailed procedures are summarized in Algorithm 2, where the parameter σt\sigma_{t} is selected by the line search as

σt=argminσ[0,1]σ(𝐯t𝐱t),Ft(𝐱t)+σ2λt2𝐯t𝐱t22.\sigma_{t}=\operatorname*{argmin}_{\sigma\in[0,1]}\langle\sigma(\mathbf{v}_{t}-\mathbf{x}_{t}),\nabla F_{t}(\mathbf{x}_{t})\rangle+\frac{\sigma^{2}\lambda t}{2}\|\mathbf{v}_{t}-\mathbf{x}_{t}\|_{2}^{2}.

To the best of our knowledge, SC-OFW is the first projection-free algorithm for strongly convex OCO over any decision set.

In the following lemma, we upper bound the approximation error of minimizing the surrogate loss function for SC-OFW over strongly convex sets.

Lemma 2

Let 𝒦\mathcal{K} be an αK\alpha_{K}-strongly convex set with respect to the 2\ell_{2} norm. Let 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t=2,,T+1t=2,\cdots,T+1, where Ft(𝐱)F_{t}(\mathbf{x}) is defined in (5). Then, for any t=2,,T+1t=2,\cdots,T+1, Algorithm 2 has

Ft1(𝐱t)Ft1(𝐱t)CF_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast})\leq C

where C=max(4(G+λD)2λ,288λαK2).C=\max\left(\frac{4(G+\lambda D)^{2}}{\lambda},\frac{288\lambda}{\alpha_{K}^{2}}\right).

Furthermore, based on Lemma 2, we provide the following theorem with respect to the regret bound of SC-OFW over strongly convex sets.

Theorem 2

Let 𝒦\mathcal{K} be an αK\alpha_{K}-strongly convex set with respect to the 2\ell_{2} norm. If ft(𝐱)f_{t}(\mathbf{x}) is λ\lambda-strongly convex for any t[T]t\in[T], for any 𝐱𝒦\mathbf{x}^{\ast}\in\mathcal{K}, Algorithm 2 ensures

t=1Tft(𝐱t)t=1Tft(𝐱)C2T+ClnT2+GD\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast})\leq C\sqrt{2T}+\frac{C\ln T}{2}+GD

where C=max(4(G+λD)2λ,288λαK2).C=\max\left(\frac{4(G+\lambda D)^{2}}{\lambda},\frac{288\lambda}{\alpha_{K}^{2}}\right).

From Theorem 2, we achieve a regret bound of O(T)O(\sqrt{T}) for strongly convex losses, which is better than the above O(T2/3)O(T^{2/3}) bound for general convex losses.

Moreover, we show the regret bound of Algorithm 2 over any general convex set 𝒦\mathcal{K} in the following theorem.

Theorem 3

If ft(𝐱)f_{t}(\mathbf{x}) is λ\lambda-strongly convex for any t[T]t\in[T], for any 𝐱𝒦\mathbf{x}^{\ast}\in\mathcal{K}, Algorithm 2 ensures

t=1Tft(𝐱t)t=1Tft(𝐱)32CT2/38+ClnT8+GD\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast})\leq\frac{3\sqrt{2}CT^{2/3}}{8}+\frac{C\ln T}{8}+GD

where C=16(G+λD)2λ.C=\frac{16(G+\lambda D)^{2}}{\lambda}.

By comparing Theorem 3 with Theorem 2, we can find that our Algorithm 2 enjoys a better regret bound when the decision set is strongly convex.

Theoretical Analysis

In this section, we prove Theorems 1 and 2. The omitted proofs can be found in the appendix.

Proof of Theorem 1

In the beginning, we define 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t[T+1]t\in[T+1], where Ft(𝐱)F_{t}(\mathbf{x}) is defined in (3).

Since each function ft(𝐱)f_{t}(\mathbf{x}) is convex, we have

t=1Tft(𝐱t)t=1Tft(𝐱)t=1Tft(𝐱t),𝐱t𝐱=t=1Tft(𝐱t),𝐱t𝐱t:=A+t=1Tft(𝐱t),𝐱t𝐱:=B.\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast})\\ \leq&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}^{\ast}\rangle\\ =&\underbrace{\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\rangle}_{:=A}+\underbrace{\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}^{\ast}-\mathbf{x}^{\ast}\rangle}_{:=B}.\\ \end{split} (6)

So, we will upper bound the regret by analyzing AA and BB.

By applying Lemma 1, we have

A=t=1Tft(𝐱t),𝐱t𝐱tt=1Tft(𝐱t)2𝐱t𝐱t2t=1TGFt1(𝐱t)Ft1(𝐱t)t=1TGC(t+2)1/33GC(T+2)2/32\begin{split}A=&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\rangle\\ \leq&\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{x}_{t})\|_{2}\|\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&\sum_{t=1}^{T}G\sqrt{F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast})}\\ \leq&\sum_{t=1}^{T}\frac{G\sqrt{C}}{(t+2)^{1/3}}\\ \leq&\frac{3G\sqrt{C}(T+2)^{2/3}}{2}\end{split} (7)

where the second inequality is due to (1) and the fact that Ft1(𝐱)F_{t-1}(\mathbf{x}) is 22-strongly convex, and the last inequality is due to t=1T(t+2)1/33(T+2)2/3/2\sum_{t=1}^{T}{(t+2)^{-1/3}}\leq 3(T+2)^{2/3}/2.

To bound BB, we introduce the following lemma.

Lemma 3

(Lemma 6.6 of Garber and Hazan (2016)) Let {ft(𝐱)}t=1T\{f_{t}(\mathbf{x})\}_{t=1}^{T} be a sequence of loss functions and let 𝐱targmin𝐱𝒦τ=1tfτ(𝐱)\mathbf{x}_{t}^{\ast}\in\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\sum_{\tau=1}^{t}f_{\tau}(\mathbf{x}) for any t[T]t\in[T]. Then, it holds that

t=1Tft(𝐱t)min𝐱𝒦t=1Tft(𝐱)0.\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t}^{\ast})-\min_{\mathbf{x}\in\mathcal{K}}\sum_{t=1}^{T}f_{t}(\mathbf{x})\leq 0.

To apply Lemma 3, we define f~1(𝐱)=ηf1(𝐱1),𝐱+𝐱𝐱122\tilde{f}_{1}(\mathbf{x})=\langle\eta\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}\rangle+\|\mathbf{x}-\mathbf{x}_{1}\|_{2}^{2} for t=1t=1 and f~t(𝐱)=ηft(𝐱t),𝐱\tilde{f}_{t}(\mathbf{x})=\langle\eta\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle for any t2t\geq 2. We recall that Ft(𝐱)=τ=1tf~τ(𝐱)F_{t}(\mathbf{x})=\sum_{\tau=1}^{t}\tilde{f}_{\tau}(\mathbf{x}) and 𝐱t+1=argmin𝐱𝒦Ft(𝐱)\mathbf{x}_{t+1}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t}(\mathbf{x}) for any t=1,,Tt=1,\cdots,T. Then, by applying Lemma 3 to {f~t(𝐱)}t=1T\{\tilde{f}_{t}(\mathbf{x})\}_{t=1}^{T}, we have

t=1Tf~t(𝐱t+1)t=1Tf~t(𝐱)0\begin{split}\sum_{t=1}^{T}\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})-\sum_{t=1}^{T}\tilde{f}_{t}(\mathbf{x}^{\ast})\leq 0\end{split}

which implies that

ηt=1Tft(𝐱t),𝐱t+1𝐱𝐱𝐱122𝐱2𝐱122.\begin{split}&\eta\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t+1}^{\ast}-\mathbf{x}^{\ast}\rangle\\ \leq&\|\mathbf{x}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}-\|\mathbf{x}^{\ast}_{2}-\mathbf{x}_{1}\|_{2}^{2}.\end{split}

Since 𝐱𝐱122D2\|\mathbf{x}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}\leq D^{2} and 𝐱2𝐱1220\|\mathbf{x}^{\ast}_{2}-\mathbf{x}_{1}\|_{2}^{2}\geq 0, we have

t=1Tft(𝐱t),𝐱t+1𝐱1η𝐱𝐱122D2η.\begin{split}\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t+1}^{\ast}-\mathbf{x}^{\ast}\rangle\leq\frac{1}{\eta}\|\mathbf{x}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}\leq\frac{D^{2}}{\eta}.\end{split} (8)

Moreover, by combining the fact that Ft(𝐱)F_{t}(\mathbf{x}) is 22-strongly convex with (1), we have

𝐱t𝐱t+122\displaystyle\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}^{2}
\displaystyle\leq Ft(𝐱t)Ft(𝐱t+1)\displaystyle F_{t}(\mathbf{x}_{t}^{\ast})-F_{t}(\mathbf{x}_{t+1}^{\ast})
=\displaystyle= Ft1(𝐱t)Ft1(𝐱t+1)+ηft(𝐱t)(𝐱t𝐱t+1)\displaystyle F_{t-1}(\mathbf{x}_{t}^{\ast})-F_{t-1}(\mathbf{x}_{t+1}^{\ast})+\eta\nabla f_{t}(\mathbf{x}_{t})^{\top}(\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast})
\displaystyle\leq ηft(𝐱t)2𝐱t𝐱t+12\displaystyle\eta\|\nabla f_{t}(\mathbf{x}_{t})\|_{2}\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}

which implies that

𝐱t𝐱t+12ηft(𝐱t)2ηG.\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}\leq\eta\|\nabla f_{t}(\mathbf{x}_{t})\|_{2}\leq\eta G. (9)

By combining (8), (9) and η=D2G(T+2)2/3\eta=\frac{D}{2G(T+2)^{2/3}}, we have

B=t=1Tft(𝐱t),𝐱t𝐱=t=1Tft(𝐱t),𝐱t+1𝐱+t=1Tft(𝐱t),𝐱t𝐱t+1D2η+t=1Tft(𝐱t)2𝐱t𝐱t+12D2η+ηTG22DG(T+2)2/3+DG(T+2)1/32GC(T+2)2/3+GC(T+2)2/34\begin{split}B=&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}^{\ast}-\mathbf{x}^{\ast}\rangle\\ =&\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t+1}^{\ast}-\mathbf{x}^{\ast}\rangle\\ &+\sum_{t=1}^{T}\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\rangle\\ \leq&\frac{D^{2}}{\eta}+\sum_{t=1}^{T}\|\nabla f_{t}(\mathbf{x}_{t})\|_{2}\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}\\ \leq&\frac{D^{2}}{\eta}+\eta TG^{2}\\ \leq&2DG(T+2)^{2/3}+\frac{DG(T+2)^{1/3}}{2}\\ \leq&G\sqrt{C}(T+2)^{2/3}+\frac{G\sqrt{C}(T+2)^{2/3}}{4}\end{split} (10)

where the last inequality is due to DC/2D\leq\sqrt{C}/2 and (T+2)1/3(T+2)2/3(T+2)^{1/3}\leq(T+2)^{2/3} for any T1T\geq 1.

By combining (7) and (10), we complete the proof.

Proof of Theorem 2

Let f~t(𝐱)=ft(𝐱t),𝐱+λ2𝐱𝐱t22\tilde{f}_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{t}\|_{2}^{2} for any t[T]t\in[T] and 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t=2,,T+1t=2,\cdots,T+1.

Since each function ft(𝐱)f_{t}(\mathbf{x}) is λ\lambda-strongly convex, we have

t=1Tft(𝐱t)t=1Tft(𝐱)t=1T(ft(𝐱t),𝐱t𝐱λ2𝐱t𝐱22)=t=1T(f~t(𝐱t)f~t(𝐱))=t=1T(f~t(𝐱t)f~t(𝐱t+1)):=A+t=1T(f~t(𝐱t+1)f~t(𝐱)):=B.\begin{split}&\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast})\\ \leq&\sum_{t=1}^{T}\left(\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}_{t}-\mathbf{x}^{\ast}\rangle-\frac{\lambda}{2}\|\mathbf{x}_{t}-\mathbf{x}^{\ast}\|_{2}^{2}\right)\\ =&\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}^{\ast}))\\ =&\underbrace{\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))}_{:=A}+\underbrace{\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})-\tilde{f}_{t}(\mathbf{x}^{\ast}))}_{:=B}.\end{split}

So, we will derive a regret bound by analyzing AA and BB.

To bound AA, we introduce the following lemma.

Lemma 4

(Lemma 6.7 of Garber and Hazan (2016)) For any t[T]t\in[T], the function f~t(𝐱)=ft(𝐱t),𝐱+λ2𝐱𝐱t22\tilde{f}_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{t}\|_{2}^{2} is (G+λD)(G+\lambda D)-Lipschitz over 𝒦\mathcal{K}.

By applying Lemma 4, for any t=3,,T+1t=3,\cdots,T+1, we have

Ft1(𝐱t1)Ft1(𝐱t)=Ft2(𝐱t1)Ft2(𝐱t)+f~t1(𝐱t1)f~t1(𝐱t)(G+λD)𝐱t1𝐱t2.\begin{split}&F_{t-1}(\mathbf{x}_{t-1}^{\ast})-F_{t-1}(\mathbf{x}_{t}^{\ast})\\ =&F_{t-2}(\mathbf{x}_{t-1}^{\ast})-F_{t-2}(\mathbf{x}_{t}^{\ast})+\tilde{f}_{t-1}(\mathbf{x}_{t-1}^{\ast})-\tilde{f}_{t-1}(\mathbf{x}_{t}^{\ast})\\ \leq&(G+\lambda D)\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}.\end{split}

Moreover, since each Ft(𝐱)F_{t}(\mathbf{x}) is tλt\lambda-strongly convex, for any t=3,,T+1t=3,\cdots,T+1, we have

𝐱t1𝐱t222(Ft1(𝐱t1)Ft1(𝐱t))(t1)λ2(G+λD)𝐱t1𝐱t2(t1)λ.\begin{split}\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}^{2}\leq&\frac{2(F_{t-1}(\mathbf{x}_{t-1}^{\ast})-F_{t-1}(\mathbf{x}_{t}^{\ast}))}{(t-1)\lambda}\\ \leq&\frac{2(G+\lambda D)\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}}{(t-1)\lambda}.\end{split}

Therefore, for any t=3,,T+1t=3,\cdots,T+1, we have

𝐱t1𝐱t22(G+λD)(t1)λ.\begin{split}\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}\leq&\frac{2(G+\lambda D)}{(t-1)\lambda}.\end{split} (11)

By applying Lemmas 2 and 4, we have

t=2T(f~t(𝐱t)f~t(𝐱t+1))t=2T(G+λD)𝐱t𝐱t+12(G+λD)t=2T𝐱t𝐱t2+(G+λD)t=2T𝐱t𝐱t+12(G+λD)t=2T2(Ft1(𝐱t)Ft1(𝐱t))(t1)λ+(G+λD)t=2T2(G+λD)tλ(G+λD)t=2T2C(t1)λ+(G+λD)t=2T2(G+λD)tλ2(G+λD)2TCλ+2(G+λD)2lnTλC2T+ClnT2\begin{split}&\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\\ \leq&\sum_{t=2}^{T}(G+\lambda D)\|\mathbf{x}_{t}-\mathbf{x}_{t+1}^{\ast}\|_{2}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\|\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\|_{2}\\ &+(G+\lambda D)\sum_{t=2}^{T}\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\sqrt{\frac{2(F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}))}{(t-1)\lambda}}\\ &+(G+\lambda D)\sum_{t=2}^{T}\frac{2(G+\lambda D)}{t\lambda}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\sqrt{\frac{2C}{(t-1)\lambda}}+(G+\lambda D)\sum_{t=2}^{T}\frac{2(G+\lambda D)}{t\lambda}\\ \leq&2(G+\lambda D)\sqrt{\frac{2TC}{\lambda}}+2(G+\lambda D)^{2}\frac{\ln T}{\lambda}\\ \leq&C\sqrt{2T}+\frac{C\ln T}{2}\end{split}

where the third inequality is due to 𝐱t𝐱t22(Ft1(𝐱t)Ft1(𝐱t))(t1)λ\|\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\|_{2}\leq\sqrt{\frac{2(F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}))}{(t-1)\lambda}} for t2t\geq 2 and (11), and the last inequality is due to 2(G+λD)λC\frac{2(G+\lambda D)}{\sqrt{\lambda}}\leq\sqrt{C}.

Due to f1(𝐱1)2G\|\nabla f_{1}(\mathbf{x}_{1})\|_{2}\leq G and 𝐱1𝐱22D\|\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\|_{2}\leq D, we have

A=f~1(𝐱1)f~1(𝐱2)+t=2T(f~t(𝐱t)f~t(𝐱t+1))=f1(𝐱1),𝐱1𝐱2λ2𝐱2𝐱122+t=2T(f~t(𝐱t)f~t(𝐱t+1))f1(𝐱1)2𝐱1𝐱22+C2T+ClnT2GD+C2T+ClnT2.\begin{split}A=&\tilde{f}_{1}(\mathbf{x}_{1})-\tilde{f}_{1}(\mathbf{x}_{2}^{\ast})+\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\\ =&\langle\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\rangle-\frac{\lambda}{2}\|\mathbf{x}_{2}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}\\ &+\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\\ \leq&\|\nabla f_{1}(\mathbf{x}_{1})\|_{2}\|\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\|_{2}+C\sqrt{2T}+\frac{C\ln T}{2}\\ \leq&GD+C\sqrt{2T}+\frac{C\ln T}{2}.\end{split} (12)

Moreover, we note that Ft(𝐱)=τ=1tf~τ(𝐱)F_{t}(\mathbf{x})=\sum_{\tau=1}^{t}\tilde{f}_{\tau}(\mathbf{x}) and 𝐱t+1=argmin𝐱𝒦Ft(𝐱)\mathbf{x}_{t+1}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t}(\mathbf{x}) for any t[T]t\in[T]. By applying Lemma 3 to {f~t(𝐱)}t=1T\{\tilde{f}_{t}(\mathbf{x})\}_{t=1}^{T}, we have

B=t=1Tf~t(𝐱t+1)t=1Tf~t(𝐱)0.\begin{split}B=\sum_{t=1}^{T}\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})-\sum_{t=1}^{T}\tilde{f}_{t}(\mathbf{x}^{\ast})\leq 0.\end{split} (13)

By combining (12) and (13), we complete the proof.

Proof of Lemma 1

For brevity, we define ht=Ft1(𝐱t)Ft1(𝐱t)h_{t}=F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}) for t=1,,Tt=1,\cdots,T and ht(𝐱t1)=Ft1(𝐱t1)Ft1(𝐱t)h_{t}(\mathbf{x}_{t-1})=F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast}) for t=2,,Tt=2,\cdots,T.

For t=1t=1, since 𝐱1=argmin𝐱𝒦𝐱𝐱122\mathbf{x}_{1}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\|\mathbf{x}-\mathbf{x}_{1}\|_{2}^{2}, we have

h1=F0(𝐱1)F0(𝐱1)=0C(1+2)2/3=ϵ1.h_{1}=F_{0}(\mathbf{x}_{1})-F_{0}(\mathbf{x}_{1}^{\ast})=0\leq\frac{C}{(1+2)^{2/3}}=\epsilon_{1}. (14)

For any T+1t2T+1\geq t\geq 2, assuming ht1ϵt1h_{t-1}\leq\epsilon_{t-1} , we first note that

ht(𝐱t1)=Ft1(𝐱t1)Ft1(𝐱t)=Ft2(𝐱t1)Ft2(𝐱t)+ηft1(𝐱t1),𝐱t1𝐱tFt2(𝐱t1)Ft2(𝐱t1)+ηft1(𝐱t1),𝐱t1𝐱tϵt1+ηft1(𝐱t1)2𝐱t1𝐱t2ϵt1+ηft1(𝐱t1)2𝐱t1𝐱t12+ηft1(𝐱t1)2𝐱t1𝐱t2ϵt1+ηGϵt1+η2G2\begin{split}&h_{t}(\mathbf{x}_{t-1})\\ =&F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})\\ =&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t}^{\ast})\\ &+\langle\eta\nabla f_{t-1}(\mathbf{x}_{t-1}),\mathbf{x}_{t-1}-\mathbf{x}_{t}^{\ast}\rangle\\ \leq&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t-1}^{\ast})\\ &+\langle\eta\nabla f_{t-1}(\mathbf{x}_{t-1}),\mathbf{x}_{t-1}-\mathbf{x}_{t}^{\ast}\rangle\\ \leq&\epsilon_{t-1}+\eta\|\nabla f_{t-1}(\mathbf{x}_{t-1})\|_{2}\|\mathbf{x}_{t-1}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&\epsilon_{t-1}+\eta\|\nabla f_{t-1}(\mathbf{x}_{t-1})\|_{2}\|\mathbf{x}_{t-1}-\mathbf{x}_{t-1}^{\ast}\|_{2}\\ &+\eta\|\nabla f_{t-1}(\mathbf{x}_{t-1})\|_{2}\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&\epsilon_{t-1}+\eta G\sqrt{\epsilon_{t-1}}+\eta^{2}G^{2}\end{split} (15)

where the first inequality is due to 𝐱t1=argmin𝐱𝒦Ft2(𝐱)\mathbf{x}_{t-1}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-2}(\mathbf{x}) and the last inequality is due to 𝐱t1𝐱t12Ft2(𝐱t1)Ft2(𝐱t1)\|\mathbf{x}_{t-1}-\mathbf{x}_{t-1}^{\ast}\|_{2}\leq\sqrt{F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t-1}^{\ast})} and (9).

Then, by substituting η=D2G(T+2)2/3\eta=\frac{D}{2G(T+2)^{2/3}} into (15), we have

ht(𝐱t1)ϵt1+DC2(T+2)2/3(t+1)1/3+D24(T+2)4/3ϵt1+ϵt14(t+1)1/3+ϵt116(t+1)2/3(1+12(t+1)1/3)ϵt1\begin{split}&h_{t}(\mathbf{x}_{t-1})\\ \leq&\epsilon_{t-1}+\frac{D\sqrt{C}}{2(T+2)^{2/3}(t+1)^{1/3}}+\frac{D^{2}}{4(T+2)^{4/3}}\\ \leq&\epsilon_{t-1}+\frac{\epsilon_{t-1}}{4(t+1)^{1/3}}+\frac{\epsilon_{t-1}}{16(t+1)^{2/3}}\\ \leq&\left(1+\frac{1}{2(t+1)^{1/3}}\right)\epsilon_{t-1}\end{split} (16)

where the second inequality is due to Tt1T\geq t-1 and DC2D\leq\frac{\sqrt{C}}{2}.

Then, to bound ht=Ft1(𝐱t)Ft1(𝐱t)h_{t}=F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}) by ϵt\epsilon_{t}, we further introduce the following lemma.

Lemma 5

(Derived from Lemma 1 of Garber and Hazan (2015)) Let f(𝐱):𝒦f(\mathbf{x}):\mathcal{K}\to\mathbb{R} be a convex and βf\beta_{f}-smooth function, where 𝒦\mathcal{K} is αK\alpha_{K}-strongly convex with respect to the 2\ell_{2} norm. Moreover, let 𝐱in𝒦\mathbf{x}_{\operatorname*{in}}\in\mathcal{K} and 𝐱out=𝐱in+σ(𝐯𝐱in)\mathbf{x}_{\operatorname*{out}}=\mathbf{x}_{\operatorname*{in}}+\sigma^{\prime}(\mathbf{v}-\mathbf{x}_{\operatorname*{in}}), where 𝐯argmin𝐱𝒦f(𝐱in),𝐱\mathbf{v}\in\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\langle\nabla f(\mathbf{x}_{\operatorname*{in}}),\mathbf{x}\rangle and σ=argminσ[0,1]σ(𝐯𝐱in),f(𝐱in)+σ2βf2𝐯𝐱in22\sigma^{\prime}=\operatorname*{argmin}_{\sigma\in[0,1]}\langle\sigma(\mathbf{v}-\mathbf{x}_{\operatorname*{in}}),\nabla f(\mathbf{x}_{\operatorname*{in}})\rangle+\frac{\sigma^{2}\beta_{f}}{2}\|\mathbf{v}-\mathbf{x}_{\operatorname*{in}}\|_{2}^{2}. For any 𝐱argmin𝐱𝒦f(𝐱)\mathbf{x}^{\ast}\in\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}f(\mathbf{x}), we have

f(𝐱out)f(𝐱)\displaystyle f(\mathbf{x}_{\operatorname*{out}})-f(\mathbf{x}^{\ast})
\displaystyle\leq (f(𝐱in)f(𝐱))max(12,1αKf(𝐱in)28βf).\displaystyle(f(\mathbf{x}_{\operatorname*{in}})-f(\mathbf{x}^{\ast}))\max\left(\frac{1}{2},1-\frac{\alpha_{K}\|\nabla f(\mathbf{x}_{\operatorname*{in}})\|_{2}}{8\beta_{f}}\right).

We note that Ft1(𝐱)F_{t-1}(\mathbf{x}) is 22-smooth for any t[T+1]t\in[T+1]. By applying Lemma 5 with f(𝐱)=Ft1(𝐱)f(\mathbf{x})=F_{t-1}(\mathbf{x}) and 𝐱in=𝐱t1\mathbf{x}_{\operatorname*{in}}=\mathbf{x}_{t-1}, for any t[T+1]t\in[T+1], we have 𝐱out=𝐱t\mathbf{x}_{\operatorname*{out}}=\mathbf{x}_{t} and

htht(𝐱t1)max(12,1αKFt1(𝐱t1)216).\begin{split}h_{t}\leq&h_{t}(\mathbf{x}_{t-1})\max\left(\frac{1}{2},1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{16}\right).\end{split} (17)

Because of (16), (17) and 1+12(t+1)1/3321+\frac{1}{2(t+1)^{1/3}}\leq\frac{3}{2}, if 12αKFt1(𝐱t1)216\frac{1}{2}\leq\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{16}, it is easy to verify that

ht34ϵt1=34C(t+1)2/3=C(t+2)2/33(t+2)2/34(t+1)2/3C(t+2)2/3=ϵt\begin{split}h_{t}&\leq\frac{3}{4}\epsilon_{t-1}=\frac{3}{4}\frac{C}{(t+1)^{2/3}}\\ &=\frac{C}{(t+2)^{2/3}}\frac{3(t+2)^{2/3}}{4(t+1)^{2/3}}\\ &\leq\frac{C}{(t+2)^{2/3}}=\epsilon_{t}\end{split} (18)

where the last inequality is due to 3(t+2)2/34(t+1)2/31\frac{3(t+2)^{2/3}}{4(t+1)^{2/3}}\leq 1 for any t2t\geq 2.

Then, if 12>αKFt1(𝐱t1)216\frac{1}{2}>\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{16}, there exist two cases. First, if ht(𝐱t1)3C4(t+1)2/3h_{t}(\mathbf{x}_{t-1})\leq\frac{3C}{4(t+1)^{2/3}}, it is easy to verify that

htht(𝐱t1)3C4(t+1)2/3ϵth_{t}\leq h_{t}(\mathbf{x}_{t-1})\leq\frac{3C}{4(t+1)^{2/3}}\leq\epsilon_{t} (19)

where the lase inequality has been proved in (18).

Second, if ht(𝐱t1)>3C4(t+1)2/3h_{t}(\mathbf{x}_{t-1})>\frac{3C}{4(t+1)^{2/3}}, we have

htht(𝐱t1)(1αKFt1(𝐱t1)216)ϵt1(1+12(t+1)1/3)(1αKFt1(𝐱t1)216)ϵt1(1+12(t+1)1/3)(1αKht(𝐱t1)16)ϵt(t+2)2/3(t+1)2/3(1+12(t+1)1/3)(1αK3C32(t+1)1/3)\begin{split}&h_{t}\\ \leq&h_{t}(\mathbf{x}_{t-1})\left(1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{16}\right)\\ \leq&\epsilon_{t-1}\left(1+\frac{1}{2(t+1)^{1/3}}\right)\left(1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{16}\right)\\ \leq&\epsilon_{t-1}\left(1+\frac{1}{2(t+1)^{1/3}}\right)\left(1-\frac{\alpha_{K}\sqrt{h_{t}(\mathbf{x}_{t-1})}}{16}\right)\\ \leq&\epsilon_{t}\frac{(t+2)^{2/3}}{(t+1)^{2/3}}\left(1+\frac{1}{2(t+1)^{1/3}}\right)\left(1-\frac{\alpha_{K}\sqrt{3C}}{32(t+1)^{1/3}}\right)\end{split}

where the second inequality is due to (16) and the third inequality is due to (2).

Since (t+2)2/3(t+1)2/31+1(t+1)2/3\frac{(t+2)^{2/3}}{(t+1)^{2/3}}\leq 1+\frac{1}{(t+1)^{2/3}} for any t0t\geq 0, it is easy to verify that

(t+2)2/3(t+1)2/3(1+12(t+1)1/3)1+2(t+1)1/3\displaystyle\frac{(t+2)^{2/3}}{(t+1)^{2/3}}\left(1+\frac{1}{2(t+1)^{1/3}}\right)\leq 1+\frac{2}{(t+1)^{1/3}}

which further implies that

htϵt(1+2(t+1)1/3)(1αK3C32(t+1)1/3)ϵt(1+2(t+1)1/3)(12(t+1)1/3)ϵt.\begin{split}h_{t}\leq&\epsilon_{t}\left(1+\frac{2}{(t+1)^{1/3}}\right)\left(1-\frac{\alpha_{K}\sqrt{3C}}{32(t+1)^{1/3}}\right)\\ \leq&\epsilon_{t}\left(1+\frac{2}{(t+1)^{1/3}}\right)\left(1-\frac{2}{(t+1)^{1/3}}\right)\\ \leq&\epsilon_{t}.\end{split} (20)

where the second inequality is due to αK3C64\alpha_{K}\sqrt{3C}\geq 64.

By combining (14), (18), (19) and (20), we complete the proof.

Conclusion and Future Work

In this paper, we first prove that the classical OFW algorithm with line search attains an O(T2/3)O(T^{2/3}) regret bound for OCO over strongly convex sets, which is better than the O(T3/4)O(T^{3/4}) regret bound for the general OCO. Furthermore, for strongly convex losses, we introduce a strongly convex variant of OFW, and prove that it achieves a regret bound of O(T2/3)O(T^{2/3}) over general convex sets and a better regret bound of O(T)O(\sqrt{T}) over strongly convex sets.

An open question is whether the regret of OFW and its strongly convex variant over strongly convex sets can be further improved if the losses are smooth. We note that Hazan and Minasyan (2020) have proposed a projection-free algorithm for OCO over general convex sets, and established an improved regret bound of O(T2/3)O(T^{2/3}) by taking advantage of the smoothness.

Acknowledgments

This work was partially supported by the NSFC (61921006), NSFC-NRF Joint Research Project (61861146001), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

References

  • Agarwal et al. (2006) Agarwal, A.; Hazan, E.; Kale, S.; and Schapire, R. E. 2006. Algorithms for portfolio management based on the Newton method. In Proceedings of the 23rd International Conference on Machine Learning, 9–16.
  • Awerbuch and Kleinberg (2008) Awerbuch, B.; and Kleinberg, R. 2008. Online linear optimization and adaptive routing. Journal of Computer and System Sciences 74(1): 97–114.
  • Awerbuch and Kleinberg (2004) Awerbuch, B.; and Kleinberg, R. D. 2004. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, 45–53.
  • Blum and Kalai (1999) Blum, A.; and Kalai, A. 1999. Universal portfolios with and without transaction costs. Machine Learning 35(3): 193–205.
  • Boyd and Vandenberghe (2004) Boyd, S.; and Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press.
  • Bubeck et al. (2015) Bubeck, S.; Dekel, O.; Koren, T.; and Peres, Y. 2015. Bandit convex optimization: T\sqrt{T} regret in one dimension. In Proceedings of the 28th Conference on Learning Theory, 266–278.
  • Chen et al. (2016) Chen, J.; Yang, T.; Lin, Q.; Zhang, L.; and Chang, Y. 2016. Optimal stochastic strongly convex optimization with a logarithmic number of projections. In Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence, 122–131.
  • Chen, Zhang, and Karbasi (2019) Chen, L.; Zhang, M.; and Karbasi, A. 2019. Projection-free bandit convex optimization. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2047–2056.
  • Demyanov and Rubinov (1970) Demyanov, V. F.; and Rubinov, A. M. 1970. Approximate Methods in Optimization Problems. Elsevier Publishing Company.
  • Duchi, Agarwal, and Wainwright (2011) Duchi, J. C.; Agarwal, A.; and Wainwright, M. J. 2011. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control 57(3): 592–606.
  • Dunn (1979) Dunn, J. C. 1979. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals. SIAM Journal on Control and Optimization 17(2): 187–211.
  • Flaxman, Kalai, and McMahan (2005) Flaxman, A. D.; Kalai, A. T.; and McMahan, H. B. 2005. Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, 385–394.
  • Frank and Wolfe (1956) Frank, M.; and Wolfe, P. 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3(1–2): 95–110.
  • Garber and Hazan (2015) Garber, D.; and Hazan, E. 2015. Faster rates for the frank-wolfe method over strongly-convex sets. In Proceedings of the 32nd International Conference on Machine Learning, 541–549.
  • Garber and Hazan (2016) Garber, D.; and Hazan, E. 2016. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. SIAM Journal on Optimization 26(3): 1493–1528.
  • Garber and Kretzu (2020a) Garber, D.; and Kretzu, B. 2020a. Improved regret bounds for projection-free bandit convex optimization. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2196–2206.
  • Garber and Kretzu (2020b) Garber, D.; and Kretzu, B. 2020b. Revisiting projection-free online learning: the strongly convex case. ArXiv e-prints arXiv: 2010.07572.
  • Hazan (2008) Hazan, E. 2008. Sparse approximate solutions to semidefinite programs. In Latin American Symposium on Theoretical Informatics, 306–316.
  • Hazan (2016) Hazan, E. 2016. Introduction to online convex optimization. Foundations and Trends in Optimization 2(3–4): 157–325.
  • Hazan, Agarwal, and Kale (2007) Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2): 169–192.
  • Hazan and Kale (2012) Hazan, E.; and Kale, S. 2012. Projection-free online learning. In Proceedings of the 29th International Conference on Machine Learning, 1843–1850.
  • Hazan and Luo (2016) Hazan, E.; and Luo, H. 2016. Variance-reduced and projection-free stochastic optimization. In Proceedings of the 33rd International Conference on Machine Learning, 1263–1271.
  • Hazan and Minasyan (2020) Hazan, E.; and Minasyan, E. 2020. Faster projection-free online learning. In Proceedings of the 33rd Annual Conference on Learning Theory, 1877–1893.
  • Hosseini, Chapman, and Mesbahi (2013) Hosseini, S.; Chapman, A.; and Mesbahi, M. 2013. Online distributed optimization via dual averaging. In 52nd IEEE Conference on Decision and Control, 1484–1489.
  • Jaggi (2013) Jaggi, M. 2013. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, 427–435.
  • Levitin and Polyak (1966) Levitin, E. S.; and Polyak, B. T. 1966. Constrained minimization methods. USSR Computational mathematics and mathematical physics 6: 1–50.
  • Levy and Krause (2019) Levy, K. Y.; and Krause, A. 2019. Projection free online learning over smooth sets. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 1458–1466.
  • Luo, Wei, and Zheng (2018) Luo, H.; Wei, C.-Y.; and Zheng, K. 2018. Efficient online portfolio with logarithmic regret. In Advances in Neural Information Processing Systems 31, 8235–8245.
  • Rector-Brooks, Wang, and Mozafari (2019) Rector-Brooks, J.; Wang, J.-K.; and Mozafari, B. 2019. Revisiting projection-free optimization for strongly convex constraint sets. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 1576–1583.
  • Shalev-Shwartz (2007) Shalev-Shwartz, S. 2007. Online Learning: Theory, Algorithms, and Applications. Ph.D. thesis, The Hebrew University of Jerusalem.
  • Shalev-Shwartz (2011) Shalev-Shwartz, S. 2011. Online learning and online convex optimization. Foundations and Trends in Machine Learning 4(2): 107–194.
  • Shalev-Shwartz and Singer (2007) Shalev-Shwartz, S.; and Singer, Y. 2007. A primal-dual perspective of online learning algorithm. Machine Learning 69(2–3): 115–142.
  • Wan, Tu, and Zhang (2020) Wan, Y.; Tu, W.-W.; and Zhang, L. 2020. Projection-free distributed online convex optimization with O(T){O}(\sqrt{T}) communication complexity. In Proceedings of the 37th International Conference on Machine Learning, 9818–9828.
  • Yang, Lin, and Zhang (2017) Yang, T.; Lin, Q.; and Zhang, L. 2017. A richer theory of convex constrained optimization with reduced projections and improved rates. In Proceedings of the 34th International Conference on Machine Learning, 3901–3910.
  • Zhang et al. (2013) Zhang, L.; Yang, T.; Jin, R.; and He, X. 2013. O(logT)O(\log T) projections for stochastic optimization of smooth and strongly convex functions. In Proceedings of the 30th International Conference on Machine Learning, 1121–1129.
  • Zhang et al. (2017) Zhang, W.; Zhao, P.; Zhu, W.; Hoi, S. C. H.; and Zhang, T. 2017. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning, 4054–4062.
  • Zinkevich (2003) Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, 928–936.

Proof of Lemma 2

Let f~t(𝐱)=ft(𝐱t),𝐱+λ2𝐱𝐱t22\tilde{f}_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{t}\|_{2}^{2} for t=1,,Tt=1,\cdots,T. For brevity, we define ht=Ft1(𝐱t)Ft1(𝐱t)h_{t}=F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}) for t=2,,Tt=2,\cdots,T and ht(𝐱t1)=Ft1(𝐱t1)Ft1(𝐱t)h_{t}(\mathbf{x}_{t-1})=F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast}) for t=3,,Tt=3,\cdots,T. For t=2t=2, we have

h2=F1(𝐱2)F1(𝐱2)=f1(𝐱1),𝐱2+λ2𝐱2𝐱122f1(𝐱1),𝐱2λ2𝐱2𝐱122f1(𝐱1),𝐱2𝐱2+λ2𝐱2𝐱122f1(𝐱1)2𝐱2𝐱22+λ2𝐱2𝐱122GD+λD22C.\begin{split}h_{2}=&F_{1}(\mathbf{x}_{2})-F_{1}(\mathbf{x}_{2}^{\ast})\\ =&\langle\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}_{2}\rangle+\frac{\lambda}{2}\|\mathbf{x}_{2}-\mathbf{x}_{1}\|_{2}^{2}-\langle\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}_{2}^{\ast}\rangle-\frac{\lambda}{2}\|\mathbf{x}_{2}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}\\ \leq&\langle\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}_{2}-\mathbf{x}_{2}^{\ast}\rangle+\frac{\lambda}{2}\|\mathbf{x}_{2}-\mathbf{x}_{1}\|_{2}^{2}\\ \leq&\|\nabla f_{1}(\mathbf{x}_{1})\|_{2}\|\mathbf{x}_{2}-\mathbf{x}_{2}^{\ast}\|_{2}+\frac{\lambda}{2}\|\mathbf{x}_{2}-\mathbf{x}_{1}\|_{2}^{2}\\ \leq&GD+\frac{\lambda D^{2}}{2}\\ \leq&C.\end{split} (21)

For any T+1t3T+1\geq t\geq 3, assuming ht1Ch_{t-1}\leq C, we have

ht(𝐱t1)=Ft1(𝐱t1)Ft1(𝐱t)=Ft2(𝐱t1)Ft2(𝐱t)+f~t1(𝐱t1)f~t1(𝐱t)Ft2(𝐱t1)Ft2(𝐱t1)+(G+λD)𝐱t1𝐱t2C+(G+λD)𝐱t1𝐱t12+(G+λD)𝐱t1𝐱t2C+(G+λD)2C(t2)λ+2(G+λD)2(t1)λ\begin{split}&h_{t}(\mathbf{x}_{t-1})\\ =&F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})\\ =&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t}^{\ast})+\tilde{f}_{t-1}(\mathbf{x}_{t-1})-\tilde{f}_{t-1}(\mathbf{x}_{t}^{\ast})\\ \leq&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t-1}^{\ast})+(G+\lambda D)\|\mathbf{x}_{t-1}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&C+(G+\lambda D)\|\mathbf{x}_{t-1}-\mathbf{x}_{t-1}^{\ast}\|_{2}+(G+\lambda D)\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&C+(G+\lambda D)\sqrt{\frac{2C}{(t-2)\lambda}}+\frac{2(G+\lambda D)^{2}}{(t-1)\lambda}\end{split}

where the first inequality is due to Lemma 4 and the last inequality is due to (11) and

𝐱t1𝐱t122(Ft2(𝐱t1)Ft2(𝐱t1))(t2)λ.\|\mathbf{x}_{t-1}-\mathbf{x}_{t-1}^{\ast}\|_{2}\leq\sqrt{\frac{2(F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t-1}^{\ast}))}{(t-2)\lambda}}. (22)

Let ct=1+1t1+12(t1)c_{t}=1+\frac{1}{\sqrt{t-1}}+\frac{1}{2(t-1)}. Because of 2(G+λD)λC\frac{2(G+\lambda D)}{\sqrt{\lambda}}\leq\sqrt{C}, we further have

ht(𝐱t1)C+C22(t2)+C2(t1)C+C24t1+C2(t1)C(1+1t1+12(t1))=ctC\begin{split}h_{t}(\mathbf{x}_{t-1})\leq&C+\frac{C}{2}\sqrt{\frac{2}{(t-2)}}+\frac{C}{2(t-1)}\\ \leq&C+\frac{C}{2}\sqrt{\frac{4}{t-1}}+\frac{C}{2(t-1)}\\ \leq&C\left(1+\frac{1}{\sqrt{t-1}}+\frac{1}{2(t-1)}\right)\\ =&c_{t}C\end{split}

where the second inequality is due to 1t22t1\frac{1}{t-2}\leq\frac{2}{t-1} for any t3t\geq 3.

We note that Ft1(𝐱)F_{t-1}(\mathbf{x}) is (t1)λ(t-1)\lambda-smooth for any t=2,,T+1t=2,\cdots,T+1. By applying Lemma 5 with f(𝐱)=Ft1(𝐱)f(\mathbf{x})=F_{t-1}(\mathbf{x}) and 𝐱in=𝐱t1\mathbf{x}_{\operatorname*{in}}=\mathbf{x}_{t-1}, we have 𝐱out=𝐱t\mathbf{x}_{\operatorname*{out}}=\mathbf{x}_{t} and

htht(𝐱t1)max(12,1αKFt1(𝐱t1)28(t1)λ).h_{t}\leq h_{t}(\mathbf{x}_{t-1})\max\left(\frac{1}{2},1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{8(t-1)\lambda}\right). (23)

Because of (23), if 12αKFt1(𝐱t1)28(t1)λ\frac{1}{2}\leq\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{8(t-1)\lambda}, it is easy to verify that

htht(𝐱t1)2ct2CCh_{t}\leq\frac{h_{t}(\mathbf{x}_{t-1})}{2}\leq\frac{c_{t}}{2}C\leq C (24)

where the last inequality is due to ct2c_{t}\leq 2 for any t3t\geq 3.

Otherwise, if 12>αKFt1(𝐱t1)28(t1)λ\frac{1}{2}>\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{8(t-1)\lambda}, there exist two cases. First, if ht(𝐱t1)Ch_{t}(\mathbf{x}_{t-1})\leq C, it is easy to verify that

htht(𝐱t1)C.h_{t}\leq h_{t}(\mathbf{x}_{t-1})\leq C. (25)

Second, if ctCht(𝐱t1)>Cc_{t}C\geq h_{t}(\mathbf{x}_{t-1})>C, we have

htht(𝐱t1)(1αKFt1(𝐱t1)28(t1)λ)ctC(1αKFt1(𝐱t1)28(t1)λ)ctC(1αK(t1)λht(𝐱t1)82(t1)λ)ctC(1αKC82(t1)λ).\begin{split}h_{t}\leq&h_{t}(\mathbf{x}_{t-1})\left(1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{8(t-1)\lambda}\right)\\ \leq&c_{t}C\left(1-\frac{\alpha_{K}\|\nabla F_{t-1}(\mathbf{x}_{t-1})\|_{2}}{8(t-1)\lambda}\right)\\ \leq&c_{t}C\left(1-\frac{\alpha_{K}\sqrt{(t-1)\lambda}\sqrt{h_{t}(\mathbf{x}_{t-1})}}{8\sqrt{2}(t-1)\lambda}\right)\\ \leq&c_{t}C\left(1-\frac{\alpha_{K}\sqrt{C}}{8\sqrt{2}\sqrt{(t-1)\lambda}}\right).\end{split}

where the third inequality is due to (2).

Because of ct=1+1t1+12(t1)1+32t1c_{t}=1+\frac{1}{\sqrt{t-1}}+\frac{1}{2(t-1)}\leq 1+\frac{3}{2\sqrt{t-1}} for any t3t\geq 3 and αKC122λ\alpha_{K}\sqrt{C}\geq 12\sqrt{2\lambda}, we further have

htC(1+32t1)(132t1)C.h_{t}\leq C\left(1+\frac{3}{2\sqrt{t-1}}\right)\left(1-\frac{3}{2\sqrt{t-1}}\right)\leq C. (26)

By combining (21), (24), (25) and (26), we complete the proof.

Proof of Lemma 4

This lemma has been proved by Garber and Hazan (2016), we include the proof for completeness. For any 𝐱,𝐲𝒦\mathbf{x},\mathbf{y}\in\mathcal{K} and t[T]t\in[T], we have

f~t(𝐱)f~t(𝐲)\displaystyle\tilde{f}_{t}(\mathbf{x})-\tilde{f}_{t}(\mathbf{y}) f~t(𝐱),𝐱𝐲=ft(𝐱t)+λ(𝐱𝐱t),𝐱𝐲\displaystyle\leq\langle\nabla\tilde{f}_{t}(\mathbf{x}),\mathbf{x}-\mathbf{y}\rangle=\langle\nabla f_{t}(\mathbf{x}_{t})+\lambda(\mathbf{x}-\mathbf{x}_{t}),\mathbf{x}-\mathbf{y}\rangle
ft(𝐱t)+λ(𝐱𝐱t)2𝐱𝐲2(G+λD)𝐱𝐲2.\displaystyle\leq\|\nabla f_{t}(\mathbf{x}_{t})+\lambda(\mathbf{x}-\mathbf{x}_{t})\|_{2}\|\mathbf{x}-\mathbf{y}\|_{2}\leq(G+\lambda D)\|\mathbf{x}-\mathbf{y}\|_{2}.

Following the above inequality, we also have

f~t(𝐲)f~t(𝐱)(G+λD)𝐱𝐲2\tilde{f}_{t}(\mathbf{y})-\tilde{f}_{t}(\mathbf{x})\leq(G+\lambda D)\|\mathbf{x}-\mathbf{y}\|_{2}

which completes the proof.

Proof of Theorem 3

This proof is similar to that of Theorem 2. Let f~t(𝐱)=ft(𝐱t),𝐱+λ2𝐱𝐱t22\tilde{f}_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{t}\|_{2}^{2} for any t[T]t\in[T]. Let 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t=2,,T+1t=2,\cdots,T+1. From the proof of Theorem 2, we have

t=1Tft(𝐱t)t=1Tft(𝐱)\displaystyle\sum_{t=1}^{T}f_{t}(\mathbf{x}_{t})-\sum_{t=1}^{T}f_{t}(\mathbf{x}^{\ast}) A+B\displaystyle\leq A+B

where A=t=1T(f~t(𝐱t)f~t(𝐱t+1))A=\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})) and B=t=1T(f~t(𝐱t+1)f~t(𝐱))B=\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})-\tilde{f}_{t}(\mathbf{x}^{\ast})).

Moreover, as in (13), we have proved that

B=t=1T(f~t(𝐱t+1)f~t(𝐱))0.B=\sum_{t=1}^{T}(\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast})-\tilde{f}_{t}(\mathbf{x}^{\ast}))\leq 0.

Therefore, we only need to bound AA. For SC-OFW over general convex sets, we introduce a new lemma.

Lemma 6

Let 𝐱t=argmin𝐱𝒦Ft1(𝐱)\mathbf{x}_{t}^{\ast}=\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}F_{t-1}(\mathbf{x}) for any t=2,,T+1t=2,\cdots,T+1, where Ft(𝐱)F_{t}(\mathbf{x}) is defined in (5). Then, for any t=2,,T+1t=2,\cdots,T+1, Algorithm 2 has

Ft1(𝐱t)Ft1(𝐱t)ϵt=C(t1)1/3F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast})\leq\epsilon_{t}=C(t-1)^{1/3}

where C=16(G+λD)2λ.C=\frac{16(G+\lambda D)^{2}}{\lambda}.

By applying Lemmas 4 and 6, we have

t=2T(f~t(𝐱t)f~t(𝐱t+1))t=2T(G+λD)𝐱t𝐱t+12(G+λD)t=2T𝐱t𝐱t2+(G+λD)t=2T𝐱t𝐱t+12(G+λD)t=2T2(Ft1(𝐱t)Ft1(𝐱t))(t1)λ+(G+λD)t=2T2(G+λD)tλ(G+λD)t=2T2C(t1)1/3(t1)λ+(G+λD)t=2T2(G+λD)tλ(G+λD)2Cλ32T2/3+2(G+λD)2lnTλ=32CT2/38+ClnT8\begin{split}\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\leq&\sum_{t=2}^{T}(G+\lambda D)\|\mathbf{x}_{t}-\mathbf{x}_{t+1}^{\ast}\|_{2}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\|\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\|_{2}+(G+\lambda D)\sum_{t=2}^{T}\|\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t+1}^{\ast}\|_{2}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\sqrt{\frac{2(F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}))}{(t-1)\lambda}}+(G+\lambda D)\sum_{t=2}^{T}\frac{2(G+\lambda D)}{t\lambda}\\ \leq&(G+\lambda D)\sum_{t=2}^{T}\sqrt{\frac{2C(t-1)^{1/3}}{(t-1)\lambda}}+(G+\lambda D)\sum_{t=2}^{T}\frac{2(G+\lambda D)}{t\lambda}\\ \leq&(G+\lambda D)\sqrt{\frac{2C}{\lambda}}\frac{3}{2}T^{2/3}+2(G+\lambda D)^{2}\frac{\ln T}{\lambda}\\ =&\frac{3\sqrt{2}CT^{2/3}}{8}+\frac{C\ln T}{8}\end{split} (27)

where the third inequality is due to 𝐱t𝐱t22(Ft1(𝐱t)Ft1(𝐱t))(t1)λ\|\mathbf{x}_{t}-\mathbf{x}_{t}^{\ast}\|_{2}\leq\sqrt{\frac{2(F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}))}{(t-1)\lambda}} for t2t\geq 2 and (11), and the last equality is due to C=16(G+λD)2λ.C=\frac{16(G+\lambda D)^{2}}{\lambda}.

Because of f1(𝐱1)2G\|\nabla f_{1}(\mathbf{x}_{1})\|_{2}\leq G and 𝐱1𝐱22D\|\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\|_{2}\leq D, we further have

A=f~1(𝐱1)f~1(𝐱2)+t=2T(f~t(𝐱t)f~t(𝐱t+1))=f1(𝐱1),𝐱1𝐱2λ2𝐱2𝐱122+t=2T(f~t(𝐱t)f~t(𝐱t+1))f1(𝐱1)2𝐱1𝐱22+32CT2/38+ClnT8GD+32CT2/38+ClnT8\begin{split}A=&\tilde{f}_{1}(\mathbf{x}_{1})-\tilde{f}_{1}(\mathbf{x}_{2}^{\ast})+\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\\ =&\langle\nabla f_{1}(\mathbf{x}_{1}),\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\rangle-\frac{\lambda}{2}\|\mathbf{x}_{2}^{\ast}-\mathbf{x}_{1}\|_{2}^{2}+\sum_{t=2}^{T}(\tilde{f}_{t}(\mathbf{x}_{t})-\tilde{f}_{t}(\mathbf{x}_{t+1}^{\ast}))\\ \leq&\|\nabla f_{1}(\mathbf{x}_{1})\|_{2}\|\mathbf{x}_{1}-\mathbf{x}_{2}^{\ast}\|_{2}+\frac{3\sqrt{2}CT^{2/3}}{8}+\frac{C\ln T}{8}\\ \leq&GD+\frac{3\sqrt{2}CT^{2/3}}{8}+\frac{C\ln T}{8}\end{split}

which completes this proof.

Proof of Lemma 6

For brevity, we define ht=Ft1(𝐱t)Ft1(𝐱t)h_{t}=F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast}) and f~t(𝐱)=ft(𝐱t),𝐱+λ2𝐱𝐱t22\tilde{f}_{t}(\mathbf{x})=\langle\nabla f_{t}(\mathbf{x}_{t}),\mathbf{x}\rangle+\frac{\lambda}{2}\|\mathbf{x}-\mathbf{x}_{t}\|_{2}^{2} for any t[T]t\in[T]. For t=2t=2, following (21), we have

h2=F1(𝐱2)F1(𝐱2)GD+λD22C=ϵ2.\begin{split}h_{2}=F_{1}(\mathbf{x}_{2})-F_{1}(\mathbf{x}_{2}^{\ast})\leq GD+\frac{\lambda D^{2}}{2}\leq C=\epsilon_{2}.\end{split} (28)

For any T+1t3T+1\geq t\geq 3, assuming ht1ϵt1h_{t-1}\leq\epsilon_{t-1} , we have

Ft1(𝐱t1)Ft1(𝐱t)=Ft2(𝐱t1)Ft2(𝐱t)+f~t1(𝐱t1)f~t1(𝐱t)Ft2(𝐱t1)Ft2(𝐱t1)+(G+λD)𝐱t1𝐱t2ϵt1+(G+λD)𝐱t1𝐱t12+(G+λD)𝐱t1𝐱t2ϵt1+(G+λD)2ϵt1(t2)λ+2(G+λD)2(t1)λ\begin{split}&F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})\\ =&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t}^{\ast})+\tilde{f}_{t-1}(\mathbf{x}_{t-1})-\tilde{f}_{t-1}(\mathbf{x}_{t}^{\ast})\\ \leq&F_{t-2}(\mathbf{x}_{t-1})-F_{t-2}(\mathbf{x}_{t-1}^{\ast})+(G+\lambda D)\|\mathbf{x}_{t-1}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&\epsilon_{t-1}+(G+\lambda D)\|\mathbf{x}_{t-1}-\mathbf{x}_{t-1}^{\ast}\|_{2}+(G+\lambda D)\|\mathbf{x}_{t-1}^{\ast}-\mathbf{x}_{t}^{\ast}\|_{2}\\ \leq&\epsilon_{t-1}+(G+\lambda D)\sqrt{\frac{2\epsilon_{t-1}}{(t-2)\lambda}}+\frac{2(G+\lambda D)^{2}}{(t-1)\lambda}\end{split}

where the first inequality is due to Lemma 4 and the last inequality is due to (11) and (22).

Because of C=16(G+λD)2λC=\frac{16(G+\lambda D)^{2}}{\lambda} and 1t22t1\frac{1}{t-2}\leq\frac{2}{t-1} for any t3t\geq 3, we further have

Ft1(𝐱t1)Ft1(𝐱t)ϵt1+2(G+λD)ϵt1(t1)λ+2(G+λD)2(t1)λC(t2)1/3+Cλ2C(t2)1/3(t1)λ+C8(t1)C(t1)1/3+C2(t1)1/3+C8(t1).\begin{split}F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})\leq&\epsilon_{t-1}+2(G+\lambda D)\sqrt{\frac{\epsilon_{t-1}}{(t-1)\lambda}}+\frac{2(G+\lambda D)^{2}}{(t-1)\lambda}\\ \leq&C(t-2)^{1/3}+\frac{\sqrt{C\lambda}}{2}\sqrt{\frac{C(t-2)^{1/3}}{(t-1)\lambda}}+\frac{C}{8(t-1)}\\ \leq&C(t-1)^{1/3}+\frac{C}{2(t-1)^{1/3}}+\frac{C}{8(t-1)}.\end{split} (29)

Then, for any σ[0,1]\sigma\in[0,1], we have

ht=Ft1(𝐱t)Ft1(𝐱t)=Ft1(𝐱t1+σt1(𝐯t1𝐱t1))Ft1(𝐱t)Ft1(𝐱t1)Ft1(𝐱t)+σt1Ft1(𝐱t1),𝐯t1𝐱t1+(t1)λσt122𝐯t1𝐱t122Ft1(𝐱t1)Ft1(𝐱t)+σFt1(𝐱t1),𝐯t1𝐱t1+(t1)λσ22𝐯t1𝐱t122Ft1(𝐱t1)Ft1(𝐱t)+σFt1(𝐱t1),𝐱t𝐱t1+(t1)λσ22𝐯t1𝐱t122Ft1(𝐱t1)Ft1(𝐱t)+σ(Ft1(𝐱t)Ft1(𝐱t1))+(t1)λσ22𝐯t1𝐱t122(1σ)(Ft1(𝐱t1)Ft1(𝐱t))+(t1)λσ2D22\begin{split}h_{t}&=F_{t-1}(\mathbf{x}_{t})-F_{t-1}(\mathbf{x}_{t}^{\ast})=F_{t-1}(\mathbf{x}_{t-1}+\sigma_{t-1}(\mathbf{v}_{t-1}-\mathbf{x}_{t-1}))-F_{t-1}(\mathbf{x}_{t}^{\ast})\\ &\leq F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})+\sigma_{t-1}\langle\nabla F_{t-1}(\mathbf{x}_{t-1}),\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\rangle+\frac{(t-1)\lambda\sigma_{t-1}^{2}}{2}\|\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\|_{2}^{2}\\ &\leq F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})+\sigma\langle\nabla F_{t-1}(\mathbf{x}_{t-1}),\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\rangle+\frac{(t-1)\lambda\sigma^{2}}{2}\|\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\|_{2}^{2}\\ &\leq F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})+\sigma\langle\nabla F_{t-1}(\mathbf{x}_{t-1}),\mathbf{x}_{t}^{\ast}-\mathbf{x}_{t-1}\rangle+\frac{(t-1)\lambda\sigma^{2}}{2}\|\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\|_{2}^{2}\\ &\leq F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast})+\sigma(F_{t-1}(\mathbf{x}_{t}^{\ast})-F_{t-1}(\mathbf{x}_{t-1}))+\frac{(t-1)\lambda\sigma^{2}}{2}\|\mathbf{v}_{t-1}-\mathbf{x}_{t-1}\|_{2}^{2}\\ &\leq(1-\sigma)(F_{t-1}(\mathbf{x}_{t-1})-F_{t-1}(\mathbf{x}_{t}^{\ast}))+\frac{(t-1)\lambda\sigma^{2}D^{2}}{2}\end{split} (30)

where the first inequality is due to the smoothness of Ft1(𝐱)F_{t-1}(\mathbf{x}), the second inequality is due to the line search and the third inequality is due to 𝐯t1argmin𝐱𝒦Ft1(𝐱t1),𝐱\mathbf{v}_{t-1}\in\operatorname*{argmin}_{\mathbf{x}\in\mathcal{K}}\langle\nabla F_{t-1}(\mathbf{x}_{t-1}),\mathbf{x}\rangle.

Because of 0(t1)2/310\leq(t-1)^{-2/3}\leq 1 for any t3t\geq 3, we substitute σ=(t1)2/3\sigma=(t-1)^{-2/3} into (30) and combine it with (29), which implies that

ht(11(t1)2/3)(C(t1)1/3+C2(t1)1/3+C8(t1))+(t1)λD22(t1)4/3.\begin{split}h_{t}\leq\left(1-\frac{1}{(t-1)^{2/3}}\right)\left(C(t-1)^{1/3}+\frac{C}{2(t-1)^{1/3}}+\frac{C}{8(t-1)}\right)+\frac{(t-1)\lambda D^{2}}{2(t-1)^{4/3}}.\end{split}

Then, due to 16λD2C16\lambda D^{2}\leq C, we have

ht(11(t1)2/3)(C(t1)1/3+C2(t1)1/3+C8(t1))+C32(t1)1/3.\begin{split}h_{t}\leq\left(1-\frac{1}{(t-1)^{2/3}}\right)\left(C(t-1)^{1/3}+\frac{C}{2(t-1)^{1/3}}+\frac{C}{8(t-1)}\right)+\frac{C}{32(t-1)^{1/3}}.\end{split}

Moreover, due to 1411(t1)2/3\frac{1}{4}\leq 1-\frac{1}{(t-1)^{2/3}} for any t3t\geq 3, we have

ht(11(t1)2/3)(C(t1)1/3+C2(t1)1/3+C8(t1)1/3+C8(t1))=(11(t1)2/3)(1+58(t1)2/3+C8(t1)4/3)C(t1)1/3(11(t1)2/3)(1+1(t1)2/3)C(t1)1/3ϵt\begin{split}h_{t}&\leq\left(1-\frac{1}{(t-1)^{2/3}}\right)\left(C(t-1)^{1/3}+\frac{C}{2(t-1)^{1/3}}+\frac{C}{8(t-1)^{1/3}}+\frac{C}{8(t-1)}\right)\\ &=\left(1-\frac{1}{(t-1)^{2/3}}\right)\left(1+\frac{5}{8(t-1)^{2/3}}+\frac{C}{8(t-1)^{4/3}}\right)C(t-1)^{1/3}\\ &\leq\left(1-\frac{1}{(t-1)^{2/3}}\right)\left(1+\frac{1}{(t-1)^{2/3}}\right)C(t-1)^{1/3}\\ &\leq\epsilon_{t}\end{split} (31)

By combining (28) and (31), we complete the proof.