Faster Rates for the Frank-Wolfe Algorithm Using Jacobi Polynomials

Abstract

The Frank Wolfe algorithm (FW) is a popular projection-free alternative for solving large-scale constrained optimization problems. However, the FW algorithm suffers from a sublinear convergence rate when minimizing a smooth convex function over a compact convex set. Thus, exploring techniques that yield a faster convergence rate becomes crucial. A classic approach to obtain faster rates is to combine previous iterates to obtain the next iterate. In this work, we extend this approach to the FW setting and show that the optimal way to combine the past iterates is using a set of orthogonal Jacobi polynomials. We also a polynomial-based acceleration technique, referred to as Jacobi polynomial accelerated FW, which combines the current iterate with the past iterate using combing weights related to the Jacobi recursion. By carefully choosing parameters of the Jacobi polynomials, we obtain a faster sublinear convergence rate. We provide numerical experiments on real datasets to demonstrate the efficacy of the proposed algorithm.

Index Terms— Acceleration methods, constrained optimization, Frank-Wolfe algorithm, Jacobi polynomials.

1 Introduction

Consider constrained optimization problems of the form

\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\hskip 8.53581ptf({{\mathbf{x}}})\quad\quad\text{subject to}\quad\quad{{\mathbf{x}}}\in\mathcal{C},

(1)

where $f:{\textbf{dom}}(f)\rightarrow\mathbb{R}$ is a smooth convex function and the constraint set $\mathcal{C}\subseteq{\textbf{dom}}(f)$ is compact and compact. Projected gradient descent [1] is a popular technique to solve constrained optimization problems, wherein gradient descent iterates are projected onto the constraint set. The constraint set in many large-scale problems that are commonly encountered in machine learning and signal processing, such as lasso, matrix completion, regularized logistic or robust regression, has a nice structure, but the projection operation onto the constraint set can be computationally expensive.

The Frank-Wolfe (FW) algorithm (also popularly known as the conditional gradient method) [2] is a projection-free alternative to the projected gradient descent algorithm for solving constrained optimization problems. The FW algorithm avoids the projection step and instead has a direction-finding step that minimizes a local linear approximation of the objective function over the constraint set. The update is then computed by taking a convex combination of the previous iterate with the solution from the linear minimization step. The linear minimization step in FW is often easy to compute for $\ell_{p}$ -norm ball or polytope constraint sets that frequently appear in machine learning and signal processing problems, thereby making FW an attractive choice.

In [3], convergences results are provided for the FW algorithm. The FW algorithm has iteration complexity bounds that are independent of the problem size but limited by a sublinear convergence rate of $\mathcal{O}(1/k)$ . With a carefully selected step size, for minimizing a strongly convex function over a strongly convex constraint set, the FW algorithm can achieve convergence rates of $\mathcal{O}(1/k^{2})$ and linear convergence rate when the solution lies in the interior of the convex constraint set [4]. The FW algorithm can even achieve linear rates under weaker conditions than strongly convex functions over polytopes by removing carefully chosen directions (referred to as away steps) that hinder the convergence rate [5]. By finding an optimal step size at each iteration, faster convergence of the FW algorithm may be obtained [4, 6].

For unconstrained optimization problems, Nesterov’s acceleration technique [7] and Polyak’s heavy-ball momentum technqiue [8] provide faster rates of ${\mathcal{O}}(1/k^{2})$ . Central to Polyak’s heavy-ball method [8] is the so-called multistep method, which keeps track of all the previous iterates and computes successive updates by taking a weighted combination of the past iterates. Doing so yields a smoother trajectory and helps attain faster convergence rates. In a similar vein, several variants of the multistep method have been proposed for FW. In [9], a momentum-guided accelerated FW (AFW) algorithm that can achieve a convergence rate of $\mathcal{O}(1/{k^{2}})$ for smooth convex minimization problems over $\ell_{p}$ -norm balls, but an order of $\mathcal{O}(1/k)$ in the general setting has been proposed. The FW iterates zig-zag as we approach the optimal solution and might affect the convergence rate. In [6], the zig-zag trajectory is alleviated by using the concept of parallel tangents for FW, wherein a significant improvement in convergence has been reported but without any theoretical guarantees. In [10], the zig-zag phenomenon is controlled by keeping track of the past gradients and taking its weighted average with different heuristic choices for the weights. When using multistep-based methods, finding the optimal updates and combining weights that can achieve faster convergence without much computational overhead is important.

The main aim of this work is, therefore, to determine an optimal technique independent of the problem size, to combine the previous iterates as in the multistep method, and obtain faster convergence rates. To this end, we draw inspiration from polynomial-based iteration methods for unconstrained problems [11] and extend that to the FW setting. The main contributions of the paper are as follows.

•

We express the linear combination of the past FW iterates as a polynomial and show that the optimal polynomial that yields the maximum error reduction is given by a set of orthogonal Jacobi polynomials.
•

We then modify the FW algorithm to incorporate a Jacobi recursion, wherein the weights from the recurrence relation of the Jacobi polynomials are used to combine the past iterate with the current iterate.
•

We show that for carefully selected parameters of the Jacobi polynomials, we obtain a faster convergence rate of ${\mathcal{O}}(1/k^{2})$ when minimizing a smooth convex function over a compact convex constraint set.

Through numerical experiments on real datasets, we demonstrate the efficacy of the proposed algorithm.

2 Preliminaries

We use boldface lowercase (uppercase) letters to represent vectors (respectively, matrices). We use $\langle{{\mathbf{x}}},{{\mathbf{y}}}\rangle$ to denote the inner product between vectors ${{\mathbf{x}}}$ and ${{\mathbf{y}}}$ . We denote an arbitrary norm using $\|\cdot\|$ , $\ell_{2}$ -norm using $||\cdot||_{2}$ , and nuclear norm using $\|\cdot\|_{\star}$ . We say that a function $f:{\textbf{dom}}(f)\rightarrow\mathbb{R}$ is $L$ smooth over a convex set ${\mathcal{C}}\subseteq{\textbf{dom}}(f)$ if for all ${{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}}$ it holds that

f({{\mathbf{y}}})\leq f({{\mathbf{x}}})+\langle\nabla f({{\mathbf{x}}}),{{\mathbf{y}}}-{{\mathbf{x}}}\rangle+\frac{L}{2}\|{{\mathbf{y}}}-{{\mathbf{x}}}\|^{2}.

2.1 The FW method

The FW algorithm [2] finds an update direction based on the local linear approximation of $f$ around the current iterate ${{\mathbf{x}}}_{k}$ by solving the linear minimization problem:

{{\mathbf{s}}}_{k}=\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\hskip 5.69054ptf({{\mathbf{x}}}_{k})+\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}-{{\mathbf{x}}}_{k}\right>.

(2)

Then the update is computed by forming a convex combination of ${{\mathbf{s}}}_{k}$ and ${{\mathbf{x}}}_{k}$ as

{{\mathbf{x}}}_{k+1}=(1-\gamma_{k}){{\mathbf{x}}}_{k}+\gamma_{k}{{\mathbf{s}}}_{k}

with $\gamma_{k}\in[0,1]$ . This ensures ${{\mathbf{x}}}_{k+1}\in{\mathcal{C}}$ . Typically, a diminishing step size $\gamma_{k}$ is used [3], but line search techniques to compute the optimal step size may also be used [2]. The FW method is summarized as Algorithm 1.

Algorithm 1 The vanilla Frank-Wolfe method (FW)

1:Initialize

{{\mathbf{x}}}_{0}\in\mathcal{C}

2:for

k=0,1,\ldots

{{\mathbf{s}}}_{k}\leftarrow\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}\right>

{{\mathbf{x}}}_{k+1}\leftarrow{{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k})

\gamma_{k}\leftarrow\frac{2}{k+2}

The FW algorithm with a diminishing step size of $\gamma_{k}=\frac{2}{k+2}$ yields a sublinear convergence of ${\mathcal{O}}(1/k)$ for minimizing $L$ -smooth convex functions over a compact convex constraint set. Specifically, the suboptimality gap at the $k$ th iterate of FW satisfies [3]

f({{\mathbf{x}}}_{k})-f({{\mathbf{x}}}^{\star})\leq\frac{2LD^{2}}{k+2},

where $D=\sup_{{{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}}}\|{{\mathbf{x}}}-{{\mathbf{y}}}\|_{2}$ is the diameter of the constraint set and ${{\mathbf{x}}}^{\star}$ is a minimizer of $f$ .

2.2 The Jacobi polynomials

The Jacobi polynomials are a set of orthogonal polynomials, orthogonal with respect to (w.r.t) the weight $(1-x)^{\alpha}(1+x)^{\beta}$ in the interval $[-1,1]$ , where $\alpha,\beta>-1$ are the tunable parameters. The Jacobi polynomials are defined by the second-order recurrence relation

\displaystyle J_{k+1}(\lambda)

\displaystyle=(a_{k}\lambda+b_{k})J_{k}(\lambda)-c_{k}J_{k-1}(\lambda),\quad k=1,2,\ldots

(3)

with $J_{0}(\lambda)=1$ and $J_{1}(\lambda)=a_{0}\lambda+b_{0}$ . The recurrence weights are given by the formulas

$\displaystyle a_{k}$	$\displaystyle=\frac{(\tau_{k}+k)(\tau_{k}+k+1)}{2\tau_{k}(\tau_{k}-\beta)}$
$\displaystyle b_{k}$	$\displaystyle=\frac{(\tau_{k}+k)(\alpha^{2}-\beta^{2})}{2\tau_{k}(\tau_{k}-\beta)(\tau_{k}+k-1)}$
$\displaystyle c_{k}$	$\displaystyle=\frac{k(k+\beta)(\tau_{k}+k+1)}{\tau_{k}(\tau_{k}-\beta)(\tau_{k}+k-1)}$	(4)

where $\tau_{k}=k+\alpha+\beta+1$ and $a_{0}=\frac{\alpha+\beta+2}{2(1+\alpha)}$ and $b_{0}=\frac{\alpha-\beta}{2(1+\alpha)}.$ Furthermore, we have $a_{k}+b_{k}-c_{k}=1.$

3 The optimal polynomial

A classic technique to accelerate iterative methods is the so-called multistep method, which uses past iterates to construct the successive iterate [8]. In this section, we apply this technique to FW and find an optimal procedure to use the past iterates.

Suppose we take a linear combination of all the past iterates from Step $4$ of Algorithm 1 with a fixed step size $\gamma$ , we get

{{\mathbf{z}}}_{k+1}=\sum_{i=0}^{k}\theta_{i}{{\mathbf{x}}}_{i}=\sum_{i=0}^{k}\theta_{i}[(1-\gamma){{\mathbf{x}}}_{i-1}+\gamma{{\mathbf{s}}}_{i-1}],

(5)

where we constrain the combining coefficients such that $\sum_{i=0}^{k}\theta_{i}=1$ with $\theta_{i}\in[0,1]$ for $i=0,1,\ldots,k$ . This constraint ensures that the weighted average is a feasible point, i.e., ${{\mathbf{z}}}_{k+1}\in\mathcal{C}$ .

Now an important question to ask is, can we find optimal coefficients that can possibly accelerate the FW algorithm? To answer this question, we express (5) as a polynomial and then select the best polynomial that speeds up the convergence rate. To do so, let us first unroll (5) and express it in terms of ${{\mathbf{x}}}_{0}$ as

\displaystyle{{\mathbf{z}}}_{k+1}

\displaystyle=\sum\limits_{i=0}^{k}\theta_{i}[(1-\gamma)^{i}{{\mathbf{x}}}_{0}+\gamma\sum\limits_{j=0}^{i-1}(1-\gamma)^{i-j}{{\mathbf{s}}}_{j}].

Let us define the polynomial $p_{k}(1-\gamma)$ of degree $k$ as

p_{k}(1-\gamma)=\theta_{k}(1-\gamma)^{k}+\theta_{k-1}(1-\gamma)^{k-1}+\cdots+\theta_{0},

with $p_{0}(1-\gamma)=1$ . Then we can express ${{\mathbf{z}}}_{k+1}$ as

\displaystyle{{\mathbf{z}}}_{k+1}

\displaystyle=p_{k}(1-\gamma){{\mathbf{x}}}_{0}+\gamma p_{k-1}(1-\gamma){{\mathbf{s}}}_{0}+\cdots+\gamma p_{0}(1-\gamma){{\mathbf{s}}}_{k-1},

where $p_{k}(1-\gamma)+\gamma\sum_{i=1}^{k}p_{k-i}(1-\gamma)=1$ and thus ${{\mathbf{z}}}_{k+1}\in\textbf{conv}\{{{\mathbf{x}}}_{0},{{\mathbf{s}}}_{0},\cdots,{{\mathbf{s}}}_{k-1}\}$ . The best polynomial is the one that minimizes the distance of the $k+1$ st iterate to ${{\mathbf{x}}}^{\star}$ , i.e.,

	$\displaystyle{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}$	$\displaystyle=p_{k}(1-\gamma){{\mathbf{x}}}_{0}+\gamma p_{k-1}(1-\gamma){{\mathbf{s}}}_{0}$
		$\displaystyle\quad\quad\quad+\dots+\gamma p_{0}(1-\gamma)-{{\mathbf{x}}}^{\star}$
		$\displaystyle=p_{k}(1-\gamma)({{\mathbf{x}}}_{0}-{{\mathbf{x}}}^{\star})+\gamma p_{k-1}({{\mathbf{s}}}_{0}-{{\mathbf{x}}}^{\star})$
		$\displaystyle\quad\quad\quad+\cdots+\gamma p_{0}(1-\gamma)({{\mathbf{s}}}_{k-1}-{{\mathbf{x}}}^{\star})$

for each $k$ , and is bounded from above as

$\displaystyle\\|{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}\\|$	$\displaystyle\leq\>\|p_{k}(1-\gamma)\|\,\\|{{\mathbf{x}}}_{0}-{{\mathbf{x}}}^{\star}\\|$
	$\displaystyle\quad\quad+\|\gamma p_{k-1}(1-\gamma)\|\,\\|{{\mathbf{s}}}_{0}-{{\mathbf{x}}}^{\star}\\|$
	$\displaystyle\quad\quad\,+\,\cdots+\|\gamma p_{0}(\gamma-1)\|\,\\|{{\mathbf{s}}}_{k-1}-{{\mathbf{x}}}^{\star}\\|$	(6)

due to the triangle inequality. Thus minimizing this upper bound would provide updates that are as close as possible to the optimal path. Next, we find the best polynomial that minimizes the above bound in the following theorem.

Theorem 1.

A $k$ th degree orthogonal Jacobi polynomial is a minimizer of the upper bound in (6).

Proof.

Recall that the Jacobi polynomials are orthogonal in the interval $[-1,1]$ . With the change of variable $\gamma=\frac{x+1}{2}$ the Jacobi polynomials are orthogonal in the interval $[0,1]$ w.r.t. the weight $w(\gamma)=(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma$ . The polynomial $p_{k}^{\star}$ that minimizes its norm w.r.t a weight function $w(\gamma)$ , i.e.,

p_{k}^{\star}\in\,\mathrel{\mathop{\kern 0.0pt\textrm{arg\,min}}\limits_{p_{k}:p_{0}(1-\gamma)=1}}\,\,\int_{0}^{1}\,\,|p_{k}(1-\gamma)|\,(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma

(7)

are given by the set of orthogonal polynomials [12].

Since a $k$ th degree Jacobi polynomial can be expressed as a linear combination of lower degree Jacobi polynomials [13], we further have $p_{k}(1-\gamma)=\sum_{i=0}^{k-1}c_{i}p_{i}(1-\gamma)$ for some combining weights $\{c_{i}\}$ . Thus, for Jacobi polynomials, we have

	$\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{p_{\kappa}:p_{0}(1-\gamma)=1,\kappa\leq k-1}}\,\,\int_{0}^{1}\,\sum_{i=0}^{k-1}\|c_{i}p_{i}(1-\gamma)\|\,(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma$
	$\displaystyle\leq\,\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{p_{\kappa}:p_{0}(1-\gamma)=1,\kappa\leq k-1}}\sum_{i=0}^{k-1}\int_{0}^{1}\,\|c_{i}\|\,\|p_{i}(1-\gamma)\|(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma.$

Therefore, selecting a $k$ th degree Jacobi polynomial ensures that the norm of the lower degree polynomials are also minimized. Thereby minimizing the upper bound in (7) and the error $\|{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}\|$ at the $k+1$ st iteration. $\Box$

The above theorem suggests that combining all the past iterates using orthogonal Jacobi polynomials yields the best successive update. Based on this insight, we modify the vanilla Frank-Wolfe method in the next section.

4 Jacobi acceleration for Frank-Wolfe

In this section, we present the proposed Jacobi accelerated Frank-Wolfe method (JFW), which is a polynomial-based technique for using the past iterates. For a convex and $L$ -smooth function $f$ , we show that to reach error of at most $\varepsilon$ , for carefully chosen Jacobi polynomial parameters $(\alpha,\beta)$ , JFW only needs $\mathcal{O}(1/\sqrt{\varepsilon})$ instead of $\mathcal{O}(1/{\varepsilon})$ as needed by FW.

4.1 The JFW algorithm

We have seen in Section 2.2 that orthogonal Jacobi polynomials follow a second-order recursion in (3) with three sequences of coefficients $\{a_{k},b_{k},c_{k}\}$ computed as in (2.2). This means that, we do not have to store all the past iterates to compute (5), but only store the previous iterate. Specifically, the proposed JFW algorithm computes an intermediate update

{{\mathbf{y}}}_{k+1}={{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k}),

where ${{\mathbf{s}}}_{k}$ is the FW direction computed as in Step 3 of Algorithm 1 with $\gamma_{k}=2/(k+2)$ . Then we combine the intermediate update with the past iterate using the second-order Jacobi recursion [cf. (3)]:

{{\mathbf{z}}}_{k+1}=(a_{k}(1-\gamma)+b_{k}){{\mathbf{x}}}_{k}-c_{k}{{\mathbf{x}}}_{k-1}

(8)

with $\gamma\in[0,1]$ . Since $a_{k}(1-\gamma)+b_{k}-c_{k}=1-\gamma a_{k}$ for the Jacobi polynomials, ${{\mathbf{z}}}_{k+1}\notin\mathcal{C}$ . Therefore, we make a final correction step

{{\mathbf{x}}}_{k+1}={{\mathbf{z}}}_{k+1}+\gamma a_{k}{{\mathbf{x}}}_{k}

so that ${{\mathbf{x}}}_{k+1}\in{\mathcal{C}}$ is a feasible point. The proposed JFW is summarized as Algorithm 2.

Algorithm 2 Jacobi accelerated Frank-Wolfe (JFW)

1:Initialize

{{\mathbf{x}}}_{0}\in\mathcal{C},

\alpha\geq\beta>-1

, and

\gamma

2:for

k=0,1,\ldots

{{\mathbf{s}}}_{k}\leftarrow\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\hskip 5.69054pt\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}\right>

\triangleright

FW direction finding

{{\mathbf{y}}}_{k+1}\leftarrow{{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k})

\triangleright

FW update

{{\mathbf{z}}}_{k+1}\leftarrow(a_{k}(1-\gamma)+b_{k}){{\mathbf{y}}}_{k+1}-c_{k}{{\mathbf{x}}}_{k}

\triangleright

Jacobi recursion

{{\mathbf{x}}}_{k+1}\leftarrow{{\mathbf{z}}}_{k+1}+\gamma a_{k}{{\mathbf{x}}}_{k}

\triangleright

Correction step

\gamma_{k}\leftarrow\frac{2}{k+2}

4.2 Convergence analysis

To demonstrate the convergence of the Jacobi FW, we consider a $L$ -smooth convex function over a compact convex constraint set. The next theorem states the improvement in convergence of the proposed algorithm.

Theorem 2.

Let $f:{\textbf{dom}}(f)\rightarrow\mathbb{R}$ be a $L$ -smooth and convex function, ${\mathcal{C}}\subseteq{\textbf{dom}}(f)$ be compact and convex, and ${{\mathbf{x}}}^{\star}$ be a minimizer of $f$ over ${\mathcal{C}}$ . For appropriately chosen parameters $(\alpha,\beta,\gamma)$ with $\alpha\geq\beta>-1$ , JFW in Algorithm 2 satisfies

f({{\mathbf{x}}}_{k})-f({{\mathbf{x}}}^{\star})\leq\left|\frac{\alpha}{\beta}\right|\frac{4LD^{2}}{(k+1)(k+2)},

(9)

where $D=\sup_{{{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}}}\|{{\mathbf{x}}}-{{\mathbf{y}}}\|_{2}$ is the diameter of the constraint set.

The complete proof is available online^*^** https://drive.google.com/file/d/1BfP-iDiAZ0cKZ1OzJA5X1_6MGeDvq7wE/view and is omitted due to space constraints. Here, we give a brief sketch of the proof, which is based on mathematical induction. For the base case $k=0$ , the bound directly follows from definitions of the diameter of the set and $L$ -smoothness of the objective function. Then by the induction hypothesis, we assume that the bound holds for $k$ iterations and show that the bound is also valid for approximately chosen tuning parameters $(\alpha,\beta)$ at the $k+1$ st iterate. To do so, we show that JFW satisfies the descent property

f({{\mathbf{x}}}_{k+1})\leq f({{\mathbf{x}}}_{k})-\frac{6\omega_{k}LD^{2}}{(k+2)^{2}},

with $\omega_{k}=a_{k}(1-\gamma)+b_{k}\in[0,1]$ and $L\geq 0$ .

Theorem 2 affirms that we can achieve a faster sublinear convergence rate of ${\mathcal{O}}(1/k^{2})$ using a Jacobi polynomial-based acceleration technique as compared to the ${\mathcal{O}}(1/k)$ rate of FW. However, in order to achieve this faster rate, careful tuning of the parameters of the Jacobi polynomials, $(\alpha,\beta,\gamma)$ , is needed. Further, the bound on iteration complexity of JFW matches the bounds proposed in [9], but for more general convex and compact constraint sets.

5 Numerical experiments

We validate the JFW algorithm on real datasets for three tasks, namely, logistic and robust regression with an $\ell_{2}$ -norm constraint and robust matrix completion. We compare JFW with the vanilla FW in Algorithm 1 and the momentum-guided AFW [9] algorithms^†^††Software to reproduce the plots in the paper is available at https://colab.research.google.com/drive/1P1_DsP4nN2YYykWanNuasdbfQ4j6EAhB?usp=sharing.

5.1 Logistic regression

Consider the logistic regression problem with an $\ell_{2}$ -norm constraint for binary classification:

	$\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\quad\frac{1}{m}\sum_{i=1}^{m}\log(1+\exp{(-b_{i}\left<\mathbf{a}_{i},{{\mathbf{x}}}\right>)})$
	$\displaystyle\text{s. to \quad}\quad\|\|{{\mathbf{x}}}\|\|_{2}\leq t,$

where $\mathbf{a}_{i}\in\mathbb{R}^{d}$ is the feature vector and $b_{i}$ is the binary class label. We apply FW, AFW and JFW on a breast cancer [14] dataset with $m=698$ and $d=9$ . We use $\alpha=\beta=1.2$ , $t=50$ , and $\gamma=2/3$ . The suboptimality gap is shown in Fig. 1(a), where to compute the suboptimality gap we obtain the optimal value using cvxpy [15]. We can see from Fig. 1(a) that JFW has a faster rate.

5.2 Robust regression

Next, we consider robust ridge regression using a smooth Huber loss function with an $\ell_{2}$ -norm constraint [16]:

	$\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\quad\frac{1}{m}\sum_{i=1}^{m}H_{\delta}(y_{i}-\langle{{\mathbf{a}}}_{i},{{\mathbf{x}}}\rangle)$
	$\displaystyle\text{s. to \quad}\quad\|\|{{\mathbf{x}}}\|\|_{2}\leq t,$

where ${{\mathbf{a}}}_{i}\in\mathbb{R}^{d}$ and $y_{i}\in\mathbb{R}$ . The Huber loss function is given by

H_{\delta}(c)=\begin{cases}c^{2},&\text{if $|c|\leq\delta$},\\ 2\delta|c|-\delta^{2},&\text{otherwise.}\end{cases}

We apply robust ridge regression on Pima Indian diabetes [17] dataset with $m=767$ and $d=8$ , and the optimal function value is computed using scikit-learn [18]. We use $t=35$ , $\delta=0.5$ , $\alpha=\beta=1450$ , and $\gamma=0.65$ . Fig. 1(b) shows the suboptimality gap, where we see that JFW has a faster convergence than FW and AFW as before.

5.3 Robust matrix completion

Finally, we consider the robust matrix completion problem to predict unobserved missing entries from observations corrupted with outliers. We use the Huber loss function as before along with a low-rank promoting nuclear norm constraint for the robust matrix completion task, i.e., we solve

	$\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{\mathbf{X}}}}\,\,\sum_{(i,j)\in\Omega}H_{\delta}(A_{ij}-X_{ij})$
	$\displaystyle\text{s. to \quad}\quad\\|{\mathbf{X}}\\|_{\star}\leq R,$

where $\mathbf{A}\in\mathbb{R}^{m\times n}$ is the partially observed matrix with the entries ${A}_{ij}$ available for $(i,j)\in\Omega$ . The sampling mask $\Omega$ is know.

We apply robust matrix completion on the Movielens 100K [19] dataset with $m=1682$ movies rated by $n=943$ users with $6.30\%$ percent ratings observed. Further, we introduce outliers by setting $4\%$ of the total ratings (i.e., $mn$ ratings) to its maximum value. We use 50% of the available ratings for training and the remaining observations for testing. We use $R=5$ , $\alpha=\beta=4.5$ , $\gamma=2/3$ and $\delta=4$ . Fig. 1(c) shows normalized error, which is defined as

\epsilon_{k}=\frac{\sum_{(i,j)\in\Omega_{\rm test}}H_{\delta}(A_{ij}-X_{ij})}{\sum_{(i,j)\in\Omega_{\rm test}}H_{\delta}(A_{ij})}

with $\Omega_{\rm test}$ being the index set of the observations in the test set. We again observe that JFW performs slightly better than FW.

6 Conclusions

The FW algorithm is a viable alternative for solving constrained optimization problems for which projection onto the constraint set is a computationally expensive operation. Motivated by polynomial-based acceleration techniques for unconstrained problems, we show that a set of orthogonal Jacobi polynomials with carefully chosen parameters can yield faster convergence rates. We have proposed Jacobi accelerated FW algorithm that provably has a faster sublinear convergence rate for minimizing a smooth convex function over a compact convex constraint set.

References

[1] Y. Nesterov, Introductory lectures on convex optimization: A basic course, volume 87. Springer Science and Business Media, 2004.
[2] M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Naval Research Logistics Quarterly, vol. 3, no. 1‐2, pp. 95–110, 1956.
[3] M. Jaggi, “Revisiting frank-wolfe: Projection-free sparse convex optimization,” in Proc. of the 30th International Conference on International Conference on Machine Learning, 2013.
[4] D. Garber and E. Hazan, “Faster rates for the frank-wolfe method over strongly-convex sets,” in Proc. of the 32nd International Conference on Machine Learning, ICML, 2015.
[5] S. Lacoste-Julien and M. Jaggi, “On the global linear convergence of frank-wolfe optimization variants,” in Proc. of Advances in Neural Information Processing Systems, 2015.
[6] E. Frandi, R. Ñanculef, and J. A. K. Suykens, “A partan-accelerated frank-wolfe algorithm for large-scale svm classification,” in Proc. of International Joint Conference on Neural Networks (IJCNN), 2015.
[7] Y. Nesterov, “A method for solving the convex programming problem with convergence rate $\mathcal{O}(1/k^{2})$ ,” Soviet Mathematics Doklady, 27, 372-376, 1983.
[8] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[9] B. Li, M. Coutiño, G. B. Giannakis, and G. Leus, “A momentum-guided frank-wolfe algorithm,” IEEE Trans. Signal Process., vol. 69, pp. 3597–3611, June 2021.
[10] Y. Zhang, B. Li, and G. B. Giannakis, “Accelerating frank-wolfe with weighted average gradients,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2021.
[11] A. d’Aspremont, D. Scieur, and A. Taylor, Acceleration Methods, 2021.
[12] P. Nevai, “Géza freud, orthogonal polynomials and christoffel functions. a case study,” Journal of Approximation Theory, vol. 48, no. 1, pp. 3–167, 1986.
[13] W. L. Shen J., Tang T., Orthogonal Polynomials and Related Approximation Results. Springer, Berlin, Heidelberg, 2011.
[14] “Diagnostic wisconsin breast cancer database,” UCI repository, https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data, accessed on 09/10/2021.
[15] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016.
[16] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, no. 7, pp. 59–72, 2007.
[17] “Pima Indian diabetes dataset,” National Institute of Diabetes and Digestive and Kidney Diseases, https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.csv, accessed on 09/10/2021.
[18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[19] “Movielens 100k dataset,” GroupLens Research, https://grouplens.org/datasets/movielens/100k/, accessed on 09/10/2021.

$\displaystyle\\|{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}\\|$	$\displaystyle\leq\>\|p_{k}(1-\gamma)\|\,\\|{{\mathbf{x}}}_{0}-{{\mathbf{x}}}^{\star}\\|$
	$\displaystyle\quad\quad+\|\gamma p_{k-1}(1-\gamma)\|\,\\|{{\mathbf{s}}}_{0}-{{\mathbf{x}}}^{\star}\\|$
	$\displaystyle\quad\quad\,+\,\cdots+\|\gamma p_{0}(\gamma-1)\|\,\\|{{\mathbf{s}}}_{k-1}-{{\mathbf{x}}}^{\star}\\|$	(6)