This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Rates for the Frank-Wolfe Algorithm Using Jacobi Polynomials

Abstract

The Frank Wolfe algorithm (FW) is a popular projection-free alternative for solving large-scale constrained optimization problems. However, the FW algorithm suffers from a sublinear convergence rate when minimizing a smooth convex function over a compact convex set. Thus, exploring techniques that yield a faster convergence rate becomes crucial. A classic approach to obtain faster rates is to combine previous iterates to obtain the next iterate. In this work, we extend this approach to the FW setting and show that the optimal way to combine the past iterates is using a set of orthogonal Jacobi polynomials. We also a polynomial-based acceleration technique, referred to as Jacobi polynomial accelerated FW, which combines the current iterate with the past iterate using combing weights related to the Jacobi recursion. By carefully choosing parameters of the Jacobi polynomials, we obtain a faster sublinear convergence rate. We provide numerical experiments on real datasets to demonstrate the efficacy of the proposed algorithm.

Index Terms—  Acceleration methods, constrained optimization, Frank-Wolfe algorithm, Jacobi polynomials.

1 Introduction

Consider constrained optimization problems of the form

minimize𝐱f(𝐱)subject to𝐱𝒞,\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\hskip 8.53581ptf({{\mathbf{x}}})\quad\quad\text{subject to}\quad\quad{{\mathbf{x}}}\in\mathcal{C}, (1)

where f:dom(f)f:{\textbf{dom}}(f)\rightarrow\mathbb{R} is a smooth convex function and the constraint set 𝒞dom(f)\mathcal{C}\subseteq{\textbf{dom}}(f) is compact and compact. Projected gradient descent [1] is a popular technique to solve constrained optimization problems, wherein gradient descent iterates are projected onto the constraint set. The constraint set in many large-scale problems that are commonly encountered in machine learning and signal processing, such as lasso, matrix completion, regularized logistic or robust regression, has a nice structure, but the projection operation onto the constraint set can be computationally expensive.

The Frank-Wolfe (FW) algorithm (also popularly known as the conditional gradient method) [2] is a projection-free alternative to the projected gradient descent algorithm for solving constrained optimization problems. The FW algorithm avoids the projection step and instead has a direction-finding step that minimizes a local linear approximation of the objective function over the constraint set. The update is then computed by taking a convex combination of the previous iterate with the solution from the linear minimization step. The linear minimization step in FW is often easy to compute for p\ell_{p}-norm ball or polytope constraint sets that frequently appear in machine learning and signal processing problems, thereby making FW an attractive choice.

In [3], convergences results are provided for the FW algorithm. The FW algorithm has iteration complexity bounds that are independent of the problem size but limited by a sublinear convergence rate of 𝒪(1/k)\mathcal{O}(1/k). With a carefully selected step size, for minimizing a strongly convex function over a strongly convex constraint set, the FW algorithm can achieve convergence rates of 𝒪(1/k2)\mathcal{O}(1/k^{2}) and linear convergence rate when the solution lies in the interior of the convex constraint set [4]. The FW algorithm can even achieve linear rates under weaker conditions than strongly convex functions over polytopes by removing carefully chosen directions (referred to as away steps) that hinder the convergence rate [5]. By finding an optimal step size at each iteration, faster convergence of the FW algorithm may be obtained [4, 6].

For unconstrained optimization problems, Nesterov’s acceleration technique [7] and Polyak’s heavy-ball momentum technqiue [8] provide faster rates of 𝒪(1/k2){\mathcal{O}}(1/k^{2}). Central to Polyak’s heavy-ball method [8] is the so-called multistep method, which keeps track of all the previous iterates and computes successive updates by taking a weighted combination of the past iterates. Doing so yields a smoother trajectory and helps attain faster convergence rates. In a similar vein, several variants of the multistep method have been proposed for FW. In [9], a momentum-guided accelerated FW (AFW) algorithm that can achieve a convergence rate of 𝒪(1/k2)\mathcal{O}(1/{k^{2}}) for smooth convex minimization problems over p\ell_{p}-norm balls, but an order of 𝒪(1/k)\mathcal{O}(1/k) in the general setting has been proposed. The FW iterates zig-zag as we approach the optimal solution and might affect the convergence rate. In [6], the zig-zag trajectory is alleviated by using the concept of parallel tangents for FW, wherein a significant improvement in convergence has been reported but without any theoretical guarantees. In [10], the zig-zag phenomenon is controlled by keeping track of the past gradients and taking its weighted average with different heuristic choices for the weights. When using multistep-based methods, finding the optimal updates and combining weights that can achieve faster convergence without much computational overhead is important.

The main aim of this work is, therefore, to determine an optimal technique independent of the problem size, to combine the previous iterates as in the multistep method, and obtain faster convergence rates. To this end, we draw inspiration from polynomial-based iteration methods for unconstrained problems [11] and extend that to the FW setting. The main contributions of the paper are as follows.

  • We express the linear combination of the past FW iterates as a polynomial and show that the optimal polynomial that yields the maximum error reduction is given by a set of orthogonal Jacobi polynomials.

  • We then modify the FW algorithm to incorporate a Jacobi recursion, wherein the weights from the recurrence relation of the Jacobi polynomials are used to combine the past iterate with the current iterate.

  • We show that for carefully selected parameters of the Jacobi polynomials, we obtain a faster convergence rate of 𝒪(1/k2){\mathcal{O}}(1/k^{2}) when minimizing a smooth convex function over a compact convex constraint set.

Through numerical experiments on real datasets, we demonstrate the efficacy of the proposed algorithm.

2 Preliminaries

We use boldface lowercase (uppercase) letters to represent vectors (respectively, matrices). We use 𝐱,𝐲\langle{{\mathbf{x}}},{{\mathbf{y}}}\rangle to denote the inner product between vectors 𝐱{{\mathbf{x}}} and 𝐲{{\mathbf{y}}}. We denote an arbitrary norm using \|\cdot\|, 2\ell_{2}-norm using ||||2||\cdot||_{2}, and nuclear norm using \|\cdot\|_{\star}. We say that a function f:dom(f)f:{\textbf{dom}}(f)\rightarrow\mathbb{R} is LL smooth over a convex set 𝒞dom(f){\mathcal{C}}\subseteq{\textbf{dom}}(f) if for all 𝐱,𝐲𝒞{{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}} it holds that

f(𝐲)f(𝐱)+f(𝐱),𝐲𝐱+L2𝐲𝐱2.f({{\mathbf{y}}})\leq f({{\mathbf{x}}})+\langle\nabla f({{\mathbf{x}}}),{{\mathbf{y}}}-{{\mathbf{x}}}\rangle+\frac{L}{2}\|{{\mathbf{y}}}-{{\mathbf{x}}}\|^{2}.

2.1 The FW method

The FW algorithm [2] finds an update direction based on the local linear approximation of ff around the current iterate 𝐱k{{\mathbf{x}}}_{k} by solving the linear minimization problem:

𝐬k=argmin𝐬𝒞f(𝐱k)+f(𝐱k),𝐬𝐱k.{{\mathbf{s}}}_{k}=\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\hskip 5.69054ptf({{\mathbf{x}}}_{k})+\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}-{{\mathbf{x}}}_{k}\right>. (2)

Then the update is computed by forming a convex combination of 𝐬k{{\mathbf{s}}}_{k} and 𝐱k{{\mathbf{x}}}_{k} as

𝐱k+1=(1γk)𝐱k+γk𝐬k{{\mathbf{x}}}_{k+1}=(1-\gamma_{k}){{\mathbf{x}}}_{k}+\gamma_{k}{{\mathbf{s}}}_{k}

with γk[0,1]\gamma_{k}\in[0,1]. This ensures 𝐱k+1𝒞{{\mathbf{x}}}_{k+1}\in{\mathcal{C}}. Typically, a diminishing step size γk\gamma_{k} is used [3], but line search techniques to compute the optimal step size may also be used [2]. The FW method is summarized as Algorithm 1.

Algorithm 1 The vanilla Frank-Wolfe method (FW)
1:Initialize 𝐱0𝒞{{\mathbf{x}}}_{0}\in\mathcal{C}
2:for k=0,1,k=0,1,\ldots do
3:    𝐬kargmin𝐬𝒞f(𝐱k),𝐬{{\mathbf{s}}}_{k}\leftarrow\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}\right>
4:    𝐱k+1𝐱k+γk(𝐬k𝐱k){{\mathbf{x}}}_{k+1}\leftarrow{{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k})
5:    γk2k+2\gamma_{k}\leftarrow\frac{2}{k+2}

The FW algorithm with a diminishing step size of γk=2k+2\gamma_{k}=\frac{2}{k+2} yields a sublinear convergence of 𝒪(1/k){\mathcal{O}}(1/k) for minimizing LL-smooth convex functions over a compact convex constraint set. Specifically, the suboptimality gap at the kkth iterate of FW satisfies [3]

f(𝐱k)f(𝐱)2LD2k+2,f({{\mathbf{x}}}_{k})-f({{\mathbf{x}}}^{\star})\leq\frac{2LD^{2}}{k+2},

where D=sup𝐱,𝐲𝒞𝐱𝐲2D=\sup_{{{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}}}\|{{\mathbf{x}}}-{{\mathbf{y}}}\|_{2} is the diameter of the constraint set and 𝐱{{\mathbf{x}}}^{\star} is a minimizer of ff.

2.2 The Jacobi polynomials

The Jacobi polynomials are a set of orthogonal polynomials, orthogonal with respect to (w.r.t) the weight (1x)α(1+x)β(1-x)^{\alpha}(1+x)^{\beta} in the interval [1,1][-1,1], where α,β>1\alpha,\beta>-1 are the tunable parameters. The Jacobi polynomials are defined by the second-order recurrence relation

Jk+1(λ)\displaystyle J_{k+1}(\lambda) =(akλ+bk)Jk(λ)ckJk1(λ),k=1,2,\displaystyle=(a_{k}\lambda+b_{k})J_{k}(\lambda)-c_{k}J_{k-1}(\lambda),\quad k=1,2,\ldots (3)

with J0(λ)=1J_{0}(\lambda)=1 and J1(λ)=a0λ+b0J_{1}(\lambda)=a_{0}\lambda+b_{0}. The recurrence weights are given by the formulas

ak\displaystyle a_{k} =(τk+k)(τk+k+1)2τk(τkβ)\displaystyle=\frac{(\tau_{k}+k)(\tau_{k}+k+1)}{2\tau_{k}(\tau_{k}-\beta)}
bk\displaystyle b_{k} =(τk+k)(α2β2)2τk(τkβ)(τk+k1)\displaystyle=\frac{(\tau_{k}+k)(\alpha^{2}-\beta^{2})}{2\tau_{k}(\tau_{k}-\beta)(\tau_{k}+k-1)}
ck\displaystyle c_{k} =k(k+β)(τk+k+1)τk(τkβ)(τk+k1)\displaystyle=\frac{k(k+\beta)(\tau_{k}+k+1)}{\tau_{k}(\tau_{k}-\beta)(\tau_{k}+k-1)} (4)

where τk=k+α+β+1\tau_{k}=k+\alpha+\beta+1 and a0=α+β+22(1+α)a_{0}=\frac{\alpha+\beta+2}{2(1+\alpha)} and b0=αβ2(1+α).b_{0}=\frac{\alpha-\beta}{2(1+\alpha)}. Furthermore, we have ak+bkck=1.a_{k}+b_{k}-c_{k}=1.

3 The optimal polynomial

A classic technique to accelerate iterative methods is the so-called multistep method, which uses past iterates to construct the successive iterate [8]. In this section, we apply this technique to FW and find an optimal procedure to use the past iterates.

Suppose we take a linear combination of all the past iterates from Step 44 of Algorithm 1 with a fixed step size γ\gamma, we get

𝐳k+1=i=0kθi𝐱i=i=0kθi[(1γ)𝐱i1+γ𝐬i1],{{\mathbf{z}}}_{k+1}=\sum_{i=0}^{k}\theta_{i}{{\mathbf{x}}}_{i}=\sum_{i=0}^{k}\theta_{i}[(1-\gamma){{\mathbf{x}}}_{i-1}+\gamma{{\mathbf{s}}}_{i-1}], (5)

where we constrain the combining coefficients such that i=0kθi=1\sum_{i=0}^{k}\theta_{i}=1 with θi[0,1]\theta_{i}\in[0,1] for i=0,1,,ki=0,1,\ldots,k. This constraint ensures that the weighted average is a feasible point, i.e., 𝐳k+1𝒞{{\mathbf{z}}}_{k+1}\in\mathcal{C}.

Now an important question to ask is, can we find optimal coefficients that can possibly accelerate the FW algorithm? To answer this question, we express (5) as a polynomial and then select the best polynomial that speeds up the convergence rate. To do so, let us first unroll (5) and express it in terms of 𝐱0{{\mathbf{x}}}_{0} as

𝐳k+1\displaystyle{{\mathbf{z}}}_{k+1} =i=0kθi[(1γ)i𝐱0+γj=0i1(1γ)ij𝐬j].\displaystyle=\sum\limits_{i=0}^{k}\theta_{i}[(1-\gamma)^{i}{{\mathbf{x}}}_{0}+\gamma\sum\limits_{j=0}^{i-1}(1-\gamma)^{i-j}{{\mathbf{s}}}_{j}].

Let us define the polynomial pk(1γ)p_{k}(1-\gamma) of degree kk as

pk(1γ)=θk(1γ)k+θk1(1γ)k1++θ0,p_{k}(1-\gamma)=\theta_{k}(1-\gamma)^{k}+\theta_{k-1}(1-\gamma)^{k-1}+\cdots+\theta_{0},

with p0(1γ)=1p_{0}(1-\gamma)=1. Then we can express 𝐳k+1{{\mathbf{z}}}_{k+1} as

𝐳k+1\displaystyle{{\mathbf{z}}}_{k+1} =pk(1γ)𝐱0+γpk1(1γ)𝐬0++γp0(1γ)𝐬k1,\displaystyle=p_{k}(1-\gamma){{\mathbf{x}}}_{0}+\gamma p_{k-1}(1-\gamma){{\mathbf{s}}}_{0}+\cdots+\gamma p_{0}(1-\gamma){{\mathbf{s}}}_{k-1},

where pk(1γ)+γi=1kpki(1γ)=1p_{k}(1-\gamma)+\gamma\sum_{i=1}^{k}p_{k-i}(1-\gamma)=1 and thus 𝐳k+1conv{𝐱0,𝐬0,,𝐬k1}{{\mathbf{z}}}_{k+1}\in\textbf{conv}\{{{\mathbf{x}}}_{0},{{\mathbf{s}}}_{0},\cdots,{{\mathbf{s}}}_{k-1}\}. The best polynomial is the one that minimizes the distance of the k+1k+1st iterate to 𝐱{{\mathbf{x}}}^{\star}, i.e.,

𝐳k+1𝐱\displaystyle{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star} =pk(1γ)𝐱0+γpk1(1γ)𝐬0\displaystyle=p_{k}(1-\gamma){{\mathbf{x}}}_{0}+\gamma p_{k-1}(1-\gamma){{\mathbf{s}}}_{0}
++γp0(1γ)𝐱\displaystyle\quad\quad\quad+\dots+\gamma p_{0}(1-\gamma)-{{\mathbf{x}}}^{\star}
=pk(1γ)(𝐱0𝐱)+γpk1(𝐬0𝐱)\displaystyle=p_{k}(1-\gamma)({{\mathbf{x}}}_{0}-{{\mathbf{x}}}^{\star})+\gamma p_{k-1}({{\mathbf{s}}}_{0}-{{\mathbf{x}}}^{\star})
++γp0(1γ)(𝐬k1𝐱)\displaystyle\quad\quad\quad+\cdots+\gamma p_{0}(1-\gamma)({{\mathbf{s}}}_{k-1}-{{\mathbf{x}}}^{\star})

for each kk, and is bounded from above as

𝐳k+1𝐱\displaystyle\|{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}\| |pk(1γ)|𝐱0𝐱\displaystyle\leq\>|p_{k}(1-\gamma)|\,\|{{\mathbf{x}}}_{0}-{{\mathbf{x}}}^{\star}\|
+|γpk1(1γ)|𝐬0𝐱\displaystyle\quad\quad+|\gamma p_{k-1}(1-\gamma)|\,\|{{\mathbf{s}}}_{0}-{{\mathbf{x}}}^{\star}\|
++|γp0(γ1)|𝐬k1𝐱\displaystyle\quad\quad\,+\,\cdots+|\gamma p_{0}(\gamma-1)|\,\|{{\mathbf{s}}}_{k-1}-{{\mathbf{x}}}^{\star}\| (6)

due to the triangle inequality. Thus minimizing this upper bound would provide updates that are as close as possible to the optimal path. Next, we find the best polynomial that minimizes the above bound in the following theorem.

Theorem 1.

A kkth degree orthogonal Jacobi polynomial is a minimizer of the upper bound in (6).

Proof.

Recall that the Jacobi polynomials are orthogonal in the interval [1,1][-1,1]. With the change of variable γ=x+12\gamma=\frac{x+1}{2} the Jacobi polynomials are orthogonal in the interval [0,1][0,1] w.r.t. the weight w(γ)=(22γ)α(2γ)βdγw(\gamma)=(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma. The polynomial pkp_{k}^{\star} that minimizes its norm w.r.t a weight function w(γ)w(\gamma), i.e.,

pkarg minpk:p0(1γ)=101|pk(1γ)|(22γ)α(2γ)βdγp_{k}^{\star}\in\,\mathrel{\mathop{\kern 0.0pt\textrm{arg\,min}}\limits_{p_{k}:p_{0}(1-\gamma)=1}}\,\,\int_{0}^{1}\,\,|p_{k}(1-\gamma)|\,(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma (7)

are given by the set of orthogonal polynomials [12].

Since a kkth degree Jacobi polynomial can be expressed as a linear combination of lower degree Jacobi polynomials [13], we further have pk(1γ)=i=0k1cipi(1γ)p_{k}(1-\gamma)=\sum_{i=0}^{k-1}c_{i}p_{i}(1-\gamma) for some combining weights {ci}\{c_{i}\}. Thus, for Jacobi polynomials, we have

minimizepκ:p0(1γ)=1,κk101i=0k1|cipi(1γ)|(22γ)α(2γ)βdγ\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{p_{\kappa}:p_{0}(1-\gamma)=1,\kappa\leq k-1}}\,\,\int_{0}^{1}\,\sum_{i=0}^{k-1}|c_{i}p_{i}(1-\gamma)|\,(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma
minimizepκ:p0(1γ)=1,κk1i=0k101|ci||pi(1γ)|(22γ)α(2γ)βdγ.\displaystyle\leq\,\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{p_{\kappa}:p_{0}(1-\gamma)=1,\kappa\leq k-1}}\sum_{i=0}^{k-1}\int_{0}^{1}\,|c_{i}|\,|p_{i}(1-\gamma)|(2-2\gamma)^{\alpha}(2\gamma)^{\beta}d\gamma.

Therefore, selecting a kkth degree Jacobi polynomial ensures that the norm of the lower degree polynomials are also minimized. Thereby minimizing the upper bound in (7) and the error 𝐳k+1𝐱\|{{\mathbf{z}}}_{k+1}-{{\mathbf{x}}}^{\star}\| at the k+1k+1st iteration.        \Box

The above theorem suggests that combining all the past iterates using orthogonal Jacobi polynomials yields the best successive update. Based on this insight, we modify the vanilla Frank-Wolfe method in the next section.

4 Jacobi acceleration for Frank-Wolfe

In this section, we present the proposed Jacobi accelerated Frank-Wolfe method (JFW), which is a polynomial-based technique for using the past iterates. For a convex and LL-smooth function ff, we show that to reach error of at most ε\varepsilon, for carefully chosen Jacobi polynomial parameters (α,β)(\alpha,\beta), JFW only needs 𝒪(1/ε)\mathcal{O}(1/\sqrt{\varepsilon}) instead of 𝒪(1/ε)\mathcal{O}(1/{\varepsilon}) as needed by FW.

4.1 The JFW algorithm

We have seen in Section 2.2 that orthogonal Jacobi polynomials follow a second-order recursion in (3) with three sequences of coefficients {ak,bk,ck}\{a_{k},b_{k},c_{k}\} computed as in (2.2). This means that, we do not have to store all the past iterates to compute (5), but only store the previous iterate. Specifically, the proposed JFW algorithm computes an intermediate update

𝐲k+1=𝐱k+γk(𝐬k𝐱k),{{\mathbf{y}}}_{k+1}={{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k}),

where 𝐬k{{\mathbf{s}}}_{k} is the FW direction computed as in Step 3 of Algorithm 1 with γk=2/(k+2)\gamma_{k}=2/(k+2). Then we combine the intermediate update with the past iterate using the second-order Jacobi recursion [cf. (3)]:

𝐳k+1=(ak(1γ)+bk)𝐱kck𝐱k1{{\mathbf{z}}}_{k+1}=(a_{k}(1-\gamma)+b_{k}){{\mathbf{x}}}_{k}-c_{k}{{\mathbf{x}}}_{k-1} (8)

with γ[0,1]\gamma\in[0,1]. Since ak(1γ)+bkck=1γaka_{k}(1-\gamma)+b_{k}-c_{k}=1-\gamma a_{k} for the Jacobi polynomials, 𝐳k+1𝒞{{\mathbf{z}}}_{k+1}\notin\mathcal{C}. Therefore, we make a final correction step

𝐱k+1=𝐳k+1+γak𝐱k{{\mathbf{x}}}_{k+1}={{\mathbf{z}}}_{k+1}+\gamma a_{k}{{\mathbf{x}}}_{k}

so that 𝐱k+1𝒞{{\mathbf{x}}}_{k+1}\in{\mathcal{C}} is a feasible point. The proposed JFW is summarized as Algorithm 2.

Algorithm 2 Jacobi accelerated Frank-Wolfe (JFW)
1:Initialize 𝐱0𝒞,{{\mathbf{x}}}_{0}\in\mathcal{C}, αβ>1\alpha\geq\beta>-1, and γ\gamma
2:for k=0,1,k=0,1,\ldots do
3:    𝐬kargmin𝐬𝒞f(𝐱k),𝐬{{\mathbf{s}}}_{k}\leftarrow\mathrel{\mathop{\kern 0.0pt\mathrm{arg\,min}}\limits_{{{\mathbf{s}}}\in\mathcal{C}}}\hskip 5.69054pt\left<\nabla f({{\mathbf{x}}}_{k}),{{\mathbf{s}}}\right> \triangleright FW direction finding
4:    𝐲k+1𝐱k+γk(𝐬k𝐱k){{\mathbf{y}}}_{k+1}\leftarrow{{\mathbf{x}}}_{k}+\gamma_{k}({{\mathbf{s}}}_{k}-{{\mathbf{x}}}_{k}) \triangleright FW update
5:    𝐳k+1(ak(1γ)+bk)𝐲k+1ck𝐱k{{\mathbf{z}}}_{k+1}\leftarrow(a_{k}(1-\gamma)+b_{k}){{\mathbf{y}}}_{k+1}-c_{k}{{\mathbf{x}}}_{k} \triangleright Jacobi recursion
6:    𝐱k+1𝐳k+1+γak𝐱k{{\mathbf{x}}}_{k+1}\leftarrow{{\mathbf{z}}}_{k+1}+\gamma a_{k}{{\mathbf{x}}}_{k} \triangleright Correction step
7:    γk2k+2\gamma_{k}\leftarrow\frac{2}{k+2}
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Fig. 1: Performance of the FW, AFW and JFW algorithms on (a) Breast cancer dataset for 2\ell_{2}-constrained logistic regression, (b) Pima Indian diabetes dataset for robust ridge regression, and (c) Movielens 100k dataset for robust matrix completion task.

4.2 Convergence analysis

To demonstrate the convergence of the Jacobi FW, we consider a LL-smooth convex function over a compact convex constraint set. The next theorem states the improvement in convergence of the proposed algorithm.

Theorem 2.

Let f:dom(f)f:{\textbf{dom}}(f)\rightarrow\mathbb{R} be a LL-smooth and convex function, 𝒞dom(f){\mathcal{C}}\subseteq{\textbf{dom}}(f) be compact and convex, and 𝐱{{\mathbf{x}}}^{\star} be a minimizer of ff over 𝒞{\mathcal{C}}. For appropriately chosen parameters (α,β,γ)(\alpha,\beta,\gamma) with αβ>1\alpha\geq\beta>-1, JFW in Algorithm 2 satisfies

f(𝐱k)f(𝐱)|αβ|4LD2(k+1)(k+2),f({{\mathbf{x}}}_{k})-f({{\mathbf{x}}}^{\star})\leq\left|\frac{\alpha}{\beta}\right|\frac{4LD^{2}}{(k+1)(k+2)}, (9)

where D=sup𝐱,𝐲𝒞𝐱𝐲2D=\sup_{{{\mathbf{x}}},{{\mathbf{y}}}\in{\mathcal{C}}}\|{{\mathbf{x}}}-{{\mathbf{y}}}\|_{2} is the diameter of the constraint set.

The complete proof is available online*** https://drive.google.com/file/d/1BfP-iDiAZ0cKZ1OzJA5X1_6MGeDvq7wE/view and is omitted due to space constraints. Here, we give a brief sketch of the proof, which is based on mathematical induction. For the base case k=0k=0, the bound directly follows from definitions of the diameter of the set and LL-smoothness of the objective function. Then by the induction hypothesis, we assume that the bound holds for kk iterations and show that the bound is also valid for approximately chosen tuning parameters (α,β)(\alpha,\beta) at the k+1k+1st iterate. To do so, we show that JFW satisfies the descent property

f(𝐱k+1)f(𝐱k)6ωkLD2(k+2)2,f({{\mathbf{x}}}_{k+1})\leq f({{\mathbf{x}}}_{k})-\frac{6\omega_{k}LD^{2}}{(k+2)^{2}},

with ωk=ak(1γ)+bk[0,1]\omega_{k}=a_{k}(1-\gamma)+b_{k}\in[0,1] and L0L\geq 0.

Theorem 2 affirms that we can achieve a faster sublinear convergence rate of 𝒪(1/k2){\mathcal{O}}(1/k^{2}) using a Jacobi polynomial-based acceleration technique as compared to the 𝒪(1/k){\mathcal{O}}(1/k) rate of FW. However, in order to achieve this faster rate, careful tuning of the parameters of the Jacobi polynomials, (α,β,γ)(\alpha,\beta,\gamma), is needed. Further, the bound on iteration complexity of JFW matches the bounds proposed in [9], but for more general convex and compact constraint sets.

5 Numerical experiments

We validate the JFW algorithm on real datasets for three tasks, namely, logistic and robust regression with an 2\ell_{2}-norm constraint and robust matrix completion. We compare JFW with the vanilla FW in Algorithm 1 and the momentum-guided AFW [9] algorithmsSoftware to reproduce the plots in the paper is available at https://colab.research.google.com/drive/1P1_DsP4nN2YYykWanNuasdbfQ4j6EAhB?usp=sharing.

5.1 Logistic regression

Consider the logistic regression problem with an 2\ell_{2}-norm constraint for binary classification:

minimize𝐱1mi=1mlog(1+exp(bi𝐚i,𝐱))\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\quad\frac{1}{m}\sum_{i=1}^{m}\log(1+\exp{(-b_{i}\left<\mathbf{a}_{i},{{\mathbf{x}}}\right>)})
s. to 𝐱2t,\displaystyle\text{s. to \quad}\quad||{{\mathbf{x}}}||_{2}\leq t,

where 𝐚id\mathbf{a}_{i}\in\mathbb{R}^{d} is the feature vector and bib_{i} is the binary class label. We apply FW, AFW and JFW on a breast cancer [14] dataset with m=698m=698 and d=9d=9. We use α=β=1.2\alpha=\beta=1.2, t=50t=50, and γ=2/3\gamma=2/3. The suboptimality gap is shown in Fig. 1(a), where to compute the suboptimality gap we obtain the optimal value using cvxpy [15]. We can see from Fig. 1(a) that JFW has a faster rate.

5.2 Robust regression

Next, we consider robust ridge regression using a smooth Huber loss function with an 2\ell_{2}-norm constraint [16]:

minimize𝐱1mi=1mHδ(yi𝐚i,𝐱)\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{{\mathbf{x}}}}}\quad\frac{1}{m}\sum_{i=1}^{m}H_{\delta}(y_{i}-\langle{{\mathbf{a}}}_{i},{{\mathbf{x}}}\rangle)
s. to 𝐱2t,\displaystyle\text{s. to \quad}\quad||{{\mathbf{x}}}||_{2}\leq t,

where 𝐚id{{\mathbf{a}}}_{i}\in\mathbb{R}^{d} and yiy_{i}\in\mathbb{R}. The Huber loss function is given by

Hδ(c)={c2,if |c|δ,2δ|c|δ2,otherwise.H_{\delta}(c)=\begin{cases}c^{2},&\text{if $|c|\leq\delta$},\\ 2\delta|c|-\delta^{2},&\text{otherwise.}\end{cases}

We apply robust ridge regression on Pima Indian diabetes [17] dataset with m=767m=767 and d=8d=8, and the optimal function value is computed using scikit-learn [18]. We use t=35t=35, δ=0.5\delta=0.5, α=β=1450\alpha=\beta=1450, and γ=0.65\gamma=0.65. Fig. 1(b) shows the suboptimality gap, where we see that JFW has a faster convergence than FW and AFW as before.

5.3 Robust matrix completion

Finally, we consider the robust matrix completion problem to predict unobserved missing entries from observations corrupted with outliers. We use the Huber loss function as before along with a low-rank promoting nuclear norm constraint for the robust matrix completion task, i.e., we solve

minimize𝐗(i,j)ΩHδ(AijXij)\displaystyle\mathrel{\mathop{\kern 0.0pt\text{minimize}}\limits_{{\mathbf{X}}}}\,\,\sum_{(i,j)\in\Omega}H_{\delta}(A_{ij}-X_{ij})
s. to 𝐗R,\displaystyle\text{s. to \quad}\quad\|{\mathbf{X}}\|_{\star}\leq R,

where 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} is the partially observed matrix with the entries Aij{A}_{ij} available for (i,j)Ω(i,j)\in\Omega. The sampling mask Ω\Omega is know.

We apply robust matrix completion on the Movielens 100K [19] dataset with m=1682m=1682 movies rated by n=943n=943 users with 6.30%6.30\% percent ratings observed. Further, we introduce outliers by setting 4%4\% of the total ratings (i.e., mnmn ratings) to its maximum value. We use 50% of the available ratings for training and the remaining observations for testing. We use R=5R=5, α=β=4.5\alpha=\beta=4.5, γ=2/3\gamma=2/3 and δ=4\delta=4. Fig. 1(c) shows normalized error, which is defined as

ϵk=(i,j)ΩtestHδ(AijXij)(i,j)ΩtestHδ(Aij)\epsilon_{k}=\frac{\sum_{(i,j)\in\Omega_{\rm test}}H_{\delta}(A_{ij}-X_{ij})}{\sum_{(i,j)\in\Omega_{\rm test}}H_{\delta}(A_{ij})}

with Ωtest\Omega_{\rm test} being the index set of the observations in the test set. We again observe that JFW performs slightly better than FW.

6 Conclusions

The FW algorithm is a viable alternative for solving constrained optimization problems for which projection onto the constraint set is a computationally expensive operation. Motivated by polynomial-based acceleration techniques for unconstrained problems, we show that a set of orthogonal Jacobi polynomials with carefully chosen parameters can yield faster convergence rates. We have proposed Jacobi accelerated FW algorithm that provably has a faster sublinear convergence rate for minimizing a smooth convex function over a compact convex constraint set.

References

  • [1] Y. Nesterov, Introductory lectures on convex optimization: A basic course, volume 87.   Springer Science and Business Media, 2004.
  • [2] M. Frank and P. Wolfe, “An algorithm for quadratic programming,” Naval Research Logistics Quarterly, vol. 3, no. 1‐2, pp. 95–110, 1956.
  • [3] M. Jaggi, “Revisiting frank-wolfe: Projection-free sparse convex optimization,” in Proc. of the 30th International Conference on International Conference on Machine Learning, 2013.
  • [4] D. Garber and E. Hazan, “Faster rates for the frank-wolfe method over strongly-convex sets,” in Proc. of the 32nd International Conference on Machine Learning, ICML, 2015.
  • [5] S. Lacoste-Julien and M. Jaggi, “On the global linear convergence of frank-wolfe optimization variants,” in Proc. of Advances in Neural Information Processing Systems, 2015.
  • [6] E. Frandi, R. Ñanculef, and J. A. K. Suykens, “A partan-accelerated frank-wolfe algorithm for large-scale svm classification,” in Proc. of International Joint Conference on Neural Networks (IJCNN), 2015.
  • [7] Y. Nesterov, “A method for solving the convex programming problem with convergence rate 𝒪(1/k2)\mathcal{O}(1/k^{2}),” Soviet Mathematics Doklady, 27, 372-376, 1983.
  • [8] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
  • [9] B. Li, M. Coutiño, G. B. Giannakis, and G. Leus, “A momentum-guided frank-wolfe algorithm,” IEEE Trans. Signal Process., vol. 69, pp. 3597–3611, June 2021.
  • [10] Y. Zhang, B. Li, and G. B. Giannakis, “Accelerating frank-wolfe with weighted average gradients,” in Proc. of the IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2021.
  • [11] A. d’Aspremont, D. Scieur, and A. Taylor, Acceleration Methods, 2021.
  • [12] P. Nevai, “Géza freud, orthogonal polynomials and christoffel functions. a case study,” Journal of Approximation Theory, vol. 48, no. 1, pp. 3–167, 1986.
  • [13] W. L. Shen J., Tang T., Orthogonal Polynomials and Related Approximation Results.   Springer, Berlin, Heidelberg, 2011.
  • [14] “Diagnostic wisconsin breast cancer database,” UCI repository, https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data, accessed on 09/10/2021.
  • [15] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016.
  • [16] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, no. 7, pp. 59–72, 2007.
  • [17] “Pima Indian diabetes dataset,” National Institute of Diabetes and Digestive and Kidney Diseases, https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.csv, accessed on 09/10/2021.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [19] “Movielens 100k dataset,” GroupLens Research, https://grouplens.org/datasets/movielens/100k/, accessed on 09/10/2021.