This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yunier Bello-Cruz 22institutetext: Cassandra Mohr 33institutetext: Department of Mathematical Sciences, Northern Illinois University. Watson Hall 330, DeKalb, IL, USA - 60115.
[email protected] and [email protected].
44institutetext: Max L.N. Gonçalves 55institutetext: Jefferson G. Melo 66institutetext: Institute of Mathematics and Statistics, Federal University of Goias, Campus II- Caixa Postal 131, CEP 74001-970, Goiânia-GO, Brazil.
[email protected] and [email protected]

A Relative Inexact Proximal Gradient Method With an Explicit Linesearch

Yunier Bello-Cruz    Max L.N. Gonçalves    Jefferson G. Melo    Cassandra Mohr
(Received: date / Accepted: date)
Abstract

This paper presents and investigates an inexact proximal gradient method for solving composite convex optimization problems characterized by an objective function composed of a sum of a full domain differentiable convex function and a non-differentiable convex function. We introduce an explicit linesearch strategy that requires only a relative inexact solution of the proximal subproblem per iteration. We prove the convergence of the sequence generated by our scheme and establish its iteration complexity, considering both the functional values and a residual associated with first-order stationary solutions. Additionally, we provide numerical experiments to illustrate the practical efficacy of our method.

Keywords:
Linesearch Iteration complexity Nonsmooth and convex optimization problemProximal gradient method Relative error rule
MSC:
65K05 90C25 90C30

1 Introduction

In this paper, our focus is on addressing nonsmooth convex optimization problems characterized by the following formulation:

minF(x):=f(x)+g(x)subject tox𝔼,\min\,F(x):=f(x)+g(x)\quad\text{subject to}\quad x\in\mathbbm{E}, (1)

where 𝔼\mathbbm{E} represents a finite-dimensional Euclidean space. The function g:𝔼¯:={+}g:\mathbbm{E}\to\overline{\mathbbm{R}}:=\mathbbm{R}\cup\{+\infty\} is a nonsmooth, proper, and lower semicontinuous convex function. The function f:𝔼f:\mathbbm{E}\to\mathbbm{R} is a continuously differentiable and convex function. Throughout the paper, we denote the optimal value and solution set of problem (1) by FF_{*} and SS_{*}, respectively. From now on, we assume that SS_{*}\neq\emptyset.

Proximal gradient methods effectively solve optimization problems such as in (1). The main step in the proximal gradient method involves evaluating the proximal operator 𝐩𝐫𝐨𝐱g:𝔼𝐝𝐨𝐦g:={x𝔼g(x)<+}{\bf prox}_{g}:\mathbbm{E}\to{\bf dom\,}g:=\{x\in\mathbbm{E}\mid g(x)<+\infty\} defined by:

𝐩𝐫𝐨𝐱g(y):=argminx𝔼{g(x)+12xy2},{\bf prox}_{g}(y):=\underset{x\in\mathbbm{E}}{\arg\min}\;\left\{g(x)+\frac{1}{2}\left\|x-y\right\|^{2}\right\}, (2)

where the norm, \|\cdot\|, is induced by the inner product of 𝔼\mathbbm{E}, ,\langle\cdot,\cdot\rangle, as :=,\|\cdot\|:=\sqrt{\langle\cdot,\cdot\rangle}. The proximal operator is well-known as a full-domain, firmly nonexpansive operator. These useful properties, together with the descent property of the gradient step, establish the foundation for the convergence and complexity for proximal gradient iterations.

Proximal Gradient Method (PGM) Let x0𝐝𝐨𝐦gx_{0}\in{\bf dom\,}g. Compute γk>0\gamma_{k}>0 and x¯k:=𝐩𝐫𝐨𝐱γkg(xkγkf(xk)).\bar{x}_{k}:={\bf prox}_{\gamma_{k}g}\left(x_{k}-\gamma_{k}\nabla f(x_{k})\right). (3) Choose βk(0,1]\beta_{k}\in(0,1] and compute xk+1:=xk+βk(x¯kxk).x_{k+1}:=x_{k}+\beta_{k}(\bar{x}_{k}-x_{k}). (4)

The coefficients γk\gamma_{k} and βk\beta_{k}, referred to as stepsizes, can be determined based on backtracking linesearch procedures. Such strategies are essential whenever the global LL-Lipschitz continuity of ff fails or even when computing an acceptable upper bound for LL is challenging. This situation is often encountered in numerous applications; for instance, in inverse problems based on Banach norms Bredies:2009 ; Schuster:2012 or Bregman distances such as the Kullback-Leibler divergence Salzo:2014 ; Bonettini:2012 ; BelloCruz:2016a . Moreover, even when LL is known, linesearches may allow for longer steps toward the solution by using local information at every iteration.

There are several possible choices for these stepsizes, each impacting the algorithm’s performance in different ways; see, for instance, Salzo:2017a ; BelloCruz:2016a ; BelloCruz:2016c ; Bello-Cruz:2021 . It is important to note that in order to compute the stepsize γk\gamma_{k} using a backtracking linesearch at each iteration kk, the proximal operator may need to be evaluated multiple times within the procedure. Conversely, the stepsize βk\beta_{k} can be selected by evaluating the proximal operator only once per iteration. In this context, we will refer to explicit linesearch to describe a backtracking procedure that determines βk\beta_{k} after setting γk\gamma_{k} as a constant for all kk. These explicit strategies are particularly advantageous, especially in cases where evaluating the proximal operator is challenging. The function gg is often complex enough that the corresponding proximal operator lacks an analytical solution. In such cases, an ad-hoc algorithm should be employed to evaluate the proximal operator inexactly. For instance, consider 𝔼=n\mathbb{E}=\mathbbm{R}^{n}, and let g:ng:\mathbbm{R}^{n}\to\mathbbm{R} be defined as

g(x)=x1+λi<jmax{|xi|,|xj|},g(x)=\|x\|_{1}+\lambda\sum_{i<j}\max\{|x_{i}|,|x_{j}|\},

with λ>0\lambda>0, a form encountered in sparse regression problems with grouping, as discussed in Zhong:2012 . Similarly, in the context of matrix factorization, consider 𝔼=n×m\mathbb{E}=\mathbbm{R}^{n\times m} for the CUR-like factorization optimization problem Barre:2022 ; JMLR:v12:mairal11a , where the goal is to approximate a matrix Wm×nW\in\mathbbm{R}^{m\times n} with Xn×mX\in\mathbbm{R}^{n\times m} having sparse rows and columns. In this case, g:n×mg:\mathbbm{R}^{n\times m}\to\mathbbm{R} is given by

g(X)=λrowi=1nX(i)2+λcolj=1mX(j)2,g(X)=\lambda_{\text{row}}\sum_{i=1}^{n}\|X^{(i)}\|_{2}+\lambda_{\text{col}}\sum_{j=1}^{m}\|X_{(j)}\|_{2},

which will be considered in the numerical illustrations of this paper. For further examples and discussions, see BelloCruz:2017a ; Bello-Cruz:2023 ; Salzo:2012 ; Villa:2013 ; Schmidt:2011 ; Jiang:2012 ; Millan:2019 ; Barre:2022 . Of course, there are rare cases when the exact analytical solution of the proximal operator is available, such as when g:ng:\mathbbm{R}^{n}\to\mathbbm{R} as g(x)=x1g(x)=\|x\|_{1} or the indicator function of a simple convex and closed set, see, for instance, Beck:2009 ; Combettes:2005 .

Consequently, in practice, the evaluation of the proximal operator is often done inexactly. A basic inexactness criterion is to approximately evaluate the proximal operator using an exogenous sequence of error tolerance which, in general, must be summable in order to guarantee the convergence of the proximal iterations. Such a diminishing sequence is a priori chosen without using any information that may be available along with the iterations. Usually, the restrictive summability condition forces the solution of the proximal subproblem to be increasingly accurate, often more than necessary; see, for instance, Jiang:2012 ; Salzo:2012 ; Aujol:2015 ; Villa:2013 ; Schmidt:2011 . In the past two decades, relative error criteria have been considered as an effective and practical way of controlling the inexactness in solving proximal subproblems of several algorithms, including FISTA, ADMM, augmented Lagrangian, and Douglas-Rachford methods; refer to Eckstein:2013 ; Eckstein:2018 ; Alves:2020 ; Adona:2019 ; Adona:2020 for examples. Relative error criteria are often easy to verify in practice and has the advantage of exploring the available information at a specific iteration, being, therefore, an interesting alternative to the aforementioned exogenous sequences.

In this paper, we propose and analyze an inexact proximal gradient method for solving problem (1). We propose a novel relative inexactness criterion for solving the proximal subproblems which in some sense ressembles the ideas of relative error criteria introduced in Monteiro:2010 ; Solodov:1999a , but adds some new elements in order to control the objective function on the inexact solution. The proposed scheme requires only one inexact solution of the proximal subproblem per iteration, and the stepsizes are computed through a relaxed explicit linesearch procedure that takes into account the residuals obtained in the proximal subproblem and enables the iteration to address non-Lipschitz optimization problems effectively. We show that the sequence generated by our method converges to a solution of problem (1). Moreover, we establish its iteration-complexity in terms of both the function values and the residues associated to an approximate stationary solution. We also present some numerical experiments to illustrate the performance of the proposed method.

The paper is structured as follows: Section 2 presents definitions, basic facts, and auxiliary results. In Section 3, we introduce the inexact proximal gradient method with an explicit linesearch (IPG-ELS) and establish some of its fundamental properties. Section 4 establishes the iteration-complexity bounds of the IPG-ELS method in terms of functional values and a residual associated with the stationary condition for problem (1). Additionally, the full convergence of the sequence generated by the IPG-ELS method is discussed in this section. Some numerical experiments illustrating the behavior of the proposed scheme are reported in Section 5. Finally, concluding remarks are provided in Section 6.

2 Preliminary

In this section, we present some preliminary material and notations that will be used throughout this paper.

Let g:𝔼¯g:\mathbbm{E}\to\overline{\mathbbm{R}} be a proper, lower semicontinuous, and convex function. For a given ε0\varepsilon\geq 0, the ε\varepsilon-subdifferential of gg at x𝐝𝐨𝐦g={x𝔼g(x)<+}x\in{\bf dom\,}g=\{x\in\mathbbm{E}\mid g(x)<+\infty\}, denoted by εg(x)\partial_{\varepsilon}g(x), is defined as

εg(x):={v𝔼g(y)g(x)+v,yxε,y𝔼}.\partial_{\varepsilon}g(x):=\{v\in\mathbbm{E}\mid g(y)\geq g(x)+\langle v,y-x\rangle-\varepsilon,\ \forall y\in\mathbbm{E}\}. (5)

When x𝐝𝐨𝐦gx\notin{\bf dom\,}g, we define εg(x)=\partial_{\varepsilon}g(x)=\emptyset. Any element vεg(x)v\in\partial_{\varepsilon}g(x) is called an ε\varepsilon-subgradient. If ε=0\varepsilon=0, then 0g(x)\partial_{0}g(x) becomes the classical subdifferential of gg at xx, denoted by g(x)\partial g(x). It follows immediately from (5) that

vεg(y),ug(x)impliesvu,yxε.v\in\partial_{\varepsilon}g(y),\;u\in\partial g(x)\quad\text{implies}\quad\langle v-u,y-x\rangle\geq-\varepsilon. (6)

We present two useful auxiliary results for εg\partial_{\varepsilon}g. The first one is the closedness of the graph of εg\partial_{\varepsilon}g and the second is the so-called transportation formula; see Propositions 4.1.1 and 4.2.2 of Hiriart-Urruty:1993 .

Proposition 1 (Closed Graph Property)

Let (εk,xk,vk)k+×𝔼×𝔼(\varepsilon_{k},x_{k},v_{k})_{k\in\mathbbm{N}}\subseteq\mathbbm{R}_{+}\times\mathbbm{E}\times\mathbbm{E} be a sequence converging to (ε,x,v)(\varepsilon,x,v) with vkεkg(xk)v_{k}\in\partial_{\varepsilon_{k}}g(x_{k}) for all kk\in\mathbbm{N}. Then, vεg(x)v\in\partial_{\varepsilon}g(x).

Proposition 2 (Transportation Formula)

With xx and yy in 𝐝𝐨𝐦g,{\bf dom\,}g, let vg(y)v\in\partial g(y). Then vgϵ(x)v\in\partial g_{\epsilon}(x) where ϵ=g(x)g(y)v,xy0\epsilon=g(x)-g(y)-\langle{v},{x-y}\rangle\geq 0.

We now introduce a concept of approximate stationary solution to problem (1). First, note that x¯\bar{x} is a solution to problem (1) if and only if 0f(x¯)+g(x¯)0\in\nabla f(\bar{x})+\partial g(\bar{x}). The concept of approximate stationary solution relaxes this inclusion by introducing a residual vv and enlarging g\partial g using εg\partial_{\varepsilon}g.

Definition 1 (η\eta-Approximate Stationary Solution)

Given η>0\eta>0, an element x~𝐝𝐨𝐦g\tilde{x}\in{\bf dom\,}g is said to be an η\eta-approximate stationary solution to problem (1) with a residual pair (v,ε)(v,\varepsilon) if

vf(x~)+εg(x~),max{v,ε}η.v\in\nabla f(\tilde{x})+\partial_{\varepsilon}g(\tilde{x}),\quad\max\{\|v\|,\varepsilon\}\leq\eta. (7)

Next, we recall the definition of quasi-Fejér convergence, which is an important and well-known tool for establishing full convergence of gradient and proximal point type methods.

Definition 2 (Quasi-Fejér Convergence)

Let SS be a nonempty subset of 𝔼\mathbbm{E}. A sequence (xk)k𝔼(x_{k})_{k\in\mathbbm{N}}\subseteq\mathbbm{E} is said to be quasi-Fejér convergent to SS if and only if, for every xSx\in S, there exists a summable sequence (δk)k+(\delta_{k})_{k\in\mathbbm{N}}\subseteq\mathbbm{R}_{+} such that

xk+1x2xkx2+δk,k.\|x_{k+1}-x\|^{2}\leq\|x_{k}-x\|^{2}+\delta_{k},\qquad\forall k\in\mathbbm{N}. (8)

The following result presents the main properties of quasi-Fejér convergent sequences; see Theorem 2.62.6 of Bauschke:1996 .

Lemma 1 (Quasi-Féjer Convergence Properties)

If (xk)k(x_{k})_{k\in\mathbbm{N}} is quasi-Fejér convergent to SS, then the following statements hold:

The sequence (xk)k(x_{k})_{k\in\mathbbm{N}} is bounded;

If a cluster point x¯\bar{x} of (xk)k(x_{k})_{k\in\mathbbm{N}} belongs to SS, then (xk)k(x_{k})_{k\in\mathbbm{N}} is convergent to x¯\bar{x}.

We conclude the section with a basic inequality that will be used in the next sections.

Lemma 2

For any x,y𝔼x,y\in\mathbbm{E}, we have x+y22x2+2y2\|x+y\|^{2}\leq 2\|x\|^{2}+2\|y\|^{2}.

3 Inexact Proximal Gradient Method

In this section, we introduce the inexact proximal gradient method with an explicit linesearch and establish some basic properties.

Recall that, given xk𝔼x_{k}\in\mathbbm{E}, the exact solution of the proximal gradient subproblem (3) with γk=1\gamma_{k}=1 consists of finding x¯k\bar{x}_{k} such that

x¯k=argminx𝔼{f(xk),xxk+g(x)+12xxk2}.\bar{x}_{k}={\rm arg}\!\min_{x\in\mathbbm{E}}\left\{\left\langle\nabla f(x_{k}),x-x_{k}\right\rangle+g(x)+\frac{1}{2}\|x-x_{k}\|^{2}\right\}.

This proximal subproblem is equivalent to the following monotone inclusion problem:

0f(xk)+g(x¯k)+x¯kxk.0\in\nabla f(x_{k})+\partial g(\bar{x}_{k})+\bar{x}_{k}-x_{k}. (9)

The kk-th iteration of the inexact proximal gradient method, which incorporates an explicit linesearch as proposed below, begins by first relaxing the above inclusion. This is achieved by finding an approximate solution x~k\tilde{x}_{k} along with a residual pair (vk,εk)(v_{k},\varepsilon_{k}) satisfying

vkf(xk)+εkg(x~k)+x~kxk,v_{k}\in\nabla f(x_{k})+\partial_{\varepsilon_{k}}g(\tilde{x}_{k})+\tilde{x}_{k}-x_{k},

and adheres to a specific mixed relative error criterion. This criterion is designed to ensure that the residual pair remains small, while also controlling the functional value of gg and the angle between f(xk)\nabla f(x_{k}) and vkv_{k}, in addition to guiding a search direction that involves the intermediate iteration x~k\tilde{x}_{k} and the residual vkv_{k}. If the stopping criterion x~k=xk\tilde{x}_{k}=x_{k} is not satisfied, we use a novel linesearch procedure, designed with the triple (x~k,vk,εk)(\tilde{x}_{k},v_{k},\varepsilon_{k}) in mind, to find a suitable stepsize βk\beta_{k}. This stepsize is used to move from the current point xkx_{k} towards ykxky_{k}-x_{k} where yk=x~kvky_{k}=\tilde{x}_{k}-v_{k}, genegenerating the next point xk+1x_{k+1}.

In the following, we formally describe the proposed method, which combines a relative inexact computation of the proximal subproblem with an explicit linesearch.

Inexact Proximal Gradient Method with an Explicit Linesearch (IPG-ELS) Initialization Step. Let x0𝐝𝐨𝐦gx_{0}\in{\bf dom\,}g, τ(0,1]\tau\in(0,1], θ(0,1)\theta\in(0,1), γ1>1\gamma_{1}>1, γ21\gamma_{2}\geq 1, and α[0,1τ]\alpha\in[0,1-\tau]. Set k=0k=0. Inexact Proximal Subproblem. Compute a triple (x~k,vk,εk)(\tilde{x}_{k},v_{k},\varepsilon_{k}) satisfying vkf(xk)+εkg(x~k)+x~kxk,\displaystyle v_{k}\in\nabla f(x_{k})+\partial_{\varepsilon_{k}}g(\tilde{x}_{k})+\tilde{x}_{k}-x_{k}, (10) g(x~kvk)g(x~k)f(xk),vk+(1+γ1)2vk2+(1+γ2)εk(1τα)2xkx~k2;\displaystyle g(\tilde{x}_{k}-v_{k})-g(\tilde{x}_{k})-\langle\nabla f(x_{k}),v_{k}\rangle+\frac{(1+\gamma_{1})}{2}\|v_{k}\|^{2}+(1+\gamma_{2})\varepsilon_{k}\leq\frac{(1-\tau-\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2}; (11) Stopping Criterion. If x~k=xk\tilde{x}_{k}=x_{k}, then stop. Linesearch Procedure. Set β=1\beta=1 and yk:=x~kvky_{k}:=\tilde{x}_{k}-v_{k}. If \dstyf(xk+β(ykxk))f(xk)+βf(xk),ykxk+βτ2xkx~k2+γ12βvk2+βγ2εk,\dsty f\left(x_{k}+\beta(y_{k}-x_{k})\right)\leq f(x_{k})+\beta\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{\beta\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\beta\|v_{k}\|^{2}+\beta\gamma_{2}\varepsilon_{k}, (12) then set βk=β\beta_{k}=\beta, xk+1=xk+βk(ykxk)x_{k+1}=x_{k}+\beta_{k}(y_{k}-x_{k}) and k:=k+1k:=k+1, and go to Step 2. Otherwise, set β=θβ\beta=\theta\beta, and verify (12).

Remark 1 (The Inexact Proximal Subproblem)

Note that the inexact proximal gradient subproblem (10) is equivalent to the exact proximal gradient subproblem (9) whenever vk=0v_{k}=0 and εk=0\varepsilon_{k}=0. Moreover, it is possible to always set vkv_{k} to 0 and still introduce a novel scheme that diverges from the previous one, which was based on inexact summable error tolerance. Indeed, under this assumption, the inexact proximal subproblem (10)-(11) can be rewritten as:

0f(xk)+εkg(x~k)+x~kxk,\displaystyle 0\in\nabla f(x_{k})+\partial_{\varepsilon_{k}}g(\tilde{x}_{k})+\tilde{x}_{k}-x_{k}, (13)
εk(1τα)2(1+γ2)xkx~k2.\displaystyle\varepsilon_{k}\leq\frac{(1-\tau-\alpha)}{2(1+\gamma_{2})}\|x_{k}-\tilde{x}_{k}\|^{2}. (14)

It is worth noting that for the above subproblem, no summability assumption is made on the tolerance sequence (εk)k(\varepsilon_{k})_{k\in\mathbbm{N}} to achieve the convergence and complexity of the proposed method, as will be shown later. Furthermore, inclusion (10) and inequality (11) guarantee that x~k\tilde{x}_{k} and x~kvk\tilde{x}_{k}-v_{k} are in the 𝐝𝐨𝐦g{\bf dom\,}g. Additionally, if gg is the indicator function of a nonempty closed convex set CC, then (13)–(14) can be fulfilled by finding a point x~kC\tilde{x}_{k}\in C such that

xkf(xk)x~k,yx~kεk:=(1τα)2(1+γ2)xkx~k2,yC.\langle x_{k}-\nabla f(x_{k})-\tilde{x}_{k},y-\tilde{x}_{k}\rangle\leq\varepsilon_{k}:=\frac{(1-\tau-\alpha)}{2(1+\gamma_{2})}\|x_{k}-\tilde{x}_{k}\|^{2},\quad\forall y\in C. (15)

For example, if CC is bounded, one can use the conditional gradient (CondG) method, a.k.a. Frank-Wolfe method Frank:1956 ; Jaggi:2013 , to obtain such point x~kC\tilde{x}_{k}\in C. Indeed, given ztCz_{t}\in C, the tt-th step of the CondG method, applied to solve the projection problem minyCyxk+f(xk)2/2\min_{y\in C}\|y-x_{k}+\nabla f(x_{k})\|^{2}/2, first finds wtw_{t} as a minimum of the linear function ztxk+f(xk),zt\langle{z_{t}-x_{k}+\nabla f(x_{k})},{\cdot-z_{t}}\rangle over CC and then sets zt+1=(1αt)zt+αtwtz_{t+1}=(1-\alpha_{t})z_{t}+\alpha_{t}w_{t} for some αt(0,1)\alpha_{t}\in(0,1). If at an iteration \ell, the computed point zz_{\ell} satisfies the stopping criterion zxk+f(xk),wzεk\langle{z_{\ell}-x_{k}+\nabla f(x_{k})},{w_{\ell}-z_{\ell}}\rangle\geq-\varepsilon_{k} for the CondG method, then x~k:=z\tilde{x}_{k}:=z_{\ell} and (15) will hold.

Remark 2 (The Explicit Lineasearch)

Note that the explicit linesearch proposed in Step 4 of the IPG-ELS is related to the one proposed in BelloCruz:2016a when vk=0v_{k}=0 and εk=0\varepsilon_{k}=0. Moreover, note that the linesearch does not evaluate the inexact proximal subproblem inside of the inner loop of Step 4.

We introduce a useful lemma that plays a crucial role in analyzing the stopping criterion of the IPG-ELS method.

Lemma 3 (Iteration Inequality Condition)

The following inequality holds for every iteration kk.

x~kxk,vk+(γ11)2vk2+γ2εk(1τα)2xkx~k2.\left\langle\tilde{x}_{k}-x_{k},v_{k}\right\rangle+\frac{(\gamma_{1}-1)}{2}\|v_{k}\|^{2}+\gamma_{2}\varepsilon_{k}\leq\frac{(1-\tau-\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2}. (16)

Proof It follows from (10) that

vk+xkx~kf(xk)εkg(x~k).v_{k}+{x}_{k}-\tilde{x}_{k}-\nabla f({x}_{k})\in\partial_{\varepsilon_{k}}g(\tilde{x}_{k}).

Now using (5) together with the fact that yk=x~kvky_{k}=\tilde{x}_{k}-v_{k}, we have

g(yk)g(x~k)\displaystyle g(y_{k})-g(\tilde{x}_{k}) vk+xkx~kf(xk),ykx~kεk\displaystyle\geq\left\langle v_{k}+{x}_{k}-\tilde{x}_{k}-\nabla f({x}_{k}),y_{k}-\tilde{x}_{k}\right\rangle-\varepsilon_{k}
=vk+xkx~kf(xk),vkεk,\displaystyle=\left\langle v_{k}+{x}_{k}-\tilde{x}_{k}-\nabla f({x}_{k}),-v_{k}\right\rangle-\varepsilon_{k},

which is equivalent to

Γk:=g(yk)g(x~k)f(xk),vk\displaystyle\Gamma_{k}:=g(y_{k})-g(\tilde{x}_{k})-\langle\nabla f({x}_{k}),v_{k}\rangle vk2+x~kxk,vkεk.\displaystyle\geq-\|v_{k}\|^{2}+\langle\tilde{x}_{k}-x_{k},v_{k}\rangle-\varepsilon_{k}. (17)

On the other hand, considering the definitions of Γk\Gamma_{k} and yky_{k}, we observe that (11) is equivalent to

Γk+(1+γ1)2vk2+(1+γ2)εk(1τα)2xkx~k2,\Gamma_{k}+\frac{(1+\gamma_{1})}{2}\|v_{k}\|^{2}+(1+\gamma_{2})\varepsilon_{k}\leq\frac{(1-\tau-\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2},

which, combined with (17), yields the desired inequality (16). ∎

The following result demonstrates that the termination criterion of the IPG-ELS method, as specified in Step 3, is satisfied only when a solution to problem (1) is identified.

Lemma 4 (Stopping at a Solution)

The IPG-ELS method terminates at the kk-th iteration if and only if xkx_{k} is a solution to problem (1).

Proof Assume that the IPG-ELS method stops at the kk-th iteration. In view of Step 3, we have x~k=xk\tilde{x}_{k}=x_{k}. Hence, it follows from (16) that

(γ11)2vk2+γ2εk0.\frac{(\gamma_{1}-1)}{2}\|v_{k}\|^{2}+\gamma_{2}\varepsilon_{k}\leq 0.

Since γ1>1\gamma_{1}>1 and γ21\gamma_{2}\geq 1, we obtain vk=0v_{k}=0 and εk=0\varepsilon_{k}=0. Hence, in view of (10), we get that 0f(xk)+g(xk),0\in\nabla f(x_{k})+\partial g(x_{k}), concluding that xkx_{k} is a solution of problem (1).

Assume now that xkx_{k} is a solution of problem (1). Thus, f(xk)g(xk).-\nabla f(x_{k})\in\partial g(x_{k}). It follows from (10) that vk+xkx~kf(xk)εkg(x~k)v_{k}+x_{k}-\tilde{x}_{k}-\nabla f(x_{k})\in\partial_{\varepsilon_{k}}g(\tilde{x}_{k}). So, the ε\varepsilon-monotonicity of εg\partial_{\varepsilon}g in (6) implies

vk+xkx~k,x~kxkεk,\langle v_{k}+x_{k}-\tilde{x}_{k},\tilde{x}_{k}-x_{k}\rangle\geq-\varepsilon_{k},

which is equivalent to

vk,x~kxkxkx~k2εk.\langle v_{k},\tilde{x}_{k}-x_{k}\rangle\geq\|x_{k}-\tilde{x}_{k}\|^{2}-\varepsilon_{k}. (18)

Hence, it follows from (16) that

(1+τ+α)2xkx~k2+(γ11)2vk2+(γ21)εk0.\frac{(1+\tau+\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{(\gamma_{1}-1)}{2}\|v_{k}\|^{2}+(\gamma_{2}-1)\varepsilon_{k}\leq 0.

Since γ1>1\gamma_{1}>1, γ21\gamma_{2}\geq 1, α0\alpha\geq 0, and τ>0\tau>0, we conclude that xk=x~kx_{k}=\tilde{x}_{k} and vk=0v_{k}=0. Hence, from (11), we also have εk=0\varepsilon_{k}=0. Therefore, the IPG-ELS method stops at the kk-th iteration. ∎

In the following, we establish the well-definedness of the linesearch procedure in Step 4 of the IPG-ELS method.

Lemma 5 (Finite Linesearch Termination)

The linesearch procedure in Step 4 stops after a finite number of steps.

Proof In view of Step 3, we have xkx~kx_{k}\neq\tilde{x}_{k}. Assume, for the sake of contradiction, that the linesearch procedure does not terminate after a finite number of steps. Thus, for all β{1,θ,θ2,}\beta\in\{1,\theta,\theta^{2},\ldots\}, we have

f(xk+β(ykxk))>f(xk)+βf(xk),ykxk+βτ2xkx~k2+γ12βvk2+βγ2εk,f\left(x_{k}+\beta(y_{k}-x_{k})\right)>f(x_{k})+\beta\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{\beta\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\beta\|v_{k}\|^{2}+\beta\gamma_{2}\varepsilon_{k},

or, equivalently,

f(xk+β(ykxk))f(xk)βf(xk),ykxk>τ2xkx~k2+γ12vk2+γ2εk.\frac{f\left(x_{k}+\beta(y_{k}-x_{k})\right)-f(x_{k})}{\beta}-\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle>\frac{\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\|v_{k}\|^{2}+\gamma_{2}\varepsilon_{k}.

Given that ff is differentiable, the left-hand side of the above inequality approaches zero as β0\beta\downarrow 0, leading to the conclusion that

0τ2xkx~k2+γ12vk2+γ2εk.\displaystyle 0\geq\frac{\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\|v_{k}\|^{2}+\gamma_{2}\varepsilon_{k}.

This implies xk=x~kx_{k}=\tilde{x}_{k} contradicting the assumption that xkx~kx_{k}\neq\tilde{x}_{k}. ∎

The subsequent analysis investigates the linesearch procedure introduced in Step 4 of the IPG-ELS method, assuming that the gradient of the function ff, f\nabla f, is Lipschitz continuous. We aim to establish an upper bound for the number of iterations required by the linesearch procedure in Step 4.

Lemma 6 (Lipschitz Condition and Linesearch Complexity)

Assume that ff has an LL-Lipschitz continuous gradient and that xkx_{k} is not a solution to problem (1). Then, any βτ/(2L)\beta\leq\tau/(2L) satisfies (12). As a consequence, the linesearch procedure in Step 4 of the IPG-ELS method stops in at most

:=ln(min{τ/(2L),1})ln(θ)\ell:=\left\lceil\frac{\ln(\min\{\tau/(2L),1\})}{\ln(\theta)}\right\rceil (19)

iterations.

Proof Since f\nabla f is LL-Lipschitz continuous, for any β>0\beta>0, we have

f(xk+β(ykxk))f(xk)βf(xk),ykxk\displaystyle f(x_{k}+\beta(y_{k}-x_{k}))-f(x_{k})-\beta\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle Lβ22ykxk2.\displaystyle\leq\frac{L\beta^{2}}{2}\|y_{k}-x_{k}\|^{2}.

Hence, if βτ/(2L)\beta\leq\tau/(2L), we conclude that

f(xk+β(ykxk))f(xk)βf(xk),ykxk\displaystyle f(x_{k}+\beta(y_{k}-x_{k}))-f(x_{k})-\beta\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle τβ4ykxk2=τβ4xkx~k+vk2\displaystyle\leq\frac{\tau\beta}{4}\|y_{k}-x_{k}\|^{2}=\frac{\tau\beta}{4}\|x_{k}-\tilde{x}_{k}+v_{k}\|^{2}
βτ2xkx~k2+τ2βvk2,\displaystyle\leq\frac{\beta\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\tau}{2}\beta\|v_{k}\|^{2},

using Lemma 2 in the last inequality. Since γ1>1>τ\gamma_{1}>1>\tau and γ20\gamma_{2}\geq 0, we have that (12) holds, thereby proving the first statement of the lemma. The last statement follows from the first one, given that the natural number \ell, defined in (19), satisfies β:=θmin{τ/(2L),1}\beta_{\ell}:=\theta^{\ell}\leq\min\{\tau/(2L),1\}. ∎

This lemma provides a sufficient condition to ensure that the lower bound of the sequence generated by our linesearch is greater than 0. Specifically, if f\nabla f is LL-Lipschitz, then the step sizes βk\beta_{k}, produced through the line search (12), are guaranteed to be bounded below by a positive constant β>0\beta>0, i.e., βkβ>0\beta_{k}\geq\beta>0 for all kk\in\mathbb{N}. Moreover, it is possible to relax the global Lipschitz condition to something local, such as f\nabla f being locally Lipschitz continuous around any solution of problem (1), as was done in Proposition 5.4(ii) of BelloCruz:2016a .

4 Convergence and Complexity Analyses of the IPG-ELS Method

In this section, we focus on analyzing the convergence properties of the IPG-ELS method. We establish the convergence and its iteration complexity in terms of functional values and a residual associated with an approximated solution, as defined in Definition 1.

We begin this section by presenting a result that is fundamental for establishing the convergence and the iteration complexity of the IPG-ELS method.

Proposition 3 (Key Inequality for the IPG-ELS Method)

For every x𝐝𝐨𝐦gx\in{\bf dom\,}\,g and kk\in\mathbbm{N}, we have

2βk[F(xk)F(x)]+2(F(xk+1)F(xk))xkx2xk+1x2αβkxkx~k2.2\beta_{k}[F(x_{k})-F(x)]+2({F(x_{k+1})-F(x_{k})})\leq\|x_{k}-x\|^{2}-\|x_{k+1}-x\|^{2}-{\alpha\beta_{k}}\|x_{k}-\tilde{x}_{k}\|^{2}. (20)

Additionally, the functional value sequence (F(xk))k(F(x_{k}))_{k\in\mathbbm{N}} is decreasing and convergent, and k=0xkxk+12<+\sum_{k=0}^{\infty}\|x^{k}-x^{k+1}\|^{2}<+\infty.

Proof Let x𝐝𝐨𝐦gx\in{\bf dom\,}\,g and kk\in\mathbb{N}. In view of the inexact relative inclusion (10), we have vk+xkx~kf(xk)εkg(x~k)v_{k}+{x}_{k}-\tilde{x}_{k}-\nabla f({x}_{k})\in\partial_{\varepsilon_{k}}g(\tilde{x}_{k}), and hence the definition of εg\partial_{\varepsilon}g in (5) implies that

g(x)g(x~k)+vk+xkx~kf(xk),xx~kεk.g(x)\geq g(\tilde{x}_{k})+\langle v_{k}+{x}_{k}-\tilde{x}_{k}-\nabla f({x}_{k}),x-\tilde{x}_{k}\rangle-\varepsilon_{k}. (21)

Since ff is convex, we have f(x)f(xk)f(xk),xxkf(x)-f(x_{k})\geq\langle\nabla f(x_{k}),x-x_{k}\rangle. Adding the above two inequalities, using f+g=Ff+g=F, and simplifying the resulting expression, we obtain

F(x)F(xk)g(x~k)g(xk)+f(xk),x~kxkxkx~k,x~kx+vk,xx~kεk.F(x)-F(x_{k})\geq g(\tilde{x}_{k})-g(x_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-x_{k}\rangle-\langle{{x}_{k}-\tilde{x}_{k}},{\tilde{x}_{k}-x}\rangle+\langle{v_{k}},{x-\tilde{x}_{k}}\rangle-\varepsilon_{k}.

Combining the above inequality with the identity

xkx~k,x~kx=12[xkx~k2+x~kx2xkx2],-\langle{{x}_{k}-\tilde{x}_{k}},{\tilde{x}_{k}-x}\rangle=\frac{1}{2}\left[\|x_{k}-\tilde{x}_{k}\|^{2}+\|\tilde{x}_{k}-x\|^{2}-\|x_{k}-x\|^{2}\right],

we have

F(x)F(xk)\displaystyle F(x)-F(x_{k})\geq g(x~k)g(xk)+f(xk),x~kxk+12xkx~k2\displaystyle\;g(\tilde{x}_{k})-g(x_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-x_{k}\rangle+\frac{1}{2}\|x_{k}-\tilde{x}_{k}\|^{2}
+12[x~kx2xkx2]+vk,xx~kεk,\displaystyle+\frac{1}{2}\left[\|\tilde{x}_{k}-x\|^{2}-\|x_{k}-x\|^{2}\right]+\langle{v_{k}},{x-\tilde{x}_{k}}\rangle-\varepsilon_{k},

or, equivalently,

F(x)F(xk)+F(xk)F(xk+1)βk\displaystyle F(x)-F(x_{k})+\frac{F(x_{k})-F(x_{k+1})}{\beta_{k}}\geq g(xk)g(xk+1)βk+g(x~k)g(xk)+f(xk),x~kyk\displaystyle\frac{g(x_{k})-g(x_{k+1})}{\beta_{k}}+g(\tilde{x}_{k})-g(x_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-y_{k}\rangle
+f(xk)f(xk+1)βk+f(xk),ykxk+12xkx~k2\displaystyle+\frac{f(x_{k})-f(x_{k+1})}{\beta_{k}}+\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{1}{2}\|x_{k}-\tilde{x}_{k}\|^{2}
+12[x(x~kvk)2xkx2]12vk2εk.\displaystyle+\frac{1}{2}\left[\|x-(\tilde{x}_{k}-v_{k})\|^{2}-\|x_{k}-x\|^{2}\right]-\frac{1}{2}\|v_{k}\|^{2}-\varepsilon_{k}.

Since xk+1=xk+βk(ykxk)x_{k+1}=x_{k}+\beta_{k}(y_{k}-x_{k}) and gg is convex, we have g(xk+1)g(xk)βk(g(yk)g(xk))g(x_{k+1})-g(x_{k})\leq\beta_{k}(g(y_{k})-g(x_{k})). Hence, combining the last two inequalities and the fact that yk=x~kvky_{k}=\tilde{x}_{k}-v_{k}, we get

F(x)F(xk)+F(xk)F(xk+1)βk\displaystyle F(x)-F(x_{k})+\frac{F(x_{k})-F(x_{k+1})}{\beta_{k}}\geq 12[xyk2xkx2]\displaystyle\frac{1}{2}\left[\|x-y_{k}\|^{2}-\|x_{k}-x\|^{2}\right]
+f(xk)f(xk+1)βk+f(xk),ykxk+12xkx~k2\displaystyle+\frac{f(x_{k})-f(x_{k+1})}{\beta_{k}}+\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{1}{2}\|x_{k}-\tilde{x}_{k}\|^{2}
+g(x~k)g(yk)+f(xk),x~kyk12vk2εk.\displaystyle+g(\tilde{x}_{k})-g(y_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-y_{k}\rangle-\frac{1}{2}\|v_{k}\|^{2}-\varepsilon_{k}.

or, equivalently,

F(x)F(xk)\displaystyle F(x)-F(x_{k}) +F(xk)F(xk+1)βk12[xyk2xkx2]+α2xkx~k2\displaystyle+\frac{F(x_{k})-F(x_{k+1})}{\beta_{k}}\geq\frac{1}{2}\left[\|x-y_{k}\|^{2}-\|x_{k}-x\|^{2}\right]+\frac{\alpha}{2}\|x_{k}-\tilde{x}_{k}\|^{2}
+f(xk)f(xk+1)βk+f(xk),ykxk+τ2xkx~k2+γ12vk2+γ2εk\displaystyle+\frac{f(x_{k})-f(x_{k+1})}{\beta_{k}}+\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{\tau}{2}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\|v_{k}\|^{2}+\gamma_{2}\varepsilon_{k}
+g(x~k)g(yk)+f(xk),x~kyk(1+γ1)2vk2(1+γ2)εk\displaystyle+g(\tilde{x}_{k})-g(y_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-y_{k}\rangle-\frac{(1+\gamma_{1})}{2}\|v_{k}\|^{2}-(1+\gamma_{2})\varepsilon_{k}
+(1τα)2xkx~k2.\displaystyle+\frac{(1-\tau-\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2}.

Now, using the linesearch procedure of Step 4, we obtain

F(x)F(xk)+F(xk)F(xk+1)βk\displaystyle F(x)-F(x_{k})+\frac{F(x_{k})-F(x_{k+1})}{\beta_{k}}\geq 12[xyk2xkx2]+α2xkx~k2\displaystyle\frac{1}{2}\left[\|x-y_{k}\|^{2}-\|x_{k}-x\|^{2}\right]+\frac{\alpha}{2}\|x_{k}-\tilde{x}_{k}\|^{2}
+g(x~k)g(yk)+f(xk),x~kyk\displaystyle+g(\tilde{x}_{k})-g(y_{k})+\langle\nabla f(x_{k}),\tilde{x}_{k}-y_{k}\rangle
(1+γ1)2vk2(1+γ2)εk+(1τα)2xkx~k2.\displaystyle-\frac{(1+\gamma_{1})}{2}\|v_{k}\|^{2}-(1+\gamma_{2})\varepsilon_{k}+\frac{(1-\tau-\alpha)}{2}\|x_{k}-\tilde{x}_{k}\|^{2}.

It follows from the above inequality and (11) that

βk[F(x)F(xk)]+F(xk)F(xk+1)βk2[xyk2xkx2]+αβk2xkx~k2.\beta_{k}[F(x)-F(x_{k})]+{F(x_{k})-F(x_{k+1})}\geq\frac{\beta_{k}}{2}\left[\|x-y_{k}\|^{2}-\|x_{k}-x\|^{2}\right]+\frac{\alpha\beta_{k}}{2}\|x_{k}-\tilde{x}_{k}\|^{2}. (22)

On the other hand, using the identity xk+1x=(1βk)(xkx)+βk(ykx)x_{k+1}-x=(1-\beta_{k})(x_{k}-x)+\beta_{k}(y_{k}-x) and the strong convexity of 2\|\cdot\|^{2}, we have

xk+1x2(1βk)xkx2+βkykx2(1βk)βkxkyk2,\|x_{k+1}-x\|^{2}\leq(1-\beta_{k})\|x_{k}-x\|^{2}+\beta_{k}\|y_{k}-x\|^{2}-(1-\beta_{k})\beta_{k}\|x_{k}-y_{k}\|^{2},

which implies

βk(ykx2xkx2)xk+1x2xkx2.\beta_{k}\left(\|y_{k}-x\|^{2}-\|x_{k}-x\|^{2}\right)\geq\|x_{k+1}-x\|^{2}-\|x_{k}-x\|^{2}.

Therefore, the proof of (20) follows by combining the latter inequality with (22). The last statement of the proposition follows immediately from (20) with x=xkx=x_{k} that

2(F(xk+1)F(xk))xk+1xk2<0.2({F(x_{k+1})-F(x_{k})})\leq-\|x_{k+1}-x_{k}\|^{2}<0. (23)

So, (F(xk))k(F(x_{k}))_{k\in\mathbbm{N}} is decreasing and convergent because it is bounded from below by F(x)F(x_{*}). Moreover, the last inequality implies that k=0xkxk+12<+\sum_{k=0}^{\infty}\|x^{k}-x^{k+1}\|^{2}<+\infty. ∎

Next, we establish the convergence rate of the functional values sequence (F(xk))k(F(x_{k}))_{k\in\mathbbm{N}}.

Theorem 4.1 (Convergence Rate of the IPG-ELS Method)

Let (xk)k(x_{k})_{k\in\mathbbm{N}} and (βk)k(\beta_{k})_{k\in\mathbbm{N}} be generated by the IPG-ELS method. Assume that there exists β>0\beta>0 such that βkβ\beta_{k}\geq\beta for all kk\in\mathbbm{N}. Then, for all kk\in\mathbbm{N}, we have

F(xk)Fdist(x0,S)2+2(F(x0)F)2β(k+1).F(x_{k})-F_{*}\leq\frac{{\rm dist}(x_{0},S_{*})^{2}+2\left(F(x_{0})-F_{*}\right)}{2\beta(k+1)}. (24)

Proof For any \ell\in\mathbbm{N} and xSx_{*}\in S_{*}, it follows from Proposition 3 with k=k=\ell and x=xx=x_{*} that

F(x)F(x)12β[xx2x+1x2+2(F(x)F(x+1))]F(x_{\ell})-F(x_{*})\leq\frac{1}{2\beta_{\ell}}\left[\|x_{\ell}-x_{*}\|^{2}-\|x_{\ell+1}-x_{*}\|^{2}+2\left(F(x_{\ell})-F(x_{\ell+1})\right)\right]

Summing the above inequality over =0,1,,k\ell=0,1,\ldots,k, we have

=0k(F(x)F(x))\displaystyle\sum_{\ell=0}^{k}\left(F(x_{\ell})-F(x_{*})\right) 12β[x0x2xk+1x2+2(F(x0)F(xk+1))]\displaystyle\leq\frac{1}{2\beta}\left[\|x_{0}-x_{*}\|^{2}-\|x_{k+1}-x_{*}\|^{2}+2\left(F(x_{0})-F(x_{k+1})\right)\right]
12β[x0x2+2(F(x0)F(x))].\displaystyle\leq\frac{1}{2\beta}\left[\|x_{0}-x_{*}\|^{2}+2\left(F(x_{0})-F(x_{*})\right)\right]. (25)

Since in view of the last statement of Proposition 3, we have F(x+1)F(x)F(x_{\ell+1})\leq F(x_{\ell}) for any \ell\in\mathbbm{N}, it follows from (4) that

(k+1)(F(xk)F(x))12β(x0x2+2[F(x0)F(x)]).(k+1)\left(F(x_{k})-F(x_{*})\right)\leq\frac{1}{2\beta}\left(\|x_{0}-x_{*}\|^{2}+2\left[F(x_{0})-F(x_{*})\right]\right).

Since xSx_{*}\in S_{*} is arbitrary, the proof of (24) follows. ∎

Remark 3 (Complexity of η\eta-Approximate Solution)

It follows from Theorem 4.1 that every accumulation point of the sequence (xk)k(x_{k})_{k\in\mathbbm{N}} is a solution to problem (1). Moreover, given any η>0\eta>0, in at most k=𝒪(1/η2)k=\mathcal{O}(1/\eta^{2}) iterations, the IPG-ELS method generates an η\eta-approximate solution xkx_{k} to problem (1), in the sense that F(xk)FηF(x_{k})-F_{*}\leq\eta.

Theorem 4.2 (Complexity of η\eta-Approximate Stationary Solution)

Consider (xk)k(x_{k})_{k\in\mathbbm{N}}, (x~k)k(\tilde{x}_{k})_{k\in\mathbbm{N}}, (vk)k(v_{k})_{k\in\mathbbm{N}}, and (εk)k(\varepsilon_{k})_{k\in\mathbbm{N}} generated by the IPG-ELS method and define wk:=vk+xkx~k+f(x~k)f(xk)w_{k}:=v_{k}+x_{k}-\tilde{x}_{k}+\nabla f(\tilde{x}_{k})-\nabla f(x_{k}), for every kk\in\mathbbm{N}. Then, we have

wkf(x~k)+εkg(x~k)εkF(x~k),k.w_{k}\in\nabla f(\tilde{x}_{k})+\partial_{\varepsilon_{k}}g(\tilde{x}_{k})\subseteq\partial_{\varepsilon_{k}}F(\tilde{x}_{k}),\qquad\forall k\in\mathbbm{N}. (26)

Additionally, if α(0,1τ)\alpha\in(0,1-\tau), and f\nabla f is Lipschitz continuous on 𝐝𝐨𝐦g{\bf dom\,}g, then, given a tolerance η>0\eta>0, the IPG-ELS method generates an η\eta-approximate stationary solution x~k\tilde{x}_{k} to problem (1) with residues (wk,εk)(w_{k},\varepsilon_{k}), in the sense of Definition 1, in at most k=O(1/η2)k=O\left(1/\eta^{2}\right) iterations.

Proof The first inclusion in (26) follows immediately from (10) and the definition of wkw_{k}, whereas the second inclusion in (26) follows from the definitions of FF and εF\partial_{\varepsilon}F.

Now let xx_{*} be the projection of x0x_{0} onto SS_{*} and let d0:=x0xd_{0}:=\|x_{0}-x_{*}\|. As it was observed in Lemma 6, the Lipschitz continuity of f\nabla f implies that there exists β>0\beta>0 such that βkβ\beta_{k}\geq\beta for all kk. It follows from Proposition 3 with k=k=\ell\in\mathbbm{N} and x=xSx=x_{*}\in S_{*} that

αβxx~2(xx2x+1x2+2[F(x)F(x+1)])\alpha\beta\|x_{\ell}-\tilde{x}_{\ell}\|^{2}\leq\left(\|x_{\ell}-x_{*}\|^{2}-\|x_{\ell+1}-x_{*}\|^{2}+2\left[F(x_{\ell})-F(x_{\ell+1})\right]\right) (27)

for all \ell\in\mathbbm{N}. Summing the above inequality over =0,1,,k\ell=0,1,\ldots,k, and using that xSx_{*}\in S_{*}, we have

αβ=0kxx~2x0x2+2[F(x0)F(x)].\alpha\beta\sum_{\ell=0}^{k}\|x_{\ell}-\tilde{x}_{\ell}\|^{2}\leq\|x_{0}-x_{*}\|^{2}+2\left[F(x_{0})-F(x_{*})\right].

Hence, since d0=x0xd_{0}=\|x_{0}-x_{*}\|, we see that there exists kk{\ell_{k}}\leq k such that

x~kxk2d02+2[F(x0)F(x)]αβ(k+1).\|\tilde{x}_{{\ell_{k}}}-x_{{\ell_{k}}}\|^{2}\leq\frac{d_{0}^{2}+2[F(x_{0})-F(x_{*})]}{\alpha\beta(k+1)}. (28)

On the other hand, since τ,α>0\tau,\alpha>0, if follows from (16) that, for every \ell\in\mathbbm{N},

(γ11)2v2+γ2ε\displaystyle\frac{(\gamma_{1}-1)}{2}\|v_{\ell}\|^{2}+\gamma_{2}\varepsilon_{\ell} xx~22+xx~,v\displaystyle\leq\frac{\|x_{\ell}-\tilde{x}_{\ell}\|^{2}}{2}+\langle x_{\ell}-\tilde{x}_{\ell},v_{\ell}\rangle
xx~22+xx~2γ11+(γ11)4v2,\displaystyle\leq\frac{\|x_{\ell}-\tilde{x}_{\ell}\|^{2}}{2}+\frac{\|x_{\ell}-\tilde{x}_{\ell}\|^{2}}{\gamma_{1}-1}+\frac{(\gamma_{1}-1)}{4}\|v_{\ell}\|^{2},

where the last inequality is due to Cauchy-Schwarz inequality and the fact that absa2/2+b2/(2s)ab\leq sa^{2}/2+b^{2}/(2s) for any a,ba,b\in\mathbbm{R} and s>0s>0, in particular, with a=va=\|v_{\ell}\|, b=x~xb=\|\tilde{x}_{\ell}-x_{\ell}\|, and s=(γ11)/2s=(\gamma_{1}-1)/2. Hence, we have

(γ11)4v2+γ2ε(γ1+12(γ11))x~x2,.\frac{(\gamma_{1}-1)}{4}\|v_{\ell}\|^{2}+\gamma_{2}\varepsilon_{\ell}\leq\left(\frac{\gamma_{1}+1}{2(\gamma_{1}-1)}\right)\|\tilde{x}_{\ell}-x_{\ell}\|^{2},\qquad\forall\ell\in\mathbbm{N}. (29)

Now, from the definition of ww_{\ell}, (29), the Cauchy-Schwarz inequality and the Lipschitz continuity of f\nabla f, we have, for every \ell\in\mathbbm{N},

w\displaystyle\|w_{\ell}\| v+xx~+f(x)f(x~)\displaystyle\leq\|v_{\ell}\|+\|{x}_{\ell}-\tilde{x}_{\ell}\|+\|\nabla f(x_{\ell})-\nabla f(\tilde{x}_{\ell})\|
[2(γ1+1)γ11+1+L]xx~.\displaystyle\leq\left[\frac{\sqrt{2(\gamma_{1}+1)}}{\gamma_{1}-1}+1+L\right]\|x_{\ell}-\tilde{x}_{\ell}\|. (30)

Moreover, it follows from (29) that

ε(γ1+12γ2(γ11))xx~2.\varepsilon_{\ell}\leq\left(\frac{\gamma_{1}+1}{2\gamma_{2}(\gamma_{1}-1)}\right)\|x_{\ell}-\tilde{x}_{\ell}\|^{2}. (31)

Hence, it follows from (28), (4), and (31) with =k\ell=\ell_{k} and m0:=d02+2(F(x0)F(x))m_{0}:=d_{0}^{2}+2(F(x_{0})-F(x_{*})) that

wk[2(γ1+1)γ11+1+L]m0αβ(k+1),εk(γ1+12γ2(γ11))m0αβ(k+1)w_{\ell_{k}}\leq\left[\frac{\sqrt{2(\gamma_{1}+1)}}{\gamma_{1}-1}+1+L\right]\frac{\sqrt{m_{0}}}{\sqrt{\alpha\beta(k+1)}},\qquad\varepsilon_{\ell_{k}}\leq\left(\frac{\gamma_{1}+1}{2\gamma_{2}(\gamma_{1}-1)}\right)\frac{m_{0}}{\alpha\beta(k+1)}

which in turn implies that

wk=𝒪(1/k),εk=𝒪(1/k).w_{\ell_{k}}=\mathcal{O}(1/\sqrt{k}),\quad\varepsilon_{\ell_{k}}=\mathcal{O}(1/k). (32)

Thus, the last statement of the theorem follows from (32) and the first inclusion in (26). ∎

We end this section by establishing the full convergence of the sequence (xk)k(x_{k})_{k\in\mathbbm{N}} to a solution of problem (1). The proof is based on the quasi-Fejér convergence of the sequence (xk)k(x_{k})_{k\in\mathbbm{N}} to the set SS_{*}, as defined in Definition 2.

Theorem 4.3 (Convergence for the IPG-ELS Method)

The sequence (xk)k(x_{k})_{k\in\mathbbm{N}} generated by the IPG-ELS method converges to a point in SS_{*}.

Proof By employing Proposition 3 at x=xS𝐝𝐨𝐦gx=x_{*}\in S_{*}\subseteq{\bf dom\,}g, we have

xk+1x2xkx2+2[F(xk)F(xk+1)]for allk.\|x_{k+1}-x_{*}\|^{2}\leq\|x_{k}-x_{*}\|^{2}+2\left[F(x_{k})-F(x_{k+1})\right]\quad\mbox{for all}\quad k\in\mathbbm{N}. (33)

We set δk:=2[F(xk)F(xk+1)]0\delta_{k}:=2\left[F(x_{k})-F(x_{k+1})\right]\geq 0, and we will prove that (δk)k(\delta_{k})_{k\in\mathbbm{N}} is a summable sequence. In fact,

k=0δk=\displaystyle\sum_{k=0}^{\infty}\delta_{k}= 2k=0[F(xk)F(xk+1)]2[F(x0)limkF(xk+1)]\displaystyle 2\sum_{k=0}^{\infty}\Big{[}F(x_{k})-F(x_{k+1})\Big{]}\leq 2\Big{[}F(x_{0})-\lim_{k\to\infty}F(x_{k+1})\Big{]}
\displaystyle\leq 2[F(x0)F(x)]<+.\displaystyle 2\Big{[}F(x_{0})-F(x_{*})\Big{]}<+\infty.

This together with (33) tells us that the sequence (xk)k(x_{k})_{k\in\mathbbm{N}} is quasi-Fejér convergent to SS_{*} via Definition 2. By Lemma 1(a), the sequence (xk)k(x_{k})_{k\in\mathbbm{N}} is bounded. Let x¯\bar{x} be a cluster point of (xk)k(x_{k})_{k\in\mathbbm{N}}. Hence there exists a subsequence (xk)k(x_{\ell_{k}})_{k\in\mathbbm{N}} of (xk)k(x_{k})_{k\in\mathbbm{N}} converging to x¯\bar{x}.

Now we proceed by considering the two possible cases:

Case 1. The sequence (βk)k\left(\beta_{\ell_{k}}\right)_{k\in\mathbbm{N}} does not converge to 0, i.e., there exist some β>0\beta>0 and a subsequence of (βk)k\left(\beta_{\ell_{k}}\right)_{k\in\mathbbm{N}} (without relabeling) such that

βkβ,k.\beta_{\ell_{k}}\geq\beta,\quad\forall\,k\in\mathbbm{N}. (34)

By using Proposition 3 with x=xSx=x_{*}\in S_{*}, we get

βk[F(xk)F(x)]\displaystyle\beta_{k}\left[F(x_{k})-F(x_{*})\right]\leq 12(xkx2xk+1x2)+F(xk)F(xk+1).\displaystyle\frac{1}{2}(\|x_{k}-x_{*}\|^{2}-\|x_{k+1}-x_{*}\|^{2})+F(x_{k})-F(x_{k+1}).

Summing for k=0,1,,mk=0,1,\ldots,m, the above inequality implies that

k=0mβk[F(xk)F(x)]\displaystyle\sum_{k=0}^{m}\beta_{k}\left[F(x_{k})-F(x_{*})\right]\leq 12(x0x2xm+1x2)+F(x0)F(xm+1)\displaystyle\frac{1}{2}(\|x_{0}-x_{*}\|^{2}-\|x_{m+1}-x_{*}\|^{2})+F(x_{0})-F(x_{m+1})
\displaystyle\leq 12x0x2+F(x0)F(x).\displaystyle\frac{1}{2}\|x_{0}-x_{*}\|^{2}+F(x_{0})-F(x_{*}).

By taking m+m\to+\infty and using the fact that F(xk)F(x)F(x_{k})\geq F(x_{*}) and (34), we obtain that

βk=0[F(xk)F(x)]k=0βk[F(xk)F(x)]k=0βk[F(xk)F(x)]<+,\beta\sum_{k=0}^{\infty}\left[F(x_{\ell_{k}})-F(x_{*})\right]\leq\sum_{k=0}^{\infty}\beta_{\ell_{k}}\left[F(x_{\ell_{k}})-F(x_{*})\right]\leq\sum_{k=0}^{\infty}\beta_{k}\left[F(x_{k})-F(x_{*})\right]<+\infty,

which together with (34) establishes that \dstylimkF(xk)=F(x).\dsty\lim_{k\rightarrow\infty}\,F(x_{\ell_{k}})=F(x_{*}). Since FF is lower semicontinuous on 𝐝𝐨𝐦g{\bf dom\,}g, it follows from the last equality that

F(x)F(x¯)lim infkF(xk)=limkF(xk)=F(x),F(x_{*})\leq F(\bar{x})\leq\liminf_{k\rightarrow\infty}F(x_{\ell_{k}})=\lim_{k\rightarrow\infty}F(x_{\ell_{k}})=F(x_{*}),

which yields F(x¯)=F(x)F(\bar{x})=F(x_{*}) and thus x¯S\bar{x}\in S_{*}.

Case 2. \dstylimkβk=0\dsty\lim_{k\rightarrow\infty}\beta_{\ell_{k}}=0. Define \dstyβ^k:=βkθ>0\dsty\hat{\beta}_{k}:=\frac{\beta_{k}}{\theta}>0 and

x^k+1:=xk+β^k(ykxk)=(1β^k)xk+β^kyk,\hat{x}_{k+1}:=x_{k}+\hat{\beta}_{k}(y_{k}-x_{k})=(1-\hat{\beta}_{k})x_{k}+\hat{\beta}_{k}y_{k}, (35)

where yk=x~kvky_{k}=\tilde{x}_{k}-v_{k}. It follows from the definition of the linesearch that

f(x^k+1)>f(xk)+β^kf(xk),ykxk+τ2β^kxkx~k2+γ12β^kvk2+β^kγ2εk.f(\hat{x}_{k+1})>f(x_{k})+\hat{\beta}_{k}\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\frac{\tau}{2}\hat{\beta}_{k}\|x_{k}-\tilde{x}_{k}\|^{2}+\frac{\gamma_{1}}{2}\hat{\beta}_{k}\|v_{k}\|^{2}+\hat{\beta}_{k}\gamma_{2}\varepsilon_{k}. (36)

It follows from the convexity of ff, the fact that γ1>1>τ\gamma_{1}>1>\tau, and the positiveness of the term β^kγ2εk\hat{\beta}_{k}\gamma_{2}\varepsilon_{k} that

f(x^k+1),x^k+1xkf(x^k+1)f(xk)>β^kf(xk),ykxk+β^kτ2(xkx~k2+vk2)\displaystyle\langle\nabla f(\hat{x}_{k+1}),\hat{x}_{k+1}-x_{k}\rangle\geq f(\hat{x}_{k+1})-f(x_{k})>\hat{\beta}_{k}\langle\nabla f(x_{k}),y_{k}-x_{k}\rangle+\hat{\beta}_{k}\frac{\tau}{2}\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right)

which, together with (35), yields

β^kτ2(xkx~k2+vk2)\displaystyle\hat{\beta}_{k}\frac{\tau}{2}\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right) <β^kf(x^k+1)f(xk),ykxk\displaystyle<\hat{\beta}_{k}\langle\nabla f(\hat{x}_{k+1})-\nabla f(x_{k}),y_{k}-x_{k}\rangle
β^kf(x^k+1)f(xk)ykxk\displaystyle\leq\hat{\beta}_{k}\|\nabla f(\hat{x}_{k+1})-\nabla f(x_{k})\|\|y_{k}-x_{k}\|
=β^kf(x^k+1)f(xk)xkx~k+vk.\displaystyle=\hat{\beta}_{k}\|\nabla f(\hat{x}_{k+1})-\nabla f(x_{k})\|\|x_{k}-\tilde{x}_{k}+v_{k}\|.

Now it follows from Lemma 2 that

xkx~k+vk22(xkx~k2+vk2).\|x_{k}-\tilde{x}_{k}+v_{k}\|^{2}\leq 2\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right).

Hence,

β^kτ2(xkx~k2+vk2)<β^kf(x^k+1)f(xk)2(xkx~k2+vk2)12,\hat{\beta}_{k}\frac{\tau}{2}\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right)<\hat{\beta}_{k}\|\nabla f(\hat{x}_{k+1})-\nabla f(x_{k})\|\cdot\sqrt{2}\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right)^{\frac{1}{2}},

which, due to the positiveness of xkx~k2+vk2\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}, yields

τ24(xkx~k2+vk2)12f(x^k+1)f(xk).\frac{\tau\sqrt{2}}{4}\left(\|x_{k}-\tilde{x}_{k}\|^{2}+\|v_{k}\|^{2}\right)^{\frac{1}{2}}\leq\|\nabla f(\hat{x}_{k+1})-\nabla f(x_{k})\|. (37)

Note that x^k+1xk=β^k(ykxk)=β^kβkxk+1xk=1θxk+1xk,\|\hat{x}_{k+1}-x_{k}\|=\|\hat{\beta}_{k}(y_{k}-x_{k})\|=\frac{\hat{\beta}_{k}}{\beta_{k}}\|x_{k+1}-x_{k}\|=\frac{1}{\theta}\|x_{k+1}-x_{k}\|, which combined with the last statement of Proposition 3 give us that x^k+1xk0\|\hat{x}_{\ell_{k}+1}-x_{\ell_{k}}\|\to 0 as k+k\to+\infty. Since f\nabla f is continuous, we have f(x^k+1)f(xk)0\|\nabla f(\hat{x}_{\ell_{k}+1})-\nabla f(x_{\ell_{k}})\|\to 0 as k+k\to+\infty. From (37), it is derived that

limk+xkx~k2+vk20,\lim_{k\rightarrow+\infty}\,\|x_{\ell_{k}}-\tilde{x}_{\ell_{k}}\|^{2}+\|v_{\ell_{k}}\|^{2}\leq 0, (38)

therefore, limk+xkx~k=0\lim_{k\rightarrow+\infty}\,\|x_{\ell_{k}}-\tilde{x}_{\ell_{k}}\|=0 and limk+vk=0\lim_{k\rightarrow+\infty}\|v_{\ell_{k}}\|=0. Additionally, we can use (11) to show that limk+εk=0\lim_{k\rightarrow+\infty}\varepsilon_{\ell_{k}}=0. Thus, x¯\bar{x} is also an accumulation point of the sequence (x~k)k(\tilde{x}_{k})_{k\in\mathbb{N}}, and without loss of generality, we can assume that x~kx¯\tilde{x}_{\ell_{k}}\to\bar{x} as k+k\to+\infty. Moreover, we have that

limk+f(xk)f(x~k)=0.\lim_{k\rightarrow+\infty}\|\nabla f(x_{\ell_{k}})-\nabla f(\tilde{x}_{\ell_{k}})\|=0. (39)

Now, using (11), we obtain

wk:=xkx~k+vk+f(x~k)f(xk)f(x~k)+εkg(x~k)εkF(x~k).w_{k}:=x_{k}-\tilde{x}_{k}+v_{k}+\nabla f(\tilde{x}_{k})-\nabla f(x_{k})\in\nabla f(\tilde{x}_{k})+\partial_{\varepsilon_{k}}g(\tilde{x}_{k})\subseteq\partial_{\varepsilon_{k}}F(\tilde{x}_{k}).

Furthermore, since vkv_{\ell_{k}} converges to 0 as indicated by (38), and by applying the triangular inequality, we obtain

wk=xkx~k+vk+f(x~k)f(xk)xkx~k+vk+f(x~k)f(xk),\|w_{k}\|=\|x_{k}-\tilde{x}_{k}+v_{k}+\nabla f(\tilde{x}_{k})-\nabla f(x_{k})\|\leq\|x_{k}-\tilde{x}_{k}\|+\|v_{k}\|+\|\nabla f(\tilde{x}_{k})-\nabla f(x_{k})\|,

which implies, via (38) and (39), that wkεkF(x~k)w_{\ell_{k}}\in\partial_{\varepsilon_{\ell_{k}}}F(\tilde{x}_{\ell_{k}}) also converges to 0. Consequently, the convergence of x~k\tilde{x}_{\ell_{k}} to x¯\bar{x} and εk\varepsilon_{\ell_{k}} to 0, combined with the closedness of the graph of F\partial F in Proposition 1, gives us that 0F(x¯)0\in\partial F(\bar{x}). This is equivalent to stating that x¯S\bar{x}\in S_{*}.

In all the cases considered above, we have shown that x¯\bar{x}, an accumulation point of the sequence (xk)k(x_{k})_{k\in\mathbb{N}}, belongs to SS_{*}. Proposition 1(b) implies that (xk)k(x_{k})_{k\in\mathbb{N}} converges to an optimal solution in SS_{*}. ∎

5 Numerical Experiments

In this section, we investigate the numerical behavior of the IPG-ELS method in solving the CUR-like factorization optimization problem JMLR:v12:mairal11a . Consider 𝔼=n×m\mathbbm{E}=\mathbbm{R}^{n\times m}. Given a matrix Wm×nW\in\mathbbm{R}^{m\times n}, the objective is to find a sparse rows and columns matrix, Xn×mX\in\mathbbm{R}^{n\times m}, such that WXWWXW approximates WW. This problem can be formulated as the following splitting optimization problem:

minXn×m12WWXWF2+λrowi=1nX(i)2+λcolj=1mX(j)2,\min_{X\in\mathbbm{R}^{n\times m}}\frac{1}{2}\|W-WXW\|_{F}^{2}+\lambda_{\text{row}}\sum_{i=1}^{n}\|X^{(i)}\|_{2}+\lambda_{\text{col}}\sum_{j=1}^{m}\|X_{(j)}\|_{2},

where F\|\cdot\|_{F} denotes the Frobenius norm, and X(i)X^{(i)} and X(j)X_{(j)} denote the ii-th row and jj-th column of XX, respectively. This problem is a special case of problem (1) with

f(X):=12WWXWF2,g(X):=λrowi=1nX(i)2+λcolj=1mX(j)2.f(X):=\frac{1}{2}\|W-WXW\|_{F}^{2},\quad g(X):=\lambda_{\text{row}}\sum_{i=1}^{n}\|X^{(i)}\|_{2}+\lambda_{\text{col}}\sum_{j=1}^{m}\|X_{(j)}\|_{2}.

In this case, the gradient of ff is given by f(X)=WT(WWXW)WT\nabla f(X)=W^{T}(W-WXW)W^{T} and has a Lipschitz constant L=WTW2L=\|W^{T}W\|^{2}. However, the proximal operator of gg does not have a closed-form solution. Following the approach in Schmidt:2011 (see also Barre:2022 ; Zhou:2022 ), a triple (X~k,Vk,εk)(\tilde{X}_{k},V_{k},\varepsilon_{k}) satisfying (10)-(11) can be obtained using the Dykstra-like algorithm (Bauschke:2008c, , Theorem 3.3). This algorithm is applied to the proximal subproblem.

minXn×m12XZF2+λrowi=1nX(i)2+λcolj=1mX(j)2,\min_{X\in\mathbbm{R}^{n\times m}}\frac{1}{2}\|X-Z\|^{2}_{F}+{\lambda_{\rm row}}\sum_{i=1}^{n}\|X^{(i)}\|_{2}+{\lambda_{\rm col}}\sum_{j=1}^{m}\|X_{(j)}\|_{2}, (40)

where Z:=Xkf(Xk),Z:=X_{k}-\nabla f(X_{k}), with the initial points Z0=Z,P0=0,Q0=0Z_{0}=Z,\quad P_{0}=0,\quad Q_{0}=0, generates the sequences

{Y=𝐩𝐫𝐨𝐱g1(Z+P)P+1=Z+PYand{Z+1=𝐩𝐫𝐨𝐱g2(Y+Q)Q+1=Y+QZ+1,\left\{\begin{array}[]{l}Y_{\ell}={\bf prox}_{g_{1}}(Z_{\ell}+P_{\ell})\\ P_{\ell+1}=Z_{\ell}+P_{\ell}-Y_{\ell}\end{array}\right.\quad\mbox{and}\quad\left\{\begin{array}[]{l}Z_{\ell+1}={\bf prox}_{g_{2}}(Y_{\ell}+Q_{\ell})\\ Q_{\ell+1}=Y_{\ell}+Q_{\ell}-Z_{\ell+1},\end{array}\right. (41)

where g1(X)=λcolj=1mX(j)2g_{1}(X)=\lambda_{\rm col}\sum_{j=1}^{m}\|X_{(j)}\|_{2} and g2(X)=λrowi=1nX(i)2g_{2}(X)=\lambda_{\rm row}\sum_{i=1}^{n}\|X^{(i)}\|_{2}. Recall that:

𝐩𝐫𝐨𝐱g1(X)(j)=max{1λcolX(j)2,0}X(j),for j=1,,m{\bf prox}_{g_{1}}(X)_{(j)}=\max\left\{1-\frac{\lambda_{\text{col}}}{\|X_{(j)}\|_{2}},0\right\}X_{(j)},\;\text{for }\quad j=1,\ldots,m

and

𝐩𝐫𝐨𝐱g2(X)(i)=max{1λrowX(i)2,0}X(i),for i=1,,n.{\bf prox}_{g_{2}}(X)^{(i)}=\max\left\{1-\frac{\lambda_{\text{row}}}{\|X^{(i)}\|_{2}},0\right\}X^{(i)},\;\text{for }\quad i=1,\ldots,n.

Using now the definition of YY_{\ell}, as presented in (41), we have that Z+PYg1(Y)Z_{\ell}+P_{\ell}-Y_{\ell}\in\partial g_{1}(Y_{\ell}). This observation, combined with Proposition 2, yields

Z+PYϵg1(Z+1),whereϵ=g1(Z+1)g1(Y)Z+PY,Z+1Y0.Z_{\ell}+P_{\ell}-Y_{\ell}\in\partial_{\epsilon_{\ell}}g_{1}(Z_{\ell+1}),\quad\mbox{where}\quad\epsilon_{\ell}=g_{1}(Z_{\ell+1})-g_{1}(Y_{\ell})-\langle{Z_{\ell}+P_{\ell}-Y_{\ell}},{Z_{\ell+1}-Y_{\ell}}\rangle\geq 0.

On the other hand, it follows from the definition of Z+1Z_{\ell+1} in (41) that

Y+QZ+1g2(Z+1).Y_{\ell}+Q_{\ell}-Z_{\ell+1}\in\partial g_{2}(Z_{\ell+1}).

Thus, it follows from the last two inclusions and the definition of ZZ that

V:=Z+P+QZϵg(Z+1)+Z+1X+f(X).V_{\ell}:=Z_{\ell}+P_{\ell}+Q_{\ell}-Z\in\partial_{\epsilon_{\ell}}g(Z_{\ell+1})+Z_{\ell+1}-X_{\ell}+\nabla f(X_{\ell}).

Therefore, if the Dykstra-like algorithm is stopped when

g(ZV)g(Z)f(X),V+(1+γ1)2V2+(1+γ2)ε(1τα)2XZ2,g(Z_{\ell}-V_{\ell})-g(Z_{\ell})-\langle\nabla f(X_{\ell}),V_{\ell}\rangle+\frac{(1+\gamma_{1})}{2}\|V_{\ell}\|^{2}+(1+\gamma_{2})\varepsilon_{\ell}\leq\frac{(1-\tau-\alpha)}{2}\|X_{\ell}-Z_{\ell}\|^{2},

then the triple (X~k,Vk,εk):=(Z,V,ϵ)(\tilde{X}_{k},V_{k},\varepsilon_{k}):=(Z_{\ell},V_{\ell},\epsilon_{\ell}) satisfies (10)-(11).

Considering that the IPG-ELS method integrates two effective strategies: (i) permitting the subproblem to be solved inexactly to meet a relative approximation criterion, and (ii) employing an explicit linesearch procedure that computes the proximal operator only once per iteration, our primary goal is to demonstrate that, in certain cases, the proposed method surpasses the proximal gradient method that employs only one of these strategies. Consequently, we compare the new algorithm with two alternative schemes: an exact version of the IPG-ELS method, denoted by PG-ELS, which corresponds to IPG-ELS with γ1=γ2=0\gamma_{1}=\gamma_{2}=0, θ=0.5\theta=0.5, τ=1\tau=1, and max{Vk,ϵk}1012\max\{\|V_{k}\|,\epsilon_{k}\}\leq 10^{-12}, replacing the inexact criterion in (11); and an IPG method with a fixed Stepsize, corresponding to (Millan:2019, , Algorithm 2) with αk=1/L\alpha_{k}=1/L and σ=0.9\sigma=0.9, where LL is the Lipschitz constant of ff. This algorithm is denoted by IPG-FixStep and is defined as

Xk+1=X~kVk,k0,X_{k+1}=\tilde{X}_{k}-V_{k},\quad\forall k\geq 0,

where X~k\tilde{X}_{k} and VkV_{k} satisfy

LVkεkg(X~k)+L(X~kXk)+f(Xk),Vk2+2εkL0.9XkX~k2.LV_{k}\in\partial_{\varepsilon_{k}}g(\tilde{X}_{k})+L(\tilde{X}_{k}-X_{k})+\nabla f(X_{k}),\quad\|V_{k}\|^{2}+\frac{2\varepsilon_{k}}{L}\leq 0.9\|X_{k}-\tilde{X}_{k}\|^{2}.

The initialization parameters for the IPG-ELS method were set as τ=0.8\tau=0.8, θ=0.5\theta=0.5, γ1=γ2=1.1\gamma_{1}=\gamma_{2}=1.1, and α=0.01\alpha=0.01. For all tests, we initialized X0=0n×mX_{0}=0\in\mathbbm{R}^{n\times m}, and set λrow=λcol=0.01\lambda_{\text{row}}=\lambda_{\text{col}}=0.01. The IPG-ELS method was executed for 101101 outer iterations to establish a baseline performance metric, F:=F(X101)F_{*}:=F(X_{101}). The other two algorithms were terminated as soon as F(xk)FF(x_{k})\leq F_{*}. The algorithms were evaluated on six datasets from Cano:2005 ; Guyon:2004 ; Team:2008 : Colon tumor (62×200062\times 2000), heart disease (303×14303\times 14), central nervous system (CNS) (60×712960\times 7129), lung cancer-Michigan (96×712996\times 7129), Secom (1567×5901567\times 590), and Cina0 (132×16033132\times 16033).

Each matrix WW was normalized to have a unit Frobenius norm, with an additional step of centering each column. Subsequently, the resulting matrices were multiplied by a constant, which plays a crucial role in controlling the Lipschitz constant of the function ff. The experiments were conducted using the Python programming language, which was installed on a machine equipped with a 3.5 GHz Dual-Core Intel Core i5 processor and 16 GB of 2400 MHz DDR4 memory.

In Tables 1 and 2, we report the Lipschitz constant of the gradient of ff (denoted as Lips), the number of outer iterations (O-IT), the number of inner iterations (I-IT), the number of linesearch iterations (LS-IT), and the total running time in seconds (Time). The results indicate that, in terms of CPU times, the IPG-ELS method outperforms the other two methods. This efficiency can be attributed to two main factors: (i) the PG-ELS method requires a significantly higher number of inner iterations to solve the proximal subproblem ”exactly”, and (ii) the IPG-FixStep method employs a small stepsize of 1/L1/L in the gradient step.

Problem Lips Method F(Xk)F(X_{k}) O-IT I-IT LS-IT Time
Colon Tumor 41.58 IPG-ELS 1.1056 101 195 318 28.88
PG-ELS 1.1056 104 751 324 36.19
IPG-FixStep 1.1056 1450 1451 - 171.72
Colon Tumor 665.32 IPG-ELS 2.3647 101 101 823 32.74
PG-ELS 2.3623 128 455 1038 44.02
IPG-FixStep 2.3647 1103 1104 - 97.72
Colon Tumor 5133.69 IPG-ELS 5.8989 101 101 1134 110.86
PG-ELS 5.8832 119 328 1324 116.26
IPG-FixStep 5.8981 586 587 - 138.92
Heart Disease 77.12 IPG-ELS 0.1732 101 178 539 1.46
PG-ELS 0.1732 119 876 603 4.28
IPG-FixStep 0.1732 487 488 - 3.52
Heart Disease 1233.99 IPG-ELS 0.3129 101 101 903 0.85
PG-ELS 0.3119 89 368 788 1.71
IPG-FixStep 0.3520 2001 2001 - 11.85
Heart Disease 9521.59 IPG-ELS 0.4995 101 101 1222 0.99
PG-ELS 0.4992 227 622 2690 5.46
IPG-FixStep 0.7242 2001 2001 - 43.92
CNS 41.95 IPG-ELS 0.9519 101 182 341 364.44
PG-ELS 0.9518 119 714 396 501.59
IPG-FixStep 0.9519 1217 1218 - 1675.05
CNS 671.17 IPG-ELS 2.1153 101 101 768 544.91
PG-ELS 2.1119 148 725 1110 878.39
IPG-FixStep 2.2193 2001 2001 - 2749.17
CNS 5178.80 IPG-ELS 6.0665 101 101 1130 1089.70
PG-ELS 6.0613 102 343 1138 1169.25
IPG-FixStep 6.0655 775 776 - 1677.93
Table 1: Performance of the IPG-ELS, PG-ELS and IPG-FixStep methods on 33 data sets.
Problem Lips Method F(Xk)F(X_{k}) O-IT I-IT LS-IT Time
Lung cancer 52.58 IPG-ELS 0.8985 101 179 443 422.07
PG-ELS 0.8984 105 837 457 521.46
IPG-FixStep 0.8985 678 679 - 962.80
Lung cancer 841.22 IPG-ELS 2.7632 101 101 845 591.92
PG-ELS 2.7623 131 768 1085 851.75
IPG-FixStep 2.7631 588 589 - 838.30
Lung cancer 2658.70 IPG-ELS 3.6391 101 101 992 696.61
PG-ELS 3.6378 161 740 1574 1173.05
IPG-FixStep 3.8819 2001 2001 - 2969.58
Secom 45.78 IPG-ELS 0.6438 101 175 373 66.64
PG-ELS 0.6438 99 9247 360 694.81
IPG-FixStep 0.6438 857 858 - 225.29
Secom 732.50 IPG-ELS 0.8587 101 101 779 82.81
PG-ELS 0.8586 102 4931 795 423.80
IPG-FixStep 0.8586 1662 1663 - 440.13
Secom 5652.06 IPG-ELS 1.6981 101 101 1138 104.54
PG-ELS 1.6899 125 1346 1388 215.08
IPG-FixStep 1.6977 801 802 - 211.93
Cina0 68.39 IPG-ELS 0.7972 101 245 487 3041.60
PG-ELS 0.7972 104 1490 483 5005.84
IPG-FixStep 0.7972 608 790 - 5530.44
Cina0 527.70 IPG-ELS 1.2817 101 250 838 3925.72
PG-ELS 1.2817 94 1789 693 5712.33
IPG-FixStep 1.2817 1010 1011 - 8741.06
Cina0 8443.20 IPG-ELS 3.6531 101 104 1126 3791.25
PG-ELS 3.6530 284 3083 3061 15338.18
IPG-FixStep 3.8493 2001 2001 - 17104.23
Table 2: Performance of the IPG-ELS, PG-ELS and IPG-FixStep methods on 33 data sets.

6 Conclusions

In this work, we present an inexact proximal gradient method for solving composite convex optimization problems. This method features a novel explicit line search using the relative-type inexact solution of the proximal subproblem. Our approach is primarily designed to solve splitting problems when the objective function is the sum of differentiable and nondifferentiable convex functions, and the analytical computation of the proximal operator is not available. Notably, the convergence of the proposed method is established without assuming Lipschitz continuity of the gradient of the smooth function. This method addresses the need for a balance between computational efficiency and the accuracy of solving the proximal subproblem, a common challenge in practice.

We have confirmed the convergence and iteration complexity of our method, validating its theoretical soundness and practical utility. Numerical experiments demonstrate its applicability and efficiency. Our method maintains convergence rates while efficiently managing relative inexact solutions of the proximal operator. The numerical results indicate that the proposed method competes effectively with both the exact proximal gradient method and the inexact proximal gradient method with a fixed stepsize.

Acknowledgements.
YBC was partially supported by the NSF Grant DMS-2307328, and by an internal grant from NIU. MLNG was partially supported by CNPq Grants 405349/2021-1 and 304133/2021-3. JGM was partially supported by CNPq Grant 312223/2022-6.

References

  • (1) Adona, V. A., Gonçalves, M. L. N., and Melo, J. G. A Partially Inexact Proximal Alternating Direction Method of Multipliers and Its Iteration-Complexity Analysis. Journal of Optimization Theory and Applications 182, 2 (Aug. 2019), 640–666.
  • (2) Adona, V. A., Gonçalves, M. L. N., and Melo, J. G. An inexact proximal generalized alternating direction method of multipliers. Computational Optimization and Applications 76, 3 (July 2020), 621–647.
  • (3) Alves, M. M., Eckstein, J., Geremia, M., and Melo, J. G. Relative-error inertial-relaxed inexact versions of Douglas-Rachford and ADMM splitting algorithms. Computational Optimization and Applications 75, 2 (Mar. 2020), 389–422.
  • (4) Aujol, J.-F., and Dossal, Ch. Stability of over-relaxations for the forward-backward algorithm, application to FISTA. SIAM Journal on Optimization 25, 4 (2015), 2408–2433.
  • (5) Barré, M., Taylor, A., and Bach, F. A note on approximate accelerated forward-backward methods with absolute and relative errors, and possibly strongly convex objectives. Open Journal of Mathematical Optimization 3 (2022), 1–15.
  • (6) Bauschke, H. H., and Borwein, J. M. On Projection Algorithms for Solving Convex Feasibility Problems. SIAM Review 38, 3 (1996), 367–426.
  • (7) Bauschke, H. H., and Combettes, P. L. A dykstra-like algorithm for two monotone operators. Pacific J. Optim. 4, 3 (2008), 382–391.
  • (8) Beck, Amir., and Teboulle, Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences 2, 1 (2009), 183–202.
  • (9) Bello Cruz, J. Y. On Proximal Subgradient Splitting Method for Minimizing the sum of two Nonsmooth Convex Functions. Set-Valued and Variational Analysis 25, 2 (June 2017), 245–263.
  • (10) Bello Cruz, J. Y., and de Oliveira, W. On weak and strong convergence of the projected gradient method for convex optimization in real Hilbert spaces. Numerical Functional Analysis and Optimization. An International Journal 37, 2 (2016), 129–144.
  • (11) Bello Cruz, J. Y., and Nghia, T. T. A. On the convergence of the forward–backward splitting method with linesearches. Optimization Methods & Software 31, 6 (2016), 1209–1238.
  • (12) Bello-Cruz, Y., Gonçalves, M. L. N., and Krislock, N. On FISTA with a relative error rule. Computational Optimization and Applications 84, 2 (Mar. 2023), 295–318.
  • (13) Bello-Cruz, Y., Li, G., and Nghia, T. T. A. On the Linear Convergence of Forward–Backward Splitting Method: Part I—Convergence Analysis. Journal of Optimization Theory and Applications 188, 2 (Feb. 2021), 378–401.
  • (14) Bonettini, S., and Ruggiero, V. On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration. Journal of Mathematical Imaging and Vision 44, 3 (2012), 236–253.
  • (15) Bredies, K. A forward-backward splitting algorithm for the minimization of non-smooth convex functionals in Banach space. Inverse Problems. An International Journal on the Theory and Practice of Inverse Problems, Inverse Methods and Computerized Inversion of Data 25, 1 (2009), 015005, 20.
  • (16) Cano, A., Masegosa, A., and Moral, S. ELVIRA biomedical data set repository. https://leo.ugr.es/elvira/DBCRepository/, 2005.
  • (17) Combettes, P. L., and Wajs, V. R. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation 4, 4 (2005), 1168–1200.
  • (18) Eckstein, J., and Silva, P. J. S. A practical relative error criterion for augmented Lagrangians. Mathematical Programming 141, 1 (Oct. 2013), 319–348.
  • (19) Eckstein, J., and Yao, W. Approximate ADMM algorithms derived from Lagrangian splitting. Computational Optimization and Applications 68, 2 (Nov. 2017), 363–405.
  • (20) Eckstein, J., and Yao, W. Relative-error approximate versions of Douglas–Rachford splitting and special cases of the ADMM. Mathematical Programming 170, 2 (Aug. 2018), 417–444.
  • (21) Frank, M., and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3, 1-2 (1956), 95–110.
  • (22) Guyon, I. UCI machine learning repository, 2004.
  • (23) Hiriart-Urruty, J.-B., and Lemaréchal, C. Convex Analysis and Minimization Algorithms II, vol. 306 of Grundlehren Der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 1993.
  • (24) Jaggi, M. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Proceedings of the 30th International Conference on Machine Learning (Feb. 2013), PMLR, pp. 427–435.
  • (25) Jiang, K., Sun, D., and Toh, K.-c. An Inexact Accelerated Proximal Gradient Method for Large Scale Linearly Constrained Convex SDP. SIAM Journal on Optimization 22, 3 (Jan. 2012), 1042–1064.
  • (26) Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. Convex and network flow optimization for structured sparsity. Journal of Machine Learning Research 12, 81 (2011), 2681–2720.
  • (27) Millán, R. D., and Machado, M. P. Inexact proximal $$\epsilon $$-subgradient methods for composite convex optimization problems. Journal of Global Optimization 75, 4 (Dec. 2019), 1029–1060.
  • (28) Monteiro, R. D. C., and Svaiter, B. F. On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean. SIAM Journal on Optimization 20, 6 (2010), 2755–2787.
  • (29) Salzo, S. The variable metric forward-backward splitting algorithm under mild differentiability assumptions. SIAM Journal on Optimization 27, 4 (2017), 2153–2181.
  • (30) Salzo, S., Masecchia, S., Verri, A., and Barla, A. Alternating proximal regularized dictionary learning. Neural Computation 26, 12 (Dec. 2014), 2855–2895.
  • (31) Salzo, S., and Villa, S. Inexact and accelerated proximal point algorithms. J. Convex Anal. 19, 4 (2012), 1167–1192.
  • (32) Schmidt, M., Roux, N. L., and Bach, F. Convergence rates of inexact proximal-gradient methods for convex optimization. In Proceedings of the 24th International Conference on Neural Information Processing Systems (Red Hook, NY, USA, Dec. 2011), NIPS’11, Curran Associates Inc., pp. 1458–1466.
  • (33) Schuster, T., Kaltenbacher, B., Hofmann, B., and Kazimierski, K. S. Regularization Methods in Banach Spaces. In Regularization Methods in Banach Spaces. De Gruyter, July 2012.
  • (34) Solodov, M. V., and Svaiter, B. F. A hybrid approximate extragradient-proximal point algorithm using the enlargement of a maximal monotone operator. Set-Valued Analysis 7, 4 (1999), 323–345.
  • (35) Team, C. A marketing dataset, Available at http://www.causality.inf.ethz.ch/data/CINA.html, 2008.
  • (36) Villa, S., Salzo, S., Baldassarre, L., and Verri, A. Accelerated and inexact forward-backward algorithms. SIAM Journal on Optimization 23, 3 (2013), 1607–1633.
  • (37) Zhong, L. W., and Kwok, J. T. Efficient Sparse Modeling With Automatic Feature Grouping. IEEE Transactions on Neural Networks and Learning Systems 23, 9 (Sept. 2012), 1436–1447.
  • (38) Zhou, Q., and Pan, S. J. Convergence Analysis of Linear Coupling with Inexact Proximal Operator. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence (Aug. 2022), PMLR, pp. 2394–2403.