This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the inexact scaled gradient projection method

O. P. Ferreira This author was partially supported by FAPEG, CNPq Grants 302473/2017-3 and 408151/2016-1) and FAPEG/PRONEM- 201710267000532 (e-mail:[email protected]).    M. V. Lemes (e-mail:[email protected]).    L. F. Prudente This author was partially supported by FAPEG/PRONEM- 201710267000532 (e-mail:[email protected]).
Abstract

The purpose of this paper is to present an inexact version of the scaled gradient projection method on a convex set, which is inexact in two sense. First, an inexact projection on the feasible set is computed, allowing for an appropriate relative error tolerance. Second, an inexact non-monotone line search scheme is employed to compute a step size which defines the next iteration. It is shown that the proposed method has similar asymptotic convergence properties and iteration-complexity bounds as the usual scaled gradient projection method employing monotone line searches.

Keywords: Scaled gradient projection method, feasible inexact projection, constrained convex optimization.

AMS subject classification: 49J52, 49M15, 65H10, 90C30.

1 Introduction

This paper is devoted to the study of the scaled gradient projection (SGP) method with non-monotone line search to solve general constrained convex optimization problems as follows

min{f(x):xC},\min\{f(x):~{}x\in C\}, (1)

where CC is a closed and convex subset of n\mathbb{R}^{n} and f:nf:\mathbb{R}^{n}\to\mathbb{R} is a continuously differentiable function. Denotes by f:=infxCf(x)f^{*}:=\inf_{x\in C}f(x) the optimal value of (1) and by Ω\Omega^{*} its solution set, which we will assume to be nonempty unless the contrary is explicitly stated. Problem (1) is a basic issue of constrained optimization, which appears very often in various areas, including finance, machine learning, control theory, and signal processing, see for example [20, 21, 35, 46, 50, 61]. Recent problems considered in most of these areas, the datasets are large or high-dimensional and their solutions need to be approximated quickly with a reasonably accuracy. It is well known that SGP method with non-monotone line search is among those that are suitable for this task, as will be explained below.

The gradient projection method is what first comes to mind when we start from the ideas of the classic optimization methods in an attempt to deal with problem (1). In fact, this method is one of the oldest known optimization methods to solve (1), the study of its convergence properties goes back to the works of Goldstein [39] and Levitin and Polyak [49]. After these works, several variants of it have appeared over the years, resulting in a vast literature on the subject, including [10, 11, 12, 33, 35, 40, 47, 56, 67]. Additional reference on this subject can be found in the recent review [17] and references therein. Among all the variants of the gradient projection method, the scaled version has been especially considered due to the flexibility provided in efficient implementations of the method; see [13, 5, 16, 18, 19]. In addition, its simplicity and easy implementation has attracted the attention of the scientific community that works on optimization over the years. This method usually uses only first-order derivatives, which makes it very stable from a numerical point of view and therefore quite suitable for solving large-scale optimization problems, see [52, 53, 61, 62]. At each current iteration, SGP method moves along the direction of the negative scaled gradient, and then projects the obtained point onto the constraint set. The current iteration and such projection define a feasible descent direction and a line search in this direction is performed to define the next iteration. It is worth mentioning that the performance of SGM method is strongly related to each of the steps we have just mentioned. In fact, the scale matrix and the step size towards the negative scaled gradient are freely selected in order to improve the performance of SGM method but without increasing the cost of each iteration. Strategies for choosing both has its origin in the study of gradient method for unconstrained optimization, papers addressing this issues include but not limited to [7, 18, 26, 27, 29, 36, 69, 25, 49]. More details about about selecting step sizes and scale matrices can be found in the recent review [17] and references therein.

In this paper, we are particularly interested in the main stages that make up the SGP method, namely, in the projection calculation and in the line search employed. It is well known that the mostly computational burden of each iteration of the SGP method is in the calculation of the projection. Indeed, the projection calculation requires, at each iteration, the solution of a quadratic problem restricted to the feasible set, which can lead to a substantial increase in the cost per iteration if the number of unknowns is large. For this reason, it may not be justified to carry out exact projections when the iterates are far from the solution of the problem. In order to reduce the computational effort spent on projections, inexact procedures that become more and more accurate when approaching the solution, have been proposed, resulting in more efficient methods; see for exemple [13, 16, 38, 42, 60, 64, 57]. On the other hand, non-monotonous searches can improve the probability of finding an optimal global solution, in addition to potentially improving the speed of convergence of the method as a whole, see for example [24, 55, 63]. The concept of non-monotone line search, that we will use here as a synonym for inexact line search, have been proposed first in [45], and later a new non-monotone search was proposed in [68]. After these papers others non-monotone searches appeared, see for example [3, 51]. In [59], an interesting general framework for non-monotonous line research was proposed, and more recently modifications of it have been presented in [43, 44].

The purpose of the present paper is to present an inexact version of the SGP method, which is inexact in two sense. First, using a version of scheme introduced in [13] and also a variation of the one appeared [64, Example 1], the inexact projection onto the feasible set is computed allowing an appropriate relative error tolerance. Second, using the inexact conceptual scheme for the line search introduced in [44, 59], a step size is computed to define the next iteration. More specifically, initially we show that the feasible inexact projection of [13] provides greater latitude than the projection of [64, Example 1]. In the first convergence result presented, we show that the SGP method using the projection proposed in [13] preserves the same partial convergence result as the classic method, that is, we prove that every accumulation point of the sequence generated by the SGP method is stationary for problem (1). Then, considering the inexact projection of [64, Example 1], and under mild assumptions, we establish full asymptotic convergence results and some complexity bounds. The presented analysis of the method is done using the general non-monotonous line search scheme introduced in [44]. In this way, the proposed method can be adapted to several line searches and, in particular, will allow obtaining several known versions of the SGP method as particular instances, including [10, 13, 47, 66]. Except for the particular case when we assume that the SGP method employs the non-monotonous line search introduced by [45], all other asymptotic convergence and complexity results are obtained without any assumption of compactness of the sub-level sets of the objective function. Finally, it is worth mentioning that the complexity results obtained for the SGP method with a general non-monotone line search are the same as in the classic case when the usual Armijo search is employed, namely, the complexity bound 𝒪(1/k)\mathcal{O}(1/\sqrt{k}) is unveil for finding ϵ\epsilon-stationary points for problem (1) and, under convexity on ff, the rate to find a ϵ\epsilon-optimal functional value is 𝒪(1/k)\mathcal{O}(1/k).

In Section 2, some notations and basic results used throughout the paper is presented. In particular, Section 2.1 is devoted to recall the concept of relative feasible inexact projection and some new properties about this concept are presented. Section 3 describes the SGP method with a general non-monotone line search and some particular instances of it are presented. Partial asymptotic convergence results are presented in Section 4. Section 5 presents a full convergence result and iteration-complexity bounds. Some numerical experiments are provided in Section 6. Finally, some concluding remarks are made in Section 7.

2 Preliminaries and basic results

In this section, we introduce some notation and results used throughout our presentation. First we consider the index set :={0,1,2,}{\mathbb{N}}:=\{0,1,2,\ldots\}, the usual inner product ,\langle\cdot,\cdot\rangle in n\mathbb{R}^{n}, and the associated Euclidean norm \|\cdot\|. Let f:nf:\mathbb{R}^{n}\to\mathbb{R} be a differentiable function and CnC\subseteq\mathbb{R}^{n}. The gradient f\nabla f of ff is said to be Lipschitz continuous in CC with constant L>0L>0 if f(x)f(y)Lxy\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|, for all x,yCx,y\in C. Combining this definition with the fundamental theorem of calculus, we obtain the following result whose proof can be found in [12, Proposition A.24].

Lemma 1.

Let f:nf:\mathbb{R}^{n}\to\mathbb{R} be a differentiable function and CnC\subseteq\mathbb{R}^{n}. Assume that f\nabla f is Lipschitz continuous in C with constant L>0L>0. Then, f(y)f(x)f(x),yx(L/2)xy2f(y)-f(x)-\langle\nabla f(x),y-x\rangle\leq(L/2)\|x-y\|^{2}, for all  x,yCx,y\in C.

Assume that CC is a convex set. The function ff is said to be convex on CC, if f(y)f(x)+f(x),yxf(y)\geq f(x)+\langle\nabla f(x),y-x\rangle, for all x,yCx,y\in C. We recall that a point x¯C{\bar{x}}\in C is a stationary point for problem (1) if

f(x¯),xx¯0,xC.\langle\nabla f({\bar{x}}),x-{\bar{x}}\rangle\geq 0,\qquad\forall~{}x\in C. (2)

Consequently, if ff is a convex function on CC, then (2) implies that x¯Ω\bar{x}\in\Omega^{*}. We end this section with some useful concepts for the analysis of the sequence generated by the scaled gradient method, for more details, see [23]. For that, let DD be a n×nn\times n positive definite matrix and D:n\|\cdot\|_{D}:\mathbb{R}^{n}\rightarrow\mathbb{R} be the norm defined by

dD:=Dd,d,dn.\|d\|_{D}:=\sqrt{\left\langle Dd,d\right\rangle},\quad\forall d\in\mathbb{R}^{n}. (3)

For a fixed constant μ1\mu\geq 1, denote by 𝒟μ{\cal D}_{\mu} the set of symmetric positive definite matrices n×nn\times n with all eigenvalues contained in the interval [1μ,μ][\frac{1}{\mu},\mu]. The set 𝒟μ{\cal D}_{\mu} is compact. Moreover, for each D𝒟μD\in{\cal D}_{\mu}, it follows that D1D^{-1} also belongs to 𝒟μ{\cal D}_{\mu}. Furthermore, due to D𝒟μD\in{\cal D}_{\mu}, by (3), we obtain

1μd2dD2μd2,dn.\frac{1}{\mu}\|d\|^{2}\leq\|d\|^{2}_{D}\leq\mu\|d\|^{2},\qquad\forall d\in\mathbb{R}^{n}. (4)
Definition 1.

Let (yk)k(y^{k})_{k\in\mathbb{N}} be a sequence in n\mathbb{R}^{n} and (Dk)k(D_{k})_{k\in\mathbb{N}} be a sequence in 𝒟μ{\cal D}_{\mu}. The sequence (yk)k(y^{k})_{k\in\mathbb{N}} is said to be quasi-Fejér convergent to a set WnW\subset\mathbb{R}^{n} with respect to (Dk)k(D_{k})_{k\in\mathbb{N}} if, for all wWw\in W, there exists a sequence (ϵk)k(\epsilon_{k})_{k\in\mathbb{N}}\subset\mathbb{R} such that ϵk0\epsilon_{k}\geq 0, kϵk<\sum_{k\in\mathbb{N}}\epsilon_{k}<\infty, and yk+1wDk+12ykwDk2+ϵk\|y_{k+1}-w\|_{D_{k+1}}^{2}\leq\|y^{k}-w\|_{D_{k}}^{2}+\epsilon_{k}, for all kk\in\mathbb{N}.

The main property of a quasi-Fejér convergent sequence is stated in the next result. Its proof can be found in [23] but, for sake of completeness, we include it here.

Theorem 2.

Let (yk)k(y^{k})_{k\in\mathbb{N}} be a sequence in n\mathbb{R}^{n} and (Dk)k(D_{k})_{k\in\mathbb{N}} be a sequence in 𝒟μ{\cal D}_{\mu}. If (yk)k(y^{k})_{k\in\mathbb{N}} is quasi-Fejér convergent to a nomempty set WnW\subset\mathbb{R}^{n} with respect to (Dk)k(D_{k})_{k\in\mathbb{N}}, then (yk)k(y^{k})_{k\in\mathbb{N}} is bounded. Furthermore, if a cluster point y¯{\bar{y}} of (yk)k(y^{k})_{k\in\mathbb{N}} belongs to WW, then limkyk=y¯\lim_{k\rightarrow\infty}y^{k}={\bar{y}}.

Proof.

Take wWw\in W. Definition 1 implies that ykwDk2y0wD02+kϵk<+\|y^{k}-w\|_{D_{k}}^{2}\leq\|y^{0}-w\|_{D_{0}}^{2}+\sum_{k\in\mathbb{N}}\epsilon_{k}<+\infty, for all kk\in\mathbb{N}. Thus, using the first inequality in (4), we conclude that ykwμykwDk\|y^{k}-w\|\leq\sqrt{\mu}\|y^{k}-w\|_{D_{k}}, for all kk\in\mathbb{N}. Therefore, combining the two previous inequalities, we conclude that (yk)k(y^{k})_{k\in\mathbb{N}} is bounded. Let y¯W{\bar{y}}\in W be a cluster point of (yk)k(y^{k})_{k\in\mathbb{N}} and (ykj)j(y^{k_{j}})_{j\in\mathbb{N}} be a subsequence of (yk)k(y^{k})_{k\in\mathbb{N}} such that limj+ykj=y¯\lim_{j\to+\infty}y^{k_{j}}={\bar{y}}. Take δ>0\delta>0. Since limj+ykj=y¯\lim_{j\to+\infty}y^{k_{j}}={\bar{y}} and jϵk<\sum_{j\in\mathbb{N}}\epsilon_{k}<\infty, there exists j0j_{0} such that jj0ϵj<δ/(2μ)\sum_{j\geq{j_{0}}}\epsilon_{j}<\delta/(2\mu) and j1>j0j_{1}>j_{0} such that ykjy¯δ/2μ2\|y^{k_{j}}-{\bar{y}}\|\leq\sqrt{\delta/2\mu^{2}}, for all jj1j\geq j_{1}. Hence, using the first inequality in (4) and taking into account that yk+1y¯Dk+12yky¯Dk2+ϵk\|y^{k+1}-{\bar{y}}\|_{D_{k+1}}^{2}\leq\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}+\epsilon_{k}, for all kk\in\mathbb{N}, we have yky¯2μyky¯Dk2μ(ykjy¯Dkj2+=kjk1ϵ),\|y^{k}-{\bar{y}}\|^{2}\leq\mu\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}\leq\mu(\|y^{k_{j}}-{\bar{y}}\|_{D_{k_{j}}}^{2}+\sum_{\ell=k_{j}}^{k-1}\epsilon_{\ell}), for all kj1k\geq j_{1}. Hence, using the second inequality in (4), we conclude that yky¯2μyky¯Dk2μ(μykjy¯2+=kjk1ϵ)<μ(δ2μ+δ2μ)=δ,\|y^{k}-{\bar{y}}\|^{2}\leq\mu\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}\leq\mu(\mu\|y^{k_{j}}-{\bar{y}}\|^{2}+\sum_{\ell=k_{j}}^{k-1}\epsilon_{\ell})<\mu(\frac{\delta}{2\mu}+\frac{\delta}{2\mu})=\delta, for all kj1k\geq j_{1}. Therefore, limkyk=y¯\lim_{k\rightarrow\infty}y^{k}={\bar{y}}. ∎

2.1 Relative feasible inexact projections

In this section, we recall two concepts of relative feasible inexact projections onto a closed and convex set, and also present some new properties of them which will be used throughout the paper. These concepts of inexact projections were introduced seeking to make the subproblem of computing the projections on the feasible set more efficient; see for example [13, 60, 64]. Before presenting the inexact projection concept that we will use, let us first recall the concept of exact projection with respect to a given norm. For that, throughout this section D𝒟μD\in{\cal D}_{\mu}. The exact projection of the point vnv\in\mathbb{R}^{n} onto CC with respect to the norm D\|\cdot\|_{D}, denoted by 𝒫CD(v){\cal P}_{C}^{D}(v), is defined by

𝒫CD(v):=argminzCzvD2.{\cal P}_{C}^{D}(v):=\arg\min_{z\in C}\|z-v\|^{2}_{D}. (5)

The next result characterizes the exact projection, its proof can be found in [8, Theorem 3.14].

Lemma 3.

Let v,wnv,w\in{\mathbb{R}}^{n}. Then, w=𝒫CD(v)w={\cal P}_{C}^{D}(v) if and only if wCw\in C and D(vw),yw0\left\langle D(v-w),y-w\right\rangle\leq 0, for all yC.y\in C.

Remark 1.

It follows from Lemma 3 that 𝒫CD(v)𝒫CD(u)DvuD\|{\cal P}_{C}^{D}(v)-{\cal P}_{C}^{D}(u)\|_{D}\leq\|v-u\|_{D}. Moreover, since D𝒟μD\in{\cal D}_{\mu}, by (4), we conclude that 𝒫CD(){\cal P}_{C}^{D}(\cdot) is Lipschitz continuous with constant L=μL=\mu. Furthermore, if (Dk)k𝒟μ(D_{k})_{k\in\mathbb{N}}\subset{\cal D}_{\mu}, limk+zk=z¯\lim_{k\to+\infty}z^{k}=\bar{z}, and limk+Dk=D¯\lim_{k\to+\infty}D_{k}=\bar{D}, then limk+𝒫CDk(zk)=𝒫CD¯(z¯)\lim_{k\to+\infty}{\cal P}_{C}^{D_{k}}(z^{k})={\cal P}_{C}^{\bar{D}}(\bar{z}), see [23, Proposition 4.2].

In the following, we recall the concept of a feasible inexact projection with respect to D\|\cdot\|_{D} relative to a fixed point.

Definition 2.

The feasible inexact projection mapping, with respect to the norm D\|\cdot\|_{D}, onto CC relative to a point uCu\in C and forcing parameter ζ(0,1]\zeta\in(0,1], denoted by 𝒫C,ζD(u,):nC{\cal P}_{C,\zeta}^{D}(u,\cdot):{\mathbb{R}}^{n}\rightrightarrows C, is the set-valued mapping defined as follows

𝒫C,ζD(u,v):={wC:wvD2ζ𝒫CD(v)vD2+(1ζ)uvD2}.{\cal P}_{C,\zeta}^{D}(u,v):=\left\{w\in C:~{}\|w-v\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}\right\}. (6)

Each point w𝒫C,ζD(u,v)w\in{\cal P}_{C,\zeta}^{D}(u,v) is called a feasible inexact projection, with respect to the norm D\|\cdot\|_{D}, of vv onto CC relative to uu and forcing parameter ζ(0,1]\zeta\in(0,1].

In the following, we show that the definition given above is nothing more than a reformulation of the concept of relative feasible inexact projection with respect to D\|\cdot\|_{D} introduced in [13].

Remark 2.

Let uCu\in C, vnv\in\mathbb{R}^{n} and DD be an n×nn\times n positive definite matrix. Consider the quadratic function Q:nQ:\mathbb{R}^{n}\to\mathbb{R} defined by Q(z):=(1/2)D(zu),zu+D(uv),zuQ(z):=(1/2)\left\langle{D}(z-u),z-u\right\rangle+\left\langle D(u-v),z-u\right\rangle. Thus, letting D\|\cdot\|_{D} be the norm defined by (3), some algebraic manipulations shows that

zvD2=2Q(z)+uvD2.\|z-v\|^{2}_{D}=2Q(z)+\|u-v\|^{2}_{D}. (7)

Hence, (7) and (5) implies that 𝒫CD(v)=argminzCQ(z).{\cal P}_{C}^{D}(v)=\arg\min_{z\in C}Q(z). Let ζ(0,1]\zeta\in(0,1]. Thus, by using (7), after some calculations, we can see that the following inexactness condition introduced in [13],

wC,Q(w)ζQ(𝒫CD(v)),w\in C,\qquad Q(w)\leq\zeta Q({\cal P}_{C}^{D}(v)),

is equivalent to find wCw\in C such that wvD2ζ𝒫CD(v)vD2+(1ζ)uvD2\|w-v\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}, which corresponds to condition (6) in Definition 2.

The concept of feasible inexact projection in Definition 2 provides more latitude to the usual concept of exact projection (5). The next remark makes this more precise.

Remark 3.

Let ζ\zeta be positive forcing parameter, CnC\subset{\mathbb{R}}^{n} and uCu\in C be as in Definition 2. First of all note that 𝒫CD(v)𝒫C,ζD(u,v){\cal P}_{C}^{D}(v)\in{\cal P}_{C,\zeta}^{D}(u,v). Therefore, 𝒫C,ζD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\neq\varnothing, for all uCu\in C and vnv\in{\mathbb{R}}^{n}. Consequently, the set-valued mapping 𝒫C,ζD(u,){\cal P}_{C,\zeta}^{D}(u,\cdot) as stated in (6) is well-defined. Moreover, for ζ=1\zeta=1, we have 𝒫C,1D(u,v)={𝒫CD(v)}{\cal P}_{C,1}^{D}(u,v)=\{{\cal P}_{C}^{D}(v)\}. In addition, if ζ¯\underline{\zeta} and ζ¯\bar{\zeta} are forcing parameters such that 0<ζ¯ζ¯10<\underline{\zeta}\leq\bar{\zeta}\leq 1, then 𝒫C,ζ¯D(u,v)𝒫C,ζ¯D(u,v){\cal P}_{C,\bar{\zeta}}^{D}(u,v)\subset{\cal P}_{C,\underline{\zeta}}^{D}(u,v).

Lemma 4.

Let vnv\in{\mathbb{R}}^{n}, uCu\in C and w𝒫C,ζD(u,v)w\in{\cal P}_{C,\zeta}^{D}(u,v). Then, there hold

D(vw),yw12wyD2+12[ζ𝒫CD(v)vD2+(1ζ)uvD2yvD2],yC.\left\langle D(v-w),y-w\right\rangle\leq\frac{1}{2}\|w-y\|_{D}^{2}+\frac{1}{2}\left[\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}-\|y-v\|_{D}^{2}\right],\qquad y\in C.
Proof.

Let yCy\in C. Since 2D(vw),yw=wyD2+wvD2vyD22\langle D(v-w),y-w\rangle=\|w-y\|_{D}^{2}+\|w-v\|_{D}^{2}-\|v-y\|_{D}^{2}, using (6) we have 2D(vw),yw=wyD2+ζ𝒫CD(v)vD2+(1ζ)uvD2vyD22\langle D(v-w),y-w\rangle=\|w-y\|_{D}^{2}+\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}-\|v-y\|_{D}^{2}, which is equivalent to the desired inequality. ∎

Next, we recall a second concept of relative feasible inexact projection onto a closed convex set, see [2, 28]. The definition is as follows.

Definition 3.

The feasible inexact projection mapping, with respect to the norm D\|\cdot\|_{D}, onto CC relative to uCu\in C and forcing parameter γ0\gamma\geq 0, denoted by C,γD(u,):nC{\cal R}_{C,\gamma}^{D}(u,\cdot):{\mathbb{R}}^{n}\rightrightarrows C, is the set-valued mapping defined as follows

C,γD(u,v):={wC:D(vw),ywγwuD2,yC}.{\cal R}_{C,\gamma}^{D}(u,v):=\left\{w\in C:~{}\left\langle D(v-w),y-w\right\rangle\leq\gamma\|w-u\|_{D}^{2},\quad\forall~{}y\in C\right\}. (8)

Each point wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v) is called a feasible inexact projection, with respect to the norm D\|\cdot\|_{D}, of vv onto CC relative to uu and forcing parameter γ0\gamma\geq 0.

The concept of feasible inexact projection in Definition 3 also provides more latitude to the usual concept of exact projection. Next, we present some remarks about this concept.

Remark 4.

Let γ0\gamma\geq 0 be a forcing parameter, CnC\subset{\mathbb{R}}^{n} and uCu\in C be as in Definition 3. For all vnv\in{\mathbb{R}}^{n}, it follows from (8) and Lemma 3 that C,0D(u,v)={𝒫CD(v)}{\cal R}_{C,0}^{D}(u,v)=\{{\cal P}_{C}^{D}(v)\} is the exact projection of vv onto CC. Moreover, 𝒫CD(v)C,γD(u,v){\cal P}_{C}^{D}(v)\in{\cal R}_{C,\gamma}^{D}(u,v) concluding that C,γ(u,v){\cal R}_{C,\gamma}(u,v)\neq\varnothing, for all uCu\in C and vnv\in{\mathbb{R}}^{n}. Consequently, the set-valued mapping C,γD(u,){\cal R}_{C,\gamma}^{D}(u,\cdot) as stated in (8) is well-defined.

The next lemma is a variation of [30, Lemma 6]. It will allow to relate Definitions 2 and 3.

Lemma 5.

Let vnv\in{\mathbb{R}}^{n}, uCu\in C, γ0\gamma\geq 0 and wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v). Then, there hold

wxD2xvD2+2γ12γuvD2112γwvD2,\displaystyle\|w-x\|_{D}^{2}\leq\|x-v\|_{D}^{2}+\frac{2\gamma}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2},

for all xCx\in C and 0γ<1/20\leq\gamma<1/2.

Proof.

First note that wxD2=xvD2wvD2+2D(vw),xw\|w-x\|_{D}^{2}=\|x-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\langle D(v-w),x-w\rangle. Since wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v) and xCx\in C, combining the last equality with (8), we obtain

wxD2xvD2wvD2+2γwuD2.\|w-x\|_{D}^{2}\leq\|x-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\gamma\|w-u\|_{D}^{2}. (9)

On the other hand, we also have wuD2=uvD2wvD2+2D(vw),uw\|w-u\|_{D}^{2}=\|u-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\langle D(v-w),u-w\rangle. Due to wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v) and uCu\in C, using (8) and considering that 0γ<1/20\leq\gamma<1/2, we have

wuD2112γuvD2112γwvD2.\|w-u\|_{D}^{2}\leq\frac{1}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2}.

Therefore, substituting the last inequality into (9), we obtain the desired inequality. ∎

In the following lemma, we present a relationship between Definitions 2 and 3.

Lemma 6.

Let vnv\in{\mathbb{R}}^{n}, uCu\in C, γ0\gamma\geq 0 and ζ(0,1]\zeta\in(0,1]. If 0γ<1/20\leq\gamma<1/2 and ζ=12γ\zeta=1-2\gamma, then

C,γD(u,v)𝒫C,ζD(u,v).{\cal R}_{C,\gamma}^{D}(u,v)\subset{\cal P}_{C,\zeta}^{D}(u,v).
Proof.

Let wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v). Applying Lemma 5 with x=𝒫CD(v)x={\cal P}_{C}^{D}(v) we have

w𝒫CD(v)D2v𝒫CD(v)D2+2γ12γuvD2112γwvD2,\|w-{\cal P}_{C}^{D}(v)\|_{D}^{2}\leq\|v-{\cal P}_{C}^{D}(v)\|_{D}^{2}+\frac{2\gamma}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2},

After some algebraic manipulations in the last inequality we obtain that

wvD2(12γ)v𝒫CD(v)D2+2γuvD2(12γ)w𝒫CD(v)D2.\|w-v\|_{D}^{2}\leq(1-2\gamma)\|v-{\cal P}_{C}^{D}(v)\|_{D}^{2}+2\gamma\|u-v\|_{D}^{2}-(1-2\gamma)\|w-{\cal P}_{C}^{D}(v)\|_{D}^{2}.

Therefore, considering that 0γ<1/20\leq\gamma<1/2 and ζ=12γ\zeta=1-2\gamma, the result follows from Definition 2. ∎

Remark 5.

Under the conditions of Lemma 6, there exists 0γ<1/20\leq\gamma<1/2 and ζ=12γ\zeta=1-2\gamma such that 𝒫C,ζD(u,v)C,γD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\nsubseteq{\cal R}_{C,\gamma}^{D}(u,v). Indeed, let γ=3/8\gamma=3/8, ζ=1/4\zeta=1/4, and w¯=12(𝒫CD(v)+u){\bar{w}}=\frac{1}{2}({\cal P}_{C}^{D}(v)+u), then

w¯vD2=14𝒫CD(v)vD2+14uvD2+12D(𝒫CD(v)v),uv.\displaystyle\|{\bar{w}}-v\|_{D}^{2}=\frac{1}{4}\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+\frac{1}{4}\|u-v\|_{D}^{2}+\frac{1}{2}\langle D({\cal P}_{C}^{D}(v)-v),u-v\rangle.

Since 𝒫CD(v){\cal P}_{C}^{D}(v) is the exact projection of vv, we have D(𝒫CD(v)v),uvuvD2\displaystyle\langle D({\cal P}_{C}^{D}(v)-v),u-v\rangle\leq\|u-v\|_{D}^{2}. Combining this inequality with the last equality and Definition 2, we conclude that w¯𝒫C,ζD(u,v){\bar{w}}\in{\cal P}_{C,\zeta}^{D}(u,v). Now, letting wt=t𝒫CD(v)+(1t)w¯w_{t}=t{\cal P}_{C}^{D}(v)+(1-t){\bar{w}} with 0<t<10<t<1, after some algebraic manipulations we have

D(vw¯),wtw¯=tw¯uD2t2D(v𝒫CD(v)),u𝒫CD(v).\langle D(v-{\bar{w}}),w_{t}-{\bar{w}}\rangle=t\|{\bar{w}}-u\|_{D}^{2}-\frac{t}{2}\langle D(v-{\cal P}_{C}^{D}(v)),u-{\cal P}_{C}^{D}(v)\rangle.

Thus, it follows from Lemma 3 that D(vw¯),wtw¯tw¯uD2\displaystyle\langle D(v-{\bar{w}}),w_{t}-{\bar{w}}\rangle\geq t\|{\bar{w}}-u\|_{D}^{2}. Hence, taking t>3/8t>3/8 we conclude that w¯C,γD(u,v){\bar{w}}\not\in{\cal R}_{C,\gamma}^{D}(u,v). Therefore, considering that w¯𝒫C,ζD(u,v){\bar{w}}\in{\cal P}_{C,\zeta}^{D}(u,v), the statement follows.

It follows from Remark 5 that, in general, 𝒫C,ζD(u,v)C,γD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\nsubseteq{\cal R}_{C,\gamma}^{D}(u,v). However, whenever CC is a bounded set, we will show that for each fixed 0γ<1/20\leq\gamma<1/2 there exist 0<ζ<10<\zeta<1 such that 𝒫C,ζD(u,v)C,γD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\subseteq{\cal R}_{C,\gamma}^{D}(u,v). For that, we first need the next lemma.

Lemma 7.

Let vnv\in{\mathbb{R}}^{n}, uCu\in C and 0<γ<1/20<\gamma<1/2. Assume that CC is a bounded set and take

0<ε<γu𝒫CD(v)D21γ+v𝒫CD(v)D+2γu𝒫CD(v)D+diamC,0<\varepsilon<\frac{\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}}{1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+2\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C}, (10)

where diamC\mbox{diam}C denotes the diameter of CC. Then, {wC:w𝒫CD(v)Dε}C,γD(u,v)}\{w\in C:~{}\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\varepsilon\}\subset\mathcal{R}_{C,\gamma}^{D}(u,v)\}.

Proof.

Take ε\varepsilon satisfying (10) and wCw\in C such that w𝒫CD(v)Dε\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\varepsilon . For all zCz\in C, we have

D(vw),zw=D(v𝒫CD(v)),z𝒫CD(v)+D(v𝒫CD(v)),𝒫CD(v)w+D(𝒫CD(v)w),z𝒫CD(v)+𝒫CD(v)wD2.\langle D(v-w),z-w\rangle=\langle D(v-\mathcal{P}_{C}^{D}(v)),z-\mathcal{P}_{C}^{D}(v)\rangle+\langle D(v-\mathcal{P}_{C}^{D}(v)),\mathcal{P}_{C}^{D}(v)-w\rangle\\ +\langle D(\mathcal{P}_{C}^{D}(v)-w),z-\mathcal{P}_{C}^{D}(v)\rangle+\|\mathcal{P}_{C}^{D}(v)-w\|_{D}^{2}.

Using Lemma 3, we have D(v𝒫CD(v)),z𝒫CD(v)0\langle D(v-\mathcal{P}_{C}^{D}(v)),z-\mathcal{P}_{C}^{D}(v)\rangle\leq 0. Thus, the last equality becomes

D(vw),zwD(v𝒫CD(v)),𝒫CD(v)w+D(𝒫CD(v)w),z𝒫CD(v)+𝒫CD(v)wD2.\langle D(v-w),z-w\rangle\leq\langle D(v-\mathcal{P}_{C}^{D}(v)),\mathcal{P}_{C}^{D}(v)-w\rangle+\langle D(\mathcal{P}_{C}^{D}(v)-w),z-\mathcal{P}_{C}^{D}(v)\rangle+\|\mathcal{P}_{C}^{D}(v)-w\|_{D}^{2}.

By using Cauchy-Schwarz inequality, we conclude from the last inequality that

D(vw),zww𝒫CD(v)D(v𝒫CD(v)D+z𝒫CD(v)D)+w𝒫CD(v)D2.\langle D(v-w),z-w\rangle\leq\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\|z-\mathcal{P}_{C}^{D}(v)\|_{D}\right)+\|w-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}.

Since w𝒫CD(v)|Dε\|w-\mathcal{P}_{C}^{D}(v)|_{D}\leq\varepsilon and z𝒫CD(v)DdiamC\|z-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\mbox{diam}C, the last inequality implies that

D(vw),zwε(v𝒫CD(v)D+diamC)+ε2,\langle D(v-w),z-w\rangle\leq\varepsilon\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\varepsilon^{2}, (11)

On the other hand, if ε\varepsilon satisfies (10) then

ε(1γ+v𝒫CD(v)D+diamC)+γε2<γu𝒫CD(v)D22γεu𝒫CD(v)D+γε2,\varepsilon\left(1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\gamma\varepsilon^{2}<\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}-2\gamma\varepsilon\|u-\mathcal{P}_{C}^{D}(v)\|_{D}+\gamma\varepsilon^{2},

hence

ε(1γ+v𝒫CD(v)D+diamC)+γε2<γ(u𝒫CD(v)Dε)2.\varepsilon\left(1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\gamma\varepsilon^{2}<\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}.

Since γ,ε<1\gamma,\varepsilon<1, we have ε2<ε(1γ)+γε2\varepsilon^{2}<\varepsilon(1-\gamma)+\gamma\varepsilon^{2} and we can conclude that

ε(v𝒫CD(v)D+diamC)+ε2<γ(u𝒫CD(v)Dε)2.\varepsilon\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\varepsilon^{2}<\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}.

It follows from (11) that

D(vw),zwγ(u𝒫CD(v)Dε)2.\langle D(v-w),z-w\rangle\leq\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}. (12)

Using again that w𝒫CD(v)|Dε\|w-\mathcal{P}_{C}^{D}(v)|_{D}\leq\varepsilon and the triangular inequality, we have

0<u𝒫CD(v)Dεu𝒫CD(v)Dw𝒫CD(v)DuwD.0<\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\leq\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\|u-w\|_{D}.

Hence, taking into account (12), we conclude that D(vw),zwγuwD2\langle D(v-w),z-w\rangle\leq\gamma\|u-w\|_{D}^{2}. Therefore, it follows from Definition 3 that wC,γD(u,v)w\in\mathcal{R}_{C,\gamma}^{D}(u,v). ∎

Proposition 8.

Let vnv\in{\mathbb{R}}^{n}, uCu\in C and assume that CC is a bounded set. Then, for each 0<γ<1/20<\gamma<1/2, there exist 0<ζ<10<\zeta<1 such that 𝒫C,ζD(u,v)C,γD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\subseteq{\cal R}_{C,\gamma}^{D}(u,v).

Proof.

It follows from Lemma  7 that given 0<γ<1/20<\gamma<1/2 there exists ε>0\varepsilon>0 such that, for all wCw\in C with w𝒫CD(v)ε\|w-\mathcal{P}_{C}^{D}(v)\|\leq\varepsilon, we have wγD(v)w\in\mathcal{R}_{\gamma}^{D}(v). Otherwise, we can see in (6), when ζ1\zeta\to 1, the diameter of C𝒫C,ζD(u,v)C\cap{\cal P}_{C,\zeta}^{D}(u,v) tends to zero, then there exists ζ\zeta close to 1 such that diam(C𝒫C,ζD(u,v))<ε/2\mbox{diam}(C\cap{\cal P}_{C,\zeta}^{D}(u,v))<\varepsilon/2, and 𝒫C,ζD(u,v)C,γD(u,v){\cal P}_{C,\zeta}^{D}(u,v)\subset{\cal R}_{C,\gamma}^{D}(u,v). ∎

Next, we present some important properties of inexact projections, which will be useful in the sequel.

Lemma 9.

Let xCx\in C, α>0\alpha>0 and z(α)=xαD1f(x)z(\alpha)=x-\alpha D^{-1}\nabla f(x). Take w(α)𝒫C,ζD(x,z(α))w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha)) with ζ(0,1]\zeta\in(0,1]. Then, there hold

  • (i)

    f(x),w(α)x12αw(α)xD2+ζ2α[𝒫CD(z(α))z(α)D2xz(α)D2]\displaystyle\langle\nabla f(x),w(\alpha)-x\rangle\leq-\frac{1}{2\alpha}\|w(\alpha)-x\|_{D}^{2}+\frac{\zeta}{2\alpha}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right];

  • (ii)

    the point xx is stationary for problem (1) if and only if x𝒫C,ζD(x,z(α))x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha));

  • (iii)

    if xCx\in C is a nonstationary point for problem (1), then f(x),w(α)x<0\Big{\langle}\nabla f(x),w(\alpha)-x\Big{\rangle}<0. Equivalently, if there exists α¯>0{\bar{\alpha}}>0 such that f(x),w(α¯)x0\Big{\langle}\nabla f(x),w({\bar{\alpha}})-x\Big{\rangle}\geq 0, then xx is stationary for problem (1).

Proof.

Since w(α)𝒫C,ζD(x,z(α))w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha)), applying Lemma 4 with w=w(α)w=w(\alpha), v=z(α)v=z(\alpha), y=xy=x, and u=xu=x, we conclude, after some algebraic manipulations, that

D(z(α)w(α)),xw(α)12w(α)xD2+ζ2[𝒫CD(z(α))z(α)D2xz(α)D2].\left\langle D(z(\alpha)-w(\alpha)),x-w(\alpha)\right\rangle\leq\frac{1}{2}\|w(\alpha)-x\|_{D}^{2}+\frac{\zeta}{2}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right].

Substituting z(α)=xαf(x)z(\alpha)=x-\alpha\nabla f(x) in the left hand side of the last inequality, some manipulations yield the inequality of item (i)(i). For proving item (ii)(ii), we first assume that xx is stationary for problem (1). In this case, (2) implies that f(x),w(α)x0\langle\nabla f(x),w(\alpha)-x\rangle\geq 0. Hence, due to 𝒫CD(z(α))z(α)Dxz(α)D\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}\leq\|x-z(\alpha)\|_{D}, item (i)(i) implies

12αw(α)xD2ζ2α[𝒫CD(z(α))z(α)D2xz(α)D2]0.\frac{1}{2\alpha}\|w(\alpha)-x\|_{D}^{2}\leq\frac{\zeta}{2\alpha}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right]\leq 0.

Since α>0\alpha>0 and ζ(0,1]\zeta\in(0,1], the last inequality yields w(α)=xw(\alpha)=x. Therefore, x𝒫C,ζD(x,z(α))x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha)). Reciprocally, if x𝒫C,ζD(x,z(α))x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha)), then Definition 2 implies that

xz(α)D2ζ𝒫CD(z(α))z(α)D2+(1ζ)xz(α)D2.\|x-z(\alpha)\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}+(1-\zeta)\|x-z(\alpha)\|_{D}^{2}.

Hence, 0ζ(𝒫CD(z(α))z(α)D2(xz(α)D2)0\leq\zeta\left(\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-(\|x-z(\alpha)\|_{D}^{2}\right). Considering that ζ(0,1]\zeta\in(0,1] we have

xz(α)D𝒫CD(z(α))z(α)D.\|x-z(\alpha)\|_{D}\leq\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}.

Thus, due to exact projection with respect to the norm D\|\cdot\|_{D} be unique and z(α)=xD1αf(x)z(\alpha)=x-D^{-1}\alpha\nabla f(x), we have 𝒫CD(xαD1f(x))=x{\cal P}_{C}^{D}(x-\alpha D^{-1}\nabla f(x))=x. Hence, xx is the solution of the constrained optimization problem minyCyz(α)D2\min_{y\in C}\|y-z(\alpha)\|^{2}_{D}, which taking into account that α>0\alpha>0 implies (2). Therefore, xx is stationary point for problem (1). Finally, to prove item (iii)(iii), take xx a nonstationary point for problem (1). Thus, by item (ii)(ii), x𝒫C,ζD(x,z(α))x\notin{\cal P}_{C,\zeta}^{D}(x,z(\alpha)) and taking into account that w(α)𝒫C,ζD(x,z(α))w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha)), we conclude that xw(α)x\neq w(\alpha). Since 𝒫CD(z(α))z(α)Dxz(α)D\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}\leq\|x-z(\alpha)\|_{D}, α>0\alpha>0 and ζ(0,1]\zeta\in(0,1], it follows from item (i)(i) that f(x),w(α)x<0\Big{\langle}\nabla f(x),w(\alpha)-x\Big{\rangle}<0 and the first sentence is proved. Finally, note that the second sentence is the contrapositive of the first sentence. ∎

Finally, it is worth mentioning that Definitions 2 and 3, introduced respectively in [13] and [28], are relative inexact concepts, while the concept introduced in [60, 64] is absolute.

2.1.1 Practical computation of inexact projections

In this section, for a given vnv\in{\mathbb{R}}^{n} and uCu\in C, we discuss how to calculate a point wCw\in C belonging to 𝒫C,ζD(u,v){\cal P}_{C,\zeta}^{D}(u,v) or C,γD(u,v){\cal R}_{C,\gamma}^{D}(u,v). We recall that Lemma 6 implies that 𝒫C,ζD(u,v){\cal P}_{C,\zeta}^{D}(u,v) has more latitude than C,γD(u,v){\cal R}_{C,\gamma}^{D}(u,v), i.e., C,γD(u,v)𝒫C,ζD(u,v){\cal R}_{C,\gamma}^{D}(u,v)\subset{\cal P}_{C,\zeta}^{D}(u,v).

We begin our discussion by showing how a point w𝒫C,ζD(u,v)w\in{\cal P}_{C,\zeta}^{D}(u,v) can be calculated without knowing the point 𝒫CD(v){\cal P}_{C}^{D}(v). Considering that this discussion has already been covered in [13, Section 3, Algorithm 3.1], we will limit ourselves to giving a general idea of how this task is carried out; see also [16, Section 5.1]. The idea is to use an external procedure capable of computing two sequences (c)(c_{\ell})_{\ell\in\mathbb{N}}\subset\mathbb{R} and (w)C(w^{\ell})_{\ell\in\mathbb{N}}\subset C satisfying the following conditions

c𝒫CD(v)vD2,,lim+c=𝒫CD(v)vD2,lim+w=𝒫CD(v).c_{\ell}\leq\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2},\quad\forall\ell\in\mathbb{N},\qquad\qquad\lim_{\ell\to+\infty}c_{\ell}=\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2},\qquad\quad\lim_{\ell\to+\infty}w^{\ell}={\cal P}_{C}^{D}(v). (13)

In this case, if vCv\notin C, then we have 𝒫CD(v)vD2uvD2<0\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}-\|u-v\|_{D}^{2}<0. Hence, given an arbitrary ζ(0,1)\zeta\in(0,1), the second condition in (13) implies that there exists ^\hat{\ell} such that

𝒫CD(v)vD2uvD2<ζ(c^uvD2).\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}-\|u-v\|_{D}^{2}<\zeta(c_{\hat{\ell}}-\|u-v\|_{D}^{2}).

Moreover, by using the last condition in (13), we conclude that there exists ¯>^\bar{\ell}>\hat{\ell} such that

w¯vD2uvD2<ζ(c¯uvD2),\|w_{\bar{\ell}}-v\|_{D}^{2}-\|u-v\|_{D}^{2}<\zeta(c_{\bar{\ell}}-\|u-v\|_{D}^{2}), (14)

which using the inequality in (13) yields w¯vD2<ζ𝒫CD(v)vD2+(1ζ)uvD2\ \|w_{\bar{\ell}}-v\|_{D}^{2}<\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}. Hence, Definition 2 implies that w¯𝒫C,ζD(u,v)w_{\bar{\ell}}\in{\cal P}_{C,\zeta}^{D}(u,v). Therefore, (14) can be used as a stopping criterion to compute a feasible inexact projection, with respect to the norm D\|\cdot\|_{D}, of vv onto CC relative to uu and forcing parameter ζ(0,1]\zeta\in(0,1]. For instance, it follows from [13, Theorem 3.2, Lemma 3.1] that such sequences (c)(c_{\ell})_{\ell\in\mathbb{N}}\subset\mathbb{R} and (w)C(w^{\ell})_{\ell\in\mathbb{N}}\subset C satisfying (13) can be computed by using Dykstra’s algorithm [22, 32], whenever DD is the identity matrix and the set C=i=1pCiC=\cap_{i=1}^{p}C_{i}, where CiC_{i} are closed and convex sets and the exact projection 𝒫CiD(v){\cal P}_{C_{i}}^{D}(v) is easy to obtain, for all i=1,,pi=1,\dots,p.

We end this section by discussing how to compute a point wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v). For that, we apply the classical Frank-Wolfe method, also known as conditional gradient method, to minimize the function ψ(z):=zv2/2\psi(z):=\|z-v\|^{2}/2 onto the constraint set CC with a suitable stop criteria depending of uCu\in C and γ(0,1]\gamma\in(0,1], see [9, 48]. To state the method we assume the existence of a linear optimization oracle (or simply LO oracle) capable of minimizing linear functions over the constraint set CC, which is assumed to be compact. The Frank-Wolfe method is formally stated as follows.

Input:

D𝒟μD\in{\cal D}_{\mu}, γ(0,1]\gamma\in(0,1], vnv\in{\mathbb{R}}^{n} and uCu\in C.

Step 0.

Let w0Cw^{0}\in C and set 0\ell\leftarrow 0.

Step 1.

Use a LO oracle to compute an optimal solution zz^{\ell} and the optimal value ss_{\ell}^{*} as

zargminzCwv,zw,s:=wv,zw.z^{\ell}\in\arg\min_{z\in C}\,\langle w^{\ell}-v,~{}z-w^{\ell}\rangle,\qquad s_{\ell}^{*}:=\langle w^{\ell}-v,~{}z^{\ell}-w^{\ell}\rangle. (15)

If sγwuD2-s^{*}_{\ell}\leq\gamma\|w^{\ell}-u\|_{D}^{2}, then define w:=ww:=w^{\ell} and stop.

Step 2.

Compute α\alpha_{\ell} and w+1w_{\ell+1} as

w+1:=w+α(zw),α:=min{1,s/zw2}.w_{\ell+1}:=w^{\ell}+\alpha_{\ell}(z^{\ell}-w^{\ell}),\qquad{\alpha}_{\ell}:=\min\left\{1,-s^{*}_{\ell}/\|z^{\ell}-w^{\ell}\|^{2}\right\}. (16)

Set +1\ell\leftarrow\ell+1, and go to Step 1.

Output:

w:=ww:=w^{\ell}.

Algorithm 1 : Frank-Wolfe method to compute wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v)

Let us describe the main features of Algorithm 1, i.e., the Frank-Wolfe method applied to the problem minzCψ(z)\min_{z\in C}\psi(z). In this case, (15) is equivalent to s:=minzCψ(w),zws_{\ell}^{*}:=\min_{z\in C}\langle\psi^{\prime}(w^{\ell}),~{}z-w^{\ell}\rangle. Since ψ\psi is convex, we have ψ(z)ψ(w)+ψ(w),zwψ(w)+s\psi(z)\geq\psi(w^{\ell})+\langle\psi^{\prime}(w^{\ell}),~{}z-w^{\ell}\rangle\geq\psi(w^{\ell})+s_{\ell}^{*}, for all zCz\in C. Define w:=argminzCψ(z)w_{*}:=\arg\min_{z\in C}\psi(z) and ψ:=minzCψ(z)\psi^{*}:=\min_{z\in C}\psi(z). Letting z=wz=w_{*} in the last inequality, we obtain ψ(w)ψψ(w)+s\psi(w^{\ell})\geq\psi^{*}\geq\psi(w^{\ell})+s_{\ell}^{*}, which implies that s<0s_{\ell}^{*}<0 whenever ψ(w)ψ\psi(w^{\ell})\neq\psi^{*}. Thus, we conclude that s=vw,zw>0vw,zw-s_{\ell}^{*}=\langle v-w^{\ell},~{}z^{\ell}-w^{\ell}\rangle>0\geq\langle v-w_{*},~{}z-w_{*}\rangle, for all zCz\in C. Therefore, if Algorithm 1 computes wCw^{\ell}\in C satisfying sγwuD2-s_{\ell}^{*}\leq\gamma\|w^{\ell}-u\|_{D}^{2}, then the method terminates. Otherwise, it computes the step size α=argminα[0,1]ψ(w+α(zw))\alpha_{\ell}=\arg\min_{\alpha\in[0,1]}\psi(w^{\ell}+\alpha(z^{\ell}-w^{\ell})) using exact minimization. Since zz^{\ell}, wCw^{\ell}\in C and CC is convex, we conclude from (16) that w+1Cw_{\ell+1}\in C, thus Algorithm 1 generates a sequence in CC. Finally, (15) implies that vw,zws\langle v-w^{\ell},~{}z-w^{\ell}\rangle\leq-s_{\ell}^{*}, for all zCz\in C. Considering that [9, Proposition A.2] implies that lim+s=0\lim_{\ell\to+\infty}s_{\ell}^{*}=0 and taking into account the stopping criteria sγwuD2-s_{\ell}^{*}\leq\gamma\|w^{\ell}-u\|_{D}^{2}, we conclude that the output of Algorithm 1 is a feasible inexact projection wC,γD(u,v)w\in{\cal R}_{C,\gamma}^{D}(u,v) i.e., vw,zwγwuD2\langle v-w,~{}z-w\rangle\leq\gamma\|w^{\ell}-u\|_{D}^{2}, for all zCz\in C.

3 Inexact scaled gradient method

The aim of this section is to present an inexact version of the scaled gradient method (SGM), which inexactness are in two distinct senses. First, we use a version of the inexactness scheme introduced in [13], and also a variation of the one appeared in [64], to compute an inexact projection onto the feasible set allowing an appropriate relative error tolerance. Second, using the inexactness conceptual scheme for non-monotones line search introduced in [43, 59], a step size is computed to define the next iterate. The statement of the conceptual algorithm is as follows.

Step 0.

Choose σ,ζmin(0,1)\sigma,{\zeta_{\min}}\in(0,1), δmin[0,1)\delta_{\min}\in[0,1), 0<ω¯<ω¯<10<\underline{\omega}<\bar{\omega}<1, 0<αminαmax0<\alpha_{\min}\leq\alpha_{\max} and μ1\mu\geq 1. Let x0Cx^{0}\in C, ν00\nu_{0}\geq 0 and set k0k\leftarrow 0.

Step 1.

Choose positive real numbers αk\alpha_{k} and ζk\zeta_{k}, and a positive definite matrix DkD_{k} such that

αminαkαmax,0<ζmin<ζk1,Dk𝒟μ.\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max},\qquad\qquad 0<{\zeta_{\min}}<\zeta_{k}\leq 1,\qquad\qquad D_{k}\in{\cal D}_{\mu}. (17)

Compute wkCw^{k}\in C as any feasible inexact projection with respect to the norm Dk\|\cdot\|_{D_{k}} of zk:=xkαkDk1f(xk)z^{k}:=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k}) onto CC relative to xkx^{k} with forcing parameter ζk\zeta_{k}, i.e.,

wk𝒫C,ζkDk(xk,zk).w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}). (18)

If wk=xkw^{k}=x^{k}, then stop declaring convergence.

Step 2.

Set τtrial1\tau_{\textrm{trial}}\leftarrow 1. If

f(xk+τtrial(wkxk))f(xk)+στtrialf(xk),wkxk+νk,f\big{(}x^{k}+\tau_{\textrm{trial}}(w^{k}-x^{k})\big{)}\leq f(x^{k})+\sigma\tau_{\textrm{trial}}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k}, (19)

then τkτtrial\tau_{k}\leftarrow\tau_{\textrm{trial}}, define the next iterate xk+1x^{k+1} as

xk+1=xk+τk(wkxk),x^{k+1}=x^{k}+\tau_{k}(w^{k}-x^{k}), (20)

and go to Step 3. Otherwise, choose τnew[ω¯τtrial,ω¯τtrial]\tau_{\textrm{new}}\in[\underline{\omega}\tau_{\textrm{trial}},\bar{\omega}\tau_{\textrm{trial}}], set τtrialτnew\tau_{\textrm{trial}}\leftarrow\tau_{\textrm{new}}, and repeat test (19).

Step 3.

Take δk+1[δmin,1]\delta_{k+1}\in[\delta_{\min},1] and choose νk+1\nu_{k+1}\in{\mathbb{R}} satisfying

0νk+1(1δk+1)[f(xk)+νkf(xk+1)].0\leq\nu_{k+1}\leq(1-\delta_{k+1})\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}. (21)

Set kk+1k\leftarrow k+1 and go to Step 1.


Algorithm 2 InexProj-SGM employing non-monotone line search

Let us describe the main features of Algorithm 2. In Step 1, we first choose αminαkαmax\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max}, 0<ζminζk<10<\zeta_{\min}\leq\zeta_{k}<1, and Dk𝒟μD_{k}\in{\cal D}_{\mu}. Then, by using some (inner) procedure, such as those specified in Section 2.1, we compute wkw^{k} as any feasible inexact projection of zk=xkαkDk1f(xk)z^{k}=x_{k}-\alpha_{k}D_{k}^{-1}\nabla f(x_{k}) onto the feasible set CC relative to the previous iterate xkx^{k} with forcing parameter ζk\zeta_{k}. If wk=xkw^{k}=x^{k}, then Lemma 9(ii) implies that xkx^{k} is a solution of problem (1). Otherwise, wkxkw^{k}\neq x^{k} and Lemma 9(i) implies that wkxkw^{k}-x^{k} is a descent direction of ff at xkx^{k}, i.e., f(xk),wkxk<0\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0. Hence, in Step 2, we employ a non-monotone line search with tolerance parameter νk0\nu_{k}\geq 0 to compute a step size τk(0,1]\tau_{k}\in(0,1], and the next iterate is computed as in (20). Finally, due to (19) and δk+1[δmin,1]\delta_{k+1}\in[\delta_{\min},1], we have 0(1δk+1)[f(xk)+νkf(xk+1)]0\leq(1-\delta_{k+1})\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}. Therefore, the next tolerance parameter νk+1\nu_{k+1}\in{\mathbb{R}} can be chosen satisfying (21) in Step 3, completing the iteration.

It is worth mentioning that the conditions in (17) allow combining several strategies for choosing the step sizes αk\alpha_{k} and the matrices DkD_{k} to accelerate the performance of the classical gradient method. Strategies of choosing the step sizes αk\alpha_{k} and the matrices DkD_{k} have their origin in the study of the gradient method for unconstrained optimization, papers dealing with this issue include but are not limited to [7, 27, 29, 36, 69], see also [18, 25, 26, 49]. More details about selecting step sizes αk\alpha_{k} and matrices DkD_{k} can be found in the recent review [17] and references therein.

Below, we present some particular instances of the parameter δk0\delta_{k}\geq 0 and the non-monotonicity tolerance parameter νk0\nu_{k}\geq 0 in Step 3.

  1. 1.

    Armijo line search

    Taking νk0\nu_{k}\equiv 0, the line search (19) is the well-known (monotone) Armijo line search, see [12, Section 2.3]. In this case, we can take δk1\delta_{k}\equiv 1 in Step 3.

  2. 2.

    Max-type line search

    The earliest non-monotone line search strategy was proposed in [45]. Let M>0M>0 be an integer parameter. In an iteration kk, this strategy requires a step size τk>0\tau_{k}>0 satisfying

    f(xk+τk(wkxk))max0jmkf(xkj)+στkf(xk),wkxk,f\big{(}x^{k}+\tau_{k}(w^{k}-x^{k})\big{)}\leq\max_{0\leq j\leq m_{k}}f(x^{k-j})+\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}, (22)

    where m0=0m_{0}=0 and 0mkmin{mk1+1,M}0\leq m_{k}\leq\min\{m_{k-1}+1,M\}. To simplify the notations, we define f(x(k)):=max0jmkf(xkj)f(x^{\ell(k)}):=\max_{0\leq j\leq m_{k}}f(x^{k-j}). In order to identify (22) as a particular instance of (19), we set

    νk=f(x(k))f(xk),0=δminδk+1[f(x(k))f(x(k+1))]/[f(x(k))f(xk+1)].\nu_{k}=f(x^{\ell(k)})-f(x^{k}),\quad 0=\delta_{\min}\leq\delta_{k+1}\leq[f(x^{\ell(k)})-f(x^{\ell(k+1)})]/[f(x^{\ell(k)})-f(x^{k+1})]. (23)

    Parameters νk\nu_{k} and δk+1\delta_{k+1} in (23) satisfy the corresponding conditions in Algorithm 2, i.e., νk0\nu_{k}\geq 0 and δk+1[δmin,1]\delta_{k+1}\in[\delta_{\min},1] (with δmin=0\delta_{\min}=0) satisfy (21). In fact, the definition of f(x(k))f(x^{\ell(k)}) implies that f(xk)f(x(k))f(x^{k})\leq f(x^{\ell(k)}) and hence νk0\nu_{k}\geq 0. Due to f(xk),wkxk<0\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0, it follows from (19) that f(x(k))f(xk+1)>0f(x^{\ell(k)})-f(x^{k+1})>0. Since mk+1mk+1m_{k+1}\leq m_{k}+1, we conclude that f(x(k))f(x(k+1))0f(x^{\ell(k)})-f(x^{\ell(k+1)})\geq 0. Hence, owing to f(xk+1)f(x(k+1))f(x^{k+1})\leq f(x^{\ell(k+1)}), we obtain δk+1[0,1]\delta_{k+1}\in[0,1]. Moreover, (21) is equivalent to δk+1[f(xk)+νkf(xk+1)](f(xk)+νk)(f(xk+1)+νk+1)\delta_{k+1}[f(x^{k})+\nu_{k}-f(x^{k+1})]\leq(f(x^{k})+\nu_{k})-(f(x^{k+1})+\nu_{k+1}), which in turn, taking into account that νk=f(x(k))f(xk)\nu_{k}=f(x^{\ell(k)})-f(x^{k}), is equivalent to second inequality in (23). Thus, (22) is a particular instance of (19) with νk\nu_{k} and δk+1\delta_{k+1} defined in (23). Therefore, Algorithm 2 has as a particular instance the inexact projected version of the scaled gradient method employing the non-monotone line search (22). This version has been considered in [13]; see also [19, 65].

  3. 3.

    Average-type line search

    Let us first recall the definition of the sequence of “cost updates” (ck)k(c_{k})_{k\in\mathbb{N}} that characterize the non-monotonous line search proposed in [68]. Let 0ηminηmax<10\leq\eta_{\min}\leq\eta_{\max}<1, c0=f(x0)c_{0}=f(x_{0}) and q0=1q_{0}=1. Choose ηk[ηmin,ηmax]\eta_{k}\in[\eta_{\min},\eta_{\max}] and set

    qk+1=ηkqk+1,ck+1=[ηkqkck+f(xk+1)]/qk+1,k.q_{k+1}=\eta_{k}q_{k}+1,\qquad c_{k+1}=[\eta_{k}q_{k}c_{k}+f(x^{k+1})]/q_{k+1},\qquad\forall k\in\mathbb{N}. (24)

    Some algebraic manipulations show that the sequence defined in (24) is equivalent to

    ck+1=(11/qk+1)ck+f(xk+1)/qk+1,k.c_{k+1}=(1-1/q_{k+1})c_{k}+f(x^{k+1})/q_{k+1},\qquad\forall k\in\mathbb{N}. (25)

    Since (21) is equivalent to f(xk+1)+νk+1(1δk+1)(f(xk)+νk)+δk+1f(xk+1)f(x^{k+1})+\nu_{k+1}\leq(1-\delta_{k+1})(f(x^{k})+\nu_{k})+\delta_{k+1}f(x^{k+1}), it follows from (25) that letting νk=ckf(xk)\nu_{k}=c_{k}-f(x^{k}) and δk+1=1/qk+1\delta_{k+1}=1/q_{k+1}, Algorithm 2 becomes the inexact projected version of the scaled gradient method employing the non-monotone line search proposed in [68]. Finally, considering that q0=1q_{0}=1 and ηmax<1\eta_{\max}<1, the first equality in (24) implies that qk+1=1+j=0ki=0jηkij=0+ηmaxj=1/(1ηmax)q_{k+1}=1+\sum_{j=0}^{k}\prod_{i=0}^{j}\eta_{k-i}\leq\sum_{j=0}^{+\infty}\eta_{\max}^{j}=1/(1-\eta_{\max}). In this case, due to δk+1=1/qk+1\delta_{k+1}=1/q_{k+1}, we can take δmin=1ηmax>0\delta_{\min}=1-\eta_{\max}>0 in Step 3. For gradient projection methods employing the non-monotone Average-type line search see, for example, [6, 34, 66].

Remark 6.

The general line search in Step 2 of Algorithm 2 with parameters δk+1\delta_{k+1} and νk\nu_{k} properly chosen in Step 3, also contains as particular cases the non-monotonous line searches that appeared in [3, 51], see also [43].

4 Partial asymptotic convergence analysis

The goal of this section is to present a partial convergence result for the sequence (xk)k(x^{k})_{k\in\mathbb{N}} generated by Algorithm 2, namely, we will prove that every cluster point of (xk)k(x^{k})_{k\in\mathbb{N}} is stationary for problem (1). For that, we state a result that is contained in the proof of [43, Theorem 4].

Lemma 10.

There holds 0δk+1[f(xk)+νkf(xk+1)](f(xk)+νk)(f(xk+1)+νk+1)0\leq\delta_{k+1}\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}\leq\big{(}f(x^{k})+\nu_{k}\big{)}-\big{(}f(x^{k+1})+\nu_{k+1}\big{)}, for all kk\in\mathbb{N}. As consequence the sequence (f(xk)+νk)k\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}} is non-increasing.

Next, we present our first convergence result. It is worth noting that, just as in the classical projected gradient method, we do not need to assume that ff has a bounded sub-level set.

Proposition 11.

Assume that limk+νk=0\lim_{k\to+\infty}\nu_{k}=0. Then, Algorithm 2 stops in a finite number of iterations at a stationary point of problem (1), or generates an infinite sequence (xk)k(x^{k})_{k\in\mathbb{N}} for which every cluster point is stationary for problem (1).

Proof.

First, assume that (xk)k(x^{k})_{k\in\mathbb{N}} is finite. In this case, according to Step 1, there exists kk\in\mathbb{N} such that xk=wk𝒫C,ζkDk(xk,zk)x^{k}=w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}), where zk=xkαkDk1f(xk)z^{k}=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k}), 0<ζ¯<ζk10<{\bar{\zeta}}<\zeta_{k}\leq 1 and αk>0\alpha_{k}>0. Therefore, applying Lemma 9(ii) with x=xkx=x^{k}, α=αk\alpha=\alpha_{k} and ζ=ζk\zeta=\zeta_{k}, we conclude that xkx^{k} is stationary for problem (1). Now, assume that (xk)k(x^{k})_{k\in\mathbb{N}} is infinite. Let x¯{\bar{x}} be a cluster point of (xk)k(x^{k})_{k\in\mathbb{N}} and (xkj)j(x^{k_{j}})_{j\in\mathbb{N}} be a subsequence of (xk)k(x^{k})_{k\in\mathbb{N}} such that limj+xkj=x¯\lim_{j\to+\infty}x^{k_{j}}=\bar{x}. Since CC is closed and (xk)kC(x^{k})_{k\in\mathbb{N}}\subset C, we have x¯C\bar{x}\in C. Moreover, owing to limk+νk=0\lim_{k\to+\infty}\nu_{k}=0, we have limj+(f(xkj)+νkj)=f(x¯)\lim_{j\to+\infty}\left(f(x^{k_{j}})+\nu_{k_{j}}\right)=f(\bar{x}). Hence, considering that limk+νk=0\lim_{k\to+\infty}\nu_{k}=0 and Lemma 10 implies that (f(xk)+νk)k\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}} is non-increasing, we conclude that limk+f(xk)=limk+(f(xk)+νk)=f(x¯)\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}\left(f(x^{k})+\nu_{k}\right)=f(\bar{x}). On the other hand, due to wk𝒫C,ζkDk(xk,zk)w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}), where zk=xkαkf(xk)z^{k}=x^{k}-\alpha_{k}\nabla f(x^{k}), Definition 2 implies

wkjzkjDk2ζkj𝒫CDk(zkj)zkjDk2+(1ζkj)xkjzkjDk2.\|w^{k_{j}}-z^{k_{j}}\|_{D_{k}}^{2}\leq\zeta_{k_{j}}\|{\cal P}_{C}^{D_{k}}(z^{k_{j}})-z^{k_{j}}\|_{D_{k}}^{2}+(1-\zeta_{k_{j}})\|x^{k_{j}}-z^{k_{j}}\|_{D_{k}}^{2}. (26)

Considering that (αk)k(\alpha_{k})_{k\in\mathbb{N}} and (ζk)k(\zeta_{k})_{k\in\mathbb{N}} are bounded, (Dk)k𝒟μ(D_{k})_{k\in\mathbb{N}}\subset{\cal D}_{\mu}, (xkj)j(x^{k_{j}})_{j\in\mathbb{N}} converges to x¯{\bar{x}} and f\nabla f is continuous, the last inequality together Remark 1 and (4) imply that (wkj)jC(w^{k_{j}})_{j\in\mathbb{N}}\subset C is also bounded. Thus, we can assume without loss of generality that limj+wkj=w¯C\lim_{j\to+\infty}w^{k_{j}}=\bar{w}\in C. In addition, taking into account that xkwkx^{k}\neq w^{k} for all k=0,1,k=0,1,\ldots, applying Lemma 9(i) with x=xkx=x^{k}, α=αk\alpha=\alpha_{k}, z(α)=zkz(\alpha)=z^{k} and ζ=ζk\zeta=\zeta_{k}, we obtain that f(xk),wkxk<0\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0, for all k=0,1,k=0,1,\ldots. Therefore, (19) and (20) imply that

0<στkf(xk),wkxkf(xk)+νkf(xk+1),k.0<-\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq f(x^{k})+\nu_{k}-f(x^{k+1}),\qquad\forall~{}k\in\mathbb{N}. (27)

Now, due τk(0,1]\tau_{k}\in(0,1], for all k=0,1,k=0,1,\ldots, we can also assume without loss of generality that limj+τkj=τ¯[0,1].\lim_{j\to+\infty}\tau_{k_{j}}=\bar{\tau}\in[0,1]. Therefore, owing to limk+f(xk)=f(x¯)\lim_{k\to+\infty}f(x^{k})=f(\bar{x}) and limk+νk=0\lim_{k\to+\infty}\nu_{k}=0, taking limit in (27) along the subsequences (xkj)j(x^{k_{j}})_{j\in\mathbb{N}}, (wkj)j(w^{k_{j}})_{j\in\mathbb{N}} and (τkj)j(\tau_{k_{j}})_{j\in\mathbb{N}} yields τ¯f(x¯),w¯x¯=0.\bar{\tau}\big{\langle}\nabla f(\bar{x}),\bar{w}-\bar{x}\big{\rangle}=0. We have two possibilities: τ¯>0\bar{\tau}>0 or τ¯=0\bar{\tau}=0. If τ¯>0\bar{\tau}>0, then f(x¯),w¯x¯=0.\big{\langle}\nabla f(\bar{x}),\bar{w}-\bar{x}\big{\rangle}=0. Now, we assume that τ¯=0\bar{\tau}=0. In this case, for all jj large enough, there exists 0<τ^kjmin{1,τkj/ω¯}0<\hat{\tau}_{k_{j}}\leq\min\{1,\tau_{k_{j}}/\underline{\omega}\} such that

f(xkj+τ^kj(wkjxkj))>f(xkj)+στ^kjf(xkj),wkjxkj+νkj.f\big{(}x^{k_{j}}+\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)}>f(x^{k_{j}})+\sigma\hat{\tau}_{k_{j}}\big{\langle}\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\big{\rangle}+\nu_{k_{j}}. (28)

On the other hand, by the mean value theorem, there exists ξkj(0,1)\xi_{k_{j}}\in(0,1) such that

f(xkj+ξkjτ^kj(wkjxkj)),τ^kj(wkjxkj)=f(xkj+τ^kj(wkjxkj))f(xkj).\langle\nabla f\big{(}x^{k_{j}}+\xi_{k_{j}}\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)},\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\rangle=f\big{(}x^{k_{j}}+\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)}-f(x^{k_{j}}).

Combining this equality with (28), and taking into account that νkj0\nu_{k_{j}}\geq 0, we have

f(xkj+ξkjτ^kj(wkjxkj)),τ^kj(wkjxkj)>στ^kjf(xkj),wkjxkj,\langle\nabla f\big{(}x^{k_{j}}+\xi_{k_{j}}\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)},\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\rangle>\sigma\hat{\tau}_{k_{j}}\big{\langle}\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\big{\rangle},

for jj large enough. Since 0<τ^kjmin{1,τkj/ω¯}0<\hat{\tau}_{k_{j}}\leq\min\{1,\tau_{k_{j}}/\underline{\omega}\}, it follows that limjτ^kjwkjxkj=0\lim_{j\to\infty}\hat{\tau}_{k_{j}}\|w^{k_{j}}-x^{k_{j}}\|=0. Then, dividing both sides of the above inequality by τ^kj>0\hat{\tau}_{k_{j}}>0 and taking limits as jj goes to ++\infty, we conclude that f(x¯),w¯x¯σf(x¯),w¯x¯\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\geq\sigma\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle. Hence, due to σ(0,1)\sigma\in(0,1), we obtain f(x¯),w¯x¯0\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\geq 0. We recall that f(xkj),wkjxkj<0\langle\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\rangle<0, for all j=0,1,j=0,1,\ldots, which taking limit as jj goes to ++\infty yields f(x¯),w¯x¯0\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\leq 0. Hence, we also have f(x¯),w¯x¯=0\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0. Therefore, for any of the two possibilities, τ¯>0\bar{\tau}>0 or τ¯=0\bar{\tau}=0, we have f(x¯),w¯x¯=0\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0. On the other hand, since (αk)k(\alpha_{k})_{k\in\mathbb{N}} and (ζk)k(\zeta_{k})_{k\in\mathbb{N}} are bounded, we also assume without loss of generality that limj+αkj=α¯[αmin,αmax]\lim_{j\to+\infty}\alpha_{k_{j}}=\bar{\alpha}\in[\alpha_{\min},\alpha_{\max}] and limj+ζkj=ζ¯[ζmin,1]\lim_{j\to+\infty}\zeta_{k_{j}}={\bar{\zeta}}\in[\zeta_{\min},1]. Thus, since Remark 1 implies that

limj+𝒫CDkj(zkj)=𝒫CD¯(z¯),\lim_{j\to+\infty}{\cal P}_{C}^{D_{k_{j}}}(z^{k_{j}})={\cal P}_{C}^{\bar{D}}(\bar{z}),

and considering that limj+xkj=x¯C\lim_{j\to+\infty}x^{k_{j}}=\bar{x}\in C, limj+wkj=w¯C\lim_{j\to+\infty}w^{k_{j}}=\bar{w}\in C, limj+τkj=τ¯[0,1]\lim_{j\to+\infty}\tau_{k_{j}}=\bar{\tau}\in[0,1], limj+Dkj=D¯𝒟μ\lim_{j\to+\infty}D_{k_{j}}=\bar{D}\in{\cal D}_{\mu}, taking limit in (26), we conclude that

w¯z¯D¯2ζ¯𝒫CD¯(z¯)z¯D¯2+(1ζ¯)x¯z¯D¯2,\|\bar{w}-\bar{z}\|_{\bar{D}}^{2}\leq{\bar{\zeta}}\|{\cal P}_{C}^{\bar{D}}(\bar{z})-\bar{z}\|_{\bar{D}}^{2}+(1-{\bar{\zeta}})\|\bar{x}-\bar{z}\|_{\bar{D}}^{2},

where z¯=x¯α¯f(x¯)\bar{z}=\bar{x}-{\bar{\alpha}}\nabla f(\bar{x}). Hence, Definition 2 implies that w¯𝒫C,ζ¯D¯(x¯,z¯){\bar{w}}\in{\cal P}_{C,{\bar{\zeta}}}^{\bar{D}}({\bar{x}},{\bar{z}}), where z¯=x¯α¯f(x¯)\bar{z}=\bar{x}-{\bar{\alpha}}\nabla f(\bar{x}). Therefore, due to f(x¯),w¯x¯=0\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0, we can apply second sentence in Lemma 9(iii) with x=x¯x=\bar{x}, z(α¯)=z¯z({\bar{\alpha}})=\bar{z} and w(α¯)=w¯w({\bar{\alpha}})=\bar{w}, to conclude that x¯\bar{x} is stationary for problem (1). ∎

The tolerance parameter νk\nu_{k} that controls the non-monotonicity of the line search must be smaller and smaller as the sequence (xk)k(x^{k})_{k\in\mathbb{N}} tends to a stationary point. Next corollary presents a general condition for this property, its proof can be found in [43, Theorem 4].

Corollary 12.

If δmin>0\delta_{\min}>0, then k=0+νk<+\sum_{k=0}^{+\infty}\nu_{k}<+\infty. Consequently, limk+νk=0\lim_{k\to+\infty}\nu_{k}=0.

The Armijo and the non-monotone Average-type line searches discussed in Section 3 satisfy the assumption of Corollary 12, i.e., δmin>0\delta_{\min}>0. However, for the non-monotone Max-type line search, we can only guarantee that δmin0\delta_{\min}\geq 0. Hence, we can not apply Corollary 12 to conclude that limk+νk=0\lim_{k\to+\infty}\nu_{k}~{}=~{}0. In the next proposition, we will deal with this case separately.

Proposition 13.

Assume that the sequence (xk)k(x^{k})_{k\in\mathbb{N}} is generated by Algorithm 2 with the non-monotone line search (22), i.e., νk=f(x(k))f(xk)\nu_{k}=f(x^{\ell(k)})-f(x^{k}) for all kk\in\mathbb{N}. In addition, assume that the level set C0:={xC:f(x)f(x0)}C_{0}:=\{x\in C:~{}f(x)\leq f(x^{0})\} is bounded and ν0=0\nu_{0}=0. Then, limk+νk=0\lim_{k\to+\infty}\nu_{k}=0.

Proof.

First of all, note that wk𝒫C,ζkDk(xk,zk)w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}), where zk=xkαkDk1f(xk)z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k}) and Dk𝒟μD_{k}\in{\cal D}_{\mu}. Thus, applying Lemma 9(i) with x=xkx=x^{k}, w(α)=wkw(\alpha)=w^{k}, z=zkz=z^{k} and ζ=ζk\zeta=\zeta_{k}, we obtain

wkxk22μαmaxf(xk),wkxk,k.\|w^{k}-x^{k}\|^{2}\leq-2\mu\alpha_{\max}\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle,\qquad\forall k\in\mathbb{N}. (29)

On the other hand, due to f(x(k))=f(xk)+νkf(x^{\ell(k)})=f(x^{k})+\nu_{k}, Lemma 10 implies that (f(x(k)))k(f(x^{\ell(k)}))_{k\in\mathbb{N}} is non-increasing and f(xk+1)f(xk+1)+νk+1f(xk)+νkf(x0)f(x^{k+1})\leq f(x^{k+1})+\nu_{k+1}\leq f(x^{k})+\nu_{k}\leq f(x^{0}). Hence, we have (xk)kC0(x^{k})_{k\in\mathbb{N}}\subset C_{0} and, as a consequence, (f(x(k)))k(f(x^{\ell(k)}))_{k\in\mathbb{N}} converges. Note that (k)\ell(k) is an integer such that

kmk(k)k.k-m_{k}\leq\ell(k)\leq k. (30)

Since x(k)=x(k)1+τ(k)1(w(k)1x(k)1)x^{{\ell(k)}}=x^{{\ell(k)}-1}+\tau_{{\ell(k)}-1}(w^{{\ell(k)}-1}-x^{{\ell(k)}-1}), (22) implies that

f(x(k))f(x((k)1))+στ(k)1f(x(k)1),w(k)1x(k)1,f\big{(}x^{\ell(k)}\big{)}\leq f\big{(}x^{\ell({{\ell(k)}-1})}\big{)}+\sigma\tau_{{\ell(k)}-1}\big{\langle}\nabla f(x^{{\ell(k)}-1}),w^{{\ell(k)}-1}-x^{{\ell(k)}-1}\big{\rangle},

for all k>Mk>M. In view of (f(x(k)))k(f(x^{\ell(k)}))_{k\in\mathbb{N}} be convergent, f(xk),wkxk<0\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0 for all kk\in\mathbb{N}, and taking into account that τk(0,1]\tau_{k}\in(0,1], the last inequality together (29) implies that

limk+τ(k)1w(k)1x(k)1=0.\lim_{k\to+\infty}\tau_{{\ell(k)}-1}\|w^{{\ell(k)}-1}-x^{{\ell(k)}-1}\|=0. (31)

We proceed to prove that limk+f(xk)=limk+f(x(k))\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}f(x^{\ell(k)}). For that, set ^(k):=(k+M+2){\hat{\ell}}(k):=\ell(k+M+2). First, we prove by induction that, for all j1j\geq 1, the following two equalities hold

limk+τ^(k)jw^(k)jx^(k)j=0,limk+f(x^(k)j)=limk+f(x(k)),\lim_{k\to+\infty}\tau_{{\hat{\ell}}(k)-j}\|w^{{{\hat{\ell}}(k)}-j}-x^{{{\hat{\ell}}(k)}-j}\|=0,\qquad\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-j})=\lim_{k\to+\infty}f(x^{\ell(k)}), (32)

where we are considering kj1k\geq j-1. Assume that j=1j=1. Since {^(k):k}{(k):k}\{{\hat{\ell}}(k):~{}k\in\mathbb{N}\}\subset\{{\ell}(k):~{}k\in\mathbb{N}\}, the first equality in (32) follows from (31). Hence, limk+x^(k)x^(k)1=0\lim_{k\to+\infty}\|x^{{{\hat{\ell}}(k)}}-x^{{\hat{\ell}(k)}-1}\|=0. Since C0C_{0} is compact and ff is uniformly continuous on C0C_{0}, we have limk+f(x^(k)1)=limk+f(x^(k))\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-1})=\lim_{k\to+\infty}f(x^{{\hat{\ell}(k)}}), which again using that {^(k):k}{(k):k}\{{\hat{\ell}}(k):~{}k\in\mathbb{N}\}\subset\{{\ell}(k):~{}k\in\mathbb{N}\} implies the second equality in (32). Assume that (32) holds for jj. Again, due to x^(k)j=x^(k)j1+τ^(k)j1(w^(k)j1x^(k)j1)x^{{{\hat{\ell}}(k)}-j}=x^{{{{\hat{\ell}}(k)}-j}-1}+\tau_{{{{\hat{\ell}}(k)}-j}-1}(w^{{{{\hat{\ell}}(k)}-j}-1}-x^{{{{\hat{\ell}}(k)}-j}-1}), (22) implies that

f(x^(k)j)f(x(^(k)j(j+1)))+στ^(k)(j+1)f(x^(k)(j+1)),w^(k)(j+1)x^(k)(j+1).f\big{(}x^{{{\hat{\ell}}(k)}-j}\big{)}\leq f\big{(}x^{\ell({{{{\hat{\ell}}(k)}j}-(j+1)})}\big{)}+\sigma\tau_{{{{\hat{\ell}}(k)}}-(j+1)}\big{\langle}\nabla f(x^{{{{\hat{\ell}}(k)}}-(j+1)}),w^{{{{\hat{\ell}}(k)}}-(j+1)}-x^{{{{\hat{\ell}}(k)}}-(j+1)}\big{\rangle}.

Similar argument used to obtain (31) yields limk+τ^(k)(j+1)w^(k)(j+1)x^(k)(j+1)=0\lim_{k\to+\infty}\tau_{{{{\hat{\ell}}(k)}}-(j+1)}\|w^{{{{\hat{\ell}}(k)}}-(j+1)}-x^{{{{\hat{\ell}}(k)}}-(j+1)}\|=0. Thus, the first equality in (32) holds for j+1j+1, which implies limk+x^(k)jx^(k)(1+j)=0\lim_{k\to+\infty}\|x^{{{\hat{\ell}}(k)}-j}-x^{{{{\hat{\ell}}(k)}}-(1+j)}\|=0. Again, the uniformly continuity of ff on C0C_{0} gives

limk+f(x^(k)(j+1))=limk+f(x^(k)j),\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-(j+1)})=\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-j}),

which shows that the second equality in (32) holds for j+1j+1. From (30) and ^(k):=(k+M+2){\hat{\ell}}(k):=\ell(k+M+2), we obtain ^(k)k1M+1{\hat{\ell}}(k)-k-1\leq M+1. Thus, taking into account that

xk+1=x^(k)j=1^(k)k1τ^(k)j(w^(k)jx^(k)j),x^{k+1}=x^{{\hat{\ell}}(k)}-\sum_{j=1}^{{\hat{\ell}}(k)-k-1}\tau_{{\hat{\ell}}(k)-j}\big{(}w^{{\hat{\ell}}(k)-j}-x^{{\hat{\ell}}(k)-j}\big{)},

it follows from the first inequality in (32) that limk+xk+1x^(k)=0\lim_{k\to+\infty}\|x^{k+1}-x^{{\hat{\ell}}(k)}\|=0. Hence, due to ff be uniformly continuous on C0C_{0} and (f(x(k)))k(f(x^{\ell(k)}))_{k\in\mathbb{N}} be convergent, we conclude that

limk+f(xk)=limk+f(x^(k))=limk+f(x(k)),\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)})=\lim_{k\to+\infty}f(x^{\ell(k)}),

and considering that νk=f(x(k))f(xk)\nu_{k}=f(x^{\ell(k)})-f(x^{k}) the desired results follows. ∎

Remark 7.

Let C0:={xC:f(x)f(x0)}C_{0}:=\{x\in C:~{}f(x)\leq f(x^{0})\} be bounded and (xk)k(x^{k})_{k\in\mathbb{N}} be generated by Algorithm 2 with the non-monotone line search (22) with ν0=0\nu_{0}=0. Then, combining Propositions 11 and 13, we conclude that (xk)k(x^{k})_{k\in\mathbb{N}} is either finite terminating at a stationary point of problem (1), or infinite, and every cluster point of (xk)k(x^{k})_{k\in\mathbb{N}} is stationary for problem (1). Therefore, we have an alternative proof for the result obtained in [13, Theorem 2.1].

Due to Proposition 11, from now on we assume that the sequence (xk)k(x^{k})_{k\in\mathbb{N}} generated by Algorithm 2 is infinite.

5 Full asymptotic convergence and complexity analysis

The purpose of this section is twofold. We will first prove, under suitable assumptions, the full convergence of the sequence (xk)k(x^{k})_{k\in\mathbb{N}} and then we will present iteration-complexity bounds for it. For this end, we need to be more restrictive both with respect to the inexact projection in (18) and in the tolerance parameter that controls the non-monotonicity of the line search used in (19). More precisely, we assume that in Step 1 of Algorithm 2:

  • A1.

    For all kk\in\mathbb{N}, we take wkC,γkDk(xk,zk)w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}) with γk=(1ζk)/2\gamma_{k}=(1-\zeta_{k})/2.

It is worth recalling that, taking the parameter γk=(1ζk)/2\gamma_{k}=(1-\zeta_{k})/2, it follows from Lemma 6 that C,γkDk(xk,zk)𝒫C,ζkDk(xk,zk){\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})\subset{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}). In addition, we also assume that in Step 2 of Algorithm 2:

  • A2.

    For all kk\in\mathbb{N}, we take 0νk0\leq\nu_{k} such that k=0+νk<+\sum_{k=0}^{+\infty}\nu_{k}<+\infty.

It follows from Corollary 12 that the Armijo and the non-monotone Average-type line searches discussed in Section 3 satisfy Assumption A2.

5.1 Full asymptotic convergence analysis

In this section, we prove the full convergence of the sequence (xk)k(x^{k})_{k\in\mathbb{N}} satisfying A1 and A2. We will begin establishing a basic inequality for (xk)k(x^{k})_{k\in\mathbb{N}}. To simplify notations, we define the constant

ξ:=2αmaxσ>0.\xi:=\dfrac{2\alpha_{\max}}{\sigma}>0. (33)
Lemma 14.

For each xCx\in C, there holds

xk+1xDk2xkxDk2+2αkτkf(xk),xxk+ξ[f(xk)f(xk+1)+νk],k.\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\alpha_{k}\tau_{k}\big{\langle}\nabla f(x^{k}),x-x^{k}\big{\rangle}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]},\quad\forall~{}k\in\mathbb{N}. (34)
Proof.

We know that xk+1xDk2=xkxDk2+xk+1xkDk22Dk(xk+1xk),xxk\|x^{k+1}-x\|_{D_{k}}^{2}=\|x^{k}-x\|_{D_{k}}^{2}+\|x^{k+1}-x^{k}\|_{D_{k}}^{2}-2\langle{D_{k}}(x^{k+1}-x^{k}),x-x^{k}\rangle, for all xCx\in C and kk\in\mathbb{N}. Thus, using (20), we have

xk+1xDk2=xkxDk2+τk2wkxkDk22τkDk(wkxk),xxk,k.\|x^{k+1}-x\|_{D_{k}}^{2}=\|x^{k}-x\|_{D_{k}}^{2}+\tau_{k}^{2}\|w^{k}-x^{k}\|_{D_{k}}^{2}-2\tau_{k}\big{\langle}{D_{k}}(w^{k}-x^{k}),x-x^{k}\big{\rangle},\qquad\forall~{}k\in\mathbb{N}. (35)

On the other hand, since wkC,γkDk(xk,zk)w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}) with zk=xkαkDk1f(xk)z^{k}=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k}), it follows from Definition 3, with y=xy=x, D=DkD=D_{k}, u=xku=x^{k}, v=zkv=z^{k}, w=wkw=w^{k}, and γ=γk\gamma=\gamma_{k}, that

Dk(xkαkDk1f(xk)wk),xwkγkwkxkDk2,k.\big{\langle}D_{k}(x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k})-w^{k}),x-w^{k}\big{\rangle}\leq\gamma_{k}\|w^{k}-x^{k}\|_{D_{k}}^{2},\qquad\forall~{}k\in\mathbb{N}.

Hence, after some algebraic manipulations in the last inequality, we have

Dk(wkxk),xxkαkf(xk),xwk(1γk)wkxkDk2.-\big{\langle}D_{k}(w^{k}-x^{k}),x-x^{k}\big{\rangle}\leq\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}-(1-\gamma_{k})\|w^{k}-x^{k}\|_{D_{k}}^{2}.

Combining the last inequality with (35), we conclude that

xk+1xDk2xkxDk2τk[2(1γk)τk]wkxkDk2+2τkαkf(xk),xwk.\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}-\tau_{k}\big{[}2(1-\gamma_{k})-\tau_{k}\big{]}\|w^{k}-x^{k}\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}. (36)

Since 0γk<(1ζmin)/2<1/20\leq\gamma_{k}<(1-{\zeta_{\min}})/2<1/2 and τk(0,1]\tau_{k}\in(0,1], we have 2(1γk)τk>ζmin>02(1-\gamma_{k})-\tau_{k}>{\zeta_{\min}}>0. Thus, it follows from (36) that

xk+1xDk2xkxDk2+2τkαkf(xk),xwk,k.\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle},\qquad\forall~{}k\in\mathbb{N}.

Thus, considering that f(xk),xwk=f(xk),xxk+f(xk),xkwk\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}=\big{\langle}\nabla f(x^{k}),x-x^{k}\big{\rangle}+\big{\langle}\nabla f(x^{k}),x^{k}-w^{k}\big{\rangle} and taking into account (19), we conclude that

xk+1xDk2xkxDk2+2τkαkf(xk),xxk+2αkσ[f(xk)f(xk+1)+νk],\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\left\langle\nabla f(x^{k}),x-x^{k}\right\rangle+\frac{2\alpha_{k}}{\sigma}\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]}, (37)

for all kk\in\mathbb{N}. On the other hand, applying Lemma 9(iii) with x=xkx=x^{k}, α=αk\alpha=\alpha_{k}, D=DkD=D_{k}, w(α)=wkw(\alpha)=w^{k}, z=zkz=z^{k} and ζ=ζk\zeta=\zeta_{k}, we obtain f(xk),wkxk<0\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0, for all kk\in\mathbb{N}. Therefore, it follows from (19) and (20) that 0<στkf(xk),wkxkf(xk)f(xk+1)+νk0<-\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq f(x^{k})-f(x^{k+1})+\nu_{k}, to all kk\in\mathbb{N}. Hence, due to 0<αkαmax0<\alpha_{k}\leq\alpha_{\max}, we have

αk[f(xk)f(xk+1)+νk]<αmax[f(xk)f(xk+1)+νk],k.\alpha_{k}[f(x^{k})-f(x^{k+1})+\nu_{k}]<\alpha_{\max}[f(x^{k})-f(x^{k+1})+\nu_{k}],\qquad\forall k\in\mathbb{N}.

Therefore, (34) follows from the combination of the last inequality with (33) and (37). ∎

For proceeding with the analysis of the behavior of the sequence (xk)k(x^{k})_{k\in\mathbb{N}}, we define the following auxiliary set

U:={xC:f(x)infk(f(xk)+νk)}.U:=\left\{x\in C:f(x)\leq\inf_{k\in{\mathbb{N}}}\left(f(x^{k})+\nu_{k}\right)\right\}.
Corollary 15.

Assume that ff is a convex function. If UU\neq\varnothing, then (xk)k(x^{k})_{k\in\mathbb{N}} converges to a stationary point of problem (1).

Proof.

Let xUx\in U. Since ff is convex, we have 0f(x)(f(xk)+νk)f(xk),xxkνk0\geq f(x)-(f(x^{k})+\nu_{k})\geq\langle\nabla f(x^{k}),x-x^{k}\rangle-\nu_{k}, for all kk\in\mathbb{N}. Thus, f(xk),xxkνk\langle\nabla f(x^{k}),x-x^{k}\rangle\leq\nu_{k}, for all kk\in\mathbb{N}. Using Lemma 14 and taking into account that τk(0,1]\tau_{k}\in(0,1] and 0<αminαkαmax0<\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max}, we obtain

xk+1xDk2xkxDk2+2αmaxνk+ξ[f(xk)f(xk+1)+νk],k.\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\alpha_{\max}\nu_{k}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]},\quad\forall~{}k\in\mathbb{N}.

Defining ϵk=2αmaxνk+ξ[f(xk)f(xk+1)+νk]\epsilon_{k}=2\alpha_{\max}\nu_{k}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]}, we have xk+1xDk2xkxDk2+ϵk\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+\epsilon_{k}, for all kk\in\mathbb{N}. On the other hand, summing ϵk\epsilon_{k} with k=0,1,,Nk=0,1,\ldots,N and using Corollary 12, we have

k=0Nϵk2αmaxk=0Nνk+ξ(f(x0)f(x)+k=0N+1νk)<+,N.\sum_{k=0}^{N}\epsilon_{k}\leq 2\alpha_{\max}\sum_{k=0}^{N}\nu_{k}+\xi\left(f(x^{0})-f(x)+\sum_{k=0}^{N+1}\nu_{k}\right)<+\infty,\qquad\forall N\in\mathbb{N}.

Hence, k=0+ϵk<+\sum_{k=0}^{+\infty}\epsilon_{k}<+\infty. Thus, it follows from Definition 1 that (xk)k(x^{k})_{k\in\mathbb{N}} is quasi-Fejér convergent to UU with respect to the sequence (Dk)k(D_{k})_{k\in\mathbb{N}} . Since UU is nonempty, it follows from Theorem 2 that (xk)k(x^{k})_{k\in\mathbb{N}} is bounded, and therefore it has cluster points. Let x¯\bar{x} be a cluster point of (xk)k(x^{k})_{k\in\mathbb{N}} and (xkj)j(x^{k_{j}})_{j\in\mathbb{N}} be a subsequence of (xk)k(x^{k})_{k\in\mathbb{N}} such that limjxkj=x¯\lim_{j\to\infty}x^{k_{j}}=\bar{x}. Considering that ff is continuous and limk+νk=0\lim_{k\to+\infty}\nu_{k}=0, we have limj(f(xkj)+νkj)=f(x¯)\lim_{j\to\infty}(f(x^{k_{j}})+\nu_{k_{j}})=f(\bar{x}). On the other hand, Lemma 10 implies that (f(xk)+νk)k\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}} is non-increasing. Thus infk(f(xk)+νk)=limk(f(xk)+νk)=f(x¯).\inf_{k\in{\mathbb{N}}}(f(x^{k})+\nu_{k})=\lim_{k\to\infty}(f(x^{k})+\nu_{k})=f(\bar{x}). Hence, x¯U\bar{x}\in U, and Theorem 2 implies that (xk)k(x^{k})_{k\in\mathbb{N}} converges to x¯\bar{x}. The conclusion is obtained by using Proposition  11. ∎

Theorem 16.

If ff is a convex function and (xk)k(x^{k})_{k\in\mathbb{N}} has no cluster points, then Ω=\Omega^{*}=\varnothing, limkxk=+\lim_{k\to\infty}\|x^{k}\|=+\infty, and infkf(xk)=inf{f(x):xC}\inf_{k\in{\mathbb{N}}}f(x^{k})=\inf\{f(x):x\in C\}.

Proof.

Since (xk)k(x^{k})_{k\in\mathbb{N}} has no cluster points, then limkxk=+\lim_{k\to\infty}\|x^{k}\|=+\infty. Assume by contradiction that Ω\Omega^{*}\neq\varnothing. Thus, there exists x~C\tilde{x}\in C, such that f(x~)f(xk)f(\tilde{x})\leq f(x^{k}) for all kk\in{\mathbb{N}}. Therefore, x~U\tilde{x}\in U. Using Corollary 15, we obtain that (xk)k(x^{k})_{k\in\mathbb{N}} is convergent, contradicting that limkxk=\lim_{k\to\infty}\|x^{k}\|=\infty. Therefore, Ω=\Omega^{*}=\varnothing. Now, we claim that infkf(xk)=inf{f(x):xC}\inf_{k\in{\mathbb{N}}}f(x^{k})=\inf\{f(x):x\in C\}. If infkf(xk)=\inf_{k\in{\mathbb{N}}}f(x^{k})=-\infty, the claim holds. Assume by contraction that infkf(xk)>infxCf(x)\inf_{k\in{\mathbb{N}}}f(x^{k})>\inf_{x\in C}f(x). Thus, there exists x~C\tilde{x}\in C such that f(x~)f(xk)f(xk)+νkf(\tilde{x})\leq f(x^{k})\leq f(x^{k})+\nu_{k}, for all kk\in{\mathbb{N}}. Hence, UU\neq\varnothing. Using Corollary 15, we have that (xk)k(x^{k})_{k\in\mathbb{N}} is convergent, contradicting again limkxk=+\lim_{k\to\infty}\|x^{k}\|=+\infty and concluding the proof. ∎

Corollary 17.

If ff is a convex function and (xk)k(x^{k})_{k\in\mathbb{N}} has at least one cluster point, then (xk)k(x^{k})_{k\in\mathbb{N}} converges to a stationary point of problem (1).

Proof.

Let x¯\bar{x} be a cluster point of the sequence (xk)k(x^{k})_{k\in\mathbb{N}} and (xkj)j(x^{k_{j}})_{j\in\mathbb{N}} be a subsequence of (xk)k(x^{k})_{k\in\mathbb{N}} such that limj+xkj=x¯\lim_{j\to+\infty}x^{k_{j}}=\bar{x}. Considering that ff is continuous and limk+νk=0\lim_{k\to+\infty}\nu_{k}=0, we have limj(f(xkj)+νkj)=f(x¯)\lim_{j\to\infty}(f(x^{k_{j}})+\nu_{k_{j}})=f(\bar{x}). On the other hand, Corollary 12 implies that (f(xk)+νk)k(f(x^{k})+\nu_{k})_{k\in\mathbb{N}} is non-increasing. Hence, we have infk(f(xk)+νk)=limk(f(xk)+νk)=f(x¯)\inf_{k\in{\mathbb{N}}}(f(x^{k})+\nu_{k})=\lim_{k\to\infty}(f(x^{k})+\nu_{k})=f(\bar{x}). Therefore x¯U\bar{x}\in U. Using Corollary 15, we obtain that (xk)k(x^{k})_{k\in\mathbb{N}} converges to a stationary point x~C\tilde{x}\in C of problem (1). ∎

Theorem 18.

Assume that ff is a convex function and Ω\Omega^{*}\neq\varnothing. Then, (xk)k(x^{k})_{k\in\mathbb{N}} converge to an optimal solution of problem (1).

Proof.

If Ω\Omega^{*}\neq\varnothing, then UU\neq\varnothing. Therefore, Corollary 15 implies that (xk)k(x^{k})_{k\in\mathbb{N}} converges to a stationary point of problem (1) and, due to ff be convex, this point is also an optimal solution. ∎

5.2 Iteration-complexity bound

In the section, we preset some iteration-complexity bounds related to the sequence (xk)k(x^{k})_{k\in\mathbb{N}} generated by Algorithm 2. For that, besides assuming A1 and A2, we also need the following assumption.

  • A3.

    The gradient f\nabla f of ff is Lipschitz continuous with constant L>0L>0.

For simple notations, we define the following positive constant

τmin:=min{1,ω¯(1σ)αmaxμL}.\tau_{\min}:=\min\left\{1,\frac{\underline{\omega}(1-\sigma)}{{\alpha_{\max}}\mu L}\right\}. (38)
Lemma 19.

The steepsize τk\tau_{k} in Algorithm 2 satisfies τkτmin\tau_{k}\geq\tau_{\min}.

Proof.

First, we assume that τk=1\tau_{k}=1. In this case, we have τkτmin\tau_{k}\geq\tau_{\min} and the required inequality holds. Now, we assume that τk<1\tau_{k}<1. Thus, it follows from (19) that there exists 0<τ^kmin{1,τk/ω¯}0<\hat{\tau}_{k}\leq\min\{1,\tau_{k}/\underline{\omega}\} such that

f(xk+τ^k(wkxk))>f(xk)+στ^kf(xk),wkxk+νk.f\big{(}x^{k}+\hat{\tau}_{k}(w^{k}-x^{k})\big{)}>f(x^{k})+\sigma\hat{\tau}_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k}. (39)

Considering that we are under assumption A3, we apply Lemma 1 to obtain

f(xk+τ^k(wkxk))f(xk)+τ^kf(xk),wkxk+L2τ^k2wkxk2.f\big{(}x^{k}+\hat{\tau}_{k}(w^{k}-x^{k})\big{)}\leq f(x^{k})+\hat{\tau}_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\frac{L}{2}\hat{\tau}_{k}^{2}\|w^{k}-x^{k}\|^{2}. (40)

Hence, the combination of (39) with (40) yields

(1σ)f(xk),wkxk+L2τ^kwkxk2>νkτ^k.(1-\sigma)\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\frac{L}{2}\hat{\tau}_{k}\|w^{k}-x^{k}\|^{2}>\frac{\nu_{k}}{\hat{\tau}_{k}}. (41)

On the order hand, wkC,γkDk(xk,zk)w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}) with γk=(1ζk)/2\gamma_{k}=(1-\zeta_{k})/2, where zk=xkαkDk1f(xk)z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k}). Thus, applying Lemma 9(i) with x=xkx=x^{k}, w(α)=wkw(\alpha)=w^{k}, z=zkz=z^{k} and ζ=ζk\zeta=\zeta_{k}, we obtain

f(xk),wkxk12αkwkxkDk2.\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{k}}\|w^{k}-x^{k}\|_{D_{k}}^{2}.

Hence, considering that 1μwkxk2wkxkDk2\frac{1}{\mu}\|w^{k}-x^{k}\|^{2}\leq\|w^{k}-x^{k}\|_{D_{k}}^{2} and 0<αkαmax0<\alpha_{k}\leq\alpha_{\max}, the last inequality implies

f(xk),wkxk12αmaxμwkxk2.\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

The combination of the last inequality with (41) yields

((1σ)2αmaxμ+L2τ^k)wkxk2>νkτ^k0.\left(-\frac{(1-\sigma)}{2{\alpha_{\max}}\mu}+\frac{L}{2}\hat{\tau}_{k}\right)\|w^{k}-x^{k}\|^{2}>\frac{\nu_{k}}{\hat{\tau}_{k}}\geq 0.

Thus, since τ^kτk/ω¯\hat{\tau}_{k}\leq\tau_{k}/\underline{\omega}, we obtain τkω¯τ^k>ω¯(1σ)/(αmaxμL)τmin\tau_{k}\geq\underline{\omega}\hat{\tau}_{k}>\underline{\omega}(1-\sigma)/({\alpha_{\max}}\mu L)\geq\tau_{\min} and the proof is concluded. ∎

Considering that C,γkDk(xk,zk)𝒫C,ζkDk(xk,zk){\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})\subset{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}), it follows from Lemma 9(ii) that if xkC,γkDk(xk,zk)x^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}), then the point xkx^{k} is stationary for problem (1). Since wkC,γkDk(xk,zk)w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}), the quantity wkxk\|w^{k}-x^{k}\| can be seen as a measure of stationarity of the point xkx^{k}. In next theorem, we present an iteration-complexity bound for this quantity, which is a constrained inexact version of [43, Theorem 1].

Theorem 20.

Let τmin\tau_{\min} be defined in (38). Then, for every NN\in\mathbb{N}, the following inequality holds

min{wkxk:k=0,1,N1}2αmaxμ[f(x0)f+k=0νk]στmin1N.\min\left\{\|w^{k}-x^{k}\|:~{}k=0,1\ldots,N-1\right\}\leq\sqrt{\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}}}\frac{1}{\sqrt{N}}.
Proof.

Since wkC,γkDk(xk,zk)w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k}) with γk=(1ζk)/2\gamma_{k}=(1-\zeta_{k})/2, where zk=xkαkDk1f(xk)z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k}), applying Lemma 9(i) with x=xkx=x^{k}, w(α)=wkw(\alpha)=w^{k}, z=zkz=z^{k} and ζ=ζk\zeta=\zeta_{k}, and taking into account that (1/μ)wkxk2wkxkDk2(1/\mu)\|w^{k}-x^{k}\|^{2}\leq\|w^{k}-x^{k}\|_{D_{k}}^{2} and 0<αkαmax0<\alpha_{k}\leq\alpha_{\max}, we obtain

f(xk),wkxk12αkwkxkDk212αmaxμwkxk2.\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{k}}\|w^{k}-x^{k}\|_{D_{k}}^{2}\leq-\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

The definition of τk\tau_{k} and (19) imply f(xk+1)f(xk)στkf(xk),wkxk+νkf(x^{k+1})-f(x^{k})\leq\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k}. The combination of the last two inequalities together with Lemma 19 yields

f(xk)f(xk+1)+νkστk12αmaxμwkxk2στmin12αmaxμwkxk2.f(x^{k})-f(x^{k+1})+\nu_{k}\geq\sigma\tau_{k}\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}\geq\sigma\tau_{\min}\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

Hence, performing the sum of the above inequality for k=0,1,,N1k=0,1,\ldots,N-1, we conclude that

k=0N1wkxk22αmaxμ[f(x0)f(xN+1)+k=0Nνk]στmin2αmaxμ[f(x0)f+k=0νk]στmin,\sum_{k=0}^{N-1}\|w^{k}-x^{k}\|^{2}\leq\frac{2{\alpha_{\max}}\mu\big{[}f(x^{0})-f(x^{N+1})+\sum_{k=0}^{N}\nu_{k}\big{]}}{\sigma\tau_{\min}}\leq\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}},

which implies the desired result. ∎

Next theorem presents an iteration-complexity bound for the sequence (f(xk))k\left(f(x^{k})\right)_{k\in\mathbb{N}} when ff is convex.

Theorem 21.

Let ff be a convex function on CC. Then, for every NN\in\mathbb{N}, there holds

min{f(xk)f:k=0,1,N1}x0xD02+ξ[f(x0)f+k=0νk]2αminτmin1N.\min\left\{f(x^{k})-f^{*}:~{}k=0,1\ldots,N-1\right\}\leq\frac{\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{2\alpha_{\min}\tau_{\min}}\frac{1}{N}.
Proof.

Using the first inequality in (17) and Lemma 19, we have 2αminτmin2αkτk2\alpha_{\min}\tau_{\min}\leq 2\alpha_{k}\tau_{k}, for all kk\in{\mathbb{N}}. We also know form the convexity of ff that f(xk),xxkff(xk)\langle\nabla f(x^{k}),x^{*}-x^{k}\rangle\leq f^{*}-f(x^{k}), for all kk\in{\mathbb{N}}. Thus, applying Lemma 14 with x=xx=x^{*}, after some algebraic manipulations, we conclude

2αminτmin[f(xk)f]xkxDk2xk+1xDk+12+ξ[f(xk)f(xk+1)+νk]k=0,1,.2\alpha_{\min}\tau_{\min}\left[f(x^{k})-f^{*}\right]\leq\|x^{k}-x^{*}\|_{D_{k}}^{2}-\|x^{k+1}-x^{*}\|_{D_{k+1}}^{2}+\xi\left[f(x^{k})-f(x^{k+1})+\nu_{k}\right]\quad k=0,1,\ldots.

Hence, performing the sum of the above inequality for k=0,1,,N1k=0,1,\ldots,N-1, we obtain

2αminτmink=0N1[f(xk)f]x0xD02xN+1xDN2+ξ[f(x0)f(xN+1)+k=0N1νk].2\alpha_{\min}\tau_{\min}\sum_{k=0}^{N-1}\left[f(x^{k})-f^{*}\right]\leq\|x^{0}-x^{*}\|_{D_{0}}^{2}-\|x^{N+1}-x^{*}\|_{D_{N}}^{2}+\xi\Big{[}f(x^{0})-f(x^{N+1})+\sum_{k=0}^{N-1}\nu_{k}\Big{]}.

Thus, 2αminτminNmin{f(xk)f:k=0,1,N1}x0xD02+ξ[f(x0)f+k=0N1νk]2\alpha_{\min}\tau_{\min}N\min\{f(x^{k})-f^{*}:~{}k=0,1\ldots,N-1\}\leq\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi[f(x^{0})-f^{*}+\sum_{k=0}^{N-1}\nu_{k}], which implies the desired inequality. ∎

We ended this section with some results regarding the number of function evaluations performed by Algorithm 2. Note that the computational cost associated to each (outer) iteration involves a gradient evaluation, the computation of a (inexact) projection, and evaluations of ff at different trial points. Thus, we must consider the function evaluations at the rejected trial points.

Lemma 22.

Let NkN_{k} be the number of function evaluations after k0k\geq 0 iterations of Algorithm 2. Then, Nk1+(k+1)[log(τmin)/log(ω¯)+1].N_{k}\leq 1+(k+1)[\log(\tau_{\min})/\log(\bar{\omega})+1].

Proof.

Let j(k)0j(k)\geq 0 be the number of inner iterations in Step 2 of Algorithm 2 to compute the step size τk\tau_{k}. Thus, τkω¯j(k)\tau_{k}\leq{\bar{\omega}}^{j(k)}. Using Lemma 19, we have 0<τminτk0<\tau_{\min}\leq\tau_{k} for all kk\in\mathbb{N}, which implies that log(τmin)log(τk)=j(k)log(ω¯)\log\left(\tau_{\min}\right)\leq\log(\tau_{k})=j(k)\log(\bar{\omega}), for all kk\in\mathbb{N}. Hence, due to log(ω¯)<0\log(\bar{\omega})<0, we have j(k)log(τmin)/log(ω¯)j(k)\leq\log(\tau_{\min})/\log(\bar{\omega}). Therefore,

Nk=1+=0k(j()+1)1+i=0k(log(τmin)log(ω¯)+1)=1+(k+1)(log(τmin)log(ω¯)+1),N_{k}=1+\sum_{\ell=0}^{k}(j(\ell)+1)\leq 1+\sum_{i=0}^{k}\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)}=1+(k+1)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)},

where the first equality follows from the definition of NkN_{k}. ∎

Theorem 23.

For a given ϵ>0\epsilon>0, Algorithm 2 computes xkx^{k} and wkw^{k} such that wkxkϵ\|w^{k}-x^{k}\|\leq\epsilon using, at most,

1+(2αmaxμ[f(x0)f+k=0νk]στmin1ϵ2+1)(log(τmin)log(ω¯)+1)1+\left({\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}}}\frac{1}{\epsilon^{2}}+1\right)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)}

function evaluations.

Proof.

The proof follows straightforwardly from Theorem 20 and Lemma 22. ∎

Theorem 24.

Let ff be a convex function on CC. For a given ϵ>0\epsilon>0, the number of function evaluations performed by Algorithm 2 is, at most,

1+(x0xD02+ξ[f(x0)f+k=0νk]2αminτmin1ϵ+1)(log(τmin)log(ω¯)+1),1+\left(\frac{\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{2\alpha_{\min}\tau_{\min}}\frac{1}{\epsilon}+1\right)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)},

to compute xkx^{k} such that f(xk)fϵf(x^{k})-f^{*}\leq\epsilon.

Proof.

The proof follows straightforwardly from Theorem 21 and Lemma 22. ∎

6 Numerical experiments

This section presents some numerical experiments in order to illustrate the potential advantages of considering inexact schemes in the SPG method. We will discuss inexactness associated with both the projection onto the feasible set and the line search procedure.

Given AA and BB two m×nm\times n matrices, with mnm\geq n, and cc\in\mathbb{R}, we consider the matrix function f:n×nf:\mathbb{R}^{n\times n}\to\mathbb{R} given by:

f(X):=12AXBF2+i=1n1[c(Xi+1,i+1Xi,i2)2+(1Xi,i)2],f(X):=\displaystyle\frac{1}{2}\|AX-B\|^{2}_{F}+\sum_{i=1}^{n-1}\left[c\left(X_{i+1,i+1}-X_{i,i}^{2}\right)^{2}+(1-X_{i,i})^{2}\right], (42)

which combines a least squares term with a Rosenbrock-type function. Throughout this section, Xi,jX_{i,j} stands for the ijij-element of the matrix XX and F\|\cdot\|_{F} denotes the Frobenius matrix norm, i.e., AF:=A,A\|A\|_{F}:=\sqrt{\langle A,A\rangle} where the inner product is given by A,B=tr(ATB)\langle A,B\rangle=\textrm{tr}(A^{T}B). The test problems consist of minimizing ff in (42) subject to two different feasible sets, as described below. We point out that interesting applications in many areas emerge as constrained least squares matrix problems, see [13] and references therein. In turn, the Rosenbrock term was added in order to make the problems more challenging.

Problem I:
minf(X)s.t.XSDD+,LXU,\begin{array}[]{cl}\displaystyle\min&f(X)\\ \mbox{s.t.}&X\in SDD^{+},\\ &L\leq X\leq U,\end{array}

where SDD+SDD^{+} is the cone of symmetric and diagonally dominant real matrices with positive diagonal, i.e.,

SDD+:={Xn×nX=XT,Xi,iji|Xi,j|i},SDD^{+}:=\{X\in\mathbb{R}^{n\times n}\mid X=X^{T},\;X_{i,i}\geq\displaystyle\sum_{j\neq i}|X_{i,j}|\;\forall i\},

LL and UU are given n×nn\times n matrices, and LXUL\leq X\leq U means that Li,jXi,jUi,jL_{i,j}\leq X_{i,j}\leq U_{i,j} for all i,ji,j. The feasible set of Problem I was considered, for example, in the numerical tests of [13].

Problem II:
minf(X)s.t.X𝕊+n,tr(X)=1,\begin{array}[]{cl}\displaystyle\min&f(X)\\ \mbox{s.t.}&X\in\mathbb{S}^{n}_{+},\\ &\textrm{tr}(X)=1,\end{array}

where 𝕊+n\mathbb{S}^{n}_{+} is the cone of symmetric and positive semidefinite real matrices and tr(X)\textrm{tr}(X) denotes the trace of XX. The feasible set of Problem II was known as spectrahedron and appears in several interesting applications see, for example, [4, 41] and references therein.

It is easy to see that the feasible set o Problem I is a closed and convex set and the feasible set of Problem II is a compact and convex set. As discussed in Section 2.1.1, the Dykstra’s alternating algorithm and the Frank-Wolfe algorithm can be used to calculate inexact projections. The choice of the most appropriate method depends on the structure of the feasible set under consideration. For Problem I, we used the Dykstra’s algorithm described in [13], see also [58]. In this case, SDD+=i=1nSDDi+SDD^{+}=\displaystyle\cap_{i=1}^{n}SDD^{+}_{i}, where

SDDi+:={Xn×nX=XT,Xi,iji|Xi,j|}for alli=1,,n,SDD^{+}_{i}:=\{X\in\mathbb{R}^{n\times n}\mid X=X^{T},\;X_{i,i}\geq\displaystyle\sum_{j\neq i}|X_{i,j}|\}\;\mbox{for all}\;i=1,\ldots,n,

and the projection of a given Zn×nZ\in\mathbb{R}^{n\times n} onto SDD+SDD^{+} consists of cycles of projections onto the convex sets SDDi+SDD^{+}_{i}. Here an iteration of the Dykstra’s algorithm should be understood as a complete cycle of projections onto all SDDi+SDD^{+}_{i} sets and onto the box {Xn×nLXU}\{X\in\mathbb{R}^{n\times n}\mid L\leq X\leq U\}. Recall that this scheme provides an inexact projection as in Definition 2. Now consider Problem II. It is well known that calculating an exact projection onto the spectrahedron (i.e., onto the feasible set of Problem II) requires a complete spectral decomposition, which can be prohibitive specially in the large scale case. In contrast, the computational cost of an iteration of the Frank-Wolfe algorithm described in Algorithm 1 is associated by an extreme eigenpair computation, see, for example, [48]. Unfortunately, despite its low cost per-iteration, the Frank-Wolfe algorithm suffers from a slow convergence rate. Thus, we considered a variant of the Frank-Wolfe algorithm proposed in [4], which improves the convergence rate and the total time complexity of the classical Frank-Wolfe method. This algorithm specialized for the projection problem over the spectrahedron is carefully described in [1]. Without attempting to go into details, it replaces the top eigenpair computation in Frank-Wolfe with a top-pp (with pnp\ll n) eigenpair computation, where pp is an algorithmic parameter automatically selected. The total number of computed eigenpairs can be used to measure the computational effort to calculate projections. We recall that a Frank-Wolfe type scheme provides an inexact projection as in Definition 3.

We notice that Problems I and II can be seen as particular instances of the problem (1) in which the number of variables is (n2+n)/2(n^{2}+n)/2. This mean that they can be solved by using Algorithm 2. We are especially interested in the spectral gradient version [14] of the SPG method, which is often associated with large-scale problems [15]. For this, we implemented Algorithm 2 considering Dk:=ID_{k}:=I for all kk, α0:=min(αmax,max(αmin,1/f(x0)))\alpha_{0}:=\min(\alpha_{\max},\max(\alpha_{\min},1/\|\nabla f(x^{0})\|)) and, for k>0k>0,

αk:={min(αmax,max(αmin,sk,sk/sk,yk)),ifsk,yk>0αmax,otherwise,\alpha_{k}:=\left\{\begin{array}[]{ll}\displaystyle\min(\alpha_{\max},\max(\alpha_{\min},\langle s^{k},s^{k}\rangle/\langle s^{k},y^{k}\rangle)),&\mbox{if}\;\langle s^{k},y^{k}\rangle>0\\ \alpha_{\max},&\mbox{otherwise},\end{array}\right.

where sk:=XkXk1s^{k}:=X^{k}-X^{k-1}, yk:=f(Xk)f(Xk1)y^{k}:=\nabla f(X^{k})-\nabla f(X^{k-1}), αmin=1010\alpha_{\min}=10^{-10}, and αmax=1010\alpha_{\max}=10^{10}. We set σ=104\sigma=10^{-4}, τ¯=0.1\underline{\tau}=0.1, τ¯=0.9\bar{\tau}=0.9, μ=1\mu=1 and ν0=0\nu_{0}=0. Parameter δmin\delta_{\min} was chosen according to the line search used (see Section 3), while parameter ζmin\zeta_{\min} depends on the inexact projection scheme considered.

In the line search scheme (Step 2 of Algorithm 2), if a step size τtrial\tau_{\textrm{trial}} is not accepted, then τnew\tau_{\textrm{new}} is calculated using one-dimensional quadratic interpolation employing the safeguard τnewτtrial/2\tau_{\textrm{new}}\leftarrow\tau_{\textrm{trial}}/2 when the minimum of the one-dimensional quadratic lies outside [ω¯τtrial,ω¯τtrial][\underline{\omega}\tau_{\textrm{trial}},\bar{\omega}\tau_{\textrm{trial}}], see, for example, [54, Section 3.5]. Concerning the stopping criterion, all runs were stopped at an iterate XkX^{k} declaring convergence if

maxi,j(|Xi,jkWi,jk|)106,\displaystyle\max_{i,j}(|X^{k}_{i,j}-W^{k}_{i,j}|)\leq 10^{-6},

where WkW^{k} is as in (18). Our codes are written in Matlab and are freely available at https://github.com/maxlemes/InexProj-SGM. All experiments were run on a macOS 10.15.7 with 3.7GHz Intel Core i5 processor and 8GB of RAM.

6.1 Influence of the inexact projection

We begin the numerical experiments by checking the influence of the forcing parameters that control the degree of inexactness of the projections in the performance of the SPG method. In this first battery of tests, we used Armijo line searches, see Section 3.

We generated 10 instances of Problem I using n=100n=100, m=200m=200, and c=10c=10. The matrices AA and BB were randomly generated with elements belonging to [1,1][-1,1]. We set L0L\equiv 0 and UU\equiv\infty as in [13]. For each instance, the starting point X0X^{0} was randomly generated with elements belonging to [0,1][0,1], then it was redefined as (X0+(X0)T)/2(X^{0}+(X^{0})^{T})/2 and its diagonal elements were again redefined as 2jinXi,j2\sum_{j\neq i}^{n}X_{i,j}, ensuring a feasible starting point. Figure 1 shows the average number of iterations, the average number of Dykstra’s iterations, and the average CPU time in seconds needed to reach the solution for different choices of ζk\zeta_{k}, namely, ζk=0.99\zeta_{k}=0.99, 0.90.9, 0.80.8, 0.70.7, 0.60.6, 0.50.5, 0.40.4, 0.30.3, 0.20.2, and 0.10.1 for all kk. Remember that smaller values of ζk\zeta_{k} imply more inexact projections. As expected, the number of iterations of the SPG tended to increase as ζk\zeta_{k} decreased, see Figure 1(a). On the other hand, the computational cost of an outer iteration (which can be measured by the number of Dykstra’s iterations) tends to decrease when considering smaller values of ζk\zeta_{k}. This suggests a trade-off, controlled by parameter ζk\zeta_{k}, between the number and the cost per iteration. Figure 1(b) shows that values for ζk\zeta_{k} close to 0.8 showed better results, which is in line with the experiments reported in [13]. Finally, as can be seen in Figure 1(c), the CPU time was shown to be directly proportional to the number Dykstra’s iterations.

Refer to caption Refer to caption Refer to caption
(a) (b) (c)
Figure 1: Results for 10 instances of Problem I using n=100n=100, m=200m=200, and c=10c=10. Average number of: (a) iterations; (b) Dykstra’s iterations; (c) CPU time in seconds needed to reach the solution for different choices of ζk\zeta_{k}.

Although Algorithm 2 is given only in terms of parameter ζk\zeta_{k}, we will directly consider parameter γk\gamma_{k} for Problem II in which inexact projections are computed according to Definition 3. We randomly generated 10 instances of Problem II with n=800n=800, m=1000m=1000, and c=100c=100. Matrices AA and BB were obtained similarly to Problem I. In turn, a starting point X0X^{0} was randomly generated with elements in the interval [1,1][-1,1], then it was redefined to be X0(X0)T/tr(X0(X0)T)X^{0}(X^{0})^{T}/\textrm{tr}(X^{0}(X^{0})^{T}), resulting in a feasible initial guess. Figure 2 shows the average number of iterations, the average number of computed eigenpairs, and the average CPU time in seconds needed to reach the solution for different constant choices of γk\gamma_{k} ranging from 10810^{-8} to 0.49990.4999. Now, higher values of γk\gamma_{k} imply more inexact projections. Note that for appropriate choices of ζk\zeta_{k}, the adopted values of γk\gamma_{k} fulfill Assumption A1 of Section 5. Concerning the number of iterations, as can be seen in Figure 2(a), the SPG method was not very sensitive to the choice of parameter γk\gamma_{k}. Hence, since higher values of γk\gamma_{k} imply cheaper iterations, the number of computed eigenpairs and the CPU time showed to be inversely proportional to γk\gamma_{k}, see Figures 2(b)–(c). Thus, our experiments suggest that the best value for γk\gamma_{k} seems to be 0.49990.4999.

Refer to caption Refer to caption Refer to caption
(a) (b) (c)
Figure 2: Results for 10 instances of Problem II using n=800n=800, m=1000m=1000, and c=100c=100. Average number of: (a) iterations; (b) computed eigenpairs; (c) CPU time in seconds needed to reach the solution for different choices of γk\gamma_{k}.

6.2 Influence of the line search scheme

The following experiments compare the performance of the SPG method with different strategies for computing the step sizes. We considered the Armijo, the Average-type, and the Max-type line searches discussed in Section 3. Based on our numerical experience, we employed the fixed value ηk=0.85\eta_{k}=0.85 for the Average-type line search and M=5M=5 for the Max-type line search. According to the results of the previous section, we used the fixed forcing parameters ζk=0.8\zeta_{k}=0.8 and γk=0.4999\gamma_{k}=0.4999 to compute inexact projections for Problems I and II, respectively.

We randomly generated 100 instances of each problem as described in Section 6.1. The dimension of the problems and the parameter cc in (42) were also taken arbitrarily. For Problem I, we choose 100n800100\leq n\leq 800 and 10c5010\leq c\leq 50, whereas for Problem II, we choose 10n20010\leq n\leq 200 and 100c1000100\leq c\leq 1000. In both cases, we set m=2nm=2n. We compare the strategies with respect to the number of function evaluations, the number of (outer) iterations, the total computational effort to calculate projections (measured by the number of Dykstra’s iterations and computed eigenpairs for Problems I and II, respectively), and the CPU time. The results are shown in Figures 3 and 4 for Problems I and II, respectively, using performance profiles [31].

For Problem I, with regard to the number of function evaluations, the SPG method with the Average-type line search was the most efficient among the tested strategies. In a somewhat surprising way, in this set of test problems, the Armijo strategy was better than the Max-type line search, see Figure 3(a). On the other hand, as can be seen in Figure 3(b), the Armijo strategy required fewer iterations than the non-monotonous strategies. As expected, this was reflected in the number of Dykstra’s iterations and the CPU time, see Figures 3(c)–(d). We can conclude that, with respect to the last two criteria, the Armijo and Average-type strategies had similar and superior performances to the Max-type strategy.

Now, concerning Problem II, Figure 4 shows that the non-monotonous strategies outperformed the Armijo strategy in all the comparative criteria considered. Again, the Average-type strategy seems to be superior to the Max-type strategy.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 3: Performance profiles for Problem I considering the SPG method with the Armijo, the Average-type, and the Max-type line searches strategies using as performance measurement: (a) number of function evaluations; (b) number of (outer) iterations; (c) number of Dykstra’s iterations; (d) CPU time.
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 4: Performance profiles for Problem II considering the SPG method with the Armijo, the Average-type, and the Max-type line searches strategies using as performance measurement: (a) number of function evaluations; (b) number of (outer) iterations; (c) number of computed eigenpairs; (d) CPU time.

From all the above experiments, we conclude that the non-monotone line searches tend to require fewer objective function evaluations. However, this does not necessarily mean computational savings, since there may be an increase in the number of iterations. In this case, optimal efficiency of the algorithm comes from a compromise between those two conflicting tendencies. Overall, the use of non-monotonous line search techniques is mainly justified when the computational effort of an iteration is associated with the cost of evaluating the objective function.

7 Conclusions

In this paper, we study the SGP method to solve constrained convex optimization problems employing inexact projections onto the feasible set and a general non-monotone line search. We expect that this paper will contribute to the development of research in this field, mainly to solve large-scale problems when the computational effort of an iteration is associated with the projections onto the feasible set and the cost of evaluating the objective function. Indeed, the idea of using the inexactness in the projection as well as in the line search, instead of the exact ones, is particularly interesting from a computational point of view. In particular, it is noteworthy that the Frank-Wolfe method has a low computational cost per iteration resulting in high computational performance in different classes of compact sets, see [37, 48]. An issue that deserves attention is the search for new efficient methods such as the Frank-Wolfe’s and Dykstra’s methods that generate inexact projections.

References

  • [1] A. A. Aguiar, O. P. Ferreira, and L. F. Prudente. Inexact gradient projection method with relative error tolerance. arXiv preprint arXiv:2101.11146, 2021.
  • [2] A. A. Aguiar, O. P. Ferreira, and L. F. Prudente. Subgradient method with feasible inexact projections for constrained convex optimization problems. Optimization, 0(0):1–23, 2021, https://doi.org/10.1080/02331934.2021.1902520.
  • [3] M. Ahookhosh, K. Amini, and S. Bahrami. A class of nonmonotone Armijo-type line search method for unconstrained optimization. Optimization, 61(4):387–404, 2012.
  • [4] Z. Allen-Zhu, E. Hazan, W. Hu, and Y. Li. Linear convergence of a Frank-Wolfe type algorithm over trace-norm balls. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6192–6201, Red Hook, NY, USA, 2017. Curran Associates Inc.
  • [5] R. Andreani, E. G. Birgin, J. M. Martínez, and J. Yuan. Spectral projected gradient and variable metric methods for optimization with linear inequalities. IMA J. Numer. Anal., 25(2):221–252, 04 2005, https://academic.oup.com/imajna/article-pdf/25/2/221/2090233/drh020.pdf.
  • [6] A. Auslender, P. J. S. Silva, and M. Teboulle. Nonmonotone projected gradient methods based on barrier and Euclidean distances. Comput. Optim. Appl., 38(3):305–327, 2007.
  • [7] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141–148, 1988.
  • [8] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011.
  • [9] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res., 59(2):235–247, 2004.
  • [10] J. Y. Bello Cruz and L. R. Lucambio Pérez. Convergence of a projected gradient method variant for quasiconvex objectives. Nonlinear Anal., 73(9):2917–2922, 2010.
  • [11] D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Control, 21(2):174–184, 1976.
  • [12] D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA, second edition, 1999.
  • [13] E. G. Birgin, J. M. Martínez, and M. Raydan. Inexact spectral projected gradient methods on convex sets. IMA J. Numer. Anal., 23(4):539–559, 2003.
  • [14] E. G. Birgin, J. M. Martínez, and M. Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim., 10(4):1196–1211, 2000, https://doi.org/10.1137/S1052623497330963.
  • [15] E. G. Birgin, J. M. Martínez, and M. Raydan. Spectral projected gradient methods: Review and perspectives. J. Stat. Softw., 60(3):1–21, 2014.
  • [16] S. Bonettini, I. Loris, F. Porta, and M. Prato. Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM J. Optim., 26(2):891–921, 2016.
  • [17] S. Bonettini, F. Porta, M. Prato, S. Rebegoldi, V. Ruggiero, and L. Zanni. Recent advances in variable metric first-order methods. In Computational Methods for Inverse Problems in Imaging, pages 1–31. Springer, 2019.
  • [18] S. Bonettini and M. Prato. New convergence results for the scaled gradient projection method. Inverse Problems, 31(9):095008, 20, 2015.
  • [19] S. Bonettini, R. Zanella, and L. Zanni. A scaled gradient projection method for constrained image deblurring. Inverse Problems, 25(1):015002, 23, 2009.
  • [20] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60(2):223–311, 2018.
  • [21] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory. Society for Industrial and Applied Mathematics, 1994, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970777.
  • [22] J. P. Boyle and R. L. Dykstra. A method for finding projections onto the intersection of convex sets in Hilbert spaces. In Advances in order restricted statistical inference (Iowa City, Iowa, 1985), volume 37 of Lect. Notes Stat., pages 28–47. Springer, Berlin, 1986.
  • [23] P. L. Combettes and B. C. Vũ. Variable metric quasi-Fejér monotonicity. Nonlinear Anal., 78:17–31, 2013.
  • [24] Y. H. Dai. On the nonmonotone line search. J. Optim. Theory Appl., 112(2):315–330, 2002.
  • [25] Y.-H. Dai and R. Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math., 100(1):21–47, 2005.
  • [26] Y.-H. Dai and R. Fletcher. New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program., 106(3, Ser. A):403–421, 2006.
  • [27] Y.-H. Dai, W. W. Hager, K. Schittkowski, and H. Zhang. The cyclic Barzilai-Borwein method for unconstrained optimization. IMA J. Numer. Anal., 26(3):604–627, 2006.
  • [28] F. R. de Oliveira, O. P. Ferreira, and G. N. Silva. Newton’s method with feasible inexact projections for solving constrained generalized equations. Comput. Optim. Appl., 72(1):159–177, 2019.
  • [29] D. di Serafino, V. Ruggiero, G. Toraldo, and L. Zanni. On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput., 318:176–195, 2018.
  • [30] R. Díaz Millán, O. P. Ferreira, and L. F. Prudente. Alternating conditional gradient method for convex feasibility problems. arXiv e-prints, page arXiv:1912.04247, Dec 2019, 1912.04247.
  • [31] E. D. Dolan and J. J. Moré. Benchmarking optimization software with performance profiles. Math. Program., 91(2):201–213, 2002.
  • [32] R. L. Dykstra. An algorithm for restricted least squares regression. J. Amer. Statist. Assoc., 78(384):837–842, 1983.
  • [33] J. Fan, L. Wang, and A. Yan. An inexact projected gradient method for sparsity-constrained quadratic measurements regression. Asia-Pac. J. Oper. Res., 36(2):1940008, 21, 2019.
  • [34] N. S. Fazzio and M. L. Schuverdt. Convergence analysis of a nonmonotone projected gradient method for multiobjective optimization problems. Optim. Lett., 13(6):1365–1379, 2019.
  • [35] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Topics Signal Process., 1(4):586–597, Dec 2007.
  • [36] A. Friedlander, J. M. Martínez, B. Molina, and M. Raydan. Gradient method with retards and generalizations. SIAM J. Numer. Anal., 36(1):275–289, 1999.
  • [37] D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, pages 541–549, 2015.
  • [38] M. Golbabaee and M. E. Davies. Inexact gradient projection and fast data driven compressed sensing. IEEE Trans. Inform. Theory, 64(10):6707–6721, 2018.
  • [39] A. A. Goldstein. Convex programming in Hilbert space. Bull. Amer. Math. Soc., 70:709–710, 1964.
  • [40] P. Gong, K. Gai, and C. Zhang. Efficient euclidean projections via piecewise root finding and its application in gradient projection. Neurocomputing, 74(17):2754 – 2766, 2011.
  • [41] D. S. Gonçalves, M. A. Gomes-Ruggiero, and C. Lavor. A projected gradient method for optimization over density matrices. Optim. Methods Softw., 31(2):328–341, 2016, https://doi.org/10.1080/10556788.2015.1082105.
  • [42] D. S. Gonçalves, M. L. N. Gonçalves, and T. C. Menezes. Inexact variable metric method for convex-constrained optimization problems. Optimization, 0(0):1–19, 2021, https://doi.org/10.1080/02331934.2021.1887181.
  • [43] G. N. Grapiglia and E. W. Sachs. On the worst-case evaluation complexity of non-monotone line search algorithms. Comput. Optim. Appl., 68(3):555–577, 2017.
  • [44] G. N. Grapiglia and E. W. Sachs. A generalized worst-case complexity analysis for non-monotone line searches. Numer. Algorithms, 87(2):779–796, Jun 2021.
  • [45] L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal., 23(4):707–716, 1986.
  • [46] N. J. Higham. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal., 22(3):329–343, 2002.
  • [47] A. N. Iusem. On the convergence properties of the projected gradient method for convex optimization. Comput. Appl. Math., 22(1):37–52, 2003.
  • [48] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 427–435, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  • [49] E. Levitin and B. Polyak. Constrained minimization methods. USSR Comput. Math. Math. Phys., 6(5):1 – 50, 1966.
  • [50] G. Ma, Y. Hu, and H. Gao. An accelerated momentum based gradient projection method for image deblurring. In 2015 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pages 1–4, 2015.
  • [51] J. Mo, C. Liu, and S. Yan. A nonmonotone trust region method based on nonincreasing technique of weighted average of the successive function values. J. Comput. Appl. Math., 209(1):97–108, 2007.
  • [52] J. J. Moré. On the performance of algorithms for large-scale bound constrained problems. In Large-scale numerical optimization (Ithaca, NY, 1989), pages 32–45. SIAM, Philadelphia, PA, 1990.
  • [53] Y. Nesterov and A. Nemirovski. On first-order algorithms for 1\ell_{1}/nuclear norm minimization. Acta Numer., 22:509–575, 2013.
  • [54] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • [55] E. R. Panier and A. L. Tits. Avoiding the Maratos effect by means of a nonmonotone line search. I. General constrained problems. SIAM J. Numer. Anal., 28(4):1183–1195, 1991.
  • [56] A. Patrascu and I. Necoara. On the convergence of inexact projection primal first-order methods for convex minimization. IEEE Trans. Automat. Control, 63(10):3317–3329, 2018.
  • [57] J. Rasch and A. Chambolle. Inexact first-order primal-dual algorithms. Comput. Optim. Appl., 76(2):381–430, 2020.
  • [58] M. Raydan and P. Tarazaga. Primal and polar approach for computing the symmetric diagonally dominant projection. Numer. Linear Algebra Appl., 9(5):333–345, 2002, https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.277.
  • [59] E. W. Sachs and S. M. Sachs. Nonmonotone line searches for optimization algorithms. Control Cybernet., 40(4):1059–1075, 2011.
  • [60] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. J. Convex Anal., 19(4):1167–1192, 2012.
  • [61] S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning. Neural information processing series. MIT Press, 2012.
  • [62] J. Tang, M. Golbabaee, and M. E. Davies. Gradient projection iterative sketch for large-scale constrained least-squares. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3377–3386, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [63] P. L. Toint. An assessment of nonmonotone linesearch techniques for unconstrained optimization. SIAM J. Sci. Comput., 17(3):725–739, 1996.
  • [64] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and inexact forward-backward algorithms. SIAM J. Optim., 23(3):1607–1633, 2013.
  • [65] C. Wang, Q. Liu, and X. Yang. Convergence properties of nonmonotone spectral projected gradient methods. J. Comput. Appl. Math., 182(1):51–66, 2005.
  • [66] X. Yan, K. Wang, and H. He. On the convergence rate of scaled gradient projection method. Optimization, 67(9):1365–1376, 2018.
  • [67] F. Zhang, H. Wang, J. Wang, and K. Yang. Inexact primal–dual gradient projection methods for nonlinear optimization on convex set. Optimization, 69(10):2339–2365, 2020, https://doi.org/10.1080/02331934.2019.1696338.
  • [68] H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim., 14(4):1043–1056, 2004.
  • [69] B. Zhou, L. Gao, and Y.-H. Dai. Gradient methods with adaptive step-sizes. Comput. Optim. Appl., 35(1):69–86, 2006.