On the inexact scaled gradient projection method

O. P. Ferreira This author was partially supported by FAPEG, CNPq Grants 302473/2017-3 and 408151/2016-1) and FAPEG/PRONEM- 201710267000532 (e-mail:[email protected]). M. V. Lemes (e-mail:[email protected]). L. F. Prudente This author was partially supported by FAPEG/PRONEM- 201710267000532 (e-mail:[email protected]).

Abstract

The purpose of this paper is to present an inexact version of the scaled gradient projection method on a convex set, which is inexact in two sense. First, an inexact projection on the feasible set is computed, allowing for an appropriate relative error tolerance. Second, an inexact non-monotone line search scheme is employed to compute a step size which defines the next iteration. It is shown that the proposed method has similar asymptotic convergence properties and iteration-complexity bounds as the usual scaled gradient projection method employing monotone line searches.

Keywords: Scaled gradient projection method, feasible inexact projection, constrained convex optimization.

AMS subject classification: 49J52, 49M15, 65H10, 90C30.

1 Introduction

This paper is devoted to the study of the scaled gradient projection (SGP) method with non-monotone line search to solve general constrained convex optimization problems as follows

\min\{f(x):~{}x\in C\},

(1)

where $C$ is a closed and convex subset of $\mathbb{R}^{n}$ and $f:\mathbb{R}^{n}\to\mathbb{R}$ is a continuously differentiable function. Denotes by $f^{*}:=\inf_{x\in C}f(x)$ the optimal value of (1) and by $\Omega^{*}$ its solution set, which we will assume to be nonempty unless the contrary is explicitly stated. Problem (1) is a basic issue of constrained optimization, which appears very often in various areas, including finance, machine learning, control theory, and signal processing, see for example [20, 21, 35, 46, 50, 61]. Recent problems considered in most of these areas, the datasets are large or high-dimensional and their solutions need to be approximated quickly with a reasonably accuracy. It is well known that SGP method with non-monotone line search is among those that are suitable for this task, as will be explained below.

The gradient projection method is what first comes to mind when we start from the ideas of the classic optimization methods in an attempt to deal with problem (1). In fact, this method is one of the oldest known optimization methods to solve (1), the study of its convergence properties goes back to the works of Goldstein [39] and Levitin and Polyak [49]. After these works, several variants of it have appeared over the years, resulting in a vast literature on the subject, including [10, 11, 12, 33, 35, 40, 47, 56, 67]. Additional reference on this subject can be found in the recent review [17] and references therein. Among all the variants of the gradient projection method, the scaled version has been especially considered due to the flexibility provided in efficient implementations of the method; see [13, 5, 16, 18, 19]. In addition, its simplicity and easy implementation has attracted the attention of the scientific community that works on optimization over the years. This method usually uses only first-order derivatives, which makes it very stable from a numerical point of view and therefore quite suitable for solving large-scale optimization problems, see [52, 53, 61, 62]. At each current iteration, SGP method moves along the direction of the negative scaled gradient, and then projects the obtained point onto the constraint set. The current iteration and such projection define a feasible descent direction and a line search in this direction is performed to define the next iteration. It is worth mentioning that the performance of SGM method is strongly related to each of the steps we have just mentioned. In fact, the scale matrix and the step size towards the negative scaled gradient are freely selected in order to improve the performance of SGM method but without increasing the cost of each iteration. Strategies for choosing both has its origin in the study of gradient method for unconstrained optimization, papers addressing this issues include but not limited to [7, 18, 26, 27, 29, 36, 69, 25, 49]. More details about about selecting step sizes and scale matrices can be found in the recent review [17] and references therein.

In this paper, we are particularly interested in the main stages that make up the SGP method, namely, in the projection calculation and in the line search employed. It is well known that the mostly computational burden of each iteration of the SGP method is in the calculation of the projection. Indeed, the projection calculation requires, at each iteration, the solution of a quadratic problem restricted to the feasible set, which can lead to a substantial increase in the cost per iteration if the number of unknowns is large. For this reason, it may not be justified to carry out exact projections when the iterates are far from the solution of the problem. In order to reduce the computational effort spent on projections, inexact procedures that become more and more accurate when approaching the solution, have been proposed, resulting in more efficient methods; see for exemple [13, 16, 38, 42, 60, 64, 57]. On the other hand, non-monotonous searches can improve the probability of finding an optimal global solution, in addition to potentially improving the speed of convergence of the method as a whole, see for example [24, 55, 63]. The concept of non-monotone line search, that we will use here as a synonym for inexact line search, have been proposed first in [45], and later a new non-monotone search was proposed in [68]. After these papers others non-monotone searches appeared, see for example [3, 51]. In [59], an interesting general framework for non-monotonous line research was proposed, and more recently modifications of it have been presented in [43, 44].

The purpose of the present paper is to present an inexact version of the SGP method, which is inexact in two sense. First, using a version of scheme introduced in [13] and also a variation of the one appeared [64, Example 1], the inexact projection onto the feasible set is computed allowing an appropriate relative error tolerance. Second, using the inexact conceptual scheme for the line search introduced in [44, 59], a step size is computed to define the next iteration. More specifically, initially we show that the feasible inexact projection of [13] provides greater latitude than the projection of [64, Example 1]. In the first convergence result presented, we show that the SGP method using the projection proposed in [13] preserves the same partial convergence result as the classic method, that is, we prove that every accumulation point of the sequence generated by the SGP method is stationary for problem (1). Then, considering the inexact projection of [64, Example 1], and under mild assumptions, we establish full asymptotic convergence results and some complexity bounds. The presented analysis of the method is done using the general non-monotonous line search scheme introduced in [44]. In this way, the proposed method can be adapted to several line searches and, in particular, will allow obtaining several known versions of the SGP method as particular instances, including [10, 13, 47, 66]. Except for the particular case when we assume that the SGP method employs the non-monotonous line search introduced by [45], all other asymptotic convergence and complexity results are obtained without any assumption of compactness of the sub-level sets of the objective function. Finally, it is worth mentioning that the complexity results obtained for the SGP method with a general non-monotone line search are the same as in the classic case when the usual Armijo search is employed, namely, the complexity bound $\mathcal{O}(1/\sqrt{k})$ is unveil for finding $\epsilon$ -stationary points for problem (1) and, under convexity on $f$ , the rate to find a $\epsilon$ -optimal functional value is $\mathcal{O}(1/k)$ .

In Section 2, some notations and basic results used throughout the paper is presented. In particular, Section 2.1 is devoted to recall the concept of relative feasible inexact projection and some new properties about this concept are presented. Section 3 describes the SGP method with a general non-monotone line search and some particular instances of it are presented. Partial asymptotic convergence results are presented in Section 4. Section 5 presents a full convergence result and iteration-complexity bounds. Some numerical experiments are provided in Section 6. Finally, some concluding remarks are made in Section 7.

2 Preliminaries and basic results

In this section, we introduce some notation and results used throughout our presentation. First we consider the index set ${\mathbb{N}}:=\{0,1,2,\ldots\}$ , the usual inner product $\langle\cdot,\cdot\rangle$ in $\mathbb{R}^{n}$ , and the associated Euclidean norm $\|\cdot\|$ . Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a differentiable function and $C\subseteq\mathbb{R}^{n}$ . The gradient $\nabla f$ of $f$ is said to be Lipschitz continuous in $C$ with constant $L>0$ if $\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|$ , for all $x,y\in C$ . Combining this definition with the fundamental theorem of calculus, we obtain the following result whose proof can be found in [12, Proposition A.24].

Lemma 1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a differentiable function and $C\subseteq\mathbb{R}^{n}$ . Assume that $\nabla f$ is Lipschitz continuous in C with constant $L>0$ . Then, $f(y)-f(x)-\langle\nabla f(x),y-x\rangle\leq(L/2)\|x-y\|^{2}$ , for all $x,y\in C$ .

Assume that $C$ is a convex set. The function $f$ is said to be convex on $C$ , if $f(y)\geq f(x)+\langle\nabla f(x),y-x\rangle$ , for all $x,y\in C$ . We recall that a point ${\bar{x}}\in C$ is a stationary point for problem (1) if

\langle\nabla f({\bar{x}}),x-{\bar{x}}\rangle\geq 0,\qquad\forall~{}x\in C.

(2)

Consequently, if $f$ is a convex function on $C$ , then (2) implies that $\bar{x}\in\Omega^{*}$ . We end this section with some useful concepts for the analysis of the sequence generated by the scaled gradient method, for more details, see [23]. For that, let $D$ be a $n\times n$ positive definite matrix and $\|\cdot\|_{D}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ be the norm defined by

\|d\|_{D}:=\sqrt{\left\langle Dd,d\right\rangle},\quad\forall d\in\mathbb{R}^{n}.

(3)

For a fixed constant $\mu\geq 1$ , denote by ${\cal D}_{\mu}$ the set of symmetric positive definite matrices $n\times n$ with all eigenvalues contained in the interval $[\frac{1}{\mu},\mu]$ . The set ${\cal D}_{\mu}$ is compact. Moreover, for each $D\in{\cal D}_{\mu}$ , it follows that $D^{-1}$ also belongs to ${\cal D}_{\mu}$ . Furthermore, due to $D\in{\cal D}_{\mu}$ , by (3), we obtain

\frac{1}{\mu}\|d\|^{2}\leq\|d\|^{2}_{D}\leq\mu\|d\|^{2},\qquad\forall d\in\mathbb{R}^{n}.

(4)

Definition 1.

Let $(y^{k})_{k\in\mathbb{N}}$ be a sequence in $\mathbb{R}^{n}$ and $(D_{k})_{k\in\mathbb{N}}$ be a sequence in ${\cal D}_{\mu}$ . The sequence $(y^{k})_{k\in\mathbb{N}}$ is said to be quasi-Fejér convergent to a set $W\subset\mathbb{R}^{n}$ with respect to $(D_{k})_{k\in\mathbb{N}}$ if, for all $w\in W$ , there exists a sequence $(\epsilon_{k})_{k\in\mathbb{N}}\subset\mathbb{R}$ such that $\epsilon_{k}\geq 0$ , $\sum_{k\in\mathbb{N}}\epsilon_{k}<\infty$ , and $\|y_{k+1}-w\|_{D_{k+1}}^{2}\leq\|y^{k}-w\|_{D_{k}}^{2}+\epsilon_{k}$ , for all $k\in\mathbb{N}$ .

The main property of a quasi-Fejér convergent sequence is stated in the next result. Its proof can be found in [23] but, for sake of completeness, we include it here.

Theorem 2.

Let $(y^{k})_{k\in\mathbb{N}}$ be a sequence in $\mathbb{R}^{n}$ and $(D_{k})_{k\in\mathbb{N}}$ be a sequence in ${\cal D}_{\mu}$ . If $(y^{k})_{k\in\mathbb{N}}$ is quasi-Fejér convergent to a nomempty set $W\subset\mathbb{R}^{n}$ with respect to $(D_{k})_{k\in\mathbb{N}}$ , then $(y^{k})_{k\in\mathbb{N}}$ is bounded. Furthermore, if a cluster point ${\bar{y}}$ of $(y^{k})_{k\in\mathbb{N}}$ belongs to $W$ , then $\lim_{k\rightarrow\infty}y^{k}={\bar{y}}$ .

Proof.

Take $w\in W$ . Definition 1 implies that $\|y^{k}-w\|_{D_{k}}^{2}\leq\|y^{0}-w\|_{D_{0}}^{2}+\sum_{k\in\mathbb{N}}\epsilon_{k}<+\infty$ , for all $k\in\mathbb{N}$ . Thus, using the first inequality in (4), we conclude that $\|y^{k}-w\|\leq\sqrt{\mu}\|y^{k}-w\|_{D_{k}}$ , for all $k\in\mathbb{N}$ . Therefore, combining the two previous inequalities, we conclude that $(y^{k})_{k\in\mathbb{N}}$ is bounded. Let ${\bar{y}}\in W$ be a cluster point of $(y^{k})_{k\in\mathbb{N}}$ and $(y^{k_{j}})_{j\in\mathbb{N}}$ be a subsequence of $(y^{k})_{k\in\mathbb{N}}$ such that $\lim_{j\to+\infty}y^{k_{j}}={\bar{y}}$ . Take $\delta>0$ . Since $\lim_{j\to+\infty}y^{k_{j}}={\bar{y}}$ and $\sum_{j\in\mathbb{N}}\epsilon_{k}<\infty$ , there exists $j_{0}$ such that $\sum_{j\geq{j_{0}}}\epsilon_{j}<\delta/(2\mu)$ and $j_{1}>j_{0}$ such that $\|y^{k_{j}}-{\bar{y}}\|\leq\sqrt{\delta/2\mu^{2}}$ , for all $j\geq j_{1}$ . Hence, using the first inequality in (4) and taking into account that $\|y^{k+1}-{\bar{y}}\|_{D_{k+1}}^{2}\leq\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}+\epsilon_{k}$ , for all $k\in\mathbb{N}$ , we have $\|y^{k}-{\bar{y}}\|^{2}\leq\mu\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}\leq\mu(\|y^{k_{j}}-{\bar{y}}\|_{D_{k_{j}}}^{2}+\sum_{\ell=k_{j}}^{k-1}\epsilon_{\ell}),$ for all $k\geq j_{1}$ . Hence, using the second inequality in (4), we conclude that $\|y^{k}-{\bar{y}}\|^{2}\leq\mu\|y^{k}-{\bar{y}}\|_{D_{k}}^{2}\leq\mu(\mu\|y^{k_{j}}-{\bar{y}}\|^{2}+\sum_{\ell=k_{j}}^{k-1}\epsilon_{\ell})<\mu(\frac{\delta}{2\mu}+\frac{\delta}{2\mu})=\delta,$ for all $k\geq j_{1}$ . Therefore, $\lim_{k\rightarrow\infty}y^{k}={\bar{y}}$ . ∎

2.1 Relative feasible inexact projections

In this section, we recall two concepts of relative feasible inexact projections onto a closed and convex set, and also present some new properties of them which will be used throughout the paper. These concepts of inexact projections were introduced seeking to make the subproblem of computing the projections on the feasible set more efficient; see for example [13, 60, 64]. Before presenting the inexact projection concept that we will use, let us first recall the concept of exact projection with respect to a given norm. For that, throughout this section $D\in{\cal D}_{\mu}$ . The exact projection of the point $v\in\mathbb{R}^{n}$ onto $C$ with respect to the norm $\|\cdot\|_{D}$ , denoted by ${\cal P}_{C}^{D}(v)$ , is defined by

{\cal P}_{C}^{D}(v):=\arg\min_{z\in C}\|z-v\|^{2}_{D}.

(5)

The next result characterizes the exact projection, its proof can be found in [8, Theorem 3.14].

Lemma 3.

Let $v,w\in{\mathbb{R}}^{n}$ . Then, $w={\cal P}_{C}^{D}(v)$ if and only if $w\in C$ and $\left\langle D(v-w),y-w\right\rangle\leq 0$ , for all $y\in C.$

Remark 1.

It follows from Lemma 3 that $\|{\cal P}_{C}^{D}(v)-{\cal P}_{C}^{D}(u)\|_{D}\leq\|v-u\|_{D}$ . Moreover, since $D\in{\cal D}_{\mu}$ , by (4), we conclude that ${\cal P}_{C}^{D}(\cdot)$ is Lipschitz continuous with constant $L=\mu$ . Furthermore, if $(D_{k})_{k\in\mathbb{N}}\subset{\cal D}_{\mu}$ , $\lim_{k\to+\infty}z^{k}=\bar{z}$ , and $\lim_{k\to+\infty}D_{k}=\bar{D}$ , then $\lim_{k\to+\infty}{\cal P}_{C}^{D_{k}}(z^{k})={\cal P}_{C}^{\bar{D}}(\bar{z})$ , see [23, Proposition 4.2].

In the following, we recall the concept of a feasible inexact projection with respect to $\|\cdot\|_{D}$ relative to a fixed point.

Definition 2.

The feasible inexact projection mapping, with respect to the norm $\|\cdot\|_{D}$ , onto $C$ relative to a point $u\in C$ and forcing parameter $\zeta\in(0,1]$ , denoted by ${\cal P}_{C,\zeta}^{D}(u,\cdot):{\mathbb{R}}^{n}\rightrightarrows C$ , is the set-valued mapping defined as follows

{\cal P}_{C,\zeta}^{D}(u,v):=\left\{w\in C:~{}\|w-v\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}\right\}.

(6)

Each point $w\in{\cal P}_{C,\zeta}^{D}(u,v)$ is called a feasible inexact projection, with respect to the norm $\|\cdot\|_{D}$ , of $v$ onto $C$ relative to $u$ and forcing parameter $\zeta\in(0,1]$ .

In the following, we show that the definition given above is nothing more than a reformulation of the concept of relative feasible inexact projection with respect to $\|\cdot\|_{D}$ introduced in [13].

Remark 2.

Let $u\in C$ , $v\in\mathbb{R}^{n}$ and $D$ be an $n\times n$ positive definite matrix. Consider the quadratic function $Q:\mathbb{R}^{n}\to\mathbb{R}$ defined by $Q(z):=(1/2)\left\langle{D}(z-u),z-u\right\rangle+\left\langle D(u-v),z-u\right\rangle$ . Thus, letting $\|\cdot\|_{D}$ be the norm defined by (3), some algebraic manipulations shows that

\|z-v\|^{2}_{D}=2Q(z)+\|u-v\|^{2}_{D}.

(7)

Hence, (7) and (5) implies that ${\cal P}_{C}^{D}(v)=\arg\min_{z\in C}Q(z).$ Let $\zeta\in(0,1]$ . Thus, by using (7), after some calculations, we can see that the following inexactness condition introduced in [13],

w\in C,\qquad Q(w)\leq\zeta Q({\cal P}_{C}^{D}(v)),

is equivalent to find $w\in C$ such that $\|w-v\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}$ , which corresponds to condition (6) in Definition 2.

The concept of feasible inexact projection in Definition 2 provides more latitude to the usual concept of exact projection (5). The next remark makes this more precise.

Remark 3.

Let $\zeta$ be positive forcing parameter, $C\subset{\mathbb{R}}^{n}$ and $u\in C$ be as in Definition 2. First of all note that ${\cal P}_{C}^{D}(v)\in{\cal P}_{C,\zeta}^{D}(u,v)$ . Therefore, ${\cal P}_{C,\zeta}^{D}(u,v)\neq\varnothing$ , for all $u\in C$ and $v\in{\mathbb{R}}^{n}$ . Consequently, the set-valued mapping ${\cal P}_{C,\zeta}^{D}(u,\cdot)$ as stated in (6) is well-defined. Moreover, for $\zeta=1$ , we have ${\cal P}_{C,1}^{D}(u,v)=\{{\cal P}_{C}^{D}(v)\}$ . In addition, if $\underline{\zeta}$ and $\bar{\zeta}$ are forcing parameters such that $0<\underline{\zeta}\leq\bar{\zeta}\leq 1$ , then ${\cal P}_{C,\bar{\zeta}}^{D}(u,v)\subset{\cal P}_{C,\underline{\zeta}}^{D}(u,v)$ .

Lemma 4.

Let $v\in{\mathbb{R}}^{n}$ , $u\in C$ and $w\in{\cal P}_{C,\zeta}^{D}(u,v)$ . Then, there hold

\left\langle D(v-w),y-w\right\rangle\leq\frac{1}{2}\|w-y\|_{D}^{2}+\frac{1}{2}\left[\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}-\|y-v\|_{D}^{2}\right],\qquad y\in C.

Proof.

Let $y\in C$ . Since $2\langle D(v-w),y-w\rangle=\|w-y\|_{D}^{2}+\|w-v\|_{D}^{2}-\|v-y\|_{D}^{2}$ , using (6) we have $2\langle D(v-w),y-w\rangle=\|w-y\|_{D}^{2}+\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}-\|v-y\|_{D}^{2}$ , which is equivalent to the desired inequality. ∎

Next, we recall a second concept of relative feasible inexact projection onto a closed convex set, see [2, 28]. The definition is as follows.

Definition 3.

The feasible inexact projection mapping, with respect to the norm $\|\cdot\|_{D}$ , onto $C$ relative to $u\in C$ and forcing parameter $\gamma\geq 0$ , denoted by ${\cal R}_{C,\gamma}^{D}(u,\cdot):{\mathbb{R}}^{n}\rightrightarrows C$ , is the set-valued mapping defined as follows

{\cal R}_{C,\gamma}^{D}(u,v):=\left\{w\in C:~{}\left\langle D(v-w),y-w\right\rangle\leq\gamma\|w-u\|_{D}^{2},\quad\forall~{}y\in C\right\}.

(8)

Each point $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ is called a feasible inexact projection, with respect to the norm $\|\cdot\|_{D}$ , of $v$ onto $C$ relative to $u$ and forcing parameter $\gamma\geq 0$ .

The concept of feasible inexact projection in Definition 3 also provides more latitude to the usual concept of exact projection. Next, we present some remarks about this concept.

Remark 4.

Let $\gamma\geq 0$ be a forcing parameter, $C\subset{\mathbb{R}}^{n}$ and $u\in C$ be as in Definition 3. For all $v\in{\mathbb{R}}^{n}$ , it follows from (8) and Lemma 3 that ${\cal R}_{C,0}^{D}(u,v)=\{{\cal P}_{C}^{D}(v)\}$ is the exact projection of $v$ onto $C$ . Moreover, ${\cal P}_{C}^{D}(v)\in{\cal R}_{C,\gamma}^{D}(u,v)$ concluding that ${\cal R}_{C,\gamma}(u,v)\neq\varnothing$ , for all $u\in C$ and $v\in{\mathbb{R}}^{n}$ . Consequently, the set-valued mapping ${\cal R}_{C,\gamma}^{D}(u,\cdot)$ as stated in (8) is well-defined.

The next lemma is a variation of [30, Lemma 6]. It will allow to relate Definitions 2 and 3.

Lemma 5.

Let $v\in{\mathbb{R}}^{n}$ , $u\in C$ , $\gamma\geq 0$ and $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ . Then, there hold

\displaystyle\|w-x\|_{D}^{2}\leq\|x-v\|_{D}^{2}+\frac{2\gamma}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2},

for all $x\in C$ and $0\leq\gamma<1/2$ .

Proof.

First note that $\|w-x\|_{D}^{2}=\|x-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\langle D(v-w),x-w\rangle$ . Since $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ and $x\in C$ , combining the last equality with (8), we obtain

\|w-x\|_{D}^{2}\leq\|x-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\gamma\|w-u\|_{D}^{2}.

(9)

On the other hand, we also have $\|w-u\|_{D}^{2}=\|u-v\|_{D}^{2}-\|w-v\|_{D}^{2}+2\langle D(v-w),u-w\rangle$ . Due to $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ and $u\in C$ , using (8) and considering that $0\leq\gamma<1/2$ , we have

\|w-u\|_{D}^{2}\leq\frac{1}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2}.

Therefore, substituting the last inequality into (9), we obtain the desired inequality. ∎

In the following lemma, we present a relationship between Definitions 2 and 3.

Lemma 6.

Let $v\in{\mathbb{R}}^{n}$ , $u\in C$ , $\gamma\geq 0$ and $\zeta\in(0,1]$ . If $0\leq\gamma<1/2$ and $\zeta=1-2\gamma$ , then

{\cal R}_{C,\gamma}^{D}(u,v)\subset{\cal P}_{C,\zeta}^{D}(u,v).

Proof.

Let $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ . Applying Lemma 5 with $x={\cal P}_{C}^{D}(v)$ we have

\|w-{\cal P}_{C}^{D}(v)\|_{D}^{2}\leq\|v-{\cal P}_{C}^{D}(v)\|_{D}^{2}+\frac{2\gamma}{1-2\gamma}\|u-v\|_{D}^{2}-\frac{1}{1-2\gamma}\|w-v\|_{D}^{2},

After some algebraic manipulations in the last inequality we obtain that

\|w-v\|_{D}^{2}\leq(1-2\gamma)\|v-{\cal P}_{C}^{D}(v)\|_{D}^{2}+2\gamma\|u-v\|_{D}^{2}-(1-2\gamma)\|w-{\cal P}_{C}^{D}(v)\|_{D}^{2}.

Therefore, considering that $0\leq\gamma<1/2$ and $\zeta=1-2\gamma$ , the result follows from Definition 2. ∎

Remark 5.

Under the conditions of Lemma 6, there exists $0\leq\gamma<1/2$ and $\zeta=1-2\gamma$ such that ${\cal P}_{C,\zeta}^{D}(u,v)\nsubseteq{\cal R}_{C,\gamma}^{D}(u,v)$ . Indeed, let $\gamma=3/8$ , $\zeta=1/4$ , and ${\bar{w}}=\frac{1}{2}({\cal P}_{C}^{D}(v)+u)$ , then

\displaystyle\|{\bar{w}}-v\|_{D}^{2}=\frac{1}{4}\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+\frac{1}{4}\|u-v\|_{D}^{2}+\frac{1}{2}\langle D({\cal P}_{C}^{D}(v)-v),u-v\rangle.

Since ${\cal P}_{C}^{D}(v)$ is the exact projection of $v$ , we have $\displaystyle\langle D({\cal P}_{C}^{D}(v)-v),u-v\rangle\leq\|u-v\|_{D}^{2}$ . Combining this inequality with the last equality and Definition 2, we conclude that ${\bar{w}}\in{\cal P}_{C,\zeta}^{D}(u,v)$ . Now, letting $w_{t}=t{\cal P}_{C}^{D}(v)+(1-t){\bar{w}}$ with $0<t<1$ , after some algebraic manipulations we have

\langle D(v-{\bar{w}}),w_{t}-{\bar{w}}\rangle=t\|{\bar{w}}-u\|_{D}^{2}-\frac{t}{2}\langle D(v-{\cal P}_{C}^{D}(v)),u-{\cal P}_{C}^{D}(v)\rangle.

Thus, it follows from Lemma 3 that $\displaystyle\langle D(v-{\bar{w}}),w_{t}-{\bar{w}}\rangle\geq t\|{\bar{w}}-u\|_{D}^{2}$ . Hence, taking $t>3/8$ we conclude that ${\bar{w}}\not\in{\cal R}_{C,\gamma}^{D}(u,v)$ . Therefore, considering that ${\bar{w}}\in{\cal P}_{C,\zeta}^{D}(u,v)$ , the statement follows.

It follows from Remark 5 that, in general, ${\cal P}_{C,\zeta}^{D}(u,v)\nsubseteq{\cal R}_{C,\gamma}^{D}(u,v)$ . However, whenever $C$ is a bounded set, we will show that for each fixed $0\leq\gamma<1/2$ there exist $0<\zeta<1$ such that ${\cal P}_{C,\zeta}^{D}(u,v)\subseteq{\cal R}_{C,\gamma}^{D}(u,v)$ . For that, we first need the next lemma.

Lemma 7.

Let $v\in{\mathbb{R}}^{n}$ , $u\in C$ and $0<\gamma<1/2$ . Assume that $C$ is a bounded set and take

0<\varepsilon<\frac{\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}}{1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+2\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C},

(10)

where $\mbox{diam}C$ denotes the diameter of $C$ . Then, $\{w\in C:~{}\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\varepsilon\}\subset\mathcal{R}_{C,\gamma}^{D}(u,v)\}$ .

Proof.

Take $\varepsilon$ satisfying (10) and $w\in C$ such that $\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\varepsilon$ . For all $z\in C$ , we have

\langle D(v-w),z-w\rangle=\langle D(v-\mathcal{P}_{C}^{D}(v)),z-\mathcal{P}_{C}^{D}(v)\rangle+\langle D(v-\mathcal{P}_{C}^{D}(v)),\mathcal{P}_{C}^{D}(v)-w\rangle\\ +\langle D(\mathcal{P}_{C}^{D}(v)-w),z-\mathcal{P}_{C}^{D}(v)\rangle+\|\mathcal{P}_{C}^{D}(v)-w\|_{D}^{2}.

Using Lemma 3, we have $\langle D(v-\mathcal{P}_{C}^{D}(v)),z-\mathcal{P}_{C}^{D}(v)\rangle\leq 0$ . Thus, the last equality becomes

\langle D(v-w),z-w\rangle\leq\langle D(v-\mathcal{P}_{C}^{D}(v)),\mathcal{P}_{C}^{D}(v)-w\rangle+\langle D(\mathcal{P}_{C}^{D}(v)-w),z-\mathcal{P}_{C}^{D}(v)\rangle+\|\mathcal{P}_{C}^{D}(v)-w\|_{D}^{2}.

By using Cauchy-Schwarz inequality, we conclude from the last inequality that

\langle D(v-w),z-w\rangle\leq\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\|z-\mathcal{P}_{C}^{D}(v)\|_{D}\right)+\|w-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}.

Since $\|w-\mathcal{P}_{C}^{D}(v)|_{D}\leq\varepsilon$ and $\|z-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\mbox{diam}C$ , the last inequality implies that

\langle D(v-w),z-w\rangle\leq\varepsilon\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\varepsilon^{2},

(11)

On the other hand, if $\varepsilon$ satisfies (10) then

\varepsilon\left(1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\gamma\varepsilon^{2}<\gamma\|u-\mathcal{P}_{C}^{D}(v)\|_{D}^{2}-2\gamma\varepsilon\|u-\mathcal{P}_{C}^{D}(v)\|_{D}+\gamma\varepsilon^{2},

hence

\varepsilon\left(1-\gamma+\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\gamma\varepsilon^{2}<\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}.

Since $\gamma,\varepsilon<1$ , we have $\varepsilon^{2}<\varepsilon(1-\gamma)+\gamma\varepsilon^{2}$ and we can conclude that

\varepsilon\left(\|v-\mathcal{P}_{C}^{D}(v)\|_{D}+\mbox{diam}C\right)+\varepsilon^{2}<\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}.

It follows from (11) that

\langle D(v-w),z-w\rangle\leq\gamma\left(\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\right)^{2}.

(12)

Using again that $\|w-\mathcal{P}_{C}^{D}(v)|_{D}\leq\varepsilon$ and the triangular inequality, we have

0<\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\varepsilon\leq\|u-\mathcal{P}_{C}^{D}(v)\|_{D}-\|w-\mathcal{P}_{C}^{D}(v)\|_{D}\leq\|u-w\|_{D}.

Hence, taking into account (12), we conclude that $\langle D(v-w),z-w\rangle\leq\gamma\|u-w\|_{D}^{2}$ . Therefore, it follows from Definition 3 that $w\in\mathcal{R}_{C,\gamma}^{D}(u,v)$ . ∎

Proposition 8.

Let $v\in{\mathbb{R}}^{n}$ , $u\in C$ and assume that $C$ is a bounded set. Then, for each $0<\gamma<1/2$ , there exist $0<\zeta<1$ such that ${\cal P}_{C,\zeta}^{D}(u,v)\subseteq{\cal R}_{C,\gamma}^{D}(u,v)$ .

Proof.

It follows from Lemma 7 that given $0<\gamma<1/2$ there exists $\varepsilon>0$ such that, for all $w\in C$ with $\|w-\mathcal{P}_{C}^{D}(v)\|\leq\varepsilon$ , we have $w\in\mathcal{R}_{\gamma}^{D}(v)$ . Otherwise, we can see in (6), when $\zeta\to 1$ , the diameter of $C\cap{\cal P}_{C,\zeta}^{D}(u,v)$ tends to zero, then there exists $\zeta$ close to 1 such that $\mbox{diam}(C\cap{\cal P}_{C,\zeta}^{D}(u,v))<\varepsilon/2$ , and ${\cal P}_{C,\zeta}^{D}(u,v)\subset{\cal R}_{C,\gamma}^{D}(u,v)$ . ∎

Next, we present some important properties of inexact projections, which will be useful in the sequel.

Lemma 9.

Let $x\in C$ , $\alpha>0$ and $z(\alpha)=x-\alpha D^{-1}\nabla f(x)$ . Take $w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ with $\zeta\in(0,1]$ . Then, there hold

(i)

$\displaystyle\langle\nabla f(x),w(\alpha)-x\rangle\leq-\frac{1}{2\alpha}\|w(\alpha)-x\|_{D}^{2}+\frac{\zeta}{2\alpha}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right]$ ;
(ii)

the point $x$ is stationary for problem (1) if and only if $x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ ;
(iii)

if $x\in C$ is a nonstationary point for problem (1), then $\Big{\langle}\nabla f(x),w(\alpha)-x\Big{\rangle}<0$ . Equivalently, if there exists ${\bar{\alpha}}>0$ such that $\Big{\langle}\nabla f(x),w({\bar{\alpha}})-x\Big{\rangle}\geq 0$ , then $x$ is stationary for problem (1).

Proof.

Since $w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ , applying Lemma 4 with $w=w(\alpha)$ , $v=z(\alpha)$ , $y=x$ , and $u=x$ , we conclude, after some algebraic manipulations, that

\left\langle D(z(\alpha)-w(\alpha)),x-w(\alpha)\right\rangle\leq\frac{1}{2}\|w(\alpha)-x\|_{D}^{2}+\frac{\zeta}{2}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right].

Substituting $z(\alpha)=x-\alpha\nabla f(x)$ in the left hand side of the last inequality, some manipulations yield the inequality of item $(i)$ . For proving item $(ii)$ , we first assume that $x$ is stationary for problem (1). In this case, (2) implies that $\langle\nabla f(x),w(\alpha)-x\rangle\geq 0$ . Hence, due to $\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}\leq\|x-z(\alpha)\|_{D}$ , item $(i)$ implies

\frac{1}{2\alpha}\|w(\alpha)-x\|_{D}^{2}\leq\frac{\zeta}{2\alpha}\left[\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-\|x-z(\alpha)\|_{D}^{2}\right]\leq 0.

Since $\alpha>0$ and $\zeta\in(0,1]$ , the last inequality yields $w(\alpha)=x$ . Therefore, $x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ . Reciprocally, if $x\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ , then Definition 2 implies that

\|x-z(\alpha)\|_{D}^{2}\leq\zeta\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}+(1-\zeta)\|x-z(\alpha)\|_{D}^{2}.

Hence, $0\leq\zeta\left(\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}^{2}-(\|x-z(\alpha)\|_{D}^{2}\right)$ . Considering that $\zeta\in(0,1]$ we have

\|x-z(\alpha)\|_{D}\leq\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}.

Thus, due to exact projection with respect to the norm $\|\cdot\|_{D}$ be unique and $z(\alpha)=x-D^{-1}\alpha\nabla f(x)$ , we have ${\cal P}_{C}^{D}(x-\alpha D^{-1}\nabla f(x))=x$ . Hence, $x$ is the solution of the constrained optimization problem $\min_{y\in C}\|y-z(\alpha)\|^{2}_{D}$ , which taking into account that $\alpha>0$ implies (2). Therefore, $x$ is stationary point for problem (1). Finally, to prove item $(iii)$ , take $x$ a nonstationary point for problem (1). Thus, by item $(ii)$ , $x\notin{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ and taking into account that $w(\alpha)\in{\cal P}_{C,\zeta}^{D}(x,z(\alpha))$ , we conclude that $x\neq w(\alpha)$ . Since $\|{\cal P}_{C}^{D}(z(\alpha))-z(\alpha)\|_{D}\leq\|x-z(\alpha)\|_{D}$ , $\alpha>0$ and $\zeta\in(0,1]$ , it follows from item $(i)$ that $\Big{\langle}\nabla f(x),w(\alpha)-x\Big{\rangle}<0$ and the first sentence is proved. Finally, note that the second sentence is the contrapositive of the first sentence. ∎

Finally, it is worth mentioning that Definitions 2 and 3, introduced respectively in [13] and [28], are relative inexact concepts, while the concept introduced in [60, 64] is absolute.

2.1.1 Practical computation of inexact projections

In this section, for a given $v\in{\mathbb{R}}^{n}$ and $u\in C$ , we discuss how to calculate a point $w\in C$ belonging to ${\cal P}_{C,\zeta}^{D}(u,v)$ or ${\cal R}_{C,\gamma}^{D}(u,v)$ . We recall that Lemma 6 implies that ${\cal P}_{C,\zeta}^{D}(u,v)$ has more latitude than ${\cal R}_{C,\gamma}^{D}(u,v)$ , i.e., ${\cal R}_{C,\gamma}^{D}(u,v)\subset{\cal P}_{C,\zeta}^{D}(u,v)$ .

We begin our discussion by showing how a point $w\in{\cal P}_{C,\zeta}^{D}(u,v)$ can be calculated without knowing the point ${\cal P}_{C}^{D}(v)$ . Considering that this discussion has already been covered in [13, Section 3, Algorithm 3.1], we will limit ourselves to giving a general idea of how this task is carried out; see also [16, Section 5.1]. The idea is to use an external procedure capable of computing two sequences $(c_{\ell})_{\ell\in\mathbb{N}}\subset\mathbb{R}$ and $(w^{\ell})_{\ell\in\mathbb{N}}\subset C$ satisfying the following conditions

c_{\ell}\leq\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2},\quad\forall\ell\in\mathbb{N},\qquad\qquad\lim_{\ell\to+\infty}c_{\ell}=\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2},\qquad\quad\lim_{\ell\to+\infty}w^{\ell}={\cal P}_{C}^{D}(v).

(13)

In this case, if $v\notin C$ , then we have $\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}-\|u-v\|_{D}^{2}<0$ . Hence, given an arbitrary $\zeta\in(0,1)$ , the second condition in (13) implies that there exists $\hat{\ell}$ such that

\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}-\|u-v\|_{D}^{2}<\zeta(c_{\hat{\ell}}-\|u-v\|_{D}^{2}).

Moreover, by using the last condition in (13), we conclude that there exists $\bar{\ell}>\hat{\ell}$ such that

\|w_{\bar{\ell}}-v\|_{D}^{2}-\|u-v\|_{D}^{2}<\zeta(c_{\bar{\ell}}-\|u-v\|_{D}^{2}),

(14)

which using the inequality in (13) yields $\ \|w_{\bar{\ell}}-v\|_{D}^{2}<\zeta\|{\cal P}_{C}^{D}(v)-v\|_{D}^{2}+(1-\zeta)\|u-v\|_{D}^{2}$ . Hence, Definition 2 implies that $w_{\bar{\ell}}\in{\cal P}_{C,\zeta}^{D}(u,v)$ . Therefore, (14) can be used as a stopping criterion to compute a feasible inexact projection, with respect to the norm $\|\cdot\|_{D}$ , of $v$ onto $C$ relative to $u$ and forcing parameter $\zeta\in(0,1]$ . For instance, it follows from [13, Theorem 3.2, Lemma 3.1] that such sequences $(c_{\ell})_{\ell\in\mathbb{N}}\subset\mathbb{R}$ and $(w^{\ell})_{\ell\in\mathbb{N}}\subset C$ satisfying (13) can be computed by using Dykstra’s algorithm [22, 32], whenever $D$ is the identity matrix and the set $C=\cap_{i=1}^{p}C_{i}$ , where $C_{i}$ are closed and convex sets and the exact projection ${\cal P}_{C_{i}}^{D}(v)$ is easy to obtain, for all $i=1,\dots,p$ .

We end this section by discussing how to compute a point $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ . For that, we apply the classical Frank-Wolfe method, also known as conditional gradient method, to minimize the function $\psi(z):=\|z-v\|^{2}/2$ onto the constraint set $C$ with a suitable stop criteria depending of $u\in C$ and $\gamma\in(0,1]$ , see [9, 48]. To state the method we assume the existence of a linear optimization oracle (or simply LO oracle) capable of minimizing linear functions over the constraint set $C$ , which is assumed to be compact. The Frank-Wolfe method is formally stated as follows.

Input:

$D\in{\cal D}_{\mu}$ , $\gamma\in(0,1]$ , $v\in{\mathbb{R}}^{n}$ and $u\in C$ .

Step 0.

Let $w^{0}\in C$ and set $\ell\leftarrow 0$ .

Step 1.

Use a LO oracle to compute an optimal solution $z^{\ell}$ and the optimal value $s_{\ell}^{*}$ as

z^{\ell}\in\arg\min_{z\in C}\,\langle w^{\ell}-v,~{}z-w^{\ell}\rangle,\qquad s_{\ell}^{*}:=\langle w^{\ell}-v,~{}z^{\ell}-w^{\ell}\rangle.

(15)

If $-s^{*}_{\ell}\leq\gamma\|w^{\ell}-u\|_{D}^{2}$ , then define $w:=w^{\ell}$ and stop.

Step 2.

Compute $\alpha_{\ell}$ and $w_{\ell+1}$ as

w_{\ell+1}:=w^{\ell}+\alpha_{\ell}(z^{\ell}-w^{\ell}),\qquad{\alpha}_{\ell}:=\min\left\{1,-s^{*}_{\ell}/\|z^{\ell}-w^{\ell}\|^{2}\right\}.

(16)

Set $\ell\leftarrow\ell+1$ , and go to Step 1.

Output:

$w:=w^{\ell}$ .

Algorithm 1 : Frank-Wolfe method to compute

w\in{\cal R}_{C,\gamma}^{D}(u,v)

Let us describe the main features of Algorithm 1, i.e., the Frank-Wolfe method applied to the problem $\min_{z\in C}\psi(z)$ . In this case, (15) is equivalent to $s_{\ell}^{*}:=\min_{z\in C}\langle\psi^{\prime}(w^{\ell}),~{}z-w^{\ell}\rangle$ . Since $\psi$ is convex, we have $\psi(z)\geq\psi(w^{\ell})+\langle\psi^{\prime}(w^{\ell}),~{}z-w^{\ell}\rangle\geq\psi(w^{\ell})+s_{\ell}^{*}$ , for all $z\in C$ . Define $w_{*}:=\arg\min_{z\in C}\psi(z)$ and $\psi^{*}:=\min_{z\in C}\psi(z)$ . Letting $z=w_{*}$ in the last inequality, we obtain $\psi(w^{\ell})\geq\psi^{*}\geq\psi(w^{\ell})+s_{\ell}^{*}$ , which implies that $s_{\ell}^{*}<0$ whenever $\psi(w^{\ell})\neq\psi^{*}$ . Thus, we conclude that $-s_{\ell}^{*}=\langle v-w^{\ell},~{}z^{\ell}-w^{\ell}\rangle>0\geq\langle v-w_{*},~{}z-w_{*}\rangle$ , for all $z\in C$ . Therefore, if Algorithm 1 computes $w^{\ell}\in C$ satisfying $-s_{\ell}^{*}\leq\gamma\|w^{\ell}-u\|_{D}^{2}$ , then the method terminates. Otherwise, it computes the step size $\alpha_{\ell}=\arg\min_{\alpha\in[0,1]}\psi(w^{\ell}+\alpha(z^{\ell}-w^{\ell}))$ using exact minimization. Since $z^{\ell}$ , $w^{\ell}\in C$ and $C$ is convex, we conclude from (16) that $w_{\ell+1}\in C$ , thus Algorithm 1 generates a sequence in $C$ . Finally, (15) implies that $\langle v-w^{\ell},~{}z-w^{\ell}\rangle\leq-s_{\ell}^{*}$ , for all $z\in C$ . Considering that [9, Proposition A.2] implies that $\lim_{\ell\to+\infty}s_{\ell}^{*}=0$ and taking into account the stopping criteria $-s_{\ell}^{*}\leq\gamma\|w^{\ell}-u\|_{D}^{2}$ , we conclude that the output of Algorithm 1 is a feasible inexact projection $w\in{\cal R}_{C,\gamma}^{D}(u,v)$ i.e., $\langle v-w,~{}z-w\rangle\leq\gamma\|w^{\ell}-u\|_{D}^{2}$ , for all $z\in C$ .

3 Inexact scaled gradient method

The aim of this section is to present an inexact version of the scaled gradient method (SGM), which inexactness are in two distinct senses. First, we use a version of the inexactness scheme introduced in [13], and also a variation of the one appeared in [64], to compute an inexact projection onto the feasible set allowing an appropriate relative error tolerance. Second, using the inexactness conceptual scheme for non-monotones line search introduced in [43, 59], a step size is computed to define the next iterate. The statement of the conceptual algorithm is as follows.

Step 0.

Choose $\sigma,{\zeta_{\min}}\in(0,1)$ , $\delta_{\min}\in[0,1)$ , $0<\underline{\omega}<\bar{\omega}<1$ , $0<\alpha_{\min}\leq\alpha_{\max}$ and $\mu\geq 1$ . Let $x^{0}\in C$ , $\nu_{0}\geq 0$ and set $k\leftarrow 0$ .

Step 1.

Choose positive real numbers $\alpha_{k}$ and $\zeta_{k}$ , and a positive definite matrix $D_{k}$ such that

\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max},\qquad\qquad 0<{\zeta_{\min}}<\zeta_{k}\leq 1,\qquad\qquad D_{k}\in{\cal D}_{\mu}.

(17)

Compute $w^{k}\in C$ as any feasible inexact projection with respect to the norm $\|\cdot\|_{D_{k}}$ of $z^{k}:=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k})$ onto $C$ relative to $x^{k}$ with forcing parameter $\zeta_{k}$ , i.e.,

w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k}).

(18)

If $w^{k}=x^{k}$ , then stop declaring convergence.

Step 2.

Set $\tau_{\textrm{trial}}\leftarrow 1$ . If

f\big{(}x^{k}+\tau_{\textrm{trial}}(w^{k}-x^{k})\big{)}\leq f(x^{k})+\sigma\tau_{\textrm{trial}}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k},

(19)

then $\tau_{k}\leftarrow\tau_{\textrm{trial}}$ , define the next iterate $x^{k+1}$ as

x^{k+1}=x^{k}+\tau_{k}(w^{k}-x^{k}),

(20)

and go to Step 3. Otherwise, choose $\tau_{\textrm{new}}\in[\underline{\omega}\tau_{\textrm{trial}},\bar{\omega}\tau_{\textrm{trial}}]$ , set $\tau_{\textrm{trial}}\leftarrow\tau_{\textrm{new}}$ , and repeat test (19).

Step 3.

Take $\delta_{k+1}\in[\delta_{\min},1]$ and choose $\nu_{k+1}\in{\mathbb{R}}$ satisfying

0\leq\nu_{k+1}\leq(1-\delta_{k+1})\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}.

(21)

Set $k\leftarrow k+1$ and go to Step 1.

Algorithm 2 InexProj-SGM employing non-monotone line search

Let us describe the main features of Algorithm 2. In Step 1, we first choose $\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max}$ , $0<\zeta_{\min}\leq\zeta_{k}<1$ , and $D_{k}\in{\cal D}_{\mu}$ . Then, by using some (inner) procedure, such as those specified in Section 2.1, we compute $w^{k}$ as any feasible inexact projection of $z^{k}=x_{k}-\alpha_{k}D_{k}^{-1}\nabla f(x_{k})$ onto the feasible set $C$ relative to the previous iterate $x^{k}$ with forcing parameter $\zeta_{k}$ . If $w^{k}=x^{k}$ , then Lemma 9(ii) implies that $x^{k}$ is a solution of problem (1). Otherwise, $w^{k}\neq x^{k}$ and Lemma 9(i) implies that $w^{k}-x^{k}$ is a descent direction of $f$ at $x^{k}$ , i.e., $\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0$ . Hence, in Step 2, we employ a non-monotone line search with tolerance parameter $\nu_{k}\geq 0$ to compute a step size $\tau_{k}\in(0,1]$ , and the next iterate is computed as in (20). Finally, due to (19) and $\delta_{k+1}\in[\delta_{\min},1]$ , we have $0\leq(1-\delta_{k+1})\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}$ . Therefore, the next tolerance parameter $\nu_{k+1}\in{\mathbb{R}}$ can be chosen satisfying (21) in Step 3, completing the iteration.

It is worth mentioning that the conditions in (17) allow combining several strategies for choosing the step sizes $\alpha_{k}$ and the matrices $D_{k}$ to accelerate the performance of the classical gradient method. Strategies of choosing the step sizes $\alpha_{k}$ and the matrices $D_{k}$ have their origin in the study of the gradient method for unconstrained optimization, papers dealing with this issue include but are not limited to [7, 27, 29, 36, 69], see also [18, 25, 26, 49]. More details about selecting step sizes $\alpha_{k}$ and matrices $D_{k}$ can be found in the recent review [17] and references therein.

Below, we present some particular instances of the parameter $\delta_{k}\geq 0$ and the non-monotonicity tolerance parameter $\nu_{k}\geq 0$ in Step 3.

1.

Armijo line search

Taking $\nu_{k}\equiv 0$ , the line search (19) is the well-known (monotone) Armijo line search, see [12, Section 2.3]. In this case, we can take $\delta_{k}\equiv 1$ in Step 3.

Max-type line search

The earliest non-monotone line search strategy was proposed in [45]. Let $M>0$ be an integer parameter. In an iteration $k$ , this strategy requires a step size $\tau_{k}>0$ satisfying

f\big{(}x^{k}+\tau_{k}(w^{k}-x^{k})\big{)}\leq\max_{0\leq j\leq m_{k}}f(x^{k-j})+\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle},

(22)

where $m_{0}=0$ and $0\leq m_{k}\leq\min\{m_{k-1}+1,M\}$ . To simplify the notations, we define $f(x^{\ell(k)}):=\max_{0\leq j\leq m_{k}}f(x^{k-j})$ . In order to identify (22) as a particular instance of (19), we set

\nu_{k}=f(x^{\ell(k)})-f(x^{k}),\quad 0=\delta_{\min}\leq\delta_{k+1}\leq[f(x^{\ell(k)})-f(x^{\ell(k+1)})]/[f(x^{\ell(k)})-f(x^{k+1})].

(23)

Parameters $\nu_{k}$ and $\delta_{k+1}$ in (23) satisfy the corresponding conditions in Algorithm 2, i.e., $\nu_{k}\geq 0$ and $\delta_{k+1}\in[\delta_{\min},1]$ (with $\delta_{\min}=0$ ) satisfy (21). In fact, the definition of $f(x^{\ell(k)})$ implies that $f(x^{k})\leq f(x^{\ell(k)})$ and hence $\nu_{k}\geq 0$ . Due to $\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0$ , it follows from (19) that $f(x^{\ell(k)})-f(x^{k+1})>0$ . Since $m_{k+1}\leq m_{k}+1$ , we conclude that $f(x^{\ell(k)})-f(x^{\ell(k+1)})\geq 0$ . Hence, owing to $f(x^{k+1})\leq f(x^{\ell(k+1)})$ , we obtain $\delta_{k+1}\in[0,1]$ . Moreover, (21) is equivalent to $\delta_{k+1}[f(x^{k})+\nu_{k}-f(x^{k+1})]\leq(f(x^{k})+\nu_{k})-(f(x^{k+1})+\nu_{k+1})$ , which in turn, taking into account that $\nu_{k}=f(x^{\ell(k)})-f(x^{k})$ , is equivalent to second inequality in (23). Thus, (22) is a particular instance of (19) with $\nu_{k}$ and $\delta_{k+1}$ defined in (23). Therefore, Algorithm 2 has as a particular instance the inexact projected version of the scaled gradient method employing the non-monotone line search (22). This version has been considered in [13]; see also [19, 65].

3.

Average-type line search

Let us first recall the definition of the sequence of “cost updates” $(c_{k})_{k\in\mathbb{N}}$ that characterize the non-monotonous line search proposed in [68]. Let $0\leq\eta_{\min}\leq\eta_{\max}<1$ , $c_{0}=f(x_{0})$ and $q_{0}=1$ . Choose $\eta_{k}\in[\eta_{\min},\eta_{\max}]$ and set

$q_{k+1}=\eta_{k}q_{k}+1,\qquad c_{k+1}=[\eta_{k}q_{k}c_{k}+f(x^{k+1})]/q_{k+1},\qquad\forall k\in\mathbb{N}.$ (24)

Some algebraic manipulations show that the sequence defined in (24) is equivalent to

$c_{k+1}=(1-1/q_{k+1})c_{k}+f(x^{k+1})/q_{k+1},\qquad\forall k\in\mathbb{N}.$ (25)

Since (21) is equivalent to $f(x^{k+1})+\nu_{k+1}\leq(1-\delta_{k+1})(f(x^{k})+\nu_{k})+\delta_{k+1}f(x^{k+1})$ , it follows from (25) that letting $\nu_{k}=c_{k}-f(x^{k})$ and $\delta_{k+1}=1/q_{k+1}$ , Algorithm 2 becomes the inexact projected version of the scaled gradient method employing the non-monotone line search proposed in [68]. Finally, considering that $q_{0}=1$ and $\eta_{\max}<1$ , the first equality in (24) implies that $q_{k+1}=1+\sum_{j=0}^{k}\prod_{i=0}^{j}\eta_{k-i}\leq\sum_{j=0}^{+\infty}\eta_{\max}^{j}=1/(1-\eta_{\max})$ . In this case, due to $\delta_{k+1}=1/q_{k+1}$ , we can take $\delta_{\min}=1-\eta_{\max}>0$ in Step 3. For gradient projection methods employing the non-monotone Average-type line search see, for example, [6, 34, 66].

Remark 6.

The general line search in Step 2 of Algorithm 2 with parameters $\delta_{k+1}$ and $\nu_{k}$ properly chosen in Step 3, also contains as particular cases the non-monotonous line searches that appeared in [3, 51], see also [43].

4 Partial asymptotic convergence analysis

The goal of this section is to present a partial convergence result for the sequence $(x^{k})_{k\in\mathbb{N}}$ generated by Algorithm 2, namely, we will prove that every cluster point of $(x^{k})_{k\in\mathbb{N}}$ is stationary for problem (1). For that, we state a result that is contained in the proof of [43, Theorem 4].

Lemma 10.

There holds $0\leq\delta_{k+1}\big{[}f(x^{k})+\nu_{k}-f(x^{k+1})\big{]}\leq\big{(}f(x^{k})+\nu_{k}\big{)}-\big{(}f(x^{k+1})+\nu_{k+1}\big{)}$ , for all $k\in\mathbb{N}$ . As consequence the sequence $\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}}$ is non-increasing.

Next, we present our first convergence result. It is worth noting that, just as in the classical projected gradient method, we do not need to assume that $f$ has a bounded sub-level set.

Proposition 11.

Assume that $\lim_{k\to+\infty}\nu_{k}=0$ . Then, Algorithm 2 stops in a finite number of iterations at a stationary point of problem (1), or generates an infinite sequence $(x^{k})_{k\in\mathbb{N}}$ for which every cluster point is stationary for problem (1).

Proof.

First, assume that $(x^{k})_{k\in\mathbb{N}}$ is finite. In this case, according to Step 1, there exists $k\in\mathbb{N}$ such that $x^{k}=w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k})$ , where $z^{k}=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k})$ , $0<{\bar{\zeta}}<\zeta_{k}\leq 1$ and $\alpha_{k}>0$ . Therefore, applying Lemma 9(ii) with $x=x^{k}$ , $\alpha=\alpha_{k}$ and $\zeta=\zeta_{k}$ , we conclude that $x^{k}$ is stationary for problem (1). Now, assume that $(x^{k})_{k\in\mathbb{N}}$ is infinite. Let ${\bar{x}}$ be a cluster point of $(x^{k})_{k\in\mathbb{N}}$ and $(x^{k_{j}})_{j\in\mathbb{N}}$ be a subsequence of $(x^{k})_{k\in\mathbb{N}}$ such that $\lim_{j\to+\infty}x^{k_{j}}=\bar{x}$ . Since $C$ is closed and $(x^{k})_{k\in\mathbb{N}}\subset C$ , we have $\bar{x}\in C$ . Moreover, owing to $\lim_{k\to+\infty}\nu_{k}=0$ , we have $\lim_{j\to+\infty}\left(f(x^{k_{j}})+\nu_{k_{j}}\right)=f(\bar{x})$ . Hence, considering that $\lim_{k\to+\infty}\nu_{k}=0$ and Lemma 10 implies that $\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}}$ is non-increasing, we conclude that $\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}\left(f(x^{k})+\nu_{k}\right)=f(\bar{x})$ . On the other hand, due to $w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k})$ , where $z^{k}=x^{k}-\alpha_{k}\nabla f(x^{k})$ , Definition 2 implies

\|w^{k_{j}}-z^{k_{j}}\|_{D_{k}}^{2}\leq\zeta_{k_{j}}\|{\cal P}_{C}^{D_{k}}(z^{k_{j}})-z^{k_{j}}\|_{D_{k}}^{2}+(1-\zeta_{k_{j}})\|x^{k_{j}}-z^{k_{j}}\|_{D_{k}}^{2}.

(26)

Considering that $(\alpha_{k})_{k\in\mathbb{N}}$ and $(\zeta_{k})_{k\in\mathbb{N}}$ are bounded, $(D_{k})_{k\in\mathbb{N}}\subset{\cal D}_{\mu}$ , $(x^{k_{j}})_{j\in\mathbb{N}}$ converges to ${\bar{x}}$ and $\nabla f$ is continuous, the last inequality together Remark 1 and (4) imply that $(w^{k_{j}})_{j\in\mathbb{N}}\subset C$ is also bounded. Thus, we can assume without loss of generality that $\lim_{j\to+\infty}w^{k_{j}}=\bar{w}\in C$ . In addition, taking into account that $x^{k}\neq w^{k}$ for all $k=0,1,\ldots$ , applying Lemma 9(i) with $x=x^{k}$ , $\alpha=\alpha_{k}$ , $z(\alpha)=z^{k}$ and $\zeta=\zeta_{k}$ , we obtain that $\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0$ , for all $k=0,1,\ldots$ . Therefore, (19) and (20) imply that

0<-\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq f(x^{k})+\nu_{k}-f(x^{k+1}),\qquad\forall~{}k\in\mathbb{N}.

(27)

Now, due $\tau_{k}\in(0,1]$ , for all $k=0,1,\ldots$ , we can also assume without loss of generality that $\lim_{j\to+\infty}\tau_{k_{j}}=\bar{\tau}\in[0,1].$ Therefore, owing to $\lim_{k\to+\infty}f(x^{k})=f(\bar{x})$ and $\lim_{k\to+\infty}\nu_{k}=0$ , taking limit in (27) along the subsequences $(x^{k_{j}})_{j\in\mathbb{N}}$ , $(w^{k_{j}})_{j\in\mathbb{N}}$ and $(\tau_{k_{j}})_{j\in\mathbb{N}}$ yields $\bar{\tau}\big{\langle}\nabla f(\bar{x}),\bar{w}-\bar{x}\big{\rangle}=0.$ We have two possibilities: $\bar{\tau}>0$ or $\bar{\tau}=0$ . If $\bar{\tau}>0$ , then $\big{\langle}\nabla f(\bar{x}),\bar{w}-\bar{x}\big{\rangle}=0.$ Now, we assume that $\bar{\tau}=0$ . In this case, for all $j$ large enough, there exists $0<\hat{\tau}_{k_{j}}\leq\min\{1,\tau_{k_{j}}/\underline{\omega}\}$ such that

f\big{(}x^{k_{j}}+\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)}>f(x^{k_{j}})+\sigma\hat{\tau}_{k_{j}}\big{\langle}\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\big{\rangle}+\nu_{k_{j}}.

(28)

On the other hand, by the mean value theorem, there exists $\xi_{k_{j}}\in(0,1)$ such that

\langle\nabla f\big{(}x^{k_{j}}+\xi_{k_{j}}\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)},\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\rangle=f\big{(}x^{k_{j}}+\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)}-f(x^{k_{j}}).

Combining this equality with (28), and taking into account that $\nu_{k_{j}}\geq 0$ , we have

\langle\nabla f\big{(}x^{k_{j}}+\xi_{k_{j}}\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\big{)},\hat{\tau}_{k_{j}}(w^{k_{j}}-x^{k_{j}})\rangle>\sigma\hat{\tau}_{k_{j}}\big{\langle}\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\big{\rangle},

for $j$ large enough. Since $0<\hat{\tau}_{k_{j}}\leq\min\{1,\tau_{k_{j}}/\underline{\omega}\}$ , it follows that $\lim_{j\to\infty}\hat{\tau}_{k_{j}}\|w^{k_{j}}-x^{k_{j}}\|=0$ . Then, dividing both sides of the above inequality by $\hat{\tau}_{k_{j}}>0$ and taking limits as $j$ goes to $+\infty$ , we conclude that $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\geq\sigma\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle$ . Hence, due to $\sigma\in(0,1)$ , we obtain $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\geq 0$ . We recall that $\langle\nabla f(x^{k_{j}}),w^{k_{j}}-x^{k_{j}}\rangle<0$ , for all $j=0,1,\ldots$ , which taking limit as $j$ goes to $+\infty$ yields $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle\leq 0$ . Hence, we also have $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0$ . Therefore, for any of the two possibilities, $\bar{\tau}>0$ or $\bar{\tau}=0$ , we have $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0$ . On the other hand, since $(\alpha_{k})_{k\in\mathbb{N}}$ and $(\zeta_{k})_{k\in\mathbb{N}}$ are bounded, we also assume without loss of generality that $\lim_{j\to+\infty}\alpha_{k_{j}}=\bar{\alpha}\in[\alpha_{\min},\alpha_{\max}]$ and $\lim_{j\to+\infty}\zeta_{k_{j}}={\bar{\zeta}}\in[\zeta_{\min},1]$ . Thus, since Remark 1 implies that

\lim_{j\to+\infty}{\cal P}_{C}^{D_{k_{j}}}(z^{k_{j}})={\cal P}_{C}^{\bar{D}}(\bar{z}),

and considering that $\lim_{j\to+\infty}x^{k_{j}}=\bar{x}\in C$ , $\lim_{j\to+\infty}w^{k_{j}}=\bar{w}\in C$ , $\lim_{j\to+\infty}\tau_{k_{j}}=\bar{\tau}\in[0,1]$ , $\lim_{j\to+\infty}D_{k_{j}}=\bar{D}\in{\cal D}_{\mu}$ , taking limit in (26), we conclude that

\|\bar{w}-\bar{z}\|_{\bar{D}}^{2}\leq{\bar{\zeta}}\|{\cal P}_{C}^{\bar{D}}(\bar{z})-\bar{z}\|_{\bar{D}}^{2}+(1-{\bar{\zeta}})\|\bar{x}-\bar{z}\|_{\bar{D}}^{2},

where $\bar{z}=\bar{x}-{\bar{\alpha}}\nabla f(\bar{x})$ . Hence, Definition 2 implies that ${\bar{w}}\in{\cal P}_{C,{\bar{\zeta}}}^{\bar{D}}({\bar{x}},{\bar{z}})$ , where $\bar{z}=\bar{x}-{\bar{\alpha}}\nabla f(\bar{x})$ . Therefore, due to $\langle\nabla f(\bar{x}),\bar{w}-\bar{x}\rangle=0$ , we can apply second sentence in Lemma 9(iii) with $x=\bar{x}$ , $z({\bar{\alpha}})=\bar{z}$ and $w({\bar{\alpha}})=\bar{w}$ , to conclude that $\bar{x}$ is stationary for problem (1). ∎

The tolerance parameter $\nu_{k}$ that controls the non-monotonicity of the line search must be smaller and smaller as the sequence $(x^{k})_{k\in\mathbb{N}}$ tends to a stationary point. Next corollary presents a general condition for this property, its proof can be found in [43, Theorem 4].

Corollary 12.

If $\delta_{\min}>0$ , then $\sum_{k=0}^{+\infty}\nu_{k}<+\infty$ . Consequently, $\lim_{k\to+\infty}\nu_{k}=0$ .

The Armijo and the non-monotone Average-type line searches discussed in Section 3 satisfy the assumption of Corollary 12, i.e., $\delta_{\min}>0$ . However, for the non-monotone Max-type line search, we can only guarantee that $\delta_{\min}\geq 0$ . Hence, we can not apply Corollary 12 to conclude that $\lim_{k\to+\infty}\nu_{k}~{}=~{}0$ . In the next proposition, we will deal with this case separately.

Proposition 13.

Assume that the sequence $(x^{k})_{k\in\mathbb{N}}$ is generated by Algorithm 2 with the non-monotone line search (22), i.e., $\nu_{k}=f(x^{\ell(k)})-f(x^{k})$ for all $k\in\mathbb{N}$ . In addition, assume that the level set $C_{0}:=\{x\in C:~{}f(x)\leq f(x^{0})\}$ is bounded and $\nu_{0}=0$ . Then, $\lim_{k\to+\infty}\nu_{k}=0$ .

Proof.

First of all, note that $w^{k}\in{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k})$ , where $z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k})$ and $D_{k}\in{\cal D}_{\mu}$ . Thus, applying Lemma 9(i) with $x=x^{k}$ , $w(\alpha)=w^{k}$ , $z=z^{k}$ and $\zeta=\zeta_{k}$ , we obtain

\|w^{k}-x^{k}\|^{2}\leq-2\mu\alpha_{\max}\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle,\qquad\forall k\in\mathbb{N}.

(29)

On the other hand, due to $f(x^{\ell(k)})=f(x^{k})+\nu_{k}$ , Lemma 10 implies that $(f(x^{\ell(k)}))_{k\in\mathbb{N}}$ is non-increasing and $f(x^{k+1})\leq f(x^{k+1})+\nu_{k+1}\leq f(x^{k})+\nu_{k}\leq f(x^{0})$ . Hence, we have $(x^{k})_{k\in\mathbb{N}}\subset C_{0}$ and, as a consequence, $(f(x^{\ell(k)}))_{k\in\mathbb{N}}$ converges. Note that $\ell(k)$ is an integer such that

k-m_{k}\leq\ell(k)\leq k.

(30)

Since $x^{{\ell(k)}}=x^{{\ell(k)}-1}+\tau_{{\ell(k)}-1}(w^{{\ell(k)}-1}-x^{{\ell(k)}-1})$ , (22) implies that

f\big{(}x^{\ell(k)}\big{)}\leq f\big{(}x^{\ell({{\ell(k)}-1})}\big{)}+\sigma\tau_{{\ell(k)}-1}\big{\langle}\nabla f(x^{{\ell(k)}-1}),w^{{\ell(k)}-1}-x^{{\ell(k)}-1}\big{\rangle},

for all $k>M$ . In view of $(f(x^{\ell(k)}))_{k\in\mathbb{N}}$ be convergent, $\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0$ for all $k\in\mathbb{N}$ , and taking into account that $\tau_{k}\in(0,1]$ , the last inequality together (29) implies that

\lim_{k\to+\infty}\tau_{{\ell(k)}-1}\|w^{{\ell(k)}-1}-x^{{\ell(k)}-1}\|=0.

(31)

We proceed to prove that $\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}f(x^{\ell(k)})$ . For that, set ${\hat{\ell}}(k):=\ell(k+M+2)$ . First, we prove by induction that, for all $j\geq 1$ , the following two equalities hold

\lim_{k\to+\infty}\tau_{{\hat{\ell}}(k)-j}\|w^{{{\hat{\ell}}(k)}-j}-x^{{{\hat{\ell}}(k)}-j}\|=0,\qquad\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-j})=\lim_{k\to+\infty}f(x^{\ell(k)}),

(32)

where we are considering $k\geq j-1$ . Assume that $j=1$ . Since $\{{\hat{\ell}}(k):~{}k\in\mathbb{N}\}\subset\{{\ell}(k):~{}k\in\mathbb{N}\}$ , the first equality in (32) follows from (31). Hence, $\lim_{k\to+\infty}\|x^{{{\hat{\ell}}(k)}}-x^{{\hat{\ell}(k)}-1}\|=0$ . Since $C_{0}$ is compact and $f$ is uniformly continuous on $C_{0}$ , we have $\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-1})=\lim_{k\to+\infty}f(x^{{\hat{\ell}(k)}})$ , which again using that $\{{\hat{\ell}}(k):~{}k\in\mathbb{N}\}\subset\{{\ell}(k):~{}k\in\mathbb{N}\}$ implies the second equality in (32). Assume that (32) holds for $j$ . Again, due to $x^{{{\hat{\ell}}(k)}-j}=x^{{{{\hat{\ell}}(k)}-j}-1}+\tau_{{{{\hat{\ell}}(k)}-j}-1}(w^{{{{\hat{\ell}}(k)}-j}-1}-x^{{{{\hat{\ell}}(k)}-j}-1})$ , (22) implies that

f\big{(}x^{{{\hat{\ell}}(k)}-j}\big{)}\leq f\big{(}x^{\ell({{{{\hat{\ell}}(k)}j}-(j+1)})}\big{)}+\sigma\tau_{{{{\hat{\ell}}(k)}}-(j+1)}\big{\langle}\nabla f(x^{{{{\hat{\ell}}(k)}}-(j+1)}),w^{{{{\hat{\ell}}(k)}}-(j+1)}-x^{{{{\hat{\ell}}(k)}}-(j+1)}\big{\rangle}.

Similar argument used to obtain (31) yields $\lim_{k\to+\infty}\tau_{{{{\hat{\ell}}(k)}}-(j+1)}\|w^{{{{\hat{\ell}}(k)}}-(j+1)}-x^{{{{\hat{\ell}}(k)}}-(j+1)}\|=0$ . Thus, the first equality in (32) holds for $j+1$ , which implies $\lim_{k\to+\infty}\|x^{{{\hat{\ell}}(k)}-j}-x^{{{{\hat{\ell}}(k)}}-(1+j)}\|=0$ . Again, the uniformly continuity of $f$ on $C_{0}$ gives

\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-(j+1)})=\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)-j}),

which shows that the second equality in (32) holds for $j+1$ . From (30) and ${\hat{\ell}}(k):=\ell(k+M+2)$ , we obtain ${\hat{\ell}}(k)-k-1\leq M+1$ . Thus, taking into account that

x^{k+1}=x^{{\hat{\ell}}(k)}-\sum_{j=1}^{{\hat{\ell}}(k)-k-1}\tau_{{\hat{\ell}}(k)-j}\big{(}w^{{\hat{\ell}}(k)-j}-x^{{\hat{\ell}}(k)-j}\big{)},

it follows from the first inequality in (32) that $\lim_{k\to+\infty}\|x^{k+1}-x^{{\hat{\ell}}(k)}\|=0$ . Hence, due to $f$ be uniformly continuous on $C_{0}$ and $(f(x^{\ell(k)}))_{k\in\mathbb{N}}$ be convergent, we conclude that

\lim_{k\to+\infty}f(x^{k})=\lim_{k\to+\infty}f(x^{{\hat{\ell}}(k)})=\lim_{k\to+\infty}f(x^{\ell(k)}),

and considering that $\nu_{k}=f(x^{\ell(k)})-f(x^{k})$ the desired results follows. ∎

Remark 7.

Let $C_{0}:=\{x\in C:~{}f(x)\leq f(x^{0})\}$ be bounded and $(x^{k})_{k\in\mathbb{N}}$ be generated by Algorithm 2 with the non-monotone line search (22) with $\nu_{0}=0$ . Then, combining Propositions 11 and 13, we conclude that $(x^{k})_{k\in\mathbb{N}}$ is either finite terminating at a stationary point of problem (1), or infinite, and every cluster point of $(x^{k})_{k\in\mathbb{N}}$ is stationary for problem (1). Therefore, we have an alternative proof for the result obtained in [13, Theorem 2.1].

Due to Proposition 11, from now on we assume that the sequence $(x^{k})_{k\in\mathbb{N}}$ generated by Algorithm 2 is infinite.

5 Full asymptotic convergence and complexity analysis

The purpose of this section is twofold. We will first prove, under suitable assumptions, the full convergence of the sequence $(x^{k})_{k\in\mathbb{N}}$ and then we will present iteration-complexity bounds for it. For this end, we need to be more restrictive both with respect to the inexact projection in (18) and in the tolerance parameter that controls the non-monotonicity of the line search used in (19). More precisely, we assume that in Step 1 of Algorithm 2:

A1.

For all $k\in\mathbb{N}$ , we take $w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ with $\gamma_{k}=(1-\zeta_{k})/2$ .

It is worth recalling that, taking the parameter $\gamma_{k}=(1-\zeta_{k})/2$ , it follows from Lemma 6 that ${\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})\subset{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k})$ . In addition, we also assume that in Step 2 of Algorithm 2:

A2.

For all $k\in\mathbb{N}$ , we take $0\leq\nu_{k}$ such that $\sum_{k=0}^{+\infty}\nu_{k}<+\infty$ .

It follows from Corollary 12 that the Armijo and the non-monotone Average-type line searches discussed in Section 3 satisfy Assumption A2.

5.1 Full asymptotic convergence analysis

In this section, we prove the full convergence of the sequence $(x^{k})_{k\in\mathbb{N}}$ satisfying A1 and A2. We will begin establishing a basic inequality for $(x^{k})_{k\in\mathbb{N}}$ . To simplify notations, we define the constant

\xi:=\dfrac{2\alpha_{\max}}{\sigma}>0.

(33)

Lemma 14.

For each $x\in C$ , there holds

\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\alpha_{k}\tau_{k}\big{\langle}\nabla f(x^{k}),x-x^{k}\big{\rangle}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]},\quad\forall~{}k\in\mathbb{N}.

(34)

Proof.

We know that $\|x^{k+1}-x\|_{D_{k}}^{2}=\|x^{k}-x\|_{D_{k}}^{2}+\|x^{k+1}-x^{k}\|_{D_{k}}^{2}-2\langle{D_{k}}(x^{k+1}-x^{k}),x-x^{k}\rangle$ , for all $x\in C$ and $k\in\mathbb{N}$ . Thus, using (20), we have

\|x^{k+1}-x\|_{D_{k}}^{2}=\|x^{k}-x\|_{D_{k}}^{2}+\tau_{k}^{2}\|w^{k}-x^{k}\|_{D_{k}}^{2}-2\tau_{k}\big{\langle}{D_{k}}(w^{k}-x^{k}),x-x^{k}\big{\rangle},\qquad\forall~{}k\in\mathbb{N}.

(35)

On the other hand, since $w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ with $z^{k}=x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k})$ , it follows from Definition 3, with $y=x$ , $D=D_{k}$ , $u=x^{k}$ , $v=z^{k}$ , $w=w^{k}$ , and $\gamma=\gamma_{k}$ , that

\big{\langle}D_{k}(x^{k}-\alpha_{k}D_{k}^{-1}\nabla f(x^{k})-w^{k}),x-w^{k}\big{\rangle}\leq\gamma_{k}\|w^{k}-x^{k}\|_{D_{k}}^{2},\qquad\forall~{}k\in\mathbb{N}.

Hence, after some algebraic manipulations in the last inequality, we have

-\big{\langle}D_{k}(w^{k}-x^{k}),x-x^{k}\big{\rangle}\leq\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}-(1-\gamma_{k})\|w^{k}-x^{k}\|_{D_{k}}^{2}.

Combining the last inequality with (35), we conclude that

\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}-\tau_{k}\big{[}2(1-\gamma_{k})-\tau_{k}\big{]}\|w^{k}-x^{k}\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}.

(36)

Since $0\leq\gamma_{k}<(1-{\zeta_{\min}})/2<1/2$ and $\tau_{k}\in(0,1]$ , we have $2(1-\gamma_{k})-\tau_{k}>{\zeta_{\min}}>0$ . Thus, it follows from (36) that

\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle},\qquad\forall~{}k\in\mathbb{N}.

Thus, considering that $\big{\langle}\nabla f(x^{k}),x-w^{k}\big{\rangle}=\big{\langle}\nabla f(x^{k}),x-x^{k}\big{\rangle}+\big{\langle}\nabla f(x^{k}),x^{k}-w^{k}\big{\rangle}$ and taking into account (19), we conclude that

\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\tau_{k}\alpha_{k}\left\langle\nabla f(x^{k}),x-x^{k}\right\rangle+\frac{2\alpha_{k}}{\sigma}\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]},

(37)

for all $k\in\mathbb{N}$ . On the other hand, applying Lemma 9(iii) with $x=x^{k}$ , $\alpha=\alpha_{k}$ , $D=D_{k}$ , $w(\alpha)=w^{k}$ , $z=z^{k}$ and $\zeta=\zeta_{k}$ , we obtain $\langle\nabla f(x^{k}),w^{k}-x^{k}\rangle<0$ , for all $k\in\mathbb{N}$ . Therefore, it follows from (19) and (20) that $0<-\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq f(x^{k})-f(x^{k+1})+\nu_{k}$ , to all $k\in\mathbb{N}$ . Hence, due to $0<\alpha_{k}\leq\alpha_{\max}$ , we have

\alpha_{k}[f(x^{k})-f(x^{k+1})+\nu_{k}]<\alpha_{\max}[f(x^{k})-f(x^{k+1})+\nu_{k}],\qquad\forall k\in\mathbb{N}.

Therefore, (34) follows from the combination of the last inequality with (33) and (37). ∎

For proceeding with the analysis of the behavior of the sequence $(x^{k})_{k\in\mathbb{N}}$ , we define the following auxiliary set

U:=\left\{x\in C:f(x)\leq\inf_{k\in{\mathbb{N}}}\left(f(x^{k})+\nu_{k}\right)\right\}.

Corollary 15.

Assume that $f$ is a convex function. If $U\neq\varnothing$ , then $(x^{k})_{k\in\mathbb{N}}$ converges to a stationary point of problem (1).

Proof.

Let $x\in U$ . Since $f$ is convex, we have $0\geq f(x)-(f(x^{k})+\nu_{k})\geq\langle\nabla f(x^{k}),x-x^{k}\rangle-\nu_{k}$ , for all $k\in\mathbb{N}$ . Thus, $\langle\nabla f(x^{k}),x-x^{k}\rangle\leq\nu_{k}$ , for all $k\in\mathbb{N}$ . Using Lemma 14 and taking into account that $\tau_{k}\in(0,1]$ and $0<\alpha_{\min}\leq\alpha_{k}\leq\alpha_{\max}$ , we obtain

\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+2\alpha_{\max}\nu_{k}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]},\quad\forall~{}k\in\mathbb{N}.

Defining $\epsilon_{k}=2\alpha_{\max}\nu_{k}+\xi\big{[}f(x^{k})-f(x^{k+1})+\nu_{k}\big{]}$ , we have $\|x^{k+1}-x\|_{D_{k}}^{2}\leq\|x^{k}-x\|_{D_{k}}^{2}+\epsilon_{k}$ , for all $k\in\mathbb{N}$ . On the other hand, summing $\epsilon_{k}$ with $k=0,1,\ldots,N$ and using Corollary 12, we have

\sum_{k=0}^{N}\epsilon_{k}\leq 2\alpha_{\max}\sum_{k=0}^{N}\nu_{k}+\xi\left(f(x^{0})-f(x)+\sum_{k=0}^{N+1}\nu_{k}\right)<+\infty,\qquad\forall N\in\mathbb{N}.

Hence, $\sum_{k=0}^{+\infty}\epsilon_{k}<+\infty$ . Thus, it follows from Definition 1 that $(x^{k})_{k\in\mathbb{N}}$ is quasi-Fejér convergent to $U$ with respect to the sequence $(D_{k})_{k\in\mathbb{N}}$ . Since $U$ is nonempty, it follows from Theorem 2 that $(x^{k})_{k\in\mathbb{N}}$ is bounded, and therefore it has cluster points. Let $\bar{x}$ be a cluster point of $(x^{k})_{k\in\mathbb{N}}$ and $(x^{k_{j}})_{j\in\mathbb{N}}$ be a subsequence of $(x^{k})_{k\in\mathbb{N}}$ such that $\lim_{j\to\infty}x^{k_{j}}=\bar{x}$ . Considering that $f$ is continuous and $\lim_{k\to+\infty}\nu_{k}=0$ , we have $\lim_{j\to\infty}(f(x^{k_{j}})+\nu_{k_{j}})=f(\bar{x})$ . On the other hand, Lemma 10 implies that $\left(f(x^{k})+\nu_{k}\right)_{k\in\mathbb{N}}$ is non-increasing. Thus $\inf_{k\in{\mathbb{N}}}(f(x^{k})+\nu_{k})=\lim_{k\to\infty}(f(x^{k})+\nu_{k})=f(\bar{x}).$ Hence, $\bar{x}\in U$ , and Theorem 2 implies that $(x^{k})_{k\in\mathbb{N}}$ converges to $\bar{x}$ . The conclusion is obtained by using Proposition 11. ∎

Theorem 16.

If $f$ is a convex function and $(x^{k})_{k\in\mathbb{N}}$ has no cluster points, then $\Omega^{*}=\varnothing$ , $\lim_{k\to\infty}\|x^{k}\|=+\infty$ , and $\inf_{k\in{\mathbb{N}}}f(x^{k})=\inf\{f(x):x\in C\}$ .

Proof.

Since $(x^{k})_{k\in\mathbb{N}}$ has no cluster points, then $\lim_{k\to\infty}\|x^{k}\|=+\infty$ . Assume by contradiction that $\Omega^{*}\neq\varnothing$ . Thus, there exists $\tilde{x}\in C$ , such that $f(\tilde{x})\leq f(x^{k})$ for all $k\in{\mathbb{N}}$ . Therefore, $\tilde{x}\in U$ . Using Corollary 15, we obtain that $(x^{k})_{k\in\mathbb{N}}$ is convergent, contradicting that $\lim_{k\to\infty}\|x^{k}\|=\infty$ . Therefore, $\Omega^{*}=\varnothing$ . Now, we claim that $\inf_{k\in{\mathbb{N}}}f(x^{k})=\inf\{f(x):x\in C\}$ . If $\inf_{k\in{\mathbb{N}}}f(x^{k})=-\infty$ , the claim holds. Assume by contraction that $\inf_{k\in{\mathbb{N}}}f(x^{k})>\inf_{x\in C}f(x)$ . Thus, there exists $\tilde{x}\in C$ such that $f(\tilde{x})\leq f(x^{k})\leq f(x^{k})+\nu_{k}$ , for all $k\in{\mathbb{N}}$ . Hence, $U\neq\varnothing$ . Using Corollary 15, we have that $(x^{k})_{k\in\mathbb{N}}$ is convergent, contradicting again $\lim_{k\to\infty}\|x^{k}\|=+\infty$ and concluding the proof. ∎

Corollary 17.

If $f$ is a convex function and $(x^{k})_{k\in\mathbb{N}}$ has at least one cluster point, then $(x^{k})_{k\in\mathbb{N}}$ converges to a stationary point of problem (1).

Proof.

Let $\bar{x}$ be a cluster point of the sequence $(x^{k})_{k\in\mathbb{N}}$ and $(x^{k_{j}})_{j\in\mathbb{N}}$ be a subsequence of $(x^{k})_{k\in\mathbb{N}}$ such that $\lim_{j\to+\infty}x^{k_{j}}=\bar{x}$ . Considering that $f$ is continuous and $\lim_{k\to+\infty}\nu_{k}=0$ , we have $\lim_{j\to\infty}(f(x^{k_{j}})+\nu_{k_{j}})=f(\bar{x})$ . On the other hand, Corollary 12 implies that $(f(x^{k})+\nu_{k})_{k\in\mathbb{N}}$ is non-increasing. Hence, we have $\inf_{k\in{\mathbb{N}}}(f(x^{k})+\nu_{k})=\lim_{k\to\infty}(f(x^{k})+\nu_{k})=f(\bar{x})$ . Therefore $\bar{x}\in U$ . Using Corollary 15, we obtain that $(x^{k})_{k\in\mathbb{N}}$ converges to a stationary point $\tilde{x}\in C$ of problem (1). ∎

Theorem 18.

Assume that $f$ is a convex function and $\Omega^{*}\neq\varnothing$ . Then, $(x^{k})_{k\in\mathbb{N}}$ converge to an optimal solution of problem (1).

Proof.

If $\Omega^{*}\neq\varnothing$ , then $U\neq\varnothing$ . Therefore, Corollary 15 implies that $(x^{k})_{k\in\mathbb{N}}$ converges to a stationary point of problem (1) and, due to $f$ be convex, this point is also an optimal solution. ∎

5.2 Iteration-complexity bound

In the section, we preset some iteration-complexity bounds related to the sequence $(x^{k})_{k\in\mathbb{N}}$ generated by Algorithm 2. For that, besides assuming A1 and A2, we also need the following assumption.

A3.

The gradient $\nabla f$ of $f$ is Lipschitz continuous with constant $L>0$ .

For simple notations, we define the following positive constant

\tau_{\min}:=\min\left\{1,\frac{\underline{\omega}(1-\sigma)}{{\alpha_{\max}}\mu L}\right\}.

(38)

Lemma 19.

The steepsize $\tau_{k}$ in Algorithm 2 satisfies $\tau_{k}\geq\tau_{\min}$ .

Proof.

First, we assume that $\tau_{k}=1$ . In this case, we have $\tau_{k}\geq\tau_{\min}$ and the required inequality holds. Now, we assume that $\tau_{k}<1$ . Thus, it follows from (19) that there exists $0<\hat{\tau}_{k}\leq\min\{1,\tau_{k}/\underline{\omega}\}$ such that

f\big{(}x^{k}+\hat{\tau}_{k}(w^{k}-x^{k})\big{)}>f(x^{k})+\sigma\hat{\tau}_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k}.

(39)

Considering that we are under assumption A3, we apply Lemma 1 to obtain

f\big{(}x^{k}+\hat{\tau}_{k}(w^{k}-x^{k})\big{)}\leq f(x^{k})+\hat{\tau}_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\frac{L}{2}\hat{\tau}_{k}^{2}\|w^{k}-x^{k}\|^{2}.

(40)

Hence, the combination of (39) with (40) yields

(1-\sigma)\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\frac{L}{2}\hat{\tau}_{k}\|w^{k}-x^{k}\|^{2}>\frac{\nu_{k}}{\hat{\tau}_{k}}.

(41)

On the order hand, $w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ with $\gamma_{k}=(1-\zeta_{k})/2$ , where $z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k})$ . Thus, applying Lemma 9(i) with $x=x^{k}$ , $w(\alpha)=w^{k}$ , $z=z^{k}$ and $\zeta=\zeta_{k}$ , we obtain

\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{k}}\|w^{k}-x^{k}\|_{D_{k}}^{2}.

Hence, considering that $\frac{1}{\mu}\|w^{k}-x^{k}\|^{2}\leq\|w^{k}-x^{k}\|_{D_{k}}^{2}$ and $0<\alpha_{k}\leq\alpha_{\max}$ , the last inequality implies

\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

The combination of the last inequality with (41) yields

\left(-\frac{(1-\sigma)}{2{\alpha_{\max}}\mu}+\frac{L}{2}\hat{\tau}_{k}\right)\|w^{k}-x^{k}\|^{2}>\frac{\nu_{k}}{\hat{\tau}_{k}}\geq 0.

Thus, since $\hat{\tau}_{k}\leq\tau_{k}/\underline{\omega}$ , we obtain $\tau_{k}\geq\underline{\omega}\hat{\tau}_{k}>\underline{\omega}(1-\sigma)/({\alpha_{\max}}\mu L)\geq\tau_{\min}$ and the proof is concluded. ∎

Considering that ${\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})\subset{\cal P}_{C,\zeta_{k}}^{D_{k}}(x^{k},z^{k})$ , it follows from Lemma 9(ii) that if $x^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ , then the point $x^{k}$ is stationary for problem (1). Since $w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ , the quantity $\|w^{k}-x^{k}\|$ can be seen as a measure of stationarity of the point $x^{k}$ . In next theorem, we present an iteration-complexity bound for this quantity, which is a constrained inexact version of [43, Theorem 1].

Theorem 20.

Let $\tau_{\min}$ be defined in (38). Then, for every $N\in\mathbb{N}$ , the following inequality holds

\min\left\{\|w^{k}-x^{k}\|:~{}k=0,1\ldots,N-1\right\}\leq\sqrt{\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}}}\frac{1}{\sqrt{N}}.

Proof.

Since $w^{k}\in{\cal R}_{C,\gamma_{k}}^{D_{k}}(x^{k},z^{k})$ with $\gamma_{k}=(1-\zeta_{k})/2$ , where $z^{k}=x^{k}-\alpha_{k}D^{-1}_{k}\nabla f(x^{k})$ , applying Lemma 9(i) with $x=x^{k}$ , $w(\alpha)=w^{k}$ , $z=z^{k}$ and $\zeta=\zeta_{k}$ , and taking into account that $(1/\mu)\|w^{k}-x^{k}\|^{2}\leq\|w^{k}-x^{k}\|_{D_{k}}^{2}$ and $0<\alpha_{k}\leq\alpha_{\max}$ , we obtain

\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}\leq-\frac{1}{2\alpha_{k}}\|w^{k}-x^{k}\|_{D_{k}}^{2}\leq-\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

The definition of $\tau_{k}$ and (19) imply $f(x^{k+1})-f(x^{k})\leq\sigma\tau_{k}\big{\langle}\nabla f(x^{k}),w^{k}-x^{k}\big{\rangle}+\nu_{k}$ . The combination of the last two inequalities together with Lemma 19 yields

f(x^{k})-f(x^{k+1})+\nu_{k}\geq\sigma\tau_{k}\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}\geq\sigma\tau_{\min}\frac{1}{2\alpha_{\max}\mu}\|w^{k}-x^{k}\|^{2}.

Hence, performing the sum of the above inequality for $k=0,1,\ldots,N-1$ , we conclude that

\sum_{k=0}^{N-1}\|w^{k}-x^{k}\|^{2}\leq\frac{2{\alpha_{\max}}\mu\big{[}f(x^{0})-f(x^{N+1})+\sum_{k=0}^{N}\nu_{k}\big{]}}{\sigma\tau_{\min}}\leq\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}},

which implies the desired result. ∎

Next theorem presents an iteration-complexity bound for the sequence $\left(f(x^{k})\right)_{k\in\mathbb{N}}$ when $f$ is convex.

Theorem 21.

Let $f$ be a convex function on $C$ . Then, for every $N\in\mathbb{N}$ , there holds

\min\left\{f(x^{k})-f^{*}:~{}k=0,1\ldots,N-1\right\}\leq\frac{\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{2\alpha_{\min}\tau_{\min}}\frac{1}{N}.

Proof.

Using the first inequality in (17) and Lemma 19, we have $2\alpha_{\min}\tau_{\min}\leq 2\alpha_{k}\tau_{k}$ , for all $k\in{\mathbb{N}}$ . We also know form the convexity of $f$ that $\langle\nabla f(x^{k}),x^{*}-x^{k}\rangle\leq f^{*}-f(x^{k})$ , for all $k\in{\mathbb{N}}$ . Thus, applying Lemma 14 with $x=x^{*}$ , after some algebraic manipulations, we conclude

2\alpha_{\min}\tau_{\min}\left[f(x^{k})-f^{*}\right]\leq\|x^{k}-x^{*}\|_{D_{k}}^{2}-\|x^{k+1}-x^{*}\|_{D_{k+1}}^{2}+\xi\left[f(x^{k})-f(x^{k+1})+\nu_{k}\right]\quad k=0,1,\ldots.

Hence, performing the sum of the above inequality for $k=0,1,\ldots,N-1$ , we obtain

2\alpha_{\min}\tau_{\min}\sum_{k=0}^{N-1}\left[f(x^{k})-f^{*}\right]\leq\|x^{0}-x^{*}\|_{D_{0}}^{2}-\|x^{N+1}-x^{*}\|_{D_{N}}^{2}+\xi\Big{[}f(x^{0})-f(x^{N+1})+\sum_{k=0}^{N-1}\nu_{k}\Big{]}.

Thus, $2\alpha_{\min}\tau_{\min}N\min\{f(x^{k})-f^{*}:~{}k=0,1\ldots,N-1\}\leq\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi[f(x^{0})-f^{*}+\sum_{k=0}^{N-1}\nu_{k}]$ , which implies the desired inequality. ∎

We ended this section with some results regarding the number of function evaluations performed by Algorithm 2. Note that the computational cost associated to each (outer) iteration involves a gradient evaluation, the computation of a (inexact) projection, and evaluations of $f$ at different trial points. Thus, we must consider the function evaluations at the rejected trial points.

Lemma 22.

Let $N_{k}$ be the number of function evaluations after $k\geq 0$ iterations of Algorithm 2. Then, $N_{k}\leq 1+(k+1)[\log(\tau_{\min})/\log(\bar{\omega})+1].$

Proof.

Let $j(k)\geq 0$ be the number of inner iterations in Step 2 of Algorithm 2 to compute the step size $\tau_{k}$ . Thus, $\tau_{k}\leq{\bar{\omega}}^{j(k)}$ . Using Lemma 19, we have $0<\tau_{\min}\leq\tau_{k}$ for all $k\in\mathbb{N}$ , which implies that $\log\left(\tau_{\min}\right)\leq\log(\tau_{k})=j(k)\log(\bar{\omega})$ , for all $k\in\mathbb{N}$ . Hence, due to $\log(\bar{\omega})<0$ , we have $j(k)\leq\log(\tau_{\min})/\log(\bar{\omega})$ . Therefore,

N_{k}=1+\sum_{\ell=0}^{k}(j(\ell)+1)\leq 1+\sum_{i=0}^{k}\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)}=1+(k+1)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)},

where the first equality follows from the definition of $N_{k}$ . ∎

Theorem 23.

For a given $\epsilon>0$ , Algorithm 2 computes $x^{k}$ and $w^{k}$ such that $\|w^{k}-x^{k}\|\leq\epsilon$ using, at most,

1+\left({\frac{2{\alpha_{\max}}\mu\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{\sigma\tau_{\min}}}\frac{1}{\epsilon^{2}}+1\right)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)}

function evaluations.

Proof.

The proof follows straightforwardly from Theorem 20 and Lemma 22. ∎

Theorem 24.

Let $f$ be a convex function on $C$ . For a given $\epsilon>0$ , the number of function evaluations performed by Algorithm 2 is, at most,

1+\left(\frac{\|x^{0}-x^{*}\|^{2}_{D_{0}}+\xi\left[f(x^{0})-f^{*}+\sum_{k=0}^{\infty}\nu_{k}\right]}{2\alpha_{\min}\tau_{\min}}\frac{1}{\epsilon}+1\right)\Big{(}\frac{\log(\tau_{\min})}{\log(\bar{\omega})}+1\Big{)},

to compute $x^{k}$ such that $f(x^{k})-f^{*}\leq\epsilon$ .

Proof.

The proof follows straightforwardly from Theorem 21 and Lemma 22. ∎

6 Numerical experiments

This section presents some numerical experiments in order to illustrate the potential advantages of considering inexact schemes in the SPG method. We will discuss inexactness associated with both the projection onto the feasible set and the line search procedure.

Given $A$ and $B$ two $m\times n$ matrices, with $m\geq n$ , and $c\in\mathbb{R}$ , we consider the matrix function $f:\mathbb{R}^{n\times n}\to\mathbb{R}$ given by:

f(X):=\displaystyle\frac{1}{2}\|AX-B\|^{2}_{F}+\sum_{i=1}^{n-1}\left[c\left(X_{i+1,i+1}-X_{i,i}^{2}\right)^{2}+(1-X_{i,i})^{2}\right],

(42)

which combines a least squares term with a Rosenbrock-type function. Throughout this section, $X_{i,j}$ stands for the $ij$ -element of the matrix $X$ and $\|\cdot\|_{F}$ denotes the Frobenius matrix norm, i.e., $\|A\|_{F}:=\sqrt{\langle A,A\rangle}$ where the inner product is given by $\langle A,B\rangle=\textrm{tr}(A^{T}B)$ . The test problems consist of minimizing $f$ in (42) subject to two different feasible sets, as described below. We point out that interesting applications in many areas emerge as constrained least squares matrix problems, see [13] and references therein. In turn, the Rosenbrock term was added in order to make the problems more challenging.

Problem I:

\begin{array}[]{cl}\displaystyle\min&f(X)\\ \mbox{s.t.}&X\in SDD^{+},\\ &L\leq X\leq U,\end{array}

where $SDD^{+}$ is the cone of symmetric and diagonally dominant real matrices with positive diagonal, i.e.,

SDD^{+}:=\{X\in\mathbb{R}^{n\times n}\mid X=X^{T},\;X_{i,i}\geq\displaystyle\sum_{j\neq i}|X_{i,j}|\;\forall i\},

$L$ and $U$ are given $n\times n$ matrices, and $L\leq X\leq U$ means that $L_{i,j}\leq X_{i,j}\leq U_{i,j}$ for all $i,j$ . The feasible set of Problem I was considered, for example, in the numerical tests of [13].

Problem II:

\begin{array}[]{cl}\displaystyle\min&f(X)\\ \mbox{s.t.}&X\in\mathbb{S}^{n}_{+},\\ &\textrm{tr}(X)=1,\end{array}

where $\mathbb{S}^{n}_{+}$ is the cone of symmetric and positive semidefinite real matrices and $\textrm{tr}(X)$ denotes the trace of $X$ . The feasible set of Problem II was known as spectrahedron and appears in several interesting applications see, for example, [4, 41] and references therein.

It is easy to see that the feasible set o Problem I is a closed and convex set and the feasible set of Problem II is a compact and convex set. As discussed in Section 2.1.1, the Dykstra’s alternating algorithm and the Frank-Wolfe algorithm can be used to calculate inexact projections. The choice of the most appropriate method depends on the structure of the feasible set under consideration. For Problem I, we used the Dykstra’s algorithm described in [13], see also [58]. In this case, $SDD^{+}=\displaystyle\cap_{i=1}^{n}SDD^{+}_{i}$ , where

SDD^{+}_{i}:=\{X\in\mathbb{R}^{n\times n}\mid X=X^{T},\;X_{i,i}\geq\displaystyle\sum_{j\neq i}|X_{i,j}|\}\;\mbox{for all}\;i=1,\ldots,n,

and the projection of a given $Z\in\mathbb{R}^{n\times n}$ onto $SDD^{+}$ consists of cycles of projections onto the convex sets $SDD^{+}_{i}$ . Here an iteration of the Dykstra’s algorithm should be understood as a complete cycle of projections onto all $SDD^{+}_{i}$ sets and onto the box $\{X\in\mathbb{R}^{n\times n}\mid L\leq X\leq U\}$ . Recall that this scheme provides an inexact projection as in Definition 2. Now consider Problem II. It is well known that calculating an exact projection onto the spectrahedron (i.e., onto the feasible set of Problem II) requires a complete spectral decomposition, which can be prohibitive specially in the large scale case. In contrast, the computational cost of an iteration of the Frank-Wolfe algorithm described in Algorithm 1 is associated by an extreme eigenpair computation, see, for example, [48]. Unfortunately, despite its low cost per-iteration, the Frank-Wolfe algorithm suffers from a slow convergence rate. Thus, we considered a variant of the Frank-Wolfe algorithm proposed in [4], which improves the convergence rate and the total time complexity of the classical Frank-Wolfe method. This algorithm specialized for the projection problem over the spectrahedron is carefully described in [1]. Without attempting to go into details, it replaces the top eigenpair computation in Frank-Wolfe with a top- $p$ (with $p\ll n$ ) eigenpair computation, where $p$ is an algorithmic parameter automatically selected. The total number of computed eigenpairs can be used to measure the computational effort to calculate projections. We recall that a Frank-Wolfe type scheme provides an inexact projection as in Definition 3.

We notice that Problems I and II can be seen as particular instances of the problem (1) in which the number of variables is $(n^{2}+n)/2$ . This mean that they can be solved by using Algorithm 2. We are especially interested in the spectral gradient version [14] of the SPG method, which is often associated with large-scale problems [15]. For this, we implemented Algorithm 2 considering $D_{k}:=I$ for all $k$ , $\alpha_{0}:=\min(\alpha_{\max},\max(\alpha_{\min},1/\|\nabla f(x^{0})\|))$ and, for $k>0$ ,

\alpha_{k}:=\left\{\begin{array}[]{ll}\displaystyle\min(\alpha_{\max},\max(\alpha_{\min},\langle s^{k},s^{k}\rangle/\langle s^{k},y^{k}\rangle)),&\mbox{if}\;\langle s^{k},y^{k}\rangle>0\\ \alpha_{\max},&\mbox{otherwise},\end{array}\right.

where $s^{k}:=X^{k}-X^{k-1}$ , $y^{k}:=\nabla f(X^{k})-\nabla f(X^{k-1})$ , $\alpha_{\min}=10^{-10}$ , and $\alpha_{\max}=10^{10}$ . We set $\sigma=10^{-4}$ , $\underline{\tau}=0.1$ , $\bar{\tau}=0.9$ , $\mu=1$ and $\nu_{0}=0$ . Parameter $\delta_{\min}$ was chosen according to the line search used (see Section 3), while parameter $\zeta_{\min}$ depends on the inexact projection scheme considered.

In the line search scheme (Step 2 of Algorithm 2), if a step size $\tau_{\textrm{trial}}$ is not accepted, then $\tau_{\textrm{new}}$ is calculated using one-dimensional quadratic interpolation employing the safeguard $\tau_{\textrm{new}}\leftarrow\tau_{\textrm{trial}}/2$ when the minimum of the one-dimensional quadratic lies outside $[\underline{\omega}\tau_{\textrm{trial}},\bar{\omega}\tau_{\textrm{trial}}]$ , see, for example, [54, Section 3.5]. Concerning the stopping criterion, all runs were stopped at an iterate $X^{k}$ declaring convergence if

\displaystyle\max_{i,j}(|X^{k}_{i,j}-W^{k}_{i,j}|)\leq 10^{-6},

where $W^{k}$ is as in (18). Our codes are written in Matlab and are freely available at https://github.com/maxlemes/InexProj-SGM. All experiments were run on a macOS 10.15.7 with 3.7GHz Intel Core i5 processor and 8GB of RAM.

6.1 Influence of the inexact projection

We begin the numerical experiments by checking the influence of the forcing parameters that control the degree of inexactness of the projections in the performance of the SPG method. In this first battery of tests, we used Armijo line searches, see Section 3.

We generated 10 instances of Problem I using $n=100$ , $m=200$ , and $c=10$ . The matrices $A$ and $B$ were randomly generated with elements belonging to $[-1,1]$ . We set $L\equiv 0$ and $U\equiv\infty$ as in [13]. For each instance, the starting point $X^{0}$ was randomly generated with elements belonging to $[0,1]$ , then it was redefined as $(X^{0}+(X^{0})^{T})/2$ and its diagonal elements were again redefined as $2\sum_{j\neq i}^{n}X_{i,j}$ , ensuring a feasible starting point. Figure 1 shows the average number of iterations, the average number of Dykstra’s iterations, and the average CPU time in seconds needed to reach the solution for different choices of $\zeta_{k}$ , namely, $\zeta_{k}=0.99$ , $0.9$ , $0.8$ , $0.7$ , $0.6$ , $0.5$ , $0.4$ , $0.3$ , $0.2$ , and $0.1$ for all $k$ . Remember that smaller values of $\zeta_{k}$ imply more inexact projections. As expected, the number of iterations of the SPG tended to increase as $\zeta_{k}$ decreased, see Figure 1(a). On the other hand, the computational cost of an outer iteration (which can be measured by the number of Dykstra’s iterations) tends to decrease when considering smaller values of $\zeta_{k}$ . This suggests a trade-off, controlled by parameter $\zeta_{k}$ , between the number and the cost per iteration. Figure 1(b) shows that values for $\zeta_{k}$ close to 0.8 showed better results, which is in line with the experiments reported in [13]. Finally, as can be seen in Figure 1(c), the CPU time was shown to be directly proportional to the number Dykstra’s iterations.

Refer to caption — Figure 1: Results for 10 instances of Problem I using $n=100$ , $m=200$ , and $c=10$ . Average number of: (a) iterations; (b) Dykstra’s iterations; (c) CPU time in seconds needed to reach the solution for different choices of $\zeta_{k}$ .

Although Algorithm 2 is given only in terms of parameter $\zeta_{k}$ , we will directly consider parameter $\gamma_{k}$ for Problem II in which inexact projections are computed according to Definition 3. We randomly generated 10 instances of Problem II with $n=800$ , $m=1000$ , and $c=100$ . Matrices $A$ and $B$ were obtained similarly to Problem I. In turn, a starting point $X^{0}$ was randomly generated with elements in the interval $[-1,1]$ , then it was redefined to be $X^{0}(X^{0})^{T}/\textrm{tr}(X^{0}(X^{0})^{T})$ , resulting in a feasible initial guess. Figure 2 shows the average number of iterations, the average number of computed eigenpairs, and the average CPU time in seconds needed to reach the solution for different constant choices of $\gamma_{k}$ ranging from $10^{-8}$ to $0.4999$ . Now, higher values of $\gamma_{k}$ imply more inexact projections. Note that for appropriate choices of $\zeta_{k}$ , the adopted values of $\gamma_{k}$ fulfill Assumption A1 of Section 5. Concerning the number of iterations, as can be seen in Figure 2(a), the SPG method was not very sensitive to the choice of parameter $\gamma_{k}$ . Hence, since higher values of $\gamma_{k}$ imply cheaper iterations, the number of computed eigenpairs and the CPU time showed to be inversely proportional to $\gamma_{k}$ , see Figures 2(b)–(c). Thus, our experiments suggest that the best value for $\gamma_{k}$ seems to be $0.4999$ .

6.2 Influence of the line search scheme

The following experiments compare the performance of the SPG method with different strategies for computing the step sizes. We considered the Armijo, the Average-type, and the Max-type line searches discussed in Section 3. Based on our numerical experience, we employed the fixed value $\eta_{k}=0.85$ for the Average-type line search and $M=5$ for the Max-type line search. According to the results of the previous section, we used the fixed forcing parameters $\zeta_{k}=0.8$ and $\gamma_{k}=0.4999$ to compute inexact projections for Problems I and II, respectively.

We randomly generated 100 instances of each problem as described in Section 6.1. The dimension of the problems and the parameter $c$ in (42) were also taken arbitrarily. For Problem I, we choose $100\leq n\leq 800$ and $10\leq c\leq 50$ , whereas for Problem II, we choose $10\leq n\leq 200$ and $100\leq c\leq 1000$ . In both cases, we set $m=2n$ . We compare the strategies with respect to the number of function evaluations, the number of (outer) iterations, the total computational effort to calculate projections (measured by the number of Dykstra’s iterations and computed eigenpairs for Problems I and II, respectively), and the CPU time. The results are shown in Figures 3 and 4 for Problems I and II, respectively, using performance profiles [31].

For Problem I, with regard to the number of function evaluations, the SPG method with the Average-type line search was the most efficient among the tested strategies. In a somewhat surprising way, in this set of test problems, the Armijo strategy was better than the Max-type line search, see Figure 3(a). On the other hand, as can be seen in Figure 3(b), the Armijo strategy required fewer iterations than the non-monotonous strategies. As expected, this was reflected in the number of Dykstra’s iterations and the CPU time, see Figures 3(c)–(d). We can conclude that, with respect to the last two criteria, the Armijo and Average-type strategies had similar and superior performances to the Max-type strategy.

Now, concerning Problem II, Figure 4 shows that the non-monotonous strategies outperformed the Armijo strategy in all the comparative criteria considered. Again, the Average-type strategy seems to be superior to the Max-type strategy.

From all the above experiments, we conclude that the non-monotone line searches tend to require fewer objective function evaluations. However, this does not necessarily mean computational savings, since there may be an increase in the number of iterations. In this case, optimal efficiency of the algorithm comes from a compromise between those two conflicting tendencies. Overall, the use of non-monotonous line search techniques is mainly justified when the computational effort of an iteration is associated with the cost of evaluating the objective function.

7 Conclusions

In this paper, we study the SGP method to solve constrained convex optimization problems employing inexact projections onto the feasible set and a general non-monotone line search. We expect that this paper will contribute to the development of research in this field, mainly to solve large-scale problems when the computational effort of an iteration is associated with the projections onto the feasible set and the cost of evaluating the objective function. Indeed, the idea of using the inexactness in the projection as well as in the line search, instead of the exact ones, is particularly interesting from a computational point of view. In particular, it is noteworthy that the Frank-Wolfe method has a low computational cost per iteration resulting in high computational performance in different classes of compact sets, see [37, 48]. An issue that deserves attention is the search for new efficient methods such as the Frank-Wolfe’s and Dykstra’s methods that generate inexact projections.

References

[1] A. A. Aguiar, O. P. Ferreira, and L. F. Prudente. Inexact gradient projection method with relative error tolerance. arXiv preprint arXiv:2101.11146, 2021.
[2] A. A. Aguiar, O. P. Ferreira, and L. F. Prudente. Subgradient method with feasible inexact projections for constrained convex optimization problems. Optimization, 0(0):1–23, 2021, https://doi.org/10.1080/02331934.2021.1902520.
[3] M. Ahookhosh, K. Amini, and S. Bahrami. A class of nonmonotone Armijo-type line search method for unconstrained optimization. Optimization, 61(4):387–404, 2012.
[4] Z. Allen-Zhu, E. Hazan, W. Hu, and Y. Li. Linear convergence of a Frank-Wolfe type algorithm over trace-norm balls. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6192–6201, Red Hook, NY, USA, 2017. Curran Associates Inc.
[5] R. Andreani, E. G. Birgin, J. M. Martínez, and J. Yuan. Spectral projected gradient and variable metric methods for optimization with linear inequalities. IMA J. Numer. Anal., 25(2):221–252, 04 2005, https://academic.oup.com/imajna/article-pdf/25/2/221/2090233/drh020.pdf.
[6] A. Auslender, P. J. S. Silva, and M. Teboulle. Nonmonotone projected gradient methods based on barrier and Euclidean distances. Comput. Optim. Appl., 38(3):305–327, 2007.
[7] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141–148, 1988.
[8] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer Publishing Company, Incorporated, 1st edition, 2011.
[9] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for solving convex linear systems. Math. Methods Oper. Res., 59(2):235–247, 2004.
[10] J. Y. Bello Cruz and L. R. Lucambio Pérez. Convergence of a projected gradient method variant for quasiconvex objectives. Nonlinear Anal., 73(9):2917–2922, 2010.
[11] D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Control, 21(2):174–184, 1976.
[12] D. P. Bertsekas. Nonlinear programming. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA, second edition, 1999.
[13] E. G. Birgin, J. M. Martínez, and M. Raydan. Inexact spectral projected gradient methods on convex sets. IMA J. Numer. Anal., 23(4):539–559, 2003.
[14] E. G. Birgin, J. M. Martínez, and M. Raydan. Nonmonotone spectral projected gradient methods on convex sets. SIAM J. Optim., 10(4):1196–1211, 2000, https://doi.org/10.1137/S1052623497330963.
[15] E. G. Birgin, J. M. Martínez, and M. Raydan. Spectral projected gradient methods: Review and perspectives. J. Stat. Softw., 60(3):1–21, 2014.
[16] S. Bonettini, I. Loris, F. Porta, and M. Prato. Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM J. Optim., 26(2):891–921, 2016.
[17] S. Bonettini, F. Porta, M. Prato, S. Rebegoldi, V. Ruggiero, and L. Zanni. Recent advances in variable metric first-order methods. In Computational Methods for Inverse Problems in Imaging, pages 1–31. Springer, 2019.
[18] S. Bonettini and M. Prato. New convergence results for the scaled gradient projection method. Inverse Problems, 31(9):095008, 20, 2015.
[19] S. Bonettini, R. Zanella, and L. Zanni. A scaled gradient projection method for constrained image deblurring. Inverse Problems, 25(1):015002, 23, 2009.
[20] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., 60(2):223–311, 2018.
[21] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory. Society for Industrial and Applied Mathematics, 1994, https://epubs.siam.org/doi/pdf/10.1137/1.9781611970777.
[22] J. P. Boyle and R. L. Dykstra. A method for finding projections onto the intersection of convex sets in Hilbert spaces. In Advances in order restricted statistical inference (Iowa City, Iowa, 1985), volume 37 of Lect. Notes Stat., pages 28–47. Springer, Berlin, 1986.
[23] P. L. Combettes and B. C. Vũ. Variable metric quasi-Fejér monotonicity. Nonlinear Anal., 78:17–31, 2013.
[24] Y. H. Dai. On the nonmonotone line search. J. Optim. Theory Appl., 112(2):315–330, 2002.
[25] Y.-H. Dai and R. Fletcher. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math., 100(1):21–47, 2005.
[26] Y.-H. Dai and R. Fletcher. New algorithms for singly linearly constrained quadratic programs subject to lower and upper bounds. Math. Program., 106(3, Ser. A):403–421, 2006.
[27] Y.-H. Dai, W. W. Hager, K. Schittkowski, and H. Zhang. The cyclic Barzilai-Borwein method for unconstrained optimization. IMA J. Numer. Anal., 26(3):604–627, 2006.
[28] F. R. de Oliveira, O. P. Ferreira, and G. N. Silva. Newton’s method with feasible inexact projections for solving constrained generalized equations. Comput. Optim. Appl., 72(1):159–177, 2019.
[29] D. di Serafino, V. Ruggiero, G. Toraldo, and L. Zanni. On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput., 318:176–195, 2018.
[30] R. Díaz Millán, O. P. Ferreira, and L. F. Prudente. Alternating conditional gradient method for convex feasibility problems. arXiv e-prints, page arXiv:1912.04247, Dec 2019, 1912.04247.
[31] E. D. Dolan and J. J. Moré. Benchmarking optimization software with performance profiles. Math. Program., 91(2):201–213, 2002.
[32] R. L. Dykstra. An algorithm for restricted least squares regression. J. Amer. Statist. Assoc., 78(384):837–842, 1983.
[33] J. Fan, L. Wang, and A. Yan. An inexact projected gradient method for sparsity-constrained quadratic measurements regression. Asia-Pac. J. Oper. Res., 36(2):1940008, 21, 2019.
[34] N. S. Fazzio and M. L. Schuverdt. Convergence analysis of a nonmonotone projected gradient method for multiobjective optimization problems. Optim. Lett., 13(6):1365–1379, 2019.
[35] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE J. Sel. Topics Signal Process., 1(4):586–597, Dec 2007.
[36] A. Friedlander, J. M. Martínez, B. Molina, and M. Raydan. Gradient method with retards and generalizations. SIAM J. Numer. Anal., 36(1):275–289, 1999.
[37] D. Garber and E. Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, pages 541–549, 2015.
[38] M. Golbabaee and M. E. Davies. Inexact gradient projection and fast data driven compressed sensing. IEEE Trans. Inform. Theory, 64(10):6707–6721, 2018.
[39] A. A. Goldstein. Convex programming in Hilbert space. Bull. Amer. Math. Soc., 70:709–710, 1964.
[40] P. Gong, K. Gai, and C. Zhang. Efficient euclidean projections via piecewise root finding and its application in gradient projection. Neurocomputing, 74(17):2754 – 2766, 2011.
[41] D. S. Gonçalves, M. A. Gomes-Ruggiero, and C. Lavor. A projected gradient method for optimization over density matrices. Optim. Methods Softw., 31(2):328–341, 2016, https://doi.org/10.1080/10556788.2015.1082105.
[42] D. S. Gonçalves, M. L. N. Gonçalves, and T. C. Menezes. Inexact variable metric method for convex-constrained optimization problems. Optimization, 0(0):1–19, 2021, https://doi.org/10.1080/02331934.2021.1887181.
[43] G. N. Grapiglia and E. W. Sachs. On the worst-case evaluation complexity of non-monotone line search algorithms. Comput. Optim. Appl., 68(3):555–577, 2017.
[44] G. N. Grapiglia and E. W. Sachs. A generalized worst-case complexity analysis for non-monotone line searches. Numer. Algorithms, 87(2):779–796, Jun 2021.
[45] L. Grippo, F. Lampariello, and S. Lucidi. A nonmonotone line search technique for Newton’s method. SIAM J. Numer. Anal., 23(4):707–716, 1986.
[46] N. J. Higham. Computing the nearest correlation matrix—a problem from finance. IMA J. Numer. Anal., 22(3):329–343, 2002.
[47] A. N. Iusem. On the convergence properties of the projected gradient method for convex optimization. Comput. Appl. Math., 22(1):37–52, 2003.
[48] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 427–435, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
[49] E. Levitin and B. Polyak. Constrained minimization methods. USSR Comput. Math. Math. Phys., 6(5):1 – 50, 1966.
[50] G. Ma, Y. Hu, and H. Gao. An accelerated momentum based gradient projection method for image deblurring. In 2015 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), pages 1–4, 2015.
[51] J. Mo, C. Liu, and S. Yan. A nonmonotone trust region method based on nonincreasing technique of weighted average of the successive function values. J. Comput. Appl. Math., 209(1):97–108, 2007.
[52] J. J. Moré. On the performance of algorithms for large-scale bound constrained problems. In Large-scale numerical optimization (Ithaca, NY, 1989), pages 32–45. SIAM, Philadelphia, PA, 1990.
[53] Y. Nesterov and A. Nemirovski. On first-order algorithms for $\ell_{1}$ /nuclear norm minimization. Acta Numer., 22:509–575, 2013.
[54] J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
[55] E. R. Panier and A. L. Tits. Avoiding the Maratos effect by means of a nonmonotone line search. I. General constrained problems. SIAM J. Numer. Anal., 28(4):1183–1195, 1991.
[56] A. Patrascu and I. Necoara. On the convergence of inexact projection primal first-order methods for convex minimization. IEEE Trans. Automat. Control, 63(10):3317–3329, 2018.
[57] J. Rasch and A. Chambolle. Inexact first-order primal-dual algorithms. Comput. Optim. Appl., 76(2):381–430, 2020.
[58] M. Raydan and P. Tarazaga. Primal and polar approach for computing the symmetric diagonally dominant projection. Numer. Linear Algebra Appl., 9(5):333–345, 2002, https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.277.
[59] E. W. Sachs and S. M. Sachs. Nonmonotone line searches for optimization algorithms. Control Cybernet., 40(4):1059–1075, 2011.
[60] S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms. J. Convex Anal., 19(4):1167–1192, 2012.
[61] S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning. Neural information processing series. MIT Press, 2012.
[62] J. Tang, M. Golbabaee, and M. E. Davies. Gradient projection iterative sketch for large-scale constrained least-squares. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3377–3386, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
[63] P. L. Toint. An assessment of nonmonotone linesearch techniques for unconstrained optimization. SIAM J. Sci. Comput., 17(3):725–739, 1996.
[64] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and inexact forward-backward algorithms. SIAM J. Optim., 23(3):1607–1633, 2013.
[65] C. Wang, Q. Liu, and X. Yang. Convergence properties of nonmonotone spectral projected gradient methods. J. Comput. Appl. Math., 182(1):51–66, 2005.
[66] X. Yan, K. Wang, and H. He. On the convergence rate of scaled gradient projection method. Optimization, 67(9):1365–1376, 2018.
[67] F. Zhang, H. Wang, J. Wang, and K. Yang. Inexact primal–dual gradient projection methods for nonlinear optimization on convex set. Optimization, 69(10):2339–2365, 2020, https://doi.org/10.1080/02331934.2019.1696338.
[68] H. Zhang and W. W. Hager. A nonmonotone line search technique and its application to unconstrained optimization. SIAM J. Optim., 14(4):1043–1056, 2004.
[69] B. Zhou, L. Gao, and Y.-H. Dai. Gradient methods with adaptive step-sizes. Comput. Optim. Appl., 35(1):69–86, 2006.