A Bregman inertial forward-reflected-backward method for nonconvex minimization

Xianfu Wang and Ziyuan Wang Department of Mathematics, Irving K. Barber Faculty of Science, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected]. Department of Mathematics, Irving K. Barber Faculty of Science, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected].

Abstract

We propose a Bregman inertial forward-reflected-backward (BiFRB) method for nonconvex composite problems. Our analysis relies on a novel approach that imposes general conditions on implicit merit function parameters, which yields a stepsize condition that is independent of inertial parameters. In turn, a question of Malitsky and Tam regarding whether FRB can be equipped with a Nesterov-type acceleration is resolved. Assuming the generalized concave Kurdyka-Łojasiewicz property of a quadratic regularization of the objective, we obtain sequential convergence of BiFRB, as well as convergence rates on both the function value and actual sequence. We also present formulae for the Bregman subproblem, supplementing not only BiFRB but also the work of Boţ-Csetnek-László and Boţ-Csetnek. Numerical simulations are conducted to evaluate the performance of our proposed algorithm.

2010 Mathematics Subject Classification: Primary 90C26, 49J52; Secondary 26D10.

Keywords: Generalized concave Kurdyka-Łojasiewicz property, Bregman proximal mapping, forward-reflected-backward splitting, implicit merit function, nonconvex optimization, inertial effect.

1 Introduction

Consider the inclusion problem of maximally monotone operators:

\text{find }x\in\mathbb{R}^{n}\text{ such that }0\in(A+B)(x),

(1)

where $A:\mathbb{R}^{n}\to 2^{\mathbb{R}^{n}}$ and $B:\mathbb{R}^{n}\to\mathbb{R}^{n}$ are (maximally) monotone operators with $B$ being Lipschitz continuous. Behavior of splitting methods for solving (1) is well-understood; see, e.g., the recent forward-reflected-backward method by Malitsky and Tam [22], and [3, Chapters 26, 28], which includes classical algorithms such as the forward-backward method [14] with $B$ being cocoercive, the Douglas-Rachford method [19, 15] and Tseng’s method [29].

The goal of this paper is to propose a Bregman inertial forward-reflected-backward method (Algorithm 3.2) for solving the nonconvex composite problem

\min_{x\in\mathbb{R}^{n}}F(x)=f(x)+g(x),

(2)

where $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}=(-\infty,\infty]$ is proper lower semicontinuous (lsc), and $g:\mathbb{R}^{n}\to\mathbb{R}$ has a Lipschitz continuous gradient, which corresponds to (1) with $A=\partial f$ and $B=\nabla g$ when $f$ and $g$ are both convex. Informally, the proposed algorithm computes

x_{k+1}\in(\nabla h+\lambda_{k}\partial f)^{-1}(x_{k}-2\lambda_{k}\nabla g(x_{k})+\lambda_{k}\nabla g(x_{k-1})+\alpha_{k}(x_{k}-x_{k-1})),

(3)

for some stepsize $\lambda_{k}>0$ , inertial parameter $\alpha_{k}\in[0,1)$ and kernel $h:\mathbb{R}^{n}\to\mathbb{R}$ . In particular, (3) reduces to the inertial forward-reflected-backward method studied in [22, Corollary 4.4] when $h=\left\lVert\cdot\right\rVert^{2}/2$ , $f,g$ are convex and inertial parameter is fixed. Before stating our main contribution, let us note that several splitting algorithms have been extended to nonconvex setting for solving (2); see, e.g., the Douglas-Rachford method [18], forward-backward method [2, 8], inertial forward-backward methods [27, 10], the Peaceman-Rachford method [17], inertial Tseng’s method [9], and forward-reflected-backward method [31].

Let us summarize the main contributions of this paper. Our convergence analysis relies on the generalized concave Kurdyka-Łojasiewicz property (Definition 2.2) and a novel framework for analyzing a sequence of implicit merit functions (Definition 3.10) through a general condition (Assumption 3.12) on merit function parameters. In turn, we derive global sequential convergence to a stationary point of $F$ with an explicit stepsize condition that is independent of inertial parameters, whereas stepsize assumptions in current literature are tangled with inertial parameters or even implicit; see Remark 4.3(i). Notably, such an independence result resolves a question of Malitsky and Tam [22, Section 7] regarding whether FRB can be adapted to incorporate a Nesterov-type acceleration; see Remark 4.3(ii). Convergence rate analysis on both the function value and actual sequence is also carried out. Moreover, we provide several formulae of the associated Bregman subproblem, which to the best of our knowledge are the first of its kind among present results devoting to Bregman extensions of splitting algorithms with the same kernel; see, e.g., [9, 10].

The paper is organized as follows. We begin with notation and background in Section 2. The proposed algorithm and its abstract convergence are studied in Section 3. Then, in Section 4, we present suitable stepsize rules under which the abstract convergence holds. Function value and sequential convergence rate results are discussed in Section 5, followed by Bregman subproblem formulae in Section 6. Numerical simulations are conducted in Section 7. We end this paper in Section 8 with remarks on our contribution and possible directions for future research.

2 Notation and preliminaries

Throughout this paper, $\mathbb{R}^{n}$ is the standard Euclidean space equipped with inner product $\langle x,y\rangle=x^{T}y$ and the Euclidean norm $\left\lVert x\right\rVert=\sqrt{\langle x,x\rangle}$ for $x,y\in\mathbb{R}^{n}$ . Let $\mathbb{N}=\{0,1,2,\ldots\}$ and $\mathbb{N}^{*}=\{-1\}\cup\mathbb{N}$ . The open ball centered at $\bar{x}$ with radius $r$ is denoted by $\mathbb{B}(\bar{x};r)$ . The distance function of a subset $K\subseteq\mathbb{R}^{n}$ is $\operatorname{dist}(\cdot,K):\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}=(-\infty,\infty]$ ,

x\mapsto\operatorname{dist}(x,K)=\inf\{\left\lVert x-y\right\rVert:y\in K\},

where $\operatorname{dist}(x,K)\equiv\infty$ if $K=\emptyset$ . For $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ and $r_{1},r_{2}\in[-\infty,\infty]$ , we set $[r_{1}<f<r_{2}]=\{x\in\mathbb{R}^{n}:r_{1}<f(x)<r_{2}\}$ . We say a function $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is coercive if $\lim_{\left\lVert x\right\rVert\to\infty}f(x)=\infty$ . A proper, lsc function $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is prox-bounded if there exists $\lambda>0$ such that $f+\left\lVert\cdot\right\rVert^{2}/(2\lambda)$ is bounded below; see, e.g., [28, Exercise 1.24]. The supremum of the set of all such $\lambda$ is the threshold $\lambda_{f}$ of prox-boundedness for $f$ . The indicator function of a set $K\subseteq\mathbb{R}^{n}$ is $\delta_{K}(x)=0$ if $x\in K$ and $\delta_{K}(x)=\infty$ otherwise.

We will use frequently the following concepts from variational analysis; see, e.g., [28, 23].

Definition 2.1

Let $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ be a proper function and $K\subseteq\mathbb{R}^{n}$ a nonempty closed set. We say that

(i) $v\in\mathbb{R}^{n}$ is a Fréchet subgradient of $f$ at $\bar{x}\in\operatorname{dom}f$ , denoted by $v\in\hat{\partial}f(\bar{x})$ , if

f(x)\geq f(\bar{x})+\langle v,x-\bar{x}\rangle+o(\left\lVert x-\bar{x}\right\rVert).

(4)

(ii) $v\in\mathbb{R}^{n}$ is a limiting subgradient of $f$ at $\bar{x}\in\operatorname{dom}f$ , denoted by $v\in\partial f(\bar{x})$ , if

v\in\{v\in\mathbb{R}^{n}:\exists x_{k}\xrightarrow[]{f}\bar{x},\exists v_{k}\in\hat{\partial}f(x_{k}),v_{k}\rightarrow v\},

(5)

where $x_{k}\xrightarrow[]{f}\bar{x}\Leftrightarrow x_{k}\rightarrow\bar{x}\text{ and }f(x_{k})\rightarrow f(\bar{x})$ . Moreover, we set $\operatorname{dom}\partial f=\{x\in\mathbb{R}^{n}:\partial f(x)\neq\emptyset\}$ . We say that $\bar{x}\in\operatorname{dom}\partial f$ is a stationary point, if $0\in\partial f(\bar{x})$ .

(iii) The Fréchet and limiting normal cones to $K$ at $\bar{x}$ are $\hat{N}_{K}(\bar{x})=\hat{\partial}\delta_{K}(\bar{x})$ and $N_{K}(\bar{x})=\partial\delta_{K}(\bar{x})$ , respectively, if $\bar{x}\in K$ ; otherwise $\hat{N}_{K}(\bar{x})=N_{K}(\bar{x})=\emptyset$ .

If $K\subseteq\mathbb{R}^{n}$ is a nonempty convex set, then $N_{K}(\bar{x})=\hat{N}_{K}(\bar{x})=\{v\in\mathbb{R}^{n}:(\forall x\in K)~{}\langle v,x-\bar{x}\rangle\leq 0\}$ ; see, e.g., [28].

The proximal mapping of a proper and lsc $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ with parameter $\lambda>0$ is

(\forall x\in\mathbb{R}^{n})\ \operatorname{Prox}_{\lambda f}(x)=\mathop{\rm argmin}\limits_{y\in\mathbb{R}^{n}}\Big{\{}f(y)+\frac{1}{2\lambda}\left\lVert x-y\right\rVert^{2}\Big{\}},

which is the resolvent $J_{\lambda A}$ with $A=\partial f$ when $f$ is convex; see, e.g., [3]. The Bregman distance induced by a differentiable $h:\mathbb{R}^{n}\to\mathbb{R}$ is

D_{h}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}:(x,y)\mapsto h(x)-h(y)-\langle x-y,\nabla h(y)\rangle.

For $\eta\in(0,\infty]$ , denote by $\Phi_{\eta}$ the class of functions $\varphi:[0,\eta)\rightarrow\mathbb{R}_{+}$ satisfying the following conditions: (i) $\varphi(t)$ is right-continuous at $t=0$ with $\varphi(0)=0$ ; (ii) $\varphi$ is strictly increasing on $[0,\eta)$ . The following concept will be the key in our convergence analysis.

Definition 2.2

[30] Let $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ be proper and lsc. Let $\bar{x}\in\operatorname{dom}\partial f$ and $\mu\in\mathbb{R}$ , and let $V\subseteq\operatorname{dom}\partial f$ be a nonempty subset.

(i) We say that $f$ has the pointwise generalized concave Kurdyka-Łojasiewicz (KL) property at $\bar{x}\in\operatorname{dom}\partial f$ , if there exist neighborhood $U\ni\bar{x}$ , $\eta\in(0,\infty]$ and concave $\varphi\in\Phi_{\eta}$ , such that for all $x\in U\cap[0<f-f(\bar{x})<\eta]$ ,

\varphi^{\prime}_{-}\big{(}f(x)-f(\bar{x})\big{)}\cdot\operatorname{dist}\big{(}0,\partial f(x)\big{)}\geq 1,

(6)

where $\varphi_{-}^{\prime}$ denotes the left derivative. Moreover, $f$ is a generalized concave KL function if it has the generalized concave KL property at every $x\in\operatorname{dom}\partial f$ .

(ii) Suppose that $f(x)=\mu$ on $V$ . We say $f$ has the setwise¹¹1We shall omit adjectives “pointwise” and “setwise” whenever there is no ambiguity. generalized concave Kurdyka-Łojasiewicz property on $V$ , if there exist $U\supset V$ , $\eta\in(0,\infty]$ and concave $\varphi\in\Phi_{\eta}$ such that for every $x\in U\cap[0<f-\mu<\eta]$ ,

\varphi^{\prime}_{-}\big{(}f(x)-\mu\big{)}\cdot\operatorname{dist}\big{(}0,\partial f(x)\big{)}\geq 1.

(7)

Generalized concave KL functions are ubiquitous. The class of semialgebraic functions is one particularly rich resource. We refer the interested reader to [7, 1, 6] for more examples; see also the fundamental work of Łojasiewicz [20] and Kurdyka [16].

Definition 2.3

(i) A set $E\subseteq\mathbb{R}^{n}$ is called semialgebraic if there exist finitely many polynomials $g_{ij},h_{ij}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ such that $E=\bigcup_{j=1}^{p}\bigcap_{i=1}^{q}\{x\in\mathbb{R}^{n}:g_{ij}(x)=0\text{ and }h_{ij}(x)<0\}.$

(ii) A function $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ is called semialgebraic if its graph $\operatorname{gph}f=\{(x,y)\in\mathbb{R}^{n+1}:f(x)=y\}$ is semialgebraic.

Fact 2.4

[6, Corollary 16] Let $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ be a proper and lsc function and let $\bar{x}\in\operatorname{dom}\partial f$ . If $f$ is semialgebraic, then it has the concave KL property at $\bar{x}$ with $\varphi(t)=c\cdot t^{1-\theta}$ for some $c>0$ and $\theta\in(0,1)$ .

3 The BiFRB method and its abstract convergence

3.1 The BiFRB method

In the remainder of this paper, suppose that the following hold:

Assumption 3.1

(i) $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is proper, lsc and prox-bounded with threshold $\lambda_{f}>0$ .

(ii) $g:\mathbb{R}^{n}\to\mathbb{R}$ has Lipschitz continuous gradient with constant $L_{\nabla g}$ .

(iii) $h:\mathbb{R}^{n}\to\mathbb{R}$ is $\sigma$ -strongly convex and has Lipschitz continuous gradient with constant $L_{\nabla h}$ .

(iv) The objective $F=f+g$ is bounded below.

We now propose the Bregman inertial forward-reflected-backward (BiFRB) algorithm.

Algorithm 3.2 (BiFRB)

1. Initialization: Pick $x_{-1},x_{0}\in\mathbb{R}^{n}$ . Let $0<\underline{\lambda}\leq\overline{\lambda}\leq\lambda_{f}$ . Choose $\underline{\lambda}\leq\lambda_{-1}\leq\overline{\lambda}$ .

2. For $k\in\mathbb{N}$ , choose $\underline{\lambda}\leq\lambda_{k}\leq\overline{\lambda}$ and $0\leq\alpha_{k}<1$ . Compute

	$\displaystyle y_{k}=x_{k}+\lambda_{k-1}\big{(}\nabla g(x_{k-1})-\nabla g(x_{k})\big{)},$		(8)
	$\displaystyle x_{k+1}\in\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-y_{k},\nabla g(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}(x_{k-1}-x_{k})\rangle+\frac{1}{\lambda_{k}}D_{h}(x,y_{k})\bigg{\}}.$		(9)

The Bregman distance $D_{h}$ is proportional to square of the Euclidean distance.

Lemma 3.3

[3, Exercise 17.5, Theorem 18.15] Let $h:\mathbb{R}^{n}\to\mathbb{R}$ be $\sigma$ -strongly convex and has Lipschitz continuous gradient with constant $L_{\nabla h}$ . Then

(\forall x,y\in\mathbb{R}^{n})~{}\frac{\sigma}{2}\left\lVert x-y\right\rVert^{2}\leq D_{h}(x,y)\leq\frac{L_{\nabla h}}{2}\left\lVert x-y\right\rVert^{2}.

Assumption 3.1(iii) has been used in the literature to obtain non-Euclidean extension of splitting algorithms; see, e.g., [9, 10]. Recall that the Fenchel conjugate of $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is $f^{*}=\sup_{y\in\mathbb{R}^{n}}\{\langle\cdot,y\rangle-f(y)\}$ and the Moreau envelope of $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ with parameter $\lambda>0$ is

(\forall x\in\mathbb{R}^{n})~{}M_{f}^{\lambda}(x)=\inf_{y\in\mathbb{R}^{n}}\bigg{\{}f(y)+\frac{1}{2\lambda}\left\lVert x-y\right\rVert^{2}\bigg{\}}.

Below is a systemic way to construct kernel $h$ via Fenchel conjugate and Moreau envelope.

Proposition 3.4

Let $L,\sigma>0$ and let $p:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ be proper, lsc and convex. Define $q=\left\lVert\cdot\right\rVert^{2}/2$ . Then the following hold:

(i) [3, Corollary 18.18] $p$ has $L$ -Lipschitz gradient if and only if $p$ is the Moreau envelope with parameter $\lambda=1/L$ of a proper lsc and convex function $(p^{*}-\lambda q)^{*}$ . Therefore, in particular, if $p^{*}$ is $1/L$ -strongly convex then $p$ has $L$ -Lipschitz gradient.

(ii) If $p$ has $L$ -Lipschitz gradient, then $h=p+\sigma q$ is $\sigma$ -strongly convex with $(L+\sigma)$ -Lipschitz gradient.

Example 3.5

Let $h(x)=\sqrt{1+\left\lVert x\right\rVert^{2}}+\left\lVert x\right\rVert^{2}/2$ . Then $h$ satisfies Assumption 3.1(iii) with $\sigma=1$ and $L_{\nabla h}=2$ .

Proof. Define $(\forall x\in\mathbb{R}^{n})~{}p(x)=\sqrt{1+\left\lVert x\right\rVert^{2}}$ . Then $p^{*}$ is $1$ -strongly convex; see, e.g., [5, Section 4.4, Example 5.29]. Applying Proposition 3.4 completes the proof. $\quad\hfill\blacksquare$

We also assume the following throughout the paper, under which the proposed algorithm is well-defined.

Assumption 3.6

Suppose that one of the following holds:

(i) The kernel $h$ has strong convexity modulus $\sigma\geq 1$ .

(ii) Function $f$ has prox-threshold $\lambda_{f}=\infty$ .

Remark 3.7

The threshold $\lambda_{f}=\infty$ when $f$ is convex or bounded below.

Proposition 3.8

Let $u,v\in\mathbb{R}^{n}$ and suppose that Assumption 3.6 holds.Then for $0<\lambda<\lambda_{f}$ , the function $x\mapsto f(x)+\langle x-u,v\rangle+D_{h}(x,u)/\lambda$ is coercive. Consequently, the set

\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-u,v\rangle+\frac{1}{\lambda}D_{h}(x,u)\bigg{\}}\neq\emptyset.

Proof. We claim that $x\mapsto f(x)+\langle x,\omega\rangle$ is prox-bounded with $\lambda_{f}$ for an arbitrary $\omega\in\mathbb{R}^{n}$ . Indeed,

\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)+\langle x,\omega\rangle}{\left\lVert x\right\rVert^{2}}=\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)}{\left\lVert x\right\rVert^{2}}+\lim_{\left\lVert x\right\rVert\to\infty}\frac{\langle x,\omega\rangle}{\left\lVert x\right\rVert^{2}}=\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)}{\left\lVert x\right\rVert^{2}}.

Thus invoking [28, Excercise 1.24] proves the claim.

By our claim, $x\mapsto f(x)+\langle x,v-\sigma u/\lambda\rangle$ is prox-bounded with threshold $\lambda_{f}$ . Let $\lambda^{\prime}$ be such that $\lambda<\lambda^{\prime}<\lambda_{f}$ . Then $\lambda^{\prime}/\sigma<\lambda_{f}$ under Assumption 3.6. Therefore

	$\displaystyle f(x)+\langle x-u,v\rangle+\frac{1}{\lambda}D_{h}(x,u)\geq f(x)+\langle x-u,v\rangle+\frac{\sigma}{2\lambda}\left\lVert x-u\right\rVert^{2}-\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}$
	$\displaystyle\geq\frac{\sigma}{2}\left(\frac{1}{\lambda}-\frac{1}{\lambda^{\prime}}\right)\left\lVert x\right\rVert^{2}+f(x)+\langle x,v-\frac{\sigma}{\lambda}u\rangle+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}-\langle u,v\rangle$
	$\displaystyle\geq\frac{\sigma}{2}\left(\frac{1}{\lambda}-\frac{1}{\lambda^{\prime}}\right)\left\lVert x\right\rVert^{2}+\inf_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x,v-\frac{\sigma}{\lambda}u\rangle+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}\bigg{\}}-\langle u,v\rangle\to\infty,$

as $\left\lVert x\right\rVert\to\infty$ , where the infimum in the last inequality is finite thanks to prox-boundedness of $x\mapsto f(x)+\langle x,v-\sigma u/\lambda\rangle$ , which completes the proof. $\quad\hfill\blacksquare$

Setting the kernel $h=\left\lVert\cdot\right\rVert/2$ gives Algorithm 3.9, which is an inertial forward-reflected-backward method (iFRB).

Algorithm 3.9 (iFRB)

1. Initialization: Pick $x_{-1},x_{0}\in\mathbb{R}^{n}$ . Let $0<\underline{\lambda}\leq\overline{\lambda}$ . Choose $\underline{\lambda}\leq\lambda_{-1}\leq\overline{\lambda}$ .

2. For $k\in\mathbb{N}$ , choose $\underline{\lambda}\leq\lambda_{k}\leq\overline{\lambda}$ and $0\leq\alpha_{k}<1$ . Compute

	$\displaystyle y_{k}=x_{k}+\lambda_{k-1}\big{(}\nabla g(x_{k-1})-\nabla g(x_{k})\big{)},$
	$\displaystyle x_{k+1}\in\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-y_{k},\nabla g(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}(x_{k-1}-x_{k})\rangle+\frac{1}{2\lambda_{k}}\left\lVert x-y_{k}\right\rVert^{2}\bigg{\}}.$

Convergence of iFRB with fixed stepsize and inertial parameter was studied in [22] in the presence of convexity. However, its behavior in nonconvex setting is still not understood, which will be covered in our analysis.

3.2 BiFRB merit function and its parameter rules

In the remainder of this paper, unless specified otherwise, let $p_{-1}\in\mathbb{R}$ and define for $k\in\mathbb{N}$

	$\displaystyle p_{k}$	$\displaystyle=\left(\frac{(L_{\nabla h}-\sigma)L^{2}_{\nabla g}}{2}\right)\frac{\lambda^{2}_{k-1}}{\lambda_{k}}+\frac{1}{2}\left(\frac{\sigma}{\lambda_{k}}-L_{\nabla g}\right)-p_{k-1},$		(10)
	$\displaystyle M_{1,k}$	$\displaystyle=p_{k-1}-\frac{\alpha_{k}+\sigma L_{\nabla g}\lambda_{k-1}}{2\lambda_{k}}-\frac{(L_{\nabla h}-\sigma)L^{2}_{\nabla g}\lambda_{k-1}^{2}}{2\lambda_{k}},$		(11)

where $(\lambda_{k})_{k\in\mathbb{N}^{*}}$ is a stepsize sequence of BiFRB; $(\alpha_{k})_{k\in\mathbb{N}}$ is a sequence of inertial parameters; and $L_{\nabla h},L_{\nabla g},\sigma$ are parameters specified in Assumption 3.1. The following concept plays a central role in our analysis.

Definition 3.10

Let $p\in\mathbb{R}$ . Define $H_{p}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\overline{\mathbb{R}}$ by

(\forall(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{n})~{}H_{p}(x,y)=F(x)+p\left\lVert x-y\right\rVert^{2}.

Then we say that $H_{p}$ is a quadratic merit function with parameter $p$ . Let $(p_{k})_{k\in\mathbb{N}^{*}}$ be given by (10). Then $H_{p_{k}}$ is the $k^{th}$ BiFRB merit function. If $p_{k}\leq\bar{p}$ for some $\bar{p}>0$ , then $H_{\bar{p}}$ is a dominating BiFRB merit function.

Remark 3.11

Quadratic merit functions of various forms appear frequently in literature that devotes to extending splitting methods to nonconvex setting; see, e.g., [9, 10, 31, 27, 26] and the references therein.

Despite the recursion (10), we shall see soon that $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}^{*}}$ are indeed implicit in our analysis. Imposing the following novel condition on $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}^{*}}$ , one gets abstract convergence of BiFRB, which in turn yields less restrictive stepsize rules; see Section 4 and in particular Remark 4.3.

Assumption 3.12 (general rules of BiFRB merit function parameter)

Let $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ be given as in (10) and (11). Suppose that the following hold:

(i) $\liminf_{k\to\infty}p_{k}\geq 0$ and $\liminf_{k\to\infty}M_{1,k}>0$ .

(ii) There exists $\bar{p}$ such that $p_{k}\leq\bar{p}$ for all $k$ .

Remark 3.13

Assumption 3.12 generalizes quadratic merit function properties in the present literature; see, e.g., [9, Lemma 3.2], [10, Lemma 6], [26, Assumption (H1)] and [27, Algorithm 5, Lemma 4.6, Proposition 4.7].

3.3 Abstract function value convergence

In this section, we study function value convergence of BiFRB under Assumption 3.12; see Section 4 for concrete stepsize rules under which Assumption 3.12 holds.

Lemma 3.14

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB and define $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ . Then the following hold:

(i) For all $k\in\mathbb{N}$ ,

H_{p_{k}}(z_{k})\leq H_{p_{k-1}}(z_{k-1})-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}.

(12)

(ii) Suppose that Assumption 3.12(i) holds. Then the sequence $\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}}$ is decreasing and $\sum_{k=0}^{\infty}M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}<\infty$ . Consequently, $\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}}$ is convergent and $\lim_{k\to\infty}\left\lVert z_{k}-z_{k-1}\right\rVert=0$ .

Proof. (i) By using the iterative scheme (9),

$\displaystyle f(x_{k+1})$	$\displaystyle\leq f(x_{k})+\langle x_{k}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{1}{\lambda_{k}}\left(D_{h}(x_{k},y_{k})-D_{h}(x_{k+1},y_{k})\right)$
	$\displaystyle\leq f(x_{k})+\langle x_{k}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2},$	(13)

where the last inequality is implied by Lemma 3.3. Invoking the descent lemma [3, Lemma 2.64] to $g$ yields

g(x_{k+1})\leq g(x_{k})+\langle x_{k+1}-x_{k},\nabla g(x_{k})\rangle+\frac{L_{\nabla g}}{2}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}.

(14)

Adding (13) and (14), one has

$\displaystyle F(x_{k+1})$	$\displaystyle\leq F(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}$
	$\displaystyle-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}+\frac{L_{\nabla g}}{2}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$
	$\displaystyle\leq F(x_{k})+\left(\frac{\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}\right)\left\lVert x_{k+1}-x_{k}\right\rVert^{2}+\frac{\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k-1}-x_{k}\right\rVert^{2}$
	$\displaystyle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}.$	(15)

Next we estimate

	$\displaystyle\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}=\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left(\left\lVert x_{k+1}-y_{k}\right\rVert^{2}-\left\lVert x_{k}-y_{k}\right\rVert^{2}\right)$
	$\displaystyle=\frac{(L_{\nabla h}-\sigma)\lambda_{k-1}^{2}}{2\lambda_{k}}\left\lVert\nabla g(x_{k})-\nabla g(x_{k-1})\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$
	$\displaystyle-\frac{\sigma\lambda_{k-1}}{\lambda_{k}}\langle x_{k+1}-x_{k},\nabla g(x_{k})-\nabla g(x_{k-1})\rangle$
	$\displaystyle\leq\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}+\frac{\sigma L_{\nabla g}\lambda_{k-1}}{\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert\left\lVert x_{k}-x_{k-1}\right\rVert-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$
	$\displaystyle\leq\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}+\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}.$

Thus, inequality (15) further implies that

	$\displaystyle F(x_{k+1})$	$\displaystyle\leq F(x_{k})+\left(\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)+\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}\right)\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$
		$\displaystyle+\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}+\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}.$		(16)

Adding $p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$ to the above inequality gives

	$\displaystyle H_{p_{k}}(z_{k})$	$\displaystyle=F(x_{k+1})+p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}$
		$\displaystyle\leq F(x_{k})+\left(\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)+\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}+p_{k}\right)(\left\lVert z_{k}-z_{k-1}\right\rVert^{2}-\left\lVert x_{k}-x_{k-1}\right\rVert^{2})$
		$\displaystyle+\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}+\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}$
		$\displaystyle=F(x_{k})+p_{k-1}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}$
		$\displaystyle=H_{p_{k-1}}(z_{k-1})-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2},$

where the first inequality follows from the identity $\left\lVert z_{k}-z_{k-1}\right\rVert^{2}=\left\lVert x_{k+1}-x_{k}\right\rVert^{2}+\left\lVert x_{k}-x_{k-1}\right\rVert^{2}$ .

(ii) Assume without loss of generality that $p_{k}\geq 0$ and $M_{1,k}>0$ for $k\in\mathbb{N}$ . Thus by statement(i) and Assumption 3.1, the sequence $\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}}$ is decreasing and bounded below, therefore it is convergent. Summing inequality (12) from $k=0$ to $k=n-1$ , we have

	$\displaystyle\sum_{k=0}^{n-1}M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}$	$\displaystyle\leq\sum_{k=0}^{n-1}\big{(}H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})\big{)}=H_{p_{0}}(z_{0})-H_{p_{n}}(z_{n})$
		$\displaystyle\leq H_{p_{0}}(z_{0})-F(x_{n+1})\leq H_{p_{0}}(z_{0})-\inf_{x\in\mathbb{R}^{n}}F(x)<\infty,$

which together with the assumption that $\liminf_{k\to\infty}M_{1,k}>0$ implies that $\sum_{k=0}^{\infty}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}<\infty$ . $\quad\hfill\blacksquare$

The following lemma is an immediate consequence of subdifferential sum rules [28, Exercise 8.8, Proposition 10.5].

Lemma 3.15

Let $p\in\mathbb{R}$ . Then for every $(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{n}$ ,

\partial H_{p}(x,y)=\{\partial f(x)+\nabla g(x)+2p(x-y)\}\times\{2p(y-x)\}.

The lemma below provides a lower bound for consecutive iterates gap.

Lemma 3.16

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB, $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ , and $p\geq 0$ . For $k\in\mathbb{N}$ , define $u_{k}=(\nabla h(y_{k})-\nabla h(x_{k+1}))/\lambda_{k}-\nabla g(x_{k})+\alpha_{k}(x_{k}-x_{k-1})/\lambda_{k}$ ,

A_{k}=u_{k}+\nabla g(x_{k+1})+2p(x_{k+1}-x_{k}),B_{k}=2p(x_{k}-x_{k+1}).

Set

M_{2,k}=\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+6p,\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\bigg{\}},

and

M_{2}=\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6p,\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}.

Then $\left(A_{k},B_{k}\right)\in\partial H_{p}(z_{k})$ and $\left\lVert\left(A_{k},B_{k}\right)\right\rVert\leq M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq M_{2}\left\lVert z_{k}-z_{k-1}\right\rVert$ .

Proof. Invoking (9), one has

0\in\partial f(x_{k+1})+\nabla g(x_{k})+\frac{\alpha_{k}(x_{k-1}-x_{k})}{\lambda_{k}}+\frac{\nabla h(x_{k+1})-\nabla h(x_{k})}{\lambda_{k}},

which implies that $u_{k}\in\partial f(x_{k+1})$ . It then follows from Lemma 3.15 that $\left(A_{k},B_{k}\right)\in\partial H_{p}(z_{k})$ . Moreover,

	$\displaystyle\left\lVert u_{k}+\nabla g(x_{k+1})\right\rVert$	$\displaystyle\leq\left\lVert\frac{1}{\lambda_{k}}\left(\nabla h(y_{k})-\nabla h(x_{k+1})\right)+\nabla g(x_{k+1})-\nabla g(x_{k})\right\rVert$
		$\displaystyle+2p\left\lVert x_{k+1}-x_{k}\right\rVert+\frac{\alpha_{k}}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert$
		$\displaystyle\leq\frac{L_{\nabla h}}{\lambda_{k}}\left\lVert y_{k}-x_{k}+x_{k}-x_{k+1}\right\rVert+\left(L_{\nabla g}+2p\right)\left\lVert x_{k+1}-x_{k}\right\rVert+\frac{1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert$
		$\displaystyle\leq\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert+\left(\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+2p\right)\left\lVert x_{k+1}-x_{k}\right\rVert.$

Then

	$\displaystyle\left\lVert\left(A_{k},B_{k}\right)\right\rVert$	$\displaystyle\leq\left\lVert A_{k}\right\rVert+\left\lVert B_{k}\right\rVert\leq\left\lVert u_{k}+\nabla g(x_{k+1})\right\rVert+4p\left\lVert(x_{k+1}-x_{k})\right\rVert$
		$\displaystyle\leq\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert+\left(\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+6p\right)\left\lVert x_{k+1}-x_{k}\right\rVert,$
		$\displaystyle\leq M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq M_{2}\left\lVert z_{k}-z_{k-1}\right\rVert,$

as desired. $\quad\hfill\blacksquare$

Having established basic properties of $\big{(}H_{p_{k}}(x_{k+1},x_{k})\big{)}$ , we now show its convergence. Denote by $\omega(z_{-1})$ the set of all limit points of $(z_{k})_{k\in\mathbb{N}^{*}}$ .

Theorem 3.17

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB and define $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ . Suppose that Assumption 3.12 holds and $(z_{k})_{k\in\mathbb{N}^{*}}$ is bounded. Let $(z_{k_{l}})_{l\in\mathbb{N}}$ be a subsequence such that $z_{k_{l}}\to z^{*}$ for some $z^{*}=(x^{*},y^{*})$ as $l\to\infty$ . Then the following hold:

(i) We have $\lim_{l\to\infty}H_{p_{k_{l}}}(z_{k_{l}})=F(x^{*})$ . In fact, $\lim_{k\to\infty}H_{p_{k}}(z_{k})=F(x^{*})$ . Consequently $\lim_{k\to\infty}H_{\bar{p}}(z_{k})=F(x^{*})$ , where $\bar{p}$ is given in Assumption 3.12(ii).

(ii) The limit point $z^{*}=(x^{*},y^{*})$ satisfies $x^{*}=y^{*}$ and $0\in\partial H_{\bar{p}}(x^{*},y^{*})$ , which implies that $0\in\partial F(x^{*})$ .

(iii) The set $\omega(z_{-1})$ is nonempty, connected, and compact, on which the function $H_{\bar{p}}$ is constant $F(x^{*})$ . Moreover, $\lim_{k\to\infty}\operatorname{dist}(z_{k},\omega(z_{-1}))=0$ .

Proof. (i) For every $k\in\mathbb{N}$ , appealing to the iterative scheme (9), we have

$\displaystyle f(x_{k+1})$	$\displaystyle\leq f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{1}{\lambda_{k}}D_{h}(x^{*},y_{k})-\frac{1}{\lambda_{k}}D_{h}(x_{k+1},y_{k})$
	$\displaystyle\leq f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda}\big{(}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}-\left\lVert x^{}-y_{k}\right\rVert^{2}\big{)}$
	$\displaystyle=f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\big{(}\left\lVert x_{k+1}-x^{}\right\rVert+2\langle x_{k}-x^{},x^{}-y_{k}\rangle\big{)}.$	(17)

Recall that $x_{k_{l}}\to x^{*}$ as $l\to\infty$ . Then Lemma 3.14(ii) and the triangle inequality imply that $x_{k_{l}-1}\to x^{*}$ and

\left\lVert x^{*}-y_{k_{l}-1}\right\rVert\leq\left\lVert x^{*}-x_{k_{l}-1}\right\rVert+\lambda_{k_{l}-2}L_{\nabla g}\left\lVert x_{k_{l}-2}-x_{k_{l}-1}\right\rVert\to 0,\text{ as }l\to\infty,

where the first inequality holds due to (8). Taking the above limits into account, replacing the index $k+1$ by $k_{l}$ in (3.3) and passing to the limit, one gets

\limsup_{l\to\infty}f(x_{k_{l}})\leq f(x^{*})\leq\liminf_{l\to\infty}f(x_{k_{l}}),

where the last inequality holds thanks to the lower semicontinuity of $f$ , which implies that $\lim_{l\to\infty}F(x_{k_{l}})=F(x^{*})$ . Note that the sequence $(H_{p_{k}}(z_{k}))_{k\in\mathbb{N}}$ is convergent. Then one concludes that

\lim_{k\to\infty}H_{p_{k}}(z_{k})=\lim_{l\to\infty}H_{p_{k_{l}}}(z_{k_{l}})=\lim_{l\to\infty}\big{(}F(x_{k_{l}})+p_{k_{l}}\left\lVert x_{k_{l}}-x_{k_{l}-1}\right\rVert^{2}\big{)}=F(x^{*}),

where $\lim_{l\to\infty}p_{k_{l}}\left\lVert x_{k_{l}}-x_{k_{l}-1}\right\rVert^{2}=0$ due to Assumption 3.12(ii) and Lemma 3.14(ii). Finally, it is easy to see that $0\leq(\bar{p}-p_{k})\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\leq\bar{p}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\to 0$ as $k\to\infty$ under Assumption 3.12(ii). Thus $\lim_{k\to\infty}H_{\bar{p}}(z_{k})=\lim_{k\to\infty}\big{(}H_{p_{k}}(z_{k})+(\bar{p}-p_{k})\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\big{)}=F(x^{*})$ .

(ii) By Lemma 3.14(ii) and the triangle inequality, subsequences $(x_{k_{l}})_{l\in\mathbb{N}}$ and $(x_{k_{l+1}})_{l\in\mathbb{N}}$ have the same limit. Therefore $x^{*}=y^{*}$ . Combining Lemma 3.16 with $p=\bar{p}$ , Lemma 3.14(ii), statement(i), and the outer semicontinuity ot $z\mapsto\partial H_{\bar{p}}(z)$ , we conclude that $0\in\partial H_{\bar{p}}(z^{*})$ , which further implies $0\in\partial F(x^{*})$ .

(iii) Apply a similar proof as in [31, Theorem 3.6(iii)]. $\quad\hfill\blacksquare$

The result below provides a sufficient condition to the critical boundedness assumption in Theorem 3.17.

Theorem 3.18

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB and define $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ . Suppose that Assumption 3.12(i) holds. If $f+g$ is coercive (or level-bounded), then the sequence $(x_{k})_{k\in\mathbb{N}^{*}}$ is bounded, so is $(z_{k})_{k\in\mathbb{N}^{*}}$ .

Proof. Assume without loss of generality that $M_{1,k}>0$ and $p_{k}\geq 0$ for $k\in\mathbb{N}$ . Then, appealing to Lemma 3.14, one has $H_{p_{k}}(z_{k})\leq H_{p_{-1}}(z_{-1})$ for $k\in\mathbb{N}$ . Therefore

F(x_{k+1})=H_{p_{k}}(z_{k})-p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}\leq H_{p_{k}}(z_{k})\leq H_{p_{-1}}(z_{-1}).

If $(x_{k})_{k\in\mathbb{N}}$ was unbounded, then the above inequality would yield a contradiction due to the coercivity of $f+g$ (or level-boundedness). $\quad\hfill\blacksquare$

3.4 Abstract sequential convergence

In this section, we establish global sequential convergence of BiFRB to a stationary point by using the generalized concave KL property (recall Definition 2.2). For $\varepsilon>0$ and nonempty set $\Omega\subseteq\mathbb{R}^{n}$ , we define $\Omega_{\varepsilon}=\{x\in\mathbb{R}^{n}:\operatorname{dist}(x,\Omega)<\varepsilon\}$ .

Lemma 3.19

[30, Lemma 4.4] Let $f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}$ be proper lsc and let $\mu\in\mathbb{R}$ . Let $\Omega\subseteq\operatorname{dom}\partial f$ be a nonempty compact set on which $f(x)=\mu$ for all $x\in\Omega$ . The following statements hold:

(i) Suppose that $f$ satisfies the pointwise generalized concave KL property at each $x\in\Omega$ . Then there exist $\varepsilon>0,\eta\in(0,\infty]$ and $\varphi\in\Phi_{\eta}$ such that $f$ has the setwise generalized concave KL property on $\Omega$ with respect to $U=\Omega_{\varepsilon}$ , $\eta$ and $\varphi$ .

(ii) Set $U=\Omega_{\varepsilon}$ and define $h:(0,\eta)\rightarrow\mathbb{R}_{+}$ by

h(s)=\sup\big{\{}\operatorname{dist}^{-1}\big{(}0,\partial f(x)\big{)}:x\in U\cap[0<f-\mu<\eta],s\leq f(x)-\mu\big{\}}.

Then the function $\tilde{\varphi}:[0,\eta)\rightarrow\mathbb{R}_{+}$ ,

t\mapsto\int_{0}^{t}h(s)ds,~{}\forall t\in(0,\eta),

and $\tilde{\varphi}(0)=0$ , is well-defined and belongs to $\Phi_{\eta}$ . The function $f$ has the setwise generalized concave KL property on $\Omega$ with respect to $U$ , $\eta$ and $\tilde{\varphi}$ . Moreover,

\tilde{\varphi}=\inf\big{\{}\varphi\in\Phi_{\eta}:\text{$\varphi$ is a concave desingularizing function of $f$ on $\Omega$ with respect to $U$ and $\eta$}\big{\}}.

We say $\tilde{\varphi}$ is the exact modulus of the setwise generalized concave KL property of $f$ on $\Omega$ with respect to $U$ and $\eta$ .

Theorem 3.20 (global convergence of BiFRB)

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB and define $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ . Suppose that Assumption 3.12 holds and $(z_{k})_{k\in\mathbb{N}^{*}}$ is bounded. Suppose further that the dominating BiFRB merit function $H_{\bar{p}}$ (recall Definition 3.10) has the generalized concave KL property on $\omega(z_{-1})$ . Then the following hold:

(i) The sequence $(z_{k})_{k\in\mathbb{N}^{*}}$ is Cauchy and has finite length. To be specific, there exist index $k_{0}\in\mathbb{N}$ , $\varepsilon>0$ and $\eta\in(0,\infty]$ such that for $i\geq k_{0}+1$ ,

\sum_{k=i}^{\infty}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right),

(18)

where

C=\frac{2\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p},\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}}{\liminf_{k\to\infty}M_{1,k}},

and $\tilde{\varphi}$ is the exact modulus associated with the setwise generalized concave KL property of $H_{\bar{p}}$ on $\omega(z_{-1})$ with respect to $\varepsilon$ and $\eta$ .

(ii) The sequence $(x_{k})_{k\in\mathbb{N}^{*}}$ has finite length and converges to some $x^{*}$ with $0\in\partial F(x^{*})$ .

Proof. By the boundedness assumption, assume without loss of generality that $z_{k}\to z^{*}=(x^{*},y^{*})\in\mathbb{R}^{n}\times\mathbb{R}^{n}$ . Then Theorem 3.17 implies that $x^{*}=y^{*}$ and $H_{p_{k}}(z_{k})\to F(x^{*})$ as $k\to\infty$ . By Lemma 3.14, assume without loss of generality that $\big{(}H_{p_{k}}(z_{k})\big{)}_{k\in\mathbb{N}}$ is decreasing. Consider two cases:

Case 1: Suppose that there exists $k_{0}$ such that $F(z^{*})=H_{p_{k_{0}}}(z_{k_{0}})$ . Then Lemma 3.14(i) implies that $z_{k_{0}+1}=z_{k_{0}}$ and $x_{k_{0}+1}=x_{k_{0}}$ . The desired results then follows from a simple induction.

Case 2: Assume that $F(z^{*})<H_{p_{k}}(z_{k})$ for all $k$ . By Theorem 3.17(iii), the function $H_{\bar{p}}$ is constant $F(x^{*})$ on $\omega(z_{-1})$ . Then Lemma 3.19 yields $\varepsilon>0$ and $\eta>0$ such that $H_{\bar{p}}$ has the setwise generalized concave KL property on $\omega(z_{-1})$ with respect to $\varepsilon>0$ and $\eta>0$ and the associated exact modulus $\tilde{\varphi}$ . Appealing to Theorem 3.17 again, there exists $k_{1}$ such that $z_{k}\in[0<H_{\bar{p}}-H_{\bar{p}}(z^{*})<\eta]$ for $k>k_{1}$ and $k_{2}$ such that $\operatorname{dist}(z_{k},\omega(z_{-1}))<\varepsilon$ for all $k>k_{2}$ . Put $k_{0}=\max(k_{1},k_{2})$ . Then for $k>k_{0}$ ,

\displaystyle(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-F(x^{*})\right)\cdot\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))\geq(\tilde{\varphi})_{-}^{\prime}\left(H_{\bar{p}}(z_{k})-F(x^{*})\right)\cdot\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))\geq 1,

where the first inequality holds because $H_{p_{k}}(z_{k})\leq H_{\bar{p}}(z_{k})$ and the last one follows from the generalized concave KL inequality (recall Definition 2.2). Invoking Lemma 3.16 yields

M_{2,k}(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-F(x^{*})\right)\cdot\left\lVert z_{k}-z_{k-1}\right\rVert\geq 1,

(19)

where $M_{2,k}=M_{2,k}(\bar{p})$ . For simplicity, define

\Delta_{k,k+1}=\tilde{\varphi}\left(H_{p_{k}}(z_{k})-H(z^{*})\right)-\tilde{\varphi}\left(H_{p_{k+1}}(z_{k+1})-H(z^{*})\right).

By the concavity of $\tilde{\varphi}$ and (19), one has

$\displaystyle\Delta_{k,k+1}$	$\displaystyle=\tilde{\varphi}\left(H_{p_{k}}(z_{k})-H(z^{})\right)-\tilde{\varphi}\left(H_{p_{k+1}}(z_{k+1})-H(z^{})\right)$
	$\displaystyle\geq(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-H(z^{*})\right)\cdot[H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})]$
	$\displaystyle\geq\frac{H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})}{M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert},$	(20)

Applying Lemma 3.14 to (3.4) yields

\Delta_{k,k+1}\geq\frac{\left\lVert z_{k+1}-z_{k}\right\rVert^{2}}{C_{k}\left\lVert z_{k}-z_{k-1}\right\rVert},

where $C_{k}=M_{2,k}/M_{1,k}$ . On one hand, by Assumption 3.12(i), assume without loss of generality that $M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2$ for all $k$ . On the other hand, $M_{2,k}$ satisfies

M_{2,k}\leq\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p},\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}.

Then $C_{k}\leq C$ and

\displaystyle 2\left\lVert z_{k+1}-z_{k}\right\rVert\leq 2\sqrt{C_{k}\Delta_{k,k+1}\left\lVert z_{k}-z_{k-1}\right\rVert}\leq C\Delta_{k,k+1}+\left\lVert z_{k}-z_{k-1}\right\rVert.

(21)

Pick $i\geq k_{0}+1$ . Summing (21) from $i$ to an arbitrary $j>i$ yields

	$\displaystyle~{}~{}~{}~{}2\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\sum_{k=i}^{j}\left\lVert z_{k}-z_{k-1}\right\rVert+C\sum_{k=i}^{j}\Delta_{k,k+1}$
	$\displaystyle\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert+\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{})\right)-C\tilde{\varphi}\left(H_{p_{j+1}}(z_{j+1})-H(z^{})\right)$
	$\displaystyle\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert+\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right),$

implying that

\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right),

from which (18) readily follows. Finally, notice that

\left\lVert z_{j}-z_{i}\right\rVert\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right).

(22)

Recall from Theorem 3.17(i) and Lemma 3.14(ii) that $H_{p_{i}}(z_{i})\to F(x^{*})$ and $\left\lVert z_{i}-z_{i-1}\right\rVert\to 0$ as $i\to\infty$ . Thus, passing the index $i$ to infinity, we conclude from inequality (22) that $(z_{k})_{k\in\mathbb{N}}$ is Cauchy. Invoking Theorem 3.17(ii) completes the proof.

(ii) The statement follows from the definition of $(z_{k})_{k\in\mathbb{N}^{*}}$ and Theorem 3.17(ii). $\quad\hfill\blacksquare$

The generalized concave KL assumption in Theorem 3.20 can be easily satisfied by semialgebraic functions.

Corollary 3.21

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated BiFRB. Assume that $(x_{k})_{k\in\mathbb{N}^{*}}$ is bounded and Assumption 3.12 holds. Suppose further that both $f$ and $g$ are semialgebraic. Then $(x_{k})_{k\in\mathbb{N}^{*}}$ converges to some $x^{*}$ with $0\in\partial F(x^{*})$ and has finite length property.

Proof. Let $\bar{p}$ be as in Assumption 3.12. Then $(x,y)\mapsto\bar{p}\left\lVert x-y\right\rVert^{2}$ is semialgebraic. In turn, the dominating BiFRB function $H_{\bar{p}}$ (recall Definition 3.10) is semialgebraic by the semialgebraic functions sum rule; see, e.g., [1, Section 4.3]. The desired result then follows immediately from Theorem 3.20. $\quad\hfill\blacksquare$

4 Stepsize strategies

Following abstract convergence properties of BiFRB, we now exploit conditions such that the important Assumption 3.12 holds.

To furnish Assumption 3.12, it suffices to establish the existence of suitable $p_{-1}$ under desirable stepsize rules. Unlike most literature that emphasizes on explicit merit function parameters [9, 10, 31, 27], our $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ are implicit. This turns out to be instrumental in designing less restrictive stepsize rules. In the remainder of this section, unless specified otherwise,

a=(L_{\nabla h}-\sigma)L_{\nabla g}^{2},b=\sigma,\text{ and }c=L_{\nabla g}.

(23)

Consequently, the sequence $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}^{*}}$ given by (10) and (11) satisfy

\displaystyle p_{k}=\frac{a}{2}\left(\frac{\lambda_{k-1}^{2}}{\lambda_{k}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{k}}-c\right)-p_{k-1},~{}M_{1,k}=p_{k-1}-\frac{\alpha_{k}+bc\lambda_{k-1}}{2\lambda_{k}}-\frac{a\lambda_{k-1}^{2}}{2\lambda_{k}}.

(24)

The lemma below asserts that $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ are governed by stepsizes and an initial input $p_{-1}$ , which will be useful soon.

Lemma 4.1

Let $(\lambda_{k})_{k\in\mathbb{N}^{*}}$ be a sequence of nonzero real numbers and let $a,b,c\in\mathbb{R}$ . Set $p_{-1}\in\mathbb{R}$ and define $(p_{k})_{k\in\mathbb{N}^{*}}$ as in (24). Then for $k\in\mathbb{N}$ ,

	$\displaystyle p_{2k}=\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}-\frac{c}{2}-p_{-1},$
	$\displaystyle p_{2k+1}=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i}^{2}}{\lambda_{2i+1}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1},$

where we adopt the convention $\sum_{i=1}^{0}a_{i}=0$ for any sequence $(a_{k})_{k\in\mathbb{N}}$ .

Proof. Let $k\in\mathbb{N}\backslash\{0\}$ . Then

	$\displaystyle p_{2k}$	$\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k-1}^{2}}{\lambda_{2k}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k}}-c\right)-p_{2k-1}=\frac{a}{2}\left(\frac{\lambda_{2k-1}^{2}}{\lambda_{2k}}-\frac{\lambda_{2k-2}^{2}}{\lambda_{2k-1}}\right)+\frac{b}{2}\left(\frac{1}{\lambda_{2k}}-\frac{1}{\lambda_{2k-1}}\right)+p_{2k-2}$
		$\displaystyle=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+p_{0}$
		$\displaystyle=\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}-\frac{c}{2}-p_{-1},$

and

	$\displaystyle p_{2k+1}$	$\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k}^{2}}{\lambda_{2k+1}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k+1}}-c\right)-p_{2k}$
		$\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k}^{2}}{\lambda_{2k+1}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k+1}}-c\right)-\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)-\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)$
		$\displaystyle-\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}+\frac{c}{2}+p_{-1}$
		$\displaystyle=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i}^{2}}{\lambda_{2i+1}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1},$

from which the desired identities hold. $\quad\hfill\blacksquare$

4.1 Non-Euclidean case

Now we provide a fixed stepsize rule for Algorithm 3.2. In contrast to similar stepsize results in the literature, we shall see that Assumption 3.12 furnishes explicit stepsize bounds that is independent of inertial parameters. In turn, we answer a question of Malitsky and Tam [22, Section 7]; see Remark 4.3.

Proposition 4.2 (fixed stepsize)

Let $(\lambda_{k})_{k\in\mathbb{N}^{*}}$ be a constant BiFRB sequence with $\lambda_{k}=\lambda>0$ for all $k$ . Suppose that $\sigma>2$ , $(L_{\nabla h}-\sigma)\sigma>1/4$ and $0\leq\alpha_{k}<\min(1,\sigma/2)=1$ . Define

\lambda^{*}=\frac{\sqrt{(2bc+c)^{2}+4a(b-2)}-2bc-c}{2a}>0,

where $a,b,c$ are specified by (23). If $\lambda<\min\big{\{}\lambda^{*},(\sigma-1)/[(\sigma+1)L_{\nabla g}]\big{\}}$ , then there exists $p_{-1}$ such that $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ generated by (10) and (11) satisfy Assumption 3.12.

Proof. Applying Lemma 4.1 to (24) yields that $p_{k}=p_{-1}$ if $k$ is odd; $p_{k}=a\lambda/2+\left(b/\lambda-c\right)/2-p_{-1}$ if $k$ is even; and

\displaystyle M_{1,k}=\begin{cases*}p_{-1}-\frac{\alpha_{k}}{2\lambda}-\frac{bc}{2}-\frac{a\lambda}{2},&\text{if $k$ is even,}\\ \frac{b-\alpha_{k}}{2\lambda}-\frac{(b+1)c}{2}-p_{-1},&\text{if $k$ is odd.}\end{cases*}

Then

\displaystyle\min\bigg{\{}p_{-1},\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)-p_{-1}\bigg{\}}\leq p_{k}\leq\max\bigg{\{}p_{-1},\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)-p_{-1}\bigg{\}},

(25)

\liminf_{k\to\infty}M_{1,k}\geq\min\bigg{\{}p_{-1}-\frac{1}{2\lambda}-\frac{bc}{2}-\frac{a\lambda}{2},\frac{b-1}{2\lambda}-\frac{(b+1)c}{2}-p_{-1}\bigg{\}}.

(26)

To furnish Assumption 3.12, we will prove that

	$\displaystyle\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)>0,$		(27)
	$\displaystyle\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<\frac{b-1}{2\lambda}-\frac{(b+1)c}{2},$		(28)
	$\displaystyle\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<\frac{a\lambda}{2}+\frac{b}{2\lambda}-\frac{c}{2}.$		(29)

Combining inequalities (27)–(29) yields the existence of some $p_{-1}$ such that

0<p_{-1}<\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)\text{ and }\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<p_{-1}<\frac{b-1}{2\lambda}-\frac{(b+1)c}{2},

which together with (25) and (26) entails Assumption 3.12. First, we establish (27). Observe that

\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)>0\Leftrightarrow a\lambda^{2}-c\lambda+b>0.

By assumption, $a>0$ and $c^{2}-4ab=L^{2}_{\nabla g}\big{(}1-4(L_{\nabla h}-\sigma)\sigma\big{)}<0$ . Then the quadratic function $\lambda\mapsto a\lambda^{2}-c\lambda+b$ has no root and is therefore strictly positive, implying (27). Next we justify (28), which amounts to $-a\lambda^{2}-(2bc+c)\lambda+b-2>0$ . Notice that the discriminant $(2bc+c)^{2}+4a(b-2)>0$ and product of roots $(b-2)/(-a)<0$ . Then the quadratic function $\lambda\mapsto-a\lambda^{2}-(2bc+c)\lambda+b-2$ has two real roots with opposite signs. In turn,

(\forall~{}0<\lambda<\lambda^{*})-a\lambda^{2}-(2bc+c)\lambda+b-2>0,

furnishing inequality (28). Finally, the assumption $\lambda<(\sigma-1)/[(\sigma+1)L_{\nabla g}]=(b-1)/[(b+1)c]$ implies (29) after rearranging. $\quad\hfill\blacksquare$

Remark 4.3

(i) (Comparison to known results) Compared to a result of Boţ and Csetnek [9, Lemma 3.3] with explicit merit function parameters but implicit stepsize bounds, our stepsize is explicit and independent of inertial parameters; see [10, Remark 7] for a similar restrictive result and [22, Corollary 4.4] for a result with dependent stepsize bound and inertial parameter in the presence of convexity; see also [25, Table 1].

(ii) (Regarding a question of Malitsky and Tam) In [22, Section 7], Malitsky and Tam posted a question regarding whether FRB can be adapted to incorporate a Nesterov-type acceleration. With inertial parameter $\alpha_{k}\in[0,1)$ independent of stepsize $\lambda$ in Proposition 4.2, as opposed to a fixed inertial parameter $\alpha\in[0,1/3)$ dependent of stepsize $\lambda$ in [22, Corollary 4.4], this question is resolved. Indeed, the Nesterov acceleration scheme corresponds to Proposition 4.2 with

(\forall k\in\mathbb{N}^{*})~{}\alpha_{k}=\frac{t_{k}-1}{t_{k+1}},

where $(\forall k\in\mathbb{N}^{*})~{}t_{k+1}=(1+\sqrt{1+4t_{k}^{2}})/2$ and $t_{-1}=1$ ; see, e.g., [24].

(iii) The assumption that $(L_{\nabla h}-\sigma)\sigma>1/4$ is not demanding. For instance, let $\alpha,\beta>0$ be such that $\alpha\beta>1/4$ and let $u:\mathbb{R}^{n}\to\mathbb{R}$ be a function with $1$ -Lipschitz gradient. Then $h=\alpha u+\beta\left\lVert\cdot\right\rVert^{2}/2$ satisfies $(L_{\nabla h}-\sigma)\sigma>1/4$ .

4.2 Euclidean case

Considering kernel $h=\left\lVert\cdot\right\rVert^{2}/2$ , we now turn to stepsize strategies of Algorithm 3.9.

Proposition 4.4 (dynamic stepsize)

Let $(\lambda_{k})_{k\in\mathbb{N}^{*}}\subseteq[\underline{\lambda},\overline{\lambda}]$ be a iFRB stepsize sequence and let $(p_{k})_{k\in\mathbb{N}^{*}}$ be the sequence generated by (24) with $a=0$ , $b=1$ and $0\leq\alpha_{k}<\bar{\alpha}$ for some $\bar{\alpha}<1/2$ . Let $(a_{k})_{k\in\mathbb{N}}$ be a sequence of positive real numbers with $\sum_{k=0}^{\infty}a_{k}<\infty$ and let $0<\varepsilon<(1-2\bar{\alpha})/(3L_{\nabla g})$ . Suppose that the following hold:

(i) The stepsize lower bound $\underline{\lambda}=\varepsilon$ and upper bound $\overline{\lambda}<\varepsilon/(2\bar{\alpha}+3\varepsilon L_{\nabla g})$ .

(ii) For all $k\in\mathbb{N}$ , $0\leq(1/\lambda_{k})-(1/\lambda_{k-1})\leq a_{k}$ .

Then there exists $p_{-1}$ such that $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ generated by (10) and (11) satisfy Assumption 3.12.

Proof. First notice that assumption(i) is not contradictory because $\varepsilon<(1-2\bar{\alpha})/(3L_{\nabla g})\Leftrightarrow 2\bar{\alpha}+3\varepsilon L_{\nabla g}<1\Leftrightarrow\varepsilon<\varepsilon/(2\bar{\alpha}+3\varepsilon L_{\nabla g})$ . Then, by assumption(i), we have

\displaystyle\lambda_{0}<\frac{\varepsilon}{2\bar{\alpha}+3\varepsilon L_{\nabla g}}\Leftrightarrow\frac{L_{\nabla g}}{2}+\frac{\bar{\alpha}}{2\varepsilon}<\frac{1}{2\lambda_{0}}-L_{\nabla g}+\frac{\bar{\alpha}}{2\varepsilon},

which implies that there exits $p_{-1}$ such that

\displaystyle\frac{L_{\nabla g}}{2}+\frac{\bar{\alpha}}{2\varepsilon}<p_{-1}<\frac{1}{2\lambda_{0}}-L_{\nabla g}+\frac{\bar{\alpha}}{2\varepsilon}.

(30)

According to Lemma 4.1 and assumption(ii)

	$\displaystyle p_{-1}\leq p_{2k+1}$	$\displaystyle=\frac{1}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1}\leq\frac{1}{2}\sum_{k=0}^{\infty}a_{k}+p_{0},$
	$\displaystyle\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1}\leq p_{2k}$	$\displaystyle=\frac{1}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1}\leq\frac{1}{2}\sum_{k=0}^{\infty}a_{k}+\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1},$

which together with (30) shows that $p_{k}$ is positive and bounded above. By assumption(ii), $\lambda_{k}\leq\lambda_{k-1}$ for all $k$ . Consequently, the sequence $(\lambda_{k})$ is convergent and $\lim_{k\to\infty}\lambda_{k-1}/\lambda_{k}=1$ . It follows from the above inequalities and (30) that

	$\displaystyle M_{1,2k+1}$	$\displaystyle=p_{2k+1}-\frac{\alpha_{2k+1}+L_{\nabla g}\lambda_{2k}}{2\lambda_{2k+1}}\geq p_{-1}-\frac{L_{\nabla g}}{2}\left(\frac{\lambda_{2k}}{\lambda_{2k+1}}\right)-\frac{\bar{\alpha}}{2\varepsilon}\to p_{-1}-\frac{L_{\nabla g}}{2}-\frac{\bar{\alpha}}{2\varepsilon}>0,$
	$\displaystyle M_{1,2k}$	$\displaystyle=p_{2k}-\frac{\alpha_{2k}+L_{\nabla g}\lambda_{2k-1}}{2\lambda_{2k}}$
		$\displaystyle\geq\frac{1}{2\lambda_{0}}-\frac{L_{\nabla g}}{2}\left(\frac{\lambda_{2k-1}}{\lambda_{2k}}+1\right)-\frac{\bar{\alpha}}{2\varepsilon}-p_{-1}\to\frac{1}{2\lambda_{0}}-L_{\nabla g}-\frac{\bar{\alpha}}{2\varepsilon}-p_{-1}>0.$

Hence $\liminf_{k\to\infty}M_{1,k}>0$ . $\quad\hfill\blacksquare$

The following corollary follows immediately, which provides an inertial variant of [31, Lemma 3.2] together with Lemma 3.14.

Corollary 4.5 (fixed stepsize)

Let $(\lambda_{k})_{k\in\mathbb{N}^{*}}$ be a constant iFRB stepsize sequence with $\lambda_{k}=\lambda>0$ for all $k$ and let $(p_{k})_{k\in\mathbb{N}^{*}}$ be the sequence generated by (24) with $a=0$ , $b=1$ and $0\leq\alpha_{k}<\bar{\alpha}$ for some $\bar{\alpha}<1/2$ . If $\lambda<(1-2\bar{\alpha})/(3L_{\nabla g})$ , then there exists $p_{-1}$ such that $(p_{k})_{k\in\mathbb{N}^{*}}$ and $(M_{1,k})_{k\in\mathbb{N}}$ generated by (10) and (11) satisfy Assumption 3.12.

Remark 4.6

One recovers the stepsize requirement in [31, Lemma 3.2] by setting $\bar{\alpha}=0$ .

5 Convergence rates

After global convergence results, we now turn to convergence rates of both the function value and actual sequence. To this end, a lemma helps.

Lemma 5.1

[11, Lemma 10] Let $(e_{k})_{k\in\mathbb{N}}\subseteq\mathbb{R}_{+}$ be a decreasing sequence converging $0$ . Assume further that there exist natural numbers $k_{0}\geq l_{0}\geq 1$ such that for every $k\geq k_{0}$ ,

e_{k-l_{0}}-e_{k}\geq C_{e}e_{k}^{2\theta},

(31)

where $C_{e}>0$ is some constant and $\theta\in[0,1)$ . Then the following hold:

(i) If $\theta=0$ , then $(e_{k})_{k\in\mathbb{N}}$ converges in finite steps.

(ii) If $\theta\in\left(0,\frac{1}{2}\right]$ , then there exists $C_{e,0}>0$ such that for every $k\geq k_{0}$ ,

0\leq e_{k}\leq C_{e,0}Q^{k}.

(iii) If $\theta\in\left(\frac{1}{2},1\right)$ , then there exists $C_{e,1}>0$ such that for every $k\geq k_{0}+l_{0}$ ,

0\leq e_{k}\leq C_{e,1}(k-l_{0}+1)^{-\frac{1}{2\theta-1}}.

Recall that we say a generalized concave KL function has KL exponent $\theta\in[0,1)$ , if an associated desingularizing function $\varphi(t)=c\cdot t^{1-\theta}$ for some $c>0$ . We now provide function value convergence rate of BiFRB under KL exponent assumption.

Theorem 5.2 (function value convergence rate)

Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB and define $z_{k}=(x_{k+1},x_{k})$ for $k\in\mathbb{N}^{*}$ . Suppose that all assumptions in Theorem 3.20 are satisfied and let $x^{*}$ be the limit given in Theorem 3.20(ii). Suppose further that the dominating BiFRB merit function $H_{\bar{p}}$ (recall Definition 3.10) has the generalized concave KL property at $z^{*}=(x^{*},x^{*})$ with KL exponent $\theta\in[0,1)$ . Define $e_{k}=H_{p_{k}}(z_{k})-F(x^{*})$ for $k\in\mathbb{N}$ . Then $(e_{k})_{k\in\mathbb{N}}$ converges to $0$ and the following hold:

(i) If $\theta=0$ , then $(e_{k})_{k\in\mathbb{N}}$ converges in finite steps.

(ii) If $\theta\in\left(0,1/2\right]$ , then there exist $\hat{c}_{1}>0$ and $\hat{Q}_{1}\in[0,1)$ such that for $k$ sufficiently large,

e_{k}\leq\hat{c}_{1}\hat{Q}_{1}^{k}.

(iii) If $\theta\in\left(1/2,1\right)$ , then there exists $\hat{c}_{2}>0$ such that for $k$ sufficiently large,

e_{k}\leq\hat{c}_{2}k^{-\frac{1}{2\theta-1}}.

Proof. Recall from Lemma 3.14(ii) and Theorem 3.17(i) that the sequence $(e_{k})_{k\in\mathbb{N}}$ is decreasing and converges to $0$ . By the KL exponent assumption, there exist $c>0$ and $k_{0}$ such that for $k\geq k_{0}$

\operatorname{dist}\big{(}0,\partial H_{\bar{p}}(z_{k})\big{)}\geq\frac{\left(H_{p_{k}}(z_{k})-H_{\bar{p}}(z^{*})\right)^{\theta}}{c(1-\theta)}=\frac{e_{k}^{\theta}}{c(1-\theta)}.

Assume without loss of generality that $M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2$ . So Lemmas 3.14 and 3.16 imply

\displaystyle e_{k-1}-e_{k}

\displaystyle=H_{p_{k}-1}(z_{k-1})-H_{p_{k}}(z_{k})\geq M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}\geq\frac{M_{1,k}}{M_{2,k}^{2}}\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))^{2}\geq C_{e}e_{k}^{2\theta},

where

C_{e}=\frac{\liminf_{k\to\infty}M_{1,k}}{4c^{2}(1-\theta)^{2}\max\bigg{\{}\left(\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p}\right)^{2},\left(\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\right)^{2}\bigg{\}}}.

Applying Lemma 5.1 completes the proof. $\quad\hfill\blacksquare$

We now develop sequential convergence rate based on Theorem 5.2.

Theorem 5.3 (sequential convergence rate)

(i) If $\theta=0$ , then $z_{k}$ converges $z^{*}$ in finite steps.

(ii) If $\theta\in\left(0,1/2\right]$ , then there exist $c_{1}>0$ and $Q_{1}\in[0,1)$ such that for $k$ sufficiently large

\left\lVert z_{k}-z^{*}\right\rVert\leq c_{1}Q_{1}^{k}.

(iii) If $\theta\in\left(1/2,1\right)$ , then there exist $c_{2}>0$ such that for $k$ sufficiently large

\left\lVert z_{k}-z^{*}\right\rVert\leq c_{2}k^{-\frac{1-\theta}{2\theta-1}}.

Proof. For simplicity, define for $k\in\mathbb{N}^{*}$

e_{k}=H_{p_{k}}(z_{k})-F(x^{*}),\delta_{k}=\sum_{i=k}^{\infty}\left\lVert z_{i+1}-z_{i}\right\rVert.

Then $e_{k}\to 0$ and $\left\lVert z_{k}-z^{*}\right\rVert\leq\delta_{k}$ . Without loss of generality (recall Lemma 3.14), assume that $e_{k}\in[0,1)$ and $e_{k+1}\leq e_{k}$ for $k\in\mathbb{N}^{*}$ . To obtain the desired results, it suffices to estimate $\delta_{k}$ .

Assume without loss of generality that $M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2$ . Invoking Lemma 3.14(i) and Theorem 3.17(i) yields that $\left(\liminf_{k\to\infty}M_{1,k}/2\right)\left\lVert z_{k}-z_{k-1}\right\rVert^{2}\leq M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq H_{p_{k-1}}(z_{k-1})-H_{p_{k}}(z_{k})\leq H_{p_{k-1}}(z_{k-1})-F(x^{*})=e_{k-1}$ , which in turn implies that

\left\lVert z_{k}-z_{k-1}\right\rVert\leq d\sqrt{e_{k-1}},

(32)

where $d=\sqrt{2/\liminf_{k\to\infty}M_{1,k}}$ . Moreover, recall from Theorem 3.20(i) that there exists $k_{0}$ such that for $k\geq k_{0}+1$

\delta_{k}\leq\left\lVert z_{k}-z_{k-1}\right\rVert+C\tilde{\varphi}(e_{k})\leq\left\lVert z_{k}-z_{k-1}\right\rVert+C\tilde{\varphi}(e_{k-1}),

(33)

where the last inequality holds because $e_{k}\leq e_{k-1}$ . Assume without loss of generality that $H_{\bar{p}}$ is associated with desingularizing $\varphi(t)=C^{-1}t^{1-\theta}$ due to the KL exponent assumption. Thus, combined with (32), inequality (33) yields

\delta_{k}\leq d\sqrt{e_{k-1}}+e_{k-1}^{1-\theta}.

(34)

Case 1 ( $\theta=0$ ): Appealing to Theorem 5.2, the sequence $e_{k}$ converges to $0$ in finite steps. Thus (34) implies $\delta_{k}\to 0$ in finite steps, so does $(z_{k})_{k\in\mathbb{N}^{*}}$ .

Case 2 ( $0<\theta\leq\frac{1}{2}$ ): Clearly $\frac{1}{2}\leq 1-\theta<1$ , therefore $\sqrt{e_{k-1}}\geq e_{k-1}^{1-\theta}$ . Inequality (34) and Theorem 5.2 imply that for $k$ sufficiently large

\delta_{k}\leq d\sqrt{e_{k-1}}+\sqrt{e_{k-1}}=(1+d)\sqrt{e_{k-1}}\leq(1+d)\sqrt{\hat{c}_{1}\hat{Q}_{1}^{k-1}}.

Case 3 ( $\frac{1}{2}<\theta<1$ ): Note that $\sqrt{e_{k-1}}\leq e_{k-1}^{1-\theta}$ . Then appealing to (34) and Theorem 5.2 yields for $k$ sufficiently large

\delta_{k}\leq de_{k-1}^{1-\theta}+e_{k-1}^{1-\theta}=(1+d)e_{k-1}^{1-\theta}\leq(1+d)(k-1)^{-\frac{1-\theta}{2\theta-1}},

from which the desired result readily follows. $\quad\hfill\blacksquare$

Example 5.4 (nonconvex feasibility problem)

Let $C\subseteq\mathbb{R}^{n}$ be a nonempty, closed and convex set and let $D\subseteq\mathbb{R}^{n}$ be nonempty and closed. Suppose that $C\cap D\neq\emptyset$ and either $C$ or $D$ is compact. Assume further that both $C$ and $D$ are semialgebraic. Consider the minimization problem (2) with $f=\delta_{D}$ and $g=\operatorname{dist}^{2}(\cdot,C)/2$ . Let $(x_{k})_{k\in\mathbb{N}^{*}}$ be a sequence generated by BiFRB. Suppose that Assumption 3.12 holds. Then the following hold:

(i) There exists $x^{*}\in\mathbb{R}^{n}$ such that $x_{k}\to x^{*}$ and $0\in\partial F(x^{*})$ .

(ii) Suppose additionally that

N_{C}(\operatorname{Proj}_{C}(x^{*}))\cap\big{(}-N_{D}(x^{*})\big{)}=\{0\}.

(35)

Then $x^{*}\in C\cap D$ . Moreover, there exist $Q_{1}\in(0,1)$ and $c_{1}>0$ such that

\left\lVert x_{k}-x^{*}\right\rVert\leq c_{1}Q_{1}^{k},

for $k$ sufficiently large.

Proof. (i) By the compactness assumption, the function $f+g$ is coercive. Hence Theorem 3.18 implies that $(x_{k})_{k\in\mathbb{N}^{*}}$ is bounded. We assume that sets $C,D$ are semialgebraic, then so are functions $f$ and $g$ ; see [1, Section 4.3] and [2, Lemma 2.3]. Then apply Corollary 3.21.

(ii) Taking the fact that $0\in\partial F(x^{*})$ into account and applying the subdifferential sum rule [28, Exercise 8.8], one gets

x^{*}-\operatorname{Proj}_{C}(x^{*})\in-N_{D}(x^{*}).

The constraint qualification (35) then implies that $x^{*}=\operatorname{Proj}_{C}(x^{*})$ and consequently $x^{*}\in C$ . The set $D$ is assumed to be closed, thus $x^{*}\in D$ . Then a direct application of [13, Theorem 5] guarantees that $F$ has KL exponent $\theta=1/2$ at $x^{*}$ . The desired result follows immediately from Theorem 5.3. $\quad\hfill\blacksquare$

6 Bregman proximal subproblem formulae

Set $\alpha\geq 0,\beta>0$ . Fix $\omega\in\mathbb{R}^{n}$ and $\lambda>0$ . In the remainder of this section, let

h(x)=\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2}.

Define for $u\in\mathbb{R}^{n}$ , $p_{\lambda}(u)=\lambda\omega-\nabla h(u)$ and

$\displaystyle T_{\lambda}(u)$	$\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-u,\omega\rangle+\frac{1}{\lambda}D_{h}(x,u)\bigg{\}}$	(36)
	$\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x,\omega\rangle+\frac{1}{\lambda}h(x)-\frac{1}{\lambda}\langle\nabla h(u),x\rangle\bigg{\}}$
	$\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}\lambda f(x)+\langle x,p_{\lambda}(u)\rangle+h(x)\bigg{\}}.$	(37)

Clearly Assumption 3.6(ii) holds with $\sigma=\beta$ and $L_{\nabla h}=\alpha+\beta$ and the Bregman proximal subproblem (9) corresponds to (36) with $\omega=\alpha_{k}(x_{k-1}-x_{k})/\lambda_{k}+\nabla g(x_{k})$ , $u=y_{k}$ and $\lambda=\lambda_{k}$ . Note that (36) covers subproblems in [9, Algorithm 3.1] and [10, Agorithm 1] as well.

In this section, we develop formulae of $T_{\lambda}(u)$ with the kernel $h$ chosen as above, supplementing not only Algorithm 3.2 but also [9, Algorithm 3.1] and [10, Algorithm 1].

6.1 Formula of $l_{0}$ -constrained problems

Let $D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}$ for integer an $r\leq n$ and real number $R>0$ , where $(\forall x\in\mathbb{R}^{n})~{}\left\lVert x\right\rVert_{0}=\sum_{i=1}^{n}|\operatorname{sgn}(x_{i})|$ with $\operatorname{sgn}(0)=0$ . In this subsection, consider minimization problem

\min_{x\in D}g(x),

which is (2) with $f=\delta_{D}$ . Recall that the Hard-threshold of $x\in\mathbb{R}^{n}$ with parameter $r$ , denoted by $H_{r}(x)$ , is given by

(\forall i\in\{1,\ldots,n\})~{}\big{(}H_{r}(x)\big{)}_{i}=\begin{cases}x_{i},&\text{ if $|x_{i}|$ belongs to the $r$ largest among $|x_{1}|,\ldots,|x_{n}|$},\\ 0,&\text{ otherwise}.\end{cases}

(38)

Immediately, $H_{r}(cx)=cH_{r}(x)$ for $c\neq 0$ . The following lemma will be instrumental.

Lemma 6.1

[21, Proposition 4.3] Given $0\neq p\in\mathbb{R}^{n}$ and a positive integer $r<n$ , we have

\max_{x\in\mathbb{R}^{n}}\big{\{}\langle p,x\rangle:\left\lVert x\right\rVert=1,~{}\left\lVert x\right\rVert_{0}\leq r\big{\}}=\left\lVert H_{r}(p)\right\rVert,

with optimal value attained at $x^{*}=\left\lVert H_{r}(p)\right\rVert^{-1}H_{r}(p)$ .

Following a similar route in [8, Proposition 5.2], we now prove the desired formula.

Proposition 6.2 (subproblem formula)

Let $f=\delta_{D}$ , where $D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}$ for integer an $r\leq n$ and real number $R>0$ . Define

x^{*}=\begin{cases}0,&\text{if }\nabla h(u)=\lambda\omega,\\ -t^{*}\left\lVert H_{r}\big{(}p_{\lambda}(u)\big{)}\right\rVert^{-1}H_{r}\big{(}p_{\lambda}(u)\big{)},&\text{if }\nabla h(u)\neq\lambda\omega,\end{cases}

(39)

where $t^{*}$ is the unique solution of

\min_{0\leq t\leq R}\bigg{\{}\varphi(t)=\alpha\sqrt{1+t^{2}}+\frac{\beta}{2}t^{2}-\left\lVert H_{r}(p_{\lambda}(u)\right\rVert t\bigg{\}},

(40)

which satisfies $t^{*}=R$ if $\left\lVert H_{r}(p_{\lambda}(u))\right\rVert\geq\alpha(1+R^{2})^{-1/2}+\beta R$ ; otherwise $t^{*}\in(0,R)$ is the unique solution of

\alpha(1+t^{2})^{-1/2}t+\beta t-\left\lVert H_{r}(p_{\lambda}(u))\right\rVert=0.

Then $x^{*}\in T_{\lambda}(u)$ .

Proof. Set $p=p_{\lambda}(u)$ for simplicity. If $\nabla h(u)=\lambda\omega$ , then $p=0$ and clearly $T_{\lambda}(u)=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\{h(x):\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}=0$ . Now we consider $\nabla h(u)\neq\lambda\omega$ , in which case $p\neq 0$ . Invoking Lemma 6.1, one has for $t>0$

	$\displaystyle\min_{x\in\mathbb{R}^{n}}\{\langle x,p\rangle:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert=t\}$	$\displaystyle=-\max_{x\in\mathbb{R}^{n}}\{\langle x/t,-tp\rangle:\left\lVert x/t\right\rVert\leq r,\left\lVert x/t\right\rVert=1\}$
		$\displaystyle=-\left\lVert H_{r}(-tp)\right\rVert=-t\left\lVert H_{r}(p)\right\rVert,$		(41)

attained at

\hat{x}=\hat{x}(t)=t\left\lVert H_{r}(-tp)\right\rVert^{-1}H_{r}(-tp)=-t\left\lVert H_{r}(p)\right\rVert^{-1}H_{r}(p).

(42)

It is easy to see that optimal value (6.1) and solution (42) still hold when $t=0$ .

Next we show that $x^{*}\in T_{\lambda}(u)$ . By (37), $T_{\lambda}(u)$ is the solution set to

\min_{x\in\mathbb{R}^{n}}\bigg{\{}\langle x,p\rangle+\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\bigg{\}}.

(43)

Consider an arbitrary feasible $x$ of (43) and let $t_{x}=\left\lVert x\right\rVert$ . Then $0\leq t_{x}\leq R$ and

	$\displaystyle\langle p,x^{}\rangle+\alpha\sqrt{1+\left\lVert x^{}\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x^{*}\right\rVert^{2}$	$\displaystyle=-t^{}\left\lVert H_{r}(p)\right\rVert+\alpha\sqrt{1+(t^{})^{2}}+\frac{\beta}{2}(t^{*})^{2}$
		$\displaystyle=\varphi(t^{*})\leq\varphi(t_{x})=-\left\lVert H_{r}(p)\right\rVert t_{x}+\alpha\sqrt{1+t_{x}^{2}}+\frac{\beta}{2}t_{x}^{2}$
		$\displaystyle\leq\langle p,x\rangle+\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2},$

where the first and last inequalities hold due to (40) and (6.1) respectively.

Finally, we turn to (40). Noticing the strong convexity of $\varphi$ and applying optimality condition, one gets

-\alpha(1+t^{2})^{-1/2}t-\beta t+\left\lVert H_{r}(p)\right\rVert=-\varphi^{\prime}(t^{*})\in N_{[0,R]}(t^{*}),

t^{*}=R\Leftrightarrow-\varphi^{\prime}(R)\in N_{[0,R]}(R)=[0,\infty)\Leftrightarrow\varphi^{\prime}(R)\leq 0\Leftrightarrow\left\lVert H_{r}(p)\right\rVert\geq\alpha(1+R^{2})^{-1/2}+\beta R,

and

t^{*}=0\Leftrightarrow-\varphi^{\prime}(0)\in N_{[0,R]}(0)=(-\infty,0]\Leftrightarrow\left\lVert H_{r}(p)\right\rVert\leq 0\Leftrightarrow H_{r}(p)=0,

which never occurs when $p\neq 0$ . If $0<\left\lVert H_{r}(p)\right\rVert<\alpha(1+R^{2})^{-1/2}+\beta R$ , then $\varphi^{\prime}(0)<0$ and $\varphi^{\prime}(R)>0$ , meaning that there exists unique $t^{*}\in(0,R)$ such that $\varphi^{\prime}(t^{*})=0$ due to the strong convexity of $\varphi$ . $\quad\hfill\blacksquare$

Remark 6.3

When $0<\left\lVert H_{r}(p_{\lambda}(u)\right\rVert<\alpha(1+R^{2})^{-1/2}+\beta R$ , $t^{*}$ in Proposition 6.2 is indeed one of the four roots of quartic equation

\beta^{2}t^{4}-2\beta\left\lVert H_{r}(p_{\lambda}(u))\right\rVert t^{3}+\left(\left\lVert H_{r}(p_{\lambda}(u))\right\rVert^{2}+\beta^{2}-\alpha^{2}\right)t^{2}-2\beta\left\lVert H_{r}(p_{\lambda}(u))\right\rVert t+\left\lVert H_{r}(p_{\lambda}(u))\right\rVert^{2}=0.

Although quartic equations admit closed-form solutions, it is virtually impossible to determine beforehand the branch of solutions to which $t^{*}$ belongs, due to parameters $\alpha$ , $\beta$ and $\left\lVert H_{r}(p_{\lambda}(u))\right\rVert$ . To find $t^{*}$ efficiently, one may employ numerical root-finding algorithms. For instance, apply bisection method to identify a subinterval of $[0,R]$ in which the desired root $t^{*}$ lies, then appeal to Newton’s method; see, e.g., [12, Chapter 2].

6.2 Formula of problems penalized by a positively homogeneous convex function

Recall that $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is positively homogeneous if $f(tx)=tf(x)$ for every $x\in\mathbb{R}^{n}$ and $t>0$ . In this subsection, we provide subproblem formula for problems of the form

\min_{x\in\mathbb{R}^{n}}f(x)+g(x),

where $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ is convex and positively homogeneous with known proximal mapping, which is a special case of problem (2).

The result below generalizes the technique of Bolte-Sabach-Teboulle-Vaisbourd [8, Proposition 5.1]. However one cannot recover their result as our kernel is different.

Proposition 6.4

Let $f:\mathbb{R}^{n}\to\overline{\mathbb{R}}$ be proper, lsc and convex. Suppose that $f$ is positively homogeneous. Then the following hold:

(i) $(\forall x\in\operatorname{dom}\partial f)$ $(\forall t>0)$ $\partial f(x)=\partial f(tx)$ .

(ii) (subproblem formula) $(\forall u\in\mathbb{R}^{n})$ $T_{\lambda}(u)=t^{*}\operatorname{Prox}_{\lambda f}(-p_{\lambda}(u))$ , where $t^{*}$ is the unique root of the equation

1-\alpha t\big{(}1+t^{2}\left\lVert\operatorname{Prox}_{\lambda f}(-p_{\lambda}(u))\right\rVert\big{)}^{-1/2}-\beta t=0.

Proof. (i) $v\in\partial f(x)\Leftrightarrow(\forall y\in\mathbb{R}^{n})~{}f(y)\geq f(x)+\langle v,y-x\rangle\Leftrightarrow(\forall y\in\mathbb{R}^{n})~{}(\forall t>0)~{}f(ty)\geq f(tx)+\langle v,ty-tx\rangle\Leftrightarrow v\in\partial f(tx)$ .

(ii) For simplicity, let $p=p_{\lambda}(u)$ . Note that $T_{\lambda}(p)$ is single-valued by strong convexity. Then invoking optimality condition and subdifferential sum rule yields

x^{*}=T_{\lambda}(u)\Leftrightarrow 0\in\lambda\partial f(x^{*})+p+\nabla h(x)=\lambda\partial f(x^{*})+p+[\alpha(1+\left\lVert x^{*}\right\rVert^{2})^{-1/2}+\beta]x^{*}.

Define $t^{*}=[\alpha(1+\left\lVert x^{*}\right\rVert^{2})^{-1/2}+\beta]^{-1}$ and $v=x^{*}/t^{*}$ . Then $t^{*}>0$ and the above inclusion together with statement(i) implies that

	$\displaystyle 0$	$\displaystyle\in\lambda\partial f(x^{})+p+v=\lambda\partial f(t^{}v)+p+v=\lambda\partial f(v)+v+p$
	$\displaystyle\Leftrightarrow-p$	$\displaystyle\in(\operatorname{Id}+\lambda\partial f)(v)\Leftrightarrow v=(\operatorname{Id}+\lambda\partial f)^{-1}(-p)=\operatorname{Prox}_{\lambda f}(-p).$

In turn, $t^{*}$ satisfies

v[1-\alpha t^{*}(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t^{*}]=v-[\alpha(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}+\beta]t^{*}v=v-\frac{x^{*}}{t^{*}}=v-v=0.

If $v\neq 0$ , then clearly $1-\alpha t^{*}(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t^{*}=0$ as claimed; if $v=0$ , then $x^{*}=0$ and $t^{*}=1/(\alpha+\beta)$ , which is consistent with statement(ii). To see the uniqueness, let $\varphi(t)=\alpha(1+\left\lVert v\right\rVert^{2}t^{2})^{1/2}/\left\lVert v\right\rVert^{2}+\beta t^{2}/2-t$ for $t\in\mathbb{R}$ , which is strongly convex. Simple calculus shows that $t^{*}$ is stationary point of $\varphi$ , thus unique. $\quad\hfill\blacksquare$

Remark 6.5

Similarly to Proposition 6.2 and Remark 6.3, one may find $t^{*}$ efficiently by appealing to root-finding algorithms.

Equipped with Proposition 6.4, we now turn to concrete problems. Recall that the Soft-threshold of $x\in\mathbb{R}^{n}$ with parameter $\lambda$ is $S_{\lambda}(x)=\operatorname{Prox}_{\lambda\left\lVert\cdot\right\rVert_{1}}(x)$ , which satisfies

(\forall i\in\{1,\ldots,n\})~{}(S_{\lambda}(x))_{i}=\max(|x_{i}|-\lambda,0)\operatorname{sgn}(x_{i}),

see, e.g., [5, Example 6.8].

Corollary 6.6 ( $l_{1}$ -penalized problem)

Let $f=\left\lVert\cdot\right\rVert_{1}$ . Then $T_{\lambda}(u)=-t^{*}S_{\lambda}\left(p_{\lambda}(u)\right)$ , where $t^{*}$ is the unique solution of $1-\alpha t\left(1+t^{2}\left\lVert S_{\lambda}\left(p_{\lambda}(u)\right)\right\rVert^{2}\right)^{-1/2}-\beta t=0$ .

Corollary 6.7 ( $l_{\infty}$ -penalized problem)

Let $f=\left\lVert\cdot\right\rVert_{\infty}$ . Then $T_{\lambda}(u)=t^{*}v$ , where

v=p_{\lambda}(u)-\lambda\operatorname{Proj}\left(p_{\lambda}(u)/\lambda;\mathbb{B}_{\left\lVert\cdot\right\rVert_{1}}\right),

and $t^{*}$ is the unique solution of $1-\alpha t(1+t^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t=0$ .

Remark 6.8

The projection $\operatorname{Proj}(\cdot;\mathbb{B}_{\left\lVert\cdot\right\rVert_{1}})$ admits a closed-form formula; see, e.g., [5, Example 6.33].

7 Numerical experiments

In the remainder, let $A\in\mathbb{R}^{m\times n}$ , $b\in\mathbb{R}^{m}$ , $r\in\mathbb{N}\backslash\{0\}$ , and $R>0$ . Define $C=\{x\in\mathbb{R}^{n}:Ax=b\}$ and $D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}$ . Clearly $C$ is semialgebraic. The set $D$ can be represented as

D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r\}\cap\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert^{2}\leq R^{2}\},

which is an intersection of semialgebraic sets; see, e.g, [4, Formula 27(d)], thus semialgebraic. Consider

\min_{x\in D}\frac{1}{2}\operatorname{dist}^{2}(x,C),

(44)

which corresponds to (2) with $f=\delta_{D}$ and $g=\operatorname{dist}^{2}(\cdot,C)/2$ . In turn, Example 5.4 ensures that BiFRB and its Euclidean variant iFRB converge to a stationary point of (44).

We shall benchmark BiFRB and iFRB against the Douglas-Rachford (DR) [18], Forward-reflected-backward (FRB) [31, 22] and inertial Tseng’s (iTseng) [9] methods with $R=1,1000$ and $r=\lceil m/5\rceil$ . These splitting methods are known to converge globally to a stationary point of (2) under the same assumptions on $F$ ; see [18, Theorems 1–2, Remark 4, Corollary 1], [31, Theorem 3.9] and [9, Theorem 3.1], respectively.

BiFRB, iFRB and FRB are implemented with stepsize rules given by Proposition 4.2, Corollary 4.5 and [31, Theorem 3.9] respectively, and terminated when

\frac{\max\big{\{}\left\lVert x_{k+1}-x_{k}\right\rVert,\left\lVert x_{k}-x_{k-1}\right\rVert\big{\}}}{\max\big{\{}1,\left\lVert x_{k}\right\rVert,\left\lVert x_{k-1}\right\rVert\big{\}}}<10^{-10}.

(45)

Moreover, BiFRB uses the kernel $h(x)=0.1\cdot\sqrt{1+\left\lVert x\right\rVert^{2}}+2.51\cdot\left\lVert x\right\rVert^{2}/2$ . We apply DR with stepsize and termination condition given by [18, Section 5] with the same tolerance $10^{-10}$ ; while iTseng employs a stepsize given by [9, Lemma 3.3] and termination condition (45). As for inertial parameter, we use $\alpha_{1}=0.9$ for BiFRB and $\alpha_{2}=0.49$ for both of iFRB and iTseng. Finally, each algorithm is initialized at the origin and is equipped with the stepsize heuristic described in [18, Remark 4].

Simulation results are presented in Table 1 below. Our problem data is generated through creating random matrices $A\in\mathbb{R}^{m\times n}$ with entries following the standard Gaussian distribution. For each problem of the size $(m,n)$ , we randomly generate 50 instances and report ceilings of the average number of iterations (iter) and the minimal objective value at termination ( $\text{fval}_{\text{min}}$ ). As problem (44) has optimal value zero, $\text{fval}_{\text{min}}$ indicates whether an algorithm actually hits a global minimizer rather than simply converging to a stationary point. We observed the following:

-

BiFRB is the most advantageous method on “bad” problems (small $m$ ), but it is not robust.
-

DR tends to have the smallest function value at termination, however, it can converge very slowly.
-

In most cases, iFRB has the smallest number of iterations with fair function values at termination.

8 Conclusions

We proved convergence of BiFRB (Algorithm 3.2) by using the generalized concave KL property and a general framework for decreasing property. The resulting stepsize condition is independent of inertial parameter, which is less restrictive compared to other algorithms with the same setting. In turn, a question of Malitsky and Tam [22, Section 7] is answered. We believe that our approach could be useful for other splitting algorithms in designing more implementable stepsize rules in the absence of convexity. We also conducted convergence rate analysis on both function value and the actual sequence. Formulae for our Bregman subproblems are provided as well, which supplements not only BiFRB but also [9, 10]. To end this paper, let us now present several directions for future work:

-

It is tempting to see whether the global Lipschitz assumption on gradients $\nabla g$ and $\nabla h$ in Assumption 3.1 can be relaxed in a fashion similar to [8].
-

Another interesting question is to explore the reason why BiFRB demonstrates inconsistent performance when problem size varies; see Section 7.
-

Now that the range of admissible inertial parameter is extended (recall Proposition 4.2), it would be interesting to see if there exists an optimal accelerating scheme.

Table 1: Comparing BiFRB and iFRB to FRB, DR and iTseng.

Size	$(R=1)$	BiFRB		iFRB		FRB		DR		iTseng
m	n	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$
100	4000	50	0.03251	93	0.03251	631	0.03929	860	0.02819	1367	0.03149
100	5000	52	0.02413	115	0.02413	733	0.02875	684	0.02192	1632	0.02268
100	6000	57	0.01369	145	0.01369	911	0.01794	683	0.01253	1970	0.0133
200	4000	870	0.2923	39	0.2745	190	0.2769	5466	0.2711	448	0.2746
200	5000	1198	0.2367	41	0.2095	231	0.2157	5864	0.2073	546	0.2097
200	6000	81	0.2306	43	0.2259	257	0.2292	4199	0.2236	620	0.2264
300	4000	885	1.051	32	1.036	105	1.038	5497	1.036	253	1.048
300	5000	925	0.7481	34	0.7044	124	0.7071	5673	0.7032	302	0.7051
300	6000	1407	0.583	36	0.5546	144	0.555	6173	0.5541	352	0.5566
Size	$(R=1000)$	BiFRB		iFRB		FRB		DR		iTseng
m	n	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$	iter	$\text{fval}_{\text{min}}$
$100$	$4000$	1873	0.006609	1210	0.00365	7376	0.00816	2194	4e-21	10001	0.01671
$100$	$5000$	574	0.00278	1427	0.002244	8675	1.098e-05	2919	1.626e-20	10001	0.0158
$100$	$6000$	790	0.003827	1696	0.001261	9394	0.0002952	2344	3.438e-20	10001	0.01066
$200$	$4000$	5842	2.992	750	2.442e-18	8223	0.007688	1038	1.106e-20	9797	0.1094
$200$	$5000$	5938	0.6547	859	2.876e-05	7257	0.02581	1242	2.082e-20	9997	0.07125
$200$	$6000$	5666	0.917	1086	0.0001211	7312	0.006726	1487	9.129e-21	10001	0.07524
$300$	$4000$	7164	3.322	619	2.196e-18	4620	1.462e-05	753	9.795e-21	7722	0.2417
$300$	$5000$	5267	2.299	693	2.175e-18	6352	0.02936	880	5.446e-20	9194	0.1702
$300$	$6000$	5724	5.06	750	4.71e-18	8644	0.1109	1027	2.175e-20	9823	0.1423

Acknowledgments

XW and ZW were partially supported by NSERC Discovery Grants.

References

[1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality, Mathematics of Operations Research, 35 (2010), pp. 438–457.
[2] H. Attouch, J. Bolte, and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming, 137 (2013), pp. 91–129.
[3] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, Cham, 2017.
[4] H. H. Bauschke, D. R. Luke, H. M. Phan, and X. Wang, Restricted normal cones and sparsity optimization with affine constraints, Foundations of Computational Mathematics, 14 (2014), pp. 63–83.
[5] A. Beck, First-order Methods in Optimization, SIAM, 2017.
[6] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization, 18 (2007), pp. 556–572.
[7] J. Bolte, A. Daniilidis, O. Ley, and L. Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society, 362 (2010), pp. 3319–3363.
[8] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd, First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems, SIAM Journal on Optimization, 28 (2018), pp. 2131–2151.
[9] R. I. Boţ and E. R. Csetnek, An inertial Tseng’s type proximal algorithm for nonsmooth and nonconvex optimization problems, Journal of Optimization Theory and Applications, 171 (2016), pp. 600–616.
[10] R. I. Boţ, E. R. Csetnek, and S. C. László, An inertial forward-backward algorithm for the minimization of the sum of two nonconvex functions, EURO Journal on Computational Optimization, 4 (2016), pp. 3–25.
[11] R. I. Boţ and D.-K. Nguyen, The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates, Mathematics of Operations Research, 45 (2020), pp. 682–712.
[12] R. Burden, D. Faires, and A. Burden, Numerical Analysis, Cengage Learning, 2014.
[13] C. Chen, T. K. Pong, L. Tan, and L. Zeng, A difference-of-convex approach for split feasibility with applications to matrix factorizations and outlier detection, Journal of Global Optimization, 78 (2020), pp. 107–136.
[14] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200.
[15] J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems in two and three space variables, Transactions of the American Mathematical Society, 82 (1956), pp. 421–439.
[16] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier, 48 (1998), pp. 769–783.
[17] G. Li, T. Liu, and T. K. Pong, Peaceman–Rachford splitting for a class of nonconvex optimization problems, Computational Optimization and Applications, 68 (2017), pp. 407–436.
[18] G. Li and T. K. Pong, Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems, Mathematical Programming, 159 (2016), pp. 371–401.
[19] P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM Journal on Numerical Analysis, 16 (1979), pp. 964–979.
[20] S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les équations aux dérivées partielles, 117 (1963), pp. 87–89.
[21] R. Luss and M. Teboulle, Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint, SIAM Review, 55 (2013), pp. 65–98.
[22] Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity, SIAM Journal on Optimization, 30 (2020), pp. 1451–1472.
[23] B. S. Mordukhovich, Variational Analysis and Generalized Differentiation I: Basic Theory, Springer-Verlag, Berlin, 2006.
[24] Y. Nesterov, A method for solving the convex programming problem with convergence rate O( $1/k^{2}$ ), Dokl. Akad. Nauk SSSR, 269 (1983), pp. 543–547.
[25] P. Ochs, Local convergence of the heavy-ball method and iPiano for non-convex optimization, Journal of Optimization Theory and Applications, 177 (2018), pp. 153–180.
[26] P. Ochs, Unifying abstract inexact convergence theorems and block coordinate variable metric iPiano, SIAM Journal on Optimization, 29 (2019), pp. 541–570.
[27] P. Ochs, Y. Chen, T. Brox, and T. Pock, iPiano: Inertial proximal algorithm for nonconvex optimization, SIAM Journal on Imaging Sciences, 7 (2014), pp. 1388–1419.
[28] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Springer-Verlag, Berlin, 1998.
[29] P. Tseng, A modified forward-backward splitting method for maximal monotone mappings, SIAM Journal on Control and Optimization, 38 (2000), pp. 431–446.
[30] X. Wang and Z. Wang, The exact modulus of the generalized Kurdyka-Łojasiewicz property, Mathematics of Operations Research (2022), https://doi.org/10.1287/moor.2021.1227.
[31] X. Wang and Z. Wang, Malitsky-Tam forward-reflected-backward splitting method for nonconvex minimization problems, Computational Optimization and Applications, 82 (2022), pp. 441–463.

$\displaystyle f(x_{k+1})$	$\displaystyle\leq f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{1}{\lambda_{k}}D_{h}(x^{*},y_{k})-\frac{1}{\lambda_{k}}D_{h}(x_{k+1},y_{k})$
	$\displaystyle\leq f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda}\big{(}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}-\left\lVert x^{}-y_{k}\right\rVert^{2}\big{)}$
	$\displaystyle=f(x^{})+\langle x^{}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle$
	$\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\big{(}\left\lVert x_{k+1}-x^{}\right\rVert+2\langle x_{k}-x^{},x^{}-y_{k}\rangle\big{)}.$	(17)