This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bregman inertial forward-reflected-backward method for nonconvex minimization

Xianfu Wang  and Ziyuan Wang Department of Mathematics, Irving K. Barber Faculty of Science, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected]. Department of Mathematics, Irving K. Barber Faculty of Science, University of British Columbia, Kelowna, B.C. V1V 1V7, Canada. E-mail: [email protected].
Abstract

We propose a Bregman inertial forward-reflected-backward (BiFRB) method for nonconvex composite problems. Our analysis relies on a novel approach that imposes general conditions on implicit merit function parameters, which yields a stepsize condition that is independent of inertial parameters. In turn, a question of Malitsky and Tam regarding whether FRB can be equipped with a Nesterov-type acceleration is resolved. Assuming the generalized concave Kurdyka-Łojasiewicz property of a quadratic regularization of the objective, we obtain sequential convergence of BiFRB, as well as convergence rates on both the function value and actual sequence. We also present formulae for the Bregman subproblem, supplementing not only BiFRB but also the work of Boţ-Csetnek-László and Boţ-Csetnek. Numerical simulations are conducted to evaluate the performance of our proposed algorithm.

2010 Mathematics Subject Classification: Primary 90C26, 49J52; Secondary 26D10.

Keywords: Generalized concave Kurdyka-Łojasiewicz property, Bregman proximal mapping, forward-reflected-backward splitting, implicit merit function, nonconvex optimization, inertial effect.

1 Introduction

Consider the inclusion problem of maximally monotone operators:

find xn such that 0(A+B)(x),\text{find }x\in\mathbb{R}^{n}\text{ such that }0\in(A+B)(x), (1)

where A:n2nA:\mathbb{R}^{n}\to 2^{\mathbb{R}^{n}} and B:nnB:\mathbb{R}^{n}\to\mathbb{R}^{n} are (maximally) monotone operators with BB being Lipschitz continuous. Behavior of splitting methods for solving (1) is well-understood; see, e.g., the recent forward-reflected-backward method by Malitsky and Tam [22], and [3, Chapters 26, 28], which includes classical algorithms such as the forward-backward method [14] with BB being cocoercive, the Douglas-Rachford method [19, 15] and Tseng’s method [29].

The goal of this paper is to propose a Bregman inertial forward-reflected-backward method (Algorithm 3.2) for solving the nonconvex composite problem

minxnF(x)=f(x)+g(x),\min_{x\in\mathbb{R}^{n}}F(x)=f(x)+g(x), (2)

where f:n¯=(,]f:\mathbb{R}^{n}\to\overline{\mathbb{R}}=(-\infty,\infty] is proper lower semicontinuous (lsc), and g:ng:\mathbb{R}^{n}\to\mathbb{R} has a Lipschitz continuous gradient, which corresponds to (1) with A=fA=\partial f and B=gB=\nabla g when ff and gg are both convex. Informally, the proposed algorithm computes

xk+1(h+λkf)1(xk2λkg(xk)+λkg(xk1)+αk(xkxk1)),x_{k+1}\in(\nabla h+\lambda_{k}\partial f)^{-1}(x_{k}-2\lambda_{k}\nabla g(x_{k})+\lambda_{k}\nabla g(x_{k-1})+\alpha_{k}(x_{k}-x_{k-1})), (3)

for some stepsize λk>0\lambda_{k}>0, inertial parameter αk[0,1)\alpha_{k}\in[0,1) and kernel h:nh:\mathbb{R}^{n}\to\mathbb{R}. In particular, (3) reduces to the inertial forward-reflected-backward method studied in [22, Corollary 4.4] when h=2/2h=\left\lVert\cdot\right\rVert^{2}/2, f,gf,g are convex and inertial parameter is fixed. Before stating our main contribution, let us note that several splitting algorithms have been extended to nonconvex setting for solving (2); see, e.g., the Douglas-Rachford method [18], forward-backward method [2, 8], inertial forward-backward methods [27, 10], the Peaceman-Rachford method [17], inertial Tseng’s method [9], and forward-reflected-backward method [31].

Let us summarize the main contributions of this paper. Our convergence analysis relies on the generalized concave Kurdyka-Łojasiewicz property (Definition 2.2) and a novel framework for analyzing a sequence of implicit merit functions (Definition 3.10) through a general condition (Assumption 3.12) on merit function parameters. In turn, we derive global sequential convergence to a stationary point of FF with an explicit stepsize condition that is independent of inertial parameters, whereas stepsize assumptions in current literature are tangled with inertial parameters or even implicit; see Remark 4.3(i). Notably, such an independence result resolves a question of Malitsky and Tam [22, Section 7] regarding whether FRB can be adapted to incorporate a Nesterov-type acceleration; see Remark 4.3(ii). Convergence rate analysis on both the function value and actual sequence is also carried out. Moreover, we provide several formulae of the associated Bregman subproblem, which to the best of our knowledge are the first of its kind among present results devoting to Bregman extensions of splitting algorithms with the same kernel; see, e.g., [9, 10].

The paper is organized as follows. We begin with notation and background in Section 2. The proposed algorithm and its abstract convergence are studied in Section 3. Then, in Section 4, we present suitable stepsize rules under which the abstract convergence holds. Function value and sequential convergence rate results are discussed in Section 5, followed by Bregman subproblem formulae in Section 6. Numerical simulations are conducted in Section 7. We end this paper in Section 8 with remarks on our contribution and possible directions for future research.

2 Notation and preliminaries

Throughout this paper, n\mathbb{R}^{n} is the standard Euclidean space equipped with inner product x,y=xTy\langle x,y\rangle=x^{T}y and the Euclidean norm x=x,x\left\lVert x\right\rVert=\sqrt{\langle x,x\rangle} for x,ynx,y\in\mathbb{R}^{n}. Let ={0,1,2,}\mathbb{N}=\{0,1,2,\ldots\} and ={1}\mathbb{N}^{*}=\{-1\}\cup\mathbb{N}. The open ball centered at x¯\bar{x} with radius rr is denoted by 𝔹(x¯;r)\mathbb{B}(\bar{x};r). The distance function of a subset KnK\subseteq\mathbb{R}^{n} is dist(,K):n¯=(,]\operatorname{dist}(\cdot,K):\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}}=(-\infty,\infty],

xdist(x,K)=inf{xy:yK},x\mapsto\operatorname{dist}(x,K)=\inf\{\left\lVert x-y\right\rVert:y\in K\},

where dist(x,K)\operatorname{dist}(x,K)\equiv\infty if K=K=\emptyset. For f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} and r1,r2[,]r_{1},r_{2}\in[-\infty,\infty], we set [r1<f<r2]={xn:r1<f(x)<r2}[r_{1}<f<r_{2}]=\{x\in\mathbb{R}^{n}:r_{1}<f(x)<r_{2}\}. We say a function f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is coercive if limxf(x)=\lim_{\left\lVert x\right\rVert\to\infty}f(x)=\infty. A proper, lsc function f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is prox-bounded if there exists λ>0\lambda>0 such that f+2/(2λ)f+\left\lVert\cdot\right\rVert^{2}/(2\lambda) is bounded below; see, e.g., [28, Exercise 1.24]. The supremum of the set of all such λ\lambda is the threshold λf\lambda_{f} of prox-boundedness for ff. The indicator function of a set KnK\subseteq\mathbb{R}^{n} is δK(x)=0\delta_{K}(x)=0 if xKx\in K and δK(x)=\delta_{K}(x)=\infty otherwise.

We will use frequently the following concepts from variational analysis; see, e.g., [28, 23].

Definition 2.1

Let f:n¯f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}} be a proper function and KnK\subseteq\mathbb{R}^{n} a nonempty closed set. We say that

(i) vnv\in\mathbb{R}^{n} is a Fréchet subgradient of ff at x¯domf\bar{x}\in\operatorname{dom}f, denoted by v^f(x¯)v\in\hat{\partial}f(\bar{x}), if

f(x)f(x¯)+v,xx¯+o(xx¯).f(x)\geq f(\bar{x})+\langle v,x-\bar{x}\rangle+o(\left\lVert x-\bar{x}\right\rVert). (4)

(ii) vnv\in\mathbb{R}^{n} is a limiting subgradient of ff at x¯domf\bar{x}\in\operatorname{dom}f, denoted by vf(x¯)v\in\partial f(\bar{x}), if

v{vn:xk𝑓x¯,vk^f(xk),vkv},v\in\{v\in\mathbb{R}^{n}:\exists x_{k}\xrightarrow[]{f}\bar{x},\exists v_{k}\in\hat{\partial}f(x_{k}),v_{k}\rightarrow v\}, (5)

where xk𝑓x¯xkx¯ and f(xk)f(x¯)x_{k}\xrightarrow[]{f}\bar{x}\Leftrightarrow x_{k}\rightarrow\bar{x}\text{ and }f(x_{k})\rightarrow f(\bar{x}). Moreover, we set domf={xn:f(x)}\operatorname{dom}\partial f=\{x\in\mathbb{R}^{n}:\partial f(x)\neq\emptyset\}. We say that x¯domf\bar{x}\in\operatorname{dom}\partial f is a stationary point, if 0f(x¯)0\in\partial f(\bar{x}).

(iii) The Fréchet and limiting normal cones to KK at x¯\bar{x} are N^K(x¯)=^δK(x¯)\hat{N}_{K}(\bar{x})=\hat{\partial}\delta_{K}(\bar{x}) and NK(x¯)=δK(x¯)N_{K}(\bar{x})=\partial\delta_{K}(\bar{x}), respectively, if x¯K\bar{x}\in K; otherwise N^K(x¯)=NK(x¯)=\hat{N}_{K}(\bar{x})=N_{K}(\bar{x})=\emptyset.

If KnK\subseteq\mathbb{R}^{n} is a nonempty convex set, then NK(x¯)=N^K(x¯)={vn:(xK)v,xx¯0}N_{K}(\bar{x})=\hat{N}_{K}(\bar{x})=\{v\in\mathbb{R}^{n}:(\forall x\in K)~{}\langle v,x-\bar{x}\rangle\leq 0\}; see, e.g., [28].

The proximal mapping of a proper and lsc f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} with parameter λ>0\lambda>0 is

(xn)Proxλf(x)=argminyn{f(y)+12λxy2},(\forall x\in\mathbb{R}^{n})\ \operatorname{Prox}_{\lambda f}(x)=\mathop{\rm argmin}\limits_{y\in\mathbb{R}^{n}}\Big{\{}f(y)+\frac{1}{2\lambda}\left\lVert x-y\right\rVert^{2}\Big{\}},

which is the resolvent JλAJ_{\lambda A} with A=fA=\partial f when ff is convex; see, e.g., [3]. The Bregman distance induced by a differentiable h:nh:\mathbb{R}^{n}\to\mathbb{R} is

Dh:n×n:(x,y)h(x)h(y)xy,h(y).D_{h}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}:(x,y)\mapsto h(x)-h(y)-\langle x-y,\nabla h(y)\rangle.

For η(0,]\eta\in(0,\infty], denote by Φη\Phi_{\eta} the class of functions φ:[0,η)+\varphi:[0,\eta)\rightarrow\mathbb{R}_{+} satisfying the following conditions: (i) φ(t)\varphi(t) is right-continuous at t=0t=0 with φ(0)=0\varphi(0)=0; (ii) φ\varphi is strictly increasing on [0,η)[0,\eta). The following concept will be the key in our convergence analysis.

Definition 2.2

[30] Let f:n¯f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}} be proper and lsc. Let x¯domf\bar{x}\in\operatorname{dom}\partial f and μ\mu\in\mathbb{R}, and let VdomfV\subseteq\operatorname{dom}\partial f be a nonempty subset.

(i) We say that ff has the pointwise generalized concave Kurdyka-Łojasiewicz (KL) property at x¯domf\bar{x}\in\operatorname{dom}\partial f, if there exist neighborhood Ux¯U\ni\bar{x}, η(0,]\eta\in(0,\infty] and concave φΦη\varphi\in\Phi_{\eta}, such that for all xU[0<ff(x¯)<η]x\in U\cap[0<f-f(\bar{x})<\eta],

φ(f(x)f(x¯))dist(0,f(x))1,\varphi^{\prime}_{-}\big{(}f(x)-f(\bar{x})\big{)}\cdot\operatorname{dist}\big{(}0,\partial f(x)\big{)}\geq 1, (6)

where φ\varphi_{-}^{\prime} denotes the left derivative. Moreover, ff is a generalized concave KL function if it has the generalized concave KL property at every xdomfx\in\operatorname{dom}\partial f.

(ii) Suppose that f(x)=μf(x)=\mu on VV. We say ff has the setwise111We shall omit adjectives “pointwise” and “setwise” whenever there is no ambiguity. generalized concave Kurdyka-Łojasiewicz property on VV, if there exist UVU\supset V, η(0,]\eta\in(0,\infty] and concave φΦη\varphi\in\Phi_{\eta} such that for every xU[0<fμ<η]x\in U\cap[0<f-\mu<\eta],

φ(f(x)μ)dist(0,f(x))1.\varphi^{\prime}_{-}\big{(}f(x)-\mu\big{)}\cdot\operatorname{dist}\big{(}0,\partial f(x)\big{)}\geq 1. (7)

Generalized concave KL functions are ubiquitous. The class of semialgebraic functions is one particularly rich resource. We refer the interested reader to [7, 1, 6] for more examples; see also the fundamental work of Łojasiewicz [20] and Kurdyka [16].

Definition 2.3

(i) A set EnE\subseteq\mathbb{R}^{n} is called semialgebraic if there exist finitely many polynomials gij,hij:ng_{ij},h_{ij}:\mathbb{R}^{n}\rightarrow\mathbb{R} such that E=j=1pi=1q{xn:gij(x)=0 and hij(x)<0}.E=\bigcup_{j=1}^{p}\bigcap_{i=1}^{q}\{x\in\mathbb{R}^{n}:g_{ij}(x)=0\text{ and }h_{ij}(x)<0\}.

(ii) A function f:n¯f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}} is called semialgebraic if its graph gphf={(x,y)n+1:f(x)=y}\operatorname{gph}f=\{(x,y)\in\mathbb{R}^{n+1}:f(x)=y\} is semialgebraic.

Fact 2.4

[6, Corollary 16] Let f:n¯f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}} be a proper and lsc function and let x¯domf\bar{x}\in\operatorname{dom}\partial f. If ff is semialgebraic, then it has the concave KL property at x¯\bar{x} with φ(t)=ct1θ\varphi(t)=c\cdot t^{1-\theta} for some c>0c>0 and θ(0,1)\theta\in(0,1).

3 The BiFRB method and its abstract convergence

3.1 The BiFRB method

In the remainder of this paper, suppose that the following hold:

Assumption 3.1

(i) f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is proper, lsc and prox-bounded with threshold λf>0\lambda_{f}>0.

(ii) g:ng:\mathbb{R}^{n}\to\mathbb{R} has Lipschitz continuous gradient with constant LgL_{\nabla g}.

(iii) h:nh:\mathbb{R}^{n}\to\mathbb{R} is σ\sigma-strongly convex and has Lipschitz continuous gradient with constant LhL_{\nabla h}.

(iv) The objective F=f+gF=f+g is bounded below.

We now propose the Bregman inertial forward-reflected-backward (BiFRB) algorithm.

Algorithm 3.2 (BiFRB)

1. Initialization: Pick x1,x0nx_{-1},x_{0}\in\mathbb{R}^{n}. Let 0<λ¯λ¯λf0<\underline{\lambda}\leq\overline{\lambda}\leq\lambda_{f}. Choose λ¯λ1λ¯\underline{\lambda}\leq\lambda_{-1}\leq\overline{\lambda}.

2. For kk\in\mathbb{N}, choose λ¯λkλ¯\underline{\lambda}\leq\lambda_{k}\leq\overline{\lambda} and 0αk<10\leq\alpha_{k}<1. Compute

yk=xk+λk1(g(xk1)g(xk)),\displaystyle y_{k}=x_{k}+\lambda_{k-1}\big{(}\nabla g(x_{k-1})-\nabla g(x_{k})\big{)}, (8)
xk+1argminxn{f(x)+xyk,g(xk)+αkλk(xk1xk)+1λkDh(x,yk)}.\displaystyle x_{k+1}\in\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-y_{k},\nabla g(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}(x_{k-1}-x_{k})\rangle+\frac{1}{\lambda_{k}}D_{h}(x,y_{k})\bigg{\}}. (9)

The Bregman distance DhD_{h} is proportional to square of the Euclidean distance.

Lemma 3.3

[3, Exercise 17.5, Theorem 18.15] Let h:nh:\mathbb{R}^{n}\to\mathbb{R} be σ\sigma-strongly convex and has Lipschitz continuous gradient with constant LhL_{\nabla h}. Then

(x,yn)σ2xy2Dh(x,y)Lh2xy2.(\forall x,y\in\mathbb{R}^{n})~{}\frac{\sigma}{2}\left\lVert x-y\right\rVert^{2}\leq D_{h}(x,y)\leq\frac{L_{\nabla h}}{2}\left\lVert x-y\right\rVert^{2}.

Assumption 3.1(iii) has been used in the literature to obtain non-Euclidean extension of splitting algorithms; see, e.g., [9, 10]. Recall that the Fenchel conjugate of f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is f=supyn{,yf(y)}f^{*}=\sup_{y\in\mathbb{R}^{n}}\{\langle\cdot,y\rangle-f(y)\} and the Moreau envelope of f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} with parameter λ>0\lambda>0 is

(xn)Mfλ(x)=infyn{f(y)+12λxy2}.(\forall x\in\mathbb{R}^{n})~{}M_{f}^{\lambda}(x)=\inf_{y\in\mathbb{R}^{n}}\bigg{\{}f(y)+\frac{1}{2\lambda}\left\lVert x-y\right\rVert^{2}\bigg{\}}.

Below is a systemic way to construct kernel hh via Fenchel conjugate and Moreau envelope.

Proposition 3.4

Let L,σ>0L,\sigma>0 and let p:n¯p:\mathbb{R}^{n}\to\overline{\mathbb{R}} be proper, lsc and convex. Define q=2/2q=\left\lVert\cdot\right\rVert^{2}/2. Then the following hold:

(i) [3, Corollary 18.18] pp has LL-Lipschitz gradient if and only if pp is the Moreau envelope with parameter λ=1/L\lambda=1/L of a proper lsc and convex function (pλq)(p^{*}-\lambda q)^{*}. Therefore, in particular, if pp^{*} is 1/L1/L-strongly convex then pp has LL-Lipschitz gradient.

(ii) If pp has LL-Lipschitz gradient, then h=p+σqh=p+\sigma q is σ\sigma-strongly convex with (L+σ)(L+\sigma)-Lipschitz gradient.

Example 3.5

Let h(x)=1+x2+x2/2h(x)=\sqrt{1+\left\lVert x\right\rVert^{2}}+\left\lVert x\right\rVert^{2}/2. Then hh satisfies Assumption 3.1(iii) with σ=1\sigma=1 and Lh=2L_{\nabla h}=2.

Proof. Define (xn)p(x)=1+x2(\forall x\in\mathbb{R}^{n})~{}p(x)=\sqrt{1+\left\lVert x\right\rVert^{2}}. Then pp^{*} is 11-strongly convex; see, e.g., [5, Section 4.4, Example 5.29]. Applying Proposition 3.4 completes the proof.\quad\hfill\blacksquare

We also assume the following throughout the paper, under which the proposed algorithm is well-defined.

Assumption 3.6

Suppose that one of the following holds:

(i) The kernel hh has strong convexity modulus σ1\sigma\geq 1.

(ii) Function ff has prox-threshold λf=\lambda_{f}=\infty.

Remark 3.7

The threshold λf=\lambda_{f}=\infty when ff is convex or bounded below.

Proposition 3.8

Let u,vnu,v\in\mathbb{R}^{n} and suppose that Assumption 3.6 holds.Then for 0<λ<λf0<\lambda<\lambda_{f}, the function xf(x)+xu,v+Dh(x,u)/λx\mapsto f(x)+\langle x-u,v\rangle+D_{h}(x,u)/\lambda is coercive. Consequently, the set

argminxn{f(x)+xu,v+1λDh(x,u)}.\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-u,v\rangle+\frac{1}{\lambda}D_{h}(x,u)\bigg{\}}\neq\emptyset.

Proof. We claim that xf(x)+x,ωx\mapsto f(x)+\langle x,\omega\rangle is prox-bounded with λf\lambda_{f} for an arbitrary ωn\omega\in\mathbb{R}^{n}. Indeed,

lim infxf(x)+x,ωx2=lim infxf(x)x2+limxx,ωx2=lim infxf(x)x2.\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)+\langle x,\omega\rangle}{\left\lVert x\right\rVert^{2}}=\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)}{\left\lVert x\right\rVert^{2}}+\lim_{\left\lVert x\right\rVert\to\infty}\frac{\langle x,\omega\rangle}{\left\lVert x\right\rVert^{2}}=\liminf_{\left\lVert x\right\rVert\to\infty}\frac{f(x)}{\left\lVert x\right\rVert^{2}}.

Thus invoking [28, Excercise 1.24] proves the claim.

By our claim, xf(x)+x,vσu/λx\mapsto f(x)+\langle x,v-\sigma u/\lambda\rangle is prox-bounded with threshold λf\lambda_{f}. Let λ\lambda^{\prime} be such that λ<λ<λf\lambda<\lambda^{\prime}<\lambda_{f}. Then λ/σ<λf\lambda^{\prime}/\sigma<\lambda_{f} under Assumption 3.6. Therefore

f(x)+xu,v+1λDh(x,u)f(x)+xu,v+σ2λxu2σ2λx2+σ2λx2\displaystyle f(x)+\langle x-u,v\rangle+\frac{1}{\lambda}D_{h}(x,u)\geq f(x)+\langle x-u,v\rangle+\frac{\sigma}{2\lambda}\left\lVert x-u\right\rVert^{2}-\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}
σ2(1λ1λ)x2+f(x)+x,vσλu+σ2λx2u,v\displaystyle\geq\frac{\sigma}{2}\left(\frac{1}{\lambda}-\frac{1}{\lambda^{\prime}}\right)\left\lVert x\right\rVert^{2}+f(x)+\langle x,v-\frac{\sigma}{\lambda}u\rangle+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}-\langle u,v\rangle
σ2(1λ1λ)x2+infxn{f(x)+x,vσλu+σ2λx2}u,v,\displaystyle\geq\frac{\sigma}{2}\left(\frac{1}{\lambda}-\frac{1}{\lambda^{\prime}}\right)\left\lVert x\right\rVert^{2}+\inf_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x,v-\frac{\sigma}{\lambda}u\rangle+\frac{\sigma}{2\lambda^{\prime}}\left\lVert x\right\rVert^{2}\bigg{\}}-\langle u,v\rangle\to\infty,

as x\left\lVert x\right\rVert\to\infty, where the infimum in the last inequality is finite thanks to prox-boundedness of xf(x)+x,vσu/λx\mapsto f(x)+\langle x,v-\sigma u/\lambda\rangle, which completes the proof.\quad\hfill\blacksquare

Setting the kernel h=/2h=\left\lVert\cdot\right\rVert/2 gives Algorithm 3.9, which is an inertial forward-reflected-backward method (iFRB).

Algorithm 3.9 (iFRB)

1. Initialization: Pick x1,x0nx_{-1},x_{0}\in\mathbb{R}^{n}. Let 0<λ¯λ¯0<\underline{\lambda}\leq\overline{\lambda}. Choose λ¯λ1λ¯\underline{\lambda}\leq\lambda_{-1}\leq\overline{\lambda}.

2. For kk\in\mathbb{N}, choose λ¯λkλ¯\underline{\lambda}\leq\lambda_{k}\leq\overline{\lambda} and 0αk<10\leq\alpha_{k}<1. Compute

yk=xk+λk1(g(xk1)g(xk)),\displaystyle y_{k}=x_{k}+\lambda_{k-1}\big{(}\nabla g(x_{k-1})-\nabla g(x_{k})\big{)},
xk+1argminxn{f(x)+xyk,g(xk)+αkλk(xk1xk)+12λkxyk2}.\displaystyle x_{k+1}\in\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-y_{k},\nabla g(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}(x_{k-1}-x_{k})\rangle+\frac{1}{2\lambda_{k}}\left\lVert x-y_{k}\right\rVert^{2}\bigg{\}}.

Convergence of iFRB with fixed stepsize and inertial parameter was studied in [22] in the presence of convexity. However, its behavior in nonconvex setting is still not understood, which will be covered in our analysis.

3.2 BiFRB merit function and its parameter rules

In the remainder of this paper, unless specified otherwise, let p1p_{-1}\in\mathbb{R} and define for kk\in\mathbb{N}

pk\displaystyle p_{k} =((Lhσ)Lg22)λk12λk+12(σλkLg)pk1,\displaystyle=\left(\frac{(L_{\nabla h}-\sigma)L^{2}_{\nabla g}}{2}\right)\frac{\lambda^{2}_{k-1}}{\lambda_{k}}+\frac{1}{2}\left(\frac{\sigma}{\lambda_{k}}-L_{\nabla g}\right)-p_{k-1}, (10)
M1,k\displaystyle M_{1,k} =pk1αk+σLgλk12λk(Lhσ)Lg2λk122λk,\displaystyle=p_{k-1}-\frac{\alpha_{k}+\sigma L_{\nabla g}\lambda_{k-1}}{2\lambda_{k}}-\frac{(L_{\nabla h}-\sigma)L^{2}_{\nabla g}\lambda_{k-1}^{2}}{2\lambda_{k}}, (11)

where (λk)k(\lambda_{k})_{k\in\mathbb{N}^{*}} is a stepsize sequence of BiFRB; (αk)k(\alpha_{k})_{k\in\mathbb{N}} is a sequence of inertial parameters; and Lh,Lg,σL_{\nabla h},L_{\nabla g},\sigma are parameters specified in Assumption 3.1. The following concept plays a central role in our analysis.

Definition 3.10

Let pp\in\mathbb{R}. Define Hp:n×n¯H_{p}:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\overline{\mathbb{R}} by

((x,y)n×n)Hp(x,y)=F(x)+pxy2.(\forall(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{n})~{}H_{p}(x,y)=F(x)+p\left\lVert x-y\right\rVert^{2}.

Then we say that HpH_{p} is a quadratic merit function with parameter pp. Let (pk)k(p_{k})_{k\in\mathbb{N}^{*}} be given by (10). Then HpkH_{p_{k}} is the kthk^{th} BiFRB merit function. If pkp¯p_{k}\leq\bar{p} for some p¯>0\bar{p}>0, then Hp¯H_{\bar{p}} is a dominating BiFRB merit function.

Remark 3.11

Quadratic merit functions of various forms appear frequently in literature that devotes to extending splitting methods to nonconvex setting; see, e.g., [9, 10, 31, 27, 26] and the references therein.

Despite the recursion (10), we shall see soon that (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}^{*}} are indeed implicit in our analysis. Imposing the following novel condition on (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}^{*}}, one gets abstract convergence of BiFRB, which in turn yields less restrictive stepsize rules; see Section 4 and in particular Remark 4.3.

Assumption 3.12 (general rules of BiFRB merit function parameter)

Let (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} be given as in (10) and (11). Suppose that the following hold:

(i) lim infkpk0\liminf_{k\to\infty}p_{k}\geq 0 and lim infkM1,k>0\liminf_{k\to\infty}M_{1,k}>0.

(ii) There exists p¯\bar{p} such that pkp¯p_{k}\leq\bar{p} for all kk.

Remark 3.13

Assumption 3.12 generalizes quadratic merit function properties in the present literature; see, e.g., [9, Lemma 3.2][10, Lemma 6][26, Assumption (H1)] and [27, Algorithm 5, Lemma 4.6, Proposition 4.7].

3.3 Abstract function value convergence

In this section, we study function value convergence of BiFRB under Assumption 3.12; see Section 4 for concrete stepsize rules under which Assumption 3.12 holds.

Lemma 3.14

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Then the following hold:

(i) For all kk\in\mathbb{N},

Hpk(zk)Hpk1(zk1)M1,kzkzk12.H_{p_{k}}(z_{k})\leq H_{p_{k-1}}(z_{k-1})-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}. (12)

(ii) Suppose that Assumption 3.12(i) holds. Then the sequence (Hpk(zk))k\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}} is decreasing and k=0M1,kzkzk12<\sum_{k=0}^{\infty}M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}<\infty. Consequently, (Hpk(zk))k\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}} is convergent and limkzkzk1=0\lim_{k\to\infty}\left\lVert z_{k}-z_{k-1}\right\rVert=0.

Proof. (i) By using the iterative scheme (9),

f(xk+1)\displaystyle f(x_{k+1}) f(xk)+xkxk+1,g(xk)+αkλkxkxk+1,xk1xk\displaystyle\leq f(x_{k})+\langle x_{k}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle
+1λk(Dh(xk,yk)Dh(xk+1,yk))\displaystyle+\frac{1}{\lambda_{k}}\left(D_{h}(x_{k},y_{k})-D_{h}(x_{k+1},y_{k})\right)
f(xk)+xkxk+1,g(xk)+αkλkxkxk+1,xk1xk\displaystyle\leq f(x_{k})+\langle x_{k}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle
+Lh2λkxkyk2σ2λkxk+1yk2,\displaystyle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}, (13)

where the last inequality is implied by Lemma 3.3. Invoking the descent lemma [3, Lemma 2.64] to gg yields

g(xk+1)g(xk)+xk+1xk,g(xk)+Lg2xk+1xk2.g(x_{k+1})\leq g(x_{k})+\langle x_{k+1}-x_{k},\nabla g(x_{k})\rangle+\frac{L_{\nabla g}}{2}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}. (14)

Adding (13) and (14), one has

F(xk+1)\displaystyle F(x_{k+1}) F(xk)+αkλkxkxk+1,xk1xk+Lh2λkxkyk2\displaystyle\leq F(x_{k})+\frac{\alpha_{k}}{\lambda_{k}}\langle x_{k}-x_{k+1},x_{k-1}-x_{k}\rangle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}
σ2λkxk+1yk2+Lg2xk+1xk2\displaystyle-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}+\frac{L_{\nabla g}}{2}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}
F(xk)+(αk2λk+Lg2)xk+1xk2+αk2λkxk1xk2\displaystyle\leq F(x_{k})+\left(\frac{\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}\right)\left\lVert x_{k+1}-x_{k}\right\rVert^{2}+\frac{\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k-1}-x_{k}\right\rVert^{2}
+Lh2λkxkyk2σ2λkxk+1yk2.\displaystyle+\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}. (15)

Next we estimate

Lh2λkxkyk2σ2λkxk+1yk2=Lhσ2λkxkyk2σ2λk(xk+1yk2xkyk2)\displaystyle\frac{L_{\nabla h}}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}=\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x_{k}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left(\left\lVert x_{k+1}-y_{k}\right\rVert^{2}-\left\lVert x_{k}-y_{k}\right\rVert^{2}\right)
=(Lhσ)λk122λkg(xk)g(xk1)2σ2λkxk+1xk2\displaystyle=\frac{(L_{\nabla h}-\sigma)\lambda_{k-1}^{2}}{2\lambda_{k}}\left\lVert\nabla g(x_{k})-\nabla g(x_{k-1})\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}
σλk1λkxk+1xk,g(xk)g(xk1)\displaystyle-\frac{\sigma\lambda_{k-1}}{\lambda_{k}}\langle x_{k+1}-x_{k},\nabla g(x_{k})-\nabla g(x_{k-1})\rangle
(Lhσ)Lg2λk122λkxkxk12+σLgλk1λkxk+1xkxkxk1σ2λkxk+1xk2\displaystyle\leq\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}+\frac{\sigma L_{\nabla g}\lambda_{k-1}}{\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert\left\lVert x_{k}-x_{k-1}\right\rVert-\frac{\sigma}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}
(Lhσ)Lg2λk12+σLgλk12λkxkxk12+σ(Lgλk11)2λkxk+1xk2.\displaystyle\leq\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}+\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)}{2\lambda_{k}}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}.

Thus, inequality (15) further implies that

F(xk+1)\displaystyle F(x_{k+1}) F(xk)+(σ(Lgλk11)+αk2λk+Lg2)xk+1xk2\displaystyle\leq F(x_{k})+\left(\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)+\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}\right)\left\lVert x_{k+1}-x_{k}\right\rVert^{2}
+(Lhσ)Lg2λk12+σLgλk1+αk2λkxkxk12.\displaystyle+\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}+\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}. (16)

Adding pkxk+1xk2p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2} to the above inequality gives

Hpk(zk)\displaystyle H_{p_{k}}(z_{k}) =F(xk+1)+pkxk+1xk2\displaystyle=F(x_{k+1})+p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}
F(xk)+(σ(Lgλk11)+αk2λk+Lg2+pk)(zkzk12xkxk12)\displaystyle\leq F(x_{k})+\left(\frac{\sigma(L_{\nabla g}\lambda_{k-1}-1)+\alpha_{k}}{2\lambda_{k}}+\frac{L_{\nabla g}}{2}+p_{k}\right)(\left\lVert z_{k}-z_{k-1}\right\rVert^{2}-\left\lVert x_{k}-x_{k-1}\right\rVert^{2})
+(Lhσ)Lg2λk12+σLgλk1+αk2λkxkxk12\displaystyle+\frac{(L_{\nabla h}-\sigma)L_{\nabla g}^{2}\lambda_{k-1}^{2}+\sigma L_{\nabla g}\lambda_{k-1}+\alpha_{k}}{2\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}
=F(xk)+pk1xkxk12M1,kzkzk12\displaystyle=F(x_{k})+p_{k-1}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}
=Hpk1(zk1)M1,kzkzk12,\displaystyle=H_{p_{k-1}}(z_{k-1})-M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2},

where the first inequality follows from the identity zkzk12=xk+1xk2+xkxk12\left\lVert z_{k}-z_{k-1}\right\rVert^{2}=\left\lVert x_{k+1}-x_{k}\right\rVert^{2}+\left\lVert x_{k}-x_{k-1}\right\rVert^{2}.

(ii) Assume without loss of generality that pk0p_{k}\geq 0 and M1,k>0M_{1,k}>0 for kk\in\mathbb{N}. Thus by statement(i) and Assumption 3.1, the sequence (Hpk(zk))k\left(H_{p_{k}}(z_{k})\right)_{k\in\mathbb{N}} is decreasing and bounded below, therefore it is convergent. Summing inequality (12) from k=0k=0 to k=n1k=n-1, we have

k=0n1M1,kzkzk12\displaystyle\sum_{k=0}^{n-1}M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2} k=0n1(Hpk(zk)Hpk+1(zk+1))=Hp0(z0)Hpn(zn)\displaystyle\leq\sum_{k=0}^{n-1}\big{(}H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})\big{)}=H_{p_{0}}(z_{0})-H_{p_{n}}(z_{n})
Hp0(z0)F(xn+1)Hp0(z0)infxnF(x)<,\displaystyle\leq H_{p_{0}}(z_{0})-F(x_{n+1})\leq H_{p_{0}}(z_{0})-\inf_{x\in\mathbb{R}^{n}}F(x)<\infty,

which together with the assumption that lim infkM1,k>0\liminf_{k\to\infty}M_{1,k}>0 implies that k=0zkzk12<\sum_{k=0}^{\infty}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}<\infty.\quad\hfill\blacksquare

The following lemma is an immediate consequence of subdifferential sum rules [28, Exercise 8.8, Proposition 10.5].

Lemma 3.15

Let pp\in\mathbb{R}. Then for every (x,y)n×n(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{n},

Hp(x,y)={f(x)+g(x)+2p(xy)}×{2p(yx)}.\partial H_{p}(x,y)=\{\partial f(x)+\nabla g(x)+2p(x-y)\}\times\{2p(y-x)\}.

The lemma below provides a lower bound for consecutive iterates gap.

Lemma 3.16

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB, zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}, and p0p\geq 0. For kk\in\mathbb{N}, define uk=(h(yk)h(xk+1))/λkg(xk)+αk(xkxk1)/λku_{k}=(\nabla h(y_{k})-\nabla h(x_{k+1}))/\lambda_{k}-\nabla g(x_{k})+\alpha_{k}(x_{k}-x_{k-1})/\lambda_{k},

Ak=uk+g(xk+1)+2p(xk+1xk),Bk=2p(xkxk+1).A_{k}=u_{k}+\nabla g(x_{k+1})+2p(x_{k+1}-x_{k}),B_{k}=2p(x_{k}-x_{k+1}).

Set

M2,k=2max{Lhλk+Lg+6p,LhLgλk1+1λk},M_{2,k}=\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+6p,\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\bigg{\}},

and

M2=2max{Lhλ¯+Lg+6p,LhLgλ¯+1λ¯}.M_{2}=\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6p,\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}.

Then (Ak,Bk)Hp(zk)\left(A_{k},B_{k}\right)\in\partial H_{p}(z_{k}) and (Ak,Bk)M2,kzkzk1M2zkzk1\left\lVert\left(A_{k},B_{k}\right)\right\rVert\leq M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq M_{2}\left\lVert z_{k}-z_{k-1}\right\rVert.

Proof. Invoking (9), one has

0f(xk+1)+g(xk)+αk(xk1xk)λk+h(xk+1)h(xk)λk,0\in\partial f(x_{k+1})+\nabla g(x_{k})+\frac{\alpha_{k}(x_{k-1}-x_{k})}{\lambda_{k}}+\frac{\nabla h(x_{k+1})-\nabla h(x_{k})}{\lambda_{k}},

which implies that ukf(xk+1)u_{k}\in\partial f(x_{k+1}). It then follows from Lemma 3.15 that (Ak,Bk)Hp(zk)\left(A_{k},B_{k}\right)\in\partial H_{p}(z_{k}). Moreover,

uk+g(xk+1)\displaystyle\left\lVert u_{k}+\nabla g(x_{k+1})\right\rVert 1λk(h(yk)h(xk+1))+g(xk+1)g(xk)\displaystyle\leq\left\lVert\frac{1}{\lambda_{k}}\left(\nabla h(y_{k})-\nabla h(x_{k+1})\right)+\nabla g(x_{k+1})-\nabla g(x_{k})\right\rVert
+2pxk+1xk+αkλkxkxk1\displaystyle+2p\left\lVert x_{k+1}-x_{k}\right\rVert+\frac{\alpha_{k}}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert
Lhλkykxk+xkxk+1+(Lg+2p)xk+1xk+1λkxkxk1\displaystyle\leq\frac{L_{\nabla h}}{\lambda_{k}}\left\lVert y_{k}-x_{k}+x_{k}-x_{k+1}\right\rVert+\left(L_{\nabla g}+2p\right)\left\lVert x_{k+1}-x_{k}\right\rVert+\frac{1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert
LhLgλk1+1λkxkxk1+(Lhλk+Lg+2p)xk+1xk.\displaystyle\leq\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert+\left(\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+2p\right)\left\lVert x_{k+1}-x_{k}\right\rVert.

Then

(Ak,Bk)\displaystyle\left\lVert\left(A_{k},B_{k}\right)\right\rVert Ak+Bkuk+g(xk+1)+4p(xk+1xk)\displaystyle\leq\left\lVert A_{k}\right\rVert+\left\lVert B_{k}\right\rVert\leq\left\lVert u_{k}+\nabla g(x_{k+1})\right\rVert+4p\left\lVert(x_{k+1}-x_{k})\right\rVert
LhLgλk1+1λkxkxk1+(Lhλk+Lg+6p)xk+1xk,\displaystyle\leq\frac{L_{\nabla h}L_{\nabla g}\lambda_{k-1}+1}{\lambda_{k}}\left\lVert x_{k}-x_{k-1}\right\rVert+\left(\frac{L_{\nabla h}}{\lambda_{k}}+L_{\nabla g}+6p\right)\left\lVert x_{k+1}-x_{k}\right\rVert,
M2,kzkzk1M2zkzk1,\displaystyle\leq M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq M_{2}\left\lVert z_{k}-z_{k-1}\right\rVert,

as desired.\quad\hfill\blacksquare

Having established basic properties of (Hpk(xk+1,xk))\big{(}H_{p_{k}}(x_{k+1},x_{k})\big{)}, we now show its convergence. Denote by ω(z1)\omega(z_{-1}) the set of all limit points of (zk)k(z_{k})_{k\in\mathbb{N}^{*}}.

Theorem 3.17

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Suppose that Assumption 3.12 holds and (zk)k(z_{k})_{k\in\mathbb{N}^{*}} is bounded. Let (zkl)l(z_{k_{l}})_{l\in\mathbb{N}} be a subsequence such that zklzz_{k_{l}}\to z^{*} for some z=(x,y)z^{*}=(x^{*},y^{*}) as ll\to\infty. Then the following hold:

(i) We have limlHpkl(zkl)=F(x)\lim_{l\to\infty}H_{p_{k_{l}}}(z_{k_{l}})=F(x^{*}). In fact, limkHpk(zk)=F(x)\lim_{k\to\infty}H_{p_{k}}(z_{k})=F(x^{*}). Consequently limkHp¯(zk)=F(x)\lim_{k\to\infty}H_{\bar{p}}(z_{k})=F(x^{*}), where p¯\bar{p} is given in Assumption 3.12(ii).

(ii) The limit point z=(x,y)z^{*}=(x^{*},y^{*}) satisfies x=yx^{*}=y^{*} and 0Hp¯(x,y)0\in\partial H_{\bar{p}}(x^{*},y^{*}), which implies that 0F(x)0\in\partial F(x^{*}).

(iii) The set ω(z1)\omega(z_{-1}) is nonempty, connected, and compact, on which the function Hp¯H_{\bar{p}} is constant F(x)F(x^{*}). Moreover, limkdist(zk,ω(z1))=0\lim_{k\to\infty}\operatorname{dist}(z_{k},\omega(z_{-1}))=0.

Proof. (i) For every kk\in\mathbb{N}, appealing to the iterative scheme (9), we have

f(xk+1)\displaystyle f(x_{k+1}) f(x)+xxk+1,g(xk)+αkλkxxk+1,xk1xk\displaystyle\leq f(x^{*})+\langle x^{*}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle
+1λkDh(x,yk)1λkDh(xk+1,yk)\displaystyle+\frac{1}{\lambda_{k}}D_{h}(x^{*},y_{k})-\frac{1}{\lambda_{k}}D_{h}(x_{k+1},y_{k})
f(x)+xxk+1,g(xk)+αkλkxxk+1,xk1xk\displaystyle\leq f(x^{*})+\langle x^{*}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle
+Lhσ2λkxyk2σ2λ(xk+1yk2xyk2)\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{*}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda}\big{(}\left\lVert x_{k+1}-y_{k}\right\rVert^{2}-\left\lVert x^{*}-y_{k}\right\rVert^{2}\big{)}
=f(x)+xxk+1,g(xk)+αkλkxxk+1,xk1xk\displaystyle=f(x^{*})+\langle x^{*}-x_{k+1},\nabla g(x_{k})\rangle+\frac{\alpha_{k}}{\lambda_{k}}\langle x^{*}-x_{k+1},x_{k-1}-x_{k}\rangle
+Lhσ2λkxyk2σ2λk(xk+1x+2xkx,xyk).\displaystyle+\frac{L_{\nabla h}-\sigma}{2\lambda_{k}}\left\lVert x^{*}-y_{k}\right\rVert^{2}-\frac{\sigma}{2\lambda_{k}}\big{(}\left\lVert x_{k+1}-x^{*}\right\rVert+2\langle x_{k}-x^{*},x^{*}-y_{k}\rangle\big{)}. (17)

Recall that xklxx_{k_{l}}\to x^{*} as ll\to\infty. Then Lemma 3.14(ii) and the triangle inequality imply that xkl1xx_{k_{l}-1}\to x^{*} and

xykl1xxkl1+λkl2Lgxkl2xkl10, as l,\left\lVert x^{*}-y_{k_{l}-1}\right\rVert\leq\left\lVert x^{*}-x_{k_{l}-1}\right\rVert+\lambda_{k_{l}-2}L_{\nabla g}\left\lVert x_{k_{l}-2}-x_{k_{l}-1}\right\rVert\to 0,\text{ as }l\to\infty,

where the first inequality holds due to (8). Taking the above limits into account, replacing the index k+1k+1 by klk_{l} in (3.3) and passing to the limit, one gets

lim suplf(xkl)f(x)lim inflf(xkl),\limsup_{l\to\infty}f(x_{k_{l}})\leq f(x^{*})\leq\liminf_{l\to\infty}f(x_{k_{l}}),

where the last inequality holds thanks to the lower semicontinuity of ff, which implies that limlF(xkl)=F(x)\lim_{l\to\infty}F(x_{k_{l}})=F(x^{*}). Note that the sequence (Hpk(zk))k(H_{p_{k}}(z_{k}))_{k\in\mathbb{N}} is convergent. Then one concludes that

limkHpk(zk)=limlHpkl(zkl)=liml(F(xkl)+pklxklxkl12)=F(x),\lim_{k\to\infty}H_{p_{k}}(z_{k})=\lim_{l\to\infty}H_{p_{k_{l}}}(z_{k_{l}})=\lim_{l\to\infty}\big{(}F(x_{k_{l}})+p_{k_{l}}\left\lVert x_{k_{l}}-x_{k_{l}-1}\right\rVert^{2}\big{)}=F(x^{*}),

where limlpklxklxkl12=0\lim_{l\to\infty}p_{k_{l}}\left\lVert x_{k_{l}}-x_{k_{l}-1}\right\rVert^{2}=0 due to Assumption 3.12(ii) and Lemma 3.14(ii). Finally, it is easy to see that 0(p¯pk)xkxk12p¯xkxk1200\leq(\bar{p}-p_{k})\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\leq\bar{p}\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\to 0 as kk\to\infty under Assumption 3.12(ii). Thus limkHp¯(zk)=limk(Hpk(zk)+(p¯pk)xkxk12)=F(x)\lim_{k\to\infty}H_{\bar{p}}(z_{k})=\lim_{k\to\infty}\big{(}H_{p_{k}}(z_{k})+(\bar{p}-p_{k})\left\lVert x_{k}-x_{k-1}\right\rVert^{2}\big{)}=F(x^{*}).

(ii) By Lemma 3.14(ii) and the triangle inequality, subsequences (xkl)l(x_{k_{l}})_{l\in\mathbb{N}} and (xkl+1)l(x_{k_{l+1}})_{l\in\mathbb{N}} have the same limit. Therefore x=yx^{*}=y^{*}. Combining Lemma 3.16 with p=p¯p=\bar{p}, Lemma 3.14(ii), statement(i), and the outer semicontinuity ot zHp¯(z)z\mapsto\partial H_{\bar{p}}(z), we conclude that 0Hp¯(z)0\in\partial H_{\bar{p}}(z^{*}), which further implies 0F(x)0\in\partial F(x^{*}).

(iii) Apply a similar proof as in [31, Theorem 3.6(iii)].\quad\hfill\blacksquare

The result below provides a sufficient condition to the critical boundedness assumption in Theorem 3.17.

Theorem 3.18

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Suppose that Assumption 3.12(i) holds. If f+gf+g is coercive (or level-bounded), then the sequence (xk)k(x_{k})_{k\in\mathbb{N}^{*}} is bounded, so is (zk)k(z_{k})_{k\in\mathbb{N}^{*}}.

Proof. Assume without loss of generality that M1,k>0M_{1,k}>0 and pk0p_{k}\geq 0 for kk\in\mathbb{N}. Then, appealing to Lemma 3.14, one has Hpk(zk)Hp1(z1)H_{p_{k}}(z_{k})\leq H_{p_{-1}}(z_{-1}) for kk\in\mathbb{N}. Therefore

F(xk+1)=Hpk(zk)pkxk+1xk2Hpk(zk)Hp1(z1).F(x_{k+1})=H_{p_{k}}(z_{k})-p_{k}\left\lVert x_{k+1}-x_{k}\right\rVert^{2}\leq H_{p_{k}}(z_{k})\leq H_{p_{-1}}(z_{-1}).

If (xk)k(x_{k})_{k\in\mathbb{N}} was unbounded, then the above inequality would yield a contradiction due to the coercivity of f+gf+g (or level-boundedness). \quad\hfill\blacksquare

3.4 Abstract sequential convergence

In this section, we establish global sequential convergence of BiFRB to a stationary point by using the generalized concave KL property (recall Definition 2.2). For ε>0\varepsilon>0 and nonempty set Ωn\Omega\subseteq\mathbb{R}^{n}, we define Ωε={xn:dist(x,Ω)<ε}\Omega_{\varepsilon}=\{x\in\mathbb{R}^{n}:\operatorname{dist}(x,\Omega)<\varepsilon\}.

Lemma 3.19

[30, Lemma 4.4] Let f:n¯f:\mathbb{R}^{n}\rightarrow\overline{\mathbb{R}} be proper lsc and let μ\mu\in\mathbb{R}. Let Ωdomf\Omega\subseteq\operatorname{dom}\partial f be a nonempty compact set on which f(x)=μf(x)=\mu for all xΩx\in\Omega. The following statements hold:

(i) Suppose that ff satisfies the pointwise generalized concave KL property at each xΩx\in\Omega. Then there exist ε>0,η(0,]\varepsilon>0,\eta\in(0,\infty] and φΦη\varphi\in\Phi_{\eta} such that ff has the setwise generalized concave KL property on Ω\Omega with respect to U=ΩεU=\Omega_{\varepsilon}, η\eta and φ\varphi.

(ii) Set U=ΩεU=\Omega_{\varepsilon} and define h:(0,η)+h:(0,\eta)\rightarrow\mathbb{R}_{+} by

h(s)=sup{dist1(0,f(x)):xU[0<fμ<η],sf(x)μ}.h(s)=\sup\big{\{}\operatorname{dist}^{-1}\big{(}0,\partial f(x)\big{)}:x\in U\cap[0<f-\mu<\eta],s\leq f(x)-\mu\big{\}}.

Then the function φ~:[0,η)+\tilde{\varphi}:[0,\eta)\rightarrow\mathbb{R}_{+},

t0th(s)𝑑s,t(0,η),t\mapsto\int_{0}^{t}h(s)ds,~{}\forall t\in(0,\eta),

and φ~(0)=0\tilde{\varphi}(0)=0, is well-defined and belongs to Φη\Phi_{\eta}. The function ff has the setwise generalized concave KL property on Ω\Omega with respect to UU, η\eta and φ~\tilde{\varphi}. Moreover,

φ~=inf{φΦη:φ is a concave desingularizing function of f on Ω with respect to U and η}.\tilde{\varphi}=\inf\big{\{}\varphi\in\Phi_{\eta}:\text{$\varphi$ is a concave desingularizing function of $f$ on $\Omega$ with respect to $U$ and $\eta$}\big{\}}.

We say φ~\tilde{\varphi} is the exact modulus of the setwise generalized concave KL property of ff on Ω\Omega with respect to UU and η\eta.

Theorem 3.20 (global convergence of BiFRB)

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Suppose that Assumption 3.12 holds and (zk)k(z_{k})_{k\in\mathbb{N}^{*}} is bounded. Suppose further that the dominating BiFRB merit function Hp¯H_{\bar{p}} (recall Definition 3.10) has the generalized concave KL property on ω(z1)\omega(z_{-1}). Then the following hold:

(i) The sequence (zk)k(z_{k})_{k\in\mathbb{N}^{*}} is Cauchy and has finite length. To be specific, there exist index k0k_{0}\in\mathbb{N}, ε>0\varepsilon>0 and η(0,]\eta\in(0,\infty] such that for ik0+1i\geq k_{0}+1,

k=izk+1zkzizi1+Cφ~(Hpi(zi)F(x)),\sum_{k=i}^{\infty}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right), (18)

where

C=22max{Lhλ¯+Lg+6p¯,LhLgλ¯+1λ¯}lim infkM1,k,C=\frac{2\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p},\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}}{\liminf_{k\to\infty}M_{1,k}},

and φ~\tilde{\varphi} is the exact modulus associated with the setwise generalized concave KL property of Hp¯H_{\bar{p}} on ω(z1)\omega(z_{-1}) with respect to ε\varepsilon and η\eta.

(ii) The sequence (xk)k(x_{k})_{k\in\mathbb{N}^{*}} has finite length and converges to some xx^{*} with 0F(x)0\in\partial F(x^{*}).

Proof. By the boundedness assumption, assume without loss of generality that zkz=(x,y)n×nz_{k}\to z^{*}=(x^{*},y^{*})\in\mathbb{R}^{n}\times\mathbb{R}^{n}. Then Theorem 3.17 implies that x=yx^{*}=y^{*} and Hpk(zk)F(x)H_{p_{k}}(z_{k})\to F(x^{*}) as kk\to\infty. By Lemma 3.14, assume without loss of generality that (Hpk(zk))k\big{(}H_{p_{k}}(z_{k})\big{)}_{k\in\mathbb{N}} is decreasing. Consider two cases:

Case 1: Suppose that there exists k0k_{0} such that F(z)=Hpk0(zk0)F(z^{*})=H_{p_{k_{0}}}(z_{k_{0}}). Then Lemma 3.14(i) implies that zk0+1=zk0z_{k_{0}+1}=z_{k_{0}} and xk0+1=xk0x_{k_{0}+1}=x_{k_{0}}. The desired results then follows from a simple induction.

Case 2: Assume that F(z)<Hpk(zk)F(z^{*})<H_{p_{k}}(z_{k}) for all kk. By Theorem 3.17(iii), the function Hp¯H_{\bar{p}} is constant F(x)F(x^{*}) on ω(z1)\omega(z_{-1}). Then Lemma 3.19 yields ε>0\varepsilon>0 and η>0\eta>0 such that Hp¯H_{\bar{p}} has the setwise generalized concave KL property on ω(z1)\omega(z_{-1}) with respect to ε>0\varepsilon>0 and η>0\eta>0 and the associated exact modulus φ~\tilde{\varphi}. Appealing to Theorem 3.17 again, there exists k1k_{1} such that zk[0<Hp¯Hp¯(z)<η]z_{k}\in[0<H_{\bar{p}}-H_{\bar{p}}(z^{*})<\eta] for k>k1k>k_{1} and k2k_{2} such that dist(zk,ω(z1))<ε\operatorname{dist}(z_{k},\omega(z_{-1}))<\varepsilon for all k>k2k>k_{2}. Put k0=max(k1,k2)k_{0}=\max(k_{1},k_{2}). Then for k>k0k>k_{0},

(φ~)(Hpk(zk)F(x))dist(0,Hp¯(zk))(φ~)(Hp¯(zk)F(x))dist(0,Hp¯(zk))1,\displaystyle(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-F(x^{*})\right)\cdot\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))\geq(\tilde{\varphi})_{-}^{\prime}\left(H_{\bar{p}}(z_{k})-F(x^{*})\right)\cdot\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))\geq 1,

where the first inequality holds because Hpk(zk)Hp¯(zk)H_{p_{k}}(z_{k})\leq H_{\bar{p}}(z_{k}) and the last one follows from the generalized concave KL inequality (recall Definition 2.2). Invoking Lemma 3.16 yields

M2,k(φ~)(Hpk(zk)F(x))zkzk11,M_{2,k}(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-F(x^{*})\right)\cdot\left\lVert z_{k}-z_{k-1}\right\rVert\geq 1, (19)

where M2,k=M2,k(p¯)M_{2,k}=M_{2,k}(\bar{p}). For simplicity, define

Δk,k+1=φ~(Hpk(zk)H(z))φ~(Hpk+1(zk+1)H(z)).\Delta_{k,k+1}=\tilde{\varphi}\left(H_{p_{k}}(z_{k})-H(z^{*})\right)-\tilde{\varphi}\left(H_{p_{k+1}}(z_{k+1})-H(z^{*})\right).

By the concavity of φ~\tilde{\varphi} and (19), one has

Δk,k+1\displaystyle\Delta_{k,k+1} =φ~(Hpk(zk)H(z))φ~(Hpk+1(zk+1)H(z))\displaystyle=\tilde{\varphi}\left(H_{p_{k}}(z_{k})-H(z^{*})\right)-\tilde{\varphi}\left(H_{p_{k+1}}(z_{k+1})-H(z^{*})\right)
(φ~)(Hpk(zk)H(z))[Hpk(zk)Hpk+1(zk+1)]\displaystyle\geq(\tilde{\varphi})_{-}^{\prime}\left(H_{p_{k}}(z_{k})-H(z^{*})\right)\cdot[H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})]
Hpk(zk)Hpk+1(zk+1)M2,kzkzk1,\displaystyle\geq\frac{H_{p_{k}}(z_{k})-H_{p_{k+1}}(z_{k+1})}{M_{2,k}\left\lVert z_{k}-z_{k-1}\right\rVert}, (20)

Applying Lemma 3.14 to (3.4) yields

Δk,k+1zk+1zk2Ckzkzk1,\Delta_{k,k+1}\geq\frac{\left\lVert z_{k+1}-z_{k}\right\rVert^{2}}{C_{k}\left\lVert z_{k}-z_{k-1}\right\rVert},

where Ck=M2,k/M1,kC_{k}=M_{2,k}/M_{1,k}. On one hand, by Assumption 3.12(i), assume without loss of generality that M1,klim infkM1,k/2M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2 for all kk. On the other hand, M2,kM_{2,k} satisfies

M2,k2max{Lhλ¯+Lg+6p¯,LhLgλ¯+1λ¯}.M_{2,k}\leq\sqrt{2}\max\bigg{\{}\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p},\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\bigg{\}}.

Then CkCC_{k}\leq C and

2zk+1zk2CkΔk,k+1zkzk1CΔk,k+1+zkzk1.\displaystyle 2\left\lVert z_{k+1}-z_{k}\right\rVert\leq 2\sqrt{C_{k}\Delta_{k,k+1}\left\lVert z_{k}-z_{k-1}\right\rVert}\leq C\Delta_{k,k+1}+\left\lVert z_{k}-z_{k-1}\right\rVert. (21)

Pick ik0+1i\geq k_{0}+1. Summing (21) from ii to an arbitrary j>ij>i yields

2k=ijzk+1zkk=ijzkzk1+Ck=ijΔk,k+1\displaystyle~{}~{}~{}~{}2\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\sum_{k=i}^{j}\left\lVert z_{k}-z_{k-1}\right\rVert+C\sum_{k=i}^{j}\Delta_{k,k+1}
k=ijzk+1zk+zizi1+Cφ~(Hpi(zi)F(x))Cφ~(Hpj+1(zj+1)H(z))\displaystyle\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert+\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right)-C\tilde{\varphi}\left(H_{p_{j+1}}(z_{j+1})-H(z^{*})\right)
k=ijzk+1zk+zizi1+Cφ~(Hpi(zi)F(x)),\displaystyle\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert+\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right),

implying that

k=ijzk+1zkzizi1+Cφ~(Hpi(zi)F(x)),\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right),

from which (18) readily follows. Finally, notice that

zjzik=ijzk+1zkzizi1+Cφ~(Hpi(zi)F(x)).\left\lVert z_{j}-z_{i}\right\rVert\leq\sum_{k=i}^{j}\left\lVert z_{k+1}-z_{k}\right\rVert\leq\left\lVert z_{i}-z_{i-1}\right\rVert+C\tilde{\varphi}\left(H_{p_{i}}(z_{i})-F(x^{*})\right). (22)

Recall from Theorem 3.17(i) and Lemma 3.14(ii) that Hpi(zi)F(x)H_{p_{i}}(z_{i})\to F(x^{*}) and zizi10\left\lVert z_{i}-z_{i-1}\right\rVert\to 0 as ii\to\infty. Thus, passing the index ii to infinity, we conclude from inequality (22) that (zk)k(z_{k})_{k\in\mathbb{N}} is Cauchy. Invoking Theorem 3.17(ii) completes the proof.

(ii) The statement follows from the definition of (zk)k(z_{k})_{k\in\mathbb{N}^{*}} and Theorem 3.17(ii).\quad\hfill\blacksquare

The generalized concave KL assumption in Theorem 3.20 can be easily satisfied by semialgebraic functions.

Corollary 3.21

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated BiFRB. Assume that (xk)k(x_{k})_{k\in\mathbb{N}^{*}} is bounded and Assumption 3.12 holds. Suppose further that both ff and gg are semialgebraic. Then (xk)k(x_{k})_{k\in\mathbb{N}^{*}} converges to some xx^{*} with 0F(x)0\in\partial F(x^{*}) and has finite length property.

Proof. Let p¯\bar{p} be as in Assumption 3.12. Then (x,y)p¯xy2(x,y)\mapsto\bar{p}\left\lVert x-y\right\rVert^{2} is semialgebraic. In turn, the dominating BiFRB function Hp¯H_{\bar{p}} (recall Definition 3.10) is semialgebraic by the semialgebraic functions sum rule; see, e.g., [1, Section 4.3]. The desired result then follows immediately from Theorem 3.20\quad\hfill\blacksquare

4 Stepsize strategies

Following abstract convergence properties of BiFRB, we now exploit conditions such that the important Assumption 3.12 holds.

To furnish Assumption 3.12, it suffices to establish the existence of suitable p1p_{-1} under desirable stepsize rules. Unlike most literature that emphasizes on explicit merit function parameters [9, 10, 31, 27], our (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} are implicit. This turns out to be instrumental in designing less restrictive stepsize rules. In the remainder of this section, unless specified otherwise,

a=(Lhσ)Lg2,b=σ, and c=Lg.a=(L_{\nabla h}-\sigma)L_{\nabla g}^{2},b=\sigma,\text{ and }c=L_{\nabla g}. (23)

Consequently, the sequence (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}^{*}} given by (10) and (11) satisfy

pk=a2(λk12λk)+12(bλkc)pk1,M1,k=pk1αk+bcλk12λkaλk122λk.\displaystyle p_{k}=\frac{a}{2}\left(\frac{\lambda_{k-1}^{2}}{\lambda_{k}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{k}}-c\right)-p_{k-1},~{}M_{1,k}=p_{k-1}-\frac{\alpha_{k}+bc\lambda_{k-1}}{2\lambda_{k}}-\frac{a\lambda_{k-1}^{2}}{2\lambda_{k}}. (24)

The lemma below asserts that (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} are governed by stepsizes and an initial input p1p_{-1}, which will be useful soon.

Lemma 4.1

Let (λk)k(\lambda_{k})_{k\in\mathbb{N}^{*}} be a sequence of nonzero real numbers and let a,b,ca,b,c\in\mathbb{R}. Set p1p_{-1}\in\mathbb{R} and define (pk)k(p_{k})_{k\in\mathbb{N}^{*}} as in (24). Then for kk\in\mathbb{N},

p2k=a2i=1k(λ2i12λ2iλ2i22λ2i1)+b2i=1k(1λ2i1λ2i1)+aλ12+b2λ0c2p1,\displaystyle p_{2k}=\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}-\frac{c}{2}-p_{-1},
p2k+1=a2i=0k(λ2i2λ2i+1λ2i22λ2i1)+b2i=0k(1λ2i+11λ2i)+p1,\displaystyle p_{2k+1}=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i}^{2}}{\lambda_{2i+1}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1},

where we adopt the convention i=10ai=0\sum_{i=1}^{0}a_{i}=0 for any sequence (ak)k(a_{k})_{k\in\mathbb{N}}.

Proof. Let k\{0}k\in\mathbb{N}\backslash\{0\}. Then

p2k\displaystyle p_{2k} =a2(λ2k12λ2k)+12(bλ2kc)p2k1=a2(λ2k12λ2kλ2k22λ2k1)+b2(1λ2k1λ2k1)+p2k2\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k-1}^{2}}{\lambda_{2k}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k}}-c\right)-p_{2k-1}=\frac{a}{2}\left(\frac{\lambda_{2k-1}^{2}}{\lambda_{2k}}-\frac{\lambda_{2k-2}^{2}}{\lambda_{2k-1}}\right)+\frac{b}{2}\left(\frac{1}{\lambda_{2k}}-\frac{1}{\lambda_{2k-1}}\right)+p_{2k-2}
=a2i=0k(λ2i12λ2iλ2i22λ2i1)+b2i=1k(1λ2i1λ2i1)+p0\displaystyle=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+p_{0}
=a2i=1k(λ2i12λ2iλ2i22λ2i1)+b2i=1k(1λ2i1λ2i1)+aλ12+b2λ0c2p1,\displaystyle=\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}-\frac{c}{2}-p_{-1},

and

p2k+1\displaystyle p_{2k+1} =a2(λ2k2λ2k+1)+12(bλ2k+1c)p2k\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k}^{2}}{\lambda_{2k+1}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k+1}}-c\right)-p_{2k}
=a2(λ2k2λ2k+1)+12(bλ2k+1c)a2i=1k(λ2i12λ2iλ2i22λ2i1)b2i=1k(1λ2i1λ2i1)\displaystyle=\frac{a}{2}\left(\frac{\lambda_{2k}^{2}}{\lambda_{2k+1}}\right)+\frac{1}{2}\left(\frac{b}{\lambda_{2k+1}}-c\right)-\frac{a}{2}\sum_{i=1}^{k}\left(\frac{\lambda_{2i-1}^{2}}{\lambda_{2i}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)-\frac{b}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)
aλ12+b2λ0+c2+p1\displaystyle-\frac{a\lambda^{2}_{-1}+b}{2\lambda_{0}}+\frac{c}{2}+p_{-1}
=a2i=0k(λ2i2λ2i+1λ2i22λ2i1)+b2i=0k(1λ2i+11λ2i)+p1,\displaystyle=\frac{a}{2}\sum_{i=0}^{k}\left(\frac{\lambda_{2i}^{2}}{\lambda_{2i+1}}-\frac{\lambda_{2i-2}^{2}}{\lambda_{2i-1}}\right)+\frac{b}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1},

from which the desired identities hold. \quad\hfill\blacksquare

4.1 Non-Euclidean case

Now we provide a fixed stepsize rule for Algorithm 3.2. In contrast to similar stepsize results in the literature, we shall see that Assumption 3.12 furnishes explicit stepsize bounds that is independent of inertial parameters. In turn, we answer a question of Malitsky and Tam [22, Section 7]; see Remark 4.3.

Proposition 4.2 (fixed stepsize)

Let (λk)k(\lambda_{k})_{k\in\mathbb{N}^{*}} be a constant BiFRB sequence with λk=λ>0\lambda_{k}=\lambda>0 for all kk. Suppose that σ>2\sigma>2, (Lhσ)σ>1/4(L_{\nabla h}-\sigma)\sigma>1/4 and 0αk<min(1,σ/2)=10\leq\alpha_{k}<\min(1,\sigma/2)=1. Define

λ=(2bc+c)2+4a(b2)2bcc2a>0,\lambda^{*}=\frac{\sqrt{(2bc+c)^{2}+4a(b-2)}-2bc-c}{2a}>0,

where a,b,ca,b,c are specified by (23). If λ<min{λ,(σ1)/[(σ+1)Lg]}\lambda<\min\big{\{}\lambda^{*},(\sigma-1)/[(\sigma+1)L_{\nabla g}]\big{\}}, then there exists p1p_{-1} such that (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} generated by (10) and (11) satisfy Assumption 3.12.

Proof. Applying Lemma 4.1 to (24) yields that pk=p1p_{k}=p_{-1} if kk is odd; pk=aλ/2+(b/λc)/2p1p_{k}=a\lambda/2+\left(b/\lambda-c\right)/2-p_{-1} if kk is even; and

M1,k={p1αk2λbc2aλ2,if k is even,bαk2λ(b+1)c2p1,if k is odd.\displaystyle M_{1,k}=\begin{cases*}p_{-1}-\frac{\alpha_{k}}{2\lambda}-\frac{bc}{2}-\frac{a\lambda}{2},&\text{if $k$ is even,}\\ \frac{b-\alpha_{k}}{2\lambda}-\frac{(b+1)c}{2}-p_{-1},&\text{if $k$ is odd.}\end{cases*}

Then

min{p1,a2λ+12(bλc)p1}pkmax{p1,a2λ+12(bλc)p1},\displaystyle\min\bigg{\{}p_{-1},\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)-p_{-1}\bigg{\}}\leq p_{k}\leq\max\bigg{\{}p_{-1},\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)-p_{-1}\bigg{\}}, (25)
lim infkM1,kmin{p112λbc2aλ2,b12λ(b+1)c2p1}.\liminf_{k\to\infty}M_{1,k}\geq\min\bigg{\{}p_{-1}-\frac{1}{2\lambda}-\frac{bc}{2}-\frac{a\lambda}{2},\frac{b-1}{2\lambda}-\frac{(b+1)c}{2}-p_{-1}\bigg{\}}. (26)

To furnish Assumption 3.12, we will prove that

a2λ+12(bλc)>0,\displaystyle\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)>0, (27)
12λ+bc2+aλ2<b12λ(b+1)c2,\displaystyle\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<\frac{b-1}{2\lambda}-\frac{(b+1)c}{2}, (28)
12λ+bc2+aλ2<aλ2+b2λc2.\displaystyle\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<\frac{a\lambda}{2}+\frac{b}{2\lambda}-\frac{c}{2}. (29)

Combining inequalities (27)–(29) yields the existence of some p1p_{-1} such that

0<p1<a2λ+12(bλc) and 12λ+bc2+aλ2<p1<b12λ(b+1)c2,0<p_{-1}<\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)\text{ and }\frac{1}{2\lambda}+\frac{bc}{2}+\frac{a\lambda}{2}<p_{-1}<\frac{b-1}{2\lambda}-\frac{(b+1)c}{2},

which together with (25) and (26) entails Assumption 3.12. First, we establish (27). Observe that

a2λ+12(bλc)>0aλ2cλ+b>0.\frac{a}{2}\lambda+\frac{1}{2}\left(\frac{b}{\lambda}-c\right)>0\Leftrightarrow a\lambda^{2}-c\lambda+b>0.

By assumption, a>0a>0 and c24ab=Lg2(14(Lhσ)σ)<0c^{2}-4ab=L^{2}_{\nabla g}\big{(}1-4(L_{\nabla h}-\sigma)\sigma\big{)}<0. Then the quadratic function λaλ2cλ+b\lambda\mapsto a\lambda^{2}-c\lambda+b has no root and is therefore strictly positive, implying (27). Next we justify (28), which amounts to aλ2(2bc+c)λ+b2>0-a\lambda^{2}-(2bc+c)\lambda+b-2>0. Notice that the discriminant (2bc+c)2+4a(b2)>0(2bc+c)^{2}+4a(b-2)>0 and product of roots (b2)/(a)<0(b-2)/(-a)<0. Then the quadratic function λaλ2(2bc+c)λ+b2\lambda\mapsto-a\lambda^{2}-(2bc+c)\lambda+b-2 has two real roots with opposite signs. In turn,

(0<λ<λ)aλ2(2bc+c)λ+b2>0,(\forall~{}0<\lambda<\lambda^{*})-a\lambda^{2}-(2bc+c)\lambda+b-2>0,

furnishing inequality (28). Finally, the assumption λ<(σ1)/[(σ+1)Lg]=(b1)/[(b+1)c]\lambda<(\sigma-1)/[(\sigma+1)L_{\nabla g}]=(b-1)/[(b+1)c] implies (29) after rearranging.\quad\hfill\blacksquare

Remark 4.3

(i) (Comparison to known results) Compared to a result of Boţ and Csetnek [9, Lemma 3.3] with explicit merit function parameters but implicit stepsize bounds, our stepsize is explicit and independent of inertial parameters; see [10, Remark 7] for a similar restrictive result and [22, Corollary 4.4] for a result with dependent stepsize bound and inertial parameter in the presence of convexity; see also [25, Table 1].

(ii) (Regarding a question of Malitsky and Tam) In [22, Section 7], Malitsky and Tam posted a question regarding whether FRB can be adapted to incorporate a Nesterov-type acceleration. With inertial parameter αk[0,1)\alpha_{k}\in[0,1) independent of stepsize λ\lambda in Proposition 4.2, as opposed to a fixed inertial parameter α[0,1/3)\alpha\in[0,1/3) dependent of stepsize λ\lambda in [22, Corollary 4.4], this question is resolved. Indeed, the Nesterov acceleration scheme corresponds to Proposition 4.2 with

(k)αk=tk1tk+1,(\forall k\in\mathbb{N}^{*})~{}\alpha_{k}=\frac{t_{k}-1}{t_{k+1}},

where (k)tk+1=(1+1+4tk2)/2(\forall k\in\mathbb{N}^{*})~{}t_{k+1}=(1+\sqrt{1+4t_{k}^{2}})/2 and t1=1t_{-1}=1; see, e.g., [24].

(iii) The assumption that (Lhσ)σ>1/4(L_{\nabla h}-\sigma)\sigma>1/4 is not demanding. For instance, let α,β>0\alpha,\beta>0 be such that αβ>1/4\alpha\beta>1/4 and let u:nu:\mathbb{R}^{n}\to\mathbb{R} be a function with 11-Lipschitz gradient. Then h=αu+β2/2h=\alpha u+\beta\left\lVert\cdot\right\rVert^{2}/2 satisfies (Lhσ)σ>1/4(L_{\nabla h}-\sigma)\sigma>1/4.

4.2 Euclidean case

Considering kernel h=2/2h=\left\lVert\cdot\right\rVert^{2}/2, we now turn to stepsize strategies of Algorithm 3.9.

Proposition 4.4 (dynamic stepsize)

Let (λk)k[λ¯,λ¯](\lambda_{k})_{k\in\mathbb{N}^{*}}\subseteq[\underline{\lambda},\overline{\lambda}] be a iFRB stepsize sequence and let (pk)k(p_{k})_{k\in\mathbb{N}^{*}} be the sequence generated by (24) with a=0a=0, b=1b=1 and 0αk<α¯0\leq\alpha_{k}<\bar{\alpha} for some α¯<1/2\bar{\alpha}<1/2. Let (ak)k(a_{k})_{k\in\mathbb{N}} be a sequence of positive real numbers with k=0ak<\sum_{k=0}^{\infty}a_{k}<\infty and let 0<ε<(12α¯)/(3Lg)0<\varepsilon<(1-2\bar{\alpha})/(3L_{\nabla g}). Suppose that the following hold:

(i) The stepsize lower bound λ¯=ε\underline{\lambda}=\varepsilon and upper bound λ¯<ε/(2α¯+3εLg)\overline{\lambda}<\varepsilon/(2\bar{\alpha}+3\varepsilon L_{\nabla g}).

(ii) For all kk\in\mathbb{N}, 0(1/λk)(1/λk1)ak0\leq(1/\lambda_{k})-(1/\lambda_{k-1})\leq a_{k}.

Then there exists p1p_{-1} such that (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} generated by (10) and (11) satisfy Assumption 3.12.

Proof. First notice that assumption(i) is not contradictory because ε<(12α¯)/(3Lg)2α¯+3εLg<1ε<ε/(2α¯+3εLg)\varepsilon<(1-2\bar{\alpha})/(3L_{\nabla g})\Leftrightarrow 2\bar{\alpha}+3\varepsilon L_{\nabla g}<1\Leftrightarrow\varepsilon<\varepsilon/(2\bar{\alpha}+3\varepsilon L_{\nabla g}). Then, by assumption(i), we have

λ0<ε2α¯+3εLgLg2+α¯2ε<12λ0Lg+α¯2ε,\displaystyle\lambda_{0}<\frac{\varepsilon}{2\bar{\alpha}+3\varepsilon L_{\nabla g}}\Leftrightarrow\frac{L_{\nabla g}}{2}+\frac{\bar{\alpha}}{2\varepsilon}<\frac{1}{2\lambda_{0}}-L_{\nabla g}+\frac{\bar{\alpha}}{2\varepsilon},

which implies that there exits p1p_{-1} such that

Lg2+α¯2ε<p1<12λ0Lg+α¯2ε.\displaystyle\frac{L_{\nabla g}}{2}+\frac{\bar{\alpha}}{2\varepsilon}<p_{-1}<\frac{1}{2\lambda_{0}}-L_{\nabla g}+\frac{\bar{\alpha}}{2\varepsilon}. (30)

According to Lemma 4.1 and assumption(ii)

p1p2k+1\displaystyle p_{-1}\leq p_{2k+1} =12i=0k(1λ2i+11λ2i)+p112k=0ak+p0,\displaystyle=\frac{1}{2}\sum_{i=0}^{k}\left(\frac{1}{\lambda_{2i+1}}-\frac{1}{\lambda_{2i}}\right)+p_{-1}\leq\frac{1}{2}\sum_{k=0}^{\infty}a_{k}+p_{0},
12λ0c2p1p2k\displaystyle\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1}\leq p_{2k} =12i=1k(1λ2i1λ2i1)+12λ0c2p112k=0ak+12λ0c2p1,\displaystyle=\frac{1}{2}\sum_{i=1}^{k}\left(\frac{1}{\lambda_{2i}}-\frac{1}{\lambda_{2i-1}}\right)+\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1}\leq\frac{1}{2}\sum_{k=0}^{\infty}a_{k}+\frac{1}{2\lambda_{0}}-\frac{c}{2}-p_{-1},

which together with (30) shows that pkp_{k} is positive and bounded above. By assumption(ii), λkλk1\lambda_{k}\leq\lambda_{k-1} for all kk. Consequently, the sequence (λk)(\lambda_{k}) is convergent and limkλk1/λk=1\lim_{k\to\infty}\lambda_{k-1}/\lambda_{k}=1. It follows from the above inequalities and (30) that

M1,2k+1\displaystyle M_{1,2k+1} =p2k+1α2k+1+Lgλ2k2λ2k+1p1Lg2(λ2kλ2k+1)α¯2εp1Lg2α¯2ε>0,\displaystyle=p_{2k+1}-\frac{\alpha_{2k+1}+L_{\nabla g}\lambda_{2k}}{2\lambda_{2k+1}}\geq p_{-1}-\frac{L_{\nabla g}}{2}\left(\frac{\lambda_{2k}}{\lambda_{2k+1}}\right)-\frac{\bar{\alpha}}{2\varepsilon}\to p_{-1}-\frac{L_{\nabla g}}{2}-\frac{\bar{\alpha}}{2\varepsilon}>0,
M1,2k\displaystyle M_{1,2k} =p2kα2k+Lgλ2k12λ2k\displaystyle=p_{2k}-\frac{\alpha_{2k}+L_{\nabla g}\lambda_{2k-1}}{2\lambda_{2k}}
12λ0Lg2(λ2k1λ2k+1)α¯2εp112λ0Lgα¯2εp1>0.\displaystyle\geq\frac{1}{2\lambda_{0}}-\frac{L_{\nabla g}}{2}\left(\frac{\lambda_{2k-1}}{\lambda_{2k}}+1\right)-\frac{\bar{\alpha}}{2\varepsilon}-p_{-1}\to\frac{1}{2\lambda_{0}}-L_{\nabla g}-\frac{\bar{\alpha}}{2\varepsilon}-p_{-1}>0.

Hence lim infkM1,k>0\liminf_{k\to\infty}M_{1,k}>0.\quad\hfill\blacksquare

The following corollary follows immediately, which provides an inertial variant of [31, Lemma 3.2] together with Lemma 3.14.

Corollary 4.5 (fixed stepsize)

Let (λk)k(\lambda_{k})_{k\in\mathbb{N}^{*}} be a constant iFRB stepsize sequence with λk=λ>0\lambda_{k}=\lambda>0 for all kk and let (pk)k(p_{k})_{k\in\mathbb{N}^{*}} be the sequence generated by (24) with a=0a=0, b=1b=1 and 0αk<α¯0\leq\alpha_{k}<\bar{\alpha} for some α¯<1/2\bar{\alpha}<1/2. If λ<(12α¯)/(3Lg)\lambda<(1-2\bar{\alpha})/(3L_{\nabla g}), then there exists p1p_{-1} such that (pk)k(p_{k})_{k\in\mathbb{N}^{*}} and (M1,k)k(M_{1,k})_{k\in\mathbb{N}} generated by (10) and (11) satisfy Assumption 3.12.

Remark 4.6

One recovers the stepsize requirement in [31, Lemma 3.2] by setting α¯=0\bar{\alpha}=0.

5 Convergence rates

After global convergence results, we now turn to convergence rates of both the function value and actual sequence. To this end, a lemma helps.

Lemma 5.1

[11, Lemma 10] Let (ek)k+(e_{k})_{k\in\mathbb{N}}\subseteq\mathbb{R}_{+} be a decreasing sequence converging 0. Assume further that there exist natural numbers k0l01k_{0}\geq l_{0}\geq 1 such that for every kk0k\geq k_{0},

ekl0ekCeek2θ,e_{k-l_{0}}-e_{k}\geq C_{e}e_{k}^{2\theta}, (31)

where Ce>0C_{e}>0 is some constant and θ[0,1)\theta\in[0,1). Then the following hold:

(i) If θ=0\theta=0, then (ek)k(e_{k})_{k\in\mathbb{N}} converges in finite steps.

(ii) If θ(0,12]\theta\in\left(0,\frac{1}{2}\right], then there exists Ce,0>0C_{e,0}>0 such that for every kk0k\geq k_{0},

0ekCe,0Qk.0\leq e_{k}\leq C_{e,0}Q^{k}.

(iii) If θ(12,1)\theta\in\left(\frac{1}{2},1\right), then there exists Ce,1>0C_{e,1}>0 such that for every kk0+l0k\geq k_{0}+l_{0},

0ekCe,1(kl0+1)12θ1.0\leq e_{k}\leq C_{e,1}(k-l_{0}+1)^{-\frac{1}{2\theta-1}}.

Recall that we say a generalized concave KL function has KL exponent θ[0,1)\theta\in[0,1), if an associated desingularizing function φ(t)=ct1θ\varphi(t)=c\cdot t^{1-\theta} for some c>0c>0. We now provide function value convergence rate of BiFRB under KL exponent assumption.

Theorem 5.2 (function value convergence rate)

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Suppose that all assumptions in Theorem 3.20 are satisfied and let xx^{*} be the limit given in Theorem 3.20(ii). Suppose further that the dominating BiFRB merit function Hp¯H_{\bar{p}} (recall Definition 3.10) has the generalized concave KL property at z=(x,x)z^{*}=(x^{*},x^{*}) with KL exponent θ[0,1)\theta\in[0,1). Define ek=Hpk(zk)F(x)e_{k}=H_{p_{k}}(z_{k})-F(x^{*}) for kk\in\mathbb{N}. Then (ek)k(e_{k})_{k\in\mathbb{N}} converges to 0 and the following hold:

(i) If θ=0\theta=0, then (ek)k(e_{k})_{k\in\mathbb{N}} converges in finite steps.

(ii) If θ(0,1/2]\theta\in\left(0,1/2\right], then there exist c^1>0\hat{c}_{1}>0 and Q^1[0,1)\hat{Q}_{1}\in[0,1) such that for kk sufficiently large,

ekc^1Q^1k.e_{k}\leq\hat{c}_{1}\hat{Q}_{1}^{k}.

(iii) If θ(1/2,1)\theta\in\left(1/2,1\right), then there exists c^2>0\hat{c}_{2}>0 such that for kk sufficiently large,

ekc^2k12θ1.e_{k}\leq\hat{c}_{2}k^{-\frac{1}{2\theta-1}}.

Proof. Recall from Lemma 3.14(ii) and Theorem 3.17(i) that the sequence (ek)k(e_{k})_{k\in\mathbb{N}} is decreasing and converges to 0. By the KL exponent assumption, there exist c>0c>0 and k0k_{0} such that for kk0k\geq k_{0}

dist(0,Hp¯(zk))(Hpk(zk)Hp¯(z))θc(1θ)=ekθc(1θ).\operatorname{dist}\big{(}0,\partial H_{\bar{p}}(z_{k})\big{)}\geq\frac{\left(H_{p_{k}}(z_{k})-H_{\bar{p}}(z^{*})\right)^{\theta}}{c(1-\theta)}=\frac{e_{k}^{\theta}}{c(1-\theta)}.

Assume without loss of generality that M1,klim infkM1,k/2M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2. So Lemmas 3.14 and 3.16 imply

ek1ek\displaystyle e_{k-1}-e_{k} =Hpk1(zk1)Hpk(zk)M1,kzkzk12M1,kM2,k2dist(0,Hp¯(zk))2Ceek2θ,\displaystyle=H_{p_{k}-1}(z_{k-1})-H_{p_{k}}(z_{k})\geq M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert^{2}\geq\frac{M_{1,k}}{M_{2,k}^{2}}\operatorname{dist}(0,\partial H_{\bar{p}}(z_{k}))^{2}\geq C_{e}e_{k}^{2\theta},

where

Ce=lim infkM1,k4c2(1θ)2max{(Lhλ¯+Lg+6p¯)2,(LhLgλ¯+1λ¯)2}.C_{e}=\frac{\liminf_{k\to\infty}M_{1,k}}{4c^{2}(1-\theta)^{2}\max\bigg{\{}\left(\frac{L_{\nabla h}}{\underline{\lambda}}+L_{\nabla g}+6\bar{p}\right)^{2},\left(\frac{L_{\nabla h}L_{\nabla g}\overline{\lambda}+1}{\underline{\lambda}}\right)^{2}\bigg{\}}}.

Applying Lemma 5.1 completes the proof.\quad\hfill\blacksquare

We now develop sequential convergence rate based on Theorem 5.2.

Theorem 5.3 (sequential convergence rate)

Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB and define zk=(xk+1,xk)z_{k}=(x_{k+1},x_{k}) for kk\in\mathbb{N}^{*}. Suppose that all assumptions in Theorem 3.20 are satisfied and let xx^{*} be the limit given in Theorem 3.20(ii). Suppose further that the dominating BiFRB merit function Hp¯H_{\bar{p}} (recall Definition 3.10) has the generalized concave KL property at z=(x,x)z^{*}=(x^{*},x^{*}) with KL exponent θ[0,1)\theta\in[0,1). Then the following hold:

(i) If θ=0\theta=0, then zkz_{k} converges zz^{*} in finite steps.

(ii) If θ(0,1/2]\theta\in\left(0,1/2\right], then there exist c1>0c_{1}>0 and Q1[0,1)Q_{1}\in[0,1) such that for kk sufficiently large

zkzc1Q1k.\left\lVert z_{k}-z^{*}\right\rVert\leq c_{1}Q_{1}^{k}.

(iii) If θ(1/2,1)\theta\in\left(1/2,1\right), then there exist c2>0c_{2}>0 such that for kk sufficiently large

zkzc2k1θ2θ1.\left\lVert z_{k}-z^{*}\right\rVert\leq c_{2}k^{-\frac{1-\theta}{2\theta-1}}.

Proof. For simplicity, define for kk\in\mathbb{N}^{*}

ek=Hpk(zk)F(x),δk=i=kzi+1zi.e_{k}=H_{p_{k}}(z_{k})-F(x^{*}),\delta_{k}=\sum_{i=k}^{\infty}\left\lVert z_{i+1}-z_{i}\right\rVert.

Then ek0e_{k}\to 0 and zkzδk\left\lVert z_{k}-z^{*}\right\rVert\leq\delta_{k}. Without loss of generality (recall Lemma 3.14), assume that ek[0,1)e_{k}\in[0,1) and ek+1eke_{k+1}\leq e_{k} for kk\in\mathbb{N}^{*}. To obtain the desired results, it suffices to estimate δk\delta_{k}.

Assume without loss of generality that M1,klim infkM1,k/2M_{1,k}\geq\liminf_{k\to\infty}M_{1,k}/2. Invoking Lemma 3.14(i) and Theorem 3.17(i) yields that (lim infkM1,k/2)zkzk12M1,kzkzk1Hpk1(zk1)Hpk(zk)Hpk1(zk1)F(x)=ek1\left(\liminf_{k\to\infty}M_{1,k}/2\right)\left\lVert z_{k}-z_{k-1}\right\rVert^{2}\leq M_{1,k}\left\lVert z_{k}-z_{k-1}\right\rVert\leq H_{p_{k-1}}(z_{k-1})-H_{p_{k}}(z_{k})\leq H_{p_{k-1}}(z_{k-1})-F(x^{*})=e_{k-1}, which in turn implies that

zkzk1dek1,\left\lVert z_{k}-z_{k-1}\right\rVert\leq d\sqrt{e_{k-1}}, (32)

where d=2/lim infkM1,kd=\sqrt{2/\liminf_{k\to\infty}M_{1,k}}. Moreover, recall from Theorem 3.20(i) that there exists k0k_{0} such that for kk0+1k\geq k_{0}+1

δkzkzk1+Cφ~(ek)zkzk1+Cφ~(ek1),\delta_{k}\leq\left\lVert z_{k}-z_{k-1}\right\rVert+C\tilde{\varphi}(e_{k})\leq\left\lVert z_{k}-z_{k-1}\right\rVert+C\tilde{\varphi}(e_{k-1}), (33)

where the last inequality holds because ekek1e_{k}\leq e_{k-1}. Assume without loss of generality that Hp¯H_{\bar{p}} is associated with desingularizing φ(t)=C1t1θ\varphi(t)=C^{-1}t^{1-\theta} due to the KL exponent assumption. Thus, combined with (32), inequality (33) yields

δkdek1+ek11θ.\delta_{k}\leq d\sqrt{e_{k-1}}+e_{k-1}^{1-\theta}. (34)

Case 1 (θ=0\theta=0): Appealing to Theorem 5.2, the sequence eke_{k} converges to 0 in finite steps. Thus (34) implies δk0\delta_{k}\to 0 in finite steps, so does (zk)k(z_{k})_{k\in\mathbb{N}^{*}}.

Case 2 (0<θ120<\theta\leq\frac{1}{2}): Clearly 121θ<1\frac{1}{2}\leq 1-\theta<1, therefore ek1ek11θ\sqrt{e_{k-1}}\geq e_{k-1}^{1-\theta}. Inequality (34) and Theorem 5.2 imply that for kk sufficiently large

δkdek1+ek1=(1+d)ek1(1+d)c^1Q^1k1.\delta_{k}\leq d\sqrt{e_{k-1}}+\sqrt{e_{k-1}}=(1+d)\sqrt{e_{k-1}}\leq(1+d)\sqrt{\hat{c}_{1}\hat{Q}_{1}^{k-1}}.

Case 3 (12<θ<1\frac{1}{2}<\theta<1): Note that ek1ek11θ\sqrt{e_{k-1}}\leq e_{k-1}^{1-\theta}. Then appealing to (34) and Theorem 5.2 yields for kk sufficiently large

δkdek11θ+ek11θ=(1+d)ek11θ(1+d)(k1)1θ2θ1,\delta_{k}\leq de_{k-1}^{1-\theta}+e_{k-1}^{1-\theta}=(1+d)e_{k-1}^{1-\theta}\leq(1+d)(k-1)^{-\frac{1-\theta}{2\theta-1}},

from which the desired result readily follows.\quad\hfill\blacksquare

Example 5.4 (nonconvex feasibility problem)

Let CnC\subseteq\mathbb{R}^{n} be a nonempty, closed and convex set and let DnD\subseteq\mathbb{R}^{n} be nonempty and closed. Suppose that CDC\cap D\neq\emptyset and either CC or DD is compact. Assume further that both CC and DD are semialgebraic. Consider the minimization problem (2) with f=δDf=\delta_{D} and g=dist2(,C)/2g=\operatorname{dist}^{2}(\cdot,C)/2. Let (xk)k(x_{k})_{k\in\mathbb{N}^{*}} be a sequence generated by BiFRB. Suppose that Assumption 3.12 holds. Then the following hold:

(i) There exists xnx^{*}\in\mathbb{R}^{n} such that xkxx_{k}\to x^{*} and 0F(x)0\in\partial F(x^{*}).

(ii) Suppose additionally that

NC(ProjC(x))(ND(x))={0}.N_{C}(\operatorname{Proj}_{C}(x^{*}))\cap\big{(}-N_{D}(x^{*})\big{)}=\{0\}. (35)

Then xCDx^{*}\in C\cap D. Moreover, there exist Q1(0,1)Q_{1}\in(0,1) and c1>0c_{1}>0 such that

xkxc1Q1k,\left\lVert x_{k}-x^{*}\right\rVert\leq c_{1}Q_{1}^{k},

for kk sufficiently large.

Proof. (i) By the compactness assumption, the function f+gf+g is coercive. Hence Theorem 3.18 implies that (xk)k(x_{k})_{k\in\mathbb{N}^{*}} is bounded. We assume that sets C,DC,D are semialgebraic, then so are functions ff and gg; see [1, Section 4.3] and [2, Lemma 2.3]. Then apply Corollary 3.21.

(ii) Taking the fact that 0F(x)0\in\partial F(x^{*}) into account and applying the subdifferential sum rule [28, Exercise 8.8], one gets

xProjC(x)ND(x).x^{*}-\operatorname{Proj}_{C}(x^{*})\in-N_{D}(x^{*}).

The constraint qualification (35) then implies that x=ProjC(x)x^{*}=\operatorname{Proj}_{C}(x^{*}) and consequently xCx^{*}\in C. The set DD is assumed to be closed, thus xDx^{*}\in D. Then a direct application of [13, Theorem 5] guarantees that FF has KL exponent θ=1/2\theta=1/2 at xx^{*}. The desired result follows immediately from Theorem 5.3. \quad\hfill\blacksquare

6 Bregman proximal subproblem formulae

Set α0,β>0\alpha\geq 0,\beta>0. Fix ωn\omega\in\mathbb{R}^{n} and λ>0\lambda>0. In the remainder of this section, let

h(x)=α1+x2+β2x2.h(x)=\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2}.

Define for unu\in\mathbb{R}^{n}, pλ(u)=λωh(u)p_{\lambda}(u)=\lambda\omega-\nabla h(u) and

Tλ(u)\displaystyle T_{\lambda}(u) =argminxn{f(x)+xu,ω+1λDh(x,u)}\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x-u,\omega\rangle+\frac{1}{\lambda}D_{h}(x,u)\bigg{\}} (36)
=argminxn{f(x)+x,ω+1λh(x)1λh(u),x}\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}f(x)+\langle x,\omega\rangle+\frac{1}{\lambda}h(x)-\frac{1}{\lambda}\langle\nabla h(u),x\rangle\bigg{\}}
=argminxn{λf(x)+x,pλ(u)+h(x)}.\displaystyle=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\bigg{\{}\lambda f(x)+\langle x,p_{\lambda}(u)\rangle+h(x)\bigg{\}}. (37)

Clearly Assumption 3.6(ii) holds with σ=β\sigma=\beta and Lh=α+βL_{\nabla h}=\alpha+\beta and the Bregman proximal subproblem (9) corresponds to (36) with ω=αk(xk1xk)/λk+g(xk)\omega=\alpha_{k}(x_{k-1}-x_{k})/\lambda_{k}+\nabla g(x_{k}), u=yku=y_{k} and λ=λk\lambda=\lambda_{k}. Note that (36) covers subproblems in [9, Algorithm 3.1] and [10, Agorithm 1] as well.

In this section, we develop formulae of Tλ(u)T_{\lambda}(u) with the kernel hh chosen as above, supplementing not only Algorithm 3.2 but also [9, Algorithm 3.1] and [10, Algorithm 1].

6.1 Formula of l0l_{0}-constrained problems

Let D={xn:x0r,xR}D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\} for integer an rnr\leq n and real number R>0R>0, where (xn)x0=i=1n|sgn(xi)|(\forall x\in\mathbb{R}^{n})~{}\left\lVert x\right\rVert_{0}=\sum_{i=1}^{n}|\operatorname{sgn}(x_{i})| with sgn(0)=0\operatorname{sgn}(0)=0. In this subsection, consider minimization problem

minxDg(x),\min_{x\in D}g(x),

which is (2) with f=δDf=\delta_{D}. Recall that the Hard-threshold of xnx\in\mathbb{R}^{n} with parameter rr, denoted by Hr(x)H_{r}(x), is given by

(i{1,,n})(Hr(x))i={xi, if |xi| belongs to the r largest among |x1|,,|xn|,0, otherwise.(\forall i\in\{1,\ldots,n\})~{}\big{(}H_{r}(x)\big{)}_{i}=\begin{cases}x_{i},&\text{ if $|x_{i}|$ belongs to the $r$ largest among $|x_{1}|,\ldots,|x_{n}|$},\\ 0,&\text{ otherwise}.\end{cases} (38)

Immediately, Hr(cx)=cHr(x)H_{r}(cx)=cH_{r}(x) for c0c\neq 0. The following lemma will be instrumental.

Lemma 6.1

[21, Proposition 4.3] Given 0pn0\neq p\in\mathbb{R}^{n} and a positive integer r<nr<n, we have

maxxn{p,x:x=1,x0r}=Hr(p),\max_{x\in\mathbb{R}^{n}}\big{\{}\langle p,x\rangle:\left\lVert x\right\rVert=1,~{}\left\lVert x\right\rVert_{0}\leq r\big{\}}=\left\lVert H_{r}(p)\right\rVert,

with optimal value attained at x=Hr(p)1Hr(p)x^{*}=\left\lVert H_{r}(p)\right\rVert^{-1}H_{r}(p).

Following a similar route in [8, Proposition 5.2], we now prove the desired formula.

Proposition 6.2 (subproblem formula)

Let f=δDf=\delta_{D}, where D={xn:x0r,xR}D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\} for integer an rnr\leq n and real number R>0R>0. Define

x={0,if h(u)=λω,tHr(pλ(u))1Hr(pλ(u)),if h(u)λω,x^{*}=\begin{cases}0,&\text{if }\nabla h(u)=\lambda\omega,\\ -t^{*}\left\lVert H_{r}\big{(}p_{\lambda}(u)\big{)}\right\rVert^{-1}H_{r}\big{(}p_{\lambda}(u)\big{)},&\text{if }\nabla h(u)\neq\lambda\omega,\end{cases} (39)

where tt^{*} is the unique solution of

min0tR{φ(t)=α1+t2+β2t2Hr(pλ(u)t},\min_{0\leq t\leq R}\bigg{\{}\varphi(t)=\alpha\sqrt{1+t^{2}}+\frac{\beta}{2}t^{2}-\left\lVert H_{r}(p_{\lambda}(u)\right\rVert t\bigg{\}}, (40)

which satisfies t=Rt^{*}=R if Hr(pλ(u))α(1+R2)1/2+βR\left\lVert H_{r}(p_{\lambda}(u))\right\rVert\geq\alpha(1+R^{2})^{-1/2}+\beta R; otherwise t(0,R)t^{*}\in(0,R) is the unique solution of

α(1+t2)1/2t+βtHr(pλ(u))=0.\alpha(1+t^{2})^{-1/2}t+\beta t-\left\lVert H_{r}(p_{\lambda}(u))\right\rVert=0.

Then xTλ(u)x^{*}\in T_{\lambda}(u).

Proof. Set p=pλ(u)p=p_{\lambda}(u) for simplicity. If h(u)=λω\nabla h(u)=\lambda\omega, then p=0p=0 and clearly Tλ(u)=argminxn{h(x):x0r,xR}=0T_{\lambda}(u)=\mathop{\rm argmin}\limits_{x\in\mathbb{R}^{n}}\{h(x):\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}=0. Now we consider h(u)λω\nabla h(u)\neq\lambda\omega, in which case p0p\neq 0. Invoking Lemma 6.1, one has for t>0t>0

minxn{x,p:x0r,x=t}\displaystyle\min_{x\in\mathbb{R}^{n}}\{\langle x,p\rangle:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert=t\} =maxxn{x/t,tp:x/tr,x/t=1}\displaystyle=-\max_{x\in\mathbb{R}^{n}}\{\langle x/t,-tp\rangle:\left\lVert x/t\right\rVert\leq r,\left\lVert x/t\right\rVert=1\}
=Hr(tp)=tHr(p),\displaystyle=-\left\lVert H_{r}(-tp)\right\rVert=-t\left\lVert H_{r}(p)\right\rVert, (41)

attained at

x^=x^(t)=tHr(tp)1Hr(tp)=tHr(p)1Hr(p).\hat{x}=\hat{x}(t)=t\left\lVert H_{r}(-tp)\right\rVert^{-1}H_{r}(-tp)=-t\left\lVert H_{r}(p)\right\rVert^{-1}H_{r}(p). (42)

It is easy to see that optimal value (6.1) and solution (42) still hold when t=0t=0.

Next we show that xTλ(u)x^{*}\in T_{\lambda}(u). By (37), Tλ(u)T_{\lambda}(u) is the solution set to

minxn{x,p+α1+x2+β2x2:x0r,xR}.\min_{x\in\mathbb{R}^{n}}\bigg{\{}\langle x,p\rangle+\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\bigg{\}}. (43)

Consider an arbitrary feasible xx of (43) and let tx=xt_{x}=\left\lVert x\right\rVert. Then 0txR0\leq t_{x}\leq R and

p,x+α1+x2+β2x2\displaystyle\langle p,x^{*}\rangle+\alpha\sqrt{1+\left\lVert x^{*}\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x^{*}\right\rVert^{2} =tHr(p)+α1+(t)2+β2(t)2\displaystyle=-t^{*}\left\lVert H_{r}(p)\right\rVert+\alpha\sqrt{1+(t^{*})^{2}}+\frac{\beta}{2}(t^{*})^{2}
=φ(t)φ(tx)=Hr(p)tx+α1+tx2+β2tx2\displaystyle=\varphi(t^{*})\leq\varphi(t_{x})=-\left\lVert H_{r}(p)\right\rVert t_{x}+\alpha\sqrt{1+t_{x}^{2}}+\frac{\beta}{2}t_{x}^{2}
p,x+α1+x2+β2x2,\displaystyle\leq\langle p,x\rangle+\alpha\sqrt{1+\left\lVert x\right\rVert^{2}}+\frac{\beta}{2}\left\lVert x\right\rVert^{2},

where the first and last inequalities hold due to (40) and (6.1) respectively.

Finally, we turn to (40). Noticing the strong convexity of φ\varphi and applying optimality condition, one gets

α(1+t2)1/2tβt+Hr(p)=φ(t)N[0,R](t),-\alpha(1+t^{2})^{-1/2}t-\beta t+\left\lVert H_{r}(p)\right\rVert=-\varphi^{\prime}(t^{*})\in N_{[0,R]}(t^{*}),

So

t=Rφ(R)N[0,R](R)=[0,)φ(R)0Hr(p)α(1+R2)1/2+βR,t^{*}=R\Leftrightarrow-\varphi^{\prime}(R)\in N_{[0,R]}(R)=[0,\infty)\Leftrightarrow\varphi^{\prime}(R)\leq 0\Leftrightarrow\left\lVert H_{r}(p)\right\rVert\geq\alpha(1+R^{2})^{-1/2}+\beta R,

and

t=0φ(0)N[0,R](0)=(,0]Hr(p)0Hr(p)=0,t^{*}=0\Leftrightarrow-\varphi^{\prime}(0)\in N_{[0,R]}(0)=(-\infty,0]\Leftrightarrow\left\lVert H_{r}(p)\right\rVert\leq 0\Leftrightarrow H_{r}(p)=0,

which never occurs when p0p\neq 0. If 0<Hr(p)<α(1+R2)1/2+βR0<\left\lVert H_{r}(p)\right\rVert<\alpha(1+R^{2})^{-1/2}+\beta R, then φ(0)<0\varphi^{\prime}(0)<0 and φ(R)>0\varphi^{\prime}(R)>0, meaning that there exists unique t(0,R)t^{*}\in(0,R) such that φ(t)=0\varphi^{\prime}(t^{*})=0 due to the strong convexity of φ\varphi. \quad\hfill\blacksquare

Remark 6.3

When 0<Hr(pλ(u)<α(1+R2)1/2+βR0<\left\lVert H_{r}(p_{\lambda}(u)\right\rVert<\alpha(1+R^{2})^{-1/2}+\beta R, tt^{*} in Proposition 6.2 is indeed one of the four roots of quartic equation

β2t42βHr(pλ(u))t3+(Hr(pλ(u))2+β2α2)t22βHr(pλ(u))t+Hr(pλ(u))2=0.\beta^{2}t^{4}-2\beta\left\lVert H_{r}(p_{\lambda}(u))\right\rVert t^{3}+\left(\left\lVert H_{r}(p_{\lambda}(u))\right\rVert^{2}+\beta^{2}-\alpha^{2}\right)t^{2}-2\beta\left\lVert H_{r}(p_{\lambda}(u))\right\rVert t+\left\lVert H_{r}(p_{\lambda}(u))\right\rVert^{2}=0.

Although quartic equations admit closed-form solutions, it is virtually impossible to determine beforehand the branch of solutions to which tt^{*} belongs, due to parameters α\alpha, β\beta and Hr(pλ(u))\left\lVert H_{r}(p_{\lambda}(u))\right\rVert. To find tt^{*} efficiently, one may employ numerical root-finding algorithms. For instance, apply bisection method to identify a subinterval of [0,R][0,R] in which the desired root tt^{*} lies, then appeal to Newton’s method; see, e.g., [12, Chapter 2].

6.2 Formula of problems penalized by a positively homogeneous convex function

Recall that f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is positively homogeneous if f(tx)=tf(x)f(tx)=tf(x) for every xnx\in\mathbb{R}^{n} and t>0t>0. In this subsection, we provide subproblem formula for problems of the form

minxnf(x)+g(x),\min_{x\in\mathbb{R}^{n}}f(x)+g(x),

where f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} is convex and positively homogeneous with known proximal mapping, which is a special case of problem (2).

The result below generalizes the technique of Bolte-Sabach-Teboulle-Vaisbourd [8, Proposition 5.1]. However one cannot recover their result as our kernel is different.

Proposition 6.4

Let f:n¯f:\mathbb{R}^{n}\to\overline{\mathbb{R}} be proper, lsc and convex. Suppose that ff is positively homogeneous. Then the following hold:

(i) (xdomf)(\forall x\in\operatorname{dom}\partial f) (t>0)(\forall t>0) f(x)=f(tx)\partial f(x)=\partial f(tx).

(ii) (subproblem formula) (un)(\forall u\in\mathbb{R}^{n}) Tλ(u)=tProxλf(pλ(u))T_{\lambda}(u)=t^{*}\operatorname{Prox}_{\lambda f}(-p_{\lambda}(u)), where tt^{*} is the unique root of the equation

1αt(1+t2Proxλf(pλ(u)))1/2βt=0.1-\alpha t\big{(}1+t^{2}\left\lVert\operatorname{Prox}_{\lambda f}(-p_{\lambda}(u))\right\rVert\big{)}^{-1/2}-\beta t=0.

Proof. (i) vf(x)(yn)f(y)f(x)+v,yx(yn)(t>0)f(ty)f(tx)+v,tytxvf(tx)v\in\partial f(x)\Leftrightarrow(\forall y\in\mathbb{R}^{n})~{}f(y)\geq f(x)+\langle v,y-x\rangle\Leftrightarrow(\forall y\in\mathbb{R}^{n})~{}(\forall t>0)~{}f(ty)\geq f(tx)+\langle v,ty-tx\rangle\Leftrightarrow v\in\partial f(tx).

(ii) For simplicity, let p=pλ(u)p=p_{\lambda}(u). Note that Tλ(p)T_{\lambda}(p) is single-valued by strong convexity. Then invoking optimality condition and subdifferential sum rule yields

x=Tλ(u)0λf(x)+p+h(x)=λf(x)+p+[α(1+x2)1/2+β]x.x^{*}=T_{\lambda}(u)\Leftrightarrow 0\in\lambda\partial f(x^{*})+p+\nabla h(x)=\lambda\partial f(x^{*})+p+[\alpha(1+\left\lVert x^{*}\right\rVert^{2})^{-1/2}+\beta]x^{*}.

Define t=[α(1+x2)1/2+β]1t^{*}=[\alpha(1+\left\lVert x^{*}\right\rVert^{2})^{-1/2}+\beta]^{-1} and v=x/tv=x^{*}/t^{*}. Then t>0t^{*}>0 and the above inclusion together with statement(i) implies that

0\displaystyle 0 λf(x)+p+v=λf(tv)+p+v=λf(v)+v+p\displaystyle\in\lambda\partial f(x^{*})+p+v=\lambda\partial f(t^{*}v)+p+v=\lambda\partial f(v)+v+p
p\displaystyle\Leftrightarrow-p (Id+λf)(v)v=(Id+λf)1(p)=Proxλf(p).\displaystyle\in(\operatorname{Id}+\lambda\partial f)(v)\Leftrightarrow v=(\operatorname{Id}+\lambda\partial f)^{-1}(-p)=\operatorname{Prox}_{\lambda f}(-p).

In turn, tt^{*} satisfies

v[1αt(1+(t)2v2)1/2βt]=v[α(1+(t)2v2)1/2+β]tv=vxt=vv=0.v[1-\alpha t^{*}(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t^{*}]=v-[\alpha(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}+\beta]t^{*}v=v-\frac{x^{*}}{t^{*}}=v-v=0.

If v0v\neq 0, then clearly 1αt(1+(t)2v2)1/2βt=01-\alpha t^{*}(1+(t^{*})^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t^{*}=0 as claimed; if v=0v=0, then x=0x^{*}=0 and t=1/(α+β)t^{*}=1/(\alpha+\beta), which is consistent with statement(ii). To see the uniqueness, let φ(t)=α(1+v2t2)1/2/v2+βt2/2t\varphi(t)=\alpha(1+\left\lVert v\right\rVert^{2}t^{2})^{1/2}/\left\lVert v\right\rVert^{2}+\beta t^{2}/2-t for tt\in\mathbb{R}, which is strongly convex. Simple calculus shows that tt^{*} is stationary point of φ\varphi, thus unique. \quad\hfill\blacksquare

Remark 6.5

Similarly to Proposition 6.2 and Remark 6.3, one may find tt^{*} efficiently by appealing to root-finding algorithms.

Equipped with Proposition 6.4, we now turn to concrete problems. Recall that the Soft-threshold of xnx\in\mathbb{R}^{n} with parameter λ\lambda is Sλ(x)=Proxλ1(x)S_{\lambda}(x)=\operatorname{Prox}_{\lambda\left\lVert\cdot\right\rVert_{1}}(x), which satisfies

(i{1,,n})(Sλ(x))i=max(|xi|λ,0)sgn(xi),(\forall i\in\{1,\ldots,n\})~{}(S_{\lambda}(x))_{i}=\max(|x_{i}|-\lambda,0)\operatorname{sgn}(x_{i}),

see, e.g., [5, Example 6.8].

Corollary 6.6 (l1l_{1}-penalized problem)

Let f=1f=\left\lVert\cdot\right\rVert_{1}. Then Tλ(u)=tSλ(pλ(u))T_{\lambda}(u)=-t^{*}S_{\lambda}\left(p_{\lambda}(u)\right), where tt^{*} is the unique solution of 1αt(1+t2Sλ(pλ(u))2)1/2βt=01-\alpha t\left(1+t^{2}\left\lVert S_{\lambda}\left(p_{\lambda}(u)\right)\right\rVert^{2}\right)^{-1/2}-\beta t=0.

Corollary 6.7 (ll_{\infty}-penalized problem)

Let f=f=\left\lVert\cdot\right\rVert_{\infty}. Then Tλ(u)=tvT_{\lambda}(u)=t^{*}v, where

v=pλ(u)λProj(pλ(u)/λ;𝔹1),v=p_{\lambda}(u)-\lambda\operatorname{Proj}\left(p_{\lambda}(u)/\lambda;\mathbb{B}_{\left\lVert\cdot\right\rVert_{1}}\right),

and tt^{*} is the unique solution of 1αt(1+t2v2)1/2βt=01-\alpha t(1+t^{2}\left\lVert v\right\rVert^{2})^{-1/2}-\beta t=0.

Remark 6.8

The projection Proj(;𝔹1)\operatorname{Proj}(\cdot;\mathbb{B}_{\left\lVert\cdot\right\rVert_{1}}) admits a closed-form formula; see, e.g., [5, Example 6.33].

7 Numerical experiments

In the remainder, let Am×nA\in\mathbb{R}^{m\times n}, bmb\in\mathbb{R}^{m}, r\{0}r\in\mathbb{N}\backslash\{0\}, and R>0R>0. Define C={xn:Ax=b}C=\{x\in\mathbb{R}^{n}:Ax=b\} and D={xn:x0r,xR}D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r,\left\lVert x\right\rVert\leq R\}. Clearly CC is semialgebraic. The set DD can be represented as

D={xn:x0r}{xn:x2R2},D=\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert_{0}\leq r\}\cap\{x\in\mathbb{R}^{n}:\left\lVert x\right\rVert^{2}\leq R^{2}\},

which is an intersection of semialgebraic sets; see, e.g, [4, Formula 27(d)], thus semialgebraic. Consider

minxD12dist2(x,C),\min_{x\in D}\frac{1}{2}\operatorname{dist}^{2}(x,C), (44)

which corresponds to (2) with f=δDf=\delta_{D} and g=dist2(,C)/2g=\operatorname{dist}^{2}(\cdot,C)/2. In turn, Example 5.4 ensures that BiFRB and its Euclidean variant iFRB converge to a stationary point of (44).

We shall benchmark BiFRB and iFRB against the Douglas-Rachford (DR) [18], Forward-reflected-backward (FRB) [31, 22] and inertial Tseng’s (iTseng) [9] methods with R=1,1000R=1,1000 and r=m/5r=\lceil m/5\rceil. These splitting methods are known to converge globally to a stationary point of (2) under the same assumptions on FF; see [18, Theorems 1–2, Remark 4, Corollary 1][31, Theorem 3.9] and [9, Theorem 3.1], respectively.

BiFRB, iFRB and FRB are implemented with stepsize rules given by Proposition 4.2, Corollary 4.5 and [31, Theorem 3.9] respectively, and terminated when

max{xk+1xk,xkxk1}max{1,xk,xk1}<1010.\frac{\max\big{\{}\left\lVert x_{k+1}-x_{k}\right\rVert,\left\lVert x_{k}-x_{k-1}\right\rVert\big{\}}}{\max\big{\{}1,\left\lVert x_{k}\right\rVert,\left\lVert x_{k-1}\right\rVert\big{\}}}<10^{-10}. (45)

Moreover, BiFRB uses the kernel h(x)=0.11+x2+2.51x2/2h(x)=0.1\cdot\sqrt{1+\left\lVert x\right\rVert^{2}}+2.51\cdot\left\lVert x\right\rVert^{2}/2. We apply DR with stepsize and termination condition given by [18, Section 5] with the same tolerance 101010^{-10}; while iTseng employs a stepsize given by [9, Lemma 3.3] and termination condition (45). As for inertial parameter, we use α1=0.9\alpha_{1}=0.9 for BiFRB and α2=0.49\alpha_{2}=0.49 for both of iFRB and iTseng. Finally, each algorithm is initialized at the origin and is equipped with the stepsize heuristic described in [18, Remark 4].

Simulation results are presented in Table 1 below. Our problem data is generated through creating random matrices Am×nA\in\mathbb{R}^{m\times n} with entries following the standard Gaussian distribution. For each problem of the size (m,n)(m,n), we randomly generate 50 instances and report ceilings of the average number of iterations (iter) and the minimal objective value at termination (fvalmin\text{fval}_{\text{min}}). As problem (44) has optimal value zero, fvalmin\text{fval}_{\text{min}} indicates whether an algorithm actually hits a global minimizer rather than simply converging to a stationary point. We observed the following:

  • -

    BiFRB is the most advantageous method on “bad” problems (small mm), but it is not robust.

  • -

    DR tends to have the smallest function value at termination, however, it can converge very slowly.

  • -

    In most cases, iFRB has the smallest number of iterations with fair function values at termination.

8 Conclusions

We proved convergence of BiFRB (Algorithm 3.2) by using the generalized concave KL property and a general framework for decreasing property. The resulting stepsize condition is independent of inertial parameter, which is less restrictive compared to other algorithms with the same setting. In turn, a question of Malitsky and Tam [22, Section 7] is answered. We believe that our approach could be useful for other splitting algorithms in designing more implementable stepsize rules in the absence of convexity. We also conducted convergence rate analysis on both function value and the actual sequence. Formulae for our Bregman subproblems are provided as well, which supplements not only BiFRB but also [9, 10]. To end this paper, let us now present several directions for future work:

  • -

    It is tempting to see whether the global Lipschitz assumption on gradients g\nabla g and h\nabla h in Assumption 3.1 can be relaxed in a fashion similar to [8].

  • -

    Another interesting question is to explore the reason why BiFRB demonstrates inconsistent performance when problem size varies; see Section 7.

  • -

    Now that the range of admissible inertial parameter is extended (recall Proposition 4.2), it would be interesting to see if there exists an optimal accelerating scheme.

Table 1: Comparing BiFRB and iFRB to FRB, DR and iTseng.
Size (R=1)(R=1) BiFRB iFRB FRB DR iTseng
m n iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}}
100 4000 50 0.03251 93 0.03251 631 0.03929 860 0.02819 1367 0.03149
100 5000 52 0.02413 115 0.02413 733 0.02875 684 0.02192 1632 0.02268
100 6000 57 0.01369 145 0.01369 911 0.01794 683 0.01253 1970 0.0133
200 4000 870 0.2923 39 0.2745 190 0.2769 5466 0.2711 448 0.2746
200 5000 1198 0.2367 41 0.2095 231 0.2157 5864 0.2073 546 0.2097
200 6000 81 0.2306 43 0.2259 257 0.2292 4199 0.2236 620 0.2264
300 4000 885 1.051 32 1.036 105 1.038 5497 1.036 253 1.048
300 5000 925 0.7481 34 0.7044 124 0.7071 5673 0.7032 302 0.7051
300 6000 1407 0.583 36 0.5546 144 0.555 6173 0.5541 352 0.5566
Size (R=1000)(R=1000) BiFRB iFRB FRB DR iTseng
m n iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}} iter fvalmin\text{fval}_{\text{min}}
100100 40004000 1873 0.006609 1210 0.00365 7376 0.00816 2194 4e-21 10001 0.01671
100100 50005000 574 0.00278 1427 0.002244 8675 1.098e-05 2919 1.626e-20 10001 0.0158
100100 60006000 790 0.003827 1696 0.001261 9394 0.0002952 2344 3.438e-20 10001 0.01066
200200 40004000 5842 2.992 750 2.442e-18 8223 0.007688 1038 1.106e-20 9797 0.1094
200200 50005000 5938 0.6547 859 2.876e-05 7257 0.02581 1242 2.082e-20 9997 0.07125
200200 60006000 5666 0.917 1086 0.0001211 7312 0.006726 1487 9.129e-21 10001 0.07524
300300 40004000 7164 3.322 619 2.196e-18 4620 1.462e-05 753 9.795e-21 7722 0.2417
300300 50005000 5267 2.299 693 2.175e-18 6352 0.02936 880 5.446e-20 9194 0.1702
300300 60006000 5724 5.06 750 4.71e-18 8644 0.1109 1027 2.175e-20 9823 0.1423

Acknowledgments

XW and ZW were partially supported by NSERC Discovery Grants.

References

  • [1] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality, Mathematics of Operations Research, 35 (2010), pp. 438–457.
  • [2] H. Attouch, J. Bolte, and B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss–Seidel methods, Mathematical Programming, 137 (2013), pp. 91–129.
  • [3] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, Cham, 2017.
  • [4] H. H. Bauschke, D. R. Luke, H. M. Phan, and X. Wang, Restricted normal cones and sparsity optimization with affine constraints, Foundations of Computational Mathematics, 14 (2014), pp. 63–83.
  • [5] A. Beck, First-order Methods in Optimization, SIAM, 2017.
  • [6] J. Bolte, A. Daniilidis, A. Lewis, and M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization, 18 (2007), pp. 556–572.
  • [7] J. Bolte, A. Daniilidis, O. Ley, and L. Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society, 362 (2010), pp. 3319–3363.
  • [8] J. Bolte, S. Sabach, M. Teboulle, and Y. Vaisbourd, First order methods beyond convexity and Lipschitz gradient continuity with applications to quadratic inverse problems, SIAM Journal on Optimization, 28 (2018), pp. 2131–2151.
  • [9] R. I. Boţ and E. R. Csetnek, An inertial Tseng’s type proximal algorithm for nonsmooth and nonconvex optimization problems, Journal of Optimization Theory and Applications, 171 (2016), pp. 600–616.
  • [10] R. I. Boţ, E. R. Csetnek, and S. C. László, An inertial forward-backward algorithm for the minimization of the sum of two nonconvex functions, EURO Journal on Computational Optimization, 4 (2016), pp. 3–25.
  • [11] R. I. Boţ and D.-K. Nguyen, The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates, Mathematics of Operations Research, 45 (2020), pp. 682–712.
  • [12] R. Burden, D. Faires, and A. Burden, Numerical Analysis, Cengage Learning, 2014.
  • [13] C. Chen, T. K. Pong, L. Tan, and L. Zeng, A difference-of-convex approach for split feasibility with applications to matrix factorizations and outlier detection, Journal of Global Optimization, 78 (2020), pp. 107–136.
  • [14] P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling & Simulation, 4 (2005), pp. 1168–1200.
  • [15] J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems in two and three space variables, Transactions of the American Mathematical Society, 82 (1956), pp. 421–439.
  • [16] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier, 48 (1998), pp. 769–783.
  • [17] G. Li, T. Liu, and T. K. Pong, Peaceman–Rachford splitting for a class of nonconvex optimization problems, Computational Optimization and Applications, 68 (2017), pp. 407–436.
  • [18] G. Li and T. K. Pong, Douglas-Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems, Mathematical Programming, 159 (2016), pp. 371–401.
  • [19] P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM Journal on Numerical Analysis, 16 (1979), pp. 964–979.
  • [20] S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les équations aux dérivées partielles, 117 (1963), pp. 87–89.
  • [21] R. Luss and M. Teboulle, Conditional gradient algorithms for rank-one matrix approximations with a sparsity constraint, SIAM Review, 55 (2013), pp. 65–98.
  • [22] Y. Malitsky and M. K. Tam, A forward-backward splitting method for monotone inclusions without cocoercivity, SIAM Journal on Optimization, 30 (2020), pp. 1451–1472.
  • [23] B. S. Mordukhovich, Variational Analysis and Generalized Differentiation I: Basic Theory, Springer-Verlag, Berlin, 2006.
  • [24] Y. Nesterov, A method for solving the convex programming problem with convergence rate O(1/k21/k^{2}), Dokl. Akad. Nauk SSSR, 269 (1983), pp. 543–547.
  • [25] P. Ochs, Local convergence of the heavy-ball method and iPiano for non-convex optimization, Journal of Optimization Theory and Applications, 177 (2018), pp. 153–180.
  • [26] P. Ochs, Unifying abstract inexact convergence theorems and block coordinate variable metric iPiano, SIAM Journal on Optimization, 29 (2019), pp. 541–570.
  • [27] P. Ochs, Y. Chen, T. Brox, and T. Pock, iPiano: Inertial proximal algorithm for nonconvex optimization, SIAM Journal on Imaging Sciences, 7 (2014), pp. 1388–1419.
  • [28] R. T. Rockafellar and R. J.-B. Wets, Variational Analysis, Springer-Verlag, Berlin, 1998.
  • [29] P. Tseng, A modified forward-backward splitting method for maximal monotone mappings, SIAM Journal on Control and Optimization, 38 (2000), pp. 431–446.
  • [30] X. Wang and Z. Wang, The exact modulus of the generalized Kurdyka-Łojasiewicz property, Mathematics of Operations Research (2022), https://doi.org/10.1287/moor.2021.1227.
  • [31] X. Wang and Z. Wang, Malitsky-Tam forward-reflected-backward splitting method for nonconvex minimization problems, Computational Optimization and Applications, 82 (2022), pp. 441–463.