An adaptive heavy ball method for ill-posed inverse problems

Qinian Jin Mathematical Sciences Institute, Australian National University, Canberra, ACT 2601, Australia [email protected] and Qin Huang School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, 100081, China [email protected]

Abstract.

In this paper we consider ill-posed inverse problems, both linear and nonlinear, by a heavy ball method in which a strongly convex regularization function is incorporated to detect the feature of the sought solution. We develop ideas on how to adaptively choose the step-sizes and the momentum coefficients to achieve acceleration over the Landweber-type method. We then analyze the method and establish its regularization property when it is terminated by the discrepancy principle. Various numerical results are reported which demonstrate the superior performance of our method over the Landweber-type method by reducing substantially the required number of iterations and the computational time.

Key words and phrases:

Ill-posed inverse problems, adaptive heavy ball method, step-sizes, momentum coefficients, convergence

Key words and phrases:

Ill-posed inverse problems, adaptive heavy-ball method, step-size, momentum coefficients, convergence

2010 Mathematics Subject Classification:

65J20, 65J22, 65J15, 47J06

1. Introduction

Consider an operator equation of the form

\displaystyle F(x)=y,

(1)

where $F:\mbox{dom}(F)\subset X\to Y$ is an operator between two real Hilbert spaces $X$ and $Y$ with domain $\mbox{dom}(F)$ . The operator $F$ could be either linear or nonlinear. Throughout, we always assume that (1) has a solution, i.e. $y\in\mbox{Ran}(F)$ , the range of $F$ . We are interested in the situation that (1) is ill-posed in the sense that its solution does not depend continuously on the right hand data. Such situation can happen in a broad range of applications of inverse problems. In practical scenarios, data are acquired through experiments and thus the exact data may not be available; instead we only have measurement data corrupted by noise. Due to the ill-posed nature of the underlying problem, it is therefore important to develop algorithms to produce stable approximate solutions of (1) using noisy data.

In practical applications, some a priori information, such as non-negativity, sparsity and piecewise constancy, on the sought solution, is often available. Utilizing such feature information in algorithm design can significantly improve the reconstruction accuracy. By incorporating the available a priori information, one may pick a proper, lower semi-continuous, strongly convex function ${\mathcal{R}}:X\to(-\infty,\infty]$ as a regularization function to determine a solution of (1) with the desired feature. Let $y^{\delta}$ be a noisy data satisfying

\|y^{\delta}-y\|\leq\delta

with a known noise level $\delta>0$ . The Landweber-type method

\displaystyle\begin{split}\xi_{n+1}^{\delta}&=\xi_{n}^{\delta}-\alpha_{n}^{\delta}L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta}),\\ x_{n+1}^{\delta}&=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\},\end{split}

(2)

has been considered in [18] for solving (1), where $\alpha_{n}^{\delta}>0$ is the step-size and $\{L(x):x\in\mbox{dom}(F)\}$ is a family of properly chosen bounded linear operators. In case $F$ is Fréchect differentiable at $x$ , we may take $L(x)=F^{\prime}(x)$ , the Fréchet derivative of $F$ at $x$ ; otherwise, $L(x)$ should be appropriately chosen as a substitute for the non-existent Fréchet derivative. This method has been thoroughly analyzed in [15, 18] under the constant step-size

\displaystyle\alpha_{n}^{\delta}=\frac{\mu_{0}}{L^{2}},

(3)

when an upper bound $L$ of $\|L(x)\|$ around the sought solution is available, and the adaptive step-size

\displaystyle\alpha_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|F(x_{n}^{\delta})-y^{\delta}\|^{2}}{\|L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})\|^{2}},\mu_{1}\right\}

(4)

with suitable choices of the parameters $\mu_{0}$ and $\mu_{1}$ , and the regularization property has been established when the iteration is terminated by the discrepancy principle. Note that, when ${\mathcal{R}}(x)=\|x\|^{2}/2$ , the corresponding method with $\alpha_{n}^{\delta}$ chosen by (3) is the classical Landweber method which has been analyzed in [6, 10], and the corresponding method with $\alpha_{n}^{\delta}$ chosen by (4) is the minimal error method ([19]). Extensive numerical simulations have demonstrated that the Landweber-type method (2) is a slowly convergent approach, often necessitating a large number of iteration steps before termination by the discrepancy principle. It is therefore important to develop strategies for accelerating the Landweber-type method (2) while maintaining its simple implementation feature.

For linear ill-posed problems in Hilbert spaces, Brakhage used in [4] the Jacobi polynomials to construct a method, which is known as the $\nu$ -method, to accelerate the classical Landweber iteration. Additionally, in [9], a family of accelerated Landweber iterations was developed by employing orthogonal polynomials and the spectral theory of self-adjoint operators. However, the acceleration strategy using orthogonal polynomials is scarcely applicable for the Landweber-type method (2) when the forward operator $F$ is nonlinear or the regularization function ${\mathcal{R}}$ is non-quadratic. In [12, 28] the sequential subspace optimization strategies were employed to accelerate the Landweber iteration. The implementation of this strategy however requires to solve an additional optimization problem to determine the weighted coefficients of the search directions at each iteration step which unfortunately has the drawback of slowing down the method.

On the other hand, to determine a solution of well-posed minimization problem

\displaystyle\min_{x\in X}f(x)

(5)

with a continuous differentiable objective function $f$ , tremendous progress has been made toward accelerating the gradient method

x_{n+1}=x_{n}-\alpha_{n}\nabla f(x_{n})

in optimization community. In particular, Nesterov’s accelerated gradient method ([22]) and Polyak’s heavy ball method ([27]) stood out and had far-reaching impact on the development of fast first order optimization methods. In Nesterov’s accelerated gradient method, to define the next iterate, instead of using the current iterate, it uses a carefully chosen extrapolation point, built from the previous two iterates ([1, 2, 5, 23]). Based on Nesterov’s acceleration strategy, an accelerated version of Landweber-type method has been proposed in [15] and a refined version of the method takes the form (see also [16, 31])

\displaystyle\begin{split}\hat{\xi}_{n}^{\delta}&=\xi_{n}^{\delta}+\frac{n-1}{n+\gamma}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}),\qquad\quad\quad\,\,\hat{x}_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\hat{\xi}_{n}^{\delta},x\rangle\right\},\\ \xi_{n+1}^{\delta}&=\hat{\xi}_{n}^{\delta}-\alpha_{n}^{\delta}L(\hat{x}_{n}^{\delta})^{*}(F(\hat{x}_{n}^{\delta})-y^{\delta}),\quad x_{n+1}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\},\end{split}

(6)

where $\gamma\geq 2$ . The numerical results presented in [15] demonstrate its striking acceleration effect. Inspired by the numerical observations in [15], some efforts have been devoted to analyzing the regularization property of (6) in the context of ill-posed problems. When X is a Hilbert space, $F$ is a bounded linear operator and ${\mathcal{R}}(x)=\|x\|^{2}/2$ , the regularization property of the corresponding method, terminated by either a priori stopping rules or the discrepancy principle, has been established in [20, 24] based on a general acceleration framework in [9] using orthogonal polynomials. When $F$ is nonlinear or ${\mathcal{R}}$ is non-quadratic, although some partial results are available in [14, 16], the analysis on (6) is under developed and the understanding of its regularization property is largely open. By modifying the first equation in (6) by $\hat{\xi}_{n}^{\delta}=\xi_{n}^{\delta}+\lambda_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta})$ with a connection coefficient $\lambda_{n}^{\delta}$ to be determined, a two-point gradient method has been considered in [13, 31]. This method, however, requires a discrete backtracking search algorithm to determine $\lambda_{n}^{\delta}$ at each iteration step which can incur additional computational time.

Different from Nesterov’s method, Polyak’s heavy ball method accelerates the gradient method by adding a momentum term to define the iterates, i.e.

\displaystyle x_{n+1}=x_{n}-\alpha_{n}\nabla f(x_{n})+\beta_{n}(x_{n}-x_{n-1}).

(7)

The performance of (7) depends crucially on the property of $f$ and the choices of the step-size $\alpha_{n}$ and the momentum coefficient $\beta_{n}$ , and its analysis turns out to be very challenging. When $f$ is twice continuous differentiable, strongly convex and has Lipschitz continuous gradient, it has been demonstrated in [27], under suitable constant choices of $\alpha_{n}$ and $\beta_{n}$ , that the iterative sequence $\{x_{n}\}$ enjoys a provable linear convergence faster than the gradient descent. Due to the recent development of machine learning and the appearance of large scale problems, the heavy ball method has gained renewed interest and much attention has been paid on understanding its convergence behavior ([7, 23, 25, 26, 30]). Although many strategies have been proposed to select $\alpha_{n}$ and $\beta_{n}$ , how to choose these parameters to achieve acceleration effect for general convex and non-convex problems remains an active research topic in optimization community.

In this paper we will consider the iterative method

\displaystyle\begin{split}\xi_{n+1}^{\delta}&=\xi_{n}^{\delta}-\alpha_{n}^{\delta}L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})+\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}),\\ x_{n+1}^{\delta}&=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\}\end{split}

(8)

for solving ill-posed inverse problem (1), where $\alpha_{n}^{\delta}$ denotes the step-size and $\beta_{n}^{\delta}$ denotes the momentum coefficient. This method differs from (2) in that a momentum term $\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta})$ is added to define the iterates. The method (8) has intimate connection with the heavy ball method. Indeed, when ${\mathcal{R}}(x)=\|x\|^{2}/2$ and $F$ is Fréchet differentiable, the method (8) with $L(x)=F^{\prime}(x)$ becomes

x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha_{n}^{\delta}F^{\prime}(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})+\beta_{n}^{\delta}(x_{n}^{\delta}-x_{n-1}^{\delta})

which can be obtained by applying the heavy ball method (7) to the minimization problem (5) with

\displaystyle f(x):=\frac{1}{2}\|F(x)-y^{\delta}\|^{2}.

(9)

We will further demonstrate that, when $F$ is a bounded linear operator, the method (8) with general strongly convex ${\mathcal{R}}$ , can be derived by applying the heavy ball method to a dual problem. Based on the successful experience with the acceleration effect of the method (7) for well-posed optimization problems, it is natural to expect that, with suitable choices of the parameters $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ , the method (8) can have superior performance over the method (2) for ill-posed problems. In this paper we will provide a criterion for choosing $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ and investigate the convergence property of the corresponding method. Our choices of $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ are adaptive in the sense that only quantities arising from the computation are used. The formulae for calculating $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ are explicit, thereby avoiding the need for a backtracking search strategy and consequently saving computational time. The complexity per iteration of our proposed method is comparable to one step of the Landweber-type method (2). However, extensive computational results demonstrate that our method significantly accelerates the Landweber-type method in terms of both the number of iterations and the computational time.

This paper is organized as follows. In Section 2 we first formulate the conditions on the regularization function ${\mathcal{R}}$ and the forward operator $F$ , we then propose our adaptive heavy ball method by elucidating how to choose the step-sizes and the momentum coefficients, we also show that our method is well-defined and the iteration can be terminated in finite many steps by the discrepancy principle. In Section 3 we focus on the convergence analysis and establish the regularization property of our method. In Section 4 we perform various numerical simulations to demonstrate the acceleration effect and superior performance of our method over the Landweber-type method.

2. The adaptive heavy-ball method

In this section we will consider the iterative method (8) and provide a motivation on choosing the step-size $\alpha_{n}^{\delta}$ and the momentum coefficient $\beta_{n}^{\delta}$ adaptively. We will then show that our proposed method, i.e. Algorithm 1 below, is well-defined.

2.1. Assumptions

Note that the method (8) involves the regularization function ${\mathcal{R}}$ and the forward operator $F$ . In order to carry out the analysis, certain conditions should be placed on ${\mathcal{R}}$ and $F$ . For the regularization function ${\mathcal{R}}$ , we assume the following standard condition.

Assumption 1.

${\mathcal{R}}:X\to(-\infty,\infty]$ is a proper, lower semi-continuous, strongly convex function in the sense that there is a constant $\sigma>0$ such that

\displaystyle{\mathcal{R}}(t\bar{x}+(1-t)x)+\sigma t(1-t)\|\bar{x}-x\|^{2}\leq t{\mathcal{R}}(\bar{x})+(1-t){\mathcal{R}}(x)

for all $\bar{x},x\in\emph{dom}({\mathcal{R}})$ and $0\leq t\leq 1$ .

To make use of Assumption 1, let us recall a little convex analysis ([29]). For any proper convex function ${\mathcal{R}}:X\to(-\infty,\infty]$ we use $\partial{\mathcal{R}}$ to denote its subdifferential, i.e.

\partial{\mathcal{R}}(x)=\{\xi\in X:{\mathcal{R}}(z)\geq\partial{\mathcal{R}}(x)+\langle\xi,z-x\rangle\mbox{ for all }z\in X\}.

For any $x\in X$ with $\partial{\mathcal{R}}(x)\neq\emptyset$ , let $\xi\in\partial{\mathcal{R}}(x)$ and define

D_{\mathcal{R}}^{\xi}(z,x):={\mathcal{R}}(z)-{\mathcal{R}}(x)-\langle\xi,z-x\rangle,\quad z\in X

which is called the Bregman distance induced by ${\mathcal{R}}$ at $x$ in the direction $\xi$ . If ${\mathcal{R}}$ is strongly convex in the sense of Assumption 1, it is easy to derive that

\displaystyle D_{\mathcal{R}}^{\xi}(z,x)\geq\sigma\|z-x\|^{2}

(10)

for all $x\in X$ with $\partial{\mathcal{R}}(x)\neq\emptyset$ , $\xi\in\partial{\mathcal{R}}(x)$ and $z\in X$ . For a proper, lower semi-continuous, convex function ${\mathcal{R}}:X\to(-\infty,\infty]$ , its convex conjugate

{\mathcal{R}}^{*}(\xi):=\sup_{x\in X}\left\{\langle\xi,x\rangle-{\mathcal{R}}(x)\right\},\quad\forall\xi\in X

is also a proper, lower semi-continuous, convex function that maps from $X$ to $(-\infty,\infty]$ . Moreover

\xi\in\partial{\mathcal{R}}(x)\iff x\in\partial{\mathcal{R}}^{*}(\xi)\iff{\mathcal{R}}(x)+{\mathcal{R}}^{*}(\xi)=\langle\xi,x\rangle.

If ${\mathcal{R}}$ satisfies Assumption 1, then ${\mathcal{R}}^{*}$ is continuous differentiable, its gradient $\nabla{\mathcal{R}}^{*}$ is Lipschitz continuous with

\displaystyle\|\nabla{\mathcal{R}}^{*}(\bar{\xi})-\nabla{\mathcal{R}}^{*}(\xi)\|\leq\frac{\|\bar{\xi}-\xi\|}{2\sigma},\quad\forall\bar{\xi},\xi\in X;

(11)

see [29, Corollary 3.5.11]. Thus, under Assumption 1, for any $\xi\in X$ the minimization problem

x=\arg\min_{z\in X}\left\{{\mathcal{R}}(z)-\langle\xi,z\rangle\right\}

has a unique solution given by $x=\nabla{\mathcal{R}}^{*}(\xi)$ . Clearly $\xi\in\partial{\mathcal{R}}(x)$ .

For the forward operator $F$ , we assume the following condition which has been widely used in dealing with ill-posed problems, where $B_{\rho}(x_{0}):=\{x\in X:\|x-x_{0}\|\leq\rho\}$ for any $\rho>0$ .

Assumption 2.

(a)

$X$ and $Y$ are Hilbert spaces.
(b)

There exist $\rho>0$ , $x_{0}\in X$ and $\xi_{0}\in\partial{\mathcal{R}}(x_{0})$ such that $B_{2\rho}(x_{0})\subset\emph{dom}(F)$ and (1) has a solution $\bar{x}$ such that $D_{\mathcal{R}}^{\xi_{0}}(\bar{x},x_{0})\leq\sigma\rho^{2}$ .
(c)

$F$ is weakly closed on $\emph{dom}(F)$ , i.e. for any sequence $\{x_{n}\}\subset\emph{dom}(F)$ satisfying $x_{n}\rightharpoonup x\in X$ and $F(x_{n})\to v\in Y$ there hold $x\in\emph{dom}(F)$ and $F(x)=v$ . Throughout the paper “ $\to$ ’ and “ $\rightharpoonup$ ” are used to denote the strong and weak convergence respectively.
(d)

There exist a family of bounded linear operators $\{L(x):X\to Y\}_{x\in B_{2\rho}(x_{0})}$ such that $x\to L(x)$ is continuous on $B_{2\rho}(x_{0})$ and there is $0\leq\eta<1$ such that

$\|F(x)-F(\bar{x})-L(\bar{x})(x-\bar{x})\|\leq\eta\|F(x)-F(\bar{x})\|$

for all $x,\bar{x}\in B_{2\rho}(x_{0})$ . Moreover, there is a constant $L>0$ such that $\|L(x)\|\leq L$ for all $x\in B_{2\rho}(x_{0})$ .

We remark that, when $F$ is a bounded linear operator, Assumption 2 holds automatically with $\rho=\infty$ and $\eta=0$ . When $F$ is nonlinear, Assumption 2 (d) is known as the tangential cone condition which gives certain restriction on the nonlinearity of $F$ . Assumption 2 (d) clearly implies

\|F(x)-F(\bar{x})\|\leq\frac{1}{1-\eta}\|L(\bar{x})(x-\bar{x})\|,\quad\forall x,\bar{x}\in B_{2\rho}(x_{0})

and thus the continuity of the mapping $x\to F(x)$ on $B_{2\rho}(x_{0})$ . Based on Assumption 1 and Assumption 2, it has been shown in [18, Lemma 3.2] that (1) has a unique solution $x^{\dagger}$ satisfying

\displaystyle D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0}):=\min_{x\in\mbox{dom}(F)}\left\{D_{\mathcal{R}}^{\xi_{0}}(x,x_{0}):F(x)=y\right\}.

(12)

By using (10) and Assumption 2 (b) it is easy to see that $\|x^{\dagger}-x_{0}\|\leq\rho$ , i.e. $x^{\dagger}\in B_{\rho}(x_{0})$ .

2.2. The method

Let us now turn to the consideration on the method (8). We first demonstrate that, when $F$ is a bounded linear operator, the method (8) can be derived by applying the heavy ball method to the dual problem of

\displaystyle\min\{{\mathcal{R}}(x):x\in X\mbox{ and }Fx=y\}.

(13)

To see this, we recall that the associated dual function is given by

\displaystyle d(\lambda):=\inf_{x\in X}\left\{{\mathcal{R}}(x)-\langle\lambda,Fx-y\rangle\right\}=-{\mathcal{R}}^{*}(F^{*}\lambda)+\langle\lambda,y\rangle

and hence the corresponding dual problem is $\max_{\lambda\in Y}d(\lambda)$ , i.e.

\displaystyle\min_{\lambda\in Y}\left\{{\mathcal{R}}^{*}(F^{*}\lambda)-\langle\lambda,y\rangle\right\}.

(14)

Since ${\mathcal{R}}$ satisfies Assumption 1, the objective function in (14) is continuous differentiable. Applying the heavy ball method to solve (14) gives

\displaystyle\lambda_{n+1}=\lambda_{n}-\alpha_{n}(F\nabla{\mathcal{R}}^{*}(F^{*}\lambda_{n})-y)+\beta_{n}(\lambda_{n}-\lambda_{n-1}).

By setting $\xi_{n}:=F^{*}\lambda_{n}$ and $x_{n}:=\nabla{\mathcal{R}}^{*}(F^{*}\lambda_{n})$ , we then obtain

\displaystyle x_{n}=\nabla{\mathcal{R}}^{*}(\xi_{n}),\quad\xi_{n+1}=\xi_{n}-\alpha_{n}F^{*}(Fx_{n}-y)+\beta_{n}(\xi_{n}-\xi_{n-1}).

In case only noisy data $y^{\delta}$ is available, by replacing $y$ by $y^{\delta}$ and allowing $\alpha_{n}$ , $\beta_{n}$ to be dependent on $y^{\delta}$ in the above equation we may obtain the method

\displaystyle x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta}),\quad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}F^{*}(Fx_{n}^{\delta}-y^{\delta})+\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta})

which is equivalent to (8) when $F$ is bounded linear.

Note that the term $F^{*}(Fx_{n}^{\delta}-y^{\delta})$ in the above equation is the gradient of the function $x\to\frac{1}{2}\|Fx-y^{\delta}\|^{2}$ at $x_{n}^{\delta}$ . When $F$ is a nonlinear Fréchet differentiable operator, we may use the gradient $F^{\prime}(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})$ of the function (9) at $x_{n}^{\delta}$ to replace $F^{*}(Fx_{n}^{\delta}-y^{\delta})$ in the above equation; in case $F$ is not differentiable at $x_{n}^{\delta}$ , we may use $L(x_{n}^{\delta})$ to substitute $F^{\prime}(x_{n}^{\delta})$ . Consequently, it leads to the method (8) for solving (1).

In order for the method (8) to have fast convergence, it is natural to choose $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ at each iteration step such that $x_{n+1}^{\delta}$ is as close to a solution of (1) as possible. Let $\hat{x}$ denote any solution of (1) in $B_{2\rho}(x_{0})$ . We may measure the closeness from $x_{n+1}^{\delta}$ to $\hat{x}$ by the Bregman distance

D_{\mathcal{R}}^{\xi_{n+1}^{\delta}}(\hat{x},x_{n+1}^{\delta}):={\mathcal{R}}(\hat{x})-{\mathcal{R}}(x_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}-x_{n+1}^{\delta}\rangle

induced by ${\mathcal{R}}$ at $x_{n+1}^{\delta}$ in the direction $\xi_{n+1}^{\delta}$ . Directly minimizing this quantity becomes infeasible because it involves an unknown solution $\hat{x}$ and a strongly convex function ${\mathcal{R}}$ which may be non-quadratic. As a compromise, we may first derive a suitable upper bound for this quantity and then take minimization steps.

We will elucidate the idea on choosing $\alpha_{n}^{\delta}$ and $\beta_{n}^{\delta}$ below. For simplicity of exposition, we introduce the notation

r_{n}^{\delta}:=F(x_{n}^{\delta})-y^{\delta},\quad g_{n}^{\delta}:=L(x_{n}^{\delta})^{*}r_{n}^{\delta}\quad\mbox{ and }\quad m_{n}^{\delta}:=\xi_{n}^{\delta}-\xi_{n-1}^{\delta}

for all integers $n\geq 0$ . Then (8) can be written as

\displaystyle\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},\quad x_{n+1}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n+1}^{\delta}).

(15)

Let $\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(\hat{x},x_{n}^{\delta})$ for $n\geq 0$ . Since $\xi_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta})$ , we have ${\mathcal{R}}(x_{n}^{\delta})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})=\langle\xi_{n}^{\delta},x_{n}^{\delta}\rangle$ and thus

\Delta_{n}^{\delta}={\mathcal{R}}(\hat{x})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle.

Therefore, by using $x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta})$ , (11) and (15), we can obtain

	$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle=\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{}(\xi_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}\rangle\right)-\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle\right)$
		$\displaystyle={\mathcal{R}}^{}(\xi_{n+1}^{\delta})-{\mathcal{R}}^{}(\xi_{n}^{\delta})-\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta})\rangle+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
		$\displaystyle\leq\frac{1}{4\sigma}\\|\xi_{n+1}^{\delta}-\xi_{n}^{\delta}\\|^{2}+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
		$\displaystyle=\frac{1}{4\sigma}\\|\alpha_{n}^{\delta}g_{n}^{\delta}-\beta_{n}^{\delta}m_{n}^{\delta}\\|^{2}+\langle-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.$

By the polarization identity in Hilbert spaces, we then have

	$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle\leq\frac{1}{4\sigma}\left((\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}+(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-2\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle\right)$
		$\displaystyle\quad\,-\alpha_{n}^{\delta}\langle r_{n}^{\delta},L(x_{n}^{\delta})(x_{n}^{\delta}-\hat{x})\rangle+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
		$\displaystyle=\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle$
		$\displaystyle\quad\,-\alpha_{n}^{\delta}\langle r_{n}^{\delta},r_{n}^{\delta}+y^{\delta}-y+y-F(x_{n}^{\delta})-L(x_{n}^{\delta})(\hat{x}-x_{n}^{\delta})\rangle$
		$\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.$

Assume $x_{n}^{\delta}\in B_{2\rho}(x_{0})$ . By using $\|y^{\delta}-y\|\leq\delta$ , the Cauchy-Schwarz inequality and Assumption 2 (d), we further obtain

$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}$
	$\displaystyle\quad\,+\delta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|+\eta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|\\|y-F(x_{n}^{\delta})\\|+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
	$\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}-\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}+(1+\eta)\delta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|+\eta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}$
	$\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle.$	(16)

Note that when $\beta_{n}^{\delta}=0$ , the method (8) becomes the Landweber-type method (2) and extensive analysis has been done on (2) with suitable choices of the step-size $\alpha_{n}^{\delta}$ , including (3) and (4). For the method (8) we also choose $\alpha_{n}^{\delta}$ as such, i.e.

\displaystyle\alpha_{n}^{\delta}:=\frac{\mu_{0}}{L^{2}}\quad\mbox{ or }\quad\alpha_{n}^{\delta}:=\min\left\{\frac{\mu_{0}\|r_{n}^{\delta}\|^{2}}{\|g_{n}^{\delta}\|^{2}},\mu_{1}\right\}

(17)

when $\|r_{n}^{\delta}\|>\tau\delta$ , where $\mu_{0}$ and $\mu_{1}$ are two positive constants. Then, by plugging the choice of $\alpha_{n}^{\delta}$ from (17) into (2.2) and using $\delta\leq\|r_{n}^{\delta}\|/\tau$ we have

	$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle\leq-\left(1-\frac{1+\eta}{\tau}-\eta-\frac{\mu_{0}}{4\sigma}\right)\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}$
		$\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle.$		(18)

We next consider the choice of $\beta_{n}^{\delta}$ . It seems natural to obtain it by minimizing the right hand side of the above equation with respect to $\beta_{n}^{\delta}$ . Let us fix a number $\beta\in(0,\infty]$ and minimize the right hand side of (2.2) over $[0,\beta]$ to obtain

\beta_{n}^{\delta}=\min\left\{\max\left\{0,\frac{\alpha_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-2\sigma\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle}{\|m_{n}^{\delta}\|^{2}}\right\},\beta\right\}

whenever $m_{n}^{\delta}\neq 0$ . However, this formula for $\beta_{n}^{\delta}$ is not computable because it involves $\hat{x}$ which is unknown. We need to treat the term

\displaystyle\gamma_{n}^{\delta}:=\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle

(19)

by using a suitable computable surrogate. Note that $\gamma_{0}^{\delta}=0$ and, for $n\geq 1$ ,

	$\displaystyle\gamma_{n}^{\delta}$	$\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle+\langle m_{n}^{\delta},x_{n-1}^{\delta}-\hat{x}\rangle$
		$\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle+\langle-\alpha_{n-1}^{\delta}g_{n-1}^{\delta}+\beta_{n-1}^{\delta}m_{n-1}^{\delta},x_{n-1}^{\delta}-\hat{x}\rangle$
		$\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\langle r_{n-1}^{\delta},L(x_{n-1}^{\delta})(x_{n-1}^{\delta}-\hat{x})\rangle+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}$
		$\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|^{2}$
		$\displaystyle\quad\,-\alpha_{n-1}^{\delta}\langle r_{n-1}^{\delta},y^{\delta}-F(x_{n-1}^{\delta})+L(x_{n-1}^{\delta})(x_{n-1}^{\delta}-\hat{x})\rangle+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}$
		$\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|^{2}$
		$\displaystyle\quad\,+\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|\left(\delta+\\|y-F(x_{n-1}^{\delta})-L(x_{n-1}^{\delta})(\hat{x}-x_{n-1}^{\delta})\\|\right)+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}.$

Assuming $x_{n-1}^{\delta}\in B_{2\rho}(x_{0})$ , we may use Assumption 2 (d) to further obtain

$\displaystyle\gamma_{n}^{\delta}$	$\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|^{2}+\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|\left(\delta+\eta\\|y-F(x_{n-1}^{\delta})\\|\right)$
	$\displaystyle\quad\,+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}$
	$\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-(1-\eta)\alpha_{n-1}^{\delta}\\|r_{n-1}^{\delta}\\|^{2}$
	$\displaystyle\quad\,+(1+\eta)\alpha_{n-1}^{\delta}\delta\\|r_{n-1}^{\delta}\\|+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}.$	(20)

This motivates us to introduce $\{\tilde{\gamma}_{n}^{\delta}\}$ such that

\displaystyle\tilde{\gamma}_{n}^{\delta}=\left\{\begin{array}[]{lll}0&\mbox{ if }n=0,\\ \langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-(1-\eta)\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}&\\ \qquad\qquad+(1+\eta)\alpha_{n-1}^{\delta}\delta\|r_{n-1}^{\delta}\|+\beta_{n-1}^{\delta}\tilde{\gamma}_{n-1}^{\delta}&\mbox{ if }n\geq 1\end{array}\right.

(24)

once $x_{k}^{\delta}$ for $0\leq k\leq n$ and $\alpha_{k}^{\delta},\beta_{k}^{\delta}$ for $0\leq k\leq n-1$ are defined. Therefore, it leads us to propose Algorithm 1 below that will be considered in this paper.

Algorithm 1 (AHB: Adaptive heavy ball method).

Take an initial guess $\xi_{0}$ , calculate $x_{0}:=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{0},x\rangle\}$ , and set $\xi_{-1}^{\delta}=\xi_{0}^{\delta}=\xi_{0}$ . Pick the numbers $\tau>1$ , $0<\beta\leq\infty$ , $\mu_{0}>0$ and $\mu_{1}>0$ . For $n\geq 0$ do the following:

(i)

Calculate $r_{n}^{\delta}:=F(x_{n}^{\delta})-y^{\delta}$ . If $\|r_{n}^{\delta}\|\leq\tau\delta$ , stop and output $x_{n}^{\delta}$ ;
(ii)

Calculate $g_{n}^{\delta}:=L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})$ and determine $\alpha_{n}^{\delta}$ according to (17);
(iii)

Calculate $m_{n}^{\delta}:=\xi_{n}^{\delta}-\xi_{n-1}^{\delta}$ and determine $\tilde{\gamma}_{n}^{\delta}$ by the formula (24);

(iv)

Calculate $\beta_{n}^{\delta}$ by

\displaystyle\beta_{n}^{\delta}=\left\{\begin{array}[]{lll}\min\left\{\max\left\{0,\frac{\alpha_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-2\sigma\tilde{\gamma}_{n}^{\delta}}{\|m_{n}^{\delta}\|^{2}}\right\},\beta\right\}&\emph{ if }m_{n}^{\delta}\neq 0,\\ 0&\emph{ if }m_{n}^{\delta}=0;\end{array}\right.

(v)

Update $\xi_{n+1}^{\delta}$ and $x_{n+1}^{\delta}$ by

\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},\quad x_{n+1}^{\delta}=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\}.

Note that, for the implementation of one step of Algorithm 1, the most expensive parts are the calculations of $r_{n}^{\delta}$ , $g_{n}^{\delta}$ and $x_{n+1}^{\delta}$ which are common for Landweber-type method (2); the computational load for other parts relating to $\alpha_{n}^{\delta}$ , $\tilde{\gamma}_{n}^{\delta}$ and $\beta_{n}^{\delta}$ can be negligible. Therefore, the computational complexity per iteration of Algorithm 1 is marginally higher than, but very close to, that of one step of the Landweber-type method (2).

In the design of Algorithm 1, the discrepancy principle has been incorporated, i.e. the iteration stops as long as $\|r_{n}^{\delta}\|\leq\tau\delta$ is satisfied for the first time. In the following we will prove that Algorithm 1 is well-defined by showing that the iteration indeed can stop in finite many steps.

Lemma 2.1.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 and let $k\geq 0$ be an integer such that $x_{n}^{\delta}\in B_{2\rho}(x_{0})$ for all $0\leq n<k$ . Define $\gamma_{n}^{\delta}$ by (19) for $0\leq n\leq k$ . Then $\gamma_{n}^{\delta}\leq\tilde{\gamma}_{n}^{\delta}$ for all $0\leq n\leq k$ .

Proof.

Since $\gamma_{0}^{\delta}=\tilde{\gamma}_{0}^{\delta}=0$ , the result is true for $n=0$ . Assume next that $\gamma_{n}^{\delta}\leq\tilde{\gamma}_{n}^{\delta}$ for some $0\leq n<k$ . By noting that $\beta_{n}^{\delta}\geq 0$ , we may use (2.2) and the definition of $\tilde{\gamma}_{n+1}^{\delta}$ to obtain

	$\displaystyle\gamma_{n+1}^{\delta}$	$\displaystyle\leq\langle m_{n+1}^{\delta},x_{n+1}^{\delta}-x_{n}^{\delta}\rangle-(1-\eta)\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}+(1+\eta)\alpha_{n}^{\delta}\delta\\|r_{n}^{\delta}\\|+\beta_{n}^{\delta}\gamma_{n}^{\delta}$
		$\displaystyle\leq\langle m_{n+1}^{\delta},x_{n+1}^{\delta}-x_{n}^{\delta}\rangle-(1-\eta)\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}+(1+\eta)\alpha_{n}^{\delta}\delta\\|r_{n}^{\delta}\\|+\beta_{n}^{\delta}\tilde{\gamma}_{n}^{\delta}$
		$\displaystyle=\tilde{\gamma}_{n+1}^{\delta}.$

By the induction principle, this shows the result. ∎

Lemma 2.2.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 with $\mu_{0}>0$ and $\tau>1$ being chosen such that

\displaystyle c_{0}:=1-\frac{1+\eta}{\tau}-\eta-\frac{\mu_{0}}{4\sigma}>0.

(25)

Let $k\geq 0$ be an integer such that $\|r_{n}^{\delta}\|>\tau\delta$ for all $0\leq n<k$ . Then $x_{n}^{\delta}\in B_{2\rho}(x_{0})$ for all $0\leq n\leq k$ and

\displaystyle\Delta_{n+1}^{\delta}\leq\Delta_{n}^{\delta}-c_{0}\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}

for all $0\leq n<k$ , where $\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(\hat{x},x_{n}^{\delta})$ and $\hat{x}$ denotes any solution of (1) in $B_{2\rho}(x_{0})]\cap\emph{dom}({\mathcal{R}})$ .

Proof.

We first show by induction that

\displaystyle x_{n}^{\delta}\in B_{2\rho}(x_{0})\quad\mbox{and}\quad D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{\dagger},x_{n}^{\delta})\leq D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0})

(26)

for all integers $0\leq n\leq k$ . It is trivial for $n=0$ as $\xi_{0}^{\delta}=\xi_{0}$ and $x_{0}^{\delta}=x_{0}$ . Next we assume that (26) holds for all $0\leq n\leq l$ for some $l<k$ . According to (2.2) and Lemma 2.1 we have for any solution $\hat{x}$ of (1) in $B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}})$ that

	$\displaystyle\Delta_{l+1}^{\delta}-\Delta_{l}^{\delta}$	$\displaystyle\leq-c_{0}\alpha_{l}^{\delta}\\|r_{l}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{l}^{\delta})^{2}\\|m_{l}^{\delta}\\|^{2}+\beta_{l}^{\delta}\gamma_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}\beta_{l}^{\delta}\langle g_{l}^{\delta},m_{l}^{\delta}\rangle$
		$\displaystyle\leq-c_{0}\alpha_{l}^{\delta}\\|r_{l}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{l}^{\delta})^{2}\\|m_{l}^{\delta}\\|^{2}+\beta_{l}^{\delta}\tilde{\gamma}_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}\beta_{l}^{\delta}\langle g_{l}^{\delta},m_{l}^{\delta}\rangle.$

Note that $\beta_{l}^{\delta}$ is the minimizer of the function $t\to h_{l}(t)$ over $[0,\beta]$ , where

h_{l}(t):=-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}+\frac{1}{4\sigma}t^{2}\|m_{l}^{\delta}\|^{2}+t\tilde{\gamma}_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}t\langle g_{l}^{\delta},m_{l}^{\delta}\rangle,

we can conclude that

\displaystyle\Delta_{l+1}^{\delta}-\Delta_{l}^{\delta}\leq h_{l}(\beta_{l}^{\delta})\leq h_{l}(0)=-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}.

(27)

By virtue of this inequality with $\hat{x}=x^{\dagger}$ , the induction hypothesis, and Assumption 2 (b) we have

D_{\mathcal{R}}^{\xi_{l+1}}(x^{\dagger},x_{l+1}^{\delta})\leq D_{\mathcal{R}}^{\xi_{l}^{\delta}}(x^{\dagger},x_{l}^{\delta})\leq D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0})\leq\sigma\rho^{2}

which together with (10) implies that $\sigma\|x_{l+1}^{\delta}-x^{\dagger}\|^{2}\leq\sigma\rho^{2}$ and hence $\|x_{l+1}^{\delta}-x^{\dagger}\|\leq\rho$ . Since $\|x^{\dagger}-x_{0}\|\leq\rho$ , we thus have $\|x_{l+1}^{\delta}-x_{0}\|\leq 2\rho$ , i.e. $x_{l+1}^{\delta}\in B_{2\rho}(x_{0})$ . We therefore complete the proof of (26). As a direct consequence, we can see that (27) holds for all $0\leq l<k$ which shows the desired result. ∎

Based on Lemma 2.2 we now can prove that Algorithm 2 is well-defined, i.e. iteration must terminate in finite many steps.

Theorem 2.3.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 with noisy data $y^{\delta}$ satisfying $\|y^{\delta}-y\|\leq\delta$ with noise level $\delta>0$ . Assume that $\tau>1$ and $\mu_{0}>0$ are chosen such that (25) holds. Then the algorithm must terminate in finite many steps, i.e. there exists a finite integer $n_{\delta}$ such that

\displaystyle\|F(x_{n_{\delta}}^{\delta})-y^{\delta}\|\leq\tau\delta<\|F(x_{n}^{\delta})-y^{\delta}\|,\quad 0\leq n<n_{\delta}.

(28)

Proof.

Let $l\geq 0$ be an integer such that $\|F(x_{n}^{\delta})-y^{\delta}\|>\tau\delta$ for all $0\leq n\leq l$ . It then follows from Lemma 2.2 that

	$\displaystyle c_{0}\sum_{n=0}^{l}\alpha_{n}^{\delta}\\|F(x_{n}^{\delta})-y^{\delta}\\|^{2}$	$\displaystyle\leq\sum_{n=0}^{l}\left(\Delta_{n}^{\delta}-\Delta_{n+1}^{\delta}\right)$
		$\displaystyle=\Delta_{0}^{\delta}-\Delta_{l+1}^{\delta}\leq\Delta_{0}^{\delta}$
		$\displaystyle=D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0}).$

According to the choice of $\alpha_{n}^{\delta}$ and $\|L(x)\|\leq L$ for all $x\in B_{2\rho}(x_{0})$ , we can see $\alpha_{n}^{\delta}\geq\min\{\mu_{0}/L^{2},\mu_{1}\}$ for all $0\leq n\leq l$ . Thus

\displaystyle(l+1)c_{0}\min\{\mu_{0}/L^{2},\mu_{1}\}\tau^{2}\delta^{2}\leq c_{0}\sum_{n=0}^{l}\alpha_{n}^{\delta}\|F(x_{n}^{\delta})-y^{\delta}\|^{2}\leq D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0})<\infty.

If there is no finite integer $n_{\delta}$ such that (28) holds, then we can take $l$ to be any positive integer. Letting $l\to\infty$ in the above equation gives a contradiction. Thus, the algorithm must terminate in finite many steps. ∎

3. Convergence analysis

In this section we will provide convergence analysis on Algorithm 1. Let $n_{\delta}$ be the integer output by the discrepancy principle, we will analyze the behavior of $x_{n_{\delta}}^{\delta}$ as $\delta\to 0$ . We first prove a weak convergence result as stated below.

Theorem 3.1.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 using a sequence of noisy data $\{y^{\delta_{k}}\}$ satisfying $\|y^{\delta_{k}}-y\|\leq\delta_{k}$ with $0<\delta_{k}\to 0$ as $k\to\infty$ . Assume that $\tau>1$ and $\mu_{0}>0$ are chosen such that (25) holds. Let $n_{k}:=n_{\delta_{k}}$ be the output integer. Then there is a subsequence $\{y^{\delta_{k_{l}}}\}$ of $\{y^{\delta_{k}}\}$ such that

x_{n_{k_{l}}}^{\delta_{k_{l}}}\rightharpoonup x^{*}\quad\mbox{ as }l\to\infty

for some solution $x^{*}$ of (1) in $B_{2\rho}(x_{0})$ .

If in addition ${\mathcal{R}}(x)=\|x\|^{2}/2$ and $\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x))$ for all $x\in B_{2\rho}(x_{0})$ , then $x_{n_{k}}^{\delta_{k}}\rightharpoonup x^{\dagger}$ as $k\to\infty$ , where $x^{\dagger}$ is the unique $x_{0}$ -minimum norm solution of (1).

Proof.

By using Lemma 2.2 with $\hat{x}=x^{\dagger}$ and (10), we have

\sigma\|x_{n_{k}}^{\delta_{k}}-\hat{x}\|^{2}\leq D_{\mathcal{R}}^{\xi_{n_{k}}^{\delta_{k}}}(\hat{x},x_{n_{k}}^{\delta_{k}})\leq D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0})

which implies that $\{x_{n_{k}}^{\delta_{k}}\}$ is a bounded sequence. Thus, we can find a subsequence $\{y^{\delta_{k_{l}}}\}$ of $\{y^{\delta_{l}}\}$ such that $x_{n_{k_{l}}}^{\delta_{k_{l}}}\rightharpoonup x^{*}$ as $l\to\infty$ for some $x^{*}\in X$ . Because $\|F(x_{n_{k_{l}}}^{\delta_{k_{l}}})-y^{\delta_{k_{l}}}\|\leq\tau\delta_{k_{l}}$ , we have $F(x_{n_{k_{l}}}^{\delta_{k_{l}}})\to y$ as $l\to\infty$ . Thus, by the weak closedness of $F$ given in Assumption 2 (c), we have $x^{*}\in\mbox{dom}(F)$ and $F(x^{*})=y$ . Furthermore, by the weak lower semi-continuity of norms we have

\displaystyle\|x^{*}-x^{\dagger}\|^{2}\leq\liminf_{l\to\infty}\|x_{n_{k_{l}}}^{\delta_{k_{l}}}-x^{\dagger}\|^{2}\leq\frac{1}{\sigma}D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0})\leq\rho^{2}

which together with $x^{\dagger}\in B_{\rho}(x_{0})$ implies that $x^{*}\in B_{2\rho}(x_{0})$ .

When ${\mathcal{R}}(x)=\|x\|^{2}/2$ and $\mbox{Ker}(L(x^{\dagger}))\subset\mbox{Ker}(L(x))$ for all $x\in B_{2\rho}(x_{0})$ , we show that $x_{n_{k}}^{\delta_{k}}\rightharpoonup x^{\dagger}$ as $k\to\infty$ , It suffices to show that $x^{\dagger}$ is the unique weak cluster point of $\{x_{n_{k}}^{\delta_{k}}\}$ . For the given ${\mathcal{R}}$ we have $\xi_{n}^{\delta_{k}}=x_{n}^{\delta_{k}}$ for all $n$ and by the definition of $\xi_{n}^{\delta_{k}}$ we can see that

	$\displaystyle x_{n}^{\delta_{k}}-x_{0}$	$\displaystyle\in\mbox{Ran}(L(x_{n-1}^{\delta_{k}})^{})\oplus\cdots\oplus\mbox{Ran}(L(x_{0}^{\delta_{k}})^{})$
		$\displaystyle\subset\mbox{Ker}(L(x_{n-1}^{\delta_{k}}))^{\perp}\oplus\cdots\oplus\mbox{Ker}(L(x_{0}^{\delta_{k}}))^{\perp}$
		$\displaystyle\subset\mbox{Ker}(L(x^{\dagger}))^{\perp}$

for all $n$ which in particular implies that $x_{n_{k}}^{\delta_{k}}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}$ . Consequently, for any weak cluster point $\bar{x}$ of $\{x_{n_{k}}^{\delta_{k}}\}$ we have $\bar{x}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}$ . Since $x^{\dagger}$ is the $x_{0}$ -minimum norm solution of (1), we must have $x^{\dagger}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}$ . Indeed, for any $w\in\mbox{Ker}(L(x^{\dagger}))$ we have $x^{\dagger}+tw\in B_{2\rho}(x_{0})$ for small $|t|$ . Thus, it follows from Assumption 2 (d) that

	$\displaystyle\\|F(x^{\dagger}+tw)-F(x^{\dagger})\\|$	$\displaystyle=\\|F(x^{\dagger}+tw)-F(x^{\dagger})-L(x^{\dagger})(tw)\\|$
		$\displaystyle\leq\eta\\|F(x^{\dagger}+tw)-F(x^{\dagger})\\|$

which implies $F(x^{\dagger}+tw)=F(x^{\dagger})=y$ for small $|t|$ as $0\leq\eta<1$ . Since $x^{\dagger}$ is the $x_{0}$ -minimal norm solution of (1), we have $\|x^{\dagger}+tw-x_{0}\|^{2}\geq\|x^{\dagger}-x_{0}\|^{2}$ for small $|t|$ and thus $\langle x^{\dagger}-x_{0},w\rangle=0$ for all $w\in\mbox{Ker}(L(x^{\dagger}))$ . Therefore $\bar{x}-x^{\dagger}=(\bar{x}-x_{0})+(x_{0}-x^{\dagger})\in\mbox{Ker}(L(x^{\dagger}))^{\perp}$

On the other hand, since both $\bar{x}$ and $x^{\dagger}$ are solutions of $F(x)=y$ in $B_{2\rho}(x_{0})$ , we may use Assumption 2 (d) to obtain

\displaystyle\|L(x^{\dagger})(\bar{x}-x^{\dagger})\|\leq(1+\eta)\|F(\bar{x})-F(x^{\dagger})\|=0

which means $\bar{x}-x^{\dagger}\in\mbox{Ker}(L(x^{\dagger}))$ . Therefore

\bar{x}-x^{\dagger}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}\cap\mbox{Ker}(L(x^{\dagger}))=\{0\}

and thus $\bar{x}=x^{\dagger}$ . The proof is complete. ∎

The above theorem gives a weak convergence result on Algorithm 1 for any $\beta\in(0,\infty]$ . However, it remains uncertain if a strong convergence result can be assured for general $\beta$ . Nevertheless, by confining $\beta$ to be less than 1, we can establish a strong convergence result for Algorithm 1 in the remaining part of this section. The proof of strong convergence is rather challenging. We must explore the counterpart of Algorithm 1 utilizing the exact data $y$ , which can be formulated as follows.

Algorithm 2.

Take an initial guess $\xi_{0}$ , calculate $x_{0}:=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{0},x\rangle\}$ , and set $\xi_{-1}=\xi_{0}$ . Pick the numbers $0<\beta\leq\infty$ , $\mu_{0}>0$ and $\mu_{1}>0$ . For $n\geq 0$ do the following:

(i)

Calculate $r_{n}:=F(x_{n})-y$ and $g_{n}:=L(x_{n})^{*}r_{n}$ ;

(ii)

Determine $\alpha_{n}$ according to

\displaystyle\alpha_{n}:=\frac{\mu_{0}}{L^{2}}\quad\mbox{or }\quad\alpha_{n}:=\left\{\begin{array}[]{lll}\min\left\{\frac{\mu_{0}\|r_{n}\|^{2}}{\|g_{n}\|^{2}},\mu_{1}\right\}&\emph{if }r_{n}\neq 0,\\ 0&\emph{if }r_{n}=0;\end{array}\right.

(iii)

Calculate $m_{n}:=\xi_{n}-\xi_{n-1}$ and determine $\tilde{\gamma}_{n}$ by

\displaystyle\tilde{\gamma}_{n}=\left\{\begin{array}[]{lll}0&\emph{if }n=0,\\ \langle m_{n},x_{n}-x_{n-1}\rangle-(1-\eta)\alpha_{n-1}\|r_{n-1}\|^{2}+\beta_{n-1}\tilde{\gamma}_{n-1}&\emph{if }n\geq 1;\end{array}\right.

(iv)

Calculate $\beta_{n}$ by

\displaystyle\beta_{n}=\left\{\begin{array}[]{lll}\min\left\{\max\left\{0,\frac{\alpha_{n}\langle g_{n},m_{n}\rangle-2\sigma\tilde{\gamma}_{n}}{\|m_{n}\|^{2}}\right\},\beta\right\}&\emph{ if }m_{n}\neq 0,\\ 0&\emph{ if }m_{n}=0;\end{array}\right.

(v)

Update $\xi_{n+1}$ and $x_{n+1}$ by

\xi_{n+1}=\xi_{n}-\alpha_{n}g_{n}+\beta_{n}m_{n},\quad x_{n+1}=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{n+1},x\rangle\}.

For the iterative sequence $\{x_{n}\}$ obtained by Algorithm 2, we first have the following result.

Lemma 3.2.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 2 with $0<\mu_{0}<4\sigma(1-\eta)$ . Then $x_{n}\in B_{2\rho}(x_{0})$ for all integers $n\geq 0$ and for any solution $\hat{x}$ of (1) in $B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ there holds

\displaystyle\Delta_{n+1}\leq\Delta_{n}-c_{1}\alpha_{n}\|F(x_{n})-y\|^{2},\quad\forall n\geq 0,

where $\Delta_{n}:=D_{\mathcal{R}}^{\xi_{n}}(\hat{x},x_{n})$ and $c_{1}:=1-\eta-\mu_{0}/(4\sigma)>0$ . Consequently $\{\Delta_{n}\}$ is monotonically decreasing and

\sum_{n=0}^{\infty}\alpha_{n}\|F(x_{n})-y\|^{2}<\infty.

Proof.

By the similar argument in the proof of Lemma 2.2 we can obtain the result immediately. ∎

Lemma 3.2 has the following consequences which will be used later.

Lemma 3.3.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 2 with $0<\mu_{0}<4\sigma(1-\eta)$ .

(i)

If $F(x_{n})-y=0$ for some $n\geq 0$ , then $x_{m}=x_{n}$ for all $m>n$ .
(ii)

If $\xi_{n+1}=\xi_{n}$ for some $n\geq 0$ , then $F(x_{n})-y=0$ , $x_{m}=x_{n}$ and $\xi_{m}=\xi_{n}$ for all $m>n$ .

Proof.

(i) If $F(x_{n})-y=0$ for some $n$ , then $x_{n}$ is a solution of (1) in $B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}})$ . Thus we may use Lemma 3.2 with $\hat{x}=x_{n}$ to conclude $D_{\mathcal{R}}^{\xi_{m}}(x_{m},x_{n})\leq D_{\mathcal{R}}^{\xi_{n}}(x_{n},x_{n})=0$ for all $m>n$ . Consequently, it follows from (10) that $x_{m}=x_{n}$ for all $m>n$ .

(ii) If $\xi_{n+1}=\xi_{n}$ for some $n$ , then $x_{n+1}=\nabla{\mathcal{R}}^{*}(\xi_{n+1})=\nabla{\mathcal{R}}^{*}(\xi_{n})=x_{n}$ . We thus have from Lemma 3.2 that

\displaystyle D_{\mathcal{R}}^{\xi_{n}}(x_{n},\hat{x})=D_{\mathcal{R}}^{\xi_{n+1}}(x_{n+1},\hat{x})\leq D_{\mathcal{R}}^{\xi_{n}}(x_{n},\hat{x})-c_{1}\alpha_{n}\|F(x_{n})-y\|^{2}

which implies $\alpha_{n}\|F(x_{n})-y\|^{2}=0$ . If $F(x_{n})\neq y$ , then the choice of $\alpha_{n}$ implies $\alpha_{n}>0$ and thus $\alpha_{n}\|F(x_{n})-y\|^{2}>0$ which is a contradiction. Therefore we must have $F(x_{n})=y$ . By using (i) we then have $x_{m}=x_{n}$ for all $m\geq n$ . Using these facts and an induction argument we can conclude that $\xi_{m}=\xi_{n}$ for all $m>n$ . Indeed, this is trivial for $m=n+1$ by the given condition. Assume next $\xi_{m}=\xi_{n}$ for all $n<m\leq k$ for some $k>n$ . Then, by the definition of $\xi_{k+1}$ , we have

	$\displaystyle\xi_{k+1}$	$\displaystyle=\xi_{k}-\alpha_{k}L(x_{k})^{*}(F(x_{k})-y)+\beta_{k}(\xi_{k}-\xi_{k-1})$
		$\displaystyle=\xi_{n}-\alpha_{k}L(x_{n})^{*}(F(x_{n})-y)=\xi_{n}.$

By the induction principle, we thus complete the proof. ∎

In order to show the strong convergence of the iterative sequence $\{x_{n}\}$ generated by Algorithm 2, we need the following result; see [17, Proposition 2.3] or [18, Proposition 3.6].

Proposition 3.4.

Let Assumption 1 and Assumption 2 hold and consider the equation (1). Let $\{x_{n}\}\subset B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ and $\{\xi_{n}\}\subset X$ be such that

(i)

$\xi_{n}\in\partial{\mathcal{R}}(x_{n})$ for all $n$ ;
(ii)

for any solution $\hat{x}$ of (1) in $B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ the sequence $\{D_{\mathcal{R}}^{\xi_{n}}(\hat{x},x_{n})\}$ is convergent;
(iii)

$\lim_{n\to\infty}\|F(x_{n})-y\|=0$ .
(iv)

there is a subsequence $\{n_{k}\}$ of integers with $n_{k}\rightarrow\infty$ such that for any solution $\hat{x}$ of (1) in $B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ there holds

$\lim_{l\rightarrow\infty}\sup_{k\geq l}|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|=0.$ (29)

Then there exists a solution $x^{*}$ of (1) in $B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ such that

\lim_{n\rightarrow\infty}D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})=0.

If, in addition, $\xi_{n+1}-\xi_{n}\in\emph{Ker}(L(x^{\dagger}))^{\perp}$ for all $n$ , then $x^{*}=x^{\dagger}$ .

Based on Lemma 3.2, Lemma 3.3 and Proposition 3.4, we are now ready to show the strong convergence of Algorithm 2.

Theorem 3.5.

Let Assumption 1 and Assumption 2 hold and let $\{x_{n}\}$ be defined by Algorithm 2 using the exact data $y$ , where $0<\mu_{0}<4\sigma(1-\eta)$ and $0<\beta<1$ . Then there exists a solution $x^{*}$ of (1) in $B_{2\rho}(x_{0})$ such that

\displaystyle\lim_{n\to\infty}D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})=0\quad\mbox{ and }\quad\lim_{n\to\infty}\|x_{n}-x^{*}\|=0.

(30)

If, in addition, $\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x))$ for all $x\in B_{2\rho}(x_{0})$ , then $x^{*}=x^{\dagger}$ .

Proof.

We will use Proposition 3.4 to prove the conclusion. The item (i) in Proposition 3.4 holds automatically by the definition of $x_{n}$ . According to Lemma 3.2 we immediately obtain the item (ii) in Proposition 3.4. To show item (iii), we note from the choice of $\{\alpha_{n}\}$ that $\alpha_{n}\geq c_{3}:=\min\{\mu_{0}/L^{2},\mu_{1}\}$ when $r_{n}\neq 0$ . Thus $\alpha_{n}\|r_{n}\|^{2}\geq c_{3}\|r_{n}\|^{2}$ for all $n\geq 0$ . From Lemma 3.2 it then follows that

c_{3}\sum_{n=0}^{\infty}\|r_{n}\|^{2}\leq\sum_{n=0}^{\infty}\alpha_{n}\|r_{n}\|^{2}<\infty.

This in particular implies (iii) in Proposition 3.4, i.e.

\displaystyle\lim_{n\to\infty}\|r_{n}\|=0.

(31)

We now verify (iv) in Proposition 3.4. By using Lemma 3.3 (i) we can conclude that

\displaystyle\|r_{n}\|=0\mbox{ for some }n\Longrightarrow\|r_{m}\|=0\mbox{ for all }m\geq n.

(32)

In view of (31) and (32), we can pick a sequence $\{n_{k}\}$ of integers by setting $n_{0}=0$ and letting $n_{k}$ , for each $k\geq 1$ , be the first integer satisfying

n_{k}\geq n_{k-1}+1\quad\text{ and }\quad\|r_{n_{k}}\|\leq\|r_{n_{k-1}}\|.

It is easy to see that

\displaystyle\|r_{n_{k}}\|\leq\|r_{n}\|,\qquad 0\leq n\leq n_{k}.

(33)

For the above chosen $\{n_{k}\}$ , we now show (29) for any solution $\hat{x}$ of (1) in $B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}})$ . For any integers $l<k$ we write

\displaystyle\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle=\sum_{n=n_{l}}^{n_{k}-1}\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle.

(34)

From the definition of $\{\xi_{n}\}$ and an induction argument it follows readily that

\displaystyle\xi_{n+1}-\xi_{n}=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}L(x_{i})^{*}r_{i}.

Thus

	$\displaystyle\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle$	$\displaystyle=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\langle L(x_{i})^{*}r_{i},x_{n_{k}}-\hat{x}\rangle$
		$\displaystyle=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\langle r_{i},L(x_{i})(x_{n_{k}}-\hat{x})\rangle.$

Using Assumption 2 (d) it is easy to obtain

	$\displaystyle\\|L(x_{i})(x_{n_{k}}-\hat{x})\\|$	$\displaystyle\leq\\|L(x_{i})(x_{i}-\hat{x})\\|+\\|L(x_{i})(x_{n_{k}}-x_{i})\\|$
		$\displaystyle\leq(1+\eta)\left(\\|r_{i}\\|+\\|F(x_{i})-F(x_{n_{k}})\\|\right)$
		$\displaystyle\leq(1+\eta)\left(2\\|r_{i}\\|+\\|r_{n_{k}}\\|\right).$

Therefore, by using the Cauchy-Schwarz inequality, (33), and $0\leq\beta_{n}\leq\beta$ for any $n\geq 0$ , we can obtain

	$\displaystyle\|\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle\|$	$\displaystyle\leq\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|\\|L(x_{i})(x_{n_{k}}-\hat{x})\\|$
		$\displaystyle\leq(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|\left(2\\|r_{i}\\|+\\|r_{n_{k}}\\|\right)$
		$\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|^{2}$
		$\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\\|r_{i}\\|^{2}.$

This together with (34) then gives

\displaystyle|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|\leq 3(1+\eta)\sum_{n=n_{l}}^{n_{k}-1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}.

(35)

To proceed further, let

S_{N}:=\sum_{n=0}^{N}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2},\quad\forall N\geq 0.

Then

	$\displaystyle S_{N+1}-S_{N}$	$\displaystyle=\sum_{n=0}^{N+1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\\|r_{i}\\|^{2}-\sum_{n=0}^{N}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\\|r_{i}\\|^{2}$
		$\displaystyle=\sum_{i=0}^{N+1}\beta^{N+1-i}\alpha_{i}\\|r_{i}\\|^{2}\geq 0.$

Moreover, by rearranging the terms in $S_{N}$ , utilizing the property $0<\beta<1$ , and invoking Lemma 3.2, we can obtain

S_{N}=\sum_{i=0}^{N}\left(\sum_{n=i}^{N}\beta^{n-i}\right)\alpha_{i}\|r_{i}\|^{2}\leq\frac{1}{1-\beta}\sum_{i=0}^{N}\alpha_{i}\|r_{i}\|^{2}\leq\frac{1}{1-\beta}\sum_{i=0}^{\infty}\alpha_{i}\|r_{i}\|^{2}<\infty.

Therefore, $\{S_{N}\}$ is a monotonically increasing sequence with a finite upper bound, and thus it converges by the monotone convergence theorem. In particular, $\{S_{N}\}$ is a Cauchy sequence and hence

\lim_{l\to\infty}\sup_{k\geq l}|S_{n_{k}-1}-S_{n_{l}-1}|=0.

Note that

|S_{n_{k}-1}-S_{n_{l}-1}|=\sum_{n=n_{l}}^{n_{k}-1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}.

We thus have from (35) that

\displaystyle\sup_{k\geq l}|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|\leq 3(1+\eta)\sup_{k\geq l}|S_{n_{k}-1}-S_{n_{l}-1}|\to 0

as $l\to\infty$ . This shows (iv) in Proposition 3.4. Thus, we may use Proposition 3.4 to conclude the existence of a solution $x^{*}$ of (1) in $B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}})$ such that (30) holds.

If $\mbox{Ker}(L(x^{\dagger}))\subset\mbox{Ker}(L(x))$ for all $x\in B_{2\rho}(x_{0})$ , then we may use the definition of $\xi_{n+1}$ to obtain

	$\displaystyle\xi_{n+1}-\xi_{n}$	$\displaystyle\in\mbox{Ran}(L(x_{n})^{})\oplus\cdots\oplus\mbox{Ran}(L(x_{0})^{})$
		$\displaystyle\subset\mbox{Ker}(L(x_{n}))^{\perp}\oplus\cdots\oplus\mbox{Ker}(L(x_{0}))^{\perp}$
		$\displaystyle\subset\mbox{Ker}(L(x^{\dagger}))^{\perp}.$

Therefore, we may use the last part of Proposition 3.4 to conclude $x^{*}=x^{\dagger}$ . ∎

In order to establish the regularization property of Algorithm 1, we need a stability result to connect Algorithm 1 with Algorithm 2. The following resut is sufficient for our purpose.

Lemma 3.6.

Let Assumption 1 and Assumption 2 hold, let $\tau>1$ and $\mu_{0}>0$ be chosen such that (25) holds, and let $\beta<\infty$ . Let $\{y^{\delta_{l}}\}$ be a sequence of noisy data satisfying $\|y^{\delta_{l}}-y\|\leq\delta_{l}$ with $0<\delta_{l}\to 0$ as $l\to\infty$ , and let $\xi_{n}^{\delta_{l}},x_{n}^{\delta_{l}}$ , $0\leq n\leq n_{\delta_{l}}$ , be defined by Algorithm 1 using noisy data $y^{\delta_{l}}$ , where $n_{\delta_{l}}$ denotes the output integer. Let $\{\xi_{n},x_{n}\}$ be defined by the counterpart of Algorithm 1 using the exact data $y$ . Then, for any finite integer $\hat{n}\leq\liminf_{l\to\infty}n_{\delta_{l}}$ there hold

\xi_{n}^{\delta_{l}}\to\xi_{n}\quad\mbox{ and }\quad x_{n}^{\delta_{l}}\to x_{n}\quad\text{ as }l\to\infty

for all $0\leq n\leq\hat{n}$ .

Proof.

Since $\hat{n}\leq\liminf_{l\to\infty}n_{\delta_{l}}$ , we can conclude that $n_{\delta_{l}}\geq\hat{n}$ for large $l$ and hence $\xi_{n}^{\delta_{l}},x_{n}^{\delta_{l}}$ are well-defined for all $0\leq n\leq\hat{n}$ . Let

\displaystyle n_{*}:=\inf\{n:n\in{\mathbb{N}}\mbox{ and }\xi_{n}=\xi_{n-1}\},

where ${\mathbb{N}}:=\{1,2,\cdots\}$ denotes the set of natural numbers. Then $1\leq n_{*}\leq\infty$ . By the definition of $n_{*}$ we have $m_{n}\neq 0$ for $1\leq n<n_{*}$ and by using Lemma 3.3 we also have $m_{n}=0$ for $n_{*}\leq n<\infty$ .

We first use an induction argument to show that

\displaystyle\xi_{n}^{\delta_{l}}\to\xi_{n},\quad x_{n}^{\delta_{l}}\to x_{n},\quad\alpha_{n}^{\delta_{l}}r_{n}^{\delta_{l}}\to\alpha_{n}r_{n},\quad\beta_{n}^{\delta_{l}}\to\beta_{n},\quad\tilde{\gamma}_{n}^{\delta_{l}}\to\tilde{\gamma}_{n}

(36)

as $l\to\infty$ for every $0\leq n<\min\{n_{*},\hat{n}\}$ . Noting that $\xi_{0}^{\delta_{l}}=\xi_{0}$ , $x_{0}^{\delta_{l}}=x_{0}$ and $\tilde{\gamma}_{0}^{\delta_{l}}=\tilde{\gamma}_{0}=0$ , and, as $m_{0}^{\delta_{l}}=m_{0}=0$ , we have $\beta_{0}^{\delta_{l}}=\beta_{0}=0$ . Furthermore, noting that $r_{0}^{\delta_{l}}=r_{0}$ , we also have $\alpha_{0}^{\delta_{l}}r_{0}^{\delta_{l}}=\alpha_{0}r_{0}$ no matter whether $r_{0}=0$ or not. Consequently (36) holds for $n=0$ . Now we assume that (36) holds for $0\leq n<k$ for some integer $1\leq k<\min\{n_{*},\hat{n}\}$ . By the induction hypothesis, the definition of $\xi_{k}^{\delta_{l}}$ , $x_{k}^{\delta_{l}}$ , and the continuity of $F$ , $L$ and $\nabla{\mathcal{R}}^{*}$ , we can easily derive that

	$\displaystyle\xi_{k}^{\delta_{l}}$	$\displaystyle=\xi_{k-1}^{\delta_{l}}-\alpha_{k-1}^{\delta_{l}}L(x_{k-1}^{\delta_{l}})^{*}r_{k-1}^{\delta_{l}}+\beta_{k-1}^{\delta_{l}}m_{k-1}^{\delta_{l}}$
		$\displaystyle\to\xi_{k-1}-\alpha_{k-1}L(x_{k-1})^{*}r_{k-1}+\beta_{k-1}m_{k-1}$
		$\displaystyle=\xi_{k},$
	$\displaystyle x_{k}^{\delta_{l}}$	$\displaystyle=\nabla{\mathcal{R}}^{}(\xi_{k}^{\delta_{l}})\to\nabla{\mathcal{R}}^{}(\xi_{k})=x_{k}$

and

	$\displaystyle\tilde{\gamma}_{k}^{\delta_{l}}$	$\displaystyle=\langle m_{k}^{\delta_{l}},x_{k}^{\delta_{l}}-x_{k-1}^{\delta_{l}}\rangle-(1-\eta)\alpha_{k-1}^{\delta_{l}}\\|r_{k-1}^{\delta_{l}}\\|^{2}+(1+\eta)\delta_{l}\alpha_{k-1}^{\delta_{l}}\\|r_{k-1}^{\delta_{l}}\\|+\beta_{k-1}^{\delta_{l}}\tilde{\gamma}_{k-1}^{\delta_{l}}$
		$\displaystyle\to\langle m_{k},x_{k}-x_{k-1}\rangle-(1-\eta)\alpha_{k-1}\\|r_{k-1}\\|^{2}+\beta_{k-1}\tilde{\gamma}_{k-1}=\tilde{\gamma}_{k}$

as $l\to\infty$ . We now show $\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}\to\alpha_{k}r_{k}$ as $l\to\infty$ . When $r_{k}=0$ , by using $0\leq\alpha_{k}^{\delta_{l}}\leq\mu_{1}$ we have

\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}-\alpha_{k}r_{k}\|=\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}\|\leq\mu_{1}\|r_{k}^{\delta_{l}}\|\to\mu_{1}\|r_{k}\|=0\quad\mbox{ as }l\to\infty.

When $r_{k}\neq 0$ , for $g_{k}:=L(x_{k})^{*}r_{k}$ we may use Assumption 2 (d) to obtain

	$\displaystyle\langle g_{k},x_{k}-x^{\dagger}\rangle$	$\displaystyle=\langle r_{k},L(x_{k})(x_{k}-x^{\dagger})\rangle$
		$\displaystyle=\\|r_{k}\\|^{2}+\langle r_{k},y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\rangle$
		$\displaystyle\geq\\|r_{k}\\|^{2}-\\|r_{k}\\|\\|y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\\|$
		$\displaystyle\geq(1-\eta)\\|r_{k}\\|^{2}>0$

which implies $g_{k}\neq 0$ . Thus, by using $x_{k}^{\delta_{l}}\to x_{k}$ and the continuity of $F$ and $L$ , we have $\|r_{k}^{\delta_{l}}\|>\tau\delta_{l}$ and $g_{k}^{\delta_{l}}\neq 0$ for sufficiently large $l$ . Consequently, by the definition of $\alpha_{k}^{\delta_{l}}$ and $\alpha_{k}$ we can conclude that $\alpha_{k}^{\delta_{l}}\to\alpha_{k}$ as $l\to\infty$ . Therefore

\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}-\alpha_{k}r_{k}\|\leq\alpha_{k}^{\delta_{l}}\|r_{k}^{\delta_{l}}-r_{k}\|+|\alpha_{k}^{\delta_{l}}-\alpha_{k}|\|r_{k}\|\to 0\quad\mbox{ as }l\to\infty.

Recall that $m_{k}\neq 0$ , we thus have $m_{k}^{\delta_{l}}\neq 0$ for large $l$ . Therefore, by using the definition of $\beta_{k}^{\delta_{l}}$ and $\beta_{k}$ , the above established facts, and the induction hypothesis we can obtain

	$\displaystyle\beta_{k}^{\delta_{l}}$	$\displaystyle=\min\left\{\max\left\{0,\frac{\alpha_{k}^{\delta_{l}}\langle g_{k}^{\delta_{l}},m_{k}^{\delta_{l}}\rangle-2\sigma\tilde{\gamma}_{k}^{\delta_{l}}}{\\|m_{k}^{\delta_{l}}\\|^{2}}\right\},\beta\right\}$
		$\displaystyle\to\min\left\{\max\left\{0,\frac{\alpha_{k}\langle g_{k},m_{k}\rangle-2\sigma\tilde{\gamma}_{k}}{\\|m_{k}\\|^{2}}\right\},\beta\right\}=\beta_{k}$

as $l\to\infty$ , where we used $\alpha_{k}^{\delta_{l}}g_{k}^{\delta_{l}}\to\alpha_{k}g_{k}$ as $l\to\infty$ which follows easily from the established facts. By the induction principle, we thus obtain (36) for $0\leq n<\min\{n_{*},\hat{n}\}$ .

When $\hat{n}\geq n_{*}$ , we next use an induction argument to show that

\displaystyle\xi_{n}^{\delta_{l}}\to\xi_{n}\quad\mbox{ and }\quad x_{n}^{\delta_{l}}\to x_{n}\quad\mbox{ as }l\to\infty

(37)

for $n_{*}\leq n\leq\hat{n}$ . Recall $m_{n_{*}}=0$ , we may use Lemma 3.3 to conclude that $r_{k}=0$ and $m_{k+1}=0$ for all $k\geq n_{*}-1$ . We first show (37) for $n=n_{*}$ . Note that

\xi_{n_{*}}=\xi_{n_{*}-1}+\beta_{n_{*}-1}m_{n_{*}-1}.

Therefore, by the definition of $\xi_{n_{*}}^{\delta_{l}}$ , (36), $0\leq\alpha_{n_{*}-1}^{\delta_{l}}\leq\mu_{1}$ we have

	$\displaystyle\\|\xi_{n_{}}^{\delta_{l}}-\xi_{n_{}}\\|$	$\displaystyle\leq\\|\xi_{n_{}-1}^{\delta_{l}}-\xi_{n_{}-1}\\|+\mu_{1}\\|L(x_{n_{}-1}^{\delta_{l}})^{}r_{n_{*}-1}^{\delta_{l}}\\|$
		$\displaystyle\quad\,+\\|\beta_{n_{}-1}^{\delta_{l}}m_{n_{}-1}^{\delta_{l}}-\beta_{n_{}-1}m_{n_{}-1}\\|$
		$\displaystyle\to\mu_{1}\\|L(x_{n_{}-1})r_{n_{}-1}\\|=0$

and consequently, by the continuity of $\nabla{\mathcal{R}}^{*}$ , we have $x_{n_{*}}^{\delta_{l}}\to x_{n_{*}}$ as $l\to\infty$ . Now assume that (37) holds for $n_{*}\leq n\leq k$ for some $n_{*}\leq k<\hat{n}$ . By using $r_{k}=0$ and $m_{k}=0$ we have $\xi_{k+1}=\xi_{k}$ . Thus, by using $0\leq\alpha_{k}^{\delta_{l}}\leq\mu_{1}$ and $0\leq\beta_{k}^{\delta_{l}}\leq\beta$ , we have

	$\displaystyle\\|\xi_{k+1}^{\delta_{l}}-\xi_{k+1}\\|$	$\displaystyle=\\|\xi_{k}^{\delta_{l}}-\xi_{k}-\alpha_{k}^{\delta_{l}}L(x_{k}^{\delta_{l}})^{*}r_{k}^{\delta_{l}}+\beta_{k}^{\delta_{l}}m_{k}^{\delta_{l}}\\|$
		$\displaystyle\leq\\|\xi_{k}^{\delta_{l}}-\xi_{k}\\|+\mu_{1}\\|L(x_{k}^{\delta_{l}})^{*}r_{k}^{\delta_{l}}\\|+\beta\\|m_{k}^{\delta_{l}}\\|$
		$\displaystyle\to\mu_{1}\\|L(x_{k})^{*}r_{k}\\|+\beta\\|m_{k}\\|=0$

as $\delta\to 0$ . This shows that (37) is also true for $n=k+1$ .

Putting the facts established in (36) and (37) together, the proof is therefore complete. ∎

Finally we are ready to show the main strong convergence result on Algorithm 1 for solving (1) using noisy data.

Theorem 3.7.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 1, where $0\leq\beta<1$ , and $\tau>1$ and $\mu_{0}>0$ are chosen such that (25) holds. Let $n_{\delta}$ be the output integer. Then there exists a solution $x^{*}$ of (1) in $B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}})$ such that

\lim_{\delta\to 0}D_{\mathcal{R}}^{\xi_{n_{\delta}}^{\delta}}(x^{*},x_{n_{\delta}}^{\delta})=0\quad\mbox{ and }\quad\lim_{\delta\to 0}\|x_{n_{\delta}}^{\delta}-x^{*}\|=0.

If, in addition, $\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x))$ for all $x\in B_{2\rho}(x_{0})$ , then $x^{*}=x^{\dagger}$ .

Proof.

Let $x^{*}$ be the solution of (1) in $B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}})$ determined in Theorem 3.5 such that $D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})\to 0$ as $n\to\infty$ , where $\{x_{n}\}$ denotes the sequence defined by the counterpart of Algorithm 1 using the exact data. Let $\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{*},x_{n}^{\delta})$ for $n\geq 0$ . We will show that $\Delta_{n_{\delta}}^{\delta}\to 0$ as $\delta\to 0$ by considering two cases.

Assume first that there is a sequence $\{y^{\delta_{l}}\}$ of noisy data satisfying $\|y^{\delta_{l}}-y\|\leq\delta_{l}$ with $\delta_{l}\to 0$ such that $n_{l}:=n_{\delta_{l}}\to\hat{n}$ as $l\to\infty$ for some finite integer $\hat{n}$ . Then $n_{l}=\hat{n}$ for all large $l$ . According to the definition of $n_{l}:=n_{\delta_{l}}$ we have

\|F(x_{\hat{n}}^{\delta_{l}})-y^{\delta_{l}}\|\leq\tau\delta_{l}.

By taking $l\to\infty$ and using lemma 3.6, we can obtain $F(x_{\hat{n}})=y$ . Thus, we may use Lemma 3.3 (i) to obtain $x_{n}=x_{\hat{n}}$ for all $n\geq\hat{n}$ . Since $x_{n}\to x^{*}$ as $n\to\infty$ , we must have $x_{\hat{n}}=x^{*}$ and thus, by Lemma 3.6 and the lower semi-continuity of ${\mathcal{R}}$ , we can obtain

	$\displaystyle\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}}$	$\displaystyle=\limsup_{l\to\infty}\Delta_{\hat{n}}^{\delta_{l}}=\limsup_{l\to\infty}\left({\mathcal{R}}(x_{\hat{n}})-{\mathcal{R}}(x_{\hat{n}}^{\delta_{l}})-\langle\xi_{\hat{n}}^{\delta_{l}},x_{\hat{n}}-x_{\hat{n}}^{\delta_{l}}\rangle\right)$
		$\displaystyle={\mathcal{R}}(x_{\hat{n}})-\liminf_{l\to\infty}{\mathcal{R}}(x_{\hat{n}}^{\delta_{l}})-\lim_{l\to\infty}\langle\xi_{\hat{n}}^{\delta_{l}},x_{\hat{n}}-x_{\hat{n}}^{\delta_{l}}\rangle$
		$\displaystyle\leq{\mathcal{R}}(x_{\hat{n}})-{\mathcal{R}}(x_{\hat{n}})=0$

which shows that $\Delta_{n_{l}}^{\delta_{l}}\to 0$ as $l\to\infty$ .

Assume next that there is a sequence $\{y^{\delta_{l}}\}$ of noisy data satisfying $\|y^{\delta_{l}}-y\|\leq\delta_{l}$ with $\delta_{l}\to 0$ such that $n_{l}:=n_{\delta_{l}}\to\infty$ as $l\to\infty$ . Let $n$ be any fixed integer. Then $n_{l}>n$ for large $l$ . It then follows from Lemma 2.2 that $\Delta_{n_{l}}^{\delta_{l}}\leq\Delta_{n}^{\delta_{l}}.$ By using Lemma 3.6 and the lower semi-continuity of ${\mathcal{R}}$ , we thus obtain

	$\displaystyle\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}}$	$\displaystyle\leq\limsup_{l\to\infty}\Delta_{n}^{\delta_{l}}=\limsup_{l\to\infty}\left({\mathcal{R}}(\hat{x})-{\mathcal{R}}(x_{n}^{\delta_{l}})-\langle\xi_{n}^{\delta_{l}},x^{*}-x_{n}^{\delta_{l}}\rangle\right)$
		$\displaystyle\leq{\mathcal{R}}(x^{})-{\mathcal{R}}(x_{n})-\langle\xi_{n},x^{}-x_{n}\rangle=D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n}).$

Letting $n\to\infty$ in the above equation gives $\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}}\leq 0$ which shows again $\Delta_{n_{l}}^{\delta_{l}}\to 0$ as $l\to\infty$ . ∎

4. Numerical results

In this section we will provide numerical simulations to test the performance of our AHB method, i.e. Algorithm 1. The computational results demonstrate that our AHB method indeed has superior performance over the Landweber iteration in terms of both the number of iterations and the CPU running time. Our simulations are performed via MATLAB R2023b on a Dell laptop with 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 1.80 GHz and 32GB memory.

Example 4.1.

We first test the performance of Algorithm 1 by considering the following linear integral equation of the first kind

\displaystyle(Fx)(s):=\int_{0}^{1}\kappa(s,t)x(t)dt=y(s),\quad s\in[0,1],

(38)

where the kernel $\kappa(s,t)$ is a continuous function defined on $[0,1]\times[0,1]$ . It is easy to see that $F$ is a compact linear map from $L^{2}[0,1]$ to itself. Our goal is to determine the minimal-norm solution of (38) by using some noisy data $y^{\delta}\in L^{2}[0,1]$ satisfying $\|y^{\delta}-y\|_{L^{2}[0,1]}\leq\delta$ with a given noise level $\delta>0$ . When applying Algorithm 1 to find the minimal-norm solution, we use $x_{0}=0$ , ${\mathcal{R}}(x)=\frac{1}{2}\|x\|_{L^{2}[0,1]}^{2}$ , $\tau=1.01$ , $\beta=\infty$ and

\alpha_{n}\equiv\frac{\mu_{0}}{\|F\|^{2}}\quad\mbox{ with }\mu_{0}=0.99(2-2/\tau),

where $\mu_{0}$ is chosen to fulfill the theoretical requirement. As comparisons, we also consider the following algorithms:

$\bullet$

Landweber iteration

$x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha F^{*}(Fx_{n}^{\delta}-y^{\delta})$

with the initial guess $x_{0}^{\delta}=0$ terminated by the discrepancy principle

$\displaystyle\|Fx_{n_{\delta}}^{\delta}-y^{\delta}\|\leq\tau\delta<\|Fx_{n}^{\delta}-y^{\delta}\|,\quad 0\leq n<n_{\delta}.$ (39)

This method is convergent for $0<\alpha\leq 2/\|A\|^{2}$ and $\tau>1$ ; see [6]. In our computation we use $\alpha=1/\|A\|^{2}$ and $\tau=1.01$ .

\bullet

The $\nu$ -method of Brakhage: This method was introduced in [4] to accelerate the Landweber iteration. It preassigns a number $\nu>1/2$ and defines the iterates by

x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha_{n}\gamma F^{*}(Fx_{n}^{\delta}-y^{\delta})+\beta_{n}(x_{n}^{\delta}-x_{n-1}^{\delta})

with $x_{-1}^{\delta}=x_{0}^{\delta}=0$ and

\displaystyle\quad\alpha_{n}=4\frac{(2n+2\nu+1)(n+\nu)}{(n+2\nu)(2n+4\nu+1)},\,\,\,\beta_{n}=\frac{n(2n-1)(2n+2\nu+1)}{(n+2\nu)(2n+4\nu+1)(2n+2\nu-1)}.

It is known that it is a regularization method for $0<\gamma<1/\|F\|^{2}$ when terminated by the discrepancy principle (39) with $\tau>1$ ; see [4, 6, 9]. We use $\nu=3$ , $\gamma=0.99/\|F\|^{2}$ and $\tau=1.01$ .

\bullet

Nesterov acceleration: This method was proposed in [22] to accelerate the gradient method for convex optimization problems. This strategy was then suggested in [15] to accelerate the Landweber iteration for ill-posed inverse problems. The corresponding method for solving (38) takes the form

\displaystyle z_{n}^{\delta}=x_{n}^{\delta}+\frac{n-1}{n+\alpha}(x_{n}^{\delta}-x_{n-1}^{\delta}),\quad x_{n+1}^{\delta}=z_{n}^{\delta}-\gamma F^{*}(Fz_{n}^{\delta}-y^{\delta})

with $x_{-1}^{\delta}=x_{0}^{\delta}=0$ and $\alpha\geq 2$ . When the method is terminated by the discrepancy principle (39) with $\tau>1$ , the corresponding regularization property has been proved in [20, 24] for $0<\gamma<1/\|F\|^{2}$ . We use $\alpha=3$ , $\gamma=0.99/\|F\|^{2}$ and $\tau=1.01$ .

Table 1. Numerical results for Example 4.1.

$\delta$	method	iterations	time (s)	relative error
0.1	Landweber	62	0.0153	1.9024e-2
	$\nu$ -method	15	0.0083	1.6657e-2
	Nesterov	16	0.0091	1.7684e-2
	AHB	20	0.0087	1.7774e-2
0.01	Landweber	190	0.0359	5.0988e-3
	$\nu$ -method	30	0.0102	5.1081e-3
	Nesterov	41	0.0157	4.1816e-3
	AHB	30	0.0108	4.6982e-3
0.001	Landweber	1256	0.1934	1.8305e-3
	$\nu$ -method	82	0.0189	1.8587e-3
	Nesterov	102	0.0285	1.7776e-3
	AHB	80	0.0187	1.7220e-3
0.0001	Landweber	8144	1.2922	6.2937e-4
	$\nu$ -method	215	0.0414	6.4159e-4
	Nesterov	324	0.0831	5.1961e-4
	AHB	268	0.0513	6.1130e-4
0.00001	Landweber	55145	8.2924	1.9023e-4
	$\nu$ -method	565	0.0918	1.9305e-4
	Nesterov	817	0.2003	1.7054e-4
	AHB	896	0.1520	1.8902e-4

In our numerical computation, we consider the equation (38) with the kernel

\displaystyle\kappa(s,t)=\left\{\begin{array}[]{lll}40s(1-t)&\mbox{ if }s\leq t,\\ 40t(1-s)&\mbox{ if }t\leq s\end{array}\right.

and assume that the sought solution is $x^{\dagger}(t)=4t(1-t)+\sin(2\pi t)$ . Let $y:=Ax^{\dagger}$ be the exact data. We add random noise to $y$ to produce a noisy data $y^{\delta}$ satisfying $\|y^{\delta}-y\|_{L^{2}[0,1]}=\delta$ for various noise levels $\delta>0$ . In our implementation, all integrals over $[0,1]$ are approximated by the trapezoidal rule based on the $1000$ nodes partitioning $[0,1]$ into subintervals of equal length.

For this model problem, we execute the Landweber iteration, the $\nu$ -method, the Nesterov acceleration and our AHB method with the above setup using noisy data with various noise levels and the computational results are reported in Table 1 which includes the required number of iterations and the consumed CPU time together with the corresponding relative errors. These results indicate that all the four methods can produce comparable approximate solutions in terms of accuracy. However, the $\nu$ -method, the Nesterov acceleration and our AHB method clearly demonstrate superior performance over the Landweber method by significantly reducing the number of iterations and the CPU running time. In order to visualize how the iterates approach the sought solution, in Figure 1 we plot the relative error $\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}/\|x^{\dagger}\|_{L^{2}}$ versus $n$ for these four methods: the left figure plots the relative error, using noisy data with $\delta=0.001$ , until the iteration is terminated by the discrepancy principle; the right figure plots the relative error when all computations are performed using the exact data. These plots further indicate the acceleration effect of the $\nu$ -methd, the Nesterov acceleration and our AHB method. Moreover, when using the exact data, our AHB method can outperform both the $\nu$ -method and the Nesterov acceleration.

Refer to caption — Figure 1. Relative error versus the number of iterations for Example 4.1.

When using noisy data in computation, among the three fast methods, the $\nu$ -method usually requires least number of iterations. However, it is totally unclear how to extend this method to solve (12) with nonlinear $F$ or non-quadratic ${\mathcal{R}}$ . The Nesterov acceleration in general takes longer time than the other two methods due to the fact that, at each iteration step, it needs to calculate $Az_{n}^{\delta}-y^{\delta}$ to update $x_{n+1}^{\delta}$ and $Ax_{n}^{\delta}-y^{\delta}$ to check the discrepancy principle as required by the theory ([20]) and thus the computational complexity per iteration increases. Although we have formulated a possible extension (6) of Nesterov acceleration to solve (12) with general $F$ and ${\mathcal{R}}$ and numerical simulations demonstrate the striking performance, a solid theoretical justification however is still unavailable. Note that, in our AHB method, we used a very small step size, compared with the ones used by other three methods. With an adaptive choice of the momentum coefficient, it still achieves an excellent acceleration effect on Landweber iteration. Moreover, our theory provides the convergence guarantee on using our AHB method to solve (12) with general $F$ and ${\mathcal{R}}$ . In the following numerical examples, we will focus on testing the acceleration effect of our AHB method over the Landweber iteration.

Example 4.2 (Computed tomography).

In this example, we consider applying Algorithm 1 to computed tomography which consists in determining the density of cross sections of a human body by measuring the attenuation of X-rays as they propagate through the biological tissues ([21]).

Assume the image is supported on a rectangular domain in $\mathbb{R}^{2}$ and is discretized on a $I\times J$ pixel grid. We will identify any image of size $I\times J$ by a long vector in ${\mathbb{R}}^{N}$ with $N=I\times J$ by stacking all its columns. We consider the standard 2D fan-beam tomography using $n_{\theta}$ projection angles evenly distributed between $1$ and $360$ degrees with $p$ lines of X-rays emitted through each angle. We use the function “fanbeamtomo” in the MATLAB package AIR TOOLS [11] to discretize the problem, leading to an ill-conditioned linear algebraic system $Fx=y$ , where $F\in\mathbb{R}^{M\times N}$ with $M=n_{\theta}p$ and $N=I\times J$ .

Table 2. Numerical results for computed tomography.

$\delta_{rel}\quad$	method	iterations	time (s)	relative error
0.05	Landweber	101	9.9940	1.8538e-01
	AHB: $\beta=0.99$	63	5.5243	1.9299e-01
	AHB: $\beta=\infty$	48	4.2141	1.9060e-01
0.01	Landweber	444	44.0053	5.9528e-02
	AHB: $\beta=0.99$	230	19.7274	5.9464e-02
	AHB: $\beta=\infty$	238	20.2110	5.9760e-02
0.005	Landweber	778	78.4099	3.1650e-02
	AHB: $\beta=0.99$	335	27.8164	3.1561e-02
	AHB: $\beta=\infty$	374	31.1740	3.1859e-02
0.001	Landweber	3254	366.628	6.2999e-03
	AHB: $\beta=0.99$	1398	124.763	5.7077e-03
	AHB: $\beta=\infty$	1495	132.288	5.9425e-03

In our numerical simulations, we consider the modified Shepp-Logan phantom of size $256\times 256$ generated by MATLAB, we also use $n_{\theta}=60$ projection angles with $p=367$ lines of X-rays per projection. Correspondingly $F$ has the size $M\times N$ with $M=22020$ and $N=65536$ . Instead of the exact data $y$ , we add Gaussian noise on $y$ to generate a noisy data $y^{\delta}$ with relative noise level $\delta_{rel}=\|y^{\delta}-y\|_{2}/\|y\|_{2}$ so that the noise level is $\delta=\delta_{rel}\|y\|_{2}$ . In order to capture the feature of the sought image, we take

{\mathcal{R}}(x)=\frac{1}{2\kappa}\|x\|_{F}^{2}+|x|_{TV},\quad\forall x\in{\mathbb{R}}^{I\times J}

with $\kappa=1$ , where $\|x\|_{F}$ is the Frobenius norm and $|x|_{TV}$ denotes the isotropic total variation of $x$ , i.e. $|x|_{TV}=f(\nabla x)$ with $\nabla$ being the discrete gradient operator defined by $\nabla x=(\nabla_{1}x,\nabla_{2}x)$ for $x=(x_{i,j})\in{\mathbb{R}}^{I\times J}$ , where

	$\displaystyle(\nabla_{1}x)_{i,j}$	$\displaystyle=\left\{\begin{array}[]{ll}x_{i+1,j}-x_{i,j},&i=1,\cdots,I-1;\,j=1,\cdots,J\\ x_{1,j}-x_{I,j},&i=I;\,j=1,\cdots,J;\\ \end{array}\right.$
	$\displaystyle(\nabla_{2}x)_{i,j}$	$\displaystyle=\left\{\begin{array}[]{ll}x_{i,j+1}-x_{i,j},&i=1,\cdots,I;\,j=1,\cdots,J-1,\\ x_{i,1}-x_{i,J},&i=1,\cdots,I-1;\,j=J\end{array}\right.$

and

\displaystyle f(u,v)=\sum_{i=1}^{I}\sum_{j=1}^{J}\sqrt{u_{i,j}^{2}+v_{i,j}^{2}},\quad\forall u=(u_{i,j}),v=(v_{i,j})\in{\mathbb{R}}^{I\times J}.

Clearly ${\mathcal{R}}$ satisfies Assumption 1 with $\sigma=1/(2\kappa)$ . When using Algorithm 1 to reconstruct the image, we use the initial guess $\xi_{0}=0$ and the step-size

\alpha_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|Fx_{n}^{\delta}-y^{\delta}\|^{2}}{\|F^{*}(Fx_{n}^{\delta}-y^{\delta})\|^{2}},\mu_{1}\right\}

Our convergence theory requires $\tau>1$ , $0<\mu_{0}<(2-2/\tau)/\kappa$ and $\mu_{1}>0$ . Therefore we take $\tau=1.05$ , $\mu_{0}=0.99(2-2/\tau)/\kappa$ and $\mu_{1}=100$ . In order to determine the momentum coefficient $\beta_{n}^{\delta}$ , we consider two values of $\beta$ : $\beta=0.99$ and $\beta=\infty$ . As comparison, we also carry out the computation by the Landweber method (2) using the same step size $\alpha_{n}^{\delta}$ and the regularization function ${\mathcal{R}}$ . During the computation, updating $x_{n}^{\delta}$ from $\xi_{n}^{\delta}$ requires to solving the total variation denoising problem

x_{n}^{\delta}=\arg\min_{x\in{\mathbb{R}}^{I\times J}}\left\{\frac{1}{2\kappa}\|x-\kappa\xi_{n}^{\delta}\|_{F}^{2}+|x|_{TV}\right\}

which is solved approximately by the primal-dual hybrid gradient (PDHG) method ([3, 15, 32]) after $70$ iterations.

The computational results by AHB and Landweber method are reported in Table 2, including the number of iterations $n_{\delta}$ , the CPU running time and the relative errors $\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|_{2}/\|x^{\dagger}\|_{2}$ , using noisy data with various relative noise level $\delta_{rel}>0$ . Table 2 shows that AHB with both $\beta=0.99$ and $\beta=\infty$ leads to a considerable decrease in the number of iterations and the amount of computational time, which demonstrates the striking acceleration effect of our AHB method.

In order to visualize the reconstruction accuracy of our AHB method, we plot in Figure 2 the true image, the reconstruction result by Landweber method and the AHB methods with $\beta=0.99$ and $\infty$ using noisy data with relative noise level $\delta_{rel}=0.01$ . In Figure 3 we also plot the curves of the relative errors $\|x_{n}^{\delta}-x^{\dagger}\|_{2}/\|x^{\dagger}\|_{2}$ versus $n$ for Landweber and our AHB methods. These plots further demonstrate that, compared with Landweber iteration, our AHB method can produce comparable reconstruction results with significantly less number of iterations.

Example 4.3 (Elliptic parameter identification).

In this example we consider the identification of the parameter $c$ in the elliptic boundary value problem

\displaystyle-\triangle u+cu=f\hbox{ in }\Omega,\qquad u=g\hbox{ on }\partial\Omega

(40)

from an $L^{2}(\Omega)$ -measurement of the state $u$ , where $\Omega\subset\mathbb{R}^{d}$ with $d\leq 3$ is a bounded domain with Lipschitz boundary $\partial\Omega$ , $f\in H^{-1}(\Omega)$ and $g\in H^{1/2}(\Omega)$ . We assume that the sought parameter $c^{{\dagger}}$ is in $L^{2}(\Omega)$ . This problem reduces to solving $F(c)=u$ if we define the nonlinear operator $F:L^{2}(\Omega)\rightarrow L^{2}(\Omega)$ by $F(c):=u(c)$ , where $u(c)\in H^{1}(\Omega)\subset L^{2}(\Omega)$ is the unique solution of (40). This operator $F$ is well defined on

{\mathcal{D}}:=\left\{c\in L^{2}(\Omega):\|c-\hat{c}\|_{L^{2}(\Omega)}\leq\varepsilon_{0}\mbox{ for some }\hat{c}\geq 0,\ \textrm{a.e.}\right\}

for some positive constant $\varepsilon_{0}>0$ . It is known that the operator $F$ is weakly closed and Fréchet differentiable, the Fréchet derivative of $F$ and its adjoint are given by

F^{\prime}(c)h=-A(c)^{-1}(hu(c))\quad\mbox{ and }\quad F^{\prime}(c)^{*}w=-u(c)A(c)^{-1}w

for $c\in{\mathcal{D}}$ and $h,w\in L^{2}(\Omega)$ , where $A(c):H_{0}^{1}(\Omega)\to H^{-1}(\Omega)$ is defined by $A(c)u=-\triangle u+cu$ which is an isomorphism uniformly in the ball $B_{2\rho}(c^{\dagger})$ for small $\rho>0$ . Moreover, $F$ satisfies Assumption 2 (d) with a number $\eta>0$ which can be very small if $\rho>0$ is sufficiently small (see [6]).

Table 3. Numerical results for elliptic parameter identification.

$\delta$	method	iterations	time (s)	$\\|x_{n_{\delta}}^{\delta}-x^{\dagger}\\|_{L^{2}}$
0.005	Landweber	36	6.2252	3.0388e-01
	AHB: $\beta=0.99$	15	2.6239	3.0577e-01
	AHB: $\beta=\infty$	12	2.1608	3.0328e-01
0.001	Landweber	248	39.967	1.3998e-01
	AHB: $\beta=0.99$	64	10.509	1.3772e-01
	AHB: $\beta=\infty$	79	12.530	1.3790e-01
0.0005	Landweber	410	66.806	1.1268e-01
	AHB: $\beta=0.99$	108	17.023	1.1298e-01
	AHB: $\beta=\infty$	165	26.356	1.1221e-01
0.0001	Landweber	2014	341.75	8.1312e-02
	AHB: $\beta=0.99$	236	38.464	8.1573e-02
	AHB: $\beta=\infty$	375	60.491	8.1716e-02
0.00005	Landweber	4999	987.75	7.2499e-02
	AHB: $\beta=0.99$	472	74.040	7.1581e-02
	AHB: $\beta=\infty$	821	133.82	7.2092e-02

In our numerical simulation, we consider the two-dimensional problem with $\Omega=[0,1]\times[0,1]$ and the sought parameter $c^{\dagger}$ is assumed to be a piecewise constant function as shown in Figure 4 (a). Assuming $u(c^{\dagger})=x+y$ , we add random noise to produce noisy data $u^{\delta}$ satisfying $\|u^{\delta}-u(c^{\dagger})\|_{L^{2}(\Omega)}=\delta$ with various noise level $\delta>0$ . We will use $u^{\delta}$ to reconstruct $c^{\dagger}$ . In order to capture the feature of the sought parameter, we take

\displaystyle{\mathcal{R}}(c)=\frac{1}{2\kappa}\|c\|_{L^{2}(\Omega)}^{2}+|c|_{TV}

(41)

with the constant $\kappa=10$ , where $|c|_{TV}$ denotes the total variation of $c$ on $\Omega$ , i.e.

|c|_{TV}:=\sup\left\{\int_{\Omega}c\,\mbox{div}\varphi\,\mbox{d}x\mbox{d}y:\varphi\in C_{0}^{1}(\Omega,{\mathbb{R}}^{2})\mbox{ and }\|\varphi\|_{L^{\infty}(\Omega)}\leq 1\right\}.

It is easy to see that this ${\mathcal{R}}$ satisfies Assumption 1 with $\sigma=1/(2\kappa)$ .

We first carry out the computation by the Landweber-type method (2) using the initial guess $\xi_{0}^{\delta}=0$ and adaptive step-size $\alpha_{n}^{\delta}$ given by (4) with $\mu_{0}=1.96(1-\eta-(1+\eta)/\tau)/\kappa$ and $\mu_{1}=80$ , where we take $\eta=0.01$ and $\tau=1.05$ . Next we execute our AHB method, i.e. Algorithm 1, with the same initial guess $\xi_{0}^{\delta}$ and step size $\alpha_{n}^{\delta}$ , the momentum coefficient $\beta_{n}^{\delta}$ is determined by using two vules of $\beta$ : $\beta=0.99$ and $\beta=\infty$ . In order to carry out the computation, we divide $\Omega$ into $128\times 128$ small squares of equal size and solve all partial differential equations involved approximately by a multigrid method ([8]) via finite difference discretization. Furthermore, updating $c_{n}^{\delta}$ from $\xi_{n}^{\delta}$ at each iteration step is equivalent to a total variation denoising problem which is solved by the PDHG method after $200$ iterations.

In Table 3 we report the computational results by the Landweber iteration (2) and our AHB method with $\beta=0.99$ and $\beta=\infty$ , including the required number of iterations $n_{\delta}$ , the CPU running time and the absolute errors $\|c_{n_{\delta}}^{\delta}-c^{\dagger}\|_{L^{2}(\Omega)}$ , using noisy data for various noise level $\delta>0$ , which clearly demonstrates the acceleration effect of our AHB method and shows that our AHB method has superior performance over the Landweber iteration. In Figure 4 and Figure 5 we plot the respective reconstructed results and absolute errors versus the number of iteration by the Landweber iteration and our AHB method using noisy data with noise level $\delta=0.0001$ , which shows these methods can produce comparable approximate solutions and indicates that our AHB method is much faster than the Landweber iteration.

5. Conclusion

With the surge in machine learning advancements and the emergence of large-scale problems across diverse domains, Polyak’s heavy ball method has experienced a resurgence of interest within the optimization community in recent years. Much effort has been dedicated to comprehending its acceleration effects. In this paper we proposed an adaptive heavy ball method for solving ill-posed inverse problems, both linear and nonlinear. This method differs from the Landweber-type method in that a momentum term is added to define the iterates. Our method integrates a strongly convex function into the design of the algorithm as a regularization function to detect the desired feature of the sought solution. Moreover, we introduced novel techniques for adaptively selecting step-sizes and momentum coefficients, aiming to achieve potential acceleration over the Landweber-type method. Notably, the updating formulae for these parameters are explicit, obviating the need for a backtracking line search procedure and thereby saving computational time. Under the discrepancy principle, we showed that our method can be terminated after a finite number of iterations and established the corresponding regularization property. We reported diverse numerical results which indicate the significant reduction in both the required number of iterations and computational time, and thus demonstrate the superior performance of our method over the Landweber-type method.

Acknowledgement

The work of Q. Jin is partially supported by the Future Fellowship of the Australian Research Council (FT170100231). The work of Q. Huang is supported by the China Scholarship Council program.

References

[1] H. Attouch and J. Peypouquet, The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than $o(1/k^{2})$ , SIAM J. Optim., 26 (2016), no. 3, 1824–1834.
[2] A. Beck and M Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), 183–202.
[3] S. Bonettini and V. Ruggiero, On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration, J. Math. Imaging Vis., 44 (2012), 236–253.
[4] H. Brakhage, On ill-posed problems and the method of conjugate gradients, Inverse and Ill-Posed Problems (Sankt Wolfgang, 1986) (Notes Rep. Math. Sci. Engrg. vol 4), (New York: Academic), 165–175, 1987.
[5] A. Chambolle and C. Dossai, On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm, J. Optim. Theor. Appl., 166 (2015), 968–982.
[6] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems, Dordrecht, Kluwer, 1996.
[7] E. Ghadimi, H. Feyzmahdavian and M. Johansson, Global convergence of the heavy-ball method for convex optimization, 2015 European Control Conference, pages 310–315, 2015.
[8] W. Hackbusch, Iterative Solution of Large Sparse Systems of Equations, Second edition, Applied Mathematical Sciences, 95. Springer, 2016.
[9] M. Hanke,Accelerated Landweber iterations for the solution of ill-posed equations, Numer. Math., 60 (1991), 341–373.
[10] M. Hanke, A. Neubauer and O. Scherzer, A convergence analysis of the Landweber iteration for nonlinear ill-posed problems, Numer. Math., 72 (1995), no. 1, 21–37.
[11] P. C. Hansen and M. Saxild-Hansen, AIR Tools — a MATLAB package of algebraic iterative reconstruction methods, J. Comput. Appl. Math., 236 (2012), 2167–2178.
[12] M. Hegland, Q. Jin and W. Wang, Accelerated Landweber iteration with convex penalty for linear inverse problems in Banach spaces, Appl. Anal., 94 (2015), 524–547.
[13] S. Hubmer and R. Ramlau, Convergence analysis of a two-point gradient method for nonlinear ill-posed problems, Inverse Problems, 33 (2017), 095004.
[14] S. Hubmer and R. Ramlau, Nesterov’s accelerated gradient method for nonlinear ill-posed problems with a locally convex residual functional, Inverse Problems, 34 (2018), 095003.
[15] Q. Jin, Landweber-Kaczmarz method in Banach spaces with inexact inner solvers, Inverse Problems, 32 (2016), 104005.
[16] Q. Jin, Convergence rates of a dual gradient method for constrained linear ill-posed problems, Numer. Math. 151 (2022), no. 4, 841–871.
[17] Q. Jin and X. Lu, A fast nonstationary iterative method with convex penalty for inverse problems in Hilbert spaces, Inverse Problems 30 (2014), no. 4, 045012, 21 pp.
[18] Q. Jin and W. Wang, Landweber iteration of Kaczmarz type with general non-smooth convex penalty functionals, Inverse Problems, 29 (2013), 085011.
[19] B. Kaltenbacher, A. Neubauer and O. Scherzer, Iterative Regularization Methods for Nonlinear Ill-Posed Problems, Radon Ser. Comput. Appl. Math. 6, De Gruyter, Berlin, 2008.
[20] S. Kindermann, Optimal-order convergence of Nesterov acceleration for linear ill-posed problems, Inverse Problems, 37 (2021), no. 6, 065002.
[21] F. Natterer, The Mathematics of Computerized Tomography, SIAM, Philadelphia, 2001.
[22] Y. Nesterov,A method of solving a convex programming problem with convergence rate $O(1/k^{2})$ , Soviet Mathematics Doklady, 27 (1983), 372–376.
[23] Y. Nesterov, Lectures on Convex Optimization, Springer Science & Business Media, New York, 2018
[24] A. Neubauer,On Nesterov acceleration for Landweber iteration of linear ill-posed problems, J. Inverse Ill-Posed Probl., 25 (2017), no. 3, 381–390.
[25] P. Ochs, Local convergence of the heavy-ball method and iPiano for non-convex optimization, J. Optim. Theory Appl., 177 (2018), no. 1, 153–180.
[26] P. Ochs, Y. Chen, T. Brox and T. Pock, iPiano: inertial proximal algorithm for nonconvex optimization, SIAM J. Imaging Sci. 7 (2014), no. 2, 1388–1419.
[27] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), no. 5, 1–17.
[28] F. Schöpfer and T. Schuster, Fast regularizing sequential subspace optimization in Banach spaces, Inverse Problems, 25 (2009), 015013.
[29] C. Zălinscu, Convex Analysis in General Vector Spaces, World Scientific Publishing Co., Inc., River Edge, New Jersey, 2002.
[30] S. K. Zavriev and F. V. Kostyuk, Heavy-ball method in nonconvex optimization problems, Comput. Math. Model., 4 (1993), pp. 336–341.
[31] M. Zhong, W. Wang and Q. Jin, Regularization of inverse problems by two-point gradient methods in Banach spaces, Numer. Math., 143 (2019), no. 3, 713–747.
[32] M. Zhu and T. F. Chan, An efficient primal-dual hybrid gradient algorithm for total variation image restoration, CAM Report 08–34 UCLA, 2008.

	$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle=\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{}(\xi_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}\rangle\right)-\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle\right)$
		$\displaystyle={\mathcal{R}}^{}(\xi_{n+1}^{\delta})-{\mathcal{R}}^{}(\xi_{n}^{\delta})-\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta})\rangle+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
		$\displaystyle\leq\frac{1}{4\sigma}\\|\xi_{n+1}^{\delta}-\xi_{n}^{\delta}\\|^{2}+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
		$\displaystyle=\frac{1}{4\sigma}\\|\alpha_{n}^{\delta}g_{n}^{\delta}-\beta_{n}^{\delta}m_{n}^{\delta}\\|^{2}+\langle-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.$

$\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta}$	$\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}$
	$\displaystyle\quad\,+\delta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|+\eta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|\\|y-F(x_{n}^{\delta})\\|+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle$
	$\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\\|g_{n}^{\delta}\\|^{2}-\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}+(1+\eta)\delta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|+\eta\alpha_{n}^{\delta}\\|r_{n}^{\delta}\\|^{2}$
	$\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\\|m_{n}^{\delta}\\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle.$	(16)

	$\displaystyle\\|L(x_{i})(x_{n_{k}}-\hat{x})\\|$	$\displaystyle\leq\\|L(x_{i})(x_{i}-\hat{x})\\|+\\|L(x_{i})(x_{n_{k}}-x_{i})\\|$
		$\displaystyle\leq(1+\eta)\left(\\|r_{i}\\|+\\|F(x_{i})-F(x_{n_{k}})\\|\right)$
		$\displaystyle\leq(1+\eta)\left(2\\|r_{i}\\|+\\|r_{n_{k}}\\|\right).$

	$\displaystyle\|\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle\|$	$\displaystyle\leq\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|\\|L(x_{i})(x_{n_{k}}-\hat{x})\\|$
		$\displaystyle\leq(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|\left(2\\|r_{i}\\|+\\|r_{n_{k}}\\|\right)$
		$\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\\|r_{i}\\|^{2}$
		$\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\\|r_{i}\\|^{2}.$

	$\displaystyle\langle g_{k},x_{k}-x^{\dagger}\rangle$	$\displaystyle=\langle r_{k},L(x_{k})(x_{k}-x^{\dagger})\rangle$
		$\displaystyle=\\|r_{k}\\|^{2}+\langle r_{k},y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\rangle$
		$\displaystyle\geq\\|r_{k}\\|^{2}-\\|r_{k}\\|\\|y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\\|$
		$\displaystyle\geq(1-\eta)\\|r_{k}\\|^{2}>0$