This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An adaptive heavy ball method for ill-posed inverse problems

Qinian Jin Mathematical Sciences Institute, Australian National University, Canberra, ACT 2601, Australia [email protected]  and  Qin Huang School of Mathematics and Statistics, Beijing Institute of Technology, Beijing, 100081, China [email protected]
Abstract.

In this paper we consider ill-posed inverse problems, both linear and nonlinear, by a heavy ball method in which a strongly convex regularization function is incorporated to detect the feature of the sought solution. We develop ideas on how to adaptively choose the step-sizes and the momentum coefficients to achieve acceleration over the Landweber-type method. We then analyze the method and establish its regularization property when it is terminated by the discrepancy principle. Various numerical results are reported which demonstrate the superior performance of our method over the Landweber-type method by reducing substantially the required number of iterations and the computational time.

Key words and phrases:
Ill-posed inverse problems, adaptive heavy ball method, step-sizes, momentum coefficients, convergence
Key words and phrases:
Ill-posed inverse problems, adaptive heavy-ball method, step-size, momentum coefficients, convergence
2010 Mathematics Subject Classification:
65J20, 65J22, 65J15, 47J06

1. Introduction

Consider an operator equation of the form

F(x)=y,\displaystyle F(x)=y, (1)

where F:dom(F)XYF:\mbox{dom}(F)\subset X\to Y is an operator between two real Hilbert spaces XX and YY with domain dom(F)\mbox{dom}(F). The operator FF could be either linear or nonlinear. Throughout, we always assume that (1) has a solution, i.e. yRan(F)y\in\mbox{Ran}(F), the range of FF. We are interested in the situation that (1) is ill-posed in the sense that its solution does not depend continuously on the right hand data. Such situation can happen in a broad range of applications of inverse problems. In practical scenarios, data are acquired through experiments and thus the exact data may not be available; instead we only have measurement data corrupted by noise. Due to the ill-posed nature of the underlying problem, it is therefore important to develop algorithms to produce stable approximate solutions of (1) using noisy data.

In practical applications, some a priori information, such as non-negativity, sparsity and piecewise constancy, on the sought solution, is often available. Utilizing such feature information in algorithm design can significantly improve the reconstruction accuracy. By incorporating the available a priori information, one may pick a proper, lower semi-continuous, strongly convex function :X(,]{\mathcal{R}}:X\to(-\infty,\infty] as a regularization function to determine a solution of (1) with the desired feature. Let yδy^{\delta} be a noisy data satisfying

yδyδ\|y^{\delta}-y\|\leq\delta

with a known noise level δ>0\delta>0. The Landweber-type method

ξn+1δ=ξnδαnδL(xnδ)(F(xnδ)yδ),xn+1δ=argminxX{(x)ξn+1δ,x},\displaystyle\begin{split}\xi_{n+1}^{\delta}&=\xi_{n}^{\delta}-\alpha_{n}^{\delta}L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta}),\\ x_{n+1}^{\delta}&=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\},\end{split} (2)

has been considered in [18] for solving (1), where αnδ>0\alpha_{n}^{\delta}>0 is the step-size and {L(x):xdom(F)}\{L(x):x\in\mbox{dom}(F)\} is a family of properly chosen bounded linear operators. In case FF is Fréchect differentiable at xx, we may take L(x)=F(x)L(x)=F^{\prime}(x), the Fréchet derivative of FF at xx; otherwise, L(x)L(x) should be appropriately chosen as a substitute for the non-existent Fréchet derivative. This method has been thoroughly analyzed in [15, 18] under the constant step-size

αnδ=μ0L2,\displaystyle\alpha_{n}^{\delta}=\frac{\mu_{0}}{L^{2}}, (3)

when an upper bound LL of L(x)\|L(x)\| around the sought solution is available, and the adaptive step-size

αnδ=min{μ0F(xnδ)yδ2L(xnδ)(F(xnδ)yδ)2,μ1}\displaystyle\alpha_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|F(x_{n}^{\delta})-y^{\delta}\|^{2}}{\|L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})\|^{2}},\mu_{1}\right\} (4)

with suitable choices of the parameters μ0\mu_{0} and μ1\mu_{1}, and the regularization property has been established when the iteration is terminated by the discrepancy principle. Note that, when (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2, the corresponding method with αnδ\alpha_{n}^{\delta} chosen by (3) is the classical Landweber method which has been analyzed in [6, 10], and the corresponding method with αnδ\alpha_{n}^{\delta} chosen by (4) is the minimal error method ([19]). Extensive numerical simulations have demonstrated that the Landweber-type method (2) is a slowly convergent approach, often necessitating a large number of iteration steps before termination by the discrepancy principle. It is therefore important to develop strategies for accelerating the Landweber-type method (2) while maintaining its simple implementation feature.

For linear ill-posed problems in Hilbert spaces, Brakhage used in [4] the Jacobi polynomials to construct a method, which is known as the ν\nu-method, to accelerate the classical Landweber iteration. Additionally, in [9], a family of accelerated Landweber iterations was developed by employing orthogonal polynomials and the spectral theory of self-adjoint operators. However, the acceleration strategy using orthogonal polynomials is scarcely applicable for the Landweber-type method (2) when the forward operator FF is nonlinear or the regularization function {\mathcal{R}} is non-quadratic. In [12, 28] the sequential subspace optimization strategies were employed to accelerate the Landweber iteration. The implementation of this strategy however requires to solve an additional optimization problem to determine the weighted coefficients of the search directions at each iteration step which unfortunately has the drawback of slowing down the method.

On the other hand, to determine a solution of well-posed minimization problem

minxXf(x)\displaystyle\min_{x\in X}f(x) (5)

with a continuous differentiable objective function ff, tremendous progress has been made toward accelerating the gradient method

xn+1=xnαnf(xn)x_{n+1}=x_{n}-\alpha_{n}\nabla f(x_{n})

in optimization community. In particular, Nesterov’s accelerated gradient method ([22]) and Polyak’s heavy ball method ([27]) stood out and had far-reaching impact on the development of fast first order optimization methods. In Nesterov’s accelerated gradient method, to define the next iterate, instead of using the current iterate, it uses a carefully chosen extrapolation point, built from the previous two iterates ([1, 2, 5, 23]). Based on Nesterov’s acceleration strategy, an accelerated version of Landweber-type method has been proposed in [15] and a refined version of the method takes the form (see also [16, 31])

ξ^nδ=ξnδ+n1n+γ(ξnδξn1δ),x^nδ=argminxX{(x)ξ^nδ,x},ξn+1δ=ξ^nδαnδL(x^nδ)(F(x^nδ)yδ),xn+1δ=argminxX{(x)ξn+1δ,x},\displaystyle\begin{split}\hat{\xi}_{n}^{\delta}&=\xi_{n}^{\delta}+\frac{n-1}{n+\gamma}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}),\qquad\quad\quad\,\,\hat{x}_{n}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\hat{\xi}_{n}^{\delta},x\rangle\right\},\\ \xi_{n+1}^{\delta}&=\hat{\xi}_{n}^{\delta}-\alpha_{n}^{\delta}L(\hat{x}_{n}^{\delta})^{*}(F(\hat{x}_{n}^{\delta})-y^{\delta}),\quad x_{n+1}^{\delta}=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\},\end{split} (6)

where γ2\gamma\geq 2. The numerical results presented in [15] demonstrate its striking acceleration effect. Inspired by the numerical observations in [15], some efforts have been devoted to analyzing the regularization property of (6) in the context of ill-posed problems. When X is a Hilbert space, FF is a bounded linear operator and (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2, the regularization property of the corresponding method, terminated by either a priori stopping rules or the discrepancy principle, has been established in [20, 24] based on a general acceleration framework in [9] using orthogonal polynomials. When FF is nonlinear or {\mathcal{R}} is non-quadratic, although some partial results are available in [14, 16], the analysis on (6) is under developed and the understanding of its regularization property is largely open. By modifying the first equation in (6) by ξ^nδ=ξnδ+λnδ(ξnδξn1δ)\hat{\xi}_{n}^{\delta}=\xi_{n}^{\delta}+\lambda_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}) with a connection coefficient λnδ\lambda_{n}^{\delta} to be determined, a two-point gradient method has been considered in [13, 31]. This method, however, requires a discrete backtracking search algorithm to determine λnδ\lambda_{n}^{\delta} at each iteration step which can incur additional computational time.

Different from Nesterov’s method, Polyak’s heavy ball method accelerates the gradient method by adding a momentum term to define the iterates, i.e.

xn+1=xnαnf(xn)+βn(xnxn1).\displaystyle x_{n+1}=x_{n}-\alpha_{n}\nabla f(x_{n})+\beta_{n}(x_{n}-x_{n-1}). (7)

The performance of (7) depends crucially on the property of ff and the choices of the step-size αn\alpha_{n} and the momentum coefficient βn\beta_{n}, and its analysis turns out to be very challenging. When ff is twice continuous differentiable, strongly convex and has Lipschitz continuous gradient, it has been demonstrated in [27], under suitable constant choices of αn\alpha_{n} and βn\beta_{n}, that the iterative sequence {xn}\{x_{n}\} enjoys a provable linear convergence faster than the gradient descent. Due to the recent development of machine learning and the appearance of large scale problems, the heavy ball method has gained renewed interest and much attention has been paid on understanding its convergence behavior ([7, 23, 25, 26, 30]). Although many strategies have been proposed to select αn\alpha_{n} and βn\beta_{n}, how to choose these parameters to achieve acceleration effect for general convex and non-convex problems remains an active research topic in optimization community.

In this paper we will consider the iterative method

ξn+1δ=ξnδαnδL(xnδ)(F(xnδ)yδ)+βnδ(ξnδξn1δ),xn+1δ=argminxX{(x)ξn+1δ,x}\displaystyle\begin{split}\xi_{n+1}^{\delta}&=\xi_{n}^{\delta}-\alpha_{n}^{\delta}L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})+\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}),\\ x_{n+1}^{\delta}&=\arg\min_{x\in X}\left\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\right\}\end{split} (8)

for solving ill-posed inverse problem (1), where αnδ\alpha_{n}^{\delta} denotes the step-size and βnδ\beta_{n}^{\delta} denotes the momentum coefficient. This method differs from (2) in that a momentum term βnδ(ξnδξn1δ)\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta}) is added to define the iterates. The method (8) has intimate connection with the heavy ball method. Indeed, when (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2 and FF is Fréchet differentiable, the method (8) with L(x)=F(x)L(x)=F^{\prime}(x) becomes

xn+1δ=xnδαnδF(xnδ)(F(xnδ)yδ)+βnδ(xnδxn1δ)x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha_{n}^{\delta}F^{\prime}(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta})+\beta_{n}^{\delta}(x_{n}^{\delta}-x_{n-1}^{\delta})

which can be obtained by applying the heavy ball method (7) to the minimization problem (5) with

f(x):=12F(x)yδ2.\displaystyle f(x):=\frac{1}{2}\|F(x)-y^{\delta}\|^{2}. (9)

We will further demonstrate that, when FF is a bounded linear operator, the method (8) with general strongly convex {\mathcal{R}}, can be derived by applying the heavy ball method to a dual problem. Based on the successful experience with the acceleration effect of the method (7) for well-posed optimization problems, it is natural to expect that, with suitable choices of the parameters αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta}, the method (8) can have superior performance over the method (2) for ill-posed problems. In this paper we will provide a criterion for choosing αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta} and investigate the convergence property of the corresponding method. Our choices of αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta} are adaptive in the sense that only quantities arising from the computation are used. The formulae for calculating αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta} are explicit, thereby avoiding the need for a backtracking search strategy and consequently saving computational time. The complexity per iteration of our proposed method is comparable to one step of the Landweber-type method (2). However, extensive computational results demonstrate that our method significantly accelerates the Landweber-type method in terms of both the number of iterations and the computational time.

This paper is organized as follows. In Section 2 we first formulate the conditions on the regularization function {\mathcal{R}} and the forward operator FF, we then propose our adaptive heavy ball method by elucidating how to choose the step-sizes and the momentum coefficients, we also show that our method is well-defined and the iteration can be terminated in finite many steps by the discrepancy principle. In Section 3 we focus on the convergence analysis and establish the regularization property of our method. In Section 4 we perform various numerical simulations to demonstrate the acceleration effect and superior performance of our method over the Landweber-type method.

2. The adaptive heavy-ball method

In this section we will consider the iterative method (8) and provide a motivation on choosing the step-size αnδ\alpha_{n}^{\delta} and the momentum coefficient βnδ\beta_{n}^{\delta} adaptively. We will then show that our proposed method, i.e. Algorithm 1 below, is well-defined.

2.1. Assumptions

Note that the method (8) involves the regularization function {\mathcal{R}} and the forward operator FF. In order to carry out the analysis, certain conditions should be placed on {\mathcal{R}} and FF. For the regularization function {\mathcal{R}}, we assume the following standard condition.

Assumption 1.

:X(,]{\mathcal{R}}:X\to(-\infty,\infty] is a proper, lower semi-continuous, strongly convex function in the sense that there is a constant σ>0\sigma>0 such that

(tx¯+(1t)x)+σt(1t)x¯x2t(x¯)+(1t)(x)\displaystyle{\mathcal{R}}(t\bar{x}+(1-t)x)+\sigma t(1-t)\|\bar{x}-x\|^{2}\leq t{\mathcal{R}}(\bar{x})+(1-t){\mathcal{R}}(x)

for all x¯,xdom()\bar{x},x\in\emph{dom}({\mathcal{R}}) and 0t10\leq t\leq 1.

To make use of Assumption 1, let us recall a little convex analysis ([29]). For any proper convex function :X(,]{\mathcal{R}}:X\to(-\infty,\infty] we use \partial{\mathcal{R}} to denote its subdifferential, i.e.

(x)={ξX:(z)(x)+ξ,zx for all zX}.\partial{\mathcal{R}}(x)=\{\xi\in X:{\mathcal{R}}(z)\geq\partial{\mathcal{R}}(x)+\langle\xi,z-x\rangle\mbox{ for all }z\in X\}.

For any xXx\in X with (x)\partial{\mathcal{R}}(x)\neq\emptyset, let ξ(x)\xi\in\partial{\mathcal{R}}(x) and define

Dξ(z,x):=(z)(x)ξ,zx,zXD_{\mathcal{R}}^{\xi}(z,x):={\mathcal{R}}(z)-{\mathcal{R}}(x)-\langle\xi,z-x\rangle,\quad z\in X

which is called the Bregman distance induced by {\mathcal{R}} at xx in the direction ξ\xi. If {\mathcal{R}} is strongly convex in the sense of Assumption 1, it is easy to derive that

Dξ(z,x)σzx2\displaystyle D_{\mathcal{R}}^{\xi}(z,x)\geq\sigma\|z-x\|^{2} (10)

for all xXx\in X with (x)\partial{\mathcal{R}}(x)\neq\emptyset, ξ(x)\xi\in\partial{\mathcal{R}}(x) and zXz\in X. For a proper, lower semi-continuous, convex function :X(,]{\mathcal{R}}:X\to(-\infty,\infty], its convex conjugate

(ξ):=supxX{ξ,x(x)},ξX{\mathcal{R}}^{*}(\xi):=\sup_{x\in X}\left\{\langle\xi,x\rangle-{\mathcal{R}}(x)\right\},\quad\forall\xi\in X

is also a proper, lower semi-continuous, convex function that maps from XX to (,](-\infty,\infty]. Moreover

ξ(x)x(ξ)(x)+(ξ)=ξ,x.\xi\in\partial{\mathcal{R}}(x)\iff x\in\partial{\mathcal{R}}^{*}(\xi)\iff{\mathcal{R}}(x)+{\mathcal{R}}^{*}(\xi)=\langle\xi,x\rangle.

If {\mathcal{R}} satisfies Assumption 1, then {\mathcal{R}}^{*} is continuous differentiable, its gradient \nabla{\mathcal{R}}^{*} is Lipschitz continuous with

(ξ¯)(ξ)ξ¯ξ2σ,ξ¯,ξX;\displaystyle\|\nabla{\mathcal{R}}^{*}(\bar{\xi})-\nabla{\mathcal{R}}^{*}(\xi)\|\leq\frac{\|\bar{\xi}-\xi\|}{2\sigma},\quad\forall\bar{\xi},\xi\in X; (11)

see [29, Corollary 3.5.11]. Thus, under Assumption 1, for any ξX\xi\in X the minimization problem

x=argminzX{(z)ξ,z}x=\arg\min_{z\in X}\left\{{\mathcal{R}}(z)-\langle\xi,z\rangle\right\}

has a unique solution given by x=(ξ)x=\nabla{\mathcal{R}}^{*}(\xi). Clearly ξ(x)\xi\in\partial{\mathcal{R}}(x).

For the forward operator FF, we assume the following condition which has been widely used in dealing with ill-posed problems, where Bρ(x0):={xX:xx0ρ}B_{\rho}(x_{0}):=\{x\in X:\|x-x_{0}\|\leq\rho\} for any ρ>0\rho>0.

Assumption 2.
  • (a)

    XX and YY are Hilbert spaces.

  • (b)

    There exist ρ>0\rho>0, x0Xx_{0}\in X and ξ0(x0)\xi_{0}\in\partial{\mathcal{R}}(x_{0}) such that B2ρ(x0)dom(F)B_{2\rho}(x_{0})\subset\emph{dom}(F) and (1) has a solution x¯\bar{x} such that Dξ0(x¯,x0)σρ2D_{\mathcal{R}}^{\xi_{0}}(\bar{x},x_{0})\leq\sigma\rho^{2}.

  • (c)

    FF is weakly closed on dom(F)\emph{dom}(F), i.e. for any sequence {xn}dom(F)\{x_{n}\}\subset\emph{dom}(F) satisfying xnxXx_{n}\rightharpoonup x\in X and F(xn)vYF(x_{n})\to v\in Y there hold xdom(F)x\in\emph{dom}(F) and F(x)=vF(x)=v. Throughout the paper “\to’ and “\rightharpoonup” are used to denote the strong and weak convergence respectively.

  • (d)

    There exist a family of bounded linear operators {L(x):XY}xB2ρ(x0)\{L(x):X\to Y\}_{x\in B_{2\rho}(x_{0})} such that xL(x)x\to L(x) is continuous on B2ρ(x0)B_{2\rho}(x_{0}) and there is 0η<10\leq\eta<1 such that

    F(x)F(x¯)L(x¯)(xx¯)ηF(x)F(x¯)\|F(x)-F(\bar{x})-L(\bar{x})(x-\bar{x})\|\leq\eta\|F(x)-F(\bar{x})\|

    for all x,x¯B2ρ(x0)x,\bar{x}\in B_{2\rho}(x_{0}). Moreover, there is a constant L>0L>0 such that L(x)L\|L(x)\|\leq L for all xB2ρ(x0)x\in B_{2\rho}(x_{0}).

We remark that, when FF is a bounded linear operator, Assumption 2 holds automatically with ρ=\rho=\infty and η=0\eta=0. When FF is nonlinear, Assumption 2 (d) is known as the tangential cone condition which gives certain restriction on the nonlinearity of FF. Assumption 2 (d) clearly implies

F(x)F(x¯)11ηL(x¯)(xx¯),x,x¯B2ρ(x0)\|F(x)-F(\bar{x})\|\leq\frac{1}{1-\eta}\|L(\bar{x})(x-\bar{x})\|,\quad\forall x,\bar{x}\in B_{2\rho}(x_{0})

and thus the continuity of the mapping xF(x)x\to F(x) on B2ρ(x0)B_{2\rho}(x_{0}). Based on Assumption 1 and Assumption 2, it has been shown in [18, Lemma 3.2] that (1) has a unique solution xx^{\dagger} satisfying

Dξ0(x,x0):=minxdom(F){Dξ0(x,x0):F(x)=y}.\displaystyle D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0}):=\min_{x\in\mbox{dom}(F)}\left\{D_{\mathcal{R}}^{\xi_{0}}(x,x_{0}):F(x)=y\right\}. (12)

By using (10) and Assumption 2 (b) it is easy to see that xx0ρ\|x^{\dagger}-x_{0}\|\leq\rho, i.e. xBρ(x0)x^{\dagger}\in B_{\rho}(x_{0}).

2.2. The method

Let us now turn to the consideration on the method (8). We first demonstrate that, when FF is a bounded linear operator, the method (8) can be derived by applying the heavy ball method to the dual problem of

min{(x):xX and Fx=y}.\displaystyle\min\{{\mathcal{R}}(x):x\in X\mbox{ and }Fx=y\}. (13)

To see this, we recall that the associated dual function is given by

d(λ):=infxX{(x)λ,Fxy}=(Fλ)+λ,y\displaystyle d(\lambda):=\inf_{x\in X}\left\{{\mathcal{R}}(x)-\langle\lambda,Fx-y\rangle\right\}=-{\mathcal{R}}^{*}(F^{*}\lambda)+\langle\lambda,y\rangle

and hence the corresponding dual problem is maxλYd(λ)\max_{\lambda\in Y}d(\lambda), i.e.

minλY{(Fλ)λ,y}.\displaystyle\min_{\lambda\in Y}\left\{{\mathcal{R}}^{*}(F^{*}\lambda)-\langle\lambda,y\rangle\right\}. (14)

Since {\mathcal{R}} satisfies Assumption 1, the objective function in (14) is continuous differentiable. Applying the heavy ball method to solve (14) gives

λn+1=λnαn(F(Fλn)y)+βn(λnλn1).\displaystyle\lambda_{n+1}=\lambda_{n}-\alpha_{n}(F\nabla{\mathcal{R}}^{*}(F^{*}\lambda_{n})-y)+\beta_{n}(\lambda_{n}-\lambda_{n-1}).

By setting ξn:=Fλn\xi_{n}:=F^{*}\lambda_{n} and xn:=(Fλn)x_{n}:=\nabla{\mathcal{R}}^{*}(F^{*}\lambda_{n}), we then obtain

xn=(ξn),ξn+1=ξnαnF(Fxny)+βn(ξnξn1).\displaystyle x_{n}=\nabla{\mathcal{R}}^{*}(\xi_{n}),\quad\xi_{n+1}=\xi_{n}-\alpha_{n}F^{*}(Fx_{n}-y)+\beta_{n}(\xi_{n}-\xi_{n-1}).

In case only noisy data yδy^{\delta} is available, by replacing yy by yδy^{\delta} and allowing αn\alpha_{n}, βn\beta_{n} to be dependent on yδy^{\delta} in the above equation we may obtain the method

xnδ=(ξnδ),ξn+1δ=ξnδαnδF(Fxnδyδ)+βnδ(ξnδξn1δ)\displaystyle x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta}),\quad\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}F^{*}(Fx_{n}^{\delta}-y^{\delta})+\beta_{n}^{\delta}(\xi_{n}^{\delta}-\xi_{n-1}^{\delta})

which is equivalent to (8) when FF is bounded linear.

Note that the term F(Fxnδyδ)F^{*}(Fx_{n}^{\delta}-y^{\delta}) in the above equation is the gradient of the function x12Fxyδ2x\to\frac{1}{2}\|Fx-y^{\delta}\|^{2} at xnδx_{n}^{\delta}. When FF is a nonlinear Fréchet differentiable operator, we may use the gradient F(xnδ)(F(xnδ)yδ)F^{\prime}(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta}) of the function (9) at xnδx_{n}^{\delta} to replace F(Fxnδyδ)F^{*}(Fx_{n}^{\delta}-y^{\delta}) in the above equation; in case FF is not differentiable at xnδx_{n}^{\delta}, we may use L(xnδ)L(x_{n}^{\delta}) to substitute F(xnδ)F^{\prime}(x_{n}^{\delta}). Consequently, it leads to the method (8) for solving (1).

In order for the method (8) to have fast convergence, it is natural to choose αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta} at each iteration step such that xn+1δx_{n+1}^{\delta} is as close to a solution of (1) as possible. Let x^\hat{x} denote any solution of (1) in B2ρ(x0)B_{2\rho}(x_{0}). We may measure the closeness from xn+1δx_{n+1}^{\delta} to x^\hat{x} by the Bregman distance

Dξn+1δ(x^,xn+1δ):=(x^)(xn+1δ)ξn+1δ,x^xn+1δD_{\mathcal{R}}^{\xi_{n+1}^{\delta}}(\hat{x},x_{n+1}^{\delta}):={\mathcal{R}}(\hat{x})-{\mathcal{R}}(x_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}-x_{n+1}^{\delta}\rangle

induced by {\mathcal{R}} at xn+1δx_{n+1}^{\delta} in the direction ξn+1δ\xi_{n+1}^{\delta}. Directly minimizing this quantity becomes infeasible because it involves an unknown solution x^\hat{x} and a strongly convex function {\mathcal{R}} which may be non-quadratic. As a compromise, we may first derive a suitable upper bound for this quantity and then take minimization steps.

We will elucidate the idea on choosing αnδ\alpha_{n}^{\delta} and βnδ\beta_{n}^{\delta} below. For simplicity of exposition, we introduce the notation

rnδ:=F(xnδ)yδ,gnδ:=L(xnδ)rnδ and mnδ:=ξnδξn1δr_{n}^{\delta}:=F(x_{n}^{\delta})-y^{\delta},\quad g_{n}^{\delta}:=L(x_{n}^{\delta})^{*}r_{n}^{\delta}\quad\mbox{ and }\quad m_{n}^{\delta}:=\xi_{n}^{\delta}-\xi_{n-1}^{\delta}

for all integers n0n\geq 0. Then (8) can be written as

ξn+1δ=ξnδαnδgnδ+βnδmnδ,xn+1δ=(ξn+1δ).\displaystyle\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},\quad x_{n+1}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n+1}^{\delta}). (15)

Let Δnδ:=Dξnδ(x^,xnδ)\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(\hat{x},x_{n}^{\delta}) for n0n\geq 0. Since ξnδ(xnδ)\xi_{n}^{\delta}\in\partial{\mathcal{R}}(x_{n}^{\delta}), we have (xnδ)+(ξnδ)=ξnδ,xnδ{\mathcal{R}}(x_{n}^{\delta})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})=\langle\xi_{n}^{\delta},x_{n}^{\delta}\rangle and thus

Δnδ=(x^)+(ξnδ)ξnδ,x^.\Delta_{n}^{\delta}={\mathcal{R}}(\hat{x})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle.

Therefore, by using xnδ=(ξnδ)x_{n}^{\delta}=\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta}), (11) and (15), we can obtain

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} =((x^)+(ξn+1δ)ξn+1δ,x^)((x^)+(ξnδ)ξnδ,x^)\displaystyle=\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{*}(\xi_{n+1}^{\delta})-\langle\xi_{n+1}^{\delta},\hat{x}\rangle\right)-\left({\mathcal{R}}(\hat{x})+{\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n}^{\delta},\hat{x}\rangle\right)
=(ξn+1δ)(ξnδ)ξn+1δξnδ,(ξnδ)+ξn+1δξnδ,xnδx^\displaystyle={\mathcal{R}}^{*}(\xi_{n+1}^{\delta})-{\mathcal{R}}^{*}(\xi_{n}^{\delta})-\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},\nabla{\mathcal{R}}^{*}(\xi_{n}^{\delta})\rangle+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle
14σξn+1δξnδ2+ξn+1δξnδ,xnδx^\displaystyle\leq\frac{1}{4\sigma}\|\xi_{n+1}^{\delta}-\xi_{n}^{\delta}\|^{2}+\langle\xi_{n+1}^{\delta}-\xi_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle
=14σαnδgnδβnδmnδ2+αnδgnδ+βnδmnδ,xnδx^.\displaystyle=\frac{1}{4\sigma}\|\alpha_{n}^{\delta}g_{n}^{\delta}-\beta_{n}^{\delta}m_{n}^{\delta}\|^{2}+\langle-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.

By the polarization identity in Hilbert spaces, we then have

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} 14σ((αnδ)2gnδ2+(βnδ)2mnδ22αnδβnδgnδ,mnδ)\displaystyle\leq\frac{1}{4\sigma}\left((\alpha_{n}^{\delta})^{2}\|g_{n}^{\delta}\|^{2}+(\beta_{n}^{\delta})^{2}\|m_{n}^{\delta}\|^{2}-2\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle\right)
αnδrnδ,L(xnδ)(xnδx^)+βnδmnδ,xnδx^\displaystyle\quad\,-\alpha_{n}^{\delta}\langle r_{n}^{\delta},L(x_{n}^{\delta})(x_{n}^{\delta}-\hat{x})\rangle+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle
=14σ(αnδ)2gnδ2+14σ(βnδ)2mnδ212σαnδβnδgnδ,mnδ\displaystyle=\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\|g_{n}^{\delta}\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\|m_{n}^{\delta}\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle
αnδrnδ,rnδ+yδy+yF(xnδ)L(xnδ)(x^xnδ)\displaystyle\quad\,-\alpha_{n}^{\delta}\langle r_{n}^{\delta},r_{n}^{\delta}+y^{\delta}-y+y-F(x_{n}^{\delta})-L(x_{n}^{\delta})(\hat{x}-x_{n}^{\delta})\rangle
+βnδmnδ,xnδx^.\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle.

Assume xnδB2ρ(x0)x_{n}^{\delta}\in B_{2\rho}(x_{0}). By using yδyδ\|y^{\delta}-y\|\leq\delta, the Cauchy-Schwarz inequality and Assumption 2 (d), we further obtain

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} 14σ(αnδ)2gnδ2+14σ(βnδ)2mnδ212σαnδβnδgnδ,mnδαnδrnδ2\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\|g_{n}^{\delta}\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\|m_{n}^{\delta}\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}
+δαnδrnδ+ηαnδrnδyF(xnδ)+βnδmnδ,xnδx^\displaystyle\quad\,+\delta\alpha_{n}^{\delta}\|r_{n}^{\delta}\|+\eta\alpha_{n}^{\delta}\|r_{n}^{\delta}\|\|y-F(x_{n}^{\delta})\|+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle
14σ(αnδ)2gnδ2αnδrnδ2+(1+η)δαnδrnδ+ηαnδrnδ2\displaystyle\leq\frac{1}{4\sigma}(\alpha_{n}^{\delta})^{2}\|g_{n}^{\delta}\|^{2}-\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}+(1+\eta)\delta\alpha_{n}^{\delta}\|r_{n}^{\delta}\|+\eta\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}
+βnδmnδ,xnδx^+14σ(βnδ)2mnδ212σαnδβnδgnδ,mnδ.\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\|m_{n}^{\delta}\|^{2}-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle. (16)

Note that when βnδ=0\beta_{n}^{\delta}=0, the method (8) becomes the Landweber-type method (2) and extensive analysis has been done on (2) with suitable choices of the step-size αnδ\alpha_{n}^{\delta}, including (3) and (4). For the method (8) we also choose αnδ\alpha_{n}^{\delta} as such, i.e.

αnδ:=μ0L2 or αnδ:=min{μ0rnδ2gnδ2,μ1}\displaystyle\alpha_{n}^{\delta}:=\frac{\mu_{0}}{L^{2}}\quad\mbox{ or }\quad\alpha_{n}^{\delta}:=\min\left\{\frac{\mu_{0}\|r_{n}^{\delta}\|^{2}}{\|g_{n}^{\delta}\|^{2}},\mu_{1}\right\} (17)

when rnδ>τδ\|r_{n}^{\delta}\|>\tau\delta, where μ0\mu_{0} and μ1\mu_{1} are two positive constants. Then, by plugging the choice of αnδ\alpha_{n}^{\delta} from (17) into (2.2) and using δrnδ/τ\delta\leq\|r_{n}^{\delta}\|/\tau we have

Δn+1δΔnδ\displaystyle\Delta_{n+1}^{\delta}-\Delta_{n}^{\delta} (11+ητημ04σ)αnδrnδ2+14σ(βnδ)2mnδ2\displaystyle\leq-\left(1-\frac{1+\eta}{\tau}-\eta-\frac{\mu_{0}}{4\sigma}\right)\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}+\frac{1}{4\sigma}(\beta_{n}^{\delta})^{2}\|m_{n}^{\delta}\|^{2}
+βnδmnδ,xnδx^12σαnδβnδgnδ,mnδ.\displaystyle\quad\,+\beta_{n}^{\delta}\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle-\frac{1}{2\sigma}\alpha_{n}^{\delta}\beta_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle. (18)

We next consider the choice of βnδ\beta_{n}^{\delta}. It seems natural to obtain it by minimizing the right hand side of the above equation with respect to βnδ\beta_{n}^{\delta}. Let us fix a number β(0,]\beta\in(0,\infty] and minimize the right hand side of (2.2) over [0,β][0,\beta] to obtain

βnδ=min{max{0,αnδgnδ,mnδ2σmnδ,xnδx^mnδ2},β}\beta_{n}^{\delta}=\min\left\{\max\left\{0,\frac{\alpha_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-2\sigma\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle}{\|m_{n}^{\delta}\|^{2}}\right\},\beta\right\}

whenever mnδ0m_{n}^{\delta}\neq 0. However, this formula for βnδ\beta_{n}^{\delta} is not computable because it involves x^\hat{x} which is unknown. We need to treat the term

γnδ:=mnδ,xnδx^\displaystyle\gamma_{n}^{\delta}:=\langle m_{n}^{\delta},x_{n}^{\delta}-\hat{x}\rangle (19)

by using a suitable computable surrogate. Note that γ0δ=0\gamma_{0}^{\delta}=0 and, for n1n\geq 1,

γnδ\displaystyle\gamma_{n}^{\delta} =mnδ,xnδxn1δ+mnδ,xn1δx^\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle+\langle m_{n}^{\delta},x_{n-1}^{\delta}-\hat{x}\rangle
=mnδ,xnδxn1δ+αn1δgn1δ+βn1δmn1δ,xn1δx^\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle+\langle-\alpha_{n-1}^{\delta}g_{n-1}^{\delta}+\beta_{n-1}^{\delta}m_{n-1}^{\delta},x_{n-1}^{\delta}-\hat{x}\rangle
=mnδ,xnδxn1δαn1δrn1δ,L(xn1δ)(xn1δx^)+βn1δγn1δ\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\langle r_{n-1}^{\delta},L(x_{n-1}^{\delta})(x_{n-1}^{\delta}-\hat{x})\rangle+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}
=mnδ,xnδxn1δαn1δrn1δ2\displaystyle=\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}
αn1δrn1δ,yδF(xn1δ)+L(xn1δ)(xn1δx^)+βn1δγn1δ\displaystyle\quad\,-\alpha_{n-1}^{\delta}\langle r_{n-1}^{\delta},y^{\delta}-F(x_{n-1}^{\delta})+L(x_{n-1}^{\delta})(x_{n-1}^{\delta}-\hat{x})\rangle+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}
mnδ,xnδxn1δαn1δrn1δ2\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}
+αn1δrn1δ(δ+yF(xn1δ)L(xn1δ)(x^xn1δ))+βn1δγn1δ.\displaystyle\quad\,+\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|\left(\delta+\|y-F(x_{n-1}^{\delta})-L(x_{n-1}^{\delta})(\hat{x}-x_{n-1}^{\delta})\|\right)+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}.

Assuming xn1δB2ρ(x0)x_{n-1}^{\delta}\in B_{2\rho}(x_{0}), we may use Assumption 2 (d) to further obtain

γnδ\displaystyle\gamma_{n}^{\delta} mnδ,xnδxn1δαn1δrn1δ2+αn1δrn1δ(δ+ηyF(xn1δ))\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}+\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|\left(\delta+\eta\|y-F(x_{n-1}^{\delta})\|\right)
+βn1δγn1δ\displaystyle\quad\,+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}
mnδ,xnδxn1δ(1η)αn1δrn1δ2\displaystyle\leq\langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-(1-\eta)\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}
+(1+η)αn1δδrn1δ+βn1δγn1δ.\displaystyle\quad\,+(1+\eta)\alpha_{n-1}^{\delta}\delta\|r_{n-1}^{\delta}\|+\beta_{n-1}^{\delta}\gamma_{n-1}^{\delta}. (20)

This motivates us to introduce {γ~nδ}\{\tilde{\gamma}_{n}^{\delta}\} such that

γ~nδ={0 if n=0,mnδ,xnδxn1δ(1η)αn1δrn1δ2+(1+η)αn1δδrn1δ+βn1δγ~n1δ if n1\displaystyle\tilde{\gamma}_{n}^{\delta}=\left\{\begin{array}[]{lll}0&\mbox{ if }n=0,\\ \langle m_{n}^{\delta},x_{n}^{\delta}-x_{n-1}^{\delta}\rangle-(1-\eta)\alpha_{n-1}^{\delta}\|r_{n-1}^{\delta}\|^{2}&\\ \qquad\qquad+(1+\eta)\alpha_{n-1}^{\delta}\delta\|r_{n-1}^{\delta}\|+\beta_{n-1}^{\delta}\tilde{\gamma}_{n-1}^{\delta}&\mbox{ if }n\geq 1\end{array}\right. (24)

once xkδx_{k}^{\delta} for 0kn0\leq k\leq n and αkδ,βkδ\alpha_{k}^{\delta},\beta_{k}^{\delta} for 0kn10\leq k\leq n-1 are defined. Therefore, it leads us to propose Algorithm 1 below that will be considered in this paper.

Algorithm 1 (AHB: Adaptive heavy ball method).

Take an initial guess ξ0\xi_{0}, calculate x0:=argminxX{(x)ξ0,x}x_{0}:=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{0},x\rangle\}, and set ξ1δ=ξ0δ=ξ0\xi_{-1}^{\delta}=\xi_{0}^{\delta}=\xi_{0}. Pick the numbers τ>1\tau>1, 0<β0<\beta\leq\infty, μ0>0\mu_{0}>0 and μ1>0\mu_{1}>0. For n0n\geq 0 do the following:

  1. (i)

    Calculate rnδ:=F(xnδ)yδr_{n}^{\delta}:=F(x_{n}^{\delta})-y^{\delta}. If rnδτδ\|r_{n}^{\delta}\|\leq\tau\delta, stop and output xnδx_{n}^{\delta};

  2. (ii)

    Calculate gnδ:=L(xnδ)(F(xnδ)yδ)g_{n}^{\delta}:=L(x_{n}^{\delta})^{*}(F(x_{n}^{\delta})-y^{\delta}) and determine αnδ\alpha_{n}^{\delta} according to (17);

  3. (iii)

    Calculate mnδ:=ξnδξn1δm_{n}^{\delta}:=\xi_{n}^{\delta}-\xi_{n-1}^{\delta} and determine γ~nδ\tilde{\gamma}_{n}^{\delta} by the formula (24);

  4. (iv)

    Calculate βnδ\beta_{n}^{\delta} by

    βnδ={min{max{0,αnδgnδ,mnδ2σγ~nδmnδ2},β} if mnδ0,0 if mnδ=0;\displaystyle\beta_{n}^{\delta}=\left\{\begin{array}[]{lll}\min\left\{\max\left\{0,\frac{\alpha_{n}^{\delta}\langle g_{n}^{\delta},m_{n}^{\delta}\rangle-2\sigma\tilde{\gamma}_{n}^{\delta}}{\|m_{n}^{\delta}\|^{2}}\right\},\beta\right\}&\emph{ if }m_{n}^{\delta}\neq 0,\\ 0&\emph{ if }m_{n}^{\delta}=0;\end{array}\right.
  5. (v)

    Update ξn+1δ\xi_{n+1}^{\delta} and xn+1δx_{n+1}^{\delta} by

    ξn+1δ=ξnδαnδgnδ+βnδmnδ,xn+1δ=argminxX{(x)ξn+1δ,x}.\xi_{n+1}^{\delta}=\xi_{n}^{\delta}-\alpha_{n}^{\delta}g_{n}^{\delta}+\beta_{n}^{\delta}m_{n}^{\delta},\quad x_{n+1}^{\delta}=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{n+1}^{\delta},x\rangle\}.

Note that, for the implementation of one step of Algorithm 1, the most expensive parts are the calculations of rnδr_{n}^{\delta}, gnδg_{n}^{\delta} and xn+1δx_{n+1}^{\delta} which are common for Landweber-type method (2); the computational load for other parts relating to αnδ\alpha_{n}^{\delta}, γ~nδ\tilde{\gamma}_{n}^{\delta} and βnδ\beta_{n}^{\delta} can be negligible. Therefore, the computational complexity per iteration of Algorithm 1 is marginally higher than, but very close to, that of one step of the Landweber-type method (2).

In the design of Algorithm 1, the discrepancy principle has been incorporated, i.e. the iteration stops as long as rnδτδ\|r_{n}^{\delta}\|\leq\tau\delta is satisfied for the first time. In the following we will prove that Algorithm 1 is well-defined by showing that the iteration indeed can stop in finite many steps.

Lemma 2.1.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 and let k0k\geq 0 be an integer such that xnδB2ρ(x0)x_{n}^{\delta}\in B_{2\rho}(x_{0}) for all 0n<k0\leq n<k. Define γnδ\gamma_{n}^{\delta} by (19) for 0nk0\leq n\leq k. Then γnδγ~nδ\gamma_{n}^{\delta}\leq\tilde{\gamma}_{n}^{\delta} for all 0nk0\leq n\leq k.

Proof.

Since γ0δ=γ~0δ=0\gamma_{0}^{\delta}=\tilde{\gamma}_{0}^{\delta}=0, the result is true for n=0n=0. Assume next that γnδγ~nδ\gamma_{n}^{\delta}\leq\tilde{\gamma}_{n}^{\delta} for some 0n<k0\leq n<k. By noting that βnδ0\beta_{n}^{\delta}\geq 0, we may use (2.2) and the definition of γ~n+1δ\tilde{\gamma}_{n+1}^{\delta} to obtain

γn+1δ\displaystyle\gamma_{n+1}^{\delta} mn+1δ,xn+1δxnδ(1η)αnδrnδ2+(1+η)αnδδrnδ+βnδγnδ\displaystyle\leq\langle m_{n+1}^{\delta},x_{n+1}^{\delta}-x_{n}^{\delta}\rangle-(1-\eta)\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}+(1+\eta)\alpha_{n}^{\delta}\delta\|r_{n}^{\delta}\|+\beta_{n}^{\delta}\gamma_{n}^{\delta}
mn+1δ,xn+1δxnδ(1η)αnδrnδ2+(1+η)αnδδrnδ+βnδγ~nδ\displaystyle\leq\langle m_{n+1}^{\delta},x_{n+1}^{\delta}-x_{n}^{\delta}\rangle-(1-\eta)\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}+(1+\eta)\alpha_{n}^{\delta}\delta\|r_{n}^{\delta}\|+\beta_{n}^{\delta}\tilde{\gamma}_{n}^{\delta}
=γ~n+1δ.\displaystyle=\tilde{\gamma}_{n+1}^{\delta}.

By the induction principle, this shows the result. ∎

Lemma 2.2.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 with μ0>0\mu_{0}>0 and τ>1\tau>1 being chosen such that

c0:=11+ητημ04σ>0.\displaystyle c_{0}:=1-\frac{1+\eta}{\tau}-\eta-\frac{\mu_{0}}{4\sigma}>0. (25)

Let k0k\geq 0 be an integer such that rnδ>τδ\|r_{n}^{\delta}\|>\tau\delta for all 0n<k0\leq n<k. Then xnδB2ρ(x0)x_{n}^{\delta}\in B_{2\rho}(x_{0}) for all 0nk0\leq n\leq k and

Δn+1δΔnδc0αnδrnδ2\displaystyle\Delta_{n+1}^{\delta}\leq\Delta_{n}^{\delta}-c_{0}\alpha_{n}^{\delta}\|r_{n}^{\delta}\|^{2}

for all 0n<k0\leq n<k, where Δnδ:=Dξnδ(x^,xnδ)\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(\hat{x},x_{n}^{\delta}) and x^\hat{x} denotes any solution of (1) in B2ρ(x0)]dom()B_{2\rho}(x_{0})]\cap\emph{dom}({\mathcal{R}}).

Proof.

We first show by induction that

xnδB2ρ(x0)andDξnδ(x,xnδ)Dξ0(x,x0)\displaystyle x_{n}^{\delta}\in B_{2\rho}(x_{0})\quad\mbox{and}\quad D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{\dagger},x_{n}^{\delta})\leq D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0}) (26)

for all integers 0nk0\leq n\leq k. It is trivial for n=0n=0 as ξ0δ=ξ0\xi_{0}^{\delta}=\xi_{0} and x0δ=x0x_{0}^{\delta}=x_{0}. Next we assume that (26) holds for all 0nl0\leq n\leq l for some l<kl<k. According to (2.2) and Lemma 2.1 we have for any solution x^\hat{x} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}}) that

Δl+1δΔlδ\displaystyle\Delta_{l+1}^{\delta}-\Delta_{l}^{\delta} c0αlδrlδ2+14σ(βlδ)2mlδ2+βlδγlδ12σαlδβlδglδ,mlδ\displaystyle\leq-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}+\frac{1}{4\sigma}(\beta_{l}^{\delta})^{2}\|m_{l}^{\delta}\|^{2}+\beta_{l}^{\delta}\gamma_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}\beta_{l}^{\delta}\langle g_{l}^{\delta},m_{l}^{\delta}\rangle
c0αlδrlδ2+14σ(βlδ)2mlδ2+βlδγ~lδ12σαlδβlδglδ,mlδ.\displaystyle\leq-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}+\frac{1}{4\sigma}(\beta_{l}^{\delta})^{2}\|m_{l}^{\delta}\|^{2}+\beta_{l}^{\delta}\tilde{\gamma}_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}\beta_{l}^{\delta}\langle g_{l}^{\delta},m_{l}^{\delta}\rangle.

Note that βlδ\beta_{l}^{\delta} is the minimizer of the function thl(t)t\to h_{l}(t) over [0,β][0,\beta], where

hl(t):=c0αlδrlδ2+14σt2mlδ2+tγ~lδ12σαlδtglδ,mlδ,h_{l}(t):=-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}+\frac{1}{4\sigma}t^{2}\|m_{l}^{\delta}\|^{2}+t\tilde{\gamma}_{l}^{\delta}-\frac{1}{2\sigma}\alpha_{l}^{\delta}t\langle g_{l}^{\delta},m_{l}^{\delta}\rangle,

we can conclude that

Δl+1δΔlδhl(βlδ)hl(0)=c0αlδrlδ2.\displaystyle\Delta_{l+1}^{\delta}-\Delta_{l}^{\delta}\leq h_{l}(\beta_{l}^{\delta})\leq h_{l}(0)=-c_{0}\alpha_{l}^{\delta}\|r_{l}^{\delta}\|^{2}. (27)

By virtue of this inequality with x^=x\hat{x}=x^{\dagger}, the induction hypothesis, and Assumption 2 (b) we have

Dξl+1(x,xl+1δ)Dξlδ(x,xlδ)Dξ0(x,x0)σρ2D_{\mathcal{R}}^{\xi_{l+1}}(x^{\dagger},x_{l+1}^{\delta})\leq D_{\mathcal{R}}^{\xi_{l}^{\delta}}(x^{\dagger},x_{l}^{\delta})\leq D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0})\leq\sigma\rho^{2}

which together with (10) implies that σxl+1δx2σρ2\sigma\|x_{l+1}^{\delta}-x^{\dagger}\|^{2}\leq\sigma\rho^{2} and hence xl+1δxρ\|x_{l+1}^{\delta}-x^{\dagger}\|\leq\rho. Since xx0ρ\|x^{\dagger}-x_{0}\|\leq\rho, we thus have xl+1δx02ρ\|x_{l+1}^{\delta}-x_{0}\|\leq 2\rho, i.e. xl+1δB2ρ(x0)x_{l+1}^{\delta}\in B_{2\rho}(x_{0}). We therefore complete the proof of (26). As a direct consequence, we can see that (27) holds for all 0l<k0\leq l<k which shows the desired result. ∎

Based on Lemma 2.2 we now can prove that Algorithm 2 is well-defined, i.e. iteration must terminate in finite many steps.

Theorem 2.3.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 with noisy data yδy^{\delta} satisfying yδyδ\|y^{\delta}-y\|\leq\delta with noise level δ>0\delta>0. Assume that τ>1\tau>1 and μ0>0\mu_{0}>0 are chosen such that (25) holds. Then the algorithm must terminate in finite many steps, i.e. there exists a finite integer nδn_{\delta} such that

F(xnδδ)yδτδ<F(xnδ)yδ,0n<nδ.\displaystyle\|F(x_{n_{\delta}}^{\delta})-y^{\delta}\|\leq\tau\delta<\|F(x_{n}^{\delta})-y^{\delta}\|,\quad 0\leq n<n_{\delta}. (28)
Proof.

Let l0l\geq 0 be an integer such that F(xnδ)yδ>τδ\|F(x_{n}^{\delta})-y^{\delta}\|>\tau\delta for all 0nl0\leq n\leq l. It then follows from Lemma 2.2 that

c0n=0lαnδF(xnδ)yδ2\displaystyle c_{0}\sum_{n=0}^{l}\alpha_{n}^{\delta}\|F(x_{n}^{\delta})-y^{\delta}\|^{2} n=0l(ΔnδΔn+1δ)\displaystyle\leq\sum_{n=0}^{l}\left(\Delta_{n}^{\delta}-\Delta_{n+1}^{\delta}\right)
=Δ0δΔl+1δΔ0δ\displaystyle=\Delta_{0}^{\delta}-\Delta_{l+1}^{\delta}\leq\Delta_{0}^{\delta}
=Dξ0(x^,x0).\displaystyle=D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0}).

According to the choice of αnδ\alpha_{n}^{\delta} and L(x)L\|L(x)\|\leq L for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), we can see αnδmin{μ0/L2,μ1}\alpha_{n}^{\delta}\geq\min\{\mu_{0}/L^{2},\mu_{1}\} for all 0nl0\leq n\leq l. Thus

(l+1)c0min{μ0/L2,μ1}τ2δ2c0n=0lαnδF(xnδ)yδ2Dξ0(x^,x0)<.\displaystyle(l+1)c_{0}\min\{\mu_{0}/L^{2},\mu_{1}\}\tau^{2}\delta^{2}\leq c_{0}\sum_{n=0}^{l}\alpha_{n}^{\delta}\|F(x_{n}^{\delta})-y^{\delta}\|^{2}\leq D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0})<\infty.

If there is no finite integer nδn_{\delta} such that (28) holds, then we can take ll to be any positive integer. Letting ll\to\infty in the above equation gives a contradiction. Thus, the algorithm must terminate in finite many steps. ∎

3. Convergence analysis

In this section we will provide convergence analysis on Algorithm 1. Let nδn_{\delta} be the integer output by the discrepancy principle, we will analyze the behavior of xnδδx_{n_{\delta}}^{\delta} as δ0\delta\to 0. We first prove a weak convergence result as stated below.

Theorem 3.1.

Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 using a sequence of noisy data {yδk}\{y^{\delta_{k}}\} satisfying yδkyδk\|y^{\delta_{k}}-y\|\leq\delta_{k} with 0<δk00<\delta_{k}\to 0 as kk\to\infty. Assume that τ>1\tau>1 and μ0>0\mu_{0}>0 are chosen such that (25) holds. Let nk:=nδkn_{k}:=n_{\delta_{k}} be the output integer. Then there is a subsequence {yδkl}\{y^{\delta_{k_{l}}}\} of {yδk}\{y^{\delta_{k}}\} such that

xnklδklx as lx_{n_{k_{l}}}^{\delta_{k_{l}}}\rightharpoonup x^{*}\quad\mbox{ as }l\to\infty

for some solution xx^{*} of (1) in B2ρ(x0)B_{2\rho}(x_{0}).

If in addition (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2 and Ker(L(x))Ker(L(x))\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x)) for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), then xnkδkxx_{n_{k}}^{\delta_{k}}\rightharpoonup x^{\dagger} as kk\to\infty, where xx^{\dagger} is the unique x0x_{0}-minimum norm solution of (1).

Proof.

By using Lemma 2.2 with x^=x\hat{x}=x^{\dagger} and (10), we have

σxnkδkx^2Dξnkδk(x^,xnkδk)Dξ0(x^,x0)\sigma\|x_{n_{k}}^{\delta_{k}}-\hat{x}\|^{2}\leq D_{\mathcal{R}}^{\xi_{n_{k}}^{\delta_{k}}}(\hat{x},x_{n_{k}}^{\delta_{k}})\leq D_{\mathcal{R}}^{\xi_{0}}(\hat{x},x_{0})

which implies that {xnkδk}\{x_{n_{k}}^{\delta_{k}}\} is a bounded sequence. Thus, we can find a subsequence {yδkl}\{y^{\delta_{k_{l}}}\} of {yδl}\{y^{\delta_{l}}\} such that xnklδklxx_{n_{k_{l}}}^{\delta_{k_{l}}}\rightharpoonup x^{*} as ll\to\infty for some xXx^{*}\in X. Because F(xnklδkl)yδklτδkl\|F(x_{n_{k_{l}}}^{\delta_{k_{l}}})-y^{\delta_{k_{l}}}\|\leq\tau\delta_{k_{l}}, we have F(xnklδkl)yF(x_{n_{k_{l}}}^{\delta_{k_{l}}})\to y as ll\to\infty. Thus, by the weak closedness of FF given in Assumption 2 (c), we have xdom(F)x^{*}\in\mbox{dom}(F) and F(x)=yF(x^{*})=y. Furthermore, by the weak lower semi-continuity of norms we have

xx2lim inflxnklδklx21σDξ0(x,x0)ρ2\displaystyle\|x^{*}-x^{\dagger}\|^{2}\leq\liminf_{l\to\infty}\|x_{n_{k_{l}}}^{\delta_{k_{l}}}-x^{\dagger}\|^{2}\leq\frac{1}{\sigma}D_{\mathcal{R}}^{\xi_{0}}(x^{\dagger},x_{0})\leq\rho^{2}

which together with xBρ(x0)x^{\dagger}\in B_{\rho}(x_{0}) implies that xB2ρ(x0)x^{*}\in B_{2\rho}(x_{0}).

When (x)=x2/2{\mathcal{R}}(x)=\|x\|^{2}/2 and Ker(L(x))Ker(L(x))\mbox{Ker}(L(x^{\dagger}))\subset\mbox{Ker}(L(x)) for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), we show that xnkδkxx_{n_{k}}^{\delta_{k}}\rightharpoonup x^{\dagger} as kk\to\infty, It suffices to show that xx^{\dagger} is the unique weak cluster point of {xnkδk}\{x_{n_{k}}^{\delta_{k}}\}. For the given {\mathcal{R}} we have ξnδk=xnδk\xi_{n}^{\delta_{k}}=x_{n}^{\delta_{k}} for all nn and by the definition of ξnδk\xi_{n}^{\delta_{k}} we can see that

xnδkx0\displaystyle x_{n}^{\delta_{k}}-x_{0} Ran(L(xn1δk))Ran(L(x0δk))\displaystyle\in\mbox{Ran}(L(x_{n-1}^{\delta_{k}})^{*})\oplus\cdots\oplus\mbox{Ran}(L(x_{0}^{\delta_{k}})^{*})
Ker(L(xn1δk))Ker(L(x0δk))\displaystyle\subset\mbox{Ker}(L(x_{n-1}^{\delta_{k}}))^{\perp}\oplus\cdots\oplus\mbox{Ker}(L(x_{0}^{\delta_{k}}))^{\perp}
Ker(L(x))\displaystyle\subset\mbox{Ker}(L(x^{\dagger}))^{\perp}

for all nn which in particular implies that xnkδkx0Ker(L(x))x_{n_{k}}^{\delta_{k}}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}. Consequently, for any weak cluster point x¯\bar{x} of {xnkδk}\{x_{n_{k}}^{\delta_{k}}\} we have x¯x0Ker(L(x))\bar{x}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}. Since xx^{\dagger} is the x0x_{0}-minimum norm solution of (1), we must have xx0Ker(L(x))x^{\dagger}-x_{0}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}. Indeed, for any wKer(L(x))w\in\mbox{Ker}(L(x^{\dagger})) we have x+twB2ρ(x0)x^{\dagger}+tw\in B_{2\rho}(x_{0}) for small |t||t|. Thus, it follows from Assumption 2 (d) that

F(x+tw)F(x)\displaystyle\|F(x^{\dagger}+tw)-F(x^{\dagger})\| =F(x+tw)F(x)L(x)(tw)\displaystyle=\|F(x^{\dagger}+tw)-F(x^{\dagger})-L(x^{\dagger})(tw)\|
ηF(x+tw)F(x)\displaystyle\leq\eta\|F(x^{\dagger}+tw)-F(x^{\dagger})\|

which implies F(x+tw)=F(x)=yF(x^{\dagger}+tw)=F(x^{\dagger})=y for small |t||t| as 0η<10\leq\eta<1. Since xx^{\dagger} is the x0x_{0}-minimal norm solution of (1), we have x+twx02xx02\|x^{\dagger}+tw-x_{0}\|^{2}\geq\|x^{\dagger}-x_{0}\|^{2} for small |t||t| and thus xx0,w=0\langle x^{\dagger}-x_{0},w\rangle=0 for all wKer(L(x))w\in\mbox{Ker}(L(x^{\dagger})). Therefore x¯x=(x¯x0)+(x0x)Ker(L(x))\bar{x}-x^{\dagger}=(\bar{x}-x_{0})+(x_{0}-x^{\dagger})\in\mbox{Ker}(L(x^{\dagger}))^{\perp}

On the other hand, since both x¯\bar{x} and xx^{\dagger} are solutions of F(x)=yF(x)=y in B2ρ(x0)B_{2\rho}(x_{0}), we may use Assumption 2 (d) to obtain

L(x)(x¯x)(1+η)F(x¯)F(x)=0\displaystyle\|L(x^{\dagger})(\bar{x}-x^{\dagger})\|\leq(1+\eta)\|F(\bar{x})-F(x^{\dagger})\|=0

which means x¯xKer(L(x))\bar{x}-x^{\dagger}\in\mbox{Ker}(L(x^{\dagger})). Therefore

x¯xKer(L(x))Ker(L(x))={0}\bar{x}-x^{\dagger}\in\mbox{Ker}(L(x^{\dagger}))^{\perp}\cap\mbox{Ker}(L(x^{\dagger}))=\{0\}

and thus x¯=x\bar{x}=x^{\dagger}. The proof is complete. ∎

The above theorem gives a weak convergence result on Algorithm 1 for any β(0,]\beta\in(0,\infty]. However, it remains uncertain if a strong convergence result can be assured for general β\beta. Nevertheless, by confining β\beta to be less than 1, we can establish a strong convergence result for Algorithm 1 in the remaining part of this section. The proof of strong convergence is rather challenging. We must explore the counterpart of Algorithm 1 utilizing the exact data yy, which can be formulated as follows.

Algorithm 2.

Take an initial guess ξ0\xi_{0}, calculate x0:=argminxX{(x)ξ0,x}x_{0}:=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{0},x\rangle\}, and set ξ1=ξ0\xi_{-1}=\xi_{0}. Pick the numbers 0<β0<\beta\leq\infty, μ0>0\mu_{0}>0 and μ1>0\mu_{1}>0. For n0n\geq 0 do the following:

  1. (i)

    Calculate rn:=F(xn)yr_{n}:=F(x_{n})-y and gn:=L(xn)rng_{n}:=L(x_{n})^{*}r_{n};

  2. (ii)

    Determine αn\alpha_{n} according to

    αn:=μ0L2or αn:={min{μ0rn2gn2,μ1}if rn0,0if rn=0;\displaystyle\alpha_{n}:=\frac{\mu_{0}}{L^{2}}\quad\mbox{or }\quad\alpha_{n}:=\left\{\begin{array}[]{lll}\min\left\{\frac{\mu_{0}\|r_{n}\|^{2}}{\|g_{n}\|^{2}},\mu_{1}\right\}&\emph{if }r_{n}\neq 0,\\ 0&\emph{if }r_{n}=0;\end{array}\right.
  3. (iii)

    Calculate mn:=ξnξn1m_{n}:=\xi_{n}-\xi_{n-1} and determine γ~n\tilde{\gamma}_{n} by

    γ~n={0if n=0,mn,xnxn1(1η)αn1rn12+βn1γ~n1if n1;\displaystyle\tilde{\gamma}_{n}=\left\{\begin{array}[]{lll}0&\emph{if }n=0,\\ \langle m_{n},x_{n}-x_{n-1}\rangle-(1-\eta)\alpha_{n-1}\|r_{n-1}\|^{2}+\beta_{n-1}\tilde{\gamma}_{n-1}&\emph{if }n\geq 1;\end{array}\right.
  4. (iv)

    Calculate βn\beta_{n} by

    βn={min{max{0,αngn,mn2σγ~nmn2},β} if mn0,0 if mn=0;\displaystyle\beta_{n}=\left\{\begin{array}[]{lll}\min\left\{\max\left\{0,\frac{\alpha_{n}\langle g_{n},m_{n}\rangle-2\sigma\tilde{\gamma}_{n}}{\|m_{n}\|^{2}}\right\},\beta\right\}&\emph{ if }m_{n}\neq 0,\\ 0&\emph{ if }m_{n}=0;\end{array}\right.
  5. (v)

    Update ξn+1\xi_{n+1} and xn+1x_{n+1} by

    ξn+1=ξnαngn+βnmn,xn+1=argminxX{(x)ξn+1,x}.\xi_{n+1}=\xi_{n}-\alpha_{n}g_{n}+\beta_{n}m_{n},\quad x_{n+1}=\arg\min_{x\in X}\{{\mathcal{R}}(x)-\langle\xi_{n+1},x\rangle\}.

For the iterative sequence {xn}\{x_{n}\} obtained by Algorithm 2, we first have the following result.

Lemma 3.2.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 2 with 0<μ0<4σ(1η)0<\mu_{0}<4\sigma(1-\eta). Then xnB2ρ(x0)x_{n}\in B_{2\rho}(x_{0}) for all integers n0n\geq 0 and for any solution x^\hat{x} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) there holds

Δn+1Δnc1αnF(xn)y2,n0,\displaystyle\Delta_{n+1}\leq\Delta_{n}-c_{1}\alpha_{n}\|F(x_{n})-y\|^{2},\quad\forall n\geq 0,

where Δn:=Dξn(x^,xn)\Delta_{n}:=D_{\mathcal{R}}^{\xi_{n}}(\hat{x},x_{n}) and c1:=1ημ0/(4σ)>0c_{1}:=1-\eta-\mu_{0}/(4\sigma)>0. Consequently {Δn}\{\Delta_{n}\} is monotonically decreasing and

n=0αnF(xn)y2<.\sum_{n=0}^{\infty}\alpha_{n}\|F(x_{n})-y\|^{2}<\infty.
Proof.

By the similar argument in the proof of Lemma 2.2 we can obtain the result immediately. ∎

Lemma 3.2 has the following consequences which will be used later.

Lemma 3.3.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 2 with 0<μ0<4σ(1η)0<\mu_{0}<4\sigma(1-\eta).

  1. (i)

    If F(xn)y=0F(x_{n})-y=0 for some n0n\geq 0, then xm=xnx_{m}=x_{n} for all m>nm>n.

  2. (ii)

    If ξn+1=ξn\xi_{n+1}=\xi_{n} for some n0n\geq 0, then F(xn)y=0F(x_{n})-y=0, xm=xnx_{m}=x_{n} and ξm=ξn\xi_{m}=\xi_{n} for all m>nm>n.

Proof.

(i) If F(xn)y=0F(x_{n})-y=0 for some nn, then xnx_{n} is a solution of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}}). Thus we may use Lemma 3.2 with x^=xn\hat{x}=x_{n} to conclude Dξm(xm,xn)Dξn(xn,xn)=0D_{\mathcal{R}}^{\xi_{m}}(x_{m},x_{n})\leq D_{\mathcal{R}}^{\xi_{n}}(x_{n},x_{n})=0 for all m>nm>n. Consequently, it follows from (10) that xm=xnx_{m}=x_{n} for all m>nm>n.

(ii) If ξn+1=ξn\xi_{n+1}=\xi_{n} for some nn, then xn+1=(ξn+1)=(ξn)=xnx_{n+1}=\nabla{\mathcal{R}}^{*}(\xi_{n+1})=\nabla{\mathcal{R}}^{*}(\xi_{n})=x_{n}. We thus have from Lemma 3.2 that

Dξn(xn,x^)=Dξn+1(xn+1,x^)Dξn(xn,x^)c1αnF(xn)y2\displaystyle D_{\mathcal{R}}^{\xi_{n}}(x_{n},\hat{x})=D_{\mathcal{R}}^{\xi_{n+1}}(x_{n+1},\hat{x})\leq D_{\mathcal{R}}^{\xi_{n}}(x_{n},\hat{x})-c_{1}\alpha_{n}\|F(x_{n})-y\|^{2}

which implies αnF(xn)y2=0\alpha_{n}\|F(x_{n})-y\|^{2}=0. If F(xn)yF(x_{n})\neq y, then the choice of αn\alpha_{n} implies αn>0\alpha_{n}>0 and thus αnF(xn)y2>0\alpha_{n}\|F(x_{n})-y\|^{2}>0 which is a contradiction. Therefore we must have F(xn)=yF(x_{n})=y. By using (i) we then have xm=xnx_{m}=x_{n} for all mnm\geq n. Using these facts and an induction argument we can conclude that ξm=ξn\xi_{m}=\xi_{n} for all m>nm>n. Indeed, this is trivial for m=n+1m=n+1 by the given condition. Assume next ξm=ξn\xi_{m}=\xi_{n} for all n<mkn<m\leq k for some k>nk>n. Then, by the definition of ξk+1\xi_{k+1}, we have

ξk+1\displaystyle\xi_{k+1} =ξkαkL(xk)(F(xk)y)+βk(ξkξk1)\displaystyle=\xi_{k}-\alpha_{k}L(x_{k})^{*}(F(x_{k})-y)+\beta_{k}(\xi_{k}-\xi_{k-1})
=ξnαkL(xn)(F(xn)y)=ξn.\displaystyle=\xi_{n}-\alpha_{k}L(x_{n})^{*}(F(x_{n})-y)=\xi_{n}.

By the induction principle, we thus complete the proof. ∎

In order to show the strong convergence of the iterative sequence {xn}\{x_{n}\} generated by Algorithm 2, we need the following result; see [17, Proposition 2.3] or [18, Proposition 3.6].

Proposition 3.4.

Let Assumption 1 and Assumption 2 hold and consider the equation (1). Let {xn}B2ρ(x0)dom()\{x_{n}\}\subset B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) and {ξn}X\{\xi_{n}\}\subset X be such that

  1. (i)

    ξn(xn)\xi_{n}\in\partial{\mathcal{R}}(x_{n}) for all nn;

  2. (ii)

    for any solution x^\hat{x} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) the sequence {Dξn(x^,xn)}\{D_{\mathcal{R}}^{\xi_{n}}(\hat{x},x_{n})\} is convergent;

  3. (iii)

    limnF(xn)y=0\lim_{n\to\infty}\|F(x_{n})-y\|=0.

  4. (iv)

    there is a subsequence {nk}\{n_{k}\} of integers with nkn_{k}\rightarrow\infty such that for any solution x^\hat{x} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) there holds

    limlsupkl|ξnkξnl,xnkx^|=0.\lim_{l\rightarrow\infty}\sup_{k\geq l}|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|=0. (29)

Then there exists a solution xx^{*} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) such that

limnDξn(x,xn)=0.\lim_{n\rightarrow\infty}D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})=0.

If, in addition, ξn+1ξnKer(L(x))\xi_{n+1}-\xi_{n}\in\emph{Ker}(L(x^{\dagger}))^{\perp} for all nn, then x=xx^{*}=x^{\dagger}.

Based on Lemma 3.2, Lemma 3.3 and Proposition 3.4, we are now ready to show the strong convergence of Algorithm 2.

Theorem 3.5.

Let Assumption 1 and Assumption 2 hold and let {xn}\{x_{n}\} be defined by Algorithm 2 using the exact data yy, where 0<μ0<4σ(1η)0<\mu_{0}<4\sigma(1-\eta) and 0<β<10<\beta<1. Then there exists a solution xx^{*} of (1) in B2ρ(x0)B_{2\rho}(x_{0}) such that

limnDξn(x,xn)=0 and limnxnx=0.\displaystyle\lim_{n\to\infty}D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})=0\quad\mbox{ and }\quad\lim_{n\to\infty}\|x_{n}-x^{*}\|=0. (30)

If, in addition, Ker(L(x))Ker(L(x))\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x)) for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), then x=xx^{*}=x^{\dagger}.

Proof.

We will use Proposition 3.4 to prove the conclusion. The item (i) in Proposition 3.4 holds automatically by the definition of xnx_{n}. According to Lemma 3.2 we immediately obtain the item (ii) in Proposition 3.4. To show item (iii), we note from the choice of {αn}\{\alpha_{n}\} that αnc3:=min{μ0/L2,μ1}\alpha_{n}\geq c_{3}:=\min\{\mu_{0}/L^{2},\mu_{1}\} when rn0r_{n}\neq 0. Thus αnrn2c3rn2\alpha_{n}\|r_{n}\|^{2}\geq c_{3}\|r_{n}\|^{2} for all n0n\geq 0. From Lemma 3.2 it then follows that

c3n=0rn2n=0αnrn2<.c_{3}\sum_{n=0}^{\infty}\|r_{n}\|^{2}\leq\sum_{n=0}^{\infty}\alpha_{n}\|r_{n}\|^{2}<\infty.

This in particular implies (iii) in Proposition 3.4, i.e.

limnrn=0.\displaystyle\lim_{n\to\infty}\|r_{n}\|=0. (31)

We now verify (iv) in Proposition 3.4. By using Lemma 3.3 (i) we can conclude that

rn=0 for some nrm=0 for all mn.\displaystyle\|r_{n}\|=0\mbox{ for some }n\Longrightarrow\|r_{m}\|=0\mbox{ for all }m\geq n. (32)

In view of (31) and (32), we can pick a sequence {nk}\{n_{k}\} of integers by setting n0=0n_{0}=0 and letting nkn_{k}, for each k1k\geq 1, be the first integer satisfying

nknk1+1 and rnkrnk1.n_{k}\geq n_{k-1}+1\quad\text{ and }\quad\|r_{n_{k}}\|\leq\|r_{n_{k-1}}\|.

It is easy to see that

rnkrn,0nnk.\displaystyle\|r_{n_{k}}\|\leq\|r_{n}\|,\qquad 0\leq n\leq n_{k}. (33)

For the above chosen {nk}\{n_{k}\}, we now show (29) for any solution x^\hat{x} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}}). For any integers l<kl<k we write

ξnkξnl,xnkx^=n=nlnk1ξn+1ξn,xnkx^.\displaystyle\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle=\sum_{n=n_{l}}^{n_{k}-1}\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle. (34)

From the definition of {ξn}\{\xi_{n}\} and an induction argument it follows readily that

ξn+1ξn=i=0n(j=i+1nβj)αiL(xi)ri.\displaystyle\xi_{n+1}-\xi_{n}=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}L(x_{i})^{*}r_{i}.

Thus

ξn+1ξn,xnkx^\displaystyle\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle =i=0n(j=i+1nβj)αiL(xi)ri,xnkx^\displaystyle=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\langle L(x_{i})^{*}r_{i},x_{n_{k}}-\hat{x}\rangle
=i=0n(j=i+1nβj)αiri,L(xi)(xnkx^).\displaystyle=-\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\langle r_{i},L(x_{i})(x_{n_{k}}-\hat{x})\rangle.

Using Assumption 2 (d) it is easy to obtain

L(xi)(xnkx^)\displaystyle\|L(x_{i})(x_{n_{k}}-\hat{x})\| L(xi)(xix^)+L(xi)(xnkxi)\displaystyle\leq\|L(x_{i})(x_{i}-\hat{x})\|+\|L(x_{i})(x_{n_{k}}-x_{i})\|
(1+η)(ri+F(xi)F(xnk))\displaystyle\leq(1+\eta)\left(\|r_{i}\|+\|F(x_{i})-F(x_{n_{k}})\|\right)
(1+η)(2ri+rnk).\displaystyle\leq(1+\eta)\left(2\|r_{i}\|+\|r_{n_{k}}\|\right).

Therefore, by using the Cauchy-Schwarz inequality, (33), and 0βnβ0\leq\beta_{n}\leq\beta for any n0n\geq 0, we can obtain

|ξn+1ξn,xnkx^|\displaystyle|\langle\xi_{n+1}-\xi_{n},x_{n_{k}}-\hat{x}\rangle| i=0n(j=i+1nβj)αiriL(xi)(xnkx^)\displaystyle\leq\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\|r_{i}\|\|L(x_{i})(x_{n_{k}}-\hat{x})\|
(1+η)i=0n(j=i+1nβj)αiri(2ri+rnk)\displaystyle\leq(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\|r_{i}\|\left(2\|r_{i}\|+\|r_{n_{k}}\|\right)
3(1+η)i=0n(j=i+1nβj)αiri2\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\left(\prod_{j=i+1}^{n}\beta_{j}\right)\alpha_{i}\|r_{i}\|^{2}
3(1+η)i=0nβniαiri2.\displaystyle\leq 3(1+\eta)\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}.

This together with (34) then gives

|ξnkξnl,xnkx^|3(1+η)n=nlnk1i=0nβniαiri2.\displaystyle|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|\leq 3(1+\eta)\sum_{n=n_{l}}^{n_{k}-1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}. (35)

To proceed further, let

SN:=n=0Ni=0nβniαiri2,N0.S_{N}:=\sum_{n=0}^{N}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2},\quad\forall N\geq 0.

Then

SN+1SN\displaystyle S_{N+1}-S_{N} =n=0N+1i=0nβniαiri2n=0Ni=0nβniαiri2\displaystyle=\sum_{n=0}^{N+1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}-\sum_{n=0}^{N}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}
=i=0N+1βN+1iαiri20.\displaystyle=\sum_{i=0}^{N+1}\beta^{N+1-i}\alpha_{i}\|r_{i}\|^{2}\geq 0.

Moreover, by rearranging the terms in SNS_{N}, utilizing the property 0<β<10<\beta<1, and invoking Lemma 3.2, we can obtain

SN=i=0N(n=iNβni)αiri211βi=0Nαiri211βi=0αiri2<.S_{N}=\sum_{i=0}^{N}\left(\sum_{n=i}^{N}\beta^{n-i}\right)\alpha_{i}\|r_{i}\|^{2}\leq\frac{1}{1-\beta}\sum_{i=0}^{N}\alpha_{i}\|r_{i}\|^{2}\leq\frac{1}{1-\beta}\sum_{i=0}^{\infty}\alpha_{i}\|r_{i}\|^{2}<\infty.

Therefore, {SN}\{S_{N}\} is a monotonically increasing sequence with a finite upper bound, and thus it converges by the monotone convergence theorem. In particular, {SN}\{S_{N}\} is a Cauchy sequence and hence

limlsupkl|Snk1Snl1|=0.\lim_{l\to\infty}\sup_{k\geq l}|S_{n_{k}-1}-S_{n_{l}-1}|=0.

Note that

|Snk1Snl1|=n=nlnk1i=0nβniαiri2.|S_{n_{k}-1}-S_{n_{l}-1}|=\sum_{n=n_{l}}^{n_{k}-1}\sum_{i=0}^{n}\beta^{n-i}\alpha_{i}\|r_{i}\|^{2}.

We thus have from (35) that

supkl|ξnkξnl,xnkx^|3(1+η)supkl|Snk1Snl1|0\displaystyle\sup_{k\geq l}|\langle\xi_{n_{k}}-\xi_{n_{l}},x_{n_{k}}-\hat{x}\rangle|\leq 3(1+\eta)\sup_{k\geq l}|S_{n_{k}-1}-S_{n_{l}-1}|\to 0

as ll\to\infty. This shows (iv) in Proposition 3.4. Thus, we may use Proposition 3.4 to conclude the existence of a solution xx^{*} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}}) such that (30) holds.

If Ker(L(x))Ker(L(x))\mbox{Ker}(L(x^{\dagger}))\subset\mbox{Ker}(L(x)) for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), then we may use the definition of ξn+1\xi_{n+1} to obtain

ξn+1ξn\displaystyle\xi_{n+1}-\xi_{n} Ran(L(xn))Ran(L(x0))\displaystyle\in\mbox{Ran}(L(x_{n})^{*})\oplus\cdots\oplus\mbox{Ran}(L(x_{0})^{*})
Ker(L(xn))Ker(L(x0))\displaystyle\subset\mbox{Ker}(L(x_{n}))^{\perp}\oplus\cdots\oplus\mbox{Ker}(L(x_{0}))^{\perp}
Ker(L(x)).\displaystyle\subset\mbox{Ker}(L(x^{\dagger}))^{\perp}.

Therefore, we may use the last part of Proposition 3.4 to conclude x=xx^{*}=x^{\dagger}. ∎

In order to establish the regularization property of Algorithm 1, we need a stability result to connect Algorithm 1 with Algorithm 2. The following resut is sufficient for our purpose.

Lemma 3.6.

Let Assumption 1 and Assumption 2 hold, let τ>1\tau>1 and μ0>0\mu_{0}>0 be chosen such that (25) holds, and let β<\beta<\infty. Let {yδl}\{y^{\delta_{l}}\} be a sequence of noisy data satisfying yδlyδl\|y^{\delta_{l}}-y\|\leq\delta_{l} with 0<δl00<\delta_{l}\to 0 as ll\to\infty, and let ξnδl,xnδl\xi_{n}^{\delta_{l}},x_{n}^{\delta_{l}}, 0nnδl0\leq n\leq n_{\delta_{l}}, be defined by Algorithm 1 using noisy data yδly^{\delta_{l}}, where nδln_{\delta_{l}} denotes the output integer. Let {ξn,xn}\{\xi_{n},x_{n}\} be defined by the counterpart of Algorithm 1 using the exact data yy. Then, for any finite integer n^lim inflnδl\hat{n}\leq\liminf_{l\to\infty}n_{\delta_{l}} there hold

ξnδlξn and xnδlxn as l\xi_{n}^{\delta_{l}}\to\xi_{n}\quad\mbox{ and }\quad x_{n}^{\delta_{l}}\to x_{n}\quad\text{ as }l\to\infty

for all 0nn^0\leq n\leq\hat{n}.

Proof.

Since n^lim inflnδl\hat{n}\leq\liminf_{l\to\infty}n_{\delta_{l}}, we can conclude that nδln^n_{\delta_{l}}\geq\hat{n} for large ll and hence ξnδl,xnδl\xi_{n}^{\delta_{l}},x_{n}^{\delta_{l}} are well-defined for all 0nn^0\leq n\leq\hat{n}. Let

n:=inf{n:n and ξn=ξn1},\displaystyle n_{*}:=\inf\{n:n\in{\mathbb{N}}\mbox{ and }\xi_{n}=\xi_{n-1}\},

where :={1,2,}{\mathbb{N}}:=\{1,2,\cdots\} denotes the set of natural numbers. Then 1n1\leq n_{*}\leq\infty. By the definition of nn_{*} we have mn0m_{n}\neq 0 for 1n<n1\leq n<n_{*} and by using Lemma 3.3 we also have mn=0m_{n}=0 for nn<n_{*}\leq n<\infty.

We first use an induction argument to show that

ξnδlξn,xnδlxn,αnδlrnδlαnrn,βnδlβn,γ~nδlγ~n\displaystyle\xi_{n}^{\delta_{l}}\to\xi_{n},\quad x_{n}^{\delta_{l}}\to x_{n},\quad\alpha_{n}^{\delta_{l}}r_{n}^{\delta_{l}}\to\alpha_{n}r_{n},\quad\beta_{n}^{\delta_{l}}\to\beta_{n},\quad\tilde{\gamma}_{n}^{\delta_{l}}\to\tilde{\gamma}_{n} (36)

as ll\to\infty for every 0n<min{n,n^}0\leq n<\min\{n_{*},\hat{n}\}. Noting that ξ0δl=ξ0\xi_{0}^{\delta_{l}}=\xi_{0}, x0δl=x0x_{0}^{\delta_{l}}=x_{0} and γ~0δl=γ~0=0\tilde{\gamma}_{0}^{\delta_{l}}=\tilde{\gamma}_{0}=0, and, as m0δl=m0=0m_{0}^{\delta_{l}}=m_{0}=0, we have β0δl=β0=0\beta_{0}^{\delta_{l}}=\beta_{0}=0. Furthermore, noting that r0δl=r0r_{0}^{\delta_{l}}=r_{0}, we also have α0δlr0δl=α0r0\alpha_{0}^{\delta_{l}}r_{0}^{\delta_{l}}=\alpha_{0}r_{0} no matter whether r0=0r_{0}=0 or not. Consequently (36) holds for n=0n=0. Now we assume that (36) holds for 0n<k0\leq n<k for some integer 1k<min{n,n^}1\leq k<\min\{n_{*},\hat{n}\}. By the induction hypothesis, the definition of ξkδl\xi_{k}^{\delta_{l}}, xkδlx_{k}^{\delta_{l}}, and the continuity of FF, LL and \nabla{\mathcal{R}}^{*}, we can easily derive that

ξkδl\displaystyle\xi_{k}^{\delta_{l}} =ξk1δlαk1δlL(xk1δl)rk1δl+βk1δlmk1δl\displaystyle=\xi_{k-1}^{\delta_{l}}-\alpha_{k-1}^{\delta_{l}}L(x_{k-1}^{\delta_{l}})^{*}r_{k-1}^{\delta_{l}}+\beta_{k-1}^{\delta_{l}}m_{k-1}^{\delta_{l}}
ξk1αk1L(xk1)rk1+βk1mk1\displaystyle\to\xi_{k-1}-\alpha_{k-1}L(x_{k-1})^{*}r_{k-1}+\beta_{k-1}m_{k-1}
=ξk,\displaystyle=\xi_{k},
xkδl\displaystyle x_{k}^{\delta_{l}} =(ξkδl)(ξk)=xk\displaystyle=\nabla{\mathcal{R}}^{*}(\xi_{k}^{\delta_{l}})\to\nabla{\mathcal{R}}^{*}(\xi_{k})=x_{k}

and

γ~kδl\displaystyle\tilde{\gamma}_{k}^{\delta_{l}} =mkδl,xkδlxk1δl(1η)αk1δlrk1δl2+(1+η)δlαk1δlrk1δl+βk1δlγ~k1δl\displaystyle=\langle m_{k}^{\delta_{l}},x_{k}^{\delta_{l}}-x_{k-1}^{\delta_{l}}\rangle-(1-\eta)\alpha_{k-1}^{\delta_{l}}\|r_{k-1}^{\delta_{l}}\|^{2}+(1+\eta)\delta_{l}\alpha_{k-1}^{\delta_{l}}\|r_{k-1}^{\delta_{l}}\|+\beta_{k-1}^{\delta_{l}}\tilde{\gamma}_{k-1}^{\delta_{l}}
mk,xkxk1(1η)αk1rk12+βk1γ~k1=γ~k\displaystyle\to\langle m_{k},x_{k}-x_{k-1}\rangle-(1-\eta)\alpha_{k-1}\|r_{k-1}\|^{2}+\beta_{k-1}\tilde{\gamma}_{k-1}=\tilde{\gamma}_{k}

as ll\to\infty. We now show αkδlrkδlαkrk\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}\to\alpha_{k}r_{k} as ll\to\infty. When rk=0r_{k}=0, by using 0αkδlμ10\leq\alpha_{k}^{\delta_{l}}\leq\mu_{1} we have

αkδlrkδlαkrk=αkδlrkδlμ1rkδlμ1rk=0 as l.\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}-\alpha_{k}r_{k}\|=\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}\|\leq\mu_{1}\|r_{k}^{\delta_{l}}\|\to\mu_{1}\|r_{k}\|=0\quad\mbox{ as }l\to\infty.

When rk0r_{k}\neq 0, for gk:=L(xk)rkg_{k}:=L(x_{k})^{*}r_{k} we may use Assumption 2 (d) to obtain

gk,xkx\displaystyle\langle g_{k},x_{k}-x^{\dagger}\rangle =rk,L(xk)(xkx)\displaystyle=\langle r_{k},L(x_{k})(x_{k}-x^{\dagger})\rangle
=rk2+rk,yF(xk)L(xk)(xxk)\displaystyle=\|r_{k}\|^{2}+\langle r_{k},y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\rangle
rk2rkyF(xk)L(xk)(xxk)\displaystyle\geq\|r_{k}\|^{2}-\|r_{k}\|\|y-F(x_{k})-L(x_{k})(x^{\dagger}-x_{k})\|
(1η)rk2>0\displaystyle\geq(1-\eta)\|r_{k}\|^{2}>0

which implies gk0g_{k}\neq 0. Thus, by using xkδlxkx_{k}^{\delta_{l}}\to x_{k} and the continuity of FF and LL, we have rkδl>τδl\|r_{k}^{\delta_{l}}\|>\tau\delta_{l} and gkδl0g_{k}^{\delta_{l}}\neq 0 for sufficiently large ll. Consequently, by the definition of αkδl\alpha_{k}^{\delta_{l}} and αk\alpha_{k} we can conclude that αkδlαk\alpha_{k}^{\delta_{l}}\to\alpha_{k} as ll\to\infty. Therefore

αkδlrkδlαkrkαkδlrkδlrk+|αkδlαk|rk0 as l.\|\alpha_{k}^{\delta_{l}}r_{k}^{\delta_{l}}-\alpha_{k}r_{k}\|\leq\alpha_{k}^{\delta_{l}}\|r_{k}^{\delta_{l}}-r_{k}\|+|\alpha_{k}^{\delta_{l}}-\alpha_{k}|\|r_{k}\|\to 0\quad\mbox{ as }l\to\infty.

Recall that mk0m_{k}\neq 0, we thus have mkδl0m_{k}^{\delta_{l}}\neq 0 for large ll. Therefore, by using the definition of βkδl\beta_{k}^{\delta_{l}} and βk\beta_{k}, the above established facts, and the induction hypothesis we can obtain

βkδl\displaystyle\beta_{k}^{\delta_{l}} =min{max{0,αkδlgkδl,mkδl2σγ~kδlmkδl2},β}\displaystyle=\min\left\{\max\left\{0,\frac{\alpha_{k}^{\delta_{l}}\langle g_{k}^{\delta_{l}},m_{k}^{\delta_{l}}\rangle-2\sigma\tilde{\gamma}_{k}^{\delta_{l}}}{\|m_{k}^{\delta_{l}}\|^{2}}\right\},\beta\right\}
min{max{0,αkgk,mk2σγ~kmk2},β}=βk\displaystyle\to\min\left\{\max\left\{0,\frac{\alpha_{k}\langle g_{k},m_{k}\rangle-2\sigma\tilde{\gamma}_{k}}{\|m_{k}\|^{2}}\right\},\beta\right\}=\beta_{k}

as ll\to\infty, where we used αkδlgkδlαkgk\alpha_{k}^{\delta_{l}}g_{k}^{\delta_{l}}\to\alpha_{k}g_{k} as ll\to\infty which follows easily from the established facts. By the induction principle, we thus obtain (36) for 0n<min{n,n^}0\leq n<\min\{n_{*},\hat{n}\}.

When n^n\hat{n}\geq n_{*}, we next use an induction argument to show that

ξnδlξn and xnδlxn as l\displaystyle\xi_{n}^{\delta_{l}}\to\xi_{n}\quad\mbox{ and }\quad x_{n}^{\delta_{l}}\to x_{n}\quad\mbox{ as }l\to\infty (37)

for nnn^n_{*}\leq n\leq\hat{n}. Recall mn=0m_{n_{*}}=0, we may use Lemma 3.3 to conclude that rk=0r_{k}=0 and mk+1=0m_{k+1}=0 for all kn1k\geq n_{*}-1. We first show (37) for n=nn=n_{*}. Note that

ξn=ξn1+βn1mn1.\xi_{n_{*}}=\xi_{n_{*}-1}+\beta_{n_{*}-1}m_{n_{*}-1}.

Therefore, by the definition of ξnδl\xi_{n_{*}}^{\delta_{l}}, (36), 0αn1δlμ10\leq\alpha_{n_{*}-1}^{\delta_{l}}\leq\mu_{1} we have

ξnδlξn\displaystyle\|\xi_{n_{*}}^{\delta_{l}}-\xi_{n_{*}}\| ξn1δlξn1+μ1L(xn1δl)rn1δl\displaystyle\leq\|\xi_{n_{*}-1}^{\delta_{l}}-\xi_{n_{*}-1}\|+\mu_{1}\|L(x_{n_{*}-1}^{\delta_{l}})^{*}r_{n_{*}-1}^{\delta_{l}}\|
+βn1δlmn1δlβn1mn1\displaystyle\quad\,+\|\beta_{n_{*}-1}^{\delta_{l}}m_{n_{*}-1}^{\delta_{l}}-\beta_{n_{*}-1}m_{n_{*}-1}\|
μ1L(xn1)rn1=0\displaystyle\to\mu_{1}\|L(x_{n_{*}-1})r_{n_{*}-1}\|=0

and consequently, by the continuity of \nabla{\mathcal{R}}^{*}, we have xnδlxnx_{n_{*}}^{\delta_{l}}\to x_{n_{*}} as ll\to\infty. Now assume that (37) holds for nnkn_{*}\leq n\leq k for some nk<n^n_{*}\leq k<\hat{n}. By using rk=0r_{k}=0 and mk=0m_{k}=0 we have ξk+1=ξk\xi_{k+1}=\xi_{k}. Thus, by using 0αkδlμ10\leq\alpha_{k}^{\delta_{l}}\leq\mu_{1} and 0βkδlβ0\leq\beta_{k}^{\delta_{l}}\leq\beta, we have

ξk+1δlξk+1\displaystyle\|\xi_{k+1}^{\delta_{l}}-\xi_{k+1}\| =ξkδlξkαkδlL(xkδl)rkδl+βkδlmkδl\displaystyle=\|\xi_{k}^{\delta_{l}}-\xi_{k}-\alpha_{k}^{\delta_{l}}L(x_{k}^{\delta_{l}})^{*}r_{k}^{\delta_{l}}+\beta_{k}^{\delta_{l}}m_{k}^{\delta_{l}}\|
ξkδlξk+μ1L(xkδl)rkδl+βmkδl\displaystyle\leq\|\xi_{k}^{\delta_{l}}-\xi_{k}\|+\mu_{1}\|L(x_{k}^{\delta_{l}})^{*}r_{k}^{\delta_{l}}\|+\beta\|m_{k}^{\delta_{l}}\|
μ1L(xk)rk+βmk=0\displaystyle\to\mu_{1}\|L(x_{k})^{*}r_{k}\|+\beta\|m_{k}\|=0

as δ0\delta\to 0. This shows that (37) is also true for n=k+1n=k+1.

Putting the facts established in (36) and (37) together, the proof is therefore complete. ∎

Finally we are ready to show the main strong convergence result on Algorithm 1 for solving (1) using noisy data.

Theorem 3.7.

Let Assumption 1 and Assumption 2 hold and consider Algorithm 1, where 0β<10\leq\beta<1, and τ>1\tau>1 and μ0>0\mu_{0}>0 are chosen such that (25) holds. Let nδn_{\delta} be the output integer. Then there exists a solution xx^{*} of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\emph{dom}({\mathcal{R}}) such that

limδ0Dξnδδ(x,xnδδ)=0 and limδ0xnδδx=0.\lim_{\delta\to 0}D_{\mathcal{R}}^{\xi_{n_{\delta}}^{\delta}}(x^{*},x_{n_{\delta}}^{\delta})=0\quad\mbox{ and }\quad\lim_{\delta\to 0}\|x_{n_{\delta}}^{\delta}-x^{*}\|=0.

If, in addition, Ker(L(x))Ker(L(x))\emph{Ker}(L(x^{\dagger}))\subset\emph{Ker}(L(x)) for all xB2ρ(x0)x\in B_{2\rho}(x_{0}), then x=xx^{*}=x^{\dagger}.

Proof.

Let xx^{*} be the solution of (1) in B2ρ(x0)dom()B_{2\rho}(x_{0})\cap\mbox{dom}({\mathcal{R}}) determined in Theorem 3.5 such that Dξn(x,xn)0D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n})\to 0 as nn\to\infty, where {xn}\{x_{n}\} denotes the sequence defined by the counterpart of Algorithm 1 using the exact data. Let Δnδ:=Dξnδ(x,xnδ)\Delta_{n}^{\delta}:=D_{\mathcal{R}}^{\xi_{n}^{\delta}}(x^{*},x_{n}^{\delta}) for n0n\geq 0. We will show that Δnδδ0\Delta_{n_{\delta}}^{\delta}\to 0 as δ0\delta\to 0 by considering two cases.

Assume first that there is a sequence {yδl}\{y^{\delta_{l}}\} of noisy data satisfying yδlyδl\|y^{\delta_{l}}-y\|\leq\delta_{l} with δl0\delta_{l}\to 0 such that nl:=nδln^n_{l}:=n_{\delta_{l}}\to\hat{n} as ll\to\infty for some finite integer n^\hat{n}. Then nl=n^n_{l}=\hat{n} for all large ll. According to the definition of nl:=nδln_{l}:=n_{\delta_{l}} we have

F(xn^δl)yδlτδl.\|F(x_{\hat{n}}^{\delta_{l}})-y^{\delta_{l}}\|\leq\tau\delta_{l}.

By taking ll\to\infty and using lemma 3.6, we can obtain F(xn^)=yF(x_{\hat{n}})=y. Thus, we may use Lemma 3.3 (i) to obtain xn=xn^x_{n}=x_{\hat{n}} for all nn^n\geq\hat{n}. Since xnxx_{n}\to x^{*} as nn\to\infty, we must have xn^=xx_{\hat{n}}=x^{*} and thus, by Lemma 3.6 and the lower semi-continuity of {\mathcal{R}}, we can obtain

lim suplΔnlδl\displaystyle\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}} =lim suplΔn^δl=lim supl((xn^)(xn^δl)ξn^δl,xn^xn^δl)\displaystyle=\limsup_{l\to\infty}\Delta_{\hat{n}}^{\delta_{l}}=\limsup_{l\to\infty}\left({\mathcal{R}}(x_{\hat{n}})-{\mathcal{R}}(x_{\hat{n}}^{\delta_{l}})-\langle\xi_{\hat{n}}^{\delta_{l}},x_{\hat{n}}-x_{\hat{n}}^{\delta_{l}}\rangle\right)
=(xn^)lim infl(xn^δl)limlξn^δl,xn^xn^δl\displaystyle={\mathcal{R}}(x_{\hat{n}})-\liminf_{l\to\infty}{\mathcal{R}}(x_{\hat{n}}^{\delta_{l}})-\lim_{l\to\infty}\langle\xi_{\hat{n}}^{\delta_{l}},x_{\hat{n}}-x_{\hat{n}}^{\delta_{l}}\rangle
(xn^)(xn^)=0\displaystyle\leq{\mathcal{R}}(x_{\hat{n}})-{\mathcal{R}}(x_{\hat{n}})=0

which shows that Δnlδl0\Delta_{n_{l}}^{\delta_{l}}\to 0 as ll\to\infty.

Assume next that there is a sequence {yδl}\{y^{\delta_{l}}\} of noisy data satisfying yδlyδl\|y^{\delta_{l}}-y\|\leq\delta_{l} with δl0\delta_{l}\to 0 such that nl:=nδln_{l}:=n_{\delta_{l}}\to\infty as ll\to\infty. Let nn be any fixed integer. Then nl>nn_{l}>n for large ll. It then follows from Lemma 2.2 that ΔnlδlΔnδl.\Delta_{n_{l}}^{\delta_{l}}\leq\Delta_{n}^{\delta_{l}}. By using Lemma 3.6 and the lower semi-continuity of {\mathcal{R}}, we thus obtain

lim suplΔnlδl\displaystyle\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}} lim suplΔnδl=lim supl((x^)(xnδl)ξnδl,xxnδl)\displaystyle\leq\limsup_{l\to\infty}\Delta_{n}^{\delta_{l}}=\limsup_{l\to\infty}\left({\mathcal{R}}(\hat{x})-{\mathcal{R}}(x_{n}^{\delta_{l}})-\langle\xi_{n}^{\delta_{l}},x^{*}-x_{n}^{\delta_{l}}\rangle\right)
(x)(xn)ξn,xxn=Dξn(x,xn).\displaystyle\leq{\mathcal{R}}(x^{*})-{\mathcal{R}}(x_{n})-\langle\xi_{n},x^{*}-x_{n}\rangle=D_{\mathcal{R}}^{\xi_{n}}(x^{*},x_{n}).

Letting nn\to\infty in the above equation gives lim suplΔnlδl0\limsup_{l\to\infty}\Delta_{n_{l}}^{\delta_{l}}\leq 0 which shows again Δnlδl0\Delta_{n_{l}}^{\delta_{l}}\to 0 as ll\to\infty. ∎

4. Numerical results

In this section we will provide numerical simulations to test the performance of our AHB method, i.e. Algorithm 1. The computational results demonstrate that our AHB method indeed has superior performance over the Landweber iteration in terms of both the number of iterations and the CPU running time. Our simulations are performed via MATLAB R2023b on a Dell laptop with 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 1.80 GHz and 32GB memory.

Example 4.1.

We first test the performance of Algorithm 1 by considering the following linear integral equation of the first kind

(Fx)(s):=01κ(s,t)x(t)𝑑t=y(s),s[0,1],\displaystyle(Fx)(s):=\int_{0}^{1}\kappa(s,t)x(t)dt=y(s),\quad s\in[0,1], (38)

where the kernel κ(s,t)\kappa(s,t) is a continuous function defined on [0,1]×[0,1][0,1]\times[0,1]. It is easy to see that FF is a compact linear map from L2[0,1]L^{2}[0,1] to itself. Our goal is to determine the minimal-norm solution of (38) by using some noisy data yδL2[0,1]y^{\delta}\in L^{2}[0,1] satisfying yδyL2[0,1]δ\|y^{\delta}-y\|_{L^{2}[0,1]}\leq\delta with a given noise level δ>0\delta>0. When applying Algorithm 1 to find the minimal-norm solution, we use x0=0x_{0}=0, (x)=12xL2[0,1]2{\mathcal{R}}(x)=\frac{1}{2}\|x\|_{L^{2}[0,1]}^{2}, τ=1.01\tau=1.01, β=\beta=\infty and

αnμ0F2 with μ0=0.99(22/τ),\alpha_{n}\equiv\frac{\mu_{0}}{\|F\|^{2}}\quad\mbox{ with }\mu_{0}=0.99(2-2/\tau),

where μ0\mu_{0} is chosen to fulfill the theoretical requirement. As comparisons, we also consider the following algorithms:

  1. \bullet

    Landweber iteration

    xn+1δ=xnδαF(Fxnδyδ)x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha F^{*}(Fx_{n}^{\delta}-y^{\delta})

    with the initial guess x0δ=0x_{0}^{\delta}=0 terminated by the discrepancy principle

    Fxnδδyδτδ<Fxnδyδ,0n<nδ.\displaystyle\|Fx_{n_{\delta}}^{\delta}-y^{\delta}\|\leq\tau\delta<\|Fx_{n}^{\delta}-y^{\delta}\|,\quad 0\leq n<n_{\delta}. (39)

    This method is convergent for 0<α2/A20<\alpha\leq 2/\|A\|^{2} and τ>1\tau>1; see [6]. In our computation we use α=1/A2\alpha=1/\|A\|^{2} and τ=1.01\tau=1.01.

  2. \bullet

    The ν\nu-method of Brakhage: This method was introduced in [4] to accelerate the Landweber iteration. It preassigns a number ν>1/2\nu>1/2 and defines the iterates by

    xn+1δ=xnδαnγF(Fxnδyδ)+βn(xnδxn1δ)x_{n+1}^{\delta}=x_{n}^{\delta}-\alpha_{n}\gamma F^{*}(Fx_{n}^{\delta}-y^{\delta})+\beta_{n}(x_{n}^{\delta}-x_{n-1}^{\delta})

    with x1δ=x0δ=0x_{-1}^{\delta}=x_{0}^{\delta}=0 and

    αn=4(2n+2ν+1)(n+ν)(n+2ν)(2n+4ν+1),βn=n(2n1)(2n+2ν+1)(n+2ν)(2n+4ν+1)(2n+2ν1).\displaystyle\quad\alpha_{n}=4\frac{(2n+2\nu+1)(n+\nu)}{(n+2\nu)(2n+4\nu+1)},\,\,\,\beta_{n}=\frac{n(2n-1)(2n+2\nu+1)}{(n+2\nu)(2n+4\nu+1)(2n+2\nu-1)}.

    It is known that it is a regularization method for 0<γ<1/F20<\gamma<1/\|F\|^{2} when terminated by the discrepancy principle (39) with τ>1\tau>1; see [4, 6, 9]. We use ν=3\nu=3, γ=0.99/F2\gamma=0.99/\|F\|^{2} and τ=1.01\tau=1.01.

  3. \bullet

    Nesterov acceleration: This method was proposed in [22] to accelerate the gradient method for convex optimization problems. This strategy was then suggested in [15] to accelerate the Landweber iteration for ill-posed inverse problems. The corresponding method for solving (38) takes the form

    znδ=xnδ+n1n+α(xnδxn1δ),xn+1δ=znδγF(Fznδyδ)\displaystyle z_{n}^{\delta}=x_{n}^{\delta}+\frac{n-1}{n+\alpha}(x_{n}^{\delta}-x_{n-1}^{\delta}),\quad x_{n+1}^{\delta}=z_{n}^{\delta}-\gamma F^{*}(Fz_{n}^{\delta}-y^{\delta})

    with x1δ=x0δ=0x_{-1}^{\delta}=x_{0}^{\delta}=0 and α2\alpha\geq 2. When the method is terminated by the discrepancy principle (39) with τ>1\tau>1, the corresponding regularization property has been proved in [20, 24] for 0<γ<1/F20<\gamma<1/\|F\|^{2}. We use α=3\alpha=3, γ=0.99/F2\gamma=0.99/\|F\|^{2} and τ=1.01\tau=1.01.

Table 1. Numerical results for Example 4.1.
δ\delta method iterations time (s) relative error
0.1 Landweber 62 0.0153 1.9024e-2
ν\nu-method 15 0.0083 1.6657e-2
Nesterov 16 0.0091 1.7684e-2
AHB 20 0.0087 1.7774e-2
0.01 Landweber 190 0.0359 5.0988e-3
ν\nu-method 30 0.0102 5.1081e-3
Nesterov 41 0.0157 4.1816e-3
AHB 30 0.0108 4.6982e-3
0.001 Landweber 1256 0.1934 1.8305e-3
ν\nu-method 82 0.0189 1.8587e-3
Nesterov 102 0.0285 1.7776e-3
AHB 80 0.0187 1.7220e-3
0.0001 Landweber 8144 1.2922 6.2937e-4
ν\nu-method 215 0.0414 6.4159e-4
Nesterov 324 0.0831 5.1961e-4
AHB 268 0.0513 6.1130e-4
0.00001 Landweber 55145 8.2924 1.9023e-4
ν\nu-method 565 0.0918 1.9305e-4
Nesterov 817 0.2003 1.7054e-4
AHB 896 0.1520 1.8902e-4

In our numerical computation, we consider the equation (38) with the kernel

κ(s,t)={40s(1t) if st,40t(1s) if ts\displaystyle\kappa(s,t)=\left\{\begin{array}[]{lll}40s(1-t)&\mbox{ if }s\leq t,\\ 40t(1-s)&\mbox{ if }t\leq s\end{array}\right.

and assume that the sought solution is x(t)=4t(1t)+sin(2πt)x^{\dagger}(t)=4t(1-t)+\sin(2\pi t). Let y:=Axy:=Ax^{\dagger} be the exact data. We add random noise to yy to produce a noisy data yδy^{\delta} satisfying yδyL2[0,1]=δ\|y^{\delta}-y\|_{L^{2}[0,1]}=\delta for various noise levels δ>0\delta>0. In our implementation, all integrals over [0,1][0,1] are approximated by the trapezoidal rule based on the 10001000 nodes partitioning [0,1][0,1] into subintervals of equal length.

For this model problem, we execute the Landweber iteration, the ν\nu-method, the Nesterov acceleration and our AHB method with the above setup using noisy data with various noise levels and the computational results are reported in Table 1 which includes the required number of iterations and the consumed CPU time together with the corresponding relative errors. These results indicate that all the four methods can produce comparable approximate solutions in terms of accuracy. However, the ν\nu-method, the Nesterov acceleration and our AHB method clearly demonstrate superior performance over the Landweber method by significantly reducing the number of iterations and the CPU running time. In order to visualize how the iterates approach the sought solution, in Figure 1 we plot the relative error xnδxL2/xL2\|x_{n}^{\delta}-x^{\dagger}\|_{L^{2}}/\|x^{\dagger}\|_{L^{2}} versus nn for these four methods: the left figure plots the relative error, using noisy data with δ=0.001\delta=0.001, until the iteration is terminated by the discrepancy principle; the right figure plots the relative error when all computations are performed using the exact data. These plots further indicate the acceleration effect of the ν\nu-methd, the Nesterov acceleration and our AHB method. Moreover, when using the exact data, our AHB method can outperform both the ν\nu-method and the Nesterov acceleration.

Refer to caption
Refer to caption
Figure 1. Relative error versus the number of iterations for Example 4.1.

When using noisy data in computation, among the three fast methods, the ν\nu-method usually requires least number of iterations. However, it is totally unclear how to extend this method to solve (12) with nonlinear FF or non-quadratic {\mathcal{R}}. The Nesterov acceleration in general takes longer time than the other two methods due to the fact that, at each iteration step, it needs to calculate AznδyδAz_{n}^{\delta}-y^{\delta} to update xn+1δx_{n+1}^{\delta} and AxnδyδAx_{n}^{\delta}-y^{\delta} to check the discrepancy principle as required by the theory ([20]) and thus the computational complexity per iteration increases. Although we have formulated a possible extension (6) of Nesterov acceleration to solve (12) with general FF and {\mathcal{R}} and numerical simulations demonstrate the striking performance, a solid theoretical justification however is still unavailable. Note that, in our AHB method, we used a very small step size, compared with the ones used by other three methods. With an adaptive choice of the momentum coefficient, it still achieves an excellent acceleration effect on Landweber iteration. Moreover, our theory provides the convergence guarantee on using our AHB method to solve (12) with general FF and {\mathcal{R}}. In the following numerical examples, we will focus on testing the acceleration effect of our AHB method over the Landweber iteration.

Example 4.2 (Computed tomography).

In this example, we consider applying Algorithm 1 to computed tomography which consists in determining the density of cross sections of a human body by measuring the attenuation of X-rays as they propagate through the biological tissues ([21]).

Assume the image is supported on a rectangular domain in 2\mathbb{R}^{2} and is discretized on a I×JI\times J pixel grid. We will identify any image of size I×JI\times J by a long vector in N{\mathbb{R}}^{N} with N=I×JN=I\times J by stacking all its columns. We consider the standard 2D fan-beam tomography using nθn_{\theta} projection angles evenly distributed between 11 and 360360 degrees with pp lines of X-rays emitted through each angle. We use the function “fanbeamtomo” in the MATLAB package AIR TOOLS [11] to discretize the problem, leading to an ill-conditioned linear algebraic system Fx=yFx=y, where FM×NF\in\mathbb{R}^{M\times N} with M=nθpM=n_{\theta}p and N=I×JN=I\times J.

Table 2. Numerical results for computed tomography.
δrel\delta_{rel}\quad method iterations time (s) relative error
0.05 Landweber 101 9.9940 1.8538e-01
AHB: β=0.99\beta=0.99 63 5.5243 1.9299e-01
AHB: β=\beta=\infty 48 4.2141 1.9060e-01
0.01 Landweber 444 44.0053 5.9528e-02
AHB: β=0.99\beta=0.99 230 19.7274 5.9464e-02
AHB: β=\beta=\infty 238 20.2110 5.9760e-02
0.005 Landweber 778 78.4099 3.1650e-02
AHB: β=0.99\beta=0.99 335 27.8164 3.1561e-02
AHB: β=\beta=\infty 374 31.1740 3.1859e-02
0.001 Landweber 3254 366.628 6.2999e-03
AHB: β=0.99\beta=0.99 1398 124.763 5.7077e-03
AHB: β=\beta=\infty 1495 132.288 5.9425e-03

In our numerical simulations, we consider the modified Shepp-Logan phantom of size 256×256256\times 256 generated by MATLAB, we also use nθ=60n_{\theta}=60 projection angles with p=367p=367 lines of X-rays per projection. Correspondingly FF has the size M×NM\times N with M=22020M=22020 and N=65536N=65536. Instead of the exact data yy, we add Gaussian noise on yy to generate a noisy data yδy^{\delta} with relative noise level δrel=yδy2/y2\delta_{rel}=\|y^{\delta}-y\|_{2}/\|y\|_{2} so that the noise level is δ=δrely2\delta=\delta_{rel}\|y\|_{2}. In order to capture the feature of the sought image, we take

(x)=12κxF2+|x|TV,xI×J{\mathcal{R}}(x)=\frac{1}{2\kappa}\|x\|_{F}^{2}+|x|_{TV},\quad\forall x\in{\mathbb{R}}^{I\times J}

with κ=1\kappa=1, where xF\|x\|_{F} is the Frobenius norm and |x|TV|x|_{TV} denotes the isotropic total variation of xx, i.e. |x|TV=f(x)|x|_{TV}=f(\nabla x) with \nabla being the discrete gradient operator defined by x=(1x,2x)\nabla x=(\nabla_{1}x,\nabla_{2}x) for x=(xi,j)I×Jx=(x_{i,j})\in{\mathbb{R}}^{I\times J}, where

(1x)i,j\displaystyle(\nabla_{1}x)_{i,j} ={xi+1,jxi,j,i=1,,I1;j=1,,Jx1,jxI,j,i=I;j=1,,J;\displaystyle=\left\{\begin{array}[]{ll}x_{i+1,j}-x_{i,j},&i=1,\cdots,I-1;\,j=1,\cdots,J\\ x_{1,j}-x_{I,j},&i=I;\,j=1,\cdots,J;\\ \end{array}\right.
(2x)i,j\displaystyle(\nabla_{2}x)_{i,j} ={xi,j+1xi,j,i=1,,I;j=1,,J1,xi,1xi,J,i=1,,I1;j=J\displaystyle=\left\{\begin{array}[]{ll}x_{i,j+1}-x_{i,j},&i=1,\cdots,I;\,j=1,\cdots,J-1,\\ x_{i,1}-x_{i,J},&i=1,\cdots,I-1;\,j=J\end{array}\right.

and

f(u,v)=i=1Ij=1Jui,j2+vi,j2,u=(ui,j),v=(vi,j)I×J.\displaystyle f(u,v)=\sum_{i=1}^{I}\sum_{j=1}^{J}\sqrt{u_{i,j}^{2}+v_{i,j}^{2}},\quad\forall u=(u_{i,j}),v=(v_{i,j})\in{\mathbb{R}}^{I\times J}.

Clearly {\mathcal{R}} satisfies Assumption 1 with σ=1/(2κ)\sigma=1/(2\kappa). When using Algorithm 1 to reconstruct the image, we use the initial guess ξ0=0\xi_{0}=0 and the step-size

αnδ=min{μ0Fxnδyδ2F(Fxnδyδ)2,μ1}\alpha_{n}^{\delta}=\min\left\{\frac{\mu_{0}\|Fx_{n}^{\delta}-y^{\delta}\|^{2}}{\|F^{*}(Fx_{n}^{\delta}-y^{\delta})\|^{2}},\mu_{1}\right\}

Our convergence theory requires τ>1\tau>1, 0<μ0<(22/τ)/κ0<\mu_{0}<(2-2/\tau)/\kappa and μ1>0\mu_{1}>0. Therefore we take τ=1.05\tau=1.05, μ0=0.99(22/τ)/κ\mu_{0}=0.99(2-2/\tau)/\kappa and μ1=100\mu_{1}=100. In order to determine the momentum coefficient βnδ\beta_{n}^{\delta}, we consider two values of β\beta: β=0.99\beta=0.99 and β=\beta=\infty. As comparison, we also carry out the computation by the Landweber method (2) using the same step size αnδ\alpha_{n}^{\delta} and the regularization function {\mathcal{R}}. During the computation, updating xnδx_{n}^{\delta} from ξnδ\xi_{n}^{\delta} requires to solving the total variation denoising problem

xnδ=argminxI×J{12κxκξnδF2+|x|TV}x_{n}^{\delta}=\arg\min_{x\in{\mathbb{R}}^{I\times J}}\left\{\frac{1}{2\kappa}\|x-\kappa\xi_{n}^{\delta}\|_{F}^{2}+|x|_{TV}\right\}

which is solved approximately by the primal-dual hybrid gradient (PDHG) method ([3, 15, 32]) after 7070 iterations.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2. The reconstruction results for computed tomography. (a) True phantom; (b) Landweber; (c) AHB with β=0.99\beta=0.99; (d) AHB with β=\beta=\infty.

The computational results by AHB and Landweber method are reported in Table 2, including the number of iterations nδn_{\delta}, the CPU running time and the relative errors xnδδx2/x2\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|_{2}/\|x^{\dagger}\|_{2}, using noisy data with various relative noise level δrel>0\delta_{rel}>0. Table 2 shows that AHB with both β=0.99\beta=0.99 and β=\beta=\infty leads to a considerable decrease in the number of iterations and the amount of computational time, which demonstrates the striking acceleration effect of our AHB method.

Refer to caption
Figure 3. Computed tomography: relative errors versus the number of iterations.

In order to visualize the reconstruction accuracy of our AHB method, we plot in Figure 2 the true image, the reconstruction result by Landweber method and the AHB methods with β=0.99\beta=0.99 and \infty using noisy data with relative noise level δrel=0.01\delta_{rel}=0.01. In Figure 3 we also plot the curves of the relative errors xnδx2/x2\|x_{n}^{\delta}-x^{\dagger}\|_{2}/\|x^{\dagger}\|_{2} versus nn for Landweber and our AHB methods. These plots further demonstrate that, compared with Landweber iteration, our AHB method can produce comparable reconstruction results with significantly less number of iterations.

Example 4.3 (Elliptic parameter identification).

In this example we consider the identification of the parameter cc in the elliptic boundary value problem

u+cu=f in Ω,u=g on Ω\displaystyle-\triangle u+cu=f\hbox{ in }\Omega,\qquad u=g\hbox{ on }\partial\Omega (40)

from an L2(Ω)L^{2}(\Omega)-measurement of the state uu, where Ωd\Omega\subset\mathbb{R}^{d} with d3d\leq 3 is a bounded domain with Lipschitz boundary Ω\partial\Omega, fH1(Ω)f\in H^{-1}(\Omega) and gH1/2(Ω)g\in H^{1/2}(\Omega). We assume that the sought parameter cc^{{\dagger}} is in L2(Ω)L^{2}(\Omega). This problem reduces to solving F(c)=uF(c)=u if we define the nonlinear operator F:L2(Ω)L2(Ω)F:L^{2}(\Omega)\rightarrow L^{2}(\Omega) by F(c):=u(c)F(c):=u(c), where u(c)H1(Ω)L2(Ω)u(c)\in H^{1}(\Omega)\subset L^{2}(\Omega) is the unique solution of (40). This operator FF is well defined on

𝒟:={cL2(Ω):cc^L2(Ω)ε0 for some c^0,a.e.}{\mathcal{D}}:=\left\{c\in L^{2}(\Omega):\|c-\hat{c}\|_{L^{2}(\Omega)}\leq\varepsilon_{0}\mbox{ for some }\hat{c}\geq 0,\ \textrm{a.e.}\right\}

for some positive constant ε0>0\varepsilon_{0}>0. It is known that the operator FF is weakly closed and Fréchet differentiable, the Fréchet derivative of FF and its adjoint are given by

F(c)h=A(c)1(hu(c)) and F(c)w=u(c)A(c)1wF^{\prime}(c)h=-A(c)^{-1}(hu(c))\quad\mbox{ and }\quad F^{\prime}(c)^{*}w=-u(c)A(c)^{-1}w

for c𝒟c\in{\mathcal{D}} and h,wL2(Ω)h,w\in L^{2}(\Omega), where A(c):H01(Ω)H1(Ω)A(c):H_{0}^{1}(\Omega)\to H^{-1}(\Omega) is defined by A(c)u=u+cuA(c)u=-\triangle u+cu which is an isomorphism uniformly in the ball B2ρ(c)B_{2\rho}(c^{\dagger}) for small ρ>0\rho>0. Moreover, FF satisfies Assumption 2 (d) with a number η>0\eta>0 which can be very small if ρ>0\rho>0 is sufficiently small (see [6]).

Table 3. Numerical results for elliptic parameter identification.
δ\delta method iterations time (s) xnδδxL2\|x_{n_{\delta}}^{\delta}-x^{\dagger}\|_{L^{2}}
0.005 Landweber 36 6.2252 3.0388e-01
AHB: β=0.99\beta=0.99 15 2.6239 3.0577e-01
AHB: β=\beta=\infty 12 2.1608 3.0328e-01
0.001 Landweber 248 39.967 1.3998e-01
AHB: β=0.99\beta=0.99 64 10.509 1.3772e-01
AHB: β=\beta=\infty 79 12.530 1.3790e-01
0.0005 Landweber 410 66.806 1.1268e-01
AHB: β=0.99\beta=0.99 108 17.023 1.1298e-01
AHB: β=\beta=\infty 165 26.356 1.1221e-01
0.0001 Landweber 2014 341.75 8.1312e-02
AHB: β=0.99\beta=0.99 236 38.464 8.1573e-02
AHB: β=\beta=\infty 375 60.491 8.1716e-02
0.00005 Landweber 4999 987.75 7.2499e-02
AHB: β=0.99\beta=0.99 472 74.040 7.1581e-02
AHB: β=\beta=\infty 821 133.82 7.2092e-02

In our numerical simulation, we consider the two-dimensional problem with Ω=[0,1]×[0,1]\Omega=[0,1]\times[0,1] and the sought parameter cc^{\dagger} is assumed to be a piecewise constant function as shown in Figure 4 (a). Assuming u(c)=x+yu(c^{\dagger})=x+y, we add random noise to produce noisy data uδu^{\delta} satisfying uδu(c)L2(Ω)=δ\|u^{\delta}-u(c^{\dagger})\|_{L^{2}(\Omega)}=\delta with various noise level δ>0\delta>0. We will use uδu^{\delta} to reconstruct cc^{\dagger}. In order to capture the feature of the sought parameter, we take

(c)=12κcL2(Ω)2+|c|TV\displaystyle{\mathcal{R}}(c)=\frac{1}{2\kappa}\|c\|_{L^{2}(\Omega)}^{2}+|c|_{TV} (41)

with the constant κ=10\kappa=10, where |c|TV|c|_{TV} denotes the total variation of cc on Ω\Omega, i.e.

|c|TV:=sup{Ωcdivφdxdy:φC01(Ω,2) and φL(Ω)1}.|c|_{TV}:=\sup\left\{\int_{\Omega}c\,\mbox{div}\varphi\,\mbox{d}x\mbox{d}y:\varphi\in C_{0}^{1}(\Omega,{\mathbb{R}}^{2})\mbox{ and }\|\varphi\|_{L^{\infty}(\Omega)}\leq 1\right\}.

It is easy to see that this {\mathcal{R}} satisfies Assumption 1 with σ=1/(2κ)\sigma=1/(2\kappa).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. The reconstruction results for parameter identification in elliptic equation. (a) True parameter; (b) Landweber; (c) AHB with β=0.99\beta=0.99; (d) AHB with β=\beta=\infty.

We first carry out the computation by the Landweber-type method (2) using the initial guess ξ0δ=0\xi_{0}^{\delta}=0 and adaptive step-size αnδ\alpha_{n}^{\delta} given by (4) with μ0=1.96(1η(1+η)/τ)/κ\mu_{0}=1.96(1-\eta-(1+\eta)/\tau)/\kappa and μ1=80\mu_{1}=80, where we take η=0.01\eta=0.01 and τ=1.05\tau=1.05. Next we execute our AHB method, i.e. Algorithm 1, with the same initial guess ξ0δ\xi_{0}^{\delta} and step size αnδ\alpha_{n}^{\delta}, the momentum coefficient βnδ\beta_{n}^{\delta} is determined by using two vules of β\beta: β=0.99\beta=0.99 and β=\beta=\infty. In order to carry out the computation, we divide Ω\Omega into 128×128128\times 128 small squares of equal size and solve all partial differential equations involved approximately by a multigrid method ([8]) via finite difference discretization. Furthermore, updating cnδc_{n}^{\delta} from ξnδ\xi_{n}^{\delta} at each iteration step is equivalent to a total variation denoising problem which is solved by the PDHG method after 200200 iterations.

Refer to caption
Figure 5. relative errors versus the number of iterations.

In Table 3 we report the computational results by the Landweber iteration (2) and our AHB method with β=0.99\beta=0.99 and β=\beta=\infty, including the required number of iterations nδn_{\delta}, the CPU running time and the absolute errorscnδδcL2(Ω)\|c_{n_{\delta}}^{\delta}-c^{\dagger}\|_{L^{2}(\Omega)}, using noisy data for various noise level δ>0\delta>0, which clearly demonstrates the acceleration effect of our AHB method and shows that our AHB method has superior performance over the Landweber iteration. In Figure 4 and Figure 5 we plot the respective reconstructed results and absolute errors versus the number of iteration by the Landweber iteration and our AHB method using noisy data with noise level δ=0.0001\delta=0.0001, which shows these methods can produce comparable approximate solutions and indicates that our AHB method is much faster than the Landweber iteration.

5. Conclusion

With the surge in machine learning advancements and the emergence of large-scale problems across diverse domains, Polyak’s heavy ball method has experienced a resurgence of interest within the optimization community in recent years. Much effort has been dedicated to comprehending its acceleration effects. In this paper we proposed an adaptive heavy ball method for solving ill-posed inverse problems, both linear and nonlinear. This method differs from the Landweber-type method in that a momentum term is added to define the iterates. Our method integrates a strongly convex function into the design of the algorithm as a regularization function to detect the desired feature of the sought solution. Moreover, we introduced novel techniques for adaptively selecting step-sizes and momentum coefficients, aiming to achieve potential acceleration over the Landweber-type method. Notably, the updating formulae for these parameters are explicit, obviating the need for a backtracking line search procedure and thereby saving computational time. Under the discrepancy principle, we showed that our method can be terminated after a finite number of iterations and established the corresponding regularization property. We reported diverse numerical results which indicate the significant reduction in both the required number of iterations and computational time, and thus demonstrate the superior performance of our method over the Landweber-type method.

Acknowledgement

The work of Q. Jin is partially supported by the Future Fellowship of the Australian Research Council (FT170100231). The work of Q. Huang is supported by the China Scholarship Council program.

References

  • [1] H. Attouch and J. Peypouquet, The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than o(1/k2)o(1/k^{2}), SIAM J. Optim., 26 (2016), no. 3, 1824–1834.
  • [2] A. Beck and M Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), 183–202.
  • [3] S. Bonettini and V. Ruggiero, On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration, J. Math. Imaging Vis., 44 (2012), 236–253.
  • [4] H. Brakhage, On ill-posed problems and the method of conjugate gradients, Inverse and Ill-Posed Problems (Sankt Wolfgang, 1986) (Notes Rep. Math. Sci. Engrg. vol 4), (New York: Academic), 165–175, 1987.
  • [5] A. Chambolle and C. Dossai, On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm, J. Optim. Theor. Appl., 166 (2015), 968–982.
  • [6] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems, Dordrecht, Kluwer, 1996.
  • [7] E. Ghadimi, H. Feyzmahdavian and M. Johansson, Global convergence of the heavy-ball method for convex optimization, 2015 European Control Conference, pages 310–315, 2015.
  • [8] W. Hackbusch, Iterative Solution of Large Sparse Systems of Equations, Second edition, Applied Mathematical Sciences, 95. Springer, 2016.
  • [9] M. Hanke,Accelerated Landweber iterations for the solution of ill-posed equations, Numer. Math., 60 (1991), 341–373.
  • [10] M. Hanke, A. Neubauer and O. Scherzer, A convergence analysis of the Landweber iteration for nonlinear ill-posed problems, Numer. Math., 72 (1995), no. 1, 21–37.
  • [11] P. C. Hansen and M. Saxild-Hansen, AIR Tools — a MATLAB package of algebraic iterative reconstruction methods, J. Comput. Appl. Math., 236 (2012), 2167–2178.
  • [12] M. Hegland, Q. Jin and W. Wang, Accelerated Landweber iteration with convex penalty for linear inverse problems in Banach spaces, Appl. Anal., 94 (2015), 524–547.
  • [13] S. Hubmer and R. Ramlau, Convergence analysis of a two-point gradient method for nonlinear ill-posed problems, Inverse Problems, 33 (2017), 095004.
  • [14] S. Hubmer and R. Ramlau, Nesterov’s accelerated gradient method for nonlinear ill-posed problems with a locally convex residual functional, Inverse Problems, 34 (2018), 095003.
  • [15] Q. Jin, Landweber-Kaczmarz method in Banach spaces with inexact inner solvers, Inverse Problems, 32 (2016), 104005.
  • [16] Q. Jin, Convergence rates of a dual gradient method for constrained linear ill-posed problems, Numer. Math. 151 (2022), no. 4, 841–871.
  • [17] Q. Jin and X. Lu, A fast nonstationary iterative method with convex penalty for inverse problems in Hilbert spaces, Inverse Problems 30 (2014), no. 4, 045012, 21 pp.
  • [18] Q. Jin and W. Wang, Landweber iteration of Kaczmarz type with general non-smooth convex penalty functionals, Inverse Problems, 29 (2013), 085011.
  • [19] B. Kaltenbacher, A. Neubauer and O. Scherzer, Iterative Regularization Methods for Nonlinear Ill-Posed Problems, Radon Ser. Comput. Appl. Math. 6, De Gruyter, Berlin, 2008.
  • [20] S. Kindermann, Optimal-order convergence of Nesterov acceleration for linear ill-posed problems, Inverse Problems, 37 (2021), no. 6, 065002.
  • [21] F. Natterer, The Mathematics of Computerized Tomography, SIAM, Philadelphia, 2001.
  • [22] Y. Nesterov,A method of solving a convex programming problem with convergence rate O(1/k2)O(1/k^{2}), Soviet Mathematics Doklady, 27 (1983), 372–376.
  • [23] Y. Nesterov, Lectures on Convex Optimization, Springer Science & Business Media, New York, 2018
  • [24] A. Neubauer,On Nesterov acceleration for Landweber iteration of linear ill-posed problems, J. Inverse Ill-Posed Probl., 25 (2017), no. 3, 381–390.
  • [25] P. Ochs, Local convergence of the heavy-ball method and iPiano for non-convex optimization, J. Optim. Theory Appl., 177 (2018), no. 1, 153–180.
  • [26] P. Ochs, Y. Chen, T. Brox and T. Pock, iPiano: inertial proximal algorithm for nonconvex optimization, SIAM J. Imaging Sci. 7 (2014), no. 2, 1388–1419.
  • [27] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), no. 5, 1–17.
  • [28] F. Schöpfer and T. Schuster, Fast regularizing sequential subspace optimization in Banach spaces, Inverse Problems, 25 (2009), 015013.
  • [29] C. Zălinscu, Convex Analysis in General Vector Spaces, World Scientific Publishing Co., Inc., River Edge, New Jersey, 2002.
  • [30] S. K. Zavriev and F. V. Kostyuk, Heavy-ball method in nonconvex optimization problems, Comput. Math. Model., 4 (1993), pp. 336–341.
  • [31] M. Zhong, W. Wang and Q. Jin, Regularization of inverse problems by two-point gradient methods in Banach spaces, Numer. Math., 143 (2019), no. 3, 713–747.
  • [32] M. Zhu and T. F. Chan, An efficient primal-dual hybrid gradient algorithm for total variation image restoration, CAM Report 08–34 UCLA, 2008.