An adaptive heavy ball method for ill-posed inverse problems
Abstract.
In this paper we consider ill-posed inverse problems, both linear and nonlinear, by a heavy ball method in which a strongly convex regularization function is incorporated to detect the feature of the sought solution. We develop ideas on how to adaptively choose the step-sizes and the momentum coefficients to achieve acceleration over the Landweber-type method. We then analyze the method and establish its regularization property when it is terminated by the discrepancy principle. Various numerical results are reported which demonstrate the superior performance of our method over the Landweber-type method by reducing substantially the required number of iterations and the computational time.
Key words and phrases:
Ill-posed inverse problems, adaptive heavy ball method, step-sizes, momentum coefficients, convergenceKey words and phrases:
Ill-posed inverse problems, adaptive heavy-ball method, step-size, momentum coefficients, convergence2010 Mathematics Subject Classification:
65J20, 65J22, 65J15, 47J061. Introduction
Consider an operator equation of the form
(1) |
where is an operator between two real Hilbert spaces and with domain . The operator could be either linear or nonlinear. Throughout, we always assume that (1) has a solution, i.e. , the range of . We are interested in the situation that (1) is ill-posed in the sense that its solution does not depend continuously on the right hand data. Such situation can happen in a broad range of applications of inverse problems. In practical scenarios, data are acquired through experiments and thus the exact data may not be available; instead we only have measurement data corrupted by noise. Due to the ill-posed nature of the underlying problem, it is therefore important to develop algorithms to produce stable approximate solutions of (1) using noisy data.
In practical applications, some a priori information, such as non-negativity, sparsity and piecewise constancy, on the sought solution, is often available. Utilizing such feature information in algorithm design can significantly improve the reconstruction accuracy. By incorporating the available a priori information, one may pick a proper, lower semi-continuous, strongly convex function as a regularization function to determine a solution of (1) with the desired feature. Let be a noisy data satisfying
with a known noise level . The Landweber-type method
(2) |
has been considered in [18] for solving (1), where is the step-size and is a family of properly chosen bounded linear operators. In case is Fréchect differentiable at , we may take , the Fréchet derivative of at ; otherwise, should be appropriately chosen as a substitute for the non-existent Fréchet derivative. This method has been thoroughly analyzed in [15, 18] under the constant step-size
(3) |
when an upper bound of around the sought solution is available, and the adaptive step-size
(4) |
with suitable choices of the parameters and , and the regularization property has been established when the iteration is terminated by the discrepancy principle. Note that, when , the corresponding method with chosen by (3) is the classical Landweber method which has been analyzed in [6, 10], and the corresponding method with chosen by (4) is the minimal error method ([19]). Extensive numerical simulations have demonstrated that the Landweber-type method (2) is a slowly convergent approach, often necessitating a large number of iteration steps before termination by the discrepancy principle. It is therefore important to develop strategies for accelerating the Landweber-type method (2) while maintaining its simple implementation feature.
For linear ill-posed problems in Hilbert spaces, Brakhage used in [4] the Jacobi polynomials to construct a method, which is known as the -method, to accelerate the classical Landweber iteration. Additionally, in [9], a family of accelerated Landweber iterations was developed by employing orthogonal polynomials and the spectral theory of self-adjoint operators. However, the acceleration strategy using orthogonal polynomials is scarcely applicable for the Landweber-type method (2) when the forward operator is nonlinear or the regularization function is non-quadratic. In [12, 28] the sequential subspace optimization strategies were employed to accelerate the Landweber iteration. The implementation of this strategy however requires to solve an additional optimization problem to determine the weighted coefficients of the search directions at each iteration step which unfortunately has the drawback of slowing down the method.
On the other hand, to determine a solution of well-posed minimization problem
(5) |
with a continuous differentiable objective function , tremendous progress has been made toward accelerating the gradient method
in optimization community. In particular, Nesterov’s accelerated gradient method ([22]) and Polyak’s heavy ball method ([27]) stood out and had far-reaching impact on the development of fast first order optimization methods. In Nesterov’s accelerated gradient method, to define the next iterate, instead of using the current iterate, it uses a carefully chosen extrapolation point, built from the previous two iterates ([1, 2, 5, 23]). Based on Nesterov’s acceleration strategy, an accelerated version of Landweber-type method has been proposed in [15] and a refined version of the method takes the form (see also [16, 31])
(6) |
where . The numerical results presented in [15] demonstrate its striking acceleration effect. Inspired by the numerical observations in [15], some efforts have been devoted to analyzing the regularization property of (6) in the context of ill-posed problems. When X is a Hilbert space, is a bounded linear operator and , the regularization property of the corresponding method, terminated by either a priori stopping rules or the discrepancy principle, has been established in [20, 24] based on a general acceleration framework in [9] using orthogonal polynomials. When is nonlinear or is non-quadratic, although some partial results are available in [14, 16], the analysis on (6) is under developed and the understanding of its regularization property is largely open. By modifying the first equation in (6) by with a connection coefficient to be determined, a two-point gradient method has been considered in [13, 31]. This method, however, requires a discrete backtracking search algorithm to determine at each iteration step which can incur additional computational time.
Different from Nesterov’s method, Polyak’s heavy ball method accelerates the gradient method by adding a momentum term to define the iterates, i.e.
(7) |
The performance of (7) depends crucially on the property of and the choices of the step-size and the momentum coefficient , and its analysis turns out to be very challenging. When is twice continuous differentiable, strongly convex and has Lipschitz continuous gradient, it has been demonstrated in [27], under suitable constant choices of and , that the iterative sequence enjoys a provable linear convergence faster than the gradient descent. Due to the recent development of machine learning and the appearance of large scale problems, the heavy ball method has gained renewed interest and much attention has been paid on understanding its convergence behavior ([7, 23, 25, 26, 30]). Although many strategies have been proposed to select and , how to choose these parameters to achieve acceleration effect for general convex and non-convex problems remains an active research topic in optimization community.
In this paper we will consider the iterative method
(8) |
for solving ill-posed inverse problem (1), where denotes the step-size and denotes the momentum coefficient. This method differs from (2) in that a momentum term is added to define the iterates. The method (8) has intimate connection with the heavy ball method. Indeed, when and is Fréchet differentiable, the method (8) with becomes
which can be obtained by applying the heavy ball method (7) to the minimization problem (5) with
(9) |
We will further demonstrate that, when is a bounded linear operator, the method (8) with general strongly convex , can be derived by applying the heavy ball method to a dual problem. Based on the successful experience with the acceleration effect of the method (7) for well-posed optimization problems, it is natural to expect that, with suitable choices of the parameters and , the method (8) can have superior performance over the method (2) for ill-posed problems. In this paper we will provide a criterion for choosing and and investigate the convergence property of the corresponding method. Our choices of and are adaptive in the sense that only quantities arising from the computation are used. The formulae for calculating and are explicit, thereby avoiding the need for a backtracking search strategy and consequently saving computational time. The complexity per iteration of our proposed method is comparable to one step of the Landweber-type method (2). However, extensive computational results demonstrate that our method significantly accelerates the Landweber-type method in terms of both the number of iterations and the computational time.
This paper is organized as follows. In Section 2 we first formulate the conditions on the regularization function and the forward operator , we then propose our adaptive heavy ball method by elucidating how to choose the step-sizes and the momentum coefficients, we also show that our method is well-defined and the iteration can be terminated in finite many steps by the discrepancy principle. In Section 3 we focus on the convergence analysis and establish the regularization property of our method. In Section 4 we perform various numerical simulations to demonstrate the acceleration effect and superior performance of our method over the Landweber-type method.
2. The adaptive heavy-ball method
In this section we will consider the iterative method (8) and provide a motivation on choosing the step-size and the momentum coefficient adaptively. We will then show that our proposed method, i.e. Algorithm 1 below, is well-defined.
2.1. Assumptions
Note that the method (8) involves the regularization function and the forward operator . In order to carry out the analysis, certain conditions should be placed on and . For the regularization function , we assume the following standard condition.
Assumption 1.
is a proper, lower semi-continuous, strongly convex function in the sense that there is a constant such that
for all and .
To make use of Assumption 1, let us recall a little convex analysis ([29]). For any proper convex function we use to denote its subdifferential, i.e.
For any with , let and define
which is called the Bregman distance induced by at in the direction . If is strongly convex in the sense of Assumption 1, it is easy to derive that
(10) |
for all with , and . For a proper, lower semi-continuous, convex function , its convex conjugate
is also a proper, lower semi-continuous, convex function that maps from to . Moreover
If satisfies Assumption 1, then is continuous differentiable, its gradient is Lipschitz continuous with
(11) |
see [29, Corollary 3.5.11]. Thus, under Assumption 1, for any the minimization problem
has a unique solution given by . Clearly .
For the forward operator , we assume the following condition which has been widely used in dealing with ill-posed problems, where for any .
Assumption 2.
-
(a)
and are Hilbert spaces.
-
(b)
There exist , and such that and (1) has a solution such that .
-
(c)
is weakly closed on , i.e. for any sequence satisfying and there hold and . Throughout the paper “’ and “” are used to denote the strong and weak convergence respectively.
-
(d)
There exist a family of bounded linear operators such that is continuous on and there is such that
for all . Moreover, there is a constant such that for all .
We remark that, when is a bounded linear operator, Assumption 2 holds automatically with and . When is nonlinear, Assumption 2 (d) is known as the tangential cone condition which gives certain restriction on the nonlinearity of . Assumption 2 (d) clearly implies
and thus the continuity of the mapping on . Based on Assumption 1 and Assumption 2, it has been shown in [18, Lemma 3.2] that (1) has a unique solution satisfying
(12) |
By using (10) and Assumption 2 (b) it is easy to see that , i.e. .
2.2. The method
Let us now turn to the consideration on the method (8). We first demonstrate that, when is a bounded linear operator, the method (8) can be derived by applying the heavy ball method to the dual problem of
(13) |
To see this, we recall that the associated dual function is given by
and hence the corresponding dual problem is , i.e.
(14) |
Since satisfies Assumption 1, the objective function in (14) is continuous differentiable. Applying the heavy ball method to solve (14) gives
By setting and , we then obtain
In case only noisy data is available, by replacing by and allowing , to be dependent on in the above equation we may obtain the method
which is equivalent to (8) when is bounded linear.
Note that the term in the above equation is the gradient of the function at . When is a nonlinear Fréchet differentiable operator, we may use the gradient of the function (9) at to replace in the above equation; in case is not differentiable at , we may use to substitute . Consequently, it leads to the method (8) for solving (1).
In order for the method (8) to have fast convergence, it is natural to choose and at each iteration step such that is as close to a solution of (1) as possible. Let denote any solution of (1) in . We may measure the closeness from to by the Bregman distance
induced by at in the direction . Directly minimizing this quantity becomes infeasible because it involves an unknown solution and a strongly convex function which may be non-quadratic. As a compromise, we may first derive a suitable upper bound for this quantity and then take minimization steps.
We will elucidate the idea on choosing and below. For simplicity of exposition, we introduce the notation
for all integers . Then (8) can be written as
(15) |
Let for . Since , we have and thus
Therefore, by using , (11) and (15), we can obtain
By the polarization identity in Hilbert spaces, we then have
Assume . By using , the Cauchy-Schwarz inequality and Assumption 2 (d), we further obtain
(16) |
Note that when , the method (8) becomes the Landweber-type method (2) and extensive analysis has been done on (2) with suitable choices of the step-size , including (3) and (4). For the method (8) we also choose as such, i.e.
(17) |
when , where and are two positive constants. Then, by plugging the choice of from (17) into (2.2) and using we have
(18) |
We next consider the choice of . It seems natural to obtain it by minimizing the right hand side of the above equation with respect to . Let us fix a number and minimize the right hand side of (2.2) over to obtain
whenever . However, this formula for is not computable because it involves which is unknown. We need to treat the term
(19) |
by using a suitable computable surrogate. Note that and, for ,
Assuming , we may use Assumption 2 (d) to further obtain
(20) |
This motivates us to introduce such that
(24) |
once for and for are defined. Therefore, it leads us to propose Algorithm 1 below that will be considered in this paper.
Algorithm 1 (AHB: Adaptive heavy ball method).
Take an initial guess , calculate , and set . Pick the numbers , , and . For do the following:
Note that, for the implementation of one step of Algorithm 1, the most expensive parts are the calculations of , and which are common for Landweber-type method (2); the computational load for other parts relating to , and can be negligible. Therefore, the computational complexity per iteration of Algorithm 1 is marginally higher than, but very close to, that of one step of the Landweber-type method (2).
In the design of Algorithm 1, the discrepancy principle has been incorporated, i.e. the iteration stops as long as is satisfied for the first time. In the following we will prove that Algorithm 1 is well-defined by showing that the iteration indeed can stop in finite many steps.
Lemma 2.1.
Proof.
Since , the result is true for . Assume next that for some . By noting that , we may use (2.2) and the definition of to obtain
By the induction principle, this shows the result. ∎
Lemma 2.2.
Proof.
We first show by induction that
(26) |
for all integers . It is trivial for as and . Next we assume that (26) holds for all for some . According to (2.2) and Lemma 2.1 we have for any solution of (1) in that
Note that is the minimizer of the function over , where
we can conclude that
(27) |
By virtue of this inequality with , the induction hypothesis, and Assumption 2 (b) we have
which together with (10) implies that and hence . Since , we thus have , i.e. . We therefore complete the proof of (26). As a direct consequence, we can see that (27) holds for all which shows the desired result. ∎
Based on Lemma 2.2 we now can prove that Algorithm 2 is well-defined, i.e. iteration must terminate in finite many steps.
Theorem 2.3.
Proof.
Let be an integer such that for all . It then follows from Lemma 2.2 that
According to the choice of and for all , we can see for all . Thus
If there is no finite integer such that (28) holds, then we can take to be any positive integer. Letting in the above equation gives a contradiction. Thus, the algorithm must terminate in finite many steps. ∎
3. Convergence analysis
In this section we will provide convergence analysis on Algorithm 1. Let be the integer output by the discrepancy principle, we will analyze the behavior of as . We first prove a weak convergence result as stated below.
Theorem 3.1.
Let Assumption 1 and Assumption 2 hold. Consider Algorithm 1 using a sequence of noisy data satisfying with as . Assume that and are chosen such that (25) holds. Let be the output integer. Then there is a subsequence of such that
for some solution of (1) in .
If in addition and for all , then as , where is the unique -minimum norm solution of (1).
Proof.
By using Lemma 2.2 with and (10), we have
which implies that is a bounded sequence. Thus, we can find a subsequence of such that as for some . Because , we have as . Thus, by the weak closedness of given in Assumption 2 (c), we have and . Furthermore, by the weak lower semi-continuity of norms we have
which together with implies that .
When and for all , we show that as , It suffices to show that is the unique weak cluster point of . For the given we have for all and by the definition of we can see that
for all which in particular implies that . Consequently, for any weak cluster point of we have . Since is the -minimum norm solution of (1), we must have . Indeed, for any we have for small . Thus, it follows from Assumption 2 (d) that
which implies for small as . Since is the -minimal norm solution of (1), we have for small and thus for all . Therefore
On the other hand, since both and are solutions of in , we may use Assumption 2 (d) to obtain
which means . Therefore
and thus . The proof is complete. ∎
The above theorem gives a weak convergence result on Algorithm 1 for any . However, it remains uncertain if a strong convergence result can be assured for general . Nevertheless, by confining to be less than 1, we can establish a strong convergence result for Algorithm 1 in the remaining part of this section. The proof of strong convergence is rather challenging. We must explore the counterpart of Algorithm 1 utilizing the exact data , which can be formulated as follows.
Algorithm 2.
Take an initial guess , calculate , and set . Pick the numbers , and . For do the following:
-
(i)
Calculate and ;
-
(ii)
Determine according to
-
(iii)
Calculate and determine by
-
(iv)
Calculate by
-
(v)
Update and by
For the iterative sequence obtained by Algorithm 2, we first have the following result.
Lemma 3.2.
Proof.
By the similar argument in the proof of Lemma 2.2 we can obtain the result immediately. ∎
Lemma 3.2 has the following consequences which will be used later.
Lemma 3.3.
-
(i)
If for some , then for all .
-
(ii)
If for some , then , and for all .
Proof.
(i) If for some , then is a solution of (1) in . Thus we may use Lemma 3.2 with to conclude for all . Consequently, it follows from (10) that for all .
(ii) If for some , then . We thus have from Lemma 3.2 that
which implies . If , then the choice of implies and thus which is a contradiction. Therefore we must have . By using (i) we then have for all . Using these facts and an induction argument we can conclude that for all . Indeed, this is trivial for by the given condition. Assume next for all for some . Then, by the definition of , we have
By the induction principle, we thus complete the proof. ∎
In order to show the strong convergence of the iterative sequence generated by Algorithm 2, we need the following result; see [17, Proposition 2.3] or [18, Proposition 3.6].
Proposition 3.4.
Based on Lemma 3.2, Lemma 3.3 and Proposition 3.4, we are now ready to show the strong convergence of Algorithm 2.
Theorem 3.5.
Proof.
We will use Proposition 3.4 to prove the conclusion. The item (i) in Proposition 3.4 holds automatically by the definition of . According to Lemma 3.2 we immediately obtain the item (ii) in Proposition 3.4. To show item (iii), we note from the choice of that when . Thus for all . From Lemma 3.2 it then follows that
This in particular implies (iii) in Proposition 3.4, i.e.
(31) |
We now verify (iv) in Proposition 3.4. By using Lemma 3.3 (i) we can conclude that
(32) |
In view of (31) and (32), we can pick a sequence of integers by setting and letting , for each , be the first integer satisfying
It is easy to see that
(33) |
For the above chosen , we now show (29) for any solution of (1) in . For any integers we write
(34) |
From the definition of and an induction argument it follows readily that
Thus
Using Assumption 2 (d) it is easy to obtain
Therefore, by using the Cauchy-Schwarz inequality, (33), and for any , we can obtain
This together with (34) then gives
(35) |
To proceed further, let
Then
Moreover, by rearranging the terms in , utilizing the property , and invoking Lemma 3.2, we can obtain
Therefore, is a monotonically increasing sequence with a finite upper bound, and thus it converges by the monotone convergence theorem. In particular, is a Cauchy sequence and hence
Note that
We thus have from (35) that
as . This shows (iv) in Proposition 3.4. Thus, we may use Proposition 3.4 to conclude the existence of a solution of (1) in such that (30) holds.
If for all , then we may use the definition of to obtain
Therefore, we may use the last part of Proposition 3.4 to conclude . ∎
In order to establish the regularization property of Algorithm 1, we need a stability result to connect Algorithm 1 with Algorithm 2. The following resut is sufficient for our purpose.
Lemma 3.6.
Let Assumption 1 and Assumption 2 hold, let and be chosen such that (25) holds, and let . Let be a sequence of noisy data satisfying with as , and let , , be defined by Algorithm 1 using noisy data , where denotes the output integer. Let be defined by the counterpart of Algorithm 1 using the exact data . Then, for any finite integer there hold
for all .
Proof.
Since , we can conclude that for large and hence are well-defined for all . Let
where denotes the set of natural numbers. Then . By the definition of we have for and by using Lemma 3.3 we also have for .
We first use an induction argument to show that
(36) |
as for every . Noting that , and , and, as , we have . Furthermore, noting that , we also have no matter whether or not. Consequently (36) holds for . Now we assume that (36) holds for for some integer . By the induction hypothesis, the definition of , , and the continuity of , and , we can easily derive that
and
as . We now show as . When , by using we have
When , for we may use Assumption 2 (d) to obtain
which implies . Thus, by using and the continuity of and , we have and for sufficiently large . Consequently, by the definition of and we can conclude that as . Therefore
Recall that , we thus have for large . Therefore, by using the definition of and , the above established facts, and the induction hypothesis we can obtain
as , where we used as which follows easily from the established facts. By the induction principle, we thus obtain (36) for .
When , we next use an induction argument to show that
(37) |
for . Recall , we may use Lemma 3.3 to conclude that and for all . We first show (37) for . Note that
Therefore, by the definition of , (36), we have
and consequently, by the continuity of , we have as . Now assume that (37) holds for for some . By using and we have . Thus, by using and , we have
as . This shows that (37) is also true for .
Finally we are ready to show the main strong convergence result on Algorithm 1 for solving (1) using noisy data.
Theorem 3.7.
Proof.
Let be the solution of (1) in determined in Theorem 3.5 such that as , where denotes the sequence defined by the counterpart of Algorithm 1 using the exact data. Let for . We will show that as by considering two cases.
Assume first that there is a sequence of noisy data satisfying with such that as for some finite integer . Then for all large . According to the definition of we have
By taking and using lemma 3.6, we can obtain . Thus, we may use Lemma 3.3 (i) to obtain for all . Since as , we must have and thus, by Lemma 3.6 and the lower semi-continuity of , we can obtain
which shows that as .
4. Numerical results
In this section we will provide numerical simulations to test the performance of our AHB method, i.e. Algorithm 1. The computational results demonstrate that our AHB method indeed has superior performance over the Landweber iteration in terms of both the number of iterations and the CPU running time. Our simulations are performed via MATLAB R2023b on a Dell laptop with 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz 1.80 GHz and 32GB memory.
Example 4.1.
We first test the performance of Algorithm 1 by considering the following linear integral equation of the first kind
(38) |
where the kernel is a continuous function defined on . It is easy to see that is a compact linear map from to itself. Our goal is to determine the minimal-norm solution of (38) by using some noisy data satisfying with a given noise level . When applying Algorithm 1 to find the minimal-norm solution, we use , , , and
where is chosen to fulfill the theoretical requirement. As comparisons, we also consider the following algorithms:
-
Landweber iteration
with the initial guess terminated by the discrepancy principle
(39) This method is convergent for and ; see [6]. In our computation we use and .
-
Nesterov acceleration: This method was proposed in [22] to accelerate the gradient method for convex optimization problems. This strategy was then suggested in [15] to accelerate the Landweber iteration for ill-posed inverse problems. The corresponding method for solving (38) takes the form
with and . When the method is terminated by the discrepancy principle (39) with , the corresponding regularization property has been proved in [20, 24] for . We use , and .
method | iterations | time (s) | relative error | |
---|---|---|---|---|
0.1 | Landweber | 62 | 0.0153 | 1.9024e-2 |
-method | 15 | 0.0083 | 1.6657e-2 | |
Nesterov | 16 | 0.0091 | 1.7684e-2 | |
AHB | 20 | 0.0087 | 1.7774e-2 | |
0.01 | Landweber | 190 | 0.0359 | 5.0988e-3 |
-method | 30 | 0.0102 | 5.1081e-3 | |
Nesterov | 41 | 0.0157 | 4.1816e-3 | |
AHB | 30 | 0.0108 | 4.6982e-3 | |
0.001 | Landweber | 1256 | 0.1934 | 1.8305e-3 |
-method | 82 | 0.0189 | 1.8587e-3 | |
Nesterov | 102 | 0.0285 | 1.7776e-3 | |
AHB | 80 | 0.0187 | 1.7220e-3 | |
0.0001 | Landweber | 8144 | 1.2922 | 6.2937e-4 |
-method | 215 | 0.0414 | 6.4159e-4 | |
Nesterov | 324 | 0.0831 | 5.1961e-4 | |
AHB | 268 | 0.0513 | 6.1130e-4 | |
0.00001 | Landweber | 55145 | 8.2924 | 1.9023e-4 |
-method | 565 | 0.0918 | 1.9305e-4 | |
Nesterov | 817 | 0.2003 | 1.7054e-4 | |
AHB | 896 | 0.1520 | 1.8902e-4 |
In our numerical computation, we consider the equation (38) with the kernel
and assume that the sought solution is . Let be the exact data. We add random noise to to produce a noisy data satisfying for various noise levels . In our implementation, all integrals over are approximated by the trapezoidal rule based on the nodes partitioning into subintervals of equal length.
For this model problem, we execute the Landweber iteration, the -method, the Nesterov acceleration and our AHB method with the above setup using noisy data with various noise levels and the computational results are reported in Table 1 which includes the required number of iterations and the consumed CPU time together with the corresponding relative errors. These results indicate that all the four methods can produce comparable approximate solutions in terms of accuracy. However, the -method, the Nesterov acceleration and our AHB method clearly demonstrate superior performance over the Landweber method by significantly reducing the number of iterations and the CPU running time. In order to visualize how the iterates approach the sought solution, in Figure 1 we plot the relative error versus for these four methods: the left figure plots the relative error, using noisy data with , until the iteration is terminated by the discrepancy principle; the right figure plots the relative error when all computations are performed using the exact data. These plots further indicate the acceleration effect of the -methd, the Nesterov acceleration and our AHB method. Moreover, when using the exact data, our AHB method can outperform both the -method and the Nesterov acceleration.


When using noisy data in computation, among the three fast methods, the -method usually requires least number of iterations. However, it is totally unclear how to extend this method to solve (12) with nonlinear or non-quadratic . The Nesterov acceleration in general takes longer time than the other two methods due to the fact that, at each iteration step, it needs to calculate to update and to check the discrepancy principle as required by the theory ([20]) and thus the computational complexity per iteration increases. Although we have formulated a possible extension (6) of Nesterov acceleration to solve (12) with general and and numerical simulations demonstrate the striking performance, a solid theoretical justification however is still unavailable. Note that, in our AHB method, we used a very small step size, compared with the ones used by other three methods. With an adaptive choice of the momentum coefficient, it still achieves an excellent acceleration effect on Landweber iteration. Moreover, our theory provides the convergence guarantee on using our AHB method to solve (12) with general and . In the following numerical examples, we will focus on testing the acceleration effect of our AHB method over the Landweber iteration.
Example 4.2 (Computed tomography).
In this example, we consider applying Algorithm 1 to computed tomography which consists in determining the density of cross sections of a human body by measuring the attenuation of X-rays as they propagate through the biological tissues ([21]).
Assume the image is supported on a rectangular domain in and is discretized on a pixel grid. We will identify any image of size by a long vector in with by stacking all its columns. We consider the standard 2D fan-beam tomography using projection angles evenly distributed between and degrees with lines of X-rays emitted through each angle. We use the function “fanbeamtomo” in the MATLAB package AIR TOOLS [11] to discretize the problem, leading to an ill-conditioned linear algebraic system , where with and .
method | iterations | time (s) | relative error | |
---|---|---|---|---|
0.05 | Landweber | 101 | 9.9940 | 1.8538e-01 |
AHB: | 63 | 5.5243 | 1.9299e-01 | |
AHB: | 48 | 4.2141 | 1.9060e-01 | |
0.01 | Landweber | 444 | 44.0053 | 5.9528e-02 |
AHB: | 230 | 19.7274 | 5.9464e-02 | |
AHB: | 238 | 20.2110 | 5.9760e-02 | |
0.005 | Landweber | 778 | 78.4099 | 3.1650e-02 |
AHB: | 335 | 27.8164 | 3.1561e-02 | |
AHB: | 374 | 31.1740 | 3.1859e-02 | |
0.001 | Landweber | 3254 | 366.628 | 6.2999e-03 |
AHB: | 1398 | 124.763 | 5.7077e-03 | |
AHB: | 1495 | 132.288 | 5.9425e-03 |
In our numerical simulations, we consider the modified Shepp-Logan phantom of size generated by MATLAB, we also use projection angles with lines of X-rays per projection. Correspondingly has the size with and . Instead of the exact data , we add Gaussian noise on to generate a noisy data with relative noise level so that the noise level is . In order to capture the feature of the sought image, we take
with , where is the Frobenius norm and denotes the isotropic total variation of , i.e. with being the discrete gradient operator defined by for , where
and
Clearly satisfies Assumption 1 with . When using Algorithm 1 to reconstruct the image, we use the initial guess and the step-size
Our convergence theory requires , and . Therefore we take , and . In order to determine the momentum coefficient , we consider two values of : and . As comparison, we also carry out the computation by the Landweber method (2) using the same step size and the regularization function . During the computation, updating from requires to solving the total variation denoising problem
which is solved approximately by the primal-dual hybrid gradient (PDHG) method ([3, 15, 32]) after iterations.




The computational results by AHB and Landweber method are reported in Table 2, including the number of iterations , the CPU running time and the relative errors , using noisy data with various relative noise level . Table 2 shows that AHB with both and leads to a considerable decrease in the number of iterations and the amount of computational time, which demonstrates the striking acceleration effect of our AHB method.

In order to visualize the reconstruction accuracy of our AHB method, we plot in Figure 2 the true image, the reconstruction result by Landweber method and the AHB methods with and using noisy data with relative noise level . In Figure 3 we also plot the curves of the relative errors versus for Landweber and our AHB methods. These plots further demonstrate that, compared with Landweber iteration, our AHB method can produce comparable reconstruction results with significantly less number of iterations.
Example 4.3 (Elliptic parameter identification).
In this example we consider the identification of the parameter in the elliptic boundary value problem
(40) |
from an -measurement of the state , where with is a bounded domain with Lipschitz boundary , and . We assume that the sought parameter is in . This problem reduces to solving if we define the nonlinear operator by , where is the unique solution of (40). This operator is well defined on
for some positive constant . It is known that the operator is weakly closed and Fréchet differentiable, the Fréchet derivative of and its adjoint are given by
for and , where is defined by which is an isomorphism uniformly in the ball for small . Moreover, satisfies Assumption 2 (d) with a number which can be very small if is sufficiently small (see [6]).
method | iterations | time (s) | ||
---|---|---|---|---|
0.005 | Landweber | 36 | 6.2252 | 3.0388e-01 |
AHB: | 15 | 2.6239 | 3.0577e-01 | |
AHB: | 12 | 2.1608 | 3.0328e-01 | |
0.001 | Landweber | 248 | 39.967 | 1.3998e-01 |
AHB: | 64 | 10.509 | 1.3772e-01 | |
AHB: | 79 | 12.530 | 1.3790e-01 | |
0.0005 | Landweber | 410 | 66.806 | 1.1268e-01 |
AHB: | 108 | 17.023 | 1.1298e-01 | |
AHB: | 165 | 26.356 | 1.1221e-01 | |
0.0001 | Landweber | 2014 | 341.75 | 8.1312e-02 |
AHB: | 236 | 38.464 | 8.1573e-02 | |
AHB: | 375 | 60.491 | 8.1716e-02 | |
0.00005 | Landweber | 4999 | 987.75 | 7.2499e-02 |
AHB: | 472 | 74.040 | 7.1581e-02 | |
AHB: | 821 | 133.82 | 7.2092e-02 |
In our numerical simulation, we consider the two-dimensional problem with and the sought parameter is assumed to be a piecewise constant function as shown in Figure 4 (a). Assuming , we add random noise to produce noisy data satisfying with various noise level . We will use to reconstruct . In order to capture the feature of the sought parameter, we take
(41) |
with the constant , where denotes the total variation of on , i.e.
It is easy to see that this satisfies Assumption 1 with .




We first carry out the computation by the Landweber-type method (2) using the initial guess and adaptive step-size given by (4) with and , where we take and . Next we execute our AHB method, i.e. Algorithm 1, with the same initial guess and step size , the momentum coefficient is determined by using two vules of : and . In order to carry out the computation, we divide into small squares of equal size and solve all partial differential equations involved approximately by a multigrid method ([8]) via finite difference discretization. Furthermore, updating from at each iteration step is equivalent to a total variation denoising problem which is solved by the PDHG method after iterations.

In Table 3 we report the computational results by the Landweber iteration (2) and our AHB method with and , including the required number of iterations , the CPU running time and the absolute errors, using noisy data for various noise level , which clearly demonstrates the acceleration effect of our AHB method and shows that our AHB method has superior performance over the Landweber iteration. In Figure 4 and Figure 5 we plot the respective reconstructed results and absolute errors versus the number of iteration by the Landweber iteration and our AHB method using noisy data with noise level , which shows these methods can produce comparable approximate solutions and indicates that our AHB method is much faster than the Landweber iteration.
5. Conclusion
With the surge in machine learning advancements and the emergence of large-scale problems across diverse domains, Polyak’s heavy ball method has experienced a resurgence of interest within the optimization community in recent years. Much effort has been dedicated to comprehending its acceleration effects. In this paper we proposed an adaptive heavy ball method for solving ill-posed inverse problems, both linear and nonlinear. This method differs from the Landweber-type method in that a momentum term is added to define the iterates. Our method integrates a strongly convex function into the design of the algorithm as a regularization function to detect the desired feature of the sought solution. Moreover, we introduced novel techniques for adaptively selecting step-sizes and momentum coefficients, aiming to achieve potential acceleration over the Landweber-type method. Notably, the updating formulae for these parameters are explicit, obviating the need for a backtracking line search procedure and thereby saving computational time. Under the discrepancy principle, we showed that our method can be terminated after a finite number of iterations and established the corresponding regularization property. We reported diverse numerical results which indicate the significant reduction in both the required number of iterations and computational time, and thus demonstrate the superior performance of our method over the Landweber-type method.
Acknowledgement
The work of Q. Jin is partially supported by the Future Fellowship of the Australian Research Council (FT170100231). The work of Q. Huang is supported by the China Scholarship Council program.
References
- [1] H. Attouch and J. Peypouquet, The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than , SIAM J. Optim., 26 (2016), no. 3, 1824–1834.
- [2] A. Beck and M Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), 183–202.
- [3] S. Bonettini and V. Ruggiero, On the convergence of primal-dual hybrid gradient algorithms for total variation image restoration, J. Math. Imaging Vis., 44 (2012), 236–253.
- [4] H. Brakhage, On ill-posed problems and the method of conjugate gradients, Inverse and Ill-Posed Problems (Sankt Wolfgang, 1986) (Notes Rep. Math. Sci. Engrg. vol 4), (New York: Academic), 165–175, 1987.
- [5] A. Chambolle and C. Dossai, On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm, J. Optim. Theor. Appl., 166 (2015), 968–982.
- [6] H. W. Engl, M. Hanke and A. Neubauer, Regularization of Inverse Problems, Dordrecht, Kluwer, 1996.
- [7] E. Ghadimi, H. Feyzmahdavian and M. Johansson, Global convergence of the heavy-ball method for convex optimization, 2015 European Control Conference, pages 310–315, 2015.
- [8] W. Hackbusch, Iterative Solution of Large Sparse Systems of Equations, Second edition, Applied Mathematical Sciences, 95. Springer, 2016.
- [9] M. Hanke,Accelerated Landweber iterations for the solution of ill-posed equations, Numer. Math., 60 (1991), 341–373.
- [10] M. Hanke, A. Neubauer and O. Scherzer, A convergence analysis of the Landweber iteration for nonlinear ill-posed problems, Numer. Math., 72 (1995), no. 1, 21–37.
- [11] P. C. Hansen and M. Saxild-Hansen, AIR Tools — a MATLAB package of algebraic iterative reconstruction methods, J. Comput. Appl. Math., 236 (2012), 2167–2178.
- [12] M. Hegland, Q. Jin and W. Wang, Accelerated Landweber iteration with convex penalty for linear inverse problems in Banach spaces, Appl. Anal., 94 (2015), 524–547.
- [13] S. Hubmer and R. Ramlau, Convergence analysis of a two-point gradient method for nonlinear ill-posed problems, Inverse Problems, 33 (2017), 095004.
- [14] S. Hubmer and R. Ramlau, Nesterov’s accelerated gradient method for nonlinear ill-posed problems with a locally convex residual functional, Inverse Problems, 34 (2018), 095003.
- [15] Q. Jin, Landweber-Kaczmarz method in Banach spaces with inexact inner solvers, Inverse Problems, 32 (2016), 104005.
- [16] Q. Jin, Convergence rates of a dual gradient method for constrained linear ill-posed problems, Numer. Math. 151 (2022), no. 4, 841–871.
- [17] Q. Jin and X. Lu, A fast nonstationary iterative method with convex penalty for inverse problems in Hilbert spaces, Inverse Problems 30 (2014), no. 4, 045012, 21 pp.
- [18] Q. Jin and W. Wang, Landweber iteration of Kaczmarz type with general non-smooth convex penalty functionals, Inverse Problems, 29 (2013), 085011.
- [19] B. Kaltenbacher, A. Neubauer and O. Scherzer, Iterative Regularization Methods for Nonlinear Ill-Posed Problems, Radon Ser. Comput. Appl. Math. 6, De Gruyter, Berlin, 2008.
- [20] S. Kindermann, Optimal-order convergence of Nesterov acceleration for linear ill-posed problems, Inverse Problems, 37 (2021), no. 6, 065002.
- [21] F. Natterer, The Mathematics of Computerized Tomography, SIAM, Philadelphia, 2001.
- [22] Y. Nesterov,A method of solving a convex programming problem with convergence rate , Soviet Mathematics Doklady, 27 (1983), 372–376.
- [23] Y. Nesterov, Lectures on Convex Optimization, Springer Science & Business Media, New York, 2018
- [24] A. Neubauer,On Nesterov acceleration for Landweber iteration of linear ill-posed problems, J. Inverse Ill-Posed Probl., 25 (2017), no. 3, 381–390.
- [25] P. Ochs, Local convergence of the heavy-ball method and iPiano for non-convex optimization, J. Optim. Theory Appl., 177 (2018), no. 1, 153–180.
- [26] P. Ochs, Y. Chen, T. Brox and T. Pock, iPiano: inertial proximal algorithm for nonconvex optimization, SIAM J. Imaging Sci. 7 (2014), no. 2, 1388–1419.
- [27] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), no. 5, 1–17.
- [28] F. Schöpfer and T. Schuster, Fast regularizing sequential subspace optimization in Banach spaces, Inverse Problems, 25 (2009), 015013.
- [29] C. Zălinscu, Convex Analysis in General Vector Spaces, World Scientific Publishing Co., Inc., River Edge, New Jersey, 2002.
- [30] S. K. Zavriev and F. V. Kostyuk, Heavy-ball method in nonconvex optimization problems, Comput. Math. Model., 4 (1993), pp. 336–341.
- [31] M. Zhong, W. Wang and Q. Jin, Regularization of inverse problems by two-point gradient methods in Banach spaces, Numer. Math., 143 (2019), no. 3, 713–747.
- [32] M. Zhu and T. F. Chan, An efficient primal-dual hybrid gradient algorithm for total variation image restoration, CAM Report 08–34 UCLA, 2008.