This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regularization Theory of the Analytic Deep Prior Approach

Clemens Arndt Center for Industrial Mathematics, University of Bremen, Germany ([email protected])
(March 2022)
Abstract

The analytic deep prior (ADP) approach was recently introduced for the theoretical analysis of deep image prior (DIP) methods with special network architectures. In this paper, we prove that ADP is in fact equivalent to classical variational Ivanov methods for solving ill-posed inverse problems. Besides, we propose a new variant which incorporates the strategy of early stopping into the ADP model. For both variants, we show how classical regularization properties (existence, stability, convergence) can be obtained under common assumptions.

1 Introduction

In particular the field of image processing (e.g. denoising, deblurring) is a constant source for challenging inverse problems. The restoration of a corrupted image is typically ill-posed, so regularization techniques are needed to obtain a natural looking result. In other words, the restoration method should incorporate some prior knowledge about the appearance of natural images. However, dependent on the application it can be very difficult to give a mathematically exact definition of what natural looking images are. This makes it hard to encode such prior knowledge in a penalty term for classical variational regularization approaches (e.g. TV regularization [30, 8]).

However, deep learning methods with convolutional neural networks have proven to be quite successful in generating and restoring images [22, 16, 35]. One reason for that is the use of appropriate training data, but [25] shows that just the architecture of an untrained network can already serve as an image prior. The so-called deep image prior (DIP) approach consists in optimizing the weights of a neural network φθ\varphi_{\theta} to minimize the loss function

12Aφθ(z)yδ2\frac{1}{2}\|A\varphi_{\theta}(z)-y^{\delta}\|^{2} (1.1)

for some forward operator AA and noisy data yδy^{\delta} (the network’s input zz is randomly chosen and kept fixed). Although no training data and no penalty functional is used, DIP produces remarkable results in different image processing tasks, as can be seen in [25]. Even challenging problems like sparse angle computed tomography [3] or compressive sensing [19] can be solved this way.

Developing regularization theory for deep learning methods is of high interest [2]. A very prosporous approach is to combine classical theory with deep learning (e.g. [26, 29]). The number of papers which analyze DIP from a theoretical point of view is also growing. In [18] a functional is constructed, which measures the ability of the neural network φθ\varphi_{\theta} to approximate an arbitrary image. This functional can then be used as a penalty term in a classical variational method. The authors of [32] analyze how fast a DIP network approximates the low-frequency and high-frequency components of the target image. By controlling this so-called spectral bias, overfitting is avoided. In [20] the ability to denoise images is attributed to convolutional layers, which are faster in fitting smooth images than noisy ones. The role and the choice of hyperparameters for DIP approaches is described in [34]. A Bayesian perspective is presented in [9], where DIP is interpreted as a Gaussian process.

The choice of architecture is crucial for applications of DIP. Generative neural networks are a natural choice due to their ability to reproduce natural looking images. But the authors of [13] took a LISTA-like network [17] instead to develop the so-called analytic deep prior (ADP) approach. This may not lead to a better practical performance of DIP, but it’s the foundation for an interesting theory. The main aspect consists in interpreting the training of a neural network as the optimization of a Tikhonov functional. There is an analogy to [1], where the penalty term for a Tikhonov functional is optimized. But in contrast to that, the focus of [13] is on the forward operator inside the functional (see section 2).

This work summarizes deeper investigations of the ADP model. The main result (Theorem 3.3) is an equivalence between the ADP approach and classical Ivanov methods [36]. Out of this follows a complete analysis of the regularization properties of ADP including the existence of solutions, stability of reconstructions and convergence towards the ground truth for vanishing noise.

In practical applications of DIP, gradient descent and early stopping is used to minimize the loss function (1.1). Thus, a global (or at least a local) minimum is not reached in general. While this fact was not considered in the theoretical derivation of ADP, we propose a new variant (called ADP-β\beta) which incorporates the effect of early stopping into the model (section 3.2). We also analyze the regularization properties of this new approach.

In section 4 we compare different numerical ways to compute ADP and DIP (with a LISTA-like architecture) solutions of simple inverse problems.111Code available at https://gitlab.informatik.uni-bremen.de/carndt/analytic_deep_prior We find that numerical solutions of both methods are mostly similar to each other, which is important for using the ADP theory for interpretations of DIP. But there can also be observed some interesting disparities between the different numerical ways. This illustrates a crucial difference between the analytical definition of DIP as a minimization problem and the numerical implementation as a gradient descent iteration.

2 Preliminaries and methods

We consider an inverse problem based on the operator equation

Ax=yAx^{\dagger}=y^{\dagger} (2.1)

where we want to recover the unknown ground truth xx^{\dagger} as good as possible. The data yy^{\dagger} is typically not known exactly, but we have only access to noisy data yδy^{\delta}.

Assumption 2.1.

We make the following assumptions for the inverse problem (2.1).

  • Let X,YX,Y be Hilbert spaces and AL(X,Y)A\in L(X,Y).

  • There exists xXx^{\dagger}\in X and for a given δ>0\delta>0, it holds yδyδ\|y^{\delta}-y^{\dagger}\|\leqslant\delta for yδYy^{\delta}\in Y.

  • Let R:X[0,]R\colon X\to[0,\infty] be a convex, coercive and weakly lower semicontinuous functional with RR\not\equiv\infty.

We recall the definition of Bregman distances, which we will use in Theorem 3.12 for a convergence result, similar to the ones in [6, 21].

Definition 2.2 (Bregman distance).

For a convex functional R:X[0,]R\colon X\to[0,\infty] with subdifferential R\partial R and x~,xX\tilde{x},x\in X, the Bregman distance is defined as the set

DR(x~,x)={R(x~)R(x)p,x~x|pR(x)}.D_{R}(\tilde{x},x)=\{R(\tilde{x})-R(x)-\langle p,\tilde{x}-x\rangle\,|\,p\in\partial R(x)\}. (2.2)

The DIP approach (introduced by [25]) for the inverse problem (2.1) consists in solving

minθ12Aφθ(z)yδ2\min_{\theta}\frac{1}{2}\|A\varphi_{\theta}(z)-y^{\delta}\|^{2} (2.3)

via a gradient descent w.r.t. the parameters θ\theta of a neural network φθ\varphi_{\theta}, as already described in the introduction. Despite the use of a neural network, DIP is a model-based approach and not data-based. To derive the ADP approach, we have to make two assumptions (see [13] for details).

The first one is choosing φθ\varphi_{\theta} to be a LISTA-like network [17], which consists of several layers of the form

xl+1=Sαλ(xlλB(Bxlyδ)),x^{l+1}=S_{\alpha\lambda}(x^{l}-\lambda B^{*}(Bx^{l}-y^{\delta})), (2.4)

where B=θB=\theta is the trainable parameter. Originally, this architecture is inspired by ISTA [12], an algorithm for finding sparse solutions of the inverse problem (2.1), and SαλS_{\alpha\lambda} is the shrinkage function. More general, we can choose SαλS_{\alpha\lambda} to be the proximal mapping of a penalty functional RR. Then, (2.4) equals a proximal forward-backward splitting algorithm [11, Theorem 3.4] which converges to the solution of the minimization problem

minxX12Bxyδ2+αR(x).\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x). (2.5)

The second assumption is letting the number of layers tend to infinity. This might be difficult in practice (see section 4), but it causes the output φθ(z)\varphi_{\theta}(z) of the network to be a solution of (2.5). Therefore the ADP model (introduced by [13]) is defined as

minBL(X,Y)12Ax(B)yδ2s.t.x(B)=argminxX12Bxyδ2+αR(x).\displaystyle\begin{split}&\min_{B\in L(X,Y)}\frac{1}{2}\|Ax(B)-y^{\delta}\|^{2}\\ \text{s.t.}\quad x(B)=\,&\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x).\end{split} (2.6)

While DIP is about optimizing the weights of a neural network, ADP is about optimizing the forward operator in a Tikhonov functional. If we add an additional regularization term for the operator BB, we get the (new) ADP-β\beta model

minBL(X,Y)12Ax(B)yδ2+βBA2s.t.x(B)=argminxX12Bxyδ2+αR(x).\begin{split}\min_{B\in L(X,Y)}\frac{1}{2}\|Ax(B)-y^{\delta}\|^{2}+\beta\|B-A\|^{2}\\ \text{s.t.}\quad x(B)=\mathrm{arg}\min_{x\in X}\,\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x).\end{split} (2.7)

The reason for this modification will be explained in section 3.2.

To guarantee uniqueness of x(B)x(B), the functional RR should be strictly convex, but this is not always required. If we assume RR even to be strongly convex, x(B)x(B) depends continuously on BB as the following theorem states. It will be useful for proving existence and stability results for ADP-β\beta.

Theorem 2.3.

Let R:X[0,]R\colon X\to[0,\infty] be a strongly convex, coercive and weakly lower semicontinuous functional. Then

x(B)=argminxX12Bxyδ2+αR(x)x(B)=\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x) (2.8)

depends continuously on BL(X,Y)B\in L(X,Y).

The proof can be found in the appendix A.1.

3 Theoretical results

3.1 Equivalence to classical methods

DIP solutions of inverse problems are naturally restricted to be the output of a neural network. Analogously, only elements of the set

UαR={x^X|BL(X,Y):x^=argminxX12Bxyδ2+αR(x)}U_{\alpha R}=\left\{\hat{x}\in X\,\bigg{|}\,\exists B\in L(X,Y):\hat{x}=\mathrm{arg}\min_{x\in X}\,\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x)\right\} (3.1)

can be solutions of the ADP approach. By definition

minxUαR12Axyδ2\displaystyle\min_{x\in U_{\alpha R}}\frac{1}{2}\|Ax-y^{\delta}\|^{2} (3.2)

is equivalent to the original ADP problem (2.6). To get a better understanding of this minimization problem we investigate UαRU_{\alpha R}. It will turn out that the set UαRU_{\alpha R} can be characterized in a much easier way, even without using an operator BL(X,Y)B\in L(X,Y). For this purpose, we formulate the following lemmas.

Lemma 3.1.

Let R:X[0,]R\colon X\to[0,\infty] be a convex, coercive and weakly lower semicontinuous functional and x^X\hat{x}\in X, yδYy^{\delta}\in Y, yδ0y^{\delta}\neq 0 and α>0\alpha>0 be arbitrary. If there exists vR(x^)v\in\partial R(\hat{x}) such that

αv,x^yδ24\alpha\langle v,\hat{x}\rangle\leqslant\frac{\|y^{\delta}\|^{2}}{4} (3.3)

holds, then there exists a linear operator BL(X,Y)B\in L(X,Y) which fulfills

x^=argminxX12Bxyδ2+αR(x).\hat{x}=\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x). (3.4)

The proof can be found in the appendix A.2.

Lemma 3.2.

Let R:X[0,]R\colon X\to[0,\infty] be a convex, coercive and weakly lower semicontinuous functional and x^X\hat{x}\in X, yδYy^{\delta}\in Y, α>0\alpha>0 be arbitrary. If for every vR(x^)v\in\partial R(\hat{x})

αv,x^>yδ24\alpha\langle v,\hat{x}\rangle>\frac{\|y^{\delta}\|^{2}}{4} (3.5)

holds, then there exists no linear operator BL(X,Y)B\in L(X,Y) which fulfills

x^=argminxX12Bxyδ2+αR(x).\hat{x}=\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x). (3.6)

The proof can be found in the appendix A.3. For given yδYy^{\delta}\in Y, x^X\hat{x}\in X and a penalty term RR, these lemmas state whether there exists a linear forward operator B:XYB\colon X\to Y such that x^\hat{x} is the Tikhonov solution w.r.t. RR of the inverse problem w.r.t. yδy^{\delta}. As a consequence, we can write the ADP minimization problem with a much simpler side constraint.

Theorem 3.3.

Let Assumption 2.1 hold. Then, for all yδYy^{\delta}\in Y, yδ0y^{\delta}\neq 0, α>0\alpha>0 the formulation

minxX12Axyδ2s.t.vR(x):αv,xyδ24\displaystyle\begin{split}&\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}\\ \text{s.t.}\quad&\exists v\in\partial R(x):\quad\alpha\langle v,x\rangle\leqslant\frac{\|y^{\delta}\|^{2}}{4}\end{split} (3.7)

is equivalent to the ADP-Problem (2.6).

Proof.

According to Lemma 3.1 and Lemma 3.2 there exists a linear operator BL(X,Y)B\in L(X,Y) such that

x^=argminxX12Bxyδ2+αR(x)\hat{x}=\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x) (3.8)

if and only if x^\hat{x} fulfills the side constraint of (3.7). ∎

Remark 3.4.

For the standard Tikhonov penalty term R(x)=12x2R(x)=\frac{1}{2}\|x\|^{2}, it holds R(x)=x\partial R(x)=x. In this case we get

minxX12Axyδ2s.t.x2yδ24α\displaystyle\begin{split}\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}\\ \text{s.t.}\quad\|x\|^{2}\leqslant\frac{\|y^{\delta}\|^{2}}{4\alpha}\end{split} (3.9)

as an equivalent formulation of the ADP problem (2.6). For r=yδ2/(4α)r=\|y^{\delta}\|^{2}/(4\alpha), this equals the Ivanov regularization method

minxX12Axyδ2s.t.x2r.\begin{split}\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}\\ \text{s.t.}\quad\|x\|^{2}\leqslant r.\end{split} (3.10)

As [36] shows, this method is in fact equivalent to the Tikhonov method

minxX12Axyδ2+α~2x2\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}+\frac{\tilde{\alpha}}{2}\|x\|^{2} (3.11)

for some α~\tilde{\alpha} dependent on yδy^{\delta} and rr. We note that the Tikhonov parameter α~\tilde{\alpha} may be equal to zero and in particular it differs from the parameter α\alpha of the ADP problem (see section 3.2).

Remark 3.5.

There are also cases in which the side constraint of (3.7) defines a non-convex feasible set. Then, the ADP problem is more difficult to solve. We give a simple two-dimensional example with the penalty term R:2[0,)R\colon\mathbb{R}^{2}\to[0,\infty),

R(x1,x2)={3|x15|for 3|x15||x2|,|x2|for |x2|>3|x15|.R(x_{1},x_{2})=\begin{cases}3\cdot|x_{1}-5|&\text{for }3\cdot|x_{1}-5|\geqslant|x_{2}|,\\ |x_{2}|&\text{for }|x_{2}|>3\cdot|x_{1}-5|.\end{cases} (3.12)

This functional has a non-centered minimum at (5,0)T(5,0)^{\mathrm{T}} and the absolute value of its gradient |R(x)||\partial R(x)| is strongly dependent on the direction. Because of these properties, it’s easy to show that the term v,x\langle v,x\rangle, vR(x)v\in\partial R(x) in the side constraint of (3.7) is non-convex w.r.t. x2x\in\mathbb{R}^{2}.

3.2 Parameter choice and early stopping

By construction of the ADP model, we expect it in application to act like DIP. But in the previous section it turned out that ADP behaves in fact equivalent to classical methods like Tikhonov’s. When we apply ADP to an inverse problem, the question arises whether ADP can also deliver something that is “new” and not equivalent to a Tikhonov solution. This section presents, how the model has to be changed to produce ADP solutions that are more similar to DIP solutions. In the same time, we derive a strategy for choosing the parameter α\alpha of the ADP model.

When we compare the ADP method

minBL(X,Y)12Ax(B)yδ2s.t.x(B)=argminxX12Bxyδ2+αADP2x2\begin{split}&\min_{B\in L(X,Y)}\frac{1}{2}\|Ax(B)-y^{\delta}\|^{2}\\ \text{s.t.}\quad&x(B)=\mathrm{arg}\min_{x\in X}\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\frac{\alpha_{\text{ADP}}}{2}\|x\|^{2}\end{split} (3.13)

to the equivalent (see Remark 3.4) Tikhonov method

minxX12Axyδ2+α~2x2,\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}+\frac{\tilde{\alpha}}{2}\|x\|^{2}, (3.14)

we have to make sure not to confuse the parameters αADP\alpha_{\text{ADP}} and α~\tilde{\alpha} of both models with each other. At first we state the following relation between these parameters.

Lemma 3.6.

If the solutions of (3.13) and (3.14) coincide, α~αADP\tilde{\alpha}\leqslant\alpha_{\text{ADP}} holds. Equality of the parameters could only occur if yδy^{\delta} was in the kernel of AA^{*} or a singular vector of AA.

The proof can be found in the appendix A.4. In general, we can assume that α~<αADP\tilde{\alpha}<\alpha_{\text{ADP}} holds. So in any application it makes sense to choose the ADP parameter greater than one would choose the parameter of a Tikhonov model. But independent of the parameter choice, the ADP solution will always be equivalent to a Tikhonov solution (Remark 3.4). To make ADP more similar to DIP, we apply early stopping [15, section 7.8]. This strategy is often used in the application of DIP but wasn’t considered for the ADP model yet.

For a given inverse problem, we could solve the ADP problem (3.13) with a gradient descent algorithm w.r.t. the operator BB (see section 4 for details) and terminate this iteration early. Taking B0=AB^{0}=A as initial value leads by definition to x(B0)x(B^{0}) being equal to the Tikhonov solution w.r.t. the parameter αADP\alpha_{\text{ADP}}. We assume the iteration to converge successfully towards the minimizer x^\hat{x} of (3.13). Since x^\hat{x} is also the minimizer of (3.14), the limit of the iteration is also a Tikhonov solution but w.r.t. the parameter α~\tilde{\alpha}. Because of α~<αADP\tilde{\alpha}<\alpha_{\text{ADP}}, the starting solution x(B0)x(B^{0}) is a stronger regularized Tikhonov solution than the limit x^\hat{x} of the iteration (see figure 1).

Refer to caption
Figure 1: Comparison of ADP solutions during gradient descent to the Tikhonov method (orange: at the start of the gradient descent, red: by early stopping, gray: limit of the gradient descent). The forward operator is an integration like in (4.1) and the data yδy^{\delta} is contaminated with gaussion noise (PSNR=40). The Tikhonov parameter and the stopping criterion for ADP were chosen a posteriori to achieve the most accurate reconstructions.

If we apply early stopping, we take some x(Bk)x(B^{k}) in between, which in general does not equal a Tikhonov solution w.r.t. AA (see figure 1). This strategy makes sense if we expect x(Bk)x(B^{k}) to be a better solution than (the Tikhonov solutions) x(B0)x(B^{0}) and x^\hat{x}. That could be the case if x(B0)x(B^{0}) is slightly over-regularized and x^\hat{x} is slightly under-regularized. Because then, the optimal regularization would lay in between.

Now, we come back to the parameter choice. If we have a criterion for estimating a suitable Tikhonov parameter αTik\alpha_{\text{Tik}} for a given inverse problem, we should try to choose αADP\alpha_{\text{ADP}} in a way that

α~<αTik<αADP\tilde{\alpha}<\alpha_{\text{Tik}}<\alpha_{\text{ADP}} (3.15)

holds. Because then, x(B0)x(B^{0}) will be slightly over-regularized and x^\hat{x} slightly under-regularized, as proposed.

In the example of figure 1, we see that the ADP solution x(Bk)x(B^{k}), obtained with early stopping, is a better approximation for the ground truth xx^{\dagger} than the most accurate Tikhonov solution, which corresponds to αTik\alpha_{\text{Tik}}. But this result is strongly dependent on the particular inverse problem. The Tikhonov method is optimal for data that is normally distributed. If the given distribution differs from that, it is theoretically possible that ADP with early stopping produces a better solution than the Tikhonov method.

Finally, we want to include the early stopping strategy directly into the ADP model to be able to investigate its effect on the regularization of inverse problems. Early stopping enforces the iterated variable to stay close to the initial value. Because of B0=AB^{0}=A, we can expect BkA\|B^{k}-A\| to be small for small kk. This leads to using BkA\|B^{k}-A\| as an additional penalty term in the ADP problem, which has a similar effect as early stopping [4, section 2.3], [33, section 4]. What comes out is the ADP-β\beta model

minBL(X,Y)12Ax(B)yδ2+βBA2s.t.x(B)=argminxX12Bxyδ2+αR(x).\begin{split}\min_{B\in L(X,Y)}\frac{1}{2}\|Ax(B)-y^{\delta}\|^{2}+\beta\|B-A\|^{2}\\ \text{s.t.}\quad x(B)=\mathrm{arg}\min_{x\in X}\,\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x).\end{split} (3.16)

3.3 Properties of ADP

The equivalence between ADP and the Ivanov method (with general convex penalty term RR), shown in section 3.1, allows to obtain some regularization properties (existence, stability, convergence) for ADP. We suppose that Assumption 2.1 holds. Besides the functional

R~(x)=minvR(x)v,x\tilde{R}(x)=\min_{v\in\partial R(x)}\langle v,x\rangle (3.17)

is assumed to be well-defined. Because then, the side constraint of (3.7) can be formulated as

R~(x)yδ24α.\tilde{R}(x)\leqslant\frac{\|y^{\delta}\|^{2}}{4\alpha}. (3.18)

Due to v,x=R(x)+R(v)\langle v,x\rangle=R(x)+R^{*}(v) for vR(x)v\in\partial R(x), where RR^{*} denotes the convex conjugated functional, coercivity of RR implies coercivity of R~\tilde{R}.

Remark 3.7 (Existence).

There exists a solution of the ADP problem (2.6) if the functional R~\tilde{R}, defined in (3.17), is weakly lower semicontinuous. This follows from the equivalence theorem 3.3 and [37, Theorem 2.1] about the existence of Ivanov solutions.

Uniqueness of solutions and stability w.r.t. the data yδy^{\delta} is less trivial. First, the right hand side of the side constraint (3.18) is dependent on yδy^{\delta}, which isn’t the case for ordinary Ivanov problems. Secondly, we know from Remark 3.5, that the constraint (3.18) does not always define a convex feasible set. Nevertheless, for the special case R(x)=12x2R(x)=\frac{1}{2}\|x\|^{2} we can obtain a convenient stability result. In this case, R~(x)=x2\tilde{R}(x)=\|x\|^{2} is a strictly convex functional. Additionally, if the given inverse problem is ill-posed, we can assume the ADP solutions to fulfill the constraint (3.18) with equality. Under these conditions, the following theorem provides stability of ADP.

Theorem 3.8 (Stability).

For R(x)=12x2R(x)=\frac{1}{2}\|x\|^{2}, let (yk)Y(y_{k})\subset Y be a sequence with yky^Y{y_{k}\to\hat{y}\in Y} and assume that the corresponding ADP solutions xkx_{k}, x^\hat{x} are unique and fulfill the side constraint in (3.9) with equality. Then ADP is stable, which means xkx^x_{k}\to\hat{x}.

The proof can be found in the appendix A.5.

To obtain a convergence result for ADP, it makes sense to use standard convergence theorems, either of the Tikhonov method [21, Theorem 4.4] or of the Ivanov method [23, Theorem 2.5], [31, Theorem 3]. They differ especially in the source conditions they require for the ground truth xx^{\dagger} and in the parameter choice rules. If we assume R~\tilde{R} to be convex, by [36, Theorem 2] and the equivalence theorem 3.3, the Tikhonov problem

minxX12Axyδ2+α~R~(x)\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}+\tilde{\alpha}\tilde{R}(x) (3.19)

is equivalent to the ADP formulation (3.7) for suitable chosen α~0\tilde{\alpha}\geqslant 0.

Remark 3.9 (Convergence).

Because of the equivalence between (3.7) and (3.19), the convergence of ADP solutions xαδx_{\alpha}^{\delta} to xx^{\dagger} for vanishing δ\delta w.r.t. the Bregman distance can be directly derived from Tikhonov convergence theorems. But the ADP parameter α\alpha does not coincide with the Tikhonov parameter α~\tilde{\alpha}. That’s why, for ADP we do not get an explicit parameter choice rule like αδ\alpha\sim\delta. Besides, a source condition for xx^{\dagger} has to be fulfilled by the functional R~\tilde{R} (defined in (3.17)) and not by the penalty term RR.

3.4 Properties of ADP-\textbeta

For proving the existence of solutions of variational regularization schemes, [21, Theorem 3.1] provides a useful framework. If we want to apply this for ADP-β\beta, it has to be ensured that Bx(B)B\mapsto x(B) is weak-weak continuous [21, Assumptions 2.1]. But unfortunately, in general this is not the case.

To obtain convenient regularization properties anyway, we restrict to X=Y=L2(Ω){X=Y=L^{2}(\Omega)} with Ωn\Omega\subset\mathbb{R}^{n}. In this setting, we consider a forward operator A:XYA\colon X\to Y that can be parametrized by a function fLp(Ω)f\in L^{p}(\Omega), p[1,)p\in[1,\infty). More precisely, we take a continuous, bilinear operator T:Lp(Ω)×XY{T\colon L^{p}(\Omega)\times X\to Y} and define

Ax=T(f,x).Ax=T(f,x). (3.20)

The same parametrization of operators by functions is used in [5]. One typical example would be a convolutional operator T(f,x)=fxT(f,x)=f*x.

The crucial idea is the additional restriction fW1,p(Ω)f\in W^{1,p}(\Omega) to take advantage of the compact embedding of Sobolev spaces W1,p(Ω)Lp(Ω)W^{1,p}(\Omega)\subset L^{p}(\Omega). A similar strategy is used in [24] for achieving weak-weak continuity of the forward operator.

We define the parametrized ADP-β\beta approach as

mingW1,p12T(f,xg)yδL22+βfgW1,p2s.t.xg=argminxL212T(g,x)yδL22+αR(x).\displaystyle\begin{split}\min_{g\in W^{1,p}}\frac{1}{2}\|T(f,x_{g})-y^{\delta}\|_{L^{2}}^{2}&+\beta\|f-g\|_{W^{1,p}}^{2}\\ \text{s.t.}\quad x_{g}=\mathrm{arg}\min_{x\in L^{2}}\,\frac{1}{2}\|T(g,x)-y^{\delta}\|_{L^{2}}^{2}&+\alpha R(x).\end{split} (3.21)

In particular, this can be interpreted as a Tikhonov method for solving the nonlinear inverse problem F(g)=yF(g^{\dagger})=y^{\dagger} with the forward operator F:W1,p(Ω)YF\colon W^{1,p}(\Omega)\to Y, F(g)=T(f,xg)F(g)=T(f,x_{g}).

Remark 3.10 (Existence).

The forward operator FF is weak-strong continuous if the penalty term RR is strongly convex. This holds, because weak convergence gkgg_{k}\rightharpoonup g w.r.t. W1,p(Ω)W^{1,p}(\Omega) implies convergence by norm in Lp(Ω)L^{p}(\Omega), by Theorem 2.3 the convergence of xgkxgx_{g_{k}}\to x_{g} follows, and the bilinear operator TT is continuous. This is more than enough to fulfill the assumptions of [21, Theorem 3.1], which provides the existence of a solution of (3.21).

A weak stability result for the parametrized ADP-β\beta method could be directly obtained from [21, Theorem 3.2]. But this particular framework even allows to prove strong stability.

Theorem 3.11 (Stability).

For p=2p=2, let RR be a strongly convex penalty term, (yk)Y(y_{k})\subset Y a convergent sequence with yky^y_{k}\to\hat{y} and (gk)W1,p(Ω)(g_{k})\subset W^{1,p}(\Omega) the corresponding solutions of the ADP-β\beta problem (3.21). Then, (gk)(g_{k}) has a convergent subsequence and the limit of each subsequence is an ADP-β\beta solution corresponding to y^\hat{y}.

The proof can be found in the appendix A.6.

While proving existence and stability of ADP-β\beta-solutions required a smart parametrization and the use of compact embeddings, a convergence theorem (w.r.t. the Bregman distance) can be proven for the general formulation (2.7). Similar to classical results like [21, Theorem 4.4] or [6, Theorem 2], we need to assume a source condition

wY:AwR(x).\exists w\in Y:\quad A^{*}w\in\partial R(x^{\dagger}). (3.22)

The parameter β\beta turns out to be really helpful for obtaining a convergence result.

Theorem 3.12 (Convergence).

Let Assumption 2.1 hold, xx^{\dagger} be an RR-minimizing solution of (2.1) which fulfills the source condition (3.22) and assume there exist ADP-β\beta solutions x^αδ\hat{x}_{\alpha}^{\delta} of (2.7). If α\alpha is chosen proportional to δ\delta, there exists dDR(x^αδ,x)d\in D_{R}(\hat{x}_{\alpha}^{\delta},x^{\dagger}) which fulfills d=O(δ)d=O(\delta).

The proof can be found in the appendix A.7.

4 Numerical Computations

Aim. We want to see whether there is a similarity between ADP and DIP also on the numerical side. The ADP approach is based on the idea of using a LISTA network in a DIP method. Usually LISTA architectures contain round about ten layers, but ADP is motivated with a network of infinite depth (see section 2). To derive the ADP model, the output of this infinite network is then replaced by the solution of a minimization problem. So the question arises, whether numerically computed ADP solutions of an inverse problem are yet similar to solutions obtained via a DIP with LISTA architecture.

In this section, we present algorithms for the computation of ADP solutions and we compare them with DIP solutions. In doing so, the focus is not on the performance of the methods (in comparison to other state-of-the-art reconstruction algorithms) but on the similarity of the different solutions.

Methods. From Theorem 3.3, we know that the ADP problem is equivalent to an Ivanov problem. This creates a possibility to compute ADP solutions easily, fast and almost exactly (we call this method ADP Ivanov). In contrast to that, it is more difficult to realize a LISTA architecture with infinite depth. But there are at least two possibilities to simulate such a network.

The first idea (Algorithm 1: DIP LISTA L=L=\infty) is to begin with a network φB\varphi_{B} of ten layers and to increase the network depth during the training process of the DIP. This is done implicitly with a simple trick. In each training step, the network’s input is set to be the network’s output of the previous step [13, Appendix 3]. So the original input will pass through more and more layers and in each step the last ten layers are optimized (via backpropagation).

initialize B0B_{0}, z0z_{0} (e.g. B0=AB_{0}=A, z0=random noisez_{0}=\text{random noise});
for k=0,1,k=0,1,... do
       zk+1=φBk(zk)z_{k+1}=\varphi_{B_{k}}(z_{k});
       lossk=12AφBk(zk)yδ2\mathrm{loss}_{k}=\frac{1}{2}\|A\varphi_{B_{k}}(z_{k})-y^{\delta}\|^{2};
       Bk+1=update(Bklossk)B_{k+1}=\mathrm{update}(\nabla_{B_{k}}\mathrm{loss}_{k});
      
end for
return zkz_{k}
Algorithm 1 DIP LISTA L=\infty
initialize B0B_{0} (e.g. B0=AB_{0}=A);
for k=0,1,k=0,1,... do
       1. Calculate x(Bk)x(B_{k}) with fixed point iteration
       2. IFT provides: Bkx(Bk)\nabla_{B_{k}}x(B_{k});
       3. Update:
       lossk=12Ax(Bk)yδ2\mathrm{loss}_{k}=\frac{1}{2}\|Ax(B_{k})-y^{\delta}\|^{2};
       Bk+1=update(Bklossk)B_{k+1}=\mathrm{update}(\nabla_{B_{k}}\mathrm{loss}_{k});
      
end for
return x(Bk)x(B_{k})
Algorithm 2 ADP IFT

The second idea (Algorithm 2: ADP IFT) is to compute x(B)x(B) from (2.6) with a classical algorithm like ISTA. After that, one can compute the gradient of x(B)x(B) w.r.t. BB (see the proof of [13, Lemma 4.1]) via the implicit function theorem (IFT). Thus, backpropagation through a big amount of layers is avoided.

For the standard DIP approach, we use a LISTA-like architecture of ten Layers (DIP LISTA L=10L=10) and optimize the weights via backpropagation. So, in total we compare four different methods (ADP Ivanov, ADP IFT, DIP LISTA L=L=\infty, DIP LISTA, L=10L=10). Since solving the Ivanov problem results in the exact ADP solution, we use this as a reference for the other three methods (for which we don’t have convergence guarantees).

In all methods we use the elastic net functional [38] R(x)=α1x1+α22x2R(x)=\alpha_{1}\|x\|_{1}+\frac{\alpha_{2}}{2}\|x\|^{2} as a penalty term. So there is one parameter for 1\ell^{1}-regularization (leads to sparsity) and one parameter for 2\ell^{2}-regularization (leads to stability and smoothness). In the LISTA-architecture, this is realized by substracting the gradient of the 2\ell^{2}-term before applying the activation function.

Setting. We consider two different artificial inverse problems (inversion of the integration operator and a deconvolution) on L2(I)L^{2}(I) for an interval II\subset\mathbb{R}. The forward operators are

(A1x)(t)=0tx(s)ds and A2x=gx,(A_{1}x)(t)=\int_{0}^{t}x(s)\,\mathrm{d}s\quad\text{ and }\quad A_{2}x=g*x, (4.1)

gg being a Gaussian function. Both of them lead to ill-posed inverse problems. We chose three different ground truth functions and created data by applying the forward operators and adding normally distributed random noise. This leads to six examples in total, which is enough for some basic observations. Figure 2 shows the reconstructions corresponding to the integration operator A1A_{1}. The three rows contain the three different ground truth functions and each column contains a different method. For comparison, the actual ADP solution (ADP Ivanov) and the ground truth is displayed in every plot. Since we are only interested in finding similarities and disparities between the solutions of the different methods, the choice of the regularization parameters plays a minor role. So, we took the same values α1\alpha_{1}, α2\alpha_{2} for each method and simply chose them a posteriori for each example to minimize the L2L^{2}-error between reconstructions and ground truth. Figure 3 shows the analogous results for the deconvolution problem (forward operator A2A_{2}).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Computation of ADP and DIP reconstructions via the IFT, via a DIP with LISTA network (L=L=\infty and L=10L=10) and via the equivalent Ivanov problem. The forward operator is A1A_{1} (integration). The given data has PSNR=40 due to additiv Gaussian noise. The regularization parameters α1\alpha_{1}, α2\alpha_{2} are chosen for each example (row) separately but are the same for each method (column).

Observations. From these experiments, we can make the following observations. There is a significant difference between using L=10L=10 or L=L=\infty layers in a LISTA network. With an infinite number of layers, the reconstructions are looking more realistic. The results of the DIP LISTA L=\infty (Algorithm 1) method and of the IFT method (Algorithm 2) are always looking quite similar. This was expected because both of them simulate an infinitely deep LISTA network. Differences are probably due to the different ways the gradients are computed or due to to slow convergence of the methods.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Computation of ADP and DIP reconstructions via the IFT, via a DIP with LISTA network (L=L=\infty and L=10L=10) and via the equivalent Ivanov problem. The forward operator is A2A_{2} (convolution). The given data has PSNR=45 due to additiv Gaussian noise. The regularization parameters α1\alpha_{1}, α2\alpha_{2} are chosen for each example (row) separately but are the same for each method (column).

In most of the cases, the reconstructions of these both methods are looking quite similar to the actual ADP solution. But sometimes they contain artifacts (e.g. the peaks in figure 3, third row). It seems that there are some spots which are hard to reconstruct for the DIP methods and others are rather simple. Besides, the ADP problem (2.6) is not a convex minimization problem w.r.t. BB. So there is no guarantee for the methods which do gradient descent (DIP LISTA L=L=\infty and the IFT method) to converge towards the global minimizer. Figure 4 shows that the reconstructions of these methods are indeed dependent on the initial value B0B_{0} of the algorithms. In contrast to that, the Ivanov problem from Theorem 3.3 is convex (with the elastic net penalty term RR). That’s probably why the actual ADP solutions are the only ones which never contain strange artifacts and the only ones that are always quite good reconstructions of the ground truth.

Refer to caption
Refer to caption
Figure 4: The same setting and methods as in figure 3 but with different initial values B0B_{0}.

The easiest possibility to slightly improve the reconstrution quality is to apply early stopping. In doing so, the most severe artifacts in the reconstructions can be diminished. This case corresponds to the ADP-β\beta approach (see section 3.2), whose additional convex term βBA2\beta\|B-A\|^{2} is a numerical advantage because it stabilizes the gradient descent for finding the minimizer. Indeed, adding the gradient of the β\beta-term to the update in Algorithm 2 (ADP IFT) can also diminish the severe artifacts. But we do not include experimental results about this, since the most interesting part is the comparison with the equivalent Ivanov problem, which doesn’t exist for ADP-β\beta.

The main conclusion is the numerical verification of the derivation of the ADP problem from the DIP approach. It is possible to use the theoretical analysis of the ADP problem for interpretations of the DIP approach because of the similarity between the reconstructions from the different numerical methods. However, the examples from figure 4 illustrate that DIP can be formulated as a minimization problem (2.3) but a numerical computed solution is not automatically a global minimizer of this problem. If early stopping is used, it is probably not even a local minimizer. Hence, there is a significant difference between the theoretical definition and the practical implementation of DIP.

5 Conclusion

ADP and ADP-β\beta were introduced as methods for solving ill-posed inverse problems in a typical Hilbert space setting (Assumption 2.1). Both of them are motivated by considering DIP with a LISTA-like architecture. The main result is an equivalence of ADP to the classical method of Ivanov regularization.

We have proven existence, stability and convergence results for both ADP and ADP-β\beta. The obtained regularization properties are comparable to the ones of classical methods like Tikhonov’s. In principal, these results can be transferred to DIP with LISTA-like networks. But due to non-convexity of the DIP minimization problem, numerically computed DIP solutions can differ significantly from exact ADP solutions, although they are similar in many cases. We conclude that theoretical analyses of the DIP approach should consider the whole optimization process and not only the properties of the minimizer.

One very important part is the early stopping of the DIP optimization process. In the ADP setting, we incorporated this strategy with an additional penalty term, which resulted in the ADP-β\beta model. The effect of this regularization can be seen by comparing the convergence theorems of ADP and ADP-β\beta. Theorem 3.12 provides a parameter choice rule (αδ\alpha\sim\delta) for ADP-β\beta, which is a big advantage over ADP.

A generalization of the ADP regularization results to DIP with general convolutional neural networks (CNNs) would be very desirable. The LISTA architecture was suitable because of its similarity to proximal splitting algorithms and the possibility to interpret the output as a solution of a variational problem. Finding similar connections for general CNNs is harder. However in [7], CNNs are used to model proximal mappings and in [28], CNNs are interpreted as algorithms for sparse coding. Besides, [10] asserts that most common activation functions are in fact proximal mappings and they establish a theory for characterizing the fixed point sets of neural networks as solutions of variational inequalities. These directions could provide ideas for possible future extensions.

Acknowledgements

I want to thank Dr. Daniel Otero Baguer, Prof. Peter Maaß, Dr. Tobias Kluth and many more colleagues from the University of Bremen and Dr. Yury Korolev from the University of Cambridge for helpful advice and feedback.

Appendix A Proofs of theoretical results

A.1 Theorem 2.3

Continuity of Bx(B)B\mapsto x(B).

Proof.

Let (Bk)L(X,Y)(B_{k})\subset L(X,Y) be a sequence of operators with BkBL(X,Y)B_{k}\to B\in L(X,Y). At first, we mention that the sequence (x(Bk))(x(B_{k})) is bounded because

αR(x(Bk))12Bkx(Bk)yδ2+αR(x(Bk))12Bkx(B)yδ2+αR(x(B))\begin{split}\alpha R(x(B_{k}))\leqslant\frac{1}{2}\|B_{k}x(B_{k})-y^{\delta}\|^{2}+\alpha R(x(B_{k}))\leqslant\frac{1}{2}\|B_{k}x(B)-y^{\delta}\|^{2}+\alpha R(x(B))\end{split} (A.1)

holds. Further, we can estimate

12Bx(Bk)yδ2+αR(x(Bk))=12Bkx(Bk)yδ+(BBk)x(Bk)2+αR(x(Bk))12(Bkx(Bk)yδ+(BBk)x(Bk))2+αR(x(Bk))=12Bkx(Bk)yδ2+αR(x(Bk))+(BBk)x(Bk)(Bkx(Bk)yδ+12(BBk)x(Bk)).\displaystyle\begin{split}&\quad\,\,\frac{1}{2}\|Bx(B_{k})-y^{\delta}\|^{2}+\alpha R(x(B_{k}))\\ &=\frac{1}{2}\|B_{k}x(B_{k})-y^{\delta}+(B-B_{k})\,x(B_{k})\|^{2}+\alpha R(x(B_{k}))\\ &\leqslant\frac{1}{2}(\|B_{k}x(B_{k})-y^{\delta}\|+\|(B-B_{k})\,x(B_{k})\|)^{2}+\alpha R(x(B_{k}))\\ &=\frac{1}{2}\|B_{k}x(B_{k})-y^{\delta}\|^{2}+\alpha R(x(B_{k}))\\ &\,+\|(B-B_{k})\,x(B_{k})\|\cdot\left(\|B_{k}x(B_{k})-y^{\delta}\|+\frac{1}{2}\|(B-B_{k})\,x(B_{k})\|\right).\end{split} (A.2)

Because of the boundedness of (x(Bk))(x(B_{k})) and the convergence BkBB_{k}\to B, the term

(BBk)x(Bk)(Bkx(Bk)yδ+12(BBk)x(Bk))\|(B-B_{k})\,x(B_{k})\|\cdot\left(\|B_{k}x(B_{k})-y^{\delta}\|+\frac{1}{2}\|(B-B_{k})\,x(B_{k})\|\right) (A.3)

converges to zero. For the remaining terms, we can estimate

12Bkx(Bk)yδ2+αR(x(Bk))12Bkx(B)yδ2+αR(x(B)),\frac{1}{2}\|B_{k}x(B_{k})-y^{\delta}\|^{2}+\alpha R(x(B_{k}))\leqslant\frac{1}{2}\|B_{k}x(B)-y^{\delta}\|^{2}+\alpha R(x(B)), (A.4)

and Bkx(B)yδ2\|B_{k}x(B)-y^{\delta}\|^{2} converges to Bx(B)yδ2\|Bx(B)-y^{\delta}\|^{2}. So (x(Bk))(x(B_{k})) is a minimizing sequence of the strongly convex functional 12Bxyδ2+αR(x)\frac{1}{2}\|Bx-y^{\delta}\|^{2}+\alpha R(x) because x(B)x(B) is the minimizer. By [27, Theorem 1], the minimizing sequence converges to the minimizer x(B)x(B). ∎

A.2 Lemma 3.1

First part of the equivalence theorem for ADP to Ivanov problems.

Proof.

Let x^,v,yδ,α\hat{x},v,y^{\delta},\alpha and RR be given according to the assumptions. We have to find a linear operator BB such that

B(Bx^yδ)αR(x^).-B^{*}(B\hat{x}-y^{\delta})\in\alpha\partial R(\hat{x}). (A.5)

holds. Because of vR(x^)v\in\partial R(\hat{x}), we just try to solve the equation

B(Bx^yδ)=αv-B^{*}(B\hat{x}-y^{\delta})=\alpha v (A.6)

for BB. If x^=0\hat{x}=0, solving would be trivial. Otherwise, we can decompose vv into

v=μx^+vs.t. v,x^=0.v=\mu\hat{x}+v_{\bot}\qquad\text{s.t. }\langle v_{\bot},\hat{x}\rangle=0. (A.7)

Accordingly it is μ=v,x^/x^2\mu={\langle v,\hat{x}\rangle}/{\|\hat{x}\|^{2}}. With that, we can write the equation from above as

BBx^+αμx^+αv=Byδ.B^{*}B\hat{x}+\alpha\mu\hat{x}+\alpha v_{\bot}=B^{*}y^{\delta}. (A.8)

We consider a linear operator B:XYB\colon X\to Y of the form

Bx=(σ1x,x^+σ2x,v)yδBx=(\sigma_{1}\langle x,\hat{x}\rangle+\sigma_{2}\langle x,v_{\bot}\rangle)\cdot y^{\delta} (A.9)

with two coefficients σ1\sigma_{1} and σ2\sigma_{2} to be determined later. Then, the adjoint operator is given by

By=y,yδ(σ1x^+σ2v)B^{*}y=\langle y,y^{\delta}\rangle(\sigma_{1}\hat{x}+\sigma_{2}v_{\bot}) (A.10)

and it holds

BBx^\displaystyle B^{*}B\hat{x} =B((σ1x^2)yδ)=σ12x^2yδ2x^+σ1σ2x^2yδ2v,\displaystyle=B^{*}((\sigma_{1}\|\hat{x}\|^{2})\cdot y^{\delta})=\sigma_{1}^{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}\hat{x}+\sigma_{1}\sigma_{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}v_{\bot}, (A.11)
Byδ\displaystyle B^{*}y^{\delta} =σ1yδ2x^+σ2yδ2v.\displaystyle=\sigma_{1}\|y^{\delta}\|^{2}\hat{x}+\sigma_{2}\|y^{\delta}\|^{2}v_{\bot}. (A.12)

To fulfill (A.8), we have to solve

σ12x^2yδ2x^+σ1σ2x^2yδ2v+αμx^+αv=σ1yδ2x^+σ2yδ2v.\sigma_{1}^{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}\hat{x}+\sigma_{1}\sigma_{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}v_{\bot}+\alpha\mu\hat{x}+\alpha v_{\bot}=\sigma_{1}\|y^{\delta}\|^{2}\hat{x}+\sigma_{2}\|y^{\delta}\|^{2}v_{\bot}. (A.13)

Because x^\hat{x} and vv_{\bot} are orthogonal to each other, we get the two equations

σ12x^2yδ2+αμ=σ1yδ2,\displaystyle\sigma_{1}^{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}+\alpha\mu=\sigma_{1}\|y^{\delta}\|^{2}, (A.14)
σ1σ2x^2yδ2+α=σ2yδ2.\displaystyle\sigma_{1}\sigma_{2}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}+\alpha=\sigma_{2}\|y^{\delta}\|^{2}. (A.15)

Notice that (A.15) and the coefficient σ2\sigma_{2} could be ignored if v=0v_{\bot}=0 held.

Equation (A.14) can be solved for σ1\sigma_{1} with a quadratic formula, which leads to

σ1=12x^2±14x^4αμx^2yδ2.\sigma_{1}=\frac{1}{2\|\hat{x}\|^{2}}\pm\sqrt{\frac{1}{4\|\hat{x}\|^{4}}-\frac{\alpha\mu}{\|\hat{x}\|^{2}\|y^{\delta}\|^{2}}}. (A.16)

Accordingly,

αμyδ214x^2\frac{\alpha\mu}{\|y^{\delta}\|^{2}}\leqslant\frac{1}{4\|\hat{x}\|^{2}} (A.17)

must hold to get real solutions. We know from above that μ=v,x^/x^2\mu={\langle v,\hat{x}\rangle}/{\|\hat{x}\|^{2}}. If we insert this, we will see that this matches exactly the assumptions of the lemma.

Now, equation (A.15) has to be solved for σ2\sigma_{2}. Excluding σ2\sigma_{2} leads to

σ2(σ1x^2yδ2yδ2)+α=0.\sigma_{2}(\sigma_{1}\|\hat{x}\|^{2}\|y^{\delta}\|^{2}-\|y^{\delta}\|^{2})+\alpha=0. (A.18)

If the term inside of the parenthesis doesn’t equal zero, there will exist a solution σ2\sigma_{2}. If the term equaled zero, equation (A.14) would lead to μ=0\mu=0. But in this case, we could choose σ1=0\sigma_{1}=0 (the quadratic formula allows two solutions), and then it is no problem to find a solution for σ2\sigma_{2}, too.

By finding solutions for σ1\sigma_{1} and σ2\sigma_{2}, we showed that the operator BB defined in (A.9) solves equation (A.6). So the lemma is proved. ∎

A.3 Lemma 3.2

Second part of equivalence theorem for ADP to Ivanov problems.

Proof.

Let x^,yδ,α\hat{x},y^{\delta},\alpha and RR be given according to the assumptions. Assume there exists a linear operator BB such that

0B(Bx^yδ)+αR(x^)0\in B^{*}(B\hat{x}-y^{\delta})+\alpha\partial R(\hat{x}) (A.19)

holds. It follows

v:=1αB(Bx^yδ)R(x^).v:=-\frac{1}{\alpha}B^{*}(B\hat{x}-y^{\delta})\in\partial R(\hat{x}). (A.20)

We can calculate αv,x^=Bx^2+yδ,Bx^\alpha\langle v,\hat{x}\rangle=-\|B\hat{x}\|^{2}+\langle y^{\delta},B\hat{x}\rangle. So according to the assumptions,

Bx^2+yδ,Bx^>yδ24-\|B\hat{x}\|^{2}+\langle y^{\delta},B\hat{x}\rangle>\frac{\|y^{\delta}\|^{2}}{4} (A.21)

must hold. But with Young’s inequality, we get

Bx^2+yδ,Bx^Bx^2+14yδ2+Bx^2=yδ24.-\|B\hat{x}\|^{2}+\langle y^{\delta},B\hat{x}\rangle\leqslant-\|B\hat{x}\|^{2}+\frac{1}{4}\|y^{\delta}\|^{2}+\|B\hat{x}\|^{2}=\frac{\|y^{\delta}\|^{2}}{4}. (A.22)

Obviously, this is a contradiction. That’s why such an operator BB can’t exist. ∎

A.4 Lemma 3.6

Relation between the ADP parameter and the Tikhonov parameter of the equivalent problem.

Proof.

Let x^\hat{x} be the solution of the ADP problem (3.13). Because of the equivalence to the Tikhonov method, x^\hat{x} is the solution of (3.14) in the same time. Besides, x(A)x(A) is the Tikhonov solution w.r.t. the parameter αADP\alpha_{\text{ADP}} of the inverse problem. Because of the minimizing properties of x^\hat{x} and x(A)x(A),

12Ax^yδ2\displaystyle\frac{1}{2}\|A\hat{x}-y^{\delta}\|^{2} 12Ax(A)yδ2,\displaystyle\leqslant\frac{1}{2}\|Ax(A)-y^{\delta}\|^{2}, (A.23)
12Ax(A)yδ2+αADP2x(A)2\displaystyle\frac{1}{2}\|Ax(A)-y^{\delta}\|^{2}+\frac{\alpha_{\text{ADP}}}{2}\|x(A)\|^{2} 12Ax^yδ2+αADP2x^2\displaystyle\leqslant\frac{1}{2}\|A\hat{x}-y^{\delta}\|^{2}+\frac{\alpha_{\text{ADP}}}{2}\|\hat{x}\|^{2} (A.24)

holds. It follows x(A)2x^2\|x(A)\|^{2}\leqslant\|\hat{x}\|^{2}. Both x(A)x(A) and x^\hat{x} are Tikhonov solutions of the same problem (only with different parameters). So α~αADP\tilde{\alpha}\leqslant\alpha_{\text{ADP}} must hold because the norm of x^\hat{x} is greater (or equal) than the norm of x(A)x(A).

Now, we assume α~=αADP>0\tilde{\alpha}=\alpha_{\text{ADP}}>0. By Remark 3.4, the problems

minxX12Axyδ2+αADP2x2,\displaystyle\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}+\frac{\alpha_{\text{ADP}}}{2}\|x\|^{2}, (A.25)
minxX12Axyδ2s.t. x2yδ24αADP\displaystyle\min_{x\in X}\frac{1}{2}\|Ax-y^{\delta}\|^{2}\quad\text{s.t. }\|x\|^{2}\leqslant\frac{\|y^{\delta}\|^{2}}{4\alpha_{\text{ADP}}} (A.26)

are equivalent. The solution x^\hat{x} fulfills

(AA+αADPId)x^=Ayδ(A^{*}A+\alpha_{\text{ADP}}\cdot\mathrm{Id})\hat{x}=A^{*}y^{\delta} (A.27)

and we assume Ayδ0A^{*}y^{\delta}\neq 0. Then, x^\hat{x} must fulfill the side constraint of (A.26) with equality, otherwise AAx^=AyδA^{*}A\hat{x}=A^{*}y^{\delta} would hold, which is a contradiction. Accordingly we get αADPx^2=yδ2/4\alpha_{\text{ADP}}\|\hat{x}\|^{2}=\|y^{\delta}\|^{2}/4 and by computing the inner product of (A.27) with x^\hat{x}, it follows

Ax^2+yδ24=Ax^2+αADPx^2=Ax^,yδ.\|A\hat{x}\|^{2}+\frac{\|y^{\delta}\|^{2}}{4}=\|A\hat{x}\|^{2}+\alpha_{\text{ADP}}\|\hat{x}\|^{2}=\langle A\hat{x},y^{\delta}\rangle. (A.28)

If we then apply the Cauchy-Schwarz and Young’s inequality, we get

Ax^,yδAx^yδAx^2+yδ24,\langle A\hat{x},y^{\delta}\rangle\leqslant\|A\hat{x}\|\cdot\|y^{\delta}\|\leqslant\|A\hat{x}\|^{2}+\frac{\|y^{\delta}\|^{2}}{4}, (A.29)

which means these inequalities must in fact hold as equalities. Therefore, Ax^A\hat{x} and yδy^{\delta} must be linear dependent (Cauchy-Schwarz) and 2Ax^=yδ2\|A\hat{x}\|=\|y^{\delta}\| must hold (Young). It follows

Ax^=12yδ.A\hat{x}=\frac{1}{2}y^{\delta}. (A.30)

We can plug this into (A.27) and get

αADPx^=12Ayδ.\alpha_{\text{ADP}}\hat{x}=\frac{1}{2}A^{*}y^{\delta}. (A.31)

Accordingly AAyδ=αADPyδAA^{*}y^{\delta}=\alpha_{\text{ADP}}y^{\delta} holds, so yδy^{\delta} is a singular vector of AA. ∎

A.5 Theorem 3.8

Stability of the ADP approach.

Proof.

We follow some of the ideas of the proofs of [14, Theorem 2.1] and [31, Theorem 2].

Let xkx_{k} and x^\hat{x} be unique solutions of (3.9) for yδ=yk,y^y^{\delta}=y_{k},\hat{y} with yky^y_{k}\to\hat{y}. The sequence (xk)(x_{k}) is bounded, so there exists a weakly convergent subsequence (xkl)(x_{k_{l}}), xklxx_{k_{l}}\rightharpoonup x_{\infty}. For arbitrary ε>0\varepsilon>0 and xXx\in X with x2y^2(4α)1ε\|x\|^{2}\leqslant\|\hat{y}\|^{2}\cdot(4\alpha)^{-1}-\varepsilon, it holds

Axy^lim inflAxklykllimlAxykl=Axy^\|Ax_{\infty}-\hat{y}\|\leqslant\liminf_{l\to\infty}\|Ax_{k_{l}}-y_{k_{l}}\|\leqslant\lim_{l\to\infty}\|Ax-y_{k_{l}}\|=\|Ax-\hat{y}\| (A.32)

because xklx_{k_{l}} minimizes the ADP problem w.r.t. ykly_{k_{l}} and xx fulfills the side constraint for ll big enough. With ε0\varepsilon\to 0 and because of the uniqueness of the solutions, we obtain x=x^x_{\infty}=\hat{x}. Arguing with a subsequence of a subsequence leads to the weak convergence xkx^x_{k}\rightharpoonup\hat{x} of the whole sequence.

According to the assumptions, it holds xk2=yk^2(4α)1\|x_{k}\|^{2}=\|\hat{y_{k}}\|^{2}\cdot(4\alpha)^{-1}. So yky^y_{k}\to\hat{y} implies xkx^{\|x_{k}\|\to\|\hat{x}\|} and together with the weak convergence, we finally obtain xkx^x_{k}\to\hat{x}. ∎

A.6 Theorem 3.11

Stability of the ADP-β\beta approach.

Proof.

First, we note that the sequence (gk)(g_{k}) is bounded in W1,2(Ω)W^{1,2}(\Omega). Hence, there exists at least one weakly convergent subsequence. For any subsequence with gkg^g_{k}\rightharpoonup\hat{g}, it holds

T(f,xgk)yk\displaystyle T(f,x_{g_{k}})-y_{k}\, T(f,xg^)y^\displaystyle\to\,T(f,x_{\hat{g}})-\hat{y} (A.33)

because of the arguments from Remark 3.10. For arbitrary gW1,2(Ω)g\in W^{1,2}(\Omega),

12T(f,xg^)y^L22+βg^fW1,22lim infk12T(f,xgk)ykL22+βgkfW1,22limk12T(f,xg)ykL22+βgfW1,22=12T(f,xg)y^L22+βgfW1,22\begin{split}&\quad\,\frac{1}{2}\|T(f,x_{\hat{g}})-\hat{y}\|_{L^{2}}^{2}+\beta\|\hat{g}-f\|_{W^{1,2}}^{2}\\ &\leqslant\liminf_{k\to\infty}\frac{1}{2}\|T(f,x_{g_{k}})-y_{k}\|_{L^{2}}^{2}+\beta\|g_{k}-f\|_{W^{1,2}}^{2}\\ &\leqslant\lim_{k\to\infty}\frac{1}{2}\|T(f,x_{g})-y_{k}\|_{L^{2}}^{2}+\beta\|g-f\|_{W^{1,2}}^{2}\\ &=\frac{1}{2}\|T(f,x_{g})-\hat{y}\|_{L^{2}}^{2}+\beta\|g-f\|_{W^{1,2}}^{2}\end{split} (A.34)

holds because of the minimizing property of gkg_{k} w.r.t. yky_{k}. Hence, g^\hat{g} is a minimizer of (3.21) w.r.t y^\hat{y}. If we choose g=g^g=\hat{g}, the first and the last line in (LABEL:eq_stab_minimizer) coincide, and we get

limk12T(f,xgk)ykL22+βgkfW1,22=12T(f,xg^)y^L22+βg^fW1,22.\lim_{k\to\infty}\frac{1}{2}\|T(f,x_{g_{k}})-y_{k}\|_{L^{2}}^{2}+\beta\|g_{k}-f\|_{W^{1,2}}^{2}=\frac{1}{2}\|T(f,x_{\hat{g}})-\hat{y}\|_{L^{2}}^{2}+\beta\|\hat{g}-f\|_{W^{1,2}}^{2}. (A.35)

If follows limkgkfW1,22=g^fW1,22\lim_{k\to\infty}\|g_{k}-f\|_{W^{1,2}}^{2}=\|\hat{g}-f\|_{W^{1,2}}^{2}. Hence, (gk)(g_{k}) converges by norm to g^\hat{g}. ∎

A.7 Theorem 3.12

Convergence of the ADP-β\beta approach.

Proof.

According to (3.22), we can choose d=R(x^αδ)R(x)Aw,x^αδxd=R(\hat{x}_{\alpha}^{\delta})-R(x^{\dagger})-\langle A^{*}w,\hat{x}_{\alpha}^{\delta}-x^{\dagger}\rangle and there exists an operator B^L(X,Y){\hat{B}\in L(X,Y)} that fulfills x^αδ=x(B^)\hat{x}_{\alpha}^{\delta}=x(\hat{B}).

Because of the minimizing property of x^αδ\hat{x}_{\alpha}^{\delta},

αR(x^αδ)12B^x^αδyδ2+αR(x^αδ)12B^xyδ2+αR(x).\alpha R(\hat{x}_{\alpha}^{\delta})\leqslant\frac{1}{2}\|\hat{B}\hat{x}_{\alpha}^{\delta}-y^{\delta}\|^{2}+\alpha R(\hat{x}_{\alpha}^{\delta})\leqslant\frac{1}{2}\|\hat{B}x^{\dagger}-y^{\delta}\|^{2}+\alpha R(x^{\dagger}). (A.36)

holds. If follows

d=R(x^αδ)R(x)Aw,x^αδx12αB^xyδ2w,Ax^αδy12α(B^xAx+yyδ)2+wAx^αδy12α(xB^A+δ)2+wAx^αδy.\displaystyle\begin{split}d&=R(\hat{x}_{\alpha}^{\delta})-R(x^{\dagger})-\langle A^{*}w,\hat{x}_{\alpha}^{\delta}-x^{\dagger}\rangle\leqslant\frac{1}{2\alpha}\|\hat{B}x^{\dagger}-y^{\delta}\|^{2}-\langle w,A\hat{x}_{\alpha}^{\delta}-y^{\dagger}\rangle\\ &\leqslant\frac{1}{2\alpha}\left(\|\hat{B}x^{\dagger}-Ax^{\dagger}\|+\|y^{\dagger}-y^{\delta}\|\right)^{2}+\|w\|\|A\hat{x}_{\alpha}^{\delta}-y^{\dagger}\|\\ &\leqslant\frac{1}{2\alpha}\left(\|x^{\dagger}\|\|\hat{B}-A\|+\delta\right)^{2}+\|w\|\|A\hat{x}_{\alpha}^{\delta}-y^{\dagger}\|.\end{split} (A.37)

We will show B^A=O(δ)\|\hat{B}-A\|=O(\delta) and Ax^αδy=O(δ)\|A\hat{x}_{\alpha}^{\delta}-y^{\dagger}\|=O(\delta) to deduce d=O(δ)d=O(\delta) for α\alpha chosen proportional to δ\delta. Because of the minimizing property of B^\hat{B}, we get

βB^A212Ax(B^)yδ2+βB^A212Ax(A)yδ2.\beta\cdot\|\hat{B}-A\|^{2}\leqslant\frac{1}{2}\|Ax(\hat{B})-y^{\delta}\|^{2}+\beta\cdot\|\hat{B}-A\|^{2}\leqslant\frac{1}{2}\|Ax(A)-y^{\delta}\|^{2}. (A.38)

From standard convergence results of the Tikhonov method [21, Theorem 4.4] or [6, Theorem 2], we get Ax(A)yδ=O(δ)\|Ax(A)-y^{\delta}\|=O(\delta). So B^A=O(δ)\|\hat{B}-A\|=O(\delta) holds.

Besides,

Ax^αδyAx(B^)yδ+yδyAx(A)yδ+δ\displaystyle\|A\hat{x}_{\alpha}^{\delta}-y^{\dagger}\|\leqslant\|Ax(\hat{B})-y^{\delta}\|+\|y^{\delta}-y^{\dagger}\|\leqslant\|Ax(A)-y^{\delta}\|+\delta (A.39)

holds and we can use Ax(A)yδ=O(δ)\|Ax(A)-y^{\delta}\|=O(\delta) again. So d=O(δ)d=O(\delta) follows. ∎

References

  • [1] G. S. Alberti, E. De Vito, M. Lassas, L. Ratti, and M. Santacesaria. Learning the optimal Tikhonov regularizer for inverse problems. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • [2] S. Arridge, P. Maass, O. Öktem, and C.-B. Schönlieb. Solving inverse problems using data-driven models. Acta Numerica, 28:1–174, 2019.
  • [3] D. O. Baguer, J. Leuschner, and M. Schmidt. Computed tomography reconstruction using deep image prior and learned reconstruction methods. Inverse Problems, 36(9):094004, 2020.
  • [4] C. Bishop. Regularization and complexity control in feed-forward networks, pages 141–148. EC2 et Cie, 1995. International Conference on Artificial Neural Networks ICANN’95.
  • [5] I. Bleyer and R. Ramlau. A double regularization approach for inverse problems with noisy data and inexact operator. Inverse Problems, 29:025004, 2013.
  • [6] M. Burger and S. Osher. Convergence rates of convex variational regularization. Inverse Problems, 20(5):1411–1421, 2004.
  • [7] E. Celledoni, M. J. Ehrhardt, C. Etmann, B. Owren, C.-B. Schönlieb, and F. Sherry. Equivariant neural networks for inverse problems. Inverse Problems, 37(8):085006, 2021.
  • [8] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 2011.
  • [9] Z. Cheng, M. Gadelha, S. Maji, and D. Sheldon. A Bayesian Perspective on the Deep Image Prior. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [10] P. L. Combettes and J.-C. Pesquet. Deep neural network structures solving variational inequalities. Set-Valued and Variational Analysis, 28:491–518, 2020.
  • [11] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.
  • [12] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004.
  • [13] S. Dittmer, T. Kluth, P. Maass, and D. Otero Baguer. Regularization by Architecture: A Deep Prior Approach for Inverse Problems. Journal of Mathematical Imaging and Vision, 62:456–470, 2020.
  • [14] H. W. Engl, K. Kunisch, and A. Neubauer. Convergence rates for Tikhonov regularisation of non-linear ill-posed problems. Inverse Problems, 5(4):523–540, 1989.
  • [15] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • [17] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 399–406, Madison, WI, USA, 2010. Omnipress.
  • [18] A. Habring and M. Holler. A generative variational model for inverse problems in imaging. SIAM Journal on Mathematics of Data Science, 4(1):306–335, 2022.
  • [19] R. Heckel and M. Soltanolkotabi. Compressive sensing with un-trained neural networks: Gradient descent finds a smooth approximation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4149–4158. PMLR, 2020.
  • [20] R. Heckel and M. Soltanolkotabi. Denoising and regularization via exploiting the structural bias of convolutional generators. In International Conference on Learning Representations, 2020.
  • [21] B. Hofmann, B. Kaltenbacher, C. Pöschl, and O. Scherzer. A convergence rates result for Tikhonov regularization in Banach spaces with non-smooth operators. Inverse Problems, 23(3):987–1010, 2007.
  • [22] V. Jain and S. Seung. Natural image denoising with convolutional networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
  • [23] B. Kaltenbacher and A. Klassen. On convergence and convergence rates for Ivanov and Morozov regularization and application to some parameter identification problems in elliptic PDEs. Inverse Problems, 34(5):055008, 2018.
  • [24] T. Kluth, C. Bathke, M. Jiang, and P. Maass. Joint super-resolution image reconstruction and parameter identification in imaging operator: analysis of bilinear operator equations, numerical solution, and application to magnetic particle imaging. Inverse Problems, 36(12):124006, 2020.
  • [25] V. Lempitsky, A. Vedaldi, and D. Ulyanov. Deep Image Prior. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
  • [26] H. Li, J. Schwab, S. Antholzer, and M. Haltmeier. NETT: solving inverse problems with deep neural networks. Inverse Problems, 36(6):065005, jun 2020.
  • [27] C. G. Looney. Convergence of minimizing sequences. Journal of Mathematical Analysis and Applications, 61(3):835–840, 1977.
  • [28] V. Papyan, Y. Romano, J. Sulam, and M. Elad. Theoretical foundations of deep learning via sparse representations: A multilayer sparse model and its connection to convolutional neural networks. IEEE Signal Processing Magazine, 35(4):72–89, 2018.
  • [29] Y. Romano, M. Elad, and P. Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
  • [30] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992.
  • [31] T. I. Seidman and C. R. Vogel. Well posedness and convergence of some regularisation methods for non-linear ill posed problems. Inverse Problems, 5(2):227–238, 1989.
  • [32] Z. Shi, P. Mettes, S. Maji, and C. G. M. Snoek. On Measuring and Controlling the Spectral Bias of the Deep Image Prior. International Journal of Computer Vision, 2022.
  • [33] J. Sjoberg and L. Ljung. Overtraining, regularization, and searching for minimum with application to neural networks. International Journal of Control, 62, 1994.
  • [34] Y. Sun, H. Zhao, and J. Scarlett. On architecture selection for linear inverse problems with untrained neural networks. Entropy, 23:1481, 2021.
  • [35] Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2790–2798, 2017.
  • [36] V. Vasin. Relationship of several variational methods for the approximate solution of ill-posed problems. Mathematical notes of the Academy of Sciences of the USSR, 7:161–165, 1970.
  • [37] C. R. Vogel. A constrained least squares regularization method for nonlinear ill-posed problems. SIAM Journal on Control and Optimization, 28(1):34–49, 1990.
  • [38] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2):301–320, 2005.