This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmPravin \surNair

[1]\orgdivDepartment of Electrical Engineering, \orgnameIndian Institute of Science, \orgaddress\cityBangalore, \countryIndia

Averaged Deep Denoisers for Image Regularization

[email protected]    \fnmKunal N. \surChaudhury [email protected] *
Abstract

Plug-and-Play (PnP) and Regularization-by-Denoising (RED) are recent paradigms for image reconstruction that leverage the power of modern denoisers for image regularization. In particular, they have been shown to deliver state-of-the-art reconstructions with CNN denoisers. Since the regularization is performed in an ad-hoc manner, understanding the convergence of PnP and RED has been an active research area. It was shown in recent works that iterate convergence can be guaranteed if the denoiser is averaged or nonexpansive. However, integrating nonexpansivity with gradient-based learning is challenging, the core issue being that testing nonexpansivity is intractable. Using numerical examples, we show that existing CNN denoisers tend to violate the nonexpansive property, which can cause PnP or RED to diverge. In fact, algorithms for training nonexpansive denoisers either cannot guarantee nonexpansivity or are computationally intensive. In this work, we construct contractive and averaged image denoisers by unfolding splitting-based optimization algorithms applied to wavelet denoising and demonstrate that their regularization capacity for PnP and RED can be matched with CNN denoisers. To our knowledge, this is the first work to propose a simple framework for training contractive denoisers using network unfolding.

keywords:
plug-and-play (PnP), regularization by denoising (RED), convergence, contractive, nonexpansive, averaged, unfolding.

1 Introduction

The problem of recovering an unknown image from incomplete, corrupted measurements comes up in applications such as deblurring, superresolution, inpainting, demosaicking, despeckling, MRI, and tomography ribes2008linear ; engl1996regularization ; dong2011image . The standard optimization framework for such problems involves a loss function f:nf:\mathbb{R}^{n}\to\mathbb{R} derived from the measurement process and a regularizer g:n{}g:\mathbb{R}^{n}\to\mathbb{R}\cup\{\infty\} incorporating prior knowledge about the ground truth. The reconstruction is performed by solving the optimization problem

min𝒙nf(𝒙)+g(𝒙).\underset{\boldsymbol{x}\in\mathbb{R}^{n}}{\min}\ f(\boldsymbol{x})+g(\boldsymbol{x}). (1)

The loss ff is typically differentiable, while the regularizer gg can be discontinuous engl1996regularization . For linear inverse problems such as deblurring, superresolution, inpainting, and compressed sensing dong2011image ; jagatap2019algorithmic , the loss is convex and quadratic, namely, f(𝒙)=𝒚𝐀𝒙2f(\boldsymbol{x})={\lVert\boldsymbol{y}-\mathbf{A}\boldsymbol{x}\rVert}^{2}, where 𝐀m×n\mathbf{A}\in\mathbb{R}^{m\times n} is the measurement matrix and 𝒚\boldsymbol{y} is the observed image. On the other hand, for applications such as single photon imaging, Poisson denoising, and despeckling chan2017plug ; rond2016poisson ; bioucas2010multiplicative , ff is non-quadratic.

For imaging applications, total-variation and other sparsity-promoting functionals rudin1992nonlinear ; candes2008enhancing ; dong2011sparsity are the most commonly used regularizers. For such regularizers, gg is typically convex but nondifferentiable engl1996regularization ; zhang2017image . If the loss ff is convex, we can solve (1) using iterative algorithms such as Forward-Backward Splitting (FBS), Douglas-Rachford Splitting (DRS), and Alternating Direction Method of Multipliers (ADMM) passty1979ergodic ; bauschke2017convex ; ryu2016primer . Model-based optimization is interpretable, computationally tractable, and can be applied to various problems. However, handcrafted regularizers are generally less powerful than modern learning-based models, and the optimization process can be time-consuming zhang2017image .

1.1 PnP and RED

Classical algorithms such as FBS and DRS perform repeated inversion of the measurement model, followed by a regularization step. More precisely, the regularization is performed using the proximal operator. For a convex regularizer h:nh:\mathbb{R}^{n}\to\mathbb{R}, the proximal operator is defined as

Proxh(𝒙)=argmin𝒛n{12𝒛𝒙2+h(𝒛)},\mathrm{Prox}_{h}(\boldsymbol{x})=\operatorname*{\arg\!\min}_{\boldsymbol{z}\in\mathbb{R}^{n}}\ \left\{\frac{1}{2}\left\lVert\boldsymbol{z}-\boldsymbol{x}\right\rVert^{2}+h(\boldsymbol{z})\right\}, (2)

where \|\cdot\| is the Euclidean norm. The proximal operator can be seen as a denoiser — it takes a noisy image 𝒙\boldsymbol{x} and returns an image close to 𝒙\boldsymbol{x} for which the penalty hh is small beck2009fast . This operator is at the core of FBS, DRS, and ADMM. In particular, the FBS update is given by

𝒛k+1\displaystyle{\boldsymbol{z}}_{k+1} =𝒙kγf(𝒙k),\displaystyle={\boldsymbol{x}}_{k}-\gamma\nabla\!f({\boldsymbol{x}}_{k}),
𝒙k+1\displaystyle{\boldsymbol{x}}_{k+1} =Proxγg(𝒛k+1).\displaystyle=\mathrm{Prox}_{\gamma g}(\boldsymbol{z}_{k+1}). (3)

where γ>0\gamma>0 and 𝒙0n\boldsymbol{x}_{0}\in\mathbb{R}^{n} is an arbitrary initialization.

On the other hand, starting with 𝒙0n\boldsymbol{x}_{0}\in\mathbb{R}^{n} and γ>0\gamma>0, the DRS update is given by (e.g., see ryu2016primer )

𝒛k+1\displaystyle{\boldsymbol{z}}_{k+1} =Proxγf(𝒙k),\displaystyle=\mathrm{Prox}_{\gamma f}(\boldsymbol{x}_{k}),
𝒙k+1\displaystyle{\boldsymbol{x}}_{k+1} =𝒙k+Proxγg(2𝒛k+1𝒙k)𝒛k+1.\displaystyle=\boldsymbol{x}_{k}+\mathrm{Prox}_{\gamma g}(2\boldsymbol{z}_{k+1}-\boldsymbol{x}_{k})-\boldsymbol{z}_{k+1}. (4)

In Plug-and-Play (PnP) regularization sreehari2016plug , instead of having to handcraft the regularizer gg in (1) and apply it via its proximal operator, we directly “plug” a powerful denoising operator D:nn\mathrm{D}:\mathbb{R}^{n}\to\mathbb{R}^{n}, such as BM3D dabov2007image or DnCNN zhang2017beyond , into (1.1) and (1.1). Thus, in the PnP variant of FBS (referred to as PnP-FBS), the update is performed by replacing the proximal operator in (1.1) by an image denoiser (Algorithm 1).

Algorithm 1 PNP-FBS
1:input: 𝒙0\boldsymbol{x}_{0}, loss ff, denoiser D\mathrm{D}, and γ>0\gamma>0.
2:for k=0,1,k=0,1,\ldots do
3:     𝒛k+1=𝒙kγf(𝒙k)\boldsymbol{z}_{k+1}={\boldsymbol{x}}_{k}-\gamma\nabla\!f({\boldsymbol{x}}_{k}).
4:     𝒙k+1=D(𝒛k+1)\boldsymbol{x}_{k+1}=\mathrm{D}(\boldsymbol{z}_{k+1}) .
5:end for

Remarkably, PnP algorithms yield state-of-the-art results across a wide range of applications, including superresolution, MRI, fusion, and tomography sreehari2016plug ; Ryu2019_PnP_trained_conv ; Sun2019_PnP_SGD ; zhang2017learning ; Tirer2019_iter_denoising ; zhang2021plug . The best performing PnP algorithms zhang2017learning ; Tirer2019_iter_denoising ; zhang2021plug use trained denoisers such as DnCNN zhang2017beyond , IRCNN zhang2017learning , and U-Net hurault2022gradient ; hurault2022proximal . This also opens up the exciting possibility of combining model-based reconstruction with deep learning zhang2017learning .

The idea of using denoisers for regularization has also been explored in Regularization-by-Denoising (RED). For example, the authors in romano2017little proposed to work with the regularizer

g(𝒙)=12𝒙(𝒙D(𝒙)),g(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^{\top}\big{(}\boldsymbol{x}-\mathrm{D}(\boldsymbol{x})\big{)}, (5)

where D\mathrm{D} is some powerful denoiser as in PnP. This is solved iteratively using gradient-based methods, where the gradient of (5) is approximated using 𝒙D(𝒙)\boldsymbol{x}-\mathrm{D}(\boldsymbol{x}) in romano2017little . It was later shown in reehorst2018regularization that this is exact if D\mathrm{D} is locally homogenous and its Jacobian is symmetric.

Several variants of the original RED algorithm romano2017little have been proposed reehorst2018regularization ; Sun2019_PnP_SGD ; sun2019block ; sun2021scalable ; cohen2021regularization . One such variant called RED-PG reehorst2018regularization is described in Algorithm 2. RED algorithms have been shown to yield state-of-the-art results for phase retrieval, tomography, superresolution, deblurring, and compressed sensing metzler2018prdeep ; wu2020simba ; wu2019online ; mataev2019deepred ; sun2019block ; hu2022monotonically .

Algorithm 2 RED-PG
1:input: 𝒛0\boldsymbol{z}_{0}, loss ff, denoiser D\mathrm{D}, and γ,L>0\gamma,L>0.
2:for k=0,1,k=0,1,\ldots do
3:     𝒙k+1=Prox(γ/L)f(𝒛k)\boldsymbol{x}_{k+1}=\mathrm{Prox}_{(\gamma/L)f}(\boldsymbol{z}_{k}).
4:     𝒛k+1=1L(D(𝒙k+1)+(L1)𝒙k+1)\boldsymbol{z}_{k+1}=\frac{1}{L}\big{(}\mathrm{D}(\boldsymbol{x}_{k+1})+(L-1)\boldsymbol{x}_{k+1}\big{)}.
5:end for

There is yet another paradigm for solving inverse problems called deep unfolding gregor2010learning ; sun2016deep ; zhang2018ista , where an iterative algorithm is unfolded into a network and trained in an end-to-end fashion. Deep unfolded networks can produce impressive results, but unlike PnP or RED (where just the denoiser needs to be trained), we have to train an unfolded network end-to-end. The present work is motivated by deep unfolding, but we propose to use them for a different purpose, namely, to construct nonexpansive denoisers by unfolding classical wavelet denoising. We then apply the trained denoiser within Algorithms 1 and 2 for image reconstruction.

We wish to point to a related work repetti2022dual , where the authors construct an unfolded denoiser by applying FBS to a sparsity-regularized image denoising problem (with box constraints on the reconstruction). However, there are important distinctions between our approach and theirs. First, splitting is applied to the Fenchel dual of the denoising problem in repetti2022dual ; the idea is to transform the nonsmooth primal (denoising) problem into a smooth, unconstrained dual problem. Second, since the sparsifying linear transform is optimized in repetti2022dual , there is no structural guarantee on the trained layers. As a result, it is difficult to come up with convergence guarantees; indeed, (repetti2022dual, , Proposition 2) assumes that the same linear transform is used across all layers, and this would generally not hold if we optimize the transforms. We present comparisons with repetti2022dual in Section 5.

1.2 Convergence Results

The success of PnP and RED has sparked interest in the theoretical understanding of these algorithms. For example, it has been shown that for a class of linear denoisers, PnP is guaranteed to converge, even at a linear rate ACK2023-contractivity , and the limit point is the solution of a convex optimization problem sreehari2016plug ; nair2021fixed ; gavaskar2021plug . This is done by expressing the linear denoiser as the proximal operator of a convex function. Unfortunately, this does not hold for powerful nonlinear denoisers such as BM3D, DnCNN, or IRCNN. This is because, unlike the proximal operator that is nonexpansive, such denoisers have been shown to violate the nonexpansive property reehorst2018regularization .

Though developing a variational characterization for deep denoisers is challenging, it is nonetheless possible to establish iterate convergence for PnP and RED algorithms. For example, the convergence of Algorithm 1 was established for generative denoisers in jagatap2019algorithmic ; liu2021recovery ; raj2019gan . For strongly convex loss functions, PnP-FBS convergence can be guaranteed using a CNN denoiser whose residual is nonexpansive Ryu2019_PnP_trained_conv . Motivated by RED, convergence was recently established for PnP modeled on FBS, ADMM, half-quadratic splitting, and gradient descent hurault2022gradient ; cohen2021has ; hurault2022proximal . On the other hand, the convergence of RED algorithms can be guaranteed using a nonexpansive denoiser romano2017little ; reehorst2018regularization ; sun2019block ; cohen2021regularization ; liu2021recovery . For instance, the block-coordinate RED algorithm in sun2019block exhibits convergence with a nonexpansive denoiser. In liu2021recovery , the authors established convergence of RED for the compressed sensing application using a nonexpansive denoiser.

The focus of this work is on strong convergence of the PnP and RED algorithms, where the iterates converge in norm to a fixed point of a well-defined operator ryu2016primer . It is important to note that there exist other forms of iterate convergence. For example, for the denoisers in hurault2022gradient ; hurault2022proximal , the limit points of the PnP-FBS iterates are stationary points of a nonconvex objective function. In contrast, the iterates converge weakly (see (bauschke2017convex, , Section 2.52.5) for a definition) to a zero of a combination of a gradient operator and a firmly nonexpansive operator in pesquet2021learning . On the other hand, the iterates of Plug-and-Play Stochastic Gradient Descent (PnP-SGD) in laumont2023maximum converge to a stationary point of a MAP problem, where the prior is derived from the denoiser. The summary of our understanding of the iterate convergence of PnP and RED is as follows:

  1. 1.

    Convergence of PnP is guaranteed for averaged denoisers Sun2019_PnP_SGD ; sun2021scalable ; nair2021fixed .

  2. 2.

    For nonexpansive denoisers, convergence is guaranteed for variants of RED reehorst2018regularization ; sun2019block ; cohen2021regularization , PnP-FBS applied to compressive sensing liu2021recovery , and a variant of PnP-FBS reehorst2018regularization .

For the above reasons, there has been significant research on training averaged and nonexpansive CNN denoisers. The challenge is that computing the Lipschitz constant of a neural network is NP-hard virmaux2018lipschitz . By Lipschitz constant, we mean the smallest cc such that T(𝒖)T(𝒗)c𝒖𝒗\lVert\mathrm{T}(\boldsymbol{u})-\mathrm{T}(\boldsymbol{v})\rVert\leqslant c\lVert\boldsymbol{u}-\boldsymbol{v}\rVert for all inputs 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v}, where we represent the CNN as an operator T\mathrm{T}. In particular, while the Lipschitz constant of a convolution layer is just the largest singular value of the matrix representation of the layer (which can be computed efficiently from the FFT of its filters sedghi2018singular ) and the Lipschitz constant of common nonlinear activations (ReLU and sigmoid) is 11, computing the Lipschitz constant for their cascaded composition is however nontrivial. Nonetheless, we can compute an upper bound for the Lipschitz constant using a direct approximation for the spectral norm of the network pesquet2021learning or the product of the spectral norms of the convolution layers Ryu2019_PnP_trained_conv . In particular, if we force the spectral norm of the convolution layers to be within 11 at each epoch of the training process, then the final network is automatically nonexpansive. Several methods based on this idea have been proposed sedghi2018singular ; hertrich2021convolutional ; terris2020building . In particular, we will later compare with the nonexpansive CNN denoiser in Ryu2019_PnP_trained_conv , where the largest singular value of a convolutional layer (estimated using power iterations) is used to force the Lipschitz constant of the layer to be within 11 after each gradient update.

1.3 Motivation

This work was motivated by the observation that existing CNN denoisers tend to violate the nonexpansive property (see Fig. 5). In fact, existing methods for training nonexpansive denoisers either cannot guarantee the final network to be nonexpansive or are computationally expensive. For example, power iterations are used for approximate computation of the largest singular value in Ryu2019_PnP_trained_conv ; pesquet2021learning ; hence, the network is not guaranteed to be nonexpansive at each epoch. The methods in sedghi2018singular ; hertrich2021convolutional ; terris2020building for constraining the convolutional layers are computationally expensive; e.g., the time taken in sedghi2018singular is about nine-fold that required to train the network without constraints. This is because sedghi2018singular requires the SVD computation of a large matrix derived from the Fourier transform of the convolutional filters. Similarly, terris2020building requires the Fourier transform of the filters in every convolutional layer. On the other hand, the training in hertrich2021convolutional requires us to solve an optimization problem that imposes orthogonality constraints on the convolutional layers.

We also highlight a technical issue that has been sidelined in existing works Ryu2019_PnP_trained_conv ; sedghi2018singular ; hertrich2021convolutional ; terris2020building ; pesquet2021learning , namely, that the CNN filters are trained on images of fixed size but deployed on images of arbitrary size. Since the filters must be zero-padded to match the image size, their Lispchitz can change with the size. As a concrete example, consider the 3×33\times 3 Sobel filter [1,0,1; 2,0,2; 1,0,1][1,0,-1\mathchar 24635\relax\;2,0,-2\mathchar 24635\relax\;1,0,-1]. The Lipschitz constant of this filter for an image of size 25×2525\times 25 is 7.98427.9842, whereas the Lipschitz constant for a 256×256256\times 256 image is 88. Thus, a guarantee that the Lipschitz bound is independent of the image size is not available for existing CNN denoisers. To substantiate our point, we present numerical evidence (see Fig. 5 in Section 5) showing that BM3D dabov2007image , DnCNN zhang2017beyond , and N-CNN Ryu2019_PnP_trained_conv violate the nonexpansive property. Such counterexamples have also been reported in reehorst2018regularization . Our observations are summarized below:

  1. 1.

    The Lipschitz constant of a CNN (trained with fixed-size filters) depends on the size of the input image and is difficult to predict.

  2. 2.

    Existing CNN denoisers violate the nonexpansive property, which can result in the divergence of PnP and RED algorithms (see Tables 2 and 6 in Section 5).

  3. 3.

    Developing efficient algorithms for training nonexpansive CNNs is challenging. Existing algorithms for training nonexpansive CNNs either cannot guarantee nonexpansivity or are computationally expensive.

1.4 Contributions

Following the above observations, we moved away from CNNs and considered deep unfolded networks instead. In fact, from our experience, we noticed a tradeoff between nonexpansivity and the denoising performance of CNNs — a significant drop in denoising performance was observed if the denoiser was forced to be nonexpansive. Indeed, it was shown in hurault2022proximal ; hertrich2021convolutional that imposing hard Lipschitz constraints on the CNN can adversely affect its denoising performance.

In the rest of the paper, we demonstrate that it is possible to construct deep unfolded networks that come with the following desirable properties:

  1. 1.

    , Unlike Ryu2019_PnP_trained_conv ; sedghi2018singular , they are averaged (or contractive) by construction and hence automatically nonexpansive.

  2. 2.

    The training is not expensive compared to sedghi2018singular ; terris2020building ; pesquet2021learning . In particular, we need to only perform cheap projections onto intervals (of the learnable parameters) after every gradient step.

  3. 3.

    When plugged into Algorithms 1 and 2, the reconstruction is comparable with the state-of-the-art.

Our construction is motivated by the averaged (contractivity) property of fixed-point operators arising in FBS and DRS; these are classically used to establish iterate convergence bauschke2017convex . In fact, the building blocks of our denoiser are obtained by unfolding FBS and DRS applied to wavelet denoising. These blocks are provably averaged (resp. contractive), and the proposed denoiser, being a composition of these building blocks, is automatically averaged (resp. contractive). While our denoiser is trained on small patches, we give a formula for patch aggregation that guarantees the final end-to-end image denoiser to be averaged (or contractive). The denoising capacity of our denoiser is short of BM3D and DnCNN; however, their regularization capacity for PnP and RED is comparable with BM3D and DnCNN. The source code is available in https://github.com/pravin1390/Averageddeepdenoiser.

2 Notations and Definitions

For integers pqp\leqslant q, we will use the notations [p,q]={p,p+1,,q}[p,q]=\{p,p+1,\ldots,q\} and [p,q]2=[p,q]×[p,q][p,q]^{2}=[p,q]\times[p,q]. We also use [p,q]s[p,q]_{s} to denote the integers {ps,(p+1)s,,qs}\{ps,(p+1)s,\ldots,qs\}, which are multiples of ss.

We will work in n\mathbb{R}^{n} and the norm \lVert\,\cdot\,\rVert will be the Euclidean norm. We say that a sequence {𝒙k}\{\boldsymbol{x}_{k}\} in n\mathbb{R}^{n} converges to 𝒙n\boldsymbol{x}^{*}\in\mathbb{R}^{n} if 𝒙k𝒙0\lVert\boldsymbol{x}_{k}-\boldsymbol{x}^{*}\rVert\to 0 as kk\to\infty. Moreover, the convergence is linear if there exists C>0C>0 and μ(0,1)\mu\in(0,1) such that 𝒙k𝒙Cμk\lVert\boldsymbol{x}_{k}-\boldsymbol{x}^{*}\rVert\leqslant C\mu^{k}. A sequence is said to be convergent if it converges to some 𝒙n\boldsymbol{x}^{*}\in\mathbb{R}^{n}; it is said to converge to a set SnS\subset\mathbb{R}^{n} if it converges to some 𝒙S\boldsymbol{x}^{*}\in S. A sequence {𝒙k}\{\boldsymbol{x}_{k}\} generated by an iterative algorithm is said to converge globally if it converges for any arbitrary initialization 𝒙0n\boldsymbol{x}_{0}\in\mathbb{R}^{n}.

We recall the definitions and properties of certain nonlinear operators that come up in the fixed-point convergence of PnP and RED algorithms. These will play an important role in the construction of the proposed denoiser.

An operator T:nn\mathrm{T}:\mathbb{R}^{n}\to\mathbb{R}^{n} is said to be Lipschitz if there exists c>0c>0 such that

𝒙,𝒚n:T(𝒙)T(𝒚)c𝒙𝒚.\forall\,\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{n}:\quad\left\lVert\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{y})\right\rVert\leqslant c\left\lVert\boldsymbol{x}-\boldsymbol{y}\right\rVert. (6)

The smallest cc satisfying (6) is called the Lipschitz constant of T\mathrm{T}. An operator T\mathrm{T} is nonexpansive if c=1c=1 and contractive if c[0,1)c\in[0,1). T\mathrm{T} is said to be θ\theta-averaged for some θ[0,1)\theta\in[0,1), if we can write T=(1θ)I+θN\mathrm{T}=(1-\theta)\,\mathrm{I}+\theta\,\mathrm{N}, where N\mathrm{N} is a nonexpansive operator and I\mathrm{I} is the identity operator on n\mathbb{R}^{n}. We call T\mathrm{T} firmly nonexpansive if T\mathrm{T} is (1/2)(1/2)-averaged.

Remark 1.

It is evident from the above definitions that an averaged operator is nonexpansive. The converse is not true; e.g., the operator I-\mathrm{I} is nonexpansive but not averaged. A contractive operator is (c+1)/2(c+1)/2-averaged, where cc is the contraction factor. However, not every averaged operator is contractive. Thus, contractive operators form a strict subset of averaged operators, which form a strict subset of nonexpansive operators bauschke2017convex .

Two instances of averaged operators relevant to this work are the proximal and the gradient-step operators. Recall that for a convex function h:nh:\mathbb{R}^{n}\to\mathbb{R}, its proximal operator Proxh:nn\mathrm{Prox}_{h}:\mathbb{R}^{n}\to\mathbb{R}^{n} is defined in (2). For a differentiable function f:nf:\mathbb{R}^{n}\to\mathbb{R}, we define the gradient-step operator to be

Gγ,f:nn,Gγ,f=Iγf,\mathrm{G}_{\gamma,f}:\mathbb{R}^{n}\to\mathbb{R}^{n},\quad\mathrm{G}_{\gamma,f}=\mathrm{I}-\gamma\nabla\!f, (7)

where γ>0\gamma>0 is the step size. A differentiable function ff is said to be β\beta-smooth if the Lipschitz constant of the gradient operator f:nn\nabla\!f:\mathbb{R}^{n}\to\mathbb{R}^{n} is at most β\beta.

Remark 2.

The proximal and the gradient-step operators in (2) and (7) are known to be firmly nonexpansive when ff is convex bauschke2017convex . However, these operators need not be contractive unless additional conditions are met, e.g., strong convexity of ff. However, for linear inverse problems, the loss ff is typically not strongly convex, and it is easy to show that the proximal and the gradient-step operators are not contractive in this case. This will be an important consideration when we work with proximal and gradient-step operators.

3 Convergence of PnP and RED

The convergence analysis of operator-splitting methods is primarily based on the classical fixed-point theory of monotone operators bauschke2017convex ; ryu2016primer ; parikh2014proximal ; eckstein1992douglas . PnP and RED have their roots in convex optimization and operator-splitting algorithms in particular. It is thus not surprising that existing convergence analyses of PnP and RED rely heavily on the fixed-point theory of firmly nonexpansive and averaged operators. nair2021fixed ; sun2021scalable ; Ryu2019_PnP_trained_conv ; cohen2021regularization ; reehorst2018regularization . In particular, this line of analysis can be used to establish convergence of Algorithms 1 and 2. This is done by expressing the updates in these algorithms as a mapping 𝒙T(𝒙)\boldsymbol{x}\mapsto\mathrm{T}(\boldsymbol{x}), where T\mathrm{T} is contractive or averaged depending on the loss ff and the denoiser D\mathrm{D}. In particular, consider the operators

TPnP=DGγ,f,\mathrm{T}_{\mathrm{PnP}}=\mathrm{D}\circ\mathrm{G}_{\gamma,f}, (8)

and

TRED=Prox(γ/L)f1L(D+(L1)I).\mathrm{T}_{\mathrm{RED}}=\mathrm{Prox}_{(\gamma/L)f}\circ\frac{1}{L}\big{(}\mathrm{D}+(L-1)\,\mathrm{I}\,\big{)}. (9)

In terms (8) and (9), we can express the updates in Algorithms 1 and 2 as 𝒙k+1=T(𝒙k)\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k}), where the operator T\mathrm{T} is either (8) or (9). Namely, we can view PnP-FBS and RED-PG as fixed-point algorithms.

The convergence analysis of fixed-point iterations is a classical topic and of research interest in the area of image regularization FPTbook . In particular, since operators (8) and (9) are continuous, a limit point of the iterates (if one exists) must necessarily be a fixed point of T\mathrm{T}; we call this fixed-point convergence. We will denote the fixed points of T\mathrm{T} as FixT\mathrm{Fix}\,\mathrm{T}, i.e., FixT={𝒙n:𝒙=T(𝒙)}\mathrm{Fix}\,\mathrm{T}=\{\boldsymbol{x}\in\mathbb{R}^{n}:\ \boldsymbol{x}=\mathrm{T}(\boldsymbol{x})\}.

We can establish various convergence results of the fixed-point iterations of T\mathrm{T}. A classical result is the following.

Theorem 1.

Let T\mathrm{T} be a contractive operator on n\mathbb{R}^{n}. Then the sequence {𝐱k}k0\{\boldsymbol{x}_{k}\}_{k\geqslant 0} generated as 𝐱k+1=T(𝐱k)\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k}) converges globally and linearly to FixT\mathrm{Fix}\,\mathrm{T}.

The iteration 𝒙k+1=T(𝒙k)\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k}) is called the Picard iteration when T\mathrm{T} is contractive, and Theorem 1 is the contraction-mapping theorem FPTbook .

Remark 3.

FixT\mathrm{Fix}\,\mathrm{T} is nonempty if T\mathrm{T} is contractive. In fact, FixT\mathrm{Fix}\,\mathrm{T} is a singleton in this case, i.e., the fixed-point iterations always converge to the same fixed point regardless of the initialization 𝐱0\boldsymbol{x}_{0}. In general, however, verifying that FixT\mathrm{Fix}\,\mathrm{T}\neq\emptyset for some operator T\mathrm{T} is a nontrivial task; this is assumed in most practical applications of fixed-point analysis, including the analysis of operator-splitting algorithms bauschke2017convex ; ryu2016primer .

It follows from Theorem 1 that a sufficient condition for the convergence of Algorithms 1 and 2 is that operators TPnP\mathrm{T}_{\mathrm{PnP}} in (8) and TRED\mathrm{T}_{\mathrm{RED}} in (9) are contractive. In fact, this is the case if the denoiser D\mathrm{D} is contractive and the loss function ff is convex.

Proposition 2.

Let ff be convex and let D\mathrm{D} be contractive. Then the operator in (9) is contractive for all γ>0\gamma>0 and L>1L>1. Moreover, if ff is β\beta-smooth, then operator (8) is contractive for all 0<γ<2/β0<\gamma<2/\beta.

Proof: If D\mathrm{D} is contractive and L>1L>1, then it is not difficult to see that the second operator in (9) is contractive, being the convex combination of I\mathrm{I} and D\mathrm{D}. On the other hand, the first operator in (9) is a proximal operator, which is nonexpansive bauschke2017convex . Being the composition of a nonexpansive and a contractive operator, (9) is contractive.

On the other hand, if ff is convex and β\beta-smooth, then from the Baillon-Haddad theorem bauschke2009baillon we have that, for all 𝒙,𝒚n\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{n},

β(f(𝒙)f(𝒚))(𝒙𝒚)f(𝒙)f(𝒚)2.\beta\,\big{(}\nabla\!f(\boldsymbol{x})-\nabla\!f(\boldsymbol{y})\big{)}^{\top}\!(\boldsymbol{x}-\boldsymbol{y})\geqslant\big{\lVert}\nabla\!f(\boldsymbol{x})-\nabla\!f(\boldsymbol{y})\big{\rVert}^{2}.

Using this, we can easily show that (7) is nonexpansive for all 0<γ<2/β0<\gamma<2/\beta. Thus, (8) is the composition of a nonexpansive and a contractive operator and is hence contractive.

We arrive at the following important conclusion by combining Theorem 1 and Proposition 2.

Corollary 3.

Suppose ff is convex and D\mathrm{D} is contractive.

  1. 1.

    Algorithm 2 converges globally and linearly if γ>0\gamma>0 and L>1L>1.

  2. 2.

    Additionally, if ff is β\beta-smooth, then Algorithm 1 converges globally and linearly if 0<γ<2/β0<\gamma<2/\beta.

Proof: If ff is convex, γ>0\gamma>0 and L>1L>1, and D\mathrm{D} is contractive, we know from Proposition 2 that (9) is a contraction. Similarly, if 0<γ<2/β0<\gamma<2/\beta, then it follows from Proposition 2 that (8) is a contraction. Hence, we have from Theorem 1 that Algorithms 1 and 2 converges globally and linearly.

The convergence results in Corollary 3 can be extended to averaged denoisers. This is important as we will later work with averaged denoisers that are not guaranteed to be contractive. We will rely on the following fixed-point convergence result for averaged operators.

Theorem 4.

Let T\mathrm{T} be an averaged operator on n\mathbb{R}^{n} and assume that FixT\mathrm{Fix}\,\mathrm{T}\neq\emptyset. The sequence {𝐱k}k0\{\boldsymbol{x}_{k}\}_{k\geqslant 0} generated as 𝐱k+1=T(𝐱k)\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k}) converges globally to FixT\mathrm{Fix}\,\mathrm{T}.

The iteration 𝒙k+1=T(𝒙k)\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k}) is called the Mann iteration when TT is averaged, and Theorem 4 is the Krasnosel’skii-Mann theorem (bauschke2017convex, , Prop. 5.16).

Remark 4.

Comparing Theorems 1 and 4, we note that while FixT\mathrm{Fix}\,\mathrm{T} is guaranteed to be nonempty when T\mathrm{T} is contractive, we need to assume this if T\mathrm{T} is averaged. Second, when T\mathrm{T} is a contraction, the sequence converges to the same fixed point regardless of the initialization 𝐱0\boldsymbol{x}_{0}, whereas the limit point will generally depend on 𝐱0\boldsymbol{x}_{0} if T\mathrm{T} is averaged. Finally, the convergence in Theorem 4 cannot be guaranteed to be linear.

If the denoiser D\mathrm{D} is nonexpansive or averaged, then we can make the following conclusion regarding the PnP and RED operators.

Proposition 5.

Let ff be convex and let D\mathrm{D} be nonexpansive. Then the operator in (9) is an averaged operator for all γ>0\gamma>0 and L>1L>1. Moreover, if ff is β\beta-smooth and D\mathrm{D} is averaged, then operator (8) is an averaged operator for all 0<γ<1/β0<\gamma<1/\beta.

Proof: The reasoning is similar to that of Proposition 2. The second operator in (9) is 1/L1/L-averaged if L>1L>1. Also, it is well-known that the proximal operator of a convex function is firmly nonexpansive bauschke2017convex . Thus, being the composition of two averaged operators, we get that (9) is averaged bauschke2017convex .

If ff is β\beta-smooth and 0<γ<1/β0<\gamma<1/\beta, we can show that (7) is firmly nonexpansive bauschke2017convex . Again, being the composition of two averaged operators, (8) must be averaged.

Combining Theorem 4 and Proposition 5, we can certify convergence of Algorithms 1 and 2.

Corollary 6.

Suppose ff is convex and D\mathrm{D} is averaged.

  1. 1.

    Assuming that FixTRED\mathrm{Fix}\,\mathrm{T}_{\mathrm{RED}}\neq\emptyset, Algorithm 2 converges globally and for any γ>0,L>1\gamma>0,\,L>1.

  2. 2.

    Assuming that FixTPnP\mathrm{Fix}\,\mathrm{T}_{\mathrm{PnP}}\neq\emptyset and ff is β\beta-smooth, Algorithm 1 converges globally for any 0<γ<1/β0<\gamma<1/\beta.

Proof: If ff is convex, γ>0\gamma>0, L>1L>1, and D\mathrm{D} is averaged (and hence nonexpansive), it follows from Proposition 5 that (9) is an averaged operator. Thus, if FixTRED\mathrm{Fix}\,\mathrm{T}_{\mathrm{RED}}\neq\emptyset, then we have from Theorem 4 that Algorithm 2 converges globally.

On the other hand, if ff is convex and β\beta-smooth, and 0<γ<1/β0<\gamma<1/\beta, then it follows from Proposition 5 that (8) is an averaged operator. Hence, if FixTPnP\mathrm{Fix}\,\mathrm{T}_{\mathrm{PnP}}\neq\emptyset, then we have from Theorem 4 that Algorithm 1 converges globally.

In summary, we were able to establish fixed-point convergence for Algorithms 1 and 2 for contractive (Corollary 3) and averaged denoisers (Corollary 6). We next show how such denoisers can be realized using deep unfolding.

Remark 5.

We can work out convergence results similar to Corollaries 3 and 6 for the PnP and RED algorithms in romano2017little ; reehorst2018regularization ; sun2019block ; cohen2021regularization .

4 Unfolded Denoisers

Motivated by the previous results, we wish to develop trained denoisers that are either contractive or averaged by construction and whose regularization capacity is comparable to state-of-the-art CNN denoisers. As explained in Section 1.3, learning a nonexpansive CNN denoiser presents several challenges. This prompted us to consider a different solution to the problem, namely, using unfolded networks. More specifically, we construct the denoiser in three steps.

  1. 1.

    We consider the classical variational problem of wavelet denoising and unfold it using FBS and DRS. This is done at a patch level. The patch denoiser is contractive (for FBS) or averaged (for DRS) by construction.

  2. 2.

    We train the patch denoiser using supervised learning, where the regularization parameters (thresholds) of wavelet denoising and the internal (step size and penalty) parameters of FBS and DRS are optimized.

  3. 3.

    We aggregate the patch denoisers to derive an image denoiser, which is proven to inherit the contractive or averaged property of the original patch denoiser.

4.1 Wavelet denoising

Let 𝒃p\boldsymbol{b}\in\mathbb{R}^{p} represent a noisy image patch, where we have stacked the pixels of a (k×k)(k\times k) image patch into a vector of length p=k2p=k^{2}. Let 𝐖\mathbf{W} be an orthogonal wavelet transform. Consider the wavelet denoising problem

min𝒙p12𝒙𝒃2+λi=1p|(𝐖𝒙)i|(λ>0),\underset{\boldsymbol{x}\in\mathbb{R}^{p}}{\min}\ \ \frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2}+\lambda\sum_{i=1}^{p}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert\qquad(\lambda>0),

where λ\lambda is the regularization parameter. The above problem has the closed-form solution 𝐖ϕλ(𝐖𝒃)\mathbf{W}^{\top}\phi_{\lambda}(\mathbf{W}\boldsymbol{b}), where the soft-threshold function

ϕλ(θ)=sign(θ)max(0,|θ|λ)\phi_{\lambda}(\theta)=\mathrm{sign}(\theta)\,\max(0,\lvert\theta\rvert-\lambda)

is applied componentwise on 𝐖𝒃\mathbf{W}\boldsymbol{b} and λ\lambda acts as a threshold chambolle1998nonlinear .

We consider a slightly different model with a different threshold for each wavelet coefficient. Namely, we consider the optimization problem

min𝒙p12𝒙𝒃2+i=1pλi|(𝐖𝒙)i|(λi>0).\underset{\boldsymbol{x}\in\mathbb{R}^{p}}{\min}\ \ \frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2}+\sum_{i=1}^{p}\lambda_{i}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert\quad(\lambda_{i}>0). (10)

This also has a closed-form solution given by

ϕΛ(𝒃)=𝐖𝐒Λ(𝐖𝒃),\phi_{\Lambda}(\boldsymbol{b})=\mathbf{W}^{\top}\mathbf{S}_{\Lambda}(\mathbf{W}\boldsymbol{b}), (11)

where Λ=(λ1,,λp)\Lambda=(\lambda_{1},\ldots,\lambda_{p}) are the thresholds and the thresholding operator 𝐒Λ:pp\mathbf{S}_{\Lambda}:\mathbb{R}^{p}\to\mathbb{R}^{p} is defined as

𝜽p,i[1,p]:(𝐒Λ(𝜽))i=ϕλi(θi).\forall\,\boldsymbol{\theta}\in\mathbb{R}^{p},\,i\in[1,p]:\quad\big{(}\mathbf{S}_{\Lambda}(\boldsymbol{\theta})\big{)}_{i}=\phi_{\lambda_{i}}(\theta_{i}).

4.2 FBS denoiser

The denoising operator ϕΛ\phi_{\Lambda} is nonexpansive, being the composition of nonexpansive and unitary operators. However, it is easy to see that ϕΛ\phi_{\Lambda} is not contractive. To identify a contractive (resp. averaged) operator, we propose to solve the denoising problem (10) in an iterative fashion using FBS (resp. DRS). The proposed denoising network is obtained by chaining the update operator in FBS and DRS.

As a first step, we apply FBS to the optimization problem (10). Note that, a stepsize ρ\rho, we can write the FBS update as

𝒙k+1=U(𝒙k;ρ,Λ),\boldsymbol{x}_{k+1}=\mathrm{U}(\boldsymbol{x}_{k}\mathchar 24635\relax\;\rho,\Lambda),

where we define the FBS block U\mathrm{U} to be

U(𝒙;ρ,Λ)=𝐖𝐒Λ(𝐖((1ρ)𝒙+ρ𝒃)).\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)=\mathbf{W}^{\top}\mathbf{S}_{\Lambda}\left(\mathbf{W}\big{(}(1-\rho)\boldsymbol{x}+\rho\boldsymbol{b}\big{)}\right). (12)

The FBS block is depicted in Fig. 1. Functionally, U\mathrm{U} is an inversion step (gradient descent on the loss) followed by wavelet regularization.

Although (10) has a closed-form solution, we are interested in the iterative solution of (10) because of the following property.

Proposition 7.

For all ρ(0,1]\rho\in(0,1] and Λ>0\Lambda>0, the map 𝐱U(𝐱;ρ,Λ)\boldsymbol{x}\mapsto\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda) is contractive.

Indeed, since 𝐖\mathbf{W} is an orthogonal transform and the operator 𝐒Λ\mathbf{S}_{\Lambda} in (12) is nonexpansive (the condition Λ>0\Lambda>0 is required to ensure that (11) is well-defined), we have for any 𝒙,𝒙n\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathbb{R}^{n},

U(𝒙;ρ,Λ)U(𝒙;ρ,Λ)(1ρ)𝒙𝒙.\displaystyle\lVert\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)-\mathrm{U}(\boldsymbol{x}^{\prime}\mathchar 24635\relax\;\rho,\Lambda)\rVert\leqslant(1-\rho)\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert.

Since 1ρ[0,1)1-\rho\in[0,1), we can conclude that the map 𝒙U(𝒙;ρ,Λ)\boldsymbol{x}\mapsto\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda) is contractive.

Refer to caption
(a) FBS block
Refer to caption
(b) DRS block
Figure 1: Building blocks obtained by unfolding FBS and DRS for wavelet denoising.The proposed denoisers are obtained by chaining these blocks.

Following Proposition 7, we chain U\mathrm{U} into a network. The precise definition is as follows.

Definition 1 (FBS denoiser).

For a fixed number of layers L1L\geqslant 1, we define DFBS:pp\mathrm{D}_{\mathrm{FBS}}:\mathbb{R}^{p}\to\mathbb{R}^{p} to be

DFBS=U(L)U(L1)U(1),\mathrm{D}_{\mathrm{FBS}}=\mathrm{U}^{(L)}\circ\mathrm{U}^{(L-1)}\circ\cdots\circ\mathrm{U}^{(1)}, (13)

where each layer has the form U()=U(;ρ(),Λ())\mathrm{U}^{(\ell)}=\mathrm{U}(\cdot\ \mathchar 24635\relax\;\rho^{(\ell)},\Lambda^{(\ell)}).

Since the composition of contractive operators is again contractive, the next result follows immediately from Proposition 7.

Proposition 8.

For all ρ()(0,1]\rho^{(\ell)}\in(0,1] and Λ()>0\Lambda^{(\ell)}>0, DFBS\mathrm{D}_{\mathrm{FBS}} is contractive.

The point is that (13) is contractive as long as Λ()\Lambda^{(\ell)} and ρ()\rho^{(\ell)} are in the interval stipulated in Proposition 8. We can thus tune the parameters over these intervals to optimize the denoising capacity while preserving the contractive nature.

Remark 6.

Unlike LISTA gregor2010learning or DualFB repetti2022dual , we use a fixed transform in each layer. This is because we wish to retain the proximal structure in (12), and orthogonality plays a crucial role in this regard.

4.3 DRS denoiser

Note that we can apply Peaceman-Rachford bauschke2017convex or DRS to problem (10). In fact, we will show that the corresponding update operator is firmly nonexpansive for DRS. Thus, the resulting denoiser obtained by chaining this operator is averaged, and we can guarantee convergence of Algorithms 1 and (2) (Corollary 6).

Remark 7.

It is well-known that DRS converges in fewer iterations than FBS parikh2014proximal . Since each layer of the unfolded denoiser corresponds to an iteration, we expect to perform better denoising with DRS for the same number of layers. Indeed, we will show in Section 5 that the DRS denoiser gives better results than the FBS denoiser in some experiments.

To arrive at the DRS update for (10), we identify (10) with (1) by setting

f(𝒙)=12𝒙𝒃2,g(𝒙)=i=1pλi|(𝐖𝒙)i|.f(\boldsymbol{x})=\frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2},\quad g(\boldsymbol{x})=\sum_{i=1}^{p}\lambda_{i}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert.

It follows from (1.1) that we can write the DRS update as written as

𝒙k+1=V(𝒙k;ρ,Λ),\boldsymbol{x}_{k+1}=\mathrm{V}(\boldsymbol{x}_{k}\mathchar 24635\relax\;\rho,\Lambda),

where the DRS block V\mathrm{V} is defined as

V(𝒙;ρ,Λ)=12𝒙\displaystyle\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)=\frac{1}{2}\boldsymbol{x}
+12(2ProxρgI)(2ProxρfI)(𝒙),\displaystyle+\frac{1}{2}\big{(}2\mathrm{Prox}_{\rho g}-\mathrm{I}\big{)}\big{(}2\mathrm{Prox}_{\rho f}-\mathrm{I}\big{)}(\boldsymbol{x}), (14)

and ρ>0\rho>0 is the penalty parameter of DRS.

Remark 8.

We can simplify the expression for the DRS block (4.3). Indeed, a direct calculation gives

Proxρf(𝒙)=11+ρ(𝒙+ρ𝒃).\mathrm{Prox}_{\rho f}(\boldsymbol{x})=\frac{1}{1+\rho}(\boldsymbol{x}+\rho\boldsymbol{b}).

On the other hand, if we absorb ρ\rho in Λ\Lambda, we get Proxρg=ϕΛ\mathrm{Prox}_{\rho g}=\phi_{\Lambda}. Thus, we can express (4.3) as

V(𝒙;ρ,Λ)=\displaystyle\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)= ρ1+ρ(𝒙𝒃)\displaystyle\frac{\rho}{1+\rho}(\boldsymbol{x}-\boldsymbol{b})
+ϕΛ(1ρ1+ρ𝒙+2ρ1+ρ𝒃).\displaystyle+\phi_{\Lambda}\left(\frac{1-\rho}{1+\rho}\boldsymbol{x}+\frac{2\rho}{1+\rho}\boldsymbol{b}\right). (15)

Having defined the update operator V\mathrm{V}, we can now establish its averaged property.

Proposition 9.

For all ρ>0\rho>0 and Λ>0\Lambda>0, the map 𝐱V(𝐱;ρ,Λ)\boldsymbol{x}\mapsto\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda) is firmly nonexpansive.

Indeed, we know that the proximal operator of a convex function is firmly nonexpansive bauschke2017convex . Thus, the operators 2ProxρgI2\mathrm{Prox}_{\rho g}-\mathrm{I} and 2ProxρfI2\mathrm{Prox}_{\rho f}-\mathrm{I} are nonexpansive. Thus, if we denote their composition by N\mathrm{N}, then N\mathrm{N} must be nonexpansive. Since V=(1/2)(I+N)\mathrm{V}=(1/2)(\mathrm{I}+\mathrm{N}), V\mathrm{V} must be firmly nonexpansive.

We will now use Proposition 9 to construct an averaged network. As with the FBS block, the trainable parameters in (8) are ρ\rho and Λ\Lambda, while 𝒃\boldsymbol{b} and 𝐖\mathbf{W} are fixed. The DRS block is depicted in Fig. 1. The DRS denoiser is obtained by chaining V\mathrm{V} into a network.

Definition 2 (DRS denoiser).

For a fixed number of layers L1L\geqslant 1, we define DDRS:pp\mathrm{D}_{\mathrm{DRS}}:\mathbb{R}^{p}\to\mathbb{R}^{p} to be

DDRS=V(L)V(L1)V(1),\mathrm{D}_{\mathrm{DRS}}=\mathrm{V}^{(L)}\circ\mathrm{V}^{(L-1)}\circ\cdots\circ\mathrm{V}^{(1)}, (16)

where each layer is of the form V()=V(;ρ(),Λ())\mathrm{V}^{(\ell)}=\mathrm{V}(\cdot\ \mathchar 24635\relax\;\rho^{(\ell)},\Lambda^{(\ell)}).

As a consequence of Proposition 9, we have the following structure on the denoiser.

Proposition 10.

For all ρ()>0\rho^{(\ell)}>0 and Λ()>0\Lambda^{(\ell)}>0, DDRS\mathrm{D}_{\mathrm{DRS}} is L/(L+1)L/(L+1)-averaged.

Indeed, notice from Proposition 9 that DDRS\mathrm{D}_{\mathrm{DRS}} is the composition of LL firmly nonexpansive operators. Now, if we compose LL number of θ\theta-averaged operators, then it can be shown that the resulting operator is Lθ/(1+(L1)θ)L\theta/(1+(L-1)\theta)-averaged (e.g., see bauschke2017convex ). Setting θ=1/2\theta=1/2, we arrive at Proposition 10.

Refer to caption
(a) barbara
Refer to caption
(b) ship
Refer to caption
(c) cameraman
Refer to caption
(d) couple
Refer to caption
(e) fingerprint
Refer to caption
(f) hill
Refer to caption
(g) house
Refer to caption
(h) lena
Refer to caption
(i) man
Refer to caption
(j) montage
Refer to caption
(k) peppers
Refer to caption
(l) starfish
Figure 2: Grayscale images from the Set12 database. We will identify these images by their corresponding names in Section 5.

4.4 Image denoiser

Recall that (13) and (16) are patch denoisers. We apply them independently on patches extracted from a noisy image and then average the denoised patches to obtain the final image zhang2017image ; dabov2007image . The overall operation defines an image-to-image denoising operator D\mathrm{D}, which is used in Algorithms 1 and 2. In view of Corollary 3 and 6, we wish to ensure that D\mathrm{D} is averaged or contractive. First, we give a precise definition of the denoiser and then prove that it has the desired property.

Let Ω=[0,q1]2\Omega=[0,q-1]^{2} denote the image domain and 𝐗:Ω\mathbf{X}:\Omega\to\mathbb{R} be the input (noisy) image; we also view 𝐗\mathbf{X} as a vector in n\mathbb{R}^{n} where n=q2n=q^{2}. We assume periodic boundary condition, i.e., for 𝒊=(i1,i2)Ω\boldsymbol{i}=(i_{1},i_{2})\in\Omega and 𝝉=(τ1,τ2)2\boldsymbol{\tau}=(\tau_{1},\tau_{2})\in\mathbb{Z}^{2}, we define

𝒊𝝉=((i1τ1)modq,(i2τ2)modq).\boldsymbol{i}-\boldsymbol{\tau}=\big{(}(i_{1}-\tau_{1})\ \mathrm{mod}\ q,\,(i_{2}-\tau_{2})\ \mathrm{mod}\ q\big{)}.

Without loss of generality, we assume that qq is a multiple of the patch size kk; we can simply pad the image to satisfy this condition.

Let the patch size be k×kk\times k. For 𝒊Ω\boldsymbol{i}\in\Omega, we define the patch operator

𝒫𝒊:nk2,\displaystyle\mathcal{P}_{\boldsymbol{i}}:\mathbb{R}^{n}\to\mathbb{R}^{k^{2}},
𝒫𝒊(𝐗)=vec\displaystyle\mathcal{P}_{\boldsymbol{i}}(\mathbf{X})=\mathrm{vec} ({𝐗(𝒊+𝝉):𝝉[0,k1]2}),\displaystyle\Big{(}\big{\{}\mathbf{X}(\boldsymbol{i}+\boldsymbol{\tau}):\boldsymbol{\tau}\in[0,k-1]^{2}\big{\}}\Big{)}, (17)

where vec()\mathrm{vec}(\cdot) denotes the vectorized representation of the pixels in a (k×k)(k\times k) window around 𝒊\boldsymbol{i} (in some fixed order).

We extract patches from 𝐗\mathbf{X} with stride ss, where we assume that ss divides kk. Thus, the starting coordinates of the patches are 𝒥=[0,(qs1)]s2\mathcal{J}=[0,(q_{s}-1)]_{s}^{2}, where qs=q/sq_{s}=q/s. In other words, the patches are

{𝒫𝒋(𝐗):𝒋𝒥}.\left\{\mathcal{P}_{\boldsymbol{j}}(\mathbf{X}):\,\boldsymbol{j}\in\mathcal{J}\right\}.
Definition 3.

Given a patch denoiser Dpatch\mathrm{D}_{\mathrm{patch}}, we define the image denoiser D:nn\mathrm{D}:\mathbb{R}^{n}\to\mathbb{R}^{n} to be

D(𝐗)=1(k/s)2𝒋𝒥𝒫𝒋(Dpatch(𝒫𝒋(𝐗))),{\mathrm{D}}(\mathbf{X})=\frac{1}{(k/s)^{2}}\sum_{\boldsymbol{j}\in\mathcal{J}}\mathcal{P}_{\boldsymbol{j}}^{*}\big{(}\mathrm{D}_{\mathrm{patch}}(\mathcal{P}_{\boldsymbol{j}}(\mathbf{X}))\big{)}, (18)

where 𝒫𝐣\mathcal{P}_{\boldsymbol{j}}^{*} is the adjoint of 𝒫𝐣\mathcal{P}_{\boldsymbol{j}}.

The factor (k/s)2(k/s)^{2} is the number of patches containing a given pixel; the assumption that ss divides kk is essential to ensure this is the same for every pixel.

Remark 9.

The adjoint in (18) is is w.r.t. the standard Euclidean inner-product (see Appendix 8 for details). In particular, it is a linear map from k2\mathbb{R}^{k^{2}} to n\mathbb{R}^{n} such that for any patch 𝐱k2\boldsymbol{x}\in\mathbb{R}^{k^{2}}, 𝐘=𝒫𝐣(𝐱)\mathbf{Y}=\mathcal{P}^{*}_{\boldsymbol{j}}(\boldsymbol{x}) is an image, where

𝒫j(𝐘)=𝒙,\mathcal{P}_{j}(\mathbf{Y})=\boldsymbol{x},

and the rest of the pixels in 𝐘\mathbf{Y} are zero.

Henceforth, we will assume that the patch denoiser is either DFBS\mathrm{D}_{{\mathrm{FBS}}} in (13) or DDRS\mathrm{D}_{{\mathrm{DRS}}} in (16). We now state the main result of this section.

Proposition 11.

The image denoiser (18) derived from (13) is contractive, while the denoiser derived from (16) is averaged.

The proof is given in Appendix 8. The core idea is to express the original system of overlapping patches in terms of non-overlapping patches. This allows us to use some form of orthogonality in the calculations.

4.5 Training

We trained (13) and (16) on patches extracted from images of the BSD500 dataset. Let 𝒙1,,𝒙Np\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\in\mathbb{R}^{p} be the clean patches (normalized to [0,1][0,1] intensity), where p=k2p=k^{2} and NN is the size of traning data. The noisy patches are generated as follows:

𝒛i=𝒙i+σ𝜼,{\boldsymbol{z}}_{i}=\boldsymbol{x}_{i}+\sigma\boldsymbol{\eta},

where 𝜼𝒩(𝟎,I)\boldsymbol{\eta}\sim\mathcal{N}(\boldsymbol{0},\mathrm{I}) and σ\sigma is the noise level. Thus, the training examples are (𝒛1,𝒙1),,(𝒛N,𝒙N)({\boldsymbol{z}}_{1},\boldsymbol{x}_{1}),\ldots,({\boldsymbol{z}}_{N},\boldsymbol{x}_{N}).

Let us use Θ\Theta to denote the parameters in (13) and (16). We will write the patch denoiser as Dpatch(;Θ)\mathrm{D}_{\text{patch}}(\cdot\,\mathchar 24635\relax\;\Theta) to account for the dependence on Θ\Theta. The training problem is to optimize Θ\Theta such that Dpatch(𝒛i;Θ)𝒙i\mathrm{D}_{\text{patch}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)\approx\boldsymbol{x}_{i} for all i=1,,Ni=1,\ldots,N. Moreover, in view of Proposition 8, we consider the following optimization problem for training (13):

minΘ\displaystyle\underset{\Theta}{\min} i=1NDFBS(𝒛i;Θ)𝒙i2\displaystyle\sum_{i=1}^{N}\ {\lVert\mathrm{D}_{\text{FBS}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)-\boldsymbol{x}_{i}\rVert}^{2} (19)
s.t. ρ(1),,ρ(L)[ρ0,1],\displaystyle\rho^{(1)},\ldots,\rho^{(L)}\in[\rho_{0},1],
λj(1),,λj(L)[λ0,)(j[1,p]).\displaystyle{\lambda^{(1)}_{j}},\ldots,{\lambda^{(L)}_{j}}\in[\lambda_{0},\infty)\quad\big{(}j\in[1,p]\big{)}.

The lower bounds ρ0(0,1)\rho_{0}\in(0,1) and λ0>0\lambda_{0}>0 are used to ensure that the constraint set in (19) is closed; this will be useful when we project Θ\Theta onto the intervals.

On the other hand, in view of Proposition 10, we consider the following optimization for training (16):

minΘ\displaystyle\underset{\Theta}{\min} i=1NDDRS(𝒛i;Θ)𝒙i2\displaystyle\sum_{i=1}^{N}\ {\lVert\mathrm{D}_{\text{DRS}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)-\boldsymbol{x}_{i}\rVert}^{2} (20)
s.t. ρ(1),,ρ(L)[ρ0,),\displaystyle\rho^{(1)},\ldots,\rho^{(L)}\in[\rho_{0},\infty),
λj(1),,λj(L)[λ0,)(j[1,p]).\displaystyle{\lambda^{(1)}_{j}},\ldots,{\lambda^{(L)}_{j}}\in[\lambda_{0},\infty)\quad\big{(}j\in[1,p]\big{)}.
Table 1: Denoising performance (PSNR/SSIM averaged) on the BSD68 dataset at different noise levels σ\sigma.
BM3D DnCNN N-CNN DFBS\mathrm{D}_{\mathrm{FBS}} DDRS\mathrm{D}_{\mathrm{DRS}}
σ=5\sigma=5 37.5237.52/0.96680.9668 38.0138.01/0.96960.9696 33.3533.35/0.92120.9212 36.5436.54/0.94800.9480 36.6236.62/0.95060.9506
σ=20\sigma=20 29.6529.65/0.87990.8799 30.2630.26/0.89480.8948 27.9927.99/0.80170.8017 30.0130.01/0.83440.8344 29.9929.99/0.83460.8346
σ=35\sigma=35 27.1027.10/0.75310.7531 27.6927.69/0.78090.7809 25.1725.17/0.62520.6252 26.1826.18/0.69880.6988 26.1726.17/0.69520.6952

5 Experiments

For the unfolded network, we used a 1010-fold cascade of Daubechies, Haar, and Symlet orthogonal wavelets in succession (the number of layers is L=30L=30). We used a fixed patch size k=64k=64. We solved (19) and (20) using Adam optimizer kingma2014adam with mini-batches of size 128128 and 55 training epochs. After the gradient step, performed using automatic differentiation, we enforced the constraints in (19) and (20) by projecting Θ\Theta onto the respective intervals in (19) and (20), essentially implementing a form of projected gradient descent. The lower bounds for trainable parameters in the optimization problems (19) and (20) were fixed to ρ0=λ0=0.0001\rho_{0}=\lambda_{0}=0.0001. We trained separate denoisers over the noise range 5-505\mbox{-}50 in steps of 55. The trained models (500500 KB for each noise level) are lightweight compared to DnCNN (2.32.3 MB).

The objective of the experiments in Sections 5.1, 5.2, and 5.3 is to analyze the denoising performance and the reconstruction accuracy of our denoisers (for deblurring and superresolution) and compare them with existing denoisers. Moreover, we present numerical evidence supporting the claimed contractive (averaged) property of the denoisers and the associated convergence guarantees for PnP and RED (Corollaries 3 and 6). We also present numerical evidence showing that state-of-the-art denoisers such as BM3D and DnCNN do not meet these guarantees.

5.1 Denoising

In Table 1, we compare the denoising performance of our denoisers with BM3D, DnCNN, and N-CNN on the Set12 dataset (see Fig. 2). Our denoisers are off by PSNR of 0.25-1.50.25\mbox{-}1.5 dB and SSIM of 0.02-0.080.02\mbox{-}0.08 from the best denoiser DnCNN. However, they consistently outperform N-CNN across all noise levels. Recall that the purpose of our construction was to guarantee convergence of PnP and RED. In pursuit of this, we had to make design choices that resulted in a tradeoff in denoising performance. Importantly, we will see that the reconstruction capability of our denoiser is competitive with the above denoisers. This is not surprising since there is no concrete relation between the denoising performance and the restoration capacity of a denoiser. For instance, it has been previously shown that BM3D can yield superior restoration results compared to DnCNN in specific applications Ryu2019_PnP_trained_conv , even though DnCNN generally exhibits better denoising capabilities than BM3D. This suggests that the denoising gap is compensated in PnP and RED because the denoiser is applied in every iteration. However, this is just an intuitive guess, and we do not have a precise mathematical explanation.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Analysis of the proposed denoisers. Denoisers DFBS\mathrm{D}_{{\mathrm{FBS}}} and DDRS\mathrm{D}_{{\mathrm{DRS}}} are plugged into PnP-FBS, which is used for image deblurring. We plot the difference between successive iterates and PSNR with iterations. We notice that the difference between iterates is strictly decreasing thanks to the averaged property, and the PSNR stabilizes in 5050-100100 iterations. In (c), we validate that PnP-FBS converges to a unique fixed point with DFBS\mathrm{D}_{{\mathrm{FBS}}}, thanks to the contraction property of operator TPnP\mathrm{T}_{\mathrm{PnP}}. The experiment is performed with diverse initializations, namely, with 𝒙0\boldsymbol{x}_{0} as all-zeros, all-ones, and random. The iterations stabilized in about 3535 iterations. The MSE between the reconstructions obtained using different initializations is of the order 1e-81\mathrm{e}\mbox{-}8.
Table 2: Comparison of the residual 𝒙k𝒙k1\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\| and PSNR with DnCNN for a deblurring experiment. (see text for more details)
denoiser number of iterations
1010 2525 5050 7575 100100
DnCNN 𝒙k𝒙k1\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\| 146.00146.00 146.58146.58 234.55234.55 23002300 22072482207248
PSNR 16.5416.54 14.7914.79 4.09-4.09 45.14-45.14 84.68-84.68
Proposed 𝒙k𝒙k1\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\| 74.0674.06 30.6530.65 7.677.67 1.991.99 0.520.52
(DFBS\mathrm{D}_{\mathrm{FBS}}) PSNR 20.4220.42 22.9622.96 23.6923.69 23.7423.74 23.7423.74
Proposed 𝒙k𝒙k1\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\| 74.0974.09 30.6730.67 7.687.68 1.991.99 0.550.55
(DDRS\mathrm{D}_{\mathrm{DRS}}) PSNR 20.4320.43 23.0023.00 23.7423.74 23.7923.79 23.7923.79

We conducted a comprehensive assessment of various parameter configurations for the denoiser and selected the most suitable settings. Regarding the impact of 𝐖\mathbf{W}, our results indicate that the choice of wavelet transform does not significantly affect the denoising performance. In contrast, we observed a significant enhancement in denoising performance by increasing the number of layers LL from 11 to 1010. However, increasing LL beyond 1010 yields marginal improvements in the generalization capability. Thus, we capped LL at 1010 to keep the model compact without compromising the denoising performance. Regarding stride ss, our findings align with common trends in patch aggregation algorithms, namely, a higher stride is associated with decreased accuracy but reduced computational demands. We found that s=8s=8 provides an optimal balance between accuracy and efficiency. Finally, the patch size plays a crucial role in the denoising performance. Keeping the stride constant, enlarging the patch size from 6464 to 128128 results in an accuracy improvement ranging from 0.100.10 to 0.250.25 dB and an enhancement in the SSIM score by 0.010.01 to 0.020.02. However, this adjustment also leads to an increase in the number of parameters of the denoiser.

5.2 Deblurring

The forward model for deblurring is 𝒚=𝐅𝝃+𝜼\boldsymbol{y}=\mathbf{F}\boldsymbol{\xi}+\boldsymbol{\eta}, where 𝒚\boldsymbol{y} is the blurred image, 𝐅\mathbf{F} is a blur operator, and 𝜼\boldsymbol{\eta} is white Gaussian noise dong2011image . For deblurring, the loss function is f(𝒙)=𝒚𝐅𝒙2f(\boldsymbol{x})=\|\boldsymbol{y}-\mathbf{F}\boldsymbol{x}\|^{2}, where 𝐅\mathbf{F} is the blur matrix. The gradient f\nabla f is β\beta-Lipschitz. Specifically, we use a normalized blur for all experiments; as a result, we have β=1\beta=1 (β\beta is the spectral norm of 𝐅\mathbf{F}). For the deblurring experiments in Table 2 and Fig. 3, we took 𝐅\mathbf{F} to be a Gaussian blur of width 25×2525\times 25 and standard deviation 1.61.6 and the noise level is σ=0.04\sigma=0.04. For deblurring using PnP-FBS, the step size is set to γ=0.001\gamma=0.001 for RED-PG; for PnP-FBS, we used γ=0.5\gamma=0.5 which satisfies the condition γ2/β\gamma\leqslant 2/\beta in Corollaries 3 and 6. For PnP-FBS, we used a noise level σ=5\sigma=5 for both our denoisers, and σ=10\sigma=10 for RED-PG.

In Table 2, we present numerical evidence to show that, in the absence of a theoretical guarantee, simply plugging a denoiser (in this case DnCNN) into PnP-FBS can cause the iterations to diverge. Indeed, the distance between successive iterates 𝒙k𝒙k1\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert is shown to diverge for DnCNN, which means that the sequence (𝒙k)(\boldsymbol{x}_{k}) does not converge. This results in spurious reconstructions, as evident from the PSNR values in Table 2. On the other hand, we notice that 𝒙k𝒙k1\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert is decreasing for the proposed denoisers, thanks to their nonexpansive property.

 

(a) Input

 

(b) BM3D

 

(c) DnCNN

 

(d)  N-CNN

 

(e)     DFBS\mathrm{D}_{{\mathrm{FBS}}}

 

(f)          DDRS\mathrm{D}_{{\mathrm{DRS}}}
Refer to caption
(g) (14.1214.12, 0.29400.2940)
Refer to caption
(h) (36.5236.52, 0.94580.9458)
Refer to caption
(i) (36.2136.21, 0.93160.9316)
Refer to caption
(j) (32.7332.73, 0.88850.8885)
Refer to caption
(k) (36.79\mathbf{36.79}, 0.9511\mathbf{0.9511})
Refer to caption
(l) (36.4336.43, 0.94650.9465)
Refer to caption
(m) (18.5118.51, 0.41420.4142)
Refer to caption
(n) (36.23{36.23}, 0.92870.9287)
Refer to caption
(o) (35.2735.27, 0.91430.9143)
Refer to caption
(p) (32.9132.91, 0.87550.8755)
Refer to caption
(q) (36.50\mathbf{36.50}, 0.9422\mathbf{0.9422})
Refer to caption
(r) (36.1336.13, 0.93650.9365)
Figure 4: Deblurring using RED-PG with different denoisers. The PSNR and SSIM values are displayed in the order (PSNR(dB), SSIM). The performance of our denoisers is comparable to BM3D and DnCNN. In particular, notice the fine details in the zoomed region where we get better detail enhancement from our denoisers.
Table 3: Deblurring using RED-PG (PSRN/SSIM) for the blur kernel 1 in levin2009understanding (see text for more details).
DnCNN N-CNN DFBS\mathrm{D}_{\mathrm{FBS}} DDRS\mathrm{D}_{\mathrm{DRS}}
barbara 25.08\mathbf{25.08}/0.73070.7307 23.9823.98/0.68920.6892 25.08\mathbf{25.08}/0.7410\mathbf{0.7410} 24.9824.98/0.73710.7371
ship 27.3927.39/0.75020.7502 26.8226.82/0.71450.7145 27.52\mathbf{27.52}/0.7627\mathbf{0.7627} 27.4727.47/0.76000.7600
cameraman 24.9224.92/0.78670.7867 25.01\mathbf{25.01}/0.77000.7700 24.9624.96/0.7808\mathbf{0.7808} 24.9124.91/0.7808\mathbf{0.7808}
couple 26.9526.95/0.73460.7346 26.5026.50/0.70460.7046 27.15\mathbf{27.15}/0.7541\mathbf{0.7541} 27.0927.09/0.75080.7508
fingerprint 25.8625.86/0.87730.8773 23.5523.55/0.78170.7817 25.99\mathbf{25.99}/0.8843\mathbf{0.8843} 25.9025.90/0.88090.8809
hill 28.4128.41/0.73010.7301 28.1928.19/0.71250.7125 28.71\mathbf{28.71}/0.7573\mathbf{0.7573} 28.6428.64/0.75320.7532
house 30.29\mathbf{30.29}/0.8310\mathbf{0.8310} 29.8529.85/0.82520.8252 29.9229.92/0.82690.8269 29.9029.90/0.82780.8278
lena 30.78\mathbf{30.78}/0.8547\mathbf{0.8547} 30.3630.36/0.84210.8421 30.7330.73/0.85400.8540 30.6830.68/0.85450.8545
man 28.2028.20/0.77490.7749 27.9427.94/0.75500.7550 28.45\mathbf{28.45}/0.7928\mathbf{0.7928} 28.3828.38/0.79020.7902
montage 24.5624.56/0.8881\mathbf{0.8881} 25.1025.10/0.87860.8786 24.78\mathbf{24.78}/0.86530.8653 24.7024.70/0.86830.8683
peppers 24.96\mathbf{24.96}/0.82270.8227 24.9524.95/0.81740.8174 24.9524.95/0.8255\mathbf{0.8255} 24.9024.90/0.82520.8252

Next, we present numerical evidence on PnP-FBS convergence using the proposed denoisers. This is done using a deblurring experiment on the Set 12 dataset (the settings are the same as in Table 2). The results are shown in Fig. 3 (a-b), where we plot 𝒙k𝒙k1\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert and PSNR for different kk. As expected, 𝒙k𝒙k1\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert is decreasing and the PSNR stabilizes in less than 5050 iterations. We also demonstrate that due to the contractivity of the FBS operator TPnP\mathrm{T}_{\mathrm{PnP}} using our denoiser DFBS\mathrm{D}_{{\mathrm{FBS}}}, the iterates converge to a unique fixed point even if we start with diverse initializations; the final reconstructions are identical, as shown in Fig 3(c), where the MSE between the different reconstructions is of the order 1e-81\mathrm{e}\mbox{-}8.

We next compare the regularization capacity of our denoisers with BM3D, DnCNN, and N-CNN for deblurring, where 𝐅\mathbf{F} is a horizontal motion blur. This is done for the RED-PG algorithm. The results are shown for a couple of images in Fig. 4. For the two images from Fig. 2, which have a mix of delicate textures, smooth regions, and broader features (top row), the reconstruction from our denoisers exhibits finer details than other denoisers.

A detailed comparison for RED-PG is provided in Table 3, where 𝐅\mathbf{F} is one of the blur kernels from levin2009understanding . We notice that our performance is comparable with DnCNN.

Table 4: Deblurring using PnP-FBS: Comparison with the denoiser in repetti2022dual . See Table 1 in repetti2022dual for the notations A-E.
σ\sigma A B C D E
0.03 Noisy 12.7812.78 19.2919.29 16.0616.06 16.2416.24 22.5422.54
DualFB 16.3316.33 24.8124.81 22.2022.20 20.4120.41 28.3428.34
DDRS\mathrm{D}_{{\mathrm{DRS}}} 15.8315.83 23.9723.97 21.3521.35 20.2820.28 28.2828.28
0.05 Noisy 12.1212.12 18.1118.11 15.1215.12 14.5814.58 20.6020.60
DualFB 14.9714.97 23.5423.54 20.6520.65 19.3119.31 27.4727.47
DDRS\mathrm{D}_{{\mathrm{DRS}}} 14.4214.42 21.8521.85 18.4218.42 18.5518.55 25.6025.60
0.08 Noisy 10.8210.82 16.1116.11 13.4413.44 12.1012.10 17.8517.85
DualFB 14.4914.49 22.7522.75 19.6819.68 18.7118.71 26.9326.93
DDRS\mathrm{D}_{{\mathrm{DRS}}} 13.7413.74 21.0021.00 17.8317.83 17.8917.89 24.9724.97
Table 5: Deblurring using PnP-FBS: Comparison with the denoiser in hurault2022proximal , averaged over CBSD68 dataset for 1010 different kernels
Method σ\sigma
0.010.01 0.030.03 0.050.05
Prox-PnP 30.5730.57 27.8027.80 26.6126.61
Proposed 28.5028.50 27.1127.11 26.4326.43

In Table 4, we compare DDRS\mathrm{D}_{{\mathrm{DRS}}} with the denoiser in repetti2022dual , which we term as “DualFB”. We replicated the experimental setup used for the deblurring experiment in (repetti2022dual, , Table 1) — an asymmetric blur operation followed by the addition of noise at various levels. For a fair comparison, similar to the experiment in repetti2022dual , we use PnP-FBS as the reconstruction algorithm for our denoiser and signal-to-noise ratio (SNR) as the performance metric. We see from Table 4 that our denoiser can closely parallel the performance of DualFB. In fact, it is not surprising that DualFB yields superior results since the linear layers are optimized in repetti2022dual , whereas we use fixed handcrafted transforms.

We next compare DDRS\mathrm{D}_{{\mathrm{DRS}}} with the “Prox-PnP” denoiser from hurault2022proximal ; we perform the comparison by using them as regularizers within the PnP-FBS framework. We replicated the experimental setup used for the deblurring experiment in (hurault2022proximal, , Table 3) in which the PSNR is averaged over the CBSD68 dataset for 1010 different kernels and different noise levels. As shown in Table 5, our denoiser exhibits comparable reconstruction capacity to Prox-PnP, though our gray-scale denoiser is applied channel-by-channel. It is worth noting that our denoiser is contractive by construction, whereas the denoiser in hurault2022proximal is “softly” enforced to be 11-Lipschitz using a loss-based penalization. We assume that training our denoiser specifically for color images can boost the color deblurring performance.

In light of the above comparisons, we wish to emphasize that the key advantage of our denoiser is that it comes with convergence guarantees. On the other hand, the convergence result in (repetti2022dual, , Proposition 2) assumes that the same linear transform is used across all layers, and this is generally not true if we optimize the transforms. It remains to be seen if it is possible to optimize our construction to match the denoiser in repetti2022dual .

Refer to caption
(a) BM3D.
Refer to caption
(b) DnCNN.
Figure 5: Violation of nonexpansive property for BM3D, DnCNN and N-CNN. We perform image superresolution with different denoisers plugged into PnP-FBS. A lower bound for the Lipschitz constant of the respective denoisers is computed from the data encountered in the PnP pipeline. More precisely, since the Lipschitz constant is defined to be the supremum of D(𝒖)D(𝒗)/𝒖𝒗\|\mathrm{D}(\boldsymbol{u})-\mathrm{D}(\boldsymbol{v})\|/\|\boldsymbol{u}-\boldsymbol{v}\| over for all possible inputs 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v}, setting 𝒖=𝒙k\boldsymbol{u}=\boldsymbol{x}_{k} and 𝒗=𝒙k+1\boldsymbol{v}=\boldsymbol{x}_{k+1} we obtain a lower bound for the Lipschitz constant at different iterations. Note that for BM3D, DnCNN, and N-CNN, the lower bound is >1>1 in many iterations, proving that these denoisers are not nonexpansive. As expected, the lower bound for the Lipschitz constant of the proposed denoisers is within 11.
Refer to caption
(a) Bicubic
Refer to caption
(b) BM3D (31.03\bf{31.03}, 0.9250\bf{0.9250})
Refer to caption
(c) DnCNN (30.2230.22, 0.91670.9167)
Refer to caption
(d) N-CNN (28.9228.92, 0.89250.8925)
Refer to caption
(e) DFBS\mathrm{D}_{{\mathrm{FBS}}} (30.6130.61, 0.91670.9167)
Refer to caption
(f) DDRS\mathrm{D}_{{\mathrm{DRS}}} (30.4830.48, 0.9202\bf{0.9202})
Refer to caption
(g) Convergence plot
Figure 6: Superresolution using PnP-FBS with different denoisers. The PSNR and SSIM after 3030 iterations are displayed. Reconstructions obtained from our denoisers (DFBS\mathrm{D}_{{\mathrm{FBS}}} and DDRS\mathrm{D}_{{\mathrm{DRS}}}) are comparable with BM3D and DnCNN. The convergence plot suggests that the PnP iterates diverge for DnCNN and N-CNN.

5.3 Superresolution

A widely used model for superresolution is 𝒚=𝐒𝐁𝝃+𝜼,\boldsymbol{y}=\mathbf{S}\mathbf{B}\boldsymbol{\xi}+\boldsymbol{\eta}, where 𝐁n×n\mathbf{B}\in\mathbb{R}^{n\times n} and 𝐒m×n\mathbf{S}\in\mathbb{R}^{m\times n} are the blur and subsampling operators (m=n/km=n/k, k>1k>1), and 𝜼\boldsymbol{\eta} is Gaussian noise chan2017plug ; nair2021fixed . For superresolution, the loss function is f(𝒙)=𝒚𝐒𝐁𝒙2f(\boldsymbol{x})=\|\boldsymbol{y}-\mathbf{S}\mathbf{B}\boldsymbol{x}\|^{2}, where 𝐒\mathbf{S} is a decimation matrix and 𝐁\mathbf{B} is a blur matrix. As in deblurring, we used a normalized blur. In this case, β\beta is bounded by the product of the spectral norms of 𝐒\mathbf{S} and 𝐁\mathbf{B}, so we have β1\beta\leqslant 1. In PnP-FBS, we used the step size γ=1.0\gamma=1.0 that satisfies the condition γ2/β\gamma\leqslant 2/\beta in Corollaries  3 and 6. We set the noise level to σ=5\sigma=5 for both our denoisers for all experiments.

Table 6: Residual 𝒙k𝒙k1\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\| for PnP-FBS with different denoisers.
denoiser iterations (kk)
1010 5050 100100 500500 10001000 30003000
N-CNN Ryu2019_PnP_trained_conv 0.540.54 0.200.20 0.190.19 0.360.36 23.4523.45 3e3\mathrm{e}+1212
DFBS\mathrm{D}_{\mathrm{FBS}} 0.310.31 0.150.15 0.110.11 0.06020.0602 0.05130.0513 2e2\mathrm{e}-1010
DDRS\mathrm{D}_{\mathrm{DRS}} 0.320.32 0.160.16 0.110.11 0.06580.0658 0.05620.0562 2e2\mathrm{e}-1010
Table 7: Superresolution using PnP-FBS: Comparison of the reconstruction capacity of different denoisers. (see text for details)
DnCNN N-CNN DFBS\mathrm{D}_{\mathrm{FBS}} DDRS\mathrm{D}_{\mathrm{DRS}}
cameraman 27.12\mathbf{27.12}/0.85190.8519 26.8926.89/0.84500.8450 26.6026.60/0.8616{0.8616} 26.6326.63/0.8540\mathbf{0.8540}
house 33.29{33.29}/0.86950.8695 32.8432.84/0.87360.8736 32.7532.75/0.85310.8531 33.70\mathbf{33.70}/0.8780\mathbf{0.8780}
peppers 28.85\mathbf{28.85}/0.90710.9071 28.4328.43/0.90270.9027 27.7427.74/0.9079\mathbf{0.9079} 27.6727.67/0.90590.9059
starfish 29.8729.87/0.90140.9014 27.5927.59/0.86090.8609 28.8028.80/0.90410.9041 28.87\mathbf{28.87}/0.8979\mathbf{0.8979}
butterfly 30.28\mathbf{30.28}/0.94030.9403 30.3630.36/0.93400.9340 28.8128.81/0.93410.9341 28.9528.95/0.9318\mathbf{0.9318}
plane 26.77\mathbf{26.77}/0.87340.8734 25.9725.97/0.85200.8520 26.0726.07/0.87700.8770 26.2326.23/0.8735\mathbf{0.8735}
parrot 28.18\mathbf{28.18}/0.88660.8866 27.6027.60/0.88650.8865 27.6227.62/0.89980.8998 27.5927.59/0.8934\mathbf{0.8934}
lena 34.0034.00/0.8980\mathbf{0.8980} 33.2133.21/0.89420.8942 33.8733.87/0.91180.9118 33.76\mathbf{33.76}/0.9063\mathbf{0.9063}
barbara 24.8724.87/0.76230.7623 24.5224.52/0.73600.7360 25.00{25.00}/0.7775{0.7775} 24.90\mathbf{24.90}/0.7629\mathbf{0.7629}
ship 30.2930.29/0.84140.8414 29.3329.33/0.80650.8065 30.02{30.02}/0.8531{0.8531} 29.93\mathbf{29.93}/0.8430\mathbf{0.8430}
man 30.7930.79/0.85950.8595 30.2530.25/0.84090.8409 30.83{30.83}/0.8800{0.8800} 30.67\mathbf{30.67}/0.8678\mathbf{0.8678}
couple 29.9629.96/0.84360.8436 28.9828.98/0.81050.8105 29.57{29.57}/0.8522{0.8522} 29.51\mathbf{29.51}/0.8440\mathbf{0.8440}

For the results in Fig. 5 and Table 6, 𝐁\mathbf{B} is a Gaussian blur of size 25×2525\times 25 and standard deviation 1.61.6, and σ=0.04\sigma=0.04. We start by presenting numerical evidence which shows that BM3D, DnCNN, and N-CNN are not nonexpansive, even if we restrict the input to images encountered in the PnP pipeline. More precisely, we apply PnP-FBS for image superresolution and compute the output-input ratio 𝒙k+1𝒙k/𝒛k+1𝒛k\|\boldsymbol{x}_{k+1}-\boldsymbol{x}_{k}\|/\|\boldsymbol{z}_{k+1}-\boldsymbol{z}_{k}\| for the denoising operation 𝒙k+1=D(𝒛k+1)\boldsymbol{x}_{k+1}=\mathrm{D}(\boldsymbol{z}_{k+1}). This gives a lower bound for the Lipschitz constant of the plugged denoiser. In particular, if any of these ratios is >1>1, the Lipschitz constant cannot be less than 11, and the denoiser cannot be nonexpansive. The ratio for different denoisers across iterations is shown in Fig. 5. We notice that the ratio is >1>1 at many iterations for BM3D, DnCNN, and N-CNN, proving that the denoisers are not nonexpansive. For N-CNN, the residual is constrained to be nonexpansive, but this constraint is also violated, as shown in Fig. 5(c). On the other hand, the ratio is always <1<1 for our denoiser, which is consistent with the theoretical results in Section 4. Similar to the counterexample in Table 2, we provide a counterexample in Table 6 that suggests divergence of the iterates on plugging N-CNN into PnP-FBS. This is consistent with the observation in Fig. 5 that N-CNN violates the nonexpansivity property.

We next compare our denoisers with BM3D, DnCNN, and N-CNN for the PnP-FBS algorithm using just downsampling (𝐁\mathbf{B} is the identity operator). The results are shown in Fig. 6 at σ=5/255\sigma=5/255. It is clear from the visual results and the metrics that the restoration results using our denoisers are comparable with BM3D and DnCNN. We also report the residual 𝒙k𝒙k1\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert for sufficiently large kk. This monotonically goes to zero for our denoisers but is seen to diverge for DnCNN and N-CNN. A detailed comparison for the PnP-FBS is shown in Table 7, where the blur and noise level are the same as in Fig. 5. Again, we notice that our performance is comparable with DnCNN.

In summary, the findings from our empirical study are as follows:

  1. 1.

    The numerical results in Fig. 5, and Tables 2 and 6, show that denoisers such as BM3D, DnCNN, and N-CNN are not guaranteed to be nonexpansive and can cause PnP algorithms to breakdown.

  2. 2.

    Fig  3 demonstrates that the proposed denoisers exhibit empirical convergence in agreement with the theoretical predictions in Corollaries 3 and 6.

  3. 3.

    From the results in Figures 4 and 6, and the extensive comparisons in Tables 1, 3, and 7, we can conclude that the regularization capacity of the proposed denoisers is comparable with BM3D and DnCNN for deblurring and superresolution.

6 Conclusion

We trained averaged and contractive patch denoisers by unfolding FBS and DRS applied to wavelet denoising. We extended them for image denoising using patch aggregation while preserving their averaged (contractivity) properties. Moreover, by using sufficiently many layers, we showed that their regularization capacity for PnP and RED can be brought to par with BM3D and DnCNN. Unlike existing CNN denoisers, we could guarantee convergence of PnP and RED using the proposed denoisers. We also presented numerical evidence to back these results. To our knowledge, this is the first work to exploit the properties of proximal operators in unfolded algorithms to develop contractive denoisers. In future work, we wish to go beyond classical wavelet denoising and develop unfolded denoisers with superior denoising and regularization capacity.

7 Funding

K. N. Chaudhury was supported by research grants CRG/2020/000527 and STR/2021/000011 from the Science and Engineering Research Board, Government of India.

8 Appendix

We prove Proposition 11 in this section. First, we recall some notations from Section 4.4.

  1. 1.

    For integers pqp\leqslant q, we denote [p,q]={p,p+1,,q},[p,q]s={ps,(p+1)s,,qs},[p,q]2=[p,q]×[p,q][p,q]=\{p,p+1,\ldots,q\},\,[p,q]_{s}=\{ps,(p+1)s,\ldots,qs\},\,[p,q]^{2}=[p,q]\times[p,q], and [p,q]s2=[p,q]s×[p,q]s[p,q]_{s}^{2}=[p,q]_{s}\times[p,q]_{s}.

  2. 2.

    The input to the denoiser is an image 𝐗:Ω\mathbf{X}:\Omega\to\mathbb{R}, where Ω=[0,q1]2\Omega=[0,q-1]^{2}. We also consider the image in matrix form as 𝐗q×q\mathbf{X}\in\mathbb{R}^{q\times q}.

  3. 3.

    We use circular shifts: for 𝒊Ω\boldsymbol{i}\in\Omega and 𝝉2\boldsymbol{\tau}\in\mathbb{Z}^{2}, 𝒊𝝉\boldsymbol{i}-\boldsymbol{\tau} is defined as ((i1τ1)modq,(i2τ2)modq)\left((i_{1}-\tau_{1})\ \mathrm{mod}\ q,(i_{2}-\tau_{2})\ \mathrm{mod}\ q\right).

  4. 4.

    We extract patches with stride ss. More precisely, the starting coordinates of the patches are

    𝒥=[0,(qs1)]s2,\mathcal{J}=[0,(q_{s}-1)]_{s}^{2},

    where qs=q/sq_{s}=q/s. The cardinality of 𝒥\mathcal{J} is qs2q_{s}^{2} (total number of patches), and each pixel belongs to ks2k_{s}^{2} patches, where ks=k/sk_{s}=k/s.

  5. 5.

    For 𝒊Ω\boldsymbol{i}\in\Omega, the patch operator 𝒫𝒊:q×qp\mathcal{P}_{\boldsymbol{i}}:\mathbb{R}^{q\times q}\to\mathbb{R}^{p} is defined in (4.4).

We now elaborate on the definition of the adjoint in (18). For a patch vector 𝒛p\boldsymbol{z}\in\mathbb{R}^{p}, if we let 𝐘=𝒫𝒊(𝒛)\mathbf{Y}=\mathcal{P}^{*}_{\boldsymbol{i}}(\boldsymbol{z}), then 𝒫𝒊(𝐘)=𝒛\mathcal{P}_{\boldsymbol{i}}(\mathbf{Y})=\boldsymbol{z} and 𝐘(𝒋)=0\mathbf{Y}(\boldsymbol{j})=0 for all 𝒋Ω𝒊\boldsymbol{j}\notin\Omega_{\boldsymbol{i}}, where

Ω𝒊={𝒊+𝝉:𝝉[0,k1]2}.\Omega_{\boldsymbol{i}}=\left\{\boldsymbol{i}+\boldsymbol{\tau}:\ \boldsymbol{\tau}\in[0,k-1]^{2}\right\}.

This completely specifies 𝐘\mathbf{Y}. In particular, we have the following property of an adjoint: for all 𝐗q×q\mathbf{X}\in\mathbb{R}^{q\times q} and 𝒛p\boldsymbol{z}\in\mathbb{R}^{p},

𝒛,𝒫𝒊(𝐗)p=𝒫𝒊(𝒛),𝐗q×q,\langle\boldsymbol{z},\mathcal{P}_{\boldsymbol{i}}(\mathbf{X})\rangle_{\mathbb{R}^{p}}=\langle\mathcal{P}_{\boldsymbol{i}}^{*}(\boldsymbol{z}),\mathbf{X}\rangle_{\mathbb{R}^{q\times q}}, (21)

where the inner product on the left (resp. right) is the standard Euclidean inner product on p\mathbb{R}^{p} (resp. q×q\mathbb{R}^{q\times q}).

The main idea behind Proposition 11 is to express the original system of overlapping patches in terms of non-overlapping patches. In particular, let qk=q/kq_{k}=q/k; since we assume qq to be a multiple of kk, qkq_{k} is an integer. Let

𝒥0=[0,(qk1)]k2.\displaystyle\mathcal{J}_{0}=[0,(q_{k}-1)]_{k}^{2}.

By construction, 𝒥0𝒥\mathcal{J}_{0}\subseteq\mathcal{J} and the points in 𝒥0\mathcal{J}_{0} are the starting coordinates of non-overlapping patches. It is not difficult to see that 𝒥\mathcal{J} is the disjoint union of (circular) shifts of 𝒥0\mathcal{J}_{0}:

𝒥=𝒊[0,ks1]2(𝒥0+s𝒊),\mathcal{J}=\bigcup_{\boldsymbol{i}\in[0,k_{s}-1]^{2}}\,\big{(}\mathcal{J}_{0}+s\boldsymbol{i}\big{)}, (22)

where [0,ks1]:={0,1,,ks1}[0,k_{s}-1]:=\{0,1,\ldots,k_{s}-1\}.

Using (22), we can decompose (18) as follows:

D=1ks2𝒊[0,ks1]2D𝒊,\mathrm{D}=\frac{1}{k_{s}^{2}}\sum_{\boldsymbol{i}\in[0,k_{s}-1]^{2}}\mathrm{D}_{\boldsymbol{i}}, (23)

where

D𝒊=𝒋𝒥0(𝒫𝒋+s𝒊Dpatch𝒫𝒋+s𝒊).\mathrm{D}_{\boldsymbol{i}}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \left(\mathcal{P}^{*}_{\boldsymbol{j}+s\boldsymbol{i}}\circ\mathrm{D}_{\mathrm{patch}}\circ\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}\right). (24)

The proof of Proposition 11 follows from a couple of observations. The first of these is an observation about operator averaging. We will say that operator T\mathrm{T} is LL-contractive if its Lipschitz constant is at most LL.

Lemma 12.

Let T1,,Tk\mathrm{T}_{1},\ldots,\mathrm{T}_{k} be LL-contractive. Then, their average,

1k(T1++Tk),\frac{1}{k}(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k}), (25)

is LL-contractive. On the other hand, if T1,,Tk\mathrm{T}_{1},\ldots,\mathrm{T}_{k} are θ\theta-averaged, then (25) is a θ\theta-averaged operator.

The second observation relates the contractivity (averaged) property of the patch denoisers to the image denoiser (24).

Lemma 13.

For 𝐢[0,ks1]2\boldsymbol{i}\in[0,k_{s}-1]^{2}, the following hold.

  1. 1.

    If Dpatch\mathrm{D}_{\mathrm{patch}} is LL-contractive, then D𝒊\mathrm{D}_{\boldsymbol{i}} is LL-contractive.

  2. 2.

    If Dpatch\mathrm{D}_{\mathrm{patch}} is θ\theta-averaged, then D𝒊\mathrm{D}_{\boldsymbol{i}} is θ\theta-averaged.

To establish Proposition 11, note from (23) that D\mathrm{D} is the average of {D𝒊}\{\mathrm{D}_{\boldsymbol{i}}\}. The desired conclusion follows immediately from Lemma 12 and 13.

We now give the proofs of Lemma 12 and 13.

For Lemma 12, the second part follows from (bauschke2017convex, , Proposition 4.30); however, we give the proofs of both parts for completeness.

Proof: [Proof of Lemma 12] Let T=(1/k)(T1++Tk)\mathrm{T}=(1/k)(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k}), where T1,,Tk\mathrm{T}_{1},\ldots,\mathrm{T}_{k} are LL-contractive. Then, for i=1,,ki=1,\ldots,k and for all 𝒙,𝒙n\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathbb{R}^{n},

Ti(𝒙)Ti(𝒙)L𝒙𝒙.\lVert\mathrm{T}_{i}(\boldsymbol{x})-\mathrm{T}_{i}(\boldsymbol{x}^{\prime})\rVert\leqslant L\,\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert. (26)

We can write

T(𝒙)T(𝒙)=1ki=1k(Ti(𝒙)Ti(𝒙)).\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{x}^{\prime})=\frac{1}{k}\sum_{i=1}^{k}\,\left(\mathrm{T}_{i}(\boldsymbol{x})-\mathrm{T}_{i}(\boldsymbol{x}^{\prime})\right).

Using triangle inequality and (26), we get

T(𝒙)T(𝒙)1ki=1kL𝒙𝒙=L𝒙𝒙.\lVert\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{x}^{\prime})\rVert\leqslant\frac{1}{k}\sum_{i=1}^{k}\,L\,\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert=L\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert.

This establishes the first part of Proposition 12.

Next, suppose that T1,,Tk\mathrm{T}_{1},\ldots,\mathrm{T}_{k} are θ\theta-averaged, i.e., there exists nonexpansive operators N1,,Nk\mathrm{N}_{1},\ldots,\mathrm{N}_{k} such that

Ti=(1θ)I+θNi(i=1,,k).\mathrm{T}_{i}=(1-\theta)\mathrm{I}+\theta\mathrm{N}_{i}\qquad(i=1,\ldots,k).

Then

T=1k(T1++Tk)=(1θ)I+θN,\mathrm{T}=\frac{1}{k}(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k})=(1-\theta)\mathrm{I}+\theta\mathrm{N},

where N=(1/k)(N1++Nk)\mathrm{N}=(1/k)(\mathrm{N}_{1}+\cdots+\mathrm{N}_{k}).

We can again use triangle inequality to show that N\mathrm{N} is nonexpansive. Thus, we have shown that T\mathrm{T} is θ\theta-averaged, which completes the proof.

We need the following result to establish Lemma 13.

Lemma 14.

For any fixed ss and 𝐢[0,ks1]2\boldsymbol{i}\in[0,k_{s}-1]^{2}, we have the following orthogonality properties:

  1. 1.

    For all 𝐗:Ω\mathbf{X}:\Omega\to\mathbb{R},

    𝒋𝒥0𝒫𝒋+s𝒊(𝐗)2=𝐗2.\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X})\rVert^{2}={\lVert\mathbf{X}\rVert}^{2}. (27)

    We can equivalently write this as

    𝒋𝒥0𝒫𝒋+s𝒊𝒫𝒋+s𝒊=I,\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}=\mathrm{I}, (28)

    where I\mathrm{I} is the identity operator on q×q\mathbb{R}^{q\times q}.

  2. 2.

    For all {𝒛𝒋p:𝒋𝒥0}\{\boldsymbol{z}_{\boldsymbol{j}}\in\mathbb{R}^{p}:\,\boldsymbol{j}\in\mathcal{J}_{0}\},

    𝒋𝒥0𝒫𝒋+s𝒊(𝒛𝒋)2=𝒋𝒥0𝒛𝒋2.\big{\lVert}\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}(\boldsymbol{z}_{\boldsymbol{j}})\big{\rVert}^{2}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert\boldsymbol{z}_{\boldsymbol{j}}\rVert^{2}. (29)

It is understood that the norm on the left in (27) is the Euclidean norm on p\mathbb{R}^{p}, while the norm on the right is the Euclidean norm on q×q\mathbb{R}^{q\times q}.

We can establish Lemma 14 using property (21) and the fact that 𝒥0\mathcal{J}_{0} are the starting coordinates of non-overlapping patches. We will skip this straightforward calculation. We are ready to prove Lemma 13.

Proof: [Proof of Lemma 13] Let Dpatch\mathrm{D}_{\mathrm{patch}} be LL-contractive, i.e., for all 𝒛,𝒛p\boldsymbol{z},\boldsymbol{z}^{\prime}\in\mathbb{R}^{p},

Dpatch(𝒛)Dpatch(𝒛)L𝒛𝒛.\lVert\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime})\rVert\leqslant L\lVert\boldsymbol{z}-\boldsymbol{z}^{\prime}\rVert. (30)

We have to show for all 𝐗,𝐗q×q\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{R}^{q\times q},

D𝒊(𝐗)D𝒊(𝐗)2L2𝐗𝐗2.\lVert\mathrm{D}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{D}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}\leqslant L^{2}\lVert\mathbf{X}-\mathbf{X}^{\prime}\big{\rVert}^{2}. (31)

From (27) and (29), we can write

D𝒊(𝐗)D𝒊(𝐗)2=𝒋𝒥0Dpatch(𝒛j)Dpatch(𝒛j)2,\displaystyle\lVert\mathrm{D}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{D}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}{\|\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}_{j})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime}_{j})\|}^{2},

where 𝒛j=𝒫𝒋+s𝒊(𝐗)\boldsymbol{z}_{j}=\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}) and 𝒛j=𝒫𝒋+s𝒊(𝐗)\boldsymbol{z}^{\prime}_{j}=\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}^{\prime}). From (30),

Dpatch(𝒛𝒋)Dpatch(𝒛𝒋)\displaystyle\lVert\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}_{\boldsymbol{j}})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime}_{\boldsymbol{j}})\rVert L𝒛𝒋𝒛𝒋\displaystyle\leqslant L\,\lVert\boldsymbol{z}_{\boldsymbol{j}}-\boldsymbol{z}^{\prime}_{\boldsymbol{j}}\rVert
=L𝒫𝒋+s𝒊(𝐗𝐗).\displaystyle=L\,\lVert\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}-\mathbf{X}^{\prime})\rVert.

Moreover, we have from (29) that

𝒋𝒥0P𝒋+s𝒊(𝐗𝐗))2=𝐗𝐗2.\displaystyle\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert{\mathrm{P}}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}-\mathbf{X}^{\prime}))\rVert^{2}=\lVert\mathbf{X}-\mathbf{X}^{\prime}\big{\rVert}^{2}.

Combining the above, we get (31). This establishes part 1 of Lemma 13.

Now, let Dpatch\mathrm{D}_{\mathrm{patch}} be θ\theta-averaged, i.e.,

Dpatch=(1θ)I+θN,\mathrm{D}_{\mathrm{patch}}=(1-\theta)\mathrm{I}+\theta\mathrm{N}, (32)

where N\mathrm{N} is nonexpansive. Substituting (32) in (24), and using (28), we can write

D𝒊=(1θ)I+θG𝒊,\displaystyle\mathrm{D}_{\boldsymbol{i}}=(1-\theta)\mathrm{I}+\theta\mathrm{G}_{\boldsymbol{i}},

where

G𝒊=𝒋𝒥0(𝒫𝒋+s𝒊N𝒫𝒋+s𝒊).\mathrm{G}_{\boldsymbol{i}}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \left(\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}\circ\mathrm{N}\circ\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}\right). (33)

Using (27) and (29) and that N\mathrm{N} is nonexpansive, we can show (the calculation is similar to one used earlier) that, for all 𝐗,𝐗q×q\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{R}^{q\times q},

G𝒊(𝐗)G𝒊(𝐗)2𝐗𝐗2.\displaystyle\lVert\mathrm{G}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{G}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}\leqslant\lVert\mathbf{X}-\mathbf{X}^{\prime}\rVert^{2}.

That is, G𝒊\mathrm{G}_{\boldsymbol{i}} is nonexpansive. It follows from (32) that D𝒊\mathrm{D}_{\boldsymbol{i}} is θ\theta-averaged. This establishes part 2 of Lemma 13.

References

  • \bibcommenthead
  • (1) A. Ribes, F. Schmitt, Linear inverse problems in imaging. IEEE Signal Process. Mag. 25(4), 84–99 (2008)
  • (2) H.W. Engl, M. Hanke, A. Neubauer, Regularization of Inverse Problems (Kluwer Academic Publishers, Dordrecht, Netherlands, 1996)
  • (3) W. Dong, L. Zhang, G. Shi, X. Wu, Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Trans. Image Process. 20(7), 1838–1857 (2011)
  • (4) G. Jagatap, C. Hegde, Algorithmic guarantees for inverse imaging with untrained network priors. Proc. Adv. Neural Inf. Process. Syst. pp. 14,832–14,842 (2019)
  • (5) S.H. Chan, X. Wang, O.A. Elgendy, Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Trans. Comput. Imaging 3(1), 84–98 (2017)
  • (6) A. Rond, R. Giryes, M. Elad, Poisson inverse problems by the plug-and-play scheme. J. Vis. Commun. Image Represent. 41, 96–108 (2016)
  • (7) J.M. Bioucas-Dias, M. Figueiredo, Multiplicative noise removal using variable splitting and constrained optimization. IEEE Trans. Image Process. 19(7), 1720–1730 (2010)
  • (8) L.I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60(1-4), 259–268 (1992)
  • (9) E.J. Candes, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted l1l_{1} minimization. J. Fourier Anal. Appl. 14(5), 877–905 (2008)
  • (10) W. Dong, X. Li, L. Zhang, G. Shi, Sparsity-based image denoising via dictionary learning and structural clustering. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 457–464 (2011)
  • (11) L. Zhang, W. Zuo, Image restoration: From sparse and low-rank priors to deep priors [lecture notes]. IEEE Signal Process. Mag. 34(5), 172–179 (2017)
  • (12) G.B. Passty, Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J. Math. Anal. Appl. 72(2), 383–390 (1979)
  • (13) H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. (Springer, New York, NY, USA, 2017)
  • (14) E. Ryu, S. Boyd, Primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)
  • (15) A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
  • (16) S. Sreehari, S.V. Venkatakrishnan, B. Wohlberg, G.T. Buzzard, L.F. Drummy, J.P. Simmons, C.A. Bouman, Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Trans. Comput. Imaging 2(4), 408–423 (2016)
  • (17) K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
  • (18) K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
  • (19) E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, W. Yin, Plug-and-play methods provably converge with properly trained denoisers. Proc. Intl. Conf. Mach. Learn. 97, 5546–5557 (2019)
  • (20) Y. Sun, B. Wohlberg, U.S. Kamilov, An online plug-and-play algorithm for regularized image reconstruction. IEEE Trans. Comput. Imaging 5(3), 395–408 (2019)
  • (21) K. Zhang, W. Zuo, S. Gu, L. Zhang, Learning deep CNN denoiser prior for image restoration. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 3929–3938 (2017)
  • (22) T. Tirer, R. Giryes, Image restoration by iterative denoising and backward projections. IEEE Trans. Image Process. 28(3), 1220–1234 (2019)
  • (23) K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, R. Timofte, Plug-and-play image restoration with deep denoiser prior. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
  • (24) S. Hurault, A. Leclaire, N. Papadakis, Gradient step denoiser for convergent plug-and-play. Proc. Int. Conf. Learn. Represent. (2022)
  • (25) S. Hurault, A. Leclaire, N. Papadakis, Proximal denoiser for convergent plug-and-play optimization with nonconvex regularization. arXiv:2201.13256 (2022)
  • (26) Y. Romano, M. Elad, P. Milanfar, The little engine that could: Regularization by denoising (RED). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017)
  • (27) E.T. Reehorst, P. Schniter, Regularization by denoising: Clarifications and new interpretations. IEEE Trans. Comput. Imaging 5(1), 52–67 (2018)
  • (28) Y. Sun, J. Liu, U.S. Kamilov, Block coordinate regularization by denoising. Proc. Adv. Neural Inf. Process. Syst. pp. 380–390 (2019)
  • (29) Y. Sun, Z. Wu, X. Xu, B. Wohlberg, U.S. Kamilov, Scalable plug-and-play ADMM with convergence guarantees. IEEE Trans. Comput. Imaging 7, 849–863 (2021)
  • (30) R. Cohen, M. Elad, P. Milanfar, Regularization by denoising via fixed-point projection (RED-PRO). SIAM J. Imaging Sci. 14(3), 1374–1406 (2021)
  • (31) C. Metzler, P. Schniter, A. Veeraraghavan, et al., prdeep: robust phase retrieval with a flexible deep network. Proc. Intl. Conf. Mach. Learn. pp. 3501–3510 (2018)
  • (32) Z. Wu, Y. Sun, A. Matlock, J. Liu, L. Tian, U.S. Kamilov, SIMBA: Scalable inversion in optical tomography using deep denoising priors. IEEE J. Sel. Top. Signal Process. 14(6), 1163–1175 (2020)
  • (33) Z. Wu, Y. Sun, J. Liu, U. Kamilov, Online regularization by denoising with applications to phase retrieval. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. Wkshp. (2019)
  • (34) G. Mataev, P. Milanfar, M. Elad, DeepRED: Deep image prior powered by RED. Proc. IEEE Intl. Conf. Comp. Vis. Wksh. (2019)
  • (35) Y. Hu, J. Liu, X. Xu, U.S. Kamilov, Monotonically convergent regularization by denoising. arXiv:2202.04961 (2022)
  • (36) K. Gregor, Y. LeCun, Learning fast approximations of sparse coding. Proc. Intl. Conf. Mach. Learn. pp. 399–406 (2010)
  • (37) J. Sun, H. Li, Z. Xu, et al., Deep ADMM-Net for compressive sensing MRI. Proc. Adv. Neural Inf. Process. Syst. 29 (2016)
  • (38) J. Zhang, B. Ghanem, ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. Proc. IEEE Intl. Conf. Comp. Vis. pp. 1828–1837 (2018)
  • (39) A. Repetti, M. Terris, Y. Wiaux, J.C. Pesquet, Dual forward-backward unfolded network for flexible plug-and-play. Proc. Eur. Signal Process. Conf. pp. 957–961 (2022)
  • (40) C.D. Athalye, K.N. Chaudhury, B. Kumar, On the contractivity of plug-and-play operators. IEEE Signal Processing Letters 30, 1447–1451 (2023)
  • (41) P. Nair, R.G. Gavaskar, K.N. Chaudhury, Fixed-point and objective convergence of plug-and-play algorithms. IEEE Trans. Comput. Imaging 7, 337–348 (2021)
  • (42) R.G. Gavaskar, C.D. Athalye, K.N. Chaudhury, On plug-and-play regularization using linear denoisers. IEEE Trans. Image Process. 30, 4802–4813 (2021)
  • (43) J. Liu, S. Asif, B. Wohlberg, U. Kamilov, Recovery analysis for plug-and-play priors using the restricted eigenvalue condition. Proc. Adv. Neural Inf. Process. Syst. 34 (2021)
  • (44) A. Raj, Y. Li, Y. Bresler, GAN-based projector for faster recovery with convergence guarantees in linear inverse problems. Proc. IEEE Intl. Conf. Comp. Vis. pp. 5602–5611 (2019)
  • (45) R. Cohen, Y. Blau, D. Freedman, E. Rivlin, It has potential: Gradient-driven denoisers for convergent solutions to inverse problems. Proc. Adv. Neural Inf. Process. Syst. 34 (2021)
  • (46) J.C. Pesquet, A. Repetti, M. Terris, Y. Wiaux, Learning maximally monotone operators for image recovery. SIAM J. Imaging Sci. 14(3), 1206–1237 (2021)
  • (47) R. Laumont, V. De Bortoli, A. Almansa, J. Delon, A. Durmus, M. Pereyra, On maximum a posteriori estimation with plug & play priors and stochastic gradient descent. Journal of Mathematical Imaging and Vision 65(1), 140–163 (2023)
  • (48) A. Virmaux, K. Scaman, Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Proc. Adv. Neural Inf. Process. Syst. 31 (2018)
  • (49) H. Sedghi, V. Gupta, P.M. Long, The singular values of convolutional layers. Proc. Int. Conf. Learn. Represent. (2019)
  • (50) J. Hertrich, S. Neumayer, G. Steidl, Convolutional proximal neural networks and plug-and-play algorithms. Linear Algebra Appl. 631, 203–234 (2021)
  • (51) M. Terris, A. Repetti, J.C. Pesquet, Y. Wiaux, Building firmly nonexpansive convolutional neural networks. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. pp. 8658–8662 (2020)
  • (52) N. Parikh, S. Boyd, Proximal algorithms. Found. Trends. Optim. 1(3), 127–239 (2014)
  • (53) J. Eckstein, D.P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1), 293–318 (1992)
  • (54) V. Pata, Fixed Point Theorem and Applications (Springer, 2019)
  • (55) J.B. Baillon, G. Haddad, Quelques propriétés des opérateurs angle-bornés et n-cycliquement monotones. Israel Journal of Mathematics 26, 137–150 (1977)
  • (56) A. Chambolle, R.A. De Vore, N.Y. Lee, B.J. Lucier, Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans. Image Process. 7(3), 319–335 (1998)
  • (57) D.P. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
  • (58) A. Levin, Y. Weiss, F. Durand, W.T. Freeman, Understanding and evaluating blind deconvolution algorithms. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 1964–1971 (2009)