[1]\fnmPravin \surNair

[1]\orgdivDepartment of Electrical Engineering, \orgnameIndian Institute of Science, \orgaddress\cityBangalore, \countryIndia

Averaged Deep Denoisers for Image Regularization

[email protected] \fnmKunal N. \surChaudhury [email protected] *

Abstract

Plug-and-Play (PnP) and Regularization-by-Denoising (RED) are recent paradigms for image reconstruction that leverage the power of modern denoisers for image regularization. In particular, they have been shown to deliver state-of-the-art reconstructions with CNN denoisers. Since the regularization is performed in an ad-hoc manner, understanding the convergence of PnP and RED has been an active research area. It was shown in recent works that iterate convergence can be guaranteed if the denoiser is averaged or nonexpansive. However, integrating nonexpansivity with gradient-based learning is challenging, the core issue being that testing nonexpansivity is intractable. Using numerical examples, we show that existing CNN denoisers tend to violate the nonexpansive property, which can cause PnP or RED to diverge. In fact, algorithms for training nonexpansive denoisers either cannot guarantee nonexpansivity or are computationally intensive. In this work, we construct contractive and averaged image denoisers by unfolding splitting-based optimization algorithms applied to wavelet denoising and demonstrate that their regularization capacity for PnP and RED can be matched with CNN denoisers. To our knowledge, this is the first work to propose a simple framework for training contractive denoisers using network unfolding.

keywords:

plug-and-play (PnP), regularization by denoising (RED), convergence, contractive, nonexpansive, averaged, unfolding.

1 Introduction

The problem of recovering an unknown image from incomplete, corrupted measurements comes up in applications such as deblurring, superresolution, inpainting, demosaicking, despeckling, MRI, and tomography ribes2008linear ; engl1996regularization ; dong2011image . The standard optimization framework for such problems involves a loss function $f:\mathbb{R}^{n}\to\mathbb{R}$ derived from the measurement process and a regularizer $g:\mathbb{R}^{n}\to\mathbb{R}\cup\{\infty\}$ incorporating prior knowledge about the ground truth. The reconstruction is performed by solving the optimization problem

\underset{\boldsymbol{x}\in\mathbb{R}^{n}}{\min}\ f(\boldsymbol{x})+g(\boldsymbol{x}).

(1)

The loss $f$ is typically differentiable, while the regularizer $g$ can be discontinuous engl1996regularization . For linear inverse problems such as deblurring, superresolution, inpainting, and compressed sensing dong2011image ; jagatap2019algorithmic , the loss is convex and quadratic, namely, $f(\boldsymbol{x})={\lVert\boldsymbol{y}-\mathbf{A}\boldsymbol{x}\rVert}^{2}$ , where $\mathbf{A}\in\mathbb{R}^{m\times n}$ is the measurement matrix and $\boldsymbol{y}$ is the observed image. On the other hand, for applications such as single photon imaging, Poisson denoising, and despeckling chan2017plug ; rond2016poisson ; bioucas2010multiplicative , $f$ is non-quadratic.

For imaging applications, total-variation and other sparsity-promoting functionals rudin1992nonlinear ; candes2008enhancing ; dong2011sparsity are the most commonly used regularizers. For such regularizers, $g$ is typically convex but nondifferentiable engl1996regularization ; zhang2017image . If the loss $f$ is convex, we can solve (1) using iterative algorithms such as Forward-Backward Splitting (FBS), Douglas-Rachford Splitting (DRS), and Alternating Direction Method of Multipliers (ADMM) passty1979ergodic ; bauschke2017convex ; ryu2016primer . Model-based optimization is interpretable, computationally tractable, and can be applied to various problems. However, handcrafted regularizers are generally less powerful than modern learning-based models, and the optimization process can be time-consuming zhang2017image .

1.1 PnP and RED

Classical algorithms such as FBS and DRS perform repeated inversion of the measurement model, followed by a regularization step. More precisely, the regularization is performed using the proximal operator. For a convex regularizer $h:\mathbb{R}^{n}\to\mathbb{R}$ , the proximal operator is defined as

\mathrm{Prox}_{h}(\boldsymbol{x})=\operatorname*{\arg\!\min}_{\boldsymbol{z}\in\mathbb{R}^{n}}\ \left\{\frac{1}{2}\left\lVert\boldsymbol{z}-\boldsymbol{x}\right\rVert^{2}+h(\boldsymbol{z})\right\},

(2)

where $\|\cdot\|$ is the Euclidean norm. The proximal operator can be seen as a denoiser — it takes a noisy image $\boldsymbol{x}$ and returns an image close to $\boldsymbol{x}$ for which the penalty $h$ is small beck2009fast . This operator is at the core of FBS, DRS, and ADMM. In particular, the FBS update is given by

	$\displaystyle{\boldsymbol{z}}_{k+1}$	$\displaystyle={\boldsymbol{x}}_{k}-\gamma\nabla\!f({\boldsymbol{x}}_{k}),$
	$\displaystyle{\boldsymbol{x}}_{k+1}$	$\displaystyle=\mathrm{Prox}_{\gamma g}(\boldsymbol{z}_{k+1}).$		(3)

where $\gamma>0$ and $\boldsymbol{x}_{0}\in\mathbb{R}^{n}$ is an arbitrary initialization.

On the other hand, starting with $\boldsymbol{x}_{0}\in\mathbb{R}^{n}$ and $\gamma>0$ , the DRS update is given by (e.g., see ryu2016primer )

	$\displaystyle{\boldsymbol{z}}_{k+1}$	$\displaystyle=\mathrm{Prox}_{\gamma f}(\boldsymbol{x}_{k}),$
	$\displaystyle{\boldsymbol{x}}_{k+1}$	$\displaystyle=\boldsymbol{x}_{k}+\mathrm{Prox}_{\gamma g}(2\boldsymbol{z}_{k+1}-\boldsymbol{x}_{k})-\boldsymbol{z}_{k+1}.$		(4)

In Plug-and-Play (PnP) regularization sreehari2016plug , instead of having to handcraft the regularizer $g$ in (1) and apply it via its proximal operator, we directly “plug” a powerful denoising operator $\mathrm{D}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ , such as BM3D dabov2007image or DnCNN zhang2017beyond , into (1.1) and (1.1). Thus, in the PnP variant of FBS (referred to as PnP-FBS), the update is performed by replacing the proximal operator in (1.1) by an image denoiser (Algorithm 1).

Algorithm 1 PNP-FBS

1:input:

\boldsymbol{x}_{0}

, loss

f

, denoiser

\mathrm{D}

, and

\gamma>0

2:for

k=0,1,\ldots

\boldsymbol{z}_{k+1}={\boldsymbol{x}}_{k}-\gamma\nabla\!f({\boldsymbol{x}}_{k})

\boldsymbol{x}_{k+1}=\mathrm{D}(\boldsymbol{z}_{k+1})

5:end for

Remarkably, PnP algorithms yield state-of-the-art results across a wide range of applications, including superresolution, MRI, fusion, and tomography sreehari2016plug ; Ryu2019_PnP_trained_conv ; Sun2019_PnP_SGD ; zhang2017learning ; Tirer2019_iter_denoising ; zhang2021plug . The best performing PnP algorithms zhang2017learning ; Tirer2019_iter_denoising ; zhang2021plug use trained denoisers such as DnCNN zhang2017beyond , IRCNN zhang2017learning , and U-Net hurault2022gradient ; hurault2022proximal . This also opens up the exciting possibility of combining model-based reconstruction with deep learning zhang2017learning .

The idea of using denoisers for regularization has also been explored in Regularization-by-Denoising (RED). For example, the authors in romano2017little proposed to work with the regularizer

g(\boldsymbol{x})=\frac{1}{2}\boldsymbol{x}^{\top}\big{(}\boldsymbol{x}-\mathrm{D}(\boldsymbol{x})\big{)},

(5)

where $\mathrm{D}$ is some powerful denoiser as in PnP. This is solved iteratively using gradient-based methods, where the gradient of (5) is approximated using $\boldsymbol{x}-\mathrm{D}(\boldsymbol{x})$ in romano2017little . It was later shown in reehorst2018regularization that this is exact if $\mathrm{D}$ is locally homogenous and its Jacobian is symmetric.

Several variants of the original RED algorithm romano2017little have been proposed reehorst2018regularization ; Sun2019_PnP_SGD ; sun2019block ; sun2021scalable ; cohen2021regularization . One such variant called RED-PG reehorst2018regularization is described in Algorithm 2. RED algorithms have been shown to yield state-of-the-art results for phase retrieval, tomography, superresolution, deblurring, and compressed sensing metzler2018prdeep ; wu2020simba ; wu2019online ; mataev2019deepred ; sun2019block ; hu2022monotonically .

Algorithm 2 RED-PG

1:input:

\boldsymbol{z}_{0}

, loss

f

, denoiser

\mathrm{D}

, and

\gamma,L>0

2:for

k=0,1,\ldots

\boldsymbol{x}_{k+1}=\mathrm{Prox}_{(\gamma/L)f}(\boldsymbol{z}_{k})

\boldsymbol{z}_{k+1}=\frac{1}{L}\big{(}\mathrm{D}(\boldsymbol{x}_{k+1})+(L-1)\boldsymbol{x}_{k+1}\big{)}

5:end for

There is yet another paradigm for solving inverse problems called deep unfolding gregor2010learning ; sun2016deep ; zhang2018ista , where an iterative algorithm is unfolded into a network and trained in an end-to-end fashion. Deep unfolded networks can produce impressive results, but unlike PnP or RED (where just the denoiser needs to be trained), we have to train an unfolded network end-to-end. The present work is motivated by deep unfolding, but we propose to use them for a different purpose, namely, to construct nonexpansive denoisers by unfolding classical wavelet denoising. We then apply the trained denoiser within Algorithms 1 and 2 for image reconstruction.

We wish to point to a related work repetti2022dual , where the authors construct an unfolded denoiser by applying FBS to a sparsity-regularized image denoising problem (with box constraints on the reconstruction). However, there are important distinctions between our approach and theirs. First, splitting is applied to the Fenchel dual of the denoising problem in repetti2022dual ; the idea is to transform the nonsmooth primal (denoising) problem into a smooth, unconstrained dual problem. Second, since the sparsifying linear transform is optimized in repetti2022dual , there is no structural guarantee on the trained layers. As a result, it is difficult to come up with convergence guarantees; indeed, (repetti2022dual, , Proposition 2) assumes that the same linear transform is used across all layers, and this would generally not hold if we optimize the transforms. We present comparisons with repetti2022dual in Section 5.

1.2 Convergence Results

The success of PnP and RED has sparked interest in the theoretical understanding of these algorithms. For example, it has been shown that for a class of linear denoisers, PnP is guaranteed to converge, even at a linear rate ACK2023-contractivity , and the limit point is the solution of a convex optimization problem sreehari2016plug ; nair2021fixed ; gavaskar2021plug . This is done by expressing the linear denoiser as the proximal operator of a convex function. Unfortunately, this does not hold for powerful nonlinear denoisers such as BM3D, DnCNN, or IRCNN. This is because, unlike the proximal operator that is nonexpansive, such denoisers have been shown to violate the nonexpansive property reehorst2018regularization .

Though developing a variational characterization for deep denoisers is challenging, it is nonetheless possible to establish iterate convergence for PnP and RED algorithms. For example, the convergence of Algorithm 1 was established for generative denoisers in jagatap2019algorithmic ; liu2021recovery ; raj2019gan . For strongly convex loss functions, PnP-FBS convergence can be guaranteed using a CNN denoiser whose residual is nonexpansive Ryu2019_PnP_trained_conv . Motivated by RED, convergence was recently established for PnP modeled on FBS, ADMM, half-quadratic splitting, and gradient descent hurault2022gradient ; cohen2021has ; hurault2022proximal . On the other hand, the convergence of RED algorithms can be guaranteed using a nonexpansive denoiser romano2017little ; reehorst2018regularization ; sun2019block ; cohen2021regularization ; liu2021recovery . For instance, the block-coordinate RED algorithm in sun2019block exhibits convergence with a nonexpansive denoiser. In liu2021recovery , the authors established convergence of RED for the compressed sensing application using a nonexpansive denoiser.

The focus of this work is on strong convergence of the PnP and RED algorithms, where the iterates converge in norm to a fixed point of a well-defined operator ryu2016primer . It is important to note that there exist other forms of iterate convergence. For example, for the denoisers in hurault2022gradient ; hurault2022proximal , the limit points of the PnP-FBS iterates are stationary points of a nonconvex objective function. In contrast, the iterates converge weakly (see (bauschke2017convex, , Section $2.5$ ) for a definition) to a zero of a combination of a gradient operator and a firmly nonexpansive operator in pesquet2021learning . On the other hand, the iterates of Plug-and-Play Stochastic Gradient Descent (PnP-SGD) in laumont2023maximum converge to a stationary point of a MAP problem, where the prior is derived from the denoiser. The summary of our understanding of the iterate convergence of PnP and RED is as follows:

1.

Convergence of PnP is guaranteed for averaged denoisers Sun2019_PnP_SGD ; sun2021scalable ; nair2021fixed .
2.

For nonexpansive denoisers, convergence is guaranteed for variants of RED reehorst2018regularization ; sun2019block ; cohen2021regularization , PnP-FBS applied to compressive sensing liu2021recovery , and a variant of PnP-FBS reehorst2018regularization .

For the above reasons, there has been significant research on training averaged and nonexpansive CNN denoisers. The challenge is that computing the Lipschitz constant of a neural network is NP-hard virmaux2018lipschitz . By Lipschitz constant, we mean the smallest $c$ such that $\lVert\mathrm{T}(\boldsymbol{u})-\mathrm{T}(\boldsymbol{v})\rVert\leqslant c\lVert\boldsymbol{u}-\boldsymbol{v}\rVert$ for all inputs $\boldsymbol{u}$ and $\boldsymbol{v}$ , where we represent the CNN as an operator $\mathrm{T}$ . In particular, while the Lipschitz constant of a convolution layer is just the largest singular value of the matrix representation of the layer (which can be computed efficiently from the FFT of its filters sedghi2018singular ) and the Lipschitz constant of common nonlinear activations (ReLU and sigmoid) is $1$ , computing the Lipschitz constant for their cascaded composition is however nontrivial. Nonetheless, we can compute an upper bound for the Lipschitz constant using a direct approximation for the spectral norm of the network pesquet2021learning or the product of the spectral norms of the convolution layers Ryu2019_PnP_trained_conv . In particular, if we force the spectral norm of the convolution layers to be within $1$ at each epoch of the training process, then the final network is automatically nonexpansive. Several methods based on this idea have been proposed sedghi2018singular ; hertrich2021convolutional ; terris2020building . In particular, we will later compare with the nonexpansive CNN denoiser in Ryu2019_PnP_trained_conv , where the largest singular value of a convolutional layer (estimated using power iterations) is used to force the Lipschitz constant of the layer to be within $1$ after each gradient update.

1.3 Motivation

This work was motivated by the observation that existing CNN denoisers tend to violate the nonexpansive property (see Fig. 5). In fact, existing methods for training nonexpansive denoisers either cannot guarantee the final network to be nonexpansive or are computationally expensive. For example, power iterations are used for approximate computation of the largest singular value in Ryu2019_PnP_trained_conv ; pesquet2021learning ; hence, the network is not guaranteed to be nonexpansive at each epoch. The methods in sedghi2018singular ; hertrich2021convolutional ; terris2020building for constraining the convolutional layers are computationally expensive; e.g., the time taken in sedghi2018singular is about nine-fold that required to train the network without constraints. This is because sedghi2018singular requires the SVD computation of a large matrix derived from the Fourier transform of the convolutional filters. Similarly, terris2020building requires the Fourier transform of the filters in every convolutional layer. On the other hand, the training in hertrich2021convolutional requires us to solve an optimization problem that imposes orthogonality constraints on the convolutional layers.

We also highlight a technical issue that has been sidelined in existing works Ryu2019_PnP_trained_conv ; sedghi2018singular ; hertrich2021convolutional ; terris2020building ; pesquet2021learning , namely, that the CNN filters are trained on images of fixed size but deployed on images of arbitrary size. Since the filters must be zero-padded to match the image size, their Lispchitz can change with the size. As a concrete example, consider the $3\times 3$ Sobel filter $[1,0,-1\mathchar 24635\relax\;2,0,-2\mathchar 24635\relax\;1,0,-1]$ . The Lipschitz constant of this filter for an image of size $25\times 25$ is $7.9842$ , whereas the Lipschitz constant for a $256\times 256$ image is $8$ . Thus, a guarantee that the Lipschitz bound is independent of the image size is not available for existing CNN denoisers. To substantiate our point, we present numerical evidence (see Fig. 5 in Section 5) showing that BM3D dabov2007image , DnCNN zhang2017beyond , and N-CNN Ryu2019_PnP_trained_conv violate the nonexpansive property. Such counterexamples have also been reported in reehorst2018regularization . Our observations are summarized below:

1.

The Lipschitz constant of a CNN (trained with fixed-size filters) depends on the size of the input image and is difficult to predict.
2.

Existing CNN denoisers violate the nonexpansive property, which can result in the divergence of PnP and RED algorithms (see Tables 2 and 6 in Section 5).
3.

Developing efficient algorithms for training nonexpansive CNNs is challenging. Existing algorithms for training nonexpansive CNNs either cannot guarantee nonexpansivity or are computationally expensive.

1.4 Contributions

Following the above observations, we moved away from CNNs and considered deep unfolded networks instead. In fact, from our experience, we noticed a tradeoff between nonexpansivity and the denoising performance of CNNs — a significant drop in denoising performance was observed if the denoiser was forced to be nonexpansive. Indeed, it was shown in hurault2022proximal ; hertrich2021convolutional that imposing hard Lipschitz constraints on the CNN can adversely affect its denoising performance.

In the rest of the paper, we demonstrate that it is possible to construct deep unfolded networks that come with the following desirable properties:

1.

, Unlike Ryu2019_PnP_trained_conv ; sedghi2018singular , they are averaged (or contractive) by construction and hence automatically nonexpansive.
2.

The training is not expensive compared to sedghi2018singular ; terris2020building ; pesquet2021learning . In particular, we need to only perform cheap projections onto intervals (of the learnable parameters) after every gradient step.
3.

When plugged into Algorithms 1 and 2, the reconstruction is comparable with the state-of-the-art.

Our construction is motivated by the averaged (contractivity) property of fixed-point operators arising in FBS and DRS; these are classically used to establish iterate convergence bauschke2017convex . In fact, the building blocks of our denoiser are obtained by unfolding FBS and DRS applied to wavelet denoising. These blocks are provably averaged (resp. contractive), and the proposed denoiser, being a composition of these building blocks, is automatically averaged (resp. contractive). While our denoiser is trained on small patches, we give a formula for patch aggregation that guarantees the final end-to-end image denoiser to be averaged (or contractive). The denoising capacity of our denoiser is short of BM3D and DnCNN; however, their regularization capacity for PnP and RED is comparable with BM3D and DnCNN. The source code is available in https://github.com/pravin1390/Averageddeepdenoiser.

2 Notations and Definitions

For integers $p\leqslant q$ , we will use the notations $[p,q]=\{p,p+1,\ldots,q\}$ and $[p,q]^{2}=[p,q]\times[p,q]$ . We also use $[p,q]_{s}$ to denote the integers $\{ps,(p+1)s,\ldots,qs\}$ , which are multiples of $s$ .

We will work in $\mathbb{R}^{n}$ and the norm $\lVert\,\cdot\,\rVert$ will be the Euclidean norm. We say that a sequence $\{\boldsymbol{x}_{k}\}$ in $\mathbb{R}^{n}$ converges to $\boldsymbol{x}^{*}\in\mathbb{R}^{n}$ if $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}^{*}\rVert\to 0$ as $k\to\infty$ . Moreover, the convergence is linear if there exists $C>0$ and $\mu\in(0,1)$ such that $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}^{*}\rVert\leqslant C\mu^{k}$ . A sequence is said to be convergent if it converges to some $\boldsymbol{x}^{*}\in\mathbb{R}^{n}$ ; it is said to converge to a set $S\subset\mathbb{R}^{n}$ if it converges to some $\boldsymbol{x}^{*}\in S$ . A sequence $\{\boldsymbol{x}_{k}\}$ generated by an iterative algorithm is said to converge globally if it converges for any arbitrary initialization $\boldsymbol{x}_{0}\in\mathbb{R}^{n}$ .

We recall the definitions and properties of certain nonlinear operators that come up in the fixed-point convergence of PnP and RED algorithms. These will play an important role in the construction of the proposed denoiser.

An operator $\mathrm{T}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is said to be Lipschitz if there exists $c>0$ such that

\forall\,\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{n}:\quad\left\lVert\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{y})\right\rVert\leqslant c\left\lVert\boldsymbol{x}-\boldsymbol{y}\right\rVert.

(6)

The smallest $c$ satisfying (6) is called the Lipschitz constant of $\mathrm{T}$ . An operator $\mathrm{T}$ is nonexpansive if $c=1$ and contractive if $c\in[0,1)$ . $\mathrm{T}$ is said to be $\theta$ -averaged for some $\theta\in[0,1)$ , if we can write $\mathrm{T}=(1-\theta)\,\mathrm{I}+\theta\,\mathrm{N}$ , where $\mathrm{N}$ is a nonexpansive operator and $\mathrm{I}$ is the identity operator on $\mathbb{R}^{n}$ . We call $\mathrm{T}$ firmly nonexpansive if $\mathrm{T}$ is $(1/2)$ -averaged.

Remark 1.

It is evident from the above definitions that an averaged operator is nonexpansive. The converse is not true; e.g., the operator $-\mathrm{I}$ is nonexpansive but not averaged. A contractive operator is $(c+1)/2$ -averaged, where $c$ is the contraction factor. However, not every averaged operator is contractive. Thus, contractive operators form a strict subset of averaged operators, which form a strict subset of nonexpansive operators bauschke2017convex .

Two instances of averaged operators relevant to this work are the proximal and the gradient-step operators. Recall that for a convex function $h:\mathbb{R}^{n}\to\mathbb{R}$ , its proximal operator $\mathrm{Prox}_{h}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is defined in (2). For a differentiable function $f:\mathbb{R}^{n}\to\mathbb{R}$ , we define the gradient-step operator to be

\mathrm{G}_{\gamma,f}:\mathbb{R}^{n}\to\mathbb{R}^{n},\quad\mathrm{G}_{\gamma,f}=\mathrm{I}-\gamma\nabla\!f,

(7)

where $\gamma>0$ is the step size. A differentiable function $f$ is said to be $\beta$ -smooth if the Lipschitz constant of the gradient operator $\nabla\!f:\mathbb{R}^{n}\to\mathbb{R}^{n}$ is at most $\beta$ .

Remark 2.

The proximal and the gradient-step operators in (2) and (7) are known to be firmly nonexpansive when $f$ is convex bauschke2017convex . However, these operators need not be contractive unless additional conditions are met, e.g., strong convexity of $f$ . However, for linear inverse problems, the loss $f$ is typically not strongly convex, and it is easy to show that the proximal and the gradient-step operators are not contractive in this case. This will be an important consideration when we work with proximal and gradient-step operators.

3 Convergence of PnP and RED

The convergence analysis of operator-splitting methods is primarily based on the classical fixed-point theory of monotone operators bauschke2017convex ; ryu2016primer ; parikh2014proximal ; eckstein1992douglas . PnP and RED have their roots in convex optimization and operator-splitting algorithms in particular. It is thus not surprising that existing convergence analyses of PnP and RED rely heavily on the fixed-point theory of firmly nonexpansive and averaged operators. nair2021fixed ; sun2021scalable ; Ryu2019_PnP_trained_conv ; cohen2021regularization ; reehorst2018regularization . In particular, this line of analysis can be used to establish convergence of Algorithms 1 and 2. This is done by expressing the updates in these algorithms as a mapping $\boldsymbol{x}\mapsto\mathrm{T}(\boldsymbol{x})$ , where $\mathrm{T}$ is contractive or averaged depending on the loss $f$ and the denoiser $\mathrm{D}$ . In particular, consider the operators

\mathrm{T}_{\mathrm{PnP}}=\mathrm{D}\circ\mathrm{G}_{\gamma,f},

(8)

and

\mathrm{T}_{\mathrm{RED}}=\mathrm{Prox}_{(\gamma/L)f}\circ\frac{1}{L}\big{(}\mathrm{D}+(L-1)\,\mathrm{I}\,\big{)}.

(9)

In terms (8) and (9), we can express the updates in Algorithms 1 and 2 as $\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k})$ , where the operator $\mathrm{T}$ is either (8) or (9). Namely, we can view PnP-FBS and RED-PG as fixed-point algorithms.

The convergence analysis of fixed-point iterations is a classical topic and of research interest in the area of image regularization FPTbook . In particular, since operators (8) and (9) are continuous, a limit point of the iterates (if one exists) must necessarily be a fixed point of $\mathrm{T}$ ; we call this fixed-point convergence. We will denote the fixed points of $\mathrm{T}$ as $\mathrm{Fix}\,\mathrm{T}$ , i.e., $\mathrm{Fix}\,\mathrm{T}=\{\boldsymbol{x}\in\mathbb{R}^{n}:\ \boldsymbol{x}=\mathrm{T}(\boldsymbol{x})\}$ .

We can establish various convergence results of the fixed-point iterations of $\mathrm{T}$ . A classical result is the following.

Theorem 1.

Let $\mathrm{T}$ be a contractive operator on $\mathbb{R}^{n}$ . Then the sequence $\{\boldsymbol{x}_{k}\}_{k\geqslant 0}$ generated as $\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k})$ converges globally and linearly to $\mathrm{Fix}\,\mathrm{T}$ .

The iteration $\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k})$ is called the Picard iteration when $\mathrm{T}$ is contractive, and Theorem 1 is the contraction-mapping theorem FPTbook .

Remark 3.

$\mathrm{Fix}\,\mathrm{T}$ is nonempty if $\mathrm{T}$ is contractive. In fact, $\mathrm{Fix}\,\mathrm{T}$ is a singleton in this case, i.e., the fixed-point iterations always converge to the same fixed point regardless of the initialization $\boldsymbol{x}_{0}$ . In general, however, verifying that $\mathrm{Fix}\,\mathrm{T}\neq\emptyset$ for some operator $\mathrm{T}$ is a nontrivial task; this is assumed in most practical applications of fixed-point analysis, including the analysis of operator-splitting algorithms bauschke2017convex ; ryu2016primer .

It follows from Theorem 1 that a sufficient condition for the convergence of Algorithms 1 and 2 is that operators $\mathrm{T}_{\mathrm{PnP}}$ in (8) and $\mathrm{T}_{\mathrm{RED}}$ in (9) are contractive. In fact, this is the case if the denoiser $\mathrm{D}$ is contractive and the loss function $f$ is convex.

Proposition 2.

Let $f$ be convex and let $\mathrm{D}$ be contractive. Then the operator in (9) is contractive for all $\gamma>0$ and $L>1$ . Moreover, if $f$ is $\beta$ -smooth, then operator (8) is contractive for all $0<\gamma<2/\beta$ .

Proof: If $\mathrm{D}$ is contractive and $L>1$ , then it is not difficult to see that the second operator in (9) is contractive, being the convex combination of $\mathrm{I}$ and $\mathrm{D}$ . On the other hand, the first operator in (9) is a proximal operator, which is nonexpansive bauschke2017convex . Being the composition of a nonexpansive and a contractive operator, (9) is contractive.

On the other hand, if $f$ is convex and $\beta$ -smooth, then from the Baillon-Haddad theorem bauschke2009baillon we have that, for all $\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{n}$ ,

\beta\,\big{(}\nabla\!f(\boldsymbol{x})-\nabla\!f(\boldsymbol{y})\big{)}^{\top}\!(\boldsymbol{x}-\boldsymbol{y})\geqslant\big{\lVert}\nabla\!f(\boldsymbol{x})-\nabla\!f(\boldsymbol{y})\big{\rVert}^{2}.

Using this, we can easily show that (7) is nonexpansive for all $0<\gamma<2/\beta$ . Thus, (8) is the composition of a nonexpansive and a contractive operator and is hence contractive.

We arrive at the following important conclusion by combining Theorem 1 and Proposition 2.

Corollary 3.

Suppose $f$ is convex and $\mathrm{D}$ is contractive.

1.

Algorithm 2 converges globally and linearly if $\gamma>0$ and $L>1$ .
2.

Additionally, if $f$ is $\beta$ -smooth, then Algorithm 1 converges globally and linearly if $0<\gamma<2/\beta$ .

Proof: If $f$ is convex, $\gamma>0$ and $L>1$ , and $\mathrm{D}$ is contractive, we know from Proposition 2 that (9) is a contraction. Similarly, if $0<\gamma<2/\beta$ , then it follows from Proposition 2 that (8) is a contraction. Hence, we have from Theorem 1 that Algorithms 1 and 2 converges globally and linearly.

The convergence results in Corollary 3 can be extended to averaged denoisers. This is important as we will later work with averaged denoisers that are not guaranteed to be contractive. We will rely on the following fixed-point convergence result for averaged operators.

Theorem 4.

Let $\mathrm{T}$ be an averaged operator on $\mathbb{R}^{n}$ and assume that $\mathrm{Fix}\,\mathrm{T}\neq\emptyset$ . The sequence $\{\boldsymbol{x}_{k}\}_{k\geqslant 0}$ generated as $\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k})$ converges globally to $\mathrm{Fix}\,\mathrm{T}$ .

The iteration $\boldsymbol{x}_{k+1}=\mathrm{T}(\boldsymbol{x}_{k})$ is called the Mann iteration when $T$ is averaged, and Theorem 4 is the Krasnosel’skii-Mann theorem (bauschke2017convex, , Prop. 5.16).

Remark 4.

Comparing Theorems 1 and 4, we note that while $\mathrm{Fix}\,\mathrm{T}$ is guaranteed to be nonempty when $\mathrm{T}$ is contractive, we need to assume this if $\mathrm{T}$ is averaged. Second, when $\mathrm{T}$ is a contraction, the sequence converges to the same fixed point regardless of the initialization $\boldsymbol{x}_{0}$ , whereas the limit point will generally depend on $\boldsymbol{x}_{0}$ if $\mathrm{T}$ is averaged. Finally, the convergence in Theorem 4 cannot be guaranteed to be linear.

If the denoiser $\mathrm{D}$ is nonexpansive or averaged, then we can make the following conclusion regarding the PnP and RED operators.

Proposition 5.

Let $f$ be convex and let $\mathrm{D}$ be nonexpansive. Then the operator in (9) is an averaged operator for all $\gamma>0$ and $L>1$ . Moreover, if $f$ is $\beta$ -smooth and $\mathrm{D}$ is averaged, then operator (8) is an averaged operator for all $0<\gamma<1/\beta$ .

Proof: The reasoning is similar to that of Proposition 2. The second operator in (9) is $1/L$ -averaged if $L>1$ . Also, it is well-known that the proximal operator of a convex function is firmly nonexpansive bauschke2017convex . Thus, being the composition of two averaged operators, we get that (9) is averaged bauschke2017convex .

If $f$ is $\beta$ -smooth and $0<\gamma<1/\beta$ , we can show that (7) is firmly nonexpansive bauschke2017convex . Again, being the composition of two averaged operators, (8) must be averaged.

Combining Theorem 4 and Proposition 5, we can certify convergence of Algorithms 1 and 2.

Corollary 6.

Suppose $f$ is convex and $\mathrm{D}$ is averaged.

1.

Assuming that $\mathrm{Fix}\,\mathrm{T}_{\mathrm{RED}}\neq\emptyset$ , Algorithm 2 converges globally and for any $\gamma>0,\,L>1$ .
2.

Assuming that $\mathrm{Fix}\,\mathrm{T}_{\mathrm{PnP}}\neq\emptyset$ and $f$ is $\beta$ -smooth, Algorithm 1 converges globally for any $0<\gamma<1/\beta$ .

Proof: If $f$ is convex, $\gamma>0$ , $L>1$ , and $\mathrm{D}$ is averaged (and hence nonexpansive), it follows from Proposition 5 that (9) is an averaged operator. Thus, if $\mathrm{Fix}\,\mathrm{T}_{\mathrm{RED}}\neq\emptyset$ , then we have from Theorem 4 that Algorithm 2 converges globally.

On the other hand, if $f$ is convex and $\beta$ -smooth, and $0<\gamma<1/\beta$ , then it follows from Proposition 5 that (8) is an averaged operator. Hence, if $\mathrm{Fix}\,\mathrm{T}_{\mathrm{PnP}}\neq\emptyset$ , then we have from Theorem 4 that Algorithm 1 converges globally.

In summary, we were able to establish fixed-point convergence for Algorithms 1 and 2 for contractive (Corollary 3) and averaged denoisers (Corollary 6). We next show how such denoisers can be realized using deep unfolding.

Remark 5.

We can work out convergence results similar to Corollaries 3 and 6 for the PnP and RED algorithms in romano2017little ; reehorst2018regularization ; sun2019block ; cohen2021regularization .

4 Unfolded Denoisers

Motivated by the previous results, we wish to develop trained denoisers that are either contractive or averaged by construction and whose regularization capacity is comparable to state-of-the-art CNN denoisers. As explained in Section 1.3, learning a nonexpansive CNN denoiser presents several challenges. This prompted us to consider a different solution to the problem, namely, using unfolded networks. More specifically, we construct the denoiser in three steps.

1.

We consider the classical variational problem of wavelet denoising and unfold it using FBS and DRS. This is done at a patch level. The patch denoiser is contractive (for FBS) or averaged (for DRS) by construction.
2.

We train the patch denoiser using supervised learning, where the regularization parameters (thresholds) of wavelet denoising and the internal (step size and penalty) parameters of FBS and DRS are optimized.
3.

We aggregate the patch denoisers to derive an image denoiser, which is proven to inherit the contractive or averaged property of the original patch denoiser.

4.1 Wavelet denoising

Let $\boldsymbol{b}\in\mathbb{R}^{p}$ represent a noisy image patch, where we have stacked the pixels of a $(k\times k)$ image patch into a vector of length $p=k^{2}$ . Let $\mathbf{W}$ be an orthogonal wavelet transform. Consider the wavelet denoising problem

\underset{\boldsymbol{x}\in\mathbb{R}^{p}}{\min}\ \ \frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2}+\lambda\sum_{i=1}^{p}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert\qquad(\lambda>0),

where $\lambda$ is the regularization parameter. The above problem has the closed-form solution $\mathbf{W}^{\top}\phi_{\lambda}(\mathbf{W}\boldsymbol{b})$ , where the soft-threshold function

\phi_{\lambda}(\theta)=\mathrm{sign}(\theta)\,\max(0,\lvert\theta\rvert-\lambda)

is applied componentwise on $\mathbf{W}\boldsymbol{b}$ and $\lambda$ acts as a threshold chambolle1998nonlinear .

We consider a slightly different model with a different threshold for each wavelet coefficient. Namely, we consider the optimization problem

\underset{\boldsymbol{x}\in\mathbb{R}^{p}}{\min}\ \ \frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2}+\sum_{i=1}^{p}\lambda_{i}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert\quad(\lambda_{i}>0).

(10)

This also has a closed-form solution given by

\phi_{\Lambda}(\boldsymbol{b})=\mathbf{W}^{\top}\mathbf{S}_{\Lambda}(\mathbf{W}\boldsymbol{b}),

(11)

where $\Lambda=(\lambda_{1},\ldots,\lambda_{p})$ are the thresholds and the thresholding operator $\mathbf{S}_{\Lambda}:\mathbb{R}^{p}\to\mathbb{R}^{p}$ is defined as

\forall\,\boldsymbol{\theta}\in\mathbb{R}^{p},\,i\in[1,p]:\quad\big{(}\mathbf{S}_{\Lambda}(\boldsymbol{\theta})\big{)}_{i}=\phi_{\lambda_{i}}(\theta_{i}).

4.2 FBS denoiser

The denoising operator $\phi_{\Lambda}$ is nonexpansive, being the composition of nonexpansive and unitary operators. However, it is easy to see that $\phi_{\Lambda}$ is not contractive. To identify a contractive (resp. averaged) operator, we propose to solve the denoising problem (10) in an iterative fashion using FBS (resp. DRS). The proposed denoising network is obtained by chaining the update operator in FBS and DRS.

As a first step, we apply FBS to the optimization problem (10). Note that, a stepsize $\rho$ , we can write the FBS update as

\boldsymbol{x}_{k+1}=\mathrm{U}(\boldsymbol{x}_{k}\mathchar 24635\relax\;\rho,\Lambda),

where we define the FBS block $\mathrm{U}$ to be

\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)=\mathbf{W}^{\top}\mathbf{S}_{\Lambda}\left(\mathbf{W}\big{(}(1-\rho)\boldsymbol{x}+\rho\boldsymbol{b}\big{)}\right).

(12)

The FBS block is depicted in Fig. 1. Functionally, $\mathrm{U}$ is an inversion step (gradient descent on the loss) followed by wavelet regularization.

Although (10) has a closed-form solution, we are interested in the iterative solution of (10) because of the following property.

Proposition 7.

For all $\rho\in(0,1]$ and $\Lambda>0$ , the map $\boldsymbol{x}\mapsto\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)$ is contractive.

Indeed, since $\mathbf{W}$ is an orthogonal transform and the operator $\mathbf{S}_{\Lambda}$ in (12) is nonexpansive (the condition $\Lambda>0$ is required to ensure that (11) is well-defined), we have for any $\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathbb{R}^{n}$ ,

\displaystyle\lVert\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)-\mathrm{U}(\boldsymbol{x}^{\prime}\mathchar 24635\relax\;\rho,\Lambda)\rVert\leqslant(1-\rho)\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert.

Since $1-\rho\in[0,1)$ , we can conclude that the map $\boldsymbol{x}\mapsto\mathrm{U}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)$ is contractive.

Following Proposition 7, we chain $\mathrm{U}$ into a network. The precise definition is as follows.

Definition 1 (FBS denoiser).

For a fixed number of layers $L\geqslant 1$ , we define $\mathrm{D}_{\mathrm{FBS}}:\mathbb{R}^{p}\to\mathbb{R}^{p}$ to be

\mathrm{D}_{\mathrm{FBS}}=\mathrm{U}^{(L)}\circ\mathrm{U}^{(L-1)}\circ\cdots\circ\mathrm{U}^{(1)},

(13)

where each layer has the form $\mathrm{U}^{(\ell)}=\mathrm{U}(\cdot\ \mathchar 24635\relax\;\rho^{(\ell)},\Lambda^{(\ell)})$ .

Since the composition of contractive operators is again contractive, the next result follows immediately from Proposition 7.

Proposition 8.

For all $\rho^{(\ell)}\in(0,1]$ and $\Lambda^{(\ell)}>0$ , $\mathrm{D}_{\mathrm{FBS}}$ is contractive.

The point is that (13) is contractive as long as $\Lambda^{(\ell)}$ and $\rho^{(\ell)}$ are in the interval stipulated in Proposition 8. We can thus tune the parameters over these intervals to optimize the denoising capacity while preserving the contractive nature.

Remark 6.

Unlike LISTA gregor2010learning or DualFB repetti2022dual , we use a fixed transform in each layer. This is because we wish to retain the proximal structure in (12), and orthogonality plays a crucial role in this regard.

4.3 DRS denoiser

Note that we can apply Peaceman-Rachford bauschke2017convex or DRS to problem (10). In fact, we will show that the corresponding update operator is firmly nonexpansive for DRS. Thus, the resulting denoiser obtained by chaining this operator is averaged, and we can guarantee convergence of Algorithms 1 and (2) (Corollary 6).

Remark 7.

It is well-known that DRS converges in fewer iterations than FBS parikh2014proximal . Since each layer of the unfolded denoiser corresponds to an iteration, we expect to perform better denoising with DRS for the same number of layers. Indeed, we will show in Section 5 that the DRS denoiser gives better results than the FBS denoiser in some experiments.

To arrive at the DRS update for (10), we identify (10) with (1) by setting

f(\boldsymbol{x})=\frac{1}{2}\lVert\boldsymbol{x}-\boldsymbol{b}\rVert^{2},\quad g(\boldsymbol{x})=\sum_{i=1}^{p}\lambda_{i}\,\lvert(\mathbf{W}\boldsymbol{x})_{i}\rvert.

It follows from (1.1) that we can write the DRS update as written as

\boldsymbol{x}_{k+1}=\mathrm{V}(\boldsymbol{x}_{k}\mathchar 24635\relax\;\rho,\Lambda),

where the DRS block $\mathrm{V}$ is defined as

		$\displaystyle\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)=\frac{1}{2}\boldsymbol{x}$
		$\displaystyle+\frac{1}{2}\big{(}2\mathrm{Prox}_{\rho g}-\mathrm{I}\big{)}\big{(}2\mathrm{Prox}_{\rho f}-\mathrm{I}\big{)}(\boldsymbol{x}),$		(14)

and $\rho>0$ is the penalty parameter of DRS.

Remark 8.

We can simplify the expression for the DRS block (4.3). Indeed, a direct calculation gives

\mathrm{Prox}_{\rho f}(\boldsymbol{x})=\frac{1}{1+\rho}(\boldsymbol{x}+\rho\boldsymbol{b}).

On the other hand, if we absorb $\rho$ in $\Lambda$ , we get $\mathrm{Prox}_{\rho g}=\phi_{\Lambda}$ . Thus, we can express (4.3) as

	$\displaystyle\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)=$	$\displaystyle\frac{\rho}{1+\rho}(\boldsymbol{x}-\boldsymbol{b})$
		$\displaystyle+\phi_{\Lambda}\left(\frac{1-\rho}{1+\rho}\boldsymbol{x}+\frac{2\rho}{1+\rho}\boldsymbol{b}\right).$		(15)

Having defined the update operator $\mathrm{V}$ , we can now establish its averaged property.

Proposition 9.

For all $\rho>0$ and $\Lambda>0$ , the map $\boldsymbol{x}\mapsto\mathrm{V}(\boldsymbol{x}\mathchar 24635\relax\;\rho,\Lambda)$ is firmly nonexpansive.

Indeed, we know that the proximal operator of a convex function is firmly nonexpansive bauschke2017convex . Thus, the operators $2\mathrm{Prox}_{\rho g}-\mathrm{I}$ and $2\mathrm{Prox}_{\rho f}-\mathrm{I}$ are nonexpansive. Thus, if we denote their composition by $\mathrm{N}$ , then $\mathrm{N}$ must be nonexpansive. Since $\mathrm{V}=(1/2)(\mathrm{I}+\mathrm{N})$ , $\mathrm{V}$ must be firmly nonexpansive.

We will now use Proposition 9 to construct an averaged network. As with the FBS block, the trainable parameters in (8) are $\rho$ and $\Lambda$ , while $\boldsymbol{b}$ and $\mathbf{W}$ are fixed. The DRS block is depicted in Fig. 1. The DRS denoiser is obtained by chaining $\mathrm{V}$ into a network.

Definition 2 (DRS denoiser).

For a fixed number of layers $L\geqslant 1$ , we define $\mathrm{D}_{\mathrm{DRS}}:\mathbb{R}^{p}\to\mathbb{R}^{p}$ to be

\mathrm{D}_{\mathrm{DRS}}=\mathrm{V}^{(L)}\circ\mathrm{V}^{(L-1)}\circ\cdots\circ\mathrm{V}^{(1)},

(16)

where each layer is of the form $\mathrm{V}^{(\ell)}=\mathrm{V}(\cdot\ \mathchar 24635\relax\;\rho^{(\ell)},\Lambda^{(\ell)})$ .

As a consequence of Proposition 9, we have the following structure on the denoiser.

Proposition 10.

For all $\rho^{(\ell)}>0$ and $\Lambda^{(\ell)}>0$ , $\mathrm{D}_{\mathrm{DRS}}$ is $L/(L+1)$ -averaged.

Indeed, notice from Proposition 9 that $\mathrm{D}_{\mathrm{DRS}}$ is the composition of $L$ firmly nonexpansive operators. Now, if we compose $L$ number of $\theta$ -averaged operators, then it can be shown that the resulting operator is $L\theta/(1+(L-1)\theta)$ -averaged (e.g., see bauschke2017convex ). Setting $\theta=1/2$ , we arrive at Proposition 10.

(a) barbara

(b) ship

(d) couple

(e) fingerprint

(f) hill

(g) house

(h) lena

(i) man

(j) montage

(k) peppers

(l) starfish

Figure 2: Grayscale images from the Set12 database. We will identify these images by their corresponding names in Section 5.

4.4 Image denoiser

Recall that (13) and (16) are patch denoisers. We apply them independently on patches extracted from a noisy image and then average the denoised patches to obtain the final image zhang2017image ; dabov2007image . The overall operation defines an image-to-image denoising operator $\mathrm{D}$ , which is used in Algorithms 1 and 2. In view of Corollary 3 and 6, we wish to ensure that $\mathrm{D}$ is averaged or contractive. First, we give a precise definition of the denoiser and then prove that it has the desired property.

Let $\Omega=[0,q-1]^{2}$ denote the image domain and $\mathbf{X}:\Omega\to\mathbb{R}$ be the input (noisy) image; we also view $\mathbf{X}$ as a vector in $\mathbb{R}^{n}$ where $n=q^{2}$ . We assume periodic boundary condition, i.e., for $\boldsymbol{i}=(i_{1},i_{2})\in\Omega$ and $\boldsymbol{\tau}=(\tau_{1},\tau_{2})\in\mathbb{Z}^{2}$ , we define

\boldsymbol{i}-\boldsymbol{\tau}=\big{(}(i_{1}-\tau_{1})\ \mathrm{mod}\ q,\,(i_{2}-\tau_{2})\ \mathrm{mod}\ q\big{)}.

Without loss of generality, we assume that $q$ is a multiple of the patch size $k$ ; we can simply pad the image to satisfy this condition.

Let the patch size be $k\times k$ . For $\boldsymbol{i}\in\Omega$ , we define the patch operator

		$\displaystyle\mathcal{P}_{\boldsymbol{i}}:\mathbb{R}^{n}\to\mathbb{R}^{k^{2}},$
	$\displaystyle\mathcal{P}_{\boldsymbol{i}}(\mathbf{X})=\mathrm{vec}$	$\displaystyle\Big{(}\big{\{}\mathbf{X}(\boldsymbol{i}+\boldsymbol{\tau}):\boldsymbol{\tau}\in[0,k-1]^{2}\big{\}}\Big{)},$		(17)

where $\mathrm{vec}(\cdot)$ denotes the vectorized representation of the pixels in a $(k\times k)$ window around $\boldsymbol{i}$ (in some fixed order).

We extract patches from $\mathbf{X}$ with stride $s$ , where we assume that $s$ divides $k$ . Thus, the starting coordinates of the patches are $\mathcal{J}=[0,(q_{s}-1)]_{s}^{2}$ , where $q_{s}=q/s$ . In other words, the patches are

\left\{\mathcal{P}_{\boldsymbol{j}}(\mathbf{X}):\,\boldsymbol{j}\in\mathcal{J}\right\}.

Definition 3.

Given a patch denoiser $\mathrm{D}_{\mathrm{patch}}$ , we define the image denoiser $\mathrm{D}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ to be

{\mathrm{D}}(\mathbf{X})=\frac{1}{(k/s)^{2}}\sum_{\boldsymbol{j}\in\mathcal{J}}\mathcal{P}_{\boldsymbol{j}}^{*}\big{(}\mathrm{D}_{\mathrm{patch}}(\mathcal{P}_{\boldsymbol{j}}(\mathbf{X}))\big{)},

(18)

where $\mathcal{P}_{\boldsymbol{j}}^{*}$ is the adjoint of $\mathcal{P}_{\boldsymbol{j}}$ .

The factor $(k/s)^{2}$ is the number of patches containing a given pixel; the assumption that $s$ divides $k$ is essential to ensure this is the same for every pixel.

Remark 9.

The adjoint in (18) is is w.r.t. the standard Euclidean inner-product (see Appendix 8 for details). In particular, it is a linear map from $\mathbb{R}^{k^{2}}$ to $\mathbb{R}^{n}$ such that for any patch $\boldsymbol{x}\in\mathbb{R}^{k^{2}}$ , $\mathbf{Y}=\mathcal{P}^{*}_{\boldsymbol{j}}(\boldsymbol{x})$ is an image, where

\mathcal{P}_{j}(\mathbf{Y})=\boldsymbol{x},

and the rest of the pixels in $\mathbf{Y}$ are zero.

Henceforth, we will assume that the patch denoiser is either $\mathrm{D}_{{\mathrm{FBS}}}$ in (13) or $\mathrm{D}_{{\mathrm{DRS}}}$ in (16). We now state the main result of this section.

Proposition 11.

The image denoiser (18) derived from (13) is contractive, while the denoiser derived from (16) is averaged.

The proof is given in Appendix 8. The core idea is to express the original system of overlapping patches in terms of non-overlapping patches. This allows us to use some form of orthogonality in the calculations.

4.5 Training

We trained (13) and (16) on patches extracted from images of the BSD500 dataset. Let $\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{N}\in\mathbb{R}^{p}$ be the clean patches (normalized to $[0,1]$ intensity), where $p=k^{2}$ and $N$ is the size of traning data. The noisy patches are generated as follows:

{\boldsymbol{z}}_{i}=\boldsymbol{x}_{i}+\sigma\boldsymbol{\eta},

where $\boldsymbol{\eta}\sim\mathcal{N}(\boldsymbol{0},\mathrm{I})$ and $\sigma$ is the noise level. Thus, the training examples are $({\boldsymbol{z}}_{1},\boldsymbol{x}_{1}),\ldots,({\boldsymbol{z}}_{N},\boldsymbol{x}_{N})$ .

Let us use $\Theta$ to denote the parameters in (13) and (16). We will write the patch denoiser as $\mathrm{D}_{\text{patch}}(\cdot\,\mathchar 24635\relax\;\Theta)$ to account for the dependence on $\Theta$ . The training problem is to optimize $\Theta$ such that $\mathrm{D}_{\text{patch}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)\approx\boldsymbol{x}_{i}$ for all $i=1,\ldots,N$ . Moreover, in view of Proposition 8, we consider the following optimization problem for training (13):

$\displaystyle\underset{\Theta}{\min}$	$\displaystyle\sum_{i=1}^{N}\ {\lVert\mathrm{D}_{\text{FBS}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)-\boldsymbol{x}_{i}\rVert}^{2}$	(19)
s.t.	$\displaystyle\rho^{(1)},\ldots,\rho^{(L)}\in[\rho_{0},1],$
	$\displaystyle{\lambda^{(1)}_{j}},\ldots,{\lambda^{(L)}_{j}}\in[\lambda_{0},\infty)\quad\big{(}j\in[1,p]\big{)}.$

The lower bounds $\rho_{0}\in(0,1)$ and $\lambda_{0}>0$ are used to ensure that the constraint set in (19) is closed; this will be useful when we project $\Theta$ onto the intervals.

On the other hand, in view of Proposition 10, we consider the following optimization for training (16):

$\displaystyle\underset{\Theta}{\min}$	$\displaystyle\sum_{i=1}^{N}\ {\lVert\mathrm{D}_{\text{DRS}}({\boldsymbol{z}}_{i}\mathchar 24635\relax\;\Theta)-\boldsymbol{x}_{i}\rVert}^{2}$	(20)
s.t.	$\displaystyle\rho^{(1)},\ldots,\rho^{(L)}\in[\rho_{0},\infty),$
	$\displaystyle{\lambda^{(1)}_{j}},\ldots,{\lambda^{(L)}_{j}}\in[\lambda_{0},\infty)\quad\big{(}j\in[1,p]\big{)}.$

Table 1: Denoising performance (PSNR/SSIM averaged) on the BSD68 dataset at different noise levels

\sigma

	BM3D	DnCNN	N-CNN	$\mathrm{D}_{\mathrm{FBS}}$	$\mathrm{D}_{\mathrm{DRS}}$
$\sigma=5$	$37.52$ / $0.9668$	$38.01$ / $0.9696$	$33.35$ / $0.9212$	$36.54$ / $0.9480$	$36.62$ / $0.9506$
$\sigma=20$	$29.65$ / $0.8799$	$30.26$ / $0.8948$	$27.99$ / $0.8017$	$30.01$ / $0.8344$	$29.99$ / $0.8346$
$\sigma=35$	$27.10$ / $0.7531$	$27.69$ / $0.7809$	$25.17$ / $0.6252$	$26.18$ / $0.6988$	$26.17$ / $0.6952$

5 Experiments

For the unfolded network, we used a $10$ -fold cascade of Daubechies, Haar, and Symlet orthogonal wavelets in succession (the number of layers is $L=30$ ). We used a fixed patch size $k=64$ . We solved (19) and (20) using Adam optimizer kingma2014adam with mini-batches of size $128$ and $5$ training epochs. After the gradient step, performed using automatic differentiation, we enforced the constraints in (19) and (20) by projecting $\Theta$ onto the respective intervals in (19) and (20), essentially implementing a form of projected gradient descent. The lower bounds for trainable parameters in the optimization problems (19) and (20) were fixed to $\rho_{0}=\lambda_{0}=0.0001$ . We trained separate denoisers over the noise range $5\mbox{-}50$ in steps of $5$ . The trained models ( $500$ KB for each noise level) are lightweight compared to DnCNN ( $2.3$ MB).

The objective of the experiments in Sections 5.1, 5.2, and 5.3 is to analyze the denoising performance and the reconstruction accuracy of our denoisers (for deblurring and superresolution) and compare them with existing denoisers. Moreover, we present numerical evidence supporting the claimed contractive (averaged) property of the denoisers and the associated convergence guarantees for PnP and RED (Corollaries 3 and 6). We also present numerical evidence showing that state-of-the-art denoisers such as BM3D and DnCNN do not meet these guarantees.

5.1 Denoising

In Table 1, we compare the denoising performance of our denoisers with BM3D, DnCNN, and N-CNN on the Set12 dataset (see Fig. 2). Our denoisers are off by PSNR of $0.25\mbox{-}1.5$ dB and SSIM of $0.02\mbox{-}0.08$ from the best denoiser DnCNN. However, they consistently outperform N-CNN across all noise levels. Recall that the purpose of our construction was to guarantee convergence of PnP and RED. In pursuit of this, we had to make design choices that resulted in a tradeoff in denoising performance. Importantly, we will see that the reconstruction capability of our denoiser is competitive with the above denoisers. This is not surprising since there is no concrete relation between the denoising performance and the restoration capacity of a denoiser. For instance, it has been previously shown that BM3D can yield superior restoration results compared to DnCNN in specific applications Ryu2019_PnP_trained_conv , even though DnCNN generally exhibits better denoising capabilities than BM3D. This suggests that the denoising gap is compensated in PnP and RED because the denoiser is applied in every iteration. However, this is just an intuitive guess, and we do not have a precise mathematical explanation.

Table 2: Comparison of the residual

\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\|

and PSNR with DnCNN for a deblurring experiment. (see text for more details)

denoiser		number of iterations
denoiser		$10$	$25$	$50$	$75$	$100$
DnCNN	$\\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\\|$	$146.00$	$146.58$	$234.55$	$2300$	$2207248$
DnCNN	PSNR	$16.54$	$14.79$	$-4.09$	$-45.14$	$-84.68$
Proposed	$\\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\\|$	$74.06$	$30.65$	$7.67$	$1.99$	$0.52$
( $\mathrm{D}_{\mathrm{FBS}}$ )	PSNR	$20.42$	$22.96$	$23.69$	$23.74$	$23.74$
Proposed	$\\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\\|$	$74.09$	$30.67$	$7.68$	$1.99$	$0.55$
( $\mathrm{D}_{\mathrm{DRS}}$ )	PSNR	$20.43$	$23.00$	$23.74$	$23.79$	$23.79$

We conducted a comprehensive assessment of various parameter configurations for the denoiser and selected the most suitable settings. Regarding the impact of $\mathbf{W}$ , our results indicate that the choice of wavelet transform does not significantly affect the denoising performance. In contrast, we observed a significant enhancement in denoising performance by increasing the number of layers $L$ from $1$ to $10$ . However, increasing $L$ beyond $10$ yields marginal improvements in the generalization capability. Thus, we capped $L$ at $10$ to keep the model compact without compromising the denoising performance. Regarding stride $s$ , our findings align with common trends in patch aggregation algorithms, namely, a higher stride is associated with decreased accuracy but reduced computational demands. We found that $s=8$ provides an optimal balance between accuracy and efficiency. Finally, the patch size plays a crucial role in the denoising performance. Keeping the stride constant, enlarging the patch size from $64$ to $128$ results in an accuracy improvement ranging from $0.10$ to $0.25$ dB and an enhancement in the SSIM score by $0.01$ to $0.02$ . However, this adjustment also leads to an increase in the number of parameters of the denoiser.

5.2 Deblurring

The forward model for deblurring is $\boldsymbol{y}=\mathbf{F}\boldsymbol{\xi}+\boldsymbol{\eta}$ , where $\boldsymbol{y}$ is the blurred image, $\mathbf{F}$ is a blur operator, and $\boldsymbol{\eta}$ is white Gaussian noise dong2011image . For deblurring, the loss function is $f(\boldsymbol{x})=\|\boldsymbol{y}-\mathbf{F}\boldsymbol{x}\|^{2}$ , where $\mathbf{F}$ is the blur matrix. The gradient $\nabla f$ is $\beta$ -Lipschitz. Specifically, we use a normalized blur for all experiments; as a result, we have $\beta=1$ ( $\beta$ is the spectral norm of $\mathbf{F}$ ). For the deblurring experiments in Table 2 and Fig. 3, we took $\mathbf{F}$ to be a Gaussian blur of width $25\times 25$ and standard deviation $1.6$ and the noise level is $\sigma=0.04$ . For deblurring using PnP-FBS, the step size is set to $\gamma=0.001$ for RED-PG; for PnP-FBS, we used $\gamma=0.5$ which satisfies the condition $\gamma\leqslant 2/\beta$ in Corollaries 3 and 6. For PnP-FBS, we used a noise level $\sigma=5$ for both our denoisers, and $\sigma=10$ for RED-PG.

In Table 2, we present numerical evidence to show that, in the absence of a theoretical guarantee, simply plugging a denoiser (in this case DnCNN) into PnP-FBS can cause the iterations to diverge. Indeed, the distance between successive iterates $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert$ is shown to diverge for DnCNN, which means that the sequence $(\boldsymbol{x}_{k})$ does not converge. This results in spurious reconstructions, as evident from the PSNR values in Table 2. On the other hand, we notice that $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert$ is decreasing for the proposed denoisers, thanks to their nonexpansive property.

Table 3: Deblurring using RED-PG (PSRN/SSIM) for the blur kernel 1 in levin2009understanding (see text for more details).

	DnCNN	N-CNN	$\mathrm{D}_{\mathrm{FBS}}$	$\mathrm{D}_{\mathrm{DRS}}$
barbara	$\mathbf{25.08}$ / $0.7307$	$23.98$ / $0.6892$	$\mathbf{25.08}$ / $\mathbf{0.7410}$	$24.98$ / $0.7371$
ship	$27.39$ / $0.7502$	$26.82$ / $0.7145$	$\mathbf{27.52}$ / $\mathbf{0.7627}$	$27.47$ / $0.7600$
cameraman	$24.92$ / $0.7867$	$\mathbf{25.01}$ / $0.7700$	$24.96$ / $\mathbf{0.7808}$	$24.91$ / $\mathbf{0.7808}$
couple	$26.95$ / $0.7346$	$26.50$ / $0.7046$	$\mathbf{27.15}$ / $\mathbf{0.7541}$	$27.09$ / $0.7508$
fingerprint	$25.86$ / $0.8773$	$23.55$ / $0.7817$	$\mathbf{25.99}$ / $\mathbf{0.8843}$	$25.90$ / $0.8809$
hill	$28.41$ / $0.7301$	$28.19$ / $0.7125$	$\mathbf{28.71}$ / $\mathbf{0.7573}$	$28.64$ / $0.7532$
house	$\mathbf{30.29}$ / $\mathbf{0.8310}$	$29.85$ / $0.8252$	$29.92$ / $0.8269$	$29.90$ / $0.8278$
lena	$\mathbf{30.78}$ / $\mathbf{0.8547}$	$30.36$ / $0.8421$	$30.73$ / $0.8540$	$30.68$ / $0.8545$
man	$28.20$ / $0.7749$	$27.94$ / $0.7550$	$\mathbf{28.45}$ / $\mathbf{0.7928}$	$28.38$ / $0.7902$
montage	$24.56$ / $\mathbf{0.8881}$	$25.10$ / $0.8786$	$\mathbf{24.78}$ / $0.8653$	$24.70$ / $0.8683$
peppers	$\mathbf{24.96}$ / $0.8227$	$24.95$ / $0.8174$	$24.95$ / $\mathbf{0.8255}$	$24.90$ / $0.8252$

Next, we present numerical evidence on PnP-FBS convergence using the proposed denoisers. This is done using a deblurring experiment on the Set 12 dataset (the settings are the same as in Table 2). The results are shown in Fig. 3 (a-b), where we plot $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert$ and PSNR for different $k$ . As expected, $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert$ is decreasing and the PSNR stabilizes in less than $50$ iterations. We also demonstrate that due to the contractivity of the FBS operator $\mathrm{T}_{\mathrm{PnP}}$ using our denoiser $\mathrm{D}_{{\mathrm{FBS}}}$ , the iterates converge to a unique fixed point even if we start with diverse initializations; the final reconstructions are identical, as shown in Fig 3(c), where the MSE between the different reconstructions is of the order $1\mathrm{e}\mbox{-}8$ .

We next compare the regularization capacity of our denoisers with BM3D, DnCNN, and N-CNN for deblurring, where $\mathbf{F}$ is a horizontal motion blur. This is done for the RED-PG algorithm. The results are shown for a couple of images in Fig. 4. For the two images from Fig. 2, which have a mix of delicate textures, smooth regions, and broader features (top row), the reconstruction from our denoisers exhibits finer details than other denoisers.

A detailed comparison for RED-PG is provided in Table 3, where $\mathbf{F}$ is one of the blur kernels from levin2009understanding . We notice that our performance is comparable with DnCNN.

Table 4: Deblurring using PnP-FBS: Comparison with the denoiser in repetti2022dual . See Table 1 in repetti2022dual for the notations A-E.

$\sigma$		A	B	C	D	E
0.03	Noisy	$12.78$	$19.29$	$16.06$	$16.24$	$22.54$
	DualFB	$16.33$	$24.81$	$22.20$	$20.41$	$28.34$
	$\mathrm{D}_{{\mathrm{DRS}}}$	$15.83$	$23.97$	$21.35$	$20.28$	$28.28$
0.05	Noisy	$12.12$	$18.11$	$15.12$	$14.58$	$20.60$
	DualFB	$14.97$	$23.54$	$20.65$	$19.31$	$27.47$
	$\mathrm{D}_{{\mathrm{DRS}}}$	$14.42$	$21.85$	$18.42$	$18.55$	$25.60$
0.08	Noisy	$10.82$	$16.11$	$13.44$	$12.10$	$17.85$
	DualFB	$14.49$	$22.75$	$19.68$	$18.71$	$26.93$
	$\mathrm{D}_{{\mathrm{DRS}}}$	$13.74$	$21.00$	$17.83$	$17.89$	$24.97$

Table 5: Deblurring using PnP-FBS: Comparison with the denoiser in hurault2022proximal , averaged over CBSD68 dataset for

10

different kernels

Method	$\sigma$
Method	$0.01$	$0.03$	$0.05$
Prox-PnP	$30.57$	$27.80$	$26.61$
Proposed	$28.50$	$27.11$	$26.43$

In Table 4, we compare $\mathrm{D}_{{\mathrm{DRS}}}$ with the denoiser in repetti2022dual , which we term as “DualFB”. We replicated the experimental setup used for the deblurring experiment in (repetti2022dual, , Table 1) — an asymmetric blur operation followed by the addition of noise at various levels. For a fair comparison, similar to the experiment in repetti2022dual , we use PnP-FBS as the reconstruction algorithm for our denoiser and signal-to-noise ratio (SNR) as the performance metric. We see from Table 4 that our denoiser can closely parallel the performance of DualFB. In fact, it is not surprising that DualFB yields superior results since the linear layers are optimized in repetti2022dual , whereas we use fixed handcrafted transforms.

We next compare $\mathrm{D}_{{\mathrm{DRS}}}$ with the “Prox-PnP” denoiser from hurault2022proximal ; we perform the comparison by using them as regularizers within the PnP-FBS framework. We replicated the experimental setup used for the deblurring experiment in (hurault2022proximal, , Table 3) in which the PSNR is averaged over the CBSD68 dataset for $10$ different kernels and different noise levels. As shown in Table 5, our denoiser exhibits comparable reconstruction capacity to Prox-PnP, though our gray-scale denoiser is applied channel-by-channel. It is worth noting that our denoiser is contractive by construction, whereas the denoiser in hurault2022proximal is “softly” enforced to be $1$ -Lipschitz using a loss-based penalization. We assume that training our denoiser specifically for color images can boost the color deblurring performance.

In light of the above comparisons, we wish to emphasize that the key advantage of our denoiser is that it comes with convergence guarantees. On the other hand, the convergence result in (repetti2022dual, , Proposition 2) assumes that the same linear transform is used across all layers, and this is generally not true if we optimize the transforms. It remains to be seen if it is possible to optimize our construction to match the denoiser in repetti2022dual .

5.3 Superresolution

A widely used model for superresolution is $\boldsymbol{y}=\mathbf{S}\mathbf{B}\boldsymbol{\xi}+\boldsymbol{\eta},$ where $\mathbf{B}\in\mathbb{R}^{n\times n}$ and $\mathbf{S}\in\mathbb{R}^{m\times n}$ are the blur and subsampling operators ( $m=n/k$ , $k>1$ ), and $\boldsymbol{\eta}$ is Gaussian noise chan2017plug ; nair2021fixed . For superresolution, the loss function is $f(\boldsymbol{x})=\|\boldsymbol{y}-\mathbf{S}\mathbf{B}\boldsymbol{x}\|^{2}$ , where $\mathbf{S}$ is a decimation matrix and $\mathbf{B}$ is a blur matrix. As in deblurring, we used a normalized blur. In this case, $\beta$ is bounded by the product of the spectral norms of $\mathbf{S}$ and $\mathbf{B}$ , so we have $\beta\leqslant 1$ . In PnP-FBS, we used the step size $\gamma=1.0$ that satisfies the condition $\gamma\leqslant 2/\beta$ in Corollaries 3 and 6. We set the noise level to $\sigma=5$ for both our denoisers for all experiments.

Table 6: Residual

\|\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\|

for PnP-FBS with different denoisers.

denoiser	iterations ( $k$ )
denoiser	$10$	$50$	$100$	$500$	$1000$	$3000$
N-CNN Ryu2019_PnP_trained_conv	$0.54$	$0.20$	$0.19$	$0.36$	$23.45$	$3\mathrm{e}$ + $12$
$\mathrm{D}_{\mathrm{FBS}}$	$0.31$	$0.15$	$0.11$	$0.0602$	$0.0513$	$2\mathrm{e}$ - $10$
$\mathrm{D}_{\mathrm{DRS}}$	$0.32$	$0.16$	$0.11$	$0.0658$	$0.0562$	$2\mathrm{e}$ - $10$

Table 7: Superresolution using PnP-FBS: Comparison of the reconstruction capacity of different denoisers. (see text for details)

	DnCNN	N-CNN	$\mathrm{D}_{\mathrm{FBS}}$	$\mathrm{D}_{\mathrm{DRS}}$
cameraman	$\mathbf{27.12}$ / $0.8519$	$26.89$ / $0.8450$	$26.60$ / ${0.8616}$	$26.63$ / $\mathbf{0.8540}$
house	${33.29}$ / $0.8695$	$32.84$ / $0.8736$	$32.75$ / $0.8531$	$\mathbf{33.70}$ / $\mathbf{0.8780}$
peppers	$\mathbf{28.85}$ / $0.9071$	$28.43$ / $0.9027$	$27.74$ / $\mathbf{0.9079}$	$27.67$ / $0.9059$
starfish	$29.87$ / $0.9014$	$27.59$ / $0.8609$	$28.80$ / $0.9041$	$\mathbf{28.87}$ / $\mathbf{0.8979}$
butterfly	$\mathbf{30.28}$ / $0.9403$	$30.36$ / $0.9340$	$28.81$ / $0.9341$	$28.95$ / $\mathbf{0.9318}$
plane	$\mathbf{26.77}$ / $0.8734$	$25.97$ / $0.8520$	$26.07$ / $0.8770$	$26.23$ / $\mathbf{0.8735}$
parrot	$\mathbf{28.18}$ / $0.8866$	$27.60$ / $0.8865$	$27.62$ / $0.8998$	$27.59$ / $\mathbf{0.8934}$
lena	$34.00$ / $\mathbf{0.8980}$	$33.21$ / $0.8942$	$33.87$ / $0.9118$	$\mathbf{33.76}$ / $\mathbf{0.9063}$
barbara	$24.87$ / $0.7623$	$24.52$ / $0.7360$	${25.00}$ / ${0.7775}$	$\mathbf{24.90}$ / $\mathbf{0.7629}$
ship	$30.29$ / $0.8414$	$29.33$ / $0.8065$	${30.02}$ / ${0.8531}$	$\mathbf{29.93}$ / $\mathbf{0.8430}$
man	$30.79$ / $0.8595$	$30.25$ / $0.8409$	${30.83}$ / ${0.8800}$	$\mathbf{30.67}$ / $\mathbf{0.8678}$
couple	$29.96$ / $0.8436$	$28.98$ / $0.8105$	${29.57}$ / ${0.8522}$	$\mathbf{29.51}$ / $\mathbf{0.8440}$

For the results in Fig. 5 and Table 6, $\mathbf{B}$ is a Gaussian blur of size $25\times 25$ and standard deviation $1.6$ , and $\sigma=0.04$ . We start by presenting numerical evidence which shows that BM3D, DnCNN, and N-CNN are not nonexpansive, even if we restrict the input to images encountered in the PnP pipeline. More precisely, we apply PnP-FBS for image superresolution and compute the output-input ratio $\|\boldsymbol{x}_{k+1}-\boldsymbol{x}_{k}\|/\|\boldsymbol{z}_{k+1}-\boldsymbol{z}_{k}\|$ for the denoising operation $\boldsymbol{x}_{k+1}=\mathrm{D}(\boldsymbol{z}_{k+1})$ . This gives a lower bound for the Lipschitz constant of the plugged denoiser. In particular, if any of these ratios is $>1$ , the Lipschitz constant cannot be less than $1$ , and the denoiser cannot be nonexpansive. The ratio for different denoisers across iterations is shown in Fig. 5. We notice that the ratio is $>1$ at many iterations for BM3D, DnCNN, and N-CNN, proving that the denoisers are not nonexpansive. For N-CNN, the residual is constrained to be nonexpansive, but this constraint is also violated, as shown in Fig. 5(c). On the other hand, the ratio is always $<1$ for our denoiser, which is consistent with the theoretical results in Section 4. Similar to the counterexample in Table 2, we provide a counterexample in Table 6 that suggests divergence of the iterates on plugging N-CNN into PnP-FBS. This is consistent with the observation in Fig. 5 that N-CNN violates the nonexpansivity property.

We next compare our denoisers with BM3D, DnCNN, and N-CNN for the PnP-FBS algorithm using just downsampling ( $\mathbf{B}$ is the identity operator). The results are shown in Fig. 6 at $\sigma=5/255$ . It is clear from the visual results and the metrics that the restoration results using our denoisers are comparable with BM3D and DnCNN. We also report the residual $\lVert\boldsymbol{x}_{k}-\boldsymbol{x}_{k-1}\rVert$ for sufficiently large $k$ . This monotonically goes to zero for our denoisers but is seen to diverge for DnCNN and N-CNN. A detailed comparison for the PnP-FBS is shown in Table 7, where the blur and noise level are the same as in Fig. 5. Again, we notice that our performance is comparable with DnCNN.

In summary, the findings from our empirical study are as follows:

1.

The numerical results in Fig. 5, and Tables 2 and 6, show that denoisers such as BM3D, DnCNN, and N-CNN are not guaranteed to be nonexpansive and can cause PnP algorithms to breakdown.
2.

Fig 3 demonstrates that the proposed denoisers exhibit empirical convergence in agreement with the theoretical predictions in Corollaries 3 and 6.
3.

From the results in Figures 4 and 6, and the extensive comparisons in Tables 1, 3, and 7, we can conclude that the regularization capacity of the proposed denoisers is comparable with BM3D and DnCNN for deblurring and superresolution.

6 Conclusion

We trained averaged and contractive patch denoisers by unfolding FBS and DRS applied to wavelet denoising. We extended them for image denoising using patch aggregation while preserving their averaged (contractivity) properties. Moreover, by using sufficiently many layers, we showed that their regularization capacity for PnP and RED can be brought to par with BM3D and DnCNN. Unlike existing CNN denoisers, we could guarantee convergence of PnP and RED using the proposed denoisers. We also presented numerical evidence to back these results. To our knowledge, this is the first work to exploit the properties of proximal operators in unfolded algorithms to develop contractive denoisers. In future work, we wish to go beyond classical wavelet denoising and develop unfolded denoisers with superior denoising and regularization capacity.

7 Funding

K. N. Chaudhury was supported by research grants CRG/2020/000527 and STR/2021/000011 from the Science and Engineering Research Board, Government of India.

8 Appendix

We prove Proposition 11 in this section. First, we recall some notations from Section 4.4.

1.

For integers $p\leqslant q$ , we denote $[p,q]=\{p,p+1,\ldots,q\},\,[p,q]_{s}=\{ps,(p+1)s,\ldots,qs\},\,[p,q]^{2}=[p,q]\times[p,q]$ , and $[p,q]_{s}^{2}=[p,q]_{s}\times[p,q]_{s}$ .
2.

The input to the denoiser is an image $\mathbf{X}:\Omega\to\mathbb{R}$ , where $\Omega=[0,q-1]^{2}$ . We also consider the image in matrix form as $\mathbf{X}\in\mathbb{R}^{q\times q}$ .
3.

We use circular shifts: for $\boldsymbol{i}\in\Omega$ and $\boldsymbol{\tau}\in\mathbb{Z}^{2}$ , $\boldsymbol{i}-\boldsymbol{\tau}$ is defined as $\left((i_{1}-\tau_{1})\ \mathrm{mod}\ q,(i_{2}-\tau_{2})\ \mathrm{mod}\ q\right)$ .
4.

We extract patches with stride $s$ . More precisely, the starting coordinates of the patches are

$\mathcal{J}=[0,(q_{s}-1)]_{s}^{2},$

where $q_{s}=q/s$ . The cardinality of $\mathcal{J}$ is $q_{s}^{2}$ (total number of patches), and each pixel belongs to $k_{s}^{2}$ patches, where $k_{s}=k/s$ .
5.

For $\boldsymbol{i}\in\Omega$ , the patch operator $\mathcal{P}_{\boldsymbol{i}}:\mathbb{R}^{q\times q}\to\mathbb{R}^{p}$ is defined in (4.4).

We now elaborate on the definition of the adjoint in (18). For a patch vector $\boldsymbol{z}\in\mathbb{R}^{p}$ , if we let $\mathbf{Y}=\mathcal{P}^{*}_{\boldsymbol{i}}(\boldsymbol{z})$ , then $\mathcal{P}_{\boldsymbol{i}}(\mathbf{Y})=\boldsymbol{z}$ and $\mathbf{Y}(\boldsymbol{j})=0$ for all $\boldsymbol{j}\notin\Omega_{\boldsymbol{i}}$ , where

\Omega_{\boldsymbol{i}}=\left\{\boldsymbol{i}+\boldsymbol{\tau}:\ \boldsymbol{\tau}\in[0,k-1]^{2}\right\}.

This completely specifies $\mathbf{Y}$ . In particular, we have the following property of an adjoint: for all $\mathbf{X}\in\mathbb{R}^{q\times q}$ and $\boldsymbol{z}\in\mathbb{R}^{p}$ ,

\langle\boldsymbol{z},\mathcal{P}_{\boldsymbol{i}}(\mathbf{X})\rangle_{\mathbb{R}^{p}}=\langle\mathcal{P}_{\boldsymbol{i}}^{*}(\boldsymbol{z}),\mathbf{X}\rangle_{\mathbb{R}^{q\times q}},

(21)

where the inner product on the left (resp. right) is the standard Euclidean inner product on $\mathbb{R}^{p}$ (resp. $\mathbb{R}^{q\times q}$ ).

The main idea behind Proposition 11 is to express the original system of overlapping patches in terms of non-overlapping patches. In particular, let $q_{k}=q/k$ ; since we assume $q$ to be a multiple of $k$ , $q_{k}$ is an integer. Let

\displaystyle\mathcal{J}_{0}=[0,(q_{k}-1)]_{k}^{2}.

By construction, $\mathcal{J}_{0}\subseteq\mathcal{J}$ and the points in $\mathcal{J}_{0}$ are the starting coordinates of non-overlapping patches. It is not difficult to see that $\mathcal{J}$ is the disjoint union of (circular) shifts of $\mathcal{J}_{0}$ :

\mathcal{J}=\bigcup_{\boldsymbol{i}\in[0,k_{s}-1]^{2}}\,\big{(}\mathcal{J}_{0}+s\boldsymbol{i}\big{)},

(22)

where $[0,k_{s}-1]:=\{0,1,\ldots,k_{s}-1\}$ .

Using (22), we can decompose (18) as follows:

\mathrm{D}=\frac{1}{k_{s}^{2}}\sum_{\boldsymbol{i}\in[0,k_{s}-1]^{2}}\mathrm{D}_{\boldsymbol{i}},

(23)

where

\mathrm{D}_{\boldsymbol{i}}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \left(\mathcal{P}^{*}_{\boldsymbol{j}+s\boldsymbol{i}}\circ\mathrm{D}_{\mathrm{patch}}\circ\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}\right).

(24)

The proof of Proposition 11 follows from a couple of observations. The first of these is an observation about operator averaging. We will say that operator $\mathrm{T}$ is $L$ -contractive if its Lipschitz constant is at most $L$ .

Lemma 12.

Let $\mathrm{T}_{1},\ldots,\mathrm{T}_{k}$ be $L$ -contractive. Then, their average,

\frac{1}{k}(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k}),

(25)

is $L$ -contractive. On the other hand, if $\mathrm{T}_{1},\ldots,\mathrm{T}_{k}$ are $\theta$ -averaged, then (25) is a $\theta$ -averaged operator.

The second observation relates the contractivity (averaged) property of the patch denoisers to the image denoiser (24).

Lemma 13.

For $\boldsymbol{i}\in[0,k_{s}-1]^{2}$ , the following hold.

1.

If $\mathrm{D}_{\mathrm{patch}}$ is $L$ -contractive, then $\mathrm{D}_{\boldsymbol{i}}$ is $L$ -contractive.
2.

If $\mathrm{D}_{\mathrm{patch}}$ is $\theta$ -averaged, then $\mathrm{D}_{\boldsymbol{i}}$ is $\theta$ -averaged.

To establish Proposition 11, note from (23) that $\mathrm{D}$ is the average of $\{\mathrm{D}_{\boldsymbol{i}}\}$ . The desired conclusion follows immediately from Lemma 12 and 13.

We now give the proofs of Lemma 12 and 13.

For Lemma 12, the second part follows from (bauschke2017convex, , Proposition 4.30); however, we give the proofs of both parts for completeness.

Proof: [Proof of Lemma 12] Let $\mathrm{T}=(1/k)(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k})$ , where $\mathrm{T}_{1},\ldots,\mathrm{T}_{k}$ are $L$ -contractive. Then, for $i=1,\ldots,k$ and for all $\boldsymbol{x},\boldsymbol{x}^{\prime}\in\mathbb{R}^{n}$ ,

\lVert\mathrm{T}_{i}(\boldsymbol{x})-\mathrm{T}_{i}(\boldsymbol{x}^{\prime})\rVert\leqslant L\,\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert.

(26)

We can write

\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{x}^{\prime})=\frac{1}{k}\sum_{i=1}^{k}\,\left(\mathrm{T}_{i}(\boldsymbol{x})-\mathrm{T}_{i}(\boldsymbol{x}^{\prime})\right).

Using triangle inequality and (26), we get

\lVert\mathrm{T}(\boldsymbol{x})-\mathrm{T}(\boldsymbol{x}^{\prime})\rVert\leqslant\frac{1}{k}\sum_{i=1}^{k}\,L\,\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert=L\lVert\boldsymbol{x}-\boldsymbol{x}^{\prime}\rVert.

This establishes the first part of Proposition 12.

Next, suppose that $\mathrm{T}_{1},\ldots,\mathrm{T}_{k}$ are $\theta$ -averaged, i.e., there exists nonexpansive operators $\mathrm{N}_{1},\ldots,\mathrm{N}_{k}$ such that

\mathrm{T}_{i}=(1-\theta)\mathrm{I}+\theta\mathrm{N}_{i}\qquad(i=1,\ldots,k).

Then

\mathrm{T}=\frac{1}{k}(\mathrm{T}_{1}+\cdots+\mathrm{T}_{k})=(1-\theta)\mathrm{I}+\theta\mathrm{N},

where $\mathrm{N}=(1/k)(\mathrm{N}_{1}+\cdots+\mathrm{N}_{k})$ .

We can again use triangle inequality to show that $\mathrm{N}$ is nonexpansive. Thus, we have shown that $\mathrm{T}$ is $\theta$ -averaged, which completes the proof.

We need the following result to establish Lemma 13.

Lemma 14.

For any fixed $s$ and $\boldsymbol{i}\in[0,k_{s}-1]^{2}$ , we have the following orthogonality properties:

For all $\mathbf{X}:\Omega\to\mathbb{R}$ ,

\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X})\rVert^{2}={\lVert\mathbf{X}\rVert}^{2}.

(27)

We can equivalently write this as

\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}=\mathrm{I},

(28)

where $\mathrm{I}$ is the identity operator on $\mathbb{R}^{q\times q}$ .

For all $\{\boldsymbol{z}_{\boldsymbol{j}}\in\mathbb{R}^{p}:\,\boldsymbol{j}\in\mathcal{J}_{0}\}$ ,

\big{\lVert}\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}(\boldsymbol{z}_{\boldsymbol{j}})\big{\rVert}^{2}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert\boldsymbol{z}_{\boldsymbol{j}}\rVert^{2}.

(29)

It is understood that the norm on the left in (27) is the Euclidean norm on $\mathbb{R}^{p}$ , while the norm on the right is the Euclidean norm on $\mathbb{R}^{q\times q}$ .

We can establish Lemma 14 using property (21) and the fact that $\mathcal{J}_{0}$ are the starting coordinates of non-overlapping patches. We will skip this straightforward calculation. We are ready to prove Lemma 13.

Proof: [Proof of Lemma 13] Let $\mathrm{D}_{\mathrm{patch}}$ be $L$ -contractive, i.e., for all $\boldsymbol{z},\boldsymbol{z}^{\prime}\in\mathbb{R}^{p}$ ,

\lVert\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime})\rVert\leqslant L\lVert\boldsymbol{z}-\boldsymbol{z}^{\prime}\rVert.

(30)

We have to show for all $\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{R}^{q\times q}$ ,

\lVert\mathrm{D}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{D}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}\leqslant L^{2}\lVert\mathbf{X}-\mathbf{X}^{\prime}\big{\rVert}^{2}.

(31)

From (27) and (29), we can write

\displaystyle\lVert\mathrm{D}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{D}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}{\|\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}_{j})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime}_{j})\|}^{2},

where $\boldsymbol{z}_{j}=\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X})$ and $\boldsymbol{z}^{\prime}_{j}=\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}^{\prime})$ . From (30),

	$\displaystyle\lVert\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}_{\boldsymbol{j}})-\mathrm{D}_{\mathrm{patch}}(\boldsymbol{z}^{\prime}_{\boldsymbol{j}})\rVert$	$\displaystyle\leqslant L\,\lVert\boldsymbol{z}_{\boldsymbol{j}}-\boldsymbol{z}^{\prime}_{\boldsymbol{j}}\rVert$
		$\displaystyle=L\,\lVert\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}-\mathbf{X}^{\prime})\rVert.$

Moreover, we have from (29) that

\displaystyle\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \lVert{\mathrm{P}}_{\boldsymbol{j}+s\boldsymbol{i}}(\mathbf{X}-\mathbf{X}^{\prime}))\rVert^{2}=\lVert\mathbf{X}-\mathbf{X}^{\prime}\big{\rVert}^{2}.

Combining the above, we get (31). This establishes part 1 of Lemma 13.

Now, let $\mathrm{D}_{\mathrm{patch}}$ be $\theta$ -averaged, i.e.,

\mathrm{D}_{\mathrm{patch}}=(1-\theta)\mathrm{I}+\theta\mathrm{N},

(32)

where $\mathrm{N}$ is nonexpansive. Substituting (32) in (24), and using (28), we can write

\displaystyle\mathrm{D}_{\boldsymbol{i}}=(1-\theta)\mathrm{I}+\theta\mathrm{G}_{\boldsymbol{i}},

where

\mathrm{G}_{\boldsymbol{i}}=\sum_{\boldsymbol{j}\in\mathcal{J}_{0}}\ \left(\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}^{*}\circ\mathrm{N}\circ\mathcal{P}_{\boldsymbol{j}+s\boldsymbol{i}}\right).

(33)

Using (27) and (29) and that $\mathrm{N}$ is nonexpansive, we can show (the calculation is similar to one used earlier) that, for all $\mathbf{X},\mathbf{X}^{\prime}\in\mathbb{R}^{q\times q}$ ,

\displaystyle\lVert\mathrm{G}_{\boldsymbol{i}}(\mathbf{X})-\mathrm{G}_{\boldsymbol{i}}(\mathbf{X}^{\prime})\rVert^{2}\leqslant\lVert\mathbf{X}-\mathbf{X}^{\prime}\rVert^{2}.

That is, $\mathrm{G}_{\boldsymbol{i}}$ is nonexpansive. It follows from (32) that $\mathrm{D}_{\boldsymbol{i}}$ is $\theta$ -averaged. This establishes part 2 of Lemma 13.

References

\bibcommenthead
(1) A. Ribes, F. Schmitt, Linear inverse problems in imaging. IEEE Signal Process. Mag. 25(4), 84–99 (2008)
(2) H.W. Engl, M. Hanke, A. Neubauer, Regularization of Inverse Problems (Kluwer Academic Publishers, Dordrecht, Netherlands, 1996)
(3) W. Dong, L. Zhang, G. Shi, X. Wu, Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Trans. Image Process. 20(7), 1838–1857 (2011)
(4) G. Jagatap, C. Hegde, Algorithmic guarantees for inverse imaging with untrained network priors. Proc. Adv. Neural Inf. Process. Syst. pp. 14,832–14,842 (2019)
(5) S.H. Chan, X. Wang, O.A. Elgendy, Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Trans. Comput. Imaging 3(1), 84–98 (2017)
(6) A. Rond, R. Giryes, M. Elad, Poisson inverse problems by the plug-and-play scheme. J. Vis. Commun. Image Represent. 41, 96–108 (2016)
(7) J.M. Bioucas-Dias, M. Figueiredo, Multiplicative noise removal using variable splitting and constrained optimization. IEEE Trans. Image Process. 19(7), 1720–1730 (2010)
(8) L.I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60(1-4), 259–268 (1992)
(9) E.J. Candes, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted $l_{1}$ minimization. J. Fourier Anal. Appl. 14(5), 877–905 (2008)
(10) W. Dong, X. Li, L. Zhang, G. Shi, Sparsity-based image denoising via dictionary learning and structural clustering. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 457–464 (2011)
(11) L. Zhang, W. Zuo, Image restoration: From sparse and low-rank priors to deep priors [lecture notes]. IEEE Signal Process. Mag. 34(5), 172–179 (2017)
(12) G.B. Passty, Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. J. Math. Anal. Appl. 72(2), 383–390 (1979)
(13) H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. (Springer, New York, NY, USA, 2017)
(14) E. Ryu, S. Boyd, Primer on monotone operator methods. Appl. Comput. Math. 15(1), 3–43 (2016)
(15) A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
(16) S. Sreehari, S.V. Venkatakrishnan, B. Wohlberg, G.T. Buzzard, L.F. Drummy, J.P. Simmons, C.A. Bouman, Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Trans. Comput. Imaging 2(4), 408–423 (2016)
(17) K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007)
(18) K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
(19) E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, W. Yin, Plug-and-play methods provably converge with properly trained denoisers. Proc. Intl. Conf. Mach. Learn. 97, 5546–5557 (2019)
(20) Y. Sun, B. Wohlberg, U.S. Kamilov, An online plug-and-play algorithm for regularized image reconstruction. IEEE Trans. Comput. Imaging 5(3), 395–408 (2019)
(21) K. Zhang, W. Zuo, S. Gu, L. Zhang, Learning deep CNN denoiser prior for image restoration. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 3929–3938 (2017)
(22) T. Tirer, R. Giryes, Image restoration by iterative denoising and backward projections. IEEE Trans. Image Process. 28(3), 1220–1234 (2019)
(23) K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, R. Timofte, Plug-and-play image restoration with deep denoiser prior. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
(24) S. Hurault, A. Leclaire, N. Papadakis, Gradient step denoiser for convergent plug-and-play. Proc. Int. Conf. Learn. Represent. (2022)
(25) S. Hurault, A. Leclaire, N. Papadakis, Proximal denoiser for convergent plug-and-play optimization with nonconvex regularization. arXiv:2201.13256 (2022)
(26) Y. Romano, M. Elad, P. Milanfar, The little engine that could: Regularization by denoising (RED). SIAM J. Imaging Sci. 10(4), 1804–1844 (2017)
(27) E.T. Reehorst, P. Schniter, Regularization by denoising: Clarifications and new interpretations. IEEE Trans. Comput. Imaging 5(1), 52–67 (2018)
(28) Y. Sun, J. Liu, U.S. Kamilov, Block coordinate regularization by denoising. Proc. Adv. Neural Inf. Process. Syst. pp. 380–390 (2019)
(29) Y. Sun, Z. Wu, X. Xu, B. Wohlberg, U.S. Kamilov, Scalable plug-and-play ADMM with convergence guarantees. IEEE Trans. Comput. Imaging 7, 849–863 (2021)
(30) R. Cohen, M. Elad, P. Milanfar, Regularization by denoising via fixed-point projection (RED-PRO). SIAM J. Imaging Sci. 14(3), 1374–1406 (2021)
(31) C. Metzler, P. Schniter, A. Veeraraghavan, et al., prdeep: robust phase retrieval with a flexible deep network. Proc. Intl. Conf. Mach. Learn. pp. 3501–3510 (2018)
(32) Z. Wu, Y. Sun, A. Matlock, J. Liu, L. Tian, U.S. Kamilov, SIMBA: Scalable inversion in optical tomography using deep denoising priors. IEEE J. Sel. Top. Signal Process. 14(6), 1163–1175 (2020)
(33) Z. Wu, Y. Sun, J. Liu, U. Kamilov, Online regularization by denoising with applications to phase retrieval. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. Wkshp. (2019)
(34) G. Mataev, P. Milanfar, M. Elad, DeepRED: Deep image prior powered by RED. Proc. IEEE Intl. Conf. Comp. Vis. Wksh. (2019)
(35) Y. Hu, J. Liu, X. Xu, U.S. Kamilov, Monotonically convergent regularization by denoising. arXiv:2202.04961 (2022)
(36) K. Gregor, Y. LeCun, Learning fast approximations of sparse coding. Proc. Intl. Conf. Mach. Learn. pp. 399–406 (2010)
(37) J. Sun, H. Li, Z. Xu, et al., Deep ADMM-Net for compressive sensing MRI. Proc. Adv. Neural Inf. Process. Syst. 29 (2016)
(38) J. Zhang, B. Ghanem, ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. Proc. IEEE Intl. Conf. Comp. Vis. pp. 1828–1837 (2018)
(39) A. Repetti, M. Terris, Y. Wiaux, J.C. Pesquet, Dual forward-backward unfolded network for flexible plug-and-play. Proc. Eur. Signal Process. Conf. pp. 957–961 (2022)
(40) C.D. Athalye, K.N. Chaudhury, B. Kumar, On the contractivity of plug-and-play operators. IEEE Signal Processing Letters 30, 1447–1451 (2023)
(41) P. Nair, R.G. Gavaskar, K.N. Chaudhury, Fixed-point and objective convergence of plug-and-play algorithms. IEEE Trans. Comput. Imaging 7, 337–348 (2021)
(42) R.G. Gavaskar, C.D. Athalye, K.N. Chaudhury, On plug-and-play regularization using linear denoisers. IEEE Trans. Image Process. 30, 4802–4813 (2021)
(43) J. Liu, S. Asif, B. Wohlberg, U. Kamilov, Recovery analysis for plug-and-play priors using the restricted eigenvalue condition. Proc. Adv. Neural Inf. Process. Syst. 34 (2021)
(44) A. Raj, Y. Li, Y. Bresler, GAN-based projector for faster recovery with convergence guarantees in linear inverse problems. Proc. IEEE Intl. Conf. Comp. Vis. pp. 5602–5611 (2019)
(45) R. Cohen, Y. Blau, D. Freedman, E. Rivlin, It has potential: Gradient-driven denoisers for convergent solutions to inverse problems. Proc. Adv. Neural Inf. Process. Syst. 34 (2021)
(46) J.C. Pesquet, A. Repetti, M. Terris, Y. Wiaux, Learning maximally monotone operators for image recovery. SIAM J. Imaging Sci. 14(3), 1206–1237 (2021)
(47) R. Laumont, V. De Bortoli, A. Almansa, J. Delon, A. Durmus, M. Pereyra, On maximum a posteriori estimation with plug & play priors and stochastic gradient descent. Journal of Mathematical Imaging and Vision 65(1), 140–163 (2023)
(48) A. Virmaux, K. Scaman, Lipschitz regularity of deep neural networks: Analysis and efficient estimation. Proc. Adv. Neural Inf. Process. Syst. 31 (2018)
(49) H. Sedghi, V. Gupta, P.M. Long, The singular values of convolutional layers. Proc. Int. Conf. Learn. Represent. (2019)
(50) J. Hertrich, S. Neumayer, G. Steidl, Convolutional proximal neural networks and plug-and-play algorithms. Linear Algebra Appl. 631, 203–234 (2021)
(51) M. Terris, A. Repetti, J.C. Pesquet, Y. Wiaux, Building firmly nonexpansive convolutional neural networks. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. pp. 8658–8662 (2020)
(52) N. Parikh, S. Boyd, Proximal algorithms. Found. Trends. Optim. 1(3), 127–239 (2014)
(53) J. Eckstein, D.P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1), 293–318 (1992)
(54) V. Pata, Fixed Point Theorem and Applications (Springer, 2019)
(55) J.B. Baillon, G. Haddad, Quelques propriétés des opérateurs angle-bornés et n-cycliquement monotones. Israel Journal of Mathematics 26, 137–150 (1977)
(56) A. Chambolle, R.A. De Vore, N.Y. Lee, B.J. Lucier, Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans. Image Process. 7(3), 319–335 (1998)
(57) D.P. Kingma, J. Ba, Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)
(58) A. Levin, Y. Weiss, F. Durand, W.T. Freeman, Understanding and evaluating blind deconvolution algorithms. Proc. IEEE Conf. Comp. Vis. Pattern Recognit. pp. 1964–1971 (2009)