Unsupervised Image Restoration Using Partially Linear Denoisers

Rihuan Ke and Carola-Bibiane Schönlieb R. Ke and C.-B. Schönlieb are with DAMTP, University of Cambridge, Cambridge CB3 0WA, UK.

Abstract

Deep neural network based methods are the state of the art in various image restoration problems. Standard supervised learning frameworks require a set of noisy measurement and clean image pairs for which a distance between the output of the restoration model and the ground truth, clean images is minimized. The ground truth images, however, are often unavailable or very expensive to acquire in real-world applications. We circumvent this problem by proposing a class of structured denoisers that can be decomposed as the sum of a nonlinear image-dependent mapping, a linear noise-dependent term and a small residual term. We show that these denoisers can be trained with only noisy images under the condition that the noise has zero mean and known variance. The exact distribution of the noise, however, is not assumed to be known. We show the superiority of our approach for image denoising, and demonstrate its extension to solving other restoration problems such as image deblurring where the ground truth is not available. Our method outperforms some recent unsupervised and self-supervised deep denoising models that do not require clean images for their training. For deblurring problems, the method, using only one noisy and blurry observation per image, reaches a quality not far away from its fully supervised counterparts on a benchmark dataset.

Index Terms:

Image denoising, deep learning, convolutional neural networks, unsupervised learning, partially linear denoiser

I Introduction

The acquisition of real life images usually suffers from noise corruption due to different factors such as sensor sensitivity, variations of photon numbers, air turbulence, just to name a few. Image denoising [30, 25, 7, 38] aims to remove noise from corrupted images. It is one of the most fundamental and central problems tackled by the image processing community. A variety of image denoising approaches have been developed in the past decades, which can be broadly classified into model based denoisers (see e.g., [30, 4, 6, 3]) and data driven denoisers [29, 5, 41].

Recent data driven techniques outperform conventional model based methods and achieve the state of the art denoising quality [22, 41, 42, 19]. These methods take advantage of large sets of image data and use the models, particularly deep convolutional neural networks (CNN), to learn the image prior from the datasets rather than relying on predefined image features. Compared to many model based methods, which need to solve a difficult optimization problem for each test image, CNN based denoisers are computationally efficient once the CNN is trained. Nevertheless, CNN based denoising approaches usually require a massive amount of ground truth images in the training phase. Specifically, conventional training pipelines consist of a loss function or a metric, which defines the distance between a clean ground truth image and its reconstructed version, and an optimization step in which the parameters of the models are adjusted so as to minimize the loss function. One of the most commonly used metrics is the mean squared error (MSE), the calculation of which depends explicitly on the ground truth image. While these learning processes can lead to high-quality denoisers, they are problematic for application domains where ground truth images are not accessible.

In the past years, several unsupervised deep learning techniques have been developed to overcome this difficulty. It is found that it is possible to train a deep neural network denoiser by using only the noisy data if multiple versions of noisy images are available for each unknown clean image [20], or if the noise is independent within different regions of the image [2, 15, 16]. Under the assumption of i.i.d. Gaussian noise, loss functions can also be adapted, based on Stein’s unbiased risk estimate of the MSE [32], such that they are defined only on the noisy images while providing good approximations of the MSE.

Refer to caption — Figure 1: Partially linear denoiser. The clean image $\bm{x}$ (top middle), the noise $\bm{n}$ (top right), and the noisy image $\bm{y}:=\bm{x}+\bm{n}$ (top left) are modeled as random variables. A denoiser $R$ is decomposed as $R(\bm{y})=g(\bm{x})+L\bm{n}+\bm{e}$ with a function $g{\left(\bm{x}\right)}$ (can be nonlinear), a linear mapping $L$ (can depend on $\bm{x}$ ), and a residual term $\bm{e}$ . If the random variable $\bm{e}$ is of small variance, then $R$ is called a partially linear denoiser. Such denoisers can be learned from only noisy images.

In this work, we address the problem of learning efficient denoisers from a set of noisy images without the need for precise modeling of the noise, and without multiple noisy observations per image. We investigate a class of structured nonlinear denoisers and their applications to inverse imaging problems including denoising and deblurring. Specifically, we formulate a given denoiser $R{\left(\cdot\right)}$ as the sum of three terms (cf. Fig. 1), including a (nonlinear) function of the ground truth signal $g{\left(\bm{x}\right)}$ , a linear mapping of the noise $L\bm{n}$ , and a residual term $\bm{e}$ . If the linear factor $L\bm{n}$ is relatively small and the residual term $\bm{e}$ , as a random variable, has small variance, then the denoised image mainly depends on the ground truth image and is not sensitive to the changes of the noise. In this paper, we study denoisers with the property that $\bm{e}$ has small variance, and we call denoisers of this type partially linear denoisers (cf. Fig. 1). Such denoisers preserve nonlinear image structure (similar to many classic image denoisers), and additionally allow the nonlinearity to be learned from noisy data only.

We observe that some natural denoisers, including deep neural networks, exhibit certain degrees of partial linearity. By exploiting this property, we show that a partially linear denoiser, in the learning setting, can be trained on noisy images by only assuming that the noise is zero mean conditioned on the images with known variance. Specifically, we introduce an auxiliary random vector together with a partially linear constraint, that we show establishes a correspondence between the MSE and a loss function defined without clean images. By doing this one can approximate the best partially linear denoisers with a theoretically guaranteed approximation error bound. Moreover, we propose an image deblurring approach, based on the partially linear denoisers, that learns from single noisy observation of the images.

Different from other existing unsupervised methods for denoising, our approach does not require a group (or a pair) of noisy observations for each image or an estimate on the underlying noise distribution. Yet our approach leads to a performance close to the fully supervised counterparts on some denoising benchmark datasets. The proposed partially linear denoisers are learned in an end-to-end manner and can be built on top of any deep neural network architectures for denoising. Once they are trained, the denoised results are obtained without the auxiliary vectors or any post-processing. Furthermore, we demonstrate our denoiser’s capacity of handling other image restoration tasks in the absence of ground truth or multiple noisy observations per image, by utilizing the direct approximation to the MSE and its end-to-end learning nature.

II Related work

In the past decades, a wide variety of image denoising algorithms were developed. They range from analytic approaches such as filtering, variational methods, wavelet transforms, Bayesian estimation to data-driven approaches such as deep learning. In this section we review some major nonlinear denoisers as well as recent deep learning approaches that do not depend on ground truth images for learning.

II-A Nonlinear denoisers

Natural images are non-Gaussian signals with fine structures such as sharp edges and textures. One of the main challenges in the image denoising task is to preserve these structures while removing the noise. Traditional linear denoisers such as Gaussian filtering and Tikhonov regularization usually do not achieve satisfactory denoising quality, as they tend to smooth the edges. Most of the existing image denoisers in the literature are therefore nonlinear.

Total variation (TV) denoising [30] is one of the most fundamental image denoising algorithms. The TV denoising solution is a minimizer of the optimization problem

\min_{x}\frac{\lambda}{2}\|x-y\|^{2}+\|x\|_{\rm TV}

in which $y\in\mathbb{R}^{m}$ is the noisy image and $\|x\|_{\rm TV}:=\|\nabla x\|_{1}$ is the total variation of $x\in\mathbb{R}^{m}$ . TV denoising is known to preserve sharp edges thanks to the non-linearity introduced by the TV norm $\|\cdot\|_{\rm TV}$ .

Patch based methods have gained popularity for image denoising tasks because of their capacity of capturing the self-similarity of images. The non-local means (NL-means) algorithm proposed by Buades et al. [3] is among the most successful methods in this category. The NL-means algorithm removes noise by calculating a weighted average of the pixel intensities, with weights defined based on a patch similarity measure which emphasizes the connection among pixels in similar patches. It is a nonlinear denoiser, different from the local mean filtering, as the weights are image-dependent. As another nonlinear patch based denoising method, the Block-matching and 3D filtering (BM3D) algorithm [6] divides image patches into groups based on a similarity criterion, and collaborative filtering on each of the groups is then performed to clean the patches. While patch based denoisers achieve promising denoising performance, they are often time-consuming due to the high computational complexity in calculating the weights or matching with similar patches.

In recent years, convolutional neural networks (CNN) based denoisers became the state of the art for image denoising [41, 42], thanks to the rapid development of deep learning techniques. In particular, CNN are known to be efficient in modeling image priors [35], which are crucial for the quality of the denoiser. A typical CNN denoiser can be formulated as a composition of mappings called layers, and a basic type of layers $y^{(k)}\rightarrow y^{(k+1)}$ has the form

y^{(k+1)}=\sigma{\left(W*y^{(k)}+b\right)},

where $*$ denotes the convolution operation, $(W,b)$ are parameters of the network that can be learned from the data, and $\sigma{\left(\cdot\right)}$ is an activation function which is often nonlinear. As such, CNN denoisers are in general nonlinear. There has been growing interest in developing CNN denoisers with new network architectures and building blocks being proposed, such as batch-normalization [41], residual connection [10], and residual dense block [42].

II-B Unsupervised deep learning for denoising

While deep CNN based models have great advantages in image denoising, the most standard learning techniques are limited to the availability of sufficiently many noise-free images. Recently, CNN-based learning algorithms for image denoising that do not require clean images as training data have been developed. Soltanayev et al. [32] propose to use Stein’s unbiased risk estimator (SURE) based loss function, which is computed without knowing ground truth images, as a replacement of the MSE loss function. In the setting of i.i.d. Gaussian noise, SURE provides an unbiased estimate of the MSE, and hence the minimization problems with respect to these two losses are equivalent. Consequently, CNN denoisers can be trained in an unsupervised manner, based on only one noisy realization of each training image. Nevertheless, SURE may not be identical to the MSE in the case of non-Gaussian noise, e.g., shot noise. The SURE training scheme has been extended to the setting of multiple noise realizations per image as well as imperfect ground truths [43].

The Noise2Noise approach developed by Lehtinen et al. [20] offers a different ground-truth free learning strategy for denoising, based on the assumption that for each image at least two different noisy observations are available. It is found that replacing the clean targets by their noisy observations in the MSE loss function leads to the same minimizers of the original supervised loss if the noise has zero mean and an infinite amount of training data are provided. Specifically, the Noise2Noise loss function is

\sum_{{\left(y,\hat{y}\right)}\in{\left(\mathcal{Y},\hat{\mathcal{Y}}\right)}}\|f_{\theta}\mathopen{}\left(\hat{y}\right)\mathclose{}-y\|^{2}

(1)

where $y$ and $\hat{y}$ represent two independent noisy observations of the same image sample, and $f_{\theta}$ is the denoiser parameterized by $\theta$ . In contrast to the MSE loss $\sum_{{\left(y,x\right)}\in{\left(\mathcal{Y},\mathcal{X}\right)}}\|f_{\theta}\mathopen{}\left(y\right)\mathclose{}-x\|^{2}$ where $x$ denotes the ground truth image, the loss function (1) is defined on a noisy pair ${\left(y,\hat{y}\right)}$ only. If the noise in $y$ and the noise in $\hat{y}$ are independent and have zero means, it is shown that the minimization of this loss function (1) with respect to $\theta$ is equivalent to minimizing the MSE [20]. This implies that the parameter $\theta$ can be computed from a training set of noisy pairs. Besides, the Noise2Noise approach does not rely on an explicit image prior or on structural knowledge about the noise models.

In certain denoising tasks, however, the acquisition of two or more noisy copies per image can be very expensive or impractical, in particular in medical imaging where patients are moving during the acquisition, or in videos with moving cars, etc. Several authors have investigated learning techniques that overcome this restriction [8, 36, 2, 15, 16]. The Frame2Frame [8] developed by Ehret et al. fine-tunes a denoiser for blind video denoising. The idea is to use optical flow to warp one video frame to match its neighboring frame, and then minimize the Noise2Noise loss (1) with $(y,\hat{y})$ being the pair of matched frames. The work [27] generalizes the Noise2Noise into the setting of a single noisy realization for each image. A synthetic noise, that is drawn from the same distribution as the underlying noise, is added to the noisy image $y$ , and the new noisy image is then used to replace the second observation $\hat{y}$ in the training loss (1). At test time, the raw output of the network is post-processed to obtain the denoising results, by computing a linear combination of the output and the input. In this approach, the synthetic noise has to be of the same noise type as the underlying noise, and the post-processing can magnify the errors of the output of the network. Batson et al. [2] show that a denoiser $f_{\theta}{\left(\cdot\right)}$ , satisfying a so called $\mathcal{J}$ -invariant property for a partition $\mathcal{J}$ of the image pixels, can be trained without accessing a second observation of the image if the noise is independent across different regions in $\mathcal{J}$ . A function $f_{\theta}\mathopen{}\left(\cdot\right)\mathclose{}$ is said to be $\mathcal{J}$ -invariant if for each subset of pixels $J\in\mathcal{J}$ , the pixel values of $f_{\theta}{\left(y\right)}$ at $J$ can be calculated without knowing the values of $y$ at $J$ . By additionally assuming that the noises of $y$ at different elements of $\mathcal{J}$ are independent and have zero mean, one minimizes the loss

\begin{split}\sum_{y\in\mathcal{Y}}{\left\|f_{\theta}{\left(y\right)}-{y}\right\|}^{2}&=\sum_{y\in\mathcal{Y}}\sum_{J\in\mathcal{J}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2}\\ &=\sum_{J\in\mathcal{J}}\sum_{y\in\mathcal{Y}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2}\end{split}

(2)

where the subscript $J$ in $y_{J}$ denotes the restriction of the image $y$ to the pixel collection $J$ . It should be noted that, for a $\mathcal{J}$ -invariant function $f_{\theta}{\left(\cdot\right)}$ and any $J\in\mathcal{J}$ , the loss $\sum_{y\in\mathcal{Y}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2}$ can be interpreted as a variant of (1), given the fact that the noise contained in $y_{J}$ is independent of $[f_{\theta}{\left(y\right)}]_{J}$ and has zero mean. This enables the training of a model using a set of noisy images $y$ only. However, as the denoiser is a $\mathcal{J}$ -invariant function, the images can not be perfectly reconstructed as the level of noise decreases to zero, i.e., it can not learn the identity mapping which is clearly not $\mathcal{J}$ -invariant. The unused information from $y$ can be further leveraged if the noise model is known or can be estimated. Krull et al. [16] propose to combine $y$ with the network predictions, based on a probabilistic model for each pixel and a Bayesian formulation, to obtain the minimal MSE estimate. The integration of statistical inference effectively removes the noise remained in the predictions. A downside of this method is that it requires an explicit expression of the posterior distribution of the noisy images beforehand, and it is not end-to-end because the network outputs intermediate results that are improved in the post-processing step.

III The proposed method

We consider image restoration problems of the form

\bm{y}=A\bm{x}+\bm{n}

(3)

in which $\bm{x}$ and $\bm{n}$ are random vectors of unknown distributions representing the ground truth images and the noise respectively, $\bm{y}$ is the noisy image from which we want to restore $\bm{x}$ , and $A$ is a linear forward operator determined by the data acquisition process. In this paper, random vectors are always denoted by boldface lower-case letters like $\bm{x}$ , $\bm{y}$ , $\bm{z}$ , etc. For ease of presentation, in this section we first focus on denoising problems, i.e., $A:=I$ defines an identity map. The more general cases where $A$ is not the identity will be discussed in Subsection III-E.

For Problem (3), the only assumption we make on the noise distribution is that it has zero mean conditioned on the image, i.e.,

(A1).

$\mathbb{E}{\left(\bm{n}\mid\bm{x}\right)}=0$ .

This assumption holds for various common types of noises including Gaussian white noise, Poisson noise, as well as some mixed noises. We underline that the noise $\bm{n}$ , however, are not assumed to be independent of the image or pixel-wise independent.

The central issue in image denoising is to find an operator, denoted by $R$ , that takes the noisy image $\bm{y}$ to the clean image $\bm{x}$ or its approximations. In this work we are interested in a class of denoisers that can be decomposed as

R{\left(\bm{y}\right)}=g{\left(\bm{x}\right)}+L\bm{n}+\bm{e},

(4)

where $g$ is a (possibly nonlinear) function of the clean image $\bm{x}$ , $L:=L_{\bm{x}}$ is a linear operator and $\bm{e}:=\bm{e}_{\bm{x},\bm{n}}$ is a residual term with small variance. In the rest of the paper, we omit the subscripts of $L_{\bm{x}}$ and $\bm{e}_{\bm{x},\bm{n}}$ if there is no ambiguity.

Remark 1.

For a given denoiser, the decomposition (4) always exists, but in this work we only consider the setting of $\bm{e}$ having small variance. Furthermore, the decomposition (4) is not unique, given the fact that $L$ can be an arbitrary linear operator. However, in order to characterize the structure of the denoiser, we assume $\mathbb{E}{\left(\bm{e}\mid\bm{x}\right)}=0$ and let $L$ be chosen such that $\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\mid\bm{x}\right)}$ is minimized. As $\bm{n}$ has zero mean, the nonlinear term is then determined by $g{\left(\bm{x}\right)}=\mathbb{E}{\left(R{\left(\bm{y}\right)}\mid\bm{x}\right)}.$

If $R$ satisfies (4) with $\bm{e}$ of small variance, then we call $R$ a partially linear denoiser. The first term $g{\left(\bm{x}\right)}$ , which can be nonlinear, does not depend on realizations of the noise $\bm{n}$ . This formulation implies that the non-linearity of the denoisers in this class is mainly due to intrinsic image structures, and the denoisers respond to the noise in a linear (encoded in $L\bm{n}$ ) or less sensitive manner (encoded in the fact that $\bm{e}$ has small variance). In fact, for any denoiser that can effectively remove noise from images, its output has to be minimally dependent on the changes of noise, and therefore when written in the form (4), the noise dependent components on the right hand side should be small in variance compared to the noise, which implies that $\mathbb{E}(\|\bm{e}\|^{2})$ is small.

The selection of a good denoiser $R$ requires knowledge of certain prior information about the target $\bm{x}$ , especially when we want to find the fine details like edges from the corrupted image. Many conventional analytical approaches aim to find the reconstruction $R{\left(\bm{y}\right)}$ from a lower dimensional space that the images lie in. This can be done, for example, by assuming some sparseness properties in the gradient fields or the wavelet domains.

In a data-driven setting, given the clean image $\bm{x}$ , one could instead minimize the mean square error (MSE) defined as $\mathcal{J}_{0}{\left(R\right)}:=\mathbb{E}({\left\|R(\bm{y})-\bm{x}\right\|}^{2}),$ where the expectation is taken over $\bm{x}$ and $\bm{n}$ . The minimization of $\mathcal{J}_{0}$ leads to the conditional mean $R_{0}(\bm{y}):=\mathbb{E}{\left(\bm{x}\mid\bm{y}\right)}$ . Next, we will discuss how to approximate $R_{0}$ in the absence of the clean image $\bm{x}$ .

III-A Auxiliary random vectors

The motivation for this paper comes from the fact that, in practice the distribution of $\bm{x}$ is often unknown, and samples of $\bm{x}$ (i.e., the ground truth images) or of the noise $\bm{n}$ are not readily available. What can be easily accessed are noisy observations $\bm{y}$ . With these alone, however, a direct evaluation of the MSE is not possible. In this work, we circumvent the need for $\bm{x}$ by introducing an auxiliary random vector and replacing the MSE $\mathcal{J}_{0}$ by an approximation. First, let $\bm{z}$ be a random vector satisfying assumptions

(A2)

the conditional mean $\mathbb{E}{\left(\bm{z}\mid\bm{x}\right)}=0$ ,
(A3)

$\bm{z}$ is independent of $\bm{n}$ ,
(A4)

the conditional covariance:

${\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}={\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}.$

The auxiliary vector $\bm{z}$ does not necessarily need to follow the same distribution as $\bm{n}$ . Samples of $\bm{z}$ , therefore, can be easily generated from e.g., Gaussian distributions, once the variances of $\bm{z}$ are known. Then, associated with $\bm{z}$ , we define

\begin{cases}\bm{\hat{n}}&:=\bm{n}+\alpha\bm{z}\\ \bm{\hat{y}}&:=\bm{y}+\alpha\bm{z},\end{cases}

(5)

where $\alpha$ is a real constant. The new random vector $\bm{\hat{y}}=\bm{x}+\bm{\hat{n}}$ can be regarded as a noisy version of image $\bm{x}$ with noise vector $\bm{\hat{n}}$ . In the following, the discussion will focus on denoising $\bm{\hat{y}}$ rather than denoising $\bm{y}$ , but the objective remains unchanged, i.e., getting the same clean image $\bm{x}$ . Specifically, if one can obtain a high-quality solution $\bm{x}$ from $\bm{\hat{y}}$ , then there exists an algorithm taking $\bm{y}$ to the clean image because $\bm{\hat{y}}$ can be computed from $\bm{y}$ . Such an algorithm can be achieved, for instance, by $R{\left(V{\left(\bm{y}\right)}\right)}$ where $R$ is a denoiser for the $\bm{\hat{y}}$ problem and $V(\bm{y}):=\bm{\hat{y}}$ . It should be noted that the difficulty of the denoising problem is raised because of the additional uncertainty from the auxiliary random vector $\bm{z}$ encoded in the observed data $\bm{\hat{y}}$ . However, one of the benefits of considering $\bm{\hat{y}}$ is that the quantity $\bm{z}$ is known and can be leveraged for constructing approximations to the MSE without knowing $\bm{x}$ or $\bm{n}$ as we will show in the following.

For the $\bm{\hat{y}}$ denoising problem, the MSE associated with the denoiser $R$ is defined as

\mathcal{J}_{\rm mse}{\left(R\right)}:=\mathbb{E}{\left({\left\|R(\bm{\hat{y}})-\bm{x}\right\|}^{2}\right)}

(6)

where the expectation $\mathbb{E}$ is taken over random variables $\bm{x}$ , $\bm{n}$ and $\bm{z}$ . This is connected to $\mathcal{J}_{0}$ via $\mathcal{J}_{\rm mse}(R)=\mathcal{J}_{0}(R{\left(V\right)})$ . Since $V$ is known, it can be shown that $\min_{R}\mathcal{J}_{0}(R)\leq\min_{R}\mathcal{J}_{\rm mse}(R)$ . Nevertheless, if $\alpha$ is close to zero, the noise distributions of $\bm{y}$ and $V(\bm{y})$ are close, so the minimum of $\mathcal{J}_{0}$ can be approximated by the minimum of $\mathcal{J}_{\rm mse}$ which promises similar denoising quality.

Now, using the auxiliary vector $\bm{z}$ , we define the following objective function for our proposed partially linear denoiser

\mathcal{J}{\left(R\right)}:=\mathbb{E}{\left({\left\|R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}.

(7)

for $\alpha\neq 0$ . Indeed, as we will see later in Subsections III-B and III-C, if we consider the partially linear denoising model, then $\mathcal{J}$ provides a good estimate of $\mathcal{J}_{\rm mse}$ . More precisely, according to the definition (4), we consider a set of denoisers for $\bm{\hat{y}}$ that have the form

R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}

(8)

where $\bm{\hat{n}}$ is defined in (5), and $L$ and $\bm{e}$ are depending on $\bm{x}$ and ${\left(\bm{x},\bm{\hat{n}}\right)}$ , respectively, and the residual $\bm{e}$ is of small variance.

In the rest of this section, we show that for $\alpha\!\neq\!0$ and partially linear denoiser $R$ , the term $\mathcal{J}{\left(R\right)}$ in (7) approximates the MSE $\mathcal{J}_{\rm mse}{\left(R\right)}$ up to an additive constant. This implies that an optimal denoiser of this class can be computed even if the distribution of ground truth images $\bm{x}$ and the distribution of $\bm{n}$ are not known. We then discuss unsupervised learning methods based on our proposed objective $\mathcal{J}{\left(R\right)}$ for image restoration tasks like denoising and deblurring.

III-B Linear minimum mean square error estimator

To start with, we focus the discussion on the simplest case, in which the denoiser $R$ is linear. This is a subset of the set of partially linear denoisers (8) with $g=L$ and $\bm{e}=0$ . The best estimator in this setting is the linear minimum mean square error (LMMSE) estimator, i.e., the minimizer of $\mathcal{J}_{\rm mse}{\left(R\right)}$ in (6) over all linear denoisers $R$ . The following proposition establishes the equivalence between the MSE loss and $\mathcal{J}{\left(R\right)}$ in (7) over linear denoisers.

Proposition 1.

If $\bm{n}$ and $\bm{z}$ satisfy Assumption (A1) - (A4) and $R$ is linear, then there is some constant $c$ not depending on $R$ , such that

\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R\bm{\hat{y}}-\bm{x}}\right\|}^{2}\right)}+c.

(9)

Proof.

If $R$ is linear, then from the definitions of $\bm{y}$ and $\bm{\hat{y}}$ ,

\begin{split}R{\bm{\hat{y}}}-{\left(\bm{y}-\bm{z}/\alpha\right)}={\left(R{\bm{\hat{y}}}-\bm{x}\right)}-{\left(\bm{n}-\bm{z}/\alpha\right)},\end{split}

and $R{\bm{\hat{y}}}-\bm{x}={\left(R\bm{x}-\bm{x}\right)}+{\left(R\bm{n}+\alpha R\bm{z}\right)}$ . Let $\langle\cdot,\cdot\rangle$ denote the inner product operator. Then, using the above, (7) can be rewritten as

\begin{split}\mathcal{J}{\left(R\right)}=&\mathbb{E}{\left({\left\|R\bm{\hat{y}}-\bm{x}\right\|}^{2}\right)}+\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}\\ &-2\mathbb{E}{\left(\langle R\bm{x}-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle R\bm{n}+\alpha R\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}.\end{split}

(10)

Here the expectations are taken over $\bm{x}$ , $\bm{n}$ and $\bm{z}$ . We next show that the last two terms on the right hand side of (10) vanish.

First, since $\bm{n}$ has zero mean conditioned on $\bm{x}$ according to Assumption (A1), it holds that $\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{n}\rangle\right)}=\mathbb{E}{\left(\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{n}\rangle\mid\bm{x}\right)}\right)}=0.$ The same property applies to $\bm{z}$ by Assumption (A2), so $\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{z}\rangle\right)}=0$ . With a linear combination of these two equalities,

\mathbb{E}{\left(\langle R\bm{x}-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}=0.

(11)

Second, given the fact that $R$ is linear and both $\bm{n}$ and $R\bm{n}$ have zero mean conditioned on $\bm{x}$ , we have

\begin{split}\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}=&{\rm tr}{\left({\rm Cov}{\left(R{\bm{n}},\bm{n}\mid\bm{x}\right)}\right)}\\ =&{\rm tr}{\left(R\ {\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}\right)}\\ \end{split}

in which ${\rm tr}{\left(\cdot\right)}$ denotes the trace of the matrix. The same equality holds for $\bm{z}$ , i.e.,

\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}={\rm tr}{\left(R\ {\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\right)}

Therefore, it follows from Assumption (A4) that $\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}=\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}.$ This together with Assumption (A3) gives

\begin{split}&\mathbb{E}{\left(\langle R{\bm{n}}+\alpha R{\bm{z}},\bm{n}-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}+\mathbb{E}{\left(\langle\alpha R{\bm{z}},-\bm{z}/\alpha\rangle\mid\bm{x}\right)}=0.\end{split}

(12)

Taking expectation with respect to $\bm{x}$ , we have

\mathbb{E}{\left(\langle R\bm{n}+\alpha R\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}=0.

(13)

Finally, let $c:=\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}$ , then Equation (10), (11) and (13) imply (9) which completes the proof. ∎

According to Proposition 1, for all linear denoisers, the term $\mathcal{J}{\left(R\right)}$ in (7) differs from the MSE by a constant $c$ not depending on $R$ . Therefore a minimization of $\mathcal{J}{\left(R\right)}$ over all linear $R$ also leads to the LMMSE estimator. We remark that the objective function (7) is not defined based on the ground truth image $\bm{x}$ , and the distribution of the random vector $\bm{n}$ is not necessarily known or equal to that of $\bm{z}$ . In fact, the denoising problem can be reformulated as follows. Given $R$ , the quantity $R\bm{\hat{y}}$ is known, and we want to decouple the term $R\bm{\hat{y}}-\bm{x}$ from

R\bm{\hat{y}}-\bm{y}=(R\bm{\hat{y}}-\bm{x})-\bm{n}.

(14)

In the loss $\mathcal{J}{\left(R\right)}$ , this is implemented by adding the scaled auxiliary vector $\bm{z}/\alpha$ to both sides of (14). The vector $\bm{z}/\alpha$ compensates the noise $\bm{n}$ in the sense that, when taking the expectation of ${\left\|R\bm{\hat{y}}-\bm{y}+\bm{z}/\alpha\right\|}^{2}$ over $\bm{n}$ and $\bm{z}$ , the $\bm{n}$ -related term $\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}$ is canceled by the $\bm{z}$ -related term $\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}$ (i.e., the last equality in (12)), which leads to (9).

III-C Optimal partially linear denoiser

Though linear denoisers have nice properties that link (7) to the MSE, they may not be the best denoisers for imaging data. Nonlinearity is unavoidable in order to achieve good denoising quality and preserve fine structures such as edges in the image. To this end, we consider a more general class of denoisers that are partially linear, as expressed in (8), where $g$ is nonlinear and $\bm{e}$ is not necessarily zero.

In a special case where the denoiser outputs exactly the clean image, i.e. $R{\left(\bm{\hat{y}}\right)}:=\bm{x}$ , $R$ is also partially linear with $L$ and $\bm{e}$ being both zero. However, such denoisers do not exist in most settings. Nevertheless, for good denoisers one could still expect the residual term $\bm{e}$ to be small.

The equivalence of the two objective functions stated in Proposition 1 does not hold for nonzero $\bm{e}$ . It is therefore of interest to know how well $\eqref{eq:newmse}$ approximates the MSE in this general case. The next proposition quantifies the approximation error in the presence of a small nonzero residual term $\bm{e}$ .

Proposition 2.

If $\bm{n}$ and $\bm{z}$ satisfy Assumption (A1) - (A4) and $R$ satisfies (8), then there is some constant $c$ not depending on $R$ , such that

\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c

(15)

Additionally, if the variance $\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}\leq\epsilon^{2}$ for $\epsilon>0$ , then the approximation error

\begin{split}{\rm Err}:=&\left|\mathcal{J}{\left(R\right)}-\mathbb{E}({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2})-c\right|\\ \leq&2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})}\end{split}

(16)

If furthermore the residual $\bm{e}{\left(\bm{x},\bm{\hat{n}}\right)}$ is Lipschitz continuous with respect to $\bm{\hat{n}}$ i.e., $\|\bm{e}{\left(\bm{x},\bm{\hat{n}}_{1}\right)}-\bm{e}{\left(\bm{x},\bm{\hat{n}}_{2}\right)}\|\leq K\|\bm{\hat{n}}_{1}-\bm{\hat{n}}_{2}\|$ for some constant $K$ , then

{\rm Err}\leq 2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}\right\|}^{2})}+2K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}.

(17)

Proof.

By the decomposition of $R$ in (8),

\begin{split}R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}=&{\left(g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}-\bm{x}\right)}\\ &-{\left(\bm{n}-\bm{z}/\alpha\right)},\end{split}

(18)

where $\hat{\bm{n}}$ is defined in (5). For function $g{\left(\cdot\right)}$ , from assumption (A1) and (A2) it is clear that $\mathbb{E}{\left(\langle g{\left(\bm{x}\right)},{\bm{n}-\bm{z}/\alpha}\rangle\mid\bm{x}\right)}=0$ . Therefore,

\mathbb{E}{\left(\langle g{\left(\bm{x}\right)}-\bm{x},{\bm{n}-\bm{z}/\alpha}\rangle\right)}=0.

(19)

Moreover, as $L$ is linear, with a similar argument to the proof of (13), we can show that

\mathbb{E}{\left(\langle L\bm{\hat{n}},{\bm{n}-\bm{z}/\alpha}\rangle\right)}=0.

(20)

According to Equation (18), (19) and (20), if we let $c:=\mathbb{E}{\left(\|\bm{n}-\bm{z}/\alpha\|^{2}\right)}$ , then Equation (15) holds.

Next, we consider the bound of $\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}$ . Assuming that $\mathbb{E}({\left\|\bm{e}\right\|}^{2})\leq\epsilon^{2}$ , a straightforward application of Cauchy-Schwarz inequalities leads to

\begin{split}\left|\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}\right|&\leq\sqrt{\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}}\sqrt{\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}}\\ &=\epsilon\sqrt{\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}},\end{split}

(21)

and therefore Equation (16) holds.

Furthermore, assume that $\bm{e}:=\bm{e}{\left(\bm{x},\bm{n}+\alpha\bm{z}\right)}$ is Lipschitz continues w.r.t. $\bm{\hat{n}}:=\bm{n}+\alpha\bm{z}$ with Lipschitz constant $K$ . Since $\bm{z}$ has zero mean conditioned on $\bm{x}$ and $\bm{n}$ , the conditional expectation

\begin{split}\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}=&\mathbb{E}{\left(\langle\bm{e}_{0},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}\\ &+\mathbb{E}{\left(\langle\bm{e}-\bm{e}_{0},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}\\ \leq&0+\mathbb{E}{\left(K\langle\bm{z},\bm{z}\rangle\mid\bm{x},\bm{n}\right)},\end{split}

where $\bm{e}_{0}:=\bm{e}{\left(\bm{x},\bm{n}+\bm{0}\right)}$ . So $\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\right)}\leq K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}$ , and because of symmetry we have $\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\right)}\geq-K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}$ . Finally,

\begin{split}\left|\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}\right|\leq&\left|\mathbb{E}{\left(\langle\bm{e},\bm{n}\rangle\right)}\right|+\left|\mathbb{E}{\left(\langle\bm{e},{\bm{z}/\alpha}\rangle\right)}\right|\\ \leq&\epsilon\sqrt{\mathbb{E}{\left({\left\|\bm{n}\right\|}^{2}\right)}}+K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}\end{split}

and the inequality (17) follows. ∎

From the bounds given in Proposition 2, if the variance of $\bm{e}$ is small, then the error (7) provides a good estimate to the MSE. Note that making a Lipschitz continuity assumption on $\bm{e}$ , the error bound in (16) can be further improved to (17) which is independent of $\alpha$ (in contrast to (16) where the bound has a $1/\alpha$ factor). In practice, this Lipschitz continuity assumption is not restrictive as we expect that the denoiser $R$ is stable with respect to small perturbations in $\bm{\hat{y}}$ and therefore has a small Lipschitz constant $K$ for $\bm{e}$ .

Remark 2 (Estimated noise variance).

If Assumption (A4) does not hold, then the bound (16) also depends on the linear operator $L$ . Consider for instance ${\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}$ where $\beta\!>\!-1$ . Then Equation (15) becomes $\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c+2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ , the proof of which is given in Appendix A. The minimization of $J{\left(R\right)}$ therefore favors denoisers with small $\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ . Such a property can be used to construct criteria for whether the noise variance is underestimated ( $\beta\!<\!0$ ) or overestimated ( $\beta\!>\!0$ ), e.g., by checking the value of $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ for the minimizer of $\mathcal{J}$ . Examples will be given in Section IV-B.

Next, we take a closer look at the errors of the proposed denoisers. We first introduce the set $\mathcal{R}_{\epsilon}$ of partially linear denoisers. Second, the distance between the minimizers of $\mathcal{J}$ and $\mathcal{J}_{\rm mse}$ over $\mathcal{R}_{\epsilon}$ is estimated. Then we verify the convergence of the proposed denoisers to the best denoisers for corrupted images generated from a single ground truth image. Finally, focusing on a special subset of images that consists of constant patches, we demonstrate the partial linearity of the optimal denoisers with an example.

For $\epsilon>0$ , let $\mathcal{R}_{\epsilon}$ be the set of denoiser satisfying (8) with variance of $\bm{e}$ less than or equal to $\epsilon^{2}$ , i.e.,

\begin{split}\mathcal{R}_{\epsilon}:=&\left\{R\mid\exists g,L,\bm{e},\ \text{such that}\ R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e},\ \right.\\ &\left.L\ \text{is linear, and}\ \mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}\leq\epsilon^{2}\right\}.\end{split}

Lemma 3 (Convexity).

For any $\epsilon\geq 0$ , $\mathcal{R}_{\epsilon}$ is a convex set.

Proof.

If $R_{1},R_{2}\in\mathcal{R}_{\epsilon}$ , then there exist $g_{1},g_{2}$ , linear operators $L_{1},L_{2}$ , and residual terms $\bm{e}_{1},\bm{e}_{2}$ , such that

R_{i}{\left(\bm{\hat{y}}\right)}=g_{i}{\left(\bm{x}\right)}+L_{i}\bm{\hat{n}}+\bm{e}_{i},\quad\mathbb{E}{\left({\left\|\bm{e}_{i}\right\|}^{2}\right)}\leq\epsilon^{2},\ i\in\{1,2\}.

For any $\lambda_{1}\in[0,1]$ and $\lambda_{2}:=1-\lambda_{1}$ ,

\begin{split}\lambda_{1}R_{1}{\left(\bm{\hat{y}}\right)}+\lambda_{2}R_{2}{\left(\bm{\hat{y}}\right)}=&{\left(\sum_{i=1}^{2}\lambda_{i}g_{i}\right)}{\left(\bm{x}\right)}+{\left(\sum_{i=1}^{2}\lambda_{i}L_{i}\right)}{\left(\bm{\hat{n}}\right)}\\ &+\sum_{i=1}^{2}\lambda_{i}\bm{e}_{i}.\end{split}

It is easy to see that $\sum_{i=1}^{2}\lambda_{i}L_{i}$ is also linear and $\mathbb{E}{\left(\|\sum_{i=1}^{2}\lambda_{i}\bm{e}_{i}\|\right)}^{2}\leq\epsilon^{2}$ . Therefore $\lambda R_{1}+\lambda_{2}R_{2}\in\mathcal{R}_{\epsilon}$ which implies that $\mathcal{R}_{\epsilon}$ is convex. ∎

We are interested in the best denoisers in $\mathcal{R}_{\epsilon}$ with as small MSE as possible. More precisely, for small $\delta\geq 0$ , we define the MMSE denoiser, denoted by $R^{\epsilon,\delta}$ , as one of the denoisers in $\mathcal{R}_{\epsilon}$ that satisfies

\mathcal{J}_{\rm mse}{\left(R^{\epsilon,\delta}\right)}\leq\inf_{R\in\mathcal{R}_{\epsilon}}\mathcal{J}_{\rm mse}{\left(R\right)}+\delta,

(22)

where $\mathcal{J}_{\rm mse}$ are defined in (6). Note that $R^{\epsilon,\delta}$ exists for arbitrarily small positive $\delta$ . Based on the approximation properties stated in Proposition 2, we propose a denoiser $\hat{R}^{\epsilon,\delta}\in\mathcal{R}_{\epsilon}$ satisfying

\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}\leq\inf_{R\in\mathcal{R}_{\epsilon}}\mathcal{J}{\left(R\right)}+\delta

(23)

as an alternative version of $R^{\epsilon,\delta}$ . In light of the theoretical bound (16), the distance between $R^{\epsilon,\delta}$ and $\hat{R}^{\epsilon,\delta}$ has an upper bound described in the following proposition.

{restatable}

propositionpropositiond Let the denoisers $R^{\epsilon,\delta}$ and $\hat{R}^{\epsilon,\delta}$ be given by (22) and (23) respectively. If $\bm{n}$ and $\bm{z}$ satisfy Assumption (A1) - (A4) and $0<\delta<1$ , it holds that

\begin{split}&\mathbb{E}{\left({\left\|R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}-\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}\right\|}^{2}\right)}\\ \leq&\frac{2\epsilon}{1-\sqrt{\delta}}\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})}+\frac{\sqrt{\delta}}{1-\sqrt{\delta}},\end{split}

where the expectation is taken over $\bm{x}$ , $\bm{n}$ and $\bm{z}$ . The proof is given in Appendix A. The bound in Proposition 3 converges to $2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})}$ as $\delta$ tends to $0$ , and it converges to $0$ as $\epsilon$ and $\delta$ converge to zero.

Building on Proposition 3, we further analyze the convergence of the proposed solutions for a special case where all realizations of the corrupted images $\bm{\hat{y}}$ are generated from the same clean image. The next corollary shows that, in this setting, the approximation error is at most of the same order as $\epsilon^{2}$ , regardless of the noise distributions or the noise variance. The proof of this corollary is given in Appendix A. {restatable}corollarypropositiondb Assume that the conditions in Proposition 3 hold. If $\bm{x}$ follows a delta distribution, i.e., the probability $P{\left(\bm{x}=x\right)}=1$ and $P{\left(\bm{x}\neq x\right)}=0$ for some image $x$ , then

\begin{split}\mathbb{E}{\left({\left\|R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}-\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}\right\|}^{2}\right)}\leq\frac{\epsilon^{2}+2\sqrt{\delta}}{1-\sqrt{\delta}}.\end{split}

(24)

Proposition 3 and Corollary 3 illustrate that the minimizers of $\mathcal{J}$ are good estimates of the best partially denoiser when $\epsilon$ is small. In principle, $\bm{e}$ should be much smaller than the noise itself, as this is a necessary condition for the denoiser being robust to different realizations of $\bm{\hat{n}}$ . We can provide some more insight into the size of the residual term for high quality denoisers when applied to a special subset of images. Smooth regions are one of the most important components in natural images, and they often account for a large fraction of the pixels. Here we study the basic case with constant images and pixelwise independent noise. We give an example to demonstrate the scale of the residual term of the minimum MSE denoiser for constant image patches.

Example (Constant patches).

Assume that $\bm{x}$ models constant patches of size $21\!\times\!21$ pixels. The pixel values are uniformly distributed in $[0,\lambda]$ where $\lambda>0$ is a constant. The noise is pixelwise independent conditioned on $\bm{x}$ . For pixel $i$ , $\bm{\hat{y}}_{i}\sim{\rm Pois}{\left(\bm{x}_{i}\right)}$ where ${\rm Pois}$ is the Poisson distribution. The optimal denoised images $R_{0}{\left(\bm{\hat{y}}\right)}$ , where $R_{0}$ minimizes the cost $\mathcal{J}_{\rm mse}{\left(\cdot\right)}$ , are therefore constant images. In this example, given $R_{0}$ and $\bm{x}$ , the quantities $g{\left(\bm{x}\right)}$ and $L$ (defined in (8)) are computed using Remark 1. Consequently, $g{\left(\bm{x}\right)}$ , $L\bm{\hat{n}}$ and $\bm{e}$ are constant images. The values of $R_{0}{\left(\bm{\hat{y}}\right)}$ are plotted versus $g{\left(\bm{x}\right)}+L\bm{\hat{n}}$ (for different ground truth values and for different $\lambda$ ) on the last two rows of Fig. 2, where the results are computed based on a deep CNN approximation to $R_{0}$ . Each yellow dot in the plot represents a realization of the noise, and the blue line represents $\bm{e}=0$ . The plots suggest that $\bm{e}$ has small variances and hence the set $\mathcal{R}_{\epsilon}$ maintains good approximations to $R_{0}$ for small $\epsilon$ .

III-D Learning a denoiser from noisy samples.

For learning the denoiser we consider parametrized denoising models $R_{\theta}$ , e.g., deep CNNs, that are parameterized by $\theta$ . Suppose that a set $\left\{y_{i}\right\}$ of realizations of $\bm{y}$ , e.g., the set of noisy images that one wants to denoise, is given. The noise in these noisy images satisfies Assumption (A1). Associated with each $y_{i}$ , we define $z_{i}$ as a realization of the auxiliary random vector $\bm{z}$ that satisfies (A2) - (A4). In (A2) and (A4) only the first two moments of $\bm{z}$ , rather than its distribution, are specified. Therefore without loss of generality, one can randomly generate $z_{i}$ from normal distributions. With the auxiliary vector $z_{i}$ , the samples of $\bm{\hat{y}}$ , denoted by $\hat{y}_{i}$ , are computed according to Equation (5). Based on the objective function (7), we minimize the empirical loss function

\mathcal{L}{\left(\theta\right)}:=\sum_{i}{\left\|R_{\theta}{\left(\hat{y}_{i}\right)}-{\left(y_{i}-z_{i}/\alpha\right)}\right\|}^{2}

(25)

under the condition that $R_{\theta}\in\mathcal{R}_{\epsilon}$ . This condition can be implemented with the partial linearity constraint described below.

Partial linearity constraint. To preserve the partial linearity during training, restriction on the variance of the residual term $\bm{e}$ in (8) is required. Unfortunately, a direct evaluation of $\bm{e}$ is not feasible due to 1) the fact that the formulation of $\bm{e}$ depends on $g{\left(\bm{x}\right)}$ and $L$ which are unknown, and 2) the condition that only one noisy observation for each image is given. However, since $g{\left(\bm{x}\right)}$ and $L$ depend only on the image $\bm{x}$ and $L$ is linear, we can remove these two terms by perturbing $\bm{\hat{y}}$ . One simple way of implementing this is to find $\bm{q}_{1}$ and $\bm{q}_{2}$ , two perturbed versions of $\bm{\hat{y}}$ , satisfying $\bm{\hat{y}}=\tau\bm{q}_{1}+(1-\tau)\bm{q}_{2}$ where $\tau\in[0,1]$ . Let the residual terms associated with $R{\left(\bm{q}_{1}\right)}$ and $R{\left(\bm{q}_{2}\right)}$ be denoted by $\bm{e}_{1}$ and $\bm{e}_{2}$ respectively. Then, from Equation (8) we have

\begin{split}&R{\left(\bm{\hat{y}}\right)}-\tau R{\left(\bm{q}_{1}\right)}-(1-\tau)R{\left(\bm{q}_{2}\right)}\\ =&\bm{e}-\tau\bm{e}_{1}-(1-\tau)\bm{e}_{2}.\end{split}

(26)

By the assumption that $\bm{e},\bm{e}_{1}$ and $\bm{e}_{2}$ are small, based on the representation of the residual terms in (26), one can therefore penalize $\psi{\left(R{\left(\bm{\hat{y}}\right)}-\tau R{\left(\bm{q}_{1}\right)}-(1-\tau)R{\left(\bm{q}_{2}\right)}\right)}$ for some metric $\psi$ .

III-E Learning image deblurring without ground truth images

The proposed partially linear denoisers can be generalized to learning image reconstructors for other linear inverse problems such as image deblurring, in particular when the ground truth images are missing. We repeat the observation model (3) here

\bm{y}=A\bm{x}+\bm{n}

(27)

where $A$ is a linear forward operator (such as the Radon transform for CT reconstruction, a convolution operator for image deblurring, etc), and $\bm{n}$ models the zero-mean noise. For many inverse imaging problems, the naive reconstruction $A^{\dagger}\bm{y}$ , is based on a direct inversion of the linear operator $A$ given by the pseudo-inverse operator $A^{\dagger}$ . Due to ill-conditioning of $A$ such inversion is typically unstable and therefore not accurate in the presence of noise. If $A$ , however, has full column rank, then the resulting artifacts in the reconstruction could be remedied by denoising $A^{\dagger}\bm{y}$ . In fact, it is easy to see that $\bm{n}^{\dagger}:=A^{\dagger}\bm{y}-\bm{x}=A^{\dagger}\bm{n}$ has zero mean. The covariance of $\bm{n}^{\dagger}$ can be estimated as long as the covariance of $\bm{n}$ is known. Therefore a straightforward application of the partially linear denoisers to the noisy reconstruction $A^{\dagger}\bm{y}$ leads to an estimate of $\bm{x}$ . Various approaches have been proposed for denoising $A^{\dagger}\bm{y}$ in a data-driven post-processing setup (see e.g., [13, 11]). These learned post-processing techniques require noisy-clean image pairs as training data. Another class of methods, in contrast, leverage denoisers to construct regularization for (27). For instance, the Regularization by Denoising (RED) methods [28, 12] minimize a variational objective function with a regularization term derived from the denoisers. The Plug-and-Play Prior framework [37, 33] is based on variable splitting algorithms for the Maximum a Posteriori (MAP) optimization problem with the proximal mapping associated with the regularization term being replaced by the denoisers. These denoiser-based regularization approaches require predefined denoisers. However, our framework does not require any noisy-clean pair or any predefined denoiser. In particular, our proposed partially linear denoiser allows training an estimator from noisy data alone.

Image deblurring is a special case of (27) where $A$ represents a convolutional kernel. In practice, $A$ is governed by different imaging factors including motion and camera focus. In the context of the proposed partial linear denoiser, one way to solve the deblurring problem is to train a single model $R_{\theta}$ that maps the measurement $\bm{y}$ directly to $\bm{x}$ . Here, knowledge of $A$ is indirectly encoded in $R_{\theta}$ . Assuming that the training data contains noisy samples of $\bm{\hat{y}}$ and the blurring kernel $A$ but no ground truth images, then similar to (7) one can minimize the loss function

\mathbb{E}{\left({\left\|A{R_{\theta}}{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)},

(28)

and in this case, $A{R_{\theta}}$ acts as a partially linear denoiser. At test time, the deblurred image can be directly computed as ${R_{\theta}}{\left(\bm{\hat{y}}\right)}$ without knowing the operator $A$ . These considerations also draw connections of the partial linear denoiser to the deep image prior approach [35]. There, an implicit regularizer is introduced, based on a convolutional neural network $R_{\theta}$ and a sole data-fitting loss function is minimized in the training. Early stopping is applied to prevent the estimator from over-fitting to the noise. With the partial linearity structure, however, we establish a connection between the cost (28) and the MSE of the noise free measurement $A\bm{x}$ which aims to get $A{R_{\theta}}{\left(\bm{\hat{y}}\right)}$ close to $A\bm{x}$ rather than its noisy versions.

IV Experiments

In this section, we report experimental results to demonstrate the efficiency of the proposed approach for different denoising tasks and for deblurring¹¹1The code will be made available at https://github.com/RK621/Unsupervised-Restoration-PLD.. Firstly, we start by comparing the partial linearity of classical denoisers including total variation (TV) denoising [30], BM3D [6], as well as CNN based denoisers (see Subsection IV-A for details). Secondly, we evaluate our method on denoising problems with synthetic noise (Subsection IV-B1, IV-B3) and analyse its robustness towards errors in the estimate of the noise variance (Subsection IV-B2). Thirdly, a numerical study on the stability of our approach with respect to varying noise levels is given and the importance of the partially linear constraint is investigated (Subsection IV-C). Fourthly, the proposed approach is used to denoise real microscopy images (Subsection IV-D). Finally, we apply our approach to learn an image deblurring model, using a set of single noisy and blurry observations of the images as training data. The details of the learning methods and the results for the deblurring experiment are presented in Subsection IV-E.

IV-A Partial linearity of denoisers

In this test, we investigate the partial linearity of some existing standard denoising approaches, including the TV approach [30], BM3D [6] and DnCNN [41]. For the convenience of the readers, the formula of the partially linear denoiser (7) is repeated here

R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e},

(29)

where $R$ is the underlying denoiser. In particular, the DnCNN is trained by minimizing the standard empirical MSE loss (6) with a training set of $400$ images with ground truth [41]. For a given image $\bm{x}$ , we compute the decomposition (29) using Remark 1. Specifically, $g(\bm{x})\!=\!\mathbb{E}{\left(R{\left(\bm{\hat{y}}\right)}\mid\bm{x}\right)}$ and $L\!=\!\min_{L_{0}\in\mathcal{L}}\mathbb{E}{\left({\left\|R{\left(\hat{\bm{y}}\right)}-g{\left(\bm{x}\right)}-L_{0}\hat{\bm{n}}\right\|}^{2}\right)}$ where the expectation is taken over $\hat{\bm{n}}$ , and $\mathcal{L}$ denotes the set of linear operators. In this example, we use a standard test image called parrot (cf. top middle of Fig. 1) as the ground truth image, i.e., a realization of $\bm{x}$ . The noise $\bm{\hat{n}}$ is i.i.d. Gaussian noise with zero mean and standard deviation $25$ (corresponding to 256 gray levels). The expectations are evaluated using $20000$ random realizations of the pair ${\left(\bm{\hat{y}},R{\left(\bm{\hat{y}}\right)}\right)}$ .

TABLE I: The PSNR for

R

(first row), the variance of

\bm{e}

averaged over all pixels (second row), and the PSNR for modified denoisers

\hat{R}:=g(\bm{x})+L\bm{\hat{n}}

(third row) on the image parrot.

Denoiser	TV	BM3D	DnCNN
PSNR (dB)	27.62	28.87	29.47
$\epsilon^{2}/m$	$8.751\times 10^{-5}$	$4.875\times 10^{-5}$	$4.281\times 10^{-5}$
Modified PSNR (dB)	27.82	29.00	29.60

For the three denoisers, we report the variance of $\bm{e}$ averaged over the image pixels of the parrot image (i.e., $\epsilon^{2}/m$ where $m$ denotes the number of pixels in $\bm{e}$ ) in Table I. A comparison of the accuracy of the methods, measured in PSNR, is also given in the table. All figures reported in the table are averaged over $8000$ independent runs with different realizations of $\bm{\hat{y}}$ . Based on the table, the DnCNN achieves the best denoising quality, and it outperforms BM3D by around $0.5$ dB and the TV approach by around $1.8$ dB. All three methods have $\epsilon^{2}/m$ less than $10^{-4}$ , and the value for the CNN denoiser is about half of that of the TV method.

Given a denoiser $R$ , in order to study the partial linearity we consider the pair

{\left([L\bm{\hat{n}}]_{i},[R{\left(\bm{\hat{y}}\right)}-g{\left(\bm{x}\right)}]_{i}\right)}

(30)

where $i$ denotes a fixed pixel (localized by the red dot on the first row of Fig. 3). To visualize the linearity for the three denoisers, the pair (30) is plotted on the second row of Fig. 3 under $8000$ realizations of the noise $\bm{\hat{n}}$ . Each point in the plot corresponds to one realization. As shown in the figure, all the denoisers have certain degrees of partial linearity at the pixel $i$ . For the TV denoiser, the residual $\bm{e}$ has a relatively larger variance (also illustrated by Table I) and there are some outliers for big $|[L\bm{\hat{n}}]_{i}|$ .

IV-B Denoising experiments

We consider two types of synthetic noises (Gaussian noise and Poisson noise), and the performance of our method is compared with recent denoising methods that are not trained on ground truth images, such as SURE [32], Noise2Self [2] and Noisier2Noise [27], along with the classic BM3D approach [6]. Throughout the denoising experiments, we use the same network architecture DnCNN [41] for the fully supervised baseline (we will call it DnCNN in the subsequent), SURE, Noise2Self, Noisier2Noise and our approaches. The experimental setups for the five methods are the same, except that the clean images are used to train the DnCNN while they are unseen by the latter four methods in the learning phase. All models are trained on a benchmark denoising dataset [41] consisting of $400$ training images of size $180\!\times\!180$ . In the training phase, we feed the CNNs with image patches of size $40\!\times\!40$ and set a batch size of $128$ . Augmentations such as random flipping and random cropping are applied to the patches. In the inference phase, the inputs to the networks are the whole noisy images. In particular, in our approach we do not include the auxiliary vector $\bm{z}$ at this stage, and the denoised images are the outputs of the network without any post-processing. We evaluate the denoising quality on two different test image sets, namely the BSD68 (containing $68$ images) [26] and the $12$ wildly used images (Set12) [6].

Training. To train our denoising model, we minimize the loss function (25) using the Adam optimizer [14]. The minimization process consists of two stages. In the first stage, we fix $\alpha\!=\!1$ and minimize the loss function (25) without constraints. In the second stage, we randomly choose $\alpha\!\in\![0.1,0.5]$ for each of the noisy samples, and additionally, in order to control the variance of $\bm{e}$ defined in (8), we impose a partially linear constraint (26). The implementation details of the latter will be given in the next paragraph. Both stages contain $2\!\times\!10^{5}$ optimization steps with an initial learning rate $0.001$ , which then drops to $0.0001$ and $0.00005$ at the $6\!\times\!10^{4}$ th step and $1.2\!\times\!10^{5}$ th step, respectively. Throughout the experiments, $z_{i}$ , i.e., the samples of the auxiliary random vector $\bm{z}$ in (25), are generated from Gaussian distributions.

Partially linear constraint. We implement the partially linear constraint by perturbing the noisy images $\hat{y}_{i}$ (i.e., samples of the noisy image $\bm{\hat{y}}$ ) and by following the formula (26). Specifically, for each $\hat{y}_{i}$ let $\beta_{i}^{\left(1\right)}$ and $\beta_{i}^{\left(2\right)}$ be random numbers uniformly distributed in $[1,1.5]$ , and let $q_{i}$ be a perturbation vector randomly generated from the same distribution as $z_{i}$ . With $q_{i}$ we construct two perturbed versions of $\hat{y}_{i}$ as $q^{(1)}_{i}\!:=\!\hat{y}_{i}-\beta_{i}^{\left(1\right)}q_{i}$ and $q^{(2)}_{i}\!:=\!\hat{y}_{i}+\beta_{i}^{\left(2\right)}q_{i}$ respectively. Then we have the linear relationship

\hat{y}_{i}=\tau_{1}q^{(1)}_{i}+\tau_{2}q^{(2)}_{i}

where $\tau_{1}:=\beta_{i}^{\left(2\right)}/{\left(\beta_{i}^{\left(1\right)}+\beta_{i}^{\left(2\right)}\right)}$ and $\tau_{2}:=1-\tau_{1}$ . In practice, we let $q_{i}$ be independent of $z_{i}$ . Since the noise levels of $q^{(1)}_{i}$ and $q^{(2)}_{i}$ depend on the pixel values of $q_{i}$ , we modify $q_{i}$ to be sparse to avoid raising the noise level too much. To do this, we randomly select $1/25$ of the pixels, the pairwise distances of which are at least $4$ pixels. The remaining pixels of $q_{i}$ are set to zero, i.e., no perturbations are applied to these pixels. To avoid having outliers at some individual pixels caused by the perturbations, we also clip $q_{i}$ such that the pixel values $q_{i}^{\left(1\right)}$ and $q_{i}^{\left(2\right)}$ fit in the range $\left[1.2a-0.2b,1.2b-0.2a\right]$ where $a\!:=\!\min_{j}[\hat{y}_{i}]_{j}$ and $b\!:=\!\max_{j}[\hat{y}_{i}]_{j}$ . Having $q_{i}^{\left(1\right)}$ and $q_{i}^{\left(2\right)}$ , based on Property (26) we add the following penalty term to the loss function

\begin{split}\mathcal{L}_{c}{\left(\theta\right)}:=&\sum_{i}\left\|M\left(\vphantom{\frac{\beta_{3-k}}{\beta_{1}+\beta_{2}}}R_{\theta}{\left(\hat{y}_{i}\right)}-\sum_{k=1}^{2}\tau_{k}R_{\theta}{\left(q_{i}^{(k)}\right)}\right)\right\|^{2},\end{split}

(31)

where $\theta$ are the network parameters, $M$ is a diagonal matrix. The diagonal entries of $M$ are set as

M_{jj}:=1/\left[\left|q^{(1)}_{i}-q^{(2)}_{i}\right|+0.1\sigma\right]_{j},

if $[q_{i}]_{j}\!\neq\!0$ , otherwise $M_{jj}\!:=\!0$ , where $\sigma\!>\!0$ is the square root of the largest pixel-wise variance of the noise $\bm{n}$ . Note that having $M$ in the loss (31) means that we penalize the nonlinearity at the perturbed pixels only. The term $0.1\sigma$ prevents division by very small numbers. In summary, the loss function is $\mathcal{L}+\gamma\mathcal{L}_{c}$ where $\mathcal{L}$ is defined in (25), and $\gamma$ is a hyperparameter which can be tuned based on the quality of the denoised images.

IV-B1 Gaussian noise

In this experiment, we test the denoisers for restoring images corrupted by Gaussian white noise. The training sets for our unsupervised approach are the noisy images. We consider two different levels of noise with standard deviation $\sigma\!=\!25$ and $\sigma\!=\!50$ (corresponding to 256 gray levels), respectively. Associated with the two noise levels, two denoisers are trained using the proposed method, and the ground truth images are unseen during the training phases. In both cases, the parameter $\gamma$ for the partially linear constraint term (31) is set to $4$ .

The test results for BSD68 [26] are reported in Table II, where we call our method DPLD (deep partially linear denoiser). The denoising quality is measured by the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index. We compare our denoiser with BM3D [6], the self-supervised method Noise2Self [2], SURE [32], Noisier2Noise [27] as well as the fully-supervised DnCNN [41]. It is worth mentioning that the last denoiser DnCNN, in contrast to the other five, requires the noisy-clean image pairs for training.

TABLE II: Denoising quality, measured by PSNR (dB) and SSIM, for BSD68 [26] corrupted by Gaussian Noise.

Noise Level	$\sigma=25$		$\sigma=50$
Measure	PSNR	SSIM	PSNR	SSIM
BM3D[6]	$28.58$	$0.8861$	$25.66$	$0.8041$
Noise2Self[2]	$27.48$	$0.8588$	$25.15$	$0.7818$
SURE[32]	$28.99$	$0.8961$	$25.88$	$0.8118$
Noisier2Noise[27]	$28.96$	$0.8951$	$25.96$	$0.8125$
DPLD	$\bm{29.08}$	$\bm{0.8961}$	$\bm{26.13}$	$\bm{0.8196}$
DnCNN[41]	$29.22$	$0.9017$	$26.24$	$0.8265$

As shown in Table II, the fully supervised denoiser DnCNN achieves the best accuracy. This is not surprising as it learns from the ground truth images which are not provided for the other ones. Our method is the best among the denoisers trained without the ground truth. It outperforms the Noise2Self, SURE and Noisier2Noise by $1.6$ dB, $0.09$ dB and $0.12$ dB respectively for noise level $\sigma\!=\!25$ , and outperforms them by $0.98$ dB, $0.25$ dB and $0.17$ dB respectively in the $\sigma\!=\!50$ case. Compared to the DnCNN, the PSNR values of DPLD are lower by $0.14$ dB and $0.11$ dB for noise levels $\sigma\!=\!25$ and $\sigma\!=\!50$ respectively.

TABLE III: Denoising quality (in dB) for the

12

wildly used image [6] and Gaussian noise

Denoiser Average C. man House Peppers Starfish Monarch Airplane Parrot Lena Barbara Boat Man Couple $\sigma=25$ BM3D $29.97$ $29.39$ $32.98$ $30.18$ $28.61$ $29.3$ $28.43$ $28.83$ $32.05$ $\bm{30.61}$ $29.86$ $29.64$ $29.72$ Noise2Self $28.81$ $27.95$ $32.22$ $29.51$ $28.12$ $28.64$ $26.7$ $27.58$ $31.52$ $26.41$ $29.06$ $29.04$ $28.94$ SURE $30.12$ $29.75$ $32.65$ $30.51$ $29.15$ $30.08$ $\bm{28.97}$ $\bm{29.32}$ $32.08$ $29.28$ $29.93$ $29.94$ $29.82$ Noisier2Noise $30.12$ $29.69$ $32.74$ $30.52$ $29.13$ $30.10$ $28.95$ $\bm{29.32}$ $32.10$ $29.22$ $29.92$ $29.94$ $29.79$ DPLD $\bm{30.28}$ $\bm{29.84}$ $\bm{33.04}$ $\bm{30.69}$ $2\bm{9.26}$ $\bm{30.21}$ $\bm{28.97}$ ${29.30}$ $\bm{32.33}$ $29.66$ $\bm{30.10}$ $\bm{30.02}$ $\bm{29.97}$ DnCNN $30.44$ $30.08$ $33.13$ $30.8$ $29.44$ $30.39$ $29.12$ $29.48$ $32.43$ $29.96$ $30.21$ $30.12$ $30.12$ $\sigma=50$ BM3D $26.71$ $26.36$ $29.75$ $26.69$ $24.99$ $25.9$ $25.22$ $25.74$ $28.84$ $26.98$ $26.76$ $26.84$ $26.49$ Noise2Self $26.14$ $25.60$ $29.20$ $26.36$ $24.67$ $25.68$ $24.53$ $25.29$ $28.55$ $24.56$ $26.45$ $26.64$ $26.10$ SURE $26.62$ $26.45$ $29.25$ $26.74$ $25.20$ $26.23$ $25.50$ $26.10$ $28.70$ $25.01$ $26.80$ $26.99$ $26.43$ Noisier2Noise $26.79$ $26.61$ $29.65$ $26.98$ $25.25$ $26.49$ $25.62$ $26.21$ $28.94$ $25.13$ $26.92$ $27.08$ $26.56$ DPLD $\bm{27.05}$ $26.89$ $30.01$ $27.22$ $25.45$ $26.72$ $25.76$ $26.36$ $29.24$ $25.70$ $27.13$ $27.23$ $26.83$ DnCNN $27.19$ $27.03$ $30.10$ $27.36$ $25.55$ $26.87$ $25.89$ $26.45$ $29.29$ $26.26$ $27.22$ $27.27$ $26.94$

The comparison of the denoisers on the $12$ wildly used images [6] is given in Table III. For noise level $\sigma\!=\!50$ , the DPLD reaches the best PSNR for all images among the five denoisers that do not consume ground truth data, except for the images Parrot and Barbara. It is interesting to note that BM3D performs better than the fully supervised method DnCNN on the image Barbara. On average, for $\sigma\!=\!50$ it outperforms Noise2Self by $0.91$ dB and SURE by $0.43$ dB respectively, and it falls behind the DnCNN by $0.14$ dB.

Fig. 4 displays the denoising results for the image ”Boat”. It can be seen that, though the Noise2Self, SURE, Noisier2Noise, and DPLD do not see the clean images or have any explicit smoothness constraints during the training stages, they yield denoised images with smooth regions (Cf. the 2nd to 5th columns of Fig. 4, respectively). The resulted image from DPLD looks close to that of DnCNN visually. The output of Noise2Self has some relatively blurry edges compared to DPLD and DnCNN, e.g., at the letters displayed in the last row of Fig. 4.

IV-B2 Training with estimated noise variance

We report additional results for the scenarios where only an inaccurate estimated noise variance (ENV) is available (for generating $\bm{z}$ ) during training. The results are obtained under the same settings as Subsection IV-B1 except that an ENV is used.

-

On the top row of Fig. 5, the PSNR values of the test results on BSD68 [26], for both noise levels $\sigma\!=\!25$ and $\sigma\!=\!50$ , are plotted against the relative errors in the ENV $(1+\beta)\sigma^{2}$ (where $\beta\!=\!-16\%,-8\%,\cdots,16\%$ ). For both noise levels, the errors in the ENV lead to a decrease in PSNR. The reduction is less significant when the noise variance is underestimated than when it is overestimated. The drop in PSNR is small when the error in the ENV is small ( $<\!0.1$ dB for an ENV $4$ % less than its true value), and our method maintains an accuracy significantly better than Noise2Self [2] for small errors in the ENV.
-

We observe a correlation between the errors in the ENV and the term $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ (see Fig. 5, bottom row). Here $L$ is computed based on Remark 1, with a constant ground truth image $\bm{x}$ (pixel values: $0.5$ , size: $40\!\times\!40$ ) and generated $\bm{n}$ following the same distribution as $\bm{z}$ (variance: $(1+\beta)\sigma^{2}$ ). In the plots, $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ is a small positive number when $\beta\!=\!0$ , and it increases (resp. decreases) as the ENV decreases (resp. increases). The is due to the extra term $2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ in the objective function $\mathcal{J}{\left(\cdot\right)}$ as shown in Remark 2.

Overall, the performance is stable to small errors in ENV, and importantly, by analyzing the structure of the learned denoiser one could investigate the error in the noise variance. The quantity $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ is helpful for choosing a better denoising model when the noise variance is unknown.

IV-B3 Poisson noise

We evaluate our method on three different levels of Poisson noise, with parameter $\lambda\!=\!60$ , $\lambda\!=\!30$ and $\lambda\!=\!15$ respectively. The training settings for the denoisers are the same as the ones for Gaussian noise, except that the variances of the auxiliary vector samples $z_{i}$ are computed differently. As the noise $\bm{n}$ is image dependent and not identically distributed for all image pixels, its variances are not known without having the ground truth image. Note that however our method does not require a precise model on the noise distribution, but only an estimate of the noise variance. At pixel $k$ , the noisy value satisfies $\lambda[\bm{y}]_{k}\sim{\rm Pois}{\left(\lambda[\bm{x}]_{k}\right)}$ where $\lambda$ is a known constant. So the noise variance ${\rm var}{\left([\bm{n}_{k}]\!\mid\![\bm{x}]_{k}\right)}\!=\!{\rm var}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!\frac{1}{\lambda}[\bm{x}]_{k}\!=\!\frac{1}{\lambda}\mathbb{E}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}$ . This implies that the sample value of $\frac{1}{\lambda}[\bm{y}]_{k}$ provides an unbiased estimate of the noise variance. The entries of the auxiliary vector are then generated as $[\bm{z}]_{k}\!=\!{\bm{r}}_{k}\sqrt{[\bm{y}]_{k}/\lambda}$ where ${\bm{r}}_{k}$ is the standard Gaussian random variable, hence ${\rm var}{\left([\bm{z}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!\frac{1}{\lambda}\mathbb{E}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!{\rm var}{\left([\bm{n}]_{k}\!\mid\![\bm{x}]_{k}\right)}$ . The partially linear constraint parameter $\gamma$ for (31) is tuned manually for the three noise levels. In this subsection, unless specified otherwise, the results are obtained by setting $\gamma=4,16,64$ for $\lambda=60,30,15$ , respectively.

TABLE IV: Denoising quality, measured by PSNR (dB) and SSIM, for BSD68 [26] corrupted by Poisson Noise.

Noise Level	Noise2Self[2]		DPLD		DnCNN[41]
Noise Level	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
$\lambda=60$	$27.78$	$0.8660$	$\bm{29.28}$	$\bm{0.9018}$	$29.43$	$0.9081$
$\lambda=30$	$26.56$	$0.8285$	$\bm{27.65}$	$\bm{0.8625}$	$27.86$	$0.8743$
$\lambda=15$	$25.42$	$0.7902$	$\bm{26.16}$	$\bm{0.8210}$	$26.39$	$0.8348$

For comparison, we train denoisers with Noise2Self [2] and DnCNN [41] on the same training set (whereas the ground truth images are available only for DnCNN). The denoising results on the test set BSD68 are reported in Table IV. Trained on the ground truth images, the DnCNN has the highest average PSNR for all three noise levels. The proposed method DPLD outperforms Noise2Self by $1.5$ dB, $1.09$ dB and $0.74$ dB for the cases with $\lambda=60,30,15$ respectively. On the other hand, it losses $0.15$ dB, $0.21$ dB and $0.23$ dB when compared to DnCNN. It should be noted that when the noise level decreases, the PSNR gap between DPLD and DnCNN gets smaller. In contrast, the gap between Noise2Self and DnCNN becomes larger as the noise becomes smaller. This may be due to the fact that the Noise2Self approach can not learn identity mapping. For a given pixel, the denoiser can not see its observed value and has to infer its value from the information of its neighboring pixels. If knowledge about the noise distribution is available, then the results can be improved by reusing the noisy images in the inference phase [16]. Our denoiser uses all information of the noisy image, and, as shown in Proposition 2, the gap between the DPLD and the best denoiser in $\mathcal{R}_{\epsilon}$ tends to zero as the noise goes to zero.

TABLE V: Denoising quality (in dB) for the

12

wildly used images [6] and Poisson noise

Denoiser Average C. man House Peppers Starfish Monarch Airplane Parrot Lena Barbara Boat Man Couple $\lambda=60$ Noise2Self $29.15$ $28.32$ $32.13$ $29.79$ $27.94$ $28.99$ $26.63$ $27.98$ $31.74$ $28.60$ $29.34$ $29.20$ $29.18$ DPLD $\bm{30.34}$ $30.05$ $32.89$ $30.82$ $28.93$ $30.54$ $28.64$ $29.68$ $32.56$ $29.79$ $30.09$ $30.02$ $30.07$ DnCNN $30.49$ $30.29$ $33.02$ $30.95$ $29.07$ $30.69$ $28.79$ $29.82$ $32.67$ $30.03$ $30.22$ $30.12$ $30.21$ $\lambda=30$ Noise2Self $27.82$ $27.36$ $30.74$ $28.27$ $26.31$ $27.72$ $25.61$ $26.95$ $30.36$ $26.79$ $27.91$ $28.02$ $27.82$ DPLD $\bm{28.68}$ $28.43$ $31.44$ $29.06$ $27.08$ $28.74$ $26.96$ $28.14$ $31.00$ $27.74$ $28.48$ $28.59$ $28.47$ DnCNN $28.87$ $28.79$ $31.67$ $29.23$ $27.16$ $28.89$ $27.11$ $28.30$ $31.14$ $28.20$ $28.60$ $28.71$ $28.67$ $\lambda=15$ Noise2Self $26.26$ $26.01$ $28.81$ $26.60$ $24.62$ $25.99$ $24.23$ $25.58$ $28.69$ $24.94$ $26.48$ $26.78$ $26.36$ DPLD $\bm{26.99}$ $27.00$ $29.76$ $27.27$ $25.22$ $26.88$ $25.26$ $26.63$ $29.39$ $25.31$ $27.02$ $27.25$ $26.87$ DnCNN $27.28$ $27.32$ $30.13$ $27.51$ $25.42$ $27.10$ $25.47$ $26.85$ $29.62$ $26.30$ $27.20$ $27.37$ $27.11$

Table V lists the PSNR of denoising outputs for the $12$ wildly used images [6]. Similar to the Gaussian noise cases, the proposed DPLD has higher PSNR values than Noise2Self on all images. The DPLD outperforms Noise2Self by more than $0.7$ dB in average PSNR, and falls behind DnCNN by less than $0.3$ dB.

Fig. 6 displays an example of the denoised images. This example shows that Noise2Self, DPLD and DnCNN are capable of recovering the details of the image, though the first two are not exposed to the detailed structures of the images during training. Compared to DPLD and DnCNN, the denoised image of Noise2Self is less smooth. Also, Noise2Self tends to remove the sharp points of a jagged edge (Cf. the third row of Fig. 6), since it relies on the data of surrounding pixels when restoring the pixels at the sharp point and therefore may encourage more regularized shapes of objects.

Finally, the residual term $\bm{e}$ (defined in (4)) of three different methods, on the test image Starfish, are presented in Fig. 7. For each method, in order to compute $\bm{e}$ we first compute $g{\left(\bm{x}\right)}$ and $L$ with Remark 1 and using its denoised outputs for 1,600,000 realizations of noisy image $\bm{y}$ . This experiment is carried out under the setting of Poisson noise with $\lambda\!=\!30$ . As can be seen from Fig. 7, the error (i.e., the difference between the ground truth and the denoised image) is an order of magnitude smaller than the noise, while the residual term is an order of magnitude smaller than the error. The variance of $\bm{e}$ of the proposed denoiser is slightly smaller than the fully supervised denoiser DnCNN, and Noise2Self has a variance of $\bm{e}$ about two times larger than the other two.

IV-C Stability with respect to the partial linear constraint and noise levels

In this experiment we study the stability of the proposed denoiser against two parameters, $\gamma$ and the noise levels. Here we consider Poisson noise with fixed $\lambda\!=\!30$ for training, under the same experimental setting as Subsection IV-B3.

First we study the stability concerning $\gamma$ . Fig. 8 shows the PSNR and SSIM, evaluated on the test set BSD68, for different choices of $\gamma$ used during training. The horizontal lines in Fig. 8 show the PSNR and SSIM of the other two methods DnCNN and Noise2Self (N2S), which do not depend on $\gamma$ . According to the figure, the learned denoiser has very low PSNR values, when $\gamma$ is very small (e.g., $10^{-4}$ , which means that less important is put on the constraint). This implies that the partial linearity is crucial for finding a high quality denoiser when minimizing (25). As $\gamma$ becomes larger, such constraint starts to take effect and substantially improve the denoising performance. The peak of PSNR is achieved at around $2^{4}$ . For $\gamma$ in $[2^{2},2^{8}]$ , we observe a variation of PSNR less than $0.2$ dB, which is relatively small compared to the gap between N2S and DnCNN, before the denoising quality degrades at very large $\gamma$ values (e.g., $2^{14}$ ).

Fig. 9 compares the robustness of the learned denoisers in terms of different noise levels. All denoisers are trained in the Poisson noise setting (again with $\lambda\!=\!30$ ), and in our approach the optimal $\gamma$ (i.e. $2^{4}$ ) is applied. It can be seen that all methods suffer from a degradation in the denoising quality when $\lambda$ in testing is below $30$ . This is reasonable because in general smaller $\lambda$ implies higher noise and more difficulty. Besides, the DPLD and Noise2Self, which use only noisy images in training, are more stable than DnCNN when $\lambda$ is around $20$ . The PSNR of DnCNN decreases quickly as $\lambda$ decreases from $30$ , and it is worse than DPLD when $\lambda\leq 24$ . However, it is interesting to note that the DPLD outperforms the (fully supervised) DnCNN in the lower noise cases (corresponding to big $\lambda$ ), even though neither the ground truth nor less noisy samples are provided for the training. This suggests that auxiliary random vector approach together with the partially linear constraint brings extra robustness compared to its counterpart that learns from a set noisy-clean image pairs with fixed noise levels.

IV-D Denoising real microscopy images

We apply the proposed method to denoising real microscopy images. Neither the ground truth images nor their noise variance is known. Specifically, we consider three datasets, namely N2DH-GOWT1, C2DL-MSC, and TOMO110, which are used in [15]. The first two are fluorescence microscopy images of GFP transfected GOWT1 mouse stem cells and rat mesenchymal stem cells, respectively. The training sets contain $92$ images of size $1024\!\times\!1024$ (N2DH-GOWT1) and $48$ images of size $992\!\times\!832$ (C2DL-MSC). The third one TOMO110 is acquired by a Cryo-transmission electron microscope, and for this one we use one image (size $7676\!\times\!7420$ ) for training.

For each of the datasets, we first estimate the noise variance.
1). For N2DH-GOWT1 and C2DL-MSC, we first obtain a rough estimate via computing the average squared difference of neighboring pixels at smooth regions. We then refine the estimated noise variance using the correlation between the linear component of the denoiser and the errors in the noise variance. This is motivated by the observation in Remark 2 and the numerical study in Subsection IV-B2.
2). For TOMO110, since multiple frames/copies of the image are available because of the dose-fractionated acquisition, we estimate the variance by computing the variance of the frames for which the noise is independent. Note that though several frames of the image are available, our approach learns to denoise the sum of all frames. In contrast, the Noise2Noise method learns to denoise only one half of the frames, the noise level for which is therefore higher.

The details of the noise estimation and training of the networks are given in Appendix B. The denoising results are displayed in Fig. 10 (with enlarged views for the regions indicated by the blue/red boxes). We compare our method with BM3D [6] and Noise2Void [15]. As shown in the figure, the BM3D method has denoised images that are smooth, but also contains artifacts (e.g., in the cells on the second row of Fig. 10). The Noise2Void method yields denoising outputs with certain graininess effect at non-constant regions (Cf. the third row and first column of the figure). The denoised images from our approach look smooth, yet it shows slightly more details of the object structures (e.g. the regions highlighted by the red boxes).

IV-E Image deblurring

Following [31] and [39], we consider the image deblurring problem (27) where the operator $A$ is a convolution operator with a blur kernel arising from random motions. The datasets for the image deblurring experiments are face images from the Helen dataset [18] and CelebA dataset [24], and we use the same training/validation split as in [39]. This results in a total of $161,800$ images for training, $22,000$ images for validation, and $16,000$ images for testing. Throughout this experiment, the noise $\bm{n}$ is Gaussian white noise with standard deviation $\sigma\!=\!2$ . We compare our approach with the fully supervised deblurring method, as well as an unsupervised method [39] that uses a pair of noisy and blurry observations per image. To facilitate the comparison, the same U-Net architecture in [39] for the deblurring networks is used in all three methods.

We train the deblurring networks $R_{\theta}$ using the empirical loss of (28), given the random noisy samples $\{y_{i}\}$ and the operator $A$ ,

\mathcal{L}_{\rm b}:=\sum_{i}{\left\|A{R_{\theta}}{\left(\hat{y}_{i}\right)}-{\left(y_{i}-z_{i}/\alpha\right)}\right\|}^{2}

where $\{z_{i}\}$ are the auxiliary vectors. As suggested in [39], we also integrate a proxy loss defined as

\mathcal{L}_{\rm prox}:=\sum_{i}{\left\|R_{\theta}{\left(y^{\rm prox}_{i}\right)}-x^{\rm prox}_{i}\right\|}^{2}

in which $x^{\rm prox}_{i}:=R_{\theta}{\left(y_{i}\right)}$ , $y^{\rm prox}_{i}:=A^{\rm prox}x^{\rm prox}_{i}+n^{\rm prox}_{i}$ , and $A^{\rm prox}$ , $n^{\rm prox}_{i}$ are randomly generated blur kernels and noise respectively. In our setting $n^{\rm prox}_{i}$ is generated from the same distribution as $\bm{z}$ . So this loss function is the MSE loss defined on the synthetic training pair $(x^{\rm prox}_{i},y^{\rm prox}_{i})$ where $x^{\rm prox}_{i}$ , a deblurred version of $y_{i}$ , is treated as the ground truth. The proxy loss has been found useful in improving the deblurring accuracy. Finally, we apply the partially linear constraint in (31) with the denoiser being $AR_{\theta}{\left(y_{i}\right)}$ in this context. In summary, the training loss function is $\mathcal{L}_{\rm b}+\gamma_{\rm prox}\mathcal{L}_{\rm prox}+\gamma\mathcal{L}_{c}$ where $\gamma_{\rm prox}$ is a hyperparameter.

The deblurring model is trained using the Adam optimizer [14] with a batch size of $128$ and $360,000$ optimization steps, starting from an initial learning rate of $10^{-3}$ . The learning rate is then decreased to $3\times 10^{-4}$ , $10^{-4}$ , $5\times 10^{-5}$ at the $310,000^{\rm th}$ , $340,000^{\rm th}$ , $350,000^{\rm th}$ step, respectively. In the first $100,000$ optimization steps, we let $\gamma_{\rm prox}=0$ and $\gamma=0$ , and after that, $\gamma_{\rm prox}=1/16$ , $\gamma=1/16$ . In this experiment, we let $\alpha$ be randomly selected from $[0.1,0.2]$ for each training sample.

TABLE VI: Deblurring results. For the data used in the training step, O and C stand for observations and clean images, respectively. The mark

\dagger

indicates training without the proxy loss.

Method

Data

Helen [18]

CelebA [24]

PSNR

SSIM

PSNR

SSIM

Xu et al. [40]

–

20.11

0.711

18.93

0.685

Zuo et al. [44]

O/C Pairs

22.24

0.763

20.53

0.750

Tao et al. [34]

O/C Pairs

22.86

0.762

24.11

0.862

Shen et al. [31]

O/C Pairs

25.99

0.871

25.05

0.879

Kupyn et al. [17]

O/C Pairs

23.63

0.781

22.45

0.729

Supervised baseline

O/C Pairs

26.13

0.886

25.20

0.892

Xia et al. [39]

\dagger

O/O Pairs

25.47

0.867

24.64

0.873

Ours

\dagger

Unpaired O

25.65

0.867

24.78

0.873

Xia et al. [39]

O/O Pairs

25.95

0.878

25.09

0.885

Ours

Unpaired O

26.00

0.879

25.09

0.886

Table VI shows the PSNR and SSIM of the deblurring results for test sets of Helen [18] and CelebA [24] respectively. In the table, the supervised baseline is based on the same U-net architecture as in our approach and trained on the (noisy) observation/clean image pairs. It achieves the highest PSNR and SSIM scores among the methods being compared. Our method reaches competitive PSNR values of 25.65 dB (Helen) and 24.78 dB (CelebA) without the proxy loss, and adding the proxy loss results in a further improvement of $0.35$ dB and $0.31$ dB respectively. As an extension of the Noise2Noise [20], the approach proposed by Xia et al. [39], in contrast, relies on paired noisy observations for each image but without the ground truth. For the results reported in the table, the operator $A$ is known during training for both our approach and the method of Xia et al. [39]. However, our method requires only unpaired observations, and it leads to a deblurring quality comparable to Xia et al. [39] and not far away from the fully supervised baseline (with a PSNR gap smaller than $0.15$ dB). Furthermore, our method outperforms the existing supervised methods Zuo et al. [44], Tao et al. [34], Shen et al. [31], Kupyn et al. [17], as well as the unsupervised approach Xu et al. [40].

A comparison of the deblurred images from different methods is shown in Fig. 11, where the ground truth and blurry images are taken from the test sets of Helen. We underline that no blur kernels are provided to these methods during test time. Given no ground truth images for training, the deblurred results of Xia et al. [39] (Cf. the $4^{\rm th}$ column in the figure) and our approach (Cf. the $5^{\rm th}$ column) are able to capture most of the details recovered by the fully supervised baseline (Cf. the $3^{\rm rd}$ column), though the latter gives slightly sharper images. This demonstrates the capacity of the proposed partially linear denoisers for solving the deblurring problem based on only one corrupted observation per image. Though in this experiment the unsupervised methods require knowledge about the blur kernels during the training phase, they can be generalized to the cases where the blur kernels are unknown for training. This can be done, e.g., by jointly learning the clean images and blur kernels [39].

V Conclusion

In this paper, we propose a class of structured nonlinear denoisers. We show that such denoisers, equipped with a partial linearity property, can be trained when we do not have access to any ground truth images, nor to the exact model of the noise. In practice, one only needs to know the noise variance conditioned on the images. Based on the partial linearity structure, we proposed an auxiliary random vector approach, which establishes a direct connection between our loss function and the MSE, and allows end-to-end training of the denoising models. The approach outperforms other ground-truth-free denoising approaches such as the SURE based learning method for denoising, having a denoising quality close to that of the fully supervised baseline. The approach also offers new opportunities for learning to solve other image restoration tasks from single corrupted observations of the images. The experimental results show that, when generalized to an image deblurring problem, our approach achieves the state of the art for unsupervised deblurring.

One disadvantage of our method is that it does not work for non-zero mean noise, such as impulse noises for which the corrupted pixels are replaced with random numbers. In such cases, the auxiliary random vector approach no longer provides good estimates to the MSE. Our method requires estimates of the noise variance. Various noise variance estimation methods from noisy images exist (see e.g., [21, 9, 1, 23]). Alternatively, in our future work, we are interested in learning-based or automated noise variance extraction methods with the potential of being integrated in the proposed denoiser within an end-to-end framework.

References

[1] Nicola Acito, Marco Diani, and Giovanni Corsini. Signal-dependent noise modeling and model parameter estimation in hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 49(8):2957–2971, 2011.
[2] Joshua Batson and Loic Royer. Noise2self: Blind denoising by self-supervision. In International Conference on Machine Learning, pages 524–533, 2019.
[3] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 60–65. IEEE, 2005.
[4] Raymond H Chan, Chung-Wa Ho, and Mila Nikolova. Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization. IEEE Transactions on image processing, 14(10):1479–1485, 2005.
[5] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence, 39(6):1256–1272, 2016.
[6] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
[7] David L Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
[8] Thibaud Ehret, Axel Davy, Jean-Michel Morel, Gabriele Facciolo, and Pablo Arias. Model-blind video denoising via frame-to-frame training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11369–11378, 2019.
[9] Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and Karen Egiazarian. Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Transactions on Image Processing, 17(10):1737–1754, 2008.
[10] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3855–3863, 2017.
[11] Yoseob Han and Jong Chul Ye. Framing u-net via deep convolutional framelets: Application to sparse-view ct. IEEE transactions on medical imaging, 37(6):1418–1429, 2018.
[12] Tao Hong, Yaniv Romano, and Michael Elad. Acceleration of red via vector extrapolation. Journal of Visual Communication and Image Representation, 63:102575, 2019.
[13] Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[15] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2129–2137, 2019.
[16] Alexander Krull, Tomas Vicar, and Florian Jug. Probabilistic noise2void: Unsupervised content-aware denoising. arXiv preprint arXiv:1906.00651, 2019.
[17] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8183–8192, 2018.
[18] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas S Huang. Interactive facial feature localization. In European conference on computer vision, pages 679–692. Springer, 2012.
[19] Stamatios Lefkimmiatis. Universal denoising networks: a novel cnn architecture for image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3204–3213, 2018.
[20] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. In International Conference on Machine Learning, pages 2965–2974, 2018.
[21] Ce Liu, William T Freeman, Richard Szeliski, and Sing Bing Kang. Noise estimation from a single image. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 901–908. IEEE, 2006.
[22] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. In Advances in Neural Information Processing Systems, pages 1673–1682, 2018.
[23] Xinhao Liu, Masayuki Tanaka, and Masatoshi Okutomi. Practical signal-dependent noise parameter estimation from a single noisy image. IEEE Transactions on Image Processing, 23(10):4361–4371, 2014.
[24] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
[25] Stephane Mallat and Wen Liang Hwang. Singularity detection and processing with wavelets. IEEE transactions on information theory, 38(2):617–643, 1992.
[26] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
[27] Nick Moran, Dan Schmidt, Yu Zhong, and Patrick Coady. Noisier2noise: Learning to denoise from unpaired noisy data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12064–12072, 2020.
[28] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
[29] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 860–867. IEEE, 2005.
[30] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
[31] Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang. Deep semantic face deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8260–8269, 2018.
[32] Shakarim Soltanayev and Se Young Chun. Training deep learning based denoisers without ground truth data. In Advances in Neural Information Processing Systems, pages 3257–3267, 2018.
[33] Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Transactions on Computational Imaging, 2(4):408–423, 2016.
[34] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
[35] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
[36] Dewil Valéry, Arias Pablo, Facciolo Gabriele, Anger Jérémy, Davy Axel, and Ehret Thibaud. Self-supervised training for blind multi-frame video denoising. arXiv preprint arXiv:2004.06957, 2020.
[37] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE Global Conference on Signal and Information Processing, pages 945–948. IEEE, 2013.
[38] Joachim Weickert. Anisotropic diffusion in image processing, volume 1. Teubner Stuttgart, 1998.
[39] Zhihao Xia and Ayan Chakrabarti. Training image estimators without image ground truth. In Advances in Neural Information Processing Systems, pages 2436–2446, 2019.
[40] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1107–1114, 2013.
[41] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
[42] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[43] Magauiya Zhussip, Shakarim Soltanayev, and Se Young Chun. Extending stein’s unbiased risk estimator to train deep denoisers with correlated pairs of noisy images. In Advances in Neural Information Processing Systems, pages 1463–1473, 2019.
[44] Wangmeng Zuo, Dongwei Ren, David Zhang, Shuhang Gu, and Lei Zhang. Learning iteration-wise generalized shrinkage–thresholding operators for blind deconvolution. IEEE Transactions on Image Processing, 25(4):1751–1764, 2016.

Supplementary Materials:
Unsupervised Image Restoration Using Partially Linear Denoisers
Rihuan Ke and Carola-Bibiane Schönlieb

Appendix A Additional proofs for Section III

Remark 2. Assume that the conditions of Proposition 2 hold except that ${\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}$ where $\beta\!>\!-1$ . Then $\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c+2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ .

Proof.

The proof is similar to that of Proposition 2.

Firstly, we repeat Equation (18) here

\begin{split}R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}=&{\left(g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}-\bm{x}\right)}\\ &-{\left(\bm{n}-\bm{z}/\alpha\right)}.\end{split}

Taking expectation of both sizes over $\bm{x}$ , $\bm{n}$ and $\bm{z}$ , one has

\begin{split}\mathcal{J}{\left(R\right)}=&\mathbb{E}{\left({\left\|R(\bm{\hat{y}})-\bm{x}\right\|}^{2}\right)}+\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}\\ &-2\mathbb{E}{\left(\langle g(\bm{x})-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle\bm{e},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle L\bm{n}+\alpha L\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}.\\ \end{split}

(32)

Secondly, by the assumption ${\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}$ , we have the following equalities associated with the last term of Equation (32):

\begin{split}&\mathbb{E}{\left(\langle L{\bm{n}}+\alpha L{\bm{z}},\bm{n}-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&\mathbb{E}{\left(\langle L{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}+\mathbb{E}{\left(\langle\alpha L{\bm{z}},-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&-\beta\mathbb{E}{\left(\langle L{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}.\end{split}

(33)

Also, the term $\mathbb{E}{\left(\langle g(\bm{x})-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}=0$ according to the proof of Proposition 2.

Finally, let $c:=\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}$ , then the desired equality follows from Equations (33) and (32). The proof is completed. ∎

\propositiond

Proof.

To simplify the notations, let $\bm{p}:=R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}$ and $\hat{\bm{p}}:=\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}$ .

I). By Lemma 3, $R^{\epsilon,\delta}+t{\left(\hat{R}^{\epsilon,\delta}-R^{\epsilon,\delta}\right)}\in\mathcal{R}_{\epsilon}$ for $t\in[0,1]$ . It follows from the definition of $R^{\epsilon,\delta}$ in (22) that

\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}-\delta\leq\mathbb{E}{\left({\left\|\bm{p}+t{\left(\bm{\hat{p}}-\bm{p}\right)}-\bm{x}\right\|}^{2}\right)}

(34)

Besides, the right hand side of the above inequality can be decomposed as

\begin{split}&\mathbb{E}{\left({\left\|\bm{p}+t{\left(\bm{\hat{p}}-\bm{p}\right)}-\bm{x}\right\|}^{2}\right)}\\ =&\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+2t\mathbb{E}{\left(\langle\bm{p}-\bm{x},\bm{\hat{p}}-\bm{p}\rangle\right)}\\ &+t^{2}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}.\end{split}

(35)

Combining (34) and (35), one has

2\mathbb{E}{\left(\langle\bm{p}-\bm{x},\bm{\hat{p}}-\bm{p}\rangle\right)}\geq-s\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}-\delta/s,\ \forall s\in(0,1].

Letting $t=1$ , Equation (35) and the last inequality force

\begin{split}&{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+\delta/s,\ \forall s\in(0,1].\end{split}

(36)

II). With the same argument as in I), for $\hat{R}^{\epsilon,\delta}$ one can show that for any $s\in(0,1]$ ,

\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}&\leq\mathbb{E}{\left({\left\|\bm{p}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}\\ &\quad-\mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}+\delta/s\\ &=\mathcal{J}{\left(R^{\epsilon,\delta}\right)}-\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}+\delta/s.\end{split}

(37)

Note that (36) and (37) give two upper bounds of $\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}$ . Averaging both sides of the two inequalities,

\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\leq&\ \mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}/2-\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}/2\\ &+\mathcal{J}{\left(R^{\epsilon,\delta}\right)}/2-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}/2\\ &+\delta/s.\end{split}

If we set $s=\sqrt{\delta}$ , the desired inequality then follows from Proposition 2. ∎

\propositiondb

Proof.

Let $\bm{p}:=R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}$ and $\hat{\bm{p}}:=\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}$ . According to Equation (8), we have the decomposition

\begin{split}\bm{p}&=g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{1},\\ \bm{\hat{p}}&=g_{2}{\left(\bm{x}\right)}+L_{2}\bm{\hat{n}}+\bm{e}_{2},\end{split}

(38)

where $L_{i}$ is linear and depends on $\bm{x}$ , and ${\left\|\bm{e}_{i}\right\|}\leq\epsilon$ for $i\in\{0,1\}$ . Let $\bm{q}:=\bm{p}-\bm{e}_{1}+\bm{e}_{2}$ , then

\bm{q}{\left(\bm{\hat{y}}\right)}=g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{2},

(39)

as a function of $\bm{\hat{y}}$ , is well defined because the distribution of $\bm{x}$ is a delta function and $\bm{q}{\left(\hat{y}_{1}\right)}=\bm{q}{\left(\hat{y}_{2}\right)}$ if $\hat{y}_{1}=\hat{y}_{2}$ for any realizations $\hat{y}_{1}$ , $\hat{y}_{2}$ of $\bm{\hat{y}}$ . Furthermore, it is easy to check that $\bm{q}\in\mathcal{R}_{\epsilon}$ .

Analogous to the argument in (37), for $\bm{q}\in\mathcal{R}_{\epsilon}$ we have

\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}&\leq\mathbb{E}{\left({\left\|\bm{q}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}\\ &\quad-\mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}+\delta/s,\end{split}

(40)

for any $s\in(0,1]$ . It follows from Equation (15) that

\begin{split}\mathbb{E}{\left({\left\|\bm{q}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}=&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e}_{2},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &+c,\\ \mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}=&\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e}_{2},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &+c\\ \end{split}

for some constant $c$ . This together with (40) gives

\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}\leq&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}+\delta/s.\end{split}

Combining the last inequality with (36), one has

\begin{split}&{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}+{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+2\delta/s.\end{split}

(41)

Next, we will show that $\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}\leq\epsilon^{2}$ which together with (41) gives the desired result (24) by setting $s=\sqrt{\delta}$ .

From Remark 1, the definitions of $L_{1}$ and $\bm{e}_{1}$ in (38) imply

\begin{split}\mathbb{E}{\left({\left\|\bm{e}_{1}\right\|}^{2}\mid\bm{x}\right)}&=\mathbb{E}{\left({\left\|\bm{p}-L_{1}\bm{\hat{n}}\right\|}^{2}|\bm{x}\right)}\\ &\leq\mathbb{E}{\left({\left\|\bm{p}-L_{1}\bm{\hat{n}}+tL\bm{\hat{n}}\right\|}^{2}|\bm{x}\right)}\end{split}

for any $t\in\mathbb{R}$ and any linear $L$ . It forces that $\mathbb{E}{\left(\langle L\bm{n},\bm{e}_{1}\rangle\mid\bm{x}\right)}=0$ . Furthermore, the conditional mean of $\bm{e}_{1}$ is zero, i.e., $\mathbb{E}{\left(\bm{e}_{1}\mid\bm{x}\right)}=0$ . Similarly, one can also show that $\mathbb{E}{\left(\langle L\bm{\hat{n}},\bm{e}_{2}\rangle\mid\bm{x}\right)}=0$ and $\mathbb{E}{\left(\bm{e}_{2}\mid\bm{x}\right)}=0$ . Therefore,

\mathbb{E}{\left(\langle g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}-\bm{x},\bm{e}_{i}\rangle\mid\bm{x}\right)}=0,{\rm~{}for~{}}i\in\{1,2\},

and consequently,

\begin{split}&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ =&\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{2}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{1}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ =&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}.\end{split}

Taking expectation over $\bm{x}$ , the last inequality leads to

\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}\leq\epsilon^{2}+\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)},

and the proof is completed. ∎

Appendix B Additional details for Subsection IV-D

In this section, further details about the experiments on real microscopy images will be given. We first provide the training specification for the denoising networks. The details of the estimation of noise variance , required by the training process, will be discussed next. Finally, we present a noise variance refinement method which helps us to find a more accurate noise variance. The general experimental setting is similar to that of Subsection IV-B.

Training specification. We use the network architecture DnCNN [41] throughout the experiments. During training, the images are copped into overlapping patches, using a sliding window of size $40\!\times\!40$ and a step size of $10$ . Additionally, for the N2DH-GOWT1 and C2DL-MSC datasets, since the major regions in the images are of low intensity corresponding to the background, we reorder the patches by their averaged pixel intensity (from low to high), and randomly delete $80\%$ of the first $80\%$ patches. This aims to let the network put more attention on the foreground objects during training. For all three datasets, we apply random flipping to augment the training images. The parameter $\gamma$ is set to $2^{2}$ for N2DH-GOWT1 and C2DL-MSC, and $2^{6}$ for TOMO110. The optimization process follows the ones described in Subsection IV-B, but one needs to use the estimated noise variance which will be introduced below.

Noise variance estimation. We estimate the noise variance using the methods that are introduced next.

1). For the datasets N2DH-GOWT1 and C2DL-MSC, the variance is estimated via the mean square difference of neighboring pixels at smooth regions of the images. Firstly, given the noisy image $y$ , we define the average squared difference at pixel $i$ as

d_{i}^{2}=\sum_{j\in\mathcal{N}(i)}(y_{i}-y_{j})^{2}/(2|\mathcal{N}(i)|)

where $\mathcal{N}(i)$ denotes the 4-connected pixels of $i$ . Secondly, we compute a smooth version of $y$

s=k*y

where $*$ denotes convolution operations and $k$ is a uniform kernel matrix of $10\!\times\!10$ (the entries sum to $1$ ). Thirdly, to find the smooth regions of the images, we define

\mathcal{F}={\left(i\mid f_{i}\leq 0.02(s_{i}-\min_{j}(s_{j}))\right)}

where $f=k*{\left((s-y)\odot(s-y)\right)}$ and $\odot$ denotes element-wise multiplication. Finally, we assume that the noise variance depends only on the intensity $v$ of pixels, and the estimated variance is computed as

V(v)={\rm mean}{\left\{d_{i}^{2}\mid i\in\mathcal{F},|s_{i}-v|<0.5/255\right\}}.

In our experiment, $V(v)$ is averaged over all images $y$ in the training set. The computed variance is plotted in Fig. 12, where the blue squares indicate $V(v)$ . We observe that $V(\cdot)$ has a linear pattern, therefore we approximate $V(v)$ with a linear function $V(v)\approx(v-\mu)/\lambda$ (see the red lines in Fig. 12) where $\mu$ and $\lambda$ are parameters. For dataset N2DH-GOWT1, $\mu=0,\lambda=69$ . For C2DL-MSC $\mu=0.09,\lambda=76$ , and $V(v)\!=\!0$ for $v\leq\mu$ .

2). For TOMO110, since the training image has multiple frames (as a consequence of dose-fractionated acquisition), we estimate the variance by the differences among the frames. Specifically, the raw dose-fractionated data contains $a^{(1)},a^{(2)},\cdots,a^{(n)}$ where $n$ is the number of frames, and the frames $a^{(j)}$ have independent noise. In the dataset being considered, for example, $n=10$ . The observed image is the sum of all frames, i.e., $y=\sum_{j=1}^{n}a^{(j)}$ . By computing the variance among $a^{(j)}$ , the noise variance at pixel $i$ is estimated as

V(i)=\frac{n}{n-1}\sum_{j=1}^{n}{\left(a_{i}^{(j)}-y_{i}/n\right)}^{2}.

We note that $V(i)$ is an estimate of the noise variance for the pixel $i$ , and our method denoise the whole image $y$ instead of individual frames $a^{(j)}$ . In the Noise2Noise approach, in contrast, the network leans to denoise half of the frames, therefore does not make full use of $y$ during the denoising process.

Noise variance refinement. The rough noise variance estimates for datasets N2DH-GOWT1 and C2DL-MSC are refined by the trained models. Specifically, a denoising network is first trained based on the rough noise estimates $V(v)$ . The trained network is then fine tuned for several different values of $\lambda$ that are close to the estimates. For fine tuning, we apply $50000$ optimization steps, and the other hyperparameters remain unchanged. This results in several denoising models (corresponding to different $\lambda$ ). For each model, we compute the term $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ for a constant image $\bm{x}$ , by using Remark 1. Since the ground truth noise variance is unknown, here we generate $\bm{n}$ using the estimated variance that is parameterized by $\lambda$ . The results are shown in Fig. 13. Motivated by experimental results in Subsection IV-B2 and Fig. 5, we select the $\lambda$ values which lead to a small positive $\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}$ . Our final estimated value of $\lambda$ is $69$ for N2DH-GOWT1 and $88$ for C2DL-MSC. No noise estimate refinement steps are performed for TOMO110 because its noise variance is estimated on individual pixels.

Appendix C Additional results about deblurring

Fig. 14 contains additional visual outputs of our model, in comparison with the fully supervised baseline and Xia et al. [39]. The experimental setting is described in Subsection IV-E. The results show that, learning without ground truth images, our model recovers major structures of the faces, and performance is comparable to the Noise2Noise based method Xia et al. [39].


Ground truth $\bm{x}$	DnCNN error	Noise2Self error	Ours error

Noise $\bm{n}$	DnCNN residual $\bm{e}$	Noise2Self residual $\bm{e}$	Ours residual $\bm{e}$


Ground truth	Blurry image	Fully supervised	Xia et al. [39]	Ours




Ground truth	Blurry image	Fully supervised	Xia et al. [39]	Ours




(a). $\lambda=1$	(b). $\lambda=2$	(c). $\lambda=4$