This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised Image Restoration Using Partially Linear Denoisers

Rihuan Ke and Carola-Bibiane Schönlieb R. Ke and C.-B. Schönlieb are with DAMTP, University of Cambridge, Cambridge CB3 0WA, UK.
Abstract

Deep neural network based methods are the state of the art in various image restoration problems. Standard supervised learning frameworks require a set of noisy measurement and clean image pairs for which a distance between the output of the restoration model and the ground truth, clean images is minimized. The ground truth images, however, are often unavailable or very expensive to acquire in real-world applications. We circumvent this problem by proposing a class of structured denoisers that can be decomposed as the sum of a nonlinear image-dependent mapping, a linear noise-dependent term and a small residual term. We show that these denoisers can be trained with only noisy images under the condition that the noise has zero mean and known variance. The exact distribution of the noise, however, is not assumed to be known. We show the superiority of our approach for image denoising, and demonstrate its extension to solving other restoration problems such as image deblurring where the ground truth is not available. Our method outperforms some recent unsupervised and self-supervised deep denoising models that do not require clean images for their training. For deblurring problems, the method, using only one noisy and blurry observation per image, reaches a quality not far away from its fully supervised counterparts on a benchmark dataset.

Index Terms:
Image denoising, deep learning, convolutional neural networks, unsupervised learning, partially linear denoiser

I Introduction

The acquisition of real life images usually suffers from noise corruption due to different factors such as sensor sensitivity, variations of photon numbers, air turbulence, just to name a few. Image denoising [30, 25, 7, 38] aims to remove noise from corrupted images. It is one of the most fundamental and central problems tackled by the image processing community. A variety of image denoising approaches have been developed in the past decades, which can be broadly classified into model based denoisers (see e.g., [30, 4, 6, 3]) and data driven denoisers [29, 5, 41].

Recent data driven techniques outperform conventional model based methods and achieve the state of the art denoising quality [22, 41, 42, 19]. These methods take advantage of large sets of image data and use the models, particularly deep convolutional neural networks (CNN), to learn the image prior from the datasets rather than relying on predefined image features. Compared to many model based methods, which need to solve a difficult optimization problem for each test image, CNN based denoisers are computationally efficient once the CNN is trained. Nevertheless, CNN based denoising approaches usually require a massive amount of ground truth images in the training phase. Specifically, conventional training pipelines consist of a loss function or a metric, which defines the distance between a clean ground truth image and its reconstructed version, and an optimization step in which the parameters of the models are adjusted so as to minimize the loss function. One of the most commonly used metrics is the mean squared error (MSE), the calculation of which depends explicitly on the ground truth image. While these learning processes can lead to high-quality denoisers, they are problematic for application domains where ground truth images are not accessible.

In the past years, several unsupervised deep learning techniques have been developed to overcome this difficulty. It is found that it is possible to train a deep neural network denoiser by using only the noisy data if multiple versions of noisy images are available for each unknown clean image [20], or if the noise is independent within different regions of the image [2, 15, 16]. Under the assumption of i.i.d. Gaussian noise, loss functions can also be adapted, based on Stein’s unbiased risk estimate of the MSE [32], such that they are defined only on the noisy images while providing good approximations of the MSE.

Refer to caption
Figure 1: Partially linear denoiser. The clean image 𝒙\bm{x} (top middle), the noise 𝒏\bm{n} (top right), and the noisy image 𝒚:=𝒙+𝒏\bm{y}:=\bm{x}+\bm{n} (top left) are modeled as random variables. A denoiser RR is decomposed as R(𝒚)=g(𝒙)+L𝒏+𝒆R(\bm{y})=g(\bm{x})+L\bm{n}+\bm{e} with a function g(𝒙)g{\left(\bm{x}\right)} (can be nonlinear), a linear mapping LL (can depend on 𝒙\bm{x}), and a residual term 𝒆\bm{e}. If the random variable 𝒆\bm{e} is of small variance, then RR is called a partially linear denoiser. Such denoisers can be learned from only noisy images.

In this work, we address the problem of learning efficient denoisers from a set of noisy images without the need for precise modeling of the noise, and without multiple noisy observations per image. We investigate a class of structured nonlinear denoisers and their applications to inverse imaging problems including denoising and deblurring. Specifically, we formulate a given denoiser R()R{\left(\cdot\right)} as the sum of three terms (cf. Fig. 1), including a (nonlinear) function of the ground truth signal g(𝒙)g{\left(\bm{x}\right)}, a linear mapping of the noise L𝒏L\bm{n}, and a residual term 𝒆\bm{e}. If the linear factor L𝒏L\bm{n} is relatively small and the residual term 𝒆\bm{e}, as a random variable, has small variance, then the denoised image mainly depends on the ground truth image and is not sensitive to the changes of the noise. In this paper, we study denoisers with the property that 𝒆\bm{e} has small variance, and we call denoisers of this type partially linear denoisers (cf. Fig. 1). Such denoisers preserve nonlinear image structure (similar to many classic image denoisers), and additionally allow the nonlinearity to be learned from noisy data only.

We observe that some natural denoisers, including deep neural networks, exhibit certain degrees of partial linearity. By exploiting this property, we show that a partially linear denoiser, in the learning setting, can be trained on noisy images by only assuming that the noise is zero mean conditioned on the images with known variance. Specifically, we introduce an auxiliary random vector together with a partially linear constraint, that we show establishes a correspondence between the MSE and a loss function defined without clean images. By doing this one can approximate the best partially linear denoisers with a theoretically guaranteed approximation error bound. Moreover, we propose an image deblurring approach, based on the partially linear denoisers, that learns from single noisy observation of the images.

Different from other existing unsupervised methods for denoising, our approach does not require a group (or a pair) of noisy observations for each image or an estimate on the underlying noise distribution. Yet our approach leads to a performance close to the fully supervised counterparts on some denoising benchmark datasets. The proposed partially linear denoisers are learned in an end-to-end manner and can be built on top of any deep neural network architectures for denoising. Once they are trained, the denoised results are obtained without the auxiliary vectors or any post-processing. Furthermore, we demonstrate our denoiser’s capacity of handling other image restoration tasks in the absence of ground truth or multiple noisy observations per image, by utilizing the direct approximation to the MSE and its end-to-end learning nature.

II Related work

In the past decades, a wide variety of image denoising algorithms were developed. They range from analytic approaches such as filtering, variational methods, wavelet transforms, Bayesian estimation to data-driven approaches such as deep learning. In this section we review some major nonlinear denoisers as well as recent deep learning approaches that do not depend on ground truth images for learning.

II-A Nonlinear denoisers

Natural images are non-Gaussian signals with fine structures such as sharp edges and textures. One of the main challenges in the image denoising task is to preserve these structures while removing the noise. Traditional linear denoisers such as Gaussian filtering and Tikhonov regularization usually do not achieve satisfactory denoising quality, as they tend to smooth the edges. Most of the existing image denoisers in the literature are therefore nonlinear.

Total variation (TV) denoising [30] is one of the most fundamental image denoising algorithms. The TV denoising solution is a minimizer of the optimization problem

minxλ2xy2+xTV\min_{x}\frac{\lambda}{2}\|x-y\|^{2}+\|x\|_{\rm TV}

in which ymy\in\mathbb{R}^{m} is the noisy image and xTV:=x1\|x\|_{\rm TV}:=\|\nabla x\|_{1} is the total variation of xmx\in\mathbb{R}^{m}. TV denoising is known to preserve sharp edges thanks to the non-linearity introduced by the TV norm TV\|\cdot\|_{\rm TV}.

Patch based methods have gained popularity for image denoising tasks because of their capacity of capturing the self-similarity of images. The non-local means (NL-means) algorithm proposed by Buades et al. [3] is among the most successful methods in this category. The NL-means algorithm removes noise by calculating a weighted average of the pixel intensities, with weights defined based on a patch similarity measure which emphasizes the connection among pixels in similar patches. It is a nonlinear denoiser, different from the local mean filtering, as the weights are image-dependent. As another nonlinear patch based denoising method, the Block-matching and 3D filtering (BM3D) algorithm [6] divides image patches into groups based on a similarity criterion, and collaborative filtering on each of the groups is then performed to clean the patches. While patch based denoisers achieve promising denoising performance, they are often time-consuming due to the high computational complexity in calculating the weights or matching with similar patches.

In recent years, convolutional neural networks (CNN) based denoisers became the state of the art for image denoising [41, 42], thanks to the rapid development of deep learning techniques. In particular, CNN are known to be efficient in modeling image priors [35], which are crucial for the quality of the denoiser. A typical CNN denoiser can be formulated as a composition of mappings called layers, and a basic type of layers y(k)y(k+1)y^{(k)}\rightarrow y^{(k+1)} has the form

y(k+1)=σ(Wy(k)+b),y^{(k+1)}=\sigma{\left(W*y^{(k)}+b\right)},

where * denotes the convolution operation, (W,b)(W,b) are parameters of the network that can be learned from the data, and σ()\sigma{\left(\cdot\right)} is an activation function which is often nonlinear. As such, CNN denoisers are in general nonlinear. There has been growing interest in developing CNN denoisers with new network architectures and building blocks being proposed, such as batch-normalization [41], residual connection [10], and residual dense block [42].

II-B Unsupervised deep learning for denoising

While deep CNN based models have great advantages in image denoising, the most standard learning techniques are limited to the availability of sufficiently many noise-free images. Recently, CNN-based learning algorithms for image denoising that do not require clean images as training data have been developed. Soltanayev et al. [32] propose to use Stein’s unbiased risk estimator (SURE) based loss function, which is computed without knowing ground truth images, as a replacement of the MSE loss function. In the setting of i.i.d. Gaussian noise, SURE provides an unbiased estimate of the MSE, and hence the minimization problems with respect to these two losses are equivalent. Consequently, CNN denoisers can be trained in an unsupervised manner, based on only one noisy realization of each training image. Nevertheless, SURE may not be identical to the MSE in the case of non-Gaussian noise, e.g., shot noise. The SURE training scheme has been extended to the setting of multiple noise realizations per image as well as imperfect ground truths [43].

The Noise2Noise approach developed by Lehtinen et al. [20] offers a different ground-truth free learning strategy for denoising, based on the assumption that for each image at least two different noisy observations are available. It is found that replacing the clean targets by their noisy observations in the MSE loss function leads to the same minimizers of the original supervised loss if the noise has zero mean and an infinite amount of training data are provided. Specifically, the Noise2Noise loss function is

(y,y^)(𝒴,𝒴^)fθ(y^)y2\sum_{{\left(y,\hat{y}\right)}\in{\left(\mathcal{Y},\hat{\mathcal{Y}}\right)}}\|f_{\theta}\mathopen{}\left(\hat{y}\right)\mathclose{}-y\|^{2} (1)

where yy and y^\hat{y} represent two independent noisy observations of the same image sample, and fθf_{\theta} is the denoiser parameterized by θ\theta. In contrast to the MSE loss (y,x)(𝒴,𝒳)fθ(y)x2\sum_{{\left(y,x\right)}\in{\left(\mathcal{Y},\mathcal{X}\right)}}\|f_{\theta}\mathopen{}\left(y\right)\mathclose{}-x\|^{2} where xx denotes the ground truth image, the loss function (1) is defined on a noisy pair (y,y^){\left(y,\hat{y}\right)} only. If the noise in yy and the noise in y^\hat{y} are independent and have zero means, it is shown that the minimization of this loss function (1) with respect to θ\theta is equivalent to minimizing the MSE [20]. This implies that the parameter θ\theta can be computed from a training set of noisy pairs. Besides, the Noise2Noise approach does not rely on an explicit image prior or on structural knowledge about the noise models.

In certain denoising tasks, however, the acquisition of two or more noisy copies per image can be very expensive or impractical, in particular in medical imaging where patients are moving during the acquisition, or in videos with moving cars, etc. Several authors have investigated learning techniques that overcome this restriction [8, 36, 2, 15, 16]. The Frame2Frame [8] developed by Ehret et al. fine-tunes a denoiser for blind video denoising. The idea is to use optical flow to warp one video frame to match its neighboring frame, and then minimize the Noise2Noise loss (1) with (y,y^)(y,\hat{y}) being the pair of matched frames. The work [27] generalizes the Noise2Noise into the setting of a single noisy realization for each image. A synthetic noise, that is drawn from the same distribution as the underlying noise, is added to the noisy image yy, and the new noisy image is then used to replace the second observation y^\hat{y} in the training loss (1). At test time, the raw output of the network is post-processed to obtain the denoising results, by computing a linear combination of the output and the input. In this approach, the synthetic noise has to be of the same noise type as the underlying noise, and the post-processing can magnify the errors of the output of the network. Batson et al. [2] show that a denoiser fθ()f_{\theta}{\left(\cdot\right)}, satisfying a so called 𝒥\mathcal{J}-invariant property for a partition 𝒥\mathcal{J} of the image pixels, can be trained without accessing a second observation of the image if the noise is independent across different regions in 𝒥\mathcal{J}. A function fθ()f_{\theta}\mathopen{}\left(\cdot\right)\mathclose{} is said to be 𝒥\mathcal{J}-invariant if for each subset of pixels J𝒥J\in\mathcal{J}, the pixel values of fθ(y)f_{\theta}{\left(y\right)} at JJ can be calculated without knowing the values of yy at JJ. By additionally assuming that the noises of yy at different elements of 𝒥\mathcal{J} are independent and have zero mean, one minimizes the loss

y𝒴fθ(y)y2=y𝒴J𝒥[fθ(y)]JyJ2=J𝒥y𝒴[fθ(y)]JyJ2\begin{split}\sum_{y\in\mathcal{Y}}{\left\|f_{\theta}{\left(y\right)}-{y}\right\|}^{2}&=\sum_{y\in\mathcal{Y}}\sum_{J\in\mathcal{J}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2}\\ &=\sum_{J\in\mathcal{J}}\sum_{y\in\mathcal{Y}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2}\end{split} (2)

where the subscript JJ in yJy_{J} denotes the restriction of the image yy to the pixel collection JJ. It should be noted that, for a 𝒥\mathcal{J}-invariant function fθ()f_{\theta}{\left(\cdot\right)} and any J𝒥J\in\mathcal{J}, the loss y𝒴[fθ(y)]JyJ2\sum_{y\in\mathcal{Y}}{\left\|[f_{\theta}{\left(y\right)}]_{J}-y_{J}\right\|}^{2} can be interpreted as a variant of (1), given the fact that the noise contained in yJy_{J} is independent of [fθ(y)]J[f_{\theta}{\left(y\right)}]_{J} and has zero mean. This enables the training of a model using a set of noisy images yy only. However, as the denoiser is a 𝒥\mathcal{J}-invariant function, the images can not be perfectly reconstructed as the level of noise decreases to zero, i.e., it can not learn the identity mapping which is clearly not 𝒥\mathcal{J}-invariant. The unused information from yy can be further leveraged if the noise model is known or can be estimated. Krull et al. [16] propose to combine yy with the network predictions, based on a probabilistic model for each pixel and a Bayesian formulation, to obtain the minimal MSE estimate. The integration of statistical inference effectively removes the noise remained in the predictions. A downside of this method is that it requires an explicit expression of the posterior distribution of the noisy images beforehand, and it is not end-to-end because the network outputs intermediate results that are improved in the post-processing step.

III The proposed method

We consider image restoration problems of the form

𝒚=A𝒙+𝒏\bm{y}=A\bm{x}+\bm{n} (3)

in which 𝒙\bm{x} and 𝒏\bm{n} are random vectors of unknown distributions representing the ground truth images and the noise respectively, 𝒚\bm{y} is the noisy image from which we want to restore 𝒙\bm{x}, and AA is a linear forward operator determined by the data acquisition process. In this paper, random vectors are always denoted by boldface lower-case letters like 𝒙\bm{x}, 𝒚\bm{y}, 𝒛\bm{z}, etc. For ease of presentation, in this section we first focus on denoising problems, i.e., A:=IA:=I defines an identity map. The more general cases where AA is not the identity will be discussed in Subsection III-E.

For Problem (3), the only assumption we make on the noise distribution is that it has zero mean conditioned on the image, i.e.,

  • (A1).

    𝔼(𝒏𝒙)=0\mathbb{E}{\left(\bm{n}\mid\bm{x}\right)}=0.

This assumption holds for various common types of noises including Gaussian white noise, Poisson noise, as well as some mixed noises. We underline that the noise 𝒏\bm{n}, however, are not assumed to be independent of the image or pixel-wise independent.

The central issue in image denoising is to find an operator, denoted by RR, that takes the noisy image 𝒚\bm{y} to the clean image 𝒙\bm{x} or its approximations. In this work we are interested in a class of denoisers that can be decomposed as

R(𝒚)=g(𝒙)+L𝒏+𝒆,R{\left(\bm{y}\right)}=g{\left(\bm{x}\right)}+L\bm{n}+\bm{e}, (4)

where gg is a (possibly nonlinear) function of the clean image 𝒙\bm{x}, L:=L𝒙L:=L_{\bm{x}} is a linear operator and 𝒆:=𝒆𝒙,𝒏\bm{e}:=\bm{e}_{\bm{x},\bm{n}} is a residual term with small variance. In the rest of the paper, we omit the subscripts of L𝒙L_{\bm{x}} and 𝒆𝒙,𝒏\bm{e}_{\bm{x},\bm{n}} if there is no ambiguity.

Remark 1.

For a given denoiser, the decomposition (4) always exists, but in this work we only consider the setting of 𝐞\bm{e} having small variance. Furthermore, the decomposition (4) is not unique, given the fact that LL can be an arbitrary linear operator. However, in order to characterize the structure of the denoiser, we assume 𝔼(𝐞𝐱)=0\mathbb{E}{\left(\bm{e}\mid\bm{x}\right)}=0 and let LL be chosen such that 𝔼(𝐞2𝐱)\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\mid\bm{x}\right)} is minimized. As 𝐧\bm{n} has zero mean, the nonlinear term is then determined by g(𝐱)=𝔼(R(𝐲)𝐱).g{\left(\bm{x}\right)}=\mathbb{E}{\left(R{\left(\bm{y}\right)}\mid\bm{x}\right)}.

If RR satisfies (4) with 𝒆\bm{e} of small variance, then we call RR a partially linear denoiser. The first term g(𝒙)g{\left(\bm{x}\right)}, which can be nonlinear, does not depend on realizations of the noise 𝒏\bm{n}. This formulation implies that the non-linearity of the denoisers in this class is mainly due to intrinsic image structures, and the denoisers respond to the noise in a linear (encoded in L𝒏L\bm{n}) or less sensitive manner (encoded in the fact that 𝒆\bm{e} has small variance). In fact, for any denoiser that can effectively remove noise from images, its output has to be minimally dependent on the changes of noise, and therefore when written in the form (4), the noise dependent components on the right hand side should be small in variance compared to the noise, which implies that 𝔼(𝒆2)\mathbb{E}(\|\bm{e}\|^{2}) is small.

The selection of a good denoiser RR requires knowledge of certain prior information about the target 𝒙\bm{x}, especially when we want to find the fine details like edges from the corrupted image. Many conventional analytical approaches aim to find the reconstruction R(𝒚)R{\left(\bm{y}\right)} from a lower dimensional space that the images lie in. This can be done, for example, by assuming some sparseness properties in the gradient fields or the wavelet domains.

In a data-driven setting, given the clean image 𝒙\bm{x}, one could instead minimize the mean square error (MSE) defined as 𝒥0(R):=𝔼(R(𝒚)𝒙2),\mathcal{J}_{0}{\left(R\right)}:=\mathbb{E}({\left\|R(\bm{y})-\bm{x}\right\|}^{2}), where the expectation is taken over 𝒙\bm{x} and 𝒏\bm{n}. The minimization of 𝒥0\mathcal{J}_{0} leads to the conditional mean R0(𝒚):=𝔼(𝒙𝒚)R_{0}(\bm{y}):=\mathbb{E}{\left(\bm{x}\mid\bm{y}\right)}. Next, we will discuss how to approximate R0R_{0} in the absence of the clean image 𝒙\bm{x}.

III-A Auxiliary random vectors

The motivation for this paper comes from the fact that, in practice the distribution of 𝒙\bm{x} is often unknown, and samples of 𝒙\bm{x} (i.e., the ground truth images) or of the noise 𝒏\bm{n} are not readily available. What can be easily accessed are noisy observations 𝒚\bm{y}. With these alone, however, a direct evaluation of the MSE is not possible. In this work, we circumvent the need for 𝒙\bm{x} by introducing an auxiliary random vector and replacing the MSE 𝒥0\mathcal{J}_{0} by an approximation. First, let 𝒛\bm{z} be a random vector satisfying assumptions

  • (A2)

    the conditional mean 𝔼(𝒛𝒙)=0\mathbb{E}{\left(\bm{z}\mid\bm{x}\right)}=0,

  • (A3)

    𝒛\bm{z} is independent of 𝒏\bm{n},

  • (A4)

    the conditional covariance:

    Cov(𝒛,𝒛𝒙)=Cov(𝒏,𝒏𝒙).{\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}={\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}.

The auxiliary vector 𝒛\bm{z} does not necessarily need to follow the same distribution as 𝒏\bm{n}. Samples of 𝒛\bm{z}, therefore, can be easily generated from e.g., Gaussian distributions, once the variances of 𝒛\bm{z} are known. Then, associated with 𝒛\bm{z}, we define

{𝒏^:=𝒏+α𝒛𝒚^:=𝒚+α𝒛,\begin{cases}\bm{\hat{n}}&:=\bm{n}+\alpha\bm{z}\\ \bm{\hat{y}}&:=\bm{y}+\alpha\bm{z},\end{cases} (5)

where α\alpha is a real constant. The new random vector 𝒚^=𝒙+𝒏^\bm{\hat{y}}=\bm{x}+\bm{\hat{n}} can be regarded as a noisy version of image 𝒙\bm{x} with noise vector 𝒏^\bm{\hat{n}}. In the following, the discussion will focus on denoising 𝒚^\bm{\hat{y}} rather than denoising 𝒚\bm{y}, but the objective remains unchanged, i.e., getting the same clean image 𝒙\bm{x}. Specifically, if one can obtain a high-quality solution 𝒙\bm{x} from 𝒚^\bm{\hat{y}}, then there exists an algorithm taking 𝒚\bm{y} to the clean image because 𝒚^\bm{\hat{y}} can be computed from 𝒚\bm{y}. Such an algorithm can be achieved, for instance, by R(V(𝒚))R{\left(V{\left(\bm{y}\right)}\right)} where RR is a denoiser for the 𝒚^\bm{\hat{y}} problem and V(𝒚):=𝒚^V(\bm{y}):=\bm{\hat{y}}. It should be noted that the difficulty of the denoising problem is raised because of the additional uncertainty from the auxiliary random vector 𝒛\bm{z} encoded in the observed data 𝒚^\bm{\hat{y}}. However, one of the benefits of considering 𝒚^\bm{\hat{y}} is that the quantity 𝒛\bm{z} is known and can be leveraged for constructing approximations to the MSE without knowing 𝒙\bm{x} or 𝒏\bm{n} as we will show in the following.

For the 𝒚^\bm{\hat{y}} denoising problem, the MSE associated with the denoiser RR is defined as

𝒥mse(R):=𝔼(R(𝒚^)𝒙2)\mathcal{J}_{\rm mse}{\left(R\right)}:=\mathbb{E}{\left({\left\|R(\bm{\hat{y}})-\bm{x}\right\|}^{2}\right)} (6)

where the expectation 𝔼\mathbb{E} is taken over random variables 𝒙\bm{x}, 𝒏\bm{n} and 𝒛\bm{z}. This is connected to 𝒥0\mathcal{J}_{0} via 𝒥mse(R)=𝒥0(R(V))\mathcal{J}_{\rm mse}(R)=\mathcal{J}_{0}(R{\left(V\right)}). Since VV is known, it can be shown that minR𝒥0(R)minR𝒥mse(R)\min_{R}\mathcal{J}_{0}(R)\leq\min_{R}\mathcal{J}_{\rm mse}(R). Nevertheless, if α\alpha is close to zero, the noise distributions of 𝒚\bm{y} and V(𝒚)V(\bm{y}) are close, so the minimum of 𝒥0\mathcal{J}_{0} can be approximated by the minimum of 𝒥mse\mathcal{J}_{\rm mse} which promises similar denoising quality.

Now, using the auxiliary vector 𝒛\bm{z}, we define the following objective function for our proposed partially linear denoiser

𝒥(R):=𝔼(R(𝒚^)(𝒚𝒛/α)2).\mathcal{J}{\left(R\right)}:=\mathbb{E}{\left({\left\|R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}. (7)

for α0\alpha\neq 0. Indeed, as we will see later in Subsections III-B and III-C, if we consider the partially linear denoising model, then 𝒥\mathcal{J} provides a good estimate of 𝒥mse\mathcal{J}_{\rm mse}. More precisely, according to the definition (4), we consider a set of denoisers for 𝒚^\bm{\hat{y}} that have the form

R(𝒚^)=g(𝒙)+L𝒏^+𝒆R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e} (8)

where 𝒏^\bm{\hat{n}} is defined in (5), and LL and 𝒆\bm{e} are depending on 𝒙\bm{x} and (𝒙,𝒏^){\left(\bm{x},\bm{\hat{n}}\right)}, respectively, and the residual 𝒆\bm{e} is of small variance.

In the rest of this section, we show that for α0\alpha\!\neq\!0 and partially linear denoiser RR, the term 𝒥(R)\mathcal{J}{\left(R\right)} in (7) approximates the MSE 𝒥mse(R)\mathcal{J}_{\rm mse}{\left(R\right)} up to an additive constant. This implies that an optimal denoiser of this class can be computed even if the distribution of ground truth images 𝒙\bm{x} and the distribution of 𝒏\bm{n} are not known. We then discuss unsupervised learning methods based on our proposed objective 𝒥(R)\mathcal{J}{\left(R\right)} for image restoration tasks like denoising and deblurring.

III-B Linear minimum mean square error estimator

To start with, we focus the discussion on the simplest case, in which the denoiser RR is linear. This is a subset of the set of partially linear denoisers (8) with g=Lg=L and 𝒆=0\bm{e}=0. The best estimator in this setting is the linear minimum mean square error (LMMSE) estimator, i.e., the minimizer of 𝒥mse(R)\mathcal{J}_{\rm mse}{\left(R\right)} in (6) over all linear denoisers RR. The following proposition establishes the equivalence between the MSE loss and 𝒥(R)\mathcal{J}{\left(R\right)} in (7) over linear denoisers.

Proposition 1.

If 𝐧\bm{n} and 𝐳\bm{z} satisfy Assumption (A1) - (A4) and RR is linear, then there is some constant cc not depending on RR, such that

𝒥(R)=𝔼(R𝒚^𝒙2)+c.\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R\bm{\hat{y}}-\bm{x}}\right\|}^{2}\right)}+c. (9)
Proof.

If RR is linear, then from the definitions of 𝒚\bm{y} and 𝒚^\bm{\hat{y}},

R𝒚^(𝒚𝒛/α)=(R𝒚^𝒙)(𝒏𝒛/α),\begin{split}R{\bm{\hat{y}}}-{\left(\bm{y}-\bm{z}/\alpha\right)}={\left(R{\bm{\hat{y}}}-\bm{x}\right)}-{\left(\bm{n}-\bm{z}/\alpha\right)},\end{split}

and R𝒚^𝒙=(R𝒙𝒙)+(R𝒏+αR𝒛)R{\bm{\hat{y}}}-\bm{x}={\left(R\bm{x}-\bm{x}\right)}+{\left(R\bm{n}+\alpha R\bm{z}\right)}. Let ,\langle\cdot,\cdot\rangle denote the inner product operator. Then, using the above, (7) can be rewritten as

𝒥(R)=𝔼(R𝒚^𝒙2)+𝔼(𝒏𝒛/α2)2𝔼(R𝒙𝒙,𝒏𝒛/α)2𝔼(R𝒏+αR𝒛,𝒏𝒛/α).\begin{split}\mathcal{J}{\left(R\right)}=&\mathbb{E}{\left({\left\|R\bm{\hat{y}}-\bm{x}\right\|}^{2}\right)}+\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}\\ &-2\mathbb{E}{\left(\langle R\bm{x}-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle R\bm{n}+\alpha R\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}.\end{split} (10)

Here the expectations are taken over 𝒙\bm{x}, 𝒏\bm{n} and 𝒛\bm{z}. We next show that the last two terms on the right hand side of (10) vanish.

First, since 𝒏\bm{n} has zero mean conditioned on 𝒙\bm{x} according to Assumption (A1), it holds that 𝔼(R𝒙𝒙,𝒏)=𝔼(𝔼(R𝒙𝒙,𝒏𝒙))=0.\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{n}\rangle\right)}=\mathbb{E}{\left(\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{n}\rangle\mid\bm{x}\right)}\right)}=0. The same property applies to 𝒛\bm{z} by Assumption (A2), so 𝔼(R𝒙𝒙,𝒛)=0\mathbb{E}{\left(\langle R{\bm{x}}-\bm{x},\bm{z}\rangle\right)}=0. With a linear combination of these two equalities,

𝔼(R𝒙𝒙,𝒏𝒛/α)=0.\mathbb{E}{\left(\langle R\bm{x}-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}=0. (11)

Second, given the fact that RR is linear and both 𝒏\bm{n} and R𝒏R\bm{n} have zero mean conditioned on 𝒙\bm{x}, we have

𝔼(R𝒏,𝒏𝒙)=tr(Cov(R𝒏,𝒏𝒙))=tr(RCov(𝒏,𝒏𝒙))\begin{split}\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}=&{\rm tr}{\left({\rm Cov}{\left(R{\bm{n}},\bm{n}\mid\bm{x}\right)}\right)}\\ =&{\rm tr}{\left(R\ {\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}\right)}\\ \end{split}

in which tr(){\rm tr}{\left(\cdot\right)} denotes the trace of the matrix. The same equality holds for 𝒛\bm{z}, i.e.,

𝔼(R𝒛,𝒛𝒙)=tr(RCov(𝒛,𝒛𝒙))\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}={\rm tr}{\left(R\ {\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\right)}

Therefore, it follows from Assumption (A4) that 𝔼(R𝒏,𝒏𝒙)=𝔼(R𝒛,𝒛𝒙).\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}=\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}. This together with Assumption (A3) gives

𝔼(R𝒏+αR𝒛,𝒏𝒛/α𝒙)=𝔼(R𝒏,𝒏𝒙)+𝔼(αR𝒛,𝒛/α𝒙)=0.\begin{split}&\mathbb{E}{\left(\langle R{\bm{n}}+\alpha R{\bm{z}},\bm{n}-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}+\mathbb{E}{\left(\langle\alpha R{\bm{z}},-\bm{z}/\alpha\rangle\mid\bm{x}\right)}=0.\end{split} (12)

Taking expectation with respect to 𝒙\bm{x}, we have

𝔼(R𝒏+αR𝒛,𝒏𝒛/α)=0.\mathbb{E}{\left(\langle R\bm{n}+\alpha R\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}=0. (13)

Finally, let c:=𝔼(𝒏𝒛/α2)c:=\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}, then Equation (10), (11) and (13) imply (9) which completes the proof. ∎

According to Proposition 1, for all linear denoisers, the term 𝒥(R)\mathcal{J}{\left(R\right)} in (7) differs from the MSE by a constant cc not depending on RR. Therefore a minimization of 𝒥(R)\mathcal{J}{\left(R\right)} over all linear RR also leads to the LMMSE estimator. We remark that the objective function (7) is not defined based on the ground truth image 𝒙\bm{x}, and the distribution of the random vector 𝒏\bm{n} is not necessarily known or equal to that of 𝒛\bm{z}. In fact, the denoising problem can be reformulated as follows. Given RR, the quantity R𝒚^R\bm{\hat{y}} is known, and we want to decouple the term R𝒚^𝒙R\bm{\hat{y}}-\bm{x} from

R𝒚^𝒚=(R𝒚^𝒙)𝒏.R\bm{\hat{y}}-\bm{y}=(R\bm{\hat{y}}-\bm{x})-\bm{n}. (14)

In the loss 𝒥(R)\mathcal{J}{\left(R\right)}, this is implemented by adding the scaled auxiliary vector 𝒛/α\bm{z}/\alpha to both sides of (14). The vector 𝒛/α\bm{z}/\alpha compensates the noise 𝒏\bm{n} in the sense that, when taking the expectation of R𝒚^𝒚+𝒛/α2{\left\|R\bm{\hat{y}}-\bm{y}+\bm{z}/\alpha\right\|}^{2} over 𝒏\bm{n} and 𝒛\bm{z}, the 𝒏\bm{n}-related term 𝔼(R𝒏,𝒏𝒙)\mathbb{E}{\left(\langle R{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)} is canceled by the 𝒛\bm{z}-related term 𝔼(R𝒛,𝒛𝒙)\mathbb{E}{\left(\langle R{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)} (i.e., the last equality in (12)), which leads to (9).

III-C Optimal partially linear denoiser

Though linear denoisers have nice properties that link (7) to the MSE, they may not be the best denoisers for imaging data. Nonlinearity is unavoidable in order to achieve good denoising quality and preserve fine structures such as edges in the image. To this end, we consider a more general class of denoisers that are partially linear, as expressed in (8), where gg is nonlinear and 𝒆\bm{e} is not necessarily zero.

In a special case where the denoiser outputs exactly the clean image, i.e. R(𝒚^):=𝒙R{\left(\bm{\hat{y}}\right)}:=\bm{x}, RR is also partially linear with LL and 𝒆\bm{e} being both zero. However, such denoisers do not exist in most settings. Nevertheless, for good denoisers one could still expect the residual term 𝒆\bm{e} to be small.

The equivalence of the two objective functions stated in Proposition 1 does not hold for nonzero 𝒆\bm{e}. It is therefore of interest to know how well (7)\eqref{eq:newmse} approximates the MSE in this general case. The next proposition quantifies the approximation error in the presence of a small nonzero residual term 𝒆\bm{e}.

Proposition 2.

If 𝐧\bm{n} and 𝐳\bm{z} satisfy Assumption (A1) - (A4) and RR satisfies (8), then there is some constant cc not depending on RR, such that

𝒥(R)=𝔼(R(𝒚^)𝒙2)2𝔼(𝒆,𝒏𝒛/α)+c\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c (15)

Additionally, if the variance 𝔼(𝐞2)ϵ2\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}\leq\epsilon^{2} for ϵ>0\epsilon>0, then the approximation error

Err:=|𝒥(R)𝔼(R(𝒚^)𝒙2)c|2ϵ𝔼(𝒏𝒛/α2)\begin{split}{\rm Err}:=&\left|\mathcal{J}{\left(R\right)}-\mathbb{E}({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2})-c\right|\\ \leq&2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})}\end{split} (16)

If furthermore the residual 𝐞(𝐱,𝐧^)\bm{e}{\left(\bm{x},\bm{\hat{n}}\right)} is Lipschitz continuous with respect to 𝐧^\bm{\hat{n}} i.e., 𝐞(𝐱,𝐧^1)𝐞(𝐱,𝐧^2)K𝐧^1𝐧^2\|\bm{e}{\left(\bm{x},\bm{\hat{n}}_{1}\right)}-\bm{e}{\left(\bm{x},\bm{\hat{n}}_{2}\right)}\|\leq K\|\bm{\hat{n}}_{1}-\bm{\hat{n}}_{2}\| for some constant KK, then

Err2ϵ𝔼(𝒏2)+2K𝔼(𝒛2).{\rm Err}\leq 2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}\right\|}^{2})}+2K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}. (17)
Proof.

By the decomposition of RR in (8),

R(𝒚^)(𝒚𝒛/α)=(g(𝒙)+L𝒏^+𝒆𝒙)(𝒏𝒛/α),\begin{split}R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}=&{\left(g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}-\bm{x}\right)}\\ &-{\left(\bm{n}-\bm{z}/\alpha\right)},\end{split} (18)

where 𝒏^\hat{\bm{n}} is defined in (5). For function g()g{\left(\cdot\right)}, from assumption (A1) and (A2) it is clear that 𝔼(g(𝒙),𝒏𝒛/α𝒙)=0\mathbb{E}{\left(\langle g{\left(\bm{x}\right)},{\bm{n}-\bm{z}/\alpha}\rangle\mid\bm{x}\right)}=0. Therefore,

𝔼(g(𝒙)𝒙,𝒏𝒛/α)=0.\mathbb{E}{\left(\langle g{\left(\bm{x}\right)}-\bm{x},{\bm{n}-\bm{z}/\alpha}\rangle\right)}=0. (19)

Moreover, as LL is linear, with a similar argument to the proof of (13), we can show that

𝔼(L𝒏^,𝒏𝒛/α)=0.\mathbb{E}{\left(\langle L\bm{\hat{n}},{\bm{n}-\bm{z}/\alpha}\rangle\right)}=0. (20)

According to Equation (18), (19) and (20), if we let c:=𝔼(𝒏𝒛/α2)c:=\mathbb{E}{\left(\|\bm{n}-\bm{z}/\alpha\|^{2}\right)}, then Equation (15) holds.

Next, we consider the bound of 𝔼(𝒆,𝒏𝒛/α)\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}. Assuming that 𝔼(𝒆2)ϵ2\mathbb{E}({\left\|\bm{e}\right\|}^{2})\leq\epsilon^{2}, a straightforward application of Cauchy-Schwarz inequalities leads to

|𝔼(𝒆,𝒏𝒛/α)|𝔼(𝒆2)𝔼(𝒏𝒛/α2)=ϵ𝔼(𝒏𝒛/α2),\begin{split}\left|\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}\right|&\leq\sqrt{\mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}}\sqrt{\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}}\\ &=\epsilon\sqrt{\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}},\end{split} (21)

and therefore Equation (16) holds.

Furthermore, assume that 𝒆:=𝒆(𝒙,𝒏+α𝒛)\bm{e}:=\bm{e}{\left(\bm{x},\bm{n}+\alpha\bm{z}\right)} is Lipschitz continues w.r.t. 𝒏^:=𝒏+α𝒛\bm{\hat{n}}:=\bm{n}+\alpha\bm{z} with Lipschitz constant KK. Since 𝒛\bm{z} has zero mean conditioned on 𝒙\bm{x} and 𝒏\bm{n}, the conditional expectation

𝔼(𝒆,𝒛/α𝒙,𝒏)=𝔼(𝒆0,𝒛/α𝒙,𝒏)+𝔼(𝒆𝒆0,𝒛/α𝒙,𝒏)0+𝔼(K𝒛,𝒛𝒙,𝒏),\begin{split}\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}=&\mathbb{E}{\left(\langle\bm{e}_{0},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}\\ &+\mathbb{E}{\left(\langle\bm{e}-\bm{e}_{0},\bm{z}/\alpha\rangle\mid\bm{x},\bm{n}\right)}\\ \leq&0+\mathbb{E}{\left(K\langle\bm{z},\bm{z}\rangle\mid\bm{x},\bm{n}\right)},\end{split}

where 𝒆0:=𝒆(𝒙,𝒏+𝟎)\bm{e}_{0}:=\bm{e}{\left(\bm{x},\bm{n}+\bm{0}\right)}. So 𝔼(𝒆,𝒛/α)K𝔼(𝒛2)\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\right)}\leq K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}, and because of symmetry we have 𝔼(𝒆,𝒛/α)K𝔼(𝒛2)\mathbb{E}{\left(\langle\bm{e},\bm{z}/\alpha\rangle\right)}\geq-K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}. Finally,

|𝔼(𝒆,𝒏𝒛/α)||𝔼(𝒆,𝒏)|+|𝔼(𝒆,𝒛/α)|ϵ𝔼(𝒏2)+K𝔼(𝒛2)\begin{split}\left|\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}\right|\leq&\left|\mathbb{E}{\left(\langle\bm{e},\bm{n}\rangle\right)}\right|+\left|\mathbb{E}{\left(\langle\bm{e},{\bm{z}/\alpha}\rangle\right)}\right|\\ \leq&\epsilon\sqrt{\mathbb{E}{\left({\left\|\bm{n}\right\|}^{2}\right)}}+K\mathbb{E}{\left({\left\|\bm{z}\right\|}^{2}\right)}\end{split}

and the inequality (17) follows. ∎

From the bounds given in Proposition 2, if the variance of 𝒆\bm{e} is small, then the error (7) provides a good estimate to the MSE. Note that making a Lipschitz continuity assumption on 𝒆\bm{e}, the error bound in (16) can be further improved to (17) which is independent of α\alpha (in contrast to (16) where the bound has a 1/α1/\alpha factor). In practice, this Lipschitz continuity assumption is not restrictive as we expect that the denoiser RR is stable with respect to small perturbations in 𝒚^\bm{\hat{y}} and therefore has a small Lipschitz constant KK for 𝒆\bm{e}.

Remark 2 (Estimated noise variance).

If Assumption (A4) does not hold, then the bound (16) also depends on the linear operator LL. Consider for instance Cov(𝐳,𝐳𝐱)=(1+β)Cov(𝐧,𝐧𝐱){\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)} where β>1\beta\!>\!-1. Then Equation (15) becomes 𝒥(R)=𝔼(R(𝐲^)𝐱2)2𝔼(𝐞,𝐧𝐳/α)+c+2β𝔼(𝐳,L𝐳)\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c+2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}, the proof of which is given in Appendix A. The minimization of J(R)J{\left(R\right)} therefore favors denoisers with small β𝔼(𝐳,L𝐳)\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}. Such a property can be used to construct criteria for whether the noise variance is underestimated (β<0\beta\!<\!0) or overestimated (β>0\beta\!>\!0), e.g., by checking the value of 𝔼(𝐳,L𝐳)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} for the minimizer of 𝒥\mathcal{J}. Examples will be given in Section IV-B.

Next, we take a closer look at the errors of the proposed denoisers. We first introduce the set ϵ\mathcal{R}_{\epsilon} of partially linear denoisers. Second, the distance between the minimizers of 𝒥\mathcal{J} and 𝒥mse\mathcal{J}_{\rm mse} over ϵ\mathcal{R}_{\epsilon} is estimated. Then we verify the convergence of the proposed denoisers to the best denoisers for corrupted images generated from a single ground truth image. Finally, focusing on a special subset of images that consists of constant patches, we demonstrate the partial linearity of the optimal denoisers with an example.

For ϵ>0\epsilon>0, let ϵ\mathcal{R}_{\epsilon} be the set of denoiser satisfying (8) with variance of 𝒆\bm{e} less than or equal to ϵ2\epsilon^{2}, i.e.,

ϵ:={Rg,L,𝒆,such thatR(𝒚^)=g(𝒙)+L𝒏^+𝒆,Lis linear, and𝔼(𝒆2)ϵ2}.\begin{split}\mathcal{R}_{\epsilon}:=&\left\{R\mid\exists g,L,\bm{e},\ \text{such that}\ R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e},\ \right.\\ &\left.L\ \text{is linear, and}\ \mathbb{E}{\left({\left\|\bm{e}\right\|}^{2}\right)}\leq\epsilon^{2}\right\}.\end{split}
Lemma 3 (Convexity).

For any ϵ0\epsilon\geq 0, ϵ\mathcal{R}_{\epsilon} is a convex set.

Proof.

If R1,R2ϵR_{1},R_{2}\in\mathcal{R}_{\epsilon}, then there exist g1,g2g_{1},g_{2}, linear operators L1,L2L_{1},L_{2}, and residual terms 𝒆1,𝒆2\bm{e}_{1},\bm{e}_{2}, such that

Ri(𝒚^)=gi(𝒙)+Li𝒏^+𝒆i,𝔼(𝒆i2)ϵ2,i{1,2}.R_{i}{\left(\bm{\hat{y}}\right)}=g_{i}{\left(\bm{x}\right)}+L_{i}\bm{\hat{n}}+\bm{e}_{i},\quad\mathbb{E}{\left({\left\|\bm{e}_{i}\right\|}^{2}\right)}\leq\epsilon^{2},\ i\in\{1,2\}.

For any λ1[0,1]\lambda_{1}\in[0,1] and λ2:=1λ1\lambda_{2}:=1-\lambda_{1},

λ1R1(𝒚^)+λ2R2(𝒚^)=(i=12λigi)(𝒙)+(i=12λiLi)(𝒏^)+i=12λi𝒆i.\begin{split}\lambda_{1}R_{1}{\left(\bm{\hat{y}}\right)}+\lambda_{2}R_{2}{\left(\bm{\hat{y}}\right)}=&{\left(\sum_{i=1}^{2}\lambda_{i}g_{i}\right)}{\left(\bm{x}\right)}+{\left(\sum_{i=1}^{2}\lambda_{i}L_{i}\right)}{\left(\bm{\hat{n}}\right)}\\ &+\sum_{i=1}^{2}\lambda_{i}\bm{e}_{i}.\end{split}

It is easy to see that i=12λiLi\sum_{i=1}^{2}\lambda_{i}L_{i} is also linear and 𝔼(i=12λi𝒆i)2ϵ2\mathbb{E}{\left(\|\sum_{i=1}^{2}\lambda_{i}\bm{e}_{i}\|\right)}^{2}\leq\epsilon^{2}. Therefore λR1+λ2R2ϵ\lambda R_{1}+\lambda_{2}R_{2}\in\mathcal{R}_{\epsilon} which implies that ϵ\mathcal{R}_{\epsilon} is convex. ∎

We are interested in the best denoisers in ϵ\mathcal{R}_{\epsilon} with as small MSE as possible. More precisely, for small δ0\delta\geq 0, we define the MMSE denoiser, denoted by Rϵ,δR^{\epsilon,\delta}, as one of the denoisers in ϵ\mathcal{R}_{\epsilon} that satisfies

𝒥mse(Rϵ,δ)infRϵ𝒥mse(R)+δ,\mathcal{J}_{\rm mse}{\left(R^{\epsilon,\delta}\right)}\leq\inf_{R\in\mathcal{R}_{\epsilon}}\mathcal{J}_{\rm mse}{\left(R\right)}+\delta, (22)

where 𝒥mse\mathcal{J}_{\rm mse} are defined in (6). Note that Rϵ,δR^{\epsilon,\delta} exists for arbitrarily small positive δ\delta. Based on the approximation properties stated in Proposition 2, we propose a denoiser R^ϵ,δϵ\hat{R}^{\epsilon,\delta}\in\mathcal{R}_{\epsilon} satisfying

𝒥(R^ϵ,δ)infRϵ𝒥(R)+δ\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}\leq\inf_{R\in\mathcal{R}_{\epsilon}}\mathcal{J}{\left(R\right)}+\delta (23)

as an alternative version of Rϵ,δR^{\epsilon,\delta}. In light of the theoretical bound (16), the distance between Rϵ,δR^{\epsilon,\delta} and R^ϵ,δ\hat{R}^{\epsilon,\delta} has an upper bound described in the following proposition.

{restatable}

propositionpropositiond Let the denoisers Rϵ,δR^{\epsilon,\delta} and R^ϵ,δ\hat{R}^{\epsilon,\delta} be given by (22) and (23) respectively. If 𝒏\bm{n} and 𝒛\bm{z} satisfy Assumption (A1) - (A4) and 0<δ<10<\delta<1, it holds that

𝔼(Rϵ,δ(𝒚^)R^ϵ,δ(𝒚^)2)2ϵ1δ𝔼(𝒏𝒛/α2)+δ1δ,\begin{split}&\mathbb{E}{\left({\left\|R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}-\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}\right\|}^{2}\right)}\\ \leq&\frac{2\epsilon}{1-\sqrt{\delta}}\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})}+\frac{\sqrt{\delta}}{1-\sqrt{\delta}},\end{split}

where the expectation is taken over 𝒙\bm{x}, 𝒏\bm{n} and 𝒛\bm{z}. The proof is given in Appendix A. The bound in Proposition 3 converges to 2ϵ𝔼(𝒏𝒛/α2)2\epsilon\sqrt{\mathbb{E}({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2})} as δ\delta tends to 0, and it converges to 0 as ϵ\epsilon and δ\delta converge to zero.

Building on Proposition 3, we further analyze the convergence of the proposed solutions for a special case where all realizations of the corrupted images 𝒚^\bm{\hat{y}} are generated from the same clean image. The next corollary shows that, in this setting, the approximation error is at most of the same order as ϵ2\epsilon^{2}, regardless of the noise distributions or the noise variance. The proof of this corollary is given in Appendix A. {restatable}corollarypropositiondb Assume that the conditions in Proposition 3 hold. If 𝒙\bm{x} follows a delta distribution, i.e., the probability P(𝒙=x)=1P{\left(\bm{x}=x\right)}=1 and P(𝒙x)=0P{\left(\bm{x}\neq x\right)}=0 for some image xx, then

𝔼(Rϵ,δ(𝒚^)R^ϵ,δ(𝒚^)2)ϵ2+2δ1δ.\begin{split}\mathbb{E}{\left({\left\|R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}-\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}\right\|}^{2}\right)}\leq\frac{\epsilon^{2}+2\sqrt{\delta}}{1-\sqrt{\delta}}.\end{split} (24)

Proposition 3 and Corollary 3 illustrate that the minimizers of 𝒥\mathcal{J} are good estimates of the best partially denoiser when ϵ\epsilon is small. In principle, 𝒆\bm{e} should be much smaller than the noise itself, as this is a necessary condition for the denoiser being robust to different realizations of 𝒏^\bm{\hat{n}}. We can provide some more insight into the size of the residual term for high quality denoisers when applied to a special subset of images. Smooth regions are one of the most important components in natural images, and they often account for a large fraction of the pixels. Here we study the basic case with constant images and pixelwise independent noise. We give an example to demonstrate the scale of the residual term of the minimum MSE denoiser for constant image patches.

Example (Constant patches).

Assume that 𝐱\bm{x} models constant patches of size 21×2121\!\times\!21 pixels. The pixel values are uniformly distributed in [0,λ][0,\lambda] where λ>0\lambda>0 is a constant. The noise is pixelwise independent conditioned on 𝐱\bm{x}. For pixel ii, 𝐲^iPois(𝐱i)\bm{\hat{y}}_{i}\sim{\rm Pois}{\left(\bm{x}_{i}\right)} where Pois{\rm Pois} is the Poisson distribution. The optimal denoised images R0(𝐲^)R_{0}{\left(\bm{\hat{y}}\right)}, where R0R_{0} minimizes the cost 𝒥mse()\mathcal{J}_{\rm mse}{\left(\cdot\right)}, are therefore constant images. In this example, given R0R_{0} and 𝐱\bm{x}, the quantities g(𝐱)g{\left(\bm{x}\right)} and LL (defined in (8)) are computed using Remark 1. Consequently, g(𝐱)g{\left(\bm{x}\right)}, L𝐧^L\bm{\hat{n}} and 𝐞\bm{e} are constant images. The values of R0(𝐲^)R_{0}{\left(\bm{\hat{y}}\right)} are plotted versus g(𝐱)+L𝐧^g{\left(\bm{x}\right)}+L\bm{\hat{n}} (for different ground truth values and for different λ\lambda) on the last two rows of Fig. 2, where the results are computed based on a deep CNN approximation to R0R_{0}. Each yellow dot in the plot represents a realization of the noise, and the blue line represents 𝐞=0\bm{e}=0. The plots suggest that 𝐞\bm{e} has small variances and hence the set ϵ\mathcal{R}_{\epsilon} maintains good approximations to R0R_{0} for small ϵ\epsilon.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
(a). λ=1\lambda=1 (b). λ=2\lambda=2 (c). λ=4\lambda=4
Figure 2: Reconstructions for constant image patches corrupted by Poisson noise with parameter λ=1\lambda=1, 22 and 44 (i.e., the column (a)-(c) respectively). Top row: samples of corrupted patches (the ground truth is 𝒙=0.5λ\bm{x}=0.5\lambda). Second and third rows: the optimal reconstructions R0(𝒚^)R_{0}(\bm{\hat{y}}) plotted against g(𝒙)+L𝒏^g(\bm{x})+L\bm{\hat{n}} (represented by the yellow dots, each of which is generated with a realization of noise 𝒏^\bm{\hat{n}}). The blue line represents 𝒆=0\bm{e}\!=\!0, i.e., (g(𝒙)+L𝒏^,g(𝒙)+L𝒏^)(g(\bm{x})+L\bm{\hat{n}},g(\bm{x})+L\bm{\hat{n}}).

III-D Learning a denoiser from noisy samples.

For learning the denoiser we consider parametrized denoising models RθR_{\theta}, e.g., deep CNNs, that are parameterized by θ\theta. Suppose that a set {yi}\left\{y_{i}\right\} of realizations of 𝒚\bm{y}, e.g., the set of noisy images that one wants to denoise, is given. The noise in these noisy images satisfies Assumption (A1). Associated with each yiy_{i}, we define ziz_{i} as a realization of the auxiliary random vector 𝒛\bm{z} that satisfies (A2) - (A4). In (A2) and (A4) only the first two moments of 𝒛\bm{z}, rather than its distribution, are specified. Therefore without loss of generality, one can randomly generate ziz_{i} from normal distributions. With the auxiliary vector ziz_{i}, the samples of 𝒚^\bm{\hat{y}}, denoted by y^i\hat{y}_{i}, are computed according to Equation (5). Based on the objective function (7), we minimize the empirical loss function

(θ):=iRθ(y^i)(yizi/α)2\mathcal{L}{\left(\theta\right)}:=\sum_{i}{\left\|R_{\theta}{\left(\hat{y}_{i}\right)}-{\left(y_{i}-z_{i}/\alpha\right)}\right\|}^{2} (25)

under the condition that RθϵR_{\theta}\in\mathcal{R}_{\epsilon}. This condition can be implemented with the partial linearity constraint described below.

Partial linearity constraint. To preserve the partial linearity during training, restriction on the variance of the residual term 𝒆\bm{e} in (8) is required. Unfortunately, a direct evaluation of 𝒆\bm{e} is not feasible due to 1) the fact that the formulation of 𝒆\bm{e} depends on g(𝒙)g{\left(\bm{x}\right)} and LL which are unknown, and 2) the condition that only one noisy observation for each image is given. However, since g(𝒙)g{\left(\bm{x}\right)} and LL depend only on the image 𝒙\bm{x} and LL is linear, we can remove these two terms by perturbing 𝒚^\bm{\hat{y}}. One simple way of implementing this is to find 𝒒1\bm{q}_{1} and 𝒒2\bm{q}_{2}, two perturbed versions of 𝒚^\bm{\hat{y}}, satisfying 𝒚^=τ𝒒1+(1τ)𝒒2\bm{\hat{y}}=\tau\bm{q}_{1}+(1-\tau)\bm{q}_{2} where τ[0,1]\tau\in[0,1]. Let the residual terms associated with R(𝒒1)R{\left(\bm{q}_{1}\right)} and R(𝒒2)R{\left(\bm{q}_{2}\right)} be denoted by 𝒆1\bm{e}_{1} and 𝒆2\bm{e}_{2} respectively. Then, from Equation (8) we have

R(𝒚^)τR(𝒒1)(1τ)R(𝒒2)=𝒆τ𝒆1(1τ)𝒆2.\begin{split}&R{\left(\bm{\hat{y}}\right)}-\tau R{\left(\bm{q}_{1}\right)}-(1-\tau)R{\left(\bm{q}_{2}\right)}\\ =&\bm{e}-\tau\bm{e}_{1}-(1-\tau)\bm{e}_{2}.\end{split} (26)

By the assumption that 𝒆,𝒆1\bm{e},\bm{e}_{1} and 𝒆2\bm{e}_{2} are small, based on the representation of the residual terms in (26), one can therefore penalize ψ(R(𝒚^)τR(𝒒1)(1τ)R(𝒒2))\psi{\left(R{\left(\bm{\hat{y}}\right)}-\tau R{\left(\bm{q}_{1}\right)}-(1-\tau)R{\left(\bm{q}_{2}\right)}\right)} for some metric ψ\psi.

III-E Learning image deblurring without ground truth images

The proposed partially linear denoisers can be generalized to learning image reconstructors for other linear inverse problems such as image deblurring, in particular when the ground truth images are missing. We repeat the observation model (3) here

𝒚=A𝒙+𝒏\bm{y}=A\bm{x}+\bm{n} (27)

where AA is a linear forward operator (such as the Radon transform for CT reconstruction, a convolution operator for image deblurring, etc), and 𝒏\bm{n} models the zero-mean noise. For many inverse imaging problems, the naive reconstruction A𝒚A^{\dagger}\bm{y}, is based on a direct inversion of the linear operator AA given by the pseudo-inverse operator AA^{\dagger}. Due to ill-conditioning of AA such inversion is typically unstable and therefore not accurate in the presence of noise. If AA, however, has full column rank, then the resulting artifacts in the reconstruction could be remedied by denoising A𝒚A^{\dagger}\bm{y}. In fact, it is easy to see that 𝒏:=A𝒚𝒙=A𝒏\bm{n}^{\dagger}:=A^{\dagger}\bm{y}-\bm{x}=A^{\dagger}\bm{n} has zero mean. The covariance of 𝒏\bm{n}^{\dagger} can be estimated as long as the covariance of 𝒏\bm{n} is known. Therefore a straightforward application of the partially linear denoisers to the noisy reconstruction A𝒚A^{\dagger}\bm{y} leads to an estimate of 𝒙\bm{x}. Various approaches have been proposed for denoising A𝒚A^{\dagger}\bm{y} in a data-driven post-processing setup (see e.g., [13, 11]). These learned post-processing techniques require noisy-clean image pairs as training data. Another class of methods, in contrast, leverage denoisers to construct regularization for (27). For instance, the Regularization by Denoising (RED) methods [28, 12] minimize a variational objective function with a regularization term derived from the denoisers. The Plug-and-Play Prior framework [37, 33] is based on variable splitting algorithms for the Maximum a Posteriori (MAP) optimization problem with the proximal mapping associated with the regularization term being replaced by the denoisers. These denoiser-based regularization approaches require predefined denoisers. However, our framework does not require any noisy-clean pair or any predefined denoiser. In particular, our proposed partially linear denoiser allows training an estimator from noisy data alone.

Image deblurring is a special case of (27) where AA represents a convolutional kernel. In practice, AA is governed by different imaging factors including motion and camera focus. In the context of the proposed partial linear denoiser, one way to solve the deblurring problem is to train a single model RθR_{\theta} that maps the measurement 𝒚\bm{y} directly to 𝒙\bm{x}. Here, knowledge of AA is indirectly encoded in RθR_{\theta}. Assuming that the training data contains noisy samples of 𝒚^\bm{\hat{y}} and the blurring kernel AA but no ground truth images, then similar to (7) one can minimize the loss function

𝔼(ARθ(𝒚^)(𝒚𝒛/α)2),\mathbb{E}{\left({\left\|A{R_{\theta}}{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}, (28)

and in this case, ARθA{R_{\theta}} acts as a partially linear denoiser. At test time, the deblurred image can be directly computed as Rθ(𝒚^){R_{\theta}}{\left(\bm{\hat{y}}\right)} without knowing the operator AA. These considerations also draw connections of the partial linear denoiser to the deep image prior approach [35]. There, an implicit regularizer is introduced, based on a convolutional neural network RθR_{\theta} and a sole data-fitting loss function is minimized in the training. Early stopping is applied to prevent the estimator from over-fitting to the noise. With the partial linearity structure, however, we establish a connection between the cost (28) and the MSE of the noise free measurement A𝒙A\bm{x} which aims to get ARθ(𝒚^)A{R_{\theta}}{\left(\bm{\hat{y}}\right)} close to A𝒙A\bm{x} rather than its noisy versions.

IV Experiments

In this section, we report experimental results to demonstrate the efficiency of the proposed approach for different denoising tasks and for deblurring111The code will be made available at https://github.com/RK621/Unsupervised-Restoration-PLD.. Firstly, we start by comparing the partial linearity of classical denoisers including total variation (TV) denoising [30], BM3D [6], as well as CNN based denoisers (see Subsection IV-A for details). Secondly, we evaluate our method on denoising problems with synthetic noise (Subsection IV-B1, IV-B3) and analyse its robustness towards errors in the estimate of the noise variance (Subsection IV-B2). Thirdly, a numerical study on the stability of our approach with respect to varying noise levels is given and the importance of the partially linear constraint is investigated (Subsection IV-C). Fourthly, the proposed approach is used to denoise real microscopy images (Subsection IV-D). Finally, we apply our approach to learn an image deblurring model, using a set of single noisy and blurry observations of the images as training data. The details of the learning methods and the results for the deblurring experiment are presented in Subsection IV-E.

IV-A Partial linearity of denoisers

In this test, we investigate the partial linearity of some existing standard denoising approaches, including the TV approach [30], BM3D [6] and DnCNN [41]. For the convenience of the readers, the formula of the partially linear denoiser (7) is repeated here

R(𝒚^)=g(𝒙)+L𝒏^+𝒆,R{\left(\bm{\hat{y}}\right)}=g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}, (29)

where RR is the underlying denoiser. In particular, the DnCNN is trained by minimizing the standard empirical MSE loss (6) with a training set of 400400 images with ground truth [41]. For a given image 𝒙\bm{x}, we compute the decomposition (29) using Remark 1. Specifically, g(𝒙)=𝔼(R(𝒚^)𝒙)g(\bm{x})\!=\!\mathbb{E}{\left(R{\left(\bm{\hat{y}}\right)}\mid\bm{x}\right)} and L=minL0𝔼(R(𝒚^)g(𝒙)L0𝒏^2)L\!=\!\min_{L_{0}\in\mathcal{L}}\mathbb{E}{\left({\left\|R{\left(\hat{\bm{y}}\right)}-g{\left(\bm{x}\right)}-L_{0}\hat{\bm{n}}\right\|}^{2}\right)} where the expectation is taken over 𝒏^\hat{\bm{n}}, and \mathcal{L} denotes the set of linear operators. In this example, we use a standard test image called parrot (cf. top middle of Fig. 1) as the ground truth image, i.e., a realization of 𝒙\bm{x}. The noise 𝒏^\bm{\hat{n}} is i.i.d. Gaussian noise with zero mean and standard deviation 2525 (corresponding to 256 gray levels). The expectations are evaluated using 2000020000 random realizations of the pair (𝒚^,R(𝒚^)){\left(\bm{\hat{y}},R{\left(\bm{\hat{y}}\right)}\right)}.

TABLE I: The PSNR for RR (first row), the variance of 𝒆\bm{e} averaged over all pixels (second row), and the PSNR for modified denoisers R^:=g(𝒙)+L𝒏^\hat{R}:=g(\bm{x})+L\bm{\hat{n}} (third row) on the image parrot.
Denoiser TV BM3D DnCNN
PSNR (dB) 27.62 28.87 29.47
ϵ2/m\epsilon^{2}/m 8.751×1058.751\times 10^{-5} 4.875×1054.875\times 10^{-5} 4.281×1054.281\times 10^{-5}
Modified PSNR (dB) 27.82 29.00 29.60

For the three denoisers, we report the variance of 𝒆\bm{e} averaged over the image pixels of the parrot image (i.e., ϵ2/m\epsilon^{2}/m where mm denotes the number of pixels in 𝒆\bm{e}) in Table I. A comparison of the accuracy of the methods, measured in PSNR, is also given in the table. All figures reported in the table are averaged over 80008000 independent runs with different realizations of 𝒚^\bm{\hat{y}}. Based on the table, the DnCNN achieves the best denoising quality, and it outperforms BM3D by around 0.50.5 dB and the TV approach by around 1.81.8 dB. All three methods have ϵ2/m\epsilon^{2}/m less than 10410^{-4}, and the value for the CNN denoiser is about half of that of the TV method.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
(a). TV (b). BM3D (c). DnCNN
Figure 3: The partial linearity of three denoisers. Top row: denoised images by TV method, BM3D, and DnCNN respectively. Bottom row: the values of [R(𝒚)g(𝒙)]i[R{\left(\bm{y}\right)}-g{\left(\bm{x}\right)}]_{i} plotted against [L𝒏^]i[L\bm{\hat{n}}]_{i} where ii is the pixel indicated by the red dot on the top row. Each plot contains 80008000 dots associated with different realizations of 𝒏^\bm{\hat{n}}.

Given a denoiser RR, in order to study the partial linearity we consider the pair

([L𝒏^]i,[R(𝒚^)g(𝒙)]i){\left([L\bm{\hat{n}}]_{i},[R{\left(\bm{\hat{y}}\right)}-g{\left(\bm{x}\right)}]_{i}\right)} (30)

where ii denotes a fixed pixel (localized by the red dot on the first row of Fig. 3). To visualize the linearity for the three denoisers, the pair (30) is plotted on the second row of Fig. 3 under 80008000 realizations of the noise 𝒏^\bm{\hat{n}}. Each point in the plot corresponds to one realization. As shown in the figure, all the denoisers have certain degrees of partial linearity at the pixel ii. For the TV denoiser, the residual 𝒆\bm{e} has a relatively larger variance (also illustrated by Table I) and there are some outliers for big |[L𝒏^]i||[L\bm{\hat{n}}]_{i}|.

IV-B Denoising experiments

We consider two types of synthetic noises (Gaussian noise and Poisson noise), and the performance of our method is compared with recent denoising methods that are not trained on ground truth images, such as SURE [32], Noise2Self [2] and Noisier2Noise [27], along with the classic BM3D approach [6]. Throughout the denoising experiments, we use the same network architecture DnCNN [41] for the fully supervised baseline (we will call it DnCNN in the subsequent), SURE, Noise2Self, Noisier2Noise and our approaches. The experimental setups for the five methods are the same, except that the clean images are used to train the DnCNN while they are unseen by the latter four methods in the learning phase. All models are trained on a benchmark denoising dataset [41] consisting of 400400 training images of size 180×180180\!\times\!180. In the training phase, we feed the CNNs with image patches of size 40×4040\!\times\!40 and set a batch size of 128128. Augmentations such as random flipping and random cropping are applied to the patches. In the inference phase, the inputs to the networks are the whole noisy images. In particular, in our approach we do not include the auxiliary vector 𝒛\bm{z} at this stage, and the denoised images are the outputs of the network without any post-processing. We evaluate the denoising quality on two different test image sets, namely the BSD68 (containing 6868 images) [26] and the 1212 wildly used images (Set12) [6].

Training. To train our denoising model, we minimize the loss function (25) using the Adam optimizer [14]. The minimization process consists of two stages. In the first stage, we fix α=1\alpha\!=\!1 and minimize the loss function (25) without constraints. In the second stage, we randomly choose α[0.1,0.5]\alpha\!\in\![0.1,0.5] for each of the noisy samples, and additionally, in order to control the variance of 𝒆\bm{e} defined in (8), we impose a partially linear constraint (26). The implementation details of the latter will be given in the next paragraph. Both stages contain 2×1052\!\times\!10^{5} optimization steps with an initial learning rate 0.0010.001, which then drops to 0.00010.0001 and 0.000050.00005 at the 6×1046\!\times\!10^{4}th step and 1.2×1051.2\!\times\!10^{5}th step, respectively. Throughout the experiments, ziz_{i}, i.e., the samples of the auxiliary random vector 𝒛\bm{z} in (25), are generated from Gaussian distributions.

Partially linear constraint. We implement the partially linear constraint by perturbing the noisy images y^i\hat{y}_{i} (i.e., samples of the noisy image 𝒚^\bm{\hat{y}}) and by following the formula (26). Specifically, for each y^i\hat{y}_{i} let βi(1)\beta_{i}^{\left(1\right)} and βi(2)\beta_{i}^{\left(2\right)} be random numbers uniformly distributed in [1,1.5][1,1.5], and let qiq_{i} be a perturbation vector randomly generated from the same distribution as ziz_{i}. With qiq_{i} we construct two perturbed versions of y^i\hat{y}_{i} as qi(1):=y^iβi(1)qiq^{(1)}_{i}\!:=\!\hat{y}_{i}-\beta_{i}^{\left(1\right)}q_{i} and qi(2):=y^i+βi(2)qiq^{(2)}_{i}\!:=\!\hat{y}_{i}+\beta_{i}^{\left(2\right)}q_{i} respectively. Then we have the linear relationship

y^i=τ1qi(1)+τ2qi(2)\hat{y}_{i}=\tau_{1}q^{(1)}_{i}+\tau_{2}q^{(2)}_{i}

where τ1:=βi(2)/(βi(1)+βi(2))\tau_{1}:=\beta_{i}^{\left(2\right)}/{\left(\beta_{i}^{\left(1\right)}+\beta_{i}^{\left(2\right)}\right)} and τ2:=1τ1\tau_{2}:=1-\tau_{1}. In practice, we let qiq_{i} be independent of ziz_{i}. Since the noise levels of qi(1)q^{(1)}_{i} and qi(2)q^{(2)}_{i} depend on the pixel values of qiq_{i}, we modify qiq_{i} to be sparse to avoid raising the noise level too much. To do this, we randomly select 1/251/25 of the pixels, the pairwise distances of which are at least 44 pixels. The remaining pixels of qiq_{i} are set to zero, i.e., no perturbations are applied to these pixels. To avoid having outliers at some individual pixels caused by the perturbations, we also clip qiq_{i} such that the pixel values qi(1)q_{i}^{\left(1\right)} and qi(2)q_{i}^{\left(2\right)} fit in the range [1.2a0.2b,1.2b0.2a]\left[1.2a-0.2b,1.2b-0.2a\right] where a:=minj[y^i]ja\!:=\!\min_{j}[\hat{y}_{i}]_{j} and b:=maxj[y^i]jb\!:=\!\max_{j}[\hat{y}_{i}]_{j}. Having qi(1)q_{i}^{\left(1\right)} and qi(2)q_{i}^{\left(2\right)}, based on Property (26) we add the following penalty term to the loss function

c(θ):=iM(Rθ(y^i)k=12τkRθ(qi(k)))2,\begin{split}\mathcal{L}_{c}{\left(\theta\right)}:=&\sum_{i}\left\|M\left(\vphantom{\frac{\beta_{3-k}}{\beta_{1}+\beta_{2}}}R_{\theta}{\left(\hat{y}_{i}\right)}-\sum_{k=1}^{2}\tau_{k}R_{\theta}{\left(q_{i}^{(k)}\right)}\right)\right\|^{2},\end{split} (31)

where θ\theta are the network parameters, MM is a diagonal matrix. The diagonal entries of MM are set as

Mjj:=1/[|qi(1)qi(2)|+0.1σ]j,M_{jj}:=1/\left[\left|q^{(1)}_{i}-q^{(2)}_{i}\right|+0.1\sigma\right]_{j},

if [qi]j0[q_{i}]_{j}\!\neq\!0, otherwise Mjj:=0M_{jj}\!:=\!0, where σ>0\sigma\!>\!0 is the square root of the largest pixel-wise variance of the noise 𝒏\bm{n}. Note that having MM in the loss (31) means that we penalize the nonlinearity at the perturbed pixels only. The term 0.1σ0.1\sigma prevents division by very small numbers. In summary, the loss function is +γc\mathcal{L}+\gamma\mathcal{L}_{c} where \mathcal{L} is defined in (25), and γ\gamma is a hyperparameter which can be tuned based on the quality of the denoised images.

IV-B1 Gaussian noise

In this experiment, we test the denoisers for restoring images corrupted by Gaussian white noise. The training sets for our unsupervised approach are the noisy images. We consider two different levels of noise with standard deviation σ=25\sigma\!=\!25 and σ=50\sigma\!=\!50 (corresponding to 256 gray levels), respectively. Associated with the two noise levels, two denoisers are trained using the proposed method, and the ground truth images are unseen during the training phases. In both cases, the parameter γ\gamma for the partially linear constraint term (31) is set to 44.

The test results for BSD68 [26] are reported in Table II, where we call our method DPLD (deep partially linear denoiser). The denoising quality is measured by the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index. We compare our denoiser with BM3D [6], the self-supervised method Noise2Self [2], SURE [32], Noisier2Noise [27] as well as the fully-supervised DnCNN [41]. It is worth mentioning that the last denoiser DnCNN, in contrast to the other five, requires the noisy-clean image pairs for training.

TABLE II: Denoising quality, measured by PSNR (dB) and SSIM, for BSD68 [26] corrupted by Gaussian Noise.
Noise Level σ=25\sigma=25 σ=50\sigma=50
Measure PSNR SSIM PSNR SSIM
BM3D[6] 28.5828.58 0.88610.8861 25.6625.66 0.80410.8041
Noise2Self[2] 27.4827.48 0.85880.8588 25.1525.15 0.78180.7818
SURE[32] 28.9928.99 0.89610.8961 25.8825.88 0.81180.8118
Noisier2Noise[27] 28.9628.96 0.89510.8951 25.9625.96 0.81250.8125
DPLD 29.08\bm{29.08} 0.8961\bm{0.8961} 26.13\bm{26.13} 0.8196\bm{0.8196}
DnCNN[41] 29.2229.22 0.90170.9017 26.2426.24 0.82650.8265

As shown in Table II, the fully supervised denoiser DnCNN achieves the best accuracy. This is not surprising as it learns from the ground truth images which are not provided for the other ones. Our method is the best among the denoisers trained without the ground truth. It outperforms the Noise2Self, SURE and Noisier2Noise by 1.61.6 dB, 0.090.09 dB and 0.120.12 dB respectively for noise level σ=25\sigma\!=\!25, and outperforms them by 0.980.98 dB, 0.250.25 dB and 0.170.17 dB respectively in the σ=50\sigma\!=\!50 case. Compared to the DnCNN, the PSNR values of DPLD are lower by 0.140.14 dB and 0.110.11 dB for noise levels σ=25\sigma\!=\!25 and σ=50\sigma\!=\!50 respectively.

TABLE III: Denoising quality (in dB) for the 1212 wildly used image [6] and Gaussian noise

Denoiser Average C. man House Peppers Starfish Monarch Airplane Parrot Lena Barbara Boat Man Couple σ=25\sigma=25 BM3D 29.9729.97 29.3929.39 32.9832.98 30.1830.18 28.6128.61 29.329.3 28.4328.43 28.8328.83 32.0532.05 30.61\bm{30.61} 29.8629.86 29.6429.64 29.7229.72 Noise2Self 28.8128.81 27.9527.95 32.2232.22 29.5129.51 28.1228.12 28.6428.64 26.726.7 27.5827.58 31.5231.52 26.4126.41 29.0629.06 29.0429.04 28.9428.94 SURE 30.1230.12 29.7529.75 32.6532.65 30.5130.51 29.1529.15 30.0830.08 28.97\bm{28.97} 29.32\bm{29.32} 32.0832.08 29.2829.28 29.9329.93 29.9429.94 29.8229.82 Noisier2Noise 30.1230.12 29.6929.69 32.7432.74 30.5230.52 29.1329.13 30.1030.10 28.9528.95 29.32\bm{29.32} 32.1032.10 29.2229.22 29.9229.92 29.9429.94 29.7929.79 DPLD 30.28\bm{30.28} 29.84\bm{29.84} 33.04\bm{33.04} 30.69\bm{30.69} 29.262\bm{9.26} 30.21\bm{30.21} 28.97\bm{28.97} 29.30{29.30} 32.33\bm{32.33} 29.6629.66 30.10\bm{30.10} 30.02\bm{30.02} 29.97\bm{29.97} DnCNN 30.4430.44 30.0830.08 33.1333.13 30.830.8 29.4429.44 30.3930.39 29.1229.12 29.4829.48 32.4332.43 29.9629.96 30.2130.21 30.1230.12 30.1230.12 σ=50\sigma=50 BM3D 26.7126.71 26.3626.36 29.7529.75 26.6926.69 24.9924.99 25.925.9 25.2225.22 25.7425.74 28.8428.84 26.9826.98 26.7626.76 26.8426.84 26.4926.49 Noise2Self 26.1426.14 25.6025.60 29.2029.20 26.3626.36 24.6724.67 25.6825.68 24.5324.53 25.2925.29 28.5528.55 24.5624.56 26.4526.45 26.6426.64 26.1026.10 SURE 26.6226.62 26.4526.45 29.2529.25 26.7426.74 25.2025.20 26.2326.23 25.5025.50 26.1026.10 28.7028.70 25.0125.01 26.8026.80 26.9926.99 26.4326.43 Noisier2Noise 26.7926.79 26.6126.61 29.6529.65 26.9826.98 25.2525.25 26.4926.49 25.6225.62 26.2126.21 28.9428.94 25.1325.13 26.9226.92 27.0827.08 26.5626.56 DPLD 27.05\bm{27.05} 26.8926.89 30.0130.01 27.2227.22 25.4525.45 26.7226.72 25.7625.76 26.3626.36 29.2429.24 25.7025.70 27.1327.13 27.2327.23 26.8326.83 DnCNN 27.1927.19 27.0327.03 30.1030.10 27.3627.36 25.5525.55 26.8726.87 25.8925.89 26.4526.45 29.2929.29 26.2626.26 27.2227.22 27.2727.27 26.9426.94

The comparison of the denoisers on the 1212 wildly used images [6] is given in Table III. For noise level σ=50\sigma\!=\!50, the DPLD reaches the best PSNR for all images among the five denoisers that do not consume ground truth data, except for the images Parrot and Barbara. It is interesting to note that BM3D performs better than the fully supervised method DnCNN on the image Barbara. On average, for σ=50\sigma\!=\!50 it outperforms Noise2Self by 0.910.91 dB and SURE by 0.430.43 dB respectively, and it falls behind the DnCNN by 0.140.14 dB.

Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
Noisy Noise2Self SURE Noisier2Noise DPLD DnCNN
Figure 4: Denoised results for the image Boat (with Gaussian noise σ=25\sigma\!=\!25). The last two rows are enlarged views of the indicated regions.

Fig. 4 displays the denoising results for the image ”Boat”. It can be seen that, though the Noise2Self, SURE, Noisier2Noise, and DPLD do not see the clean images or have any explicit smoothness constraints during the training stages, they yield denoised images with smooth regions (Cf. the 2nd to 5th columns of Fig. 4, respectively). The resulted image from DPLD looks close to that of DnCNN visually. The output of Noise2Self has some relatively blurry edges compared to DPLD and DnCNN, e.g., at the letters displayed in the last row of Fig. 4.

Refer to caption Refer to caption
Refer to caption Refer to caption
(a). σ=25\sigma=25 (b). σ=50\sigma=50
Figure 5: The influence of inaccurate noise variance during training. Top row: PSNR on the test set BSD68 [26]. The dotted horizontal lines indicate results of Noise2Self [2]. Bottom row: the mean of 𝒛,L𝒛\langle\bm{z},L\bm{z}\rangle of the learned denoisers.

IV-B2 Training with estimated noise variance

We report additional results for the scenarios where only an inaccurate estimated noise variance (ENV) is available (for generating 𝒛\bm{z}) during training. The results are obtained under the same settings as Subsection IV-B1 except that an ENV is used.

  1. -

    On the top row of Fig. 5, the PSNR values of the test results on BSD68 [26], for both noise levels σ=25\sigma\!=\!25 and σ=50\sigma\!=\!50, are plotted against the relative errors in the ENV (1+β)σ2(1+\beta)\sigma^{2} (where β=16%,8%,,16%\beta\!=\!-16\%,-8\%,\cdots,16\%). For both noise levels, the errors in the ENV lead to a decrease in PSNR. The reduction is less significant when the noise variance is underestimated than when it is overestimated. The drop in PSNR is small when the error in the ENV is small (<0.1<\!0.1 dB for an ENV 44% less than its true value), and our method maintains an accuracy significantly better than Noise2Self [2] for small errors in the ENV.

  2. -

    We observe a correlation between the errors in the ENV and the term 𝔼(𝒛,L𝒛)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} (see Fig. 5, bottom row). Here LL is computed based on Remark 1, with a constant ground truth image 𝒙\bm{x} (pixel values: 0.50.5, size: 40×4040\!\times\!40) and generated 𝒏\bm{n} following the same distribution as 𝒛\bm{z} (variance: (1+β)σ2(1+\beta)\sigma^{2}). In the plots, 𝔼(𝒛,L𝒛)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} is a small positive number when β=0\beta\!=\!0, and it increases (resp. decreases) as the ENV decreases (resp. increases). The is due to the extra term 2β𝔼(𝒛,L𝒛)2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} in the objective function 𝒥()\mathcal{J}{\left(\cdot\right)} as shown in Remark 2.

Overall, the performance is stable to small errors in ENV, and importantly, by analyzing the structure of the learned denoiser one could investigate the error in the noise variance. The quantity 𝔼(𝒛,L𝒛)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} is helpful for choosing a better denoising model when the noise variance is unknown.

IV-B3 Poisson noise

We evaluate our method on three different levels of Poisson noise, with parameter λ=60\lambda\!=\!60, λ=30\lambda\!=\!30 and λ=15\lambda\!=\!15 respectively. The training settings for the denoisers are the same as the ones for Gaussian noise, except that the variances of the auxiliary vector samples ziz_{i} are computed differently. As the noise 𝒏\bm{n} is image dependent and not identically distributed for all image pixels, its variances are not known without having the ground truth image. Note that however our method does not require a precise model on the noise distribution, but only an estimate of the noise variance. At pixel kk, the noisy value satisfies λ[𝒚]kPois(λ[𝒙]k)\lambda[\bm{y}]_{k}\sim{\rm Pois}{\left(\lambda[\bm{x}]_{k}\right)} where λ\lambda is a known constant. So the noise variance var([𝒏k][𝒙]k)=var([𝒚]k[𝒙]k)=1λ[𝒙]k=1λ𝔼([𝒚]k[𝒙]k){\rm var}{\left([\bm{n}_{k}]\!\mid\![\bm{x}]_{k}\right)}\!=\!{\rm var}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!\frac{1}{\lambda}[\bm{x}]_{k}\!=\!\frac{1}{\lambda}\mathbb{E}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}. This implies that the sample value of 1λ[𝒚]k\frac{1}{\lambda}[\bm{y}]_{k} provides an unbiased estimate of the noise variance. The entries of the auxiliary vector are then generated as [𝒛]k=𝒓k[𝒚]k/λ[\bm{z}]_{k}\!=\!{\bm{r}}_{k}\sqrt{[\bm{y}]_{k}/\lambda} where 𝒓k{\bm{r}}_{k} is the standard Gaussian random variable, hence var([𝒛]k[𝒙]k)=1λ𝔼([𝒚]k[𝒙]k)=var([𝒏]k[𝒙]k){\rm var}{\left([\bm{z}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!\frac{1}{\lambda}\mathbb{E}{\left([\bm{y}]_{k}\!\mid\![\bm{x}]_{k}\right)}\!=\!{\rm var}{\left([\bm{n}]_{k}\!\mid\![\bm{x}]_{k}\right)}. The partially linear constraint parameter γ\gamma for (31) is tuned manually for the three noise levels. In this subsection, unless specified otherwise, the results are obtained by setting γ=4,16,64\gamma=4,16,64 for λ=60,30,15\lambda=60,30,15, respectively.

TABLE IV: Denoising quality, measured by PSNR (dB) and SSIM, for BSD68 [26] corrupted by Poisson Noise.
Noise Level Noise2Self[2] DPLD DnCNN[41]
PSNR SSIM PSNR SSIM PSNR SSIM
λ=60\lambda=60 27.7827.78 0.86600.8660 29.28\bm{29.28} 0.9018\bm{0.9018} 29.4329.43 0.90810.9081
λ=30\lambda=30 26.5626.56 0.82850.8285 27.65\bm{27.65} 0.8625\bm{0.8625} 27.8627.86 0.87430.8743
λ=15\lambda=15 25.4225.42 0.79020.7902 26.16\bm{26.16} 0.8210\bm{0.8210} 26.3926.39 0.83480.8348

For comparison, we train denoisers with Noise2Self [2] and DnCNN [41] on the same training set (whereas the ground truth images are available only for DnCNN). The denoising results on the test set BSD68 are reported in Table IV. Trained on the ground truth images, the DnCNN has the highest average PSNR for all three noise levels. The proposed method DPLD outperforms Noise2Self by 1.51.5 dB, 1.091.09 dB and 0.740.74 dB for the cases with λ=60,30,15\lambda=60,30,15 respectively. On the other hand, it losses 0.150.15 dB, 0.210.21 dB and 0.230.23 dB when compared to DnCNN. It should be noted that when the noise level decreases, the PSNR gap between DPLD and DnCNN gets smaller. In contrast, the gap between Noise2Self and DnCNN becomes larger as the noise becomes smaller. This may be due to the fact that the Noise2Self approach can not learn identity mapping. For a given pixel, the denoiser can not see its observed value and has to infer its value from the information of its neighboring pixels. If knowledge about the noise distribution is available, then the results can be improved by reusing the noisy images in the inference phase [16]. Our denoiser uses all information of the noisy image, and, as shown in Proposition 2, the gap between the DPLD and the best denoiser in ϵ\mathcal{R}_{\epsilon} tends to zero as the noise goes to zero.

TABLE V: Denoising quality (in dB) for the 1212 wildly used images [6] and Poisson noise

Denoiser Average C. man House Peppers Starfish Monarch Airplane Parrot Lena Barbara Boat Man Couple λ=60\lambda=60 Noise2Self 29.1529.15 28.3228.32 32.1332.13 29.7929.79 27.9427.94 28.9928.99 26.6326.63 27.9827.98 31.7431.74 28.6028.60 29.3429.34 29.2029.20 29.1829.18 DPLD 30.34\bm{30.34} 30.0530.05 32.8932.89 30.8230.82 28.9328.93 30.5430.54 28.6428.64 29.6829.68 32.5632.56 29.7929.79 30.0930.09 30.0230.02 30.0730.07 DnCNN 30.4930.49 30.2930.29 33.0233.02 30.9530.95 29.0729.07 30.6930.69 28.7928.79 29.8229.82 32.6732.67 30.0330.03 30.2230.22 30.1230.12 30.2130.21 λ=30\lambda=30 Noise2Self 27.8227.82 27.3627.36 30.7430.74 28.2728.27 26.3126.31 27.7227.72 25.6125.61 26.9526.95 30.3630.36 26.7926.79 27.9127.91 28.0228.02 27.8227.82 DPLD 28.68\bm{28.68} 28.4328.43 31.4431.44 29.0629.06 27.0827.08 28.7428.74 26.9626.96 28.1428.14 31.0031.00 27.7427.74 28.4828.48 28.5928.59 28.4728.47 DnCNN 28.8728.87 28.7928.79 31.6731.67 29.2329.23 27.1627.16 28.8928.89 27.1127.11 28.3028.30 31.1431.14 28.2028.20 28.6028.60 28.7128.71 28.6728.67 λ=15\lambda=15 Noise2Self 26.2626.26 26.0126.01 28.8128.81 26.6026.60 24.6224.62 25.9925.99 24.2324.23 25.5825.58 28.6928.69 24.9424.94 26.4826.48 26.7826.78 26.3626.36 DPLD 26.99\bm{26.99} 27.0027.00 29.7629.76 27.2727.27 25.2225.22 26.8826.88 25.2625.26 26.6326.63 29.3929.39 25.3125.31 27.0227.02 27.2527.25 26.8726.87 DnCNN 27.2827.28 27.3227.32 30.1330.13 27.5127.51 25.4225.42 27.1027.10 25.4725.47 26.8526.85 29.6229.62 26.3026.30 27.2027.20 27.3727.37 27.1127.11

Table V lists the PSNR of denoising outputs for the 1212 wildly used images [6]. Similar to the Gaussian noise cases, the proposed DPLD has higher PSNR values than Noise2Self on all images. The DPLD outperforms Noise2Self by more than 0.70.7 dB in average PSNR, and falls behind DnCNN by less than 0.30.3 dB.

Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
Noisy Noise2Self DPLD DnCNN Ground truth
Figure 6: Quality comparison for different methods for Poisson noise (λ=30\lambda\!=\!30). The last two rows are enlarged views of the indicated regions.

Fig. 6 displays an example of the denoised images. This example shows that Noise2Self, DPLD and DnCNN are capable of recovering the details of the image, though the first two are not exposed to the detailed structures of the images during training. Compared to DPLD and DnCNN, the denoised image of Noise2Self is less smooth. Also, Noise2Self tends to remove the sharp points of a jagged edge (Cf. the third row of Fig. 6), since it relies on the data of surrounding pixels when restoring the pixels at the sharp point and therefore may encourage more regularized shapes of objects.

Refer to caption Refer to caption Refer to caption Refer to caption
Ground truth 𝒙\bm{x} DnCNN error Noise2Self error Ours error
Refer to caption Refer to caption Refer to caption Refer to caption
Noise 𝒏\bm{n} DnCNN residual 𝒆\bm{e} Noise2Self residual 𝒆\bm{e} Ours residual 𝒆\bm{e}
Figure 7: Residual terms of DnCNN, Noise2Self and our method. The error images displayed on the first row are computed by subtracting the ground truth from the denoised images. The number on the top right corner of the images is the mean square of pixel values.

Finally, the residual term 𝒆\bm{e} (defined in (4)) of three different methods, on the test image Starfish, are presented in Fig. 7. For each method, in order to compute 𝒆\bm{e} we first compute g(𝒙)g{\left(\bm{x}\right)} and LL with Remark 1 and using its denoised outputs for 1,600,000 realizations of noisy image 𝒚\bm{y}. This experiment is carried out under the setting of Poisson noise with λ=30\lambda\!=\!30. As can be seen from Fig. 7, the error (i.e., the difference between the ground truth and the denoised image) is an order of magnitude smaller than the noise, while the residual term is an order of magnitude smaller than the error. The variance of 𝒆\bm{e} of the proposed denoiser is slightly smaller than the fully supervised denoiser DnCNN, and Noise2Self has a variance of 𝒆\bm{e} about two times larger than the other two.

IV-C Stability with respect to the partial linear constraint and noise levels

In this experiment we study the stability of the proposed denoiser against two parameters, γ\gamma and the noise levels. Here we consider Poisson noise with fixed λ=30\lambda\!=\!30 for training, under the same experimental setting as Subsection IV-B3.

First we study the stability concerning γ\gamma. Fig. 8 shows the PSNR and SSIM, evaluated on the test set BSD68, for different choices of γ\gamma used during training. The horizontal lines in Fig. 8 show the PSNR and SSIM of the other two methods DnCNN and Noise2Self (N2S), which do not depend on γ\gamma. According to the figure, the learned denoiser has very low PSNR values, when γ\gamma is very small (e.g., 10410^{-4}, which means that less important is put on the constraint). This implies that the partial linearity is crucial for finding a high quality denoiser when minimizing (25). As γ\gamma becomes larger, such constraint starts to take effect and substantially improve the denoising performance. The peak of PSNR is achieved at around 242^{4}. For γ\gamma in [22,28][2^{2},2^{8}], we observe a variation of PSNR less than 0.20.2 dB, which is relatively small compared to the gap between N2S and DnCNN, before the denoising quality degrades at very large γ\gamma values (e.g., 2142^{14}).

Fig. 9 compares the robustness of the learned denoisers in terms of different noise levels. All denoisers are trained in the Poisson noise setting (again with λ=30\lambda\!=\!30), and in our approach the optimal γ\gamma (i.e. 242^{4}) is applied. It can be seen that all methods suffer from a degradation in the denoising quality when λ\lambda in testing is below 3030. This is reasonable because in general smaller λ\lambda implies higher noise and more difficulty. Besides, the DPLD and Noise2Self, which use only noisy images in training, are more stable than DnCNN when λ\lambda is around 2020. The PSNR of DnCNN decreases quickly as λ\lambda decreases from 3030, and it is worse than DPLD when λ24\lambda\leq 24. However, it is interesting to note that the DPLD outperforms the (fully supervised) DnCNN in the lower noise cases (corresponding to big λ\lambda), even though neither the ground truth nor less noisy samples are provided for the training. This suggests that auxiliary random vector approach together with the partially linear constraint brings extra robustness compared to its counterpart that learns from a set noisy-clean image pairs with fixed noise levels.

Refer to caption
Figure 8: PSNR (in blue) and SSIM (in red) plotted against the partially linear constraint parameter γ\gamma (better viewed in color). The horizontal lines represent results of the DnCNN and Noise2Self for comparison.
Refer to caption
Figure 9: The Robustness of the denoisers with respect to the noise levels.

IV-D Denoising real microscopy images

We apply the proposed method to denoising real microscopy images. Neither the ground truth images nor their noise variance is known. Specifically, we consider three datasets, namely N2DH-GOWT1, C2DL-MSC, and TOMO110, which are used in [15]. The first two are fluorescence microscopy images of GFP transfected GOWT1 mouse stem cells and rat mesenchymal stem cells, respectively. The training sets contain 9292 images of size 1024×10241024\!\times\!1024 (N2DH-GOWT1) and 4848 images of size 992×832992\!\times\!832 (C2DL-MSC). The third one TOMO110 is acquired by a Cryo-transmission electron microscope, and for this one we use one image (size 7676×74207676\!\times\!7420) for training.

For each of the datasets, we first estimate the noise variance.
1). For N2DH-GOWT1 and C2DL-MSC, we first obtain a rough estimate via computing the average squared difference of neighboring pixels at smooth regions. We then refine the estimated noise variance using the correlation between the linear component of the denoiser and the errors in the noise variance. This is motivated by the observation in Remark 2 and the numerical study in Subsection IV-B2.
2). For TOMO110, since multiple frames/copies of the image are available because of the dose-fractionated acquisition, we estimate the variance by computing the variance of the frames for which the noise is independent. Note that though several frames of the image are available, our approach learns to denoise the sum of all frames. In contrast, the Noise2Noise method learns to denoise only one half of the frames, the noise level for which is therefore higher.

The details of the noise estimation and training of the networks are given in Appendix B. The denoising results are displayed in Fig. 10 (with enlarged views for the regions indicated by the blue/red boxes). We compare our method with BM3D [6] and Noise2Void [15]. As shown in the figure, the BM3D method has denoised images that are smooth, but also contains artifacts (e.g., in the cells on the second row of Fig. 10). The Noise2Void method yields denoising outputs with certain graininess effect at non-constant regions (Cf. the third row and first column of the figure). The denoised images from our approach look smooth, yet it shows slightly more details of the object structures (e.g. the regions highlighted by the red boxes).

Noisy image Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
BM3D Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
Noise2Void Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
Ours Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption
N2DH-GOWT1 C2DL-MSC TOMO110
Figure 10: Denoising results of three real microscopy datasets (on 3 columns).

IV-E Image deblurring

Following [31] and [39], we consider the image deblurring problem (27) where the operator AA is a convolution operator with a blur kernel arising from random motions. The datasets for the image deblurring experiments are face images from the Helen dataset [18] and CelebA dataset [24], and we use the same training/validation split as in [39]. This results in a total of 161,800161,800 images for training, 22,00022,000 images for validation, and 16,00016,000 images for testing. Throughout this experiment, the noise 𝒏\bm{n} is Gaussian white noise with standard deviation σ=2\sigma\!=\!2. We compare our approach with the fully supervised deblurring method, as well as an unsupervised method [39] that uses a pair of noisy and blurry observations per image. To facilitate the comparison, the same U-Net architecture in [39] for the deblurring networks is used in all three methods.

We train the deblurring networks RθR_{\theta} using the empirical loss of (28), given the random noisy samples {yi}\{y_{i}\} and the operator AA,

b:=iARθ(y^i)(yizi/α)2\mathcal{L}_{\rm b}:=\sum_{i}{\left\|A{R_{\theta}}{\left(\hat{y}_{i}\right)}-{\left(y_{i}-z_{i}/\alpha\right)}\right\|}^{2}

where {zi}\{z_{i}\} are the auxiliary vectors. As suggested in [39], we also integrate a proxy loss defined as

prox:=iRθ(yiprox)xiprox2\mathcal{L}_{\rm prox}:=\sum_{i}{\left\|R_{\theta}{\left(y^{\rm prox}_{i}\right)}-x^{\rm prox}_{i}\right\|}^{2}

in which xiprox:=Rθ(yi)x^{\rm prox}_{i}:=R_{\theta}{\left(y_{i}\right)}, yiprox:=Aproxxiprox+niproxy^{\rm prox}_{i}:=A^{\rm prox}x^{\rm prox}_{i}+n^{\rm prox}_{i}, and AproxA^{\rm prox}, niproxn^{\rm prox}_{i} are randomly generated blur kernels and noise respectively. In our setting niproxn^{\rm prox}_{i} is generated from the same distribution as 𝒛\bm{z}. So this loss function is the MSE loss defined on the synthetic training pair (xiprox,yiprox)(x^{\rm prox}_{i},y^{\rm prox}_{i}) where xiproxx^{\rm prox}_{i}, a deblurred version of yiy_{i}, is treated as the ground truth. The proxy loss has been found useful in improving the deblurring accuracy. Finally, we apply the partially linear constraint in (31) with the denoiser being ARθ(yi)AR_{\theta}{\left(y_{i}\right)} in this context. In summary, the training loss function is b+γproxprox+γc\mathcal{L}_{\rm b}+\gamma_{\rm prox}\mathcal{L}_{\rm prox}+\gamma\mathcal{L}_{c} where γprox\gamma_{\rm prox} is a hyperparameter.

The deblurring model is trained using the Adam optimizer [14] with a batch size of 128128 and 360,000360,000 optimization steps, starting from an initial learning rate of 10310^{-3}. The learning rate is then decreased to 3×1043\times 10^{-4}, 10410^{-4}, 5×1055\times 10^{-5} at the 310,000th310,000^{\rm th}, 340,000th340,000^{\rm th}, 350,000th350,000^{\rm th} step, respectively. In the first 100,000100,000 optimization steps, we let γprox=0\gamma_{\rm prox}=0 and γ=0\gamma=0, and after that, γprox=1/16\gamma_{\rm prox}=1/16, γ=1/16\gamma=1/16. In this experiment, we let α\alpha be randomly selected from [0.1,0.2][0.1,0.2] for each training sample.

TABLE VI: Deblurring results. For the data used in the training step, O and C stand for observations and clean images, respectively. The mark \dagger indicates training without the proxy loss.
Method Data Helen [18] CelebA [24]
PSNR SSIM PSNR SSIM
Xu et al. [40] 20.11 0.711 18.93 0.685
Zuo et al. [44] O/C Pairs 22.24 0.763 20.53 0.750
Tao et al. [34] O/C Pairs 22.86 0.762 24.11 0.862
Shen et al. [31] O/C Pairs 25.99 0.871 25.05 0.879
Kupyn et al. [17] O/C Pairs 23.63 0.781 22.45 0.729
Supervised baseline O/C Pairs 26.13 0.886 25.20 0.892
Xia et al. [39]\dagger O/O Pairs 25.47 0.867 24.64 0.873
Ours\dagger Unpaired O 25.65 0.867 24.78 0.873
Xia et al. [39] O/O Pairs 25.95 0.878 25.09 0.885
Ours
Unpaired O
26.00 0.879 25.09 0.886

Table VI shows the PSNR and SSIM of the deblurring results for test sets of Helen [18] and CelebA [24] respectively. In the table, the supervised baseline is based on the same U-net architecture as in our approach and trained on the (noisy) observation/clean image pairs. It achieves the highest PSNR and SSIM scores among the methods being compared. Our method reaches competitive PSNR values of 25.65 dB (Helen) and 24.78 dB (CelebA) without the proxy loss, and adding the proxy loss results in a further improvement of 0.350.35 dB and 0.310.31 dB respectively. As an extension of the Noise2Noise [20], the approach proposed by Xia et al. [39], in contrast, relies on paired noisy observations for each image but without the ground truth. For the results reported in the table, the operator AA is known during training for both our approach and the method of Xia et al. [39]. However, our method requires only unpaired observations, and it leads to a deblurring quality comparable to Xia et al. [39] and not far away from the fully supervised baseline (with a PSNR gap smaller than 0.150.15 dB). Furthermore, our method outperforms the existing supervised methods Zuo et al. [44], Tao et al. [34], Shen et al. [31], Kupyn et al. [17], as well as the unsupervised approach Xu et al. [40].

A comparison of the deblurred images from different methods is shown in Fig. 11, where the ground truth and blurry images are taken from the test sets of Helen. We underline that no blur kernels are provided to these methods during test time. Given no ground truth images for training, the deblurred results of Xia et al. [39] (Cf. the 4th4^{\rm th} column in the figure) and our approach (Cf. the 5th5^{\rm th} column) are able to capture most of the details recovered by the fully supervised baseline (Cf. the 3rd3^{\rm rd} column), though the latter gives slightly sharper images. This demonstrates the capacity of the proposed partially linear denoisers for solving the deblurring problem based on only one corrupted observation per image. Though in this experiment the unsupervised methods require knowledge about the blur kernels during the training phase, they can be generalized to the cases where the blur kernels are unknown for training. This can be done, e.g., by jointly learning the clean images and blur kernels [39].

Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Ground truth Blurry image Fully supervised Xia et al. [39] Ours
Figure 11: Face deblurring results. The images are taken from the test sets of Helen [18].

V Conclusion

In this paper, we propose a class of structured nonlinear denoisers. We show that such denoisers, equipped with a partial linearity property, can be trained when we do not have access to any ground truth images, nor to the exact model of the noise. In practice, one only needs to know the noise variance conditioned on the images. Based on the partial linearity structure, we proposed an auxiliary random vector approach, which establishes a direct connection between our loss function and the MSE, and allows end-to-end training of the denoising models. The approach outperforms other ground-truth-free denoising approaches such as the SURE based learning method for denoising, having a denoising quality close to that of the fully supervised baseline. The approach also offers new opportunities for learning to solve other image restoration tasks from single corrupted observations of the images. The experimental results show that, when generalized to an image deblurring problem, our approach achieves the state of the art for unsupervised deblurring.

One disadvantage of our method is that it does not work for non-zero mean noise, such as impulse noises for which the corrupted pixels are replaced with random numbers. In such cases, the auxiliary random vector approach no longer provides good estimates to the MSE. Our method requires estimates of the noise variance. Various noise variance estimation methods from noisy images exist (see e.g., [21, 9, 1, 23]). Alternatively, in our future work, we are interested in learning-based or automated noise variance extraction methods with the potential of being integrated in the proposed denoiser within an end-to-end framework.

References

  • [1] Nicola Acito, Marco Diani, and Giovanni Corsini. Signal-dependent noise modeling and model parameter estimation in hyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 49(8):2957–2971, 2011.
  • [2] Joshua Batson and Loic Royer. Noise2self: Blind denoising by self-supervision. In International Conference on Machine Learning, pages 524–533, 2019.
  • [3] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 60–65. IEEE, 2005.
  • [4] Raymond H Chan, Chung-Wa Ho, and Mila Nikolova. Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization. IEEE Transactions on image processing, 14(10):1479–1485, 2005.
  • [5] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence, 39(6):1256–1272, 2016.
  • [6] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
  • [7] David L Donoho. De-noising by soft-thresholding. IEEE transactions on information theory, 41(3):613–627, 1995.
  • [8] Thibaud Ehret, Axel Davy, Jean-Michel Morel, Gabriele Facciolo, and Pablo Arias. Model-blind video denoising via frame-to-frame training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11369–11378, 2019.
  • [9] Alessandro Foi, Mejdi Trimeche, Vladimir Katkovnik, and Karen Egiazarian. Practical poissonian-gaussian noise modeling and fitting for single-image raw-data. IEEE Transactions on Image Processing, 17(10):1737–1754, 2008.
  • [10] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao Ding, and John Paisley. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3855–3863, 2017.
  • [11] Yoseob Han and Jong Chul Ye. Framing u-net via deep convolutional framelets: Application to sparse-view ct. IEEE transactions on medical imaging, 37(6):1418–1429, 2018.
  • [12] Tao Hong, Yaniv Romano, and Michael Elad. Acceleration of red via vector extrapolation. Journal of Visual Communication and Image Representation, 63:102575, 2019.
  • [13] Kyong Hwan Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
  • [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2129–2137, 2019.
  • [16] Alexander Krull, Tomas Vicar, and Florian Jug. Probabilistic noise2void: Unsupervised content-aware denoising. arXiv preprint arXiv:1906.00651, 2019.
  • [17] Orest Kupyn, Volodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiří Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8183–8192, 2018.
  • [18] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas S Huang. Interactive facial feature localization. In European conference on computer vision, pages 679–692. Springer, 2012.
  • [19] Stamatios Lefkimmiatis. Universal denoising networks: a novel cnn architecture for image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3204–3213, 2018.
  • [20] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. In International Conference on Machine Learning, pages 2965–2974, 2018.
  • [21] Ce Liu, William T Freeman, Richard Szeliski, and Sing Bing Kang. Noise estimation from a single image. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 901–908. IEEE, 2006.
  • [22] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. In Advances in Neural Information Processing Systems, pages 1673–1682, 2018.
  • [23] Xinhao Liu, Masayuki Tanaka, and Masatoshi Okutomi. Practical signal-dependent noise parameter estimation from a single noisy image. IEEE Transactions on Image Processing, 23(10):4361–4371, 2014.
  • [24] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
  • [25] Stephane Mallat and Wen Liang Hwang. Singularity detection and processing with wavelets. IEEE transactions on information theory, 38(2):617–643, 1992.
  • [26] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  • [27] Nick Moran, Dan Schmidt, Yu Zhong, and Patrick Coady. Noisier2noise: Learning to denoise from unpaired noisy data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12064–12072, 2020.
  • [28] Yaniv Romano, Michael Elad, and Peyman Milanfar. The little engine that could: Regularization by denoising (red). SIAM Journal on Imaging Sciences, 10(4):1804–1844, 2017.
  • [29] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 2, pages 860–867. IEEE, 2005.
  • [30] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • [31] Ziyi Shen, Wei-Sheng Lai, Tingfa Xu, Jan Kautz, and Ming-Hsuan Yang. Deep semantic face deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8260–8269, 2018.
  • [32] Shakarim Soltanayev and Se Young Chun. Training deep learning based denoisers without ground truth data. In Advances in Neural Information Processing Systems, pages 3257–3267, 2018.
  • [33] Suhas Sreehari, S Venkat Venkatakrishnan, Brendt Wohlberg, Gregery T Buzzard, Lawrence F Drummy, Jeffrey P Simmons, and Charles A Bouman. Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Transactions on Computational Imaging, 2(4):408–423, 2016.
  • [34] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Jiaya Jia. Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182, 2018.
  • [35] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454, 2018.
  • [36] Dewil Valéry, Arias Pablo, Facciolo Gabriele, Anger Jérémy, Davy Axel, and Ehret Thibaud. Self-supervised training for blind multi-frame video denoising. arXiv preprint arXiv:2004.06957, 2020.
  • [37] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE Global Conference on Signal and Information Processing, pages 945–948. IEEE, 2013.
  • [38] Joachim Weickert. Anisotropic diffusion in image processing, volume 1. Teubner Stuttgart, 1998.
  • [39] Zhihao Xia and Ayan Chakrabarti. Training image estimators without image ground truth. In Advances in Neural Information Processing Systems, pages 2436–2446, 2019.
  • [40] Li Xu, Shicheng Zheng, and Jiaya Jia. Unnatural l0 sparse representation for natural image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1107–1114, 2013.
  • [41] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  • [42] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [43] Magauiya Zhussip, Shakarim Soltanayev, and Se Young Chun. Extending stein’s unbiased risk estimator to train deep denoisers with correlated pairs of noisy images. In Advances in Neural Information Processing Systems, pages 1463–1473, 2019.
  • [44] Wangmeng Zuo, Dongwei Ren, David Zhang, Shuhang Gu, and Lei Zhang. Learning iteration-wise generalized shrinkage–thresholding operators for blind deconvolution. IEEE Transactions on Image Processing, 25(4):1751–1764, 2016.

Supplementary Materials:
Unsupervised Image Restoration Using Partially Linear Denoisers
Rihuan Ke and Carola-Bibiane Schönlieb

Appendix A Additional proofs for Section III

Remark 2. Assume that the conditions of Proposition 2 hold except that Cov(𝒛,𝒛𝒙)=(1+β)Cov(𝒏,𝒏𝒙){\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)} where β>1\beta\!>\!-1. Then 𝒥(R)=𝔼(R(𝒚^)𝒙2)2𝔼(𝒆,𝒏𝒛/α)+c+2β𝔼(𝒛,L𝒛)\mathcal{J}{\left(R\right)}=\mathbb{E}{\left({\left\|{R(\bm{\hat{y}})-\bm{x}}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e},{\bm{n}-\bm{z}/\alpha}\rangle\right)}+c+2\beta\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}.

Proof.

The proof is similar to that of Proposition 2.

Firstly, we repeat Equation (18) here

R(𝒚^)(𝒚𝒛/α)=(g(𝒙)+L𝒏^+𝒆𝒙)(𝒏𝒛/α).\begin{split}R{\left(\bm{\hat{y}}\right)}-{\left(\bm{y}-\bm{z}/\alpha\right)}=&{\left(g{\left(\bm{x}\right)}+L\bm{\hat{n}}+\bm{e}-\bm{x}\right)}\\ &-{\left(\bm{n}-\bm{z}/\alpha\right)}.\end{split}

Taking expectation of both sizes over 𝒙\bm{x}, 𝒏\bm{n} and 𝒛\bm{z}, one has

𝒥(R)=𝔼(R(𝒚^)𝒙2)+𝔼(𝒏𝒛/α2)2𝔼(g(𝒙)𝒙,𝒏𝒛/α)2𝔼(𝒆,𝒏𝒛/α)2𝔼(L𝒏+αL𝒛,𝒏𝒛/α).\begin{split}\mathcal{J}{\left(R\right)}=&\mathbb{E}{\left({\left\|R(\bm{\hat{y}})-\bm{x}\right\|}^{2}\right)}+\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}\\ &-2\mathbb{E}{\left(\langle g(\bm{x})-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle\bm{e},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &-2\mathbb{E}{\left(\langle L\bm{n}+\alpha L\bm{z},\bm{n}-\bm{z}/\alpha\rangle\right)}.\\ \end{split} (32)

Secondly, by the assumption Cov(𝒛,𝒛𝒙)=(1+β)Cov(𝒏,𝒏𝒙){\rm Cov}{\left(\bm{z},\bm{z}\mid\bm{x}\right)}\!=\!(1+\beta){\rm Cov}{\left(\bm{n},\bm{n}\mid\bm{x}\right)}, we have the following equalities associated with the last term of Equation (32):

𝔼(L𝒏+αL𝒛,𝒏𝒛/α𝒙)=𝔼(L𝒏,𝒏𝒙)+𝔼(αL𝒛,𝒛/α𝒙)=β𝔼(L𝒛,𝒛𝒙).\begin{split}&\mathbb{E}{\left(\langle L{\bm{n}}+\alpha L{\bm{z}},\bm{n}-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&\mathbb{E}{\left(\langle L{\bm{n}},\bm{n}\rangle\mid\bm{x}\right)}+\mathbb{E}{\left(\langle\alpha L{\bm{z}},-\bm{z}/\alpha\rangle\mid\bm{x}\right)}\\ =&-\beta\mathbb{E}{\left(\langle L{\bm{z}},\bm{z}\rangle\mid\bm{x}\right)}.\end{split} (33)

Also, the term 𝔼(g(𝒙)𝒙,𝒏𝒛/α)=0\mathbb{E}{\left(\langle g(\bm{x})-\bm{x},\bm{n}-\bm{z}/\alpha\rangle\right)}=0 according to the proof of Proposition 2.

Finally, let c:=𝔼(𝒏𝒛/α2)c:=\mathbb{E}{\left({\left\|\bm{n}-\bm{z}/\alpha\right\|}^{2}\right)}, then the desired equality follows from Equations (33) and (32). The proof is completed. ∎

\propositiond

*

Proof.

To simplify the notations, let 𝒑:=Rϵ,δ(𝒚^)\bm{p}:=R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)} and 𝒑^:=R^ϵ,δ(𝒚^)\hat{\bm{p}}:=\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}.

I). By Lemma 3, Rϵ,δ+t(R^ϵ,δRϵ,δ)ϵR^{\epsilon,\delta}+t{\left(\hat{R}^{\epsilon,\delta}-R^{\epsilon,\delta}\right)}\in\mathcal{R}_{\epsilon} for t[0,1]t\in[0,1]. It follows from the definition of Rϵ,δR^{\epsilon,\delta} in (22) that

𝔼(𝒑𝒙2)δ𝔼(𝒑+t(𝒑^𝒑)𝒙2)\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}-\delta\leq\mathbb{E}{\left({\left\|\bm{p}+t{\left(\bm{\hat{p}}-\bm{p}\right)}-\bm{x}\right\|}^{2}\right)} (34)

Besides, the right hand side of the above inequality can be decomposed as

𝔼(𝒑+t(𝒑^𝒑)𝒙2)=𝔼(𝒑𝒙2)+2t𝔼(𝒑𝒙,𝒑^𝒑)+t2𝔼(𝒑^𝒑2).\begin{split}&\mathbb{E}{\left({\left\|\bm{p}+t{\left(\bm{\hat{p}}-\bm{p}\right)}-\bm{x}\right\|}^{2}\right)}\\ =&\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+2t\mathbb{E}{\left(\langle\bm{p}-\bm{x},\bm{\hat{p}}-\bm{p}\rangle\right)}\\ &+t^{2}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}.\end{split} (35)

Combining (34) and (35), one has

2𝔼(𝒑𝒙,𝒑^𝒑)s𝔼(𝒑^𝒑2)δ/s,s(0,1].2\mathbb{E}{\left(\langle\bm{p}-\bm{x},\bm{\hat{p}}-\bm{p}\rangle\right)}\geq-s\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}-\delta/s,\ \forall s\in(0,1].

Letting t=1t=1, Equation (35) and the last inequality force

(1s)𝔼(𝒑^𝒑2)𝔼(𝒑^𝒙2)𝔼(𝒑𝒙2)+δ/s,s(0,1].\begin{split}&{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+\delta/s,\ \forall s\in(0,1].\end{split} (36)

II). With the same argument as in I), for R^ϵ,δ\hat{R}^{\epsilon,\delta} one can show that for any s(0,1]s\in(0,1],

(1s)𝔼(𝒑^𝒑2)𝔼(𝒑(𝒚𝒛/α)2)𝔼(𝒑^(𝒚𝒛/α)2)+δ/s=𝒥(Rϵ,δ)𝒥(R^ϵ,δ)+δ/s.\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}&\leq\mathbb{E}{\left({\left\|\bm{p}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}\\ &\quad-\mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}+\delta/s\\ &=\mathcal{J}{\left(R^{\epsilon,\delta}\right)}-\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}+\delta/s.\end{split} (37)

Note that (36) and (37) give two upper bounds of 𝔼(𝒑^𝒑2)\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}. Averaging both sides of the two inequalities,

(1s)𝔼(𝒑^𝒑2)𝔼(𝒑^𝒙2)/2𝒥(R^ϵ,δ)/2+𝒥(Rϵ,δ)/2𝔼(𝒑𝒙2)/2+δ/s.\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\leq&\ \mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}/2-\mathcal{J}{\left(\hat{R}^{\epsilon,\delta}\right)}/2\\ &+\mathcal{J}{\left(R^{\epsilon,\delta}\right)}/2-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}/2\\ &+\delta/s.\end{split}

If we set s=δs=\sqrt{\delta}, the desired inequality then follows from Proposition 2. ∎

\propositiondb

*

Proof.

Let 𝒑:=Rϵ,δ(𝒚^)\bm{p}:=R^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)} and 𝒑^:=R^ϵ,δ(𝒚^)\hat{\bm{p}}:=\hat{R}^{\epsilon,\delta}{\left(\bm{\hat{y}}\right)}. According to Equation (8), we have the decomposition

𝒑=g1(𝒙)+L1𝒏^+𝒆1,𝒑^=g2(𝒙)+L2𝒏^+𝒆2,\begin{split}\bm{p}&=g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{1},\\ \bm{\hat{p}}&=g_{2}{\left(\bm{x}\right)}+L_{2}\bm{\hat{n}}+\bm{e}_{2},\end{split} (38)

where LiL_{i} is linear and depends on 𝒙\bm{x}, and 𝒆iϵ{\left\|\bm{e}_{i}\right\|}\leq\epsilon for i{0,1}i\in\{0,1\}. Let 𝒒:=𝒑𝒆1+𝒆2\bm{q}:=\bm{p}-\bm{e}_{1}+\bm{e}_{2}, then

𝒒(𝒚^)=g1(𝒙)+L1𝒏^+𝒆2,\bm{q}{\left(\bm{\hat{y}}\right)}=g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{2}, (39)

as a function of 𝒚^\bm{\hat{y}}, is well defined because the distribution of 𝒙\bm{x} is a delta function and 𝒒(y^1)=𝒒(y^2)\bm{q}{\left(\hat{y}_{1}\right)}=\bm{q}{\left(\hat{y}_{2}\right)} if y^1=y^2\hat{y}_{1}=\hat{y}_{2} for any realizations y^1\hat{y}_{1}, y^2\hat{y}_{2} of 𝒚^\bm{\hat{y}}. Furthermore, it is easy to check that 𝒒ϵ\bm{q}\in\mathcal{R}_{\epsilon}.

Analogous to the argument in (37), for 𝒒ϵ\bm{q}\in\mathcal{R}_{\epsilon} we have

(1s)𝔼(𝒑^𝒒2)𝔼(𝒒(𝒚𝒛/α)2)𝔼(𝒑^(𝒚𝒛/α)2)+δ/s,\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}&\leq\mathbb{E}{\left({\left\|\bm{q}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}\\ &\quad-\mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}+\delta/s,\end{split} (40)

for any s(0,1]s\in(0,1]. It follows from Equation (15) that

𝔼(𝒒(𝒚𝒛/α)2)=𝔼(𝒒𝒙2)2𝔼(𝒆2,𝒏𝒛/α)+c,𝔼(𝒑^(𝒚𝒛/α)2)=𝔼(𝒑^𝒙2)2𝔼(𝒆2,𝒏𝒛/α)+c\begin{split}\mathbb{E}{\left({\left\|\bm{q}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}=&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e}_{2},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &+c,\\ \mathbb{E}{\left({\left\|\bm{\hat{p}}-{\left(\bm{y}-\bm{z}/\alpha\right)}\right\|}^{2}\right)}=&\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}-2\mathbb{E}{\left(\langle\bm{e}_{2},\bm{n}-\bm{z}/\alpha\rangle\right)}\\ &+c\\ \end{split}

for some constant cc. This together with (40) gives

(1s)𝔼(𝒑^𝒒2)𝔼(𝒒𝒙2)𝔼(𝒑^𝒙2)+δ/s.\begin{split}{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}\leq&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{x}\right\|}^{2}\right)}+\delta/s.\end{split}

Combining the last inequality with (36), one has

(1s)𝔼(𝒑^𝒒2)+(1s)𝔼(𝒑^𝒑2)𝔼(𝒒𝒙2)𝔼(𝒑𝒙2)+2δ/s.\begin{split}&{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{q}\right\|}^{2}\right)}+{\left(1-s\right)}\mathbb{E}{\left({\left\|\bm{\hat{p}}-\bm{p}\right\|}^{2}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}+2\delta/s.\end{split} (41)

Next, we will show that 𝔼(𝒒𝒙2)𝔼(𝒑𝒙2)ϵ2\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}-\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)}\leq\epsilon^{2} which together with (41) gives the desired result (24) by setting s=δs=\sqrt{\delta}.

From Remark 1, the definitions of L1L_{1} and 𝒆1\bm{e}_{1} in (38) imply

𝔼(𝒆12𝒙)=𝔼(𝒑L1𝒏^2|𝒙)𝔼(𝒑L1𝒏^+tL𝒏^2|𝒙)\begin{split}\mathbb{E}{\left({\left\|\bm{e}_{1}\right\|}^{2}\mid\bm{x}\right)}&=\mathbb{E}{\left({\left\|\bm{p}-L_{1}\bm{\hat{n}}\right\|}^{2}|\bm{x}\right)}\\ &\leq\mathbb{E}{\left({\left\|\bm{p}-L_{1}\bm{\hat{n}}+tL\bm{\hat{n}}\right\|}^{2}|\bm{x}\right)}\end{split}

for any tt\in\mathbb{R} and any linear LL. It forces that 𝔼(L𝒏,𝒆1𝒙)=0\mathbb{E}{\left(\langle L\bm{n},\bm{e}_{1}\rangle\mid\bm{x}\right)}=0. Furthermore, the conditional mean of 𝒆1\bm{e}_{1} is zero, i.e., 𝔼(𝒆1𝒙)=0\mathbb{E}{\left(\bm{e}_{1}\mid\bm{x}\right)}=0. Similarly, one can also show that 𝔼(L𝒏^,𝒆2𝒙)=0\mathbb{E}{\left(\langle L\bm{\hat{n}},\bm{e}_{2}\rangle\mid\bm{x}\right)}=0 and 𝔼(𝒆2𝒙)=0\mathbb{E}{\left(\bm{e}_{2}\mid\bm{x}\right)}=0. Therefore,

𝔼(g1(𝒙)+L1𝒏^𝒙,𝒆i𝒙)=0,fori{1,2},\mathbb{E}{\left(\langle g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}-\bm{x},\bm{e}_{i}\rangle\mid\bm{x}\right)}=0,{\rm~{}for~{}}i\in\{1,2\},

and consequently,

𝔼(𝒒𝒙2𝒙)=𝔼(g1(𝒙)+L1𝒏^+𝒆2𝒙2𝒙)𝔼(𝒆22𝒙)+𝔼(g1(𝒙)+L1𝒏^𝒙2𝒙)𝔼(𝒆22𝒙)+𝔼(g1(𝒙)+L1𝒏^+𝒆1𝒙2𝒙)=𝔼(𝒆22𝒙)+𝔼(𝒑𝒙2𝒙).\begin{split}&\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ =&\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{2}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ \leq&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|g_{1}{\left(\bm{x}\right)}+L_{1}\bm{\hat{n}}+\bm{e}_{1}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}\\ =&\mathbb{E}{\left({\left\|\bm{e}_{2}\right\|}^{2}\mid\bm{x}\right)}+\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\mid\bm{x}\right)}.\end{split}

Taking expectation over 𝒙\bm{x}, the last inequality leads to

𝔼(𝒒𝒙2)ϵ2+𝔼(𝒑𝒙2),\mathbb{E}{\left({\left\|\bm{q}-\bm{x}\right\|}^{2}\right)}\leq\epsilon^{2}+\mathbb{E}{\left({\left\|\bm{p}-\bm{x}\right\|}^{2}\right)},

and the proof is completed. ∎

Appendix B Additional details for Subsection IV-D

In this section, further details about the experiments on real microscopy images will be given. We first provide the training specification for the denoising networks. The details of the estimation of noise variance , required by the training process, will be discussed next. Finally, we present a noise variance refinement method which helps us to find a more accurate noise variance. The general experimental setting is similar to that of Subsection IV-B.

Training specification. We use the network architecture DnCNN [41] throughout the experiments. During training, the images are copped into overlapping patches, using a sliding window of size 40×4040\!\times\!40 and a step size of 1010. Additionally, for the N2DH-GOWT1 and C2DL-MSC datasets, since the major regions in the images are of low intensity corresponding to the background, we reorder the patches by their averaged pixel intensity (from low to high), and randomly delete 80%80\% of the first 80%80\% patches. This aims to let the network put more attention on the foreground objects during training. For all three datasets, we apply random flipping to augment the training images. The parameter γ\gamma is set to 222^{2} for N2DH-GOWT1 and C2DL-MSC, and 262^{6} for TOMO110. The optimization process follows the ones described in Subsection IV-B, but one needs to use the estimated noise variance which will be introduced below.

Noise variance estimation. We estimate the noise variance using the methods that are introduced next.

1). For the datasets N2DH-GOWT1 and C2DL-MSC, the variance is estimated via the mean square difference of neighboring pixels at smooth regions of the images. Firstly, given the noisy image yy, we define the average squared difference at pixel ii as

di2=j𝒩(i)(yiyj)2/(2|𝒩(i)|)d_{i}^{2}=\sum_{j\in\mathcal{N}(i)}(y_{i}-y_{j})^{2}/(2|\mathcal{N}(i)|)

where 𝒩(i)\mathcal{N}(i) denotes the 4-connected pixels of ii. Secondly, we compute a smooth version of yy

s=kys=k*y

where * denotes convolution operations and kk is a uniform kernel matrix of 10×1010\!\times\!10 (the entries sum to 11). Thirdly, to find the smooth regions of the images, we define

=(ifi0.02(siminj(sj)))\mathcal{F}={\left(i\mid f_{i}\leq 0.02(s_{i}-\min_{j}(s_{j}))\right)}

where f=k((sy)(sy))f=k*{\left((s-y)\odot(s-y)\right)} and \odot denotes element-wise multiplication. Finally, we assume that the noise variance depends only on the intensity vv of pixels, and the estimated variance is computed as

V(v)=mean{di2i,|siv|<0.5/255}.V(v)={\rm mean}{\left\{d_{i}^{2}\mid i\in\mathcal{F},|s_{i}-v|<0.5/255\right\}}.

In our experiment, V(v)V(v) is averaged over all images yy in the training set. The computed variance is plotted in Fig. 12, where the blue squares indicate V(v)V(v). We observe that V()V(\cdot) has a linear pattern, therefore we approximate V(v)V(v) with a linear function V(v)(vμ)/λV(v)\approx(v-\mu)/\lambda (see the red lines in Fig. 12) where μ\mu and λ\lambda are parameters. For dataset N2DH-GOWT1, μ=0,λ=69\mu=0,\lambda=69. For C2DL-MSC μ=0.09,λ=76\mu=0.09,\lambda=76, and V(v)=0V(v)\!=\!0 for vμv\leq\mu.

Refer to caption Refer to caption
(a). N2DH-GOWT1 (b). C2DL-MSC
Figure 12: initial noise variance estimates.

2). For TOMO110, since the training image has multiple frames (as a consequence of dose-fractionated acquisition), we estimate the variance by the differences among the frames. Specifically, the raw dose-fractionated data contains a(1),a(2),,a(n)a^{(1)},a^{(2)},\cdots,a^{(n)} where nn is the number of frames, and the frames a(j)a^{(j)} have independent noise. In the dataset being considered, for example, n=10n=10. The observed image is the sum of all frames, i.e., y=j=1na(j)y=\sum_{j=1}^{n}a^{(j)}. By computing the variance among a(j)a^{(j)}, the noise variance at pixel ii is estimated as

V(i)=nn1j=1n(ai(j)yi/n)2.V(i)=\frac{n}{n-1}\sum_{j=1}^{n}{\left(a_{i}^{(j)}-y_{i}/n\right)}^{2}.

We note that V(i)V(i) is an estimate of the noise variance for the pixel ii, and our method denoise the whole image yy instead of individual frames a(j)a^{(j)}. In the Noise2Noise approach, in contrast, the network leans to denoise half of the frames, therefore does not make full use of yy during the denoising process.

Noise variance refinement. The rough noise variance estimates for datasets N2DH-GOWT1 and C2DL-MSC are refined by the trained models. Specifically, a denoising network is first trained based on the rough noise estimates V(v)V(v). The trained network is then fine tuned for several different values of λ\lambda that are close to the estimates. For fine tuning, we apply 5000050000 optimization steps, and the other hyperparameters remain unchanged. This results in several denoising models (corresponding to different λ\lambda). For each model, we compute the term 𝔼(𝒛,L𝒛)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)} for a constant image 𝒙\bm{x}, by using Remark 1. Since the ground truth noise variance is unknown, here we generate 𝒏\bm{n} using the estimated variance that is parameterized by λ\lambda. The results are shown in Fig. 13. Motivated by experimental results in Subsection IV-B2 and Fig. 5, we select the λ\lambda values which lead to a small positive 𝔼(𝒛,L𝒛)\mathbb{E}{\left(\langle\bm{z},L\bm{z}\rangle\right)}. Our final estimated value of λ\lambda is 6969 for N2DH-GOWT1 and 8888 for C2DL-MSC. No noise estimate refinement steps are performed for TOMO110 because its noise variance is estimated on individual pixels.

Refer to caption Refer to caption
(a). N2DH-GOWT1 (b). C2DL-MSC
Figure 13: Refining noise variance estimates using the learned denoisers.
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to caption
Ground truth Blurry image Fully supervised Xia et al. [39] Ours
Figure 14: Additional visual results about deblurring. The images are taken from the test sets of Helen [18] and CelebA [24]

Appendix C Additional results about deblurring

Fig. 14 contains additional visual outputs of our model, in comparison with the fully supervised baseline and Xia et al. [39]. The experimental setting is described in Subsection IV-E. The results show that, learning without ground truth images, our model recovers major structures of the faces, and performance is comparable to the Noise2Noise based method Xia et al. [39].