This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Verification in Image Denoising

Huangxing Lin1,   Yihong Zhuang1,  Delu Zeng2,   Yue Huang1,  Xinghao Ding1*,   John Paisley3
1Xiamen University, 2South China University of Technology, China
3 Columbia University, USA
*Corresponding author: [email protected]
Abstract

We devise a new regularization, called self-verification, for image denoising. This regularization is formulated using a deep image prior learned by the network, rather than a traditional predefined prior. Specifically, we treat the output of the network as a “prior” that we denoise again after “re-noising”. The comparison between the again denoised image and its prior can be interpreted as a self-verification of the network’s denoising ability. We demonstrate that self-verification encourages the network to capture low-level image statistics needed to restore the image. Based on this self-verification regularization, we further show that the network can learn to denoise even if it has not seen any clean images. This learning strategy is self-supervised, and we refer to it as Self-Verification Image Denoising (SVID). SVID can be seen as a mixture of learning-based methods and traditional model-based denoising methods, in which regularization is adaptively formulated using the output of the network. We show the application of SVID to various denoising tasks using only observed corrupted data. It can achieve the denoising performance close to supervised CNNs.

1 Introduction

Images captured by various devices are prone to noise and corruption due to limited imaging environments such as low light and slow shutter speed. Denoising is the problem whereby we recover the underlying clean image from its noise measurements. While sometimes the purpose may be for aesthetic reasons, downstream computer vision tasks, such as detection and segmentation, are also greatly facilitated by having minimally corrupted inputs.

A noisy image yy is usually modeled as

y=x+n,y=x+n, (1)

where nn represents the noise, xx is the clean image to be restored. This inverse problem is challenging because the statistics of nn are usually unknown and often complex.

In general, model-based image denoising algorithms aim to solve this problem through optimizing functions of the form

x=argminxxy22+αR(x),x^{*}=\arg\mathop{\min}\limits_{x}\left\|{x-y}\right\|_{2}^{2}+\alpha R(x), (2)

where xy22\left\|{x-y}\right\|_{2}^{2} is a data fidelity term that ensures the solution agrees with the observation, R(x)R(x) is a regularizer and α\alpha is a trade-off parameter. The first challenge of Eq. (2) is choosing R(x)R(x), which encodes the pre-known image properties and directs the solution towards a more plausible image. (Given this sense of R(x)R(x), we refer to it as a “prior” although we are not taking a Bayesian perspective.) Some common priors for constructing R(x)R(x) include total variation, non-local self-similarity, and others [5, 17]. However, these preselected priors often do not model the finer properties of natural images, and so often lead to degraded results.

Recently, supervised deep networks have achieved unprecedented success in image denoising by learning image priors and noise statistics from pairs of noisy/clean images [32]. Most CNN denoisers, e.g. DnCNN [29] and VDN [26], have superior performance over model-based denoisers. Unfortunately, in most cases, collecting a large number of realistic paired data is difficult, which limits the application of supervised CNNs. This leads to the topic of learning to denoise without paired data, the common approach of non-deep models, but less addressed by neural networks. Some self-supervised methods [12] show that only noisy data is required to train a denoising network under certain minor assumptions (e.g. noise is zero-mean). One of these methods is called a deep image prior (DIP) [23], which has attracted much attention because it handles image denoising from a unique perspective. DIP found that the structure of CNN is naturally an implicit regularizer for image denoising. Specifically, DIP uses a deep CNN to reparameterize a noisy image yy, i.e.,

minθFθ(z)y22,\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}(z)-y}\right\|_{2}^{2}, (3)

where zz is a fixed random vector, Fθ(){F_{\theta}}(\cdot) represents the deep CNN and θ\theta is its parameters. The optimization trajectory of Eq. (3) will pass through a good local optimum (i.e. a clean image) before overfitting to yy. So denoising can be completed by stopping training early. Moreover, we found that similar results (see Figure 1) can be observed even if zz in Eq. (3) is replaced by yy,

minθFθ(y)y22.\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}(y)-y}\right\|_{2}^{2}. (4)

The effectiveness of DIP relies on the fact that the CNN naturally tends to capture low-level image statistics, while showing strong “impedance” to unstructured noise [23]. DIP demonstrates the potential of neural networks to learn image priors without clean labels.

Refer to caption
Figure 1: Visual results obtained by minimizing Eq. (4). The superscript represents the number of iterations in training.
Refer to caption
Figure 2: Illustration of self-verification image denoising (SVID). Self-verification consists of two steps: adaptive noise degradation and similarity comparison. Self-verification refers to using the output of the network to verify its own denoising ability. The dimensions of mm and n^\hat{n} are the same. For better visualization, only nine elements in mm are shown.

1.1 Motivation

While DIP is an effective unsupervised approach, a drawback for real-world applications is its time-consuming training that requires human intervention for stopping. Inspired by DIP, in this paper we explore a more flexible use of the neural network image prior. Similar to DIP, we only focus on self-supervised cases; the difference is that our goal is to formulate an explicit regularization for denoising using the prior determined by the network.

We use a deep CNN to perform the denoising task in Eq. (2), so Eq. (2) can be rewritten as

θ=argminθFθ(y)y22+αR(Fθ(y)).\theta^{*}=\arg\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}(y)-y}\right\|_{2}^{2}+\alpha R({F_{\theta}}(y)). (5)

Since DIP has confirmed the affinity of the deep network for learning low-level image statistics, we expect that this network will first capture low-level image statistics as well in early iterations before attempting to memorize the noise when solving Eq. (5). Since the output of the network (i.e. Fθ(y){F_{\theta}}(y)) contains the low-level statistics required to restore the image, we consider using Fθ(y){F_{\theta}}(y) as the prior for image denoising. Put another way, can the output of the denoising network be used to construct a regularization for itself? In this paper, we give a feasible solution for performing this task.

1.2 Our Proposed Contribution

Suppose there is a second noise-degraded image y2y_{2} in addition to yy for which both share the same latent clean image xx, and which have independent and identically distributed (i.i.d.) noise. Based on y2y_{2} and yy, we devise a new regularization term for image denoising. Then, we could consider for Eq. (5) the function

θ=argminθFθ(y)y22+αFθ(y2)Fθ(y)22.\theta^{*}=\arg\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}(y)-y}\right\|_{2}^{2}+\alpha\left\|{{F_{\theta}}({y_{2}})-{F_{\theta}}(y)}\right\|_{2}^{2}. (6)

Since y2y_{2} and yy have different noise, Fθ(y)=x{F_{\theta^{*}}}(y)=x is obviously a good candidate solution for minimizing Eq. (6). However, in most real-world applications, yy is available, but we do not have a second noisy image y2y_{2}. Instead, we use the output of the network to synthesize a new y2y_{2},

y2D(Fθ(y))=Fθ(y)+n2,{y_{2}}\equiv D({F_{\theta}}(y))={F_{\theta}}(y)+n_{2}, (7)

where D()D(\cdot) is a predefined adaptive noise degradation function and n2n_{2} is newly generated noise. We discuss D()D(\cdot) later. Combining Eqs. (6) and (7), we have

θ=argminθFθ(y)y22+αFθ(D(Fθ(y)))Fθ(y)22.\theta^{*}=\arg\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}(y)-y}\right\|_{2}^{2}+\alpha\left\|{{F_{\theta}}(D({F_{\theta}}(y)))-{F_{\theta}}(y)}\right\|_{2}^{2}. (8)

The first term of Eq. (8) implicitly constrains the removed noise to be zero mean. In the second term of Eq. (8), if the noise of D(Fθ(y))D({F_{\theta}}(y)) and yy share similar statistical characteristics, but n2n{n_{2}}\neq n, then Fθ(y)=x{F_{\theta^{*}}}(y)=x is still a good solution for Eq. (8). We refer to the second term of Eq. (8) as a self-verification regularization. In Fθ(D(Fθ(y)))Fθ(y)22\left\|{{F_{\theta}}(D({F_{\theta}}(y)))-{F_{\theta}}(y)}\right\|_{2}^{2}, Fθ(y){F_{\theta}}(y) can be regarded as a deep image prior learned by the network. The similarity comparison between Fθ(D(Fθ(y))){F_{\theta}}(D({F_{\theta}}(y))) and Fθ(y){F_{\theta}}(y) can be seen as using a prior generated by the network to verify the denoising behavior of the network itself. We update the network to minimize Fθ(D(Fθ(y)))Fθ(y)22\left\|{{F_{\theta}}(D({F_{\theta}}(y)))-{F_{\theta}}(y)}\right\|_{2}^{2}. When θ\theta converges and the network becomes a good denoiser, the discrepancy between Fθ(D(Fθ(y))){F_{\theta}}(D({F_{\theta}}(y))) and Fθ(y){F_{\theta}}(y) should be small.

With self-verification regularization, we further show how a deep CNN learns to denoise with self-supervision. We call this training method Self-Verification Image Denoising (SVID). SVID does not rely on other noise statistics except the assumption of zero-mean noise. To perform self-verification, SVID uses a Siamese network as the backbone. Although SVID is based only on noisy training data, its denoising results are impressive. We demonstrate the superior performance of SVID in various denoising experiments. SVID significantly outperforms other state-of-the-art self-supervised methods.

2 Related Work

In this section, we review related topics most relevant to this work, including model-based denoising methods, learning-based denoising methods and Siamese networks.

Most model-based denoising methods focus on using handcrafted priors [24, 2, 17], [34] to construct regularization. Over the past decades, various ideas have been proposed, all aiming to identify sources of inner structure in visual data. For example, the non-local self-similarity (NSS) prior [25] is widely used in image denoising; some well-known NSS-based methods include BM3D [4] and WNNM [7]. Other prominent techniques, such as wavelet coring [22], total variation [20] and low-rank assumptions [5] are also standard approaches.

In recent years, supervised deep learning with CNNs has shown excellent denoising performance [31]. Many methods adopt sophisticated network structures to achieve better denoising results [30, 27, 14]. Their effectiveness is due to their ability to learn image priors from large amounts of paired data [33]. However, they require paired noisy/clean images. To mitigate this issue, Lehtinen et al. [13] demonstrate that the denoising CNN can be trained with pairs of independent noisy measurements of the same scene. This training strategy is called Noise2Noise, and achieves denoising performance on par with general supervised learning methods.

To relax the supervised requirement, networks trained with only noisy images has recently been considered [12]. For instance, Noise2Void [11] proposes a blind spot strategy, which trains a network to predict masked pixels using their neighbors so as to achieve denoising. The blind spot strategy has strong denoising ability, and it is adopted by many self-supervised methods, such as Self2Self [18] and Noise2Self [1]. Later, Neighbor2Neighbor [9] showed that two different subsamples of a noisy input can also be the training data for denoising.

As for the choice of network, Siamese networks are natural and effective tools for modeling invariance, and therefore have become a common structure for self-supervised representation learning methods [3, 28, 6]. Generally, inputs are distorted versions of the same sample, and maximize the similarity subject to different conditions. In this paper, we also use a Siamese network, not to learn a robust representation, but to discover the clean target shared between the inputs yy and y2y_{2}.

3 Methodology

Given a noisy dataset Ωnoisy={yi}i=1N\Omega^{noisy}=\{y^{i}\}^{N}_{i=1}, we aim to use these noisy images to learn to denoise without using other clean images. We propose a self-verification image denoising (SVID) approach, as shown in Figure 2. SVID is derived from Eq. (8), in which self-verification regularization plays a major role. Unlike DIP [23] trains a new network for each image, we use all noisy images to train a shared denoising network by mini-batch gradient descent. For simplicity, we set the batch size to 1 in the objective function of SVID, but it is easy to expand to the case where the batch size is greater than 1. After training, the denoising CNN in SVID is directly applied to new noisy images without additional learning.

3.1 Overview of SVID

Self-verification consists of two steps: adaptive noise degradation and similarity comparison. We use a Siamese network, where the goal is first to input yy to a CNN to obtain a denoised image Fθ(y){F_{\theta}}(y) and its removed noise n^=yFθ(y)\hat{n}=y-{F_{\theta}}(y). Since low-level image statistics are easier to be captured by deep CNNs than noise, we assume that Fθ(y){F_{\theta}}(y) is a clean image. We then synthesize y2y_{2} by the following adaptive noise degradation function,

y2\displaystyle{y_{2}} D(Fθ(y))\displaystyle\equiv D({F_{\theta}}(y)) (9)
=Fθ(y)+n2\displaystyle={F_{\theta}}(y)+{n_{2}}
=Fθ(y)+mn^\displaystyle={F_{\theta}}(y)+m\odot\hat{n}
=Fθ(y)+m(yFθ(y)),\displaystyle={F_{\theta}}(y)+m\odot(y-{F_{\theta}}(y)),

where n2=mn^{n_{2}}=m\odot\hat{n} is a new noise which is adaptively synthesized with respect to n^\hat{n}. \odot represents element-wise multiplication, mm is a random binary (11 or 1-1) mask that is independent of n^\hat{n} but has the same dimension as n^\hat{n}. In each iteration of learning, a new mm is sampled. Each element in mm is 11 with a probability of p=0.5p=0.5, and 1-1 otherwise.

Next, the network learns to denoise by maximizing the similarity between Fθ(y){F_{\theta}}(y) and Fθ(y2){F_{\theta}}(y_{2}). This is achieved by solving the following optimization problem,

θ=argminθFθ(y2)Stopgrad(Fθ(y))22.\displaystyle\theta^{*}=\arg\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}({y_{2}})-{\rm{Stopgrad}}({F_{\theta}}(y))}\right\|_{2}^{2}. (10)

“Stopgrad” means that the gradient does not backpropagate to Fθ(y){F_{\theta}}(y), which helps prevent trivial solutions (i.e. Fθ()=c{F_{\theta}}(\cdot)=c, cc is a constant.). The stop-gradient procedure is a common strategy for learning Siamese networks [3].

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

3.2 Discussion of SVID

We motivate the SVID objective of Eq. (10) as a modification of Eq. (6). From Eq. (6), we observe that we can expand the self-regularization R(y,y2)Fθ(y2)Fθ(y)22R(y,y_{2})\equiv\left\|{{F_{\theta}}({y_{2}})-{F_{\theta}}(y)}\right\|_{2}^{2} as

R(y,y2)\displaystyle R(y,y_{2}) =\displaystyle= Fθ(y2)y+yFθ(y)22\displaystyle\left\|{{F_{\theta}}({y_{2}})-y+y-{F_{\theta}}(y)}\right\|_{2}^{2}
=\displaystyle= Fθ(y)y22+yFθ(y2)22\displaystyle\left\|{F_{\theta}}(y)-y\right\|_{2}^{2}+\left\|y-{F_{\theta}}(y_{2})\right\|_{2}^{2}
2yFθ(y),yFθ(y2).\displaystyle-2\left\langle y-{F_{\theta}}(y),y-{F_{\theta}}(y_{2})\right\rangle.

We observe that the squared error Fθ(y)y22\|{F_{\theta}}(y)-y\|_{2}^{2} is implicitly included in R(y,y2)R(y,y_{2}), and therefore, modulo a rescaling of the regularization parameter α\alpha, the squared error is not necessary in Eq.(6)Eq.(\ref{eq.new_reg}). We also observe that the squared error term yFθ(y2)22\|y-{F_{\theta}}(y_{2})\|_{2}^{2} is between yy and the output using y2y_{2}. Since the noise in yy and y2y_{2} are assumed zero mean and independent, we can view this as an adversarial term, since it is not possible to model noise in yy based on the noise in y2y_{2}, and therefore the network defaults to their shared structure.

In Eq. (3.2), we need the second noisy image y2y_{2}. We observe that if y2y_{2} is synthesized by Eq. (9), then y2y_{2} naturally has some desirable properties. Specifically,

1. Some image details may remain in n^\hat{n}. These details can be destroyed by multiplying with a random mask mm, which makes n2n_{2} a more natural-looking noise (see Figure LABEL:fig_cat).

2. 𝔼(m)=0\mathbb{E}(m)=0 leads to 𝔼(n2)=𝔼(m)𝔼(n^)=0\mathbb{E}(n_{2})=\mathbb{E}(m)\mathbb{E}(\hat{n})=0. Therefore, the noise of y2y_{2} meets the zero-mean assumption.

3. Since n2n_{2} is composed of n^\hat{n}, the statistical characteristics of n2n_{2} and n^\hat{n} should be similar, as shown in Figure 4. In addition, mm is random, so n2n^{n_{2}}\neq\hat{n}. This further leads to y2y{y_{2}}\neq y. If y2=y{y_{2}}=y, self-verification Eq. (10) is meaningless.

Refer to caption
(a) 1k1k iterations
Refer to caption
(b) 20k20k iterations
Refer to caption
(c) 300k300k iterations
Refer to caption
(d) 1500k1500k iterations
Figure 4: Statistical histograms for 20,00020,000 noise examples with a size of 128×128128\times 128. nn is the real noise, n^\hat{n} and n2n_{2} are the noise produced by SVID. As the training progresses, the statistical distributions of n^\hat{n} and n2n_{2} gradually converge to the distribution of nn.
Refer to caption
(a) Clean || SSIM, PSNR
Refer to caption
(b) Input || 0.491, 24.25
Refer to caption
(c) BM3D || 0.915, 32.70
Refer to caption
(d) DIP || 0.900, 32.00
Refer to caption
(e) N2V || 0.915, 32.62
Refer to caption
(f) Nb2Nb || 0.918, 32.79
Refer to caption
(g) U-Net || 0.934, 33.96
Refer to caption
(h) Our || 0.930, 33.61
Figure 5: Example results for Gaussian denoising, σ=25\sigma=25.
Table 1: PSNR results (dB) on BSD300 data with Gaussian, Speckle and Poisson noise. Bold: best. Red: second. Blue: third. We note that U-Net is a fully supervised model and can be considered an upper bound on performance since SVID also employs the U-Net framework. N2N also can be understood as a supervised model that SVID approaches. SVID outperforms all other self-supervised models and non-deep models.
Test noise level BM3D NLH DIP N2V S2S Nb2Nb N2N U-Net SVID
Gaussian σ=25\sigma=25 30.90 30.31 29.99 30.56 29.63 31.07 31.62 31.81 31.50
σ(0,50]\sigma\in(0,50] 31.69 31.77 29.66 31.88 27.76 32.25 33.27 33.44 32.97
Speckle v=0.1v=0.1 26.64 25.18 26.73 28.40 27.60 28.89 31.13 31.49 29.73
v(0,0.2]v\in(0,0.2] 26.70 26.10 27.06 28.77 27.53 29.29 31.39 31.86 30.10
Poisson λ=30\lambda=30 27.70 28.49 28.77 29.78 29.23 29.97 30.91 31.09 30.41
λ[5,50]\lambda\in[5,50] 27.23 26.12 27.98 28.94 28.01 29.03 30.25 30.44 29.49

Our method uses a stop-gradient operation to avoid trivial solutions. We next discuss one hypothesis on why SVID is practically effective, as our experiments will show. In our method, we use the network’s output Fθ(y){F_{\theta}}(y) and the adaptively synthesized y2y_{2} to learn denoising, hoping that the learned knowledge can adapt to the real noisy data yy. SVID is an EM-like algorithm. It implicitly involves two sets of variables (i.e. θ\theta and Fθ(y){F_{\theta}}(y)), and solves two underlying sub-problems. The presence of the stop-gradient is the consequence of introducing the extra set of variables.

We re-express the objective function of SVID in the following form,

L(θ,ηy)=𝔼y,y2[Fθ(y2)ηy22].L(\theta,\eta_{y})={\mathbb{E}_{y,{y_{2}}}}\left[{\left\|{{F_{\theta}}({y_{2}})-{\eta_{y}}}\right\|_{2}^{2}}\right]. (12)

Note that Eq. (12) is consistent with the objective function in Eq. (10). For clarity, we use ηy\eta_{y} to denote Fθ(y){F_{\theta}}(y). With this formulation, we consider solving:

minθ,ηyL(θ,ηy).\mathop{\min}\limits_{\theta,{\eta_{y}}}L(\theta,{\eta_{y}}). (13)

The problem in Eq. (13) can be solved by an alternating algorithm, fixing one set of variables and solving for the other set. At each training step tt, we randomly sample a noisy image yy from the given noisy dataset Ωnoisy\Omega^{noisy}. Then, alternately solve the two sub-problems:

ηy(t)Fθ(t1)(y)\eta_{y}^{(t)}\leftarrow{F_{{\theta^{(t-1)}}}}(y) (14)
θ(t)argminθFθ(y2)ηy(t)22{\theta^{(t)}}\leftarrow\arg\mathop{\min}\limits_{\theta}\left\|{{F_{\theta}}({y_{2}})-\eta_{y}^{(t)}}\right\|_{2}^{2} (15)

where “\leftarrow” means assigning.

Refer to caption
(a) Clean || SSIM, PSNR
Refer to caption
(b) Input || 0.603, 23.99
Refer to caption
(c) BM3D || 0.860, 28.96
Refer to caption
(d) DIP || 0.868, 29.67
Refer to caption
(e) N2V || 0.937, 32.28
Refer to caption
(f) Nb2Nb || 0.952, 32.64
Refer to caption
(g) U-Net || 0.966, 35.55
Refer to caption
(h) Our || 0.956, 33.80
Figure 6: Example results for Speckle denoising, v=0.1v=0.1.
Refer to caption
(a) Clean || SSIM, PSNR
Refer to caption
(b) Input || 0.486, 20.71
Refer to caption
(c) BM3D || 0.723, 24.15
Refer to caption
(d) DIP || 0.790, 24.80
Refer to caption
(e) N2V || 0.821, 25.53
Refer to caption
(f) Nb2Nb || 0.832, 25.47
Refer to caption
(g) U-Net || 0.852, 28.15
Refer to caption
(h) Our || 0.847, 27.93
Figure 7: Example results for Poisson denoising, λ=30\lambda=30.

Solving for ηy\eta_{y}. There is no optimization in this part, but simply a redefinition of the term ηy\eta_{y} to be the output of the neural network on yy. Moreover, previous work on DIP has revealed that the CNN naturally tends to capture low-level image statistics. Therefore, the similarity between ηy(t)\eta_{y}^{(t)} and xx will gradually increase.

Solving for θ\theta. y2y_{2} is constructed by Eq. (9) using ηy(t)\eta_{y}^{(t)}. The more similar ηy(t)\eta_{y}^{(t)} is to xx, the closer the noise statistics of y2y_{2} and yy are (see Figure 4 for illustration). The denoising network is improved by solving Eq. (15). ηy(t)\eta_{y}^{(t)} is fixed in this subproblem, so the gradient does not backpropagate to ηy(t)\eta_{y}^{(t)}.

In short, one output of the network is obtained at training step tt, which serves as a “prior” target for denoising. ηy(t)\eta_{y}^{(t)} contains the low-level image statistics required to restore a clean image. Therefore, we use ηy(t)\eta_{y}^{(t)} as a label to improve the denoising ability of the network. This leads to the network to produce a better prior at time t+1t+1. By solving the two sub-problems alternately at each training step, Fθ(y){F_{\theta}}(y) approaches the true underlying xx.

3.3 Training Details

The denoising CNN in SVID is a simple U-Net [19]. We use PyTorch and Adam [10] with a batch size of 1 to train the network. The training images are randomly cropped into 128×128128\times 128 patches before being input to the network. The learning rate is fixed to 0.00020.0002 for the first 1,000,0001,000,000 iterations and linearly decays to 0 for the next 1,000,0001,000,000 iterations.111We will include a reference here to our code and datasets. After training, the CNN with fixed parameters θ\theta^{*} is directly applied to other noisy images.

4 Experiments

We evaluate SVID on various denoising tasks.

4.1 Synthetic Noise

We collected 4744 images from the Waterloo Exploration Database [15] to synthesize noisy images for training. Several state-of-the-art denoising methods are adopted for performance comparison, including model-based methods BM3D [4] and NLH [8], self-learning methods DIP [23], Noise2Void(N2V) [11], Self2Self(S2S) [18] and Neighbor2Neighbor(Nb2Nb) [9], other deep learning methods include Noise2Noise(N2N) [13] and a common fully-supervised U-Net. U-Net is trained with noisy/clean image pairs, while N2N uses paired noisy/noisy images. For the sake of fairness, N2N, U-Net and our SVID adopt the same network architecture to perform denoising. BSD300 [16] is the test set of the following experiments.

Refer to caption
(a) Clean (Normal-dose) || a SSIM, PSNR
Refer to caption
(b) Noisy (Low-dose)|| aa 0.724, 19.55
Refer to caption
(c) U-Net || 0.788, 25.80
Refer to caption
(d) SVID || 0.779, 25.01
Figure 8: Low-dose CT denoising example.

Gaussian noise. We first study the effectiveness of SVID on additive Gaussian noise. Each training example is corrupted by zero-mean Gaussian noise with a random standard deviation σ(0,50]\sigma\in\left(0,50\right]. For testing, we synthesize two sets of noisy images, using a fixed noise level σ=25\sigma=25 and a variable σ(0,50]\sigma\in\left(0,50\right]. We present the quantitative results in Table 1. Visual comparisons can be found in Figure 5. In addition, Figure 4 shows the statistical histograms of the noise removed by SVID during training.

Speckle noise. Multiplicative speckle noise is signal dependent and is often observed in medical ultrasonic images and radar images. The speckle noise in the image is modeled as the random value multiplied by the pixel value of the latent signal xx, which can be expressed as y=x+xny=x+x\cdot n. In this model, nn is an uniform noise with a mean of 0 and a variance of vv. We vary the noise variance v(0,0.2]v\in\left(0,0.2\right] during training. The results and the comparisons of denoising on Speckle noise can be seen in Table 1 and Figure 6.

Poisson noise. In our third experiment we consider Poisson noise, which can be used to model photon noise in imaging sensors. Poisson noise is also signal-dependent because its expected magnitude depends on the pixel brightness. We randomize the noise magnitude λ[5,50]\lambda\in\left[5,50\right] separately for each training example. Quantitative and subjective comparisons are reported in Table 1 and Figure 7, respectively.

Discussion. In the above comparisons, SVID shows excellent denoising results both subjectively and quantitatively. These results demonstrate that the output of the denoising network can be a prior for denoising, as this paper conjectures. The denoising performance of SVID outperforms other self-supervised methods (i.e. DIP, N2V, S2S, Nb2Nb) by a large margin, which confirms the effectiveness of self-verification. In SVID, the denoising ability of the network can be improved by self-verification. In addition, SVID is better than model-based methods BM3D and NLH. This is because the deep image prior captured by the network can better model the properties of natural images than the traditional non-local self-similarity prior. We also noticed that SVID is inferior to supervised methods U-Net and N2N, which is reasonable considering that our SVID has no access to paired data for supervision. In short, our SVID consistently exhibits encouraging denoising performance for various types of noise. The denoised images are clean and sharp. More importantly, SVID does not rely on paired data or pre-known noise statistics, which shows its potential value for many practical applications.

Table 2: Quantitative results for low-dose CT denoising. SVID approaches the fully supervised U-Net.
Noisy U-Net SVID
SSIM 0.766 0.827 0.813
PSNR 22.61 28.56 27.20

4.2 Low-Dose CT Denoising

Computed tomography (CT) is a popular imaging modality in clinical diagnosis. Despite the healthcare benefits, the extensive use of CT has caused public concern about the potential risks of X-ray radiation. Reducing radiation dose is an effective way to deal with this problem. However, a decrease of radiation dose is associated with an increase of noise and artifacts in the reconstructed image, which may adversely affect subsequent diagnosis. Therefore, denoising is a critical post-processing step for low-dose CT images [21]. Since the noise statistics in CT images are very complicated and paired training data is difficult to obtain, the self-supervised denoising methods can better cater to low-dose CT images. We further evaluate SVID in the task of low-dose CT denoising.

We use a real clinical dataset authorized for the 2016 NIH-AAPM-Mayo Clinic LDCT Grand Challenge by Mayo Clinic.222https://www.aapm.org/GrandChallenge/LowDoseCT/ This dataset contains 5946 pairs of normal-/low- dose images with a slice thickness of 1mm. We randomly select 5000 pairs of images as the training set, and the rest as the test set. We compare SVID with a fully supervised U-Net. Note that SVID only uses low-dose CT images to learn denoising without the normal-dose images. The comparison results are reported in Figure 8 and Table 2. As can be seen, our SVID is also effective for noise in CT. SVID is successful in removing the majority of noise and restoring high-quality image details. Its performance is close to the fully supervised U-Net.

Refer to caption
(a) Clean
Refer to caption
(b) Input
Refer to caption
(c) w/o stop-gradient
Refer to caption
(d) w/ stop-gradient
Figure 9: SVID with and without stop-gradient. The noise in (b) is Gaussian (σ=25\sigma=25).

4.3 Stop-gradient

Our SVID employs a Siamese network to perform self-verification. An undesired trivial solution to Siamese networks is all outputs collapse to a constant. There are several strategies to prevent Siamese networks from collapsing, such as contrastive learning, clustering and stop-gradients [3]. In SVID, the stop-gradient operation is a natural consequence of introducing an extra variable. Here, our aim is to show that the stop-gradient is critical to preventing SVID from collapsing.

Figure 9 presents a comparison with and without using the stop-gradient. Without stop-gradients, all outputs of the denoising network collapse to 0 and the objective function Eq. (12) reaches its minimum of 0. With stop-gradients, the objective function Eq. (12) implicitly constrains the removed noise to be zero-mean. Therefore, the denoised image is prevented from collapsing to 0. Moreover, the deep CNN tends to capture low-level image statistics, which promotes the denoised image to look like a clean natural image.

5 Conclusion

We have demonstrated that the output of a neural network can be a prior for image denoising. Based on this prior, we developed a new self-verification regularization where even without any clean data, the network can learn to denoise in a self-supervised manner. We demonstrated the effectiveness and broad applicability of the proposed method on several denoising tasks involving different noise properties and data types. We believe that this study hints at the advantage of exploring the use of CNNs to generate deep image priors for various CV tasks.

References

  • [1] J. Batson and L. Royer. Noise2self: Blind denoising by self-supervision. In ICML, 2019.
  • [2] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In CVPR, 2005.
  • [3] X. Chen and K. He. Exploring simple siamese representation learning. In CVPR, 2021.
  • [4] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
  • [5] W. Dong, G. Li, G. Shi, X. Li, and Y. Ma. Low-rank tensor approximation with laplacian scale mixture modeling for multiframe image denoising. In ICCV, 2015.
  • [6] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
  • [7] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In CVPR, 2014.
  • [8] Y. Hou, J. Xu, M. Liu, G. Liu, L. Liu, F. Zhu, and L. Shao. Nlh: A blind pixel-level non-local method for real-world image denoising. IEEE Transactions on Image Processing, 29:5121–5135, 2020.
  • [9] T. Huang, S. Li, X. Jia, H. Lu, and J. Liu. Neighbor2neighbor: Self-supervised denoising from single noisy images. In CVPR, 2021.
  • [10] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [11] A. Krull, T.-O. Buchholz, and F. Jug. Noise2void-learning denoising from single noisy images. In CVPR, 2019.
  • [12] S. Laine, T. Karras, J. Lehtinen, and T. Aila. High-quality self-supervised deep image denoising. In NeurIPS, 2019.
  • [13] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2noise: Learning image restoration without clean data. In ICML, 2018.
  • [14] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang. Non-local recurrent network for image restoration. In NeurIPS, 2018.
  • [15] K. Ma, Z. Duanmu, Q. Wu, Z. Wang, H. Yong, H. Li, and L. Zhang. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2):1004–1016, 2016.
  • [16] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [17] D. Meng and F. De La Torre. Robust matrix factorization with unknown noise. In ICCV, 2013.
  • [18] Y. Quan, M. Chen, T. Pang, and H. Ji. Self2self with dropout: Learning self-supervised denoising from single image. In CVPR, 2020.
  • [19] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [20] I. Selesnick. Total variation denoising via the moreau envelope. IEEE Signal Processing Letters, 24(2):216–220, 2017.
  • [21] H. Shan, A. Padole, F. Homayounieh, U. Kruger, R. D. Khera, C. Nitiwarangkul, M. K. Kalra, and G. Wang. Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose ct image reconstruction. Nature Machine Intelligence, 1(6):269–276, 2019.
  • [22] E. P. Simoncelli and E. H. Adelson. Noise removal via bayesian wavelet coring. In ICIP, 1996.
  • [23] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In CVPR, 2018.
  • [24] J. Xu, L. Zhang, and D. Zhang. A trilateral weighted sparse coding scheme for real-world image denoising. In ECCV, 2018.
  • [25] J. Yao, D. Meng, Q. Zhao, W. Cao, and Z. Xu. Nonconvex-sparsity and nonlocal-smoothness-based blind hyperspectral unmixing. IEEE Transactions on Image Processing, 28(6):2991–3006, 2019.
  • [26] Z. Yue, H. Yong, Q. Zhao, D. Meng, and L. Zhang. Variational denoising network: Toward blind noise modeling and removal. In NeurIPS, 2019.
  • [27] Z. Yue, Q. Zhao, L. Zhang, and D. Meng. Dual adversarial network: Toward real-world noise removal and noise generation. In ECCV, 2020.
  • [28] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
  • [29] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
  • [30] K. Zhang, W. Zuo, and L. Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018.
  • [31] Y. Zhang, K. Li, K. Li, G. Sun, Y. Kong, and Y. Fu. Accurate and fast image denoising via attention guided scaling. IEEE Transactions on Image Processing, 30:6255–6265, 2021.
  • [32] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu. Residual non-local attention networks for image restoration. In ICLR, 2019.
  • [33] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu. Residual dense network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(7):2480–2495, 2020.
  • [34] Q. Zhao, D. Meng, Z. Xu, W. Zuo, and L. Zhang. Robust principal component analysis with complex noise. In ICML, 2014.