This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\addauthor

Nahyun Kim*[email protected] \addauthorDonggon Jang*[email protected] \addauthorSunhyeok [email protected] \addauthorBomi [email protected] \addauthorDae-Shik [email protected] \addinstitution Korea Advanced Institute of Science
and Technology (KAIST),
Daejeon, Korea Image Denoising with Frequency Domain Knowledge

Unsupervised Image Denoising with Frequency Domain Knowledge

Abstract

Supervised learning-based methods yield robust denoising results, yet they are inherently limited by the need for large-scale clean/noisy paired datasets. The use of unsupervised denoisers, on the other hand, necessitates a more detailed understanding of the underlying image statistics. In particular, it is well known that apparent differences between clean and noisy images are most prominent on high-frequency bands, justifying the use of low-pass filters as part of conventional image preprocessing steps. However, most learning-based denoising methods utilize only one-sided information from the spatial domain without considering frequency domain information. To address this limitation, in this study we propose a frequency-sensitive unsupervised denoising method. To this end, a generative adversarial network (GAN) is used as a base structure. Subsequently, we include spectral discriminator and frequency reconstruction loss to transfer frequency knowledge into the generator. Results using natural and synthetic datasets indicate that our unsupervised learning method augmented with frequency information achieves state-of-the-art denoising performance, suggesting that frequency domain information could be a viable factor in improving the overall performance of unsupervised learning-based methods.

1 Introduction

Based on clean and noisy image pairs, supervised learning-based image denoisers have shown impressive performance compared to prior-based approaches. A large number of high-quality image pairs play an important role in the performance of supervised learning-based methods. However, constructing large-scale paired datasets may be unavailable or expensive in real-world situations. For this reason, image denoising methods that do not require clean and noisy image pairs have recently drawn attention.

A noisy image xx is usually modeled as the sum of clean background yy and noise nn: x=y+nx=y+n. Subsequently, noise corrupts the benign pixels, which makes it hard to distinguish the pixels of noise and content in the spatial domain. However, in the frequency domain, noise and content can be easily identified. As shown in Figure 1 (a), we observe that the noise lies in the high-frequency bands and semantic information lies in the low-frequency bands. Furthermore, in Figure 1 (b), we note that apparent differences between clean and noisy images are most prominent on high-frequency bands. It may indicate that the frequency domain provides useful evidence for noise removal. However, the recent learning-based denoisers overlook the frequency domain information and use only one-sided information from the spatial domain.

Motivated by these observations, we propose the unsupervised denoising method that reflects frequency domain information. Specifically, with a generative adversarial network as a base structure, we introduce the spectral discriminator and frequency reconstruction loss to transfer frequency knowledge to the generator. The spectral discriminator distinguishes the differences between denoised and clean images on high-frequency bands. By propagating this knowledge to the generator for noise removal, the generator considers the frequency domain and thus produces visually more plausible denoised images to fool the spectral discriminator. The frequency reconstruction loss, combined with the cycle consistency loss, improves the image quality and preserves the content of images while narrowing the gap between clean and denoised images in the frequency domain.

The main contributions of our method are summarized as follows: 1) We propose the GAN-based unsupervised image denoising method that preserves semantic information and produces a high-quality noise-free image. 2) To the best of our knowledge, it is the first approach to explore the potential of the frequency domain with Fourier transform in the field of noise removal tasks. The proposed spectral discriminator and frequency reconstruction loss make the generator concentrate on the noise and produce satisfying results. Denoised images recovered by our method are close to clean reference images in both spatial and frequency domain. 3) The proposed method outperforms existing unsupervised image denoisers by a considerable margin. Moreover, our performance is even comparable with supervised learning-based approaches trained with paired datasets.

2 Related Work

2.1 Image Denoising

Refer to caption
Refer to caption
Figure 1: The spectrum analysis in the frequency domain. (a) Visualization of images in the spatial domain and corresponding spectrum maps in the frequency domain. (b) The statistics (mean and variance) after azimuthal integral over the power spectrum on clean and noisy images of CBSD68. We use AWGN with a noise level σ=50\sigma=50 to yield noisy images.

Recently, with the advent of deep neural networks, supervised learning-based image denoisers [Zhang et al.(2017)Zhang, Zuo, Chen, Meng, and Zhang, Zhang et al.(2018)Zhang, Zuo, and Zhang, Mao et al.(2016)Mao, Shen, and Yang] show promising performance on a set of clean and noisy image pairs. However, it is challenging to construct clean and noisy image pairs in a real-world scenario. To address the above issues, denoisers that do not rely on clean and noisy image pairs have been proposed [Lehtinen et al.(2018)Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala, and Aila, Krull et al.(2019)Krull, Buchholz, and Jug, Chen et al.(2018)Chen, Chen, Chao, and Yang, Du et al.(2020)Du, Chen, and Yang, Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky]. N2N [Lehtinen et al.(2018)Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala, and Aila] learns reconstruction using only noisy image pairs without ground-truth clean images. N2V [Krull et al.(2019)Krull, Buchholz, and Jug] estimates a corrupted pixel from its neighboring pixels based on a blind-spot mechanism. GCBD [Chen et al.(2018)Chen, Chen, Chao, and Yang] generates the noisy images while modeling the real-world noise distribution through the GAN [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] and trains the denoiser with pseudo clean and noisy image pairs. LIR [Du et al.(2020)Du, Chen, and Yang] trains an image denoiser by disentangling invariant representations from noisy images with an unpaired dataset.

2.2 Frequency Domain in CNNs

In traditional image processing, analyzing images in the frequency domain is known to be effective by transforming the image from the spatial domain to the frequency domain. Inspired by this idea, several works attempt to utilize the information from the frequency domain in deep neural networks. Xu et al\bmvaOneDot[Xu et al.(2020)Xu, Qin, Sun, Wang, Chen, and Ren] accelerate the training of neural networks utilizing the discrete cosine transform. Dzanic et al\bmvaOneDot[Dzanic et al.(2019)Dzanic, Shah, and Witherden] observe that discrepancy exists between the images generated by the GAN [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] and the real images through the analysis of high-frequency Fourier modes. In addition, attempts to utilize the frequency domain information in the various fields, including image forensics [Durall et al.(2020)Durall, Keuper, and Keuper, Frank et al.(2020)Frank, Eisenhofer, Schönherr, Fischer, Kolossa, and Holz, Zhang et al.(2019)Zhang, Karaman, and Chang], image generation [Chen et al.(2020)Chen, Li, Jin, Liu, and Li, Cai et al.(2020)Cai, Zhang, Huang, Geng, and Huang, Jiang et al.(2020)Jiang, Dai, Wu, and Loy], and domain adaptation [Yang et al.(2020)Yang, Lao, Sundaramoorthi, and Soatto, Yang and Soatto(2020)] are gradually increasing. However, image denoising methods combining the frequency domain analysis with DNN remain much less explored.

Refer to caption
Figure 2: An overview of the proposed framework. Given an unpaired clean ycy_{c} and noisy image xnx_{n}, the generator Gn2cG_{n2c} for image denoising takes the noisy image xnx_{n} as an input and learns the mapping for noise removal. Additional network Gc2nG_{c2n} is used to impose the cycle consistency. Three discriminators DCD_{C}, DTD_{T}, and DSD_{S} try to distinguish the denoised image Gn2c(xn)G_{n2c}(x_{n}) from real clean image xcx_{c} in terms of both spatial domain and frequency domain. The whole framework is end-to-end trainable.

3 Method

In this section, we first introduce the spectral discriminator and frequency reconstruction loss that use information from the frequency domain. Then, we present an unsupervised framework for image denoising, integrating the proposed discriminator and loss with the GAN. The proposed framework is illustrated in Figure 2.

3.1 Frequency Domain Constraints

Spectral Discriminator

The simple way for the generator to consider the frequency domain is that the discriminator transfers the frequency domain knowledge to the generator. To this end, we propose the spectral discriminator similar to that introduced by [Chen et al.(2020)Chen, Li, Jin, Liu, and Li] to measure spectral realness. We compute the discrete Fourier transform on 2D image data f(w,h)f(w,h) in size W×HW\times H to feed frequency representations to the discriminator.

F(k,l)=w=0W1h=0H1f(w,h)e2πikwWe2πilhHF(k,l)=\sum^{W-1}_{w=0}\sum^{H-1}_{h=0}f(w,h)e^{-2{\pi}i\frac{kw}{W}}e^{-2{\pi}i\frac{lh}{H}} (1)

for spectral coordinates k=0,,W1k=0,...,W-1 and l=0,,H1l=0,...,H-1.

Recent studies [Chen et al.(2020)Chen, Li, Jin, Liu, and Li, Durall et al.(2020)Durall, Keuper, and Keuper] show that the 1D representation of the Fourier power spectrum is sufficient to highlight spectral differences. Following their works, we transform the result of Fourier transform to polar coordinate and compute azimuthal integration over θ\theta.

F(r,θ)=F(k,l):r=k2+l2,θ=arctanlk,AI(r)=12π02π|F(r,θ)|𝑑θF(r,\theta)=F(k,l):r=\sqrt{k^{2}+l^{2}},\quad\theta=\arctan{\frac{l}{k}},\quad AI(r)=\frac{1}{2\pi}\int^{2\pi}_{0}{|F(r,\theta)|d\theta} (2)

where AI(r)AI(r) means the average intensity of the image signal about radial distance rr.

We propose the spectral discriminator that allows the generator to focus on noise using high-frequency spectral information. To learn the differences on high-frequency bands, we pass the 1D spectral vector into the high-pass filter FhpF_{hp} and input it to the spectral discriminator.

vI=Fhp(AI(r)),Fhp(x)={x,r>rτ,0,otherwisev_{I}=F_{hp}(AI(r)),\quad F_{hp}(x)=\begin{cases}x,\quad r>r_{\tau},\\ 0,\quad otherwise\end{cases} (3)

where rτr_{\tau} is a threshold radius for high-pass filtering and vIv_{I} is a high-pass filtered 1D spectral vector of an input II.

Generally, the most distinct characteristics between clean and noisy images exist on high-frequency bands. Thus, if there is some remained noise on denoised images, the spectral discriminator easily distinguishes the difference between the clean and denoised images on high-frequency bands. By transferring this knowledge to the generator, the generator for noise removal learns to yield visually more plausible images to fool the spectral discriminator.

Frequency Reconstruction Loss

Cai et al\bmvaOneDot[Cai et al.(2020)Cai, Zhang, Huang, Geng, and Huang] demonstrate the existence of a gap between the real and generated image in the frequency domain, which leads to artifacts in the spatial domain. Motivated by this observation, we propose to use frequency reconstruction loss with cycle consistency loss to ameliorate the quality of denoised images while reducing the gap. We aim that the frequency reconstruction loss which is complementary to cycle consistency loss enables the generator to consider the frequency domain. Furthermore, we expect that it can serve as an assistant in generating high-quality denoised images. To compute the frequency reconstruction loss, we map an input xnx_{n} and reconstructed image Gc2n(Gn2c(xn))G_{c2n}(G_{n2c}(x_{n})) to the frequency domain using Fourier transform. Then, we calculate the frequency reconstruction loss by measuring the difference between the two results of the Fourier transform and taking a logarithm to normalize it. Finally, we minimize the following objective:

LFreq=log(1+1WHk=0W1l=0H1|Fxn(k,l)FGc2n(Gn2c(xn))(k,l)|)L_{Freq}=log(1+\frac{1}{WH}\sum^{W-1}_{k=0}\sum^{H-1}_{l=0}|F_{x_{n}}(k,l)-F_{G_{c2n}(G_{n2c}(x_{n}))}(k,l)|) (4)

3.2 Unsupervised Framework for Image Denoising

Our goal is to learn a mapping from a noise domain XNX_{N} to a clean domain YCY_{C} given unpaired training images xnXNx_{n}\in X_{N} and ycYCy_{c}\in Y_{C}. To learn this mapping, we use the CycleGAN-like framework consisting of two generators, Gn2cG_{n2c} and Gc2nG_{c2n}, and three discriminators, DCD_{C}, DTD_{T}, and DSD_{S}. Given a noisy image xnx_{n}, the generator Gn2cG_{n2c} learns to generate a denoised image Gn2c(xn)G_{n2c}(x_{n}). While distinguishing the denoised image Gn2c(xn)G_{n2c}(x_{n}) from the real clean image ycy_{c}, the discriminator DCD_{C} makes the generator produce the denoised images closer to the real clean domain YCY_{C}. To stablize training, we use the Least Squares GAN (LSGAN) loss [Mao et al.(2017)Mao, Li, Xie, Lau, Wang, and Paul Smolley] for adversarial loss. The LSGAN loss for Gn2cG_{n2c} and DCD_{C} is:

LadvClean=EycPc[(DC(yc))2]+ExnPn[(1DC(Gn2c(xn)))2]L^{Clean}_{adv}=E_{y_{c}\sim P_{c}}[(D_{C}(y_{c}))^{2}]+E_{x_{n}\sim P_{n}}[(1-D_{C}(G_{n2c}(x_{n})))^{2}] (5)

where PnP_{n} and PcP_{c} are the data distributions of the domain XNX_{N} and domain YCY_{C}, respectively.

As introduced in [Wang and Yu(2020)], we adopt the texture discriminator DTD_{T} in order to guide the generator to produce clean contour and preserve texture while removing the noise. Following the scheme of [Wang and Yu(2020)], a random color shift algorithm MshiftM_{shift} is applied to the denoised image Gn2c(xn)G_{n2c}(x_{n}). The texture loss for Gn2cG_{n2c} and DTD_{T} is:

LadvTexture=EycPc[(DT(Mshift(yc)))2]+ExnPn[(1DT(Mshift(Gn2c(xn))))2]L^{Texture}_{adv}=E_{y_{c}\sim P_{c}}[(D_{T}(M_{shift}(y_{c})))^{2}]+E_{x_{n}\sim P_{n}}[(1-D_{T}(M_{shift}(G_{n2c}(x_{n}))))^{2}] (6)

As discussed in Section 3.1, we use the spectral discriminator DSD_{S} to guide the generator to generate more realistic images by reducing the gap between the clean and denoised image in the frequency domain. The spectral loss for Gn2cG_{n2c} and DSD_{S} is:

LadvSpectral=EycPc[(DS(vyc))2]+ExnPn[(1DS(vGn2c(xn)))2]L^{Spectral}_{adv}=E_{y_{c}\sim P_{c}}[(D_{S}(v_{y_{c}}))^{2}]+E_{x_{n}\sim P_{n}}[(1-D_{S}(v_{G_{n2c}(x_{n})}))^{2}] (7)

where vv denotes the high-pass filtered 1D spectral vector in Eq. 3.

CycleGAN [Zhu et al.(2017)Zhu, Park, Isola, and Efros] imposes the two-sided cycle consistency constraint to learn the one-to-one mappings between two domains. On the other hand, we use only one-sided cycle consistency to maintain the content between noisy and denoised images. By incorporating a network Gc2nG_{c2n}, we let Gc2n(Gn2c(xn))G_{c2n}(G_{n2c}(x_{n})) be identical to the noisy image xnx_{n}. The cycle consistency loss is expressed as:

LCC=xnGc2n(Gn2c(xn))1L_{CC}=||x_{n}-G_{c2n}(G_{n2c}(x_{n}))||_{1} (8)

where ||||1||\cdot||_{1} is the L1 norm.

Furthermore, we add the reconstruction loss between the Gc2n(Gn2c(xn))G_{c2n}(G_{n2c}(x_{n})) and xnx_{n} to stabilize the training. We employ the negative SSIM loss [Wang et al.(2004)Wang, Bovik, Sheikh, and Simoncelli] and combine it with the frequency reconstruction loss LFreqL_{Freq} in Eq. 4. The reconstruction loss is expressed as:

LRecon=LFreq(xn,Gc2n(Gn2c(xn)))+LSSIM(xn,Gc2n(Gn2c(xn)))L_{Recon}=L_{Freq}(x_{n},G_{c2n}(G_{n2c}(x_{n})))+L_{SSIM}(x_{n},G_{c2n}(G_{n2c}(x_{n}))) (9)

where LSSIM(a,b)L_{SSIM}(a,b) denotes the negative SSIM loss, SSIM(a,b)-SSIM(a,b).

To impose the local smoothness and mitigate the artifacts in the restored image, we adopt the total variation loss [Chambolle(2004)]. The total variation loss is expressed as:

LTV=w,h(wGn2c(xn)2+hGn2c(xn)2)L_{TV}=\sum_{w,h}(||{\bigtriangledown_{w}G_{n2c}(x_{n})}||_{2}+{||\bigtriangledown_{h}G_{n2c}(x_{n})}||_{2}) (10)

where ||||2||\cdot||_{2} denotes the L2 norm, w\bigtriangledown_{w} and h\bigtriangledown_{h} are the operations to compute the gradients in terms of horizontal and vertical directions, respectively.

Inspired by [Du et al.(2020)Du, Chen, and Yang, Wang and Yu(2020)], we use the perceptual loss [Johnson et al.(2016)Johnson, Alahi, and Fei-Fei] to ensure that extracted features from the noisy and denoised image are semantically invariant. This allows the image to keep its semantics even after the noise has been removed. The perceptual loss is expressed as:

LVGG=ϕl(xn)ϕl(Gn2c(xn))2L_{VGG}=||\phi_{l}(x_{n})-\phi_{l}(G_{n2c}(x_{n}))||_{2} (11)

where ϕl()\phi_{l}(\cdot) denotes the pre-trained VGG-19 [Simonyan and Zisserman(2014)] on ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei], ll denotes lthl\textrm{th} layer from VGG-19, and we use the Conv5-4 layer of VGG-19 model in our experiments.

Moreover, we employ the background loss to preserve background consistency between the noisy and denoised image. The background loss constrains the L1 norm between blurred results of the noisy and denoised image. As a blur operator, we adopt a guided filter [Wu et al.(2018)Wu, Zheng, Zhang, and Huang] that smooths the image while preserving the sharpness such as edges and details. The background loss is expressed as:

LBG=GF(xn)GF(Gn2c(xn))1L_{BG}=||GF(x_{n})-GF(G_{n2c}(x_{n}))||_{1} (12)

where GF()GF(\cdot) denotes the guided filter.

Our full objective for the two generators and the three discriminators is expressed as:

minGn2c,Gc2nmaxDC,DT,DSLadvClean+LadvTexture+LadvSpectral+LCC+λVGGLVGG+λBGLBG+λTVLTV+λReconLRecon\begin{split}\min_{G_{n2c},G_{c2n}}\max_{D_{C},D_{T},D_{S}}L^{Clean}_{adv}+L^{Texture}_{adv}+L^{Spectral}_{adv}+L_{CC}+\\ \lambda_{VGG}L_{VGG}+\lambda_{BG}L_{BG}+\lambda_{TV}L_{TV}+\lambda_{Recon}L_{Recon}\end{split} (13)

We empirically define the weights in the full objective as: λVGG=2\lambda_{VGG}=2, λBG=2\lambda_{BG}=2 , λTV=0.2\lambda_{TV}=0.2, and λRecon=0.2\lambda_{Recon}=0.2.

4 Experiment

In this section, we provide the implementation details of the proposed method. Then, we present extensive experiments on synthetic and real-world noisy images. Lastly, we conduct an ablation study to show the effectiveness of the proposed method. For synthetic noise, we use Additive White Gaussian Noise (AWGN) to synthesize the noisy images. We adopt the CBSD68 [Martin et al.(2001)Martin, Fowlkes, Tal, and Malik] for evaluation. For real noise, we use the Low-Dose Computed Tomography dataset [Moen et al.(2021)Moen, Chen, Holmes III, Duan, Yu, Yu, Leng, Fletcher, and McCollough] and real photographs SIDD [Abdelhamed et al.(2018)Abdelhamed, Lin, and Brown] to demonstrate the generalization capacity of the proposed method. We employ PSNR and SSIM [Wang et al.(2004)Wang, Bovik, Sheikh, and Simoncelli] to evaluate the results.

Refer to caption
(a) Input
Refer to caption
(b) LPF
Refer to caption
(c) CBM3D
Refer to caption
(d) DnCNN
Refer to caption
(e) FFDNet
Refer to caption
(f) RedNet-30
Refer to caption
(g) N2N
Refer to caption
(h) DIP
Refer to caption
(i) N2V
Refer to caption
(j) LIR
Refer to caption
(k) Ours
Refer to caption
(l) GT
Figure 3: Qualitative results of our method and other baselines on CBSD68 corrupted by AWGN with a noise level σ=25\sigma=25.

4.1 Implementation Details

We implement our method with Pytorch [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer]. The generator and discriminator architectures are detailed in the supplementary material. We train our method up to 100 epochs on Nvidia TITAN RTX GPU and RTX A6000 in experiments. We adopt ADAM [Kingma and Ba(2014)] for optimization. The initial learning rate is set to 0.0001, and we keep the same learning rate for the first 70 epochs and linearly decay the rate to zero over the last 30 epochs. We set the batch size to 16 in all experiments. We randomly crop 128×128128\times 128 patches for synthetic noise removal and use input patches of size 256×256256\times 256 for real-world noise removal. We randomly flip the images horizontally for data augmentation. For high-pass filter on spectral discriminator, rτr_{\tau} is set to H/22\lfloor H/2\sqrt{2}\rfloor where HH is the height of an image and \lfloor{\>}\rfloor is a floor operator. Loss weights are described in Section 3.2. Our model is evaluated with three random seeds, and we report its average values for rigorous evaluation.

Traditional Paired setting Unpaired setting
Methods LPF CBM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] DnCNN [Zhang et al.(2017)Zhang, Zuo, Chen, Meng, and Zhang] FFDNet [Zhang et al.(2018)Zhang, Zuo, and Zhang] RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang] N2N [Lehtinen et al.(2018)Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala, and Aila] DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] N2V [Krull et al.(2019)Krull, Buchholz, and Jug] LIR [Du et al.(2020)Du, Chen, and Yang] Ours
Noise level PSNR (dB)
σ=15\sigma=15 25.93 33.55 33.72 29.68 33.60 33.92 28.51 28.66 30.44 32.21
σ=25\sigma=25 24.61 30.91 30.85 28.71 30.68 31.31 27.26 27.20 29.08 29.37
σ=50\sigma=50 21.49 27.47 27.19 26.79 26.42 28.10 23.66 24.52 25.69 26.03
Noise level SSIM
σ=15\sigma=15 0.7079 0.9619 0.9254 0.8616 0.9620 0.9301 0.8851 0.9024 0.9414 0.9502
σ=25\sigma=25 0.6102 0.9331 0.8724 0.8254 0.9308 0.8857 0.8613 0.8684 0.9126 0.9124
σ=50\sigma=50 0.4266 0.8722 0.7490 0.7463 0.8502 0.7973 0.7510 0.7927 0.8435 0.8375
Table 1: The average PSNR and SSIM results of our method and other baselines on CBSD68 corrupted by AWGN with noise levels σ={15,25,50}\sigma=\{15,25,50\}. Our results are marked in bold.

4.2 Synthetic Noise Removal

We train the model with DIV2K [Martin et al.(2001)Martin, Fowlkes, Tal, and Malik] that contains 800 images with 2K resolution. For the unpaired training, we randomly divide the dataset into two parts without intersection. To construct a noise set, we add the AWGN with noise levels σ={15,25,50}\sigma=\{15,25,50\} to images in one part using the other part as a clean set. For a fair comparison, we use only the noise set and their corresponding ground-truth when training other supervised learning-based methods. We select unsupervised methods, i.e. DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky], N2N [Lehtinen et al.(2018)Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala, and Aila], N2V [Krull et al.(2019)Krull, Buchholz, and Jug], and LIR [Du et al.(2020)Du, Chen, and Yang], and supervised methods, i.e. DnCNN [Zhang et al.(2017)Zhang, Zuo, Chen, Meng, and Zhang], FFDNet [Zhang et al.(2018)Zhang, Zuo, and Zhang], and RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang], to compare the performance. Traditional Low-Pass Filtering (LPF) and BM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] are also evaluated. As shown in Figure 3, the unsupervised methods tend to shift the color and leave apparent visual artifacts in the sky. Especially, LIR removes the noise but fails to preserve the texture. With frequency domain information, our method successfully eliminates noise and preserves the texture. The classical LPF using Fourier transform alleviates the noise, but our framework that reflects not only the frequency domain knowledge but also spatial domain knowledge shows superior results. As shown in Table 1, our model outperforms other unsupervised methods, i.e. DIP, N2V, and LIR, by at least +0.29 dB in PSNR. Although our model is trained on unpaired images, it achieves superior performance in the SSIM than DnCNN and FFDNet trained on paired datasets. We conjecture that the reason for better noise removal is the use of the extra domain information that other previous methods do not consider.

Refer to caption
(a) LDCT
Refer to caption
(b) BM3D
Refer to caption
(c) RED-CNN
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) NDCT
Figure 4: Qualitative results of our method and other baselines on Mayo Clinic Low Dose CT dataset. (a) Real low-dose. (b)-(f) Results of each methods. (g) Real normal-dose. As shown in the highlighted red box, the reconstructed image by our method has few noise and artifacts. The display window is [160,240][160,240] HU.
Traditional Paired setting Unpaired setting
Methods BM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] RED-CNN [Chen et al.(2017)Chen, Zhang, Kalra, Lin, Chen, Liao, Zhou, and Wang] DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] LIR [Du et al.(2020)Du, Chen, and Yang] Ours
PSNR (dB) 29.16 29.39 26.97 27.26 30.11
SSIM 0.8514 0.9078 0.8267 0.8452 0.8728
Table 2: The average PSNR and SSIM results of different methods on Mayo Clinic Low Dose CT dataset. Our results are marked in bold.

4.3 Real-World Noise Removal

In this section, we evaluate the generalization ability of the proposed method on real-world noise, i.e. Low-Dose Computed Tomography (CT) and real photographs. For the comparison of the Low-Dose CT, we adopt BM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian], DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky], RED-CNN [Chen et al.(2017)Chen, Zhang, Kalra, Lin, Chen, Liao, Zhou, and Wang], and LIR [Du et al.(2020)Du, Chen, and Yang] as baselines. For the comparison of the real photographs, BM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian], DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky], RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang], and LIR [Du et al.(2020)Du, Chen, and Yang] are selected as baselines.

Denoising on Low-Dose CT

Since Computed Tomography (CT) helps to diagnose abnormalities of organs, CT is widely used in medical analysis. Reducing the radiation dose in order to decrease health risks causes noise and artifacts in the reconstructed images. Like the real-world noise, the noise distributions of the reconstructed image are difficult to model analytically. Therefore, we adopt a CT dataset authorized by Mayo Clinic [Moen et al.(2021)Moen, Chen, Holmes III, Duan, Yu, Yu, Leng, Fletcher, and McCollough] to evaluate the generalization ability of our method on real-world noise. Mayo Clinic dataset consists of paired normal-dose and lose-dose CT images for each patient. The Normal-Dose CT (NDCT) and the Low-Dose CT (LDCT) images correspond to clean and noisy images, respectively. For the training, we obtain 2,850 images in 512×512512\times 512 resolution from 20 different patients. We construct 1,422 LDCT images from randomly selected 10 patients as a noise set and 1,428 NDCT images from the remaining patients as a clean set for unpaired training. For the test, we obtain 865 images from 5 different patients. As shown in Table 2, our method achieves the best and the second-best performance in PSNR and SSIM, respectively. Note that our model trained on the unpaired dataset outperforms the RED-CNN trained on the paired dataset in PSNR. It indicates that our method can be more practical in medical analysis where obtaining paired datasets is challenging. We also compare the qualitative results with other baselines. As shown in Figure 4, other methods tend to generate artifacts or lose details. On the other hand, our method shows a reasonable balance between noise removal and image quality. More qualitative results are provided in the supplementary material.

Denoising on Real Photographs

To demonstrate the effectiveness of our method on real noisy photographs, we evaluate our method on SIDD [Abdelhamed et al.(2018)Abdelhamed, Lin, and Brown] which is obtained from smartphone cameras. Because the images of the SIDD comprise various noise levels and brightness, this dataset is the best appropriate to validate the generalization capacity of the denoisers. The SIDD includes 320 pairs of noisy images and corresponding clean images with 4K or 5K resolutions for the training. For the unpaired training, we divide the dataset into 160 clean and 160 noisy images without intersection. The other training settings are the same as implementation details. For evaluation, we use 1280 cropped patches of size 256×256256\times 256 in the SIDD validation set. As show in Figure 5, other baselines tend to leave the noise or fail to preserve the color of images. In contrast, our method removes the intense noise while keeping the color compared to other baselines. We also report the quantitative results in Table 3. More qualitative results are provided in the supplementary material.

Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Figure 5: Qualitative results of our method and other baselines on SIDD.
Traditional Paired setting Unpaired setting
Methods CBM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang] DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] LIR [Du et al.(2020)Du, Chen, and Yang] Ours
PSNR (dB) 28.32 38.02 24.68 33.79 34.30
SSIM 0.6784 0.9619 0.5901 0.9466 0.9334
Table 3: The average PSNR and SSIM results of different methods on SIDD. Our results are marked in bold.

4.4 Ablation Study

We conduct an ablation study to demonstrate the validity of our key components: the texture discriminator DTD_{T}, the spectral discriminator DSD_{S}, and the frequency reconstruction loss LFreqL_{Freq}. We employ an additional evaluation metric LFD [Jiang et al.(2020)Jiang, Dai, Wu, and Loy] to measure the difference between denoised images and reference images in the frequency domain. The small LFD value indicates that the denoised images are close to the reference images. First, to verify the effectiveness of the DSD_{S}, we only add the DSD_{S} to the base structure. As shown in Table 4, when the DSD_{S} is integrated, both PSNR and SSIM increase by 0.2 dB and 0.0014, respectively. It demonstrates that the spectral discriminator leads the generator to remove high-frequency related noise effectively by transferring the difference between noisy and clean images on the high-frequency bands. Also, we see that the spectral discriminator makes the denoised images close to clean domain images in the frequency domain, resulting in the decrease of LFD. Next, to verify the effectiveness of the DTD_{T}, we integrate it with the DSD_{S}. Distinguishing the texture representations helps restore clean contours and fine details related to image quality, which improves the SSIM metric. A curious phenomenon is that the texture discriminator increases the LFD. We conjecture that the introduction of DTD_{T} causes a bias to the spatial domain in maintaining the balance between the spatial and frequency domains, thus increasing the distance in the frequency domain. Adding the LFreqL_{Freq} shows results validating our hypothesis that narrowing the gap in the frequency domain is crucial to generate the high-quality denoised image. In addition, through the decrease of LFD, the frequency reconstruction loss may help to maintain the balance between the spatial and frequency domain.

DSD_{S} DTD_{T} LFreqL_{Freq} PSNR (dB) SSIM LFD
\xmark \xmark \xmark 25.59 0.8290 6.5955
\cmark \xmark \xmark 25.79 0.8304 6.5649
\cmark \cmark \xmark 25.82 0.8334 6.5874
\cmark \cmark \cmark 26.03 0.8375 6.5795
Table 4: Ablation study. Quantitative results of our method with and without the texture discriminator DTD_{T}, spectral discriminator DSD_{S}, and frequency reconstruction loss LFreqL_{Freq} on CBSD68 corrupted by AWGN with a noise level σ=50\sigma=50. we report the PSNR, SSIM (higher is better) and LFD (lower is better). The best results are marked in bold.

5 Conclusion

In this paper, we propose an unsupervised learning-based image denoiser that enables the image denoising without clean and noisy image pairs. To the best of our knowledge, it is the first approach that aims to recover a noise-free image from a corrupted image using frequency domain information. To this end, we introduce the spectral discriminator and frequency reconstruction loss that can propagate the frequency knowledge to the generator. By reflecting the information from the frequency domain, our method successfully focuses on high-frequency components to remove noise. Experiments on synthetic and real noise removal show that our method outperforms other unsupervised learning-based denoisers and generates more visually pleasing images with fewer artifacts. We believe that considering the frequency domain can be advantageous in other low-level vision tasks as well.

6 Acknowledgements

This work was supported by the Engineering Research Center of Excellence (ERC) Program supported by National Research Foundation (NRF), Korean Ministry of Science & ICT (MSIT) (Grant No. NRF-2017R1A5A1014708).

References

  • [Abdelhamed et al.(2018)Abdelhamed, Lin, and Brown] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1692–1700, 2018.
  • [Aharon et al.(2006)Aharon, Elad, and Bruckstein] Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.
  • [Ahn et al.(2018)Ahn, Kang, and Sohn] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 252–268, 2018.
  • [Buades et al.(2005)Buades, Coll, and Morel] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. IEEE, 2005.
  • [Cai et al.(2020)Cai, Zhang, Huang, Geng, and Huang] Mu Cai, Hong Zhang, Huijuan Huang, Qichuan Geng, and Gao Huang. Frequency domain image translation: More photo-realistic, better identity-preserving. arXiv preprint arXiv:2011.13611, 2020.
  • [Chambolle(2004)] Antonin Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical imaging and vision, 20(1):89–97, 2004.
  • [Chen et al.(2017)Chen, Zhang, Kalra, Lin, Chen, Liao, Zhou, and Wang] Hu Chen, Yi Zhang, Mannudeep K Kalra, Feng Lin, Yang Chen, Peixi Liao, Jiliu Zhou, and Ge Wang. Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging, 36(12):2524–2535, 2017.
  • [Chen et al.(2018)Chen, Chen, Chao, and Yang] Jingwen Chen, Jiawei Chen, Hongyang Chao, and Ming Yang. Image blind denoising with generative adversarial network based noise modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3155–3164, 2018.
  • [Chen et al.(2020)Chen, Li, Jin, Liu, and Li] Yuanqi Chen, Ge Li, Cece Jin, Shan Liu, and Thomas Li. Ssd-gan: Measuring the realness in the spatial and spectral domains. arXiv preprint arXiv:2012.05535, 2020.
  • [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 16(8):2080–2095, 2007.
  • [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [Du et al.(2020)Du, Chen, and Yang] Wenchao Du, Hu Chen, and Hongyu Yang. Learning invariant representation for unsupervised image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14483–14492, 2020.
  • [Durall et al.(2020)Durall, Keuper, and Keuper] Ricard Durall, Margret Keuper, and Janis Keuper. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7890–7899, 2020.
  • [Dzanic et al.(2019)Dzanic, Shah, and Witherden] Tarik Dzanic, Karan Shah, and Freddie Witherden. Fourier spectrum discrepancies in deep network generated images. arXiv preprint arXiv:1911.06465, 2019.
  • [Frank et al.(2020)Frank, Eisenhofer, Schönherr, Fischer, Kolossa, and Holz] Joel Frank, Thorsten Eisenhofer, Lea Schönherr, Asja Fischer, Dorothea Kolossa, and Thorsten Holz. Leveraging frequency analysis for deep fake image recognition. In International Conference on Machine Learning, pages 3247–3258. PMLR, 2020.
  • [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint arXiv:1406.2661, 2014.
  • [Gu et al.(2014)Gu, Zhang, Zuo, and Feng] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862–2869, 2014.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [Isola et al.(2017)Isola, Zhu, Zhou, and Efros] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [Jiang et al.(2020)Jiang, Dai, Wu, and Loy] Liming Jiang, Bo Dai, Wayne Wu, and Chen Change Loy. Focal frequency loss for generative models. arXiv preprint arXiv:2012.12821, 2020.
  • [Johnson et al.(2016)Johnson, Alahi, and Fei-Fei] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pages 694–711. Springer, 2016.
  • [Kim et al.(2020)Kim, Soh, Park, and Cho] Yoonsik Kim, Jae Woong Soh, Gu Yong Park, and Nam Ik Cho. Transfer learning from synthetic to real-noise denoising with adaptive instance normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3482–3492, 2020.
  • [Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Krull et al.(2019)Krull, Buchholz, and Jug] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2129–2137, 2019.
  • [Lehtinen et al.(2018)Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala, and Aila] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.
  • [Mairal et al.(2009)Mairal, Bach, Ponce, Sapiro, and Zisserman] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Non-local sparse models for image restoration. In 2009 IEEE 12th international conference on computer vision, pages 2272–2279. IEEE, 2009.
  • [Mao et al.(2016)Mao, Shen, and Yang] Xiao-Jiao Mao, Chunhua Shen, and Yu-Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. arXiv preprint arXiv:1603.09056, 2016.
  • [Mao et al.(2017)Mao, Li, Xie, Lau, Wang, and Paul Smolley] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017.
  • [Martin et al.(2001)Martin, Fowlkes, Tal, and Malik] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
  • [Miyato et al.(2018)Miyato, Kataoka, Koyama, and Yoshida] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • [Moen et al.(2021)Moen, Chen, Holmes III, Duan, Yu, Yu, Leng, Fletcher, and McCollough] Taylor R Moen, Baiyu Chen, David R Holmes III, Xinhui Duan, Zhicong Yu, Lifeng Yu, Shuai Leng, Joel G Fletcher, and Cynthia H McCollough. Low-dose ct image and projection dataset. Medical physics, 48(2):902–911, 2021.
  • [Osher et al.(2005)Osher, Burger, Goldfarb, Xu, and Yin] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 4(2):460–489, 2005.
  • [Paszke et al.(2017)Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
  • [Van der Walt et al.(2014)Van der Walt, Schönberger, Nunez-Iglesias, Boulogne, Warner, Yager, Gouillart, and Yu] Stefan Van der Walt, Johannes L Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua D Warner, Neil Yager, Emmanuelle Gouillart, and Tony Yu. scikit-image: image processing in python. PeerJ, 2:e453, 2014.
  • [Wang and Yu(2020)] Xinrui Wang and Jinze Yu. Learning to cartoonize using white-box cartoon representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8090–8099, 2020.
  • [Wang et al.(2004)Wang, Bovik, Sheikh, and Simoncelli] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [Wu et al.(2018)Wu, Zheng, Zhang, and Huang] Huikai Wu, Shuai Zheng, Junge Zhang, and Kaiqi Huang. Fast end-to-end trainable guided filter. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1838–1847, 2018.
  • [Xu and Osher(2007)] Jinjun Xu and Stanley Osher. Iterative regularization and nonlinear inverse scale space applied to wavelet-based denoising. IEEE Transactions on Image Processing, 16(2):534–544, 2007.
  • [Xu et al.(2017)Xu, Zhang, Zhang, and Feng] Jun Xu, Lei Zhang, David Zhang, and Xiangchu Feng. Multi-channel weighted nuclear norm minimization for real color image denoising. In Proceedings of the IEEE international conference on computer vision, pages 1096–1104, 2017.
  • [Xu et al.(2018)Xu, Zhang, and Zhang] Jun Xu, Lei Zhang, and David Zhang. A trilateral weighted sparse coding scheme for real-world image denoising. In Proceedings of the European conference on computer vision (ECCV), pages 20–36, 2018.
  • [Xu et al.(2020)Xu, Qin, Sun, Wang, Chen, and Ren] Kai Xu, Minghai Qin, Fei Sun, Yuhao Wang, Yen-Kuang Chen, and Fengbo Ren. Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1740–1749, 2020.
  • [Yang and Soatto(2020)] Yanchao Yang and Stefano Soatto. Fda: Fourier domain adaptation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4085–4095, 2020.
  • [Yang et al.(2020)Yang, Lao, Sundaramoorthi, and Soatto] Yanchao Yang, Dong Lao, Ganesh Sundaramoorthi, and Stefano Soatto. Phase consistent ecological domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9011–9020, 2020.
  • [Zhang et al.(2017)Zhang, Zuo, Chen, Meng, and Zhang] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE transactions on image processing, 26(7):3142–3155, 2017.
  • [Zhang et al.(2018)Zhang, Zuo, and Zhang] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018.
  • [Zhang et al.(2019)Zhang, Karaman, and Chang] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detecting and simulating artifacts in gan fake images. In 2019 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–6. IEEE, 2019.
  • [Zhu et al.(2017)Zhu, Park, Isola, and Efros] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • [Zoran and Weiss(2011)] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pages 479–486. IEEE, 2011.

Appendix A Supplementary Material

In this supplementary material, we describe the architecture details and show the additional experiments as follows:

  • In Section B, we describe the architectures of two generators, i.e. Gn2cG_{n2c} and Gc2nG_{c2n}, and three discriminators, i.e. DCD_{C}, DTD_{T}, and DSD_{S}, in our framework.

  • In Section C, we show the additional results on CBSD68 [Martin et al.(2001)Martin, Fowlkes, Tal, and Malik] corrupted by AWGN with a noise level σ=25\sigma=25.

  • In Section D, we show the additional qualitative results on real-world noise, i.e. Low-Dose CT authorized by Mayo Clinic [Moen et al.(2021)Moen, Chen, Holmes III, Duan, Yu, Yu, Leng, Fletcher, and McCollough] and SIDD [Abdelhamed et al.(2018)Abdelhamed, Lin, and Brown].

  • In Section E, we show the results of an additional ablation study to demonstrate the validity of the perceptual loss LVGGL_{VGG}, the cycle consistency loss LCCL_{CC}, and the reconstruction loss LReconL_{Recon}.

  • In Section F, we show the results on several noise types, such as structured noise and Poisson noise, to evaluate the generalization ability of our method.

Appendix B The Details of Architectures

Generator Gn2cG_{n2c}

For the noise removal generator Gn2cG_{n2c}, we adopt the network introduced by [Ahn et al.(2018)Ahn, Kang, and Sohn]. The main idea of this architecture is multiple cascading connections at global and local levels which help to propagate low-level information to later layers and remove noise. The details of Gn2cG_{n2c} are illustrated in Figure 6 and 7.

Generator Gc2nG_{c2n}

For the generator Gc2nG_{c2n}, we adopt the U-Net based network that is similar to the architecture introduced by [Kim et al.(2020)Kim, Soh, Park, and Cho]. The role of this network is to translate images from the noise domain to the clean domain. The details of Gc2nG_{c2n} are illustrated in Figure 8 and 9.

Discriminators DCD_{C} and DTD_{T}

For the discriminators DCD_{C} and DTD_{T}, we employ the 70×7070\times 70 PatchGAN discriminator [Isola et al.(2017)Isola, Zhu, Zhou, and Efros] which classifies whether 70×7070\times 70 image patches are real or fake. The details of DCD_{C} and DTD_{T} are illustrated in Figure 10.

Discriminator DSD_{S}

For the spectral discriminator DSD_{S}, we employ the single linear unit as the spectral discriminator. The DSD_{S} takes a high-pass filtered 1D spectral vector and aims to classify whether the spectral vector is real or fake.

Refer to caption
Figure 6: The architecture of Cascading Block used as the basic component in the Gn2cG_{n2c}. We use the Residual Block proposed by [He et al.(2016)He, Zhang, Ren, and Sun] and the ReLU.
Refer to caption
Figure 7: The architecture of generator Gn2cG_{n2c} for noise removal. We use the convolution with kernel size=3, stride=1, and padding=1.
Refer to caption
Figure 8: The architecture of Instance Residual Block used as the basic component in the Gc2nG_{c2n}. We use the convolution with kernel size=3, stride=1, and padding=1 and the LeakyReLU with a slope of 0.20.2.
Refer to caption
Figure 9: The architecture of generator Gc2nG_{c2n}. We use the convolution with kernel size=3, stride=1, and padding=1 and transposed convolution with kernel size=3, stride=2, padding=1, and output padding=1.
Refer to caption
Figure 10: The architecture of discriminators DCD_{C} and DTD_{T}. We use the convolution with kernel size=4 and padding=1. Followed by the convolution, we use the spectral normalization [Miyato et al.(2018)Miyato, Kataoka, Koyama, and Yoshida] and the LeakyReLU with a slope of 0.20.2.

Appendix C Additional Results on AWGN

We additionally visualize the results for CBSD68 images corrupted by AWGN with a noise level σ=25\sigma=25 and show the PSNR and SSIM in Figure 11 and 12. In Figure 11, our method outperforms other methods trained with unpaired dataset by at least +3.44dB and +0.08 in terms of PSNR and SSIM, respectively. LIR and N2V spoil the color and lights, but our method preserves both the color and lights and successfully removes the noise. We also show the challenging example that has repetitive high-frequency patterns hard to distinguish with noise in Figure 12. Our approach removes noise without artifact and also preserves the patterns of the zebra. Although our method is trained under unpaired settings, it shows comparable performance in PSNR and SSIM with the supervised models in Figure 12. Furthermore, compared to methods trained with unpaired dataset, our approach achieves the best performance in both PSNR and SSIM.

Refer to caption
(a) Input (21.21/0.55)
Refer to caption
(b) LPF (21.43/0.60)
Refer to caption
(c) CBM3D (28.55/0.91)
Refer to caption
(d) DnCNN (28.64/0.92)
Refer to caption
(e) FFDNet (24.56/0.83)
Refer to caption
(f) RedNet-30 (27.46/0.91)
Refer to caption
(g) N2N (28.92/0.92)
Refer to caption
(h) DIP (22.90/0.74)
Refer to caption
(i) N2V (24.05/0.81)
Refer to caption
(j) LIR (19.48/0.82)
Refer to caption
(k) Ours (27.49/0.90)
Refer to caption
(l) GT (PSNR/SSIM)
Figure 11: Qualitative results of our method and other baselines on CBSD68 corrupted by AWGN with a noise level σ=25\sigma=25.
Refer to caption
(a) Input (20.27/0.44)
Refer to caption
(b) LPF (21.29/0.58)
Refer to caption
(c) CBM3D (29.99/0.85)
Refer to caption
(d) DnCNN (30.32/0.87)
Refer to caption
(e) FFDNet (26.29/0.81)
Refer to caption
(f) RedNet-30 (30.50/0.88)
Refer to caption
(g) N2N (30.71/0.88)
Refer to caption
(h) DIP (27.40/0.77)
Refer to caption
(i) N2V (26.73/0.75)
Refer to caption
(j) LIR (26.04/0.83)
Refer to caption
(k) Ours (28.35/0.83)
Refer to caption
(l) GT (PSNR/SSIM)
Figure 12: Qualitative results of our method and other baselines on CBSD68 corrupted by AWGN with a noise level σ=25\sigma=25.

Appendix D Additional Qualitative Results on Real-World Noise

D.1 Low-Dose CT

In this subsection, we show the additional qualitative results on Low-Dose CT dataset authorized by Mayo Clinic [Moen et al.(2021)Moen, Chen, Holmes III, Duan, Yu, Yu, Leng, Fletcher, and McCollough] in Figure 13. As shown in Figure 13, previous methods tend to lose details and generate blurred results. However, our method removes the noise, while preserving the details of organs. It shows that our method is also practical for medical image denoising.

D.2 Real Photographs

In this subsection, we visualize the additional qualitative results on SIDD [Abdelhamed et al.(2018)Abdelhamed, Lin, and Brown] in Figure 14 and 15. As shown in Figure 14, previous methods tend to lose the texture and leave the noise. In contrast, our method removes the noise while preserving the texture compared to other baselines. In Figure 15, we observe that our method removes the intense noise while preserving the color of images compared to other baselines.

Refer to caption
(a) LDCT
Refer to caption
(b) BM3D
Refer to caption
(c) RED-CNN
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) NDCT
Refer to caption
(h) LDCT
Refer to caption
(i) BM3D
Refer to caption
(j) RED-CNN
Refer to caption
(k) DIP
Refer to caption
(l) LIR
Refer to caption
(m) Ours
Refer to caption
(n) NDCT
Figure 13: Qualitative results of our method and other baselines on Mayo Clinic Low Dose CT dataset. As shown in the highlighted red box, the reconstructed images by our method have few noise and preserve the details of organs. The display window is [160,240][160,240] HU.
Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Refer to caption
(h) Input
Refer to caption
(i) CBM3D
Refer to caption
(j) RedNet-30
Refer to caption
(k) DIP
Refer to caption
(l) LIR
Refer to caption
(m) Ours
Refer to caption
(n) GT
Figure 14: Qualitative results of our method and other baselines on real noisy data, SIDD.
Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Refer to caption
(h) Input
Refer to caption
(i) CBM3D
Refer to caption
(j) RedNet-30
Refer to caption
(k) DIP
Refer to caption
(l) LIR
Refer to caption
(m) Ours
Refer to caption
(n) GT
Figure 15: Qualitative results of our method and other baselines on real noisy data, SIDD.

Appendix E Additional Ablation Study

We conduct an additional ablation study to demonstrate the validity of the perceptual loss LVGGL_{VGG}, the cycle consistency loss LCCL_{CC}, and the reconstruction loss LReconL_{Recon}. First, to verify the effectiveness of the LVGGL_{VGG}, we only add the LVGGL_{VGG}. As shown in Table 5, when the LVGGL_{VGG} is used, both PSNR and SSIM increase by 0.07dB and 0.0068. It demonstrates that the perceptual loss LVGGL_{VGG} helps to improve the performance, preserving the semantics even after the noise has been removed. Next, to verify the contribution of LCCL_{CC}, we integrate it with the LVGGL_{VGG}. We observe that the LCCL_{CC} which enables the one-to-one mapping between noisy and denoised images improves the PSNR and SSIM by 0.08dB and 0.004. Finally, when we integrate the LReconL_{Recon} with the LVGGL_{VGG} and the LCCL_{CC}, both PSNR and SSIM increase by 0.15dB and 0.0063, thus showing the best results in PSNR and SSIM. Through this experiment, we validate that each of the losses contributes to the performance improvement.

LVGGL_{VGG} LCCL_{CC} LReconL_{Recon} PSNR (dB) SSIM
\xmark \xmark \xmark 25.67 0.8204
\cmark \xmark \xmark 25.74 0.8272
\cmark \cmark \xmark 25.88 0.8312
\cmark \cmark \cmark 26.03 0.8375
Table 5: Ablation study. Quantitative results of our method with and without the perceptual loss LVGGL_{VGG}, the cycle consistency loss LCCL_{CC}, and the reconstruction loss LReconL_{Recon} on CBSD68 corrupted by AWGN with a noise level σ=50\sigma=50. We report the PSNR and SSIM (higher is better). The best results are marked in bold.

Appendix F Evaluation on Several Noise Types

F.1 Structured Noise

In this subsection, we show the results on structured noise. To generate the structured noise, we sample the pixel-wise i.i.d white noise, and convolve it with a 2D Gaussian filter whose a kernel size is 21×2121\times 21 and σ\sigma is 33 pixel. For the train and evaluation, we follow the same setting as the setting for synthetic noise removal in the main paper. As shown in Figure 16, our method is able to remove complex noise compared to BM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] and DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky]. Furthermore, while LIR [Du et al.(2020)Du, Chen, and Yang] spoil the lights, our method successfully preserves both the color and lights. The quantitative results are summarized in Table 6. Our method outperforms the traditional and unsupervised methods, achieving the second-best performance in terms of PSNR and SSIM.

Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Figure 16: Qualitative results of our method and other baselines on CBSD68 corrupted by structured noise.
Traditional Paired setting Unpaired setting
Methods CBM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang] DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] LIR [Du et al.(2020)Du, Chen, and Yang] Ours
PSNR (dB) 20.62 28.51 20.70 16.90 25.18
SSIM 0.5650 0.9588 0.7239 0.3738 0.9026
Table 6: The average PSNR and SSIM results of different methods on CBSD68 corrupted by structured noise. Our results are marked in bold.

F.2 Poisson Noise

In the comparisons of Poisson noisy images, we use Kodak24 as the test dataset. The images are corrupted by independent Poisson noise from Scikit-image library [Van der Walt et al.(2014)Van der Walt, Schönberger, Nunez-Iglesias, Boulogne, Warner, Yager, Gouillart, and Yu]. We train the models following the settings in the main paper. The visualized results of Poisson noise removal are given in Figure 17 and 18. Our approach shows impressive noise removal results. While LIR and DIP fail to remove the Poisson noise, our method successfully eliminates the noise and preserves the colors. In Table 7, our method achieves the best performance in PSNR and the second-best performance in terms of SSIM even when it is trained under the unpaired dataset. It demonstrates that our method has robustness and generalization against various noise types. Note that we do not change any hyper-parameters when trained under several types of noise.

Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Figure 17: Qualitative results of our method and other baselines on Kodak24 corrupted by Poisson noise.
Traditional Paired setting Unpaired setting
Methods CBM3D [Dabov et al.(2007)Dabov, Foi, Katkovnik, and Egiazarian] RedNet-30 [Mao et al.(2016)Mao, Shen, and Yang] DIP [Ulyanov et al.(2018)Ulyanov, Vedaldi, and Lempitsky] LIR [Du et al.(2020)Du, Chen, and Yang] Ours
PSNR (dB) 32.36 29.59 29.59 26.20 34.93
SSIM 0.8694 0.9778 0.8774 0.7741 0.9691
Table 7: The average PSNR and SSIM results of different methods on Kodak24 dataset corrupted by Poisson noise. Our results are marked in bold.
Refer to caption
(a) Input
Refer to caption
(b) CBM3D
Refer to caption
(c) RedNet-30
Refer to caption
(d) DIP
Refer to caption
(e) LIR
Refer to caption
(f) Ours
Refer to caption
(g) GT
Figure 18: Qualitative results of our method and other baselines on Kodak24 corrupted by Poisson noise.