Unleashing the Power of Self-Supervised Image Denoising: A Comprehensive Review

Dan Zhang*, , Fangfang Zhou*, Felix Albu ,
Yuanzhou Wei, Xiao Yang, Yuan Gu, Qiang Li *These authors contributed to the work equllly and should be regarded as co-first authors. Email: zhangdan [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected].

Abstract

The advent of deep learning has brought a revolutionary transformation to image denoising techniques. However, the persistent challenge of acquiring noisy-clean pairs for supervised methods in real-world scenarios remains formidable, necessitating the exploration of more practical self-supervised image denoising. This paper focuses on self-supervised image denoising methods that offer effective solutions to address this challenge. Our comprehensive review analyzes the latest advancements in self-supervised image denoising approaches, categorizing them into three distinct classes: General methods, Blind Spot Network (BSN)-based methods, and Transformer-based methods. For each class, we provide a concise theoretical analysis along with their practical applications. To assess the effectiveness of these methods, we present both quantitative and qualitative experimental results on various datasets. Additionally, we critically discuss the current limitations of these methods and propose promising directions for future research. By offering a detailed overview of recent developments in self-supervised image denoising, this review serves as an invaluable resource for researchers and practitioners in the field, facilitating a deeper understanding of this emerging domain and inspiring further advancements.

Index Terms:

Self-supervised image denoising, Survey.

I Introduction

Noise in images often manifests as isolated pixels or pixel blocks that have a significant visual impact, disrupting the actual information content of the image and making it unclear. Generally, noise signals are unrelated to the object being imaged and represent useless information. Noise arises from two main sources: during image acquisition and during image transmission and processing. Image sensors like Charge-coupled Device (CCD) and Complementary Metal Oxide Semiconductor (CMOS) arrays introduce noise due to sensor material properties, the environment, electronic components, and circuitry. This includes thermal noise from resistance, transistor channel noise, photon noise, dark current noise, and non-uniform light response noise [1, 2]. Digital images become contaminated with various types of noise [3, 4] during transmission and recording due to imperfect media and equipment. Even slight camera shake during shooting can generate noise [5, 6]. Noise can also be introduced at different image processing stages when inputs diverge from expectations. Image noise typically exhibits the following characteristics: It is irregular and random in distribution and magnitude, and generally correlated with the image [7, 8]. There are several types of noise based on its probability distributions that can affect images: Gaussian noise [9], Poisson noise [10], multiplicative noise [11], salt-and-pepper noise [12], Gamma noise [13], Rayleigh noise [14], Uniform noise [15] and Exponential noise [16].

Refer to caption — Figure 1: Outline of the survey.

Image noise is a common problem that can degrade the visual quality of images, making it difficult to analyze and negatively impacting subsequent tasks such as image detection [17, 18, 19], classification [20, 21, 22], segmentation [23, 24, 25, 26], tracking [27, 28], text generation [29, 30] and more. Therefore, image denoising methods for restoring a clean image $x$ from a noisy image $y$ = $x$ + $n$ is necessary for improving the visual quality of images and facilitating subsequent image analysis tasks.

Traditional image denoising methods [31], [32, 33, 34, 35, 36, 37, 38] work reasonably well for their designed/tuned scenarios but do not generalize or scale to achieve human-level denoising performance across the range of noise types, levels, complexities, data sizes, and quality metrics required for most real-world applications. Progress in denoising now depends on machine learning techniques that can overcome these challenges. Sparse-based methods is first applied to image denoising such as [31], non-locally centralized sparse representation (NCSR) method [39]. Then, in order to save computer costs and recover more image details, more efficient methods were proposed [40, 41, 42, 43, 44, 45]. Examples include Markov random field (MRF) [41], gradient models [44], and weighted nuclear norm minimization (WNNM) [45]. However, these models usually require manual hyperparameter setting, which is very unfriendly to different datasets that need to manually try to adjust parameters. Moreover, such traditional methods require a lot of computing resources in the denoising process.

As we enter the era of big data and experience significant improvements in computer computing power, various fields are leaning towards learning-based algorithms to meet future demands [46, 47, 23, 48, 49, 17, 50, 51, 52, 53, 54, 55, 20, 56, 57]. Deep learning, in particular, has demonstrated superior performance and computational cost compared to traditional algorithms in many fields [31, 58, 59, 31, 60, 61, 62, 45, 63, 64, 65, 66, 17, 67, 68, 69, 70, 71, 72, 49, 73]. In the area of image denoising, deep learning methods can be categorized as either supervised or self-supervised, depending on whether or not they require noisy-clean image pairs.

Supervised image denoising involves training a deep neural network (DNN) using pairs of noisy and corresponding clean images, and let the model learn the mapping of a noisy image to a clean image [74, 75, 76, 77, 78, 79, 80, 81] or let the residual of a clean image and a noisy image be used as a target, and learn to separate the noise from the noisy image [82, 83, 84]. Once trained, the model can take in a new noisy image and produce a denoised version of it. Jain et al. [85] was the first to use DNN for image denoising in 2008, and achieved denoising results comparable to the best traditional algorithms at the time, wavelet and MRF methods[86, 87], while having a smaller computational cost. Then more DNN methods were proposed [74, 88, 81, 89, 90]. Generally, the higher the depth and width of the network, the better the performance of the model, and correspondingly, the larger the number of parameters of the model. Due to the limitations of the computer memory at that time, these models require a trade-off between performance and the number of model parameters. Because actual noisy images can be readily acquired, whereas their corresponding clean images are less accessible, this leads to the challenge of obtaining perfect pairs of noisy-clean images. Therefore, in many studies, people use different noise adding strategies to obtain injected noise images based on limited clean datasets, and obtain artificially synthesized noisy-clean image pairs.

In addition to the commonly used additive white noisy images (AWNI), the noise with Poisson-Gaussian distribution can also be adopted as noise modelling. Zhang et al. [82] proposed the denoising convolutional neural network (DnCNN) in 2017 to train an image denoising model by manually adding Additive White Gaussian Noise (AWGN) on clean images to generate noisy-clean pairs. Through a residual learning, the model not only improves the model denoising performance, but also greatly reduces the amount of calculation. In [75], Guo et al. proposed a convolutional blind denoising network (CBDNet) designed specifically for real photographs. The CBDNet utilizes the Poisson-Gaussian noise distribution and in-camera processing. It features two subnetworks: the noise estimation subnetwork and the non-blind denoising subnetwork. To enhance the robustness of the denoiser, the noise estimation subnetwork applies an asymmetric loss, with greater penalties for underestimating the noise level. Following the CBDNet’s framework, a simpler yet highly effective blind denoising model called the SDNet was introduced in [91]. The SDNet employs the generalized signal-dependent noise model [92]. By incorporating a stage-wise procedure and lifted residual learning, the SDNet achieves competitive results on both synthesized and real noisy images.

Deep learning methods are data-driven and rely heavily on the quality and quantity of training data for achieving high-quality denoising results. Consequently, training datasets are critical for these models. However, acquiring absolutely clean images is often impractical in real-world applications, making methods that do not require clean images even more valuable for research.

Self-supervised image denoising is an image denoising algorithm that does not require clean images. It has attracted a large number of researchers to study this field [93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118]. Noise2Noise [107] was the first self-supervised algorithm that achieved performance comparable to supervised image denoising by training on only aligned noisy-noisy image pairs. However, obtaining a large number of two perfectly aligned noisy-noisy image pairs can be challenging. To address this issue, some researchers [111, 112, 113, 93, 97] use noisy images and noise models to artificially generate noisy-noisy pairs for training. However, in practical scenarios, noise level estimates and priors may not be readily available. Some researchers have developed denoising models using a single noisy image for training [108, 114, 96, 100]. One notable example is the Blind Spot Network (BSN) proposed in [108], and a large number of BSN-based methods [108, 109, 111, 113, 116, 119] have been proposed and have demonstrated superior denoising performance. In order to combine the advantages of convolutional neural network (CNN) in local feature extraction and Transformer in global feature extraction, some Transformer-based methods [101, 103, 120] have been proposed one after another, and have achieved excellent results in self supervised image denoising.

Although there have been several reviews on image denoising[121, 122, 123, 124], to the best of our knowledge, there is no dedicated survey of self-supervised image denoising algorithms, which makes it difficult for researchers to quickly understand the classic self-supervised image denoising algorithms and provide feasible directions for future research.This article aims to address this gap by investigating self-supervised image denoising algorithms in recent years. The main contributions of this paper are as following:

Firstly, we propose a survey focusing solely on self-supervised image denoising methods and divide the involved methods into three categories: General methods, BSN-based methods and Transformer-based methods as shown in Fig. 1.

Secondly, we briefly introduce the denoising principles of representative methods in each category.

Thirdly, through the statistical analysis of a large number of experimental results, we evaluate the performance of different methods on different denoising tasks, discuss the existing challenges in image denoising, and provide suggestions for the future development direction of this field.

The organizational framework of this review is as follows: Section II divides self-supervised image denoising algorithms into General method, BSN-based methods and Transformer-based methods, and provides a brief introduction to the principles of the classic algorithms in each category, as well as statistics on applicable scenarios, while Section III introduces common image denoising datasets and evaluation metrics. Section IV quantitatively and qualitatively analyzes the state-of-the-art self-supervised denoising methods in recent years. Section V discusses current challenges and future research directions for self-supervised denoising methods. Finally, Section VI concludes the survey.

II Self-supervised image denoising methods

Self-supervised image denoising algorithm is a deep learning image denoising algorithm that does not require paired noisy-clean images as training data.

BSN is a self-supervised image denoising algorithm based on the assumption that the noise is spatially independent and zero-mean [108], and uses the spatial correlation of the image signal to predict blind pixels through surrounding pixels.

In recent years, a large number of BSN-based image denoising algorithms have emerged [108, 109, 110, 111, 113, 116, 119], and Transformer has been proved to have irreplaceable advantages of CNN in the extraction of global information in image processing tasks [125, 126, 127, 128, 129, 130, 120, 131]. Therefore, this paper divides the self-supervised image denoising algorithm into: General methods, BSN-based methods, Transformer-based methods.

TABLE I: General self-supervised image denoising method.

Method	Other needs	Applications (denoising type)	Key words (remarks)
N2N[107]	Paired noise images	Gaussian, Poisson, Bernoulli noise denoising and random text overlays remove.	Real strictly aligned noiay-noisy image pairs.
GCBD [106]	Unpaired clean images	Real-world sRGB image noise, Gaussian and Mixture noise denoising.	GAN learned to generate noise distribution from noisy image and adding the noise to the clean image to get noisy-clean pairs.
SURE-based Method[132]	Noise Model	Gaussian noise denoising.	Stein’s unbiased risk estimator (SURE), a SURE-based refining method.
Noisier2noise [112]	Arbitrary noise model	Gaussian additive noise and multiplicative Bernoulli noise denoising.	Add synthetic noise signals to the given noise image as labels and continue add same type of noise to the labels as inputs.
NAC [93]	Noise model	AWGN and real-world sRGB image noise denoising.	Add synthetic noise signals to the given weak noise images as inputs, weak noise images as targets.
R2R [114]	Noisy level function(NLF) or ISP function	AWGN and real-world sRGB image noise denoising.	Use data augmentation technique to get noisy-noisy pairs.
NBR2NBR [96]	$\backslash$	Gaussian, Poisson noise and real-world rawRGB image noise denoising.	Two sub noise images that get from one noise image can be regard as a noisy-noisy pair.
Noise2Score [115]	Arbitrary noise model	Gaussian, Poisson, and Gamma noise denoising.	The first step is to train a Neural Network (NN) to estimate the score function and then obtain the final denoising result by Tweedie’s formula. It can be used to deal with any exponential family noises.
Kim et al. [98]	Arbitrary noise model	Gaussian, Poisson, Gamma noise denoising.	First to learn the score matching by a NN; second to obtain the denoised result via distribution-independent Tweedie’s formula, a new noise model and noise parameters estimation algorithm.
CVF-SID [100]	$\backslash$	Real-world sRGB image noise denoising.	Cyclic multi-Variate Function (CVF), uses the designed CNN model decompose the sRGB noise image to clean image, signal-independent and signal-dependent noise images.
IDR [97]	Noise model	Gaussian, binomial and impulse noise, real-world raw noise denoising.	Improving model denoising performance with an iterative method.
Vaksman et al. [104]	$\backslash$	Gaussian noise, real-world sRGB image noise denoising.	Capture bursts of noisy shots, and one is the input, and the rest are conducted by patch matching and stitching as targets, forming one-multiple training pairs.

II-A General self-supervised image denoising methods

Lehtinen et al. [107] proposed a self-supervised image denoising method called Noise2Noise (N2N)[107], based on the UNet [133] structure. N2N requires perfectly aligned noisy-noisy image pairs with the same image signal $c$ but different noises $n_{1}$ and $n_{2}$ . Paired noise images $x_{1}$ = $c+n_{1}$ , $x_{2}=c+n_{2}$ are used as input and target for training respectively, and the loss function can be expressed as:

L(\theta)=\|F(c+n_{1};\theta)-(c+n_{2})\|^{2}

(1)

where $F(\cdot)$ denotes the UNet model and $\theta$ denotes the model parameters. Through such a strategy, N2N achieves a denoising effect comparable to supervised image denoising. However, N2N requires noisy-noisy image pairs that are strictly aligned which can be difficult to obtain in practice due to environmental factors such as wind, sunlight, etc. To overcome this, scholars use noise models [112, 113, 93, 115, 97, 134] or noise level estimation [106, 97, 98] to artificially generate noisy-noisy pairs. Some use unpaired noisy-clean images to obtain simulated noise [106, 111, 112] and add simulated noise to the clean images for denoising model training, while others only require single noisy images for denoising model training[108, 109, 110, 94, 95, 96, 99, 116, 100, 102, 104, 105].

II-A1 Unpaired noisy-clean images methods

GCBD [106] is an image denoising method that uses unpaired noisy-clean images, and is divided into two stages: noise level estimation and noise removal. First, noise blocks are obtained from noise images, and a generative adversarial network (GAN) is trained to estimate the noise distribution and generate noise samples. Second, the generated noise samples are added to the clean image to obtain a synthetic noisy image to construct a paired training dataset, which is used to train a CNN to denoise a given noisy image. To obtain a set of approximate noise blocks $V$ , GCBD subtracts the average of relatively smooth blocks in the noisy image under the assumption that the noise distribution has zero-mean. GCBD uses a WGAN-GP [135] loss for the GAN and a CNN with a structure similar to DnCNN [82]. The loss for the GAN and CNN are described as following Eq. (2, 3):

	$\displaystyle L_{GAN}$	$\displaystyle=E_{\tilde{x}\sim P_{g}}[D(\tilde{x})]-E_{x\sim P_{r}}[D(x)]$		(2)
		$\displaystyle+\gamma E_{\hat{x}\sim P_{\hat{x}}}[(\\|\nabla_{\hat{x}}D(\hat{x})\\|_{2}-1)^{2}]$		(2)

where $P_{r}$ is the distribution of $V$ , $P_{g}$ is the generator distribution, $P_{\hat{x}}$ is defined as the distribution uniformly sampled along the line between pairs of points sampled from $P_{r}$ and $P_{g}$ [136], and $D(\cdot)$ denotes the output of the discriminative network regarding the corresponding noise block as input.

L_{CNN}(\theta)=\frac{1}{2N}\sum\limits_{i=1}^{N}\|R(y_{i};\theta)-(y_{i}-x_{i})\|_{F}^{2}

(3)

where $R(\cdot)$ denotes the output of the CNN, $x_{i}$ is the clean image, $y_{i}$ is the synthetic noisy image, $N$ is the batch size of the training and $\theta$ denotes the network parameters. GCBD does not require paired noisy-clean or noisy-noisy images, but only unpaired noisy-clean images for training. The noise synthesized by the GAN is closer to real noise than previous supervised methods that use specific noise models. However, GCBD assumes additive noise with zero-mean, which may not cover all types of noise, such as multiplicative noise.

Soltanayev et. al [132] proposed a self-supervised image denoising method based on Therorem of Monte-Carlo Stein’s unbiased risk estimator (MC-SURE [134]) as Eq. (4, 5):

\sum\limits_{i=1}^{K}\frac{\partial h_{i}(y)}{\partial y_{i}}=\lim\limits_{\epsilon\rightarrow 0}{E_{\tilde{n}}\{\tilde{n}^{t}(\frac{h(y+\epsilon\tilde{n})-h(y)}{\epsilon})\}}

(4)

where $\tilde{n}\sim\aleph_{0,1}\in\mathbb{R}^{K}$ be independent of $n$ and $y$ , and provided that $h(y)$ admits a well-defined sencond-order Taylor expansion. If not, this is still valid in the weak sense provided that $h(y)$ is tempered. So,

\frac{1}{K}\sum\limits_{i=1}^{K}\frac{\partial h_{i}(y)}{\partial y_{i}}\approx\frac{1}{\epsilon K}\tilde{n}^{t}(h(y+\epsilon\tilde{n})-h(y))

(5)

where $\epsilon$ is a fixed small positive value, $t$ is the transpose operator. The SURE-based method proposes to incorporate MC-SURE [134] into a stochastic gradient-based optimization algorithm for image denoising. An unbiased estimator is proposed to replace the unknown Bayesian risk for training, and with a carefully defined cost function, a learning-based image denoiser can be trained using this SURE-based method without requiring a noiseless ground truth. In addition, other deep learning-based denoisers can also utilize the MC-SURE-based training by modifying their cost function to a stochastic gradient-based optimizing format, as long as it satisfies the MC-SURE. This method assumes Gaussian noise with known variance in all simulations, but it can incorporate a variety of noise distributions like Poisson distribution and exponential families, potentially making it applicable to different applications in the measurement domain. However, the method is sensitive to hyper-parameters, and the neural network needs to be retrained if the underlying noise model changes.

C2N [137] framework trains a noise generator network $G$ to create a realistic noise map $\hat{n}$ for a given clean image $x$ , which produces pseudo-noise images $\hat{y}$ as follows:

\hat{y}=x+\hat{n}=x+G(x,r)

(6)

The 32-dim random vector $r$ is sampled from $\mathcal{N}(\bf{0},\bf{I^{2}})$ to reflect the stochastic behavior of noise according to the conditions of each scene, and authors use a new generator architecture to extract the diverse and complex noise distributions combining the Signal-Independent Pixel-Wise Transform, Signal-Dependent Pixel-Wise Transform and Spatially Correlated Transforms. A discriminator network $D$ distinguishes whether a given noisy image is synthesized from the generator $G$ or sampled from the real-world dataset. The two networks $G$ and $D$ are optimized adversarially with the Wasserstein distance [135] to learn the noise distribution as follows:

$\displaystyle L_{adv}(D,G)$	$\displaystyle=E_{y^{\prime}\sim P_{N}}[D(y^{\prime})]$	(7)
	$\displaystyle-E_{x\sim P_{C},r\sim P_{r}}[1-D(x+G(x,r))]$
	$\displaystyle+\lambda E_{x_{\delta}\sim P_{\delta}}[(\\|\nabla_{x_{\delta}}D(x_{\delta})\\|_{2}-1)^{2}]$

$P_{N}$ and $P_{C}$ denote the distribution of the real-world noisy and clean images, respectively. The real noisy image $y^{\prime}$ is sampled from $P_{N}$ . Prime notation on $y$ denotes that we use noisy image unpaired to clean image $x$ . $P_{r}$ is the distribution of random vector $r$ . The authors use a stabilizing loss term $L_{stb}$ to prevent the color-shifting problem, which is defined as follows:

L_{std}=\frac{1}{N}\sum\limits_{c}\|\sum\limits_{i\in B}{\hat{n}_{i,c}\|_{1}}

(8)

where $N$ denotes the number of pixel $i$ in mini-batch $B$ and $c$ is index of each color channels. They minimize this loss to make the channel-wise average of the generated noise approach zero. For denoising, the C2N framework uses pseudo-noise images $\hat{y}$ to train a denoising model in a supervised manner with $L_{1}$ loss. However, this denoising method may not perfectly reflect real-world noise, leading to potential deviations in practical applications.

II-A2 Methods relying on artificially generated noisy-noisy pairs

Noisier2Noise [112] was introduced in 2020 that was a variant of the N2N [107] method. It uses two fully aligned noise images for denoising model training, but these images are not collected naturally. Instead, the method adds the same noise type but different noises successively to a noise image. This process generates an image pair $(Z,Y)$ , with $Z$ as the input for the network $f(\cdot)$ and $Y$ as the training target. Then:

Y\triangleq X+N

(9)

Z\triangleq Y+M=X+N+M

(10)

The loss function used for training is $L_{2}$ loss:

L(\theta)=\min\limits_{\theta}{E_{Z}\|f(Z;\theta)-Y\|_{2}}

(11)

Since $M$ and $N$ belong to the same known noise model, they are approximately equal, it can be deduced that:

E[X|Z]=2E[Y|Z]-Z

(12)

After training, the original unknown noise $n$ can be obtained by subtracting $Z$ from twice the model output. The denoised image can be obtained by subtracting $n$ from the original noise image. Noisier2Noise only needs single noise images and a noise distribution statistical model to train the denoising model. It can also denoise spatially correlated noise. However, it requires a known noise model, which may not always be available or accurate in practice. Additionally, the noise model used for training may not be representative of the noise in the actual images, leading to suboptimal denoising results. Finally, Noisier2Noise may not perform well on images with complex noise patterns.

Noisy-As-Clean (NAC) [93] assumes the noise in the image is weak enough, so the noise image can be treated as a clean image for denoising model training. NAC uses paired noise images $(y,z)$ generated by the clean image $x$ and synthetic noise $n_{s}$ that has the same type of noise as the natural weak noise $n_{0}$ . The input $y$ and the target $z$ can be described as:

y=x+n_{0}

(13)

z=y+n_{s}

(14)

E[x]>>E[n_{0}],Var[x]>>Var[n_{0}]

(15)

$E(\cdot)$ denotes the expectation and $Var(\cdot)$ denotes the variance. Hence, the noisy image $y$ should have similar expectation with the clean image $x$ ,

E[x]>>E[n_{s}],Var[x]>>Var[n_{s}]

(16)

then,

E[y]=E[x+n_{0}]=E[x]+E[n_{0}]\approx E[x]

(17)

E[z]=E[y+n_{s}]\approx E[y]

(18)

so,

E_{y}[E_{z}[z|y]]=E[z]\approx E[y]=E_{x}[E_{y}[y|x]]

(19)

As can be seen from the above, in theory, when the noise in the image is weak enough, it can be treated as a clean image for denoising model training as the target. NAC can be combined with any CNN model, and it can achieve better performance than methods that use noisy-clean image pairs, especially when the noise is weak. However, NAC requires prior knowledge of the noise type and noise level estimation, which can be difficult to obtain in the real world. Although the article mentions using the Poisson-Gaussian mixture distribution when the noise type is unknown, many noises do not follow this distribution in practice.

Recorrupted-to-Recorrupted (R2R) [114] was introduced in 2021 that used a data augmentation method to train a denoising model with only single noisy images. Like N2N [107] that uses noisy-noisy image pairs to train the model, R2R generates these images from a single noisy image using different data augmentation methods for independent noise and noise correlated with the image signal, respectively. For independent noise, paired noise images $(\hat{y},\tilde{y})$ can be generated by the following:

\hat{y}=y+D^{T}z,\tilde{y}=y-D^{-1}z,z\sim\mathcal{N}(\bf{0},\bf{\sigma^{2}I})

(20)

where $y$ is the noisy image, $D$ can be any invertible matrix and $z$ is sampled from standard normal distribution $\mathcal{N}$ . For noise correlated with the image signal $x$ , paired noise images can be generated using a similar formula with the $x$ -dependent covariance matrix $\sqrt{\sum_{x}}$ by the Eq. (21):

\hat{y}=y+\sqrt{\sum_{x}}D^{T}z,\tilde{y}=y-\sqrt{\sum_{x}}D^{-1}z,z\sim\mathcal{N}(\bf{0,I})

(21)

R2R achieved competitive results for both AWGN removal and real-world image denoising compared to the state-of-the-art self-supervised learning methods. However, it requires a significant amount of computational resources due to the data augmentation method used for training, and R2R requires prior knowledge of the noise level, which can be difficult to obtain in the real world.

Zhang et al. [97] proposed the Iterative Data Refinement (IDR) method trained with noiser-noise image pairs. First, IDR proves that the lower the noise level on the target noisy image the better the denoising effect. Therefore, it is feasible to reduce the noise level on the original noisy image to improve the denoising performance of the model. Then, IDR is proposed to be divided into two steps: (1) train a model $M_{0}$ with noise reduction ability through the noisier-noise pairs $\{y_{0}+n$ , $y_{0}\}$ . (2) iteratively replace the noisy image $y_{0}$ with the preliminary denoising image from the previous round $M_{0}$ until the loss function converges. Through iteration in this way, the denoising performance of the model is continuously improved. However, this iterative method is easy to lose the detail information in the image signal, resulting in the final denoised image being too smooth.

Noise2Score [115] is a denoising method that estimates the score function of an image using Bayesian statistics, specifically Tweedie’s formula [138], which can be applied to any exponential family of noise. The method involves adding certain noise at different levels to noisy images to estimate the score function, rather than the noise itself. A post-processing step with Tweedie’s formula is then used, which is determined by the specific noise model, such as Poisson or Gamma, etc. Noise2Score is more flexible than some other denoising methods that assume a specific type of noise. Additionally, it estimates the score function directly rather than the noise itself, and it can achieve better denoising performance than methods that rely on estimating the noise level.

Kim et al. [98] developed a denoising method that uses a quadratic equation for noise model estimation, which can be applied to Gaussian, Poisson, and Gamma noise models. The method requires only one additional inference step for noise level estimation, which is more efficient than Noise2Score [115], which requires multiple inferences. The method involves adding unknown noise to noisy images and estimating the noise model using the quadratic equation. The estimated noise model is then used to estimate the noise level and denoise the image. However, it assumes that the unknown noise is limited to these three common noise models, so it may not perform well with other types of noise.

II-A3 Methods relying on only single noisy images

Neighbor2Neighbor (NBR2NBR) [96] is introduced in 2021 that uses a specific down-sampling strategy to divide the entire noise image into multiple $2\times 2$ cells, achieving noisy-noisy image pairs. The method randomly selects two pixels in each cell to get two sub-images and uses them as the input and target respectively for denoising model training. To prevent overfitting, a regularization term is added to supplement lost details. The loss function for NBR2NBR includes a reconstruction term and a regularization term as formulas in Eq. (22-24), which help to preserve image details:

L_{recon}=\|F(g_{1}(y))-g_{2}(y)\|_{2}^{2}

(22)

L_{regular}=\|F(g_{1}(y))-g_{2}(y)-F(g_{1}(F(y)))-g_{2}(F(y))\|_{2}^{2}

(23)

L=L_{recon}+\gamma L_{regular}

(24)

where $F(\cdot)$ represents the denoising model, $g_{1}(y)$ and $g_{2}(y)$ respectively represent the two sub-images obtained by applying the mask strategy on the noise image $y$ , $g_{1}(F(y))$ and $g_{2}(F(y))$ respectively represent two sub-images generated by using the down-sampling strategy on the output of the denoising model when the noise image $y$ is used as input with none gradient. Synthetic experiments with different noise distributions in sRGB space and real-world experiments on a denoising benchmark dataset in raw-RGB space have shown that NBR2NBR can achieve state-of-the-art denoising performance. However, approximating neighbor pixels can lead to over-smoothing, and down-sampling can destroy structural continuity.

CVF-SID[100] was introduced in 2022 by Reyhaneh Neshatavar et al. that proposes a Cyclic multi-Variate Function (CVF) module. The method only requires single noisy images. CVF-SID first decomposes the noisy imag $I_{n}$ into a clean image $I_{c}$ , signal-dependent noise $N_{d}$ and signal-independent noise $N_{i}$ :

I_{c},I_{d},N_{i}=F(I_{n})

(25)

where $F(\cdot)$ denotes a CNN model. And then it combines the three output parts arbitrarily as a new input to the previous model for decomposition training,

I_{n_{2}}=g(s_{1}I_{c},s_{2}N_{d},s_{3}N_{i})

(26)

I_{c_{2}},N_{d_{2}},N_{i_{2}}=F(I_{n_{2}})

(27)

where $g$ is a combination function, $s_{1}$ , $s_{2}$ , $s_{3}$ are constants which can be one value from [-1, 0, 1]. Finally, repeat this cycle until the loss converges. To allow the CNN model to decompose noisy images, the method uses various self-supervised loss terms based on statistical behaviors of general noise. CVF-SID directly decomposes the sRGB noise image, maintaining the texture structure of the image better than BSN-based methods. However, the method relies on the assumption that the noise in the image can be decomposed into signal-dependent and signal-independent components, which may not always be true in practice. Therefore, the method may not perform well on images with noise that is not well-modeled by this assumption. Additionally, the cyclic training process used by CVF-SID is computationally expensive and may take longer to converge than other denoising methods. This may make it difficult to use in real-time applications or on large datasets. Finally, while CVF-SID is able to maintain the texture structure of the image better than some other methods [108, 109, 114], it may still introduce some degree of smoothing or loss of detail due to the decomposition and recombination process.

Vaksman et al. [104] presented the Patch-Craft Self-Supervised Training for Correlated Image Denoising paper at CVPR 2023, to address the challenge of denoising images with an unknown noise model. The method uses bursts of noisy images for self-supervised denoiser training, where one image from the burst is used as input and the rest are used for constructing targets. The targets are constructed using the patch-craft frames concept introduced in PaCNet [138], where the input shot is split into fully overlapping patches and the nearest neighbor within the rest of the burst images is found for each patch. It is important to note that the input shot is strictly excluded from the neighbor search. The paper also proposes a novel method for performing the statistical analysis of the target noise, which involves cutting the proper left tail shown on the histogram of an empirical covariance for the input and the difference between target and input. This method allows trainers to exclude faulty image pairs from the training set, which can boost the performance of the denoising method. One potential limitation of the method is that it requires bursts of noisy images for training, which may not always be available in real-world scenarios. Additionally, the method may not perform well on images with highly complex noise patterns that cannot be effectively modelled by the patch-craft frames concept. Furthermore, the method may be computationally expensive due to the neighbor search process used for constructing targets.

II-B BSN-based self-supervised image denoising methods

BSN is a method of denoising images by predicting the noisy-free pixels of the masked pixels using the spatial continuity between the masked pixels and their surrounding pixels in the image signal. The effectiveness of BSN is based on the assumption that the noise in the image is spatially independent and zero-mean, while the image signal exhibits spatial correlation.

BSN-based methods can be divided into two categories based on the mask strategy: mask in input [108, 109, 113, 94, 95, 115, 98, 99, 103, 134] and mask in network [110, 111, 116, 101, 102, 105]. Mask in input uses a specific strategy to mask some pixels on the noisy image as input, and the complete noisy image as the target to perform supervised training on deep neural networks. In contrast, mask in network involves masking part of the receptive field, including the center pixel, during feature extraction in the network structure, so that the model uses some surrounding pixels of the feature maps to predict the target pixel. These two techniques are illustrated in the Fig. 2-3, respectively.

TABLE II: BSN-based self-supervised image denoising methods of mask in inputs.

Method	Other needs	Applications (denoising type)	Key words (remarks)
N2V[108]	$\backslash$	Gaussian noise and some biomedical image noise denoising.	Pixel-wise independent noise, randomly select several pixel to mask in the input images.
N2S[109]	$\backslash$	Blind Gaussian noise denoising.	$\mathcal{J}$ -invariant function determines the mask distribution, and replaces the pixels at $\mathcal{J}$ with random numbers.
PN2V [113]	An arbitrary noise model	Microscopy and low-light condition image noise denoising.	Mask input images, probabilistic model, predict per-pixel intensity distributions.
Noise2Same [94]	$\backslash$	Gaussian noise denoising.	$\mathcal{J}$ -invariant function determines the mask distribution, and replaces the pixels at $\mathcal{J}$ with local averages.
S2S [95]	$\backslash$	Blind Gaussian, salt-and-pepper and real-world sRGB image noise denoising.	Bernoulli-sampled instances of the input image results on noisy pairs.
B2UB[99]	$\backslash$	FMDD, Gaussian, Poisson and real-world rawRGB image noise denoising.	Gobal-aware mask mapper, re-visible loss.

II-B1 Mask in input methods

The first BSN-based method, Noise2Void (N2V), was proposed by Krull et al. [108] in 2018. N2V is a self-supervised image denoising algorithm that only requires a single noisy image to train the model. During training, N2V randomly selects multiple pixels on each $64\times 64$ patch in the noisy image to be replaced by any adjacent pixels, as shown in Fig. 2a. The resulting image with multiple pixels masked is regarded as input, and unmasked noisy image is regarded as the target for training. The corresponding loss function for N2V can be described as:

L(\theta)=\|F(y_{masked};\theta)-y_{original}\|^{2}

(28)

where $y_{original}$ represents the noisy image, $y_{masked}$ represents the masked version of $y_{original}$ , and $\theta$ represents the parameters of the denoising model $F(\cdot)$ . By randomly replacing some pixels in $y_{original}$ with adjacent pixels to create $y_{masked}$ , the model is trained to predict the missing pixels based on their spatial continuity with the surrounding pixels. This approach prevents the model from simply memorizing the noisy image and encourages it to learn the underlying structure of the image. While N2V can obtain a denoising model with only a single noisy image, it has some limitations. Only the masked pixels participate in model training each time, which wastes training resources and time. Additionally, the masked pixels no longer appear in the training process, resulting in the loss of information during model training.

Noise2Self (N2S) [109] is proposed based on the assumption of noise with statistical independence. The authors first demonstrate that the optimal denoising function for a given dataset can be found by minimizing the self-supervised loss using a class of $\mathcal{J}$ -invariant functions, which are functions that preserve the structure of the input image under arbitrary translations. They then show through examples that as the degree of correlation between features in the data increases, the optimal $\mathcal{J}$ -invariant denoising function approaches the optimal general denoising function. Based on this observation, the authors propose a deep neural network with millions of parameters that is modified to become $\mathcal{J}$ -invariant through the masking of pixels with random values. This involves adding a $\mathcal{J}$ -invariant mask $\mathcal{J}$ to the input, as shown in Fig. 2b, where the original values on $\mathcal{J}$ are masked and replaced by random values. The authors conduct experiments using two neural networks, UNet [133] and DnCNN [82] on three datasets including ImageNet [139] and CellNet [140] with different noise mixture strategies. The experiments show that N2S achieves competitive performance with the same neural network architectures trained with clean targets (called Noise2Truth) and independently noisy targets (N2N [107]). However, many open questions remain regarding the optimal choice of $\mathcal{J}$ in practice. The construction of $\mathcal{J}$ must reflect the signal-dependent patterns and noise independence relations, which involves making a bias-variance trade-off in the loss of the relative sizes of each subset $J$ and increasing information for prediction.

Several denoising methods [107, 108, 110], including N2V [108], assume that the noise present in an image is uniform, which is not always the case in reality. Probabilistic Noise2Void (PN2V) [113] extends the original N2V method and addresses this limitation by incorporating a probabilistic model of the noise. PN2V models the noise distribution using a probabilistic approach, which allows it to handle non-uniform noise and capture the variability in the noise distribution across different regions of the image. PN2V builds on N2V by introducing a probabilistic noise model that incorporates the noise distribution as a prior into the loss function. This allows the model to learn the noise distribution from the data and produce more accurate denoising results. In the PN2V model training followed by N2V, noisy images are first randomly masked as inputs and original noisy images as targets, shown in Fig. 2a. The network is then trained to predict probability distribution of the pixel’s signal in the masked regions, considering only its surroundings, while also learning the noise model parameters from the training data. The loss function of PN2V can be expressed as:

L=\mathop{\arg\min}\limits_{\theta}\sum\limits_{i=1}^{n}{-\ln(\frac{1}{K}\sum\limits_{k=1}^{K}){p(x_{i}|s_{i}^{k})}}

(29)

$p(\cdot)$ is described by an arbitrary noise model, which in the case of PN2V [113] is a histogram-based noise model. For each pixel $i$ , the network directly predicts $K$ output values ( $K=800$ , $s_{i}^{k}$ denotes each value), which are interpreted as independent samples, and $x_{i}$ denotes the observed noisy input value. Empirical results have shown that PN2V [113] outperforms N2V [108] in a variety of imaging applications, including fluorescence microscopy and electron microscopy. Furthermore, the probabilistic noise model allows PN2V to estimate the uncertainty of the denoised image, which is useful in downstream applications that require uncertainty quantification. However, in the real world, most noise sources do not strictly follow the Poisson-Gaussian distribution or histogram-based noise models used in PN2V [113]. Therefore, the PN2V model trained with a specific noise distribution will have limited performance when denoising real-world noise.

Noise2Same[94] proved through experiments that the denoising functions trained through mask-based blind-spot methods are not strictly $\mathcal{J}$ -invariant, meaning that they do not preserve the structure of the input image under arbitrary translations. This can lead to reduced denoising performance, particularly in cases where the noise is correlated or exhibits complex statistical properties. Furthermore, Noise2Same demonstrated that minimizing the Euclidean distance between the denoised image and the original image, expressed as $E_{x}\|f(x)-x\|^{2}$ , is not optimal for self-supervised image denoising. Noise2same masked the noise image same as N2S [109] to a $\mathcal{J}$ -invariant distribution, but replaced the masked pixel with local average values instead of random values. It proposed to take both the original noisy image and the $\mathcal{J}$ -invariant masked image as inputs and output two residual noise. The network would be trained from learning the invariance loss between the two outputs.

Self2Self With Dropout (S2S) [95] uses Bernoulli dropout to improve the performance of self-supervised denoising methods. S2S randomly masks instances of the noisy image using a Bernoulli process, which generates a set of image pairs for training the neural network. The Bernoulli dropout is applied using element-wise multiplication, and it is used in both the training and test phases to reduce prediction variance. To further improve performance, S2S uses partial convolution instead of standard convolution for re-normalization on sample pixels. Partial convolution is a convolutional operation that only convolves over the unmasked pixels, which helps to preserve the image structure and prevent artifacts. The self-prediction loss in S2S is defined on the pairs of Bernoulli sampled instances of the input image. The loss function encourages the network to predict the unmasked pixels of the noisy image from the masked pixels, while also taking into account the noise distribution and the image structure. The Bernoulli dropout and partial convolution help to regularize the denoising process and improve the performance of the network.

Blind2Unblind (B2UB) [99] is based on blind spots proposed by Wang et al. in 2022. B2UB uses a global-aware mask mapper to ensure that all the information in the image is fully utilized and uses re-visible loss to help restore more details in the image. B2UB’s mask strategy is called the global-aware mask mapper. Firstly, the noise image is divided into multiple $k\times k$ cells, and the pixels are marked with a number $i$ ( $i\in[1,k^{2}]$ , $k=4$ , in the experiments) from left to right and top to bottom in each cell. By masking the points with the same number in the original image, $k\times k$ masked images with the same size as the original image are obtained as inputs, as shown in Fig. 2c. Secondly, the denoising model produces $k\times k$ outputs. The pixels at all blind spots in the output images are collected and put back in the previously recorded positions, resulting in an image of the same size as the original noisy image with each pixel denoised. This image is combined with the original noisy image for supervised training. In addition to the supervised loss, B2UB uses a re-visible loss as a regularization loss item to make full use of the information in the noisy image for detail restoration. The whole loss can be described as:

L_{blind}=\|M^{-1}(F(M(y))-y\|_{2}^{2}

(30)

L_{re-visible}=\|M^{-1}(F(M(y))+\alpha\widetilde{F}(y)-(1+\alpha)y\|_{2}^{2}

(31)

L=L_{blind}+\beta L_{re-visible}

(32)

where $y$ is the original noise image, $F(\cdot)$ represents the denoising model, $\tilde{F}(\cdot)$ denotes the denoising model without gradient update during training, $M(\cdot)$ represents the first step of the global-aware mask mapper which creates blind spots, $M^{-1}(\cdot)$ represents the second step of the global-aware mask mapper which collects the denoised blind spots, $\beta$ is a fixed hyper-parameter (set as 1) that determines how much the blind term affects model denoising training, and $\alpha$ is a variable hyper-parameter (increase from 2 to 20 in training) that controls the strength of the visible part. B2UB uses the mask strategy of global-aware mask mapper to blind all points in the noise image once, so that all pixels participate in denoising training and make full use of the information of all pixels in the image. The re-visible loss is used to allow the model to see the complete noise image and repair the destroyed texture information in the process of creating blind spots. Extensive experiments on sRGB images with synthetic noise and rawRGB images with real noise demonstrate the superior performance of B2UB. However, B2UB cannot handle real-world sRGB images with spatially correlated noise that does not satisfy the prerequisites of BSN-based denoising methods.

TABLE III: BSN-based self-supervised image denoising methods of mask in networks.

Method	Other needs	Applications (denoising type)	Key words (remarks)
Laine et al.[110]	$\backslash$	Gaussian, Poisson and Impulse noise denoising.	Mask four direction half receptive field respectively to achieve blind spot network.
DBSN [111]	Unpaired clean images	AWGN, heteroscedastic Gaussian (HG) and multivariate Gaussian (MG) noise and real-world sRGB image noise denoising.	Dilated convolution, NLF, clean predict/noisy and clean/noisy synthetic pairs, knowledge distillation.
AP-BSN[116]	$\backslash$	Real-world sRGB image noise denoising.	Asymmetric $PD$ in the training and testing, random-replacing refinement post-processing.
MM-BSN[102]	$\backslash$	Real-world sRGB image noise denoising.	Multi-mask strategy for large area spatial correlated noise, multi-mask combined network.
Li et al.[105]	$\backslash$	Real-world sRGB image noise denoising.	Distinguish flat and textured regions in noisy images, and construct supervisions for them separately.

II-B2 Mask in network methods

Laine et al. [110] proposed a method for generating blind spots without using a mask on the input image for efficient training. The authors designed a novel architecture consisting of four parallel UNets[133] as the model framework. To achieve the effect of blind spots, four networks only look at the receptive fields of the upper, lower, left, and right half planes that do not include the central pixel, as shown in Fig. 3a. Only a series of $1\times 1$ convolutions are used to combine the output features of the four branches to ensure that the receptive field will not expand, thereby ensuring that the central pixel will never be seen in the entire network frame. The method of Laine et al. does not mask the input image, but the receptive field in the mask network framework, so that all pixels can participate in the training. Compared with N2V [108], this method greatly improves the training efficiency and fully utilizes the information of all pixels. However, the large number of model parameters may limit the scalability of the method to larger images or more complex noise patterns.

In 2020, Wu et al. [111] proposed a two-stage image denoising method using unpaired clean $x$ and noisy $y$ images. They trained a denoising model with Dilated Blind-Spot Network (DBSN), as shown in Fig. 3b and knowledge distillation. In the first stage, noisy images are input into the proposed DBSN and the noise estimation model $CNN_{est}$ respectively. Through maximizing the constrained log-likelihood as the loss function to conduct self-supervised image denoising training, the denoised image and the noise level estimation function can be obtained. The loss function for the first stage when learning DBSN and $CNN_{est}$ is the constrained negative log-likelihood is:

	$\displaystyle L_{self}$	$\displaystyle=\sum_{i}\{\frac{1}{2}(y_{i}-\hat{x_{i}})^{T}(\widehat{\sum}_{i}^{x}+\widehat{\sum}_{i}^{n})^{-1}(y_{i}-\hat{x_{i}})$		(33)
		$\displaystyle+\log\|\widehat{\sum}_{i}^{n}\|+tr((\widehat{\sum}_{i}^{n})^{-1}\widehat{\sum}_{i}^{x})\}$		(33)

where $tr(\cdot)$ denotes the trace of a matrix. $y_{i}$ is the real noisy image, $\hat{x}_{i}$ is the output of DBSN, $\widehat{\sum}_{i}^{n}$ and $\widehat{\sum}_{i}^{x}$ is a $C\times C$ covariance matrix output by $CNNest$ and D-BSN respectively for the position $i$ . In the second stage, the DBSN denoising model trained in the first stage is used to obtain the denoised image $\hat{x}_{i}$ of the original noise image, and the noise level function is applied to the original clean image to obtain a synthetic noise image $\hat{y}_{i}$ . Two pairs of image sets ( $y_{i}$ , $\hat{x}_{i}$ ) and ( $\hat{y}_{i}$ , $x_{i}$ ) are created, and they are input into any CNN model for joint training to obtain the final CNN denoiser. Its loss function is:

	$\displaystyle L_{distill}$	$\displaystyle=\sum_{i}\\|CDN(\hat{y_{i}})-x_{i}\\|^{2}$		(34)
		$\displaystyle+\beta\sum_{i}\\|CDN(y_{i})-\hat{x_{i}}\\|^{2}$		(34)

where $\beta=0.1$ is the trade-off parameter, convolutional denoising network ( $CDN$ ) can be any existing CNN denoisers. Wu et al. [111] utilizes unpaired noisy-clean image pairs for training the denoising model. The use of real clean images in the training allows the model to focus on more detail restoration, and the real noisy images make the model more adaptable to the removal of noise in the real world. The two branches are jointly trained, and the final denoising model has better denoising performance. Additionally, this model is based on the fact that the noise is spatially independent. Although, the pixel-shuffle downsampling (PD, stride=2 in DBSN) first introduced in Zhou et al. [141] are adopted to break the spatial correlation of the noise, the size of the spatial connection that PD (stride=2) can break is finite. For large-scale complex noise in real-world, PD as well as the noise level estimation may not perform well.

AP-BSN [116] is a denoising method for sRGB images with real-world noise, proposed by Wooseok Lee et al. in 2022. AP-BSN uses Asymmetric PD (AP) to break the spatial connection of noise, obtains blind spots by masking the central pixel of the convolution kernels, as shown in Fig. 3b, and adopts random-replacing refinement ( $R^{3}$ ) to restore image texture details and improve the denoising model’s performance. The stride of PD determines the down-sampling multiple, and different PD strides affect denoising performance during training and testing. The larger the stride, the larger the down-sampling multiple, the more thorough the spatial correlation of the noise is broken, but at the same time the texture information of the image signals is damaged more. AP means adopting PD with different strides during training and testing. AP-BSN uses PD with stride=5 during training and PD with stride=2 during testing for the best performance. The AP-BSN loss is the $L_{1}$ -norm distance between the output of the BSN model and the noisy image and can be described as:

I_{BSN}={\rm{PD^{-1}}}(F({\rm{PD}}(I_{N})))

(35)

L=\|I_{BSN}-I_{N}\|_{1}

(36)

where $\rm{PD^{-1}}$ means the inverse of PD, $F(\cdot)$ is the BSN model, and $I_{N}$ is the noise image. Taking the noisy image as both input and target. AP-BSN is the first denoising algorithm using self-supervised BSN on real-world sRGB images. However, the stride of PD synchronization affects the spatial correlation of noise and the degree of damage to the texture structure of the image signal. Therefore, the stride of PD cannot be too large. AP-BSN’s denoising performance may be greatly reduced when the noise in the image is spatially correlated in a large area.

MM-BSN [102] is designed for real-world sRGB images with large areas of spatially correlated noise. MM-BSN is also masked in the network. Unlike AP-BSN [116], which only shields the central pixel of the convolution kernel, MM-BSN uses Multi-mask to shield pixels at multiple positions of the convolution kernel to break the large-area correlation of noise, as shown in Fig. 3c. When the spatial connection area of the noise is large enough, only the central pixel of the convolution kernel is masked, and the surrounding pixels are likely still to be noisy-related. Then, when the feature extraction of the target point is performed through the surrounding pixels, the extracted features will still contain noise. Noise in the real world has different shapes due to different reasons of formation. Therefore, in order to break the spatial correlation of noise, MM-BSN uses masks of various shapes (such as ’/’, ’\’, ’—’, ’-’, ’ $\square$ ’ etc.) to mask the convolution kernel, the ’\’ shape sample is shown in Fig. 3c. A noisy image is likely to contain noise of multiple shapes, so it is inappropriate to use only one shape of mask in a model. A novel network structure is proposed to combine the features extracted by Multi-mask. In order for a model to handle noise of various shapes, multiple masks must be used at the same time. MM-BSN uses the same loss as AP-BSN [116]. The noisy image is input to the BSN model in which the pixels at multiple positions of the receptive field are masked for model training. MM-BSN is the first to propose the feature extraction of images by combining masks of various shapes. The Multi-mask strategy can break the structure of large-area spatial correlation noise and improve the denoising performance of the model. However, the shape of the mask is fixed and cannot be self-adjusted during training according to the shape of the noise.

Li et al. [105] proposed the Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising in CVPR2023. It researches how to do the supervision for flat areas and texture areas respectively. For flat areas, it proposed the blind-neighborhood network (BNN), which is deformed from BSN but have different receptive field. At the end of each branch in BNN, the blind spot is created by applying a single pixel shift to the features and a $(2k-1)\times(2k-1)$ blind-neighborhood can be created by increasing the pixel shift size from 1 to $k$ . The loss function for flat areas is:

L_{BNN}=\|x_{1}-y\|_{1}

(37)

$x_{1}$ is the ouput of BNN, and $y$ is the noisy image. In addition, to distinguish the texture areas from the flat areas, the paper first calculated the statand deviation map $\sigma$ :

	$\displaystyle\sigma(i,j)$	$\displaystyle=std(\tilde{x}_{1}(i-\frac{n-1}{2}:i+\frac{n-1}{2},$		(38)
		$\displaystyle j-\frac{n-1}{2}:j+\frac{n-1}{2}))$		(38)

where $std(\cdot)$ represents the standard deviation function, which is calculated on 1-channel patches by averaging the values of RGB channels. $n$ is the local window size (set as 7 in this paper). Then, it proposed the following expression to convert the binarization $\sigma$ to the normalized $\alpha$ with scope of [0, 1].

\alpha(i,j)=\left\{\begin{array}[]{ll}S(\sigma(i,j)-1),&\sigma(i,j)\leq l\\ 0.5,&l<\sigma(i,j)\leq u\\ S(\sigma(i,j)-5),&\sigma(i,j)>u\end{array}\right.

(39)

where $S(\cdot)$ represents the Sigmoid function, $l=1$ , and $\mu=5$ . The higher $\alpha(i,j)$ the more local area is textured. When the texture areas are defined, then a locally aware network (LAN) is presented by stacking $k$ $3\times 3$ convolution layers and several $1\times 1$ convolution blocks with channel attention mechanism [58]. The loss function for LAN training phase is as follows:

L_{LAN}=(1-\alpha)\cdot\|sg(\tilde{x}_{1})-\tilde{x}_{2}\|_{1}

(40)

where $\tilde{x}_{1}$ and $\tilde{x}_{2}$ denotes the output of the BNN and LAN, respectively. $sg(\cdot)$ represents stop gradient operation [142]. BNN creates blind spots in the receptive field by shifting the features, which can help to remove noise from flat areas. Although, LAN is specifically designed for texture areas and takes into account the local texture information, this approach may not work well on texture areas where the features are more complex and shifting them may cause loss of important information. Another potential limitation of BNN is that it requires careful tuning of the pixel shift size to create the blind spots, which may be sensitive to the noise characteristics of different images.

TABLE IV: Transformer-based self-supervised image denoising methods.

Method	Mask way	Applications (denoising type)	Key words (remarks)
DT [103]	Mask in inputs	FM dataset, Gaussian, Poisson and real-world rawRGB images noise denoising.	CNN and Transformer combined to denoising.
LG-BPN [101]	Mask in networks	Real-world sRGB image noise denoising.	Densely-Sampled Patch-Masked Convolution (DSPMC) for blind spot get and noise spatial correlated break, Dilated Transformer Block (DTB), combining local information with global information.
SwinIA [120]	$\backslash$	FM dataset, Gaussian, Poisson and real-world rawRGB images noise denoising.	Transformer.

II-C Transformer-based self-supervised image denoising methods

Zhang et al. [103] proposed a novel self-supervised image denoising method called Denoise Transformer (DT), which takes the masked noisy images as input (as shown in Fig. 2c) and is constructed using Context-aware Denoise Transformer (CADT) units and a Secondary Noise Extractor (SNE) block. CADT is designed as a dual branch, the local branch composed of several convolutional layers and deformable layers and the global branch composed by transformers. CADTs focus on the fusion and complementarity of local and global features, which boost the denoising performance by residual learning for the noise extraction. What’s more, SNE block is designed in low computational complexity for secondary global noise extraction, simple but effective. DT achieves a competitive performance comparing to the current state-of-the-art methods, especially on denoising images with blurred textures and dark areas. However, the computational complexity of the transformer components is high, which may limit its use in certain industrial applications. Further computational optimization may be needed to make the method more practical for real-world use.

LG-BPN [101] also aims to real-world sRGB images denoising based on the architecture combining BSN and Transformer. It fuses local and global information extracted by two branches: a local information extraction branch using multiple cascaded Dilated Convolution Blocks (DCB), and a global information extraction branch using Dilated Transformer Blocks (DTB). To break the noise spatial connection and generate blind spots, LG-BPN uses Densely-Sampled Patch-Masked Convolution (DSPMC) with a specific mask shape for convolution instead of the PD strategy used in AP-BSN [116]. Then DCB and DTB are used to extract local and global information. The mask shape in DSPMC is got from Algorithm 1, where $KH$ and $KW$ denotes the high and wide of convolution kernel, respectively, and mask has a same shape with the convolution kernel. The strategy of replacing the PD with a specific mask convolution kernel can prevent the texture structure of the original image signal from being irreversibly damaged, and allow the subsequent feature extraction to combine more surrounding pixel information, thereby retaining more high-frequency information.

Algorithm 1 Generate the mask

mask.fill(1)

dis=KH//2

for

i=0

KH

for

j=0

KW

abs(i-dis)+abs(j-dis)\leq dis

then

mask[:,:,i,j]=0

end if

end for

To maintain the effect of blind spots for DTB, LG-BPN modifies the self-attention calculation and feed-forward layer in the Transformer branch, replacing spatial attention with channel attention and using only $3\times 3$ dilated convolutions. This allows the Transformer branch to obtain global information while maintaining blind spots, improving denoising performance. Experimental results on real datasets demonstrate that LG-BPN achieves superior denoising performance, partially resolving the oppositional problem of preserving image texture information and destroying spatial phaseness of real noise through DSPMC and DTB. However, the specific mask shapes in DSPMC may not be optimal for all types of noise patterns, and may not work well on images with complex textures or patterns. Additionally, the modification of the Transformer branch in DTB may increase the computational complexity of the method, which may limit its practical use in certain applications.

Mikhail et al. [120] proposed Swin Transformer-based Image Autoencoder (SwinIA), which represents the first convolution-free transformer architecture for blind-spot self-supervised denoising. Unlike other models in its class, this autoencoder does not make any assumptions about the noise distribution and does not rely on masking during training. Additionally, the model can be trained in an autoencoder fashion with a single forward pass and an MSE loss. SwinIA achieves competitive results despite the inherent loss of information from the blind spot and the model’s ignorance of the noise nature. However, this method still has the problem of high computational complexity.

III Datasets and Evaluation Metrics

TABLE V: Datasets of image denoising.

Type	Dataset	Publish		Color	Greyscale	Set size	Resolution
Only clean images	BSD500[143]	2011	TPAMI	v	-	500	481 $\times$ 321
	DIV2K[144]	2017	CVPRW	v	-	1000	1972 $\times$ 1437
	BSD300[145]	2001	ICCV	v	-	500	481 $\times$ 321
	BSD68[146]	2009	IJCV	-	v	68	321 $\times$ 481 & 481 $\times$ 321
	CBSD68[146]	2005	-	v	-	68	321 $\times$ 481 & 481 $\times$ 321
	Urban100[147]	2015	CVPR	v	-	100	984 $\times$ 797
	Kodak24[148]	1999	-	v	-	24	768 $\times$ 512
	Set12[31]	2007	-	-	v	12	256 $\times$ 256
	Manga109[149]	2016	-	-	v	109	823 $\times$ 1169
	Set14[150]	2010	LNTCS	v	-	14	492 $\times$ 446
	McMaster[151]	2011	JEI	v	-	18	500 $\times$ 500
	Set5[152]	2012	BMVC	v	-	5	313 $\times$ 336
	Waterloo Exploration Database[153]	2017	TIP	v	-	4744	-
Only noisy images	NC12[154]	2015	-	v	-	-	-
	RNI15[154]	2015	IPOL	v	-	15	514 $\times$ 465
	RNI6[154]	2015	-	-	v	6	-
Clean with Real noisy images	RENOIR[155]	2018	VCIR	v	-	120	4364 $\times$ 3115
	NAM[156]	2016	CVPR	v	-	11	7360 $\times$ 4912
	CC[156]	2016	CVPR	v	-	15	-
	DND[157]	2017	CVPR	v	-	1000	512 $\times$ 512
	SIDD[158]	2018	CVPR	v	-	320	4586 $\times$ 3035
	PolyU[159]	2018	arxiv	v	-	100	2784 $\times$ 1856
	SID[160]	2018	CVPR	-	raw	5094	5078 $\times$ 3388
	NIND[161]	2019	CVPR	v	-	616	3083 $\times$ 3864
	FMDD[162]	2018	arxiv	v	v	12,000	256 $\times$ 256

III-A Datasets

In this section, we introduce the widely used public benchmarks for evaluating image denoising performance. These benchmarks include color maps, grayscale maps, and real noise maps, as shown in Table V, listed according to the published time and image types.

Kodak24[148]: This benchmark consists of 24 high-quality color images with a variety of content and is often used to compare the visual quality of different image processing methods.

Set5[152]: The Set5 dataset is a collection of 5 images (”baby”, ”bird”, ”butterfly”, ”head”, ”woman”) commonly used for testing the performance of image denoising and image super-resolution models.

Set12[31]: Set12 is a collection of 12 grayscale images of different scenes that are widely used for evaluating image denoising methods.

Set14[150]: The Set14 dataset is a collection of 14 images commonly used for testing visual performance.

Manga109[149]: This benchmark is composed of 109 manga volumes drawn by professional manga artists in Japan.

DIV2K[144]: This is a large dataset of RGB images with a diverse range of contents. The DIV2K dataset contains 800 high-resolution images in the train set, 100 high-resolution images in the validation set, and 100 diverse images in the test set.

BSD68[146]: This benchmark contains 68 grayscale images and is part of The Berkeley Segmentation Dataset and Benchmark. It is commonly used for measuring the performance of image denoising algorithms.

BSD300[145]: The public benchmark consists of all of the grayscale and color segmentations for 300 images. The images are divided into a training set of 200 images, and a test set of 100 images.

BSD500 [143]: The BSD500 is a dataset of 500 natural images that have been manually annotated with image segmentations. The dataset is commonly used by researchers in the field of computer vision and image processing for evaluating.

CBSD68[146]: This is the color version of BSD68 and is used for evaluating the performance of color image denoising methods.

Urban100[147]: This is a commonly used test set for evaluating the performance of super-resolution models.

McMaster [151]: This benchmark contains 18 clean images and is widely used for the test phase.

RNI6[154]: This benchmark contains 6 real noisy grayscale images without ground-truth and is usually used for visualization in the test phase.

RNI15[154]: It contains 15 real noisy color images without ground-truth, and is usually used for visualization in test phase.

RENOIR[155]: It is a dataset of color images corrupted by natural noise due to low-light conditions, together with spatially and intensity-aligned low noise images of the same scenes.

CC[156]: The cc contains 15 real noisy images with different ISOs, i.e., 1,600, 3,200 and 6,400.

DND[157]: The DND contained 50 real noisy images and the clean images were captured by low-ISO images.

Waterloo Exploration Database[153]: The Waterloo Exploration Database is a collection of images that have been intentionally degraded to simulate real-world conditions, such as noise, blur, and low resolution. It is used by researchers in the field of image processing to test and compare different algorithms for restoring degraded images to their original quality. The database is available for free and provides a standardized benchmark for evaluating the effectiveness of various image restoration techniques.

NC12[154]: The NC12 contained 12 noisy images and did not have ground-truth clean images.

SIDD[158]: The SIDD contained real noisy images from smart phones, and consisted of 320 image pairs of noisy and ground-truth images.

NAM[156]: The Nam includes 11 scenes, which are saved in JPGE format.

PolyU[159]: The PolyU consisted of 100 real noisy images with sizes of $2,784\times 1,856$ obtained by five cameras:a Nikon D800, Canon 5D Mark II, Sony A7 II, Canon 80D and Canon 600D.

SID[160]: The SID(see in the dark) dataset contains 5094 raw short-exposure images, each with a corresponding long-exposure reference image. Images were captured using two cameras: Sony $\alpha 7SII$ and Fujifilm X-T2.

NIND[161]: The NIND (Natural Image Noise Dataset) dataset is a set of real photographs with real noise, from identical scenes captured with varying ISO values. Most images are taken with a Fujifilm X-T1 and XF18-55mm, and other photographers are encouraged to contribute images for a more diverse crowdsourced effort.

FMDD[162]: The dataset consists of 12,000 real fluorescence microscopy images obtained with commercial confocal, two-photon, and wide-field microscopes and representative biological samples such as cells, zebrafish, and mouse brain tissues. We use image averaging to effectively obtain ground truth images and 60,000 noisy images with different noise levels.

III-B Evaluation Metrics

There are several metrics that can be used to evaluate the performance of different denoising algorithms. These include:

i. Peak signal-to-noise ratio (PSNR) [163]: PSNR is measured in decibels (dB) and is commonly used to measure the difference between two images, such as compressed images and original images. It can be used to evaluate the quality of compressed images, restoration of images and ground truth, and the performance of restoration algorithms. A higher PSNR value indicates better image quality.

ii. Mean Squared Error (MSE) [164]: MSE is a commonly used metric to evaluate the similarity between two images. It measures the average of the squared differences between the pixel values of the two images, with a lower MSE indicating a higher degree of similarity.

iii. Mean Absolute Error (MAE) [165]: MAE measures the average of the absolute differences between the pixel values of the two images, with a lower MAE indicating a higher degree of similarity. Unlike MSE, MAE gives equal weight to all pixel differences, regardless of their magnitude. The value ranges from 0 to infinity, with smaller values indicating better similarity.

iv. Root Mean Square Error (RMSE) [165]: RMSE measures the square root of the average of the squared differences between the pixel values of the two images, with a lower RMSE indicating a higher degree of similarity. RMSE is similar to MSE, but it is expressed in the same unit as the data being measured, making it easier to interpret. The value ranges from 0 to infinity, with smaller the values indicating better similarity.

v. Structural Similarity Index (SSIM) [163]: SSIM is a full-reference image quality assessment metric that compares the structural information, luminance, and contrast of distorted and reference images. It takes into account the human visual system’s sensitivity to these aspects of images and has been shown to provide better correlation with human perception of image quality than other metrics such as MSE. A higher SSIM value indicates better image quality.

vi. Figure of Merit (FOM) [166]: FOM is a metric used to evaluate the performance of image intensifier tubes used in night vision devices. It takes into account various factors such as resolution, signal-to-noise ratio, and distortion to provide an overall measure of the image quality produced by the image intensifier tube. A higher FOM value indicates better performance.

vii. Image Enhancement Factor (IEF) [167]: IEF is a measure used to evaluate the effectiveness of image enhancement methods. It is calculated as the ratio of the MSE between the enhanced image and the original image to the MSE between the input image and the original image. A higher IEF value indicates better effectiveness.

viii. Feature Similarity Index Mersure(FSIM) [168]: FSIM is a full-reference image quality assessment metric that compares the similarity of the structural information and luminance of distorted and reference images. It takes into account the human visual system’s sensitivity to contrast and spatial frequency and has been shown to correlate well with human perception of image quality.

In this paper, PSNR and SSIM are used for denoising performance comparison, as they are widely used in academic circles to evaluate denoising performance.

The expression of PSNR can be found in following:

MSE=\frac{1}{N}\|y-x\|^{2}

(41)

PSNR=10\times\lg(\frac{(2^{n}-1)^{2}}{MSE})

(42)

where $y$ is the denoised results, $x$ is the original image and $N$ is the total number of pixels in the image. $n$ is the number of bits per pixel.

For two images $x$ and $y$ , SSIM calculation can be performed by following:

SSIM(x,y)=[L(x,y)]^{\alpha}[C(x,y)]^{\beta}[S(x,y)]^{\gamma}

(43)

here,

L(x,y)=\frac{2\mu_{x}\mu_{y}+\epsilon_{1}}{\mu_{x}^{2}+\mu_{y}^{2}+\epsilon_{1}}

(44)

C(x,y)=\frac{2\sigma_{x}\sigma_{y}+\epsilon_{2}}{\sigma_{x}^{2}+\sigma_{y}^{2}+\epsilon_{2}}

(45)

S(x,y)=\frac{\sigma_{xy}+\epsilon_{3}}{\sigma_{x}\sigma_{y}+\epsilon_{3}}

(46)

where, $\mu_{x}$ and $\mu_{y}$ are the mean values, $\sigma_{x}$ and $\sigma_{y}$ are the standard deviation, $\sigma_{xy}$ is the covariance of $x$ and $y$ , $\epsilon_{1}$ , $\epsilon_{2}$ and $\epsilon_{3}$ are three constants used separately to avoid system errors caused by a denominator of 0.

SSIM models distortion as a combination of three factors: brightness, contrast, and structure. The metric uses the mean as the estimate of brightness, the standard deviation as the estimate of contrast, and the covariance as a measure of structural similarity, satisfying $\alpha>0$ , $\beta>0$ , $\gamma>0$ . The parameters $\alpha$ , $\beta$ , and $\gamma$ can be used to adjust the relative weight of luminance, contrast, and structural components that contribute to SSIM, respectively.

In general, PSNR and SSIM can provide a quantitative perspective on the denoising quality of images, but it is also important to consider the visual effect comparison, as the human eye perception is sometimes more important.

IV Experiments

In this section, we evaluate the performance of recent denoising algorithms on various datasets. We quantitatively and qualitatively compare the denoising performance of the models through the metrics PSNR [163] and SSIM [163] and the visual effect of the denoised images. The noise types involved are: Gaussian noise [9], Poisson noise [10], Impulse noise [169] and Gamma noise [13]. We also evaluate the denoising performance on different types of images, including synthetic noise denoising on grayscale and color images, as well as real-world noise denoising on rawRGB and sRGB images.

IV-A Synthetic noise denoising on grayscale images

Table I-IV show that many models are capable of denoising synthetic noise images. To evaluate the denoising performance of synthetic noise on grayscale images, we take BSD68 [146], and Set12 [31] as benchmarks.

TABLE VI: The PSNR/SSIM results of Gaussian (

\sigma

=25) denoising on BSD68 [146] and Set12 [31]. The highest PSNR value is highlighted in bold, the second is underlined.

Method	BSD68	Set12
N2N [107]	28.86/0.823	30.72/0.845
DIP [170]	27.96/0.774	25.82/0.772
N2V [108]	27.72/0.794	25.01/0.656
GCBD [106]	29.15/-	-
DBSN [111]	29.12/0.804	30.32/0.835
DnCNN+NAC [93]	26.33/0.651	29.21/0.738
ResNet+NAC [93]	29.33/0.833	31.78/0.880
S2S [95]	28.70/0.803	-
N2S [109]	28.12/0.792	29.16/-
Noise2Score [115]	29.12/-	30.13/-
MC-SURE[134]	28.94/0.818	29.13/-
Noisier2Noise [112]	28.55/0.808	-
Laine19-pme [110]	28.84/0.814
R2R[114]	29.14/0.822	30.06/0.851
B2UB[99]	28.99/0.820	30.09/0.854
IDR [97]	29.20/0.835	30.40/0.863

TABLE VII: The PSNR/SSIM results of Gaussian (

\sigma

=50) denoising on BSD68 [146] and Set12 [31]. The highest PSNR value is highlighted in bold, the second is underlined.

Method	BSD68	Set12
N2N [107]	25.77/0.700	27.20/-
DIP [170]	25.04/0.645	-
N2V [108]	25.12/0.684	24.68/-
DBSN [111]	26.20/0.699	27.19/0.758
S2S [95]	25.92/0.699	-
N2S [109]	25.62/0.678	26.19/-
Noise2Score [115]	26.21/-	27.16/-
MC-SURE [134]	25.93/0.678	26.23/-
Noisier2Noise [112]	25.61/0.681	-
Laine19-pme [110]	25.78/0.698	-
R2R[114]	26.13/0.709	26.86/0.771
B2UB[99]	26.09/0.715	26.91/0.776
IDR [97]	26.25/0.726	27.29/0.789

Table VI and Table VII show PSNR/SSIM results of different models on BSD68[146] and Set12[31] datasets for removing Gaussian noise with levels of $\sigma$ =25 and $\sigma$ =50. The tables reveal that the model with the best denoising effect and the worst model are not always the same on different datasets. When $\sigma$ = 25, N2N [107], ResNet+NAC [93], and IDR [97] are among the best performing models, while when $\sigma$ = 50, IDR performs the best because its Iterative Data Refinement strategy is effective in removing high-level noise. However, these models require real noisy-noisy image pairs or noise models in addition to single noise images.

Since most models are evaluated on BSD68[146], we draw curves in Fig. 4 to visually compare the denoising performance of different models at different noise levels. As shown in Fig. 4, all models have much higher denoising performance with a noise level of 25 compared to the performance with a noise level of 50. This indicates that the larger the noise level, the more severe the damage to the details of the image, and the more challenging it becomes to strike a balance between restoring details and removing noise.

Fig. 5 presents the comparison of visual performance by self-supervised Laine et al. [110], DBSN [111] and IDR [97] denoising on the grayscale images BSD68 [146] and Set12 [31] datasets. Laine et al. achieves the best denoising performance on grayscale images, preserving more details during denoising. However, the image denoised by IDR appears too smooth, which is an inevitable result of the iterative method used for model optimization.

IV-B Synthetic noise denoising on color images

To evaluate the denoising performance of synthetic noise on grayscale images, we show the PSNR/SSIM results of different methods on several datasets such as Kodak [148], CBSD68 [146], BSD300 [145], Set9 [170], and Set14 [150]. The noise types covered in this section include Gaussian, Poisson, Impulse, Bernoulli and Gamma noise. While a large number of models use Gaussian and Poisson noise for experiments, only a few studies evaluate the effectiveness of removing other types of noise.

Table VIII displays the Impulse noise removal results on the Kodak[148], BSD300[145], and Set14[150] datasets. The results show that Laine19-pme [110] achieves a comparable effect to N2N [107] in the removal of Impulse noise. Additionally, the noise removal ability of models for fixed noise levels is higher than that of noise levels that vary within a certain range.

TABLE VIII: The PSNR results of Impulse noise denoising on Kodak [148], BSD300 [145] and Set14 [150].

Noisy type	Method	Kodak	BSD300	Set14
Impulse ( $\alpha$ =0.5)	N2N [107]	32.88	30.85	30.94
	Laine19-mu [108]	30.82	28.52	29.05
	Laine19-pme [110]	32.93	30.71	31.09
Impulse ( $\alpha\in$ [0,1])	N2N [107]	31.53	30.11	29.51
	Laine19-mu [110]	27.16	25.55	25.56
	Laine19-pme [110]	31.40	29.98	29.51

Table IX displays the denoising performance of N2N [107], N2V [108], N2S [109], Noise2Score [115], NBR2NBR [96] and Kim et al. [98] on Gamma noise. The table shows that N2N performs best to remove Gamma noise on all datasets. Additionally, the denoising performance of different methods at different noise levels is consistent.

TABLE IX: The PSNR results of Gamma noise denoising on Kodak [148], CBSD68 [146].

Noisy type	Method	Kodak	CBSD68
Gamma (K=100)	N2N [107]	36.26	35.45
	N2V [108]	31.96	31.83
	N2S [109]	32.83	31.71
	Noise2Score [115]	34.23	33.82
	NBR2NBR [96]	35.10	34.21
	Kim et al. [98]	35.42	34.52
Gamma (K=50)	N2N [107]	34.49	33.52
	N2V [108]	31.38	30.51
	N2S [109]	31.71	30.63
	Noise2Score [115]	31.34	31.05
	NBR2NBR [96]	32.38	32.11
	Kim et al. [98]	32.81	32.43

Table X and Table XI evaluate the denoising ability of more than a dozen self-supervised denoising methods considering Gaussian noise with a noise level of 25 and 50 on Kodak[148], CBSD68[146], BSD300[146], Set9[170], and Set14[150] datasets using PSNR/SSIM. The tables show that on Kodak and Set14, N2N is the best among all models, and it ahieves the second-best denoising on CBSD68, BSD300, and Set9. Overall, when $\sigma$ = 25, N2N[107] achieves the best denoising effect, while there exist methods that surpass N2N[107] in some datasets. This finding suggests that the denoising method using single noise images can achieve denoising performance comparable to or even better than that of N2N, which requires real noisy-noisy image pairs. When $\sigma$ = 50, IDR [97] achieved the best performance for its effectiveness in removing high-level noise, which is consistent with the performance on grayscale images denoising.

TABLE X: The PSNR/SSIM results of Gaussian (

\sigma

= 25) noise denoising on Kodak [148], CBSD68 [146], BSD300 [145], Set9 [170], and Set14 [150]. The highest PSNR value is highlighted in bold, the second is underlined.

Method	Kodak	CBSD68	BSD300	Set9	Set14
N2N [107]	32.41/0.884	31.10/-	31.39/0.889	31.33/0.957	31.37/0.868
MC-SURE [134]	30.75/-	30.23/-	-	-	-
DIP [170]	27.20/0.720	-	26.38/0.708	30.77/0.942	27.16/0.758
N2V [108]	31.63/0.869	29.22/-	30.72/0.874	30.66/0.947	28.84/0.802
N2S [109]	30.81/-	30.05/-	-	30.05/0.944	-
Laine19-mu [110]	30.62/0.840	28.61/-	28.62/0.803	-	29.93/0.830
Laine19-pme [110]	32.40/0.883	30.88/-	30.99/0.877	-	31.36/0.866
DBSN [111]	32.07/0.875	-	31.12/0.881	-	30.63/0.846
Noisier2Noise [112]	31.96/0.869	-	29.57/0.815	-	29.64/0.832
S2S [95]	31.28/0.864	-	29.86/0.849	31.74/0.956	30.08/0.839
Noise2Score [115]	31.89/-	30.85/-	-	-	-
R2R [114]	32.25/0.880	-	30.91/0.872	-	31.32/0.865
NBR2NBR [96]	32.08/0.879	30.56/-	30.79/0.873	-	31.09/0.864
IDR [97]	32.36/0.884	31.29/0.889	31.48/0.890	-	30.85/0.866
Kim et al. [98]	31.78/-	30.89/-	-	-	-
B2UB[99]	32.27/0.880	-	30.87/0.872	-	31.27/0.864
DT [103]	31.96/0.879	-	30.80/0.872	-	31.15/0.864

TABLE XI: The PSNR/SSIM results of Gaussian (

\sigma

= 50) noise denoising on Kodak [148], CBSD68 [146], BSD300 [145] and Set9 [170]. The highest PSNR value is highlighted in bold, the second is underlined.

Method	Kodak	CBSD68	BSD300	Set9
N2N [107]	29.23/0.803	27.94/-	28.17/0.799	28.94/0.929
MC-SURE[134]	26.93/-	26.24/-	-	-
DIP [170]	-	-	-	28.23/0.910
N2V [108]	28.57/0.776	25.13/-	27.60/0.775	27.81/0.912
N2S [109]	28.21/-	27.14/-	-	27.51/0.905
Laine19-mu [110]	27.78/-	26.42/-	-	-
Laine19-pme [110]	28.63/-	27.65/-	-	-
DBSN [111]	28.81/0.783	-	27.87/0.782	-
Noisier2Noise [112]	28.73/0.770	-	26.18/0.684	-
S2S [95]	-	-	-	29.25/0.928
Noise2Score [115]	28.83/-	27.75/-	-	-
NBR2NBR [96]	28.28/-	27.32/-	-	-
IDR [97]	29.27/0.803	28.09/0.800	28.25/0.802	-
Kim et al. [98]	28.64/-	27.56/-	-	-

Fig. 6 shows the denoising performance of 11 self-supervised image denoising models on the Kodak[148], BSD300[146] and Set14[150] datasets for Gaussian noise and Poisson noise. When the Gaussian noise level fluctuates from 5 to 50, N2N [107] and DIP [170] are the best and worst models for denoising, respectively, and the denoising performance of Laine19-pme [110], B2UB [99], DT [103] and NBR2NBR [96] are close to that of N2N. When the Poisson noise level is fixed at 30, N2N and DIP are also the methods with the best and worst denoising effects, respectively, and the denoising performance of S2S [95], B2UB, DT, and R2R [114] approach that of N2N. When the Poisson noise level fluctuates from 5 to 50, B2UB achieves the best denoising performance on BSD300 and Set14, except that N2N remains the best denoising effect on the Kodak dataset. In this case, the denoising performance of DT and R2R are closely followed by N2N and B2UB. It is worth noting that most of the methods that are close to or even exceed the denoising performance of N2N are BSN-based models. This observation suggests that BSN-based models have great potential for development.

We also show the visual performance of self-supervised models N2N [107], Laine et al. [110], NBR2NBR [96], IDR [97], B2UB [99] and DT [103] on color images from the BSD300 [146] and Kodak [148] datasets. As can be seen in Fig. 7-8, N2N achieves the best denoising performance on both datasets. The image from the BSD300 dataset contains more detailed textures, resulting in lower denoising performance for all models compared to the Kodak dataset. However, even the iterative-based method IDR achieves denoising performance similar to N2N on the Kodak dataset.

IV-C Real-world image denoising in sRGB images

TABLE XII: Quantitative comparison of real-world sRGB image denoising on SIDD [158] and DND [157] benchmark datasets. We get the official evaluation results from SIDD and DND benchmark websites.

\diamond

indicates that we have retrained the model, uploaded the test results and received the results. R indicates that the result is reported by R2R [114].

\ast

denotes the method with self-ensemble strategy [171]. The highest value is highlighted in bold, the second is underlined

Training data	Method	SIDD benchmark		DND
Training data	Method	PSNR	SSIM	PSNR	SSIM
Need noise images and other priors	GCBD [106]	-	-	35.58	0.922
	UIDNet [55]	32.48	0.897	-	-
	D-BSN [111] + MWCNN [172]	-	-	37.93	bf 0.937
	NAC [93]	-	-	36.20	0.925
	R2R [114]	34.78	0.898	-	-
	C2N [137] + DIDN^$\ast$ [173]	35.35	0.937	37.28	0.924
Need single noisy	Noise2Void [108]	27.68^R	0.668^R	-	-
	Noise2Self [109]	29.56^R	0.808^R	-	-
	CVF-SID (S2) [100]	34.71	0.917	36.50	0.924
	AP-BSN [116]	35.97	0.925	38.09	0.937
	Li et al. [104]	37.41	0.934	38.18	0.938
	LG-BPN [101]	37.28	0.936	38.43	0.942
	MM-BSN[102]	37.37	0.936	38.74	0.943

To compare the effectiveness of self-supervised denoising models on real-world sRGB images, we evaluated them on sRGB of SIDD benchmark [158], DND benchmark [157], CC [156] and PolyU [159] datasets using PSNR and SSIM as the metrics.

Table XII compares the performance of several models on real-world sRGB images by SIDD [158] and DND [157] benchmark datasets, where C2N [137], Li et al. [105], LG-BPN [101] and MM-BSN [102] achieving better performance. Notably, the latter three models are BSN-based methods that only require single noise images.

Table XIII compares the denoising performance of N2N [107], N2V [108], DIP [170], N2S [109], S2S [95], DBSN [111], NAC [93] and R2R [114] on the CC [156] dataset. Among them, S2S, DBSN, NAC, and R2R outperform N2N, demonstrating that it is possible to achieve better denoising results using single images and noise prior or unpaired noisy-clean images.

TABLE XIII: Quantitative comparison of real-world sRGB image denoising on CC [156] dataset.

Metrics	N2N [107]	N2V [108]	DIP[170]	N2S [109]	S2S [95]	DBSN [111]	NAC [93]	R2R [114]
PSNR	35.32	32.27	35.69	33.38	37.52	35.90	36.59	37.78
SSIM	0.916	0.862	0.926	0.846	0.947	0.937	0.950	0.945

Table XIV compares the denoising performance of DIP [170], N2V [108], N2S [109], S2S [95] and R2R [114] on the PolyU [159] dataset, with R2R achieving the best denoising performance, followed by S2S.

TABLE XIV: Quantitative comparison of real-world sRGB image denoising on PolyU [159] dataset.

Metrics	DIP [170]	N2V [108]	N2S [109]	S2S [95]	R2R [114]
PSNR	36.95	34.08	35.46	37.52	38.47
SSIM	0.975	0.954	0.965	0.983	0.965

The visual denoising performance of several self-supervised models of C2N [137], CVF-SID [100], AP-BSN [116], LG-BPN [101], and MM-BSN [102] are compared on the SIDD validation and benchmark, as well as the DND dataset. C2N requires unpaired noisy-clean image pairs, while the remaining methods, including BSN-based methods AP-BSN, LG-BPN, and MM-BSN, and the general method CVF-SID, only need single noise images.

Fig. 9 shows the visual comparison of five denoising methods on the SIDD[158] validation and benchmark datasets. From the SIDD validation image, it can be observed that the BSN-based methods produce smoother denoised results with better noise removal, but suffer from significant loss of texture information. C2N [137] and CVF-SID [100] preserve the detailed textures better, but CVF-SID has a poorer denoising performance. Therefore, C2N is more suitable for denoising noisy images with more detailed textures. As shown in Fig. 9 and Figure 10 for the SIDD benchmark images and the DND benchmark image, the BSN-based methods achieve better denoising results. For images with higher noise level and less fine textures in the image signal, the BSN-based methods are a recommended choice.

IV-D Real-world image denoising in rawRGB images

Table XV presents a quantitative comparison of self-supervised denoising mehods on real-world rawRGB images from the SIDD[158] benchmark and validation datasets. The table shows that B2UB [99] achieves the highest PSNR and SSIM on both datasets, followed by DT [103]. Notably, both B2UB and DT are also BSN-based methods that require only single noisy images for training. DT, which combines Transformer to supplement global information, is also among the top denoisers.

TABLE XV: Quantitative comparison of real-world rawRGB image denoising on SIDD [158] benchmark and validation datasets. The highest value is highlighted in bold, the second is underlined

Method	SIDD benchmark		SIDD validation
Method	PSNR	SSIM	PSNR	SSIM
N2V [108]	48.01	0.983	48.55	0.984
Laine19-mu (G) [110]	49.82	0.989	50.44	0.990
Laine19-pme (G)[110]	42.17	0.935	42.87	0.939
Laine19-mu (P)[110]	50.28	0.989	50.89	0.990
Laine19-pme (P)[110]	48.46	0.984	48.98	0.985
DBSN[111]	49.56	0.987	50.13	0.988
R2R[114]	46.70	0.978	47.20	0.980
NBR2NBR[96]	50.47	0.990	51.06	0.991
B2UB[99]	50.79	0.991	51.36	0.992
DT[103]	50.62	0.990	51.16	0.991

Fig. 11 depicts denoising results for various methods on the SIDD[158] validation and benchmark datasets. The first row shows images from the SIDD validation, while the second row shows images from the SIDD benchmark. As shown in the figure, the denoising performance of NBR2NBR [96] is noticeably worse than that of the other methods, especially on the SIDD validation image. The denoised image contains artifacts, as indicated by the red arrow in the figure. The BSN-based methods B2UB [99] and DT [103] that combines BSN and Transformer, have similar noise removal capabilities. However, as shown by the green arrow in the figure, DT preserves the details of the image signal better.

V Future direction

In recent years, self-supervised denoising methods based on deep neural networks have gained increasing attention. These methods have demonstrated comparable denoising performance to supervised methods and show great potential for further improvement. In this section, we summarize some of the challenges and directions in the self-supervised image denoising field based on survey results:

1. Most existing self-supervised denoising methods that require only single noisy images for model training are based on BSN and achieves the state-of-the-art performance. While these methods achieve good denoising results under the assumption of independent noise and zero-mean, real-world sRGB images often have spatially correlated noise that violates this assumption. Although some methods [116, 102, 101] have been proposed to destroy the spatial correlation of noise, they also destroy the texture information of the image signal. Therefore, developing methods that can preserve the detailed texture of the image signal while destroying the spatial connection of the noise would greatly improve the performance of all BSN-based models for denoising real sRGB images.

2. Existing self-supervised models are established based on small noise structures such as Gaussian noise, Poisson noise, or real noise in SIDD images. They cannot handle special noises like stripe noise or speckle noise. Therefore, image denoising with large areas of spatially correlated noise is a challenging direction.

3. CNNs have limitations in controlling global information due to the restricted receptive field. Transformers can help address this limitation by providing a global view of the image. However, attempts to combine CNN and Transformer for image denoising, such as in DT [103] and LG-BPN [101], did not achieve the expected improvement of ”one plus one bigger than two”. Given that the combination of CNN and Transformer is still in the early stages of research, there is still room for improvement in this area, making it a promising direction for future research. Developing more effective ways to integrate CNN and Transformer architectures for image denoising could lead to significant improvements in denoising performance.

4. Artificial Intelligence Generated Content (AIGC) is gaining more attention with the application of diffusion denoising methods[174, 175, 124], which have achived outstanding performance. We believe that diffusion models have great potential in self-supervised denoising, and it is our next reseach direction.

VI Conclusion

This paper provides a comprehensive survey of recent self-supervised image denoising models, dividing them into three categories: General methods, BSN-based methods, and Transformer-based methods. To the best of our knowledge, this is the first survey to focus solely on self-supervised image denoising methods. The paper provides a brief introduction to classical methods in each of the three categories and evaluates them quantitatively and qualitatively on various datasets.

From a large number of experimental results, the following findings were observed: (1) For the remaining general datasets, BSN-based methods perform better than other methods on most datasets. (2) BSN-based methods are prone to destroy texture information in image signals when breaking the noise spatial connection, leading to smoother denoised images with fine textures lost. Thus, for noisy images with more fine textures, general methods are more recommended. (3) The self-supervised Transformer-based image denoising method achieves the comparable denoising performance, but has not surpassed the performance of using CNN alone. Based on the above analyses, we tend to believe that future image noise reduction research should take the BSN-BASED method as the benchmark, and explore new ideas for large-area spatial contact noise reduction based on the Transformer and Diffusion models, in order to make the noise reduction theory more relevant to practical applications.

References

[1] Qing Xu, Hailin Jiang, Riccardo Scopigno, and Mateu Sbert. A new approach for very dark video denoising and enhancement. In 2010 IEEE International Conference on Image Processing, pages 1185–1188. IEEE, 2010.
[2] Lin Li, Ronggang Wang, Wenmin Wang, and Wen Gao. A low-light image enhancement method for both denoising and contrast enlarging. In 2015 IEEE International Conference on Image Processing (ICIP), pages 3730–3734. IEEE, 2015.
[3] R Mark Henkelman. Measurement of signal intensities in the presence of noise in mr images. Medical physics, 12(2):232–233, 1985.
[4] Yuli Wang, Ryan Herbst, and Shiva Abbaszadeh. Electronic noise characterization of a dedicated head-and-neck cancer pet based on czt, 2021.
[5] Zhiliang Liu and Liyuan Ren. Shaking noise exploration and elimination for detecting local flaws of steel wire ropes based on magnetic flux leakages. IEEE Transactions on Industrial Electronics, 70(4):4206–4216, 2022.
[6] Rina Komatsu and Tad Gonsalves. Effectiveness of u-net in denoising rgb images. Comput. Sci. Inf. Techn, pages 1–10, 2019.
[7] Charles Boncelet. Image noise models. In The essential guide to image processing, pages 143–167. Elsevier, 2009.
[8] Qian Zhao, Deyu Meng, Zongben Xu, Wangmeng Zuo, and Lei Zhang. Robust principal component analysis with complex noise. In International Conference on Machine Learning, pages 55–63. PMLR, 2014.
[9] David Slepian. The one-sided barrier problem for gaussian noise. Bell System Technical Journal, 41(2):463–501, 1962.
[10] David Middleton. On the theory of random noise. phenomenological models. i. Journal of Applied Physics, 22(9):1143–1152, 1951.
[11] RJ McIntyre. Multiplication noise in uniform avalanche diodes. IEEE Transactions on Electron Devices, (1):164–168, 1966.
[12] Edward A Cockayne. Dwarfism with retinal atrophy and deafness. Archives of disease in childhood, 11(61):1, 1936.
[13] MA Schultz. Shutdown reactivity measurements using noise techniques. Noise Analysis in Nuclear Systems, pages 135–154, 1964.
[14] DA Ronken. Intensity discrimination of rayleigh noise. The Journal of the Acoustical Society of America, 45(1):54–57, 1969.
[15] Carl Eckart. The theory of noise in continuous media. The Journal of the Acoustical Society of America, 25(2):195–199, 1953.
[16] Roderick L Jerde, Laurence E Peterson, and Wayne Stein. Effects of high energy radiations on noise pulses from photomultiplier tubes. Review of Scientific Instruments, 38(10):1387–1394, 1967.
[17] Xiao Yang, Lilong Chai, Ramesh Bahadur Bist, Sachin Subedi, and Zihao Wu. A deep learning model for detecting cage-free hens on the litter floor. Animals, 12(15):1983, 2022.
[18] Haonan Han, Rui Yang, Shuyan Li, Runze Hu, and Xiu Li. Ssgd: A smartphone screen glass dataset for defect detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
[19] Peng Zhou, Zheng Liu, Hemmings Wu, Yuli Wang, Yong Lei, and Shiva Abbaszadeh. Automatically detecting bregma and lambda points in rodent skull anatomy images. PloS one, 15(12):e0244378, 2020.
[20] Yingzhou Lu, Kosaku Sato, and Jialu Wang. Deep learning based multi-label image classification of protest activities. arXiv preprint arXiv:2301.04212, 2023.
[21] Jinlin Xiang and Eli Shlizerman. Tkil: tangent kernel approach for class balanced incremental learning. arXiv preprint arXiv:2206.08492, 2022.
[22] Yuanzhou Wei, Dan Zhang, Meiyan Gao, Yuanhao Tian, Ya He, Bolin Huang, and Changyang Zheng. Breast cancer prediction based on machine learning. Journal of Software Engineering and Applications, 16(8):348–360, 2023.
[23] Ziyang Wang, Chengkuan Zhao, and Zixuan Ni. Adversarial vision transformer for medical image semantic segmentation with limited annotations. In The 33rd British Machine Vision Conference, page 1002, 2022.
[24] Baoru Huang, Anh Nguyen, Siyao Wang, Ziyang Wang, Erik Mayer, David Tuch, Kunal Vyas, Stamatia Giannarou, and Daniel S Elson. Simultaneous depth estimation and surgical tool segmentation in laparoscopic images. IEEE Transactions on Medical Robotics and Bionics, 4(2):335–338, 2022.
[25] Weiwei Zhao, Yida Wang, Fangfang Zhou, Gaiying Li, Zhichao Wang, Haodong Zhong, Yang Song, Kelly M Gillen, Yi Wang, Guang Yang, et al. Automated segmentation of midbrain structures in high-resolution susceptibility maps based on convolutional neural network and transfer learning. Frontiers in neuroscience, 16:801618, 2022.
[26] Chenyu You, Jinlin Xiang, Kun Su, Xiaoran Zhang, Siyuan Dong, John Onofrey, Lawrence Staib, and James S Duncan. Incremental learning meets transfer learning: Application to multi-site prostate mri segmentation. In International Workshop on Distributed, Collaborative, and Federated Learning, pages 3–16. Springer, 2022.
[27] Young-Gun Lee, Zheng Tang, and Jenq-Neng Hwang. Online-learning-based human tracking across non-overlapping cameras. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2870–2883, 2017.
[28] Zheng Tang and Jenq-Neng Hwang. Moana: An online learned adaptive appearance model for robust multiple object tracking in 3d. IEEE Access, 7:31934–31945, 2019.
[29] Yubo Luo and Yongfeng Huang. Text steganography with high embedding rate: Using recurrent neural networks to generate chinese classic poetry. In Proceedings of the 5th ACM workshop on information hiding and multimedia security, pages 99–104, 2017.
[30] Yubo Luo, Yongfeng Huang, Fufang Li, and Chinchen Chang. Text steganography based on ci-poetry generation using markov chain model. KSII Transactions on Internet and Information Systems (TIIS), 10(9):4568–4584, 2016.
[31] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8):2080–2095, 2007.
[32] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Non-local means denoising. Image Processing On Line, 1:208–212, 2011.
[33] Antoni Buades, Bartomeu Coll, and J-M Morel. A non-local algorithm for image denoising. In 2005 IEEE computer society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. Ieee, 2005.
[34] Jean-Luc Starck, Emmanuel J Candès, and David L Donoho. The curvelet transform for image denoising. IEEE Transactions on Image Processing, 11(6):670–684, 2002.
[35] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Image denoising with block-matching and 3d filtering. In Image processing: algorithms and systems, neural networks, and machine learning, volume 6064, pages 354–365. SPIE, 2006.
[36] M Kivanc Mihcak, Igor Kozintsev, Kannan Ramchandran, and Pierre Moulin. Low-complexity image denoising based on statistical modeling of wavelet coefficients. IEEE Signal Processing Letters, 6(12):300–303, 1999.
[37] Tao Chen, Kai-Kuang Ma, and Li-Hui Chen. Tri-state median filter for image denoising. IEEE Transactions on Image Processing, 8(12):1834–1838, 1999.
[38] Ming Zhang and Bahadir K Gunturk. Multiresolution bilateral filtering for image denoising. IEEE Transactions on Image Processing, 17(12):2324–2333, 2008.
[39] Weisheng Dong, Lei Zhang, Guangming Shi, and Xin Li. Nonlocally centralized sparse representation for image restoration. IEEE Transactions on Image Processing, 22(4):1620–1630, 2012.
[40] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.
[41] Uwe Schmidt and Stefan Roth. Shrinkage fields for effective image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2774–2781, 2014.
[42] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 4(2):460–489, 2005.
[43] Dongwei Ren, Wangmeng Zuo, David Zhang, Lei Zhang, and Ming-Hsuan Yang. Simultaneous fidelity and regularization learning for image restoration. IEEE Transactions on pattern analysis and machine intelligence, 43(1):284–299, 2019.
[44] Wangmeng Zuo, Lei Zhang, Chunwei Song, David Zhang, and Huijun Gao. Gradient histogram estimation and preservation for texture enhanced image denoising. IEEE Transactions on image Processing, 23(6):2459–2472, 2014.
[45] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2862–2869, 2014.
[46] Yuanzhou Wei, Meiyan Gao, Jun Xiao, Chixu Liu, Yuanhao Tian, and Ya He. Research and implementation of cancer gene data classification based on deep learning. Journal of Software Engineering and Applications, 16(6):155–169, 2023.
[47] Ziheng Chen, Fabrizio Silvestri, Jia Wang, He Zhu, Hongshik Ahn, and Gabriele Tolomei. Relax: Reinforcement learning agent explainer for arbitrary predictive models. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 252–261, 2022.
[48] Ashwini Pokle, Jinjin Tian, Yuchen Li, and Andrej Risteski. Contrasting the landscape of contrastive and non-contrastive learning. arXiv preprint arXiv:2203.15702, 2022.
[49] Ziyang Wang, Tianze Li, Jian-Qing Zheng, and Baoru Huang. When cnn meet with vit: Towards semi-supervised learning for multi-class medical image semantic segmentation. In European Conference on Computer Vision, pages 424–441. Springer, 2022.
[50] Yongsheng Mei, Hanhan Zhou, Tian Lan, Guru Venkataramani, and Peng Wei. Mac-po: Multi-agent experience replay via collective priority optimization. arXiv preprint arXiv:2302.10418, 2023.
[51] Ziheng Chen and Hongshik Ahn. Item response theory based ensemble in machine learning. International Journal of Automation and Computing, 17:621–636, 2020.
[52] Yishan Gong, Wei Zhang, Zhijia Zhang, and Yuanyuan Li. Research and implementation of traffic sign recognition system. In Wireless Communications, Networking and Applications: Proceedings of WCNA 2014, pages 553–560. Springer, 2016.
[53] Yunzhong He, Cong Zhang, Ruoyan Kong, Chaitanya Kulkarni, Qing Liu, Ashish Gandhe, Amit Nithianandan, and Arul Prakash. Hiercat: Hierarchical query categorization from weakly supervised data at facebook marketplace. In Companion Proceedings of the ACM Web Conference 2023, pages 331–335, 2023.
[54] Jialu Wang, Ping Li, and Feifang Hu. A/B testing in network data with covariate-adaptive randomization. In Proceedings of the 40th International Conference on Machine Learning, pages 35949–35969, 2023.
[55] Zhiwei Hong, Xiaocheng Fan, Tao Jiang, and Jianxing Feng. End-to-end unpaired image denoising with conditional adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4140–4149, 2020.
[56] Yubo Luo, Le Zhang, Zhenyu Wang, and Shahriar Nirjon. Efficient multitask learning on resource-constrained systems. arXiv preprint arXiv:2302.13155, 2023.
[57] Yuan Gu, Mingyue Wang, Yishu Gong, Song Jiang, Chen Li, and Dan Zhang. Unveiling breast cancer risk profiles: A comprehensive survival clustering analysis empowered by an interactive online tool for personalized medicine. medRxiv, pages 2023–05, 2023.
[58] Fei Wen, Mian Qin, Paul Gratz, and Narasimha Reddy. Openmem: Hardware/software cooperative management for mobile memory system. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 109–114. IEEE, 2021.
[59] Ping Wang, Fei Wen, Paul V Gratz, and Alex Sprintson. Simd-matcher: A simd-based arbitrary matching framework. ACM Transactions on Architecture and Code Optimization (TACO), 19(3):1–20, 2022.
[60] Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. Deep learning–based text classification: a comprehensive review. ACM computing surveys (CSUR), 54(3):1–40, 2021.
[61] Domen Tabernik and Danijel Skočaj. Deep learning for large-scale traffic-sign detection and recognition. IEEE Transactions on Intelligent Transportation Systems, 21(4):1427–1440, 2019.
[62] Enming Luo, Stanley H Chan, and Truong Q Nguyen. Adaptive image denoising by targeted databases. IEEE Transactions on Image Processing, 24(7):2167–2181, 2015.
[63] Bart Goossens, Aleksandra Pizurica, and Wilfried Philips. Removal of correlated noise by modeling the signal of interest in the wavelet domain. IEEE Transactions on Image Processing, 18(6):1153–1165, 2009.
[64] Ymir Mäkinen, Lucio Azzari, and Alessandro Foi. Exact transform-domain noise variance for collaborative filtering of stationary correlated noise. In 2019 IEEE International Conference on Image Processing (ICIP), pages 185–189. IEEE, 2019.
[65] Song Jiang, Yuan Gu, and Ela Kumar. Magnetic resonance imaging (mri) brain tumor image classification based on five machine learning algorithms. Cloud Computing and Data Science, pages 122–133, 2023.
[66] Vandana Sachdev, Yuan Gu, James Nichols, Wen Li, Stanislav Sidenko, Darlene Allen, Colin Wu, and Swee Lay Thein. A machine learning algorithm to improve risk assessment for patients with sickle cell disease. Blood, 134:893, 2019.
[67] Ramesh Bahadur Bist, Xiao Yang, Sachin Subedi, and Lilong Chai. Mislaying behavior detection in cage-free hens with deep learning technologies. Poultry Science, 102(7):102729, 2023.
[68] Wenqiang Li, Yuk Ming Tang, Ziyang Wang, Kai Ming Yu, and Suet To. Atrous residual interconnected encoder to attention decoder framework for vertebrae segmentation via 3d volumetric ct images. Engineering Applications of Artificial Intelligence, 114:105102, 2022.
[69] Xiaobo Ma, Abolfazl Karimpour, and Yao-Jan Wu. Statistical evaluation of data requirement for ramp metering performance assessment. Transportation Research Part A: Policy and Practice, 141:248–261, 2020.
[70] Xiaobo Ma. Traffic Performance Evaluation Using Statistical and Machine Learning Methods. PhD thesis, The University of Arizona, 2022.
[71] Yi He, Fudong Lin, Nian-Feng Tzeng, et al. Interpretable minority synthesis for imbalanced classification. In International Joint Conferences on Artificial Intelligence, 2021.
[72] Fudong Lin, Xu Yuan, Lu Peng, and Nian-Feng Tzeng. Cascade variational auto-encoder for hierarchical disentanglement. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1248–1257, 2022.
[73] Laith Alzubaidi, Jinshuai Bai, Aiman Al-Sabaawi, Jose Santamaría, AS Albahri, Bashar Sami Nayyef Al-dabbagh, Mohammed A Fadhel, Mohamed Manoufali, Jinglan Zhang, Ali H Al-Timemy, et al. A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. Journal of Big Data, 10(1):46, 2023.
[74] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing, 27(9):4608–4622, 2018.
[75] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei Zhang. Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1712–1722, 2019.
[76] Yoonsik Kim, Jae Woong Soh, Gu Yong Park, and Nam Ik Cho. Transfer learning from synthetic to real-noise denoising with adaptive instance normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3482–3492, 2020.
[77] Pengju Liu, Hongzhi Zhang, Wei Lian, and Wangmeng Zuo. Multi-level wavelet convolutional neural networks. IEEE Access, 7:74973–74985, 2019.
[78] Bumjun Park, Songhyun Yu, and Jechang Jeong. Densely connected hierarchical network for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[79] Liangyu Chen, Xiaojie Chu, Xiangyu Zhang, and Jian Sun. Simple baselines for image restoration. In European Conference on Computer Vision, pages 17–33. Springer, 2022.
[80] Zongsheng Yue, Qian Zhao, Lei Zhang, and Deyu Meng. Dual adversarial network: Toward real-world noise removal and noise generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 41–58. Springer, 2020.
[81] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. Advances in Neural Information Processing Systems, 25, 2012.
[82] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
[83] Saeed Anwar and Nick Barnes. Real image denoising with feature attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3155–3164, 2019.
[84] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial-adaptive network for single image denoising. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 171–187. Springer, 2020.
[85] Viren Jain and Sebastian Seung. Natural image denoising with convolutional networks. Advances in Neural Information Processing Systems, 21, 2008.
[86] Stan Z Li. Markov random field models in computer vision. In Computer Vision—ECCV’94: Third European Conference on Computer Vision Stockholm, Sweden, May 2–6 1994 Proceedings, Volume II 3, pages 361–370. Springer, 1994.
[87] Jia Chen and Chi-Keung Tang. Spatio-temporal markov random field for video denoising. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
[88] Gregory Vaksman, Michael Elad, and Peyman Milanfar. Patch craft: Video denoising by deep modeling and patch matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2157–2166, 2021.
[89] Weisheng Dong, Peiyao Wang, Wotao Yin, Guangming Shi, Fangfang Wu, and Xiaotong Lu. Denoising prior driven deep neural network for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10):2305–2318, 2018.
[90] Forest Agostinelli, Michael R Anderson, and Honglak Lee. Adaptive multi-column deep neural networks with application to robust image denoising. Advances in Neural Information Processing Systems, 26, 2013.
[91] Hengyuan Zhao, Wenze Shao, Bingkun Bao, and Haibo Li. A simple and robust deep convolutional approach to blind image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[92] Xinhao Liu, Masayuki Tanaka, and Masatoshi Okutomi. Practical signal-dependent noise parameter estimation from a single noisy image. IEEE Transactions on Image Processing, 23(10):4361–4371, 2014.
[93] Jun Xu, Yuan Huang, Ming-Ming Cheng, Li Liu, Fan Zhu, Zhou Xu, and Ling Shao. Noisy-as-clean: Learning self-supervised denoising from corrupted image. IEEE Transactions on Image Processing, 29:9316–9329, 2020.
[94] Yaochen Xie, Zhengyang Wang, and Shuiwang Ji. Noise2same: Optimizing a self-supervised bound for image denoising. Advances in Neural Information Processing Systems, 33:20320–20330, 2020.
[95] Yuhui Quan, Mingqin Chen, Tongyao Pang, and Hui Ji. Self2self with dropout: Learning self-supervised denoising from single image. In Proceedings of the IEEE/CVF Conference on cComputer Vision and Pattern Recognition, pages 1890–1898, 2020.
[96] Tao Huang, Songjiang Li, Xu Jia, Huchuan Lu, and Jianzhuang Liu. Neighbor2neighbor: Self-supervised denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14781–14790, 2021.
[97] Yi Zhang, Dasong Li, Ka Lung Law, Xiaogang Wang, Hongwei Qin, and Hongsheng Li. Idr: Self-supervised image denoising via iterative data refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2098–2107, 2022.
[98] Kwanyoung Kim, Taesung Kwon, and Jong Chul Ye. Noise distribution adaptive self-supervised image denoising using tweedie distribution and score matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2008–2016, 2022.
[99] Zejin Wang, Jiazheng Liu, Guoqing Li, and Hua Han. Blind2unblind: Self-supervised image denoising with visible blind spots. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2027–2036, 2022.
[100] Reyhaneh Neshatavar, Mohsen Yavartanoo, Sanghyun Son, and Kyoung Mu Lee. Cvf-sid: Cyclic multi-variate function for self-supervised image denoising by disentangling noise from image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17583–17591, 2022.
[101] Zichun Wang, Ying Fu, Ji Liu, and Yulun Zhang. Lg-bpn: Local and global blind-patch network for self-supervised real-world denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18156–18165, 2023.
[102] Dan Zhang, Fangfang Zhou, Yuwen Jiang, and Zhengming Fu. Mm-bsn: Self-supervised image denoising for real-world with multi-mask based on blind-spot network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4188–4197, 2023.
[103] Dan Zhang and Fangfang Zhou. Self-supervised image denoising for real-world images with context-aware transformer. IEEE Access, 11:14340–14349, 2023.
[104] Gregory Vaksman and Michael Elad. Patch-craft self-supervised training for correlated image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5795–5804, 2023.
[105] Junyi Li, Zhilu Zhang, Xiaoyu Liu, Chaoyu Feng, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. Spatially adaptive self-supervised learning for real-world image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9914–9924, 2023.
[106] Jingwen Chen, Jiawei Chen, Hongyang Chao, and Ming Yang. Image blind denoising with generative adversarial network based noise modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3155–3164, 2018.
[107] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.
[108] Alexander Krull, Tim-Oliver Buchholz, and Florian Jug. Noise2void-learning denoising from single noisy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2129–2137, 2019.
[109] Joshua Batson and Loic Royer. Noise2self: Blind denoising by self-supervision. In International Conference on Machine Learning, pages 524–533. PMLR, 2019.
[110] Samuli Laine, Tero Karras, Jaakko Lehtinen, and Timo Aila. High-quality self-supervised deep image denoising. Advances in Neural Information Processing Systems, 32, 2019.
[111] Xiaohe Wu, Ming Liu, Yue Cao, Dongwei Ren, and Wangmeng Zuo. Unpaired learning of deep image denoising. In European Conference on Computer Vision, pages 352–368. Springer, 2020.
[112] Nick Moran, Dan Schmidt, Yu Zhong, and Patrick Coady. Noisier2noise: Learning to denoise from unpaired noisy data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12064–12072, 2020.
[113] Alexander Krull, Tomáš Vičar, Mangal Prakash, Manan Lalit, and Florian Jug. Probabilistic noise2void: Unsupervised content-aware denoising. Frontiers in Computer Science, 2:5, 2020.
[114] Tongyao Pang, Huan Zheng, Yuhui Quan, and Hui Ji. Recorrupted-to-recorrupted: Unsupervised deep learning for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2043–2052, 2021.
[115] Kwanyoung Kim and Jong Chul Ye. Noise2score: tweedie’s approach to self-supervised image denoising without clean images. Advances in Neural Information Processing Systems, 34:864–874, 2021.
[116] Wooseok Lee, Sanghyun Son, and Kyoung Mu Lee. Ap-bsn: Self-supervised denoising for real-world images via asymmetric pd and blind-spot network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17725–17734, 2022.
[117] Majed El Helou and Sabine Süsstrunk. Blind universal bayesian image denoising with gaussian noise level learning. IEEE Transactions on Image Processing, 29:4885–4897, 2020.
[118] Jun Xu, Dongwei Ren, Lei Zhang, and David Zhang. Patch group based bayesian learning for blind image denoising. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, pages 79–95. Springer, 2017.
[119] David Honzátko, Siavash A Bigdeli, Engin Türetken, and L Andrea Dunbar. Efficient blind-spot neural network architecture for image denoising. In 2020 7th Swiss Conference on Data Science (SDS), pages 59–60. IEEE, 2020.
[120] Mikhail Papkov and Pavel Chizhov. Swinia: Self-supervised blind-spot image denoising with zero convolutions. arXiv preprint arXiv:2305.05651, 2023.
[121] Chunwei Tian, Lunke Fei, Wenxian Zheng, Yong Xu, Wangmeng Zuo, and Chia-Wen Lin. Deep learning on image denoising: An overview. Neural Networks, 131:251–275, 2020.
[122] Ademola E Ilesanmi and Taiwo O Ilesanmi. Methods for image denoising using convolutional neural network: a review. Complex & Intelligent Systems, 7(5):2179–2198, 2021.
[123] Saeed Izadi, Darren Sutton, and Ghassan Hamarneh. Image denoising in the deep learning era. Artificial Intelligence Review, pages 1–46, 2022.
[124] Zhaoming Kong, Fangxi Deng, Haomin Zhuang, Xiaowei Yang, Jun Yu, and Lifang He. A comparison of image denoising methods. arXiv preprint arXiv:2304.08990, 2023.
[125] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):87–110, 2022.
[126] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[127] Mo Zhao, Gang Cao, Xianglin Huang, and Lifang Yang. Hybrid transformer-cnn for real image denoising. IEEE Signal Processing Letters, 29:1252–1256, 2022.
[128] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022.
[129] Fangfang Zhou, Dan Zhang, and Zhenming Fu. High dynamic range imaging with context-aware transformer. arXiv preprint arXiv:2304.04416, 2023.
[130] Ziyang Wang and Irina Voiculescu. Dealing with unreliable annotations: A noise-robust network for semantic segmentation through a transformer-improved encoder and convolution decoder. Applied Sciences, 13(13):7966, 2023.
[131] Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. Self-supervised video hashing via bidirectional transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13549–13558, 2021.
[132] Shakarim Soltanayev and Se Young Chun. Training deep learning based denoisers without ground truth data. Advances in Neural Information Processing Systems, 31, 2018.
[133] Keunsoo Ko, Jun-Tae Lee, and Chang-Su Kim. Pac-net: Pairwise aesthetic comparison network for image aesthetic assessment. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2491–2495. IEEE, 2018.
[134] Sathish Ramani, Thierry Blu, and Michael Unser. Monte-carlo sure: A black-box optimization of regularization parameters for general denoising algorithms. IEEE Transactions on Image processing, 17(9):1540–1554, 2008.
[135] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. Advances in Neural Information Processing Systems, 30, 2017.
[136] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223. PMLR, 2017.
[137] Geonwoon Jang, Wooseok Lee, Sanghyun Son, and Kyoung Mu Lee. C2n: Practical generative noise modeling for real-world denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2350–2359, 2021.
[138] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
[139] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[140] Patrick Cahan, Hu Li, Samantha A Morris, Edroaldo Lummertz Da Rocha, George Q Daley, and James J Collins. Cellnet: network biology applied to stem cell engineering. Cell, 158(4):903–915, 2014.
[141] Yuqian Zhou, Jianbo Jiao, Haibin Huang, Yang Wang, Jue Wang, Honghui Shi, and Thomas Huang. When awgn-based denoiser meets real noises. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13074–13081, 2020.
[142] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
[143] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2010.
[144] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 126–135, 2017.
[145] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416–423. IEEE, 2001.
[146] Stefan Roth and Michael J Black. Fields of experts: A framework for learning image priors. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 860–867. IEEE, 2005.
[147] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
[148] Rich Franzen. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2):9, 1999.
[149] Kiyoharu Aizawa, Azuma Fujimoto, Atsushi Otsubo, Toru Ogawa, Yusuke Matsui, Koki Tsubota, and Hikaru Ikuta. Building a manga dataset “manga109” with annotations for multimedia applications. IEEE MultiMedia, 27(2):8–18, 2020.
[150] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In Curves and Surfaces: 7th International Conference, Avignon, France, June 24-30, 2010, Revised Selected Papers 7, pages 711–730. Springer, 2012.
[151] Lei Zhang, Xiaolin Wu, Antoni Buades, and Xin Li. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. Journal of Electronic imaging, 20(2):023016–023016, 2011.
[152] Marco Bevilacqua, Aline Roumy, Christine M. Guillemot, and Marie-Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In British Machine Vision Conference, 2012.
[153] Kede Ma, Zhengfang Duanmu, Qingbo Wu, Zhou Wang, Hongwei Yong, Hongliang Li, and Lei Zhang. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing, 26(2):1004–1016, 2016.
[154] Marc Lebrun, Miguel Colom, and Jean-Michel Morel. The noise clinic: a blind image denoising algorithm. Image Processing On Line, 5:1–54, 2015.
[155] Josue Anaya and Adrian Barbu. Renoir–a dataset for real low-light image noise reduction. Journal of Visual Communication and Image Representation, 51:144–154, 2018.
[156] Seonghyeon Nam, Youngbae Hwang, Yasuyuki Matsushita, and Seon Joo Kim. A holistic approach to cross-channel image noise modeling and its application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1683–1691, 2016.
[157] Tobias Plotz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1586–1595, 2017.
[158] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1692–1700, 2018.
[159] Jun Xu, Hui Li, Zhetong Liang, David Zhang, and Lei Zhang. Real-world noisy image denoising: A new benchmark. arXiv preprint arXiv:1804.02603, 2018.
[160] Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3291–3300, 2018.
[161] Benoit Brummer and Christophe De Vleeschouwer. Natural image noise dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[162] Yide Zhang, Yinhao Zhu, Evan Nichols, Qingfei Wang, Siyuan Zhang, Cody Smith, and Scott Howard. A poisson-gaussian denoising dataset with real fluorescence microscopy images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11710–11718, 2019.
[163] Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, pages 2366–2369. IEEE, 2010.
[164] Daniel Wallach and Bruno Goffinet. Mean squared error of prediction as a criterion for evaluating and comparing system models. Ecological modelling, 44(3-4):299–306, 1989.
[165] Cort J Willmott and Kenji Matsuura. Advantages of the mean absolute error (mae) over the root mean square error (rmse) in assessing average model performance. Climate research, 30(1):79–82, 2005.
[166] H Garo Balian and Nelson W Eddy. Figure-of-merit (fom), an improved criterion over the normalized chi-squared test for assessing goodness-of-fit of gamma-ray spectral peaks. Nuclear Instruments and Methods, 145(2):389–395, 1977.
[167] Mohammad Ali Badamchizadeh and Ali Aghagolzadeh. Comparative study of unsharp masking methods for image enhancement. In Third International Conference on Image and Graphics (ICIG’04), pages 27–30. IEEE, 2004.
[168] Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378–2386, 2011.
[169] D Henderson and RP Hamernik. Impulse noise: critical review. The Journal of the Acoustical Society of America, 80(2):569–584, 1986.
[170] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9446–9454, 2018.
[171] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 136–144, 2017.
[172] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo. Multi-level wavelet-cnn for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 773–782, 2018.
[173] Songhyun Yu, Bumjun Park, and Jechang Jeong. Deep iterative down-up cnn for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
[174] Shreyas Fadnavis, Joshua Batson, and Eleftherios Garyfallidis. Patch2self: Denoising diffusion mri with self-supervised learning. Advances in Neural Information Processing Systems, 33:16293–16303, 2020.
[175] Cheng Yang, Lijing Liang, and Zhixun Su. Real-world denoising via diffusion model. arXiv preprint arXiv:2305.04457, 2023.