This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Disentangled Representation for Invertible Image Denoising and Beyond

Wenchao Du, Hu Chen, Yi Zhang and Hongyu Yang W. Du, H. Chen, and H. Yang are with the College of Computer Science, Sichuan University, Chengdu 610065, China. Y. Zhang is with the college of Cyber Science and Engineering, Sichuan University, Chengdu, 610065, China. Email: [email protected]; [email protected]; [email protected]; [email protected].
Abstract

Image denoising is a typical ill-posed problem due to complex degradation. Leading methods based on normalizing flows have tried to solve this problem with an invertible transformation instead of a deterministic mapping. However, the implicit bijective mapping is not explored well. Inspired by a latent observation that noise tends to appear in the high-frequency part of the image, we propose a fully invertible denoising method that injects the idea of disentangled learning into a general invertible neural network to split noise from the high-frequency part. More specifically, we decompose the noisy image into clean low-frequency and hybrid high-frequency parts with an invertible transformation and then disentangle case-specific noise and high-frequency components in the latent space. In this way, denoising is made tractable by inversely merging noiseless low and high-frequency parts. Furthermore, we construct a flexible hierarchical disentangling framework, which aims to decompose most of the low-frequency image information while disentangling noise from the high-frequency part in a coarse-to-fine manner. Extensive experiments on real image denoising, JPEG compressed artifact removal, and medical low-dose CT image restoration have demonstrated that the proposed method achieves competing performance on both quantitative metrics and visual quality, with significantly less computational cost.

Index Terms:
Invertible Image denoising, bijective transformation, disentangling learning, hierarchical representation.

I Introduction

Image denoising, aiming to recover clean observation from its noisy measurement, is a classic inverse problem. Based on the Bayesian perspective, most traditional approaches view it as a general maximum a posteriori (MAP) optimization problem, with assumptions regarding both the image priors and noise (usually Gaussian), which deviate from real cases and lead to critical limitations in practical scenes. Deep learning-based methods [1, 2, 3, 4] have achieved superior denoising performance in recent years. Most of them view it as a nonlinear mapping between noisy and clean image pairs. However, real image noise is generally accumulated from multiple degrading sources, which results in non-injective mapping due to various noise types and levels.

Bridging a bijective transformation between the noisy and clean image pairs to solve such an ambiguous inverse problem in low-level vision has been explored in recent years. Previous methods [5, 6] have employed convolutional neural networks (CNNs) to model wavelet transforms to solve image restorations, in which wavelet decomposition and reconstruction processes are modeled separately, leading to an injective mapping procedure. Generative adversarial network-based methods [7, 8] have also modeled the bijective transformation for noisy image generation and restoration in supervised and unsupervised manners. However, these methods need different models to simulate these two processes separately, which leads to complex training procedures.

Refer to caption

17.55/0.09

Refer to caption

35.30/0.83

Refer to caption

35.56/0.84

Refer to caption

35.92/0.85

Refer to caption

18.47/0.14 Noisy

Refer to caption

31.62/0.71 InvDn

Refer to caption

31.91/0.72 DANet

Refer to caption

32.25/0.73 Ours

Figure 1: Real image noise removal results on SIDD validation set. The two representative methods are selected for comparisons, i.e. flow-based InvDN [9] and GAN-based DANet [7], PSNR and SSIM values are also computed. Zoomed in for better visualization.

Normalizing flow that allows for efficient and exact likelihood calculation and sampling by invertible transformation, has been applied to solve ill-posed inverse problems in low-level vision [10, 11, 12, 9]. NoiseFlow [10] is a seminal flow-based model for camera noise generation. Unlike general CNN-based methods [13] that model camera imaging pipelines in forward and reverse directions to simulate real noise generation, NoiseFlow learns the distribution of real noise and generates diverse noisy images by latent variable sampling to augment training data. However, it treats the clean image as conditional prior and thus is not fully invertible between clean and noisy image pairs. However, it treats the clean image as conditional prior and thus is not fully invertible. Considering that noise tends to appear in the high-frequency part of the image, InvDn [9] discards the high-frequency content, and then utilizes the invertible neural network (INN) [14] to capture the distribution of the high-frequency part of the clean image, image denoising is achieved by sampling a latent variable from predefined distribution to approximate the lost high-frequency information. Nevertheless, since the model is bijective, the ill-posedness of the task is partly alleviated. However, InvDn assumes that the low and high-frequency contents of the image are independent of each other and thus lacks the ability to exploit their dependency for image denoising, as shown in Fig. 1, which leads to the recovered image losing part of high-frequency details, e.g., over-smoothed image edges. Furthermore, the implicit bijective mapping is not fully guaranteed due to extra latent variable sampling.

Instead, in this paper, we propose a fully invertible model for image denoising that injects the idea of disentangled learning into a general invertible architecture to explore feature-level noise-high frequency signals splitting. Specifically, we aim to remove noise from the hybrid high-frequency part of the noisy image while reserving case-specific high-frequency information as much as possible. To do so, we first decompose the image into low and high-frequency representations by an invertible transform during the forward propagation. Assuming the decomposed low-frequency information is noiseless, according to the information lossless characteristics of the normalizing flow, the noise only appears in the high-frequency part. Therefore, our goal is to split the noise component from it, so that the noise-free high-frequency signal is preserved well.

To this end, we introduce the concept of disentangled learning into the general invertible architecture, which transforms the hybrid high-frequencies into the compact and independent representations with internal characteristics, where we model the high-frequency representation learning in the form of distribution, e.g., the marginal distribution of the hybrid high-frequency representation obeys a pre-specified distribution, i.e. isotropic Gaussian. In this way, we directly disentangle noise and clean high-frequency representations along specific dimensions without extra latent variable sampling, which leads to a more accurate and stable bijective transformation. Image denoising is achieved by inversely merging noise-free low and high-frequency representations only. Furthermore, we hierarchically decompose the high-frequency representation into low and high-frequency parts and disentangle noise from fine-grained high-frequency parts, which results in a flexible and efficient framework for generalized image denoising tasks.

In short, our contributions are summarized as follows:

  1. 1.

    We propose a fully invertible model for image denoising, which introduces the idea of disentangled learning into a general invertible architecture and achieves a more accurate bijective mapping without latent variable sampling.

  2. 2.

    We construct a hierarchical high-frequency decomposition framework, which could process diverse noise with varying complexity, and achieve a better trade-off between performance and computational costs.

  3. 3.

    Extensive experiments on natural image denoising, JPEG compressed artifact removal and medical low-dose CT image restoration demonstrate the proposed method achieves competing performance in terms of quantitative and qualitative evaluations. To the best of our knowledge, this is the first full invertible method capable of solving multiple real image denosing tasks.

The remainder of the paper is organized as follows: Section II provides a brief review of related work. Section III presents our approach in detail. In Section IV, extensive experiments are conducted to evaluate the proposed method, and the conclusion is presented in Section V.

II Related Work

II-A Traditional Methods

Most traditional image denoising models usually construct a MAP optimization problem with a data filed and an extra regularization term. Along this direction, making assumptions regarding the noise distribution is necessary for most methods [15, 16, 17] to build the model (e.g., Mixture of Gaussian). The regularization term is generally based on the natural image prior. Classic total variation [18] uses the statistical characteristics of images to remove noise. Sparse dictionary learning and Field-of-Experts (FoE) also employ certain priors existing in image patches [19, 20, 21]. Non-local similarity method [22] employ non-local similar patterns of image. In addition, transform techniques are also explored, e.g., wavelet domain methods [23] and block-matching and 3D filtering (BM3D) [24].

II-B Deep Learning based Methods

Instead of preset image and noise distribution priors, DNNs based methods directly learn a denoiser in a data-driven manner. Previous method [25] first explored the multi-layer perception (MLP). Chen et al. [4] further proposed a feed-forward deep network called the trainable non-linear reaction diffusion (TNRD) model. Furthermore, Mao et al. [3] utilized a fully convolution encoder-decoder network with symmetric skip connection to solve image restoration. Zhang et al. [2] proposed the Gaussian denoising convolution network (DnCNN) and achieved superior performance. Considering non-local self-similarity characteristic of images, non-local attention networks [26, 27] are also explored in image restoration.

However, most of these works focus on the synthetic noisy images with specific noise levels, spatially variant noise limit the capacity of such models on real cases. To this end, some methods employed self-adaptive noise level estimated from inputs to serve as extra priors (e.g., FFDNet [28] and CBDNet [29]). VDN [30] further integrated variational inference into the noise estimation and image denoising with a unique Bayesian framework. AINDNet [31] introduced the transfer learning to mitigate the domain gap between the real and synthetic noise distribution. In addition, simulating real noise with a generative model was also explored in recent works [7, 32]. Considering that DNNs-based image denoising is non-injective in nature, a complex network architecture and powerful representation ability is always required. Thus, Transformer-based methods [33, 34, 35] have drawn more attention due to powerful representation learning on global image features.

II-C Normalizing Flow based Methods

Invertible neural networks (INNs) have drawn more attention to solving ambiguous inverse problems [36]. Unlike DNNs, INNs focus on learning the forward process and using additional latent output variables to capture the information that would otherwise be lost. Due to invertibility, a model of the corresponding inverse process is learned implicitly. NoiseFlow [10] modeled the distribution of real noise in ISP imaging process to augment the training data. InvDn [9] extended the idea of IRN [12] to real noise removal, which discards hybrid high-frequency information containing noise of the image and exploited extra latent variable sampling to approximate the lost high-frequency information. However, a key factor is ignored that noise is case-specific, which implies the corresponded high-frequency information is also case-specific. More recently, FDN [37] directly disentangled the noise in the latent space with normalizing flows, which leads to a more complex network. FINO [38] introduced the dual flow models to bridge bijective mapping between the noisy and clean image pairs.

Rather, we aim to solve image denoising by modeling a reliable bijective mapping with single model only. We empirically show that integrating the idea of disentangled learning into the flow-based framework can result in fast and stable training as well as good performance on generalized image denoising tasks.

III Methodology

III-A Preliminaries

Typical INNs models a generative process with a known distribution through a sequence of differentiable, invertible mappings. Formally, let x0Dx_{0}\in\mathbb{R}^{D} be a random variable with a known and tractable probability density function p𝒳0:Dp_{\mathcal{X}_{0}}:\mathbb{R}^{D}\rightarrow\mathbb{R} and let x1,,xNx_{1},\dots,x_{N} be a sequence of random variables such that xi=fi(xi1)x_{i}=f_{i}(x_{i-1}) where fi:DDf_{i}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{D} is a differentiable, bijective function. Then, if y=f(x0)=fnfN1f1(x0)y=f(x_{0})=f_{n}\circ f_{N-1}\circ\cdots\circ f_{1}(x_{0}), the change of variables formula says that the probability density function for yy is

p(y)=p𝒳0(g(y))j=1N|det𝐉j(g(y))|1p(y)=p_{\mathcal{X}_{0}}(g(y))\prod_{j=1}^{N}|\det\mathbf{J}_{j}(g(y))|^{-1} (1)

where g=g1gN1gNg=g_{1}\circ\cdots\circ g_{N-1}\circ g_{N} is the inverse of ff, and 𝐉j=fj/xj1\mathbf{J}_{j}=\partial{f_{j}}/\partial{x}_{j-1} is the Jacobian of the jjth transformation fjf_{j} with respect to its input xj1x_{j-1} (i.e., the output of fj1f_{j-1}).

Due to such flexibility on accessing to the inverse mapping, INN architecture could be used for variational inference [39, 40], and representation learning without any information loss.

INN is composed of basic invertible blocks [41]. For the ll-th block, the input u is split into u1\textbf{u}_{1} and u2\textbf{u}_{2} along the channel axis, and the typical additive affine transformation and corresponded inverse transformation are formulated as:

{v1=u1+ϕ(u2)v2=u2+η(v1){u2=v2η(v1)u1=v1ϕ(u2)\displaystyle\left\{\begin{aligned} \textbf{v}_{1}&=\textbf{u}_{1}+\phi(\textbf{u}_{2})\\ \textbf{v}_{2}&=\textbf{u}_{2}+\eta(\textbf{v}_{1})\end{aligned}\right.\Leftrightarrow\left\{\begin{aligned} \textbf{u}_{2}&=\textbf{v}_{2}-\eta(\textbf{v}_{1})\\ \textbf{u}_{1}&=\textbf{v}_{1}-\phi(\textbf{u}_{2})\end{aligned}\right. (2)

where ϕ\phi and η\eta are arbitrary neural networks. The output of a single block is [v1,v2][\textbf{v}_{1},\textbf{v}_{2}].

To enhance the transformation ability of the identity branch, Eq. (2) is always augmented as:

{v1=u1exp(ψ(u2))+ϕ(u2)v2=u2exp(ρ(u1))+η(u1)u2=(v2η(v1))exp(ρ(v1))u1=(v1ϕ(u2))exp(ψ(u2))\displaystyle\left\{\begin{aligned} \textbf{v}_{1}&=\textbf{u}_{1}\odot\exp(\psi(\textbf{u}_{2}))+\phi(\textbf{u}_{2})\\ \textbf{v}_{2}&=\textbf{u}_{2}\odot\exp(\rho(\textbf{u}_{1}))+\eta(\textbf{u}_{1})\\ \textbf{u}_{2}&=(\textbf{v}_{2}-\eta(\textbf{v}_{1}))\odot\exp(-\rho(\textbf{v}_{1}))\\ \textbf{u}_{1}&=(\textbf{v}_{1}-\phi(\textbf{u}_{2}))\odot\exp(-\psi(\textbf{u}_{2}))\end{aligned}\right. (3)

where ψ()\psi(\cdot), ϕ()\phi(\cdot) and η()\eta(\cdot) denote the transformation functions, which are arbitrary [41]. Function ρ()\rho(\cdot) is further followed by a centered sigmoid function and a scale term to prevent numerical explosion due to the exp()\exp(\cdot) function.

III-B Problem Specification

We rethink the image denoising task from the perspective of invertible transformation. Suppose the original noisy observation is yy, its clean image is xx and the noise is nn. We have

p(y)=p(x,n)=p(x)p(n|x)p(y)=p(x,n)=p(x)p(n|x) (4)

where the distribution of noisy image p(y)p(y) is a joint distribution corresponded to xx and nn. Directly splitting the noise component from yy is difficult due to unknown p(n)p(n). In addition, p(n|x)p(n|x) implies the noise is case-specific to image content. A general observation that noise tends to appear in the high-frequency component of the image. Therefore, we employ a wavelet transformation to decompose the noisy image yy into low and high-frequency components, denoted as yLFy_{\text{LF}} and yHFy_{\text{HF}} respectively. Then, Eq. (4) is reformulated by

p(y)=p(yLF,yHF)=p(yLF)p(yHF|yLF)p(y)=p(y_{\text{LF}},{y}_{\text{HF}})=p(y_{\text{LF}})p({y}_{\text{HF}}|y_{\text{LF}}) (5)

Ideally, wavelet decomposition is orthogonal, which could yield dense representation without redundancies, so Eq. (5) is rewritten as p(y)=p(yLF,yHF)=p(yLF)p(yHF)p(y)=p(y_{\text{LF}},{y}_{\text{HF}})=p(y_{\text{LF}})p({y}_{\text{HF}}).

Our goal is to disentangle noise representation from hybrid high-frequency component, which means that the low-frequency part yLFy_{\text{LF}} is approximately noiseless, and noise is only contained in p(yHF)p({y}_{\text{HF}}). Considering the unique characteristics of information lossless in INNs, we could implement it easily. Thus, we focus on how to split noise from p(yHF)p({y}_{\text{HF}}) such that

p(yHF)=p(yhF,yn)=p(yhF)p(yn|yhF)p({y}_{\text{HF}})=p({y}_{\text{hF}},y_{\text{n}})=p({y}_{\text{hF}})p(y_{\text{n}}|{y}_{\text{hF}}) (6)

where yhF{y}_{\text{hF}} denotes noise-free high-frequency part, and yny_{\text{n}} is case-specific noise.

Refer to caption
Figure 2: Illustration of invertible image denoising by disentangling representation. In the forward procedure, noisy image xx is first decomposed into hybrid high frequencies yHFy_{\text{HF}} and noisy low frequencies yLFy_{\text{LF}} with Discrete Wavelet Transform (DWT), which are feed into a parameterized invertible neural network fθ()f_{\theta}(\cdot) to transform into a clean low-frequency y^LF\hat{y}_{\text{LF}} and a high-frequency yHFy_{\text{HF}} obeying specific distribution in the latent space, where case-specific high-frequency yhFy_{\text{hF}} and noise yny_{\text{n}} would be disentangled. Image denoising is achieved through the inverse function fθ1f^{-1}_{\theta} and DWT operation with disentangled yhFy_{\text{hF}} and clean y^LF\hat{y}_{\text{LF}}.

Different from general flow based models, e.g., IRN [12] and InvDn [9], a latent variable zz is introduced to approximate the case-specific yhF{y}_{\text{hF}}, and transformed zz to be case-agnostic with a specified distribution, where zz is used to model the high-frequency information lost in degrading. For real image denoising, yny_{\text{n}} and yhF{y}_{\text{hF}} are always case-specific and tangled. Therefore, it is difficult to model lost yhF{y}_{\text{hF}} accurately with a case-agnostic variable without any conditional priors. Thus, we attempt to solve this problem with a simple manner, i.e. disentangling noise part from yHF{y}_{\text{HF}} in the latent space. In this way, case-specific yhF{y}_{\text{hF}} and yny_{\text{n}} can be split well.

Directly disentangling hybrid yHF{y}_{\text{HF}} is difficult due to unknown p(yn)p(y_{\text{n}}). To this end, we assume that yhF{y}_{\text{hF}} and yn{y}_{\text{n}} are independent, so that Eq. (6) is reformulated as

p(yHF)=p(yhF,yn)=p(yhF)p(yn)p({y}_{\text{HF}})=p({y}_{\text{hF}},y_{\text{n}})=p({y}_{\text{hF}})p(y_{\text{n}}) (7)

To satisfy this assumption, we directly enforce p(yHF)p({y}_{\text{HF}}) to obey pre-specific distribution, e.g., isotropic Gaussian, which means p(yHF)p({y}_{\text{HF}}) is compact and explainable, a key characteristic is that the feature vectors in yHF{y}_{\text{HF}} are independent of each other. We disentangle the high-frequency representation yHF{y}_{\text{HF}} along the specific dimension to split yhF{y}_{\text{hF}} and yn{y}_{\text{n}}. To reconstruct clean observation, we explicitly remove yn{y}_{\text{n}} and only preserve specific p(yhF)p({y}_{\text{hF}}), so that p(yHF)=p(yhF)p(yn)=p(yhF)1p({y}_{\text{HF}})=p({y}_{\text{hF}})p(y_{\text{n}})=p({y}_{\text{hF}})\cdot 1. Benefited from the invertible characteristic of INNs, image denoising could be achieved by inverse pass using Eq. (1).

III-C Model Architecture

The sketch of disentangling framework is illustrated in Fig. 2, which contains two processes: i.e. forward decomposition and inverse reconstruction, and is easily injected into the general invertible framework.

III-C1 Forward Decomposition

The key to our approach is to decompose the approximately noiseless low-frequency and case-specific high-frequency representations during the forward pass. To achieve this, we first exploit the Discrete Wavelet Transform (DWT) to decompose the image into low and high-frequency parts, where Haar Transformation [42] is used to approximate it for simplicity. Haar Transform explicitly decomposes the inputs, (i.e. images or a group of feature maps) into an approximate low-pass representation and high-frequency coefficients with three directions. More concretely, it transforms the input with shape (H×W×C)(H\times W\times C) into a tensor of shape (12H×12W×4C)(\frac{1}{2}H\times\frac{1}{2}W\times 4C). The first CC slices of the output tensor are effectively produced by average pooling, which is approximately a low-pass representation equivalent to the bilinear interpolation down-sampling. The remaining three groups of CC slices contain residual components in the vertical, horizontal and diagonal directions, which are the high-frequency information of the original input. By such a transformation, the low and high-frequency information are explicitly separated.

Then, an invertible network module is used to further abstract the yLF{y}_{\text{LF}} and yHF{y}_{\text{HF}}, where we leverage the coupling layer architecture in [43, 41], presented in i.e. Eqs. (2) and (3). Our goal is to polish the low and high-frequency inputs to obtain a suitable low-frequency representation and an independent properly distributed high-frequency representation. Therefore, we match yLF{y}_{\text{LF}} and yHF{y}_{\text{HF}} respectively to the split of u1\textbf{u}_{1}, u2\textbf{u}_{2} in Eq. (2). Furthermore, to increase the model capacity, we employ the additive transformation (Eq. (2)) for the low-frequency part u1\textbf{u}_{1}, and the enhanced affine transformation (Eq. (3)) for the high-frequency part u2\textbf{u}_{2}.

After transformation, a noise-free low-frequency representation y^LF\hat{y}_{\text{LF}} is expected. However, noise also appears in y^LF\hat{y}_{\text{LF}} actually due to non-ideal decomposition. Inspired by the Nyquist-Shannon sampling theorem that the lost information during down-sampling a clean image amounts to high-frequency contents, we utilize the bicubic method [44] to guide y^LF\hat{y}_{\text{LF}} decomposition. Let xguidex_{guide} be the down-sampled clean image xx corresponding to y^LF\hat{y}_{\text{LF}} that is produced by the bicubic method. To generate the clean low-frequency representation, we drive the y^LF\hat{y}_{\text{LF}} to resemble xguide{x}_{guide}:

Lguide:=𝒳(xguide,y^LF)L_{guide}:=\ell_{\mathcal{X}}(x_{guide},\hat{y}_{\text{LF}}) (8)

where 𝒳\ell_{\mathcal{X}} is a difference metric on 𝒳\mathcal{X}, i.e. the L2L_{2} loss. Benefited from the characteristic of information lossless of invertible architecture, our low-frequency guidance loss drives noise appearing in yHF{y}_{\text{HF}} only.

Another goal during the forward pass is to disentangle noise from high-frequency part yHF{y}_{\text{HF}}. To this end, we enforce the yHF{y}_{\text{HF}} to obey specific distribution, i.e. pyHF𝒩(0,IK)p_{{y}_{\text{HF}}}\sim\mathcal{N}(\textbf{0},I_{K}) so that it could be disentangled along the specific dimension. A KL divergence loss is used as the distribution metric, where

DKL=p(z)log(p(z)q(z))dzD_{KL}=-\int{p(z)\log(\frac{p(z)}{q(z)})\text{d}z} (9)

which is referred to as our distribution guidance loss LdistL_{dist}.

III-C2 Inverse Reconstruction

Considering feature vectors in transformed yHF{y}_{\text{HF}} is independent, we directly split it along the channel axis into a specific high-frequency part yhF{y}_{\text{hF}} and noise yn{y}_{\text{n}}, i.e. yHF=[yhF,yn]{y}_{\text{HF}}=[{y}_{\text{hF}},{y}_{\text{n}}], and yhF{y}_{\text{hF}} and yn{y}_{\text{n}} are case-specific. To reconstruct the clean image xx in the flow-based architecture, we need to construct dimension-consistent noise-free high-frequency representation y^HF\hat{y}_{\text{HF}}. To do so, we simply replace yn{y}_{\text{n}} with 0 in reconstructed high-frequency representation y^HF\hat{y}_{\text{HF}}, i.e. y^HF=[yhF,0]\hat{y}_{\text{HF}}=[{y}_{\text{hF}},\textbf{0}], which lies in a subspace of yHF{y}_{\text{HF}}. Further, to guide noise component disentanglement, we minimize reconstruction loss LreconL_{recon} with corresponded clean image xx:

Lrecon:=𝒳(x,iDWT(fθ1(y^LF,y^HF)))L_{recon}:=\ell_{\mathcal{X}}(x,\text{iDWT}(f^{-1}_{\theta}(\hat{y}_{\text{LF}},\hat{y}_{\text{HF}}))) (10)

where 𝒳\ell_{\mathcal{X}} measures the difference between the clean image and the reconstructed one, i.e. the L1L1 loss. iDWT denotes inverse DWT, and fθ1f^{-1}_{\theta} denotes the inverse pass of parameterized invertible neural networks.

In addition, implicit noisy image self-reconstruction could also be achieved by injecting case-specific yn{y}_{\text{n}} into yHF{y}_{\text{HF}}, it doesn’t rely on any extra self-supervised constraints. This implies there exists an implicit bijective mapping between the noisy and clean image pairs.

Refer to caption
Figure 3: Illustration of our hierarchical disentangling framework. Our approach exploits the three invertible modules to transform the noisy image yy into clean low frequencies and specific high frequencies with different scales, i.e. L=3L=3. Each invertible module is composed of a DWT layer and stacked InvBlocks.

III-D Hierarchical Disentangled Representation

Furthermore, inspired by the latent observation that most image information is located in the low-frequency part, and high and low-frequencies are not decomposed exactly due to non-ideal transform. Thus, exploiting the sufficient low-frequency information while disentangling fine-grained high-frequency signals is necessary. Based on this analysis, we construct a multi-level decomposing framework in Fig. 3.

During the forward propagating, we stack multiple invertible modules to decompose the high-frequency representation yHF{y}_{\text{HF}} into low and high-frequency parts, this leads to multiscale low-frequency outputs {y^LFl}l=1L\{\hat{y}^{l}_{\text{LF}}\}^{L}_{l=1}. Therefore, the low-frequency guidance loss LguideL_{guide} is rewritten as

Lguide:=l=1L𝒳(xguidel,y^LFl)L_{guide}:=\sum_{l=1}^{L}\ell_{\mathcal{X}}({x}^{l}_{guide},\hat{y}^{l}_{\text{LF}}) (11)

where LL is the number of levels decomposed, y^LFl\hat{y}^{l}_{\text{LF}} represents the decomposed low-frequency output in llth level, and corresponded multiscale guided image xguidelx^{l}_{guide} is generated by down-sampling xx to 1/2l{1}/{2^{l}} scale with the bicubic method.

For the high-frequency part, we only minimize the KL divergence loss for the last output yHFLy^{L}_{\text{HF}}, and disentangle noise component from it. In our experiments, we set L=3L=3 at most for generalized image denoising tasks.

In inverse propagation, we reconstruct clean high frequencies y^HFl\hat{y}^{l}_{\text{HF}} level by level, from level LL to level 11. For llth level, it leads to an implicit conditional flow model, i.e. p(y^HFl1|yHFl,y^LFl)p(\hat{y}^{l-1}_{\text{HF}}|{y}^{l}_{\text{HF}},\hat{y}^{l}_{\text{LF}}), which could be implemented by y^HFl1=fθ1(yHFl,y^LFl)\hat{y}^{l-1}_{\text{HF}}=f^{-1}_{\theta}({y}^{l}_{\text{HF}},\hat{y}^{l}_{\text{LF}}). Therefore, the case-specific high-frequency signal is generated in a coarse-to-fine manner, it is efficient and stable.

We optimize the whole architecture by minimizing the compact loss LtotalL_{total} with the combination of reconstruction loss LreconL_{recon}, low-frequency guidance loss LguideL_{guide} and distribution loss LdistL_{dist}:

Ltotal:=λ1Lrecon+λ2Lguide+λ3LdistL_{total}:=\lambda_{1}L_{recon}+\lambda_{2}L_{guide}+\lambda_{3}L_{dist} (12)

where λ1,λ2\lambda_{1},\lambda_{2} and λ3\lambda_{3} are coefficients for balancing different loss terms.

Refer to caption

(a)(\text{a})

Refer to caption

(b)(\text{b})

Figure 4: Online PSNR curves on SIDD validation set with 300k300k iterations.
TABLE I: The architecture configurations.
Metric Single-level Multi-level (Ours)
L=2 L=3 L=2 L=3
PSNR 39.358 39.499 39.364 39.468
SSIM 0.908 0.910 0.909 0.910
#Param(M) 4.419 12.554 4.021 9.092
TABLE II: The disentangled dimension configurations for model with B=8 and L=2.
DIM(yn)(y_{\text{n}}) 4/5 3/5 2/5 1/5
PSNR 39.348 39.352 39.403 39.364
SSIM 0.9086 0.9085 0.9086 0.9086

IV Experiment

IV-A Experimental Setting

IV-A1 Datasets

To validate the effectiveness of our method, two representative real image noise datasets, the Smartphone Image Denoising Dataset (SIDD) [45] and Darmstadt Noise Dataset (DND) [46], are utilized to verify our method’s performance. The SIDD is taken by five smartphone cameras with small apertures and sensor sizes from 10 scenes under varied lighting conditions. Ground truth images are generated through a systematic procedure. We use the medium version of SIDD as the training set, which contains 320 clean-noisy pairs for training and 1280 cropped patches from the other 40 pairs for validation. The reported test results are obtained via an online submission system. The DND is captured by four consumer-grade cameras of different sensor sizes. It contains 50 pairs of real-world noisy and approximately noise-free images. These images are cropped into 1000 patches of size 512×512512\times 512. Similarly, the performance is evaluated by submitting the results to the online system. Considering DND does not provide any training data, we employ a training strategy by combining the training set of SIDD and Renoir [47]. Results are submitted to the DND benchmark by utilizing the same model that provides the best validation performance on the SIDD benchmark.

Refer to caption

(a)(\text{a})

Refer to caption

(b)(\text{b})

Refer to caption

(c)(\text{c})

Refer to caption

(d)(\text{d})

Refer to caption

(e)(\text{e})

Refer to caption

(f)(\text{f})

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a)
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b)
Figure 5: Visualization analysis for invertible bijective mapping. The top row denotes the results of invertible image denoising. (a) is input noisy image, (b), (c), and (d) denote the decomposed low-frequency outputs from the level L=1,2,3L=1,2,3, separately. (e) is inverse denoised output, and (f) is the self-reconstructed noisy image. The second row visualizes the high-frequency feature maps from L=1L=1 during forward propagating. The bottom row visualizes the disentangled high-frequency feature maps from L=1L=1 during inverse propagating.

IV-A2 Implementations

The proposed method cascades three invertible modules at most, each module contains a Haar transform layer and 8 basic invertible blocks, i.e. L=3,B=8L=3,B=8. For the single invBlock (as shown in Fig. 2), we implement the non-linear functions ϕ\phi, ρ\rho, and η\eta with the Densenet Block (DB) [48]. All the models are trained with Adam as the optimizer, with momentum of β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999. The batch size is set as 16 with 144×144144\times 144 size, and the initial learning rate is fixed at 2×1042\times 10^{-4}, which decays by half for every 100k100k iterations. The training is performed on a single Tesla P100 GPU. We augment the training data with extra horizontal and vertical flipping, as well as random rotations. For loss functions, we set λ1=1,λ2=4\lambda_{1}=1,\lambda_{2}=4 and λ3=1\lambda_{3}=1 separately for different loss terms in all experiments. In addition, Peak-Signal-Noise-Ratio (PNSR) and Structural Similarity (SSIM) are used to evaluate the performance of methods in all experiments.

IV-B Ablation Study

We mainly explore three major determinants of our model: a). Model capacity, which depends on the number of decomposed levels, and the number of invertible blocks in single invertible module; b). Effectiveness of architecture, including decomposing types of image and disentangling ways of framework; and c). Disentangled representation. All the experiments are performed in SIDD validation set with 300k300k iterations.

IV-B1 Model capability

We first study two key factors affecting the model capability, i.e. multi-level decomposition and the capacity of the single invertible module. As shown in Fig. 4-(a) and (b), we observe that increasing the decomposed levels leads to significant performance gains. In addition, fixing decomposition levels (e.g., L=2L=2), stacking more invBlocks in the single invertible module also further enhances the ability of the model, but it also brings more parameters. Therefore, we set the B=8\text{B}=8 and L=3L=3 at most in our architecture configurations.

IV-B2 Architecture Designing

We also consider different architectures, i.e. decompose low and high-frequency parts in the last level only instead of each level, which is similar to InvDn [9]. The quantitative results are illuminated in Tab. I, our multilevel decomposition architecture with high-frequencies disentangling achieves a better trade-off in terms of the performance and complexity of the model.

IV-B3 Disentangled Representation

We further explore the effects of disentangled dimensions in yHFy_{\text{HF}}. We split yny_{\text{n}} from yHFy_{\text{HF}} with different dimensions along the channel axis. All models are trained separately. Tab. II gives the detailed results, the best model is achieved by setting DIM(yn)=2/5DIM(yHF)\text{DIM}(y_{\text{n}})=2/5\cdot\text{DIM}(y_{\text{HF}}). We use it in our final model configurations.

IV-B4 Analysis for Bijective Mapping

We further explore the bijective relationship in our architecture. As shown in Fig. 5, we first give the decomposed low-frequency outputs from multi-level decomposition during forward propagation, denoised output, and self-reconstructed noisy input by implicitly inverse reconstruction. Our method only splits case-specific noise from the hybrid high-frequency component in the latent space to achieve image denoising, where it bridges the bijective transformation between the noisy image generation and restoration. Moreover, we visualize the decomposed hybrid high-frequency components and disentangled noise-free high-frequencies components. It is obvious that our method could remove case-specific noise signals in latent space while retaining fine high-frequencies, which implies the effectiveness of our method.

Refer to caption

(17.56/0.1110)

Refer to caption

(18.17/0.1205)

Noisy

Refer to caption

(34.19/0.7965)

Refer to caption

(35.08/0.8442)

DANet [7]

Refer to caption

(34.00/0.7857)

Refer to caption

(34.74/0.8365)

InvDn [9]

Refer to caption

(34.80/0.8156)

Refer to caption

(35.68/0.8564)

MAXIM [35]

Refer to caption

(34.47/0.8067)

Refer to caption

(35.14/0.8458)

Ours(L=2)

Refer to caption

(PSNR/SSIM)

Refer to caption

(PSNR/SSIM)

Reference

Refer to caption

Noisy

(PSNR/SSIM)

Refer to caption

CBDNet [29]

(31.40/0.8364)

Refer to caption

VDN [30]

(34.08/0.9166)

Refer to caption

DANet [7]

(33.66/0.9148)

Refer to caption

InvDn [9]

(34.22/0.9216)

Refer to caption

Ours(L=3)

(34.28/0.9224)

Figure 6: Visualized denoising samples from SIDD and DND datasets. The first and second rows are from the SIDD, and bottom is from the DND.
TABLE III: Comprehensive comparisons with other competing methods.
Method #Param (M) MACs (G) SIDD [45] DND [46]
PSNR SSIM PSNR SSIM
DnCNN [2] 0.67 44.02 37.73 0.941 37.90 0.9430
TNRD [4] 24.73 0.643 33.65 0.8306
BM3D [24] 25.65 0.685 33.51 0.8507
CBDNet [29] 4.34 33.28 0.868 38.06 0.9421
RIDNet [49] 1.50 98.26 39.26 0.8306
GradNet [50] 1.60 38.34 0.946 39.44 0.9543
AINDNet [31] 13.76 39.08 0.953 39.53 0.9561
VDN [30] 7.82 49.5 39.28 0.909 39.38 0.9518
MIRNet [51] 31.78 816.0 39.72 0.959 39.88 0.9563
SADNet [52] 4.23 18.97 39.59 0.9523
MPRNet [53] 15.7 1394.0 39.71 0.958 39.80 0.9540
MAXIM [35] 22.20 339.0 39.96 0.960 39.84 0.9540
DANet [7] 63.01 14.85 39.25 0.955 39.47 0.9548
DANet++ [7] 63.01 14.85 39.43 0.956 39.58 0.9545
InvDn [9] 2.64 47.96 39.28 0.955 39.57 0.9522
FDN [37] 4.38 77.64 39.31 0.955
FINO [38] 39.40 0.957
Ours+RB+L2 2.47 46.54 39.31 0.956
Ours+RB+L3 5.26 52.26 39.40 0.957
Ours+DB+L2 4.02 74.02 39.39 0.956 39.54 0.9520
Ours+DB+L3 9.09 84.40 39.48 0.957 39.60 0.9536

IV-C Real Image Noise Removal

We perform comprehensive comparisons, including model parameters, computational complexity (MACs), and quantitative metrics, on two real denoising benchmarks, i.e. SIDD [45] and DND [46]. Note that “MACs” is counted on a single RGB image with 256×256256\times 256 size. Moreover, we select two kinds of representative methods for comparisons, one is the deterministic mapping-based methods, including CNN-based and Transformer methods; The other is bijective mapping-based methods, e.g., Flow-based and GAN-based methods. In addition, in order to demonstrate the effectiveness of our method, we further explore the lightweight architecture configurations, e.g., using the more lightweight invBlocks with Residual Block (RB) [54] and different decomposition levels.

Tab. III lists the detailed results, there exists a latent tendency for the deterministic mapping-based methods that the larger models bring better performance, e.g., MIRNet and MPRNet, but which also lead to expensive computational costs. Moreover, MAXIM introduces the MLP-style Transformer achieve significant performance gains on the SIDD test set, but it also brings great computations. Instead, bijective mapping-based methods reverse this tendency, e.g., DANet and InvDn. They achieve the better trade-off between the performance and computational complexity.

Our method further balances the performance and computational costs on the SIDD test set, where we only stack two lightweight invertible modules (denote as “RB+L2”) and achieves better results compared to general CNN-based methods and bijective mapping-based methods. Meanwhile, the performance could be improved further by extending to 3-level decomposition (denote as “RB+L3”). Compared with DANet and InvDn, the PSNRs of our lightweight models increase by 0.10.20.1\sim 0.2dB on the SIDD test set. Replacing lightweight residual blocks with dense blocks (DB) in a single InvBlock, our method (denote as “DB+L2” and “DB+L3”) achieves consistent performance gains on two real denoising benchmarks. In addition, our method is more flexible and friendly to mobile applications, which achieves comparable results with state-of-the-art methods, e.g., MIRNet, MPRNet, and MAXIM, note that the performance of our approach could be further improved with the more effective inverse architecture, e.g., invertible attention networks [55], but it is beyond the scope of this paper.

TABLE IV: PSNR|| SSIM || PSNR-B values comparisons. The best and the second best results are boldfaced and underlined.
Dataset QF SADCT [56] LD [57] PCA [58] ARCNN [59] TNRD [4] DnCNN [2]
Classic5 10 28.88 0.807 28.16 28.39 0.780 27.59 29.32 0.800 29.08 29.03 0.793 28.76 29.28 0.799 29.04 29.40 0.803 29.13
20 30.92 0.866 29.75 30.30 0.858 29.37 31.56 0.858 31.12 31.15 0.852 30.59 31.47 0.858 31.05 31.63 0.861 31.19
30 32.14 0.891 30.83 31.47 0.833 30.17 32.86 0.884 32.31 32.51 0.881 31.98 32.78 0.884 32.24 32.91 0.886 32.38
LIVE1 10 28.65 0.809 28.01 28.26 0.805 27.68 29.01 0.809 28.83 28.96 0.808 28.77 29.14 0.811 28.88 29.19 0.812 28.90
20 30.81 0.878 29.82 30.19 0.872 29.64 31.28 0.875 30.72 31.29 0.873 30.79 31.46 0.877 31.04 31.59 0.880 31.07
30 32.08 0.908 30.92 29.41 0.896 29.36 32.62 0.903 32.18 32.67 0.904 32.22 32.84 0.906 32.28 32.98 0.909 32.34
BSD500 10 28.23 0.778 27.38 28.03 0.782 27.29 28.64 0.779 28.01 28.56 0.791 27.87 28.60 0.793 27.95 28.84 0.801 28.44
20 30.09 0.851 28.61 29.82 0.851 28.43 30.73 0.851 29.42 30.43 0.859 29.10 30.51 0.861 29.34 31.05 0.874 30.29
30 31.21 0.884 29.34 30.87 0.872 29.15 31.99 0.884 30.84 31.52 0.890 29.92 31.58 0.890 30.02 32.36 0.905 31.43
Twitter 27.61 0.728 27.53 27.58 0.727 27.49 27.71 0.730 27.66 27.54 0.730 27.49 27.60 0.727 27.52 27.63 0.729 27.54
WeChat 29.60 0.780 29.59 29.48 0.796 29.47 29.63 0.799 29.60 29.30 0.799 29.29 26.64 0.791 26.63 29.57 0.798 29.57
Dataset QF LIPIO [60] M-Net [61] DCSC [62] RNAN [63] Ours(L=2) Ours(L=3)
Classic5 10 29.35 0.802 29.04 29.69 0.811 29.31 29.62 0.827 29.30 29.87 0.828 29.42 29.51 0.816 29.47 29.53 0.815 29.50
20 31.58 0.857 31.12 31.90 0.866 31.29 31.81 0.880 31.34 32.11 0.869 32.16 31.57 0.870 31.52 31.64 0.871 31.59
30 32.86 0.884 32.28 32.97 0.888 32.49 33.06 0.903 32.49 33.38 0.892 32.35 32.70 0.894 32.62 32.81 0.894 32.72
LIVE1 10 29.17 0.812 28.89 29.45 0.819 29.04 29.34 0.832 29.01 29.63 0.824 29.13 29.36 0.824 29.32 29.37 0.823 29.33
20 31.52 0.877 31.07 31.83 0.885 31.14 31.70 0.896 31.18 32.03 0.888 31.12 31.64 0.889 31.57 31.65 0.889 31.57
30 32.99 0.907 32.31 33.07 0.911 32.47 33.07 0.922 32.43 33.45 0.915 32.22 32.94 0.915 32.83 32.95 0.916 32.83
BSD500 10 28.81 0.782 28.39 28.96 0.804 28.56 28.95 0.805 28.55 29.08 0.805 28.48 28.97 0.796 28.92 28.97 0.795 28.93
20 30.92 0.855 30.07 31.05 0.874 30.36 31.13 0.876 30.41 31.25 0.875 30.27 31.10 0.868 31.01 31.11 0.868 31.02
30 32.31 0.887 31.27 32.61 0.907 31.15 32.42 0.906 31.52 32.70 0.907 31.33 32.34 0.899 32.22 32.35 0.899 32.22
Twitter 27.47 0.733 27.41 27.98 0.744 27.87 27.63 0.731 27.43 27.43 0.718 27.42 31.04 0.794 30.82 31.09 0.795 30.90
WeChat 28.90 0.800 28.90 29.82 0.807 29.82 29.58 0.800 29.58 29.56 0.800 29.56 32.32 0.823 32.07 32.30 0.823 32.07
Refer to caption

Example in “BSD500”

(PSNR-B/SSIM)

(a)
Refer to caption

JPEG

(29.43/0.8595)

Refer to caption

DnCNN [2]

(33.67/0.9223)

(b)
Refer to caption

LD [57]

(31.34/0.8963)

Refer to caption

LIPIO [60]

(32.80/0.9139)

(c)
Refer to caption

PCA [58]

(32.90/0.9173)

Refer to caption

DCSC [62]

(33.55/0.9314)

(d)
Refer to caption

ARCNN [59]

(33.12/0.9148)

Refer to caption

Ours(L=2)

(34.12/0.9327)

(e)
Figure 7: Visualized comparisons on synthetic datasets with JPEG QF=10. Red rectangle denotes the zoomed ROI area.

Visualized comparisons are shown in Fig. 6. It is clear that other bijective mapping-based methods (e.g., DANet and InvDn) could remove noise well but also bring over-smoothed effects, e.g., blurred edges and local structures. Transformer-based MAXIM could alleviate this problem by capturing non-local high-frequency patterns with global self-attention. Instead, our method only removes the noise appearing in the high-frequency textures of degraded images, which doesn’t depend on any nonlocal patterns or image priors, while with finer details against others’.

Refer to caption

Example in “Twitter”

(a)
Refer to caption

Compressed

(25.57/0.7280)

Refer to caption

DnCNN [2]

(26.35/0.7534)

(b)
Refer to caption

PCA [58]

(25.60/0.7319)

Refer to caption

LIPIO [60]

(25.71/0.7337)

(c)
Refer to caption

LD [57]

(25.59/0.7322)

Refer to caption

Ours(L=2)

(31.27/0.8375)

(d)
Refer to caption

ARCNN [59]

(26.24/0.7488)

Refer to caption

Reference

(PSNR-B/SSIM)

(e)
Figure 8: Visualized comparisons on the real-world use case. Red rectangle denotes the zoomed ROI area.

IV-D JPEG Compression Artifact Removal

Beyond real image denoising, our method is further extended to JPEG compression artifact reduction. Following the same experimental setting as in [64], we first validate our approach in synthetic datasets, where we use both the training and testing sets from BSD500 [65] as training data. JPEG compressed images were generated by the Matlab JPEG encoder. To present the performance of our method on blind image deblocking, the JPEG quality factors (QF) are randomly generated in [1,40][1,40]. Note that we only train one single model (i.e. blind image compressed artifact removal) to handle all the JPEG compression factors, e.g., 10, 20 and 30. The whole training process was conducted on the Y channel image of YCrCb space.

In contrast to synthetic JPEG deblocking, real image compressed artifact reduction generally contains two implicit tasks, i.e., up-scaling and artifact reduction, where the original images are usually compressed and rescaled for transmission and storage. Two real datasets, which are collected from popular social medias but with different compression rates, are used to verify our method’s effectiveness, i.e., Twitter [59] and WeChat [64]. Twitter contains 114 training images and extra 10 images for validation. Each high-resolution image (3264×24483264\times 2448) results in a compressed and rescaled version with a fixed resolution of 600×450600\times 450. For WeChat, which only provides 300 testing images, each image (3000×40003000\times 4000 pixel) has a corresponded compression version with 600×800600\times 800. Our approach is only trained on Twitter training set, and tested in validation set and WeChat.

IV-D1 Comparisons on synthetic datasets

Tab. IV lists the detailed results on the three widely used synthetic datasets, i.e., 5 images in Classic5 [66], 29 images in LIVE1[67] and 100 images in the validation set of BSD500, where we select some representative traditional methods [56, 57, 58] and deep learning based methods [59, 2, 4, 61, 62, 63] for comparison. In addition, we use the PSNR, SSIM and the PSNR-B [68] for quantitative evaluations, where PSNR-B is more sensitive to blocking artifacts than the PSNR.

As shown in Tab. IV, although the proposed method doesn’t achieve the best PSNR and SSIM metrics, it has the best PSNR-B values against other methods, which implies that the proposed method is more effective on recovering the local high-frequencies of the compressed image than other methods. Further, we observe a latent tendency that our method achieves approximate consistent results in terms of the PSNR and PSNR-B metrics. In contrast, other methods all exhibit obvious degradation for the single PSNR-B metric.

Visualized results are demonstrated in Fig. 7, the obvious blocking effects couldn’t be reduced well in local high-frequency areas with general CNN-based methods. Instead, our approach can recover consistent textures and smoother edges while reducing blocking artifacts significantly.

IV-D2 Comparisons on real cases

To avoid out-of-memory caused by excessive image resolution during inference, we first crop the image with a sliding window, which results in local image blocks without overlap, and then perform artifact reduction operation and measurement calculation.

Quantitative results are demonstrated in Tab. IV. Traditional and CNN-based methods all appear heavy performance degradation due to complex noise distribution and high-frequency information loss. Instead, our approach still achieves obvious performance gains on Twitter and WeChat, i.e. the PSNR and PSNR-B increase by 232\sim 3 dB on average, which implies that our method is more effective to process real compressed artifact removal. Furthermore, visualized results are illuminated in Fig. 8, our method presents significant advantages in removing real artifacts while preserving fine high-frequency details.

Refer to caption

LDCT

(34.39/0.9318)

Refer to caption

CTformer [69]

(37.80/0.9758)

Refer to caption

RedCNN [70]

(38.84/0.9813)

Refer to caption

Eformer [71]

(39.21/0.9799)

Refer to caption

WGAN-VGG [72]

(35.98/0.9598)

Refer to caption

Ours(L=2)

(40.22/0.9832)

Refer to caption

InvDn [9]

(38.20/0.9791)

Refer to caption

NDCT

(PSNR/SSIM)

Figure 9: Visualized comparisons with state-of-the-art methods. The display window is [160,240][160,240]HU. Red rectangles denote ROI areas, zoomed in for better visualization.
TABLE V: Quantitative results of all competing methods on Mayo.
Methods #Param (M) MACs (G) PSNR SSIM
LDCT 38.13 0.961
BM3D [24] 42.10 0.983
RedCNN [70] 1.85 462.48 43.36 0.989
WGAN-VGG [72] 34.07 14.76 40.08 0.979
InvDn [9] 2.07 161.6 42.44 0.987
Eformer [71] 1.11 67.45 43.12 0.988
CTformer [69] 1.45 219.42 42.76 0.987
Ours(L=1) 1.44 189.34 43.44 0.989
Ours(L=2) 2.10 243.98 43.57 0.990

IV-E Low-Dose CT Image Restoration

We further validate our method on the medical low-dose CT image restoration. Mayo111http://www.aapm.org/GrandChallenge/LowDoseCT/, a real clinical dataset [73] authorized by Mayo Clinic for the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge, is used to evaluate low-dose CT (LDCT) reconstruction algorithms, i.e. recovering norm-dose CT (NDCT) image from low-dose measurements. It contains 5,936 slices with 512×512512\times 512 sizes from 10 different subjects, each LDCT slice is simulated by inserting real noise into the NDCT to reach a noise level that corresponded to 25%25\% of the full dose. In addition, due to extra metallic applicators implanted in part patients, the heavy streak artifacts are introduced into image domain, which leads to more complex noise distribution. We shuffle the dataset and randomly select 4,000 slices as the training set, the rest are used as validation and testing to evaluate the performance of our approach. Representative methods, including BM3D, WGAN-VGG [72] and RedCNN [70], are used for comparison. Further, to evaluate the performance of bijective mapping-based method, we select the InvDn [9] for comparison, where we obey the best hyperparameters setting and retrain it with 600k600k iterations. In addition, we also select recent Transformer-based methods for comprehensive comparisons, e.g., Eformer [71] and CTformer [9].

Quantitative results are illuminated in Tab. V. Compared to the general CNN-based and Transformer-based methods, our method presents powerful denoising ability with a single decomposing module only, and significant gains are achieved with a 2-level decomposition, i.e. L=2L=2. Our approach surpasses RedCNN by an average margin of \sim 0.2dB in PSNR while with few computational costs. InvDn doesn’t present obvious advantages with the bijective characteristic, where complex high-frequency distribution in CT image domain is hard to approximate with case-agnostic latent variable.

Visualized results are demonstrated in Fig. 9. A representative LDCT slice with heavy streak artifacts is selected for comparison. All the methods could remove noise better. However, RedCNN could reduce the noise well, but the streak artifacts are still preserved. WGAN-VGG generates visually pleasing results with adversarial training, but it introduces extra noise and artifacts into the results. InvDn oversmoothies the local structures, meanwhile, the artifacts are also retained. Eformer and CTformer could restore global structures better, but noise and artifacts aren’t removed successfully. Instead, our method has presented significant advantages in removing noise and artifacts while recovering fine structure details.

V Conclusion

In this paper, we propose a flexible and efficient hierarchical disentangled representation architecture for invertible image denoising, which bridges a bijective transformation between noise image self-reconstruction and restoration, and largely mitigates the ill-posedness of the task. Extensive experiments on real image denoising, JPEG compressed artifact removal, and medical low-dose CT image restoration have demonstrated that the proposed method achieves competing performance in terms of both quantitative and qualitative evaluations, while with varying complexity. Although many concrete implementations of the advanced idea are possible, we show that a simple design already achieves excellent results on generalized image denoising tasks, which provides a new perspective for solving real image restoration tasks.

References

  • [1] V. Jain and H. Seung, “Natural image denoising with convolutional networks,” in NIPS, 2008.
  • [2] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, pp. 3142–3155, 2017.
  • [3] X.-J. Mao, C. Shen, and Y.-B. Yang, “Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections,” in NIPS, 2016.
  • [4] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1256–1272, 2017.
  • [5] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-cnn for image restoration,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 886–88 609, 2018.
  • [6] W. Liu, Q. Yan, and Y. Zhao, “Densely self-guided wavelet network for image denoising,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1742–1750, 2020.
  • [7] Z. Yue, Q. Zhao, L. Zhang, and D. Meng, “Dual adversarial network: Toward real-world noise removal and noise generation,” in ECCV, 2020.
  • [8] W. Du, H. Chen, and H. Yang, “Learning invariant representation for unsupervised image restoration,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 471–14 480, 2020.
  • [9] Y. Liu, Z. Qin, S. Anwar, P. Ji, D. Kim, S. Caldwell, and T. Gedeon, “Invertible denoising network: A light solution for real noise removal,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [10] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise modeling with conditional normalizing flows,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3165–3173, 2019.
  • [11] A. Lugmayr, M. Danelljan, L. V. Gool, and R. Timofte, “Srflow: Learning the super-resolution space with normalizing flow,” ECCV, 2020.
  • [12] M. Xiao, S. Zheng, C. Liu, Y. Wang, D. He, G. Ke, J. Bian, Z. Lin, and T.-Y. Liu, “Invertible image rescaling,” 2020.
  • [13] S. W. Zamir, A. Arora, S. H. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Cycleisp: Real image restoration via improved data synthesis,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2693–2702, 2020.
  • [14] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in NeurIPS, 2018.
  • [15] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” 2011 International Conference on Computer Vision, pp. 479–486, 2011.
  • [16] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng, “Patch group based nonlocal self-similarity prior learning for image denoising,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 244–252, 2015.
  • [17] F. Chen, L. Zhang, and H. Yu, “External patch prior guided internal clustering for image denoising,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 603–611, 2015.
  • [18] L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: Nonlinear Phenomena, vol. 60, pp. 259–268, 1992.
  • [19] W. Dong, X. Li, L. Zhang, and G. Shi, “Sparsity-based image denoising via dictionary learning and structural clustering,” CVPR 2011, pp. 457–464, 2011.
  • [20] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” 2009 IEEE 12th International Conference on Computer Vision, pp. 2272–2279, 2009.
  • [21] S. Roth and M. J. Black, “Fields of experts: a framework for learning image priors,” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 860–867 vol. 2, 2005.
  • [22] A. Buades, B. Coll, and J. Morel, “A non-local algorithm for image denoising,” 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 60–65 vol. 2, 2005.
  • [23] L. Kaur, S. Gupta, and R. Chauhan, “Image denoising using wavelet thresholding,” in ICVGIP, 2002.
  • [24] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Transactions on Image Processing, vol. 16, pp. 2080–2095, 2007.
  • [25] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2392–2399, 2012.
  • [26] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent network for image restoration,” NeurIPS, 2018.
  • [27] S. Lefkimmiatis, “Universal denoising networks : A novel cnn architecture for image denoising,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3204–3213, 2018.
  • [28] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” IEEE Transactions on Image Processing, vol. 27, pp. 4608–4622, 2018.
  • [29] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1712–1722, 2019.
  • [30] Z. Yue, H. Yong, Q. Zhao, L. Zhang, and D. Meng, “Variational denoising network: Toward blind noise modeling and removal,” in NeurIPS, 2019.
  • [31] Y. Kim, J. W. Soh, G. Park, and N. Cho, “Transfer learning from synthetic to real-noise denoising with adaptive instance normalization,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3479–3489, 2020.
  • [32] X. Wu, M. Liu, Y. Cao, D. Ren, and W. Zuo, “Unpaired learning of deep image denoising,” in ECCV, 2020.
  • [33] J. Liang, J. Cao, G. Sun, K. Zhang, L. V. Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 1833–1844, 2021.
  • [34] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general u-shaped transformer for image restoration,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17 662–17 672, 2022.
  • [35] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. C. Bovik, and Y. Li, “Maxim: Multi-axis mlp for image processing,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5759–5770, 2022.
  • [36] L. Ardizzone, J. Kruse, S. J. Wirkert, D. Rahner, E. Pellegrini, R. Klessen, L. Maier-Hein, C. Rother, and U. Köthe, “Analyzing inverse problems with invertible neural networks,” International Conference on Learning Representations, 2018.
  • [37] Y. Liu, S. Anwar, Z. Qin, P. Ji, S. Caldwell, and T. Gedeon, “Disentangling noise from images: A flow-based image denoising neural network,” ArXiv, vol. abs/2105.04746, 2021.
  • [38] L. Guo, S. Huang, H. Liu, and B. Wen, “Fino: Flow-based joint image and noise model,” ArXiv, vol. abs/2111.06031, 2021.
  • [39] D. P. Kingma, T. Salimans, and M. Welling, “Improved variational inference with inverse autoregressive flow,” NeurIPS, 2017.
  • [40] R. van den Berg, L. Hasenclever, J. M. Tomczak, and M. Welling, “Sylvester normalizing flows for variational inference,” ArXiv, vol. abs/1803.05649, 2018.
  • [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” International Conference on Learning Representations, 2017.
  • [42] R. Lienhart and J. Maydt, “An extended set of haar-like features for rapid object detection,” Proceedings. International Conference on Image Processing, vol. 1, pp. I–I, 2002.
  • [43] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent components estimation,” International Conference on Learning Representations, 2015.
  • [44] D. P. Mitchell and A. Netravali, “Reconstruction filters in computer-graphics,” Proceedings of the 15th annual conference on Computer graphics and interactive techniques, 1988.
  • [45] A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1692–1700, 2018.
  • [46] T. Plötz and S. Roth, “Benchmarking denoising algorithms with real photographs,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2750–2759, 2017.
  • [47] J. Anaya and A. Barbu, “Renoir - a dataset for real low-light image noise reduction,” J. Vis. Commun. Image Represent., vol. 51, pp. 144–154, 2018.
  • [48] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C. C. Loy, Y. Qiao, and X. Tang, “Esrgan: Enhanced super-resolution generative adversarial networks,” in ECCV Workshops, 2018.
  • [49] S. Anwar and N. Barnes, “Real image denoising with feature attention,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3155–3164, 2019.
  • [50] Y. Liu, S. Anwar, L. Zheng, and Q. Tian, “Gradnet image denoising,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2140–2149, 2020.
  • [51] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for real image restoration and enhancement,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [52] M. Chang, Q. Li, H. Feng, and Z. Xu, “Spatial-adaptive network for single image denoising,” ArXiv, vol. abs/2001.10291, 2020.
  • [53] S. W. Zamir, A. Arora, S. H. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 816–14 826, 2021.
  • [54] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1132–1140, 2017.
  • [55] R. S. Sukthanker, Z. Huang, S. Kumar, R. Timofte, and L. V. Gool, “Generative flows with invertible attentions,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 224–11 233, 2022.
  • [56] A. Foi, V. Katkovnik, and K. O. Egiazarian, “Pointwise shape-adaptive dct for high-quality denoising and deblocking of grayscale and color images,” IEEE Transactions on Image Processing, vol. 16, pp. 1395–1411, 2007.
  • [57] Y. Li, F. Guo, R. T. Tan, and M. S. Brown, “A contrast enhancement framework with jpeg artifacts suppression,” in ECCV, 2014.
  • [58] Q. Song, R. Xiong, X. Fan, D. Liu, F. Wu, T. Huang, and W. Gao, “Compressed image restoration via artifacts-free pca basis learning and adaptive sparse modeling,” IEEE Transactions on Image Processing, vol. 29, pp. 7399–7413, 2020.
  • [59] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 576–584, 2015.
  • [60] Q. Fan, D. Chen, L. Yuan, G. Hua, N. Yu, and B. Chen, “A general decoupled learning framework for parameterized image operators,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 33–47, 2021.
  • [61] Y. Tai, J. Yang, X. Liu, and C. Xu, “Memnet: A persistent memory network for image restoration,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 4549–4557, 2017.
  • [62] X. Fu, Z. Zha, F. Wu, X. Ding, and J. W. Paisley, “Jpeg artifacts reduction via deep convolutional sparse coding,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2501–2510, 2019.
  • [63] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. R. Fu, “Residual non-local attention networks for image restoration,” International Conference on Learning Representations, 2019.
  • [64] X. Fu, X. Wang, A. Liu, J. Han, and Z. Zha, “Learning dual priors for jpeg compression artifacts removal,” 2021 IEEE International Conference on Computer Vision (ICCV), 2021.
  • [65] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, pp. 898–916, 2011.
  • [66] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in Curves and Surfaces, 2010.
  • [67] S. HR, “Live image quality assessment database release 2,” 2005. [Online]. Available: http://live.ece.utexas.edu/research/quality
  • [68] C. Yim and A. C. Bovik, “Quality assessment of deblocked images,” IEEE Transactions on Image Processing, vol. 20, pp. 88–98, 2011.
  • [69] D. Wang, F. Fan, Z. Wu, R. Liu, F. Wang, and H. Yu, “Ctformer: Convolution-free token2token dilated vision transformer for low-dose ct denoising,” ArXiv, vol. abs/2202.13517, 2022.
  • [70] H. Chen, Y. Zhang, M. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang, “Low-dose ct with a residual encoder-decoder convolutional neural network,” IEEE Transactions on Medical Imaging, vol. 36, pp. 2524–2535, 2017.
  • [71] A. Luthra, H. Sulakhe, T. Mittal, A. Iyer, and S. K. Yadav, “Eformer: Edge enhancement based transformer for medical image denoising,” ArXiv, vol. abs/2109.08044, 2021.
  • [72] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. Kalra, Y. Zhang, L. Sun, and G. Wang, “Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE Transactions on Medical Imaging, vol. 37, pp. 1348–1357, 2018.
  • [73] C. H. McCollough, A. Bartley, R. E. Carter, B. Chen, T. A. Drees, P. Edwards, D. R. Holmes, A. E. Huang, F. Khan, S. Leng, K. McMillan, G. Michalak, K. M. Nunez, L. Yu, and J. G. Fletcher, “Low‐dose ct for the detection and classification of metastatic liver lesions: Results of the 2016 low dose ct grand challenge,” Medical Physics, vol. 44, p. e339–e352, 2017.