This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

T-Net: Deep Stacked Scale-Iteration Network for Image Dehazing

Lirong Zheng, Yanshan Li, Kaihao Zhang, Wenhan Luo Lirong Zheng and Yanshan Li (Corresponding author) are with the ATR National Key Lab. of Defense Technology and Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, 518060, China. E-mail: {[email protected], [email protected]}
Kaihao Zhang is with the College of Engineering and Computer Science, Australian National University, Canberra, ACT, Australia. E-mail: {[email protected]}
Wenhan Luo is with Tencent, Shenzhen, China. E-mail: {[email protected]}
Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Hazy images reduce the visibility of the image content, and haze will lead to failure in handling subsequent computer vision tasks. In this paper, we address the problem of image dehazing by proposing a dehazing network named T-Net, which consists of a backbone network based on the U-Net architecture and a dual attention module. And it can achieve multi-scale feature fusion by using skip connections with a new fusion strategy. Furthermore, by repeatedly unfolding the plain T-Net, Stack T-Net is proposed to take advantage of the dependence of deep features across stages via a recursive strategy. In order to reduce network parameters, the intra-stage recursive computation of ResNet is adopted in our Stack T-Net. And we take both the stage-wise result and the original hazy image as input to each T-Net and finally output the prediction of clean image. Experimental results on both synthetic and real-world images demonstrate that our plain T-Net and the advanced Stack T-Net perform favorably against the state-of-the-art dehazing algorithms, and show that our Stack T-Net could further improve the dehazing effect, demonstrating the effectiveness of the recursive strategy.

Index Terms:
Image dehazing, Multi-scale, Dual attention, Recurrent structure, Recursive strategy

1 Introduction

Haze is a common atmospheric phenomenon that not only has a serious adverse effect on human visual perception, but also has a serious impact on the performance of modern computer vision systems for various visual tasks, such as image classification, object detection and video surveillance. Since hazy conditions like fog, aerosols, sands and mists can scatter and adsorb light, images taken in the hazy environment suffer from obscured visibility, reduced contrast, color cast and many other degradations. Therefore, it is crucial to develop effective solutions to reduce the impact of image distortion caused by environmental conditions through dehazing.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
(a) Hazy image (b) T-Net (c) Stack T-Net
Figure 1: Dehazed results on real-world hazy images. We show the results of T-Net and Stack T-Net, which both perform well on these images. Compared with T-Net, the results of Stack T-Net are cleaner and brighter and exhibit finer details.

In the past ten years, single image dehazing has attracted widespread attention in the computer vision community. Its goal is to restore clean scenes from hazy images. According to the atmosphere scattering model [1, 2, 3], the hazy process could be approximated as,

I(x)=J(x)t(x)+A(x)(1t(x)),I(x)=J(x)t(x)+A(x)(1-t(x)), (1)

where I(x)I(x) and J(x)J(x) separately denote a hazy image and its clean scene, A(x)A(x) represents the global atmospheric light, t(x)t(x) describes the transmission map and xx is the pixel location. And the transmission map t(x)t(x) is related to the depth d(x)d(x) of the image, represented as t(x)=eβd(x)t(x)=e^{-\beta{d(x)}}, where β\beta is the atmospheric scattering coefficient and controls the scale of the haze. Many methods based on this physical model restore clean scene J(x)J(x) from a hazy image I(x)I(x) by estimating the global atmospheric light and the transmission map of the hazy image.

Early prior-based methods attempt to estimate the transmission map by using the difference in statistical properties of hazy and clear images, such as contrast prior [4], dark channel prior (DCP) [5], color-lines [6], color attenuation prior [7], haze-lines [8], and color ellipsoidal prior [9]. However, these hypothetical priors are not suitable for all of the real-world images, and it is easy to obtain inaccurate approximations of the transmission map, which will in turn cause the quality degradation of the restored image. To solve this problem, driven by the unprecedented success in recent years of deep learning in low level vision tasks [10, 11, 12, 13, 14], Convolution Neural Networks (CNNs) have been adopted for the task of image dehazing.

Earlier deep learning based methods, including the DehazeNet [15], the multi-scale CNN (MSCNN) [16], and the residue learning technique [17], use deep networks to estimate the transmission map and exploit the conventional DCP’s [5] bright pixels method to estimate the atmospheric light. However, since the estimation of the atmospheric light is coarse, the results of these methods are not sufficiently satisfactory. Thereby some methods try to use deep networks to estimate the transmission map and atmospheric light, simultaneously or respectively. For example, Li et al. [18] propose a network named AOD-Net to estimate a new variable which combines the transmission map with the atmospheric light. And Zhang et al. [19] use a two-stream network to estimate these two variables respectively.

In addition to the methods based on the atmospheric scattering model, there are some works that use end-to-end deep neural networks to directly estimate clean images instead of performing explicit estimation of the transmission map and atmospheric light. Considering that the estimation of the transmission map and atmospheric light sometimes deviates from the real hazy image, directly estimating the clean image is able to avoid sub-optimal restoration. Inspired by the successes of the generative adversarial networks (GANs) in synthesizing realistic images, many methods use GANs as the framework to transform hazy images to clean images. For instance, Qu et al. [20] propose a GAN based on pix2pix to realize image-to-image translation for dehazing.

Image restoration methods based on deep learning are driven by data, requiring large-scale synthetic or real-world datasets for training. These datasets are created by using the atmosphere scattering model to simulate the process of image degradation. Although proper guidance by the transmission map and atmospheric light is beneficial for effective dehazing under a wide range of haze density and for dealing with color distortions caused by haze, the algorithms based on the estimation of the transmission map and atmospheric light also rely on the same physical model. These algorithms may suffer from inherent performance loss on real-world images due to the over-fitting of the algorithms on synthetic datasets. In comparison, the methods of direct mapping exhibit higher robustness on real-world hazy images.

For this reason, we design an end-to-end network named T-Net for single image dehazing, which directly predicts clean images without using the atmospheric scattering model. Our T-Net is a symmetrical T-shaped architecture which is composed of two components, a backbone module and a dual attention module. The backbone module is a network based on the U-Net [21] architecture, in which residual dense block, upsampling block and downsampling block are used as basic blocks. In order to obtain distinctive features conducive to dehazing, a dual attention module [22] is embedded in the backbone module, which can also enhance the robustness of the network. Meanwhile, in view of the effectiveness of multi-scale features for image restoration [16, 23, 24, 25, 26, 27, 28], we use skip connections with a new fusion strategy [29] in T-Net to realize adaptive multi-scale feature fusion. Moreover, inspired by the recurrent structure of [24] and [30], we propose Stack T-Net by repeatedly unfolding our T-Net to further improve the performance of dehazing. Each stage of the Stack T-Net uses the output of the previous stage and the original hazy image as input and outputs clean image. Fig. 1 shows the dehazed results of our proposed T-Net and Stack T-Net on exemplar real-world images.

To summarize, our work has three-fold contributions as follows.

1) T-Net, a novel network with a dual attention module based on the U-Net architecture, is proposed to realize efficient information exchange across the multi-scale features from different levels to directly predict clean images.

2) Stack T-Net, which is created by repeatedly unfolding T-Net, can further improve the performance of dehazing by taking both the stage-wise result and the original rainy image as input to each stage.

3) Extensive experiments show that our plain T-Net and Stack T-Net perform favorably against the state-of-the-art methods on both synthetic and real-world hazy images. Meanwhile, an ablation study is conducted to demonstrate the effects of different modules in the proposed network.

The rest of the paper is organized as follows. Section 2 discusses the related work of single image dehazing. Section 3 describes the structure of the plain network T-Net, including a backbone module and a dual attention module. Section 4 explains the details of our overall dehazing network Stack T-Net and defines the loss function for training. Section 5 introduces the datasets used in the experiments, and presents the results of the ablation study and the quantitative and qualitative comparisons between our approach and the state-of-the-art methods on synthetic and real-world images. Section 6 concludes the paper and discusses the direction of the future work.

2 Related Work

In this section, we briefly discuss the single image dehazing methods, which could be roughly divided into two categories, prior-based methods and learning-based methods as mentioned above.

2.1 Prior based methods

Single image dehazing is a highly ill-posed problem in computer vision, and it is difficult to dehaze images without additional information. To address this challenge, different priors or assumptions obtained through observations and statistics on a large amount of real data have been used in many dehazing methods. Except the earliest methods which are based on image enhancement algorithms [31, 32, 33, 34], most prior-based dehazing methods are proposed based on the atmosphere scattering model, obtaining clean images by estimating the transmission map and the atmospheric light to invert Eq.(1). Representative works following this route include [35, 4, 5, 6, 7, 8, 9], etc.

Fattal et al. [35] propose a dehazing technique by estimating the albedo under the assumption that the transmission map and surface shading are locally uncorrelated. Tan et al. [4] propose a local contrast-maximization method based on Markov Random Field (MRF), due to the observation that the contrast of clear images is higher than the contrast of the corresponding hazy images. Observing that the intensity of at least one color channel in local regions of natural haze-free images is close to zero and that the pixel intensity increases as haze increases, He et al. [5] propose the dark channel prior (DCP) as the approximation of haze distribution to estimate the transmission map. Meanwhile, this work also proposes a popular way to estimate the atmosphere light by averaging the top 0.1%\% of brightness pixels in hazy images. Many subsequent works make improvement on the DCP to refine the estimation of the transmission map by using different edge-preserving smoothing filters [36, 37, 38, 39]. Discovering that haze causes the color-lines [40] to deviate from the origin, Fattal et al. [6] propose a new method to recover the transmission map, where color-lines are proposed by observing that pixels of small image patches typically exhibit an one-dimensional distribution. The color attenuation prior adopted in the linear model of [7] is based on the assumption that, as haze increases, the brightness of images increases but saturation decreases. With the assumption that several hundreds of distinct colors can well represent colors of a haze-free image, Berman et al. [8] observe that these colors in hazy images form clusters along lines in the RGB space and these lines pass through the coordinate value corresponding to the atmospheric light, where these lines are named haze-lines. Based on this observation, they proposed a method to estimate the transmission map by calculating the haze-lines with the predicted atmospheric light which is computed with the brightness pixels method of DCP. Biu and Kim [9] construct color ellipsoids by statistically fitting haze pixel clusters in the RGB space and then calculate a prior vector through color ellipsoid geometry to obtain the transmission map. In spite of different degrees of success which have been obtained, the prior based methods are limited by the hypothetical priors themselves. There is a certain gap between these priors and the reality, which can cause that the performance of these dehazing methods is not sufficiently satisfactory.

2.2 Deep learning based methods

Recently, deep learning achieves significant success in low-level vision tasks such as image super-resolution [41, 42, 43], deblurring [44, 45], deraining [46, 47], desnowing [48, 49], which also include dehazing [15, 16]. At present, there are two kinds of main ideas about the dehazing methods based on deep learning. One is to estimate the transmission map and atmospheric light according to the atmospheric scattering model and the other uses deep learning network to directly predict clean images.

The deep learning dehazing methods according to the physical model use the same strategy as the prior based methods to restore clean images, but generally estimate the transmission map and atmospheric light by using specially designed CNNs instead of priors so as to avoid the limitation on performance caused by the gap between hand-crafted priors and the reality. Cai et al. [15] propose a dehazing model, DehazeNet, to estimate the transmission map. Ren et al. [16] design a Multi-Scale CNN (MSCNN) to estimate the transmission map with a coarse-to-fine strategy. Zhang et al. [19] employ a densely connected pyramid dehazing network (DPCN) which is a two-stream network to predict the transmission map and the atmospheric light, respectively. Li et al. [18] create a reformulation of the atmosphere scattering model by using a new variable to integrate the transmission map and the atmospheric light and design an end-to-end neural network named AOD-Net to estimate this variable. Zhang et al. [50] introduce Famed-net to estimate the same variable of [18]. Dudhane et al. [51] propose PYF-Net, which consists of a YNet for the estimation of the transmission map in the RGB and YCbCr space and a FNet to fuse two transmission maps. Recently, there are some works that combine the traditional method DCP with deep learning. For example, Golts et al. [52] design a new loss based on the DCP, which is used for the training of an unsupervised deep network to estimate the transmission map. Chen et al. [53] exploit a Patch Map Selection Net to adaptively set the patch size of dark channel corresponding to each pixel.

The deep learning dehazing methods of direct mapping regard dehazing as an image-to-image translation problem. In recent years, generative adversarial networks (GANs) have shown great success in image generation and translation [54] [55] [12], and are subsequently used in the field of dehazing. Qu et al. [20] propose a pix2pix GAN with two enhancing blocks to predict the clean images and enhance the detail and color. Raj et al. [56] design a condition GAN based on the U-Net architecture for dehazing. Engin et al. [57] design an end-to-end dehazing network based on CycleGAN with no need for paired hazy and corresponding clean images for training. Du et al. [58] design a GAN with an adaptive loss to facilitate end-to-end perceptual optimization and propose a new post processing method for halo artifacts removal using guide filters. Shao et al. [59] apply an end-to-end network which is made up of two sub-networks, a bidirectional translation network based on CycleGAN to bridge the gap between the synthetic and real domains and a dehazing network to restore clean images from the hazy images before and after translation. In addition to GAN, some works are proposed based on other architectures. Ren et al. [60] introduce an end-to-end network based on the encoder-decoder architecture, which adopts a novel fusion-based strategy that derives three inputs by using three pre-processing methods. Liu et al. [27] propose a grid network GridDehazeNet for single image dehazing, which consists of three modules, pre-processing, backbone, and post-processing. Li et al. [61] introduce a level-aware progressive network (LAP-Net), of which each stage learns different levels of haze with different supervision and the final output is yielded with an adaptive integration strategy. Dong et al. [62] propose a Multi-Scale Boosted Dehazing Network (MSBDN) based on the U-Net architecture with two principles, boosting and error feedback to realize dense feature fusion.

3 T-Net

In this section, we describe the proposed T-Net which is a symmetrical T-shaped network consisting of two sub-modules, a backbone module based on the U-Net architecture and a dual attention module. Fig. 2 illustrates the architecture of T-Net, and the details are given in the following.

Refer to caption
Figure 2: The architecture of T-Net. It is an end-to-end network to directly obtain the dehazed result, consisting of a backbone module and a dual attention module (dark yellow rectangular block). The red path is the trunk road of the network, and the gray paths are skip connections with weights.

3.1 Backbone module

The backbone module of T-Net is based on the U-Net architecture, and we adopt a new fusion strategy to make use of the skip connections. As shown in Fig. 2, the network mainly includes three kinds of basic blocks, residual dense block (RDB) [63], upsampling block and downsampling block. And the detailed structure of these three kinds of blocks is shown in Fig. 3.

Refer to caption
Figure 3: The detailed structure of three kinds of blocks in the backbone module of the T-Net. The yellow, blue and green doted boxes separately represent the RDB block, the downsampling block and the upsampling block, where we illustrate the variation of the feature size and the number of channels in each layer of each kind of block.

Extensive research has demonstrated that the use of multi-scale features is beneficial to various tasks of image understanding. The high-level features are of the downsampled spatial resolution but compress more semantic contextual information that is necessary for scene understanding of images. In contrast, the low-level features are of higher resolution to help localize objects but contain less semantic contextual information. Thereby the feature fusion of different levels can simultaneously preserve spatial information from low-level features and exploit the semantic contextual information from high-level features for image understanding.

Actually, image restoration includes the process of image understanding to determine what should be kept on the images and what should be removed. In order to make full use of the information from different levels of features, multi-scale feature fusion has been applied to many image restoration tasks such as image derainning [24, 25], image debluring [64, 65] and image dehazing [26, 27, 62], which significantly improves the performance of the algorithms. The U-Net [21] architecture is originally proposed for semantic segmentation, consisting of a contracting path to capture contextual cues and a symmetric expanding path for precise localization as well as multiple lateral connections between these contracting paths and their symmetric expanding paths for multi-scale feature fusion. We design our T-Net based on this architecture and make several improvements for the need of dehazing.

As shown in Fig. 2, the backbone is a symmetric network, which mainly includes three pairs of RDB blocks and four pairs of upsampling and downsampling blocks in the trunk road (the red path in Fig. 2), and three RDB blocks in the lateral connections. In addition, there is a convolutional layer without activation function at the beginning of the network, which generates 16 linear feature maps as the learned input from hazy image. And there is another convolutional layer at the end, which is symmetrical to the beginning and used to generate high-quality dehazed images.

The RDB [63] block which keeps the number of feature maps unchanged can extract abundant local features via densely connected convolutional layers, so we choose it as the basic block for feature generation. As shown in Fig. 3, we use five convolutional layers in each RDB and set the growth rate to 16. Every convolutional layer takes the concatenation of the output features of all convolutional layers before it as input, and the final output of each RDB block is the combination of the output of the last layer and the input of this RDB block through channel-wise addition, where the first four convolutional layers are used to extract features and the last layer (kernel size = 1, 1x1 convolution) is used to keep the number of channels of the output feature the same as that of the input feature. In order to reduce information loss, we use convolutional layers to realize upsampling and downsampling, and the detailed structure is shown in Fig. 3. The upsampling block and the downsampling block both consist of two convolutional layers of which the first layer adjusts the size of feature maps by setting different kernel sizes of convolutional layers and the other layer uses 1x1 convolution to change the number of channels. In each upsampling block, the number of feature maps decreases by half as the size of feature maps increases by two times, which is the reverse in each downsampling block.

We use a new fusion strategy to realize skip connections, for the consideration that features from different scales may not be equally important. Motivated by [29], we set two trainable fusion parameters for each skip connection, where every fusion parameter is a nn-dimensional vector (nn is the number of channels of the features before fusion). Moreover, RDB blocks instead of 1x1 convolution are used in the lateral connection to obtain more feature combinations, which improves the possibility of obtaining more effective information for dehazing. The backbone includes four pairs of upsampling and downsampling blocks, so the feature from feature fusion could be expressed as

F˘ij=αfr(Fij)+βfu(fx(Fi+1j)),Fi+1j=fd(Fij),fx()={fr(),i<m1,fa(),i=m1.i=0,1,,m1;j=0,1,,n,\begin{split}\breve{F}_{i}^{j}=\alpha&f_{r}(F_{i}^{j})+\beta f_{u}(f_{x}(F_{i+1}^{j})),\\ &F_{i+1}^{j}=f_{d}(F_{i}^{j}),\\ f_{x}(\cdot)&=\begin{cases}f_{r}(\cdot),\quad i<m-1,\\ f_{a}(\cdot),\quad i=m-1.\end{cases}\\ i=0,1,&\cdots,m-1;\quad j=0,1,\cdots,n,\end{split} (2)

where α\alpha and β\beta are the fusion parameters for the two features to be fused, fr()f_{r}(\cdot), fu()f_{u}(\cdot), fd()f_{d}(\cdot) and fa()f_{a}(\cdot) separately stand for the functions of RDB, upsampling, downsampling and dual attention, FijF^{j}_{i} represents the jj-th feature channel after the ii-th downsampling, F˘ij\breve{F}_{i}^{j} represents the fused feature symmetric with FijF^{j}_{i}, mm is the number of the upsampling and downsampling blocks pairs, and nn is the number of channels of the features.

3.2 Dual attention module

As shown in Fig. 2, in addition to the backbone module, we use a dual attention module in the T-Net, which is embedded in the middle of the backbone module and includes two blocks, the position attention block and the channel attention block. Fig. 4 is an overview of the dual attention module.

Refer to caption
Figure 4: An overview of the dual attention module. The final output of the dual attention module is the sum of the outputs of the position attention block and the channel attention block, and has the same feature size and number of channels as the input.

Discriminant feature representations are the key for scene understanding, which could be obtained by capturing long-range contextual information. The dual attention module is originally proposed in [22] for scene segmentation, which could adaptively integrate local features with their global dependencies. The position attention block encodes a wider range of contextual information into local features by a weighted sum of the features at all positions, thus enhancing their representation capability. The channel attention block emphasizes interdependent feature channels by integrating associated features among all channels, to further improve the feature representation of specific semantics. These two attention blocks could be simply regarded as two different dimension feature enhancers, and expand the range of semantic contextual information from two different dimensions.

As shown in Fig. 4, the position attention block generates feature maps of spatial long-range contextual through three steps. Firstly, generating a spatial attention matrix by a softmaxed product of two new features of which one is obtained by reshaping the input feature to C×(H×W)\mathbb{R}^{C\times(H\times W)}, and the other is the transposition of first new feature, where (H×W)(H\times W) is the number of pixels. Secondly, getting an attention increment by performing a matrix multiplication between the attention matrix and the second new feature. Thirdly, obtaining the final representations reflecting long-range contexts through an element-wise sum of the attention increment and the original input feature. The channel attention block captures the channel relationship through similar steps, except for the first step, in which channel attention matrix is calculated in channel dimension instead of position dimension.

As explained above, it is obvious that these two attention blocks focus on the content similarity of features, that is, the interdependence between features. They can adaptively emphasize the effective information in the features under the guidance of the loss function and obtain discriminant feature representations according to different input images, without being limited by the network structure itself. Thereby, we believe combining the U-Net architecture with the dual attention module can further improve the robustness of the network, and the discriminant features obtained by the dual attention module can guide the parameter learning of the network. As shown in Fig. 2, we embed the dual attention module in the middle of the last pair of upsampling and downsampling blocks to extract discriminant feature representations from high-level semantic features.

4 Stack T-Net

Refer to caption
Figure 5: The architecture of Stack T-Net. (a) The detailed structure. (b)The information flow and the feature resolution change among different stages. We use T-Net as the sub-network to build Stack T-Net by repeatedly unfolding it several times. Each dotted line of (b) represents different resolution, and the resolution of each dotted line is 1/4 of it of the above one. Since we use T-Net with four pairs of upsampling and downsampling blocks as sub-network, there are five dotted lines in (b) (containing the original resolution).

In this section, we introduce the overall dehazing network Stack T-Net which is constructed by repeatedly unfolding the plain T-Net, and the loss function used in our method. The first subsection below explains the details of the architecture of Stack T-Net, and Fig. 5 is the illustration a KK-stage Stack T-Net, where sub-figure (a) exhibits the detailed structure and sub-figure (b) shows the information flow and the feature resolution change among different stages. The second subsection introduces the loss function we used for training the full network.

4.1 The details of the architecture

Inspired by [24] and [30], we introduce the recurrent structure to our method. In [24], Yang et al. use recurrent network for deraining but just use the deraining result of the previous stage as the input of the next stage, while the performance of deraining is limited by the deraining result of each stage. So we use the strategy of [30], using the concatenation of the output of the previous stage and the original hazy image as input, where the original hazy image can supplement the information loss caused by each dehazing stage. As shown in Fig. 5, we use T-Net with the same structure in each stage. The inference of Stack T-Net at the kk-th stage can be formulated as

xk=fin(x0,yk1),yk=fTk(xk),\begin{split}x^{k}&=f_{in}(x^{0},y^{k-1}),\\ &y^{k}=f^{k}_{T}(x^{k}),\end{split} (3)

where xkx^{k} and yky^{k} separately represent the input and the output of the kk-th stage, x0x^{0} is the original hazy image and y0=x0y^{0}=x^{0}, finf_{in} stands for the operation of concatenation, fTkf_{T}^{k} denotes the mapping of T-Net at the kk-th stage, and meanwhile k{1,,K}k\in\left\{1,\cdots,K\right\}. T-Net of each stage plays the role of a dehazing sub-network, which directly learns the mapping from hazy image to dehazy results. Thereby, the input of each stage is a six-channel image which is the concatenation of two three-channel RGB images, and the output is a three-channel RGB image. Meanwhile, in order to be consistent with other stages, the first stage uses the concatenation of two identical original hazy images as input. And we choose the output of the last stage as the final dehazing result.

However, the network of recurrent structure has considerably large amount of parameters to learn, which take a lot of memory and decrease the efficiency of dehazing. And the more complex the structure of network is, the more likely it leads to over-fitting. Witnessing the success of recursive computation in image restoration [66, 67, 68], we apply it to our method. In our practice, we utilize the inter-stage recursive computation instead of repeatedly unfolding the plain network. Since inter-stage recursive computation requires each stage to share the same network parameters, each stage can be expressed as

fTk()=fT().f_{T}^{k}(\cdot)=f_{T}(\cdot). (4)

Experimental results verify that this recursive strategy can improve the dehazing effectiveness without designing a deeper and more complex network.

4.2 Loss function

The loss function of our method measures the error of the dehazing results at each-stage, and is used to train the proposed network. The MSE loss has smooth function curve, which is convenient for the use of the gradient descent algorithm. But it is more sensitive to outliers and easier to cause gradient explosions than the L1L_{1} loss. Therefore, we employ the smooth L1L_{1} loss which combines the MSE loss and the L1L_{1} loss to enhance the robustness of the network.

Let Ic(x)I_{c}(x) denote the intensity of the cc-th color channel at pixel xx in the ground truth, and Jc(x)kJ_{c}(x)^{k} represents the intensity of the same position of the dehazed image in the kk-th stage. KK and NN separately denote the number of the stages and the number of pixels. The smooth L1L_{1} loss of the kk-th stage (LSL1kL_{SL_{1}}^{k}) can be defined as

LSL1k=1Nx=1Nc=13fSL1(|Jck(x)Ic(x)|),L_{SL_{1}}^{k}=\frac{1}{N}\sum_{x=1}^{N}\sum_{c=1}^{3}f_{SL_{1}}(\left|J_{c}^{k}(x)-I_{c}(x)\right|), (5)

where

fSL1(e)={0.5e2,0e<1,e0.5,e1.f_{SL_{1}}(e)=\begin{cases}0.5e^{2},\quad 0\leq e<1,\\ e-0.5,\quad e\geq 1.\end{cases} (6)

When the error between the dehazed result and the ground truth exceeds the limited value, the L1L_{1} norm is used instead of the L2L_{2} norm to reduce the influence of outliers on the network parameters.

The perceptual loss [69] measures image visual similarities between the dehazed image and the ground truth more effectively than the pixel-wise loss, for which we take advantage of the perceptual loss to strengthen the finer details of the dehazed images. By leveraging multi-scale features extracted from VGG16 [70] pre-trained on ImageNet [71], the perceptual loss of the kk-th dehazed stage can be defined as

Lpk=j=131CjHjWjFj(Jk)Fj(I)22,L_{p}^{k}=\sum_{j=1}^{3}\frac{1}{C_{j}H_{j}W_{j}}\left\|F_{j}(J^{k})-F_{j}(I)\right\|_{2}^{2}, (7)

where JkJ^{k} represents the dehazed result of the kk-th stage, II represents the ground truth, each Fj()F_{j}(\cdot) (j={1,2,3}j=\left\{1,2,3\right\}) separately denotes the output feature of relu1-2, relu2-2, relu3-3 layer of VGG-16 and CjC_{j}, HjH_{j} and WjW_{j} specify the dimension of these features.

By combining the smooth L1 loss and the perceptual loss, the total loss of all stages is defined as

L=k=1KLSL1k+λk=1KLpk,L=\sum_{k=1}^{K}L_{SL_{1}}^{k}+\lambda\sum_{k=1}^{K}L_{p}^{k}, (8)

that is,

L=LSL1+λLP,L=L_{SL_{1}}+\lambda L_{P}, (9)

where λ\lambda is set to 0.040.04 for controlling the relative weights on the two loss components.

5 Experiments

In this section, quantitative and qualitative experimental results are shown to demonstrate the effectiveness of the proposed method. We conduct a series of experiments to compare the performance of our method with the state-of-the-art approaches on both synthetic and real-world datasets. Moreover, we carry out an ablation study to demonstrate the effectiveness of each module of our network.

5.1 Datasets

In the training phase, we use the datasets of [59] as training set. This dataset contains 60006000 synthetic hazy images from the RESIDE dataset [72], where the RESIDE dataset [72] contains both synthesized and real-world hazy/clean image pairs of indoor and outdoor scenes and is split into five subsets, namely, ITS (Indoor Training Set), OTS (Outdoor Training Set), SOTS (Synthetic Object Testing Set), URHI (Unannotated Real Hazy Images), and RTTS (Real Task-driven Testing Set). Among the 60006000 images, 30003000 are chosen from the ITS and the rest is from the OTS. All the images are randomly cropped to 256×256256\times 256 and randomly flipped for data augmentation, and then the pixel values are normalized to [1,1][-1,1]. Since our training set is a subset of ITS and OTS, we use SOTS as the total testing set in the test phase to compare the performance of our proposed method with other methods. Furthermore, real-world hazy images are chosen from URHI to evaluate the generalization of our method in the real-world scenery.

5.2 Network implementation

We implement our framework with Pytorch [73] and exploit the Adam optimizer [74] with a batch size 1414 to train the network, where the momentum β1\beta_{1} and β2\beta_{2} adopt the default values of 0.90.9 and 0.9990.999, respectively. We train every model for 20002000 epochs in total. The learning rate is initially set to 0.0010.001, reduced by half every 2020 epochs, and kept fixed as 0.00010.0001 from the 8080-th epoch. Our proposed method is evaluated against the following state-of-the-art approaches: DCP [5], MSCNN [16], DehazeNet [15], NLD [8], AOD-Net [18], GFN [60], DCPDN [19], EPDN [20], and DA_\_dehazing [59]. We adopt peak-signal-to-noise ratio (PSNR) and structural similarity measure (SSIM) as quantitative evaluation indexes, where PSNR can measure the pixel to pixel difference and SSIM can quantify the structural difference between a dehazed output image and the corresponding ground truth. The results of all experiments are shown in the following.

5.3 Ablation study

To evaluate the effectiveness of several key modules in our network, we perform ablation studies with the following three strategies.

The first ablation study is conducted to determine the configurations of the backbone module of the proposed T-Net, that is, the number of upsampling and downsampling blocks pairs and the number of RDB blocks pairs in the trunk road. RDB blocks are used as feature generator in our network. Simply put, the more RDB block pairs we use, the deeper the network and the deeper the features we can extract. The position attention block contains a matrix multiplication of (N,C)×(C,N)(N,C)\times(C,N), where NN is the number of the pixels in each channel of the feature. So it is easy to cause out of memory if the input feature size of this block is too large. We set the initial pairs of upsampling and downsampling blocks as 2 to avoid this problem. Table I shows the performance of T-Net with different configurations on the SOTS dataset, where mm and nn respectively represent the number of upsampling and downsampling block pairs and the number of RDB block pairs in the trunk road.

TABLE I: Comparison on the SOTS dataset for T-Net with different configurations.
                                         Configuration       SOTS    
          mm                                nn PSNR SSIM    
      2 1 22.90 0.8954    
  2 26.86 0.9451    
  3 27.43 0.9473    
    3 1 22.97 0.8949    
  2 27.49 0.9520    
  3 28.13 0.9536    
    4 1 22.86 0.8932    
  2 28.30 0.9535    
  3 28.55 0.9543    
     

There is a tendency shown in Table I that the average PSNR and SSIM values increase as mm and nn increase. The exception happens when n=1n=1, i.e., the performance of T-Net does not get improved with the increase of mm in this case. The reason is that, the representation ability of features is influenced by the number of RDB block pairs in the trunk road. When nn is set to 1, the features extracted by our network contains insufficient deep discriminant information, which makes the fitting ability of neural network badly limited. Moreover, when n>1n>1, the performance of adding a pair of upsampling and downsampling blocks is better than adding a pair of RDB blocks (see e.g., (3,2) versus (2,3), (4,2) versus (3,3) in the form of (m,n)), which verifies the effectiveness of multi-scale features. Compared with single-scale features, multi-scale features can provide more discriminant information which is helpful for image understanding. As shown in Table I, the average PSNR and SSIM values are the highest when m=4,n=3m=4,n=3.Therefore, we use a T-Net with four pairs of upsampling and downsampling blocks and three pairs of RDB blocks in the trunk road as our network in the following experiments.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) Hazy image (b) NLD (c) DehazeNet (d) AOD-Net (e) DCPDN (f) EPDN (g) DA_dehaze (h) T-Net (i) Stack T-Net (j) GT
Figure 6: Visual comparisons on the SOTS dataset. We show the visual results of eight methods on the synthetic data including a kind of prior-based methods and seven kinds of deep-learning-based methods. The images of the first and the last columns represent the hazy images and their corresponding ground truths, respectively. The results of our proposed T-Net and Stack T-Net are separately shown in the eighth and ninth columns.

The second ablation study is conducted to verify the effectiveness of each module by comparing the performance of several variants of the T-Net. We set the trunk road without the dual attention module (where we use a RDB block to replace the dual attention module) as the basic model. Starting from this basic model, we create other variants by gradually injecting our modifications which include usual skip connections realized by 1x1 convolutions, usual skip connections realized by RDB blocks, our skip connections, as well as our skip connections and dual attention module (T-Net). Table II shows the performance of each variant of T-Net on the SOTS dataset.

TABLE II: Comparison on the SOTS dataset of several variants of T-Net.
          Variant SOTS    
    PSNR SSIM    
      the basic model 24.19 0.8228    
    with usual skip connections (1x1 conv) 27.42 0.9504    
    with usual skip connections (RDB block) 27.61 0.9520    
    with our skip connections 27.82 0.9530    
    T-Net 28.55 0.9543    
     

It is clear that the performance is further improved every time a new component is added to the basic model, which justifies the overall design. As shown in Table II, the comparison among these four variants show the effectiveness of our skip connections with a new fusion strategy. It is demonstrated that RDB blocks used in the lateral connection instead of 1x1 convolution can extract more complex semantic information for dehazing, and that different scale features have different importances in feature fusion. Furthermore, the addition of the dual attention module has the greatest effect on the dehazing performance improvement, which effectively enhances the robustness and generalization ability of the network. This ablation study demonstrates the contribution of each component in our T-Net.

The third ablation study is conducted to evaluate our proposed Stack T-Net with different recursive stage number KK and demonstrates the effectiveness of our recursive strategy. Limited by physical memory, we just set K=1,2,3K=1,2,3. Table III shows the performance on the SOTS dataset of the Stack T-Net with different recursive stage numbers.

TABLE III: Comparison on the SOTS dataset of the Stack T-Net with different KK stages.
      Recursive stage number SOTS    
  PSNR SSIM    
      K=1K=1 28.55 0.9543    
    K=2K=2 28.71 0.9556    
    K=3K=3 28.83 0.9551    
     
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
(a) Hazy image (b) NLD (c) DehazeNet (d) AOD-Net (e) DCPDN (f) EPDN (g) DA_dehaze (h) T-Net (i) Stack T-Net
Figure 7: Visual comparisons on the URHI dataset. We show the visual results of different methods on the real-world data. The first column shows the hazy images, and other columns represent the dehazed results of different methods. The results of our proposed T-Net and Stack T-Net are separately shown in the last two columns.

As shown in Table III, Stack T-Net usually achieves higher average PSNR and SSIM values as the recursive stage number increases. However, the SSIM value of Stack T-Net with three stages is a bit lower than the value of Stack T-Net with two stages, which is because the best model is saved according to the highest PSNR value, instead of the SSIM value. Actually, the highest SSIM value of Stack T-Net with three stages is higher than that of Stack T-Net with two stages in the experiment. This ablation study demonstrates that our recursive strategy is effective.

5.4 Performance comparison on synthetic dataset

The proposed method is tested on the same synthetic dataset SOTS, to qualitatively and quantitatively compare with the state-of-the-art methods which include DCP [5], NLD [8], MSCNN [16], DehazeNet [15], AOD-Net [18], GFN [60], DCPDN [19], EPDN [20], and DA_dehaze [59]. Apart from the DCP and NLD which are prior based methods, the others are deep learning based methods. We test two models of our algorithm, namely T-Net and Stack T-Net with three stages in the comparison experiment.

Table IV presents the quantitative comparison results of different dehazed methods on the SOTS dataset in terms of the average PSNR and SSIM values. Moreover, the data in this table except the last two rows is all quoted from the work of Shao [59]. It is clear that our work outperforms the state-of-the-art methods by a wide margin. Not only Stack T-Net but also our plain T-Net outperform the state-of-the-art DA_haze [59] in terms of both the average PSNR and SSIM values on the SOTS dataset.

TABLE IV: Quantitative comparison on the SOTS dataset between the state-of-the-art dehazing methods.
      Methods SOTS    
    PSNR SSIM    
      DCP [5] 15.49 0.64    
    NLD [8] 17.27 0.75    
    MSCNN [16] 17.57 0.81    
    DehazeNet [15] 21.14 0.85    
    AOD-Net [18] 19.06 0.85    
    DCPDN [19] 19.39 0.65    
    GFN [60] 22.30 0.88    
    EPDN [20] 23.82 0.89    
    DA_dehaze [59] 27.76 0.93    
    T-Net 28.55 0.95    
    Stack T-Net 28.83 0.96    
     

Fig. 6 shows the qualitative comparison results among different dehazing methods, where the synthetic images are chosen from the SOTS datasets. The haze density and scene of the shown images in Fig. 6 are different, where we choose four images from the indoor and outdoor subset of SOTS, respectively. We note that the first three methods [8, 15, 18] perform poorly on most images and are easy to cause under-dehazing, color oversaturation and other distortion problems. The dehazing results of DCPDN [19] and EPDN [20] seem better, but DCPDN [19] tends to make the brighter areas in images overexposed and EPDN [20] tends to reduce the brightness of the darker areas in images, which both cause the loss of many details in the images (see e.g., the first and third rows in Fig. 6 (e) (f)). DA_dehaze [59] performs well on most images, but it tends to cause color distortions (see e.g., the first, second, fifth and sixth rows in Fig. 6 (g) of which the color is a little different from the ground truth images) and cause halo artifacts sometimes (see e.g., the sky area of the third row in Fig. 6 (g)), where other methods except our methods have the same problem.

Compared with the state-of-the-art methods, our T-Net and Stack T-Net have the best performance in terms of haze removal and are effective in suppressing halo artifacts and color distortions. The dehazed images are visually most similar to their ground truth ones (see e.g., Fig. 6 (h) (i)). In addition, we note that Stack T-Net can further eliminate hazy areas on the base of T-Net (see e.g. the first row in Fig. 6 (h) in which the haze in the middle is eliminated in the same row of (i)), demonstrating that the recursive strategy is helpful for image dehazing.

5.5 Performance comparison on real-world hazy images

We further compare our method with the state-of-the-art approaches on real-world images to evaluate the generalization ability of our method in the real-world scenery, where the real-world images used in the experiment are chosen from the URHI dataset. There is not quantitative comparison since the clean ground truth images of the real-world hazy images are not available, and the qualitative comparison results are shown in Fig. 7.

We note that the results on real-world data are primarily the same as those on the synthetic data. The dehazed images of NLD suffer from severe color distortions and halo artifacts (see e.g., Fig. 7 (b)), as well as other problems such as overexposure (see e.g., the second, fifth rows in Fig. 7 (b)). DehazeNet and AOD-Net tend to under-dehaze images (see e.g., the second, third, fifth and last rows in Fig. 7 (c) (d)). DCPDN, EPDN and DA_dehaze have better dehazing performance but tend to cause halo artifact in the sky area and the object edges with large color differences (see e.g., the sky area of the first, second, third, fourth and sixth rows and the edge of the airplane of the third row in Fig. 7 (e) (f) (g)). Compared with the state-of-the-art ones, our methods have the best visual effect. As shown in Fig. 7, in addition to effective haze removal, T-Net and Stack T-Net can also suppress color distortions and halo artifacts. There are almost no artifacts or distortions in our dehazed images of which the sky area and the object edges are clean and smooth (see e.g., the first four rows in Fig. 7 (h) (i)). Moreover, our method can restore the color of the image better than other methods (see e.g., the first row in Fig. 7 of which the color is most similar to the hazy image). Although T-Net and Stack T-Net both have good performance on real world images, Stack T-Net can further improve the quality of the images (see e.g., the fifth row in Fig. 7 (h) (i), where the image of (i) has higher brightness and looks cleaner than the same image of (g)). Specially, even for images with severe haze and deep depth (see e.g., the second row in Fig. 7), our method can still remove a certain amount of haze while maintaining the authenticity of the images, without causing image distortion due to halo artifacts like other methods. Overall, our methods achieve a good balance between dehazing and maintaining image authenticity.

6 Conclusion

In this work, we have proposed a new end-to-end dehazing network named T-Net, which is built based on the U-Net architecture and contains a backbone module and a dual attention module. Inspired by the recursive strategy, we have further proposed Stack T-Net by repeatedly unfolding the plain T-Net. Through the ablation studies, we verify that the overall design of T-Net is effective and the recursive strategy is helpful for dehazing. Experimental results on both synthetic and real-world images demonstrate that our T-Net and Stack T-Net perform favorably against the state-of-the-art dehazing algorithms, and that our Stack T-Net could further improve the dehazing performance.

The real-world images under severe haze lose amount of texture details and color information, which causes the performance degradation of the state-of-the-art dehazing methods or makes the color and details of the dehazing results deviate from the original images. Our future work will focus on this problem, and further study the inherent latent features of invariance.

Acknowledgment

This work was partially supported by National Natural Science Foundation of China (Nos. 61771319, 62076165, 61871154), Natural Science Foundation of Guangdong Province (No. 2019A1515011307), Shenzhen Science and Technology Project (No. JCYJ20180507182259896) and the other project (Nos. 2020KCXTD004, WDZC20195500201).

References

  • [1] E. J. Mccartney, “Optics of the atmosphere: Scattering by molecules and particles,” Journal of Modern Optics, 1977.
  • [2] S. G. Narasimhan and S. K. Nayar, “Vision and the atmosphere,” International Journal of Computer Vision, vol. 48, no. 3, pp. 233–254, 2002.
  • [3] Y. Li, S. You, M. S. Brown, and R. T. Tan, “Haze visibility enhancement: A survey and quantitative benchmarking,” Computer Vision & Image Understanding, vol. 165, no. DEC., pp. 1–16, 2016.
  • [4] R. T. Tan, “Visibility in bad weather from a single image,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
  • [5] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 12, pp. 2341–2353, 2011.
  • [6] R. Fattal, “Dehazing using color-lines,” ACM Trans. Graph., vol. 34, no. 1, Dec. 2015.
  • [7] Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm using color attenuation prior,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3522–3533, 2015.
  • [8] D. Berman, T. treibitz, and S. Avidan, “Non-local image dehazing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [9] T. M. Bui and W. Kim, “Single image dehazing using color ellipsoid prior,” IEEE Transactions on Image Processing, vol. 27, no. 2, pp. 999–1009, 2018.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans Pattern Anal Mach Intell, vol. 38, no. 2, pp. 295–307, 2016.
  • [11] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [12] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [13] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in IEEE Conference on Computer Vision & Pattern Recognition, 2017.
  • [14] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2016.
  • [15] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end system for single image haze removal,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5187–5198, 2016.
  • [16] W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M. H. Yang, “Single image dehazing via multi-scale convolutional neural networks,” in Computer Vision – ECCV 2016, 2016, pp. 154–169.
  • [17] Y. Huang, Y. Wang, and Z. Su, “Single image dehazing via a joint deep modeling,” in 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 2840–2844.
  • [18] B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aod-net: All-in-one dehazing network,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4780–4788.
  • [19] H. Zhang and V. M. Patel, “Densely connected pyramid dehazing network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3194–3203.
  • [20] Y. Qu, Y. Chen, J. Huang, and Y. Xie, “Enhanced pix2pix dehazing network,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8152–8160.
  • [21] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241.
  • [22] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [23] V. Papyan and M. Elad, “Multi-scale patch-based image restoration,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 249–261, 2016.
  • [24] W. Yang, R. T. Tan, J. Feng, J. Liu, Z. Guo, and S. Yan, “Deep joint rain detection and removal from a single image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [25] H. Zhang and V. M. Patel, “Density-aware single image de-raining using a multi-stream dense network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [26] H. Zhang, V. Sindagi, and V. M. Patel, “Multi-scale single image dehazing using perceptual pyramid deep network,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 1015–101 509.
  • [27] X. Liu, Y. Ma, Z. Shi, and J. Chen, “Griddehazenet: Attention-based multi-scale network for image dehazing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [28] T. D. Do, Q. K. Nguyen, and V. H. Nguyen, “A multi-scale context encoder for high quality restoration of facial images,” in 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), 2020, pp. 1–6.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [30] D. Ren, W. Zuo, Q. Hu, P. Zhu, and D. Meng, “Progressive image deraining networks: A better and simpler baseline,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [31] D. J. Jobson, Z. Rahman, and G. A. Woodell, “Properties and performance of a center/surround retinex,” IEEE Transactions on Image Processing, vol. 6, no. 3, pp. 451–462, 1997.
  • [32] Tae Keun Kim, Joon Ki Paik, and Bong Soon Kang, “Contrast enhancement system using spatially adaptive histogram equalization with temporal filtering,” IEEE Transactions on Consumer Electronics, vol. 44, no. 1, pp. 82–87, 1998.
  • [33] C. Busch and E. Debes, “Wavelet transform for analyzing fog visibility,” IEEE Intelligent Systems and their Applications, vol. 13, no. 6, pp. 66–71, 1998.
  • [34] M. Verma, V. D. Kaushik, and V. K. Pathak, “An efficient deblurring algorithm on foggy images using curvelet transforms,” in Proceedings of the Third International Symposium on Women in Computing and Informatics, 2015, p. 426–431.
  • [35] R. Fattal, “Single image dehazing,” ACM Trans. Graph., vol. 27, no. 3, p. 1–9, Aug. 2008.
  • [36] Z. Li and J. Zheng, “Single image de-hazing using globally guided image filtering,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 442–450, 2018.
  • [37] Z. Li, J. Zheng, Z. Zhu, W. Yao, and S. Wu, “Weighted guided image filtering,” IEEE Transactions on Image Processing, vol. 24, no. 1, pp. 120–129, 2015.
  • [38] Z. Li and J. Zheng, “Edge-preserving decomposition-based single image haze removal,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5432–5441, 2015.
  • [39] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 6, pp. 1397–1409, 2013.
  • [40] I. Omer and M. Werman, “Color lines: image specific color representation,” in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., vol. 2, 2004, pp. II–II.
  • [41] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
  • [42] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision.   Springer, 2016, pp. 694–711.
  • [43] W. Niu, K. Zhang, W. Luo, Y. Zhong, X. Yu, and H. Li, “Blind motion deblurring super-resolution: When dynamic spatio-temporal learning meets static image understanding,” arXiv preprint arXiv:2105.13077, 2021.
  • [44] K. Zhang, W. Luo, Y. Zhong, L. Ma, B. Stenger, W. Liu, and H. Li, “Deblurring by realistic blurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2737–2746.
  • [45] K. Zhang, W. Luo, Y. Zhong, L. Ma, W. Liu, and H. Li, “Adversarial spatio-temporal learning for video deblurring,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 291–301, 2018.
  • [46] K. Zhang, W. Luo, W. Ren, J. Wang, F. Zhao, L. Ma, and H. Li, “Beyond monocular deraining: Stereo image deraining via semantic understanding,” in European Conference on Computer Vision.   Springer, 2020, pp. 71–89.
  • [47] K. Zhang, W. Luo, Y. Yu, W. Ren, F. Zhao, C. Li, L. Ma, W. Liu, and H. Li, “Beyond monocular deraining: Parallel stereo deraining network via semantic prior,” arXiv preprint arXiv:2105.03830, 2021.
  • [48] Y.-F. Liu, D.-W. Jaw, S.-C. Huang, and J.-N. Hwang, “Desnownet: Context-aware deep network for snow removal,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3064–3073, 2018.
  • [49] K. Zhang, R. Li, Y. Yu, W. Luo, C. Li, and H. Li, “Deep dense multi-scale network for snow removal using semantic and geometric priors,” arXiv preprint arXiv:2103.11298, 2021.
  • [50] J. Zhang and D. Tao, “Famed-net: A fast and accurate multi-scale end-to-end dehazing network,” IEEE Transactions on Image Processing, vol. 29, pp. 72–84, 2020.
  • [51] A. Dudhane and S. Murala, “Ryf-net: Deep fusion network for single image haze removal,” IEEE Transactions on Image Processing, vol. 29, pp. 628–640, 2020.
  • [52] A. Golts, D. Freedman, and M. Elad, “Unsupervised single image dehazing using dark channel prior loss,” IEEE Transactions on Image Processing, vol. 29, pp. 2692–2701, 2020.
  • [53] W. Chen, J. Ding, and S. Kuo, “Pms-net: Robust haze removal based on patch map for single images,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 673–11 681.
  • [54] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [55] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [56] N. Bharath Raj and N. Venketeswaran, “Single image haze removal using a generative adversarial network,” in 2020 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET), 2020, pp. 37–42.
  • [57] D. Engin, A. Genc, and H. K. Ekenel, “Cycle-dehaze: Enhanced cyclegan for single image dehazing,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 938–9388.
  • [58] Y. Du and X. Li, “Perceptually optimized generative adversarial network for single image dehazing,” ArXiv, vol. abs/1805.01084, 2018.
  • [59] Y. Shao, L. Li, W. Ren, C. Gao, and N. Sang, “Domain adaptation for image dehazing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [60] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M.-H. Yang, “Gated fusion network for single image dehazing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [61] Y. Li, Q. Miao, W. Ouyang, Z. Ma, H. Fang, C. Dong, and Y. Quan, “Lap-net: Level-aware progressive network for image dehazing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [62] H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, and M.-H. Yang, “Multi-scale boosted dehazing network with dense feature fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [63] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [64] O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, “Deblurgan-v2: Deblurring (orders-of-magnitude) faster and better,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [65] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [66] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [67] X. Li, J. Wu, Z. Lin, H. Liu, and H. Zha, “Recurrent squeeze-and-excitation context aggregation net for single image deraining,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [68] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [69] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision – ECCV 2016, 2016, pp. 694–711.
  • [70] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
  • [71] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [72] B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking single-image dehazing and beyond,” IEEE Transactions on Image Processing, vol. 28, no. 1, pp. 492–505, 2019.
  • [73] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in In NIPS Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
  • [74] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017.