Unpaired Overwater Image Defogging Using Prior Map Guided CycleGAN

Yaozong Mo, Chaofeng Li, , Wenqi Ren, , Shaopeng Shang, Wenwu Wang, , and Xiao-jun Wu This work was supported by the National Natural Science Foundation of China under Grant 62176150. (Corresponding author: Chaofeng Li and Shaopeng Shao.) Yaozong Mo and Chaofeng Li are with the Institute of Logistics Science and Engineering, Shanghai Maritime University, Shanghai, China, 201306 (e-mail: [email protected]; [email protected]). Wenqi Ren is with the School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen, China, 518000 (e-mail: [email protected]). Shaopeng Shang is with the Vocational College, Shanghai Jian Qiao University, Shanghai, China, 201306 (e-mail: [email protected]). Wenwu Wang is with the Center for Vision Speech and Signal Processing, Department of Electrical and Electronic Engineering, University of Surrey, Surrey GU2 7XH, U.K. (e-mail: [email protected]). Xiao-jun Wu is with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China, 214122 (e-mail: [email protected]).

Abstract

Deep learning-based methods have achieved significant performance for image defogging. However, existing methods are mainly developed for land scenes and perform poorly when dealing with overwater foggy images, since overwater scenes typically contain large expanses of sky and water. In this work, we propose a Prior map Guided CycleGAN (PG-CycleGAN) for defogging of images with overwater scenes. To promote the recovery of the objects on water in the image, two loss functions are exploited for the network where a prior map is designed to invert the dark channel and the min-max normalization is used to suppress the sky and emphasize objects. However, due to the unpaired training set, the network may learn an under-constrained domain mapping from foggy to fog-free image, leading to artifacts and loss of details. Thus, we propose an intuitive Upscaling Inception Module (UIM) and a Long-range Residual Coarse-to-fine framework (LRC) to mitigate this issue. Extensive experiments on qualitative and quantitative comparisons demonstrate that the proposed method outperforms the state-of-the-art supervised, semi-supervised, and unsupervised defogging approaches.

Index Terms:

Overwater image defogging, generative adversarial networks, unpaired defogging dataset.

I Introduction

Due to the presence of fog, images captured on lakes, rivers, and seas can be significantly degraded. This can adversely impact the performance of subsequent downstream computer vision tasks, such as object detection and tracking. Although many promising image defogging algorithms [1][2][3] have been developed, they rarely focus on the scene of overwater and are limited in defogging overwater image. Therefore, it is necessary and valuable to conduct research on the defogging of overwater images.

According to the atmospheric scattering model [4][5], a foggy image can be formulated as:

I(x_{i})=J(x_{i})t(x_{i})+A(1-t(x_{i})),

(1)

where $x_{i}$ refers to the position of a pixel, $I(x_{i})$ is the captured foggy image, and $J(x_{i})$ is the scene radiance. $A$ and $t(x_{i})$ are the global atmospheric light and the transmission map, respectively. Existing defogging methods can be classified into two categories, prior-based and learning-based. The prior-based methods [6][7][8] recover clean images by using physical priors through the model (1). However, these priors are not always reliable. For example, He et al. [1] proposed dark channel prior to obtain the transmission map which assumes that at least one channel in the RGB space is close to zero in a clean natural image. As a result, this method may fail when dealing with the sky or other scene objects which are similar to atmospheric light.

Unlike prior-based methods which perform well only for certain scenarios, the learning-based methods have better generalization ability. Conventional dehazing methods estimate clean image by using convolutional neural networks (CNNs) or generative adversarial networks (GANs) [9] via supervised learning. Due to the inherent domain shift problem (i.e., mismatch between the synthetic and real domains), these methods often struggle when dealing with real-world foggy images. Thus, some algorithms have been developed to bridge the gap between the synthetic and real domains. For example, Yang et al. [10] proposed an unsupervised Disentangled Dehazing Network which blends two recovered scene radiances from different aspects. However, due to the losses being calculated on these two intermediate outputs, the final results may be suboptimal. Li et al. [11] proposed a USID-Net that maps the unpaired images into a shared latent space and then reconstructs the corresponding images. Due to the under-constrained domain mapping from the foggy to fog-free image, the weights of the loss functions optimized for the USID-Net need to be carefully adjusted to guide the network training. Although many semi-supervised and unsupervised learning algorithms [12][13][14] have achieved significant performance, little attention has been paid to the task of overwater image defogging in the literature. In this work, we focus on the problem of overwater image defogging, and propose a new method based on the characteristics of the overwater scene.

Different from the images captured on land, images captured on water usually contain large sky and water areas. Although objects such as boats on the water often occupy only a small area, their recovery is more important than the recovery of sky, as it directly affects subsequent downstream tasks such as object detection and tracking. Taking this into account, we propose a prior map which facilitates the network to focus on the recovery of objects. Fig. 1 compares the dark channel (DC) and the prior map. In dark channels, the sky and water regions tend to be brighter than the objects on the water because their pixel values are generally close to the atmospheric light. Based on this feature, the objects will be highlighted when the dark channel is inverted, as shown in Fig. 1 (c). To emphasize this feature, we utilize the min-max normalization to enhance the contrast, which stretches the pixel values in the middle region while keeping the sky regions darker and the objects brighter, as illustrated in Fig. 1 (e), and we call it the prior map. By taking advantage of the prior map, we develop a prior map guided cycle-consistency loss and a prior map guided GAN loss, which are weighted by the prior map to impose a larger penalty on the objects than the sky. Further experiments show that the proposed prior map is not only applicable to overwater images, but also to land images, by facilitating the network to concentrate on foreground parts, such as buildings.

The original CycleGAN [15] introduced a cycle-consistency loss to promote the similarity between the reconstructed image and the original input image. However, due to the unpaired training set, the network learns a mapping from a source domain (foggy) to a target domain (fog-free) that is under-constrained and may lead to artifacts and loss of detailed information. To address this problem, we design a network architecture by using an Upscaling Inception Module (UIM) inspired by Inception Network [16] and a Long-range Residual Coarse-to-fine framework (LRC).

The motivation for using the UIM is based on the finding that deconvolution blocks tend to introduce checkerboard artifacts due to the uneven overlap when the kernel size is not divisible by the stride. On the contrary, bilinear interpolation could generate smooth results, but it restricts the representation power of the network since it is a fixed upscaling procedure. To mitigate the overlap issue, a sub-pixel convolution could be used as in Pixelshuffle [17], where the kernel size is divided by the stride. However, it is prone to introducing artifacts and difficult to achieve favorable defogging results when used alone, as observed empirically in our experiments. In contrast, the UIM can improve the results by leveraging the strengths of these three upscaling methods.

To preserve the detailed information from the input image, we build the LRC framework by establishing a long-range residual connection between the input and output layers. In addition, we extend the defogging process into three stages: coarse, finer, and finest. The latter two stages adopt the downscaled defogged image from the previous stage as additional input to enhance the defogging performance.

In summary, this paper presents the following contributions:

•

We propose an overwater scene-oriented unsupervised image defogging framework with a prior map that can highlight the objects on the water, based on the characteristic of the overwater images which often contain large sky and water areas.
•

We design a prior map guided GAN loss and a prior map guided cycle-consistency loss to facilitate the network in recovering the objects on the water. Experimental results show that the proposed prior map is not only applicable to overwater images, but also to land images.
•

We further propose an Upscaling Inception Module and a Long-range Residual Coarse-to-fine framework for mitigating the checkerboard artifacts and for recovering image details.

The organization of the paper is as follows. Section II reviews the relevant literature. Section III describes the details of the prior map guided CycleGAN. Section IV presents the experimental results and Section V concludes this paper.

II Related Work

In this section, we briefly review the prior-based and learning-based single image defogging methods, which are related to our work.

II-A Prior-Based Methods

The prior-based methods have been widely used in the past few years, via estimating the transmission maps and atmospheric light based on the statistics of natural images. Fattal et al. [18] formulated a refined image degradation model [4] that accounts for a shading factor and surface reflectance coefficients. Tan et al. [19] proposed an automated method based on two observations: clean images have a higher contrast than foggy images and the variation of airlight changes smoothly across small regions. He et al. [1] proposed a dark channel prior (DCP), which assumes that there is at least one channel in the RGB space that is close to zero in a clean natural image, to estimate transmission map and atmospheric light. Following He et al. [1], many works, such as [20][21], improved the dark channel prior for the image defogging problem. In addition, Zhu et al. [8] recover the depth information by building a linear model under the color attenuation prior, where the parameters of the linear model are learned with a supervised learning method. Fattal et al. [7] assert that pixels in local patches of images typically exhibit a one-dimensional distribution called color-lines. Similarly, Berman et al. [6] assume that a few hundred distinct colors can well approximate the colors of a clean image, and then perform image defogging based on this non-local prior. Based on the assumption that there is a linear relationship between the foggy and clean images in the minimum channel, Wang et al. [22] developed a computationally efficient defogging method. However, the accuracy of the estimated transmission map greatly affects the defogging effect. To address this problem, Liu et al. [23] reformulate the image defogging problem as a task of enhancing local visibility and global contrast that does not rely on atmospheric physical models. Although these methods show effectiveness in image defogging, they may fail when dealing with certain scenarios. For instance, dark channel prior [1] may fail for some scene objects (such as sky and white building) which are similar to the atmospheric light. Using the haze-line prior [6] for the dense fog may introduce color distortion. In comparison, we optimize the network by building the prior into the loss functions, and our data-driven model offers good generalization ability.

II-B Learning-Based Methods

With the advance in CNNs and the support of large-scale datasets, data-driven methods for image defogging have drawn significant attention in recent years. Cai et al. [24] exploited an end-to-end trainable CNN to estimate the transmission map in an atmospheric scattering model for recovering the clean image. Ren et al. [25] proposed a multi-scale neural network named MSCNN for image defogging, which consists of a coarse net and a fine net to predict the transmission map. Unlike the above methods using CNN for estimating the transmission map, Li et al. [2] reformulated the atmospheric scattering model to combine the two unknown variables, i.e., transmission map and atmospheric light, into one variable. Then, they leverage a lightweight CNN to estimate this variable. Instead of using physical models, Ren et al. [26] adopted a fusion-based strategy to obtain a confidence map by integrating white balance, contrast enhancing, and gamma correction, derived from the original foggy image, and then recovered the clean image by weighting these three maps with a confidence map. Liu et al. [27] remove fog by using a grid network constructed with three modules: pre-processing, backbone, and post-processing, which can effectively alleviate the bottleneck issue and help reduce the artifacts in the final output. Qu et al. [3] proposed an Enhanced Pix2pix Dehazing Network (EPDN) which contains two enhancing blocks in the tail of decoder to refine the intermediate output. Different from other GAN-based methods, the discriminator supervises only the intermediate result of EPDN. Dong et al. [28] proposed a multi-scale and dense connected network for image defogging, where the strengthen-operate-subtract boosting strategy is incorporated in the decoder, and the dense feature fusion module is used to extract the features from non-adjacent levels and to restore the spatial information. Qin et al. [29] also adopted a fusion strategy, where they applied channel attention and pixel attention after the high-level feature for giving more weight to important features. To efficiently extract the features from the input image, Yi et al. [30] proposed a Multi-scale Topological Network (MSTN) that uses a multi-scale feature fusion module and an adaptive feature selection module to achieve progressive image defogging. However, this multi-branch network architecture increases the complexity of the network and further increases the training time.

Although the supervised learning-based methods have achieved favorable performance, they are limited to synthetic images and perform poorly on real-world images because of the inherent domain shift introduced by the synthetic training set. Hence, some semi-supervised and unsupervised methods are integrated to enhance the real-world foggy images. Engin et al. [31] introduced an enhanced CycleGAN which adopt unpaired real-world images as training set. Li et al. [13] proposed a semi-supervised network by combining a supervised learning branch with an unsupervised learning branch. Shao et al. [14] proposed a domain adaptation paradigm by incorporating two image defogging modules and an image translation module to bridge the gap between the synthetic and real domains. Golts et al. [12] exploited a dark channel prior energy function, which no longer relies on paired images to calculate the loss for the optimization of the network. Zhao et al. [32] proposed a semi-supervised two stage network, which utilized a CNN to refine the defogged image of DCP [1]. Chen et al. [33] select three physical priors to constitute a loss committee, i.e., dark channel loss, bright channel loss, and contrast limited adaptive histogram equalization loss, that guides the unsupervised training.

The above mentioned semi-supervised and unsupervised methods offer promising results on real-world images, but little attention has been paid to the image defogging of overwater scenes. Recently, Zheng et al. [34] proposed an enhanced CycleGAN to tackle the overwater image defogging problem. Nevertheless, this method simply replaces the training set of land foggy images as used in other methods with overwater images. Its defogging performance is limited as this method does not leverage the properties of the overwater scenes. To this end, our method focuses on the characteristics of overwater foggy images, and improves the defogging performance and visual quality of the defogged images.

III Proposed Method

In this section, the proposed Prior map Guided CycleGAN (PG-CycleGAN) will be presented in detail. Fig. 2 illustrates the architecture of the PG-CycleGAN, it consists of two cycles: foggy-clean-foggy and clean-foggy-clean. In the foggy-clean-foggy cycle, the defogging network $G$ takes the foggy image $x$ as input and learns the residual from it to get the defogged image, where the network utilizes three upscaling methods: deconvolution, bilinear interpolation, and PixelShuffle [17] when decoding the intermediate features. Furthermore, the defogging procedure is split into three stages: the coarse, finer, and finest to enhance the defogging performance. Then, the discriminator $D_{Y}$ is used to distinguish between the clean image $y$ and the defogged image $G(x)$ to get the reconstructed foggy image $F(G(x))$ . Finally, the proposed prior map is used to weight the losses for updating the networks for the recovery of the objects on water. The clean-foggy-clean cycle is similar to the foggy-clean-foggy cycle.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

(p)

(q)

(r)

(s)

(t)

(u)

(v)

(w)

(x)

(y)

(z)

(aa)

(ab)

(ac)

(ad)

(ae)

(af)

(ag)

(ah)

(ai)

(aj)

Figure 3: Difference on overwater images between the dark channel (DC), inverted dark channel (IDC), IDC after gamma correction (gamma=3), IDC after histogram equalization (HE), and prior map. The top three rows present the comparison on foggy images, and the bottom three rows present the comparison on clean images. Here the dark channels are without performing minimum filtering.

(a) Input (b) DC (c) IDC (d) Gamma (e) HE (f) Prior map

III-A Prior Map

As He et al. [1] claimed, the dark channel prior is invalid when the scene objects are similar to the atmospheric light. This is also reflected in Fig. 3 (b). It can be seen that the pixel value of the water and sky is not close to 0 in clean images.

A characteristic of the overwater scenes is that the sky and water usually occupy most of the area. In dark channels, they tend to be brighter than objects on the water because their pixel values are generally close to the atmospheric light. Based on this feature, the objects will be highlighted when the dark channel of the overwater image is inverted, as shown in Fig. 3 (c). To emphasize this feature, we try to enhance the contrast of the inverted dark channel. Fig. 3 shows the difference between the dark channel (DC), inverted dark channel (IDC), IDC after gamma correction, IDC after histogram equalization (HE), and IDC after min-max normalization (i.e. the prior map). It can be seen that though the contrast is enhanced, the gamma correction may lower the pixel value of the objects, and the histogram equalization may amplify background noise. In contrast, the min-max normalization maintains the highlight of the object while reducing noise. Fig. 4 is the intensity histogram over 4,531 inverted dark channels of an overwater image dataset [34] (details described in Sec. IV-A; the images are rescaled to 256 $\times$ 256 for computational efficiency). It can be seen that the min-max normalization stretches the pixel values in the middle region while keeping the sky regions darker and the objects brighter, this statistic provides strong support to our proposed prior map.

Specifically, given a foggy image $I$ whose pixel value is in the range of [0, 1], its corresponding dark channel can be expressed as:

I_{dc}=\mathop{min}\limits_{y_{i}\in N(x_{i})}(\mathop{min}\limits_{c\in\{r,g,b\}}I^{c}(y_{i})),

(2)

where $x_{i}$ and $y_{i}$ are pixel coordinates of $I$ , $I^{c}$ denotes the $c$ -th color channel of $I$ , and $N(x_{i})$ denotes the local neighborhood centered at $x_{i}$ . To obtain the prior map, we first remove the minimum filtering $\mathop{min}\limits_{y_{i}\in N(x_{i})}(\cdot)$ from the procedure for dark channel calculation, and invert the pixel value as follows:

I_{idc}=1-\mathop{min}\limits_{c\in\{r,g,b\}}I^{c},

(3)

where $I_{idc}$ represents the inverted dark channel. The reasons for removing the minimum filter are two. One is that it reduces the required computational resources, as performing the sliding window algorithm is time-consuming. The other is that the minimum filter may lead to unclear boundaries of the objects in the defogging results because it exploits all the surrounding pixels.

Then, the min-max normalization is performed on the $I_{idc}$ to get the prior map,

I_{p}=\frac{I_{idc}-min(I_{idc})}{max(I_{idc})-min(I_{idc})}.

(4)

After getting the prior map, we further define the prior map guided cycle-consistency loss:

L_{pgcyc}=||(F(G(x))-x)I_{p}^{foggy}||_{1}\\ +||(G(F(y))-y)I_{p}^{clean}||_{1},

(5)

and the prior map guided GAN loss:

L_{pgG}=\mathbb{E}_{y\sim p_{data}(y)}[(D_{Y}(y)-1)^{2}]\\ +\mathbb{E}_{x\sim p_{data}(x)}[D_{Y}(G(x)I_{p}^{foggy})^{2}],

(6)

where $x$ and $y$ are foggy image and clean image, $I_{p}^{foggy}$ and $I_{p}^{clean}$ are the corresponding prior maps, and $||\cdot||_{1}$ is an $L_{1}$ norm. In $L_{pgcyc}$ and $L_{pgG}$ , the loss of objects is increased and the loss of sky is reduced, as a result, the network is guided to recover the objects.

III-B Network Architecture

The details of the defogging network $G$ are shown in Fig. 5. We employ an encoder-decoder structure that includes an encoder for downscaling, a feature transformation module for enhancing the representation capability, and a decoder that consists of two UIMs for upscaling the feature maps while reducing the loss of details. In the forward propagation phase, the network learns a residual from the input image and enhance the defogging effect through the coarse-to-fine framework.

In the encoder, the first layer has 64 filters with a kernel size 7 $\times$ 7 and a stride of 1, the second and the third layers have 128 and 256 filters, respectively, with a kernel size of 3 $\times$ 3 and a stride of 2. The feature transformation module contains 9 residual blocks [35] to enhance the representation capability of the network. The decoder cascades two upscaling inception modules and two convolutional layers. The first convolutional layer in the decoder has 64 filters with a kernel size 3 $\times$ 3 and a stride of 1, the second convolutional layer has 3 filters, each with a kernel size of 7 $\times$ 7 and a stride of 1. It should be noted that the architecture of the fogging net $F$ is the same as $G$ .

1) Upscaling Inception Module: As the name suggests, the UIM is inspired by the Inception module [16], it is a combination of three parallel upscaling branches: deconvolution, bilinear interpolation, and PixelShuffle [17], with their output filter banks concatenated into a single output. As shown in Fig. 5, the 1 $\times$ 1 convolution is applied before the deconvolution and bilinear interpolation, which offers a channel-wise pooling to decrease the number of feature maps whilst retaining their salient features. Due to the bilinear interpolation being a fixed upscaling method, we cascade two 3 $\times$ 3 convolutional layers after it to enhance its representation ability, while the other two branches cascade one 3 $\times$ 3 convolutional layer after the upscaling layer. In the two UIMs in the decoder, the number of feature maps of the final 3 $\times$ 3 convolutional layer of the deconvolution, bilinear interpolation, and PixelShuffle [17] are (64, 32, 32) and (32, 16, 16) respectively. After obtaining all the upscaled feature maps, we adopt the Squeeze-and-Excitation (SE) layer [36] to selectively emphasize informative features and suppress less useful ones, which consists of a global average pooling layer to generate channel-wise statistics and a bottleneck layer with two fully connected (FC) layers.

2) Long-range Residual Coarse-to-fine framework: The two complementary mechanisms: namely, the long-range residual connection and coarse-to-fine framework are adopted in our network for enhancing the defogging performance. As shown in Fig. 5, we first add a long-range residual connection between the input foggy image and the final output layer, which enables the network to preserve detailed information by learning the residual from the input foggy image. Nonetheless, this will also limit the representation capability of the network. Thus, we extend the defogging procedure into three stages, namely, the coarse, finer, and finest stage. As shown in Fig. 6, except that the coarse stage takes the same two 1/4 downscaled foggy images as input, in the finer and finest stage, respectively, the 1/2 and full-size defogged images generated by the previous stage are integrated as additional information to enhance the defogging performance. As a consequence of these two mutual compensations, the final defogged image retains the detailed information of the input image while also having a desired defogging effect.

3) Discriminator: In our network, we have two discriminators. As shown in Fig. 2, $D_{X}$ is used to distinguish between the input foggy images and the generated foggy images from the generator $F$ ; in the same way, $D_{Y}$ is used to discriminate between the generated clean images from $G$ and the input clean images. We adopt PatchGAN [37] discriminator as $D_{X}$ which consists of 5 convolutional layers. Each layer has the same kernel size 4 $\times$ 4 with a stride of 2, and the number of filters is 64, 128, 256, 512, 1 from the lowest to highest.

For the discriminator $D_{Y}$ , we employ the pixel level rather than the patch level discriminator, since the size of the prior map is equal to the input image, and the former produces an array of the same size as the input image, while the latter produces an array with reduced size. The discriminator $D_{Y}$ consists of 3 convolutional layers, where each layer has the same kernel size 1 $\times$ 1 with a stride of 1, and the number of filters is 64, 128, 1 from the lowest to highest. After the output is obtained from $D_{Y}$ , it is further weighted by the prior map with a high penalty for the objects.

III-C Loss Function

The loss function of PG-CycleGAN includes 4 items, i.e., the prior map guided cycle-consistency loss $L_{pgcyc}$ , the prior map guided GAN loss $L_{pgG}$ , the cycle-consistency loss $L_{cyc}$ , and the perceptual loss $L_{VGG}$ .

Prior map guided cycle-consistency loss and prior map guided GAN loss are defined in Equations (5) and (6), they give higher penalties to objects on the water and lower penalties to the sky.

Cycle-consistency loss aims to minimize the objective between the foggy images $X$ and its reconstructed foggy images $F(G(x))$ , which can be expressed as:

L_{cyc}=||F(G(x))-x||_{1}+||G(F(y))-y||_{1}.

(7)

Perceptual loss is also used to enhance the textural information of the recovery image, which is based on VGG16 [38] to further constrain the generators, defined as:

L_{VGG}=||\phi(x)-\phi(F(G(x)))||_{2}^{2}\\ +||\phi(y)-\phi(G(F(y)))||_{2}^{2},

(8)

where $\phi$ represents the feature maps of the second and fifth pooling layer of the VGG16 network, and $||\cdot||_{2}$ is an $L_{2}$ norm.

Overall loss function. Combining all the above losses, the whole loss function is formulated as,

L=\lambda_{1}(L_{cyc}+L_{pgcyc}+L_{VGG})+\lambda_{2}L_{pgG}.

(9)

where $\lambda_{i}$ ( $i=1,2$ ) is a hyperparameter indicating the weight. In our empirical experiments, it is found that the network is sensitive to $L_{pgG}$ which can significantly improve the defogging performance after fine-tuning. Thus, we set $\lambda_{1}$ to 1 which can stabilize the training, and set $\lambda_{2}$ to a value that is slightly larger 2.

IV Experiments and Discussions

In this section, we evaluate the qualitative and quantitative results of our proposed method against six state-of-the-art methods, including the prior-based method DCP [1], the supervised method MSBDN [39], the semi-supervised and unsupervised methods, e.g. DAD [14], PSD [33], RefineDNet [32], and SLAD [40].

(a)

(b)

(c)

(d)

(e)

(f)

(g)

2.84 / 0.80 0.75 / 0.58 0.93 / 0.78 2.30 / 0.77 0.79 / 0.74 1.23 / 0.64 0.81 / 0.80 0.92 / 0.60

4.88 / 0.73 0.86 / 0.34 1.51 / 0.48 2.85 / 0.61 1.48 / 0.49 0.88 / 0.35 0.96 / 0.41 0.83 / 0.33

2.90 / 0.56 1.31 / 0.35 1.13 / 0.45 1.96 / 0.49 0.82 / 0.33 0.98 / 0.37 0.86 / 0.35 0.54 / 0.26

1.66 / 0.41 0.91 / 0.24 0.65 / 0.35 0.98 / 0.30 0.48 / 0.22 0.60 / 0.24 0.92 / 0.30 0.36 / 0.16

4.49 / 0.72 1.31 / 0.32 2.38 / 0.64 3.50 / 0.68 1.31 / 0.58 1.28 / 0.44 1.51 / 0.54 1.21 / 0.49

2.99 / 0.57 1.11 / 0.23 1.31 / 0.50 2.17 / 0.52 0.73 / 0.29 0.74 / 0.28 1.16 / 0.37 0.82 / 0.29
(a) (b) (c) (d) (e) (f) (g) (h)

TABLE I: Comparison of training set sizes for different defogging methods

	Samples
	Image pairs	Unpaired foggy	Unpaired clean
DAD	6000	1000	-
RefineDNet	13,990	2903	3557
MSBDN	16,000	-	-
Ours	-	2090	2441

IV-A Datasets

In this work, we use unpaired real-world images for training instead of the widely used paired synthetic images. Table I shows the comparison of training set sizes for different defogging methods. Unlike other methods that employ a large number of paired and unpaired images for training, our training set only contains 2,090 unpaired foggy and 2,441 clean overwater images. The used dataset is proposed by Zheng et al. [34], which also contains 188 real-world overwater foggy images for testing.

To demonstrate the applicability of PG-CycleGAN in wider contexts, two land image datasets O-HAZE [41] and NH-HAZE [42] are also employed as test sets. O-HAZE [41] contains 45 different outdoor scenes depicting the same visual content recorded in haze-free and hazy conditions, under the same illumination parameters. NH-HAZE [42] is a non-homogeneous real dataset with pairs of real hazy and corresponding haze-free images, from 55 outdoor scenes.

IV-B Implementation Details

In our network, all the training samples are resized to 512 $\times$ 512, and then randomly cropped to 256 $\times$ 256 for data augmentation. During training, ADAM optimizer [43] is employed for the generators and discriminators with a learning rate of 0.0001 and batch size of 1. We adopt cosine annealing schedule where the initial learning rate is decreased to 0 after 200 epochs according to the cosine function. The network is trained on Pytorch deep learning platform with the acceleration of an Nvidia GeForce RTX 3090 GPU, and it takes about 30 hours for training.

IV-C Comparison With State-of-the-Art Methods

We compare the performance of the PG-CycleGAN with several state-of-the-art defogging algorithms as mentioned earlier, and a set of experiments including qualitative and quantitative on overwater and land images is conducted.

1) Qualitative comparisons on overwater images: In this experiment, we evaluate the visual quality of the defogged images obtained by PG-CycleGAN on real-world overwater images.

Fig. 7 shows the results of different methods on overwater images. We can observe that, in general, the resulting images given by our method has higher brightness than those by DCP [1]. This is because of the inaccurate estimation of atmospheric light with the DCP method. The results of the DAD [14] often suffer from color distortion and unrealistic looking. Due to the domain gap between the synthetic and real-world foggy images, images restored by MSBDN [39] remain somewhat foggy, especially in distant areas. PSD [33] sometimes fails to handle the overwater scene, which leads to severe color shifting. Although RefineDNet [32] shows pretty good results, it tends to darken the images, and the sky regions in the 5th row of Fig. 7 are blurred. This may be due to the drawback that the RefineDNet [32] adopts the DCP [1] in the first stage to restore visibility, while DCP is difficult to restore sky regions of the foggy images. As can be seen in the 4th and 6th rows of Fig. 7, the defogged images of the self-supervised method SLAD [40] also remain somewhat foggy.

2) Qualitative comparisons on land images: In this experiment, we evaluate the visual quality of the defogged images by PG-CycleGAN on land images.

Figs. 8 and 9 show the results of different methods on real-world land images on the O-HAZE dataset. From the 4th row of Fig. 8, we can observe that with the DCP method, fog around the leaves still remains. The MSBDN, PSD, and SLAD degrade substantially when dealing with thick fog, as shown in Fig. 9, where certain amount of fog still remains. Although DAD can remove the fog well, the resulting images have severe color distortion. Due to the use of the DCP in the first defogging stage, the resulting images by RefineDNet are too dark as compared to the ground-truth.

In comparison, our network can generate cleaner images from the overwater and land foggy images. As demonstrated in the zoom-in regions in Fig. 7, e.g. the 1st and 2nd rows, the crowd and cars are well recovered due to the prior map guided losses. Similarly, as shown in the 3rd row of Fig. 7, other methods neglect the fine rope, yet our model can still effectively defog it. Furthermore, the 4th-6th rows in Fig. 7 show that our results have less remaining fog and artifacts. From Fig. 9, we can observe that the results of PG-CycleGAN are closer to the ground-truth than other methods. Considering those results together, we can see that PG-CycleGAN restores more details and obtains images that are visually clearer.

3) Quantitative comparisons on overwater images: Because the overwater images in the test set have no corresponding clean images, full-reference image quality assessment metrics are not suitable for our experiments. Recently, some non-reference image quality assessment metrics [44][45][46][47] have been proposed for quantitatively evaluating defogging performance. We adopt three non-reference evaluation criteria: FADE [44], HazDesNet [47], and BAVE [46], in which FADE and HazDesNet are designed to evaluate the defogging performance by predicting the density of the fog. Lower values of the FADE and HazDesNet denote better defogging performance. The BAVE is an evaluator of visibility enhancement that contains three indicators, which are the rate of visible edges $e$ , the quality value of the contrast restoration $r$ , and the normalized saturate value of pixels $\sigma$ . For a defogged image, the higher values of $e$ and $r$ , and the lower value of $\sigma$ denote the better quality of the recovered image.

A quantitative comparison of the test set which contains 188 real-world overwater images is reported in Table II. It can be seen that our method gives the best results in both FADE and the HazDesNet, which demonstrate that the PG-CycleGAN can remove the fog more effectively. Although PG-CycleGAN is ranked the second in terms of visible edges $e$ and the quality value of the contrast restoration $r$ , and the third in terms of the indicator $\sigma$ , it is not far from the best results. Combining all the performance indicators, it can be argued that our results have less fog remaining and better visual quality.

TABLE II: Quantitative comparison with state-of-the-art methods on 188 real-world overwater images.

	FADE $\downarrow$	HazDesNet $\downarrow$	$e$ $\uparrow$	$\sigma(\%)$ $\downarrow$	$r$ $\uparrow$
DCP	1.08	$0.33^{2}$	2.85	0.006	1.68
DAD	1.36	0.49	2.03	$0.000^{1}$	2.18
MSBDN	2.28	0.53	1.01	0.087	1.50
PSD	$1.01^{1}$	0.41	2.36	1.146	$3.57^{1}$
RefineDNet	1.06	0.36	$3.94^{1}$	$0.001^{1}$	2.16
SLAD	1.07	0.40	3.30	0.037	2.31
Ours	$0.75^{1}$	$0.32^{1}$	$3.81^{2}$	0.005	$2.94^{2}$

We further present the fog density map predicted by HazDesNet in Fig. 10. It can be seen that benefiting from the prior map guided losses, the fog density map of our result is closer to dark blue compared to other methods, which indicates that our method has better defogging performance.

TABLE III: Quantitative comparison with state-of-the-art methods on the O-HAZE and NH-HAZE datasets.

	O-HAZE		NH-HAZE
	SSIM	PSNR	SSIM	PSNR
DAD	$0.6948^{1}$	$18.21^{1}$	0.5603	$13.92^{2}$
MSBDN	0.6109	16.64	0.4976	12.60
PSD	0.5799	12.96	$0.5604^{2}$	11.61
RefineDNet	0.6749	17.06	0.5334	13.04
SLAD	0.6420	14.48	0.5422	11.79
Ours	$0.6755^{2}$	$17.14^{2}$	$0.5992^{1}$	$13.19^{2}$

4) Quantitative comparisons on land images: In this experiment, we adopt the full-reference image quality assessment metrics SSIM and PSNR for comprehensively evaluating the performance of our proposed PG-CycleGAN.

Table III gives the quantitative comparison on the O-HAZE and NH-HAZE datasets. We can observe that although the PG-CycleGAN is trained on overwater images, it achieves the best performance in SSIM on the NH-HAZE dataset, and ranked the second in terms of PSNR on both the O-HAZE and NH-HAZE datasets. Furthermore, as shown in Fig. 8, the values of FADE and HazDesNet of PG-CycleGAN are lower than other methods, which also demonstrates that our method leads to better quality images with less fog remaining than other methods.

Overall, the quantitative and qualitative comparisons show that our proposed method not only generates images with better visual effects on overwater images, but also offers excellent defogging performance for land images.

V Analysis and Discussion

V-A Effectiveness of the PG-CycleGAN

In order to verify the effectiveness of the LRC, UIM, and prior map guided losses, we conduct a series of ablation studies to analyze our method.

Fig. 11 illustrates the effectiveness of the LRC. After the coarse-to-fine framework is removed (as shown in Fig. 11 (b)), we can see that although learning the residual from the foggy input can retain detailed information, the results tend to be insufficient defogging. In addition, due to the limited representation power of the network, the artifacts are also introduced. When the long-range residual connection is removed (as shown in Fig. 11 (c)), the defogging results show color shift. After the long-range residual connection and coarse-to-fine framework are both removed, the results may include color shift and heavy fog, and the results of CycleGAN also show the clear color shifts. The above discussion shows that the combination of the long-range residual connection and coarse-to-fine framework can result in a good complementarity that facilitates the restoration of detailed information and provides a desirable defogging effect.

In Fig. 12 (b)-(e), the results are generated by using deconvolution, bilinear interpolation, PixelShuffle [17], and UIM as the upscaling method, respectively. It can be seen that the zoom-in regions in Fig. 12 (b), (d), and (e) contain more fog than Fig. 12 (f), and the images in Fig. 12 (c) contain color distortion whose color tends to be blue. This figure shows that PG-CycleGAN eliminates color distortion and offers better defogging effect as compared with the baseline methods.

For evaluating the effectiveness of the prior map guided losses, we conduct an ablation study involving the following three control groups: 1) not using the prior map guided cycle-consistency loss, 2) adopting PathGAN [37] discriminator which is widely used in image generation tasks instead of regular discriminator, 3) using regular discriminator but not imposing the prior map. The visual results are shown in Fig. 13, which shows that with the absence of the prior map guided cycle-consistency loss, the resulting color tends to be blue. In Fig. 13 (c), some regions are blurred and the defogging results are very artificial. Furthermore, with the absence of the prior map, the regular discriminator cannot distinguish between real and generated images, which leads to undesired green regions in the defogging results as shown in Fig. 13 (d). We can see that in Fig. 13 (e), the results of CycleGAN show insufficient defogging. On the contrary, as shown in Fig. 13 (f), the PG-CycleGAN optimized with the prior map offers better visual effects, e.g., the objects are finely recovered.

Overall, as illustrated in Figs. 11-13, the LRC restores detailed information while also providing a desirable defogging effect. The UIM integrates the strengths of three upscaling branches and generates cleaner and visually more pleasing results. The prior map guided losses enhance the recovery of the objects on the water. Table IV quantitatively compares the contribution of each part. When either part is removed, both FADE and HazDesNet have increased, i.e., the defogging performance deteriorates. This further demonstrates the effectiveness of our proposed method.

Table V gives the comparison of inference times on a 512 $\times$ 512 image between the CycleGAN [15], the PG-CycleGAN with deconvolution instead of UIM, and the PG-CycleGAN. It can be observed that UIM does not significantly slow down the inference speed. This is benefiting from the 1x1 convolution, which reduces the dimension before upscaling feature maps.

TABLE IV: Effectiveness of the PG-CycleGAN

	FADE $\downarrow$	HazDesNet $\downarrow$
w/ Deconv	1.0522	0.3836
w/ Bilinear	1.2644	0.3727
w/ PixelShuffle	0.7839	0.3743
w/o $L_{pgcyc}$	1.0411	0.3728
w/ PatchGAN D	0.9966	0.3972
w/ regular D	0.9472	0.3725
CycleGAN	1.5970	0.6032
PG-CycleGAN	0.7544	0.3240

TABLE V: Comparison of inference time

	Time(s)
CycleGAN	0.118
w/ Deconv	0.164
PG-CycleGAN	0.176

V-B Limitations

As with previous methods, there are limitations in the proposed method. As shown in Fig. 14, when the local area in the original input image is near to white, which usually happens on white hulls, the pixel in the corresponding prior map will tend to 0. The reason is that if all the pixels of the three channels in the input possess a large value, the local region will show brightly in the dark channel, and then appear dark after the dark channel is inverted. This may lead to suboptimal solutions in network optimization, as it cannot give the objects a high penalty. The following solutions could be used to overcome this problem, i.e., refining the weights of the prior map guided losses, or adding an extra PatchGAN [37] discriminator to optimize the model.

VI Concluding Remarks

In this paper, we have presented a prior map guided cycle-consistent adversarial network for overwater image defogging. We have proposed a method for estimating the prior map and incorporated it in two losses, imposing a greater penalty on objects on the water. Due to the fact that the unpaired training set may lead to color distortion, we have proposed two complementary methods to improve the defogging results: the long-range residual connection and coarse-to-fine framework. Moreover, an intuitive and effective upscaling inception module is exploited, by combining the strengths of three upscaling branches: deconvolution, bilinear interpolation, and PixelShuffle [17], to generate images of improved quality. Extensive experiments on the qualitative and quantitative comparisons demonstrate the advantages of PG-CycleGAN against several state-of-the-art methods. Furthermore, benefiting from the feature that the proposed prior map can highlight objects on water and distinguish them from the background, the prior map may help in other research directions, such as maritime object detection, tracking, and segmentation.

\AtNextBibliography

References

[1] Kaiming He, Jian Sun and Xiaoou Tang “Single image haze removal using dark channel prior” In IEEE transactions on pattern analysis and machine intelligence 33.12 IEEE, 2010, pp. 2341–2353
[2] Boyi Li et al. “AOD-Net: All-in-One Dehazing Network” In 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4780–4788
[3] Yanyun Qu, Yizi Chen, Jingying Huang and Yuan Xie “Enhanced Pix2pix Dehazing Network” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8152–8160
[4] A. Cantor “Optics of the atmosphere–Scattering by molecules and particles” In IEEE Journal of Quantum Electronics 14.9, 1978, pp. 698–699
[5] Shree K Nayar and Srinivasa G Narasimhan “Vision in bad weather” In Proceedings of the seventh IEEE international conference on computer vision 2, 1999, pp. 820–827 IEEE
[6] Dana Berman and Shai Avidan “Non-local image dehazing” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1674–1682
[7] Raanan Fattal “Dehazing using color-lines” In ACM transactions on graphics (TOG) 34.1 ACM New York, NY, USA, 2014, pp. 1–14
[8] Qingsong Zhu, Jiaming Mai and Ling Shao “A fast single image haze removal algorithm using color attenuation prior” In IEEE transactions on image processing 24.11 IEEE, 2015, pp. 3522–3533
[9] Ian Goodfellow et al. “Generative adversarial nets” In Advances in neural information processing systems 27, 2014
[10] Xitong Yang, Zheng Xu and Jiebo Luo “Towards perceptual image dehazing by physics-based disentanglement and adversarial training” In Proceedings of the AAAI Conference on Artificial Intelligence 32.1, 2018
[11] Jiafeng Li et al. “USID-Net: Unsupervised Single Image Dehazing Network via Disentangled Representations” In IEEE Transactions on Multimedia, 2022, pp. 1–1 DOI: 10.1109/TMM.2022.3163554
[12] Alona Golts, Daniel Freedman and Michael Elad “Unsupervised single image dehazing using dark channel prior loss” In IEEE Transactions on Image Processing 29 IEEE, 2019, pp. 2692–2701
[13] Lerenhan Li et al. “Semi-supervised image dehazing” In IEEE Transactions on Image Processing 29 IEEE, 2019, pp. 2766–2779
[14] Yuanjie Shao et al. “Domain adaptation for image dehazing” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2808–2817
[15] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A Efros “Unpaired image-to-image translation using cycle-consistent adversarial networks” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232
[16] Christian Szegedy et al. “Going deeper with convolutions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9
[17] Wenzhe Shi et al. “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883
[18] Raanan Fattal “Single image dehazing” In ACM transactions on graphics (TOG) 27.3 ACM New York, NY, USA, 2008, pp. 1–9
[19] Robby T Tan “Visibility in bad weather from a single image” In 2008 IEEE conference on computer vision and pattern recognition, 2008, pp. 1–8 IEEE
[20] Jean-Philippe Tarel and Nicolas Hautiere “Fast visibility restoration from a single color or gray level image” In 2009 IEEE 12th international conference on computer vision, 2009, pp. 2201–2208 IEEE
[21] Gaofeng Meng et al. “Efficient image dehazing with boundary constraint and contextual regularization” In Proceedings of the IEEE international conference on computer vision, 2013, pp. 617–624
[22] Wencheng Wang, Xiaohui Yuan, Xiaojin Wu and Yunlong Liu “Fast Image Dehazing Method Based on Linear Transformation” In IEEE Transactions on Multimedia 19.6, 2017, pp. 1142–1155 DOI: 10.1109/TMM.2017.2652069
[23] Xiaoning Liu, Hui Li and Ce Zhu “Joint Contrast Enhancement and Exposure Fusion for Real-World Image Dehazing” In IEEE Transactions on Multimedia, 2021, pp. 1–1 DOI: 10.1109/TMM.2021.3110483
[24] Bolun Cai et al. “Dehazenet: An end-to-end system for single image haze removal” In IEEE Transactions on Image Processing 25.11 IEEE, 2016, pp. 5187–5198
[25] Wenqi Ren et al. “Single image dehazing via multi-scale convolutional neural networks” In European conference on computer vision, 2016, pp. 154–169 Springer
[26] Wenqi Ren et al. “Gated fusion network for single image dehazing” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3253–3261
[27] Xiaohong Liu, Yongrui Ma, Zhihao Shi and Jun Chen “Griddehazenet: Attention-based multi-scale network for image dehazing” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7314–7323
[28] Hang Dong et al. “Multi-scale boosted dehazing network with dense feature fusion” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2157–2167
[29] Xu Qin et al. “FFA-Net: Feature fusion attention network for single image dehazing” In Proceedings of the AAAI Conference on Artificial Intelligence 34.07, 2020, pp. 11908–11915
[30] Qiaosi Yi et al. “Efficient and Accurate Multi-scale Topological Network for Single Image Dehazing” In IEEE Transactions on Multimedia, 2021, pp. 1–1 DOI: 10.1109/TMM.2021.3093724
[31] Deniz Engin, Anil Genç and Hazim Kemal Ekenel “Cycle-dehaze: Enhanced cyclegan for single image dehazing” In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 825–833
[32] Shiyu Zhao, Lin Zhang, Ying Shen and Yicong Zhou “RefineDNet: a weakly supervised refinement framework for single image dehazing” In IEEE Transactions on Image Processing 30 IEEE, 2021, pp. 3391–3404
[33] Zeyuan Chen, Yangchao Wang, Yang Yang and Dong Liu “PSD: Principled synthetic-to-real dehazing guided by physical priors” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7180–7189
[34] Shunyuan Zheng et al. “Overwater Image Dehazing via Cycle-Consistent Generative Adversarial Network” In Proceedings of the Asian Conference on Computer Vision, 2020
[35] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
[36] Jie Hu, Li Shen and Gang Sun “Squeeze-and-excitation networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141
[37] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A Efros “Image-to-image translation with conditional adversarial networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134
[38] Karen Simonyan and Andrew Zisserman “Very deep convolutional networks for large-scale image recognition” In arXiv preprint arXiv:1409.1556, 2014
[39] Hang Dong et al. “Multi-Scale Boosted Dehazing Network With Dense Feature Fusion” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
[40] Yudong Liang et al. “Self-supervised Learning and Adaptation for Single Image Dehazing” In Proceedings of the 31st International Joint Conference on Artificial Intelligence (IJCAI-22), 2022, pp. 1137–1143
[41] Codruta O. Ancuti, Cosmin Ancuti, Radu Timofte and Christophe De Vleeschouwer “O-HAZE: a dehazing benchmark with real hazy and haze-free outdoor images” In IEEE Conference on Computer Vision and Pattern Recognition, NTIRE Workshop, NTIRE CVPR’18, 2018
[42] Codruta O. Ancuti, Cosmin Ancuti and Radu Timofte “NH-HAZE: An Image Dehazing Benchmark with Non-Homogeneous Hazy and Haze-Free Images” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE CVPR 2020, 2020
[43] Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
[44] Lark Kwon Choi, Jaehee You and Alan Conrad Bovik “Referenceless prediction of perceptual fog density and perceptual image defogging” In IEEE Transactions on Image Processing 24.11 IEEE, 2015, pp. 3888–3901
[45] Tuxin Guan et al. “Visibility and Distortion Measurement for No-Reference Dehazed Image Quality Assessment via Complex Contourlet Transform” In IEEE Transactions on Multimedia, 2022, pp. 1–17 DOI: 10.1109/TMM.2022.3168438
[46] Nicolas Hautiere, Jean-Philippe Tarel, Didier Aubert and Eric Dumont “Blind contrast enhancement assessment by gradient ratioing at visible edges” In Image Analysis & Stereology 27.2, 2008, pp. 87–95
[47] Jiahe Zhang et al. “HazDesNet: An End-to-End Network for Haze Density Prediction” In IEEE Transactions on Intelligent Transportation Systems 23.4, 2022, pp. 3087–3102