This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

S3Net: A Single Stream Structure for Depth Guided Image Relighting

Hao-Hsiang Yang1∗    Wei-Ting Chen1,2∗    and Sy-Yen Kuo3
1 ASUS Intelligent Cloud Services
   Asustek Computer Inc    Taipei    Taiwan
2 Graduate Institute of Electronics Engineering
   National Taiwan University    Taipei    Taiwan
3 Department of Electrical Engineering
   National Taiwan University    Taipei    Taiwan.
(islike8399, jimmy3505090)@gmail.com, [email protected]
https://github.com/dectrfov/NTIRE-2021-Depth-Guided-Image-Any-to-Any-relighting
Abstract

Depth guided any-to-any image relighting aims to generate a relit image from the original image and corresponding depth maps to match the illumination setting of the given guided image and its depth map. To the best of our knowledge, this task is a new challenge that has not been addressed in the previous literature. To address this issue, we propose a deep learning-based neural Single Stream Structure network called S3Net for depth guided image relighting. This network is an encoder-decoder model. We concatenate all images and corresponding depth maps as the input and feed them into the model. The decoder part contains the attention module and the enhanced module to focus on the relighting-related regions in the guided images. Experiments performed on challenging benchmark show that the proposed model achieves the 3rd3^{rd} highest SSIM in the NTIRE 2021 Depth Guided Any-to-any Relighting Challenge.

*Equally-contributed first authors.

1 Introduction

The objective of this paper is to address the depth guided any-to-any relighting task. That is, an input image with a certain color temperature and light source position setting is relit to match the illumination setting of another guided image. The example is plotted in \figreffig:example. \figreffig:example (a) is the original source image, and \figreffig:example (c) is the guided image. Moreover, the depth maps of two images are provided as shown in \figreffig:example (b) and \figreffig:example (d), respectively. The target of relighting is to generate a novel result as shown in \figreffig:example (f) based on the content of the source image and the light condition of the guided image. \figreffig:example (e) is a result estimated by our proposed approach. To the best of our knowledge, this task is a new challenge that has not been addressed in the previous literature. Image relighting is an emerging and crucial technology owing to its applications in visualization, image editing and augmented reality (AR). For example, any-to-any relighting can be used to render images with various ambient lighting conditions for the first and the third person gaming.

Refer to caption
Figure 1: An example of any-to-any relighting. (a) Original image. (b) Original depth map. (c) Guided image. (d) Guided depth map. (e) Relit image by our method. (f) Ground truth.

The input of any-to-any image relighting is an original image and a guided image, which can be seen as the application of the style transferring [1]. That is, two images containing different ambient conditions can be seen as different style images. However, there are two inherent differences between the style transferring and the image relighting. First, to generate the image with different ambient conditions, it is necessary to generate shadows into the relit image. Second, similarly, the shadow from the original image needs to be removed. On the other hand, the style transferring focuses on the texture rendering. To address this issue, recently, many deep learning-based image relighting algorithms [2, 3, 4, 5, 6] are proposed because the deep convolutional neural networks (CNNs) have achieved a lot of successful improvements in many computer vision tasks. They develop the neural networks and follow the end-to-end manner to directly generate relit images without assuming any physical priors. Inspired by those approaches, in this paper, we also propose the deep learning network to tackle the depth guided any-to-any relighting.

Different from the conventional image relighting tasks [7], NTIRE 2021 Depth Guided Any-to-any Relighting Challenge provides additional depth maps, which are beneficial for the model to learn spatial representations. To effectively extract the RGB-image and the depth features, the most popular methods are using a dual stream backbone [8, 9]. However, it is difficult to design the dual backbones to extract the image and depth features because the input information may cause a huge computational burden. Specifically, in NTIRE 2021 Depth Guided Any-to-any Relighting Challenge [10], the sizes of both images and depth maps are 1024 ×\times 1024. Furthermore, the input contains two images and two depth maps. Therefore, in this paper, we design a single stream structure network (S3Net) to fulfill depth guided any-to-any relighting. This network is an encoder-decoder model. All images and corresponding depth maps are concatenated as the input and passed our network to render the relit image.

To design an efficient network for image relighting, several modules can be utilized. For example, multi-scale feature extractors [11, 12] can be used to increase the receptive field and integrate the coarse-to-fine representation. It is necessary to adopt this module because relit images contain objects in various scales. Additionally, the attention mechanisms are widely used in various tasks like image enhancement [13, 14] and machine translation [15]. The attention mechanism assigns feature map weights so that features of the sequence of regions or locations are magnified. Because the relit image contains the information of direction (e.g., \figreffig:example (a)), the attention mechanism is beneficial for the model to learn the directional representations. Therefore, two modules are selected and integrated into decoder parts to focus on the relighting-related regions in the guided images. Besides model structure, object functions also impact the overall performance. We combine discrete wavelet transform (DWT) to design a multi-scale loss function to optimize our model so that our model can relight the global ambient conditions and detailed structure.

We summarized the contributions in this paper as follow.

  1. 1.

    We propose a Single Stream Structure network (S3Net) for the depth guided any-to-any relighting. We apply the single stream network to extract the image and depth features, and deal with various ambient lighting conditions such as the direction of the illumination and the color temperature.

  2. 2.

    During the training phase, we adopt the loss function combining Discrete wavelet transform to train our model. This loss function effectively improves accuracy.

  3. 3.

    We test our proposed method on the VIDIT dataset [16] and multiple experiments demonstrate that the proposed S3Net achieves the 3rd3^{rd} highest SSIM and MPS in the NTIRE 2021 Depth Guided Any-to-any Relighting Challenge.

2 Related Works

2.1 RGB-D Fusion

Different from the conventional RGB tasks, extra depth information can improve accuracy in many computer vision tasks like semantic segmentation [17] and object detection [4]. Depth maps have demonstrated to be a useful cue to provide geometric and spatial information when combining with the RGB representation. In [4], they propose Faster-RCNN structure to tackle pedestrian detection. They prove that depth maps can be utilized to refine the convolutional features extracted from RGB images. Additionally, more accurate region proposals are achieved by exploring the perspective projection with the help of the depth information. In [17], a unified and efficient cross-modality guided encoder for semantic segmentation is proposed. This structure jointly filters and recalibrates both representations before cross-modality aggregation. Meanwhile, a bi-direction multi-step propagation strategy is introduced to effectively fuse information between the two modalities. Chen et.al [18] propose the approach to RGB-D salience object detection. The depth and image features are extracted by a dual backbone. Authors also integrate both features through densely connected structures and apply their mixed features to generate dynamic filters with receptive fields of different sizes. These works tend to use a dual backbone to extract features of different modalities. In practice, the size of relit images in VIDIT [16] and the corresponding depth map is very large (e.g., 1024×\times1024). The relit images contain illuminant direction information, which is not proper to crop large images to small patches during training. To address this issue, we design a single stream structure to jointly extract both depth and the image features.

2.2 Deep Learning Based Image Relighting

Following the rule in the 2021 NTIRE Depth Guided Image Relighting Challenge, there are two kinds of settings. First, the illuminant direction and the temperature are pre-defined [2, 6], which is known as one-to-one relighting. Second, the ambient condition is based on a guided image [3], which is known as any-to-any relighting. Both of them are very similar to other low-level vision tasks like image dehazing [14] image deraining [13], image smoke removal [19], image desnowing [20], reflection removal [21], and underwater image enhancement [12]. All of them belong to the image-to-image translation problems. To tackle those problems, the encoder-decoder network can be applied. Furthermore, when it comes to encoder-decoder structure, U-net [22, 23] is the most popular network for image-to-image translation tasks. This structure consists not only the encoder-decoder structure but also the skip connection, which concatenates the features with identical size from the encoder to the decoder. For example, Puthussery et.al [2] proposes a wavelet decomposed relit network for image relighting. This structure is a novel encoder-decoder network employing wavelet-based decomposition followed by convolution layers under a multi-resolution framework. Additionally, this network contains skip connection operations. Different from those low-level vision tasks, the relit image contains the information of illuminant direction and color temperature, which may cause the shadow of the objects to be located in different locations. In [3], a lighting-to-feature network is proposed to recover the corresponding implicit scene representation from the illumination settings, which is known as the inverse process of the lighting estimation network.

2.3 Wavelet Transform for Deep Learning

DWT [24] with orthogonal property is widely adopted in many computer vision tasks [23, 2]. DWT decomposes an image into various small patches in different frequency intervals, which can replace the existing down-sampling operations like max pooling or average-pooling. Therefore, many tasks apply DWT to diminish feature maps and achieve multi-scale features. Moreover, in [25], DWT is leveraged to design the objective function to measure the similarity between the ground truth and predict images. Motivated by these works, we also combine DWT in the loss function to make our network learn the multi-scale representations.

2.4 Attention Mechanism

Attention mechanisms are important roles in both human perception systems and deep learning tasks [13, 26]. Attention mechanisms provide feature maps or certain sequence weights so that features of regions or locations can be magnified. Specifically, for computer vision tasks, attention mechanisms are categorized as the spatial attention [3] and the channel attention [13]. The former spatially utilizes weights to refine the feature maps, and the latter computes the global average-pooled features to implement the channel-wise attention. In this paper, both attention mechanisms are leveraged in our model to further increase the performance of the network.

3 Proposed Methods

Refer to caption
Figure 2: The proposed network for depth guided any-to-any relighting. This model applies the Res2Net as the encoder to extract both the image and depth map features. All images and depth maps are concentrated as input. In the decoder parts, we use attention mechanism including channel and spatial attention mechanism and the enhanced module to refine features and relight the image.

3.1 Overall Neural Network

The architecture of the proposed S3Net is presented in \figreffig:architecture. This network is based on [27] and contains the encoder and decoder parts. In the encoder part, we apply the Res2Net101 [28] as our backbone. The Res2Net can represent multi-scale features at a granular level and increases the range of receptive fields for each network layer. After the input is passed through the backbone, multi-scale feature extraction is achieved. Note that, the Res2Net utilized in this work discards the full connection layer and the size of final output feature maps from our encoder is 116\frac{1}{16}. The initial weight of the encoder is pre-trained parameters trained from the ImageNet. We connect bottom features to the decoder. The decoder consists of the stacks of convolution to refine the feature maps. Both the pixel shuffle [29] and the transposed convolution [30] are adopted to magnify feature maps. Furthermore, we leverage attention modules to refine intermediate features. Attention modules consist of residual layer [31], the spatial [26], and the channel attention [32] as shown in \figreffig:architecture (orange rectangle).

Furthermore, inspired by [11], we add the enhanced module in our S3Net. The enhanced module leverages average pooling in different strides to change the size of feature maps and receptive fields, which is effective to extract multi-scale features. Finally, the up-sampling is applied to restore diminished feature maps and all feature maps are concatenated. The enhanced module is illustrated in the light orange rectangle in \figreffig:architecture. Moreover, it is known that U-Net-like structure is beneficial in many tasks such as image dehazing [33, 14] and semantic segmentation [22]. Its skip connection encourages feature re-using. Therefore, in our S3Net, we also adopt skip connection to merge the last three feature maps from the backbone to their corresponding feature maps.

In practice, we directly combine the original image, original depth map, guided image and guided depth map as input. This input is seen as the 8-channel tensor and the output is the 3-channel relit image. In addition, we change the first convolution in the Res2Net backbone so that the S3Net can accommodate 8-channel tensors as input.

3.2 Loss Functions

To train our S3Net, we apply three loss functions. The first function is Charbonnier loss [34] that can be regarded as the robust L1L_{1} loss function. The Charbonnier loss is expressed as:

LCha(I,I^)=1TiT(IiIi^)2+ϵ2L_{Cha}(I,\hat{I})=\frac{1}{T}\sum_{i}^{T}\sqrt[]{(I_{i}-\hat{I_{i}})^{2}+\epsilon^{2}} (1)

where II and I^\hat{I} represent the ground truth and relit images from the proposed network, respectively, and ee is seen as a tiny constant (e.g., 10610^{-6}) for stable and robust convergence. LChaL_{C}ha can restore global structure [34] and can be more robust to handle outliers.

Secondly, we apply the SSIM loss [35]. The SSIM loss is able to reconstruct local textures and details. It can be expressed as:

LSSIM(I,I^)=(2μIμI^+C1)(2σII^+C2)(μI2+μI^2+C1)(σI2+σI^2+C2){L_{SSIM}}(I,\hat{I})=-\frac{(2\mu_{I}\mu_{\hat{I}}+C_{1})(2\sigma_{I\hat{I}}+C_{2})}{(\mu_{I}^{2}+\mu_{\hat{I}}^{2}+C_{1})(\sigma_{I}^{2}+\sigma_{\hat{I}}^{2}+C_{2})} (2)

where σ\sigma and μ\mu represent the standard deviation, the covariance and the mean of images. In the image relighting task, to remove shadows from the original image, we extend the SSIM loss function so that our network can restore more detailed parts. We follow the method in [25] and combine DWT into the SSIM loss because lots of tasks [13, 23] have demonstrated that the DWT captures the high-frequency features, which is beneficial for reconstructing the clear details on relit images. Initially, the DWT decomposes the predicted image into four different and small sub-band images. The operation can be expressed as:

I^LL,I^LH,I^HL,I^HH=DWT(I^)\hat{I}^{LL},\hat{I}^{LH},\hat{I}^{HL},\hat{I}^{HH}={\rm DWT}(\hat{I}) (3)

where superscripts mean the output from respective filters (e.g., fLLf_{LL}, fHLf_{HL}, fLHf_{LH} and fHHf_{HH}).

fHLf_{HL} ,fLHf_{LH} and fHHf_{HH} are high-pass filters for the horizontal edge, the vertical edge and the corner detection, respectively. fLLf_{LL} is seen as the down-sampling operation. Moreover, the DWT can keep decomposing the I^LL\hat{I}_{LL} to generate images with different scales and frequency information. This step is written as:

I^i+1LL,I^i+1LH,I^i+1HL,I^i+1HH=DWT(I^iLL)\hat{I}^{LL}_{i+1},\hat{I}_{i+1}^{LH},\hat{I}_{i+1}^{HL},\hat{I}_{i+1}^{HH}={\rm DWT}(\hat{I}_{i}^{LL}) (4)

where the subscript ii means the output from the ithi^{th} DWT iteration, and I^0LL\hat{I}^{LL}_{0} is the original predicted relit image. The SSIM loss terms described above are calculated from the original image pair and various sub-band image pairs. The fusion of the SSIM loss and the DWT is integrated as:

LWSSIM(I,I^)=0iγiLSSIM(Iiw,I^iw),\displaystyle L_{{W-SSIM}}(I,\hat{I})=\sum_{0}^{i}\gamma_{i}L_{{\rm SSIM}}(I^{w}_{i},\hat{I}^{w}_{i}), (5)
w{LL,HL,LH,HH}\displaystyle w\in\left\{LL,HL,LH,HH\right\}

where γi\gamma_{i} is based on [25] to control the importance of different patches.

The third loss is the perceptual loss [36]. Different from the aforementioned two loss functions, the perceptual loss leverages multi-scale features achieved from a pre-trained deep neural network (e.g., VGG19 [37]) to measure the visual feature difference between the ground truth and the estimated image. Formally, in this task, the VGG19 pre-trained on ImageNet is utilized as the loss function network. The perceptual loss is defined as

LPer(I,I^)=|(VGG(I)VGG(I^)|L_{Per}(I,\hat{I})=|(VGG(I)-VGG(\hat{I})| (6)

where |||\cdot| is the absolute value. The overall loss function is expressed as:

LTotal=λ1Lcha+λ2LWSSIM+λ3LPerL_{Total}=\lambda_{1}L_{cha}+\lambda_{2}L_{W-SSIM}+\lambda_{3}L_{Per} (7)

where λ1\lambda_{1}, λ2\lambda_{2} and λ3\lambda_{3} are scaling coefficients and used to adjust the relative weights on the three components.

4 Experiments

4.1 Dataset

The dataset used in the 2021 NTIRE image challenge of depth guided image relighting is the Virtual Image Dataset for Illumination Transfer (VIDIT) [16]. This dataset consists of 390 different scenes which are with 40 different illumination settings (five different color temperatures from 2500 to 6500K and 8 azimuthal angles). There are 15600 images totally. Additionally, the corresponding 390 depth maps are provided. The size of all training images and depth maps are 1024 × 1024 × 3 and 1024 × 1024 × 1, respectively. 300 virtual scenes with different illumination settings are applied for training while the other 90 virtual scenes are used for validation and testing.

4.2 Experimental Setting

Refer to caption
Figure 3: The visual comparison of the proposed method and other existing methods. We plot relit results using the proposed S3Net trained with and without the wavelet SSIM loss.

In the training phase, we first choose an arbitrary image and the corresponding depth map. Then, we choose an image as a guided image that contains random color temperature and azimuthal angles. With this information, the output image can be determined. Our network utilizes the original image, the original depth map, the guided image and the guided depth map as input to reconstruct the final decided image. We do not apply any data augmentation like random flip and random cropping during the training phase.

we apply the AdamW [38] as the new optimizer, and the batch size is set as 3 and the network is trained for 100 epochs with the momentum β1\beta_{1} = 0.5 and β2\beta_{2} = 0.999. The learning rate is set as 10410^{-4} and it is divided by ten after 20 epochs. The weights of these three loss terms (i.e., λ1,λ2\lambda_{1},\lambda_{2} and λ3\lambda_{3}) are set as 1, 1.1 and 0.1, respectively. All experiments are performed on the PyTorch platform and a single Nvidia V100 graphic card. We spend about 80 hours finishing the model training.111The source code will be released in our project page.

4.3 Ablation Experiments

Refer to caption
Figure 4: The visualization of the failure cases recovered by the proposed S3Net. One see that, though the color temperature can be transferred successfully, the detailed and structural information under the regions of shadows cannot be recovered adequately.

To find the best effectiveness of the proposed S3Net, some ablation experiments are conducted in this section. The peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) are adopted as objective metrics for quantitative evaluation.

The ablation experiments consist of three experimental settings: 1) The input is only with the source image and guided image but without depth maps, that is, the input is a six-channel tensor (Image); 2) The input is with the source image, guided image, and their corresponding depth maps (Both); For these two settings, LChaL_{Cha}, LSSIML_{SSIM} and LPerL_{Per} are adopted as objective functions. 3) The same setting in 2) but the LSSIML_{SSIM} [25] is replaced by LWSSIML_{W-SSIM}. The results are reported in Table 1. One can see that the PSNR and SSIM scores of setting 2 can be improved compared with setting 1. It can prove that using depth maps can improve the performance of relighting because they can provide more spatial information for the network. Moreover, compared with setting 2, the performance of setting 3 is improved effectively. It indicates that the LWSSIML_{W-SSIM} in the network can be more beneficial compared with the LSSIML_{SSIM}.

Table 1: The comparison of using different input data and applying the different loss functions on VIDIT dataset.
Index Input LWSSIML_{W-SSIM} PSNR SSIM
1 Image 18.7611 0.6821
2 Both 18.8451 0.6913
3 Both \surd 19.1281 0.6969
Table 2: The average SSIM, PSNR, MPS and LPIPS of top five methods over NTIRE 2021 depth guided image relighting validation and testing dataset.
User name Validation Testing
SSIM PSNR MPS SSIM LPIPS PSNR
DeepBlueAI 0.7196 20.0637 0.7675 0.7087 0.1737 20.7915
lifu 0.7107 19.7730 0.7671 0.6874 0.1532 19.8901
elientumba 0.6802 18.5570 0.7423 0.6508 0.1661 18.6039
auy200 0.6864 19.3552 0.7341 0.6711 0.2028 20.1478
HaoqiangYang 0.7022 19.2462 0.6452 0.6784 0.1566 19.2212

We also present some visual comparisons from VIDIT validation set for depth guided any-to-any relighting problem to prove the effectiveness of using the wavelet SSIM loss in the proposed S3Net. As shown in \figreffig:vis, images relit from the model trained with wavelet SSIM loss are closer to the ground truth. This model can properly remove shadows and transfer the ambient conditions from guide images to source images.

4.4 Results of Challenge

We list the results of the proposed S3Net compared with other competing entries in depth guided any-to-any relighting challenge of NTIRE 2021 workshop [10] in Table 2. Besides PSNR and SSIM, the Mean Perceptual Score (MPS) defined as the average of the normalized SSIM and LPIPS [39] scores, themselves averaged across the entire test set of each submission is adopted to evaluate performance of all submissions. As shown in Table 2, our results obtained the 3rd3^{rd} place, the 4th4^{th} place, the 2nd2^{nd} place and the 3rd3^{rd} place in terms of SSIM, PSNR, LPIPS and MPS. It is noted that our model takes 2.042 seconds on average to generate a relit result with 1024 ×\times 1024 image size in the test phase.

4.5 Limitations and Discussion

The proposed method learns the mapping functions of depth guided image relighting. Although our method achieves competitive performance in this competition, it may fail under the certain conditions. As shown in \figreffig:fail, the relit images are very different from the ground truth. Because when original images contain a large region of shadows, our model cannot identify the foreground and the backward of them. Even the depth maps are given, this information just provides the front side instead of the omnidirectional spatial information. Therefore, the color temperature from the guided images is transferred to the relit image (e.g., \figreffig:fail (e)), but the model reconstructs the poor structures.

5 Conclusion

In this paper, we propose a single stream structure network (S3Net) for depth guided any-to-any image relighting. We concatenate source image and guided image with their corresponding depth maps as input to design our model. This network is based on the Res2Net [28] in the encoder part. The attention modules and the enhanced module [11] are applied in the decoder part to refine the feature maps. We leverage the wavelet SSIM loss [25] to supervise the network training. Moreover, in the NTIRE 2021 Depth Guided Any-to-any Image Relighting Challenge, the proposed S3Net achieves the 3rd3^{rd} place in terms of PMS and SSIM. Since depth guided image relighting is a new challenge that has not been addressed in the previous literature, in future works, we will design the novel backbone to extract and fuse the image and depth features effectively.

6 Acknowledgement

We thank to National Center for High-performance Computing (NCHC) for providing computational and storage resources.

References

  • [1] X. Gong, H. Huang, L. Ma, F. Shen, W. Liu, and T. Zhang, “Neural stereoscopic image style transfer,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [2] D. Puthussery, M. Kuriakose, J. C V et al., “WDRN: A wavelet decomposed relightnet for image relighting,” arXiv preprint arXiv:2009.06678, 2020.
  • [3] Z. Hu, X. Huang, Y. Li, and Q. Wang, “SA-AE for any-to-any relighting,” in European Conference on Computer Vision, 2020.
  • [4] Z. Guo, W. Liao, Y. Xiao, P. Veelaert, and W. Philips, “Deep learning fusion of rgb and depth images for pedestrian detection,” in British Machine Vision Conference, 2019.
  • [5] Z. Xu, K. Sunkavalli, S. Hadap, and R. Ramamoorthi, “Deep image-based relighting from optimal sparse samples,” ACM Transactions on Graphics (ToG), 2018.
  • [6] H.-H. Yang, W.-T. Chen, H.-L. Luo, and S.-Y. Kuo, “Multi-modal bifurcated network for depth guided image relighting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021.
  • [7] M. El Helou, R. Zhou, S. Süsstrunk, R. Timofte et al., “AIM 2020: Scene relighting and illumination estimation challenge,” in Proceedings of the European Conference on Computer Vision Workshops (ECCVW), 2020.
  • [8] Y. Pang, L. Zhang, X. Zhao, and H. Lu, “Hierarchical dynamic filtering network for rgb-d salient object detection,” arXiv preprint arXiv:2007.06227, 2020.
  • [9] S. Chen and Y. Fu, “Progressively guided alternate refinement network for rgb-d salient object detection,” in European Conference on Computer Vision, 2020.
  • [10] M. El Helou, R. Zhou, S. Süsstrunk, R. Timofte et al., “NTIRE 2021: Depth-guided image relighting challenge,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021.
  • [11] Y. Qu, Y. Chen, J. Huang, and Y. Xie, “Enhanced pix2pix dehazing network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [12] H.-H. Yang, K.-C. Huang, and W.-T. Chen, “Laffnet: A lightweight adaptive feature fusion network for underwater image enhancement,” in IEEE International Conference on Robotics and Automation (ICRA), 2021.
  • [13] H.-H. Yang, C.-H. H. Yang, and Y.-C. F. Wang, “Wavelet channel attention module with a fusion network for single image deraining,” in IEEE International Conference on Image Processing (ICIP), 2020.
  • [14] W.-T. Chen, H.-Y. Fang, J.-J. Ding, and S.-Y. Kuo, “PMHLD: patch map-based hybrid learning dehazenet for single image haze removal,” IEEE Transactions on Image Processing, 2020.
  • [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [16] M. E. Helou, R. Zhou, J. Barthas, and S. Süsstrunk, “Vidit: Virtual image dataset for illumination transfer,” arXiv preprint arXiv:2005.05460, 2020.
  • [17] X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation,” in European Conference on Computer Vision (ECCV), 2020.
  • [18] Y. Pang, L. Zhang, X. Zhao, and H. Lu, “Hierarchical dynamic filtering network for rgb-d salient object detection,” in Eur. Conf. Comput. Vis., 2020.
  • [19] W.-T. Chen, S.-Y. Yuan, G.-C. Tsai, H.-C. Wang, and S.-Y. Kuo, “Color channel-based smoke removal algorithm using machine learning for static images,” in IEEE International Conference on Image Processing (ICIP), 2018.
  • [20] W.-T. Chen, H.-Y. Fang, J.-J. Ding, C.-C. Tsai, and S.-Y. Kuo, “JSTASR: Joint size and transparency-aware snow removal algorithm based on modified partial convolution and veiling effect removal,” in European Conference on Computer Vision, 2020.
  • [21] G.-C. Tsai, W.-T. Chen, S.-Y. Yuan, and S.-Y. Kuo, “Efficient reflection removal algorithm for single image by pixel compensation and detail reconstruction,” in IEEE International Conference on Digital Signal Processing (DSP), 2018.
  • [22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, 2015.
  • [23] H.-H. Yang and Y. Fu, “Wavelet U-net and the chromatic adaptation transform for single image dehazing,” in IEEE International Conference on Image Processing (ICIP), 2019.
  • [24] S. Mallat, A wavelet tour of signal processing.   Elsevier, 1999.
  • [25] H.-H. Yang, C.-H. H. Yang, and Y.-C. J. Tsai, “Y-net: Multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
  • [26] V. Mnih, N. Heess et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014.
  • [27] H. Wu, J. Liu, Y. Xie, Y. Qu, and L. Ma, “Knowledge transfer dehazing network for nonhomogeneous dehazing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
  • [28] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [29] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [30] H. Gao, H. Yuan, Z. Wang, and S. Ji, “Pixel transposed convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [32] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  • [33] W.-T. Chen, J.-J. Ding, and S.-Y. Kuo, “PMS-net: Robust haze removal based on patch map for single images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [34] J. T. Barron, “A general and adaptive robust loss function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [35] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on Computational Imaging, 2016.
  • [36] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision, 2016.
  • [37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [38] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  • [39] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.