¹¹institutetext: Korea Advanced Institute of Science and Technology, Daejeon, South Korea
¹¹email: {egyptdj,songjy18,jong.ye}@kaist.ac.kr ²²institutetext: Amazon Web Services, Seoul, South Korea
²²email: [email protected]

PyNET-CA: Enhanced PyNET with Channel Attention for End-to-end Mobile Image Signal Processing

Byung-Hoon Kim 11 Joonyoung Song 11 Jong Chul Ye 11 JaeHyun Baek 22

Abstract

Reconstructing RGB image from RAW data obtained with a mobile device is related to a number of image signal processing (ISP) tasks, such as demosaicing, denoising, etc. Deep neural networks have shown promising results over hand-crafted ISP algorithms on solving these tasks separately, or even replacing the whole reconstruction process with one model. Here, we propose PyNET-CA, an end-to-end mobile ISP deep learning algorithm for RAW to RGB reconstruction. The model enhances PyNET, a recently proposed state-of-the-art model for mobile ISP, and improve its performance with channel attention and subpixel reconstruction module. We demonstrate the performance of the proposed method with comparative experiments and results from the AIM 2020 learned smartphone ISP challenge. The source code of our implementation is available at https://github.com/egyptdj/skyb-aim2020-public ⁰⁰footnotetext: The final authenticated version is available online at https://doi.org/10.1007/978-3-030-67070-2_12

Keywords:

RAW to RGB, Mobile image signal processing, Image reconstruction, Deep learning

Refer to caption — Figure 1: Reconstructed RGB image from RAW data with the proposed method. (a) Input RAW image (visualised). (b) Reconstructed RGB image with proposed method. (c) Target image taken with Canon 5D Mark IV.

1 Introduction

Reconstructing RGB image from the RAW data obtained with a mobile device is a topic of growing interest. The data acquired from the image sensors of mobile devices require several image signal processing (ISP) steps to solve a number of low-level computer vision problems, such as demosaicing, denoising, color correction, etc. Hand-crafted ISP algorithms depend on the prior knowledge about the data acquisition process or degradation principles. Softwares of the mobile device implement these algorithms to process the RAW data sequentially, solving each tasks step by step to reconstruct the RGB image presented to the user.

Deep neural networks, specifically the convolutional neural networks (CNNs), have recently shown promising results over hand-crafted ISP algorithms. There also have been attempts to not just replace each ISP algorithm with deep neural networks, but to reconstruct RGB images from the RAW data by training a single end-to-end deep learning reconstruction model. However, one of the difficulties in training an end-to-end reconstruction model over separately processing the RAW data is that the prior knowledge about the data is not directly incorporated into the model. For example, it is important to take both global (e.g. luminance, color balance) and local features (e.g. fine-grained textures, edge structures) of the image into account during the reconstruction process, which is not explicitly present in the input RAW data.

We consider a recently proposed end-to-end RAW to RGB reconstruction model PyNET [7], which is basically a CNN designed to exploit both global and local features of the input data. Despite the fact that the PyNET achieves state-of-the-art performance in RAW to RGB reconstruction, there exists some drawbacks in the model architecture and the training process that can be further improved. In this paper, we address these issues and propose PyNET-CA, an enhanced PyNET with channel attention to improve the performance and reduce the training time. We demonstrate the performance of the proposed model by a number of comparative experiments, and report the results of participating the AIM 2020 learned smartphone ISP challenge.

2 Related Work

Since the introduction of the SRCNN [3] which solves the single image super-resolution (SISR) problem with a CNN, a large variety of neural network based image reconstruction and enhancement methods have been proposed. Deep learning for SISR has rapidly grown with deeper network architectures [8], and better modules [10], [20]. Models that employ the channel attention mechanism have also shown to improve the performance of image enhancement tasks [19]. Although these development of network architectures do improve the quantiative quality of the enhanced images, it does not necessarily mean that the enhanced images are perceptually of good quality. Generative adversarial network (GAN) based SISR models, such as SRGAN [9] and ESRGAN [17], are introduced to address this issue, and enhances the images in a photo-realistic way. Enhancing the image to both quantitatively accurate and perceptually realistic within the perception-distortion tradeoff [1] is now an important issue in image enhancement tasks [2].

Along with the development of neural networks for image enhancement tasks, there have also been attempts to train an end-to-end deep learning model for RAW to RGB reconstruction. One of the earliest model was proposed by [5] which is based on a CNN with composite loss function and adversarial training scheme. The DeepISP [13] also is based on the CNN structure, and addresses the issue of global and local feature priors with a two-stage approach. The W-Net is another model with the two-stage approach proposed by [16], which stacks two U-Net [12] structure with channel attention module. The SalGAN [21] employs the U-Net structure as the generator of the adversarial training scheme and incorporates spatial attention scheme into the loss function. The HERN [11] modifies the channel attention module of the residual in residual module of [19] to construct a dual-path network and incorporates global feature information with a separate full-image encoder. One of the most recent model is PyNET [7], a CNN with inverted pyramidal structure. The PyNET achieves state-of-the-art result on the RAW to RGB reconstruction task, owing to the model architecture that can account for both global and local features of the image [7].

3 Proposed Method

3.1 Network architecture

3.1.1 Basic modules

First, we define some of the modules that constitute the PyNET-CA model (Figure 2). The channel attention module (CA) of PyNET-CA follows from [19]. Global average pooling is first applied to the height and width dimension of the features to output a vector with length corresponding to the number of input channels. The pooled vector is linearly mapped, passed through the nonlinear ReLU activation, and is again linearly mapped to match the number of input channels. The length of output features after the first linear mapping is determined by the reduction ratio $r$ , which reduces the number of channels by $\frac{1}{r}$ of the input channels. The final features are squashed to range $[0,1]$ by the sigmoid function, and are multiplied element-wise with the channels of the input to account for the level of attention for each channels.

The DoubleConv module is defined as two sequential operations of 2D convolution followed by a LeakyReLU activation. The kernels of the two convolution layers have the same shape, which is one of $3\times 3$ , $5\times 5$ , $7\times 7$ , $9\times 9$ . The input image or feature to the DoubleConv module is reflect-padded before the convolution to match the size of the output feature. Unlike the original PyNET [7], we apply instance normalisation after only the second convolution layer where needed.

The MultiConv channel attention (MCCA) module is the basic building block of our model. It comprise concatenating the features from the DoubleConv modules and a channel attention module. We specify four types of MCCA (MCCA3, MCCA5, MCCA7, MCCA9) based on the kernel size of the DoubleConv module. The MCCA3 module has only one $3\times 3$ DoubleConv module, while the MCCA5 module concatenates the output of $3\times 3$ and $5\times 5$ DoubleConv module, and so forth. Lastly, channel attention is applied to the concatenated features from the DoubleConv modules. The reduction ratio $r$ of the channel attention module is set to 1, 2, 3, 4 for MCCA3, MCCA5, MCCA7, MCCA9, respectively.

3.1.2 Inverted pyramidal structure

To account for both the global and local features of the image, PyNET-CA has an inverted pyramidal structure as in Figure 3. Given a RAW image with Bayer pattern $\mathbf{I}_{raw}\in\mathbb{R}^{H\times W}$ , Bayer sampling function $f_{Bayer}$ is first applied to obtain

\displaystyle\hat{\mathbf{I}}_{raw}=f_{Bayer}(\mathbf{I}_{raw})\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times 4}.

(1)

At each levels, the images are downscaled by $2\times 2$ max pooling followed by the MCCA3 module to serve as the input feature $F_{in}^{k}$ at level $k$ ,

	$\displaystyle F_{in}^{1}$	$\displaystyle=\text{MCCA3}(\mathbf{\hat{I}}_{raw}),$		(2)
	$\displaystyle F_{in}^{k+1}$	$\displaystyle=\text{MCCA3}(\text{MaxPool2}(F_{in}^{k})),\quad k\in\{1,2,3,4\}.$		(3)

We denote the set of operations at each level $k$ that process the input feature $F_{in}^{k}$ as function $H^{k}$ . The function $H^{k}$ is composed of of a number of MCCA modules along with residual connections, within-level and between-level skip connections to compute the output feature $F_{out}^{k}$ ,

	$\displaystyle F_{out}^{5}$	$\displaystyle=H^{5}(F_{in}^{5})$		(4)
	$\displaystyle F_{out}^{k}$	$\displaystyle=H^{k}(F_{in}^{k},F_{out}^{k+1}),\quad k\in\{1,2,3,4\}.$		(5)

The composition of the operators of $H^{k}$ differs at each levels (Figure 3).

Lastly, the reconstructed RGB image $\mathbf{\hat{I}}_{rgb}^{k}$ is obtained by a $3\times 3$ convolution layer and tanh activation of the output features,

\displaystyle\mathbf{\hat{I}}_{rgb}^{k}=\text{tanh}(\text{Conv3}(F_{out}^{k}))\in\mathbb{R}^{\frac{H}{2^{k}}\times\frac{W}{2^{k}}\times 3},\quad k\in\{1,2,3,4,5\}.

(6)

One important point regarding the inverted pyramidal structure lies in (5). It should be noted that the output feature at level $k$ is computed not only with the input feature at level $k$ , but also with the output feature from lower-resolution level $k+1$ as a prior. Because the network is trained progressively, the prior $F_{out}^{k+1}$ contains meaningful information of the downsampled target image. The bilinear downscaling of the target image at each level corresponds to low-pass filtering, and explicitly emphasizes global feature information as the levels elevate. This information can be passed onto the lower levels effectively and recursively by concatenation.

3.1.3 Subpixel reconstruction module

Because of the Bayer sampling function in (1), the enhanced features need to be upsampled at the last level of the PyNET-CA. This is an ill-posed problem as in the case of the SISR problems. At the final level of the original PyNET structure, enhanced features are upsampled with bilinear interpolation or transposed convolution, and then convolved with a $3\times 3$ kernel convolution layer to output the final image. However, bilinear interpolation or transposed convolution can cause blurring or checkerboard artifacts in the upsampled image. Furthermore, reconstructing final image with the convolution layer after upsampling the image is computationally inefficient.

For computational efficiency and better image quality, the proposed PyNET-CA upsamples the image with the MCCA9 module, followed by $1\times 1$ convolution layer and upsamples the features by subpixel shuffling [14] at the final level of the model,

\displaystyle\mathbf{\hat{I}}_{rgb}=\text{tanh}(\text{SubpixelShuffle}(\text{Conv1}(\text{MCCA9}(F_{out}^{1}))))\in\mathbb{R}^{H\times W\times 3}.

(7)

We denote the RGB reconstruction module (7) as the subpixel reconstruction module (SRM).

3.2 Network training

3.2.1 Dataset

For training the network, we used the Zurich RAW to RGB (ZRR) dataset. One important fact about the dataset is that it consists of a RANSAC [4] aligned pair of RAW and RGB images, taken separately from a smartphone (Huawei P20) and a DSLR camera (Canon 5D Mark IV), respectively. Considering the collection process, the dataset is prone to significant pixel-level mismatch between the RAW and RGB image pairs if the object is moving or if the object has large area of reflection (see Figure 4). We screened out the images with large area of reflection (e.g. on cars, on windows) or moving objects (e.g. people, animal, vehicles on the road). There were 1,679 out of 46,839 (3.58%) training image pairs excluded from this screening, leaving out 45,160 image pairs for training the model. For testing, we used 1,204 image pairs provided by the AIM 2020 challenge organizers without excluding any image pairs. All RAW and RGB images were $448\times 448$ cropped patches, and were casted into single precision floating point data centered and scaled to the range $[-1,1]$ to stabilise the training.

3.2.2 Progressive training

The PyNET-CA model is progressively trained from the level with the lowest resolution. The target image $\mathbf{I}_{rgb}$ is downsampled with bilinear interpolation to match the resolution of the reconstructed images $\mathbf{\hat{I}}_{rgb}^{k}$ at each level. The loss function used for training the PyNET-CA is a linear combination of the mean squared error (MSE) loss, the perceptual loss using one VGG layer relu5_4, and the multi-scale structural similarity index measure (MS-SSIM) loss,

\mathcal{L}=\lambda_{1}\mathcal{L}_{\texttt{MSE}}+\lambda_{2}\mathcal{L}_{\texttt{VGG}}+\lambda_{3}\mathcal{L}_{\texttt{MS-SSIM}}.

Basically the MSE loss is minimised at all levels with $\lambda_{1}$ set to 1.0, and other coefficients were adjusted with respect to the $\lambda_{1}$ . For training level 5 and level 4, only the MSE loss is minimised, with $\lambda_{2}$ and $\lambda_{3}$ set to 0.0. From level 3, the perceptual loss is minimised to account for the perceptual similarity with coefficient $\lambda_{2}=0.01$ , $\lambda_{3}=0.0$ . At the last 0th level, the MS-SSIM loss is maximised (i.e., negatively minimised). Different MS-SSIM scaling coefficient is used for training the model based on the goal of the task. For the fidelity task, which is to achieve highest metric score on the peak signal-to-noise ratio (PSNR) and the MS-SSIM, the MS-SSIM scaling coefficient $\lambda_{3}$ is set to 0.01. For the perceptual task, on the other hand, we set the MS-SSIM scaling coefficient to 0.1.

3.2.3 One-cycle policy of the learning rate

The PyNET requires a long training time, especially at the high-resolution levels. To address this issue and hasten convergence, we employ the one-cycle policy of the learning rate for training the PyNET-CA [15]. At each levels, the learning rate starts with $5.0\times 10^{-5}$ , reaches the maximum learning rate $1.0\times 10^{-4}$ at the early 20% of the training, and gradually decays to $5.0\times 10^{-7}$ by the end of the training as in Figure 5. We train the model for 16 epochs per level, which is around 68% decrement of the training epochs at the last level compared to [7].

4 Experiment

4.1 Comparative studies

Table 1: Comparative study of the proposed method.

Model	CA	SRM	PSNR	SSIM
PyNET-CA (proposed)	✓	✓	21.5022	0.7438
PyNET + SRM		✓	21.4126	0.7375
PyNET			21.2071	0.7367
U-Net			20.5057	0.7297
Pix2Pix			20.4502	0.7196

We demonstrate the effectiveness of the proposed method by comparative studies (Table 1, Figure 6). First, we performed ablation studies to evaluate the performance with and without the modules of the PyNET-CA. The performance of the PyNET-CA degrades if the channel attention at the MCCA module is removed, and further degrades without the SRM (7). In the case of the PyNET-CA without both modules, the model architecture corresponds to the architecture of the original PyNET, network training scheme being the only difference from [7]. From the ablation studies, it can be shown that the PyNET model is enhanced in its performance with the channel attention module and the SRM upsampling.

Second, we compare the PyNET-CA model with two benchmark models, the U-Net [12], and the Pix2Pix [22]. The U-Net is a encoder-decoder CNN with skip connections, applied to many image reconstruction tasks. Although the local features of the output images were relatively well recovered, the global features such as luminance or color balance were easily lost. The Pix2Pix is a conditional GAN for unpaired image-to-image translation tasks. The RGB image is reconstructed with the generator, taking RAW image as the condition. The output images were perceptually more realistic than fully supervised methods, retaining sharp edges with high-frequency details. However, the model could not correctly reconstruct the color of the RGB images in many cases.

4.2 AIM 2020 learned smartphone ISP challenge

Table 2: Result of the AIM 2020 learned smartphone ISP challenge

	Fidelity		Perceptual
Team	PSNR	SSIM	PSNR	SSIM	MOS
anonymized	22.257	0.7913	21.011	0.7729	4.0
skyb (ours)	21.926	0.7865	21.734	0.7891	3.8
anonymized	21.915	0.7842	21.574	0.7770	4.7
anonymized	21.909	0.7829	21.909	0.7829	4.0
anonymized	21.861	0.7807	21.861	0.7807	4.5
anonymized	21.569	0.7846	21.569	0.7846	3.5
anonymized	21.403	0.7834	21.403	0.7834	4.2
anonymized	21.179	0.7794	21.179	0.7794	4.1
anonymized	21.144	0.7729	21.144	0.7729	3.2
anonymized	20.192	0.7622	-	-	-
anonymized	20.138	0.7438	20.138	0.7438	2.2

We report the results from the AIM 2020 learned smartphone ISP challenge [6]. There were two separate tracks which we both participated in. The goal of the first track was to achieve highest fidelity in terms of the PSNR and the SSIM, while the goal of the second track was to reconstruct RGB images with the best perceptual quality. In the second track, the perceptual quality was evaluated by the mean opinion score (MOS). To find the best model with low generalisation error, we used the provided 1,204 images for validating the model during training. We early stopped the training with highest PSNR value on the validation dataset to participate in the fidelity track. For the perceptual track, we trained the last level for 32 epochs, and early stopped the training with lowest learned perceptual image patch similarity (LPIPS) [18] value on the validation dataset. All models were implemented with PyTorch 1.5.0 and were trained with four NVIDIA V100 GPUs with 16GB memory each. We applied 8x self-ensemble of 90 degree rotation and horizontal/vertical flip during test time. We have ranked 2nd place for both the PSNR and the SSIM on the fidelity track, and 1st place for the SSIM on the perceptual track as reported in Table 2. The challenge results demonstrate the exceptional performance of the proposed method.

5 Conclusion

We propose PyNET-CA, an end-to-end deep learning model for RAW to RGB image reconstruction. The model enhances the PyNET structure to improve its performance and reduce training time. Comparative experiments and results from the AIM 2020 challenge demonstrate the exceptional performance of the proposed model.

Acknowledgement

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [2016-0-00562(R0124-16-0002), Emotional Intelligence Technology to Infer Human Emotion and Carry on Dialogue Accordingly]

References

[1] Blau, Y., Michaeli, T.: The perception-distortion tradeoff. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6228–6237 (2018)
[2] Deng, X., Yang, R., Xu, M., Dragotti, P.L.: Wavelet domain style transfer for an effective perception-distortion tradeoff in single image super-resolution. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3076–3085 (2019)
[3] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2015)
[4] Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)
[5] Ignatov, A., Kobyshev, N., Timofte, R., Vanhoey, K., Van Gool, L.: Dslr-quality photos on mobile devices with deep convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3277–3285 (2017)
[6] Ignatov, A., Timofte, R., et al.: AIM 2020 challenge on learned image signal processing pipeline. In: European Conference on Computer Vision Workshops (2020)
[7] Ignatov, A., Van Gool, L., Timofte, R.: Replacing mobile camera isp with a single deep learning model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 536–537 (2020)
[8] Kim, J., Kwon Lee, J., Mu Lee, K.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1646–1654 (2016)
[9] Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4681–4690 (2017)
[10] Lim, B., Son, S., Kim, H., Nah, S., Mu Lee, K.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 136–144 (2017)
[11] Mei, K., Li, J., Zhang, J., Wu, H., Li, J., Huang, R.: Higher-resolution network for image demosaicing and enhancing. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3441–3448. IEEE (2019)
[12] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
[13] Schwartz, E., Giryes, R., Bronstein, A.M.: Deepisp: Toward learning an end-to-end image processing pipeline. IEEE Transactions on Image Processing 28(2), 912–923 (2018)
[14] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1874–1883 (2016)
[15] Smith, L.N., Topin, N.: Super-convergence: Very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications. vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)
[16] Uhm, K.H., Kim, S.W., Ji, S.W., Cho, S.J., Hong, J.P., Ko, S.J.: W-net: Two-stage u-net with misaligned data for raw-to-rgb mapping. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3636–3642. IEEE (2019)
[17] Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 0–0 (2018)
[18] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
[19] Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 286–301 (2018)
[20] Zhang, Y., Tian, Y., Kong, Y., Zhong, B., Fu, Y.: Residual dense network for image super-resolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2472–2481 (2018)
[21] Zhao, Y., Po, L.M., Zhang, T., Liao, Z., Shi, X., Zhang, Y., Ou, W., Xian, P., Xiong, J., Zhou, C., et al.: Saliency map-aided generative adversarial network for raw to rgb mapping. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 3449–3457. IEEE (2019)
[22] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)