This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Real-World Image Super Resolution via Unsupervised Bi-directional Cycle Domain Transfer Learning based Generative Adversarial Network

Xiang Wang, Yimin Yang,  Zhichang Guo, Zhili Zhou, , Yu Liu, , Qixiang Pang, Shan Du X. Wang is with the Dept. of Computer Science, Lakehead University, Thunder Bay, ON, Canada, e-mail: [email protected]. Yang is with the Dept. of electrical and computer engineering, Western University, London, ON, Canada, e-mail: [email protected]. Guo is with the School of Mathematics, Harbin Institute of Technology, Hei Long Jiang, China, e-mail: [email protected]. Zhou is with the School of Computer and Software, Nanjing University of Information Science and Technology, Jiangsu, China, [email protected]. Liu is with the School of Microelectronics, Tianjin University, Tianjin, China, e-mail: [email protected]. Pang is with the Okanagan College, Kelowna, BC, Canada, e-mail: [email protected]. Du is with the Dept. of Comp. Sci., Math, Physics & Statistics, The University of British Columbia Okanagan, BC, Canada, e-mail: [email protected]
Abstract

Deep Convolutional Neural Networks (DCNNs) have exhibited impressive performance on image super-resolution tasks. However, these deep learning-based super-resolution methods perform poorly in real-world super-resolution tasks, where the paired high-resolution and low-resolution images are unavailable and the low-resolution images are degraded by complicated and unknown kernels. To break these limitations, we propose the Unsupervised Bi-directional Cycle Domain Transfer Learning-based Generative Adversarial Network (UBCDTL-GAN), which consists of an Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN) and the Semantic Encoder guided Super Resolution Network (SESRN). First, the UBCDTN is able to produce an approximated real-like LR image through transferring the LR image from an artificially degraded domain to the real-world LR image domain. Second, the SESRN has the ability to super-resolve the approximated real-like LR image to a photo-realistic HR image. Extensive experiments on unpaired real-world image benchmark datasets demonstrate that the proposed method achieves superior performance compared to state-of-the-art methods.

Index Terms:
Deep convolutional neural networks, Image super-resolution, Unsupervised manner, Real-world scenes

Refer to caption

Figure 1: The overview of the proposed UBCDTL-GAN: In the first stage (left), the green dot rectangle represents the Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN). The red path indicates the forward cycle module. Given the input HR image IHRI^{HR}, IdegradedLRI^{LR}_{degraded} is the artificially degraded LR image, and I^realLR\widehat{I}^{LR}_{real} is the real-like LR image generated by a generator GAG_{A}. I^reconLR\widehat{I}^{LR}_{recon} represents the reconstructed image of IdegradedLRI^{LR}_{degraded} produced by another generator GBG_{B} from I^realLR\widehat{I}^{LR}_{real}. I^idtLR\widehat{I}^{LR}_{idt} is produced by GBG_{B} from the original IdegradedLRI^{LR}_{degraded}. In addition, LDBadvL^{adv}_{D_{B}}, LdegradedidtL^{idt}_{degraded}, LGBcycL^{cyc}_{G_{B}} and LFEApercepL^{percep}_{FE_{A}} depicted in red dotted line represent adversarial loss, identity loss, cycle-consistency loss and cycle-perceptual loss for forward cycle module. DAD_{A} is a discriminator and FEAFE_{A} denotes a feature extractor. Symmetrically, the blue path shows the backward cycle module, where IrealLRI^{LR}_{real} is given by a real-world dataset and synthesized LR image IsynLRI^{LR}_{syn} is generated by GBG_{B}. Moreover, the GAG_{A} is able to translate IsynLRI^{LR}_{syn} back to reconstructed real-world LR image LreconLRL^{LR}_{recon} and generate the identity real-world LR image IidtLRI^{LR}_{idt}. The blue dotted line represents the adversarial loss LDAadvL^{adv}_{D_{A}}, identity loss LrealidtL^{idt}_{real}, cycle-consistency loss LGAcycL^{cyc}_{G_{A}} and cycle-perceptual loss LFEBperceptL^{percept}_{FE_{B}} for backward cycle module respectively. DBD_{B} is a discriminator and FEBFE_{B} denotes a feature extractor. In the second stage (right), the framework of a Semantic Encoder guided Super Resolution Network (SESRN) is depicted in a yellow dot rectangle, where it consists of Semantic Encoder SESE, Generator GSRG_{SR}, Joint Discriminator DSRD_{SR} and Content Extractor ϕ\phi. There are two paths in the SESRN, where the red path indicates the process of the real tuple and the blue path is the process of the fake tuple. ISRI^{SR} is the SR image from GSRG_{SR}. Furthermore, SE()SE(\cdot) denotes the embedded semantics obtained from SESE. D()D(\cdot) represents the output probability of DSRD_{SR}. ϕ()\phi(\cdot) describes the features learned by ϕ\phi.

I Introduction

Single Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a single Low-Resolution (LR) image. It has been widely applied in many computer vision applications, such as surveillance [1], image enhancement [2] and medical image processing [3]. In the SISR task, the general degradation formula is expressed as:

y=(xk)s+ny=(x\bigotimes k)\downarrow s+n (1)

where xx represents the HR image, yy is the degraded LR image, kk denotes a blur kernel, and \bigotimes is the convolution operation performed on xx and kk. s\downarrow s denotes a downsampling operation on the image with a scaling factor ss, and nn is considered as an additive white Gaussian noise. However, under the real-world scene setting, kk is unknown and nn should take into account many possible conditions such as sensor noise, compression artifacts, and unpredicted noise caused by physical devices. Resulting from the existence of uncertain kernel kk and noise nn, SISR has become a particularly ill-posed inverse task since there are infinite HR images that can be recovered from a given LR image, in which it is required to select the most plausible solution.

Recently, a great number of deep learning-based SISR models have been proposed, such as RDN [4], EDSR [5], SRDenseNet [6], and ESRGAN [7]. Despite the successful progress achieved by the aforementioned methods, there still exist unnoticed issues. Those models were trained in a supervised manner with a large number of paired images, synthesized LR images produced by pre-determined degradation and its HR counterpart, resulting in deteriorative performance when they are applied to real-world scenarios. The reason is that paired LR-HR data is unavailable and the degradation of the input LR image is unknown in the real-world scene. Besides, it is unreasonable to simply apply the LR images downsampled by the ideally fixed kernel to the real-world SR problems [8, 9]. There exists a large domain gap between real-world LR images and artificially synthesized LR images. In addition, the synthesized LR images may eliminate diverse patterns and complicated characteristics belonging to real-world LR images such as sensor noise and natural characteristics. Thus, the existing SR methods normally encounter a serious domain consistency problem and produce poor performance in practical scenarios [10, 11].

Therefore, it is imperative to explore an effective method that can apply unpaired images to satisfy the need for real-world SR scenarios. It must be different from the aforementioned SR methods which do not take into consideration the domain gap between the LR images generated from a known degradation (e.g., bicubic downscaling) and the real-world LR images. To address the above limitations, we propose an Unsupervised Bi-directional Cycle Domain Transfer Learning-based Generative Adversarial Network (UBCDTL-GAN) for real-world image super-resolution, which is composed of two networks, Unsupervised Bi-directional Cycle Domain Transfer Network (UBCDTN) and Semantic Encoder guided Super Resolution Network (SESRN). To simulate the real-world data distribution and reduce the domain gap between the generated LR image domain and the real-world LR image domain, we design UBCDTN to estimate the inherent degradation kernel from the real-world LR distribution and translate the artificially degraded LR domain image to the real-world domain. With the help of the cycle consistency mechanism [12], the proposed UBCDTN is able to learn bi-directional inverse mapping in an unsupervised manner, which can ensure the generated real-like LR image preserves desired characteristics of real-world patterns. Besides, we also enforce auxiliary constraints on the UBCDTN such as adversarial loss, identity loss, and cycle-perceptual loss. The designed domain transfer network provides an effective way to generate real-like LR images, which can construct the paired real-world LR-HR data for the following SESRN. For the second step, we design the Semantic Encoder guided Super Resolution Network (SESRN). The goal of the SESRN is to super-resolve the real-like LR images to the photo-realistic HR images. It is emphasized that we apply our previously proposed architecture and optimization strategy of GAFHH-RIDN [13] to the generator and discriminator of SESEN. We evaluate our method on the NTIRE 2020 Super Resolution Challenge Track 1 validation dataset. The quantitative and qualitative comparisons demonstrate the superiority of the proposed UBCDTL-GAN compared with other state-of-the-art methods. A comparison of visual results with various latest methods is shown in Fig. 4, and the numeric results can be seen in Table I. The main contributions of our proposed method can be summarized as follows:

1) We propose a novel bi-directional cycle domain transfer network, UBCDTN. According to the domain transfer learning scheme, the designed bi-directional cycle architecture is able to eliminate the domain gap between the artificially degraded LR images and real-world LR images in an unsupervised manner.

2) We further impose the auxiliary constraints on UBCDTN by incorporating adversarial loss, identity loss, and cycle-perceptual loss, which can guarantee that the generated real-like LR images contain the same style as real-world images.

3) We design the SESRN as a deep super-resolution network to generate visually pleasant SR results under supervised learning settings.

4) Benefiting from the collaborative training strategy, the proposed UBCDTL-GAN is able to be trained in an end-to-end fashion, which can ease the entire training procedure and strengthen the robustness of the model.

Refer to caption

Figure 2: The proposed SESRN and its components: Semantic Encoder SESE, Generator GSRG_{SR}, Joint Discriminator DSRD_{SR} and Content Extractor ϕ\phi. For DSRD_{SR}, ESLDSN represents the Embedded Semantics-Level Discriminative Sub-Net, ILDSN represents the Image-Level Discriminative Sub-Net, and FCM denotes the Fully Connected Module. As for the generator GSRG_{SR}, there are three stages: Shallow Feature Module (SFM), Multi-level Dense Block Module (MDBM), and Upsampling Module (UM). IHRI^{HR} and I^realLR\widehat{I}^{LR}_{real} denote HR images and LR images respectively. ISRI^{SR} is SR images from GSRG_{SR}. Furthermore, SE()SE(\cdot) denotes the embedded semantics obtained from the SESE. DSR()D_{SR}(\cdot) represents the output probability of the DSRD_{SR}. ϕ(IHR)\phi(I^{HR}) and ϕ(ISR)\phi(I^{SR}) describes the features learned by the content extractor ϕ\phi.

II Related Work

In this section, we first review the paired image super-resolution methods. Then, we introduce blind and unsupervised learning methods for real-world scenarios.

II-A Paired Image Super Resolution

In recent years, deep learning-based methods have exhibited exceptional ability to enhance SISR performance. However, most of these methods rely on supervised settings, where specific pre-defined and paired LR and HR images are required in the training process. The pioneer method was introduced by Dong etal.et\ al. [14], namely SRCNN. It is able to recover HR images from bicubic downsampled LR images in an end-to-end manner. Moreover, the enhanced deep SR method EDSR [5] was proposed by Lim etal.et\ al., which greatly enlarges the network to be deeper and wider, resulting in learning richer feature representations. Later, generative adversarial learning was introduced in SISR. SRGAN [15] proposed by Ledig etal.et\ al. incorporates adversarial loss and perceptual loss to update the generator and discriminator respectively, which can obtain photo-realistic SR images. Furthermore, the ESRGAN [7] introduces the new generator architecture and adopts enhanced adversarial loss, achieving promising performance in the visual aspect.

II-B Blind Image Super Resolution

The blind SISR is defined as the task that supposes the LR image is produced by an unknown degradation kernel from its HR version. The Iterative Kernel Correction (IKC) [16] was proposed by Gu etalet\ al. to estimate blur kernel and eliminate artifacts caused by kernel discrepancy. Zhang etalet\ al. [17] proposed IRCNN which includes a set of CNN denoisers to estimate the blur model. However, the results of IRCNN indicate that it is difficult to estimate a comprehensive degradation kernel in real-world conditions. The blind SR methods have the limited capability to approximate unknown degradations. Thus, there is still room for improvement in dealing with blind image super-resolution.

II-C Unpaired Super Resolution

Recently, many unsupervised methods have been proposed to satisfy real-world conditions where the paired LR-HR image data is unavailable, and the input image is corrupted by unknown degradation kernels. By employing the internal recurrence of information inside an image, ZSSR [18] can use pseudo image pairs to recover LR images with diverse blur kernels. Inspired by CycleGAN, Yuan etalet\ al. [10] proposed CinCGAN, which first generates bicubic downsampled LR images from the input and then super-resolves LR images to HR images. However, CinCGAN only considers single bicubic degradation, resulting in poor generalization in complicated real-world SR tasks. By contrast, our method utilizes UBCDTN to simulate diverse real-world degradation types, making it perform proficiently in the real world.

III Overview

We divide the unsupervised super-resolution problem into two stages, but it is still in an end-to-end fashion. The proposed method comprises two stages: 1) Unsupervised image domain translation between real-world LR images and artificially degraded LR images. 2) Semantic encoder guided generative adversarial super resolution from the generated real-like LR images to final HR images in a supervised manner. To obtain the artificially degraded LR image IdegradedLRI^{LR}_{degraded}, we perform bicubic downscaling on HR image IHRI^{HR} by image degradation block.

In the first stage, the proposed UBCDTN generates real-like LR images belonging to the real-world domain from the artificially degraded LR images. It aims to transfer the domain of artificially degraded images to the real-world domain, which can ensure a similar real-world LR pattern in the generated real-like LR images. After domain translation performed by UBCDTN, we obtain the real-like LR I^realLR\widehat{I}^{LR}_{real} treated as the input of the following SESRN. In the second stage, we utilize SESRN to learn the super-resolution mapping from the real-like LR space to the HR space. We further train the SESRN with adversarial loss, content loss, and pixel-wise loss to generate photo-realistic SR images.

III-A Unsupervised Bi-directional Cycle Domain Transfer Network

The designed UBCDTN is able to translate the source domain IdegradedLRI^{LR}_{degraded} to the target domain IrealLRI^{LR}_{real}. It provides an effective bi-directional cycle solution to reduce the domain gap between IdegradedLRI^{LR}_{degraded} and IrealLRI^{LR}_{real}. The forward-cycle module contains GAG_{A} where it aims to learn a mapping IdegradedLRI^realLRI^{LR}_{degraded}\rightarrow\widehat{I}^{LR}_{real}. The goal of GBG_{B} in backward-cycle module is to learn another mapping IrealLRIsynLRI^{LR}_{real}\rightarrow I^{LR}_{syn}.

As shown in Fig. 1, the forward-cycle module comprises generator GAG_{A}, discriminator DBD_{B} and Feature Extractor FEAFE_{A}. The backward-cycle module consists of GBG_{B}, DAD_{A}, and FEBFE_{B}, where these networks have the same architecture as the forward-cycle module’s corresponding components, but for different purposes. We utilize U-Net as the basic architecture for GAG_{A} and GBG_{B}. DAD_{A} and DBD_{B} are standard convolutional neural networks, where each is composed of nine convolutional layers followed by BatchNormalization (BN) layers and Leaky ReLU layers. The pre-trained VGG19 is exploited as feature extractor FEAFE_{A} and FEBFE_{B}.

The UBCDTN trains two generators simultaneously, where these two generators should be translated in bi-direction and inverted to each other. We involve adversarial learning in both modules, where GAG_{A} and GBG_{B} are trained on discriminator DBD_{B}, DAD_{A} respectively. However, as mentioned before, the SISR task is an ill-posed problem in which there exist many possible SR images reconstructed from one given LR input. Thus, applying adversarial loss alone to train UBCDTN leads to the problem of model collapse, and it cannot produce the desired generated images which have the same characteristics and patterns as target domain images. It is necessary to employ regularization on UBCDTN to improve the quality of translated images. Thus, we propose a cycle-consistency constraint to guarantee the domain correlation between (I^realLR\widehat{I}^{LR}_{real}, IrealLRI^{LR}_{real}) and (IsynLRI^{LR}_{syn}, IdegradedLRI^{LR}_{degraded}). In addition, in order to avoid color variation, identity loss is applied to preserve color composition between the input image and output image [12]. Moreover, we introduce the cycle-perceptual loss as an additional constraint to preserve the sharper edges and finer structures in the reconstructed images. The following sections provide more details of the forward-cycle module and backward-cycle module.

III-A1 Forward-cycle Module

The forward-cycle module contains a generator GAG_{A}, discriminator DBD_{B} and feature extractor FEAFE_{A}. The goal of GAG_{A} is to map the artificially degraded LR image to the target domain image, i.e., I^realLR=GA((IdegradedLR)i)\widehat{I}_{real}^{LR}=G_{A}((I^{LR}_{degraded})_{i}). The concern of the unpaired condition can be eliminated by adding additional constraints. We propose the cycle-consistency to guarantee the relevant content in the generated I^realLR\widehat{I}^{LR}_{real} can be preserved, i.e., GB(GA((IdegradedLR)i))IdegradedLRG_{B}(G_{A}((I^{LR}_{degraded})_{i}))\approx I^{LR}_{degraded}. In a word, the generated images of GAG_{A} and GBG_{B} should be cycle-consistent with each other. Thus, the forward cycle consistency is established IdegradedLRGA(IdegradedLR)GB(GA(IdegradedLR))IdegradedLRI^{LR}_{degraded}\rightarrow G_{A}(I^{LR}_{degraded})\rightarrow G_{B}(G_{A}(I^{LR}_{degraded}))\approx I^{LR}_{degraded}. The forward cycle consistency loss LGBcycL^{cyc}_{G_{B}} can be defined as follows:

I^reconLR=GB(GA((IdegradedLR)i))\widehat{I}_{recon}^{LR}=G_{B}(G_{A}((I^{LR}_{degraded})_{i})) (2)
LGBcyc=1NiN(I^reconLR)i(IdegradedLR)i1\begin{split}L^{cyc}_{G_{B}}=\frac{1}{N}\sum_{i}^{N}||(\widehat{I}^{LR}_{recon})_{i}-(I^{LR}_{degraded})_{i}||_{1}\end{split} (3)

where NN is the number of training images. As Eqn. 2 shows, with the help of GBG_{B}, I^reconLR\widehat{I}^{LR}_{recon}can be identical to the input IdegradedLRI^{LR}_{degraded}. Through the cycle consistency mechanism, IdegradedLRI^{LR}_{degraded} in the source domain can be reconstructed after performing GAG_{A} and GBG_{B} on IdegradedLRI^{LR}_{degraded} in turn. In addition, we apply adversarial losses to GAG_{A} and DBD_{B} such that the distribution of I^realLR\widehat{I}^{LR}_{real} is indistinguishable from the real distribution IrealLRI^{LR}_{real}. We apply the Relativistic average GAN (RaGAN) [19] to train GAG_{A} and DBD_{B}. GAG_{A} receives the input IdegradedLRI^{LR}_{degraded} and generates I^realLR\widehat{I}^{LR}_{real}, which can fool DBD_{B}. The discriminator DBD_{B} aims to predict the probability that the provided real image IrealLRI^{LR}_{real} is more realistic compared to the generated fake image I^realLR\widehat{I}_{real}^{LR}. The input of DBD_{B} contains real data IrealLRI^{LR}_{real} and fake data I^realLR\widehat{I}_{real}^{LR} respectively. It can be summarized as follows:

D(Xreal)=σ(C(IrealLR)Exf[C(I^realLR)])D(X_{real})=\sigma(C(I^{LR}_{real})-E_{x_{f}}[C(\widehat{I}_{real}^{LR})]) (4)
D(Xfake)=σ(C(I^realLR)Exr[C(IrealLR)])D(X_{fake})=\sigma(C(\widehat{I}_{real}^{LR})-E_{x_{r}}[C(I^{LR}_{real})]) (5)

where σ\sigma denotes sigmoid non-linearity, CC denotes the discriminator without the final sigmoid layer, ExfE_{x_{f}}, ExrE_{x_{r}} are mathematical means of XfakeX_{fake} and XrealX_{real} in the training batch respectively, and D(Xreal)D(X_{real}), D(Xfake)D({X_{fake}}) are the predicted probability of IrealLRI^{LR}_{real}, I^realLR\widehat{I}_{real}^{LR} to be real by discriminator DBD_{B}. An adversarial loss for GAG_{A}, DBD_{B} can be expressed as LGAadvL^{adv}_{G_{A}} and LDBadvL^{adv}_{D_{B}} respectively as follows:

LGAadv=𝔼IrealLRp(IrealLR)[log(1D(Xreal))]𝔼I^realLRp(I^realLR)[log(D(Xfake))]\begin{split}L^{adv}_{G_{A}}=&-\mathbb{E}_{I^{LR}_{real}}\sim p_{(I^{LR}_{real})}[log(1-D(X_{real}))]\\ &-\mathbb{E}_{\widehat{I}_{real}^{LR}}\sim p_{(\widehat{I}_{real}^{LR})}[log(D(X_{fake}))]\end{split} (6)
LDBadv=𝔼IrealLRp(IrealLR)[log(D(Xreal))]𝔼I^realLRp(I^realLR)[log(1D(Xfake))]\begin{split}L^{adv}_{D_{B}}=&-\mathbb{E}_{I^{LR}_{real}}\sim p_{(I^{LR}_{real})}[log(D(X_{real}))]\\ &-\mathbb{E}_{\widehat{I}_{real}^{LR}}\sim p_{(\widehat{I}_{real}^{LR})}[log(1-D(X_{fake}))]\end{split} (7)

where 𝔼IrealLRp(IrealLR)\mathbb{E}_{I^{LR}_{real}}\sim p_{(I^{LR}_{real})} and 𝔼I^realLRp(I^realLR)\mathbb{E}_{\widehat{I}_{real}^{LR}}\sim p_{(\widehat{I}_{real}^{LR})} indicates the distribution of real image IrealLRI^{LR}_{real}, and fake image I^realLR\widehat{I}^{LR}_{real} respectively. We further introduce forward identity loss LdegradedidtL^{idt}_{degraded} to maintain color composition between I^idtLR\widehat{I}^{LR}_{idt} and IdegradedLRI^{LR}_{degraded}. The I^idtLR\widehat{I}^{LR}_{idt} can be generated by GBG_{B}, as Eqn. 8 shows. The LdegradedidtL^{idt}_{degraded} is expressed as Eqn. 9:

I^idtLR=GB((IdegradedLR)i)\widehat{I}_{idt}^{LR}=G_{B}((I^{LR}_{degraded})_{i}) (8)
Ldegradedidt=1NiN(I^idtLR)i(IdegradedLR)i1L^{idt}_{degraded}=\frac{1}{N}\sum_{i}^{N}||(\widehat{I}^{LR}_{idt})_{i}-(I^{LR}_{degraded})_{i}||_{1} (9)

Moreover, in order to minimize the perceptual divergence between I^reconLR\widehat{I}^{LR}_{recon} and IdegradedLRI^{LR}_{degraded}, we utilize FEAFE_{A} to extract VGG feature maps for cycle-perceptual loss estimation. It can be defined as follows:

LFEApercep=1NiNFEq,r((I^reconLR)i)FEq,r((IdegradedLR)i)2\begin{split}L^{percep}_{FE_{A}}=\frac{1}{N}\sum_{i}^{N}||FE_{q,r}((\widehat{I}^{LR}_{recon})_{i})-FE_{q,r}((I^{LR}_{degraded})_{i})||_{2}\end{split} (10)

where FEq,r()FE_{q,r}(\cdot) indicates the extracted feature maps of qq-th convolution layer (after activation layer) before rr-th maxpooling layer. The total objective loss for forward cycle module LtotalForwardL^{Forward}_{total} is a weighted sum of the four loss functions:

LtotalForward=ω1LGAadv+ω2LGBcyc+ω3Ldegraded_LRidt+ω4LFEApercep\begin{split}L^{Forward}_{total}=\omega_{1}L^{adv}_{G_{A}}+\omega_{2}L^{cyc}_{G_{B}}+\omega_{3}L^{idt}_{degraded\_LR}+\omega_{4}L^{percep}_{FE_{A}}\end{split} (11)

where the hyper-parameters ω1\omega_{1}, ω2\omega_{2}, ω3\omega_{3}, ω4\omega_{4} are trade-off factors for each loss. We empirically set ω1\omega_{1}, ω2\omega_{2}, ω3\omega_{3}, and ω4\omega_{4} to 1, 10, 1 and 1 respectively. It is noticeable that the loss with high weights indicates a significant proportion of the training process.

III-A2 Backward-cycle Module

To transfer the image from the target domain to the source domain, i.e., IrealLRIsynLRI^{LR}_{real}\rightarrow I^{LR}_{syn}, we specifically construct a backward-cycle module in which the generator GBG_{B} is able to learn the mapping GB(IrealLR)IsynLRG_{B}(I^{LR}_{real})\approx I^{LR}_{syn}. The learned mapping is capable of re-expressing the target domain image IrealLRI^{LR}_{real} by the source domain image IsynLRI^{LR}_{syn} implicitly. The backward-cycle module consists of a generator GBG_{B}, discriminator DAD_{A} and feature extractor FEBFE_{B}. There are several constraints required in the backward-cycle module. First, we symmetrically design the backward cycle consistency constraint, i.e., GA(GB((IrealLR)i))IrealLRG_{A}(G_{B}((I^{LR}_{real})_{i}))\approx I^{LR}_{real}, which further forces the contents of generated images to be relevant to the input ones. Thus, the backward cycle is composed of IrealLRGB(IrealLR)GA(GB(IrealLR))IrealLRI^{LR}_{real}\rightarrow G_{B}(I^{LR}_{real})\rightarrow G_{A}(G_{B}(I^{LR}_{real}))\approx I^{LR}_{real}. The backward cycle consistency loss can be defined as follows:

IreconLR=GA(GB((IrealLR)i))I_{recon}^{LR}=G_{A}(G_{B}((I^{LR}_{real})_{i})) (12)
LGAcyc=1NiN||(IreconLR)i)(IrealLR)i||1L^{cyc}_{G_{A}}=\frac{1}{N}\sum_{i}^{N}||(I^{LR}_{recon})_{i})-(I^{LR}_{real})_{i}||_{1} (13)

As Eqn. 12 shows, because of the formulated backward cycle scheme, GBG_{B} and GAG_{A} are capable of inverting the reconstructed images IreconLRI^{LR}_{recon} back to the IrealLRI^{LR}_{real}. The Eqn. 13 is able to minimize the correlation discrepancy between IreconLRI^{LR}_{recon} and IrealLRI^{LR}_{real} such that the consistent characteristics can be preserved. Similarly, as the forward cycle module, we also adopt adversarial learning in the backward cycle module, where the generator GBG_{B} and discriminator DAD_{A} are optimized by RaGAN. The adversarial loss LGBadvL^{adv}_{G_{B}} and LDAadvL^{adv}_{D_{A}} are defined as:

D(Xreal)=σ(C(IdegradedLR)Exf[C(IsynLR)])D(X_{real})=\sigma(C(I^{LR}_{degraded})-E_{x_{f}}[C(I_{syn}^{LR})]) (14)
D(Xfake)=σ(C(IsynLR)Exr[C(IdegradedLR)])D(X_{fake})=\sigma(C(I_{syn}^{LR})-E_{x_{r}}[C(I^{LR}_{degraded})]) (15)
LGBadv=𝔼IdegradedLRp(IdegradedLR)[log(1D(Xreal))]𝔼IsynLRp(IsynLR)[log(D(Xfake))]\begin{split}L^{adv}_{G_{B}}=&-\mathbb{E}_{I^{LR}_{degraded}}\sim p_{(I^{LR}_{degraded})}[log(1-D(X_{real}))]\\ &-\mathbb{E}_{I_{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[log(D(X_{fake}))]\end{split} (16)
LDAadv=𝔼IdegradedLRp(IdegradedLR)[log(D(Xreal))]𝔼IsynLRp(IsynLR)[log(1D(Xfake))]\begin{split}L^{adv}_{D_{A}}=&-\mathbb{E}_{I^{LR}_{degraded}}\sim p_{(I^{LR}_{degraded})}[log(D(X_{real}))]\\ &-\mathbb{E}_{I_{syn}^{LR}}\sim p_{(I_{syn}^{LR})}[log(1-D(X_{fake}))]\end{split} (17)

According to the adversarial loss, GBG_{B} can generate desired IsynLRI^{LR}_{syn} under the supervision of DAD_{A}. In order to avoid color variation between IidtLRI^{LR}_{idt} and IrealLRI^{LR}_{real}, we further define backward identity loss LrealidtL^{idt}_{real}. The IidtLRI^{LR}_{idt} is generated by GAG_{A}, as Eqn. 18 shows and the Lreal_LRidtL^{idt}_{real\_LR} is defined as Eqn. 19:

IidtLR=GA((IrealLR)i)I_{idt}^{LR}=G_{A}((I^{LR}_{real})_{i}) (18)
Lreal_LRidt=1NiN(IidtLR)i(IrealLR)i1L^{idt}_{real\_LR}=\frac{1}{N}\sum_{i}^{N}||(I^{LR}_{idt})_{i}-(I^{LR}_{real})_{i}||_{1} (19)

Moreover, the backward cycle-perceptual loss LFEBpercepL^{percep}_{FE_{B}} is calculated to recover visual pleasing details. We utilize FEBFE_{B} to measure the Euclidean distance between feature maps of IreconLRI^{LR}_{recon} and IrealLRI^{LR}_{real}. It can be defined as follows:

LFEBpercep=1NiNFEq,r((IreconLR)i)FEq,r((IrealLR)i)2\begin{split}L^{percep}_{FE_{B}}=\frac{1}{N}\sum_{i}^{N}||FE_{q,r}((I^{LR}_{recon})_{i})-FE_{q,r}((I_{real}^{LR})_{i})||_{2}\end{split} (20)

In the end, the total optimization loss LtotalBackwardL^{Backward}_{total} for the backward cycle module can be formulated as:

LtotalBackward=λ1LGBadv+λ2LGAcyc+λ3Lreal_LRidt+λ4LFEBpercep\begin{split}L^{Backward}_{total}=\lambda_{1}L^{adv}_{G_{B}}+\lambda_{2}L^{cyc}_{G_{A}}+\lambda_{3}L^{idt}_{real\_LR}+\lambda_{4}L^{percep}_{FE_{B}}\end{split} (21)

where the hyper-parameters λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, λ4\lambda_{4} are the set of corresponding weights for LGBadvL^{adv}_{G_{B}}, LGAcycL^{cyc}_{G_{A}}, Lreal_LRidtL^{idt}_{real\_LR} and LFEBpercepL^{percep}_{FE_{B}}. The λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3}, and λ4\lambda_{4} are empirically set to 1, 10, 1 and 1 respectively.

III-A3 Total Unsupervised Bi-directional Cycle Domain Transfer Network Loss

The full optimization objective loss of UBCDTN LtotalUBCDTNL^{UBCDTN}_{total}, which is an addition of forward cycle module loss LtotalForwardL^{Forward}_{total} and backward cycle module loss LtotalBackwardL^{Backward}_{total}, can be represented as follows:

LtotalUBCDTN=LtotalForward+LtotalBackwardL^{UBCDTN}_{total}=L^{Forward}_{total}+L^{Backward}_{total} (22)

Refer to caption

Figure 3: Red dotted rectangle: The architecture of the Generator. Blue dotted rectangle: The architecture of the Joint Discriminator. FSFF_{SF} denotes shallow features, FMDBMF_{MDBM} denotes the outputs of MDBM, FGFF_{GF} represents global features, and FMHFF_{MHF} represents multiple hierarchical features. K, n and s are the kernel size, number of filters, and strides respectively. N is the number of neurons in the dense layer.

III-B Semantic Encoder Guided Super Resolution Network

In this section, we present the proposed SESRN, which aims to generate the desired ISRI^{SR} from the I^realLR\widehat{I}^{LR}_{real} produced by UBCDTN. First, we describe four main components of SESRN: GSRG_{SR}, SESE, DSRD_{SR}, CECE. Second, we describe the optimization loss function for SESRN. The architecture of SESRN is illustrated in Fig. 2. In addition, the structure of the joint discriminator and generator proposed in GAFH-RIDN [13] are shown in Fig. 3.

III-B1 Generator

We utilize the Dense Nested Blocks (DNBs) and Residual in Internal Dense Blocks (RIDBs) that we proposed in GAFH-RIDN [13] as basic units to construct GSRG_{SR}. The goal of GSRG_{SR} is to super-resolve to the SR image ISRI^{SR} from the real-like LR image I^realLR\widehat{I}_{real}^{LR}. The overall super-resolution process is formulated as:

ISR=GSR(I^realLR)I^{SR}=G_{SR}(\widehat{I}_{real}^{LR}) (23)

As shown at the top of Fig. 3, the generator GSRG_{SR} contains three stages: Shallow Feature Module (SFM), Multi-level Dense Block Module (MDBM), and Upsampling Module (UM). Specifically, the MDBM is built up by multiple DNBs formed by several RIDBs. Each DNB includes 3 RIDBs cascaded by residual connections and one scaling layer. The designed architecture guarantees the feature maps of each layer are propagated into all succeeding layers, promoting an effective way to extract hierarchical features and alleviating the gradient vanishing problem. It is emphasized that Local Residual Learning (LRL) is introduced to take effective use of the local residual features extracted by RIDBs. In addition, in order to help the generator fully take advantage of hierarchical features, we design the Global Residual Learning (GRL) to fuse the shallow features FSFF_{SF} and global features FGFF_{GF}. Overall, benefiting from the proposed architecture [13], GSRG_{SR} is capable of exploiting abundant hierarchical features and super-resolving from the LR space to the HR space.

III-B2 Semantic Encoder

The proposed semantic encoder is supposed to extract embedded semantics (as shown in Fig. 2), which is used to project visual information (HR, LR) back to the latent space. The motivation is that the GAN-based SR models [15, 7, 20] only exploit visual information during the discriminative procedure, ignoring the semantic information reflected by latent representation. Therefore, the proposed semantic encoder will complement the missing critical property. Previous GAN’s work [21, 22] has proved that the semantic representation is beneficial to the discriminator.

Based on this observation, the proposed semantic encoder is designed to inversely map the image to the embedded semantics. Significantly, the most important advantage of the semantic encoder is that it is able to guide the discriminative process since the embedded semantics obtained from the semantic encoder can reflect semantic attributes, such as the image features (color, texture, and shape) and the spatial relationship between various components of the images. It can be emphasized that the embedded semantics is fed into a joint discriminator along with HR and LR images, leading to optimizing the adversarial process. Thanks to this property, the semantic encoder can guide the discriminative adversarial learning of the joint discriminator, thereby enhancing its discriminative ability.

In this context, we utilize the pre-trained VGG16 [23] networks as the semantic encoder. For the purpose of satisfying the different dimensions of HR and LR images, we adopt two side-by-side semantic encoders, which have the same structure but different input dimensions, to obtain embedded semantics from different convolutional layers respectively.

III-B3 Joint Discriminator

As shown in Fig. 2, the proposed joint discriminator takes the tuple incorporating both visual information and embedded semantics as the input, where Embedded Semantics-Level Discriminative Sub-Net (ESLDSN) aims to identify the embedded semantics whether it comes from the HR images while Image-Level Discriminative Sub-Net (ILDSN) distinguishes whether the input image is from the HR image dataset or the generator. Next, through the operation of the Fully Connected Module (FCM) on a concatenated vector, the final probability is predicted.

Thanks to this property, the joint discriminator is capable of learning the joint probability distribution of image data (IHR,ISRI^{HR},I^{SR}) and embedded semantics (SE(IHR),SE(I^realLR)SE(I^{HR}),SE(\widehat{I}^{LR}_{real})). In order to satisfy this structure, we design two sets of paths entering into the joint discriminator. The set of red paths in Fig. 2 represents a real tuple which consists of the real HR image IHRI^{HR} from the dataset and its embedded semantics SE(IHR)SE(I^{HR}). The set of paths in blue indicates the fake tuple constructed by SR image ISRI^{SR} generated from the generator and SE(I^realLR)SE(\widehat{I}^{LR}_{real}) produced by semantic encoder through real-like LR image. Therefore, the joint discriminator has the ability to measure the difference between real tuple (IHR,SE(IHR))(I^{HR},SE(I^{HR})) and the fake tuple (ISR,SE(I^realLR))(I^{SR},SE(\widehat{I}^{LR}_{real})).

In addition, we adopt the Relativistic average Least Squares GAN (RaLSGAN) [19] loss for the joint discriminator by applying the RaD to the least squares loss function [24]. The real tuple is denoted as Xreal=(IHR,SE(IHR))X_{real}=(I^{HR},SE(I^{HR})), and the fake tuple is expressed as Xfake=(ISR,SE(I^realLR))X_{fake}=(I^{SR},SE(\widehat{I}^{LR}_{real})). In the training procedure, the joint discriminator receives both XrealX_{real} and XfakeX_{fake} as the input. It can be expressed as follows:

C~(Xreal)=(C(Xreal)Exf[C(Xfake)])\tilde{C}(X_{real})=(C(X_{real})-E_{x_{f}}[C(X_{fake})]) (24)
C~(Xfake)=(C(Xfake)Exr[C(Xreal)])\tilde{C}(X_{fake})=(C(X_{fake})-E_{x_{r}}[C(X_{real})]) (25)

where C~()\tilde{C}(\cdot) denotes the probability predicted by the joint discriminator. Moreover, the least squares loss is applied to evaluate the distance between HR and SR images. Thus, we define the optimization loss LDSRRaLSL_{D_{SR}}^{RaLS} and LGSRRaLSL_{G_{SR}}^{RaLS} for joint discriminator and generator respectively:

LDSRRaLS=𝔼IHRp(IHR)[(C~(Xreal)1)2]+𝔼ISRp(ISR)[(C~(Xfake)+1)2]\begin{split}L_{D_{SR}}^{RaLS}=&\mathbb{E}_{I^{HR}\sim p_{(I^{HR})}}[(\tilde{C}(X_{real})-1)^{2}]\\ &+\mathbb{E}_{I^{SR}\sim p_{(I^{SR})}}[(\tilde{C}(X_{fake})+1)^{2}]\end{split} (26)
LGSRRaLS=𝔼ISRp(ISR)[(C~(Xfake)1)2]+𝔼IHRp(IHR)[(C~(Xreal)+1)2]\begin{split}L_{G_{SR}}^{RaLS}=&\mathbb{E}_{I^{SR}\sim p_{(I^{SR})}}[(\tilde{C}(X_{fake})-1)^{2}]\\ &+\mathbb{E}_{I^{HR}\sim p_{(I^{HR})}}[(\tilde{C}(X_{real})+1)^{2}]\end{split} (27)

By taking advantage of least squares loss and relativism in RaLS, SESRN is capable of enhancing model stability and producing visually pleasant SR images.

III-B4 Content Extractor

In SESRN, we further leverage the pre-trained VGG19 network as content extractor ϕ\phi to obtain feature representations, where they are used to formulate the content loss LcontentL_{content}. Specifically, we calculate the LcontentL_{content} based on the Euclidean distance between two feature representations of SR images and HR images. We extract feature representations from the ‘Conv3_3’ layer in the content extractor, which are the low-level features before the activation layer. With this loss term, the SESRN is encouraged to reconstruct finer high-frequency details and improve perceptual quality.

III-B5 Loss Function

We introduce content loss LcontentL_{content} to constrain SR images to be faithful to human visual perception. Besides, we also involve pixel-wise loss LpixelL_{pixel} to optimize our method. Furthermore, adversarial losses LGSRRaLSL_{G_{SR}}^{RaLS} and LDSRRaLSL_{D_{SR}}^{RaLS} are applied to GSRG_{SR} and DSRD_{SR} respectively, which allows the generator to produce the SR image consistent with the distribution of the HR image.

Content Loss: LcontentL_{content} is able to improve the perceptual similarity between SR and HR images. It is formulated as:

Lcontent=ϕi,j(IHR)ϕi,j(ISR)22L_{content}=||\phi_{i,j}(I^{HR})-\phi_{i,j}(I^{SR})||^{2}_{2} (28)

where ϕi,j()\phi_{i,j}(\cdot) represents the feature representations obtained from jj-th convolution layer before ii-th maxpooling layer in the fixed content extractor.

Pixel-wise Loss: The pixel-wise loss is widely applied to enforce the intensity similarity between the SR and HR images. It is calculated as:

Lpixel=1NiN(ISR)i(IHR)i2L_{pixel}=\frac{1}{N}\sum_{i}^{N}||(I^{SR})_{i}-(I^{HR})_{i}||_{2} (29)

We use LpixelL_{pixel} to minimize the distance between ISRI^{SR} and IHRI^{HR}.

Total Loss: Finally, we obtain the total loss function LtotalSESRNL_{total}^{SESRN} for SESRN. It is the weighted sum of the three above-discussed losses. The formula is described as follows:

LtotalSESRN=λconLcontent+λadvLGRaLS+λpixelLpixelL^{SESRN}_{total}=\lambda_{con}L_{content}+\lambda_{adv}L_{G}^{RaLS}+\lambda_{pixel}L_{pixel} (30)

where λcon\lambda_{con}, λadv\lambda_{adv} and λpixel\lambda_{pixel} are empirically set to 1, 10310^{-3} and 1 respectively.

III-C Full objective loss for UBCDTL-GAN

The full objective loss for the UBCDTL-GAN, which is the combination of LtotalUBCDTNL^{UBCDTN}_{total} and LtotalSESRNL^{SESRN}_{total}, can be defined as:

LtotalUBCDTLGAN=LtotalUBCDTN+LtotalSESRNL^{UBCDTL-GAN}_{total}=L^{UBCDTN}_{total}+L^{SESRN}_{total} (31)

The complete objective loss LtotalUBCDTLGANL^{UBCDTL-GAN}_{total} encourages the proposed method UBCDTL-GAN can solve the unpaired real-world image super-resolution problem.

IV Experiments

In this section, we first present the datasets and the experimental details. Then, we compare our method with state-of-the-art SISR methods. The qualitative comparison can be seen in Fig. 4, and the quantitative comparisons are provided in Table I. It is emphasized that all the quantitative results are cited from their official published papers. Moreover, as for qualitative comparison, we directly download the released code and pre-trained models of the compared methods. Then, we carefully re-implement referenced methods on the same validation dataset to obtain effective results.

Refer to caption

Refer to caption

Refer to caption

Figure 4: Qualitative comparison of visual results with state-of-the-art methods on NTIRE 2020 Real-World Track 1 images. Our method produces photo-realistic results.

IV-A Training Data

At the training stage, we conduct experiments on the DF2K dataset [25, 5], which is a merge of the DIV2K and Flickr2K datasets, containing a total of 3450 images. Specifically, as for the LR images, we use the real-world LR images from DIV2K NTIRE 2017 unknown degradation 4× dataset, and Flickr2K LR images collected from NTIRE 2020 Real-World Track 1 training source dataset, where all the LR images are corrupted with unknown degradation, and downsampled 4× by an unpredicted operator to satisfy real-world conditions, resulting in sensor noise, compression artifacts, etc. Since the goal of our method is to solve an unsupervised super-resolution problem without LR-HR paired images, we select the first 1725 images (number: 1-1725) from the DF2K HR dataset as our HR training dataset, and the LR training dataset is formed by the other 1725 images (number: 1726-3450) obtained from DF2K real-world LR dataset. Overall, our method is trained on such an unpaired real-world LR-HR dataset.

To evaluate the proposed method on real-world data, at the testing stage, we use the validation dataset from the NTIRE 2020 Real-World SR challenge Track 1. This dataset contains 100 testing LR images (scaling factor: 4×), where all LR images are processed with unknown degradation operations to simulate the realistic artifacts and natural characteristics. In order to compare the qualitative and quantitative results fairly, we use the same validation dataset for all experiments.

IV-B Training Setups

The training procedure is divided into three steps. Instead of randomly initializing model weights, we pre-train the UBCDTN and SESRN in the first and second steps, and then we jointly train the whole method in an end-to-end manner. First, we train the UBCDTN with unpaired artificially degraded images IdegradedLRI^{LR}_{degraded} and real-world images IrealLRI^{LR}_{real}, which aims to transfer the LR image from the artificially degraded LR domain to the real-world LR domain. Second, we pre-train the SESRN using the approximated real-like images I^realLR\widehat{I}^{LR}_{real} and their HR version IHRI^{HR} to generate realistic super-resolved images ISRI^{SR}. As for pre-trained UBCDTN and SESRN, we use the same optimization strategy where the Adam optimizer [26] is applied to train both networks by setting β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999. The learning rate is initialized as 10410^{-4} and the minibatch size is set as 8. We train the UBCDTN and SESRN 50K epochs separately. After the pre-training process, both UBCDTN and SESRN are able to acquire exceptional initialization weights, which are beneficial to training stability and fast training speed. In the third step, we jointly train two networks together. The proposed UBCDTL-GAN takes an unpaired HR image IHRI^{HR} and IrealLRI^{LR}_{real} as the input to generate the final result ISRI^{SR}. The UBCDTL-GAN applies the Adam optimizer with an initial learning rate = 10410^{-4}, β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999 to the whole training process. The model is trained for 100K epochs with the minibatch size 8.

TABLE I: Quantitative comparison on NTIRE 2020 Real World Super-Resolution Challenge Track 1 validation dataset of the proposed method against state-of-the-art methods, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.
            Methods             NTIRE_2020_T1
            PSNR             SSIM
            Bicubic             23.87             0.644
            Nearest Neighbor             23.39             0.580
            EDSR [5]             25.36             0.640
            ESRGAN [7]             19.04             0.242
            SRGAN [15]             20.78             0.525
            SRFBN [27]             25.37             0.642
            MsDNN [28]             25.08             0.708
            RCAN [29]             25.31             0.640
            USISResNet [30]             21.71             0.589
            CycleGAN [12]             25.01             0.618
            ZSSR [18]             24.87             0.600
            Proposed             26.83             0.789

IV-C Quantitative Comparison

The quantitative results presented in Table I demonstrate that our method has promising superiority, achieving the highest 26.83dB/0.789 in terms of PSNR/SSIM values. The results obtained from SRFBN [27] place the second-best with 25.37dB/0.642. Our method improves the PSNR/SSIM values by 1.46dB/0.147 over their method. EDSR [5] and SRFBN [27] cannot achieve desired numerical results since they are PSNR-oriented methods and are merely trained on simple degradation images. Moreover, we found that the performances of ESRGAN [7] and SRGAN [15] are lower than most of the methods in Table I. Besides, from Fig. 5, the visual results of ESRGAN and SRGAN show over-smoothed textures and unrealistic artifacts. Interestingly, we found such a phenomenon has also been presented in [10, 18, 30]. The underlying reason is that these methods only train on the simple and clean pre-defined LR images, ignoring the difference of domain distribution between the real-world LR domain and the pre-defined LR domain. Benefiting from the proposed UBCDTN and SESRN, our method has the ability to solve the problem of domain distribution shift and boost quantitative performances when dealing with real-world SR tasks.

IV-D Qualitative Comparison

The visual comparisons are provided in Fig. 4. We compare with various SR methods such as Bicubic, Nearest Neighbor, SRGAN [15], ESRGAN [7], CycleGAN [12], and ZSSR [18]. For the Bicubic and Nearest Neighbor methods, it is obvious that the results lack high-frequency contents, producing overly smooth edges and coarse textures. Regarding the SRGAN, it fails to alleviate the blurring details on the lines and edges of the SR results. The results of ESRGAN suffer from apparently broken artifacts and dramatic degradation problems, which are unfaithful to human perception. As for the unsupervised method CycleGAN and ZSSR, the SR results are improved, but there are still far away from the ground truth. Although the SR images of CycleGAN present better shapes than the previously compared methods, the results still remain unnatural edges and distortions, leading to poor visual effects. Besides, the blind method ZSSR was also evaluated, but it fails to reduce visible corruptions in some degree since there are still over-smoothed textures and noise-like characteristics existing in the images.

Compared with the aforementioned methods, the SR results of our method superiorly outperform all other methods, super-resolving visually pleasant SR images with sharper edges and finer textures. Our SR results are more realistic than SRGAN and ESRGAN since these two methods are only trained on simple degradation data (e.g., bicubiced LR images) without introducing any complicated noise and artifacts from the real-world images while our method trains on approximated real-like LR images consisting of similar characteristics as real-world LR images. The unsupervised method CycleGAN is less effective in super-resolve unclear LR images. Although it involves the cycle translation model, it lacks a powerful super-resolution network as the one (SESRN) in our method. Besides, the other unsupervised method ZSSR also fails to achieve the expected results, since it does not take into account the domain gap between noise-free LR images and real-world images. In contrast, benefiting from the domain transfer network (UBCDTN), our method is able to successfully eliminate the domain gap and produce real-like LR images comprising real-world patterns. Overall, The SR results verify the powerful unsupervised learning strategy used in the proposed method for super-resolving photo-realistic SR images.

TABLE II: The compared variants of the proposed method in the ablation study and the descriptions of the proposed components. The tick indicates that this variant includes this component.
Variants Methods SESRN GAG_{A} GBG_{B} DBD_{B} DAD_{A} FEBFE_{B} FEAFE_{A}
VariantA SESRN \surd
VariantB SESRN+GAG_{A}+DBD_{B} \surd \surd \surd
VariantC SESRN+GAG_{A}+FEBFE_{B}+GBG_{B}+FEAFE_{A} \surd \surd \surd \surd \surd
VariantD SESRN+GAG_{A}+DBD_{B}+GBG_{B}+DAD_{A} \surd \surd \surd \surd \surd
VariantE SESRN+GAG_{A}+DBD_{B}+GBG_{B}+DAD_{A}+FEBFE_{B}+FEAFE_{A} \surd \surd \surd \surd \surd \surd \surd

V Ablation Study

In this section, we conduct the ablation study to further investigate the components of the proposed method and demonstrate the advantages of UBCDTL-GAN. The list of compared variants of our method is presented in Table II. We provide visual SR results of different variants in Fig. 5. The quantitative comparison of several variants is presented in Table III.

V-A Description of Different Variants of the Proposed Method

In ablation studies, we design several variants which consist of different proposed components. We adopt the SESRN as the baseline variant in the following experiments and pay more attention to investigating the elements used in the UBCDTN. To comply with the single variable principle, we gradually add one of the components to the baseline variant. We describe the details of designed variants, all of which are specified as follows:

1) VariantA: The VariantA is designed as the baseline variant, which only contains SESRN. As shown in Table II, VariantA can be considered as removing all UBCDTN components from the ultimate proposed method. In the following variants, we successively add each of the components to VariantA.

2) VariantB: In VariantB, we introduce GAG_{A} and DBD_{B} while GBG_{B}, DAD_{A} and both FEAFE_{A}, FEBFE_{B} are removed. Because GAG_{A} and DBD_{B} are essential components of the forward cycle module in UBCDTN, VariantB can be considered as composed of the forward cycle module of UBCDTN and baseline model, removing the backward cycle module.

3) VariantC: Besides the baseline model, it consists of two generators GAG_{A}, GBG_{B} and two feature extractors FEAFE_{A}, FEBFE_{B} of UBCDTN, eliminating discriminators DAD_{A} and DBD_{B} involved in UBCDTN.

4) VariantD: It is constructed by the four components GAG_{A}, GBG_{B}, DBD_{B} and DAD_{A} of UBCDTN, while it removes feature extractors FEAFE_{A} and FEBFE_{B} of UBCDTN.

5) VariantE (Proposed): The VariantE represents the ultimate proposed method which comprises a baseline model and all components of UBCDTN.

Refer to caption

Refer to caption

Figure 5: Qualitative comparisons of different variants in our ablation study. The visual results on images ”0896”, ”0824” from NTIRE 2020 Track 1 testing dataset with scale factor 4×. The best results are highlighted
TABLE III: Quantitative results of ablation study with different variants on NTIRE 2020 validation T1 dataset, in terms of average PSNR (dB) and SSIM for upscale factor 4×. The bold results indicate the best performance.
                    Variants                     NTIRE_2020_T1
                    PSNR                     SSIM
                    VariantA                     25.97                     0.757
                    VariantB                     23.47                     0.698
                    VariantC                     25.44                     0.729
                    VariantD                     25.81                     0.746
                    VariantE (Proposed)                     26.83                     0.789

V-B Effect of UBCDTN

This experiment is conducted by VariantA and VariantE. Specifically, because of removing UBCDTN, VariantA is trained on bicubic downsampled LR images directly while VariantE takes real-like LR images obtained by UBCDTN as the training LR inputs. According to the analysis of the performance between VariantA and VariantE, we can demonstrate advantages originating from UBCDTN. As shown in Fig. 5, VariantA produces over-smoothed SR images missing high-frequency details while the SR results of VariantE contain naturally desired edges and textures. In addition, from Table III, the quantitative results of VariantA decrease dramatically from 26.83dB/0.789 to 25.97dB/0.757 after removing UBCDTN. The reason is that VariantA simply trains on bicubic data, ignoring the domain distribution difference between bicubic data and real-world data when solving the real-world SR task. By incorporating the UBCDTN in the variant, there is a noteworthy improvement in terms of both qualitative and quantitative performance, which is able to verify that UBCDTN plays an important role in the super-resolution procedure. Thus, we can validate the effectiveness of the proposed UBCDTN and its necessity.

V-C Effect of GBG_{B} and DAD_{A}

In this experiment, we compare VariantB and VariantE to verify the effect of GBG_{B} and DAD_{A}, which is also identical to show the effectiveness of the backward cycle module. Note that VariantB is equivalent to UBCDTN for domain transformation without the backward cycle module. In this setting, VariantB takes the artificially degraded images as the input to GAG_{A} and generates real-like LR images through the supervision of DBD_{B} in the forward cycle module without the backward cycle module. It can be observed from Table III that with the absence of the backward cycle module, VariantB performs worse than VariantE which contains the whole UBCDTN in terms of PSNR/SSIM values, since there is no restriction to prevent the forward cycle module and backward cycle module from contradicting each other. A huge enhancement can be observed after integrating the backward cycle module into VariantE, where it is able to greatly improve quantitative performance and further produce high quality SR images with desirable details. This is due to the presence of a backward cycle module. The variant is capable of utilizing cycle consistency constraint, which guarantees the correction between two inverse modules in an unsupervised manner. In a word, by introducing GBG_{B} and DAD_{A}, the whole bi-directional cycle consistency learning strategy can be established to produce real-like LR images which maintain the same characteristics as real-world LR images. These results validate the significance of involving GBG_{B} and DAD_{A}, which also reveals that the backward cycle module can improve quantitative results and visual quality.

V-D Effect of DAD_{A} and DBD_{B}

In this experiment, we aim to demonstrate the contribution of DAD_{A} and DBD_{B}, which can also reflect the importance of an adversarial loss in the UBCDTN. The VariantC and VariantE are compared to investigate the effect of DAD_{A} and DBD_{B}. In this case, VariantC is trained on cycle-consistency loss, identity loss, and cycle-perceptual loss, while the adversarial loss is not involved resulting from removing DAD_{A} and DBD_{B}. As shown in Table III, it is obvious that performance severely decreases by removing discriminators from VariantC. By incorporating DAD_{A} and DBD_{B} into VariantE, we observe that VariantE significantly outperforms VariantC by a large margin of 1.39dB/0.06 in terms of PSNR/SSIM. From Fig. 5, the visual results of VariantC are degraded, where the SR images contain over-smoothed textures and unclear artifacts, while VarantE is able to generate visually realistic SR images with more nature-looking details. The underlying reason for poor performance is that we cannot employ adversarial loss on VariantC, which leads to the lack of adequate iterative adversarial training. According to the quantitative and qualitative comparisons, it can be verified that by applying adversarial loss performed by DAD_{A} and DBD_{B} to the variant, the SR performance is greatly enhanced, indicating the effectiveness of DAD_{A}, DBD_{B} and adversarial loss.

V-E Effect of FEAFE_{A} and FEBFE_{B}

We compare VariantD and VariantE in this experiment to verify the effectiveness of FEAFE_{A} and FEBFE_{B}, where the advantages of FEAFE_{A} and FEBFE_{B} are also equal to the benefit of cycle-perceptual loss. By removing FEAFE_{A} and FEBFE_{B}, VariantD no longer employs cycle-perceptual loss on the training phase to optimize the model. As shown in Table III, VariantE achieves higher quantitative performance compared to VariantD, increasing PSNR/SSIM values from 25.81dB/0.746 to 26.83dB/0.789. In addition, Fig. 5 clearly shows the qualitative comparisons evaluated by the SR results from VariantD and VariantE with and without cycle-perceptual loss in the training process. It is apparent that without cycle-perceptual loss, the deteriorated textures are visible in the SR results of VariantD. In contrast, when integrating FEAFE_{A} and FEBFE_{B} into the variant, the cycle-perceptual loss can be calculated, which further motivates the variant to produce perceptually natural images with exceptional details. According to the numerical performance and visual results, we are able to verify that FEAFE_{A} and FEBFE_{B} have a significant impact on the super-resolution process, which also identifies that the cycle-perceptual loss is able to improve the perceptual quality of SR images.

V-F Final Effect

The VariantE can be considered as the ultimate proposed method, which includes all the proposed components. Compared with other variants, the ultimate proposed method is able to greatly improve quantitative performance and obviously enhance the quality of visual results. Thus, we can conclude the effectiveness of the proposed method as well as all the components.

VI Conclusion

We proposed an unsupervised super-resolution method UBCDTL-GAN for real-world scenarios, which does not involve any paired image data and pre-defined degradation operation. The proposed method comprises two networks, UBCDTN and SESRN. First, the UBCDTN transfers an artificially degraded image to a real-like image with real-world artifacts and characteristics. Next, SESRN reconstructs from the approximate real-like LR image to a visually pleasant super-resolved image with realistic details and textures. According to the designed framework and applied optimization constraints, the proposed method UBCDTL-GAN has the ability to improve real-world super-resolution performance. The quantitative and qualitative experiments on NTIRE 2020 T1 real-world SR dataset validate the effectiveness of our method and show superior SR performances compared to existing state-of-the-art methods.

References

  • [1] P. Rasti, T. Uiboupin, S. Escalera, and G. Anbarjafari, “Convolutional neural network super resolution for face recognition in surveillance monitoring,” in International conference on articulated motion and deformable objects.    Springer, 2016, pp. 175–184.
  • [2] D. Capel and A. Zisserman, “Super-resolution enhancement of text image sequences,” in Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, vol. 1.    IEEE, 2000, pp. 600–605.
  • [3] W. Shi, J. Caballero, C. Ledig, X. Zhuang, W. Bai, K. Bhatia, A. M. S. M. de Marvao, T. Dawes, D. O’Regan, and D. Rueckert, “Cardiac image super-resolution with global correspondence using multi-atlas patchmatch,” in International conference on medical image computing and computer-assisted intervention.    Springer, 2013, pp. 9–16.
  • [4] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2472–2481.
  • [5] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
  • [6] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4799–4807.
  • [7] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [8] G. Kim, J. Park, K. Lee, J. Lee, J. Min, B. Lee, D. K. Han, and H. Ko, “Unsupervised real-world super resolution with cycle generative adversarial network and domain discriminator,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 456–457.
  • [9] S. Maeda, “Unpaired image super-resolution using pseudo-supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 291–300.
  • [10] Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, “Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 701–710.
  • [11] T. Zhao, W. Ren, C. Zhang, D. Ren, and Q. Hu, “Unsupervised degradation learning for single image super-resolution,” arXiv preprint arXiv:1812.04240, 2018.
  • [12] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
  • [13] X. Wang, Y. Yang, Q. Pang, Q. Tang, and S. Du, “End-to-end generative adversarial face hallucination through residual in internal dense network,” in 29th European Signal Processing Conference (EUSIPCO), 2021.
  • [14] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 2, pp. 295–307, 2015.
  • [15] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
  • [16] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1604–1613.
  • [17] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3929–3938.
  • [18] A. Shocher, N. Cohen, and M. Irani, ““zero-shot” super-resolution using deep internal learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3118–3126.
  • [19] A. Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.
  • [20] X. Yu and F. Porikli, “Ultra-resolving face images by discriminative generative networks,” in European conference on computer vision.    Springer, 2016, pp. 318–333.
  • [21] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
  • [22] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704, 2016.
  • [23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [24] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802.
  • [25] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 126–135.
  • [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [27] Z. Li, J. Yang, Z. Liu, X. Yang, G. Jeon, and W. Wu, “Feedback network for image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3867–3876.
  • [28] S. Gao and X. Zhuang, “Multi-scale deep neural networks for real image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  • [29] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
  • [30] K. Prajapati, V. Chudasama, H. Patel, K. Upla, R. Ramachandra, K. Raja, and C. Busch, “Unsupervised single image super-resolution network (usisresnet) for real-world data using generative adversarial network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 464–465.