This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FREDSR: Fourier Residual Efficient Diffusive GAN for Fast Single Image Super Resolution

Kyoungwan Woo1 * Achyuta Rajaram 2 *
Abstract

FREDSR is a GAN variant that aims to outperform traditional GAN models in specific tasks such as Single Image Super Resolution with extreme parameter efficiency at the cost of per-dataset generalizeability. FREDSR integrates fast Fourier transformation, residual prediction, diffusive discriminators, etc to achieve strong performance in comparisons to other models on the UHDSR4K dataset for Single Image 3x Super Resolution from 360p and 720p with only 37000 parameters. The model follows the characteristics of the given dataset, resulting in lower generalizeability but higher performance on tasks such as real time up-scaling.

1 Introduction

Image to Image Translation(I2I) is a highly general meta-task that can summarize most application-specific subtasks in image generation or processing. Broadly, we use the definition from [1], where we seek to convert an an input image from domain A to domain B, while preserving ”intrinsic source content”, i.e. the meaningful semantic features of the image, while converting only the ”extrinsic style”, pertaining to the specific domain. This process can be defined as a mapping that synthesizes images that are indistinguishable from a sample from the target distribution B, using the semantic information present from a given sample of distribution A. This definition can generalize to many problems in image processing, with past work being used for image synthesis [2], image segmentation [3], style transfer [4], image inpainting [5], image super-resolution [6], among many other applications.

Given the formulation of the I2I problem stated above, it seems natural that Generative Adversarial Networks(GAN) would be applicable for these tasks. GANs seek to generate images indistinguishable from those in a given distribution by playing a mini-max game between a discriminator and generator, where the generator seeks to decrease the discriminator’s accuracy in determining if the outputs of the generator belong to a specified distribution of images, while the discriminator seeks to maximise said accuracy [7]. This methodology circumvents the problems of using a pixel-level loss, or other naive supervised losses, as when they are used, blurry images are creates due to averaging of the multiple plausible solutions for a given translation.

As a result of this natural applicability to GANs, past research has sought to apply GAN architectures to I2I translation problems, such as the pix2pix network, which pioneered the development of GANs for I2I translation [8]. Pix2pix allowed for high-quality low-resolution I2I translation to be computed without the need for problem-specific hand crafted loss functions. However, the initial formulation has not been crafted for high-resolutions, with [9] stating that conditional GANs such as pix2pix can struggle to generate high definition images, due to training instability. Furthermore, they indicate that including Perceptual Loss, an error function based on the outputs of classification networks, that seek to perform large-scale feature extraction, has significant benefits. Using this as a supervised objective allowed for stable images and was an improvement over raw pixel-level objectives.

More recently, an improvement to pix2pix has been suggested, referred to as pix2pixHD, combining classifier-based perceptual loss with discriminator based loss in order to allow for high resolution image generation [10]. However, this model requires 3 discriminators to allow for the processing of information at every scale, which creates large increases to the computational resources required during training. Furthermore, using several discriminators increases learning instability, as matching gradient updates between the Generator and Discriminator is one of the largest challenges in GAN training, as not doing so properly leads to common GAN failure modes such as Catastrophic Forgetting and Mode Collapse [11].

An alternative to this architecture, optimized for large mask high resolution image inpainting, uses Fast Fourier Convolutions (FFCs) to allow for efficient processing of global-scale information along with the local-scale details [5],[12]. With this prior, we contribute:

i. Using the previously introduced FFC block, we create a novel model architecture for I2I translation, adapting the original encoder-decoder stack from pix2pix. The use of the FFC allows the model to quickly gain access to global information. This allows for the model to operate on extrinsic global information throughout its entirety, improving its parameter efficiency for translation tasks. Additionally, this architecture is easily adaptable for Single Image Super Resolution(SISR) problems, and the maintaining of global information proves crucial to the performance of this architecture.

ii. We extend the diffusive discriminator from [13], which seeks to combat discriminator overfitting by using a tool from the highly successful de-noising Diffusion Probabilistic Model architecture [14], adding diffusion from a Gaussian mixture distribution to the discriminator inputs, by using a residual discriminator design.

Overall, we use all of these techniques to create a novel GAN variant, which experimentation has shown to be performant for specialized SISR tasks, while having much higher parameter efficiency in orders of magnitude. We motivate the existence of such a technique in the application of real-time 3D rendering, such as those found in modern video games. Modern gaming requires high resolution rendering of complicated 3D meshes with potentially billions of polygons, as well as expensive processes such as ray-traced per pixel lighting, ambient occlusion, and motion blur. Performance in this application has traditionally scaled harshly with output resolution, due to a quadratic increase in total calculations. Thus, a real-time application of SISR can provide efficiency benefits. One of the examples of utilizing unsupervised learning techniques for this task was introduced by NVIDIA with their Deep Learning Super Sampling application. FREDSR follows a similar notion of extremely performant, yet specialized SISR problem task, and was optimized for high performance on a specific dataset with extreme parameter efficiency.

2 Method

We seek to develop a novel Single Image Super Resolution(SISR) architecture where we take a three channel input image high resolution and low resolution pair qhrq_{hr} and qlrq_{lr}. We propose a model architecture that takes in qlrq_{lr} and outputs a three channel image qhq_{h}, which has the same resolution as qhrq_{hr}.

Refer to caption
Figure 2: The model architecture of the FREDSR generator. This network uses the recently proposed FFCs [12], along with a complex loss function that combines adversarial, perceptual, and pixel-level losses. Graphic style was inspired by [5].

2.1 Maintaining global information through all layers

Challenging problems in I2I translation, including that of large-scale modification to high-resolution data, requires processing of global information. We argue that for successful high-resolution I2I translation, a high receptive field that encapsulates global information, and maintains it throughout the network is necessary in order to efficiently process global extrinsic parameters on an image, such as the time of day, lighting conditions, the artistic style associated with a painter, etc. Traditional fully convolutional models suffer from slow growth of effective receptive field, due to the small kernels used [15]. To improve on this, [16] proposed the dilated convolution, which introduces ”gaps” between the values of a convolution, ”spreading out” a 3x3 convolution, for example over an 8x8 area, while only using 9 parameters. However, this loses some local information present, meaning that a complex architecture that combines these with traditional convolutions is necessary in order to achieve usable results, as was seen in the I2I density map generation network proposed in [17]. Additionally, even the introduction of such dilated convolutions only slightly increases the receptive field. Thus, as a result, improvement is possible, which comes by way of the Fast Fourier Convolution (FFC)[12], based on the Fast Fourier Transform [18].

FFC splits information into two main channels, global and local, where the local information is processed using traditional convolutions, and the global information is processed using a real FFT2d. During each FCC block, the information of both branches are combined to allow for more efficient use of parameters. An illustration of the exact construction of the FFC block can be seen in Figure 2. This allows for the efficient use of parameters. This unit are fully differentiable, and we show them in used as a drop-in replacement for traditional convolutions. Additionally, FFCs have been showed to be well suited to capture periodic structures, which are extremely common both in natural and artificial environments, such as brick buildings or waves on the surface of water [5].

2.2 Model Architecture

Here, we introduce our novel I2I architecture, and the modifications necessary for high resolution SISR tasks with an extremely efficient network. We have already described the foundational units of this architecture, the FFC block. This architecture is illustrated in Figure 2.

SISR Generator

A traditional I2I network cannot perform SISR, due to the differing resolutions of the input and output. However, our architecture is easily adaptable to perform this task. We perform this adaption based off of the foundation of the VDSR model proposed in [19]. to begin, we upscale the low-resolution input by using a bicubic upscaler. Then we use our model to predict the difference in value between the input bicubic upscaled image, and the true image. This process is visualized in figure 2. This model has several advantages over other super-resolution networks, as the FFC module allow for the more efficient use of parameters, preservation of low-frequency features maintained through the upscaling. Additionally the residual learning allows for much more efficient use of parameters and faster training, as no training time or parameters are ”wasted” learning to copy the original image. This allows our model to achieve competitive performance to SOTA methods with with only 37115 total parameters on the specialized tasks.

Refer to caption
Figure 3: The model architecture of the FREDSR discriminator. This was based on a traditional discriminator, with special modifications for the super-resolution task. This discriminator uses diffusion applied to its inputs, among other tactics to improve discriminator performance and prevent overfitting. Graphic style was inspired by [5].

SISR Discriminator

We allow the discriminator to train to distinguish between the residuals, also known as the differences between the upscaled images and the true images, and our model’s predicted residuals. This architecture is visualized in full in figure 3.

2.3 Generative Loss Functions

In FREDSR, we combine various L1 and L2 losses. The benefits of combining various level losses have been previously discussed in papers such as [8].

SSIM loss

In the case of L2 loss, we utilize the Structural Similarity Index Metric for the SISR model. In conjunction with the residual discriminator, SSIM works to deblurr the image. Compared to common L1/L2 losses such as MAE and MSE, SSIM has been proved to outperform in super resolution tasks, such as in [20]. The loss function is as follows:

SSIM=SSIM(x,y)=(2μxμy+C1)+(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2)\mathcal{L}_{SSIM}=-SSIM(x,y)=-\frac{(2\mu_{x}\mu_{y}+C_{1})+(2\sigma_{xy}+C_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+C_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{2})} (1)

With x,y defined as the generated and real images respectively.

Charbonnier Loss

As a method of enforcing color similarity, as well as mixing L1 and L2 losses, we use the Charbonnier Loss function. This is a variation of MAE which is differentiable and has shown to mix beneficial properties of L1 and L2 losses, and has shown significant success.

CHARB=E((xy)2+ϵ)\mathcal{L}_{CHARB}=E(\sqrt{(x-y)^{2}+\epsilon}) (2)

With x,y defined as the generated and real images respectively, and ϵ\epsilon a constant chosen.

MGE loss

To improve the performance of the model for SISR tasks, we add an additional loss on top of the global feature reconstruction loss. This loss is focused on the maintaining of edges, and was introduced for SISR in [21]. The maintaining of sharp edges allows for a more accurate image to that captured at high resolution. MGE uses a classical sobel operator, first introduced in [22]. After this, we can compute the gradients for each pixel, then define MGE as follows.

MGE=E((G(x)G(y))2)\mathcal{L}_{MGE}=E((G(x)-G(y))^{2}) (3)

With x,y defined as the generated and real images respectively.

Adversarial Loss

In the case of adversarial losses, as the model implementation follows a GAN structure, we utilize the modified MiniMax loss from [7].The loss function is as follows.

adv=E(ygen)[log(D(ygen))]\mathcal{L}_{adv}=E_{(y_{gen})}[log(D(y_{gen}))] (4)

where ygeny_{gen} are corresponding images taken from the original and high-resolution generated images.

Final Generative loss

Additionally to the previous losses, we add a Perceptual Loss, as they have shown incredible benefits within the training and evaluation of image generation networks [23],[24],[25],[26]. Combining this with the above loss functions, we end up with the final generative loss function to be minimized by gradient descent as:

final=λ1adv+λ2PL+λ3MGE+λ4SSIM+λ5CHARB\mathcal{L}_{final}=\lambda_{1}*\mathcal{L}_{adv}+\lambda_{2}*\mathcal{L}_{PL}+\lambda_{3}*\mathcal{L}_{MGE}+\lambda_{4}*\mathcal{L}_{SSIM}+\\ {\lambda_{5}*\mathcal{L}_{CHARB}} (5)

with λ1\lambda_{1},λ2\lambda_{2},λ3\lambda_{3},λ4\lambda_{4}, λ5\lambda_{5} as scaling parameters.

2.4 Discriminative Loss Functions

Modified Minimax Loss

In the case of discriminative losses, as the model implementation follows a GAN structure, we train the model to recognize real images as greater than 0.5, and fake images as less than 0.5. Thus, our loss is given by:

disc=E(x,y)[log(D(y)]+E(y)[log(1D(G(x))]\mathcal{L}_{disc}=E_{(x,y)}[log(D(y)]+E_{(y)}[log(1-D(G(x))] (6)

where x,y are corresponding images taken from the original and high-resolution images respectively.

3 Experiments

FREDSR was developed on Tensorflow [27] and Keras [28], then trained on UHDSR4K, a SIRS benchmark dataset from [29]. The images were downscaled using bicubic algorithms, consistent with the original dataset.

Along with the techniques outlined in Sections 1 and 2, we utilize various traditional and novel GAN training methods, such as discriminator restart learning from [30] and variable Gaussian noise, an improved version of adaptive blur and control from [31]. It is important to note that FREDSR utilized minimal to no hyperparameter tuning: all λ\lambda values were set to equalize the importance of each loss function.

3.1 Diffusive Discriminator

A common problem with GAN stability is combating discriminator overfitting, and several methods have been used to address this, including the use of random data augmentations which adapt with the discriminator’s performance. [32] For stable training of GANs, we implement the solution proposed in [13], using adaptive noise samples from the forward chain of a Gaussian Mixture distribution. This method has been shown both theoretically and experimentally to provide stable and data-efficient GAN training, and improve over baselines for high-quality image generation [13]. In this work, we implement the diffusive inputs along with our residual discriminator.

3.2 Discriminator Restart Learning

Even with diffusion applied to the discriminator inputs, the discriminator tends to converge faster than the generator, or simply learn the wrong weights after an extended period of training. By resetting the discriminator learning rate mid training, or a harsh full weight reinitialization, we ensure that the discriminator continues learning based on a more converged generator output. In this reset step, the discriminator receives a learning rate increase. The generator’s focus on the adversarial loss is lowered as well, to allow the discriminator to learn more freely. Once the discriminator is near convergence again, the changes to the coefficients are reversed.

3.3 Gaussian Random Noise

Adding randomized noise to each convolution of the generator is a commonly used tactic in deep learning to increase the generalizability of a model while decreasing the chance of overfitting. [33] Although our datasets are near the larger scale, to improve the higher-resolution generalizability of our model as well as speed up training, we implement Gaussian random noise generations in between each of our fast fourier convolutions. The gaussian noise levels should decrease similarly to the generator loss, as in the later phases of training, randomized data can lead to less favorable outputs.

3.4 AdamW Optimizer

We utilize the AdamW optimizer proposed by [34] to combat the weight regularization introduced by Adam. Adam tends to regularize larger weights less than smaller weights, and thus does not converge as well as traditional methods such as stochastic gradient descent. AdamW has been proposed to improve this issue, and we notice a significant improvement over the Adam optimizer.

3.5 Cosine Annealing Decay Learning Rate

We utilize a cosine annealing decay learning rate scheduler to combine both decaying learning rate schedulers and warm restart learning rate schedulers proposed by [35]. Restart learning has been used commonly with stochastic gradient descent to speed up training processes, and exponential decay has been used to help models converge.

Refer to caption
Figure 4: A graph plot of the learning rate for the SISR generator: Each peak is five percent lower than the previous, and the learning rate will decay until halved prior to the next reset.

3.6 UHDSR4K Image Super Resolution

For our main analysis, we focus on the super-resolution objective of the UHDSR4K dataset introduced in [29]. This dataset is the largest-scale UHD dataset in the field of 4K image super-resolution. For training, we downsample the original 4K images to 1920x1080. From here, we construct our 3x upsampling dataset, derived from the original UHDSR4K, by further downsampling to 640x360 resolution, and training our model to perform 3x super-resolution. This dataset is composed of diverse environments including city scenes, people, animals, buildings, cars, natural landscapes, and sculptures scraped from the internet in order to diversify the camera processing applied, along with allowing for high resolution images from many surroundings. In order to compute our downsampling, we follow [29] by using bicubic interpolation. For our metrics, we use standard super-resolution metrics, specifically SSIM and PSNR [36],[37]. Drawing comparisons using all of these methods allows for verifying that our model can both perform the super-resolution task and successfully match pixels, as well as is globally closer to the original image. It is important to note that our model’s hyperparameters were not fully tuned, yielding suboptimal performance. We visualize generator loss with respect to training steps in figure 5.

Refer to caption
Figure 5: A partial graph plot of the generator loss over train steps. Note than our model converges much slower than other models such as VDSR[19]. We perform training on 12x Nvidia Tesla V100s for over 300 epochs.

Comparisons to Baselines

For our comparison, we use several baseline models, which allow for us to directly benchmark our model’s performance on its original training dataset. We note that although it would be preferred to train our baseline models on the UHDSR4K dataset, as we have done with our model, due to time and resource constraints we were unable to do so. However, as this dataset has high diversity, we believe that it has coverage of the training distributions of the baseline models. We test all models at 3x super-resolution on the UHDSR4K dataset, with our training-time downscaling applied. Thus, we test 3x super resolution from 640 pixels by 360 pixels to 1920 pixels by 1080 pixels on the test dataset of UHDSR4K. For our baseline comparison models, we use only publically available pretrained models. We select 3 strong deep learning baselines with diverse techniques used. These baselines are EDSR[38], ESPCN[39], and FSRCNN [40]. For completeness, we also report bilinear and bicubic upscaling as a control. The results, calculated using the SSIM, and PSNR metrics as described previously are tabulated in Table 1.

UHDSR4K 640x320 3x Upscaling
Method Params SSIM \uparrow PSNR\uparrow
Bicubic 0 0.856 26.75
Bilinear 0 0.848 26.49
EDSR 43000k 0.879 27.61
ESPCN 20k 0.858 27.02
FSRCNN 12k 0.857 27.03
FREDSR 37k 0.883 27.776
Table 1: Quantitative evaluation of 3x SISR performance on downscaled UHDSR4K dataset. We report Structural Similarity Index Measure(SSIM) and Peak Signal to Noise Ratio(PSNR) metrics. FREDSR compares favorably against these baselines across all metrics.

As shown by this, our method can outperforms the baselines in the task of SISR on its original training resolution and dataset, strictly outperforming significantly larger models, indicating high overall performance and parameter efficiency. Further ablation analysis is necessary in order to determine the exact root of this high performance, but we believe it can be attributed to the poor generalization of the other models, high receptive field of FREDSR, and ability to take advantage of the information present in repetitive structures in order to perform more accurate prediction.

High resolution Performance Across Resolutions

Refer to caption
Figure 6: Positive samples for high resolution performance. These are examples of our model’s performance on the UHDSR4K dataset, in order to display our model’s high-resolution super-resolution capabilities. The top image is our model’s output, while the bottom is the true high resolution image. As shown here, along with our results from table 2, show an interesting improvement in our models as the resolution improves. This can be attributed to the task of super-resolution becoming easier at higher resolutions.

Along with our model’s performance on low resolutions, we seek to find out how it performs on high resolution upscaling. To this end, we study performance on high resolution upscaling. We point out the high amount of repetitive patches present in urban scenes, and study the ability of our network to handle repeated signals deformed by changes in perspective. This signal deformation has been shown to cause challenges to past networks implementing FFCs [5], thus we study our model’s performance in this domain. We compare our model against a variety of strong baseline solutions for upscaling, and train a model, using the pretrained 360p model as a base, to do 3x super resolution from 720p to 4k.

UHDSR4K

For our full resolution UHDSR4K analysis, we perform 3x upscaling from the standard resolutions 720p to 4k resolutions. For our comparison, we use the PNSR and SSIM results from [29], testing our models against SRCNN[Dong2015], FSRCNN [40],VSDR [19], EDSR[38], RCAN[41], RDN[42], HAN[43], DRLN[44], and MANet [29]. We report PSNR and SSIM results from [29] along with our results and compare our models generalization performance at higher resolution against models that have been natively trained on upscaling the higher resolution. We also include Bilinear and Bicubic upscaling for control.

UHDSR4K 1280x720 3x Upscaling
Method Params SSIM \uparrow PSNR\uparrow
Bilinear 0 0.9240 30.781
Bicubic 0 0.9375 31.984
SRCNN 57l 0.9503 34.082
FSRCNN 12k 0.9462 33.614
VDSR 665k 0.9575 35.115
EDSR 43000k 0.9608 35.674
RCAN 16000k 0.9608 35.576
RDN 21900k 0.9614 35.769
HAN 20000k 0.9601 35.547
DRLN 34000k 0.9617 35.808
MANet 27000k 0.9618 35.842
FREDSR 37k 0.9645 35.3344
Table 2: Quantitative evaluation of 3x SISR generalization performance on original UHDSR4K dataset. We report Structural Similarity Index Measure(SSIM), and Peak Signal to Noise Ratio(PSNR) metrics. We report metrics for other deep learning based methods from [29]. FREDSR compares favorably against these baselines across all metrics, even though it is comprised of far fewer parameters. This can be attributed to the scale invariance caused by the inductive biases in the FFC modules.

As demonstrated here, we outperform several strong baselines, despite having orders of magnitudes of fewer parameters. This indicates an ability to efficiently capture the requisite information, which we can attribute to the use of FFC and the process of residual learning. Additionally, we see an increase in performance as resolution increases, where a model trained on multiple resolutions outperforms a model trained on only one. This is potentially due to the nature of the task of super-resolution itself, however further analysis is required to determine the root cause of this phenomenon.

Other datasets

However, we note that as expected, FREDSR does not perform as highly across other datasets, such as Manga109 or Urban100, from [45] and [46] respectively. This is visualized in Figure 7, where we show our poor generalization performance. The reasoning behind this is clear: bicubic upscaling does not perform as well on such datasets at all, and the learned residual images are generally lacking of color and of a much higher resolution, compared to the low resolution. This shows that FREDSR is highly specialized, losing generalizability at the benefit of extreme parameter efficiency. However, for the main real-world application of video game upscaling, this should not be a major issue, due to the consistent resolution and environment present.

Refer to caption
Figure 7: Poor performance, with low clarity in edges, can be seen on the Manga109 or Urban100 datasets, from [45] and [46] respectively. This is potentially caused by the large distributional shift present, namely the change in resolution and color in residual images.

4 Related Work

Within the domain of super-resolution, several traditional algorithms and data-driven models have been designed to accurately upscale images. Such classical algorithms are aptly covered in [47], and will not be discussed further here, as most state of the art solutions to SISR are based off of learning based techniques [48]. The use of convolutional networks for super resolution was first introduced in [Dong2015], which also initiated the use of deep learning for this task. Skip connections, integral to stabilizing training, improving parameter efficiency, and increasing performance [49] were first introduced with the development of DRCN [19]. The use of GANs for this task, which have the major benefit of generating high-quality images when compared to more traditional methods, was introduced in [50]. The main challenge faced with GANs, their low mode coverage, or output diversity, is not applicable to the super-resolution objective, meaning that the main advantages of GANs, namely their comparatively faster sampling and high quality outputs than diffusion networks and VAEs respectively, can be utilized [51]. In a similar fashion to [50], [10] combines adversarial and perceptual loss functions in order to train a GAN for super-resolution. Other works have noted the lack of high receptive field present in most super-resolution networks, pointing out that redundant patches within an image, common in both artificial and natural landscapes, allow for global context to be more informative to the computation of local upscaling. As a result, several works have taken advantage of the increased receptive field of the dilated, or atrous, convolution [52], [21]. This further motivates our use of Fourier convolutions, where we take advantage of their theoretically unbounded receptive fields. Other works have also used recurrent and attention mechanisms to allow for global cross-patch information usage [41], [43], [53], [54]. These attention-based mechanisms are an extremely interesting novel field for further works to explore.

5 Discussion

In this study, we investigate the use of the Fast Fourier convolution for solving the specialized SISR objective. Using these convolutions, we have constructed a novel GAN architecture that has shown to outperform several baselines, while being significantly smaller. Our model arguably shows good performance in urban settings due to the highly repeated sectors present, which seems challenging to other methods, as seen in figure 1 and 5. Our model seems to have extremely high parameter efficiency, outperforming models which are comprised of more than a thousand times as many parameters. In exchange for this, we lose generalizability outside of a given image distribution. We believe that this tradeoff can be harmful, but mitigated under a circumstance where we believe our method is strongest-video game upscaling. This opinion is corroborated by the fact that video games have similar highly repeating sectors to urban scenes, due to their general usage of built-in textures for three-dimensional models. This model is highly efficient, which is necessary for real-time application on consumer graphics hardware, and can be trained for a given game, as within this narrow domain significant generalization is not required. This would allow for rendering at a low resolution and upscaling. Further work would be required in this domain to allow for enforced temporal coherence.
However, convolutions are not the only approach to increasing the receptive field of image processing networks, and models like MLP Mixer models [55] as well as vision transformers [56] are exciting topics for future study in this field. Especially in the case of super resolution, higher receptive fields will unlock possibilities for the development of deep learning for high-resolution image processing, allowing for an increase in fidelity, diversity, and processing speed. Improved super resolution will allow for advancements in medical imaging [57] and compression of image and video formats [58], among many other applications.

Acknowledgments

The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing (HPC, database, consultation) resources that have contributed to the research results reported within this paper. We additionally thank Professor Leslie Kaelbling and Ge Yang from the MIT Computer Science and Artificial Intelligence Laboratory for their valuable discussions and providing us with access to resources to complete this research. Lastly, we acknowledge Tensorflow and Horovod as the development language and platform of choice for this project.

References

  • [1] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-image translation: Methods and applications, 2021.
  • [2] Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. Sean: Image synthesis with semantic region-adaptive normalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [3] Xi Guo, Zhicheng Wang, Qin Yang, Weifeng Lv, Xianglong Liu, Qiong Wu, and Jian Huang. Gan-based virtual-to-real image translation for urban scene semantic segmentation. Neurocomputing, 394:127–135, 2020.
  • [4] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242–2251, 2017.
  • [5] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. CoRR, abs/2109.07161, 2021.
  • [6] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang, Chao Dong, and Liang Lin. Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 814–81409, 2018.
  • [7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
  • [8] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2016.
  • [9] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks, 2017.
  • [10] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans, 2017.
  • [11] Hoang Thanh-Tung and Truyen Tran. On catastrophic forgetting and mode collapse in generative adversarial networks, 2018.
  • [12] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4479–4488. Curran Associates, Inc., 2020.
  • [13] Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Diffusion-gan: Training gans with diffusion, 2022.
  • [14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
  • [15] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [16] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions, 2015.
  • [17] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes, 2018.
  • [18] E. O. Brigham and R. E. Morrow. The fast fourier transform. IEEE Spectrum, 4(12):63–70, 1967.
  • [19] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. CoRR, abs/1511.04587, 2015.
  • [20] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging, 3(1):47–57, 2017.
  • [21] Zhengyang Lu and Ying Chen. Single image super resolution based on a modified u-net with mixed gradient loss, 2019.
  • [22] N. Kanopoulos, N. Vasanthavada, and R.L. Baker. Design of an image edge detection filter using the sobel operator. IEEE Journal of Solid-State Circuits, 23(2):358–367, 1988.
  • [23] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution, 2016.
  • [24] Cheng Shi and Chi-Man Pun. Adaptive multi-scale deep neural networks with perceptual loss for panchromatic and multispectral images classification. Information Sciences, 490:1–17, 2019.
  • [25] Yifan Liu, Hao Chen, Yu Chen, Wei Yin, and Chunhua Shen. Generic perceptual loss for modeling structured output dependencies, 2021.
  • [26] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
  • [27] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [28] Francois Chollet et al. Keras, 2015.
  • [29] Kaihao Zhang, Dongxu Li, Wenhan Luo, Wenqi Ren, Björn Stenger, Wei Liu, Hongdong Li, and Ming-Hsuan Yang. Benchmarking ultra-high-definition image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14769–14778, October 2021.
  • [30] Kun Li and Dae-Ki Kang. Enhanced generative adversarial networks with restart learning rate in discriminator. Applied Sciences, 12:1191, 01 2022.
  • [31] Igor Susmelj, Eirikur Agustsson, and Radu Timofte. Abc-gan : Adaptive blur and control for improved training stability of generative adversarial networks. 2017.
  • [32] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data, 2020.
  • [33] Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3):643–674, 1996.
  • [34] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017.
  • [35] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with restarts. CoRR, abs/1608.03983, 2016.
  • [36] Zhou Wang, Alan Bovik, Hamid Sheikh, and Eero Simoncelli. Image quality assessment: From error visibility to structural similarity. Image Processing, IEEE Transactions on, 13:600 – 612, 05 2004.
  • [37] Alain Horé and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, pages 2366–2369, 2010.
  • [38] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1132–1140, 2017.
  • [39] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1874–1883, 2016.
  • [40] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network, 2016.
  • [41] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. CoRR, abs/1807.02758, 2018.
  • [42] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
  • [43] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and Haifeng Shen. Single image super-resolution via a holistic attention network, 2020.
  • [44] Saeed Anwar and Nick Barnes. Densely residual laplacian super-resolution, 2019.
  • [45] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20):21811–21838, nov 2016.
  • [46] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5197–5206, 2015.
  • [47] Jos Ouwerkerk. Image super-resolution survey. Image Vision Comput., 24:1039–1052, 10 2006.
  • [48] Zhihao Wang, Jian Chen, and Steven C. H. Hoi. Deep learning for image super-resolution: A survey. CoRR, abs/1902.06068, 2019.
  • [49] A. Emin Orhan. Skip connections as effective symmetry-breaking. CoRR, abs/1701.09175, 2017.
  • [50] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. CoRR, abs/1609.04802, 2016.
  • [51] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In International Conference on Learning Representations, 2022.
  • [52] George Seif and Dimitrios Androutsos. Large receptive field networks for high-scale image super-resolution. CoRR, abs/1804.08181, 2018.
  • [53] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and Lei Zhang. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [54] Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 457–466, June 2022.
  • [55] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision, 2021.
  • [56] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
  • [57] Wenzhe Shi, Jose Caballero, Christian Ledig, Xiahai Zhuang, Wenjia Bai, Kanwal Bhatia, Antonio M. Simoes Monteiro de Marvao, Tim Dawes, Declan O’Regan, and Daniel Rueckert. Cardiac image super-resolution with global correspondence using multi-atlas patchmatch. In Kensaku Mori, Ichiro Sakuma, Yoshinobu Sato, Christian Barillot, and Nassir Navab, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, pages 9–16, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • [58] Sheng Cao, Chao-Yuan Wu, and Philipp Krähenbühl. Lossless image compression through super-resolution, 2020.