Perception-Distortion Balanced Super-Resolution:
A Multi-Objective Optimization Perspective

Lingchen Sun, Jie Liang, Shuaizheng Liu, Hongwei Yong, Lei Zhang L. Sun, S. Liu, H. Yong, and L. Zhang are with the Department of Computing, the Hong Kong Polytechnic University, Hong Kong (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). This work is supported by the Hong Kong RGC RIF grant (R5001-18) and the PolyU-OPPO Joint Innovation Lab.J. Liang is with the OPPO Research Institute (e-mail: [email protected]).

Abstract

High perceptual quality and low distortion degree are two important goals in image restoration tasks such as super-resolution (SR). Most of the existing SR methods aim to achieve these goals by minimizing the corresponding yet conflicting losses, such as the $\ell_{1}$ loss and the adversarial loss. Unfortunately, the commonly used gradient-based optimizers, such as Adam, are hard to balance these objectives due to the opposite gradient decent directions of the contradictory losses. In this paper, we formulate the perception-distortion trade-off in SR as a multi-objective optimization problem and develop a new optimizer by integrating the gradient-free evolutionary algorithm (EA) with gradient-based Adam, where EA and Adam focus on the divergence and convergence of the optimization directions respectively. As a result, a population of optimal models with different perception-distortion preferences is obtained. We then design a fusion network to merge these models into a single stronger one for an effective perception-distortion trade-off. Experiments demonstrate that with the same backbone network, the perception-distortion balanced SR model trained by our method can achieve better perceptual quality than its competitors while attaining better reconstruction fidelity. Codes and models can be found at https://github.com/csslc/EA-Adam.

Index Terms:

Super-resolution, GAN, evolutionary algorithm, optimization.

I Introduction

Image super-resolution (SR) [1] is a valuable technique to reconstruct a high-resolution (HR) image with good quality from a low-resolution (LR) input image. Benefiting from the deep learning techniques, it has become popular to train a deep neural network (DNN) for SR by using a large amount of LR-HR image pairs [2, 3, 4, 5, 6, 7, 8, 9]. One commonly used loss function to optimize the SR models is the distortion (or fidelity) measure, e.g., the norm of the prediction errors [2, 10, 11, 12]. However, SR is a typical ill-posed problem, and there are many possible HR estimations for the given LR input. It is well-known that the $\ell_{2}$ or $\ell_{1}$ loss tends to generate over-smoothed HR estimates [13, 14] although they can result in good distortion measures such as peak-to-noise ratio (PSNR).

To better preserve the local structures and details of reconstructed images, the SSIM loss [15] and perceptual loss [16] have also been used to train SR models. The former adopts the widely used image quality assessment index, i.e., SSIM [15, 17], as the optimization objective, while the latter minimizes the $\ell_{1}$ or $\ell_{2}$ norm of the prediction difference in a feature space [7, 8]. In general, the SSIM loss and perceptual loss can improve the perceptual quality of SR outputs, but the improvement is limited because they can hardly generate details that are lost during degradation[18, 19].

Refer to caption — Figure 1: Performance comparison of perception-distortion balanced SR models trained by the Adam optimizer and our hybrid EA-Adam optimizer on the BSD100 dataset. For the perceptual quality index LPIPS, the smaller the better. For the distortion degree index PSNR, the higher the better. The models optimized by Adam (orange triangle points) with different weighted combinations of $\ell_{1}$ , perceptual and adversarial losses are either stagnated into an unsatisfactory solution, or achieve unbalanced performance. The models optimized by our EA-Adam optimizer (blue circle points) can obtain better convergence and divergence. Furthermore, the fused model (red star) from EA-Adam models achieves the best perceptual quality index while maintaining a high fidelity index.

To obtain perceptually more realistic HR images, the perception-distortion trade-off methods [20, 7, 8, 21, 22, 23] have been developed in SR. Usually, they train a generative adversarial network (GAN) [24] by using the adversarial loss. Instead of minimizing the pixel-wise $\ell_{2}$ or $\ell_{1}$ loss or local structure-wise SSIM or perceptual loss, the adversarial loss is optimized to minimize the distribution divergence between the predicted SR images and HR images. In practice, it is unstable and impossible to train the SR model using only the adversarial loss. Therefore, most GAN-based SR methods weigh the $\ell_{1}$ , perceptual and adversarial losses [7, 8] into one single loss for optimization.

While GAN-based SR methods can generate some realistic details, they will also introduce undesired artifacts [25, 18]. Many following works [26, 13, 27, 28, 29, 30, 31] aim to reduce the artifacts caused by the adversarial loss while keeping the realistic details. Dario et al. [26] applied $\ell_{1}$ and adversarial losses in both spatial and Fourier domains to improve the restoration of high-frequency details. Liang et al. [13] designed an LDL loss to suppress GAN-generated artifacts by considering their statistical difference from visually friendly details. To avoid artifacts in low-frequency areas, a region-aware adversarial loss [27] was designed to help the discriminator pay more attention to high-frequency areas. Zhang et al. [28] used a similar idea to handle the low-frequency and high-frequency areas differently by using a constrained loss. Xie et al. [29] proposed to detect the artifact regions and developed a finetuning procedure to improve GAN-based SR models.

Unfortunately, the above-mentioned methods have two major limitations. First, empirically selecting weights to combine the losses into one is unreasonable and labor-consuming [28]. Second and more importantly, distortion- and perception-oriented losses often conflict with each other, and the commonly used gradient-based DNN optimizers, such as Adam [32], are difficult to balance the conflicting gradient-decent directions. We illustrate this issue in Fig. 1, where we train a population of SR networks by weighing the $\ell_{1}$ , perceptual and adversarial losses with different weights. The Adam optimizer is used to train the network, and the PSNR (the higher the better) and LPIPS (the smaller the better) [33] indices are used to measure the distortion degree and perceptual quality, respectively. An ideal SR model is expected to achieve both high PSNR and low LPIPS, i.e., approaching the origin point in Fig. 1. However, we can see that the learned models (orange triangle points) by Adam with different weighted $\ell_{1}$ , perceptual and adversarial losses are rather stagnated into an unsatisfactory solution, or achieve unbalanced performance.

In this paper, we propose a new approach for perception-distortion balanced SR from the network optimization perspective. We formulate the problem as a multi-objective optimization task [18], one objective aiming for high perceptual quality and another for low distortion degree. Considering that these two objectives are hard to achieve simultaneously by using gradient-based optimizers because of their conflicting gradient descent directions, we propose to integrate the gradient-free evolutionary algorithm (EA) [34, 35, 36] with the gradient-based Adam for network optimization. Adam mainly ensures the model convergence, i.e., the losses will decrease along the gradient, while EA mainly ensures the model divergence, i.e., the potential solutions could interact with and benefit from each other by the crossover operation. Meanwhile, the mutation operation of EA helps the model to jump out of the local optimum. As a result, the obtained Pareto front (PF) consists of a series of optimal models with different perception-distortion preferences, as shown by the blue circular points in Fig. 1. Compared with the Adam optimizer, the models optimized by our proposed EA-Adam method are closer to the original point and are more evenly distributed than those optimized by Adam. Motivated by the mixture of experts theory [37, 38] and the network interpolation methods [39, 40], we further propose a simple yet effective fusion network to merge the obtained models into a single stronger one, which inherits the advantages of the model population. As indicated by the red star in Fig. 1, the fused model achieves the best perceptual quality index while maintaining a high fidelity index. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed hybrid EA-Adam optimizer and weight aggregation network. Compared with the existing state-of-the-art methods, our model achieves a better perception-distortion trade-off.

The rest of this paper is organized as follows. Section II summarizes the related works on distortion- and perception-oriented SR methods, and evolutionary algorithms on DNN learning. Section III presents the multi-objective formulation of SR, the hybrid EA-Adam optimizer, and the network fusion approach in detail. Section IV reports the experimental results and discussions. Finally, Section V concludes this paper.

II Related Work

II-A Distortion-oriented SR Methods

Most of the SR methods in this category [2, 3, 11, 5] employ the pixel-wise loss functions, such as the $\ell_{2}$ or $\ell_{1}$ norm of the prediction error, while focusing on the design of network architectures to improve the SR performance. The seminal work of SRCNN [2] introduced a convolutional neural network (CNN) with three layers to perform the SR task. Afterward, many SR models have been developed. Kim et al. [6] trained a deeper and wider network to improve the representational ability of CNN. Deep residual connections were introduced to improve deep SR network learning performance [41, 42, 10]. Tong et al. [12] proposed to use dense skip connections in a very deep network. The recursive network et al. [6] was designed to improve SR performance without introducing new parameters for additional convolutions. Very recently, the transformer networks [43, 44, 45, 46] have been successfully employed for SR, resulting in better distortion-based measures such as PSNR. Because $\ell_{2}$ or $\ell_{1}$ loss tends to generate blurry SR outputs, the SSIM loss [15] and perceptual loss [16] have been proposed to regularize the local structures of images. These two losses can improve the visual quality of SR outputs without harming too much the distortion measures. However, they cannot generate image details to make the SR image perceptually more realistic.

II-B Perception-oriented SR Methods

To make the reconstructed SR images perceptually more realistic, SRGAN [7] combined the adversarial loss with distortion-oriented losses (e.g., $\ell_{1}$ loss) in training. Following SRGAN, ESRGAN [8] used the VGG features before activation and employed the RRDB backbone [8] as the generator model, improving the perceptual quality. DualFormer [31] leveraged the spectral and spatial discriminators for identifying high-frequency and low-frequency information, respectively. While GAN-based methods can generate new image details, they may introduce many unnatural visual artifacts because of unstable adversarial training. Therefore, many of the following works aim to reduce the GAN-generated artifacts from the perspectives of frequency division [27, 26, 28, 47, 48], the weight of layers in perceptual loss [30, 49], or different statistical properties [13, 29]. In addition, some works [22] aim to optimize the no-reference image quality metrics, which can be non-differentiable. For example, RankSRGAN [22] designed a differentiable rank-content loss to rank the quality of predictions with NIQE [50] or Ma [51] index.

II-C Evolutionary Algorithms in DNN Learning

EA [35, 36] is a classical search-based optimization technique, where individuals interactively evolve from a population. The EA operators mainly consist of crossover [52], mutation, and selection. Due to the global search and gradient-free properties, EA has proven its effectiveness in many applications [53, 54, 55]. It has recently been applied to DNN learning [56, 57, 58]. Guo et al. [56] applied evolutionary architecture search to construct a simplified supernet with all architectures optimized simultaneously. Yu et al. [57] searched over a big single-stage model that contains small children models. EAGAN [58] applied an efficient two-stage EA-based network search framework to learn GANs. Another application is network compression. For example, Zhou et al. [59] established filter pruning as a multi-objective optimization problem and proposed a knee-guided EA algorithm to trade-off between the scale of parameters and performance. In addition, EA algorithm can be used as the network optimizer. ESGD [60] alternately applied SGD and EA in one framework to improve the average performance of DNN. Zhang et al. [61] proposed a hierarchical cluster-based suppression algorithm to remove similar weights and improve population diversity. GEMONN [62] determined the search direction in the EA optimizer by the gradient of weights to improve the efficiency of EA for training DNN.

To optimize a network, EA is usually coupled with a gradient-based optimizer to solve large-scale optimization problems [63]. Most of the previous EA-based DNN learning works [60] optimize a single objective. In this work, we formulate the perceptually realistic SR task as a multi-objective optimization problem and propose a hybrid EA and Adam algorithm to solve it.

III The Proposed Method

This section presents the proposed hybrid EA-Adam optimizer and a simple yet effective fusion network for perception-distortion balanced SR. Firstly, we formulate the SR task as a multi-objective optimization problem, with two objectives focusing on perceptual quality and fidelity quality, respectively. Secondly, a hybrid EA-Adam optimizer is designed to solve the problem, where EA addresses the model divergence and Adam addresses the model convergence. A set of SR models with different perception-distortion preferences are trained to form a PF. Lastly, a simple yet effective fusion network is proposed to merge the learned models into a strong perception-distortion-balanced SR model.

III-A Multi-Objective Formulation

Most of the GAN-based SR methods [7, 8] train the generator model by weighting the pixel-wise $L_{pix}$ loss, perceptual loss $L_{percep}$ and adversarial loss $L_{adv}$ as follows:

{L_{GAN}}={\alpha_{1}}{L_{pix}}+{\alpha_{2}}{L_{percep}}+{\alpha_{3}}{L_{adv}},

(1)

where the $L_{pix}$ (i.e., $\ell_{1}$ loss) optimizes the reconstruction fidelity, and $L_{percep}$ and $L_{adv}$ improve the perceptual quality. The weights $\alpha_{1}$ , $\alpha_{2}$ and $\alpha_{3}$ are often empirically set to 0.01, 1 and 0.005, respectively.

The Adam optimizer [32] is dominantly used to optimize the above loss function. Unfortunately, the gradient-based Adam is difficult to minimize the perception- and distortion-oriented losses simultaneously due to their opposite gradient descent directions, while an ideal weight setting is hard to be found given an infinite searching space. Therefore, most traditional SR models turn to pursue either fidelity or perceptual quality yet sacrifice the other. As illustrated in Fig. 1, when the weights are biased towards distortion- or perception-oriented losses, the SR results will be over-smoothed or include visually unpleasant artifacts. When similar weights are set to balance perception and distortion, the optimization might be stagnant at the local minimum as no explicit gradient descent directions can be found. Note that we apply gradient normalization [64] to avoid the trivial solution.

In this paper, rather than linearly combining the distortion- and perception-oriented losses into a single objective function, we formulate the perceptually realistic SR task as a multi-objective optimization problem as follows:

\left\{\begin{array}[]{l}\min{f_{1}}={L_{pix}},\\ \min{f_{2}}={L_{percep}}+{\alpha}{L_{adv}},\end{array}\right.

(2)

where objectives $f_{1}$ and $f_{2}$ respectively focus on the reconstruction fidelity and the perceptual quality of the SR results. In the following, we propose to couple EA [36] with Adam [32] to address the multi-objective optimization problem in Eq. (2). EA and Adam help with the optimization process for the divergence and convergence, respectively.

TABLE I: {justify} The proposed hybrid EA-Adam optimization algorithm.

Input: The number of optimized models

N

, Adam epochs

T^{Adam}

, EA epochs

T^{EA}

, total epochs

T

, the pretrained model

{\theta}_{G_{0}}

optimized by

\ell_{1}

loss, and the probability

\delta

of selecting parents from the population

\mathcal{B}

Step 1. Initialization: Initialize a population of generators

\mathcal{G}=\{\theta_{G_{1}},\cdots,\theta_{G_{N}}\}

, where

{\theta}_{G_{k}}={\theta}_{G_{0}},k\in[1,\cdots,N]

. Initialize the discriminator population

{\mathcal{D}}=\{\theta_{D_{1}},\cdots,\theta_{D_{N}}\}

. Generate the neighboring population

\mathcal{B}

. Define the Adam optimizers for each model.

Step 2. Optimization iteration:

For

t\in 1:T^{EA}+T^{Adam}:T

\#

Adam Steps

For

i\in\{1,\cdots,T^{Adam}\}

For

s\in\{2,\cdots,N\}

1. Update the weights of

\theta_{G_{s}}

by Adam.

2. Update the weights of

\theta_{D_{s}}

by Adam.

EndFor

\#

EA Steps

For

j\in\{1,\cdots,T^{EA}\}

For

k\in\{2,\cdots,N\}

1. If

r<\delta

\mathcal{P}=\mathcal{B}(k)

; otherwise,

\mathcal{P}=\mathcal{G}

2. Select parents

{\theta}_{1}

and

{\theta}_{2}

from

\mathcal{P}

randomly.

3. Generate offspring

{\theta}_{I}

by the crossover and mutation in Eq. (3) and Eq. (4).

4. Select the new individual

\theta_{G_{k}}

as in Eq. (5).

EndFor

Re-initialization: Re-define the Adam optimizer for each model in population.

EndFor

III-B The Hybrid EA-Adam Optimizer

The algorithm of the proposed hybrid EA-Adam optimizer is summarized in Table I. EA and Adam alternately optimize the $N$ SR models with different perception-distortion preferences. We employ the general framework of GAN-based SR learning, which is shown in Fig. 2(a). The generator network learns the mapping between the LR and HR images, and the discriminator network discriminates between the true and fake images. The two networks compete with each other to ensure that the generated images are of high quality and more natural.

The Adam step is illustrated in Fig. 2(b). Initialized by a pre-trained model with $\ell_{1}$ loss, the other $N-1$ generator-discriminator pairs are updated by Adam separately, as shown in the Adam step of Table I. We fix the first model across the training process as it has the best reconstruction fidelity by training with pure pixel-wise $\ell_{1}$ loss. Then, the EA step starts, as illustrated in Fig. 2(c). The adversarial training facilitates the model convergence in the Adam step to obtain better objective scores, while the EA step focuses on the model divergence, enabling the optimized models uniformly distributed on the performance plane spanned by the two objective indices. Therefore, the discriminators are only updated in the Adam steps to enhance convergence.

In the EA step, a population of generator models is optimized together. Following MOEA-D [36], the parents are first chosen from the neighboring population $\mathcal{B}$ or the whole population $\mathcal{G}$ , which focuses on the local or global evolution. The offspring is generated by the selected parents via crossover and mutation operators. The crossover is used to interact with the individuals in a population with each other so that useful information can be effectively propagated internally. We employ the SBX crossover [52], which is defined as follows:

{\theta}_{I}=0.5\times\left[{(1+\beta){\theta}_{1}}+(1-\beta){{\theta}_{2}}\right].

(3)

Here, $\{\theta\}_{I,1,2}$ indicate the parameters of the corresponding models, $\beta$ is a weighting variable that relies on a random number $r\in[0,1]$ , where $\beta={{{(r\times 2)}^{1/(1+\eta)}}}$ if $r<0.5$ and $\beta={{{(1/(2-r\times 2))}^{1/(1+\eta)}}}$ otherwise, $\eta$ is a constant and is set to 20 as in SBX [52]. The mutation operator disturbs the optimization to jump out of the local optimum by adding a zero-mean random Gaussian noise ${\varepsilon^{(k)}}\sim{\mathcal{N}(0,0.01)}$ [60] to each of the $N$ models, i.e.,

{\theta_{I}}={\theta_{I}}+{\varepsilon^{(k)}},k\in\{1,\cdots,N\}.

(4)

To select a good model from the parent models and the newly generated offspring models, we measure the model performance by using an aggregation function derived from the Tchebycheff decomposition [65] as follows:

\min{F^{te}}=\max\{\lambda({f_{1}}-z_{1}^{*}),(1-\lambda)({f_{2}}-z_{2}^{*})\},

(5)

where $z_{1}^{*}$ and $z_{2}^{*}$ are the moving minimum values of $f_{1}$ and $f_{2}$ recorded until the current iteration, and weights $\lambda$ are uniformly distributed within $(0,1]$ . Specifically, we set $\lambda={0,\frac{1}{{N-1}}},{\frac{2}{{N-1}},\cdots,1}$ for the $N$ models, respectively. With different aggregation loss weights $\lambda$ , the multiple optimization problem can be divided into $N$ sub-problems of single-objective optimization for different perception-distortion preferences. Note that the momentum in the Adam optimizer should be initialized after the EA optimization step. Finally, a set of generator models with different perception-distortion preferences are obtained to form a PF, as shown in Fig. 1.

To illustrate the training convergence process of the EA-Adam optimization algorithm, in Fig. 3 we demonstrate the performance of optimized models on the two conflicting objectives $f_{1}$ and $f_{2}$ over an Adam-EA cycle on the B100 dataset. Different shaped symbols represent the $N=4$ optimized models, where the green color denotes the model optimized after the first step of Adam, the blue color denotes the model optimized after $T_{Adam}$ steps by Adam ( $T_{Adam}=10$ in this experiment), and the red color denotes the model further optimized by EA with $T_{EA}=1$ step. We can see that Adam focuses on the convergence of the 4 models, enabling a fast search for better objective scores of them. However, due to the opposite optimization gradient directions of $f_{1}$ and $f_{2}$ , Adam suffers from optimization stagnation. That is, the optimized models targeting at different perception-distortion objectives tend to converge on the same side of the plane, making it difficult to form a diverse solution space. Fortunately, the introduction of EA enhances the diversification of the $N$ models so that the optimized models are more evenly distributed in the optimization plane with improved performance.

III-C Network Fusion

As mentioned above, the $N$ SR models (with the same architectural topology) optimized by the proposed hybrid EA-Adam optimizer can work better than their counterparts optimized by Adam in balancing perceptual quality and distortion degree. However, it can be labor-consuming to manually select one from them to super-resolve the input LR image. On the other hand, if we apply all the $N$ models to the input image and then fuse the outputs into one image, it can be expensive and less effective in inference [39]. To solve this issue, we propose to automatically fuse the $N$ models into one stronger model to pursue further enhanced performance beneath the PF formed by the $N$ models. To achieve this goal, we first introduce a lightweight attention-based weight regression network to fuse the convolution layers of the $N$ SR models as in [38, 37, 66]. As shown at the left of Fig. 4, the input is the feature from one layer and the output is the weight of the corresponding layer in the $N$ models. The generated weights satisfy $\sum\nolimits_{k}{{w_{k}}=1},k\in\{1,\cdots,N\}$ . There are 3 modules in the weight regression network. The first module consists of 3 convolution layers with Leaky ReLU activation. The kernel sizes of convolution layers are all $5\times 5$ . The second module is a global average pooling layer, and the last is a mapping module, which consists of two linear layers followed by a Sigmoid activation. The first module extracts the features and the last two modules estimate the weights to fuse the convolution layers of the SR models.

We train the attention-based weight regression network with the most commonly-used $L_{GAN}$ loss as it encourages the output to be perceptually more realistic. Meanwhile, since the aggregated model linearly interpolates among the $N$ SR models that are fixed in this stage, the reconstruction fidelity can also be well preserved. We fuse the $N$ SR models in one shot and apply the fused model to all testing images, as shown at the right of Fig. 4. Here, we employ a set of $M$ validation images, i.e., $\{I_{i}\}_{i=1}^{M}$ , to facilitate the fusion process. Based on the validation data, we calculate the adaptive fusion weights for each $I_{i}$ , i.e., $[w_{1}^{i},w_{2}^{i},\cdots,w_{N}^{i}]$ , using the trained attention-based weight regression network, and average them over the $M$ weight vectors to get a universal weight, i.e., $[\bar{w}_{1},\bar{w}_{2},\cdots,\bar{w}_{N}]$ . Finally, a fused model is obtained, which inherits the advantages from the $N$ SR models in terms of both reconstruction fidelity and perceptual quality, while holding a stronger generalization capability.

IV EXPERIMENT

IV-A Experiment Setup

Datasets and evaluation metrics. To train the population of SR models, we use either the DIV2K dataset [67] (800 images) or the DF2K dataset [67, 68] (800 DIV2K images and 2650 Flickr2k images) according to the training datasets of competing methods. For the network fusion process, 700 images of DIV2K are used to train the weight regression network, and the remaining 100 images are used as the validation data in the network fusion process. Note that we do not employ any test data in the stages of network fusion process. Once learned, our weight regression network is fixed and applied to any given test set.

Among the 800 images in the DIV2K training dataset, we utilize 700 images as training data for the weight regression network, and the remaining 100 images are used as the validation data in the network fusion process. We have not employed any test data during the training stage. Once learned, our weight regression network is fixed and applied to any given test set. Therefore, our method is zero-shot, which aligns with the approach taken by the referenced and compared SR methods.

Following prior arts [8, 13, 48], we conduct experiments with a scaling factor of $4\times$ on both synthetic (downsampled using MATLAB bicubic kernel) and real-world (degraded using RealESRGAN pipeline [69]) experiments. For the bicubic-degraded SR task, we evaluate the performance of different methods on 6 benchmarks, including Set5 [70], Set14 [71], BSD100 [72], Urban100 [73], Manga109 [74], and DIV2K100 [67]. We compute the PSNR and SSIM [15] indices on the Y channel in the YCbCr space to measure the distortion degree, and compute the LPIPS [33] and DISTS [75] indices in the RGB space to evaluate the perceptual quality. For the real-world-degraded SR task, we evaluate the performance of different methods on the RealSR [76] and DrealSR [77] benchmarks.

Backbones and compared methods: As in many previous perceptually realistic SR works [22, 21, 13, 27], we adopt SRResNet [7] and RRDB [8] as the backbone networks. SRResNet is a lightweight network proposed in SRGAN [7], while RRDB [8] is widely used in later GAN-based SR methods [21, 78, 13] for its superior performance. Specifically, we compare our SRResNet-based model with SRGAN [7] and RankSRGAN [22], and compare our RRDB-based model with ESRGAN [8], SPSR [78], LDL [13], CAL-GAN [79] and WGSR [48]. In addition, Transformer-based backbones have become popular as they can capture long range dependency. Therefore. we compare SwinIR [43] trained with the $L_{GAN}$ loss by Adam optimizer and our proposed hybrid EA-Adam optimization method to validate the effectiveness of our method on the Transformer backbone. For the methods that apply different backbones or training datasets, we compare them with our model of similar capacity and training datasets. Specifically, we compare our SRResNet-based model with CFSNet [80] and G-MGBP [20], and compare our RRDB-based model with SROOE[81], PD-ADMM [28] and DualFormer [31]. We further validate the proposed method for the real-world SR tasks, and compare the obtained ‘RealESRGAN+Ours’ model with RealESRGAN [69], IKC [82], IKR-Net [83], LDL [13] and DASR [38]. The results of the compared methods are obtained using the officially released models. Note that we do not compare our method with the distortion-oriented SR methods, because this work focuses on perception-distortion balance.

Training details: We validate the effectiveness of our method with scaling factor 4 following the previous works [7, 8]. The data augmentation and discriminator networks used in SRGAN and ESRGAN are adopted in our SRResNet- and RRDB-based models, respectively, for a fair comparison. The size of training input patches is set as 32×32. For both the hybrid EA-Adam optimizer and the attention-based weight regression network, the learning rate is set as $1e^{-5}$ for the SRResNet backbone, and $1e^{-4}$ for the RRDB and SwinIR backbones. To balance the performance and the training efficiency, we set the population size as $N=5$ . The $T^{Adam}$ for the Adam optimizer is 10, the $T^{EA}$ for the EA optimizer is 1, and the total optimization epoch $T$ is 100. The probability $\delta$ of selecting parents from the neighboring population $\mathcal{B}$ is 0.7.

TABLE II: {justify} Quantitative comparison between the state-of-the-art methods and the proposed hybrid EA-Adam optimization method. Three groups of methods trained with the same backbone are compared, where SRResNet [7], RRDB [8] and SwinIR [43] backbones are used, respectively. The ‘SwinIR-

L_{GAN}

’ is the SwinIR model trained with the

L_{GAN}

loss by the Adam optimizer. The best results of each group are highlighted in bold.

Method		SRGAN [7]	RankSR- GAN [22]	SRResNet + Ours	ESRGAN [8]	SPSR [78]	LDL [13]	CAL- GAN [79]	WGSR [48]	RRDB + Ours	SwinIR + $L_{GAN}$	SwinIR + Ours
Training Dataset		DIV2K	DIV2K	DIV2K	DF2K+OST	DIV2K	DIV2K	DIV2K	DIV2K	DIV2K	DIV2K	DIV2K
Set5	PSNR $\uparrow$	29.93	29.77	30.25	30.45	30.40	30.99	31.04	30.56	31.31	30.35	30.40
	SSIM $\uparrow$	0.8535	0.8394	0.8584	0.8582	0.8443	0.8679	0.8555	0.8529	0.8792	0.8543	0.8581
	LPIPS $\downarrow$	0.0740	0.0763	0.0694	0.0741	0.0686	0.0643	0.0670	0.0741	0.0683	0.0694	0.0678
	DISTS $\downarrow$	0.0999	0.0958	0.0975	0.0972	0.0925	0.0938	0.1007	0.1073	0.0947	0.0960	0.0953
Set14	PSNR $\uparrow$	26.53	26.47	26.98	27.07	26.64	27.20	27.31	26.97	27.42	26.61	26.78
	SSIM $\uparrow$	0.7173	0.7031	0.7342	0.7085	0.7138	0.7429	0.7367	0.7255	0.7537	0.7253	0.7302
	LPIPS $\downarrow$	0.1430	0.1398	0.1330	0.1315	0.1345	0.1301	0.1308	0.1372	0.1239	0.1268	0.1268
	DISTS $\downarrow$	0.1113	0.1105	0.1043	0.0985	0.0990	0.1003	0.1127	0.1147	0.1021	0.1074	0.1047
BSD100	PSNR $\uparrow$	25.58	25.51	25.94	25.33	25.51	26.12	26.29	26.01	26.51	25.58	25.75
	SSIM $\uparrow$	0.6677	0.6510	0.6836	0.6643	0.6599	0.6941	0.6829	0.6815	0.7071	0.6764	0.6832
	LPIPS $\downarrow$	0.1770	0.1829	0.1766	0.1602	0.1665	0.1614	0.1672	0.1791	0.1609	0.1568	0.1560
	DISTS $\downarrow$	0.1293	0.1290	0.1286	0.1174	0.1186	0.1243	0.1287	0.1351	0.1212	0.1223	0.1207
Manga109	PSNR $\uparrow$	28.12	27.94	28.67	28.42	28.56	29.41	29.21	28.13	29.33	28.60	28.64
	SSIM $\uparrow$	0.8656	0.8500	0.8735	0.8619	0.8590	0.8767	0.8671	0.8521	0.8874	0.8650	0.8694
	LPIPS $\downarrow$	0.0704	0.0755	0.0648	0.0644	0.0664	0.0547	0.0690	0.0760	0.0536	0.0631	0.0621
	DISTS $\downarrow$	0.0551	0.0560	0.0540	0.0467	0.0460	0.0402	0.0498	0.0653	0.0398	0.0407	0.0391
Urban100	PSNR $\uparrow$	24.41	24.53	24.83	24.36	24.80	25.50	25.33	24.83	25.57	25.06	25.14
	SSIM $\uparrow$	0.7320	0.7284	0.7479	0.7363	0.7473	0.7692	0.7639	0.7453	0.7752	0.7564	0.7594
	LPIPS $\downarrow$	0.1439	0.1435	0.1354	0.1235	0.1206	0.1096	0.1167	0.1303	0.1150	0.1215	0.1193
	DISTS $\downarrow$	0.1076	0.1062	0.1038	0.0880	0.0861	0.0861	0.0877	0.1065	0.0842	0.0895	0.0863
DIV2K100	PSNR $\uparrow$	28.17	28.07	28.62	28.18	28.18	28.96	28.95	28.98	29.34	28.63	28.77
	SSIM $\uparrow$	0.7765	0.7654	0.7900	0.7779	0.7720	0.7970	0.7897	0.7980	0.8123	0.7897	0.7945
	LPIPS $\downarrow$	0.1254	0.1318	0.1208	0.1151	0.1126	0.1008	0.1072	0.1191	0.1038	0.1044	0.1027
	DISTS $\downarrow$	0.0663	0.0657	0.0646	0.0594	0.0546	0.0529	0.0600	0.0687	0.0554	0.0562	0.0537
#FLOPs		10.38G			73.43G						4.42G
#Params		1.52M			16.69M						0.93M

TABLE III: {justify} Quantitative comparisons between the state-of-the-art perceptually realistic SR models and the models trained by our method. The ’RRDB+Ours-PD’ employs the RRDB backbone and the weight setting in PD-ADMM [28]. The ’DualFormer+Ours’ employs the RRDB backbone and the discriminator in DualFormer [31]. The best results of each group are highlighted in bold.

Method		G-MGBP [20]	CFSNet [80]	SRResNet +Ours	PD-AD MM[28]	RRDB +Ours-PD	SROOE [30]	DualFormer [31]	RRDB +Ours	DualFormer +Ours
Training Dataset		DIV2K	DIV2K	DIV2K	DIV2K	DIV2K	DF2K	DF2K	DF2K	DF2K
Set5	PSNR $\uparrow$	29.54	30.23	30.25	31.82	31.68	31.30	31.35	31.45	31.32
	SSIM $\uparrow$	0.8419	0.8503	0.8584	0.8810	0.8838	0.8671	0.8690	0.8744	0.8713
	LPIPS $\downarrow$	0.0855	0.0781	0.0694	0.0758	0.0746	0.0646	0.0678	0.0640	0.0595
	DISTS $\downarrow$	0.1167	0.0986	0.0975	0.1107	0.1047	0.0936	0.0920	0.0930	0.0917
Set14	PSNR $\uparrow$	26.58	26.59	26.98	27.99	27.95	27.36	27.47	27.59	27.53
	SSIM $\uparrow$	0.7151	0.7170	0.7342	0.7589	0.7668	0.7362	0.7397	0.7479	0.7443
	LPIPS $\downarrow$	0.1502	0.1619	0.1330	0.1401	0.1336	0.1167	0.1191	0.1137	0.1134
	DISTS $\downarrow$	0.1207	0.1178	0.1043	0.1129	0.1083	0.0907	0.0933	0.1007	0.0974
BSD100	PSNR $\uparrow$	25.68	25.60	25.94	26.86	26.93	26.34	26.50	26.64	26.43
	SSIM $\uparrow$	0.6673	0.6677	0.6836	0.7137	0.7206	0.6953	0.6895	0.7044	0.7019
	LPIPS $\downarrow$	0.1882	0.1822	0.1766	0.1854	0.1774	0.1528	0.1583	0.1583	0.1520
	DISTS $\downarrow$	0.1418	0.1312	0.1286	0.1369	0.1312	0.1172	0.1207	0.1209	0.1176
Manga109	PSNR $\uparrow$	28.17	28.54	28.67	30.17	29.92	29.97	29.82	29.82	29.87
	SSIM $\uparrow$	0.8609	0.8622	0.8735	0.8864	0.8952	0.8802	0.8837	0.8943	0.8876
	LPIPS $\downarrow$	0.0786	0.0689	0.0648	0.0616	0.0528	0.0504	0.0528	0.0491	0.0504
	DISTS $\downarrow$	0.0726	0.0558	0.0550	0.0544	0.0455	0.0402	0.0377	0.0386	0.0365
Urban100	PSNR $\uparrow$	24.40	24.72	24.83	26.29	25.89	25.94	25.68	25.94	25.95
	SSIM $\uparrow$	0.7415	0.7445	0.7479	0.7882	0.7824	0.7801	0.7741	0.7853	0.7854
	LPIPS $\downarrow$	0.1477	0.1362	0.1354	0.1234	0.1220	0.1080	0.1150	0.1124	0.1089
	DISTS $\downarrow$	0.1358	0.1086	0.1038	0.1016	0.0931	0.0846	0.0842	0.0833	0.0827
DIV2K100	PSNR $\uparrow$	28.27	28.49	28.62	29.71	29.91	29.24	29.30	29.33	29.36
	SSIM $\uparrow$	0.7766	0.7826	0.7900	0.8117	0.8226	0.8016	0.8020	0.8142	0.8111
	LPIPS $\downarrow$	0.1488	0.1233	0.1208	0.1223	0.1136	0.0968	0.1027	0.1034	0.0989
	DISTS $\downarrow$	0.0925	0.0649	0.0666	0.0763	0.0654	0.0565	0.0551	0.0540	0.0524
#FLOPs		22.42G	19.22G	10.38G	109.78G	73.43G	90.06G	73.43G	73.43G	73.43G
#Params		0.28M	5.00M	1.52M	18.67M	16.69M	103.44M	16.69M	16.69M	16.69M

IV-B Comparison with State-of-the-Arts

In this section, we compare the final fused model by our proposed hybrid EA-Adam optimizer and network fusion method against state-of-the-art methods. Table II provides the quantitative comparisons among the methods with the same backbone. Table III provides the quantitative comparisons among the methods with similar model capacity.

From Table II, we can see that for the lightweight network models with the SRResNet backbone, SRGAN gets better PSNR and SSIM scores than RankSRGAN on almost all the benchmarks, while they have higher or lower LPIPS scores on different benchmarks. Our proposed method improves the fidelity- and perceptual-oriented metrics by a large margin under all benchmarks without increasing the SR model complexity. This validates the advantages of our hybrid EA-Adam optimization strategy, which could push the PF toward the optimal point of solution space. For models trained with the RRDB backbone, we can see that our proposed method achieves better results than the competing methods ESRGAN, SPSR, LDL, CAL-GAN, and WGSR in most cases. The LDL also demonstrates competing results in several instances, especially for the perceptual quality metric on the Urban100 dataset. This advantage comes from its iterative and local discriminative strategy that can effectively inhibit visual artifacts and encourage the generation of high-frequency details. Furthermore, our method with the SwinIR backbone demonstrates significant improvement over its counterparts, which verifies that the proposed method can be applied to both CNN-based and transformer-based models.

From Table III, we can see that the proposed method achieves consistent improvement on most benchmarks in terms of both perceptual quality (LPIPS) and reconstruction accuracy (PSNR and SSIM), although the compared methods such as G-MGBP, CFSNet and SROOE have more model parameters and FLOPs (e.g., 22.42G FLOPs and 19.22G FLOPs for ‘G-MGBP’ and ‘CFSNet’, while 10.38G FLOPs for ‘SRResNet+Ours’). In addition, we test the released model of PD-ADMM, which performs favorably in terms of distortion control yet sacrifices the perceptual quality. PD-ADMM applies the same weight, i.e., [0.5, 0.5], for perception and distortion objectives. A constraint term is further introduced to ensure that the low-frequency information in two stages is equal. The ADMM optimization is applied to solve the constrained optimization problem in PD-ADMM. Since it is improper and difficult to integrate the constraint term into our proposed hybrid EA-Adam approach with $N$ -optimized models, to fairly compare our method with PD-ADMM, we apply the optimization objective of PD-ADMM to EA-Adam without its constraint. One can see that our method outperforms PD-ADMM in almost all cases but with fewer FLOPs and parameters. SROOE employs an additional condition network to learn the loss aggregation weights, leading to significantly larger model size and Flops than the other methods. In contrast, our EA-Adam method is more efficient while achieving comparable results with SROOE. Our method obtains better performance than DualFormer on both perception and distortion metrics. Note that DualFormer employs a spectral and a spatial discriminator. Our EA-Adam presents a new training strategy from the optimization perspective, which can be applied to DualFormer to further enhance its performance, as shown in ‘DualFormer+Ours’ in Table III.

We provide 4× SR visual comparisons between models (with SRResNet [7], RRDB [8] and SwinIR [43] backbones) trained by our method and other state-of-the-arts models in Fig. 5, Fig. 6 and Fig. 7. We do not compare with the early published method ESRGAN due to limited space. In addition, we give the visual comparisons between PD-ADMM [28] and ‘RRDB+Ours-PD’, i.e., the RRDB model trained by our EA-Adam optimization method with the weight setting in PD-ADMM, in Fig. 8. The visual comparisons between SROOE [30], DualFormer [31] and our proposed EA-Adam optimization method are presented in Fig. 9. One can see that our method can restore more correct visual patterns (e.g., the steps in the second group of Fig. 5 and the zebra crossings in the third group of Fig. 6) while inhibiting the artifacts (e.g., the building windows in the fifth group of Fig. 9). At the same time, our model can generate many high-frequency details (e.g., the lines of buildings and wood floor in the first and third group of Figs. 7 and 8), respectively.

IV-C Ablation Studies

In this section, we conduct a series of ablation studies to validate the effectiveness of the proposed hybrid EA-Adam optimization. We first conduct experiments to compare our hybrid EA-Adam algorithm with the Adam algorithm, and compare their results with different selections of $N$ . Then we compare our network fusion method with the learnable weight method. Furthermore, we study the selection of different loss functions in optimizing the weight regression network. We also study the aggregation weights of obtained models with different perception-distortion preferences. All ablation results are evaluated on the Urban100 dataset.

TABLE IV: {justify} Quantitative comparison between the models optimized by Adam, AdamW, our hybrid EA-Adam, and hybrid EA-AdamW optimizers. The SRResNet backbone is adopted and the models are evaluated on the Urban100 dataset. ‘Fused Model’ is obtained by applying our network fusion method to the four trained models. The best results are highlighted.

Model		Model1	Model2	Model3	Model4	Fused model
Weights		$[0.25,0.75]$	$[0.5,0.5]$	$[0.75,0.25]$	$[1,0]$	-
PSNR	Adam	24.36	23.82	23.96	24.22	24.41
	EA-Adam	25.78	25.75	24.95	24.73	24.83
	AdamW	25.48	24.67	23.41	24.18	24.37
	EA-AdamW	25.95	25.81	25.21	24.82	24.91
LPIPS	Adam	0.3061	0.2899	0.2665	0.1446	0.1437
	EA-Adam	0.2120	0.2055	0.1536	0.1384	0.1354
	AdamW	0.2570	0.2925	0.2306	0.1477	0.1444
	EA-AdamW	0.2265	0.2201	0.1519	0.1404	0.1368

Hybrid EA-Adam vs. Adam. We compare the performance of models optimized by the original Adam and our hybrid EA-Adam. In addition, we replace Adam with AdamW [84], a recently developed variant of Adam, and report the results of AdamW and hybrid EA-AdamW for a more comprehensive evaluation of our proposed method. As traditional optimizers (Adam and AdamW) cannot directly optimize a multi-objective problem, we uniformly set 4 combinations of loss weights, i.e., $[0.25,0.75]$ , $[0.5,0.5]$ , $[0.75,0.25]$ and $[1,0]$ , to weigh the two objectives in Eq. (2) into a single-objective optimization problem. For example, if the weight is $[0.25,0.75]$ , the optimization objective of Adam is $0.25\times f_{1}+0.75\times f_{2}$ , which is the same as that of the Adam steps in our proposed hybrid optimizer. The models optimized by different weights are named ‘Model1’ to ‘Model4’, respectively. In addition, we apply the same network fusion method (as in Section 3.3) to the 4 models optimized by Adam.

Table IV gives the quantitative comparison. Without loss of generality, we conduct experiments with the SRResNet backbone and evaluate on the Urban100 dataset. PSNR and LPIPS metrics are used to measure the fidelity and perceptual quality, respectively. One can see that our hybrid EA-Adam/AdamW optimizers outperform Adam/AdamW by a large margin in most cases. Both the fidelity and perceptual quality metrics are improved, validating that the proposed hybrid EA-Adam strategy can push the PF closer to the optimal point of the metric space. For most models, EA-Adam improves PSNR by more than 0.4dB over Adam and improves LPIPS by more than $10\%$ . This might be caused by the gradient vanishing problem in the gradient-based optimization process so that the models can be trapped into a local minimum. In contrast, with our hybrid EA-Adam optimizer, effective interaction between different models can be introduced by the crossover operator in EA, so that the models can be stably optimized with better convergence.

On the other hand, each of the four models may perform well in one aspect, yet sacrifices the other aspect as a trade-off. The fused model by our network fusion method achieves a better balance between fidelity and perceptual quality. Specifically, although the fidelity indices of the fused model are lower than that of Models 1, 2, and 3, its perceptual performance is better than Model 4, which is trained to maximize perceptual quality. This result well aligns with the goal of fusion network, which aims to further improve the perceptual quality of the fused model without reducing much the fidelity. It is worth noting that the bad convergence of Models 1 to 3 makes the traditional Adam/Adamw optimizer gain less improvement from the network fusion process. In contrast, the proposed hybrid optimization strategy facilitates the effectiveness of network fusion via the interaction of individuals in crossover operation. In addition, we find that AdamW does not show many advantages over Adam in the SR task. This is because AdamW is mainly proposed to handle the over-fitting problem in high-level vision tasks by introducing weight decay. However, this may not fit the SR task as the optimization might be under-fitting to the large space of image details.

Fig. 10 shows the visual comparisons among SR models trained by Adam and our EA-Adam optimizers, where consistent observation with Table IV can be drawn. Specifically, models optimized by EA-Adam outperform their counterparts optimized by Adam, where the structures and patterns are more loyal to the HR and richer details are generated. It can also be observed that while each of the four models is preferred in either fidelity or perceptual quality, the fused model by our method shows comprehensive enhancement in both aspects.

TABLE V: {justify} Training time and quantitative comparisons on Urban100 benchmark by using different numbers of individuals (

N

) in EA.

N	PSNR	SSIM	LPIPS	Training time
3	25.34	0.7525	0.1184	63h
5	25.57	0.7752	0.1150	112h
7	25.60	0.7804	0.1125	160h
9	25.64	0.7823	0.1114	211h

TABLE VI: {justify} Quantitative comparison between the ‘Learnable Weight’, ‘Weight Regression’, and ‘Our Fusion Method’ on the Urban100 dataset. The RRDB backbone is adopted. The best and second results are highlighted in red and blue.

	PSNR	SSIM	LPIPS
Learnable Weight	25.60	0.7748	0.1195
Weight Regression	25.71	0.7756	0.1124
Our Fusion Method	25.57	0.7752	0.1150

Selection of the number of optimized models $N$ . To show the influence of the number of optimized models, we conduct experiments by using different selections of $N$ . The results on Urban100 are shown in Table V. One can see that increasing $N$ can obtain better perception-distortion performance but at the price of longer optimization time. When $N$ is greater than 5, the benefits are not significant. Considering both performance and training time, we choose $N=5$ in the experiments.

Network Fusion vs. Learnable Weight. We propose a two-stage framework for fusing $N$ networks. The first stage is the attention-based weight regression, which generates a set of weights for each input LR image. Therefore, the corresponding weight vectors of validation data can be obtained. The second stage is model fusion, which obtains a single stronger model by averaging all the weight vectors of validation data. To show the effectiveness of the proposed fusion method, we introduce a fusion method by setting the fusion weights of $N$ models as learnable parameters directly, which is named ‘Learnable Weight’. We compare our two stages, named ‘Weight Regression’ and ‘Our Fusion Method’, with ‘Learnable Weight’ on the Urban100 dataset in Table VI. We can see that the ‘Weight Regression’ achieves the best results in all metrics since it can dynamically infer the optimal weights for each input based on image content. ‘Our Fusion Method’ obtains better performance than ‘Learnable Weight’ in perceptual quality metric LPIPS with 0.0045 improvement while keeping comparable PSNR and SSIM values. This is because the universal network weight is inherited from ‘Weight Regression’ using validation data, which holds a stronger generalization capability.

TABLE VII: {justify} Quantitative comparison on Urban100 benckmark among different loss functions for the network fusion process. The RRDB backbone is adopted.

	PSNR	SSIM	LPIPS
RRDB+Ours- $\ell_{1}$	26.55	0.8029	0.1955
RRDB +Ours-PD [65]	25.89	0.7824	0.1220
RRDB +Ours- $L_{\scalebox{0.85}{adv}}$	25.57	0.7752	0.1150

Loss Selection for Network Fusion. The fusion network aims to leverage the strength of $N$ models for perceptually more plausible SR results. Table VII gives the numerical comparisons on the Urban100 benchmark with $\ell_{1}$ , $PD$ [28] and $L_{adv}$ loss functions. We choose the adversarial loss in this paper because it achieves the best LPIPS score, which implies perceptually more realistic outputs.

TABLE VIII: {justify} Aggregation weights of two layers (L1 and L2) in the RRDB models with different perception-distortion preferences given two inputs (I1 and I2), and the final fusion model (Ours).

	L1	L2
I1	[0.0463, 0.0167, 0.0678, 0.3333, 0.5358]	[0.2348, 0.2084, 0.1426, 0.1452, 0.2689]
I2	[0.0263, 0.0144, 0.1014, 0.3307, 0.5272]	[0.3371, 0.1061, 0.0022, 0.1985, 0.3561]
Ours	[0.0965, 0.0733, 0.1429, 0.2629, 0.4243]	[0.3203, 0.1212, 0.0224, 0.1932, 0.3428]

TABLE IX: {justify} Quantitative comparisons on Urban100 benchmark between different ways of aggregation weight combination. The RRDB backbone is adopted. The best results are highlighted in bold.

	PSNR	SSIM	LPIPS
(a)	21.86	0.6546	0.2269
(b)	24.52	0.7573	0.1755
(c)	25.57	0.7752	0.1150

Aggregation Weights of Obtained Models. In the fusion stage, the network parameters of the $N$ models are frozen, and the aggregation weights are optimized for each layer using a set of validation data to fuse the $N$ models into one by interpolation, as done in those network interpolation methods [39]. To further show the advantage of our proposed fusion method, we study the aggregation weights of obtained models. Given two inputs (I1 and I2), the aggregation weights in two random layers (L1 and L2) of the $5$ obtained RRDB models are shown in Table VIII. The aggregation weights of the final fusion model ‘Ours’ are also given. It can be seen that the weights are neither evenly distributed among models nor concentrated on one or two models. The aggregation weight distribution varies in layers, and each model contributes to the final fusion. This demonstrates the effectiveness of our EA-Adam optimizer, which can simultaneously obtain models with different focuses. We also perform the ablation study on aggregation weight by (a) directly averaging the parameters of the $5$ networks, (b) applying the aggregation weights of one layer to all layers of the $5$ networks, and (c) our proposed fusion strategy. The results on Urban100 are shown in Table IX. Our fusion method achieves the best result, showing the effectiveness of our network fusion method.

IV-D Applications to real-world Super-Resolution

To demonstrate the generalization capability of the proposed method, we further present the comparisons by using the complex real-world SR setting. Specifically, we use the generator and discriminator of RealESRGAN [69] and optimize the model with our proposed EA-Adam hybrid optimizer and the fusion strategy. The DF2K [67, 68] is used as the training dataset, and the LR images are degraded with the RealESRGAN degradation pipeline [69]. The test data are from the real-world datasets, including RealSR [76] and DRealSR [77]. The compared methods include RealESRGAN [69], IKC [82], LDL [13], IKRNet [83] and DASR [38]. The results are shown in Table X. One can see that IKC obtains the worst results since it focuses on solving blur degradation and cannot adapt to real-world SR tasks. The proposed method obtains the best perception-distortion results in the real-world SR task, demonstrating its effectiveness.

TABLE X: {justify} Quantitative comparisons with GAN-based state-of-the-art real-world super-resolution methods on RealSR and DrealSR test datasets. The RRDB backbone is adopted. The best result is highlighted in bold.

	RealSR	DrealSR
	PSNR/SSIM/LPIPS	PSNR/SSIM/LPIPS
RealESRGAN	25.69/0.7616/0.2727	28.64/0.8053/0.2847
IKC	19.51/0.5088/0.4337	27.07/0.7429/0.3994
IKR-Net	25.15/0.7308/0.3912	28.94/0.8131/0.4274
LDL	25.28/0.7567/0.2766	28.21/0.8126/0.2815
DASR	27.02/0.7708/0.3151	29.77/0.8263/0.3126
RRDB+Ours	27.59/0.7806/0.2621	30.32/0.8358/0.2802

IV-E Training and Inference Time

Training time. We first compare the training time of our hybrid EA-Adam optimization method (including the EA-Adam optimizer and the network fusion stage) and the original Adam optimizer by training an SR model with RRDB-backbone for ×4 super-resolution. Table XI provides the quantitative comparison of the training time. Specifically, the Adam optimizer costs 250 epochs and 48h to converge. Though hybrid EA-Adam costs a longer time in each epoch, it converges in much fewer epochs thanks to the effective interaction in the EA step. The hybrid EA-Adam optimization method totally costs 144h to finish the model training, which is about three times the training time of Adam. Most of the training time of hybrid EA-Adam is spent on the EA part. It is noted that there have been various parallel computation methods [85, 86] to speed up the EA process based on CUDA technology, showing great potential to accelerate our method in training.

Inference time. While spending more time in training, the proposed method does not introduce any extra computational and memory burden during the inference stage. The trained SR model has the same inference time as other SR models with the same backbone.

TABLE XI: {justify} Comparison of the training time of our hybrid EA-Adam optimization method and the Adam method.

Method	Adam	EA-Adam optimizer		Network Fusion
Method		EA-iter	Adam-iter
Run-time/epoch	0.19h	4.76h	0.80h	0.63h
Total-epoch	250	8	92	50
Total-run-time	48h	112h		32h

IV-F Failure Cases

While our method can improve the training process of GAN-based SR models and improve the model performance in perception-distortion balance, it can still fail in some challenging cases. Fig. 11 shows some typical examples. If the high-frequency structures are largely destroyed in the input LR image, it is hard for our model to correct them in the SR output, as shown in Fig. 11(a). In addition, for some small-scale textures and text characters, our method may generate some visual artifacts, as shown in Figs. 11(b)) and 11(c)).

V Conclusion

In this paper, we proposed a hybrid EA-Adam optimization method to optimize the perception-distortion balanced image super-resolution (SR) models. We formulated the perception-distortion trade-off as a multi-objective optimization problem. A population of SR models were optimized by alternatively performing the Adam and EA steps, where the Adam was dedicated to the model convergence and the EA focused on rescuing the model from local minimum due to its strong capability in multi-objective searching. An effective model fusion strategy was then designed to merge the trained models into a stronger one, which further improved the perception-distortion trade-off without increasing the inference cost. Extensive experiments demonstrated the effectiveness of the proposed method against the traditional Adam optimizer, and the state-of-the-art perception-distortion balanced performance of the trained SR models.

References

[1] Z. Wang, J. Chen, and S. C. H. Hoi, “Deep learning for image super-resolution: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3365–3387, 2021.
[2] D. Chao, L. C. Change, H. Kaiming, and T. Xiaoou, “Learning a deep convolutional network for image super-resolution,” in European conference on computer vision, 2014, pp. 184–199.
[3] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in European conference on computer vision. Springer, 2016, pp. 391–407.
[4] M. S. Sajjadi, B. Scholkopf, and M. Hirsch, “EnhanceNet: Single image super-resolution through automated texture synthesis,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4491–4500.
[5] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep back-projection networks for super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1664–1673.
[6] J. Kim, J. K. Lee, and K. M. Lee, “Deeply-recursive convolutional network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1637–1645.
[7] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690.
[8] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “ESRGAN: Enhanced super-resolution generative adversarial networks,” in Proceedings of the European conference on computer vision (ECCV) workshops, 2018, pp. 0–0.
[9] C. Laroche, A. Almansa, and M. Tassano, “Deep model-based super-resolution with non-uniform blur,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp. 1797–1808.
[10] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 286–301.
[11] S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1192–1204, 2020.
[12] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in Proceedings of the IEEE International Conference on computer vision, 2017, pp. 4799–4807.
[13] J. Liang, H. Zeng, and L. Zhang, “Details or Artifacts: A locally discriminative learning approach to realistic image super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5657–5666.
[14] T. Xu, P. Mi, X. Zheng, L. Li, F. Chao, G. Jiang, W. Zhang, Y. Zhou, and R. Ji, “What hinders perceptual quality of PSNR-oriented methods?” arXiv preprint arXiv:2201.01034, 2022.
[15] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[16] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in European conference on computer vision. Springer, 2016, pp. 694–711.
[17] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
[18] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6228–6237.
[19] J. W. Soh, G. Y. Park, J. Jo, and N. I. Cho, “Natural and realistic single image super-resolution with explicit natural manifold discrimination,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8122–8131.
[20] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 PIRM challenge on perceptual image super-resolution,” in Computer Vision – ECCV 2018 Workshops. Cham: Springer International Publishing, 2019, pp. 334–355.
[21] K. Zhang, L. V. Gool, and R. Timofte, “Deep unfolding network for image super-resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3217–3226.
[22] W. Zhang, Y. Liu, C. Dong, and Y. Qiao, “RankSRGAN: Generative adversarial networks with ranker for image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3096–3105.
[23] Q. Cai, J. Li, H. Li, Y.-H. Yang, F. Wu, and D. Zhang, “Tdpn: Texture and detail-preserving network for single image super-resolution,” IEEE Transactions on Image Processing, vol. 31, pp. 2375–2389, 2022.
[24] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[25] E. Prashnani, H. Cai, Y. Mostofi, and P. Sen, “PieAPP: Perceptual image-error assessment through pairwise preference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1808–1817.
[26] D. Fuoli, L. Van Gool, and R. Timofte, “Fourier space losses for efficient perceptual image super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2360–2369.
[27] W. Li, K. Zhou, L. Qi, L. Lu, and J. Lu, “Best-Buddy GANs for highly detailed image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1412–1420.
[28] Y. Zhang, B. Ji, J. Hao, and A. Yao, “Perception-distortion balanced ADMM optimization for single-image super-resolution,” in European Conference on Computer Vision. Springer, 2022, pp. 108–125.
[29] L. Xie, X. Wang, X. Chen, G. Li, Y. Shan, J. Zhou, and C. Dong, “Desra: detect and delete the artifacts of gan-based real-world super-resolution models,” arXiv preprint arXiv:2307.02457, 2023.
[30] S. H. Park, Y. S. Moon, and N. I. Cho, “Perception-oriented single image super-resolution using optimal objective estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1725–1735.
[31] X. Luo, Y. Zhu, S. Xu, and D. Liu, “On the effectiveness of spectral discriminators for perceptual quality improvement,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 243–13 253.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[33] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
[34] N. Srinivas and K. Deb, “Muiltiobjective optimization using nondominated sorting in genetic algorithms,” Evolutionary computation, vol. 2, no. 3, pp. 221–248, 1994.
[35] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: NSGA-II,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
[36] Q. Zhang and H. Li, “MOEA/D: A multiobjective evolutionary algorithm based on decomposition,” IEEE Transactions on evolutionary computation, vol. 11, no. 6, pp. 712–731, 2007.
[37] Y. Wang, L. Wang, H. Wang, P. Li, and H. Lu, “Blind single image super-resolution with a mixture of deep networks,” Pattern Recognition, vol. 102, p. 107169, 2020.
[38] J. Liang, H. Zeng, and L. Zhang, “Efficient and degradation-adaptive network for real-world image super-resolution,” in European Conference on Computer Vision, 2022.
[39] X. Wang, K. Yu, C. Dong, X. Tang, and C. C. Loy, “Deep network interpolation for continuous imagery effect transition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1692–1701.
[40] M. Liu, J. Pan, Z. Yan, W. Zuo, and L. Zhang, “Adaptive network combination for single-image reflection removal: A domain generalization perspective,” arXiv preprint arXiv:2204.01505, 2022.
[41] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 136–144.
[42] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2472–2481.
[43] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “SwinIR: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1833–1844.
[44] X. Zhang, H. Zeng, S. Guo, and L. Zhang, “Efficient long-range attention network for image super-resolution,” arXiv preprint arXiv:2203.06697, 2022.
[45] X. Chen, X. Wang, J. Zhou, Y. Qiao, and C. Dong, “Activating more pixels in image super-resolution transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 367–22 377.
[46] Q. Cai, Y. Qian, J. Li, J. Lyu, Y.-H. Yang, F. Wu, and D. Zhang, “Hipa: Hierarchical patch transformer for single image super resolution,” IEEE Transactions on Image Processing, vol. 32, pp. 3226–3237, 2023.
[47] H. Huang, R. He, Z. Sun, and T. Tan, “Wavelet-SRNET: A wavelet-based CNN for multi-scale face super resolution,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1689–1697.
[48] C. Korkmaz, A. M. Tekalp, and Z. Dogan, “Training generative image super-resolution models by wavelet-domain losses enables better control of artifacts,” arXiv preprint arXiv:2402.19215, 2024.
[49] S. H. Park, Y. S. Moon, and N. I. Cho, “Flexible style image super-resolution using conditional objective,” IEEE Access, vol. 10, pp. 9774–9792, 2022.
[50] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–212, 2012.
[51] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference quality metric for single-image super-resolution,” Computer Vision and Image Understanding, vol. 158, pp. 1–16, 2017.
[52] T. Higuchi, S. Tsutsui, and M. Yamamura, “Theoretical analysis of simplex crossover for real-coded genetic algorithms,” in International Conference on Parallel Problem Solving from Nature. Springer, 2000, pp. 365–374.
[53] X. Zhou, A. K. Qin, M. Gong, and K. C. Tan, “A survey on evolutionary construction of deep neural networks,” IEEE Transactions on Evolutionary Computation, vol. 25, no. 5, pp. 894–912, 2021.
[54] R. Saravanan, S. Ramabalan, N. G. R. Ebenezer, and C. Dharmaraja, “Evolutionary multi criteria design optimization of robot grippers,” Applied Soft Computing, vol. 9, no. 1, pp. 159–172, 2009.
[55] H. Mala-Jetmarova, N. Sultanova, and D. Savic, “Lost in optimisation of water distribution systems? A literature review of system design,” Water, vol. 10, no. 3, p. 307, 2018.
[56] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” in European conference on computer vision. Springer, 2020, pp. 544–560.
[57] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, T. Huang, X. Song, R. Pang, and Q. Le, “BigNAS: Scaling up neural architecture search with big single-stage models,” in European Conference on Computer Vision. Springer, 2020, pp. 702–717.
[58] G. Ying, X. He, B. Gao, B. Han, and X. Chu, “EAGAN: Efficient two-stage evolutionary architecture search for GANs,” in European Conference on Computer Vision. Springer, 2022, pp. 37–53.
[59] Y. Zhou, G. G. Yen, and Z. Yi, “A knee-guided evolutionary algorithm for compressing deep neural networks,” IEEE transactions on cybernetics, vol. 51, no. 3, pp. 1626–1638, 2019.
[60] X. Cui, W. Zhang, Z. Tüske, and M. Picheny, “Evolutionary stochastic gradient descent for optimization of deep neural networks,” Advances in neural information processing systems, vol. 31, 2018.
[61] H. Zhang, K. Hao, L. Gao, B. Wei, and X. Tang, “Optimizing deep neural networks through neuroevolution with stochastic gradient descent,” IEEE Transactions on Cognitive and Developmental Systems, 2022.
[62] S. Yang, Y. Tian, C. He, X. Zhang, K. C. Tan, and Y. Jin, “A gradient-guided evolutionary approach to training deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[63] Y. Tian, H. Chen, H. Ma, X. Zhang, K. C. Tan, and Y. Jin, “Integrating conjugate gradients into evolutionary algorithms for large-scale continuous multi-objective optimization,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 10, pp. 1801–1817, 2022.
[64] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks,” in International conference on machine learning. PMLR, 2018, pp. 794–803.
[65] K. Miettinen, Nonlinear multi-objective optimization. Springer Science, 2012, vol. 12.
[66] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 030–11 039.
[67] E. Agustsson and R. Timofte, “NTIRE 2017 challenge on single image super-resolution: Dataset and study,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 126–135.
[68] R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang, “Ntire 2017 challenge on single image super-resolution: Methods and results,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 114–125.
[69] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1905–1914.
[70] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
[71] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in International conference on curves and surfaces. Springer, 2010, pp. 711–730.
[72] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2. IEEE, 2001, pp. 416–423.
[73] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5197–5206.
[74] Y. Matsui, K. Ito, Y. Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using Manga109 dataset,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21 811–21 838, 2017.
[75] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, “Image quality assessment: Unifying structure and texture similarity,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567–2581, 2020.
[76] J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[77] P. Wei, Z. Xie, H. Lu, Z. Zhan, Q. Ye, W. Zuo, and L. Lin, “Component divide-and-conquer for real-world image super-resolution,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. Springer, 2020, pp. 101–117.
[78] C. Ma, Y. Rao, Y. Cheng, C. Chen, J. Lu, and J. Zhou, “Structure-preserving super resolution with gradient guidance,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7769–7778.
[79] J. Park, S. Son, and K. M. Lee, “Content-aware local gan for photo-realistic super-resolution,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 585–10 594.
[80] W. Wang, R. Guo, Y. Tian, and W. Yang, “Cfsnet: Toward a controllable feature space for image restoration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision ICCV, October 2019.
[81] S. Vasu, N. Thekke Madam, and A. Rajagopalan, “Analyzing perception-distortion tradeoff using enhanced perceptual super-resolution network,” in Proceedings of the European Conference on Computer Vision ECCV Workshops, 2018, pp. 0–0.
[82] J. Gu, H. Lu, W. Zuo, and C. Dong, “Blind super-resolution with iterative kernel correction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1604–1613.
[83] H. F. Ates, S. Yildirim, and B. K. Gunturk, “Deep learning-based blind image super-resolution with iterative kernel reconstruction and noise estimation,” Computer Vision and Image Understanding, vol. 233, p. 103718, 2023.
[84] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
[85] J. Shi, J. Sun, Q. Zhang, H. Zhang, and Y. Fan, “Improving pareto local search using cooperative parallelism strategies for multiobjective combinatorial optimization,” IEEE Transactions on Cybernetics, 2022.
[86] J. Shi, Q. Zhang, and J. Sun, “PPLS/D: Parallel pareto local search based on decomposition,” IEEE transactions on cybernetics, vol. 50, no. 3, pp. 1060–1071, 2018.

Perception-Distortion Balanced Super-Resolution: A Multi-Objective Optimization Perspective

Abstract

Index Terms:

I Introduction

II Related Work

II-A Distortion-oriented SR Methods

II-B Perception-oriented SR Methods

II-C Evolutionary Algorithms in DNN Learning

III The Proposed Method

III-A Multi-Objective Formulation

III-B The Hybrid EA-Adam Optimizer

III-C Network Fusion

IV EXPERIMENT

IV-A Experiment Setup

IV-B Comparison with State-of-the-Arts

IV-C Ablation Studies

IV-D Applications to real-world Super-Resolution

IV-E Training and Inference Time

IV-F Failure Cases

V Conclusion

References

Perception-Distortion Balanced Super-Resolution:
A Multi-Objective Optimization Perspective