Residual-Conditioned Optimal Transport: Towards
Structure-Preserving Unpaired and Paired Image Restoration
Abstract
Deep learning-based image restoration methods generally struggle with faithfully preserving the structures of the original image. In this work, we propose a novel Residual-Conditioned Optimal Transport (RCOT) approach, which models image restoration as an optimal transport (OT) problem for both unpaired and paired settings, introducing the transport residual as a unique degradation-specific cue for both the transport cost and the transport map. Specifically, we first formalize a Fourier residual-guided OT objective by incorporating the degradation-specific information of the residual into the transport cost. We further design the transport map as a two-pass RCOT map that comprises a base model and a refinement process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. By duality, the RCOT problem is transformed into a minimax optimization problem, which can be solved by adversarially training neural networks. Extensive experiments on multiple restoration tasks show that RCOT achieves competitive performance in terms of both distortion measures and perceptual quality, restoring images with more faithful structures as compared with state-of-the-art methods.
1 Introduction

Image restoration is a fundamental low-level task of removing the degradation (e.g., noise, down-scaling, rain, haze, blur, etc.) from a degraded image. Traditional methods (He et al., 2010; Sun et al., 2008; Zheng et al., 2023; Tang et al., 2022, 2023) focus on designing an optimization problem that exploits suitable priors of the clear image. Recently, significant advancements in image restoration have been witnessed. They are primarily driven by the sophisticated network architectures (Liang et al., 2021; Zamir et al., 2021, 2022; Wang et al., 2022b; Zhou et al., 2023; Wang et al., 2023; Chen et al., 2023; Ren et al., 2016; Cui et al., 2023a) and efficient generative models, especially Generative Adversarial Networks (GANs) (Li et al., 2018b; Ledig et al., 2017; Zhao et al., 2020; Pan et al., 2020, 2021) and diffusion probabilistic models (DPMs) (Kawar et al., 2022; Luo et al., 2023; Zhu et al., 2023; Murata et al., 2023; Saharia et al., 2022b, a; Choi et al., 2023). The mainstream of the generation-based methods focuses on how to effectively condition the generator on degraded images for high-fidelity restoration.
Classic restoration methods (Zamir et al., 2021, 2022; Potlapalli et al., 2023) utilize sophisticated networks to fit regression models that yield deterministic results, by minimizing distortion measures (e.g., MSE, SSIM) w.r.t. the ground-truth. However, due to the ill-posedness of image restoration, they often capture the “mean” of the posterior distribution of the high-quality data given the degraded data, sampling results with excessive smoothness and compromised structural details, which may deviate from human perception (e.g., Restormer (Zamir et al., 2022), PromptIR (Potlapalli et al., 2023) in Figure 1 (b)). Differently, generation-based methods (Saharia et al., 2022b; Gao et al., 2023; Luo et al., 2023) regard the restoration task as a conditional distribution modeling problem, which generally results in visually appealing results. However, they often use the degraded image as a conditional input without including specific information about the degradation, which can result in outputs with remaining distortions and inaccurate structural details (e.g., IR-SDE (Luo et al., 2023) in Figure 1 (b)). Therefore, how to faithfully preserve the structures while minimizing the distortion remains challenging.
To tackle the challenge, we propose to model image restoration as an optimal transport (OT) problem, in which we introduce a novel transport residual as a degradation-specific cue for both the transport cost and transport map. Specifically, we present a two-pass Residual-Conditioned Optimal Transport (RCOT) approach to realize a degradation-aware and structure-preserving OT map, which applies to both unpaired and paired data settings. The key idea is to incorporate the degradation-specific knowledge (from the residual or its embedding) into the transport cost, and more importantly, into the transport map via a two-pass process, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. This conditioning mechanism enables the transport map to adjust its behaviors for multiple restoration tasks and restore images with better structural content (Figure 1 (b)).
In summary, our contributions mainly include:
-
•
We model image restoration as an OT problem, in which we introduce a Fourier residual-guided OT objective, allowing us to incorporate degradation-specific knowledge into the transport cost. We further deduce a minimax dual formulation for the OT model.
-
•
We propose a two-pass RCOT approach, which conditions the transport map on the residual embedding. This conditioning mechanism dynamically injects degradation-specific from the residual embedding into the restoration operator, i.e., the RCOT map, enhancing its capability to preserve the image structure.
-
•
Extensive experiments on multiple tasks, e.g., image denoising, super-resolution, deraining, and dehazing on both synthetic and real-world datasets show the effectiveness of our method in terms of both distortion measures and perceptual quality. In particular, our method restores images with more faithful structural details, compared with existing approaches.

2 Background
Image restoration. Classic state-of-the-art restoration methods focus on minimizing distortion (e.g., MSE). They are primarily driven by efficient network architectures (Liang et al., 2021; Zamir et al., 2021, 2022; Wang et al., 2022b; Zhou et al., 2023) and model the restoration as regression models that produce deterministic results. However, they rely on large amounts of paired training data and are prone to “mean” results with compromised perceptual quality.
Differently, deep generative models have made much progress that can generate perceptually realistic images. The early mainstream methods are built on the conditional generative adversarial network (CGAN) (Mirza & Osindero, 2014), in which the restoration problem is treated as a conditional generation problem. RoCGAN (Chrysos et al., 2020) utilizes high-quality data for supervised training of restoration from degraded inputs. AmbientGAN (Bora et al., 2018) generates clean images from noisy input, assuming the degradation satisfies certain conditions. Recently, a promising avenue is the adoption of diffusion models. Saharia et al. (2022b) utilize the degraded images as conditions to train diffusion models. Kawar et al. (2021, 2022) and Chung et al. (2023) operate under the assumption that the degradation and its parameters are known at test time. Gao et al. (2023) propose a scale-adaptive condition on the LR image for high-fidelity super-resolution. Luo et al. (2023) propose a maximum likelihood-based loss function to train a mean-reverting score-based model. However, these methods more or less omit the degradation-specific knowledge, which limits their ability to produce faithful structures while minimizing the distortion.
Intuitively, the generation-based methods are seeking an efficient map to transform the distribution of the low-quality image distribution into the high-quality image distribution . The key issue lies in the ambiguity that there may exist infinite maps satisfying this constraint. Prior knowledge is required to determine which is the optimal map. In this paper, we model image restoration from an OT perspective, seeking the most efficient map that transports to with a minimal transport cost.
Optimal transport and its applications in restoration. OT problem seeks the optimal transport map (Monge Problem (Monge, 1781), a.k.a., MP) or transport plan (Kantorovich problem (Kantorovich, 1942), a.k.a., KP) to transform a distribution to another distribution with the minimal transport cost. and respectively represent the sets of probability distributions on the Polish spaces and . Formally, the Monge and Kantorovich problems can be stated as follows
(1) | |||
(2) |
where measures the transport cost between two samples. In the MP (1), the map attaining the infimum is called the optimal transport map, which is taken over all the transport maps . The constraint means that pushes forward the probabilistic mass of to , where is the push-forward operator. In the KP (2), the coupling attaining the minimum is called the optimal transport plan, which is taken over all the transport plans on whose marginals are and .
Recently, many attempts have been made to construct the translation/restoration as an OT problem. In this context, represent the degraded distribution and target distribution respectively. Gu et al. (2023) propose to use optimal transport to guide the training of the conditional score-based diffusion model for super-resolution and translation. Korotin et al. (2023b, a) compute optimal transport maps and plans using neural networks under the duality framework and apply their method to unpaired image translation. They use regularizer for the transport cost . Wang et al. (2022a) relax the transport constraint in Monge formulation with a Wasserstein-1 discrepancy penalty between the target distribution and the push-forward distribution . Likewise, they empirically use regularizer for for the denoising task. These methods pioneer the way of modeling image translation/restoration problems as an OT problem, but their performances in multiple restoration problems are limited without proper prior knowledge about the correspondence between and .
Different from the aforementioned studies, our RCOT customizes image restoration as an OT problem, crafting the transport cost and map through the integration of transport residual. This innovation leads to a degradation-aware and structure-preserving transport map, i.e., the RCOT map.
3 Method
The key idea of RCOT is to introduce the transport residual as a degradation-specific cue for both the transport cost and transport map. We first model image restoration as an OT problem, exploiting the frequency knowledge of the residual, yielding the Fourier residual-guided OT (FROT) objective (section 3). Secondly, and most crucially, we integrate the degradation-specific knowledge from the residual embedding in the transport map via a two-pass process (section 3.2), in which the transport residual is computed by the base model in the first pass and then encoded as a residual embedding to condition the second-pass restoration (see Figure 2). In section 3.3, we present the learning algorithm for the two-pass RCOT map by adversarially training two neural networks to solve the minimax optimization problem in both unpaired and paired settings.
3.1 Residual-guided OT Formulation for Restoration

We first discuss formulizing image restoration as an OT problem, which can be applied to both unpaired and paired data. For the convenience of understanding, we elaborate our method for the unpaired data setting in sections 3.1 and 3.2. We will specify and extend to the paired case in section 3.3. We represent the domains of degraded images and target images by and , whose distributions are and , respectively. Then the Kantorovich (Kantorovich, 1942) form of OT cost can be defined by (2). However, this formulation does not consider the prior knowledge of the degradation, limiting its applicability to multiple restoration tasks. To this end, we suggest the FROT objective by introducing a penalty term on the degradation domain gap (i.e., the transport residual ) in the transport cost, leading to . The FROT objective is then defined as
(3) |
Figure 3 presents the images of transport residuals for four types of degradation (noise, rain, haze, and low-resolution) along with their Fourier signal histograms. These histograms indicate that for degradations like rain, haze, and low-resolution, the residuals tend to be sparse in the frequency domain. In the case of noise, the histograms exhibit a smoother profile, resembling a Gaussian distribution. Based on the observation, we formalize the regularizer on the Fourier residuals for the deraining, dehazing, and super-resolution tasks, i.e., . For the denoising task, we specify the regularizer on the Fourier residual. Our objective is to find the corresponding OT map that attains the infimum of Monge’s formulation (1) under cost . The duality of (3) will lead to a more manageable approach. According to (Villani et al., 2009), (3) takes the following dual form:
where is the -transform of . Replacing the optimization of the first term over target with an equivalent optimization (Rockafellar interchange theorem (Rockafellar, 1976), Theorem 3A) over the map of interest , we obtain the minimax reformulation of dual form:
(4) |
where represents the transport degradation domain gap (termed as transport residual). Now we show that tackling this minimax problem provides the OT map.
Proposition 3.1.
(Saddle points of FROT provide OT maps). For any optimal potential function , it holds for the Monge OT map that
(5) |
The proof is given in Appendix A. Proposition 3.1 affirms the feasibility of solving the minimax problem (3.1) to acquire an optimal pair, constituting an OT map from to . For general and , given some optimal potential , the set may encompass not only the OT map but also other saddle points, which are capable of delivering decent performance as in experiments (section 4). To tackle the minimax problem (3.1), we can approximate the map and potential with neural networks and .
3.2 Two-pass Residual-Conditioned OT Map
After parameterizing the transport map with a neural network , to better preserve image structures for restoration under different degradations, we are motivated to incorporate degradation-specific and structural knowledge into the map . Intuitively, the transport residual may contain degradation-specific knowledge (e.g., degradation type and level) and structural information (as empirically shown in Appendix I). We then suggest utilizing the transport residual as an additional condition for the transport map . To achieve this, we develop a two-pass RCOT map (Figure 2) based on a transport residual condition (TRC) module, in which the transport residual is computed by the base model in the first pass and then encoded as a degradation-specific embedding to condition the second-pass restoration. Correspondingly, we introduce an image generator to generate the restored image, and a residual encoder to control the restoration with the residual embedding as a condition. and constitute the map .
Two-pass transport map. Given a degraded image , the first pass unconditionally generates an intermediate restored result via the image generator and calculates its corresponding intermediate transport residual . The second pass extracts the residual embedding via the residual encoder and uses it as a condition for the image generator, which then restores a refined result . This transport process is summarized as
(6) |
Transport residual condition (TRC). The TRC module is intended to compensate for the degradation-specific information that can be lost through the degradation. Specifically, the TRC module consists of two key components: a residual embedding generation module (REGM) and a condition integration module. In REGM, the estimated transport residual in the first pass is projected by the residual encoder to a residual embedding , which is then utilized as a degradation-specific condition for the second-pass restoration. In the condition integration module, we employ the cross-stage feature fusion (Zamir et al., 2021) to integrate the features from the image generator with the degradation-specific embedding for structure-preserving restoration.
3.3 Overall Training
With the parameterization of and , the optimization objective function of (3.1) can be written as
(7) |
For unpaired setting, we train the networks and by respectively minimizing and maximizing , i.e., . This can be achieved by adversarially training and , in which we estimate the expectation using mini-batch data in each training step.
(Partially) Paired Setting. In practice, most datasets may include a fraction of available paired samples. For this partially paired setting, we can leverage the paired samples to enforce to approximate the target for any pair with a squared loss (where denotes the paired subsets of ):
(8) |
Consequently, the training objective for the partially paired setting is . The overall algorithm is detailed in Algorithm 1 in the Appendix B.
4 Experiments
We evaluate the proposed RCOT on benchmark datasets on four representative image restoration tasks: image denoising, deraining, dehazing, and super-resolution (SR). In tables, the best and second-best quality metrics (PSNR/SSIM for measuring pixel/structure similarity, and LPIPS (Zhang et al., 2018)/FID (Heusel et al., 2017) for perceptual deviation measuring) are highlighted and underlined. The implementation details and selected compared methods are introduced in Appendix C. Extra evaluations (e.g., parameter quantity, cross-dataset and preliminary multiple-in-one comparisons) are included in Appendix G.
Method | Kodak24 (Franzen, 1999) | CBSD68 (Martin et al., 2001) | ||
---|---|---|---|---|
NOT∗ (Korotin et al., 2023b) | 29.13/0.786/0.131/70.17 | 27.12/0.725/0.227/104.5 | 29.76/0.802/0.119/73.68 | 26.82/0.723/0.212/113.3 |
OTUR∗ (Wang et al., 2022a) | 31.05/0.848/0.104/55.74 | 28.03/0.744/0.178/83.27 | 30.27/0.833/0.095/67.74 | 27.36/0.733/0.169/93.17 |
MPRNet (Zamir et al., 2021) | 31.96/0.868/0.112/43.98 | 28.36/0.785/0.185/73.26 | 30.89/0.880/0.103/59.23 | 27.56/0.779/0.163/86.42 |
Restormer(Zamir et al., 2022) | 32.13/0.880/0.097/40.22 | 29.25/0.799/0.156/64.26 | 31.20/0.887/0.090/55.28 | 27.90/0.794/0.149/66.12 |
IR-SDE (Luo et al., 2023) | 31.40/0.842/0.080/45.56 | 28.03/0.721/0.134/83.66 | 30.46/0.856/0.075/57.30 | 26.98/0.737/0.138/96.40 |
RCD(Zhang et al., 2023) | 32.18/0.880/0.089/39.85 | 29.22/0.795/0.147/63.26 | 31.28/0.886/0.089/50.96 | 28.01/0.796/0.149/69.78 |
PromptIR (Potlapalli et al., 2023) | 32.25/0.883/0.091/30.91 | 29.19/0.799/0.154/60.42 | 31.31/0.888/0.085/45.45 | 28.03/0.797/0.143/63.98 |
RCOT∗ | 31.84/0.860/0.085/35.28 | 28.64/0.792/0.152/63.27 | 30.77/0.849/0.079/47.21 | 27.69/0.775/0.145/73.43 |
RCOT | 32.64/0.885/0.070/19.33 | 29.53/0.828/0.121/33.13 | 31.52/0.889/0.062/34.24 | 28.25/0.799/0.118/56.60 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
14.60/0.256 | 26.60/0.822 | 29.29/0.921 | 29.77/0.940 | 31.32/0.955 | 29.52/0.921 | 30.98/0.951 | 31.26/0.961 |
Noisy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
4.1 Results
Gaussian Image Denoising.
For Gaussian image denoising, we train the model on a combination of BSD400 (Arbelaez et al., 2010) and WED (Ma et al., 2016) datasets. The BSD400 dataset comprises 400 training images, while the WED dataset consists of 4,744 images. Gaussian noise with level is separately added to generate noisy images for training. We evaluate RCOT on the Kodak24 (Franzen, 1999) and CBSD68 (Martin et al., 2001) datasets under noise levels . Table 1 reports the PSNR/SSIM/LPIPS/FID scores of the compared methods. The RCOT achieves the best performance under all metrics. Particularly, the LPIPS and FID values of RCOT’s results are remarkably better than those of other methods. Figure 4 and Figure 13 in the Appendix H.3 display visual examples of . MPRNet, Restormer, and PromptIR (Zamir et al., 2021, 2022; Potlapalli et al., 2023), directly minimizing the loss, can perform well in distortion measures (PSNR and SSIM), but they produce results with smoothed structures. The generation-based methods IR-SDE (Luo et al., 2023) and OTUR (Wang et al., 2022a) seem to produce realistic structures but are limited in removing the heavy noise completely. As a comparison, our RCOT reproduces a clean and sharp image with faithful textures.
Method | Synthetic Rain100L (Fan et al., 2019) | Real-world SPANet (Wang et al., 2019) | ||||||
PSNR () | SSIM () | LPIPS () | FID () | PSNR () | SSIM () | LPIPS () | FID () | |
NOT∗ (Korotin et al., 2023b) | 29.29 | 0.911 | 0.030 | 54.12 | 32.55 | 0.901 | 0.028 | 53.28 |
OTUR∗ (Wang et al., 2022a) | 33.71 | 0.954 | 0.027 | 36.64 | 39.23 | 0.961 | 0.017 | 29.64 |
MPRNet (Zamir et al., 2021) | 34.95 | 0.964 | 0.039 | 21.61 | 39.52 | 0.967 | 0.021 | 28.13 |
Restormer (Zamir et al., 2022) | 36.74 | 0.978 | 0.026 | 13.29 | 41.39 | 0.981 | 0.013 | 19.67 |
SFNet (Cui et al., 2023b) | 36.56 | 0.974 | 0.023 | 13.12 | 41.02 | 0.980 | 0.015 | 21.52 |
IR-SDE (Luo et al., 2023) | 36.94 | 0.978 | 0.014 | 9.52 | 42.56 | 0.987 | 0.009 | 16.25 |
PromptIR (Potlapalli et al., 2023) | 37.09 | 0.979 | 0.025 | 10.21 | 39.17 | 0.965 | 0.015 | 28.93 |
RCOT∗ | 36.22 | 0.972 | 0.019 | 12.59 | 41.05 | 0.977 | 0.014 | 18.93 |
RCOT | 37.27 | 0.980 | 0.015 | 7.97 | 43.77 | 0.993 | 0.008 | 9.52 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
19.97/0.829 | 24.87/0.876 | 29.23/0.966 | 27.66/0.952 | 35.15/0.981 | 27.03/0.941 | 30.69/0.937 | 35.40/0.989 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
21.96/0.729 | 24.19/0.758 | 26.04/0.811 | 28.83/0.859 | 32.14/0.928 | 32.52/0.931 | 28.24/0.873 | 33.16/0.939 |
Rainy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
Method | Synthetic SOTS (outside) (Fan et al., 2019) | Real-world O-HAZE (Ancuti et al., 2018) | ||||||
PSNR () | SSIM () | LPIPS () | FID () | PSNR () | SSIM () | LPIPS () | FID () | |
NOT∗ (Korotin et al., 2023b) | 24.21 | 0.900 | 0.046 | 21.35 | 15.13 | 0.673 | 0.271 | 258.68 |
OTUR∗ (Wang et al., 2022a) | 26.36 | 0.953 | 0.024 | 18.96 | 16.43 | 0.719 | 0.209 | 248.74 |
MPRNet (Zamir et al., 2021) | 28.31 | 0.954 | 0.029 | 17.79 | 21.55 | 0.778 | 0.256 | 223.86 |
Restormer (Zamir et al., 2022) | 30.87 | 0.969 | 0.026 | 13.29 | 25.20 | 0.804 | 0.221 | 198.07 |
Dehazeformer (Song et al., 2023) | 31.45 | 0.978 | 0.021 | 15.54 | 25.56 | 0.812 | 0.209 | 199.35 |
IR-SDE (Luo et al., 2023) | 30.55 | 0.968 | 0.018 | 12.76 | 22.13 | 0.776 | 0.160 | 179.23 |
PromptIR (Potlapalli et al., 2023) | 31.31 | 0.973 | 0.021 | 16.28 | 25.27 | 0.813 | 0.216 | 217.66 |
RCOT∗ | 30.34 | 0.965 | 0.020 | 12.97 | 21.01 | 0.773 | 0.186 | 203.72 |
RCOT | 31.66 | 0.976 | 0.015 | 10.21 | 27.16 | 0.839 | 0.145 | 169.38 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
17.28/0.897 | 26.65/0.949 | 27.70/0.985 | 28.52/0.986 | 31.25/0.989 | 32.76/0.992 | 33.01/0.991 | 33.21/0.993 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
16.78/0.704 | 18.11/0.724 | 20.31/0.743 | 23.61/0.810 | 25.51/0.781 | 23.14/0.775 | 27.80/0.827 | 28.50/0.841 |
Hazy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
Image Deraining.
We evaluate RCOT on both synthetic dataset Rain100L (Yang et al., 2017) and real-world dataset SPANet (Wang et al., 2019). For Rain100L (Yang et al., 2017), we train the model with 13,712 paired clean-rain images collected from multiple datasets (Fu et al., 2017; Li et al., 2016; Yang et al., 2017; Zhang & Patel, 2018; Zhang et al., 2019a). For real-world SPANet (Wang et al., 2019), it contains 27.5K paired rainy and rain-free images for training, and 1, 000 paired images for testing.
Table 2 reports the performance of the evaluated methods. The proposed RCOT achieves the best performance over both distortion measures and perceptual quality measures. The underlying reason should be that the RCOT benefits from the FROT objective and residual embedding that exploits the information of rain streaks. From the results shown in Figures 5 and 14, we can observe that results of NOT (Korotin et al., 2023b) are discolored. Restormer (Zamir et al., 2022) and PromptIR (Potlapalli et al., 2023) effectively remove the rain streaks, but the structural details are oversmoothed. IR-SDE (Luo et al., 2023) produces a restored result that still exhibits a slight presence of rain streaks. RCOT restores rain-free images with better structural content.
Image Dehazing.
We evaluate RCOT on the synthetic SOTS (Li et al., 2018a) dataset, which contains 72,135 images for training and 500 images for testing, and real hazy O-HAZE (Ancuti et al., 2018) dataset, which contains 45 paired hazy and haze-free images, which are collected at the same scene under the same illumination conditions. 40 images are used for training and the other 5 images are used for testing.
Table 3 reports the qualitative results. The proposed RCOT achieves the best performance overall. Notably, the unpaired RCOT also achieves a promising performance, especially in terms of the LPIPS and FID values. From the results shown in Figure 6 and 12, we can observe that NOT (Korotin et al., 2023b) and IR-SDE (Luo et al., 2023) produce sharp results but are not faithful in color and removing haze. Restormer, MPRNet, and PromptIR can remove the haze but produce results with distorted color. As a comparison, the RCOT can remove the haze while faithfully preserving the color.
Image Super-resolution.
Method | Bicubic | NOT∗ | OTUR∗ | Restomer | IR-SDE | IDM | LINF | RCOT∗ | RCOT |
---|---|---|---|---|---|---|---|---|---|
(Korotin et al., 2023b) | (Wang et al., 2022a) | (Zamir et al., 2022) | (Luo et al., 2023) | (Gao et al., 2023) | (Yao et al., 2023) | ||||
PSNR | 26.70 | 25.73 | 24.88 | 27.90 | 26.89 | 27.59 | 27.33 | 26.78 | 28.41 |
SSIM | 0.771 | 0.718 | 0.679 | 0.796 | 0.775 | 0.785 | 0.769 | 0.758 | 0.804 |
LPIPS | 0.186 | 0.136 | 0.128 | 0.136 | 0.118 | 0.121 | 0.116 | 0.125 | 0.114 |
FID | 15.88 | 19.27 | 24.65 | 6.14 | 8.56 | 5.73 | 3.26 | 4.51 | 1.22 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
25.22/0.782 | 24.76/0.768 | 24.34/0.744 | 25.54/0.772 | 25.78/0.799 | 25.36/0.785 | 26.12/0.814 |
Bicubic | NOT | OTUR | Restormer | IDM | IR-SDE | RCOT |
Method | SR on DIV2K | Deraining on Rain100L | Denoising on Kodak24 | Average |
---|---|---|---|---|
w/o TRC | 27.24/0.785/6.72 | 34.20/0.955/28.63 | 28.45/0.790/64.20 | 29.97/0.843/32.47 |
w/ TRC | 28.41/0.804/1.22 | 37.27/0.978/8.67 | 29.53/0.828/33.13 | 31.72/0.870/14.34 |
Loss | PSNR | SSIM | FID |
---|---|---|---|
27.60 | 0.772 | 69.43 | |
supervised | 27.69 | 0.779 | 78.69 |
+ | 28.25 | 0.799 | 56.60 |
For image super-resolution, we evaluate the performance on the challenging DIV2K (Agustsson & Timofte, 2017) dataset, which consists of 800 (4x) LR and HR image pairs for training and 100 pairs for testing.
The quantitative results in Table 4 show that the proposed RCOT outperforms the most recent generative methods. The qualitative gains of both distortion measures and perceptual quality are significant. Notably, even in unpaired setting the RCOT achieves the third-best performance in terms of the FID score. The visual examples in Figure 7 and 15 show that the DPMs-based methods (Luo et al., 2023; Gao et al., 2023) produce unnatural structures. The OT-based methods (Wang et al., 2022a; Korotin et al., 2023b) exhibit severe distortion. As a comparison, our RCOT restores a sharp image with more realistic details.
4.2 Ablation Study
We primarily showcase the ablation studies on the TRC module and training loss components in the main body. Four additional ablation experierments are reported in Appendix D.
Importance of the TRC module. We validate the importance of the transport residual condition module on three tasks (SR on DIV2K (Agustsson & Timofte, 2017), Deraining on Rain100L (Fan et al., 2019), and Denoising on Kodak24 (Franzen, 1999)). We report the quantitative results with and without TRC in Table 5 and display visual comparisons in Figure 8. The proposed TRC module yields an average gain of 1.75 dB of PSNR value over the basic model. With the TRC, the model restores images with better structures.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Degraded | w/o TRC | w/ TRC | Ground truth |
Effect of the loss components. To investigate the effect of different loss components (FROT and supervised loss), we train the transport map under different loss functions and evaluate its denoising performance. Table 6 reports the results on CBSD68 (Martin et al., 2001) with noise level . When training under , the performance is close to that under supervised loss. As a comparison, the training objective for the paired case, which integrates and supervised , provides a significant gain to our model. These results validate the effectiveness of the proposed FROT under unpaired setting and show the importance of integrating supervised loss with when the targets are available.
5 Conclusion
This paper proposed a novel Residual-Conditioned Optimal Transport (RCOT) approach to preserve perceptual structures while effectively removing the distortion, which treated image restoration as an OT problem and introduced the unique transport residual as a degradation-specific cue for both the transport cost and transport map. We first customized the Fourier residual-guided OT objective that exploits Fourier statistics of the degradation domain gap (represented by the residual). Based on FROT, we deduced a minimax problem that can be tackled to provide OT maps. Then we developed a two-pass RCOT map, in which the second pass generates refined results conditioned on the intermediate transport residual computed by the first pass for structure preservation. Extensive experiments demonstrated the effectiveness of RCOT for achieving competitive restoration performance, especially in terms of the ability to preserve structural content. In the future, we are also interested in developing an all-in-one RCOT framework for restoration by properly integrating domain knowledge.
Limitation and Future Research
In this work, based on the empirical observation of the degradation domain gap in the frequency domain, we utilize handcrafted priors in the frequency domain to characterize the Fourier residual, which has been shown to be effective in the experiments. However, it may not be optimal for different degradations. Automatic, adaptive, and optimal learning for it will be explored in our future work. We hope the method can be applied to all-in-one image restoration.
Impact Statement
This work aims to advance the field of machine learning with applications in image restoration. It may be valuable to the research of optimal transport designed as a deep generative model working with low-quality data and has no ethical concerns as far as we know.
Acknowledgement
This work was supported by National Key R&D Program 2021YFA1003002, and NSFC (12125104, U20B2075, 623B2084, 12326615).
References
- Agustsson & Timofte (2017) Agustsson, E. and Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 126–135, 2017.
- Ancuti et al. (2018) Ancuti, C., Ancuti, C., Timofte, R., and De Vleeschouwer, C. O-haze: a dehazing benchmark with real hazy and haze-free outdoor images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 754–762, 2018.
- Arbelaez et al. (2010) Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(5):898–916, 2010.
- Bora et al. (2018) Bora, A., Price, E., and Dimakis, A. G. Ambientgan: Generative models from lossy measurements. In International Conference on Machine Learning (ICML), 2018.
- Chen et al. (2023) Chen, X., Li, H., Li, M., and Pan, J. Learning a sparse transformer network for effective image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5896–5905, 2023.
- Choi et al. (2023) Choi, J., Park, Y., and Kang, M. Restoration based generative models. In International Conference on Machine Learning (ICML), pp. 5787–5816, 2023.
- Chrysos et al. (2020) Chrysos, G. G., Kossaifi, J., and Zafeiriou, S. Rocgan: Robust conditional gan. International Journal of Computer Vision (IJCV), 128:2665–2683, 2020.
- Chu et al. (2022) Chu, X., Chen, L., and Yu, W. Nafssr: Stereo image super-resolution using nafnet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1239–1248, June 2022.
- Chung et al. (2023) Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In International Conference on Learning Representations (ICLR), 2023.
- Cui et al. (2023a) Cui, Y., Ren, W., Yang, S., Cao, X., and Knoll, A. Irnext: Rethinking convolutional network design for image restoration. In International Conference on Machine Learning (ICML), 2023a.
- Cui et al. (2023b) Cui, Y., Tao, Y., Bing, Z., Ren, W., Gao, X., Cao, X., Huang, K., and Knoll, A. Selective frequency network for image restoration. In The Eleventh International Conference on Learning Representations (ICLR), 2023b.
- Fan et al. (2019) Fan, Q., Chen, D., Yuan, L., Hua, G., Yu, N., and Chen, B. A general decoupled learning framework for parameterized image operators. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(1):33–47, 2019.
- Franzen (1999) Franzen, R. Kodak lossless true color image suite. source: http://r0k. us/graphics/kodak, 4(2):9, 1999.
- Fu et al. (2017) Fu, X., Huang, J., Zeng, D., Huang, Y., Ding, X., and Paisley, J. Removing rain from single images via a deep detail network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3855–3863, 2017.
- Gao et al. (2023) Gao, S., Liu, X., Zeng, B., Xu, S., Li, Y., Luo, X., Liu, J., Zhen, X., and Zhang, B. Implicit diffusion models for continuous super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10021–10030, 2023.
- Gu et al. (2023) Gu, X., Yang, L., Sun, J., and Xu, Z. Optimal transport-guided conditional score-based diffusion model. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- He et al. (2010) He, K., Sun, J., and Tang, X. Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33(12):2341–2353, 2010.
- Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Kantorovich (1942) Kantorovich, L. V. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pp. 199–201, 1942.
- Kawar et al. (2021) Kawar, B., Vaksman, G., and Elad, M. Snips: Solving noisy inverse problems stochastically. Advances in Neural Information Processing Systems (NeurIPS), 34:21757–21769, 2021.
- Kawar et al. (2022) Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Korotin et al. (2023a) Korotin, A., Selikhanovych, D., and Burnaev, E. Kernel neural optimal transport. In International Conference on Learning Representations (ICLR), 2023a. URL https://openreview.net/forum?id=Zuc_MHtUma4.
- Korotin et al. (2023b) Korotin, A., Selikhanovych, D., and Burnaev, E. Neural optimal transport. In International Conference on Learning Representations (ICLR), 2023b. URL https://openreview.net/forum?id=d8CBRlWNkqH.
- Ledig et al. (2017) Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A., Tejani, A., Totz, J., Wang, Z., et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4681–4690, 2017.
- Li et al. (2018a) Li, B., Ren, W., Fu, D., Tao, D., Feng, D., Zeng, W., and Wang, Z. Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing (TIP), 28(1):492–505, 2018a.
- Li et al. (2018b) Li, R., Pan, J., Li, Z., and Tang, J. Single image dehazing via conditional generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8202–8211, 2018b.
- Li et al. (2016) Li, Y., Tan, R. T., Guo, X., Lu, J., and Brown, M. S. Rain streak removal using layer priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2736–2744, 2016.
- Liang et al. (2021) Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1833–1844, 2021.
- Luo et al. (2023) Luo, Z., Gustafsson, F. K., Zhao, Z., Sjölund, J., and Schön, T. B. Image restoration with mean-reverting stochastic differential equations. International Conference on Machine Learning (ICML), 2023.
- Ma et al. (2016) Ma, K., Duanmu, Z., Wu, Q., Wang, Z., Yong, H., Li, H., and Zhang, L. Waterloo exploration database: New challenges for image quality assessment models. IEEE Transactions on Image Processing (TIP), 26(2):1004–1016, 2016.
- Martin et al. (2001) Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pp. 416–423, 2001.
- Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Monge (1781) Monge, G. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pp. 666–704, 1781.
- Murata et al. (2023) Murata, N., Saito, K., Lai, C.-H., Takida, Y., Uesaka, T., Mitsufuji, Y., and Ermon, S. GibbsDDRM: A partially collapsed gibbs sampler for solving blind inverse problems with denoising diffusion restoration. In International Conference on Machine Learning (ICML), 2023.
- Pan et al. (2020) Pan, J., Dong, J., Liu, Y., Zhang, J., Ren, J., Tang, J., Tai, Y.-W., and Yang, M.-H. Physics-based generative adversarial models for image restoration and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(7):2449–2462, 2020.
- Pan et al. (2021) Pan, X., Zhan, X., Dai, B., Lin, D., Loy, C. C., and Luo, P. Exploiting deep generative prior for versatile image restoration and manipulation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 44(11):7474–7489, 2021.
- Potlapalli et al. (2023) Potlapalli, V., Zamir, S. W., Khan, S., and Khan, F. S. Promptir: Prompting for all-in-one blind image restoration. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Ren et al. (2016) Ren, W., Liu, S., Zhang, H., Pan, J., Cao, X., and Yang, M. H. Single image dehazing via multi-scale convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–169, 2016.
- Rockafellar (1976) Rockafellar, R. Integral functionals, normal integrands and measurable selections. Nonlinear Operators and the Calculus of Variations, pp. 157–207, 1976.
- Saharia et al. (2022a) Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., and Norouzi, M. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp. 1–10, 2022a.
- Saharia et al. (2022b) Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(4):4713–4726, 2022b.
- Song et al. (2023) Song, Y., He, Z., Qian, H., and Du, X. Vision transformers for single image dehazing. IEEE Transactions on Image Processing (TIP), 32:1927–1941, 2023.
- Sun et al. (2008) Sun, J., Xu, Z., and Shum, H.-Y. Image super-resolution using gradient profile prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8. IEEE, 2008.
- Tang et al. (2022) Tang, X., Zhao, X., Liu, J., Wang, J., Miao, Y., and Zeng, T. Uncertainty-aware unsupervised image deblurring with deep priors guided by domain knowledge. arXiv e-prints, pp. arXiv–2210, 2022.
- Tang et al. (2023) Tang, X., Zhao, X., Liu, J., Wang, J., Miao, Y., and Zeng, T. Uncertainty-aware unsupervised image deblurring with deep residual prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9883–9892, 2023.
- Villani et al. (2009) Villani, C. et al. Optimal transport: old and new, volume 338. Springer, 2009.
- Wang et al. (2023) Wang, C., Pan, J., Wang, W., Dong, J., Wang, M., Ju, Y., and Chen, J. Promptrestorer: A prompting image restoration method with degradation perception. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Wang et al. (2019) Wang, T., Yang, X., Xu, K., Chen, S., Zhang, Q., and Lau, R. W. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Wang et al. (2022a) Wang, W., Wen, F., Yan, Z., and Liu, P. Optimal transport for unsupervised denoising learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 45(2):2104–2118, 2022a.
- Wang et al. (2018) Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., and Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0, 2018.
- Wang et al. (2022b) Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., and Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17683–17693, 2022b.
- Wei et al. (2021) Wei, Y., Zhang, Z., Wang, Y., Xu, M., Yang, Y., Yan, S., and Wang, M. Deraincyclegan: Rain attentive cyclegan for single image deraining and rainmaking. IEEE Transactions on Image Processing, 30:4788–4801, 2021.
- Yang et al. (2017) Yang, W., Tan, R. T., Feng, J., Liu, J., Guo, Z., and Yan, S. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1357–1366, 2017.
- Yao et al. (2023) Yao, J.-E., Tsao, L.-Y., Lo, Y.-C., Tseng, R., Chang, C.-C., and Lee, C.-Y. Local implicit normalizing flow for arbitrary-scale image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Zamir et al. (2021) Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., Yang, M.-H., and Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Zamir et al. (2022) Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., and Yang, M.-H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Zhang & Patel (2018) Zhang, H. and Patel, V. M. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 695–704, 2018.
- Zhang et al. (2019a) Zhang, H., Sindagi, V., and Patel, V. M. Image de-raining using a conditional generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 30(11):3943–3956, 2019a.
- Zhang et al. (2018) Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 586–595, 2018.
- Zhang et al. (2019b) Zhang, W., Liu, Y., Dong, C., and Qiao, Y. Ranksrgan: Generative adversarial networks with ranker for image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3096–3105, 2019b.
- Zhang et al. (2023) Zhang, Z., Jiang, Y., Shao, W., Wang, X., Luo, P., Lin, K., and Gu, J. Real-time controllable denoising for image and video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14028–14038, June 2023.
- Zhao et al. (2020) Zhao, S., Cui, J., Sheng, Y., Dong, Y., Liang, X., Eric, I., Chang, C., and Xu, Y. Large scale image completion via co-modulated generative adversarial networks. In International Conference on Learning Representations (ICLR), 2020.
- Zheng et al. (2023) Zheng, H., Pang, T., and Ji, H. Unsupervised deep video denoising with untrained network. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 37, pp. 3651–3659, 2023.
- Zhou et al. (2023) Zhou, M., Huang, J., Guo, C.-L., and Li, C. Fourmer: an efficient global modeling paradigm for image restoration. In International Conference on Machine Learning (ICML), 2023.
- Zhu et al. (2017) Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2223–2232, 2017.
- Zhu et al. (2023) Zhu, Y., Zhang, K., Liang, J., Cao, J., Wen, B., Timofte, R., and Van Gool, L. Denoising diffusion models for plug-and-play image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1219–1229, 2023.
Appendix
Appendix A Proof
Proposition A.1.
(Saddle points provide OT maps). For any optimal potential function , it holds for the OT map (i.e., the transport map attaining the infimum of Monge’s formulation (1) under ) that
(9) |
Proof.
First, we give Monge’s primal form of the FROT objective
(10) |
For the OT map , it holds , then we find
(11) |
Using and the change of variables , we can derive
Substituting this equality into (11), we obtain
∎
Appendix B Algorithm
Input: degraded dataset ; high-quality dataset ; transport network: ;
potential network: ; the number of iterations of per iteration of : ;
Appendix C Implementation Details and Baselines
This section introduces the compared methods and detailed settings in our experiments.
Implementation details. We train separate models for different tasks using the RMSProp optimizer with a learning rate of for the transport network and for the potential network . The inner iteration number is set to be 1. The learning rate is decayed by a factor of 10 after 100 epochs. In all experiments, the transport network uses the backbone in MPRNet (Zamir et al., 2021). The residual encoder consists of two CNN down-sampling layers with residual channel attention block (RCAB) (zhang2018image). In the FROT objective, is suggested as -norm. During training, we crop patches of size 256x256 as input and use a batch size of 4. All the experiments are conducted on the Pytorch framework with an NVIDIA 4090 GPU. For super-resolution, there is an extra preprocessing step. The LR images undergo bicubic rescaling to match the dimensions of their respective high-resolution counterparts. The source code will be released after the possible publication of our work.
For the unpaired setting, although datasets that contain paired labels are utilized for training, we randomly shuffle the target and degraded input to ensure the loss are isolated from paired information, which is a common strategy (Wang et al., 2022a; Korotin et al., 2023a, b) of unpaired training for restoration problems.
Representative Compared methods. For image denoising, deraining, and dehazing, we choose four most recent representative methods with state-of-the-art performance and two recent OT-based generative methods as competitors. They include MPRNet (Zamir et al., 2021), Restormer (Zamir et al., 2022), IR-SDE (Luo et al., 2023), OTUR (Wang et al., 2022a), NOT (Korotin et al., 2023b), and PromptIR (Potlapalli et al., 2023).
MPRNet (Zamir et al., 2021) and Restormer (Zamir et al., 2022) respectively specialize two backbones for restoration. Notably, Restormer (Zamir et al., 2022) designs an efficient Transformer model by making several key designs in the building blocks (multi-head attention and feed-forward network) such that it can capture long-range pixel interactions, while still remaining applicable to large images. It achieves state-of-the-art results on several image restoration tasks.
PromptIR (Potlapalli et al., 2023) utilizes learnable task prompts to incorporate the degradation-specific knowledge in the Transformer model to achieve task-aware restoration. We train the PromptIR model in single task setting.
For image super-resolution, we choose the most recent state-of-the-art generative diffusion-based (Luo et al., 2023; Gao et al., 2023), sophisticated transformer-based method Restormer (Zamir et al., 2022), and OT-based methods (Korotin et al., 2023b; Wang et al., 2022a) as competitors for comparison.
Appendix D Additional Ablation Studies
D.1 Versatility and generalizability beyond specific network designs
The RCOT can be easily applied to different network architectures or frameworks. The two-pass TRC module is a plug-in module (Figure 2), allowing us to use any architecture as a base model to generate the restored result, and then use this result to calculate the residual. We have now included a comparison on Rain100L dataset (Fan et al., 2019) between the MPRNet (Zamir et al., 2021), NAFNet (Chu et al., 2022), Restormer (Zamir et al., 2022) methods and the corresponding methods with the proposed TRC module.
The results in Table 7 show that the TRC module brings a meaningful boost to three SOTA network architectures, which validates its versatility and generalizability beyond specific network designs.
D.2 Effect of different conditions
In this paper, we treat the transport residual as a degradation-specific condition by encoding its embedding to adaptively enhance the representation of the restoration. To better demonstrate the benefits, we also try conditioning directly on the output restored image of the base model. The average qualitative results on (SR, Deraining, and Denoising) are reported in Table 8 which sustains our claim.
Method | w/o condition | conditioned on | conditioned on residual |
---|---|---|---|
PSNR | 29.97 | 30.39 | 31.72 |
SSIM | 0.843 | 0.848 | 0.870 |
FID | 32.47 | 26.52 | 14.34 |
D.3 Effect of the Fourier residual penalty in FROT objective
We investigate the effect of the proposed FROT cost on four restoration tasks (Denoising on Kodak24 (Franzen, 1999) with noise level , SR on DIV2K (Agustsson & Timofte, 2017), Deraining on Rain100L (Fan et al., 2019), and Dehazing on SOTS (Li et al., 2018a)). In Table 9, we compare the performance of the models trained under the unpaired FROT cost (w/ ) and regular OT cost (w/o ). The results show that the Fourier residual penalty, integrating degradation-specific knowledge into the transport cost, brings meaningful gains.
Task | Transport cost | PSNR | SSIM | FID |
---|---|---|---|---|
Denoising | w/o | 28.37 | 0.780 | 69.12 |
w/ | 28.64 | 0.792 | 63.27 | |
SR | w/o | 25.96 | 0.723 | 11.41 |
w/ | 26.78 | 0.758 | 4.51 | |
Deraining | w/o | 35.69 | 0.945 | 20.13 |
w/ | 36.22 | 0.972 | 12.59 | |
Dehazing | w/o | 29.72 | 0.953 | 15.56 |
w/ | 30.34 | 0.965 | 12.57 |
D.4 Effect of different regularizers for the Fourier residual
We conduct an extra ablation study on the regularizer for the Fourier residual. The results are reported in Table 10.
Regularizer | |||
---|---|---|---|
SR | 26.82/0.756 | 26.49/0.747 | 26.78/0.758 |
Denoising | 28.52/0.782 | 28.79/0.795 | 28.64/0.792 |
Deraining | 36.01/0.968 | 35.80/0.948 | 36.22/0.972 |
Dehazing | 30.18/0.958 | 29.80/0.954 | 30.34/0.965 |
For the denoising task, the regularizer is more suitable for the Fourier residual since the Gaussian noise is basically from a Gaussian distribution, which is equivalent to the regularization. The sparsity regularizer applies to other degradations.
Appendix E Sensitivity to the Supervised Trade-off Parameter
We test the sensitivity of RCOT to the trade-off parameter on SR tasks. Figure 9 shows that when , RCOT’s performance might approach its limit. Interestingly, the distortion measures, i.e., PSNR and SSIM, go steady when increases after . However, the FID score seems to turn bad if keeps increasing. Based on this result, we fix in our experiments.

Appendix F Training Cost Curves for Three tasks
In Figure 10, we display the cost curves over three tasks (i.e., SR, Deraining, and Denoising) of and in the training process. The cost curve is normalized to . cost is scaled to and then take the negative.

Appendix G Additional Evaluations and Comparisons
Parameter quantity and computational complexity. We report the parameter quantity and computational complexity in Table 11. We compare with the two most recent SOTA methods IRSDE and PromptIR. The results show that our approach is based on a relatively lightweight model, which sustains its effectiveness.
Method | IR-SDE | PromptIR | RCOT |
---|---|---|---|
#Param | 36.2M | 33M | 14.2M |
Flops | 117G | 158G | 142G |
Inference time | 9.09s | 3.25s | 1.06s |
Method | Real rainy SPANet (Wang et al., 2019) | Real hazy O-HAZE (Ancuti et al., 2018) | ||||||
PSNR () | SSIM () | LPIPS () | FID () | PSNR () | SSIM () | LPIPS () | FID () | |
NOT∗ (Korotin et al., 2023b) | 31.22 | 0.882 | 0.040 | 50.23 | 17.28 | 0.698 | 0.215 | 198.56 |
OTUR∗ (Wang et al., 2022a) | 36.52 | 0.932 | 0.025 | 33.21 | 17.45 | 0.712 | 0.202 | 199.78 |
MPRNet (Zamir et al., 2021) | 37.38 | 0.943 | 0.031 | 36.37 | 21.28 | 0.765 | 0.244 | 228.67 |
Restormer (Zamir et al., 2022) | 37.86 | 0.951 | 0.022 | 33.65 | 24.56 | 0.788 | 0.233 | 196.33 |
IR-SDE (Luo et al., 2023) | 38.29 | 0.966 | 0.013 | 19.52 | 24.53 | 0.796 | 0.169 | 186.44 |
PromptIR (Potlapalli et al., 2023) | 37.23 | 0.947 | 0.019 | 34.29 | 24.32 | 0.776 | 0.240 | 205.66 |
RCOT∗ | 37.37 | 0.948 | 0.015 | 21.20 | 22.35 | 0.775 | 0.178 | 189.72 |
RCOT | 40.02 | 0.972 | 0.009 | 15.66 | 26.59 | 0.827 | 0.148 | 165.12 |
Cross-domain evaluations. We conduct two representative cross-domain evaluations to compare the generalizability of RCOT and other methods. Specifically, we train the deraining models on the synthetic Rain13K dataset (Fu et al., 2017; Li et al., 2016; Yang et al., 2017; Zhang & Patel, 2018; Zhang et al., 2019a) and then test their performances on the real rainy dataset SPANet (Wang et al., 2019). Then we train the dehazing models on the synthetic SOTS dataset and then test their performances on the real hazy dataset O-HAZE (Ancuti et al., 2018). Table 12 reports the quantitative results. The results show that the compared methods generally exhibit a declined performance in the presence of a domain gap. However, RCOT still achieves notable performance under all metrics, which sustains the generalizability of RCOT over real-world datasets.
Evaluation of a single model for multiple degradations. We are interested in extending RCOT to realize all-in-one restoration in the future, which is to train a single model for multiple degradations. Since the Fourier residual regularizer in RCOT is task-specific, there is still a gap towards the all-in-one target. Nevertheless, the regularizers for deraining and dehazing are set to be sparsity. Hence we train our model with a combination of rainy data (Rain100L training set) and noisy data (BSD400 and WED with ). Then we test on Kodak24 with noise level and Rain100L test set. The quantitative results are reported in Table 14, in which PromptIR (Potlapalli et al., 2023), the most recent representative all-in-one method is chosen as a competitor. The results show that our RCOT exhibits a better capability of handling multiple degradations. The reason behind the reason should be our residual embedding contains richer degradation-specific knowledge (e.g., degradation type and level) as compared to the learnable visual prompt in PromptIR (Potlapalli et al., 2023).
Method | Rain100L | Kodak24 |
---|---|---|
PromptIR (Potlapalli et al., 2023) | 36.79/0.974 | 31.25/0.872 |
RCOT (ours) | 38.02/0.984 | 31.82/0.879 |
G.1 Additional comparison with GAN-based methods
Given that the potential network can be considered as one discriminator, a comparison with GAN-based approaches could provide valuable insights into the effectiveness and uniqueness of the proposed method in terms of image restoration. We compare RCOT with the most recent GAN-based restoration methods for deraining (on synthetic Rain100L and real SPANet) and SR (on DIV2K).
Appendix H Additional Visual Results
H.1 Additional Comparison on Real Rainy Dataset
We evaluate the performance of the compared methods on the real rainy dataset SPANet (Wang et al., 2019). This dataset contains 27.5K paired rainy and rain-free images for training, and 1, 000 paired images for testing.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Rainy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
H.2 Additional Visual Comparison on Real Hazy Dataset
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Hazy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
H.3 More visual examples
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
14.75/0.271 | 26.82/0.836 | 28.88/0.846 | 30.70/0.946 | 32.04/0.960 | 30.18/0.903 | 31.89/0.959 | 32.41/0.963 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
14.77/0.364 | 21.97/0.608 | 25.29/0.840 | 29.26/0.955 | 30.22/0.964 | 28.73/0.938 | 30.18/0.965 | 30.61/0.967 |
Noisy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
25.56/0.949 | 28.56/0.897 | 29.70/0.968 | 32.91/0.987 | 33.19/0.988 | 33.71/0.989 | 33.22/0.988 | 34.20/0.990 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
31.23/0.967 | 27.18/0.939 | 31.71/0.971 | 34.96/0.981 | 35.20/0.986 | 36.12/0.988 | 35.56/0.985 | 37.12/0.988 |
Rainy | NOT | OTUR | MPRNet | Restormer | IR-SDE | PromptIR | RCOT |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
20.20/0.543 | 19.29/0.550 | 19.08/0.493 | 19.86/0.588 | 19.57/0.538 | 19.80/0.570 | 20.75/0.621 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
24.86/0.743 | 23.12/0.663 | 23.06/0.595 | 24.99/0.749 | 23.96/0.711 | 23.82/0.727 | 25.62/0.793 |
LR | NOT | OTUR | Restormer | IDM | IR-SDE | RCOT |
Appendix I Visual examples of Transport Residuals
This section exhibits some visual examples of the transport residual . We can observe that different degradations have unique degradation-specific transport residuals, which coincides with our motivation to utilize the residual embedding to encode the degradation-specific knowledge (e.g., degradation type and level).



