This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[]\fnmWei-Ling \surCai

]\orgdivDepartment of Computer Science and Technology, \orgnameNanjing Normal University, \orgaddress \cityNanJing, \countryChina

Spectral Normalization and Dual Contrastive Regularization for Image-to-Image Translation

\fnmChen \surZhao [email protected]    [email protected]    \fnmZheng \surYuan [email protected] [
Abstract

Existing image-to-image (I2I) translation methods achieve state-of-the-art performance by incorporating the patch-wise contrastive learning into Generative Adversarial Networks. However, patch-wise contrastive learning only focuses on the local content similarity but neglects the global structure constraint, which affects the quality of the generated images. In this paper, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization, namely SN-DCR. To maintain consistency of the global structure and texture, we design the dual contrastive regularization using different deep feature spaces respectively. In order to improve the global structure information of the generated images, we formulate a semantic contrastive loss to make the global semantic structure of the generated images similar to the real images from the target domain in the semantic feature space. We use Gram Matrices to extract the style of texture from images. Similarly, we design a style contrastive loss to improve the global texture information of the generated images. Moreover, to enhance the stability of the model, we employ the spectral normalized convolutional network in the design of our generator. We conduct comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks. The code and pretrained models are available at https://github.com/zhihefang/SN-DCR.

keywords:
image-to-image translation, contrastive learning, generative adversarial network

1 Introduction

Image-to-image translation (I2I) tasks aim to map an input image from the source domain into the target domain while retaining its original content and structure. In many I2I tasks, it is impossible to collect paired training data, and therefore adversarial loss [10] suffers from the collapse of the content and structure. In order to alleviate this problem, many typical works [42] use cycle consistency loss [42], which enforces the consistency between the input images and the reconstructed images by an inverse mapping of the generated image. Motivated by vision transformer (ViT), the recent cycle consistency based work [30] explored the generator with a ViT for unpaired image-to-image translation. However, cycle consistency requires that the mapping between the two domains be a bijection, which is too restrictive [23].

Refer to caption
Figure 1: Visual comparison results with all baselines on the Van Gogh\rightarrowPhoto dataset. It is obvious that the global structure of the images generated by previous methods is corrupted, while our SN-DCR is able to preserve the global structure information and generate the photos with more natural details. Our SN-DCR performs better in terms of global structure and texture. Note that CUT and Cyclegan fail to generate a valid output at the first input.

Recently, inspired by the success of contrastive learning, CUT [23] proposed patch-wise contrastive learning to maximize the mutual information between the same location of input and generated images, which introduced the contrastive learning into I2I translation for the first time. DCLGAN [12] proposed a method based on patch-wise contrastive learning and dual learning settings (using two generators and two discriminators). Although DCLGAN improves the quality of the generated images, an additional generator and discriminator need to be trained, which increases the training costs. F-LSeSim [41] utilized patch-wise contrastive learning by computing the learned self-similarity. However, it relies on VGG features to measure similarity, which reduces training efficiency. In previous studies, the query for contrastive learning was selected randomly from the generated images, which is an obvious problem since some locations contain less information from the source domain. Therefore, QS-Attn [16] designed a query-selected attention module by intentionally choosing significant anchors for patch-wise contrastive learning. Recently, a novel I2I method [28] based on style harmonization was proposed, which leverages two distinct styles: class-aware memory style and image-specifc component style. Gou et al. [11] proposed multi-feature contrastive learning (MCL) to construct a patch-wise contrastive loss using the feature information of the discriminator output layer for I2I tasks.

The previous works retain content consistency of the generated images via patch-wise contrastive learning without any regularization. However, only patch-wise contrastive learning is unable to effectively maintain the overall structure and texture of the images, as it only focuses on the local content similarity but neglects the global structure constraint. This issue affects the quality of the generated image.

In this paper, we propose a new I2I translation framework based on dual contrastive regularization and spectral normalization (SN-DCR). To achieve a global constraint of structure and texture in an unpaired manner, we formulate two new global contrastive loss functions to supplement the patch-wise contrastive loss, called dual contrastive regularization (DCR). DCR contains two parts: one is the semantic contrastive loss and the other is the style contrastive loss. Specifically, the semantic contrastive loss is proposed to improve the global structure information of the generated images, which encourages the generated images and the real images of the target domain (positives) to pull together in the semantic feature space while pushing the generated images away from the real images of the source domain (negatives). For example, the semantic structure information of the generated dog should be similar to the real dog in the semantic feature space. The style contrastive loss is developed to achieve the global texture consistency, in which the feature space capturing texture maps [9] is adopted to represent the style of texture.

Furthermore, it is well known that the training of GAN is unstable, and there are some issues such as mode collapse and convergence difficulties. To alleviate these issues, we employ the spectral normalization [21] in the design of our model, which enhances the stability of training. Moreover, to further boost the feature representation ability of our model, Frequency Channel Attention Network (FCANet) [25] is introduced to improve the translation performance. We demonstrate that our designed generator is effective through ablation experiments. In order to perform a better patch-wise contrastive loss, the qs-attn module is introduced in this paper. We conduct the comprehensive experiments to evaluate the effectiveness of SN-DCR, and the results prove that our method achieves SOTA in multiple tasks.

In summary, the main contributions of our method are three folds:

  • We propose a novel unpaired I2I translation framework via dual contrastive learning and spectral normalization, namely SN-DCR. Our proposed SN-DCR focuses not only on local content similarity but also on the consistency of the global structure and style, enabling the generation of high quality images with more natural structure and texture information.

  • To improve the global information of the generated images, we design dual contrastive regularization (DCR), which can achieve the consistency of the global structure and texture. Our proposed DCR can be viewed as a universal regularization to enhance quality of the generated images.

  • Experimental results compared with SOTAs clearly demonstrate that our SN-DCR exhibits superiority over prior unsupervised I2I translation approaches. Furthermore, we conduct comprehensive ablation experiments to scrutinize each of our contributions, and prove the effectiveness of each element.

2 RELATED WORK

2.1 Image-to-Image Translation

GANs [10] [20] [19] have obtained great success, especially in I2I translation and the key idea is adversarial loss [10]. I2I translation can be categorized into two groups: a paired setting [18] [31] [22] (supervised) and an unpaired setting (unsupervised)[42][36][17][1] . Paired setting means that each image from the source domain has a corresponding label, which can be regarded as classic GAN. In order to further improve the quality of the generated images, SPADE introduces a the spatially-adaptive normalization layer. However, it is difficult to obtain paired training data, as a result, current methods [42] are usually based on unpaired settings, which are developed based on one assumption: cycle-consistency. For example, CycleGAN [42], Dual-GAN [36] and MUNIT [17] train cross-domain GANs with cycle-consistency loss. CycleGAN learns two mappings simultaneously via translating an image to the target domain and back preserving the fidelity of the input and the reconstructed image. However, the assumption of cycle consistency that two domains can be mapped in both directions is too strict to obtain sufficient context. To alleviate the issue, many methods have tried to break the cycle consistency. DistanceGAN [1] proposes a distance constraint that allows unsupervised domain mapping to be one-sided. GC-GAN [8] enforces geometry consistency as a constraint for unsupervised domain mapping. CUT [23] introduces patch-wise contrastive learning into I2I translation, which significantly improves the quality of translation. However, only patch-wise contrastive loss is unable to effectively maintain the global structure and texture of the generated images.

2.2 Contrastive Learning

Recent studies of the self-supervised learning [13] [5] [29] [38] show its strong ability to represent an image without labels, particularly with the help of the contrastive loss [3] [6]. Its idea is to perform the instance-level discrimination and learn the feature embedding, by pulling the features from the same image together and pushing those from different ones away. It has been observed that contrastive loss has been utilized in several low-level visual tasks, resulting in outstanding performance in various applications such as style transfer [39], image generation [24], image smoothing [43], image super-resolution and Deraining [34, 4]. PatchNCE [23] proposes patch-based contrastive learning, which uses a noise-contrastive estimation framework by learning the correspondence between the patches of the input image and the corresponding generated image patches. Excellent results are achieved and the recent methods [12] [41] [16] [32] [40] also obtained better performance by utilizing the idea of patch-wise contrastive learning. DCLGAN [12] proposes a method based on patch-wise contrastive learning and dual learning settings (using two generators and two discriminators). Although DCLGAN improves the quality of the generated images, an additional generator and discriminator need to be trained, which increases the training costs. QS-Attn [16] designed a query-selected attention module by intentionally choosing significant anchors for patch-wise contrastive learning. MCL [11] designed multi-feature contrastive learning (MCL) to construct a patch-wise contrastive loss using the feature information of the discriminator output layer. In parallel to these various designed methods, we mainly explore global regularization by introducing the idea of global contrastive loss. Therefore, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization to improve the global information of the generated images, which consistently shows better results.

Refer to caption
Figure 2: Overall framework of our proposed SN-DCR. A cat (the input image) is translated by the generator G into a dog(the generated image). We introduce a dual contrastive regularization that combines both semantic and style contrastive loss to effectively pull the generated image closer to the real images of the target domain (positives) while pushing it away from the real images of the source domain (negatives).

3 OUR METHOD

3.1 Overall Framework

Given an input image xH×W×3\mathrm{~{}x~{}}\in\mathbb{~{}R}^{H\times W\times 3} from the source domain X, our goal is to translate it into G(x) in the target domain Y via the adversarial loss, having no apparent difference with the real imageyH×W×3\mathrm{~{}y~{}}\in\mathbb{~{}R}^{H\times W\times 3} from the domain Y. The framework of SN-DCR is shown in Fig.2, which consists of a generator G and a discriminator D. We divide the generator into two parts, the first part is defined as the encoder E, and the other part is defined as the decoder. SN-DCR includes three loss functions, the adversarial loss , patch-wise contrastive loss and dual contrastive regularization. In order to perform better patch-wise contrastive loss, we introduce the qs-attn module (global) in our proposed framework.

We expect the generated image G(x) to be as similar as possible to the real image yH×W×3\mathrm{~{}y~{}}\in\mathbb{~{}R}^{H\times W\times 3} from the domain Y , and the generator G can be mapped from the domain X to the domain Y through the adversarial loss. The adversarial loss is as follows:

adv=𝔼yYlogD(𝐲)+𝔼xXlog(1D(G(𝐱))),\displaystyle\mathcal{L}_{adv}\!=\!\mathbb{E}_{y\in Y}\!\log\!D(\mathbf{y})\!+\!\mathbb{E}_{x\in X}\log(1\!-D(G(\mathbf{x}))), (1)
Refer to caption
Figure 3: Our proposed spectral normalized generator, denoted as G, incorporates the use of InsNorm (IN) and a Frequency Channel Attention Network (FCA). Additionally, it utilizes our novel spectral normalized residual block (SN ResBlock), with nine such blocks present in the middle of the architecture. In line with the configuration of CUT, we extract features from five different layers to compute the multi-layer patch-wise contrastive loss. These layers include RGB pixels, the initial two downsampling convolutions, and the first and fifth residual blocks.

3.2 Network Architecture

Generator. We employ a UNet-based network with 9 residual blocks as the generator module. As known, the training process of GANs is very unstable, and problems such as mode collapse and convergence difficulties often occur. We employ SNconv in the design of residual blocks, which enhances the stability of training. In addition, to further boost the performance of our proposed SN-DCR, an up-to-date attention mechanism Frequency Channel Attention Network(FCANet) is introduced into our network. As illustrated in Fig. 3, given an input image x, the generator G can map x to G(x) in the target domain. To achieve this goal, G is supposed to preserve both image structures and details when translating. Motivated by previous studies, we exploit an encoder-decoder network with nine residual blocks as the generator. Given a cat, we first employ an initial layer and two down-sampling layers to encode the input image into a low-resolution feature map. Then, nine SN ResBlocks are adopted to extract more complex and deeper features in the low-resolution space. Fig. 4 shows the detailed structure of SN ResBlock. After that, we employ the corresponding two up-sampling layers and a 7 * 7 convolutional layer to output a dog. Moreover, as mentioned above, we introduce FCANet in the generator design to further enhance the ability of SN-DCR. FCANet combines the channel attention mechanism with the discrete cosine transform cleverly, and expands on the basis of SENet [15] to obtain a new multi spectral channel attention mechanism. FCANet enables our model to learn the weights from different feature maps adaptively. Moreover, after the introduction of FCANet, the results of ablation study indicate that the performance of the proposed model can be improved significantly.

Discriminator. We use the same PatchGAN discriminator [27] architecture as CycleGAN and Pix2Pix which uses local patches of sizes 70x70 and assigns every patch a result. This is equivalent to manually crop one image into 70x70 overlapping patches, run a regular discriminator over each patch, and average the results. For instance, the discriminator takes an image from either domain X or domain Y , passes it through five downsampling Convolutional Normalization LeakeyReLU layers, and outputs a result matrix of 30x30. Each element corresponds to the classification result of one patch. Following CycleGAN and Pix2Pix, in order to improve the stability of adversarial training, we use a buffer to store 50 previously generated images.

Refer to caption

Figure 4: SN Residual block. The SN residual block can enhance the stability of training and assist the generator to extract more complex features.

3.3 Dual Contrastive Regularization

We adopt the real images of the target domain and source domain as the positives and negatives to improve better quality of the generated images. DCR constrains the generator by two different feature spaces. Note that, to ensure the flexibility of our proposed method, these positives and negatives are randomly chosen. To achieve a constraint of the global structure, we propose a semantic contrastive loss, which aims to encourage the generated image G(x) to be close to the positives P while keeping away from the negatives N. For the feature space, inspired by AECR-Net [35] ,we employ a pre-trained VGG-16 network to extract the feature maps fC×H×Wf\in\mathbb{R}^{C\times H\times W}.

The semantic contrastive loss can be expressed as:

semantic=i=1nωiFi(G(x))Fi(P)1Fi(G(x))Fi(N)1+1e7,\displaystyle\mathcal{L}_{semantic}\!=\!\sum_{i=1}^{n}\omega_{i}\!\cdot\frac{\left\|F_{i}(G(x))-F_{i}(P)\right\|_{1}}{\left\|F_{i}(G(x))\!\!-\!\!F_{i}(\!N\!)\right\|_{1}\!+\!1e^{-7}}, (2)

where FiF_{i} refers to extracting the i-th hidden features from the VGG-16 network pre-trained on ImageNet. nn refers to n layers that we choose. P refers to the real images of the target domain, and N refers to the real images of the source domain. Here we choose the 1st, 3rd, 5th, 9th and 13th layers. ωi\omega_{i} are weight coefficients, and we set ω1\omega_{1} = 1/16, ω2\omega_{2} =1/8, ω3\omega_{3} = 1/4, ω4\omega_{4}=1/2, ω5\omega_{5}=1.

Besides, to achieve a constraint of the global texture, we use a special feature space. We can build this feature space in any layer of the network, which consists of the correlations between the different filter responses. These feature correlations are given by the Gram matrix:

Mijl=kfiklfjkl,\displaystyle M^{l}_{ij}=\sum_{k}f^{l}_{ik}f^{l}_{jk}, (3)

where MijlM^{l}_{ij} is the inner product between the feature maps ii and jj in the layer ll. kk refers to the vector length. We employ a feature extraction network to extract the feature maps [c, h, w]. We can obtain the feature maps ii[ c, h*w] and the feature maps jj[ c, h*w] via flatten and matrix transpose. Gram Matrices is the inner product between the feature maps ii and jj. We then get a set of Gram matrices {M1,M2,,ML}\{M^{1},M^{2},\ldots,M^{L}\} from layers {1,2,3…L} in the feature extraction network. The Gram matrix MM is a quantitative description of latent image features. We expect that the distance dd (we use L2 to measure this distance) between the generated image G(x) and negatives NN is much greater than the distance between the generated image G(x) and positives PP:

d(M(G(x)),M(P))d(M(G(x)),M(N)),\displaystyle\quad d(M(G(x)),M(P))\ll d(M(G(x)),M(N)),\quad\text{} (4)

Therefore, our proposed style contrastive loss can be expressed as:

style=max{d(M(G(x)),M(P))d(M(G(x)),M(N))+α,0},\displaystyle\begin{aligned} \mathcal{L}_{style}=\text{max}\left\{d(M(G(x)),M(P))\right.\\ \left.-d(M(G(x)),M(N))+\alpha,0\right\},\end{aligned} (5)

where α\alpha are hyperparameters (similar to the triplet loss [26] ), we set it to 0.04 in our experiments. Finally, our dual contrastive Regularization is formulated as:

Dual=λ1semantic+λ2style,\displaystyle\mathcal{L}_{Dual}=\lambda_{1}\mathcal{L}_{semantic}+\lambda_{2}\mathcal{L}_{style}, (6)

where λ1\lambda_{1} and λ2\lambda_{2} are weight coefficients, and we set λ1\lambda_{1} = 1, λ2\lambda_{2} =0.5.

Table 1: Quantitative comparison with all baselines.
Method Cat\rightarrowDog Van gogh\rightarrowPhoto Horse\rightarrowZebra
FID\downarrow SWD\downarrow FID\downarrow FID\downarrow SWD\downarrow sec/iter\downarrow
CycleGAN 80.5 19.5 103.0 72.2 39.1 0.40
CUT 76.2 12.9 96.9 45.5 31.5 0.24
FastCUT 94.0 17.6 105.3 73.4 38.2 0.15
FSeSim 87.8 13.8 94.3 43.4 37.2 0.11
DCLGAN 68.7 12.5 93.7 43.2 31.2 0.41
QS-Attn 72.8 12.8 92.2 41.1 30.3 0.22
SN-DCR 62.7 12.1 90.3 33.6 28.4 0.23
  • For the metrics, our algorithm outperforms all the baselines obviously, and our SN-DCR achieves state-of-the-art performance on unpaired I2I translation tasks.

3.4 Patch-wise Contrastive Loss

Following the setup of CUT, we employ a patch-wise contrastive loss to maximize the mutual information between the inputs and outputs. The key idea is to pull the query (patch from the generated image, yellow box in Fig.2) and positives (corresponding from the input image, blue box) together, and push the query and negatives (non-local patches from the input image, red box) away. We employ E to extract features from the input image and the generated image. Then we introduce QS-Attn module to deliberately choose important anchors for patch-wise contrastive loss. The formula can be expressed as:

=log[exp(qk+/τ)exp(qk+/τ)+n=1N1exp(qk/τ)],\displaystyle\ell=-\log\left[\dfrac{\exp\left(q\cdot k^{+}/\tau\right)}{\exp\left(q\!\cdot\!k^{+}/\tau\right)\!+\!\sum_{n=1}^{N-1}\exp\left(q\!\cdot\!k^{-}/\tau\right)}\right], (7)

where qq refers to the anchor feature (yellow box in Fig.2) from G(x), k+k^{+} refers to a positive (blue box) and kk^{-} refers to (N-1) negatives (red box). Here τ\tau indicates a temperature parameter used to measure the distance between query and other samples. The default value is 0.070.07 . Note that the positive k+k^{+} is the corresponding of the anchor feature qq in the input image x , and (N1)(N-1) negatives are randomly selected in x.

We utilize E to extract features from the generated image, select the L layers of interest from E, and sent it to Q (QS-Attn moudle) to get the features we need. The resulted features can be denoted by {𝐳^l}L={Ql(El(G(𝐱)))}L\{\hat{\mathbf{z}}_{l}\}_{\text{L}}=\{Q_{l}(E^{l}(G(\mathbf{x})))\}_{L}, where ElE^{l} represents l-th layer we choose. We use qs-attn module to select important anchors in each selected layer , s{1,,Sl}s\in\left\{1,...,S_{l}\right\} (SlS_{l} represent the number of patches selected in each layer). In the same way, the corresponding patches of L layer are obtained from the input image, {𝐳l}L={Ql(El(G(𝐱)))}L\{{\mathbf{z}}_{l}\}_{\text{L}}=\{Q_{l}(E^{l}(G(\mathbf{x})))\}_{L}. We take the corresponding patches obtained from the input image as positives, and the other features as negatives.

The patch-wise contrastive loss can be expressed as:

patch(G,Q,X)=𝔼𝒙Xl=1Ls=1Sl(𝒛^ls,𝒛ls,𝒛lSs),\displaystyle\mathcal{L}_{{}_{patch}}(G,Q,X)=\mathbb{E}_{\boldsymbol{x}^{\sim}X}\sum_{l=1}^{L}\sum_{s=1}^{S_{l}}\ell\Big{(}\boldsymbol{\hat{z}}_{l}^{s},\boldsymbol{z}_{l}^{s},\boldsymbol{z}_{l}^{S\setminus s}\Big{)}, (8)

In summary, the overall objective function of SN-DCR can be formulated as:

Total=adv+patchX+patchY+Dual,\displaystyle\mathcal{L}_{Total}=\mathcal{L}_{adv}+\mathcal{L}^{X}_{patch}+\mathcal{L}^{Y}_{patch}+\mathcal{L}_{Dual}, (9)

where patchX\mathcal{L}^{X}_{patch} is the patch-wise contrastive loss defined in Eq.(8), and patchY\mathcal{L}^{Y}_{patch} is the identity loss, in which the positive k+k^{+} and negatives kk^{-} are extracted from a real image y from the domain Y, and the anchor q is from G(y). We refer to the CUT settings to add the identity loss. The goal of this identity loss is to prevent generator G from making changes on the target domain images.

Refer to caption
Figure 5: Visual results comparison with all baselines on the Horse \rightarrow Zebra and Cat\rightarrow Dog datasets. Compared with all baselines, our SN-DCR not only performs better in terms of global structure and texture, but also generates more real images with more natural detail.
Table 2: Efficiency comparison with all baselines.
Horse2Zebra
Method Memory\downarrow Overall Training time\downarrow FLOPs\downarrow Parameters \downarrow
CycleGAN 4.7G 46h 128.26G 28.286M
CUT 3.9G 27h 64.13G 14.703M
FastCUT 3.4G 17h 64.13G 14.703M
FSeSim 3.8G 12h 64.13G 14.143M
DCLGAN 7.8G 44h 128.26G 29.406M
QS-Attn 5.5G 24h 64.13G 14.703M
SN-DCR 5.5G 26h 42.38G 14.679M
  • Our proposed SN-DCR demonstrates competitive performance in the realm of image-to-image translation, notwithstanding its suboptimal outcomes in terms of computational speed, parameter efficiency, and memory consumption. Despite these limitations in specific computational metrics, SN-DCR distinguishes itself by exhibiting notable efficacy in the image translation task. This underscores its merit as a viable alternative that strikes a judicious balance between considerations of computational resources and translation performance, thereby constituting a valuable contribution to the field. Moreover, we conduct a comparative analysis of FLOPS for generator of each methods, and our approach has demonstrated superior performance, which further substantiates the advantage of SN-DCR.

Table 3: Quantitative comparison with all baselines on Cityscapes dataset.
CityScapes
Method FID\downarrow mAP\uparrow pixAcc \uparrow classAcc\uparrow SSIM\uparrow
CycleGAN 76.3 20.4 55.9 25.4 0.3601
CUT 56.4 24.7 68.8 30.7 0.4474
FastCUT 68.8 19.1 59.9 24.3 0.4265
FSeSim 54.3 22.1 69.4 27.8 0.4367
DCLGAN 49.4 22.9 76.9 29.6 0.4486
QS-Attn 53.5 25.5 79.9 31.2 0.4573
SN-DCR 46.6 27.9 74.3 35.4 0.4638
  • For Cityscapes dataset, we use the pretrained semantic segmentation network DRN, and compute mean average precision (mAP), pixel-wise accuracy (pixAcc), and average class accuracy (classAcc), showing the semantic interpretability of the generated images. Unlike other datasets, CityScapes does have corresponding labels. So we conduct an evaluation between ground truth and the generated images via structural similarity (SSIM) [33] metrics to evaluate the structure consistency.

Refer to caption
Figure 6: Visual results comparison with all baselines on the CityScapes dataset. Our proposed SN-DCR show visual satisfactory results. SN-DCR can clearly generate a wide variety of cars with natural details, and performs better in terms of global structure and texture.
Refer to caption
Figure 7: Visual results for ablation study on Horse\rightarrow Zebra dataset. The leftmost column are input images. Model G is our proposed SN-DCR.
Table 4: Ablation study on Horse \rightarrow Zebra dataset. ( hyperparameter λ1\lambda_{1} and λ2\lambda_{2} on DCR)
λ1\lambda_{1}=0.1 λ1\lambda_{1}=0.5 λ1\lambda_{1}=1 λ1\lambda_{1}=5 λ1\lambda_{1}=10
λ2\lambda_{2}=0.1 38.1 37.9 36.5 37.4 38.4
λ2\lambda_{2}=0.5 38.3 34.1 33.6 36.7 37.9
λ2\lambda_{2}=1 39.2 35.5 33.9 36.2 38.9
λ2\lambda_{2}=5 40.3 37.8 35.4 38.6 40.9
λ2\lambda_{2}=10 41.1 38.4 36.2 39.5 40.7
  • We use FID to evaluate ablation study.

Table 5: Ablation study on Horse \rightarrow Zebra dataset.
Configuration
Method Generator QS-Attn DCR FID
FCA SN semantic style
CUT 45.5
A ×\times ×\times ×\times ×\times \checkmark 37.4
B ×\times ×\times ×\times \checkmark ×\times 38.7
C ×\times ×\times ×\times \checkmark \checkmark 35.5
D ×\times ×\times \checkmark \checkmark \checkmark 34.5
E \checkmark \checkmark \checkmark ×\times ×\times 38.9
F ×\times \checkmark \checkmark \checkmark \checkmark 34.1
G \checkmark \checkmark \checkmark \checkmark \checkmark 33.6
  • DCR refers to dual contrastive regularization. semantic means semantically contrastive loss, and style means style contrastive loss. FCA refers to the Frequency Channel Attention Network. SN refers to the spectral normalized convolutional network. QS-Attn refers to QS-Attn moudle.

4 EXPERIMENTS

4.1 Experimental Settings

Datasets. SN-DCR is trained and evaluated on Horse\rightarrow Zebra, Cat\rightarrowDog , Van Gogh \rightarrow Photo and CityScapes datasets. Horse\rightarrowZebra is provided in [42] , which contains 1,067 and 1,334 training images for horse and zebra, respectively. We use 120 horse images as the test images on Horse \rightarrow Zebra. Cat \rightarrow Dog is from [7], which consists of 5,153 and 4,739 training images for cat and dog, respectively. We used 500 images of cats as test images on Cat \rightarrow Dog. Van Gogh \rightarrow Photo is a dataset of 400 Van Gogh paintings and 6287 photos extracted from [7]. We used 400 Van Gogh images as test images. Cityscapes contains street scenes from German cities, with 2,975 training images and 500 test images.

Training details. The implementation of SN-DCR is mainly based on CUT. We use a ResNet-based generator and a PatchGAN discriminator. Different from CUT, we introduce qs-attn module in patch-wise contrastive loss, and introduce SN and FCA module in design of the generator. Our proposed dual contrastive regularization employs VGG16 to extract features. we use the Adam optimizer, β1=0.5\beta_{1}=0.5 and β2=0.999\beta_{2}=0.999. The batch size we used is 1, and all training images are loaded into 286 * 286, then cut to 256 * 256 blocks. SN-DCR trains 400 epochs on each dataset, the learning rate is 0.0002, and the learning rate starts to decay linearly to 0 after 200 epochs.

Evaluation metrics. We primarily utilize Frechet Inception Distance (FID) [14] and Sliced Wasserstein Distance (SWD) [2] as key evaluation metrics for assessing the performance of SN-DCR. Both FID and SWD are well-established metrics with a robust correlation to human perception. They measure the distance between two distributions, namely the distributions of real and generated images. A lower FID and SWD signify a closer resemblance between the generated and real images. For Cityscapes dataset, we use the pretrained semantic segmentation network DRN [37] , and compute mean average precision (mAP), pixel-wise accuracy (pixAcc), and average class accuracy (classAcc), showing the semantic interpretability of the generated images.

4.2 Comparison with other Methods

Table 1 shows the quantitative results of our proposed SN-DCR compared with all baselines on three datasets, including QS-Attn [16], DCLgan [12] , FseSim [41], CUT [23], FastCUT [23] and Cyclegan [42]. We mainly use FID and SWD scores as our quantitative metrics. For the metrics, our algorithm outperforms all the baselines obviously, and our SN-DCR achieves state-of-the-art performance on unpaired I2I translation tasks. As illustrated in the Table 2, SN-DCR demonstrates superiority over DCLgan and Cyclegan concerning both performance and computational parameters. Despite having computational parameters comparable to QS-Attn and CUT, SN-DCR outperforms them in terms of performance. Although SN-DCR falls slightly short of FastCUT and FSeSim in computational parameters, its performance significantly surpasses that of FastCUT and FSeSim. Despite these limitations in specific computational metrics, SN-DCR distinguishes itself by exhibiting notable efficacy in the image translation task. Fig.1 shows the visual results comparison with all baselines on the Van Gogh\rightarrowPhoto dataset. Compared with other methods, our SN-DCR is able to preserve the global structure information and generate the photos with more natural details, proving that our DCR is effective to improve the global information. To better illustrate the superiority of our SN-DCR, we display some visual results in Fig.5. We randomly pick four samples on the Horse \rightarrow Zebra and Cat \rightarrow Dog datasets. QS-Attn, DCLGAN and FseSim fail to preserve details and textures of the generated image. CUT, FastCUT, Cyclegan generate unsatisfactory content. Compared with all baselines, our SN-DCR not only performs better in terms of global structure and texture, but also generates more real images with more natural details. Fig 6 shows the visual results comparison with all baselines on the CityScapes. Compared with other methods, our SN-DCR is able to clearly generate a wide variety of cars with natural details, and performs better in terms of global structure and texture, proving that our DCR is able to enhance consistency of the global texture. Moreover, our SN-DCR can produce more real images with more color details. Table 2 shows the quantitative results of our proposed SN-DCR compared with all baselines on the CityScapes datasets. Obviously, our algorithm outperforms all the baselines. For structural similarity (SSIM) metrics, our proposed SN-DCR performs better than other methods, proving that our DCR is effective to enhance the information of global structure.

4.3 Ablation Study

In comparison experiments, SN-DCR shows better performance than all baseline methods. In SN-DCR, we apply the FCA and SN module in design of the generator, QS-Attn module in patch-wise contrastive learning, and dual contrastive regularization.

For effect of the weights( hyperparameter λ1\lambda_{1} and λ2\lambda_{2} ) in DCR, ablation experiments are performed on the Horse \rightarrow Zebra dataset to ensure their optimum values, as shown in Table 3. To evaluate the measures separately, we mainly conduct the ablation study on the Horse \rightarrow Zebra dataset. Our baseline is CUT. Visual results and metrics for ablation study are listed in Fig 7 and Table 4.

The metric of models A and B performs better than CUT, reflecting that our proposed global contrastive loss functions are useful for I2I translation tasks. Model C outperforms A and B, indicating the effectiveness of our dual setting. Model D performs better than C, indicating that the addition of QS-Attn module is beneficial to the model’s performance. Model E without DCR to exhibit the different influence between global constraints and local constraints. Model E outperforms QS-Attn, proving that our designed generator is effective. Model F outperforms D, indicating the effectiveness of the SN ResBlock in our generator. Finally, Model G, which combines all the measures, achieves the best performance on I2I translation tasks, validating the efficacy of our overall approach.

5 CONCLUSION

In this paper, we propose a new unpaired I2I translation framework based on dual contrastive regularization and spectral normalization, namely SN-DCR. To achieve a global constraint of structure and texture in an unpaired manner, we formulate two new global contrastive loss functions to supplement the patch-wise contrastive loss, called dual contrastive regularization (DCR). To alleviate mode collapse and convergence difficulties, we employ the spectral normalized convolutional network in the design of our generator. Moreover, to further boost the feature representation ability of our model, Frequency Channel Attention Network is introduced to further boost the feature representation ability of our generator. In ablation study, we have proved the effectiveness of each component of our method. Our design can better deal with various tasks of I2I translation. In comparison experiments, SN-DCR shows better performance than all baseline methods in terms of global structure and texture. In multiple datasets, quantitative and visual results demonstrate that SN-DCR achieves the best results. Furthermore, we highlight that the principal advantage of SN-DCR lies in its remarkable capability to generate highly realistic images, particularly excelling in the generation of textures and natural details. Nevertheless, it is noteworthy that our approach comes with certain constraints, chief among which are less than optimal performance in terms of both training costs and memory consumption. Finally, we believe that our work will initiate further research on unpaired image-to-image translation.

Acknowledgements This work was supported in part by the National Natural Science Foundation of China under Grant No. 62276138 and Grant No. 61876087.

Data availability statement The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Declaration

Conflict of interest The authors declare that they have no conflict of interest.

References

  • \bibcommenthead
  • Benaim and Wolf [2017] Benaim S, Wolf L (2017) One-sided unsupervised domain mapping. Neural Information Processing Systems pp 752–762
  • Bruckstein et al [2011] Bruckstein AM, ter Haar Romeny BM, Bronstein AM, et al (2011) Wasserstein barycenter and its application to texture mixing. International Conference on Scale Space and Variational Methods pp 435–446
  • Caron et al [2020] Caron M, Misra I, Mairal J, et al (2020) Unsupervised learning of visual features by contrasting cluster assignments. Neural Information Processing Systems
  • Chang et al [2023] Chang Y, Guo Y, Ye Y, et al (2023) Unsupervised deraining: Where asymmetric contrastive learning meets self-similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Chen et al [2020a] Chen T, Kornblith S, Norouzi M, et al (2020a) A simple framework for contrastive learning of visual representations. International Conference on Machine Learning pp 1597–1607
  • Chen et al [2020b] Chen X, Fan H, Girshick RB, et al (2020b) Improved baselines with momentum contrastive learning. CoRR
  • Choi et al [2020] Choi Y, Uh Y, Yoo J, et al (2020) Stargan v2: Diverse image synthesis for multiple domains. Conference on Computer Vision and Pattern Recognition pp 8185–8194
  • Fu et al [2019] Fu T, Gong M, Wang C, et al (2019) Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. Conference on Computer Vision and Pattern Recognition pp 2427–2436
  • Gatys et al [2016] Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. Conference on Computer Vision and Pattern Recognition pp 2414–2423
  • Goodfellow et al [2020] Goodfellow I, Pouget-Abadie J, Mirza M, et al (2020) Generative adversarial networks. Communications of the ACM 63(11):139–144
  • Gou et al [2023] Gou Y, Li M, Song Y, et al (2023) Multi-feature contrastive learning for unpaired image-to-image translation. Complex & Intelligent Systems 9(4):4111–4122
  • Han et al [2021] Han J, Shoeiby M, Petersson L, et al (2021) Dual contrastive learning for unsupervised image-to-image translation. Conference on Computer Vision and Pattern Recognition Workshops pp 746–755
  • He et al [2020] He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. Conference on Computer Vision and Pattern Recognition pp 9726–9735
  • Heusel et al [2017] Heusel M, Ramsauer H, Unterthiner T, et al (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems pp 6626–6637
  • Hu et al [2020] Hu J, Shen L, Albanie S, et al (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
  • Hu et al [2022] Hu X, Zhou X, Huang Q, et al (2022) Qs-attn: Query-selected attention for contrastive learning in I2I translation. Conference on Computer Vision and Pattern Recognition pp 18270–18279
  • Huang et al [2018] Huang X, Liu MY, Belongie S, et al (2018) Multimodal unsupervised image-to-image translation. European conference on computer vision pp 172–189
  • Isola et al [2017] Isola P, Zhu J, Zhou T, et al (2017) Image-to-image translation with conditional adversarial networks. Conference on Computer Vision and Pattern Recognition pp 5967–5976
  • Karras et al [2021] Karras T, Laine S, Aila T (2021) A style-based generator architecture for generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 43(12):4217–4228
  • Mirza and Osindero [2014] Mirza M, Osindero S (2014) Conditional generative adversarial nets. CoRR
  • Miyato et al [2018] Miyato T, Kataoka T, Koyama M, et al (2018) Spectral normalization for generative adversarial networks. International Conference on Learning Representations
  • Park et al [2019] Park T, Liu M, Wang T, et al (2019) Semantic image synthesis with spatially-adaptive normalization. Conference on Computer Vision and Pattern Recognition pp 2337–2346
  • Park et al [2020] Park T, Efros AA, Zhang R, et al (2020) Contrastive learning for unpaired image-to-image translation. European conference on computer vision 12345:319–345
  • Phaphuangwittayakul et al [2023] Phaphuangwittayakul A, Ying F, Guo Y, et al (2023) Few-shot image generation based on contrastive meta-learning generative adversarial network. The Visual Computer 39(9):4015–4028
  • Qin et al [2021] Qin Z, Zhang P, Wu F, et al (2021) Fcanet: Frequency channel attention networks. International Conference on Computer Vision pp 763–772
  • Schroff et al [2015] Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. Conference on Computer Vision and Pattern Recognition pp 815–823
  • Son et al [2017] Son J, Park SJ, Jung K (2017) Retinal vessel segmentation in fundoscopic images with generative adversarial networks. CoRR
  • Song et al [2023] Song S, Lee S, Seong H, et al (2023) Shunit: Style harmonization for unpaired image-to-image translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 2292–2302
  • Sung et al [2018] Sung F, Yang Y, Zhang L, et al (2018) Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1199–1208
  • Torbunov et al [2023] Torbunov D, Huang Y, Yu H, et al (2023) Uvcgan: Unet vision transformer cycle-consistent gan for unpaired image-to-image translation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 702–712
  • Wang et al [2018] Wang T, Liu M, Zhu J, et al (2018) High-resolution image synthesis and semantic manipulation with conditional gans. Conference on Computer Vision and Pattern Recognition pp 8798–8807
  • Wang et al [2021] Wang W, Zhou W, Bao J, et al (2021) Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation. International Conference on Computer Vision pp 14000–14009
  • Wang et al [2004] Wang Z, Bovik AC, Sheikh HR, et al (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612
  • Wu et al [2023] Wu G, Jiang J, Liu X (2023) A practical contrastive learning framework for single-image super-resolution. IEEE Transactions on Neural Networks and Learning Systems
  • Wu et al [2021] Wu H, Qu Y, Lin S, et al (2021) Contrastive learning for compact single image dehazing. Conference on Computer Vision and Pattern Recognition pp 10551–10560
  • Yi et al [2017] Yi Z, Zhang HR, Tan P, et al (2017) Dualgan: Unsupervised dual learning for image-to-image translation. International Conference on Computer Vision pp 2868–2876
  • Yu et al [2017] Yu F, Koltun V, Funkhouser TA (2017) Dilated residual networks. Conference on Computer Vision and Pattern Recognition pp 636–644
  • Zhang et al [2020] Zhang D, Zheng Z, Li M, et al (2020) Reinforced similarity learning: Siamese relation networks for robust object tracking. In: Proceedings of the 28th ACM International Conference on Multimedia, pp 294–303
  • Zhang et al [2023] Zhang Y, Tian Y, Hou J (2023) Csast: Content self-supervised and style contrastive learning for arbitrary style transfer. Neural Networks 164:146–155
  • Zhao et al [2023] Zhao C, Cai W, Yuan Z, et al (2023) Multi-crop contrastive learning for unsupervised image-to-image translation. CoRR
  • Zheng et al [2021] Zheng C, Cham T, Cai J (2021) The spatially-correlative loss for various image translation tasks. Conference on Computer Vision and Pattern Recognition pp 16407–16417
  • un-Yan Zhu et al [2017] un-Yan Zhu, Park T, Isola P, et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. International Conference on Computer Vision pp 2242–2251
  • Zhu et al [2023] Zhu D, Wang W, Xue X, et al (2023) Structure-preserving image smoothing via contrastive learning. The Visual Computer pp 1–15