This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\authorinfo

Further author information: (Send correspondence to Ganning Zhao)
Ganning Zhao: E-mail: [email protected]

Unsupervised Synthetic Image Refinement via Contrastive Learning and Consistent Semantic-Structural Constraints

Ganning Zhao University of Southern California, Los Angeles, CA, USA Tingwei Shen University of California, Berkeley, Berkeley, CA, USA Suya You DEVCOM Army Research Laboratory, Los Angeles, CA, USA C.-C. Jay Kuo University of Southern California, Los Angeles, CA, USA
Abstract

Ensuring the realism of computer-generated synthetic images is crucial to deep neural network (DNN) training. Due to different semantic distributions between synthetic and real-world captured datasets, there exists semantic mismatch between synthetic and refined images, which in turn results in the semantic distortion. Recently, contrastive learning (CL) has been successfully used to pull correlated patches together and push uncorrelated ones apart. In this work, we exploit semantic and structural consistency between synthetic and refined images and adopt CL to reduce the semantic distortion. Besides, we incorporate hard negative mining to improve the performance furthermore. We compare the performance of our method with several other benchmarking methods using qualitative and quantitative measures and show that our method offers the state-of-the-art performance.

keywords:
Synthetic images refinement, Contrastive learning, Image translation, Unsupervised learning, Synthetic to real translation

1 INTRODUCTION

Synthetic image refinement, aiming at making synthetic images more realistic to human observers, is an increasingly important research topic due to its numerous applications. Machine learning models with a huge number of parameters (e.g., deep neural networks) demand vast amounts of training data to ensure good performance. However, real-world data acquisition, including data collection and labeling, is expensive and time-consuming. To overcome this challenge, researchers have turned to computer-generated synthetic data as a substitute for real-world data during training. Nevertheless, there exists a significant gap between the distributions of real and synthetic data. This gap tends to degrade the model performance in downstream tasks.

To narrow the gap, one solution is to improve the generator so that it can generate more realistically-looking synthetic data. Generating realistic synthetic data remains challenging and costly, as a powerful generator cannot produce close-to-real synthetic data. Another solution is to learn the characteristics from images captured in real world and improve the realism of synthetic images accordingly. In this work, we take the second path. Simply speaking, we refine the synthetic data by improving their realism with an objective that they are similar to the real data in distributions.

Various lines of research have been proposed to refine synthetic images[1]. As one of the first papers in this field, SimGAN [2] used an adversarial structure with limited success. Many image-to-image translation methods are proposed to transfer input images in the source domain to output ones in the target domain. DiscoGAN [3], CycleGAN [4], and DualGAN [5] applied a bidirectional structure to ensure cycle-consistency between source and target images. However, the assumption of bijective mapping may not hold in practice since images from source domain may not have counterparts in target domain, and vice versa. To address this issue, various one-sided image translation methods have been proposed, including DistanceGAN [6] and CcGAN [7], However, the semantic distortion problem still exists. There is semantic mismatch between input and output images.

Recently, contrastive learning (CL) has been adopted for image-to-image translation. One pioneering work is CUT [8]. Afterward, researchers came up with various ideas for further improvement such as deriving more powerful loss functions [9, 10], utilizing more informative negative samples [11, 12, 13], and combining these two ideas. However, most of them have not been tailored to synthetic images the refinement. Inspired by the recent success of CL in image-to-image translation, we propose an unsupervised CL method to improve the realism of synthetic images and reduce the semantic distortion.

It was pointed out in [14] that the semantic distortion is caused by the difference in semantic distributions of source and target domains. For instance, consider the refinement of synthetic images in the GTA5 dataset to the real-world CityScapes domain. There are more vegetation and cars in CityScapes than those in GTA5. For example, trees appear in the sky and building regions and cars appear in the street in CityScapes. The difference causes semantic inconsistency between input synthetic and output refined images, resulting in downstream semantic segmentation errors. Furthermore, the structure information is highly correlated with the semantics. For instance, buildings, trees, and roads have different structures. To reduce the semantic distortion, we incorporate both semantic and structural information in CL to ensure semantic consistency between input and output images.

This work has three main contributions.

  1. 1.

    We proposes a novel CL method that can enhance the realism of synthetic images. Our method provides additional training data to improve downstream tasks and has the potential to aid in computer game rendering.

  2. 2.

    We derive a semantic-structural relation consistency loss (SSRC) to maintain semantic and structural consistency between input synthetic and output refined images.

  3. 3.

    We conduct experiments on various datasets to demonstrate the effectiveness and generalizability of our method in improving the realism of synthetic images. Our method offers state-of-the-art performance through quantitative and qualitative evaluations.

2 Related Work

2.1 Early Work on Synthetic Image Refinement

SimGAN [2] was one of the pioneering papers that improve the realism of synthetic images using unlabeled real data, while preserving their annotation information with an adversarial network. It added an L1 self-regularization loss in the image pixel space to keep the annotations. Based on this method, Atapattu et al. [15] proposed an improved version that incorporated a perceptual loss term to further enhance the realism of the refined images. However, their methods only applied to gray-scale images with simple structures. Their performance in complex-scene datasets is limited.

CyclGAN [4], DualGAN [5], and DiscoGAN [3] assumed a bijective translation function and used the cycle consistency loss to enforce the consistency between images in the source domain, XX, and the reconstructed domain, F(GXY(X))F(G_{XY}(X)), and vice versa. In this setting, GG is the mapping from the source domain, XX, to the target domain, YY, and FF is the inverse mapping from the target domain, YY, to the source domain, XX. These methods require an additional generator/discriminator pair. Additionally, they are restricted to bijective image translation. [16, 8, 11]. As a result, they may work for synthetic image refinement since the source and target domains may not contain the same scenes or content.

One-sided image translation methods were proposed as alternatives to the cycle-consistent constraint. DistanceGAN [6] learns the one-sided mapping by ensuring a strong correlation between pairwise distances calculated in the source and target domain. CcGAN [7] predefines a geometric transformation f(.)f(.) so that f1(f(x))=f(f1(x))=xf^{-1}(f(x))=f(f^{-1}(x))=x and enforces geometry-consistency between an input image and its counterpart transformed by f(x)f(x) in the new domain (i.e. f(GXY(x))GXY(f(x))f(G_{XY}(x))\approx G_{XY}(f(x))).

2.2 Contrast Learning (CL) and InfoNCE Loss

A few CL methods have been proposed to improve image translation tasks recently. CUT [8, 17] utilized mutual information maximization to learn a cross-domain similarity between patches from the source and the target domains. NEGCUT [11] trained a generator to produce negative samples online and replace the random negative samples, which effectively brought the positive and query samples closer. F-LSeSim [17] minimized fixed self-similarity and learned self-similarity to keep scene structures. The first loss is computed as a spatially-correlative map between input and output patches, and the second loss is computed as the contrastive loss among input, augmented input and target domain patches.

Several loss functions have been proposed to improve th CL model performance. One famous one is the InfoNCE loss [18], which describes the relation of correlated (positive) samples over both correlated and uncorrelated (negative) samples. Minimizing this loss function, while maximizing the lower bound of their mutual information, makes the feature space of positive samples more correlated and that of negative samples less correlated. This technique and the corresponding loss function have been popular, e.g., [19, 8, 20, 21].

Variants of InfoNCE have been investigated to further improve the performance. To give an example, the negative-positive coupling (NPC) effect reduces learning efficiency due to vanishing gradients caused by the loss function. This occurs when randomly selected negative patches are uncorrelated with positive ones. The decoupled infoNCE loss function [9] can address this problem by removing the positive pair term in the denominator. Another idea to improve the performance is mining hard negative samples, e.g., using adversarial training on negative samples [12] or applying a more informative negative sampling strategy based on the von Mises Fisher distribution [13].

Refer to caption
Figure 1: System overview of the proposed method.

3 Proposed Method

3.1 System Overview

We aim to design a CL-based method to refine synthetic images so that they appear more realistic to humans. The system consists of three parts.

  • As shown in the middle row of Figure 1, input synthetic images are fed into a generator with an encoder/decoder structure to generate refined output images. Afterward, the refined images are fed back into the encoder with the same weights.

  • As shown in the bottom row of Figure 2, the encoded features from different layers of synthetic and refined images are fed into a 2-layer MLP to yield a shared embedding space, denoted by \mathcal{H}. Since deeper layers in the encoder have larger receptive fields, the spatial positions of deeper features correspond to larger patches in synthetic images. Hence, different patches correspond to different feature vectors in the MLP feature space, and deeper layers’ features correspond to larger patches, Next, we propose the semantic-structural relation consistency (SSRC) loss to constrain semantic and structural distortions. Furthermore, we use the hard negative decoupled infoNCE loss (hDCE) to sample hard negative patches.

  • As shown in the top row of Figure 1, as synthetic images learn attributes from real-world captured images, the real-world images and refined images are fed to a discriminator to reduce their differences.

We adopt the following notation for the rest of this section. Let synthetic images X𝒳X\in\mathcal{X}, refined images Y^𝒴^\hat{Y}\in\mathcal{\hat{Y}} and real-world images Y𝒴Y\in\mathcal{Y}, where 𝒳\mathcal{X}, 𝒴\mathcal{Y} and 𝒴^\mathcal{\hat{Y}} are the source, target, and destination domains. Patches in their correspondings images are represented by xXx\in X, yYy\in Y and y^Y^\hat{y}\in\hat{Y}. Embedding vectors from MLP feature space of different patches are zxz\in x, wyw\in y and w^y^\hat{w}\in\hat{y}.

3.2 Semantic-Structural Relation Consistency (SSRC)

The semantic-structural relation consistency (SSRC) loss contains two complementary components: 1) semantic relation consistency (SRC) and 2) structure consistency constraint (SCC). The SRC loss, LSRCL_{SRC}, is computed in the feature space to alleviate the semantic distortion, while the SCC loss, LSCCL_{SCC}, is computed in the pixel space to reduce the structural distortion. The SSRC loss is given by the weighted sum of LSRCL_{SRC} and LSCCL_{SCC}:

LSSRC=λSRCLSRC+λSCCLSCC.L_{SSRC}=\lambda_{SRC}L_{SRC}+\lambda_{SCC}L_{SCC}. (1)

We will derive LSRCL_{SRC} and LSCCL_{SCC} below.

Refer to caption
Figure 2: Illustration of the SRC loss. The encoder extracts features from different layers, which correspond to different patch sizes. We use pseudo labels to represent the similarity distribution, assigning higher pseudo scores to patches with closer semantic relations to a given query or positive patches.

3.2.1 Semantic Relation Consistency (SRC)

Most CL losses use a one-hot labeling system to differentiate patches. A label of one is assigned to positive patches and a label of zero to negative ones. However, this approach is uninformative because it treats all negative samples equally, while their semantic similarity to the query patch varies and thus contributes to the loss with different weights. To address this issue, we encourage consistency between the semantic relation of the query and negative patches in the output image, and that of positive and negative patches in the input image.

We capture the semantic relation by soft pseudo labels [22, 23, 24] of negative patches, which reveal more information about their correlation with a query or positive patches. Specifically, for an output image, we define the semantic similarity distribution between a query, xqx_{q}, and a negative patch, xix_{i}^{-}, in the feature space as a softmax function:

Q(i)=expzqTzij=1KexpzqTzj,Q(i)=\frac{\exp{z_{q}^{T}z_{i}^{-}}}{\sum_{j=1}^{K}\exp{z_{q}^{T}z_{j}^{-}}}, (2)

where zqz_{q} and ziz_{i}^{-} are the embedding vectors of xqx_{q} and xix_{i}^{-}, respectively. Similarly, for an input image, the semantic similarity between a positive patch, ypy_{p}, and a negative patch, yiy_{i}^{-}, is defined as

P(i)=expwpTwij=1KexpwpTwj,P(i)=\frac{\exp{w_{p}^{T}w_{i}^{-}}}{\sum_{j=1}^{K}\exp{w_{p}^{T}w_{j}^{-}}}, (3)

where wpw_{p} and wiw_{i}^{-} are the corresponding embedding vectors of ypy_{p} and yiy_{i}^{-}, respectively. We apply the Jensen-Shannon Divergence (JSD) to preserve the consistency between the probability distribution of PkP_{k} and QkQ_{k} as shown in Figure 2. Then, we have

LSRC=k=1KJSD^(Pk||Qk)=12KL(P||M)+12KL(Q||M),L_{SRC}=\sum_{k=1}^{K}\widehat{JSD}(P_{k}||Q_{k})=\frac{1}{2}KL(P||M)+\frac{1}{2}KL(Q||M), (4)

where M=12(P+Q)M=\frac{1}{2}(P+Q). Therefore, the minimization of Eq. (4) will constrain the semantic consistency between input and output patches.

3.2.2 Structure Consistency Constraint (SCC)

Previous studies [25, 14] illustrate significant differences in class statistics between datasets. For instance, more pixels are labeled as “sky” and “building” while fewer are labeled as “vegetation” in the GTA5 dataset as compared to those in the CityScapes dataset. When we use GTA5 as the source and CityScapes as the destination. Sky or building regions in GTA5 images are more prone to be falsely transformed into vegetation in refined target images. This is called the structure distortion between the source and target images. It degrades the refinement of synthetic data because of the semantic distortion.

To address this issue, we develop a new loss term called the structural consistency loss. It operates in the image pixel space to minimize the structure distortion. More structure information can be preserved if the corresponding pixels in input and output images have stronger correlation. Intuitively, a realistic style should be added to the input images, instead of changing the image shape. Thus, maximizing the mutual information between input and output pixels enforces structural consistency, as it amplifies non-linear dependencies between them. To achieve efficient and numerically stable computation, we adopt the relative Squared-loss Mutual Information (rSMI), which is an extension of the Squared-loss Mutual Information [14].

For an input image of the source domain, Xi𝒳X_{i}\in\mathcal{X}, and an output image of the target domain, Yi^𝒴\hat{Y_{i}}\in\mathcal{Y}, we use VXiV^{X_{i}} and VYi^V^{\hat{Y_{i}}} to indicate random variables for pixels in XiX_{i} and Yi^\hat{Y_{i}}. The structure consistency constraint (SCC) is defined as

LSCC=1Ni=1NrSMI^(VXi,VYi^),L_{SCC}=-\frac{1}{N}\sum_{i=1}^{N}\widehat{rSMI}(V^{X_{i}},V^{\hat{Y_{i}}}), (5)

where NN is the sample number, and rSMIrSMI is computed by the relative Pearson(rPE) Divergence [26] via

rSMI(VXi,VYi^)=DrPE(PVXiVYi^||(P(VXi,VYi^)).rSMI(V^{X_{i}},V^{\hat{Y_{i}}})=D_{rPE}(P_{V^{X_{i}}}\otimes V^{\hat{Y_{i}}}||(P_{(V^{X_{i}},V^{\hat{Y_{i}}})}). (6)

3.3 Hard Negative Decoupled infoNCE (hDCE)

If the negative patches are uncorrelated with the query patch (i.e., easy negative samples), the gradient of InfoNCE will be small due to a large denominator, which potentially causes gradient vanishing and affects learning efficiency. This is known as negative-positive coupling (NPC). It is desired to mine negative patches that are semantically correlated with the query patch (i.e., hard negative samples) and use a different loss function to alleviate the NPC effect. We re-sample negative samples with the von MisesFisher distribution [13] to obtain different latent classes and achieve large semantic similarity (inner product) between query and negative patches. Mathematically, we have

zqβ(z),whereqβ(z)eβzTzp(z),z^{-}\sim q_{\beta}(z^{-}),\quad\textrm{where}\quad q_{\beta}(z^{-})\varpropto e^{\beta z^{T}z^{-}}\cdot p(z^{-}), (7)

where parameter β\beta is a concentration parameter that controls the similarity of hard negative samples with query samples.

As discussed in Section 2, decoupled infoNCE (DCE) alleviates the NPC effect by removing the positive pair term in the denominator of InfoNCE [9]. With the negative sampling strategy [24], we can finetune the loss function furthermore as

LhDCE=𝔼(z,w)[logexp(wTz/τ)N𝔼zqβ[exp(wTz/τ)]],L_{hDCE}=\mathbb{E}_{(z,w)}\left[-log\frac{\exp(w^{T}z/\tau)}{N\mathbb{E}_{z^{-}\sim q_{\beta}}[\exp(w^{T}z^{-}/\tau)]}\right], (8)

where NN is the number of negative patches and τ\tau is the temperature parameter that controls the strength of penalties on hard negative samples.

3.4 Full Objective Function

To ensure the distribution similarity between refined images in the target domain and real-world images in the destination domain, we include the standard adversarial loss in the loss function to obtain the ultimate loss. The standard adversarial loss is

LGAN(G,D,X,Y)=𝔼yYlogD(y)+𝔼xXlog(1D(y^)).L_{GAN}(G,D,X,Y)=\mathbb{E}_{y\sim Y}logD(y)+\mathbb{E}_{x\sim X}log(1-D(\hat{y})). (9)

The ultimate full objective fundtion is in form of

LFull=LSSRC+λhDCELhDCE+λGANLGAN,L_{Full}=L_{SSRC}+\lambda_{hDCE}L_{hDCE}+\lambda_{GAN}L_{GAN}, (10)

where LSSRCL_{SSRC}, LhDCEL_{hDCE}, and LGANL_{GAN} are given in Eqs. (1), (8), and (9), respectively, and λhDCE\lambda_{hDCE} and λGAN\lambda_{GAN} are weights.

4 Experiments

We evaluate the performance of our method quantitatively and qualitatively in this section. We refine the synthetic GTA5 dataset [27] and a more challenging Synthia (SYNTHIA-RAND-CITYSCAPES) [28] dataset to the domain of the CityScapes dataset [29] as their labels are compatible. Specifically, we resize images in all datasets to the size of 256×256256\times 256 and use the standard training/test split in GTA5 and CityScapes datasets. As no standard training/test split is provided for Synthia, we use the last 1,880 images for testing and the remaining 7,520 images for training.

Table 1: Quantitative evaluation on our method and benchmarking methods in [14].
Methods GTA5 \rightarrow Cityscapes Cityscapes parsing \rightarrow Image
pixel acc \uparrow class acc \uparrow mean IoU \uparrow pixel acc \uparrow class acc \uparrow mean IoU \uparrow
GAN+VGG 0.216 0.098 0.041 0.551 0.199 0.133
DRIT++ 0.423 0.138 0.071 \ \ \
GAN 0.382 0.137 0.068 0.437 0.161 0.098
CycleGAN 0.232 0.127 0.043 0.520 0.170 0.110
CUT 0.546 0.165 0.095 0.695 0.259 0.178
CUT+SCC 0.572 0.185 0.110 0.699 0.263 0.182
Ours 0.654 0.186 0.113 0.714 0.263 0.184
Refer to caption
      Input                     Ours                    CUT                    CUT+SCC                   CUT+SRC
Figure 3: Qualitative visual comparison of images refined by our method and other benchmarking methods.
Refer to caption
Figure 4: Qualitative visual comparison of five images refined by our method taken from the more challenging Synthia dataset, where the first row shows five input synthetic images and the second row displays the correspondent refined images.

4.1 Quantitative Evaluation

To quantitatively evaluate the performance of our method in refining synthetic images and its effectiveness in improving downstream semantic segmentation tasks, we train a segmentation network on the CityScapes dataset and then feed the refined synthetic data to it to predict the segmentation label maps. Then, we compute the accuracy and Intersection over Union (IoU) scores by comparing the predicted maps with ground-truth label maps of the synthetic dataset. High scores indicate its semantic consistency between input and output and thus benefit to segmentation tasks.

To be consistent with previous work in [14], we refine the first 500 images of the test set in the GTA5 dataset and feed them to FCN-8s [30] as the evaluation network pre-trained on CityScapes by pix2pix [31]. The network prediction is the segmentation label maps, and we compute the scores with groundtruth label map of GTA5. We compare the performance of our method with other methods in Table 1. As shown in the column “GTA5 \rightarrow Cityscapes” of the table, our method offers the state-of-the-art performance with a substantial performance gap.

To further demonstrate the effectiveness of our method, we evaluate the domain mappers in CityScapes utilizing FCN scores. Specifically, we feed the ground truth segmentation label maps to our model to generate predicted images. Then, we use the generated images as the input to FCN-8s to predict the segmentation label maps. We compute the evaluation metrics between the predicted and ground truth labels. As shown in the column “Cityscapes parsing \rightarrow Image” of Table 1, our method achieves state-of-the-art performance in this evaluation.

4.2 Qualitative Evaluation

Figure 3 shows four representative testing images generated by four different methods, including Ours, CUT, CUT with the SCC loss only, and CUT with the SRC loss only. Our method offers superior visual results by preserving both semantic and structural consistency between input and output. In contrast, the other three benchmarking methods exhibit more semantic distortions, such as predicting the sky as trees (or buildings), or distorting other regions. Clearly, our proposed SSRC loss outperforms benchmarking methods using only the SCC loss or only the SRC loss.

The effectiveness of our method is also demonstrated on the more challenging Synthia dataset. Five refined images are shown in Figure 4. The refinement of Synthia to CityScapes is a challenging task as it involves differences in viewpoints, classes, and objects. For example, we have the top-down view in Synthia and the ground view in CityScapes. More humans and objects exist in the Synthia dataset. Our method can maintain realistic-looking images that resemble those in the CityScapes dataset after refinement.

5 Conclusion

A novel unsupervised synthetic image refinement method by combining contrastive learning and semantic-structural relation consistency (SSRC) was proposed in this work. By incorporating SSRC loss, semantic and structural consistency between synthetic and refined images can be maintained to reduces the semantic distortion effectively. Consistency in this context can be estimated by computing the mutual information in both feature and pixel spaces. Furthermore, we exploited both the semantic relation and hard negative mining to further improve the performance of the contrastive loss function. Our method yields the state-of-the-art performance. Its effectiveness was demonstrated quantitatively and qualitatively.

ACKNOWLEDGMENTS

This project was sponsored by US DoD LUCI (Laboratory University Collaboration Initiative) fellowship and US Army Research Laboratory. The authors also acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources. The first author would also like to extend her gratitude to Vasileios Magoulianitis for his valuable input in draft proofreading and editing.

References

  • [1] Shen, T., Zhao, G., and You, S., “A study on improving realism of synthetic data for machine learning,” (2023).
  • [2] Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., and Webb, R., “Learning from simulated and unsupervised images through adversarial training,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 2107–2116 (2017).
  • [3] Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J., “Learning to discover cross-domain relations with generative adversarial networks,” in [International conference on machine learning ], 1857–1865, PMLR (2017).
  • [4] Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A., “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in [Proceedings of the IEEE international conference on computer vision ], 2223–2232 (2017).
  • [5] Yi, Z., Zhang, H., Tan, P., and Gong, M., “Dualgan: Unsupervised dual learning for image-to-image translation,” in [Proceedings of the IEEE international conference on computer vision ], 2849–2857 (2017).
  • [6] Benaim, S. and Wolf, L., “One-sided unsupervised domain mapping,” Advances in neural information processing systems 30 (2017).
  • [7] Fu, H., Gong, M., Wang, C., Batmanghelich, K., Zhang, K., and Tao, D., “Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ], 2427–2436 (2019).
  • [8] Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y., “Contrastive learning for unpaired image-to-image translation,” in [Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16 ], 319–345, Springer (2020).
  • [9] Yeh, C.-H., Hong, C.-Y., Hsu, Y.-C., Liu, T.-L., Chen, Y., and LeCun, Y., “Decoupled contrastive learning,” in [Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI ], 668–684, Springer (2022).
  • [10] Chen, J., Gan, Z., Li, X., Guo, Q., Chen, L., Gao, S., Chung, T., Xu, Y., Zeng, B., Lu, W., et al., “Simpler, faster, stronger: Breaking the log-k curse on contrastive learners with flatnce,” arXiv preprint arXiv:2107.01152 (2021).
  • [11] Wang, W., Zhou, W., Bao, J., Chen, D., and Li, H., “Instance-wise hard negative example generation for contrastive learning in unpaired image-to-image translation,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision ], 14020–14029 (2021).
  • [12] Hu, Q., Wang, X., Hu, W., and Qi, G.-J., “Adco: Adversarial contrast for efficient learning of unsupervised representations from self-trained negative adversaries,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ], 1074–1083 (2021).
  • [13] Robinson, J., Chuang, C.-Y., Sra, S., and Jegelka, S., “Contrastive learning with hard negative samples,” arXiv preprint arXiv:2010.04592 (2020).
  • [14] Guo, J., Li, J., Fu, H., Gong, M., Zhang, K., and Tao, D., “Alleviating semantics distortion in unsupervised low-level image-to-image translation via structure consistency constraint,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ], 18249–18259 (2022).
  • [15] Atapattu, C. and Rekabdar, B., “Improving the realism of synthetic images through a combination of adversarial and perceptual losses,” in [2019 International Joint Conference on Neural Networks (IJCNN) ], 1–7, IEEE (2019).
  • [16] Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., and Yang, M.-H., “Diverse image-to-image translation via disentangled representations,” in [Proceedings of the European conference on computer vision (ECCV) ], 35–51 (2018).
  • [17] Zheng, C., Cham, T.-J., and Cai, J., “The spatially-correlative loss for various image translation tasks,” in [Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ], 16407–16417 (2021).
  • [18] Oord, A. v. d., Li, Y., and Vinyals, O., “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748 (2018).
  • [19] Bachman, P., Hjelm, R. D., and Buchwalter, W., “Learning representations by maximizing mutual information across views,” Advances in neural information processing systems 32 (2019).
  • [20] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G., “A simple framework for contrastive learning of visual representations,” in [International conference on machine learning ], 1597–1607, PMLR (2020).
  • [21] He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R., “Momentum contrast for unsupervised visual representation learning,” in [Proceedings of the IEEE/CVF conference on computer vision and pattern recognition ], 9729–9738 (2020).
  • [22] Wei, C., Wang, H., Shen, W., and Yuille, A., “Co2: Consistent contrast for unsupervised visual representation learning,” arXiv preprint arXiv:2010.02217 (2020).
  • [23] Yang, C., An, Z., Cai, L., and Xu, Y., “Mutual contrastive learning for visual representation learning,” in [Proceedings of the AAAI Conference on Artificial Intelligence ], 36(3), 3045–3053 (2022).
  • [24] Jung, C., Kwon, G., and Ye, J. C., “Exploring patch-wise semantic relation for contrastive learning in image-to-image translation tasks,” in [Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition ], 18260–18269 (2022).
  • [25] Jia, Z., Yuan, B., Wang, K., Wu, H., Clifford, D., Yuan, Z., and Su, H., “Semantically robust unpaired image translation for data with unmatched semantics statistics,” in [Proceedings of the IEEE/CVF International Conference on Computer Vision ], 14273–14283 (2021).
  • [26] Yamada, M., Suzuki, T., Kanamori, T., Hachiya, H., and Sugiyama, M., “Relative density-ratio estimation for robust distribution comparison,” Neural computation 25(5), 1324–1370 (2013).
  • [27] Richter, S. R., Vineet, V., Roth, S., and Koltun, V., “Playing for data: Ground truth from computer games,” in [Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 ], 102–118, Springer (2016).
  • [28] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lopez, A. M., “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 3234–3243 (2016).
  • [29] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B., “The cityscapes dataset for semantic urban scene understanding,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 3213–3223 (2016).
  • [30] Long, J., Shelhamer, E., and Darrell, T., “Fully convolutional networks for semantic segmentation,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 3431–3440 (2015).
  • [31] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A., “Image-to-image translation with conditional adversarial networks,” in [Proceedings of the IEEE conference on computer vision and pattern recognition ], 1125–1134 (2017).