¹¹institutetext: ¹Boston University ²Amazon ³Meta AI (FAIR)

SynCDR : Training Cross Domain Retrieval Models with Synthetic Data

Samarth Mishra¹ Carlos D. Castillo² Hongcheng Wang²
Kate Saenko^1,3 Venkatesh Saligrama¹

Abstract

In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains. For instance, given a sketch of an object, a model needs to retrieve a real image of it from an online store’s catalog. A standard approach for such a problem is learning a feature space of images where Euclidean distances reflect similarity. Even without human annotations, which may be expensive to acquire, prior methods function reasonably well using unlabeled images for training. Our problem constraint takes this further to scenarios where the two domains do not necessarily share any common categories in training data. One instance where this can occur is when the two domains in question come from different versions of some biometric sensor recording identities of different people. We posit a simple solution, which is to generate synthetic data to fill in these missing category examples across domains. This, we do via category preserving translation of images from one visual domain to another. We compare approaches specifically trained for this translation for a pair of domains, as well as those that can use large-scale pre-trained text-to-image diffusion models via prompts, and find that the latter can generate better replacement synthetic data, leading to more accurate cross-domain. Our best SynCDR model can outperform prior art by up to 15%. Code for our work is available at https://github.com/samarth4149/SynCDR

Keywords:

Cross-domain Retrieval Synthetic Data Diffusion Models

1 Introduction

Refer to caption — Figure 1: Motivation. Cross Domain Retrieval problems show up in many applications, and prior work has developed solutions in different scenarios, including when labeled data is absent [22]. What these solutions rely on however is the presence of same category data in both domains so similar pairs can be discovered. When these are missing, such approaches can fail. Our solution is to make up for these missing examples using synthetic data (generated via label preserving translation). While such data may not be a perfect replacement (*e.g*. the real image generated from the painting in the above example may not be entirely realistic because of a white background), we show that they are still useful for training cross-domain retrieval models.

It is a non-trivial task for deep recognition models to generalize across visual domains, i.e., a model trained to recognize objects only in charcoal sketches fails to do so in real images due to a large training-test distribution shift [34]. This is one key challenge in cross-domain retrieval, where one may desire to retrieve an item from an online catalog based on a hand-drawn sketch.

Albeit with plentiful human annotated data in both domains available for training, this problem is simpler, one cannot always rely on their presence due to the expense of curating them. Kim et al. [22] attempted to solve this problem with access to only unlabeled images from the two domains at training time. They used these to train an embedding model that could be used to retrieve similar images (determined by the category of the primary object in the image) across domains. Their training method of cross-domain self-supervision (CDS), relied on optimizing for contrastive criteria—one within and one across domains. While within domains, the criterion used was instance discrimination, across domains, it was entropy minimization. This meant that the network discovered similar pairs in training instances across domains and was trained to reinforce the similarity measure between such highly similar instances. It is important to note that how effective this optimization is depends on (a) the network being initialized with pre-trained weights that can result in some meaningful initial similarity measures and (b) the presence of semantically similar (or same category) examples across domains.

While (a) may be available via the range of open-source pre-trained vision models (e.g. [48, 20]), (b) is problem specific and may not always be available. For instance, in some biometric data collection process, if a sensor is upgraded at some point, identities recorded with the old sensor would not be available with the new one and vice-versa. See Fig. 1 for an example in a dataset of bird images [45], where categories are defined by species of birds. In such scenarios with missing cross-domain pairs with same labels, we show that CDS is unable to learn representations that encode category semantics effectively across domains.

The solution we propose is to use synthetic data to fill in these gaps left by missing category examples in either domain. This we attempt to obtain via label-preserving translation across domains. Given an image in domain A, we wish to synthesize an image in domain B, which has the same category label as the original image. Note that this is done without access to the ground truth label information. An example is shown in Fig. 1. We then train our cross-domain retrieval model with synthetic data (SynCDR) using self-supervision losses used by prior work [22] as well as a new loss based on pseudo positive pairs, that utilizes these pair labels between a real image and its synthetic counterpart in the other domain, that we get for free from label-preserving cross-domain translations.

How should we generate the aforementioned synthetic data? One strategy could be to use image-to-image translation trained on unpaired data. While there is a gamut of prior work here [52, 21, 17, 18, 19, 23, 3, 25], their focus has primarily been on perceptual quality of generated images rather than their use as synthetic data for training recognition models. For instance, Lee et al. [23] used their GAN based translation method only for MNIST $\leftrightarrow$ MNIST-M digit recognition, which, prior work from the same year [38] reported better results on without using any synthetic data. In our experiments, we utilize contrastive unpaired translation (CUT) [25] as a representative approach, which is first trained on unlabeled data from the two domains for translation. This process is expensive, and can be avoided by using other pre-trained models in a text-guided fashion as we elucidate next.

The recent advent of large pre-trained text-to-image diffusion models such as Stable Diffusion [32], has facilitated text-guided image editing. This, we can also use for image translation without any subsequent target-specific training, using approaches like SDEdit [24] and InstructPix2Pix [2]. Because by design, these methods attempt to closely mimic the input image, we find that they often fail to accurately represent the target domain. Alternatively, they can edit the image too much (based on a control parameter) and not preserve category labels. A better solution, we found, is using a personalization approach such as ELITE [47], that can learn to encode input image content into Stable Diffusion’s vocabulary and more naturally combine it with the target domain, thus generating better synthetic data (see Fig. 2 for examples).

Our contributions can be summarized as follows:

•

We show prior work is ineffective for training cross-domain retrieval models when same category pairs are not available across domains.
•

We make up for these missing examples with synthetic data using label-preserving translation across domains and develop a loss to use pseudo positive pair labels that we get for free from such translations for training.
•

We compare multiple synthetic data generators, including translation methods trained specifically on a pair of domains, as well as those leveraging large-scale pre-trained generative models, which can be used without any target-specific training, via the use of prompts.
•

With the best synthetic data, SynCDR realizes performance benefits of $\sim$ 15% on the DomainNet dataset [26] and $\sim$ 6% on a challenging fine-grained recognition task over images of different bird species [45, 46], compared to baselines that do not use any synthetic data.

2 Related Work

Cross-domain Image Retrieval. Given a “query” image, there are multiple practical scenarios where one may expect to fetch similar images from a database. Often, there might be a shift in visual domains between the query image and the image database. Examples include retrieving person images from face sketches, retrieving product images using sketches/cell phone pictures, among others. Given these uses, there has been an array of prior research with multiple summary review articles [39, 6]. In the deep learning era, solutions for this problem involve learning a feature space where euclidean distances can reflect similarity and this metric is used to rank “target” images in the database, retrieving the top-1 or top few. Some characteristic approaches here have used category label information [36], automatically extracted attribute annotations [16], or triplet annotations [50, 40] for training a deep feature extractor.

Self-supervised Feature Learning for Cross-Domain Retrieval. While approaches above rely on labels for training, these may be expensive in labor-hours to collect. In the absence of any labels representation learning can be done by optimizing for different self-supervision objectives [10, 4, 11, 5]. Kim et al. [22], followed by others [51, 15], demonstrated that category discriminative domain-adaptive features can be learned without the presence of any labels using self-supervised criteria within and across domains. This, however, relies on the presence of same categories in both domains in training data. If unavailable, the cross-domain self-supervision losses can degrade retrieval performance by training the model to match unrelated instances. This is the problem we tackle in this paper, by using synthetically generated examples to make up for the missing categories of data.

Unpaired Image-to-Image Translation. One primary approach we adopt is translating an image such that its content is preserved but the style mimics the alternate domain. This way, we have synthetically added examples for missing categories in either domain. While there are prior work that can perform this translation when trained on paired images [21], we do not have such paired data. What we can use are methods that can be trained for this task with unpaired data from two domains [52, 19, 23, 25]. We use CUT [25] as a representative approach in our experiments.

Large Scale Pre-trained Text-to-Image Generative Models. The quality and diversity of generated images from diffusion models has given rise to models that can follow text instruction in image generation, via training on large-scale image-text paired data [31, 35, 32]. These models have enabled several further applications—image editing [24, 13, 2], personalization [9, 33, 47], text-to-3D generation [28], to name a few. Using the aforementioned image editing approaches [24, 2] can allow us to perform cross-domain translations for our use-case, as we shall see in Sec. 3. These methods, by design, attempt to closely resemble the input image, allowing control over the extent of resemblance via a single parameter. We find in many cases, they either cannot preserve image content well, or do not result in a large enough edit to effectively represent the target domain (see an example of the latter in Fig. 2). Using a personalization approach like ELITE [47] this can be mitigated. ELITE can encode a given image’s content as a new token in the vocabulary of Stable Diffusion, and then combine this more naturally with the target domain, without the restriction of closely resembling the given source image.

Synthetic Data from Text-to-Image Diffusion Models. Given the high fidelity and level of control that models like Stable-Diffusion [32] and Imagen [35] can provide, recent work has evaluated their use as data sources for training recognition models. [37] and [1] used them for generating Imagenet-like synthetic data and studied scaling and transfer properties of recognition models trained on them. Tian et al. developed a method StableRep [42], that used multiple generations from a single caption in a multi-positive contrastive loss to train an image representation with 20M synthetic images, that is as accurate as CLIP [30] trained on 50M real images. [43, 8, 29] used synthetic data from these models in classification problems with few/biased training examples. We extend this general research topic with our approach using these models for generating cross-domain translations.

3 Approach

Notation. Let the two domains be A and B. At training time, two sets of images are available (with no accompanying labels), $\mathcal{D}_{A}=\{\boldsymbol{x}^{A}_{i}\}_{i=1}^{n_{A}}$ and $\mathcal{D}_{B}=\{\boldsymbol{x}^{B}_{i}\}_{i=1}^{n_{B}}$ . The ground truth category labels are unknown but let each image in $\mathcal{D}_{A}$ come from a category label in $Y_{A}$ , and each image in $\mathcal{D}_{B}$ come from a category label in $Y_{B}$ . It is known that $Y_{A}\cap Y_{B}=\phi$ (This is not a strict requirement for SynCDR to function and we report results where $Y_{A}$ and $Y_{B}$ share some categories in Sec. 4.4). The outcome of training is a feature extractor $f:\mathcal{X}\rightarrow\mathbb{S}^{d}$ (where $\mathcal{X}$ is the support of all images, and $\mathbb{S}^{d}$ the $d$ -dimensional unit hypersphere $\mathbb{S}^{d}=\{\boldsymbol{x};\boldsymbol{x}\in\mathbb{R}^{d},\left|\left|x\right|\right|_{2}=1\}$ ), such that cosine similarities in the output feature space reflect category semantics.

For evaluation are provided two sets $\mathcal{D}^{(test)}_{A}$ and $\mathcal{D}^{(test)}_{B}$ both of which have images with known category labels and each set has at least one image from each label in $Y_{A}\cup Y_{B}$ . At test time, given an image $\boldsymbol{x}^{A}$ in $\mathcal{D}^{(test)}_{A}$ , it is desired to fetch images of the same category label from $\mathcal{D}^{(test)}_{B}$ , and vice-versa given an image in domain B. This is done by ranking in decreasing order $\{f(\boldsymbol{x}^{A})^{\top}f(\boldsymbol{x}^{B});\boldsymbol{x}^{B}\in\mathcal{D}^{(test)}_{B}\}$ and picking top K to calculate precision@K (for different values of K).

Synthetic Data. For our solution we use synthetic data generators $G_{A\rightarrow B}$ , which, given an image $\boldsymbol{x}^{A}\in\mathcal{X}^{A}$ ( $\mathcal{X}^{A}$ being the support of images from domain A), can generate a synthetic image $\boldsymbol{x}_{syn}^{B}$ with the content of $\boldsymbol{x}^{A}$ in the style of domain $B$ . We also have $G_{B\rightarrow A}$ , that can do this in the other direction. Using these, we generate $\mathcal{D}_{syn,B}=\{G_{A\rightarrow B}(\boldsymbol{x});\boldsymbol{x}\in\mathcal{D}_{A}\}$ and $\mathcal{D}_{syn,A}=\{G_{B\rightarrow A}(\boldsymbol{x});\boldsymbol{x}\in\mathcal{D}_{B}\}$ . Per our described motivation (Fig. 1), we can now combine these synthetic data with real data to make up for missing categories and train with a prior approach such as CDS [22] on this entire set. However, the manner in which we generated synthetic data provides us more information to use as described next.

Pseudo Positive Pairs (PPP). Given that $G_{A\rightarrow B}$ preserves content for image $\boldsymbol{x}^{A}$ , we can use the fact that $\boldsymbol{x}^{A}$ and $\boldsymbol{x}_{syn}^{B}$ are the same semantic category to train our model. In this context, we refer to $\boldsymbol{x}^{A}$ and $\boldsymbol{x}_{syn}^{B}$ as pseudo-positive pairs (PPP) and use them as follows in a PPP loss.

During training, we sample batches of size $m$ from real domain-A data $X_{A}=[\boldsymbol{x}^{A}_{i}]_{i=1}^{m}$ and $X_{syn,B}=[\boldsymbol{x}^{B}_{syn,i}]_{i=1}^{m}$ . Using the feature extractor $f$ , we define the loss

	$\displaystyle L_{PPP}(X_{A},X_{syn,B})$	$\displaystyle=\frac{1}{m}\sum_{\boldsymbol{x}^{A}\in X_{A}}-\log\left(\frac{\exp(f(\boldsymbol{x}^{A})^{\top}f(\boldsymbol{x}_{syn}^{B}))}{\sum_{\boldsymbol{x}\in X_{syn,B}}\exp(f(\boldsymbol{x}^{A})^{\top}f(\boldsymbol{x}))}\right)$
		$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}_{syn}^{B}\in X_{syn,B}}-\log\left(\frac{\exp(f(\boldsymbol{x}_{syn}^{B})^{\top}f(\boldsymbol{x}^{A}))}{\sum_{\boldsymbol{x}\in X_{A}}\exp(f(\boldsymbol{x}_{syn}^{B})^{\top}f(\boldsymbol{x}))}\right)$		(1)

which is a contrastive loss matching the real example $\boldsymbol{x}^{A}$ to its synthetic counterpart $\boldsymbol{x}_{syn}^{B}=G_{A\rightarrow B}(\boldsymbol{x}^{A})$ in a batch of synthetic examples and likewise in the opposite direction for a synthetic example, matching it to its real counterpart in a batch of real examples.

Training SynCDR. For training we combine $L_{PPP}$ with the CDS criterion [22] ( $L_{CDS}$ ), and minimize the following loss via minibatch stochastic gradient descent. To meet length constraints, we excluded a precise definition of $L_{CDS}$ and refer readers to Sec. 3 and Fig. 3 of [22]. For the sake of current discussion, we simply describe it as a combination of in-domain and cross-domain contrastive criteria given two sets of unlabeled images from two domains.

	$\displaystyle L$	$\displaystyle=L_{CDS}(\mathcal{D}_{A}\cup\mathcal{D}_{syn,A},\mathcal{D}_{B}\cup\mathcal{D}_{syn,B})$
		$\displaystyle+\frac{1}{2}\left[L_{PPP}(\mathcal{D}_{A},\mathcal{D}_{syn,B})+L_{PPP}(\mathcal{D}_{B},\mathcal{D}_{syn,A})\right]$		(2)

where the two arguments of $L_{CDS}$ reflect the unlabeled datasets in the two domains and we abuse notation described in Sec. 3 to describe the loss for entire sets rather than minibatches.

3.1 Methods for Synthetic Data Generation

We use the following methods for generating our synthetic cross-domain translations. In Fig. 2, we provide examples for translations generated by each.

Contrastive Unpaired Translation (CUT) [25] is an approach that trains GAN image generators for the requisite translation task using unpaired images from the two domains. For training this generator, we use all images in $\mathcal{D}_{A}$ and $\mathcal{D}_{B}$ . The training process is quite expensive, requiring $\sim$ 20 hrs on a single NVIDIA Tesla V100 gpu.

Img2Img/SDEdit [24] uses a pre-trained Stable Diffusion model for image editing by partially following the diffusion forward noising process (so that not all content in the input image is lost due to noising) and then denoising based on a text prompt. For instance in the example in Fig. 2, for translating real images of birds from the CUB dataset to paintings, we used the prompt “A painting of a bird.” and for translating sketches to paintings for Domainnet objects, we used “A painting of an object.”. The method allows for specification of edit strength $\in[0,1]$ , which is the fraction of diffusion noising steps to follow. The value $0$ leads to no edit and $1$ leads to completely following the prompt and retaining none of the input image information. In our experiments, we set this parameter to $0.3$ based on validation performance. For prompts used for all datasets and other generation parameters, we refer readers to Appendix 0.B.

InstructPix2Pix [2] is a Stable-Diffusion model fine-tuned and adapted for following natural language instructions for editing images. For converting to paintings (for both photos of birds as well as sketches of objects from Domainnet), we input the original image and the prompt “Convert to a painting.”. With InstructPix2Pix, we can provide a parameter $image\_guidance\_scale\geq 1$ . The larger the value of this parameter, more is the output image similar to the input image. In our experiments, we set this parameter to $1.2$ based on validation performance. Full prompts and other generation parameters for all datasets and additional examples are in Appendix 0.B.

The above translation methods attempt to closely resemble the original image in translations and we find that this is often not enough to accurately represent data in the other domain. In such scenarios, images may need additional content to make it look like the target domain. For instance, in Fig. 2, while the above methods do a decent job of converting paintings to pencil sketches or real photos to paintings, the translation in the opposite direction is much poorer. In translating a sketch to a painting (row 2), Img2Img and InstructPix2Pix fail to add color and CUT adds some incorrect colors based on what it could learn from unpaired images. Similar is the case in converting a painting of a bird to a realistic photograph (row 4), where a background may be expected but above methods fail to generate one (CUT does generate one, but does not generate an natural looking photograph). While Img2Img and InstructPix2Pix allow for increasing the amount of edit via a controllable parameter, in Appendix 0.B, we show that this often leads to non-preservation of image category. The following generation method can avoid these issues.

ELITE [47] is a personalization approach based on Stable Diffusion and a CLIP-based image encoder. The encoder encodes an object instance from an image as a word in Stable Diffusion’s vocabulary, which can then be used for generation with prompts. In the example in Fig. 2, for generating a painting (from either a real photo or a sketch), we input the image to ELITE’s image encoder and use the prompt “A painting of $\langle S\rangle$ .” (where $\langle S\rangle$ is the image encoded as a word, or more precisely, a token embedding). For full details, refer to Appendix 0.B.

Instead of being restricted by having to closely resemble the input image, ELITE can learn the object’s properties and more naturally combine them with those of the target domain. In Fig. 2 (rows 2 and 4), we see that this results in the most natural looking target domain images with the object from the original image. As we shall see in Sec. 4, this serves as the most effective synthetic data for learning cross-domain retrieval models.

4 Experiments

Datasets and Evaluation. We ran experiments on three different datasets : DomainNet [26], CUB [45, 46] and Office-Home [44]. We experimented with the 126 class subset of DomainNet used in [22] with 3 domains : clipart, painting and sketch. For CUB, we use two domains : Real (coming from the CUB-200-2011 data [45]) and Paintings (coming from [46]). In CUB, we removed categories that had fewer than 5 paintings, leaving us with 187 categories. The Office-Home dataset has 65 categories in 4 domains : Art, Clipart, Product and Real World.

We experimented with each possible pair of domains in each dataset. For the experiments, we made a 50:20:30 split of the data in the two domains using the first for training, second for validation and third for testing. For each pair of domains we split the training data into two disjoint sets of categories (of equal size) and run two experiments : one where category set 1 is from domain A and category set 2 is from domain B, and the other where this is reversed. Note that testing in either of these cases takes place on the same data, and we report average Prec@1 of retrieval in either direction (given domain A image, fetch a domain B image of same category and vice-versa), and the overall average of these over all pairs of domains in the dataset. In Sec. 0.A.3, we also report Prec@5 and Prec@15.

Other Implementation Details. Similar to CDS [22], we used a ResNet-50 [12] backbone pre-trained on Imagenet followed by a fully connected layer and $L_{2}$ normalization as our feature extractor. We trained for 15 epochs and perform early stopping based on average validation Prec@1. Other training hyperparameters are kept the same as CDS [22]. In our approach using synthetic data, we generated one synthetic example per real example in the training set. For full training details, please refer to Appendix 0.C.

4.1 Cross Domain Retrieval Results

Table 1: Cross Domain Retrieval Prec@1 in different scenarios of DomainNet. As highlighted in Fig. 1, prior art (CDS) cannot perform well (and is only about as good as a pre-trained ImageNet model) in this scenario where training data does not have examples from the same categories across domains. The cross-domain criterion from CDS, impedes performance and using only the In-domain ID criterion for training performs better. As seen in the results, synthetic data can make up for missing real examples and lead to improved performance. ELITE provides the best quality synthetic data for SynCDR, resulting in best cross-domain retrieval performance.

Model	Synthetic Data	Clipart - Painting	Clipart - Sketch	Painting - Sketch	Average
ImageNet (pt)	-	27.4	24.6	29.2	27.1
CDS [22]	-	28.6 ± 0.7	24.5 ± 0.6	30.5 ± 0.6	27.8 ± 0.6
In-domain ID [49]	-	30.7 ± 0.4	27.5 ± 0.2	31.8 ± 0.4	30.0 ± 0.4
SynCDR (Ours)	CUT	38.8 ± 0.8	40.1 ± 0.4	44.7 ± 0.6	41.2 ± 0.6
	Img2Img	38.2 ± 0.7	34.5 ± 0.7	37.4 ± 0.5	36.7 ± 0.6
	InstructPix2Pix	41.6 ± 0.6	39.8 ± 0.7	42.5 ± 0.7	41.3 ± 0.7
	ELITE	45.4 ± 0.7	44.2 ± 0.4	46.4 ± 1.3	45.3 ± 0.9

Methods for Comparison. We compare 4 versions of SynCDR using different synthetic data generation methods. We also compare them to prior work using no synthetic data. CDS is the method proposed by Kim et al. [22] which uses a combination of in-domain instance discrimination [49] (ID) and cross-domain matching. As another comparison, we drop the latter of these two criteria and train only with in-domain ID loss. As we shall see, in the absence of same category examples across domains, this can lead to better performance. Finally, we also report the performance of using simply an ImageNet-1K pre-trained ResNet-50 backbone, which serves as the initialization for all the other methods.

We digress briefly to note that while large image embedding models based on vision transformers [7] and CLIP [30] pre-training are available [20] and provide robust image embeddings across domains, they may be expensive to store and run on smaller devices at inference time. For instance, a CLIP ViT-B model (the smallest CLIP available) has 160M parameters compared to a ResNet-50’s 25M. We thus only use large pre-trained models at training time. Although knowledge distillation [14] is possible from these large CLIP models to smaller ones, as shown by Sun et al. [41] (in Fig. 3 of their paper), it still requires a large number of images and text captions (a few tens of millions). In Sec. 0.A.2, we show that distillation is not effective with the limited size of our training datasets (which also have no captions available).

Each of Tabs. 1, LABEL:, 2, LABEL: and 3 reports the results of cross-domain retrieval (Prec@1) across all pairs of domains for a given dataset. Each cell (except the ones for the pre-trained models) reports mean and standard deviation over 3 different training runs. The “Average” column reports the mean and the pooled standard deviation over the other columns in the table. We additionally report Prec@5 and Prec@15 in Sec. 0.A.3, while noting that they follow the same trends as seen in Prec@1.

Table 2: Cross Domain Retrieval Prec@1 for CUB

\leftrightarrow

CUB-Paintings. Compared to the Imagenet pre-trained model which serves as initialization for all methods, we find both CDS and In-domain ID help improve performance by a small amount. ELITE’s synthetic data again leads to the best performance with SynCDR. (For more discussion, refer to Sec. 4.1)

Model	Synthetic Data	Painting - Real
ImageNet (pt)	-	20.5
CDS [22]	-	22.0 ± 0.8
In-domain ID [49]	-	21.4 ± 1.2
SynCDR (Ours)	CUT	23.2 ± 0.5
	Img2Img	22.0 ± 0.5
	InstructPix2Pix	21.5 ± 0.7
	ELITE	28.2 ± 0.4

Table 3: Cross Domain Retrieval Prec@1 in different scenarios of Office-Home. As with other benchmarks, we find that SynCDR with ELITE generated data results in best performance on average over all scenarios in Office-Home. We see synthetic data results in relatively small performance improvements in scenarios with small domain gaps (e.g. Product-Real), but more significant improvements in larger domain gap scenarios (e.g. those involving Clipart). Refer to Sec. 4.1 for further discussion.

Model	Synthetic Data	Art - Clipart	Art - Product	Art - Real	Clipart - Product	Clipart - Real	Product - Real	Average
ImageNet (pt)	-	34.5	48.6	54.6	42.2	43.9	68.6	48.7
CDS	-	36.5 ± 0.3	50.0 ± 0.4	54.9 ± 0.3	43.0 ± 0.2	45.2 ± 0.3	69.1 ± 0.2	49.8 ± 0.3
In-domain ID	-	36.3 ± 0.3	50.0 ± 0.4	54.7 ± 0.3	43.4 ± 0.4	45.4 ± 0.4	69.0 ± 0.3	49.8 ± 0.3
SynCDR (Ours)	CUT	38.3 ± 0.4	50.8 ± 0.2	55.5 ± 0.6	43.3 ± 0.5	46.6 ± 0.3	69.3 ± 0.3	50.6 ± 0.4
	Img2Img	36.4 ± 0.3	49.9 ± 0.3	54.9 ± 0.5	43.4 ± 0.4	45.2 ± 0.4	69.0 ± 0.3	49.8 ± 0.4
	InstructPix2Pix	38.6 ± 0.5	50.2 ± 0.2	54.7 ± 0.3	43.9 ± 0.2	46.7 ± 0.4	69.0 ± 0.2	50.5 ± 0.3
	ELITE	38.3 ± 0.5	51.3 ± 0.3	55.4 ± 0.3	44.1 ± 0.6	46.3 ± 0.5	69.3 ± 0.3	50.8 ± 0.4

DomainNet. Tab. 1 reports the results. We can see that CDS and ID both improve on top of ImageNet pre-training but CDS, having the negative effect of incorrect matchings from the cross-domain loss, performs poorer. All SynCDR variants improve on this performance, with the worst being Img2Img. We find that the best performing synthetic data from Img2Img can only be attained with a fairly low edit strength. This is because using a higher edit strength substantially trades off Img2Img’s capacity to preserve the image content.

We also find that CUT and InstructPix2Pix perform at par with each other on average, but each is better in different scenarios. InstructPix2Pix is poorer in scenarios involving sketch since it is worse at “filling in color” into pencil sketches (as seen from Fig. 2), when simply prompted to convert a sketch to a painting or a clipart. On the other hand, CUT can add color to sketches based on it’s training on the two domains, even though these colors may not always be correct. In the clipart-painting scenario, InstructPix2Pix has a bigger edge in the quality of data being able to rely on Stable Diffusion’s prior understanding of these domains (more examples in Appendix 0.E). Finally, we see that synthetic data from ELITE performs the best. Aside from generating good quality data, it does not suffer from InstructPix2Pix’s limitations in adding color to sketches since ELITE parses the object into a textual token and then generates a painting of it, without having to be restricted by the amount of edit between the original and the generated image.

CUB. Tab. 2 reports the results for CUB. Compared to Domainnet, this dataset involves a much more fine-grained recognition task. We find that both CDS and ID provide improved performance over ImageNet pre-training albeit by a small amount. CDS does not seem to be disadvantaged by its cross-domain matching criterion possibly because it finds images that are similar enough across different bird categories. We find that CUT results in better synthetic data and consequently improves performance more than Img2Img or InstructPix2Pix (also seen in Fig. 2, row 4). This is possibly because most realistic photos of birds have a more consistent style than paintings of objects in DomainNet, where CUT is less accurate with adding colors (see Fig. 2 row 2). Again, similar to the case of DomainNet, ELITE from its ability of parsing objects and more naturally incorporating them in the target domain, leads to best quality synthetic data and hence, best retrieval performance.

Table 4: Ablating PPP — Cross-domain retrieval results on DomainNet. We see that using PPP, i.e., using the fact that cross-domain translations are label preserving for training, leads to big performance boosts for cross-domain retrieval. (up to 10% on average with synthetic data generated with ELITE)

Method (Synthetic Data)	Loss	Clipart - Painting	Clipart - Sketch	Painting - Sketch	Avg
SynCDR (CUT)	CDS	36.1 ± 0.8	36.8 ± 1.1	35.6 ± 3.8	36.2 ± 2.4
SynCDR (CUT)	CDS + PPP	38.8 ± 0.8	40.1 ± 0.4	44.7 ± 0.6	41.2 ± 0.6
SynCDR (Img2Img)	CDS	38.1 ± 0.6	33.8 ± 0.6	37.4 ± 0.4	36.4 ± 0.5
SynCDR (Img2Img)	CDS + PPP	38.2 ± 0.7	34.5 ± 0.7	37.4 ± 0.5	36.7 ± 0.6
SynCDR (InstructPix2Pix)	CDS	39.9 ± 0.6	37.5 ± 0.8	40.9 ± 0.7	39.4 ± 0.7
SynCDR (InstructPix2Pix)	CDS + PPP	41.6 ± 0.6	39.8 ± 0.7	42.5 ± 0.7	41.3 ± 0.7
SynCDR (ELITE)	CDS	38.2 ± 0.6	31.9 ± 0.9	36.1 ± 0.8	35.4 ± 0.7
SynCDR (ELITE)	CDS + PPP	45.4 ± 0.7	44.2 ± 0.4	46.4 ± 1.3	45.3 ± 0.9

Office-Home. The results for this dataset are in Tab. 3. As with other benchmarks, we found that SynCDR with ELITE’s synthetic data resulted in best cross-domain retrieval performance overall. We find on some domains similar to ImageNet data (such as Product-Real), an ImageNet pre-trained backbone performs quite well, and CDS and in-domain ID cannot improve much on this given they are only self-supervised training objectives functioning with no dataset specific labels. Additionally, we find that on such small domain gap scenarios (Product has realistic images on a white background, and some training images were even found to occur in both Product and Real domains), performance improvement from synthetic data is relatively small. This is because the problem of missing examples as shown in Fig. 1 is not as big with small domain gaps. On the other hand, when the domain gap is larger (e.g. when Clipart is involved) SynCDR helps improve performance more significantly.

Table 5: Analyzing Synthetic Data with CLIP features. On all pairs of domains in the DomainNet dataset, using CLIP as an image featurizer, we try to answer three questions : (1) How much does a synthetic generation method edit a real image on average (Distance to Source); (2) how well does the translation preserve category labels (NCM Accuracy); and (3) how well does the synthetic data mimic the real data that it replaces (Similarity to Real Target). ELITE, which leads to the best performing synthetic data brings about large edits, preserves labels reasonably well and mimics real data the best, out of the 4 different methods compared.

Synthetic Data	Distance to Source				NCM Accuracy				Similarity to Real Target
Synthetic Data	c-p	c-s	p-s	avg	c-p	c-s	p-s	avg	c-p	c-s	p-s	avg
CUT	0.41	0.30	0.39	0.37	0.45	0.63	0.47	0.52	0.57	0.62	0.61	0.60
Img2Img	0.15	0.14	0.17	0.15	0.67	0.72	0.58	0.65	0.57	0.64	0.59	0.60
InstructPix2Pix	0.29	0.24	0.23	0.25	0.58	0.65	0.61	0.61	0.60	0.64	0.62	0.62
ELITE	0.47	0.48	0.47	0.47	0.59	0.53	0.56	0.56	0.67	0.64	0.68	0.66

4.2 Ablating PPP

As described in Sec. 3, we use the fact that our cross-domain translations preserve category labels in the form of the pseudo-positive pairs (PPP) loss. In Tab. 4 we report the performance of methods where we ignore these pairs and simply combine synthetic and real data, training with the CDS criterion. Comparing with methods using no synthetic data Tab. 1, we find that the addition of synthetic examples does improve performance even when PPP is not used. Additionally, using PPP provides a big performance advantage (up to 10% on average in the case of ELITE). This differs for different synthetic data generators. In Sec. 4.3, we find that this boost from PPP is often low in cases where the amount of edit made by the translation method is small.

4.3 Further Analyzing Synthetic Data

In this section, we use a CLIP image featurizer [20], $f_{CLIP}:\mathcal{X}\rightarrow\mathbb{S}^{d}$ (following notation from Sec. 3, recall that $\mathbb{S}^{d}$ is the d-dimensional unit hypersphere), to quantitavely evaluate synthetic data from different generation sources (results in Tab. 5 on the Domainnet dataset). Here, we slightly abuse notation in that $\mathcal{D}_{A}$ and $\mathcal{D}_{B}$ now have examples from all categories in $Y_{A}\cup Y_{B}$ (previously $\mathcal{D}_{A}$ and $\mathcal{D}_{B}$ did not share any categories). Note that this is done here only for evaluating synthetic data generated by each method.

How different are synthetic images from the real counterparts they were generated from? To find this, we computed Distance to Source, reported in Tab. 5. Specifically, for an image $\boldsymbol{x}^{A}\in\mathcal{D}_{A}$ and its synthetic counterpart $\boldsymbol{x}_{syn}^{B}=G_{A\rightarrow B(\boldsymbol{x}^{A})}$ , we compute $1-f_{CLIP}(\boldsymbol{x}^{A})^{\top}f_{CLIP}(\boldsymbol{x}_{syn}^{B})$ and compute the average over all examples in $\mathcal{D}_{A}$ (note that this quantity is proportional to the square of the distance between the vectors, given they lie on the unit hypersphere). Similarly we compute the average of these distances for the opposite direction of translation over examples in $\mathcal{D}_{B}$ and report the average of the two for each pair of domains in Tab. 5.

We find the least edit amount from Img2Img data, followed by InstructPix2Pix and CUT, while ELITE generates images that are the most distinct from source since it is not tied down to closely mimicking the source real image. These reflect the amount of edits we can see in examples in Fig. 2. In Appendix 0.E, we present additional examples of synthetic data from each method. Additionally, factoring in results from Tab. 4, we find that larger edits corresponded to higher benefit of using PPP. In other words, the value of the pair label between a synthetic example and its real counterpart (indicating they are the same category) increases the more different the two are.

How well are classes preserved in generated synthetic data? This is necessary, because PPP relies on the translations being label-preserving. To measure, we compute nearest class mean (NCM) accuracy. Specifically, we compute NCM accuracy in the CLIP feature space, over examples in the set $\mathcal{D}_{syn,B}=\{G_{A\rightarrow B}(\boldsymbol{x});\boldsymbol{x}\in\mathcal{D}_{A}\}$ using the examples in $\mathcal{D}_{B}$ along with their ground truth category labels. In Tab. 5 we report the mean of this and accuracy over examples in $\mathcal{D}_{syn,A}$ using real examples and category labels from $\mathcal{D}_{A}$ . We find that Img2Img data has a high NCM accuracy which can possibly be explained by the lower amount of edit it makes to the real image during translation. While ELITE exhibits larger edits, it does not lag behind a lot in NCM accuracy, consequently making it a better synthetic data generator.

How well does synthetic data mimic real data? Ideally, the best synthetic data for performance is one that mimics the real data distribution (i.e. it is the best replacement for the missing real data in Fig. 1). Hence, we try to measure this using Similarity to Real Target. For an example $\boldsymbol{x}_{syn}^{B}$ , with class label $y$ (note that this is the same as the class label of the real example used to generate $\boldsymbol{x}_{syn}^{B}$ ), we compute $f_{CLIP}(\boldsymbol{x}_{syn}^{B})^{\top}\overline{f}_{CLIP}(\mathcal{D}_{B,y})$ (where $\overline{f}_{CLIP}(\mathcal{D}_{B,y})$ is the feature space mean of all examples from $\mathcal{D}_{B}$ with category label $y$ ), and find the average over all examples in $\mathcal{D}_{syn,B}$ . In Tab. 5 we report the mean of this and a similar average computed over examples in $\mathcal{D}_{syn,A}$ for each pair of domains. We find that ELITE generated data is closest to the real data, which is reflected in its efficacy as synthetic replacement for missing real data.

4.4 SynCDR performance with category overlap

Table 6: Performance in the case of non-zero category overlap across domains in training data. In the introduction, we motivated how synthetic data can come to the rescue in cases of missing similar category training examples across the two domains (i.e. 0% category overlap). In this experiment performed using the DomainNet dataset, we see that synthetic data (used to train SynCDR) is still useful to improve the performance of CDS [22] even with non-zero category overlap.

Method	Category Overlap in training data	C-P	C-S	P-S	Avg
CDS	0%	28.6 ± 0.7	24.5 ± 0.6	30.5 ± 0.6	27.8 ± 0.6
	50%	32.5 ± 0.6	29.3 ± 0.6	34.3 ± 0.9	32.1 ± 0.7
	100%	37.1 ± 0.5	35.1 ± 0.4	40.5 ± 0.5	37.6 ± 0.5
In-domain ID	0%	30.7 ± 0.4	27.5 ± 0.2	31.8 ± 0.4	30.0 ± 0.4
	50%	31.6 ± 0.3	28.3 ± 0.2	32.1 ± 0.2	30.6 ± 0.3
	100%	31.6 ± 0.3	29.1 ± 0.1	32.8 ± 0.7	31.2 ± 0.5
SynCDR (w ELITE)	0%	45.4 ± 0.7	44.2 ± 0.4	46.4 ± 1.3	45.3 ± 0.9
	50%	48.9 ± 0.9	44.4 ± 0.4	45.7 ± 1.4	46.3 ± 1.0
	100%	50.8 ± 0.4	48.5 ± 1.2	48.8 ± 1.1	49.4 ± 1.0

In Fig. 1, we motivated that synthetic data steps in to assist by replacing missing real training data in cases where similar categories of images are not available across domains. While this is the case when synthetic data is most effective, in this section, we show how much it can add to performance when training data across domains have overlapping categories.

In Tab. 6, on each scenario of DomainNet we report performance of different methods with different amounts of category overlaps. The dataset has images from $126$ categories. $0\%$ overlap corresponds to results from Tab. 1, each domain having training data from $63$ categories. In the case of $50\%$ overlap, each of domains A and B has training data from 84 categories, 42 of which are shared by both domains. In the case of $100\%$ overlap, each domain has training data from each of the $126$ different categories.

As discussed in the introduction, we can see the improved efficacy of CDS [22] as it gets training examples from same categories across the two domains. This is in contrast to in-domain ID which improved much less. We can see that SynCDR still helps improve over CDS performance significantly in this case.

5 Conclusion

Cross-domain retrieval models can be trained without labeled data but they rely heavily on presence of examples from same categories in both domains. In situations where this is unavailable, we show that they can perform poorly. In this paper, we presented a solution by replacing missing category examples with synthetic data. For this, we used label-preserving translation methods which could generate counterparts of real images in the opposite domain while preserving content, crucially, the semantic category. With such generated data, we get additional pair labels (indicating same category) for free, which we used during training by optimizing the pseudo positive pair PPP loss. We compared different synthetic data generators, including those that require specific training for a pair of domains as well as ones that can be prompted for translation and rely on pre-trained text-to-image generative diffusion models. Overall, with the best synthetic data, our SynCDR model could improve performance by $\sim$ 15% on the DomainNet dataset and $\sim$ 6% on the CUB dataset.

Limitations. The best synthetic data for SynCDR uses textual prompts which describe the domains. This may not work if domains involved cannot be described this way. Here, a method like Textual Inversion [9] could be used to learn the properties of a domain as a new token embedding. We evaluated this method on DomainNet in Sec. 0.A.1 but found its performance to be poorer on average than using textual descriptions. Experimenting on tasks where textually describing domains is not possible, is a topic of future work.

Acknowledgments. This work was funded by the National Science Foundation and the Hariri Institute at Boston University.

References

[1] Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466 (2023)
[2] Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18392–18402 (2023)
[3] Chang, H.Y., Wang, Z., Chuang, Y.Y.: Domain-specific mappings for generative adversarial style transfer. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16. pp. 573–589. Springer (2020)
[4] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)
[5] Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15750–15758 (2021)
[6] Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Computing Surveys (Csur) 40(2), 1–60 (2008)
[7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[8] Dunlap, L., Umino, A., Zhang, H., Yang, J., Gonzalez, J.E., Darrell, T.: Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289 (2023)
[9] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
[10] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)
[11] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
[12] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[13] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
[14] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
[15] Hu, C., Lee, G.H.: Feature representation learning for unsupervised cross-domain image retrieval. In: European Conference on Computer Vision. pp. 529–544. Springer (2022)
[16] Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE international conference on computer vision. pp. 1062–1070 (2015)
[17] Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)
[18] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: ECCV (2018)
[19] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV). pp. 172–189 (2018)
[20] Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: Openclip (Jul 2021). https://doi.org/10.5281/zenodo.5143773, https://doi.org/10.5281/zenodo.5143773, if you use this software, please cite it as below.
[21] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1125–1134 (2017)
[22] Kim, D., Saito, K., Oh, T.H., Plummer, B.A., Sclaroff, S., Saenko, K.: Cds: Cross-domain self-supervised pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9123–9132 (2021)
[23] Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse image-to-image translation via disentangled representations. In: Proceedings of the European conference on computer vision (ECCV). pp. 35–51 (2018)
[24] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073 (2021)
[25] Park, T., Efros, A.A., Zhang, R., Zhu, J.Y.: Contrastive learning for unpaired image-to-image translation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. pp. 319–345. Springer (2020)
[26] Peng, X., Bai, Q., Xia, X., Huang, Z., Saenko, K., Wang, B.: Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1406–1415 (2019)
[27] von Platen, P., Patil, S., Lozhkov, A., Cuenca, P., Lambert, N., Rasul, K., Davaadorj, M., Wolf, T.: Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers (2022)
[28] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)
[29] Qraitem, M., Saenko, K., Plummer, B.A.: From fake to real (ffr): A two-stage training pipeline for mitigating spurious correlations with synthetic data. arXiv preprint arXiv:2308.04553 (2023)
[30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[31] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
[32] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[33] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22500–22510 (2023)
[34] Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. pp. 213–226. Springer (2010)
[35] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems 35, 36479–36494 (2022)
[36] Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35(4), 1–12 (2016)
[37] Sariyildiz, M.B., Alahari, K., Larlus, D., Kalantidis, Y.: Fake it till you make it: Learning (s) from a synthetic imagenet clone. arXiv preprint arXiv:2212.08420 (2022)
[38] Shu, R., Bui, H.H., Narui, H., Ermon, S.: A dirt-t approach to unsupervised domain adaptation. arXiv preprint arXiv:1802.08735 (2018)
[39] Smeulders, A.W., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Transactions on pattern analysis and machine intelligence 22(12), 1349–1380 (2000)
[40] Song, J., Yu, Q., Song, Y.Z., Xiang, T., Hospedales, T.M.: Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE international conference on computer vision. pp. 5551–5560 (2017)
[41] Sun, X., Zhang, P., Zhang, P., Shah, H., Saenko, K., Xia, X.: Dime-fm: Distilling multimodal and efficient foundation models. arXiv preprint arXiv:2303.18232 (2023)
[42] Tian, Y., Fan, L., Isola, P., Chang, H., Krishnan, D.: Stablerep: Synthetic images from text-to-image models make strong visual representation learners. arXiv preprint arXiv:2306.00984 (2023)
[43] Trabucco, B., Doherty, K., Gurinas, M., Salakhutdinov, R.: Effective data augmentation with diffusion models. arXiv preprint arXiv:2302.07944 (2023)
[44] Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5018–5027 (2017)
[45] Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
[46] Wang, S., Chen, X., Wang, Y., Long, M., Wang, J.: Progressive adversarial networks for fine-grained domain adaptation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9213–9222 (2020)
[47] Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
[48] Wightman, R.: Pytorch image models. https://github.com/rwightman/pytorch-image-models (2019). https://doi.org/10.5281/zenodo.4414861
[49] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3733–3742 (2018)
[50] Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 799–807 (2016)
[51] Yue, X., Zheng, Z., Zhang, S., Gao, Y., Darrell, T., Keutzer, K., Vincentelli, A.S.: Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13834–13844 (2021)
[52] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)

Appendix 0.A Additional Results

0.A.1 Using Textual Inversion to learn domains

In the main paper under limitations, we mentioned that our synthetic data generation methods use simply a word or a phrase to represent the target domain (Tab. 12 lists these for different methods). This may not be possible in scenarios where this description is unavailable, e.g. in data collected from specific sensors. We hypothesize that in such a case, Textual Inversion [9] can be used. It allows us to encode a concept exemplified via a set of input images in a new token by optimizing its embeddings such that a pre-trained and frozen Stable Diffusion model generates the input images conditioned on this new token. Note that this description obfuscates some details to provide an intuitive understanding. For a full and precise description of textual inversion, we refer readers to Sec 3 of [9].

While in this paper we did not experiment with datasets where domains cannot be verbally described, we evaluated textually inverting domains on the DomainNet dataset. For textual inversion, we used a pre-trained Stable-Diffusion v1.4 model (which is the same version as used by ELITE), and encoded each domain (clipart, painting or sketch) using its unlabeled training images by using 5000 gradient steps, with a learning rate of $5\times 10^{-4}$ and a batch size of 4. We used code available in [27]. For generating a synthetic image in domain A, we use the prompt “An image of S in the style of $<d_{A}>$ ”, where S is the token corresponding to the input image object encoded by ELITE and $<d_{A}>$ corresponds to the textually inverted token encoding the properties of domain A.

Table 7: Using Textual Inversion to learn Domain Properties. We see that when we use a textually inverted token (see Sec. 0.A.1 and [9]) that encodes domain properties for generating synthetic data with ELITE, cross-domain retrieval performance of SynCDR is similar or decreases. This happens possibly because Stable Diffusion’s understanding of common domains such as paintings, clipart and sketches is quite good. Additionally, as discussed in Sec. 0.A.1, we see that in cases of generating paintings or clipart from sketches, using textual inversion is less accurate in preserving category of the input example. See examples in Fig. 3

Method	C-P	C-S	P-S	Avg
SynCDR (ELITE)	45.4 ± 0.7	44.2 ± 0.4	46.4 ± 1.3	45.3 ± 0.9
SynCDR (ELITE with textual inversion)	45.1 ± 0.5	41.8 ± 0.6	43.0 ± 0.8	43.3 ± 0.7

From the results in Tab. 7, we see that textually inverting a domain’s properties leads to similar or poorer results overall. It is poorer in the case when sketch is involved as a domain. On digging deeper, we find that when generating paintings or cliparts from sketches, using textual inversion is poorer at preserving the category of the input sketch example. For instance, NCM accuracy (as defined in Sec. 4.3 of the main paper) of synthetic paintings generated with textual inversion is 0.54 compared to 0.61 without (see some examples of such cases in Fig. 3). This suggests the generation process somehow get skewed more to pay attention on only the style in such cases. We leave further investigation of this method to future work.

0.A.2 Distilling large CLIP models for cross-domain retrieval

As mentioned in Sec. 4.1 (main paper) and as shown in Tab. 8, image embeddings from large CLIP transformer models are robust across domains, but are expensive to obtain on smaller memory devices at inference time. In the top section of Tab. 8, we report the performances of two of these CLIP pre-trained models. The CLIP RN-50 is a larger version of the ResNet-50 model with $\sim$ 40M parameters compared to the latter’s $\sim$ 25M. The CLIP ViT-L is a large transformer model, which provides highly robust image embeddings at test time, but is expensive to store and run with $\sim$ 305M parameters.

We additionally note that ELITE’s image encoder, which generates the token embedding encoding an input image concept, uses the CLIP ViT-L model. ELITE can hence be interpreted as a way of distilling the information in the CLIP ViT-L via synthetic training data. An alternative approach would be to distill the CLIP ViT-L model based on its feature similarities between pairs of images in the training data. This approach is described next.

CLIP-distill (Tab. 8) distills a CLIP ViT-L into a ResNet-50. Its optimization criterion is similar to Sun et al. [41], but uses only unlabeled images available to us as training data. More precisely, during training we are given a teacher (CLIP ViT-L) feature extractor $g:\mathcal{X}\rightarrow\mathbb{S}^{d}$ (recall $\mathbb{S}^{d}$ is the $d$ -dimensional unit hypersphere) and a student model $f:\mathcal{X}\rightarrow\mathbb{S}^{d}$ . Additionally for $m$ -sized minibatches of images $X_{A}$ from the domain A and $X_{B}$ from domain B, let us define $p_{f}(\boldsymbol{x},X)=\text{softmax}(f(\boldsymbol{x})^{\top}f(X))$ (where $\boldsymbol{x}\in\mathcal{X}$ and we overloaded notation for the batch of $m$ images $X$ so that $f(X)=[f(\boldsymbol{x});\boldsymbol{x}\in X]\in\mathbb{R}^{d\times m}$ )

$\displaystyle L_{distill}$	$\displaystyle(X_{A},X_{B})$
	$\displaystyle=\frac{1}{m}\sum_{\boldsymbol{x}^{A}\in X_{A}}D_{KL}(p_{f}(\boldsymbol{x}^{A},X_{B})\|\|p_{g}(\boldsymbol{x}^{A},X_{B}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{A}\in X_{A}}D_{KL}(p_{f}(\boldsymbol{x}^{A},X_{A})\|\|p_{g}(\boldsymbol{x}^{A},X_{A}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{B}\in X_{B}}D_{KL}(p_{f}(\boldsymbol{x}^{B},X_{A})\|\|p_{g}(\boldsymbol{x}^{B},X_{A}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{B}\in X_{B}}D_{KL}(p_{f}(\boldsymbol{x}^{B},X_{B})\|\|p_{g}(\boldsymbol{x}^{B},X_{B}))$	(3)

where $D_{KL}$ is the KL divergence between two distributions. In essence, we attempt to distill the similarity predictions made by the teacher CLIP model $g$ for training images across domains (terms 1 and 3 of Sec. 0.A.2), and those within the same domain (terms 2 and 4 of Sec. 0.A.2).

From the results, we see that distilling the large CLIP model helps with performance compared to ID or CDS, but SynCDR (i.e. training with synthetic data) is a much more effective approach in terms of performance.

Table 8: CLIP Distillation over different scenarios of DomainNet. The top two rows contain the performance of using pre-trained CLIP models at inference time (CLIP ViT-L is a large transformer with

\sim

305M parameters, and CLIP RN-50 is a resnet-50 variant with

\sim

40M parameters). The next two rows, CDS and ID use train a ResNet-50 backbone in a self-supervised manner on the unlabeled training images available in the two domains. CLIP-distill distills the similarity predictions from a CLIP ViT-L model into a ResNet-50 using the same data. ELITE uses a CLIP ViT-L backbone in its image encoder for generating synthetic examples. As seen from the results, using these for training leads to a better performing model than distilling via feature similarities.

Model	C-P	C-S	P-S	Avg
CLIP (pt) - RN-50	35.3	39.9	39.5	38.2
CLIP (pt) - ViT-L	66.3	74.8	70.1	70.4
CDS	28.6 ± 0.7	24.5 ± 0.6	30.5 ± 0.6	27.8 ± 0.6
In-domain ID	30.7 ± 0.4	27.5 ± 0.2	31.8 ± 0.4	30.0 ± 0.4
CLIP-distill	33.2 ± 0.5	32.3 ± 0.5	33.4 ± 0.4	33.0 ± 0.5
SynCDR (with ELITE)	45.4 ± 0.7	44.2 ± 0.4	46.4 ± 1.3	45.3 ± 0.9

0.A.3 Full Retrieval Results

Table 9: Cross Domain Retrieval performance (Prec@1, 5, 15) across different scenarios of DomainNet. Here we additionally report precisions at different number of retrieved examples for all models from Tab. 1 (main paper).

Model	Synthetic Data	Clipart - Painting			Clipart - Sketch			Painting - Sketch
Model	Synthetic Data	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	27.4	23.4	19.2	24.6	20.4	16.8	29.2	25.5	21.5
CDS	-	28.6 ± 0.7	24.5 ± 0.7	20.5 ± 0.6	24.5 ± 0.6	20.8 ± 0.6	17.6 ± 0.6	30.5 ± 0.6	26.9 ± 0.6	23.0 ± 0.6
ID	-	30.7 ± 0.4	26.5 ± 0.4	22.2 ± 0.3	27.5 ± 0.2	23.3 ± 0.2	19.3 ± 0.3	31.8 ± 0.4	28.1 ± 0.3	24.0 ± 0.3
SynCDR (Ours)	CUT	38.8 ± 0.8	34.2 ± 0.7	30.0 ± 0.7	40.1 ± 0.4	36.5 ± 0.4	32.7 ± 0.4	44.7 ± 0.6	41.2 ± 0.5	37.2 ± 0.5
	Img2Img	38.2 ± 0.7	33.3 ± 0.6	28.7 ± 0.4	34.5 ± 0.7	30.8 ± 0.6	27.2 ± 0.5	37.4 ± 0.5	33.9 ± 0.5	30.1 ± 0.4
	InstructPix2Pix	41.6 ± 0.6	36.6 ± 0.6	32.0 ± 0.5	39.8 ± 0.7	35.9 ± 0.7	32.1 ± 0.6	42.5 ± 0.7	38.8 ± 0.6	34.6 ± 0.5
	ELITE	45.4 ± 0.7	41.6 ± 0.6	37.1 ± 0.6	44.2 ± 0.4	41.1 ± 0.3	37.6 ± 0.2	46.4 ± 1.3	43.9 ± 1.3	40.3 ± 1.2

Model	Synthetic Data	Average
Model	Synthetic Data	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	27.1	23.1	19.2
CDS	-	27.8 ± 0.6	24.1 ± 0.6	20.4 ± 0.6
ID	-	30.0 ± 0.4	26.0 ± 0.3	21.8 ± 0.3
SynCDR (Ours)	CUT	41.2 ± 0.6	37.3 ± 0.6	33.3 ± 0.5
	Img2Img	36.7 ± 0.6	32.7 ± 0.6	28.7 ± 0.5
	InstructPix2Pix	41.3 ± 0.7	37.1 ± 0.6	32.9 ± 0.5
	ELITE	45.3 ± 0.9	42.2 ± 0.9	38.3 ± 0.8

Table 10: Cross Domain Retrieval performance (Prec@1, 5, 15) for CUB

\leftrightarrow

CUB-Paintings. Here we additionally report precisions at different number of retrieved examples for all models from Tab. 2 (main paper).

Model	Synthetic Data	Painting - Real
Model	Synthetic Data	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	20.5	17.0	14.7
CDS	-	22.0 ± 0.8	18.2 ± 0.7	16.1 ± 0.6
ID	-	21.4 ± 1.2	18.3 ± 1.0	16.0 ± 0.6
SynCDR (Ours)	CUT	23.2 ± 0.5	19.3 ± 0.4	16.7 ± 0.4
	Img2Img	22.0 ± 0.5	18.2 ± 0.6	16.0 ± 0.5
	InstructPix2Pix	21.5 ± 0.7	17.8 ± 0.3	15.7 ± 0.3
	ELITE	28.2 ± 0.4	23.8 ± 0.2	21.0 ± 0.2

Table 11: Cross Domain Retrieval performance (Prec@1, 5, 15) in different scenarios of Office-Home. Here we additionally report precisions at different number of retrieved examples for all models from Tab 3 (main paper).

Model	Synthetic Data	Art - Clipart			Art - Product			Art - Real
Model	Synthetic Data	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	34.5	29.3	24.4	48.6	40.7	34.5	54.6	47.8	42.8
CDS	-	36.5 ± 0.3	31.4 ± 0.2	26.7 ± 0.2	50.0 ± 0.4	42.1 ± 0.1	36.3 ± 0.2	54.9 ± 0.3	48.9 ± 0.1	43.6 ± 0.1
In-domain ID	-	36.3 ± 0.3	31.2 ± 0.2	26.6 ± 0.1	50.0 ± 0.4	41.9 ± 0.1	36.1 ± 0.1	54.7 ± 0.3	48.9 ± 0.1	43.5 ± 0.1
SynCDR (Ours)	CUT	38.3 ± 0.4	33.3 ± 0.2	28.7 ± 0.1	50.8 ± 0.2	42.6 ± 0.1	37.0 ± 0.1	55.5 ± 0.6	49.7 ± 0.2	44.7 ± 0.1
	Img2Img	36.4 ± 0.3	31.4 ± 0.1	26.8 ± 0.1	49.9 ± 0.3	42.2 ± 0.2	36.4 ± 0.1	54.9 ± 0.5	49.2 ± 0.2	44.0 ± 0.2
	InstructPix2Pix	38.6 ± 0.5	33.0 ± 0.2	28.6 ± 0.3	50.2 ± 0.2	42.3 ± 0.2	36.6 ± 0.2	54.7 ± 0.3	49.3 ± 0.1	43.9 ± 0.1
	ELITE	38.3 ± 0.5	33.1 ± 0.2	29.1 ± 0.2	51.3 ± 0.3	42.5 ± 0.1	37.2 ± 0.1	55.4 ± 0.3	49.7 ± 0.2	44.8 ± 0.1
Model	Synthetic Data	Clipart - Product			Clipart - Real			Product - Real
Model	Synthetic Data	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	42.2	35.9	29.4	43.9	38.0	30.2	68.6	60.7	49.9
CDS	-	43.0 ± 0.2	38.0 ± 0.2	31.2 ± 0.1	45.2 ± 0.3	40.6 ± 0.2	32.5 ± 0.2	69.1 ± 0.2	62.5 ± 0.2	52.2 ± 0.2
In-domain ID	-	43.4 ± 0.4	37.9 ± 0.2	31.0 ± 0.1	45.4 ± 0.4	40.4 ± 0.2	32.4 ± 0.2	69.0 ± 0.3	62.4 ± 0.1	52.1 ± 0.1
SynCDR (Ours)	CUT	43.3 ± 0.5	38.0 ± 0.4	31.3 ± 0.3	46.6 ± 0.3	41.8 ± 0.3	34.5 ± 0.2	69.3 ± 0.3	62.8 ± 0.1	53.0 ± 0.1
	Img2Img	43.4 ± 0.4	38.1 ± 0.2	31.2 ± 0.2	45.2 ± 0.4	40.4 ± 0.3	32.6 ± 0.3	69.0 ± 0.3	62.6 ± 0.1	52.5 ± 0.1
	InstructPix2Pix	43.9 ± 0.2	39.2 ± 0.2	32.6 ± 0.2	46.7 ± 0.4	42.0 ± 0.2	34.0 ± 0.3	69.0 ± 0.2	62.5 ± 0.1	52.4 ± 0.1
	ELITE	44.1 ± 0.6	39.3 ± 0.4	32.7 ± 0.4	46.3 ± 0.5	41.6 ± 0.3	33.8 ± 0.5	69.3 ± 0.3	62.9 ± 0.3	53.1 ± 0.4

Model	Synthetic Data	Average
Model	Synthetic Data	Prec@1	Prec@5	Prec@15
ImageNet (pt)	-	48.7	42.1	35.2
CDS	-	49.8 ± 0.3	43.9 ± 0.2	37.1 ± 0.2
In-domainID	-	49.8 ± 0.3	43.8 ± 0.2	36.9 ± 0.1
SynCDR (Ours)	CUT	50.6 ± 0.4	44.7 ± 0.2	38.2 ± 0.2
	Img2Img	49.8 ± 0.4	44.0 ± 0.2	37.3 ± 0.2
	InstructPix2Pix	50.5 ± 0.3	44.7 ± 0.2	38.0 ± 0.2
	ELITE	50.8 ± 0.4	44.8 ± 0.3	38.4 ± 0.3

In Tabs. 9, LABEL:, 10, LABEL: and 11, we report Precs @1, 5 and 15 retrieved images as done in [22]. We observe largely similar trends using Prec@5 and Prec@15 as those using Prec@1, and results are reported for completeness.

0.A.4 Feature Visualization

Fig. 4 shows t-SNE visualization of the output features from the network before and after SynCDR training. The two domains in the plot are Clipart and Painting from the DomainNet dataset and we used 1000 points from the test set of each domain picked at random for the visualization. We see that features get better clustered class-wise and are more aligned across domains after SynCDR training.

Appendix 0.B Synthetic Data Generation Details

0.B.1 Contrastive Unpaired Translation (CUT)

CUT [25] trains a generator that can translate an input image to visually resemble a different domain. It can be trained using images from the two visual domains without the need for paired illustrations, making it viable for our use-case. CUT relies on optimizing a combination of a GAN loss and patchwise contrastive losses where the latter reuses features from the encoder of the image generator (which is a composition of a convolutional encoder and decoder) and uses negative patches from within the input image, resulting in a simple efficient approach without many components.

For our experiments, we used two CUT generators from code provided by Park et al. [25] trained for translation in either direction on the unlabeled training images in each of the two domains. This is the relatively slower higher quality version of CUT compared to FastCUT, a faster variant that Park et al. [25] also provide. For training each generator, we used a batch size of 4, and trained for 400 epochs (where each epoch was clipped to a maximum of 250 steps). While after training, image generation is fast (90 ms compared to few seconds for diffusion sampling methods—as described next), the training itself is relatively slow, taking approximately 20 hrs on an NVIDIA Tesla V100 gpu.

0.B.2 Img2Img/SDEdit

SDEdit [24] was a method first proposed for editing images before the advent of open large scale pre-trained text-to-image diffusion models. With a pre-trained denoising diffusion model the approach is straightforward, following the diffusion noising process on an input image, and then subsequently denoising it. What generates the edit is following the noising process partially based on an edit strength $\in[0,1]$ , such that not all information in the input image is lost. At the extremes, the value 0 for this parameter leads to no edit, and the value 1 leads to losing all input image information.

With a text-guided generative model like Stable Diffusion [32], this approach readily facilitates text-guided image editing using prompts. For our experiments, we use the implementation of StableDiffusionImg2ImgPipeline in [27] with the pre-trained model runwayml/stable-diffusion-v1-5. We used 50 diffusion steps, a guidance scale of 10 (these values being somewhat standard for Stable Diffusion image generation) and chose the final edit strength as 0.3, after computing validation accuracy for SynCDR using data for 3 different values : 0.3, 0.5 and 0.7. The prompts we used for translating to each domain are mentioned in Tab. 12. Since edit strength determines number of diffusion steps, the speed of the process depends on the edit strength parameter. For 0.3, we generated synthetic data at $\sim$ 2.8 seconds per image using an Nvidia Tesla V100 GPU, resulting in a total generation time of 6.7 hrs for our largest training dataset (painting domain of domainnet with $\sim$ 8600 images).

In Fig. 5, we show examples with different edit strength values. We notice in that example, that a higher strength can make a generated image look more like a painting, but at the cost of typically not retaining the input image category.

0.B.3 InstructPix2Pix

InstructPix2Pix [2] is a denoising diffusion model that can be conditioned on both an image and a text-prompt trained to carry out an edit to the input image, specified by the prompt. To control the amount of edit, we can use a parameter $image\_guidance\_scale\geq 1$ , larger values of which lead to the output image looking more like the input image.

For our experiments, we used StableDiffusionInstructPix2PixPipeline from [27], with the pre-trained model provided by [2] in timbrooks/instruct-pix2pix. We used 50 diffusion steps and a guidance scale of 10, and for setting the $image\_guidance\_scale$ we used validation accuracy across 4 different values of generation : [1, 1.2, 1.4, 1.6], finally choosing 1.2. The text prompts used to provide the edit instructions for each domain are in Tab. 12. Generation speed for this method is relatively slower than Img2Img at around 9 seconds per image resulting in a total generation time of 21 hrs on an NVIDIA Tesla V100 GPU for our largest training dataset (painting domain of domainnet with $\sim$ 8600 images).

Fig. 6 shows examples of InstructPix2Pix edits made with different values of $image\_guidance\_scale$ to convert a realistic photo of a bird to a painting. We notice that InstructPix2Pix does a decent job of editing across the different scales, however on occasion a smaller guidance can lead to large edits which may not preserve the bird category (for e.g. $image\_guidance\_scale=1$ in the second row).

0.B.4 ELITE

ELITE [47] is a method for fast personalized image generation from Stable Diffusion. Its goal is to learn an object from a single image and generate it in different contexts and styles based on a text prompt. This, ELITE does by encoding an input image into a token embedding that encodes the object into the dictionary of Stable Diffusion’s text encoder. Here we note that both ELITE and Textual Inversion [9] are personalized generation methods which can learn concepts from one or a few input images in the form of a new token embedding. While textual inversion does this via optimization for each new object/concept, which is slower, ELITE can do the same by using a trained encoder module which generates a new concept token embedding in one forward pass.

The full ELITE method uses both a global and a local image encoding module, where the latter is useful for better preservation of details from the input object (interested readers may see Figure 6 of [47]). In our experiments, we utilize only the global module since the latter additionally requires a segmentation mask around the primary object in the input image. For generation, we used the implementation of [47], which uses an image encoder based on a CLIP pre-trained ViT-L model from [30], and Stable Diffusion v1.4 for generation. The different prompts we used for generation are listed in Tab. 12. In these prompts, “S” corresponds to the special token that is output by ELITE’s image encoder representing the input object. Generation speed was about the same as InstructPix2Pix at around 9s per image on an NVIDIA Tesla V100 gpu resulting in 21 hrs total generation time for domainnet paintings domain (with $\sim$ 8600 images).

Table 12: Prompts used for synthetic data generation with different approaches. In the case of ELITE, “S” refers to the special token representing the input object, whose embedding is output by ELITE’s image encoder.

Dataset	Domain	Img2Img Prompt	InstructPix2Pix Prompt	ELITE Prompt
DomainNet	Clipart	A clipart image of an object.	Convert to a clipart image.	A clipart image of S
	Painting	A painting of an object.	Convert to a painting.	A painting of S
	Sketch	A pencil/charcoal sketch of an object.	Convert to a pencil/charcoal sketch.	A pencil/charcoal sketch of S
CUB	Painting	A painting of a bird.	Convert to a painting.	A painting of S
CUB	Real	A realistic photo of a bird.	Convert to a realistic photo.	A realistic photo of S
Office-Home	Art	A painting of an object.	Convert to a painting.	A painting of S
	Clipart	A clipart image of an object.	Convert to a clipart image.	A clipart image of S
	Product	A photo of an object without a background.	Convert to a photo without a background.	An image of S without a background
	Real	A realistic photo of an object.	Convert to a realistic photo.	A realistic photo of S

Appendix 0.C Other Training Details

For training SynCDR, we use most of the same hyperparameters as CDS [22]. Our model architecture is a ResNet-50 backbone followed by a fully connected layer resulting in an output feature dimension of 512. Before training, the backbone is initialized to Imagenet pre-trained weights. We use SGD with a learning rate of 0.003, with a batch size of 32 and momentum 0.9 for training SynCDR using a combination of the CDS and PPP losses as described in Sec. 3 of the main paper. We trained all models for 15 epochs validating after each epoch and do early stopping based on validation set Prec@1.

Appendix 0.D Code

The code for running SynCDR has been attached in a zip folder with the supplement. We intend to make it publicly available on github with the final version of the paper.

Appendix 0.E More Synthetic Generation Examples

Figs. 7, LABEL:, 8, LABEL: and 9 show some more examples of translations from the CUB and DomainNet datasets.

$\displaystyle L_{distill}$	$\displaystyle(X_{A},X_{B})$
	$\displaystyle=\frac{1}{m}\sum_{\boldsymbol{x}^{A}\in X_{A}}D_{KL}(p_{f}(\boldsymbol{x}^{A},X_{B})\|\|p_{g}(\boldsymbol{x}^{A},X_{B}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{A}\in X_{A}}D_{KL}(p_{f}(\boldsymbol{x}^{A},X_{A})\|\|p_{g}(\boldsymbol{x}^{A},X_{A}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{B}\in X_{B}}D_{KL}(p_{f}(\boldsymbol{x}^{B},X_{A})\|\|p_{g}(\boldsymbol{x}^{B},X_{A}))$
	$\displaystyle+\frac{1}{m}\sum_{\boldsymbol{x}^{B}\in X_{B}}D_{KL}(p_{f}(\boldsymbol{x}^{B},X_{B})\|\|p_{g}(\boldsymbol{x}^{B},X_{B}))$	(3)