Test-time Training for Data-efficient UCDR

Soumava Paul¹, Titir Dutta², Aheli Saha¹, Abhishek Samanta¹, Soma Biswas²
¹Universität des Saarlandes, ²Indian Institute of Science, Bangalore
¹{sopa00001, ahsa00002, absa00002}@stud.uni-saarland.de, ²{titird,somabiswas}@iisc.ac.in

Abstract

Image retrieval under generalized test scenarios has gained significant momentum in literature, and the recently proposed protocol of Universal Cross-domain Retrieval is a pioneer in this direction. A common practice in any such generalized classification or retrieval algorithm is to exploit samples from many domains during training to learn a domain-invariant representation of data. Such criterion is often restrictive, and thus in this work, for the first time, we explore the generalized retrieval problem in a data-efficient manner. Specifically, we aim to generalize any pre-trained cross-domain retrieval network towards any unknown query domain/category, by means of adapting the model on the test data leveraging self-supervised learning techniques. Toward that goal, we explored different self-supervised loss functions (for example, RotNet, JigSaw, Barlow Twins, etc.) and analyze their effectiveness for the same. Extensive experiments demonstrate the proposed approach is simple, easy to implement, and effective in handling data-efficient UCDR.

1 Introduction

Cross-domain data retrieval has emerged as an important and quite relevant research topic in today’s world because of the plethora of information being uploaded and shared through internet in multiple forms or modalities (text, video, image etc.) and categories. A number of research has been going on to address this problem, but mostly with a pre-defined domain of retrieval, such as text-based image retrieval [3], sketch-based image retrieval [12], audio-based object retrieval [1] etc. Even though all of these retrieval scenarios have important individual applications in real life, in recent times, we have observed these domain boundaries to be blurrier than ever. For example, Google has a search platform which can process user input or queries in the forms of keywords (text-based retrieval), voice commands (audio-based retrieval), and image, simultaneously. Maintaining domain-specific retrieval networks for each of these could lead to huge training and maintenance cost. Thus we see an initiative to address such problems in terms of domain-agnostic retrieval models, which is formally introduced as Universal Cross-domain Retrieval (UCDR) in [18]. UCDR attempts to combine the Domain Generalization efforts for classification [16] with the traditional Zero-shot Sketch-based Image Retrieval (ZS-SBIR) [27][22]. The query in this case, can belong to an unseen category, as well as can be from a unseen domain; and thus the retrieval scenario becomes relatable to real-world and much more challenging compared to the stand-alone DG or ZS-SBIR problem.

However, to address such generalization across domains, both DG and UCDR methods combine training data collected over multiple domains (of the same set of categories) to learn semantically-meaningful domain-independent representation, which can translate directly to any unknown domain during retrieval. Clearly, this imposes restrictions on the models to be built with multi-domain, multi-category training samples of huge quantities. Additionally, any domain-specific efficient retrieval model (such as DSH [13] for SBIR), with a comparatively lower training data (only sketch and image domains) requirement, becomes irrelevant in this case, since those would be heavily biased towards sketch and image domains only. This is extremely restrictive in real-world, since collecting annotated data from multiple domains involves significant manual labour. Thus, we feel that it is time to take a step back and analyze the possibility to achieve such generalization in a data-efficient manner.

In this work, we aim to explore a data-efficient methodology to address the universal cross-domain retrieval problem. Specifically, we attempt to explore the possibility of re-using any cross-domain retrieval network, trained on a comparatively smaller set (compared to DG or UCDR trainings) and adapt them to address retrieval under UCDR protocol. Towards this goal, we choose the Semantic-neighbourhood and Mixture-prediction network (SnMpNet), originally proposed to address UCDR, as the baseline model. We first explore the effect on its performance when trained with only two domains, eg. sketch and image (instead of five, as shown in [18]). Next, we attempt to adapt the model, trained in a data-scarce setting, by leveraging the test-time samples from an unknown category and unknown domain, to improve its performance. This approach is based on our hypothesis that any information extracted from a query sample of an unknown domain and/or category may help an already trained model adapt to the underlying distribution shift quickly and thus result in a better retrieval list. Such adaptation is inspired from the test-time training (TTT) [23] strategy, originally proposed for classification, and later applied for ZS-SBIR in [20]. TTT proposes to update an already trained model with the information extracted from test samples during test-time, by minimizing a self-supervised loss function. In this work, we explore three different self-supervised losses for the same, namely - (1) RotNet loss [10], (2) Jigsaw puzzle loss [17], and (3) Barlow Twins [28] loss. It can be noted that, unlike the original TTT [23], our proposed methodology (details in Section 3) does not require the inclusion of these self-supervised losses during original training of the baseline model. Thus our test-time adaptation process is much simpler, and easy to implement. Additionally, this adaptation can be seamlessly incorporated with any existing retrieval algorithm, without modifying their original training or architecture.

Thus, we summarize the contributions of this work as follows:

1.

In this work, we attempt to explore the UCDR problem in a data-efficient manner, instead of training on a large-scale multi-domain dataset.
2.

We explore a number of self-supervision based learning techniques to adapt the pre-trained base model towards unseen query data.
3.

Extensive experiments and analysis on the large-scale DomainNet dataset are performed to demonstrate the effectiveness of the proposed training and adaptation strategy.

Next, we briefly discuss the relevant recent work in this direction in Section 2. The rest of the paper is organized in the following fashion: we discuss the proposed test-time training approach using self-supervised losses in Section 3, followed by our findings and analysis in Sections 4 and 5. Finally, we conclude this paper with a summary in Section 6.

Refer to caption — Figure 1: This figure depicts test-time training strategy for rotation-SnMpNet in data-efficient UCDR.

2 Related Work

Here, we discuss some of the seminal works in image retrieval, test-time training, and self-supervised representation learning to elaborate on the background of this paper.

Zero-shot Sketch-based Image Retrieval (ZS-SBIR): It addresses the problem of sketch-based image retrieval when the query sketch belongs to a category that was not seen by the model during training. The problem was first introduced by [22][27], as a category-wise generalization extension of traditional sketch-based image retrieval (SBIR) [12][29]. Later, [6][7][8][9] have reported significant improvements in this direction. The general approach for this problem followed in these papers is to learn a latest-space or shared-space representation of sketch and images, by means of a semantic supervision (generally the word2vec or GloVe-representation of the training category names), so that the sketch and images from same categories can be placed close to each other in this learned latent-space. In contrast to this popular approach, [14] proposed a single-branch architecture for both domains with a domain-indicator function to reduce the number of trainable parameters. This architecture has been adopted very closely in our base model SnMpNet [18] for UCDR.

Universal Cross-Domain Retrieval (UCDR): UCDR-protocol [18] further extends ZS-SBIR towards domain-wise generalization during retrieval. Thus, it proposed to move beyond sketch-query, successfully modeling any real-life retrieval scenario. The proposed model SnMpNet for UCDR learns a domain-invariant and semantically meaningful representation of data for retrieval using a single-branch architecture as in [14]. The details of this model have been discussed in Section 3.1.

Test-Time Training (TTT): This was first introduced by [23], where the test data is treated as an unlabelled dataset, and the model weights are updated via a self-supervised learning approach. It aims at improving model generalization by increasing robustness against distribution shifts, owing to real-life situations where the train and test data often belong to different distributions. The rotation prediction task proposed by [10] is used as the learning objective for this self-supervision task. Recently, [15] performed a detailed analysis of such test-time training strategies under significant distribution shift, and proposed a test-time feature-alignment and moment-matching strategy to address the same. Tent [24] adapts the model parameters to distribution shifts in test-time by minimizing the entropy of model predictions. However, the entropy minimization loss can only be used in the classification setting since it belongs to a class of probabilistic losses, which is not present in the retrieval setting of SnMpNet. [2] combines meta-learning, self-supervision, and test-time adaptation to address corrupted image classification benchmark on the CIFAR-10 dataset. A contrastive self-supervision learning technique is combined with pseudo-labeling in [4]. [25] proposes a continual learning technique for test-time adaptation. Sketch3T [20] utilizes a self-supervised task of sketch-raster to sketch vector translations. This helps in adapting the model to the unique style of new sketches in test-time, as well as new categories. This is the first work to apply test-time training to the ZS-SBIR task. However, it deals solely with the sketch domain. In the UCDR setting, the query sample can belong to any unseen domain, and hence it’s not possible to use such a specialized self-supervised loss for test-time adaptation.

Self-Supervised Learning: Recent self-supervised algorithms share a common methodology of learning semantic information about data, and are independent of variations in style, orientation, and distortions. Few notable works in this direction are RotNet [10], Jigsaw Puzzles [17], Barlow Twins [28], [11], [5] etc. RotNet [10] argues that a model capable of predicting the rotational angles applied to an image necessarily has contextual and class awareness, and therefore the rotation prediction task can be employed for self-supervised representation learning. Jigsaw Puzzles [17] formulates the self-supervision learning objective as a jigsaw puzzle-solving task to gain visuospatial understanding. Barlow Twins [28] proposes a cross-correlation matrix between representations of distorted versions of the same batch as an objective function and aims to drive it toward an identity matrix. BYOL [11] proposes two neural networks, online and target to interact and learn from each other while processing augmented versions of the same sample. SimCLR [5] proposes a contrastive self-supervision learning technique without requiring a memory bank. It is to be noted that the overall objective of self-supervision learning techniques resonates well with the goal of UCDR, where we want to learn domain-independent representations of a class. Thus it makes for a logical choice to achieve generalization from unknown test data.

3 Proposed Method

We begin the discussion of the proposed method with a short description of the base model SnMpNet [18].

3.1 Base Model - SnMpNet:

The Semantic Neighbourhood and Mixture Prediction Network (SnMpNet) has a deep neural architecture, with SE-ResNet50 as its backbone feature extractor. The network implies a semantic feature (through a linear projection layer) of 300-d, on top of the SE-ResNet (2048-d). SnMpNet aims to learn this semantic feature to be domain-invariant, as well as meaningful, so that domain-generalization and category-wise generalization can be achieved simultaneously. Following CuMix [16], SnMpNet also treats the input data in a mixed format, where the mixing may be performed inter or intra-domain. Thus, for given samples $\mathbf{x}_{i}$ , $\mathbf{x_{j}}$ and $\mathbf{x}_{k}$ from training set $\mathcal{D}_{tr}=\{(\mathbf{x}_{m},y_{m})\}_{m=1}^{N}$ of $N$ -samples, the input to the network is computed as, $\mathbf{x}=\alpha\mathbf{x}_{i}+(1-\alpha)[\lambda\mathbf{x}_{j}+(1-\lambda)\mathbf{x}_{k}]$ . Here, $\alpha\sim Beta(\beta,\beta)$ and $\lambda\sim Bernoulli(\gamma,\gamma)$ , with $\beta$ and $\gamma$ being hyper-parameters. To obtain a domain-invariant representation of such mixed samples, the mixture-prediction loss $\mathcal{L}_{Mp}$ is introduced, which only predicts the correct ratio of the component categories present in $\mathbf{x}$ , and forgets about their domain. Additionally, SnMpNet also minimizes a semantic neighbourhood loss $\mathcal{L}_{Sn}$ , which is essentially a cross-entropy loss computed between the latent-space representation of $\mathbf{x}$ and its components’ semantic ground-truth data (eg., word2vec representations of their category-name). Combining both these losses, the network learns meaningful semantic-representation of data, which are domain-agnostic. Final retrieval is performed in the learned semantic-space on the basis of the Euclidean distance between the query sample and search-set instances.

In our work, we retain the architecture and training methodology of SnMpNet unmodified, which provisions for this base model to be replaced by any other cross-domain retrieval algorithm with a shared-space representation learning technique. Next, we discuss the test-time training proposed on top of such a retrieval algorithm to enhance its performance for UCDR. We begin this discussion by providing a brief insight into the self-supervision techniques explored for the same.

3.2 Self-supervision Techniques for Unknown Data

Self-supervised learning (SSL) has become a popular choice when learning from unlabeled or unstructured data. Here we specifically explore the following three SSL-loss components in this regard. We choose RotNet [10] and Jigsaw[17] because of their simplicity, as well as Barlow twins [28] for its effectiveness. We first discuss the details on the loss functions and then describe the adaptation process followed using each of these losses in this work.

RotNet Loss: RotNet [10] has been a very popular choice for introducing self-supervision in a feature learning network, due to its simplicity and effectiveness. It uses four rotations of an input image, in angles of $\{0^{\circ},90^{\circ},180^{\circ},270^{\circ}\}$ , and learns to predict the rotation-index of any such sample from its corresponding feature representations. Thus, the RotNet-loss is computed as,

\displaystyle\mathcal{L}_{SSL}^{rotnet}

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\mathcal{L}_{CE}(r_{i},h_{i})}

(1)

where $\mathcal{L}_{CE}$ is the cross-entropy loss computed between the true rotation-index $r_{i}$ and the predicted index $h_{i}$ . The loss is averaged over the total $n$ -number of samples present in the test set. This loss is minimized without the direct class supervision of the samples, but indirectly learns the semantic content of the data through the rotation prediction.

Jigsaw Puzzles Loss: Similar to RotNet, jigsaw puzzle [17] is another very effective self-supervision component, which has been used in transfer learning [17], domain-generalization [26] etc. Here, any input image is broken down into a number of patches, based on the fixed grid size. For example, following the authors’ approach in [17], we resize and then break down an image into $9$ -patches using a $3\times 3$ grid. These $9$ -patches are then shuffled and a total of $31$ different combinations (out of a possible $9!$ ) of jigsaw images are created through permutations. Now, the network is trained to predict the permutation index of such jigsaw images, forming the following jigsaw-puzzle loss,

\displaystyle\mathcal{L}_{SSL}^{jigsaw}

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{\mathcal{L}_{CE}(p_{i},h_{i})}

(2)

where this cross-entropy loss is computed between the true permutation-index $p_{i}$ and the predicted index $h_{i}$ , and averaged over the total number of samples present in the test set.

Barlow Twins Loss: We follow a similar formulation of this loss, as proposed in [28]. For each image in the test set, $X_{i}$ , we create two differently augmented version of the same as $X^{(1)}_{i}$ and $X^{(2)}_{i}$ . Augmented versions are created through various image operations, such as Gaussian blur, grayscale transformation, solarization, etc. A cross-correlation matrix $\mathcal{C}\in\mathbb{R}^{d\times d}$ is computed between the feature representations of these augmented versions as $\mathbf{x}^{(1)}_{i}\in\mathbb{R}^{d}$ and $\mathbf{x}^{(2)}_{i}\in\mathbb{R}^{d}$ and the following loss function is minimized,

\displaystyle\mathcal{L}_{SSL}^{BT}

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}{[\sum_{a=1}^{d}{(1-\mathcal{C}_{aa})^{2}+\lambda\sum_{a=1}^{d}{\sum_{b\neq a}{\mathcal{C}_{ab}^{2}}}}]}

(3)

where,

\displaystyle\mathcal{C}_{ab}

\displaystyle=\frac{\sum_{i=1}^{n}{\mathbf{x}^{(1)}_{i,a}\mathbf{x}^{(2)}_{i,b}}}{\sqrt{\sum_{i=1}^{n}{(\mathbf{x}^{(1)}_{i,a}})^{2}}\sqrt{\sum_{i=1}^{n}{(\mathbf{x}^{(2)}_{i,b}})^{2}}}

(4)

Here, the main idea is to make the diagonal term of the cross-correlation matrix 1, so that the embedding becomes invariant to any distortion. Additionally, the off-diagonal terms are pushed towards zero to de-correlate the different vector components of the embedding [28]. It has been reported to be particularly successful in image classification problems in the low-data regime.

In our implementation, these self-supervised losses only become effective to adapt the pre-trained base model on the test sets during retrieval. Thus, we introduce an additional auxiliary classifier with existing pre-trained base-model (on top of the semantic embedding (300-d) of SnMpNet) to compute the preferred loss-variant. However, this additional classifier is required only for RotNet and Jigsaw loss computations; for Barlow Twins, we leverage the 300-d embeddings directly from SnMpNet to compute the $\mathcal{C}$ -matrix. Next, we discuss our adaptation or test-time training in detail.

3.3 Test-time Training (TTT) for UCDR

Here, we discuss the proposed test-time training or adaptation framework to address the UCDR in data-efficient manner. In our setup, since the training data does not have samples from many numbers of domains or categories, the generalization ability of the base model SnMpNet would be low. In other words, the network may be biased towards the job of sketch-based image retrieval, in case sketch and image are the domains present during training. But its performance degrades (details in the experiments section, Table 1) when a cartoon or painting is presented as query. We hypothesize that under such a condition, any information or clue extracted from test query may help the network to adapt and retrieve better for that query domain.

Towards that goal, we propose to perform a single-step parameter update of the network using the gradient computed through any of the SSL-loss components discussed before. Thus, for any test sample $X_{te}$ , we perform a forward-propagation to compute the $\mathcal{L}_{SSL}^{v},v\in\{rotnet,jigsaw,BT\}$ , and then compute the corresponding gradients and back-propagate through the network components (SE-ResNet-50 based feature extractor, linear projection layer for learning the semantic-embedding and the auxiliary classifier) just once to tune the already trained SnMpNet on the basis of the test sample itself. We make the final inference on the retrieval list for the query sample, using this updated model instead of the previously-trained SnMpNet. The advantage of this proposed test-time adaptation is that any pre-trained retrieval network can be used as the base model in this case, and no modification in the training strategy or additional data / computation during training is required.

3.3.1 Proposed SnMpNet-variants

Depending on the SSL-loss component used for test-time adaptation of the model, we propose three variants of SnMpNet as described below:

1.

rotation-SnMpNet: During test-time, the test samples are rotated as stated previously, to create an augmented set. The 300-d semantic embeddings of the samples in this set are extracted from pre-trained SnMpNet and are fed to the auxiliary classifier to perform the task of classifying their corresponding rotation angles. The network parameters are updated only once on the basis of $\mathcal{L}_{SSL}^{rotnet}$ computed over this newly generated test set.
2.

jigsaw-SnMpNet: Here, during test-time training, only $\mathcal{L}_{SSL}^{jigsaw}$ loss will be computed with pre-trained SnMpNet + auxiliary classifier. The test samples are resized, and jigsaw images are created for the same.
3.

BT-SnMpNet: Finally, for this variant, noisy or distorted test samples are generated to compute the cross-correlation matrix $\mathcal{C}$ , using their 300-d semantic embeddings from pre-trained SnMpNet. We compute $\mathcal{L}_{SSL}^{BT}$ based on $\mathcal{C}$ and accordingly update the model.

Thus, the proposed test-time training methodology is very easy and can be seamlessly used with any trained cross-domain retrieval model without any modification in its architecture or training process. Now, with this discussion, we’ll move on to the experimental validation in the next section. Our overall approach with rotation-SnMpNet is illustrated in Figure 1 for reference.

4 Experiments

	QuickDraw, Clipart	SnMpNet	0.4031	0.3332	0.3635	0.3019
Query Domain	Training Domains	Method	Unseen-class Search Set		Seen+Unseen-class Search Set
Query Domain	Training Domains	Method	mAP@200	Prec@200	mAP@200	Prec@200
Painting	Real, Sketch, Infograph,	SnMpNet	0.4031	0.3332	0.3635	0.3019
	Sketch, Image	SnMpNet	0.3827	0.3167	0.3480	0.2842
		rotation-SnMpNet	0.3880	0.3219	0.3508	0.2892
		jigsaw-SnMpNet	0.3807	0.3154	0.3441	0.2829
		BT-SnMpNet	0.3932	0.3337	0.3481	0.2905
Clipart	Real, Sketch, Infograph,	SnMpNet	0.4198	0.3323	0.3765	0.2959
	QuickDraw, Painting	SnMpNet	0.4198	0.3323	0.3765	0.2959
	Sketch, Image	SnMpNet	0.3318	0.2539	0.2923	0.2172
		rotation-SnMpNet	0.3586	0.2811	0.3173	0.2432
		jigsaw-SnMpNet	0.3579	0.2803	0.3158	0.2425
		BT-SnMpNet	0.3465	0.2744	0.2987	0.2271
QuickDraw	Real, Sketch, Infograph,	SnMpNet	0.1736	0.1284	0.1512	0.1111
	Clipart, Painting	SnMpNet	0.1736	0.1284	0.1512	0.1111
	Sketch, Image	SnMpNet	0.1845	0.1471	0.1551	0.1241
		rotation-SnMpNet	0.1931	0.1511	0.1600	0.1246
		jigsaw-SnMpNet	0.1905	0.1498	0.1577	0.1227
		BT-SnMpNet	0.1515	0.1231	0.1082	0.0819
Infograph	Real, Sketch, Clipart,	SnMpNet	0.2079	0.1717	0.1800	0.1496
	QuickDraw, Painting	SnMpNet	0.2079	0.1717	0.1800	0.1496
	Sketch, Image	SnMpNet	0.1660	0.1322	0.1358	0.1071
		rotation-SnMpNet	0.1952	0.1592	0.1638	0.1331
		jigsaw-SnMpNet	0.1904	0.1539	0.1592	0.1281
		BT-SnMpNet	0.1634	0.1326	0.1322	0.1034
Average	5/6 DomainNet domains	SnMpNet	0.3011	0.2414	0.2678	0.2146
	Sketch, Image	SnMpNet	0.2663	0.2125	0.2328	0.1832
		rotation-SnMpNet	0.2837	0.2283	0.2480	0.1975
		jigsaw-SnMpNet	0.2799	0.2248	0.2442	0.1940
		BT-SnMpNet	0.2636	0.2160	0.2218	0.1757

Table 1: Performance of SnMpNet and the proposed test-time adaptation based variants for data-efficient UCDR.

In this section, we discuss the experimental validation of proposed test-time training strategy to address data-efficient UCDR. As we mentioned before, SnMpNet [18] is our base model throughout all the experiments and we also analyse the performance of each of the variants discussed in the previous section. Specifically, we report our results for two different test-cases: (1) Data-efficient UCDR: where the training set for the SnMpNet is small (only two training domains), as presented in [6][8] etc. instead of [18]; and (2) Traditional UCDR: where the training set contains large-scale multi-domain data, as in [18]. First, we briefly introduce the datasets used for this analysis.

4.1 Datasets

We experimented with two large-scale datasets:
Sketchy extended [21] contains $75,471$ sketches and $73,002$ images across $125$ categories. Out of these total $125$ -categories, $93$ , $11$ , and $21$ -classes are used for training, validation, and testing, respectively (following the ZS-SBIR setup [6]). The presence of annotated samples from only two domains makes this dataset an excellent choice for data-efficient generalized retrieval scenario. Thus, we largely use this dataset for pre-training the base model, and unknwon samples from other domains (except sketch and image from DomainNet) to validate the effectiveness of proposed test-time adaptation in data-efficient manner.

DomainNet [19] has approximately $6,00,000$ samples from $345$ categories, collected in six domains, namely, Clip-art, Sketch, Real, Quickdraw, Infograph, and Painting. Following [16], $245$ and $55$ classes are used for training and validation [16], respectively. To simulate the unknown domain, we leave one domain out and rests are used for training. During testing, the samples from this left-out domain is used as query. We train the base model using this setting for evaluating the traditional UCDR only.

It is not be noted that, for fair comparison, we have always reported results of SnMpNet and proposed variants on the same training and test data-splits.

4.2 Implementation details

We use PyTorch 1.1.0 (following [18]) and a single Nvidia Tesla v100 GPU for all experiments. Maintaining that any pre-trained baseline model can be used for the purpose, we use the same training parameters for SnMpNet with either just two training domains (for data-efficient UCDR), or with five domains (standard UCDR). For our test-time adaptation, we update the pre-trained model for single-iteration per query sample. To prevent huge parameter divergence from the learned distribution, we use a lower learning rate (1e-6 to 1e-5) and use SGD optimizer with weight decay and Nesterov momentum of 0.9, with a batch-size of $32$ . This batch is created by applying standard data augmentation techniques, like random resized-crop, horizontal-flip, and color-jitter on the query sample, during the single iteration of training at test-time.

4.3 Test-time Training for Data-efficient UCDR

First, we analyze the effect of limited training data for UCDR. Towards that goal, we pre-train the SnMpNet model with the training-split of Sketchy-extended dataset [6], and test on the additional domain query data from DomainNet [18]. This model is treated as the base-model in data-efficient UCDR scenario for the rest of the paper. We compare this performance with the results reported in [18], where the same model is trained with multi-domain data from DomainNet. Following [18], these results are reported in the form of mAP@200 and Prec@200 in Table 1. Also, we perform experiments for two different configurations of the search set: (1) when the search set contains samples from unseen categories only; and (2) when both seen and unseen categories samples are present in the search set. We exclude the Sketch domain from DomainNet as query in this experiment, as it is already present in the training set of Sketchy-extended and hence isn’t effectively unseen-domain to the model.

As we can observe from Table 1, the performance of SnMpNet drops significantly when the number of training domains becomes restricted. This result is on par with our expectation, since traditional SnMpNet is not originally designed to learn generalization in a data-efficient manner. This regression in performance has been observed for all unseen query domains, except for Quickdraw. This may be because the contents of Quickdraw is basically Sketch (but very abstract in nature); thus, SnMpNet can retrieve for this case easily, compared to other drastically different (eg. Painting or Infograph) query domains. Thus, this observation justifies the need for further effort in this direction.

Next, we explore the effectiveness of test-time training in data-efficient learning context. We observe the performance of all 3 SnMpNet variants under limited training domains in Table 1. We can observe that for all 4 query domains (sketch excluded), test-time training has improved the performance of SnMpNet [18]. Particularly, rotation-SnMpnet significantly outperforms the base model on all four query domains and almost matches the original SnMpNet [18] performance (trained on all five-domains). These results are encouraging and support our previous hypothesis that any information/hint extracted from the unknown test sample could help improve the model’s generalization performance under such a challenging condition. Also, the proposed adaptation does not require any modification on the pre-trained base network, which makes it easy to integrate with any data-efficient base model.

4.4 Test-time Training for Traditional UCDR

Next, we study the effectiveness of such test-time adaptation for traditional UCDR. Following [18], we assume that sufficient multi-domain data is available to the model for training, and aim to explore if proposed test-time adaptation can further improve the performance of the model for UCDR. Thus, the SnMpNet is pre-trained with training samples from five domains (leaving one out) from DomainNet, and used as base model. While testing the query from unknown domain, we apply the test-time adaptation discussed in Section 3. We summarize our observations in Table 2.

Query Domain	Method	Unseen-class Search Set		Seen+Unseen-class Search Set
	Method	mAP@200	Prec@200	mAP@200	Prec@200
Sketch	SnMpNet	0.3007	0.2432	0.2624	0.2134
	rotation-SnMpNet	0.2959	0.2394	0.2624	0.2129
	jigsaw-SnMpNet	0.2929	0.2371	0.2632	0.2134
	BT-SnMpNet	0.2993	0.2479	0.2440	0.1987
Quickdraw	SnMpNet	0.1736	0.1284	0.1512	0.1111
	rotation-SnMpNet	0.1683	0.1244	0.1494	0.1094
	jigsaw-SnMpNet	0.1689	0.1250	0.1498	0.1100
	BT-SnMpNet	0.1613	0.1298	0.1134	0.0785
Painting	SnMpNet	0.4031	0.3332	0.3635	0.3019
	rotation-SnMpNet	0.3997	0.3301	0.3707	0.3064
	jigsaw-SnMpNet	0.3992	0.3301	0.3675	0.3041
	BT-SnMpNet	0.4072	0.3494	0.3615	0.3042
Infograph	SnMpNet	0.2079	0.1717	0.1800	0.1496
	rotation-SnMpNet	0.2058	0.1695	0.1815	0.1508
	jigsaw-SnMpNet	0.2053	0.1692	0.1819	0.1511
	BT-SnMpNet	0.1903	0.1597	0.1502	0.1229
Clipart	SnMpNet	0.4198	0.3323	0.3765	0.2959
	rotation-SnMpNet	0.4171	0.3295	0.3786	0.2978
	jigsaw-SnMpNet	0.4167	0.3298	0.3835	0.3020
	BT-SnMpNet	0.4281	0.3472	0.3790	0.2962
Average	SnMpNet	0.3010	0.2418	0.2667	0.2144
	rotation-SnMpNet	0.2974	0.2386	0.2685	0.2155
	jigsaw-SnMpNet	0.2966	0.2382	0.2692	0.2161
	BT-SnMpNet	0.2972	0.2468	0.2496	0.2001

Table 2: Comparison of SnMpNet and its TTT-variants for UCDR protocol on DomainNet.

We see that at least one of the proposed variants outperforms SnMpNet on Painting, and Clipart domains for both configurations of the search set. However, performance drops for Sketch, Infograph and Quickdraw by small amounts. This also impacts the overall Average performance. Only BT-SnMpNet outperforms SnMpNet on Prec@200 for unseen-class search set. We note this to be a limitation of test-time adaptation in traditional UCDR. In case sufficient training data is available during training, the proposed self-supervision-based test-time adaptation cannot improveme on the pre-trained base model’s retrieval performance. It can be noted that such a failure case is on the similar line of analysis for TTT reported by TTT++ [15] in the context of traditional image-classification.

With these results, we now move on to the in-depth analysis of the proposed approach in the next section.

5 Analysis

Here, we provide reasoning behind the design choices for the proposed test-time adaptation, and ablate different learning components to understand the contribution of each. We choose the rotation-SnMpNet variant for this analysis, since it has demonstrated improvement consistently (refer to Table 1). The results in this section are reported with data-efficient pre-trained base-model, where the number of training domains is two for pre-training.

5.1 Ablation Studies

We first analyze the impact of different design parameters during test-time training of SnMpNet. We primarily identify the learning rate, embedding-dimension, optimizer configuration, no. of iterations during test-time update as some of the crucial parameters effecting the final performance of model. We summarize the effect of each of these parameters in Table 3, using Infograph as the query domain. We observe that the test-time adaptation strategy is almost robust to the number of update-iterations, optimizer configuration or embedding-dimension etc., since the reported mAP@200 and Prec@200 remain similar against these conditions. However, learning rate of this adaptation seems to affect the retrieval results noticably. We see that a learning rate of 1e-6 for the base model and 1e-5 for the auxiliary classifier performs much better than 1e-4 and 1e-3, respectively. For context, pre-trained SnMpNet final learning rate is 1e-6 for our experiments. This finding is on par with recommendations in [23] of TTT learning rate being at the same scale of the pre-trained model’s final epoch of training.

Query Domain	Iter.	Mom.	lrc/lrb	Emb. Dim	L2 decay	Unseen-class Search Set		Seen+Unseen-class Search Set
	Iter.	Mom.	lrc/lrb	Emb. Dim	L2 decay	mAP@200	Prec@200	mAP@200	Prec@200
Infograph	1	✓	1e-3/1e-4	300	✓	0.1869	0.1505	0.1561	0.1250
	1	✗	1e-5/1e-6	300	✗	0.1952	0.1591	0.1638	0.1330
	3	✓	1e-5/1e-6	300	✓	0.1940	0.1583	0.1626	0.1321
	1	✓	1e-5/1e-6	2048	✓	0.1950	0.1587	0.1634	0.1325
	1	✓	1e-5/1e-6	300	✓	0.1952	0.1592	0.1638	0.1331

Table 3: Ablation Study for proposed test-time adaptation technique with rotation-SnMpNet for data-efficient UCDR evaluation. Infograph from DomainNet dataset is used as query domain. [Column description: Iter: number of iterations during test-time adaptation; Mom: Nesterov momentum; lrc/lrb: learning rate used for auxilliary classifier / pre-trained base model during adaptation; Emb. Dim: embedding dimension fed to the axilliary classifier; L2-Decay: is used in optimizer or not.]

5.2 Standard v Online TTT

Here, we explore variants of the test-time update strategy based on the literature [23]. We refer to the adaptation followed in this work as Standard adaptation, where the base-model parameters are updated based on each query sample, and then discarded for the next query. Essentially, for each query, the original pre-trained SnMpNet is being adapted and accordingly retrieval is performed. We also explore the retrieval results using the online-version [23] of adaptation, where the updates from each sample get accumulated and the pre-trained base model keeps updating itself sequentially for each incoming query. This strategy has been reported to be effective in the classification setting for gradually changing distribution shifts at test time. However, the evaluation summarized in Table 4 clearly depict that it is not as effective as the standard adaptation process for data-efficient UCDR. This may be because that UCDR is much more complex compared to traditional classification problem. In UCDR, each query can potentially belong to a different domain and category, and thus a cumulative update from each of such samples can lead to complete parameter divergence for the model. Thus the assumption of gradually changing distribution shifts that hold for traditional TTT benchmarks [23], won’t be valid for UCDR.

Query Domain	TTT-version	Unseen-class Search Set		Seen+Unseen-class Search Set
	TTT-version	mAP@200	Prec@200	mAP@200	Prec@200
Sketch	Standard	0.2959	0.2394	0.2624	0.2129
Sketch	Online	0.2517	0.1904	0.2223	0.1685
Quickdraw	Standard	0.1683	0.1244	0.1494	0.1094
Quickdraw	Online	0.0988	0.0733	0.0867	0.0649
Painting	Standard	0.3997	0.3301	0.3707	0.3064
Painting	Online	0.3405	0.2713	0.3153	0.2506
Infograph	Standard	0.2058	0.1695	0.1815	0.1508
Infograph	Online	0.1340	0.1006	0.1173	0.0880
Clipart	Standard	0.4171	0.3295	0.3786	0.2978
Clipart	Online	0.3320	0.2501	0.2947	0.2219

Table 4: Comparison of Standard and Online versions of rotation-SnMpNet for UCDR protocol on DomainNet.

6 Conclusion

In this work, we proposed test-time training heuristics for data-efficient UCDR task. We leveraged self-supervision learning techniques to adapt any pre-trained data-efficient retrieval model to generalize across any unknown domain and/or category during inference. To the best of our knowledge, though test-time training has been previously explored in the classification and ZS-SBIR settings, this is the first work exploring data-efficient UCDR using test-time adaptation techniques. We also reported extensive experiments and in-depth analysis to corroborate our hypothesis. We have demonstrated that our approach is simple, easy to integrate, and effective for such a challenging retrieval paradigm.

References

[1] Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pages 435–451, 2018.
[2] Alexander Bartler, Andre Bühler, Felix Wiewel, Mario Döbler, and Bin Yang. Mt3: Meta test-time training for self-supervised test-time adaption. In International Conference on Artificial Intelligence and Statistics, pages 3080–3090. PMLR, 2022.
[3] Min Cao, Shiping Li, Juntao Li, Liqiang Nie, and Min Zhang. Image-text retrieval: A survey on recent research and development. In International Joint Conference on Artificial Intelligence, 2022.
[4] Dian Chen, Dequan Wang, Trevor Darrell, and Sayna Ebrahimi. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 295–305, 2022.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[6] Sounak Dey, Pau Riba, Anjan Dutta, Josep Llados, and Yi-Zhe Song. Doodle to search: Practical zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2179–2188, 2019.
[7] Anjan Dutta and Zeynep Akata. Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5089–5098, 2019.
[8] Titir Dutta, Anurag Singh, and Soma Biswas. Adaptive margin diversity regularizer for handling data imbalance in zero-shot sbir. In European Conference on Computer Vision, pages 349–364. Springer, 2020.
[9] Titir Dutta, Anurag Singh, and Soma Biswas. Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Transactions on Multimedia, 23:2833–2842, 2020.
[10] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018.
[11] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
[12] Li Liu, Fumin Shen, Yuming Shen, Xianglong Liu, and Ling Shao. Deep sketch hashing: Fast free-hand sketch-based image retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862–2871, 2017.
[13] L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: fast free-hand sketch-based image retrieval, 2017. CVPR.
[14] Qing Liu, Lingxi Xie, Huiyu Wang, and Alan L Yuille. Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3662–3671, 2019.
[15] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt+++: when does self-supervised test-time training fail or thrive? In Proceedings of Neural Information Processing Systems, 2021.
[16] M. Mancini, Z. Akata, E. Ricci, and B. Caputo. Towards recognizing unseen categories in unseen domains, 2020. ECCV.
[17] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
[18] Soumava Paul, Titir Dutta, and Soma Biswas. Universal cross-domain retrieval: Generalizing across classes and domains. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12056–12064, 2021.
[19] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang. Moment matching for multi-source domain adaptation, 2019. ICCV.
[20] Aneeshan Sain, Ayan Kumar Bhunia, Vaishnav Potlapalli, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. Sketch3t: Test-time training for zero-shot sbir. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7462–7471, 2022.
[21] P. Sangkloy, N. Burnell, C. Ham, and J. Hays. The sketchy database: learning to retrieve badly drawn bunnies. ACM TOG, 35(4):1–12, 2016.
[22] Yuming Shen, Li Liu, Fumin Shen, and Ling Shao. Zero-shot sketch-image hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3598–3607, 2018.
[23] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
[24] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
[25] Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. Continual test-time domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7201–7211, 2022.
[26] S. Wang, L. Yu, C. Li, C. W. Fu, and P. A. Heng. Learning from extrinsic and intrinsic supervisions for domain generalization, 2020. ECCV.
[27] Sasi Kiran Yelamarthi, Shiva Krishna Reddy, Ashish Mishra, and Anurag Mittal. A zero-shot framework for sketch based image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018.
[28] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
[29] Jingyi Zhang, Fumin Shen, Li Liu, Fan Zhu, Mengyang Yu, Ling Shao, Heng Tao Shen, and Luc Van Gool. Generative domain-migration hashing for sketch-to-image retrieval. In Proceedings of the European conference on computer vision (ECCV), pages 297–314, 2018.