Multi-Similarity Contrastive Learning

Emily Mu
Massachusetts Institute of Technology
[email protected]
John Guttag
Massachusetts Institute of Technology
[email protected]
Maggie Makar
University of Michigan
[email protected]

Abstract

Given a similarity metric, contrastive methods learn a representation in which examples that are similar are pushed together and examples that are dissimilar are pulled apart. Contrastive learning techniques have been utilized extensively to learn representations for tasks ranging from image classification to caption generation. However, existing contrastive learning approaches can fail to generalize because they do not take into account the possibility of different similarity relations. In this paper, we propose a novel multi-similarity contrastive loss (MSCon), that learns generalizable embeddings by jointly utilizing supervision from multiple metrics of similarity. Our method automatically learns contrastive similarity weightings based on the uncertainty in the corresponding similarity, down-weighting uncertain tasks and leading to better out-of-domain generalization to new tasks. We show empirically that networks trained with MSCon outperform state-of-the-art baselines on in-domain and out-of-domain settings.

1 Introduction

Contrastive methods learn embeddings by pushing similar examples together and pulling dissimilar examples apart. Embeddings trained using contrastive learning have been shown to achieve state-of-the-art performance on a variety of computer vision tasks [30, 39, 21]. In contrastive learning, representations are trained to discriminate pairs of similar images (positive examples) from a set of dissimilar images (negative examples). Supervised contrastive learning approaches consider all instances with the same label to be positive examples and all examples with different labels to be negative examples [21].

Existing contrastive learning methods can fail to generalize because the learned embeddings are too simplistic, reflecting limited similarities between different examples. This limitation exists because current contrastive learning methods only consider a single way of defining similarity between examples. In settings where multiple notions of similarity are available, relying on only one notion of similarity represents a missed opportunity to learn more general representations [21, 17]. A challenge of generalizing to multiple notions is that training using multiple tasks adds complexity when tasks have different levels of uncertainty. Incorporating noisy similarity measures can lead to worse generalization performance. In multi-task and meta learning, it has been demonstrated that assigning different weights based upon relative task uncertainty can help models focus on tasks with low uncertainty, potentially leading to better classification accuracy and generalization towards new tasks and datasets [20, 26]. In this work, we propose multi-similarity contrastive loss (MSCon), a novel loss function that utilizes supervision from multiple similarity metrics and learns to down-weight more uncertain similarities.

Throughout, we will use shoe classification as a motivating example. Each of the shoes in Figure 1 is associated with distinct category, closure, and gender attributes. For example, images 1 and 2 are similar in category but are dissimilar in closure and gender, while images 2 and 3 are similar in gender but dissimilar in category and closure. We refer to such a dataset as a multi-similarity dataset. Other examples of multi-similarity datasets include multiple disease labels associated with chest radiographs [16, 19] and relational tables associated with website text [7, 5]. For convenience, we will refer to the similarity function induced by the labels of a task as the similarity metric of that task.

Suppose we are training a model using all three tasks: category, closure, and gender. Closure might be a task with low noise and low uncertainty, while gender might be a task with high noise and higher uncertainty. We find that our approach learns a higher weight for closure than for gender, ensuring that the model focuses more on closure during training.

Refer to caption — Figure 1: Shoe Example. An example illustrating multiple disjoint similarity relationships between three images of shoes.

Our framework is shown in Figure 2. MSCon uses multiple projection heads to learn embeddings based on different metrics of similarity. In this way, we are able to represent examples that are positive examples in one projected subspace and negative examples in a different projected subspace. Additionally, we model similarity-dependent uncertainty by first constructing a pseudo-likelihood function. Since our contrastive loss uses a non-parametric approach to learn the similarities between two inputs, we use the pseudo-likelihood function to approximate label uncertainty in the learned similarity spaces. We then learn a weighting parameter for each similarity metric that maximizes this pseudo-likelihood.

In extensive experiments, we show that our weighting scheme allows models to learn to down-weight more uncertain similarity metrics, which leads to better generalization of the learned representation to novel tasks. We also show that embeddings trained with our multi-similarity contrastive loss outperform embeddings trained with traditional self-supervised and supervised contrastive losses on two multi-similarity datasets. Finally, we show that embeddings trained with MSCon generalize better to out-of-domain tasks than do embeddings trained with multi-task cross-entropy.

Our main contributions are:

1.

We propose a novel multi-similarity contrastive learning method for utilizing supervision based on multiple metrics of similarity.
2.

We propose a weighting scheme to learn robust embeddings in the presence of possibly uncertain similarities induced by noisy tasks. Our weighting scheme learns to down-weight uninformative or uncertain tasks leading to better out-of-distribution generalization.
3.

We empirically demonstrate that a network trained with our multi-similarity contrastive loss performs well for both in-domain and out-of-domain tasks and generalizes better than multi-task cross-entropy methods to out-of-domain tasks.

2 Related Work

2.1 Contrastive Representation Learning

Our work draws from existing literature in contrastive representation learning. Many of the current state-of-the-art vision and language models are trained using contrastive losses [30, 39, 8, 21, 13]. Self-supervised contrastive learning methods, such as MoCo and SimCLR, maximize agreement between two different augmentations or views of the same image [13, 8]. Recently, vision-language contrastive learning has allowed dual-encoder models to pretrain with hundreds of millions of image-text pairs [18, 30]. The resulting learned embeddings achieve state-of-the-art performance on many vision and language benchmarks [39, 35, 22]. Supervised contrastive learning, SupCon, allows contrastive learning to take advantage of existing labels [21]. Contrastive learning has also been adapted to learn from both labels and text [36] and from hierarchies of labels [41]. The method most similar to ours is conditional similarity networks [34]. In conditional similarity networks, masks are learned or assigned to different embedding dimensions with respect to different metrics of similarity. These masks are learned jointly with the convolutional neural network parameters during training time. Conditional similarity networks differ from our work in two major ways. First, unlike our work, conditional similarity networks uses triplet loss, a specialized version of contrastive loss. At training time, it requires triplets based on each similarity metric. Second, we automatically learn separate projection spaces and weights for each metric of similarity, whereas they learn a linear transformation from the embedding space for each similarity and do not consider weighting metrics. As we show in Section 4.3, our multiple similarity contrastive networks consistently outperforms conditional similarity networks.

2.2 Multi-Task Learning

Multi-task learning aims to simultaneously learn multiple related tasks and often outperforms learning each task alone [20, 6, 24]. However, if tasks are weighted improperly during training, the performance on some tasks suffer. Various learned task weighting methods have been proposed for multi-task learning in the vision and language domains [26, 20, 9, 32, 23, 24, 25]. These methods learn task weightings based on different task characteristics in order to improve the generalization performance towards novel tasks [26]. This is done by regularizing the task variance using gradient descent [9, 25] or by using adversarial training to divide models into task-specific and generalizable parameters [23]. Overwhelmingly, these methods are built for multiple tasks trained with likelihood-based losses, such as regression and classification. One of the most popular of these methods models task uncertainty to determine task-specific weighting and automatically learns weights to balance this uncertainty [20]. In our work, we adapt automatically learned task weighting to our multi-similarity contrastive loss by predicting similarity uncertainty. This is not straightforward since the contrastive loss is trained in a pairwise fashion and there is a lack of absolute labels in the learned output (a set of embedding vectors) [4].

2.3 Uncertainty in Contrastive Learning

Adapting uncertainty estimation techniques to contrastive learning remains an active area of research. This is because contrastive learning learns abstract embedding vectors rather than absolute labels and because of the pairwise training of contrastive models. Given access to training data and labels, previous work has proposed estimating the density and consistency of the hypersphere embedding space distribution as metrics to estimate uncertainty [4, 29]. The density of the embedding space at a point captures the amount of data the model has observed during training, and can serve as a proxy for epistemic, or model, uncertainty [29]. The consistency of the embedding space at a point uses k-nearest-neighbors to measure the extent to which the training data mapped closest to that point have consistent labels, and can serve as a proxy for aleatoric, or data-dependent, uncertainty [4]. Other recent work proposes learned temperature as a metric of heteroscedastic, or input-dependent, uncertainty to identify out-of-distribution data for labeled datasets [40]. To our knowledge, we are the first work to model similarity-dependent uncertainty, or the relative confidence between different training tasks, in the contrastive setting.

3 Method

3.1 Multi-Similarity Setup

We assume that during training time, we have access to dataset: $\mathcal{D}=\{x_{i},\textbf{Y}_{i}\}_{i}^{M}$ , where $x$ is an image and the $\textbf{Y}_{i}=\{y_{i}^{1}...y_{i}^{C}\}$ are distinct categorical attributes associated with the image. We aim to learn an embedding function $f:x\rightarrow\mathbb{R}^{d}$ that maps $x$ to an embedding space. We define $h_{i}=f(x_{i})$ to be the embedding of $x_{i}$ .

In the typical contrastive training setup, training proceeds by selecting a batch of $N$ randomly sampled data $\{x_{i}\}_{i=1...N}$ . We randomly sample two distinct label preserving augmentations (e.g., from rotations, crops, flips) for each $x_{i}$ , ( $\tilde{x}_{2i}$ and $\tilde{x}_{2i-1}$ ), to construct $2N$ augmented samples, $\{\tilde{x}_{j}\}_{j=1...2N}$ . Let $A(i)=\{1,...2N\}\backslash i$ be the set of all samples and augmentations not including $i$ . We define $g$ to be a projection head that maps the embedding to the similarity space represented as the surface of the unit sphere $\mathbb{S}^{d}=\{v\in\mathbb{R}^{d}:||v||_{2}=1\}$ . Finally, we define $v_{i}=g(h_{i})$ as the mapping of $h_{i}$ to the projection space.

Supervised contrastive learning uses labels to implicitly define the positive sets of examples. Specifically, supervised contrastive learning encourages samples with the same label to have similar embeddings and samples with a different label to have different embeddings. We follow the literature in referring to samples with the same label as an image $i$ as the positive samples, and samples with a different label than that of $i$ ’s as the negative samples.

Supervised contrastive learning (SupCon) [21] proceeds by minimizing the loss:

\footnotesize L^{supcon}=\sum_{i=I}\frac{-1}{|P(i)|}\sum_{p\in P(i)}\log\frac{\text{exp}(\frac{v_{i}^{T}v_{p}}{\tau})}{\sum_{a\in A(i)}\text{exp}(\frac{v_{i}^{T}v_{a}}{\tau})},

(1)

where $|S|$ denotes the cardinality of the set $S$ , $P(i)$ denotes the positive set with all other samples with the same label as $x_{i}$ , i.e., $P(i)=\{j\in A(i):y_{j}=y_{i}\}$ , $I$ denotes the set of all samples in a particular batch, and $\tau\in\{0,\infty\}$ is a temperature hyperparameter.

In contrast to SupCon, our multi-similarity contrastive (MSCon) approach proceeds by jointly training an embedding space using multiple notions of similarity. We do so by training the embedding with multiple projection heads $g^{c}$ that map the embedding to $C$ projection spaces, where each space distinguishes the image based on a different similarity metric. We define $v^{c}_{i}=g^{c}(h_{i})$ to be the mapping of $h_{i}$ to the projection space by projection head $g^{c}$ . Because each projection space is already normalized, we assume that the each similarity loss is similarly scaled. We define the multi-similarity contrastive loss to be a summation of the supervised contrastive loss over all conditions $L^{mscon}=\sum_{c\in C}\sum_{i=I}L^{mscon}_{c,i}$ where each conditional $L^{mscon}_{c,i}$ is defined as in equation 2. Specifically,

\footnotesize L^{mscon}_{c,i}=\frac{-1}{|P^{c}(i)|}\sum_{p\in P^{c}(i)}\log\frac{\text{exp}(\frac{v_{i}^{cT}v^{c}_{p}}{\tau})}{\sum_{a\in A(i)}\text{exp}(\frac{v_{i}^{cT}v^{c}_{a}}{\tau})},

(2)

where $P^{c}(i)$ is defined as the positive set under similarity $c$ such that for all $j\in P^{c}(i)$ , $y_{j}^{c}=y_{i}^{c}$ .

3.2 Contrastive Task Weighting

In the above formulation of our multi-similarity contrastive loss function, each similarity is weighted equally. However, previous work in multi-task learning for both vision and language have demonstrated that model performance can deteriorate when one or more of the tasks is noisy or uncertain. One way to tackle this is to learn task weights based on the uncertainty of each task. However, model performance can be sensitive to weight selection [12, 26, 20, 9, 25], and manually searching for optimal weightings is expensive in both computation and time. Previous work has suggested using irreducible uncertainty of task predictions in a weighting scheme. For example, tasks where predictions are more uncertain are weighted lower because they are less informative[20].

Such notions of uncertainty are typically predicated on an assumed parametric likelihood of a label given inputs. However, this work is not easily adapted to multi-similarity contrastive learning because 1) contrastive training does not directly predict downstream task performance and 2) the confidence in different similarity metrics has never been considered in this setting. In contrastive learning, the estimate of interest is a similarity metric between different examples rather than a predicted label, which means that downstream task performance is not directly predicted by training results. Furthermore, previous work in contrastive learning has only focused on modeling data-dependent uncertainty, or how similar a sample is to negative examples within the same similarity metric. To our knowledge, we are the first to utilize uncertainty in the training tasks and their corresponding similarity metrics as a basis for constructing a weighting scheme for multi-similarity contrastive losses.

We do this in two ways: 1) we construct a pseudo-likelihood function approximating task performance and 2) we introduce a similarity dependent temperature parameter to model relative confidence between different similarity metrics. We present an extension to the contrastive learning paradigm that enables estimation of the uncertainty in similarity metrics. In addition to providing useful information about the informativeness of each similarity metric, our estimate of uncertainty enables us to weight the different notions of similarity such that noisy notions of similarity are weighted lower than more reliable notions.

Our approach proceeds by constructing a pseudo-likelihood function which approximates task performance. We show in the supplement that maximizing our pseudo-likelihood also maximizes our MSCon objective function. This pseudo-likelihood endows the approach with a well-defined notion of uncertainty that can then be used to weight the different similarities.

Let $v_{i}^{c}$ be the model projection head output for similarity $c$ for input $x_{i}$ . Let $\textbf{Y}^{c}$ be the $c$ th column in Y. We define $P^{c}_{y}=\{x_{j}\in\mathcal{D}:\textbf{Y}^{c}_{j}=y\}$ to be the positive set for label $y$ under similarity metric $c$ . We define the classification probability $p(y|v_{i}^{c},D,\tau)$ as the average distance of the representation $v_{i}^{c}$ from all representations for inputs conditioned on the similarity metric. Instead of directly optimizing equation 1, we can maximize the following pseudo-likelihood:

\footnotesize p(y|v_{i}^{c},D,\tau)\propto\frac{1}{|P^{c}_{y}|}\sum_{p\in P^{c}_{y}}\text{exp}(\frac{v_{i}^{cT}v_{p}^{c}}{\tau}).

(3)

Note that optimizing 3 is equivalent to optimizing 1 by applying Jensen’s inequality (as shown in the supplement). By virtue of being a pseudo-likelihood, equation 3 provides us with a well-defined probability associated with downstream task performance that we can use to weight the different tasks. We will next outline how to construct this uncertainty from the pseudo-likelihood defined in equation 3.

We assume that $v^{c}$ is a sufficient statistic for $y^{c}$ , meaning that $y^{i}$ is independent of all other variables conditional on $v^{i}$ . Such an assumption is not unrealistic, it simply reflects the notion that $v^{c}$ is an accurate estimation for $y^{c}$ . Under this assumption the pseudo-likelihood expressed in 3 factorizes as follows:

\footnotesize p(y^{1},...y^{C}|v_{i}^{1},...v_{i}^{C},D,\tau)=p(y^{1}|v_{i}^{1},D,\tau)...p(y^{C}|v_{i}^{C},D,\tau).

(4)

Previous work in contrastive learning modifies the temperature to learn from particularly difficult data examples [40, 31]. Inspired by this, we adapt the contrastive likelihood to incorporate a similarity-dependent scaled version of the temperature. We introduce a parameter $\sigma_{c}^{2}$ for each similarity metric controlling the scaling of temperature and representing the similarity dependent uncertainty in Equation 5.

\footnotesize p(y|v_{i}^{c},D,\tau,\sigma_{c}^{2})\propto\frac{1}{|P^{c}_{y}|}\sum_{p\in P^{c}_{y}}\text{exp}(\frac{v_{i}^{cT}v_{p}^{c}}{\tau\sigma_{c}^{2}})

(5)

The negative log-likelihood for this contrastive likelihood can be expressed as Equation 6.

\displaystyle-\text{log }p(y|v_{i}^{c},D,\tau,\sigma_{c}^{2})

\displaystyle\propto\frac{1}{\sigma_{c}^{2}}\sum_{i=I}L^{mscon}_{c,i}+2\text{log}(\sigma_{c})

(6)

We provide a detailed derivation of this equation in the supplement. Extending this analysis to consider multiple similarity metrics, we can adapt the optimization objective to learn weightings for each similarity as in Equation 7.

\footnotesize\text{argmin}_{f,g_{1},...g_{C},\sigma_{1},...\sigma_{C}}(\sum_{c\in C}(\frac{1}{\sigma_{c}^{2}}\sum_{i=I}L^{mscon}_{c,i}+2\text{log}(\sigma_{c})))

(7)

During training, we learn the $\sigma_{c}$ weighting parameters through gradient descent.

4 Experiments

In this section, we evaluate the performance of our approach: 1) under varying levels of uncertainty in similarity metrics induced by varying levels of task noise and 2) across in-domain and out-of-domain classification tasks. We show that our multi-similarity contrastive loss significantly outperforms existing self-supervised and single-task supervised contrastive networks and outperforms multi-task cross-entropy networks on novel tasks. We also demonstrate that our method is able to learn to down-weight more uncertain similarities, and that compared to using equal weights, our weighted multi-similarity contrastive loss is more robust to similarity metric uncertainty and generalizes better to novel tasks under increasing uncertainty.

4.1 Datasets and Implementation

Datasets.

We use two datasets: Zappos50k [37, 38] and MEDIC [1, 2, 3, 27, 28]. Sample images are provided in the supplement.

Zappos50k consists of 50,000 $136\times 102$ images of shoes. We focus our analysis on three tasks: the category of shoe (shoes, boots, sandals, or slippers), the suggested gender of the shoe (for women, men, girls, boys), and the closing mechanism of the shoe (buckle, pull on, slip on, hook and loop, or laced). We fine-tune the embedding space to predict the brand of the shoe for the out-of-domain experiment. We split the images into 70% training, 10% validation, and 20% test sets and resize all images to $112\times 112$ .

MEDIC is the largest multi-task learning disaster-related dataset, extending the CRISIS multi-task image benchmark dataset [2, 1]. MEDIC consists of $\approx 71,000$ images of disasters collected from Twitter, Google, Bing, Flickr, and Instagram. The dataset includes four disaster-related tasks that are relevant for humanitarian aid: the disaster type (earthquake, fire, flood, hurricane, landslide, other disaster, and not a disaster), the informativeness of the image for humanitarian response (informative or not informative), categories relevant to humanitarian response (having affected, injured, or dead people, infrastructure and utility damage, rescue volunteering or donation effort, and not needing humanitarian response), and the severity of the damage of the event (severe damage, mild damage, and little to no damage). For the out-of-domain analysis, we hold out each task from training and then attempt to predict the hold-out task during evaluation. These tasks were generated from a crowd sourcing annotation platform and the images are split already into 69% training, 9% validation and 22% test sets. All images were resized to $224\times 224$ .

Training Details.

Consistent with previous work [8, 21], images are augmented by applying various transformations to increase dataset diversity. We train using standard data augmentations, including random crops, flips, and color jitters.

An embedding network consisting of a shared encoder and multiple projection heads is then trained using MSCon with multiple similarity metrics defined by different tasks as shown in Figure 2. The resulting vectors are normalized to the unit hypersphere, which allows us to use an inner product to measure distances in the projection space. Zappos50k encoders use ResNet18 backbones with projection heads of size 32. MEDIC encoders use ResNet50 backbones with projection spaces of size 64 [14]. All models are pretrained on ImageNet [10]. All networks are trained using a SGD with momentum optimizer for 200 epochs with a batch-size of 64 and a learning rate of 0.05, unless otherwise specified. We use a temperature of $\tau=0.1$ .

After training the multi-similarity contrastive network, we discard the projection heads and freeze the encoder network. We then evaluate the performance of the embedding network on downstream tasks by training a linear classifier on the embedding features. We train a linear classifier for 20 epochs and evaluate top-1 accuracy. Standard deviations are computed by bootstrapping the test set 1000 times. Additional implementation details can be found in the supplement. We will release code for implementing MSCon.

Models.

We compare the unweighted and weighted versions of our Multi-Similarity Contrastive Network (MSCon) with the following baselines:

•

Cross-Entropy Networks (XEnt) We train separate cross-entropy networks with each of the available tasks. We also train a multitask cross-entropy network with all available tasks. We train each network with a learning rate of 0.01. We select the best model using the validation accuracy.
•

Conditional Similarity Network (CSN) We train a conditional similarity network that learns the convolutional filters, embedding, and mask parameters together. 10,000 triplets are constructed from the similarities available in the training dataset. We follow the training procedure specified in [34].
•

SimCLR and SupCon Networks We train a self-supervised contrastive network for each dataset and individual supervised contrastive networks with each of the similarity metrics represented in the training dataset. We pretrain with a temperature of 0.1 for all contrastive networks which is the typical temperature used for SimCLR and SupCon [8, 21]. For evaluation, we fine-tune a classification layer on the frozen embedding space.

4.2 Role of Weighting in Achieving Robustness to Task Uncertainty

In this subsection, we evaluate the robustness of our learned embeddings to similarity uncertainty. Since the true level of task noise (similarity metric uncertainty) is unobserved, we use a semi-simulated approach, where we simulate uncertain similarities in both the Zappos50k and MEDIC datasets.

For the Zappos50k dataset, we train the encoder using the category, closure, and gender similarity metrics. To introduce task uncertainty, we randomly corrupt the closure task by proportion $\rho$ . We randomly sample $\rho$ of the closure labels and randomly reassign the label amongst all possible labels. Note that when $\rho=1.0$ , all labels are randomly sampled equally from the available closure labels. When $\rho=0.0$ , all labels are identical to the original dataset. For the MEDIC dataset, we train the encoder using the disaster types, humanitarian, and informative similarity metrics. We corrupt the disaster type task in order to introduce task uncertainty. As $\rho$ increases in Figure 3, we find that MSCon learns to down-weight the noisy task for both the Zappos50k and MEDIC datasets.

For the Zappos50k dataset, we evaluate the top-1 classification accuracy on an out-of-domain task, brand classification, and on an in-domain task, the corrupted closure classification. Similarly, for the MEDIC dataset, we evaluate the top-1 classification accuracy on an out-of-domain task, damage-severity classification, and on an in-domain task, the corrupted disaster-type classification.

Figure 3 shows the results from this analysis. The top panel shows how the weights change as we change task uncertainty on the x-axis. The middle and bottom panels shows how out-of-domain and in-domain evaluation accuracy changes as we change task uncertainty. As expected, as $\rho$ increases to $1$ , the in-domain classification accuracy for both the equal-weighted and weighted MSCon learned embeddings decreases to random.

However, the out-of-domain classification accuracy for the weighted MSCon learned embeddings is more robust to changes in $\rho$ than the unweighted MSCon learned embeddings. This is because the weighted version of MSCon automatically learns to down-weight uncertain or more uninformative tasks during encoder training.

4.3 Classification Performance

In this section, we evaluate in- and out-of-domain performance of various methods. We find that our multi-similarity contrastive network significantly outperforms all other contrastive methods on in-domain tasks and outperforms multi-task cross-entropy learning on out-of-domain tasks. We also show how performance changes with variation in hyperparameter selection. More qualitative analysis of the learned similarity subspaces (i.e., TSNE visualizations) can be found in the supplement.

In-domain Performance.

To evaluate the quality of the learned embedding spaces, we measure top-1 classification accuracy on all tasks for both the Zappos50k and MEDIC datasets. We report the average accuracy and the standard deviation for all tasks in Table 1 and Table 2. For the Zappos50k dataset, MSCon has the highest top-1 classification accuracy of the models. For MEDIC, MSCon out performs all of the contrastive learning techniques on all tasks. However, for three of the tasks, the best performance is achieved by one of cross-entropy methods (but different methods dominate for different tasks). We hypothesize that this may be due to the inherent uncertainty of some of the tasks [2, 1]. For both datasets, CSN achieves accuracies that are lower than the single-task supervised networks. We believe this is because conditional similarity loss is trained with triplet loss [15], which has been shown to be outperformed by N-pairs loss and supervised contrastive learning for single-task learning [33, 21].

Table 1: Top-1 classification accuracy across all in-domain evaluation settings for the Zappos50k dataset. MSCon outperforms all baselines.

	Zappos50k: In-Domain Evaluation
Loss	Category	Closure	Gender
XEnt Cat	96.64 (0.34)	74.55 (0.38)	63.78 (0.59)
XEnt Clo	88.99 (0.33)	92.28 (0.35)	66.59 (0.57)
XEnt Gend	81.96 (0.32)	73.28 (0.37)	83.09 (0.60)
XEnt MT	96.98 (0.29)	93.33 (0.36)	85.07 (0.55)
SimCLR	90.05 (0.43)	81.30 (0.49)	69.10 (0.84)
SupCon Cat	96.95 (0.29)	73.02 (0.36)	61.24 (0.62)
SupCon Clo	83.62 (0.30)	91.75 (0.41)	65.90 (0.60)
SupCon Gen	76.40 (0.28)	69.52 (0.38)	85.11 (0.58)
CSN	83.33 (0.32)	72.12 (0.36)	69.21 (0.60)
MSCon	97.17 (0.27)	94.37 (0.35)	85.98 (0.56)

Table 2: Top-1 classification accuracy across all in-domain evaluation settings for the MEDIC dataset. We compare cross-entropy single-task and multi-task training, unsupervised contrastive training (SimCLR), single-similarity contrastive training (SupCon), and multi-similarity contrastive training (CSN, MSCon).

	MEDIC: In-Domain Evaluation
Loss	Damage severity	Disaster types	Humanitarian	Informative
XEnt Damage severity	81.39 (0.35)	75.71 (0.37)	81.76 (0.33)	84.48 (0.31)
XEnt Disaster types	81.02 (0.34)	78.98 (0.35)	82.06 (0.34)	86.08 (0.3)
XEnt Humanitarian	81.32 (0.36)	76.52 (0.35)	82.1 (0.37)	86.41 (0.31)
XEnt Informative	80.2 (0.36)	76.73 (0.35)	80.83 (0.36)	85.68 (0.3)
XEnt Multi-Task	81.01 (0.36)	78.04 (0.32)	82.25 (0.35)	86.01 (0.29)
SimCLR	74.9 (0.4)	68.5 (0.42)	73.89 (0.4)	78.67 (0.33)
SupCon Damage severity	80.26 (0.33)	75.1 (0.4)	80.42 (0.4)	84.45 (0.34)
SupCon Disaster types	80.23 (0.34)	78.33 (0.37)	80.63 (0.36)	84.02 (0.3)
SupCon Humanitarian	79.98 (0.36)	74.89 (0.39)	80.36 (0.32)	85.07 (0.32)
SupCon Informative	79.14 (0.35)	74.67 (0.34)	79.97 (0.31)	84.02 (0.3)
CSN	75.13 (0.4)	70.02 (0.37)	70.52 (0.38)	76.28 (0.32)
MSCon	81.0 (0.3)	79.14 (0.31)	81.69 (0.3)	85.15 (0.3)

Out-of-domain Performance. Here, we test how well different approaches are able to generalize to previously unseen tasks. We compare MSCon to multi-task cross-entropy (XEnt MT). For the Zappos50k dataset, we train embedding spaces with the category, closure, and gender similarity metrics. We then select the top 20 brands in the Zappos dataset with the most examples, and fine-tune a classification layer on the frozen embedding with the brand labels. We report top-1 brand classification accuracy of the fine-tuned network on the test set and the standard deviation in Table 4. We find that MSCon significantly improves upon XEnt MT in the out-of-domain setting. More detailed top-1 classification results for all cross-entropy and contrastive networks are provided in the supplement.

To evaluate generalization on the MEDIC dataset, we hold out each of the four tasks. We then train an embedding space with the remaining three similarity metrics. Next, we fine-tune a classification layer on the frozen embedding with the hold-out task. Table 4 reports the top-1 classification accuracy and the standard deviation for the hold-out task on the test set. We observe that, except for the informative task, our approach is able to generalize to new tasks with higher accuracy than the multi-task cross-entropy learned embedding space. We hypothesize that this is because the informative task is the only binary task and the most ambiguous.

Table 3: Top-1 out-of-domain brand classification accuracy for the Zappos50k dataset. We compare cross-entropy multi-task training and multi-similarity contrastive training.

Zappos50k: OOD Evaluation
Loss	Brand
XEnt MT	32.10 (1.48)
MSCon	42.62 (1.52)

Table 4: Top-1 out-of-domain classification accuracy for hold-out similarity metrics on the MEDIC dataset. We compare cross-entropy multi-task training and multi-similarity contrastive training. We abbreviate the similarity metrics of damage severity (DS), disaster types (DS), humanitarian (Human), informative (Inf) and multi-task (MT) and MSCon.

	MEDIC: OOD Evaluation
Loss	DS	DT	Human	Inf
XEnt MT	79.51 (0.36)	75.02 (0.38)	79.77 (0.4)	86.18 (0.3)
MSCon	80.98 (0.32)	76.17 (0.32)	81.45 (0.34)	85.22 (0.3)

Hyperparameter Analysis.

We test if there exists a specific temperature that leads to optimal performance of MSCon for multiple similarity metrics. In Figure 4, we plot the top-1 classification accuracy for each of the category, closure, and gender tasks as a function of pretraining temperature for MSCon. We also plot the top-1 classification accuracy as a function of training epochs. We find that a pretraining temperature of $\tau=0.1$ and training for 200 epochs works well for all tasks. These hyperparameter settings are consistent with optimal hyperparameter settings for SimCLR and SupCon. Note that previous work for SimCLR and SupCon have found the large batch sizes consistently result in better top-1 accuracy [8, 21]. We hypothesize that larger batch sizes would also improve performance for MSCon loss. We include hyperparameter analyses on MSCon for the MEDIC dataset in the supplement.

5 Conclusion

In this work, we propose multi-similarity contrastive loss (MSCon). Existing contrastive learning methods learn a representation based on a single similarity metric. However, it is often the case that multiple tasks are available, each implying a different similarity metric. We show how to leverage multiple similarity metrics in a contrastive setting to learn embeddings that generalize well to unseen tasks. We additionally extend uncertainty based task weighting to the contrastive framework. We do this by 1) modeling downstream classification performance for each similarity by using a psuedo-likelihood and 2) by representing similarity dependent uncertainty as a temperature scaling factor for each similarity metric. We demonstrate that our MSCon learned embeddings outperform all contrastive baselines and generalizes better than multi-task cross-entropy to novel tasks.

There are many interesting directions for future work. Firstly, we do not consider data-dependent uncertainty in our framework. It would be interesting to consider what would happen if we have variance in the uncertainty of our input data. Can we account for both similarity-dependent and input-dependent uncertainty? Another interesting direction for future work would be to see if we could incorporate non-categorical labels in our multi-similarity learning scheme. Currently, we define similarity metrics using multiple categorical tasks. However, some applications use continuous metrics of similarity (e.g. heart rate measurements available in patient electronic health record data or heel height associated with shoes). Defining positive and negative examples for continuous variables with different scales is not straightforward. Thus, an interesting follow-up question may be how to incorporate both categorical and continuous similarity metrics under a single contrastive framework.

Finally, we note that our method will not necessarily generalize well to any novel task. Sometimes, multi-task learning can degrade performance when models are unable to learn representations that generalize towards all tasks [11, 42]. Our work does not address criteria for the selection of tasks for training or evaluation.

References

[1] Firoj Alam, Tanvirul Alam, Md Hasan, Abul Hasnat, Muhammad Imran, Ferda Ofli, et al. Medic: a multi-task learning dataset for disaster image classification. Neural Computing and Applications, pages 1–24, 2022.
[2] Firoj Alam, Ferda Ofli, and Muhammad Imran. Crisismmd: Multimodal twitter datasets from natural disasters. In Twelfth international AAAI conference on web and social media, 2018.
[3] Firoj Alam, Ferda Ofli, Muhammad Imran, Tanvirul Alam, and Umair Qazi. Deep learning benchmarks and datasets for social media image classification for disaster response. In 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 151–158. IEEE, 2020.
[4] Shervin Ardeshir and Navid Azizan. Uncertainty in contrastive learning: On the predictability of downstream performance. arXiv preprint arXiv:2207.09336, 2022.
[5] Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. Tabel: Entity linking in web tables. In International Semantic Web Conference, pages 425–441. Springer, 2015.
[6] Deblina Bhattacharjee, Tong Zhang, Sabine Süsstrunk, and Mathieu Salzmann. Mult: An end-to-end multitask learning transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12031–12041, 2022.
[7] Hsin-Hsi Chen, Shih-Chung Tsai, and Jin-He Tsai. Mining tables from large scale html texts. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics, 2000.
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[9] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[11] Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
[12] Ting Gong, Tyler Lee, Cory Stephenson, Venkata Renduchintala, Suchismita Padhy, Anthony Ndirango, Gokce Keskin, and Oguz H Elibol. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access, 7:141627–141632, 2019.
[13] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[15] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International workshop on similarity-based pattern recognition, pages 84–92. Springer, 2015.
[16] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
[17] Ashraful Islam, Chun-Fu Richard Chen, Rameswar Panda, Leonid Karlinsky, Richard Radke, and Rogerio Feris. A broad study on the transferability of visual representations with contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8845–8855, 2021.
[18] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
[19] Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), 2020.
[20] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
[21] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673, 2020.
[22] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
[23] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Adversarial multi-task learning for text classification. arXiv preprint arXiv:1704.05742, 2017.
[24] Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In European Conference on Computer Vision, pages 158–174. Springer, 2020.
[25] Yuren Mao, Zekai Wang, Weiwei Liu, Xuemin Lin, and Wenbin Hu. Banditmtl: Bandit-based multi-task learning for text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5506–5516, 2021.
[26] Yuren Mao, Zekai Wang, Weiwei Liu, Xuemin Lin, and Pengtao Xie. Metaweighting: Learning to weight tasks in multi-task learning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3436–3448, 2022.
[27] Hussein Mouzannar, Yara Rizk, and Mariette Awad. Damage identification in social media posts using multimodal deep learning. In ISCRAM, 2018.
[28] Dat T Nguyen, Ferda Ofli, Muhammad Imran, and Prasenjit Mitra. Damage assessment from social media imagery data during disasters. In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, pages 569–576, 2017.
[29] Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, and Andrew Gallagher. Modeling uncertainty with hedged instance embedding. arXiv preprint arXiv:1810.00319, 2018.
[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[31] Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
[32] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
[33] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.
[34] Andreas Veit, Serge Belongie, and Theofanis Karaletsos. Conditional similarity networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 830–838, 2017.
[35] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
[36] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
[37] Aron Yu and Kristen Grauman. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 192–199, 2014.
[38] Aron Yu and Kristen Grauman. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In Proceedings of the IEEE International Conference on Computer Vision, pages 5570–5579, 2017.
[39] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
[40] Oliver Zhang, Mike Wu, Jasmine Bayrooti, and Noah Goodman. Temperature as uncertainty in contrastive learning. arXiv preprint arXiv:2110.04403, 2021.
[41] Shu Zhang, Ran Xu, Caiming Xiong, and Chetan Ramaiah. Use all the labels: A hierarchical multi-label contrastive learning framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16660–16669, 2022.
[42] Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.