This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Contrastive Learning against Noisy Views

Ching-Yao Chuang   R Devon Hjelm   Xin Wang   Vibhav Vineet
Neel Joshi   Antonio Torralba   Stefanie Jegelka   Yale Song
MIT CSAIL   Microsoft Research
https://github.com/chingyaoc/RINCE
Abstract

Contrastive learning relies on an assumption that positive pairs contain related views, e.g., patches of an image or co-occurring multimodal signals of a video, that share certain underlying information about an instance. But what if this assumption is violated? The literature suggests that contrastive learning produces suboptimal representations in the presence of noisy views, e.g., false positive pairs with no apparent shared information. In this work, we propose a new contrastive loss function that is robust against noisy views. We provide rigorous theoretical justifications by showing connections to robust symmetric losses for noisy binary classification and by establishing a new contrastive bound for mutual information maximization based on the Wasserstein distance measure. The proposed loss is completely modality-agnostic and a simple drop-in replacement for the InfoNCE loss, which makes it easy to apply to existing contrastive frameworks. We show that our approach provides consistent improvements over the state-of-the-art on image, video, and graph contrastive learning benchmarks that exhibit a variety of real-world noise patterns.

1 Introduction

Contrastive learning [1, 2, 3] has become one of the most prominent self-supervised approaches to learn representations of high-dimensional signals, producing impressive results with image [4, 5, 6, 7, 8, 9], text [10, 11, 12, 13], audio [14, 15, 16], and video [17, 18, 19]. The central idea is to learn representations that capture the underlying information shared between different “views” of data [3, 20]. For images, the views are typically constructed by applying common data augmentation techniques, such as jittering, cropping, resizing and rotation [6], and for video the views are often chosen as adjacent frames [21] or co-occurring multimodal signals, such as video and the corresponding optical flow [22], audio [19] and transcribed speech [17].

Refer to caption
Figure 1: Noisy views can deteriorate contrastive learning. We propose a new contrastive loss function (RINCE) that rescales the sample importance in the gradient space based on an estimated noise level. With a simple turn of a knob (q(0,1]q\in(0,1]), we can upweight or downweight sample pairs with low shared information.

Designing the right contrasting views has shown to be a key ingredient of contrastive learning [6, 23]. This often requires domain knowledge, intuition, trial-and-error (and luck!). What would happen if the views are wrongly chosen and do not provide meaningful shared information? Prior work has reported deteriorating effects of such noisy views in contrastive learning under various scenarios, e.g., unrelated image patches due to extreme augmentation [20], irrelevant video-audio pairs due to overdubbing [24], and misaligned video-caption pairs [17]. The major issue with noisy views is that representations of different views are forced to align with each other even if there is no meaningful shared information. This often leads to suboptimal representations that merely capture spurious correlations [25] or make them collapse to a trivial solution [26]. Worse yet, when we attempt to learn from large-scale unlabeled data – i.e., the scenario where self-supervised learning is particularly expected to shine – the issue is only aggravated because of the increased noise in the real-world data [27], hindering the ultimate success of contrastive learning.

Consequently, a few attempts have been made to design contrastive approaches that are noise-tolerant. For example, Morgado et al. [24] optimize a soft instance discrimination loss to weaken the impact of noisy views. Miech et al. [17] address the misalignment between video and captions by aligning multiple neighboring segments of a video. However, existing approaches are often tied to specific modalities or make assumptions that may not hold for general scenarios, e.g., MIL-NCE [17] is not designed to address the issues of irrelevant audio-visual signals.

In this work, we develop a principled approach to make contrastive learning robust against noisy views. We start by making connections between contrastive learning and the classical noisy binary classification in supervised learning [28, 29]. This allows us to explore the wealth of literature on learning with noisy labels [30, 31, 32]. In particular, we focus on a family of robust loss functions that has the symmetric property [29], which provides strong theoretical guarantees against noisy labels in binary classification. We then show a functional form of contrastive learning that can satisfy the symmetry condition if given a proper symmetric loss function, motivating the design of new contrastive loss functions that provide similar theoretical guarantees.

This leads us to propose Robust InfoNCE (RINCE), a contrastive loss function that satisfies the symmetry condition. RINCE can be understood as a generalized form of the contrastive objective that is robust against noisy views. Intuitively, its symmetric property provides an implicit means to reweight sample importance in the gradient space without requiring an explicit form of noise estimator. It also provides a simple “knob” (a real-valued scalar q(0,1]q\in(0,1]) that controls the behavior of the loss function balancing the exploration-exploitation trade-off (i.e., from being conservative to playing adventures on potentially noisy samples).

We also provide a theoretical analysis of the proposed RINCE objective and show that it extends the analyses by Ghosh et al. [29] to the self-supervised contrastive learning regime. Furthermore, we relate the proposed loss function to dependency measurement. Analogous to InfoNCE loss, which is a lower bound of mutual information between two views [3], we show that RINCE is a lower bound of Wassersein Dependency Measure (WDM) [33] even in the noisy setting. By replacing the KL divergence in the mutual information estimator with the Wasserstein distance, WDM is able to capture the geometry of the representation space via the equipped metric space and provides robustness against noisy views better than the KL divergence, both in theory and practice. In particular, the features learned with RINCE achieve better class-wise separation, which is proved to be crucial to improve generalization [34].

Despite its rigorous theoretical background, implementing RINCE requires only a few lines code and can be a simple drop-in replacement for the InfoNCE loss to make contrastive learning robust against noisy views. Since InfoNCE sets the basis for many modern contrastive methods such as SimCLR [6] and MoCo-v1/v2/v3 [5, 7, 35], our construction can be easily applied to many existing frameworks.

Finally, we provide strong empirical evidence demonstrating the robustness of RINCE against noisy views under various scenarios with different modalities and noise types. We show that RINCE improves over the state-of-the-art in image [36, 37], video [27, 38] and graph [39] self-supervised learning benchmarks, demonstrating its generalizability across multiple modalities. We also show that RINCE exhibits strong robustness against different types of noise such augmentation noise [20, 40], label noise [28, 41], and noisy audio-visual correspondence [24]. The improvement is consistently observed across different dataset scales and training epochs, demonstrating the scalability and computational efficiency. In short, our main contributions are:

  • We propose RINCE, a new contrastive learning objective that is robust against noisy views of data;

  • We provide a theoretical analysis to relate the proposed loss to symmetric losses and dependency measurement;

  • We demonstrate our approach on real-world scenarios of image, video, and graph contrastive learning.

2 Related Work

Contrastive Learning

Contrastive approaches have become prominent in unsupervised representation learning [1, 2, 42]: InfoNCE [3] and its variants [5, 6, 8, 9] achieve state-of-the-art across different modalities [10, 14, 17, 18, 19]. Modern approaches improve upon InfoNCE from different directions. One line of work focuses on modifying training mechanisms, e.g., appending projection head [6], momentum encoder with dynamic dictionary update [5, 7], siamese networks with stop gradient trick [43, 8], and online cluster assignment [44]. Another line of work refines the loss function itself to make it more effective, e.g., upweight hard negatives [45, 12], correct false negatives [11], and alleviate feature suppression [46]. Along this second line of work, we propose a new contrastive loss function robust against noisy views. Some of prior work in this direction [24, 17] was demonstrated on limited modalities only; we demonstrate its generality on image [37], video [27, 38], and graph [47, 48, 40] contrastive learning scenarios. Our approach is orthogonal to the first line of work; our loss function can easily be applied to some of existing training mechanisms such as SimCLR [6] and MoCo-v1/v2/v3 [5, 7, 35].

Robust Loss against Noisy Labels

Learning with noisy labels has been actively explored in recent years [28, 49, 29, 50, 51, 52, 31, 32, 53, 54, 55]. One line of work attempts to develop robust loss functions that are noise-tolerant [29, 30, 56, 57]. Ghosh et al. [29] prove that symmetric loss functions are robust against noisy labels, e.g., Mean Absolute Error (MAE) [30], while commonly used Cross Entropy (CE) loss is not. Based on this idea, Zhang and Sabuncu [56] propose the generalized cross entropy loss to combine MAE and CE loss functions. A similar idea is adopted in [57] by combining the reversed cross entropy loss with CE loss. In the next section, we relate noisy views to noisy labels by interpreting contrastive learning as binary classification, and developed a robust symmetric contrastive loss that enjoys the similar theoretical guarantees.

3 Prelim: From Noisy Labels to Noisy Views

We start by connecting two seemingly different but related frameworks: supervised binary classification with noisy labels and self-supervised contrastive learning with noisy views. We then introduce a family of symmetric loss functions that is noise-tolerant and show how we can transform contrastive objectives to a symmetric form.

3.1 Symmetric Losses for Noisy Labels

Denoting the input space by 𝒳{\mathcal{X}} and the binary output space by 𝒴={1,1}{\mathcal{Y}}=\{-1,1\}, let 𝒮={xi,yi}i=1m{\mathcal{S}}=\{x_{i},y_{i}\}_{i=1}^{m} be the unobserved clean dataset that is drawn i.i.d. from the data distribution 𝒟\mathcal{D}. In the noisy setting, the learner obtains a noisy dataset 𝒮η={xi,y^i}i=1m{\mathcal{S}}_{\eta}=\{x_{i},\hat{y}_{i}\}_{i=1}^{m}, where y^i=yi\hat{y}_{i}=y_{i} with probability 1ηxi1-\eta_{x_{i}} and y^i=yi\hat{y}_{i}=-y_{i} with probability ηxi\eta_{x_{i}}. Note that the noise rate ηx\eta_{x} is data point-dependent. For a classifier f:𝒳f\in{\mathcal{F}}\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{X}}\rightarrow{\mathbb{R}}, the expected risk under the noise-free scenario is R(f)=𝔼𝒟[(f(x),y)]R_{\ell}(f)=\mathbb{E}_{\mathcal{D}}[\ell(f(x),y)] where :×𝒴\ell\mathrel{\mathop{\mathchar 58\relax}}{\mathbb{R}}\times{\mathcal{Y}}\rightarrow{\mathbb{R}} is a binary classification loss function. When the noise exists, the learner minimizes the noisy expected risk Rη(f)=𝔼𝒟η[(f(x),y^)]R_{\ell}^{\eta}(f)=\mathbb{E}_{\mathcal{D}_{\eta}}[\ell(f(x),\hat{y})].

Ghosh et al. [29] show that symmetric loss functions are robust against noisy labels in binary classification. In particular, a loss function \ell is symmetric if it sums to a constant:

(s,1)+(s,1)=c,s,\displaystyle\ell(s,1)+\ell(s,-1)=c,\;\;\;\;\forall s\in{\mathbb{R}}, (1)

where ss is the prediction score from ff. Note that the symmetry condition should also hold with the gradients w.r.t. ss. They show that if the noise rate is ηxηmax<0.5,x𝒳\eta_{x}\leq\eta_{\max}<0.5,\forall x\in{\mathcal{X}} and if the loss is symmetric and non-negative, the minimizer of the noisy risk fη=arginffRη(f)f_{\eta}^{\ast}=\operatorname*{arg\,inf}_{f\in{\mathcal{F}}}R^{\eta}(f) approximately minimizes the clean risk:

R(fη)ϵ/(12ηmax),\displaystyle R(f_{\eta}^{\ast})\leq\epsilon/(1-2\eta_{\max}),

where ϵ=inffR(f)\epsilon=\inf_{f\in{\mathcal{F}}}R(f) is the optimal clean risk. This implies that the noisy risk under symmetric loss is a good surrogate of the clean risk. In Appendix A.2, we further relax the non-negative constraint on the loss with a corollary.111This is important for our proposed RINCE loss that involves an exponential function (s,y)=yes\ell(s,y)=-ye^{s}, which can produce negative values.

3.2 Towards Symmetric Contrastive Objectives

The results above suggest that we can achieve robustness against noisy views if a contrastive objective can be expressed in a form that satisfies the symmetry condition in the binary classification framework. To this end, we first relate contrastive learning to binary classification, and then express it in a form where symmetry can be achieved.

Contrastive learning as binary classification.

Given two views XX and VV, we can interpret contrastive learning as noisy binary classification operating over pairs of samples (x,v)(x,v) with a label 11 if it is sampled from the joint distribution, (x,v)PXV(x,v)\sim P_{XV}, and 1-1 if it comes from the product of marginals, (x,v)PXPV(x,v^{\prime})\sim P_{X}P_{V}. In the presence of noisy views, some negative pairs (x,v)PXPV(x,v^{\prime})\sim P_{X}P_{V} could be mislabeled as positive, introducing noisy labels.

To see this more concretely, let us consider the InfoNCE loss [3], one of the most widely adopted contrastive objectives [58, 4, 6, 11]. It minimizes the following loss function:

InfoNCE(s)=loges+es++i=1Kesi\displaystyle{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})=-\log\frac{e^{s^{+}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}
:=logef(x)Tg(v)/tef(x)Tg(v)/t+i=1Kef(x)Tg(vi)/t,\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=-\log\frac{e^{f(x)^{T}g(v)/t}}{e^{f(x)^{T}g(v)/t}+\sum_{i=1}^{K}e^{f(x)^{T}g(v_{i})/t}}, (2)

where s={s+,{si}i=1K}\textbf{s}=\{s^{+},\{s_{i}^{-}\}_{i=1}^{K}\}, s+s^{+} and sis_{i}^{-} are the scores of related (positive) and unrelated (negative) pairs and tt is the temperature parameter introduced to avoid gradient saturation. The expectation of the loss is taken over (x,v)PXV(x,v)\sim P_{XV} and KK independent samples viPVv_{i}\sim P_{V}, where PXVP_{XV} denotes the joint distribution over pairs of views such as transformations of the same image or co-occurring multimodal signals. Although InfoNCE has a functional form of the (K+1)(K+1)-way softmax cross entropy loss, the model ultimately learns to classify whether a pair (x,v)(x,v) is positive or negative by maximizing/minimizing the positive score s+s^{+}/negative scores sis_{i}^{-}. Therefore, InfoNCE under noisy views can be seen as binary classification with noisy labels. We acknowledge that similar interpretations have been made in prior works under different contexts [59, 60, 4].

Symmetric form of contrastive learning.

Now we turn to a functional form of contrastive learning that can achieve the symmetric property. Assume that we have a noise-tolerant loss function \ell that satisfies the symmetry condition of equation 1. We say a contrastive learning objective is symmetric if it accepts the following form

(s)=(s+,1)Positive Pair+λi=1K(si,1)K Negative Pairs\displaystyle{\mathcal{L}}(\textbf{s})=\underbrace{\ell(s^{+},1)}_{\textnormal{Positive Pair}}+\lambda\underbrace{\sum_{i=1}^{K}\ell(s_{i}^{-},-1)}_{\textnormal{$K$ Negative Pairs}} (3)

which consists of a collection of (K+1)(K+1) binary classification losses; λ>0\lambda>0 is a density weighting term controlling the ratio between classes 11 (positive pairs) and 1-1 (negative pairs). Reducing λ\lambda places more weight on the positive score s+s^{+}, while setting λ\lambda to zero recovers the negative-pair-free contrastive loss such as BYOL [8].

Contrastive objectives that satisfy the symmetric form enjoy strong theoretical guarantees against noisy labels as described in Ghosh et al. [29], as long as we plug in the right contrastive loss function \ell that satisfies the symmetry condition. Unfortunately, the InfoNCE loss [3] does not satisfy the symmetry condition in the gradients w.r.t. s+/s^{+/-} (we provide the full derivations in Appendix A.5). This motivates us to develop a new contrastive loss function that satisfies the symmetry condition, described next.

4 Robust InfoNCE Loss

1# pos: exponent for positive example
2# neg: sum of exponents for negative examples
3# q, lam: hyperparameters of RINCE
4
5info_nce_loss = -log(pos/(pos+neg))
6rince_loss = -pos**q/q + (lam*(pos+neg))**q/q
Figure 2: Pseudocode for RINCE. The implementation only requires a small modification to the InfoNCE code.

Based on the idea of robust symmetric classification loss, we present the following Robust InfoNCE (RINCE) loss:

RINCEλ,q(s)=eqs+q+(λ(es++i=1Kesi))qq,\displaystyle{\mathcal{L}}^{\lambda,q}_{\textnormal{RINCE}}(\textbf{s})=\frac{-e^{q\cdot s^{+}}}{q}+\frac{(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}))^{q}}{q},

where q,λ(0,1]q,\lambda\in(0,1]. Figure 2 shows the pseudo-code of RINCE: it is simple to implement. When q=1q=1, RINCE becomes a contrastive loss that fully satisfies the symmetry property in the form of equation 3 with (s,y)=yes\ell(s,y)=-ye^{s}:

RINCEλ,q=1(s)=(1λ)es++λi=1Kesi.\displaystyle{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})=-(1-\lambda)e^{s^{+}}+\lambda\sum_{i=1}^{K}e^{s_{i}^{-}}.

Notice that the exponential loss yes-ye^{s} satisfies the symmetric condition defined in equation 1 with c=0c=0. Therefore, when q1q\rightarrow 1, we achieve robustness against noisy views in the same manner as binary classification with noisy labels.

In the limit of q0q\rightarrow 0, RINCE becomes asymptotically equivalent to InfoNCE, as the following lemma describes:

Lemma 1.

For any λ>0\lambda>0, it holds that

limq0RINCEλ,q(s)=InfoNCE(s)+log(λ);\displaystyle\lim_{q\rightarrow 0}{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q}(\textbf{s})={\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})+\log(\lambda);
limq0sRINCEλ,q(s)=sInfoNCE(s).\displaystyle\lim_{q\rightarrow 0}\frac{\partial}{\partial\textbf{s}}{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q}(\textbf{s})=\frac{\partial}{\partial\textbf{s}}{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s}).

We defer the proofs to Appendix A. Note that the convergence also holds for the derivatives: optimizing RINCE in the limit of q0q\rightarrow 0 is mathematically equivalent to optimizing InfoNCE. Therefore, by controlling q(0,1]q\in(0,1] we smoothly interpolate between the InfoNCE loss (q0q\rightarrow 0) and the RINCE loss in its fully symmetric form (q1q\rightarrow 1).

4.1 Intuition behind RINCE

We now analyze the behavior of RINCE through the lens of exploration-exploitation trade-off. In particular, we reveal an implicit easy/hard positive mining scheme by inspecting the gradients of RINCE under different qq values, and show that we achieve stronger robustness (more exploitation) with larger qq at the cost of potentially useful clean hard positive samples (less exploration).

To simplify the analysis, we consider InfoNCE and RINCE with a single negative pair (K=1K=1):

InfoNCE(s)=log(es+/(es++es));\displaystyle{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})=-\log(e^{s^{+}}/(e^{s^{+}}+e^{s^{-}}));
RINCEλ,q(s)=eqs+q+(λ(es++es))qq.\displaystyle{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q}(\textbf{s})=\frac{-e^{q\cdot s^{+}}}{q}+\frac{(\lambda\cdot(e^{s^{+}}+e^{s^{-}}))^{q}}{q}.
Refer to caption
Figure 3: Loss Visualization. We visualize the (a) loss value and the (b) gradient scale with respect to the positive score s+s^{+} for different qq while setting λ=0.5\lambda=0.5. The gradient scale of InfoNCE (q0q\rightarrow 0) is larger when the positive score is smaller (hard positive pair). In contrast, for fully symmetric RINCE (q=1q=1), the gradient is larger when positive score is large (easy positive pair).

We visualize the loss and the scale of the gradients with respect to positive scores s+s^{+} in Figure 3. Although the loss values are different for each qq, they follow the same principle: The loss achieves its minimum when the positive score s+s^{+} is maximized and the negative score ss^{-} is minimized.

The interesting bit lies in the gradients. The InfoNCE loss (q0q\rightarrow 0) places more emphasis on hard positive pairs, i.e., the pairs with low positive scores s+s^{+} (the left-most part in the plot). In contrast, the fully symmetric RINCE loss (q=1q=1) places more weights on easy positive pairs (the right-most part). Note that both q0q\rightarrow 0 and q1q\rightarrow 1 naturally perform hard negative mining; both their derivatives put exponentially more weights on hard negative pairs.

This reveals an implicit trade-off between exploration (convergence) and exploitation (robustness). When q0q\rightarrow 0, the loss performs hard positive mining, providing faster convergence in the noise-free setting. But in the presence of noise, exploration is harmful; it wrongly puts higher weights to false positive pairs because noisy samples tend to induce larger losses [61, 55, 62, 24], and this could hinder convergence. In contrast, when q1q\rightarrow 1, we perform easy positive mining. This provides robustness especially against false positives; but this is done at the cost of exploration with clean hard positives. An important aspect here is that RINCE does not require an explicit form of noise estimator: the scores s+s^{+} and ss^{-}, and the relationship between the two (which is what the loss function measures) act as noise estimates. In practice, we set q[0.1,0.5]q\in[0.1,0.5] to strike the balance between exploration and exploitation.

4.2 Theoretical Underpinnings

Next, we provide an information-theoretic explanation on what makes RINCE robust against noisy views. In particular, we show that RINCE is a contrastive lower-bound of mutual information (MI) expressed in Wasserstein dependency measure (WDM) [33], which provides superior robustness against sample noise compared to the Kullback–Leibler (KL) divergence thanks to strong geometric properties of the Wasserstein metric. We further show that, even in the presence of noise, RINCE is a lower bound of clean WDM, indicating its robustness against noisy views.

Limitations of KL divergence in MI estimation.

Without loss of generality, let f=gf=g and consider f=fϕf=f^{\prime}\circ\phi, where ϕ\phi is a representation encoder and ff^{\prime} is a projection head [6]. Also, let Pϕ=ϕ#PP^{\phi}=\phi_{\#}P be the pushforward measure of PP with respect to ϕ\phi. It has been shown [63, 20] that InfoNCE is a variational lower-bound of MI in the representation space expressed with KL-divergence:

𝔼[InfoNCE(s)]+log(K)I(ϕ(X),ϕ(V))\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})\right]+\log(K)\leq I(\phi(X),\phi(V))
=DKL(PXVϕ,PXϕPVϕ)\displaystyle=D_{\textnormal{KL}}(P^{\phi}_{XV},P^{\phi}_{X}P^{\phi}_{V}) .

Intuitively, maximizing MI can be interpreted as maximizing the discrepancy between positive and negative pairs. However, prior works [64, 33] have identified theoretical limitations of maximizing MI using the KL divergence: Because KL divergence is not a metric, it is sensitive to small differences in data samples regardless of the geometry of the underlying data distributions. Therefore, the encoder ϕ\phi can capture limited information shared between XX and VV as long as the differences are sufficient to maximize the KL divergence. Note that this can be especially detrimental in the presence of noisy views, as the learner can quickly settle on spurious correlations in false positive pairs due to the absence of the actual shared information.

RINCE is a lower bound of WDM.

We now establish RINCE as a lower bound of WDM [33], which is proposed as a replacement for the KL divergence in MI estimation.

WDM is based on the Wasserstein distance, a distance metric between probability distributions defined via an optimal transport cost. Letting μ\mu and νProb(d×d)\nu\in\textnormal{Prob}({\mathbb{R}}^{d}\times{\mathbb{R}}^{d}) be two probability measures, we define the Wasserstein-11 distance with a Euclidean cost function as

𝒲(μ,ν)=infπΠ(μ,ν)𝔼(X,V)(X,V)π[XX+VV]\displaystyle{\mathcal{W}}(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\mathbb{E}_{\begin{subarray}{c}(X,V)\\ (X^{\prime},V^{\prime})\end{subarray}\sim\pi}\left[\left\|X-X^{\prime}\right\|+\left\|V-V^{\prime}\right\|\right]

where Π(μ,ν)\Pi(\mu,\nu) denotes the set of measure couplings whose marginals are μ\mu and ν\nu, respectively. By virtue of symmetry when q=1q=1, if λ>1/(K+1)\lambda>1/(K+1), the Kantorovich-Rubinstein duality [65] implies that (full theorem in Appendix A.3):

𝔼[RINCEλ,q=1(s)]LI𝒲(ϕ(X),ϕ(V))\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]\leq L\cdot I_{\mathcal{W}}(\phi(X),\phi(V))
:=L𝒲(PXVϕ,PXϕPVϕ)\displaystyle\mathrel{\mathop{\mathchar 58\relax}}=L\cdot{\mathcal{W}}(P_{XV}^{\phi},P_{X}^{\phi}P_{V}^{\phi}) , (4)

where I𝒲(ϕ(X),ϕ(V))I_{\mathcal{W}}(\phi(X),\phi(V)) is the WDM defined in [33] and LL is a constant that depends on t,λt,\lambda, and the Lispchitz constant of the projection head ff. Note that we are not aware of any work that showed it is possible to establish a similar bound with WDM for the InfoNCE loss.

This provides another explanation of what makes RINCE robust against noisy views. Unlike InfoNCE which maximizes the KL divergence, optimizing RINCE is equivalent to maximizing the WDM with a Lipschitz function. Equipped with a proper metric, this allows RINCE to measure the divergence between two distributions PXVϕP_{XV}^{\phi} and PXϕPVϕP_{X}^{\phi}P_{V}^{\phi} without being overly sensitive to individual sample noise, as long as the noise does not alter the geometry of the distributions. This also allows the encoder ϕ\phi to learn more complete representations, as maximizing the Wasserstein distance requires the encoder to not only model the density ratio between the two distributions but also the optimal cost of transporting one distribution to another.

RINCE is still a lower bound of WDM even with noise.

Finally, we show that RINCE still maximizes the noise-less WDM under additive noise, corroborating the robustness of RINCE. Let’s consider a simple mixture noise model:

PXVη=(1η)PXV+ηPXPV,\displaystyle P_{XV}^{\eta}=(1-\eta)P_{XV}+\eta P_{X}P_{V},

where η\eta is the noise rate and the noisy joint distribution PXVηP_{XV}^{\eta} is a weighted sum between the noise-less positive distribution PXVP_{XV} and negative distribution PXPVP_{X}P_{V}. Note that the marginals of PXVηP_{XV}^{\eta} are still PXP_{X} and PVP_{V} by construction. The intuition behind mixture noise model is that when we draw positive pairs from PXVηP_{XV}^{\eta}, we obtain false positives from PXPVP_{X}P_{V} with probability η\eta. Via the symmetry of the contrastive loss, we can extend bound (4) as follows (proof in Appendix A.4):

𝔼PXVη[RINCEλ,q=1(s)](1η)LI𝒲(ϕ(X),ϕ(V)).\displaystyle-\mathbb{E}_{P_{XV}^{\eta}}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]\leq(1-\eta)\cdot L\cdot I_{\mathcal{W}}(\phi(X),\phi(V)).

Comparing to the bound (4), the right hand side is rewieghted with (1η)(1-\eta). This implies that minimizing RINCE with noisy views still maximizes a lower bound of noise-less WDM. Despite the simplicity of the analysis, it intuitively relates dependency measures and the noisy views with interpretable bounds. It would be an interesting future direction to extend the analysis to more complicated noise models, e.g., PXVη=(1η)PXV+ηQXVP_{XV}^{\eta}=(1-\eta)P_{XV}+\eta Q_{XV}, where QQ is an unknown perturbation on positive distribution.

5 Experiments

We evaluate RINCE on various contrastive learning scenarios involving images (CIFAR-10 [36], ImageNet [37]), videos (ACAV100M [27], Kinetics400 [38]) and graphs (TUDataset [39]). Empirically, we find that RINCE is insensitive to the choice of λ\lambda; we simply set λ=0.01\lambda=0.01 for all vision experiments and λ=0.025\lambda=0.025 for graph experiments.

Refer to caption
Figure 4: Noisy CIFAR-10. We show the top1 accuracy of RINCE with different values of qq across different noise rate η\eta. Large qq (q=0.5,1q=0.5,1) leads to better robustness, while smaller qq (q=0.01q=0.01) performs similar to InfoNCE (q0q\rightarrow 0).
Refer to caption
Figure 5: t-SNE Visualization on CIFAR-10 with label noise. Colors indicate classes. RINCE leads to better class-wise separation than the InfoNCE loss in both noise-less and noisy cases.

5.1 Noisy CIFAR-10

We begin with controlled experiments on CIFAR-10 to verify the robustness of RINCE against synthetic noise by controlling the noise rate η\eta. We consider two noise types:

Label noise: We start with the case of supervised contrastive learning [41] where positive pairs are different images of the same label. This allows us to control noise in the traditional sense, i.e., learning with noisy labels. Similar to [56], we flip the true labels to semantically related ones, e.g., CAT \leftrightarrow DOG with probability η/2\eta/2. This is commonly referred to as class-dependent noise [55, 56, 57].

Augmentation noise: We consider the self-supervised learning scenario and vary the crop size during data augmentation similar to [20], i.e., after applying all the transformations as in SimCLR [6], images are further cropped into 1/51/5 of their original size with probability η\eta. This effectively controls the noise rate as cropped patches will most likely to be too small to contain any shared information.

Figure 4 shows the results of SimCLR trained with InfoNCE and RINCE with different choices of qq and λ\lambda. When the augmentation noise is present, e.g., η=0.4\eta=0.4, the accuracy of InfoNCE drops from 91.14%91.14\% to 87.33%87.33\%. In contrast, the robustness of RINCE is enhanced by increasing qq, achieving 89.01%89.01\% when q=1.0q=1.0,. InfoNCE also fails to address label noise and suffers from significant performance drop (93.38%87.11%93.38\%\rightarrow 87.11\% when η=0.8\eta=0.8). In comparison, RINCE retains the performance even when the noise rate is large (91.59%91.59\% for q=1.0q=1.0). In both cases, reducing the value of qq makes the performance of RINCE closer to InfoNCE, verifying our analysis in Lemma 1.

Figure 5 shows t-SNE visualization [66] of representations learned with InfoNCE and RINCE (q=1.0q=1.0) under different label noise. As the noise rate increases, representations of different classes start to tangle up for InfoNCE, while RINCE still achieves decent class-wise separation.

Method Δ\Delta to SimCLR [6] Top 1 Top 5
Supervised [67] N/A 76.5 -
SimSiam [43] No negative pairs 71.3 -
BYOL [8] No negative pairs 74.3 91.6
Barlow Twins [9] Redundancy reduction 73.2 91.0
SwAV [44] Cluster discrimination 75.3 -
SimCLR [6] None 69.3 89.0
+RINCE (Ours) Symmetry controller qq 70.0 89.8
MoCo [5] Momentum encoder 60.6 -
MoCov2 [7] Momentum encoder 71.1 90.1
MoCov3 [35] Momentum encoder 73.8 -
+RINCE (Ours) Symmetry controller qq 74.2 91.8
Table 1: Linear Evaluation on ImageNet. All the methods use ResNet-50 [67] as backbone architecture with 24M parameters. Note that RINCE subsumes InfoNCE when q0q\rightarrow 0.

5.2 Image Contrastive Learning

We verify our approach on the well-established ImageNet benchmark [37]. We adopt the same training protocol and hyperparameter settings of SimCLR [6] and MoCov3 [35] and simply replace the InfoNCE with our RINCE loss (q=0.1q=0.1 and q=0.6q=0.6, respectively) as shown in Figure 2. Table 1 shows that RINCE improves InfoNCE (SimCLR and MoCov3) by a non-trivial margin. We also include results from the SOTA baselines, where they improve SimCLR by introducing dynamic dictionary plus momentum encoder (MoCo-v1/v2/v3 [5, 7, 35]), removing negative pairs plus the stop-gradient trick (SimSiam [43], BYOL [8]), or online cluster assignment (SwAV [44]). In comparison, our work is orthogonal to the recent developments, and the existing tricks can be applied along with RINCE.

Figure 6 shows the positive pairs from SimCLR augmentations and the corresponding positive scores s+=f(x)Tg(v)s^{+}=f(x)^{T}g(v) output by trained RINCE model. Examples with lower positive scores contain pairs that is less informative to each other, while semantically meaningful pairs often have higher scores. This implies that positive scores are good noise detectors, and down-weighting the samples with lower positive score brings robustness during training, verifying our analysis in section 4.1.

5.3 Video Contrastive Learning

We examine our approach in the audio-visual learning scenario using two video datasets: Kinetics400 [38] and ACAV100M [27]. Here, we find that simple qq-warmup improves the stability of RINCE, i.e., qq starts at 0.010.01 and linearly increases to 0.40.4 until the last epoch. We apply this to all RINCE models in this section. As we show below, RINCE outperforms SOTA noise-robust contrastive methods [19, 24] on Kinetics400, while also providing scalability and computational efficiency compared to InfoNCE.

Method
Backbone
Finetune
Input Size
HMDB UCF
3D-RotNet [68] R3D-18 16×112216\!\times\!112^{2} 33.7 62.9
ClipOrder [69] R3D-18 16×112216\!\times\!112^{2} 30.9 72.4
DPC [70] R3D-18 25×128225\!\times\!128^{2} 35.7 75.7
CBT [71] S3D 16×112216\!\times\!112^{2} 44.6 79.5
AVTS [72] MC3-18 25×224225\!\times\!224^{2} 56.9 85.8
SeLaVi [73] R(2+1)D-18 32×112232\!\times\!112^{2} 47.1 83.1
XDC [74] R(2+1)D-18 32×224232\!\times\!224^{2} 52.6 86.8
Robust-xID [24] R(2+1)D-18 32×224232\!\times\!224^{2} 55.0 85.6
Cross-AVID [19] R(2+1)D-18 32×224232\!\times\!224^{2} 59.9 86.9
AVID+CMA [19] R(2+1)D-18 32×224232\!\times\!224^{2} 60.8 87.5
InfoNCE (Ours) R(2+1)D-18 32×224232\!\times\!224^{2} 57.8 88.6
RINCE (Ours) R(2+1)D-18 32×224232\!\times\!224^{2} 61.6 88.8
GDT [23] R(2+1)D-18 30×112230\!\times\!112^{2} 62.3 90.9
Based on an advanced hierarchical data augmentation during pretraining.
Table 2: Kinetics400-pretrained performance on UCF101 and HMDB51 (top-1 accuracy). Ours use the same data augmentation approach as Cross-AVID and AVID+CMA, while GDT uses an advanced hierarchical sampling process.

Kinetics400

For fair comparison to SOTA, we follow the same experimental protocol and hyperparameter settings of [19] and simply replace their loss functions with InfoNCE and RINCE loss as shown in Figure 2. We use the same network architecture, i.e., 18-layer R(2+1)D video encoder [75], 9-layer VGG-like audio encoder, and 3-layer MLP projection head producing 128-dim embeddings. We use the ADAM optimizer [76] for 400 epochs with 4,096 batch size, 1e-4 learning rate and 1e-5 weight decay. The pretrained encoders are finetuned on UCF-101 [77] and HMDB-51 [78] with clips composed of 32 frames of size 224 ×\times 224. We defer the full experimental details to Appendix B.

Table 2 shows that RINCE outperforms most of the baseline approaches, including Robust-xID [24] and AVID+CMA [19] which are recent InfoNCE-based SOTA methods proposed to address the noisy view issues in audio-visual contrastive learning. Considering the only change required is the simple replacement of the InfoNCE with our RINCE loss, the results clearly show the effectiveness of our approach. The simplicity means we can easily apply RINCE to a variety of InfoNCE-based approaches, such as GDT [23] that uses advanced data augmentation mechanisms to achieve SOTA results.

ACAV100M

We conduct an in-depth analysis of RINCE on ACAV100M [27], a recent large-scale video dataset for self-supervised learning. Compared to Kinetics400 which is limited to human actions, ACAV100M contains videos “in-the-wild” exhibiting a wide variety of audio-visual patterns. The unconstrained nature of the dataset makes it a good benchmark to investigate the robustness of RINCE to various types of real-world noise, e.g., background music, overdubbed audio, studio narrations, etc.

Refer to caption
Figure 6: Positive pairs and their scores. The positive scores s+[1,1]s^{+}\in[-1,1] are output by the trained RINCE model (temperature == 1). Pairs that have lower scores are visually noisy, while informative pairs often have higher scores.

We focus on evaluating the (a) scalability and (b) convergence rate of RINCE, thereby answering the question: Will it retrain its edge over InfoNCE (a) even in the large-scale regime and (b) with a longer training time? We follow the same experimental setup as described above, but reduce the batch size to 512 and report the results only on the first split of UCF-101 to make our experiments tractable.

Figure 7 (a) shows the top-1 accuracy of RINCE and InfoNCE across different data scales and training epochs. RINCE outperforms InfoNCE by a large margin at every data scale. In terms of the convergence rate, RINCE is comparable to or even outperforms fully-trained (200 epochs) InfoNCE models with only 100 or fewer epochs. Figure 7 (b) gives a closer look at the convergence at 50K and 200K scales. Interestingly, InfoNCE saturates and even degenerates after epoch 150, while RINCE keeps improving. This verifies our analysis in section 4.1: InfoNCE can overfit noisy samples due to its exploration property, while RINCE downweights them and continue to obtain the learning signal from clean ones, achieving robustness against noise.

Refer to caption
Figure 7: RINCE outperforms InfoNCE with fewer epochs across different scales The results are based on ACAV100M-pretrained models transferred to UCF-101.

5.4 Graph Contrastive Learning

To see whether the modality-agnostic nature of RINCE applies beyond image and video data, we examine our approach on TUDataset [39], a popular benchmark suite for graph inference on molecules (BZR, NCI1), bioinformatics (PROTEINS), and social network (RDT-B, IMDB-B). Unlike vision datasets, data augmentation for graphs requires careful engineering with domain knowledge, limiting the applicability of InfoNCE-type contrastive objectives.

For fair comparison, we follow the protocol of [40] and train graph isomorphism networks [79] with four types of data augmentation: node dropout, edge perturbation, attribute masking, and subgraph sampling. We train models using ADAM [76] for 20 epochs with a learning rate 0.01 and report mean and standard deviation over 5 independent trials. We set q=0.1q=0.1 for all the experiments in this section.

Table 3 shows that RINCE outperforms three SOTA InfoNCE-based contrastive methods, GraphCL and JOAO /JOAOv2, setting the new records on all four datasets. GraphCL applies different augmentations for different datasets, while JOAO/JOAOv2 require solving bi-level optimization to choose optimal augmentation per dataset. In contrast, we apply the same augmentation across all four datasets and achieve competitive performance, demonstrating its generality and robustness. In Figure 8, we control perturbation rate by applying three augmentation types (node dropout, edge perturbation, attribute masking) to different % of nodes/edges. We show results on two datasets most sensitive to augmentation. Again, RINCE consistently outperform InfoNCE and has relatively smaller variances when the noise rate increases.

Methods RDT-B NCI1 PROTEINS DD
node2vec [80] - 54.9±\pm1.6 57.5±\pm3.6 -
sub2vec [81] 71.5±\pm0.4 52.8±\pm1.5 53.0±\pm5.6 -
graph2vec [82] 75.8±\pm1.0 73.2±\pm1.8 73.3±\pm2.1 -
InfoGraph [47] 82.5±\pm1.4 76.2±\pm1.1 74.4±\pm0.3 72.9±\pm1.8
GraphCL [40] 89.5±\pm0.8 77.9±\pm0.4 74.4±\pm0.5 78.6±\pm0.4
JOAO [83] 85.3±\pm1.4 78.1±\pm0.5 74.6±\pm0.4 77.3±\pm0.5
JOAOv2 [83] 86.4±\pm1.5 78.4±\pm0.5 74.1±\pm1.1 77.4±\pm1.2
InfoNCE (Ours) 89.9±\pm0.4 78.2±\pm0.8 74.4±\pm0.5 78.6±\pm0.8
RINCE (Ours) 90.9±\pm0.6 78.6±\pm0.4 74.7±\pm0.8 78.7±\pm0.4
GraphCL [40] but uses the same data augmentation as RINCE.
Table 3: Self-supervised representation learning on TUDataset: The baseline results are excerpted from the published papers.
Refer to caption
Figure 8: Performance v.s. Perturbation Rate: We increase the perturbation rate of node dropping, edge perturbation, and attribute masking from 10% to 60%. RINCE outperforms InfoNCE in terms of accuracy and variance when perturbation enhances.

6 Conclusion

We presented Robust InfoNCE (RINCE) as a simple drop-in replacement for the InfoNCE loss in contrastive learning. Despite its simplicity, it comes with strong theoretical justifications and guarantees against noisy views. Empirically, we provided extensive results across image, video, and graph contrastive learning scenarios demonstrating its robustness against a variety of realistic noise patterns.

Acknowledgements

This work was in part supported by NSF Convergence Award 6944221 and ONR MURI 6942251.

References

  • Chopra et al. [2005] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1.   IEEE, 2005, pp. 539–546.
  • Hadsell et al. [2006] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2.   IEEE, 2006, pp. 1735–1742.
  • Oord et al. [2018] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • Tian et al. [2020a] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16.   Springer, 2020, pp. 776–794.
  • He et al. [2020] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  • Chen et al. [2020a] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  • Chen et al. [2020b] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” in 2020 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’20).   IEEE, 2020.
  • Grill et al. [2020] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar et al., “Bootstrap your own latent: A new approach to self-supervised learning,” Advances in neural information processing systems, 2020.
  • Zbontar et al. [2021] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning.   PMLR, 2021.
  • Logeswaran and Lee [2018] L. Logeswaran and H. Lee, “An efficient framework for learning sentence representations,” in International Conference on Learning Representations, 2018.
  • Chuang et al. [2020] C.-Y. Chuang, J. Robinson, L. Yen-Chen, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” Advances in neural information processing systems, 2020.
  • Robinson et al. [2021a] J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” in International Conference on Learning Representations, 2021.
  • Giorgi et al. [2020] J. M. Giorgi, O. Nitski, G. D. Bader, and B. Wang, “Declutr: Deep contrastive learning for unsupervised textual representations,” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2020.
  • Baevski et al. [2020] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in neural information processing systems, 2020.
  • Saeed et al. [2021] A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 3875–3879.
  • Wang and Oord [2021] L. Wang and A. v. d. Oord, “Multi-format contrastive learning of audio representations,” arXiv preprint arXiv:2103.06508, 2021.
  • Miech et al. [2020] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9879–9889.
  • Ma et al. [2020] S. Ma, Z. Zeng, D. McDuff, and Y. Song, “Active contrastive learning of audio-visual video representations,” in International Conference on Learning Representations, 2020.
  • Morgado et al. [2021a] P. Morgado, N. Vasconcelos, and I. Misra, “Audio-visual instance discrimination with cross-modal agreement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 475–12 486.
  • Tian et al. [2020b] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning?” in Advances in neural information processing systems, 2020.
  • Sermanet et al. [2018] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in 2018 IEEE international conference on robotics and automation, 2018.
  • Han et al. [2020] T. Han, W. Xie, and A. Zisserman, “Self-supervised co-training for video representation learning,” Advances in neural information processing systems, 2020.
  • Patrick et al. [2021] M. Patrick, Y. M. Asano, P. Kuznetsova, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi, “On compositions of transformations in contrastive self-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • Morgado et al. [2021b] P. Morgado, I. Misra, and N. Vasconcelos, “Robust audio-visual instance discrimination,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 934–12 945.
  • Arjovsky et al. [2019] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
  • Jing et al. [2021] L. Jing, P. Vincent, Y. LeCun, and Y. Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” arXiv preprint arXiv:2110.09348, 2021.
  • Lee et al. [2021] S. Lee, J. Chung, Y. Yu, G. Kim, T. Breuel, G. Chechik, and Y. Song, “Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 274–10 284.
  • Natarajan et al. [2013] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learning with noisy labels,” Advances in neural information processing systems, vol. 26, pp. 1196–1204, 2013.
  • Ghosh et al. [2015] A. Ghosh, N. Manwani, and P. Sastry, “Making risk minimization tolerant to label noise,” Neurocomputing, vol. 160, pp. 93–107, 2015.
  • Ghosh et al. [2017] A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • Li et al. [2017] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learning from noisy labels with distillation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • Veit et al. [2017] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie, “Learning from noisy large-scale datasets with minimal supervision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  • Ozair et al. [2019] S. Ozair, C. Lynch, Y. Bengio, A. v. d. Oord, S. Levine, and P. Sermanet, “Wasserstein dependency measure for representation learning,” arXiv preprint arXiv:1903.11780, 2019.
  • Chuang et al. [2021] C.-Y. Chuang, Y. Mroueh, K. Greenewald, A. Torralba, and S. Jegelka, “Measuring generalization with optimal transport,” in Advances in Neural Information Processing Systems, 2021.
  • Chen et al. [2021] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” arXiv preprint arXiv:2104.02057, 2021.
  • Krizhevsky et al. [2009] A. Krizhevsky et al., “Learning multiple layers of features from tiny images.”   Citeseer, 2009.
  • Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • Kay et al. [2017] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • Morris et al. [2020] C. Morris, N. M. Kriege, F. Bause, K. Kersting, P. Mutzel, and M. Neumann, “Tudataset: A collection of benchmark datasets for learning with graphs,” arXiv preprint arXiv:2007.08663, 2020.
  • You et al. [2020] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” Advances in Neural Information Processing Systems, vol. 33, pp. 5812–5823, 2020.
  • Khosla et al. [2020] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in neural information processing systems, 2020.
  • Hjelm et al. [2018] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio, “Learning deep representations by mutual information estimation and maximization,” arXiv preprint arXiv:1808.06670, 2018.
  • Chen and He [2021] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  • Caron et al. [2020] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Kalantidis et al. [2020] Y. Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Robinson et al. [2021b] J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, and S. Sra, “Can contrastive learning avoid shortcut solutions?” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Veličković et al. [2018] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax,” arXiv preprint arXiv:1809.10341, 2018.
  • Hassani and Khasahmadi [2020] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs,” in International Conference on Machine Learning.   PMLR, 2020, pp. 4116–4126.
  • Sukhbaatar et al. [2015] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  • Xiao et al. [2015] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2691–2699.
  • Liu and Tao [2015] T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 447–461, 2015.
  • Patrini et al. [2017] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making deep neural networks robust to label noise: A loss correction approach,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1944–1952.
  • Jiang et al. [2018] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in International Conference on Machine Learning, 2018.
  • Ren et al. [2018] M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4334–4343.
  • Han et al. [2018] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” Advances in neural information processing systems, 2018.
  • Zhang and Sabuncu [2018] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” Advances in neural information processing systems, 2018.
  • Wang et al. [2019] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 322–330.
  • Bachman et al. [2019] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” arXiv preprint arXiv:1906.00910, 2019.
  • Gutmann and Hyvärinen [2010] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics.   JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.
  • Wu et al. [2018] Z. Wu, Y. Xiong, S. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance-level discrimination,” arXiv preprint arXiv:1805.01978, 2018.
  • Arpit et al. [2017] D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in International Conference on Machine Learning.   PMLR, 2017, pp. 233–242.
  • Yu et al. [2019] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does disagreement help generalization against label corruption?” in International Conference on Machine Learning.   PMLR, 2019, pp. 7164–7173.
  • Poole et al. [2019] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker, “On variational bounds of mutual information,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5171–5180.
  • McAllester and Stratos [2020] D. McAllester and K. Stratos, “Formal limitations on the measurement of mutual information,” in International Conference on Artificial Intelligence and Statistics, 2020.
  • Hörmander et al. [2006] F. H. N. H. L. Hörmander, N. S. B. Totaro, and A. V. M. Waldschmidt, “Grundlehren der mathematischen wissenschaften 332.”   Springer, 2006.
  • Van der Maaten and Hinton [2008] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • Jing and Tian [2018] L. Jing and Y. Tian, “Self-supervised spatiotemporal feature learning by video geometric transformations,” arXiv preprint arXiv:1811.11387, vol. 2, no. 7, p. 8, 2018.
  • Xu et al. [2019] D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang, “Self-supervised spatiotemporal learning via video clip order prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 334–10 343.
  • Han et al. [2019] T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  • Sun et al. [2019] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning video representations using contrastive bidirectional transformer,” arXiv preprint arXiv:1906.05743, 2019.
  • Korbar et al. [2018] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Advances in Neural Information Processing Systems (NeurIPS), 2018.
  • Asano et al. [2020] Y. M. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi, “Labelling unlabelled videos from scratch with multi-modal self-supervision,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Alwassel et al. [2020] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran, “Self-supervised learning by cross-modal audio-video clustering,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Tran et al. [2018] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015.
  • Soomro et al. [2012] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • Kuehne et al. [2011] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in 2011 International conference on computer vision.   IEEE, 2011, pp. 2556–2563.
  • Xu et al. [2018] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” in International Conference on Learning Representations, 2018.
  • Grover and Leskovec [2016] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • Adhikari et al. [2018] B. Adhikari, Y. Zhang, N. Ramakrishnan, and B. A. Prakash, “Sub2vec: Feature learning for subgraphs,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining.   Springer, 2018, pp. 170–182.
  • Narayanan et al. [2017] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal, “graph2vec: Learning distributed representations of graphs,” arXiv preprint arXiv:1707.05005, 2017.
  • You et al. [2021] Y. You, T. Chen, Y. Shen, and Z. Wang, “Graph contrastive learning automated,” arXiv preprint arXiv:2106.07594, 2021.
  • Falcon and Cho [2020] W. Falcon and K. Cho, “A framework for contrastive self-supervised learning and designing a new approach,” arXiv preprint arXiv:2009.00104, 2020.

Appendix A Theory and Proofs

A.1 Proof of Lemma 1: From RINCE to InfoNCE

We show that RINCE becomes asymptotically equivalent to InfoNCE when q0q\rightarrow 0. In particular, we prove the convergence of RINCE and its derivative in the limit of q0q\rightarrow 0.

Proof.

We first prove the convergence in the function space with the L’Hôpital’s rule:

limq0RINCEλ,q(s)\displaystyle\lim_{q\rightarrow 0}{\mathcal{L}}^{\lambda,q}_{\textnormal{RINCE}}(\textbf{s})
=limq0eqs+q+(λ(es++i=1Kesi))qq\displaystyle=\lim_{q\rightarrow 0}\frac{-e^{q\cdot s^{+}}}{q}+\frac{\left(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right)^{q}}{q}
=limq01eqs+q+1+(λ(es++i=1Kesi))qq\displaystyle=\lim_{q\rightarrow 0}\frac{1-e^{q\cdot s^{+}}}{q}+\frac{-1+\left(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right)^{q}}{q}
=limq01eqs+q+limq01+(λ(es++i=1Kesi))qq\displaystyle=\lim_{q\rightarrow 0}\frac{1-e^{q\cdot s^{+}}}{q}+\lim_{q\rightarrow 0}\frac{-1+\left(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right)^{q}}{q}
=log(es+)+log(λ(es++i=1Kesi))\displaystyle=-\log(e^{s^{+}})+\log\left(\lambda(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right) (L’Hôpital’s rule)
=loges+λ(es++i=1Kesi)\displaystyle=-\log\frac{e^{s^{+}}}{\lambda\left(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}\right)}
=InfoNCE(s)+log(λ).\displaystyle={\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})+\log(\lambda).

To prove the convergence in its derivative, we analyze the derivative with respect to the positive score s+s^{+} and the negative score sis_{i}^{-}. We begin with RINCE:

(positive score) limq0s+RINCEλ,q(s)\displaystyle\lim_{q\rightarrow 0}\frac{\partial}{\partial s^{+}}{\mathcal{L}}^{\lambda,q}_{\textnormal{RINCE}}(\textbf{s})
=\displaystyle= limq0s+eqs+q+s+(λ(es++i=1Kesi))qq\displaystyle\lim_{q\rightarrow 0}\frac{\partial}{\partial s^{+}}\frac{-e^{q\cdot s^{+}}}{q}+\frac{\partial}{\partial s^{+}}\frac{\left(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right)^{q}}{q}
=\displaystyle= limq0eqs++(λ(es++i=1Kesi))q1λes+\displaystyle\lim_{q\rightarrow 0}-e^{q\cdot s^{+}}+(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}))^{q-1}\cdot\lambda\cdot e^{s^{+}}
=\displaystyle= 1+es+es++i=1Kesi;\displaystyle-1+\frac{e^{s^{+}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}};
(negative score) limq0siRINCEλ,q(s)\displaystyle\lim_{q\rightarrow 0}\frac{\partial}{\partial s_{i}^{-}}{\mathcal{L}}^{\lambda,q}_{\textnormal{RINCE}}(\textbf{s})
=\displaystyle= limq0si(λ(es++i=1Kesi))qq\displaystyle\lim_{q\rightarrow 0}\frac{\partial}{\partial s_{i}^{-}}\frac{\left(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\right)^{q}}{q}
=\displaystyle= limq0(λ(es++i=1Kesi))q1λesi\displaystyle\lim_{q\rightarrow 0}(\lambda\cdot(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}))^{q-1}\cdot\lambda\cdot e^{s_{i}^{-}}
=\displaystyle= esies++i=1Kesi.\displaystyle\frac{e^{s_{i}^{-}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}.

We can see that the derivatives match the ones of InfoNCE

(positive score) s+InfoNCE(s)\displaystyle\frac{\partial}{\partial s^{+}}{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})
=s+loges+es++i=1Kesi\displaystyle=\frac{\partial}{\partial s^{+}}-\log\frac{e^{s^{+}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}
=es+es++i=1Kesi(es++i=1Kesi)es+e2s+(es++i=1Kesi)2\displaystyle=-\frac{e^{s^{+}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}\cdot\frac{(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})\cdot e^{s^{+}}-e^{2s^{+}}}{(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})^{2}}
=1+es+es++i=1Kesi;\displaystyle=-1+\frac{e^{s^{+}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}};
(negative score) siInfoNCE(s)\displaystyle\frac{\partial}{\partial s_{i}^{-}}{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})
=es++i=1Kesiesies+esi(es++i=1Kesi)2\displaystyle=-\frac{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}{e^{s_{i}^{-}}}\cdot\frac{-e^{s^{+}}\cdot e^{s_{i}^{-}}}{(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})^{2}}
=esies++i=1Kesi.\displaystyle=\frac{e^{s_{i}^{-}}}{e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}}}.

A.2 Noisy Risk Bound for Exponential Loss

We justify the robustness of RINCE when q=1q=1 by extending Ghosh et al. [29]’s theorem to the exponential loss. The proof technique can be applied to other bounded symmetric classification losses.

Corollary 2.

Consider the setting of Ghosh et al. [29] and the exponential loss function (s,y)=yes{\mathcal{L}}(s,y)=-ye^{s}. Let fη=arginffRη(f)f_{\eta}^{\ast}=\operatorname*{arg\,inf}_{f\in{\mathcal{F}}}R_{{\mathcal{L}}}^{\eta}(f) be the minimizer of the noisy risk and ϵ=inffR(f)\epsilon=\inf_{f\in{\mathcal{F}}}R_{{\mathcal{L}}}(f) be the optimal risk. If ηxηmax<0.5\eta_{x}\leq\eta_{\max}<0.5 for all x𝒳x\in{\mathcal{X}}. If the prediction score is bounded by smaxs_{\max}, we have R(fη)(ϵ+2ηmaxesmax)/(12ηmax)R(f_{\eta}^{\ast})\leq(\epsilon+2\eta_{\max}e^{s_{\max}})/(1-2\eta_{\max}).

Proof.

Consider a binary classification loss with the following form:

~x(f(x),y)=B+x(f(x),y)=Byef(x)0,\displaystyle\tilde{{\mathcal{L}}}_{x}(f(x),y)=B+{\mathcal{L}}_{x}(f(x),y)=B-y\cdot e^{f(x)}\geq 0,

where the prediction score f(x)f(x) is bounded by smax=log(B)s_{\max}=\log(B). Note that the boundedness assumption holds for general representation learning on hypersphere, where the prediction score is the inner product between normalized feature vectors. Importantly, the loss satisfies

~(f(x),1)+~(f(x),1)=2B.\displaystyle\tilde{{\mathcal{L}}}(f(x),1)+\tilde{{\mathcal{L}}}(f(x),-1)=2B.

By construction, the optimal risk takes the following value:

inffR~(f)=inff𝔼xμ[~(f(x),yx)]=ϵ+B:=ϵ~,\displaystyle\inf_{f\in{\mathcal{F}}}R_{\tilde{{\mathcal{L}}}}(f)=\inf_{f\in{\mathcal{F}}}\mathbb{E}_{x\sim\mu}[\tilde{{\mathcal{L}}}(f(x),y_{x})]=\epsilon+B\mathrel{\mathop{\mathchar 58\relax}}=\tilde{\epsilon},

and f=arginffR~(f)f^{\ast}=\operatorname*{arg\,inf}_{f\in{\mathcal{F}}}R_{\tilde{{\mathcal{L}}}}(f). Note that ff^{\ast} is also a minimizer w.r.t. the original loss {\mathcal{L}}( f=arginffR(f)f^{\ast}=\operatorname*{arg\,inf}_{f\in{\mathcal{F}}}R_{{\mathcal{L}}}(f)), as an additive constant will not change the optimum solutions. Expanding the noisy risk gives

R~η(f)\displaystyle R_{\tilde{{\mathcal{L}}}}^{\eta}(f) =𝔼(x,y)μ[(1ηx)~(f(x),yx)+ηx~(f(x),yx)]\displaystyle=\mathbb{E}_{(x,y)\sim\mu}[(1-\eta_{x})\tilde{{\mathcal{L}}}(f(x),y_{x})+\eta_{x}\tilde{{\mathcal{L}}}(f(x),-y_{x})]
=𝔼xμ[(1ηx)~(f(x),yx)+ηx(2B~(f(x),yx))]\displaystyle=\mathbb{E}_{x\sim\mu}[(1-\eta_{x})\tilde{{\mathcal{L}}}(f(x),y_{x})+\eta_{x}(2B-\tilde{{\mathcal{L}}}(f(x),y_{x}))] (Symmetry)
=𝔼xμ[(12ηx)~(f(x),yx)]+2B𝔼xμ[ηx].\displaystyle=\mathbb{E}_{x\sim\mu}[(1-2\eta_{x})\tilde{{\mathcal{L}}}(f(x),y_{x})]+2B\mathbb{E}_{x\sim\mu}[\eta_{x}].

Let fη=arginfRη(fη)=arginfR~η(fη)f_{\eta}^{\ast}=\operatorname*{arg\,inf}R_{\mathcal{L}}^{\eta}(f_{\eta}^{\ast})=\operatorname*{arg\,inf}R_{\tilde{{\mathcal{L}}}}^{\eta}(f_{\eta}^{\ast}), we have

R~η(f)R~η(fη)=𝔼xμ[(12ηx)(~(f(x),yx)~(fη(x),yx))]0\displaystyle R_{\tilde{{\mathcal{L}}}}^{\eta}(f^{\ast})-R_{\tilde{{\mathcal{L}}}}^{\eta}(f_{\eta}^{\ast})=\mathbb{E}_{x\sim\mu}[(1-2\eta_{x})(\tilde{{\mathcal{L}}}(f^{\ast}(x),y_{x})-\tilde{{\mathcal{L}}}(f_{\eta}^{\ast}(x),y_{x}))]\geq 0

since fηf_{\eta}^{\ast} is the minimizer of R~ηR_{\tilde{{\mathcal{L}}}}^{\eta}, which implies that

Exμ[(12ηx)~(fη(x),yx)]Exμ[(12ηx)~(f(x),yx)]ϵ~,\displaystyle E_{x\sim\mu}[(1-2\eta_{x})\tilde{{\mathcal{L}}}(f_{\eta}^{\ast}(x),y_{x})]\leq E_{x\sim\mu}[(1-2\eta_{x})\tilde{{\mathcal{L}}}(f^{\ast}(x),y_{x})]\leq\tilde{\epsilon},

since 0<12ηx10<1-2\eta_{x}\leq 1 by assumption. Let ηmax=supx𝒳ηx\eta_{max}=\sup_{x\in{\mathcal{X}}}\eta_{x}, we have

(12ηmax)Exμ[~(fη(x),yx)]ϵ,\displaystyle(1-2\eta_{\max})E_{x\sim\mu}[\tilde{{\mathcal{L}}}(f_{\eta}^{\ast}(x),y_{x})]\leq\epsilon,

since the loss is non-negative, which implies

R~(fη)ϵ~12ηmax.\displaystyle R_{\tilde{{\mathcal{L}}}}(f_{\eta}^{\ast})\leq\frac{\tilde{\epsilon}}{1-2\eta_{\max}}.

Finally, we recover the original exponential loss without the additive term BB. Plugging the form we have

B+R(fη)ϵ+B12ηmax,\displaystyle B+R_{{\mathcal{L}}}(f_{\eta}^{\ast})\leq\frac{\epsilon+B}{1-2\eta_{max}},

which implies

R(fη)ϵ+B12ηmaxB=ϵ+2Bηmax12ηmax.\displaystyle R_{{\mathcal{L}}}(f_{\eta}^{\ast})\leq\frac{\epsilon+B}{1-2\eta_{max}}-B=\frac{\epsilon+2B\eta_{max}}{1-2\eta_{max}}.

For exponential loss, setting BB to esmaxe^{s_{\textnormal{max}}} completes the proof. ∎

For instance, when the noise level is 40%40\%, we have R(fη)5ϵ+4BR_{{\mathcal{L}}}(f_{\eta}^{\ast})\leq 5\epsilon+4B. Note that the prediction score is bounded by 1/t1/t in our case as the representations are projected onto the unit hypersphere.

A.3 Lower bound of Wasserstein Distance

We now establish RINCE as a lower bound of WDM [33]. WDM is based on the Wasserstein distance, a distance metric between probability distributions defined via an optimal transport cost. Letting μ\mu and νProb(d×d)\nu\in\textnormal{Prob}({\mathbb{R}}^{d}\times{\mathbb{R}}^{d}) be two probability measures, we define the Wasserstein-11 distance with a Euclidean cost function as

𝒲(μ,ν)=infπΠ(μ,ν)𝔼(X,V)(X,V)π[XX+VV],\displaystyle{\mathcal{W}}(\mu,\nu)=\inf_{\pi\in\Pi(\mu,\nu)}\mathbb{E}_{\begin{subarray}{c}(X,V)\\ (X^{\prime},V^{\prime})\end{subarray}\sim\pi}\left[\left\|X-X^{\prime}\right\|+\left\|V-V^{\prime}\right\|\right],

where Π(μ,ν)\Pi(\mu,\nu) denotes the set couplings whose marginals are μ\mu and ν\nu, respectively. We are now ready to state our theorem.

Theorem 3.

If λK>1λ\lambda K>1-\lambda and ff projects the representation to a unit hypersphere, we have

𝔼[RINCEλ,q=1(s)]Lip(f)(1λ)e1/tt𝒲1(PXVϕ,PXϕPVϕ).\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]\leq\frac{\textnormal{Lip}(f)\cdot(1-\lambda)\cdot e^{1/t}}{t}{\mathcal{W}}_{1}(P_{XV}^{\phi},P_{X}^{\phi}P_{V}^{\phi}).
Proof.

By the additivity of expectation, we can bound the negative symmetric loss as follows

𝔼[RINCEλ,q=1(s)]\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]
=𝔼xPXvPV|X=xviPV[(1λ)ef(ϕ(x))Tf(ϕ(v)/tλi=1Kef(ϕ(x))Tf(ϕ(vi)/t]\displaystyle=\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v\sim P_{V|X=x}\\ v_{i}\sim P_{V}\end{subarray}}[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}-\lambda\sum_{i=1}^{K}e^{f(\phi(x))^{T}f(\phi(v_{i})/t}]
=𝔼(x,v)PXV[(1λ)ef(ϕ(x))Tf(ϕ(v)/t]𝔼xPXviPV[λi=1Kef(ϕ(x))Tf(ϕ(vi)/t]\displaystyle=\mathbb{E}_{(x,v)\sim P_{XV}}\left[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v_{i}\sim P_{V}\end{subarray}}\left[\lambda\sum_{i=1}^{K}e^{f(\phi(x))^{T}f(\phi(v_{i})/t}\right]
=𝔼(x,v)PXV[(1λ)ef(ϕ(x))Tf(ϕ(v)/t]𝔼xPXviPV[λi=1Kef(ϕ(x))Tf(ϕ(vi)/t]\displaystyle=\mathbb{E}_{(x,v)\sim P_{XV}}\left[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v_{i}\sim P_{V}\end{subarray}}\left[\lambda\sum_{i=1}^{K}e^{f(\phi(x))^{T}f(\phi(v_{i})/t}\right]
=𝔼(x,v)PXV[(1λ)ef(ϕ(x))Tf(ϕ(v)/t]λK𝔼xPXvPV[ef(ϕ(x))Tf(ϕ(v)/t]\displaystyle=\mathbb{E}_{(x,v)\sim P_{XV}}\left[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\lambda K\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v\sim P_{V}\end{subarray}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right]
(1λ)(𝔼(x,v)PXV[ef(ϕ(x))Tf(ϕ(v)/t]𝔼xPXvPV[ef(ϕ(x))Tf(ϕ(v)/t]),\displaystyle\leq(1-\lambda)\cdot(\mathbb{E}_{(x,v)\sim P_{XV}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v\sim P_{V}\end{subarray}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right]),

where the last equality follows by λK>1λ\lambda K>1-\lambda. Note that for 1ts1t\frac{-1}{t}\leq s\leq\frac{1}{t}, which implies |ses|e1/t|\nabla_{s}e^{s}|\leq e^{1/t}. Therefore, by the mean value theorem, we have

|ef(ϕ(x))Tf(ϕ(v))/tef(ϕ(x))Tf(ϕ(v))/t|\displaystyle|e^{f(\phi(x))^{T}f(\phi(v))/t}-e^{f(\phi(x^{\prime}))^{T}f(\phi(v^{\prime}))/t}|
e1/tt|f(ϕ(x)),f(ϕ(v))f(ϕ(x)),f(ϕ(v))|\displaystyle\leq\frac{e^{1/t}}{t}|\langle f(\phi(x)),f(\phi(v))\rangle-\langle f(\phi(x^{\prime})),f(\phi(v^{\prime}))\rangle| (Mean Value Theorem)
=e1/tt|f(ϕ(x))f(ϕ(x)),f(ϕ(v))+f(ϕ(x)),f(ϕ(v)f(ϕ(v))|\displaystyle=\frac{e^{1/t}}{t}|\langle f(\phi(x))-f(\phi(x^{\prime})),f(\phi(v))\rangle+\langle f(\phi(x^{\prime})),f(\phi(v)-f(\phi(v^{\prime}))\rangle|
e1/tt(|f(ϕ(x))f(ϕ(x)),f(ϕ(v))|+|f(ϕ(x)),f(ϕ(v)f(ϕ(v))|)\displaystyle\leq\frac{e^{1/t}}{t}(|\langle f(\phi(x))-f(\phi(x^{\prime})),f(\phi(v))\rangle|+|\langle f(\phi(x^{\prime})),f(\phi(v)-f(\phi(v^{\prime}))\rangle|)
e1/tt(f(ϕ(x))f(ϕ(x))f(ϕ(v))+f(ϕ(v)f(ϕ(v))f(ϕ(x)))\displaystyle\leq\frac{e^{1/t}}{t}(\|f(\phi(x))-f(\phi(x^{\prime}))\|\|f(\phi(v))\|+\|f(\phi(v)-f(\phi(v^{\prime}))\|\|f(\phi(x^{\prime}))\|) (Cauchy–Schwarz Ineq.)
=e1/tt(f(ϕ(x))f(ϕ(x))+f(ϕ(v)f(ϕ(v)))\displaystyle=\frac{e^{1/t}}{t}(\|f(\phi(x))-f(\phi(x^{\prime}))\|+\|f(\phi(v)-f(\phi(v^{\prime}))\|) (f(ϕ(x))f(\phi(x)) is unit norm)
Lip(f)e1/tt(ϕ(x)ϕ(x)+ϕ(v)ϕ(v))\displaystyle\leq\frac{\textnormal{Lip}(f)\cdot e^{1/t}}{t}(\|\phi(x)-\phi(x^{\prime})\|+\|\phi(v)-\phi(v^{\prime})\|)
=Lip(f)e1/ttd((ϕ(x),ϕ(v)),(ϕ(x),ϕ(v))).\displaystyle=\frac{\textnormal{Lip}(f)\cdot e^{1/t}}{t}d((\phi(x),\phi(v)),(\phi(x^{\prime}),\phi(v^{\prime}))).

We can see that the Lipschitz constant of exp(f(,))\exp(f(\cdot,\cdot)) with respect to the metric dd is bounded by Lip(f)e1/tt\frac{\textnormal{Lip}(f)\cdot e^{1/t}}{t}. Therefore, by Kantorovich-Rubinstein duality, we have

𝔼[RINCEλ,q=1(s)]\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]
(1λ)(𝔼(x,v)PXV[ef(ϕ(x))Tf(ϕ(v)/t]\displaystyle\leq(1-\lambda)\cdot(\mathbb{E}_{(x,v)\sim P_{XV}}[e^{f(\phi(x))^{T}f(\phi(v)/t}]
𝔼xPXvPV[ef(ϕ(x))Tf(ϕ(v)/t]),\displaystyle\qquad\qquad\qquad\qquad-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v\sim P_{V}\end{subarray}}[e^{f(\phi(x))^{T}f(\phi(v)/t}]),
Lip(f)(1λ)e1/tt𝒲1(ϕ#PXV,ϕ#PXϕ#PV)\displaystyle\leq\frac{\textnormal{Lip}(f)\cdot(1-\lambda)\cdot e^{1/t}}{t}{\mathcal{W}}_{1}(\phi_{\#}P_{XV},\phi_{\#}P_{X}\cdot\phi_{\#}P_{V})

A.4 Noisy Wasserstein Dependency Measure

The result is a simple combination of Corollary 2 and Theorem 3. If ληKη+1ηKη+1+K\lambda\geq\frac{\eta K-\eta+1}{\eta K-\eta+1+K}, by the assumption of additive noisy models and the symmetry of loss, we have

𝔼[RINCEλ,q=1(s)]\displaystyle-\mathbb{E}\left[{\mathcal{L}}_{\textnormal{RINCE}}^{\lambda,q=1}(\textbf{s})\right]
=𝔼(x,v)PXVηviPV[(1λ)ef(ϕ(x))Tf(ϕ(v)/tλi=1Kef(ϕ(x))Tf(ϕ(vi)/t]\displaystyle=\mathbb{E}_{\begin{subarray}{c}(x,v)\sim P_{XV}^{\eta}\\ v_{i}\sim P_{V}\end{subarray}}[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}-\lambda\sum_{i=1}^{K}e^{f(\phi(x))^{T}f(\phi(v_{i})/t}]
=𝔼(x,v)PXVη[(1λ)ef(ϕ(x))Tf(ϕ(v)/t]𝔼xPXviPV[λi=1Kef(ϕ(x))Tf(ϕ(vi)/t]\displaystyle=\mathbb{E}_{(x,v)\sim P_{XV}^{\eta}}\left[(1-\lambda)e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v_{i}\sim P_{V}\end{subarray}}\left[\lambda\sum_{i=1}^{K}e^{f(\phi(x))^{T}f(\phi(v_{i})/t}\right]
=(1λ)(1η)𝔼(x,v)PXV[ef(ϕ(x))Tf(ϕ(v)/t]K(λη+ηλ)𝔼xPXvPV[ef(ϕ(x))Tf(ϕ(v)/t]\displaystyle=(1-\lambda)(1-\eta)\mathbb{E}_{(x,v)\sim P_{XV}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-K\cdot(\lambda-\eta+\eta\lambda)\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v\sim P_{V}\end{subarray}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right] (symmetry)
(1λ)(1η)(𝔼(x,v)PXV[ef(ϕ(x))Tf(ϕ(v)/t]𝔼xPXviPV[ef(ϕ(x))Tf(ϕ(vi)/t])\displaystyle\leq(1-\lambda)(1-\eta)\cdot(\mathbb{E}_{(x,v)\sim P_{XV}}\left[e^{f(\phi(x))^{T}f(\phi(v)/t}\right]-\mathbb{E}_{\begin{subarray}{c}x\sim P_{X}\\ v_{i}\sim P_{V}\end{subarray}}\left[e^{f(\phi(x))^{T}f(\phi(v_{i})/t}\right]) (ληKη+1ηKη+1+K\lambda\geq\frac{\eta K-\eta+1}{\eta K-\eta+1+K})
(1η)Lip(f)(1λ)e1/tt𝒲1(ϕ#PXV,ϕ#PXϕ#PV)\displaystyle\leq(1-\eta)\cdot\frac{\textnormal{Lip}(f)\cdot(1-\lambda)\cdot e^{1/t}}{t}{\mathcal{W}}_{1}(\phi_{\#}P_{XV},\phi_{\#}P_{X}\cdot\phi_{\#}P_{V})
=(1η)LI𝒲(ϕ(X),ϕ(V)).\displaystyle=(1-\eta)\cdot L\cdot I_{\mathcal{W}}(\phi(X),\phi(V)).

A.5 InfoNCE is not symmetric

Note that by taking the derivative with respect to the prediction score ss, the definition is equivalent to (s,1)s+(s,1)s=0s\frac{\partial{\mathcal{L}}(s,1)}{\partial s}+\frac{\partial{\mathcal{L}}(s,-1)}{\partial s}=0\;\forall s\in{\mathbb{R}}.

InfoNCE(s)s+\displaystyle\frac{\partial{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})}{\partial s^{+}} =1InfoNCE(s)es+i=1Kesi(es++i=1Kesi)2\displaystyle=\frac{-1}{{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})}\cdot\frac{e^{s^{+}}\cdot\sum_{i=1}^{K}e^{s_{i}^{-}}}{(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})^{2}}
InfoNCE(s)si\displaystyle\frac{\partial{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})}{\partial s_{i}^{-}} =1InfoNCE(s)es+(1esi)+i=1Kesi(es++i=1Kesi)2.\displaystyle=\frac{-1}{{\mathcal{L}}_{\textnormal{InfoNCE}}(\textbf{s})}\cdot\frac{e^{s^{+}}(1-e^{s_{i}^{-}})+\sum_{i=1}^{K}e^{s_{i}^{-}}}{(e^{s^{+}}+\sum_{i=1}^{K}e^{s_{i}^{-}})^{2}}.

Within a batch of data, the gradients with respect to s+s^{+} and ss^{-} are entangled and do not sum to a constant, which fail to meet the symmetry condition.

Appendix B Experiment Details

B.1 CIFAR-10

We follow the experiment setup in [11], where the SimCLR [6] models are trained with Adam optimizer for 500 epochs with learning rate 0.001 and weight decay 1e-6. The encoder is ResNet-50 and the dimension of the latent vector is 128. The temperature is set to t=0.5t=0.5. The models are then evaluated by training a linear classifier for 100 epochs with learning rate 0.001 and weight decay 1e-6. We use the PyTorch code in Figure 9 to generate the data augmentation noise.

1def get_train_transform(noise_rate):
2 train_transform = transforms.Compose([
3 transforms.RandomResizedCrop(32),
4 transforms.RandomApply([transforms.RandomResizedCrop(32, scale=(0.2, 0.2))], p=noise_rate),
5 transforms.RandomHorizontalFlip(p=0.5),
6 transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),
7 transforms.RandomGrayscale(p=0.2),
8 transforms.GaussianBlur(kernel_size=int(0.1*32)),
9 transforms.ToTensor(),
10 transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])])
11 return train_transform
Figure 9: PyTorch code for CIFAR-10 data augmentation noise.

B.2 ImageNet

SimCLR

We adopt the SimCLR implementation222https://github.com/PyTorchLightning/lightning-bolts/tree/master/pl_bolts/models/self_supervised/simclr from PyTorch Lightning [84]. In addition, we spot a bug and fix the implementation of negative masking of PyTorch Lightning according to Figure 10 and achieve 68.9 top-1 accuracy on ImageNet (the one reported in the PyTorch Lightning’s website is 68.4). To implement RINCE, we only modify the lines that calculates loss according to Figure 2.

Mocov3

We adopt the official code333https://github.com/facebookresearch/moco-v3 from Mocov3 [35]. To implement RINCE, we only modify the lines that calculates loss in moco/builder.py according to Figure 2.

1def compute_neg_mask(self):
2 total_images = self.num_nodes * self.gpus * self.batch_size * self.num_pos
3 world_size = self.num_nodes * self.gpus
4 batch_size = self.batch_size * self.num_pos
5 orig_images = self.batch_size
6 rank = int(os.environ["LOCAL_RANK"])
7
8 neg_mask = torch.zeros(batch_size, total_images)
9 all_indices = np.arange(total_images)
10 pos_members = orig_images * world_size * np.arange(self.num_pos)
11 for anchor in np.arange(self.num_pos):
12 for img_idx in range(orig_images):
13 delete_inds = orig_images * rank + img_idx + pos_members
14 neg_inds = torch.tensor(np.delete(all_indices, delete_inds)).long()
15 neg_mask[anchor * orig_images + img_idx, neg_inds] = 1
16 neg_mask = neg_mask.cuda(non_blocking=True)
17
18 return neg_mask
19
20def nt_xent_loss(self, out_1, out_2, temperature):
21 if torch.distributed.is_available() and torch.distributed.is_initialized():
22 out_1_dist = SyncFunction.apply(out_1)
23 out_2_dist = SyncFunction.apply(out_2)
24 else:
25 out_1_dist = out_1
26 out_2_dist = out_2
27
28 out = torch.cat([out_1, out_2], dim=0)
29 out_dist = torch.cat([out_1_dist, out_2_dist], dim=0)
30
31 similarity = torch.exp(torch.mm(out, out_dist.t()) / temperature)
32
33 #################################### original code ####################################
34 # # from each row, subtract e^(1/temp) to remove similarity measure for x1.x1 #
35 # neg = similarity.sum(dim=-1) #
36 # row_sub = Tensor(neg.shape).fill_(math.e ** (1 / temperature)).to(neg.device) #
37 # neg = torch.clamp(neg - row_sub, min=eps) # clamp for numerical stability #
38 #######################################################################################
39
40 neg_mask = self.compute_neg_mask()
41 neg = torch.sum(similarity * neg_mask, 1)
42
43 pos = torch.exp(torch.sum(out_1 * out_2, dim=-1) / temperature)
44 pos = torch.cat([pos, pos], dim=0)
45
46 loss = -(torch.mean(torch.log(pos / (pos + neg))))
47
48 return loss
Figure 10: PyTorch Lightening implementation of SimCLR. The original implementation of negative masking (commented out) is problematic because it subtracts e1/te^{1/t} to remove similarity measure for pairs that consist of the same images. However, subtracting a constant does not alter the gradient with respect to the model parameters. In particular, there are still gradients backpropagating through the false positive pairs. We fix it by directly filtering out those false pairs with a negative mask.

B.3 Kinetics-400

We adopt the official implementation444https://github.com/facebookresearch/AVID-CMA from [19]. Similarly, we only modify the loss function in the criterions directory. In particular, we use the SimCLR style implementation for both InfoNCE and RINCE loss. We also adopt the same hyperparameters described in the git repository for training. We set the learning rate to 1e31e-3 to finetune the models on downstream classification tasks such as UCF101 and HMDB51 with the provided evaluation code.

B.4 ACAV100M

We again modify the official implementation of [19] for the ACAV100M experiments, where we modify the data loader to adopt it to ACAV100M. Different from Kinetics-400 experiments, the input size is set to 8×22428\times 224^{2} during the finetuning process for computational efficiency. We again use the exact same set of hyperparameters from [19] for both training and testing.

B.5 TU-Dataset

We adopt the official implementation555https://github.com/Shen-Lab/GraphCL/tree/master/unsupervised_TU from [40]. To implement RINCE, we only modify the loss in gsimclr.py file.

Appendix C Additional Results

C.1 Exact Number of CIFAR-10 and ACAV100M Experiments

We first provide the exact numbers for CIFAR-10 and ACAV100M experiments.

η\eta InfoNCE q=0.01q=0.01 q=0.1q=0.1 q=0.5q=0.5 q=1.0q=1.0
0.0 93.4±\pm0.2 93.4±\pm0.2 93.2±\pm0.1 93.3±\pm0.1 93.0±\pm0.2
0.2 93.1±\pm0.1 93.3±\pm0.3 93.0±\pm0.1 93.2±\pm0.2 92.9±\pm0.3
0.4 90.7±\pm0.2 93.0±\pm0.2 92.0±\pm0.9 93.1±\pm0.1 92.8±\pm0.1
0.6 88.2±\pm0.4 90.8±\pm0.2 90.6±\pm0.3 92.9±\pm0.2 92.4±\pm0.2
0.8 87.1±\pm0.5 89.1±\pm0.2 89.3±\pm0.1 89.9±\pm0.3 91.6±\pm0.3
1.0 87.1±\pm1.0 88.7±\pm0.1 89.3±\pm0.4 89.3±\pm0.6 88.2±\pm0.3
Table 4: CIFAR-10 Label Noise
η\eta InfoNCE q=0.01q=0.01 q=0.1q=0.1 q=0.5q=0.5 q=1.0q=1.0
0.0 91.1±\pm0.1 91.6±\pm0.1 91.5±\pm0.1 91.8±\pm0.2 90.7±\pm0.1
0.2 89.3±\pm0.1 89.8±\pm0.2 89.7±\pm0.1 90.4±\pm0.1 90.9±\pm0.1
0.4 87.3±\pm0.4 87.7±\pm0.5 87.5±\pm0.2 88.8±\pm0.1 89.0±\pm0.1
0.6 84.5±\pm0.2 85.4±\pm0.2 85.3±\pm0.2 86.6±\pm0.1 86.3±\pm0.2
0.8 80.6±\pm0.1 81.2±\pm0.2 80.3±\pm0.2 82.5±\pm0.2 82.8±\pm0.3
1.0 71.0±\pm0.5 71.2±\pm0.6 71.8±\pm0.4 71.5±\pm0.3 72.7±\pm0.2
Table 5: CIFAR-10 Augmentation Noise
model 20K 50K 100K 200K 500K
InfoNCE (100 epoch) 72.482 75.205 77.161 79.937 82.717
InfoNCE (150 epoch) 72.429 76.13 78.8 80.095 83.082
InfoNCE (200 epoch) 72.429 76.183 78.641 79.94 83.388
RINCE (100 epoch) 73.635 76.685 78.694 81.153 83.505
RINCE (150 epoch) 74.632 77.505 79.064 82.263 83.399
RINCE (200 epoch) 74.253 78.086 79.355 82.368 83.769
Table 6: Top1 accuracy on UCF101 of models trained on ACAV100M.

C.2 Positive Scores and Views, Continue

We extend our analysis of Figure 6 to InfoNCE baseline and discuss the impact of implicit weighting. We can see that the positive scores in both InfoNCE and RINCE models are correlated to the noisiness of positive pairs.

Refer to caption
Figure 11: Positive pairs and their scores. The corresponding positive scores are shown below the image pairs. The positive scores s+[1,1]s^{+}\in[-1,1] are output by the trained InfoNCE and RINCE model (temperature == 1). Pairs that have lower scores are visually noisy, while informative pairs often have higher scores.

We then study the distribution of positive scores and compare the positive scores output by InfoNCE and RINCE on noisy views. As Figure 12 (a) shows, the positive scores of clean pairs output by RINCE is slightly higher, making the density of RINCE around score 1.0 larger than InfoNCE. Figure 12 (b) gives a closer look on scores versus noisy views. We can see that InfoNCE tends to output higher scores for noisy views than RINCE, corroborating our analysis: InfoNCE tends to maximize the positive score of hard (noisy) pairs. This inherently makes the positive scores of clean pairs lower for InfoNCE, explaining the discrepancy between InfoNCE and RINCE in (a).

Refer to caption
Figure 12: Comparison between RINCE and InfoNCE. (a) Distribution of Positive Scores for RINCE and InfoNCE; (b) InfoNCE outputs higher scores for noisy pairs.

C.3 Ablation Study on λ\lambda

Finally, we provide an ablation study on how λ\lambda affect the performance of RINCE with CIFAR-10 augmentation noise experiments. We can see that in both clean and noise setting, RINCE is not sensitive to the choice of λ\lambda as long as it is not too large. Therefore, we simply set λ=0.01\lambda=0.01 for all vision experiments and λ=0.025\lambda=0.025 for graph experiments.

Noise Rate 0.0 0.4
RINCE (λ=0.01\lambda=0.01) 91.54 89.65
RINCE (λ=0.05\lambda=0.05) 91.81 89.81
RINCE (λ=0.1\lambda=0.1) 91.32 89.9
RINCE (λ=0.2\lambda=0.2) 90.55 89.69
RINCE (λ=0.4\lambda=0.4) 90.89 89.39
Table 7: CIFAR-10 Augmentation Noise