\newfloatcommand

capbtabboxtable[][\FBwidth]

Investigating the Role of Negatives in
Contrastive Representation Learning

Jordan T. Ash^∗ Surbhi Goel^∗ Akshay Krishnamurthy^∗ Dipendra Misra¹¹1Authors listed in alphabetial order. All authors contributed equally.

Microsoft Research NYC
{ash.jordan, goel.surbhi, akshaykr, dimisra}@microsoft.com

Abstract

Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many parameters such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks. In the NLP task, we find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives. We discuss plausible explanations for this behavior and suggest future directions to better align theory and practice.

1 Introduction

Unsupervised representation learning refers to a suite of algorithmic approaches for discovering meaningful transformations of complex data in the absence of an explicit supervision signal. These representations are trained to preserve and expose semantic information in the data, so that once learned, they can be used effectively in downstream supervised tasks of interest. In recent years, these methods have seen tremendous success when combined with deep learning models, obtaining state of the art results on a variety of important language and vision tasks (Smith and Eisner, 2005; Chopra et al., 2005; Mikolov et al., 2013; Oord et al., 2018; Bachman et al., 2019; Clark et al., 2020; Chen et al., 2020).

One popular approach for representation learning is contrastive learning, or noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010). In NCE, a notion of semantic similarity is used to create a self-supervision signal that encourages similar points to be represented (or embedded) in a similar manner. For example, a typical approach is to train an embedding so that similar points have large inner product in the embedding space and dis-similar points have small inner product. Intuitively, representations trained in this manner capture the semantic similarity of the raw data, so we might expect that they are useful in downstream supervised learning tasks. Indeed, NCE is employed in the Electra language model (Clark et al., 2020) and has also been used effectively in computer vision tasks at ImageNet scale (Chen et al., 2020).

While the implementation details vary, a mathematical abstraction for representation learning with NCE is: (1) a single NCE example consists of $k+2$ raw inputs, $(x,x^{+},x_{1}^{-},\ldots,x_{k}^{-})$ , where $(x,x^{+})$ are semantically similar and $x_{i}^{-}$ are sampled from some base distribution, (2) the representation $f$ is trained to encourage $f(x)^{\top}f(x^{+})\gg f(x)^{\top}f(x_{i}^{-})$ for each $i$ . This second step can be done with standard classification objectives, such as the cross-entropy loss, where the model is viewed as a classifier over $k+1$ labels. For example, a candidate objective is

\displaystyle\mathop{\mathbb{E}}_{x,x^{+},x_{1:k}^{-}}\left[-\log\left(\frac{e^{f(x)^{\top}f(x^{+})}}{e^{f(x)^{\top}f(x^{+})}+\sum_{i=1}^{k}e^{f(x)^{\top}f(x_{i}^{-})}}\right)\right].

This objective encourages $f(x)^{\top}f(x^{+})$ to be large, so that the representation efficiently captures the semantic similarity inherent in the NCE sampling scheme.

Modern contrastive pipelines have several parameters that together contribute to the success of resulting representations. These parameters include modeling choices, such as the number of negatives $k$ , the semantic sampling procedure for $(x,x^{+})$ , and neural network architecture, along with optimization decisions, such as batch size. While there have been large scale empirical studies on contrastive learning (Chen et al., 2020), it remains unclear how these different choices interact with each other to affect the downstream performance of the learned representations. Indeed, a major challenge is that exorbitant computational costs for training contrastive models precludes comprehensive ablations to disentangle the effects of each design decision. Instead a hybrid approach that leverages theoretical insights to guide systematic experimentation may be more effective at shedding light on the role of these parameters.

As an example of entangled design choices, consider the number of negative samples $k$ and the batch size $B$ in the NCE objective. In practice, it is standard to set $k$ as a function of the batch size $B$ , as $k=2B-2$ , and then to set the batch size as large as the hardware can support. When implemented on a GPU, this allows for many more gradient evaluations with minimal computational overhead. However, in this setup, the effect of increasing $k$ on downstream performance is confounded by improvements from increasing the batch size.

In this paper, we focus on the number of negatives $k$ , and investigate its role in contrastive learning. The key question we ask is how does the number of negative samples affect the quality of representations learned by NCE? As mentioned, existing empirical evidence confounds $k$ with the batch size $B$ , so it does not provide much of an answer to this question. Moreover, conceptual reasoning paints a fairly nuanced picture. On one hand, a large value of $k$ implies that the negatives are more likely to cover the semantic space (coverage), which encourages the representation to disambiguate all semantically distinct concepts. However, there is a risk that some of the negative samples will be semantically very similar (collisions) to the positive example $x^{+}$ , which compromises the supervision signal and may lead to worse representation quality.

Our contributions.

For our theoretical results, we adopt the framework of Saunshi et al. (2019), where the semantic similarity in the contrastive distribution is directly related to the labels in a downstream classification task. We refine the results of Saunshi et al. (2019) with a novel and tighter error transfer theorem that bounds the classification error by a function of the NCE error. We then study how this function depends on the number of concepts/classes in the data and the number of negatives in the NCE task. Our analysis captures the trade-off between coverage over the classes and collisions, and it reveals an expression for the ideal number of negatives $k$ , which roughly scales as the number of concepts in the data. We emphasize that our sharper theoretical results are instrumental even for obtaining this qualitative trend; indeed, the analysis in Saunshi et al. (2019) instead suggests that increasing $k$ leads to decreasing performance.

Empirically, we study the effect of the number of negative samples on the performance of the final classifier on a downstream classification task for both vision and natural language datasets. For our language task, we observe that performance increases with $k$ up to a point, but it deteriorates as we increase $k$ further. The experiments also confirm that the optimal choice of $k$ depends on the number of concepts in the dataset. We do not observe the same trends in our vision experiments, and we describe how conflating factors—most notably data augmentation—cause disagreement between the observed empirical behavior and the predictions from the theoretical framework typically used to study NCE (Saunshi et al., 2019; Chuang et al., 2020).

Related work.

Self-supervised learning has elicited breakthrough successes in representation learning, achieving state of the art performance across a wide array of important machine learning tasks, including computer vision (He et al., 2020; Huo et al., 2020; Khosla et al., 2020; Wang et al., 2020; Bachman et al., 2019), natural language processing (Fang and Xie, 2020; Fu et al., 2021; Le-Khac et al., 2020; Giorgi et al., 2020), and deep reinforcement learning. In vision, Chen et al. (2020) have shown that simple NCE algorithms are able to obtain extremely competitive performance on the challenging ImageNet dataset, something that was previously thought to require extensive supervised training only. In NLP, the NCE approach has garnered much attention since the pioneering work of Mikolov et al. (2013) demonstrated its efficacy in learning word-level representations. In Deep RL, NCE-style losses have been recently employed to learn world models in complex environments (Nachum et al., 2018; Laskin et al., 2020; Zhu et al., 2020; Kipf et al., 2019; Shelhamer et al., 2017).

Theoretical work on representation learning is quite nascent, with most works focused on mathematical explanations for the efficacy of these approaches (Saunshi et al., 2019; Lee et al., 2020). One is via conditional independence and redundancy (Tosh et al., 2020, 2021). Another theory proposed by Wang and Isola (2020) is that contrastive representations simultaneously optimize for alignment of similar points and uniformity in the embedding space, in the asymptotic regime where the number of negatives tends to infinity. This asymptotic regime was also studied by Foster et al. (2020), who analyze a contrastive estimator for an information-gain quantity. However, Foster et al. (2020) do not study representation learning and Wang and Isola (2020) do not theoretically analyze downstream performance as we do. Indeed, our results show that there can be a price for too many negatives due to collisions, and perhaps call into question whether this asymptotic regime should be used to explain the downstream performance of NCE representations.²²2Note that similar concerns have been raised regarding viewing representation learning as a mutual information maximization problem (Tschannen et al., 2019). This collision phenomenon has been observed by Chuang et al. (2020), who propose a debiased version of the contrastive loss that attempts to correct for negative samples coming from the positive class. They experimentally observe significant gains from removing the effect of collision. In contrast, here we study the original contrastive objective that is primarily used in practice and characterize how an appropriate choice of $k$ can balance the collision and coverage trade-off.

Our theoretical results closely follow the setup of Saunshi et al. (2019). We adopt essentially the same framework in the present paper, but unlike Saunshi et al. (2019) we do not use the number of negatives $k$ to define the downstream classification tasks. Our setup yields a single $k$ -independent downstream task, which allows us to study the influence of this parameter. In addition, many steps of our analysis are sharper than theirs, which leads to our new qualitative findings.

2 Setup and Algorithm

In this section, we present the mathematical framework that we use to study noise contrastive estimation, as well as the learning algorithm and the evaluation metric that we consider.

Notation.

Let the set of input features be denoted by $\mathcal{X}$ and the set of latent classes by $\mathcal{C}$ , where $|\mathcal{C}|=N$ . With each latent class $c\in\mathcal{C}$ we will associate a distribution $\mathcal{D}_{c}$ over $\mathcal{X}$ . We can view $\mathcal{D}_{c}$ as the distribution over data conditioned on belonging to latent class $c$ . We will also assume a distribution $\rho$ on $\mathcal{C}$ . We will learn with a function class $\mathcal{F}$ , the set of representation functions $f:\mathcal{X}\rightarrow\mathbb{R}^{d}$ that map the input to a $d$ -dimensional vector. We use $\lesssim$ to denote less than or equal to up to constants.

NCE data.

We assume access to similar data points in the form of pairs $(x,x^{+})$ and $k$ negative data points $x^{-}_{1},\ldots,x^{-}_{k}$ . To formalize this, an unlabeled sample from $\mathcal{D}_{\textup{NCE}}$ is generated as follows:

•

Choose a class $c$ according to $\rho$ and draw i.i.d. samples $x,x^{+}$ from $\mathcal{D}_{c}$ .
•

Choose classes $c^{-}_{1},\ldots,c^{-}_{k}$ i.i.d. from $\rho$ and draw $x^{-}_{1}\sim\mathcal{D}_{c^{-}_{1}},\ldots,x^{-}_{k}\sim\mathcal{D}_{c^{-}_{k}}$ .
•

Output tuple $(x,x^{+},x^{-}_{1:k})$

This data generation is implicit. We only observe the data points and not the associated class labels. Observe that $c^{-}_{i}$ may be the same as $c$ for one or more $i\in\{1,\ldots,k\}$ , causing collisions.

Empirically, similar pairs are typically generated from co-occurring data, for example, two parts of a document can be viewed as a similar pair (Tosh et al., 2020). Note that our data generation process implies that $x$ and $x^{+}$ are independent conditioned on the latent class. This is a simplification of the real-world setting where the independence structure might not hold. However, it enables a transparent theoretical analysis that can provide insights that are applicable to some real-world setups.

NCE algorithm.

The objective of an NCE algorithm is to use the available unsupervised data to learn a representation that represents (or embeds) similar points in a similar manner and distinguishes them from the negative samples. The following NCE loss captures this objective,

Definition 1 (NCE Loss).

The NCE loss for a representation $f$ on the distribution $\mathcal{D}_{\textup{NCE}}$ is defined as

\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f):=\underset{\mathcal{D}_{\textup{NCE}}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{T}\left(f(x^{+})-f(x^{-}_{i})\right)\right\}_{i=1}^{k}\right)\right].

The empirical NCE loss with a set $S$ of $M$ samples drawn from $\mathcal{D}_{\textup{NCE}}$ is,

\displaystyle\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f;S):=\qquad\frac{1}{M}\sum_{(x,x^{+},x_{1:k}^{-})\in S}\ell\left(\left\{f(x)^{T}\left(f(x^{+})-f(x^{-}_{i})\right)\right\}_{i=1}^{k}\right).

We restrict $\ell$ to be one of the two standard loss functions, hinge loss $\ell(v)=\max\{0,1+\max_{i}\{-v_{i}\}\}$ and logistic loss $\ell(v)=\log\left(1+\sum\exp(-v_{i})\right)$ for a vector $v\in\mathbb{R}^{k}$ . Both loss functions and their variants are deployed routinely in practical NCE implementations (Schroff et al., 2015; Chen et al., 2020).

The algorithm we analyze is the standard empirical risk minimization (ERM) on $\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}$ over $\mathcal{F}$ . For the rest of the paper, we will denote our learned representation $\widehat{f}\in\arg\min_{f\in\mathcal{F}}\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f)$

Downstream supervised learning task.

For evaluating representations, we will consider the standard supervised learning task of classifying a data point into one of the classes in $\mathcal{C}$ . More formally, a sample from $\mathcal{D}_{\textup{sup}}$ is generated as follows: (1) choose a class $c$ according to $\rho$ (2) draw sample $x$ from $\mathcal{D}_{c}$ , and (3) output sample $(x,c)$ .

Observe that the distribution over classes and the conditional distribution over data points is the same as in the unsupervised data. This is crucial to be able to relate the NCE task to the downstream classification task. We will use the corresponding supervised loss for function $g:\mathcal{X}\rightarrow\mathbb{R}^{N}$ ,

\displaystyle\mathcal{L}_{\sup}(g)=\underset{\mathcal{D}_{\textup{sup}}}{\mathbb{E}}\left[\ell\left(\left\{g(x)_{c}-g(x)_{c^{\prime}}\right\}_{c^{\prime}\neq c}\right)\right]

Note that Saunshi et al. (2019) consider a different downstream task which depends on the number of negative samples in the NCE task. Here we disentangle the number of negatives used in NCE from the number of downstream classes which results in a more natural downstream task.

To evaluate the performance of the learned representation function $\widehat{f}$ , we will train a linear classifier $W\in\mathbb{R}^{N\times d}$ on top of the learned embedding and use it for the downstream task. We overload notation and define the supervised loss of an embedding $f$ as

\displaystyle\mathcal{L}_{\textup{sup}}(f)=\inf_{W\in\mathbb{R}^{N\times d}}\mathcal{L}_{\textup{sup}}(Wf).

We define the loss of the mean classifier as $\mathcal{L}_{\textup{sup}}^{\mu}(f)=\mathcal{L}_{\textup{sup}}(W^{\mu}f)$ where $W^{\mu}$ is such that the $c^{th}$ row is the mean $\mu_{c}=\mathbb{E}_{x\sim\mathcal{D}_{c}}[f(x)]$ of the representations of data from a fixed class $c$ . Note that $\mathcal{L}_{\textup{sup}}(f)\leq\mathcal{L}_{\textup{sup}}^{\mu}(f)$ for all $f$ . Our guarantees will hold for the loss of the mean-classifier which will directly imply a bound on the overall supervised loss. Note that this analysis may be loose if the mean-classifier itself has poor downstream performance³³3 Saunshi et al. (2019) present a simple example where the two losses are very different..

Refer to caption — Figure 1: Value of pre-multiplier $\alpha(k,N)$ multiplied with the growth rates of $\mathcal{L}^{(k)}_{\textup{NCE}}-\tau_{k}$ (on a log-scale) for $N=100$ with varying $k$ and growth rates. The plots represent the scaling of our upper bound on the supervised loss. The blue dots represent the optimal $k$ predicted by our theory.

3 Theoretical Overview

In this section, we present our main theoretical insights and overview our analysis, which provides sharp bounds on the supervised loss in terms of the NCE loss. Our bounds show an interesting trade-off between the number of negatives and the number of latent classes. For ease of presentation, the bounds in the main paper assume that $\rho$ is uniform over $\mathcal{C}$ . Proofs and generalizations are deferred to Appendix B.

In the supervised loss, data points are distinguished against all classes other than the class that the point belongs to. However, for the NCE loss, if the number of negatives $k$ is small relative to $N$ , then the negatives may not cover all the classes simultaneously. On the other hand, if $k$ is too large, then it is quite likely that some of the negatives will belong to the same class as the “anchor point” $x$ , causing collisions that force the learner to distinguish between similar examples. This leads to opposing factors when choosing the optimal number of negative samples. We present our main result that captures this trade-off.

Theorem 1 (Transfer between NCE and supervised Loss).

For any $f\in\mathcal{F}$ ,

\displaystyle\mathcal{L}_{\textup{sup}}(f)\leq\mathcal{L}^{\mu}_{\textup{sup}}(f)\leq\underbrace{\frac{2}{1-\tau_{k}}\left\lceil\frac{2(N-1)H_{N-1}}{k}\right\rceil}_{\alpha(k,N)}\left(\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\right),

where $\tau_{k}=\Pr[\exists~{}i:c^{-}_{i}=c]=1-\left(1-\frac{1}{N}\right)^{k}$ is the probability of a collision amongst the negatives and $H_{t}=\Theta(\log(t))$ is the $t^{\text{th}}$ harmonic number.

Note that the above bound is similar to the one of Saunshi et al. (2019), except that the left hand side and the pre-multiplier $\alpha(k,N)$ are different. The left hand side here is the supervised loss on all classes rather than the loss, in expectation, on $k+1$ -way classification task, where the tasks are drawn randomly from $\rho$ . Further, our pre-multiplier incorporates the trade-off between $k$ and $N$ . A key difference is that the bound in the prior work gets worse with increasing $k$ implying $k=1$ is the best choice whereas ours has a more subtle dependence on $k$ .

Remark 2 (Choosing $k$ ).

The coefficient $\alpha(k,N)$ decreases with $k$ up to $k\approx\frac{1}{\log(N)-\log(N-1)}$ and then increases. Note that $\mathcal{L}_{\textup{NCE}}-\tau_{k}$ may itself increase as a function of $k$ , however as long as its growth rate is $o(k)$ , we will see a similar trend. (See Figure 1).

We refine our bound by decomposing $\mathcal{L}_{\textup{NCE}}-\tau_{k}$ into two terms that can potentially be small for large $\mathcal{F}$ .

Theorem 3 (Refined NCE Loss Bound).

For any $f\in\mathcal{F}$ ,

\displaystyle\mathcal{L}^{\mu}_{\textup{sup}}(f)

\displaystyle\lesssim\alpha(k,N)\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{\frac{k}{N}}s(f)\right)

where $\mathcal{L}^{(k)}_{\neq}(f)$ is the NCE loss obtained by ignoring the coordinates corresponding to the collisions and $s(f)$ is a measure of average intra-class deviation.

We defer the formal definitions of $\mathcal{L}^{(k)}_{\neq}(f)$ and $s(f)$ to Appendix B.2. Note that the trade-off between $N$ and $k$ still holds here even for the coefficient in front of $s(f)$ (see Figure 1 for $\sqrt{k}$ ). $\alpha(k,N)\sqrt{\frac{k}{N}}$ is minimized roughly at $k\approx\frac{0.5}{\log(N)-\log(N-1)}$ .

Lastly, we do a careful generalization analysis to guarantee closeness of the empirical NCE loss of $\widehat{f}$ to the true NCE loss. Given a generalization guarantee, using the fact that $\widehat{f}$ is the ERM solution, we obtain the following guarantee:

Theorem 4 (Guarantee on ERM).

For $\widehat{f}\in\arg\min_{f\in\mathcal{F}}\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f)$ and any $f\in\mathcal{F}$ ,

\displaystyle\mathcal{L}_{\textup{sup}}(\widehat{f})

\displaystyle\lesssim\alpha(k,N)\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{\frac{k}{N}}s(f)+\sqrt{k}\mathsf{Gen}_{M}\right)

where $\mathsf{Gen}_{M}$ is a decreasing function of $M$ that depends on $\mathcal{F}$ but is independent of $k$ .

See Appendix B.3 for a formal definition of $\mathsf{Gen}_{M}$ , which is based on the Rademacher complexity of $\mathcal{F}$ .

An example exhibiting the coverage-collision trade-off

Our bounds suggest that there are two competing forces influenced by the choice of $k$ in the NCE task. The first is coverage of all the classes, which intuitively improves supervised performance, and suggests taking $k$ larger. On the other hand, when $k$ is large, the negatives are more likely to collide with the anchor class $c$ , which will degrade performance. However, as the trends regarding the optimal choice of $k$ are based on analyzing an upper bound, there is always the concern that the upper bound is loose. To further substantiate our results, we now provide a simple example which demonstrates the existence of the coverage-collision trade-off and also verifies the sharpness of our transfer theorem.

First we showcase the tightness of Theorem 1 and the importance of coverage. Consider a problem with $N$ classes $\{1,\ldots,N\}$ where $N$ is a multiple of $2$ and let $\rho$ be the uniform distribution. For each class $c$ , the distribution $D_{c}$ is the uniform distribution on two examples $x_{1}^{c}$ and $x_{2}^{c}$ . The representation $f_{1}$ pairs classes together, so that all points from class $2i$ and $2i+1$ are both mapped to the $i^{\textrm{th}}$ standard basis element $e_{i}$ (see Figure 2).

It is straightforward to verify the following facts for the hinge loss (see Appendix for detailed calculations): (1) since $f_{1}$ collapses classes together, it has $\mathcal{L}_{\sup}^{\mu}(f_{1})=1$ , (2) $\mathcal{L}_{\textup{NCE}}(f_{1})=1-(1-2/N)^{k}$ since for each class $2i$ the NCE loss will be $1$ if and only if either $2i$ or $2i+1$ appear in the negatives (otherwise the loss will be $0$ ). Thus, for this function $f_{1}$ we have

\displaystyle\frac{\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}}{1-\tau_{k}}=1-\left(1-\frac{1}{N-1}\right)^{k}\leq\frac{k}{N-1},

while $\mathcal{L}_{\sup}^{\mu}(f_{1})=1$ . This verifies that Theorem 1 is tight up to constants and the $\Theta(\log(N-1))$ factor arising from the $N-1$ -st harmonic number.

In addition, observe that $\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})$ increases rapidly with $k$ , since when $k$ is large it is very likely that either the anchor class or its pair will appear in the negatives. This demonstrates the importance of coverage, since when $k$ is large, $f_{1}$ , which has high downstream loss, is unlikely to be selected by the NCE procedure.

To make this precise, consider another function $f_{2}$ such that $f_{2}(x_{1}^{i})=(1+\epsilon)e_{i}$ and $f_{2}(x_{2}^{i})=(1-\epsilon)e_{i}$ . This representation correctly separates all of the classes, but there is some noise in the embedding space. By direct calculation, we have $\mathcal{L}_{\sup}^{\mu}(f_{2})=\epsilon/2$ so that $f_{2}$ should be preferred for downstream performance. We can also show that

\displaystyle 0\leq\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})-(1-\tau_{k})\left(\frac{\epsilon^{2}}{4}+\frac{\epsilon}{2}\right)-\tau_{k}\leq\tau_{k}\epsilon.

See Figure 2, where we plot a sharper lower bound that we establish in the appendix. For moderately small $\epsilon$ , say $\epsilon=0.1$ these bounds imply that $\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})<\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})$ when $k/N$ is sufficiently small ( $k\lessapprox 0.05N$ ), which means that the bad representation $f_{1}$ will be selected. On the other hand when $k\gtrapprox 0.05N$ , the better representation $f_{2}$ will be selected. This demonstrates how using many negatives can yield representations with better downstream performance.

In the asymptotic regime where $k\to\infty$ , we observe a second important phenomena with this example: If $k/N$ is too large, then the bad classifier $f_{1}$ will be favored again. To see this, note that when there are many collisions $f_{2}$ will incur $1+\epsilon$ NCE error, so $\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})=1+\epsilon$ . On the other, it is clear that $\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=1$ , so $f_{1}$ will be selected. This phenomena can be seen in Figure 2. In this regime, the collisions in the NCE problem compromise the supervision signal and lead to the selection of a suboptimal representation.

4 Experiments

We test our theoretical findings on two domains, an NLP task and a vision task. We first describe our general learning setting and then describe these tasks and results.

Experimental setting.

We train the encoder model $f$ by performing NCE with logistic loss. Following Saunshi et al. (2019), we train on the NCE objective for a fixed number of epochs and do not use a held-out set for model selection. We train a linear classifier on the downstream task for a fixed number of epochs and save the model after each epoch. We do not fine-tune $f$ during downstream training.

Computing $k$ negative tokens for a batch size $B$ results in $\mathcal{O}(Bk)$ forward passes through the model, which is prohibitively expensive for large values of $B$ and $k$ . In order to reduce computational overhead, we follow the data reuse trick used in literature (Chen et al., 2020; Chuang et al., 2020). Given $B$ pairs of positively aligned, semantically similar points $\{(x_{i},x^{+}_{i})\}_{i=1}^{B}$ , we compute $\{f(x_{i}),f(x^{+}_{i})\}_{i=1}^{B}$ using $2B$ forward passes. For each $i\in[B]$ , we treat the set $\mathcal{C}_{i}=\{x_{j},x^{+}_{j}\mid j\in[B],j\neq i\}$ as the candidate set of negative samples to pick from for the positive pair $(x_{i},x^{+}_{i})$ . Previous work use the entire candidate set $\mathcal{C}_{i}$ , which gives $k=|\mathcal{C}_{i}|=2(B-1)$ . While this allows most negative samples to be used, it results in a confounding affect, due to the entanglement of $B$ and $k$ . Instead, for a fixed value of $k\leq 2(B-1)$ , we sample $k$ negative examples without replacement from $\mathcal{C}_{i}$ independently for each $i$ . This sampling protocol marks a notable departure from prior work by decoupling the dependence between batch size and number of negative examples and studying their influence on performance separately.

NLP experiments.

Here we perform a thorough analysis on the Wiki-3029 dataset of Saunshi et al. (2019). This dataset consists of 200 sentences taken from 3029 Wikipedia articles. The downstream task consists of predicting the article identity from a given sentence. A key advantage of this dataset is that it contains many classes, which allows us to study learning behavior as the number of classes $N$ increases. For a given $N\leq 3029$ , for both training and test datasets, we remove sentences from all but the first $N$ articles. We tokenize sentences using the NLTK English word tokenizer (Bird et al., 2009), and replace all tokens that occur less than 50 times in the training set, with token ${\tt UNK}$ denoting an unknown token. This gives us a vocabulary size $V$ of 15,807 with at most 136 unique tokens per sentence. We create our training set by randomly sampling 80% of the sentences from each article. The remaining 20% of the sentences for each article form the test set.

We use a bag-of-words representation for modeling $f$ . Formally, we map a sentence $x$ to a list of tuples $\{(t_{i},p_{i})\}$ where $\{t_{i}\}$ is the set of unique tokens in $x$ and $p_{i}$ is the number of times $t_{i}$ occurs in $x$ normalized by the number of tokens in $x$ . We compute $f(x)=\sum_{i}p_{i}\nu_{t_{i}}$ where $\nu\in\mathbb{R}^{V\times d}$ is a word-embedding matrix with embedding dimension $d$ , and $\nu_{t}$ is the $t^{th}$ row denoting the word embedding of token $t$ . We initialize the word embedding matrix randomly and train it using NCE.

We train the NCE model for a fixed number of epochs and do not use a held-out set, following the protocol of Saunshi et al. (2019). We measure both the downstream performance of the mean classifier $W^{\mu}$ and a trained linear classifier. For the former, we use the entire training dataset to compute $W^{\mu}$ using the trained representation. For the latter, we use the entire training dataset, with labels, to fit a linear layer on top of the model $f$ , and we do not fine-tune $f$ . For each setting, we repeat the experiment 5 times with different seeds. Please see Appendix D for full details on hyperparameters values and optimization details.

The leftmost plot in Figure 3 shows the accuracy of the mean classifier as we vary $N$ and $k$ . For a given number of classes $N$ , performance improves as $k$ increases and then slowly decays for larger values of $k$ . We show the general trend in this figure and report exact numbers to the appendix. Further, we find that the optimal choice of $k$ , indicated by solid black circles, slowly increases as $N$ increases. Both of these empirical findings support the theoretical predictions of our improved error transfer bounds. As we have discussed, the previous theoretical results (Saunshi et al., 2019) do not predict this observed trend. Lastly, we observe that performance improves very rapidly as $k$ increases, up until $k=50$ , while the decay in performance as $k$ increases beyond the optimal value is relatively mild in comparison to these initial gains.

The center plot in Figure 3 shows the mean classifier’s performance as we vary $B$ and $k$ for two values of $N$ . Note that for a given value of $B$ , we can only increase $k$ up to $2(B-1)$ . We observe that increasing the batch size can improve the performance even when the number of negative examples are fixed. Optimal performance is obtained for the largest batch size and an intermediate value of $k$ . This highlights the very different roles played by $B$ and $k$ on this dataset. We note that these results could not have been observed by previous work that tie the value of $k$ to $B$ .

Our theoretical analysis reveals a coverage-collision trade-off and suggests that the poor performance when $k$ is very large is due to many collision between the negative samples and the anchor/positive, i.e., false negatives. To empirically verify this claim, we modify the representation learning procedure by sampling $k$ negative examples only from those members in the candidate set that do not share the same class as the positive reference point.⁴⁴4If we are unable to find $k$ true negatives for any example in the batch, then no update is performed on this batch. However, due to the choice of $N,B,k$ this never occurred in our experiments. In the rightmost plot of Figure 3, we compare this sampling procedure with the standard one for different values of $N$ and $k$ and a fixed $B=1000$ . We observe that artificially removing collisions improves the performance across all values of $N$ ; however, the effect is more significant for smaller values of $N$ ( $1.5\%$ for $N=200$ and $K=1000$ ). That the effect is more significant for small $N$ is to be expected, since the probability of collision decreases rapidly with $N$ , so the different sampling procedures are quite similar for large $N$ . We can further notice that performance decays only slightly for very large values of $k$ in the absence of collisions. This evidence supports the collision hypothesis, namely that collisions are a central reason for performance degradation with increasing $k$ . This observation was also made by Chuang et al. (2020). Lastly, as hypothesized by prior work (Chen et al., 2020), we believe a mild reduction in accuracy with large values of $k$ can happen even in the absence of collision, due to instability in optimization

Vision experiments.

In this section, we decouple batch size from $k$ and measure downstream performance as a function of these parameters on the SVHN and CIFAR-10 datasets. We fit all parameters except the last layer of a ResNet-18 model using the NCE loss, and we evaluate downstream accuracy by training the last layer only in a fully supervised fashion. We use a learning rate of $10^{-3}$ , train for 400 epochs, and fit parameters with the Adam optimizer.

We consider two ways of sampling tuples of images for the NCE objective. In the first, we adhere to our theoretical framework and use class information to capture semantic similarity. Here, for a given sample, we use an image from the same class to act as a positive, and we do not use data augmentation. In the second, we deviate from our theory and do not use class information. Instead, we follow a common empirical approach and use an augmented version of the anchor point. Negative samples are also augmented. For these experiments, we exercise the SimCLR data augmentation scheme, consisting of randomly applied crops, color jitter, horizontal flips, greyscaling, and blurring (Chen et al., 2020).

We experiment with both SVHN and CIFAR-10 image datasets, with Figure 4 summarizing our results. When using augmentation, the trend shown in previous work is observed, with accuracy tending to improve with increasing batch size and $k$ . When using class information for sampling positives, however, the performance trends are somewhat unexpected. Downstream SVHN accuracy in this setting is remarkably consistent, with nearly all combinations of $k$ and batch size producing roughly equivalent, high-quality representations for downstream classification. In the CIFAR-10 case, this is not true—in fact, performance tends to increase as batch size decreases. This is likely due to the exclusion of data augmentation, which is necessary for obtaining reasonable performance even in the fully supervised regime for these architectures. In its absence, the variance induced by decreasing the batch size appears to offer a regularization benefit that improves accuracy. Still, overall performance without augmentation is poor in comparison to data-augmented NCE.

These experiments raise an interesting conjecture: perhaps the observation that downstream performance increases with $k$ and batch size can be attributed to the kinds of data augmentation techniques that have been refined specifically for vision, rather than to the NCE objective itself. In NLP, where there are less clear-cut ways of performing data augmentation, we see trends that more closely agree with our theory. Either way, these experiments point to a distortion between the theoretical framework typically used to study NCE (using true positives) and, in the vision case, what is observed in practice. The core of this distortion appears to involve the use of augmentation techniques that have been highly tuned for object recognition/computer vision tasks.

5 Discussion

In this paper, we study how the number of negative samples $k$ used in contrastive representation learning affects the quality of learned representations. Our theoretical results highlight a tension between covering the concepts in the data and avoiding collisions, and suggest that the number of negatives should scale roughly linearly with the number of concepts. Our NLP experiments confirm this trade-off whereas our vision experiments show a more entangled dependence on other factors such as inductive bias of the architecture and the use of data augmentation.

This article therefore raises a number of interesting directions for future work, spanning both theory and practice, and calls for a more detailed investigation to better understand the success of NCE in different domains. We close with a few of these directions and related points of discussion:

Data augmentation.

As alluded to by our vision experiments, the use of data augmentation in practice seems to be a principal factor in the gap between what is observed empirically and what we describe theoretically. Because NCE is so commonly used with image data, developing a theoretical model that can incorporate properties of commonly used augmentations and the inductive bias of networks is an important avenue for future research.

Choosing $k$ in practice.

When the NCE distribution and the downstream task are closely linked, our theoretical results indicate that the number of negatives used in NCE should scale with the number of classes. In practice, however, choosing $k$ is less clear, with the optimal choice being a function of complex factors. The model’s inductive bias, the choice of data augmentation used, and the data type all seem to influence the ideal value. Given this, are there algorithmic principles for setting $k$ in practice?

Decoupling negatives with other parameters.

This work disentangles $k$ from the batch size used for model training, with our experiments suggesting that each plays a different role in the quality of the learned representation. Modern NCE pipelines have several conflating parameters, and we believe that a deeper empirical study is imperative for disambiguating and understanding the role of each.

Acknowledgements

We thank Nikunj Saunshi for helpful discussions regarding the experimental aspects of this work. We also thank Remi Tachet des Combes and Philip Bachman for interesting discussions and for providing references to relevant work.

References

Bachman et al. (2019) P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, 2019.
Bird et al. (2009) S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
Chen et al. (2020) T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.
Chopra et al. (2005) S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
Chuang et al. (2020) C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka. Debiased contrastive learning. In Advances in Neural Information Processing Systems, 2020.
Clark et al. (2020) K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
Fang and Xie (2020) H. Fang and P. Xie. Cert: Contrastive self-supervised learning for language understanding. arXiv:2005.12766, 2020.
Foster et al. (2020) A. Foster, M. Jankowiak, M. O’Meara, Y. W. Teh, and T. Rainforth. A unified stochastic gradient approach to designing bayesian-optimal experiments. In International Conference on Artificial Intelligence and Statistics, 2020.
Foster and Rakhlin (2019) D. J. Foster and A. Rakhlin. $\ell_{\infty}$ -vector contraction for rademacher complexity. arXiv:1911.06468, 2019.
Fu et al. (2021) H. Fu, S. Zhou, Q. Yang, J. Tang, G. Liu, K. Liu, and X. Li. Lrc-bert: Latent-representation contrastive knowledge distillation for natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
Giorgi et al. (2020) J. M. Giorgi, O. Nitski, G. D. Bader, and B. Wang. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv:2006.03659, 2020.
Gutmann and Hyvärinen (2010) M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, 2010.
He et al. (2020) K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
Huo et al. (2020) X. Huo, L. Xie, L. Wei, X. Zhang, H. Li, Z. Yang, W. Zhou, H. Li, and Q. Tian. Heterogeneous contrastive learning: Encoding spatial information for compact visual representations. arXiv:2011.09941, 2020.
Khosla et al. (2020) P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, 2020.
Kipf et al. (2019) T. Kipf, E. van der Pol, and M. Welling. Contrastive learning of structured world models. In International Conference on Learning Representations, 2019.
Laskin et al. (2020) M. Laskin, A. Srinivas, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, 2020.
Le-Khac et al. (2020) P. H. Le-Khac, G. Healy, and A. F. Smeaton. Contrastive representation learning: A framework and review. IEEE Access, 2020.
Lee et al. (2020) J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo. Predicting what you already know helps: Provable self-supervised learning. arXiv:2008.01064, 2020.
Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations (Workshop Track), 2013.
Mohri et al. (2018) M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
Nachum et al. (2018) O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. In International Conference on Learning Representations, 2018.
Oord et al. (2018) A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
Saunshi et al. (2019) N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, 2019.
Schroff et al. (2015) F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
Shelhamer et al. (2017) E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell. Loss is its own reward: Self-supervision for reinforcement learning. In International Conference on Learning Representations (Workshop Track), 2017.
Smith and Eisner (2005) N. A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In Annual Meeting of the Association for Computational Linguistics, 2005.
Tosh et al. (2020) C. Tosh, A. Krishnamurthy, and D. Hsu. Contrastive estimation reveals topic posterior information to linear models. arXiv:2003.02234, 2020.
Tosh et al. (2021) C. Tosh, A. Krishnamurthy, and D. Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, 2021.
Tschannen et al. (2019) M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic. On mutual information maximization for representation learning. In International Conference on Learning Representations, 2019.
Wang and Isola (2020) T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 2020.
Wang et al. (2020) X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. arXiv:2011.09157, 2020.
Zhu et al. (2020) J. Zhu, Y. Xia, L. Wu, J. Deng, W. Zhou, T. Qin, and H. Li. Masked contrastive representation learning for reinforcement learning. arXiv:2010.07470, 2020.

Appendix

The appendix is organized as follows: Appendix A-C present proofs of our theoretical results. Appendix D presents experimental details including additional experiments and hyperparameter values.

Appendix A Useful Facts

The standard multi-class hinge loss $\ell(v)=\max\{0,1+\max_{i}\{-v_{i}\}\}$ and the logistic loss $\ell(v)=\log\left(1+\sum_{i}\exp(-v_{i})\right)$ , for vector $v\in\mathbb{R}^{k}$ , satisfy the following lemma:

Lemma 1 (Sub-additivity of losses).

Let $v\in\mathbb{R}^{k}$ be any vector. For all $I_{1},I_{2}\subset[k]$ , with $S=I_{1}\cup I_{2}$ , we have that

\ell(\{v_{i}\}_{i\in I_{1}})\leq\ell(\{v_{i}\}_{i\in S})\leq\ell(\{v_{i}\}_{i\in I_{1}})+\ell(\{v_{i}\}_{i\in I_{2}}).

Lemma 2 (Coupon Collector).

Consider the coupon collector problem with $n$ coupons, each with probability $p_{i}$ . In each trial we collect a coupon by sampling from the distribution $(p_{1},\ldots,p_{n})$ . Let $M$ denote the minimum number of trials until all coupons have been collected. Then:

\displaystyle\mathbb{E}[M]=\int_{0}^{\infty}\left(1-\prod_{i=1}^{n}(1-\exp(-p_{i}t)\right)dt\leq\frac{H_{n}}{\min_{i}p_{i}},

where $H_{n}$ is the $n$ -th harmonic number. Furthermore, using Markov’s inequality,

\displaystyle\Pr\left[M\geq\frac{2H_{n}}{\min_{i}p_{i}}\right]\leq\mathbb{E}\left[M\right]\cdot\frac{\min_{i}p_{i}}{2H_{n}}\leq 1/2.

Lemma 3.

For any vectors $v,w\in\mathbb{R}^{t}$ ,

\displaystyle\ell(v)-\ell(w)\leq\left|\max_{i}(w_{i}-v_{i})\right|.

Proof.

For hinge loss,

	$\displaystyle\ell(v)$	$\displaystyle=\max\left(0,1+\max_{i}((w_{i}-v_{i})-w_{i})\right)$
		$\displaystyle\leq\max\left(0,1+\max_{i}(-w_{i})+\max_{i}(w_{i}-v_{i})\right)$
		$\displaystyle\leq\max\left(0,1+\max_{i}(-w_{i})\right)+\left\|\max_{i}(w_{i}-v_{i})\right\|=\ell(w)+\left\|\max_{i}(w_{i}-v_{i})\right\|.$

For logistic loss,

	$\displaystyle\ell(v)$	$\displaystyle=\log\left(1+\sum_{i}\exp(-v_{i})\right)$
		$\displaystyle=\log\left(1+\sum_{i}\exp(-w_{i})\exp(w_{i}-v_{i})\right)$
		$\displaystyle\leq\log\left(1+\exp\left(\|\max_{i}(w_{i}-v_{i})\|\right)\sum_{i}\exp(-w_{i})\right)$
		$\displaystyle\leq\|\max_{i}(w_{i}-v_{i})\|+\log\left(1+\sum_{i}\exp(-w_{i})\right)=\ell(w)+\left\|\max_{i}(w_{i}-v_{i})\right\|.$

The last inequality holds because $1\leq\exp(|\max_{i}(w_{i}-v_{i})|)$ . ∎

Appendix B General Results with Non-uniform Probabilities and Proofs

In this section, we state and prove a more general version of Theorem 1 and Theorem 3 which can accommodate a non-uniform class distribution $\rho$ . The particular results in the paper can be obtained by setting all class probabilities as $1/N$ .

Notation.

For a drawn sample, denote the collisions by $I(c,c^{-}_{1},\ldots,c^{-}_{k})=\{i\in[k]|c=c^{-}_{i}\}$ . Let $\tau_{k}(c)=\Pr_{c^{-}_{i}\sim\rho^{k}}[I(c,c^{-}_{1},\ldots,c^{-}_{k})\neq\emptyset]=1-(1-\rho(c))^{k}$ denote the probability of seeing class $c$ in the negative samples (i.e., the collision probability). Also, let $\tau_{k}=\sum_{c^{\prime}\in\mathcal{C}}\rho(c^{\prime})\tau_{k}(c^{\prime})=1-\sum_{c^{\prime}\in\mathcal{C}}\rho(c^{\prime})(1-\rho(c^{\prime}))^{k}$ . We will drop the arguments of $I$ when it is clear from the context.

B.1 General Transfer Theorem

Let us first start with the general transfer theorem.

Theorem 5 (General Transfer Bound).

For all $f\in\mathcal{F}$ , we have

\displaystyle\mathcal{L}^{\textup{sup}}(f)\leq\frac{2\max\left(1,\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right)}{\left(1-\rho_{\max}\right)^{k}}\left(\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{|I|}(0)\middle|I\neq\emptyset\right]\right),

where $\rho_{\min}=\min_{c}\rho(c)$ , $\rho_{\max}=\max_{c}\rho(c)$ and $H_{t}$ is the $t^{\text{th}}$ harmonic number.

Proof.

The first step is to apply Jensen’s inequality to get,

	$\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f)$	$\displaystyle=\underset{c,c_{i}^{-}\sim\rho^{k+1}}{\mathbb{E}}\underset{\begin{subarray}{c}x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\right\}\right)\right]$
		$\displaystyle\geq\underset{\begin{subarray}{c}c,c^{-}_{i}\sim\rho^{k+1}\\ x\sim D_{c}\end{subarray}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\right].$

Introducing the collision function $I$ , we have

	$\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f)$	$\displaystyle\geq\underset{x\sim D}{\mathbb{E}}\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\right]$
		$\displaystyle\geq\underset{(x,c)\sim D}{\mathbb{E}}\left[(1-\tau_{k}(c))\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\middle\|I=\phi,c\right]\right]+\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}[\ell_{\|I\|}(0)\|I\neq\emptyset]$

Here $\ell_{t}(0)$ is equal to $\ell$ evaluated at a $t$ -dimensional 0 vector, which we obtain $\ell_{|I|}(0)$ by using the sub-additivity property in Lemma 1. Note that for the second term, conditional on $I$ , the distribution of $c$ is not $\rho$ in general. We deal with the second term in the next section, and focus now on the first term.

Lemma 4.

For any $c\in\mathcal{C}$ and $x$ , we have,

\displaystyle\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\middle|I=\emptyset,c\right]\geq\frac{1}{2\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right),

where $H_{N}$ is the $N^{\textrm{th}}$ Harmonic number.

Proof.

We can view the conditioning on $I=\emptyset$ and $c$ as selecting $c^{-}_{i}$ from $\rho_{-c}$ with support on all $c^{\prime}\neq c$ and $\rho_{-c}(c^{\prime})=\frac{\rho(c^{\prime})}{1-\rho(c)}$ . Therefore, we have

	$\displaystyle\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}$	$\displaystyle\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\middle\|I=\emptyset,c\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}\underset{c^{-}_{i}\sim\rho_{-c}^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{=}}}\frac{1}{m}\sum_{j=1}^{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{ij}}\right)\}\right)\right]$
		$\displaystyle=\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[\sum_{j=1}^{m}\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{ij}}\right)\}\right)\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{\geq}}}\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\cup\{c^{-}_{ij}\}}\right)\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{\geq}}}\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[{\bf 1}\left[\cup\{c^{-}_{ij}\}=\mathcal{C}\setminus\{c\}\right]\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]$
		$\displaystyle=\frac{1}{m}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\cdot\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\Pr}\left[\cup\{c^{-}_{ij}\}=\mathcal{C}\setminus\{c\}\right].$

We replace the distribution over negatives, as described above to get (B.1). Then we construct $m$ copies of this expression and take an average to get (B.1), which is a key step in the argument. Next, (B.1) uses Lemma 1, which allows us to replace the sum over the $m$ losses with a single multi-class loss on the union of the labels. Inequality (B.1) uses the non-negativity of the loss and finally we observe that the loss expression no longer depends on the samples $c_{ij}^{-}$ .

To bound the probability term, note that it can be viewed as an instance of the standard coupon collector problem with $N-1$ different coupons with non-identical probabilities of selection given by $\rho_{-c}$ . In this instance, we perform $mk$ trials. Therefore using Lemma 2, if we take $mk\geq 2\frac{(1-\rho(c))H_{N-1}}{\min_{c^{\prime}\neq c}\rho(c^{\prime})}$ boxes, we should cover all coupons with probability at least $1/2$ . This implies taking $m=\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil$ , we get that the probability of hitting all elements in $\mathcal{C}\setminus\{c\}$ is at least $1/2$ , which yields the desired result. ∎

Substituting back, for the first term, we get,

	$\displaystyle\underset{(x,c)\sim D}{\mathbb{E}}\left[(1-\tau_{k}(c))\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\middle\|I=\phi,c\right]\right]$
	$\displaystyle\geq\underset{(x,c)\sim D}{\mathbb{E}}\left[\frac{1-\tau_{k}(c)}{2\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]$
	$\displaystyle\geq\frac{\left(1-\rho_{\max}\right)^{k}}{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}\underset{(x,c)\sim D}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]$
	$\displaystyle\geq\frac{\left(1-\rho_{\max}\right)^{k}}{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}\mathcal{L}^{\mu}_{\textup{sup}}(f).$

This gives us the desired result. ∎

B.2 Refining the Transfer Theorem

As Saunshi et al. [2019] point out, we cannot drive $\mathcal{L}_{\textup{NCE}}$ to 0 as it is always lower bounded by $\tau_{k}$ . To get a better understanding of when the loss is actually small, let us define two terms that will be useful for a more refined analysis.

Definition 2 (Average Intra-class Variance).

We define the average intra-class variance of a learned embedding as,

\displaystyle s(f):=\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\right]

where $\Sigma(f,c)=\mathbb{E}_{x\sim D_{c}}[(f(x)-\mu_{c})(f(x)-\mu(c))^{\top}]$ is the covariance matrix of the representation $f$ restricted to class $\mathcal{D}_{c}$ .

Definition 3 (NCE loss without collisions).

We define a loss function that is obtained from the NCE loss by removing the colliding negatives,

\mathcal{L}^{(k)}_{\neq}(f):=\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)\right]

Observe that both $\mathcal{L}^{(k)}_{\neq}$ as well as $s(f)$ can be made very small for large enough $\mathcal{F}$ . Now we are ready to present our refined bound.

Theorem 6 (General Refined Bound).

For all $f\in\mathcal{F}$ , we have

\displaystyle\mathcal{L}^{\textup{sup}}(f)\leq\frac{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}{(1-\rho_{\max})^{k}}\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{k\rho_{\max}}\cdot s(f)\right).

Proof.

Decomposing the NCE loss gives us,

	$\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{\|I\|}(0)\middle\|I\neq\emptyset\right]$
	$\displaystyle\leq\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)+\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\in I}\right)\right]-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{\|I\|}(0)\middle\|I\neq\emptyset\right]$
	$\displaystyle=\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)\right]+\tau_{k}\underbrace{\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell(\{f(x)^{\top}(f(x^{+})-f(x_{i}^{-}))\}_{i\in I})-\ell_{\|I\|}(0)\middle\|I\neq\phi\right]}_{\Delta(f)}$
	$\displaystyle=\mathcal{L}^{(k)}_{\neq}+\tau_{k}\Delta(f).$

As before, note that the second term samples the classes $(c,c_{i}^{-})$ from the conditional distribution given that there is a collision. We follow the convention that an empty set implies value 0 for $\mathcal{L}^{(k)}_{\neq}$ . Note that this term can be made very small if the classes are separable by $f$ with a sufficient margin. As for the second term, we will first prove the following lemma,

Lemma 5.

For any $t\geq 0$ :

\displaystyle\underset{x,x^{+},x_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x_{i}\right)\}_{i\in[t]}\right)-\ell_{t}(0)\right]\leq\sqrt{t}\mathbb{E}_{x\sim D_{c}}[\|f(x)\|^{2}]\sqrt{\|\Sigma(f,c)\|_{2}}.

(1)

The above lemma is an improvement to Lemma A.1 of Saunshi et al. [2019] which has a linear scaling with $t$ . Observe that on the LHS, all samples are drawn from $D_{c}$ , which corresponds to the situation where we have $t$ collisions with the anchor class $c$ .

Proof of Lemma 5.

From Lemma 3, we know that

\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i})\right)\}_{i\in[t]}\right)-\ell_{t}(0)\right]\leq\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\left|\max_{i}f(x)^{\top}\left(f(x^{-}_{i})-f(x^{+})\right)\right|\right].

Now we have,

	$\displaystyle\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\left\|\max_{i}f(x)^{\top}\left(f(x^{-}_{i})-f(x^{+})\right)\right\|\right]$
	$\displaystyle=\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\underset{x^{+},x^{-}_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left\|\max_{i}\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right\|\right]\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{\underset{x^{+},x_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left(\max_{i}\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{\sum_{i}\underset{x^{+},x^{-}_{i}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x_{i}^{-})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{t\underset{x^{+},x^{-}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\sqrt{t}\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}.$

Combining with the above inequality gives us the desired result. ∎

Substituting back, we get the following bound on the second term,

	$\displaystyle\tau_{k}\Delta(f)$	$\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\sqrt{\|I\|}\middle\|I\neq\emptyset,c\right]\right]$
		$\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\|I\|\middle\|I\neq\emptyset,c\right]}\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{k\rho(c)\tau_{k}(c)}\right]$
		$\displaystyle=\sqrt{k}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{\rho(c)\tau_{k}(c)}\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{\leq}}}\sqrt{k\rho_{\max}}s(f)$

where (B.2) follows from observing that $\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[|I|\middle|I\neq\emptyset,c\right]=\frac{k\rho(c)}{\tau_{k}(c)}$ (see also equation (35) of Saunshi et al. [2019]) and (B.2) follows from Definition 2. Combining these terms gives us the desired result. ∎

Note that Saunshi et al. [2019] also obtain a similar result, but in their analysis the coefficient on the $s(f)$ term scales linearly with $k$ . Instead, we obtain a $\sqrt{k}$ scaling. However, for the case of uniform class distribution $\rho$ , their coefficient is tighter in the $k\leq n$ regime.

B.3 Generalization Bound

To transfer our guarantee to the minimizer of the empirical NCE loss, we prove a generalization bound using a notion of worst case Rademacher complexity.

Definition 4 (Rademacher Complexity).

The empirical Rademacher complexity of a real-valued function class $\mathcal{H}$ for a set $S$ drawn from $\mathcal{X}^{m}$ is defined to be

\widehat{\mathcal{R}}_{S}(\mathcal{H}):=\mathbb{E}_{\sigma}\left[\sup_{h\in\mathcal{H}}\left(\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}h(x_{i})\right)\right],

where $\sigma_{i}$ are iid Rademacher random variables. We define the worst-case Rademacher complexity to be

\mathcal{R}_{m}(\mathcal{H}):=\max_{S\sim\mathcal{X}^{m}}\widehat{\mathcal{R}}_{S}(\mathcal{H}).

We will use the following theorem to bound the Rademacher complexity of our vector valued class,

Theorem 7 (Foster and Rakhlin [2019]).

Let $\mathcal{H}\subseteq\{h:\mathcal{X}\rightarrow\mathbb{R}^{K}\}$ , and let $\phi$ be a real-valued $L$ -Lipschitz function with respect to the $\ell_{\infty}$ norm. We have,

\mathcal{R}_{S}(\phi\circ\mathcal{H})\leq\tilde{O}\left(L\sqrt{k}\right)\max_{i}\mathcal{R}_{M}(\mathcal{H}|_{i})

where $\mathcal{H}|_{i}$ is the function class restricted to the $i^{\text{th}}$ coordinate.

Note that the above theorem relates the empirical Rademacher complexity on sample $S$ to the worst case Rademacher complexity. Finally we will use the following standard generalization result based on Rademacher complexity,

Theorem 8 (Mohri et al. [2018]).

For a real-valued function class $\mathcal{H}$ and a set $S=\{x_{1},\ldots,x_{m}\}$ $\in\mathcal{X}^{m}$ drawn i.i.d. from some distribution, with probability $1-\delta$ , for all $h\in\mathcal{H}$

\left|\mathbb{E}[h(x)]-\frac{1}{m}\sum_{i=1}^{m}h(x_{i})\right|\leq\mathcal{R}_{S}(\mathcal{H})+O\left(\sqrt{\frac{\log(1/\delta)}{m}}\right).

Now we state our generalization bound for the NCE loss.

Theorem 9.

Let $B=\max_{x\in\mathcal{X},f^{\prime}\in\mathcal{F}}\|f^{\prime}(x)\|_{1}$ . For all $f\in\mathcal{F}$ , with probability $1-\delta$ over the random sample $S$ of $M$ NCE examples used to fit $\widehat{f}$ ,

\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(\widehat{f})\leq\mathcal{L}^{(k)}_{\textup{NCE}}(f)+\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i}\mathcal{R}_{M}(\mathcal{F}|_{i})+O\left(\sqrt{\frac{\log(1/\delta)}{M}}\right).

Proof.

Let us assume the dimension of input features is $n$ . Define function $\phi:\mathbb{R}^{d(k+2)}\rightarrow\mathbb{R}$ as

\phi(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})=\ell\left(\left\{z^{T}(z^{+}-z^{-}_{i})\right\}_{i=1}^{k}\right)

where each $z,z^{+},z^{-}_{i}\in\mathbb{R}^{d}$ . Also define the class of functions $\mathcal{G}:=\{g_{f}|f\in\mathcal{F}\}$ where $g_{f}:\mathbb{R}^{n(k+2)}\rightarrow\mathbb{R}^{d(k+2)}$ such that

g_{f}(x,x^{+},x^{-}_{1},\ldots,x^{-}_{k})=(f(x),f(x^{+}),f(x^{-}_{1}),\ldots,f(x^{-}_{k})).

Observe that $\phi\circ g_{f}$ gives us exactly the NCE loss for a sample. Therefore using Theorem 8 and the standard ERM analysis, we have with probability $1-\delta$ for all $f\in\mathcal{F}$ ,

\mathcal{L}^{(k)}_{\textup{NCE}}(\widehat{f})\leq\mathcal{L}^{(k)}_{\textup{NCE}}(f)+2\mathcal{R}_{S}(\phi\circ\mathcal{G})+O\left(\sqrt{\frac{\log(1/\delta)}{M}}\right).

Now we will show that $\phi$ is Lipschitz. Observe that, for any $W=(w,w^{+},w^{-}_{1},\ldots,w^{-}_{k}),Z=(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})$ ,

$\displaystyle\|\phi(Z)-\phi(W)\|$	$\displaystyle=\|\phi(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})-\phi(w,w^{+},w^{-}_{1},\ldots,w^{-}_{k})\|$
	$\displaystyle=\left\|\ell\left(\left\{z^{T}(z^{+}-z^{-}_{i})\right\}_{i=1}^{k}\right)-\ell\left(\left\{w^{T}(w^{+}-w^{-}_{i})\right\}_{i=1}^{k}\right)\right\|$
	$\displaystyle\leq\left\|\max_{i}(w^{T}(w^{+}-w^{-}_{i})-z^{T}(z^{+}-z^{-}_{i}))\right\|$	(Using Lemma 3)
	$\displaystyle\leq\left\|w^{\top}w^{+}-z^{\top}z^{+}+\max_{i}z^{\top}z_{i}^{-}-w^{\top}w_{i}^{-}\right\|$
	$\displaystyle\leq\left\|w^{\top}w^{+}-z^{\top}z^{+}\right\|+\max_{i}\left\|w^{\top}w_{i}^{+}-z^{\top}z_{i}^{+}\right\|$

Thus we have several terms with a similar form $|w^{\top}\tilde{w}-z^{\top}\tilde{z}|$ where $\tilde{w}\in W,\tilde{z}\in Z$ . For these, we bound as

\displaystyle|w^{\top}\tilde{w}-z^{\top}\tilde{z}|\leq|w^{\top}(\tilde{w}-\tilde{z})+(w-z)^{\top}\tilde{z}|\leq 2B\|W-Z\|_{\infty},

where $\|\cdot\|_{\infty}$ is the $\ell_{\infty}$ norm in the $d(k+2)$ dimensional space. Thus we conclude that

\displaystyle|\phi(Z)-\phi(W)|\leq 4B\|W-Z\|_{\infty}.

Now applying Theorem 7, we have

\mathcal{R}_{S}(\phi\circ\mathcal{G})\leq\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i\in[d(k+2)]}\mathcal{R}_{M}(\mathcal{G}|_{i})\leq\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i\in[d]}\mathcal{R}_{M}(\mathcal{F}|_{i}).

Substituting back gives us the desired result. Here, to pass from the Rademacher complexity of $\mathcal{G}$ , which has output dimension $d(k+2)$ , we use the fact that $g_{f}$ is simply $k+2$ copies of $f$ applied to different examples and that we are working with the worst-case Rademacher complexity. ∎

Applying this for $\widehat{f}$ and using the fact that it is the minimizer of the empirical NCE loss gives us the requires guarantee on the NCE loss of $\widehat{f}$ .

Appendix C Calculations for the Example in Section 3

Properties of $f_{1}$ .

First, it is easy to see that $\mathcal{L}_{\sup}^{\mu}(f_{1})=1$ for the hinge loss, since the mean embeddings satisfy $\mu_{2i}=\mu_{2i+1}=e_{i}$ . In addition, every example is embedded as $e_{j}$ for some $j$ , which means that $W^{\mu}e_{j}$ is a $2$ -sparse binary vector. Such score vectors always yield hinge loss $1$ .

Now, let us verify the stated identify for $\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})$ . Due to symmetry, we can assume that $x,x^{+}$ come from class $1$ . Now, using the definition of $f_{1}$ we see that

\displaystyle\max\{0,1+f_{1}(x)^{\top}(f_{1}(x_{i}^{-})-f_{1}(x^{+}))\}={\bf 1}\{c_{i}^{-}\in\{1,2\}\}.

Taking max over all of the negatives, we see that the NCE loss will be $1$ if and only if class $1$ or $2$ appear in the negatives and otherwise it will be $0$ . Formally

\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=\mathbb{P}_{S\sim\rho^{k}}[S\cap\{1,2\}\neq\emptyset]=1-\mathbb{P}_{S\sim\rho^{k}}[S\cap\{1,2\}=\emptyset]=1-(1-2/N)^{k},

where in the last step we use that the prior $\rho$ is uniform. Now the stated identity follows by algebraic manipulations. In particular:

	$\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}$	$\displaystyle=1-(1-2/N)^{k}-1+(1-1/N)^{k}=\left(\frac{N-1}{N}\right)^{k}\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)$
		$\displaystyle=(1-\tau_{k})\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)$

Finally, observe that

\displaystyle\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=1-\lim_{k\to\infty}(1-2/N)^{k}=1.

Note that for $f_{1}$ ,

\mathcal{L}^{\mu}_{\textup{sup}}=1=\frac{1}{(1-\tau_{k})\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)}\left(\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}\right)\geq\frac{N-1}{(1-\tau_{k})k}\left(\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}\right)

where the inequality follows from observing $1-(1-x)^{r}\leq rx$ for $0<x<1$ . This shows that Theorem 1 is tight up to $\log$ factors.

Properties of $f_{2}$ .

For $f_{2}$ , the supervised loss is computed as follows. First note that $\mu_{i}=e_{i}$ for each class. Now, observe that if the example $x$ is from class $i$ we have that $f_{2}(x)^{\top}\mu_{j}=0$ for all $j\neq i$ . For class $i$ we have $f_{2}(x)^{\top}\mu_{i}\sim\textrm{Unif}(\{1-\epsilon,1+\epsilon\})$ where the randomness is over the realization of $x$ . In the first of these cases, the hinge loss is $\epsilon$ , while in the second case it is $0$ . Thus, we have $\mathcal{L}_{\sup}^{\mu}(f_{2})=\epsilon/2$ .

To analyze $\mathcal{L}_{\textup{NCE}}$ we consider two cases. Again, due to symmetry we can assume that $x,x^{+}$ come from class $1$ .

Case 1, no collisions:

If none of the negative examples come from class $1$ then $f_{2}(x)^{\top}f_{2}(x_{i}^{\prime})=0$ for all $i$ , hence the hinge loss is (where $\sigma,\sigma^{+}$ are Rademacher random variables):

\displaystyle\mathbb{E}_{\sigma,\sigma^{+}}\max\{0,1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\}=\epsilon^{2}/4+\epsilon/2.

(With probability $3/4$ over the Rademacher variables, the hinge structure clips the loss at $0$ .)

Case 2, collisions:

We provide crude upper and lower bounds in the case of collisions. Assume that $x_{i}^{-}$ is a collision, so that this term in the hinge loss is:

	$\displaystyle\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)(1+\sigma_{i}^{-}\epsilon-1-\sigma^{+}\epsilon)\}=\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon)\}$
	$\displaystyle\geq\max\{0,\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon)\}=1.$

Here, the inequality is Jensen’s inequality and then we use independence of the Rademacher random variables. This is clearly a lower bound on the expected NCE loss when there is a collision. We can also upper bound the loss when there is a collision by setting $\sigma_{i}^{+}=1$

\displaystyle\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon\}\leq\mathbb{E}_{\sigma,\sigma^{+}}\max\{0,1+(1+\sigma\epsilon)(1-\sigma^{+})\epsilon\}=1+\epsilon.

In addition, since $\epsilon\leq 1$ , for any values of $\sigma,\sigma^{+}$ , this term dominates the loss incurred from any other negative. Thus this is an upper bound on the NCE-loss when there is a collision. Thus, we have the bounds

\displaystyle(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}\leq\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})\leq(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}(1+\epsilon).

Actually we can obtain a much better lower bound. Note that the smallest realization of a collision loss is $1+(1+\epsilon)(-2\epsilon)=1-2\epsilon-2\epsilon^{2}$ . This quadratic is positive for $\epsilon\in[0,(\sqrt{3}-1)/2]$ , which we assume going forward. With this choice of $\epsilon$ , there will be no clipping and we can ignore taking the max with $0$ . Next, observe that the loss we incur for a non-colliding example is $\max\{0,1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\}$ . We can see that for any realizations of $\sigma,\sigma^{+},\sigma_{i}^{-}$

\displaystyle 1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\leq 1+(1+\sigma\epsilon)(\sigma_{i}^{-}-\sigma^{+})\epsilon,

which means that we can ignore all of the non-colliding examples when evaluating the loss. Thus, the loss is

\displaystyle\mathbb{E}\max\{0,1+(1+\sigma\epsilon)(\max_{i\in I}\sigma_{i}^{-}-\sigma^{+})\epsilon\}=1+\epsilon\cdot\mathbb{E}\max_{i\in I}\sigma_{i}^{-},

where $I$ is the set of colliding examples. This is clearly upper bounded by $\tau_{k}(1+\epsilon)$ after taking into account the probability of a collision. For a lower bound, we have

	$\displaystyle\mathbb{E}\left[{\bf 1}\{\|I\|>0\}(1+\epsilon\max_{i\in I}\sigma_{i})\right]=\tau_{k}+\epsilon\mathbb{E}\left[{\bf 1}\{\|I\|>0\}\max_{i\in I}\sigma_{i}\right]$
	$\displaystyle=\tau_{k}+\epsilon\left(\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\left((1-(1/2)^{t})\cdot 1+(1/2)^{t}\cdot(-1)\right)\right)$
	$\displaystyle=\tau_{k}+\epsilon\left(\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\left(1-2(1/2)^{t}\right)\right)$
	$\displaystyle=\tau_{k}+\epsilon\left(\tau_{k}-2\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}(1/2)^{t}\right)$
	$\displaystyle\geq\tau_{k}+\epsilon\left(\tau_{k}-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}-\frac{1}{2}\sum_{t=2}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\right)$
	$\displaystyle=\tau_{k}+\epsilon/2\cdot\left(\tau_{k}-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}\right).$

Here the only lower bound uses that in the terms where $t\geq 2$ we can upper bound $(1/2)^{t}\leq 1/4$ . Combining this with the term from the “no-collision” case, we have

	$\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})$	$\displaystyle\geq(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}+\epsilon/2\left(\tau_{k}-\frac{k}{n}\left(1-\frac{1}{N}\right)^{k-1}\right)$
		$\displaystyle=\tau_{k}+(1-\tau_{k})\epsilon^{2}/4+\epsilon/2\left(1-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}\right).$

Finally, notice that the upper bound on the collision case is tight if there is a collision that has $\sigma_{i}^{+}=1$ . In the asymptotic regime where $k\to\infty$ we have that the probability of such a collision goes to $1$ , and simultaneously $\tau_{k}\to 1$ . Thus in this limit this upper bound is tight, so we obtain $\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})=1+\epsilon$ .

Dependence on $k/N$ .

Next, we claim that (bounds on) the first crossing point between the NCE losses for $f_{1}$ and $f_{2}$ depend on $k$ only through the function $k/N$ . To see why, recall the elementary inequalities

\displaystyle 1-x\leq(1-x/n)^{n}\leq\exp(-x)\leq 1-x+x^{2}

This means that a sufficient condition for $f_{1}$ to be selected by NCE is

\displaystyle\frac{2k}{N}\leq(1-k/N)(\epsilon^{2}/4+\epsilon/2)+(k/N-k^{2}/N^{2})

While a sufficient condition for $f_{2}$ to be selected is

\displaystyle 2k/N-4k^{2}/N^{2}\geq(1-k/N+k^{2}/N^{2})(\epsilon^{2}/4+\epsilon/2)+k/N(1+\epsilon)

These are both quadratic inequalities in $k/N$ . Thus if they admit solutions, these would have $k$ scaling linearly with $N$ .

Appendix D Experiment Details

D.1 Wiki-3029 NLP Experiments

We present additional details for the Wiki-3029 results below.

Dataset details.

We use the dataset provided by Saunshi et al. [2019].⁵⁵5https://nlp.cs.princeton.edu/CURL/raw/ This dataset was released by the authors, and we also explicitly obtained their permission to use the dataset for this study. We use the scripts provided to us by the authors to generate the Wiki-3029 dataset.

Training details.

We perform NCE training for a fixed number of epochs. In each epoch, we take sentences from each article and pair them randomly to form a set of positively aligned examples $(x,x^{+})$ . We mix pairs across classes and divide this dataset into batches of a given size $B$ and fit the representation using the Adam optimizer on the NCE loss. Formally, for a given batch $\mathcal{B}=\{(x_{i},x_{i}^{+})\}_{i=1}^{B}$ we optimize the loss:

\mathcal{L}_{\mathrm{nce}}(\theta)=-\frac{1}{2B}\sum_{i=1}^{B}\ln\frac{w(x_{i},x^{+}_{i})}{\sum_{j\neq i}\left\{w(x_{i},x^{+}_{j})+w(x_{i},x_{j})\right\}}-\frac{1}{2B}\sum_{i=1}^{B}\ln\frac{w(x_{i},x^{+}_{i})}{\sum_{j\neq i}\left\{w(x^{+}_{i},x^{+}_{j})+w(x^{+}_{i},x_{j})\right\}},

where $w(x,x^{\prime})=\exp(\nicefrac{{f_{\theta}(x)^{\top}f_{\theta}(x^{\prime})}}{{\tau}})$ , $\theta$ is the parameters of the bag-of-words model $f$ , and $\tau$ is a temperature hyperparameter. The parameters $\theta$ consists of word embedding for every token in the vocabulary including the UNK token.

At the end of NCE training, we compute a mean classifier vector $v_{c}$ for an article $c$ by averaging $f_{\theta}(x)$ for every sentence $x$ from article $c$ in the training dataset. For a given sentence $x^{\prime}$ in the test dataset, we assign it the class $\arg\max_{c}f_{\theta}(x^{\prime})^{\top}v_{c}$ .

We further train a linear classifier on top of $f_{\theta}(x)$ for a fixed number of epochs. We optimize the logistic loss and use Adam optimizer. Formally, we use the entire NCE training dataset and do mini-batch learning. Given a batch $\{(x_{i},c_{i})\}_{i=1}^{B}$ of pairs $i$ and sentences $c_{i}$ we optimize the loss:

\mathcal{L}_{\mathrm{downstream}}(W)=-\frac{1}{B}\sum_{i=1}^{B}\ln\frac{\exp\left((W^{\top}f_{\theta}(x_{i}))_{c_{i}}\right)}{\sum_{c^{\prime}}\exp\left((W^{\top}f_{\theta}(x_{i}))_{c^{\prime}}\right)},

where $W$ is the parameters of the linear classifier. We keep the parameters $\theta$ fixed when training the linear classifier. We save the linear classifier at the end of each epoch and compute the aggregate training loss recorded in this epoch. Finally, we report performance of the model with the smallest aggregate training loss.

We initialize $W$ and $\theta$ randomly using default PyTorch initialization.

Hyperparameters.

We used the following hyperparameters for the Wiki-3029 experiments.

Hyperparameters	Values
Word Embedding	768
Optimization method	Adam
Parameter initialization	PyTorch 1.4 default
Temperature	1
Gradient clip norm	2.5
NCE epochs	50
Downstream training epoch	50
Learning rate	$0.01$
Model architecture	Bag of Words
Data augmentation	None

Table 1: Hyperparameters for Wiki-3029

Detailed Figures with Error Bars.

We show a detailed version of the left plot in Figure 3 in Figure 5. This shows the mean classifier accuracy on varying $N$ and $k$ for a fixed value of $B=1000$ . The gap between the best and worst performance in each row is quite noticeable ( $\approx$ 2-3%) and this is much larger than the standard deviation ( $\leq 0.25$ ). We also plot the performance of the trained linear classifier in Figure 6. Note that the general trend seems to be similar to Figure 5 and an intermediate value of $k$ seems to be best for any value of $N$ . However, we see that in this case the value of $k=50$ is optimal across all values of $N$ .

We show a detailed version of the center plot in Figure 3 in Figure 7. This shows the mean and trained classifier accuracy for different values of $B,k$ , and $N$ . We observe that gap between the best and worst performance is noticeable ( $\approx$ 1% for the mean classifier and $\approx$ 2-3% for the trained classifier). Further, standard deviation is sufficiently low to enable reading trends from the mean performance. For the trained classifier, we observe that optimal performance is received with largest value of $B$ and an intermediate value of $k$ similar to the mean classifier.

Compute Infrastructure.

We run our experiments on a cluster containing a mixture of P40, P100, and V100 GPUs. Each experiment runs on a single GPU in a docker container and takes less than 5 hours to complete with a V100 GPU.

D.2 Vision Experiment Details

Vision experiments used the same infrastructure as NLP experiments, but took roughly two days to fully run. We train models on the NCE task for 400 epochs. We used the following hyperparameters to run the included CIFAR-10 experiments:

Hyperparameters	Values
Latent dimension	128
Gradient clip norm	None
Optimization method	Adam
Parameter initialization	PyTorch 1.4 default
Learning rate	$10^{-4}$
Model architecture	ResNet-18

Table 2: Hyperparameters for vision

Additional Experiments.

Here we provide a plot similar to Figure 4, but using a mean classifier rather than a trained, linear classifier for the downstream task. The trend in Figure 8 is similar, with results varying based on both dataset and whether or not augmentation is present. Interestingly, mean classifier performance on SVHN with augmented data is significantly worse than the trained classifier version in Figure 4. This is possibly due to the fact that the augmentation was not designed for digit data, including, for example, horizontal flips.

	$\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{\|I\|}(0)\middle\|I\neq\emptyset\right]$
	$\displaystyle\leq\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)+\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\in I}\right)\right]-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{\|I\|}(0)\middle\|I\neq\emptyset\right]$
	$\displaystyle=\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)\right]+\tau_{k}\underbrace{\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell(\{f(x)^{\top}(f(x^{+})-f(x_{i}^{-}))\}_{i\in I})-\ell_{\|I\|}(0)\middle\|I\neq\phi\right]}_{\Delta(f)}$
	$\displaystyle=\mathcal{L}^{(k)}_{\neq}+\tau_{k}\Delta(f).$

	$\displaystyle\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\left\|\max_{i}f(x)^{\top}\left(f(x^{-}_{i})-f(x^{+})\right)\right\|\right]$
	$\displaystyle=\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\underset{x^{+},x^{-}_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left\|\max_{i}\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right\|\right]\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{\underset{x^{+},x_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left(\max_{i}\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{\sum_{i}\underset{x^{+},x^{-}_{i}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x_{i}^{-})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\sqrt{t\underset{x^{+},x^{-}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\\|f(x)\\|}\left(f(x^{-})-f(x^{+})\right)\right)^{2}\right]}\right]$
	$\displaystyle\leq\sqrt{t}\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}.$

	$\displaystyle\tau_{k}\Delta(f)$	$\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\sqrt{\|I\|}\middle\|I\neq\emptyset,c\right]\right]$
		$\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\|I\|\middle\|I\neq\emptyset,c\right]}\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{k\rho(c)\tau_{k}(c)}\right]$
		$\displaystyle=\sqrt{k}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\\|f(x)\\|\right]\sqrt{\\|\Sigma(f,c)\\|_{2}}\sqrt{\rho(c)\tau_{k}(c)}\right]$
		$\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{\leq}}}\sqrt{k\rho_{\max}}s(f)$

$\displaystyle\|\phi(Z)-\phi(W)\|$	$\displaystyle=\|\phi(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})-\phi(w,w^{+},w^{-}_{1},\ldots,w^{-}_{k})\|$
	$\displaystyle=\left\|\ell\left(\left\{z^{T}(z^{+}-z^{-}_{i})\right\}_{i=1}^{k}\right)-\ell\left(\left\{w^{T}(w^{+}-w^{-}_{i})\right\}_{i=1}^{k}\right)\right\|$
	$\displaystyle\leq\left\|\max_{i}(w^{T}(w^{+}-w^{-}_{i})-z^{T}(z^{+}-z^{-}_{i}))\right\|$	(Using Lemma 3)
	$\displaystyle\leq\left\|w^{\top}w^{+}-z^{\top}z^{+}+\max_{i}z^{\top}z_{i}^{-}-w^{\top}w_{i}^{-}\right\|$
	$\displaystyle\leq\left\|w^{\top}w^{+}-z^{\top}z^{+}\right\|+\max_{i}\left\|w^{\top}w_{i}^{+}-z^{\top}z_{i}^{+}\right\|$

Investigating the Role of Negatives in Contrastive Representation Learning

Abstract

1 Introduction

Our contributions.

Related work.

2 Setup and Algorithm

Notation.

NCE data.

NCE algorithm.

Definition 1 (NCE Loss).

Downstream supervised learning task.

3 Theoretical Overview

Theorem 1 (Transfer between NCE and supervised Loss).

Remark 2 (Choosing kk).

Theorem 3 (Refined NCE Loss Bound).

Theorem 4 (Guarantee on ERM).

An example exhibiting the coverage-collision trade-off

4 Experiments

Experimental setting.

NLP experiments.

Vision experiments.

5 Discussion

Data augmentation.

Choosing kk in practice.

Decoupling negatives with other parameters.

Acknowledgements

References

Appendix

Appendix A Useful Facts

Lemma 1 (Sub-additivity of losses).

Lemma 2 (Coupon Collector).

Lemma 3.

Proof.

Appendix B General Results with Non-uniform Probabilities and Proofs

Notation.

B.1 General Transfer Theorem

Theorem 5 (General Transfer Bound).

Proof.

Lemma 4.

Proof.

B.2 Refining the Transfer Theorem

Definition 2 (Average Intra-class Variance).

Definition 3 (NCE loss without collisions).

Theorem 6 (General Refined Bound).

Proof.

Lemma 5.

Proof of Lemma 5.

B.3 Generalization Bound

Definition 4 (Rademacher Complexity).

Theorem 7 (Foster and Rakhlin [2019]).

Theorem 8 (Mohri et al. [2018]).

Theorem 9.

Proof.

Appendix C Calculations for the Example in Section 3

Properties of f1f_{1}.

Properties of f2f_{2}.

Case 1, no collisions:

Case 2, collisions:

Dependence on k/Nk/N.

Appendix D Experiment Details

D.1 Wiki-3029 NLP Experiments

Dataset details.

Training details.

Hyperparameters.

Detailed Figures with Error Bars.

Compute Infrastructure.

D.2 Vision Experiment Details

Additional Experiments.

Investigating the Role of Negatives in
Contrastive Representation Learning

Remark 2 (Choosing $k$ ).

Choosing $k$ in practice.

Properties of $f_{1}$ .

Properties of $f_{2}$ .

Dependence on $k/N$ .