This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

Investigating the Role of Negatives in
Contrastive Representation Learning

Jordan T. Ash   Surbhi Goel   Akshay Krishnamurthy   Dipendra Misra111Authors listed in alphabetial order. All authors contributed equally.

Microsoft Research NYC
{ash.jordan, goel.surbhi, akshaykr, dimisra}@microsoft.com
Abstract

Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many parameters such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks. In the NLP task, we find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives. We discuss plausible explanations for this behavior and suggest future directions to better align theory and practice.

1 Introduction

Unsupervised representation learning refers to a suite of algorithmic approaches for discovering meaningful transformations of complex data in the absence of an explicit supervision signal. These representations are trained to preserve and expose semantic information in the data, so that once learned, they can be used effectively in downstream supervised tasks of interest. In recent years, these methods have seen tremendous success when combined with deep learning models, obtaining state of the art results on a variety of important language and vision tasks (Smith and Eisner, 2005; Chopra et al., 2005; Mikolov et al., 2013; Oord et al., 2018; Bachman et al., 2019; Clark et al., 2020; Chen et al., 2020).

One popular approach for representation learning is contrastive learning, or noise contrastive estimation (NCE) (Gutmann and Hyvärinen, 2010). In NCE, a notion of semantic similarity is used to create a self-supervision signal that encourages similar points to be represented (or embedded) in a similar manner. For example, a typical approach is to train an embedding so that similar points have large inner product in the embedding space and dis-similar points have small inner product. Intuitively, representations trained in this manner capture the semantic similarity of the raw data, so we might expect that they are useful in downstream supervised learning tasks. Indeed, NCE is employed in the Electra language model (Clark et al., 2020) and has also been used effectively in computer vision tasks at ImageNet scale (Chen et al., 2020).

While the implementation details vary, a mathematical abstraction for representation learning with NCE is: (1) a single NCE example consists of k+2k+2 raw inputs, (x,x+,x1,,xk)(x,x^{+},x_{1}^{-},\ldots,x_{k}^{-}), where (x,x+)(x,x^{+}) are semantically similar and xix_{i}^{-} are sampled from some base distribution, (2) the representation ff is trained to encourage f(x)f(x+)f(x)f(xi)f(x)^{\top}f(x^{+})\gg f(x)^{\top}f(x_{i}^{-}) for each ii. This second step can be done with standard classification objectives, such as the cross-entropy loss, where the model is viewed as a classifier over k+1k+1 labels. For example, a candidate objective is

𝔼x,x+,x1:k[log(ef(x)f(x+)ef(x)f(x+)+i=1kef(x)f(xi))].\displaystyle\mathop{\mathbb{E}}_{x,x^{+},x_{1:k}^{-}}\left[-\log\left(\frac{e^{f(x)^{\top}f(x^{+})}}{e^{f(x)^{\top}f(x^{+})}+\sum_{i=1}^{k}e^{f(x)^{\top}f(x_{i}^{-})}}\right)\right].

This objective encourages f(x)f(x+)f(x)^{\top}f(x^{+}) to be large, so that the representation efficiently captures the semantic similarity inherent in the NCE sampling scheme.

Modern contrastive pipelines have several parameters that together contribute to the success of resulting representations. These parameters include modeling choices, such as the number of negatives kk, the semantic sampling procedure for (x,x+)(x,x^{+}), and neural network architecture, along with optimization decisions, such as batch size. While there have been large scale empirical studies on contrastive learning (Chen et al., 2020), it remains unclear how these different choices interact with each other to affect the downstream performance of the learned representations. Indeed, a major challenge is that exorbitant computational costs for training contrastive models precludes comprehensive ablations to disentangle the effects of each design decision. Instead a hybrid approach that leverages theoretical insights to guide systematic experimentation may be more effective at shedding light on the role of these parameters.

As an example of entangled design choices, consider the number of negative samples kk and the batch size BB in the NCE objective. In practice, it is standard to set kk as a function of the batch size BB, as k=2B2k=2B-2, and then to set the batch size as large as the hardware can support. When implemented on a GPU, this allows for many more gradient evaluations with minimal computational overhead. However, in this setup, the effect of increasing kk on downstream performance is confounded by improvements from increasing the batch size.

In this paper, we focus on the number of negatives kk, and investigate its role in contrastive learning. The key question we ask is how does the number of negative samples affect the quality of representations learned by NCE? As mentioned, existing empirical evidence confounds kk with the batch size BB, so it does not provide much of an answer to this question. Moreover, conceptual reasoning paints a fairly nuanced picture. On one hand, a large value of kk implies that the negatives are more likely to cover the semantic space (coverage), which encourages the representation to disambiguate all semantically distinct concepts. However, there is a risk that some of the negative samples will be semantically very similar (collisions) to the positive example x+x^{+}, which compromises the supervision signal and may lead to worse representation quality.

Our contributions.

For our theoretical results, we adopt the framework of Saunshi et al. (2019), where the semantic similarity in the contrastive distribution is directly related to the labels in a downstream classification task. We refine the results of Saunshi et al. (2019) with a novel and tighter error transfer theorem that bounds the classification error by a function of the NCE error. We then study how this function depends on the number of concepts/classes in the data and the number of negatives in the NCE task. Our analysis captures the trade-off between coverage over the classes and collisions, and it reveals an expression for the ideal number of negatives kk, which roughly scales as the number of concepts in the data. We emphasize that our sharper theoretical results are instrumental even for obtaining this qualitative trend; indeed, the analysis in Saunshi et al. (2019) instead suggests that increasing kk leads to decreasing performance.

Empirically, we study the effect of the number of negative samples on the performance of the final classifier on a downstream classification task for both vision and natural language datasets. For our language task, we observe that performance increases with kk up to a point, but it deteriorates as we increase kk further. The experiments also confirm that the optimal choice of kk depends on the number of concepts in the dataset. We do not observe the same trends in our vision experiments, and we describe how conflating factors—most notably data augmentation—cause disagreement between the observed empirical behavior and the predictions from the theoretical framework typically used to study NCE (Saunshi et al., 2019; Chuang et al., 2020).

Related work.

Self-supervised learning has elicited breakthrough successes in representation learning, achieving state of the art performance across a wide array of important machine learning tasks, including computer vision (He et al., 2020; Huo et al., 2020; Khosla et al., 2020; Wang et al., 2020; Bachman et al., 2019), natural language processing (Fang and Xie, 2020; Fu et al., 2021; Le-Khac et al., 2020; Giorgi et al., 2020), and deep reinforcement learning. In vision, Chen et al. (2020) have shown that simple NCE algorithms are able to obtain extremely competitive performance on the challenging ImageNet dataset, something that was previously thought to require extensive supervised training only. In NLP, the NCE approach has garnered much attention since the pioneering work of Mikolov et al. (2013) demonstrated its efficacy in learning word-level representations. In Deep RL, NCE-style losses have been recently employed to learn world models in complex environments (Nachum et al., 2018; Laskin et al., 2020; Zhu et al., 2020; Kipf et al., 2019; Shelhamer et al., 2017).

Theoretical work on representation learning is quite nascent, with most works focused on mathematical explanations for the efficacy of these approaches (Saunshi et al., 2019; Lee et al., 2020). One is via conditional independence and redundancy (Tosh et al., 2020, 2021). Another theory proposed by Wang and Isola (2020) is that contrastive representations simultaneously optimize for alignment of similar points and uniformity in the embedding space, in the asymptotic regime where the number of negatives tends to infinity. This asymptotic regime was also studied by Foster et al. (2020), who analyze a contrastive estimator for an information-gain quantity. However, Foster et al. (2020) do not study representation learning and Wang and Isola (2020) do not theoretically analyze downstream performance as we do. Indeed, our results show that there can be a price for too many negatives due to collisions, and perhaps call into question whether this asymptotic regime should be used to explain the downstream performance of NCE representations.222Note that similar concerns have been raised regarding viewing representation learning as a mutual information maximization problem (Tschannen et al., 2019). This collision phenomenon has been observed by Chuang et al. (2020), who propose a debiased version of the contrastive loss that attempts to correct for negative samples coming from the positive class. They experimentally observe significant gains from removing the effect of collision. In contrast, here we study the original contrastive objective that is primarily used in practice and characterize how an appropriate choice of kk can balance the collision and coverage trade-off.

Our theoretical results closely follow the setup of Saunshi et al. (2019). We adopt essentially the same framework in the present paper, but unlike Saunshi et al. (2019) we do not use the number of negatives kk to define the downstream classification tasks. Our setup yields a single kk-independent downstream task, which allows us to study the influence of this parameter. In addition, many steps of our analysis are sharper than theirs, which leads to our new qualitative findings.

2 Setup and Algorithm

In this section, we present the mathematical framework that we use to study noise contrastive estimation, as well as the learning algorithm and the evaluation metric that we consider.

Notation.

Let the set of input features be denoted by 𝒳\mathcal{X} and the set of latent classes by 𝒞\mathcal{C}, where |𝒞|=N|\mathcal{C}|=N. With each latent class c𝒞c\in\mathcal{C} we will associate a distribution 𝒟c\mathcal{D}_{c} over 𝒳\mathcal{X}. We can view 𝒟c\mathcal{D}_{c} as the distribution over data conditioned on belonging to latent class cc. We will also assume a distribution ρ\rho on 𝒞\mathcal{C}. We will learn with a function class \mathcal{F}, the set of representation functions f:𝒳df:\mathcal{X}\rightarrow\mathbb{R}^{d} that map the input to a dd-dimensional vector. We use \lesssim to denote less than or equal to up to constants.

NCE data.

We assume access to similar data points in the form of pairs (x,x+)(x,x^{+}) and kk negative data points x1,,xkx^{-}_{1},\ldots,x^{-}_{k}. To formalize this, an unlabeled sample from 𝒟NCE\mathcal{D}_{\textup{NCE}} is generated as follows:

  • Choose a class cc according to ρ\rho and draw i.i.d. samples x,x+x,x^{+} from 𝒟c\mathcal{D}_{c}.

  • Choose classes c1,,ckc^{-}_{1},\ldots,c^{-}_{k} i.i.d. from ρ\rho and draw x1𝒟c1,,xk𝒟ckx^{-}_{1}\sim\mathcal{D}_{c^{-}_{1}},\ldots,x^{-}_{k}\sim\mathcal{D}_{c^{-}_{k}}.

  • Output tuple (x,x+,x1:k)(x,x^{+},x^{-}_{1:k})

This data generation is implicit. We only observe the data points and not the associated class labels. Observe that cic^{-}_{i} may be the same as cc for one or more i{1,,k}i\in\{1,\ldots,k\}, causing collisions.

Empirically, similar pairs are typically generated from co-occurring data, for example, two parts of a document can be viewed as a similar pair (Tosh et al., 2020). Note that our data generation process implies that xx and x+x^{+} are independent conditioned on the latent class. This is a simplification of the real-world setting where the independence structure might not hold. However, it enables a transparent theoretical analysis that can provide insights that are applicable to some real-world setups.

NCE algorithm.

The objective of an NCE algorithm is to use the available unsupervised data to learn a representation that represents (or embeds) similar points in a similar manner and distinguishes them from the negative samples. The following NCE loss captures this objective,

Definition 1 (NCE Loss).

The NCE loss for a representation ff on the distribution 𝒟NCE\mathcal{D}_{\textup{NCE}} is defined as

NCE(k)(f):=𝔼𝒟NCE[({f(x)T(f(x+)f(xi))}i=1k)].\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f):=\underset{\mathcal{D}_{\textup{NCE}}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{T}\left(f(x^{+})-f(x^{-}_{i})\right)\right\}_{i=1}^{k}\right)\right].

The empirical NCE loss with a set SS of MM samples drawn from 𝒟NCE\mathcal{D}_{\textup{NCE}} is,

^NCE(k)(f;S):=1M(x,x+,x1:k)S({f(x)T(f(x+)f(xi))}i=1k).\displaystyle\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f;S):=\qquad\frac{1}{M}\sum_{(x,x^{+},x_{1:k}^{-})\in S}\ell\left(\left\{f(x)^{T}\left(f(x^{+})-f(x^{-}_{i})\right)\right\}_{i=1}^{k}\right).

We restrict \ell to be one of the two standard loss functions, hinge loss (v)=max{0,1+maxi{vi}}\ell(v)=\max\{0,1+\max_{i}\{-v_{i}\}\} and logistic loss (v)=log(1+exp(vi))\ell(v)=\log\left(1+\sum\exp(-v_{i})\right) for a vector vkv\in\mathbb{R}^{k}. Both loss functions and their variants are deployed routinely in practical NCE implementations (Schroff et al., 2015; Chen et al., 2020).

The algorithm we analyze is the standard empirical risk minimization (ERM) on ^NCE(k)\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}} over \mathcal{F}. For the rest of the paper, we will denote our learned representation f^argminf^NCE(k)(f)\widehat{f}\in\arg\min_{f\in\mathcal{F}}\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f)

Downstream supervised learning task.

For evaluating representations, we will consider the standard supervised learning task of classifying a data point into one of the classes in 𝒞\mathcal{C}. More formally, a sample from 𝒟sup\mathcal{D}_{\textup{sup}} is generated as follows: (1) choose a class cc according to ρ\rho (2) draw sample xx from 𝒟c\mathcal{D}_{c}, and (3) output sample (x,c)(x,c).

Observe that the distribution over classes and the conditional distribution over data points is the same as in the unsupervised data. This is crucial to be able to relate the NCE task to the downstream classification task. We will use the corresponding supervised loss for function g:𝒳Ng:\mathcal{X}\rightarrow\mathbb{R}^{N},

sup(g)=𝔼𝒟sup[({g(x)cg(x)c}cc)]\displaystyle\mathcal{L}_{\sup}(g)=\underset{\mathcal{D}_{\textup{sup}}}{\mathbb{E}}\left[\ell\left(\left\{g(x)_{c}-g(x)_{c^{\prime}}\right\}_{c^{\prime}\neq c}\right)\right]

Note that Saunshi et al. (2019) consider a different downstream task which depends on the number of negative samples in the NCE task. Here we disentangle the number of negatives used in NCE from the number of downstream classes which results in a more natural downstream task.

To evaluate the performance of the learned representation function f^\widehat{f}, we will train a linear classifier WN×dW\in\mathbb{R}^{N\times d} on top of the learned embedding and use it for the downstream task. We overload notation and define the supervised loss of an embedding ff as

sup(f)=infWN×dsup(Wf).\displaystyle\mathcal{L}_{\textup{sup}}(f)=\inf_{W\in\mathbb{R}^{N\times d}}\mathcal{L}_{\textup{sup}}(Wf).

We define the loss of the mean classifier as supμ(f)=sup(Wμf)\mathcal{L}_{\textup{sup}}^{\mu}(f)=\mathcal{L}_{\textup{sup}}(W^{\mu}f) where WμW^{\mu} is such that the cthc^{th} row is the mean μc=𝔼x𝒟c[f(x)]\mu_{c}=\mathbb{E}_{x\sim\mathcal{D}_{c}}[f(x)] of the representations of data from a fixed class cc. Note that sup(f)supμ(f)\mathcal{L}_{\textup{sup}}(f)\leq\mathcal{L}_{\textup{sup}}^{\mu}(f) for all ff. Our guarantees will hold for the loss of the mean-classifier which will directly imply a bound on the overall supervised loss. Note that this analysis may be loose if the mean-classifier itself has poor downstream performance333 Saunshi et al. (2019) present a simple example where the two losses are very different..

Refer to caption
Figure 1: Value of pre-multiplier α(k,N)\alpha(k,N) multiplied with the growth rates of NCE(k)τk\mathcal{L}^{(k)}_{\textup{NCE}}-\tau_{k} (on a log-scale) for N=100N=100 with varying kk and growth rates. The plots represent the scaling of our upper bound on the supervised loss. The blue dots represent the optimal kk predicted by our theory.

3 Theoretical Overview

In this section, we present our main theoretical insights and overview our analysis, which provides sharp bounds on the supervised loss in terms of the NCE loss. Our bounds show an interesting trade-off between the number of negatives and the number of latent classes. For ease of presentation, the bounds in the main paper assume that ρ\rho is uniform over 𝒞\mathcal{C}. Proofs and generalizations are deferred to Appendix B.

In the supervised loss, data points are distinguished against all classes other than the class that the point belongs to. However, for the NCE loss, if the number of negatives kk is small relative to NN, then the negatives may not cover all the classes simultaneously. On the other hand, if kk is too large, then it is quite likely that some of the negatives will belong to the same class as the “anchor point” xx, causing collisions that force the learner to distinguish between similar examples. This leads to opposing factors when choosing the optimal number of negative samples. We present our main result that captures this trade-off.

Theorem 1 (Transfer between NCE and supervised Loss).

For any ff\in\mathcal{F},

sup(f)supμ(f)21τk2(N1)HN1kα(k,N)(NCE(k)(f)τk),\displaystyle\mathcal{L}_{\textup{sup}}(f)\leq\mathcal{L}^{\mu}_{\textup{sup}}(f)\leq\underbrace{\frac{2}{1-\tau_{k}}\left\lceil\frac{2(N-1)H_{N-1}}{k}\right\rceil}_{\alpha(k,N)}\left(\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\right),

where τk=Pr[i:ci=c]=1(11N)k\tau_{k}=\Pr[\exists~{}i:c^{-}_{i}=c]=1-\left(1-\frac{1}{N}\right)^{k} is the probability of a collision amongst the negatives and Ht=Θ(log(t))H_{t}=\Theta(\log(t)) is the ttht^{\text{th}} harmonic number.

Note that the above bound is similar to the one of Saunshi et al. (2019), except that the left hand side and the pre-multiplier α(k,N)\alpha(k,N) are different. The left hand side here is the supervised loss on all classes rather than the loss, in expectation, on k+1k+1-way classification task, where the tasks are drawn randomly from ρ\rho. Further, our pre-multiplier incorporates the trade-off between kk and NN. A key difference is that the bound in the prior work gets worse with increasing kk implying k=1k=1 is the best choice whereas ours has a more subtle dependence on kk.

Remark 2 (Choosing kk).

The coefficient α(k,N)\alpha(k,N) decreases with kk up to k1log(N)log(N1)k\approx\frac{1}{\log(N)-\log(N-1)} and then increases. Note that NCEτk\mathcal{L}_{\textup{NCE}}-\tau_{k} may itself increase as a function of kk, however as long as its growth rate is o(k)o(k), we will see a similar trend. (See Figure 1).

We refine our bound by decomposing NCEτk\mathcal{L}_{\textup{NCE}}-\tau_{k} into two terms that can potentially be small for large \mathcal{F}.

Theorem 3 (Refined NCE Loss Bound).

For any ff\in\mathcal{F},

supμ(f)\displaystyle\mathcal{L}^{\mu}_{\textup{sup}}(f) α(k,N)((k)(f)+kNs(f))\displaystyle\lesssim\alpha(k,N)\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{\frac{k}{N}}s(f)\right)

where (k)(f)\mathcal{L}^{(k)}_{\neq}(f) is the NCE loss obtained by ignoring the coordinates corresponding to the collisions and s(f)s(f) is a measure of average intra-class deviation.

We defer the formal definitions of (k)(f)\mathcal{L}^{(k)}_{\neq}(f) and s(f)s(f) to Appendix B.2. Note that the trade-off between NN and kk still holds here even for the coefficient in front of s(f)s(f) (see Figure 1 for k\sqrt{k}). α(k,N)kN\alpha(k,N)\sqrt{\frac{k}{N}} is minimized roughly at k0.5log(N)log(N1)k\approx\frac{0.5}{\log(N)-\log(N-1)}.

Lastly, we do a careful generalization analysis to guarantee closeness of the empirical NCE loss of f^\widehat{f} to the true NCE loss. Given a generalization guarantee, using the fact that f^\widehat{f} is the ERM solution, we obtain the following guarantee:

Theorem 4 (Guarantee on ERM).

For f^argminf^NCE(k)(f)\widehat{f}\in\arg\min_{f\in\mathcal{F}}\widehat{\mathcal{L}}^{(k)}_{\textup{NCE}}(f) and any ff\in\mathcal{F},

sup(f^)\displaystyle\mathcal{L}_{\textup{sup}}(\widehat{f}) α(k,N)((k)(f)+kNs(f)+k𝖦𝖾𝗇M)\displaystyle\lesssim\alpha(k,N)\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{\frac{k}{N}}s(f)+\sqrt{k}\mathsf{Gen}_{M}\right)

where 𝖦𝖾𝗇M\mathsf{Gen}_{M} is a decreasing function of MM that depends on \mathcal{F} but is independent of kk.

See Appendix B.3 for a formal definition of 𝖦𝖾𝗇M\mathsf{Gen}_{M}, which is based on the Rademacher complexity of \mathcal{F}.

An example exhibiting the coverage-collision trade-off

Our bounds suggest that there are two competing forces influenced by the choice of kk in the NCE task. The first is coverage of all the classes, which intuitively improves supervised performance, and suggests taking kk larger. On the other hand, when kk is large, the negatives are more likely to collide with the anchor class cc, which will degrade performance. However, as the trends regarding the optimal choice of kk are based on analyzing an upper bound, there is always the concern that the upper bound is loose. To further substantiate our results, we now provide a simple example which demonstrates the existence of the coverage-collision trade-off and also verifies the sharpness of our transfer theorem.

f1:f_{1}:e1e_{1}e2e_{2}f2:f_{2}:(1±ϵ)e1(1\pm\epsilon)e_{1}(1±ϵ)e2(1\pm\epsilon)e_{2}(1±ϵ)e3(1\pm\epsilon)e_{3}(1±ϵ)e4(1\pm\epsilon)e_{4}
Refer to caption
Refer to caption
Figure 2: Left: Schematic of the distribution with the two representation functions, where colors correspond to distinct classes. f1f_{1} collapses pairs of classes together, while f2f_{2} separates all classes, but has some intraclass variance in its embeddings. Center: plot of NCE(k)(f1)\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1}) and the bounds on NCE(k)(f2)\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2}) as a function of kk. Right: Monte-carlo simulations of NCE(k)\mathcal{L}_{\textup{NCE}}^{(k)} for the two representations, demonstrating both cross points. In both plots we take N=40,ϵ=0.35N=40,\epsilon=0.35.

First we showcase the tightness of Theorem 1 and the importance of coverage. Consider a problem with NN classes {1,,N}\{1,\ldots,N\} where NN is a multiple of 22 and let ρ\rho be the uniform distribution. For each class cc, the distribution DcD_{c} is the uniform distribution on two examples x1cx_{1}^{c} and x2cx_{2}^{c}. The representation f1f_{1} pairs classes together, so that all points from class 2i2i and 2i+12i+1 are both mapped to the ithi^{\textrm{th}} standard basis element eie_{i} (see Figure 2).

It is straightforward to verify the following facts for the hinge loss (see Appendix for detailed calculations): (1) since f1f_{1} collapses classes together, it has supμ(f1)=1\mathcal{L}_{\sup}^{\mu}(f_{1})=1, (2) NCE(f1)=1(12/N)k\mathcal{L}_{\textup{NCE}}(f_{1})=1-(1-2/N)^{k} since for each class 2i2i the NCE loss will be 11 if and only if either 2i2i or 2i+12i+1 appear in the negatives (otherwise the loss will be 0). Thus, for this function f1f_{1} we have

NCE(k)(f1)τk1τk=1(11N1)kkN1,\displaystyle\frac{\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}}{1-\tau_{k}}=1-\left(1-\frac{1}{N-1}\right)^{k}\leq\frac{k}{N-1},

while supμ(f1)=1\mathcal{L}_{\sup}^{\mu}(f_{1})=1. This verifies that Theorem 1 is tight up to constants and the Θ(log(N1))\Theta(\log(N-1)) factor arising from the N1N-1-st harmonic number.

In addition, observe that NCE(k)(f1)\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1}) increases rapidly with kk, since when kk is large it is very likely that either the anchor class or its pair will appear in the negatives. This demonstrates the importance of coverage, since when kk is large, f1f_{1}, which has high downstream loss, is unlikely to be selected by the NCE procedure.

To make this precise, consider another function f2f_{2} such that f2(x1i)=(1+ϵ)eif_{2}(x_{1}^{i})=(1+\epsilon)e_{i} and f2(x2i)=(1ϵ)eif_{2}(x_{2}^{i})=(1-\epsilon)e_{i}. This representation correctly separates all of the classes, but there is some noise in the embedding space. By direct calculation, we have supμ(f2)=ϵ/2\mathcal{L}_{\sup}^{\mu}(f_{2})=\epsilon/2 so that f2f_{2} should be preferred for downstream performance. We can also show that

0NCE(k)(f2)(1τk)(ϵ24+ϵ2)τkτkϵ.\displaystyle 0\leq\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})-(1-\tau_{k})\left(\frac{\epsilon^{2}}{4}+\frac{\epsilon}{2}\right)-\tau_{k}\leq\tau_{k}\epsilon.

See Figure 2, where we plot a sharper lower bound that we establish in the appendix. For moderately small ϵ\epsilon, say ϵ=0.1\epsilon=0.1 these bounds imply that NCE(k)(f1)<NCE(k)(f2)\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})<\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2}) when k/Nk/N is sufficiently small (k0.05Nk\lessapprox 0.05N), which means that the bad representation f1f_{1} will be selected. On the other hand when k0.05Nk\gtrapprox 0.05N, the better representation f2f_{2} will be selected. This demonstrates how using many negatives can yield representations with better downstream performance.

In the asymptotic regime where kk\to\infty, we observe a second important phenomena with this example: If k/Nk/N is too large, then the bad classifier f1f_{1} will be favored again. To see this, note that when there are many collisions f2f_{2} will incur 1+ϵ1+\epsilon NCE error, so limkNCE(k)(f2)=1+ϵ\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})=1+\epsilon. On the other, it is clear that limkNCE(k)(f1)=1\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=1, so f1f_{1} will be selected. This phenomena can be seen in Figure 2. In this regime, the collisions in the NCE problem compromise the supervision signal and lead to the selection of a suboptimal representation.

4 Experiments

We test our theoretical findings on two domains, an NLP task and a vision task. We first describe our general learning setting and then describe these tasks and results.

Experimental setting.

We train the encoder model ff by performing NCE with logistic loss. Following Saunshi et al. (2019), we train on the NCE objective for a fixed number of epochs and do not use a held-out set for model selection. We train a linear classifier on the downstream task for a fixed number of epochs and save the model after each epoch. We do not fine-tune ff during downstream training.

Computing kk negative tokens for a batch size BB results in 𝒪(Bk)\mathcal{O}(Bk) forward passes through the model, which is prohibitively expensive for large values of BB and kk. In order to reduce computational overhead, we follow the data reuse trick used in literature (Chen et al., 2020; Chuang et al., 2020). Given BB pairs of positively aligned, semantically similar points {(xi,xi+)}i=1B\{(x_{i},x^{+}_{i})\}_{i=1}^{B}, we compute {f(xi),f(xi+)}i=1B\{f(x_{i}),f(x^{+}_{i})\}_{i=1}^{B} using 2B2B forward passes. For each i[B]i\in[B], we treat the set 𝒞i={xj,xj+j[B],ji}\mathcal{C}_{i}=\{x_{j},x^{+}_{j}\mid j\in[B],j\neq i\} as the candidate set of negative samples to pick from for the positive pair (xi,xi+)(x_{i},x^{+}_{i}). Previous work use the entire candidate set 𝒞i\mathcal{C}_{i}, which gives k=|𝒞i|=2(B1)k=|\mathcal{C}_{i}|=2(B-1). While this allows most negative samples to be used, it results in a confounding affect, due to the entanglement of BB and kk. Instead, for a fixed value of k2(B1)k\leq 2(B-1), we sample kk negative examples without replacement from 𝒞i\mathcal{C}_{i} independently for each ii. This sampling protocol marks a notable departure from prior work by decoupling the dependence between batch size and number of negative examples and studying their influence on performance separately.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Mean classifier accuracy on the Wiki-3029 dataset. Left: Downstream performance as a function of NN and kk for a fixed batch size of B=1000B=1000. Each row is locally normalized by dividing the unnormalized accuracy in a given row with the best performance in that row. Color denotes normalized average accuracy across 5 trials where lighter areas have higher performance. Solid black circles highlight kk with the best performance for each row. Shaded black circles show the smallest and largest value of kk with normalized performance of at least 0.9950.995. Center: Mean classifier average accuracy across 5 trials with varying BB and KK. The performance is globally normalized by dividing the unnormalized accuracy by the overall best accuracy. Right: An experiment highlighting the effect of disabling collisions across kk for 4 values of NN. Solid lines indicate vanilla training, and dotted lines indicate no collision.

NLP experiments.

Here we perform a thorough analysis on the Wiki-3029 dataset of Saunshi et al. (2019). This dataset consists of 200 sentences taken from 3029 Wikipedia articles. The downstream task consists of predicting the article identity from a given sentence. A key advantage of this dataset is that it contains many classes, which allows us to study learning behavior as the number of classes NN increases. For a given N3029N\leq 3029, for both training and test datasets, we remove sentences from all but the first NN articles. We tokenize sentences using the NLTK English word tokenizer (Bird et al., 2009), and replace all tokens that occur less than 50 times in the training set, with token 𝚄𝙽𝙺{\tt UNK} denoting an unknown token. This gives us a vocabulary size VV of 15,807 with at most 136 unique tokens per sentence. We create our training set by randomly sampling 80% of the sentences from each article. The remaining 20% of the sentences for each article form the test set.

We use a bag-of-words representation for modeling ff. Formally, we map a sentence xx to a list of tuples {(ti,pi)}\{(t_{i},p_{i})\} where {ti}\{t_{i}\} is the set of unique tokens in xx and pip_{i} is the number of times tit_{i} occurs in xx normalized by the number of tokens in xx. We compute f(x)=ipiνtif(x)=\sum_{i}p_{i}\nu_{t_{i}} where νV×d\nu\in\mathbb{R}^{V\times d} is a word-embedding matrix with embedding dimension dd, and νt\nu_{t} is the ttht^{th} row denoting the word embedding of token tt. We initialize the word embedding matrix randomly and train it using NCE.

We train the NCE model for a fixed number of epochs and do not use a held-out set, following the protocol of Saunshi et al. (2019). We measure both the downstream performance of the mean classifier WμW^{\mu} and a trained linear classifier. For the former, we use the entire training dataset to compute WμW^{\mu} using the trained representation. For the latter, we use the entire training dataset, with labels, to fit a linear layer on top of the model ff, and we do not fine-tune ff. For each setting, we repeat the experiment 5 times with different seeds. Please see Appendix D for full details on hyperparameters values and optimization details.

The leftmost plot in Figure 3 shows the accuracy of the mean classifier as we vary NN and kk. For a given number of classes NN, performance improves as kk increases and then slowly decays for larger values of kk. We show the general trend in this figure and report exact numbers to the appendix. Further, we find that the optimal choice of kk, indicated by solid black circles, slowly increases as NN increases. Both of these empirical findings support the theoretical predictions of our improved error transfer bounds. As we have discussed, the previous theoretical results (Saunshi et al., 2019) do not predict this observed trend. Lastly, we observe that performance improves very rapidly as kk increases, up until k=50k=50, while the decay in performance as kk increases beyond the optimal value is relatively mild in comparison to these initial gains.

The center plot in Figure 3 shows the mean classifier’s performance as we vary BB and kk for two values of NN. Note that for a given value of BB, we can only increase kk up to 2(B1)2(B-1). We observe that increasing the batch size can improve the performance even when the number of negative examples are fixed. Optimal performance is obtained for the largest batch size and an intermediate value of kk. This highlights the very different roles played by BB and kk on this dataset. We note that these results could not have been observed by previous work that tie the value of kk to BB.

Refer to caption
Figure 4: Average downstream classifier accuracy as a function of both batch size and kk for various vision datasets and NCE parameters. Left: SVHN data, using a true positive sample from the same class, as used in our analysis. Center Left: SVHN data, but using data augmentation. Here the positive sample is an augmented version of the reference sample, and negatives are augmented as well. Center Right: NCE on CIFAR-10 data, using real positive examples. Right: CIFAR-10 data with data augmentation.

Our theoretical analysis reveals a coverage-collision trade-off and suggests that the poor performance when kk is very large is due to many collision between the negative samples and the anchor/positive, i.e., false negatives. To empirically verify this claim, we modify the representation learning procedure by sampling kk negative examples only from those members in the candidate set that do not share the same class as the positive reference point.444If we are unable to find kk true negatives for any example in the batch, then no update is performed on this batch. However, due to the choice of N,B,kN,B,k this never occurred in our experiments. In the rightmost plot of Figure 3, we compare this sampling procedure with the standard one for different values of NN and kk and a fixed B=1000B=1000. We observe that artificially removing collisions improves the performance across all values of NN; however, the effect is more significant for smaller values of NN (1.5%1.5\% for N=200N=200 and K=1000K=1000). That the effect is more significant for small NN is to be expected, since the probability of collision decreases rapidly with NN, so the different sampling procedures are quite similar for large NN. We can further notice that performance decays only slightly for very large values of kk in the absence of collisions. This evidence supports the collision hypothesis, namely that collisions are a central reason for performance degradation with increasing kk. This observation was also made by Chuang et al. (2020). Lastly, as hypothesized by prior work (Chen et al., 2020), we believe a mild reduction in accuracy with large values of kk can happen even in the absence of collision, due to instability in optimization

Vision experiments.

In this section, we decouple batch size from kk and measure downstream performance as a function of these parameters on the SVHN and CIFAR-10 datasets. We fit all parameters except the last layer of a ResNet-18 model using the NCE loss, and we evaluate downstream accuracy by training the last layer only in a fully supervised fashion. We use a learning rate of 10310^{-3}, train for 400 epochs, and fit parameters with the Adam optimizer.

We consider two ways of sampling tuples of images for the NCE objective. In the first, we adhere to our theoretical framework and use class information to capture semantic similarity. Here, for a given sample, we use an image from the same class to act as a positive, and we do not use data augmentation. In the second, we deviate from our theory and do not use class information. Instead, we follow a common empirical approach and use an augmented version of the anchor point. Negative samples are also augmented. For these experiments, we exercise the SimCLR data augmentation scheme, consisting of randomly applied crops, color jitter, horizontal flips, greyscaling, and blurring (Chen et al., 2020).

We experiment with both SVHN and CIFAR-10 image datasets, with Figure 4 summarizing our results. When using augmentation, the trend shown in previous work is observed, with accuracy tending to improve with increasing batch size and kk. When using class information for sampling positives, however, the performance trends are somewhat unexpected. Downstream SVHN accuracy in this setting is remarkably consistent, with nearly all combinations of kk and batch size producing roughly equivalent, high-quality representations for downstream classification. In the CIFAR-10 case, this is not true—in fact, performance tends to increase as batch size decreases. This is likely due to the exclusion of data augmentation, which is necessary for obtaining reasonable performance even in the fully supervised regime for these architectures. In its absence, the variance induced by decreasing the batch size appears to offer a regularization benefit that improves accuracy. Still, overall performance without augmentation is poor in comparison to data-augmented NCE.

These experiments raise an interesting conjecture: perhaps the observation that downstream performance increases with kk and batch size can be attributed to the kinds of data augmentation techniques that have been refined specifically for vision, rather than to the NCE objective itself. In NLP, where there are less clear-cut ways of performing data augmentation, we see trends that more closely agree with our theory. Either way, these experiments point to a distortion between the theoretical framework typically used to study NCE (using true positives) and, in the vision case, what is observed in practice. The core of this distortion appears to involve the use of augmentation techniques that have been highly tuned for object recognition/computer vision tasks.

5 Discussion

In this paper, we study how the number of negative samples kk used in contrastive representation learning affects the quality of learned representations. Our theoretical results highlight a tension between covering the concepts in the data and avoiding collisions, and suggest that the number of negatives should scale roughly linearly with the number of concepts. Our NLP experiments confirm this trade-off whereas our vision experiments show a more entangled dependence on other factors such as inductive bias of the architecture and the use of data augmentation.

This article therefore raises a number of interesting directions for future work, spanning both theory and practice, and calls for a more detailed investigation to better understand the success of NCE in different domains. We close with a few of these directions and related points of discussion:

Data augmentation.

As alluded to by our vision experiments, the use of data augmentation in practice seems to be a principal factor in the gap between what is observed empirically and what we describe theoretically. Because NCE is so commonly used with image data, developing a theoretical model that can incorporate properties of commonly used augmentations and the inductive bias of networks is an important avenue for future research.

Choosing kk in practice.

When the NCE distribution and the downstream task are closely linked, our theoretical results indicate that the number of negatives used in NCE should scale with the number of classes. In practice, however, choosing kk is less clear, with the optimal choice being a function of complex factors. The model’s inductive bias, the choice of data augmentation used, and the data type all seem to influence the ideal value. Given this, are there algorithmic principles for setting kk in practice?

Decoupling negatives with other parameters.

This work disentangles kk from the batch size used for model training, with our experiments suggesting that each plays a different role in the quality of the learned representation. Modern NCE pipelines have several conflating parameters, and we believe that a deeper empirical study is imperative for disambiguating and understanding the role of each.

Acknowledgements

We thank Nikunj Saunshi for helpful discussions regarding the experimental aspects of this work. We also thank Remi Tachet des Combes and Philip Bachman for interesting discussions and for providing references to relevant work.

References

  • Bachman et al. (2019) P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, 2019.
  • Bird et al. (2009) S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. O’Reilly Media, 2009.
  • Chen et al. (2020) T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 2020.
  • Chopra et al. (2005) S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
  • Chuang et al. (2020) C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka. Debiased contrastive learning. In Advances in Neural Information Processing Systems, 2020.
  • Clark et al. (2020) K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
  • Fang and Xie (2020) H. Fang and P. Xie. Cert: Contrastive self-supervised learning for language understanding. arXiv:2005.12766, 2020.
  • Foster et al. (2020) A. Foster, M. Jankowiak, M. O’Meara, Y. W. Teh, and T. Rainforth. A unified stochastic gradient approach to designing bayesian-optimal experiments. In International Conference on Artificial Intelligence and Statistics, 2020.
  • Foster and Rakhlin (2019) D. J. Foster and A. Rakhlin. \ell_{\infty}-vector contraction for rademacher complexity. arXiv:1911.06468, 2019.
  • Fu et al. (2021) H. Fu, S. Zhou, Q. Yang, J. Tang, G. Liu, K. Liu, and X. Li. Lrc-bert: Latent-representation contrastive knowledge distillation for natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • Giorgi et al. (2020) J. M. Giorgi, O. Nitski, G. D. Bader, and B. Wang. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv:2006.03659, 2020.
  • Gutmann and Hyvärinen (2010) M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In International Conference on Artificial Intelligence and Statistics, 2010.
  • He et al. (2020) K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • Huo et al. (2020) X. Huo, L. Xie, L. Wei, X. Zhang, H. Li, Z. Yang, W. Zhou, H. Li, and Q. Tian. Heterogeneous contrastive learning: Encoding spatial information for compact visual representations. arXiv:2011.09941, 2020.
  • Khosla et al. (2020) P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan. Supervised contrastive learning. In Advances in Neural Information Processing Systems, 2020.
  • Kipf et al. (2019) T. Kipf, E. van der Pol, and M. Welling. Contrastive learning of structured world models. In International Conference on Learning Representations, 2019.
  • Laskin et al. (2020) M. Laskin, A. Srinivas, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, 2020.
  • Le-Khac et al. (2020) P. H. Le-Khac, G. Healy, and A. F. Smeaton. Contrastive representation learning: A framework and review. IEEE Access, 2020.
  • Lee et al. (2020) J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo. Predicting what you already know helps: Provable self-supervised learning. arXiv:2008.01064, 2020.
  • Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations (Workshop Track), 2013.
  • Mohri et al. (2018) M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Nachum et al. (2018) O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. In International Conference on Learning Representations, 2018.
  • Oord et al. (2018) A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXiv:1807.03748, 2018.
  • Saunshi et al. (2019) N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, 2019.
  • Schroff et al. (2015) F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • Shelhamer et al. (2017) E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell. Loss is its own reward: Self-supervision for reinforcement learning. In International Conference on Learning Representations (Workshop Track), 2017.
  • Smith and Eisner (2005) N. A. Smith and J. Eisner. Contrastive estimation: Training log-linear models on unlabeled data. In Annual Meeting of the Association for Computational Linguistics, 2005.
  • Tosh et al. (2020) C. Tosh, A. Krishnamurthy, and D. Hsu. Contrastive estimation reveals topic posterior information to linear models. arXiv:2003.02234, 2020.
  • Tosh et al. (2021) C. Tosh, A. Krishnamurthy, and D. Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, 2021.
  • Tschannen et al. (2019) M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic. On mutual information maximization for representation learning. In International Conference on Learning Representations, 2019.
  • Wang and Isola (2020) T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 2020.
  • Wang et al. (2020) X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. arXiv:2011.09157, 2020.
  • Zhu et al. (2020) J. Zhu, Y. Xia, L. Wu, J. Deng, W. Zhou, T. Qin, and H. Li. Masked contrastive representation learning for reinforcement learning. arXiv:2010.07470, 2020.

Appendix

The appendix is organized as follows: Appendix A-C present proofs of our theoretical results. Appendix D presents experimental details including additional experiments and hyperparameter values.

Appendix A Useful Facts

The standard multi-class hinge loss (v)=max{0,1+maxi{vi}}\ell(v)=\max\{0,1+\max_{i}\{-v_{i}\}\} and the logistic loss (v)=log(1+iexp(vi))\ell(v)=\log\left(1+\sum_{i}\exp(-v_{i})\right), for vector vkv\in\mathbb{R}^{k}, satisfy the following lemma:

Lemma 1 (Sub-additivity of losses).

Let vkv\in\mathbb{R}^{k} be any vector. For all I1,I2[k]I_{1},I_{2}\subset[k], with S=I1I2S=I_{1}\cup I_{2}, we have that

({vi}iI1)({vi}iS)({vi}iI1)+({vi}iI2).\ell(\{v_{i}\}_{i\in I_{1}})\leq\ell(\{v_{i}\}_{i\in S})\leq\ell(\{v_{i}\}_{i\in I_{1}})+\ell(\{v_{i}\}_{i\in I_{2}}).
Lemma 2 (Coupon Collector).

Consider the coupon collector problem with nn coupons, each with probability pip_{i}. In each trial we collect a coupon by sampling from the distribution (p1,,pn)(p_{1},\ldots,p_{n}). Let MM denote the minimum number of trials until all coupons have been collected. Then:

𝔼[M]=0(1i=1n(1exp(pit))dtHnminipi,\displaystyle\mathbb{E}[M]=\int_{0}^{\infty}\left(1-\prod_{i=1}^{n}(1-\exp(-p_{i}t)\right)dt\leq\frac{H_{n}}{\min_{i}p_{i}},

where HnH_{n} is the nn-th harmonic number. Furthermore, using Markov’s inequality,

Pr[M2Hnminipi]𝔼[M]minipi2Hn1/2.\displaystyle\Pr\left[M\geq\frac{2H_{n}}{\min_{i}p_{i}}\right]\leq\mathbb{E}\left[M\right]\cdot\frac{\min_{i}p_{i}}{2H_{n}}\leq 1/2.
Lemma 3.

For any vectors v,wtv,w\in\mathbb{R}^{t},

(v)(w)|maxi(wivi)|.\displaystyle\ell(v)-\ell(w)\leq\left|\max_{i}(w_{i}-v_{i})\right|.
Proof.

For hinge loss,

(v)\displaystyle\ell(v) =max(0,1+maxi((wivi)wi))\displaystyle=\max\left(0,1+\max_{i}((w_{i}-v_{i})-w_{i})\right)
max(0,1+maxi(wi)+maxi(wivi))\displaystyle\leq\max\left(0,1+\max_{i}(-w_{i})+\max_{i}(w_{i}-v_{i})\right)
max(0,1+maxi(wi))+|maxi(wivi)|=(w)+|maxi(wivi)|.\displaystyle\leq\max\left(0,1+\max_{i}(-w_{i})\right)+\left|\max_{i}(w_{i}-v_{i})\right|=\ell(w)+\left|\max_{i}(w_{i}-v_{i})\right|.

For logistic loss,

(v)\displaystyle\ell(v) =log(1+iexp(vi))\displaystyle=\log\left(1+\sum_{i}\exp(-v_{i})\right)
=log(1+iexp(wi)exp(wivi))\displaystyle=\log\left(1+\sum_{i}\exp(-w_{i})\exp(w_{i}-v_{i})\right)
log(1+exp(|maxi(wivi)|)iexp(wi))\displaystyle\leq\log\left(1+\exp\left(|\max_{i}(w_{i}-v_{i})|\right)\sum_{i}\exp(-w_{i})\right)
|maxi(wivi)|+log(1+iexp(wi))=(w)+|maxi(wivi)|.\displaystyle\leq|\max_{i}(w_{i}-v_{i})|+\log\left(1+\sum_{i}\exp(-w_{i})\right)=\ell(w)+\left|\max_{i}(w_{i}-v_{i})\right|.

The last inequality holds because 1exp(|maxi(wivi)|)1\leq\exp(|\max_{i}(w_{i}-v_{i})|). ∎

Appendix B General Results with Non-uniform Probabilities and Proofs

In this section, we state and prove a more general version of Theorem 1 and Theorem 3 which can accommodate a non-uniform class distribution ρ\rho. The particular results in the paper can be obtained by setting all class probabilities as 1/N1/N.

Notation.

For a drawn sample, denote the collisions by I(c,c1,,ck)={i[k]|c=ci}I(c,c^{-}_{1},\ldots,c^{-}_{k})=\{i\in[k]|c=c^{-}_{i}\}. Let τk(c)=Prciρk[I(c,c1,,ck)]=1(1ρ(c))k\tau_{k}(c)=\Pr_{c^{-}_{i}\sim\rho^{k}}[I(c,c^{-}_{1},\ldots,c^{-}_{k})\neq\emptyset]=1-(1-\rho(c))^{k} denote the probability of seeing class cc in the negative samples (i.e., the collision probability). Also, let τk=c𝒞ρ(c)τk(c)=1c𝒞ρ(c)(1ρ(c))k\tau_{k}=\sum_{c^{\prime}\in\mathcal{C}}\rho(c^{\prime})\tau_{k}(c^{\prime})=1-\sum_{c^{\prime}\in\mathcal{C}}\rho(c^{\prime})(1-\rho(c^{\prime}))^{k}. We will drop the arguments of II when it is clear from the context.

B.1 General Transfer Theorem

Let us first start with the general transfer theorem.

Theorem 5 (General Transfer Bound).

For all ff\in\mathcal{F}, we have

sup(f)2max(1,2(1ρmin)HN1kρmin)(1ρmax)k(NCE(k)(f)τk𝔼c,ciρk+1[|I|(0)|I]),\displaystyle\mathcal{L}^{\textup{sup}}(f)\leq\frac{2\max\left(1,\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right)}{\left(1-\rho_{\max}\right)^{k}}\left(\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{|I|}(0)\middle|I\neq\emptyset\right]\right),

where ρmin=mincρ(c)\rho_{\min}=\min_{c}\rho(c), ρmax=maxcρ(c)\rho_{\max}=\max_{c}\rho(c) and HtH_{t} is the ttht^{\text{th}} harmonic number.

Proof.

The first step is to apply Jensen’s inequality to get,

NCE(k)(f)\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f) =𝔼c,ciρk+1𝔼x,x+Dc2xiDci[({f(x)(f(x+)f(xi)})]\displaystyle=\underset{c,c_{i}^{-}\sim\rho^{k+1}}{\mathbb{E}}\underset{\begin{subarray}{c}x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\right\}\right)\right]
𝔼c,ciρk+1xDc[({f(x)(μcμci)})].\displaystyle\geq\underset{\begin{subarray}{c}c,c^{-}_{i}\sim\rho^{k+1}\\ x\sim D_{c}\end{subarray}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\right].

Introducing the collision function II, we have

NCE(k)(f)\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f) 𝔼xD𝔼ciρk[({f(x)(μcμci)})]\displaystyle\geq\underset{x\sim D}{\mathbb{E}}\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\right]
𝔼(x,c)D[(1τk(c))𝔼ciρk[({f(x)(μcμci)})|I=ϕ,c]]+τk𝔼c,ciρk+1[|I|(0)|I]\displaystyle\geq\underset{(x,c)\sim D}{\mathbb{E}}\left[(1-\tau_{k}(c))\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\middle|I=\phi,c\right]\right]+\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}[\ell_{|I|}(0)|I\neq\emptyset]

Here t(0)\ell_{t}(0) is equal to \ell evaluated at a tt-dimensional 0 vector, which we obtain |I|(0)\ell_{|I|}(0) by using the sub-additivity property in Lemma 1. Note that for the second term, conditional on II, the distribution of cc is not ρ\rho in general. We deal with the second term in the next section, and focus now on the first term.

Lemma 4.

For any c𝒞c\in\mathcal{C} and xx, we have,

𝔼ciρk[({f(x)(μcμci)})|I=,c]122(1ρ(c))HN1kminccρ(c)({f(x)(μcμc)}c𝒞c),\displaystyle\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\middle|I=\emptyset,c\right]\geq\frac{1}{2\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right),

where HNH_{N} is the NthN^{\textrm{th}} Harmonic number.

Proof.

We can view the conditioning on I=I=\emptyset and cc as selecting cic^{-}_{i} from ρc\rho_{-c} with support on all ccc^{\prime}\neq c and ρc(c)=ρ(c)1ρ(c)\rho_{-c}(c^{\prime})=\frac{\rho(c^{\prime})}{1-\rho(c)}. Therefore, we have

𝔼ciρk\displaystyle\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}} [({f(x)(μcμci)})|I=,c]\displaystyle\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\middle|I=\emptyset,c\right]
=(a)𝔼ciρck[({f(x)(μcμci)})]\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}\underset{c^{-}_{i}\sim\rho_{-c}^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\}\right)\right]
=(b)1mj=1m𝔼cijρck[({f(x)(μcμcij)})]\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{=}}}\frac{1}{m}\sum_{j=1}^{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{ij}}\right)\}\right)\right]
=1m𝔼cijρck×m[j=1m({f(x)(μcμcij)})]\displaystyle=\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[\sum_{j=1}^{m}\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{ij}}\right)\}\right)\right]
(c)1m𝔼cijρck×m[({f(x)(μcμc)}c{cij})]\displaystyle\stackrel{{\scriptstyle\textnormal{(c)}}}{{\mathstrut{\geq}}}\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\cup\{c^{-}_{ij}\}}\right)\right]
(d)1m𝔼cijρck×m[𝟏[{cij}=𝒞{c}]({f(x)(μcμc)}c𝒞c)]\displaystyle\stackrel{{\scriptstyle\textnormal{(d)}}}{{\mathstrut{\geq}}}\frac{1}{m}\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\mathbb{E}}\left[{\bf 1}\left[\cup\{c^{-}_{ij}\}=\mathcal{C}\setminus\{c\}\right]\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]
=1m({f(x)(μcμc)}c𝒞c)Prcijρck×m[{cij}=𝒞{c}].\displaystyle=\frac{1}{m}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\cdot\underset{c^{-}_{ij}\sim\rho_{-c}^{k\times m}}{\Pr}\left[\cup\{c^{-}_{ij}\}=\mathcal{C}\setminus\{c\}\right].

We replace the distribution over negatives, as described above to get (B.1). Then we construct mm copies of this expression and take an average to get (B.1), which is a key step in the argument. Next, (B.1) uses Lemma 1, which allows us to replace the sum over the mm losses with a single multi-class loss on the union of the labels. Inequality (B.1) uses the non-negativity of the loss and finally we observe that the loss expression no longer depends on the samples cijc_{ij}^{-}.

To bound the probability term, note that it can be viewed as an instance of the standard coupon collector problem with N1N-1 different coupons with non-identical probabilities of selection given by ρc\rho_{-c}. In this instance, we perform mkmk trials. Therefore using Lemma 2, if we take mk2(1ρ(c))HN1minccρ(c)mk\geq 2\frac{(1-\rho(c))H_{N-1}}{\min_{c^{\prime}\neq c}\rho(c^{\prime})} boxes, we should cover all coupons with probability at least 1/21/2. This implies taking m=2(1ρ(c))HN1kminccρ(c)m=\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil , we get that the probability of hitting all elements in 𝒞{c}\mathcal{C}\setminus\{c\} is at least 1/21/2, which yields the desired result. ∎

Substituting back, for the first term, we get,

𝔼(x,c)D[(1τk(c))𝔼ciρk[({f(x)(μcμci)})|I=ϕ,c]]\displaystyle\underset{(x,c)\sim D}{\mathbb{E}}\left[(1-\tau_{k}(c))\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\ell\left(\left\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{-}_{i}}\right)\right\}\right)\middle|I=\phi,c\right]\right]
𝔼(x,c)D[1τk(c)22(1ρ(c))HN1kminccρ(c)({f(x)(μcμc)}c𝒞c)]\displaystyle\geq\underset{(x,c)\sim D}{\mathbb{E}}\left[\frac{1-\tau_{k}(c)}{2\left\lceil\frac{2(1-\rho(c))H_{N-1}}{k\min_{c^{\prime}\neq c}\rho(c^{\prime})}\right\rceil}\cdot\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]
(1ρmax)k22(1ρmin)HN1kρmin𝔼(x,c)D[({f(x)(μcμc)}c𝒞c)]\displaystyle\geq\frac{\left(1-\rho_{\max}\right)^{k}}{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}\underset{(x,c)\sim D}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(\mu_{c}-\mu_{c^{\prime}}\right)\}_{c^{\prime}\in\mathcal{C}\setminus c}\right)\right]
(1ρmax)k22(1ρmin)HN1kρminsupμ(f).\displaystyle\geq\frac{\left(1-\rho_{\max}\right)^{k}}{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}\mathcal{L}^{\mu}_{\textup{sup}}(f).

This gives us the desired result. ∎

B.2 Refining the Transfer Theorem

As Saunshi et al. [2019] point out, we cannot drive NCE\mathcal{L}_{\textup{NCE}} to 0 as it is always lower bounded by τk\tau_{k}. To get a better understanding of when the loss is actually small, let us define two terms that will be useful for a more refined analysis.

Definition 2 (Average Intra-class Variance).

We define the average intra-class variance of a learned embedding as,

s(f):=𝔼cρ[𝔼xDc[f(x)]Σ(f,c)2]\displaystyle s(f):=\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\right]

where Σ(f,c)=𝔼xDc[(f(x)μc)(f(x)μ(c))]\Sigma(f,c)=\mathbb{E}_{x\sim D_{c}}[(f(x)-\mu_{c})(f(x)-\mu(c))^{\top}] is the covariance matrix of the representation ff restricted to class 𝒟c\mathcal{D}_{c}.

Definition 3 (NCE loss without collisions).

We define a loss function that is obtained from the NCE loss by removing the colliding negatives,

(k)(f):=𝔼c,ciρk+1x,x+Dc2xiDci[({f(x)(f(x+)f(xi)}iI)]\mathcal{L}^{(k)}_{\neq}(f):=\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)\right]

Observe that both (k)\mathcal{L}^{(k)}_{\neq} as well as s(f)s(f) can be made very small for large enough \mathcal{F}. Now we are ready to present our refined bound.

Theorem 6 (General Refined Bound).

For all ff\in\mathcal{F}, we have

sup(f)22(1ρmin)HN1kρmin(1ρmax)k((k)(f)+kρmaxs(f)).\displaystyle\mathcal{L}^{\textup{sup}}(f)\leq\frac{2\left\lceil\frac{2(1-\rho_{\min})H_{N-1}}{k\rho_{\min}}\right\rceil}{(1-\rho_{\max})^{k}}\left(\mathcal{L}^{(k)}_{\neq}(f)+\sqrt{k\rho_{\max}}\cdot s(f)\right).
Proof.

Decomposing the NCE loss gives us,

NCE(k)(f)τk𝔼c,ciρk+1[|I|(0)|I]\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(f)-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{|I|}(0)\middle|I\neq\emptyset\right]
𝔼c,ciρk+1x,x+Dc2xiDci[({f(x)(f(x+)f(xi)}iI)+({f(x)(f(x+)f(xi)}iI)]τk𝔼c,ciρk+1[|I|(0)|I]\displaystyle\leq\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)+\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\in I}\right)\right]-\tau_{k}\underset{c,c^{-}_{i}\sim\rho^{k+1}}{\mathbb{E}}\left[\ell_{|I|}(0)\middle|I\neq\emptyset\right]
=𝔼c,ciρk+1x,x+Dc2xiDci[({f(x)(f(x+)f(xi)}iI)]+τk𝔼c,ciρk+1x,x+Dc2xiDci[({f(x)(f(x+)f(xi))}iI)|I|(0)|Iϕ]Δ(f)\displaystyle=\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i}\right)\}_{i\not\in I}\right)\right]+\tau_{k}\underbrace{\underset{\begin{subarray}{c}c,c_{i}^{-}\sim\rho^{k+1}\\ x,x^{+}\sim D_{c}^{2}\\ x^{-}_{i}\sim D_{c^{-}_{i}}\end{subarray}}{\mathbb{E}}\left[\ell(\{f(x)^{\top}(f(x^{+})-f(x_{i}^{-}))\}_{i\in I})-\ell_{|I|}(0)\middle|I\neq\phi\right]}_{\Delta(f)}
=(k)+τkΔ(f).\displaystyle=\mathcal{L}^{(k)}_{\neq}+\tau_{k}\Delta(f).

As before, note that the second term samples the classes (c,ci)(c,c_{i}^{-}) from the conditional distribution given that there is a collision. We follow the convention that an empty set implies value 0 for (k)\mathcal{L}^{(k)}_{\neq}. Note that this term can be made very small if the classes are separable by ff with a sufficient margin. As for the second term, we will first prove the following lemma,

Lemma 5.

For any t0t\geq 0:

𝔼x,x+,xiDc2+t[({f(x)(f(x+)f(xi)}i[t])t(0)]t𝔼xDc[f(x)2]Σ(f,c)2.\displaystyle\underset{x,x^{+},x_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x_{i}\right)\}_{i\in[t]}\right)-\ell_{t}(0)\right]\leq\sqrt{t}\mathbb{E}_{x\sim D_{c}}[\|f(x)\|^{2}]\sqrt{\|\Sigma(f,c)\|_{2}}. (1)

The above lemma is an improvement to Lemma A.1 of Saunshi et al. [2019] which has a linear scaling with tt. Observe that on the LHS, all samples are drawn from DcD_{c}, which corresponds to the situation where we have tt collisions with the anchor class cc.

Proof of Lemma 5.

From Lemma 3, we know that

𝔼x,x+,xiDc2+t[({f(x)(f(x+)f(xi))}i[t])t(0)]𝔼x,x+,xiDc2+t[|maxif(x)(f(xi)f(x+))|].\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\ell\left(\{f(x)^{\top}\left(f(x^{+})-f(x^{-}_{i})\right)\}_{i\in[t]}\right)-\ell_{t}(0)\right]\leq\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\left|\max_{i}f(x)^{\top}\left(f(x^{-}_{i})-f(x^{+})\right)\right|\right].

Now we have,

𝔼x,x+,xiDc2+t[|maxif(x)(f(xi)f(x+))|]\displaystyle\underset{x,x^{+},x^{-}_{i}\sim D_{c}^{2+t}}{\mathbb{E}}\left[\left|\max_{i}f(x)^{\top}\left(f(x^{-}_{i})-f(x^{+})\right)\right|\right]
=𝔼xDc[f(x)𝔼x+,xiDc1+t[|maxif(x)f(x)(f(xi)f(x+))|]]\displaystyle=\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\underset{x^{+},x^{-}_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left|\max_{i}\frac{f(x)^{\top}}{\|f(x)\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right|\right]\right]
𝔼xDc[f(x)𝔼x+,xiDc1+t[(maxif(x)f(x)(f(xi)f(x+)))2]]\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\sqrt{\underset{x^{+},x_{i}\sim D_{c}^{1+t}}{\mathbb{E}}\left[\left(\max_{i}\frac{f(x)^{\top}}{\|f(x)\|}\left(f(x^{-}_{i})-f(x^{+})\right)\right)^{2}\right]}\right]
𝔼xDc[f(x)i𝔼x+,xiDc2[(f(x)f(x)(f(xi)f(x+)))2]]\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\sqrt{\sum_{i}\underset{x^{+},x^{-}_{i}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\|f(x)\|}\left(f(x_{i}^{-})-f(x^{+})\right)\right)^{2}\right]}\right]
𝔼xDc[f(x)t𝔼x+,xDc2[(f(x)f(x)(f(x)f(x+)))2]]\displaystyle\leq\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\sqrt{t\underset{x^{+},x^{-}\sim D_{c}^{2}}{\mathbb{E}}\left[\left(\frac{f(x)^{\top}}{\|f(x)\|}\left(f(x^{-})-f(x^{+})\right)\right)^{2}\right]}\right]
t𝔼xDc[f(x)]Σ(f,c)2.\displaystyle\leq\sqrt{t}\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}.

Combining with the above inequality gives us the desired result. ∎

Substituting back, we get the following bound on the second term,

τkΔ(f)\displaystyle\tau_{k}\Delta(f) 𝔼cρ[τk(c)𝔼xDc[f(x)]Σ(f,c)2𝔼ciρk[|I||I,c]]\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[\sqrt{|I|}\middle|I\neq\emptyset,c\right]\right]
𝔼cρ[τk(c)𝔼xDc[f(x)]Σ(f,c)2𝔼ciρk[|I||I,c]]\displaystyle\leq\underset{c\sim\rho}{\mathbb{E}}\left[\tau_{k}(c)\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\sqrt{\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[|I|\middle|I\neq\emptyset,c\right]}\right]
=(a)𝔼cρ[𝔼xDc[f(x)]Σ(f,c)2kρ(c)τk(c)]\displaystyle\stackrel{{\scriptstyle\textnormal{(a)}}}{{\mathstrut{=}}}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\sqrt{k\rho(c)\tau_{k}(c)}\right]
=k𝔼cρ[𝔼xDc[f(x)]Σ(f,c)2ρ(c)τk(c)]\displaystyle=\sqrt{k}\underset{c\sim\rho}{\mathbb{E}}\left[\underset{x\sim D_{c}}{\mathbb{E}}\left[\|f(x)\|\right]\sqrt{\|\Sigma(f,c)\|_{2}}\sqrt{\rho(c)\tau_{k}(c)}\right]
(b)kρmaxs(f)\displaystyle\stackrel{{\scriptstyle\textnormal{(b)}}}{{\mathstrut{\leq}}}\sqrt{k\rho_{\max}}s(f)

where (B.2) follows from observing that 𝔼ciρk[|I||I,c]=kρ(c)τk(c)\underset{c^{-}_{i}\sim\rho^{k}}{\mathbb{E}}\left[|I|\middle|I\neq\emptyset,c\right]=\frac{k\rho(c)}{\tau_{k}(c)} (see also equation (35) of Saunshi et al. [2019]) and (B.2) follows from Definition 2. Combining these terms gives us the desired result. ∎

Note that Saunshi et al. [2019] also obtain a similar result, but in their analysis the coefficient on the s(f)s(f) term scales linearly with kk. Instead, we obtain a k\sqrt{k} scaling. However, for the case of uniform class distribution ρ\rho, their coefficient is tighter in the knk\leq n regime.

B.3 Generalization Bound

To transfer our guarantee to the minimizer of the empirical NCE loss, we prove a generalization bound using a notion of worst case Rademacher complexity.

Definition 4 (Rademacher Complexity).

The empirical Rademacher complexity of a real-valued function class \mathcal{H} for a set SS drawn from 𝒳m\mathcal{X}^{m} is defined to be

^S():=𝔼σ[suph(1mi=1mσih(xi))],\widehat{\mathcal{R}}_{S}(\mathcal{H}):=\mathbb{E}_{\sigma}\left[\sup_{h\in\mathcal{H}}\left(\frac{1}{m}\sum_{i=1}^{m}\sigma_{i}h(x_{i})\right)\right],

where σi\sigma_{i} are iid Rademacher random variables. We define the worst-case Rademacher complexity to be

m():=maxS𝒳m^S().\mathcal{R}_{m}(\mathcal{H}):=\max_{S\sim\mathcal{X}^{m}}\widehat{\mathcal{R}}_{S}(\mathcal{H}).

We will use the following theorem to bound the Rademacher complexity of our vector valued class,

Theorem 7 (Foster and Rakhlin [2019]).

Let {h:𝒳K}\mathcal{H}\subseteq\{h:\mathcal{X}\rightarrow\mathbb{R}^{K}\}, and let ϕ\phi be a real-valued LL-Lipschitz function with respect to the \ell_{\infty} norm. We have,

S(ϕ)O~(Lk)maxiM(|i)\mathcal{R}_{S}(\phi\circ\mathcal{H})\leq\tilde{O}\left(L\sqrt{k}\right)\max_{i}\mathcal{R}_{M}(\mathcal{H}|_{i})

where |i\mathcal{H}|_{i} is the function class restricted to the ithi^{\text{th}} coordinate.

Note that the above theorem relates the empirical Rademacher complexity on sample SS to the worst case Rademacher complexity. Finally we will use the following standard generalization result based on Rademacher complexity,

Theorem 8 (Mohri et al. [2018]).

For a real-valued function class \mathcal{H} and a set S={x1,,xm}S=\{x_{1},\ldots,x_{m}\} 𝒳m\in\mathcal{X}^{m} drawn i.i.d. from some distribution, with probability 1δ1-\delta, for all hh\in\mathcal{H}

|𝔼[h(x)]1mi=1mh(xi)|S()+O(log(1/δ)m).\left|\mathbb{E}[h(x)]-\frac{1}{m}\sum_{i=1}^{m}h(x_{i})\right|\leq\mathcal{R}_{S}(\mathcal{H})+O\left(\sqrt{\frac{\log(1/\delta)}{m}}\right).

Now we state our generalization bound for the NCE loss.

Theorem 9.

Let B=maxx𝒳,ff(x)1B=\max_{x\in\mathcal{X},f^{\prime}\in\mathcal{F}}\|f^{\prime}(x)\|_{1}. For all ff\in\mathcal{F}, with probability 1δ1-\delta over the random sample SS of MM NCE examples used to fit f^\widehat{f},

(k)NCE(f^)(k)NCE(f)+O~(B(k+2)d)maxiM(|i)+O(log(1/δ)M).\displaystyle\mathcal{L}^{(k)}_{\textup{NCE}}(\widehat{f})\leq\mathcal{L}^{(k)}_{\textup{NCE}}(f)+\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i}\mathcal{R}_{M}(\mathcal{F}|_{i})+O\left(\sqrt{\frac{\log(1/\delta)}{M}}\right).
Proof.

Let us assume the dimension of input features is nn. Define function ϕ:d(k+2)\phi:\mathbb{R}^{d(k+2)}\rightarrow\mathbb{R} as

ϕ(z,z+,z1,,zk)=({zT(z+zi)}i=1k)\phi(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})=\ell\left(\left\{z^{T}(z^{+}-z^{-}_{i})\right\}_{i=1}^{k}\right)

where each z,z+,zidz,z^{+},z^{-}_{i}\in\mathbb{R}^{d}. Also define the class of functions 𝒢:={gf|f}\mathcal{G}:=\{g_{f}|f\in\mathcal{F}\} where gf:n(k+2)d(k+2)g_{f}:\mathbb{R}^{n(k+2)}\rightarrow\mathbb{R}^{d(k+2)} such that

gf(x,x+,x1,,xk)=(f(x),f(x+),f(x1),,f(xk)).g_{f}(x,x^{+},x^{-}_{1},\ldots,x^{-}_{k})=(f(x),f(x^{+}),f(x^{-}_{1}),\ldots,f(x^{-}_{k})).

Observe that ϕgf\phi\circ g_{f} gives us exactly the NCE loss for a sample. Therefore using Theorem 8 and the standard ERM analysis, we have with probability 1δ1-\delta for all ff\in\mathcal{F},

(k)NCE(f^)(k)NCE(f)+2S(ϕ𝒢)+O(log(1/δ)M).\mathcal{L}^{(k)}_{\textup{NCE}}(\widehat{f})\leq\mathcal{L}^{(k)}_{\textup{NCE}}(f)+2\mathcal{R}_{S}(\phi\circ\mathcal{G})+O\left(\sqrt{\frac{\log(1/\delta)}{M}}\right).

Now we will show that ϕ\phi is Lipschitz. Observe that, for any W=(w,w+,w1,,wk),Z=(z,z+,z1,,zk)W=(w,w^{+},w^{-}_{1},\ldots,w^{-}_{k}),Z=(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k}),

|ϕ(Z)ϕ(W)|\displaystyle|\phi(Z)-\phi(W)| =|ϕ(z,z+,z1,,zk)ϕ(w,w+,w1,,wk)|\displaystyle=|\phi(z,z^{+},z^{-}_{1},\ldots,z^{-}_{k})-\phi(w,w^{+},w^{-}_{1},\ldots,w^{-}_{k})|
=|({zT(z+zi)}i=1k)({wT(w+wi)}i=1k)|\displaystyle=\left|\ell\left(\left\{z^{T}(z^{+}-z^{-}_{i})\right\}_{i=1}^{k}\right)-\ell\left(\left\{w^{T}(w^{+}-w^{-}_{i})\right\}_{i=1}^{k}\right)\right|
|maxi(wT(w+wi)zT(z+zi))|\displaystyle\leq\left|\max_{i}(w^{T}(w^{+}-w^{-}_{i})-z^{T}(z^{+}-z^{-}_{i}))\right| (Using Lemma 3)
|ww+zz++maxizziwwi|\displaystyle\leq\left|w^{\top}w^{+}-z^{\top}z^{+}+\max_{i}z^{\top}z_{i}^{-}-w^{\top}w_{i}^{-}\right|
|ww+zz+|+maxi|wwi+zzi+|\displaystyle\leq\left|w^{\top}w^{+}-z^{\top}z^{+}\right|+\max_{i}\left|w^{\top}w_{i}^{+}-z^{\top}z_{i}^{+}\right|

Thus we have several terms with a similar form |ww~zz~||w^{\top}\tilde{w}-z^{\top}\tilde{z}| where w~W,z~Z\tilde{w}\in W,\tilde{z}\in Z. For these, we bound as

|ww~zz~||w(w~z~)+(wz)z~|2BWZ,\displaystyle|w^{\top}\tilde{w}-z^{\top}\tilde{z}|\leq|w^{\top}(\tilde{w}-\tilde{z})+(w-z)^{\top}\tilde{z}|\leq 2B\|W-Z\|_{\infty},

where \|\cdot\|_{\infty} is the \ell_{\infty} norm in the d(k+2)d(k+2) dimensional space. Thus we conclude that

|ϕ(Z)ϕ(W)|4BWZ.\displaystyle|\phi(Z)-\phi(W)|\leq 4B\|W-Z\|_{\infty}.

Now applying Theorem 7, we have

S(ϕ𝒢)O~(B(k+2)d)maxi[d(k+2)]M(𝒢|i)O~(B(k+2)d)maxi[d]M(|i).\mathcal{R}_{S}(\phi\circ\mathcal{G})\leq\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i\in[d(k+2)]}\mathcal{R}_{M}(\mathcal{G}|_{i})\leq\tilde{O}\left(B\sqrt{(k+2)d}\right)\max_{i\in[d]}\mathcal{R}_{M}(\mathcal{F}|_{i}).

Substituting back gives us the desired result. Here, to pass from the Rademacher complexity of 𝒢\mathcal{G}, which has output dimension d(k+2)d(k+2), we use the fact that gfg_{f} is simply k+2k+2 copies of ff applied to different examples and that we are working with the worst-case Rademacher complexity. ∎

Applying this for f^\widehat{f} and using the fact that it is the minimizer of the empirical NCE loss gives us the requires guarantee on the NCE loss of f^\widehat{f}.

Appendix C Calculations for the Example in Section 3

Properties of f1f_{1}.

First, it is easy to see that supμ(f1)=1\mathcal{L}_{\sup}^{\mu}(f_{1})=1 for the hinge loss, since the mean embeddings satisfy μ2i=μ2i+1=ei\mu_{2i}=\mu_{2i+1}=e_{i}. In addition, every example is embedded as eje_{j} for some jj, which means that WμejW^{\mu}e_{j} is a 22-sparse binary vector. Such score vectors always yield hinge loss 11.

Now, let us verify the stated identify for NCE(k)(f1)\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1}). Due to symmetry, we can assume that x,x+x,x^{+} come from class 11. Now, using the definition of f1f_{1} we see that

max{0,1+f1(x)(f1(xi)f1(x+))}=𝟏{ci{1,2}}.\displaystyle\max\{0,1+f_{1}(x)^{\top}(f_{1}(x_{i}^{-})-f_{1}(x^{+}))\}={\bf 1}\{c_{i}^{-}\in\{1,2\}\}.

Taking max over all of the negatives, we see that the NCE loss will be 11 if and only if class 11 or 22 appear in the negatives and otherwise it will be 0. Formally

NCE(k)(f1)=Sρk[S{1,2}]=1Sρk[S{1,2}=]=1(12/N)k,\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=\mathbb{P}_{S\sim\rho^{k}}[S\cap\{1,2\}\neq\emptyset]=1-\mathbb{P}_{S\sim\rho^{k}}[S\cap\{1,2\}=\emptyset]=1-(1-2/N)^{k},

where in the last step we use that the prior ρ\rho is uniform. Now the stated identity follows by algebraic manipulations. In particular:

NCE(k)(f1)τk\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k} =1(12/N)k1+(11/N)k=(N1N)k(1(11N1)k)\displaystyle=1-(1-2/N)^{k}-1+(1-1/N)^{k}=\left(\frac{N-1}{N}\right)^{k}\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)
=(1τk)(1(11N1)k)\displaystyle=(1-\tau_{k})\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)

Finally, observe that

limkNCE(k)(f1)=1limk(12/N)k=1.\displaystyle\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})=1-\lim_{k\to\infty}(1-2/N)^{k}=1.

Note that for f1f_{1},

μsup=1=1(1τk)(1(11N1)k)(NCE(k)(f1)τk)N1(1τk)k(NCE(k)(f1)τk)\mathcal{L}^{\mu}_{\textup{sup}}=1=\frac{1}{(1-\tau_{k})\left(1-\left(1-\frac{1}{N-1}\right)^{k}\right)}\left(\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}\right)\geq\frac{N-1}{(1-\tau_{k})k}\left(\mathcal{L}_{\textup{NCE}}^{(k)}(f_{1})-\tau_{k}\right)

where the inequality follows from observing 1(1x)rrx1-(1-x)^{r}\leq rx for 0<x<10<x<1. This shows that Theorem 1 is tight up to log\log factors.

Properties of f2f_{2}.

For f2f_{2}, the supervised loss is computed as follows. First note that μi=ei\mu_{i}=e_{i} for each class. Now, observe that if the example xx is from class ii we have that f2(x)μj=0f_{2}(x)^{\top}\mu_{j}=0 for all jij\neq i. For class ii we have f2(x)μiUnif({1ϵ,1+ϵ})f_{2}(x)^{\top}\mu_{i}\sim\textrm{Unif}(\{1-\epsilon,1+\epsilon\}) where the randomness is over the realization of xx. In the first of these cases, the hinge loss is ϵ\epsilon, while in the second case it is 0. Thus, we have supμ(f2)=ϵ/2\mathcal{L}_{\sup}^{\mu}(f_{2})=\epsilon/2.

To analyze NCE\mathcal{L}_{\textup{NCE}} we consider two cases. Again, due to symmetry we can assume that x,x+x,x^{+} come from class 11.

Case 1, no collisions:

If none of the negative examples come from class 11 then f2(x)f2(xi)=0f_{2}(x)^{\top}f_{2}(x_{i}^{\prime})=0 for all ii, hence the hinge loss is (where σ,σ+\sigma,\sigma^{+} are Rademacher random variables):

𝔼σ,σ+max{0,1+(1+σϵ)(1σ+ϵ)}=ϵ2/4+ϵ/2.\displaystyle\mathbb{E}_{\sigma,\sigma^{+}}\max\{0,1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\}=\epsilon^{2}/4+\epsilon/2.

(With probability 3/43/4 over the Rademacher variables, the hinge structure clips the loss at 0.)

Case 2, collisions:

We provide crude upper and lower bounds in the case of collisions. Assume that xix_{i}^{-} is a collision, so that this term in the hinge loss is:

𝔼σ,σ+,σimax{0,1+(1+σϵ)(1+σiϵ1σ+ϵ)}=𝔼σ,σ+,σimax{0,1+(1+σϵ)((σiσ+)ϵ)}\displaystyle\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)(1+\sigma_{i}^{-}\epsilon-1-\sigma^{+}\epsilon)\}=\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon)\}
max{0,𝔼σ,σ+,σi1+(1+σϵ)((σiσ+)ϵ)}=1.\displaystyle\geq\max\{0,\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon)\}=1.

Here, the inequality is Jensen’s inequality and then we use independence of the Rademacher random variables. This is clearly a lower bound on the expected NCE loss when there is a collision. We can also upper bound the loss when there is a collision by setting σi+=1\sigma_{i}^{+}=1

𝔼σ,σ+,σimax{0,1+(1+σϵ)((σiσ+)ϵ}𝔼σ,σ+max{0,1+(1+σϵ)(1σ+)ϵ}=1+ϵ.\displaystyle\mathbb{E}_{\sigma,\sigma^{+},\sigma_{i}^{-}}\max\{0,1+(1+\sigma\epsilon)((\sigma_{i}^{-}-\sigma^{+})\epsilon\}\leq\mathbb{E}_{\sigma,\sigma^{+}}\max\{0,1+(1+\sigma\epsilon)(1-\sigma^{+})\epsilon\}=1+\epsilon.

In addition, since ϵ1\epsilon\leq 1, for any values of σ,σ+\sigma,\sigma^{+}, this term dominates the loss incurred from any other negative. Thus this is an upper bound on the NCE-loss when there is a collision. Thus, we have the bounds

(1τk)(ϵ2/4+ϵ/2)+τkNCE(k)(f2)(1τk)(ϵ2/4+ϵ/2)+τk(1+ϵ).\displaystyle(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}\leq\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})\leq(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}(1+\epsilon).

Actually we can obtain a much better lower bound. Note that the smallest realization of a collision loss is 1+(1+ϵ)(2ϵ)=12ϵ2ϵ21+(1+\epsilon)(-2\epsilon)=1-2\epsilon-2\epsilon^{2}. This quadratic is positive for ϵ[0,(31)/2]\epsilon\in[0,(\sqrt{3}-1)/2], which we assume going forward. With this choice of ϵ\epsilon, there will be no clipping and we can ignore taking the max with 0. Next, observe that the loss we incur for a non-colliding example is max{0,1+(1+σϵ)(1σ+ϵ)}\max\{0,1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\}. We can see that for any realizations of σ,σ+,σi\sigma,\sigma^{+},\sigma_{i}^{-}

1+(1+σϵ)(1σ+ϵ)1+(1+σϵ)(σiσ+)ϵ,\displaystyle 1+(1+\sigma\epsilon)(-1-\sigma^{+}\epsilon)\leq 1+(1+\sigma\epsilon)(\sigma_{i}^{-}-\sigma^{+})\epsilon,

which means that we can ignore all of the non-colliding examples when evaluating the loss. Thus, the loss is

𝔼max{0,1+(1+σϵ)(maxiIσiσ+)ϵ}=1+ϵ𝔼maxiIσi,\displaystyle\mathbb{E}\max\{0,1+(1+\sigma\epsilon)(\max_{i\in I}\sigma_{i}^{-}-\sigma^{+})\epsilon\}=1+\epsilon\cdot\mathbb{E}\max_{i\in I}\sigma_{i}^{-},

where II is the set of colliding examples. This is clearly upper bounded by τk(1+ϵ)\tau_{k}(1+\epsilon) after taking into account the probability of a collision. For a lower bound, we have

𝔼[𝟏{|I|>0}(1+ϵmaxiIσi)]=τk+ϵ𝔼[𝟏{|I|>0}maxiIσi]\displaystyle\mathbb{E}\left[{\bf 1}\{|I|>0\}(1+\epsilon\max_{i\in I}\sigma_{i})\right]=\tau_{k}+\epsilon\mathbb{E}\left[{\bf 1}\{|I|>0\}\max_{i\in I}\sigma_{i}\right]
=τk+ϵ(t=1k(kt)(1N)t(11N)kt((1(1/2)t)1+(1/2)t(1)))\displaystyle=\tau_{k}+\epsilon\left(\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\left((1-(1/2)^{t})\cdot 1+(1/2)^{t}\cdot(-1)\right)\right)
=τk+ϵ(t=1k(kt)(1N)t(11N)kt(12(1/2)t))\displaystyle=\tau_{k}+\epsilon\left(\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\left(1-2(1/2)^{t}\right)\right)
=τk+ϵ(τk2t=1k(kt)(1N)t(11N)kt(1/2)t)\displaystyle=\tau_{k}+\epsilon\left(\tau_{k}-2\sum_{t=1}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}(1/2)^{t}\right)
τk+ϵ(τkkN(11N)k112t=2k(kt)(1N)t(11N)kt)\displaystyle\geq\tau_{k}+\epsilon\left(\tau_{k}-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}-\frac{1}{2}\sum_{t=2}^{k}{k\choose t}\left(\frac{1}{N}\right)^{t}\left(1-\frac{1}{N}\right)^{k-t}\right)
=τk+ϵ/2(τkkN(11N)k1).\displaystyle=\tau_{k}+\epsilon/2\cdot\left(\tau_{k}-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}\right).

Here the only lower bound uses that in the terms where t2t\geq 2 we can upper bound (1/2)t1/4(1/2)^{t}\leq 1/4. Combining this with the term from the “no-collision” case, we have

NCE(k)(f2)\displaystyle\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2}) (1τk)(ϵ2/4+ϵ/2)+τk+ϵ/2(τkkn(11N)k1)\displaystyle\geq(1-\tau_{k})(\epsilon^{2}/4+\epsilon/2)+\tau_{k}+\epsilon/2\left(\tau_{k}-\frac{k}{n}\left(1-\frac{1}{N}\right)^{k-1}\right)
=τk+(1τk)ϵ2/4+ϵ/2(1kN(11N)k1).\displaystyle=\tau_{k}+(1-\tau_{k})\epsilon^{2}/4+\epsilon/2\left(1-\frac{k}{N}\left(1-\frac{1}{N}\right)^{k-1}\right).

Finally, notice that the upper bound on the collision case is tight if there is a collision that has σi+=1\sigma_{i}^{+}=1. In the asymptotic regime where kk\to\infty we have that the probability of such a collision goes to 11, and simultaneously τk1\tau_{k}\to 1. Thus in this limit this upper bound is tight, so we obtain limkNCE(k)(f2)=1+ϵ\lim_{k\to\infty}\mathcal{L}_{\textup{NCE}}^{(k)}(f_{2})=1+\epsilon.

Dependence on k/Nk/N.

Next, we claim that (bounds on) the first crossing point between the NCE losses for f1f_{1} and f2f_{2} depend on kk only through the function k/Nk/N. To see why, recall the elementary inequalities

1x(1x/n)nexp(x)1x+x2\displaystyle 1-x\leq(1-x/n)^{n}\leq\exp(-x)\leq 1-x+x^{2}

This means that a sufficient condition for f1f_{1} to be selected by NCE is

2kN(1k/N)(ϵ2/4+ϵ/2)+(k/Nk2/N2)\displaystyle\frac{2k}{N}\leq(1-k/N)(\epsilon^{2}/4+\epsilon/2)+(k/N-k^{2}/N^{2})

While a sufficient condition for f2f_{2} to be selected is

2k/N4k2/N2(1k/N+k2/N2)(ϵ2/4+ϵ/2)+k/N(1+ϵ)\displaystyle 2k/N-4k^{2}/N^{2}\geq(1-k/N+k^{2}/N^{2})(\epsilon^{2}/4+\epsilon/2)+k/N(1+\epsilon)

These are both quadratic inequalities in k/Nk/N. Thus if they admit solutions, these would have kk scaling linearly with NN.

Appendix D Experiment Details

D.1 Wiki-3029 NLP Experiments

We present additional details for the Wiki-3029 results below.

Dataset details.

We use the dataset provided by Saunshi et al. [2019].555https://nlp.cs.princeton.edu/CURL/raw/ This dataset was released by the authors, and we also explicitly obtained their permission to use the dataset for this study. We use the scripts provided to us by the authors to generate the Wiki-3029 dataset.

Training details.

We perform NCE training for a fixed number of epochs. In each epoch, we take sentences from each article and pair them randomly to form a set of positively aligned examples (x,x+)(x,x^{+}). We mix pairs across classes and divide this dataset into batches of a given size BB and fit the representation using the Adam optimizer on the NCE loss. Formally, for a given batch ={(xi,xi+)}i=1B\mathcal{B}=\{(x_{i},x_{i}^{+})\}_{i=1}^{B} we optimize the loss:

nce(θ)=12Bi=1Blnw(xi,x+i)ji{w(xi,x+j)+w(xi,xj)}12Bi=1Blnw(xi,x+i)ji{w(x+i,x+j)+w(x+i,xj)},\mathcal{L}_{\mathrm{nce}}(\theta)=-\frac{1}{2B}\sum_{i=1}^{B}\ln\frac{w(x_{i},x^{+}_{i})}{\sum_{j\neq i}\left\{w(x_{i},x^{+}_{j})+w(x_{i},x_{j})\right\}}-\frac{1}{2B}\sum_{i=1}^{B}\ln\frac{w(x_{i},x^{+}_{i})}{\sum_{j\neq i}\left\{w(x^{+}_{i},x^{+}_{j})+w(x^{+}_{i},x_{j})\right\}},

where w(x,x)=exp(fθ(x)fθ(x)/τ)w(x,x^{\prime})=\exp(\nicefrac{{f_{\theta}(x)^{\top}f_{\theta}(x^{\prime})}}{{\tau}}), θ\theta is the parameters of the bag-of-words model ff, and τ\tau is a temperature hyperparameter. The parameters θ\theta consists of word embedding for every token in the vocabulary including the UNK token.

At the end of NCE training, we compute a mean classifier vector vcv_{c} for an article cc by averaging fθ(x)f_{\theta}(x) for every sentence xx from article cc in the training dataset. For a given sentence xx^{\prime} in the test dataset, we assign it the class argmaxcfθ(x)vc\arg\max_{c}f_{\theta}(x^{\prime})^{\top}v_{c}.

We further train a linear classifier on top of fθ(x)f_{\theta}(x) for a fixed number of epochs. We optimize the logistic loss and use Adam optimizer. Formally, we use the entire NCE training dataset and do mini-batch learning. Given a batch {(xi,ci)}i=1B\{(x_{i},c_{i})\}_{i=1}^{B} of pairs ii and sentences cic_{i} we optimize the loss:

downstream(W)=1Bi=1Blnexp((Wfθ(xi))ci)cexp((Wfθ(xi))c),\mathcal{L}_{\mathrm{downstream}}(W)=-\frac{1}{B}\sum_{i=1}^{B}\ln\frac{\exp\left((W^{\top}f_{\theta}(x_{i}))_{c_{i}}\right)}{\sum_{c^{\prime}}\exp\left((W^{\top}f_{\theta}(x_{i}))_{c^{\prime}}\right)},

where WW is the parameters of the linear classifier. We keep the parameters θ\theta fixed when training the linear classifier. We save the linear classifier at the end of each epoch and compute the aggregate training loss recorded in this epoch. Finally, we report performance of the model with the smallest aggregate training loss.

We initialize WW and θ\theta randomly using default PyTorch initialization.

Hyperparameters.

We used the following hyperparameters for the Wiki-3029 experiments.

Hyperparameters Values
Word Embedding 768
Optimization method Adam
Parameter initialization PyTorch 1.4 default
Temperature 1
Gradient clip norm 2.5
NCE epochs 50
Downstream training epoch 50
Learning rate 0.010.01
Model architecture Bag of Words
Data augmentation None
Table 1: Hyperparameters for Wiki-3029

Detailed Figures with Error Bars.

We show a detailed version of the left plot in Figure 3 in Figure 5. This shows the mean classifier accuracy on varying NN and kk for a fixed value of B=1000B=1000. The gap between the best and worst performance in each row is quite noticeable (\approx 2-3%) and this is much larger than the standard deviation (0.25\leq 0.25). We also plot the performance of the trained linear classifier in Figure 6. Note that the general trend seems to be similar to Figure 5 and an intermediate value of kk seems to be best for any value of NN. However, we see that in this case the value of k=50k=50 is optimal across all values of NN.

Refer to caption
Figure 5: Mean classifier accuracy on the Wiki-3029 dataset for different values of number of classes (N)(N) and number of negative examples (k)(k), and a fixed batch size of 1000. Values show average performance across 5 trials with different seeds, and all standard deviations are at most 0.250.25 and typically much smaller.
Refer to caption
Figure 6: Trained linear classifier accuracy on the Wiki-3029 dataset for different values of number of classes (N)(N) and number of negative examples (k)(k), and a fixed batch size of 1000. Values show average performance across 5 trials with different seeds and all standard deviations are at most 0.280.28 and typically much smaller.

We show a detailed version of the center plot in Figure 3 in Figure 7. This shows the mean and trained classifier accuracy for different values of B,kB,k, and NN. We observe that gap between the best and worst performance is noticeable (\approx1% for the mean classifier and \approx2-3% for the trained classifier). Further, standard deviation is sufficiently low to enable reading trends from the mean performance. For the trained classifier, we observe that optimal performance is received with largest value of BB and an intermediate value of kk similar to the mean classifier.

Refer to caption
(a) Mean classifier
Refer to caption
(b) Trained classifier
Figure 7: Mean and trained classifier accuracy on Wiki-3029 dataset for different values of batch size BB and number of negatives examples kk and two values of the number of classes NN. Left: mean classifier performance (All standard deviations are less that 0.160.16). Right: trained linear classifier performance (All standard deviations are less that 0.260.26).

Compute Infrastructure.

We run our experiments on a cluster containing a mixture of P40, P100, and V100 GPUs. Each experiment runs on a single GPU in a docker container and takes less than 5 hours to complete with a V100 GPU.

D.2 Vision Experiment Details

Vision experiments used the same infrastructure as NLP experiments, but took roughly two days to fully run. We train models on the NCE task for 400 epochs. We used the following hyperparameters to run the included CIFAR-10 experiments:

Hyperparameters Values
Latent dimension 128
Gradient clip norm None
Optimization method Adam
Parameter initialization PyTorch 1.4 default
Learning rate 10410^{-4}
Model architecture ResNet-18
Table 2: Hyperparameters for vision

Additional Experiments.

Here we provide a plot similar to Figure 4, but using a mean classifier rather than a trained, linear classifier for the downstream task. The trend in Figure 8 is similar, with results varying based on both dataset and whether or not augmentation is present. Interestingly, mean classifier performance on SVHN with augmented data is significantly worse than the trained classifier version in Figure 4. This is possibly due to the fact that the augmentation was not designed for digit data, including, for example, horizontal flips.

Refer to caption
Figure 8: A companion to Figure 4, showing mean classifier performance, as used in our theory, rather than trained classifier performance. Here we show classifier accuracy as a function of both batch size and kk for various vision datasets and NCE parameters. Left: SVHN data, using a true positive sample from the same class, as used in our analysis. Center Left: SVHN data, but using data augmentation. Here the positive sample is an augmented version of the reference sample, and negatives are augmented as well. Center Right: NCE on CIFAR-10 data, using real positive examples. Right: CIFAR-10 results with data augmentation.