The Power of Contrast for Feature Learning:
A Theoretical Analysis
Abstract
Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of the superiority of contrastive learning is still limited. In this paper, under linear representation settings, (i) we provably show that contrastive learning outperforms the standard autoencoders and generative adversarial networks, two classical generative unsupervised learning methods, for both feature recovery and in-domain downstream tasks; (ii) we also illustrate the impact of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.
Keywords: Self-Supervised Learning, Contrastive Learning, Principal Component Analysis, Spiked Covariance Model, Supervised Contrastive Learning
1 Introduction
Deep supervised learning has achieved great success in various applications, including computer vision (Krizhevsky et al., 2012), natural language processing (Vaswani et al., 2017), and scientific computing (Han et al., 2018). However, its dependence on manually assigned labels, which is usually difficult and costly, has motivated research into alternative approaches to exploit unlabeled data. Self-supervised learning is a promising approach that leverages the unlabeled data itself as supervision and learns representations that are beneficial to potential in-domain downstream tasks.
At a high level, there are two common approaches for feature extraction in self-supervised learning: generative and contrastive (Liu et al., 2021; Jaiswal et al., 2021). Both approaches aim to learn latent representations of the original data, while the difference is that the generative approach focused on minimizing the reconstruction error from latent representations, and the contrastive approach targets to decrease the similarity between the representations of contrastive pairs constructed by data augmentation. Recent works have shown the benefits of contrastive learning in practice (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b, c). However, these works did not explain the popularity of contrastive learning — what is the advantage of contrastive learning and where does it come from?
Additionally, recent works aim to further improve contrastive learning by introducing label information. Specifically, Khosla et al. (2020) proposed the supervised contrastive learning, where the contrasting procedures are performed across different classes rather than different instances. With the help of label information, their proposed method outperforms self-supervised contrastive learning and classical cross-entropy-based supervised learning. However, despite this improvement in in-domain downstream tasks, Islam et al. (2021) found that such improvement in transfer learning is limited and even negative for such supervised contrastive learning. This phenomenon motivates us to rethink the impact of labeled data in the contrastive learning framework.
In this paper, we first establish a theoretical framework to study contrastive learning under the linear representation setting. Under this framework, we provide a theoretical analysis of the feature learning performance of the contrastive learning on the spiked covariance model (Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018) and theoretically justify why contrastive learning outperforms standard autoencoders and generative adversarial networks (GANs) (Goodfellow et al., 2014) —contrastive learning is able to remove more noise by constructing contrastive samples via augmentations. Moreover, we investigate the impact of label information in the contrastive learning framework and provide a theoretical justification of why labeled data help to gain accuracy in in-domain regression and classification while can hurt multi-task transfer learning.
1.1 Related Works
The idea of contrastive learning was firstly proposed in Hadsell et al. (2006) as an effective method to perform dimensional reduction. Following this line of research, Dosovitskiy et al. (2014) proposed to perform instance discrimination by creating surrogate classes for each instance and Wu et al. (2018) further proposed to preserve a memory bank as a dictionary of negative samples. Other extensions based on this memory bank approach include He et al. (2020); Misra and Maaten (2020); Tian et al. (2020); Chen et al. (2020c). Rather than keeping a costly memory bank, another line of work exploits the benefit of mini-batch training where different samples are treated as negative to each other (Ye et al., 2019; Chen et al., 2020a). Moreover, Khosla et al. (2020) explores the supervised version of contrastive learning where pairs are generated based on label information.
Despite its success in practice, the theoretical understanding of contrastive learning is still limited. Previous works provide provable guarantees for contrastive learning under conditional independence assumption (or its variants) (Arora et al., 2019; Lee et al., 2021; Tosh et al., 2021; Tsai et al., 2020). Specifically, they assume the two contrastive views are independent conditioned on the label and show that contrastive learning can provably learn representations beneficial for in-domain downstream tasks. In addition to this line of research, there exist several alternative perspectives for studying the theoretical properties of contrastive learning. To name a few, Wang and Isola (2020); Graf et al. (2021) explored the representation geometry, HaoChen et al. (2021) analyzed the augmentation graph, Tian (2022) proposed a two-player game theory framework, Zimmermann et al. (2021) demonstrated the connection between contrastive learning and nonlinear Independent Component Analysis (Hyvärinen et al., 2009), Saunshi et al. (2022) showed that the importance of inductive bias in contrastive learning, and Jing et al. (2021) investigated the dimensional collapse phenomenon. Furthermore, Tian et al. (2021); Wang et al. (2021) have also explored the ability of self-supervised learning to learn features even without contrastive pairs, specifically in the context of linear representation settings.
More relevant to this paper, Wen and Li (2021) considered representation learning under the sparse coding model and studied the optimization properties in shallow ReLU neural networks. However, the assumptions that features are extremely sparse and signals follow Gaussian distribution seem strong for real data. Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They derived sample complexity bounds in a PAC-learning style for various settings. Specifically, the authors assume that there is a ground-truth representation such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features.
While the aforementioned previous works aim to demonstrate that contrastive learning is capable of learning meaningful representations, it was left untouched why contrastive learning outperforms other representation learning methods. We also shed light on the impact of labeled data in a contrastive learning framework, which is underexplored in prior works. A detailed comparison with existing literature is deferred to Appendix A.1.
1.2 Outline
This paper is organized as follows. Section 2 provides the setup for the data-generating process and the loss function. In Section 3, we review the connection between PCA and autoencoders/GANs. We also establish a theoretical framework to study contrastive learning in the linear representation setting. Under this framework, we evaluate the feature recovery performance and in-domain downstream task performance of contrastive learning and autoencoders. In Section 4, we analyze the supervised contrastive learning. In Section 5, we verify our theoretical results given in Sections 3 and 4. Finally, we summarize our analysis and provide future directions in Section 6.
1.3 Notations
In this paper, we use to hide universal constants and we write for two sequences of positive numbers and if and only if there exists a universal constant such that for any . We write when and holds simultaneously. We use to represent the norm of vectors, the spectral norm of matrices, and Frobenius norm of matrices respectively. Let be a set of orthogonal matrices. Namely, . We write when there exists a sufficiently small constant depending on the constant and independent of , and such that holds. is defined similarly. We use to denote the cardinality of a set . For any , let . We use to refer to the sine distance between two orthogonal matrices , which is defined by: , where is any orthogonal complement of . More properties of sine distance can be found in Section A.3. We use to denote the canonical basis in -dimensional Euclidean space , that is, is the vector whose -th coordinate is and all the other coordinates are . Let be an indicator function that takes when is true, otherwise takes . We write and to denote and , respectively.
2 Setup
Here we introduce loss and data-generative models that will be used for the theoretical analysis later.
2.1 Linear Representation Settings for Contrastive Learning
Given an input , contrastive learning aims to learn a low-dimensional representation by contrasting different samples, that is, maximizing the agreement between positive pairs, and minimizing the agreement between negative pairs. Suppose we have data points from the population distribution . The contrastive learning task can be formulated to the following optimization problem:
(1) |
where is a contrastive loss and is a regularization term; are the sets of positive samples and negative samples corresponding to , the details of which are described below.
Linear Representation and Regularization Term
We consider the linear representation function , where the parameter is a matrix . This linear representation setting has been widely adopted in other theory papers to understand self-supervised contrastive learning (Jing et al., 2021; Wang et al., 2021; Tian et al., 2021) and shed light upon other complex machine learning phenomena such as in Tripuraneni et al. (2021). Moreover, since regularization techniques have been widely adopted in contrastive learning practice (Chen et al., 2020a; He et al., 2020; Grill et al., 2020), we further consider penalizing the representation by a regularization term to encourage the orthogonality of and therefore promote the diversity of to learn different representations. The reason we use such quadratic regularization instead of a standard regularization is to encourage a diverse representation in the linear representation setting by penalizing on the similarity , we defer a formal discussion and numerical experiments about this regularization in the Appendix A.2.
Linear Contrastive Loss
The contrastive loss is set to be the average similarity (measured by the inner product) between positive pairs minus that between negative pairs:
(2) |
where are sets of positive samples and negative samples corresponding to . This loss function has been commonly used in contrastive learning (Hadsell et al., 2006) and metric learning (Schroff et al., 2015; He et al., 2018). In Khosla et al. (2020), the authors show that the inner-product based linear loss (2) is an approximation of the NT-Xent contrastive loss when one positive and one negative are used, which has been highlighted in recent contrastive learning practice (Sohn, 2016; Wu et al., 2018; Oord et al., 2018; Chen et al., 2020a). In Li et al. (2021), the authors proposed the SSL-HSIC contrastive loss, which can be reduced to this linear loss when the kernel is chosen to be a simple inner product. Following Li et al. (2021), we provide the results in Table 1, which shows that linear contrastive loss can also work well with some additional training techniques.
Testing Accuracy | InfoNCE | Linear contrastive loss |
---|---|---|
CIFAR10 | ||
STL10 |
2.2 Generation of Positive and Negative Pairs
There are two common approaches to generating positive and negative pairs, depending on whether or not label information is available. When the label information is not available, the typical strategy is to generate different views of the original data via augmentation (Hadsell et al., 2006; Chen et al., 2020a). Two views of the same data point serve as the positive pair for each other, while those of different data serve as negative pairs.
Definition 2.1 (Augmented Pairs Generation in the Self-supervised Setting)
Given two augmentation functions and training samples , the augmented views are given by: , . Then for each view , , the corresponding positive samples and negative samples are defined by: and .
The loss function of the self-supervised contrastive learning problem can then be written as:
(3) |
In particular, we adopt the following augmentation in our analysis.
Definition 2.2 (Random Masking Augmentation)
The two views of the original data are generated by randomly dividing its dimensions into two sets, that is, , where is the diagonal masking matrix with being random variables sampled from a Bernoulli distribution with mean .
Remark 2.3
In this paper, we focus on random masking augmentation, which has also been used in other works on the theoretical understanding of contrastive learning, eg. Wen and Li (2021). However, our primary interest lies in comparing the performance of contrastive learning with autoencoders and analyzing the impact of labeled data, while their work focuses on understanding the training process of neural networks in contrastive learning. Random masking augmentation is an analog of the random cropping augmentation used in practice. As shown in Chen et al. (2020a), cropping augmentation achieves overwhelming performance on linear evaluation (ImageNet top-1 accuracy) compared with other augmentation methods, please see Figure 5 in Chen et al. (2020a) for details.
When the label information is available, Khosla et al. (2020) proposed the following approach to generate positive and negative pairs.
Definition 2.4 (Pairs Generation in the Supervised Setting)
In a -class classification problem, given samples for each class : and let , the corresponding positive samples and negative samples for are defined by and . That is, the positive samples are the remaining ones in the same class with and the negative samples are the samples from different classes.
Correspondingly, the loss function of the supervised contrastive learning problem can be written as:
(4) |
2.3 Data Generating Process
In real-world scenarios, data often comprises both signal (relevant information) and noise (irrelevant distractions). For instance, in image classification, the signal might be the primary subject of interest, while the noise could represent background elements. Self-supervised learning methods, without predefined tasks, aim to extract generalized patterns from data, ideally capturing as much of the signal as possible. It is commonly understood that signals tend to exhibit specific low-complexity structures, often being low-rank and showing higher correlations across coordinates. In contrast, background noise might lack a distinct structure, potentially being dense (or full rank) with lower coordinate correlations. To delve into this structural difference more rigorously, we consider an additive data-generating model. Here, the observed data emerges as a combination of a low-rank signal and dense noise.
(5) |
where and are both zero mean sub-Gaussian independent random variables, and is a constant represents the signal strength. In particular, and . The first term represents the signal of interest residing in a low-dimensional subspace spanned by the columns of . The second term is the dense noise with heteroskedastic noise. Given that, the ideal low-dimensional representation is to compress the observed into a low-dimensional representation spanned by the columns of . This model is known as the spiked covariance model (Johnstone, 2001; Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018). It was proposed from the empirical observation that the eigenvalues of the sample covariance matrix of phoneme data have few ”spikes”, which corresponds to the low-dimensional structure of data generation. The model has been used in the literature of PCA (Johnstone, 2001; Deshpande and Montanari, 2014; Zhang et al., 2018) and Contrastive Learning (Wen and Li, 2021).
In this paper, we aim to learn a good projection onto a lower-dimensional subspace from the observation . Since the information of is invariant with the transformation for any , the essential information of is contained in the right eigenvector of . Thus, we quantify the goodness of the representation using the sine distance , where is the top- right eigenspace of . It is notable that we only assume that noise and signal follow a sub-Gaussian distribution. This includes bounded noise/signals such as images, sound data, or text data.
3 Comparison of Self-Supervised Contrastive Learning and Autoencoders/GANs
Generative and contrastive learning are two popular approaches of self-supervised learning. Recent experiments have highlighted the improved performance of contrastive learning compared with the generative approach. For example, in Figure 1 of Chen et al. (2020a) and Figure 7 of Liu et al. (2021), it is observed that state-of-the-art contrastive self-supervised learning has more than 10 percent improvement over state-of-the-art generative self-supervised learning, with the same number of parameters. In this section, we rigorously demonstrate the advantage of contrastive learning over autoencoders/GANs, the representative methods in generative self-supervised learning, by investigating the linear representation settings under the spiked covariance model (5). The investigation is conducted for both feature recovery and in-domain downstream tasks.
Hereafter, we focus on the linear representation settings. This section is organized as follows: in Section 3.1 we first review the connection between principal component analysis (PCA) and autoencoders/GANs, which are two representative methods in generative approaches in self-supervised learning, under linear representation settings. Then we establish the connection between contrastive learning and PCA in Section 3.2. Based on these connections, we make the comparison between contrastive learning and autoencoder on feature recovery ability (Section 3.3) and in-domain downstream performance (Section 3.4).
3.1 Autoencoders, GANs and PCA
Autoencoders are popular unsupervised learning methods to perform dimensional reduction. Autoencoders learn two functions: encoder and decoder . While the encoder compresses the original data into low-dimensional features, and the decoder recovers the original data from those features. It can be formulated to be the following optimization problem for samples (Ballard, 1987; Fan et al., 2019):
(6) |
By minimizing this loss, autoencoders try to preserve the essential features to recover the original data in the low-dimensional representation. In our setting, we consider the class of linear functions for and . The loss function is set as the mean squared error. Write and . Namely, we consider the following problem.
Let . By Theorem 2.4.8 in Golub and Loan (1996), the optimal solution is given by the eigenspace of , which exactly corresponds to the result of PCA. Thus, in linear representation settings, autoencoders are equivalent to PCA, which is also often known as undercomplete linear autoencoders (Bourlard and Kamp, 1988; Plaut, 2018; Fan et al., 2019). We write the obtained low-rank representation by autoencoders as
(7) |
where is the top- eigenvectors of matrix , is a diagonal matrix of spectral values and can be any orthonormal matrix.
We also note that GANs (Goodfellow et al., 2014) is related to PCA. Namely, Feizi et al. (2020) showed that the global solution for GANs recovers the empirical PCA solution as the generative model.
To see this, let be the second-order Wasserstein distance. Also let be the set of linear generator functions from . Consider the following GAN optimization problem:
(8) |
where denotes the empirical distribution of i.i.d. data and is the generated distribution with generator and . Note that the optimization problem Equation (8) can be written as , where the first minimization is over probability distributions which have marginals and . By Theorem 2 in Feizi et al. (2020), the optimizer of problem Equation (8) is obtained as , where satisfies . This implies is also a solution to the optimization problem Equation (8). Hence GANs learn the PCA solution as a generator.
From this equivalence among ordinary PCA, autoencoders, and GANs, we only focus on autoencoders hereafter for brevity.
3.2 Contrastive Learning and Diagonal-Deletion PCA
Here we bridge PCA and contrastive learning with certain augmentations under the linear representation setting. Recall that the optimization problem for self-supervised contrastive learning is formulated as:
(9) |
To compare contrastive learning with autoencoders, we now derive the solution of the optimization problem (9). We start with the general result for self-supervised contrastive learning with augmented pairs generation in Definition 2.1, and then turn to the special case of random masking augmentation (Definition 2.2).
Proposition 3.1
For two fixed augmentation functions , denote the augmented data matrices as and , when the augmented pairs are generated as in Definition 2.1, all the optimal solutions of contrastive learning problem (9) are given by:
where is a positive constant, is the -th largest eigenvalue of the following matrix:
(10) |
is the corresponding eigenvector and can be any orthonormal matrix.
The proof is given in Appendix B.1.
Proposition 3.1 is a general result for augmented pairs generation with fixed and deterministic augmentation functions. The result itself only depends on the augmented data matrices, thus it is straightforward to generalize to the case where different augmentation functions are applied to different samples, we omit it here for the simplicity of notations. Moreover, when the augmentation is sampled from a stochastic distribution, we can also characterize the optimal solution of the expected loss in the same way. Specifically, if we apply the random masking augmentation (2.2), we can further obtain a result to characterize the optimal solution. For any square matrix , we denote to be with all off-diagonal entries set to be zero and to be with all diagonal entries set to be zero. Then we have the following corollary for random masking augmentation.
Corollary 3.2
Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then the minimizer of the expected loss function of contrastive learning problem (9) over the distribution of random augmentations (i.e., ) is given by:
where is a positive constant, is the -th largest eigenvalue of the following matrix:
(11) |
is the corresponding eigenvector and can be any orthonormal matrix.
The proof is given in Appendix B.2.
With Proposition 3.1 and Corollary 3.2 established, we can find that the self-supervised contrastive learning equipped with augmented pairs generation and random masking augmentation can eliminate the effect of random noise on the diagonal entries of the observed covariance matrix. Since is a diagonal matrix, when the diagonal entries only take a small proportion of the total Frobenius norm, the contrasting augmented pairs will preserve the core features while eliminating most of the random noise and give a more accurate estimation of core features.
3.3 Feature Recovery from Noisy Data
After bridging both autoencoder and contrastive learning with PCA, now we can perform the analysis of feature recovery ability to understand the benefit of contrastive learning over autoencoders. As mentioned above, our target is to recover the subspace spanned by the columns of , which can further help us obtain information on the unobserved that is important for in-domain downstream tasks. However, the observed data has a covariance matrix of rather than the desired , which brings difficulty to representation learning. We demonstrate that contrastive learning can better exploit the structure of core features and obtain better estimation than autoencoders in this setting.
We start with autoencoders. In the noiseless case, the covariance matrix is and autoencoders can perfectly recover the core features. However, in noisy cases, the random noises sometimes perturb the core features, which makes autoencoders fail to learn the core features. Such noisy cases are widespread in real applications such as measurement errors and backgrounds in images such as grasses and sky. Interestingly, we will later show that contrastive learning can better recover despite the presence of large noise.
To provide rigorous analysis, we first introduce the incoherent constant (Candès and Recht, 2009).
Definition 3.3 (Incoherent Constant)
We define the incoherence constant of as
(12) |
Intuitively, the incoherent constant measures the degree of the incoherence of the distribution of entries among different coordinates, or loosely speaking, the similarity between and canonical basis . For uncorrelated random noise, the covariance matrix is diagonal and its eigenspace is exactly spanned by the canonical basis (if the diagonal entries in are all different), which attains the maximum value of the incoherent constant. On the contrary, the core features usually exhibit certain correlation structures and the corresponding eigenspace of the covariance matrix is expected to have a lower incoherent constant.
We then introduce a few assumptions which our theoretical results are built on. Recall that in the spiked covariance model (5), , and .
Assumption 3.4 (Regular Covariance Condition)
The condition number of covariance matrix satisfies where represents the -th largest number among and is a universal constant.
Assumption 3.5 (Signal to noise ratio condition)
Define the signal-to-noise ratio , we assume , implying that the covariance of noise is of the same order as that of the core features.
Assumption 3.6 (Incoherent Condition)
The incoherent constant of the core feature matrix satisfies
The incoherent constant often appears in the literature of matrix completion (Candès and Recht, 2009) and PCA (Zhang et al., 2018). The order of can be arbitrary as long as it decreases to as . One can directly adapt the later results to this setting. If is distributed uniformly on , then the expectation of incoherent constant is of order .
Lemma 3.7 (Expectation of incoherent constant over a uniform distribution)
(13) |
Thus, we set to the order for simplicity. The proof is given in Appendix B.6. Here we provide a remark on the implication of our assumptions, and we defer a further discussion on how to generalize our main results under weaker assumptions in Remark 3.11.
Remark 3.8
The three assumptions above can be explained as follows: Assumption 3.4 implies that the variances of all dimensions are of the same order. For Assumption 3.5, we focus on a large noise regime where the noise may hurt the estimation significantly. Here we assume the ratio lies in a constant range, but our theory can easily adapt to the case where has a decreasing order. Specifically, for Theorems 3.9, 3.10, 3.13 and 3.14 presented below, we derive an explicit dependence on of each result in the appendix. One can check Equations (46), (LABEL:dependence_of_rho_CL), (59), (60), (61) and (62) for details. Assumption 3.6 implies a stronger correlation among the coordinates of core features, which is the essential property to distinguish them from random noise.
Now we are ready to present our first result, showing that the autoencoders are unable to recover the core features in the large-noise regime. Due to the equivalence among PCA, autoencoders, and GANs we presented in Section 3.1, for brevity, we only focus on autoencoders hereafter.
Theorem 3.9 (Recovery Ability of Autoencoders, Lower Bound)
Consider the spiked covariance model (5), under Assumptions 3.4-3.6 and , let be the learned representation of autoencoders with singular value decomposition (as in Equation (7)). If we further assume are different from each other and for some universal constant . Then there exist two universal constants , , such that when , we have
(14) |
The proof is given in Appendix B.7. The condition means that there exists a sufficiently small constant independent of and such that holds. The additional assumptions are different from each other and for some universal constant are made to ensure the identifiability of top- eigenspace. We need these conditions to guarantee the uniqueness of . As an extreme example, the top- eigenspace of the identity matrix can be any -dimensional subspace and thus not unique. To avoid discussing such arbitrariness of the output, we make these assumptions to guarantee the separability of the eigenspace.
Then we investigate the feature recovery ability of the self-supervised contrastive learning approach.
Theorem 3.10 (Recovery Ability of Contrastive Learning, Upper Bound)
The proof is given in Appendix B.9. The two terms in equation (15) can be explained as follows: the first term is due to the shift between the distributions of the augmented data and the original data. Specifically, the random masking augmentation generates two views with disjoint nonzero coordinates and thus can mitigate the influence of random noise on the diagonal entries in the covariance matrix. However, such augmentation slightly hurts the estimation of core features. This bias, appearing as the first term in Equation (15), is measured by the incoherent constant defined in Equation (12). The second term corresponds to the estimation error of the population covariance matrix.
Theorems 3.9 and 3.10 characterize the difference in feature recovery ability between autoencoders and contrastive learning. The autoencoders fail to recover most of the core features in the large-noise regime since has a trivial upper bound . In contrast, with the help of data augmentation, the contrastive learning approach mitigates the corruption of random noise while preserving core features. As and increase, it yields a consistent estimator of core features and further leads to better performance in the in-domain downstream tasks, as shown in the next section.
Remark 3.11
Here we discuss the potential generalization of our results to the setting with weaker assumptions. Intuitively speaking, the random masking augmentation exploits the prior knowledge that the core features in the original signal are more structural across different coordinates compared with the random noise. Thus the essential requirements are
-
1.
Noise is less correlated between different coordinates compared with core features.
-
2.
Core features and noise are very different, i.e., is large.
Those two requirements correspond to the diagonal assumption on and incoherent assumption on (Assumption 3.6). In particular, the latter is to give a lower bound for when is diagonal and heteroskedasticity. For more general and , it suffices to assume , and , and we can still draw a similar comparison under these assumptions. Notice that by Lemma 3.7, when is randomly chosen we will immediately have . Similar arguments also apply to all of the later results in this paper and we omit them for simplicity.
Remark 3.12
Similar random masking augmentation (as in Definition 2.2) can also apply to autoencoders. Although directly applying this augmentation would not work as well since it will not affect the optimal solution (see discussion in Appendix C), an alternative strategy is to reconstruct the whole data from the masked one. This method was originally proposed as denoising autoencoders (DAEs) for general augmentation Vincent et al. (2008), and was proven to be powerful with masking augmentation in a recently proposed representation learning method, masked autoencoders (MAEs) (He et al., 2021). DAEs are a variant of autoencoders that are trained to reconstruct the original image from randomly masked patches. It has been found that DAEs (especially MAEs) outperform other self-supervised methods like MoCo v3, DINO, and BEiT after fine-tuning (He et al., 2021).
More specifically, under the same setup described in Section 2, let be the random masking augmentation defined in Definition 2.2. We adopt the symmetric linear encoders and decoders. Given samples , we formally define the loss minimization problem111In (He et al., 2021), the loss function is computed on masked coordinates only, but as the authors noted, “This choice is purely result-driven: computing the loss on all pixels leads to a slight decrease in accuracy (e.g., 0.5%).” Hence we will analyze the loss with respect to all coordinates for simplicity. of DAEs as
(16) |
Notice that the DAEs may not preserve the norm of the input since . As a result, we optimize the loss under the scaled constraint .
Then, we claim that under the same conditions as in Theorem 3.10, DAEs behave similarly to contrastive learning: Let be any solution that minimizes equation (16), and denote its singular value decomposition as , then we have (the proof is given in Appendix C.2.)
(17) |
From Theorem 3.9, we know that under high dimensional settings with large sample sizes, DAEs (or masked autoencoders) significantly outperform classic autoencoders. Moreover, compared to Theorem 3.10, the upper bounds of DAEs are the same as contrastive learning with random masking augmentations. We also provide experimental results on synthetic datasets to verify this result in Appendix C.2. Although they have similar performance in our linear representation framework because both of them exploit the masking views to eliminate noise, the difference could arise from other aspects such as network architecture and training algorithms, for example, He et al. (2021) used a vision Transformer (Dosovitskiy et al., 2020) while Chen et al. (2020a) used a ResNet (He et al., 2016).
3.4 Performance on In-Domain Downstream Tasks
In the previous section, we have seen that contrastive learning can recover the core feature effectively. In practice, we are interested in using the learned features on in-domain downstream tasks. He et al. (2020) experimentally showed the overwhelming performance of linear classifiers trained on representations learned with contrastive learning against several supervised learning methods in those in-domain downstream tasks.
Following the recent success, here we evaluate the in-domain downstream performance of simple predictors, which take a linear transformation of the representation as an input. Let and be the learned representations based on train data . We observe a new signal independent of following the spiked covariance model (5). For simplicity, assume follows and is independent of . We consider two major types of in-domain downstream tasks: classification and regression. For the binary classification task, we observe a new supervised sample following the binary response model:
(18) |
where is a known monotone increasing function satisfying for any , and is a unit vector of coefficients. Notice that our model (18) includes a logistic model (when ) and probit models (when , where is the cumulative distribution function of the standard normal distribution.) We can also interpret model (18) as a shallow neural network model with width for binary classification. For the regression task, we observe a new supervised sample following the linear regression model:
(19) |
where is independent of , , and is a unit vector of coefficients as before. We can interpret this model as a principal component regression model (PCR) (Jolliffe, 1982) under standard error-in-variables settings222 In error-in-variables settings, the bias term of the measurement error appears in prediction and estimation risk. Since our focus lies in proving a better performance of contrastive learning against autoencoders, we ignore the unavoidable bias term here by considering the excess risk. , where we assume that the coefficients lie in a low-dimensional subspace spanned by column vectors of . We either estimate or predict the signal based on the observed samples contaminated by the measurement error . For details of PCR in error-in-variables settings, see, for example, Ćevid et al. (2020); Agarwal et al. (2020); Bing et al. (2021).
In classification setting, we specify - loss, that is, for some predictor taking values in . For regression task, we employ the squared error loss . Based on some learned representation , we consider a class of linear predictors. Namely, for classification task and for regression task, where is a weight vector . Note that the learned representation depends only on unsupervised samples . Let and the expectations with respect to and , respectively.
Our goal as stated above is to bound the prediction risk of predictors constructed upon the learned representations and , that is, the quantity and .
Now we state our results on the performance of the in-domain downstream prediction task.
Theorem 3.13 (Excess Risk for In-Domain Downstream Task: Upper Bound)
Suppose the conditions in Theorem 3.10 hold. Then, for the regression task, we have
and for the classification task,
The proofs are given in Appendix B.11.
This result shows that the price of estimating by contrastive learning on an in-domain downstream prediction task can be made small in the case where the core feature lies in a relatively low-dimensional subspace, and the number of samples is relatively large compared to the ostensible dimension of data.
However, the in-domain downstream performance of autoencoders is not as good as contrastive learning. We obtain the following lower bound for the in-domain downstream prediction risk with the autoencoders.
Theorem 3.14 (Excess Risk for In-Domain Downstream Task: Lower Bound))
Suppose the conditions in Theorem 3.9 hold. Assume holds for some constant . Additionally assume that is sufficiently small and . Then, For the regression task,
and for classification task, if is differentiable at and , then
where and are constants independent of and .
The proof is given in Appendix B.11. The condition means that there exists a sufficiently small constant independent of , and such that holds.
The constant appearing in Theorem 3.14 is a constant term independent of and . Thus, when is sufficiently large compared to and is small, the upper bound of in-domain downstream task performance via contrastive learning in Theorem 3.13 is smaller than the lower bound of in-domain downstream task performance via autoencoders. The assumption of in Theorem 3.14 is assumed for clarity of presentation. Using the same techniques in the proof of Theorem 3.14, one can obtain a constant lower bound for autoencoders with slightly stronger assumptions, for example, with , without assuming . Our theory can be adapted to both of these assumptions. Our results support the empirical success of contrastive learning.
4 The Impact of Labeled Data in Supervised Contrastive Learning
Recent works have explored adding label information to improve contrastive learning (Khosla et al., 2020). Empirical results show that label information can significantly improve the accuracy of the in-domain downstream tasks. However, when domain shift is considered, the label information hardly improves and even hurts transferability (Islam et al., 2021). For example, in Table 2 of Khosla et al. (2020) and the first column in Table 4 of Islam et al. (2021), supervised contrastive learning shows significant improvement with 7%-8% accuracy increase on in-domain downstream classification on ImageNet and Mini-ImageNet. On the contrary, in Table 4 of Khosla et al. (2020) and Table 4 of Islam et al. (2021), supervised contrastive learning hardly increases the predictive accuracy compared to the self-supervised contrastive learning (the difference of mean accuracy is less than 1%) and can harm significantly on some datasets (e.g. 5.5% lower for SUN 397 in Table 4 of Khosla et al. (2020)). These results indicate that some mechanisms in supervised contrastive learning hurt model transferability while the improvement in source tasks is significant. Moreover, in Table 4 of Islam et al. (2021), it is observed that combining supervised learning and self-supervised contrastive learning together achieves the best transfer learning performance compared to each of them individually. Motivated by those empirical observations, in this section, we aim to investigate the impact of labeled data in contrastive learning and provide a theoretical foundation for these phenomena.
4.1 Feature Mining in Multi-Class Classification
We first demonstrate the impact of labels in contrastive learning under the standard single-sourced (i.e. no transfer learning) setting. Suppose our samples are drawn from different classes with probability for class , and . For each class, samples are generated from a class-specific Gaussian distribution:
(20) |
We assume the norms of are in the same order, that is, denote , we have . We further assume , denote and assume , where the last assumption is added to ensure identifiability since the classification problem (20) is invariant under translation. Denote , we assume and for two universal constants and . We remark that this model is a labeled version of the spiked covariance model (5) since the core features and random noise are both sub-Gaussian. We use classes to ensure that ’s span an -dimensional space, and denote its orthonormal basis as . Recall that our target is to recover .
Remark 4.1
Here we focus on explaining the impact of label information in the SupCon algorithm(Khosla et al., 2020). SupCon is designed for multi-class classification tasks and it requires using the class label to find positive samples. In that case, labels from a linear function of the latent features can not be used as class labels in a multi-class classification setting. Hence we proposed to use the Gaussian Mixture Model (20) to generate class labels while keeping the most consistency with models used earlier(i.e., the spiked covariance model (5)).
As introduced in Definition 2.4, the supervised contrastive learning introduced by Khosla et al. (2020) allows us to generate contrastive pairs using labeled information and discriminate instances across classes. When we have both labeled data and unlabeled data, we can perform contrastive learning based on pairs that are generated separately for the two types of data.
Data Generating Process
Formally, let us consider the case where we draw samples as unlabeled data from the Gaussian mixture model (20) with . For the labeled data, we draw samples; samples for each of the classes in the Gaussian mixture model, and denote them as . We discuss the above case for simplicity. More general versions that allow different sample sizes for each class are considered in Theorem D.2 (in the appendix). We study the following hybrid loss to illustrate how the label information helps promote performance over self-supervised contrastive learning:
(21) |
where is the ratio between supervised loss and self-supervised contrastive loss. Here we consider this generalized hybrid loss to show the benefit of exploiting additional unlabeled data. If we choose it will correspond to the original SupCon loss.
We first provide a high-level explanation of why label information can help learn core features. When the label information is unavailable, no matter how much (unlabeled) data we have, we can only take them (and their augmented views) as positive samples. In such a scenario, performing augmentation leads to an unavoidable trade-off between estimation bias and accuracy. However, if we have additional class information, we can contrast between data in the same class to extract more beneficial features that help distinguish a particular class from others and therefore reduce the bias.
Theorem 4.2
The proof is given in Appendix D.2
Corollary 4.3
From Theorem 4.2, it directly follows that when we have labeled data for each class and no unlabeled data (),
The first bound in Theorem 4.2 demonstrates how the effect of labeled data changes with the ratio in the hybrid loss in Equation (21). In addition, compared with Theorem 3.10, when we only have labeled data (), the second bound in Theorem 4.2 indicates that with labeled data being available, the supervised contrastive learning can yield consistent estimation as while the self-supervised contrastive learning consists of an irreducible bias term . At a high level, label information can help gain accuracy by creating more positive samples for a single anchor and therefore extract more decisive features. One should notice a caveat that when labeled data is extremely rare compared to unlabeled data, the estimation of supervised contrastive learning suffers from high variance. In comparison, self-supervised contrastive learning, which can exploit a much larger number of samples, may outperform it.
4.2 Information Filtering in Multi-Task Transfer Learning
In this section, we show that the theoretical tools developed in this paper can be used to illustrate the role of label information when using contrastive learning in the transfer learning setting. Label information can tell us the beneficial information for the in-domain downstream task, and learning with labeled data will filter out useless information and preserve the decisive parts of core features. However, in transfer learning, the label information is sometimes rather found to hurt the performance of contrastive learning. For example, in Table 4 of Islam et al. (2021), while supervised contrastive learning gains 8% improvement in source tasks by incorporating label information, it improves only 1% on generalizing to new datasets on average and can even hurt on some datasets. Such observation implies that label information in contrastive learning has very different roles for generalization on source tasks and new tasks. In this section, we consider two regimes of transfer learning – tasks are insufficient/abundant. In both regimes, we provide theories to support the empirical observations and further demonstrate how to wisely combine supervised and self-supervised contrastive learning to avoid those harms and achieve better performance. Specifically, we consider a transfer learning problem with regression setting and binary classification setting. Suppose we have source tasks which share a common data generative model (5). For the -th task, the labels are generated by in a regression setting while in a binary classification setting, where is a unit vector varying across tasks. These two settings share the same distribution for only differ in the way to generate the labels .
To incorporate label information, we maximize the Hilbert-Schmidt Independence Criteria (HSIC) (Gretton et al., 2005; Barshan et al., 2011), which has been widely used in literature (Song et al., 2007a, b, c; Barshan et al., 2011).
4.2.1 Hilbert-Schmidt Independent Criteria
Gretton et al. (2005) proposed the Hilbert Schmidt Independent Criteria (HSIC) to measure the dependence between two random variables. It computes the Hilbert-Schmidt norm of the cross-covariance operator associated with their Reproducing Kernel Hilbert Space (RKHS). Such measurement has been widely used as a supervised loss function in feature selection (Song et al., 2007c), feature extraction (Song et al., 2007a), clustering (Song et al., 2007b) and supervised PCA (Barshan et al., 2011).
The basic idea behind HSIC is that two random variables are independent if and only if any bounded continuous functions of the two random variables are uncorrelated. Let be a separable RKHS containing all continuous bounded real-valued functions mapping from to and be that for maps from to . For each point , there exists a corresponding element such that , where is a unique positive definite kernel. Likewise, define the kernel and feature map for . The empirical HSIC is defined as follows.
Definition 4.4 (Empirical HSIC (Gretton et al., 2005))
Let , be a series of independent and identically distributed observations. An estimator of HSIC, written as , is given by
where and .
In our setting, we aim to maximize the dependency between learned features and label via HSIC. Substituting and , we obtain our supervised loss for the representation matrix :
(22) |
A more commonly used supervised loss in the regression task is the mean squared error. Here we explain the equivalence of maximizing HSIC with penalty and minimizing the mean squared error in the regression task.
Recall that in the contrastive learning framework, we first learn the representation via a linear transformation and then perform linear regression to learn a predictor with the learned representation. Consider the mean squared error , where is a predictor of . Also consider a linear class of predictors with parameter . Assume that both and are centered. For any fixed representation , the minimum mean squared error is given by
where . Ignoring the constant term , it can be seen that the only essential difference between the minimization problem and maximizing HSIC in Equation (22) is the normalization term for . Thus, minimizing the mean squared error is equivalent to maximizing HSIC with regularization term . Since contains the regularization term , we can jointly use and HSIC as a surrogate for the standard regression error to avoid the singularity of learned representation .
4.2.2 Main results
First, we consider the regression setting. Before stating our results, we prepare some notations. Suppose we have unlabeled data and labeled data for each source task where and are independently drawn from the spiked covariance model (5), we learn the linear representation via the joint optimization:
(23) |
where is a pre-specified ratio between the self-supervised contrastive loss and HSIC. A more general setting, where the ratio and the number of labeled data for each source task are allowed to depend on , is considered in the appendix, see Section D.2 for details. We now present a theorem showing the recoverability of by minimizing the hybrid loss function (23).
Theorem 4.5
In the regression setting where , suppose Assumptions 3.4-3.6 hold for the spiked covariance model (5) and , if we further assume that for some constant , and ’s are orthogonal to each other, and let be any solution that optimizes the problem in Equation (23), and denote its singular value decomposition as , then we have:
(24) |
The proof is given in Appendix D.6.
Similar to Section 3.4, we can obtain an in-domain downstream task risk in a supervised contrastive learning setting. Consider a new test task where a label is generated by with . Recall that the loss in the in-domain downstream task is measured by the squared error: . We obtain the following result.
Theorem 4.6
Suppose the conditions in Theorem 4.5 hold. Then,
(25) | ||||
The proof is given in Appendix D.7.
In Theorem 4.5 and Theorem 4.6, as goes to infinity (corresponding to the case where we only use the supervised loss), the upper bounds in Equations (24) and (LABEL:CL_transfer_upper_bound_4) are reduced to , which is worse than the rate obtained by self-supervised contrastive learning (Theorem 3.10). This implies that when the model focuses mainly on the supervised loss, the algorithm will extract the information only beneficial for the source tasks and fail to estimate other parts of core features. As a result, when the target task has a very different distribution, labeled data will bring extra bias and therefore hurt the transferability. Additionally, one can minimize the right-hand side of Equation (24) to obtain a sharper rate. Specifically, we can choose an appropriate such that the upper bound becomes (when ), obtaining a smaller rate than that of the self-supervised contrastive learning. These facts provide theoretical foundations for the recent empirical observations that smartly combining supervised and self-supervised contrastive learning achieves significant improvement in transferability compared with performing each of them individually (Islam et al., 2021).
Remark 4.7
A heuristic intuition of this surprising fact is that when tasks are not diverse enough, supervised training will only focus on the features that are helpful to predict the labels of source tasks and ignore other features. For example, we have unlabeled images which contain cats or dogs and the background can be sandland or forest. If the source task focuses on classifying the background, supervised learning will not learn features associated with cats and dogs, while self-supervised learning can learn these features since they are helpful to discriminate different images. As a result, although supervised learning can help to classify sandland and forest, it can hurt performance on the classification of dogs and cats and we should incorporate self-supervised contrastive learning to learn these features.
When the tasks are abundant enough then estimation via labeled data can recover core features completely. Similar to Theorem 4.5 and Theorem 4.6, we have the following results.
Theorem 4.8
The proof is given in Appendix D.8.
Theorem 4.9
Suppose the conditions in Theorem 4.8 hold. Then,
(27) |
The proof is given in Appendix D.9.
Theorem 4.8 and Theorem 4.9 show that in the case where tasks are abundant, as goes to infinity (corresponding to the case where we use the supervised loss only), the upper bounds in Equations (26) and (27) are reduced to . This rate can be worse than the rate obtained by self-supervised contrastive learning when is small. Recall that when the number of tasks is small, labeled data introduce extra bias term (Theorem 4.5 and Theorem 4.6). We note that when the tasks are abundant enough, the harm of labeled data is mainly due to the variance brought by the labeled data. When is sufficiently large, supervised learning on source tasks can yield a consistent estimation of core features, whereas self-supervised contrastive learning can not.
In extending our results from regression to the binary classification setting, the only difference is in the label generation process, and we can obtain similar results with some modification of the proofs. The corresponding feature recovery bounds of Theorem 4.5 (where the tasks are insufficient) and Theorem 4.8 (where the tasks are abundant) are stated as follows:
Theorem 4.10
Theorem 4.11
As the feature recovery bounds remain to be the same, the counterpart of the in-domain downstream tasks results, Theorem 4.6 and Theorem 4.9, in this classification setting follows immediately. For the sake of space, we defer the generalized version and proofs in Theorem D.10 and Theorem D.11 in the appendix.
5 Numerical Experiments
5.1 Linear Model with Synthetic Data
To verify our theory, we conducted numerical experiments on the spiked covariance model (5) under a linear representation setting and contrastive loss functions defined in (3) and (23). As we have explicitly formulated the loss function and derived its equivalent form in the main body and appendix, we simply minimize the corresponding loss by gradient descent to find the optimal linear representation . For self-supervised contrastive learning with random masking augmentation, we independently draw the augmentation function by Definition 2.2 and apply them to all samples in each iteration. To ensure convergence, we set the maximum number of iterations for it (typically 10000 or 50000 depending on dimension ).
We report two criteria to evaluate the quality of the representation, in-domain downstream error, and sine distance. To obtain the sine distance for a learned representation , we perform singular value decomposition to get and then compute . To obtain the in-domain downstream task performance, in the comparison between autoencoders and contrastive learning, we first draw labeled data from spiked covariance model (5) with labels generated as in Section 3.4, then we train the model by using the data without labels to obtain the linear representation , and learn a linear predictor using the data with labels and compute the regression error. In the transfer learning setting, we draw some labeled data from the source tasks and additional unlabeled data. The number of labeled data is set to be and the number of unlabeled data is set to be . Then train with them to obtain the linear representation , and draw labeled data from a new source task to learn a linear predictor to compute the regression error. In particular, we subtract the optimal regression error obtained by the best representation for each regression error and report the difference, or more precisely, the excess risk as in-domain downstream performance.
The results are reported in Fig. 1 and 2 and Table 2 and 3. As predicted by Theorems 3.9 and 3.10, the feature recovery error and in-domain downstream task risk of contrastive learning decrease as increases (Fig. 1: Left) and as increases (Fig. 1: Center) while that of autoencoders is insensible to the changes in and . Consistent with our theory, in Fig. 1: Right, it is observed that when tasks are not abundant, the transfer performance exhibit a -shaped curve, and the best result is achieved by choosing an appropriate . When tasks are abundant and labeled data are sufficient, the error remains small when we take large .






-5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0.0242 | 0.0231 | 0.0199 | 0.0141 | 0.0122 | 0.0125 | 0.0184 | 0.0345 | 0.0499 | 0.0535 | 0.0587 | |
0.0223 | 0.0163 | 0.0156 | 0.0096 | 0.0079 | 0.0055 | 0.0064 | 0.0064 | 0.0067 | 0.0070 | 0.0079 |
-5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|
2.0373 | 2.0371 | 2.0228 | 1.9908 | 2.0021 | 2.0055 | 2.0010 | 2.0362 | 2.0699 | 2.0705 | 2.0813 | |
2.0352 | 2.0292 | 2.0030 | 1.9871 | 1.9740 | 1.9690 | 1.9766 | 1.9702 | 1.9790 | 1.9714 | 1.9672 |
5.2 Neural Nets with Real-World Dataset
In this section, we provide experimental results in real-world datasets to support our theoretical results. Although our model settings and assumptions might be violated under this scenario, as we shall see, our findings still remain valid in practice.
We conduct the experiments using the datasets STL-10 (Coates et al., 2011) and CIFAR-10 (Krizhevsky, 2009) with the neural nets architecture ResNet-18 (He et al., 2016). Our experiments are carried out based on linear evaluation following SimCLR (Chen et al., 2020a), where we first train a ResNet-18 encoder and a two-layer MLP projector with unlabeled augmented data. We then freeze the encoder, train a logistic regression on top of it with labeled data, and lastly evaluate the performance on the test data. Following Chen et al. (2020a), we apply augmentations including resized cropping, horizontal flipping, color distortion, and Gaussian blurring to generate the augmented data, and use the InfoNCE loss function to train the network. All training is carried out with the Adam optimizer (Kingma and Ba, 2015), batch size 256, learning rate , weight decay , and a cosine annealing learning rate scheduler for 100 epochs. Our codes are implemented in Pytorch and run on an NVIDIA V100 GPU.
Contrastive learning v.s. standard autoencoders:
Here we provide real-world evidence for our theoretical findings in Section 3.3 by comparing the performance of contrastive learning versus a standard autoencoder. The architecture of encoders is the same for these two methods, and we use an inversed ResNet-18 as the decoder. During the training time, we use the encoder-decoder architecture and mean squared error loss to reconstruct the input, and then we train a linear classifier on the features learned by the encoder. The results are listed in Table 4, we can find that contrastive learning demonstrates superior performance over the standard autoencoder.
Testing Accuracy | Contrastive Learning | Standard Autoencoder |
---|---|---|
CIFAR10 | ||
STL10 |
The impact of labeled data in transfer learning
Now we we provide real-world evidence for our theoretical findings in Section 4.2. Following the joint optimization formulation in equation 23, we combine the InfoNCE loss function for unlabeled data and cross-entropy loss for labeled data with a ratio . For both STL-10 and CIFAR-10 datasets, we divide the test data into two sets, one consists of the first five classes and the other one consists of the remaining five classes. During training, we use the training data as unlabeled data and the first set of test data as the labeled data to train the model jointly, and then train a linear classifier with the second set of test data on features learned by the encoder. As predicted by Theorem 4.5, when is small, introducing labeled data from the first five classes would be beneficial to learn better representations and improve the performance on the last five classes; when is large, labeled data from the first five classes will make the model only focus on features that are useful to discriminate the first five classes and ignore other features, thus introducing labeled data could be harmful to the performance on the last five classes. Testing accuracy on the last five classes with different are listed in table 5, it is observed that the accuracy first increases and then decreases as grows, which is consistent with our theoretical results.
Testing Accuracy | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
CIFAR10 | 74.27 | 75.52 | 74.86 | 75.31 | 75.21 | 74.86 | 74.46 | 73.85 | 72.20 | 69.17 | 51.31 |
STL10 | 82.56 | 83.54 | 83.27 | 83.24 | 83.21 | 83.03 | 82.34 | 82.11 | 80.94 | 76.88 | 52.37 |
6 Conclusion
In this work, we establish a theoretical framework to study contrastive learning under the linear representation setting. We theoretically prove that contrastive learning, compared with autoencoders and GANs, can obtain a better low-rank representation under the spiked covariance model, which further leads to better performance in in-domain downstream tasks. We also highlight the impact of labeled data in supervised contrastive learning and multi-task transfer learning: labeled data can reduce the domain shift bias in contrastive learning, but it harms the learned representation in transfer learning. To our knowledge, our result is the first theoretical result to guarantee the success of contrastive learning by comparing it with existing representation learning methods. However, to get a tractable analysis, like many other theoretical works in representation learning (Du et al., 2020; Lee et al., 2021; Tripuraneni et al., 2021), our work starts with linear representations, which still provides important insights. Recently, Wen and Li (2021) and Refinetti and Goldt (2022) studied the training dynamics of autoencoders and contrastive learning with nonlinear shallow neural networks. Extending our results to these more complex models is an interesting direction for future work.
Acknowledgement
L.Z. is supported by National Science Foundation DMS 2015378. J.Z. is supported by the National Science Foundation (CCF 1763191 and CAREER 1942926), the US National Institutes of Health (P30AG059307 and U01MH098953) and grants from the Silicon Valley Foundation and the Chan-Zuckerberg Initiative.
A Background and omitted discussion
A.1 Comparison with other works
Here we compare the results in this paper with some closely related works.
To rigorously analyze contrastive learning, we consider the random masking augmentation strategy which is also analyzed in Wen and Li (2021). In Wen and Li (2021), the authors aim to understand the training dynamics of contrastive learning in a shallow nonlinear neural network and focus more on dealing with nonlinearity. In comparison, our work focus on the comparison between contrastive learning and autoencoders and the role of label information in contrastive learning. To make the problem mathematically tractable, we adopt a linear model, which is simple but enough to shed light on many mysterious phenomena in practice. Moreover, while they assume a sparse coding model, where the features are extremely sparse, and Gaussianity of signals and noise, our analysis only requires that the features are sub-Gaussian (5). Furthermore, our technique allows the signal-to-noise ratio to have different orders, as long as it decreases slowly, while their analysis is restricted to a particular signal-to-noise ratio.
Tian (2022) studied the relationship between contrastive learning and PCA from a game-theoretic point of view. Specifically, the authors decompose the gradient descent on the contrastive loss into two dynamics, namely the max-player and min-player. It is proven that in deep linear networks, the max-player is equivalent to PCA and the landscape has no spurious minimum. While the results on max-player can be applied to a family of contrastive loss, it is still difficult to analyze the min-player in a general setting. In our paper, we use a linear contrastive loss (2) to explicitly obtain the features learned by contrastive learning. Moreover, our results can be directly extended to a deep linear network setting by the equivalence of a single linear transformation and a deep linear network. The major difference is the non-convexity of the loss landscape.
Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They viewed training with unlabeled data as functional regularization on learning the representation function, and obtained sample complexity bounds in a PAC-learning style for various settings. In particular, they found that such functional regularization can help to reduce the amount of labeled data needed, and showed autoencoders and masked self-supervision as two concrete examples. Apart from Garg and Liang (2020), this paper focuses on a regime in combining self-supervised learning and supervised learning, where a trade-off between labeled data and unlabeled data exists. Specifically, Theorem 3 in Garg and Liang (2020) assumes that a ground truth representation exists such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features. In our paper, we first study the regime where tasks are not diverse enough in Theorem 4.5 (which corresponds to the case where ground truth does not exist) and show the trade-off between supervised loss and self-supervised loss. Then in Theorem 4.8 we show that when tasks are abundant (which corresponds to the case where ground truth exists), labeled data helps to achieve better error bounds, which is similar to the result of Garg and Liang (2020). Our result of Theorem 4.5 provides novel insight into the regime where tasks are not diverse, which has been left untouched in the literature.
In Li et al. (2021), the authors proposed a novel self-supervised loss function based on HSIC and discussed the relationship between InfoNCE and the proposed SSL-HSIC loss. The SSL-HSIC loss measures the dependence between the output features and one-hot encoded labels(which serve as the indicators of positive samples) and minimizing the SSL-HSIC loss encourages the network to discriminate augmented views from different samples. In comparison to this self-supervised loss, we use HSIC in Section 4.2 as a supervised loss to measure the dependence between output features and the true labels, which is a common usage of HSIC in previous works (Barshan et al., 2011; Song et al., 2007c). Moreover, we want to point out that the proposed estimator of SSL-HSIC (see Equation (11) in Li et al. (2021)) can be reduced to the linear loss we use in this paper when the kernel is chosen to be a simple inner product. The authors argued that the standard InfoNCE loss may yield meaningless features in some cases thus the proposed HSIC-based loss could be a better alternative, and provided empirical results illustrating the comparable performance of SSL-HSIC. It remains to be further explored the benefits of such an HSIC-based method.
Lee et al. (2021) studied self-supervised learning under a conditional independence assumption, and showed that with the optimal representation learned in pretext tasks, the in-domain downstream risk is guaranteed to be small. In the contrastive learning context, for example, an image classification downstream task, such an assumption implies that the augmented views generated from the same picture are roughly independent conditional on its ground-truth class, which could be too strong since two views are usually strongly correlated. Such an assumption would be closer to supervised contrastive learning as we have discussed in Section 4.1 where we contrast two independent samples from the same class, such sample pairs are independent conditional on the true label, but it requires label information and thus not applied to self-supervised contrastive learning. Compared with this work, we studied a specific data-generating model where the two views are obtained by practical augmentation and thus could be more close to a real-world setting. It is also remarkable that inLee et al. (2021), their analysis can be adapted to the nonlinear representation setting in the sense that they directly assumed that the optimal representation is obtained. However, the representation learning in the contrastive learning context, even under a linear representation setting, could be non-convex. Thus our analysis starts from a simple setting to obtain a deep understanding of what could be learned in the pretext tasks.
Saunshi et al. (2022) proposed a novel perspective that theoretical analysis of contrastive learning must take the inductive bias into account. It is shown that without considering the function class, it is possible that the learned features totally fail in downstream classification tasks. Furthermore, it is shown that within the linear representation class, contrastive self-supervised learning is guaranteed to learn meaningful features under certain conditions. In our paper, we have restricted ourselves to a similar linear representation setting and avoided the collapse raised by complicated models. Compared with their analysis in a linear representation setting, our bounds provide an exact order while their results still need to quantify the expressivity and inconsistency measure, which is very difficult without very strong assumptions. Moreover, their analysis requires a finite input space and augmentation set, which is much stronger than our data-generating model.
Later than our paper, Fu et al. (2022) considered a similar issue of transfer learning performance of the supervised contrastive method. The authors argue that directly minimizing the supervised contrastive method will lead to feature collapse, which implies that within each class, all data points will have the same embedding and thus the supervised contrastive method loses information that could be useful for transfer learning. This intuition is similar to our setting in Section 4.2, and based on it the authors proposed a new loss function that combines the supervised contrastive loss and within-class self-supervised loss together. At a high level, this new loss function is quite similar to equation 23 since they are both linear interpolations of self-supervised contrastive loss and supervised loss, and the common motivation is to encourage the model to learn more background features. And our analysis in Section 4.2 can provide a theoretical foundation for when could such interpolation work and how to choose the ratio .
A.2 Disucssion about the regularization term
In this paper we use a quadratic regularization term instead of a standard regularization term . Denote , then we can write these two terms as:
The main difference between these two terms is that except for penalizing the norm of representation , it also penalizes the similarity between different representations. In particular, in the linear representation setting, where we will deal with optimization problems like
where is a symmetric matrix determined by data and augmentation, we can easily find that the regularization would fail. To see this, we can rewrite the loss function as
it is easy to find that the optimal solution of each would be at infinity. Moreover, even if we add constraints like , the optimal solution of each would all be the eigenvector corresponding to the smallest eigenvalue of thus the model would only learn a single representation from the data. In contrast, the quadratic regularization term encourages the diversity of representation by penalizing the similarity between representations, i.e., . In this situation, we have:
and it is easy to find that the optimal solution of would be finite and each corresponds to different eigenvectors of and is orthogonal to each other, which implies that they would learn totally different representations. As a result, the quadratic regularization term would be more helpful to self-supervised learning with linear representation. In real-world practice, regularization can still work well with the help of non-linearity and normalization techniques, but we also provide an empirical observation that applying quadratic regularization would be helpful to improve the performance compared with using standard weight decay. We would also like to point out that in the linear regime, the choice of would only affect the norm of and makes no difference to the direction of and the quality of representation, thus we do not specify this value in our analysis. Similar regularization techniques are also used in Liu et al. (2021) for theoretical analysis in the linear representation setting.
We verify this conjecture in neural networks, and the results are provided in Table 6. The first column corresponds to training with standard weight decay, and in the setting of the second column, we do not apply weight decay for each weight matrix of fully connected layers in the encoder, and add an additional regularization term on the loss function instead. Note that the quadratic regularization does not apply to convolution layers and bias terms, thus we keep the weight decay on these parameters. We search the regularization parameter in each settings from and find that yields the best performance in each settings and datasets.
It is observed that quadratic regularization slightly improves the performance of contrastive learning, which is consistent with our intuition.
Testing Accuracy | weight decay=0.0001 | quadratic regularization=0.0001 |
---|---|---|
CIFAR10 | ||
STL10 |
A.3 Background on distance between subspaces
In this section, we will provide some basic properties of sine distance between subspaces. Recall the definition:
(30) |
where are two orthogonal matrices. Similarly, we can also define:
We first give two equivalent definitions of this distance:
Proposition A.1
Proof Write . We have
then by definition of sine distance, we can obtain the desired equation.
Proposition A.2
Proof Expand the right hand and use Proposition A.1 we have:
With Propositions A.1 and A.2, it is easy to verify its properties to be a distance function. Obviously, we have and by definition. Moreover, we have the following results:
Lemma A.3 (Lemma 1 in Cai and Zhang (2018))
For any ,
(31) |
and
(32) |
Proposition A.4 (Identity of indiscernibles)
Proof It is a straightforward corollary by definition:
Proposition A.5 (Triangular inequality)
Proof By the triangular inequality for Frobenius norm we have:
then apply Proposition A.2 to replace the Frobenius norm with sine distance we can finish the proof.
B Omitted proofs for Section 3
B.1 Proofs for Section 3.1 and Section 3.2
In this section, we will provide the proof of Proposition 3.1 and Corollary 3.2, the restatement of them and the detailed proof can be found in Proposition B.1 and Corollary B.2.
Proposition B.1 (Restatement of Proposition 3.1)
For two fixed augmentation functions , denote the augmented data matrices as and , when the augmented pairs are generated as in Definition 2.1, the optimal solution of contrastive learning problem (9) is given by:
where is a positive constant, is the -th largest eigenvalue of the following matrix:
(33) |
is the corresponding eigenvector and can be any orthonormal matrix.
Proof [Proof of Proposition 3.1] When augmented pairs generation in Definition 2.1 is applied, the contrastive loss can be written as:
Note that the last term only depends on , and the first term implies that when is the optimal solution, is the best rank- approximation of , where . Applying Lemma E.4 to the first term, we can conclude that satisfies the desired conditions.
Corollary B.2 (Restatement of Corollary 3.2)
Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then in expectation over the data augmentation, the optimal solution of contrastive learning problem (9) is given by:
where is a positive constant, is the -th largest eigenvalue of the following matrix:
(34) |
is the corresponding eigenvector and can be any orthonormal matrix.
Proof [Proof of Corollary 3.2] Following the proof of Proposition 3.1, now we only need to compute the expectation over the augmentation distribution defined in Definition 2.2:
(35) |
Note that by the definition of random masking augmentation, we have , which implies . On the other hand, and have no common nonzero entries, hence the matrix only consists of off-diagonal entries and each of the off-diagonal entry denoted as appears if and only if . Moreover, if it appears, we must have equals to the -th element of . With this result, we can then compute the expectation in Equation (35):
By a similar argument as in the proof of Proposition 3.1, we can conclude that satisfies the desired conditions.
Remark B.3
Note that the two views generated by random masking augmentation have disjoint non-zero dimensions, hence contrasting such positive pairs yields a correlation between different dimensions only. That is why the first term in equation (34) appears to be where the diagonal entries are eliminated.
B.2 Proofs for Section 3.3
In this section, we will prove Lemma 3.7, Theorems 3.9 and 3.10 in Section 3.3. The restatement and proof of them can be found in Lemma B.6, Theorem B.7, and Theorem B.9.
Before starting the proof, we give two technical lemmas to help the proof.
Lemma B.4 (Uniform distribution on the unit sphere (Marsaglia, 1972))
If i.i.d. , then is uniformly distributed on the unit sphere .
Lemma B.5
If i.i.d. , then:
Proof Denote , then we have:
Note that the moment-generating function of chi-square distribution with degrees of freedom is:
Then combine this fact with Equation (B.2) we have:
which implies:
In particular, take yields:
as desired.
Lemma B.6 (Restatement of Lemma 3.7)
(36) |
Proof [Proof of Lemma 3.7] Denote the columns of as , we have:
By Lemma B.4 we can transform this expectation on the uniform sphere distribution into normalized multivariate Gaussian variables:
(37) |
where are i.i.d. standard normal random variables. Apply Chebyshev’s inequality we know that:
In particular, take we have:
Then take it back into Equation (37) and apply Lemma B.5 we obtain:
as desired.
Now we start proving our main results. Note that is the top- left eigenspace of the observed covariance matrix and is that of the core feature covariance matrix, and by Assumption 3.5 the observed covariance matrix is dominated by the covariance of random noise. The Davis-Kahan theorem provides a technique to estimate the eigenspace distance via estimating the difference between target matrices. We will adopt this technique to prove the lower bound of the feature recovery ability of autoencoders in Theorem 3.9.
Theorem B.7 (Restatement of Theorem 3.9)
Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and , let be the learned representation of autoencoder with singular value decomposition (as in Eq.(7)). If we further assume are different from each other and for some universal constant . Then there exist two universal constants , such that when , we have
(38) |
Proof [Proof of Theorem 3.9] Denote to be the target matrix, to be the samples generated from model 5 and let to be the corresponding matrices. In addition, we write the column mean matrix of a matrix to be , that is, each column of is the column mean of . We denote the sum of variance as . As shown in Equation (7), autoencoders find the top- eigenspace of the following matrix:
The rest of the proof is divided into three steps for the sake of presentation.
Step 1. Bound the difference between and
In this step, we aim to show that the data recovery of autoencoders is dominated by the random noise term. Note that , we just need to bound the norm of the following matrix:
(39) |
and we will deal with these four terms separately.
- 1.
-
2.
For the second term, since and are independent, we must have , so apply Lemma E.2 twice we have:
(42) -
3.
For the third term, apply Lemma E.3 again yields:
(43) -
4.
For the last term, note that each column of and are the same, so we can rewrite it as:
where and . Since and are independent zero mean sub-Gaussian random variables and , we can conclude that:
(44)
To sum up, combine equations (41)(42)(43)(LABEL:PCA_step1_term4) together we obtain the upper bound for the 2 norm expectation of matrix :
(45) |
Step 2. Bound the sine distance between eigenspaces
As we have shown in step 1, the target matrix of autoencoders is close to the covariance matrix of random noise, that is, . Note that is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis . Denote to be the top- eigenspace of and to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:
Step 3. Obtain the final result by triangular inequality
By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:
Then apply the triangular inequality of sine distance (Proposition A.5) we can obtain the lower bound of autoencoders.
(46) | ||||
By Assumption 3.5, it implies that when n and d are sufficiently large and is sufficiently small (smaller than a given constant ), there exists a universal constant such that:
To start the proof, we introduce a technical lemma first.
Lemma B.8 (Lemma 4 in Zhang et al. (2018))
If is any square matrix and is the matrix with diagonal entries set to 0 , then
Here, factor ” 2 ” in the statement above cannot be improved.
Theorem B.9 (Restatement of Theorem 3.10)
Proof [Proof of Theorem 3.10] The proof strategy is quite similar to that of Theorem 3.9 and we follow the notation defined in the first paragraph of that proof. As we have shown in Corollary 3.2, under our linear representation setting, the contrastive learning algorithm finds the top- eigenspace of the following matrix:
To prove the theorem, first we need to bound the difference between and . We aim to show that the contrastive learning algorithm is dominated by the core feature term. Note that , we just need to bound the norm of the following matrix:
(48) | ||||
and we will also deal with these five terms separately.
- 1.
-
2.
For the second term, apply equation (42) yields:
(52) -
3.
For the third term, apply equation (43) yields:
(53) -
4.
For the fourth term, apply equation (LABEL:PCA_step1_term4) yields:
(54) - 5.
To sum up, combine equations (50)(52)(53)(54)(55) together we obtain the upper bound for the 2 norm expectation of matrix :
(56) |
With the upper bound for , simply apply Lemma E.1 we can obtain the desired bound for sine distance:
(57) | ||||
Moreover, there exists an orthogonal matrix depending on such that:
which finishes the proof.
B.3 Proofs for Section 3.4
In this section, we will provide the proof of Theorems 3.13 and 3.14 with both regression and classification settings. The corresponding statement and proof can be found in Theorems B.10 and B.11.
For notation simplicity, define the prediction risk of predictor for classification and regression tasks as and , respectively. Define . We write for with a slight abuse of notation. For two matrices and of the same order, we define when is positive semi-definite.
Theorem B.10 (Restatement of Theorem 3.13)
Suppose the conditions in Theorem 3.10 hold. Then, for the classification task, we have
and for regression tasks,
Theorem B.11 (Restatement of Theorem 3.14)
Suppose the conditions in Theorem 3.9 hold. Assume holds for some constant . Additionally assume that is sufficiently small and . Then, For the regression task,
and for classification task, if is differentiable at and , then
where and are constants independent of and .
The proofs of Theorem B.10 and B.11 relies on Lemma B.15, B.16, B.17, B.18 and B.20 which are proved later in this section.
Proof [Proof of Theorem B.10: Classification Task Part] Lemma B.17 gives for any ,
(58) | |||
(59) |
Substituting combined with Assumption 3.5 and concludes the proof.
Proof [Proof of Theorem B.10: Regression Part] Note that under Assumption 3.5 and , . Lemma B.20 gives for any ,
(60) |
Theorem 3.10 with substitution gives the desired result.
Proof [Proof of Theorem B.11: Classification Part] Lemma B.16 gives that for , we can take and sufficiently small so that holds. By Lemma B.18,
(61) |
where the last inequality follows since . If we further take , the right hand becomes a positive constant.
This concludes the proof.
Proof [Proof of Theorem B.11: Regression Part] From proposition B.19, we have
Thus from Lemma B.15,
(62) |
Using Lemma B.16 and by the same argument in the proof of Theorem 3.14: Classification Part, we conclude the proof.
Lemma B.12
For any ,
Proof Since for symmetric positive semi-definite matrices and ,
where we used Weyl’s inequality in the second inequality.
Lemma B.13
For any ,
Proof Since ,
where we used Weyl’s inequality and .
Lemma B.14
For any ,
For the term ,
From Lemma A.3, . Finally from these results and ,
Since LHS does not depend on the orthogonal transformation where , we obtain
Combined again with Lemma A.3, we obtain the desired result.
Lemma B.15
For any ,
Proof Observe
Since , it follows that . Thus
where we used and from Proposition A.1. Combined with Lemma B.13, we obtain
Lemma B.16
Suppose the conditions in Theorem 3.9 hold. Fix . There exists a constant such that if , then,
where is a universal constant.
Proof By Cauchy-Schwartz inequality,
From Theorem 3.9, there exists a constant such that we have
Therefore combined with a trivial bound ,
where we used since .
Thus we can take .
This concludes the proof.
Lemma B.17
For any ,
Proof Recall that we are considering the class of linear classifiers , where . For notational simplicity, write and .
Since and is monotone increasing, the false positive probability becomes
Write and . From assumption, jointly follows a normal distribution with mean . Write , , where . Let . By a formula for conditional normal distribution, we have . This gives
where is cumulative distribution function of and . We define as . When , is called the logistic-normal integral, whose analytical form is not known (Pirjol, 2013). Since a random variable is symmetric about mean and ,
Hence
Note that the true negative probability is exactly the same as the false positive probability under our settings:
Therefore
Let
From Cauchy-Schwartz inequality,
Define and . Then, since on the event where , is monotone increasing and is non-negative, we have
This yields
Note that for any ,
where is a density function of standard normal distribution. Observe
where we used . Since for , and , we obtain
When , since is increasing in ,
(63) |
From Lemma B.12 and B.13, we have
(64) |
Then, Equation (63) becomes
where the last inequality follows from Lemma B.14.
On the event where ,
In summary, on the event where ,
On the other hand, on the event where , we have a trivial inequality . This gives
where the last inequality follows from Markov’s inequality.
Lemma B.18
Suppose satisfies . Then,
Proof We firstly bound the term . From Lemma B.15,
(65) |
From assumption, RHS of Equation (65) is non-negative. Then using the inequality for ,
From Equation (64) and Equation (65),
(66) |
From the proof of Lemma B.17,
Note that for any , . Since we assume RHS of Equation (65) is positive, . Thus on the event where , . Observe
where in the last equality we transformed . Since is differentiable at and ,
Thus there exists a constant only depending on such that for all since . This gives
The last inequality follows since by assumption. It is noted that from Equation (64). Therefore with Equation (66),
Proposition B.19
For any ,
Proof [Proof of Proposition B.19] Generate random variables following the model (19). We calculate the prediction risk of as:
where and . From this, we obtain
Lemma B.20
For any ,
Proof [Proof of Lemma B.20] From proposition B.19, we have
Note that for any orthogonal matrix . Take such that without loss of generality, since we can always take a sequence such that from Lemma A.3.
Lemma B.14 gives
On the event where ,
On the event where , we utilize the trivial upper bound
Combining these results, we have
where the last inequality follows by Markov’s inequality.
C Discussion about Autoencoders and random masking augmentation
In the following, we show that our results do not change if we applied the same augmentation (2.2) for autoencoders. As discussed in Section 3.1, we can ignore the bias term in autoencoders for simplicity, which only serves as centralization of the data matrix. In that case, we applied random augmentation and to the original data , and the optimization problem can be formulated as follows:
(67) |
Then, similar to Theorem 3.1 for contrastive learning, we can also obtain an explicit solution for this optimization problem.
Theorem C.1
The optimal solution of autoencoders with random masking augmentation (67) is given by:
where is a positive constant, is the -th largest eigenvalue of the following matrix:
(68) |
is the corresponding eigenvector and can be any orthonormal matrix.
Proof We first derive the equivalent form for this objective function:
(69) | ||||
where . Note that by Definition 2.2 we have and follows the Bernoulli distribution, so we have:
(70) |
Again, by Theorem 2.4.8 in Golub and Loan (1996), the optimal solution of Eq.(67) is given by the eigenvalue decomposition of , up to an orthogonal transformation, which finishes the proof.
With Theorem C.1 established, we can now derive the space distance for autoencoders with random masking augmentation.
Theorem C.2
Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and , let be the learned representation of augmented autoencoder with singular value decomposition (i.e., the optimal solution of optimization problem 67). If we further assume are different from each other and for some universal constant . Then there exist two universal constants , such that when , we have
(71) |
Proof Step1, similar to the proof of Theorem B.7, we first bound the difference between and . Note that:
(72) |
Since is a diagonal matrix, then by Lemma B.8 we have:
(73) |
Now, directly apply equation (41)(42)(43) we can obtain that:
(74) |
Step 2, bound the distance between eigenspaces. As we have shown in step 1, the target matrix of the autoencoder is close to the covariance matrix of random noise, i.e., . Note that is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis . Denote to be the top- eigenspace of and to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:
Step 3, obtain the final result by triangular inequality. By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:
(75) | ||||
Then apply the triangular inequality of distance (Proposition A.5) we can obtain the lower bound of the autoencoder.
By Assumption 3.5, it implies that when n and d are sufficiently large and is sufficiently small (smaller than a given constant ), there exists a universal constant such that:
Compared with Theorem 3.9, we can find that random masking augmentation makes no difference to autoencoders, which justifies the fairness of our comparison between contrastive learning and autoencoders.
However, contrary to the autoencoders with random-masking augmentation, we show that the representations obtained by DAEs behave as the representations obtained by contrastive learning.
Proof [Proof of Remark 3.12] Let be the loss function of DAEs. Then,
(76) |
We minimize the loss over such that . Then, the loss becomes
Thus, the solution to the (expected) loss minimization problem is the top- eigenvectors of , i.e., , where is any orthogonal matrix from . We use the same argument as in the proof of Theorem B.9. First note that
By Lemmas B.8, E.3 and the incoherent condition , we have:
(77) |
For the second term, applying equation (42) yields:
(78) |
For the third term, applying equation (43) yields:
(79) |
Here we provide some experimental results about DAEs on synthetic datasets as analog to Figure 1 and 2, the settings are the same as described in Section 5.1. The results are summarized in Figure 3, as we can observe, the performance of DAEs is comparable with contrastive learning, which aligns with our theoretical results above.




D Omitted proofs for Section 4
D.1 Proofs for Section 4.1
In this section, we will provide the proof of a generalized version of Theorem 4.2 to cover the imbalanced setting, the statement and the detailed proof can be found in Theorem D.2.
In the main body, we assume the unlabeled data and labeled data are both balanced for the sake of clarity and simplicity. Now we allow them to be imbalanced and provide a more general analysis. Suppose we have unlabeled data and labeled data for class , the contrastive learning task can be formulated as:
(80) |
In addition, we write a generalized version of the supervised contrastive loss function to cover the imbalanced cases:
(81) |
where is the weight for supervised loss of class . Again we first provide a theorem to give the optimal solution to the contrastive learning problem.
Theorem D.1
The optimal solution of the supervised contrastive learning problem (80) is given by :
where is a positive constant, is the -th largest eigenvalue of the following matrix:
is the corresponding eigenvector and can be any orthonormal matrix.
Proof Under this setting, combined with the result obtained in Corollary 3.2, the contrastive loss can be rewritten as:
Then we deal with the last term independently, note that:
Thus we have:
Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution must satisfy the desired conditions.
With the optimal solution obtained in Theorem D.1, we can provide a generalized version of Theorem 4.2 to cover the imbalance cases.
Theorem D.2 (Generalized version of Theorem 4.2)
Proof [Proof of Theorem D.2]
For labeled data , we write it to be , where and are two matrices consisting of class mean and random noise. To be more specific, if subject to the -th cluster, then and . Since the data is randomly drawn from each class, follows the multinomial distribution over with probability . Thus follows a subgaussian distribution with covariance matrix .
As shown in Theorem D.1, the optimal solution of contrastive learning is equivalent to PCA of the following matrix:
Again we will deal with these terms separately,
-
1.
For the first term, as we have discussed, can be divided into two matrices and , each of them consisting of sub-gaussian columns. Again we can obtain the result as in (56) (the proof is totally the same):
(82) -
2.
For the second term, notice that:
(83) and that:
(84) Since , we can conclude that:
(85) Moreover, we have
(86) Take equation (85) and (86) back into (LABEL:sup_mat_1) we can conclude:
(87) On the other hand, by equation (85) we know:
(88) Notice that:
(89) Thus take equations (88) and (LABEL:sup_mat_eq_4) back into equation (LABEL:sup_mat_2) we have:
(90) (91)
Then combine equations (82)(87)(90) together, we can obtain the following result:
Since we have assumed that we can find that the top- eigenspace of matrix:
is spanned by , then apply Lemma E.1 again we have:
Now we use this result to derive Theorem 4.2. Since and , approximately we have . Although we can not obtain the closed-form eigenvalue in general, in a special case, where , and , it is easy to find that:
which further implies that:
and we can obtain the result in Theorem 4.2.
D.2 Proofs for Section 4.2
In this section, we will provide the proof of the generalized version of Theorems 4.5 and 4.8 to cover the imbalanced setting, the statement and detailed proof can be found in Theorems D.6 and D.8. With the two generalized theorems proven, Theorems 4.5, 4.6, 4.8, 4.9 holds immediately.
First, we prove a useful lemma to illustrate that the supervised loss function only yields estimation along a 1-dimensional space. Consider a single source task, where the data is generated by the spiked covariance model and the label is generated by
suppose we have collect labeled data from this task, denote the data as and the label , then we have the following result.
Lemma D.3
Under the conditions similar to Theorem 3.10, we can find an event such that and:
(92) |
The proof strategy is to estimate the difference between the two rank-1 matrices via bounding the difference of the corresponding vector component. We first provide a simple lemma to illustrate the technique:
Lemma D.4
Suppose are two vectors, then we have:
Proof Denote , then we have:
Take square root on both sides we can finish the proof.
Now we can prove the Lemma D.3.
Proof [Proof of Lemma D.3] Clearly, we have:
(93) | ||||
thus we can replace the with in equation (92) and conclude the proof. Denote , note that both of and are rank-1 matrices. We first bound the difference between and :
(94) | ||||
We deal with the four terms in (94) separately:
-
1.
For the first term, apply Lemma E.3 we have:
(95) -
2.
For the second term, apply Lemma E.2 twice we have:
(96) -
3.
For the third term and fourth term, from equation (LABEL:PCA_step1_term4) we know:
(97)
Combine these three equations (95)(96)(97) together we have:
(98) |
With equation (98), we can now turn to the difference between and . By Lemma D.4 we know that:
Using Markov’s inequality, we can conclude from (98) that:
Then denote we have:
which finished the proof.
In the main body, we assume the number of labeled data and the ratio of the loss function is both balanced. Now we will provide a more general result to cover the imbalance occasions.
Formally, suppose we have unlabeled data and labeled data for source task , we learn the linear representation via joint optimization:
(99) |
To investigate its feature recovery ability, we first give the following result.
Theorem D.5
For the optimization problem (99), if we apply augmented pairs generation in Definition 2.1 with random masking augmentation 2.2 for unlabeled data, then the optimal solution is given by:
where is a constant, is the -th largest eigenvalue of the following matrix:
is the corresponding eigenvector, can be any orthogonal matrix and is the centering matrix.
Proof Under this setting, combined with the result obtained in Corollary 3.2, the loss function can be rewritten as:
Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution must satisfy the desired conditions.
Then we can give the proofs of Theorem 4.5 and Theorem 4.8 under our generalized setting, one can easily obtain those under balanced setting by simply setting and , which is consistent with Theorem 4.5 and Theorem 4.8 in the main body.
Theorem D.6 (Generalized version of Theorem 4.5)
In the regression setting where , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5) and , if we further assume that and ’s are orthogonal to each other, and let be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as , then we have:
Proof [Proof of Theorem D.6] As shown in Theorem D.5, optimizing loss function (99) is equivalent to find the top- eigenspace of matrix
Again denote and . By equation (56) we know that:
By Theorem D.3 we know that for each task , we can find an event such that :
The target matrix is , and we can obtain the upper bound for the difference between and :
(100) | ||||
We divide the top- eigenspace of into two parts: the top- eigenspace and top- to top- eigenspace . Similarly, we also divide the top- eigenspace of into two parts: and . Then applying Lemma E.1 we can bound the sine distance for each part: on the one hand,
On the other hand,
Note that:
and the sine distance has trivial upper bounds:
Thus we can conclude:
Theorem D.7 (Restatement of Theorem 4.6)
Suppose the conditions in Theorem 4.5 hold. Then,
(101) | |||
(102) |
Theorem D.8 (Generalized version of Theorem 4.8)
Proof [Proof of Theorem D.8] The proof strategy is similar to that of Theorem 4.5, here the difference is that each direction can be accurately estimated by the labeled data and we do not need to separate the eigenspace. Directly applying Lemma E.1 and equation (100) we have:
Now we move to a binary classification setting, where labels are generated by instead of in previous regression setting. We first give the corresponding generalized version of Theorem 4.10 and Theorem 4.11 to cover the general imbalanced settings.
Theorem D.10 (Generalized version of Theorem 4.10)
In the classification setting where , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), follows a Gaussian distribution, and , if we further assume that and ’s are orthogonal to each other, and let be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as , then we have:
Theorem D.11 (Generalized version of Theorem 4.11)
In the classification setting where , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), follows a Gaussian distribution, and , if we further assume that and is full rank, suppose is the optimal solution of optimization problem Equation (99), and denote its singular value decomposition as , then we have:
The only difference between these two settings is the distribution of labels . Thus to prove Theorem D.10 and Theorem D.11, we only need to recover Lemma D.3 in this binary classification setting. Since in the classification setting the labels are discrete and could be harder to analyze, we make the Gaussian assumption on to make problems mathematically tractable in these two Theorems.
Lemma D.12 (Classification version of Lemma D.3)
Proof Again, by (93) we have:
thus we can replace the with in equation (104) and conclude the proof. Denote , note that both of and are rank-1 matrices. We first bound the difference between and :
(105) | ||||
We deal with the four terms in (105) separately:
-
1.
For the first term, note that: and , thus follows a folded Gaussian distribution, which is a reflection of standard Gaussian distribution along the normal plane of , thus
(106) -
2.
For the second term, note that and are independent and almost surely
(107) -
3.
For the third term and fourth terms, we have:
(108)
Combine these three equations (106)(107)(108) together we have:
(109) |
With equation (109), we can now turn to the difference between and . By Lemma D.4 we know that:
Using Markov’s inequality, we can conclude from (109) that:
Then denote we have:
which finished the proof.
With Lemma D.12 established, it is straightforward to obtain the same results as in Theorem 4.5, Theorem 4.6, Theorem 4.8 and Theorem 4.9 for this binary classification setting.
E Useful lemmas
In this section, we list some of the main techniques that have been used in the proof of the main results.
Lemma E.1 (Theorem 2 in Yu et al. (2015))
Let be symmetric, with eigenvalues and respectively. Fix and assume that where and Let , and let and have orthonormal columns satisfying and for Then
Moreover, there exists an orthogonal matrix such that
Lemma E.2 (Lemma 2 in Zhang et al. (2018))
Assume that has independent sub-Gaussian entries, Assume that
Let be a fixed orthogonal matrix. Then
Lemma E.3 (Theorem 6 in Cai et al. (2020))
Suppose is a -by- random matrix with independent mean-zero sub-Gaussian entries. If there exist such that for constant , then
Lemma E.4 (The Eckart-Young-Mirsky Theorem (Eckart and Young, 1936))
Suppose that is the singular value decomposition of . Then the best rank- approximation of the matrix w.r.t the Frobenius norm, , is given by
that is, for any matrix of rank at most k
References
- Agarwal et al. (2020) Anish Agarwal, Devavrat Shah, and Dennis Shen. On principal component regression in a high-dimensional error-in-variables setting. arXiv preprint arXiv:2010.14449, 2020.
- Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
- Bai and Yao (2012) Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 106:167–177, 2012.
- Ballard (1987) Dana H Ballard. Modular learning in neural networks. In AAAI, volume 647, pages 279–284, 1987.
- Barshan et al. (2011) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011.
- Bing et al. (2021) Xin Bing, Florentina Bunea, Seth Strimas-Mackey, and Marten Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond. Journal of Machine Learning Research, 22(177):1–50, 2021.
- Bourlard and Kamp (1988) Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
- Cai and Zhang (2018) T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89, 2018.
- Cai et al. (2020) T Tony Cai, Rungang Han, and Anru R Zhang. On the non-asymptotic concentration of heteroskedastic wishart-type matrix. arXiv preprint arXiv:2008.12434, 2020.
- Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
- Ćevid et al. (2020) Domagoj Ćevid, Peter Bühlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21:232, 2020.
- Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
- Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020b.
- Chen et al. (2020c) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
- Coates et al. (2011) Adam Coates, A. Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
- Deshpande and Montanari (2014) Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In 2014 IEEE International Symposium on Information Theory, pages 2197–2201. IEEE, 2014.
- Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27:766–774, 2014.
- Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
- Eckart and Young (1936) Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936.
- Fan et al. (2019) Jianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. arXiv preprint arXiv:1904.05526, 2019.
- Feizi et al. (2020) Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans in the lqg setting: Formulation, generalization and stability. IEEE Journal on Selected Areas in Information Theory, 1(1):304–311, 2020.
- Fu et al. (2022) Daniel Y Fu, Mayee F Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. In Computer Sciences & Mathematics Forum, volume 3, page 4. MDPI, 2022.
- Garg and Liang (2020) Siddhant Garg and Yingyu Liang. Functional regularization for representation learning: A unified theoretical perspective. Advances in Neural Information Processing Systems, 33:17187–17199, 2020.
- Golub and Loan (1996) Gene H. Golub and Charles Van Loan. Matrix computations (3rd ed.). 1996.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Graf et al. (2021) Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
- Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
- Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
- Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
- Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
- HaoChen et al. (2021) Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. arXiv preprint arXiv:2106.04156, 2021.
- He et al. (2016) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
- He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
- He et al. (2018) Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1945–1954, 2018.
- Hyvärinen et al. (2009) Aapo Hyvärinen, Jarmo Hurri, Patrik O Hoyer, Aapo Hyvärinen, Jarmo Hurri, and Patrik O Hoyer. Independent component analysis. Springer, 2009.
- Islam et al. (2021) Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Richard J. Radke, and Rogério Schmidt Feris. A broad study on the transferability of visual representations with contrastive learning. ArXiv, abs/2103.13517, 2021.
- Jaiswal et al. (2021) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2021.
- Jing et al. (2021) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
- Johnstone (2001) Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001.
- Jolliffe (1982) Ian T Jolliffe. A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(3):300–303, 1982.
- Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
- Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
- Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
- Li et al. (2021) Yazhe Li, Roman Pogodin, Danica J Sutherland, and Arthur Gretton. Self-supervised learning with kernel dependence maximization. Advances in Neural Information Processing Systems, 34, 2021.
- Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2021.
- Marsaglia (1972) George Marsaglia. Choosing a point from the surface of a sphere. Annals of Mathematical Statistics, 43:645–646, 1972.
- Misra and Maaten (2020) Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Pirjol (2013) Dan Pirjol. The logistic-normal integral and its generalizations. Journal of Computational and Applied Mathematics, 237(1):460–469, 2013.
- Plaut (2018) Elad Plaut. From principal subspaces to principal components with linear autoencoders. arXiv preprint arXiv:1804.10253, 2018.
- Refinetti and Goldt (2022) Maria Refinetti and Sebastian Goldt. The dynamics of representation learning in shallow, non-linear autoencoders. arXiv preprint arXiv:2201.02115, 2022.
- Saunshi et al. (2022) Nikunj Saunshi, Jordan Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, and Akshay Krishnamurthy. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pages 19250–19286. PMLR, 2022.
- Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
- Sohn (2016) Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.
- Song et al. (2007a) Le Song, Arthur Gretton, Karsten Borgwardt, and Alex Smola. Colored maximum variance unfolding. Advances in Neural Information Processing Systems, 20:1385–1392, 2007a.
- Song et al. (2007b) Le Song, Alex Smola, Arthur Gretton, and Karsten M Borgwardt. A dependence maximization view of clustering. In Proceedings of the 24th international conference on Machine learning, pages 815–822, 2007b.
- Song et al. (2007c) Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pages 823–830, 2007c.
- Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
- Tian (2022) Yuandong Tian. Understanding deep contrastive learning via coordinate-wise optimization. In Advances in Neural Information Processing Systems, 2022.
- Tian et al. (2021) Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810, 2021.
- Tosh et al. (2021) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
- Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
- Tsai et al. (2020) Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
- Wang and Isola (2020) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
- Wang et al. (2021) Xiang Wang, Xinlei Chen, Simon S Du, and Yuandong Tian. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2021.
- Wen and Li (2021) Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. arXiv preprint arXiv:2105.15134, 2021.
- Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
- Yao et al. (2015) Jianfeng Yao, Shurong Zheng, and ZD Bai. Sample covariance matrices and high-dimensional data analysis. Cambridge University Press Cambridge, 2015.
- Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6210–6219, 2019.
- Yu et al. (2015) Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
- Zhang et al. (2018) Anru R Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic pca: Algorithm, optimality, and applications. arXiv preprint arXiv:1810.08316, 2018.
- Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.