\AtAppendix

The Power of Contrast for Feature Learning:
A Theoretical Analysis

\nameWenlong Ji \email[email protected]
\addrDepartment of Statistics
Stanford University
Stanford, CA 94305, USA \AND\nameZhun Deng \email[email protected]
\addrDepartment of Computer Science
Columbia University
New York, NY 10027, USA \AND\nameRyumei Nakada \email[email protected]
\addrDepartment of Statistics
Rutgers University
Piscataway, NJ 08854, USA \AND\nameJames Zou \email[email protected]
\addrDepartment of Biomedical Data Science
Stanford University
Stanford, CA 94305, USA \AND\nameLinjun Zhang \email[email protected]
\addrDepartment of Statistics
Rutgers University
Piscataway, NJ 08854, USA

Abstract

Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of the superiority of contrastive learning is still limited. In this paper, under linear representation settings, (i) we provably show that contrastive learning outperforms the standard autoencoders and generative adversarial networks, two classical generative unsupervised learning methods, for both feature recovery and in-domain downstream tasks; (ii) we also illustrate the impact of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.

Keywords: Self-Supervised Learning, Contrastive Learning, Principal Component Analysis, Spiked Covariance Model, Supervised Contrastive Learning

1 Introduction

Deep supervised learning has achieved great success in various applications, including computer vision (Krizhevsky et al., 2012), natural language processing (Vaswani et al., 2017), and scientific computing (Han et al., 2018). However, its dependence on manually assigned labels, which is usually difficult and costly, has motivated research into alternative approaches to exploit unlabeled data. Self-supervised learning is a promising approach that leverages the unlabeled data itself as supervision and learns representations that are beneficial to potential in-domain downstream tasks.

At a high level, there are two common approaches for feature extraction in self-supervised learning: generative and contrastive (Liu et al., 2021; Jaiswal et al., 2021). Both approaches aim to learn latent representations of the original data, while the difference is that the generative approach focused on minimizing the reconstruction error from latent representations, and the contrastive approach targets to decrease the similarity between the representations of contrastive pairs constructed by data augmentation. Recent works have shown the benefits of contrastive learning in practice (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b, c). However, these works did not explain the popularity of contrastive learning — what is the advantage of contrastive learning and where does it come from?

Additionally, recent works aim to further improve contrastive learning by introducing label information. Specifically, Khosla et al. (2020) proposed the supervised contrastive learning, where the contrasting procedures are performed across different classes rather than different instances. With the help of label information, their proposed method outperforms self-supervised contrastive learning and classical cross-entropy-based supervised learning. However, despite this improvement in in-domain downstream tasks, Islam et al. (2021) found that such improvement in transfer learning is limited and even negative for such supervised contrastive learning. This phenomenon motivates us to rethink the impact of labeled data in the contrastive learning framework.

In this paper, we first establish a theoretical framework to study contrastive learning under the linear representation setting. Under this framework, we provide a theoretical analysis of the feature learning performance of the contrastive learning on the spiked covariance model (Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018) and theoretically justify why contrastive learning outperforms standard autoencoders and generative adversarial networks (GANs) (Goodfellow et al., 2014) —contrastive learning is able to remove more noise by constructing contrastive samples via augmentations. Moreover, we investigate the impact of label information in the contrastive learning framework and provide a theoretical justification of why labeled data help to gain accuracy in in-domain regression and classification while can hurt multi-task transfer learning.

1.1 Related Works

The idea of contrastive learning was firstly proposed in Hadsell et al. (2006) as an effective method to perform dimensional reduction. Following this line of research, Dosovitskiy et al. (2014) proposed to perform instance discrimination by creating surrogate classes for each instance and Wu et al. (2018) further proposed to preserve a memory bank as a dictionary of negative samples. Other extensions based on this memory bank approach include He et al. (2020); Misra and Maaten (2020); Tian et al. (2020); Chen et al. (2020c). Rather than keeping a costly memory bank, another line of work exploits the benefit of mini-batch training where different samples are treated as negative to each other (Ye et al., 2019; Chen et al., 2020a). Moreover, Khosla et al. (2020) explores the supervised version of contrastive learning where pairs are generated based on label information.

Despite its success in practice, the theoretical understanding of contrastive learning is still limited. Previous works provide provable guarantees for contrastive learning under conditional independence assumption (or its variants) (Arora et al., 2019; Lee et al., 2021; Tosh et al., 2021; Tsai et al., 2020). Specifically, they assume the two contrastive views are independent conditioned on the label and show that contrastive learning can provably learn representations beneficial for in-domain downstream tasks. In addition to this line of research, there exist several alternative perspectives for studying the theoretical properties of contrastive learning. To name a few, Wang and Isola (2020); Graf et al. (2021) explored the representation geometry, HaoChen et al. (2021) analyzed the augmentation graph, Tian (2022) proposed a two-player game theory framework, Zimmermann et al. (2021) demonstrated the connection between contrastive learning and nonlinear Independent Component Analysis (Hyvärinen et al., 2009), Saunshi et al. (2022) showed that the importance of inductive bias in contrastive learning, and Jing et al. (2021) investigated the dimensional collapse phenomenon. Furthermore, Tian et al. (2021); Wang et al. (2021) have also explored the ability of self-supervised learning to learn features even without contrastive pairs, specifically in the context of linear representation settings.

More relevant to this paper, Wen and Li (2021) considered representation learning under the sparse coding model and studied the optimization properties in shallow ReLU neural networks. However, the assumptions that features are extremely sparse and signals follow Gaussian distribution seem strong for real data. Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They derived sample complexity bounds in a PAC-learning style for various settings. Specifically, the authors assume that there is a ground-truth representation such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features.

While the aforementioned previous works aim to demonstrate that contrastive learning is capable of learning meaningful representations, it was left untouched why contrastive learning outperforms other representation learning methods. We also shed light on the impact of labeled data in a contrastive learning framework, which is underexplored in prior works. A detailed comparison with existing literature is deferred to Appendix A.1.

1.2 Outline

This paper is organized as follows. Section 2 provides the setup for the data-generating process and the loss function. In Section 3, we review the connection between PCA and autoencoders/GANs. We also establish a theoretical framework to study contrastive learning in the linear representation setting. Under this framework, we evaluate the feature recovery performance and in-domain downstream task performance of contrastive learning and autoencoders. In Section 4, we analyze the supervised contrastive learning. In Section 5, we verify our theoretical results given in Sections 3 and 4. Finally, we summarize our analysis and provide future directions in Section 6.

1.3 Notations

In this paper, we use $O,\Omega,\Theta$ to hide universal constants and we write $a_{k}\lesssim b_{k}$ for two sequences of positive numbers $\{a_{k}\}$ and $\{b_{k}\}$ if and only if there exists a universal constant $C>0$ such that $a_{k}<Cb_{k}$ for any $k$ . We write $a_{k}\asymp b_{k}$ when $a_{k}\lesssim b_{k}$ and $a_{k}\gtrsim b_{k}$ holds simultaneously. We use $\|\cdot\|,\|\cdot\|_{2},\|\cdot\|_{F}$ to represent the $\ell_{2}$ norm of vectors, the spectral norm of matrices, and Frobenius norm of matrices respectively. Let $\mathbb{O}_{d,r}$ be a set of $d\times r$ orthogonal matrices. Namely, $\mathbb{O}_{d,r}\triangleq\{U\in\mathbb{R}^{d\times r}:U^{\top}U=I_{r}\}$ . We write $n\gg d$ when there exists a sufficiently small constant $c$ depending on the constant and independent of $n$ , $d$ and $r$ such that $d/n<c$ holds. $d\gg r$ is defined similarly. We use $|A|$ to denote the cardinality of a set $A$ . For any $n\in\mathbb{N}^{+}$ , let $[n]=\{1,2,\cdots,n\}$ . We use $\|\sin\Theta(U_{1},U_{2})\|_{F}$ to refer to the sine distance between two orthogonal matrices $U_{1},U_{2}\in\mathbb{O}_{d,r}$ , which is defined by: $\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}$ , where $U_{1\perp}\in\mathbb{O}_{d-r,r}$ is any orthogonal complement of $U_{1}$ . More properties of sine distance can be found in Section A.3. We use $\{e_{i}\}_{i=1}^{d}$ to denote the canonical basis in $d$ -dimensional Euclidean space $\mathbb{R}^{d}$ , that is, $e_{i}$ is the vector whose $i$ -th coordinate is $1$ and all the other coordinates are $0$ . Let $\mathbb{I}\{A\}$ be an indicator function that takes $1$ when $A$ is true, otherwise takes $0$ . We write $a\vee b$ and $a\wedge b$ to denote $\max(a,b)$ and $\min(a,b)$ , respectively.

2 Setup

Here we introduce loss and data-generative models that will be used for the theoretical analysis later.

2.1 Linear Representation Settings for Contrastive Learning

Given an input $x\in\mathbb{R}^{d}$ , contrastive learning aims to learn a low-dimensional representation $h=f(x;\theta)\in\mathbb{R}^{r}$ by contrasting different samples, that is, maximizing the agreement between positive pairs, and minimizing the agreement between negative pairs. Suppose we have $n$ data points $X=[x_{1},x_{2},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ from the population distribution $\mathcal{D}$ . The contrastive learning task can be formulated to the following optimization problem:

\min_{\theta}\mathcal{L}(\theta)=\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell(x_{i},\mathcal{B}_{i}^{Pos},\mathcal{B}_{i}^{Neg};f(\cdot;\theta))+\lambda R(\theta),

(1)

where $\ell(\cdot)$ is a contrastive loss and $\lambda R(\theta)$ is a regularization term; $\mathcal{B}_{i}^{Pos},\mathcal{B}_{i}^{Neg}$ are the sets of positive samples and negative samples corresponding to $x_{i}$ , the details of which are described below.

Linear Representation and Regularization Term

We consider the linear representation function $f(x;W)=Wx$ , where the parameter $\theta$ is a matrix $W\in\mathbb{R}^{r\times d}$ . This linear representation setting has been widely adopted in other theory papers to understand self-supervised contrastive learning (Jing et al., 2021; Wang et al., 2021; Tian et al., 2021) and shed light upon other complex machine learning phenomena such as in Tripuraneni et al. (2021). Moreover, since regularization techniques have been widely adopted in contrastive learning practice (Chen et al., 2020a; He et al., 2020; Grill et al., 2020), we further consider penalizing the representation by a regularization term $R(W)=\|WW^{\top}\|_{F}^{2}/2$ to encourage the orthogonality of $W$ and therefore promote the diversity of $w_{i}$ to learn different representations. The reason we use such quadratic regularization instead of a standard $\ell_{2}$ regularization is to encourage a diverse representation in the linear representation setting by penalizing on the similarity $\langle w_{i},w_{j}\rangle^{2}$ , we defer a formal discussion and numerical experiments about this regularization in the Appendix A.2.

Linear Contrastive Loss

The contrastive loss is set to be the average similarity (measured by the inner product) between positive pairs minus that between negative pairs:

\ell(x,\mathcal{B}_{x}^{Pos},\mathcal{B}_{x}^{Neg},f(\cdot;\theta))=-\sum_{x^{Pos}\in\mathcal{B}_{x}^{Pos}}\frac{\langle f(x,\theta),f(x^{Pos},\theta)\rangle}{|\mathcal{B}_{x}^{Pos}|}+\sum_{x^{Neg}\in\mathcal{B}_{x}^{Neg}}\frac{\langle f(x,\theta),f(x^{Neg},\theta)\rangle}{|\mathcal{B}_{x}^{Neg}|},

(2)

where $\mathcal{B}_{x}^{Pos},\mathcal{B}_{x}^{Neg}$ are sets of positive samples and negative samples corresponding to $x$ . This loss function has been commonly used in contrastive learning (Hadsell et al., 2006) and metric learning (Schroff et al., 2015; He et al., 2018). In Khosla et al. (2020), the authors show that the inner-product based linear loss (2) is an approximation of the NT-Xent contrastive loss when one positive and one negative are used, which has been highlighted in recent contrastive learning practice (Sohn, 2016; Wu et al., 2018; Oord et al., 2018; Chen et al., 2020a). In Li et al. (2021), the authors proposed the SSL-HSIC contrastive loss, which can be reduced to this linear loss when the kernel $k(\cdot,\cdot)$ is chosen to be a simple inner product. Following Li et al. (2021), we provide the results in Table 1, which shows that linear contrastive loss can also work well with some additional training techniques.

Testing Accuracy	InfoNCE	Linear contrastive loss
CIFAR10	$65.11\pm 0.51$	$\mathbf{66.07\pm 0.46}$
STL10	$\mathbf{71.02\pm 0.47}$	$70.30\pm 0.31$

Table 1: InfoNCE loss v.s. Linear contrastive loss. We train a ResNet-18 encoder on CIFAR-10 and STL-10 datasets with different contrastive loss functions. To train the linear contrastive loss, we follow the HSIC regularization techniques used in Li et al. (2021), which helps linear contrastive loss yield comparable performance to standard InfoNCE. We repeat each experiment for ten runs and report the mean and standard deviation of accuracy. Detailed experimental settings can be found in Section 5.2.

2.2 Generation of Positive and Negative Pairs

There are two common approaches to generating positive and negative pairs, depending on whether or not label information is available. When the label information is not available, the typical strategy is to generate different views of the original data via augmentation (Hadsell et al., 2006; Chen et al., 2020a). Two views of the same data point serve as the positive pair for each other, while those of different data serve as negative pairs.

Definition 2.1 (Augmented Pairs Generation in the Self-supervised Setting)

Given two augmentation functions $g_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ and $n$ training samples $\mathcal{B}=\{x_{i}\}_{i\in[n]}$ , the augmented views are given by: $\{(g_{1}(x_{i})$ , $g_{2}(x_{i}))\}_{i\in[n]}$ . Then for each view $g_{v}(x_{i})$ , $v=1,2$ , the corresponding positive samples and negative samples are defined by: $\mathcal{B}_{i,v}^{Pos}=\{g_{s}(x_{i}):s\in[2]\setminus\{v\}\}$ and $\mathcal{B}_{i,v}^{Neg}=\{g_{s}(x_{j}):s\in[2],j\in[n]\setminus\{i\}\}$ .

The loss function of the self-supervised contrastive learning problem can then be written as:

\mathcal{L}_{\text{SelfCon}}(W)\!=-\!\frac{1}{2n}\sum_{i=1}^{n}\sum_{v=1}^{2}\biggl{[}\langle Wg_{v}(x_{i}),Wg_{[2]\setminus\{v\}}(x_{i})\rangle\!-\!\sum_{j\neq i}\sum_{s=1}^{2}\frac{\langle Wg_{v}(x_{i}),Wg_{s}(x_{j})\rangle}{2n-2}\biggr{]}\!+\!\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}.

(3)

In particular, we adopt the following augmentation in our analysis.

Definition 2.2 (Random Masking Augmentation)

The two views of the original data are generated by randomly dividing its dimensions into two sets, that is, $g_{1}(x_{i})=Ax_{i},\text{~{}and~{}}g_{2}(x_{i})=(I-A)x_{i}$ , where $A=\operatorname{diag}(a_{1},\cdots,a_{d})\in\mathbb{R}^{d\times d}$ is the diagonal masking matrix with $\{a_{i}\}_{i=1}^{d}$ being $i.i.d.$ random variables sampled from a Bernoulli distribution with mean $1/2$ .

Remark 2.3

In this paper, we focus on random masking augmentation, which has also been used in other works on the theoretical understanding of contrastive learning, eg. Wen and Li (2021). However, our primary interest lies in comparing the performance of contrastive learning with autoencoders and analyzing the impact of labeled data, while their work focuses on understanding the training process of neural networks in contrastive learning. Random masking augmentation is an analog of the random cropping augmentation used in practice. As shown in Chen et al. (2020a), cropping augmentation achieves overwhelming performance on linear evaluation (ImageNet top-1 accuracy) compared with other augmentation methods, please see Figure 5 in Chen et al. (2020a) for details.

When the label information is available, Khosla et al. (2020) proposed the following approach to generate positive and negative pairs.

Definition 2.4 (Pairs Generation in the Supervised Setting)

In a $K$ -class classification problem, given $n_{k}$ samples for each class $k\in[K]$ : $\{x_{i}^{k}:i\in[n_{k}]\}_{k=1}^{K}$ and let $n=\sum_{k=1}^{K}n_{k}$ , the corresponding positive samples and negative samples for $x_{i}^{k}$ are defined by $\mathcal{B}_{i,k}^{Pos}=\{x_{j}^{k}:j\in[n_{k}]\setminus i\}$ and $\mathcal{B}_{i,k}^{Neg}=\{x_{j}^{s}:s\in[K]\setminus k,j\in[n_{s}]\}$ . That is, the positive samples are the remaining ones in the same class with $x^{k}_{i}$ and the negative samples are the samples from different classes.

Correspondingly, the loss function of the supervised contrastive learning problem can be written as:

\mathcal{L}_{\text{SupCon}}(W)=-\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\biggl{[}\sum_{j\neq i}\frac{\langle Wx_{i}^{k},Wx_{j}^{k}\rangle}{n-1}-\sum_{j=1}^{n}\sum_{s\neq k}\frac{\langle Wx_{i}^{k},Wx_{j}^{s}\rangle}{n(K-1)}\biggr{]}+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}.

(4)

2.3 Data Generating Process

In real-world scenarios, data often comprises both signal (relevant information) and noise (irrelevant distractions). For instance, in image classification, the signal might be the primary subject of interest, while the noise could represent background elements. Self-supervised learning methods, without predefined tasks, aim to extract generalized patterns from data, ideally capturing as much of the signal as possible. It is commonly understood that signals tend to exhibit specific low-complexity structures, often being low-rank and showing higher correlations across coordinates. In contrast, background noise might lack a distinct structure, potentially being dense (or full rank) with lower coordinate correlations. To delve into this structural difference more rigorously, we consider an additive data-generating model. Here, the observed data emerges as a combination of a low-rank signal and dense noise.

x=U^{\star}z+\xi,\quad\mathrm{Cov}(z)=\nu^{2}I_{r},\quad\mathrm{Cov}(\xi)=\Sigma,

(5)

where $z\in\mathbb{R}^{r}$ and $\xi\in\mathbb{R}^{d}$ are both zero mean sub-Gaussian independent random variables, and $\nu\in\mathbb{R}$ is a constant represents the signal strength. In particular, $U^{\star}\in\mathbb{O}_{d,r}$ and $\Sigma=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2})$ . The first term $U^{\star}z$ represents the signal of interest residing in a low-dimensional subspace spanned by the columns of $U^{\star}$ . The second term $\xi$ is the dense noise with heteroskedastic noise. Given that, the ideal low-dimensional representation is to compress the observed $x$ into a low-dimensional representation spanned by the columns of $U^{\star}$ . This model is known as the spiked covariance model (Johnstone, 2001; Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018). It was proposed from the empirical observation that the eigenvalues of the sample covariance matrix of phoneme data have few ”spikes”, which corresponds to the low-dimensional structure of data generation. The model has been used in the literature of PCA (Johnstone, 2001; Deshpande and Montanari, 2014; Zhang et al., 2018) and Contrastive Learning (Wen and Li, 2021).

In this paper, we aim to learn a good projection $W\in\mathbb{R}^{r\times d}$ onto a lower-dimensional subspace from the observation $x$ . Since the information of $W$ is invariant with the transformation $W\leftarrow OW$ for any $O\in\mathbb{O}_{r,r}$ , the essential information of $W$ is contained in the right eigenvector of $W$ . Thus, we quantify the goodness of the representation $W$ using the sine distance $\|\sin\Theta(U,U^{\star})\|_{F}$ , where $U$ is the top- $r$ right eigenspace of $W$ . It is notable that we only assume that noise and signal follow a sub-Gaussian distribution. This includes bounded noise/signals such as images, sound data, or text data.

3 Comparison of Self-Supervised Contrastive Learning and Autoencoders/GANs

Generative and contrastive learning are two popular approaches of self-supervised learning. Recent experiments have highlighted the improved performance of contrastive learning compared with the generative approach. For example, in Figure 1 of Chen et al. (2020a) and Figure 7 of Liu et al. (2021), it is observed that state-of-the-art contrastive self-supervised learning has more than 10 percent improvement over state-of-the-art generative self-supervised learning, with the same number of parameters. In this section, we rigorously demonstrate the advantage of contrastive learning over autoencoders/GANs, the representative methods in generative self-supervised learning, by investigating the linear representation settings under the spiked covariance model (5). The investigation is conducted for both feature recovery and in-domain downstream tasks.

Hereafter, we focus on the linear representation settings. This section is organized as follows: in Section 3.1 we first review the connection between principal component analysis (PCA) and autoencoders/GANs, which are two representative methods in generative approaches in self-supervised learning, under linear representation settings. Then we establish the connection between contrastive learning and PCA in Section 3.2. Based on these connections, we make the comparison between contrastive learning and autoencoder on feature recovery ability (Section 3.3) and in-domain downstream performance (Section 3.4).

3.1 Autoencoders, GANs and PCA

Autoencoders are popular unsupervised learning methods to perform dimensional reduction. Autoencoders learn two functions: encoder $f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{r}$ and decoder $g:\mathbb{R}^{r}\rightarrow\mathbb{R}^{d}$ . While the encoder $f$ compresses the original data into low-dimensional features, and the decoder $g$ recovers the original data from those features. It can be formulated to be the following optimization problem for samples $\{x_{i}\}_{i=1}^{n}$ (Ballard, 1987; Fan et al., 2019):

\min_{f,g}\mathbb{E}_{x}\mathcal{L}(x,g(f(x))).

(6)

By minimizing this loss, autoencoders try to preserve the essential features to recover the original data in the low-dimensional representation. In our setting, we consider the class of linear functions for $f$ and $g$ . The loss function is set as the mean squared error. Write $f(x)=W_{\text{AE}}x$ and $g(x)=W_{\text{DE}}x$ . Namely, we consider the following problem.

\min_{W_{\text{AE}},W_{\text{DE}}}\frac{1}{n}\|X-W_{\text{DE}}W_{\text{AE}}X\|_{F}^{2}.

Let $X=(x_{1},\dots,x_{n})\in\mathbb{R}^{d\times n}$ . By Theorem 2.4.8 in Golub and Loan (1996), the optimal solution is given by the eigenspace of $XX^{\top}$ , which exactly corresponds to the result of PCA. Thus, in linear representation settings, autoencoders are equivalent to PCA, which is also often known as undercomplete linear autoencoders (Bourlard and Kamp, 1988; Plaut, 2018; Fan et al., 2019). We write the obtained low-rank representation by autoencoders as

W_{\text{AE}}=(U_{\text{AE}}\Sigma_{\text{AE}}V_{\text{AE}}^{\top})^{\top},

(7)

where $U_{\text{AE}}$ is the top- $r$ eigenvectors of matrix $XX^{\top}$ , $\Sigma_{\text{AE}}$ is a diagonal matrix of spectral values and $V_{\text{AE}}=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

We also note that GANs (Goodfellow et al., 2014) is related to PCA. Namely, Feizi et al. (2020) showed that the global solution for GANs recovers the empirical PCA solution as the generative model.

To see this, let $\mathcal{W}_{2}$ be the second-order Wasserstein distance. Also let $\mathcal{G}$ be the set of linear generator functions from $\mathbb{R}^{r}\to\mathbb{R}^{d}$ . Consider the following $\mathcal{W}_{2}$ GAN optimization problem:

\displaystyle\min_{g\in\mathcal{G}}\mathcal{W}_{2}^{2}(\mathbb{P}_{n},\mathbb{P}_{g(Z)}),

(8)

where $\mathbb{P}_{n}$ denotes the empirical distribution of i.i.d. data $x_{1},\dots,x_{n}\in\mathbb{R}^{d}$ and $\mathbb{P}_{g(Z)}$ is the generated distribution with generator $g$ and $Z\sim N(0,I_{r})$ . Note that the optimization problem Equation (8) can be written as $\min_{\mathbb{P}_{n,Z}}\min_{g\in\mathcal{G}}\mathbb{E}[\|X-g(Z)\|^{2}]$ , where the first minimization is over probability distributions which have marginals $\mathbb{P}_{n}$ and $\mathbb{P}_{Z}$ . By Theorem 2 in Feizi et al. (2020), the optimizer of problem Equation (8) is obtained as $\hat{g}:Z\mapsto\hat{G}Z$ , where $\hat{G}$ satisfies $\hat{G}\hat{G}^{\top}=U_{\text{AE}}\Sigma_{\text{AE}}^{2}U_{\text{AE}}^{\top}$ . This implies $W_{\text{AE}}^{\top}:\mathbb{R}^{r}\to\mathbb{R}^{d}$ is also a solution to the optimization problem Equation (8). Hence GANs learn the PCA solution as a generator.

From this equivalence among ordinary PCA, autoencoders, and GANs, we only focus on autoencoders hereafter for brevity.

3.2 Contrastive Learning and Diagonal-Deletion PCA

Here we bridge PCA and contrastive learning with certain augmentations under the linear representation setting. Recall that the optimization problem for self-supervised contrastive learning is formulated as:

\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)\!:=-\!\frac{1}{2n}\sum_{i=1}^{n}\sum_{v=1}^{2}\biggl{[}\langle Wg_{v}(x_{i}),Wg_{[2]\setminus\{v\}}(x_{i})\rangle\!-\!\sum_{j\neq i}\sum_{s=1}^{2}\frac{\langle Wg_{v}(x_{i}),Wg_{s}(x_{j})\rangle}{2n-2}\biggr{]}\!+\!\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}.

(9)

To compare contrastive learning with autoencoders, we now derive the solution of the optimization problem (9). We start with the general result for self-supervised contrastive learning with augmented pairs generation in Definition 2.1, and then turn to the special case of random masking augmentation (Definition 2.2).

Proposition 3.1

For two fixed augmentation functions $g_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , denote the augmented data matrices as $X_{1}=[g_{1}(x_{1}),\cdots,g_{1}(x_{n})]\in\mathbb{R}^{d\times n}$ and $X_{2}=[g_{2}(x_{1}),\cdots,g_{2}(x_{n})]\in\mathbb{R}^{d\times n}$ , when the augmented pairs are generated as in Definition 2.1, all the optimal solutions of contrastive learning problem (9) are given by:

W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top},

(10)

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

The proof is given in Appendix B.1.

Proposition 3.1 is a general result for augmented pairs generation with fixed and deterministic augmentation functions. The result itself only depends on the augmented data matrices, thus it is straightforward to generalize to the case where different augmentation functions are applied to different samples, we omit it here for the simplicity of notations. Moreover, when the augmentation is sampled from a stochastic distribution, we can also characterize the optimal solution of the expected loss in the same way. Specifically, if we apply the random masking augmentation (2.2), we can further obtain a result to characterize the optimal solution. For any square matrix $A\in\mathbb{R}^{d\times d}$ , we denote $D(A)$ to be $A$ with all off-diagonal entries set to be zero and $\Delta(A)=A-D(A)$ to be $A$ with all diagonal entries set to be zero. Then we have the following corollary for random masking augmentation.

Corollary 3.2

Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then the minimizer of the expected loss function of contrastive learning problem (9) over the distribution of random augmentations (i.e., $\mathbb{E}_{g_{1},g_{2}}\mathcal{L}_{SelfCon}(W)$ ) is given by:

W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top},

(11)

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

The proof is given in Appendix B.2.

With Proposition 3.1 and Corollary 3.2 established, we can find that the self-supervised contrastive learning equipped with augmented pairs generation and random masking augmentation can eliminate the effect of random noise on the diagonal entries of the observed covariance matrix. Since $\mathrm{Cov}(\xi)=\Sigma$ is a diagonal matrix, when the diagonal entries $\mathrm{Cov}(U^{\star}z)=\nu^{2}U^{\star}U^{\star\top}$ only take a small proportion of the total Frobenius norm, the contrasting augmented pairs will preserve the core features while eliminating most of the random noise and give a more accurate estimation of core features.

3.3 Feature Recovery from Noisy Data

After bridging both autoencoder and contrastive learning with PCA, now we can perform the analysis of feature recovery ability to understand the benefit of contrastive learning over autoencoders. As mentioned above, our target is to recover the subspace spanned by the columns of $U^{\star}$ , which can further help us obtain information on the unobserved $z$ that is important for in-domain downstream tasks. However, the observed data has a covariance matrix of $\nu^{2}U^{\star}U^{\star\top}+\Sigma$ rather than the desired $\nu^{2}U^{\star}U^{\star\top}$ , which brings difficulty to representation learning. We demonstrate that contrastive learning can better exploit the structure of core features and obtain better estimation than autoencoders in this setting.

We start with autoencoders. In the noiseless case, the covariance matrix is $\nu^{2}U^{\star}U^{\star\top}$ and autoencoders can perfectly recover the core features. However, in noisy cases, the random noises sometimes perturb the core features, which makes autoencoders fail to learn the core features. Such noisy cases are widespread in real applications such as measurement errors and backgrounds in images such as grasses and sky. Interestingly, we will later show that contrastive learning can better recover $U^{\star}$ despite the presence of large noise.

To provide rigorous analysis, we first introduce the incoherent constant (Candès and Recht, 2009).

Definition 3.3 (Incoherent Constant)

We define the incoherence constant of $U\in\mathbb{O}_{d,r}$ as

I(U)=\max_{i\in[d]}\left\|e_{i}^{\top}U\right\|^{2}.

(12)

Intuitively, the incoherent constant measures the degree of the incoherence of the distribution of entries among different coordinates, or loosely speaking, the similarity between $U$ and canonical basis $\{e_{i}\}_{i=1}^{d}$ . For uncorrelated random noise, the covariance matrix is diagonal and its eigenspace is exactly spanned by the canonical basis $\{e_{i}\}_{i=1}^{d}$ (if the diagonal entries in $\Sigma$ are all different), which attains the maximum value of the incoherent constant. On the contrary, the core features usually exhibit certain correlation structures and the corresponding eigenspace of the covariance matrix is expected to have a lower incoherent constant.

We then introduce a few assumptions which our theoretical results are built on. Recall that in the spiked covariance model (5), $x=U^{\star}z+\xi$ , $\mathrm{Cov}(z)=\nu^{2}I_{r}$ and $\mathrm{Cov}(\xi)=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2})$ .

Assumption 3.4 (Regular Covariance Condition)

The condition number of covariance matrix $\Sigma=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2})$ satisfies $\kappa:=\sigma_{(1)}^{2}/\sigma_{(d)}^{2}<C,$ where $\sigma_{(j)}^{2}$ represents the $j$ -th largest number among $\sigma_{1}^{2},\cdots,\sigma_{d}^{2}$ and $C>0$ is a universal constant.

Assumption 3.5 (Signal to noise ratio condition)

Define the signal-to-noise ratio $\rho:=\nu/\sigma_{(1)}$ , we assume $\rho=\Theta(1)$ , implying that the covariance of noise is of the same order as that of the core features.

Assumption 3.6 (Incoherent Condition)

The incoherent constant of the core feature matrix $U^{\star}\in\mathbb{O}_{d,r}$ satisfies $I(U^{\star})=O\quantity(r\log d/d).$

The incoherent constant often appears in the literature of matrix completion (Candès and Recht, 2009) and PCA (Zhang et al., 2018). The order of $I(U^{\star})$ can be arbitrary as long as it decreases to $0$ as $d\rightarrow\infty$ . One can directly adapt the later results to this setting. If $U$ is distributed uniformly on $\mathbb{O}_{d,r}$ , then the expectation of incoherent constant is of order $r\log d/d$ .

Lemma 3.7 (Expectation of incoherent constant over a uniform distribution)

\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=O\quantity(\frac{r}{d}\log d).

(13)

Thus, we set $I(U^{\star})$ to the order $r\log d/d$ for simplicity. The proof is given in Appendix B.6. Here we provide a remark on the implication of our assumptions, and we defer a further discussion on how to generalize our main results under weaker assumptions in Remark 3.11.

Remark 3.8

The three assumptions above can be explained as follows: Assumption 3.4 implies that the variances of all dimensions are of the same order. For Assumption 3.5, we focus on a large noise regime where the noise may hurt the estimation significantly. Here we assume the ratio lies in a constant range, but our theory can easily adapt to the case where $\rho$ has a decreasing order. Specifically, for Theorems 3.9, 3.10, 3.13 and 3.14 presented below, we derive an explicit dependence on $\rho$ of each result in the appendix. One can check Equations (46), (LABEL:dependence_of_rho_CL), (59), (60), (61) and (62) for details. Assumption 3.6 implies a stronger correlation among the coordinates of core features, which is the essential property to distinguish them from random noise.

Now we are ready to present our first result, showing that the autoencoders are unable to recover the core features in the large-noise regime. Due to the equivalence among PCA, autoencoders, and GANs we presented in Section 3.1, for brevity, we only focus on autoencoders hereafter.

Theorem 3.9 (Recovery Ability of Autoencoders, Lower Bound)

Consider the spiked covariance model (5), under Assumptions 3.4-3.6 and $n>d\gg r$ , let $W_{\text{AE}}$ be the learned representation of autoencoders with singular value decomposition $W_{\text{AE}}=(U_{\text{AE}}\Sigma_{\text{AE}}V_{\text{AE}}^{\top})^{\top}$ (as in Equation (7)). If we further assume $\{\sigma_{i}^{2}\}_{i=1}^{d}$ are different from each other and $\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma}$ for some universal constant $C_{\sigma}$ . Then there exist two universal constants $C_{\rho}>0$ , $c\in(0,1)$ , such that when $\rho<C_{\rho}$ , we have

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{AE}}\right)\right\|_{F}\geq c\sqrt{r}.

(14)

The proof is given in Appendix B.7. The condition $d\gg r$ means that there exists a sufficiently small constant $c>0$ independent of $d$ and $r$ such that $r/d<c$ holds. The additional assumptions $\{\sigma_{i}^{2}\}_{i=1}^{d}$ are different from each other and $\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma}$ for some universal constant $C_{\sigma}$ are made to ensure the identifiability of top- $r$ eigenspace. We need these conditions to guarantee the uniqueness of $U_{\text{AE}}$ . As an extreme example, the top- $r$ eigenspace of the identity matrix can be any $r$ -dimensional subspace and thus not unique. To avoid discussing such arbitrariness of the output, we make these assumptions to guarantee the separability of the eigenspace.

Then we investigate the feature recovery ability of the self-supervised contrastive learning approach.

Theorem 3.10 (Recovery Ability of Contrastive Learning, Upper Bound)

Under the spiked covariance model (5), random masking augmentation in Definition 2.2, Assumptions 3.4-3.6 and $n>d\gg r$ , let $W_{\text{CL}}$ be any solution that minimizes Equation (3), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{CL}}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

(15)

The proof is given in Appendix B.9. The two terms in equation (15) can be explained as follows: the first term is due to the shift between the distributions of the augmented data and the original data. Specifically, the random masking augmentation generates two views with disjoint nonzero coordinates and thus can mitigate the influence of random noise on the diagonal entries in the covariance matrix. However, such augmentation slightly hurts the estimation of core features. This bias, appearing as the first term in Equation (15), is measured by the incoherent constant defined in Equation (12). The second term corresponds to the estimation error of the population covariance matrix.

Theorems 3.9 and 3.10 characterize the difference in feature recovery ability between autoencoders and contrastive learning. The autoencoders fail to recover most of the core features in the large-noise regime since $\|\sin\Theta(U,U^{\star})\|_{F}$ has a trivial upper bound $\sqrt{r}$ . In contrast, with the help of data augmentation, the contrastive learning approach mitigates the corruption of random noise while preserving core features. As $n$ and $d$ increase, it yields a consistent estimator of core features and further leads to better performance in the in-domain downstream tasks, as shown in the next section.

Remark 3.11

Here we discuss the potential generalization of our results to the setting with weaker assumptions. Intuitively speaking, the random masking augmentation exploits the prior knowledge that the core features in the original signal are more structural across different coordinates compared with the random noise. Thus the essential requirements are

1.

Noise is less correlated between different coordinates compared with core features.
2.

Core features and noise are very different, i.e., $\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}$ is large.

Those two requirements correspond to the diagonal assumption on $\Sigma$ and incoherent assumption on $U^{\star}$ (Assumption 3.6). In particular, the latter is to give a lower bound for $\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}$ when $\Sigma$ is diagonal and heteroskedasticity. For more general $\Sigma$ and $U^{\star}$ , it suffices to assume $\frac{\|\Delta(\Sigma)\|_{2}}{\nu^{2}}=o(1)$ , and $\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}=\Omega(\sqrt{r})$ , and we can still draw a similar comparison under these assumptions. Notice that by Lemma 3.7, when $U^{\star}$ is randomly chosen we will immediately have $\mathbb{E}\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}=(1-o(1))\sqrt{r}$ . Similar arguments also apply to all of the later results in this paper and we omit them for simplicity.

Remark 3.12

Similar random masking augmentation (as in Definition 2.2) can also apply to autoencoders. Although directly applying this augmentation would not work as well since it will not affect the optimal solution (see discussion in Appendix C), an alternative strategy is to reconstruct the whole data from the masked one. This method was originally proposed as denoising autoencoders (DAEs) for general augmentation Vincent et al. (2008), and was proven to be powerful with masking augmentation in a recently proposed representation learning method, masked autoencoders (MAEs) (He et al., 2021). DAEs are a variant of autoencoders that are trained to reconstruct the original image from randomly masked patches. It has been found that DAEs (especially MAEs) outperform other self-supervised methods like MoCo v3, DINO, and BEiT after fine-tuning (He et al., 2021).

More specifically, under the same setup described in Section 2, let $A$ be the random masking augmentation defined in Definition 2.2. We adopt the symmetric linear encoders and decoders. Given samples $X=[x_{1},\dots,x_{n}]\in\mathbb{R}^{d\times n}$ , we formally define the loss minimization problem¹¹1In (He et al., 2021), the loss function is computed on masked coordinates only, but as the authors noted, “This choice is purely result-driven: computing the loss on all pixels leads to a slight decrease in accuracy (e.g., 0.5%).” Hence we will analyze the loss with respect to all coordinates for simplicity. of DAEs as

\displaystyle\min_{W\in\mathbb{R}^{r\times d}:WW^{\top}=2I_{r}}\frac{1}{n}\mathbb{E}_{A}\quantity[\|W^{\top}WAX-X\|_{F}^{2}].

(16)

Notice that the DAEs may not preserve the norm of the input since $\mathbb{E}_{A}[\|Ax_{i}\|^{2}]=(1/2)\|x_{i}\|^{2}$ . As a result, we optimize the loss under the scaled constraint $WW^{\top}=2I_{r}$ .

Then, we claim that under the same conditions as in Theorem 3.10, DAEs behave similarly to contrastive learning: Let $W_{\text{DAE}}$ be any solution that minimizes equation (16), and denote its singular value decomposition as $W_{\text{DAE}}=(U_{\text{DAE}}\Sigma_{\text{DAE}}V_{\text{DAE}}^{\top})^{\top}$ , then we have (the proof is given in Appendix C.2.)

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{DAE}}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

(17)

From Theorem 3.9, we know that under high dimensional settings with large sample sizes, DAEs (or masked autoencoders) significantly outperform classic autoencoders. Moreover, compared to Theorem 3.10, the upper bounds of DAEs are the same as contrastive learning with random masking augmentations. We also provide experimental results on synthetic datasets to verify this result in Appendix C.2. Although they have similar performance in our linear representation framework because both of them exploit the masking views to eliminate noise, the difference could arise from other aspects such as network architecture and training algorithms, for example, He et al. (2021) used a vision Transformer (Dosovitskiy et al., 2020) while Chen et al. (2020a) used a ResNet (He et al., 2016).

3.4 Performance on In-Domain Downstream Tasks

In the previous section, we have seen that contrastive learning can recover the core feature effectively. In practice, we are interested in using the learned features on in-domain downstream tasks. He et al. (2020) experimentally showed the overwhelming performance of linear classifiers trained on representations learned with contrastive learning against several supervised learning methods in those in-domain downstream tasks.

Following the recent success, here we evaluate the in-domain downstream performance of simple predictors, which take a linear transformation of the representation as an input. Let $W_{\text{CL}}$ and $W_{\text{AE}}$ be the learned representations based on train data $X\in\mathbb{R}^{n\times d}$ . We observe a new signal $\check{x}=U^{\star}\check{z}+\check{\xi}$ independent of $X$ following the spiked covariance model (5). For simplicity, assume $\check{z}$ follows $N(0,\nu^{2}I_{r})$ and is independent of $\check{\xi}$ . We consider two major types of in-domain downstream tasks: classification and regression. For the binary classification task, we observe a new supervised sample $\check{y}$ following the binary response model:

\displaystyle\check{y}|\check{z}

\displaystyle\sim\text{Ber}(F(\langle\check{z},w^{\star}\rangle/\nu)),

(18)

where $F:\mathbb{R}\to[0,1]$ is a known monotone increasing function satisfying $1-F(u)=F(-u)$ for any $u\in\mathbb{R}$ , and $w^{\star}\in\mathbb{R}^{r}$ is a unit vector of coefficients. Notice that our model (18) includes a logistic model (when $F(u)=1/(1+e^{-u})$ ) and probit models (when $F(u)=\Phi(u)$ , where $\Phi$ is the cumulative distribution function of the standard normal distribution.) We can also interpret model (18) as a shallow neural network model with width $r$ for binary classification. For the regression task, we observe a new supervised sample $\check{y}$ following the linear regression model:

\displaystyle\check{y}

\displaystyle=\langle\check{z},w^{\star}\rangle/\nu+\check{\epsilon},

(19)

where $\check{\epsilon}\sim(0,\sigma_{\epsilon}^{2})$ is independent of $\check{z}$ , , and $w^{\star}\in\mathbb{R}^{r}$ is a unit vector of coefficients as before. We can interpret this model as a principal component regression model (PCR) (Jolliffe, 1982) under standard error-in-variables settings²²2 In error-in-variables settings, the bias term of the measurement error appears in prediction and estimation risk. Since our focus lies in proving a better performance of contrastive learning against autoencoders, we ignore the unavoidable bias term here by considering the excess risk. , where we assume that the coefficients lie in a low-dimensional subspace spanned by column vectors of $U^{\star}$ . We either estimate or predict the signal based on the observed samples contaminated by the measurement error $\check{\xi}$ . For details of PCR in error-in-variables settings, see, for example, Ćevid et al. (2020); Agarwal et al. (2020); Bing et al. (2021).

In classification setting, we specify $0$ - $1$ loss, that is, $\ell_{c}(\delta)\triangleq\mathbb{I}\{\check{y}\neq\delta(\check{x})\}$ for some predictor $\delta$ taking values in $\{0,1\}$ . For regression task, we employ the squared error loss $\ell_{r}(\delta)\triangleq(\check{y}-\delta(\check{x}))^{2}$ . Based on some learned representation $W$ , we consider a class of linear predictors. Namely, $\delta_{W,w}(\check{x})\triangleq\mathbb{I}\{F(w^{\top}W\check{x})\geq 1/2\}$ for classification task and $\delta_{W,w}(\check{x})\triangleq w^{\top}W\check{x}$ for regression task, where $w\in\mathbb{R}^{r}$ is a weight vector $w\in\mathbb{R}^{r}$ . Note that the learned representation depends only on unsupervised samples $X$ . Let $\mathbb{E}_{\mathcal{D}}[\cdot]$ and $\mathbb{E}_{\mathcal{E}}[\cdot]$ the expectations with respect to $(X,Z)$ and $(\check{y},\check{x},\check{z})$ , respectively.

Our goal as stated above is to bound the prediction risk of predictors $\{\delta_{W,w}:w\in\mathbb{R}^{r}\}$ constructed upon the learned representations $W_{\text{CL}}$ and $W_{\text{AE}}$ , that is, the quantity $\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell(\delta_{W_{\text{CL}},w})]$ and $\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell(\delta_{W_{\text{AE}},w})]$ .

Now we state our results on the performance of the in-domain downstream prediction task.

Theorem 3.13 (Excess Risk for In-Domain Downstream Task: Upper Bound)

Suppose the conditions in Theorem 3.10 hold. Then, for the regression task, we have

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}},

and for the classification task,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star\top},w})]=O\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}})\wedge 1.

The proofs are given in Appendix B.11.

This result shows that the price of estimating $U^{\star}$ by contrastive learning on an in-domain downstream prediction task can be made small in the case where the core feature lies in a relatively low-dimensional subspace, and the number of samples is relatively large compared to the ostensible dimension of data.

However, the in-domain downstream performance of autoencoders is not as good as contrastive learning. We obtain the following lower bound for the in-domain downstream prediction risk with the autoencoders.

Theorem 3.14 (Excess Risk for In-Domain Downstream Task: Lower Bound))

Suppose the conditions in Theorem 3.9 hold. Assume $r\leq r_{c}$ holds for some constant $r_{c}>0$ . Additionally assume that $\rho=\Theta(1)$ is sufficiently small and $n\gg d\gg r$ . Then, For the regression task,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star},w})]\geq c_{c}^{\prime},

and for classification task, if $F$ is differentiable at $0$ and $F^{\prime}(0)>0$ , then

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star},w})]\geq c_{r}^{\prime},

where $c_{r}^{\prime}>0$ and $c_{c}^{\prime}>0$ are constants independent of $n$ and $d$ .

The proof is given in Appendix B.11. The condition $n\gg d\gg r$ means that there exists a sufficiently small constant $c>0$ independent of $n$ , $d$ and $r$ such that $d/n\vee r/d<c$ holds.

The constant $c^{\prime}$ appearing in Theorem 3.14 is a constant term independent of $d$ and $n$ . Thus, when $d$ is sufficiently large compared to $r$ and $d/n$ is small, the upper bound of in-domain downstream task performance via contrastive learning in Theorem 3.13 is smaller than the lower bound of in-domain downstream task performance via autoencoders. The assumption of $r\leq r_{c}$ in Theorem 3.14 is assumed for clarity of presentation. Using the same techniques in the proof of Theorem 3.14, one can obtain a constant lower bound for autoencoders with slightly stronger assumptions, for example, $\rho^{2}=O(1/\log d)$ with $n\gg dr$ , without assuming $r\leq r_{c}$ . Our theory can be adapted to both of these assumptions. Our results support the empirical success of contrastive learning.

4 The Impact of Labeled Data in Supervised Contrastive Learning

Recent works have explored adding label information to improve contrastive learning (Khosla et al., 2020). Empirical results show that label information can significantly improve the accuracy of the in-domain downstream tasks. However, when domain shift is considered, the label information hardly improves and even hurts transferability (Islam et al., 2021). For example, in Table 2 of Khosla et al. (2020) and the first column in Table 4 of Islam et al. (2021), supervised contrastive learning shows significant improvement with 7%-8% accuracy increase on in-domain downstream classification on ImageNet and Mini-ImageNet. On the contrary, in Table 4 of Khosla et al. (2020) and Table 4 of Islam et al. (2021), supervised contrastive learning hardly increases the predictive accuracy compared to the self-supervised contrastive learning (the difference of mean accuracy is less than 1%) and can harm significantly on some datasets (e.g. 5.5% lower for SUN 397 in Table 4 of Khosla et al. (2020)). These results indicate that some mechanisms in supervised contrastive learning hurt model transferability while the improvement in source tasks is significant. Moreover, in Table 4 of Islam et al. (2021), it is observed that combining supervised learning and self-supervised contrastive learning together achieves the best transfer learning performance compared to each of them individually. Motivated by those empirical observations, in this section, we aim to investigate the impact of labeled data in contrastive learning and provide a theoretical foundation for these phenomena.

4.1 Feature Mining in Multi-Class Classification

We first demonstrate the impact of labels in contrastive learning under the standard single-sourced (i.e. no transfer learning) setting. Suppose our samples are drawn from $r+1$ different classes with probability $p_{k}$ for class $k\in[r+1]$ , and $\sum_{k=1}^{r+1}p_{k}=1$ . For each class, samples are generated from a class-specific Gaussian distribution:

x^{k}=\mu^{k}+\xi^{k},\quad\xi^{k}\sim\mathcal{N}(0,\Sigma^{k}),\quad\forall k=1,2,\cdots,r+1.

(20)

We assume the norms of $\mu^{k},\forall k\in[r+1]$ are in the same order, that is, denote $\nu=\|\mu^{1}\|/\sqrt{r}$ , we have $\|\mu^{k}\|=O(\sqrt{r}\nu),\forall k\in[r+1]$ . We further assume $\Sigma^{k}=\operatorname{diag}(\sigma_{1,k}^{2},\cdots,\sigma_{d,k}^{2})$ , denote $\sigma_{(1)}^{2}=\max_{1\leq i\leq d,1\leq j\leq r+1}\sigma_{i,j}^{2}$ and assume $\sum_{k=1}^{r+1}p_{k}\mu^{k}=0$ , where the last assumption is added to ensure identifiability since the classification problem (20) is invariant under translation. Denote $\Lambda=\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}$ , we assume $\operatorname{rank}(\Lambda)=r$ and $C_{1}\nu^{2}<\lambda_{(r)}(\Lambda)<\lambda_{(1)}(\Lambda)<C_{2}\nu^{2}$ for two universal constants $C_{1}$ and $C_{2}$ . We remark that this model is a labeled version of the spiked covariance model (5) since the core features and random noise are both sub-Gaussian. We use $r+1$ classes to ensure that $\mu^{k}$ ’s span an $r$ -dimensional space, and denote its orthonormal basis as $U^{\star}$ . Recall that our target is to recover $U^{\star}$ .

Remark 4.1

Here we focus on explaining the impact of label information in the SupCon algorithm(Khosla et al., 2020). SupCon is designed for multi-class classification tasks and it requires using the class label to find positive samples. In that case, labels from a linear function of the latent features can not be used as class labels in a multi-class classification setting. Hence we proposed to use the Gaussian Mixture Model (20) to generate class labels while keeping the most consistency with models used earlier(i.e., the spiked covariance model (5)).

As introduced in Definition 2.4, the supervised contrastive learning introduced by Khosla et al. (2020) allows us to generate contrastive pairs using labeled information and discriminate instances across classes. When we have both labeled data and unlabeled data, we can perform contrastive learning based on pairs that are generated separately for the two types of data.

Data Generating Process

Formally, let us consider the case where we draw $n$ samples as unlabeled data $X=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ from the Gaussian mixture model (20) with $p_{1}=p_{2}=\cdots=p_{r+1}$ . For the labeled data, we draw $(r+1)m$ samples; $m$ samples for each of the $r+1$ classes in the Gaussian mixture model, and denote them as $\hat{X}=[\hat{x}_{1},\cdots,\hat{x}_{(r+1)m}]\in\mathbb{R}^{d\times(r+1)m}$ . We discuss the above case for simplicity. More general versions that allow different sample sizes for each class are considered in Theorem D.2 (in the appendix). We study the following hybrid loss to illustrate how the label information helps promote performance over self-supervised contrastive learning:

\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)+\alpha\mathcal{L}_{\text{SupCon}}(W),

(21)

where $\alpha>0$ is the ratio between supervised loss and self-supervised contrastive loss. Here we consider this generalized hybrid loss to show the benefit of exploiting additional unlabeled data. If we choose $\alpha\rightarrow\infty$ it will correspond to the original SupCon loss.

We first provide a high-level explanation of why label information can help learn core features. When the label information is unavailable, no matter how much (unlabeled) data we have, we can only take them (and their augmented views) as positive samples. In such a scenario, performing augmentation leads to an unavoidable trade-off between estimation bias and accuracy. However, if we have additional class information, we can contrast between data in the same class to extract more beneficial features that help distinguish a particular class from others and therefore reduce the bias.

Theorem 4.2

Suppose the labeled and unlabeled samples are generated as the process mentioned above. If Assumptions 3.4-3.6 hold, $n>d\gg r$ and let $W_{\text{CL}}$ be any solution that minimizes the supervised contrastive learning problem in Equation (21), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim

\displaystyle\frac{1}{1+\alpha}\left(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}\right)+\frac{\alpha}{1+\alpha}\sqrt{\frac{dr}{m}}.

The proof is given in Appendix D.2

Corollary 4.3

From Theorem 4.2, it directly follows that when we have $m$ labeled data for each class and no unlabeled data ( $\alpha\to\infty$ ),

\mathbb{E}\|\sin\Theta(U_{\text{CL}},U)\|_{F}\lesssim\sqrt{\frac{dr}{m}}.

The first bound in Theorem 4.2 demonstrates how the effect of labeled data changes with the ratio $\alpha$ in the hybrid loss in Equation (21). In addition, compared with Theorem 3.10, when we only have labeled data ( $\alpha\to\infty$ ), the second bound in Theorem 4.2 indicates that with labeled data being available, the supervised contrastive learning can yield consistent estimation as $m\rightarrow\infty$ while the self-supervised contrastive learning consists of an irreducible bias term $O(r^{3/2}\log d/d)$ . At a high level, label information can help gain accuracy by creating more positive samples for a single anchor and therefore extract more decisive features. One should notice a caveat that when labeled data is extremely rare compared to unlabeled data, the estimation of supervised contrastive learning suffers from high variance. In comparison, self-supervised contrastive learning, which can exploit a much larger number of samples, may outperform it.

4.2 Information Filtering in Multi-Task Transfer Learning

In this section, we show that the theoretical tools developed in this paper can be used to illustrate the role of label information when using contrastive learning in the transfer learning setting. Label information can tell us the beneficial information for the in-domain downstream task, and learning with labeled data will filter out useless information and preserve the decisive parts of core features. However, in transfer learning, the label information is sometimes rather found to hurt the performance of contrastive learning. For example, in Table 4 of Islam et al. (2021), while supervised contrastive learning gains 8% improvement in source tasks by incorporating label information, it improves only 1% on generalizing to new datasets on average and can even hurt on some datasets. Such observation implies that label information in contrastive learning has very different roles for generalization on source tasks and new tasks. In this section, we consider two regimes of transfer learning – tasks are insufficient/abundant. In both regimes, we provide theories to support the empirical observations and further demonstrate how to wisely combine supervised and self-supervised contrastive learning to avoid those harms and achieve better performance. Specifically, we consider a transfer learning problem with regression setting and binary classification setting. Suppose we have $T$ source tasks which share a common data generative model (5). For the $t$ -th task, the labels are generated by $y^{t}=\langle w_{t},z\rangle/\nu$ in a regression setting while $y^{t}=\operatorname{sign}(\langle w_{t},z\rangle)$ in a binary classification setting, where $w_{t}\in\mathbb{R}^{r}$ is a unit vector varying across tasks. These two settings share the same distribution for $x$ only differ in the way to generate the labels $y^{t}$ .

To incorporate label information, we maximize the Hilbert-Schmidt Independence Criteria (HSIC) (Gretton et al., 2005; Barshan et al., 2011), which has been widely used in literature (Song et al., 2007a, b, c; Barshan et al., 2011).

4.2.1 Hilbert-Schmidt Independent Criteria

Gretton et al. (2005) proposed the Hilbert Schmidt Independent Criteria (HSIC) to measure the dependence between two random variables. It computes the Hilbert-Schmidt norm of the cross-covariance operator associated with their Reproducing Kernel Hilbert Space (RKHS). Such measurement has been widely used as a supervised loss function in feature selection (Song et al., 2007c), feature extraction (Song et al., 2007a), clustering (Song et al., 2007b) and supervised PCA (Barshan et al., 2011).

The basic idea behind HSIC is that two random variables are independent if and only if any bounded continuous functions of the two random variables are uncorrelated. Let $\mathcal{F}$ be a separable RKHS containing all continuous bounded real-valued functions mapping from $\mathcal{X}$ to $\mathbb{R}$ and $\mathcal{G}$ be that for maps from $\mathcal{Y}$ to $\mathbb{R}$ . For each point $x\in$ $\mathcal{X}$ , there exists a corresponding element $\phi\in\mathcal{F}$ such that $\left\langle\phi(x),\phi\left(x^{\prime}\right)\right\rangle_{\mathcal{F}}=k\left(x,x^{\prime}\right)$ , where $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ is a unique positive definite kernel. Likewise, define the kernel $l(\cdot,\cdot)$ and feature map $\psi$ for $\mathcal{G}$ . The empirical HSIC is defined as follows.

Definition 4.4 (Empirical HSIC (Gretton et al., 2005))

Let $Z:=\{(x_{1},y_{1}),\ldots$ , $(x_{m},y_{m})\}$ $\subseteq\mathcal{X}\times\mathcal{Y}$ be a series of $m$ independent and identically distributed observations. An estimator of HSIC, written as $\operatorname{HSIC}(Z,\mathcal{F},\mathcal{G})$ , is given by

\operatorname{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\operatorname{tr}(KHLH),

where $H,K,L\in\mathbb{R}^{m\times m},K_{ij}:=k\left(x_{i},x_{j}\right),L_{ij}:=l\left(y_{i},y_{j}\right)$ and $H:=I_{m}-(1/m)1_{m}1_{m}^{\top}$ .

In our setting, we aim to maximize the dependency between learned features $WX\in\mathbb{R}^{r\times n}$ and label $y\in\mathbb{R}^{n}$ via HSIC. Substituting $K\leftarrow X^{\top}W^{\top}WX$ and $L\leftarrow yy^{\top}$ , we obtain our supervised loss for the representation matrix $W$ :

\operatorname{HSIC}(X,y;W)=\frac{1}{(n-1)^{2}}\tr(X^{\top}W^{\top}WXHyy^{\top}H).

(22)

A more commonly used supervised loss in the regression task is the mean squared error. Here we explain the equivalence of maximizing HSIC with penalty $\|WW^{\top}\|_{F}^{2}$ and minimizing the mean squared error in the regression task.

Recall that in the contrastive learning framework, we first learn the representation via a linear transformation and then perform linear regression to learn a predictor with the learned representation. Consider the mean squared error $\mathcal{L}(\delta)=(1/n)\sum_{i=1}^{n}(\delta(x_{i})-y_{i})^{2}$ , where $\delta(x_{i})$ is a predictor of $y_{i}$ . Also consider a linear class of predictors $\delta_{W,w}(x_{i})=w^{\top}Wx_{i}$ with parameter $w\in\mathbb{R}^{r}$ . Assume that both $X$ and $y$ are centered. For any fixed representation $W$ , the minimum mean squared error is given by

	$\displaystyle\min_{w\in\mathbb{R}^{r}}\mathcal{L}(\delta_{W,w})=$	$\displaystyle\frac{1}{n}\\|(WX)^{\top}w^{\star}-y\\|^{2}$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\quantity(y^{\top}y-\tr(X^{\top}W^{\top}(WXX^{\top}W^{\top})^{-1}WXyy^{\top})),$

where $w^{\star}=(WXX^{\top}W^{\top})^{-1}WXy$ . Ignoring the constant term $y^{\top}y$ , it can be seen that the only essential difference between the minimization problem $\min_{w\in\mathbb{R}^{r}}\mathcal{L}(\delta_{W,w})$ and maximizing HSIC in Equation (22) is the normalization term $(WXX^{\top}W^{\top})^{-1/2}$ for $W$ . Thus, minimizing the mean squared error is equivalent to maximizing HSIC with regularization term $\|WW^{\top}\|_{F}^{2}$ . Since $L_{\text{SelfCon}}$ contains the regularization term $\|WW^{\top}\|_{F}^{2}$ , we can jointly use $L_{\text{SelfCon}}$ and HSIC as a surrogate for the standard regression error to avoid the singularity of learned representation $W$ .

4.2.2 Main results

First, we consider the regression setting. Before stating our results, we prepare some notations. Suppose we have $n$ unlabeled data $X=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ and $m$ labeled data for each source task $\hat{X}^{t}=[\hat{x}_{1}^{t},\cdots,\hat{x}_{m}^{t}],y^{t}=[y_{1}^{t},\cdots,y_{m}^{t}],\forall t=1,\dots,T$ where $x_{i}$ and $\hat{x}_{j}^{t}$ are independently drawn from the spiked covariance model (5), we learn the linear representation via the joint optimization:

\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)-\alpha\sum_{t=1}^{T}\operatorname{HSIC}(\hat{X}^{t},y^{t};W),

(23)

where $\alpha>0$ is a pre-specified ratio between the self-supervised contrastive loss and HSIC. A more general setting, where the ratio $\alpha$ and the number of labeled data for each source task are allowed to depend on $t$ , is considered in the appendix, see Section D.2 for details. We now present a theorem showing the recoverability of $W$ by minimizing the hybrid loss function (23).

Theorem 4.5

In the regression setting where $y^{t}=\langle w_{t},z\rangle/\nu$ , suppose Assumptions 3.4-3.6 hold for the spiked covariance model (5) and $n>d\gg r$ , if we further assume that $\alpha>C$ for some constant $C$ , $T<r$ and $w_{t}$ ’s are orthogonal to each other, and let $W_{\text{CL}}$ be any solution that optimizes the problem in Equation (23), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim

\displaystyle\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right).

(24)

The proof is given in Appendix D.6.

Similar to Section 3.4, we can obtain an in-domain downstream task risk in a supervised contrastive learning setting. Consider a new test task where a label is generated by $\check{y}=\langle w^{\star},\check{z}\rangle/\nu$ with $\check{x}=U^{\star}\check{z}+\check{\xi}$ . Recall that the loss in the in-domain downstream task is measured by the squared error: $\ell_{r}(\delta):=(\check{y}-\delta(\check{x}))^{2}$ . We obtain the following result.

Theorem 4.6

Suppose the conditions in Theorem 4.5 hold. Then,

		$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]$		(25)
		$\displaystyle\quad\lesssim\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right).$		(25)

The proof is given in Appendix D.7.

In Theorem 4.5 and Theorem 4.6, as $\alpha$ goes to infinity (corresponding to the case where we only use the supervised loss), the upper bounds in Equations (24) and (LABEL:CL_transfer_upper_bound_4) are reduced to $\sqrt{r-T}+T^{3/2}\sqrt{d/m}$ , which is worse than the $r^{3/2}\log d/d$ rate obtained by self-supervised contrastive learning (Theorem 3.10). This implies that when the model focuses mainly on the supervised loss, the algorithm will extract the information only beneficial for the source tasks and fail to estimate other parts of core features. As a result, when the target task has a very different distribution, labeled data will bring extra bias and therefore hurt the transferability. Additionally, one can minimize the right-hand side of Equation (24) to obtain a sharper rate. Specifically, we can choose an appropriate $\alpha$ such that the upper bound becomes $\sqrt{r^{2}(r-T)}\log d/d$ (when $n,m\rightarrow\infty$ ), obtaining a smaller rate than that of the self-supervised contrastive learning. These facts provide theoretical foundations for the recent empirical observations that smartly combining supervised and self-supervised contrastive learning achieves significant improvement in transferability compared with performing each of them individually (Islam et al., 2021).

Remark 4.7

A heuristic intuition of this surprising fact is that when tasks are not diverse enough, supervised training will only focus on the features that are helpful to predict the labels of source tasks and ignore other features. For example, we have unlabeled images which contain cats or dogs and the background can be sandland or forest. If the source task focuses on classifying the background, supervised learning will not learn features associated with cats and dogs, while self-supervised learning can learn these features since they are helpful to discriminate different images. As a result, although supervised learning can help to classify sandland and forest, it can hurt performance on the classification of dogs and cats and we should incorporate self-supervised contrastive learning to learn these features.

When the tasks are abundant enough then estimation via labeled data can recover core features completely. Similar to Theorem 4.5 and Theorem 4.6, we have the following results.

Theorem 4.8

In the regression setting where $y^{t}=\langle w_{t},z\rangle/\nu$ , suppose Assumptions 3.4-3.6 hold for the spiked covariance model (5) and $n>d\gg r$ , if we further assume that $T>r$ and $\lambda_{(r)}(\sum_{i=1}^{T}w_{i}w_{i}^{\top})>c$ for some constant $c>0$ , suppose $W_{\text{CL}}$ is the optimal solution of optimization problem Equation (23), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim

\displaystyle\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}.

(26)

The proof is given in Appendix D.8.

Theorem 4.9

Suppose the conditions in Theorem 4.8 hold. Then,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}.

(27)

The proof is given in Appendix D.9.

Theorem 4.8 and Theorem 4.9 show that in the case where tasks are abundant, as $\alpha$ goes to infinity (corresponding to the case where we use the supervised loss only), the upper bounds in Equations (26) and (27) are reduced to $T\sqrt{rd/m}$ . This rate can be worse than the $\sqrt{r^{3}}\log d/d+\sqrt{rd/n}$ rate obtained by self-supervised contrastive learning when $m$ is small. Recall that when the number of tasks is small, labeled data introduce extra bias term $\sqrt{r-T}$ (Theorem 4.5 and Theorem 4.6). We note that when the tasks are abundant enough, the harm of labeled data is mainly due to the variance brought by the labeled data. When $m$ is sufficiently large, supervised learning on source tasks can yield a consistent estimation of core features, whereas self-supervised contrastive learning can not.

In extending our results from regression to the binary classification setting, the only difference is in the label generation process, and we can obtain similar results with some modification of the proofs. The corresponding feature recovery bounds of Theorem 4.5 (where the tasks are insufficient) and Theorem 4.8 (where the tasks are abundant) are stated as follows:

Theorem 4.10

In the classification setting where $y^{t}=\operatorname{sign}(\langle w_{t},z\rangle/\nu)$ , suppose the conditions in Theorem 4.5 hold and $z$ in the spiked covariance model (5) follows a Gaussian distribution, then we have:

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim

\displaystyle\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right).

(28)

Theorem 4.11

In the classification setting where $y^{t}=\operatorname{sign}(\langle w_{t},z\rangle/\nu)$ , suppose the conditions in Theorem 4.8 hold and $z$ in the spiked covariance model (5) follows a Gaussian distribution, then we have:

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim

\displaystyle\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}.

(29)

As the feature recovery bounds remain to be the same, the counterpart of the in-domain downstream tasks results, Theorem 4.6 and Theorem 4.9, in this classification setting follows immediately. For the sake of space, we defer the generalized version and proofs in Theorem D.10 and Theorem D.11 in the appendix.

5 Numerical Experiments

5.1 Linear Model with Synthetic Data

To verify our theory, we conducted numerical experiments on the spiked covariance model (5) under a linear representation setting and contrastive loss functions defined in (3) and (23). As we have explicitly formulated the loss function and derived its equivalent form in the main body and appendix, we simply minimize the corresponding loss by gradient descent to find the optimal linear representation $W$ . For self-supervised contrastive learning with random masking augmentation, we independently draw the augmentation function by Definition 2.2 and apply them to all samples in each iteration. To ensure convergence, we set the maximum number of iterations for it (typically 10000 or 50000 depending on dimension $d$ ).

We report two criteria to evaluate the quality of the representation, in-domain downstream error, and sine distance. To obtain the sine distance for a learned representation $W$ , we perform singular value decomposition to get $W=(U\Sigma V^{\top})^{\top}$ and then compute $\|\sin\Theta(U,U^{\star})\|_{F}$ . To obtain the in-domain downstream task performance, in the comparison between autoencoders and contrastive learning, we first draw $n$ labeled data from spiked covariance model (5) with labels generated as in Section 3.4, then we train the model by using the data without labels to obtain the linear representation $W$ , and learn a linear predictor $w$ using the data with labels and compute the regression error. In the transfer learning setting, we draw some labeled data from the source tasks and additional unlabeled data. The number of labeled data is set to be $m=1000$ and the number of unlabeled data is set to be $n=1000$ . Then train with them to obtain the linear representation $W$ , and draw labeled data from a new source task to learn a linear predictor $w$ to compute the regression error. In particular, we subtract the optimal regression error obtained by the best representation $U^{\star\top}$ for each regression error and report the difference, or more precisely, the excess risk as in-domain downstream performance.

The results are reported in Fig. 1 and 2 and Table 2 and 3. As predicted by Theorems 3.9 and 3.10, the feature recovery error and in-domain downstream task risk of contrastive learning decrease as $d$ increases (Fig. 1: Left) and as $n$ increases (Fig. 1: Center) while that of autoencoders is insensible to the changes in $d$ and $n$ . Consistent with our theory, in Fig. 1: Right, it is observed that when tasks are not abundant, the transfer performance exhibit a $U$ -shaped curve, and the best result is achieved by choosing an appropriate $\alpha$ . When tasks are abundant and labeled data are sufficient, the error remains small when we take large $\alpha$ .

Refer to caption — Figure 1: The vertical axes indicate the downstream regression error. We subtract the regression error of the ground truth features to measure the excess error. Left: Comparison of in-domain downstream task performance between contrastive learning and autoencoders the dimension $d$ . The sample size $n$ is set as $n=20000$ . Center: Comparison of in-domain downstream task performance between contrastive learning and autoencoders the dimension $n$ . The dimension $d$ is set as $d=40$ . Right: In-domain downstream task performance in transfer learning against penalty parameter $\alpha$ in log scale. $T$ is the number of source tasks and $r$ is the dimension of the representation function. We set the number of labeled data and unlabeled data as $m=1000$ and $n=1000$ respectively.

$\log_{e}(\alpha)$	-5	-4	-3	-2	-1	0	1	2	3	4	5
$T=8,r=10$	0.0242	0.0231	0.0199	0.0141	0.0122	0.0125	0.0184	0.0345	0.0499	0.0535	0.0587
$T=20,r=10$	0.0223	0.0163	0.0156	0.0096	0.0079	0.0055	0.0064	0.0064	0.0067	0.0070	0.0079

Table 2: In-domain downstream performance in transfer learning against the penalty parameter

\alpha

T

is the number of source tasks.

$\log_{e}(\alpha)$	-5	-4	-3	-2	-1	0	1	2	3	4	5
$T=8,r=10$	2.0373	2.0371	2.0228	1.9908	2.0021	2.0055	2.0010	2.0362	2.0699	2.0705	2.0813
$T=20,r=10$	2.0352	2.0292	2.0030	1.9871	1.9740	1.9690	1.9766	1.9702	1.9790	1.9714	1.9672

Table 3: Feature recovery performance in transfer learning against the penalty parameter

\alpha

T

is the number of source tasks.

5.2 Neural Nets with Real-World Dataset

In this section, we provide experimental results in real-world datasets to support our theoretical results. Although our model settings and assumptions might be violated under this scenario, as we shall see, our findings still remain valid in practice.

We conduct the experiments using the datasets STL-10 (Coates et al., 2011) and CIFAR-10 (Krizhevsky, 2009) with the neural nets architecture ResNet-18 (He et al., 2016). Our experiments are carried out based on linear evaluation following SimCLR (Chen et al., 2020a), where we first train a ResNet-18 encoder and a two-layer MLP projector with unlabeled augmented data. We then freeze the encoder, train a logistic regression on top of it with labeled data, and lastly evaluate the performance on the test data. Following Chen et al. (2020a), we apply augmentations including resized cropping, horizontal flipping, color distortion, and Gaussian blurring to generate the augmented data, and use the InfoNCE loss function to train the network. All training is carried out with the Adam optimizer (Kingma and Ba, 2015), batch size 256, learning rate $3\times 10^{-4}$ , weight decay $10^{-4}$ , and a cosine annealing learning rate scheduler for 100 epochs. Our codes are implemented in Pytorch and run on an NVIDIA V100 GPU.

Contrastive learning v.s. standard autoencoders:

Here we provide real-world evidence for our theoretical findings in Section 3.3 by comparing the performance of contrastive learning versus a standard autoencoder. The architecture of encoders is the same for these two methods, and we use an inversed ResNet-18 as the decoder. During the training time, we use the encoder-decoder architecture and mean squared error loss to reconstruct the input, and then we train a linear classifier on the features learned by the encoder. The results are listed in Table 4, we can find that contrastive learning demonstrates superior performance over the standard autoencoder.

Testing Accuracy	Contrastive Learning	Standard Autoencoder
CIFAR10	$65.11\pm 0.51$	$44.76\pm 0.16$
STL10	$71.02\pm 0.47$	$39.00\pm 0.58$

Table 4: Comparison of linear evaluation performance of contrastive learning and autoencoders.

The impact of labeled data in transfer learning

Now we we provide real-world evidence for our theoretical findings in Section 4.2. Following the joint optimization formulation in equation 23, we combine the InfoNCE loss function for unlabeled data and cross-entropy loss for labeled data with a ratio $\alpha$ . For both STL-10 and CIFAR-10 datasets, we divide the test data into two sets, one consists of the first five classes and the other one consists of the remaining five classes. During training, we use the training data as unlabeled data and the first set of test data as the labeled data to train the model jointly, and then train a linear classifier with the second set of test data on features learned by the encoder. As predicted by Theorem 4.5, when $\alpha$ is small, introducing labeled data from the first five classes would be beneficial to learn better representations and improve the performance on the last five classes; when $\alpha$ is large, labeled data from the first five classes will make the model only focus on features that are useful to discriminate the first five classes and ignore other features, thus introducing labeled data could be harmful to the performance on the last five classes. Testing accuracy on the last five classes with different $\alpha$ are listed in table 5, it is observed that the accuracy first increases and then decreases as $\alpha$ grows, which is consistent with our theoretical results.

Testing Accuracy	$\alpha=0.0$	$\alpha=0.1$	$\alpha=0.2$	$\alpha=0.3$	$\alpha=0.4$	$\alpha=0.5$	$\alpha=0.6$	$\alpha=0.7$	$\alpha=0.8$	$\alpha=0.9$	$\alpha=1.0$
CIFAR10	74.27	75.52	74.86	75.31	75.21	74.86	74.46	73.85	72.20	69.17	51.31
STL10	82.56	83.54	83.27	83.24	83.21	83.03	82.34	82.11	80.94	76.88	52.37

Table 5: Transfer learning performance with different

\alpha

. For each experiment, we report the average accuracy for three independent runs.

6 Conclusion

In this work, we establish a theoretical framework to study contrastive learning under the linear representation setting. We theoretically prove that contrastive learning, compared with autoencoders and GANs, can obtain a better low-rank representation under the spiked covariance model, which further leads to better performance in in-domain downstream tasks. We also highlight the impact of labeled data in supervised contrastive learning and multi-task transfer learning: labeled data can reduce the domain shift bias in contrastive learning, but it harms the learned representation in transfer learning. To our knowledge, our result is the first theoretical result to guarantee the success of contrastive learning by comparing it with existing representation learning methods. However, to get a tractable analysis, like many other theoretical works in representation learning (Du et al., 2020; Lee et al., 2021; Tripuraneni et al., 2021), our work starts with linear representations, which still provides important insights. Recently, Wen and Li (2021) and Refinetti and Goldt (2022) studied the training dynamics of autoencoders and contrastive learning with nonlinear shallow neural networks. Extending our results to these more complex models is an interesting direction for future work.

Acknowledgement

L.Z. is supported by National Science Foundation DMS 2015378. J.Z. is supported by the National Science Foundation (CCF 1763191 and CAREER 1942926), the US National Institutes of Health (P30AG059307 and U01MH098953) and grants from the Silicon Valley Foundation and the Chan-Zuckerberg Initiative.

A Background and omitted discussion

A.1 Comparison with other works

Here we compare the results in this paper with some closely related works.

To rigorously analyze contrastive learning, we consider the random masking augmentation strategy which is also analyzed in Wen and Li (2021). In Wen and Li (2021), the authors aim to understand the training dynamics of contrastive learning in a shallow nonlinear neural network and focus more on dealing with nonlinearity. In comparison, our work focus on the comparison between contrastive learning and autoencoders and the role of label information in contrastive learning. To make the problem mathematically tractable, we adopt a linear model, which is simple but enough to shed light on many mysterious phenomena in practice. Moreover, while they assume a sparse coding model, where the features are extremely sparse, and Gaussianity of signals and noise, our analysis only requires that the features are sub-Gaussian (5). Furthermore, our technique allows the signal-to-noise ratio to have different orders, as long as it decreases slowly, while their analysis is restricted to a particular signal-to-noise ratio.

Tian (2022) studied the relationship between contrastive learning and PCA from a game-theoretic point of view. Specifically, the authors decompose the gradient descent on the contrastive loss into two dynamics, namely the max-player and min-player. It is proven that in deep linear networks, the max-player is equivalent to PCA and the landscape has no spurious minimum. While the results on max-player can be applied to a family of contrastive loss, it is still difficult to analyze the min-player in a general setting. In our paper, we use a linear contrastive loss (2) to explicitly obtain the features learned by contrastive learning. Moreover, our results can be directly extended to a deep linear network setting by the equivalence of a single linear transformation and a deep linear network. The major difference is the non-convexity of the loss landscape.

Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They viewed training with unlabeled data as functional regularization on learning the representation function, and obtained sample complexity bounds in a PAC-learning style for various settings. In particular, they found that such functional regularization can help to reduce the amount of labeled data needed, and showed autoencoders and masked self-supervision as two concrete examples. Apart from Garg and Liang (2020), this paper focuses on a regime in combining self-supervised learning and supervised learning, where a trade-off between labeled data and unlabeled data exists. Specifically, Theorem 3 in Garg and Liang (2020) assumes that a ground truth representation exists such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features. In our paper, we first study the regime where tasks are not diverse enough in Theorem 4.5 (which corresponds to the case where ground truth does not exist) and show the trade-off between supervised loss and self-supervised loss. Then in Theorem 4.8 we show that when tasks are abundant (which corresponds to the case where ground truth exists), labeled data helps to achieve better error bounds, which is similar to the result of Garg and Liang (2020). Our result of Theorem 4.5 provides novel insight into the regime where tasks are not diverse, which has been left untouched in the literature.

In Li et al. (2021), the authors proposed a novel self-supervised loss function based on HSIC and discussed the relationship between InfoNCE and the proposed SSL-HSIC loss. The SSL-HSIC loss measures the dependence between the output features and one-hot encoded labels(which serve as the indicators of positive samples) and minimizing the SSL-HSIC loss encourages the network to discriminate augmented views from different samples. In comparison to this self-supervised loss, we use HSIC in Section 4.2 as a supervised loss to measure the dependence between output features and the true labels, which is a common usage of HSIC in previous works (Barshan et al., 2011; Song et al., 2007c). Moreover, we want to point out that the proposed estimator of SSL-HSIC (see Equation (11) in Li et al. (2021)) can be reduced to the linear loss we use in this paper when the kernel $k(\cdot,\cdot)$ is chosen to be a simple inner product. The authors argued that the standard InfoNCE loss may yield meaningless features in some cases thus the proposed HSIC-based loss could be a better alternative, and provided empirical results illustrating the comparable performance of SSL-HSIC. It remains to be further explored the benefits of such an HSIC-based method.

Lee et al. (2021) studied self-supervised learning under a conditional independence assumption, and showed that with the optimal representation learned in pretext tasks, the in-domain downstream risk is guaranteed to be small. In the contrastive learning context, for example, an image classification downstream task, such an assumption implies that the augmented views generated from the same picture are roughly independent conditional on its ground-truth class, which could be too strong since two views are usually strongly correlated. Such an assumption would be closer to supervised contrastive learning as we have discussed in Section 4.1 where we contrast two independent samples from the same class, such sample pairs are independent conditional on the true label, but it requires label information and thus not applied to self-supervised contrastive learning. Compared with this work, we studied a specific data-generating model where the two views are obtained by practical augmentation and thus could be more close to a real-world setting. It is also remarkable that inLee et al. (2021), their analysis can be adapted to the nonlinear representation setting in the sense that they directly assumed that the optimal representation is obtained. However, the representation learning in the contrastive learning context, even under a linear representation setting, could be non-convex. Thus our analysis starts from a simple setting to obtain a deep understanding of what could be learned in the pretext tasks.

Saunshi et al. (2022) proposed a novel perspective that theoretical analysis of contrastive learning must take the inductive bias into account. It is shown that without considering the function class, it is possible that the learned features totally fail in downstream classification tasks. Furthermore, it is shown that within the linear representation class, contrastive self-supervised learning is guaranteed to learn meaningful features under certain conditions. In our paper, we have restricted ourselves to a similar linear representation setting and avoided the collapse raised by complicated models. Compared with their analysis in a linear representation setting, our bounds provide an exact order while their results still need to quantify the expressivity and inconsistency measure, which is very difficult without very strong assumptions. Moreover, their analysis requires a finite input space and augmentation set, which is much stronger than our data-generating model.

Later than our paper, Fu et al. (2022) considered a similar issue of transfer learning performance of the supervised contrastive method. The authors argue that directly minimizing the supervised contrastive method will lead to feature collapse, which implies that within each class, all data points will have the same embedding and thus the supervised contrastive method loses information that could be useful for transfer learning. This intuition is similar to our setting in Section 4.2, and based on it the authors proposed a new loss function that combines the supervised contrastive loss and within-class self-supervised loss together. At a high level, this new loss function is quite similar to equation 23 since they are both linear interpolations of self-supervised contrastive loss and supervised loss, and the common motivation is to encourage the model to learn more background features. And our analysis in Section 4.2 can provide a theoretical foundation for when could such interpolation work and how to choose the ratio $\alpha$ .

A.2 Disucssion about the regularization term

In this paper we use a quadratic regularization term $R_{1}(W)=\|WW^{\top}\|_{F}^{2}$ instead of a standard $\ell_{2}$ regularization term $R_{2}(W)=\|W\|_{F}^{2}$ . Denote $W^{T}=[w_{1},\cdots,w_{r}]$ , then we can write these two terms as:

R_{1}(W)=\sum_{i}^{r}\|w_{i}\|^{4}+\sum_{i,j}\langle w_{i},w_{j}\rangle^{2},\quad R_{2}(W)=\sum_{i}^{r}\|w_{i}\|^{2}

The main difference between these two terms is that except for penalizing the norm of representation $w_{i}$ , it also penalizes the similarity between different representations. In particular, in the linear representation setting, where we will deal with optimization problems like

\min_{W\in\mathbb{R}^{r\times d}}\tr(WAW^{\top})+\frac{\lambda}{2}R(W)

where $A$ is a symmetric matrix determined by data and augmentation, we can easily find that the $\ell_{2}$ regularization would fail. To see this, we can rewrite the loss function as

\tr(WAW^{\top})+\frac{\lambda}{2}\|W\|_{F}^{2}=\sum_{i=1}^{r}(w_{i}^{\top}Aw_{i}+\frac{\lambda}{2}\|w_{i}\|^{2})=\sum_{i=1}^{r}w_{i}^{\top}(A+\frac{\lambda}{2}I)w_{i},

it is easy to find that the optimal solution of each $w_{i}$ would be at infinity. Moreover, even if we add constraints like $\|w_{i}\|<C,\forall i\in[r]$ , the optimal solution of each $w_{i}$ would all be the eigenvector corresponding to the smallest eigenvalue of $A$ thus the model would only learn a single representation from the data. In contrast, the quadratic regularization term encourages the diversity of representation by penalizing the similarity between representations, i.e., $\langle w_{i},w_{j}\rangle^{2}$ . In this situation, we have:

\tr(WAW^{\top})+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}=\frac{\lambda}{2}\|W^{\top}W+\frac{1}{\lambda}A\|_{F}^{2}-\frac{1}{2\lambda}\|A\|_{F}^{2}

and it is easy to find that the optimal solution of $w_{i}$ would be finite and each $w_{i}$ corresponds to different eigenvectors of $A$ and is orthogonal to each other, which implies that they would learn totally different representations. As a result, the quadratic regularization term would be more helpful to self-supervised learning with linear representation. In real-world practice, $\ell_{2}$ regularization can still work well with the help of non-linearity and normalization techniques, but we also provide an empirical observation that applying quadratic regularization would be helpful to improve the performance compared with using standard weight decay. We would also like to point out that in the linear regime, the choice of $\lambda\in\mathbb{R}_{+}$ would only affect the norm of $w_{i}$ and makes no difference to the direction of $w_{i}$ and the quality of representation, thus we do not specify this value in our analysis. Similar regularization techniques are also used in Liu et al. (2021) for theoretical analysis in the linear representation setting.

We verify this conjecture in neural networks, and the results are provided in Table 6. The first column corresponds to training with standard weight decay, and in the setting of the second column, we do not apply weight decay for each weight matrix of fully connected layers in the encoder, and add an additional regularization term $\lambda\|W^{T}W\|_{F}^{2}$ on the loss function instead. Note that the quadratic regularization does not apply to convolution layers and bias terms, thus we keep the weight decay on these parameters. We search the regularization parameter in each settings from $\lambda=0.1,0.01,0.001,0.0001,0.00001,0.000001$ and find that $\lambda=0.0001$ yields the best performance in each settings and datasets.

It is observed that quadratic regularization slightly improves the performance of contrastive learning, which is consistent with our intuition.

Testing Accuracy	weight decay=0.0001	quadratic regularization=0.0001
CIFAR10	$65.11\pm 0.51$	$\mathbf{65.54\pm 0.26}$
STL10	$71.02\pm 0.47$	$\mathbf{71.39\pm 0.39}$

Table 6: Quadratic regularization v.s. weight decay. We compare the top-1 accuracy of linear classifiers trained on features learned by SimCLR with the ResNet-18 encoder and different regularization methods. We repeat each experiment for 5 runs and report the mean and standard deviation. More details are provided in Section 5.2

A.3 Background on distance between subspaces

In this section, we will provide some basic properties of sine distance between subspaces. Recall the definition:

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}=\left\|U_{2\perp}^{\top}U_{1}\right\|_{F}.

(30)

where $U_{1},U_{2}\in\mathbb{O}_{d,r}$ are two orthogonal matrices. Similarly, we can also define:

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{2}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{2}=\left\|U_{2\perp}^{\top}U_{1}\right\|_{2}.

We first give two equivalent definitions of this distance:

Proposition A.1

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=r-\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2}

Proof Write $U=[U_{1},U_{1\perp}]\in\mathbb{O}_{d,d}$ . We have

r=\|U_{2}\|_{F}^{2}=\|U^{\top}U_{2}\|_{F}^{2}=\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}^{2}+\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2},

then by definition of sine distance, we can obtain the desired equation.

Proposition A.2

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=\frac{1}{2}\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\|_{F}^{2}

Proof Expand the right hand and use Proposition A.1 we have:

	$\displaystyle\frac{1}{2}\\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\\|_{F}^{2}=$	$\displaystyle\frac{1}{2}(\\|U_{1}U_{1}^{\top}\\|_{F}^{2}+\\|U_{2}U_{2}^{\top}\\|_{F}^{2}-2\tr(U_{1}U_{1}^{\top}U_{2}U_{2}^{\top}))$
	$\displaystyle=$	$\displaystyle\frac{1}{2}(r+r-2\tr(U_{1}^{\top}U_{2}U_{2}^{\top}U_{1}))$
	$\displaystyle=$	$\displaystyle r-\\|U_{1}^{\top}U_{2}\\|_{F}^{2}=\left\\|\sin\Theta\left(U_{1},U_{2}\right)\right\\|_{F}^{2}.$

With Propositions A.1 and A.2, it is easy to verify its properties to be a distance function. Obviously, we have $0\leq\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\leq\sqrt{r}$ and $\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}=\left\|\sin\Theta\left(U_{2},U_{1}\right)\right\|_{F}$ by definition. Moreover, we have the following results:

Lemma A.3 (Lemma 1 in Cai and Zhang (2018))

For any $U,V\in\mathbb{O}_{d,r}$ ,

\displaystyle\|\sin\Theta(U,V)\|_{2}\leq\inf_{O\in\mathbb{O}_{r,r}}\|UO-V\|_{2}\leq\sqrt{2}\|\sin\Theta(U,V)\|_{2},

(31)

and

\displaystyle\|\sin\Theta(U,V)\|_{F}\leq\inf_{O\in\mathbb{O}_{r,r}}\|UO-V\|_{F}\leq\sqrt{2}\|\sin\Theta(U,V)\|_{F}.

(32)

Proposition A.4 (Identity of indiscernibles)

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}=0\Leftrightarrow\exists O\in\mathbb{O}^{r\times r},\text{ s.t. }U_{1}O=U_{2}

Proof It is a straightforward corollary by definition:

	$\displaystyle\left\\|\sin\Theta\left(U_{1},U_{2}\right)\right\\|_{F}=0$	$\displaystyle\Leftrightarrow\left\\|U_{1\perp}^{\top}U_{2}\right\\|_{F}=0\Leftrightarrow U_{2\perp}\perp U_{1}$
		$\displaystyle\Leftrightarrow\exists O\in\mathbb{O}^{r\times r},\text{ s.t. }U_{1}O=U_{2}.$

Proposition A.5 (Triangular inequality)

\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\leq\left\|\sin\Theta\left(U_{1},U_{3}\right)\right\|_{F}+\left\|\sin\Theta\left(U_{2},U_{3}\right)\right\|_{F}

Proof By the triangular inequality for Frobenius norm we have:

\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\|_{F}\leq\|U_{1}U_{1}^{\top}-U_{3}U_{3}^{\top}\|_{F}+\|U_{2}U_{2}^{\top}-U_{3}U_{3}^{\top}\|_{F},

then apply Proposition A.2 to replace the Frobenius norm with sine distance we can finish the proof.

B Omitted proofs for Section 3

B.1 Proofs for Section 3.1 and Section 3.2

In this section, we will provide the proof of Proposition 3.1 and Corollary 3.2, the restatement of them and the detailed proof can be found in Proposition B.1 and Corollary B.2.

Proposition B.1 (Restatement of Proposition 3.1)

For two fixed augmentation functions $g_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , denote the augmented data matrices as $X_{1}=[g_{1}(x_{1}),\cdots,g_{1}(x_{n})]\in\mathbb{R}^{d\times n}$ and $X_{2}=[g_{2}(x_{1}),\cdots,g_{2}(x_{n})]\in\mathbb{R}^{d\times n}$ , when the augmented pairs are generated as in Definition 2.1, the optimal solution of contrastive learning problem (9) is given by:

W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{r}1_{r}^{\top}-I_{r})(X_{1}+X_{2})^{\top},

(33)

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

Proof [Proof of Proposition 3.1] When augmented pairs generation in Definition 2.1 is applied, the contrastive loss can be written as:

	$\displaystyle\mathcal{L}_{\text{SelfCon}}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{n}\sum_{i=1}^{n}[\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle$
		$\displaystyle-\frac{1}{4(n-1)}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle]$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{n}\sum_{i=1}^{n}\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle$
		$\displaystyle+\frac{1}{4n(n-1)}\sum_{i=1}^{n}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{2n}\tr(X_{1}^{\top}W^{\top}WX_{2}+X_{2}^{\top}W^{\top}WX_{1})$
		$\displaystyle+\frac{1}{4n(n-1)}\tr((1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top}W^{\top}W(X_{1}+X_{2}))$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}$
		$\displaystyle-\frac{1}{2n}\tr((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\biggl{\\|}\lambda W^{\top}W-\frac{1}{2n\lambda}\quantity(X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})\biggr{\\|}_{F}^{2}$
		$\displaystyle-\biggl{\\|}\frac{1}{2n\lambda}\quantity(X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})\biggr{\\|}_{F}^{2}.$

Note that the last term only depends on $X$ , and the first term implies that when $W_{\text{CL}}$ is the optimal solution, $\lambda W_{\text{CL}}W_{\text{CL}}^{\top}$ is the best rank- $r$ approximation of $\frac{1}{(n-1)\lambda}XHX^{\top}$ , where $H:=1_{n}1_{n}^{\top}-I_{n}$ . Applying Lemma E.4 to the first term, we can conclude that $W_{\text{CL}}$ satisfies the desired conditions.

Corollary B.2 (Restatement of Corollary 3.2)

Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then in expectation over the data augmentation, the optimal solution of contrastive learning problem (9) is given by:

W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top},

(34)

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

Proof [Proof of Corollary 3.2] Following the proof of Proposition 3.1, now we only need to compute the expectation over the augmentation distribution defined in Definition 2.2:

$\displaystyle\mathcal{L}_{\text{SelfCon}}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}[\frac{1}{n}\sum_{i=1}^{n}[\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle$
	$\displaystyle-\frac{1}{4(n-1)}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle]]$
$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}[\frac{1}{2n}{\tr}((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}$
	$\displaystyle-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)].$	(35)

Note that by the definition of random masking augmentation, we have $X_{1}=AX,X_{2}=(I-A)X$ , which implies $X_{1}+X_{2}=X$ . On the other hand, $X_{1}$ and $X_{2}$ have no common nonzero entries, hence the matrix $X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top}$ only consists of off-diagonal entries and each of the off-diagonal entry denoted as $x_{ij}$ appears if and only if $a_{i}+a_{j}=1$ . Moreover, if it appears, we must have $x_{ij}$ equals to the $(i,j)$ -th element of $XX^{\top}$ . With this result, we can then compute the expectation in Equation (35):

	$\displaystyle\mathcal{L}_{\text{SelfCon}}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}\biggl{[}\frac{1}{2n}{\tr}((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}$
		$\displaystyle-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)\bigg{]}$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\norm{\lambda W^{\top}W-\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})}_{F}^{2}$
		$\displaystyle-\norm{\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})}_{F}^{2}.$

By a similar argument as in the proof of Proposition 3.1, we can conclude that $W_{\text{CL}}$ satisfies the desired conditions.

Remark B.3

Note that the two views generated by random masking augmentation have disjoint non-zero dimensions, hence contrasting such positive pairs yields a correlation between different dimensions only. That is why the first term in equation (34) appears to be $\Delta(XX^{\top})$ where the diagonal entries are eliminated.

B.2 Proofs for Section 3.3

In this section, we will prove Lemma 3.7, Theorems 3.9 and 3.10 in Section 3.3. The restatement and proof of them can be found in Lemma B.6, Theorem B.7, and Theorem B.9.

Before starting the proof, we give two technical lemmas to help the proof.

Lemma B.4 (Uniform distribution on the unit sphere (Marsaglia, 1972))

If $x_{1},x_{2},\cdots,x_{n}$ i.i.d. $\sim\mathcal{N}(0,1)$ , then $(x_{1}/\sqrt{{\sum_{i=1}^{n}x_{i}^{2}}},\cdots,x_{n}/\sqrt{\sum_{i=1}^{n}x_{i}^{2}})$ is uniformly distributed on the unit sphere $\mathbb{S}^{d}=\{(x_{1},\cdots,x_{n})\in\mathbb{R}^{n}:\sum_{i=1}^{n}x_{i}^{2}=1\}$ .

Lemma B.5

If $x_{1},x_{2},\cdots,x_{n}$ i.i.d. $\sim\mathcal{N}(0,1)$ , then:

\mathbb{E}\max_{1\leq i\leq n}x_{i}^{2}\leq 2\log(n).

Proof Denote $Y=\max_{1\leq i\leq n}x_{i}^{2}$ , then we have:

\displaystyle\exp(t\mathbb{E}Y)\leq\mathbb{E}\exp(tY)\leq\mathbb{E}\sum_{i=1}^{n}\exp(tx_{i}^{2})=n\mathbb{E}\exp(tx_{i}^{2}).

Note that the moment-generating function of chi-square distribution with $v$ degrees of freedom is:

M_{X}(t)=(1-2t)^{-v/2}.

Then combine this fact with Equation (B.2) we have:

\exp(t\mathbb{E}Y)\leq n(1-2t)^{-\frac{1}{2}},

which implies:

\mathbb{E}Y\leq\frac{\log(n)}{t}-\frac{1-2t}{2t},\quad\forall t<\frac{1}{2}.

In particular, take $t\rightarrow\frac{1}{2}$ yields:

\mathbb{E}Y\leq 2\log(n)

as desired.

Lemma B.6 (Restatement of Lemma 3.7)

\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U^{\star})=O\quantity(\frac{r}{d}\log d).

(36)

Proof [Proof of Lemma 3.7] Denote the columns of $U$ as $U=[u_{1},\cdots,u_{r}]\in\mathbb{O}_{d,r}$ , we have:

	$\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=$	$\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}\max_{i\in[d]}\sum_{j=1}^{r}\|e_{i}^{\top}u_{j}\|^{2}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}\sum_{j=1}^{r}\max_{i\in[d]}\|e_{i}^{\top}u_{j}\|^{2}$
	$\displaystyle=$	$\displaystyle r\mathbb{E}_{u\sim\operatorname{Uniform}(\mathbb{S}^{d})}\max_{i\in[d]}\|e_{i}^{\top}u\|^{2}.$

By Lemma B.4 we can transform this expectation on the uniform sphere distribution into normalized multivariate Gaussian variables:

\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}.

(37)

where $x_{1},x_{2},\cdots,x_{d}$ are i.i.d. standard normal random variables. Apply Chebyshev’s inequality we know that:

\mathbb{P}\quantity(|\frac{1}{d}\sum_{i=1}^{d}x_{j}^{2}-1|>\epsilon)\leq\frac{2}{d\epsilon^{2}}.

In particular, take $\epsilon=1$ we have:

\mathbb{P}\quantity(\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2})\leq\frac{8}{d}.

Then take it back into Equation (37) and apply Lemma B.5 we obtain:

	$\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=$	$\displaystyle r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}\mathbb{I}\{\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2}\}$
		$\displaystyle+r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}\mathbb{I}\{\sum_{i=1}^{d}x_{j}^{2}\geq\frac{d}{2}\}$
	$\displaystyle\leq$	$\displaystyle r\mathbb{P}\quantity(\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2})+\frac{2r}{d}\mathbb{E}_{x_{1},\cdots,x_{d}}\max_{i\in[d]}x_{i}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{8r}{d}+\frac{4r\log d}{d}$

as desired.

Now we start proving our main results. Note that $U_{\text{AE}}$ is the top- $r$ left eigenspace of the observed covariance matrix and $U^{\star}$ is that of the core feature covariance matrix, and by Assumption 3.5 the observed covariance matrix is dominated by the covariance of random noise. The Davis-Kahan theorem provides a technique to estimate the eigenspace distance via estimating the difference between target matrices. We will adopt this technique to prove the lower bound of the feature recovery ability of autoencoders in Theorem 3.9.

Theorem B.7 (Restatement of Theorem 3.9)

Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and $n>d\gg r$ , let $W_{AE}$ be the learned representation of autoencoder with singular value decomposition $W_{AE}=(U_{AE}\Sigma_{AE}V_{AE}^{\top})^{\top}$ (as in Eq.(7)). If we further assume $\{\sigma_{i}^{2}\}_{i=1}^{d}$ are different from each other and $\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma}$ for some universal constant $C_{\sigma}$ . Then there exist two universal constants $C_{\rho}>0,c\in(0,1)$ , such that when $\rho<C_{\rho}$ , we have

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq c\sqrt{r}.

(38)

Proof [Proof of Theorem 3.9] Denote $M=\nu^{2}U^{\star}U^{\star\top}$ to be the target matrix, $x_{i}=U^{\star}z_{i}+\xi_{i},\quad i=1,2,\cdots n$ to be the samples generated from model 5 and let $X=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n},Z=[z_{1},\cdots,z_{n}]\in\mathbb{R}^{r\times n},E=[\xi_{1},\cdots,\xi_{n}]\in\mathbb{R}^{d\times n}$ to be the corresponding matrices. In addition, we write the column mean matrix $\bar{X}\in\mathbb{R}^{n\times d}$ of a matrix $X\in\mathbb{R}^{n\times d}$ to be $\bar{X}=\frac{1}{n}X1_{n}1_{n}^{\top}$ , that is, each column of $\bar{X}$ is the column mean of $X$ . We denote the sum of variance $\sigma_{i}^{2}$ as $\sigma_{\text{sum}}^{2}=\sum_{i=1}^{d}\sigma_{i}^{2}$ . As shown in Equation (7), autoencoders find the top- $r$ eigenspace of the following matrix:

\displaystyle\hat{M}_{1}=\frac{1}{n}X(I_{n}-\frac{1}{n}1_{n}1_{n}^{\top})X^{\top}

\displaystyle=\frac{1}{n}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}-\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}.

The rest of the proof is divided into three steps for the sake of presentation.

Step 1. Bound the difference between $\hat{M}_{1}$ and $\Sigma$

In this step, we aim to show that the data recovery of autoencoders is dominated by the random noise term. Note that $\Sigma=\mathrm{Cov}(\xi)=\mathbb{E}\xi\xi^{\top}$ , we just need to bound the norm of the following matrix:

\hat{M}_{1}-\Sigma=\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}+\frac{1}{n}(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})+(\frac{1}{n}EE^{\top}-\Sigma)-\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top},

(39)

and we will deal with these four terms separately.

For the first term, note that $\mathbb{E}zz^{\top}=\nu^{2}I_{r}$ , the first term can then be divided into two terms

\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}=M+U^{\star}(\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top})U^{\star\top}.

(40)

Then apply the concentration inequality of Wishart-type matrices (Lemma E.3) we have:

\mathbb{E}\|\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top}\|_{2}\leq(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}.

Plug it back into (40) we obtain the bound for the first term:

\|\frac{1}{n}UZZ^{\top}U^{\top}\|_{2}\leq\|M\|_{2}+\|U\|_{2}\|\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top}\|_{2}\|U\|_{2}\leq\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}.

(41)

For the second term, since $Z$ and $E$ are independent, we must have $\mathbb{E}U^{\star}ZE^{\top}=0$ , so apply Lemma E.2 twice we have:

$\displaystyle\frac{1}{n}\mathbb{E}\\|EZ^{\top}U^{\star}\\|_{2}=$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\mathbb{E}_{E}[\\|EZ^{\top}U^{\star}\\|_{2}\|Z]]$	(42)
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\\|Z\\|_{2}(\sigma_{\text{sum}}+r^{1/4}\sqrt{\sigma_{\text{sum}}\sigma_{(1)}}+\sqrt{r}\sigma_{(1)})]$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\\|Z\\|_{2}]\sqrt{d}\sigma_{(1)}$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\sqrt{d}\sigma_{(1)}(r^{1/2}\nu+(nr)^{1/4}\nu+n^{1/2}\nu)$
$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.$

3.

For the third term, apply Lemma E.3 again yields:

$\mathbb{E}\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\leq\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}.$ (43)

For the last term, note that each column of $\bar{Z}$ and $\bar{E}$ are the same, so we can rewrite it as:

\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}=(U^{\star}\bar{z}+\bar{\xi})(U^{\star}\bar{z}+\bar{\xi})^{\top},

where $\bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i}$ and $\bar{\xi}=\frac{1}{n}\sum_{i=1}^{n}\xi_{i}$ . Since $z$ and $\xi$ are independent zero mean sub-Gaussian random variables and $\mathrm{Cov}(z)=\nu^{2}I_{r},\mathrm{Cov}(\xi)=\Sigma$ , we can conclude that:

		$\displaystyle\mathbb{E}\\|\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}\\|_{2}\leq\mathbb{E}\\|\bar{z}\bar{z}^{\top}\\|_{2}+2\mathbb{E}\\|\bar{z}\bar{\xi}^{\top}\\|_{2}+\mathbb{E}\\|\bar{\xi}\bar{\xi}^{\top}\\|_{2}$		(44)
	$\displaystyle\lesssim$	$\displaystyle\frac{r\nu^{2}}{n}+\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu+\frac{d\sigma_{(1)}^{2}}{n}.$		(44)

To sum up, combine equations (41)(42)(43)(LABEL:PCA_step1_term4) together we obtain the upper bound for the 2 norm expectation of matrix $\hat{M}-\Sigma$ :

\mathbb{E}\|\hat{M}_{1}-\Sigma\|_{2}\lesssim\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

(45)

Step 2. Bound the sine distance between eigenspaces

As we have shown in step 1, the target matrix of autoencoders is close to the covariance matrix of random noise, that is, $\Sigma$ . Note that $\Sigma$ is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis $e_{i}$ . Denote $U_{\Sigma}$ to be the top- $r$ eigenspace of $\Sigma$ and $\{e_{i}\}_{i\in C}$ to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:

		$\displaystyle\mathbb{E}\\|\sin\Theta(U_{\text{AE}},U_{\Sigma})\\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\\|\hat{M}_{1}-\Sigma\\|_{2}}{\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2}}$
	$\displaystyle\lesssim$	$\displaystyle\sqrt{r}\frac{1}{\sigma_{(1)}^{2}}\quantity(\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu)$
	$\displaystyle\lesssim$	$\displaystyle\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).$

Step 3. Obtain the final result by triangular inequality

By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:

	$\displaystyle\\|\sin\Theta(U^{\star},U_{\Sigma})\\|_{F}^{2}$	$\displaystyle=\\|U_{\Sigma\perp}^{\top}U^{\star}\\|_{F}^{2}=\sum_{i\in[d]\setminus C}\\|e_{i}^{\top}U^{\star}\\|^{2}=\\|U^{\star}\\|_{F}^{2}-\sum_{i\in C}\\|e_{i}^{\top}U^{\star}\\|^{2}$
		$\displaystyle\geq r-rI(U^{\star})=r-O\quantity(\frac{r^{2}}{d}\log d).$

Then apply the triangular inequality of sine distance (Proposition A.5) we can obtain the lower bound of autoencoders.

	$\displaystyle\mathbb{E}\\|\sin\Theta(U_{\text{AE}},U^{\star})\\|_{F}$	$\displaystyle\geq\mathbb{E}\\|\sin\Theta(U^{\star},U_{\Sigma})\\|_{F}-\mathbb{E}\\|\sin\Theta(U_{\text{AE}},U_{\Sigma})\\|_{F}$		(46)
		$\displaystyle\geq\sqrt{r}-O\quantity(\frac{r}{\sqrt{d}}\sqrt{\log d})-O\quantity(\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}})).$		(46)

By Assumption 3.5, it implies that when n and d are sufficiently large and $\rho$ is sufficiently small (smaller than a given constant $C_{\rho}>0$ ), there exists a universal constant $c\in(0,1)$ such that:

\mathbb{E}\|\sin\Theta(U_{\text{AE}},U^{\star})\|_{F}\geq c\sqrt{r}.

To start the proof, we introduce a technical lemma first.

Lemma B.8 (Lemma 4 in Zhang et al. (2018))

If $M\in\mathbb{R}^{p\times p}$ is any square matrix and $\Delta(M)$ is the matrix $M$ with diagonal entries set to 0 , then

\|\Delta(M)\|_{2}\leq 2\|M\|_{2}.

Here, factor ” 2 ” in the statement above cannot be improved.

Theorem B.9 (Restatement of Theorem 3.10)

Under the spiked covariance model Eq.(5), random masking augmentation in Definition 2.2, Assumptions 3.4-3.6 and $n>d\gg r$ , let $W_{CL}$ be any solution that minimizes Eq.(3), and denote its singular value decomposition as $W_{CL}=(U_{CL}\Sigma_{CL}V_{CL}^{\top})^{\top}$ , then we have

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{CL}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

(47)

Proof [Proof of Theorem 3.10] The proof strategy is quite similar to that of Theorem 3.9 and we follow the notation defined in the first paragraph of that proof. As we have shown in Corollary 3.2, under our linear representation setting, the contrastive learning algorithm finds the top- $r$ eigenspace of the following matrix:

	$\displaystyle\hat{M}_{2}=$	$\displaystyle\frac{1}{n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\Delta((U^{\star}Z+E)(U^{\star}Z+E)^{\top})-\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}$
		$\displaystyle+\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}.$

To prove the theorem, first we need to bound the difference between $\hat{M}_{2}$ and $M$ . We aim to show that the contrastive learning algorithm is dominated by the core feature term. Note that $\Sigma=\mathbb{E}Uzz^{\top}U^{\top}$ , we just need to bound the norm of the following matrix:

	$\displaystyle\hat{M}_{2}-M=$	$\displaystyle(\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M)+\frac{1}{n}\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(EE^{\top})$		(48)
		$\displaystyle-\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}+\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}.$		(48)

and we will also deal with these five terms separately.

For the first term, we can divide it into two parts:

\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M=\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star T}-M)+\Delta(M)-M.

(49)

Then apply Lemma B.8 and Lemma E.3 we have:

\mathbb{E}\|\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M)\|_{2}\leq 2\mathbb{E}\|\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M\|_{2}\leq 2(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}.

Using the incoherent condition $I(U)=O(\frac{r}{d}\log d)$ , we know that:

\|M-\Delta(M)\|_{2}\leq\nu^{2}\max_{i\in[d]}\|e_{i}^{\top}U^{\star}\|_{2}^{2}=\nu^{2}I(U^{\star})\lesssim\frac{r}{d}\log d\nu^{2}.

Combine the two equations above together we obtain the bound for the first term:

	$\displaystyle\mathbb{E}\\|\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M\\|_{2}$	$\displaystyle\leq\mathbb{E}\\|\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M)\\|_{2}+\\|M-\Delta(M)\\|_{2}$		(50)
		$\displaystyle\lesssim\nu^{2}(\frac{r}{d}\log d+\frac{r}{n}+\sqrt{\frac{r}{n}}).$		(51)

For the second term, apply equation (42) yields:

\displaystyle\frac{1}{n}\mathbb{E}\|\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})\|_{2}\leq\frac{4}{n}\mathbb{E}\|EZ^{\top}U^{\star\top}\|_{2}\lesssim\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.

(52)

For the third term, apply equation (43) yields:

\mathbb{E}\|\frac{1}{n}\Delta(EE^{\top})\|_{2}=\mathbb{E}\|\Delta(\frac{1}{n}EE^{\top}-\Sigma)\|_{2}\leq 2\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\lesssim(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}.

(53)

For the fourth term, apply equation (LABEL:PCA_step1_term4) yields:

	$\displaystyle\mathbb{E}\\|\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U\bar{Z}+\bar{E})^{\top}\\|_{2}\lesssim$	$\displaystyle\mathbb{E}\\|\frac{1}{n}(U\bar{Z}+\bar{E})(U\bar{Z}+\bar{E})^{\top}\\|_{2}$		(54)
	$\displaystyle\lesssim$	$\displaystyle\frac{r\nu^{2}}{n}+\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu+\frac{d\sigma_{(1)}^{2}}{n}.$		(54)

For the last term, by equations (41)(42)(43) we know:

	$\displaystyle\mathbb{E}\\|\frac{1}{n}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}\\|_{2}$
	$\displaystyle\lesssim\\|\Sigma\\|_{2}+\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu+\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}.$

Thus we can conclude that:

\mathbb{E}\|\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}\|_{2}\lesssim\frac{d}{n}\sigma_{(1)}^{2}+\frac{r}{n}\nu^{2}.

(55)

To sum up, combine equations (50)(52)(53)(54)(55) together we obtain the upper bound for the 2 norm expectation of matrix $\hat{M}_{2}-M$ :

\mathbb{E}\|\hat{M}_{2}-M\|_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}.

(56)

With the upper bound for $\|\hat{M}_{2}-M\|_{2}$ , simply apply Lemma E.1 we can obtain the desired bound for sine distance:

	$\displaystyle\mathbb{E}\\|\sin\Theta(U_{\text{CL}},U^{\star})\\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\\|\hat{M}_{2}-M\\|_{2}}{\nu^{2}}$	(57)
$\displaystyle\lesssim$	$\displaystyle\sqrt{r}\frac{1}{\nu^{2}}\quantity(\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}})$
$\displaystyle=$	$\displaystyle\sqrt{r}\quantity(\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\rho^{-2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\rho^{-1}\sqrt{\frac{d}{n}})$
$\displaystyle\lesssim$	$\displaystyle\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.$

Moreover, there exists an orthogonal matrix $\hat{O}\in\mathbb{O}^{r\times r}$ depending on $U_{\text{CL}}$ such that:

\mathbb{E}\|U^{\top}U_{\text{CL}}\hat{O}-I_{r}\|_{F}=\mathbb{E}\|U_{\text{CL}}\hat{O}-U\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{M}_{2}-M\|_{2}}{\nu^{2}}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

which finishes the proof.

B.3 Proofs for Section 3.4

In this section, we will provide the proof of Theorems 3.13 and 3.14 with both regression and classification settings. The corresponding statement and proof can be found in Theorems B.10 and B.11.

For notation simplicity, define the prediction risk of predictor $\delta$ for classification and regression tasks as $\mathcal{R}_{c}(\delta):=\mathbb{E}_{\mathcal{D}}[\ell_{c}(\delta)]$ and $\mathcal{R}_{r}(\delta):=\mathbb{E}_{\mathcal{D}}[\ell_{r}(\delta)]$ , respectively. Define $\Sigma_{x}:=\nu^{2}U^{\star}U^{\star\top}+\Sigma$ . We write $\delta_{U,w}$ for $\delta_{U^{\top},w}$ with a slight abuse of notation. For two matrices $A$ and $B$ of the same order, we define $A\succeq B$ when $A-B$ is positive semi-definite.

Theorem B.10 (Restatement of Theorem 3.13)

Suppose the conditions in Theorem 3.10 hold. Then, for the classification task, we have

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star\top},w})]=O\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}})\wedge 1,

and for regression tasks,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

Theorem B.11 (Restatement of Theorem 3.14)

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star},w})]\geq c_{c}^{\prime},

and for classification task, if $F$ is differentiable at $0$ and $F^{\prime}(0)>0$ , then

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star},w})]\geq c_{r}^{\prime},

where $c_{r}^{\prime}>0$ and $c_{c}^{\prime}>0$ are constants independent of $n$ and $d$ .

The proofs of Theorem B.10 and B.11 relies on Lemma B.15, B.16, B.17, B.18 and B.20 which are proved later in this section.

Proof [Proof of Theorem B.10: Classification Task Part] Lemma B.17 gives for any $U\in\mathbb{O}_{d,r}$ ,

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]$		(58)
	$\displaystyle\quad\leq((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1)^{-1})\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}].$		(59)

Substituting $U\leftarrow U_{AE}$ combined with Assumption 3.5 and $\kappa=O(1)$ concludes the proof.

Proof [Proof of Theorem B.10: Regression Part] Note that under Assumption 3.5 and $\kappa=O(1)$ , $(1+\rho^{-2})/(1+\kappa^{-1}\rho^{-2})^{2}=O(1)$ . Lemma B.20 gives for any $U\in\mathbb{O}_{d,r}$ ,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})]=O\quantity((1+\rho^{-2})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2}).

(60)

Theorem 3.10 with substitution $U\leftarrow U_{AE}$ gives the desired result.

Proof [Proof of Theorem B.11: Classification Part] Lemma B.16 gives that for $c_{1}:=1-1/(2\kappa r_{c})\in(0,1)$ , we can take $n\gg d\gg r$ and sufficiently small $\rho>0$ so that $\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}]\geq c_{1}r$ holds. By Lemma B.18,

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]$
	$\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}^{2}))$
	$\displaystyle\quad\geq\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(1-c_{1})r)$
	$\displaystyle\quad\geq\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\frac{1}{2}),$		(61)

where the last inequality follows since $r\leq r_{c}$ . If we further take $\rho=\Theta(1)<1/2$ , the right hand becomes a positive constant. This concludes the proof.

Proof [Proof of Theorem B.11: Regression Part] From proposition B.19, we have

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})$
	$\displaystyle\quad={w^{\star}}^{\top}((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1}$
	$\displaystyle\quad\quad-U^{\star\top}U_{AE}(U_{AE}^{\top}U^{\star}U^{\star\top}U_{AE}+(1/\nu^{2})U_{AE}^{\top}\Sigma U_{AE})^{-1}U_{AE}^{\top}U^{\star})w^{\star}.$

Thus from Lemma B.15,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})$
	$\displaystyle\quad\geq\quantity(\frac{1}{1+\rho^{-2}}+\rho^{2}\kappa\quantity(\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}^{2}-r))\\|w^{\star}\\|^{2}.$		(62)

Using Lemma B.16 and by the same argument in the proof of Theorem 3.14: Classification Part, we conclude the proof.

Lemma B.12

For any $U\in\mathbb{O}_{d,r}$ ,

\displaystyle\lambda_{\min}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2}).

Proof Since $\lambda_{\min}(AC)\geq\lambda_{\min}(A)\lambda_{\min}(C)$ for symmetric positive semi-definite matrices $A$ and $C$ ,

	$\displaystyle\lambda_{\min}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$
	$\displaystyle\quad\geq\lambda_{\min}(U^{\top}U^{\star}U^{\star\top}U)\lambda_{\min}(\nu^{2}(U^{\top}\Sigma_{x}U)^{-1})$
	$\displaystyle\quad\geq\lambda_{\min}(I-(I-U^{\top}U^{\star}U^{\star\top}U))\frac{\nu^{2}}{\lambda_{\max}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)}$
	$\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}),$

where we used Weyl’s inequality $\lambda_{\min}(A+C)\geq\lambda_{\min}(A)-\|C\|_{2}$ in the second inequality.

Lemma B.13

For any $U\in\mathbb{O}_{d,r}$ ,

\displaystyle\lambda_{\max}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})\leq\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2})+\sigma_{(d)}^{2}}.

Proof Since $\|AC\|_{2}\leq\|A\|_{2}\|C\|_{2}$ ,

	$\displaystyle\lambda_{\max}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$	$\displaystyle\leq\lambda_{\max}(\nu^{2}(U^{\top}\Sigma_{x}U)^{-1})$
		$\displaystyle\leq\frac{\nu^{2}}{\lambda_{\min}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)}$
		$\displaystyle\leq\frac{\nu^{2}}{\lambda_{\min}(\nu^{2}I-\nu^{2}(I-U^{\top}U^{\star}U^{\star\top}U)+U^{\top}\Sigma U)}$
		$\displaystyle\leq\frac{\nu^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2})+\sigma_{(d)}^{2}},$

where we used Weyl’s inequality $\lambda_{\min}(A+C)\geq\lambda_{\min}(A)-\|C\|_{2}$ and $\lambda_{\min}(\nu^{2}I+U^{\top}\Sigma U)\geq\nu^{2}+\sigma_{(d)}^{2}$ .

Lemma B.14

For any $U\in\mathbb{O}_{d,r}$ ,

	$\displaystyle\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad=O\quantity(\frac{1}{1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\kappa^{-1}\rho^{-2}}\frac{1+\rho^{-2}}{1+\kappa^{-1}\rho^{-2}}\\|\sin\Theta(U,U^{\star})\\|_{2}).$

Proof Observe that

	$\displaystyle\\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-(U^{\top}\Sigma_{x}U)^{-1}\\|_{2}+\\|(U^{\top}\Sigma_{x}U)^{-1}-U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}{U^{\star}}\\|_{2}$
	$\displaystyle\quad:=(T1)+(T2).$

For the term $(T1)$ ,

	$\displaystyle(T1)$	$\displaystyle=\\|(U^{\top}\Sigma_{x}U)^{-1}(U^{\top}\Sigma_{x}U)(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-(U^{\top}\Sigma_{x}U)^{-1}(U^{\star\top}\Sigma_{x}U^{\star})(U^{\star\top}\Sigma_{x}U^{\star})^{-1}\\|_{2}$
		$\displaystyle\leq\\|(U^{\top}\Sigma_{x}U)^{-1}\\|_{2}\\|U^{\top}\Sigma_{x}U-U^{\star\top}\Sigma_{x}U^{\star}\\|_{2}\\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}\\|_{2}.$

Note

	$\displaystyle\\|U^{\top}\Sigma_{x}U-U^{\star\top}\Sigma_{x}U^{\star}\\|_{2}$	$\displaystyle=\\|\nu^{2}U^{\top}U^{\star}U^{\star\top}U-\nu^{2}I+U^{\top}\Sigma U-U^{\star\top}\Sigma U^{\star}\\|_{2}$
		$\displaystyle\leq\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\\|U^{\top}\Sigma(U-U^{\star})+(U-U^{\star})^{\top}\Sigma U^{\star}\\|_{2}$
		$\displaystyle\leq\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+2\sigma_{(1)}^{2}\\|U-U^{\star}\\|_{2}.$

Also we have $\lambda_{\min}(U^{\top}\Sigma_{x}U)\geq\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}$ from the proof of Lemma B.13 and $\lambda_{\min}(U^{\star\top}\Sigma_{x}U^{\star})\geq\nu^{2}+\sigma_{(d)}^{2}$ . Therefore

\displaystyle(T1)

\displaystyle\leq\frac{1}{(\nu^{2}+\sigma_{(d)}^{2})(\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2})}(\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+2\sigma_{(1)}^{2}\|U-U^{\star}\|_{2}).

For the term $(T2)$ ,

	$\displaystyle(T2)$	$\displaystyle=\\|(U^{\top}\Sigma_{x}U)^{-1}-U^{\star\top}{(U^{\star}+(U-U^{\star}))}(U^{\top}\Sigma_{x}U)^{-1}(U^{\star}+(U-U^{\star}))^{\top}U^{\star}\\|_{2}$
		$\displaystyle=\\|-U^{\star\top}(U-U^{\star})(U^{\top}\Sigma_{x}U)^{-1}-(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}$
		$\displaystyle\quad-U^{\star\top}{(U-U^{\star})}(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}\\|_{2}$
		$\displaystyle\leq\frac{1}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}}(2\\|U-U^{\star}\\|_{2}+\\|U-U^{\star}\\|_{2}^{2}).$

From Lemma A.3, $\|\sin\Theta(U,U^{\star})\|_{2}\leq\|U-U^{\star}\|_{2}$ . Finally from these results and $\|U-U^{\star}\|_{2}^{2}\leq 2\|U-U^{\star}\|_{2}$ ,

	$\displaystyle\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad=O\quantity(\frac{\nu^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\\|U-U^{\star}\\|_{2}).$

Since LHS does not depend on the orthogonal transformation $U\leftarrow UO$ where $O\in\mathbb{O}_{r,r}$ , we obtain

	$\displaystyle\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad=O\quantity(\frac{\nu^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\inf_{O\in\mathbb{O}_{r,r}}\\|UO-U^{\star}\\|_{2}).$

Combined again with Lemma A.3, we obtain the desired result.

Lemma B.15

For any $U\in\mathbb{O}_{d,r}$ ,

	$\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$
	$\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}).$

Proof Observe

	$\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$
	$\displaystyle\quad\geq\lambda_{\min}((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1})-\\|U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star}\\|_{2}.$

Since $U^{\top}U^{\star}U^{\star\top}U\succeq 0$ , it follows that $(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}\preceq\nu^{2}(U^{\top}\Sigma U)^{-1}$ . Thus

	$\displaystyle\\|U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\nu^{2}\lambda_{\max}((U^{\top}\Sigma U)^{-1})\\|U^{\star\top}U\\|^{2}_{2}$
	$\displaystyle\quad\leq\frac{\nu^{2}}{\sigma_{(d)}^{2}}\\|U^{\star\top}U\\|^{2}_{F}$
	$\displaystyle\quad=\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}),$

where we used $\lambda_{\max}((U^{\top}\Sigma U)^{-1})\leq 1/\lambda_{\min}(U^{\top}\Sigma U)\leq 1/\sigma_{(d)}^{2}$ and $\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=r-\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2}$ from Proposition A.1. Combined with Lemma B.13, we obtain

	$\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$
	$\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}).$

Lemma B.16

Suppose the conditions in Theorem 3.9 hold. Fix $c_{1}\in(0,1)$ . There exists a constant $c_{2}>0$ such that if $\sqrt{r\log d/d}\vee\rho^{2}\vee d/n<c_{2}$ , then,

\displaystyle\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{\text{AE}},U^{\star})\|_{F}^{2}

\displaystyle\geq c_{1}r,

where $c_{1}\in(0,1)$ is a universal constant.

Proof By Cauchy-Schwartz inequality,

	$\displaystyle\mathbb{E}_{\mathcal{D}}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}^{2}-r$
	$\displaystyle\quad\geq(\mathbb{E}_{\mathcal{D}}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F})^{2}-r$
	$\displaystyle\quad=(\mathbb{E}_{\mathcal{D}}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}-\sqrt{r})\quantity(\mathbb{E}_{\mathcal{D}}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}+\sqrt{r}).$

From Theorem 3.9, there exists a constant $c_{3}>0$ such that we have

\mathbb{E}_{\mathcal{D}}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq\sqrt{r}-c_{3}\frac{r}{\sqrt{d}}\sqrt{\log d}-c_{3}\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).

Therefore combined with a trivial bound $\|\sin\Theta(U_{AE},U^{\star})\|_{F}\leq\sqrt{r}$ ,

	$\displaystyle\mathbb{E}_{\mathcal{D}}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}^{2}-r$	$\displaystyle\geq-rc_{3}\frac{r^{1/2}}{\sqrt{d}}\sqrt{\log d}+\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}$
		$\displaystyle\geq-rc_{3}\quantity(2\frac{r^{1/2}}{\sqrt{d}}\sqrt{\log d}\vee 6\rho^{2}\vee 6\sqrt{\frac{d}{n}}),.$

where we used $\rho\sqrt{d/n}\leq\rho^{2}\vee d/n\leq\rho^{2}\vee\sqrt{d/n}$ since $d<n$ . Thus we can take $c_{2}=6(1-c_{1})/c_{3}$ . This concludes the proof.

Lemma B.17

For any $U\in\mathbb{O}_{d,r}$ ,

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]$
	$\displaystyle\quad\leq((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1)^{-1})\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}].$

Proof Recall that we are considering the class of linear classifiers $\{\delta_{U,w}:w\in\mathbb{R}^{r}\}$ , where $\delta_{U,w}(\check{x})=\mathbb{I}\{F(\check{x}^{\top}Uw)>1/2\}$ . For notational simplicity, write $\beta:=Uw$ and $\beta^{\star}:=U^{\star}w^{\star}$ .

\displaystyle\mathcal{R}_{c}(\delta_{U,w})=\mathbb{P}_{\mathcal{E}}(\delta_{U,w}(\check{x})\neq\check{y})=\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)+\mathbb{P}_{\mathcal{E}}(\check{y}=1,F(\check{x}^{\top}\beta)\leq 1/2).

Since $F(0)=1/2$ and $F$ is monotone increasing, the false positive probability becomes

	$\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)$	$\displaystyle=\mathbb{P}_{\mathcal{E}}(\check{y}=0,\check{x}^{\top}\beta>0)$
		$\displaystyle=\mathbb{E}_{\mathcal{E}}[\mathbb{E}_{\mathcal{E}}[\mathbb{I}\{\check{y}=0\}\|\check{x},\check{z}]\mathbb{I}\{\check{x}^{\top}\beta>0\}]$
		$\displaystyle=\mathbb{E}_{\mathcal{E}}[(1-F(\nu^{-1}\check{z}^{\top}U^{\star\top}\beta^{\star}))\mathbb{I}\{\check{x}^{\top}\beta>0\}].$

Write $\omega:=\check{x}^{\top}\beta$ and $\omega^{\star}:=\nu^{-1}\check{z}^{\top}U^{\star\top}\beta^{\star}$ . From assumption, $(\omega^{\star},\omega)$ jointly follows a normal distribution with mean $0$ . Write ${v^{\star}}^{2}:=\mathrm{Var}(\omega^{\star})={w^{\star}}^{\top}w^{\star}$ , $v^{2}:=\mathrm{Var}(\omega)=\beta^{\top}\Sigma_{x}\beta$ , where $\Sigma_{x}:=\nu^{2}U^{\star}U^{\star\top}+\Sigma$ . Let $\tau:=\text{Cor}(\omega^{\star},\omega)=\nu{w^{\star}}^{\top}U^{\star\top}\beta/(v^{\star}v)$ . By a formula for conditional normal distribution, we have $\omega|\omega^{\star}\sim N(\tau v\omega^{\star}/v^{\star},v^{2}(1-\tau^{2}))$ . This gives

	$\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{I}\{\omega>0\}]$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{E}_{\mathcal{E}}[\mathbb{I}\{\omega>0\}\|\omega^{\star}]]$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{P}_{\mathcal{E}}(\omega>0\|\omega^{\star})]$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\mathbb{P}_{\mathcal{E}}\quantity(\frac{\omega-\tau v\omega^{\star}/v^{\star}}{v(1-\tau^{2})^{1/2}}>-\frac{\tau v\omega^{\star}/v^{\star}}{v(1-\tau^{2})^{1/2}}\middle\|\omega^{\star})]$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})]$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]+\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}<0\}],$

where $\Phi$ is cumulative distribution function of $N(0,1)$ and $\alpha:=\tau/(1-\tau^{2})^{1/2}$ . We define $\Psi_{F}$ as $\Psi_{F}(s^{2}):=2E_{u\sim N(0,s^{2})}[F(u)\mathbb{I}\{u>0\}]$ . When $F(u)=1/(1+e^{-u})$ , $\Psi_{F}(s^{2})$ is called the logistic-normal integral, whose analytical form is not known (Pirjol, 2013). Since a random variable $\omega^{\star}$ is symmetric about mean $0$ and $F(u)=1-F(-u)$ ,

	$\displaystyle\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}<0\}]$	$\displaystyle=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(-\omega^{\star}))\quantity(1-\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}]$
		$\displaystyle=\mathbb{E}_{\mathcal{E}}\quantity[F(\omega^{\star})\quantity(1-\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].$

Hence

	$\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)$
	$\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(\Phi(\alpha\omega^{\star}/v^{\star})+F(\omega^{\star})-2F(\omega^{\star})\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}]$
	$\displaystyle\quad=\frac{1}{2}\Psi_{F}({v^{\star}}^{2})-\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].$

Note that the true negative probability is exactly the same as the false positive probability under our settings:

	$\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=1,F(\check{x}^{\top}\beta)\leq 1/2)$	$\displaystyle=\mathbb{E}_{\mathcal{E}}[F(\check{x}^{\top}\beta^{\star})\mathbb{I}\{\check{x}^{\top}\beta\leq 0\}]$
		$\displaystyle=\mathbb{E}_{\mathcal{E}}[F(-\check{x}^{\top}\beta^{\star})\mathbb{I}\{\check{x}^{\top}\beta\geq 0\}]$
		$\displaystyle=\mathbb{E}_{\mathcal{E}}[(1-F(\check{x}^{\top}\beta^{\star}))\mathbb{I}\{\check{x}^{\top}\beta\geq 0\}]$
		$\displaystyle=\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2).$

Therefore

\displaystyle\mathcal{R}_{c}(\delta_{U,w})

\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].

Let

	$\displaystyle\tau_{\max,U}$	$\displaystyle:=\sup_{w\in\mathbb{R}^{r}}\nu{w^{\star}}^{\top}U^{\star\top}Uw/({w^{\star}}^{\top}w^{\star}w^{\top}U^{\top}\Sigma_{x}Uw)^{1/2},$
	$\displaystyle\tau_{\max,U^{\star}}$	$\displaystyle:=\sup_{w\in\mathbb{R}^{r}}\nu{w^{\star}}^{\top}w/({w^{\star}}^{\top}w^{\star}w^{\top}U^{\star\top}\Sigma_{x}U^{\star}w)^{1/2}.$

From Cauchy-Schwartz inequality,

	$\displaystyle\tau_{\max,U}^{2}$	$\displaystyle=\frac{\nu^{2}{w^{\star}}^{\top}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}w^{\star}}{{w^{\star}}^{\top}w^{\star}},$
	$\displaystyle\tau_{\max,U^{\star}}^{2}$	$\displaystyle=\frac{\nu^{2}{w^{\star}}^{\top}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}w^{\star}}{{w^{\star}}^{\top}w^{\star}}.$

Define $\alpha_{\max,U}:=\tau_{\max,U}/(1-\tau_{\max,U}^{2})^{1/2}$ and $\alpha_{\max,U^{\star}}:=\tau_{\max,U^{\star}}/(1-\tau_{\max,U^{\star}}^{2})^{1/2}$ . Then, since on the event where $\omega^{\star}>0$ , $\alpha\mapsto\Phi(\alpha\omega^{\star}/v^{\star})$ is monotone increasing and $2F(w^{\star})-1$ is non-negative, we have

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})$	$\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]$
	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$	$\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].$

This yields

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad=2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)(\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].$

Note that for any $a,b\geq 0$ ,

\displaystyle|\Phi(b)-\Phi(a)|\leq\phi(a\wedge b)|b-a|,

where $\phi$ is a density function of standard normal distribution. Observe

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\leq 2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\|\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star})\|\mathbb{I}\{\omega^{\star}>0\}]$
	$\displaystyle\quad\lesssim\frac{2}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|\omega^{\star}\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\frac{\phi(\omega^{\star}/v^{\star})}{v^{\star}}\differential{\omega^{\star}}$
	$\displaystyle\quad\lesssim\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\differential{\omega^{\star}}$
	$\displaystyle\quad=\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\frac{\exp(-1/(2((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})){\omega^{\star}}^{2})}{\sqrt{2\pi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})}}\differential{\omega^{\star}}$
	$\displaystyle\quad=\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}(\Psi_{F}(((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U^{\star}})^{-2}{v^{\star}}^{2}))-1/2),$

where we used $\sup_{u>0}u\phi(u)<\infty$ . Since $(a-b)=(a^{2}-b^{2})/(a+b)\leq(a^{2}-b^{2})/(a\wedge b)$ for $a,b>0$ , and $\Psi_{F}\leq 1$ , we obtain

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\lesssim\frac{|\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U}^{2}|}{\alpha_{\max,U^{\star}}^{2}\wedge\alpha_{\max,U}^{2}}.

When $\tau_{\max,U^{\star}}\geq\tau_{\max,U}$ , since $\tau\mapsto\tau^{2}/(1-\tau^{2})$ is increasing in $\tau>0$ ,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$	$\displaystyle\lesssim\frac{\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U}^{2}}{\alpha_{\max,U}^{2}}$
		$\displaystyle=\frac{\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}}{(1-\tau_{\max,U^{\star}}^{2})\tau_{\max,U}^{2}}.$		(63)

From Lemma B.12 and B.13, we have

	$\displaystyle\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})\leq\tau_{\max,U}^{2}$	$\displaystyle\leq\frac{\nu^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}},$
	$\displaystyle\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}\leq\tau_{\max,U^{\star}}^{2}$	$\displaystyle\leq\frac{\nu^{2}}{\nu^{2}+\sigma_{(d)}^{2}}.$		(64)

Then, Equation (63) becomes

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\lesssim\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})}(\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2})$
	$\displaystyle\quad\leq\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})}\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\frac{(\kappa\rho^{2}+1)(\rho^{-2}+1)^{2}}{(1+\kappa^{-1}\rho^{-2})(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})^{2}}\\|\sin\Theta(U,U^{\star})\\|_{2}$
	$\displaystyle\quad=\frac{\kappa\rho^{2}(\rho^{-2}+1)^{2}}{(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})^{2}}\\|\sin\Theta(U,U^{\star})\\|_{2}.$

where the last inequality follows from Lemma B.14.

On the event where $\|\sin\Theta(U,U^{\star})\|_{2}^{2}\leq 1/2$ ,

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\lesssim\kappa\rho^{2}(1+\rho^{-2})^{2}\|\sin\Theta(U,U^{\star})\|_{2}.

When $\tau_{\max,U^{\star}}<\tau_{\max,U}$ , on the event where $\|\sin\Theta(U,U^{\star})\|_{2}\leq\kappa^{-1}\rho^{-2}/2$ ,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\lesssim\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}}\frac{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}}{-\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\sigma_{(d)}^{2}}(\tau_{\max,U}^{2}-\tau_{\max,U^{\star}}^{2})$
	$\displaystyle\quad\leq\frac{(\nu^{2}+\sigma_{(1)}^{2})^{2}}{\nu^{2}}\frac{1}{-\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\sigma_{(d)}^{2}}$
	$\displaystyle\quad\quad\times\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\frac{(1+\rho^{-2})^{3}}{(-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\kappa^{-1}\rho^{-2})^{3}}\\|\sin\Theta(U,U^{\star})\\|_{2}$
	$\displaystyle\quad\lesssim(\kappa(1+\rho^{2}))^{3}\\|\sin\Theta(U,U^{\star})\\|_{2},$

where we used Lemma B.14 again.

In summary, on the event where $\|\sin\Theta(U,U^{\star})\|_{2}\leq\kappa^{-1}\rho^{-2}/2\wedge 1/2$ ,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2})\\|\sin\Theta(U,U^{\star})\\|_{2}.$

On the other hand, on the event where $\|\sin\Theta(U,U^{\star})\|_{2}>\kappa^{-1}\rho^{-2}/2\wedge 1/2$ , we have a trivial inequality $\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\leq 1$ . This gives

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]$
	$\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2})\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}]$
	$\displaystyle\quad\quad+\mathbb{P}_{\mathcal{D}}(\\|\sin\Theta(U,U^{\star})\\|_{2}>\kappa^{-1}\rho^{-2}/2\wedge 1/2)$
	$\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1))\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}],$

where the last inequality follows from Markov’s inequality.

Lemma B.18

Suppose $U\in\mathbb{O}_{d,r}$ satisfies $1/(1+\rho^{2})-\kappa(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2})\geq 0$ . Then,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2})).$

Proof We firstly bound the term $\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}$ . From Lemma B.15,

	$\displaystyle\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}$	$\displaystyle\geq\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})$
		$\displaystyle\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}).$		(65)

From assumption, RHS of Equation (65) is non-negative. Then using the inequality $a-b=(a^{2}-b^{2})/(a+b)\geq(a^{2}-b^{2})/(2a)$ for $a\geq b\geq 0$ ,

	$\displaystyle\alpha_{\max,U^{\star}}-\alpha_{\max,U}$	$\displaystyle\gtrsim\frac{1}{\alpha_{\max,U^{\star}}}(\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U^{\star}}^{2})$
		$\displaystyle\geq\frac{(1-\tau_{\max,U^{\star}}^{2})^{1/2}}{\tau_{\max,U^{\star}}}\frac{\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}}{(1-\tau_{\max,U^{\star}}^{2})(1-\tau_{\max,U}^{2})}.$

From Equation (64) and Equation (65),

	$\displaystyle\alpha_{\max,U^{\star}}-\alpha_{\max,U}$
	$\displaystyle\quad\gtrsim\quantity(\frac{\nu^{2}+\sigma_{(d)}^{2}}{\nu^{2}})^{1/2}\quantity(\frac{\nu^{2}+\sigma_{(1)}^{2}}{\sigma_{(1)}^{2}})^{3/2}\quantity(\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}))$
	$\displaystyle\quad=(1+\kappa^{-1}\rho^{-2})^{1/2}(1+\rho^{2})^{3/2}\quantity(\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2})).$		(66)

From the proof of Lemma B.17,

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad=2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)(\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].$

Note that for any $b\geq a\geq 0$ , $\Phi(b)-\Phi(a)\geq\phi(b)(b-a)$ . Since we assume RHS of Equation (65) is positive, $\alpha_{\max,U^{\star}}\geq\alpha_{\max,U}$ . Thus on the event where $\omega^{\star}>0$ , $\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star}\geq\alpha_{\max,U}\omega^{\star}/v^{\star}$ . Observe

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\geq 2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star}-\alpha_{\max,U}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]$
	$\displaystyle\quad=\frac{2}{v^{\star}}(\alpha_{\max,U^{\star}}-\alpha_{\max,U})\int_{0}^{\infty}(2F(\omega^{\star})-1)\omega^{\star}\frac{\phi(\omega^{\star}/v^{\star})}{v^{\star}}\phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})\differential{\omega^{\star}}$
	$\displaystyle\quad\simeq\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\omega^{\star}\exp(-(1/2)(1+\alpha_{\max,U^{\star}}^{2}){\omega^{\star}}^{2}/{v^{\star}}^{2})\differential{\omega^{\star}}$
	$\displaystyle\quad\simeq\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}\int_{0}^{\infty}(2F((1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}\omega^{\star})-1)\omega^{\star}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}},$

where in the last equality we transformed $w^{\star}\to(1+\alpha_{\max,U^{\star}}^{2})^{1/2}w^{\star}/v^{\star}$ . Since $F(u)$ is differentiable at $0$ and $F(0)=1/2$ ,

\displaystyle F(u)-1/2=F^{\prime}(0)u+o(u).

Thus there exists a constant $\epsilon>0$ only depending on $F$ such that $2(F(u)-1/2)\geq F^{\prime}(0)u$ for all $u\in[0,\epsilon]$ since $F^{\prime}(0)>0$ . This gives

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}F^{\prime}(0)(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}$
	$\displaystyle\quad\quad\times\int_{0}^{\epsilon(1+\alpha_{\max,U^{\star}}^{2})^{1/2}v^{\star}}{\omega^{\star}}^{2}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}}$
	$\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}\int_{0}^{\epsilon v^{\star}}{\omega^{\star}}^{2}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}}$
	$\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}.$

The last inequality follows since $v^{\star}=\|w^{\star}\|=1$ by assumption. It is noted that $\alpha_{\max,U^{\star}}^{2}\leq\nu^{2}/\sigma_{(d)}^{2}$ from Equation (64). Therefore with Equation (66),

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\gtrsim\frac{1}{(1+\kappa\rho^{2})^{3/2}}(1+\kappa^{-1}\rho^{-2})^{1/2}(1+\rho^{2})^{3/2}\quantity(\frac{1}{1+\rho^{-2}}-\kappa\rho^{2}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}))$
	$\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2})).$

Proposition B.19

For any $U\in\mathbb{O}_{d,r}$ ,

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})

\displaystyle=\nu^{2}{w^{\star}}^{\top}(I-\nu^{2}U^{\star\top}U(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)^{-1}U^{\top}U^{\star})w^{\star}+\sigma_{\epsilon}^{2}.

Proof [Proof of Proposition B.19] Generate random variables $(\check{x},\check{z},\check{\xi},\check{\epsilon})$ following the model (19). We calculate the prediction risk of $\delta_{U,w}$ as:

	$\displaystyle\mathcal{R}_{r}(\delta_{U,w})$	$\displaystyle:=\mathbb{E}_{\mathcal{E}}(\check{y}-\check{x}^{\top}Uw)^{2}$
		$\displaystyle=\mathrm{Var}_{\mathcal{E}}(\nu^{-1}\check{z}^{\top}w^{\star}+\check{\epsilon})^{2}-2\mathrm{Cov}_{\mathcal{E}}(\nu^{-1}\check{z}^{\top}w^{\star}+\check{\epsilon},U^{\star}\check{z}+\check{\xi})Uw$
		$\displaystyle\quad+w^{\top}U^{\top}\mathrm{Var}_{\mathcal{E}}(U^{\star}\check{z}+\check{\xi})Uw$
		$\displaystyle=\\|w^{\star}\\|^{2}+\sigma_{\epsilon}^{2}-2\nu{w^{\star}}^{\top}U^{\star\top}Uw+w^{\top}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)w$
		$\displaystyle=(w-A^{-1}b)^{\top}A(w-A^{-1}b)-b^{\top}A^{-1}b+\\|w^{\star}\\|^{2}+\sigma_{\epsilon}^{2},$

where $A:=\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U$ and $b:=\nu U^{\top}U^{\star}w^{\star}$ . From this, we obtain

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})

\displaystyle={w^{\star}}^{\top}\quantity(I-U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star})w^{\star}+\sigma_{\epsilon}^{2}.

Lemma B.20

For any $U\in\mathbb{O}_{d,r}$ ,

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})=O\quantity((1+\rho^{-2})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2}).

Proof [Proof of Lemma B.20] From proposition B.19, we have

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})$
	$\displaystyle\quad={w^{\star}}^{\top}\quantity((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1}-U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}{U^{\star}})w^{\star}.$

Note that $\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})\equiv\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{UO,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})$ for any orthogonal matrix $O\in\mathbb{O}_{r,r}$ . Take $\tilde{O}\in\mathbb{O}_{r,r}$ such that $\|U\tilde{O}-U^{\star}\|_{2}\leq\sqrt{2}\|\sin\Theta(U,O)\|_{2}$ without loss of generality, since we can always take a sequence $(\tilde{O}_{m})_{m\geq 1}$ such that $\|UO_{m}-U^{\star}\|_{2}\leq\sqrt{2}\|\sin\Theta(U,O)\|_{2}+1/m$ from Lemma A.3.

Lemma B.14 gives

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})$
	$\displaystyle\quad=O\quantity(\frac{1}{1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\kappa^{-1}\rho^{-2}}\frac{1+\rho^{-2}}{1+\kappa^{-1}\rho^{-2}}\\|\sin\Theta(U,U^{\star})\\|_{2}\\|w^{\star}\\|^{2}).$

On the event where $\|\sin\Theta(U,U^{\star})\|_{2}^{2}<1/2$ ,

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})=O\quantity(\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\|\sin\Theta(U,U^{\star})\|_{2}\|w^{\star}\|^{2}).

On the event where $\|\sin\Theta(U,U^{\star})\|_{2}^{2}\geq 1/2$ , we utilize the trivial upper bound

\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})

\displaystyle\leq\|(I+\nu^{-2}U^{\star\top}\Sigma U^{\star})^{-1}\|_{2}\|w^{\star}\|^{2}\leq\frac{\nu^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\|w^{\star}\|^{2}.

Combining these results, we have

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})]$
	$\displaystyle\quad\lesssim\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}]\\|w^{\star}\\|^{2}$
	$\displaystyle\quad\quad+\frac{1}{1+\kappa^{-1}\rho^{-2}}\\|w^{\star}\\|^{2}\mathbb{P}_{\mathcal{D}}(\\|\sin\Theta(U,U^{\star})\\|_{2}\geq 1/\sqrt{2})$
	$\displaystyle\quad\lesssim\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\mathbb{E}_{\mathcal{D}}[\\|\sin\Theta(U,U^{\star})\\|_{2}]\\|w^{\star}\\|^{2},$

where the last inequality follows by Markov’s inequality.

C Discussion about Autoencoders and random masking augmentation

In the following, we show that our results do not change if we applied the same augmentation (2.2) for autoencoders. As discussed in Section 3.1, we can ignore the bias term in autoencoders for simplicity, which only serves as centralization of the data matrix. In that case, we applied random augmentation $g_{1}(x)=Ax$ and $g_{2}(x)=(I-A)x$ to the original data $\{x_{i}\}_{i=1}^{n}$ , and the optimization problem can be formulated as follows:

\displaystyle\min_{W_{AE},W_{DE}}\frac{1}{2n}\mathbb{E}_{A}[\|AX-W_{DE}W_{AE}AX\|_{F}^{2}+\|(I-A)X-W_{DE}W_{AE}(I-A)X\|_{F}^{2}].

(67)

Then, similar to Theorem 3.1 for contrastive learning, we can also obtain an explicit solution for this optimization problem.

Theorem C.1

The optimal solution of autoencoders with random masking augmentation (67) is given by:

W_{AE}=W_{DE}^{\top}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

\frac{1}{2}\Delta(XX^{\top})+D(XX^{\top}),

(68)

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

Proof We first derive the equivalent form for this objective function:

	$\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\\|AX-W_{DE}W_{AE}AX\\|_{F}^{2}+\\|(I-A)X-W_{DE}W_{AE}(I-A)X\\|_{F}^{2}]$	(69)
$\displaystyle=$	$\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}A^{\top}AX)+\tr(X^{\top}A^{\top}W_{DE}W_{AE}AX)+\tr(X^{\top}A^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE}AX)$
	$\displaystyle+\tr(X^{\top}(I-A)^{\top}(I-A)X)+\tr(X^{\top}(I-A)^{\top}W_{DE}W_{AE}(I-A)X)$
	$\displaystyle+\tr(X^{\top}(I-A)^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE}(I-A)X)]$
$\displaystyle=$	$\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}AX)+\tr(AXX^{\top}A^{\top}W_{DE}W_{AE})+\tr(AXX^{\top}A^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})$
	$\displaystyle+\tr(X^{\top}(I-A)X)+\tr((I-A)XX^{\top}(I-A)^{\top}W_{DE}W_{AE})$
	$\displaystyle+\tr((I-A)XX^{\top}(I-A)^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})]$
$\displaystyle=$	$\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}X)+\tr(\hat{M}W_{DE}W_{AE})+\tr(\hat{M}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})],$

where $\hat{M}:=AXX^{\top}A^{\top}+(I-A)XX^{\top}(I-A)^{\top}$ . Note that by Definition 2.2 we have $A=\operatorname{diag}(a_{1},\cdots,a_{d})$ and $a_{i}$ follows the Bernoulli distribution, so we have:

\mathbb{E}_{A}\hat{M}=\frac{1}{2}\Delta(XX^{\top})+D(XX^{\top})

(70)

Again, by Theorem 2.4.8 in Golub and Loan (1996), the optimal solution of Eq.(67) is given by the eigenvalue decomposition of $\mathbb{E}_{A}\hat{M}=\frac{1}{2}\Delta(XX^{\top})+D(XX^{T})$ , up to an orthogonal transformation, which finishes the proof.
With Theorem C.1 established, we can now derive the space distance for autoencoders with random masking augmentation.

Theorem C.2

Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and $n>d\gg r$ , let $W_{AE}$ be the learned representation of augmented autoencoder with singular value decomposition $W_{AE}=(U_{AE}\Sigma_{AE}V_{AE}^{\top})^{\top}$ (i.e., the optimal solution of optimization problem 67). If we further assume $\{\sigma_{i}^{2}\}_{i=1}^{d}$ are different from each other and $\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma}$ for some universal constant $C_{\sigma}$ . Then there exist two universal constants $C_{\rho}>0,c\in(0,1)$ , such that when $\rho<C_{\rho}$ , we have

\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq c\sqrt{r}.

(71)

Proof Step1, similar to the proof of Theorem B.7, we first bound the difference between $\hat{M}:=\Delta(XX^{\top})+2D(XX^{\top})$ and $\Sigma:=\mathrm{Cov}(\xi\xi^{\top})$ . Note that:

\|\hat{M}-\Sigma\|_{2}=\|XX^{\top}-\Sigma-\frac{1}{2}\Delta(XX^{\top})\|_{2}\leq\|XX^{\top}-\Sigma\|_{2}+\frac{1}{2}\|\Delta(XX^{\top}-\Sigma)\|_{2}+\frac{1}{2}\|\Delta(\Sigma)\|_{2}

(72)

Since $\Sigma$ is a diagonal matrix, then by Lemma B.8 we have:

\|\hat{M}-\Sigma\|_{2}\leq 2\|XX^{\top}-\Sigma\|_{2}

(73)

Now, directly apply equation (41)(42)(43) we can obtain that:

\mathbb{E}\|\hat{M}-\Sigma\|_{2}\lesssim\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

(74)

Step 2, bound the $\sin\Theta$ distance between eigenspaces. As we have shown in step 1, the target matrix of the autoencoder is close to the covariance matrix of random noise, i.e., $\Sigma$ . Note that $\Sigma$ is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis $e_{i}$ . Denote $U_{\Sigma}$ to be the top- $r$ eigenspace of $\Sigma$ and $\{e_{i}\}_{i\in C}$ to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:

		$\displaystyle\mathbb{E}\\|\sin\Theta(U_{AE},U_{\Sigma})\\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\\|\hat{M}-\Sigma\\|_{2}}{\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2}}$
	$\displaystyle\lesssim$	$\displaystyle\sqrt{r}\frac{1}{\sigma_{(1)}^{2}}\quantity(\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu)$
	$\displaystyle\lesssim$	$\displaystyle\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).$

Step 3, obtain the final result by triangular inequality. By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:

	$\displaystyle\\|\sin\Theta(U^{\star},U_{\Sigma})\\|_{F}^{2}$	$\displaystyle=\\|U_{\Sigma\perp}^{\top}U^{\star}\\|_{F}^{2}=\sum_{i\in[d]/C}\\|e_{i}^{\top}U^{\star}\\|^{2}=\\|U^{\star}\\|_{F}^{2}-\sum_{i\in C}\\|e_{i}^{\top}U^{\star}\\|^{2}$		(75)
		$\displaystyle\geq r-rI(U^{\star})=r-O\quantity(\frac{r^{2}}{d}\log d).$		(75)

Then apply the triangular inequality of $\sin\Theta$ distance (Proposition A.5) we can obtain the lower bound of the autoencoder.

	$\displaystyle\mathbb{E}\\|\sin\Theta(U_{AE},U^{\star})\\|_{F}$	$\displaystyle\geq\mathbb{E}\\|\sin\Theta(U^{\star},U_{\Sigma})\\|_{F}-\mathbb{E}\\|\sin\Theta(U_{AE},U_{\Sigma})\\|_{F}$
		$\displaystyle\geq\sqrt{r}-O\quantity(\frac{r}{\sqrt{d}}\sqrt{\log d})-O\quantity(\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}})).$

\mathbb{E}\|\sin\Theta(U_{AE},U^{\star})\|_{F}\geq c\sqrt{r}.

Compared with Theorem 3.9, we can find that random masking augmentation makes no difference to autoencoders, which justifies the fairness of our comparison between contrastive learning and autoencoders.

However, contrary to the autoencoders with random-masking augmentation, we show that the representations obtained by DAEs behave as the representations obtained by contrastive learning.

Proof [Proof of Remark 3.12] Let $\mathcal{L}:=(1/n)\|X-W^{\top}WAX\|_{F}^{2}$ be the loss function of DAEs. Then,

\displaystyle\mathbb{E}_{A}\mathcal{L}=\frac{1}{n}\tr(W^{\top}W\quantity(\frac{1}{2}D(XX^{\top})+\frac{1}{4}\Delta(XX^{\top}))W^{\top}W-W^{\top}WXX^{\top})+(\text{const}.).

(76)

We minimize the loss over $W$ such that $WW^{\top}=2I_{r}$ . Then, the loss becomes

	$\displaystyle\operatorname*{arg\,min}_{WW^{\top}=2I_{r}}E_{A}\mathcal{L}$	$\displaystyle=\operatorname*{arg\,min}_{WW^{\top}=2I_{r}}\frac{1}{n}\tr(W\quantity(D(XX^{\top})+\frac{1}{2}\Delta(XX^{\top}))W^{\top}-WXX^{\top}W^{\top})$
		$\displaystyle=\operatorname*{arg\,max}_{WW^{\top}=2I_{r}}\tr(W\frac{1}{n}\Delta(XX^{\top})W^{\top}).$

Thus, the solution to the (expected) loss minimization problem is the top- $r$ eigenvectors of $\Delta(n^{-1}XX^{\top})$ , i.e., $W^{\top}=\sqrt{2}OP_{r}(\Delta(n^{-1}XX^{\top}))$ , where $O$ is any orthogonal matrix from $\mathcal{O}_{r,r}$ . We use the same argument as in the proof of Theorem B.9. First note that

\displaystyle\frac{1}{n}\Delta(XX^{\top})=\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(U^{\star}ZE+EZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(EE^{\top}).

By Lemmas B.8, E.3 and the incoherent condition $I(U)=O(\frac{r}{d}\log d)$ , we have:

	$\displaystyle\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top})-\nu^{2}U^{\star}U^{\star\top}}_{2}$
	$\displaystyle\leq 2\mathbb{E}\norm{\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-\nu^{2}U^{\star}U^{\star\top}}_{2}+\mathbb{E}\norm{\Delta\quantity(\nu^{2}U^{\star}U^{\star\top})-\nu^{2}U^{\star}U^{\star\top})}_{2}$
	$\displaystyle\lesssim 2(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}+\frac{r}{d}\log d\nu^{2}.$		(77)

For the second term, applying equation (42) yields:

\displaystyle\frac{1}{n}\mathbb{E}\|\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})\|_{2}\leq\frac{4}{n}\mathbb{E}\|EZ^{\top}U^{\star\top}\|_{2}\lesssim\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.

(78)

For the third term, applying equation (43) yields:

\mathbb{E}\|\frac{1}{n}\Delta(EE^{\top})\|_{2}=\mathbb{E}\|\Delta(\frac{1}{n}EE^{\top}-\Sigma)\|_{2}\leq 2\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\lesssim(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}.

(79)

Combining equations (77)(78)(79) gives

\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}XX^{\top}-\nu^{2}U^{\star}U^{\star\top})}_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}.

From Lemma E.1, we obtain the desired bound:

\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{DAE}},U^{\star})\|_{F}\leq\frac{2\sqrt{r}}{\nu^{2}}\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}XX^{\top}-\nu^{2}U^{\star}U^{\star\top})}_{2}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

Here we provide some experimental results about DAEs on synthetic datasets as analog to Figure 1 and 2, the settings are the same as described in Section 5.1. The results are summarized in Figure 3, as we can observe, the performance of DAEs is comparable with contrastive learning, which aligns with our theoretical results above.

D Omitted proofs for Section 4

D.1 Proofs for Section 4.1

In this section, we will provide the proof of a generalized version of Theorem 4.2 to cover the imbalanced setting, the statement and the detailed proof can be found in Theorem D.2.

In the main body, we assume the unlabeled data and labeled data are both balanced for the sake of clarity and simplicity. Now we allow them to be imbalanced and provide a more general analysis. Suppose we have $n$ unlabeled data $X=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ and $n_{k}$ labeled data $X_{k}=[x_{k}^{1},\cdots,x_{k}^{n_{k}}]\in\mathbb{R}^{d\times n_{k}}$ for class $k$ , the contrastive learning task can be formulated as:

\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)+\mathcal{L}_{\text{SupCon}}(W;\alpha).

(80)

In addition, we write a generalized version of the supervised contrastive loss function to cover the imbalanced cases:

\mathcal{L}_{\text{SupCon}}(W;\alpha)=-\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}\sum_{i=1}^{n_{k}}[\sum_{j\neq i}\frac{\langle Wx_{i}^{k},Wx_{j}^{k}\rangle}{n_{k}-1}-\frac{\sum_{j=1}^{n}\sum_{s\neq k}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle}{\sum_{s\neq k}n_{s}}]+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2},

(81)

where $\alpha_{k}>0$ is the weight for supervised loss of class $k$ . Again we first provide a theorem to give the optimal solution to the contrastive learning problem.

Theorem D.1

The optimal solution of the supervised contrastive learning problem (80) is given by :

W_{\text{CL}}=C\quantity(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top})^{\top},

where $C>0$ is a positive constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

	$\displaystyle\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$
	$\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}\quantity[\frac{1}{n_{k}-1}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}-\frac{1}{\sum_{t\neq k}n_{t}}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}],$

$u_{i}$ is the corresponding eigenvector and $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthonormal matrix.

Proof Under this setting, combined with the result obtained in Corollary 3.2, the contrastive loss can be rewritten as:

	$\displaystyle\mathcal{L}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)$
		$\displaystyle-\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\quantity[\frac{1}{n_{k}-1}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle].$

Then we deal with the last term independently, note that:

		$\displaystyle\sum_{i=1}^{n_{k}}\quantity[\frac{1}{n_{k}-1}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle]$
	$\displaystyle=$	$\displaystyle\frac{1}{n_{k}-1}\sum_{i=1}^{n_{k}}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{i=1}^{n_{k}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle$
	$\displaystyle=$	$\displaystyle\frac{1}{n_{k}-1}\tr(X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}W^{\top}W)-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\tr(X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}W^{\top}W).$

Thus we have:

	$\displaystyle\mathcal{L}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{4n}\tr((\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)$
		$\displaystyle-\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}[\frac{1}{n_{k}-1}\tr(X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}W^{\top}W)$
		$\displaystyle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\tr(X_{k}1_{k}1_{s}^{\top}XW^{\top}W)].$

Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution $W_{\text{CL}}$ must satisfy the desired conditions.
With the optimal solution obtained in Theorem D.1, we can provide a generalized version of Theorem 4.2 to cover the imbalance cases.

Theorem D.2 (Generalized version of Theorem 4.2)

If Assumptions 3.4-3.6 hold, $n>d\gg r$ and let $W_{\text{CL}}$ be any solution that minimizes the supervised contrastive learning problem in Equation (80), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have

	$\displaystyle\mathbb{E}\\|\sin\Theta(U_{\text{CL}},U)\\|_{F}\lesssim$	$\displaystyle\frac{\nu^{2}}{\lambda_{r}(T)}\biggl{(}\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}$
		$\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\biggl{[}\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}(\sqrt{\frac{d}{n_{k}}}+\sqrt{r})+\sqrt{\frac{dr}{n_{k}}}\biggr{]}\biggr{)},$

where $T\triangleq\frac{1}{4}\sum_{k=1}^{r+1}p_{i}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top}))$ .

Proof [Proof of Theorem D.2] For labeled data $X=[x_{1},\cdots,x_{n}]$ , we write it to be $X=M+E$ , where $M=[\mu_{1},\cdots,\mu_{n}]$ and $E=[\xi_{1},\cdots,\xi_{n}]$ are two matrices consisting of class mean and random noise. To be more specific, if $x_{i}$ subject to the $k$ -th cluster, then $\mu_{i}=\mu^{k}$ and $\xi_{i}\sim\mathcal{N}(0,\Sigma^{k})$ . Since the data is randomly drawn from each class, $\mu_{i}$ follows the multinomial distribution over $\mu^{1},\cdots,\mu^{r}$ with probability $p_{1},\cdots,p_{r+1}$ . Thus $\mu_{i}$ follows a subgaussian distribution with covariance matrix $N=\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}$ .

As shown in Theorem D.1, the optimal solution of contrastive learning is equivalent to PCA of the following matrix:

	$\displaystyle\hat{T}\triangleq$	$\displaystyle\frac{1}{4n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$
		$\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}[\frac{1}{n_{k}-1}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}$
		$\displaystyle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\frac{1}{2}(X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}+X_{s}1_{s}1_{k}^{\top}X_{k}^{\top})].$

Again we will deal with these terms separately,

For the first term, as we have discussed, $X$ can be divided into two matrices $M$ and $E$ , each of them consisting of sub-gaussian columns. Again we can obtain the result as in (56) (the proof is totally the same):

\mathbb{E}\|\frac{1}{n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})-N\|_{2}\lesssim\nu^{2}(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}.

(82)

For the second term, notice that:

		$\displaystyle X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}=\sum_{i=1}^{n_{k}}\sum_{j\neq i}(\mu^{k}+\xi_{i}^{k})(\mu^{k}+\xi_{j}^{k})^{\top}$		(83)
	$\displaystyle=$	$\displaystyle n_{k}(n_{k}-1)\mu^{k}\mu^{k\top}+(n_{k}-1)\mu^{k}(\sum_{i=1}^{n_{k}}\xi_{i}^{k})^{\top}+(n_{k}-1)(\sum_{i=1}^{n_{k}}\xi_{i}^{k})\mu^{k\top}+\sum_{i=1}^{n_{k}}\sum_{j\neq i}\xi_{i}^{k}\xi_{j}^{kT},$		(83)

and that:

		$\displaystyle\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}=\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{i=1}^{n_{k}}(\mu^{k}+\xi_{i}^{k})\sum_{j=1}^{n_{s}}(\mu^{s}+\xi_{j}^{s})^{\top}$		(84)
	$\displaystyle=$	$\displaystyle\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}[n_{k}n_{s}\mu^{k}\mu^{s\top}+n_{k}\mu^{k}(\sum_{j=1}^{n_{s}}\xi_{j}^{s})^{\top}+n_{s}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\mu^{s\top}+\sum_{i=1}^{n_{k}}\xi_{i}^{k}\sum_{j=1}^{n_{s}}\xi_{j}^{sT}].$		(84)

Since $\xi_{i}^{k}\sim\mathcal{N}(0,\Sigma^{k})$ , we can conclude that:

\mathbb{E}\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\|_{2}\leq\sqrt{\mathbb{E}\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\|_{2}^{2}}=\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}.

(85)

Moreover, we have

	$\displaystyle\frac{1}{n_{k}(n_{k}-1)}\mathbb{E}\\|\sum_{i=1}^{n_{k}}\sum_{j\neq i}\xi_{i}^{k}\xi_{j}^{kT}\\|_{2}$	$\displaystyle\leq\frac{1}{n_{k}(n_{k}-1)}\mathbb{E}\\|E_{k}E_{k}^{\top}\\|_{2}+\frac{n_{k}}{n_{k}-1}\mathbb{E}\\|\bar{\xi^{k}}\bar{\xi^{k}}^{\top}\\|_{2}$		(86)
		$\displaystyle\lesssim\frac{d}{n_{k}}\sigma_{(1)}^{2}.$		(86)

Take equation (85) and (86) back into (LABEL:sup_mat_1) we can conclude:

\mathbb{E}\|\frac{1}{n_{k}(n_{k}-1)}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}-\mu^{k}\mu^{k\top}\|_{2}\lesssim\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}\sqrt{r}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}.

(87)

On the other hand, by equation (85) we know:

\mathbb{E}\|\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\xi_{j}^{s}\|_{2}\leq\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mathbb{E}\|\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}\xi_{i}^{s}\|_{2}\lesssim\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\sqrt{\frac{d}{n_{s}}}\sigma_{(1)}.

(88)

Notice that:

		$\displaystyle\mathbb{E}\\|\frac{1}{\sum_{t\neq k}n_{t}}\frac{1}{n_{k}}\sum_{s\neq k}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\sum_{j=1}^{n_{s}}\xi_{j}^{sT}\\|_{2}\leq\mathbb{E}\\|\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\bar{\xi^{k}}\bar{\xi^{s}}^{\top}\\|_{2}$		(89)
	$\displaystyle\leq$	$\displaystyle\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mathbb{E}\\|\bar{\xi^{k}}\bar{\xi^{s}}^{\top}\\|_{2}\lesssim\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{d}{\sqrt{n_{k}n_{s}}}\sigma_{(1)}^{2}.$		(89)

Thus take equations (88) and (LABEL:sup_mat_eq_4) back into equation (LABEL:sup_mat_2) we have:

		$\displaystyle\mathbb{E}\\|\frac{1}{n_{k}}\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mu^{k}\mu^{s\top}\\|_{2}$		(90)
		$\displaystyle\lesssim\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sigma_{(1)}\sqrt{r}\nu).$		(91)

Then combine equations (82)(87)(90) together, we can obtain the following result:

		$\displaystyle\mathbb{E}\\|\hat{T}-\frac{1}{4}N-\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top}))\\|_{2}$
	$\displaystyle\lesssim$	$\displaystyle\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}$
		$\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity[\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sqrt{r}\sigma_{(1)}\nu)+\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}\sqrt{r}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}].$

Since we have assumed that $\operatorname{rank}(\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top})=r$ we can find that the top- $r$ eigenspace of matrix:

T=\frac{1}{4}\sum_{k=1}^{r+1}p_{i}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu_{k\top}))

is spanned by $U^{\star}$ , then apply Lemma E.1 again we have:

		$\displaystyle\mathbb{E}\\|\sin\Theta(U_{\text{SCL}},U)\\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\\|\hat{N}-N\\|_{2}}{\lambda_{r}(N)}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{r}}{\lambda_{r}(T)}\Biggl{[}\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}$
		$\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\Biggl{[}\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sqrt{r}\sigma_{(1)}\nu)+\sqrt{\frac{d}{n_{k}}}\sqrt{r}\sigma_{(1)}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}\Biggr{]}\Biggr{]}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\nu^{2}}{\lambda_{r}(T)}\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity[\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}+\sqrt{r})+\sqrt{\frac{dr}{n_{k}}}]).$

Now we use this result to derive Theorem 4.2. Since $\|\mu^{k}\|=O(\sqrt{r}\nu)$ and $\sum_{k=1}^{r+1}p_{k}\mu^{k}=0$ , approximately we have $\frac{\nu^{2}}{\lambda_{r}(N)}\approx\frac{1}{\min_{k\in[r]}[1+\alpha_{k}]}$ . Although we can not obtain the closed-form eigenvalue in general, in a special case, where $\alpha=\alpha_{1}=\cdots=\alpha_{r+1}$ , $m=n_{1}=n_{2}=\cdots=n_{r+1}$ and $\frac{1}{r+1}=p_{1}=p_{2}=\cdots=p_{r+1}$ , it is easy to find that:

\sum_{s\neq k}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top})=-\mu^{k}\mu^{k\top},

which further implies that:

T=\frac{1}{4}\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha(1+\frac{1}{r})\mu^{k}\mu^{k\top},\quad\lambda_{r}(T)=[\frac{1}{4}+\alpha(1+\frac{1}{r})]\lambda(N).

and we can obtain the result in Theorem 4.2.

D.2 Proofs for Section 4.2

In this section, we will provide the proof of the generalized version of Theorems 4.5 and 4.8 to cover the imbalanced setting, the statement and detailed proof can be found in Theorems D.6 and D.8. With the two generalized theorems proven, Theorems 4.5, 4.6, 4.8, 4.9 holds immediately.

First, we prove a useful lemma to illustrate that the supervised loss function only yields estimation along a 1-dimensional space. Consider a single source task, where the data $x=U^{\star}z+\xi$ is generated by the spiked covariance model and the label is generated by

y=\langle w^{\star},z\rangle/\nu

suppose we have collect $n$ labeled data from this task, denote the data as $X=[x_{1},x_{2},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ and the label $y=[y_{1},y_{2},\cdots,y_{n}]\in\mathbb{R}^{n}$ , then we have the following result.

Lemma D.3

Under the conditions similar to Theorem 3.10, we can find an event $A$ such that $\mathbb{P}(A^{C})=O(\sqrt{d/n})$ and:

\mathbb{E}\quantity[\norm{\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}}_{F}\mathbb{I}\{A\}]\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

(92)

The proof strategy is to estimate the difference between the two rank-1 matrices via bounding the difference of the corresponding vector component. We first provide a simple lemma to illustrate the technique:

Lemma D.4

Suppose $\alpha,\beta\in\mathbb{R}^{d}$ are two vectors, then we have:

\|\alpha\alpha^{\top}-\beta\beta^{\top}\|_{F}\leq\sqrt{2}(\|\alpha\|_{2}+\|\beta\|_{2})\|\alpha-\beta\|_{2}.

Proof Denote $\alpha=(\alpha_{1},\cdots,\alpha_{d}),\beta=(\beta_{1},\cdots,\beta_{d})$ , then we have:

		$\displaystyle\\|\alpha\alpha^{\top}-\beta\beta^{\top}\\|_{F}^{2}\leq\sum_{i=1}^{d}\sum_{j=1}^{d}\|\alpha_{i}\alpha_{j}-\beta_{i}\beta_{j}\|^{2}\leq 2\sum_{i=1}^{d}\sum_{j=1}^{d}\|\alpha_{i}\alpha_{j}-\alpha_{i}\beta_{j}\|^{2}+\|\alpha_{i}\beta_{j}-\beta_{i}\beta_{j}\|^{2}$
	$\displaystyle\leq$	$\displaystyle 2\sum_{i=1}^{d}\sum_{j=1}^{d}\|\alpha_{i}\|^{2}\|\alpha_{j}-\beta_{j}\|^{2}+\|\beta_{j}\|^{2}\|\alpha_{i}-\beta_{i}\|^{2}\leq 2(\\|\alpha\\|_{2}^{2}+\\|\beta\\|_{2}^{2})\\|\alpha-\beta\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle 2(\\|\alpha\\|_{2}+\\|\beta\\|_{2})^{2}\\|\alpha-\beta\\|_{2}^{2}.$

Take square root on both sides we can finish the proof.
Now we can prove the Lemma D.3.

Proof [Proof of Lemma D.3] Clearly, we have:

	$\displaystyle\\|\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}$	(93)
$\displaystyle\leq$	$\displaystyle\frac{n^{2}}{(n-1)^{2}}\\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}+\frac{2n+1}{(n-1)^{2}}\\|\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}$
$\displaystyle\lesssim$	$\displaystyle\\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}+\frac{r}{n}\nu^{2},$

thus we can replace the $\frac{1}{(n-1)^{2}}$ with $\frac{1}{n}$ in equation (92) and conclude the proof. Denote $\hat{N}\triangleq\frac{1}{n^{2}}XHyy^{\top}HX^{\top}$ , note that both of $\hat{N}$ and $Uw^{\star}w^{\star\top}U^{\top}$ are rank-1 matrices. We first bound the difference between $\frac{1}{n}XHy$ and $Uw^{\star}$ :

$\displaystyle\\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\\|=$	$\displaystyle\\|\frac{1}{n\nu}(U^{\star}Z+E)HZ^{\top}w^{\star}-\nu U^{\star}w^{\star}\\|$	(94)
$\displaystyle\leq$	$\displaystyle\\|\frac{1}{n\nu}(U^{\star}Z+E)HZ^{\top}-\nu U^{\star}\\|_{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{\nu}(\\|\frac{1}{n}U^{\star}ZZ^{\top}-\nu^{2}U^{\star}\\|_{2}+\frac{1}{n}\\|EZ^{\top}\\|_{2}+\frac{1}{n}\\|U^{\star}Z\bar{Z}^{\top}\\|_{2}+\frac{1}{n}\\|E\bar{Z}^{\top}\\|_{2}).$

We deal with the four terms in (94) separately:

For the first term, apply Lemma E.3 we have:

\mathbb{E}\|\frac{1}{n}U^{\star}ZZ^{\top}-\nu^{2}U^{\star}\|_{2}\leq\mathbb{E}\|\frac{1}{n}ZZ^{\top}-\nu^{2}I_{r}\|_{2}\leq\quantity(\frac{r}{n}+\sqrt{\frac{r}{n}})\nu^{2}.

(95)

For the second term, apply Lemma E.2 twice we have:

$\displaystyle\frac{1}{n}\mathbb{E}\\|EZ^{\top}\\|_{2}=$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\mathbb{E}_{E}[\\|EZ^{\top}\\|_{2}\|Z]]$	(96)
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\\|Z\\|_{2}(\sigma_{\text{sum}}+r^{1/4}\sqrt{\sigma_{\text{sum}}\sigma_{(1)}}+\sqrt{r}\sigma_{(1)})]$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\\|Z\\|_{2}]\sqrt{d}\sigma_{(1)}$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{n}\sqrt{d}\sigma_{(1)}(r^{1/2}\nu+(nr)^{1/4}\nu+n^{1/2}\nu)$
$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.$

For the third term and fourth term, from equation (LABEL:PCA_step1_term4) we know:

\mathbb{E}\frac{1}{n}\|U^{\star}Z\bar{Z}^{\top}\|_{2}+\mathbb{E}\frac{1}{n}\|E\bar{Z}^{\top}\|_{2}\leq\mathbb{E}\|\bar{z}\bar{z}^{\top}\|_{2}+\mathbb{E}\|\bar{\xi}\bar{z}^{\top}\|_{2}\leq\frac{r}{n}\nu^{2}+\sqrt{\frac{d}{n}}\nu\sigma_{(1)}.

(97)

Combine these three equations (95)(96)(97) together we have:

\mathbb{E}\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}.

(98)

With equation (98), we can now turn to the difference between $\hat{N}$ and $Uw^{\star}w^{\star\top}U^{\top}$ . By Lemma D.4 we know that:

\|\hat{N}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\lesssim(\|\frac{1}{n}XHy\|+\|\nu U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}|\|.

Using Markov’s inequality, we can conclude from (98) that:

\mathbb{P}(\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|\geq\nu)\leq\frac{\mathbb{E}\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|}{\nu}\lesssim\sqrt{\frac{d}{n}}.

Then denote $A=\{\omega:\|\frac{1}{n}XHy-\nu^{2}U^{\star}w^{\star}\|_{2}<\nu\}$ we have:

	$\displaystyle\mathbb{E}\\|\hat{N}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}\mathbb{I}\{A\}\lesssim$	$\displaystyle\mathbb{E}(\\|\frac{1}{n}XHy\\|+\\|\nu U^{\star}w^{\star}\\|)\\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|\\|\mathbb{I}\{A\}$
	$\displaystyle\lesssim$	$\displaystyle\nu(\mathbb{E}\\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\\|)\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.$

which finished the proof.
In the main body, we assume the number of labeled data and the ratio of the loss function is both balanced. Now we will provide a more general result to cover the imbalance occasions. Formally, suppose we have $n$ unlabeled data $X=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n}$ and $n_{i}$ labeled data $\mathcal{S}_{i}$ $X_{i}=[x_{i}^{1},\cdots,x_{i}^{n_{i}}],y_{i}=[y_{i}^{1},\cdots,y_{i}^{n_{1}}],\forall i=1,\cdots T$ for source task , we learn the linear representation via joint optimization:

\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)-\sum_{t=1}^{T}\alpha_{i}\operatorname{HSIC}(\hat{X}^{t},y^{t};W),

(99)

To investigate its feature recovery ability, we first give the following result.

Theorem D.5

For the optimization problem (99), if we apply augmented pairs generation in Definition 2.1 with random masking augmentation 2.2 for unlabeled data, then the optimal solution is given by:

W_{\text{CL}}=C\quantity(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top})^{\top},

where $C>0$ is a constant, $\sigma_{i}$ is the $i$ -th largest eigenvalue of the following matrix:

\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})+\sum_{i=1}^{T}\frac{\alpha_{i}}{(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}),

$u_{i}$ is the corresponding eigenvector, $V=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r}$ can be any orthogonal matrix and $H_{n_{i}}=I_{n_{i}}-\frac{1}{n_{i}}1_{n_{i}}1_{n_{i}}^{\top}$ is the centering matrix.

Proof Under this setting, combined with the result obtained in Corollary 3.2, the loss function can be rewritten as:

	$\displaystyle\mathcal{L}(W)=$	$\displaystyle\frac{\lambda}{2}\\|WW^{\top}\\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)$
		$\displaystyle-\sum_{t=1}^{T}\alpha_{i}\frac{1}{(n_{i}-1)^{2}}\tr(X_{i}^{\top}W^{\top}WX_{i}Hy_{i}y_{i}^{\top}H)$
	$\displaystyle=$	$\displaystyle\frac{\lambda}{2}\biggl{\\|}WW^{\top}-\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$
		$\displaystyle-\sum_{i=1}^{T}\frac{\alpha_{i}}{\lambda(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top})\biggr{\\|}_{F}^{2}$
		$\displaystyle-\frac{\lambda}{2}\biggl{\\|}\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$
		$\displaystyle+\sum_{i=1}^{T}\frac{\alpha_{i}}{\lambda(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}\biggr{\\|}_{F}^{2}.$

Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution $W_{\text{CL}}$ must satisfy the desired conditions.
Then we can give the proofs of Theorem 4.5 and Theorem 4.8 under our generalized setting, one can easily obtain those under balanced setting by simply setting $\alpha=\alpha_{1}=\cdots=\alpha_{T}$ and $m=n_{1}=\cdots=n_{T}$ , which is consistent with Theorem 4.5 and Theorem 4.8 in the main body.

Theorem D.6 (Generalized version of Theorem 4.5)

In the regression setting where $y^{t}=\langle w_{t},z\rangle/\nu$ , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5) and $n>d\gg r$ , if we further assume that $T<r$ and $w_{t}$ ’s are orthogonal to each other, and let $W^{\text{CL}}$ be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

	$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\lesssim$	$\displaystyle\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}})\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})$
		$\displaystyle+\sum_{i=1}^{T}\quantity(\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}})\sqrt{\frac{d}{n_{i}}}.$

Proof [Proof of Theorem D.6] As shown in Theorem D.5, optimizing loss function (99) is equivalent to find the top- $r$ eigenspace of matrix

\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})+\sum_{i=1}^{T}\frac{\alpha_{i}}{(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}.

Again denote $\hat{M}_{2}\triangleq\frac{1}{n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})$ and $\hat{N}_{i}\triangleq\frac{1}{(n_{i}-1)^{2}}X_{i}Hy_{i}y_{i}^{\top}HX_{i}^{\top}$ . By equation (56) we know that:

\mathbb{E}\|\hat{M}_{2}-M\|_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}.

By Theorem D.3 we know that for each task $\mathcal{S}_{i}$ , we can find an event $A_{i}$ such that $\mathbb{P}(A_{i})=O(\sqrt{\frac{d}{n}})$ :

\mathbb{E}\|\hat{N}_{i}-\nu^{2}U^{\star}w_{i}w_{i}^{\top}U^{\star\top}\|_{F}\mathbb{I}\{A_{i}\}\lesssim\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu.

The target matrix is $N=\nu^{2}U^{\star}U^{\star\top}+\sum_{i=1}^{T}\alpha_{i}\nu^{2}U^{\star}w_{i}w_{i}^{T}U^{\star\top}$ , and we can obtain the upper bound for the difference between $N$ and $\hat{N}$ :

		$\displaystyle\mathbb{E}\\|\hat{N}-N\\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}\leq\frac{1}{4}\mathbb{E}\\|\hat{M}_{2}-M\\|_{2}+\sum_{i=1}^{T}\alpha_{i}\mathbb{E}\\|\hat{N}_{i}-\nu^{2}U^{\star}w_{i}w_{i}^{\top}U^{\star\top}\\|_{F}\mathbb{I}\{A_{i}\}$		(100)
	$\displaystyle\lesssim$	$\displaystyle\nu^{2}(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\quantity[\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu].$		(100)

We divide the top- $r$ eigenspace $U_{\text{CL}}$ of $W_{\text{CL}}W_{\text{CL}}^{\top}$ into two parts: the top- $T$ eigenspace $U_{\text{CL}}^{(1)}$ and top- $(T+1)$ to top- $r$ eigenspace $U_{\text{CL}}^{(2)}$ . Similarly, we also divide the top- $r$ eigenspace $U^{\star}$ of $N$ into two parts: $U^{\star(1)}$ and $U^{\star(2)}$ . Then applying Lemma E.1 we can bound the sine distance for each part: on the one hand,

		$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\\|_{F}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}$
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{T}\mathbb{E}\\|\hat{N}-N\\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\lambda_{(T)}(N)-\lambda_{(T+1)}(N)}+\sqrt{T}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}\nu^{2}}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{T}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{T}\sum_{i=1}^{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}}\sqrt{\frac{d}{n_{i}}}.$

On the other hand,

		$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\\|_{F}$
		$\displaystyle=\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}$
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{r-T}\mathbb{E}\\|\hat{N}-N\\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\min\{\lambda_{(T)}(N)-\lambda_{(T+1)}(N),\lambda_{(r)}(N)\}}+\sqrt{r-T}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}\nu^{2}}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{r-T}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{r-T}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{\min_{i\in[T]}\{\alpha_{i},1\}}+1)\sqrt{\frac{d}{n_{i}}}.$

Note that:

		$\displaystyle\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}^{2}$
		$\displaystyle=r-\\|U_{\text{CL}}^{\top}U^{\star}\\|_{F}^{2}$
		$\displaystyle\leq r-\\|U_{\text{CL}}^{(1)\top}U^{\star(1)}\\|_{F}^{2}-\\|U_{\text{CL}}^{(2)T}U^{\star(2)}\\|_{F}^{2}$
		$\displaystyle\leq T-\\|U_{\text{CL}}^{(1)\top}U^{\star(1)}\\|_{F}^{2}+(r-T)-\\|U_{\text{CL}}^{(2)\top}U^{\star(2)}\\|_{F}^{2}$
		$\displaystyle\leq\\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\\|_{F}^{2}+\\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\\|_{F}^{2},$

and the sine distance has trivial upper bounds:

\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\|_{F}^{2}\leq T,\quad\|\sin\Theta(U_{\text{CL}}^{(2)},U^{\star(2)})\|_{F}^{2}\leq r-T

Thus we can conclude:

		$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}$
		$\displaystyle\leq\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\\|_{F}+\mathbb{E}\\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\\|_{F}$
		$\displaystyle\lesssim\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sum_{i=1}^{T}\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}\sqrt{\frac{d}{n_{i}}})\wedge\sqrt{r-T}$
		$\displaystyle\quad+\quantity(\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sum_{i=1}^{T}\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}}\sqrt{\frac{d}{n_{i}}})\wedge\sqrt{T}.$

Theorem D.7 (Restatement of Theorem 4.6)

Suppose the conditions in Theorem 4.5 hold. Then,

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]$		(101)
	$\displaystyle\quad\lesssim\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right).$		(102)

Proof [Proof of Theorem D.7] Theorem 4.6 follows directly from Lemma B.20 and Theorem 4.5.

Theorem D.8 (Generalized version of Theorem 4.8)

In the regression setting where $y^{t}=\langle w_{t},z\rangle/\nu$ , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5) and $n>d\gg r$ , if we further assume that $T\geq r$ and $\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top}$ is full rank, suppose $W^{\text{CL}}$ is the optimal solution of optimization problem Equation (99), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

	$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\lesssim$	$\displaystyle\frac{\sqrt{r}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})$
		$\displaystyle+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.$

Proof [Proof of Theorem D.8] The proof strategy is similar to that of Theorem 4.5, here the difference is that each direction can be accurately estimated by the labeled data and we do not need to separate the eigenspace. Directly applying Lemma E.1 and equation (100) we have:

		$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}$
		$\displaystyle=\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}$
		$\displaystyle\lesssim\frac{\sqrt{r}\mathbb{E}\\|\hat{N}-N\\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\lambda_{(r)}(N)}+\sqrt{r}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})$
		$\displaystyle\lesssim\frac{\sqrt{r}}{\nu^{2}+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{r}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}$
		$\displaystyle\lesssim\frac{\sqrt{r}}{1+\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.$

Theorem D.9 (Restatement of Theorem 4.9)

Suppose the conditions in Theorem 4.8 hold. Then,

\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}.

(103)

Proof [Proof of Theorem 4.9] Theorem 4.9 follows directly from Lemma B.20 and Theorem 4.8.

Now we move to a binary classification setting, where labels $y$ are generated by $y=\operatorname{sign}(\langle w^{\star},z\rangle)$ instead of $y=\langle w^{\star},z\rangle/\nu$ in previous regression setting. We first give the corresponding generalized version of Theorem 4.10 and Theorem 4.11 to cover the general imbalanced settings.

Theorem D.10 (Generalized version of Theorem 4.10)

In the classification setting where $y^{t}=\operatorname{sign}(\langle w_{t},z\rangle)$ , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), $z$ follows a Gaussian distribution, and $n>d\gg r$ , if we further assume that $T<r$ and $w_{t}$ ’s are orthogonal to each other, and let $W^{\text{CL}}$ be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

	$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\lesssim$	$\displaystyle\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}})\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})$
		$\displaystyle+\sum_{i=1}^{T}\quantity(\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}})\sqrt{\frac{d}{n_{i}}}.$

Theorem D.11 (Generalized version of Theorem 4.11)

In the classification setting where $y^{t}=\operatorname{sign}(\langle w_{t},z\rangle)$ , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), $z$ follows a Gaussian distribution, and $n>d\gg r$ , if we further assume that $T\geq r$ and $\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top}$ is full rank, suppose $W^{\text{CL}}$ is the optimal solution of optimization problem Equation (99), and denote its singular value decomposition as $W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}$ , then we have:

	$\displaystyle\mathbb{E}\\|\sin(\Theta(U_{\text{CL}},U^{\star}))\\|_{F}\lesssim$	$\displaystyle\frac{\sqrt{r}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})$
		$\displaystyle+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.$

The only difference between these two settings is the distribution of labels $y$ . Thus to prove Theorem D.10 and Theorem D.11, we only need to recover Lemma D.3 in this binary classification setting. Since in the classification setting the labels are discrete and could be harder to analyze, we make the Gaussian assumption on $z$ to make problems mathematically tractable in these two Theorems.

Lemma D.12 (Classification version of Lemma D.3)

In the binary classification setting, under the conditions similar to Theorem 3.10 and assume $z$ in the spiked covariance model (5) follows a Gaussian distribution, we can find an event $A$ such that $\mathbb{P}(A^{C})=O(\sqrt{d/n})$ and:

\mathbb{E}\quantity[\norm{\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}}_{F}\mathbb{I}\{A\}]\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

(104)

Proof Again, by (93) we have:

		$\displaystyle\\|\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}$
	$\displaystyle\lesssim$	$\displaystyle\\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}+\frac{r}{n}\nu^{2},$

thus we can replace the $\frac{1}{(n-1)^{2}}$ with $\frac{1}{n}$ in equation (104) and conclude the proof. Denote $\hat{N}\triangleq\frac{1}{n^{2}}XHyy^{\top}HX^{\top}$ , note that both of $\hat{N}$ and $Uw^{\star}w^{\star\top}U^{\top}$ are rank-1 matrices. We first bound the difference between $\frac{1}{n}XHy$ and $\sqrt{\frac{2\nu^{2}}{\pi}}Uw^{\star}$ :

	$\displaystyle\\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|=$	$\displaystyle\\|\frac{1}{n}(U^{\star}Z+E)Hy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|$		(105)
	$\displaystyle\leq$	$\displaystyle\\|\frac{1}{n}U^{\star}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|+\frac{1}{n}\\|Ey\\|+\frac{1}{n}\\|U^{\star}Z\bar{y}\\|+\frac{1}{n}\\|E\bar{y}\\|.$		(105)

We deal with the four terms in (105) separately:

For the first term, note that: $\frac{1}{n}Zy=\frac{1}{n}\sum_{i=1}^{n}z_{i}\operatorname{sign}(z_{i}^{\top}w^{\star})$ and $z_{i}\sim\mathcal{N}(0,\nu^{2}I_{r})$ , thus $z_{i}\operatorname{sign}(z_{i}^{\top}w^{\star})$ follows a folded Gaussian distribution, which is a reflection of standard Gaussian distribution along the normal plane of $w^{\star}$ , thus

	$\displaystyle\mathbb{E}\\|\frac{1}{n}U^{\star}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|$	$\displaystyle\leq\mathbb{E}\\|\frac{1}{n}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}w^{\star}\\|\leq\sqrt{\mathbb{E}\\|\frac{1}{n}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}w^{\star}\\|^{2}}$		(106)
		$\displaystyle\leq\sqrt{\frac{r}{n}}\nu$		(106)

For the second term, note that $y$ and $E$ are independent and $|y|=1$ almost surely

\displaystyle\frac{1}{n}\mathbb{E}\|Ey\|=\frac{1}{n}\mathbb{E}\|\sum_{i=1}^{n}\xi_{i}\|\leq\frac{1}{n}\sqrt{\mathbb{E}\|\sum_{i=1}^{n}\xi_{i}\|^{2}}\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}

(107)

For the third term and fourth terms, we have:

\mathbb{E}\frac{1}{n}\|U^{\star}Z\bar{y}\|+\mathbb{E}\frac{1}{n}\|E\bar{y}\|\leq\mathbb{E}\frac{1}{n}\|\sum_{i=1}^{n}z_{i}\|+\mathbb{E}\frac{1}{n}\|\sum_{i=1}^{n}\xi_{i}\|\lesssim\sqrt{\frac{r}{n}}\nu+\sqrt{\frac{d}{n}}\sigma_{(1)}.

(108)

Combine these three equations (106)(107)(108) together we have:

\mathbb{E}\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}.

(109)

With equation (109), we can now turn to the difference between $\hat{N}$ and $\frac{2\nu^{2}}{\pi}Uw^{\star}w^{\star\top}U^{\top}$ . By Lemma D.4 we know that:

\|\hat{N}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\lesssim(\|\frac{1}{n}XHy\|+\|\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|.

Using Markov’s inequality, we can conclude from (109) that:

\mathbb{P}(\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\geq\nu)\leq\frac{\mathbb{E}\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|}{\nu}\lesssim\sqrt{\frac{d}{n}}.

Then denote $A=\{\omega:\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|_{2}<\nu\}$ we have:

	$\displaystyle\mathbb{E}\\|\hat{N}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\\|_{F}\mathbb{I}\{A\}\lesssim$	$\displaystyle\mathbb{E}(\\|\frac{1}{n}XHy\\|+\\|\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|)\\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\\|\mathbb{I}\{A\}$
	$\displaystyle\lesssim$	$\displaystyle\nu\mathbb{E}\\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.$

which finished the proof.
With Lemma D.12 established, it is straightforward to obtain the same results as in Theorem 4.5, Theorem 4.6, Theorem 4.8 and Theorem 4.9 for this binary classification setting.

E Useful lemmas

In this section, we list some of the main techniques that have been used in the proof of the main results.

Lemma E.1 (Theorem 2 in Yu et al. (2015))

Let $\Sigma,\hat{\Sigma}\in\mathbb{R}^{p\times p}$ be symmetric, with eigenvalues $\lambda_{1}\geq\ldots\geq\lambda_{p}$ and $\hat{\lambda}_{1}\geq$ $\ldots\geq\hat{\lambda}_{p}$ respectively. Fix $1\leq r\leq s\leq p$ and assume that $\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)>0$ where $\lambda_{0}:=\infty$ and $\lambda_{p+1}:=-\infty.$ Let $d:=s-r+1$ , and let $V=\left(v_{r},v_{r+1},\ldots,v_{s}\right)\in\mathbb{R}^{p\times d}$ and $\hat{V}=\left(\hat{v}_{r},\hat{v}_{r+1},\ldots,\hat{v}_{s}\right)\in\mathbb{R}^{p\times d}$ have orthonormal columns satisfying $\Sigma v_{j}=\lambda_{j}v_{j}$ and $\hat{\Sigma}\hat{v}_{j}=\hat{\lambda}_{j}\hat{v}_{j}$ for $j=r,r+1,\ldots,s.$ Then

\|\sin\Theta(\hat{V},V)\|_{\mathrm{F}}\leq\frac{2\min\left(d^{1/2}\|\hat{\Sigma}-\Sigma\|_{\mathrm{2}},\|\hat{\Sigma}-\Sigma\|_{\mathrm{F}}\right)}{\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)}.

Moreover, there exists an orthogonal matrix $\hat{O}\in\mathbb{R}^{d\times d}$ such that

\|\hat{V}\hat{O}-V\|_{\mathrm{F}}\leq\frac{2^{3/2}\min\left(d^{1/2}\|\hat{\Sigma}-\Sigma\|_{\mathrm{2}},\|\hat{\Sigma}-\Sigma\|_{\mathrm{F}}\right)}{\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)}.

Lemma E.2 (Lemma 2 in Zhang et al. (2018))

Assume that $E\in\mathbb{R}^{p_{1}\times p_{2}}$ has independent sub-Gaussian entries, $\operatorname{Var}\left(E_{ij}\right)=$ $\sigma_{ij}^{2},\sigma_{C}^{2}=\max_{j}\sum_{i}\sigma_{ij}^{2},\sigma_{R}^{2}=\max_{i}\sum_{j}\sigma_{ij}^{2},\sigma_{(1)}^{2}=\max_{i,j}\sigma_{ij}^{2}.$ Assume that

\left\|E_{ij}/\sigma_{ij}\right\|_{\psi_{2}}:=\max_{q\geq 1}q^{-1/2}\left\{\mathbb{E}\left(\left|E_{ij}\right|/\sigma_{ij}\right)^{q}\right\}^{1/q}\leq\kappa.

Let $V\in\mathbb{O}_{p_{2},r}$ be a fixed orthogonal matrix. Then

\mathbb{P}\left(\|EV\|_{2}\geq 2\left(\sigma_{C}+x\right)\right)\leq 2\exp\left(5r-\min\left\{\frac{x^{4}}{\kappa^{4}\sigma_{(1)}^{2}\sigma_{C}^{2}},\frac{x^{2}}{\kappa^{2}\sigma_{(1)}^{2}}\right\}\right),

\mathbb{E}\|EV\|_{2}\lesssim\sigma_{C}+\kappa r^{1/4}\left(\sigma_{(1)}\sigma_{C}\right)^{1/2}+\kappa r^{1/2}\sigma_{(1)}.

Lemma E.3 (Theorem 6 in Cai et al. (2020))

Suppose $Z$ is a $p_{1}$ -by- $p_{2}$ random matrix with independent mean-zero sub-Gaussian entries. If there exist $\sigma_{1},\ldots,\sigma_{p}\geq 0$ such that $\left\|Z_{ij}/\sigma_{i}\right\|_{\psi_{2}}\leq C_{K}$ for constant $C_{K}>0$ , then

\mathbb{E}\left\|ZZ^{\top}-\mathbb{E}ZZ^{\top}\right\|_{2}\lesssim\sum_{i}\sigma_{i}^{2}+\sqrt{p_{2}\sum_{i}\sigma_{i}^{2}}\cdot\max_{i}\sigma_{i}.

Lemma E.4 (The Eckart-Young-Mirsky Theorem (Eckart and Young, 1936))

Suppose that $A=U\Sigma V^{T}$ is the singular value decomposition of $A$ . Then the best rank- $k$ approximation of the matrix $A$ w.r.t the Frobenius norm, $\|\cdot\|_{F}$ , is given by

A_{k}=\sum_{i=1}^{k}\sigma_{i}u_{i}v_{i}^{T}.

that is, for any matrix $B$ of rank at most k

\|A-A_{k}\|_{F}\leq\|A-B\|_{F}.

References

Agarwal et al. (2020) Anish Agarwal, Devavrat Shah, and Dennis Shen. On principal component regression in a high-dimensional error-in-variables setting. arXiv preprint arXiv:2010.14449, 2020.
Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
Bai and Yao (2012) Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 106:167–177, 2012.
Ballard (1987) Dana H Ballard. Modular learning in neural networks. In AAAI, volume 647, pages 279–284, 1987.
Barshan et al. (2011) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011.
Bing et al. (2021) Xin Bing, Florentina Bunea, Seth Strimas-Mackey, and Marten Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond. Journal of Machine Learning Research, 22(177):1–50, 2021.
Bourlard and Kamp (1988) Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
Cai and Zhang (2018) T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89, 2018.
Cai et al. (2020) T Tony Cai, Rungang Han, and Anru R Zhang. On the non-asymptotic concentration of heteroskedastic wishart-type matrix. arXiv preprint arXiv:2008.12434, 2020.
Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
Ćevid et al. (2020) Domagoj Ćevid, Peter Bühlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21:232, 2020.
Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020b.
Chen et al. (2020c) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
Coates et al. (2011) Adam Coates, A. Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
Deshpande and Montanari (2014) Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In 2014 IEEE International Symposium on Information Theory, pages 2197–2201. IEEE, 2014.
Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27:766–774, 2014.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
Eckart and Young (1936) Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936.
Fan et al. (2019) Jianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. arXiv preprint arXiv:1904.05526, 2019.
Feizi et al. (2020) Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans in the lqg setting: Formulation, generalization and stability. IEEE Journal on Selected Areas in Information Theory, 1(1):304–311, 2020.
Fu et al. (2022) Daniel Y Fu, Mayee F Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. In Computer Sciences & Mathematics Forum, volume 3, page 4. MDPI, 2022.
Garg and Liang (2020) Siddhant Garg and Yingyu Liang. Functional regularization for representation learning: A unified theoretical perspective. Advances in Neural Information Processing Systems, 33:17187–17199, 2020.
Golub and Loan (1996) Gene H. Golub and Charles Van Loan. Matrix computations (3rd ed.). 1996.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Graf et al. (2021) Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
HaoChen et al. (2021) Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. arXiv preprint arXiv:2106.04156, 2021.
He et al. (2016) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
He et al. (2018) Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1945–1954, 2018.
Hyvärinen et al. (2009) Aapo Hyvärinen, Jarmo Hurri, Patrik O Hoyer, Aapo Hyvärinen, Jarmo Hurri, and Patrik O Hoyer. Independent component analysis. Springer, 2009.
Islam et al. (2021) Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Richard J. Radke, and Rogério Schmidt Feris. A broad study on the transferability of visual representations with contrastive learning. ArXiv, abs/2103.13517, 2021.
Jaiswal et al. (2021) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2021.
Jing et al. (2021) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
Johnstone (2001) Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001.
Jolliffe (1982) Ian T Jolliffe. A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(3):300–303, 1982.
Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
Li et al. (2021) Yazhe Li, Roman Pogodin, Danica J Sutherland, and Arthur Gretton. Self-supervised learning with kernel dependence maximization. Advances in Neural Information Processing Systems, 34, 2021.
Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2021.
Marsaglia (1972) George Marsaglia. Choosing a point from the surface of a sphere. Annals of Mathematical Statistics, 43:645–646, 1972.
Misra and Maaten (2020) Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Pirjol (2013) Dan Pirjol. The logistic-normal integral and its generalizations. Journal of Computational and Applied Mathematics, 237(1):460–469, 2013.
Plaut (2018) Elad Plaut. From principal subspaces to principal components with linear autoencoders. arXiv preprint arXiv:1804.10253, 2018.
Refinetti and Goldt (2022) Maria Refinetti and Sebastian Goldt. The dynamics of representation learning in shallow, non-linear autoencoders. arXiv preprint arXiv:2201.02115, 2022.
Saunshi et al. (2022) Nikunj Saunshi, Jordan Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, and Akshay Krishnamurthy. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pages 19250–19286. PMLR, 2022.
Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
Sohn (2016) Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.
Song et al. (2007a) Le Song, Arthur Gretton, Karsten Borgwardt, and Alex Smola. Colored maximum variance unfolding. Advances in Neural Information Processing Systems, 20:1385–1392, 2007a.
Song et al. (2007b) Le Song, Alex Smola, Arthur Gretton, and Karsten M Borgwardt. A dependence maximization view of clustering. In Proceedings of the 24th international conference on Machine learning, pages 815–822, 2007b.
Song et al. (2007c) Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pages 823–830, 2007c.
Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
Tian (2022) Yuandong Tian. Understanding deep contrastive learning via coordinate-wise optimization. In Advances in Neural Information Processing Systems, 2022.
Tian et al. (2021) Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810, 2021.
Tosh et al. (2021) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
Tsai et al. (2020) Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
Wang and Isola (2020) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
Wang et al. (2021) Xiang Wang, Xinlei Chen, Simon S Du, and Yuandong Tian. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2021.
Wen and Li (2021) Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. arXiv preprint arXiv:2105.15134, 2021.
Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
Yao et al. (2015) Jianfeng Yao, Shurong Zheng, and ZD Bai. Sample covariance matrices and high-dimensional data analysis. Cambridge University Press Cambridge, 2015.
Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6210–6219, 2019.
Yu et al. (2015) Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
Zhang et al. (2018) Anru R Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic pca: Algorithm, optimality, and applications. arXiv preprint arXiv:1810.08316, 2018.
Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.

	$\displaystyle\\|U^{\top}\Sigma_{x}U-U^{\star\top}\Sigma_{x}U^{\star}\\|_{2}$	$\displaystyle=\\|\nu^{2}U^{\top}U^{\star}U^{\star\top}U-\nu^{2}I+U^{\top}\Sigma U-U^{\star\top}\Sigma U^{\star}\\|_{2}$
		$\displaystyle\leq\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+\\|U^{\top}\Sigma(U-U^{\star})+(U-U^{\star})^{\top}\Sigma U^{\star}\\|_{2}$
		$\displaystyle\leq\nu^{2}\\|\sin\Theta(U,U^{\star})\\|_{2}^{2}+2\sigma_{(1)}^{2}\\|U-U^{\star}\\|_{2}.$

	$\displaystyle(T2)$	$\displaystyle=\\|(U^{\top}\Sigma_{x}U)^{-1}-U^{\star\top}{(U^{\star}+(U-U^{\star}))}(U^{\top}\Sigma_{x}U)^{-1}(U^{\star}+(U-U^{\star}))^{\top}U^{\star}\\|_{2}$
		$\displaystyle=\\|-U^{\star\top}(U-U^{\star})(U^{\top}\Sigma_{x}U)^{-1}-(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}$
		$\displaystyle\quad-U^{\star\top}{(U-U^{\star})}(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}\\|_{2}$
		$\displaystyle\leq\frac{1}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})+\sigma_{(d)}^{2}}(2\\|U-U^{\star}\\|_{2}+\\|U-U^{\star}\\|_{2}^{2}).$

	$\displaystyle\\|U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\nu^{2}\lambda_{\max}((U^{\top}\Sigma U)^{-1})\\|U^{\star\top}U\\|^{2}_{2}$
	$\displaystyle\quad\leq\frac{\nu^{2}}{\sigma_{(d)}^{2}}\\|U^{\star\top}U\\|^{2}_{F}$
	$\displaystyle\quad=\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\\|\sin\Theta(U,U^{\star})\\|_{F}^{2}),$

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\leq 2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\|\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star})\|\mathbb{I}\{\omega^{\star}>0\}]$
	$\displaystyle\quad\lesssim\frac{2}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|\omega^{\star}\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\frac{\phi(\omega^{\star}/v^{\star})}{v^{\star}}\differential{\omega^{\star}}$
	$\displaystyle\quad\lesssim\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\differential{\omega^{\star}}$
	$\displaystyle\quad=\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\frac{\exp(-1/(2((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})){\omega^{\star}}^{2})}{\sqrt{2\pi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})}}\differential{\omega^{\star}}$
	$\displaystyle\quad=\frac{\|\alpha_{\max,U^{\star}}-\alpha_{\max,U}\|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}(\Psi_{F}(((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U^{\star}})^{-2}{v^{\star}}^{2}))-1/2),$

	$\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})$
	$\displaystyle\quad\lesssim\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})}(\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2})$
	$\displaystyle\quad\leq\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})}\\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\\|_{2}$
	$\displaystyle\quad\leq\frac{(\kappa\rho^{2}+1)(\rho^{-2}+1)^{2}}{(1+\kappa^{-1}\rho^{-2})(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})^{2}}\\|\sin\Theta(U,U^{\star})\\|_{2}$
	$\displaystyle\quad=\frac{\kappa\rho^{2}(\rho^{-2}+1)^{2}}{(1-\\|\sin\Theta(U,U^{\star})\\|_{2}^{2})^{2}}\\|\sin\Theta(U,U^{\star})\\|_{2}.$

The Power of Contrast for Feature Learning: A Theoretical Analysis

Abstract

1 Introduction

1.1 Related Works

1.2 Outline

1.3 Notations

2 Setup

2.1 Linear Representation Settings for Contrastive Learning

Linear Representation and Regularization Term

Linear Contrastive Loss

2.2 Generation of Positive and Negative Pairs

Definition 2.1 (Augmented Pairs Generation in the Self-supervised Setting)

Definition 2.2 (Random Masking Augmentation)

Remark 2.3

Definition 2.4 (Pairs Generation in the Supervised Setting)

2.3 Data Generating Process

3 Comparison of Self-Supervised Contrastive Learning and Autoencoders/GANs

3.1 Autoencoders, GANs and PCA

3.2 Contrastive Learning and Diagonal-Deletion PCA

Proposition 3.1

Corollary 3.2

3.3 Feature Recovery from Noisy Data

Definition 3.3 (Incoherent Constant)

Assumption 3.4 (Regular Covariance Condition)

Assumption 3.5 (Signal to noise ratio condition)

Assumption 3.6 (Incoherent Condition)

Lemma 3.7 (Expectation of incoherent constant over a uniform distribution)

Remark 3.8

Theorem 3.9 (Recovery Ability of Autoencoders, Lower Bound)

Theorem 3.10 (Recovery Ability of Contrastive Learning, Upper Bound)

Remark 3.11

Remark 3.12

3.4 Performance on In-Domain Downstream Tasks

Theorem 3.13 (Excess Risk for In-Domain Downstream Task: Upper Bound)

Theorem 3.14 (Excess Risk for In-Domain Downstream Task: Lower Bound))

4 The Impact of Labeled Data in Supervised Contrastive Learning

4.1 Feature Mining in Multi-Class Classification

Remark 4.1

Data Generating Process

Theorem 4.2

Corollary 4.3

4.2 Information Filtering in Multi-Task Transfer Learning

4.2.1 Hilbert-Schmidt Independent Criteria

Definition 4.4 (Empirical HSIC (Gretton et al., 2005))

4.2.2 Main results

Theorem 4.5

Theorem 4.6

Remark 4.7

Theorem 4.8

Theorem 4.9

Theorem 4.10

Theorem 4.11

5 Numerical Experiments

5.1 Linear Model with Synthetic Data

5.2 Neural Nets with Real-World Dataset

Contrastive learning v.s. standard autoencoders:

The impact of labeled data in transfer learning

6 Conclusion

Acknowledgement

A Background and omitted discussion

A.1 Comparison with other works

A.2 Disucssion about the regularization term

A.3 Background on distance between subspaces

Proposition A.1

Proposition A.2

Lemma A.3 (Lemma 1 in Cai and Zhang (2018))

Proposition A.4 (Identity of indiscernibles)

Proposition A.5 (Triangular inequality)

B Omitted proofs for Section 3

B.1 Proofs for Section 3.1 and Section 3.2

Proposition B.1 (Restatement of Proposition 3.1)

Corollary B.2 (Restatement of Corollary 3.2)

Remark B.3

B.2 Proofs for Section 3.3

Lemma B.4 (Uniform distribution on the unit sphere (Marsaglia, 1972))

Lemma B.5

Lemma B.6 (Restatement of Lemma 3.7)

Theorem B.7 (Restatement of Theorem 3.9)

Step 1. Bound the difference between M^1\hat{M}_{1} and Σ\Sigma

Step 2. Bound the sine distance between eigenspaces

The Power of Contrast for Feature Learning:
A Theoretical Analysis

Step 1. Bound the difference between $\hat{M}_{1}$ and $\Sigma$