This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

The Power of Contrast for Feature Learning:
A Theoretical Analysis

\nameWenlong Ji \email[email protected]
\addrDepartment of Statistics
Stanford University
Stanford, CA 94305, USA \AND\nameZhun Deng \email[email protected]
\addrDepartment of Computer Science
Columbia University
New York, NY 10027, USA \AND\nameRyumei Nakada \email[email protected]
\addrDepartment of Statistics
Rutgers University
Piscataway, NJ 08854, USA \AND\nameJames Zou \email[email protected]
\addrDepartment of Biomedical Data Science
Stanford University
Stanford, CA 94305, USA \AND\nameLinjun Zhang \email[email protected]
\addrDepartment of Statistics
Rutgers University
Piscataway, NJ 08854, USA
Abstract

Contrastive learning has achieved state-of-the-art performance in various self-supervised learning tasks and even outperforms its supervised counterpart. Despite its empirical success, theoretical understanding of the superiority of contrastive learning is still limited. In this paper, under linear representation settings, (i) we provably show that contrastive learning outperforms the standard autoencoders and generative adversarial networks, two classical generative unsupervised learning methods, for both feature recovery and in-domain downstream tasks; (ii) we also illustrate the impact of labeled data in supervised contrastive learning. This provides theoretical support for recent findings that contrastive learning with labels improves the performance of learned representations in the in-domain downstream task, but it can harm the performance in transfer learning. We verify our theory with numerical experiments.

Keywords: Self-Supervised Learning, Contrastive Learning, Principal Component Analysis, Spiked Covariance Model, Supervised Contrastive Learning

1 Introduction

Deep supervised learning has achieved great success in various applications, including computer vision (Krizhevsky et al., 2012), natural language processing (Vaswani et al., 2017), and scientific computing (Han et al., 2018). However, its dependence on manually assigned labels, which is usually difficult and costly, has motivated research into alternative approaches to exploit unlabeled data. Self-supervised learning is a promising approach that leverages the unlabeled data itself as supervision and learns representations that are beneficial to potential in-domain downstream tasks.

At a high level, there are two common approaches for feature extraction in self-supervised learning: generative and contrastive (Liu et al., 2021; Jaiswal et al., 2021). Both approaches aim to learn latent representations of the original data, while the difference is that the generative approach focused on minimizing the reconstruction error from latent representations, and the contrastive approach targets to decrease the similarity between the representations of contrastive pairs constructed by data augmentation. Recent works have shown the benefits of contrastive learning in practice (Chen et al., 2020a; He et al., 2020; Chen et al., 2020b, c). However, these works did not explain the popularity of contrastive learning — what is the advantage of contrastive learning and where does it come from?

Additionally, recent works aim to further improve contrastive learning by introducing label information. Specifically, Khosla et al. (2020) proposed the supervised contrastive learning, where the contrasting procedures are performed across different classes rather than different instances. With the help of label information, their proposed method outperforms self-supervised contrastive learning and classical cross-entropy-based supervised learning. However, despite this improvement in in-domain downstream tasks, Islam et al. (2021) found that such improvement in transfer learning is limited and even negative for such supervised contrastive learning. This phenomenon motivates us to rethink the impact of labeled data in the contrastive learning framework.

In this paper, we first establish a theoretical framework to study contrastive learning under the linear representation setting. Under this framework, we provide a theoretical analysis of the feature learning performance of the contrastive learning on the spiked covariance model (Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018) and theoretically justify why contrastive learning outperforms standard autoencoders and generative adversarial networks (GANs) (Goodfellow et al., 2014) —contrastive learning is able to remove more noise by constructing contrastive samples via augmentations. Moreover, we investigate the impact of label information in the contrastive learning framework and provide a theoretical justification of why labeled data help to gain accuracy in in-domain regression and classification while can hurt multi-task transfer learning.

1.1 Related Works

The idea of contrastive learning was firstly proposed in Hadsell et al. (2006) as an effective method to perform dimensional reduction. Following this line of research, Dosovitskiy et al. (2014) proposed to perform instance discrimination by creating surrogate classes for each instance and Wu et al. (2018) further proposed to preserve a memory bank as a dictionary of negative samples. Other extensions based on this memory bank approach include He et al. (2020); Misra and Maaten (2020); Tian et al. (2020); Chen et al. (2020c). Rather than keeping a costly memory bank, another line of work exploits the benefit of mini-batch training where different samples are treated as negative to each other (Ye et al., 2019; Chen et al., 2020a). Moreover, Khosla et al. (2020) explores the supervised version of contrastive learning where pairs are generated based on label information.

Despite its success in practice, the theoretical understanding of contrastive learning is still limited. Previous works provide provable guarantees for contrastive learning under conditional independence assumption (or its variants) (Arora et al., 2019; Lee et al., 2021; Tosh et al., 2021; Tsai et al., 2020). Specifically, they assume the two contrastive views are independent conditioned on the label and show that contrastive learning can provably learn representations beneficial for in-domain downstream tasks. In addition to this line of research, there exist several alternative perspectives for studying the theoretical properties of contrastive learning. To name a few, Wang and Isola (2020); Graf et al. (2021) explored the representation geometry, HaoChen et al. (2021) analyzed the augmentation graph, Tian (2022) proposed a two-player game theory framework, Zimmermann et al. (2021) demonstrated the connection between contrastive learning and nonlinear Independent Component Analysis (Hyvärinen et al., 2009), Saunshi et al. (2022) showed that the importance of inductive bias in contrastive learning, and Jing et al. (2021) investigated the dimensional collapse phenomenon. Furthermore, Tian et al. (2021); Wang et al. (2021) have also explored the ability of self-supervised learning to learn features even without contrastive pairs, specifically in the context of linear representation settings.

More relevant to this paper, Wen and Li (2021) considered representation learning under the sparse coding model and studied the optimization properties in shallow ReLU neural networks. However, the assumptions that features are extremely sparse and signals follow Gaussian distribution seem strong for real data. Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They derived sample complexity bounds in a PAC-learning style for various settings. Specifically, the authors assume that there is a ground-truth representation such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features.

While the aforementioned previous works aim to demonstrate that contrastive learning is capable of learning meaningful representations, it was left untouched why contrastive learning outperforms other representation learning methods. We also shed light on the impact of labeled data in a contrastive learning framework, which is underexplored in prior works. A detailed comparison with existing literature is deferred to Appendix A.1.

1.2 Outline

This paper is organized as follows. Section 2 provides the setup for the data-generating process and the loss function. In Section 3, we review the connection between PCA and autoencoders/GANs. We also establish a theoretical framework to study contrastive learning in the linear representation setting. Under this framework, we evaluate the feature recovery performance and in-domain downstream task performance of contrastive learning and autoencoders. In Section 4, we analyze the supervised contrastive learning. In Section 5, we verify our theoretical results given in Sections 3 and 4. Finally, we summarize our analysis and provide future directions in Section 6.

1.3 Notations

In this paper, we use O,Ω,ΘO,\Omega,\Theta to hide universal constants and we write akbka_{k}\lesssim b_{k} for two sequences of positive numbers {ak}\{a_{k}\} and {bk}\{b_{k}\} if and only if there exists a universal constant C>0C>0 such that ak<Cbka_{k}<Cb_{k} for any kk. We write akbka_{k}\asymp b_{k} when akbka_{k}\lesssim b_{k} and akbka_{k}\gtrsim b_{k} holds simultaneously. We use ,2,F\|\cdot\|,\|\cdot\|_{2},\|\cdot\|_{F} to represent the 2\ell_{2} norm of vectors, the spectral norm of matrices, and Frobenius norm of matrices respectively. Let 𝕆d,r\mathbb{O}_{d,r} be a set of d×rd\times r orthogonal matrices. Namely, 𝕆d,r{Ud×r:UU=Ir}\mathbb{O}_{d,r}\triangleq\{U\in\mathbb{R}^{d\times r}:U^{\top}U=I_{r}\}. We write ndn\gg d when there exists a sufficiently small constant cc depending on the constant and independent of nn, dd and rr such that d/n<cd/n<c holds. drd\gg r is defined similarly. We use |A||A| to denote the cardinality of a set AA. For any n+n\in\mathbb{N}^{+}, let [n]={1,2,,n}[n]=\{1,2,\cdots,n\}. We use sinΘ(U1,U2)F\|\sin\Theta(U_{1},U_{2})\|_{F} to refer to the sine distance between two orthogonal matrices U1,U2𝕆d,rU_{1},U_{2}\in\mathbb{O}_{d,r}, which is defined by: sinΘ(U1,U2)FU1U2F\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}, where U1𝕆dr,rU_{1\perp}\in\mathbb{O}_{d-r,r} is any orthogonal complement of U1U_{1}. More properties of sine distance can be found in Section A.3. We use {ei}i=1d\{e_{i}\}_{i=1}^{d} to denote the canonical basis in dd-dimensional Euclidean space d\mathbb{R}^{d}, that is, eie_{i} is the vector whose ii-th coordinate is 11 and all the other coordinates are 0. Let 𝕀{A}\mathbb{I}\{A\} be an indicator function that takes 11 when AA is true, otherwise takes 0. We write aba\vee b and aba\wedge b to denote max(a,b)\max(a,b) and min(a,b)\min(a,b), respectively.

2 Setup

Here we introduce loss and data-generative models that will be used for the theoretical analysis later.

2.1 Linear Representation Settings for Contrastive Learning

Given an input xdx\in\mathbb{R}^{d}, contrastive learning aims to learn a low-dimensional representation h=f(x;θ)rh=f(x;\theta)\in\mathbb{R}^{r} by contrasting different samples, that is, maximizing the agreement between positive pairs, and minimizing the agreement between negative pairs. Suppose we have nn data points X=[x1,x2,,xn]d×nX=[x_{1},x_{2},\cdots,x_{n}]\in\mathbb{R}^{d\times n} from the population distribution 𝒟\mathcal{D}. The contrastive learning task can be formulated to the following optimization problem:

minθ(θ)=minθ1ni=1n(xi,iPos,iNeg;f(;θ))+λR(θ),\min_{\theta}\mathcal{L}(\theta)=\min_{\theta}\frac{1}{n}\sum_{i=1}^{n}\ell(x_{i},\mathcal{B}_{i}^{Pos},\mathcal{B}_{i}^{Neg};f(\cdot;\theta))+\lambda R(\theta), (1)

where ()\ell(\cdot) is a contrastive loss and λR(θ)\lambda R(\theta) is a regularization term; iPos,iNeg\mathcal{B}_{i}^{Pos},\mathcal{B}_{i}^{Neg} are the sets of positive samples and negative samples corresponding to xix_{i}, the details of which are described below.

Linear Representation and Regularization Term

We consider the linear representation function f(x;W)=Wxf(x;W)=Wx, where the parameter θ\theta is a matrix Wr×dW\in\mathbb{R}^{r\times d}. This linear representation setting has been widely adopted in other theory papers to understand self-supervised contrastive learning (Jing et al., 2021; Wang et al., 2021; Tian et al., 2021) and shed light upon other complex machine learning phenomena such as in Tripuraneni et al. (2021). Moreover, since regularization techniques have been widely adopted in contrastive learning practice (Chen et al., 2020a; He et al., 2020; Grill et al., 2020), we further consider penalizing the representation by a regularization term R(W)=WWF2/2R(W)=\|WW^{\top}\|_{F}^{2}/2 to encourage the orthogonality of WW and therefore promote the diversity of wiw_{i} to learn different representations. The reason we use such quadratic regularization instead of a standard 2\ell_{2} regularization is to encourage a diverse representation in the linear representation setting by penalizing on the similarity wi,wj2\langle w_{i},w_{j}\rangle^{2}, we defer a formal discussion and numerical experiments about this regularization in the Appendix A.2.

Linear Contrastive Loss

The contrastive loss is set to be the average similarity (measured by the inner product) between positive pairs minus that between negative pairs:

(x,xPos,xNeg,f(;θ))=xPosxPosf(x,θ),f(xPos,θ)|xPos|+xNegxNegf(x,θ),f(xNeg,θ)|xNeg|,\ell(x,\mathcal{B}_{x}^{Pos},\mathcal{B}_{x}^{Neg},f(\cdot;\theta))=-\sum_{x^{Pos}\in\mathcal{B}_{x}^{Pos}}\frac{\langle f(x,\theta),f(x^{Pos},\theta)\rangle}{|\mathcal{B}_{x}^{Pos}|}+\sum_{x^{Neg}\in\mathcal{B}_{x}^{Neg}}\frac{\langle f(x,\theta),f(x^{Neg},\theta)\rangle}{|\mathcal{B}_{x}^{Neg}|}, (2)

where xPos,xNeg\mathcal{B}_{x}^{Pos},\mathcal{B}_{x}^{Neg} are sets of positive samples and negative samples corresponding to xx. This loss function has been commonly used in contrastive learning (Hadsell et al., 2006) and metric learning (Schroff et al., 2015; He et al., 2018). In Khosla et al. (2020), the authors show that the inner-product based linear loss (2) is an approximation of the NT-Xent contrastive loss when one positive and one negative are used, which has been highlighted in recent contrastive learning practice (Sohn, 2016; Wu et al., 2018; Oord et al., 2018; Chen et al., 2020a). In Li et al. (2021), the authors proposed the SSL-HSIC contrastive loss, which can be reduced to this linear loss when the kernel k(,)k(\cdot,\cdot) is chosen to be a simple inner product. Following Li et al. (2021), we provide the results in Table 1, which shows that linear contrastive loss can also work well with some additional training techniques.

Testing Accuracy InfoNCE Linear contrastive loss
CIFAR10 65.11±0.5165.11\pm 0.51 66.07±0.46\mathbf{66.07\pm 0.46}
STL10 71.02±0.47\mathbf{71.02\pm 0.47} 70.30±0.3170.30\pm 0.31
Table 1: InfoNCE loss v.s. Linear contrastive loss. We train a ResNet-18 encoder on CIFAR-10 and STL-10 datasets with different contrastive loss functions. To train the linear contrastive loss, we follow the HSIC regularization techniques used in Li et al. (2021), which helps linear contrastive loss yield comparable performance to standard InfoNCE. We repeat each experiment for ten runs and report the mean and standard deviation of accuracy. Detailed experimental settings can be found in Section 5.2.

2.2 Generation of Positive and Negative Pairs

There are two common approaches to generating positive and negative pairs, depending on whether or not label information is available. When the label information is not available, the typical strategy is to generate different views of the original data via augmentation (Hadsell et al., 2006; Chen et al., 2020a). Two views of the same data point serve as the positive pair for each other, while those of different data serve as negative pairs.

Definition 2.1 (Augmented Pairs Generation in the Self-supervised Setting)

Given two augmentation functions g1,g2:ddg_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} and nn training samples ={xi}i[n]\mathcal{B}=\{x_{i}\}_{i\in[n]}, the augmented views are given by: {(g1(xi)\{(g_{1}(x_{i}), g2(xi))}i[n]g_{2}(x_{i}))\}_{i\in[n]}. Then for each view gv(xi)g_{v}(x_{i}), v=1,2v=1,2, the corresponding positive samples and negative samples are defined by: i,vPos={gs(xi):s[2]{v}}\mathcal{B}_{i,v}^{Pos}=\{g_{s}(x_{i}):s\in[2]\setminus\{v\}\} and i,vNeg={gs(xj):s[2],j[n]{i}}\mathcal{B}_{i,v}^{Neg}=\{g_{s}(x_{j}):s\in[2],j\in[n]\setminus\{i\}\}.

The loss function of the self-supervised contrastive learning problem can then be written as:

SelfCon(W)=12ni=1nv=12[Wgv(xi),Wg[2]{v}(xi)jis=12Wgv(xi),Wgs(xj)2n2]+λ2WWF2.\mathcal{L}_{\text{SelfCon}}(W)\!=-\!\frac{1}{2n}\sum_{i=1}^{n}\sum_{v=1}^{2}\biggl{[}\langle Wg_{v}(x_{i}),Wg_{[2]\setminus\{v\}}(x_{i})\rangle\!-\!\sum_{j\neq i}\sum_{s=1}^{2}\frac{\langle Wg_{v}(x_{i}),Wg_{s}(x_{j})\rangle}{2n-2}\biggr{]}\!+\!\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}. (3)

In particular, we adopt the following augmentation in our analysis.

Definition 2.2 (Random Masking Augmentation)

The two views of the original data are generated by randomly dividing its dimensions into two sets, that is, g1(xi)=Axi, and g2(xi)=(IA)xig_{1}(x_{i})=Ax_{i},\text{~{}and~{}}g_{2}(x_{i})=(I-A)x_{i}, where A=diag(a1,,ad)d×dA=\operatorname{diag}(a_{1},\cdots,a_{d})\in\mathbb{R}^{d\times d} is the diagonal masking matrix with {ai}i=1d\{a_{i}\}_{i=1}^{d} being i.i.d.i.i.d. random variables sampled from a Bernoulli distribution with mean 1/21/2.

Remark 2.3

In this paper, we focus on random masking augmentation, which has also been used in other works on the theoretical understanding of contrastive learning, eg. Wen and Li (2021). However, our primary interest lies in comparing the performance of contrastive learning with autoencoders and analyzing the impact of labeled data, while their work focuses on understanding the training process of neural networks in contrastive learning. Random masking augmentation is an analog of the random cropping augmentation used in practice. As shown in Chen et al. (2020a), cropping augmentation achieves overwhelming performance on linear evaluation (ImageNet top-1 accuracy) compared with other augmentation methods, please see Figure 5 in Chen et al. (2020a) for details.

When the label information is available, Khosla et al. (2020) proposed the following approach to generate positive and negative pairs.

Definition 2.4 (Pairs Generation in the Supervised Setting)

In a KK-class classification problem, given nkn_{k} samples for each class k[K]k\in[K]: {xik:i[nk]}k=1K\{x_{i}^{k}:i\in[n_{k}]\}_{k=1}^{K} and let n=k=1Knkn=\sum_{k=1}^{K}n_{k}, the corresponding positive samples and negative samples for xikx_{i}^{k} are defined by i,kPos={xjk:j[nk]i}\mathcal{B}_{i,k}^{Pos}=\{x_{j}^{k}:j\in[n_{k}]\setminus i\} and i,kNeg={xjs:s[K]k,j[ns]}\mathcal{B}_{i,k}^{Neg}=\{x_{j}^{s}:s\in[K]\setminus k,j\in[n_{s}]\}. That is, the positive samples are the remaining ones in the same class with xikx^{k}_{i} and the negative samples are the samples from different classes.

Correspondingly, the loss function of the supervised contrastive learning problem can be written as:

SupCon(W)=1nKk=1Ki=1n[jiWxik,Wxjkn1j=1nskWxik,Wxjsn(K1)]+λ2WWF2.\mathcal{L}_{\text{SupCon}}(W)=-\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\biggl{[}\sum_{j\neq i}\frac{\langle Wx_{i}^{k},Wx_{j}^{k}\rangle}{n-1}-\sum_{j=1}^{n}\sum_{s\neq k}\frac{\langle Wx_{i}^{k},Wx_{j}^{s}\rangle}{n(K-1)}\biggr{]}+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}. (4)

2.3 Data Generating Process

In real-world scenarios, data often comprises both signal (relevant information) and noise (irrelevant distractions). For instance, in image classification, the signal might be the primary subject of interest, while the noise could represent background elements. Self-supervised learning methods, without predefined tasks, aim to extract generalized patterns from data, ideally capturing as much of the signal as possible. It is commonly understood that signals tend to exhibit specific low-complexity structures, often being low-rank and showing higher correlations across coordinates. In contrast, background noise might lack a distinct structure, potentially being dense (or full rank) with lower coordinate correlations. To delve into this structural difference more rigorously, we consider an additive data-generating model. Here, the observed data emerges as a combination of a low-rank signal and dense noise.

x=Uz+ξ,Cov(z)=ν2Ir,Cov(ξ)=Σ,x=U^{\star}z+\xi,\quad\mathrm{Cov}(z)=\nu^{2}I_{r},\quad\mathrm{Cov}(\xi)=\Sigma, (5)

where zrz\in\mathbb{R}^{r} and ξd\xi\in\mathbb{R}^{d} are both zero mean sub-Gaussian independent random variables, and ν\nu\in\mathbb{R} is a constant represents the signal strength. In particular, U𝕆d,rU^{\star}\in\mathbb{O}_{d,r} and Σ=diag(σ12,,σd2)\Sigma=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2}). The first term UzU^{\star}z represents the signal of interest residing in a low-dimensional subspace spanned by the columns of UU^{\star}. The second term ξ\xi is the dense noise with heteroskedastic noise. Given that, the ideal low-dimensional representation is to compress the observed xx into a low-dimensional representation spanned by the columns of UU^{\star}. This model is known as the spiked covariance model (Johnstone, 2001; Bai and Yao, 2012; Yao et al., 2015; Zhang et al., 2018). It was proposed from the empirical observation that the eigenvalues of the sample covariance matrix of phoneme data have few ”spikes”, which corresponds to the low-dimensional structure of data generation. The model has been used in the literature of PCA (Johnstone, 2001; Deshpande and Montanari, 2014; Zhang et al., 2018) and Contrastive Learning (Wen and Li, 2021).

In this paper, we aim to learn a good projection Wr×dW\in\mathbb{R}^{r\times d} onto a lower-dimensional subspace from the observation xx. Since the information of WW is invariant with the transformation WOWW\leftarrow OW for any O𝕆r,rO\in\mathbb{O}_{r,r}, the essential information of WW is contained in the right eigenvector of WW. Thus, we quantify the goodness of the representation WW using the sine distance sinΘ(U,U)F\|\sin\Theta(U,U^{\star})\|_{F}, where UU is the top-rr right eigenspace of WW. It is notable that we only assume that noise and signal follow a sub-Gaussian distribution. This includes bounded noise/signals such as images, sound data, or text data.

3 Comparison of Self-Supervised Contrastive Learning and Autoencoders/GANs

Generative and contrastive learning are two popular approaches of self-supervised learning. Recent experiments have highlighted the improved performance of contrastive learning compared with the generative approach. For example, in Figure 1 of Chen et al. (2020a) and Figure 7 of Liu et al. (2021), it is observed that state-of-the-art contrastive self-supervised learning has more than 10 percent improvement over state-of-the-art generative self-supervised learning, with the same number of parameters. In this section, we rigorously demonstrate the advantage of contrastive learning over autoencoders/GANs, the representative methods in generative self-supervised learning, by investigating the linear representation settings under the spiked covariance model (5). The investigation is conducted for both feature recovery and in-domain downstream tasks.

Hereafter, we focus on the linear representation settings. This section is organized as follows: in Section 3.1 we first review the connection between principal component analysis (PCA) and autoencoders/GANs, which are two representative methods in generative approaches in self-supervised learning, under linear representation settings. Then we establish the connection between contrastive learning and PCA in Section 3.2. Based on these connections, we make the comparison between contrastive learning and autoencoder on feature recovery ability (Section 3.3) and in-domain downstream performance (Section 3.4).

3.1 Autoencoders, GANs and PCA

Autoencoders are popular unsupervised learning methods to perform dimensional reduction. Autoencoders learn two functions: encoder f:drf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{r} and decoder g:rdg:\mathbb{R}^{r}\rightarrow\mathbb{R}^{d}. While the encoder ff compresses the original data into low-dimensional features, and the decoder gg recovers the original data from those features. It can be formulated to be the following optimization problem for samples {xi}i=1n\{x_{i}\}_{i=1}^{n} (Ballard, 1987; Fan et al., 2019):

minf,g𝔼x(x,g(f(x))).\min_{f,g}\mathbb{E}_{x}\mathcal{L}(x,g(f(x))). (6)

By minimizing this loss, autoencoders try to preserve the essential features to recover the original data in the low-dimensional representation. In our setting, we consider the class of linear functions for ff and gg. The loss function is set as the mean squared error. Write f(x)=WAExf(x)=W_{\text{AE}}x and g(x)=WDExg(x)=W_{\text{DE}}x. Namely, we consider the following problem.

minWAE,WDE1nXWDEWAEXF2.\min_{W_{\text{AE}},W_{\text{DE}}}\frac{1}{n}\|X-W_{\text{DE}}W_{\text{AE}}X\|_{F}^{2}.

Let X=(x1,,xn)d×nX=(x_{1},\dots,x_{n})\in\mathbb{R}^{d\times n}. By Theorem 2.4.8 in Golub and Loan (1996), the optimal solution is given by the eigenspace of XXXX^{\top}, which exactly corresponds to the result of PCA. Thus, in linear representation settings, autoencoders are equivalent to PCA, which is also often known as undercomplete linear autoencoders (Bourlard and Kamp, 1988; Plaut, 2018; Fan et al., 2019). We write the obtained low-rank representation by autoencoders as

WAE=(UAEΣAEVAE),W_{\text{AE}}=(U_{\text{AE}}\Sigma_{\text{AE}}V_{\text{AE}}^{\top})^{\top}, (7)

where UAEU_{\text{AE}} is the top-rr eigenvectors of matrix XXXX^{\top}, ΣAE\Sigma_{\text{AE}} is a diagonal matrix of spectral values and VAE=[v1,,vr]r×rV_{\text{AE}}=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

We also note that GANs (Goodfellow et al., 2014) is related to PCA. Namely, Feizi et al. (2020) showed that the global solution for GANs recovers the empirical PCA solution as the generative model.

To see this, let 𝒲2\mathcal{W}_{2} be the second-order Wasserstein distance. Also let 𝒢\mathcal{G} be the set of linear generator functions from rd\mathbb{R}^{r}\to\mathbb{R}^{d}. Consider the following 𝒲2\mathcal{W}_{2} GAN optimization problem:

ming𝒢𝒲22(n,g(Z)),\displaystyle\min_{g\in\mathcal{G}}\mathcal{W}_{2}^{2}(\mathbb{P}_{n},\mathbb{P}_{g(Z)}), (8)

where n\mathbb{P}_{n} denotes the empirical distribution of i.i.d. data x1,,xndx_{1},\dots,x_{n}\in\mathbb{R}^{d} and g(Z)\mathbb{P}_{g(Z)} is the generated distribution with generator gg and ZN(0,Ir)Z\sim N(0,I_{r}). Note that the optimization problem Equation (8) can be written as minn,Zming𝒢𝔼[Xg(Z)2]\min_{\mathbb{P}_{n,Z}}\min_{g\in\mathcal{G}}\mathbb{E}[\|X-g(Z)\|^{2}], where the first minimization is over probability distributions which have marginals n\mathbb{P}_{n} and Z\mathbb{P}_{Z}. By Theorem 2 in Feizi et al. (2020), the optimizer of problem Equation (8) is obtained as g^:ZG^Z\hat{g}:Z\mapsto\hat{G}Z, where G^\hat{G} satisfies G^G^=UAEΣAE2UAE\hat{G}\hat{G}^{\top}=U_{\text{AE}}\Sigma_{\text{AE}}^{2}U_{\text{AE}}^{\top}. This implies WAE:rdW_{\text{AE}}^{\top}:\mathbb{R}^{r}\to\mathbb{R}^{d} is also a solution to the optimization problem Equation (8). Hence GANs learn the PCA solution as a generator.

From this equivalence among ordinary PCA, autoencoders, and GANs, we only focus on autoencoders hereafter for brevity.

3.2 Contrastive Learning and Diagonal-Deletion PCA

Here we bridge PCA and contrastive learning with certain augmentations under the linear representation setting. Recall that the optimization problem for self-supervised contrastive learning is formulated as:

minWr×dSelfCon(W):=12ni=1nv=12[Wgv(xi),Wg[2]{v}(xi)jis=12Wgv(xi),Wgs(xj)2n2]+λ2WWF2.\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)\!:=-\!\frac{1}{2n}\sum_{i=1}^{n}\sum_{v=1}^{2}\biggl{[}\langle Wg_{v}(x_{i}),Wg_{[2]\setminus\{v\}}(x_{i})\rangle\!-\!\sum_{j\neq i}\sum_{s=1}^{2}\frac{\langle Wg_{v}(x_{i}),Wg_{s}(x_{j})\rangle}{2n-2}\biggr{]}\!+\!\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}. (9)

To compare contrastive learning with autoencoders, we now derive the solution of the optimization problem (9). We start with the general result for self-supervised contrastive learning with augmented pairs generation in Definition 2.1, and then turn to the special case of random masking augmentation (Definition 2.2).

Proposition 3.1

For two fixed augmentation functions g1,g2:ddg_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, denote the augmented data matrices as X1=[g1(x1),,g1(xn)]d×nX_{1}=[g_{1}(x_{1}),\cdots,g_{1}(x_{n})]\in\mathbb{R}^{d\times n} and X2=[g2(x1),,g2(xn)]d×nX_{2}=[g_{2}(x_{1}),\cdots,g_{2}(x_{n})]\in\mathbb{R}^{d\times n}, when the augmented pairs are generated as in Definition 2.1, all the optimal solutions of contrastive learning problem (9) are given by:

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

X1X2+X2X112(n1)(X1+X2)(1n1nIn)(X1+X2),X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top}, (10)

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

The proof is given in Appendix B.1.

Proposition 3.1 is a general result for augmented pairs generation with fixed and deterministic augmentation functions. The result itself only depends on the augmented data matrices, thus it is straightforward to generalize to the case where different augmentation functions are applied to different samples, we omit it here for the simplicity of notations. Moreover, when the augmentation is sampled from a stochastic distribution, we can also characterize the optimal solution of the expected loss in the same way. Specifically, if we apply the random masking augmentation (2.2), we can further obtain a result to characterize the optimal solution. For any square matrix Ad×dA\in\mathbb{R}^{d\times d}, we denote D(A)D(A) to be AA with all off-diagonal entries set to be zero and Δ(A)=AD(A)\Delta(A)=A-D(A) to be AA with all diagonal entries set to be zero. Then we have the following corollary for random masking augmentation.

Corollary 3.2

Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then the minimizer of the expected loss function of contrastive learning problem (9) over the distribution of random augmentations (i.e., 𝔼g1,g2SelfCon(W)\mathbb{E}_{g_{1},g_{2}}\mathcal{L}_{SelfCon}(W)) is given by:

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

Δ(XX)1n1X(1n1nIn)X,\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top}, (11)

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

The proof is given in Appendix B.2.

With Proposition 3.1 and Corollary 3.2 established, we can find that the self-supervised contrastive learning equipped with augmented pairs generation and random masking augmentation can eliminate the effect of random noise on the diagonal entries of the observed covariance matrix. Since Cov(ξ)=Σ\mathrm{Cov}(\xi)=\Sigma is a diagonal matrix, when the diagonal entries Cov(Uz)=ν2UU\mathrm{Cov}(U^{\star}z)=\nu^{2}U^{\star}U^{\star\top} only take a small proportion of the total Frobenius norm, the contrasting augmented pairs will preserve the core features while eliminating most of the random noise and give a more accurate estimation of core features.

3.3 Feature Recovery from Noisy Data

After bridging both autoencoder and contrastive learning with PCA, now we can perform the analysis of feature recovery ability to understand the benefit of contrastive learning over autoencoders. As mentioned above, our target is to recover the subspace spanned by the columns of UU^{\star}, which can further help us obtain information on the unobserved zz that is important for in-domain downstream tasks. However, the observed data has a covariance matrix of ν2UU+Σ\nu^{2}U^{\star}U^{\star\top}+\Sigma rather than the desired ν2UU\nu^{2}U^{\star}U^{\star\top}, which brings difficulty to representation learning. We demonstrate that contrastive learning can better exploit the structure of core features and obtain better estimation than autoencoders in this setting.

We start with autoencoders. In the noiseless case, the covariance matrix is ν2UU\nu^{2}U^{\star}U^{\star\top} and autoencoders can perfectly recover the core features. However, in noisy cases, the random noises sometimes perturb the core features, which makes autoencoders fail to learn the core features. Such noisy cases are widespread in real applications such as measurement errors and backgrounds in images such as grasses and sky. Interestingly, we will later show that contrastive learning can better recover UU^{\star} despite the presence of large noise.

To provide rigorous analysis, we first introduce the incoherent constant (Candès and Recht, 2009).

Definition 3.3 (Incoherent Constant)

We define the incoherence constant of U𝕆d,rU\in\mathbb{O}_{d,r} as

I(U)=maxi[d]eiU2.I(U)=\max_{i\in[d]}\left\|e_{i}^{\top}U\right\|^{2}. (12)

Intuitively, the incoherent constant measures the degree of the incoherence of the distribution of entries among different coordinates, or loosely speaking, the similarity between UU and canonical basis {ei}i=1d\{e_{i}\}_{i=1}^{d}. For uncorrelated random noise, the covariance matrix is diagonal and its eigenspace is exactly spanned by the canonical basis {ei}i=1d\{e_{i}\}_{i=1}^{d} (if the diagonal entries in Σ\Sigma are all different), which attains the maximum value of the incoherent constant. On the contrary, the core features usually exhibit certain correlation structures and the corresponding eigenspace of the covariance matrix is expected to have a lower incoherent constant.

We then introduce a few assumptions which our theoretical results are built on. Recall that in the spiked covariance model (5), x=Uz+ξx=U^{\star}z+\xi, Cov(z)=ν2Ir\mathrm{Cov}(z)=\nu^{2}I_{r} and Cov(ξ)=diag(σ12,,σd2)\mathrm{Cov}(\xi)=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2}).

Assumption 3.4 (Regular Covariance Condition)

The condition number of covariance matrix Σ=diag(σ12,,σd2)\Sigma=\operatorname{diag}(\sigma_{1}^{2},\cdots,\sigma_{d}^{2}) satisfies κ:=σ(1)2/σ(d)2<C,\kappa:=\sigma_{(1)}^{2}/\sigma_{(d)}^{2}<C, where σ(j)2\sigma_{(j)}^{2} represents the jj-th largest number among σ12,,σd2\sigma_{1}^{2},\cdots,\sigma_{d}^{2} and C>0C>0 is a universal constant.

Assumption 3.5 (Signal to noise ratio condition)

Define the signal-to-noise ratio ρ:=ν/σ(1)\rho:=\nu/\sigma_{(1)}, we assume ρ=Θ(1)\rho=\Theta(1), implying that the covariance of noise is of the same order as that of the core features.

Assumption 3.6 (Incoherent Condition)

The incoherent constant of the core feature matrix U𝕆d,rU^{\star}\in\mathbb{O}_{d,r} satisfies I(U)=O(rlogd/d).I(U^{\star})=O\quantity(r\log d/d).

The incoherent constant often appears in the literature of matrix completion (Candès and Recht, 2009) and PCA (Zhang et al., 2018). The order of I(U)I(U^{\star}) can be arbitrary as long as it decreases to 0 as dd\rightarrow\infty. One can directly adapt the later results to this setting. If UU is distributed uniformly on 𝕆d,r\mathbb{O}_{d,r}, then the expectation of incoherent constant is of order rlogd/dr\log d/d.

Lemma 3.7 (Expectation of incoherent constant over a uniform distribution)
𝔼UUniform(𝕆d,r)I(U)=O(rdlogd).\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=O\quantity(\frac{r}{d}\log d). (13)

Thus, we set I(U)I(U^{\star}) to the order rlogd/dr\log d/d for simplicity. The proof is given in Appendix B.6. Here we provide a remark on the implication of our assumptions, and we defer a further discussion on how to generalize our main results under weaker assumptions in Remark 3.11.

Remark 3.8

The three assumptions above can be explained as follows: Assumption 3.4 implies that the variances of all dimensions are of the same order. For Assumption 3.5, we focus on a large noise regime where the noise may hurt the estimation significantly. Here we assume the ratio lies in a constant range, but our theory can easily adapt to the case where ρ\rho has a decreasing order. Specifically, for Theorems 3.9, 3.10, 3.13 and 3.14 presented below, we derive an explicit dependence on ρ\rho of each result in the appendix. One can check Equations (46), (LABEL:dependence_of_rho_CL), (59), (60), (61) and (62) for details. Assumption 3.6 implies a stronger correlation among the coordinates of core features, which is the essential property to distinguish them from random noise.

Now we are ready to present our first result, showing that the autoencoders are unable to recover the core features in the large-noise regime. Due to the equivalence among PCA, autoencoders, and GANs we presented in Section 3.1, for brevity, we only focus on autoencoders hereafter.

Theorem 3.9 (Recovery Ability of Autoencoders, Lower Bound)

Consider the spiked covariance model (5), under Assumptions 3.4-3.6 and n>drn>d\gg r, let WAEW_{\text{AE}} be the learned representation of autoencoders with singular value decomposition WAE=(UAEΣAEVAE)W_{\text{AE}}=(U_{\text{AE}}\Sigma_{\text{AE}}V_{\text{AE}}^{\top})^{\top} (as in Equation (7)). If we further assume {σi2}i=1d\{\sigma_{i}^{2}\}_{i=1}^{d} are different from each other and σ(1)2/(σ(r)2σ(r+1)2)<Cσ\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma} for some universal constant CσC_{\sigma}. Then there exist two universal constants Cρ>0C_{\rho}>0, c(0,1)c\in(0,1), such that when ρ<Cρ\rho<C_{\rho}, we have

𝔼sinΘ(U,UAE)Fcr.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{AE}}\right)\right\|_{F}\geq c\sqrt{r}. (14)

The proof is given in Appendix B.7. The condition drd\gg r means that there exists a sufficiently small constant c>0c>0 independent of dd and rr such that r/d<cr/d<c holds. The additional assumptions {σi2}i=1d\{\sigma_{i}^{2}\}_{i=1}^{d} are different from each other and σ(1)2/(σ(r)2σ(r+1)2)<Cσ\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma} for some universal constant CσC_{\sigma} are made to ensure the identifiability of top-rr eigenspace. We need these conditions to guarantee the uniqueness of UAEU_{\text{AE}}. As an extreme example, the top-rr eigenspace of the identity matrix can be any rr-dimensional subspace and thus not unique. To avoid discussing such arbitrariness of the output, we make these assumptions to guarantee the separability of the eigenspace.

Then we investigate the feature recovery ability of the self-supervised contrastive learning approach.

Theorem 3.10 (Recovery Ability of Contrastive Learning, Upper Bound)

Under the spiked covariance model (5), random masking augmentation in Definition 2.2, Assumptions 3.4-3.6 and n>drn>d\gg r, let WCLW_{\text{CL}} be any solution that minimizes Equation (3), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have

𝔼sinΘ(U,UCL)Fr3/2dlogd+drn.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{CL}}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}. (15)

The proof is given in Appendix B.9. The two terms in equation (15) can be explained as follows: the first term is due to the shift between the distributions of the augmented data and the original data. Specifically, the random masking augmentation generates two views with disjoint nonzero coordinates and thus can mitigate the influence of random noise on the diagonal entries in the covariance matrix. However, such augmentation slightly hurts the estimation of core features. This bias, appearing as the first term in Equation (15), is measured by the incoherent constant defined in Equation (12). The second term corresponds to the estimation error of the population covariance matrix.

Theorems 3.9 and 3.10 characterize the difference in feature recovery ability between autoencoders and contrastive learning. The autoencoders fail to recover most of the core features in the large-noise regime since sinΘ(U,U)F\|\sin\Theta(U,U^{\star})\|_{F} has a trivial upper bound r\sqrt{r}. In contrast, with the help of data augmentation, the contrastive learning approach mitigates the corruption of random noise while preserving core features. As nn and dd increase, it yields a consistent estimator of core features and further leads to better performance in the in-domain downstream tasks, as shown in the next section.

Remark 3.11

Here we discuss the potential generalization of our results to the setting with weaker assumptions. Intuitively speaking, the random masking augmentation exploits the prior knowledge that the core features in the original signal are more structural across different coordinates compared with the random noise. Thus the essential requirements are

  1. 1.

    Noise is less correlated between different coordinates compared with core features.

  2. 2.

    Core features and noise are very different, i.e., sinΘ(U,UΣ)F\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F} is large.

Those two requirements correspond to the diagonal assumption on Σ\Sigma and incoherent assumption on UU^{\star} (Assumption 3.6). In particular, the latter is to give a lower bound for sinΘ(U,UΣ)F\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F} when Σ\Sigma is diagonal and heteroskedasticity. For more general Σ\Sigma and UU^{\star}, it suffices to assume Δ(Σ)2ν2=o(1)\frac{\|\Delta(\Sigma)\|_{2}}{\nu^{2}}=o(1), and sinΘ(U,UΣ)F=Ω(r)\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}=\Omega(\sqrt{r}), and we can still draw a similar comparison under these assumptions. Notice that by Lemma 3.7, when UU^{\star} is randomly chosen we will immediately have 𝔼sinΘ(U,UΣ)F=(1o(1))r\mathbb{E}\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}=(1-o(1))\sqrt{r}. Similar arguments also apply to all of the later results in this paper and we omit them for simplicity.

Remark 3.12

Similar random masking augmentation (as in Definition 2.2) can also apply to autoencoders. Although directly applying this augmentation would not work as well since it will not affect the optimal solution (see discussion in Appendix C), an alternative strategy is to reconstruct the whole data from the masked one. This method was originally proposed as denoising autoencoders (DAEs) for general augmentation Vincent et al. (2008), and was proven to be powerful with masking augmentation in a recently proposed representation learning method, masked autoencoders (MAEs) (He et al., 2021). DAEs are a variant of autoencoders that are trained to reconstruct the original image from randomly masked patches. It has been found that DAEs (especially MAEs) outperform other self-supervised methods like MoCo v3, DINO, and BEiT after fine-tuning (He et al., 2021).

More specifically, under the same setup described in Section 2, let AA be the random masking augmentation defined in Definition 2.2. We adopt the symmetric linear encoders and decoders. Given samples X=[x1,,xn]d×nX=[x_{1},\dots,x_{n}]\in\mathbb{R}^{d\times n}, we formally define the loss minimization problem111In (He et al., 2021), the loss function is computed on masked coordinates only, but as the authors noted, “This choice is purely result-driven: computing the loss on all pixels leads to a slight decrease in accuracy (e.g., 0.5%).” Hence we will analyze the loss with respect to all coordinates for simplicity. of DAEs as

minWr×d:WW=2Ir1n𝔼A[WWAXXF2].\displaystyle\min_{W\in\mathbb{R}^{r\times d}:WW^{\top}=2I_{r}}\frac{1}{n}\mathbb{E}_{A}\quantity[\|W^{\top}WAX-X\|_{F}^{2}]. (16)

Notice that the DAEs may not preserve the norm of the input since 𝔼A[Axi2]=(1/2)xi2\mathbb{E}_{A}[\|Ax_{i}\|^{2}]=(1/2)\|x_{i}\|^{2}. As a result, we optimize the loss under the scaled constraint WW=2IrWW^{\top}=2I_{r}.

Then, we claim that under the same conditions as in Theorem 3.10, DAEs behave similarly to contrastive learning: Let WDAEW_{\text{DAE}} be any solution that minimizes equation (16), and denote its singular value decomposition as WDAE=(UDAEΣDAEVDAE)W_{\text{DAE}}=(U_{\text{DAE}}\Sigma_{\text{DAE}}V_{\text{DAE}}^{\top})^{\top}, then we have (the proof is given in Appendix C.2.)

𝔼sinΘ(U,UDAE)Fr3/2dlogd+drn.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{\text{DAE}}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}. (17)

From Theorem 3.9, we know that under high dimensional settings with large sample sizes, DAEs (or masked autoencoders) significantly outperform classic autoencoders. Moreover, compared to Theorem 3.10, the upper bounds of DAEs are the same as contrastive learning with random masking augmentations. We also provide experimental results on synthetic datasets to verify this result in Appendix C.2. Although they have similar performance in our linear representation framework because both of them exploit the masking views to eliminate noise, the difference could arise from other aspects such as network architecture and training algorithms, for example, He et al. (2021) used a vision Transformer (Dosovitskiy et al., 2020) while Chen et al. (2020a) used a ResNet (He et al., 2016).

3.4 Performance on In-Domain Downstream Tasks

In the previous section, we have seen that contrastive learning can recover the core feature effectively. In practice, we are interested in using the learned features on in-domain downstream tasks. He et al. (2020) experimentally showed the overwhelming performance of linear classifiers trained on representations learned with contrastive learning against several supervised learning methods in those in-domain downstream tasks.

Following the recent success, here we evaluate the in-domain downstream performance of simple predictors, which take a linear transformation of the representation as an input. Let WCLW_{\text{CL}} and WAEW_{\text{AE}} be the learned representations based on train data Xn×dX\in\mathbb{R}^{n\times d}. We observe a new signal xˇ=Uzˇ+ξˇ\check{x}=U^{\star}\check{z}+\check{\xi} independent of XX following the spiked covariance model (5). For simplicity, assume zˇ\check{z} follows N(0,ν2Ir)N(0,\nu^{2}I_{r}) and is independent of ξˇ\check{\xi}. We consider two major types of in-domain downstream tasks: classification and regression. For the binary classification task, we observe a new supervised sample yˇ\check{y} following the binary response model:

yˇ|zˇ\displaystyle\check{y}|\check{z} Ber(F(zˇ,w/ν)),\displaystyle\sim\text{Ber}(F(\langle\check{z},w^{\star}\rangle/\nu)), (18)

where F:[0,1]F:\mathbb{R}\to[0,1] is a known monotone increasing function satisfying 1F(u)=F(u)1-F(u)=F(-u) for any uu\in\mathbb{R} , and wrw^{\star}\in\mathbb{R}^{r} is a unit vector of coefficients. Notice that our model (18) includes a logistic model (when F(u)=1/(1+eu)F(u)=1/(1+e^{-u})) and probit models (when F(u)=Φ(u)F(u)=\Phi(u), where Φ\Phi is the cumulative distribution function of the standard normal distribution.) We can also interpret model (18) as a shallow neural network model with width rr for binary classification. For the regression task, we observe a new supervised sample yˇ\check{y} following the linear regression model:

yˇ\displaystyle\check{y} =zˇ,w/ν+ϵˇ,\displaystyle=\langle\check{z},w^{\star}\rangle/\nu+\check{\epsilon}, (19)

where ϵˇ(0,σϵ2)\check{\epsilon}\sim(0,\sigma_{\epsilon}^{2}) is independent of zˇ\check{z}, , and wrw^{\star}\in\mathbb{R}^{r} is a unit vector of coefficients as before. We can interpret this model as a principal component regression model (PCR) (Jolliffe, 1982) under standard error-in-variables settings222 In error-in-variables settings, the bias term of the measurement error appears in prediction and estimation risk. Since our focus lies in proving a better performance of contrastive learning against autoencoders, we ignore the unavoidable bias term here by considering the excess risk. , where we assume that the coefficients lie in a low-dimensional subspace spanned by column vectors of UU^{\star}. We either estimate or predict the signal based on the observed samples contaminated by the measurement error ξˇ\check{\xi}. For details of PCR in error-in-variables settings, see, for example, Ćevid et al. (2020); Agarwal et al. (2020); Bing et al. (2021).

In classification setting, we specify 0-11 loss, that is, c(δ)𝕀{yˇδ(xˇ)}\ell_{c}(\delta)\triangleq\mathbb{I}\{\check{y}\neq\delta(\check{x})\} for some predictor δ\delta taking values in {0,1}\{0,1\}. For regression task, we employ the squared error loss r(δ)(yˇδ(xˇ))2\ell_{r}(\delta)\triangleq(\check{y}-\delta(\check{x}))^{2}. Based on some learned representation WW, we consider a class of linear predictors. Namely, δW,w(xˇ)𝕀{F(wWxˇ)1/2}\delta_{W,w}(\check{x})\triangleq\mathbb{I}\{F(w^{\top}W\check{x})\geq 1/2\} for classification task and δW,w(xˇ)wWxˇ\delta_{W,w}(\check{x})\triangleq w^{\top}W\check{x} for regression task, where wrw\in\mathbb{R}^{r} is a weight vector wrw\in\mathbb{R}^{r}. Note that the learned representation depends only on unsupervised samples XX. Let 𝔼𝒟[]\mathbb{E}_{\mathcal{D}}[\cdot] and 𝔼[]\mathbb{E}_{\mathcal{E}}[\cdot] the expectations with respect to (X,Z)(X,Z) and (yˇ,xˇ,zˇ)(\check{y},\check{x},\check{z}), respectively.

Our goal as stated above is to bound the prediction risk of predictors {δW,w:wr}\{\delta_{W,w}:w\in\mathbb{R}^{r}\} constructed upon the learned representations WCLW_{\text{CL}} and WAEW_{\text{AE}}, that is, the quantity infwr𝔼[(δWCL,w)]\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell(\delta_{W_{\text{CL}},w})] and infwr𝔼[(δWAE,w)]\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell(\delta_{W_{\text{AE}},w})].

Now we state our results on the performance of the in-domain downstream prediction task.

Theorem 3.13 (Excess Risk for In-Domain Downstream Task: Upper Bound)

Suppose the conditions in Theorem 3.10 hold. Then, for the regression task, we have

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]r3/2dlogd+drn,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}},

and for the classification task,

𝔼𝒟[infwr𝔼[c(δWCL,w)]infwr𝔼[c(δU,w)]=O(r3/2dlogd+drn)1.\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star\top},w})]=O\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}})\wedge 1.

The proofs are given in Appendix B.11.

This result shows that the price of estimating UU^{\star} by contrastive learning on an in-domain downstream prediction task can be made small in the case where the core feature lies in a relatively low-dimensional subspace, and the number of samples is relatively large compared to the ostensible dimension of data.

However, the in-domain downstream performance of autoencoders is not as good as contrastive learning. We obtain the following lower bound for the in-domain downstream prediction risk with the autoencoders.

Theorem 3.14 (Excess Risk for In-Domain Downstream Task: Lower Bound))

Suppose the conditions in Theorem 3.9 hold. Assume rrcr\leq r_{c} holds for some constant rc>0r_{c}>0. Additionally assume that ρ=Θ(1)\rho=\Theta(1) is sufficiently small and ndrn\gg d\gg r. Then, For the regression task,

𝔼𝒟[infwr𝔼[r(δUAE,w)]infwr𝔼[r(δU,w)]cc,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star},w})]\geq c_{c}^{\prime},

and for classification task, if FF is differentiable at 0 and F(0)>0F^{\prime}(0)>0, then

𝔼𝒟[infwr𝔼[c(δUAE,w)]infwr𝔼[c(δU,w)]cr,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star},w})]\geq c_{r}^{\prime},

where cr>0c_{r}^{\prime}>0 and cc>0c_{c}^{\prime}>0 are constants independent of nn and dd.

The proof is given in Appendix B.11. The condition ndrn\gg d\gg r means that there exists a sufficiently small constant c>0c>0 independent of nn, dd and rr such that d/nr/d<cd/n\vee r/d<c holds.

The constant cc^{\prime} appearing in Theorem 3.14 is a constant term independent of dd and nn. Thus, when dd is sufficiently large compared to rr and d/nd/n is small, the upper bound of in-domain downstream task performance via contrastive learning in Theorem 3.13 is smaller than the lower bound of in-domain downstream task performance via autoencoders. The assumption of rrcr\leq r_{c} in Theorem 3.14 is assumed for clarity of presentation. Using the same techniques in the proof of Theorem 3.14, one can obtain a constant lower bound for autoencoders with slightly stronger assumptions, for example, ρ2=O(1/logd)\rho^{2}=O(1/\log d) with ndrn\gg dr, without assuming rrcr\leq r_{c}. Our theory can be adapted to both of these assumptions. Our results support the empirical success of contrastive learning.

4 The Impact of Labeled Data in Supervised Contrastive Learning

Recent works have explored adding label information to improve contrastive learning (Khosla et al., 2020). Empirical results show that label information can significantly improve the accuracy of the in-domain downstream tasks. However, when domain shift is considered, the label information hardly improves and even hurts transferability (Islam et al., 2021). For example, in Table 2 of Khosla et al. (2020) and the first column in Table 4 of Islam et al. (2021), supervised contrastive learning shows significant improvement with 7%-8% accuracy increase on in-domain downstream classification on ImageNet and Mini-ImageNet. On the contrary, in Table 4 of Khosla et al. (2020) and Table 4 of Islam et al. (2021), supervised contrastive learning hardly increases the predictive accuracy compared to the self-supervised contrastive learning (the difference of mean accuracy is less than 1%) and can harm significantly on some datasets (e.g. 5.5% lower for SUN 397 in Table 4 of Khosla et al. (2020)). These results indicate that some mechanisms in supervised contrastive learning hurt model transferability while the improvement in source tasks is significant. Moreover, in Table 4 of Islam et al. (2021), it is observed that combining supervised learning and self-supervised contrastive learning together achieves the best transfer learning performance compared to each of them individually. Motivated by those empirical observations, in this section, we aim to investigate the impact of labeled data in contrastive learning and provide a theoretical foundation for these phenomena.

4.1 Feature Mining in Multi-Class Classification

We first demonstrate the impact of labels in contrastive learning under the standard single-sourced (i.e. no transfer learning) setting. Suppose our samples are drawn from r+1r+1 different classes with probability pkp_{k} for class k[r+1]k\in[r+1], and k=1r+1pk=1\sum_{k=1}^{r+1}p_{k}=1. For each class, samples are generated from a class-specific Gaussian distribution:

xk=μk+ξk,ξk𝒩(0,Σk),k=1,2,,r+1.x^{k}=\mu^{k}+\xi^{k},\quad\xi^{k}\sim\mathcal{N}(0,\Sigma^{k}),\quad\forall k=1,2,\cdots,r+1. (20)

We assume the norms of μk,k[r+1]\mu^{k},\forall k\in[r+1] are in the same order, that is, denote ν=μ1/r\nu=\|\mu^{1}\|/\sqrt{r}, we have μk=O(rν),k[r+1]\|\mu^{k}\|=O(\sqrt{r}\nu),\forall k\in[r+1]. We further assume Σk=diag(σ1,k2,,σd,k2)\Sigma^{k}=\operatorname{diag}(\sigma_{1,k}^{2},\cdots,\sigma_{d,k}^{2}), denote σ(1)2=max1id,1jr+1σi,j2\sigma_{(1)}^{2}=\max_{1\leq i\leq d,1\leq j\leq r+1}\sigma_{i,j}^{2} and assume k=1r+1pkμk=0\sum_{k=1}^{r+1}p_{k}\mu^{k}=0, where the last assumption is added to ensure identifiability since the classification problem (20) is invariant under translation. Denote Λ=k=1r+1pkμkμk\Lambda=\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}, we assume rank(Λ)=r\operatorname{rank}(\Lambda)=r and C1ν2<λ(r)(Λ)<λ(1)(Λ)<C2ν2C_{1}\nu^{2}<\lambda_{(r)}(\Lambda)<\lambda_{(1)}(\Lambda)<C_{2}\nu^{2} for two universal constants C1C_{1} and C2C_{2}. We remark that this model is a labeled version of the spiked covariance model (5) since the core features and random noise are both sub-Gaussian. We use r+1r+1 classes to ensure that μk\mu^{k}’s span an rr-dimensional space, and denote its orthonormal basis as UU^{\star}. Recall that our target is to recover UU^{\star}.

Remark 4.1

Here we focus on explaining the impact of label information in the SupCon algorithm(Khosla et al., 2020). SupCon is designed for multi-class classification tasks and it requires using the class label to find positive samples. In that case, labels from a linear function of the latent features can not be used as class labels in a multi-class classification setting. Hence we proposed to use the Gaussian Mixture Model (20) to generate class labels while keeping the most consistency with models used earlier(i.e., the spiked covariance model (5)).

As introduced in Definition 2.4, the supervised contrastive learning introduced by Khosla et al. (2020) allows us to generate contrastive pairs using labeled information and discriminate instances across classes. When we have both labeled data and unlabeled data, we can perform contrastive learning based on pairs that are generated separately for the two types of data.

Data Generating Process

Formally, let us consider the case where we draw nn samples as unlabeled data X=[x1,,xn]d×nX=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n} from the Gaussian mixture model (20) with p1=p2==pr+1p_{1}=p_{2}=\cdots=p_{r+1}. For the labeled data, we draw (r+1)m(r+1)m samples; mm samples for each of the r+1r+1 classes in the Gaussian mixture model, and denote them as X^=[x^1,,x^(r+1)m]d×(r+1)m\hat{X}=[\hat{x}_{1},\cdots,\hat{x}_{(r+1)m}]\in\mathbb{R}^{d\times(r+1)m}. We discuss the above case for simplicity. More general versions that allow different sample sizes for each class are considered in Theorem D.2 (in the appendix). We study the following hybrid loss to illustrate how the label information helps promote performance over self-supervised contrastive learning:

minWr×d(W):=minWr×dSelfCon(W)+αSupCon(W),\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)+\alpha\mathcal{L}_{\text{SupCon}}(W), (21)

where α>0\alpha>0 is the ratio between supervised loss and self-supervised contrastive loss. Here we consider this generalized hybrid loss to show the benefit of exploiting additional unlabeled data. If we choose α\alpha\rightarrow\infty it will correspond to the original SupCon loss.

We first provide a high-level explanation of why label information can help learn core features. When the label information is unavailable, no matter how much (unlabeled) data we have, we can only take them (and their augmented views) as positive samples. In such a scenario, performing augmentation leads to an unavoidable trade-off between estimation bias and accuracy. However, if we have additional class information, we can contrast between data in the same class to extract more beneficial features that help distinguish a particular class from others and therefore reduce the bias.

Theorem 4.2

Suppose the labeled and unlabeled samples are generated as the process mentioned above. If Assumptions 3.4-3.6 hold, n>drn>d\gg r and let WCLW_{\text{CL}} be any solution that minimizes the supervised contrastive learning problem in Equation (21), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim 11+α(r3/2dlogd+drn)+α1+αdrm.\displaystyle\frac{1}{1+\alpha}\left(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}\right)+\frac{\alpha}{1+\alpha}\sqrt{\frac{dr}{m}}.

The proof is given in Appendix D.2

Corollary 4.3

From Theorem 4.2, it directly follows that when we have mm labeled data for each class and no unlabeled data (α\alpha\to\infty),

𝔼sinΘ(UCL,U)Fdrm.\mathbb{E}\|\sin\Theta(U_{\text{CL}},U)\|_{F}\lesssim\sqrt{\frac{dr}{m}}.

The first bound in Theorem 4.2 demonstrates how the effect of labeled data changes with the ratio α\alpha in the hybrid loss in Equation (21). In addition, compared with Theorem 3.10, when we only have labeled data (α\alpha\to\infty), the second bound in Theorem 4.2 indicates that with labeled data being available, the supervised contrastive learning can yield consistent estimation as mm\rightarrow\infty while the self-supervised contrastive learning consists of an irreducible bias term O(r3/2logd/d)O(r^{3/2}\log d/d). At a high level, label information can help gain accuracy by creating more positive samples for a single anchor and therefore extract more decisive features. One should notice a caveat that when labeled data is extremely rare compared to unlabeled data, the estimation of supervised contrastive learning suffers from high variance. In comparison, self-supervised contrastive learning, which can exploit a much larger number of samples, may outperform it.

4.2 Information Filtering in Multi-Task Transfer Learning

In this section, we show that the theoretical tools developed in this paper can be used to illustrate the role of label information when using contrastive learning in the transfer learning setting. Label information can tell us the beneficial information for the in-domain downstream task, and learning with labeled data will filter out useless information and preserve the decisive parts of core features. However, in transfer learning, the label information is sometimes rather found to hurt the performance of contrastive learning. For example, in Table 4 of Islam et al. (2021), while supervised contrastive learning gains 8% improvement in source tasks by incorporating label information, it improves only 1% on generalizing to new datasets on average and can even hurt on some datasets. Such observation implies that label information in contrastive learning has very different roles for generalization on source tasks and new tasks. In this section, we consider two regimes of transfer learning – tasks are insufficient/abundant. In both regimes, we provide theories to support the empirical observations and further demonstrate how to wisely combine supervised and self-supervised contrastive learning to avoid those harms and achieve better performance. Specifically, we consider a transfer learning problem with regression setting and binary classification setting. Suppose we have TT source tasks which share a common data generative model (5). For the tt-th task, the labels are generated by yt=wt,z/νy^{t}=\langle w_{t},z\rangle/\nu in a regression setting while yt=sign(wt,z)y^{t}=\operatorname{sign}(\langle w_{t},z\rangle) in a binary classification setting, where wtrw_{t}\in\mathbb{R}^{r} is a unit vector varying across tasks. These two settings share the same distribution for xx only differ in the way to generate the labels yty^{t}.

To incorporate label information, we maximize the Hilbert-Schmidt Independence Criteria (HSIC) (Gretton et al., 2005; Barshan et al., 2011), which has been widely used in literature (Song et al., 2007a, b, c; Barshan et al., 2011).

4.2.1 Hilbert-Schmidt Independent Criteria

Gretton et al. (2005) proposed the Hilbert Schmidt Independent Criteria (HSIC) to measure the dependence between two random variables. It computes the Hilbert-Schmidt norm of the cross-covariance operator associated with their Reproducing Kernel Hilbert Space (RKHS). Such measurement has been widely used as a supervised loss function in feature selection (Song et al., 2007c), feature extraction (Song et al., 2007a), clustering (Song et al., 2007b) and supervised PCA (Barshan et al., 2011).

The basic idea behind HSIC is that two random variables are independent if and only if any bounded continuous functions of the two random variables are uncorrelated. Let \mathcal{F} be a separable RKHS containing all continuous bounded real-valued functions mapping from 𝒳\mathcal{X} to \mathbb{R} and 𝒢\mathcal{G} be that for maps from 𝒴\mathcal{Y} to \mathbb{R}. For each point xx\in 𝒳\mathcal{X}, there exists a corresponding element ϕ\phi\in\mathcal{F} such that ϕ(x),ϕ(x)=k(x,x)\left\langle\phi(x),\phi\left(x^{\prime}\right)\right\rangle_{\mathcal{F}}=k\left(x,x^{\prime}\right), where k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} is a unique positive definite kernel. Likewise, define the kernel l(,)l(\cdot,\cdot) and feature map ψ\psi for 𝒢\mathcal{G}. The empirical HSIC is defined as follows.

Definition 4.4 (Empirical HSIC (Gretton et al., 2005))

Let Z:={(x1,y1),Z:=\{(x_{1},y_{1}),\ldots, (xm,ym)}(x_{m},y_{m})\} 𝒳×𝒴\subseteq\mathcal{X}\times\mathcal{Y} be a series of mm independent and identically distributed observations. An estimator of HSIC, written as HSIC(Z,,𝒢)\operatorname{HSIC}(Z,\mathcal{F},\mathcal{G}), is given by

HSIC(Z,,𝒢):=(m1)2tr(KHLH),\operatorname{HSIC}(Z,\mathcal{F},\mathcal{G}):=(m-1)^{-2}\operatorname{tr}(KHLH),

where H,K,Lm×m,Kij:=k(xi,xj),Lij:=l(yi,yj)H,K,L\in\mathbb{R}^{m\times m},K_{ij}:=k\left(x_{i},x_{j}\right),L_{ij}:=l\left(y_{i},y_{j}\right) and H:=Im(1/m)1m1mH:=I_{m}-(1/m)1_{m}1_{m}^{\top}.

In our setting, we aim to maximize the dependency between learned features WXr×nWX\in\mathbb{R}^{r\times n} and label yny\in\mathbb{R}^{n} via HSIC. Substituting KXWWXK\leftarrow X^{\top}W^{\top}WX and LyyL\leftarrow yy^{\top}, we obtain our supervised loss for the representation matrix WW:

HSIC(X,y;W)=1(n1)2tr(XWWXHyyH).\operatorname{HSIC}(X,y;W)=\frac{1}{(n-1)^{2}}\tr(X^{\top}W^{\top}WXHyy^{\top}H). (22)

A more commonly used supervised loss in the regression task is the mean squared error. Here we explain the equivalence of maximizing HSIC with penalty WWF2\|WW^{\top}\|_{F}^{2} and minimizing the mean squared error in the regression task.

Recall that in the contrastive learning framework, we first learn the representation via a linear transformation and then perform linear regression to learn a predictor with the learned representation. Consider the mean squared error (δ)=(1/n)i=1n(δ(xi)yi)2\mathcal{L}(\delta)=(1/n)\sum_{i=1}^{n}(\delta(x_{i})-y_{i})^{2}, where δ(xi)\delta(x_{i}) is a predictor of yiy_{i}. Also consider a linear class of predictors δW,w(xi)=wWxi\delta_{W,w}(x_{i})=w^{\top}Wx_{i} with parameter wrw\in\mathbb{R}^{r}. Assume that both XX and yy are centered. For any fixed representation WW, the minimum mean squared error is given by

minwr(δW,w)=\displaystyle\min_{w\in\mathbb{R}^{r}}\mathcal{L}(\delta_{W,w})= 1n(WX)wy2\displaystyle\frac{1}{n}\|(WX)^{\top}w^{\star}-y\|^{2}
=\displaystyle= 1n(yytr(XW(WXXW)1WXyy)),\displaystyle\frac{1}{n}\quantity(y^{\top}y-\tr(X^{\top}W^{\top}(WXX^{\top}W^{\top})^{-1}WXyy^{\top})),

where w=(WXXW)1WXyw^{\star}=(WXX^{\top}W^{\top})^{-1}WXy. Ignoring the constant term yyy^{\top}y, it can be seen that the only essential difference between the minimization problem minwr(δW,w)\min_{w\in\mathbb{R}^{r}}\mathcal{L}(\delta_{W,w}) and maximizing HSIC in Equation (22) is the normalization term (WXXW)1/2(WXX^{\top}W^{\top})^{-1/2} for WW. Thus, minimizing the mean squared error is equivalent to maximizing HSIC with regularization term WWF2\|WW^{\top}\|_{F}^{2}. Since LSelfConL_{\text{SelfCon}} contains the regularization term WWF2\|WW^{\top}\|_{F}^{2}, we can jointly use LSelfConL_{\text{SelfCon}} and HSIC as a surrogate for the standard regression error to avoid the singularity of learned representation WW.

4.2.2 Main results

First, we consider the regression setting. Before stating our results, we prepare some notations. Suppose we have nn unlabeled data X=[x1,,xn]d×nX=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n} and mm labeled data for each source task X^t=[x^1t,,x^mt],yt=[y1t,,ymt],t=1,,T\hat{X}^{t}=[\hat{x}_{1}^{t},\cdots,\hat{x}_{m}^{t}],y^{t}=[y_{1}^{t},\cdots,y_{m}^{t}],\forall t=1,\dots,T where xix_{i} and x^jt\hat{x}_{j}^{t} are independently drawn from the spiked covariance model (5), we learn the linear representation via the joint optimization:

minWr×d(W):=minWr×dSelfCon(W)αt=1THSIC(X^t,yt;W),\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)-\alpha\sum_{t=1}^{T}\operatorname{HSIC}(\hat{X}^{t},y^{t};W), (23)

where α>0\alpha>0 is a pre-specified ratio between the self-supervised contrastive loss and HSIC. A more general setting, where the ratio α\alpha and the number of labeled data for each source task are allowed to depend on tt, is considered in the appendix, see Section D.2 for details. We now present a theorem showing the recoverability of WW by minimizing the hybrid loss function (23).

Theorem 4.5

In the regression setting where yt=wt,z/νy^{t}=\langle w_{t},z\rangle/\nu , suppose Assumptions 3.4-3.6 hold for the spiked covariance model (5) and n>drn>d\gg r, if we further assume that α>C\alpha>C for some constant CC, T<rT<r and wtw_{t}’s are orthogonal to each other, and let WCLW_{\text{CL}} be any solution that optimizes the problem in Equation (23), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim rT(rlogdd+dn+αTdm1)+T(rlogdαd+1αdn+Tdm).\displaystyle\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right). (24)

The proof is given in Appendix D.6.

Similar to Section 3.4, we can obtain an in-domain downstream task risk in a supervised contrastive learning setting. Consider a new test task where a label is generated by yˇ=w,zˇ/ν\check{y}=\langle w^{\star},\check{z}\rangle/\nu with xˇ=Uzˇ+ξˇ\check{x}=U^{\star}\check{z}+\check{\xi}. Recall that the loss in the in-domain downstream task is measured by the squared error: r(δ):=(yˇδ(xˇ))2\ell_{r}(\delta):=(\check{y}-\delta(\check{x}))^{2}. We obtain the following result.

Theorem 4.6

Suppose the conditions in Theorem 4.5 hold. Then,

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})] (25)
rT(rlogdd+dn+αTdm1)+T(rlogdαd+1αdn+Tdm).\displaystyle\quad\lesssim\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right).

The proof is given in Appendix D.7.

In Theorem 4.5 and Theorem 4.6, as α\alpha goes to infinity (corresponding to the case where we only use the supervised loss), the upper bounds in Equations (24) and (LABEL:CL_transfer_upper_bound_4) are reduced to rT+T3/2d/m\sqrt{r-T}+T^{3/2}\sqrt{d/m}, which is worse than the r3/2logd/dr^{3/2}\log d/d rate obtained by self-supervised contrastive learning (Theorem 3.10). This implies that when the model focuses mainly on the supervised loss, the algorithm will extract the information only beneficial for the source tasks and fail to estimate other parts of core features. As a result, when the target task has a very different distribution, labeled data will bring extra bias and therefore hurt the transferability. Additionally, one can minimize the right-hand side of Equation (24) to obtain a sharper rate. Specifically, we can choose an appropriate α\alpha such that the upper bound becomes r2(rT)logd/d\sqrt{r^{2}(r-T)}\log d/d (when n,mn,m\rightarrow\infty), obtaining a smaller rate than that of the self-supervised contrastive learning. These facts provide theoretical foundations for the recent empirical observations that smartly combining supervised and self-supervised contrastive learning achieves significant improvement in transferability compared with performing each of them individually (Islam et al., 2021).

Remark 4.7

A heuristic intuition of this surprising fact is that when tasks are not diverse enough, supervised training will only focus on the features that are helpful to predict the labels of source tasks and ignore other features. For example, we have unlabeled images which contain cats or dogs and the background can be sandland or forest. If the source task focuses on classifying the background, supervised learning will not learn features associated with cats and dogs, while self-supervised learning can learn these features since they are helpful to discriminate different images. As a result, although supervised learning can help to classify sandland and forest, it can hurt performance on the classification of dogs and cats and we should incorporate self-supervised contrastive learning to learn these features.

When the tasks are abundant enough then estimation via labeled data can recover core features completely. Similar to Theorem 4.5 and Theorem 4.6, we have the following results.

Theorem 4.8

In the regression setting where yt=wt,z/νy^{t}=\langle w_{t},z\rangle/\nu, suppose Assumptions 3.4-3.6 hold for the spiked covariance model (5) and n>drn>d\gg r, if we further assume that T>rT>r and λ(r)(i=1Twiwi)>c\lambda_{(r)}(\sum_{i=1}^{T}w_{i}w_{i}^{\top})>c for some constant c>0c>0, suppose WCLW_{\text{CL}} is the optimal solution of optimization problem Equation (23), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim rα+1(rdlogd+dn)+Tdrm.\displaystyle\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}. (26)

The proof is given in Appendix D.8.

Theorem 4.9

Suppose the conditions in Theorem 4.8 hold. Then,

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]rα+1(rdlogd+dn)+Tdrm.\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}. (27)

The proof is given in Appendix D.9.

Theorem 4.8 and Theorem 4.9 show that in the case where tasks are abundant, as α\alpha goes to infinity (corresponding to the case where we use the supervised loss only), the upper bounds in Equations (26) and (27) are reduced to Trd/mT\sqrt{rd/m}. This rate can be worse than the r3logd/d+rd/n\sqrt{r^{3}}\log d/d+\sqrt{rd/n} rate obtained by self-supervised contrastive learning when mm is small. Recall that when the number of tasks is small, labeled data introduce extra bias term rT\sqrt{r-T} (Theorem 4.5 and Theorem 4.6). We note that when the tasks are abundant enough, the harm of labeled data is mainly due to the variance brought by the labeled data. When mm is sufficiently large, supervised learning on source tasks can yield a consistent estimation of core features, whereas self-supervised contrastive learning can not.

In extending our results from regression to the binary classification setting, the only difference is in the label generation process, and we can obtain similar results with some modification of the proofs. The corresponding feature recovery bounds of Theorem 4.5 (where the tasks are insufficient) and Theorem 4.8 (where the tasks are abundant) are stated as follows:

Theorem 4.10

In the classification setting where yt=sign(wt,z/ν)y^{t}=\operatorname{sign}(\langle w_{t},z\rangle/\nu), suppose the conditions in Theorem 4.5 hold and zz in the spiked covariance model (5) follows a Gaussian distribution, then we have:

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim rT(rlogdd+dn+αTdm1)+T(rlogdαd+1αdn+Tdm).\displaystyle\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right). (28)
Theorem 4.11

In the classification setting where yt=sign(wt,z/ν)y^{t}=\operatorname{sign}(\langle w_{t},z\rangle/\nu) , suppose the conditions in Theorem 4.8 hold and zz in the spiked covariance model (5) follows a Gaussian distribution, then we have:

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\lesssim rα+1(rdlogd+dn)+Tdrm.\displaystyle\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}. (29)

As the feature recovery bounds remain to be the same, the counterpart of the in-domain downstream tasks results, Theorem 4.6 and Theorem 4.9, in this classification setting follows immediately. For the sake of space, we defer the generalized version and proofs in Theorem D.10 and Theorem D.11 in the appendix.

5 Numerical Experiments

5.1 Linear Model with Synthetic Data

To verify our theory, we conducted numerical experiments on the spiked covariance model (5) under a linear representation setting and contrastive loss functions defined in (3) and (23). As we have explicitly formulated the loss function and derived its equivalent form in the main body and appendix, we simply minimize the corresponding loss by gradient descent to find the optimal linear representation WW. For self-supervised contrastive learning with random masking augmentation, we independently draw the augmentation function by Definition 2.2 and apply them to all samples in each iteration. To ensure convergence, we set the maximum number of iterations for it (typically 10000 or 50000 depending on dimension dd).

We report two criteria to evaluate the quality of the representation, in-domain downstream error, and sine distance. To obtain the sine distance for a learned representation WW, we perform singular value decomposition to get W=(UΣV)W=(U\Sigma V^{\top})^{\top} and then compute sinΘ(U,U)F\|\sin\Theta(U,U^{\star})\|_{F}. To obtain the in-domain downstream task performance, in the comparison between autoencoders and contrastive learning, we first draw nn labeled data from spiked covariance model (5) with labels generated as in Section 3.4, then we train the model by using the data without labels to obtain the linear representation WW, and learn a linear predictor ww using the data with labels and compute the regression error. In the transfer learning setting, we draw some labeled data from the source tasks and additional unlabeled data. The number of labeled data is set to be m=1000m=1000 and the number of unlabeled data is set to be n=1000n=1000. Then train with them to obtain the linear representation WW, and draw labeled data from a new source task to learn a linear predictor ww to compute the regression error. In particular, we subtract the optimal regression error obtained by the best representation UU^{\star\top} for each regression error and report the difference, or more precisely, the excess risk as in-domain downstream performance.

The results are reported in Fig. 1 and 2 and Table 2 and 3. As predicted by Theorems 3.9 and 3.10, the feature recovery error and in-domain downstream task risk of contrastive learning decrease as dd increases (Fig. 1: Left) and as nn increases (Fig. 1: Center) while that of autoencoders is insensible to the changes in dd and nn. Consistent with our theory, in Fig. 1: Right, it is observed that when tasks are not abundant, the transfer performance exhibit a UU-shaped curve, and the best result is achieved by choosing an appropriate α\alpha. When tasks are abundant and labeled data are sufficient, the error remains small when we take large α\alpha.

Refer to caption
Refer to caption
Refer to caption
Figure 1: The vertical axes indicate the downstream regression error. We subtract the regression error of the ground truth features to measure the excess error. Left: Comparison of in-domain downstream task performance between contrastive learning and autoencoders the dimension dd. The sample size nn is set as n=20000n=20000. Center: Comparison of in-domain downstream task performance between contrastive learning and autoencoders the dimension nn. The dimension dd is set as d=40d=40. Right: In-domain downstream task performance in transfer learning against penalty parameter α\alpha in log scale. TT is the number of source tasks and rr is the dimension of the representation function. We set the number of labeled data and unlabeled data as m=1000m=1000 and n=1000n=1000 respectively.
Refer to caption
Refer to caption
Refer to caption
Figure 2: Left: Comparison of learned feature between contrastive learning and autoencoders against the dimension dd. The sample size nn is set as n=20000n=20000. Center: Comparison of feature recovery performance between contrastive learning and autoencoders against the dimension nn. The dimension dd is set as d=40d=40. Right: Feature recovery performance in transfer learning against penalty parameter α\alpha in log scale. TT is the number of source tasks. We set the number of labeled data and unlabeled data as m=1000m=1000 and n=1000n=1000 respectively.
loge(α)\log_{e}(\alpha) -5 -4 -3 -2 -1 0 1 2 3 4 5
T=8,r=10T=8,r=10 0.0242 0.0231 0.0199 0.0141 0.0122 0.0125 0.0184 0.0345 0.0499 0.0535 0.0587
T=20,r=10T=20,r=10 0.0223 0.0163 0.0156 0.0096 0.0079 0.0055 0.0064 0.0064 0.0067 0.0070 0.0079
Table 2: In-domain downstream performance in transfer learning against the penalty parameter α\alpha. TT is the number of source tasks.
loge(α)\log_{e}(\alpha) -5 -4 -3 -2 -1 0 1 2 3 4 5
T=8,r=10T=8,r=10 2.0373 2.0371 2.0228 1.9908 2.0021 2.0055 2.0010 2.0362 2.0699 2.0705 2.0813
T=20,r=10T=20,r=10 2.0352 2.0292 2.0030 1.9871 1.9740 1.9690 1.9766 1.9702 1.9790 1.9714 1.9672
Table 3: Feature recovery performance in transfer learning against the penalty parameter α\alpha. TT is the number of source tasks.

5.2 Neural Nets with Real-World Dataset

In this section, we provide experimental results in real-world datasets to support our theoretical results. Although our model settings and assumptions might be violated under this scenario, as we shall see, our findings still remain valid in practice.

We conduct the experiments using the datasets STL-10 (Coates et al., 2011) and CIFAR-10 (Krizhevsky, 2009) with the neural nets architecture ResNet-18 (He et al., 2016). Our experiments are carried out based on linear evaluation following SimCLR (Chen et al., 2020a), where we first train a ResNet-18 encoder and a two-layer MLP projector with unlabeled augmented data. We then freeze the encoder, train a logistic regression on top of it with labeled data, and lastly evaluate the performance on the test data. Following Chen et al. (2020a), we apply augmentations including resized cropping, horizontal flipping, color distortion, and Gaussian blurring to generate the augmented data, and use the InfoNCE loss function to train the network. All training is carried out with the Adam optimizer (Kingma and Ba, 2015), batch size 256, learning rate 3×1043\times 10^{-4}, weight decay 10410^{-4}, and a cosine annealing learning rate scheduler for 100 epochs. Our codes are implemented in Pytorch and run on an NVIDIA V100 GPU.

Contrastive learning v.s. standard autoencoders:

Here we provide real-world evidence for our theoretical findings in Section 3.3 by comparing the performance of contrastive learning versus a standard autoencoder. The architecture of encoders is the same for these two methods, and we use an inversed ResNet-18 as the decoder. During the training time, we use the encoder-decoder architecture and mean squared error loss to reconstruct the input, and then we train a linear classifier on the features learned by the encoder. The results are listed in Table 4, we can find that contrastive learning demonstrates superior performance over the standard autoencoder.

Testing Accuracy Contrastive Learning Standard Autoencoder
CIFAR10 65.11±0.5165.11\pm 0.51 44.76±0.1644.76\pm 0.16
STL10 71.02±0.4771.02\pm 0.47 39.00±0.5839.00\pm 0.58
Table 4: Comparison of linear evaluation performance of contrastive learning and autoencoders.
The impact of labeled data in transfer learning

Now we we provide real-world evidence for our theoretical findings in Section 4.2. Following the joint optimization formulation in equation 23, we combine the InfoNCE loss function for unlabeled data and cross-entropy loss for labeled data with a ratio α\alpha. For both STL-10 and CIFAR-10 datasets, we divide the test data into two sets, one consists of the first five classes and the other one consists of the remaining five classes. During training, we use the training data as unlabeled data and the first set of test data as the labeled data to train the model jointly, and then train a linear classifier with the second set of test data on features learned by the encoder. As predicted by Theorem 4.5, when α\alpha is small, introducing labeled data from the first five classes would be beneficial to learn better representations and improve the performance on the last five classes; when α\alpha is large, labeled data from the first five classes will make the model only focus on features that are useful to discriminate the first five classes and ignore other features, thus introducing labeled data could be harmful to the performance on the last five classes. Testing accuracy on the last five classes with different α\alpha are listed in table 5, it is observed that the accuracy first increases and then decreases as α\alpha grows, which is consistent with our theoretical results.

Testing Accuracy α=0.0\alpha=0.0 α=0.1\alpha=0.1 α=0.2\alpha=0.2 α=0.3\alpha=0.3 α=0.4\alpha=0.4 α=0.5\alpha=0.5 α=0.6\alpha=0.6 α=0.7\alpha=0.7 α=0.8\alpha=0.8 α=0.9\alpha=0.9 α=1.0\alpha=1.0
CIFAR10 74.27 75.52 74.86 75.31 75.21 74.86 74.46 73.85 72.20 69.17 51.31
STL10 82.56 83.54 83.27 83.24 83.21 83.03 82.34 82.11 80.94 76.88 52.37
Table 5: Transfer learning performance with different α\alpha. For each experiment, we report the average accuracy for three independent runs.

6 Conclusion

In this work, we establish a theoretical framework to study contrastive learning under the linear representation setting. We theoretically prove that contrastive learning, compared with autoencoders and GANs, can obtain a better low-rank representation under the spiked covariance model, which further leads to better performance in in-domain downstream tasks. We also highlight the impact of labeled data in supervised contrastive learning and multi-task transfer learning: labeled data can reduce the domain shift bias in contrastive learning, but it harms the learned representation in transfer learning. To our knowledge, our result is the first theoretical result to guarantee the success of contrastive learning by comparing it with existing representation learning methods. However, to get a tractable analysis, like many other theoretical works in representation learning (Du et al., 2020; Lee et al., 2021; Tripuraneni et al., 2021), our work starts with linear representations, which still provides important insights. Recently, Wen and Li (2021) and Refinetti and Goldt (2022) studied the training dynamics of autoencoders and contrastive learning with nonlinear shallow neural networks. Extending our results to these more complex models is an interesting direction for future work.

Acknowledgement

L.Z. is supported by National Science Foundation DMS 2015378. J.Z. is supported by the National Science Foundation (CCF 1763191 and CAREER 1942926), the US National Institutes of Health (P30AG059307 and U01MH098953) and grants from the Silicon Valley Foundation and the Chan-Zuckerberg Initiative.

A Background and omitted discussion

A.1 Comparison with other works

Here we compare the results in this paper with some closely related works.

To rigorously analyze contrastive learning, we consider the random masking augmentation strategy which is also analyzed in Wen and Li (2021). In Wen and Li (2021), the authors aim to understand the training dynamics of contrastive learning in a shallow nonlinear neural network and focus more on dealing with nonlinearity. In comparison, our work focus on the comparison between contrastive learning and autoencoders and the role of label information in contrastive learning. To make the problem mathematically tractable, we adopt a linear model, which is simple but enough to shed light on many mysterious phenomena in practice. Moreover, while they assume a sparse coding model, where the features are extremely sparse, and Gaussianity of signals and noise, our analysis only requires that the features are sub-Gaussian (5). Furthermore, our technique allows the signal-to-noise ratio to have different orders, as long as it decreases slowly, while their analysis is restricted to a particular signal-to-noise ratio.

Tian (2022) studied the relationship between contrastive learning and PCA from a game-theoretic point of view. Specifically, the authors decompose the gradient descent on the contrastive loss into two dynamics, namely the max-player and min-player. It is proven that in deep linear networks, the max-player is equivalent to PCA and the landscape has no spurious minimum. While the results on max-player can be applied to a family of contrastive loss, it is still difficult to analyze the min-player in a general setting. In our paper, we use a linear contrastive loss (2) to explicitly obtain the features learned by contrastive learning. Moreover, our results can be directly extended to a deep linear network setting by the equivalence of a single linear transformation and a deep linear network. The major difference is the non-convexity of the loss landscape.

Garg and Liang (2020) studied the combination of supervised learning and self-supervised learning. They viewed training with unlabeled data as functional regularization on learning the representation function, and obtained sample complexity bounds in a PAC-learning style for various settings. In particular, they found that such functional regularization can help to reduce the amount of labeled data needed, and showed autoencoders and masked self-supervision as two concrete examples. Apart from Garg and Liang (2020), this paper focuses on a regime in combining self-supervised learning and supervised learning, where a trade-off between labeled data and unlabeled data exists. Specifically, Theorem 3 in Garg and Liang (2020) assumes that a ground truth representation exists such that it can keep both self-supervised loss and supervised loss at a very low threshold. However, as the authors admit, it is hard to determine such a threshold in practical settings. For example, since the unlabeled data and labeled data come from different domains, such as Image-Net and CIFAR-10, domain-specific features may have a much lower loss compared with domain-transferable features. In our paper, we first study the regime where tasks are not diverse enough in Theorem 4.5 (which corresponds to the case where ground truth does not exist) and show the trade-off between supervised loss and self-supervised loss. Then in Theorem 4.8 we show that when tasks are abundant (which corresponds to the case where ground truth exists), labeled data helps to achieve better error bounds, which is similar to the result of Garg and Liang (2020). Our result of Theorem 4.5 provides novel insight into the regime where tasks are not diverse, which has been left untouched in the literature.

In Li et al. (2021), the authors proposed a novel self-supervised loss function based on HSIC and discussed the relationship between InfoNCE and the proposed SSL-HSIC loss. The SSL-HSIC loss measures the dependence between the output features and one-hot encoded labels(which serve as the indicators of positive samples) and minimizing the SSL-HSIC loss encourages the network to discriminate augmented views from different samples. In comparison to this self-supervised loss, we use HSIC in Section 4.2 as a supervised loss to measure the dependence between output features and the true labels, which is a common usage of HSIC in previous works (Barshan et al., 2011; Song et al., 2007c). Moreover, we want to point out that the proposed estimator of SSL-HSIC (see Equation (11) in Li et al. (2021)) can be reduced to the linear loss we use in this paper when the kernel k(,)k(\cdot,\cdot) is chosen to be a simple inner product. The authors argued that the standard InfoNCE loss may yield meaningless features in some cases thus the proposed HSIC-based loss could be a better alternative, and provided empirical results illustrating the comparable performance of SSL-HSIC. It remains to be further explored the benefits of such an HSIC-based method.

Lee et al. (2021) studied self-supervised learning under a conditional independence assumption, and showed that with the optimal representation learned in pretext tasks, the in-domain downstream risk is guaranteed to be small. In the contrastive learning context, for example, an image classification downstream task, such an assumption implies that the augmented views generated from the same picture are roughly independent conditional on its ground-truth class, which could be too strong since two views are usually strongly correlated. Such an assumption would be closer to supervised contrastive learning as we have discussed in Section 4.1 where we contrast two independent samples from the same class, such sample pairs are independent conditional on the true label, but it requires label information and thus not applied to self-supervised contrastive learning. Compared with this work, we studied a specific data-generating model where the two views are obtained by practical augmentation and thus could be more close to a real-world setting. It is also remarkable that inLee et al. (2021), their analysis can be adapted to the nonlinear representation setting in the sense that they directly assumed that the optimal representation is obtained. However, the representation learning in the contrastive learning context, even under a linear representation setting, could be non-convex. Thus our analysis starts from a simple setting to obtain a deep understanding of what could be learned in the pretext tasks.

Saunshi et al. (2022) proposed a novel perspective that theoretical analysis of contrastive learning must take the inductive bias into account. It is shown that without considering the function class, it is possible that the learned features totally fail in downstream classification tasks. Furthermore, it is shown that within the linear representation class, contrastive self-supervised learning is guaranteed to learn meaningful features under certain conditions. In our paper, we have restricted ourselves to a similar linear representation setting and avoided the collapse raised by complicated models. Compared with their analysis in a linear representation setting, our bounds provide an exact order while their results still need to quantify the expressivity and inconsistency measure, which is very difficult without very strong assumptions. Moreover, their analysis requires a finite input space and augmentation set, which is much stronger than our data-generating model.

Later than our paper, Fu et al. (2022) considered a similar issue of transfer learning performance of the supervised contrastive method. The authors argue that directly minimizing the supervised contrastive method will lead to feature collapse, which implies that within each class, all data points will have the same embedding and thus the supervised contrastive method loses information that could be useful for transfer learning. This intuition is similar to our setting in Section 4.2, and based on it the authors proposed a new loss function that combines the supervised contrastive loss and within-class self-supervised loss together. At a high level, this new loss function is quite similar to equation 23 since they are both linear interpolations of self-supervised contrastive loss and supervised loss, and the common motivation is to encourage the model to learn more background features. And our analysis in Section 4.2 can provide a theoretical foundation for when could such interpolation work and how to choose the ratio α\alpha.

A.2 Disucssion about the regularization term

In this paper we use a quadratic regularization term R1(W)=WWF2R_{1}(W)=\|WW^{\top}\|_{F}^{2} instead of a standard 2\ell_{2} regularization term R2(W)=WF2R_{2}(W)=\|W\|_{F}^{2}. Denote WT=[w1,,wr]W^{T}=[w_{1},\cdots,w_{r}], then we can write these two terms as:

R1(W)=irwi4+i,jwi,wj2,R2(W)=irwi2R_{1}(W)=\sum_{i}^{r}\|w_{i}\|^{4}+\sum_{i,j}\langle w_{i},w_{j}\rangle^{2},\quad R_{2}(W)=\sum_{i}^{r}\|w_{i}\|^{2}

The main difference between these two terms is that except for penalizing the norm of representation wiw_{i}, it also penalizes the similarity between different representations. In particular, in the linear representation setting, where we will deal with optimization problems like

minWr×dtr(WAW)+λ2R(W)\min_{W\in\mathbb{R}^{r\times d}}\tr(WAW^{\top})+\frac{\lambda}{2}R(W)

where AA is a symmetric matrix determined by data and augmentation, we can easily find that the 2\ell_{2} regularization would fail. To see this, we can rewrite the loss function as

tr(WAW)+λ2WF2=i=1r(wiAwi+λ2wi2)=i=1rwi(A+λ2I)wi,\tr(WAW^{\top})+\frac{\lambda}{2}\|W\|_{F}^{2}=\sum_{i=1}^{r}(w_{i}^{\top}Aw_{i}+\frac{\lambda}{2}\|w_{i}\|^{2})=\sum_{i=1}^{r}w_{i}^{\top}(A+\frac{\lambda}{2}I)w_{i},

it is easy to find that the optimal solution of each wiw_{i} would be at infinity. Moreover, even if we add constraints like wi<C,i[r]\|w_{i}\|<C,\forall i\in[r], the optimal solution of each wiw_{i} would all be the eigenvector corresponding to the smallest eigenvalue of AA thus the model would only learn a single representation from the data. In contrast, the quadratic regularization term encourages the diversity of representation by penalizing the similarity between representations, i.e., wi,wj2\langle w_{i},w_{j}\rangle^{2}. In this situation, we have:

tr(WAW)+λ2WWF2=λ2WW+1λAF212λAF2\tr(WAW^{\top})+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}=\frac{\lambda}{2}\|W^{\top}W+\frac{1}{\lambda}A\|_{F}^{2}-\frac{1}{2\lambda}\|A\|_{F}^{2}

and it is easy to find that the optimal solution of wiw_{i} would be finite and each wiw_{i} corresponds to different eigenvectors of AA and is orthogonal to each other, which implies that they would learn totally different representations. As a result, the quadratic regularization term would be more helpful to self-supervised learning with linear representation. In real-world practice, 2\ell_{2} regularization can still work well with the help of non-linearity and normalization techniques, but we also provide an empirical observation that applying quadratic regularization would be helpful to improve the performance compared with using standard weight decay. We would also like to point out that in the linear regime, the choice of λ+\lambda\in\mathbb{R}_{+} would only affect the norm of wiw_{i} and makes no difference to the direction of wiw_{i} and the quality of representation, thus we do not specify this value in our analysis. Similar regularization techniques are also used in Liu et al. (2021) for theoretical analysis in the linear representation setting.

We verify this conjecture in neural networks, and the results are provided in Table 6. The first column corresponds to training with standard weight decay, and in the setting of the second column, we do not apply weight decay for each weight matrix of fully connected layers in the encoder, and add an additional regularization term λWTWF2\lambda\|W^{T}W\|_{F}^{2} on the loss function instead. Note that the quadratic regularization does not apply to convolution layers and bias terms, thus we keep the weight decay on these parameters. We search the regularization parameter in each settings from λ=0.1,0.01,0.001,0.0001,0.00001,0.000001\lambda=0.1,0.01,0.001,0.0001,0.00001,0.000001 and find that λ=0.0001\lambda=0.0001 yields the best performance in each settings and datasets.

It is observed that quadratic regularization slightly improves the performance of contrastive learning, which is consistent with our intuition.

Testing Accuracy weight decay=0.0001 quadratic regularization=0.0001
CIFAR10 65.11±0.5165.11\pm 0.51 65.54±0.26\mathbf{65.54\pm 0.26}
STL10 71.02±0.4771.02\pm 0.47 71.39±0.39\mathbf{71.39\pm 0.39}
Table 6: Quadratic regularization v.s. weight decay. We compare the top-1 accuracy of linear classifiers trained on features learned by SimCLR with the ResNet-18 encoder and different regularization methods. We repeat each experiment for 5 runs and report the mean and standard deviation. More details are provided in Section 5.2

A.3 Background on distance between subspaces

In this section, we will provide some basic properties of sine distance between subspaces. Recall the definition:

sinΘ(U1,U2)FU1U2F=U2U1F.\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}=\left\|U_{2\perp}^{\top}U_{1}\right\|_{F}. (30)

where U1,U2𝕆d,rU_{1},U_{2}\in\mathbb{O}_{d,r} are two orthogonal matrices. Similarly, we can also define:

sinΘ(U1,U2)2U1U22=U2U12.\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{2}\triangleq\left\|U_{1\perp}^{\top}U_{2}\right\|_{2}=\left\|U_{2\perp}^{\top}U_{1}\right\|_{2}.

We first give two equivalent definitions of this distance:

Proposition A.1
sinΘ(U1,U2)F2=rU1U2F2\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=r-\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2}

Proof  Write U=[U1,U1]𝕆d,dU=[U_{1},U_{1\perp}]\in\mathbb{O}_{d,d}. We have

r=U2F2=UU2F2=U1U2F2+U1U2F2,r=\|U_{2}\|_{F}^{2}=\|U^{\top}U_{2}\|_{F}^{2}=\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}^{2}+\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2},

then by definition of sine distance, we can obtain the desired equation.  

Proposition A.2
sinΘ(U1,U2)F2=12U1U1U2U2F2\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=\frac{1}{2}\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\|_{F}^{2}

Proof  Expand the right hand and use Proposition A.1 we have:

12U1U1U2U2F2=\displaystyle\frac{1}{2}\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\|_{F}^{2}= 12(U1U1F2+U2U2F22tr(U1U1U2U2))\displaystyle\frac{1}{2}(\|U_{1}U_{1}^{\top}\|_{F}^{2}+\|U_{2}U_{2}^{\top}\|_{F}^{2}-2\tr(U_{1}U_{1}^{\top}U_{2}U_{2}^{\top}))
=\displaystyle= 12(r+r2tr(U1U2U2U1))\displaystyle\frac{1}{2}(r+r-2\tr(U_{1}^{\top}U_{2}U_{2}^{\top}U_{1}))
=\displaystyle= rU1U2F2=sinΘ(U1,U2)F2.\displaystyle r-\|U_{1}^{\top}U_{2}\|_{F}^{2}=\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}.

 
With Propositions A.1 and A.2, it is easy to verify its properties to be a distance function. Obviously, we have 0sinΘ(U1,U2)Fr0\leq\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\leq\sqrt{r} and sinΘ(U1,U2)F=sinΘ(U2,U1)F\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}=\left\|\sin\Theta\left(U_{2},U_{1}\right)\right\|_{F} by definition. Moreover, we have the following results:

Lemma A.3 (Lemma 1 in Cai and Zhang (2018))

For any U,V𝕆d,rU,V\in\mathbb{O}_{d,r},

sinΘ(U,V)2infO𝕆r,rUOV22sinΘ(U,V)2,\displaystyle\|\sin\Theta(U,V)\|_{2}\leq\inf_{O\in\mathbb{O}_{r,r}}\|UO-V\|_{2}\leq\sqrt{2}\|\sin\Theta(U,V)\|_{2}, (31)

and

sinΘ(U,V)FinfO𝕆r,rUOVF2sinΘ(U,V)F.\displaystyle\|\sin\Theta(U,V)\|_{F}\leq\inf_{O\in\mathbb{O}_{r,r}}\|UO-V\|_{F}\leq\sqrt{2}\|\sin\Theta(U,V)\|_{F}. (32)
Proposition A.4 (Identity of indiscernibles)
sinΘ(U1,U2)F=0O𝕆r×r, s.t. U1O=U2\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}=0\Leftrightarrow\exists O\in\mathbb{O}^{r\times r},\text{ s.t. }U_{1}O=U_{2}

Proof  It is a straightforward corollary by definition:

sinΘ(U1,U2)F=0\displaystyle\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}=0 U1U2F=0U2U1\displaystyle\Leftrightarrow\left\|U_{1\perp}^{\top}U_{2}\right\|_{F}=0\Leftrightarrow U_{2\perp}\perp U_{1}
O𝕆r×r, s.t. U1O=U2.\displaystyle\Leftrightarrow\exists O\in\mathbb{O}^{r\times r},\text{ s.t. }U_{1}O=U_{2}.

 

Proposition A.5 (Triangular inequality)
sinΘ(U1,U2)FsinΘ(U1,U3)F+sinΘ(U2,U3)F\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}\leq\left\|\sin\Theta\left(U_{1},U_{3}\right)\right\|_{F}+\left\|\sin\Theta\left(U_{2},U_{3}\right)\right\|_{F}

Proof  By the triangular inequality for Frobenius norm we have:

U1U1U2U2FU1U1U3U3F+U2U2U3U3F,\|U_{1}U_{1}^{\top}-U_{2}U_{2}^{\top}\|_{F}\leq\|U_{1}U_{1}^{\top}-U_{3}U_{3}^{\top}\|_{F}+\|U_{2}U_{2}^{\top}-U_{3}U_{3}^{\top}\|_{F},

then apply Proposition A.2 to replace the Frobenius norm with sine distance we can finish the proof.  

B Omitted proofs for Section 3

B.1 Proofs for Section 3.1 and Section 3.2

In this section, we will provide the proof of Proposition 3.1 and Corollary 3.2, the restatement of them and the detailed proof can be found in Proposition B.1 and Corollary B.2.

Proposition B.1 (Restatement of Proposition 3.1)

For two fixed augmentation functions g1,g2:ddg_{1},g_{2}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}, denote the augmented data matrices as X1=[g1(x1),,g1(xn)]d×nX_{1}=[g_{1}(x_{1}),\cdots,g_{1}(x_{n})]\in\mathbb{R}^{d\times n} and X2=[g2(x1),,g2(xn)]d×nX_{2}=[g_{2}(x_{1}),\cdots,g_{2}(x_{n})]\in\mathbb{R}^{d\times n}, when the augmented pairs are generated as in Definition 2.1, the optimal solution of contrastive learning problem (9) is given by:

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

X1X2+X2X112(n1)(X1+X2)(1r1rIr)(X1+X2),X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{r}1_{r}^{\top}-I_{r})(X_{1}+X_{2})^{\top}, (33)

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

Proof [Proof of Proposition 3.1] When augmented pairs generation in Definition 2.1 is applied, the contrastive loss can be written as:

SelfCon(W)=\displaystyle\mathcal{L}_{\text{SelfCon}}(W)= λ2WWF21ni=1n[Wg1(xi),Wg2(xi)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{n}\sum_{i=1}^{n}[\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle
14(n1)jiWg1(xi)+Wg2(xi),Wg1(xj)+Wg2(xi)]\displaystyle-\frac{1}{4(n-1)}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle]
=\displaystyle= λ2WWF21ni=1nWg1(xi),Wg2(xi)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{n}\sum_{i=1}^{n}\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle
+14n(n1)i=1njiWg1(xi)+Wg2(xi),Wg1(xj)+Wg2(xi)\displaystyle+\frac{1}{4n(n-1)}\sum_{i=1}^{n}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle
=\displaystyle= λ2WWF212ntr(X1WWX2+X2WWX1)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{2n}\tr(X_{1}^{\top}W^{\top}WX_{2}+X_{2}^{\top}W^{\top}WX_{1})
+14n(n1)tr((1n1nIn)(X1+X2)WW(X1+X2))\displaystyle+\frac{1}{4n(n-1)}\tr((1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top}W^{\top}W(X_{1}+X_{2}))
=\displaystyle= λ2WWF2\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}
12ntr((X2X1+X1X212(n1)(X1+X2)(1n1nIn)(X1+X2))WW)\displaystyle-\frac{1}{2n}\tr((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)
=\displaystyle= 12λWW12nλ(X2X1+X1X212(n1)(X1+X2)(1n1nIn)(X1+X2))F2\displaystyle\frac{1}{2}\biggl{\|}\lambda W^{\top}W-\frac{1}{2n\lambda}\quantity(X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})\biggr{\|}_{F}^{2}
12nλ(X2X1+X1X212(n1)(X1+X2)(1n1nIn)(X1+X2))F2.\displaystyle-\biggl{\|}\frac{1}{2n\lambda}\quantity(X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})\biggr{\|}_{F}^{2}.

Note that the last term only depends on XX, and the first term implies that when WCLW_{\text{CL}} is the optimal solution, λWCLWCL\lambda W_{\text{CL}}W_{\text{CL}}^{\top} is the best rank-rr approximation of 1(n1)λXHX\frac{1}{(n-1)\lambda}XHX^{\top}, where H:=1n1nInH:=1_{n}1_{n}^{\top}-I_{n}. Applying Lemma E.4 to the first term, we can conclude that WCLW_{\text{CL}} satisfies the desired conditions.  

Corollary B.2 (Restatement of Corollary 3.2)

Under the same conditions as in Proposition 3.1, if we use random masking (Definition 2.2) as our augmentation function, then in expectation over the data augmentation, the optimal solution of contrastive learning problem (9) is given by:

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

Δ(XX)1n1X(1n1nIn)X,\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top}, (34)

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

Proof [Proof of Corollary 3.2] Following the proof of Proposition 3.1, now we only need to compute the expectation over the augmentation distribution defined in Definition 2.2:

SelfCon(W)=\displaystyle\mathcal{L}_{\text{SelfCon}}(W)= λ2WWF2𝔼(g1,g2)[1ni=1n[Wg1(xi),Wg2(xi)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}[\frac{1}{n}\sum_{i=1}^{n}[\langle Wg_{1}(x_{i}),Wg_{2}(x_{i})\rangle
14(n1)jiWg1(xi)+Wg2(xi),Wg1(xj)+Wg2(xi)]]\displaystyle-\frac{1}{4(n-1)}\sum_{j\neq i}\langle Wg_{1}(x_{i})+Wg_{2}(x_{i}),Wg_{1}(x_{j})+Wg_{2}(x_{i})\rangle]]
=\displaystyle= λ2WWF2𝔼(g1,g2)[12ntr((X2X1+X1X2\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}[\frac{1}{2n}{\tr}((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}
12(n1)(X1+X2)(1n1nIn)(X1+X2))WW)].\displaystyle-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)]. (35)

Note that by the definition of random masking augmentation, we have X1=AX,X2=(IA)XX_{1}=AX,X_{2}=(I-A)X, which implies X1+X2=XX_{1}+X_{2}=X. On the other hand, X1X_{1} and X2X_{2} have no common nonzero entries, hence the matrix X1X2+X2X1X_{1}X_{2}^{\top}+X_{2}X_{1}^{\top} only consists of off-diagonal entries and each of the off-diagonal entry denoted as xijx_{ij} appears if and only if ai+aj=1a_{i}+a_{j}=1. Moreover, if it appears, we must have xijx_{ij} equals to the (i,j)(i,j)-th element of XXXX^{\top}. With this result, we can then compute the expectation in Equation (35):

SelfCon(W)=\displaystyle\mathcal{L}_{\text{SelfCon}}(W)= λ2WWF2𝔼(g1,g2)[12ntr((X2X1+X1X2\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\mathbb{E}_{(g_{1},g_{2})}\biggl{[}\frac{1}{2n}{\tr}((X_{2}X_{1}^{\top}+X_{1}X_{2}^{\top}
12(n1)(X1+X2)(1n1nIn)(X1+X2))WW)]\displaystyle-\frac{1}{2(n-1)}(X_{1}+X_{2})(1_{n}1_{n}^{\top}-I_{n})(X_{1}+X_{2})^{\top})W^{\top}W)\bigg{]}
=\displaystyle= λ2WWF212ntr((12Δ(XX)12(n1)X(1n1nIn)X)WW)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)
=\displaystyle= 12λWW14nλ(Δ(XX)1n1X(1n1nIn)X)F2\displaystyle\frac{1}{2}\norm{\lambda W^{\top}W-\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})}_{F}^{2}
14nλ(Δ(XX)1n1X(1n1nIn)X)F2.\displaystyle-\norm{\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})}_{F}^{2}.

By a similar argument as in the proof of Proposition 3.1, we can conclude that WCLW_{\text{CL}} satisfies the desired conditions.  

Remark B.3

Note that the two views generated by random masking augmentation have disjoint non-zero dimensions, hence contrasting such positive pairs yields a correlation between different dimensions only. That is why the first term in equation (34) appears to be Δ(XX)\Delta(XX^{\top}) where the diagonal entries are eliminated.

B.2 Proofs for Section 3.3

In this section, we will prove Lemma 3.7, Theorems 3.9 and 3.10 in Section 3.3. The restatement and proof of them can be found in Lemma B.6, Theorem B.7, and Theorem B.9.

Before starting the proof, we give two technical lemmas to help the proof.

Lemma B.4 (Uniform distribution on the unit sphere (Marsaglia, 1972))

If x1,x2,,xnx_{1},x_{2},\cdots,x_{n} i.i.d. 𝒩(0,1)\sim\mathcal{N}(0,1), then (x1/i=1nxi2,,xn/i=1nxi2)(x_{1}/\sqrt{{\sum_{i=1}^{n}x_{i}^{2}}},\cdots,x_{n}/\sqrt{\sum_{i=1}^{n}x_{i}^{2}}) is uniformly distributed on the unit sphere 𝕊d={(x1,,xn)n:i=1nxi2=1}\mathbb{S}^{d}=\{(x_{1},\cdots,x_{n})\in\mathbb{R}^{n}:\sum_{i=1}^{n}x_{i}^{2}=1\}.

Lemma B.5

If x1,x2,,xnx_{1},x_{2},\cdots,x_{n} i.i.d. 𝒩(0,1)\sim\mathcal{N}(0,1), then:

𝔼max1inxi22log(n).\mathbb{E}\max_{1\leq i\leq n}x_{i}^{2}\leq 2\log(n).

Proof  Denote Y=max1inxi2Y=\max_{1\leq i\leq n}x_{i}^{2}, then we have:

exp(t𝔼Y)𝔼exp(tY)𝔼i=1nexp(txi2)=n𝔼exp(txi2).\displaystyle\exp(t\mathbb{E}Y)\leq\mathbb{E}\exp(tY)\leq\mathbb{E}\sum_{i=1}^{n}\exp(tx_{i}^{2})=n\mathbb{E}\exp(tx_{i}^{2}).

Note that the moment-generating function of chi-square distribution with vv degrees of freedom is:

MX(t)=(12t)v/2.M_{X}(t)=(1-2t)^{-v/2}.

Then combine this fact with Equation (B.2) we have:

exp(t𝔼Y)n(12t)12,\exp(t\mathbb{E}Y)\leq n(1-2t)^{-\frac{1}{2}},

which implies:

𝔼Ylog(n)t12t2t,t<12.\mathbb{E}Y\leq\frac{\log(n)}{t}-\frac{1-2t}{2t},\quad\forall t<\frac{1}{2}.

In particular, take t12t\rightarrow\frac{1}{2} yields:

𝔼Y2log(n)\mathbb{E}Y\leq 2\log(n)

as desired.  

Lemma B.6 (Restatement of Lemma 3.7)
𝔼UUniform(𝕆d,r)I(U)=O(rdlogd).\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U^{\star})=O\quantity(\frac{r}{d}\log d). (36)

Proof [Proof of Lemma 3.7] Denote the columns of UU as U=[u1,,ur]𝕆d,rU=[u_{1},\cdots,u_{r}]\in\mathbb{O}_{d,r}, we have:

𝔼UUniform(𝕆d,r)I(U)=\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)= 𝔼UUniform(𝕆d,r)maxi[d]j=1r|eiuj|2\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}\max_{i\in[d]}\sum_{j=1}^{r}|e_{i}^{\top}u_{j}|^{2}
\displaystyle\leq 𝔼UUniform(𝕆d,r)j=1rmaxi[d]|eiuj|2\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}\sum_{j=1}^{r}\max_{i\in[d]}|e_{i}^{\top}u_{j}|^{2}
=\displaystyle= r𝔼uUniform(𝕊d)maxi[d]|eiu|2.\displaystyle r\mathbb{E}_{u\sim\operatorname{Uniform}(\mathbb{S}^{d})}\max_{i\in[d]}|e_{i}^{\top}u|^{2}.

By Lemma B.4 we can transform this expectation on the uniform sphere distribution into normalized multivariate Gaussian variables:

𝔼UUniform(𝕆d,r)I(U)=r𝔼x1,,xdmaxi[d]xi2j=1dxj2.\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)=r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}. (37)

where x1,x2,,xdx_{1},x_{2},\cdots,x_{d} are i.i.d. standard normal random variables. Apply Chebyshev’s inequality we know that:

(|1di=1dxj21|>ϵ)2dϵ2.\mathbb{P}\quantity(|\frac{1}{d}\sum_{i=1}^{d}x_{j}^{2}-1|>\epsilon)\leq\frac{2}{d\epsilon^{2}}.

In particular, take ϵ=1\epsilon=1 we have:

(i=1dxj2<d2)8d.\mathbb{P}\quantity(\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2})\leq\frac{8}{d}.

Then take it back into Equation (37) and apply Lemma B.5 we obtain:

𝔼UUniform(𝕆d,r)I(U)=\displaystyle\mathbb{E}_{U\sim\operatorname{Uniform}(\mathbb{O}_{d,r})}I(U)= r𝔼x1,,xdmaxi[d]xi2j=1dxj2𝕀{i=1dxj2<d2}\displaystyle r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}\mathbb{I}\{\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2}\}
+r𝔼x1,,xdmaxi[d]xi2j=1dxj2𝕀{i=1dxj2d2}\displaystyle+r\mathbb{E}_{x_{1},\cdots,x_{d}}\frac{\max_{i\in[d]}x_{i}^{2}}{\sum_{j=1}^{d}x_{j}^{2}}\mathbb{I}\{\sum_{i=1}^{d}x_{j}^{2}\geq\frac{d}{2}\}
\displaystyle\leq r(i=1dxj2<d2)+2rd𝔼x1,,xdmaxi[d]xi2\displaystyle r\mathbb{P}\quantity(\sum_{i=1}^{d}x_{j}^{2}<\frac{d}{2})+\frac{2r}{d}\mathbb{E}_{x_{1},\cdots,x_{d}}\max_{i\in[d]}x_{i}^{2}
\displaystyle\leq 8rd+4rlogdd\displaystyle\frac{8r}{d}+\frac{4r\log d}{d}

as desired.  

Now we start proving our main results. Note that UAEU_{\text{AE}} is the top-rr left eigenspace of the observed covariance matrix and UU^{\star} is that of the core feature covariance matrix, and by Assumption 3.5 the observed covariance matrix is dominated by the covariance of random noise. The Davis-Kahan theorem provides a technique to estimate the eigenspace distance via estimating the difference between target matrices. We will adopt this technique to prove the lower bound of the feature recovery ability of autoencoders in Theorem 3.9.

Theorem B.7 (Restatement of Theorem 3.9)

Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and n>drn>d\gg r, let WAEW_{AE} be the learned representation of autoencoder with singular value decomposition WAE=(UAEΣAEVAE)W_{AE}=(U_{AE}\Sigma_{AE}V_{AE}^{\top})^{\top} (as in Eq.(7)). If we further assume {σi2}i=1d\{\sigma_{i}^{2}\}_{i=1}^{d} are different from each other and σ(1)2/(σ(r)2σ(r+1)2)<Cσ\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma} for some universal constant CσC_{\sigma}. Then there exist two universal constants Cρ>0,c(0,1)C_{\rho}>0,c\in(0,1), such that when ρ<Cρ\rho<C_{\rho}, we have

𝔼sinΘ(U,UAE)Fcr.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq c\sqrt{r}. (38)

Proof [Proof of Theorem 3.9] Denote M=ν2UUM=\nu^{2}U^{\star}U^{\star\top} to be the target matrix, xi=Uzi+ξi,i=1,2,nx_{i}=U^{\star}z_{i}+\xi_{i},\quad i=1,2,\cdots n to be the samples generated from model 5 and let X=[x1,,xn]d×n,Z=[z1,,zn]r×n,E=[ξ1,,ξn]d×nX=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n},Z=[z_{1},\cdots,z_{n}]\in\mathbb{R}^{r\times n},E=[\xi_{1},\cdots,\xi_{n}]\in\mathbb{R}^{d\times n} to be the corresponding matrices. In addition, we write the column mean matrix X¯n×d\bar{X}\in\mathbb{R}^{n\times d} of a matrix Xn×dX\in\mathbb{R}^{n\times d} to be X¯=1nX1n1n\bar{X}=\frac{1}{n}X1_{n}1_{n}^{\top}, that is, each column of X¯\bar{X} is the column mean of XX. We denote the sum of variance σi2\sigma_{i}^{2} as σsum2=i=1dσi2\sigma_{\text{sum}}^{2}=\sum_{i=1}^{d}\sigma_{i}^{2}. As shown in Equation (7), autoencoders find the top-rr eigenspace of the following matrix:

M^1=1nX(In1n1n1n)X\displaystyle\hat{M}_{1}=\frac{1}{n}X(I_{n}-\frac{1}{n}1_{n}1_{n}^{\top})X^{\top} =1n(UZ+E)(UZ+E)1n(UZ¯+E¯)(UZ¯+E¯).\displaystyle=\frac{1}{n}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}-\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}.

The rest of the proof is divided into three steps for the sake of presentation.

Step 1. Bound the difference between M^1\hat{M}_{1} and Σ\Sigma

In this step, we aim to show that the data recovery of autoencoders is dominated by the random noise term. Note that Σ=Cov(ξ)=𝔼ξξ\Sigma=\mathrm{Cov}(\xi)=\mathbb{E}\xi\xi^{\top}, we just need to bound the norm of the following matrix:

M^1Σ=1nUZZU+1n(UZE+EZU)+(1nEEΣ)1n(UZ¯+E¯)(UZ¯+E¯),\hat{M}_{1}-\Sigma=\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}+\frac{1}{n}(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})+(\frac{1}{n}EE^{\top}-\Sigma)-\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}, (39)

and we will deal with these four terms separately.

  1. 1.

    For the first term, note that 𝔼zz=ν2Ir\mathbb{E}zz^{\top}=\nu^{2}I_{r}, the first term can then be divided into two terms

    1nUZZU=M+U(1nZZ𝔼zz)U.\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}=M+U^{\star}(\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top})U^{\star\top}. (40)

    Then apply the concentration inequality of Wishart-type matrices (Lemma E.3) we have:

    𝔼1nZZ𝔼zz2(rn+rn)ν2.\mathbb{E}\|\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top}\|_{2}\leq(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}.

    Plug it back into (40) we obtain the bound for the first term:

    1nUZZU2M2+U21nZZ𝔼zz2U2(1+rn+rn)ν2.\|\frac{1}{n}UZZ^{\top}U^{\top}\|_{2}\leq\|M\|_{2}+\|U\|_{2}\|\frac{1}{n}ZZ^{\top}-\mathbb{E}zz^{\top}\|_{2}\|U\|_{2}\leq\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}. (41)
  2. 2.

    For the second term, since ZZ and EE are independent, we must have 𝔼UZE=0\mathbb{E}U^{\star}ZE^{\top}=0, so apply Lemma E.2 twice we have:

    1n𝔼EZU2=\displaystyle\frac{1}{n}\mathbb{E}\|EZ^{\top}U^{\star}\|_{2}= 1n𝔼Z[𝔼E[EZU2|Z]]\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\mathbb{E}_{E}[\|EZ^{\top}U^{\star}\|_{2}|Z]] (42)
    \displaystyle\lesssim 1n𝔼Z[Z2(σsum+r1/4σsumσ(1)+rσ(1))]\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\|Z\|_{2}(\sigma_{\text{sum}}+r^{1/4}\sqrt{\sigma_{\text{sum}}\sigma_{(1)}}+\sqrt{r}\sigma_{(1)})]
    \displaystyle\lesssim 1n𝔼Z[Z2]dσ(1)\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\|Z\|_{2}]\sqrt{d}\sigma_{(1)}
    \displaystyle\lesssim 1ndσ(1)(r1/2ν+(nr)1/4ν+n1/2ν)\displaystyle\frac{1}{n}\sqrt{d}\sigma_{(1)}(r^{1/2}\nu+(nr)^{1/4}\nu+n^{1/2}\nu)
    \displaystyle\lesssim dnσ(1)ν.\displaystyle\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.
  3. 3.

    For the third term, apply Lemma E.3 again yields:

    𝔼1nEEΣ2(dn+dn)σ(1)2.\mathbb{E}\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\leq\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}. (43)
  4. 4.

    For the last term, note that each column of Z¯\bar{Z} and E¯\bar{E} are the same, so we can rewrite it as:

    1n(UZ¯+E¯)(UZ¯+E¯)=(Uz¯+ξ¯)(Uz¯+ξ¯),\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}=(U^{\star}\bar{z}+\bar{\xi})(U^{\star}\bar{z}+\bar{\xi})^{\top},

    where z¯=1ni=1nzi\bar{z}=\frac{1}{n}\sum_{i=1}^{n}z_{i} and ξ¯=1ni=1nξi\bar{\xi}=\frac{1}{n}\sum_{i=1}^{n}\xi_{i}. Since zz and ξ\xi are independent zero mean sub-Gaussian random variables and Cov(z)=ν2Ir,Cov(ξ)=Σ\mathrm{Cov}(z)=\nu^{2}I_{r},\mathrm{Cov}(\xi)=\Sigma, we can conclude that:

    𝔼1n(UZ¯+E¯)(UZ¯+E¯)2𝔼z¯z¯2+2𝔼z¯ξ¯2+𝔼ξ¯ξ¯2\displaystyle\mathbb{E}\|\frac{1}{n}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}\|_{2}\leq\mathbb{E}\|\bar{z}\bar{z}^{\top}\|_{2}+2\mathbb{E}\|\bar{z}\bar{\xi}^{\top}\|_{2}+\mathbb{E}\|\bar{\xi}\bar{\xi}^{\top}\|_{2} (44)
    \displaystyle\lesssim rν2n+dnσ(1)ν+dσ(1)2n.\displaystyle\frac{r\nu^{2}}{n}+\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu+\frac{d\sigma_{(1)}^{2}}{n}.

To sum up, combine equations (41)(42)(43)(LABEL:PCA_step1_term4) together we obtain the upper bound for the 2 norm expectation of matrix M^Σ\hat{M}-\Sigma:

𝔼M^1Σ2ν2(1+rn+rn)+σ(1)2(dn+dn)+dnσ(1)ν.\mathbb{E}\|\hat{M}_{1}-\Sigma\|_{2}\lesssim\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu. (45)
Step 2. Bound the sine distance between eigenspaces

As we have shown in step 1, the target matrix of autoencoders is close to the covariance matrix of random noise, that is, Σ\Sigma. Note that Σ\Sigma is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis eie_{i}. Denote UΣU_{\Sigma} to be the top-rr eigenspace of Σ\Sigma and {ei}iC\{e_{i}\}_{i\in C} to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:

𝔼sinΘ(UAE,UΣ)F2r𝔼M^1Σ2σ(r)2σ(r+1)2\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{AE}},U_{\Sigma})\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{M}_{1}-\Sigma\|_{2}}{\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2}}
\displaystyle\lesssim r1σ(1)2(ν2(1+rn+rn)+σ(1)2(dn+dn)+dnσ(1)ν)\displaystyle\sqrt{r}\frac{1}{\sigma_{(1)}^{2}}\quantity(\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu)
\displaystyle\lesssim r(ρ2+dn+ρdn).\displaystyle\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).
Step 3. Obtain the final result by triangular inequality

By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:

sinΘ(U,UΣ)F2\displaystyle\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}^{2} =UΣUF2=i[d]CeiU2=UF2iCeiU2\displaystyle=\|U_{\Sigma\perp}^{\top}U^{\star}\|_{F}^{2}=\sum_{i\in[d]\setminus C}\|e_{i}^{\top}U^{\star}\|^{2}=\|U^{\star}\|_{F}^{2}-\sum_{i\in C}\|e_{i}^{\top}U^{\star}\|^{2}
rrI(U)=rO(r2dlogd).\displaystyle\geq r-rI(U^{\star})=r-O\quantity(\frac{r^{2}}{d}\log d).

Then apply the triangular inequality of sine distance (Proposition A.5) we can obtain the lower bound of autoencoders.

𝔼sinΘ(UAE,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{AE}},U^{\star})\|_{F} 𝔼sinΘ(U,UΣ)F𝔼sinΘ(UAE,UΣ)F\displaystyle\geq\mathbb{E}\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}-\mathbb{E}\|\sin\Theta(U_{\text{AE}},U_{\Sigma})\|_{F} (46)
rO(rdlogd)O(r(ρ2+dn+ρdn)).\displaystyle\geq\sqrt{r}-O\quantity(\frac{r}{\sqrt{d}}\sqrt{\log d})-O\quantity(\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}})).

By Assumption 3.5, it implies that when n and d are sufficiently large and ρ\rho is sufficiently small (smaller than a given constant Cρ>0C_{\rho}>0), there exists a universal constant c(0,1)c\in(0,1) such that:

𝔼sinΘ(UAE,U)Fcr.\mathbb{E}\|\sin\Theta(U_{\text{AE}},U^{\star})\|_{F}\geq c\sqrt{r}.

 

To start the proof, we introduce a technical lemma first.

Lemma B.8 (Lemma 4 in Zhang et al. (2018))

If Mp×pM\in\mathbb{R}^{p\times p} is any square matrix and Δ(M)\Delta(M) is the matrix MM with diagonal entries set to 0 , then

Δ(M)22M2.\|\Delta(M)\|_{2}\leq 2\|M\|_{2}.

Here, factor ” 2 ” in the statement above cannot be improved.

Theorem B.9 (Restatement of Theorem 3.10)

Under the spiked covariance model Eq.(5), random masking augmentation in Definition 2.2, Assumptions 3.4-3.6 and n>drn>d\gg r, let WCLW_{CL} be any solution that minimizes Eq.(3), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{CL}=(U_{CL}\Sigma_{CL}V_{CL}^{\top})^{\top}, then we have

𝔼sinΘ(U,UCL)Fr3/2dlogd+drn.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{CL}\right)\right\|_{F}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}. (47)

Proof [Proof of Theorem 3.10] The proof strategy is quite similar to that of Theorem 3.9 and we follow the notation defined in the first paragraph of that proof. As we have shown in Corollary 3.2, under our linear representation setting, the contrastive learning algorithm finds the top-rr eigenspace of the following matrix:

M^2=\displaystyle\hat{M}_{2}= 1n(Δ(XX)1n1X(1n1nIn)X)\displaystyle\frac{1}{n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})
=\displaystyle= 1nΔ((UZ+E)(UZ+E))1n1(UZ¯+E¯)(UZ¯+E¯)\displaystyle\frac{1}{n}\Delta((U^{\star}Z+E)(U^{\star}Z+E)^{\top})-\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}
+1n(n1)(UZ+E)(UZ+E).\displaystyle+\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}.

To prove the theorem, first we need to bound the difference between M^2\hat{M}_{2} and MM. We aim to show that the contrastive learning algorithm is dominated by the core feature term. Note that Σ=𝔼UzzU\Sigma=\mathbb{E}Uzz^{\top}U^{\top}, we just need to bound the norm of the following matrix:

M^2M=\displaystyle\hat{M}_{2}-M= (1nΔ(UZZU)M)+1nΔ(UZE+EZU)+1nΔ(EE)\displaystyle(\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M)+\frac{1}{n}\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(EE^{\top}) (48)
1n1(UZ¯+E¯)(UZ¯+E¯)+1n(n1)(UZ+E)(UZ+E).\displaystyle-\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U^{\star}\bar{Z}+\bar{E})^{\top}+\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}.

and we will also deal with these five terms separately.

  1. 1.

    For the first term, we can divide it into two parts:

    1nΔ(UZZU)M=Δ(1nUZZUTM)+Δ(M)M.\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M=\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star T}-M)+\Delta(M)-M. (49)

    Then apply Lemma B.8 and Lemma E.3 we have:

    𝔼Δ(1nUZZUM)22𝔼1nUZZUM22(rn+rn)ν2.\mathbb{E}\|\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M)\|_{2}\leq 2\mathbb{E}\|\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M\|_{2}\leq 2(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}.

    Using the incoherent condition I(U)=O(rdlogd)I(U)=O(\frac{r}{d}\log d), we know that:

    MΔ(M)2ν2maxi[d]eiU22=ν2I(U)rdlogdν2.\|M-\Delta(M)\|_{2}\leq\nu^{2}\max_{i\in[d]}\|e_{i}^{\top}U^{\star}\|_{2}^{2}=\nu^{2}I(U^{\star})\lesssim\frac{r}{d}\log d\nu^{2}.

    Combine the two equations above together we obtain the bound for the first term:

    𝔼1nΔ(UZZU)M2\displaystyle\mathbb{E}\|\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})-M\|_{2} 𝔼Δ(1nUZZUM)2+MΔ(M)2\displaystyle\leq\mathbb{E}\|\Delta(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-M)\|_{2}+\|M-\Delta(M)\|_{2} (50)
    ν2(rdlogd+rn+rn).\displaystyle\lesssim\nu^{2}(\frac{r}{d}\log d+\frac{r}{n}+\sqrt{\frac{r}{n}}). (51)
  2. 2.

    For the second term, apply equation (42) yields:

    1n𝔼Δ(UZE+EZU)24n𝔼EZU2dnσ(1)ν.\displaystyle\frac{1}{n}\mathbb{E}\|\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})\|_{2}\leq\frac{4}{n}\mathbb{E}\|EZ^{\top}U^{\star\top}\|_{2}\lesssim\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu. (52)
  3. 3.

    For the third term, apply equation (43) yields:

    𝔼1nΔ(EE)2=𝔼Δ(1nEEΣ)221nEEΣ2(dn+dn)σ(1)2.\mathbb{E}\|\frac{1}{n}\Delta(EE^{\top})\|_{2}=\mathbb{E}\|\Delta(\frac{1}{n}EE^{\top}-\Sigma)\|_{2}\leq 2\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\lesssim(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}. (53)
  4. 4.

    For the fourth term, apply equation (LABEL:PCA_step1_term4) yields:

    𝔼1n1(UZ¯+E¯)(UZ¯+E¯)2\displaystyle\mathbb{E}\|\frac{1}{n-1}(U^{\star}\bar{Z}+\bar{E})(U\bar{Z}+\bar{E})^{\top}\|_{2}\lesssim 𝔼1n(UZ¯+E¯)(UZ¯+E¯)2\displaystyle\mathbb{E}\|\frac{1}{n}(U\bar{Z}+\bar{E})(U\bar{Z}+\bar{E})^{\top}\|_{2} (54)
    \displaystyle\lesssim rν2n+dnσ(1)ν+dσ(1)2n.\displaystyle\frac{r\nu^{2}}{n}+\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu+\frac{d\sigma_{(1)}^{2}}{n}.
  5. 5.

    For the last term, by equations (41)(42)(43) we know:

    𝔼1n(UZ+E)(UZ+E)2\displaystyle\mathbb{E}\|\frac{1}{n}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}\|_{2}
    Σ2+(1+rn+rn)ν2+dnσ(1)ν+(dn+dn)σ(1)2.\displaystyle\lesssim\|\Sigma\|_{2}+\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu+\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}.

    Thus we can conclude that:

    𝔼1n(n1)(UZ+E)(UZ+E)2dnσ(1)2+rnν2.\mathbb{E}\|\frac{1}{n(n-1)}(U^{\star}Z+E)(U^{\star}Z+E)^{\top}\|_{2}\lesssim\frac{d}{n}\sigma_{(1)}^{2}+\frac{r}{n}\nu^{2}. (55)

To sum up, combine equations (50)(52)(53)(54)(55) together we obtain the upper bound for the 2 norm expectation of matrix M^2M\hat{M}_{2}-M:

𝔼M^2M2ν2(rdlogd+rn+rn)+σ(1)2(dn+dn)+σ(1)νdn.\mathbb{E}\|\hat{M}_{2}-M\|_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}. (56)

With the upper bound for M^2M2\|\hat{M}_{2}-M\|_{2}, simply apply Lemma E.1 we can obtain the desired bound for sine distance:

𝔼sinΘ(UCL,U)F2r𝔼M^2M2ν2\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U^{\star})\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{M}_{2}-M\|_{2}}{\nu^{2}} (57)
\displaystyle\lesssim r1ν2(ν2(rdlogd+rn+rn)+σ(1)2(dn+dn)+σ(1)νdn)\displaystyle\sqrt{r}\frac{1}{\nu^{2}}\quantity(\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}})
=\displaystyle= r((rdlogd+rn+rn)+ρ2(dn+dn)+ρ1dn)\displaystyle\sqrt{r}\quantity(\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\rho^{-2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\rho^{-1}\sqrt{\frac{d}{n}})
\displaystyle\lesssim r3/2dlogd+drn.\displaystyle\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

Moreover, there exists an orthogonal matrix O^𝕆r×r\hat{O}\in\mathbb{O}^{r\times r} depending on UCLU_{\text{CL}} such that:

𝔼UUCLO^IrF=𝔼UCLO^UF2r𝔼M^2M2ν2r3/2dlogd+drn.\mathbb{E}\|U^{\top}U_{\text{CL}}\hat{O}-I_{r}\|_{F}=\mathbb{E}\|U_{\text{CL}}\hat{O}-U\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{M}_{2}-M\|_{2}}{\nu^{2}}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

which finishes the proof.  

B.3 Proofs for Section 3.4

In this section, we will provide the proof of Theorems 3.13 and 3.14 with both regression and classification settings. The corresponding statement and proof can be found in Theorems B.10 and B.11.

For notation simplicity, define the prediction risk of predictor δ\delta for classification and regression tasks as c(δ):=𝔼𝒟[c(δ)]\mathcal{R}_{c}(\delta):=\mathbb{E}_{\mathcal{D}}[\ell_{c}(\delta)] and r(δ):=𝔼𝒟[r(δ)]\mathcal{R}_{r}(\delta):=\mathbb{E}_{\mathcal{D}}[\ell_{r}(\delta)], respectively. Define Σx:=ν2UU+Σ\Sigma_{x}:=\nu^{2}U^{\star}U^{\star\top}+\Sigma. We write δU,w\delta_{U,w} for δU,w\delta_{U^{\top},w} with a slight abuse of notation. For two matrices AA and BB of the same order, we define ABA\succeq B when ABA-B is positive semi-definite.

Theorem B.10 (Restatement of Theorem 3.13)

Suppose the conditions in Theorem 3.10 hold. Then, for the classification task, we have

𝔼𝒟[infwr𝔼[c(δWCL,w)]infwr𝔼[c(δU,w)]=O(r3/2dlogd+drn)1,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star\top},w})]=O\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}})\wedge 1,

and for regression tasks,

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]r3/2dlogd+drn.\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.
Theorem B.11 (Restatement of Theorem 3.14)

Suppose the conditions in Theorem 3.9 hold. Assume rrcr\leq r_{c} holds for some constant rc>0r_{c}>0. Additionally assume that ρ=Θ(1)\rho=\Theta(1) is sufficiently small and ndrn\gg d\gg r. Then, For the regression task,

𝔼𝒟[infwr𝔼[r(δUAE,w)]infwr𝔼[r(δU,w)]cc,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star},w})]\geq c_{c}^{\prime},

and for classification task, if FF is differentiable at 0 and F(0)>0F^{\prime}(0)>0, then

𝔼𝒟[infwr𝔼[c(δUAE,w)]infwr𝔼[c(δU,w)]cr,\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U_{\text{AE}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{c}(\delta_{U^{\star},w})]\geq c_{r}^{\prime},

where cr>0c_{r}^{\prime}>0 and cc>0c_{c}^{\prime}>0 are constants independent of nn and dd.

The proofs of Theorem B.10 and B.11 relies on Lemma B.15, B.16, B.17, B.18 and B.20 which are proved later in this section.

Proof [Proof of Theorem B.10: Classification Task Part] Lemma B.17 gives for any U𝕆d,rU\in\mathbb{O}_{d,r},

𝔼𝒟[infwrc(δU,w)infwrc(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})] (58)
((κ(1+ρ2))3+κρ2(1+ρ2)2+(κρ21)1)𝔼𝒟[sinΘ(U,U)2].\displaystyle\quad\leq((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1)^{-1})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]. (59)

Substituting UUAEU\leftarrow U_{AE} combined with Assumption 3.5 and κ=O(1)\kappa=O(1) concludes the proof.  

Proof [Proof of Theorem B.10: Regression Part] Note that under Assumption 3.5 and κ=O(1)\kappa=O(1), (1+ρ2)/(1+κ1ρ2)2=O(1)(1+\rho^{-2})/(1+\kappa^{-1}\rho^{-2})^{2}=O(1). Lemma B.20 gives for any U𝕆d,rU\in\mathbb{O}_{d,r},

𝔼𝒟[infwrr(δU,w)infwrr(δU,w)]=O((1+ρ2)𝔼𝒟[sinΘ(U,U)2]w2).\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})]=O\quantity((1+\rho^{-2})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2}). (60)

Theorem 3.10 with substitution UUAEU\leftarrow U_{AE} gives the desired result.  

Proof [Proof of Theorem B.11: Classification Part] Lemma B.16 gives that for c1:=11/(2κrc)(0,1)c_{1}:=1-1/(2\kappa r_{c})\in(0,1), we can take ndrn\gg d\gg r and sufficiently small ρ>0\rho>0 so that 𝔼𝒟[sinΘ(UAE,U)F2]c1r\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}]\geq c_{1}r holds. By Lemma B.18,

𝔼𝒟[infwrc(δUAE,w)infwrc(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]
(1+ρ2)3/2(1+κρ2)3/2ρ2(11+ρ2κ(rsinΘ(UAE,U)F2))\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}))
(1+ρ2)3/2(1+κρ2)3/2ρ2(11+ρ2κ(1c1)r)\displaystyle\quad\geq\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(1-c_{1})r)
(1+ρ2)3/2(1+κρ2)3/2ρ2(11+ρ212),\displaystyle\quad\geq\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\frac{1}{2}), (61)

where the last inequality follows since rrcr\leq r_{c}. If we further take ρ=Θ(1)<1/2\rho=\Theta(1)<1/2, the right hand becomes a positive constant. This concludes the proof.  

Proof [Proof of Theorem B.11: Regression Part] From proposition B.19, we have

infwrr(δUAE,w)infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})
=w((I+(1/ν2)UΣU)1\displaystyle\quad={w^{\star}}^{\top}((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1}
UUAE(UAEUUUAE+(1/ν2)UAEΣUAE)1UAEU)w.\displaystyle\quad\quad-U^{\star\top}U_{AE}(U_{AE}^{\top}U^{\star}U^{\star\top}U_{AE}+(1/\nu^{2})U_{AE}^{\top}\Sigma U_{AE})^{-1}U_{AE}^{\top}U^{\star})w^{\star}.

Thus from Lemma B.15,

infwrr(δUAE,w)infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U_{AE},w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})
(11+ρ2+ρ2κ(sinΘ(UAE,U)F2r))w2.\displaystyle\quad\geq\quantity(\frac{1}{1+\rho^{-2}}+\rho^{2}\kappa\quantity(\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}-r))\|w^{\star}\|^{2}. (62)

Using Lemma B.16 and by the same argument in the proof of Theorem 3.14: Classification Part, we conclude the proof.  

Lemma B.12

For any U𝕆d,rU\in\mathbb{O}_{d,r},

λmin(ν2UU(UΣxU)1UU)ν2ν2+σ(1)2(1sinΘ(U,U)22).\displaystyle\lambda_{\min}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2}).

Proof  Since λmin(AC)λmin(A)λmin(C)\lambda_{\min}(AC)\geq\lambda_{\min}(A)\lambda_{\min}(C) for symmetric positive semi-definite matrices AA and CC,

λmin(ν2UU(UΣxU)1UU)\displaystyle\lambda_{\min}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})
λmin(UUUU)λmin(ν2(UΣxU)1)\displaystyle\quad\geq\lambda_{\min}(U^{\top}U^{\star}U^{\star\top}U)\lambda_{\min}(\nu^{2}(U^{\top}\Sigma_{x}U)^{-1})
λmin(I(IUUUU))ν2λmax(ν2UUUU+UΣU)\displaystyle\quad\geq\lambda_{\min}(I-(I-U^{\top}U^{\star}U^{\star\top}U))\frac{\nu^{2}}{\lambda_{\max}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)}
ν2ν2+σ(1)2(1sinΘ(U,U)22),\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2}),

where we used Weyl’s inequality λmin(A+C)λmin(A)C2\lambda_{\min}(A+C)\geq\lambda_{\min}(A)-\|C\|_{2} in the second inequality.  

Lemma B.13

For any U𝕆d,rU\in\mathbb{O}_{d,r},

λmax(ν2UU(UΣxU)1UU)ν2ν2(1sinΘ(U,U)2)+σ(d)2.\displaystyle\lambda_{\max}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})\leq\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2})+\sigma_{(d)}^{2}}.

Proof  Since AC2A2C2\|AC\|_{2}\leq\|A\|_{2}\|C\|_{2},

λmax(ν2UU(UΣxU)1UU)\displaystyle\lambda_{\max}(\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}) λmax(ν2(UΣxU)1)\displaystyle\leq\lambda_{\max}(\nu^{2}(U^{\top}\Sigma_{x}U)^{-1})
ν2λmin(ν2UUUU+UΣU)\displaystyle\leq\frac{\nu^{2}}{\lambda_{\min}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)}
ν2λmin(ν2Iν2(IUUUU)+UΣU)\displaystyle\leq\frac{\nu^{2}}{\lambda_{\min}(\nu^{2}I-\nu^{2}(I-U^{\top}U^{\star}U^{\star\top}U)+U^{\top}\Sigma U)}
ν2ν2(1sinΘ(U,U)2)+σ(d)2,\displaystyle\leq\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2})+\sigma_{(d)}^{2}},

where we used Weyl’s inequality λmin(A+C)λmin(A)C2\lambda_{\min}(A+C)\geq\lambda_{\min}(A)-\|C\|_{2} and λmin(ν2I+UΣU)ν2+σ(d)2\lambda_{\min}(\nu^{2}I+U^{\top}\Sigma U)\geq\nu^{2}+\sigma_{(d)}^{2}.  

Lemma B.14

For any U𝕆d,rU\in\mathbb{O}_{d,r},

ν2(UΣxU)1ν2UU(UΣxU)1UU2\displaystyle\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
=O(11sinΘ(U,U)22+κ1ρ21+ρ21+κ1ρ2sinΘ(U,U)2).\displaystyle\quad=O\quantity(\frac{1}{1-\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\kappa^{-1}\rho^{-2}}\frac{1+\rho^{-2}}{1+\kappa^{-1}\rho^{-2}}\|\sin\Theta(U,U^{\star})\|_{2}).

Proof  Observe that

(UΣxU)1UU(UΣxU)1UU2\displaystyle\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
(UΣxU)1(UΣxU)12+(UΣxU)1UU(UΣxU)1UU2\displaystyle\quad\leq\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-(U^{\top}\Sigma_{x}U)^{-1}\|_{2}+\|(U^{\top}\Sigma_{x}U)^{-1}-U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}{U^{\star}}\|_{2}
:=(T1)+(T2).\displaystyle\quad:=(T1)+(T2).

For the term (T1)(T1),

(T1)\displaystyle(T1) =(UΣxU)1(UΣxU)(UΣxU)1(UΣxU)1(UΣxU)(UΣxU)12\displaystyle=\|(U^{\top}\Sigma_{x}U)^{-1}(U^{\top}\Sigma_{x}U)(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-(U^{\top}\Sigma_{x}U)^{-1}(U^{\star\top}\Sigma_{x}U^{\star})(U^{\star\top}\Sigma_{x}U^{\star})^{-1}\|_{2}
(UΣxU)12UΣxUUΣxU2(UΣxU)12.\displaystyle\leq\|(U^{\top}\Sigma_{x}U)^{-1}\|_{2}\|U^{\top}\Sigma_{x}U-U^{\star\top}\Sigma_{x}U^{\star}\|_{2}\|(U^{\star\top}\Sigma_{x}U^{\star})^{-1}\|_{2}.

Note

UΣxUUΣxU2\displaystyle\|U^{\top}\Sigma_{x}U-U^{\star\top}\Sigma_{x}U^{\star}\|_{2} =ν2UUUUν2I+UΣUUΣU2\displaystyle=\|\nu^{2}U^{\top}U^{\star}U^{\star\top}U-\nu^{2}I+U^{\top}\Sigma U-U^{\star\top}\Sigma U^{\star}\|_{2}
ν2sinΘ(U,U)22+UΣ(UU)+(UU)ΣU2\displaystyle\leq\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\|U^{\top}\Sigma(U-U^{\star})+(U-U^{\star})^{\top}\Sigma U^{\star}\|_{2}
ν2sinΘ(U,U)22+2σ(1)2UU2.\displaystyle\leq\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+2\sigma_{(1)}^{2}\|U-U^{\star}\|_{2}.

Also we have λmin(UΣxU)ν2(1sinΘ(U,U)22)+σ(d)2\lambda_{\min}(U^{\top}\Sigma_{x}U)\geq\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2} from the proof of Lemma B.13 and λmin(UΣxU)ν2+σ(d)2\lambda_{\min}(U^{\star\top}\Sigma_{x}U^{\star})\geq\nu^{2}+\sigma_{(d)}^{2}. Therefore

(T1)\displaystyle(T1) 1(ν2+σ(d)2)(ν2(1sinΘ(U,U)22)+σ(d)2)(ν2sinΘ(U,U)22+2σ(1)2UU2).\displaystyle\leq\frac{1}{(\nu^{2}+\sigma_{(d)}^{2})(\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2})}(\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+2\sigma_{(1)}^{2}\|U-U^{\star}\|_{2}).

For the term (T2)(T2),

(T2)\displaystyle(T2) =(UΣxU)1U(U+(UU))(UΣxU)1(U+(UU))U2\displaystyle=\|(U^{\top}\Sigma_{x}U)^{-1}-U^{\star\top}{(U^{\star}+(U-U^{\star}))}(U^{\top}\Sigma_{x}U)^{-1}(U^{\star}+(U-U^{\star}))^{\top}U^{\star}\|_{2}
=U(UU)(UΣxU)1(UΣxU)1(UU)U\displaystyle=\|-U^{\star\top}(U-U^{\star})(U^{\top}\Sigma_{x}U)^{-1}-(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}
U(UU)(UΣxU)1(UU)U2\displaystyle\quad-U^{\star\top}{(U-U^{\star})}(U^{\top}\Sigma_{x}U)^{-1}(U-U^{\star})^{\top}U^{\star}\|_{2}
1ν2(1sinΘ(U,U)22)+σ(d)2(2UU2+UU22).\displaystyle\leq\frac{1}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}}(2\|U-U^{\star}\|_{2}+\|U-U^{\star}\|_{2}^{2}).

From Lemma A.3, sinΘ(U,U)2UU2\|\sin\Theta(U,U^{\star})\|_{2}\leq\|U-U^{\star}\|_{2}. Finally from these results and UU222UU2\|U-U^{\star}\|_{2}^{2}\leq 2\|U-U^{\star}\|_{2},

ν2(UΣxU)1ν2UU(UΣxU)1UU2\displaystyle\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
=O(ν2ν2(1sinΘ(U,U)22)+σ(d)2ν2+σ(1)2ν2+σ(d)2UU2).\displaystyle\quad=O\quantity(\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\|U-U^{\star}\|_{2}).

Since LHS does not depend on the orthogonal transformation UUOU\leftarrow UO where O𝕆r,rO\in\mathbb{O}_{r,r}, we obtain

ν2(UΣxU)1ν2UU(UΣxU)1UU2\displaystyle\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
=O(ν2ν2(1sinΘ(U,U)22)+σ(d)2ν2+σ(1)2ν2+σ(d)2infO𝕆r,rUOU2).\displaystyle\quad=O\quantity(\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\inf_{O\in\mathbb{O}_{r,r}}\|UO-U^{\star}\|_{2}).

Combined again with Lemma A.3, we obtain the desired result.  

Lemma B.15

For any U𝕆d,rU\in\mathbb{O}_{d,r},

λmin(ν2(UΣxU)1ν2UU(UΣxU)1UU)\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})
ν2ν2+σ(1)2ν2σ(d)2(rsinΘ(U,U)F2).\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}).

Proof  Observe

λmin(ν2(UΣxU)1ν2UU(UΣxU)1UU)\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})
λmin((I+(1/ν2)UΣU)1)UU(UUUU+(1/ν2)UΣU)1UU2.\displaystyle\quad\geq\lambda_{\min}((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1})-\|U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star}\|_{2}.

Since UUUU0U^{\top}U^{\star}U^{\star\top}U\succeq 0, it follows that (UUUU+(1/ν2)UΣU)1ν2(UΣU)1(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}\preceq\nu^{2}(U^{\top}\Sigma U)^{-1}. Thus

UU(UUUU+(1/ν2)UΣU)1UU2\displaystyle\|U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star}\|_{2}
ν2λmax((UΣU)1)UU22\displaystyle\quad\leq\nu^{2}\lambda_{\max}((U^{\top}\Sigma U)^{-1})\|U^{\star\top}U\|^{2}_{2}
ν2σ(d)2UUF2\displaystyle\quad\leq\frac{\nu^{2}}{\sigma_{(d)}^{2}}\|U^{\star\top}U\|^{2}_{F}
=ν2σ(d)2(rsinΘ(U,U)F2),\displaystyle\quad=\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}),

where we used λmax((UΣU)1)1/λmin(UΣU)1/σ(d)2\lambda_{\max}((U^{\top}\Sigma U)^{-1})\leq 1/\lambda_{\min}(U^{\top}\Sigma U)\leq 1/\sigma_{(d)}^{2} and sinΘ(U1,U2)F2=rU1U2F2\left\|\sin\Theta\left(U_{1},U_{2}\right)\right\|_{F}^{2}=r-\left\|U_{1}^{\top}U_{2}\right\|_{F}^{2} from Proposition A.1. Combined with Lemma B.13, we obtain

λmin(ν2(UΣxU)1ν2UU(UΣxU)1UU)\displaystyle\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})
ν2ν2+σ(1)2ν2σ(d)2(rsinΘ(U,U)F2).\displaystyle\quad\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}).

 

Lemma B.16

Suppose the conditions in Theorem 3.9 hold. Fix c1(0,1)c_{1}\in(0,1). There exists a constant c2>0c_{2}>0 such that if rlogd/dρ2d/n<c2\sqrt{r\log d/d}\vee\rho^{2}\vee d/n<c_{2}, then,

𝔼𝒟sinΘ(UAE,U)F2\displaystyle\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{\text{AE}},U^{\star})\|_{F}^{2} c1r,\displaystyle\geq c_{1}r,

where c1(0,1)c_{1}\in(0,1) is a universal constant.

Proof  By Cauchy-Schwartz inequality,

𝔼𝒟sinΘ(UAE,U)F2r\displaystyle\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}-r
(𝔼𝒟sinΘ(UAE,U)F)2r\displaystyle\quad\geq(\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{AE},U^{\star})\|_{F})^{2}-r
=(𝔼𝒟sinΘ(UAE,U)Fr)(𝔼𝒟sinΘ(UAE,U)F+r).\displaystyle\quad=(\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{AE},U^{\star})\|_{F}-\sqrt{r})\quantity(\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{AE},U^{\star})\|_{F}+\sqrt{r}).

From Theorem 3.9, there exists a constant c3>0c_{3}>0 such that we have

𝔼𝒟sinΘ(U,UAE)Frc3rdlogdc3r(ρ2+dn+ρdn).\mathbb{E}_{\mathcal{D}}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq\sqrt{r}-c_{3}\frac{r}{\sqrt{d}}\sqrt{\log d}-c_{3}\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).

Therefore combined with a trivial bound sinΘ(UAE,U)Fr\|\sin\Theta(U_{AE},U^{\star})\|_{F}\leq\sqrt{r},

𝔼𝒟sinΘ(UAE,U)F2r\displaystyle\mathbb{E}_{\mathcal{D}}\|\sin\Theta(U_{AE},U^{\star})\|_{F}^{2}-r rc3r1/2dlogd+ρ2+dn+ρdn\displaystyle\geq-rc_{3}\frac{r^{1/2}}{\sqrt{d}}\sqrt{\log d}+\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}
rc3(2r1/2dlogd6ρ26dn),.\displaystyle\geq-rc_{3}\quantity(2\frac{r^{1/2}}{\sqrt{d}}\sqrt{\log d}\vee 6\rho^{2}\vee 6\sqrt{\frac{d}{n}}),.

where we used ρd/nρ2d/nρ2d/n\rho\sqrt{d/n}\leq\rho^{2}\vee d/n\leq\rho^{2}\vee\sqrt{d/n} since d<nd<n. Thus we can take c2=6(1c1)/c3c_{2}=6(1-c_{1})/c_{3}. This concludes the proof.  

Lemma B.17

For any U𝕆d,rU\in\mathbb{O}_{d,r},

𝔼𝒟[infwrc(δU,w)infwrc(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]
((κ(1+ρ2))3+κρ2(1+ρ2)2+(κρ21)1)𝔼𝒟[sinΘ(U,U)2].\displaystyle\quad\leq((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1)^{-1})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}].

Proof  Recall that we are considering the class of linear classifiers {δU,w:wr}\{\delta_{U,w}:w\in\mathbb{R}^{r}\}, where δU,w(xˇ)=𝕀{F(xˇUw)>1/2}\delta_{U,w}(\check{x})=\mathbb{I}\{F(\check{x}^{\top}Uw)>1/2\}. For notational simplicity, write β:=Uw\beta:=Uw and β:=Uw\beta^{\star}:=U^{\star}w^{\star}.

c(δU,w)=(δU,w(xˇ)yˇ)=(yˇ=0,F(xˇβ)>1/2)+(yˇ=1,F(xˇβ)1/2).\displaystyle\mathcal{R}_{c}(\delta_{U,w})=\mathbb{P}_{\mathcal{E}}(\delta_{U,w}(\check{x})\neq\check{y})=\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)+\mathbb{P}_{\mathcal{E}}(\check{y}=1,F(\check{x}^{\top}\beta)\leq 1/2).

Since F(0)=1/2F(0)=1/2 and FF is monotone increasing, the false positive probability becomes

(yˇ=0,F(xˇβ)>1/2)\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2) =(yˇ=0,xˇβ>0)\displaystyle=\mathbb{P}_{\mathcal{E}}(\check{y}=0,\check{x}^{\top}\beta>0)
=𝔼[𝔼[𝕀{yˇ=0}|xˇ,zˇ]𝕀{xˇβ>0}]\displaystyle=\mathbb{E}_{\mathcal{E}}[\mathbb{E}_{\mathcal{E}}[\mathbb{I}\{\check{y}=0\}|\check{x},\check{z}]\mathbb{I}\{\check{x}^{\top}\beta>0\}]
=𝔼[(1F(ν1zˇUβ))𝕀{xˇβ>0}].\displaystyle=\mathbb{E}_{\mathcal{E}}[(1-F(\nu^{-1}\check{z}^{\top}U^{\star\top}\beta^{\star}))\mathbb{I}\{\check{x}^{\top}\beta>0\}].

Write ω:=xˇβ\omega:=\check{x}^{\top}\beta and ω:=ν1zˇUβ\omega^{\star}:=\nu^{-1}\check{z}^{\top}U^{\star\top}\beta^{\star}. From assumption, (ω,ω)(\omega^{\star},\omega) jointly follows a normal distribution with mean 0. Write v2:=Var(ω)=ww{v^{\star}}^{2}:=\mathrm{Var}(\omega^{\star})={w^{\star}}^{\top}w^{\star}, v2:=Var(ω)=βΣxβv^{2}:=\mathrm{Var}(\omega)=\beta^{\top}\Sigma_{x}\beta, where Σx:=ν2UU+Σ\Sigma_{x}:=\nu^{2}U^{\star}U^{\star\top}+\Sigma. Let τ:=Cor(ω,ω)=νwUβ/(vv)\tau:=\text{Cor}(\omega^{\star},\omega)=\nu{w^{\star}}^{\top}U^{\star\top}\beta/(v^{\star}v). By a formula for conditional normal distribution, we have ω|ωN(τvω/v,v2(1τ2))\omega|\omega^{\star}\sim N(\tau v\omega^{\star}/v^{\star},v^{2}(1-\tau^{2})). This gives

(yˇ=0,F(xˇβ)>1/2)\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)
=𝔼[(1F(ω))𝕀{ω>0}]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{I}\{\omega>0\}]
=𝔼[(1F(ω))𝔼[𝕀{ω>0}|ω]]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{E}_{\mathcal{E}}[\mathbb{I}\{\omega>0\}|\omega^{\star}]]
=𝔼[(1F(ω))(ω>0|ω)]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}[(1-F(\omega^{\star}))\mathbb{P}_{\mathcal{E}}(\omega>0|\omega^{\star})]
=𝔼[(1F(ω))(ωτvω/vv(1τ2)1/2>τvω/vv(1τ2)1/2|ω)]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\mathbb{P}_{\mathcal{E}}\quantity(\frac{\omega-\tau v\omega^{\star}/v^{\star}}{v(1-\tau^{2})^{1/2}}>-\frac{\tau v\omega^{\star}/v^{\star}}{v(1-\tau^{2})^{1/2}}\middle|\omega^{\star})]
=𝔼[(1F(ω))Φ(αω/v)]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})]
=𝔼[(1F(ω))Φ(αω/v)𝕀{ω>0}]+𝔼[(1F(ω))Φ(αω/v)𝕀{ω<0}],\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]+\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}<0\}],

where Φ\Phi is cumulative distribution function of N(0,1)N(0,1) and α:=τ/(1τ2)1/2\alpha:=\tau/(1-\tau^{2})^{1/2}. We define ΨF\Psi_{F} as ΨF(s2):=2EuN(0,s2)[F(u)𝕀{u>0}]\Psi_{F}(s^{2}):=2E_{u\sim N(0,s^{2})}[F(u)\mathbb{I}\{u>0\}]. When F(u)=1/(1+eu)F(u)=1/(1+e^{-u}), ΨF(s2)\Psi_{F}(s^{2}) is called the logistic-normal integral, whose analytical form is not known (Pirjol, 2013). Since a random variable ω\omega^{\star} is symmetric about mean 0 and F(u)=1F(u)F(u)=1-F(-u),

𝔼[(1F(ω))Φ(αω/v)𝕀{ω<0}]\displaystyle\mathbb{E}_{\mathcal{E}}\quantity[(1-F(\omega^{\star}))\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}<0\}] =𝔼[(1F(ω))(1Φ(αω/v))𝕀{ω>0}]\displaystyle=\mathbb{E}_{\mathcal{E}}\quantity[(1-F(-\omega^{\star}))\quantity(1-\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}]
=𝔼[F(ω)(1Φ(αω/v))𝕀{ω>0}].\displaystyle=\mathbb{E}_{\mathcal{E}}\quantity[F(\omega^{\star})\quantity(1-\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].

Hence

(yˇ=0,F(xˇβ)>1/2)\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2)
=𝔼[(Φ(αω/v)+F(ω)2F(ω)Φ(αω/v))𝕀{ω>0}]\displaystyle\quad=\mathbb{E}_{\mathcal{E}}\quantity[(\Phi(\alpha\omega^{\star}/v^{\star})+F(\omega^{\star})-2F(\omega^{\star})\Phi(\alpha\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}]
=12ΨF(v2)𝔼[(2F(ω)1)Φ(αω/v)𝕀{ω>0}].\displaystyle\quad=\frac{1}{2}\Psi_{F}({v^{\star}}^{2})-\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].

Note that the true negative probability is exactly the same as the false positive probability under our settings:

(yˇ=1,F(xˇβ)1/2)\displaystyle\mathbb{P}_{\mathcal{E}}(\check{y}=1,F(\check{x}^{\top}\beta)\leq 1/2) =𝔼[F(xˇβ)𝕀{xˇβ0}]\displaystyle=\mathbb{E}_{\mathcal{E}}[F(\check{x}^{\top}\beta^{\star})\mathbb{I}\{\check{x}^{\top}\beta\leq 0\}]
=𝔼[F(xˇβ)𝕀{xˇβ0}]\displaystyle=\mathbb{E}_{\mathcal{E}}[F(-\check{x}^{\top}\beta^{\star})\mathbb{I}\{\check{x}^{\top}\beta\geq 0\}]
=𝔼[(1F(xˇβ))𝕀{xˇβ0}]\displaystyle=\mathbb{E}_{\mathcal{E}}[(1-F(\check{x}^{\top}\beta^{\star}))\mathbb{I}\{\check{x}^{\top}\beta\geq 0\}]
=(yˇ=0,F(xˇβ)>1/2).\displaystyle=\mathbb{P}_{\mathcal{E}}(\check{y}=0,F(\check{x}^{\top}\beta)>1/2).

Therefore

c(δU,w)\displaystyle\mathcal{R}_{c}(\delta_{U,w}) =ΨF(v2)2𝔼[(2F(ω)1)Φ(αω/v)𝕀{ω>0}].\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].

Let

τmax,U\displaystyle\tau_{\max,U} :=supwrνwUUw/(wwwUΣxUw)1/2,\displaystyle:=\sup_{w\in\mathbb{R}^{r}}\nu{w^{\star}}^{\top}U^{\star\top}Uw/({w^{\star}}^{\top}w^{\star}w^{\top}U^{\top}\Sigma_{x}Uw)^{1/2},
τmax,U\displaystyle\tau_{\max,U^{\star}} :=supwrνww/(wwwUΣxUw)1/2.\displaystyle:=\sup_{w\in\mathbb{R}^{r}}\nu{w^{\star}}^{\top}w/({w^{\star}}^{\top}w^{\star}w^{\top}U^{\star\top}\Sigma_{x}U^{\star}w)^{1/2}.

From Cauchy-Schwartz inequality,

τmax,U2\displaystyle\tau_{\max,U}^{2} =ν2wUU(UΣxU)1UUwww,\displaystyle=\frac{\nu^{2}{w^{\star}}^{\top}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}w^{\star}}{{w^{\star}}^{\top}w^{\star}},
τmax,U2\displaystyle\tau_{\max,U^{\star}}^{2} =ν2w(UΣxU)1www.\displaystyle=\frac{\nu^{2}{w^{\star}}^{\top}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}w^{\star}}{{w^{\star}}^{\top}w^{\star}}.

Define αmax,U:=τmax,U/(1τmax,U2)1/2\alpha_{\max,U}:=\tau_{\max,U}/(1-\tau_{\max,U}^{2})^{1/2} and αmax,U:=τmax,U/(1τmax,U2)1/2\alpha_{\max,U^{\star}}:=\tau_{\max,U^{\star}}/(1-\tau_{\max,U^{\star}}^{2})^{1/2}. Then, since on the event where ω>0\omega^{\star}>0, αΦ(αω/v)\alpha\mapsto\Phi(\alpha\omega^{\star}/v^{\star}) is monotone increasing and 2F(w)12F(w^{\star})-1 is non-negative, we have

infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w}) =ΨF(v2)2𝔼[(2F(ω)1)Φ(αmax,Uω/v)𝕀{ω>0}]\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]
infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w}) =ΨF(v2)2𝔼[(2F(ω)1)Φ(αmax,Uω/v)𝕀{ω>0}].\displaystyle=\Psi_{F}({v^{\star}}^{2})-2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}].

This yields

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
=2𝔼[(2F(ω)1)(Φ(αmax,Uω/v)Φ(αmax,Uω/v))𝕀{ω>0}].\displaystyle\quad=2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)(\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].

Note that for any a,b0a,b\geq 0,

|Φ(b)Φ(a)|ϕ(ab)|ba|,\displaystyle|\Phi(b)-\Phi(a)|\leq\phi(a\wedge b)|b-a|,

where ϕ\phi is a density function of standard normal distribution. Observe

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
2𝔼[(2F(ω)1)|Φ(αmax,Uω/v)Φ(αmax,Uω/v)|𝕀{ω>0}]\displaystyle\quad\leq 2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)|\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star})|\mathbb{I}\{\omega^{\star}>0\}]
2v0(2F(ω)1)|αmax,Uαmax,U|ωϕ((αmax,Uαmax,U)ω/v)ϕ(ω/v)vdω\displaystyle\quad\lesssim\frac{2}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)|\alpha_{\max,U^{\star}}-\alpha_{\max,U}|\omega^{\star}\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\frac{\phi(\omega^{\star}/v^{\star})}{v^{\star}}\differential{\omega^{\star}}
|αmax,Uαmax,U|v0(2F(ω)1)ϕ((αmax,Uαmax,U)ω/v)dω\displaystyle\quad\lesssim\frac{|\alpha_{\max,U^{\star}}-\alpha_{\max,U}|}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\phi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})\omega^{\star}/v^{\star})\differential{\omega^{\star}}
=|αmax,Uαmax,U|αmax,Uαmax,U0(2F(ω)1)exp(1/(2((αmax,Uαmax,U)2v2))ω2)2π((αmax,Uαmax,U)2v2)dω\displaystyle\quad=\frac{|\alpha_{\max,U^{\star}}-\alpha_{\max,U}|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\frac{\exp(-1/(2((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})){\omega^{\star}}^{2})}{\sqrt{2\pi((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U})^{-2}{v^{\star}}^{2})}}\differential{\omega^{\star}}
=|αmax,Uαmax,U|αmax,Uαmax,U(ΨF(((αmax,Uαmax,U)2v2))1/2),\displaystyle\quad=\frac{|\alpha_{\max,U^{\star}}-\alpha_{\max,U}|}{\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U}}(\Psi_{F}(((\alpha_{\max,U^{\star}}\wedge\alpha_{\max,U^{\star}})^{-2}{v^{\star}}^{2}))-1/2),

where we used supu>0uϕ(u)<\sup_{u>0}u\phi(u)<\infty. Since (ab)=(a2b2)/(a+b)(a2b2)/(ab)(a-b)=(a^{2}-b^{2})/(a+b)\leq(a^{2}-b^{2})/(a\wedge b) for a,b>0a,b>0, and ΨF1\Psi_{F}\leq 1, we obtain

infwrc(δU,w)infwrc(δU,w)|αmax,U2αmax,U2|αmax,U2αmax,U2.\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\lesssim\frac{|\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U}^{2}|}{\alpha_{\max,U^{\star}}^{2}\wedge\alpha_{\max,U}^{2}}.

When τmax,Uτmax,U\tau_{\max,U^{\star}}\geq\tau_{\max,U}, since ττ2/(1τ2)\tau\mapsto\tau^{2}/(1-\tau^{2}) is increasing in τ>0\tau>0,

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w}) αmax,U2αmax,U2αmax,U2\displaystyle\lesssim\frac{\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U}^{2}}{\alpha_{\max,U}^{2}}
=τmax,U2τmax,U2(1τmax,U2)τmax,U2.\displaystyle=\frac{\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}}{(1-\tau_{\max,U^{\star}}^{2})\tau_{\max,U}^{2}}. (63)

From Lemma B.12 and B.13, we have

ν2ν2+σ(1)2(1sinΘ(U,U)22)τmax,U2\displaystyle\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})\leq\tau_{\max,U}^{2} ν2ν2(1sinΘ(U,U)22)+σ(d)2,\displaystyle\leq\frac{\nu^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}},
ν2ν2+σ(1)2τmax,U2\displaystyle\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}\leq\tau_{\max,U^{\star}}^{2} ν2ν2+σ(d)2.\displaystyle\leq\frac{\nu^{2}}{\nu^{2}+\sigma_{(d)}^{2}}. (64)

Then, Equation (63) becomes

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
ν2+σ(d)2σ(d)2ν2+σ(1)2ν2(1sinΘ(U,U)22)(τmax,U2τmax,U2)\displaystyle\quad\lesssim\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})}(\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2})
ν2+σ(d)2σ(d)2ν2+σ(1)2ν2(1sinΘ(U,U)22)ν2(UΣxU)1ν2UU(UΣxU)1UU2\displaystyle\quad\leq\frac{\nu^{2}+\sigma_{(d)}^{2}}{\sigma_{(d)}^{2}}\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})}\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
(κρ2+1)(ρ2+1)2(1+κ1ρ2)(1sinΘ(U,U)22)2sinΘ(U,U)2\displaystyle\quad\leq\frac{(\kappa\rho^{2}+1)(\rho^{-2}+1)^{2}}{(1+\kappa^{-1}\rho^{-2})(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})^{2}}\|\sin\Theta(U,U^{\star})\|_{2}
=κρ2(ρ2+1)2(1sinΘ(U,U)22)2sinΘ(U,U)2.\displaystyle\quad=\frac{\kappa\rho^{2}(\rho^{-2}+1)^{2}}{(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})^{2}}\|\sin\Theta(U,U^{\star})\|_{2}.

where the last inequality follows from Lemma B.14.

On the event where sinΘ(U,U)221/2\|\sin\Theta(U,U^{\star})\|_{2}^{2}\leq 1/2,

infwrc(δU,w)infwrc(δU,w)κρ2(1+ρ2)2sinΘ(U,U)2.\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\lesssim\kappa\rho^{2}(1+\rho^{-2})^{2}\|\sin\Theta(U,U^{\star})\|_{2}.

When τmax,U<τmax,U\tau_{\max,U^{\star}}<\tau_{\max,U}, on the event where sinΘ(U,U)2κ1ρ2/2\|\sin\Theta(U,U^{\star})\|_{2}\leq\kappa^{-1}\rho^{-2}/2,

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
ν2+σ(1)2ν2ν2(1sinΘ(U,U)22)+σ(d)2ν2sinΘ(U,U)22+σ(d)2(τmax,U2τmax,U2)\displaystyle\quad\lesssim\frac{\nu^{2}+\sigma_{(1)}^{2}}{\nu^{2}}\frac{\nu^{2}(1-\|\sin\Theta(U,U^{\star})\|_{2}^{2})+\sigma_{(d)}^{2}}{-\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\sigma_{(d)}^{2}}(\tau_{\max,U}^{2}-\tau_{\max,U^{\star}}^{2})
(ν2+σ(1)2)2ν21ν2sinΘ(U,U)22+σ(d)2\displaystyle\quad\leq\frac{(\nu^{2}+\sigma_{(1)}^{2})^{2}}{\nu^{2}}\frac{1}{-\nu^{2}\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\sigma_{(d)}^{2}}
×ν2(UΣxU)1ν2UU(UΣxU)1UU2\displaystyle\quad\quad\times\|\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star}\|_{2}
(1+ρ2)3(sinΘ(U,U)22+κ1ρ2)3sinΘ(U,U)2\displaystyle\quad\leq\frac{(1+\rho^{-2})^{3}}{(-\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\kappa^{-1}\rho^{-2})^{3}}\|\sin\Theta(U,U^{\star})\|_{2}
(κ(1+ρ2))3sinΘ(U,U)2,\displaystyle\quad\lesssim(\kappa(1+\rho^{2}))^{3}\|\sin\Theta(U,U^{\star})\|_{2},

where we used Lemma B.14 again.

In summary, on the event where sinΘ(U,U)2κ1ρ2/21/2\|\sin\Theta(U,U^{\star})\|_{2}\leq\kappa^{-1}\rho^{-2}/2\wedge 1/2,

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
((κ(1+ρ2))3+κρ2(1+ρ2)2)sinΘ(U,U)2.\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2})\|\sin\Theta(U,U^{\star})\|_{2}.

On the other hand, on the event where sinΘ(U,U)2>κ1ρ2/21/2\|\sin\Theta(U,U^{\star})\|_{2}>\kappa^{-1}\rho^{-2}/2\wedge 1/2, we have a trivial inequality infwrc(δU,w)infwrc(δU,w)1\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})\leq 1. This gives

𝔼𝒟[infwrc(δU,w)infwrc(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})]
((κ(1+ρ2))3+κρ2(1+ρ2)2)𝔼𝒟[sinΘ(U,U)2]\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]
+𝒟(sinΘ(U,U)2>κ1ρ2/21/2)\displaystyle\quad\quad+\mathbb{P}_{\mathcal{D}}(\|\sin\Theta(U,U^{\star})\|_{2}>\kappa^{-1}\rho^{-2}/2\wedge 1/2)
((κ(1+ρ2))3+κρ2(1+ρ2)2+(κρ21))𝔼𝒟[sinΘ(U,U)2],\displaystyle\quad\lesssim((\kappa(1+\rho^{2}))^{3}+\kappa\rho^{2}(1+\rho^{-2})^{2}+(\kappa\rho^{2}\vee 1))\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}],

where the last inequality follows from Markov’s inequality.  

Lemma B.18

Suppose U𝕆d,rU\in\mathbb{O}_{d,r} satisfies 1/(1+ρ2)κ(rsinΘ(U,U)F2)01/(1+\rho^{2})-\kappa(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2})\geq 0. Then,

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
(1+ρ2)3/2(1+κρ2)3/2ρ2(11+ρ2κ(rsinΘ(U,U)F2)).\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2})).

Proof  We firstly bound the term τmax,U2τmax,U2\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}. From Lemma B.15,

τmax,U2τmax,U2\displaystyle\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2} λmin(ν2(UΣxU)1ν2UU(UΣxU)1UU)\displaystyle\geq\lambda_{\min}(\nu^{2}(U^{\star\top}\Sigma_{x}U^{\star})^{-1}-\nu^{2}U^{\star\top}U(U^{\top}\Sigma_{x}U)^{-1}U^{\top}U^{\star})
ν2ν2+σ(1)2ν2σ(d)2(rsinΘ(U,U)F2).\displaystyle\geq\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}). (65)

From assumption, RHS of Equation (65) is non-negative. Then using the inequality ab=(a2b2)/(a+b)(a2b2)/(2a)a-b=(a^{2}-b^{2})/(a+b)\geq(a^{2}-b^{2})/(2a) for ab0a\geq b\geq 0,

αmax,Uαmax,U\displaystyle\alpha_{\max,U^{\star}}-\alpha_{\max,U} 1αmax,U(αmax,U2αmax,U2)\displaystyle\gtrsim\frac{1}{\alpha_{\max,U^{\star}}}(\alpha_{\max,U^{\star}}^{2}-\alpha_{\max,U^{\star}}^{2})
(1τmax,U2)1/2τmax,Uτmax,U2τmax,U2(1τmax,U2)(1τmax,U2).\displaystyle\geq\frac{(1-\tau_{\max,U^{\star}}^{2})^{1/2}}{\tau_{\max,U^{\star}}}\frac{\tau_{\max,U^{\star}}^{2}-\tau_{\max,U}^{2}}{(1-\tau_{\max,U^{\star}}^{2})(1-\tau_{\max,U}^{2})}.

From Equation (64) and Equation (65),

αmax,Uαmax,U\displaystyle\alpha_{\max,U^{\star}}-\alpha_{\max,U}
(ν2+σ(d)2ν2)1/2(ν2+σ(1)2σ(1)2)3/2(ν2ν2+σ(1)2ν2σ(d)2(rsinΘ(U,U)F2))\displaystyle\quad\gtrsim\quantity(\frac{\nu^{2}+\sigma_{(d)}^{2}}{\nu^{2}})^{1/2}\quantity(\frac{\nu^{2}+\sigma_{(1)}^{2}}{\sigma_{(1)}^{2}})^{3/2}\quantity(\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}))
=(1+κ1ρ2)1/2(1+ρ2)3/2(ν2ν2+σ(1)2ν2σ(d)2(rsinΘ(U,U)F2)).\displaystyle\quad=(1+\kappa^{-1}\rho^{-2})^{1/2}(1+\rho^{2})^{3/2}\quantity(\frac{\nu^{2}}{\nu^{2}+\sigma_{(1)}^{2}}-\frac{\nu^{2}}{\sigma_{(d)}^{2}}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2})). (66)

From the proof of Lemma B.17,

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
=2𝔼[(2F(ω)1)(Φ(αmax,Uω/v)Φ(αmax,Uω/v))𝕀{ω>0}].\displaystyle\quad=2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)(\Phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})-\Phi(\alpha_{\max,U}\omega^{\star}/v^{\star}))\mathbb{I}\{\omega^{\star}>0\}].

Note that for any ba0b\geq a\geq 0, Φ(b)Φ(a)ϕ(b)(ba)\Phi(b)-\Phi(a)\geq\phi(b)(b-a). Since we assume RHS of Equation (65) is positive, αmax,Uαmax,U\alpha_{\max,U^{\star}}\geq\alpha_{\max,U}. Thus on the event where ω>0\omega^{\star}>0, αmax,Uω/vαmax,Uω/v\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star}\geq\alpha_{\max,U}\omega^{\star}/v^{\star}. Observe

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
2𝔼[(2F(ω)1)ϕ(αmax,Uω/v)(αmax,Uω/vαmax,Uω/v)𝕀{ω>0}]\displaystyle\quad\geq 2\mathbb{E}_{\mathcal{E}}\quantity[(2F(\omega^{\star})-1)\phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star}-\alpha_{\max,U}\omega^{\star}/v^{\star})\mathbb{I}\{\omega^{\star}>0\}]
=2v(αmax,Uαmax,U)0(2F(ω)1)ωϕ(ω/v)vϕ(αmax,Uω/v)dω\displaystyle\quad=\frac{2}{v^{\star}}(\alpha_{\max,U^{\star}}-\alpha_{\max,U})\int_{0}^{\infty}(2F(\omega^{\star})-1)\omega^{\star}\frac{\phi(\omega^{\star}/v^{\star})}{v^{\star}}\phi(\alpha_{\max,U^{\star}}\omega^{\star}/v^{\star})\differential{\omega^{\star}}
αmax,Uαmax,Uv0(2F(ω)1)ωexp((1/2)(1+αmax,U2)ω2/v2)dω\displaystyle\quad\simeq\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{v^{\star}}\int_{0}^{\infty}(2F(\omega^{\star})-1)\omega^{\star}\exp(-(1/2)(1+\alpha_{\max,U^{\star}}^{2}){\omega^{\star}}^{2}/{v^{\star}}^{2})\differential{\omega^{\star}}
αmax,Uαmax,U1+αmax,U20(2F((1+αmax,U2)1/2vω)1)ωexp((1/2)ω2)dω,\displaystyle\quad\simeq\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}\int_{0}^{\infty}(2F((1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}\omega^{\star})-1)\omega^{\star}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}},

where in the last equality we transformed w(1+αmax,U2)1/2w/vw^{\star}\to(1+\alpha_{\max,U^{\star}}^{2})^{1/2}w^{\star}/v^{\star}. Since F(u)F(u) is differentiable at 0 and F(0)=1/2F(0)=1/2,

F(u)1/2=F(0)u+o(u).\displaystyle F(u)-1/2=F^{\prime}(0)u+o(u).

Thus there exists a constant ϵ>0\epsilon>0 only depending on FF such that 2(F(u)1/2)F(0)u2(F(u)-1/2)\geq F^{\prime}(0)u for all u[0,ϵ]u\in[0,\epsilon] since F(0)>0F^{\prime}(0)>0. This gives

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
αmax,Uαmax,U1+αmax,U2F(0)(1+αmax,U2)1/2v\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}F^{\prime}(0)(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}
×0ϵ(1+αmax,U2)1/2vω2exp((1/2)ω2)dω\displaystyle\quad\quad\times\int_{0}^{\epsilon(1+\alpha_{\max,U^{\star}}^{2})^{1/2}v^{\star}}{\omega^{\star}}^{2}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}}
αmax,Uαmax,U1+αmax,U2(1+αmax,U2)1/2v0ϵvω2exp((1/2)ω2)dω\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}v^{\star}\int_{0}^{\epsilon v^{\star}}{\omega^{\star}}^{2}\exp(-(1/2){\omega^{\star}}^{2})\differential{\omega^{\star}}
αmax,Uαmax,U1+αmax,U2(1+αmax,U2)1/2.\displaystyle\quad\gtrsim\frac{\alpha_{\max,U^{\star}}-\alpha_{\max,U}}{1+\alpha_{\max,U^{\star}}^{2}}(1+\alpha_{\max,U^{\star}}^{2})^{-1/2}.

The last inequality follows since v=w=1v^{\star}=\|w^{\star}\|=1 by assumption. It is noted that αmax,U2ν2/σ(d)2\alpha_{\max,U^{\star}}^{2}\leq\nu^{2}/\sigma_{(d)}^{2} from Equation (64). Therefore with Equation (66),

infwrc(δU,w)infwrc(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{c}(\delta_{U^{\star},w})
1(1+κρ2)3/2(1+κ1ρ2)1/2(1+ρ2)3/2(11+ρ2κρ2(rsinΘ(U,U)F2))\displaystyle\quad\gtrsim\frac{1}{(1+\kappa\rho^{2})^{3/2}}(1+\kappa^{-1}\rho^{-2})^{1/2}(1+\rho^{2})^{3/2}\quantity(\frac{1}{1+\rho^{-2}}-\kappa\rho^{2}(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2}))
(1+ρ2)3/2(1+κρ2)3/2ρ2(11+ρ2κ(rsinΘ(U,U)F2)).\displaystyle\quad\gtrsim\frac{(1+\rho^{2})^{3/2}}{(1+\kappa\rho^{2})^{3/2}}\rho^{2}\quantity(\frac{1}{1+\rho^{2}}-\kappa(r-\|\sin\Theta(U,U^{\star})\|_{F}^{2})).

 

Proposition B.19

For any U𝕆d,rU\in\mathbb{O}_{d,r},

infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w}) =ν2w(Iν2UU(ν2UUUU+UΣU)1UU)w+σϵ2.\displaystyle=\nu^{2}{w^{\star}}^{\top}(I-\nu^{2}U^{\star\top}U(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)^{-1}U^{\top}U^{\star})w^{\star}+\sigma_{\epsilon}^{2}.

Proof [Proof of Proposition B.19] Generate random variables (xˇ,zˇ,ξˇ,ϵˇ)(\check{x},\check{z},\check{\xi},\check{\epsilon}) following the model (19). We calculate the prediction risk of δU,w\delta_{U,w} as:

r(δU,w)\displaystyle\mathcal{R}_{r}(\delta_{U,w}) :=𝔼(yˇxˇUw)2\displaystyle:=\mathbb{E}_{\mathcal{E}}(\check{y}-\check{x}^{\top}Uw)^{2}
=Var(ν1zˇw+ϵˇ)22Cov(ν1zˇw+ϵˇ,Uzˇ+ξˇ)Uw\displaystyle=\mathrm{Var}_{\mathcal{E}}(\nu^{-1}\check{z}^{\top}w^{\star}+\check{\epsilon})^{2}-2\mathrm{Cov}_{\mathcal{E}}(\nu^{-1}\check{z}^{\top}w^{\star}+\check{\epsilon},U^{\star}\check{z}+\check{\xi})Uw
+wUVar(Uzˇ+ξˇ)Uw\displaystyle\quad+w^{\top}U^{\top}\mathrm{Var}_{\mathcal{E}}(U^{\star}\check{z}+\check{\xi})Uw
=w2+σϵ22νwUUw+w(ν2UUUU+UΣU)w\displaystyle=\|w^{\star}\|^{2}+\sigma_{\epsilon}^{2}-2\nu{w^{\star}}^{\top}U^{\star\top}Uw+w^{\top}(\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U)w
=(wA1b)A(wA1b)bA1b+w2+σϵ2,\displaystyle=(w-A^{-1}b)^{\top}A(w-A^{-1}b)-b^{\top}A^{-1}b+\|w^{\star}\|^{2}+\sigma_{\epsilon}^{2},

where A:=ν2UUUU+UΣUA:=\nu^{2}U^{\top}U^{\star}U^{\star\top}U+U^{\top}\Sigma U and b:=νUUwb:=\nu U^{\top}U^{\star}w^{\star}. From this, we obtain

infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w}) =w(IUU(UUUU+(1/ν2)UΣU)1UU)w+σϵ2.\displaystyle={w^{\star}}^{\top}\quantity(I-U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}U^{\star})w^{\star}+\sigma_{\epsilon}^{2}.

 

Lemma B.20

For any U𝕆d,rU\in\mathbb{O}_{d,r},

infwrr(δU,w)infwrr(δU,w)=O((1+ρ2)𝔼𝒟[sinΘ(U,U)2]w2).\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})=O\quantity((1+\rho^{-2})\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2}).

Proof [Proof of Lemma B.20] From proposition B.19, we have

infwrr(δU,w)infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})
=w((I+(1/ν2)UΣU)1UU(UUUU+(1/ν2)UΣU)1UU)w.\displaystyle\quad={w^{\star}}^{\top}\quantity((I+(1/\nu^{2})U^{\star\top}\Sigma U^{\star})^{-1}-U^{\star\top}U(U^{\top}U^{\star}U^{\star\top}U+(1/\nu^{2})U^{\top}\Sigma U)^{-1}U^{\top}{U^{\star}})w^{\star}.

Note that infwrr(δU,w)infwrr(δU,w)infwrr(δUO,w)infwrr(δU,w)\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})\equiv\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{UO,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w}) for any orthogonal matrix O𝕆r,rO\in\mathbb{O}_{r,r}. Take O~𝕆r,r\tilde{O}\in\mathbb{O}_{r,r} such that UO~U22sinΘ(U,O)2\|U\tilde{O}-U^{\star}\|_{2}\leq\sqrt{2}\|\sin\Theta(U,O)\|_{2} without loss of generality, since we can always take a sequence (O~m)m1(\tilde{O}_{m})_{m\geq 1} such that UOmU22sinΘ(U,O)2+1/m\|UO_{m}-U^{\star}\|_{2}\leq\sqrt{2}\|\sin\Theta(U,O)\|_{2}+1/m from Lemma A.3.

Lemma B.14 gives

infwrr(δU,w)infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})
=O(11sinΘ(U,U)22+κ1ρ21+ρ21+κ1ρ2sinΘ(U,U)2w2).\displaystyle\quad=O\quantity(\frac{1}{1-\|\sin\Theta(U,U^{\star})\|_{2}^{2}+\kappa^{-1}\rho^{-2}}\frac{1+\rho^{-2}}{1+\kappa^{-1}\rho^{-2}}\|\sin\Theta(U,U^{\star})\|_{2}\|w^{\star}\|^{2}).

On the event where sinΘ(U,U)22<1/2\|\sin\Theta(U,U^{\star})\|_{2}^{2}<1/2,

infwrr(δU,w)infwrr(δU,w)=O(1+ρ2(1+κ1ρ2)2sinΘ(U,U)2w2).\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})=O\quantity(\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\|\sin\Theta(U,U^{\star})\|_{2}\|w^{\star}\|^{2}).

On the event where sinΘ(U,U)221/2\|\sin\Theta(U,U^{\star})\|_{2}^{2}\geq 1/2, we utilize the trivial upper bound

infwrr(δU,w)infwrr(δU,w)\displaystyle\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w}) (I+ν2UΣU)12w2ν2ν2+σ(d)2w2.\displaystyle\leq\|(I+\nu^{-2}U^{\star\top}\Sigma U^{\star})^{-1}\|_{2}\|w^{\star}\|^{2}\leq\frac{\nu^{2}}{\nu^{2}+\sigma_{(d)}^{2}}\|w^{\star}\|^{2}.

Combining these results, we have

𝔼𝒟[infwrr(δU,w)infwrr(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U,w})-\inf_{w\in\mathbb{R}^{r}}\mathcal{R}_{r}(\delta_{U^{\star},w})]
1+ρ2(1+κ1ρ2)2𝔼𝒟[sinΘ(U,U)2]w2\displaystyle\quad\lesssim\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2}
+11+κ1ρ2w2𝒟(sinΘ(U,U)21/2)\displaystyle\quad\quad+\frac{1}{1+\kappa^{-1}\rho^{-2}}\|w^{\star}\|^{2}\mathbb{P}_{\mathcal{D}}(\|\sin\Theta(U,U^{\star})\|_{2}\geq 1/\sqrt{2})
1+ρ2(1+κ1ρ2)2𝔼𝒟[sinΘ(U,U)2]w2,\displaystyle\quad\lesssim\frac{1+\rho^{-2}}{(1+\kappa^{-1}\rho^{-2})^{2}}\mathbb{E}_{\mathcal{D}}[\|\sin\Theta(U,U^{\star})\|_{2}]\|w^{\star}\|^{2},

where the last inequality follows by Markov’s inequality.  

C Discussion about Autoencoders and random masking augmentation

In the following, we show that our results do not change if we applied the same augmentation (2.2) for autoencoders. As discussed in Section 3.1, we can ignore the bias term in autoencoders for simplicity, which only serves as centralization of the data matrix. In that case, we applied random augmentation g1(x)=Axg_{1}(x)=Ax and g2(x)=(IA)xg_{2}(x)=(I-A)x to the original data {xi}i=1n\{x_{i}\}_{i=1}^{n}, and the optimization problem can be formulated as follows:

minWAE,WDE12n𝔼A[AXWDEWAEAXF2+(IA)XWDEWAE(IA)XF2].\displaystyle\min_{W_{AE},W_{DE}}\frac{1}{2n}\mathbb{E}_{A}[\|AX-W_{DE}W_{AE}AX\|_{F}^{2}+\|(I-A)X-W_{DE}W_{AE}(I-A)X\|_{F}^{2}]. (67)

Then, similar to Theorem 3.1 for contrastive learning, we can also obtain an explicit solution for this optimization problem.

Theorem C.1

The optimal solution of autoencoders with random masking augmentation (67) is given by:

WAE=WDE=C(i=1ruiσivi),W_{AE}=W_{DE}^{\top}=C\left(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top}\right)^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

12Δ(XX)+D(XX),\frac{1}{2}\Delta(XX^{\top})+D(XX^{\top}), (68)

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

Proof  We first derive the equivalent form for this objective function:

12n𝔼A[AXWDEWAEAXF2+(IA)XWDEWAE(IA)XF2]\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\|AX-W_{DE}W_{AE}AX\|_{F}^{2}+\|(I-A)X-W_{DE}W_{AE}(I-A)X\|_{F}^{2}] (69)
=\displaystyle= 12n𝔼A[tr(XAAX)+tr(XAWDEWAEAX)+tr(XAWAEWDEWDEWAEAX)\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}A^{\top}AX)+\tr(X^{\top}A^{\top}W_{DE}W_{AE}AX)+\tr(X^{\top}A^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE}AX)
+tr(X(IA)(IA)X)+tr(X(IA)WDEWAE(IA)X)\displaystyle+\tr(X^{\top}(I-A)^{\top}(I-A)X)+\tr(X^{\top}(I-A)^{\top}W_{DE}W_{AE}(I-A)X)
+tr(X(IA)WAEWDEWDEWAE(IA)X)]\displaystyle+\tr(X^{\top}(I-A)^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE}(I-A)X)]
=\displaystyle= 12n𝔼A[tr(XAX)+tr(AXXAWDEWAE)+tr(AXXAWAEWDEWDEWAE)\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}AX)+\tr(AXX^{\top}A^{\top}W_{DE}W_{AE})+\tr(AXX^{\top}A^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})
+tr(X(IA)X)+tr((IA)XX(IA)WDEWAE)\displaystyle+\tr(X^{\top}(I-A)X)+\tr((I-A)XX^{\top}(I-A)^{\top}W_{DE}W_{AE})
+tr((IA)XX(IA)WAEWDEWDEWAE)]\displaystyle+\tr((I-A)XX^{\top}(I-A)^{\top}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})]
=\displaystyle= 12n𝔼A[tr(XX)+tr(M^WDEWAE)+tr(M^WAEWDEWDEWAE)],\displaystyle\frac{1}{2n}\mathbb{E}_{A}[\tr(X^{\top}X)+\tr(\hat{M}W_{DE}W_{AE})+\tr(\hat{M}W_{AE}^{\top}W_{DE}^{\top}W_{DE}W_{AE})],

where M^:=AXXA+(IA)XX(IA)\hat{M}:=AXX^{\top}A^{\top}+(I-A)XX^{\top}(I-A)^{\top}. Note that by Definition 2.2 we have A=diag(a1,,ad)A=\operatorname{diag}(a_{1},\cdots,a_{d}) and aia_{i} follows the Bernoulli distribution, so we have:

𝔼AM^=12Δ(XX)+D(XX)\mathbb{E}_{A}\hat{M}=\frac{1}{2}\Delta(XX^{\top})+D(XX^{\top}) (70)

Again, by Theorem 2.4.8 in Golub and Loan (1996), the optimal solution of Eq.(67) is given by the eigenvalue decomposition of 𝔼AM^=12Δ(XX)+D(XXT)\mathbb{E}_{A}\hat{M}=\frac{1}{2}\Delta(XX^{\top})+D(XX^{T}), up to an orthogonal transformation, which finishes the proof.  
With Theorem C.1 established, we can now derive the space distance for autoencoders with random masking augmentation.

Theorem C.2

Consider the spiked covariance model Eq.(5), under Assumptions 3.4-3.6 and n>drn>d\gg r, let WAEW_{AE} be the learned representation of augmented autoencoder with singular value decomposition WAE=(UAEΣAEVAE)W_{AE}=(U_{AE}\Sigma_{AE}V_{AE}^{\top})^{\top} (i.e., the optimal solution of optimization problem 67). If we further assume {σi2}i=1d\{\sigma_{i}^{2}\}_{i=1}^{d} are different from each other and σ(1)2/(σ(r)2σ(r+1)2)<Cσ\sigma_{(1)}^{2}/(\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2})<C_{\sigma} for some universal constant CσC_{\sigma}. Then there exist two universal constants Cρ>0,c(0,1)C_{\rho}>0,c\in(0,1), such that when ρ<Cρ\rho<C_{\rho}, we have

𝔼sinΘ(U,UAE)Fcr.\mathbb{E}\left\|\sin\Theta\left(U^{\star},U_{AE}\right)\right\|_{F}\geq c\sqrt{r}. (71)

Proof  Step1, similar to the proof of Theorem B.7, we first bound the difference between M^:=Δ(XX)+2D(XX)\hat{M}:=\Delta(XX^{\top})+2D(XX^{\top}) and Σ:=Cov(ξξ)\Sigma:=\mathrm{Cov}(\xi\xi^{\top}). Note that:

M^Σ2=XXΣ12Δ(XX)2XXΣ2+12Δ(XXΣ)2+12Δ(Σ)2\|\hat{M}-\Sigma\|_{2}=\|XX^{\top}-\Sigma-\frac{1}{2}\Delta(XX^{\top})\|_{2}\leq\|XX^{\top}-\Sigma\|_{2}+\frac{1}{2}\|\Delta(XX^{\top}-\Sigma)\|_{2}+\frac{1}{2}\|\Delta(\Sigma)\|_{2} (72)

Since Σ\Sigma is a diagonal matrix, then by Lemma B.8 we have:

M^Σ22XXΣ2\|\hat{M}-\Sigma\|_{2}\leq 2\|XX^{\top}-\Sigma\|_{2} (73)

Now, directly apply equation (41)(42)(43) we can obtain that:

𝔼M^Σ2ν2(1+rn+rn)+σ(1)2(dn+dn)+dnσ(1)ν.\mathbb{E}\|\hat{M}-\Sigma\|_{2}\lesssim\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu. (74)

Step 2, bound the sinΘ\sin\Theta distance between eigenspaces. As we have shown in step 1, the target matrix of the autoencoder is close to the covariance matrix of random noise, i.e., Σ\Sigma. Note that Σ\Sigma is assumed to be a diagonal matrix with different elements, hence its eigenspace only consists of canonical basis eie_{i}. Denote UΣU_{\Sigma} to be the top-rr eigenspace of Σ\Sigma and {ei}iC\{e_{i}\}_{i\in C} to be its corresponding basis vectors, apply the Davis-Kahan Theorem E.1 we can conclude that:

𝔼sinΘ(UAE,UΣ)F2r𝔼M^Σ2σ(r)2σ(r+1)2\displaystyle\mathbb{E}\|\sin\Theta(U_{AE},U_{\Sigma})\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{M}-\Sigma\|_{2}}{\sigma_{(r)}^{2}-\sigma_{(r+1)}^{2}}
\displaystyle\lesssim r1σ(1)2(ν2(1+rn+rn)+σ(1)2(dn+dn)+dnσ(1)ν)\displaystyle\sqrt{r}\frac{1}{\sigma_{(1)}^{2}}\quantity(\nu^{2}\quantity(1+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sqrt{\frac{d}{n}}\sigma_{(1)}\nu)
\displaystyle\lesssim r(ρ2+dn+ρdn).\displaystyle\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}}).

Step 3, obtain the final result by triangular inequality. By Assumption 3.6 we know that the distance between canonical basis and the eigenspace of core features can be large:

sinΘ(U,UΣ)F2\displaystyle\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}^{2} =UΣUF2=i[d]/CeiU2=UF2iCeiU2\displaystyle=\|U_{\Sigma\perp}^{\top}U^{\star}\|_{F}^{2}=\sum_{i\in[d]/C}\|e_{i}^{\top}U^{\star}\|^{2}=\|U^{\star}\|_{F}^{2}-\sum_{i\in C}\|e_{i}^{\top}U^{\star}\|^{2} (75)
rrI(U)=rO(r2dlogd).\displaystyle\geq r-rI(U^{\star})=r-O\quantity(\frac{r^{2}}{d}\log d).

Then apply the triangular inequality of sinΘ\sin\Theta distance (Proposition A.5) we can obtain the lower bound of the autoencoder.

𝔼sinΘ(UAE,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{AE},U^{\star})\|_{F} 𝔼sinΘ(U,UΣ)F𝔼sinΘ(UAE,UΣ)F\displaystyle\geq\mathbb{E}\|\sin\Theta(U^{\star},U_{\Sigma})\|_{F}-\mathbb{E}\|\sin\Theta(U_{AE},U_{\Sigma})\|_{F}
rO(rdlogd)O(r(ρ2+dn+ρdn)).\displaystyle\geq\sqrt{r}-O\quantity(\frac{r}{\sqrt{d}}\sqrt{\log d})-O\quantity(\sqrt{r}\quantity(\rho^{2}+\sqrt{\frac{d}{n}}+\rho\sqrt{\frac{d}{n}})).

By Assumption 3.5, it implies that when n and d are sufficiently large and ρ\rho is sufficiently small (smaller than a given constant Cρ>0C_{\rho}>0), there exists a universal constant c(0,1)c\in(0,1) such that:

𝔼sinΘ(UAE,U)Fcr.\mathbb{E}\|\sin\Theta(U_{AE},U^{\star})\|_{F}\geq c\sqrt{r}.

 
Compared with Theorem 3.9, we can find that random masking augmentation makes no difference to autoencoders, which justifies the fairness of our comparison between contrastive learning and autoencoders.

However, contrary to the autoencoders with random-masking augmentation, we show that the representations obtained by DAEs behave as the representations obtained by contrastive learning.

Proof [Proof of Remark 3.12] Let :=(1/n)XWWAXF2\mathcal{L}:=(1/n)\|X-W^{\top}WAX\|_{F}^{2} be the loss function of DAEs. Then,

𝔼A=1ntr(WW(12D(XX)+14Δ(XX))WWWWXX)+(const.).\displaystyle\mathbb{E}_{A}\mathcal{L}=\frac{1}{n}\tr(W^{\top}W\quantity(\frac{1}{2}D(XX^{\top})+\frac{1}{4}\Delta(XX^{\top}))W^{\top}W-W^{\top}WXX^{\top})+(\text{const}.). (76)

We minimize the loss over WW such that WW=2IrWW^{\top}=2I_{r}. Then, the loss becomes

argminWW=2IrEA\displaystyle\operatorname*{arg\,min}_{WW^{\top}=2I_{r}}E_{A}\mathcal{L} =argminWW=2Ir1ntr(W(D(XX)+12Δ(XX))WWXXW)\displaystyle=\operatorname*{arg\,min}_{WW^{\top}=2I_{r}}\frac{1}{n}\tr(W\quantity(D(XX^{\top})+\frac{1}{2}\Delta(XX^{\top}))W^{\top}-WXX^{\top}W^{\top})
=argmaxWW=2Irtr(W1nΔ(XX)W).\displaystyle=\operatorname*{arg\,max}_{WW^{\top}=2I_{r}}\tr(W\frac{1}{n}\Delta(XX^{\top})W^{\top}).

Thus, the solution to the (expected) loss minimization problem is the top-rr eigenvectors of Δ(n1XX)\Delta(n^{-1}XX^{\top}), i.e., W=2OPr(Δ(n1XX))W^{\top}=\sqrt{2}OP_{r}(\Delta(n^{-1}XX^{\top})), where OO is any orthogonal matrix from 𝒪r,r\mathcal{O}_{r,r}. We use the same argument as in the proof of Theorem B.9. First note that

1nΔ(XX)=1nΔ(UZZU)+1nΔ(UZE+EZU)+1nΔ(EE).\displaystyle\frac{1}{n}\Delta(XX^{\top})=\frac{1}{n}\Delta(U^{\star}ZZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(U^{\star}ZE+EZ^{\top}U^{\star\top})+\frac{1}{n}\Delta(EE^{\top}).

By Lemmas B.8, E.3 and the incoherent condition I(U)=O(rdlogd)I(U)=O(\frac{r}{d}\log d), we have:

𝔼Δ(1nUZZU)ν2UU2\displaystyle\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top})-\nu^{2}U^{\star}U^{\star\top}}_{2}
2𝔼1nUZZUν2UU2+𝔼Δ(ν2UU)ν2UU)2\displaystyle\leq 2\mathbb{E}\norm{\frac{1}{n}U^{\star}ZZ^{\top}U^{\star\top}-\nu^{2}U^{\star}U^{\star\top}}_{2}+\mathbb{E}\norm{\Delta\quantity(\nu^{2}U^{\star}U^{\star\top})-\nu^{2}U^{\star}U^{\star\top})}_{2}
2(rn+rn)ν2+rdlogdν2.\displaystyle\lesssim 2(\sqrt{\frac{r}{n}}+\frac{r}{n})\nu^{2}+\frac{r}{d}\log d\nu^{2}. (77)

For the second term, applying equation (42) yields:

1n𝔼Δ(UZE+EZU)24n𝔼EZU2dnσ(1)ν.\displaystyle\frac{1}{n}\mathbb{E}\|\Delta(U^{\star}ZE^{\top}+EZ^{\top}U^{\star\top})\|_{2}\leq\frac{4}{n}\mathbb{E}\|EZ^{\top}U^{\star\top}\|_{2}\lesssim\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu. (78)

For the third term, applying equation (43) yields:

𝔼1nΔ(EE)2=𝔼Δ(1nEEΣ)221nEEΣ2(dn+dn)σ(1)2.\mathbb{E}\|\frac{1}{n}\Delta(EE^{\top})\|_{2}=\mathbb{E}\|\Delta(\frac{1}{n}EE^{\top}-\Sigma)\|_{2}\leq 2\|\frac{1}{n}EE^{\top}-\Sigma\|_{2}\lesssim(\sqrt{\frac{d}{n}}+\frac{d}{n})\sigma_{(1)}^{2}. (79)

Combining equations (77)(78)(79) gives

𝔼Δ(1nXXν2UU)2ν2(rdlogd+rn+rn)+σ(1)2(dn+dn)+σ(1)νdn.\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}XX^{\top}-\nu^{2}U^{\star}U^{\star\top})}_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}.

From Lemma E.1, we obtain the desired bound:

𝔼sinΘ(UDAE,U)F2rν2𝔼Δ(1nXXν2UU)2r3/2dlogd+drn.\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{DAE}},U^{\star})\|_{F}\leq\frac{2\sqrt{r}}{\nu^{2}}\mathbb{E}\norm{\Delta\quantity(\frac{1}{n}XX^{\top}-\nu^{2}U^{\star}U^{\star\top})}_{2}\lesssim\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}.

 

Here we provide some experimental results about DAEs on synthetic datasets as analog to Figure 1 and 2, the settings are the same as described in Section 5.1. The results are summarized in Figure 3, as we can observe, the performance of DAEs is comparable with contrastive learning, which aligns with our theoretical results above.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Comparison of denoising autoencoders, autoencoders, and contrastive learning on synthetic datasets. Left Column: The vertical axes indicate the downstream regression error. We subtract the regression error of the ground truth features to measure the excess error. Top Row: Comparison of in-domain downstream task performance of autoencoders, contrastive learning, and denoising autoencoders the dimension dd. The sample size nn is set as n=20000n=20000. Bottom Row: Comparison of in-domain downstream task performance of autoencoders, contrastive learning, and denoising autoencoders the dimension nn. The dimension dd is set as d=40d=40.

D Omitted proofs for Section 4

D.1 Proofs for Section 4.1

In this section, we will provide the proof of a generalized version of Theorem 4.2 to cover the imbalanced setting, the statement and the detailed proof can be found in Theorem D.2.

In the main body, we assume the unlabeled data and labeled data are both balanced for the sake of clarity and simplicity. Now we allow them to be imbalanced and provide a more general analysis. Suppose we have nn unlabeled data X=[x1,,xn]d×nX=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n} and nkn_{k} labeled data Xk=[xk1,,xknk]d×nkX_{k}=[x_{k}^{1},\cdots,x_{k}^{n_{k}}]\in\mathbb{R}^{d\times n_{k}} for class kk, the contrastive learning task can be formulated as:

minWr×d(W):=minWr×dSelfCon(W)+SupCon(W;α).\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)+\mathcal{L}_{\text{SupCon}}(W;\alpha). (80)

In addition, we write a generalized version of the supervised contrastive loss function to cover the imbalanced cases:

SupCon(W;α)=1r+1k=1r+1αknki=1nk[jiWxik,Wxjknk1j=1nskWxik,Wxjsskns]+λ2WWF2,\mathcal{L}_{\text{SupCon}}(W;\alpha)=-\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}\sum_{i=1}^{n_{k}}[\sum_{j\neq i}\frac{\langle Wx_{i}^{k},Wx_{j}^{k}\rangle}{n_{k}-1}-\frac{\sum_{j=1}^{n}\sum_{s\neq k}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle}{\sum_{s\neq k}n_{s}}]+\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}, (81)

where αk>0\alpha_{k}>0 is the weight for supervised loss of class kk. Again we first provide a theorem to give the optimal solution to the contrastive learning problem.

Theorem D.1

The optimal solution of the supervised contrastive learning problem (80) is given by :

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\quantity(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top})^{\top},

where C>0C>0 is a positive constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

14n(Δ(XX)1n1X(1n1nIn)X)\displaystyle\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})
+1r+1k=1r+1αknk[1nk1Xk(1nk1nkInk)Xk1tkntXk1k1sXs],\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}\quantity[\frac{1}{n_{k}-1}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}-\frac{1}{\sum_{t\neq k}n_{t}}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}],

uiu_{i} is the corresponding eigenvector and V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthonormal matrix.

Proof  Under this setting, combined with the result obtained in Corollary 3.2, the contrastive loss can be rewritten as:

(W)=\displaystyle\mathcal{L}(W)= λ2WWF212ntr((12Δ(XX)12(n1)X(1n1nIn)X)WW)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)
1r+1k=1r+1αk1nki=1nk[1nk1jiWxik,Wxjk1tkntskj=1nsWxik,Wxjs].\displaystyle-\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\quantity[\frac{1}{n_{k}-1}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle].

Then we deal with the last term independently, note that:

i=1nk[1nk1jiWxik,Wxjk1tkntskj=1nsWxik,Wxjs]\displaystyle\sum_{i=1}^{n_{k}}\quantity[\frac{1}{n_{k}-1}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle]
=\displaystyle= 1nk1i=1nkjiWxik,Wxjk1tknti=1nkskj=1nsWxik,Wxjs\displaystyle\frac{1}{n_{k}-1}\sum_{i=1}^{n_{k}}\sum_{j\neq i}\langle Wx_{i}^{k},Wx_{j}^{k}\rangle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{i=1}^{n_{k}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\langle Wx_{i}^{k},Wx_{j}^{s}\rangle
=\displaystyle= 1nk1tr(Xk(1nk1nkInk)XkWW)1tkntsktr(Xk1k1sXsWW).\displaystyle\frac{1}{n_{k}-1}\tr(X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}W^{\top}W)-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\tr(X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}W^{\top}W).

Thus we have:

(W)=\displaystyle\mathcal{L}(W)= λ2WWF214ntr((Δ(XX)1n1X(1n1nIn)X)WW)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{4n}\tr((\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)
1r+1k=1r+1αknk[1nk1tr(Xk(1nk1nkInk)XkWW)\displaystyle-\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}[\frac{1}{n_{k}-1}\tr(X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}W^{\top}W)
1tkntsktr(Xk1k1sXWW)].\displaystyle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\tr(X_{k}1_{k}1_{s}^{\top}XW^{\top}W)].

Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution WCLW_{\text{CL}} must satisfy the desired conditions.  
With the optimal solution obtained in Theorem D.1, we can provide a generalized version of Theorem 4.2 to cover the imbalance cases.

Theorem D.2 (Generalized version of Theorem 4.2)

If Assumptions 3.4-3.6 hold, n>drn>d\gg r and let WCLW_{\text{CL}} be any solution that minimizes the supervised contrastive learning problem in Equation (80), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have

𝔼sinΘ(UCL,U)F\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{CL}},U)\|_{F}\lesssim ν2λr(T)(r3/2dlogd+drn\displaystyle\frac{\nu^{2}}{\lambda_{r}(T)}\biggl{(}\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}
+1r+1k=1r+1αk[sknsdtknt(dnk+r)+drnk]),\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\biggl{[}\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}(\sqrt{\frac{d}{n_{k}}}+\sqrt{r})+\sqrt{\frac{dr}{n_{k}}}\biggr{]}\biggr{)},

where T14k=1r+1piμkμk+1r+1k=1r+1αk(μkμksknstknt12(μkμs+μsμk))T\triangleq\frac{1}{4}\sum_{k=1}^{r+1}p_{i}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top})).

Proof [Proof of Theorem D.2] For labeled data X=[x1,,xn]X=[x_{1},\cdots,x_{n}], we write it to be X=M+EX=M+E, where M=[μ1,,μn]M=[\mu_{1},\cdots,\mu_{n}] and E=[ξ1,,ξn]E=[\xi_{1},\cdots,\xi_{n}] are two matrices consisting of class mean and random noise. To be more specific, if xix_{i} subject to the kk-th cluster, then μi=μk\mu_{i}=\mu^{k} and ξi𝒩(0,Σk)\xi_{i}\sim\mathcal{N}(0,\Sigma^{k}). Since the data is randomly drawn from each class, μi\mu_{i} follows the multinomial distribution over μ1,,μr\mu^{1},\cdots,\mu^{r} with probability p1,,pr+1p_{1},\cdots,p_{r+1}. Thus μi\mu_{i} follows a subgaussian distribution with covariance matrix N=k=1r+1pkμkμkN=\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}.

As shown in Theorem D.1, the optimal solution of contrastive learning is equivalent to PCA of the following matrix:

T^\displaystyle\hat{T}\triangleq 14n(Δ(XX)1n1X(1n1nIn)X)\displaystyle\frac{1}{4n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})
+1r+1k=1r+1αknk[1nk1Xk(1nk1nkInk)Xk\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\frac{\alpha_{k}}{n_{k}}[\frac{1}{n_{k}-1}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}
1tkntsk12(Xk1k1sXs+Xs1s1kXk)].\displaystyle-\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\frac{1}{2}(X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}+X_{s}1_{s}1_{k}^{\top}X_{k}^{\top})].

Again we will deal with these terms separately,

  1. 1.

    For the first term, as we have discussed, XX can be divided into two matrices MM and EE, each of them consisting of sub-gaussian columns. Again we can obtain the result as in (56) (the proof is totally the same):

    𝔼1n(Δ(XX)1n1X(1n1nIn)X)N2ν2(rdlogd+rn)+σ(1)2dn.\mathbb{E}\|\frac{1}{n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})-N\|_{2}\lesssim\nu^{2}(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}. (82)
  2. 2.

    For the second term, notice that:

    Xk(1nk1nkInk)Xk=i=1nkji(μk+ξik)(μk+ξjk)\displaystyle X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}=\sum_{i=1}^{n_{k}}\sum_{j\neq i}(\mu^{k}+\xi_{i}^{k})(\mu^{k}+\xi_{j}^{k})^{\top} (83)
    =\displaystyle= nk(nk1)μkμk+(nk1)μk(i=1nkξik)+(nk1)(i=1nkξik)μk+i=1nkjiξikξjkT,\displaystyle n_{k}(n_{k}-1)\mu^{k}\mu^{k\top}+(n_{k}-1)\mu^{k}(\sum_{i=1}^{n_{k}}\xi_{i}^{k})^{\top}+(n_{k}-1)(\sum_{i=1}^{n_{k}}\xi_{i}^{k})\mu^{k\top}+\sum_{i=1}^{n_{k}}\sum_{j\neq i}\xi_{i}^{k}\xi_{j}^{kT},

    and that:

    1tkntskXk1k1sXs=1tkntski=1nk(μk+ξik)j=1ns(μs+ξjs)\displaystyle\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}=\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{i=1}^{n_{k}}(\mu^{k}+\xi_{i}^{k})\sum_{j=1}^{n_{s}}(\mu^{s}+\xi_{j}^{s})^{\top} (84)
    =\displaystyle= 1tkntsk[nknsμkμs+nkμk(j=1nsξjs)+nsi=1nkξikμs+i=1nkξikj=1nsξjsT].\displaystyle\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}[n_{k}n_{s}\mu^{k}\mu^{s\top}+n_{k}\mu^{k}(\sum_{j=1}^{n_{s}}\xi_{j}^{s})^{\top}+n_{s}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\mu^{s\top}+\sum_{i=1}^{n_{k}}\xi_{i}^{k}\sum_{j=1}^{n_{s}}\xi_{j}^{sT}].

    Since ξik𝒩(0,Σk)\xi_{i}^{k}\sim\mathcal{N}(0,\Sigma^{k}), we can conclude that:

    𝔼1nki=1nkξik2𝔼1nki=1nkξik22=dnkσ(1).\mathbb{E}\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\|_{2}\leq\sqrt{\mathbb{E}\|\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\|_{2}^{2}}=\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}. (85)

    Moreover, we have

    1nk(nk1)𝔼i=1nkjiξikξjkT2\displaystyle\frac{1}{n_{k}(n_{k}-1)}\mathbb{E}\|\sum_{i=1}^{n_{k}}\sum_{j\neq i}\xi_{i}^{k}\xi_{j}^{kT}\|_{2} 1nk(nk1)𝔼EkEk2+nknk1𝔼ξk¯ξk¯2\displaystyle\leq\frac{1}{n_{k}(n_{k}-1)}\mathbb{E}\|E_{k}E_{k}^{\top}\|_{2}+\frac{n_{k}}{n_{k}-1}\mathbb{E}\|\bar{\xi^{k}}\bar{\xi^{k}}^{\top}\|_{2} (86)
    dnkσ(1)2.\displaystyle\lesssim\frac{d}{n_{k}}\sigma_{(1)}^{2}.

    Take equation (85) and (86) back into (LABEL:sup_mat_1) we can conclude:

    𝔼1nk(nk1)Xk(1nk1nkInk)Xkμkμk2dnkσ(1)rν+dnkσ(1)2.\mathbb{E}\|\frac{1}{n_{k}(n_{k}-1)}X_{k}(1_{n_{k}}1_{n_{k}}^{\top}-I_{n_{k}})X_{k}^{\top}-\mu^{k}\mu^{k\top}\|_{2}\lesssim\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}\sqrt{r}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}. (87)

    On the other hand, by equation (85) we know:

    𝔼1tkntskj=1nsξjs2sknstknt𝔼1nsi=1nsξis2sknstkntdnsσ(1).\mathbb{E}\|\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}\sum_{j=1}^{n_{s}}\xi_{j}^{s}\|_{2}\leq\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mathbb{E}\|\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}\xi_{i}^{s}\|_{2}\lesssim\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\sqrt{\frac{d}{n_{s}}}\sigma_{(1)}. (88)

    Notice that:

    𝔼1tknt1nkski=1nkξikj=1nsξjsT2𝔼sknstkntξk¯ξs¯2\displaystyle\mathbb{E}\|\frac{1}{\sum_{t\neq k}n_{t}}\frac{1}{n_{k}}\sum_{s\neq k}\sum_{i=1}^{n_{k}}\xi_{i}^{k}\sum_{j=1}^{n_{s}}\xi_{j}^{sT}\|_{2}\leq\mathbb{E}\|\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\bar{\xi^{k}}\bar{\xi^{s}}^{\top}\|_{2} (89)
    \displaystyle\leq sknstknt𝔼ξk¯ξs¯2sknstkntdnknsσ(1)2.\displaystyle\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mathbb{E}\|\bar{\xi^{k}}\bar{\xi^{s}}^{\top}\|_{2}\lesssim\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{d}{\sqrt{n_{k}n_{s}}}\sigma_{(1)}^{2}.

    Thus take equations (88) and (LABEL:sup_mat_eq_4) back into equation (LABEL:sup_mat_2) we have:

    𝔼1nk1tkntskXk1k1sXssknstkntμkμs2\displaystyle\mathbb{E}\|\frac{1}{n_{k}}\frac{1}{\sum_{t\neq k}n_{t}}\sum_{s\neq k}X_{k}1_{k}1_{s}^{\top}X_{s}^{\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\mu^{k}\mu^{s\top}\|_{2} (90)
    sknsdtknt(dnkσ(1)2+σ(1)rν).\displaystyle\lesssim\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sigma_{(1)}\sqrt{r}\nu). (91)

Then combine equations (82)(87)(90) together, we can obtain the following result:

𝔼T^14N1r+1k=1r+1αk(μkμksknstknt12(μkμs+μsμk))2\displaystyle\mathbb{E}\|\hat{T}-\frac{1}{4}N-\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top}))\|_{2}
\displaystyle\lesssim ν2(rdlogd+rn)+σ(1)2dn\displaystyle\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}
+1r+1k=1r+1αk[sknsdtknt(dnkσ(1)2+rσ(1)ν)+dnkσ(1)rν+dnkσ(1)2].\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity[\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sqrt{r}\sigma_{(1)}\nu)+\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}\sqrt{r}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}].

Since we have assumed that rank(k=1r+1pkμkμk)=r\operatorname{rank}(\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top})=r we can find that the top-rr eigenspace of matrix:

T=14k=1r+1piμkμk+1r+1k=1r+1αk(μkμksknstknt12(μkμs+μsμk))T=\frac{1}{4}\sum_{k=1}^{r+1}p_{i}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity(\mu^{k}\mu^{k\top}-\sum_{s\neq k}\frac{n_{s}}{\sum_{t\neq k}n_{t}}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu_{k\top}))

is spanned by UU^{\star}, then apply Lemma E.1 again we have:

𝔼sinΘ(USCL,U)F2r𝔼N^N2λr(N)\displaystyle\mathbb{E}\|\sin\Theta(U_{\text{SCL}},U)\|_{F}\leq\frac{2\sqrt{r}\mathbb{E}\|\hat{N}-N\|_{2}}{\lambda_{r}(N)}
\displaystyle\lesssim rλr(T)[ν2(rdlogd+rn)+σ(1)2dn\displaystyle\frac{\sqrt{r}}{\lambda_{r}(T)}\Biggl{[}\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}})+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}
+1r+1k=1r+1αk[sknsdtknt(dnkσ(1)2+rσ(1)ν)+dnkrσ(1)ν+dnkσ(1)2]]\displaystyle+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\Biggl{[}\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}\sigma_{(1)}^{2}+\sqrt{r}\sigma_{(1)}\nu)+\sqrt{\frac{d}{n_{k}}}\sqrt{r}\sigma_{(1)}\nu+\frac{d}{n_{k}}\sigma_{(1)}^{2}\Biggr{]}\Biggr{]}
\displaystyle\lesssim ν2λr(T)(r3/2dlogd+drn+1r+1k=1r+1αk[sknsdtknt(dnk+r)+drnk]).\displaystyle\frac{\nu^{2}}{\lambda_{r}(T)}\quantity(\frac{r^{3/2}}{d}\log d+\sqrt{\frac{dr}{n}}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha_{k}\quantity[\sum_{s\neq k}\frac{\sqrt{n_{s}d}}{\sum_{t\neq k}n_{t}}\quantity(\sqrt{\frac{d}{n_{k}}}+\sqrt{r})+\sqrt{\frac{dr}{n_{k}}}]).

 
Now we use this result to derive Theorem 4.2. Since μk=O(rν)\|\mu^{k}\|=O(\sqrt{r}\nu) and k=1r+1pkμk=0\sum_{k=1}^{r+1}p_{k}\mu^{k}=0, approximately we have ν2λr(N)1mink[r][1+αk]\frac{\nu^{2}}{\lambda_{r}(N)}\approx\frac{1}{\min_{k\in[r]}[1+\alpha_{k}]}. Although we can not obtain the closed-form eigenvalue in general, in a special case, where α=α1==αr+1\alpha=\alpha_{1}=\cdots=\alpha_{r+1}, m=n1=n2==nr+1m=n_{1}=n_{2}=\cdots=n_{r+1} and 1r+1=p1=p2==pr+1\frac{1}{r+1}=p_{1}=p_{2}=\cdots=p_{r+1}, it is easy to find that:

sk12(μkμs+μsμk)=μkμk,\sum_{s\neq k}\frac{1}{2}(\mu^{k}\mu^{s\top}+\mu^{s}\mu^{k\top})=-\mu^{k}\mu^{k\top},

which further implies that:

T=14k=1r+1pkμkμk+1r+1k=1r+1α(1+1r)μkμk,λr(T)=[14+α(1+1r)]λ(N).T=\frac{1}{4}\sum_{k=1}^{r+1}p_{k}\mu^{k}\mu^{k\top}+\frac{1}{r+1}\sum_{k=1}^{r+1}\alpha(1+\frac{1}{r})\mu^{k}\mu^{k\top},\quad\lambda_{r}(T)=[\frac{1}{4}+\alpha(1+\frac{1}{r})]\lambda(N).

and we can obtain the result in Theorem 4.2.

D.2 Proofs for Section 4.2

In this section, we will provide the proof of the generalized version of Theorems 4.5 and 4.8 to cover the imbalanced setting, the statement and detailed proof can be found in Theorems D.6 and D.8. With the two generalized theorems proven, Theorems 4.5, 4.6, 4.8, 4.9 holds immediately.

First, we prove a useful lemma to illustrate that the supervised loss function only yields estimation along a 1-dimensional space. Consider a single source task, where the data x=Uz+ξx=U^{\star}z+\xi is generated by the spiked covariance model and the label is generated by

y=w,z/νy=\langle w^{\star},z\rangle/\nu

suppose we have collect nn labeled data from this task, denote the data as X=[x1,x2,,xn]d×nX=[x_{1},x_{2},\cdots,x_{n}]\in\mathbb{R}^{d\times n} and the label y=[y1,y2,,yn]ny=[y_{1},y_{2},\cdots,y_{n}]\in\mathbb{R}^{n}, then we have the following result.

Lemma D.3

Under the conditions similar to Theorem 3.10, we can find an event AA such that (AC)=O(d/n)\mathbb{P}(A^{C})=O(\sqrt{d/n}) and:

𝔼[1(n1)2XHyyHXν2UwwUF𝕀{A}]dnσ(1)ν.\mathbb{E}\quantity[\norm{\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}}_{F}\mathbb{I}\{A\}]\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu. (92)

The proof strategy is to estimate the difference between the two rank-1 matrices via bounding the difference of the corresponding vector component. We first provide a simple lemma to illustrate the technique:

Lemma D.4

Suppose α,βd\alpha,\beta\in\mathbb{R}^{d} are two vectors, then we have:

ααββF2(α2+β2)αβ2.\|\alpha\alpha^{\top}-\beta\beta^{\top}\|_{F}\leq\sqrt{2}(\|\alpha\|_{2}+\|\beta\|_{2})\|\alpha-\beta\|_{2}.

Proof  Denote α=(α1,,αd),β=(β1,,βd)\alpha=(\alpha_{1},\cdots,\alpha_{d}),\beta=(\beta_{1},\cdots,\beta_{d}), then we have:

ααββF2i=1dj=1d|αiαjβiβj|22i=1dj=1d|αiαjαiβj|2+|αiβjβiβj|2\displaystyle\|\alpha\alpha^{\top}-\beta\beta^{\top}\|_{F}^{2}\leq\sum_{i=1}^{d}\sum_{j=1}^{d}|\alpha_{i}\alpha_{j}-\beta_{i}\beta_{j}|^{2}\leq 2\sum_{i=1}^{d}\sum_{j=1}^{d}|\alpha_{i}\alpha_{j}-\alpha_{i}\beta_{j}|^{2}+|\alpha_{i}\beta_{j}-\beta_{i}\beta_{j}|^{2}
\displaystyle\leq 2i=1dj=1d|αi|2|αjβj|2+|βj|2|αiβi|22(α22+β22)αβ22\displaystyle 2\sum_{i=1}^{d}\sum_{j=1}^{d}|\alpha_{i}|^{2}|\alpha_{j}-\beta_{j}|^{2}+|\beta_{j}|^{2}|\alpha_{i}-\beta_{i}|^{2}\leq 2(\|\alpha\|_{2}^{2}+\|\beta\|_{2}^{2})\|\alpha-\beta\|_{2}^{2}
\displaystyle\leq 2(α2+β2)2αβ22.\displaystyle 2(\|\alpha\|_{2}+\|\beta\|_{2})^{2}\|\alpha-\beta\|_{2}^{2}.

Take square root on both sides we can finish the proof.  
Now we can prove the Lemma D.3.

Proof [Proof of Lemma D.3] Clearly, we have:

1(n1)2XHyyHXν2UwwUF\displaystyle\|\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F} (93)
\displaystyle\leq n2(n1)21n2XHyyHXν2UwwUF+2n+1(n1)2ν2UwwUF\displaystyle\frac{n^{2}}{(n-1)^{2}}\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}+\frac{2n+1}{(n-1)^{2}}\|\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}
\displaystyle\lesssim 1n2XHyyHXν2UwwUF+rnν2,\displaystyle\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}+\frac{r}{n}\nu^{2},

thus we can replace the 1(n1)2\frac{1}{(n-1)^{2}} with 1n\frac{1}{n} in equation (92) and conclude the proof. Denote N^1n2XHyyHX\hat{N}\triangleq\frac{1}{n^{2}}XHyy^{\top}HX^{\top}, note that both of N^\hat{N} and UwwUUw^{\star}w^{\star\top}U^{\top} are rank-1 matrices. We first bound the difference between 1nXHy\frac{1}{n}XHy and UwUw^{\star}:

1nXHyνUw=\displaystyle\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|= 1nν(UZ+E)HZwνUw\displaystyle\|\frac{1}{n\nu}(U^{\star}Z+E)HZ^{\top}w^{\star}-\nu U^{\star}w^{\star}\| (94)
\displaystyle\leq 1nν(UZ+E)HZνU2\displaystyle\|\frac{1}{n\nu}(U^{\star}Z+E)HZ^{\top}-\nu U^{\star}\|_{2}
\displaystyle\leq 1ν(1nUZZν2U2+1nEZ2+1nUZZ¯2+1nEZ¯2).\displaystyle\frac{1}{\nu}(\|\frac{1}{n}U^{\star}ZZ^{\top}-\nu^{2}U^{\star}\|_{2}+\frac{1}{n}\|EZ^{\top}\|_{2}+\frac{1}{n}\|U^{\star}Z\bar{Z}^{\top}\|_{2}+\frac{1}{n}\|E\bar{Z}^{\top}\|_{2}).

We deal with the four terms in (94) separately:

  1. 1.

    For the first term, apply Lemma E.3 we have:

    𝔼1nUZZν2U2𝔼1nZZν2Ir2(rn+rn)ν2.\mathbb{E}\|\frac{1}{n}U^{\star}ZZ^{\top}-\nu^{2}U^{\star}\|_{2}\leq\mathbb{E}\|\frac{1}{n}ZZ^{\top}-\nu^{2}I_{r}\|_{2}\leq\quantity(\frac{r}{n}+\sqrt{\frac{r}{n}})\nu^{2}. (95)
  2. 2.

    For the second term, apply Lemma E.2 twice we have:

    1n𝔼EZ2=\displaystyle\frac{1}{n}\mathbb{E}\|EZ^{\top}\|_{2}= 1n𝔼Z[𝔼E[EZ2|Z]]\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\mathbb{E}_{E}[\|EZ^{\top}\|_{2}|Z]] (96)
    \displaystyle\lesssim 1n𝔼Z[Z2(σsum+r1/4σsumσ(1)+rσ(1))]\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\|Z\|_{2}(\sigma_{\text{sum}}+r^{1/4}\sqrt{\sigma_{\text{sum}}\sigma_{(1)}}+\sqrt{r}\sigma_{(1)})]
    \displaystyle\lesssim 1n𝔼Z[Z2]dσ(1)\displaystyle\frac{1}{n}\mathbb{E}_{Z}[\|Z\|_{2}]\sqrt{d}\sigma_{(1)}
    \displaystyle\lesssim 1ndσ(1)(r1/2ν+(nr)1/4ν+n1/2ν)\displaystyle\frac{1}{n}\sqrt{d}\sigma_{(1)}(r^{1/2}\nu+(nr)^{1/4}\nu+n^{1/2}\nu)
    \displaystyle\lesssim dnσ(1)ν.\displaystyle\frac{\sqrt{d}}{\sqrt{n}}\sigma_{(1)}\nu.
  3. 3.

    For the third term and fourth term, from equation (LABEL:PCA_step1_term4) we know:

    𝔼1nUZZ¯2+𝔼1nEZ¯2𝔼z¯z¯2+𝔼ξ¯z¯2rnν2+dnνσ(1).\mathbb{E}\frac{1}{n}\|U^{\star}Z\bar{Z}^{\top}\|_{2}+\mathbb{E}\frac{1}{n}\|E\bar{Z}^{\top}\|_{2}\leq\mathbb{E}\|\bar{z}\bar{z}^{\top}\|_{2}+\mathbb{E}\|\bar{\xi}\bar{z}^{\top}\|_{2}\leq\frac{r}{n}\nu^{2}+\sqrt{\frac{d}{n}}\nu\sigma_{(1)}. (97)

Combine these three equations (95)(96)(97) together we have:

𝔼1nXHyνUwdnσ(1).\mathbb{E}\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}. (98)

With equation (98), we can now turn to the difference between N^\hat{N} and UwwUUw^{\star}w^{\star\top}U^{\top}. By Lemma D.4 we know that:

N^ν2UwwUF(1nXHy+νUw)1nXHyνUw|.\|\hat{N}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\lesssim(\|\frac{1}{n}XHy\|+\|\nu U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}|\|.

Using Markov’s inequality, we can conclude from (98) that:

(1nXHyνUwν)𝔼1nXHyνUwνdn.\mathbb{P}(\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|\geq\nu)\leq\frac{\mathbb{E}\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|}{\nu}\lesssim\sqrt{\frac{d}{n}}.

Then denote A={ω:1nXHyν2Uw2<ν}A=\{\omega:\|\frac{1}{n}XHy-\nu^{2}U^{\star}w^{\star}\|_{2}<\nu\} we have:

𝔼N^ν2UwwUF𝕀{A}\displaystyle\mathbb{E}\|\hat{N}-\nu^{2}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\mathbb{I}\{A\}\lesssim 𝔼(1nXHy+νUw)1nXHyνUw|𝕀{A}\displaystyle\mathbb{E}(\|\frac{1}{n}XHy\|+\|\nu U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}|\|\mathbb{I}\{A\}
\displaystyle\lesssim ν(𝔼1nXHyνUw)dnσ(1)ν.\displaystyle\nu(\mathbb{E}\|\frac{1}{n}XHy-\nu U^{\star}w^{\star}\|)\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

which finished the proof.  
In the main body, we assume the number of labeled data and the ratio of the loss function is both balanced. Now we will provide a more general result to cover the imbalance occasions. Formally, suppose we have nn unlabeled data X=[x1,,xn]d×nX=[x_{1},\cdots,x_{n}]\in\mathbb{R}^{d\times n} and nin_{i} labeled data 𝒮i\mathcal{S}_{i} Xi=[xi1,,xini],yi=[yi1,,yin1],i=1,TX_{i}=[x_{i}^{1},\cdots,x_{i}^{n_{i}}],y_{i}=[y_{i}^{1},\cdots,y_{i}^{n_{1}}],\forall i=1,\cdots T for source task , we learn the linear representation via joint optimization:

minWr×d(W):=minWr×dSelfCon(W)t=1TαiHSIC(X^t,yt;W),\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}(W):=\min_{W\in\mathbb{R}^{r\times d}}\mathcal{L}_{\text{SelfCon}}(W)-\sum_{t=1}^{T}\alpha_{i}\operatorname{HSIC}(\hat{X}^{t},y^{t};W), (99)

To investigate its feature recovery ability, we first give the following result.

Theorem D.5

For the optimization problem (99), if we apply augmented pairs generation in Definition 2.1 with random masking augmentation 2.2 for unlabeled data, then the optimal solution is given by:

WCL=C(i=1ruiσivi),W_{\text{CL}}=C\quantity(\sum_{i=1}^{r}u_{i}\sigma_{i}v_{i}^{\top})^{\top},

where C>0C>0 is a constant, σi\sigma_{i} is the ii-th largest eigenvalue of the following matrix:

14n(Δ(XX)1n1X(1n1nIn)X)+i=1Tαi(ni1)2XiHniyiyiHniXi),\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})+\sum_{i=1}^{T}\frac{\alpha_{i}}{(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}),

uiu_{i} is the corresponding eigenvector, V=[v1,,vr]r×rV=[v_{1},\cdots,v_{r}]\in\mathbb{R}^{r\times r} can be any orthogonal matrix and Hni=Ini1ni1ni1niH_{n_{i}}=I_{n_{i}}-\frac{1}{n_{i}}1_{n_{i}}1_{n_{i}}^{\top} is the centering matrix.

Proof  Under this setting, combined with the result obtained in Corollary 3.2, the loss function can be rewritten as:

(W)=\displaystyle\mathcal{L}(W)= λ2WWF212ntr((12Δ(XX)12(n1)X(1n1nIn)X)WW)\displaystyle\frac{\lambda}{2}\|WW^{\top}\|_{F}^{2}-\frac{1}{2n}\tr(\quantity(\frac{1}{2}\Delta(XX^{\top})-\frac{1}{2(n-1)}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})W^{\top}W)
t=1Tαi1(ni1)2tr(XiWWXiHyiyiH)\displaystyle-\sum_{t=1}^{T}\alpha_{i}\frac{1}{(n_{i}-1)^{2}}\tr(X_{i}^{\top}W^{\top}WX_{i}Hy_{i}y_{i}^{\top}H)
=\displaystyle= λ2WW14nλ(Δ(XX)1n1X(1n1nIn)X)\displaystyle\frac{\lambda}{2}\biggl{\|}WW^{\top}-\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})
i=1Tαiλ(ni1)2XiHniyiyiHniXi)F2\displaystyle-\sum_{i=1}^{T}\frac{\alpha_{i}}{\lambda(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top})\biggr{\|}_{F}^{2}
λ214nλ(Δ(XX)1n1X(1n1nIn)X)\displaystyle-\frac{\lambda}{2}\biggl{\|}\frac{1}{4n\lambda}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})
+i=1Tαiλ(ni1)2XiHniyiyiHniXiF2.\displaystyle+\sum_{i=1}^{T}\frac{\alpha_{i}}{\lambda(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}\biggr{\|}_{F}^{2}.

Then by a similar argument as in the proof of Proposition 3.1, we can conclude that the optimal solution WCLW_{\text{CL}} must satisfy the desired conditions.  
Then we can give the proofs of Theorem 4.5 and Theorem 4.8 under our generalized setting, one can easily obtain those under balanced setting by simply setting α=α1==αT\alpha=\alpha_{1}=\cdots=\alpha_{T} and m=n1==nTm=n_{1}=\cdots=n_{T}, which is consistent with Theorem 4.5 and Theorem 4.8 in the main body.

Theorem D.6 (Generalized version of Theorem 4.5)

In the regression setting where yt=wt,z/νy^{t}=\langle w_{t},z\rangle/\nu , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5) and n>drn>d\gg r, if we further assume that T<rT<r and wtw_{t}’s are orthogonal to each other, and let WCLW^{\text{CL}} be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\lesssim (rTmini[T]{αi,1}+Tmini[T]αi)(rdlogd+dn)\displaystyle\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}})\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})
+i=1T(rTαi+mini[T]{αi,1}mini[T]{αi,1}+Tαi+mini[T]αimini[T]αi)dni.\displaystyle+\sum_{i=1}^{T}\quantity(\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}})\sqrt{\frac{d}{n_{i}}}.

Proof [Proof of Theorem D.6] As shown in Theorem D.5, optimizing loss function (99) is equivalent to find the top-rr eigenspace of matrix

14n(Δ(XX)1n1X(1n1nIn)X)+i=1Tαi(ni1)2XiHniyiyiHniXi.\frac{1}{4n}\quantity(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top})+\sum_{i=1}^{T}\frac{\alpha_{i}}{(n_{i}-1)^{2}}X_{i}H_{n_{i}}y_{i}y_{i}^{\top}H_{n_{i}}X_{i}^{\top}.

Again denote M^21n(Δ(XX)1n1X(1n1nIn)X)\hat{M}_{2}\triangleq\frac{1}{n}(\Delta(XX^{\top})-\frac{1}{n-1}X(1_{n}1_{n}^{\top}-I_{n})X^{\top}) and N^i1(ni1)2XiHyiyiHXi\hat{N}_{i}\triangleq\frac{1}{(n_{i}-1)^{2}}X_{i}Hy_{i}y_{i}^{\top}HX_{i}^{\top}. By equation (56) we know that:

𝔼M^2M2ν2(rdlogd+rn+rn)+σ(1)2(dn+dn)+σ(1)νdn.\mathbb{E}\|\hat{M}_{2}-M\|_{2}\lesssim\nu^{2}\quantity(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}.

By Theorem D.3 we know that for each task 𝒮i\mathcal{S}_{i}, we can find an event AiA_{i} such that (Ai)=O(dn)\mathbb{P}(A_{i})=O(\sqrt{\frac{d}{n}}):

𝔼N^iν2UwiwiUF𝕀{Ai}dniσ(1)ν.\mathbb{E}\|\hat{N}_{i}-\nu^{2}U^{\star}w_{i}w_{i}^{\top}U^{\star\top}\|_{F}\mathbb{I}\{A_{i}\}\lesssim\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu.

The target matrix is N=ν2UU+i=1Tαiν2UwiwiTUN=\nu^{2}U^{\star}U^{\star\top}+\sum_{i=1}^{T}\alpha_{i}\nu^{2}U^{\star}w_{i}w_{i}^{T}U^{\star\top}, and we can obtain the upper bound for the difference between NN and N^\hat{N}:

𝔼N^N2𝕀{i=1TAi}14𝔼M^2M2+i=1Tαi𝔼N^iν2UwiwiUF𝕀{Ai}\displaystyle\mathbb{E}\|\hat{N}-N\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}\leq\frac{1}{4}\mathbb{E}\|\hat{M}_{2}-M\|_{2}+\sum_{i=1}^{T}\alpha_{i}\mathbb{E}\|\hat{N}_{i}-\nu^{2}U^{\star}w_{i}w_{i}^{\top}U^{\star\top}\|_{F}\mathbb{I}\{A_{i}\} (100)
\displaystyle\lesssim ν2(rdlogd+rn+rn)+σ(1)2(dn+dn)+σ(1)νdn+i=1T[αidniσ(1)ν].\displaystyle\nu^{2}(\frac{r}{d}\log d+\sqrt{\frac{r}{n}}+\frac{r}{n})+\sigma_{(1)}^{2}\quantity(\sqrt{\frac{d}{n}}+\frac{d}{n})+\sigma_{(1)}\nu\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\quantity[\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu].

We divide the top-rr eigenspace UCLU_{\text{CL}} of WCLWCLW_{\text{CL}}W_{\text{CL}}^{\top} into two parts: the top-TT eigenspace UCL(1)U_{\text{CL}}^{(1)} and top-(T+1)(T+1) to top-rr eigenspace UCL(2)U_{\text{CL}}^{(2)}. Similarly, we also divide the top-rr eigenspace UU^{\star} of NN into two parts: U(1)U^{\star(1)} and U(2)U^{\star(2)}. Then applying Lemma E.1 we can bound the sine distance for each part: on the one hand,

𝔼sin(Θ(UCL(1),U(1)))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\|_{F}
=\displaystyle= 𝔼sin(Θ(UCL(1),U(1)))F𝕀{i=1TAi}+𝔼sin(Θ(UCL(1),U(1)))F𝕀{i=1TAiC}\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}
\displaystyle\leq T𝔼N^N2𝕀{i=1TAi}λ(T)(N)λ(T+1)(N)+T(i=1TAiC)\displaystyle\frac{\sqrt{T}\mathbb{E}\|\hat{N}-N\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\lambda_{(T)}(N)-\lambda_{(T+1)}(N)}+\sqrt{T}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})
\displaystyle\lesssim Tmini[T]αiν2(ν2rdlogd+σ(1)2dn+i=1Tαidniσ(1)ν)+Ti=1Tdni\displaystyle\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}\nu^{2}}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{T}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}
\displaystyle\lesssim Tmini[T]αi(rdlogd+dn)+Ti=1Tαi+mini[T]αimini[T]αidni.\displaystyle\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{T}\sum_{i=1}^{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}}\sqrt{\frac{d}{n_{i}}}.

On the other hand,

𝔼sin(Θ(UCL(2),U(2)))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\|_{F}
=𝔼sin(Θ(UCL(2),U(2)))F𝕀{i=1TAi}+𝔼sin(Θ(UCL(2),U(2)))F𝕀{i=1TAiC}\displaystyle=\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}
\displaystyle\leq rT𝔼N^N2𝕀{i=1TAi}min{λ(T)(N)λ(T+1)(N),λ(r)(N)}+rT(i=1TAiC)\displaystyle\frac{\sqrt{r-T}\mathbb{E}\|\hat{N}-N\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\min\{\lambda_{(T)}(N)-\lambda_{(T+1)}(N),\lambda_{(r)}(N)\}}+\sqrt{r-T}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})
\displaystyle\lesssim rTmini[T]{αi,1}ν2(ν2rdlogd+σ(1)2dn+i=1Tαidniσ(1)ν)+rTi=1Tdni\displaystyle\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}\nu^{2}}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{r-T}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}
\displaystyle\lesssim rTmini[T]{αi,1}(rdlogd+dn)+rTi=1T(αimini[T]{αi,1}+1)dni.\displaystyle\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{r-T}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{\min_{i\in[T]}\{\alpha_{i},1\}}+1)\sqrt{\frac{d}{n_{i}}}.

Note that:

sin(Θ(UCL,U))F2\displaystyle\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}^{2}
=rUCLUF2\displaystyle=r-\|U_{\text{CL}}^{\top}U^{\star}\|_{F}^{2}
rUCL(1)U(1)F2UCL(2)TU(2)F2\displaystyle\leq r-\|U_{\text{CL}}^{(1)\top}U^{\star(1)}\|_{F}^{2}-\|U_{\text{CL}}^{(2)T}U^{\star(2)}\|_{F}^{2}
TUCL(1)U(1)F2+(rT)UCL(2)U(2)F2\displaystyle\leq T-\|U_{\text{CL}}^{(1)\top}U^{\star(1)}\|_{F}^{2}+(r-T)-\|U_{\text{CL}}^{(2)\top}U^{\star(2)}\|_{F}^{2}
sinΘ(UCL(1),U(1))F2+sinΘ(UCL(1),U(1))F2,\displaystyle\leq\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\|_{F}^{2}+\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\|_{F}^{2},

and the sine distance has trivial upper bounds:

sinΘ(UCL(1),U(1))F2T,sinΘ(UCL(2),U(2))F2rT\|\sin\Theta(U_{\text{CL}}^{(1)},U^{\star(1)})\|_{F}^{2}\leq T,\quad\|\sin\Theta(U_{\text{CL}}^{(2)},U^{\star(2)})\|_{F}^{2}\leq r-T

Thus we can conclude:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}
𝔼sin(Θ(UCL(1),U(1)))F+𝔼sin(Θ(UCL(2),U(2)))F\displaystyle\leq\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(1)},U^{\star(1)}))\|_{F}+\mathbb{E}\|\sin(\Theta(U_{\text{CL}}^{(2)},U^{\star(2)}))\|_{F}
(rTmini[T]{αi,1}(rdlogd+dn)+i=1TrTαi+mini[T]{αi,1}mini[T]{αi,1}dni)rT\displaystyle\lesssim\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sum_{i=1}^{T}\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}\sqrt{\frac{d}{n_{i}}})\wedge\sqrt{r-T}
+(Tmini[T]αi(rdlogd+dn)+i=1TTαi+mini[T]αimini[T]αidni)T.\displaystyle\quad+\quantity(\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sum_{i=1}^{T}\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}}\sqrt{\frac{d}{n_{i}}})\wedge\sqrt{T}.

 

Theorem D.7 (Restatement of Theorem 4.6)

Suppose the conditions in Theorem 4.5 hold. Then,

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})] (101)
rT(rlogdd+dn+αTdm1)+T(rlogdαd+1αdn+Tdm).\displaystyle\quad\lesssim\sqrt{r-T}\left(\frac{r\log d}{d}+\sqrt{\frac{d}{n}}+\alpha T\sqrt{\frac{d}{m}}\wedge 1\right)+\sqrt{T}\left(\frac{r\log d}{\alpha d}+\frac{1}{\alpha}\sqrt{\frac{d}{n}}+T\sqrt{\frac{d}{m}}\right). (102)

Proof [Proof of Theorem D.7] Theorem 4.6 follows directly from Lemma B.20 and Theorem 4.5.  

Theorem D.8 (Generalized version of Theorem 4.8)

In the regression setting where yt=wt,z/νy^{t}=\langle w_{t},z\rangle/\nu , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5) and n>drn>d\gg r, if we further assume that TrT\geq r and i=1Tαiwiwi\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top} is full rank, suppose WCLW^{\text{CL}} is the optimal solution of optimization problem Equation (99), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\lesssim r1+ν2λ(r)(i=1Tαiwiwi)(rdlogd+dn)\displaystyle\frac{\sqrt{r}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})
+ri=1T(αi1+ν2λ(r)(i=1Tαiwiwi)+1)dni.\displaystyle+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.

Proof [Proof of Theorem D.8] The proof strategy is similar to that of Theorem 4.5, here the difference is that each direction can be accurately estimated by the labeled data and we do not need to separate the eigenspace. Directly applying Lemma E.1 and equation (100) we have:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}
=𝔼sin(Θ(UCL,U))F𝕀{i=1TAi}+𝔼sin(Θ(UCL,U))F𝕀{i=1TAiC}\displaystyle=\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}+\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\mathbb{I}\{\cup_{i=1}^{T}A_{i}^{C}\}
r𝔼N^N2𝕀{i=1TAi}λ(r)(N)+r(i=1TAiC)\displaystyle\lesssim\frac{\sqrt{r}\mathbb{E}\|\hat{N}-N\|_{2}\mathbb{I}\{\cap_{i=1}^{T}A_{i}\}}{\lambda_{(r)}(N)}+\sqrt{r}\mathbb{P}(\cup_{i=1}^{T}A_{i}^{C})
rν2+ν2λ(r)(i=1Tαiwiwi)(ν2rdlogd+σ(1)2dn+i=1Tαidniσ(1)ν)+ri=1Tdni\displaystyle\lesssim\frac{\sqrt{r}}{\nu^{2}+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\nu^{2}\frac{r}{d}\log d+\sigma_{(1)}^{2}\sqrt{\frac{d}{n}}+\sum_{i=1}^{T}\alpha_{i}\sqrt{\frac{d}{n_{i}}}\sigma_{(1)}\nu)+\sqrt{r}\sum_{i=1}^{T}\sqrt{\frac{d}{n_{i}}}
r1+λ(r)(i=1Tαiwiwi)(rdlogd+dn)+ri=1T(αi1+λ(r)(i=1Tαiwiwi)+1)dni.\displaystyle\lesssim\frac{\sqrt{r}}{1+\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.

 

Theorem D.9 (Restatement of Theorem 4.9)

Suppose the conditions in Theorem 4.8 hold. Then,

𝔼𝒟[infwr𝔼[r(δWCL,w)]infwr𝔼[r(δU,w)]rα+1(rdlogd+dn)+Tdrm.\displaystyle\mathbb{E}_{\mathcal{D}}[\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{W_{\text{CL}},w})]-\inf_{w\in\mathbb{R}^{r}}\mathbb{E}_{\mathcal{E}}[\ell_{r}(\delta_{U^{\star\top},w})]\lesssim\frac{\sqrt{r}}{\alpha+1}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})+T\sqrt{\frac{dr}{m}}. (103)

Proof [Proof of Theorem 4.9] Theorem 4.9 follows directly from Lemma B.20 and Theorem 4.8.  

Now we move to a binary classification setting, where labels yy are generated by y=sign(w,z)y=\operatorname{sign}(\langle w^{\star},z\rangle) instead of y=w,z/νy=\langle w^{\star},z\rangle/\nu in previous regression setting. We first give the corresponding generalized version of Theorem 4.10 and Theorem 4.11 to cover the general imbalanced settings.

Theorem D.10 (Generalized version of Theorem 4.10)

In the classification setting where yt=sign(wt,z)y^{t}=\operatorname{sign}(\langle w_{t},z\rangle) , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), zz follows a Gaussian distribution, and n>drn>d\gg r, if we further assume that T<rT<r and wtw_{t}’s are orthogonal to each other, and let WCLW^{\text{CL}} be any solution that optimizes the problem in Equation (99), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\lesssim (rTmini[T]{αi,1}+Tmini[T]αi)(rdlogd+dn)\displaystyle\quantity(\frac{\sqrt{r-T}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\frac{\sqrt{T}}{\min_{i\in[T]}\alpha_{i}})\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})
+i=1T(rTαi+mini[T]{αi,1}mini[T]{αi,1}+Tαi+mini[T]αimini[T]αi)dni.\displaystyle+\sum_{i=1}^{T}\quantity(\sqrt{r-T}\frac{\alpha_{i}+\min_{i\in[T]}\{\alpha_{i},1\}}{\min_{i\in[T]}\{\alpha_{i},1\}}+\sqrt{T}\frac{\alpha_{i}+\min_{i\in[T]}\alpha_{i}}{\min_{i\in[T]}\alpha_{i}})\sqrt{\frac{d}{n_{i}}}.
Theorem D.11 (Generalized version of Theorem 4.11)

In the classification setting where yt=sign(wt,z)y^{t}=\operatorname{sign}(\langle w_{t},z\rangle) , suppose Assumptions 3.4-3.6 hold for spiked covariance model (5), zz follows a Gaussian distribution, and n>drn>d\gg r, if we further assume that TrT\geq r and i=1Tαiwiwi\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top} is full rank, suppose WCLW^{\text{CL}} is the optimal solution of optimization problem Equation (99), and denote its singular value decomposition as WCL=(UCLΣCLVCL)W_{\text{CL}}=(U_{\text{CL}}\Sigma_{\text{CL}}V_{\text{CL}}^{\top})^{\top}, then we have:

𝔼sin(Θ(UCL,U))F\displaystyle\mathbb{E}\|\sin(\Theta(U_{\text{CL}},U^{\star}))\|_{F}\lesssim r1+ν2λ(r)(i=1Tαiwiwi)(rdlogd+dn)\displaystyle\frac{\sqrt{r}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}\quantity(\frac{r}{d}\log d+\sqrt{\frac{d}{n}})
+ri=1T(αi1+ν2λ(r)(i=1Tαiwiwi)+1)dni.\displaystyle+\sqrt{r}\sum_{i=1}^{T}\quantity(\frac{\alpha_{i}}{1+\nu^{2}\lambda_{(r)}(\sum_{i=1}^{T}\alpha_{i}w_{i}w_{i}^{\top})}+1)\sqrt{\frac{d}{n_{i}}}.

The only difference between these two settings is the distribution of labels yy. Thus to prove Theorem D.10 and Theorem D.11, we only need to recover Lemma D.3 in this binary classification setting. Since in the classification setting the labels are discrete and could be harder to analyze, we make the Gaussian assumption on zz to make problems mathematically tractable in these two Theorems.

Lemma D.12 (Classification version of Lemma D.3)

In the binary classification setting, under the conditions similar to Theorem 3.10 and assume zz in the spiked covariance model (5) follows a Gaussian distribution, we can find an event AA such that (AC)=O(d/n)\mathbb{P}(A^{C})=O(\sqrt{d/n}) and:

𝔼[1(n1)2XHyyHX2ν2πUwwUF𝕀{A}]dnσ(1)ν.\mathbb{E}\quantity[\norm{\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}}_{F}\mathbb{I}\{A\}]\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu. (104)

Proof  Again, by (93) we have:

1(n1)2XHyyHX2ν2πUwwUF\displaystyle\|\frac{1}{(n-1)^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}
\displaystyle\lesssim 1n2XHyyHX2ν2πUwwUF+rnν2,\displaystyle\|\frac{1}{n^{2}}XHyy^{\top}HX^{\top}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}+\frac{r}{n}\nu^{2},

thus we can replace the 1(n1)2\frac{1}{(n-1)^{2}} with 1n\frac{1}{n} in equation (104) and conclude the proof. Denote N^1n2XHyyHX\hat{N}\triangleq\frac{1}{n^{2}}XHyy^{\top}HX^{\top}, note that both of N^\hat{N} and UwwUUw^{\star}w^{\star\top}U^{\top} are rank-1 matrices. We first bound the difference between 1nXHy\frac{1}{n}XHy and 2ν2πUw\sqrt{\frac{2\nu^{2}}{\pi}}Uw^{\star}:

1nXHy2ν2πUw=\displaystyle\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|= 1n(UZ+E)Hy2ν2πUw\displaystyle\|\frac{1}{n}(U^{\star}Z+E)Hy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\| (105)
\displaystyle\leq 1nUZy2ν2πUw+1nEy+1nUZy¯+1nEy¯.\displaystyle\|\frac{1}{n}U^{\star}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|+\frac{1}{n}\|Ey\|+\frac{1}{n}\|U^{\star}Z\bar{y}\|+\frac{1}{n}\|E\bar{y}\|.

We deal with the four terms in (105) separately:

  1. 1.

    For the first term, note that:1nZy=1ni=1nzisign(ziw)\frac{1}{n}Zy=\frac{1}{n}\sum_{i=1}^{n}z_{i}\operatorname{sign}(z_{i}^{\top}w^{\star}) and zi𝒩(0,ν2Ir)z_{i}\sim\mathcal{N}(0,\nu^{2}I_{r}), thus zisign(ziw)z_{i}\operatorname{sign}(z_{i}^{\top}w^{\star}) follows a folded Gaussian distribution, which is a reflection of standard Gaussian distribution along the normal plane of ww^{\star}, thus

    𝔼1nUZy2ν2πUw\displaystyle\mathbb{E}\|\frac{1}{n}U^{\star}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\| 𝔼1nZy2ν2πw𝔼1nZy2ν2πw2\displaystyle\leq\mathbb{E}\|\frac{1}{n}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}w^{\star}\|\leq\sqrt{\mathbb{E}\|\frac{1}{n}Zy-\sqrt{\frac{2\nu^{2}}{\pi}}w^{\star}\|^{2}} (106)
    rnν\displaystyle\leq\sqrt{\frac{r}{n}}\nu
  2. 2.

    For the second term, note that yy and EE are independent and |y|=1|y|=1 almost surely

    1n𝔼Ey=1n𝔼i=1nξi1n𝔼i=1nξi2dnσ(1)\displaystyle\frac{1}{n}\mathbb{E}\|Ey\|=\frac{1}{n}\mathbb{E}\|\sum_{i=1}^{n}\xi_{i}\|\leq\frac{1}{n}\sqrt{\mathbb{E}\|\sum_{i=1}^{n}\xi_{i}\|^{2}}\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)} (107)
  3. 3.

    For the third term and fourth terms, we have:

    𝔼1nUZy¯+𝔼1nEy¯𝔼1ni=1nzi+𝔼1ni=1nξirnν+dnσ(1).\mathbb{E}\frac{1}{n}\|U^{\star}Z\bar{y}\|+\mathbb{E}\frac{1}{n}\|E\bar{y}\|\leq\mathbb{E}\frac{1}{n}\|\sum_{i=1}^{n}z_{i}\|+\mathbb{E}\frac{1}{n}\|\sum_{i=1}^{n}\xi_{i}\|\lesssim\sqrt{\frac{r}{n}}\nu+\sqrt{\frac{d}{n}}\sigma_{(1)}. (108)

Combine these three equations (106)(107)(108) together we have:

𝔼1nXHy2ν2πUwdnσ(1).\mathbb{E}\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}. (109)

With equation (109), we can now turn to the difference between N^\hat{N} and 2ν2πUwwU\frac{2\nu^{2}}{\pi}Uw^{\star}w^{\star\top}U^{\top}. By Lemma D.4 we know that:

N^2ν2πUwwUF(1nXHy+2ν2πUw)1nXHy2ν2πUw.\|\hat{N}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\lesssim(\|\frac{1}{n}XHy\|+\|\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|.

Using Markov’s inequality, we can conclude from (109) that:

(1nXHy2ν2πUwν)𝔼1nXHy2ν2πUwνdn.\mathbb{P}(\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\geq\nu)\leq\frac{\mathbb{E}\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|}{\nu}\lesssim\sqrt{\frac{d}{n}}.

Then denote A={ω:1nXHy2ν2πUw2<ν}A=\{\omega:\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|_{2}<\nu\} we have:

𝔼N^2ν2πUwwUF𝕀{A}\displaystyle\mathbb{E}\|\hat{N}-\frac{2\nu^{2}}{\pi}U^{\star}w^{\star}w^{\star\top}U^{\star\top}\|_{F}\mathbb{I}\{A\}\lesssim 𝔼(1nXHy+2ν2πUw)1nXHy2ν2πUw|𝕀{A}\displaystyle\mathbb{E}(\|\frac{1}{n}XHy\|+\|\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|)\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}|\|\mathbb{I}\{A\}
\displaystyle\lesssim ν𝔼1nXHy2ν2πUwdnσ(1)ν.\displaystyle\nu\mathbb{E}\|\frac{1}{n}XHy-\sqrt{\frac{2\nu^{2}}{\pi}}U^{\star}w^{\star}\|\lesssim\sqrt{\frac{d}{n}}\sigma_{(1)}\nu.

which finished the proof.  
With Lemma D.12 established, it is straightforward to obtain the same results as in Theorem 4.5, Theorem 4.6, Theorem 4.8 and Theorem 4.9 for this binary classification setting.

E Useful lemmas

In this section, we list some of the main techniques that have been used in the proof of the main results.

Lemma E.1 (Theorem 2 in Yu et al. (2015))

Let Σ,Σ^p×p\Sigma,\hat{\Sigma}\in\mathbb{R}^{p\times p} be symmetric, with eigenvalues λ1λp\lambda_{1}\geq\ldots\geq\lambda_{p} and λ^1\hat{\lambda}_{1}\geq λ^p\ldots\geq\hat{\lambda}_{p} respectively. Fix 1rsp1\leq r\leq s\leq p and assume that min(λr1λr,λsλs+1)>0\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)>0 where λ0:=\lambda_{0}:=\infty and λp+1:=.\lambda_{p+1}:=-\infty. Let d:=sr+1d:=s-r+1, and let V=(vr,vr+1,,vs)p×dV=\left(v_{r},v_{r+1},\ldots,v_{s}\right)\in\mathbb{R}^{p\times d} and V^=(v^r,v^r+1,,v^s)p×d\hat{V}=\left(\hat{v}_{r},\hat{v}_{r+1},\ldots,\hat{v}_{s}\right)\in\mathbb{R}^{p\times d} have orthonormal columns satisfying Σvj=λjvj\Sigma v_{j}=\lambda_{j}v_{j} and Σ^v^j=λ^jv^j\hat{\Sigma}\hat{v}_{j}=\hat{\lambda}_{j}\hat{v}_{j} for j=r,r+1,,s.j=r,r+1,\ldots,s. Then

sinΘ(V^,V)F2min(d1/2Σ^Σ2,Σ^ΣF)min(λr1λr,λsλs+1).\|\sin\Theta(\hat{V},V)\|_{\mathrm{F}}\leq\frac{2\min\left(d^{1/2}\|\hat{\Sigma}-\Sigma\|_{\mathrm{2}},\|\hat{\Sigma}-\Sigma\|_{\mathrm{F}}\right)}{\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)}.

Moreover, there exists an orthogonal matrix O^d×d\hat{O}\in\mathbb{R}^{d\times d} such that

V^O^VF23/2min(d1/2Σ^Σ2,Σ^ΣF)min(λr1λr,λsλs+1).\|\hat{V}\hat{O}-V\|_{\mathrm{F}}\leq\frac{2^{3/2}\min\left(d^{1/2}\|\hat{\Sigma}-\Sigma\|_{\mathrm{2}},\|\hat{\Sigma}-\Sigma\|_{\mathrm{F}}\right)}{\min\left(\lambda_{r-1}-\lambda_{r},\lambda_{s}-\lambda_{s+1}\right)}.
Lemma E.2 (Lemma 2 in Zhang et al. (2018))

Assume that Ep1×p2E\in\mathbb{R}^{p_{1}\times p_{2}} has independent sub-Gaussian entries, Var(Eij)=\operatorname{Var}\left(E_{ij}\right)= σij2,σC2=maxjiσij2,σR2=maxijσij2,σ(1)2=maxi,jσij2.\sigma_{ij}^{2},\sigma_{C}^{2}=\max_{j}\sum_{i}\sigma_{ij}^{2},\sigma_{R}^{2}=\max_{i}\sum_{j}\sigma_{ij}^{2},\sigma_{(1)}^{2}=\max_{i,j}\sigma_{ij}^{2}. Assume that

Eij/σijψ2:=maxq1q1/2{𝔼(|Eij|/σij)q}1/qκ.\left\|E_{ij}/\sigma_{ij}\right\|_{\psi_{2}}:=\max_{q\geq 1}q^{-1/2}\left\{\mathbb{E}\left(\left|E_{ij}\right|/\sigma_{ij}\right)^{q}\right\}^{1/q}\leq\kappa.

Let V𝕆p2,rV\in\mathbb{O}_{p_{2},r} be a fixed orthogonal matrix. Then

(EV22(σC+x))2exp(5rmin{x4κ4σ(1)2σC2,x2κ2σ(1)2}),\mathbb{P}\left(\|EV\|_{2}\geq 2\left(\sigma_{C}+x\right)\right)\leq 2\exp\left(5r-\min\left\{\frac{x^{4}}{\kappa^{4}\sigma_{(1)}^{2}\sigma_{C}^{2}},\frac{x^{2}}{\kappa^{2}\sigma_{(1)}^{2}}\right\}\right),
𝔼EV2σC+κr1/4(σ(1)σC)1/2+κr1/2σ(1).\mathbb{E}\|EV\|_{2}\lesssim\sigma_{C}+\kappa r^{1/4}\left(\sigma_{(1)}\sigma_{C}\right)^{1/2}+\kappa r^{1/2}\sigma_{(1)}.
Lemma E.3 (Theorem 6 in Cai et al. (2020))

Suppose ZZ is a p1p_{1}-by- p2p_{2} random matrix with independent mean-zero sub-Gaussian entries. If there exist σ1,,σp0\sigma_{1},\ldots,\sigma_{p}\geq 0 such that Zij/σiψ2CK\left\|Z_{ij}/\sigma_{i}\right\|_{\psi_{2}}\leq C_{K} for constant CK>0C_{K}>0, then

𝔼ZZ𝔼ZZ2iσi2+p2iσi2maxiσi.\mathbb{E}\left\|ZZ^{\top}-\mathbb{E}ZZ^{\top}\right\|_{2}\lesssim\sum_{i}\sigma_{i}^{2}+\sqrt{p_{2}\sum_{i}\sigma_{i}^{2}}\cdot\max_{i}\sigma_{i}.
Lemma E.4 (The Eckart-Young-Mirsky Theorem (Eckart and Young, 1936))

Suppose that A=UΣVTA=U\Sigma V^{T} is the singular value decomposition of AA. Then the best rank- kk approximation of the matrix AA w.r.t the Frobenius norm, F\|\cdot\|_{F}, is given by

Ak=i=1kσiuiviT.A_{k}=\sum_{i=1}^{k}\sigma_{i}u_{i}v_{i}^{T}.

that is, for any matrix BB of rank at most k

AAkFABF.\|A-A_{k}\|_{F}\leq\|A-B\|_{F}.

References

  • Agarwal et al. (2020) Anish Agarwal, Devavrat Shah, and Dennis Shen. On principal component regression in a high-dimensional error-in-variables setting. arXiv preprint arXiv:2010.14449, 2020.
  • Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
  • Bai and Yao (2012) Zhidong Bai and Jianfeng Yao. On sample eigenvalues in a generalized spiked population model. Journal of Multivariate Analysis, 106:167–177, 2012.
  • Ballard (1987) Dana H Ballard. Modular learning in neural networks. In AAAI, volume 647, pages 279–284, 1987.
  • Barshan et al. (2011) Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds. Pattern Recognition, 44(7):1357–1371, 2011.
  • Bing et al. (2021) Xin Bing, Florentina Bunea, Seth Strimas-Mackey, and Marten Wegkamp. Prediction under latent factor regression: Adaptive pcr, interpolating predictors and beyond. Journal of Machine Learning Research, 22(177):1–50, 2021.
  • Bourlard and Kamp (1988) Hervé Bourlard and Yves Kamp. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics, 59(4):291–294, 1988.
  • Cai and Zhang (2018) T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89, 2018.
  • Cai et al. (2020) T Tony Cai, Rungang Han, and Anru R Zhang. On the non-asymptotic concentration of heteroskedastic wishart-type matrix. arXiv preprint arXiv:2008.12434, 2020.
  • Candès and Recht (2009) Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717–772, 2009.
  • Ćevid et al. (2020) Domagoj Ćevid, Peter Bühlmann, and Nicolai Meinshausen. Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21:232, 2020.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
  • Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020b.
  • Chen et al. (2020c) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020c.
  • Coates et al. (2011) Adam Coates, A. Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
  • Deshpande and Montanari (2014) Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse pca. In 2014 IEEE International Symposium on Information Theory, pages 2197–2201. IEEE, 2014.
  • Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27:766–774, 2014.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Du et al. (2020) Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  • Eckart and Young (1936) Carl Eckart and G. Marion Young. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936.
  • Fan et al. (2019) Jianqing Fan, Cong Ma, and Yiqiao Zhong. A selective overview of deep learning. arXiv preprint arXiv:1904.05526, 2019.
  • Feizi et al. (2020) Soheil Feizi, Farzan Farnia, Tony Ginart, and David Tse. Understanding gans in the lqg setting: Formulation, generalization and stability. IEEE Journal on Selected Areas in Information Theory, 1(1):304–311, 2020.
  • Fu et al. (2022) Daniel Y Fu, Mayee F Chen, Michael Zhang, Kayvon Fatahalian, and Christopher Ré. The details matter: Preventing class collapse in supervised contrastive learning. In Computer Sciences & Mathematics Forum, volume 3, page 4. MDPI, 2022.
  • Garg and Liang (2020) Siddhant Garg and Yingyu Liang. Functional regularization for representation learning: A unified theoretical perspective. Advances in Neural Information Processing Systems, 33:17187–17199, 2020.
  • Golub and Loan (1996) Gene H. Golub and Charles Van Loan. Matrix computations (3rd ed.). 1996.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Graf et al. (2021) Florian Graf, Christoph Hofer, Marc Niethammer, and Roland Kwitt. Dissecting supervised constrastive learning. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2021.
  • Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  • Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
  • Han et al. (2018) Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018.
  • HaoChen et al. (2021) Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. arXiv preprint arXiv:2106.04156, 2021.
  • He et al. (2016) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
  • He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  • He et al. (2018) Xinwei He, Yang Zhou, Zhichao Zhou, Song Bai, and Xiang Bai. Triplet-center loss for multi-view 3d object retrieval. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1945–1954, 2018.
  • Hyvärinen et al. (2009) Aapo Hyvärinen, Jarmo Hurri, Patrik O Hoyer, Aapo Hyvärinen, Jarmo Hurri, and Patrik O Hoyer. Independent component analysis. Springer, 2009.
  • Islam et al. (2021) Ashraful Islam, Chun-Fu Chen, Rameswar Panda, Leonid Karlinsky, Richard J. Radke, and Rogério Schmidt Feris. A broad study on the transferability of visual representations with contrastive learning. ArXiv, abs/2103.13517, 2021.
  • Jaiswal et al. (2021) Ashish Jaiswal, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2021.
  • Jing et al. (2021) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
  • Johnstone (2001) Iain M Johnstone. On the distribution of the largest eigenvalue in principal components analysis. The Annals of statistics, 29(2):295–327, 2001.
  • Jolliffe (1982) Ian T Jolliffe. A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(3):300–303, 1982.
  • Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
  • Li et al. (2021) Yazhe Li, Roman Pogodin, Danica J Sutherland, and Arthur Gretton. Self-supervised learning with kernel dependence maximization. Advances in Neural Information Processing Systems, 34, 2021.
  • Liu et al. (2021) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2021.
  • Marsaglia (1972) George Marsaglia. Choosing a point from the surface of a sphere. Annals of Mathematical Statistics, 43:645–646, 1972.
  • Misra and Maaten (2020) Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Pirjol (2013) Dan Pirjol. The logistic-normal integral and its generalizations. Journal of Computational and Applied Mathematics, 237(1):460–469, 2013.
  • Plaut (2018) Elad Plaut. From principal subspaces to principal components with linear autoencoders. arXiv preprint arXiv:1804.10253, 2018.
  • Refinetti and Goldt (2022) Maria Refinetti and Sebastian Goldt. The dynamics of representation learning in shallow, non-linear autoencoders. arXiv preprint arXiv:2201.02115, 2022.
  • Saunshi et al. (2022) Nikunj Saunshi, Jordan Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, and Akshay Krishnamurthy. Understanding contrastive learning requires incorporating inductive biases. In International Conference on Machine Learning, pages 19250–19286. PMLR, 2022.
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
  • Sohn (2016) Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NIPS, 2016.
  • Song et al. (2007a) Le Song, Arthur Gretton, Karsten Borgwardt, and Alex Smola. Colored maximum variance unfolding. Advances in Neural Information Processing Systems, 20:1385–1392, 2007a.
  • Song et al. (2007b) Le Song, Alex Smola, Arthur Gretton, and Karsten M Borgwardt. A dependence maximization view of clustering. In Proceedings of the 24th international conference on Machine learning, pages 815–822, 2007b.
  • Song et al. (2007c) Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning, pages 823–830, 2007c.
  • Tian et al. (2020) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  • Tian (2022) Yuandong Tian. Understanding deep contrastive learning via coordinate-wise optimization. In Advances in Neural Information Processing Systems, 2022.
  • Tian et al. (2021) Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. arXiv preprint arXiv:2102.06810, 2021.
  • Tosh et al. (2021) Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pages 1179–1206. PMLR, 2021.
  • Tripuraneni et al. (2021) Nilesh Tripuraneni, Chi Jin, and Michael Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
  • Tsai et al. (2020) Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
  • Wang and Isola (2020) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
  • Wang et al. (2021) Xiang Wang, Xinlei Chen, Simon S Du, and Yuandong Tian. Towards demystifying representation learning with non-contrastive self-supervision. arXiv preprint arXiv:2110.04947, 2021.
  • Wen and Li (2021) Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. arXiv preprint arXiv:2105.15134, 2021.
  • Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  • Yao et al. (2015) Jianfeng Yao, Shurong Zheng, and ZD Bai. Sample covariance matrices and high-dimensional data analysis. Cambridge University Press Cambridge, 2015.
  • Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6210–6219, 2019.
  • Yu et al. (2015) Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
  • Zhang et al. (2018) Anru R Zhang, T Tony Cai, and Yihong Wu. Heteroskedastic pca: Algorithm, optimality, and applications. arXiv preprint arXiv:1810.08316, 2018.
  • Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.