Direct Coloring for Self-Supervised Enhanced Feature Decoupling

Salman Mohamadi
West Virginia University
Morgantown, WV, USA
[email protected] Gianfranco Doretto
West Virginia University
Morgantown, WV, USA
[email protected] Donald A. Adjeroh
West Virginia University
Morgantown, WV, USA
[email protected]

Abstract

The success of self-supervised learning (SSL) has been the focus of multiple recent theoretical and empirical studies, including the role of data augmentation (in feature decoupling) as well as complete and dimensional representation collapse. While complete collapse is well-studied and addressed, dimensional collapse has only gain attention and addressed in recent years mostly using variants of redundancy reduction (aka whitening) techniques. In this paper, we further explore a complementary approach to whitening via feature decoupling for improved representation learning while avoiding representation collapse. In particular, we perform feature decoupling by early promotion of useful features via careful feature coloring. The coloring technique is developed based on a Bayesian prior of the augmented data, which is inherently encoded for feature decoupling. We show that our proposed framework is complementary to the state-of-the-art techniques, while outperforming both contrastive and recent non-contrastive methods. We also study the different effects of coloring approach to formulate it as a general complementary technique along with other baselines.

1 Introduction

Self-supervised learning (SSL) provides state-of-the-art results in unsupervised learning, outperforming deep active learning [24] and semi-supervised learning, while rivaling supervised learning under different settings. Specifically, the core idea of SSL frameworks is to train a model on properly augmented data [32] to accomplish a proxy task (also called pretext task) guided by an appropriate loss function [18]. Despite the emergence of a variety of techniques, a majority of the approaches are based on a first principle [25], enforcing invariance to the representation of augmented data. Seeing it from the perspective of Information Bottleneck (IB) principle [33], the goal is to learn a representation that is very much informative about the data distribution while un-informative of the augmentation. Recent literature offers a multitude of research on both pretext tasks and loss functions, leading to the emergence and evolution of different sets of frameworks, including contrastive, non-contrastive, clustering-based, and whitening-based approaches [23]. Though less explored, the augmentation process also has been investigated recently [31, 2, 35, 36]. Recent theoretical investigations along with empirical assessment of the learning process of SSL presented a number of findings regarding the elements behind its tremendous success [26]. Among them, it is established that augmentation is essential as it helps decouple two sets of features, sparse (useful) features and dense (less useful) features, leading to learning meaningful representations [35, 36] with respect to downstream tasks. The leading argument here is that augmentation decouples these two types of features, as proper augmentation reduces the correlation between dense features while keeping the correlation between sparse features.

In essence, proper augmentation encourages learning of useful features (sparse features) by perturbing mainly the dense features. In fact, it is mostly feature decoupling that is used to counteract the void created by lack of labels in SSL. Specifically, unlike in supervised learning where the labels guide the learning process toward learning and encoding useful features (as they are shared in samples with the same label), in SSL, learning useful features is due to feature decoupling. Thus, enhancing feature decoupling can be expected to significantly improve the learning mechanism in SSL. However, we argue that one less noticed downside effect of the augmentation process seems to be its indirect contribution to representation collapse. Representation collapse is a common phenomena in SSL training process, where essentially the learning process leads to some sort of trivial representation. While complete collapse is the main type of representation collapse, recently, another type of collapse, namely dimensional collapse, has also been characterized [17]. One way to think of the complete collapse is to see it as a special solution to the optimization where the corresponding representation is constant (all features are constant). Dimensional collapse, however, emerges out of highly correlated dimensions in the representation, where dimensions collapse to a single dimension (or potentially much fewer than actual number of dimensions). Complete collapse has been well-addressed by techniques such as careful training protocols [6], asymmetric architectural design and training protocols [13, 7]. These in essence inject some variance to avoid having zero variance (complete collapse). In contrast, dimensional collapse is well-addressed in recent work on whitening embedding/ latent space [11, 39] by standardizing some covariance matrix, in order to eliminate high correlation between dimensions of representation. One potential downside of these set of approaches is that the whitening process could generally limit the capacity of the model [30], especially if used earlier in low level feature learning. In other words, if done without a careful attention, whitening could hinder the feature decoupling provided by the augmentation process as it decorrelates the dimensions regardless of its relevance (or otherwise) to the target/desired representation.

In this paper, we present a technique to alleviate this downside of whitening by direct coloring to further enhancing feature decoupling, while promoting a faster learning process. Our proposed approach applies to whitening and non-whitening based approaches. We also theoretically and empirically examine how coloring would substantially reduce the chance of complete collapse, the primary type of collapse. Our key contributions are as follows:

•

We propose a technique based on coloring transform to further enhance feature decoupling based on augmentation, leading to improved performance. We also empirically show a faster learning convergence, and discuss the avoidance of complete collapse using constrained optimization.
•

We develop a direct coloring technique privileged by a Bayesian prior that does not require the conventional stage of whitening before coloring, allowing for faster coloring transform on the cross-correlation matrix of some embedding space.
•

We perform a detailed empirical study, suggesting that while coloring is most effective for whitening-based SSL frameworks, a simple variation (weaker version) also improves some other existing non-whitening baselines.

2 Preliminary and background

1. SSL: Enforcing invariance to the representation of augmented views is the driving first principle of most existing SSL approaches. This core idea has been instantiated via a variety of methods including contrastive approaches [6], non-contrastive approaches [13, 7], clustering-based methods [4, 3], whitening-based techniques [11, 39], etc. Along with this, there have been parallel efforts on improved augmentation protocols [32], sampling strategies [1, 34, 37], and robustness [29]. Studies on representation collapse and theoretical justification of approaches and results have also been considered [35, 36]. Augmentation effect is indirectly connected with representation collapse. However, augmentation is also the main source of feature decoupling [35], orienting the learning toward useful sparse features, similar to the role of labels in supervised learning. While existing standard augmentation protocols generally aim at useful feature decoupling, not every augmentation protocol leads to the desired feature decoupling. That is, certain augmentations may not necessarily lead to decoupling sparse (useful) and dense (less useful) features [35, 32]. Existing set of augmentation protocols generates views that are easy to associate for human visual perception. Intuitively, underlying useful features are not often perturbed to the point where positive views are visually dissociated from each other.

Representation collapse: Let’s say for a given sample $x$ , the random augmentation function $\tau$ generates two views $x_{1}$ , and $x_{2}$ , and the goal is to train a network $\Phi(.)$ so that the directions of $\Phi(x_{1})$ and $\Phi(x_{2})$ align [36]. Technically, we want the optimizer to find a robust representation for the augmentation effect. However, the optimizer might come up with practically meaningless representations that theoretically fit the optimization objective. Theoretically, one can analyse such cases in terms of variance and covariance of the representation. The most common type of such solutions is when $\Phi(.)$ leads to a constant vector, hence the variance of the representation is zero [17, 36]. This is called complete or total collapse, where the representation collapses to a single dot in the space. Another less recognized case is when the coordinates of the representation, $\Phi_{i}(.)$ , are scaled versions of each other, meaning that all of them are aligned. This is the case where the features are highly correlated, and the covariance matrix is far from standard; hence the representation collapses to a single line, also called dimensional collapse [17, 36]. The complete collapse has been addressed by a variety of techniques that typically add variance to the representation, while dimensional collapse is often addressed by standardizing the covariance matrix, practically via a decorrelation process. The latter involves whitening the latent/embedding space, allowing for decorrelation of the feature dimensions.

2. Whitening and Coloring in SSL and ML: While whitening and coloring are well-explored concepts in signal processing dating back to decades ago (see [27]), in modern literature of machine learning, they are reinvented with respect to compatibility with the gradient-based learning. From a reductionist perspective, whitening is a technique for decorrelating the covariance matrix, while coloring is to assign/induce desired statistical correlation within the covariance matrix, mainly for Gaussian processes. Specifically, within computer vision problem domains, different techniques for whitening and coloring transform have been used for various purposes, for instance, as a generalized form of batch normalization [30], for speeding up the training [30], enhancing domain generalization and adaptation [28, 8], preserving desired information and statistical characteristics [38], etc. Very recently, whitening has been used for feature decorrelation within SSL frameworks [11, 39, 17]. The prime idea behind this is that whitening standardizes the covariance matrix, thus resulting in redundancy reduction in the representation, which also helps the network to avoid dimensional collapse.

Except for some recent methods [11, 39] which perform whitening on the latent/embedding space, whitening and coloring approaches in the literature did not necessarily follow this general idea of whitening or coloring on the covariance matrix of the embedding space. Rather, they are usually customized for the application at hand. For instance, as a generalization of batch normalization, Siarohin et al [30] proposed a whitening and coloring transform performed on the spatial dimensions of a batch of $m$ images, respectively. This resulted in standardizing the covariance matrix of the given dimension of the batch (whitening), as well as projecting the covariance matrix to that of an arbitrary multivariate Gaussian (coloring). However, our coloring technique distinguishes itself from prior techniques in two important ways: (1) it performs direct coloring (as discussed later); and (2) the coloring is done in the latent space.

3 Coloring for enhanced feature decoupling

3.1 Description of method

Refer to caption — Figure 1: Left:Schematic diagram of the proposed framework. For a given sample, two augmented views are generated and fed to the symmetric networks. The two pairs of projectors are used to perform cascade coloring and whitening, respectively. Right: Desired cross-correlation for direct coloring; $E$ is a squared matrix with the same size as the latent space of each of the VAEs.

Fig. 1 shows a schematic diagram of our proposed framework. Our method is conceptually simple and is categorized as a non-contrastive approach, in the sense that it does not require negative views. Similar to some prior work such as [6, 39, 11], this framework is also constructed from a pair of symmetric networks, where the base design consists of encoders followed by projectors. The basic procedure starts with a random data augmentation function $\tau$ generating two augmented views for each sample image from a batch of samples $X$ , namely $X_{1}$ and $X_{2}$ , to be fed to the pair of symmetric networks. However, our framework distinguishes itself from others by further architectural modification as well as its loss function. The specific core idea is based on enhancing feature decoupling by inducing meaningful correlation (coloring) between features followed by eliminating unnecessary redundancy. We will empirically assess how this oriented coloring also diversifies the useful features. We view feature decoupling as the main role of data augmentation, and thus facilitates learning useful sparse features (as opposed to less useful dense features) [35, 36]. However, we must emphasize that only proper augmentation leads to desired feature decoupling, otherwise the learning would not capture plausible general features. Hence, we carefully engineer a coloring transform that allows for effective feature decoupling relying on proper augmentation. This coloring is shortly followed by a decorrelation (aka whitening) process allowing for dimensional collapse avoidance. Assuming the proper coloring (relying on proper data augmentation), the loss function is given as follows:

\scriptsize\mathcal{L}=\mathcal{L}_{W}+\lambda\mathcal{L}_{C}

(1)

where $\mathcal{L_{C}}$ and $\mathcal{L_{W}}$ are coloring loss and whitening loss, respectively. Here we have:

\scriptsize\begin{split}\mathcal{L}_{C}=\sum_{i}\sum_{j}(C_{ij}-E_{ij})^{2}\>\>\>\text{and }\>\>\>\mathcal{L}_{W}=\sum_{i}(1-W_{ii})^{2}+\alpha\sum_{i}\sum_{j\neq i}(W_{ij})^{2}\end{split}

(2)

where, $\lambda$ and $\alpha$ are weighting factors. Also $C_{ij}$ and $W_{ij}$ are elements of cross-correlation matrices computed for the coloring and whitening processes, respectively (see Fig. 1), while $E_{ij}$ , is the target cross-correlation matrix used for the desired coloring. In other words, matrix $C$ is the cross-correlation matrix computed between the two output vectors of the coloring projector heads, whereas matrix $W$ is the cross-correlation matrix computed between the outputs of the final projector heads (whitening projector heads). Matrix $E$ is the desired colored cross-correlation matrix, computed as will be explained in the next section. We have:

\scriptsize C_{ij}\triangleq\frac{\sum_{m}z_{m,i}^{(1)}z_{m,j}^{(2)}}{\sqrt{\sum_{m}(z_{m,i}^{(1)})^{2}}\sqrt{\sum_{m}(z_{m,j}^{(2)})^{2}}}

(3)

where $z^{(1)}$ and $z^{(2)}$ are the normalized outputs of the coloring heads respectively for views $x_{1}$ and $x_{2}$ , $m$ is the batch size. Similarly, $W{{}_{ij}}$ values are computed from the normalized outputs of the whitening heads.

A certain layer (close to the output layer) of each network is connected to a projector head intended for coloring the features at this level. The output layer is also connected to a projector head for the decorrelation process, whitening the embedding space. The coloring transform is performed on the cross-correlation matrix of the outputs of the projectors. The goal is to guide the network to target the desired features, by inducing controlled useful correlation in the cross-correlation matrix. The redundancy would be reduced in the next stage (the whitening process), by setting the cross-correlation matrix to an identity matrix. In the case of whitening, the diagonal elements are set to 1, encouraging similarity in representation, while the off-diagonal elements are to be close to zero, reducing redundancy in the feature representation [39].

3.2 Direct coloring and desired colored cross-correlation

Consider the standard practice of coloring transform under the Gaussian model for a multivariate Gaussian signal. Here the premise is that the signal (vector of random variables) at hand has a non-identity covariance matrix and the goal is to transform the signal to a new signal with the desired covariance matrix. To this end, first we need to perform the whitening transform, which transforms the multivariate Gaussian signal to a signal with an identity covariance matrix, similar to the ideal white noise. Whitening transform here is essentially a decorrelation by change of basis, and then scaling the principal axes to unit length. After that, the transformed signal will undergo a coloring process in which the data is scaled in desired directions toward desired variances and then rotated. This results in projecting the covariance matrix of the signal to that of an arbitrary multivariate Gaussian signal, hence the covariance matrix of the colored signal will be the desired covariance matrix [16].

However, in SSL literature in general, the purpose of whitening is mainly decorrelation, as was done in [11, 39]. We also follow this idea, aiming at instantiating the colored signal as the one with the desired cross-correlation. To this end, we deviate from the general procedure of coloring in which coloring is preceded by whitening, and introduce direct coloring relying on some Bayesian prior (as discussed in the next section) which is compatible with gradient based learning, while conceptually easier to perform. We perform direct coloring by projecting the (estimated) cross-correlation matrix to the desired cross-correlation via a gradient-based learning process minimizing the following functional:

\begin{split}\scriptsize&\mathcal{L}_{C}=\sum_{i}\sum_{j}(C_{ij}-E_{ij})^{2}\\ \end{split}

(4)

where $E_{ij}$ are the elements of the desired cross-correlation matrix, whereas $C_{ij}$ are elements of the cross-correlation matrix computed from two embedding vectors.

We can identify two advantages of direct coloring:

1.

The process is computationally faster as it does not require us to initially project the elements of $C$ to the identity matrix before setting them to desired values.
2.

It does not assume a multivariate Gaussian, and in fact, the nature of the desired cross-correlation depends heavily on the effect of the argumentation.

Proper or standard augmentation is an integral part of any SSL framework as it decouples features properly, allowing for learing useful features. We build upon this to further enhance the feature decoupling. To this end, the desired colored cross-correlation matrix is computed from augmented data by further decorrelating the dense features while keeping the correlation between sparse features. Specifically, a pair of variational autoencoders (VAE) are trained separately on pairs of augmented views $x_{1}$ and $x_{2}$ for sample image $x$ under standard augmentation protocols. Then $E$ , the cross-correlation between the normalized latent vectors from the VAEs are computed as the desired cross-correlation. See Fig. 1 (Right). The specific settings and scenarios are further discussed under experiments and ablation study. Elements of $E$ are computed along the whole dataset samples ( $n$ ) as follows:

\scriptsize E_{ij}\triangleq\frac{\sum_{n}z_{n,i}^{(1)}z_{n,j}^{(2)}}{\sqrt{\sum_{n}(z_{n,i}^{(1)})^{2}}\sqrt{\sum_{n}(z_{n,j}^{(2)})^{2}}}

(5)

where $z^{(1)}$ and $z^{(2)}$ are the latent vectors of the top and bottom VAEs in Fig. 1 (Right).

3.3 Maximum A Posteriori (MAP) analysis

Here we want to demonstrate that the proposed loss function based on a Bayesian prior (desired target cross-correlation) is a solution to the Maximum A Posteriori (MAP) estimation, specifically considering the prior and likelihood components in this work. The MAP estimation aims to find the most probable model parameters given both the data and some prior knowledge i.e., it combines a prior probability ${p(\Theta)}$ and a likelihood ${p(X|\Theta)}$ to find the best model parameters ${\Theta}$ :

\scriptsize\Theta_{MAP}=argmax_{\Theta}[p(\Theta)p(X|\Theta)]

(6)

here we have a prior in the form of the target colored cross-correlation matrix $\mathbf{E}$ from VAEs, which specifies the desired correlation structure between the augmented views. Hence the prior is Gaussian distribution with mean ${E}$ and a certain variance $\sigma^{2}$ as follows:

\scriptsize p(\Theta)=\mathcal{N}(\Theta|E,\sigma^{2})

(7)

Moreover, the likelihood is composed of two terms: the coloring likelihood $p_{color}(X|\Theta)$ and the whitening likelihood $p_{whiten}(X|\Theta)$ , which capture the respective effects of the coloring and whitening processes on the learned representations. These likelihoods are defined as follows:

\scriptsize p_{color}(X|\Theta)=\prod_{i}\prod_{j}\mathcal{N}(C_{ij}|E_{ij},\sigma^{2})

(8)

\scriptsize p_{whiten}(X|\Theta)=\prod_{i}\mathcal{N}(W_{ii}|1,\sigma^{2}).\prod_{i\neq j}\mathcal{N}(W_{ij}|0,\sigma^{2}).

(9)

Thus, we have:

\scriptsize\begin{split}&\Theta_{MAP}=argmax_{\Theta}[\mathcal{N}(\Theta|E,\sigma^{2}).\prod_{i}\prod_{j}\mathcal{N}(C_{ij}|E_{ij},\sigma^{2}).\\ &\prod_{i}\mathcal{N}(W_{ii}|1,\sigma^{2}).\prod_{i!=j}\mathcal{N}(W_{ij}|0,\sigma^{2})]\\ \end{split}

(10)

Since maximizing the product of probabilities is equivalent to minimizing the negative logarithm of the product, the MAP objective can be expressed in terms of our loss function:

\scriptsize\begin{split}&\Theta_{MAP}=argmin_{\Theta}[-\log\mathcal{N}(\Theta|E,\sigma^{2})-\log\prod_{i}\prod_{j}\mathcal{N}(C_{ij}|E_{ij},\sigma^{2})\\ &-\log\prod_{i}\mathcal{N}(W_{ii}|1,\sigma^{2}).\prod_{i\neq j}\mathcal{N}(W_{ij}|0,\sigma^{2})]\\ \end{split}

(11)

Comparing this expression to our loss function $\mathcal{L}$ , we can see that it aligns with the terms in Equation 12 and 13,i.e., joint optimization of whitening and Bayesian coloring terms in our loss function reflects the MAP estimation with a Gaussian prior (see supplementary for more analysis).

4 Experiments and results

In this section we present the experimental settings as well as empirical results in order to assess the effectiveness and generality of our proposed approach. We use datasets at different scales, and show results for applying our approach on different downstream tasks.

4.1 Datasets and Baselines

The main part of the experiments is performed on ImageNet dataset [10], under linear evaluation on ImageNet as well as transfer learning on smaller datasets for classification task. However, we also assess the approach on detection and segmentation tasks with different datasets. The experimental results are mainly obtained by building on Solo-Learn [9], a recently developed open access library of visual SSL approaches. Solo-Learn [9] provides an implementation of the existing baselines that we compared against in this section.

Datasets: In this study we make use of ImageNet [10], CIFAR10/100 [20], Tiny ImageNet [21], as well as VOC0712 [12] and COCO [22]. In our ablation study we use a separate dataset, ImageNet-100.

Baselines: For comparison, we contrast our framework to different classes of the recent baselines, including contrastive, non-contrastive, clustering-based, and whitening (aka redundancy reduction) baselines, as well as baselines primarily based on vision transformers. These baselines include SimCLR [6], BYOL [13], SimSiam [7], SwAV [4], Barlow-Twins (BT) [39], Whitening-MSE (W-MSE) with $d=4$ [11], and DINO [5].

4.2 Experimental setting

4.2.1 Architecture:

ResNet18 and ResNet50 [15] are used for encoder architecture, except in both cases the last layer is replaced with a three layer projector (we call it whitening projector) as described in [39]. The last layer of the projector is the output with size 2048. A set of identical projector heads, architecturally similar to the whitening projector are used for the direct coloring. Specifically the layer 16 and 46 of ResNet18 and ResNet50 pass through average pooling layer and then fed to the coloring heads (as shown in Fig. 1), encouraging the feature decoupling in training process for the former layers. The architecture of the VAE is based on ResNet18 or ResNet50, and that determines the number of the layer that is connected to the coloring head. For instance, in the case of ResNet18, the corresponding VAE to generate desired colored matrix, is made of an encoder consisting of the first 16 layers of ResNet18, a latent space of the same size as the coloring head output, and a decoder with the same size as the encoder. More detail is available in the supplementary.

4.2.2 Augmentation protocol:

For a given sample $x$ , two augmented views $x_{1}$ and $x_{2}$ are generated using augmentation protocols. Regarding the very recent literature, there are two sets of augmentation protocols, a set of standard augmentation protocols [6, 11, 39], as well as a set of heavy augmentation protocols [2]. Most of the baselines use standard augmentation protocols. We also use standard protocols for the main experiments. In both cases, a set of augmentation techniques are performed via a random process $\tau$ . Specifically, for standard augmentation protocols on all datasets we follow the specification in [6] which include random mirroring, random crop, random color jittering and gray-scaling, and random aspect ratio re-arrangement.

4.2.3 Implementation details:

Optimization of all experiments including pre-training and evaluation under linear setting and transfer learning has been performed using Adam optimizer [19]. Transfer learning using ImageNet pre-training of ResNet50 on CIFAR10/100 is based on standard settings available in [6]. The size of output of projection heads (both whitening and coloring heads) for ImageNet dataset is 2048, the same as the size of the latent space of VAEs, whereas for CIFAR10/100 and Tiny ImageNet we follow the details in [11]. Augmentation protocols is standard unless otherwise specified. The pre-training is performed for 1000 epochs consistently along all experiments. The value of $\lambda$ in the loss function (12) is static, and set to $0.05$ , however, in the supplementary material, there is a range of experiments with dynamic values for $\lambda$ , e.g., decreasing over the number of epochs. Learning rate and other hyperparameters for ImageNet dataset are same as in [39]. Learning rate for CIFAR10 and CIFAR100 is set to $3\times 10^{-3}$ ; whereas for Tiny ImageNet and VOC0712, the learning rate is set to $2\times 10^{-3}$ . The weight decay is set to $5\times 10^{-6}$ . All other baselines we follow the latest settings presented by [9].

Direct coloring: Direct coloring is performed using a desired pre-calculated cross-correlation matrix as the target of coloring the embedding of the coloring heads. The coloring heads are made of three linear layers, first two of which are each followed by batch normalization and ReLU, whereas the output layer is fully connected layer of size 2048. The coloring loss measures the difference between the cross-correlation computed from the outputs of the coloring heads, and the target cross-correlation. In case of ResNet18 as the base encoder, layer number 16 is connected to the coloring head, whereas in case of ResNet50 layer number 46 is connected to its corresponding coloring head. Note that a range of experiments regarding the optimum layer are presented later under ablation study. The target cross-correlation for coloring is computed between the latent spaces of two VAEs with the encoder/decoder same size the layer connected to the coloring head, i.e., Layer 16 and 46 for ResNet18 and ResNet50, respectively. The target cross-correlation for desired coloring is investigated further in ablation study.

Whitening: Whitening is performed on the output of the whitening heads, as a standard practice of decorrelation of highly correlated features. The whitening head is architecturally similar to the coloring head, whereas it replaces the last layer of the ResNet architecture. The hyperparameter $\alpha$ is set to $0.01$ , trading off between the diagonal and off-diagonal terms in the whitening loss function. The elements of $W$ , similar to $C$ , fall between -1 and 1, representing a spectrum of correlation (positive), no-correlation (zero) and anti-correlation (negative).

4.3 Evaluation settings

Standard evaluation is performed under linear and transfer learning settings following the details of former baselines [11, 39, 13]. For linear evaluation, the common procedure is to remove the projector heads and train a linear classifier placed on top of one of the fixed encoders under the supervised setting over the source data (used for pre-training). For transfer learning purposes (classification, detection, and segmentation), the same standard procedure is performed except that the supervised training and evaluation is performed on the target data. Following [39], we also perform linear and transfer learning evaluation, except here we have four heads to remove, both whitening and coloring projector heads. In linear evaluation a whitening head is replaced with the linear classifier (a fully connected layer followed by a softmax) for the evaluation process. The learning rate for linear and transfer learning evaluation consistently starts with $10^{-3}$ and exponentially decays to $10^{-6}$ for some 500 epochs.

Linear evaluation: Linear evaluation on classification task is performed on ImageNet, Tiny ImageNet, CIFAR10, and CIFAR100. The evaluation consists of 500 epochs of supervised training (on labeled data) and then testing. Note that the encoder is fixed and only linear classifier undergoes training.

Transfer learning: Transfer learning consisted of pre-training on ImageNet dataset and evaluation on other datasets, as presented in the results. Accordingly, on classification task, transfer learning was performed with CIFAR10 and CIFAR100, whereas on detection and segmentation task it is performed on VOC0712 and COCO respectively. Similar to the case of linear evaluation, the supervised training is performed for 500 epochs before testing the performance.

4.4 Results and comparison

In this section the results for different datasets, tasks, and learning paradigms are presented. Following is respectively the classification results with ImageNet, CIFAR1/100, and Tiny ImageNet, as well as object detection results with VOC0712 and segmentation results with COCO.

4.4.1 Linear evaluation with ImageNet

The evaluation on classification task, has been performed on ImageNet. The results are presented in Table 1, evaluating multiple baselines under 100, 400, and 1000 epochs of pre-training before supervised linear evaluation. The results show that coloring speeds up the convergence of training. At 1000 epochs, our approach offers $0.8\%$ improvement over the former best result. We note that, given the difficulty of this challenge, this magnitude of improvement has been difficult to achieve on this problem. This is evident from the relative difference in performance for the prior baselines, as shown in the table.

Framework	ImageNet			CIFAR10	CIFAR100	Tiny-ImgNet
	100	400	1000
SimCLR	66.6	70	71.06	91.03	66.48	49.11
BYOL	68.4	73.1	74.6	92.39	70.81	51.05
SwAV	66.5	70.8	72.6	90.06	65.09	48.89
SimSiam	67.9	70.8	71.6	90.81	66.19	49.85
W-MSE4	69.5	72.6	73.7	89.27	62.20	49.51
B-Twins	67.4	71.4	73.6	92.45	70.51	50.11
DINO (RN50)	68.1	72.4	75.3	90.46	67.01	48.54
Ours	69.6	73.2	76.1	93.47	72.52	50.27

Table 1: Top-1 linear classification accuracy for ImageNet using ResNet50 pre-trained on ImageNet under 100, 400, and 1000 epochs. SwAV reproduction is without multi-crop technique. Our method converges faster in terms of number of epochs, while also providing higher accuracy. Top-1 linear classification accuracy for CIFAR10, CIFAR100, and Tiny ImageNet, all using ResNet18.

Framework	Transfer Learning (ImageNet pre-training)
	CIFAR10	CIFAR100
SimCLR	97.52	85.41
BYOL	97.91	86.21
SwAV	97.53	84.19
SimSiam	97.11	85.27
W-MSE4	97.02	86.31
B-Twins	97.29	85.03
Ours	98.19	86.80

Table 2: Top-1 transfer learning classification accuracy, pre-trained on ImageNet, and fine-tuned on CIFAR10, and CIAFAR100.

4.4.2 CIFAR10 and CIFAR100, and Tiny ImageNet

Linear and transfer learning (pre-trained on ImageNet) evaluation with three datasets, CIFAR10, CIFAR100 and Tiny ImageNet on classification task are presented in Table 2 and Table 3 respectively. In case of linear evaluation, our approach slightly outperformed state-of-the-art on CIFAR10 and CIFAR100, offering respectively $1.02\%$ and $1.71\%$ improvements over the former best result, whereas it remains competitive (second best) with the state-of-the-art on Tiny ImageNet.

Transfer learning has also been performed with CIFAR10 and CIFAR100, as presented in Table 3. In both cases, the pre-training was performed on ImageNet while fine-tuning with the CIFAR10/100. Our approach slightly outperformed the state-of-the-art, offering $0.28\%$ and $0.49\%$ improvement, respectively, on CIFAR10 and CIFAR100.

4.4.3 Transfer learning with VOC0712 (Detection Task) and COCO (Segmentation Task)

Transfer learning with VOC0712 is performed on object detection. We follow setting in MoCo [14], finetuning the encoder. The results in Table 4 show that on AP₅₀ setting our method slightly outperformed former baselines, whereas on AP₇₅ and AP₁₀₀ is either on par or very competitive with the former baselines. Transfer learning results with COCO on segmentation task, in Table 4, indicate the competitiveness of our direct coloring technique.

Framework	VOC0712			COCO
	AP₁₀₀	AP₇₅	AP₅₀	AP₁₀₀	AP₅₀
SwAV	56.1	62.7	82.6	33.8	55.2
SimSiam	57	63.7	82.4	34.4	56.0
B-Twins	56.8	63.4	82.6	34.3	56.0
Ours	56.6	63.7	82.9	34.4	56.9

Table 3: Transfer learning on object detection task with VOC0712 (using Faster R-CNN), and segmentation task with COCO (using Mask R-CNN). Results for other baselines are taken from [39]. Direct coloring performs either on par or better than the state-of-the-art in both tasks.

5 Ablation Study

We performed ablation study to assess contribution offered by different aspects of this framework. Multiple scenarios including scenarios on location of coloring heads, alternative approaches to computing desired coloring matrix, and alternative/simpler architectures were considered. Other scenarios including direct coloring for other methods, and range of $\lambda$ are presented in supplementary materials. The experiments are performed with a separate dataset, ImageNet-100 on classification task, using ResNet18 evaluated after 500 epochs of pre-training. Note that regarding other baselines with direct coloring technique, empirical evidence is presented in supplementary.

5.1 Presence and location of coloring head

We assess the effect of the presence, and location of the coloring heads. Given the standard setting presented for the framework, the coloring heads are connected to layer 16 of ResNet18, with hidden layer dimension of 2048. Under this setting, the baseline performance on classification task is a top-1 accuracy of $80.93\%$ . Removing the coloring heads and corresponding optimization terms from the loss function, the framework is reduced to BT, and the accuracy descends to $79.69\%$ . However, with the current output size of 2048, there is a $1.24\%$ improvement with the coloring heads. Next we assess the performance with coloring head connected to different layers.

Coloring at layer 10: When the coloring heads are connected to layer number 10, the top-1 accuracy descends to $79.97\%$ (from $80.93\%$ ). We suspect that this drop is mainly because layers 11-16 are no longer under coloring process. Here, the desired coloring cross-correlation is computed from a VAE with both encoder and decoder of size 10, same as the location of coloring heads.

Coloring at layer 17: When the coloring heads are connected to the same layer as the whitening heads, layer number 17, the performance drops to $78.81\%$ . Roughly speaking, coloring could be seen as the opposite of whitening. Accordingly we hypothesize that what actually happens in case of parallel whitening and coloring heads, is that the result follows the superposition principle, meaning that coloring is cancelled out as the weight factor $\lambda$ in Equation 1, gives less weight to coloring.

5.2 Projector Dimension

The experiments with a range of different projector dimensionality indicate the sensitivity of the method to this factor, as the top-1 accuracy with output size 512, 1024, 2048, and 4096 are respectively $74.31\%$ , $77.15\%$ , $80.93\%$ , and $81.66\%$ . This is a similar behavior to that of Barlow-Twins.

5.3 Desired coloring matrix

We briefly assess the case in which coloring matrix is computed from a pair of autoencoder, instead of VAEs. The performance degrades to $80.21\%$ . The slightly better performance of VAEs might be due to the fact that the latent space of VAE is regularized, which allows for more robust cross-correlation in terms of variance of the elements.

5.4 Less computation with auto-correlation

We consider an architecturally simpler instantiation of the direct coloring in which, the cross-correlation is replaced with auto-correlation, as shown in Fig. 3. The idea turned out to be effective. As shown in Fig. 3, the the framework consists of only one network, in which the encoder undergoes the coloring before passing its output to whitening heads. Specifically, layer 16 of the encoder is connected to a coloring head where the auto-correlation of its output is becoming colored. Similarly in case of whitening, the auto-correlation of the output is becoming whitened. The desired coloring matrix is also an auto-correlation matrix computed from the latent space of one VAE. Amazingly, while the computational complexity reduces by half, the performance only slightly drops to $80.64\%$ . Corresponding loss function of coloring and whitening process as well as the total loss is presented in the supplementary material.

5.5 On the avoidance of complete collapse

Let’s consider direct coloring a rather general technique for any SSL framework. Here, in terms of loss function, we have two terms, first term being the original loss of the framework to be ungraded denoted as $f$ , plus a second term corresponding to the direct coloring, hence we have $\mathcal{L}=f+\lambda g$ . In essence, direct coloring in this setting would be seen as a constrained optimization problem in terms of Lagrange multipliers. The first term alone, $f$ , would be solved by a trivial solution, complete collapse. However, this is subjected to the second term, $g$ , as the coloring constraint. We observe that direct coloring would substantially reduce the chance of complete collapse, because to find the solution to this constrained optimization problem, the optimizer looks for points where the gradient vector of $f$ and $g$ are parallel to each other. Since complete collapse is only a solution to $f$ alone and certainly not a solution to $g$ , if one chooses a proper $\lambda$ , one can substantially avoid the complete collapse. Further theoretical and empirical evidence on this as well as empirical evidence on other baselines with direct coloring technique is presented in supplementary .

6 Conclusion and Future Direction

In this paper, we analyzed the general setting of SSL, the role of the augmentation process, and trivial solutions to the SSL problem. We extended the study to revisit the core principle underlying state-of-the-art approaches. Building upon this, we presented a new framework for SSL, based on direct coloring, which improved the performance, sped-up learning convergence, and reduced the chance of complete collapse. The key foundation is the idea of direct coloring, however, we also considered direct coloring as a general technique that can be used with existing baselines. Empirical assessment on multiple datasets and three downstream tasks show the effectiveness of the proposed framework. We leave the generalization of direct coloring technique in terms of multi-stage coloring as a future direction.

References

[1] Mehdi Azabou, Mohammad Gheshlaghi Azar, Ran Liu, Chi-Heng Lin, Erik C Johnson, Kiran Bhaskaran-Nair, Max Dabagia, Bernardo Avila-Pires, Lindsey Kitchell, Keith B Hengen, et al. Mine your own view: Self-supervised learning through across-sample prediction. arXiv preprint arXiv:2102.10106, 2021.
[2] Yalong Bai, Yifan Yang, Wei Zhang, and Tao Mei. Directional self-supervised learning for heavy image augmentations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16692–16701, 2022.
[3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pages 132–149, 2018.
[4] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
[6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[7] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
[8] Tai-Yin Chiu. Understanding generalized whitening and coloring transform for universal style transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4452–4460, 2019.
[9] Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, and Elisa Ricci. solo-learn: A library of self-supervised methods for visual representation learning. J. Mach. Learn. Res., 23:56–1, 2022.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[11] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning. In International Conference on Machine Learning, pages 3015–3024. PMLR, 2021.
[12] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–308, 2009.
[13] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
[14] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] Maliha Hossain. Whitening and coloring transformations for multivariate gaussian data. A slecture partly based on the ECE662 Spring, 2014.
[17] Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9598–9608, 2021.
[18] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(11):4037–4058, 2020.
[19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[21] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
[22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[23] Salman Mohamadi. Active uncertainty representation learning: Toward more label efficiency in deep learning. 2024.
[24] Salman Mohamadi, Gianfranco Doretto, and Don Adjeroh. Deep active ensemble sampling for image classification. In Proceedings of the Asian Conference on Computer Vision, pages 4531–4547, 2022.
[25] Salman Mohamadi, Gianfranco Doretto, and Donald A Adjeroh. Fussl: Fuzzy uncertain self supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2799–2808, 2023.
[26] Salman Mohamadi, Gianfranco Doretto, and Donald A Adjeroh. More synergy, less redundancy: Exploiting joint mutual information for self-supervised learning. In 2023 IEEE International Conference on Image Processing (ICIP), pages 1390–1394. IEEE, 2023.
[27] Arye Nehorai and Martin Morf. Enhancement of sinusoids in colored noise and the whitening performance of exact least squares predictors. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30(3):353–363, 1982.
[28] Subhankar Roy, Aliaksandr Siarohin, and Nicu Sebe. Unsupervised domain adaptation using full-feature whitening and colouring. In Image Analysis and Processing–ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II 20, pages 225–236. Springer, 2019.
[29] Aniruddha Saha, Ajinkya Tejankar, Soroush Abbasi Koohpayegani, and Hamed Pirsiavash. Backdoor attacks on self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13337–13346, 2022.
[30] Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and coloring batch transform for gans. arXiv preprint arXiv:1806.00420, 2018.
[31] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European conference on computer vision, pages 776–794. Springer, 2020.
[32] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems, 33:6827–6839, 2020.
[33] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
[34] Feng Wang, Huaping Liu, Di Guo, and Sun Fuchun. Unsupervised representation learning by invariance propagation. Advances in Neural Information Processing Systems, 33:3510–3520, 2020.
[35] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pages 11112–11122. PMLR, 2021.
[36] Zixin Wen and Yuanzhi Li. The mechanism of prediction head in non-contrastive self-supervised learning. arXiv preprint arXiv:2205.06226, 2022.
[37] Haohang Xu, Xiaopeng Zhang, Hao Li, Lingxi Xie, Wenrui Dai, Hongkai Xiong, and Qi Tian. Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[38] Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, and Jung-Woo Ha. Photorealistic style transfer via wavelet transforms. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9036–9045, 2019.
[39] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.

Appendix A $\lambda$

A.1 Sensitivity to $\lambda$

A range of experiments on ImageNet-100 with different values for $\lambda$ is presented here. Note that the value for $\alpha$ is set to $10^{-2}$ for all experiments and the pre-training was performed only for 250 epochs. The top-1 accuracy for different values of $\lambda$ is depicted in Fig. 1. As it is presented, for values less than $0.03$ down to $0.005$ , the accuracy drops somewhat steadily. However, for values greater than $0.05$ the accuracy degradation is sharper. In fact with large values of $\lambda$ , dimensional collapse is probable.

A.2 Dynamic values for $\lambda$

We also assess the case in which the value of $\lambda$ is not static, i.e., changing over time. The idea is to see the effect of stronger direct coloring in the early stages of the training, while it becomes less strong (relatively smaller $\lambda$ ) as the training progresses. Under this setting, we start with $\lambda=0.08$ and decrease it to $\lambda=0.04$ , scheduled as $[0.08,0.07,0.06,0.05,0.04]$ , for epochs 1-50, 51-100, …, 201-250. the top-1 accuracy is $76.70\%$ , about $0.32\%$ less than the case with static $\lambda=0.05$ .

Appendix B Simpler Design with Auto-correlation

As discussed in the paper, we evaluated the direct coloring with a relatively simpler architectural design, substituting the cross-correlation with auto-correlation. Corresponding loss function for whitening, coloring and the total loss function are presented in this section. The total loss, coloring loss and whitening loss are as follows:

\mathcal{L}=\mathcal{L}_{W^{\prime}}+\lambda\mathcal{L}_{C^{\prime}}

(12)

where $\mathcal{L}_{C^{\prime}}$ and $\mathcal{L}_{W^{\prime}}$ are coloring loss and whitening loss, respectively. Here we have:

\begin{split}&\mathcal{L}_{C^{\prime}}=\sum_{i}\sum_{j}(C^{\prime}_{ij}-E^{\prime}_{ij})^{2}\\ &\mathcal{L}_{W^{\prime}}=\sum_{i}(1-W^{\prime}_{ii})^{2}+\alpha\sum_{i}\sum_{j\neq i}(W^{\prime}_{ij})^{2}\end{split}

(13)

where, $\lambda$ and $\alpha$ are weighting factors. Also $C^{\prime}_{ij}$ and $W^{\prime}_{ij}$ are elements of auto-correlation matrices computed for the coloring and whitening processes, respectively (see Fig. 3 in the paper), while $E^{\prime}_{ij}$ , is the target auto-correlation matrix used for the desired coloring. We also have:

C^{\prime}_{ij}\triangleq\frac{\sum_{m}^{\prime}z_{m^{\prime},i}z_{m^{\prime},j}}{\sqrt{\sum_{m}^{\prime}(z_{m^{\prime},i})^{2}}\sqrt{\sum_{m}^{\prime}(z_{m^{\prime},j})^{2}}}

(14)

where $z$ is the normalized output of one coloring head for one view, $x_{1}$ , and $m^{\prime}$ is the batch size (note that similar to the original framework, here for each sample we fed two views to the network). Similarly, $W^{\prime}{{}_{ij}}$ values are computed from the normalized output of one whitening head, where the auto-correlation is computed from all views fed to the network. Finally the elements of the matrix $E^{\prime}$ is also computed from the latent space of one VAE similar to the equation for the elements of $C^{\prime}$ .

Appendix C More detail on Architecture

The architectural design as well as some other details are presented here in more detail. Two architectures, ResNet18 and ResNet50 are used for encoder architecture, with the last layer replaced with a three layer projector, whitening projector. The last layer of the projector is the output with sizes 2048 and 1024. In fact in case of ResNet50 (under both linear and transfer learning settings) the output size is 2048 for ImageNet whereas in case of ResNet18 with other datasets, such as CIFAR10/100 the output size is 1024. Specifically the layer 16 and 46 of ResNet18 and ResNet50 pass through average pooling layer and then fed to the coloring heads (as conceptually shown in Fig. 1 of the paper). The architecture of the VAE is based on ResNet18 or ResNet50, and that determines the number of the layer that is connected to the coloring head. The coloring and whitening heads are made of three linear layers, first two are each followed by batch normalization and ReLU, whereas the output layer is fully connected layer of size 2048 or 1024. Finally, note that the layer that is connected to each coloring head first go through a max pooling process, similar to the layer that is connected to whitening head.

Appendix D Direct Coloring for other baselines

In this section we evaluate the effectiveness of direct coloring for one other baseline. We chose SIMSIAM as it is a modified version of BYOL, as a breakthrough work. We assess the case under standard augmentation as presented in the paper. The baseline accuracy without coloring head, under 1000 epochs of pre-training on ImageNet-100 using ResNet18 is $77.17\%$ . Adding coloring heads, and corresponding term into the loss function, $\mathcal{L}=\mathcal{L}_{old}+\lambda\mathcal{L}_{Coloring}$ , with $\lambda=0.01$ , the top-1 accuracy upgrades to $78.40\%$ , offering some $1.23\%$ improvement. Higher value of $\lambda$ , $\lambda=0.05$ , however, sharply degrades the accuracy to $72.5\%$ . Hence, if used carefully, coloring can improve former baseline, SIMSIAM.

Appendix E Avoidance of complete collapse

From theoretical perspective, we argue that any method that guarantees the avoidance of zero-variance representation, somehow assures the avoidance of complete collapse. In this sense, even whitening process also could be considered as a helpful technique. Here we formulate the problem from the perspective of constraint optimization. With direct coloring term added to the loss function of a gievn SSL framework which is prone to complete collapse, we have two terms (set of terms). First term being the original loss of the framework plus a second term corresponding to the direct coloring, $\mathcal{L}=f+\lambda g$ . In essence, direct coloring in this setting would be seen as a constrained optimization problem. Thus, thinking in terms of Lagrange multipliers, one would see the first term alone, $f$ ,prone to a trivial solution, complete collapse. However, this is subjected to the second term, $g$ , as the coloring constraint. We observe that direct coloring would substantially reduce the chance of complete collapse, because to find the solution to this constrained optimization problem, the optimizer looks for points where the gradient vector of $f$ and $g$ are parallel to each other. Since complete collapse is only a solution to $f$ alone and certainly not a solution to $g$ (as $g$ encourages non-zero variance), if one chooses a proper $\lambda$ , one can substantially avoid the complete collapse.

We assess it experimentally as well. The idea is to measure the representation variance, both in presence and without the presence of the coloring head. To this end, we measure the variance of last layers of the whitening head, output vector, as the coloring effect would be detectable there. Using the same architecture as Fig. 1 of the paper, this is done both with and without the coloring head (with $\lambda>0$ and $\lambda=0$ ). After 500 epochs of pre-training on ImageNet-100 with and without direct coloring, the weights are fixed and the variance of the normalized output vector of the whitening head is computed. In case of pre-training with direct coloring, the variance is $0.97$ while in case of pre-training without direct coloring, the variance is $0.68$ , empirically confirming our theoretical analysis regarding the avoidance of complete collapse.

We performed the same experiments with the SIMSIAM, as the experimental setting is presented before. We measure the variance of the last layer of the projector head (a normalized vector), in presence and absence of the coloring head. The variance in presence of the coloring head is $0.89$ while in absence of the coloring heads the variance is $0.65$ . This shows that coloring head in general would decrease the chance of complete collapse as it inject more variance to the representation, even with other baselines.