This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ContraNorm: A Contrastive Learning Perspective on Oversmoothing and Beyond

Xiaojun Guo1    Yifei Wang2∗   Tianqi Du2∗   Yisen Wang1,3
1National Key Lab of General Artificial Intelligence,
  School of Intelligence Science and Technology, Peking University
2School of Mathematical Sciences, Peking University
3Institute for Artificial Intelligence, Peking University
Equal Contribution.Corresponding Author: Yisen Wang ([email protected])
Abstract

Oversmoothing is a common phenomenon in a wide range of Graph Neural Networks (GNNs) and Transformers, where performance worsens as the number of layers increases. Instead of characterizing oversmoothing from the view of complete collapse in which representations converge to a single point, we dive into a more general perspective of dimensional collapse in which representations lie in a narrow cone. Accordingly, inspired by the effectiveness of contrastive learning in preventing dimensional collapse, we propose a novel normalization layer called ContraNorm. Intuitively, ContraNorm implicitly shatters representations in the embedding space, leading to a more uniform distribution and a slighter dimensional collapse. On the theoretical analysis, we prove that ContraNorm can alleviate both complete collapse and dimensional collapse under certain conditions. Our proposed normalization layer can be easily integrated into GNNs and Transformers with negligible parameter overhead. Experiments on various real-world datasets demonstrate the effectiveness of our proposed ContraNorm. Our implementation is available at https://github.com/PKU-ML/ContraNorm.

1 Introduction

Recently, the rise of Graph Neural Networks (GNNs) has enabled important breakthroughs in various fields of graph learning (Ying et al., 2018; Senior et al., 2020). Along the other avenue, although getting rid of bespoke convolution operators, Transformers (Vaswani et al., 2017) also achieve phenomenal success in multiple natural language processing (NLP) tasks (Lan et al., 2020; Liu et al., 2019; Rajpurkar et al., 2018) and have been transferred successfully to computer vision (CV) field (Dosovitskiy et al., 2021; Liu et al., 2021; Strudel et al., 2021). Despite their different model architectures, GNNs and Transformers are both hindered by the oversmoothing problem (Li et al., 2018; Tang et al., 2021), where deeply stacking layers give rise to indistinguishable representations and significant performance deterioration.

In order to get rid of oversmoothing, we need to dive into the modules inside and understand how oversmoothing happens on the first hand. However, we notice that existing oversmoothing analysis fails to fully characterize the behavior of learned features. A canonical metric for oversmoothing is the average similarity (Zhou et al., 2021; Gong et al., 2021; Wang et al., 2022). The tendency of similarity converging to 1 indicates representations shrink to a single point (complete collapse). However, this metric can not depict a more general collapse case, where representations span a low-dimensional manifold in the embedding space and also sacrifice expressive power, which is called dimensional collapse (left figure in Figure 1). In such cases, the similarity metric fails to quantify the collapse level. Therefore, we need to go beyond existing measures and take this so-called dimensional collapse into consideration. Actually, this dimensional collapse behavior is widely discussed in the contrastive learning literature (Jing et al., 2022; Hua et al., 2021; Chen & He, 2021; Grill et al., 2020), which may hopefully help us characterize the oversmoothing problem of GNNs and Transformers. The main idea of contrastive learning is maximizing agreement between different augmented views of the same data example (i.e. positive pairs) via a contrastive loss. Common contrastive loss can be decoupled into alignment loss and uniformity loss (Wang & Isola, 2020). The two ingredients correspond to different objectives: alignment loss expects the distance between positive pairs to be closer, while uniformity loss measures how well the embeddings are uniformly distributed. Pure training with only alignment loss may lead to a trivial solution where all representations shrink to one single point. Fortunately, the existence of uniformity loss naturally helps solve this problem by drawing representations evenly distributed in the embedding space.

Given the similarities between the oversmoothing problem and the representation collapse issue, we establish a connection between them. Instead of directly adding uniformity loss into model training, we design a normalization layer that can be easily used out of the box with almost no parameter overhead. To achieve this, we first transfer the uniformity loss used for training to a loss defined over graph node representations, thus it can optimize representations themselves rather than model parameters. Intuitively, the loss meets the need of drawing a uniform node distribution. Following the recent research in combining optimization scheme and model architecture (Yang et al., 2021; Zhu et al., 2021; Xie et al., 2021; Chen et al., 2022), we use the transferred uniformity loss as an energy function underlying our proposed normalization layer, such that descent steps along it corresponds with the forward pass. By analyzing the unfolding iterations of the principled uniformity loss, we design a new normalization layer ContraNorm.

Refer to caption
Figure 1: An illustration of how our proposed ContraNorm solves the dimensional collapse. Left: Features suffer from dimensional collapse. Right: With the help of ContraNorm, features become more uniform in the space, and the dimensional collapse is eased.

As a proof of concept, Figure 1 demonstrates that ContraNorm makes the features away from each other, which eases the dimensional collapse. Theoretically, we prove that ContraNorm increases both average variance and effective rank of representations, thus solving complete collapse and dimensional collapse effectively. We also conduct a comprehensive evaluation of ContraNorm on various tasks. Specifically, ContraNorm boosts the average performance of BERT (Devlin et al., 2018) from 82.59% to 83.54% on the validation set of General Language Understanding Evaluation (GLUE) datasets (Wang et al., 2019), and raises the test accuracy of DeiT (Touvron et al., 2021) with 24 blocks from 77.69% to 78.67% on ImageNet1K (Russakovsky et al., 2015) dataset. For GNNs, experiments are conducted on fully supervised graph node classification tasks, and our proposed model outperforms vanilla Graph Convolution Network (GCN) (Kipf & Welling, 2017) on all depth settings. Our contributions are summarized as:

  • We dissect the limitations of existing oversmoothing analysis, and highlight the importance of incorporating the dimensional collapse issue into consideration.

  • Inspired by the techniques from contrastive learning to measure and resolve oversmoothing, we propose ContraNorm as an optimization-induced normalization layer to prevent dimensional collapse.

  • Experiments on a wide range of tasks show that ContraNorm can effectively mitigate dimensional collapse in various model variants, and demonstrate clear benefits across three different scenarios: ViT for image classification, BERT for natural language understanding, and GNNs for node classifications.

2 Background & Related Work

Message Passing in Graph Neural Networks. In the literature of GNNs, message-passing graph neural networks (MP-GNNs) are intensively studied. It progressively updates representations by exchanging information with neighbors. The update of node ii’s representation in ll-th layer is formalized as 𝒉i(l)=UPDATE(𝒉(l1),AGGREGATE(𝒉i(l1),{𝒉j(l1)|j𝒩(i)})),{\bm{h}}^{(l)}_{i}=\mbox{UPDATE}\big{(}{\bm{h}}^{(l-1)},\mbox{AGGREGATE}({\bm{h}}^{(l-1)}_{i},\{{\bm{h}}^{(l-1)}_{j}\left|\right.j\in{\mathcal{N}}(i)\})\big{)}, where 𝒩(i){\mathcal{N}}(i) denotes the neighborhood set of node ii, AGGREGATE()\mbox{AGGREGATE}(\cdot) is the procedure where nodes exchange message, and UPDATE()\mbox{UPDATE}(\cdot) is often a multi-layer perceptron (MLP). A classical MP-GNNs model is GCN (Kipf & Welling, 2017), which propagates messages among one-hop neighbors.

Self-Attention in Transformers. Transformers encode information in a global scheme with self-attention as the key ingredient (Vaswani et al., 2017). Self-attention module re-weights intermediate representations by aggregating semantically near neighbors. Formally, it estimates similarities between key and query, namely self-attention matrix, as 𝑨¯=softmax(𝑸𝑲)\bar{{\bm{A}}}=\mbox{softmax}({\bm{Q}}{\bm{K}}^{\top}), 𝑸=𝑿𝑾Q{\bm{Q}}={\bm{X}}{\bm{W}}_{Q}, and 𝑲=𝑿𝑾K{\bm{K}}={\bm{X}}{\bm{W}}_{K} where 𝑿{\bm{X}}, 𝑾Q{\bm{W}}_{Q}, and 𝑾K{\bm{W}}_{K} are the input, weight matrices for query and key, respectively. A multi-head self-attention module with a residual connection can be formularized as attn(𝑿)=𝑿+k=1h𝑨¯k𝑿𝑽k𝑾k,\mbox{\rm{attn}}({\bm{X}})={\bm{X}}+\sum_{k=1}^{h}\bar{{\bm{A}}}_{k}{\bm{X}}{\bm{V}}_{k}{\bm{W}}_{k}^{\top}, where hh is the number of heads, and 𝑽{\bm{V}}, 𝑾{\bm{W}} are weights for value and final output.

Connections between Message Passing and Self-Attention. Note that the self-attention matrix can be seen as a normalized adjacent matrix of a corresponding graph (Shi et al., 2022). Considering a weighted fully connected graph GG with adjacency matrix denoted as 𝑨^\hat{{\bm{A}}}, we map nodes to token representations and set weights of the edge between node ii and node jj to exp(𝑸i𝑲j)\exp({{\bm{Q}}_{i}^{\top}{\bm{K}}_{j}}). Then (i,j)(i,j) entry of a normalized adjacency matrix is 𝑨~ij=𝑨^ij/𝑫^ij=exp(𝑸i𝑲j)/kexp(𝑸i𝑲k)\tilde{{\bm{A}}}_{ij}=\hat{{\bm{A}}}_{ij}/\hat{{\bm{D}}}_{ij}=\exp({{\bm{Q}}_{i}^{\top}{\bm{K}}_{j}})/\sum_{k}\exp({\bm{Q}}_{i}^{\top}{\bm{K}}_{k}), where diagonal matrix 𝑫^ii=j𝑨^ij\hat{{\bm{D}}}_{ii}=\sum_{j}\hat{{\bm{A}}}_{ij}. Apparently, 𝑨~\tilde{{\bm{A}}} is equal to the form of self-attention matrix defined in Transformers. Simultaneously, 𝑨~\tilde{{\bm{A}}} plays a major part in the message-passing scheme by deciding which nodes to exchange information with.

Oversmoothing in GNNs and Transformers. The term oversmoothing is firstly proposed by Li et al. (2018) in research of GNNs. Intuitively, representations converge to a constant after repeatedly exchanging messages with neighbors when the layer goes to infinity. Zhou et al. (2020) mathematically proves that, under some conditions, the convergence point carries only information of the graph topology. Coincidentally, an oversmoothing-like phenomenon is also observed in Transformers. Unlike CNNs, Transformers can not benefit from simply deepening layers, and even saturates with depth increasing. Early works empirically ascribe it to attention/feature collapse or patch/token uniformity (Tang et al., 2021; Zhou et al., 2021; Gong et al., 2021; Yan et al., 2022). To be specific, the attention maps tend to be overly similar in later layers, whereby features insufficiently exchange information and lose diversity in the end. Outputs of pure transformers, i.e., attention without skip connections or MLPs, are also observed to converge to a rank-1 matrix (Dong et al., 2021). For illustration proposes, we also refer to the degeneration issue in Transformers as oversmoothing.

Whitening Methods for Dimensional Collapse. Whitening methods ensure that the covariance matrix of normalized outputs is diagonal, making dimensions mutually independent and implicitly solving dimensional collapse. Huang et al. (2018; 2019); Siarohin et al. (2018); Ermolov et al. (2021) all adopt the idea of whitening, but differ in calculating details of whitening matrix and application domain. Compared with them, we avoid complicated calculations on the squared root of the inverse covariance matrix and delicate design of backward pass for differentiability. Moreover, Huang et al. (2018; 2019) are proposed for convolution operations, Siarohin et al. (2018) is for GAN, and Ermolov et al. (2021) works in self-supervised learning. In contrast, we borrow the idea from contrastive learning and solve the oversmoothing of neural networks.

3 Mitigating Oversmoothing from the Perspective of Contrastive Learning

In this section, we first empirically demonstrate that current similarity metric fails to characterize dimensional collapse, thus overlooking a crucial part of oversmoothing. To address this problem, we draw inspiration from contrastive learning whose uniformity property naturally rules out dimensional collapse. Specifically, we transfer the uniformity loss to a loss that directly acts on representations. By unfolding the optimization steps along this loss, we induce a normalization layer called ContraNorm. Theoretically, we prove our proposed layer helps mitigate dimensional collapse.

3.1 The Characterization of Oversmoothing

In this part, we begin by highlighting the limitations of existing metrics in characterizing oversmoothing. These limitations motivate us to adopt the effective rank metric, which has been demonstrated to be effective in capturing the degree of dimensional collapse in contrastive learning.

Taking the oversmoothing problem of Transformers as an example without loss of generality, a prevailing metric is evaluating attention map similarity (Wang et al., 2022; Gong et al., 2021; Shi et al., 2022). The intuition is that as attention map similarity converges to one, feature similarity increases, which can result in a loss of representation expressiveness and decreased performance. However, by conducting experiments on transformer structured models like ViT and BERT, we find that high attention map similarity does not necessarily correspond to high feature similarity. As shown in Figure 2(a) and Figure 2(b), although attention map similarity is close to 0.8 or even higher, the feature similarity remains below 0.5 in most cases. It means the oversmoothing problem still occurs even with a low feature similarity. This finding suggests that similarity can not fully depict the quality of representations and the oversmoothing problem.

Intuitively, we can consider a case that representations in the latent embedding space do not shrink to one single point but span a low dimensional space. In such cases, feature similarity may be relatively low, but representations still lose expressive power. Such representation degeneration problem is known as dimensional collapse, and it is widely discussed in the literature of contrastive learning. In contrastive learning, common practice to describe dimensional collapse is the vanishing distribution of singular values (Gao et al., 2019; Ethayarajh, 2019; Jing et al., 2022). To investigate whether dimensional collapse occurs in Transformers, we draw the singular value distribution of features in the last block of 12-layer BERT. As shown in Figure 2(c), the insignificant (nearly zero) values dominate singular value distribution in the deep layer of vanilla BERT, indicating that representations reside on a low-dimensional manifold and dimensional collapse happens. To show the collapse tendency along layer index, we simplify the singular value distribution to a concise scalar effective rank (erank) (Roy & Vetterli, 2007), which covers the full singular value spectrum.

Definition 3.1 (Effective Rank).

Considering matrix 𝐗m×n{\bm{X}}\in\mathbb{R}^{m\times n} whose singular value decomposition is given by 𝐗=𝐔𝚺𝐕{\bm{X}}={\bm{U}}\mathbf{\Sigma}{\bm{V}}, where 𝚺\mathbf{\Sigma} is a diagonal matrix with singular values σ1σ2σQ0\sigma_{1}\geq\sigma_{2}\geq\cdots\geq\sigma_{Q}\geq 0 with Q=min{m,n}Q=\min\{m,n\}. The distribution of singular values is defined as L1L_{1}-normalized form pi=σi/k=1Q|σk|.p_{i}=\sigma_{i}/\sum_{k=1}^{Q}{|\sigma_{k}|}. The effective rank of the matrix 𝐗{\bm{X}}, denoted as erank(𝐗)\mbox{erank}({\bm{X}}), is defined as erank(𝐗)=exp(H{p1,p2,,pQ}),\mbox{erank}({\bm{X}})=\exp(H\{p_{1},p_{2},\cdots,p_{Q}\}), where H(p1,p2,,pQ)H(p_{1},p_{2},\cdots,p_{Q}) is the Shannon entropy given by H(p1,p2,,pQ)=k=1Qpklogpk.H(p_{1},p_{2},\cdots,p_{Q})=-\sum_{k=1}^{Q}p_{k}\log p_{k}.

Based on erank, we revisit the oversmoothing issue of Transformers in Figure 2(d). We can see that the effective rank descends along with layer index, indicating an increasingly imbalanced singular value distribution in deeper layers. This finding not only verifies dimensional collapse does occur in Transformers, but also indicates the effectiveness of effective rank to detect this issue.

Refer to caption
(a) Similarity on BERT
Refer to caption
(b) Similarity on ViT
Refer to caption
(c) Feature Distribution
Refer to caption
(d) Effective Rank
Figure 2: Comparison between different metrics of oversmoothing of 12-layer BERT on GLUE sub-task STS-B and 12-layer ViT on CIFAR10. Figure 2(a) and Figure 2(b) show feature and attention map cosine similarity along layer index. Figure 2(c) and Figure 2(d) show the singular value distribution of features and effective rank of BERT w/ and w/o ContraNorm.

3.2 Inspirations from the Uniformity Loss in Contrastive Learning

The core idea of contrastive learning is maximizing agreement between augmented views of the same example (i.e., positive pairs) and disagreement of views from different samples (i.e. negative pairs). A popular form of contrastive learning optimizes feature representations using a loss function with limited negative samples (Chen et al., 2020). Concretely, given a batch of randomly sampled examples, for each example we generate its augmented positive views and finally get a total of NN samples. Considering an encoder function ff, the contrastive loss for the positive pair (i,i+)(i,i^{+}) is

(i,i+)=logexp(f(𝒙i)f(𝒙i+)/τ)k=1Nexp(f(𝒙i)f(𝒙k)/τ)exp(f(𝒙i)f(𝒙i)/τ),\mathcal{L}(i,i^{+})=-\log\frac{\mbox{exp}(f({\bm{x}}_{i})^{\top}f({\bm{x}}_{i^{+}})/\tau)}{\sum_{k=1}^{N}\mbox{exp}(f({\bm{x}}_{i})^{\top}f({\bm{x}}_{k})/\tau)-\mbox{exp}(f({\bm{x}}_{i})^{\top}f({\bm{x}}_{i})/\tau)}, (1)

where τ\tau denotes the temperature. The loss can be decoupled into alignment loss and uniformity loss:

align(i,i+)=f(𝒙i)f(𝒙i+)/τuniform(i)=logk=1Nexp(f(𝒙i)f(𝒙k)/τ).\mathcal{L}_{\rm{align}}(i,i^{+})=-f({\bm{x}}_{i})^{\top}f({\bm{x}}_{i^{+}})/\tau\qquad\mathcal{L}_{\rm{uniform}}(i)=\log\sum_{k=1}^{N}\mbox{exp}(f({\bm{x}}_{i})^{\top}f({\bm{x}}_{k})/\tau). (2)

The alignment loss encourages feature representations for positive pairs to be similar, thus being invariant to unnecessary noises. However, training with only the alignment loss will result in a trivial solution where all representations are identical. In other words, complete collapse happens. While batch normalization (Ioffe & Szegedy, 2015) can help avoid this issue, it cannot fully prevent the problem of dimensional collapse, which still negatively impacts learning (Hua et al., 2021).

Thanks to the property of uniformity loss, dimensional collapse can be solved effectively. Reviewing the form of uniformity loss in Eq. (2), it maximizes average distances between all samples, resulting in embeddings that are roughly uniformly distributed in the latent space, and thus more information is preserved. The inclusion of the uniformity loss in the training process helps to alleviate dimensional collapse. Intuitively, it can also serve as a sharp tool for alleviating oversmoothing in models such as GNNs and Transformers.

An alternative approach to address oversmoothing is to directly incorporate the uniformity loss into the training objective. However, our experiments reveal that this method has limited effectiveness (see Appendix B.2 for more details). Instead, we propose a normalization layer that can be easily integrated into various models. Our approach utilizes the uniformity loss as an underlying energy function of the proposed layer, such that a descent step along the energy function corresponds to a forward pass of the layer. Alternatively, we can view the layer as the unfolded iterations of an optimization function. This perspective is adopted to elucidate GNNs (Yang et al., 2021; Zhu et al., 2021), Transformers (Yang et al., 2022), and classic MLPs (Xie et al., 2021).

Note that the uniformity loss works by optimizing model parameters while what the normalization layer directly updates is representation itself. So we first transfer the uniformity loss, which serves as a training loss, to a kind of architecture loss. Consider a fully connected graph with restricted number of nodes, where node 𝒉i{\bm{h}}_{i} is viewed as representation of a random sample 𝒙i{\bm{x}}_{i}. Reminiscent of uniform\mathcal{L}_{\rm{uniform}}, we define ^uniform\hat{\mathcal{L}}_{\rm{uniform}} over all nodes in the graph as

^uniform=iuniform(i)=ilogje𝒉i𝒉j/τ.\hat{\mathcal{L}}_{\rm{uniform}}=\sum_{i}\mathcal{L}_{\rm{uniform}}(i)=\sum_{i}\log\sum_{j}e^{{\bm{h}}_{i}^{\top}{\bm{h}}_{j}/\tau}. (3)

This form of uniformity loss is defined directly on representations, and we later use it as the underlying energy function for representation update.

3.3 The Proposed ContraNorm

Till now, we are able to build a connection between layer design and the unfolded uniformity-minimizing iterations. Specifically, we take the derivative of L^uniform\hat{L}_{\rm{uniform}} on node representations:

^uniform𝒉i=jexp(𝒉i𝒉j/τ)kexp(𝒉i𝒉k/τ)𝒉j/τ+jexp(𝒉i𝒉j/τ)kexp(𝒉j𝒉k/τ)𝒉j/τ.\frac{\partial\hat{\mathcal{L}}_{\rm{uniform}}}{\partial{\bm{h}}_{i}}=\sum_{j}\frac{\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{j}/\tau)}{\sum_{k}\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{k}/\tau)}{\bm{h}}_{j}/\tau+\sum_{j}\frac{\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{j}/\tau)}{\sum_{k}\exp({\bm{h}}_{j}^{\top}{\bm{h}}_{k}/\tau)}{\bm{h}}_{j}/\tau. (4)

Denote the feature matrix as 𝑯{\bm{H}} with the ii-th row 𝒉i{\bm{h}}_{i}^{\top}, we can rewrite Eq. (4) into a matrix form:

^uniform𝑯=(𝑫1𝑨+𝑨𝑫1)𝑯/τ,\frac{\partial\hat{\mathcal{L}}_{\rm{uniform}}}{\partial{\bm{H}}}=({\bm{D}}^{-1}{\bm{A}}+{\bm{A}}{\bm{D}}^{-1}){\bm{H}}/\tau, (5)

where 𝑨=exp(𝑯𝑯/τ){\bm{A}}=\exp({\bm{H}}{\bm{H}}^{\top}/\tau) and 𝑫=deg(𝑨){\bm{D}}=\deg({\bm{A}})111deg(𝑨)\deg({\bm{A}}) is a diagonal matrix, whose (i,i)(i,i)-th element equals to the sum of the ii-th row of 𝑨{\bm{A}}.. To reduce the uniformity loss ^uniform\hat{\mathcal{L}}_{\rm{uniform}}, a natural way is to take a single step of gradient descent that updates 𝑯{\bm{H}} by

𝑯t=𝑯bs×^uniform𝑯b=𝑯bs/τ×(𝑫1𝑨+𝑨𝑫1)𝑯b,{\bm{H}}_{t}={\bm{H}}_{b}-s\times\frac{\partial\hat{\mathcal{L}}_{\rm{uniform}}}{\partial{\bm{H}}_{b}}={\bm{H}}_{b}-s/\tau\times({\bm{D}}^{-1}{\bm{A}}+{\bm{A}}{\bm{D}}^{-1}){\bm{H}}_{b}, (6)

where 𝑯b{\bm{H}}_{b} and 𝑯t{\bm{H}}_{t} denote the representations before and after the update, respectively, and ss is the step size of the gradient descent. By taking this update after a certain representation layer, we can reduce the uniformity loss of the representations and thus help ease the dimensional collapse.

In Eq. (6), there exist two terms 𝑫1𝑨{\bm{D}}^{-1}{\bm{A}} and 𝑨𝑫1{\bm{A}}{\bm{D}}^{-1} multiplied with 𝑯b{\bm{H}}_{b}. Empirically, the two terms play a similar role in our method. Note that the first term is related to self-attention matrix in Transformers, so we only preserve it and discard the second one. Then Eq. (6) becomes

𝑯t=𝑯bs/τ×(𝑫1𝑨)𝑯b=𝑯bs/τ×softmax(𝑯b𝑯b)𝑯b.{\bm{H}}_{t}={\bm{H}}_{b}-s/\tau\times({\bm{D}}^{-1}{\bm{A}}){\bm{H}}_{b}={\bm{H}}_{b}-s/\tau\times\text{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}. (7)

In fact, the operation corresponds to the stop-gradient technique, which is widely used in contrastive learning methods (He et al., 2020; Grill et al., 2020; Tao et al., 2022). By throwing away some terms in the gradient, stop-gradient makes the training process asymmetric and thus avoids representation collapse with less computational overhead, which is discussed in detail in Appendix B.1.

However, the layer induced by Eq. (7) still can not ensure uniformity on representations. Consider an extreme case where softmax(𝑯b𝑯b)\mathrm{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}) is equal to identity matrix 𝑰{\bm{I}}. Eq. (7) becomes 𝑯t=𝑯bs/τ×𝑯b=(1s/τ)𝑯b{\bm{H}}_{t}={\bm{H}}_{b}-s/\tau\times{\bm{H}}_{b}=(1-s/\tau){\bm{H}}_{b}, which just makes the scale of 𝑯{\bm{H}} smaller and does not help alleviate the complete collapse. To keep the representation stable, we explore two different approaches: feature norm regularization and layer normalization.

I. Feature Norm regularization. We add a regularization term 12ihi22-\frac{1}{2}\sum_{i}\|h_{i}\|_{2}^{2} to the uniformity loss to encourage a larger feature norm. When the regularization term becomes smaller, the norm of hih_{i} becomes larger. Therefore, adding this term can help prevent the norm of representation hih_{i} from becoming smaller. In this way, the update form becomes

𝑯t=(1+s)𝑯bs/τ×softmax(𝑯b𝑯b)𝑯b.{\bm{H}}_{t}=(1+s){\bm{H}}_{b}-s/\tau\times\mathrm{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}. (8)
Proposition 1.

Let 𝐞=(1,1,,1)/n{\bm{e}}=(1,1,\dots,1)^{\top}/\sqrt{n}. For attention matrix 𝐀¯=softmax(𝐇b𝐇b)\bar{{\bm{A}}}=\operatorname{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}), let σmin\sigma_{\min} be the smallest eigenvalue of 𝐏=(𝐈𝐞𝐞)(𝐈𝐀¯)+(𝐈𝐀¯)(𝐈𝐞𝐞){\bm{P}}=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}})+({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}). For the update in Eq. (8), i.e. 𝐇t=((1+s)𝐈s𝐀¯)𝐇b,s>0{\bm{H}}_{t}=((1+s){\bm{I}}-s\bar{{\bm{A}}}){\bm{H}}_{b},\ s>0, we have Var(𝐇t)(1+sσmin)Var(𝐇b)Var({\bm{H}}_{t})\geq(1+s\sigma_{\min})\cdot Var({\bm{H}}_{b}). Especially, if σmin0\sigma_{\min}\geq 0, we have Var(𝐇t)Var(𝐇b)Var({\bm{H}}_{t})\geq Var({\bm{H}}_{b}).

Proposition 1 gives a bound for the ratio of the variance after and before the update. It shows that the change of the variance is influenced by the symmetric matrix 𝑷=(𝑰𝒆𝒆)(𝑰𝑨¯)+(𝑰𝑨¯)(𝑰𝒆𝒆){\bm{P}}=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}})+({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}). If 𝑷{\bm{P}} is a semi-positive definite matrix, we will get the result that Var(𝑯t)Var(𝑯b)Var({\bm{H}}_{t})\geq Var({\bm{H}}_{b}), which indicates that the representations become more uniform. In Appendix F, we will give out some sufficient conditions for that 𝑷{\bm{P}} is semi-positive definite.

II. Appending LayerNorm. Another option is to leverage Layer Normalization (LayerNorm) (Ba et al., 2016) to maintain the feature scale. The update form of LayerNorm is LayerNorm(𝒉i)=γ((𝒉imean(𝒉i))/Var(𝒉i)+ε)+β\text{LayerNorm}({\bm{h}}_{i})=\gamma\cdot(({\bm{h}}_{i}-\text{mean}({\bm{h}}_{i}))/\sqrt{\text{Var}({\bm{h}}_{i})+\varepsilon})+\beta, where γ\gamma and β\beta are learnable parameters and ε=105\varepsilon=10^{-5}. The learnable parameters γ\gamma and β\beta can rescale the representation 𝒉i{\bm{h}}_{i} to help ease the problem. We append LayerNorm to the original update in Eq. (7) and obtain

𝑯t=LayerNorm(𝑯bs/τ×softmax(𝑯b𝑯b)𝑯b),{\bm{H}}_{t}=\mbox{LayerNorm}\left({\bm{H}}_{b}-s/\tau\times\mathrm{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}\right), (9)

where applying the layer normalization to a representation matrix 𝑯{\bm{H}} means applying the layer normalization to all its components 𝒉1,,𝒉n{\bm{h}}_{1},\dots,{\bm{h}}_{n}.

ContraNorm. We empirically compare the two proposed methods and find their performance comparable, while the second one performs slightly better. Therefore, we adopt the second update form and name it Contrastive Normalization (ContraNorm). The ContraNorm layer can be added after any representation layer to reduce the uniformity loss and help relieve the dimensional collapse. We discuss the best place to plug our ContraNorm in Appendix B.3.

ContraNorm-D: Scalable ContraNorm for Large-scale Graphs. From the update rule in Eq. (9), we can see that the computational complexity of the ContraNorm layer is 𝒪(n2d){\mathcal{O}}(n^{2}d), where nn is the number of tokens and dd is the feature dimension per token, which is in the same order of self-attention (Vaswani et al., 2017). Therefore, this operation is preferable when the token number nn is similar or less than the feature dimension dd, as is usually the case in CV and NLP domains. However, for large-scale graphs that contain millions of nodes (ndn\gg d), the quadratic dependence on nn makes ContraNorm computationally prohibitive. To circumvent this issue, we propose Dual ContraNorm (ContraNorm-D), a scalable variant of ContraNorm whose computational complexity 𝒪(nd2){\mathcal{O}}(nd^{2}) has a linear dependence on nn, with the following dual update rule,

𝑯t=LayerNorm(𝑯bs/τ×𝑯b×softmax(𝑯b𝑯b)),{\bm{H}}_{t}=\mbox{LayerNorm}\left({\bm{H}}_{b}-s/\tau\times{\bm{H}}_{b}\times\mathrm{softmax}({\bm{H}}_{b}^{\top}{\bm{H}}_{b})\right), (10)

where we calculate the dimension-wise feature correlation matrix 𝑯b𝑯bd×d{\bm{H}}_{b}^{\top}{\bm{H}}_{b}\in{\mathbb{R}}^{d\times d} and multiple it to the right of features 𝑯b{\bm{H}}_{b} after softmax normalization. In Appendix A, we provide a more thorough explanation of it, showing how to derive it from the feature decorrelation normalization adopted in non-contrastive learning methods (like Barlow Twins (Zbontar et al., 2021) and VICReg (Bardes et al., 2022)). As revealed in recent work (Garrido et al., 2022), there is a dual relationship between contrastive and non-contrastive learning methods, and we can therefore regard ContraNorm-D as a dual version of ContraNorm that focuses on decreasing dimension-wise feature correlation.

3.4 Theoretical Analysis

In this part, we give a theoretical result of ContraNorm, and show that with a slightly different form of the uniformity loss, the ContraNorm update can help alleviate the dimensional collapse.

Proposition 2.

Consider the update form 𝐇t=(1+s)𝐇bs(𝐇b𝐇b)𝐇b,{\bm{H}}_{t}=(1+s){\bm{H}}_{b}-s({\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}, let σmax\sigma_{\max} be the largest singular value of 𝐇b{\bm{H}}_{b}. For s>0s>0 satisfying 1+(1σmax2)s>01+(1-\sigma^{2}_{\max})s>0, we have erank(𝐇t)>erank(𝐇b)\operatorname{erank}({\bm{H}}_{t})>\operatorname{erank}({\bm{H}}_{b}).

Proposition 2 gives a promise of alleviating dimensional collapse under a special update form. Denote ^dim=tr((𝑰𝑯𝑯)2)/4\hat{\mathcal{L}}_{\rm{dim}}=tr(({\bm{I}}-{\bm{H}}{\bm{H}}^{\top})^{2})/4, then ^dim\hat{\mathcal{L}}_{\rm{dim}} shares a similar form with Barlow Twins (Zbontar et al., 2021). This loss tries to equate the diagonal elements of the similarity matrix 𝑯𝑯{\bm{H}}{\bm{H}}^{\top} to 1 and equate the off-diagonal elements of the matrix to 0, which drives different features becoming orthogonal, thus helping the features become more uniform in the space. Note that ^dim𝑯=(𝑰𝑯𝑯)𝑯\frac{\partial\hat{\mathcal{L}}_{\rm{dim}}}{\partial{\bm{H}}}=({\bm{I}}-{\bm{H}}{\bm{H}}^{\top}){\bm{H}}, therefore, the update above can be rewritten as 𝑯t=𝑯b+s^dim(𝑯b)𝑯b{\bm{H}}_{t}={\bm{H}}_{b}+s\frac{\partial\hat{\mathcal{L}}_{\rm{dim}}({\bm{H}}_{b})}{\partial{\bm{H}}_{b}}, which implies that this update form is a ContraNorm with a different uniformity loss. Proposition 2 reveals that this update can increase the effective rank of the representation matrix, when ss satisfies 1+(1σmax)s>01+(1-\sigma_{\max})s>0. Note that no matter whether σmax>0\sigma_{\max}>0 or not, if ss is sufficiently close to 0, the condition will be satisfied. Under this situation, the update will alleviate the dimensional collapse.

4 Experiments

In this section, we demonstrate the effectiveness of ContraNorm by experiments including 1) language understanding tasks on GLUE datasets with BERT and ALBERT (Lan et al., 2020) as the backbones; 2) image classification task on ImageNet100 and ImageNet1k datasets with DeiT as the base model; 3) fully supervised graph node classification task on popular graph datasets with GCN as the backbone. Moreover, we conduct ablative studies on ContraNorm variants comparison and hyperparameters sensitivity analysis.

4.1 Experiments on Language Models

Setup. To corroborate the potential of ContraNorm, we integrate it into two transformer structured models: BERT and ALBERT, and evaluate it on GLUE datasets. GLUE includes three categories of tasks: (i) single-sentence tasks CoLA and SST-2; (ii) similarity and paraphrase tasks MRPC, QQP, and STS-B; (iii) inference tasks MNLI, QNLI, and RTE. For MNLI task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) versions. We default plug ContraNorm after the self-attention module and residual connection (more position choices are in Appendix B.3). We use a batch size of 32 and fine-tune for 5 epochs over the data for all GLUE tasks. For each task, we select the best scale factor ss in Eq. (6) among (0.005,0.01,0.05,0.1,0.2)(0.005,0.01,0.05,0.1,0.2). We use base models (BERT-base and ALBERT-base) of 12 stacked blocks with hyperparameters fixed for all tasks: number of hidden size 128, number of attention heads 12, maximum sequence length 384. We use Adam (Kingma & Ba, 2014) optimizer with the learning rate of 2e52e-5.

Results. As shown in Table 1, ContraNorm substantially improves results on all datasets compared with vanilla BERT. Specifically, our ContraNorm improves the previous average performance from 82.59% to 83.39% for BERT backbone and from 83.74% to 84.67% for ALBERT backbone. We also submit our trained model to GLUE benchmark leaderboard and the results can be found in Appendix E.1. It is observed that BERT with ContraNorm also outperforms vanilla model across all datasets. To verify the de-oversmoothing effect of ContraNorm. We build models with/without ContraNorm on various layer depth settings. The performance comparison is shown in Figure 3(a). Constant stack of blocks causes obvious deterioration in vanilla BERT, while BERT with ContraNorm maintains competitive advantage. Moreover, for deep models, we also show the tendency of variance and effective rank in Figure 3(b) and Figure 3(c), which verifies the power of ContraNorm in alleviating complete collapse and dimensional collapse, respectively.

Refer to caption
(a) Deep Model Performance
Refer to caption
(b) Variance
Refer to caption
(c) Effective rank
Figure 3: Comparison of BERT and BERT+ContraNorm on COLA with various layer depths. Figure 3(a) presents performances of models with block number=12, 24, 36, 48, 60, 72. Figure 3(b) and Figure 3(c) show the tendency of features variance and effective rank for models with 72 blocks.
Table 1: Results comparison on validation set of GLUE tasks. Following Devlin et al. (2018), we report F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other tasks. ContraNorm* denotes the best model when varying plugging positions of ContraNorm. Avg denotes the average performance on all the tasks and bold denotes the best performance.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE Avg
BERT-base 55.28 92.89 88.96 88.24 88.61 84.65 84.61 91.51 68.59 82.59
BERT + ContraNorm 58.83 93.12 89.49 88.30 88.66 84.87 84.66 91.78 70.76 83.39
BERT + ContraNorm* 59.57 93.23 89.97 88.30 88.93 84.87 84.67 91.78 70.76 83.54
ALBERT-base 57.35 93.69 92.09 87.23 90.54 84.56 84.37 90.90 76.53 83.74
ALBERT + ContraNorm 58.51 92.89 92.86 87.45 90.56 84.33 84.62 91.76 79.06 84.67
ALBERT + ContraNorm* 58.76 93.23 92.89 87.67 90.72 84.69 84.95 92.28 79.06 84.92

4.2 Experiments on Vision Transformers

We also validate effectiveness of ContraNorm in computer vision field. We choose DeiT as backbone and models are trained from scratch. Experiments with different depth settings are evaluated on ImageNet100 and ImageNet1k datasets. Based on the Timm (Wightman, 2019) and DeiT repositories, we insert ContraNorm into base model intermediately following self-attention module.

Setup. We follow the training recipe of Touvron et al. (2021) with minimal hyperparameter changes. Specifically, we use AdamW (Loshchilov & Hutter, 2019) optimizer with cosine learning rate decay. We train each model for 150 epochs and the batch size is set to 1024. Augmentation techniques are used to boost the performance. For all experiments, the image size is set to be 224×224224\times 224. For Imagenet100, we set the scale factor ss to 0.2 for all layer depth settings. For Imagenet1k, we set ss to 0.2 for models with 12 and 16 blocks, and raise it to 0.3 for models with 24 blocks.

Results. In Table 2, DeiT with ContraNorm outperforms vanilla DeiT with all layer settings on both datasets. Typically, our method shows a gain of nearly 5% on test accuracy for ImageNet100. For ImageNet1k, we boost the performance of DeiT with 24 blocks from 77.69% to 78.67%.

Table 2: Test accuracy (%) comparison results. For ImageNet100 and ImageNet1k, we use DeiT-tiny and DeiT-small as the baseline respectively. The block number is set to 12, 16 and 24. The best result for each dataset is bolded.
Dataset Model #\#L=12 #\#L=16 #\#L=24
ImageNet100 DeiT-tiny 76.58 75.34 76.76
+ContraNorm 79.34 80.44 81.28
ImageNet1k DeiT-small 77.32 78.25 77.69
+ContraNorm 77.80 79.04 78.67

4.3 Experiments on Graph Neural Networks

Setup. We implement ContraNorm as a simple normalization layer after each graph convolution of GCN, and evaluate it on fully supervised graph node classification task. For datasets, we choose two popular citation networks Cora (McCallum et al., 2000) and Citeseer (Giles et al., 1998), and Wikipedia article networks Chameleon and Squirrel (Rozemberczki et al., 2021). More information of datasets is deferred to Appendix D. We compare ContraNorm against a popular normalization layer PairNorm (Zhao & Akoglu, 2020) designed for preventing oversmoothing. We also take LayerNorm as a baseline by setting the scale s=0s=0. We follow data split setting in Kipf & Welling (2017) with train/validation/test splits of 60%, 20%, 20%, respectively. To keep fair in comparison, we fix the hidden dimension to 32, and dropout rate to 0.6 as reported in Zhao & Akoglu (2020). We choose the best of scale controller ss in range of {0.2,0.5,0.8,1.0}\{0.2,0.5,0.8,1.0\} for both PairNorm and ContraNorm. For PairNorm, we choose the best variant presented by Zhao & Akoglu (2020).

Results. As shown in Table 3, in shallow layers (e.g., two layer), the addition of ContraNorm reduces the accuracy of vanilla GCN by a small margin, while it helps prevent the performance from sharply deteriorating as the layer goes deeper. ContraNorm outperforms PairNorm and LayerNorm in the power of de-oversmoothing. Here, we show the staple in diluting oversmoothing is ContraNorm, and LayerNorm alone fails to prevent GCN from oversmoothing, even amplifying the negative effect on Cora with more than 16 layers.

Table 3: Test accuracy (%) comparison results. We use GCN as backbone and apply LayerNorm, PairNorm and ContraNorm, respectively. We fairly tune the scale parameter in LayerNorm and ContraNorm. The layer number is set to 2, 4, 8, 16, 32. For every layer setting, the best accuracy is marked in blue background, and the second best is underlined. Results are averaged over 5 runs.
Dataset Model #\#L=2 #\#L=4 #\#L=8 #\#L=16 #\#L=32
Cora Vanilla GCN 81.75 ±\pm 0.51 72.61 ±\pm 2.42 17.71 ±\pm 6.89 20.71 ±\pm 8.54 19.69 ±\pm 9.54
+LayerNorm 79.96 ±\pm 0.73 77.45 ±\pm 0.67 39.09 ±\pm 4.68 7.79 ±\pm 0.00 7.79 ±\pm 0.00
+PairNorm 75.32 ±\pm 1.05 72.64 ±\pm 2.67 71.86 ±\pm 3.31 54.11 ±\pm 9.49 36.62 ±\pm 2.73
+ContraNorm 79.75 ±\pm 0.33 77.02 ±\pm 0.96 74.01 ±\pm 0.64 68.75 ±\pm 2.10 46.39 ±\pm 2.46
CiteSeer Vanilla GCN 69.18 ±\pm 0.34 55.01 ±\pm 4.36 19.65 ±\pm 0.00 19.65 ±\pm 0.00 19.65 ±\pm 0.00
+LayerNorm 63.27 ±\pm 1.15 60.91 ±\pm 0.76 33.74 ±\pm 6.15 19.65 ±\pm 0.00 19.65 ±\pm 0.00
+PairNorm 61.59 ±\pm 1.35 53.01 ±\pm 2.19 55.76 ±\pm 4.45 44.21 ±\pm 1.73 36.68 ±\pm 2.55
+ContraNorm 64.06 ±\pm 0.85 60.55 ±\pm 0.72 59.30 ±\pm 0.67 49.01 ±\pm 3.49 36.94 ±\pm 1.70
Chameleon Vanilla GCN 45.79 ±\pm 1.20 37.85 ±\pm 1.35 22.37 ±\pm 0.00 22.37 ±\pm 0.00 23.37 ±\pm 0.00
+LayerNorm 63.95 ±\pm 1.29 55.79 ±\pm 1.25 34.08 ±\pm 2.62 22.37 ±\pm 0.00 22.37 ±\pm 0.00
+PairNorm 62.24 ±\pm 1.73 58.38 ±\pm 1.48 49.12 ±\pm 2.32 37.54 ±\pm 1.70 30.66 ±\pm 1.58
+ContraNorm 64.78 ±\pm 1.68 58.73 ±\pm 1.12 48.99 ±\pm 1.52 40.92 ±\pm 1.74 35.44 ±\pm 3.16
Squirrel Vanilla GCN 29.47 ±\pm 0.96 19.31 ±\pm 0.00 19.31 ±\pm 0.00 19.31 ±\pm 0.00 19.31 ±\pm 0.00
+LayerNorm 43.04 ±\pm 1.25 29.64 ±\pm 5.50 19.63 ±\pm 0.45 19.96 ±\pm 0.44 19.40 ±\pm 0.19
+PairNorm 43.86 ±\pm 0.41 40.25 ±\pm 0.55 36.03 ±\pm 1.43 29.55 ±\pm 2.19 29.05 ±\pm 0.91
+ContraNorm 47.24 ±\pm 0.66 40.31 ±\pm 0.74 35.85 ±\pm 1.58 32.37 ±\pm 0.93 27.80 ±\pm 0.72

4.4 Ablation Study

Recalling Section 3.3, we improve ContraNorm with stop-gradient technique (SG) by masking the second term. For solving data representation instability, we apply layer normalization (LN) to the original version, while for the convenience of theoretical analysis, layer normalization is replaced by L2L_{2} normalization (L2L_{2}N). Here, we investigate the effect of these tricks and the results are shown in Table 4. Compared with the variant with only LayerNorm, ContraNorm with both stop gradient and layer normalization presents better performance. As for the two normalization methods, they are almost comparable to our methods, which verifies the applicability of our theoretical analysis.

In Appendix E.2, we further conduct an ablation study regarding the gains of ContraNorm while using different values of scale factor ss, and show that ContraNorm is robust in an appropriate range of ss.

Table 4: Performance comparison among different variants of ContraNorm. SG, LN, L2L_{2}N are the abbreviations of stop gradient, layer normalization and L2L_{2} normalization, respectively. All experiments are conducted on GLUE tasks with the same parameter settings. Avg denotes the average performance on all the tasks. We bold the best result for each task.
Variants Datasets Avg
SG LN L2L_{2}N COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE
58.80 93.12 89.60 88.35 88.97 84.81 84.67 91.47 68.23 83.11
59.82 93.00 89.64 88.37 88.92 84.72 84.58 91.58 68.95 83.29
59.57 93.12 89.97 88.30 88.93 84.84 84.67 91.58 68.95 83.33

5 Conclusion

In this paper, we point out the deficiencies of current metrics in characterizing over-smoothing, and highlight the importance of taking dimensional collapse as the entry point for oversmoothing analysis. Inspired by the uniformity loss in contrastive learning, we design an optimization-induced normalization layer ContraNorm. Our theoretical findings indicate ContraNorm mitigates dimensional collapse successfully by reducing variance and effective rank of representations. Experiments show that ContraNorm boosts the performance of Transformers and GNNs on various tasks. Our work provides a new perspective from contrastive learning on solving the oversmoothing problem, which helps motivate the following extensive research.

Acknowledgement

Yisen Wang is partially supported by the National Key R&D Program of China (2022ZD0160304), the National Natural Science Foundation of China (62006153), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Huawei Technologies Inc. Xiaojun Guo thanks for Hanqi Yan for her kind guidance on the implementation of BERT models. We thank anonymous reviewers for their constructive advice on this work.

References

  • Ali et al. (2021) Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image transformers. In NeurIPS, 2021.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Bardes et al. (2022) Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
  • Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In ICML, 2020.
  • Chen et al. (2022) Qi Chen, Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Optimization-induced graph implicit nonlinear diffusion. In ICML, 2022.
  • Chen & He (2021) Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2021.
  • Chien et al. (2022) Eli Chien, Wei-Cheng Chang, Cho-Jui Hsieh, Hsiang-Fu Yu, Jiong Zhang, Olgica Milenkovic, and Inderjit S Dhillon. Node feature extraction by self-supervised multi-scale neighborhood prediction. In ICLR, 2022.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dong et al. (2021) Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • Ermolov et al. (2021) Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning. In ICML, 2021.
  • Ethayarajh (2019) Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
  • Fulton (2000) William Fulton. Eigenvalues, invariant factors, highest weights, and schubert calculus. Bulletin of the American Mathematical Society, 37(3):209–249, 2000.
  • Gao et al. (2019) Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tieyan Liu. Representation degeneration problem in training natural language generation models. In ICLR, 2019.
  • Garrido et al. (2022) Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, and Yann Lecun. On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574, 2022.
  • Giles et al. (1998) C Lee Giles, Kurt D Bollacker, and Steve Lawrence. Citeseer: An automatic citation indexing system. In DL, 1998.
  • Gong et al. (2021) Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, and Qiang Liu. Vision transformers with patch diversification. arXiv preprint arXiv:2104.12753, 2021.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In NeurIPS, 2020.
  • Hua et al. (2021) Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In ICCV, 2021.
  • Huang et al. (2018) Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
  • Huang et al. (2019) Lei Huang, Yi Zhou, Fan Zhu, Li Liu, and Ling Shao. Iterative normalization: Beyond standardization towards efficient whitening. In CVPR, 2019.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • Jing et al. (2022) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2022.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kipf & Welling (2017) Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • Lan et al. (2020) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. In ICLR, 2020.
  • Li et al. (2021) Guohao Li, Matthias Müller, Bernard Ghanem, and Vladlen Koltun. Training graph neural networks with 1000 layers. In ICML, 2021.
  • Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, 2018.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • Loshchilov & Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
  • McCallum et al. (2000) Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3:127–163, 2000.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  • Roy & Vetterli (2007) Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In EUSIPCO, 2007.
  • Rozemberczki et al. (2021) Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. In IJCV, 2015.
  • Senior et al. (2020) Andrew W Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander WR Nelson, Alex Bridgland, et al. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020.
  • Shi et al. (2022) Han Shi, Jiahui Gao, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, and James Kwok. Revisiting over-smoothing in bert from the perspective of graph. In ICLR, 2022.
  • Siarohin et al. (2018) Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening and coloring batch transform for gans. arXiv preprint arXiv:1806.00420, 2018.
  • Strudel et al. (2021) Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021.
  • Tang et al. (2021) Yehui Tang, Kai Han, Chang Xu, An Xiao, Yiping Deng, Chao Xu, and Yunhe Wang. Augmented shortcuts for vision transformers. In NeurIPS, 2021.
  • Tao et al. (2022) Chenxin Tao, Honghui Wang, Xizhou Zhu, Jiahua Dong, Shiji Song, Gao Huang, and Jifeng Dai. Exploring the equivalence of siamese self-supervised learning via a unified gradient framework. In CVPR, 2022.
  • Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
  • Wang et al. (2022) Peihao Wang, Wenqing Zheng, Tianlong Chen, and Zhangyang Wang. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. arXiv preprint arXiv:2203.05962, 2022.
  • Wang & Isola (2020) Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.
  • Wightman (2019) Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • Xie et al. (2021) Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Yisen Wang, Guangcan Liu, and Zhouchen Lin. Optimization induced equilibrium networks. arXiv preprint arXiv:2105.13228, 2021.
  • Yan et al. (2022) Hanqi Yan, Lin Gui, Wenjie Li, and Yulan He. Addressing token uniformity in transformers via singular value transformation. In UAI, 2022.
  • Yang et al. (2021) Yongyi Yang, Tang Liu, Yangkun Wang, Jinjing Zhou, Quan Gan, Zhewei Wei, Zheng Zhang, Zengfeng Huang, and David Wipf. Graph neural networks inspired by classical iterative algorithms. In ICML, 2021.
  • Yang et al. (2022) Yongyi Yang, Zengfeng Huang, and David Wipf. Transformers from an optimization perspective. arXiv preprint arXiv:2205.13891, 2022.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In ACM SIGKDD, 2018.
  • Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
  • Zhao & Akoglu (2020) Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In ICLR, 2020.
  • Zhou et al. (2021) Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
  • Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
  • Zhu et al. (2021) Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In WWW, 2021.

Appendix A More Details on Dual ContraNorm

A.1 Methodology

The most time-consuming operation of ContraNorm is the matrix multiplication. Given 𝐇n×d\mathbf{H}\in{\mathbb{R}^{n\times d}}, where nn and dd denote the number of samples in a batch and feature size respectively, the time complexity of ContraNorm is 𝒪(n2d)\mathcal{O}(n^{2}d), which is the same order as the self-attention operation in Transformer. Empirically, we report the training time of BERT with or without ContraNorm on GLUE tasks in Table 5. All experiments are conducted on a single NVIDIA GeForce RTX 3090. On average, we raise the performance of BERT on GLUE tasks from 82.59% to 83.54% (see Table 1) with less than 4 minutes overhead. We think the time overhead is acceptable considering the benefits it brings.

Table 5: Estimated training time of BERT with or without ContraNorm on GLUE tasks. All experiments are conducted on a single NVIDIA GeForce RTX 3090. ss is the abbreviation for second. Avg denotes the average training time across all the tasks.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m (/mm) QNLI RTE Avg
BERT 110ss 851ss 74ss 6458ss 110ss 8186ss 2402ss 74ss 2283ss
+ContraNorm 125ss 983ss 80ss 7150ss 122ss 8941ss 2594ss 80s 2509ss

However, the computation bottleneck will become more apparent when applying ContraNorm to large-scale graphs with ndn\gg d. And we draw our inspirations from the duality between contrastive and non-contrastive learning (Garrido et al., 2022) to resolve this issue. It is shown that the uniformity loss used for deriving ContraNorm is equivalent (under limited assumptions) to the feature decorrelation loss adopted in non-contrastive learning. As the computation complexity of the decorrelation loss is 𝒪(nd2){\mathcal{O}}(nd^{2}) that has a linear dependence on nn. For the sake of consistency with the original ContraNorm, we choose the dual version of the InfoNCE uniformity loss, named VICReg-exp proposed in Garrido et al. (2022), and apply it to the graph features 𝑯n×d{\bm{H}}\in{\mathbb{R}}^{n\times d}, leading to the following VICReg-exp decorrelation for 𝑯{\bm{H}},

LVICRegexp(𝑯)=ilogje𝒉:,i𝒉:,j,L_{\rm VICReg-exp}({\bm{H}})=\sum_{i}\log\sum_{j}e^{{\bm{h}}_{:,i}^{\top}{\bm{h}}_{:,j}}, (11)

where 𝒉:,i{\bm{h}}_{:,i} denotes the ii-th column vector (i.e., a feature) of 𝑯{\bm{H}}. We can observe its close resemblance to the InfoNCE uniformity loss in Eq. (2). Following the same procedure as in Section 3.3, we can drive the gradient descent rule of the decorrelation loss and arrive at the following normalization layer named Dual ContraNorm (ContraNorm-D):

ContraNorm-D(𝐇)=LayerNorm(𝐇s/τ×𝐇×softmax(𝐇T𝐇)),\mbox{ContraNorm-D}(\mathbf{H})=\mbox{LayerNorm}(\mathbf{H}-s/\tau\times\mathbf{H}\times\mbox{softmax}(\mathbf{H}^{T}\mathbf{H})), (12)

which is also very close in form to the ContraNorm layer in Eq. (9). We can see that ContraNorm-D only involves computing the feature correlation matrix 𝑯𝑯{\bm{H}}^{\top}{\bm{H}} of complexity 𝒪(nd2){\mathcal{O}}(nd^{2}). The linear dependence on the node number nn makes it very suitable for large-scale graphs. We note that other forms of feature decorrelation objectives in non-contrastive learning (Zbontar et al., 2021; Bardes et al., 2022) can also be applied, and we leave a more systematic study of their effects on neural network dynamics to future work.

A.2 BERT Experiments on GLUE Benchmark

To verify whether it performs equally well as ContraNorm, we conduct experiments on the validation set of GLUE tasks. The learning rate of BERT with ContraNorm-D is set to 4e54e-5 and other experiment setups are the same in Section 4.1. As shown from Table 6, for each task the performance of BERT with ContraNorm-D surpasses the vanilla BERT, and the average performance is raised from 82.59%82.59\% to 84.21%84.21\%. The results imply effectiveness of this modified version of ContraNorm, which can also be explained with the relationship between Gram matrix (𝐆=𝐇𝐇Tn×n\mathbf{G}=\mathbf{H}\mathbf{H}^{T}\in\mathbb{R}^{n\times n} ) and covariance matrix ( 𝐂=𝐇T𝐇d×d\mathbf{C}=\mathbf{H}^{T}\mathbf{H}\in\mathbb{R}^{d\times d}) (Ali et al., 2021).

Table 6: Results comparison of BERT with or without ContraNorm-D on validation set of GLUE tasks. Following Devlin et al. (2018), we report F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other tasks. Avg denotes the average performance on all the tasks and bold denotes the best performance.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE Avg
BERT 55.28 92.89 88.96 88.24 88.61 84.65 84.61 91.51 68.59 82.59
+ ContraNorm-D 62.08 93.69 91.78 88.36 89.59 85.24 85.19 91.95 70.04 84.21

A.3 GNN Experiments on Large-scale OGB Benchmark

Setup. In order to assess the efficacy of ContraNorm-D in enhancing existing GNN techniques, we conduct tests on the ogbn-arxiv (Hu et al., 2020) benchmark, which consists of a large number of nodes (169,343) and edges (1,166,243), rendering ContraNorm unsuitable for scalability. We choose the high-ranking GIANT (Chien et al., 2022) method from the OGB leaderboard 222See https://ogb.stanford.edu/docs/leader_nodeprop/#ogbn-arxiv. as the backbone, and combine it with RevGAT Li et al. (2021) and self-knowledge distillation (selfKD) as suggested in the original paper. We follow the experiment settings in Chien et al. (2022), but reduce the optimization step for each layer of the hierarchical label trees to 1,000 (as opposed to 2,000 in the original method). For hyperparameters in ContraNorm-D, we set the scale ss to 22 and the temperature τ\tau to 55. We insert a ContraNorm-D layer after the graph convolution operation. Our implement is based on the GIANT repository 333See https://github.com/elichienxD/deep_gcns_torch..

Results. As shown in Table 7, despite the various tricks that have been applied to the GIANT* setting, the use of our ContraNorm-D variant can still boost its test performance from 76.15% to 76.39%. At the time of paper revision (April 17, 2023), we find that this gain is significant enough to improve the rank of GIANT* from the 6th place to the 2nd place, and achieves SOTA results on the ogbn-arxiv dataset at this parameter scale, since the 1st ranking method GLEM+RevGAT has more than 100x parameters than our model. This finding indicates that ContraNorm-D can serve as a plug-and-play component and boost existing state-of-the-art GNNs on large-scale graph datasets.

Table 7: Comparison of GIANT* (GIANT-XRT+RevGAT+KD) with or without ContraNorm-D on the test set of ogbn-arxiv benchmark.
Method GIANT* GIANT* + ContraNorm-D GLEM+RevGAT (SOTA)
Test Accuracy 76.15 76.39 76.94
Params 1.3M 1.3M 140.4M
Peak Memory 5.11G 6.09G -
Average epoch time 0.44s 0.61s -
Rank on OGB leaderboard 6 2 1

Appendix B Additional Discussions and Ablating Experiments

In this section, given more space, we elaborate more details on the choice of our architectural designs and evaluate their effects on real-world experiments.

B.1 Explaining and Evaluating The Stop-Gradient Operation in Eq. (7)

In this section, we will illustrate the stop-gradient operation in Eq. (7) by using the framework proposed by Tao et al. (2022). The original update form should be Eq. (6):

𝑯t=𝑯bs×^uniform𝑯b=𝑯bs/τ×(𝑫1𝑨+𝑨𝑫1)𝑯b.{\bm{H}}_{t}={\bm{H}}_{b}-s\times\frac{\partial\hat{\mathcal{L}}_{\rm{uniform}}}{\partial{\bm{H}}_{b}}={\bm{H}}_{b}-s/\tau\times({\bm{D}}^{-1}{\bm{A}}+{\bm{A}}{\bm{D}}^{-1}){\bm{H}}_{b}.

We take SimCLR (Chen et al., 2020) as the contrastive learning framework. Tao et al. (2022) have studied the stop-gradient form of SimCLR and illustrated that the stop-gradient operation will make a similar performance with the original one. Based on this, we will elaborate on how 𝐀𝐃1\mathbf{A}\mathbf{D}^{-1} is removed in the following part. In fact, we can directly illustrate how the second term in Eq. (4) can be omitted.

^uniform𝒉i=jexp(𝒉i𝒉j/τ)kexp(𝒉i𝒉k/τ)𝒉j/τ+jexp(𝒉i𝒉j/τ)kexp(𝒉j𝒉k/τ)𝒉j/τ.\frac{\partial\hat{\mathcal{L}}_{\rm{uniform}}}{\partial{\bm{h}}_{i}}=\sum_{j}\frac{\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{j}/\tau)}{\sum_{k}\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{k}/\tau)}{\bm{h}}_{j}/\tau+\sum_{j}\frac{\exp({\bm{h}}_{i}^{\top}{\bm{h}}_{j}/\tau)}{\sum_{k}\exp({\bm{h}}_{j}^{\top}{\bm{h}}_{k}/\tau)}{\bm{h}}_{j}/\tau.

In SimCLR, by denoting the normalized features from the online branch as uio,i=1,2,,nu^{o}_{i},i=1,2,\dots,n, the normalized features from the target branch (although the two branches have no differences in SimCLR) are uit,i=1,2,,nu^{t}_{i},i=1,2,\dots,n and 𝒱={u1o,u2o,,uno,u1t,u2t,,unt},\mathcal{V}=\{u_{1}^{o},u_{2}^{o},\dots,u_{n}^{o},u_{1}^{t},u_{2}^{t},\dots,u_{n}^{t}\}, the SimCLR loss can be represented as

L=12nu1𝒱logexp(u1u1/τ)u𝒱/u1exp(u1u/τ)L=-\frac{1}{2n}\sum_{u_{1}\in\mathcal{V}}\log\frac{\exp(u_{1}^{\top}u_{1}^{\prime}/\tau)}{\sum_{u\in\mathcal{V}/u_{1}}\exp(u_{1}^{\top}u/\tau)}

where u1u_{1} and u1u_{1}^{\prime} are positive pairs and τ\tau is the temperature. Then the gradient of LL on u1ou_{1}^{o} can be calculated as

Lu1o=12τn(u1t+v𝒱/u1osvv)+12τn(u1t+v𝒱/u1otvv)\frac{\partial L}{\partial u_{1}^{o}}=\frac{1}{2\tau n}\left(-u_{1}^{t}+\sum_{v\in{\mathcal{V}/u_{1}^{o}}}s_{v}v\right)+\frac{1}{2\tau n}\left(-u_{1}^{t}+\sum_{v\in\mathcal{V}/u_{1}^{o}}t_{v}v\right)

where

sv=exp(u1ov/τ)v𝒱/u1oexp(u1ov/τ)s_{v}=\frac{\exp(u_{1}^{o\top}v/\tau)}{\sum_{v^{\prime}\in\mathcal{V}/u_{1}^{o}}\exp(u_{1}^{o\top}v^{\prime}/\tau)}

is the softmax results over similarities between u1ou_{1}^{o} and other samples, and

tv=exp(vu1o/τ)v𝒱/vexp(vv/τ)t_{v}=\frac{\exp(v^{\top}u_{1}^{o}/\tau)}{\sum_{v^{\prime}\in\mathcal{V}/v}\exp(v^{\top}v^{\prime}/\tau)}

is computed over similarities between sample vv and its contrastive samples 𝒱/v\mathcal{V}/v. We can see that the first term of L/u1o\partial L/\partial u_{1}^{o} comes from the part which takes u1ou_{1}^{o} as the anchor, and the second term comes from the part which takes the other feature as the anchor. Tao et al. (2022) proposes to stop the second term and verifies that stopping the second gradient term will not affect the performance empirically.

Note that the u1tu_{1}^{t} term in the gradient is from the alignment loss. So the gradient of the uniformity loss on u1ou_{1}^{o} can be written as

Luniformu1o=12τn(v𝒱/u1osvv)+12τn(v𝒱/u1otvv)\frac{\partial L_{\rm{uniform}}}{\partial u_{1}^{o}}=\frac{1}{2\tau n}\left(\sum_{v\in{\mathcal{V}/u_{1}^{o}}}s_{v}v\right)+\frac{1}{2\tau n}\left(\sum_{v\in\mathcal{V}/u_{1}^{o}}t_{v}v\right) (13)

It is noteworthy that by writing 𝒱={h1,h2,,hN}\mathcal{V}=\{h_{1},h_{2},\dots,h_{N}\}, Eq. (4) shares the same form as Eq. (13). By adopting the stop-gradient method just as Tao et al. (2022) takes, we remove the second term in Eq. (4), which is just the 𝑨𝑫1{\bm{A}}{\bm{D}}^{-1} term in Eq. (6).

Empirically, we draw the singular value distribution of embeddings for vanilla BERT and +ContraNorm with only 𝑨𝑫1{\bm{A}}{\bm{D}}^{-1} term or 𝑫1𝑨{\bm{D}}^{-1}{\bm{A}} term on RTE task. As shown in Figure 4, compared with vanilla BERT with a long-tail distribution (dimensional collapse), adding ContraNorm with 𝑫1𝑨{\bm{D}}^{-1}{\bm{A}} and 𝑨𝑫1{\bm{A}}{\bm{D}}^{-1} both reduce the number of insignificant (nearly zero) singular values and make a more balanced distribution. The similar singular value distributions mean that they play a similar role in alleviating dimensional collapse.

Refer to caption
Figure 4: Singular value distributions with ContraNorm using only 𝑨𝑫1{\bm{A}}{\bm{D}}^{-1} term or 𝑫1𝑨{\bm{D}}^{-1}{\bm{A}} term. Experiments are conducted with 12-layer BERT on RTE task.

B.2 Comparing ConraNorm to Uniformity Regularization

We conduct a comparative experiment on BERT model with straightforwardly applied uniformity loss and our proposed ContraNorm. Specifically, we add the uniformity loss (lossuniloss_{\rm{uni}}) to the classification loss (MSELoss, CrossEntropyLoss or BCELoss depending on the task type, denoted by lossclsloss_{\rm{cls}}). Formally, the final loss is written as

losstotal=losscls+lossuni,loss_{\rm{total}}=loss_{\rm{cls}}+loss_{\rm{uni}},

where lossuni=i=1Nlogj=1Nexp(f(𝐱i)Tf(𝐱j)/τ)loss_{\rm{uni}}=\sum_{i=1}^{N}\log\sum_{j=1}^{N}\exp(f(\mathbf{x}_{i})^{T}f(\mathbf{x}_{j})/\tau) and NN is the number of samples in the batch. We tune τ\tau in the range of [0.5,0.8,1.0,1.2,1.5][0.5,0.8,1.0,1.2,1.5] and choose the best one in terms of average performance. Other hyperparameters are kept the same as settings of ContraNorm. The results are shown in the Table 8

Table 8: Results comparison on validation set of GLUE tasks. Following Devlin et al. (2018), we report F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the other tasks. Avg denotes the average performance on all the tasks. For each task, the best performance is bolded.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE Avg
BERT-base 55.28 92.89 88.96 88.24 88.61 84.65 84.61 91.51 68.59 82.59
BERT + Uniformity Loss 58.08 93.00 89.46 88.14 88.69 84.45 84.43 91.60 68.59 82.94
BERT + ContraNorm 58.83 93.12 89.49 88.30 88.66 84.87 84.66 91.78 70.76 83.39

We can see that +ContraNorm gets the best score in 8 / 9 tasks, while +Uniformity loss reaches the best only in 1 / 9 tasks. ContraNorm also has the highest average score among all tasks. The reason is that updating the total loss is a combined process for objectives of correct classification and uniform distribution. Thus, a lower losstotalloss_{\rm{total}} may be only caused by a lower classification loss while uniformity loss is kept the same, which cannot ensure a more uniform distribution of representations. In contrast, ContraNorm acts directly on representations in each layer and enforces the uniform property.

In fact, there are many methods in GNNs such as Yang et al. (2021) and Zhu et al. (2021), which design the propagation mechanism under the guidance of the corresponding objective. The well-designed propagation mechanism is shown to be the most fundamental part of GNNs (Zhu et al., 2021). Instead of directly using the loss function, these methods transfer the loss function into a specific propagation method and achieve superior performance, which indicates that changing the network may be more effective than directly adding the objective to the loss function.

B.3 Analysis of the plugging position of ContraNorm

We explore two ways to integrate ContraNorm into BERT and ALBERT. Concretely, consider the update of 𝑯(l){\bm{H}}^{(l)} in ll-th block of Transformers

𝑯(l)=MultiHeadAttention(𝑯(l)),\displaystyle{\bm{H}}^{(l)}=\mbox{MultiHeadAttention}({\bm{H}}^{(l)}), (14)
𝑯(l)=𝑯(l)+𝑿,\displaystyle{\bm{H}}^{(l)}={\bm{H}}^{(l)}+{\bm{X}}, (15)
𝑯(l)=LayerNorm(𝑯(l)),\displaystyle{\bm{H}}^{(l)}=\mbox{LayerNorm}({\bm{H}}^{(l)}), (16)

where 𝑿=𝑯b{\bm{X}}={\bm{H}}_{b} is the input tensor.

We choose positions 1) between Eq. (14) and Eq. (15), named as before-residual; 2) between Eq. (15) and Eq. (16), named as after-residual. The performance comparison between the two positions on GLUE datasets is shown in Table 9. It is observed that putting ContraNorm after the residual connection slightly outperforms that before residual. Therefore, we choose the after-residual variant as our basic ContraNorm.

Table 9: Results comparison on different plugging positions of ContraNorm. Experiments are evaluated on the validation set of GLUE tasks. Avg denotes the average performance on all the tasks. The best result for each task is bolded.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE Avg
BERT-base 55.28 92.89 88.96 88.24 88.61 84.65 84.61 91.51 68.59 82.59
before-residual 59.57 93.12 89.97 88.30 88.93 84.84 84.67 91.58 68.95 83.33
after-residual 58.83 93.12 89.49 88.30 88.66 84.87 84.66 91.78 70.76 83.39

Appendix C More Experimental Details

Appendix D Introduction of graph datasets

Here, we introduce the properties of the graph datasets discussed in this work.

Citation Network. Cora, CiteSeer are three popular citation graph datasets. In these graphs, nodes represent papers and edges correspond to the citation relationship between two papers. Nodes are classified according to academic topics.

Wikipedia Network. Chameleon and Squirrel are Wikipedia page networks on specific topics, where nodes represent web pages and edges are the mutual links between them. Node features are the bag-of-words representation of informative nouns. The nodes are classified into four categories according to the number of the average monthly traffic of the page.

Statics of datasets are shown in Table 10.

Table 10: Graph datasets statics.
Category Dataset # Nodes # Edges # Features Degree # Classes
Citation Cora 2708 5278 1433 4.90 7
CiteSeer 3327 4552 3703 3.77 6
Wikipedia Chameleon 2277 36101 500 5.0 6
Squirrel 5201 217073 2089 154.0 4

D.1 Metrics calculating feature and attention map cosine similarity

Following Wang et al. (2022), given feature map 𝑯N×D{\bm{H}}\in\mathbb{R}^{N\times D} and attention map of the hh-th head 𝑨(h)N×N{\bm{A}}^{(h)}\in\mathbb{R}^{N\times N} with batch size NN and hidden embedding size DD, the feature cosine similarity and the attention cosine similarity is computed as

simf=2N(N1)i,j>i𝒉i𝒉j𝒉i2𝒉j2,simattn=2N(N1)Hi,j>i𝒂i(h)𝒂j(h)𝒂i(h)2𝒂j(h)2,\mbox{sim}_{f}=\frac{2}{N(N-1)}\sum_{i,j>i}\frac{{{\bm{h}}_{i}}^{\top}{\bm{h}}_{j}}{\|{\bm{h}}_{i}\|_{2}\|{\bm{h}}_{j}\|_{2}},\quad\mbox{sim}_{\rm{attn}}=\frac{2}{N(N-1)H}\sum_{i,j>i}\frac{{{\bm{a}}^{(h)}_{i}}^{\top}{\bm{a}}^{(h)}_{j}}{\|{\bm{a}}^{(h)}_{i}\|_{2}\|{\bm{a}}^{(h)}_{j}\|_{2}},

where 𝒉i{\bm{h}}_{i} denotes the ii-th row of 𝑯{\bm{H}}, 𝒂i{\bm{a}}_{i} is the ii-th column of 𝑨{\bm{A}}, and HH is the number of attention heads.

Appendix E Additional Experimental Results

E.1 Results on test set of GLUE datasets

We submit our trained model to GLUE benchmark leaderboard and the resultant feedback of performance is shown in Table 11.

Table 11: Results comparison on test set of GLUE tasks. Avg denotes the average performance on all the tasks.
Dataset COLA SST-2 MRPC QQP STS-B MNLI-m MNLI-mm QNLI RTE Avg
BERT-base 53.3 92.8 86.8 71.2 82.8 84.4 83.2 90.9 66.3 79.1
BERT + ContraNorm 54.5 93.0 87.9 71.4 83.0 84.5 83.4 91.0 66.9 79.5

E.2 More results of the Ablation Study

We conduct experiments using BERT+ContraNorm with varying scaling factor ss on GLUE datasets. For each dataset, we vary the normalization scale around the best choice, and other parameters remain consistent. As shown in Figure 5, the results illustrate that model with ContraNorm is robust in an appropriate range of normalization scaling factor.

Refer to caption
(a) RTE
Refer to caption
(b) CoLA
Refer to caption
(c) STS-B
Refer to caption
(d) MRPC
Refer to caption
(e) SST2
Refer to caption
(f) QNLI
Refer to caption
(g) MNLI-m
Refer to caption
(h) MNLI-mm
Refer to caption
(i) QQP
Figure 5: Performance when varying scale factor ss on GLUE datasets. We choose BERT as the base model. The varying range is an interval containing the best choice of ss. Here, we fix it to [0.005,0.1][0.005,0.1] for all tasks.

Appendix F Omitted Proofs

Here, we present the complete proof for Propositions in Section 3.4.

First, we give two lemmas that will be useful for the proof.

Lemma F.1.

Denote 𝒆=[1,1,,1]/n{\bm{e}}=[1,1,\dots,1]^{\top}/\sqrt{n}. For any update form 𝑿1=𝑷𝑿0{\bm{X}}_{1}={\bm{P}}{\bm{X}}_{0} and λ>0\lambda>0, if the eigenvalues of (𝑰𝒆𝒆)λ𝑷(𝑰𝒆𝒆)𝑷({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-\lambda{\bm{P}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{P}} are all not larger than zero, then we have Var(𝑿1)λ1Var(𝑿0)\mathrm{Var}({\bm{X}}_{1})\geq\lambda^{-1}\cdot\mathrm{Var}({\bm{X}}_{0}).

Proof.

We denote 𝑿0𝑿0=𝒀diag(ω1,,ωn)𝒀{\bm{X}}_{0}{\bm{X}}_{0}^{\top}={\bm{Y}}\text{diag}(\omega_{1},\dots,\omega_{n}){\bm{Y}}^{\top} for the eigen-decomposition of 𝑿0𝑿0{\bm{X}}_{0}{\bm{X}}_{0}^{\top}, where 𝒀=[𝒚1,𝒚2,,𝒚n]{\bm{Y}}=[{\bm{y}}_{1},{\bm{y}}_{2},\dots,{\bm{y}}_{n}] is the orthogonal basis and all ωi0\omega_{i}\geq 0. Note that (𝑰𝒆𝒆)2=(𝑰𝒆𝒆)({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})^{2}=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}). Therefore,

Var(𝑿0)\displaystyle\mathrm{Var}({\bm{X}}_{0}) =(𝑰𝒆𝒆)𝑿0F2\displaystyle=\|({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{X}}_{0}\|_{F}^{2} (17)
=tr{|(𝑰𝒆𝒆)𝑿0𝑿0(𝑰𝒆𝒆)}\displaystyle=tr\{|({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{X}}_{0}{\bm{X}}_{0}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})\}
=tr{(𝑰𝒆𝒆)𝒀diag(ω1,,ωn)𝒀(𝑰𝒆𝒆)}\displaystyle=tr\{({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{Y}}\text{diag}(\omega_{1},\dots,\omega_{n}){\bm{Y}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})\}
=tr{diag(ω1,,ωn)𝒀(𝑰𝒆𝒆)(𝑰𝒆𝒆)𝒀}\displaystyle=tr\{\text{diag}(\omega_{1},\dots,\omega_{n}){\bm{Y}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{Y}}\}
=tr{diag(ω1,,ωn)𝒀(𝑰𝒆𝒆)𝒀}\displaystyle=tr\{\text{diag}(\omega_{1},\dots,\omega_{n}){\bm{Y}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{Y}}\}
=i=1nωi𝒚i(𝑰𝒆𝒆)𝒚i\displaystyle=\sum_{i=1}^{n}\omega_{i}{\bm{y}}_{i}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{y}}_{i}

Similarly, we have

Var(𝑿1)=i=1nωi𝒚i𝑷(𝑰𝒆𝒆)𝑷𝒚i.\mathrm{Var}({\bm{X}}_{1})=\sum_{i=1}^{n}\omega_{i}{\bm{y}}_{i}^{\top}{\bm{P}}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{P}}{\bm{y}}_{i}. (18)

Therefore, we have

Var(𝑿0)λVar(𝑿1)=i=1nωi𝒚i{(𝑰𝒆𝒆)λ𝑷(𝑰𝒆𝒆)𝑷}𝒚i.\mathrm{Var}({\bm{X}}_{0})-\lambda\mathrm{Var}({\bm{X}}_{1})=\sum_{i=1}^{n}\omega_{i}{\bm{y}}_{i}^{\top}\{({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-\lambda{\bm{P}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{P}}\}{\bm{y}}_{i}. (19)

Thus, if the eigenvalues of (𝑰𝒆𝒆)λ𝑷(𝑰𝒆𝒆)𝑷:=𝚺({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-\lambda{\bm{P}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{P}}:=\bm{\Sigma} are all not larger than zero, 𝚺\bm{\Sigma} is semi-negative definite, then we have

𝒚i{(𝑰𝒆𝒆)λ𝑷(𝑰𝒆𝒆)𝑷}𝒚i0,{\bm{y}}_{i}^{\top}\{({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-\lambda{\bm{P}}^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}){\bm{P}}\}{\bm{y}}_{i}\leq 0, (20)

which implies that Var(𝑿0)λVar(𝑿1)0\mathrm{Var}({\bm{X}}_{0})-\lambda\mathrm{Var}({\bm{X}}_{1})\leq 0. Therefore, Var(𝑿1)λ1Var(𝑿0)\mathrm{Var}({\bm{X}}_{1})\geq\lambda^{-1}\cdot\mathrm{Var}({\bm{X}}_{0}). ∎

The second lemma is from the Eq. (13) in Fulton (2000).

Lemma F.2.

Let A,B,CA,B,C be symmetric matrices and C=A+B.C=A+B. Suppose the eigenvalues of AA are α1α2αn\alpha_{1}\geq\alpha_{2}\geq\dots\geq\alpha_{n}, the eigenvalues of BB are β1β2βn\beta_{1}\geq\beta_{2}\geq\dots\geq\beta_{n}, and the eigenvalues of CC are γ1γ2γn\gamma_{1}\geq\gamma_{2}\geq\dots\geq\gamma_{n}. Then we have the inequality

maxi+j=n+kαi+βjγkmini+j=k+1αi+βj.\max_{i+j=n+k}\alpha_{i}+\beta_{j}\leq\gamma_{k}\leq\min_{i+j=k+1}\alpha_{i}+\beta_{j}. (21)

We can now start to prove Proposition 1.

Proposition 1.

Let 𝒆=(1,1,,1)/n{\bm{e}}=(1,1,\dots,1)^{\top}/\sqrt{n}. For attention matrix 𝑨¯=softmax(𝑯b𝑯b)\bar{{\bm{A}}}=\operatorname{softmax}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}), let σmin\sigma_{\min} be the smallest eigenvalue of 𝑷=(𝑰𝒆𝒆)(𝑰𝑨¯)+(𝑰𝑨¯)(𝑰𝒆𝒆){\bm{P}}=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}})+({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top}). For the ContraNorm update 𝑯t=((1+s)𝑰s𝑨¯)𝑯b,s>0{\bm{H}}_{t}=((1+s){\bm{I}}-s\bar{{\bm{A}}}){\bm{H}}_{b},\ s>0, we have Var(𝑯t)(1sσmin)1Var(𝑯b)Var({\bm{H}}_{t})\geq(1-s\sigma_{\min})^{-1}\cdot Var({\bm{H}}_{b}). Especially, if σmin0\sigma_{\min}\geq 0, we have Var(𝑯t)Var(𝑯b)Var({\bm{H}}_{t})\geq Var({\bm{H}}_{b}).

Proof.

We denote Σ=(𝑰𝒆𝒆)λ((1+s)𝑰s𝑨¯)(𝑰𝒆𝒆)((1+s)𝑰s𝑨¯)\Sigma=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-\lambda((1+s){\bm{I}}-s\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})((1+s){\bm{I}}-s\bar{{\bm{A}}}). Then,

Σ\displaystyle\Sigma =(1λ)𝑰λ(s(𝑰𝑨¯)(𝑰𝒆𝒆)s(𝑰𝒆𝒆)(𝑰𝑨¯)s2(𝑰𝑨¯)(𝑰𝒆𝒆)(𝑰𝑨¯))\displaystyle=(1-\lambda){\bm{I}}-\lambda(s({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})-s({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}})-s^{2}({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}})) (22)
=(1λ)𝑰λs𝑷λs2(𝑰𝑨¯)(𝑰𝒆𝒆)(𝑰𝑨¯).\displaystyle=(1-\lambda){\bm{I}}-\lambda s{\bm{P}}-\lambda s^{2}({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}}).

Let α1α2αn\alpha_{1}\geq\alpha_{2}\geq\dots\geq\alpha_{n} be the eigenvalues of Σ\Sigma. Since 𝑰𝒆𝒆{\bm{I}}-{\bm{e}}{\bm{e}}^{\top} has a eigenvalue of 0 and n1n-1 eigenvalues of 1, 𝑰𝒆𝒆{\bm{I}}-{\bm{e}}{\bm{e}}^{\top} is a semi-definite positive matrix. Thus, s2(𝑰𝑨¯)(𝑰𝒆𝒆)(𝑰𝑨¯)s^{2}({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}}) is also a semi-definite positive matrix. Notice that the largest eigenvalue of (1λ)𝑰s𝑷(1-\lambda){\bm{I}}-s{\bm{P}} is (1λ)sσmin(1-\lambda)-s\sigma_{\min} and the largest eigenvalue of s2(𝑰𝑨¯)(𝑰𝒆𝒆)(𝑰𝑨¯)-s^{2}({\bm{I}}-\bar{{\bm{A}}})^{\top}({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{{\bm{A}}}) is 0. Therefore, by Lemma 21, the largest eigenvalue of Σ\Sigma is less or equal to (1λ)sσmin(1-\lambda)-s\sigma_{\min}. Let λ=1sσmin\lambda=1-s\sigma_{\min}, then the largest eigenvalue of Σ\Sigma is less or equal to 0. By Lemma F.1, we have Var(𝑯t)(1sσmin)1Var(𝑯b)Var({\bm{H}}_{t})\geq(1-s\sigma_{\min})^{-1}\cdot Var({\bm{H}}_{b}).

Moreover, if σmin0\sigma_{\min}\geq 0, then (1sσmin)11(1-s\sigma_{\min})^{-1}\geq 1, leading to Var(𝑯t)Var(𝑯b)Var({\bm{H}}_{t})\geq Var({\bm{H}}_{b}). ∎

Remark. Now we discuss some sufficient conditions for σmin0.\sigma_{\min}\geq 0. If 𝑷{\bm{P}} is a diagonally dominant matrix, then we will have the result σmin0.\sigma_{\min}\geq 0. Denote aij=A¯ija_{ij}=\bar{A}_{ij}, bij=σijaijb_{ij}=\sigma_{ij}-a_{ij} and 𝑸=(𝑰𝒆𝒆)(𝑰A¯){\bm{Q}}=({\bm{I}}-{\bm{e}}{\bm{e}}^{\top})({\bm{I}}-\bar{A}), where σij=1\sigma_{ij}=1 if i=ji=j and σij=0\sigma_{ij}=0 if iji\neq j, then we have

𝑸ij=bij1nk=1n(bkj).{\bm{Q}}_{ij}=b_{ij}-\frac{1}{n}\sum_{k=1}^{n}(b_{kj}). (23)

If we have k=1nakj1+naij\sum_{k=1}^{n}a_{kj}\leq 1+na_{ij} for any i,ji,j, then we will have

bjjk=1nbkjnbij,ij.b_{jj}\geq\frac{\sum_{k=1}^{n}b_{kj}}{n}\geq b_{ij},\ i\neq j. (24)

Notice that k=1nbkj=0\sum_{k=1}^{n}b_{kj}=0, we have

|Qjj|=|kjQkj||Q_{jj}|=|\sum_{k\neq j}Q_{kj}| (25)

Since A¯\bar{A} is an attention matrix, we also have

|Qjj|=|kjQjk|.|Q_{jj}|=|\sum_{k\neq j}Q_{jk}|. (26)

Therefore, we have

|𝑷jj|\displaystyle|{\bm{P}}_{jj}| =2|𝑸jj|\displaystyle=2|{\bm{Q}}_{jj}| (27)
=|kjQkj|+|kjQjk|\displaystyle=|\sum_{k\neq j}Q_{kj}|+|\sum_{k\neq j}Q_{jk}|
|kjQkj+kjQjk|\displaystyle\geq|\sum_{k\neq j}Q_{kj}+\sum_{k\neq j}Q_{jk}|
|kj𝑷kj|,\displaystyle\geq|\sum_{k\neq j}{\bm{P}}_{kj}|,

which indicates that 𝑷{\bm{P}} is a diagonally dominated matrix, thus σmin0.\sigma_{\min}\geq 0. Therefore, k=1nakj1+naij\sum_{k=1}^{n}a_{kj}\leq 1+na_{ij} for any i,ji,j is just a sufficient condition. A special case for this is k=1nakj=1,k\sum_{k=1}^{n}a_{kj}=1,\forall k.

We now move on to the proof of Proposition 2. We first give a lemma on the property of effective rank.

Lemma F.3.

Let the eigenvalues of AAAA^{\top} be λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\dots\geq\lambda_{n} and the eigenvalues of BBBB^{\top} be σ1σ2σn\sigma_{1}\geq\sigma_{2}\geq\dots\sigma_{n}. If σi/λi\sigma_{i}/\lambda_{i} is increasing as ii increases, then we have erank(B)erank(A).\operatorname{erank}(B)\geq\operatorname{erank}(A).

This lemma can be proved just by using the definition of the effective rank.

Proposition 2.

Consider the update form

𝑯t=(1+s)𝑯bs(𝑯b𝑯b)𝑯b,{\bm{H}}_{t}=(1+s){\bm{H}}_{b}-s({\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}, (28)

let σmax\sigma_{\max} be the largest singular value of 𝑯b{\bm{H}}_{b}. For s>0s>0 satisfying 1+(1σmax2)s>01+(1-\sigma^{2}_{\max})s>0, we have erank(𝑯t)>erank(𝑯b)\operatorname{erank}({\bm{H}}_{t})>\operatorname{erank}({\bm{H}}_{b}).

Proof.

We have

𝑯t=((1+s)𝑯bs𝑯b𝑯b)𝑯b.{\bm{H}}_{t}=((1+s){\bm{H}}_{b}-s{\bm{H}}_{b}{\bm{H}}_{b}^{\top}){\bm{H}}_{b}. (29)

Therefore,

𝑯t𝑯t=s2(𝑯b𝑯b)32s(1+s)(𝑯b𝑯b)2+(1+s)2(𝑯b𝑯b){\bm{H}}_{t}{\bm{H}}_{t}^{\top}=s^{2}({\bm{H}}_{b}{\bm{H}}_{b}^{\top})^{3}-2s(1+s)({\bm{H}}_{b}{\bm{H}}_{b}^{\top})^{2}+(1+s)^{2}({\bm{H}}_{b}{\bm{H}}_{b}^{\top}) (30)

Suppose λ1λ2λn\lambda_{1}\geq\lambda_{2}\geq\dots\geq\lambda_{n} are the eigenvalues of 𝑯b𝑯b{\bm{H}}_{b}{\bm{H}}_{b}^{\top}, then 𝑯t𝑯tT{\bm{H}}_{t}{\bm{H}}_{t}^{T} has the same eigenvectors as 𝑯b𝑯b{\bm{H}}_{b}{\bm{H}}_{b}^{\top}, and its eigenvalues are (λis(1+s))2λi(\lambda_{i}s-(1+s))^{2}\lambda_{i}. Since ss satisfies 1+(1σmax2s)>01+(1-\sigma^{2}_{\max}s)>0, we have 1+s>λis1+s>\lambda_{i}s. Therefore, (λis(1+s))2(\lambda_{i}s-(1+s))^{2} is increasing as ii increases, resulting the fact that erank(𝑯t)>erank(𝑯b)\operatorname{erank}({\bm{H}}_{t})>\operatorname{erank}({\bm{H}}_{b}) by using Lemma F.3. ∎