GATE: Graph CCA for Temporal SElf-supervised Learning for Label-efficient fMRI Analysis

Liang Peng, Nan Wang, Jie Xu, Xiaofeng Zhu, and Xiaoxiao Li L. Peng and N. Wang are contributed equally to this paper.L. Peng and J. Xu are with Center for Future Media and Department of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China.N. Wang is with School of Computer Science and Engineering, East China Normal University, Shanghai, 200062, China, and the University of British Columbia, Vancouver, BC V6T 1Z4 CanadaX. Zhu is with School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610056, China, and also with the Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, China.X. Li is with Electrical and Computer Engineering, the University of British Columbia, Vancouver, BC V6T 1Z4 Canada (e-mail:[email protected]).

Abstract

In this work, we focus on the challenging task, neuro-disease classification, using functional magnetic resonance imaging (fMRI). In population graph-based disease analysis, graph convolutional neural networks (GCNs) have achieved remarkable success. However, these achievements are inseparable from abundant labeled data and sensitive to spurious signals. To improve fMRI representation learning and classification under a label-efficient setting, we propose a novel and theory-driven self-supervised learning (SSL) framework on GCNs, namely Graph CCA for Temporal sElf-supervised learning on fMRI analysis (GATE). Concretely, it is demanding to design a suitable and effective SSL strategy to extract formation and robust features for fMRI. To this end, we investigate several new graph augmentation strategies from fMRI dynamic functional connectives (FC) for SSL training. Further, we leverage canonical-correlation analysis (CCA) on different temporal embeddings and present the theoretical implications. Consequently, this yields a novel two-step GCN learning procedure comprised of (i) SSL on an unlabeled fMRI population graph and (ii) fine-tuning on a small labeled fMRI dataset for a classification task. Our method is tested on two independent fMRI datasets, demonstrating superior performance on autism and dementia diagnosis. Our code is available at https://github.com/LarryUESTC/GATE.

Index Terms:

Graph Convolutional Network, fMRI analysis, Label-efficient Learning, Self-supervised Learning

I Introduction

As the brain Functional Connectivity (FC) derived from functional Magnetic Resonance Imaging (fMRI) can capture abnormal brain functional activity [1], it has been widely applied in disease diagnosis. Recently, to quantify changes in brain connectivity over time, many researches apply the sliding window method to extract dynamic FC matrices from Blood-Oxygen-Level Dependent (BOLD) signals [2]. FC representations will be further used for disease diagnosis through machine learning methods.

With the success of machine learning, deep learning methods have made breakthroughs in fMRI classification, such as Convolutional Neural Networks (CNNs) [3], Recurrent Neural Networks (RNNs) [4], and Graph Convolutional neural Networks (GCNs) [5]. Indeed, GCNs are widely used for disease diagnosis [5, 6], due to the biomarkers across all subjects are critical for recognizing the common patterns associated with diseases. Previous studies have explored the development of GCNs for various prediction tasks on fMRI data [5, 7, 3, 8]. For example, Parisot et al. [5] apply vanilla GCN for supervised disease prediction with fMRI data. However, these studies depend heavily on large-scale labeled $/$ annotated data to get promising results. Conversely, this condition is usually not satisfied in clinical practice due to the costly and complex annotation process. For example, the annotation of autism samples requires doctor scoring on child’s behavior and developmental history (typically from a few months to years) following a complex protocol [9].

To leverage unlabeled data and assistant representation learning on the label-efficient data [10] (i.e., a small portion of labeled data), Self-Supervised Learning (SSL) has emerged as a powerful approach of unsupervised learning [11]. Finding the suitable SSL strategy for fMRI signals is essential and existing SSL strategies can be generally divided into three categories: contrastive-based SSL, reconstruction-based SSL, and similarity-based SSL. Contrastive-based SSL [12, 13, 6] requires selecting diverse negative samples to form contrastive loss (e.g., InfoNCE loss [14] and triple loss [15]), which is difficult for disease classification using fMRI due to the limited number of samples and a small number of classes. Reconstruction-based SSL [16, 17] contains an encoder-decoder structure to reconstruct input, but it is not practical as fitting the low signal-to-noise ratio fMRI features may overfit to spurious features [16]. Therefore, we focus on similarity-based SSL [18, 19] which forces the similarity of multiple views of the same data (e.g., data and its augmentation) and can provide the best practical value to assist node classification on the graph with unlabeled fMRI data.

Refer to caption — Figure 1: Illustration. This work focuses on leveraging self-supervised learning (SSL) strategy to improve fMRI prediction performance on the population graph with limited labels (left), where each node indicate a subject. Empirical studies (right, also see Sec. V) have shown SSL achieves higher testing accuracy than vanilla training when a small percent of data are labeled on the graph.

However, the promising similarity-based SSL strategy brings the two unique challenges for GCN-based fMRI analysis, which have been un-explored so far:

Challenge 1: How to find suitable augmentations for fMRI analysis to generate different views from one BOLD signal?
Modern SSL methods rely heavily on applying various data augmentations to create different views of the sample [20]. As shown by Tian et al. [21], SSL methods take advantage of maximizing mutual information between the different views of the sample to generate discriminative representations/embeddings. However, not all data augmentations have a positive effect on SSL. As suggested by Wang et al. [22], a suitable/helpful data augmentation for SSL should satisfy certain principles. Moreover, as suggested by [23], the augmentations should reduce the correlation between the spurious features and target labels.

Challenge 2: How to design the corresponding consistency loss for SSL training on fMRI analysis?
The consistency among correlated signals should be maximized. As a classical multi-variate analysis method, Canonical Correlation Analysis (CCA) [24] has been widely used in fMRI analysis [25] and aims to maximize the correlation between two representations.

To address aforementioned challenges, we propose Graph CCA for Temporal sElf-supervised learning on fMRI analysis (GATE), to achieve promising results with the assistance of pre-training on unlabeled fMRI data and fine-tuning on small labeled data (see Fig. 1). Specifically, as shown in Fig. 2, we first develop an augmentation strategy for fMRI analysis that generates two related views from the BOLD signals. The main motivation is to capture the critical information related to diseases from two diverse views using SSL techniques. With these two diverse views, we conduct a GCN encoder to obtain their embedding matrices which extract the associations among subjects. Furthermore, the CCA-based objective function is performed to maximize the correlation of the representations from two views. The novelties and contributions of this work could be summarized as follows:

•

We propose a novel SSL method (GATE) on fMRI data, which is an effective and versatile framework to solve the problem of learning on label-efficient datasets. This could bring GCN-based methods from research to clinical applications where labels are difficult to be collected.
•

The proposed GATE can tackle the spurious factors in dynamic FC analysis by developing a GCN-based CCA regularization with the designed multi-view temporal augmentation strategy on BOLD signals.
•

We conduct a theoretical discussion to support our motivation and prove the critical implication of how GATE assists learning on label-efficient data.
•

The comprehensive comparison experiments demonstrate that GATE achieves state-of-the-art performance under the label-efficient setting. We also conduct extensive ablation experiments to discuss key components of our design and algorithm.

II Related work

II-A Disease Prediction on fMRI data

Medical imaging techniques such as Computed Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and X-ray have been employed for diagnosis and early detection of diseases. For example, Kam et al. utilized multiple convolutional neural networks to conduct early mild cognitive impairment diagnosis [26]. In neuroscience, MRI is widely used to study neurological disorders, which can be separated between structural MRI and functional MRI depending on the detected type of connectivity. In structural MRI, nodes are defined by anatomical connections between regions or brain tissues, while edges are constructed by the topology between them. For instance, Yao et al. presented a triplet GNN model for disease diagnosis on structural MRI data [27]. In functional MRI (fMRI), nodes represent functional regions of the brain, while edges are constructed by correlations between their activities. For example, Wang et al. proposed a similarity-driven multi-view model to conduct autism spectrum disease diagnosis on fMRI data [28]. In fact, it has been a prevalent way to exploit the functional organization of human brain by analyzing functional fMRI, as shifts in human attention (e.g., a stimulus or cognitive task) are associated with systematic changes in the activity of functional areas of brains [29, 30].

There is growing evidence that functional connectivity may show dynamic changes over a short period of time [31, 32, 33]. Previous works have demonstrated that the using of sliding windows is beneficial to conduct dynamic connectivity analysis on fMRI data since the resting-state reflects different physiological states, such as wandering and muscle fatigue. For example, Wang et al. first used sliding windows to divide the rs-fMRI time series into multiple segments and then proposed a spatio-temporal convolutional-recursive neural network [34]. Mao et al. analyzed the dynamic functional differences between men and women based on sliding window measures and community detection [32]. Yao et al. proposed a temporal adaptive graph convolutional network to exploit the topological information and to model multi-level semantic information within the entire time series signals [35]. Yu et al. [36] presented a Transformer-based method to infer functional networks in both spatial and temporal space in a self-supervised manner.

II-B GCNs for disease prediction on fMRI data

Deep learning has revolutionized computer-aided diagnosis in the past decade [37, 38]. On this basis, deep GCNs takes the advantages of the relationship between nodes to find the common patterns/biomarkers, and there has been an increasing focus on GCNs in neuroscience [3]. Previous GCN-based methods on fMRI data can be categorized into two subgroups depending on the definition of nodes in the graph, i.e., population graph-based models and brain region graph based models [8]. Also, these two subgroups correspond to two separate tasks for fMRI analysis, i.e., population graph-based models for node classification task and brain region graph based models for graph classification tasks.

In population graph-based models, the nodes in the graph denote the subjects and edges represent the similarity between subjects. For example, Parisot et al. [5] exploits the GCN and involves representing populations as a sparse graph in which nodes are associated with imaging features, and edge weights are constructed from phenotype information (e.g., age, gene, and sex of the subjects). Kazi et al. [39] propose a GCN model with multiple filter kernel sizes on a population graph.

In brain region graph based models, the nodes denote anatomical brain regions and the edges represent functional or structural connectives among these brain regions. For example, Li et al. [3] conduct an interpretable GCN model on brain region graph to understand which brain regions are related to a specific neurological disorder. Xing et al. [40] consider a GCN model with long short term memory model on brain region graph for disease diagnosis.

Although previous GCN methods achieved promising results on both population graph and brain region graph, existing methods are still limited by large labeled data. Clearly, designing a GCN method for limited labeled fMRI data is an important yet unsolved challenge.

II-C Self-supervised learning

As an powerful unsupervised representation learning method, SSL has recently been developed in graph learning domain. Existing SSL methods can be broadly classified into three groups, i.e., contrastive-based SSL, reconstruction-based SSL and similarity-based SSL.

The contrastive-based SSL methods enhance the similarity between the representations from two views (e.g., global representation and local representation) by manually constructing positive and negative sample pairs. For example, Deep Graph Infomax (DGI) [41] employs InforNCE [14] contrastive loss function to contrast the local node representations and the global graph representations. MVGRL [42] proposes a contrastive multi-view representation learning method by contrasting local and global embeddings from two views. However, contrastive-based SSL methods rely on negative samples, which is not suitable with limited number of samples and small number of classes.

Recently, reconstruction-based SSL methods learn a representation by conducting reconstruction-based pretext tasks (e.g., image inpainting and recovering color channels). For example, He et al. [16] develop an SSL encoder-decoder model to reconstruct the original image from the latent representation and mask patches. Qiu et al. [17] propose a graph encoder-decoder model to reconstruct the relationships between nodes based on representation similarity. However, the reconstruction-based SSL needs to reconstruct input context (low dimensional features are transformed into high dimensional features), which is impractical due to fitting the low signal-to-noise ratio fMRI features may result in overfitting to spurious features.

More recently, similarity-based SSL takes the advantage of the coupling between multiple views of the same data (i.e., input context and its augmentation) to learn a good representation without labeled data. For example, Grill et al. [18] learn representations by encoding two augmented views through two distinct graph encoders. Although this technique avoids the limitations of selecting negative samples in the contrasitive-based SSL method and reconstruction loss in the reconstruction-based SSL method, designing a suitable strategy for generation multiple coupled views for fMRI data is highly needed.

III Method

As shown in Fig. 2, we give an overview of our similarity-based SSL method. Specifically, the key components of the proposed GATE include three parts: (a) Dynamic FC augmentation (Sec. III-A); (b) GCN encoder (Sec. III-B); (c) Objective function (Sec. III-C), and our method employs a two-step training procedure under label-efficient settings, i.e., step 1: unsupervised pre-training on unlabeled fMRI data, and step 2: fine-tuning on a small portion of labeled data for a specific prediction task. This section focuses on illustrating the novel SSL strategy used in step 1 and finally introduces step 2.

III-A Multi-view fMRI dynamic functional connectivity generation

The key idea of GATE is to ensure the consistency of two augmented views of an input content. Specifically, by adopting similarity-based SSL, spurious factors can be mitigated by augmenting its variations specifically. For example, if we train a model to identify “apple” from peach, these two views can be “red apple” and “yellow apple”. Both views contain the main characteristics (i.e., shape) of “apple” but vary on the spurious features (i.e., color).

Dynamic functional connectivity. To capture the temporal variability, the entire time fMRI course was divided into multiple sub-segments by the sliding window method [2]. Denote BOLD signals $\mathbf{S}_{i}\in\mathbb{R}^{R\times T}$ with $R$ being the number of brain Regions-Of-Interests (ROIs) in fMRI images of the $i$ -th subject and $T$ being the length of the entire segment. Then, we construct the FC matrix ( $\mathbf{F}_{i}\in\mathbb{R}^{R\times R}$ ) by calculating the Pearson’s correlation [43] between the matched BOLD segments of the paired ROIs. Last, we flatten the upper triangle of the FC matrix to represent the fMRI features $\mathbf{x}_{i}$ for the $i$ -th subject. To construct the population graph $G=\{\mathbf{X},\mathbf{A}\}$ for fMRI analysis [5], where $\mathbf{X}$ represents subject features and $\mathbf{A}$ represents similarities between subjects, we extract temporal fMRI BOLD signals via constructing FCs as the node features, and follow Parisot et al. [5] to obtain the initial graph $\mathbf{A}$ via $k$ NN from subject features.

Existing literature indicates that dynamic FC-based fMRI analysis is sensitive to the choice of the size of the sliding window $L$ [44, 45]. Therefore, we aim to impair the effect of perturbations on $L$ , i.e., the spurious features in dynamic FC. Although the brain FC is not static, a short time sub-segment and its surrounding time sub-segments should contain the same characteristics (e.g., FC patterns) associated with neurological disease [46, 31]. Thus, we propose to augment fMRI data along its temporal domain with sliding windows to obtain two relevant views ( $G^{a}=\{\mathbf{X}^{a},\mathbf{A}^{a}\}$ and $G^{b}=\{\mathbf{X}^{b},\mathbf{A}^{b}\}$ ) by step window augmentation (S-A) and multi-scale window augmentation (M-A).

Step window augmentation (S-A). As shown in Fig. 2 ( $a$ ), S-A takes two neighboring sliding windows as two related views ( $G^{a}$ and $G^{b}$ ). In this case, the raw BOLD signals $\mathbf{S}_{i}$ are divided into $M$ sub-segments $\{\mathbf{S}_{i}^{1},\dots,\mathbf{S}_{i}^{M}\}$ for the $i$ -th subject, by setting the window size as $L$ and the step of the sliding window as $s$ , resulting in $M=\lfloor\frac{T-L}{s}\rfloor+1$ sub-segments and the corresponding dynamic FCs. In each training iteration, S-A first randomly select one sub-segment (e.g., $m$ -th sub-segment) for the first view $G^{a}=\{\mathbf{X}^{m},\mathbf{A}^{m}\}$ , and then regard its neighboring sub-segment (e.g., $(m\pm 1)$ -th sub-segment) as the second view $G^{b}=\{\mathbf{X}^{m\pm 1},\mathbf{A}^{m\pm 1}\}$ .

Multi-scale window augmentation (M-A). Considering that FCs generated by different window sizes contain the relevant information, M-A takes two different scales of sliding windows as two related views. In this case, the $\mathbf{S}_{i}$ is divided into multiple $M$ sub-segments depending on the window size $L$ . For the $m$ -th sub-segment ( $m\in[M]$ ), we sample BOLD signals using two different window sizes $l_{a}$ and $l_{b}$ to calculate FCs, forming two views of graph $G^{a}=\{\mathbf{X}^{m,l_{a}},\mathbf{A}^{m,l_{a}}\}$ and $G^{b}=\{\mathbf{X}^{m,l_{b}},\mathbf{A}^{m,l_{b}}\}$ . These two views are fed into each training iteration of GATE, as shown in Fig. 2 ( $a$ ).

It is worth mentioning that it is also possible to jointly consider S-A and M-A, for example, by randomly selecting one in each training iteration. In addition, S-A and M-A can also be easily combined with randomly drop augmentation [47] which is commonly used (see Sec. V-C). In fact, data augmentation plays an essential role in SSL, however not all data augmentations have a positive effect on SSL [22]. Hence, it is important to investigate a suitable data augmentation for SSL considering the essential role of data augmentations in SSL as well as the special properties and data structure of medical data compared with natural data. Compared with [48], our method directly operates on the raw data rather than the graph representation. Furthermore, our proposed augmentation methods for fMRI to generate multi-view data is a versatile solution that can be combined with any model architectures.

III-B Graph embedding

With the graph data from two views ( $G^{a}=\{\mathbf{X}^{a},\mathbf{A}^{a}\}$ and $G^{b}=\{\mathbf{X}^{b},\mathbf{A}^{b}\}$ ), we perform an encoder to learn the patterns of the data. Since GCN [49] and its variants have been widely used to capture the semantic information (i.e., $\mathbf{X}$ ) and the structural information (i.e., $\mathbf{A}$ ) in the graph data, in this study, we adopt a GCN model as the encoder $f(\cdot)$ to get the embedding of all subjects in each view. It is worth noting that the graph encoder $f(\cdot)$ is versatile for applying other powerful graph learning encoders (e.g., GAT [50] and GIN [51]). For fair comparison and easily application, we apply the commonly used GCN model as the encoder in this study. More specifically, the GCN operation on the $l$ -th hidden layer is defined as

\displaystyle\mathbf{H}^{(l+1)}=\sigma(\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}\mathbf{H}^{(l)}\mathbf{\Theta}^{(l)}),

(1)

where $\mathbf{D}$ is the diagonal matrix of $\mathbf{A}$ , $\mathbf{H}^{(l)}$ is the output features of all subjects at the $l$ -th layer, $\mathbf{\Theta}^{(l)}$ is the trained weight matrix, and $\sigma(\cdot)$ represents an activation function. The input features of Eq. (1) is the feature vector fMRI features (i.e., $\mathbf{X}$ obtained by S-A or M-A). To make the embedding comparable, we use the same GCN encoder (weight sharing) to project two views into the same embedding space. Consequently, we obtain the normalized embedding matrices of two views, i.e., $\mathbf{Z}^{a}=f(\mathbf{X}^{a},\mathbf{A}^{a})$ and $\mathbf{Z}^{b}=f(\mathbf{X}^{b},\mathbf{A}^{b})$ .

III-C Objective function

To optimize the model, contrastive-based SSL methods [41, 42] apply contrastive loss on positive pairs and negative pairs, which is not suitable with limited number of samples and small number of classes. Reconstruction-based SSL methods [16] apply reconstruction loss (i.e., MSE) to reconstruct input context, which would be impractical due to fitting the low signal-to-noise ratio fMRI features may result in overfitting to spurious features. Therefore, we propose GATE to avoid sampling negative samples or reconstructing fMRI time series. Instead, we design novel augmentation methods together with the corresponding similarity loss to leverage unlabeled data, while prevent feature collapse in the embedding space. Once the embedding matrices of two views are computed, the next step is to maximize the correlation of these two embedding matrices, as the CCA does (see Sec. IV). Furthermore, we define an input-consistency regularization loss as

\displaystyle\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\frac{\left\langle\mathbf{z}^{a}_{i},\mathbf{z}^{b}_{i}\right\rangle}{\left\|\mathbf{z}^{a}_{i}\right\|\left\|\mathbf{z}^{b}_{i}\right\|}+\gamma\sum_{v=a,b}\|(\mathbf{Z}^{v})^{\top}\mathbf{Z}^{v}-\mathbf{I}\|_{F}^{2},

(2)

where $\left\langle\cdot,\cdot\right\rangle$ is the dot product operator, and $\gamma$ is the trade-off coefficient. And $v$ is one of the view (e.g., view $a$ and view $b$ ) and $Z^{v}$ is the embeddings matrices of this view. In Eq. (2), the first term can be regarded as a regularization of the embedded features, ensuring that the low-dimensional features can maintain the representation capability. The second term in Eq. (2) ensures that the individual dimensions of the feature are uncorrelated to avoid collapsed solutions, i.e., all outputs of the model are equal.

Performing downstream task. After the GCN encoder is trained, there is no need to keep the large graph structure in fine-tuning step for versatile and efficiency in clinical practice because many GNN encoders (e.g., GCN and GAT) belong to transductive learning which means the $\mathbf{A}$ needs to be reconstructed after new samples involved. Thus, we fine-tune the encoder without graph information (i.e., by replacing $\mathbf{A}$ with identity matrix $\mathbf{I}$ ), followed by a linear layer with ELU activation function (denoted as $\psi(\cdot):\mathcal{Z}\mapsto\mathbb{R}$ ) on a small portion of labeled data using cross-entropy loss for a particular clinical prediction task. All sliding windows are entered into the model during inference processes. To this end, we give the pseudo code of GATE.

Algorithm 1 The pseudo code of our proposed GATE

Input: BOLD signal segments for all the $N$ subject $\{\mathbf{S}_{i}^{1},\dots,\mathbf{S}_{i}^{M}\}_{i=1}^{N}$ , where $m\in[M]$ indexes the sliding window; random drop function $\mathcal{R}(\cdot)$ ; graph generation function $\mathcal{P}(\cdot)$ , data in fine-tuning step $\mathbf{\tilde{X}}$ with label $\mathbf{\tilde{Y}}$ , and fine-tuning classifier $\psi(\cdot)$ .
Output: Graph neural network encoder: $f(\cdot)$

\mathbf{S}^{a}\leftarrow\mathbf{S}^{m};~{}~{}\mathbf{S}^{b}\leftarrow\mathbf{S}^{m\pm 1}

\triangleright

Random select one window

2:SSL training:

3:for

t\leftarrow 1,2,\cdots,\mathrm{Epochs}

\mathbf{X}^{a},\mathbf{A}^{a}=\mathcal{P}(\mathbf{S}^{a});~{}~{}\mathbf{X}^{b},\mathbf{A}^{b}=\mathcal{P}(\mathbf{S}^{b})

5: if

\mathrm{use~{}random~{}drop}

then

\mathbf{X}^{a},\mathbf{A}^{a}=\mathcal{R}(\mathbf{X}^{a},\mathbf{A}^{a})

\mathbf{X}^{b},\mathbf{A}^{b}=\mathcal{R}(\mathbf{X}^{b},\mathbf{A}^{b})

8: end if

\triangleright

Generate fMRI features and population graphs

\mathbf{Z}^{a}=f(\mathbf{X}^{a},\mathbf{A}^{a})

\triangleright

Obtain embeddings of view

(a)

10:

\mathbf{Z}^{b}=f(\mathbf{X}^{b},\mathbf{A}^{b})

\triangleright

Obtain embeddings of view

(b)

11: Loss

\leftarrow

Eq. (2)

\triangleright

Objective function

12: Updating

f(\cdot)

with optimizer, i.e., AdamW

13:end for

14:Fine-tuning for downstream task:

15:for

t\leftarrow 1,2,\cdots,\mathrm{Epochs}

16:

\mathbf{P}=\psi(f(\hat{\mathbf{X}},\mathbf{I}))

17: Loss

\leftarrow

\mathcal{L}_{ce}

\triangleright

Train classifier using cross-entropy loss

18: Updating

\psi(\cdot)

and

f(\cdot)

with optimizer, i.e., AdamW

19:end for

IV Theoretical motivation and analysis on CCA Loss

Machine learning models may poorly generalize on testing data if the decision making is learned from the reliance on the spurious features. Adding input-consistency regularization can help mitigate the correlation between spurious features with the label and improve testing performance [23, 52]. CCA, matching the similarity of two representations, has been widely used for multi-view fusion and fMRI analysis [53], while leveraging CCA in SSL is under-explored. Although [48] uses CCA as regulation, it ONLY implicitly connects view augmentation with Information Bottleneck [54]. Here we improve the theoretical analysis by building the connection between a deep-learning based non-linear CCA and SSL. Moreover, we imply how they will affect the downstream task. The general objective of deep CCA is expressed as

	$\displaystyle\max_{f}\mathcal{L}_{\rm CCA}\coloneqq\mathbb{E}_{G^{a},G^{b}}[f(G^{a})^{\top}f(G^{b})],$		(3)
	$\displaystyle{\text{ s.t. }}\Sigma_{f(G^{a}),f(G^{a})}=\Sigma_{f(G^{b}),f(G^{b})}=\mathbf{I},$

where $f$ is a normalized non-linear embedding: $\mathcal{G}\mapsto\mathcal{Z}$ and $f\leftarrow\frac{f}{\|f\|_{2}}$ w.r.t. two views $G^{a}$ and $G^{b}$ , $\Sigma$ is the covariance matrix that $\Sigma_{f,f}=\mathbb{E}_{G^{a}}[f(G^{a})f(G^{a})]$ . Here, we state the identifiability between input consistent SSL regulation and non-linear CCA optimization.

Theorem 1 (CCA loss)

Considering the optimization of Eq. (2) and Eq. (3), Eq. (2) is the dual problem of Eq. (3) in the form of Lagrangian and satisfies Karush-Kuhn-Tucker (KKT) conditions [55].

Then, we build the connection between CCA regularization and bounding the generalization error on the downstream tasks. Let variables $G^{a}$ and $G^{b}$ be two views of a data, i.e., $G^{a}$ is one sample and $G^{b}$ is its variant. Denote the representation operation $\mathcal{T}$ , low rank approximation operator $\mathcal{R}$ and $h:\mathcal{X}\mapsto\mathbb{R}$ . Conduct SVD on $\mathcal{T}$ : find $k$ orthonormal vectors $U=[u_{1},\dots,u_{k}]$ , $V=[v_{1},\dots,v_{k}]$ and scalars $s=\sigma_{1},\dots,\sigma_{k}\in\mathbb{R}$ that minimizes:

	$\displaystyle\mathcal{L}_{U,V,S}\coloneqq$	$\displaystyle\max_{\\|h\\|_{L^{2}(G^{b})}=1}\\|\mathcal{T}h-\mathcal{T}_{k}h\\|_{L^{2}(G^{a})},$
	$\displaystyle\mathcal{T}_{k}h\coloneqq$	$\displaystyle\sum_{i=1}^{k}\sigma_{i}\langle v_{i},h\rangle_{L^{2}(G^{b})}u_{i}=f^{\top}\mathbf{w},$

where $\mathbf{w}=\langle v_{i},h\rangle$ . We state the general theorem of non-linear CCA on the approximation error as follows.

Lemma 2 (General theorem for non-linear CCA [56])

Let $f$ be the solution of Eq (3) and $h^{*}$ be the optimal function to predict $\mathbf{Y}$ , the one-hot encoder of label $Y$ . Denote $\sigma_{i}\coloneqq\mathbb{E}_{G^{a},G^{b}}[f_{i}(G^{a})f_{i}(G^{b})].$ The the approximation error of $f$ satisfies

	$\displaystyle e_{apx}(f)$	$\displaystyle\coloneqq\min_{\mathbf{W}}\mathbb{E}_{G^{a}}[\\|h^{*}(G^{a})-\mathbf{W}^{\top}f(G^{a})\\|^{2}]$
		$\displaystyle\leq 2\mathbb{E}_{y}[\\|h^{*}-\mathcal{R}w_{b,y}\\|^{2}+\\|(\mathcal{R}-\mathcal{T}_{k})w_{b,y}\\|^{2}].$		(4)

Here $(\mathcal{T}_{k}\circ h_{y})(g_{a})\coloneqq\sum_{i=1}^{k}\sigma_{i}\mathbb{E}[f_{i}(G^{b})h_{y}(G^{b})]f(g_{a})$ , and $(\mathcal{R}\circ h_{y})(g_{a})\coloneqq\mathbb{E}_{Y}[\mathbb{E}_{G^{b}}[h_{y}(G^{b})|Y]|G^{a}=g_{a}]$ , where $h_{y}$ satisfies $\mathbb{E}[h_{y}(G^{b})|Y=y]=\mathbf{1}(Y=y)$ .

Theorem 3 (Upper bound of excess risk of downstream task)

Let $G^{a}$ and $G^{b}$ be two views randomly generated from the same training instance and $\mathbf{Y}$ be the instance labels. Consider learning an embedding by minizing Eq. (2) $f^{*}\coloneqq\operatorname*{arg\,min}\mathcal{L}$ and perform downstream linear model on $\mathbf{Y}$ with $f(\cdot)$ , i.e., $h(G)\coloneqq(\mathbf{W}^{*})^{\top}f(G)$ , $\mathbf{W}^{*}\leftarrow\operatorname*{arg\,min}_{\mathbf{W}}\mathbb{E}_{G,\mathbf{Y}}\left[\|\mathbf{Y}-\mathbf{W}^{\top}f(G)\|^{2}\right]$ , for analysis simplicity $\mathbf{W}\in\mathbb{R}^{k\times k}$ . Suppose $\mathbf{Y}=h^{*}(G)+N$ , where $N$ is $\sigma^{2}$ -subgaussian and $\mathbb{E}[N]=0$ . If the multi-view conditional independence assumption holds, with probability of at least 1- $\delta$ , we have the excess risk:

	$\displaystyle{\rm ER}_{f}(\mathbf{W})\coloneqq$	$\displaystyle\mathbb{E}_{G}\left[\\|h^{*}(G)-\mathbf{W}^{\top}f(G)\\|_{2}^{2}\right]$
	$\displaystyle\leq$	$\displaystyle\mathcal{O}\left(\frac{\alpha}{1-\lambda}+\sigma^{2}\frac{k}{n}\right),$		(5)

where $n$ is the number of labeled samples in the downstream task, $k$ is the number of classes in $\mathbf{Y}$ , $\alpha$ is the Bayes error, and $\lambda$ is the $k$ -th singular value of the representation operation $\mathcal{T}$ . In our context of SSL $(\mathcal{T}h)\coloneqq\mathbb{E}[(h_{y}(G^{b})^{\top}f(G^{b})|f(G^{a})]$ .

Proof 4

Let variables $G^{a}$ and $G^{b}$ be two views of a data, i.e., $G^{a}$ is one sample and $G^{b}$ is its variant. Let functions $w_{i,y}(G^{a})=\mathbf{1}(h^{*}(G^{a})=y)$ for $i\in\{a,b\}$ . Under the conditional independence, $\mathcal{T}=\mathcal{R}$ and the $(k-1)$ -th maximal correlation of $G^{a}$ and $G^{b}$ is the $k$ -th singular value of $\mathcal{T}$ [57]. We have

		$\displaystyle\sum_{y}\\|\mathcal{T}_{k}w_{b,y}-w_{a,y}\\|^{2}$
	$\displaystyle=$	$\displaystyle 1-\int_{G^{a},G^{b}}P_{G^{a},G^{b}}(G^{a},G^{b})1(h(G^{a})\neq h(G^{b}))$
	$\displaystyle\geq$	$\displaystyle\sum_{i\in\{a,b\}}P(h^{*}(G^{i})\neq y)\geq 1-2\alpha.$

Accordingly we obtain

		$\displaystyle\sum_{y}\\|\mathcal{R}w_{b,y}-w_{a,y}\\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{y}(\\|\mathcal{R}w_{b,y}\\|^{2}\!+\\|w_{a,y}\\|^{2}-2\langle w_{a,y},\mathcal{R}w_{b,y}\rangle)$
	$\displaystyle\leq$	$\displaystyle 2-2(1-2\alpha)=4\alpha.$

Additionally,

\displaystyle\sum_{y}\|(\mathcal{T}\!-\!\mathcal{T}_{k})w_{b,y}\|^{2}\!\leq\!\lambda^{2}(1\!-\!\frac{(1-2\alpha)^{2}-\lambda^{2}}{1-\lambda^{2}})\!\leq\!\frac{4\alpha(1-\alpha)}{1-\lambda^{2}}

Hence we have

		$\displaystyle{\sum_{y}\\|\mathcal{T}_{k}w_{b,y}-w_{a,y}\\|^{2}}$
	$\displaystyle\leq$	$\displaystyle\left(\sqrt{\sum_{y}\\|\mathcal{T}_{k}w_{b,y}-w_{a,y}\\|^{2}}+\sqrt{\sum_{y}\\|(\mathcal{T}-\mathcal{T}_{k})w_{b,y}\\|^{2}}\right)^{2}$
	$\displaystyle\leq$	$\displaystyle(2\sqrt{a}+\sqrt{\frac{4\alpha(1-\alpha)}{1-\lambda^{2}}})^{2}\leq\frac{16\alpha}{1-\lambda},$

and $\sum_{y}\|\mathcal{T}_{k}w_{b,y}-w_{a,y}\|^{2}\leq\sum_{y}\|(\mathcal{T}-\mathcal{T}_{k})w_{b,y}\|^{2}$ .

Notice that we need to upper bound Eq. (2), where

		$\displaystyle\sum_{y}\left\\|\left(\mathcal{R}-\mathcal{T}_{k}\right)w_{2,y}\right\\|^{2}$
	$\displaystyle\leq$	$\displaystyle 2\!\sum_{y}\!\left\\|\mathcal{R}w_{2,y}\!-\!w_{1,y}\right\\|^{2}\!+\!\left\\|w_{1,y}\!-\!\mathcal{T}_{k}w_{2,y}\right\\|^{2}\!$
	$\displaystyle\leq$	$\displaystyle\frac{16\alpha}{1-\lambda_{k}^{2}}+4\alpha=\mathcal{O}(\frac{\alpha}{1-\lambda}).$

On the other hand, we have $\|h^{*}-\mathcal{R}w_{b,y}\|^{2}\leq\langle(\mathbf{Y}-h^{*},\mathbf{W}^{*\top}f-\mathbf{W}^{\top}f\rangle\lesssim Tr(\Sigma_{YY|G^{a}})(k+\log k/\delta))$ . Therefore, the excess risk of fitting the linear layer

	$\displaystyle\frac{1}{n}\\|h^{*}-\mathbf{W}^{\top}f\\|^{2}_{F}\lesssim$	$\displaystyle\frac{1}{n}Tr(\Sigma_{YY\|G^{a}})(k+\log k/\delta))$
	$\displaystyle=$	$\displaystyle\mathcal{O}(\sigma^{2}\frac{k}{n})$

Therefore we could easily conclude that Theorem 3 holds.

Remark 5

Note that based on bound provided by Theorem 3, larger number of labeled samples $n$ in the downstream task and smaller value of $\lambda$ give a lower excess risk. Small value of $\lambda$ can be achieved by low rank representation of $\mathcal{T}$ , e.g., via optimizing Eq. (3) [56].

V Experiments

V-A Experimental setup

TABLE I: Performance (i.e., mean (std)) of all methods on two datasets under

20\%

labeled data in the fine-tuning step. The “gray” row means the results with vanilla training (semi-supervised learning or supervised learning). The “Avg” metric means the average value of the five evaluation metrics for convenient comparison. The best performance is highlighted in boldface.

	Metrics	Accuracy	AUC	Precision	Recall	F1-score	Avg
ABIDE	Vanilla GCN [5]	59.6 (2.8)	59.3 (2.7)	61.0 (2.5)	67.3 (5.3)	64.7 (2.8)	62.4 (3.2)
	GAT [50]	61.6 (2.2)	60.8 (1.9)	62.7 (2.2)	69.7 (4.0)	66.6 (2.2)	64.3 (2.5)
	SAC-GCN [58]	61.8 (2.2)	61.2 (2.2)	63.4 (2.3)	68.6 (4.1)	65.8 (2.4)	64.1 (2.7)
	DGI [41]	57.4 (2.7)	56.9 (2.4)	59.8 (2.6)	63.6 (5.3)	61.6 (2.7)	59.9 (3.1)
	MVGRL [42]	58.1 (3.1)	57.6 (2.8)	60.5 (1.7)	64.0 (5.2)	62.2 (2.2)	60.5 (3.0)
	BGRL [47]	61.0 (2.4)	60.3 (2.4)	62.4 (2.0)	69.3 (4.9)	65.6 (2.6)	63.7 (2.9)
	CCA-SSG [48]	60.6 (2.3)	60.1 (2.0)	62.6 (2.6)	66.5 (4.4)	64.4 (4.9)	62.8 (3.6)
	GATE(ours)	63.7 (1.8)	63.6 (2.1)	65.6 (1.5)	70.1 (3.7)	67.7 (2.5)	66.2 (2.5)
FTD	Vanilla GCN [5]	64.5 (3.0)	65.5 (2.6)	65.9 (2.8)	69.4 (3.6)	66.8 (2.7)	66.4 (2.5)
	GAT [50]	66.4 (2.7)	66.1 (2.2)	66.8 (2.2)	75.7 (2.2)	70.6 (2.2)	69.1 (2.3)
	SAC-GCN [58]	67.8 (3.8)	67.5 (3.6)	66.9 (2.7)	71.8 (4.5)	68.1 (2.9)	68.5 (3.5)
	DGI [41]	62.8 (2.9)	62.9 (2.4)	65.4 (2.8)	62.5 (7.3)	63.8 (2.0)	63.5 (3.5)
	MVGRL [42]	63.5 (3.8)	63.4 (2.9)	65.7 (2.3)	64.0 (6.6)	64.7 (3.4)	64.3 (3.8)
	BGRL [47]	68.2 (2.5)	68.5 (1.9)	67.8 (3.1)	73.6 (5.0)	70.6 (2.1)	69.7 (2.9)
	CCA-SSG [48]	69.7 (3.0)	69.5 (2.4)	68.6 (2.7)	75.7 (4.5)	71.8 (2.4)	71.1 (3.0)
	GATE (ours)	72.4 (2.5)	71.7 (2.0)	70.7 (2.2)	76.8 (4.2)	73.3 (2.8)	73.0 (2.7)

V-A1 Datasets

we conduct experiments on two fMRI datasets: Autism brain imaging data exchange (ABIDE) for health control (HC) vs. autism patient classification¹¹1http://fcon_1000.projects.nitrc.org/indi/abide/. and Frontotemporal dementia (FTD) for HC vs. dementia classification²²2https://cind.ucsf.edu/research/grants/frontotemporal-lobar-degeneration-neuroimaging-initiative-0..

ABIDE contains 1029 subjects with functional magnetic resonance imaging data from ABIDE-I and ABIDE-II datasets in this work, including 485 ASD patients and 544 healthy controls (HCs) which are nearly balanced class distribution. The registered fMRI volumes were constructed on predefined templates, using Bootstrap Analysis of Stable Cluster parcellation with 122 ROIs (BASC-122). We construct a $122\times 122$ FC network for each subject, where each node is an ROI and the edge weight is the Pearson’s correlation between the time series of BOLD signals of paired ROIs. To represent a subject, we use the upper triangle of the fully associated matrix, yielding in a 7503-dimensional feature vector.

FTD includes $181$ subjects consisting of 86 normal cases (HC), 95 characterized as FTD, which are nearly balanced class distribution. Subjects are acquired on the 3.0T scanner at different centers with a gradient field strength of 80mT/m and gradient switching rate of 200mT/m/ms, using an eight-channel phased-array receiver coil. Then, we use the DPARSF toolbox [59] to preprocess those data. The following pipeline is included: (i) slice timing correction; (ii) head motion correction; (iii) spatial normalization to the montreal neurological institute template; (iv) spatial smoothing using a full half-width Gaussian smoothing kernel; and (v) linear detrending and temporal bandpass filtering (0.01- 0.10 Hz) for BOLD signals. Finally, the registered fMRI volumes were parcellated into 116 ROIs according to the AAL template.

V-A2 Graph construction

We follow Parisot et al. [5] to obtain the initial graph $\mathbf{A}$ . To begin with, we extract low-dimensional and discriminative features from raw medical images. We then utilise them to build a similarity graph $\mathbf{S}\in\mathbb{R}^{n\times n}$ where $n$ indicates the number of nodes in the population graph, intending to limit the adverse influence of high-dimensional features, such as noisy, redundant features and the dimensionality curse. Additionally, we employ phenotype data (e.g., sex, age and gene) to calculate node similarity from another angle, supplying much information to produce high-quality graphs. Finally, we merge the edges acquired from image-based node features and the edges generated from phenotype information to obtain the initial graph $\mathbf{A}$ by performing the Hadamard product between the similarity graph matrix $\mathbf{S}$ and the phenotypic graph matrix $\widetilde{\mathbf{S}}$ (i.e., $\mathbf{A}=\mathbf{S}\circ\widetilde{\mathbf{S}}$ ). We also sparse the graph $\mathbf{A}$ by preserving the $k$ edges with the largest weights for each node and others as zeros. Equally, we add the diagonal matrix $\mathbf{I}$ to $\mathbf{A}$ (i.e., $\mathbf{I}+\mathbf{A}\rightarrow\mathbf{A}$ ).

V-A3 Comparison methods

The comparison methods include three GCN methods without SSL strategies (vanilla GCN [5], GAT [50], and SAC-GCN [58]), two contrastive-based graph SSL methods (DGI [41] and MVGRL [42]), two similarity-based graph SSL method (BGRL [47] and CCA-SSG [48]). We adopt the open-source codes of all comparison methods and use grid search to decide their best hyper-parameters. For a fair comparison, we use the same dynamic window strategy for the comparison methods and use grid search technology to find their best practice. We list details of each comparison method below.

•

Vanilla GCN [5] trains GCN model [49] with cross-entropy loss on fMRI data for disease diagnosis with labeled training data. This is the baseline method of GCNs on fMRI data [5].
•

GAT [50] is one popular GCN method with attention mechanism and is considered as the state-of-the-art method for graph learning with labeled training data.
•

SAC-GCN [58] is a population graph based method which conduct representation learning by taking into account both the functional graph and the structural graph.
•

DGI [41] is a pioneer in the study of SSL with the graph structure data. As a contrastive learning method, DGI obtains good representation by manually constructing positive and negative pairs.
•

MVGRL [42] maximizes the mutual information of multiple related views inspired by DGI [41], which contains a more complex pipeline for graph SSL.
•

BGRL [47] generalizes BYOL [18] to graphs by forming online and target node embeddings, and gets promising results on large-scale graph representation learning.
•

CCA-SSG [48] proposes a similarity-based SSL and obtains significant results. The main difference between GATE and CCA-SSG is that GATE designs a special strategy to generate two views according to the characteristics of medical data, and we improve the theoretical analysis by building the connection between a deep-learning based non-linear CCA and SSL.

For supervised learning method (e.g., vanilla GCN, GAT, and SAC-GCN), we follow their original setting (transductive learning) to train the model. Among the above comparison methods, DGI, MVGRL, BGRL, and CCA-SSG are the state-of-the-art GCN models using SSL embeddings. In contrastive-based models (i.e., DGI and MVGRL), following their original literature and official codes, we use the corruption function and the readout function for negative construction and positive construction. In similarity-based methods (i.e., BGRL and CCA-SSG), we use the random masking function on feature and graph to generate two views.

V-A4 Implementation details

All experiments are run on a server with 8 NVIDIA GeForce 3090 GPUs and implemented in PyTorch (vision 1.9). All parameters are optimized by the AdamW [60] optimizer with $1e^{-3}$ learning rate. The $\gamma$ in Eq. (2) is empirically set to $0.2$ , $L$ and $s$ in S-A are set to $30$ and $15$ respectively, $l_{a}$ and $l_{a}$ in M-A are randomly selected from $[10,20,30,40,50]$ in each training iteration. We apply one graph convolutional layer following one linear layer in this study considering that the large number of graph convolutional layer may cause the problem of over-smooth[61] (the problem of over-smooth is easy to happen on the small datasets). We apply the ELU function as a nonlinear activation for each layer. The training times ( $s$ ) on the ABIDE dataset and FTD dataset are $20.3(\pm 0.2)$ and $14.7(\pm 0.2)$ , respectively. In the fine-tuning step, the labelled data is randomly selected from the original set (e.g., $206$ labeled samples on the ABIDE dataset and $36$ labeled samples on the FTD dataset for 20% labeled rate in the fine-tuning step), $5$ -fold cross-validation is performed and the experiments are repeated $5$ times with random seeds. The average performances with corresponding standard deviation (std) are reported for all methods.

V-A5 Performance evaluation

The performance evaluation include five evaluation metrics, including Accuracy, Area under the ROC Curve (AUC), Precision, Recall and F1-score. For all of these metrics, a higher value means better performance. Besides, we conduct significance testing using the two sample t-test.

V-B Results and analysis

The quantitative comparison is shown in Table I. We can observe that: GATE achieves the best performance on the quantitative metrics across two datasets, followed by CCA-SSG, SAC-GCN, GAT, BGRL, Vanilla GCN, MVGRL, and DGI. Meanwhile, we find that the improvements are significant (with $p<0.05$ via significance testing) compared with state-of-the-art graph SSL methods. Particularly, GATE improves by $12.73\%$ ( $p=4.83e^{-4}$ ) and $4.04\%$ ( $p=4.16e^{-3}$ ) on average, compared to the baseline comparison method DGI and the best graph SSL comparison method CCA-SSG. This indicates the superior performance of our proposed GATE. Moreover, we can observed that the DGI and MVGRL cannot outperform the baseline vanilla GCN. The possible reason is that the limited number of samples and a small number of classes in the dataset (the problem of class collision [62]). We emphasize that the ultimate success of SSL is leveraging unlabeled data, which means SSL methods can gain more benefits and achieve higher results with more unlabeled data.

We further conduct experiments under different ratios of labeled data ( $10\%$ to $80\%$ ). Remarkably, labeled data is only used in the fine-tuning stage under the SSL setting. As shown in Fig. 3, GATE always outperforms the vanilla GCN, especially with a large gap in performance on limited labeled data. Specifically, compared to vanilla GCN, GATE achieves an average improvement of $11.4\%$ , $9.9\%$ , $4.3\%$ , and $2.0\%$ in terms of $10\%$ , $20\%$ , $50\%$ , and $80\%$ percentage of labeled data, respectively. These results confirm our claim that GATE obtains satisfying prediction performance with a small portion of labeled data.

V-C Ablation study

Effectiveness of dynamic FC augmentation. To evaluate the effectiveness of our proposed augmentation strategy (Sec. III-A), we report the performance of seven augmentation strategies based on our framework in Fig. 4 (a), i.e., DA (with random drop augmentation [47]), MA (with M-A augmentation), SA (with S-A augmentation), MA $+$ DA (with both M-A and DA), SA $+$ DA (with both S-A and DA), SA $+$ MA (with both S-A and M-A), and SA $+$ MA $+$ DA (with S-A, M-A and DA). Generally, we observe that performance worsens with DA only, while the proposed two augmentations (S-A and M-A) increase performance. Such observations do not come as a surprise, indeed SSL relies on defining two views appropriate for specific data and tasks (i.e., fMRI in this study). Our best performance is obtained from SA $+$ MA $+$ DA.

Effectiveness of different SSL strategies. We perform an experiment to compare different SSL strategies, i.e., Contrastive-based SSL (CL), Reconstruction-based SSL (Re), and GATE. For CL, we replace Eq. (2) with InfoNCE loss [14] and select negative samples randomly. For RE, we replace Eq. (2) with MSE loss and add a decoder network. Fig. 5 shows the superiority of GATE compared with other SSL strategies (CL and RE), which is consistent with our analysis.

Different dimensional embedding. We analyze the impacts of different numbers of embedding dimensions in our methods. For this purpose, we set the range of the numbers of embedding dimensions as $\{16,32,64,128,256,512,1024\}$ . As shown in Fig. 6, the accuracy of our method increased with the increasing numbers of embedding dimensions i.e., from $16$ to $256$ on the ABIDE dataset and from $16$ to $128$ on the FTD dataset. On both two datasets, the performance peaks around $128$ to $256$ and then increases slowly. Moreover, the accuracy is slightly low when the dimensionality is set in range of $[16,64]$ . The reason can be that low dimensional embeddings cannot capture the useful information well limited by its insufficient representational ability. The performance is stable for the high-dimensional embeddings, which indicate that GATE is insensitive to the numbers of embedding dimensions with high dimensional embeddings. In this observation, we set the numbers of embedding dimensions to $256$ in all experiments.

Effectiveness of $\gamma$ in the objective function. We conduct an ablation experiment on the second term in the objective function. The results in Fig. 7 correspond to $\gamma$ in Eq. 2 varying from $0.001$ through $10$ . We can see that GATE has poor performance on both two datasets when $\gamma$ is small (e.g., $\gamma\leq 0.1$ ), providing empirical evidence for our state that the second term in Eq. 2 is necessary to get promising result. Moreover, over a wide range of the value of $\gamma$ (e.g., $0.1\leq\gamma\leq 0.8$ ), GATE achieves a near-perfect performance on both two datasets, which shows GATE can achieve stability when $\gamma$ is in a reasonable range (e.g., $0.2\leq\gamma\leq 0.8$ ). As expected, a large value of $\gamma$ in Eq. 2 takes a hit on performance (e.g., $0.9\leq\gamma$ ). The reason can be that a large value of $\gamma$ means the second term dominates the objective function, whereupon the model tends to produce more irrelevant embeddings rather than discriminative embeddings for downstream tasks.

Effectiveness of fine-tuning and graph. To analyse the effectiveness of fine-tuning step and graph, we perform a comparison of GATE without fine-tuning step and GATE without graph in SSL (i.e., replacing $\mathbf{A}$ with $\mathbf{I}$ in Eq. (1) in SSL step). As shown in Fig. 8, we clearly observe that both fine-tuning and graph are a necessary in our pipeline. In practice, fine-tuning step fits the practical medical deployment scenario where a small portion of labeled data can be obtained to fit specific disease in downstream tasks. Besides, graph information can provide common biomarkers between subjects for model in SSL step. Indeed, after SSL step, there is no need to keep the large graph structure for computational efficiency in fine-tuning step, which is able to make GATE easier to be applied in clinical practice.

Low-rank representation. In Theorem 3, we state that GATE can obtain low-rank representation to decrease upper bound of excess risk of downstream task. We carry out Singular Value Decomposition (SVD) experiments on embeddings to support our statement. According to Fig. 9, the singular values of GATE is remarkably smaller than vanilla GCN in both two datasets.

Parameter Sensitivity Analysis. As we mentioned in Sec. III-A, existing dynamic FC-based methods are sensitive to the parameters in the sliding window algorithm (e.g., the length of the sliding window and the step of the sliding window) [44, 45]. We investigate whether GATE is sensitive to the parameters used in the proposed two augmentation, which is related to the sliding window algorithm. To do this, we conduct the parameter sensitivity studies on proposed augmentations and report their results in Fig. 10 and Fig. 11.

First, in the augmentation S-A, it includes two parameters, i.e., the length of the sliding windows $L$ and the step of sliding windows $s$ which control the total number of the sliding windows. Figure 10 shows the results obtained by traversing $L$ from $20$ to $70$ and $L$ from $10$ to $35$ , with fixing the other settings. We could observe that GATE is insensitive to the parameters $L$ within a certain range of values and $s$ . Specifically, GATE can consistently obtain promising results except in the case of a tiny value of $L$ , and the suggested value ranges of $L$ and $s$ are $[30,60]$ and $[10,35]$ , respectively. Second, in the augmentation M-A, we investigate the effect of gaps between two different scale sliding windows (i.e., $\left\|l_{a}-l_{b}\right\|$ ). Figure 11 shows the results obtained by traversing gaps from $5$ to $40$ with fixing the other settings (e.g., the value of $L$ and the value of $s$ are set to $30$ and $15$ , respectively). As a result, the variation of the performance of GATE is relatively stable with the value ranges of gaps is $[10,40]$ . This observation demonstrates that GATE is insensitive to the parameters settings of M-A in our experiments. In summary, the reasons of the above observation could be that GATE maximizing the similarity of the two representations (i.e., embeddings of two different sliding windows) to reduce the perturbations on different sliding windows.

VI Discussion

VI-A The needs of label efficiency for fMRI

In recent years, fMRI analysis has revolutionized with deep learning, while training a powerful model usually requires a large amount of labeled fMRI data. To visually explain the needs of large labeled data for fMRI analysis, we depict the results of vanilla GCN under different label rates in Fig. 3. We can observe that the accuracy of the vanilla GCN decreases significantly with the decreasing labeling rate (e.g., 10% to 50%). However obtain a sufficient number of labeled images for fMRI data could be challenging due to the cost of obtaining neuro-disease annotations. Therefore, designing a method for limited labeled fMRI data is crucial [63]. To solve this limitation, we propose a graph SSL method GATE, which can yield comparable results under label-efficient data (e.g., the accuracy of GATE under 20% labeling rate is comparable to the accuracy of vanilla GCN under 50% labeling rate). The main reason could be that GATE can help extract robust features on label-efficient data. Consequently, our experiments show promising results to tackle the issue of limited labeled fMRI data.

VI-B Graph learning for neuroimaging

Deep graph learning models, like GCNs, have become popular approaches for populations disease analysis, including neuroimaging [5, 3]. Modeling subject-wise neuroimaging signals (and auxiliary phenotype features) could leverage the representation of similar subjects. Our work investigates how adding similarity regularization affects GCN-based representation learning on fMRI signals. As shown in Fig. 8 and Tab. I, graph information is essential to get good results and GATE is significantly better than other GCN-based methods. The reason could be better extracting associations between subjects and maximizing the correlation of the representations from multiple coupled views. Hence, we can conclude that graph learning can obtain embedding matrices of different views to characterize the representations of neuroimaging.

VI-C Technical contributions of our work

We further analyze the technical contributions of GATE. Firstly, GATE utilizes an SSL-based strategy to capture vital relationship for generation multiple coupled views of a fMRI BOLD signal. We can achieve good performance of pre-training on unlabeled fMRI data and fine-tuning on small labeled data. Then, we design a GCN encoder for limited labeled fMRI data to extract associations between subjects. Our GCN-based method effectively solves the issue of earning on label-efficient dataset. Moreover, GATE is a versatile method which can be further improved by considering a more powerful GNN encoder (i.e., $f(\cdot)$ ) and combining a more powerful classifier (i.e., $\psi(\cdot)$ ) for real application. In general, self-supervised tasks provide informative priors that can benefit GCN in generalizable target performance.

VII Conclusion

In this paper, we propose GATE, a novel graph SSL method for fMRI analysis on the population graph. GATE can achieve outstanding classification performance without labeled data in the training phase. Theoretical analysis analyzes demonstrate the appropriateness of the proposed augmentation strategy, CCA regularization loss in SSL for fMRI analysis, and the generalization bound of our method. Experimental results show that GATE outperforms alternative SSL methods on two neuroimaging datasets with significant margins. Furthermore, it opens perspectives to bring deep learning from research to clinic by learning from noisy and limited labeled data.

VIII Acknowledgment

This work was partially supported by the National Natural Science Foundation of China (Grant No. 61876046), Medico-Engineering Cooperation Funds from University of Electronic Science and Technology of China (No. ZYGX2022YGRH009 and ZYGX2022YGRH014), the Guangxi “Bagui” Teams for Innovation and Research, China, Natural Sciences and Engineering Research Council of Canada (GECR-2022-00430), NVIDIA Hardware Award, and Public Safety Canada (NS-5001-22170).

References

[1] S. Lang, N. Duncan, and G. Northoff, “Resting-State Functional Magnetic Resonance Imaging: Review of Neurosurgical Applications,” Neurosurgery, vol. 74, no. 5, pp. 453–465, 2014.
[2] S. Menon and K. Krishnamurthy, “A comparison of static and dynamic functional connectivities for identifying subjects and biological sex using intrinsic individual brain connectivity,” Scientific Reports, vol. 9, 2019.
[3] X. Li, Y. Zhou, N. Dvornek, M. Zhang, S. Gao, J. Zhuang, D. Scheinost, L. H. Staib, P. Ventola, and J. S. Duncan, “Braingnn: Interpretable brain graph neural network for fmri analysis,” vol. 74, 2021, p. 102233.
[4] N. C. Dvornek, X. Li, J. Zhuang, and J. S. Duncan, “Jointly discriminative and generative recurrent neural networks for learning from fmri,” in MLMI. Springer, 2019, pp. 382–390.
[5] S. Parisot, S. I. Ktena, E. Ferrante, M. Lee, R. Guerrero, B. Glocker, and D. Rueckert, “Disease prediction using graph convolutional networks: application to autism spectrum disorder and alzheimer’s disease,” Medical Image Analysis, vol. 48, pp. 117–130, 2018.
[6] Y. Zhou, T. Zhou, T. Zhou, H. Fu, J. Liu, and L. Shao, “Contrast-attentive thoracic disease recognition with dual-weighting graph reasoning,” IEEE Transactions on Medical Imaging, vol. 40, no. 4, pp. 1196–1206, 2021.
[7] M. Ghorbani, A. Kazi, M. S. Baghshah, H. R. Rabiee, and N. Navab, “Ra-gcn: Graph convolutional network for disease prediction problems with imbalanced data,” Medical Image Analysis, vol. 75, p. 102272, 2022.
[8] A. Bessadok, M. A. Mahjoub, and I. Rekik, “Graph neural networks in network neuroscience,” arXiv preprint arXiv:2106.03535, 2021.
[9] S. L. Hyman, S. E. Levy, S. M. Myers, D. Kuo, S. Apkon, T. Brei, L. F. Davidson, B. E. Davis, K. A. Ellerbeck, G. H. Noritz et al., “Executive summary: identification, evaluation, and management of children with autism spectrum disorder,” Pediatrics, vol. 145, no. 1, 2020.
[10] Z. Luo, Y. Zou, J. Hoffman, and L. F. Fei-Fei, “Label efficient learning of transferable representations acrosss domains and tasks,” in NIPS, 2017.
[11] L. Sun, K. Yu, and K. Batmanghelich, “Context matters: Graph-based self-supervised representation learning for medical images,” in AAAI, vol. 35, 2021, pp. 4874–4882.
[12] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
[13] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased contrastive learning,” in NIPS, vol. 33, 2020, pp. 8765–8775.
[14] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[15] Y. Mo, L. Peng, J. Xu, X. Shi, and X. Zhu, “Simple unsupervised graph representation learning,” in AAAI, 2022.
[16] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” arXiv preprint arXiv:2111.06377, 2021.
[17] C. Qiu, Z. Huang, W. Xu, and H. Li, “Vgaer: graph neural network reconstruction based community detection,” in AAAI, 2022.
[18] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” in NIPS, vol. 33, 2020, pp. 21 271–21 284.
[19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
[20] R. Ni, M. Shu, H. Souri, M. Goldblum, and T. Goldstein, “The close relationship between contrastive learning and meta-learning,” in ICLR, 2021.
[21] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning?” Advances in Neural Information Processing Systems, vol. 33, pp. 6827–6839, 2020.
[22] Y. Wang, Q. Zhang, Y. Wang, J. Yang, and Z. Lin, “Chaos is a ladder: A new theoretical understanding of contrastive learning via augmentation overlap,” in ICLR, 2022.
[23] Y. Chen, C. Wei, A. Kumar, and T. Ma, “Self-training avoids using spurious features under domain shift,” in NIPS, 2020, pp. 21 061–21 071.
[24] B. Thompson, Canonical correlation analysis: Uses and interpretation. Sage, 1984, no. 47.
[25] O. Friman, M. Borga, P. Lundberg, and H. Knutsson, “Adaptive analysis of fmri data,” NeuroImage, vol. 19, no. 3, pp. 837–845, 2003.
[26] T.-E. Kam, H. Zhang, and D. Shen, “A novel deep learning framework on brain functional networks for early mci diagnosis,” in MICCAI, 2018, pp. 293–301.
[27] D. Yao, J. Sui, M. Wang, E. Yang, Y. Jiaerken, N. Luo et al., “A mutual multi-scale triplet graph convolutional network for classification of brain disorders using functional or structural connectivity,” IEEE Transactions on Medical Imaging, vol. 40, no. 4, pp. 1279–1289, 2021.
[28] N. Wang, D. Yao, L. Ma, and M. Liu, “Multi-site clustering and nested feature extraction for identifying autism spectrum disorder with resting-state fmri,” Medical Image Analysis, vol. 75, p. 102279, 2022.
[29] D. J. Heeger and D. Ress, “What does fmri tell us about neuronal activity?” Nature reviews neuroscience, vol. 3, no. 2, pp. 142–151, 2002.
[30] J. Huang, L. Zhou, L. Wang, and D. Zhang, “Attention-diffusion-bilinear neural network for brain network analysis,” IEEE Transactions on Medical Imaging, vol. 39, no. 7, pp. 2541–2552, 2020.
[31] J. Zhang, W. Cheng, Z. Liu, K. Zhang, X. Lei, Y. Yao et al., “Neural, electrophysiological and anatomical basis of brain-network variability and its characteristic changes in mental disorders,” Brain, vol. 139, no. 8, pp. 2307–2321, 2016.
[32] N. Mao, H. Zheng, Z. Long, L. Yao, and X. Wu, “Gender differences in dynamic functional connectivity based on resting-state fmri,” in EMBC, 2017, pp. 2940–2943.
[33] G. B. Mertzios, H. Molter, and V. Zamaraev, “Sliding window temporal graph coloring,” in AAAI, vol. 33, no. 01, 2019, pp. 7667–7674.
[34] M. Wang, C. Lian, D. Yao, D. Zhang, M. Liu, and D. Shen, “Spatial-temporal dependency modeling and network hub detection for functional mri analysis via convolutional-recurrent network,” IEEE Transactions on Biomedical Engineering, vol. 67, no. 8, pp. 2241–2252, 2019.
[35] D. Yao, J. Sui, E. Yang, P.-T. Yap, D. Shen, and M. Liu, “Temporal-adaptive graph convolutional network for automated identification of major depressive disorder using resting-state fmri,” in MLMI, 2020.
[36] X. Yu, L. Zhang, L. Zhao, Y. Lyu, T. Liu, and D. Zhu, “Disentangling spatial-temporal functional brain networks via twin-transformers,” arXiv preprint arXiv:2204.09225, 2022.
[37] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual review of biomedical engineering, vol. 19, pp. 221–248, 2017.
[38] J. Fan, X. Cao, P.-T. Yap, and D. Shen, “Birnet: Brain image registration using dual-supervised fully convolutional networks,” Medical Image Analysis, vol. 54, pp. 193–206, 2019.
[39] A. Kazi, S. Shekarforoush, S. A. Krishna, H. Burwinkel, G. Vivar, K. Kortüm, S.-A. Ahmadi, S. Albarqouni, and N. Navab, “Inceptiongcn: receptive field aware graph convolutional network for disease prediction,” in IPMI. Springer, 2019, pp. 73–85.
[40] X. Xing, Q. Li, H. Wei, M. Zhang, Y. Zhan, X. S. Zhou, Z. Xue, and F. Shi, “Dynamic spectral graph convolution networks with assistant task training for early mci diagnosis,” in MICCAI, 2019, pp. 639–646.
[41] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” in ICLR, 2019.
[42] K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs,” in ICML, 2020, pp. 4116–4126.
[43] P. Schober, C. Boer, and L. A. Schwarte, “Correlation coefficients: appropriate use and interpretation,” Anesthesia & Analgesia, vol. 126, no. 5, pp. 1763–1768, 2018.
[44] A. D. Savva, G. D. Mitsis, and G. K. Matsopoulos, “Assessment of dynamic functional connectivity in resting-state fmri using the sliding window technique,” Brain and behavior, vol. 9, no. 4, p. 1255, 2019.
[45] S. Shakil, C.-H. Lee, and S. D. Keilholz, “Evaluation of sliding window correlation performance for characterizing dynamic functional connectivity and brain states,” Neuroimage, vol. 133, pp. 111–128, 2016.
[46] T.-E. Kam, H. Zhang, Z. Jiao, and D. Shen, “Deep learning of static and dynamic brain functional networks for early mci detection,” IEEE Transactions on Medical Imaging, vol. 39, no. 2, pp. 478–487, 2020.
[47] S. Thakoor, C. Tallec, M. G. Azar, M. Azabou, E. L. Dyer, R. Munos, P. Veličković, and M. Valko, “Large-scale representation learning on graphs via bootstrapping,” arXiv preprint arXiv:2102.06514, 2021.
[48] H. Zhang, Q. Wu, J. Yan, D. Wipf, and P. S. Yu, “From canonical correlation analysis to self-supervised graph neural networks,” in NIPS, vol. 34, 2021.
[49] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
[50] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
[51] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
[52] C. Wei, K. Shen, Y. Chen, and T. Ma, “Theoretical analysis of self-training with deep networks on unlabeled data,” in ICLR, 2021.
[53] X. Zhuang, Z. Yang, and D. Cordes, “A technical review of canonical correlation analysis for neuroscience applications,” Human Brain Mapping, vol. 41, no. 13, pp. 3807–3833, 2020.
[54] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint arXiv:0004057, 2000.
[55] B. Ghojogh, A. Ghodsi, F. Karray, and M. Crowley, “Kkt conditions, first-order and second-order optimization, and distributed optimization: Tutorial and survey,” arXiv preprint arXiv:2110.01858, 2021.
[56] J. D. Lee, Q. Lei, N. Saunshi, and J. Zhuo, “Predicting what you already know helps: Provable self-supervised learning,” in NIPS, vol. 34, 2021.
[57] A. Makur, F. Kozynski, S.-L. Huang, and L. Zheng, “An efficient algorithm for information decomposition and extraction,” in Allerton. IEEE, 2015, pp. 972–979.
[58] X. Song, F. Zhou, A. F. Frangi, J. Cao, X. Xiao, Y. Lei, T. Wang, and B. Lei, “Graph convolution network with similarity awareness and adaptive calibration for disease-induced deterioration prediction,” Medical Image Analysis, vol. 69, p. 101947, 2021.
[59] C. Yan and Y. Zang, “Dparsf: a matlab toolbox for” pipeline” data analysis of resting-state fmri,” Frontiers in Systems Neuroscience, vol. 4, p. 13, 2010.
[60] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[61] C. Yang, R. Wang, S. Yao, S. Liu, and T. Abdelzaher, “Revisiting over-smoothing in deep gcns,” arXiv preprint arXiv:2003.13663, 2020.
[62] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar, “A theoretical analysis of contrastive unsupervised representation learning,” in ICML, 2019, pp. 5628–5637.
[63] M. Wang, D. Zhang, J. Huang, P.-T. Yap, D. Shen, and M. Liu, “Identifying autism spectrum disorder with multi-site fmri via low-rank domain adaptation,” IEEE transactions on medical imaging, vol. 39, no. 3, pp. 644–655, 2019.