Semi-supervised Domain Adaptation on Graphs with Contrastive Learning and Minimax Entropy

Jiaren Xiao Quanyu Dai Xiao Shen Xiaochen Xie Jing Dai
James Lam Ka-Wai Kwok [email protected] Department of Mechanical Engineering, The University of Hong Kong, Hong Kong, China Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China School of Computer Science and Technology, Hainan University, Haikou, China Department of Automation, Harbin Institute of Technology, Shenzhen, China Institute for Automatic Control and Complex Systems, University of Duisburg-Essen, Duisburg, Germany

Abstract

Label scarcity in a graph is frequently encountered in real-world applications due to the high cost of data labeling. To this end, semi-supervised domain adaptation (SSDA) on graphs aims to leverage the knowledge of a labeled source graph to aid in node classification on a target graph with limited labels. SSDA tasks need to overcome the domain gap between the source and target graphs. However, to date, this challenging research problem has yet to be formally considered by the existing approaches designed for cross-graph node classification. This paper proposes a novel method called SemiGCL to tackle the graph Semi-supervised domain adaptation with Graph Contrastive Learning and minimax entropy training. SemiGCL generates informative node representations by contrasting the representations learned from a graph’s local and global views. Additionally, SemiGCL is adversarially optimized with the entropy loss of unlabeled target nodes to reduce domain divergence. Experimental results on benchmark datasets demonstrate that SemiGCL outperforms the state-of-the-art baselines on the SSDA tasks.

keywords:

Semi-supervised Domain Adaptation, Graph Transfer Learning, Node Classification, Graph Contrastive Learning, Adversarial Learning

^†^†journal: Neurocomputing

1 Introduction

Graphs represent the structured and relational data that are commonly found in the real world, such as social networks, citation networks, and protein-protein interaction (PPI) networks. Node classification is a crucial graph mining problem with practical applications in various fields like e-commerce and computational biology. Graph representation learning encodes graph information into low-dimensional node representation vectors, also known as node embeddings [1]. These learned node representations can then be used with classical machine learning methods, such as a logistic regression classifier, to classify the unlabeled nodes. In recent years, graph neural networks (GNNs) have emerged as promising approaches for node classification tasks [2]. Furthermore, many recent GNN models apply graph contrastive learning to produce meaningful node representations in a self-supervised manner [3].

Existing studies primarily consider the node classification tasks on a partially labeled graph [2, 4]. In this single-graph scenario, a node classification model is trained with the labeled nodes in a graph, and evaluated on the unlabeled nodes within the same graph. However, in realistic situations, there is often a need to classify nodes in a newly collected graph (target graph) with a scarcity of node labels [5, 6]. Due to the resource-intensive and time-consuming nature of data labeling [7], it is desirable to transfer the knowledge of an available labeled graph (source graph) to assist in node classification on the target graph. For instance, on a newly-formed social network that lacks labels, it would be advantageous to transfer the abundant label information from a well-developed social network in order to classify users into interest groups. Similarly, the label information of a well-established citation database could be transferred to assign research topics to papers in a newly constructed citation network. In this cross-graph scenario, the source and target graphs can be treated as independent domains, namely the source domain and the target domain. Since the source and target graphs have distinct data distributions, there exists a domain divergence (or distribution shift) between these domains, hindering knowledge transfer across domains.

Domain adaptation aims to reduce the domain gap and improve a model’s performance when deployed to the target domain [8, 9]. The studies on domain adaptation mainly focus on computer vision (CV) and natural language processing (NLP), assuming that the data within each domain, such as images and text, are independent and identically distributed (i.i.d.). The majority of existing work considers unsupervised domain adaptation (UDA), where the target domain data are completely unlabeled [8]. In many practical applications, it is feasible to acquire a small amount of labeled data from the target domain. When utilized appropriately, these limited labeled target data can aid in model training in conjunction with the source data. This research problem is known as semi-supervised domain adaptation (SSDA) [10, 11].

This paper investigates the problem of cross-graph node classification under the SSDA setting, as depicted in Figure 1. The source graph is fully labeled, while the target graph has a limited number of labeled nodes per class. The majority of nodes (exceeding 99.5% in most cases) are unlabeled in the target graph. The objective is to leverage information from both the source and target graphs to classify the unlabeled nodes in the target graph. Under the SSDA setting, it is crucial to effectively utilize the available labels in target graph to improve the model performance when predicting the unlabeled target nodes. One intuitive way is to optimize the model using the cross-entropy loss that takes both the labeled source and target nodes into account. However, without additional considerations, the node classification model would be biased towards the source graph with a much larger number of labeled nodes [10]. In addition, unlike images, nodes within a graph are connected by edges, thereby violating the i.i.d. assumption. It poses an additional challenge to the SSDA tasks.

Refer to caption — Figure 1: Semi-supervised domain adaptation on graphs. The source and target graphs are two independent domains with distinct data distributions. The source graph is fully labeled, while the target graph has a limited number of labeled nodes per class.

Although domain adaptation in CV and NLP has been extensively studied, research on graph-structured data is still in its early stages. A few recent models (e.g., ACDNE [12], UDA-GCN [6], ASN [13], and MFRReg [14]) transfer knowledge from a labeled source graph to an unlabeled target graph with the help of domain adaptation techniques. Since the target graph is unlabeled, these models can be categorized into the UDA methods.

CDNE [5] and AdaGCN [15] also explore the cross-graph node classification problem with the target graph partially labeled. However, they consider a target graph where a fraction of nodes (e.g., 5% of the total nodes) are randomly selected to have accessible labels. In contrast, our work follows the commonly used SSDA setting [10, 11]. Under this setting, an equal number of labeled nodes (e.g., five nodes) are available for each class in the target graph. This setting would be more realistic and challenging. Given an originally unlabeled target graph, we can label a limited number of nodes per class to assist in the classification of each class node. To reduce the labeling effort, the number of labeled nodes in the target graph is much lower than 5% of the total nodes, thereby increasing the difficulty of domain adaptation tasks.

In addition, AdaGCN simply merges the labeled source and target nodes to calculate the cross-entropy loss. The domain adaptation technique adopted by AdaGCN is still an UDA one that reduces the Wasserstein distance between the source and target representations. CDNE, on the other hand, incorporates the target labels to calculate the class-conditional maximum mean discrepancy (MMD) [16]. The reduction of MMD contributes to aligning the source and target distributions. However, the classical MMD metric has been empirically found inferior to some recent domain adaptation techniques [17]. Recent advances of domain adaptation in CV and NLP remain to be explored on the graph-structured data. For instance, the recently proposed SSDA approach, minimax entropy (MME) [11], adversarially optimizes an image classification model on the entropy loss of unlabeled target data, leading to improved model performance.

In this paper, we propose an end-to-end method called SemiGCL that addresses the graph Semi-supervised domain adaptation with Graph Contrastive Learning and minimax entropy training. On the one hand, our method is empowered by graph contrastive learning to generate informative node representations for cross-graph node classification. Two graph neural network encoders jointly leverage graph structure information and node attributes to extract node representations from the original graph (local view) and the diffusion-augmented graph (global view), respectively. Graph contrastive learning is employed to maximize the mutual information between representations learned from the local and global views, which encourages the GNN encoders to capture a graph’s rich local and global structural information simultaneously.

On the other hand, as a design tailored to the SSDA setting, we introduce minimax entropy training to reduce the domain divergence between the source and target graphs. Our approach involves the utilization of a cosine similarity-based node classifier [18]. The process of domain adaptation is modeled as a minimax game played between the GNN encoders and the node classifier. More precisely, the node classifier is trained to maximize the entropy of unlabeled nodes in the target graph. By doing so, the classifier’s weight vectors can be moved to the target graph, alleviating the model bias towards the source graph. In contrast, we update the GNN encoders to minimize the same entropy, which drives the node representations of unlabeled target nodes to get clustered around the classifier’s weight vectors, resulting in discriminative node representations.

Our method is evaluated on eight benchmark transfer tasks using five real-world information networks. The main contributions can be summarized as follows.

1.

The first GNN-based model is proposed and devised specifically to address the challenging semi-supervised domain adaptation problem on graphs.
2.

The model is designed to incorporate graph contrastive learning and minimax entropy training, effectively reducing domain divergence and generating discriminative node representations.
3.

Extensive validations are conducted on real-world information networks, demonstrating that our model outperforms the state-of-the-art approaches on the benchmark transfer tasks.

This paper is structured as follows. Section 2 provides a review of the related work. Section 3 introduces the proposed method. Section 4 reports the experimental results. Finally, Section 6 presents the conclusion.

2 Related Work

2.1 Domain Adaptation

Existing work mainly focuses on the UDA problem that assumes the data of target domain are purely unlabeled. The basic approach of UDA is to reduce the domain divergence through moment matching [16, 19, 20] or adversarial learning [17, 21, 22, 23]. Deep adaptation network (DAN) [19] estimates the distribution divergence with maximum mean discrepancy. The reduction of MMD helps match the source and target distributions. DANN [21] introduces a domain classifier that is trained with the feature extractor in an adversarial manner. Specifically, the domain classifier is optimized to tell apart samples from the source and target domains. The feature extractor is trained to produce samples that can deceive the domain classifier, thus matching the feature distributions. Another adversarial adaptation method, WDGRL [17], utilizes the Wasserstein distance as the adversarial loss.

Differing from the mainstream UDA research, a few recent studies investigate the SSDA problem where a small number of data are labeled in the target domain. Minimax entropy model [11] consists of a feature encoder (e.g., AlexNet and VGG16) and a cosine similarity-based classifier [18] devised for the few-shot classification. With the entropy of unlabeled target data calculated as adversarial loss, MME alleviates domain discrepancy by enforcing a minimax game between the feature extractor and the image classifier. ECACL [10] incorporates categorical alignment and consistency alignment to improve the performance of existing UDA methods under the SSDA setting.

Domain adaptation has been widely studied in areas like CV and NLP. However, most previous models are designed for vector-based data (e.g., images and text) that follow the i.i.d. assumption. Our work considers the domain adaptation problem for graph-structured data where nodes have complicated edge connections, thereby violating the i.i.d. assumption. Note that GRDA [24] studies the problem of domain adaptation across multiple domains with adjacency described by a domain graph. Each domain is a node in the domain graph. This problem is completely different from the one we considered, i.e., domain adaptation on graphs, where each domain is a graph. Therefore, GRDA is not applicable to our setting.

2.2 Representation Learning on Graphs

Graph representation learning generates low-dimensional representation vectors for graph nodes [1, 4]. There are methods designed to preserve structural properties only [25, 26] or to jointly exploit graph structure and side information such as node attributes [27, 28, 29]. Among them, GNNs [2] have drawn a lot of research interest in recent years.

GraphSAGE [30] pioneers in generating node representations by aggregating information from the sampled neighboring nodes. Unlike GraphSAGE, which is evaluated in the supervised learning scenario, many studies focus on graph semi-supervised learning to overcome the label shortage in a partially labeled graph. The well-known model, GCN [31], simplifies the spectral graph convolution as a first-order approximation. Through jointly encoding local graph structure and node attributes, GCN demonstrates impressive performance and inspires many follow-up studies. GAT [32], for instance, introduces an attention mechanism and assigns learnable weights to the neighbors. Note that GCN and GAT utilize all neighbors of a node without neighborhood sampling.

Self-supervised learning [3] is another way to tackle the lack of labels. It enables the models to produce meaningful node embeddings by conducting various pretext tasks without any node label information. Under this learning paradigm, graph contrastive learning (GCL) maximizes the mutual information (MI) between two (augmented) instances of the same object, such as one graph and a node. For example, DGI [33] maximizes the MI between node representations and the global summaries of a graph. The resulting node representations are expected to encode the global structural information. MVGRL [34] contrasts the node and graph embeddings learned from two structural views of a graph.

The majority of graph representation learning models are not specially designed to address the domain discrepancy between graphs. Although we can sometimes adapt these models for cross-graph node classification, the domain divergence would still hinder their performance.

2.3 Cross-graph Node Classification

Recently, a few studies have been conducted to address the problem of cross-graph node classification, specifically focusing on two independent attributed graphs. The objective of this research is to assist in node classification on a target graph that lacks sufficient node labels by transferring knowledge from a source graph with abundant labels.

Similar to the UDA studies in CV and NLP, most existing methods are devised for a cross-graph scenario that assumes the target graph is unlabeled. To capture the attributed affinity and topological proximity, ACDNE [12] constructs two feature extractors that encode node attributes and neighborhood attributes, respectively. UDA-GCN [6] utilizes dual graph convolutional networks to exploit the local and global consistency on a graph. Both ACDNE and UDA-GCN follow the adversarial training paradigm in DANN [21] to match the embedding distributions of source and target graphs. ASN [13] also adopts the dual GCNs and improves the extraction of domain-shared features by separating the domain-private features. Xiao et al. recently proposed a GNN-based model (i.e., AdaGIn) that aligns the multimodal embedding distributions by conditional adversarial networks [35]. To capture the global graph information, AdaGIn maximizes mutual information between the node and graph-level representations.

Very recently, a few studies design theory-grounded algorithms for domain adaptation on graphs. For example, MFRReg [14] regularizes the GNN spectral property (i.e., maximum frequency response) to improve the GNN transferability. Additionally, inspired by the Weisfeiler-Lehman graph isomorphism test, GRADE [36] proposes a graph subtree discrepancy to measure the graph distribution shift. The reduction of graph subtree discrepancy contributes to aligning the graph data distributions.

Unlike these UDA methods, CDNE [5] and AdaGCN [15] consider a target graph in which a percentage of nodes are randomly selected to have accessible labels. CDNE generates node representations with stacked autoencoders. The source and target distributions are aligned by reducing the maximum mean discrepancy. AdaGCN extracts node representations with GCN and minimizes the Wasserstein distance [17] to reduce domain discrepancy. Our work focuses on the SSDA problem on graphs. To our knowledge, this problem has never been formally considered by the prior art. To address this problem, we devise a novel GNN-based model empowered by graph contrastive learning and minimax entropy training.

3 Proposed Method

In this section, we first define the research problem. Then, the model architecture is presented. After that, we elaborate on node representation learning, node label prediction, and semi-supervised domain adaptation. Finally, we summarize the overall objective and outline the model training procedure.

3.1 Problem Definition

An attributed graph, or an information network, is mathematically described as $\mathcal{G}\left(\bm{V},\bm{A},\bm{X},\bm{Y}\right)$ , including node set $\bm{V}\in\mathbb{R}^{N}$ , adjacency matrix $\bm{A}\in\mathbb{R}^{N\times N}$ , attribute matrix $\bm{X}\in\mathbb{R}^{N\times L}$ , and label matrix $\bm{Y}\in\mathbb{R}^{N\times C}$ . $N$ , $L$ , and $C$ denote the number of nodes, the dimension of a node attribute vector, and the number of node classes, respectively. The $i$ -th rows of $\bm{A}$ , $\bm{X}$ , and $\bm{Y}$ are associated with the $i$ -th node $v\in\bm{V}$ . Adjacency matrix $\bm{A}$ and label matrix $\bm{Y}$ contain binary values. Specifically, $A_{ij}=1$ indicates the $i$ -th node has an edge connection with the $j$ -th node; $Y_{ic}=1$ means the $i$ -th node belongs to the $c$ -th class. The degree of node $v$ is the number of its edge connections, i.e., $\sum\nolimits_{j}A_{ij}$ . In this paper, we consider the undirected graph. The average degree is computed by $\sum\nolimits_{i}\sum\nolimits_{j}A_{ij}/N$ , revealing the graph density. Table 1 summarizes the main notations in this paper.

Table 1: Main notations.

Notation Description $\mathcal{G}$ An attributed graph $\mathcal{G}^{s}$ , $\mathcal{G}^{t}$ Source graph and target graph $\bm{V}$ , $\bm{A}$ Node set and adjacency matrix $\bm{X}$ , $\bm{Y}$ Attribute matrix and label matrix $\bm{E}$ , $\hat{\bm{Y}}$ Representation matrix and label prediction matrix of $\mathcal{G}$ $\bm{\mathcal{X}}$ Union attribute set $\bm{x}_{v}$ , $\bm{y}_{v}$ Attribute vector and label vector of node $v\in\bm{V}$ $\bm{e}_{v}$ , $\hat{\bm{y}}_{v}$ Embedding vector and label prediction vector of node $v\in\bm{V}$ $n$ Number of labeled nodes per class in $\mathcal{G}^{t}$ $f_{A}$ , $f_{P}$ , $f_{c}$ Local-view GNN encoder, global-view GNN encoder, and node classifier $\bm{\theta}_{g},\bm{\theta}_{c}$ Sets of parameters in the GNN encoders and in the node classifier $\alpha$ Teleport probability $\sigma$ Nonlinear activation function $s$ Neighborhood sample size $T$ Temperature parameter $\eta_{0}$ Initial learning rate $\bm{B}$ A batch of nodes $\lambda_{1}$ , $\lambda_{2}$ , $\lambda_{3}$ Contrastive learning coefficient, domain adaptation coefficient, and entropy coefficient

Graph representation learning encodes each node $v\in\bm{V}$ into a low-dimensional representation vector, that is, node embedding vector $\bm{e}_{v}$ . In representation matrix $\bm{E}\in\mathbb{R}^{N\times l}$ , $\bm{e}_{v}^{\top}$ is stored as one row. The embedding dimension is denoted as $l$ . Using the learned node embeddings, a node classifier can be trained to perform node classification tasks.

In this paper, we study the problem of semi-supervised domain adaptation on graphs. The source domain refers to a labeled graph $\mathcal{G}^{s}\left(\bm{V}^{s},\bm{A}^{s},\bm{X}^{s},\bm{Y}^{s}\right)$ , in which the label of each node is known. The target domain is a partially labeled graph $\mathcal{G}^{t}\left(\bm{V}^{t},\bm{A}^{t},\bm{X}^{t},\bm{Y}^{t}\right)$ , where only a few nodes have known labels. In line with the common SSDA setting [10, 11], we randomly choose an equal number of nodes from each class to form the set of labeled nodes in the target graph. The objective of this study is to improve the node classification performance on the target graph by transferring knowledge from the source graph. To accomplish this, it is essential to produce node representations that are both transferrable and discriminative. This research problem is challenging because of the distinction between the source and target graphs. Since these two graphs have no edge connections or shared common nodes, they represent independent domains. The discrepancy between the domains arises from variations in graph scales and differences in the distributions of node attributes, edge connections, and node labels.

Based on prior studies [5, 6, 15], it is necessary for the source and target graphs to have the same set of node classes. However, these two graphs possess different attribute sets, namely $\bm{\mathcal{X}}^{s}$ and $\bm{\mathcal{X}}^{t}$ . A union attribute set is created and denoted as $\bm{\mathcal{X}}=\bm{\mathcal{X}}^{s}\cup\bm{\mathcal{X}}^{t}$ . The attribute matrices can then be expressed as $\bm{X}^{s}\in\mathbb{R}^{N^{s}\times U}$ and $\bm{X}^{t}\in\mathbb{R}^{N^{t}\times U}$ , where $U=\left|\bm{\mathcal{X}}\right|$ represents the total number of attributes across both graphs. This union attribute set enables parameter sharing, allowing the same model to be applied to both graphs. Many modern domain adaptation techniques rely on parameter sharing [11, 17, 21]. To quantify the extent of attribute overlap, we define a common attribute rate as $R_{a}=\left|\bm{\mathcal{X}}^{s}\cap\bm{\mathcal{X}}^{t}\right|/\left|\bm{\mathcal{X}}\right|$ , showing the percentage of node attributes present in both graphs.

3.2 Overview of Model Architecture

As shown in Figure 2, the proposed SemiGCL model is made up of two modules: the GNN encoders and the node classifier. In addition to using the original graph as a regular structural view, we use graph diffusion to obtain an augmented graph that serves as an additional structural view. Two GNN encoders are used to encode the original graph and the augmented graph, respectively. Representation vectors generated by the GNN encoders are then concatenated to obtain the node embedding vector. A cosine similarity-based classifier is employed as the node classifier, which takes the embedding vector as input and outputs the label prediction.

There are three losses involved in the model optimization. The contrastive loss is calculated for each graph in a self-supervised way. By minimizing the contrastive loss, the agreement is maximized between the representations learned from the two structural views. The cross-entropy loss is computed with the labeled nodes in the source and target graphs. The minimization of cross-entropy loss contributes to extracting discriminative node representations. The entropy loss is calculated on the unlabeled target nodes. To mitigate domain divergence, the GNN encoders and the node classifier are adversarially optimized with the entropy loss.

3.3 Node Representation Learning

We consider two structural views of the same graph: one regular view (local view) and one additional view (global view). The regular structural view is the original graph represented by binary adjacency matrix $\bm{A}$ . The adjacency matrix stores the edge connection, which is a kind of local structural information around a node. Hence, the regular view is also regarded as a local view. The additional structural view is an augmented graph described by diffusion matrix $\bm{P}$ . The augmented graph is generated using graph diffusion [37].

\tilde{\bm{P}}=\left(\bm{D}+\bm{I}_{N}\right)^{-1/2}\left(\bm{A}+\bm{I}_{N}\right)\left(\bm{D}+\bm{I}_{N}\right)^{-1/2}

(1)

\bm{P}=\alpha\left[\bm{I}_{N}-\left(1-\alpha\right)\tilde{\bm{P}}\right]^{-1}

(2)

where $\bm{D}$ is the diagonal degree matrix of adjacency matrix $\bm{A}$ , $\bm{I}_{N}\in\mathbb{R}^{N\times N}$ is an identity matrix, $\tilde{\bm{P}}$ is the transition matrix, and $\alpha\in\left(0,1\right)$ is the teleport probability. The edge weight in diffusion matrix represents the influence between two nodes that may not be in the one-hop neighborhood of the original graph. In other words, diffusion matrix can establish links to the distant nodes of original graph. The application of diffusion matrix makes nodes in multi-hop neighborhoods directly get involved in the node representation learning, capturing the global information. Therefore, the augmented structural view can be treated as a global view. Following GCN [31], the calculation of diffusion matrix is regarded as a pre-processing step before loading the graph data to conduct model training.

3.3.1 Representation Learning on Structural Views

Node representations of the local and global views are calculated following the neighborhood aggregation strategy of modern GNNs [38]. Through aggregating neighborhood information based on the adjacency matrix and the diffusion matrix, the GNN encoders can naturally exploit both graph structure and node attributes to generate node representations. The node representation of local view is computed as follows.

\bm{h}_{S}^{k}=\dfrac{1}{\left|\bm{S}_{v}\right|}\sum\nolimits_{u\in\bm{S}_{v}}\bm{h}_{u}^{k-1},

(3)

\bm{h}_{v}^{k}=\sigma_{1}\left({\bm{W}_{A}^{k}[\bm{h}_{v}^{k-1};\bm{h}_{S}^{k}]}\right),

(4)

where the set $\bm{S}_{v}$ represents a sample of nodes taken from the one-hop neighborhood of node $v$ , the vector $\bm{h}_{S}^{k}$ is the neighborhood representation at step $k\in\left\{1,\ldots,K\right\}$ , the vector $\bm{h}_{v}^{k}$ is the representation of node $v$ at step $k$ , $\bm{W}_{A}^{k}$ is a weight matrix, and $\sigma_{1}$ is the ReLU activation, that is, $\sigma_{1}\left(x\right)={\rm max}(0,x)$ . Initially, node representation $\bm{h}_{v}^{0}$ is set to be node attribute vector $\bm{x}_{v}$ . A simple form of skip connection [30] is implemented in Eq. 4, so that the previous representation of a node (i.e., $\bm{h}_{v}^{k-1}$ ) can be incorporated at current step $k$ . The intuition behind Eq. 3 and Eq. 4 is that at each step $k$ , or search depth $k$ , node $v$ aggregates information from its local neighbors in $\bm{S}_{v}$ . The neighbors also aggregate information following the same procedures. As the process iterates from $k=1$ to $k=K$ , node $v$ gradually gains more and more information from nodes that are in a larger search depth.

Before generating node representations, the required neighborhood sets (up to search depth $K$ ) are sampled for each node in batch $\bm{B}$ . Sample size, $s_{k}$ , determines the number of neighbors sampled at search depth $k$ . By the end of $K$ steps, the representation of node $v$ captures the information within its $K$ -hop neighbors. The representation vector at the final step, denoted as $\bm{h}_{v}^{K}$ , serves as the node representation of local view:

\bm{e}_{v}^{A}=\bm{h}_{v}^{K}=f_{A}\left(\bm{x}_{v},\bm{x}_{S}\right),v\in\bm{B}.

(5)

Here, $f_{A}$ refers to the GNN encoder of local view, and $\bm{x}_{S}$ is the attribute matrix of the sampled neighbors.

The node representation of global view is calculated as follows.

\bm{h}_{P}^{k}=\sigma_{1}\left(\bm{W}_{P}^{k}\sum\nolimits_{u\in\bm{P}_{v}}p_{vu}\bm{h}_{u}^{k-1}\right),k\in\left\{1,\ldots,K\right\},

(6)

where $p_{vu}$ are diffusion matrix entries in the row corresponding to node $v$ , $\bm{P}_{v}$ is a set of nodes associated with entries $p_{vu}$ , $\bm{W}_{P}^{k}$ is the weight matrix, and $\bm{h}_{P}^{k}$ is the representation vector of node $v$ at step $k$ . The node representation encoded from global view is

\bm{e}_{v}^{P}=\bm{h}_{P}^{K}=f_{P}\left(\bm{x}_{P}\right),v\in\bm{B},

(7)

in which $f_{P}$ is the GNN encoder of global view and $\bm{x}_{P}$ is the attribute matrix of neighboring nodes in the augmented graph characterized by diffusion matrix. As suggested by [37], diffusion matrix, $\bm{P}$ , is sparsified by selecting the highest $s$ entries per row, which can be interpreted as sampling $s$ neighboring nodes based on the augmented graph. Note that, since the computation of diffusion matrix already considers a node itself by adding the identity matrix (see Eq. 1), skip connection is removed in the calculation of global-view representation.

Representations of the local and global views are concatenated to obtain the embedding vector of a node:

\bm{e}_{v}=\left[\bm{e}_{v}^{A};\bm{e}_{v}^{P}\right].

(8)

Node representations are generated with Eq. 3–Eq. 8 for both the source and target graphs. The same GNN encoders, $f_{A}$ and $f_{P}$ , are utilized to compute node representations for each graph. That is, the model parameters are shared when processing these two graphs. The GNN encoders can naturally be trained with a batch of nodes per iteration. Note that both parameter sharing and minibatch training are required by minimax entropy training.

3.3.2 Contrastive Learning between Structural Views

As shown in Figure 3, graph contrastive learning is applied to maximize the agreement between representations learned from the local and global views. The GNN encoders are encouraged to simultaneously encode a graph’s local and global information. Specifically, we maximize mutual information between the local and global views by contrasting the node representation of one view with a graph-level representation of the other view.

When considering a batch of nodes, denoted as $\bm{B}$ , GNN encoder $f_{A}$ computes the local-view representations as $\bm{E}_{A}=\left[\bm{e}_{1}^{A},\ldots,\bm{e}_{\left|\bm{B}\right|}^{A}\right]$ . Readout function, $\mathcal{R}$ , is applied to obtain a summary vector $\bm{r}_{A}$ by averaging the node representations in this batch:

\bm{r}_{A}=\mathcal{R}\left(\bm{E}_{A}\right)=\sigma_{2}\left(\dfrac{1}{\left|\bm{B}\right|}\sum_{i=1}^{\left|\bm{B}\right|}\bm{e}_{i}^{A}\right).

(9)

Here, $\sigma_{2}$ represents the logistic sigmoid activation function, that is, $\sigma_{2}\left(x\right)=1/\left(1+{\rm exp}\left(-x\right)\right)$ . Summary vector, $\bm{r}_{A}$ , serves as a graph-level representation associated with the nodes in this batch and their neighbors considered. Similarly, we can use GNN encoder $f_{P}$ to obtain the global-view representations $\bm{E}_{P}=\left[\bm{e}_{1}^{P},\ldots,\bm{e}_{\left|\bm{B}\right|}^{P}\right]$ and the corresponding summary vector $\bm{r}_{P}$ .

The pairs $\left(\bm{e}_{i}^{A},\bm{r}_{P}\right)$ and $\left(\bm{e}_{i}^{P},\bm{r}_{A}\right)$ are regarded as positive samples. Mutual information between the local and global views is maximized by classifying the positive samples and their negative counterparts. Following [33], row-wise shuffling of attribute matrix $\bm{X}$ is applied to corrupt the graph. GNN encoders, $f_{A}$ and $f_{P}$ , take the original adjacency and diffusion matrices, as well as the corrupted attributes, as inputs to calculate the node representations for negative samples, namely $\tilde{\bm{E}}_{A}=\left[\tilde{\bm{e}}_{1}^{A},\ldots,\tilde{\bm{e}}_{\left|\bm{B}\right|}^{A}\right]$ and $\tilde{\bm{E}}_{P}=\left[\tilde{\bm{e}}_{1}^{P},\ldots,\tilde{\bm{e}}_{\left|\bm{B}\right|}^{P}\right]$ . The negative samples are denoted as $\left(\tilde{\bm{e}}_{i}^{A},\bm{r}_{P}\right)$ and $\left(\tilde{\bm{e}}_{i}^{P},\bm{r}_{A}\right)$ .

To classify the positive and negative samples, we utilize a bilinear function $\mathcal{D}$ that assigns a score to each sample, indicating its likelihood of being positive.

\bm{s}_{i}=\mathcal{D}\left(\bm{e}_{i},\bm{r}\right)=\sigma_{2}\left(\bm{e}_{i}^{\top}\bm{W}_{b}\bm{r}\right),i\in\left\{1,\ldots,\left|\bm{B}\right|\right\},

(10)

where $\bm{W}_{b}$ is a learnable matrix and $\left(\bm{e}_{i},\bm{r}\right)$ is an example of input sample containing a representation vector $\bm{e}_{i}$ and a summary vector $\bm{r}$ . Finally, a contrastive loss is defined using the Jensen-Shannon divergence [3]:

	$\displaystyle\mathcal{L}_{\rm CL}=$	$\displaystyle-\dfrac{1}{4\left\|\bm{B}\right\|}\sum_{i=1}^{\left\|\bm{B}\right\|}\{{\rm log}\mathcal{D}\left(\bm{e}_{i}^{A},\bm{r}_{P}\right)+{\rm log}\mathcal{D}\left(\bm{e}_{i}^{P},\bm{r}_{A}\right)$		(11)
		$\displaystyle+{\rm log}\left[1-\mathcal{D}\left(\tilde{\bm{e}}_{i}^{A},\bm{r}_{P}\right)\right]+{\rm log}\left[1-\mathcal{D}\left(\tilde{\bm{e}}_{i}^{P},\bm{r}_{A}\right)\right]\}.$		(11)

By optimizing the contrastive loss, the two instances within the positive sample are pulled closer, whereas their counterparts in the negative sample are pushed away. Note that the contrastive loss is independently computed for each of the source and target graphs, namely $\mathcal{L}_{\rm CL}^{s}$ and $\mathcal{L}_{\rm CL}^{t}$ . Therefore, the overall contrastive loss is reformulated as the sum of the loss on each graph:

\mathcal{L}_{\rm CL}=\mathcal{L}_{\rm CL}^{s}+\mathcal{L}_{\rm CL}^{t}.

(12)

3.4 Node Label Prediction

The cosine similarity-based classifier is found to be effective in the few-shot classification where a few labeled data from the new classes are given to train a classification model [18]. The few-shot classification shares certain similarities with the SSDA problem, where a few labeled nodes in the new domain (target domain) are provided. Following [11], we devise node classifier $f_{c}$ to be cosine similarity-based:

\hat{\bm{y}}_{v}=f_{c}\left(\bm{e}_{v}\right)=\sigma_{3}\left(\dfrac{1}{T}\dfrac{\bm{W}_{c}^{\top}\bm{e}_{v}}{\left\|\bm{e}_{v}\right\|_{2}}\right),v\in\bm{B},

(13)

where $\bm{W}_{c}$ is the learnable weight matrix; $T$ is a temperature parameter for scaling; nonlinear activation $\sigma_{3}$ is a softmax function for multiclass classification; class probability vector $\hat{\bm{y}}_{v}^{\top}$ is one row in label prediction matrix $\hat{\bm{Y}}$ . Weight matrix, $\bm{W}_{c}=\left[\bm{w}_{1},\bm{w}_{2},\ldots,\bm{w}_{C}\right]$ , consists of a series of weight vectors (i.e., $\bm{w}_{j},j\in\left\{1,\ldots,C\right\}$ ). In order to classify the nodes correctly, the direction of weight vector $\bm{w}_{j}$ has to be representative to the normalized node representations of class $j$ . Therefore, each weight vector can be treated as an estimated prototype for the corresponding class [11].

Cross-entropy loss, $\mathcal{L}_{\rm CE}$ , is computed using the labeled nodes in both the source and target graphs:

\mathcal{L}_{\rm CE}\!=\!-\!\mathop{\mathbb{E}}\nolimits_{v\in\bm{B}^{s}}\!\sum_{j=1}^{C}\left(Y_{vj}^{s}{\rm log}\hat{Y}_{vj}^{s}\!\right)\!-\!\mathop{\mathbb{E}}\nolimits_{v\in\bm{T}_{l}}\!\sum_{j=1}^{C}\left(Y_{vj}^{t}{\rm log}\hat{Y}_{vj}^{t}\!\right).

(14)

Here, $\bm{B}^{s}$ represents a batch of source graph nodes, $\bm{T}_{l}$ denotes the set of labeled nodes in the target graph, binary element $Y_{vj}$ in label matrix $\bm{Y}$ indicates if a node $v$ belongs to class $j$ , and $\hat{Y}_{vj}$ is the corresponding entry in label prediction matrix $\hat{\bm{Y}}$ . The source and target graphs are distinguished by two superscripts, namely $s$ and $t$ . By minimizing the cross-entropy loss, the GNN encoders are expected to generate discriminative node representations.

3.5 Semi-supervised Domain Adaptation

If the model is optimized using the cross-entropy loss calculated by the labeled source and target nodes, the trained model is likely to be biased towards the source graph, since the source labels are dominant. The node representations of unlabeled target nodes would not be discriminative enough. To address this issue, we apply minimax entropy training [11] as a domain adaptation technique.

Minimax entropy optimizes the model on the entropy of unlabeled nodes in the target graph. With the model optimized on the cross-entropy loss (Eq. 14), the estimated prototypes (i.e., $\bm{w}_{j},j\in\left\{1,\ldots,C\right\}$ ) will be closer to the embedding distribution of source graph. Then the “position” of each prototype $\bm{w}_{j}$ is moved to the target graph by training the node classifier to increase the entropy of unlabeled target nodes:

\mathcal{L}_{\rm EN}=-\mathop{\mathbb{E}}\nolimits_{v\in\bm{B}^{t}}\sum_{j=1}^{C}\left(\hat{Y}_{vj}^{t}{\rm log}\hat{Y}_{vj}^{t}\right),

(15)

where $\bm{B}^{t}$ is a batch of nodes sampled from the set of unlabeled nodes in target graph, i.e., $\bm{T}_{u}$ . Increasing the entropy leads to a more uniform prediction score for each class, which means each prototype $\bm{w}_{j}$ shall be similar to all unlabeled target node representations. In contrast, the GNN encoders are optimized to decrease the entropy. The embedding vectors of unlabeled target nodes are expected to be clustered around one of the prototypes by reducing the entropy, which results in discriminative node representations.

To summarize, the domain adaptation process is modeled as a minimax game between the GNN encoders and the node classifier. Specifically, the node classifier is trained to maximize the entropy, whereas the GNN encoders are optimized to minimize it. We insert a gradient reversal layer [21] between the GNN encoders and the node classifier, so that these two model modules can be updated in one backpropagation.

3.6 Overall Objective and Model Training

The proposed SemiGCL model is trained with cross-entropy loss $\mathcal{L}_{\rm CE}$ , contrastive loss $\mathcal{L}_{\rm CL}$ , and entropy loss $\mathcal{L}_{\rm EN}$ . The overall learning objective functions are:

\hat{\bm{\theta}}_{g}=\mathop{\rm argmin}\limits_{\bm{\theta}_{g}}\left(\mathcal{L}_{\rm CE}+\lambda_{1}\mathcal{L}_{\rm CL}+\lambda_{2}\mathcal{L}_{\rm EN}\right),

(16)

\hat{\bm{\theta}}_{c}=\mathop{\rm argmin}\limits_{\bm{\theta}_{c}}\left(\mathcal{L}_{\rm CE}-\lambda_{3}\mathcal{L}_{\rm EN}\right),

(17)

where balance coefficients, $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ , are contrastive learning coefficient, domain adaptation coefficient and entropy coefficient, respectively; $\bm{\theta}_{g}$ and $\bm{\theta}_{c}$ are the sets of learnable parameters in the GNN encoders and in the node classifier, respectively.

Input : Fully labeled source graph

\mathcal{G}^{s}\left(\bm{V}^{s},\bm{A}^{s},\bm{X}^{s},\bm{Y}^{s}\right)

; partially labeled target graph

\mathcal{G}^{t}\left(\bm{V}^{t},\bm{A}^{t},\bm{X}^{t},\bm{Y}^{t}\right)

; batch size; coefficients

\lambda_{1}

\lambda_{2}

, and

\lambda_{3}

3Initialize model parameters

\bm{\theta}_{g}

for the GNN encoders and

\bm{\theta}_{c}

for the node classifier.

4 while not max epoch do

5 while not max iteration do

6 Sample a batch of labeled nodes (i.e.,

\bm{B}^{s}

) from

\mathcal{G}^{s}

and a batch of unlabeled nodes (i.e.,

\bm{B}^{t}

) from

\mathcal{G}^{t}

;

7 Generate node representations in Eq. 8;

8 Calculate contrastive loss

\mathcal{L}_{\rm CL}

in Eq. 12;

9 Calculate cross-entropy loss

\mathcal{L}_{\rm CE}

in Eq. 14;

10 Calculate entropy loss

\mathcal{L}_{\rm EN}

in Eq. 15;

11 Backpropagate and update

\bm{\theta}_{g}

and

\bm{\theta}_{c}

using the overall losses in Eq. 16 and Eq. 17.

13 end while

15 end while

Testing : With model parameters

\bm{\theta}_{g}

and

\bm{\theta}_{c}

optimized, node representations of unlabeled target nodes are computed using Eq. 8. The corresponding label predictions are subsequently calculated using Eq. 13.

Algorithm 1 SemiGCL

Algorithm 1 provides an outline of the main procedures for training and testing. During the training stage, the nodes are initially sampled independently from the source graph and the unlabeled set of the target graph (Line 4). Subsequently, the GNN encoders are utilized to generate representations for both the source and target nodes (Line 5). Following this, the contrastive loss, the cross-entropy loss, and the entropy loss are computed (Lines 6-8). Finally, the trainable parameters in SemiGCL are updated using the overall losses defined in Eq. 16 and Eq. 17 (Line 9). With the completion of multiple epochs, the model reaches convergence. The node representations produced by the GNN encoders would become transferrable and discriminative. During the testing stage, we apply the trained node classifier to categorize the unlabeled target nodes using their corresponding node representations.

The time complexity of SemiGCL is derived by investigating its two modules: the GNN encoders (Eq. 5 and Eq. 7) and the node classifier (Eq. 13). Since the GNN encoders employ neighborhood aggregation, each GNN encoder has time complexity similar to the one of GraphSAGE [30], i.e., $\mathcal{O}\left(\prod_{k=1}^{K}s_{k}\right)$ per node. $s_{k}$ represents the sample size of neighboring nodes at search depth $k$ . $K$ is the maximum search depth. The time complexity of node classifier is linear to the number of processed nodes. Consequently, the overall time complexity of SemiGCL is linear with respect to the number of nodes.

3.7 Theoretical Analysis on Minimax Entropy

Following MME [11], we theoretically analyze the mechanism of minimax entropy training for reducing domain divergence. As shown in [39], domain divergence can be measured with a domain discriminator. Let $f_{d}\in\mathcal{H}$ be a domain discriminator from hypothesis space $\mathcal{H}$ , the ${\cal H}$ -divergence between source domain distribution $p$ and target domain distribution $q$ is

{D_{\mathcal{H}}}(p,q)\triangleq 2\mathop{\sup}\limits_{f_{d}\in\mathcal{H}}\left|{\mathop{\Pr}\limits_{{\bm{e}}\sim p}\left[{f_{d}({\bm{e}})=1}\right]-\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{f_{d}({\bm{e}})=1}\right]}\right|

(18)

where $\bm{e}$ denotes the node representation in Eq. 8. This theory reveals that domain divergence can be measured by training a domain discriminator $f_{d}$ that distinguishes distributions $p$ and $q$ .

Although there is no domain discriminator in the model (see Figure 2), SemiGCL minimizes the divergence between the source and target graphs through minimax training on the entropy of unlabeled target nodes. Without any additional model modules, the domain discriminator can be regarded as a classifier that assigns a binary domain label for a node representation according to its entropy:

f_{d}(\bm{e})=\begin{cases}1,&\text{if }H(f_{c}(\bm{e}))\geq\gamma,\\ 0,&\text{otherwise},\end{cases}

(19)

where $f_{c}$ is the node classifier, $H$ is the entropy, and $\gamma$ is a threshold. As shown in Eq. 13, node classifier, $f_{c}$ , outputs the probability of class prediction. The ${\cal H}$ -divergence in Eq. 18 can be rewritten as

$\displaystyle{D_{\mathcal{H}}}(p,q)$	$\displaystyle\triangleq 2\mathop{\sup}\limits_{f_{d}\in\mathcal{H}}\left\|{\mathop{\Pr}\limits_{{\bm{e}}\sim p}\left[{f_{d}({\bm{e}})=1}\right]-\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{f_{d}({\bm{e}})=1}\right]}\right\|$	(20)
	$\displaystyle=\textstyle{2\mathop{\sup}\limits_{f_{c}\in\mathcal{C}}\left\|\mathop{\Pr}\limits_{{\bm{e}}\sim p}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right]-\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right]\right\|}$
	$\displaystyle\leq 2\mathop{\sup}\limits_{f_{c}\in\mathcal{C}}\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right],$

where $\mathcal{C}$ is the hypothesis space of node classifier $f_{c}$ . In the final inequality, it is assumed that $\mathop{\Pr}\limits_{{\bm{e}}\sim p}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right]\leq\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right]$ . The reason is that the entropy of a source node is driven to be a small value by minimizing the cross-entropy loss in Eq. 14.

The inequality in Eq. 20 shows that domain divergence can be bounded by the proportion of target nodes with entropy surpassing threshold $\gamma$ . The upper bound can be obtained by finding a node classifier $f_{c}\in\mathcal{C}$ that attains maximum entropy for the node representations of target graph. Our goal is to find node representations that achieve the lowest domain divergence. Thus, the minimax objective is

\min_{{\bm{e}}\sim q}\mathop{\max}\limits_{f_{c}\in\mathcal{C}}\mathop{\Pr}\limits_{{\bm{e}}\sim q}\left[{H(f_{c}(\bm{e}))\geq\gamma}\right].

(21)

It is required to find the GNN encoders that achieve the minimum with respect to node representations ${\bm{e}}\sim q$ . The above minimax objective corresponds to the objective functions Eq. 16 and Eq. 17. In summary, the minimax entropy training adopted by SemiGCL can theoretically bound and reduce the domain divergence between the source and target graphs by maximizing entropy and minimizing entropy, respectively. Thus minimax entropy training can align the data distributions of these two graphs, promoting knowledge transfer from the source graph to the target graph.

4 Experiments

In this section, our goal is to address the following research questions (RQs) by conducting extensive experiments.

1.

RQ1: How does SemiGCL compare with the state-of-the-art baselines in terms of performance on the SSDA tasks?
2.

RQ2: What are the advantages of integrating graph contrastive learning and minimax entropy training in the SemiGCL model?
3.

RQ3: How does the model performance vary depending on the number of labeled nodes per class in the target graph?
4.

RQ4: How do the hyperparameters affect the performance of the SemiGCL model?

4.1 Experimental Setup

4.1.1 Datasets

Following CDNE [5], ACDNE [12], and AdaGCN [15], we conduct experiments on five real-world information networks, including three citation networks (i.e., ACMv9, Citationv1, and DBLPv7) and two social networks (i.e., Blog1 and Blog2). Table 2 presents the dataset statistics. Three citation networks are acquired from ArnetMiner [40]. These citation networks are represented as undirected graphs where each node corresponds to a paper. An edge between nodes signifies a citation relationship between two papers. The paper title is used to extract the node attribute vector. The union attribute set of these three citation graphs consists of 6,775 node attributes. Each node is assigned to one of the five classes based on the paper’s research topic: “Database”, “Artificial Intelligence”, “Computer Vision”, “Information Security”, and “Networking”.

Blog1 and Blog2 are extracted from the BlogCatalog dataset [41]. In these social networks, the bloggers and their friendships are represented by nodes and undirected edges, respectively. The attribute vector of each node is derived from the blogger’s self-description. Based on the blogger’s interest group, each node is assigned to one of the six classes. Unlike the citation networks, Blog1 and Blog2 have larger average degrees, indicating that the nodes in these social networks have more neighbors.

We assess all methods on eight transfer tasks: C $\rightarrow$ A, D $\rightarrow$ A, A $\rightarrow$ C, D $\rightarrow$ C, A $\rightarrow$ D, C $\rightarrow$ D, B2 $\rightarrow$ B1, and B1 $\rightarrow$ B2. ACMv9, Citationv1, DBLPv7, Blog1, and Blog2 are represented by A, C, D, B1, and B2, respectively. The arrow symbol, “ $\rightarrow$ ”, shows the transfer of knowledge from a source graph to a target graph. Table 3 presents the common attribute rate of each transfer task, which indicates the discrepancy between the attribute distributions of the source and target graphs. Since the two social networks originate from the BlogCatalog dataset, their attribute distributions are relatively close.

Table 2: Summary of datasets.

Dataset #Nodes #Edges Average Degree #Attributes #Union Attributes #Classes ACMv9 8,661 13,590 3.13 5,571 6,775 5 Citationv1 8,724 14,798 3.39 5,379 DBLPv7 5,463 8,098 2.96 4,412 Blog1 2,300 33,471 29.11 4,121 4,185 6 Blog2 2,896 53,836 37.18 4,158 $\ast$ “#Nodes” means the number of nodes.

Table 3: Common attribute rate on each transfer task.

Transfer Task C $\rightarrow$ A A $\rightarrow$ C D $\rightarrow$ A A $\rightarrow$ D D $\rightarrow$ C C $\rightarrow$ D B2 $\rightarrow$ B1 B1 $\rightarrow$ B2 #Common Attributes 4,285 3,621 3,783 4,094 #Union Attributes 6,665 6,362 6,008 4,185 Common Attribute Rate 64.29% 56.92% 62.97% 97.83%

4.1.2 Baselines

Following [5, 6, 12, 15], the baseline approaches are of three categories: (I) classical GNN models for representation learning on a single graph, (II) typical domain adaptation approaches, and (III) state-of-the-art models designed for cross-graph node classification. The implementation details of our method and the baselines are provided in A.

1.

GCN [31], GAT [32], GraphSAGE [30], DGI [33], and MVGRL [34]: They are popular GNN models for single-graph representation learning. GCN designs a simplified filter to conduct spectral graph convolution. GAT specifies learnable weights to the neighboring nodes when aggregating their information. Rather than utilizing the complete neighborhood, GraphSAGE produces the representation of a node by aggregating information from a sampled set of neighbors. DGI and MVGRL are GNN models that devise contrastive losses to maximize the mutual information between a graph’s local and global representations. These GNN models are adapted and evaluated under the cross-graph scenario to show whether the classical single-graph GNN models are sufficient for cross-graph node classification tasks.
2.

DANN [21], WDGRL [17], and MME [11]: They are typical domain adaptation approaches with the assumption that the input data are independent and identically distributed, such as images and text. These approaches are evaluated to show whether they can be directly applied to conduct domain adaptation on graphs where the nodes are connected by edges, violating the i.i.d. assumption. To process graph data, they are implemented to only take node attributes as input, ignoring the edge connections.
3.

CDNE [5], ACDNE [12], AdaGCN [15], UDA-GCN [6], ASN [13], AdaGIn [35], MFRReg [14], and GRADE [36]: They are state-of-the-art models proposed for cross-graph node classification. These methods are hereafter referred to as cross-graph models. CDNE extracts node representations with stacked autoencoders and incorporates the MMD loss [16] to align graph distributions. ACDNE captures topology proximity and attribute affinity with two independent feature extractors. UDA-GCN utilizes the dual graph convolutional network. The gradient reversal layer [21] is implemented in ACDNE and UDA-GCN to reduce domain discrepancy. AdaGCN adopts GCN [31] as the feature extractor and alleviates domain divergence by reducing the Wasserstein distance [17]. ASN separates the features shared across domains from the domain-private features. AdaGIn is a GNN-based model that matches the multimodal embedding distributions with conditional adversarial networks. MFRReg improves the transferability of UDA-GCN with spectral regularization. GRADE reduces the graph subtree discrepancy to align the graph data distributions.

4.2 Performance Study (RQ1)

In this section, the proposed method and the baselines are evaluated on the cross-graph node classification tasks. Under this cross-graph scenario, all models perform node classification tasks by leveraging information from both a fully labeled source graph and a partially labeled target graph. We randomly select five nodes per class (i.e., $n=5$ ) to form the labeled set of target graph. Classification accuracy is reported to quantify the performance of a model in classifying the unlabeled nodes of target graph. As a supplementary evaluation, we also conduct semi-supervised node classification on the partially labeled target graph. In such a single-graph scenario, the models are trained and evaluated with the target graph only.

4.2.1 Cross-graph Node Classification on Citation Graphs

The first category of baselines includes the classical GNN models proposed for single-graph representation learning, i.e., GCN, GAT, GraphSAGE, DGI, and MVGRL. As introduced in A, these GNN models are adapted to conduct cross-graph node classification. In Table 4, with the help of contrastive learning, MVGRL and DGI outperform other GNN models in this category. However, due to the lack of capability to address the domain gap, the accuracies of these GNN models are lower than the cross-graph models in the third category of baselines.

In the second category of baselines (i.e., DANN, WDGRL, and MME), domain adaptation techniques are applied to reduce the domain discrepancy. However, the performance of these approaches is still inferior compared with the models in other categories. This reveals that the node representations generated in citation graphs lack discriminative power when only node attributes are utilized. Moreover, as the attribute distributions are distinct (refer to Table 3), incorporating the graph structure is necessary. Otherwise, the domain gap would remain significantly large and cannot be reduced effectively using these domain adaptation techniques.

In Table 4, SemiGCL ranks first on all six transfer tasks. For example, SemiGCL outperforms the best-performing baseline (i.e., MFRReg) by 2.12% on task D $\rightarrow$ C and 1.44% on task D $\rightarrow$ A. On average, the performance gain of SemiGCL over MFRReg is 1.06% on the citation graphs. It demonstrates the superiority of SemiGCL under the SSDA setting by incorporating graph contrastive learning and minimax entropy training in a principled way. Although CDNE and AdaGCN also investigate the adaptation scenario where the target graph is partially labeled, their performance on the SSDA tasks is even inferior to some cross-graph models designed for the UDA setting, including ACDNE, UDA-GCN, AdaGIn, and MFRReg. As introduced in A, the UDA cross-graph models are extended to the SSDA setting by optimizing the models with the cross-entropy loss of labeled source and target nodes. However, the learned models would be biased towards the source graph, since the labeled source nodes are of a much greater amount. SemiGCL adopts the minimax entropy training to address this issue and reduce the domain divergence. In addition, the GNN encoders naturally exploit both graph structure and node attributes to generate node representations. Graph contrastive learning maximizes the mutual information between representations encoded from the local and global views of a graph. By doing so, the GNN encoders are encouraged to capture the local and global information of a graph, thereby producing more discriminative node representations to promote node classification.

Table 4: Accuracy (%) of node classification on target citation graph.

Method C $\rightarrow$ A D $\rightarrow$ A A $\rightarrow$ C D $\rightarrow$ C A $\rightarrow$ D C $\rightarrow$ D Average GCN [31] 72.93 (0.24) 68.73 (0.78) 77.57 (0.20) 74.55 (0.48) 72.79 (0.47) 74.60 (0.36) 73.53 GAT [32] 71.32 (1.73) 67.60 (1.99) 77.99 (0.15) 75.43 (0.63) 73.55 (0.65) 74.95 (0.24) 73.47 GraphSAGE [30] 65.55 (1.09) 59.28 (1.76) 69.22 (1.66) 64.50 (1.48) 64.73 (1.60) 68.91 (0.46) 65.37 DGI [33] 73.05 (0.49) 70.95 (0.44) 78.08 (0.57) 76.07 (0.34) 72.26 (0.38) 73.66 (0.56) 74.01 MVGRL [34] 73.15 (0.45) 70.01 (0.36) 77.77 (0.54) 76.49 (0.42) 74.07 (0.25) 76.54 (0.42) 74.67 DANN [21] 52.87 (0.31) 49.72 (0.68) 55.77 (0.50) 54.71 (0.60) 56.95 (1.26) 56.80 (0.49) 54.47 WDGRL [17] 52.33 (0.16) 49.38 (0.42) 55.46 (0.31) 53.52 (0.57) 56.35 (1.51) 56.71 (0.57) 53.96 MME [11] 52.12 (0.52) 49.03 (0.47) 55.30 (0.26) 52.94 (0.60) 56.26 (0.22) 57.68 (0.25) 53.89 CDNE [5] 76.55 (0.35) 73.23 (0.53) 80.00 (0.25) 78.75 (0.66) 74.76 (0.26) 74.42 (0.19) 76.29 ACDNE [12] 77.97 (0.21) 74.44 (0.33) 83.70 (0.12) 81.93 (0.40) 77.68 (0.13) 78.01 (0.30) 78.96 AdaGCN [15] 76.45 (0.50) 74.29 (0.40) 82.16 (0.45) 80.49 (0.30) 76.75 (0.34) 77.17 (0.31) 77.89 UDA-GCN [6] 77.48 (0.54) 75.00 (0.55) 82.06 (0.15) 80.43 (0.45) 77.45 (0.24) 78.55 (0.42) 78.50 ASN [13] 77.86 (0.74) 75.19 (0.99) 81.05 (0.82) 80.06 (0.92) 75.61 (0.68) 75.58 (1.37) 77.56 AdaGIn [35] 77.08 (0.36) 75.86 (0.72) 83.17 (0.21) 82.92 (0.29) 76.82 (0.45) 77.03 (0.35) 78.81 MFRReg [14] 78.03 (0.45) 76.32 (0.75) 82.78 (0.15) 81.73 (0.29) 77.93 (0.29) 78.78 (0.35) 79.26 GRADE [36] 74.01 (0.52) 70.00 (0.37) 79.04 (0.31) 76.20 (0.60) 74.23 (0.42) 76.30 (0.38) 74.96 SemiGCL [ours] 79.03 (0.38) 77.76 (0.56) 83.89 (0.20) 83.85 (0.38) 78.00 (0.60) 79.36 (0.49) 80.32 * A: ACMv9, C: Citationv1, D: DBLPv7. In each column, the highest classification accuracy is in boldface, and the second-best accuracy is underlined. The values in parentheses are standard deviations.

4.2.2 Cross-graph Node Classification on Social Graphs

In Table 6, SemiGCL performs best on transfer tasks B2 $\rightarrow$ B1 and B1 $\rightarrow$ B2. Compared with the best-performing baseline (i.e., ACDNE), the performance gain yielded by SemiGCL is 1.08% on average. The accuracies of SemiGCL on the social graphs are higher than its results on the citation graphs. Blog1 and Blog2 have closer attribute distributions, which is supported by the largest common attribute rate shown in Table 3. Therefore, the domain divergence between Blog1 and Blog2 is not as large as the one between two citation graphs, reducing the difficulty of transfer tasks. Note that, in this case, the classical domain adaptation methods (i.e., DANN, WDGRL, and MME) also yield impressive performance.

Table 5: Accuracy (%) of node classification on target social graph (Categories I and II).

Method GCN GAT GraphSAGE DGI MVGRL DANN WDGRL MME B2 $\rightarrow$ B1 74.61 (1.29) 66.33 (0.80) 87.68 (0.24) 70.42 (0.30) 78.03 (0.73) 86.81 (0.25) 86.31 (0.28) 87.67 (0.16) B1 $\rightarrow$ B2 74.24 (0.63) 67.33 (1.41) 86.62 (0.47) 71.65 (0.37) 79.88 (0.61) 86.13 (0.33) 85.64 (0.33) 86.76 (0.55) Average 74.43 66.83 87.15 71.04 78.96 86.47 85.98 87.22 * B1: Blog1, B2: Blog2. The values in parentheses are standard deviations.

Table 6: Accuracy (%) of node classification on target social graph (Category III).

Method CDNE ACDNE AdaGCN UDA-GCN ASN AdaGIn MFRReg GRADE SemiGCL B2 $\rightarrow$ B1 81.79 (0.77) 91.79 (0.68) 75.95 (0.75) 69.28 (1.81) 68.75 (2.38) 91.67 (0.12) 70.43 (1.83) 74.67 (0.51) 92.94 (0.21) B1 $\rightarrow$ B2 84.96 (0.72) 91.93 (0.11) 75.12 (0.47) 70.28 (1.15) 68.90 (1.33) 90.98 (0.19) 70.80 (0.88) 74.61 (0.46) 92.94 (0.23) Average 83.38 91.86 75.54 69.78 68.83 91.33 70.62 74.64 92.94 * B1: Blog1, B2: Blog2. In each row, the highest classification accuracy is in boldface, and the second-best accuracy is underlined. The values in parentheses are standard deviations.

GNN models commonly adopt smoothing operations to produce similar representations for the nodes with edge connections, possibly leading to the same class prediction for the connected nodes on the subsequent classification task [42]. The smoothing operations rely on the homophily assumption that the nodes connected by an edge tend to be of the same class [31, 43]. In Figure 4, we select nodes of Class 3, referred to as central nodes for clarity, to analyze the class information of neighboring nodes. In the citation graphs, the majority of neighbors belong to the same class as the central node. However, in the social graphs, less than 50% of the neighboring nodes share the same class with the central node. It means more neighboring nodes violate the homophily assumption in the social graphs. Hence, the neighborhood information around a node is noisier in the social graphs.

It can be seen from Table 5 and Table 6 that the noise within neighborhood results in the clear underperformance of some GNN models, including GCN, GAT, DGI, MVGRL, AdaGCN, UDA-GCN, ASN, MFRReg, and GRADE. The reason is that these GNN models consider the complete neighborhood of a node to compute node representation. In contrast, other GNN models (i.e., GraphSAGE, AdaGIn, and SemiGCL) improve performance by sampling a few neighbors to generate node representations, since neighborhood sampling introduces fewer neighbors that are not of the same class as the central node.

4.2.3 Loss Evolution during Model Training

Figure 5 demonstrates the evolution of loss values as the GNN encoders are trained over multiple epochs. Two transfer tasks, C $\rightarrow$ A and B2 $\rightarrow$ B1, serve as examples. Referring to Eq. 16, the GNN encoders are optimized using an overall loss function $\mathcal{L}=\mathcal{L}_{\rm CE}+\lambda_{1}\mathcal{L}_{\rm CL}+\lambda_{2}\mathcal{L}_{\rm EN}$ . The overall loss consistently decreases with a slower descent rate, revealing that the model gradually converges during training. Similar decreasing trends can be observed in cross-entropy loss $\mathcal{L}_{\rm CE}$ , contrastive loss $\mathcal{L}_{\rm CL}$ , and entropy loss $\mathcal{L}_{\rm EN}$ .

4.2.4 Single-graph Node Classification

Since the target graph has five labeled nodes per class (i.e., $n=5$ ), we can train the node classification models by solely utilizing the available information in target graph. Such a scenario is referred to as the single-graph node classification, as the source graph is not involved in the model training. The trained models are evaluated on the unlabeled nodes of target graph. Although this work focuses on the cross-graph node classification problem under the SSDA setting, we report classification accuracies under the single-graph scenario in Table 7 as a supplementary evaluation. Note that some baselines in Table 4 are not applicable in the single-graph scenario, including the domain adaptation approaches (Category II) and the cross-graph models (Category III). To conduct node classification in the single-graph scenario, we adapt SemiGCL by optimizing the model without entropy loss $\mathcal{L}_{\rm EN}$ (see Eq. 16 and Eq. 17).

In Table 7, SemiGCL outperforms the classical GNN models by large margins. On average, the improvements of SemiGCL over MVGRL are 2.19% on the citation graphs and 6.23% on the social graphs. Therefore, SemiGCL demonstrates its strength in generating discriminative node representations for a graph. SemiGCL has an average accuracy of 74.36% on the citation graphs and 72.58% on the social graphs, which are much lower than its results on the cross-graph node classification, i.e., 80.32% (Table 4) and 92.94% (Table 6), respectively. As an example, we analyze the results with ACMv9 chosen as the target graph. SemiGCL has an accuracy of 79.03% on transfer task C $\rightarrow$ A, exceeding its result on the single-graph scenario (i.e., 70.91%) by 8.12%. These observations motivate us to transfer knowledge from a labeled source graph, so that the classification performance on target graph can be improved. Note that, MVGRL and DGI also incorporate contrastive learning, resulting in higher accuracies than those of GCN, GAT, and GraphSAGE.

Table 7: Accuracy (%) of single-graph node classification.

Method GCN [31] GAT [32] GraphSAGE [30] DGI [33] MVGRL [34] SemiGCL [ours] A 48.17 48.91 43.24 70.42 68.34 70.91 C 58.94 59.07 49.24 71.95 75.99 79.66 D 53.57 54.37 49.28 67.50 72.18 72.50 Average 53.56 54.12 47.25 69.96 72.17 74.36 B1 61.53 55.37 60.55 62.53 67.81 72.14 B2 61.28 57.31 62.69 61.61 64.88 73.02 Average 61.41 56.34 61.62 62.07 66.35 72.58 * A: ACMv9, C: Citationv1, D: DBLPv7, B1: Blog1, B2: Blog2. In each row, the highest classification accuracy is in boldface, and the second-best accuracy is underlined.

4.3 Ablation Study (RQ2)

In this section, we study four key components in the SemiGCL model, including contrastive learning (CL), global view (GV), local view (LV), and domain adaptation (DA). Each component is removed from the SemiGCL model individually to investigate its contribution. We construct the following model variants.

1.

SemiGCL-CL: A variant of SemiGCL without contrastive learning (CL). The contrastive loss is removed from the overall objective function (i.e., Eq. 16).
2.

SemiGCL-GV: A variant of SemiGCL without learning node representations from the global view (GV). The node embedding vector in Eq. 8 is the representation of local view.
3.

SemiGCL-LV: A variant of SemiGCL without learning node representations from the local view (LV). The embedding vector of a node (see Eq. 8) is the representation of global view.
4.

SemiGCL-DA: A variant of SemiGCL without domain adaptation (DA). We remove the entropy loss from the objective functions Eq. 16 and Eq. 17.

Note that, in SemiGCL-GV and SemiGCL-LV, as the embedding vector is only extracted from one structural view (i.e., local view or global view), the contrastive loss (Eq. 12) cannot be calculated to optimize the model. In this section, the labeled set of target graph is identical to that of Section 4.2. Each class in the target graph has five labeled nodes, i.e., $n=5$ .

4.3.1 Performance of Model Variants

Table 8 shows that the removal of every component leads to performance drops on all transfer tasks. It reveals that each component contributes to improving the performance of SemiGCL. In particular, with domain adaptation applied, the averaged classification accuracy increases by 1.29% on citation graphs and 1.21% on social graphs. Therefore, the model performance can be consistently improved by minimax entropy training.

Table 8: Accuracy (%) of SemiGCL variants on the target graph.

Model Variant SemiGCL SemiGCL-CL SemiGCL-GV SemiGCL-LV SemiGCL-DA C $\rightarrow$ A 79.03 74.81 67.26 74.45 78.29 D $\rightarrow$ A 77.76 70.04 63.54 71.04 75.42 A $\rightarrow$ C 83.89 80.43 74.05 79.02 82.82 D $\rightarrow$ C 83.85 78.96 72.10 77.02 82.41 A $\rightarrow$ D 78.00 74.05 68.68 72.34 76.73 C $\rightarrow$ D 79.36 76.25 70.16 75.58 78.53 Average 80.32 75.76 69.30 74.91 79.03 B2 $\rightarrow$ B1 92.94 92.56 89.76 88.38 91.74 B1 $\rightarrow$ B2 92.94 92.10 89.12 89.14 91.72 Average 92.94 92.33 89.44 88.76 91.73 $\ast$ CL: contrastive learning, GV: global view, LV: local view, DA: domain adaptation. The sign “-” indicates the removal of one component. The highest accuracy in each row is in boldface.

On average, contrasting the local and global views increases the classification accuracy by 4.56% on the citation graphs and 0.61% on the social graphs. In comparison with the citation graphs, the social graphs have fewer nodes and larger average degrees (see Table 2). In a small-scale and dense social graph, the nodes would be “closer”. In other words, it takes a short random walk to reach another node from one node. Therefore, if diffusion is applied to a social graph, the established edges in the augmented graph are likely to be between nodes that appear together within short random walks in the original graph. The information of these nodes could be accessed in the neighborhood aggregation process based on the adjacency matrix (i.e., local view). In such case, the local-view and global-view representations would already have a certain degree of agreement, which makes contrastive learning not that powerful on the social graphs.

In the citation graphs, it is observed that the removal of global-view representation leads to an 11.02% decrease in average accuracy, which is much more significant than the corresponding decrease in the social graphs (i.e., 3.50%). Compared with the social graphs, a citation graph is sparser and in a larger scale (see Table 2). It is more difficult to capture the global information of a citation graph with the local-view neighborhood aggregation. In this case, the global information extracted from the global view is essential for representation learning.

4.3.2 Visualization of Node Representations

In Figure 6, we plot the t-SNE [44] visualization of node representations produced by the SemiGCL variants. Two transfer tasks, C $\rightarrow$ A and D $\rightarrow$ A, are taken as examples, with ACMv9 being the target graph. Node representations produced by SemiGCL exhibit the most favorable clustering structure, with a clearer separation between the clusters of different classes. Additionally, node representations from the source and target graphs are mostly close if they belong to the same class. The t-SNE visualization supports that SemiGCL can achieve both the discriminability of node representations and domain alignment. With the contrastive learning removed (see SemiGCL-CL), the nodes of different classes are more likely to be mixed. It reveals that contrastive learning can improve the discriminability of node representations. Similar observations are found in the case without domain adaptation (see SemiGCL-DA).

4.4 Effect of Target Labeled Number (RQ3)

In this section, we investigate the model performance when the number of labeled nodes per class (i.e., $n$ ) varies from 0 to 10 in the target graph, that is, $n\in\left\{0,5,10\right\}$ . In the case that $n=0$ , the target graph is completely unlabeled, which corresponds to the UDA setting. When there are five labeled nodes per class (i.e., $n=5$ ), the labeled set of target graph is the same as the one in Section 4.2. As introduced in Section 2, CDNE and AdaGCN are cross-graph models that investigate the adaptation scenario with a partially labeled target graph. In addition to CDNE and AdaGCN, we also make comparisons with three competitive UDA cross-graph models, i.e., ACDNE, AdaGIn, and MFRReg.

Figure 7 shows the average accuracy of each model on the six transfer tasks of citation graphs, i.e., C $\rightarrow$ A, D $\rightarrow$ A, A $\rightarrow$ C, D $\rightarrow$ C, A $\rightarrow$ D, and C $\rightarrow$ D. The SemiGCL variant without domain adaptation (i.e., SemiGCL-DA) is also included. SemiGCL is the top-performing model on the SSDA tasks, i.e., $n=5$ and $n=10$ . A significant performance lift is observed in SemiGCL from $n=0$ to $n=5$ . Therefore, SemiGCL can effectively exploit a few labeled nodes in the target graph to improve the model performance. In contrast, the accuracies of UDA approaches (i.e., ACDNE, AdaGIn, and MFRReg) remain stable with the increase of labeled target nodes. Moreover, the accuracies of CDNE and AdaGCN slightly increase from $n=0$ to $n=5$ . It indicates these five cross-graph models lack the capability to utilize the target labels well. Note that, under the UDA setting ( $n=0$ ), SemiGCL has no an improvement over SemiGCL-DA, since the domain adaptation component of SemiGCL (i.e., minimax entropy training) is devised for the SSDA tasks.

4.5 Hyperparameter Sensitivity Analysis (RQ4)

In this section, we analyze the influences of four key hyperparameters on the model performance, including embedding dimension $l$ , contrastive learning coefficient $\lambda_{1}$ , teleport probability $\alpha$ , and temperature parameter $T$ . The aim is to gain insights into how these hyperparameters should be configured. When investigating a specific hyperparameter, the remaining hyperparameters are set to their default values provided in A. Note that, in this section, the labeled set of target graph is identical to the one in Section 4.2.

Figure 8 presents the classification accuracies of transfer tasks on the citation and social graphs. Embedding dimension, $l$ , refers to the dimension of a node representation vector learned by the SemiGCL model. The model achieves the highest accuracy with an embedding dimension of $128$ on the citation graph tasks (i.e., C $\rightarrow$ A and D $\rightarrow$ A). On the social graph tasks (i.e., B2 $\rightarrow$ B1 and B1 $\rightarrow$ B2), embedding dimensions $l\in\left\{128,256,512\right\}$ can all lead to fairly good performance. Contrastive learning coefficient, $\lambda_{1}$ , is the weight of contrastive loss when optimizing the GNN encoders. As discussed in Section 4.3.1, contrastive learning has a more significant impact on the citation graphs. Therefore, the model performance is more sensitive to the contrastive learning coefficient on transfer tasks C $\rightarrow$ A and D $\rightarrow$ A. When $\lambda_{1}=0.1$ , the best accuracy is observed on all four transfer tasks. Teleport probability, $\alpha$ , plays an important role in constructing the diffusion matrix. It is observed that the optimal value is around $\alpha=0.1$ . Temperature parameter, $T$ , scales the output of node classifier. When $T=5.0$ , the classification accuracy reaches its peak value on transfer tasks B2 $\rightarrow$ B1 and B1 $\rightarrow$ B2. On the citation graph tasks, a larger temperature parameter (i.e., $T=20.0$ ) is more desirable.

5 Discussion

SSDA on graphs presents three significant challenges: domain divergence between the source and target graphs, model bias towards the source graph, and non-i.i.d. graph-structured data. Our method, SemiGCL, showcases superior performance on graph SSDA tasks through the principled integration of minimax entropy training and graph contrastive learning. Minimax entropy training plays a crucial role in mitigating domain divergence and alleviating model bias with the entropy loss of unlabeled target nodes. Meanwhile, graph contrastive learning captures both local and global structural information within a graph.

The experiments are conducted on the citation and social graphs, where SemiGCL’s performance hinges on the effective functioning of minimax entropy training and graph contrastive learning. On both types of graphs, minimax entropy training consistently improves SemiGCL’s performance. Compared with the social graphs, a citation graph is sparser and in a larger scale, increasing the difficulty in capturing global information through local-view neighborhood aggregation alone. Consequently, contrastive learning between the local and global views yields a larger improvement on the citation graphs. On a small-scale and dense social graph, contrastive learning is not that powerful. Additionally, as minimax entropy training is specifically designed for the SSDA tasks, it does not exhibit an improvement under the UDA setting. Hyperparameters are also critical for the performance of SemiGCL, such as embedding dimension, contrastive learning coefficient, teleport probability, and temperature parameter.

6 Conclusion

A novel GNN-based model named SemiGCL has been proposed to tackle the semi-supervised domain adaptation problem on graphs. SemiGCL constructs two GNN encoders to extract node representations from two structural views of a graph, i.e., the original graph (local view) and the diffusion-augmented graph (global view). Graph contrastive learning is employed to maximize the mutual information between representations learned from the two structural views. By doing so, the GNN encoders are encouraged to encode rich local and global information in a graph. To mitigate domain discrepancy, the cosine similarity-based node classifier and the GNN encoders are trained in an adversarial manner using the entropy loss of unlabeled target nodes. Experimental results on real-world information networks demonstrate that our method surpasses the state-of-the-art baselines on the benchmark SSDA tasks.

This study considers only a single labeled source graph. Another practical scenario is to have a few labeled source graphs with diverse data distributions. One potential avenue for future research is the development of methods that select the optimal source graph from the available options. Utilizing the optimal source graph, a single-source domain adaption model [45] could achieve the best performance on a particular target graph short of labels. Another promising direction for future work is to develop multi-source domain adaptation approaches [46] that facilitate the transfer of knowledge from multiple labeled source graphs to the target graph.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants (62203137, 62362020, 62102124), in part by the Innovation and Technology Commission (ITC) of Hong Kong under Grant MRP/029/20X, in part by the Research Grants Council (RGC) of Hong Kong under Grants (17209021, 17207020, 17210023), and in part by the Alexander von Humboldt Foundation, Germany.

Appendix A Implementation Details

The proposed SemiGCL¹¹1The source codes of SemiGCL are publicly available at https://github.com/JiarenX/SemiGCL. is implemented in PyTorch. In Table 1, we report the main hyperparameters selected for each transfer task. GNN encoders, $f_{A}$ and $f_{P}$ , have two-layer structures. The layer dimensions are 1024 and 64 in sequence (i.e., “1024/64” in Table 1). Cosine similarity-based classifier, $f_{c}$ , is a logistic regression with $L_{2}$ normalization applied to its input. The output of node classifier is scaled by a temperature parameter $T$ and activated by a softmax function. The SemiGCL model is trained over shuffled minibatches using the Adam optimizer. We select initial learning rate $\eta_{0}$ from $\left\{0.005,0.010,0.015\right\}$ . When training progress $p$ linearly increases from $0$ to $1$ , we follow [21] to decay learning rate as $\eta_{p}=\eta_{0}\left(1+10p\right)^{-0.75}$ . $L_{2}$ regularization is imposed on the trainable parameters to prevent overfitting with a weight decay term of $5\times 10^{-5}$ . In the overall objective function (Eq. 16), contrastive learning coefficient, $\lambda_{1}$ , is chosen as $0.1$ . Domain adaptation coefficient, $\lambda_{2}$ , starts from $0$ and progressively increases, i.e., $\lambda_{2}=2\left(1+{\rm exp}\left(-10p\right)\right)^{-1}-1$ . We set the maximum value of $\lambda_{2}$ as $0.1$ . In Eq. 17, entropy coefficient, $\lambda_{3}$ , is 1.0 for the citation graphs and 2.0 for the social graphs.

Table 1: Main hyperparameters for the SemiGCL model.

Transfer Task $\eta_{0}$ Epoch Batch Size Weight Decay Layer Dimension $\bm{s}$ ^$\ast$ $\alpha$ $T$ C $\rightarrow$ A 0.010 30 128 $5\times 10^{-5}$ 1024/64 $\left\{20,20\right\}$ 0.10 20.0 D $\rightarrow$ A A $\rightarrow$ C D $\rightarrow$ C A $\rightarrow$ D 0.005 100 256 $\left\{10,10\right\}$ 0.05 C $\rightarrow$ D B2 $\rightarrow$ B1 0.010 50 64 $\left\{30,30\right\}$ 0.10 15.0 B1 $\rightarrow$ B2 $\ast$ A set $\bm{s}=\left\{s_{1},\ldots,s_{K}\right\}$ consists of the neighborhood sample size $s_{k}$ at every search depth $k\in\left\{1,\ldots,K\right\}$ .

The baselines are implemented following the original papers and evaluated with a protocol the same as that of SemiGCL. We adapt baselines in the first category (i.e., GCN, GAT, GraphSAGE, DGI, and MVGRL) to make them conduct cross-graph node classification (Section 4.2.1 and Section 4.2.2). The cross-entropy loss is calculated with the labeled nodes in the source and target graphs. For DGI and MVGRL, the contrastive loss is computed independently for each graph, that is, source graph or target graph. The contrastive losses on both graphs are then added to optimize the model. After training, GNN models in the first category are evaluated on the unlabeled nodes of target graph. In single-graph node classification (Section 4.2.4), these GNN models are trained and evaluated only with the partially labeled target graph under the semi-supervised learning scenario. DANN, WDGRL, and MME are initially designed to process images with CNNs, such as AlexNet and ResNet34. To encode graph data, their feature extractors are replaced by multilayer perceptrons (MLPs) which take node attributes as input. To adapt DANN and WDGRL for the SSDA tasks, we calculate the cross-entropy loss with the labeled nodes in source and target graphs.

The cross-graph models are adapted for the SSDA setting. AdaGCN [15] is implemented with PyTorch based on the original paper. Although CDNE and AdaGCN also consider the scenario that the target graph is partially labeled, they assume a percentage of target nodes is randomly selected to have labels. We modify the CDNE codes²²2https://github.com/shenxiaocam/CDNE and the AdaGCN codes to assign the same number of labeled nodes for each class in the target graph. The UDA methods (i.e., ACDNE³³3https://github.com/shenxiaocam/ACDNE, UDA-GCN⁴⁴4https://github.com/mandy976/UDAGCN, ASN⁵⁵5https://github.com/yuntaodu/ASN, AdaGIn⁶⁶6https://github.com/JiarenX/AdaGIn, MFRReg⁷⁷7https://github.com/Shen-Lab/GDA-SpecReg, and GRADE⁸⁸8https://github.com/jwu4sml/GRADE) are modified for the SSDA setting by incorporating the cross-entropy loss of labeled nodes in the target graph. The hyperparameters of these cross-graph models are initialized with the recommended ones in their papers or official implementations. We have tuned some of their key hyperparameters to improve the performance of these models. For example, the GNN-based models (i.e., AdaGCN, UDA-GCN, ASN, AdaGIn, MFRReg, and GRADE) are tested up to three GNN layers. The dimension of each GNN layer is selected in the set $\left\{128,256,512,1024,2048\right\}$ .

In all models, except for DGI and MVGRL, embedding dimension $l$ is set as 128 following ACDNE. Unlike other GNN models that conduct node classification tasks in an end-to-end manner, DGI and MVGRL generate node embeddings first and then train a logistic regression classifier for node classification. We find an embedding dimension of 512 improves the performance of DGI and MVGRL. For each method, we report the mean value of classification accuracies over five runs with different random seeds. Note that, in multiclass classification, according to the scikit-learn document⁹⁹9https://scikit-learn.org/stable/modules/model_evaluation.html, the Micro-Precision, Micro-Recall, and Micro-F1 scores are all identical to the classification accuracy.

References

[1] W. L. Hamilton, R. Ying, J. Leskovec, Representation learning on graphs: Methods and applications, arXiv:1709.05584 [cs] (Apr. 2018).
[2] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A comprehensive survey on graph neural networks, IEEE Transactions on Neural Networks and Learning Systems 32 (1) (2021) 4–24.
[3] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, P. Yu, Graph self-supervised learning: A survey, IEEE Transactions on Knowledge and Data Engineering (2022).
[4] P. Cui, X. Wang, J. Pei, W. Zhu, A survey on network embedding, IEEE Transactions on Knowledge and Data Engineering 31 (5) (2019) 833–852.
[5] X. Shen, Q. Dai, S. Mao, F.-L. Chung, K.-S. Choi, Network together: Node classification via cross-network deep network embedding, IEEE Transactions on Neural Networks and Learning Systems 32 (5) (2021) 1935–1948.
[6] M. Wu, S. Pan, C. Zhou, X. Chang, X. Zhu, Unsupervised domain adaptive graph convolutional networks, in: Proceedings of The Web Conference 2020, New York, NY, USA, 2020, pp. 1457–1467.
[7] W. Hu*, B. Liu*, J. Gomes, M. Zitnik, P. Liang, V. Pande, J. Leskovec, Strategies for pre-training graph neural networks, in: Proceedings of International Conference on Learning Representations, 2020.
[8] G. Wilson, D. J. Cook, A survey of unsupervised deep domain adaptation, ACM Transactions on Intelligent Systems and Technology 11 (5) (2020) 1–46.
[9] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, Q. He, A comprehensive survey on transfer learning, Proceedings of the IEEE 109 (1) (2020) 43–76.
[10] K. Li, C. Liu, H. Zhao, Y. Zhang, Y. Fu, ECACL: A holistic framework for semi-supervised domain adaptation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8578–8587.
[11] K. Saito, D. Kim, S. Sclaroff, T. Darrell, K. Saenko, Semi-supervised domain adaptation via minimax entropy, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8050–8058.
[12] X. Shen, Q. Dai, F.-l. Chung, W. Lu, K.-S. Choi, Adversarial deep network embedding for cross-network node classification, Proceedings of the AAAI Conference on Artificial Intelligence 34 (03) (2020) 2991–2999, number: 03.
[13] X. Zhang, Y. Du, R. Xie, C. Wang, Adversarial separation network for cross-network node classification, in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, New York, NY, USA, 2021, pp. 2618–2626.
[14] Y. You, T. Chen, Z. Wang, Y. Shen, Graph domain adaptation via theory-grounded spectral regularization, in: Proceedings of International Conference on Learning Representations, 2023.
[15] Q. Dai, X.-M. Wu, J. Xiao, X. Shen, D. Wang, Graph transfer learning via adversarial domain adaptation with graph convolution, IEEE Transactions on Knowledge and Data Engineering 35 (5) (2023) 4908–4922.
[16] M. Long, J. Wang, G. Ding, J. Sun, P. S. Yu, Transfer feature learning with joint distribution adaptation, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2200–2207.
[17] J. Shen, Y. Qu, W. Zhang, Y. Yu, Wasserstein distance guided representation learning for domain adaptation, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[18] W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. F. Wang, J.-B. Huang, A closer look at few-shot classification, in: Proceedings of International Conference on Learning Representations, 2019.
[19] M. Long, Y. Cao, J. Wang, M. I. Jordan, Learning transferable features with deep adaptation networks, in: Proceedings of the 32nd International Conference on Machine Learning, ICML’15, Lille, France, 2015, pp. 97–105.
[20] M. Wang, W. Deng, Deep face recognition with clustering based domain adaptation, Neurocomputing 393 (2020) 1–14.
[21] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, V. Lempitsky, Domain-adversarial training of neural networks, The Journal of Machine Learning Research 17 (1) (2016) 2096–2030.
[22] H. Wang, J. Tian, S. Li, H. Zhao, F. Wu, X. Li, Structure-conditioned adversarial learning for unsupervised domain adaptation, Neurocomputing 497 (2022) 216–226.
[23] P. Zhao, W. Zang, B. Liu, Z. Kang, K. Bai, K. Huang, Z. Xu, Domain adaptation with feature and label adversarial networks, Neurocomputing 439 (2021) 294–301.
[24] Z. Xu, H. He, G.-H. Lee, B. Wang, H. Wang, Graph-relational domain adaptation, in: Proceedings of International Conference on Learning Representations, 2022.
[25] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.
[26] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.
[27] J. Xiao, Q. Dai, X. Xie, J. Lam, K.-W. Kwok, Adversarially regularized graph attention networks for inductive learning on partially labeled graphs, Knowledge-Based Systems 268 (May 2023).
[28] Z. Zhang, H. Yang, J. Bu, S. Zhou, P. Yu, J. Zhang, M. Ester, C. Wang, ANRL: Attributed network representation learning via deep neural networks, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, AAAI Press, Stockholm, Sweden, 2018, pp. 3155–3161.
[29] M. Chen, Z. Wei, Z. Huang, B. Ding, Y. Li, Simple and deep graph convolutional networks, in: Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 1725–1735.
[30] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, pp. 1025–1035.
[31] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: Proceedings of International Conference on Learning Representations, Toulon, France, 2017.
[32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, in: Proceedings of International Conference on Learning Representations, 2018.
[33] P. Veličković, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, R. D. Hjelm, Deep graph infomax, in: Proceedings of International Conference on Learning Representations, 2019.
[34] K. Hassani, A. H. Khasahmadi, Contrastive multi-view representation learning on graphs, in: Proceedings of the International Conference on Machine Learning, 2020, pp. 4116–4126.
[35] J. Xiao, Q. Dai, X. Xie, Q. Dou, K.-W. Kwok, J. Lam, Domain adaptive graph infomax via conditional adversarial networks, IEEE Transactions on Network Science and Engineering 10 (1) (2023) 35–52.
[36] J. Wu, J. He, E. Ainsworth, Non-IID transfer learning on graphs, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
[37] J. Klicpera, S. Weißenberger, S. Günnemann, Diffusion improves graph learning, in: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019, pp. 13366–13378.
[38] K. Xu, W. Hu, J. Leskovec, S. Jegelka, How powerful are graph neural networks?, in: Proceedings of International Conference on Learning Representations, 2019.
[39] S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, Analysis of representations for domain adaptation, in: Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, MIT Press, Cambridge, MA, USA, 2006, pp. 137–144.
[40] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, Z. Su, ArnetMiner: Extraction and mining of academic social networks, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, Association for Computing Machinery, New York, NY, USA, 2008, pp. 990–998.
[41] J. Li, X. Hu, J. Tang, H. Liu, Unsupervised streaming feature selection in social media, in: Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM ’15, Association for Computing Machinery, New York, NY, USA, 2015, pp. 1041–1050.
[42] Q. Li, Z. Han, X.-M. Wu, Deeper insights into graph convolutional networks for semi-supervised learning, in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, USA, 2018, pp. 3538–3545.
[43] Y. Ma, X. Liu, N. Shah, J. Tang, Is homophily a necessity for graph neural networks?, in: Proceedings of International Conference on Learning Representations, 2022.
[44] L. v. d. Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research 9 (Nov) (2008) 2579–2605.
[45] S. Zhao, X. Yue, S. Zhang, B. Li, H. Zhao, B. Wu, R. Krishna, J. E. Gonzalez, A. L. Sangiovanni-Vincentelli, S. A. Seshia, K. Keutzer, A review of single-source deep unsupervised visual domain adaptation, IEEE Transactions on Neural Networks and Learning Systems 33 (2) (2022) 473–493.
[46] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, B. Wang, Moment matching for multi-source domain adaptation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1406–1415.