UniKG: A Benchmark and Universal Embedding for Large-Scale Knowledge Graphs

Yide Qiu, Shaoxiang Ling, Tong Zhang, Bo Huang, Zhen Cui^∗

Abstract

Irregular data in real-world are usually organized as heterogeneous graphs (HGs) consisting of multiple types of nodes and edges. To explore useful knowledge from real-world data, both the large-scale encyclopedic HG datasets and corresponding effective learning methods are crucial, but haven’t been well investigated. In this paper, we construct a large-scale HG benchmark dataset named UniKG from Wikidata to facilitate knowledge mining and heterogeneous graph representation learning. Overall, UniKG contains more than 77 million multi-attribute entities and 2000 diverse association types, which significantly surpasses the scale of existing HG datasets. To perform effective learning on the large-scale UniKG, two key measures are taken, including (i) the semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes into a common embedding space to facilitate node aggregation in a large receptive field; (ii) proposing a novel plug-and-play anisotropy propagation module (APM) to learn effective multi-hop anisotropy propagation kernels, which extends methods of large-scale homogeneous graphs to heterogeneous graphs. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meantimes adaptively mine multi-attribute association through the multi-hop aggregation in large-scale HGs. We set up a node classification task on our UniKG dataset, and evaluate multiple baseline methods which are constructed by embedding our APM into large-scale homogenous graph learning methods. Our UniKG dataset and the baseline codes have been released at https://github.com/Yide-Qiu/UniKG.

Introduction

Heterogeneous Graphs (HGs), also known as Heterogeneous Information Networks (HINs), consist of multiple types of nodes and links. Compared with homogeneous graphs, the heterogeneity in multi-attribute nodes and graph topology makes HGs carry richer semantics, and be more suitable to characterize a variety of complex real-world systems, such as academic networks and social networks. For this reason, methods that focus on representation learning on HGs have drawn increasing attention in recent years, and have facilitated numerous application tasks in diverse domains, including recommendation systems (Gao et al. 2022; Wu et al. 2022), malware detection systems (Zhang et al. 2022a), and healthcare systems (Cao et al. 2022).

Refer to caption — Figure 1: An example of the meta-path method. There are two kinds of nodes (red and blue) in the graph, and the paths are sampled in the order of ‘red, blue, red, blue’, starting from the node $n_{1}$ . The different colors of the edges represent different types.

Existing HG-related works involve the design of effective learning methods and also the construction of HG datasets, because an encyclopedic HG dataset is also crucial to promote HG learning. Regarding HG dataset construction, existing research (Sharma et al. 2022; Chen, Razniewski, and Weikum 2023; Kubeša and Straka 2023; Whitehouse et al. 2023; Ahrabian et al. 2023; Stranisci et al. 2022; Luo et al. 2023) mostly focuses on combining specific entities and relationships within knowledge graphs (KGs) to facilitate downstream tasks within the corresponding domain. Existing datasets generally consist of thousands of nodes and dozens of relationships applied in a specific domain such as finance, recommendation systems and social networks. For the learning methods on HGs, depending on the difference of methods in capturing semantic structure, the existing works can be broadly classified into aggregation-based methods (Schlichtkrull et al. 2018; Zhang et al. 2019; Xu et al. 2019) and meta-path-based methods (Wang et al. 2020; Huang et al. 2016; Sun et al. 2011; Chang et al. 2022). The aggregation-based methods primarily concentrate on iteratively aggregating neighbor information to update node representations. In contrast, meta-path-based methods focus on combining information from distinct meta-paths to learn node representations.

Although notable success has been achieved by existing works on HGs, there are still several issues to be tackled. The most important one is the lack of a large-scale HG dataset. Existing HG datasets are almost small-scale, with just thousands of nodes and dozens of relation types limited in specific knowledge domains. This greatly limits the ability to facilitate representation learning and knowledge extraction, therefore there is an urgent need for a large-scale HG dataset with encyclopedic knowledge. Furthermore, when regarding effective learning on large-scale HG datasets, another issue is the inapplicability of existing HG learning methods. On one hand, existing aggregation-based HG learning methods cost too much memory overhead when performing multi-hop aggregation in very large-scale receptive fields. Hence, they may suffer from out-of-memory problems during model training. On the other hand, for the meta-path-based methods, existing works are applied to the case where only one type of association exists between two types of nodes. However, when the association types between two nodes significantly increase, the number of metapaths may suffer explosive growth (Please see the example in Figure 1). This causes great difficulty for topology perception if only a limited number of metapaths can be sampled.

In this paper, we construct a large-scale Heterogeneous Graph benchmark dataset named UniKG from Wikidata. We release UniKG to facilitate the representation learning task of heterogeneous graphs. Overall, UniKG contains 77.31 million multi-attribute entities labels by 2000 classes, 564 million directed edges annotated by 2082 diverse association types, which significantly surpasses the scale of existing homogeneous graph datasets. To perform effective learning on the large-scale UniKG, two key measures are taken. We propose a semantic alignment strategy for multi-attribute entities, which projects the feature description of multi-attribute nodes into a common embedding space. Then we design a novel plug-and-play anisotropy propagation module (APM) as base to extend methods of large-scale homogeneous graphs to heterogeneous graphs. These two strategies enable efficient information propagation among a tremendous number of multi-attribute entities and meantimes adaptively mine multi-attribute association through the multi-hop aggregation in large-scale HGs. We set up a node classification task on our UniKG dataset that evaluate multiple baseline methods which are constructed by embedding our APM into large-scale homogenous graph learning methods. In addition, we attempt to transform knowledge of UniKG to a recommendation system to verify the capability to facilitate downstream task of diverse domains.

Our contributions can be summarized in the following three aspects:

1) We construct a large-scale Heterogeneous Graph benchmark dataset named UniKG from Wikidata which contains over 77 million nodes and 564 million edges associated with 2,082 types. To the best of our knowledge, UniKG significantly surpasses the scale of existing HG datasets.

2) We propose a plug-and-play Anisotropy Propagation Module (APM), which can adaptively mine association relationships of diverse types and extends methods of large-scale homogeneous graphs to heterogeneous graphs.

3) We transform knowledge from UniKG to recommender systems, which effectively promotes performance of recommendation task.

Table 1: Statistics of proposed UniKG dataset and common heterogeneous node classification datasets. It can be observed that the scale of UniKG significantly surpasses the scale of others.

Datasets	#Nodes	#Node Types	#Edges	#Edge Types	#classes
ACM	10,942	4	273,936	8	3
IMDB	21,420	4	43,321	6	5
DBLP	26,128	4	119,783	6	4
Freebase	180,098	8	528,844	36	7
MAG	1,939,743	4	21,111,007	4	349
OAG	1,116,162	5	13,965,692	30	275
UniKG	77,312,474	2,000	564,425,621	2,082	2,000

Related Work

In this section, we introduce the classical HG datasets, as well as representative GNN methods, such as those for solving KGs, large-scale graphs and HGs tasks. Finally, we briefly discuss the limitations of existing datasets and methods and point out highlights the advantages of our proposed datasets and methods.

Heterogeneous Graph datasets

Numerous studies have focused on constructing domain-specific heterogeneous graph datasets (Sharma et al. 2022; Chen, Razniewski, and Weikum 2023; Ahrabian et al. 2023; Jang et al. 2022) to facilitate research in their respective fields. For instance, (Sharma et al. 2022) demonstrate the insufficiency of existing relation extraction tasks in the financial domain. The recently proposed PubGraph (Ahrabian et al. 2023) contains knowledge from almost all scientific fields. (Jang et al. 2022) has created a temporal benchmark to address temporal misalignment issues. Moreover, some studies (Stranisci et al. 2022; Kubeša and Straka 2023) attempt to explore cultural differences by extracting and combining diverse linguistic entities or events. For example, (Stranisci et al. 2022) reveals the issue of underrepresentation of non-Western writers.

Graph Neural Networks Methods

As existing work is unsuitable for large-scale heterogeneous graphs, we review GNN methods designed for KG tasks, large-scale graphs and HGs.

GNNs Methods for KGs tasks The similar structure has led to the increased use of GNN methods for KG tasks. For instance, (Banerjee et al. 2023) proposes a question-answering system GETT-QA based on GNNs embeddings. (Pramanik et al. 2021) constructs local context graphs for information retrieval. Recently, (Luo et al. 2023) uses two branches of Hyper-graph Neural Networks DHGE to learn entity instance views and concept ontology views. One particularly interesting research is KGQ (Hogan et al. 2021), which discusses a unified data model derived from KGs and queries, which coincides with our idea. GNNs Methods for Large-Scale Graphs Methods specifically designed for large-scale graphs are commonly categorized into two groups: sampling-based aggregation GNNs and decoupling-based GNNs. Sampling-based aggregation GNNs sample subgraphs by various strategies, such as node sampling (Hamilton, Ying, and Leskovec 2017), neighborhood sampling (Chen, Ma, and Xiao 2018), and subgraph sampling (Chiang et al. 2019; Zeng et al. 2019), seeking the optimal way to perform batch training. Instead, decoupling-based GNNs precompute the feature propagation matrix once and store the results as inputs for the models. For instance, (Wu et al. 2019) proposes SGC, which uses the last-hop propagation feature as input. Then, SIGN (Frasca et al. 2020) concatenates multi-hop propagation features, and SAGN (Sun, Gu, and Hu 2021) further introduces a hop-wise attention mechanism. GNNs Methods for Heterogeneous Graphs HGNNs face a major obstacle in dealing with the heterogeneity arising from complex structures. An instance is PME (Chen et al. 2018), which measures the distances between nodes to capture heterogeneity. (Ma et al. 2021) proposes GATNE, aiming to learn node embedding in multi-path graphs. Recently, based on the attention mechanism, scholars proposed a series of attention-based HGNNs (Fu et al. 2020; Hong et al. 2020; Hu et al. 2020b; Fu et al. 2019) to capture semantic importance of nodes.

In contrast, our dataset is large-scale and universal domain, which can support model training on a large-scale heterogeneous graph. And our method can adaptively mine multi-attribute association relationships with low complexity, and extend methods of large-scale homogeneous graphs to heterogeneous graphs.

Preliminary

Heterogeneous Graph

A Graph can be defined as $\mathcal{G}={(\mathcal{V},\mathcal{E})}$ , in which $\mathcal{V}$ and $\mathcal{E}$ represent the sets of nodes and edges, respectively. Each node $v_{i}\in\mathcal{V}$ and each edge $e_{i}\in\mathcal{E}$ are associated with a only one specific type. Formally, a heterogeneous graph (or heterogeneous information network) can be defined as follows:

\mathcal{G}=(\mathcal{V},\mathcal{E},\mathcal{A},\mathcal{R},\phi_{v},\phi_{e})

where:

•

$\mathcal{V}=\{v_{1},v_{2},\ldots,v_{N}\}$ denotes the set of nodes, and $N$ is the total number of nodes.
•

$\mathcal{E}=\{e_{ij}\}$ denotes the set of directed edges between nodes.
•

$\mathcal{A}=\{a_{1},a_{2},\ldots,a_{Q}\}$ is the set of node classes, and $Q$ is the total number of node types.
•

$\mathcal{R}=\{r_{1},r_{2},\ldots,r_{M}\}$ is the set of edge types, and $M$ is the total number of edge types.
•

$\phi_{v}:\mathcal{V}\rightarrow\{a_{1},a_{2},\ldots,a_{Q}\}$ represents the node labeling function that maps nodes to their corresponding classes.
•

$\phi_{e}:\mathcal{E}\rightarrow\{r_{1},r_{2},\ldots,r_{M}\}$ represents the edge type function that maps edges to their corresponding types.

Decoupled Methods on Large-Scale Homogeneous Graphs

Given a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , Each node $v_{i}\in\mathcal{V}$ is embedded as a $d_{v}$ -dimensional vector $\mathbf{x}_{i}$ , forming the node feature matrix $\mathbf{X}$ . Each directed edge $e_{i}\in\mathcal{E}$ is a [0,1] scalar value, forming adjacent matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ . The decoupled methods on large-scale homogeneous graphs are proven effective paradigm for large-scale graph representation learning. They iteratively aggregate propagation features to update embeddings, which can be formulated as:

\begin{cases}\mathbf{Z}=\delta([\mathbf{X}\mathbf{\Theta}_{0}|\mathbf{P}\mathbf{X}\mathbf{\Theta}_{1}|...|\mathbf{P}^{K}\mathbf{X}\mathbf{\Theta}_{K}])\\ \mathbf{Y}=\sigma(\mathbf{Z}\mathbf{W})\end{cases}

(1)

where $K$ refers to the maximum number of hops, $\mathbf{Z}$ denotes the hidden variable, $\mathbf{P}^{k}\mathbf{X}$ refers to $k$ -th hop propagation matrix and $\mathbf{P}=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$ is the symmetric regularization matrix, $\delta(\cdot)$ represents the aggregation function to fuse multi-hop features (e.g. sum $(\cdot)$ , mean $(\cdot)$ , or an attention mechanism), $\sigma(\cdot)$ is the non-linear activation function, $\mathbf{\Theta}$ and $\mathbf{W}$ are learnable parameter matrices. Based on the widely accepted consensus that messaging passing overheads much more, the decoupling-based GNNs precompute the approximate result of the message passing process in the CPU as the input of overall model train. Thus, this architecture is simple and efficient.

Dataset Construction

In this section, we introduce our proposed dataset UniKG, including construction details and data statistics. Figure 2 is a construction example. Table 1 and Table 2 show the statistics and three different scales of UniKG, respectively.

Entities Extraction

Most knowledge graphs, such as Wikidata, have an underlying database model that provides various functions for organizing data of a specific type. Directly deserializing the data of billions of entities to construct a large-scale graph is very difficult. Thus, we design an extraction strategy to carefully parse Wikidata and automatically filter out entities with complete attributes (e.g., contain ‘id’, ‘description’, ‘labels’, and ‘claims’ in English). For example, when parsing entity ‘Belgium’, we extract its id as ‘Q31’, description is ‘country in Western Europe’, label is ‘Belgium’, and dozens of ‘claims’ triplets. In addition, to ensure the semantic adequacy of the filtered entities, we semi-manually generated a screen set to filter the mass of unsemantic ‘claims’ attributes in Wikidata. For example, we found that Wikidata stores lots of ‘external-id’ relationships, which lack semantics and are thus ignored. The ¡subject, object¿ structure of filtered claims can be modeled as directed edges between entities. In this way, we extract over 77 million nodes and 564 million directed edges.

Semantic Alignment

To facilitate node aggregation in a large receptive field, we design a alignment strategy to contain significant semantic structure. Specifically, the first sentence of each entity is organized into the structure of $<$ label ‘be’ description. $>$ , which describes the attributes of the entity itself. Then, for each filtered ‘claim’, its label $c$ and the associated entity label or value $e$ are organized into the structure $<c$ $e$ . $>$ , which is sequentially linked to first sentence. In this way, each entity is automatically organized into a feature description that conforms to a significant structure. We randomly sampled a portion of entities to fine-tune the Pre-trained Language Model (PLM) Deberta. Then, the PLM projects feature description of all entities into a common representation space to generate node embedding.

Label Annotation

Entities in Wikidata are redundantly and trivially labeled, such as ‘fictional character’, ‘character that may or may not be fictional’, ‘character in a fictitious work’, ‘conceptual character’, ‘fictional character in a musical work’, and ‘imaginary character’, which results in similar entities may have multiple different annotation. This is not favorable for classification tasks and is a heavy burden for representation learning tasks. Therefore, we design an annotation strategy to semi-automatically annotate all filtered entities. i) For each entity $e\in\mathcal{E}$ , we extract all ‘instance of’ relationships. These relationships link to the parent entity set $\mathcal{F}(a)=\{b_{1},b_{2}...,b_{f}\}$ , where $b$ associate each ‘instance of’ and $f$ represents the number of ‘instance of’. The parent entity sets of all entities are collected into a union as the label set $\mathcal{F}_{all}=\{\bigcup\mathcal{F}(a)|a\in\mathcal{E}\}$ . ii) Then, we cluster (e.g. k-means) all of the feature description embeddings of $\mathcal{F}_{all}$ , resulting in a smaller high-level label set $\mathcal{F}_{m}$ . For instance, we annotate the labels ‘fictional character’, ‘character that may or may not be fictional’, ‘character in a fictitious work’, ‘conceptual character’, ‘fictional character in a musical work’, and ‘imaginary character’ as ‘Fictional characters from various mediums’. Finally, we annotate each entity with the help of TWO non-native fluent English speakers, who fine-checked the incorrect associations between specific labels and high-level labels. All associations marked as incorrect by both the annotators are manually relabeled. Additionally, each directed edge is annotated as associated relationship label. In this way, we cluster 74666 ‘instances of’ object entities into 2000 abstract manual annotations with physical meaning, and obtain 564 million edges with 2,082 relationship labels.

Through these strategies, we obtain 77,312,474 nodes labeling by 74,666 semi-manual annotations and 564,425,621 directed edges of 2,082 distinct types, and construct an incredibly intricate large-scale heterogeneous graph. In previous heterogeneous graphs, an edge usually has only one type, however, in our dataset an edge may have multiple types. For example, ‘member of’ and ‘part of’, ‘author of’ and ‘creator of ’ tend to co-exist in a node-pair. This results in metapaths may suffer explosive growth.

Table 2: Statistics of different scales of UniKG.

Datasets	#Nodes	#Edges	#Classes
UniKG-1M	1,002,988	24,475,405	2,000
UniKG-10M	10,044,777	216,295,022	2,000
UniKG-full	77,312,474	564,425,621	2,000

Table 3: Performance of the five baselines on three scale UniKGs. ‘R-’ means use our HGD framework.

Methods	UniKG-1M				UniKG-10M				UniKG-Full
Methods	Acc.	Prec.	Rec.	F1.	Acc.	Prec.	Rec.	F1.	Acc.	Prec.	Rec.	F1.
R-MLP	68.48	92.24	68.31	78.50	78.66	96.77	79.60	87.35	80.69	97.25	81.69	88.79
R-SGC	44.18	88.47	42.75	57.65	64.25	93.00	65.21	76.67	69.84	94.79	71.40	81.45
R-SIGN	66.69	94.35	65.27	77.16	69.52	96.84	70.18	81.38	75.45	97.58	76.42	85.71
R-SAGN	75.41	90.64	75.95	82.64	89.03	96.19	90.11	93.05	93.16	98.46	93.83	96.09
R-GAMLP	47.24	86.95	46.80	60.85	60.99	88.44	62.70	73.38	72.89	94.28	74.95	83.51

Anisotropic Graph Decoupling

In this section, we detailed describe the novel Heterogeneous Graph Decoupling (HGD) framework, and the anisotropic propagation process of the plug-and-play Anisotropic Propagation Module (APM) on large-scale heterogeneous graphs as shown in Figure 3.

Through introducing $d_{e}$ -dimensional edge embeddings, the adjacency matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ can be represented as a relation-aware adjacency matrix $\mathbf{A}_{r}\in\mathbb{R}^{n\times n\times d_{e}}$ . Therefore, the Heterogeneous Graph Decoupling (HGD) framework can be represented as:

\begin{cases}\mathbf{Z}=\delta([\mathbf{X}\mathbf{\Theta}_{0}|\mathbf{C}\mathbf{\Theta}_{1}|...|\mathbf{C}^{K}\mathbf{\Theta}_{K}])\\ \mathbf{Y}=\sigma(\mathbf{Z}\mathbf{W})\end{cases}

(2)

where, for each $k\in[1,K]$ , $\mathbf{C}^{k}\in\mathbb{R}^{n\times d}$ denotes the anisotropic propagation matrix of the $k$ -th hop. We formulate the process of anisotropic propagation as a function $\hat{f}(\cdot)$ , then $\mathbf{C}^{k+1}=\hat{f}(\mathbf{A}_{r},\mathbf{C}^{k})$ and $\mathbf{C}^{0}=\mathbf{X}$ is the original feature matrix.

To integrate the anisotropic edge features into the message matrix, we first upgrade dimension $\mathbf{C}^{k}$ from $\mathbb{R}^{n\times n}$ to $\mathbb{R}^{n\times n\times d}$ , and the matrix is denoted as:

\mathbf{H}^{k}=\mathbf{C}^{k}\otimes\mathbf{I}_{n}\vspace{-0.03cm}

(3)

where $\mathbf{I}_{n}\in\mathbb{R}^{n\times n}$ is the identity matrix. The $i$ -th row of $\mathbf{H}^{k}$ represents a message matrix $\mathbf{M}\in\mathbb{R}^{n\times d}$ from node $i$ to others. Based on the $\mathbf{H}^{k}$ , through the tensor product of the anisotropic message matrix and its coefficient matrix, we can obtain the anisotropic propagation matrix at the $k+1$ -th hop by the attention mechanism, which can be expressed as:

\mathbf{C}^{k+1}=\mathbf{B}^{k}\times_{2}\hat{\mathbf{Q}}

(4)

where $\times_{2}$ represents the tensor product on the second dimension, $\mathbf{C}^{k+1}\in\mathbb{R}^{n\times d}$ denotes the $(k+1)$ -th hop anisotropic propagation matrix, $\mathbf{B}^{k}$ is the message propagation matrix, and $\hat{\mathbf{Q}}$ represents the edge-aware coefficient matrix. Specifically, we perform element-wise Hadamard multiplication $\odot$ between $\mathbf{A}_{r}$ and $\mathbf{H}^{k}$ to derive the message propagation matrix $\mathbf{B}^{k}\in\mathbb{R}^{n\times n\times d}$ :

\mathbf{B}^{k}=\mathbf{A}_{r}\odot\mathbf{H}^{k}

(5)

where, $\mathbf{B}^{k}$ describes messages between each node-pairs. The $l_{2}$ -norm matrix of $\mathbf{B}^{k}$ on the third dimension can be represented as $\mathbf{Q}\in\mathbb{R}^{n\times n}$ :

\vspace{-0.03cm}\mathbf{Q}=||\mathbf{B}^{k}||_{2}\vspace{-0.03cm}

(6)

then, calculate the edge-aware coefficient matrix $\hat{\mathbf{Q}}\in\mathbb{R}^{n\times n}$ for the messages between each node-pair by row-wise softmax:

\vspace{-0.03cm}\hat{\mathbf{Q}}=\text{softmax}(\mathbf{Q})\vspace{-0.03cm}

(7)

Finally, perform tensor multiplication along the second dimension between the message propagation matrix $\mathbf{B}^{k}$ and the edge-aware coefficient matrix $\hat{\mathbf{Q}}$ , which means that messages of $k$ -th hop are anisotropically propagate to the corresponding nodes of $k+1$ -th hop. By iterating this procedure $K$ times, we obtain a sequence of anisotropic propagation matrices $[\mathbf{C}^{1},...,\mathbf{C}^{K}]$ , which serves as the input to the post-classifier of the decoupling-based graph neural network.

This method supports fast multi-attribute aggregation and can extends decoupled methods of the large-scale homogeneous graphs to the large-scale heterogeneous graphs in a edge-aware manner.

In real-world, nodes tend to be annotated by multi-label. We use multi-label loss functions, such as the Binary Cross-Entropy (BCE) loss:

\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\left[\mathbf{y}_{i}\cdot\log(\hat{\mathbf{y}}_{i})+(1-\mathbf{y}_{i})\cdot\log(1-\hat{\mathbf{y}}_{i})\right]

(8)

where $N$ is the number of nodes, $\mathbf{y}_{i}$ denotes the ground truth vector of node $i$ , and $\hat{\mathbf{y}}_{i}$ represents the predicted label vector.

Table 4: Results of ablation experiment on recommendation system. Where the ‘w/ u.’ means ‘with prior knowledge from UniKG-Full’.

Methods	Amazon-book			Yelp			Citeulike-a
Methods	Prec.	Rec.	Ndcg.	Prec.	Rec.	Ndcg.	Prec.	Rec.	Ndcg.
LightGCN	0.01716	0.06191	0.04106	0.00433	0.01123	0.00849	0.02329	0.07188	0.05064
LightGCN w/ u.	0.52%	0.11%	0.82%	6.47%	7.93%	4.36%	0.09%	0.26%	0.55 %

Table 5: Dataset statistics for recommendation systems. Where ‘Cul.’ means ‘Citeulike-a’, ‘Ama.’ means ‘Amazon-book’.

Datasets	#Users	#Items	#Edges	#Relations
Cul.	5,551	16,980	210,537	1
Ama.	52,643	65,865	2,090,149	39
Yelp	44,907	137,597	2,346,409	42

Experiments

In this section, we introduce two experiments to evaluate the proposed dataset and method, along with the ability to facilitate other downstream tasks. The detailed descriptions of recommendation task are reported in Appendices.

Experimental Setup

Datasets

In Table 1, we compared UniKG with common heterogeneous node classification datasets. It can be observed that UniKG significantly surpasses the scale of existing HG datasets in terms of the numbers of nodes, edges, edge types, etc. Due to the high complexity associated with the enormous size of UniKG-Full, we construct two sub-datasets of small and medium sizes to accommodate different scales of models, through Snowball Sampling (Goodman 1960) following (Ahrabian et al. 2023). Specifically, we extract high-degree nodes and random nodes as the start of random walk to traverse and record 1 million and 10 million nodes. Considering sample bias, we conduct full-graph anisotropic propagation before the split. Table 2 shows the statistics of three different scales. We set up and evaluate a semi-supervised multi-label node classification task on the UniKG-1M, UniKG-10M, and UniKG-Full datasets. Each dataset is randomly split into training, validation, and testing sets using an 8:1:1 ratio. Moreover, we conduct the recommendation task ablation experiment on the Amazon-Book, Yelp and Citeulike-a datasets. Their statistics are shown in Table 5.

Baselines

For the node classification task, incorporating MLP, SGC (Wu et al. 2019), SIGN (Frasca et al. 2020), SAGN (Sun, Gu, and Hu 2021), and GAMLP (Zhang et al. 2022b) into our HGD frame work, we propose 5 baselines: R-MLP, R-SGC, R-SIGN, R-SAGN, and R-GAMLP that are applicable to large-scale heterogeneous graphs. For another recommendation task, we adopt LightGCN (He et al. 2020) as our baseline model.

Ablations

For the recommendation task, we use similar embeddings retrieved in UniKG as more expressive initial representations rather than random initialization. Thus, the initial representations from UniKG-Full datasets can form 2 ablation environment: LightGCN, and LightGCN w/ UniKG-Full for ablation study.

Evaluation Metrics

For the multi-label node classification task, we choose Accuracy, Precise, Recall, F1 Score as evaluation indicators. Note that we choose ‘Subset Accuracy’ as Accuracy, which is the proportion of nodes whose predicted labels exactly match the ground truth labels. For the recommendation task, we choose Precise, Recall and Ndcg as evaluation indicators.

Hyperparameters and Environment

Certain reported performances on the OGB leaderboard (Hu et al. 2020a) have incorporated various tricks such as SLE, RLU, and the ‘labels as input’ (Wang et al. 2021). These tricks are recognized as a significant influence on performance (Sun, Gu, and Hu 2021). To ensure fairness, we avoid these tricks completely. Furthermore, we refer to the hyperparameter settings of the official OGB (Hu et al. 2020a) or the original authors. For the node classification task, we adopt the following hyperparameters for all models: embedding dimension = 128 for both entities and relations, hidden dimension = 256, epoch = 300, learning rate = 0.01, and dropout rate = 0. For the recommendation task, we adhere to the original configuration of the LightGCN. All experiments were conducted using single 24GB GeForce GTX 4090 GPU.

Main Results and Analysis

In this experiment, we evaluate all baselines on UniKG and compare ablations on recommendation task.

Node Classification Task

From Table 3, we observe a clear trend: as the ability to capture multi-hop features improves, performance gradually enhances. This trend aligns with conclusions drawn from existing large-scale graph benchmarks. Specifically, R-SAGN consistently outperforms other baselines, achieving 93.16% subset accuracy and 96.09% F1 score on the largest UniKG-Full dataset. We attribute advantage of SAGN to its adaptive attention mechanism, which effectively aggregates multi-attributes multi-hop propagation features. This may imply that it is crucial to design post classifier with heterogeneous graph structure-aware components. Additionally, we find that R-MLP, utilizing only the original features, achieves a subset accuracy of 80.69%. In contrast, R-SGC, utilizing only the last-hop propagation features, decreased by 10.85%, R-SIGN, using multi-hop propagation features, decreased by 5.24%, and GAMLP, using the layer-wise attention mechanism, decreased by 7.80%. Similar trends were observed on the other two smaller datasets. In summary, for the representation learning task on large-scale heterogeneous graphs, we conclude that: i) The Heterogeneous Graph Decoupling framework proved to be effective; ii) The original features are important, so it is necessary to increase their importance or incorporate residual structures; iii) In the feature mapping process, it is crucial to design effective aggregation strategies, such as the adaptive attention mechanism in R-SAGN, to capture heterogeneous multi-hop propagation features.

Recommendation Task

The similar embeddings retrieved in UniKG as the more expressive initial embedding in the recommender system and are used as the ablation condition. Table 4 presents the results of ablation experiments. It can be observed that the performance of LightGCN improves on all datasets after using the more expressive initial embedding. In particular, on the Yelp dataset, Precision improves by 6.47%, Recall by 7.93%, and Ndcg by 4.36%. This indicates that more expressive features can improve model performance and that our proposed UniKG can facilitate downstream tasks of recommender systems.

Conclusion

We construct a large-scale heterogeneous graph benchmark dataset UniKG, along with associated representation learning methods: the Heterogeneous Graph Decoupling framework and a plug-and-play Anisotropy Propagation Module. Our proposed construction methods model knowledge graphs as large-scale heterogeneous graphs, and the scale of the constructed UniKG significantly surpasses the scale of existing heterogeneous graph datasets. The integration of the decoupling-based methods of homogeneous graphs with Heterogeneous Graph Decoupling framework is easy to implement and applicable to large-scale heterogeneous graphs with competitive performance. The real-world knowledge of UniKG can facilitate diverse downstream tasks.

References

Ahrabian et al. (2023) Ahrabian, K.; Du, X.; Myloth, R. D.; Ananthan, A. B. S.; and Pujara, J. 2023. PubGraph: A Large Scale Scientific Temporal Knowledge Graph. arXiv preprint arXiv:2302.02231.
Banerjee et al. (2023) Banerjee, D.; Nair, P. A.; Usbeck, R.; and Biemann, C. 2023. GETT-QA: Graph Embedding Based T2T Transformer for Knowledge Graph Question Answering. In European Semantic Web Conference, 279–297. Springer.
Cao et al. (2022) Cao, J.; Zhang, X.; Shahinian, V.; Yin, H.; Steffick, D.; Saran, R.; Crowley, S.; Mathis, M.; Nadkarni, G. N.; Heung, M.; et al. 2022. Generalizability of an acute kidney injury prediction model across health systems. Nature Machine Intelligence, 4(12): 1121–1129.
Chang et al. (2022) Chang, Y.; Chen, C.; Hu, W.; Zheng, Z.; Zhou, X.; and Chen, S. 2022. Megnn: Meta-path extracted graph neural network for heterogeneous graph representation learning. Knowledge-Based Systems, 107611.
Chen et al. (2018) Chen, H.; Yin, H.; Wang, W.; Wang, H.; Nguyen, Q. V. H.; and Li, X. 2018. PME: projected metric embedding on heterogeneous networks for link prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 1177–1186.
Chen, Ma, and Xiao (2018) Chen, J.; Ma, T.; and Xiao, C. 2018. Fastgcn: fast learning with graph convolutional networks via importance sampling. arXiv preprint arXiv:1801.10247.
Chen, Razniewski, and Weikum (2023) Chen, L.; Razniewski, S.; and Weikum, G. 2023. Knowledge base completion for long-tail entities. arXiv preprint arXiv:2306.17472.
Chiang et al. (2019) Chiang, W.-L.; Liu, X.; Si, S.; Li, Y.; Bengio, S.; and Hsieh, C.-J. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 257–266.
Frasca et al. (2020) Frasca, F.; Rossi, E.; Eynard, D.; Chamberlain, B.; Bronstein, M.; and Monti, F. 2020. Sign: Scalable inception graph neural networks. arXiv preprint arXiv:2004.11198.
Fu et al. (2020) Fu, X.; Zhang, J.; Meng, Z.; and King, I. 2020. Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding. In Proceedings of The Web Conference 2020, 2331–2341.
Fu et al. (2019) Fu, Y.; Xiong, Y.; Philip, S. Y.; Tao, T.; and Zhu, Y. 2019. Metapath enhanced graph attention encoder for hins representation learning. In 2019 IEEE International Conference on Big Data (Big Data), 1103–1110. IEEE.
Gao et al. (2022) Gao, C.; Wang, X.; He, X.; and Li, Y. 2022. Graph neural networks for recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, 1623–1625.
Goodman (1960) Goodman, L. A. 1960. Snowball Sampling: The Annals of Mathematical Statistics.
Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive representation learning on large graphs. Advances in neural information processing systems, 30.
He et al. (2020) He, X.; Deng, K.; Wang, X.; Li, Y.; Zhang, Y.; and Wang, M. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 639–648.
Hogan et al. (2021) Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; Melo, G. D.; Gutierrez, C.; Kirrane, S.; Gayo, J. E. L.; Navigli, R.; Neumaier, S.; et al. 2021. Knowledge graphs. ACM Computing Surveys (Csur), 54(4): 1–37.
Hong et al. (2020) Hong, H.; Guo, H.; Lin, Y.; Yang, X.; Li, Z.; and Ye, J. 2020. An attention-based graph neural network for heterogeneous structural learning. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 4132–4139.
Hu et al. (2020a) Hu, W.; Fey, M.; Zitnik, M.; Dong, Y.; Ren, H.; Liu, B.; Catasta, M.; and Leskovec, J. 2020a. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33: 22118–22133.
Hu et al. (2020b) Hu, Z.; Dong, Y.; Wang, K.; and Sun, Y. 2020b. Heterogeneous graph transformer. In Proceedings of the web conference 2020, 2704–2710.
Huang et al. (2016) Huang, Z.; Zheng, Y.; Cheng, R.; Sun, Y.; Mamoulis, N.; and Li, X. 2016. Meta Structure: Computing Relevance in Large Heterogeneous Information Networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Jang et al. (2022) Jang, J.; Ye, S.; Lee, C.; Yang, S.; Shin, J.; Han, J.; Kim, G.; and Seo, M. 2022. TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. arXiv preprint arXiv:2204.14211.
Kubeša and Straka (2023) Kubeša, D.; and Straka, M. 2023. DaMuEL: A Large Multilingual Dataset for Entity Linking. arXiv preprint arXiv:2306.09288.
Luo et al. (2023) Luo, H.; Haihong, E.; Tan, L.; Zhou, G.; Yao, T.; and Wan, K. 2023. DHGE: Dual-View Hyper-Relational Knowledge Graph Embedding for Link Prediction and Entity Typing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 6467–6474.
Ma et al. (2021) Ma, S.; Liu, J.-W.; Zuo, X.; and Li, W.-M. 2021. Heterogeneous graph gated attention network. In 2021 International Joint Conference on Neural Networks (IJCNN), 1–6. IEEE.
Pramanik et al. (2021) Pramanik, S.; Alabi, J.; Roy, R. S.; and Weikum, G. 2021. UNIQORN: unified question answering over RDF knowledge graphs and natural language text. arXiv preprint arXiv:2108.08614.
Schlichtkrull et al. (2018) Schlichtkrull, M.; Kipf, T. N.; Bloem, P.; Van Den Berg, R.; Titov, I.; and Welling, M. 2018. Modeling relational data with graph convolutional networks. In The Semantic Web: 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings 15, 593–607. Springer.
Sharma et al. (2022) Sharma, S.; Nayak, T.; Bose, A.; Meena, A. K.; Dasgupta, K.; Ganguly, N.; and Goyal, P. 2022. FinRED: A dataset for relation extraction in financial domain. In Companion Proceedings of the Web Conference 2022, 595–597.
Stranisci et al. (2022) Stranisci, M. A.; Spillo, G.; Musto, C.; Patti, V.; and Damiano, R. 2022. The URW-KG: a Resource for Tackling the Underrepresentation of non-Western Writers. arXiv preprint arXiv:2212.13104.
Sun, Gu, and Hu (2021) Sun, C.; Gu, H.; and Hu, J. 2021. Scalable and adaptive graph neural networks with self-label-enhanced training. arXiv preprint arXiv:2104.09376.
Sun et al. (2011) Sun, Y.; Han, J.; Yan, X.; Yu, P. S.; and Wu, T. 2011. PathSim: meta path-based top-K similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 992–1003.
Wang et al. (2020) Wang, X.; Lu, Y.; Shi, C.; Wang, R.; Cui, P.; and Mou, S. 2020. Dynamic heterogeneous information network embedding with meta-path based proximity. IEEE Transactions on Knowledge and Data Engineering, 34(3): 1117–1132.
Wang et al. (2021) Wang, Y.; Jin, J.; Zhang, W.; Yu, Y.; Zhang, Z.; and Wipf, D. 2021. Bag of tricks for node classification with graph neural networks. arXiv preprint arXiv:2103.13355.
Whitehouse et al. (2023) Whitehouse, C.; Vania, C.; Aji, A. F.; Christodoulopoulos, C.; and Pierleoni, A. 2023. WebIE: Faithful and robust information extraction on the web. arXiv preprint arXiv:2305.14293.
Wu et al. (2019) Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; and Weinberger, K. 2019. Simplifying graph convolutional networks. In International conference on machine learning, 6861–6871. PMLR.
Wu et al. (2022) Wu, S.; Sun, F.; Zhang, W.; Xie, X.; and Cui, B. 2022. Graph neural networks in recommender systems: a survey. ACM Computing Surveys, 55(5): 1–37.
Xu et al. (2019) Xu, F.; Lian, J.; Han, Z.; Li, Y.; Xu, Y.; and Xie, X. 2019. Relation-aware graph convolutional networks for agent-initiated social e-commerce recommendation. In Proceedings of the 28th ACM international conference on information and knowledge management, 529–538.
Zeng et al. (2019) Zeng, H.; Zhou, H.; Srivastava, A.; Kannan, R.; and Prasanna, V. 2019. Graphsaint: Graph sampling based inductive learning method. arXiv preprint arXiv:1907.04931.
Zhang et al. (2022a) Zhang, B.; Li, J.; Chen, C.; Lee, K.; and Lee, I. 2022a. A practical botnet traffic detection system using gnn. In Cyberspace Safety and Security: 13th International Symposium, CSS 2021, Virtual Event, November 9–11, 2021, Proceedings 13, 66–78. Springer.
Zhang et al. (2019) Zhang, C.; Song, D.; Huang, C.; Swami, A.; and Chawla, N. V. 2019. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 793–803.
Zhang et al. (2022b) Zhang, W.; Yin, Z.; Sheng, Z.; Li, Y.; Ouyang, W.; Li, X.; Tao, Y.; Yang, Z.; and Cui, B. 2022b. Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 4560–4570.