This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SE-GSL: A General and Effective Graph Structure Learning Framework through Structural Entropy Optimization

Dongcheng Zou Beihang UniversityBeijingChina [email protected] Hao Peng Beihang UniversityBeijingChina [email protected] Xiang Huang Beihang UniversityBeijingChina [email protected] Renyu Yang Beihang UniversityBeijingChina [email protected] Jianxin Li Beihang UniversityBeijingChina [email protected] Jia Wu Macquarie UniversitySydneyAustralia [email protected] Chunyang Liu Didi ChuxingBeijingChina [email protected]  and  Philip S. Yu University of Illinois ChicagoChicagoUSA [email protected]
(2023)
Abstract.

Graph Neural Networks (GNNs) are de facto solutions to structural data learning. However, it is susceptible to low-quality and unreliable structure, which has been a norm rather than an exception in real-world graphs. Existing graph structure learning (GSL) frameworks still lack robustness and interpretability. This paper proposes a general GSL framework, SE-GSL, through structural entropy and the graph hierarchy abstracted in the encoding tree. Particularly, we exploit the one-dimensional structural entropy to maximize embedded information content when auxiliary neighbourhood attributes is fused to enhance the original graph. A new scheme of constructing optimal encoding trees is proposed to minimize the uncertainty and noises in the graph whilst assuring proper community partition in hierarchical abstraction. We present a novel sample-based mechanism for restoring the graph structure via node structural entropy distribution. It increases the connectivity among nodes with larger uncertainty in lower-level communities.  SE-GSL is compatible with various GNN models and enhances the robustness towards noisy and heterophily structures. Extensive experiments show significant improvements in the effectiveness and robustness of structure learning and node representation learning.

Graph structure learning, structural entropy, graph neural network
journalyear: 2023copyright: acmlicensedconference: Proceedings of the ACM Web Conference 2023; April 30-May 4, 2023; Austin, TX, USAbooktitle: Proceedings of the ACM Web Conference 2023 (WWW ’23), April 30-May 4, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543507.3583453isbn: 978-1-4503-9416-1/23/04ccs: Computing methodologies Artificial intelligenceccs: Mathematics of computing Graph algorithmsccs: Information systems Data mining

1. Introduction

Graph Neural Networks (GNNs) (Wu et al., 2020a; Zhou et al., 2020) have become the cornerstone and de facto solution of structural representation learning. Most of the state-of-the-art GNN models employ message passing (Gilmer et al., 2017) and recursive information aggregation from local neighborhoods  (Veličković et al., 2017; Keyulu et al., 2019; Peng et al., 2021; Yang et al., 2023) to learn node representation. These models have been advancing a variety of tasks, including node classification (Welling and Kipf, 2016; Xu et al., 2018; Peng et al., 2019), node clustering (Bianchi et al., 2020; Peng et al., 2023), graph classification (Ying et al., 2018; Peng et al., 2020), and graph generation (You et al., 2018), etc.

GNNs are extremely sensitive to the quality of given graphs and thus require resilient and high-quality graph structures. However, it is increasingly difficult to meet such a requirement in real-world graphs. Their structures tend to be noisy, incomplete, adversarial, and heterophily (i.e., the edges with a higher tendency to connect nodes of different types), which can drastically weaken the representation capability of GNNs (Pei et al., 2019; Luo et al., 2021a; Dai et al., 2021). Recent studies also reveal that even a minor perturbation in the graph structure can lead to inferior prediction quality  (Bojchevski and Günnemann, 2019; Zhang and Zitnik, 2020; Sun et al., 2022a). Additionally, GNNs are vulnerable to attacks since the raw graph topology is decoupled from node features, and attackers can easily fabricate links between entirely different nodes (Zhang and Zitnik, 2020; Sun et al., 2022a).

To this end, Graph Structure Learning (GSL) (Chen et al., 2020b; Jin et al., 2020; Wang et al., 2020; Zhu et al., 2021b, a; Qingyun et al., 2022; Sun et al., 2022b) becomes the recent driving force for learning superior task-relevant graph topology and for enhancing the resilience and robustness of node representation. The existing works focus on jointly optimizing GNN whilst imposing regularization on refined graph structures. Typical methods include metric-based (Chen et al., 2020b; Wang et al., 2020; Li et al., 2018a), probabilistic sampling  (Franceschi et al., 2019; Rong et al., 2019; Zheng et al., 2020), and learnable structure approach (Jin et al., 2020), etc. While promising, GNNs and GSL still have the following issues. i) robustness to system noises and heterophily graphs. While many GSL models strive to fuse node features and topological features through edge reconstruction (e.g., add, prune, or reweight)  (Wang et al., 2020; Zhang and Zitnik, 2020; Zhu et al., 2021a), additional noises and disassortative connections will be inevitably involved in the fused structure due to the unreliable priori topology and node embeddings, which would further degrade the GNNs representation capability (Li et al., 2022). ii) model interpretability. Fully parameterizing the adjacency matrix will incur a non-negligible cost of parameter storage and updating and is liable to low model interpretability (Gilpin et al., 2018). Although some studies on the improved GNN interpretability  (Ying et al., 2019; Huang et al., 2022), few works can effectively explain the topology evolution during graph structure learning. Therefore, fusing the node and topological features in a noisy system environment to obtain GNN-friendly graphs by exploiting inherent graph structures is still an underexplored problem (Wu et al., 2020b).

Refer to caption
Figure 1. An illustrative example of the hierarchical community (semantics) in a simple social network. (1) Vertices and edges represent the people and their interconnectivity (e.g., common locations, interests, occupations). There are different abstraction levels, and each community can be divided into sub-communities in a finer-grained manner (e.g., students are placed in different classrooms while teachers are allocated different offices). The lowest abstraction will come down to the individuals with own attributes, and the highest abstraction is the social network system. (b) An encoding tree is a natural form to represent and interpret such a multi-level hierarchy.
\Description

An example of hierarchical communities in a simple social network.

In this paper, we present SE-GSL, a general and effective graph structure learning framework that can adaptively optimize the topological graph structure in a learning-free manner and can achieve superior node representations, widely applicable to the mainstream GNN models. This study is among the first attempts to marry the structural entropy and encoding tree theory (Li and Pan, 2016) with GSL, which offers an effective measure of the information embedded in an arbitrary graph and structural diversity. The multi-level semantics of a graph can be abstracted and characterized through an encoding tree. Encoding tree (Li and Pan, 2016; Li et al., 2018b; Zeng et al., 2023) represents a multi-grained division of graphs into hierarchical communities and sub-communities, thus providing a pathway to better interpretability. Fig. 1 showcases how such graph semantics are hierarchically abstracted. Specifically, we first enhance the original graph topology by incorporating the vertex similarities and auxiliary neighborhood information via the kk-Nearest Neighbors (kk-NN) approach, so that noise can be better identified and diminished. This procedure is guided by the kk-selector that maximizes the amount of embedded information in the graph structure. We then propose a scheme to establish an optimal encoding tree to minimize the graph uncertainty and edge noises whilst maximizing the knowledge in the encoding tree. To restore the entire graph structure that can be further fed into GNN encoders, we recover edge connectivity between related vertices from the encoding tree taking into account the structural entropy distribution among vertices. The core idea is to weaken the association between vertices in high-level communities whilst establishing dense and extensive connections between vertices in low-level communities. The steps above will be iteratively conducted to co-optimize both graph structure and node embedding learning. SE-GSL111code is available at: https://github.com/RingBDStack/SE-GSL is an interpretable GSL framework that effectively exploits the substantive structure of the graph. We conduct extensive experiments and demonstrate significant and consistent improvements in the effectiveness of node representation learning and the robustness of edge perturbations.

Contribution highlights: i) SE-GSL provides a generic GSL solution to improve both the effectiveness and robustness of the mainstream GNN approaches. ii) SE-GSL offers a new perspective of navigating the complexity of attribute noise and edge noise by leveraging structural entropy as an effective measure and encoding tree as the graph hierarchical abstraction. iii) SE-GSL presents a series of optimizations on the encoding tree and graph reconstruction that can not only explicitly interpret the graph hierarchical meanings but also reduce the negative impact of unreliable fusion of node features and structure topology on the performance of GNNs. iv) We present a visualization study to reveal improved interpretability when the graph structure is evolutionary.

2. Preliminaries

This section formally reviews the basic concepts of Graph, Graph Neural Networks (GNNs), Graph Structure Learning (GSL), and Structural Entropy. Important notations are given in Appendix  A.1.

2.1. Graph and Graph Structure Learning

Graph and Community. Let G={V,E,X}G=\{V,E,X\} denote a graph, where VV is the set of nn vertices222A vertex is defined in the graph and a node in the tree., EV×VE\subseteq V\times V is the edge set, and Xn×dX\in\mathbb{R}^{n\times d} refers to the vertex attribute set. An×n\mathrm{A}\in\mathbb{R}^{n\times n} denotes the adjacency matrix of GG, where Aij\mathrm{A}_{ij} is referred to as the weight of the edge between vertex ii and vertex jj in GG. Particularly, if GG is unweighted, A{0,1}n×n\mathrm{A}\in\{0,1\}^{n\times n} and Aij\mathrm{A}_{ij} only indicate the existence of the edges. In our work, we only consider the undirected graph, where Aij=Aji\mathrm{A}_{ij}=\mathrm{A}_{ji}. For any vertex viv_{i}, the degree of viv_{i} is defined as d(vi)=jAijd(v_{i})=\sum_{j}\mathrm{A}_{ij}, and D=diag(d(v1),d(v2),,d(vn))D=\mathrm{diag}(d(v_{1}),d(v_{2}),\dots,d(v_{n})) refers to the degree matrix.

Suppose that 𝒫={P1,P2,,PL}\mathcal{P}=\{P_{1},P_{2},\dots,P_{L}\} is a partition of VV. Each PiP_{i} is called a community (aka. module or cluster), representing a group of vertices with commonality. Due to the grouping nature of a real-world network, each community of the graph can be hierarchically split into multi-level sub-communities. Such hierarchical community partition (i.e., hierarchical semantic) of a graph can be intrinsically abstracted as the encoding tree (Li and Pan, 2016; Li et al., 2018b), and each tree node represents a specific community. Take Fig. 1 as an example: at a high abstraction (semantic) level, the entire graph can be categorized as two coarse-grained communities, i.e., teachers (T) and students (S). Students can be identified as sub-communities like S.1 and S.2, as per the class placement scheme.

Graph Structure Learning (GSL). For any given graph GG, the goal of GSL (Zhu et al., 2021b) is to simultaneously learn an optimal graph structure GG^{*} optimized for a specific downstream task and the corresponding graph representation ZZ. In general, the objective of GSL can be summarized as gsl=task(Z,Y)+αreg(Z,G,G)\mathcal{L}_{gsl}=\mathcal{L}_{task}(Z,Y)+\alpha\mathcal{L}_{reg}(Z,G^{*},G), where task\mathcal{L}_{task} refers to a task-specific objective with respect to the learned representation ZZ and the ground truth YY. reg\mathcal{L}_{reg} imposes constraints on the learned graph structure and representations, and α\alpha is a hyper-parameter.

2.2. Structural Entropy

Different from information entropy (aka. Shannon entropy) that measures the uncertainty of probability distribution in information theory  (Shannon, 1948), structural entropy (Li and Pan, 2016) measures the structural system diversity, e.g., the uncertainty embedded in a graph.

Encoding Tree. Formally, the encoding tree 𝒯\mathcal{T} of graph G=(V,E)G=(V,E) holds the following properties: (1) The root node λ\lambda in 𝒯\mathcal{T} has a label Tλ=VT_{\lambda}=V, VV represents the set of all vertices in GG. (2) Each non-root node α\alpha has a label TαVT_{\alpha}\subset V. Furthermore, if α\alpha is a leaf node, TαT_{\alpha} is a singleton with one vertex in VV. (3) For each non-root node α\alpha, its parent node in TT is denoted as α\alpha^{-}. (4) For each non-leaf node α\alpha, its ii-th children node is denoted as αi\alpha^{\left\langle i\right\rangle} ordered from left to right as ii increases. (5) For each non-leaf node α\alpha, assuming the number of children α\alpha is NN, all vertex subset TαiT_{\alpha^{\left\langle i\right\rangle}} form a partition of TαT_{\alpha}, written as Tα=i=1NTαiT_{\alpha}={\textstyle\bigcup_{i=1}^{N}}T_{\alpha^{\left\langle i\right\rangle}} and i=1NTαi={\textstyle\bigcap_{i=1}^{N}}T_{\alpha^{\left\langle i\right\rangle}}=\varnothing. If the encoding tree’s height is restricted to KK, we call it KK-level encoding tree. Entropy measures can be conducted on different encoding trees.

One-dimensional Structural Entropy. In a single-level encoding tree 𝒯\mathcal{T}, its structural entropy degenerates to the unstructured Shannon entropy, which is formulated as:

(1) H1(G)=vVdvvol(G)log2dvvol(G),H^{1}(G)=-\sum_{v\in V}{\frac{d_{v}}{vol(G)}\log_{2}{\frac{d_{v}}{vol(G)}}},

where dvd_{v} is the degree of vertex vv, and vol(G)vol(G) is the sum of the degrees of all vertices in GG. According to the fundamental research (Li and Pan, 2016), one-dimensional structural entropy H1(G)H^{1}(G) measures the uncertainty of vertex set VV in GG, which is also the upper bound on the amount of information embedded in GG.

High-dimensional Structural Entropy. For the encoding tree 𝒯\mathcal{T}, we define high-dimensional structural entropy of GG as:

(2) HK(G)=min𝒯:height(𝒯)K{H𝒯(G)},H^{K}(G)=\min_{\forall\mathcal{T}:height(\mathcal{T})\leq K}\{H^{\mathcal{T}}(G)\},
(3) H𝒯(G)=α𝒯,αλH𝒯(G;α)=α𝒯,αλgαvol(G)log2𝒱α𝒱α,H^{\mathcal{T}}(G)=\sum_{\alpha\in\mathcal{T},\alpha\neq\lambda}{H^{\mathcal{T}}(G;\alpha)}=-\sum_{\alpha\in\mathcal{T},\alpha\neq\lambda}{\frac{g_{\alpha}}{vol(G)}\log_{2}{\frac{\mathcal{V}_{\alpha}}{\mathcal{V}_{\alpha^{-}}}}},

where gαg_{\alpha} is the sum weights of the cut edge set [Tα,Tα/Tλ][T_{\alpha},T_{\alpha}/T_{\lambda}], i.e., all edges connecting vertices inside TαT_{\alpha} with vertices outside TαT_{\alpha}. 𝒱α\mathcal{V}_{\alpha} is the sum of degrees of all vertices in TαT_{\alpha}. H𝒯(G;α)H^{\mathcal{T}}(G;\alpha) is the structural entropy of node α\alpha and H𝒯(G)H^{\mathcal{T}}(G) is the structural entropy of 𝒯\mathcal{T}. HK(G)H^{K}(G) is the KK-dimensional structural entropy, with the optimal encoding tree of KK-level .

3. Our Approach

Refer to caption
Figure 2. The overall architecture of SE-GSL.
\Description

The overall architecture of SE-GSL.

This section presents the architecture of SE-GSL, then elaborate on how we enhance the graph structure learning by structural entropy-based optimization of the hierarchical encoding tree.

3.1. Overview of  SE-GSL

Fig. 2 depicts the overall pipeline. At the core of  SE-GSL is the structure optimization procedure that transforms and enhances the graph structure. More specifically, it encompasses multi-stages: graph structure enhancement, hierarchical encoding tree generation, and sampling-based structure reconstruction before an iterative representation optimization.

First, the original topological information is integrated with vertex attributes and the neighborhood in close proximity. Specifically, we devise a similarity-based edge reweighting mechanism and incorporate kk-NN graph structuralization to provide auxiliary edge information. The most suitable kk is selected under the guidance of the one-dimensional structural entropy maximization strategy (§ 3.2). Upon the enhanced graph, we present a hierarchical abstraction mechanism to further suppress the edge noise and reveal the high-level hierarchical community structure (encoding tree) (§ 3.3). A novel sampling-based approach is designed to build new graph topology from the encoding tree, particularly by restoring the edge connectivity from the tree hierarchy (§ 3.4). The core idea is to weaken the association between high-level communities whilst establishing dense and extensive connections within low-level communities. To this end, we transform the node structural entropy into probability, rejecting the deterministic threshold. Through multi-iterative stochastic sampling, it is more likely to find favorable graph structures for GNNs. Afterward, the rebuilt graph will be fed into the downstream generic GNN encoders. To constantly improve both the node representation and the graph structure, the optimization pipeline is iterated for multiple epochs. The training procedure of SE-GSL is summarized in Appendix A.4.

3.2. Graph Structure Enhancement

To fully incorporate vertex attributes and neighborhood information in the graph structure, we perform feature fusion and edge reweighting so that the topological structure, together with the informative vertex adjacent similarity, can be passed on to the encoding tree generator. To begin with, we calculate the pair-wise similarity matrix S|V|×|V|S\in\mathbb{R}^{|V|\times|V|} among vertices in graph GG. To better depict the linear correlation between two vertex attributes, we take the Pearson correlation coefficient (PCC) as the similarity measure in the experiments, i.e.,

(4) Sij=PCC(xi,xj)=E((xiui)(xjuj))σiσj,S_{ij}=\mathrm{PCC}(x_{i},x_{j})=\frac{E((x_{i}-u_{i})(x_{j}-u_{j}))}{\sigma_{i}\sigma_{j}},

where xix_{i} and xj1×dx_{j}\in\mathbb{R}^{1\times d} are the attribute vectors of vertices ii and jj, respectively. uiu_{i} and σi\sigma_{i} denote the mean value and variance of xix_{i}, and E()E(\cdot) is the dot product function. Based on SS, we can intrinsically construct the kk-NN graph Gknn={V,Eknn}G_{knn}=\{V,E_{knn}\} where each edge in EknnE_{knn} represents a vertex and its kk nearest neighbors (e.g., the edges in red in Fig  2). We fuse GknnG_{knn} and the original GG to Gf={V,Ef=EEknn}G_{f}=\{V,E_{f}=E\cup E_{knn}\}.

A key follow-up step is pinpointing the most suitable number kk of nearest neighbors. An excessive value of kk would make GfG_{f} over-noisy and computationally inefficient, while a small kk would result in insufficient information and difficulties in hierarchy extraction. As outlined in § 2.2, a larger one-dimensional structural entropy indicates more information that GfG_{f} can potentially hold. Hence, we aim to maximize the one-dimensional structural entropy H1(Gf)H^{1}(G_{f}) to guide the selection of kk for larger encoding information. In practice, we gradually increase the integer parameter kk, generate the corresponding Gf(k)G^{(k)}_{f} and compute H1(Gf(k))H^{1}(G^{(k)}_{f}). Observably, when kk reaches a threshold kmk_{m}, H1(Gf(k))H^{1}(G^{(k)}_{f}) comes into a plateau without noticeable increase. This motivates us to regard this critical point kmk_{m} as the target parameter. The kk-selector algorithm is depicted in Appendix  A.5. In addition, the edge eije_{ij} between viv_{i} and vjv_{j} is reweighted as:

(5) ωij=Sij+M,M=12|V|1|E|1<i,j<nSij,\omega_{ij}=S_{ij}+M,\quad M=\frac{1}{2|V|}\cdot\frac{1}{|E|}\sum_{1<i,j<n}{S_{ij}},

where MM is a modification factor that amplifies the trivial edge weights and thus makes the kk-selector more sensitive to noises.

3.3. Hierarchical Encoding Tree Generation

Our methodology of abstracting the fused graph into a hierarchy is inspired by the structural entropy theory (Li and Pan, 2016; Li et al., 2018b). We intend to minimize the graph uncertainty (i.e., edge noises) and maximize the knowledge embedded (e.g., optimal partition) in the high-dimensional hierarchy. Correspondingly, the objective of structural entropy minimization is to find out an encoding tree 𝒯\mathcal{T} that minimizes H𝒯(Gf)H^{\mathcal{T}}(G_{f}) defined in Eq. 3. Due to the difficulty in graph semantic complexity quantification, we restrict the optimization objective to the KK-level tree with a hyperparameter KK. The optimal KK-dimensional encoding tree is represented as:

(6) 𝒯=argmin𝒯:height(𝒯)K(H𝒯(G)).\mathcal{T^{*}}=\mathop{\arg\min}\limits_{\forall\mathcal{T}:height(\mathcal{T})\leq K}(H^{\mathcal{T}}(G)).

To address this optimization problem, we design a greedy-based heuristic algorithm to approximate HK(G)H^{K}(G). To assist the greedy heuristic, we define two basic operators:

Definition 1.

Combining operator:  Given an encoding tree 𝒯\mathcal{T} for G=(V,E)G=(V,E), let α\alpha and β\beta be two nodes in 𝒯\mathcal{T} sharing the same parent γ\gamma. The combining operator CB𝒯(α,β)\mathrm{CB}_{\mathcal{T}}(\alpha,\beta) updates the encoding tree as: γδ;δα;δβ.\gamma\leftarrow\delta^{-};\delta\leftarrow\alpha^{-};\delta\leftarrow\beta^{-}. A new node δ\delta is inserted between γ\gamma and its children α,β\alpha,\beta.

Definition 2.

Lifting operator:  Given an encoding tree 𝒯\mathcal{T} for G=(V,E)G=(V,E), let α\alpha, β\beta and γ\gamma be the nodes in 𝒯\mathcal{T}, satisfying β=γ\beta^{-}=\gamma and α=β\alpha^{-}=\beta. The lifting operator LF𝒯(α,β)\mathrm{LF}_{\mathcal{T}}(\alpha,\beta) updates the encoding tree as: γα;IF:Tβ=,THEN:drop(β).\gamma\leftarrow\alpha^{-};\mathrm{IF:}T_{\beta}=\varnothing,\mathrm{THEN:}\mathrm{drop}(\beta). The subtree rooted at α\alpha is lifted by placing itself as γ\gamma’s child. If no more children exist after lifting, β\beta will be deleted from 𝒯\mathcal{T}.

In light of the high-dimensional structural entropy minimization principle (Li et al., 2018b), we first build a full-height binary encoding tree by greedily performing the combining operations. Two children of the root are combined to form a new partition iteratively until the structural entropy is no longer reduced. To satisfy the height restriction, we further squeeze the encoding tree by lifting subtrees to higher levels. To do so, we select and conduct lifting operations between a non-root node and its parent node that can reduce the structural entropy to the maximum. This will be repeated until the encoding tree height is less than KK and the structural entropy can no longer be decreased. Eventually, we obtain an encoding tree with a specific height KK with minimal structural entropy. The pseudo-code is detailed in Appendix A.6.

3.4. Sample-based Graph Reconstruction

The subsequent step is to restore the topological structure from the hierarchy whilst guaranteeing the established hierarchical semantics in optimal encoding tree 𝒯\mathcal{T}^{*}. The key to graph reconstruction is determining which edges to augment or weaken. Intuitively, the nodes in real-life graphs in different communities tend to have different labels. The work (Zhu et al., 2020a) has demonstrated the effectiveness of strengthening intra-cluster edges and reducing inter-cluster edges in a cluster-awareness approach to refine the graph topology. However, for hierarchical communities, simply removing cross-community edges will undermine the integrity of the higher-level community. Adding edges within communities could also incur additional edge noises to lower-level partitioning.

We optimize the graph structure with community preservation by investigating the structural entropy of deduction between two interrelated nodes as the criterion of edge reconstruction:

Definition 3.

Structural entropy of deduction:  Let 𝒯\mathcal{T} be an encoding tree of GG. We define the structural entropy of the deduction from non-leaf node λ\lambda to its descendant α\alpha as:

(7) H𝒯(G;(λ,α])=β,TαTβTλH𝒯(G;β).H^{\mathcal{T}}(G;(\lambda,\alpha])=\sum_{\beta,T_{\alpha}\subseteq T_{\beta}\subset T_{\lambda}}{H^{\mathcal{T}}(G;\beta)}.

This node structure entropy definition exploits the hierarchical structure of the encoding tree and offers a generic measure of top-down deduction to determine a community or vertex in the graph.

From the viewpoint of message passing, vertices with higher structural entropy of deduction produce more diversity and uncertainty and thus require more supervisory information. Therefore, such vertices need expanded connection fields during the graph reconstruction to aggregate more information via extensive edges. To achieve this, we propose an analog sampling-based graph reconstruction method. The idea is to explore the node pairs at the leaf node level (the lowest semantic level) and stochastically generate an edge for a given pair of nodes with a certain probability associated with the deduction structural entropy.

Specifically, for a given 𝒯\mathcal{T}, assume the node δ\delta has a set of child nodes {δ1,δ2,,δn}\{\delta^{\left\langle 1\right\rangle},\delta^{\left\langle 2\right\rangle},\dots,\delta^{\left\langle n\right\rangle}\}. The probability of the child δi\delta^{\left\langle i\right\rangle} is defined as: P(δi)=σδ(H𝒯(Gf;(λ,δi]))P(\delta^{\left\langle i\right\rangle})=\sigma_{\delta}(H^{\mathcal{T}}(G_{f};(\lambda,\delta^{\left\langle i\right\rangle}])), where λ\lambda is the root of 𝒯\mathcal{T} and σδ()\sigma_{\delta}(\cdot) represents a distribution function. Take softmax\mathrm{softmax} function as an example, the probability of δi\delta^{\left\langle i\right\rangle} can be calculated as:

(8) P(δi)=exp(H𝒯(Gf;(λ,δi]))j=1nexp(H𝒯(Gf;(λ,δj])).P(\delta^{\left\langle i\right\rangle})=\frac{\mathrm{exp}(H^{\mathcal{T}}(G_{f};(\lambda,\delta^{\left\langle i\right\rangle}]))}{{\textstyle\sum_{j=1}^{n}}{\mathrm{exp}(H^{\mathcal{T}}(G_{f};(\lambda,\delta^{\left\langle j\right\rangle}]))}}.

The probability of a non-root node can be acquired recursively. To generate new edges, we sample leaf node pairs by a top-down approach with a single sampling flow as follows:

(1) For the encoding tree (or subtree) with root node δ\delta, two different child nodes δi\delta^{\left\langle i\right\rangle} and δj\delta^{\left\langle j\right\rangle} are selected by sampling according to P(δi)P(\delta^{\left\langle i\right\rangle}) and P(δj)P(\delta^{\left\langle j\right\rangle}). Let δ1δi\delta_{1}\leftarrow\delta^{\left\langle i\right\rangle} and δ2δj\delta_{2}\leftarrow\delta^{\left\langle j\right\rangle} (2) If δ1\delta_{1} is a non-leaf node, we perform sampling once on the subtree rooted at δ1\delta_{1} to get δ1i\delta_{1}^{\left\langle i\right\rangle}, then update δ1δ1i\delta_{1}\leftarrow\delta_{1}^{\left\langle i\right\rangle}. The same is operated on δ2\delta_{2}. (3) After recursively performing step (2), we sample two leaf nodes δ1\delta_{1} and δ2\delta_{2}, while adding the edge connecting vertex v1=Tδ1v_{1}=T_{\delta_{1}} and v2=Tδ2v2=T_{\delta_{2}} into the edge set EE^{\prime} of graph GG^{\prime}. To establish extensive connections at all levels, multiple samplings are performed on all encoding subtrees. For each subtree rooted at δ\delta, we conduct independent samplings for θ×n\theta\times n times, where nn is the number of δ\delta’s children, and θ\theta is a hyperparameter that positively correlated with the density of reconstructed graph. For simplicity, we adopt a uniform θ\theta for all subtrees. Separately setting and tuning θ\theta of each semantic level for precise control is also feasible.

3.5. Time Complexity of  SE-GSL

The overall time complexity is O(n2+n+nlog2n)O(n^{2}+n+n\log^{2}n), in which nn is the number of nodes. Separately, in § 3.2, the time complexity of calculating similarity matrix is O(n2)O(n^{2}) and of kk-selector is O(n)O(n). According to  (Li and Pan, 2016), the optimization of a high-dimensional encoding tree in § 3.3 costs the time complexity of O(nlog2n)O(n\log^{2}n). As for the sampling process in § 3.4, the time complexity can be proved as O(2n)O(2n). We report the time cost of  SE-GSL in Appendix A.3.

4. Experimental Setup

Table 1. Classification Accuracy (%) comparison, with improvement range of  SE-GSL  against the baselines. The best results are bolded and the second-best are underlined. Green denotes the outperformance percentage, while yellow denotes underperformance.
Method Cora Citeseer Pubmed PT TW Actor Cornell Texas Wisconsin
GCN 87.26±0.63 76.22±0.71 87.46±0.12 67.62±0.21 62.46±1.94 27.65±0.55 49.19±1.80 57.30±2.86 48.57±4.08
GAT 87.52±0.69 76.04±0.78 86.61±0.15 68.76±1.01 61.68±1.20 27.77±0.59 57.09±6.32 58.10±4.14 51.34±4.78
GCNII 87.57±0.87 75.47±1.01 88.64±0.23 68.93±0.93 65.17±0.47 30.66±0.66 58.76±7.11 55.36±6.45 51.96±4.36
Grand 87.93±0.71 77.59±0.85 86.14±0.98 69.80±0.75 66.79±0.22 29.80±0.60 57.21±2.48 56.56±1.53 52.94±3.36
Mixhop 85.71±0.85 75.94±1.00 87.31±0.44 69.48±0.30 66.34±0.22 33.72±0.76 64.47±4.78 63.16±6.28 72.12±3.34
Dropedge 86.32±1.09 76.12±1.32 87.58±0.34 68.49±0.91 65.24±1.45 30.10±0.71 58.94±5.95 59.20±5.43 60.45±4.48
Geom-GCN-P 84.93 75.14 88.09 - - 31.63 60.81 67.57 64.12
Geom-GCN-S 85.27 74.71 84.75 - - 30.30 55.68 59.73 56.67
GDC 87.17±0.36 76.13±0.53 88.08±0.16 66.14±0.54 64.14±1.40 28.74±0.76 59.46±4.35 56.42±3.99 48.30±4.29
GEN 87.84±0.69 78.77±0.88 86.13±0.41 71.62±0.78 65.16±0.77 36.69±1.02 65.57±6.74 73.38±6.65 54.90±4.73
H2GCN-2 87.81±1.35 76.88±1.77 89.59±0.33 68.15±0.30 63.33±0.77 35.62±1.30 82.16±6.00 82.16±5.28 85.88±4.22
SE-GSL 87.93±1.24 77.63±1.65 88.16±0.76 71.91±0.66 66.99±0.93 36.34±2.07 75.21±5.54 82.49±4.80 86.27±4.32
Improvement 0.00\sim3.00 -1.14\sim2.92 -1.43\sim3.41 0.29\sim5.77 0.20\sim5.31 -0.35\sim8.69 -6.95\sim26.02 0.33\sim27.13 0.39\sim37.97

Software and Hardware. All experiments are conducted on a Linux server with GPU (NVIDIA RTX A6000) and CPU (Intel i9-10980XE), using PyTorch 1.12 and Python 3.9.

Datasets. We experiment on nine open graph benchmark datasets, including three citation networks (i.e., Cora, Citeseer, and Pubmed), two social networks (i.e., PT and TW), three website networks from WebKB (i.e., Cornell, Texas, and Wisconsin), and a co-occurrence network. Their statistics are summarized in Appendix  A.2.

Baseline and backbone models. We compare  SE-GSL with baselines including general GNNs (i.e., GCN, GAT, GCNII, Grand) and graph learning/high-order neighborhood awareness methods (i.e. MixHop, Dropedge, Geom-GCN, GDC, GEN, H2GCN). Four classic GNNs (GCN, GAT, GraphSAGE, APPNP) are chosen as the backbone encoder that  SE-GSL works upon. Details are in Appendix  A.3.

Parameter settings. For  SE-GSL  with various backbones, we uniformly adopt two-layer GNN encoders. To avoid over-fitting, We adopt ReLU (ELU for GAT) as the activation function and apply a dropout layer with a dropout rate of 0.5. The training iteration is set to 10. The embedding dimension dd is chosen from {8,16,32,48,64}\{8,16,32,48,64\}, while the height of the encoding tree KK is searched among {2,3,4}\{2,3,4\}, and the hyperparameter θ\theta in 3.4 is tuned among {0.5,1,3,5,10,30}\{0.5,1,3,5,10,30\}. We adopt the scheme of data split in Geom-GCN (Pei et al., 2019) and H2GCN (Zhu et al., 2020b) for all experiments – nodes are randomly divided into the train, validation, and test sets, which take up 48%, 32%, 20%, respectively. In each iteration, the GNN encoder optimization is carried out for 200 epochs, using the Adam optimizer, with an initial learning rate of 0.010.01 and a weight decay of 5e45e-4. The model with the highest accuracy on validation sets is used for further testing and reporting.

Refer to caption
Figure 3. Results of  SE-GSL with different encoding tree heights.
\Description

Results with different encoding tree heights.

5. Results and Analysis

In this section, we demonstrate the efficacy of  SE-GSL on semi-supervised node classification (§ 5.1, followed by micro-benchmarks that investigate the detailed effect of the submodules on the overall performance and validate the robustness of  SE-GSL when tackling random perturbations (§ 5.2). For better interpretation, we visualize the change of structural entropy and graph topology (§ 5.3).

5.1. Node Classification

5.1.1. Comparison with baselines

We compare the node classification performance of  SE-GSL  with ten baseline methods on nine benchmark datasets. Table  1 shows the average accuracy and the standard deviation. Note that the results of H2GCN (except PT and TW) and Geom-GCN are from the reported value in original papers ( - for not reported), while the rest are obtained based on the execution of the source code provided by their authors under our experimental settings. Our observations are three-fold:

(1) SE-GSL  achieves optimal results on 5 datasets, runner-up results on 8 datasets, and advanced results on all datasets. The accuracy can be improved up to 3.41% on Pubmed, 3.00% on Cora, and 2.92% on Citeseer compared to the baselines. This indicates that our design can effectively capture the inherent and deep structure of the graph and hence the classification improvement.

(2) SE-GSL  shows significant improvement on the datasets with heteropily graphs, e.g., up to 37.97% and 27.13% improvement against Wisconsin and Texas datasets, respectively. This demonstrates the importance of the graph structure enhancement that can contribute to a more informative and robust node representation.

(3) While all GNN methods can achieve satisfactory results on citation networks, the graph learning/high-order neighborhood awareness frameworks substantially outperform others on the WebKB datasets and the actor co-occurrence networks, which is highly disassortative. This is because these methods optimize local neighborhoods for better information aggregation. Our method is one of the top performers among them due to the explicit exploitation of the global structure information in the graph hierarchical semantics.

5.1.2. Comparison base on different backbones

Table 2 shows the mean classification accuracy of  SE-GSL with different backbone encoders. Observably,  SE-GSL upon GCN and GAT overwhelmingly outperforms its backbone model, with an accuracy improvement of up to 31.04% and 27.48%, respectively. This indicates the iterative mechanism in the  SE-GSL pipeline can alternately optimize the node representation and graph structure. We also notice that despite the lower improvement,  SE-GSL variants based on GraphSAGE and APPNP perform relatively better compared to those on GCN and GAT. This is most likely due to the backbone model itself being more adapted to handle disassortative settings on graphs.

Table 2. Classification accuracy(%) of  SE-GSL and corresponding backbones. Wisc. is short for Wisconsin.
Method Actor TW Texas Wisc. Improvement
SE-GSLGCN 35.03 66.88 75.68 79.61 \uparrow 5.20\sim31.04
SE-GSLSAGE 36.20 66.92 82.49 86.27 \uparrow 0.25\sim6.79
SE-GSLGAT 32.46 63.57 74.59 78.82 \uparrow 4.69\sim27.48
SE-GSLAPPNP 36.34 66.99 81.28 83.14 \uparrow 2.01\sim12.16
Table 3. The kk selection for each iteration in structural optimization. Bolds represent the kk selection when the accuracy reaches maximum.
Iteration 1 2 3 4 5 6 7 8 9
Cora 22 22 19 22 21 22 20 21 20
Actor 23 15 15 15 14 15 14 14 15
TW 50 16 16 17 15 17 27 16 16
Wisconsin 21 16 11 16 14 13 16 13 11
Texas 21 13 13 13 13 10 14 10 14

5.2. Micro-benchmarking

5.2.1. Effectiveness of kk-selector

This subsection evaluate how the one-dimensional structural entropy guides the kk-selector in § 3.2. Table  3 showcases the selected parameter kk in each iteration with SE-GSLGCN. Noticeably, as the iterative optimization proceeds, the optimal parameter kk converges to a certain range, indicating the gradual stabilization of the graph structure and node representation. The disparity of parameter kk among different datasets also demonstrates the necessity of customizing kk in different cases rather than using kk as a static hyperparameter.

5.2.2. Impact of the encoding tree’s height KK

We evaluate all four variants of  SE-GSL  on the website network datasets, and the encoding tree height KK involved in § 3.3 varies from 2 to 4. As shown in Fig.  3, there is a huge variation in the optimal tree heights among different datasets. For example, in the variants based on GAT, GCN, and APPNP, the best results can be targeted at K=3K=3 in Texas and at K=4K=4 in Cornell and Wisconsin. By contrast, in  SE-GSLSAGE, K=2K=2 can enable the best accuracy of 86.27%. This weak correlation between the best KK and the model performance is worth investigating further, which will be left as future work.

Refer to caption
Figure 4. Robustness of  SE-GSL  against random noises.
\Description

Results of perturbation experiment.

Refer to caption
Figure 5. The normalized structural entropy changes during the training of  SE-GSLGAT with 2-dimensional structural entropy on (a) Texas, (b) Cornell, and (c) Wisconsin. The structure is iterated every 200 epochs. By comparison, (d) shows the entropy changes on Wisconsin without the graph reconstruction strategy.
\Description

Visualization of structural entropy and acc. variation.

Refer to caption
Figure 6. The visualized evolution of the graph structure on Cora (a,b,c) and Citeseer (d,e,f). The corresponding Structural Entropy (SE) is also shown.
\Description

Visualization of topology evolution.

5.2.3. Sensitivity to perturbations

We introduce random edge noises into Cora and Citeseer, with perturbation rates of 0.2, 0.4, 0.6, 0.8, and 1. As shown in Fig. 4(a),  SE-GSL outperforms baselines in both GCN and GAT cases under most noise settings. For instance,  SE-GSLGCN achieves up to 8.09% improvement against the native GCN when the perturbation rate is 0.8; by contrast, improvements by GCN-Jaccard and GCN-DropEdge are merely 6.99% and 5.77%, respectively. A similar phenomenon is observed for most cases in the Citeseer dataset (Fig. 4(b)), despite an exception when compared against GCN-Jaccard. Nevertheless, our approach is still competitive and even better than GCN-Jaccard at a high perturbation rate.

5.3. Interpretation of Structure Evolution

5.3.1. Structural entropy variations analysis

We evaluate how the structural entropy changes during the training of  SE-GSLGAT with 2-dimensional structural entropy on WebKB datasets. For comparison, we visualize the entropy changes on Wisconsin without the structure learning. In the experiment setting, both the graph structure and the encoding tree are updated once at each iteration (i.e., 200 GNN epochs), and within one iteration, the structural entropy is only affected by edge weights determined by the similarity matrix. For comparison, we normalize the structural entropy by H𝒯(G)H1(G)\textstyle{\frac{H^{\mathcal{T}}(G)}{H^{1}(G)}}.

As shown in Fig. 5(a)-(c), as the accuracy goes up, the normalized structural entropy constantly decreases during the iterative graph reconstruction, reaching the minimums of 0.7408 in Texas, 0.7245 in Cornell, and 0.7344 in Wisconsin. This means the increasing determinism of the overall graph structure and the reduced amount of information required to determine a vertex. Interestingly, if our graph reconstruction mechanism is disabled (as shown in Fig. 5(d)), the normalized structural entropy keeps rising from 0.7878, compared with Fig. 5(c). Accordingly, the final accuracy will even converge to 55.34%, a much lower level.

Such a comparison also provides a feasible explanation for the rising trend of the normalized structural entropy within every single iteration. This stems from the smoothing effect during the GNN training. As the node representation tends to be homogenized, the graph structure will be gradually smoothed, leading to a decrease in the one-dimensional structural entropy thus the normalized structural entropy increases.

5.3.2. Visualization

Fig. 6 visualizes the topology of the original Cora and Citeseer graph and of the 2nd and 5th iterations. The vertex color indicates the class it belongs to, and the layout denotes connecting relations. Edges are hidden for clarity. As the iteration continues, much clearer clustering manifests – few outliers and more concentrated clusters. Vertices with the same label are more tightly connected due to the iterative graph reconstruction scheme. This improvement hugely facilitates the interpretability of the GSL and the node representation models. pan2021information

6. Related Work

Graph structure learning and neighborhood optimization. The performance of GNNs is heavily dependent on task-relevant links and neighborhoods. Graph structure learning explicitly learns and adjusts the graph topology, and our SE-GSL is one of them. GDC (Gasteiger et al., 2019b) reconnects graphs through graph diffusion to aggregate from multi-hop neighborhoods. Dropedge (Rong et al., 2019), Neuralsparse  (Zheng et al., 2020) contribute to graph denoising via edge-dropping strategy while failing to renew overall structures. LDS-GNN (Franceschi et al., 2019) models edges by sampling graphs from the Bernoulli distribution. Meanwhile, we consider linking the structural entropy, which is more expressive of graph topology, to the sampling probability. GEN (Wang et al., 2021), IDGL (Chen et al., 2020b) explore the structure from the node attribute space by the kk-NN method. Differently, instead of directly using attribute similarity, we regenerate edges from the hierarchical abstraction of graphs to avoid inappropriate metrics. Besides adjusting the graph structure, methods to optimize aggregation are proposed with results on heterophily graphs. MixHop (Abu-El-Haija et al., 2019) learns the aggregation parameters for neighborhoods of different hops through a mixing network, while H2GCN (Zhu et al., 2020b) identifies higher-order neighbor-embedding separation and intermediate representation combination, for adapting to heterophily graphs. Geom-GCN (Pei et al., 2019) aggregates messages over both the original graph and latent space by a designed geometric scheme.

Structural entropy with neural networks. Structural information principles (Li and Pan, 2016), defining encoding trees and structural entropy, were first used in bioinformatic structure analysis (Li et al., 2016, 2018b). Existing work mainly focuses on network analysis, node clustering and community detection(Li et al., 2017; Liu et al., 2019; Pan et al., 2021). As an advanced theory on graphs and hierarchical structure, structural information theory has great potential in combination with neural networks. SR-MARL (Zeng et al., 2023) applies structural information principles to hierarchical role discovery in multi-agent reinforcement learning, thereby boosting agent network optimization. SEP (Wu et al., 2022) provides a graph pooling scheme based on optimal encoding trees to address local structure damage and suboptimal problem. It essentially uses structural entropy minimization for a multiple-layer coarsened graph. MinGE (Luo et al., 2021b) and MEDE (Yang et al., 2023) estimate the node embedding dimension of GNNs via structural entropy, which introduces both attribute entropy and structure entropy as objective. Although these works exploit structural entropy to mine the latent settings of neural networks and GNNs, how to incorporate this theory in the optimization process is still understudied, and we are among the first attempts.

7. Conclusion

To cope with edge perturbations in graphs with heterophily, this paper proposes a novel graph structure learning framework  SE-GSL that considers the structural entropy theory in graph structure optimization. We design a structure enhancement module guided by the one-dimensional structural entropy maximization strategy to extend the information embedded in the topology. To capture the hierarchical semantics of graphs, high-dimensional structural entropy minimization is performed for optimal encoding trees. We propose a node sampling technique on the encoding tree to restore the most appropriate edge connections at different community levels, taking into account the deduction structural entropy distribution. In the future, we plan to combine delicate loss functions with structural entropy so that the knowledge in encoding trees can be converted into gradient information, which will further allow for end-to-end structure optimization.

Acknowledgements.
This paper was supported by the National Key R&D Program of China through grant 2021YFB1714800, NSFC through grant 62002007, S&T Program of Hebei through grant 20310101D, Natural Science Foundation of Beijing Municipality through grant 4222030, CCF-DiDi GAIA Collaborative Research Funds for Young Scholars, the Fundamental Research Funds for the Central Universities, Xiaomi Young Scholar Funds for Beihang University, and in part by NSF under grants III-1763325, III-1909323, III-2106758, and SaTC-1930941.

References

  • (1)
  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. 2019. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning. PMLR, 21–29.
  • Bianchi et al. (2020) Filippo Maria Bianchi, Daniele Grattarola, and Cesare Alippi. 2020. Spectral clustering with graph neural networks for graph pooling. In Proceedings of the International Conference on Machine Learning. PMLR, 874–883.
  • Bojchevski and Günnemann (2019) Aleksandar Bojchevski and Stephan Günnemann. 2019. Certifiable robustness to graph perturbations. Advances in Neural Information Processing Systems 32 (2019).
  • Chen et al. (2020a) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020a. Simple and deep graph convolutional networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1725–1735.
  • Chen et al. (2020b) Yu Chen, Lingfei Wu, and Mohammed Zaki. 2020b. Iterative deep graph learning for graph neural networks: Better and robust node embeddings. Advances in Neural Information Processing Systems 33 (2020), 19314–19326.
  • Dai et al. (2021) Enyan Dai, Charu Aggarwal, and Suhang Wang. 2021. Nrgnn: Learning a label noise resistant graph neural network on sparsely and noisily labeled graphs. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 227–236.
  • Feng et al. (2020) Wenzheng Feng, Jie Zhang, Yuxiao Dong, Yu Han, Huanbo Luan, Qian Xu, Qiang Yang, Evgeny Kharlamov, and Jie Tang. 2020. Graph random neural networks for semi-supervised learning on graphs. Advances in Neural Information Processing Systems 33 (2020), 22092–22103.
  • Franceschi et al. (2019) Luca Franceschi, Mathias Niepert, Massimiliano Pontil, and Xiao He. 2019. Learning discrete structures for graph neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, 1972–1982.
  • Gasteiger et al. (2019a) Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. 2019a. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In Proceedings of the International Conference on Learning Representations.
  • Gasteiger et al. (2019b) Johannes Gasteiger, Stefan Weißenberger, and Stephan Günnemann. 2019b. Diffusion improves graph learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS). 13366–13378.
  • Getoor (2005) Lise Getoor. 2005. Link-based classification. In Advanced methods for knowledge discovery from complex data. Springer, 189–207.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the International Conference on Machine Learning. PMLR, 1263–1272.
  • Gilpin et al. (2018) Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 80–89.
  • Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30 (2017).
  • Huang et al. (2022) Qiang Huang, Makoto Yamada, Yuan Tian, Dinesh Singh, and Yi Chang. 2022. Graphlime: Local interpretable model explanations for graph neural networks. IEEE Transactions on Knowledge and Data Engineering (2022).
  • Jin et al. (2020) Wei Jin, Yao Ma, Xiaorui Liu, Xianfeng Tang, Suhang Wang, and Jiliang Tang. 2020. Graph structure learning for robust graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 66–74.
  • Keyulu et al. (2019) Xu Keyulu, Hu Weihua, Leskovec Jure, and Stefanie Jegelka. 2019. How powerful are graph neural networks. Proceedings of the International Conference on Learning Representations.
  • Li and Pan (2016) Angsheng Li and Yicheng Pan. 2016. Structural information and dynamical complexity of networks. IEEE Transactions on Information Theory 62, 6 (2016), 3290–3339.
  • Li et al. (2016) Angsheng Li, Xianchen Yin, and Yicheng Pan. 2016. Three-dimensional gene map of cancer cell types: Structural entropy minimisation principle for defining tumour subtypes. Scientific reports 6, 1 (2016), 1–26.
  • Li et al. (2018b) Angsheng Li, Xianchen Yin, Bingxiang Xu, Danyang Wang, Jimin Han, Yi Wei, Yun Deng, Ying Xiong, and Zhihua Zhang. 2018b. Decoding topologically associating domains with ultra-low resolution Hi-C data by graph structural entropy. Nature communications 9, 1 (2018), 1–12.
  • Li et al. (2017) Angsheng Li, Xiaohui Zhang, and Yicheng Pan. 2017. Resistance maximization principle for defending networks against virus attack. Physica A: Statistical Mechanics and its Applications 466 (2017), 211–223.
  • Li et al. (2022) Kuan Li, Yang Liu, Xiang Ao, Jianfeng Chi, Jinghua Feng, Hao Yang, and Qing He. 2022. Reliable Representations Make A Stronger Defender: Unsupervised Structure Refinement for Robust GNN. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 925–935.
  • Li et al. (2018a) Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang. 2018a. Adaptive graph convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  • Liu et al. (2019) Yiwei Liu, Jiamou Liu, Zijian Zhang, Liehuang Zhu, and Angsheng Li. 2019. REM: From structural entropy to community structure deception. Advances in Neural Information Processing Systems 32 (2019).
  • Luo et al. (2021a) Dongsheng Luo, Wei Cheng, Wenchao Yu, Bo Zong, Jingchao Ni, Haifeng Chen, and Xiang Zhang. 2021a. Learning to drop: Robust graph neural network via topological denoising. In Proceedings of the 14th ACM international conference on web search and data mining. 779–787.
  • Luo et al. (2021b) Gongxu Luo, Jianxin Li, Jianlin Su, Hao Peng, Carl Yang, Lichao Sun, Philip S Yu, and Lifang He. 2021b. Graph entropy guided node embedding dimension selection for graph neural networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 2767–2774.
  • Pan et al. (2021) Yicheng Pan, Feng Zheng, and Bingchen Fan. 2021. An Information-theoretic Perspective of Hierarchical Clustering. arXiv preprint arXiv:2108.06036 (2021).
  • Pei et al. (2019) Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. 2019. Geom-GCN: Geometric Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations.
  • Peng et al. (2019) Hao Peng, Jianxin Li, Qiran Gong, Yangqiu Song, Yuanxing Ning, Kunfeng Lai, and Philip S. Yu. 2019. Fine-Grained Event Categorization with Heterogeneous Graph Convolutional Networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 3238–3245.
  • Peng et al. (2020) Hao Peng, Jianxin Li, Qiran Gong, Senzhang Wang, Yuanxin Ning, and Lifang He. 2020. Motif-Matching Based Subgraph-Level Attentional Convolutional Network for Graph Classification. In Proceedings of 34th AAAI Conference on Artificial Intelligence.
  • Peng et al. (2021) Hao Peng, Ruitong Zhang, Yingtong Dou, Renyu Yang, Jingyi Zhang, and Philip S. Yu. 2021. Reinforced Neighborhood Selection Guided Multi-Relational Graph Neural Networks. ACM Trans. Inf. Syst. 40, 4 (2021), 1–46.
  • Peng et al. (2023) Hao Peng, Ruitong Zhang, Shaoning Li, Yuwei Cao, Shirui Pan, and Philip S. Yu. 2023. Reinforced, Incremental and Cross-Lingual Event Detection From Social Messages. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 980–998.
  • Qingyun et al. (2022) Sun Qingyun, Li Jianxin, Peng Hao, Wu Jia, Fu Xingcheng, Ji Cheng, and Yu Philip S. 2022. Graph Structure Learning with Variational Information Bottleneck. In Proceedings of 36th AAAI Conference on Artificial Intelligence.
  • Rong et al. (2019) Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. 2019. DropEdge: Towards Deep Graph Convolutional Networks on Node Classification. In Proceedings of the International Conference on Learning Representations.
  • Rozemberczki et al. (2021) Benedek Rozemberczki, Carl Allen, and Rik Sarkar. 2021. Multi-scale attributed node embedding. Journal of Complex Networks 9, 2 (2021), cnab014.
  • Rozemberczki and Sarkar (2021) Benedek Rozemberczki and Rik Sarkar. 2021. Twitch gamers: a dataset for evaluating proximity preserving and structural role-based node embeddings. arXiv preprint arXiv:2101.03091 (2021).
  • Shannon (1948) Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423.
  • Sun et al. (2022a) Lichao Sun, Yingtong Dou, Carl Yang, Kai Zhang, Ji Wang, S Yu Philip, Lifang He, and Bo Li. 2022a. Adversarial Attack and Defense on Graph Data: A Survey. IEEE Transactions on Knowledge and Data Engineering (2022), 1–20.
  • Sun et al. (2022b) Qingyun Sun, Jianxin Li, Haonan Yuan, Xingcheng Fu, Hao Peng, Cheng Ji, Qian Li, and Philip S. Yu. 2022b. Position-Aware Structure Learning for Graph Topology-Imbalance by Relieving Under-Reaching and Over-Squashing. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, 1848–1857.
  • Tang et al. (2009) Jie Tang, Jimeng Sun, Chi Wang, and Zi Yang. 2009. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 807–816.
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. Proceedings of the International Conference on Learning Representations (2017).
  • Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).
  • Wang et al. (2021) Ruijia Wang, Shuai Mou, Xiao Wang, Wanpeng Xiao, Qi Ju, Chuan Shi, and Xing Xie. 2021. Graph structure estimation neural networks. In Proceedings of the Web Conference 2021. 342–353.
  • Wang et al. (2020) Xiao Wang, Meiqi Zhu, Deyu Bo, Peng Cui, Chuan Shi, and Jian Pei. 2020. Am-gcn: Adaptive multi-channel graph convolutional networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 1243–1253.
  • Welling and Kipf (2016) Max Welling and Thomas N Kipf. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations.
  • Wu et al. (2019) Huijun Wu, Chen Wang, Yuriy Tyshetskiy, Andrew Docherty, Kai Lu, and Liming Zhu. 2019. Adversarial examples for graph data: deep insights into attack and defense. In Proceedings of the 28th International Joint Conference on Artificial Intelligence. 4816–4823.
  • Wu et al. (2022) Junran Wu, Xueyuan Chen, Ke Xu, and Shangzhe Li. 2022. Structural entropy guided graph hierarchical pooling. In Proceedings of the International Conference on Machine Learning. PMLR, 24017–24030.
  • Wu et al. (2020b) Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. 2020b. Graph information bottleneck. Advances in Neural Information Processing Systems 33 (2020), 20437–20448.
  • Wu et al. (2020a) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020a. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2020), 4–24.
  • Xu et al. (2018) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful are Graph Neural Networks?. In Proceedings of the International Conference on Learning Representations.
  • Yang et al. (2016) Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting semi-supervised learning with graph embeddings. In Proceedings of the International Conference on Machine Learning. PMLR, 40–48.
  • Yang et al. (2023) Zhengyu Yang, Ge Zhang, Jia Wu, Hao Peng, Xue Shan, Jian Yang, Li Angsheng, Jianlin Su, and Quan Z. Sheng. 2023. Minimum Entropy Principle Guided Graph Neural Networks. In Proceedings of the ACM International WSDM Conference.
  • Ying et al. (2019) Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. 2019. Gnnexplainer: Generating explanations for graph neural networks. Advances in Neural Information Processing Systems 32 (2019).
  • Ying et al. (2018) Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. 2018. Hierarchical graph representation learning with differentiable pooling. Advances in Neural Information Processing Systems 31 (2018).
  • You et al. (2018) Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec. 2018. Graphrnn: Generating realistic graphs with deep auto-regressive models. In Proceedings of the International Conference on Machine Learning. PMLR, 5708–5717.
  • Zeng et al. (2023) Xianghua Zeng, Hao Peng, and Angsheng Li. 2023. Effective and Stable Role-based Multi-Agent Collaboration by Structural Information Principles. In Proceedings of 37th AAAI Conference on Artificial Intelligence.
  • Zhang and Zitnik (2020) Xiang Zhang and Marinka Zitnik. 2020. Gnnguard: Defending graph neural networks against adversarial attacks. Advances in Neural Information Processing Systems 33 (2020), 9263–9275.
  • Zheng et al. (2020) Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. 2020. Robust graph representation learning via neural sparsification. In Proceedings of the International Conference on Machine Learning. PMLR, 11458–11468.
  • Zhou et al. (2020) Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57–81.
  • Zhu et al. (2020b) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020b. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems 33 (2020), 7793–7804.
  • Zhu et al. (2021b) Yanqiao Zhu, Weizhi Xu, Jinghao Zhang, Qiang Liu, Shu Wu, and Liang Wang. 2021b. Deep graph structure learning for robust representations: A survey. arXiv preprint arXiv:2103.03036 (2021).
  • Zhu et al. (2021a) Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021a. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.
  • Zhu et al. (2020a) Yanqiao Zhu, Yichen Xu, Feng Yu, Shu Wu, and Liang Wang. 2020a. Cagnn: Cluster-aware graph neural networks for unsupervised graph representation learning. arXiv preprint arXiv:2009.01674 (2020).

Appendix A Appendix

A.1. Glossary of notations

In Table  4, we summarize the notations used in our work.

Table 4. Glossary of Notations.
Notation Description
G;A;SG;A;S Graph; Adjacency matrix; Similarity matrix.
v;e;xv;e;x Vertex; Edge; Vertex attribute.
V;E;XV;E;X Vertex set; Edge set; Attribute set.
|V|;|E||V|;|E| The number of vertices and edges.
𝒫;Pi\mathcal{P};P_{i} The partition of VV; A community.
D;d(vi)D;d(v_{i}) The degree matrix; The degree of vertex viv_{i}.
eije_{ij} The edge between viv_{i} and vjv_{j}.
wijw_{ij} The weight of edge eije_{ij}.
vol(G)vol(G) The volume of graph GG, i.e., degree sum in GG.
Gknn(k)G^{(k)}_{knn} The kk-NN graph with parameter kk.
GfG_{f} Fusion graph.
Gf(k)G^{(k)}_{f} The fusion graph with parameter kk.
𝒯\mathcal{T} Encoding tree.
𝒯\mathcal{T}^{*} The optimal encoding tree.
λ\lambda The root node of an encoding tree.
α\alpha A non-root node of an encoding tree.
α\alpha^{-} The parent node of α\alpha.
αi\alpha^{\left\langle i\right\rangle} the ii-th child of α\alpha.
TλT_{\lambda} The label of λ\lambda, i.e., node set VV.
TαT_{\alpha} The label of α\alpha, i.e., a subset of VV.
𝒱α\mathcal{V}_{\alpha} Volume of graph GG.
gag_{a} the sum weights of cut edge set [Tα,Tα/Tλ][T_{\alpha},T_{\alpha}/T_{\lambda}].
N(𝒯)N(\mathcal{T}) The number of non-root node in 𝒯\mathcal{T}.
H𝒯(G)H^{\mathcal{T}}(G) Structural entropy of GG under 𝒯\mathcal{T}.
HK(G)H^{K}(G) KK-dimensional structural entropy.
H1(G)H^{1}(G) One-dimensional structural entropy.
H𝒯(G;α)H^{\mathcal{T}}(G;\alpha) Structural entropy of node α\alpha in 𝒯\mathcal{T}.
H𝒯(G;(λ,α])H^{\mathcal{T}}(G;(\lambda,\alpha]) Structural entropy of a deduction from λ\lambda to α\alpha.

A.2. Dataset and Time Costs of  SE-GSL

Our framework SE-GSL  is evaluated on nine graph datasets. the statistics of these datasets are shown in Table 5. The time costs of  SE-GSL on all datasets are shown in Table 6.

Table 5. Statistics of benchmark datasets.
Dataset Nodes Edges Classes Features homophily
Cora 2708 5429 7 1433 0.83
Citeseer 3327 4732 6 3703 0.71
Pubmed 19717 44338 3 500 0.79
PT 1912 31299 2 3169 0.59
TW 2772 63462 2 3169 0.55
Actor 7600 33544 5 931 0.24
Cornell 183 295 5 1703 0.30
Texas 183 309 5 1703 0.11
Wisconsin 251 499 5 1703 0.21
  • Citation networks (Yang et al., 2016; Welling and Kipf, 2016). Cora, Citeseer, and Pubmed are benchmark datasets of citation networks. Nodes represent paper, and edges represent citation relationships in these networks. The features are bag-of-words representations of papers, and labels denote their academic fields.

  • Social networks (Rozemberczki et al., 2021). TW and PT are two subsets of Twitch Gamers dataset (Rozemberczki and Sarkar, 2021), designed for binary node classification tasks, where nodes correspond to users and links to mutual friendships. The features are liked games, location, and streaming habits of the users. The labels denote whether a streamer uses explicit language (Taiwanese and Portuguese).

  • WebKB networks (Getoor, 2005). Cornell, Texas, and Wisconsin are three subsets of WebKB, where nodes are web pages, and edges are hyperlinks. The features are the bag-of-words representation of pages. The labels denote categories of pages, including student, project, course, staff, and faculty.

  • Actor co-occurrence network (Tang et al., 2009). This dataset is a subgraph of the film-director-actor-writer network, in which nodes represent actors, edges represent co-occurrence relation, node features are keywords of the actor, and labels are the types of actors.

A.3. Baselines

Baselines are briefly described as follows333For GCN, GAT, GraphSAGE, and APPNP layers, we adopt implementation from DGL library (Wang et al., 2019):https://github.com/dmlc/dgl :

  • GCN (Welling and Kipf, 2016) is the most popular GNN, which defines the first-order approximation of a localized spectral filter on graphs.

  • GAT (Veličković et al., 2017) introduces a self-attention mechanism to important scores for different neighbor nodes.

  • GraphSAGE (Hamilton et al., 2017) is an inductive framework that leverages node features to generate embeddings by sampling and aggregating features from the local neighborhood.

  • APPNP (Gasteiger et al., 2019a) combines GCN with personalized PageRank.

  • GCNII444https://github.com/chennnM/GCNII (Chen et al., 2020a) employs residual connection and identity mapping.

  • Grand555https://github.com/THUDM/GRAND (Feng et al., 2020) purposes a random propagation strategy for data augmentation, and uses consistency regularization to optimize.

  • Mixhop666https://github.com/samihaija/mixhop (Abu-El-Haija et al., 2019) aggregates mixing neighborhood information.

  • Geom-GCN777https://github.com/graphdml-uiuc-jlu/geom-gcn (Pei et al., 2019) exploits geometric relationships to capture long-range dependencies within structural neighborhoods. Three variant of Geom-GCN is used for comparison.

  • GDC888https://github.com/gasteigerjo/gdc (Gasteiger et al., 2019b) refines graph structure based on diffusion kernels.

  • GEN999https://github.com/BUPT-GAMMA/Graph-Structure-Estimation-Neural-Networks (Wang et al., 2021) estimates underlying meaningful graph structures.

  • H2GCN101010https://github.com/GemsLab/H2GCN (Zhu et al., 2020b) combine multi-hop neighbor-embeddings for adapting to both heterophily and homophily graph settings.

  • DropEdge111111https://github.com/DropEdge/DropEdge (Rong et al., 2019) randomly removes edges from the input graph for over-fitting prevention.

  • Jaccard121212https://github.com/DSE-MSU/DeepRobust (Wu et al., 2019) prunes the edges connecting nodes with small Jaccard similarity.

Table 6. Comparison of training time(hr.) of achieving the best performance based on GPU.
Method Cora Citeseer Pubmed PT TW Actor Cornell Texas Wisconsin
SE-GSLGCN 0.071 0.213 4.574 0.178 0.374 1.482 0.006 0.008 0.009
SE-GSLSAGE 0.074 0.076 4.852 0.169 0.214 0.817 0.006 0.007 0.009
SE-GSLGAT 0.071 0.180 4.602 0.172 0.329 1.273 0.006 0.008 0.009
SE-GSLAPPNP 0.073 0.215 4.854 0.138 0.379 1.367 0.010 0.011 0.013

A.4. Overall algorithm of SE-GSL

The overall algorithm of SE-GSL  is shown in Algorithm  1. Note that, if choose to retain the connection from the previous iteration, to ensure that the number of edges remains stable during the training, a percentage of edges in the reconstructed graph with low similarity will be dropped in each iteration.

Input: a graph G=(V,E)G=(V,E), features XX, labels YLY_{L}, mode True,False\in{True,False}
iterations η\eta, encoding tree height KK, hyperparameter θ\theta
Output: optimized graph G=(V,E)G^{\prime}=(V,E^{\prime}), prediction YPY_{P}, GNN parameters Θ\Theta
1 Initialize Θ\Theta;
2 for i=1i=1 to η\eta do
3       Update Θ\Theta by classification loss cls(YL,YP)\mathcal{L}_{cls}(Y_{L},Y_{P});
4       Getting node representation X=GNN(X)X^{\prime}=\mathrm{GNN}(X);
5       Initialize k=1k=1 for kk-NN structuralization;
6       Create fusion map GfG_{f} according to Algorithm 2;
7       Create KK-dimensional encoding tree 𝒯\mathcal{T}^{*} according to Algorithm 3;
8       for each non-root node α\alpha in 𝒯\mathcal{T}^{*} do
9             Calculate H𝒯(Gf;(λ,α])H^{\mathcal{T}^{*}}(G_{f};(\lambda,\alpha]) through Eq. 7;
10             Assign probability P(α)P(\alpha) to α\alpha through Eq. 8;
11            
12      for each subtree rooted at α\alpha in 𝒯\mathcal{T}^{*} do
13             Assuming α\alpha has nn children, set t=θ×nt=\theta\times n;
14             for j=1j=1 to tt do
15                   Sample a node pair (vm,vn)(v_{m},v_{n}) according to § 3.4;
16                   Adding edge emne_{mn} to GG^{\prime};
17                  
18            
19      if mode then
20             Let E=EEE^{\prime}=E\cup E^{\prime}, where EE^{\prime} and EE are the edge set of GG^{\prime} and GG, respectively;
21             Drop a percentage of edges in GG^{\prime};
22      Update graph structure GGG\leftarrow G^{\prime}; Update node representation: XXX\leftarrow X^{\prime};
23      
24Get prediction YPY_{P};
25 Return GG^{\prime}, YPY_{P} and Θ\Theta;
Algorithm 1 Model training for SE-GSL

A.5. Algorithm of one-dimensional structural entropy guided graph enhancement

The kk-selector is designed for choosing an optimal kk for kk-NN structuralization under the guidance of one-dimensional structural entropy maximization. The algorithm of kk-selector and fusion graph construction is shown in Algorithm 2.

Input: a graph G=(V,E)G=(V,E), node representation XX
Output: fusion graph GfG_{f}
1 Calculate S|V|×|V|S\in\mathbb{R}^{|V|\times|V|} via Eq. 4;
2 for k=2k=2 to |V|1|V|-1 do
3       Generate GknnG_{knn} by SS;
4       Generate Gf(k)={V,Ef=EEknn}G^{(k)}_{f}=\{V,E_{f}=E\cup E_{knn}\};
5       Reweight Gf(k)G^{(k)}_{f} via Eq. 5;
6       Calculate H1(Gf(k))H^{1}(G^{(k)}_{f}) via Eq. 1;
7       if H1(Gf(k))H^{1}(G^{(k)}_{f}) reaches the maximal optima then
8             GfGf(k)G_{f}\leftarrow G^{(k)}_{f};
9             Return GfG_{f};
10      
Algorithm 2 kk-selector and fusion graph construction

A.6. Algorithm of high-dimensional structural entropy minimization

The pseudo-code of the high-dimensional structural entropy minimization algorithm is shown in Algorithm 3.

Input: a graph G=(V,E)G=(V,E), the height of encoding tree k>1k>1
Output: Optimal high-dimensional encoding tree 𝒯\mathcal{T}^{*}
1 //Initialize an encoding tree 𝒯\mathcal{T} with height 11 and root λ\lambda
2 Create root node λ\lambda;
3 for viVv_{i}\in V do
4       Create node αi\alpha_{i}. Let Tαi=viT_{\alpha_{i}}=v_{i};
5       vi=λv_{i}^{-}=\lambda;
6      
7//Generation of binary encoding tree
8 while λ\lambda has more than 22 children do
9       Select αi\alpha_{i} and αj\alpha_{j} in 𝒯\mathcal{T}, condition on αi=αj=λ\alpha_{i}^{-}=\alpha_{j}^{-}=\lambda and argmaxαi,αj(H𝒯(G)HCB(αi,αj)𝒯(G))\mathop{\arg\max}\limits_{\alpha_{i},\alpha_{j}}(H^{\mathcal{T}}(G)-H^{\mathcal{T}}_{\mathrm{CB}(\alpha_{i},\alpha_{j})}(G));
10       CB(αi,αj)\mathrm{CB}(\alpha_{i},\alpha_{j}) according to Definition  1;
11      
12//Squeezing of encoding tree
13 while height(𝒯)>K\mathrm{height}(\mathcal{T})>K do
14       Select non-root node α\alpha and β\beta in 𝒯\mathcal{T}, condition on α=β\alpha^{-}=\beta and argmaxα,β(H𝒯(G)HLF(α,β)𝒯(G))\mathop{\arg\max}\limits_{\alpha,\beta}(H^{\mathcal{T}}(G)-H^{\mathcal{T}}_{\mathrm{LF}(\alpha,\beta)}(G));
15       LF(α,β)\mathrm{LF}(\alpha,\beta) according to Definition  2;
16      
Return 𝒯𝒯\mathcal{T}^{*}\leftarrow\mathcal{T};
Algorithm 3 K-dimensional structural entropy minimization