Heterogeneous Directed Hypergraph Neural Network over abstract syntax tree (AST) for Code Classification
††thanks: DOI reference number: 10.18293/SEKE2023-136
Abstract
Code classification is a difficult issue in program understanding and automatic coding. Due to the elusive syntax and complicated semantics in programs, most existing studies use techniques based on abstract syntax tree (AST) and graph neural network (GNN) to create code representations for code classification. These techniques utilize the structure and semantic information of the code, but they only take into account pairwise associations and neglect the high-order correlations that already exist between nodes in the AST, which may result in the loss of code structural information. On the other hand, while a general hypergraph can encode high-order data correlations, it is homogeneous and undirected which will result in a lack of semantic and structural information such as node types, edge types, and directions between child nodes and parent nodes when modeling AST. In this study, we propose to represent AST as a heterogeneous directed hypergraph (HDHG) and process the graph by heterogeneous directed hypergraph neural network (HDHGN) for code classification. Our method improves code understanding and can represent high-order data correlations beyond paired interactions. We assess heterogeneous directed hypergraph neural network (HDHGN) on public datasets of Python and Java programs. Our method outperforms previous AST-based and GNN-based methods, which demonstrates the capability of our model.
Index Terms:
hypergraph, heterogeneous graph, code classification, graph neural networks, program understandingI Introduction
With the advancement of modern computer software, how to learn from vast open-source code repositories to enhance software development has become an essential research topic. In recent years, source code processing, which tries to help computers automatically comprehend and analyze source code, has received a lot of attention. Several works have been suggested including code classification [1, 2, 3, 4, 5, 6], method name prediction [3][4][7][8], code summarization [3][9][10] and code clone detection [5][11][12], etc.
Due to the improvement of machine learning technology, particularly deep learning, more and more work has employed deep learning for code classification. Currently, there are two main categories of code classification methods: AST-based and GNN-based. To take advantage of the semantic and structural information of the source code, several studies adopt AST when learning code representations [1][5][7][8]. Some research uses graph neural networks (GNN) to create code representations for code categorization to a better understanding of the structure of code based on AST [3][4][13][14].




Although these AST-based and graph-based techniques employ the structural information of source code and demonstrate their effectiveness, there is a problem that they only take into the pairwise relationships and ignore the possible high-order correlations between AST nodes. For example, when code is parsed into an AST, each parent AST node has child AST nodes belonging to various fields or called attributes. A parent node may have several child nodes under the same field, and these nodes have high-order correlations with one another. Fig. 1 depicts an example of a python code snippet. The corresponding AST generated by the official python ast module111https://docs.python.org/3/library/ast.html. is illustrated in Fig. 2(a). As we can see, the “Module” is the root node of the AST. It has two child nodes, “Assign” and “Expr” which belong to the field named “body.” When modeling the correlations between the three nodes, previous approaches only consider the pairwise relationships, i.e., the pair of “Module” and “Assign” and the pair of “Module” and “Expr,” as demonstrated in Fig. 3(a). The high-order data correlation that “Assign” and “Expr” both belong to the “body” of “Module” as shown in Fig. 3(b) is dismissed may result in the loss of code structural information.


In recent years, hypergraph, which can encode high-order data correlations, has drawn a lot of interest. Considering the outstanding performance of hypergraph in graph classification [15][16], we present hypergraph into code classification. On the other hand, a general hypergraph is homogeneous and undirected, i.e., it only has one type of node and one type of edge, and its hyperedge is undirected. If we represent the AST with a general hypergraph, it will result in lack of semantic and structural information such as field names and directions. As illustrated in Fig. 4(a), a typical hypergraph does not have the field name and the direction to show who is the parent node and who is the child node.
To tackle the above problems, we suggest a heterogeneous directed hypergraph (HDHG) to model the AST and a HDHGN for code classification. First, we propose to use heterogeneous directed hyperedge to show the relationship of AST nodes, the example of Fig. 3 is shown in Fig. 4(b). Second, we combine deep learning techniques from hypergraph neural networks and heterogeneous graph neural networks to create the HDHGN, and we also add operations to process directed hyperedge.
We evaluate our method on public datasets Python800 and Java250 [17]. Our model gets 97% in accuracy on Python800 and 96% on Java250 which outperforms previous state-of-the-art AST-based and GNN-based work. Our study demonstrates the utility of HDHG and HDHGN for code classification.
The main contributions of this paper are:
-
•
We propose an HDHG to depict AST.
-
•
We propose an HDHGN to generate vector representations for code classification.
-
•
We assess our model on public datasets and compare it with previous SOTA AST-based and graph-based methods.
II Related Work
Code classification is to classify codes based on their functions. Different from natural language, code has structural information. As a result, several works adopt AST by various techniques. Mou et al. [1] is one of the first works to suggest a Tree-Based Convolutional Neural Network (TBCNN) in code classification. Alon et al. propose code2seq [7] and code2vec [8] to deconstruct code to a collection of paths in its AST. J. Zhang et al. [5] propose a novel neural called ASTNN for source code representation for code classification and clone detection. N. D. Q. Bui et al. [18] propose a novel method named TreeCaps by fusing capsule networks with TBCNN in code classification.
With the popularity of GNN, more works apply kinds of GNN in code classification based on AST to strengthen the comprehension of code structures. M. Allamanis et al. [13] first construct graphs from source code by adding edges like control flow and data flow to AST and employing a gated graph neural network (GGNN) to process program graphs. V. Hellendoorn et al. [19] propose a model called GREAT based on the transformer architecture by extracting global relational information from code graphs. D. Vagavolu et al. [3] propose an approach that can extract and use program features from multiple code graphs. M. Lu et al. [6] improved GGNN in program classification. T. Long [14] proposes a multi-view graph program representation method that combines both data flow and control flow as multiple views and applies GNN to process. W. Wang et al. [4] propose to leverage heterogeneous graphs to show code based on previous work and adopt heterogeneous GNN to process.
III Preliminary
In this section, we introduce some fundamental background ideas, such as hypergraph, heterogeneous graph, and AST.
III-A Hypergraph
In an ordinary graph, an edge can only be connected with two vertices. Different from general graphs, the edge of a hypergraph [20] can link any number of vertices. Formally, a hypergraph is a pair where is a set of elements called nodes or vertices, and is a set of non-empty subsets of called hyperedges or links.
A directed hypergraph [21] is a hypergraph with directed hyperedges. A directed hyperedge or hyperarc is an ordered pair, , of (possibly empty) disjoint subsets of vertices; is the tail of while is its head. A backward hyperarc, or simply B-arc, is a hyperarc with . A forward hyperarc, or simply F-arc, is a hyperarc with . A hypergraph whose hyperarcs are B-arcs is known as a B-graph (or B-hypergraph). A hypergraph whose hyperarcs are F-arcs is known as an F-graph or F-hypergraph.
In our study, since the child node of AST points to the parent node and the child node has only one parent node, our HDHG is a B-hypergraph.
III-B Heterogeneous Graph
A heterogeneous graph [22] is a graph consisting of multiple types of entities or nodes and multiple types of links or edges. A heterogeneous graph is represented as , consisting of an entity set and a link set . A heterogeneous graph is also correlated with a node type mapping function and a link type mapping function . and represent the sets of predefined object types and link types, where .
III-C Abstract Syntax Tree
The AST represents the source code’s abstract syntax structure. The code compiler will parse the code into an AST through the program syntax and semantic analysis. Each node on the tree represents a structure in the source code and belongs to different AST node types. Each AST node has zero, one, or several fields that can be thought of as the node’s attributes. Each field may have none, one, or a list of objects such as AST node, number, and string. If one AST node contains a field with a different AST node, and the latter is equivalent to the former’s child AST node.
IV Methodology
We first convert the code snippet into an AST and construct an HDHG based on it, then put it into our HDHGN. We combine the vector representations for code categorization once we get the network’s node’s vector representation. The overview of our model is demonstrated in Fig. 5.
IV-A Heterogeneous Directed Hypergraph
We parse the code snippet into an AST with a code compiler, then we develop the HDHG based on the AST. We set the node of AST as the “AST” node and the identifier of AST as the “identifier” node in HDHG. We set the value of the “AST” node as its AST node type name, set the value of the “identifier” node as its content, and treat them as two different types of nodes. The field is configured as a directed hyper edge. If one node has a field including another node, the latter node belongs to the tail of the field hyperedge, the former is the head of the field hyperedge. We designated the field name as the type of hyper edge. The illustration of the HDHG of AST in Fig. 2 is shown in Fig. 6.
IV-B Heterogeneous Directed Hypergraph Neural Network
IV-B1 Definition
We let a HDHG , which includes a node set and a directed hyperedge set . Each node , where represents node type and is the value of the node. Each directed hyperedge , represents edge type, is the tail nodes of hyperedge , is the head node of hyperedge , they show the direction of the hyperedge is to .
IV-B2 Feature initialization
According to the value and the category of node , we obtain embedding vector by embedding function as (1), where is the dimension size of the embedding vector.
(1) |
To put embedding vectors of various types into the same vector space, we make a linear projection to obtain the initial feature vector of node based on the corresponding node type as (2), where is the dimension size of feature vector and hidden vector.
(2) |
We also obtained embedding vector of hyperedge according to the edge type as (3).
(3) |
IV-B3 Heterogeneous Directed Hypergraph Convolution Layer
Our model updates each node vector in one heterogeneous directed hypergraph convolution (HDHGConv) layer. We refer to the framework of two-stage message passing of hypergraph neural network [15] which has two steps: aggregating messages from nodes to hyperedges and aggregating messages from hyperedges to nodes. Contrarily, we add operations that add heterogeneous information and direction information.
Aggregating messages from nodes to hyperedges: First, the hidden vector of each node is multiplied by the head matrix or tail matrix to get the message vector from node to hyperedge as (4), where indicate layer number, is the total number of layers.
(4) |
Directed hyperedge gathers message from their tail nodes and head node by transformer attention mechanism [23]. For each message, the attention score is formulated as (5), where is the edge type vector.
(5) |
We obtain the vector of directed hyperedge as (6).
(6) |
Then we add edge type vector to as (7), where is formed as (8).
(7) |
(8) |
Aggregating messages from hyperedges to nodes: For each directed hyperedge whose head node or tail node is , the will be linear projected by (9) to get message which will be sent to .
(9) |
Same as before, we aggregate messages to get by the transformer attention mechanism.
(10) |
(11) |
Last, we update hidden vector of node by (12), where is the elu activation function and GraphNorm is graph normalization [24].
(12) |
The above are all the weight matrix, and the above are all bias vectors, which will be learned by training. All of the attention mechanisms mentioned above use multiple heads.
IV-C Classification
When we obtain the node hidden vectors from the last layer, we utilize attention pooling to aggregate the information of each node to obtain vector representation as (13)(14), where is a learnable vector.
(13) |
(14) |
To obtain the final classification prediction, we use an MLP, which is expressed as (15).
(15) |
The attention mechanism above is also multi-head attention. We employ the standard cross-entropy loss function for the training.
V Evaluation
We implement code by torch_geometric222https://pytorch-geometric.readthedocs.io/en/latest/index.html. Our implementation is available on https://github.com/qiankunmu/HDHGN.
V-A Datasets
We use Python800 and Java250 to train and assess our model. The two public datasets are from Project CodeNet [17] which are obtained from downloading submissions from two online judge websites: AIZU Online Judge and AtCoder. The code snippets are classified by the problem. The statistics of the datasets are depicted in Table I. To be clear, the AST node type means the AST type such as “Module” and “Assign,” different from node types in HDHG, i.e., “AST” and “identifier.” The edge type means the field name or called attribute in AST. We randomly split the dataset into the training set, validation set, and test set by 6:2:2.
Python800 | Java250 | |
Size | 240000 | 75000 |
Labels | 800 | 250 |
Avg. node | 202.05 | 216.43 |
Avg. edge | 187.25 | 198.51 |
AST node types | 93 | 58 |
Edge types | 58 | 61 |
Language | Python | Java |
V-B Baselines
We compare our model with AST-based and GNN-based techniques which acquire the best performance in code classification including TBCNN [1], TreeCaps [18], GGNN [13], GREAT [19] and HPG+HGT [4]. TBCNN used a tree-based convolutional neural network to extract features from AST. A model called TreeCaps combines TBCNN and capsule networks. By adding edges like control flow and data flow to AST, the gated graph neural network (GGNN) processes graphs from source code. GREAT is a model extracting global relational information from code graphs based on the transformer architecture. A technique known as HPG+HGT uses a heterogeneous graph transformer to describe code as a heterogeneous graph. We also trained a GCN [25] and a GIN [26] in an experiment to compare.
V-C Experiment settings
We use a parser from the official Python 3.8 ast library and javalang library333https://pypi.org/project/javalang/ to parse the code snippets into ASTs. The embedding vectors are produced by random initialization and learned via training. Our model’s layer number was set to four. The hidden vector dimension size and embedding vector dimension size were both set to 128. We use a narrow multi-head attention [23] mechanism and set the number of heads to eight. We employed Adam optimizer with the learning rate of to train our model. We set the dropout rate to 0.2. We optimized the hyper-parameters of other baselines for the validation set’s greatest performance. The models were trained for 100 epochs and we saved the models which perform best in validation set.
V-D Results
We use the performance of the model on the test set as the outcome. We select classification accuracy as the metric. We calculate the mean accuracy and standard deviation after five iterations of the experiment. The results are depicted in Table II. Our HDHGN outperforms other methods in both datasets. In Python800, our HDHGN is 2.88% higher than the best baseline. In Java250, our model outperforms baseline models by at least 2.47%. This demonstrates that our model utilizes the semantic and structural features of code AST more effectively than previous approaches.
Models | Python800 | Java250 | |||
mean | sd | mean | sd | ||
AST-based | TBCNN | 91.16 | 90.47 | ||
TreeCaps | 90.33 | 91.38 | |||
GNN-based | GCN | 91.94 | 90.01 | ||
GIN | 93.23 | 90.72 | |||
GGNN | 90.13 | 89.97 | |||
GREAT | 93.33 | 93.17 | |||
HPG+HGT444 We use the outcomes that were reported in their research because the paper did not make their code publicly available. | 94.99 | - | 93.95 | - | |
Ours | HDHGN | 97.87 | 96.42 |
Variant | mean | sd |
---|---|---|
HDHGN | 97.87 | |
- hyperedge | 94.79 | |
- heterogeneous information | 95.23 | |
- direction | 95.49 |
V-E Ablation study
We perform some ablation studies of our HDHGN on Python800. We take into account three variants as below.
V-E1 - hyperedge
We eliminate hyperedges from our model, leaving only paired edges, or normal edges, in the graph. A few regular edges will develop from the initial hyperedge.
V-E2 - heterogeneous information
We eliminate heterogeneous information from our model, which entails treating identifier nodes and AST nodes as a single type of node in the graph and eliminating the information about edge types.
V-E3 - direction
We remove direction information in our model, this means that the hyperedge is not directed hyperedge, it does not differentiate the head nodes and tail nodes.
We also repeat the experiment five times and compute the mean accuracy and standard deviation. The outcomes are depicted in Table III. Removing hyperedge make the result decrease by 3.08%. This demonstrates that high-order data correlations between AST nodes in code are indeed useful for comprehending programs. The removal of heterogeneous information reduces the result by 2.64%. Heterogeneous information often contains a lot of semantic information, which is helpful for program understanding. Removing direction caused a drop of 2.38% on the result. The direction of the graph can help enhance the model and get structural information by indicating whether the nodes connected by hyperedges are parent nodes or child nodes. The above outcomes demonstrate that our model can obtain a better understanding of AST structure and acquire more precise results in code classification after considering high-order data correlations, heterogeneous information, and direction information.
VI Conclusion
In this study, we propose an HDHGN for code classification. To possibly encode high-order data correlations between nodes in AST, we introduce the use of hypergraphs. Due to the general hypergraph being homogeneous and undirected which will result in a lack of semantic and structural information, we propose to represent AST as a heterogeneous directed hypergraph. We create an HDHGN accordingly to utilize high-order data correlation, heterogeneous information and direction information better than previous methods. We test our model using open Python and Java datasets, and we compare the results to the baselines developed using the SOTA AST and GNN. The experiment demonstrates that our HDHGN outperforms the baselines. Further ablation study describes that the HDHGN enhances the performance of code classification.
Presently, the hypergraph we produce is large and contains many nodes and edges. Future research will focus on ways to scale down hypergraphs for modeling AST and enhance the current hypergraph model to make it more effective at classifying codes.
Acknowledgment
This work was supported by the Open Research Fund of NPPA Key Laboratory of Publishing Integration Development, ECNUP.
References
- [1] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural networks over tree structures for programming language processing,” in AAAI, 2016.
- [2] S. Gilda, “Source code classification using neural networks,” 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 1–6, 2017.
- [3] D. Vagavolu, K. C. Swarna, and S. Chimalakonda, “A mocktail of source code representations,” in ASE, 2021.
- [4] W. Wang, K. Zhang, G. Li, and Z. Jin, “Learning to represent programs with heterogeneous graphs,” 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC), pp. 378–389, 2022.
- [5] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree,” 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794, 2019.
- [6] M. Lu, Y. Wang, D. Tan, and L. Zhao, “Student program classification using gated graph attention neural network,” IEEE Access, vol. 9, pp. 87857–87868, 2021.
- [7] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” in ICLR, 2019.
- [8] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: learning distributed representations of code,” Proceedings of the ACM on Programming Languages, vol. 3, pp. 1 – 29, 2019.
- [9] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment generation,” 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp. 200–20010, 2018.
- [10] S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, “Retrieval-augmented generation for code summarization via hybrid gnn,” in ICLR, 2021.
- [11] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: Scalable and accurate tree-based detection of code clones,” 29th International Conference on Software Engineering (ICSE’07), pp. 96–105, 2007.
- [12] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in ASE, 2016.
- [13] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” in ICLR, 2018.
- [14] T. Long, Y. Xie, X. Chen, W. Zhang, Q. Cao, and Y. Yu, “Multi-view graph representation for programming language processing: An investigation into algorithm detection,” in AAAI, 2022.
- [15] J. Huang and J. Yang, “Unignn: a unified framework for graph and hypergraph neural networks,” in IJCAI, 2021.
- [16] Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph neural networks,” in AAAI Conference on Artificial Intelligence, 2018.
- [17] R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker, V. Thost, L. Buratti, S. Pujar, S. Ramji, U. Finkler, S. Malaika, and F. Reiss, “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” 2021.
- [18] N. D. Q. Bui, Y. Yu, and L. Jiang, “Treecaps: Tree-based capsule networks for source code processing,” in AAAI, 2021.
- [19] V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber, “Global relational models of source code,” in ICLR, 2020.
- [20] C. Berge, “Graphs and hypergraphs,” 1973.
- [21] G. Gallo, G. Longo, and S. Pallottino, “Directed hypergraphs and applications,” Discret. Appl. Math., vol. 42, pp. 177–201, 1993.
- [22] Y. Sun and J. Han, “Mining heterogeneous information networks: Principles and methodologies,” in Mining Heterogeneous Information Networks: Principles and Methodologies, 2012.
- [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5998–6008, 2017.
- [24] T. Cai, S. Luo, K. Xu, D. He, T.-Y. Liu, and L. Wang, “Graphnorm: A principled approach to accelerating graph neural network training,” in ICML, 2021.
- [25] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
- [26] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?,” in ICLR, 2019.