This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification

Kesu Wang1, Meng Yan2, He Zhang1∗, Haibo Hu2 1State Key Lab for Novel Software Technology, Nanjing University, NanjingChina; Corresponding author2School of Big Data & Software Engineering, Chongqing University, ChongqingChina [email protected], [email protected], [email protected], [email protected]
(2022)
Abstract.

Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach outperforms the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).

Program Comprehension, Program Classification, Code Representation Learning, Cross-language Program Classification
journalyear: 2022copyright: acmcopyrightconference: 30th International Conference on Program Comprehension; May 16–17, 2022; Virtual Event, USAbooktitle: 30th International Conference on Program Comprehension (ICPC ’22), May 16–17, 2022, Virtual Event, USAprice: 15.00doi: 10.1145/3524610.3527915isbn: 978-1-4503-9298-3/22/05

1. Introduction

Program classification is aiming to automatically classify programs according to their functions or semantics. As one of the primary means of facilitating program comprehension, program classification nowadays has been widely employed in a variety of different tasks such as code clone detection (Baker, 1993; Bellon et al., 2007; Börstler, 1995), defect detection and identification (Wang et al., 2016; Li et al., 2017b), and code search (Kim et al., 2018), etc. The benefits of program classification have been generally recognized and valued, program classification has long relied on manual classification, which is very time-consuming and error prone. In recent years, more and more researchers have paid attention to source code-oriented program comprehension, as source codes are the most natural representation of programs. Moreover, source codes are well-structured, are ideal to support automating program comprehension.

Cross-language program classification refers to how programs written in different programming languages can be classified by their functions according to the structure and semantics of their codes. For example, although the code structure and logic of ”quick sort” written in C++ and ”bubble sort” written in Python is implemented differently, they are both sorting algorithms in essence, so they should be classified as codes of the same function. Consequently, cross-language program classification can reduce the time to implement programs of the same function in different languages, and can also promote cross-language code reuse, thereby helping developers write code quickly and reduce development time for code transfer.

However, it is challenging to capture code semantics efficiently and accurately. Moreover, considering that different programming languages have different grammatical rules and coding features, it is even more difficult to accurately extract cross-language code semantics. In terms of the extraction of code semantic information, Li et al. (Li et al., 2017a) and Harer et al. (Harer et al., 2018) use tokens to generate embedded vectors and feed them into neural networks for code feature learning. However, a token only contains lexical information of the code and cannot reflect the structural and semantic characteristics of the code. Ben et al. (Ben-Nun et al., 2018) use Intermediate Representations (IRs) to design a graph called XFG, and then uses neural networks (GNN (Scarselli et al., 2008), RNN (Elman, 1990)) to learn the semantic features of the graph. Since the XFG contains data dependence of the code, it is helpful for semantic extraction, but obtaining IR requires code compilation, which makes it impossible to process some incomplete code fragments. Azcona et al. (Azcona et al., 2019) and Mou et al. (Mou et al., 2016) propose code learning approaches based on Abstract Syntax Tree (AST), which could learn the features from the traversal or the tree structure of the AST, and have a certain semantic effect.

Refer to caption
Figure 1. An overview of the proposed approach – UAST.

These researches for learning the semantic features of the code mentioned above are all set for single language, and cannot learn the semantic features of different programming languages, resulting in not much breakthrough in the research of cross-language program classification. Bui et al. (Bui et al., 2019) pay attention to cross-language issues, and propose a Bi-NN framework to learn the semantic features of two different programming languages. Specifically, it uses two networks of the same structure (similar to Siamese (Chopra et al., 2005) networks) to perform code feature learning. Although the model can handle cross-language program classification problems, it is essentially learning a single programming language feature respectively and fusing different learned code features. In addition, the structure is not highly scalable, if the number of programming languages increases, a new network needs to be added, which makes the network structure more complicated and more time-consuming on the training time. Bui et al. (Bui et al., 2021) propose a pre-trained model with the idea of self-supervision. It trains code features on the large corpus which contains multiple programming languages, so it can be used for cross-language program classification tasks. However, it does not have special preprocessing for different languages and just directly generates AST for mixed training, which essentially does not consider the differences of different languages.

To address those problems mentioned above, we propose a neural network called Unified Abstract Syntax Tree (UAST for short). With regard to semantic extraction, we employ self-attention combined with Bi-LSTM to extract flattened AST sequence features, which can capture the global logical structure characteristics of the code. Besides, we use Graph Convolutional Neural (GCN) (Kipf and Welling, 2016) network to extract the local feature of the graph-like AST structure. And then, we fuse those two features (sequence features and graph features) to strengthen the feature of corresponding dimension, so that the structural and semantic characteristics of the code can be obtained. For cross-language semantic learning, we have established a unified vocabulary for embedded mapping and use this vocabulary to reduce the differences between different programming languages, so as to facilitate the learning of neural networks. In addition, we collect a dataset that contains five different programming languages from Leetcode111https://leetcode.com/. We conduct experiments on two datasets, the results both show that our UAST performs better than the state-of-the-art baselines by 4.54% - 22.62% in terms of four evaluation metrics. Furthermore, we conduct ablation experiments to explore the impact of our unified vocabulary and unified AST feature fusion on model performance.

In summary, the main contributions of this study can be summarized as follows.

  • We propose a unified AST representation learning approach using two sub-networks (SAST and GAST). SAST is used to extract the global code syntactic features contained in the AST path sequence. GAST is used to capture the local code semantic features in the AST tree. The unified AST network can comprehensively consider global and local learned code features, which could effectively learn code semantic features.

  • We construct a unified vocabulary mechanism to reduce the difference between different programming languages. The initial input embedded vector can be obtained through the vocabulary mapping, and then put the vector to two sub-networks for training, which can classify cross-language programs.

  • We conduct experiments on two datasets, and the experimental results show that our performance is better than other state-of-the-art baselines (CodeBERT, Infercode) in terms of Recall, Precision, F1-score and Accuracy.

  • We contribute a benchmark dataset for the cross-language program classification task. The dataset contains five programming languages (C, C++, Java, Python, and JavaScript), with a total of 50 problems and 20,000 solution files. The solutions to each problem are semantically similar codes to each other.222Our replication package including both datasets and scripts can be found at https://github.com/kkcookies99/UAST.

The remainder of this paper is organized as follows. Section 2 elaborates on the proposed approach. Sections 3 and 4 report on the experimental design and results respectively. Section 5 discusses the considerations behind the proposed approach. Section 6 reviews the related work. Finally, we conclude this paper and present the future work in Section 7.

2. The Proposed Approach

In this section, we present the design and implementation details of our proposed UAST (Unified Abstract Syntax Tree) neural network model for cross-language program classification.

Refer to caption
Figure 2. The construction of unified vocabulary. Boxes of the same color have the same meaning, and the areas surrounded by dotted lines of the same color represent the same code structure.

2.1. Overall Structure

As illustrated in Figure 1, this model firstly takes different programming source codes as input and then parses them into ASTs. Afterwards, it performs path embedding and graph embedding using unified vocabulary on ASTs. Then the path embedded vector and the graph embedded vector are fed into different sub-networks to capture code features respectively. Finally, it fuses separate learned features and conducts the classification task.

Specifically, it mainly consists of five parts: Unified Vocabulary, Sequence-based AST Network, Graph-based AST Network, Unified AST Feature Fusion, and Cross-language Program Classification. We explain the details of each part in the following subsections.

2.2. Unified Vocabulary

Since different programming languages have different coding rules and coding characteristics, there are certain differences in the text representation of different languages. Figure 2 shows ASTs generated by tree-sitter333http://tree-sitter.github.io/tree-sitter/ parsing the function of adding two integers implemented by three programming languages (Java/C++/Python). The reason why we choose tree-sitter for syntactic analysis of the source code is that it supports 40 programming languages, and also provides a Python API which is easy to use. It can be found in Figure 2 that there are some differences in the node names of ASTs. The root node is called ”program” in the Java AST, ”translation unit” in C++ AST, and ”module” in Python AST. Actually, these three terminologies all represent the same semantics, that is, the coding unit, which generally means the code file or the program. In order to reduce the difference in node names between different languages, we propose a mechanism called ”Unified Vocabulary”, which is mainly used to normalize AST node names in different languages, such as the ”coding unit” mentioned above is unified into ”unit”. Specifically, we use ”unit” to replace ”program, translation unit, module” uniformly, which will reduce the difference between different coding languages. Furthermore, ”block” in Java, ”compound statement” in C++ and ”block” in Python are essentially a code block, so ”block” is used instead. The nodes of the same color in Figure 2 represent the same meaning, so they all will be processed into the same node name uniformly.

We have considered all similar but different expressions of node names in different languages. The unified vocabulary mechanism will unify AST node names generated by different programming languages, alleviating the differentiation caused by different coding characteristics. Thus, the embedded vector generated using the unified vocabulary can learn the feature of various programming languages and thereby tackle the problem of cross-language programming classification.

2.3. Sequence-based AST Network

The foundation of cross-language program classification is to learn semantic and syntactic code features of different programming languages. As for the extraction of the syntactic structure information of the code, we propose a Sequence-based AST network (SAST for short). First, we perform a pre-order traversal of the unified AST, and the obtained path sequence can be regarded as a flattened presentation of the AST. The path sequence contains the global information of the source code, and also shows the syntactic structure characteristics of the source code to a certain extent. In order to extract the dependencies between the nodes within the sequence, we use the self-attention structure (Vaswani et al., 2017). The self-attention mechanism is a powerful mechanism in the transformer structure, which is very effective in extracting internal relationships and can alleviate the problem of long-distance dependence. The calculation formula is as follows:

(1) Attention(Q,K,V)=SoftMax(QKTdk)VAttention(Q,K,V)=SoftMax(\frac{QK^{T}}{\sqrt{d_{k}}})V

where the three matrices Ql×dQ\in\mathbb{R}^{l\times d}, Kl×dK\in\mathbb{R}^{l\times d}, and Vl×dV\in\mathbb{R}^{l\times d} are initialized and generated according to the embedded path sequence vector, and these three matrices are equal in the self-attention mechanism, which could reduce the parameters of the model and can train faster. dd is the embedding dimension of the path sequence. ll is the length of input path. Dot product is calculated between QQ and KK. dkdd_{k}\in\mathbb{R}^{d} is the dimension of input vector. And Attention(Q,K,V)Attention(Q,K,V) is the calculated attention score.

In writing code, the context of the code statement often reflects its intent. For example, it needs to declare a variable before using it in C++. The above-mentioned self-attention mechanism has captured the internal relationship of the code embedded vector. So in addition to extracting the internal dependencies of the input source code, it also needs to capture the context dependency of the source code. Therefore, the Bidirectional Long Short-Term Memory (Bi-LSTM) (Graves and Schmidhuber, 2005) is introduced here. The Bi-LSTM could learn features of the input data from two directions, so it can infer the current information from the context of the code. In our proposed SAST network, Bi-LSTM is used to comprehensively consider all available input path information in the context to extract semantic and logic features of the source code. Specifically, the hidden state of the LSTM at each position tt of the input path is computed as:

(2) it=σ(Wi[ht1,xt]+bi)\displaystyle i_{t}=\sigma(W_{i}\cdot[h_{t-1},x_{t}]+b_{i})
(3) ft=σ(Wf[ht1,xt]+bf)\displaystyle f_{t}=\sigma(W_{f}\cdot[h_{t-1},x_{t}]+b_{f})
(4) ot=σ(Wo[ht1,xt]+bo)\displaystyle o_{t}=\sigma(W_{o}\cdot[h_{t-1},x_{t}]+b_{o})
(5) c~t=tanh(WC[ht1,xt]+bc)\displaystyle\tilde{c}_{t}=tanh(W_{C}\cdot[h_{t-1},x_{t}]+b_{c})
(6) ct=ftct1+itc~t\displaystyle c_{t}=f_{t}\odot c_{t-1}+i_{t}\odot\tilde{c}_{t}
(7) ht=ottanh(ct)\displaystyle h_{t}=o_{t}\odot tanh(c_{t})

where σ\sigma is the sigmoid function, tanhtanh is the hyperbolic function, xtdx_{t}\in\mathbb{R}^{d} presents the data at position tt of the input path sequence after self-attention, ctc_{t} presents the hidden unit state of xtx_{t}, and hthh_{t}\in\mathbb{R}^{h} represents the hidden unit state of the learning layer, which is the final extracted code features. Wi,Wf,Woh×dW_{i},W_{f},W_{o}\in\mathbb{R}^{h\times d} are the trainable weight matrices. \odot is the element-wise matrix multiplication operator.

(8) hSAST=htthtt{h_{SAST}}=\mathop{h_{t}}\limits^{\rightarrow}\oplus\mathop{h_{t}}\limits^{\leftarrow}

After that, we concatenate the hidden state htth\mathop{h_{t}}\limits^{\rightarrow}\in\mathbb{R}^{h} learned by the forward LSTM and the hidden state htth\mathop{h_{t}}\limits^{\leftarrow}\in\mathbb{R}^{h} learned by the backward LSTM to obtain the hSAST2×h{h_{SAST}}\in\mathbb{R}^{2\times h}, which contains the context features of the code.

2.4. Graph-based AST Network

Sequence-based AST network has learned the global structure and syntactic features of the code from the path sequence, but the path sequence is a flattened representation, which leads to ignoring some tree-like structure information of the code. We find that except for language-specific library files or package files, the logic of different programming languages is generally the same when writing specific functions. As shown in the Figure 2, the parts enclosed by the dashed lines of the same color all have the same meaning (i.e., the parts outlined by the green dotted line all represent ”return a+b” statement in 3 programming languages). In addition, the parts enclosed by the dashed lines of the same color are also highly consistent in the AST shape. In order to learn the local tree features of the AST, it is easy to think of using GCN to extract the local information of the code, so we propose the Graph-based AST network (GAST for short). We regard the AST structure as a special graph and perform convolution operation on it to extract code semantic features. GCN can aggregate and learn features of neighbor nodes of each node in the unified AST, so that it can capture the local features of the AST, that is, the local semantics of the code.

Refer to caption
Figure 3. A case of AST node aggregation of GAST.

Figure 3 illustrates the AST of ”return a+b” statement and a two-layer convolution operation on it. Specifically, the ”return statement” node will aggregate the information of its first-order neighbors (”unit, binary expression left, and binary expression right”) with the 1-hop aggregation, which can extract the direct relationship between code statements. As the number of hops increases, the ”return statement” node can also indirectly aggregate the information of their second-order neighbors(”body, identifier1identifier_{1}, identifier2identifier_{2}). Therefore, the ”return statement” node could learn the local structural and semantic features of its neighbors. The right of Figure 3 is the adjacency matrix of the left AST, and it is worth noting that A~\widetilde{A} adds an identity matrix to its adjacency matrix, indicating that the node could also learn the feature of itself. The graph convolution operation is defined as follows:

(9) A~=A+I\displaystyle\widetilde{A}=A+I
(10) Hi(l+1)=σ(jND12A~D12Hj(l)W(l))\displaystyle{H_{i}}^{(l+1)}=\sigma(\sum_{j\in N}D^{-\frac{1}{2}}\widetilde{A}D^{-\frac{1}{2}}{H_{j}}^{(l)}W^{(l)})

where Hi(l+1){H_{i}}^{(l+1)} is the feature of node ii in the layer (l+1){(l+1)}, Hj(l){H_{j}}^{(l)} is the feature of all neighbor nodes of node ii (including itself) in the layer ll; NN is the number of all neighbors of node ii; A~N×N\widetilde{A}\in\mathbb{R}^{N\times N} is the adjacency matrix AA of node ii added with the identity matrix; DN×ND\in\mathbb{R}^{N\times N} is the degree matrix of A~\widetilde{A}; W(l)din×doutW^{(l)}\in\mathbb{R}^{d_{in}\times d_{out}} is the trainable weight matrix in the layer ll.

2.5. Unified AST Feature Fusion

The aforementioned SAST has captured the global structure and logical characteristics of the code, and GAST has extracted the local structural and semantic feature of the code. In order to comprehensively consider the global and local code semantic feature, a fusion mechanism is further needed. We realize the enhancement of dimensional features through vector concatenation, which can be described as:

(11) hcode=concat(hSAST,hGAST)\displaystyle h_{code}=concat(h_{SAST},h_{GAST})

where hcode2×h+douth_{code}\in\mathbb{R}^{2\times h+d_{out}} represents the feature vector after the unified AST feature fusion, hSAST2×hh_{SAST}\in\mathbb{R}^{2\times h} represents the global structural feature learned from the flattened sequence, and hGASTdouth_{GAST}\in\mathbb{R}^{d_{out}} represents the local semantic feature learned from the graph-like AST.

2.6. Cross-language Program Classification

After that, we obtain the feature vector which includes the global and local semantic features of the input code, and then we perform a fully connected layer for linear dimensional transformation, and eventually the probability pip_{i} is the output through the SoftMax layer. We use the Cross Entropy (De Boer et al., 2005) as our loss function and adopt Adam optimizer (Jiang and Su, 2009) to minimize it. The loss is calculated as follows:

(12) J=i=1kyilog(pi)\displaystyle J=-\sum_{i=1}^{k}y_{i}log(p_{i})

where kk is the number of program categories, and yy is the label of different programs (if the category is i, then yi=1y_{i}=1,else yi=0y_{i}=0), pip_{i} is the output after the SoftMax layer.

3. Experimental Design

This section firstly presents four research questions to be investigated, then describes our experimental datasets, compared baselines, experiments settings and common evaluation metrics in the following subsections.

3.1. Research Questions

Our motivation is to verify whether adopting the unified vocabulary and the unified AST feature fusion could improve the performance of our approach, and how those two mechanisms affect the performance of UAST. Taking the above concerns into account, we raise the following Research Questions (RQs):

RQ1: How effective is the proposed UAST compared with the other baselines?

RQ1 is intended to investigate whether the proposed UAST outperforms the other state-of-the-art baselines. There are many other works also explore code representation learning semantics for cross-language program classification, and we select representative ones as baselines and will be described in Section 3.3.

RQ2: How does the unified vocabulary affect the performance of our proposed UAST ?

RQ2 aims to explore whether unified vocabulary could reduce the difference between different programming languages and thus improve the performance of our proposed method. To analyze the impact of unified vocabulary on model effectiveness, we verify the impact of using the unified vocabulary or not on the overall performance through ablation experiments.

RQ3: How does the unified AST feature fusion affect the performance of our proposed UAST?

RQ3 is put forward to evaluate the impact of the unified AST feature fusion. Since the two sub-networks (SAST and GAST) mentioned in Section 2.3 and 2.4 could extract global syntactic features and local semantic structure features respectively, so it is a need to explore an effect of the fusion mechanism on the learned features of two sub-networks. Therefore, we conduct ablation experiments to verify the impact of using the unified AST feature fusion mechanism on the overall performance.

RQ4: How do different parameter settings affect the performance of our proposed UAST?

RQ4 is to examine the impact of UAST’s own parameters on performance. UAST contains two important parameters which could affect the model performance substantially. One is the length of the path obtained by pre-order traversal of the AST, the other is the layer of GCN which determines how many layers of neighbors’ information can be aggregated in the GAST. We set different parameters to find the most suitable parameters which make the best performance.

3.2. Datasets

We use two datasets to evaluate the performance of program classification models. The first dataset inherits from the Bi-TBCNN by Bui et al. (Bui et al., 2019), since the dataset contains two programming languages (Java and C++), we call this dataset Dataset JC. To evaluate our model performance on more programming languages, we collect a dataset from Leetcode which contains 20000 problem solutions of five different programming languages, and we call this dataset Dataset Leetcode. The following is a detailed introduction to these two datasets.

  • JC: This dataset includes 10 different categories of programs crawled by Bui et al. (Bui et al., 2019) from GitHub. It contains 5822 Java files and 7019 C++ files. The code files for each category implement the same function, so they are codes that are semantically similar to each other. The dataset is divided into three parts: training set (3), validation set (1), and testing set (1). The specific information of the dataset split is shown in the Table 1.

  • Leetcode: We crawl 50 different categories of programs from Leetcode, each of which contains 400 solutions of different programming languages, with a total of 20000 files. The codes under the same problem are all semantically similar to each other. The dataset includes five different programming languages: C, C++, Java, Python, and JavaScript. We use NICAD444http://www.txl.ca/txl-nicaddownload.html to filter out the duplicate code. Considering that each code file contains some salient information (the function name of the solution to each problem written in different programming language is the same, for example, the function name of codes to the problem of ”finding the median of two positive-order arrays” are all called “findMedianSortedArrays” ), so we use ”XXX” instead of all function names to each problem and the user-defined function names will not be replaced. Besides, our dataset can also be used as a benchmark dataset for cross-language program classification task. Since the number of problem solutions in different programming languages is unbalanced, we divide it according to the number of different programming languages, and the ratio of division is: training(3): validation(1): testing(1). The specific division information of the dataset is shown in the Table 1.

Table 1. The division of Dataset JC and Leetcode.
JC Leetcode
Java C++ C C++ Java Python JavaScript
Training 3498 4215 331 3428 5051 2633 557
Validation 1162 1402 110 1143 1684 878 185
Testing 1162 1402 110 1143 1684 878 185

3.3. Baselines

We compare UAST with the following state-of-the-art baselines for the cross-language code learning.

  • CodeBERT (Feng et al., 2020): It is a variant of BERT (Devlin et al., 2019) and it uses RTD (Replaced Token Detection) for pre-training and adopts deep bi-transformer components, so it finally generates the code features that can integrate context information, which can effectively extract the features of the input sequence. And it learns a pre-trained model for 6 programming languages, so it can be used for cross-language program classification.

  • Infercode (Bui et al., 2021): This method applies the self-supervised learning ideas in natural language processing to the AST of the code, and trains the code representation by predicting the automatic context sub-trees of the AST. Because it has been trained on multiple programming languages, it can extract the characteristics of the different programming languages and can be used to classify cross-language programs.

3.4. Experimental Setting

We implement our network in Pytorch. For hardware devices, all experiments are run on a 10-core 3.70GHz Intel(R) Core(TM) i9-10900X CPU and NVIDIA GeForce RTX 3090 GPU server. To ensure the fairness of the comparative experiments, we use the same training set for training. We tune the parameters on the validation set and the specific parameter settings are as follows: the epoch of UAST is 5, the batch size is 64, we adopt Adam (Kingma and Ba, 2015) as the training optimizer, and the learning rate is set to 0.001. As for SAST, the length of the path sequence is unified to 700 for Dataset JC and 200 for Dataset Leetcode (specifically, if it is insufficient, it is padded with 0; if it is exceeded, it is directly truncated). Besides, the embedding dimension of the path is 200, the number of units of the Bi-LSTM is 64, the number of layers of Bi-LSTM is 2, and the dropout rate of Bi-LSTM is set to 0.5; the dropout rate of self-attention is 0.2, and the number of heads of the self-attention layer is 4. As for GAST, the size of the adjacency matrix of the graph is unified to [400, 400], and the number of layers of GCN is 2.

3.5. Evaluation Metrics

For multi-classification tasks, it is common to use public evaluation indicators (Precision, Recall, F1-score, Accuracy) to evaluate the model performance (Zhang et al., 2019). In order to verify the effect of our model, we use those four evaluation metrics. Detailed definition of those metrics as follows:

(13) Precision=TP¯TP¯+FP¯\displaystyle Precision=\frac{\overline{TP}}{\overline{TP}+\overline{FP}}
(14) Recall=TP¯TP¯+FN¯\displaystyle Recall=\frac{\overline{TP}}{\overline{TP}+\overline{FN}}
(15) F1score=2×Precision×RecallPrecision+Recall\displaystyle F1-score=2\times\frac{Precision\times Recall}{Precision+Recall}
(16) Accuracy=TP¯+TN¯TP¯+FP¯+TN¯+FN¯\displaystyle Accuracy=\frac{\overline{TP}+\overline{TN}}{\overline{TP}+\overline{FP}+\overline{TN}+\overline{FN}}

where TP¯\overline{TP} (True Positive) is the weighted average of the number of samples that are correctly predicted as positive examples for each class; FN¯\overline{FN} (False Negative) is the weighted average of the number of samples that are incorrectly predicted as negative examples per class; FP¯\overline{FP} (False Positive) is the weighted average of the number of samples that are incorrectly predicted as positive examples per class; TN¯\overline{TN} (True Negative) is the weighted average of the number of samples that are correctly predicted as negative examples per class.

4. Results and Analysis

This section reports the experimental results by addressing the four research questions that are proposed in Section 3.1.

4.1. RQ1: Model Performance

In order to answer RQ1, we compare UAST with mentioned algorithms in Section 3.3. Table 2 and Table 3 report the results on two datasets. It can be found from Table 2 that UAST achieves a Recall of 0.9611, a Precision of 0.9631, a F1-score of 0.9617 and an Accuracy of 0.9626, which outperforms other baselines. Meanwhile, the result on Dataset Leetcode also outperforms other baselines, indicating that our UAST is effective for cross-language program classification.

Table 2. Comparative experiment on Dataset JC.
Model Recall Precision F1-score Accuracy
CodeBERT 0.9078 0.9177 0.9090 0.9005
Infercode 0.8317 0.8468 0.8325 0.8343
UAST 0.9611 0.9631 0.9617 0.9626
Table 3. Comparative experiment on Dataset Leetcode.
Model Recall Precision F1-score Accuracy
CodeBERT 0.6147 0.6348 0.6174 0.6245
Infercode 0.5696 0.5819 0.5762 0.5807
UAST 0.7958 0.8025 0.7965 0.7964

Among them, Infercode learns the characteristics of the sub-tree of the AST through self-supervision and it uses a cross-language corpus for training, so it can be used for cross-language programming classification. However, the extracted feature vector is only 100 dimensions, which cannot fully reflect the characteristics of some long codes, making the final performance worse than the other neural network models. As a powerful pre-trained model, CodeBERT’s bi-transformer structure can effectively extract the context information of the code, so its performance can reach 91% on Dataset JC. However, it learns the feature of code by token embedding, so it cannot reflect the semantic characteristics of the code. UAST comprehensively considers global and local code structure semantic information, and also the unified vocabulary in it can reduce the difference in different programming languages, the performance is better than other baselines.

Furthermore, we conduct experiments on two datasets of different sizes and language types. And the performance of our proposed UAST in both datasets outperforms other baselines, which not only shows the efficiency of our model but also shows that our proposed model has a strong generalization ability.

Result 1: Our proposed UAST significantly outperforms other state-of-the-art baselines in terms of Precision, Recall, F1-score, and Accuracy.

4.2. RQ2: The Impact of the Unified Vocabulary

As mentioned in Section 2.2, we propose the unified vocabulary mechanism to reduce the differences between different programming languages. So in order to answer RQ2, we conduct an ablation experiment to explore the impact of the unified vocabulary on model performance. The results are shown in the Table LABEL:tab:ablation_uv.555Method with -V indicates that the unified vocabulary is not used.

According to Table LABEL:tab:ablation_uv, we find that using the unified vocabulary outperforms those who do not use it by 0.37% - 3.32% in terms of Recall, by 0.37% - 3.86% in terms of Precision, by 0.42% - 3.26% in terms of F1-score, and by 0.30% - 2.86% in terms of Accuracy on both Dataset JC and Leetcode, which indicates that the unified vocabulary indeed reduces the feature gap between different programming languages and thus improves the performance of cross-language program classification.

Table 4. Ablation experiment results of the unified vocabulary on Dataset JC and Leetcode.
Dataset Model Recall Precision F1-score Accuracy
JC SAST-V 0.8802 0.8868 0.8816 0.8856
SAST 0.9125 0.9254 0.9142 0.9142
GAST-V 0.9467 0.9479 0.9469 0.9478
GAST 0.9504 0.9516 0.9511 0.9508
UAST-V 0.9524 0.9526 0.9509 0.9516
UAST 0.9611 0.9631 0.9617 0.9626
Leetcode SAST-V 0.6553 0.6894 0.6540 0.6554
SAST 0.6718 0.7020 0.6707 0.6721
GAST-V 0.7744 0.7793 0.7749 0.7749
GAST 0.7892 0.7956 0.7887 0.7892
UAST-V 0.7893 0.7970 0.7904 0.7882
UAST 0.7958 0.8025 0.7965 0.7964

Since different programming languages have their own unique coding features, the terms obtained by the parser in parsing different programming languages are different even if they have the same meaning, so the AST nodes will show different node names. If we directly generate a vocabulary for all the node names of the AST, then each language will have a certain difference in its corresponding AST, which will enlarge the gap of code features. So the use of the unified vocabulary will reduce the difference in terms obtained by code parsing, and further reduce the feature learned by the proposed network, so the performance for cross-language program classification tasks has been improved.

Furthermore, we find that in addition to improving the performance of all three networks (SAST, GAST, and UAST), the use of the unified vocabulary in the SAST network has the most pronounced effect. Due to the fact that the SAST network is designed to extract the features of a traversal path sequence of the AST, and the initial embedded vector input into the network is determined by vocabulary mapping, so the unified vocabulary has the most direct impact on the SAST network which leads to the notable performance. As for the GAST network, the adjacency matrix is not constructed based on the unified vocabulary, but only on the structure of the AST. And the feature matrix is composed of the one-hot vector of the node name, which is related to the unified vocabulary, so the overall impact of the unified vocabulary on the GAST is relatively insignificant compared to SAST.

Given all those results discussed above, the unified vocabulary improves the performance to a certain extent, confirming the effectiveness of the unified vocabulary for the cross-language program classification.

Result 2: The unified vocabulary does improve model performance to a certain extent, so leveraging the unified vocabulary to generate the embedded vector is a good choice for cross-language classification tasks.

4.3. RQ3: The Impact of the Unified AST Feature Fusion

Table LABEL:tab:ablation_fusion shows that the performance after the unified AST feature fusion is 1.06% - 4.86% higher than that before fusion on Dataset JC, and 0.66% - 12.58% higher than that before fusion on Dataset Leetcode, indicating that the unified AST feature fusion could further enhance the effectiveness of our proposed model.

In addition, we find that the performance of SAST is not good as the performance of GAST on both datasets. This result implies that the feature extracted from the flattened AST sequence structure is less useful than the feature extracted from the AST graph-like structure. The main reason is that the sequence obtained by traversing the AST has fewer structural characteristics of the branch statements, cause it only includes the overall process of the source code. While the adjacency matrix in the graph-like AST can capture the associated relationship with neighbor nodes, which is very effective for some structures (i.e., for-loop structure, if-condition structure, etc.), the performance of SAST is not as good as that of GAST.

Table 5. Ablation experiment results of the unified AST feature fusion on Dataset JC and Leetcode.
Dataset Model Recall Precision F1-score Accuracy
JC SAST 0.9125 0.9254 0.9142 0.9142
GAST 0.9504 0.9516 0.9511 0.9508
UAST 0.9611 0.9631 0.9617 0.9626
Leetcode SAST 0.6718 0.7020 0.6707 0.6721
GAST 0.7892 0.7956 0.7887 0.7892
UAST 0.7958 0.8025 0.7965 0.7964

In order to further analyze the reasons why unified AST feature fusion works better, we perform a reasonable analysis as follows. The two code features extracted by the unified AST neural network (SAST and GAST) contain global features and local features of the code respectively. If we only consider one of the features, then the features of the other aspect will be ignored, so fusing the two hidden vectors can strengthen the features of the corresponding dimensions, which results in better performance. Furthermore, the SAST focuses on global information extraction, and the GAST pays attention to the capture of local information. Therefore, two code features are spliced and fused after the unified AST neural network (SAST, GAST), which can comprehensively consider global and local information and learn the semantic structure characteristics of the code from multiple aspects.

Given all those factors stated above, the unified AST feature fusion mechanism can indeed improve the performance of the network as a whole.

Result 3: The unified AST feature fusion mechanism leverages both global structural information and local semantic information for code representation, which can significantly help improve model performance.

4.4. RQ4: The Impact of Parameter Settings

In our proposed UAST, the length of the path sequence in SAST and the layers of GCN in GAST are two important parameters that could affect the performance of the model. The length of the path sequence determines how much code content is fed into the neural network. Due to the unbalanced distribution of the length of the source code in the dataset, setting the appropriate length is of particular importance. The number of layers of GCN in GAST, namely hop, represents the range that each node can aggregate information from neighbor nodes. As the number of hops increases, each node can aggregate a larger range of information from neighboring nodes, thereby focusing on a larger range of local semantic information of the code. To answer RQ4, we perform experiments with different parameter settings, and the experimental results are shown in Figure 4 and Figure 5.

For the length of the path sequence, we explore the effect of different lengths on performance by setting different length values (100-1000). Figure 4 shows that the UAST achieves the best performance when the length is set to 700 on Dataset JC and 200 on Dataset Leetcode in terms of Recall, Precision, F1-score and Accuracy. As the length increases, the effect of the model will not be improved, but it will increase the complexity of the model. Besides, Bi-LSTM cannot effectively handle long sequences of inputs, which is one of the possible reasons for the poor performance. When the length is reduced, the performance of the model will also decrease. The main reason is that reducing the length of the input will cut out some useful information, so that the neural network cannot effectively capture some key feature of the code.

Refer to caption
(a) Results on Dataset JC
Refer to caption
(b) Results on Dataset Leetcode
Figure 4. The effect of the length of the path sequence in SAST network on model performance.

Notably, we have statistics on the length of AST path for both datasets. As shown in the Table LABEL:tab:statistic, we find that 80% of the AST path sequences are within 726 in length on the Dataset JC, while 80% of the AST path sequences are within 221 in length on the Dataset Leetcode. Therefore, it is most suitable to set the sequence input to 700 and 200 respectively on Dataset JC and Leetcode, because long input will cause some short codes to be filled with 0, and short input will lose some key information. Thus, when training on different datasets, it is a need to have statistics on the length of AST path and choose the most appropriate length before training.

Table 6. Statistical distribution of AST path sequence length on Dataset JC and Leetcode.
Dataset Mean Median 70% 80% 90%
JC 576 354 502 726 1498
Leetcode 165 144 189 221 279

For the layers of GCN, we explore the effect of different layers on performance by setting different layers of GCN (1,2,3). It can be seen from Table 5 that the model performs best when the number of layers of GCN is set to 2. The first-order neighbors indicate a smaller neighbor range, resulting in a smaller captured tree structure, so the performance is not as good as two GCN layers stacking. A larger number of layers represents a wider range of neighbors, and the experimental results are even worse than that of one layer. Our initial inference is that some specific structures are only displayed in the range of two layers when writing the code. In addition, the increase in the number of layers could make the model more complex and lead to overfitting.

Refer to caption
(a) Results on Dataset JC
Refer to caption
(b) Results on Dataset Leetcode
Figure 5. The effect of the layer of GCN in GAST Network on model performance.

Result 4: For the choice of path length, it is better to have statistics on dataset AST path before training and choose the most suitable value. While setting the layer of GCN to 2 is beneficial to model effectiveness.

5. Discussion

This section discusses the considerations behind the proposed approach and the threats to validity of this study.

5.1. Why we choose AST as the code representation form?

In order to capture the characteristics of the code, current research is mainly to learn the code by exploring different code Intermediate Representations (IRs). Token represents code lexical information, Abstract Syntax Tree (AST) is a representation form that includes code structure and grammatical information. Graph-based IRs such as Control Flow Graph (CFG) and Data Flow Graph (DFG) contain the control flow and data flow characteristics of the code. Different representation forms include different features from different views of the code. In recent years, many excellent studies based on those IRs have been carried out (Azcona et al., 2019; Mou et al., 2016; Feng et al., 2020; Ben-Nun et al., 2018). It is currently the mainstream method of program classification to distinguish codes of different categories (functions) by capturing code characteristics.

For the intermediate representation of the code, we have a variety of options, token, AST or graph-based forms (such as CFG, DFG). For the token, it only contains the information of the lexical level of the code, and the program classification is based on the function implemented by the code (i.e., the semantic level), so the token is not the best choice. As for graph-based representations, although some control information they contain can describe the semantics of the code, it is not easy to generate the graph of multiple languages, and the cost of training a pure graph model is high. While AST is relatively easy to generate for most programming languages. And as a tree structure, AST is easier to traverse. So we finally decide to use AST as our code representation.

5.2. Why does UAST work?

The essential difficulty of cross-language program classification is to learn the semantic features of codes implemented in different programming languages. This model achieves such excellent performance on cross-language program classification mainly due to two mechanisms (unified vocabulary and unified AST feature fusion) in it. The unified vocabulary reduces the differences between different programming languages, making it more effective for cross-language classification. The fusion mechanism fuses the learned code features of global and local information to capture the semantics of the code leading the classification performance better.

5.3. Threats to Validity

There are two types of threats to this study, and accordingly, we have made efforts to mitigate those threats to validity.

Internal threat: The threats to the internal validity of this study may result from the dataset. Since we collect the data from Leetcode, where the function names of the solutions to the problems are the same. If they explicitly appear in the code, the label will be exposed. So we substitute all the function name of solutions to eliminate this threat. Meanwhile, the dataset may be some duplicate codes, therefore, we deduplicate the similar codes using NICAD tools to reduce this threat.

External threat: In order to rule out that our proposed method is only effective on our Dataset Leetcode, we use two datasets for experiments, which could reduce the threat of accidental results to a certain extent. The results outperform other baselines using the same two datasets, illustrating our model has good generalization ability.

6. Related Work

6.1. Program Classification

Program classification can be regarded as a high-level abstraction of code. In all source code mining tasks, program classification lays the foundation for various tasks related to source code understanding. Program classification can be applied to many scenarios in the field of software engineering, such as code clone detection (Bellon et al., 2007; Börstler, 1995), bug fixing (Peters et al., 2017), code smell classification (Fontana and Zanoni, 2017), defect classification (Wang et al., 2016), program understanding (Alon et al., 2019), etc. The earliest research on program classification dates back to the last century. (Clark and Darlington, 1980; Börstler, 1995; Jiang and Su, 2009; Taherkhani et al., 2011) used features (such as token, component name, counts of statements, code metrics, etc) to classify programs. Since these studies are classified based on artificially defined rules and extract features at the surface level of the code, their performances are not too prominent, but it starts the beginning of exploration for program classification.

Machine learning has been shown to yield promising results for classification. Ugurel et al. (Ugurel et al., 2002) used Support Vector Machine (SVM) (Drucker et al., 1997) to classify the code’s token set, and Ma et al. (Ma et al., 2018) used SVM, Decision Tree (Quinlan, 1986), and Bayesian Network (Friedman et al., 1997) to classify the code based on the code’s token sequence to determine which types software artifacts are produced by various open source projects at different levels of granularity. Shimonaka et al. (Shimonaka et al., 2016) used four machine algorithms to construct a learning model for the syntactic information of the code, which is used to identify the source code to determine whether the code is automatically generated code.

With the rising trend of deep learning, more and more researches begin to combine neural networks to learn the characteristics of the code. Mou et al. (Mou et al., 2016) proposed a tree-based convolutional network, the kernel of which can be used to capture the structural information of the source code, and performed well in the task of classifying programs. Zhang et al. (Zhang et al., 2019) divided the AST into small sub-ASTs, encoded those sub-ASTs into vectors by capturing the lexical and syntactic features, and then used the Bi-RNN model to generate the code vector representation. This algorithm has achieved excellent results in the source code classification task. Barchi et al. (Barchi et al., 2021) explored the use of Convolutional Neural Networks (CNN) (Kalchbrenner et al., 2014) to analyze program source code and proved that the CNN model can be successfully applied to source code classification. Compared with the most advanced methods, this method provided higher accuracy and less learning time.

6.2. Code Representation Learning

The emergence of deep neural networks technology has brought new solutions to the field of software engineering, and at the same time, an increasing amount of attention has been devoted to the learning of code representation. Code representation forms are used to analyze the original meaning of the source code at different levels. For example, the token represents the lexical information of the code, while the Abstract Syntax Tree (AST) represents the code structure and grammatical information obtained by the syntax analyzer, and the Data Control Flow (DFG) and the Control Flow Graph (CFG) represent the code data flow and control flow information. Also, there are other intermediate representations (IRs) to represent the code, and different representation forms combined with neural network to learn code semantics is the direction that researchers are exploring currently.

Harer et al. (Harer et al., 2018) used Word2Vec (Mikolov et al., 2013) tool to generate the initial embedded vector for the C/C++ token, and then used the TextCNN (Zhang and Wallace, 2015) model to learn the features of the vector for software vulnerability detection. Azcona et al. (Azcona et al., 2019) analyzed the Python code submitted by students by embedding the token into a vector, and then extracted the effective features of the student to analyze the progress and performance of the student. The token contains the lexical information of the code, but ignores the structural features and grammatical logic information of the code.

Mou et al. (Mou et al., 2016) proposed a convolutional neural network named TBCNN based on AST. They designed a sub-tree kernel to slide on the AST to extract the structure information of the tree, and also used dynamic pooling to deal with the number of sub-trees of the AST. Uri Alon et al. (Alon et al., 2019) proposed a code embedding algorithm called code2vec, which gives an attention score to each path in the AST, so that it could extract the syntactic information of the entire source code while dividing sentences of different degrees of importance. Wang et al. (Wang et al., 2020) constructed a graph called Flow-Augmented Abstract Syntax Tree (FA-AST) to represent source code, which contains control flow and data flow information. Then they used Graph Neural Networks (GNN) (Scarselli et al., 2008) to capture the feature of FA-AST.

Ben-Nun et al. (Ben-Nun et al., 2018) tried to learn code semantics based on the IR of the code. They converted IR into context flow graph (XFG), which contains the data flow and control flow of the code, and then used neural networks to learn the code features from XFG embedding. Wei et al. (Wei et al., 2020) generated a type dependency graph from source code, which links type variables with logical constraints as well as name and usage information, and then used GNN to extract the feature of this kind of IR.

6.3. Cross-language Code Learning

Cross-language learning is widely used in machine translation in the field of NLP (Huang et al., 2013; Conneau and Lample, 2019), and cross-language code learning is still in its infancy. Nguyen et al. (Nguyen et al., 2014) proposed a tool called semSMT based on SMT to migrate Java programs to C#, and it operated at all three lexical, syntactic, and semantic levels to extract code features. Bui et al. (Bui et al., 2019) proposed a bilateral dependency neural networks to learn the features of different languages, and then used Siamese network for similarity detection. Ye et al. (Ye et al., 2020) proposed the MISIM (Machine Inferred Code Similarity System) model, in which the Context-Aware Semantic Structure (CASS) can capture the code’s contextual semantics to describe the intent of the code, and CASS could also learn the language-independent representation through manual configuration for cross-language detection. The Infercode proposed by Bui et al. (Bui et al., 2021) implemented self-supervised learning by predicting the automatic context sub-trees of the AST, in which vectors are generated by training multiple languages, and it performed well on multiple tasks.The TPTrans proposed by Peng et al. (Peng et al., 2021) performed feature learning for different programming languages, and incorporated the feature information of the code tree structure into the transformer structure using position embedding and embedding. This model performed well for code summarization tasks.

7. Conclusions

In order to better address the problem of cross-language code learning, in this study, we propose a Unified Abstract Syntax Tree neural network (UAST) framework for cross-language program classification task. The UAST contains two sub-networks (SAST and GAST), where SAST is used to extract the global code syntactic features contained in the AST path sequence, GAST is used to capture the local code semantic features of the AST tree. And UAST network can comprehensively consider global and local code features through the unified AST feature fusion, which could effectively learn code semantic features. In addition, the unified vocabulary mechanism we proposed can reduce the difference between different programming languages. The comparative experiments on the public dataset and our collected dataset both show that the UAST performs better than other state-of-the-art baselines by a significant margin, and can better distinguish between different programs. In conclusion, this study leverages the powerful representation learning techniques to model cross-language source code, which contributes to the state-of-the-art AI for software engineering. With regard to the future work, we are going to leverage both source code and natural text information to better improve program classification.

ACKNOWLEDGEMENTS

This work is supported by the National Key Research and Development Project (No.2019YFE0105500), the National Natural Science Foundation of China (No.62072227), the Research Council of Norway (No.309494), the Key Research and Development Program of Jiangsu Province (No.BE2021002-2), the Natural Science Foundation of Chongqing (No. cstc2021jcyj-msxmX0538), and the Chongqing Science and Technology Plan Project (No.cstc2018jszx-cyztzxX0037).

References

  • (1)
  • Alon et al. (2019) Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
  • Azcona et al. (2019) David Azcona, Piyush Arora, I-Han Hsiao, and Alan Smeaton. 2019. user2code2vec: Embeddings for profiling students based on distributional representations of source code. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. ACM, 86–95.
  • Baker (1993) Brenda S Baker. 1993. A program for identifying duplicated code. Computing Science and Statistics (1993), 49–49.
  • Barchi et al. (2021) Francesco Barchi, Emanuele Parisi, Gianvito Urgese, Elisa Ficarra, and Andrea Acquaviva. 2021. Exploration of Convolutional Neural Network models for source code classification. Engineering Applications of Artificial Intelligence 97 (2021), 104075.
  • Bellon et al. (2007) Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, and Ettore Merlo. 2007. Comparison and evaluation of clone detection tools. IEEE Transactions on software engineering 33, 9 (2007), 577–591.
  • Ben-Nun et al. (2018) Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc, 3585–3597.
  • Börstler (1995) Juergen Börstler. 1995. Feature-oriented classification for software reuse. In Proceedings of the 7th International Conference on Software Engineering and Knowledge Engineering, Vol. 95. Knowledge Systems Institute, 22–24.
  • Bui et al. (2019) Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2019. Bilateral dependency neural networks for cross-language algorithm classification. In Proceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 422–433.
  • Bui et al. (2021) Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. InferCode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the 43rd International Conference on Software Engineering. IEEE, 1186–1197.
  • Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 539–546.
  • Clark and Darlington (1980) Keith L. Clark and John Darlington. 1980. Algorithm classification through synthesis. The computer journal 23, 1 (1980), 61–65.
  • Conneau and Lample (2019) Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., 7059–7069.
  • De Boer et al. (2005) Pieter-Tjerk De Boer, Dirk P Kroese, Shie Mannor, and Reuven Y Rubinstein. 2005. A tutorial on the cross-entropy method. Annals of operations research 134, 1 (2005), 19–67.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. ACL, 29–35.
  • Drucker et al. (1997) Harris Drucker, Chris JC Burges, Linda Kaufman, Alex Smola, Vladimir Vapnik, et al. 1997. Support vector regression machines. Advances in neural information processing systems 9 (1997), 155–161.
  • Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A pre-trained model for programming and natural languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. ACL, 1536–1547.
  • Fontana and Zanoni (2017) Francesca Arcelli Fontana and Marco Zanoni. 2017. Code smell severity classification using machine learning techniques. Knowledge-Based Systems 128 (2017), 43–58.
  • Friedman et al. (1997) Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine learning 29, 2 (1997), 131–163.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5-6 (2005), 602–610.
  • Harer et al. (2018) Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, et al. 2018. Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 (2018).
  • Huang et al. (2013) Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of the 2013 International Conference on Acoustics, Speech and Signal Processing. IEEE, 7304–7308.
  • Jiang and Su (2009) Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis. ACM, 81–92.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).
  • Kim et al. (2018) Kisub Kim, Dongsun Kim, Tegawendé F Bissyandé, Eunjong Choi, Li Li, Jacques Klein, and Yves Le Traon. 2018. FaCoY: a code-to-code search engine. In Proceedings of the 40th International Conference on Software Engineering. ACM, 946–957.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations. OpenReview.
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Li et al. (2017b) Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017b. Software defect prediction via convolutional neural network. In Proceedings of the 2017 International Conference on Software Quality, Reliability and Security. IEEE, 318–328.
  • Li et al. (2017a) Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017a. Cclearner: A deep learning-based clone detection approach. In Proceedings of 33rd International Conference on Software Maintenance and Evolution. IEEE, 249–260.
  • Ma et al. (2018) Yuzhan Ma, Sarah Fakhoury, Michael Christensen, Venera Arnaoudova, Waleed Zogaan, and Mehdi Mirakhorli. 2018. Automatic classification of software artifacts in open-source applications. In Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 414–425.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Mou et al. (2016) Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. ACM.
  • Nguyen et al. (2014) Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2014. Migrating code with statistical machine translation. In Proceedings of the 36th International Conference on Software Engineering Companion. ACM, 544–547.
  • Peng et al. (2021) Han Peng, Ge Li, Wenhan Wang, Yunfei Zhao, and Zhi Jin. 2021. Integrating tree path in transformer for code representation. In Proceedings of the 35th Conference on Neural Information Processing Systems. Curran Associates, Inc.
  • Peters et al. (2017) Fayola Peters, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. 2017. Text filtering and ranking for security bug report prediction. IEEE Transactions on Software Engineering 45, 6 (2017), 615–631.
  • Quinlan (1986) J. Ross Quinlan. 1986. Induction of decision trees. Machine learning 1, 1 (1986), 81–106.
  • Scarselli et al. (2008) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks 20, 1 (2008), 61–80.
  • Shimonaka et al. (2016) Kento Shimonaka, Soichi Sumi, Yoshiki Higo, and Shinji Kusumoto. 2016. Identifying auto-generated code by using machine learning techniques. In Proceedings of the 7th International Workshop on Empirical Software Engineering in Practice. IEEE, 18–23.
  • Taherkhani et al. (2011) Ahmad Taherkhani, Ari Korhonen, and Lauri Malmi. 2011. Recognizing algorithms using language constructs, software metrics and roles of variables: An experiment with sorting algorithms. Comput. J. 54, 7 (2011), 1049–1066.
  • Ugurel et al. (2002) Secil Ugurel, Robert Krovetz, and C Lee Giles. 2002. What’s the code? automatic classification of source code archives. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 632–638.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems. Curran Associates, Inc., 5998–6008.
  • Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297–308.
  • Wang et al. (2020) Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 261–271.
  • Wei et al. (2020) Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. Lambdanet: Probabilistic type inference using graph neural networks. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.
  • Ye et al. (2020) Fangke Ye, Shengtian Zhou, Anand Venkat, Ryan Marucs, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy Mattson, Tim Kraska, Pradeep Dubey, et al. 2020. Misim: An end-to-end neural code similarity system. arXiv preprint arXiv:2006.05265 (2020).
  • Zhang et al. (2019) Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering. IEEE, 783–794.
  • Zhang and Wallace (2015) Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015).