This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BAFLineDP: Code Bilinear Attention Fusion Framework for Line-Level Defect Prediction

Shaojian Qiu College of Mathematics and Informatics
South China Agricultural University
Guangzhou, China
qiushaojian@scau.edu.cn
   Huihao Huang College of Mathematics and Informatics
South China Agricultural University
Guangzhou, China
huanghuihao@insoft-lab.com
   Jianxiang Luo {@IEEEauthorhalign} Yingjie Kuang College of Mathematics and Informatics
South China Agricultural University
Guangzhou, China
luojianxiang@insoft-lab.com
College of Mathematics and Informatics
South China Agricultural University
Guangzhou, China
kuangyj@scau.edu.cn
   Haoyu Luo College of Mathematics and Informatics
South China Agricultural University
Guangzhou, China
haoyuluo@scau.edu.cn
Abstract

Software defect prediction aims to identify defect-prone code, aiding developers in optimizing testing resource allocation. Most defect prediction approaches primarily focus on coarse-grained, file-level defect prediction, which fails to provide developers with the precision required to locate defective code. Recently, some researchers have proposed fine-grained, line-level defect prediction methods. However, most of these approaches lack an in-depth consideration of the contextual semantics of code lines and neglect the local interaction information among code lines. To address the above issues, this paper presents a line-level defect prediction method grounded in a code bilinear attention fusion framework (BAFLineDP). This method discerns defective code files and lines by integrating source code line semantics, line-level context, and local interaction information between code lines and line-level context. Through an extensive analysis involving within- and cross-project defect prediction across 9 distinct projects encompassing 32 releases, our results demonstrate that BAFLineDP outperforms current advanced file-level and line-level defect prediction approaches.

Index Terms:
line-level defect prediction, code contextual feature, code pre-trained model, bilinear attention fusion

I Introduction

As the scale and complexity of modern software continue to increase, software development and maintenance are becoming arduous. Software defects not resolved in time will inevitably affect the software quality [1]. Defect prediction can help the quality assurance teams discover potential defects during the development process, contributing to optimizing the allocation of limited testing resources. To facilitate this, researchers [2] mine data from software historical repositories, construct code metrics and representations, and train prediction models to pinpoint defect-prone areas in software.

In recent years, researchers have proposed defect prediction methods at various granularity levels, such as package level [3], component level [4], module level [5], file level [3, 6] and method level [7]. Current research indicates that a more granular approach to defect prediction in software testing yields greater cost-effectiveness in code inspection [7, 3, 8]. Kamei et al. [3] demonstrated that defect prediction at the file level is more effective than at the package level. Similarly, Hata et al. [7] pointed out that method-level defect prediction offers superior cost-efficiency over file-level defect prediction. Although these methods have been verified through empirical studies, they are still coarse-grained predictions and are difficult to help developers in actual defect-finding tasks. For example, file-level defect prediction can only show developers the defect-prone files. Developers still have to spend extra effort walking through files and identifying defective code lines, resulting in inefficient reviews. Therefore, studying more fine-grained line-level defect prediction is necessary to improve the efficiency of code review work.

Currently, a handful of researchers are attempting to identify the defective lines of code. Wattanakriengkrai et al. [9] adopt a model-agnostic technique to discern code tokens with defective risks and classify the code containing the risk tokens as defective lines. Based on ensemble learning and attention mechanism, Zhang et al. [10] used abstract syntax trees to build a model to predict method-level defects and locate suspicious code lines. Pornprasit et al. [11] proposed a method called DeepLineDP, which utilizes a hierarchical Bi-GRU [12] network to extract code features to predict file-level defects and identify defective code lines through the token weights.

While the aforementioned approaches excel in line-level defect prediction tasks, they mostly neglect the contextual semantics of code lines and the local interaction information between them. Specifically, these approaches mainly rely on token-level attention within code lines to gauge their potential for defects. However, the computation of these token-level attentions is constrained to the information contained solely within the confines of the individual code line. It often fails to fully encompass the broader contextual information pertaining to the code line. Furthermore, software defects may manifest when multiple lines of code interact locally, leading to issues such as null pointer exceptions. Token attention-based methodologies may encounter challenges in effectively predicting such types of flaws. Consequently, considerable room remains for enhancement within the domain of existing line-level defect prediction methods.

To elucidate the problem of missing code line context and local interaction information between code lines and line-level context, we present an illustrative abridged case in Figure 1. In this case, there is a potential NullPointerException. Line 6 of the code directly calls name.length()name.length() to get the length of the string without checking whether the variable is null. If jo.get("name")jo.get("name") returns null, a NullPointerException will be thrown. The above-mentioned line-level defect prediction method mainly focuses on defects caused by tokens within code lines, and cannot mine the local contextual association between the 3-4 lines and the 6th line of code. Early warning of the defect is thus missed.

Refer to caption

Figure 1: A motivation case of missing line context and local interaction between code lines and line-level context.

To solve the above problems, this paper proposes a code bilinear attention fusion framework for line-level defect prediction (BAFLineDP). Specifically, BAFLineDP first uses a pre-trained CodeBERT model to parse the code file into a line embedding matrix. Subsequently, we adopt Bi-GRU to extract contextual features between code lines. Then, a bilinear attention fusion network is employed to mine the interactive representation of each code line and corresponding line-level context. Finally, we use code representations generated by BAFLineDP to improve prediction performance on defect prediction tasks within and across projects. The empirical results demonstrate that the BAFLineDP approach is better than other advanced defect prediction methods in the AUC, MCC, and BA metrics of file-level defect prediction and the Recall@Top20%LOC and Effort@Top20%Recall indicators of line-level defect prediction.

The main contributions of this paper are as follows:

  1. -

    This paper introduces a bilinear attention fusion mechanism into the line-level defect prediction framework, taking into full account the global line-level context and local interactions of code lines to deeply mine the representations of code features. This method addresses the issue in existing line-level defect prediction approaches, which lack a comprehensive consideration of both global and local information at the code line level.

  2. -

    This paper conducts experiments on multiple open-source benchmark projects and compares existing advanced methods. Experimental results show that the proposed BAFLineDP method achieves considerable prediction performance in defect prediction tasks within and across projects.

II Related Work

Software defect prediction is one of the important research directions in the field of intelligent software engineering. Its purpose is to help software quality assurance teams rationally allocate limited testing resources to improve the quality of software products. Researchers in this field have proposed defect prediction methods for various granularity, such as package level [3], component level [4], module level [5], file level [3, 6] and method level [7]. Currently, most mainstream research focuses on file-level defect prediction.

File-level defect prediction methods are mainly used to predict whether code files have defect tendencies. These methods are usually based on the static attributes [13] of code (e.g., Halstead features, McCabe features, CK features, and MOOD features), change attributes [14] (e.g., process characteristics organizational structure, code ownership, number of code revisions, change entropy, and developer characteristics) and structural-semantic features extracted by deep neural networks [15, 16, 17]. Most existing file-level defect prediction methods perform model training and evaluation within the same project. However, in actual application scenarios, it is not easy to obtain enough training samples for new projects. Therefore, cross-project file-level defect prediction approaches [18, 19, 20] have been proposed to address data scarcity in target projects. These methods leverage high-quality datasets collected from other projects to construct a defect prediction model for the target project. Whether in defect prediction scenarios within or across projects, file-level defect prediction has been confirmed by most studies to achieve notable performance. However, file-level defect prediction is still a coarse-grained prediction, which makes it challenging to help developers find defects in actual application scenarios. The main reason is that file-level defect prediction can only predict the defect tendency of code files. Developers still need to spend additional time traversing files and identifying defective lines, which affects their efficiency in locating and fixing defects.

Refer to caption
Figure 2: An overview of our BAFLineDP framework.

Compared with file-level defect prediction, line-level defect prediction can discover more fine-grained defects and help developers locate defects more accurately and efficiently. A survey by Wan et al. [21] found that software practitioners are more inclined to use fine-grained defect prediction to assist code review work. In recent years, some researchers have tried to utilize various approaches to predict defective code lines [22, 23, 24, 9]. One of the simplest methods is to use static analysis tools to identify defective lines of code based on predefined rules [25]. However, static analysis tools may generate many false positive warnings unrelated to defects. Majd et al. [22] designed 32 statement-level features based on C/C++ and used a long-term memory network to build a statement-level defect prediction model. However, these statement-level features require manual knowledge extraction by experts and can only capture the static attributes of code statements. Ray et al. [26] and Wang et al. [24] employed the n-gram model to capture surrounding code tokens and discern defective lines by identifying unnatural tokens. However, the n-gram model can only capture surrounding tokens with a limited length, which is insufficient to generate effective code context features. Recently, Pornprasit et al. [11] proposed a line-level defect prediction method called DeepLineDP, which automatically extracts semantic features from tokens in lines to predict file-level defects and identify defect-prone code lines through the attention of code tokens. However, DeepLineDP’s calculation of token attention only focuses on the information of the current code line, and it is difficult to capture the contextual semantics and local interactions between code lines effectively.

Although the above approaches are better than most existing defect prediction methods [25, 22, 26] in line-level defect prediction tasks, it suffers from the problem of missing a comprehensive consideration of both global and local information at the code line level. The methodology presented in this paper comprehensively considers the line-level context and local interactions among code lines, enabling deep mining of code feature representations.

III Methodology

In this section, we present BAFLineDP, a line-level defect prediction approach based on a bilinear attention fusion framework. Figure 2 provides a holistic overview of our BAFLineDP framework, which encompasses four steps: (1) Code Preprocessing and Embedding; (2) Line-Level Context Extraction; (3) Bilinear Attention Fusion Feature Construction; and (4) File-Level and Line-Level Defect Prediction.

Refer to caption
Figure 3: The process of feature construction with bilinear attention fusion.

III-A Code Preprocessing and Embedding

Code preprocessing [27] plays a pivotal role within the realm of deep learning. It serves as a critical step in enhancing the robustness and stability of models by standardizing and normalizing the source code. By eliminating redundant information that holds no relevance to code logic, the impact of noise on the model is reduced, allowing the model to extract code semantics and syntactic features more effectively. Therefore, we performed code preprocessing for each code file. Based on the investigation by Rahman et al. [28], we systematically removed all special characters (e.g., (),.:;’!”(space)) since these special characters introduce undesirable noise to the prediction model. Simultaneously, blank lines were also removed because of no substantive contribution to the code’s behavior [29]. Eliminating such extraneous information allows the model to maintain its focus on learning the actual content of the code, free from interference by irrelevant factors. In addition, to enhance the model’s generalization capability, we introduce the generic tokens \langlestr\rangle and \langlenum\rangle to replace constant strings and numeric representations within the source code. This strategic substitution ensures the model does not generate independent representations for these elements, rendering it more adaptable to diverse scenarios and usage cases.

In our pursuit of capturing precise semantics related to defective code, we utilize the pre-trained CodeBERT model to encode code lines. CodeBERT [30] is a bimodal pre-trained language model that leverages a multi-layer bidirectional Transformer as the architecture to capture the meaningful semantic relationships among diverse tokens, contributing to construct an adequate representation of the source code line. The incorporation of the pre-trained CodeBERT model holds the promise of a deeper understanding of code line semantics and the accurate capture of features associated with defects. Furthermore, to maintain the structural information of the source code, we adopt a sequential structure to depict the organization of code files. In this configuration, each code file is expressed as a sequence of code lines, denoted as l1,l2,,ln\langle l_{1},l_{2},\cdots,l_{n}\rangle. For each code line, the pre-trained CodeBERT model will generate the corresponding embedded representation.

III-B Line-Level Context Extraction

In this step, we employ the Bi-GRU network to extract line-level contextual features, thereby capturing the global information pertaining to code lines. The Bi-GRU [12] is founded upon a bidirectional recurrent neural network architecture, which affords it the capability to concurrently consider both the preceding and succeeding information within sequences of source code lines. The bidirectional mode ensures the effective modeling of the contextual associations among code lines, culminating in a richer representation of their interdependencies.

Given a sequence of code lines, denoted as Vl=[vl1,vl2,,vln]V_{l}=[v_{l1},v_{l2},\cdots,v_{ln}], where vliR1×dv_{li}\in R^{1\times d} represents the vector representation of code line lil_{i}, i[1,n]i\in[1,n]. These representations are encoded by the pre-trained CodeBERT model, where dd signifies the output dimension of CodeBERT. Bi-GRU comprises both a forward GRU denoted as hi=GRU(vli)\overrightarrow{h}_{i}=\overrightarrow{GRU}(v_{li}), i[1,|n|]i\in[1,|n|] and a backward GRU denoted as hi=GRU(vli)\overleftarrow{h}_{i}=\overleftarrow{GRU}(v_{li}), i[|n|,1]i\in[|n|,1]. Through the concatenation of the two hidden states, hi\overrightarrow{h}_{i} and hi\overleftarrow{h}_{i}, produced by the forward and backward GRU components, we arrive at the final hidden representation of the given vliv_{li}. In other words, the line-level context vector representation can be expressed as vci=[hihi]v_{ci}=[\overrightarrow{h}_{i}\oplus\overleftarrow{h}_{i}]. Consequently, the line-level context sequence is represented as Vc=[vc1,vc2,,vcn]V_{c}=[v_{c1},v_{c2},\cdots,v_{cn}], where vciR1×dv_{ci}\in R^{1\times d^{{}^{\prime}}} characterizes the line-level context vector representation of code line lil_{i}, and dd^{{}^{\prime}} signifies the output dimension of Bi-GRU.

III-C Bilinear Attention Fusion Feature Construction

Source code is inherently complex and structured [31], encompassing not only overall structure and semantic features but also local intricacies and contextual dependencies. Merely considering line-level context information is insufficient for a comprehensive understanding of defective code lines. Hence, to address this limitation, we introduce a bilinear attention mechanism [32] to build a code fusion network, called BAFN. BAFN is designed to amalgamate global and local information by capturing the bilinear interaction attention weights between code lines and their respective line-level contextual information to construct defect code features. The structure of BAFN, as depicted in Figure 3, primarily includes two essential modules: the bilinear interaction module and the bilinear pooling module.

Bilinear Interaction Module. Assume that the code line matrix denoted as HlRθl×dH_{l}\in R^{\theta_{l}\times d} and the line-level context matrix denoted as HcRθl×dH_{c}\in R^{\theta_{l}\times d^{{}^{\prime}}} are constructed by splicing the vectors from the code line sequence VlV_{l} and the line-level context sequence VcV_{c}, respectively. θl\theta_{l} represents the total number of code lines. We construct the bilinear interaction matrix ARθl×θlA\in R^{\theta_{l}\times\theta_{l}} through the source code line matrix HlH_{l} and the line-level context matrix HcH_{c}. The calculation formula is as follows:

A=((1qT)σ(HlTU))σ(MTHc)A=((1\cdot q^{T})\circ\sigma(H_{l}^{T}U))\cdot\sigma(M^{T}H_{c}) (1)

where URd×kU\in R^{d\times k} and MRd×kM\in R^{d^{{}^{\prime}}\times k} are two learnable linear transformation matrices, which are also weight matrices. kk represents the dimension of the linear transformation, qR1×kq\in R^{1\times k} is a learnable weight vector. The symbol \circ denotes the Hadamard product operation [33], and σ()\sigma(\cdot) signifies the ReLU activation function [34]. Each element within the bilinear interaction attention matrix AA encapsulates the extent of interaction between a source code line and its corresponding line-level context, further presenting the local associations between potential defects and code lines. In addition, to prevent the model from excessively focusing on extraneous noise information, we incorporate the mask into the interaction attention map AA to enhance the overall model performance.

TABLE I: An Overview of the Experimental Datasets
Project Description Release Version #File #LOC %Defective Files %Defective Lines
ActiveMQ Messaging and integration patterns 5.0.0, 5.1.0, 5.2.0, 5.3.0, 5.8.0 1,884-3,420 142k-299k 2%-7% 0.08%-0.44%
Camel Enterprise integration framework 1.4.0, 2.9.0, 2.10.0, 2.11.0 1,515-8,846 75k-485k 2%-8% 0.09%-0.24%
Derby Relational database 10.2.1.6, 10.3.1.4, 10.5.1.1 1,963-2,705 412k-533k 6%-28% 0.10%-0.63%
Groovy Java-syntax-compatible OOP 1.5.7, 1.6.0.Beta1, 1.6.0.Beta2 757-884 74k-93k 2%-4% 0.10%-0.17%
HBase Distributed scalable data store 0.94.0, 0.95.0, 0.95.2 1,059-1,834 246k-537k 7%-11% 0.17%-1.02%
Hive Data warehouse system for hadoop 0.9.0, 0.10.0, 0.12.0 1,416-2,662 290k-567k 6%-19% 0.31%-2.90%
JRuby Ruby programming lang for JVM 1.1, 1.4, 1.5, 1.7 731-1,614 106k-240k 2%-13% 0.03%-0.09%
Lucene Text search engine library 2.3.0, 2.9.0, 3.0.0, 3.1.0 805-2,806 101k-342k 2%-8% 0.07%-0.39%
Wicket Web application framework 1.3.0.beta1, 1.3.0.beta2, 1.5.3 1,672-2,578 106k-165k 2%-16% 0.05%-0.46%

Bilinear Pooling Module. To efficiently amalgamate global and local information from the code lines and create a high-level representation of defect code, denoted as f′′R1×kf^{{}^{\prime\prime}}\in R^{1\times k}, we employ a bilinear pooling module. This module is instrumental in extracting important features from the code. The calculation formula for f′′f^{{}^{\prime\prime}} is as follows:

f′′=(HlTU)TA(MTHc)f^{{}^{\prime\prime}}=(H_{l}^{T}U)^{T}\cdot A\cdot(M^{T}H_{c}) (2)

Notably, there are no new learnable parameters introduced in this layer. Instead, the bilinear pooling module and the bilinear interaction module share the weight matrices UU and MM, thereby diminishing the number of model parameters. Furthermore, we add a sum pooling operation, grounded in the fused feature representation, to obtain a more compact defect code feature representation, denoted as fR1×k/sf^{{}^{\prime}}\in R^{1\times k/s}. The calculation formula for ff^{{}^{\prime}} is as follows:

f=SumPool(f′′,s)f^{{}^{\prime}}=SumPool(f^{{}^{\prime\prime}},s) (3)

where the SumPool()SumPool(\cdot) function [35] is a one-dimensional non-overlapping sum pooling operation with stride ss, which reduces the dimensionality of f′′f^{{}^{\prime\prime}} from kk to k/sk/s, allowing for the extraction of crucial features. For the pooling operation, we set the stride s=3s=3.

Further, we extend the single-head bilinear interaction attention matrix into a double-head form. The final defect code fusion feature vector fR1×k/sf\in R^{1\times k/s} is a combination of two heads: f1R1×k/sf_{1}^{{}^{\prime}}\in R^{1\times k/s} and f2R1×k/sf_{2}^{{}^{\prime}}\in R^{1\times k/s}. The calculation formula for ff is as follows:

f=f1f2f=f_{1}^{{}^{\prime}}\oplus f_{2}^{{}^{\prime}} (4)

Since the shared utilization of weight matrices UU and MM across the two modules, introducing an additional header only requires adding a new weight vector qq, which is parameter-efficient. Through our experiments, we find that incorporating a double-head interaction mechanism yields better performance when compared to the single-head interaction approach.

III-D File-Level and Line-Level Defect Prediction

Via the BAFN, we derive a high-level vector representation ff of the source code file, which serves as a feature for performing file-level defect prediction. We employ a single-layer fully connected network as the prediction layer and utilize the source code file vector as input, leading to the generation of the defect prediction probability, denoted as pp. The calculation formula is provided below:

p=Sigmoid(W0f+b0)p=Sigmoid(W_{0}f+b_{0}) (5)

where W0W_{0} and b0b_{0} are learnable weight matrices and bias values, and the Sigmoid()Sigmoid(\cdot) function [36] serves to map the prediction score into the (0,1)(0,1) range.

To identify defect-prone lines of code, we leverage the diagonal elements within the bilinear interaction attention maps to rank the risk associated with each source code line. We first consolidate the bilinear interaction attention maps from a double-head form into a single-head form. Then, we extract the diagonal elements within the consolidated map to assemble the line-level attention map. The attention score on the diagonal of this map is utilized as the final risk coefficient for each code line, which enables the sorting of all the code lines within the source code file. This sorting process allows us to pinpoint the most defect-prone code lines, ultimately achieving line-level defect prediction.

IV Experiment Design and Results

In this section, we present the research design and experimental results, including the experiment datasets, evaluation metrics, baseline methods, experimental details, and analysis of results for each research question.

IV-A Experiment Datasets

In our study, we utilize the line-level defect prediction datasets collected by Wattanakriengkrai et al. [9], encompassing 9 open-source projects and 32 software releases. We downloaded each project version from official websites for analysis, as detailed in Table I. This table includes project names, descriptions, release versions, file counts, line counts, file defect rates, and line defect rates. The datasets feature projects with a wide range of sizes and defect rates, with file counts varying from 731 to 8846, and line counts from 74k to 567k. File defect rates span 2% to 28%, and line defect rates range from 0.03% to 2.90%. For detailed steps on collecting file-level and line-level ground truth, please refer to [11]. Our experimental approach involves training on the first release of each project, validation on the second, and testing on subsequent versions. This results in 14 distinct tasks for within-project defect prediction (WPDP) and 112 for cross-project defect prediction (CPDP).

IV-B Evaluation Metrics

We adopt the classification metrics Area Under Curve (AUC) [37], Balanced Accuracy (BA) [38], and Matthews Correlation Coefficient (MCC) [39] to evaluate the file-level defect prediction task, and the workload-aware metrics Recall@Top20%LOC [11] and Effort@Top20%Recall [11] to evaluate the line-level defect prediction task.

AUC measures the performance of binary classification models. The AUC metric is unaffected by class distribution and, therefore, has better evaluation performance in the case of class imbalance.

BA measures the average ratio of true positive and true negative. The highly balanced accuracy shows that the method can accurately predict defective and clean instances.

MCC is used to evaluate the accuracy and stability of model predictions. MCC ranges from -1 to 1, where 1 means the predictions are perfect, 0 means the predictions are irrelevant to the actual outcome, and -1 means the predictions are completely opposite.

Recall@Top20%LOC is a metric utilized to gauge the efficacy of identifying defective lines within the top 20% LOC in a software release. A high value of Recall@Top20%LOC signifies the method’s proficiency in uncovering numerous actual defective lines with limited effort, whereas a low value of Recall@Top20%LOC indicates that developers need to spend more effort to detect defective lines.

Effort@Top20%Recall is a metric assessing the amount of effort (i.e., LOC) required to identify the top 20% of actual defective lines in a software release. A low Effort@Top20%Recall value suggests minimal effort is required by developers to pinpoint the top 20% of actual defective lines, while a high Effort@Top20%Recall value implies a greater effort is needed for the same task.

IV-C Baseline Methods

To evaluate the performance of our BAFLineDP approach, we compared the current advanced defect prediction methods. Below, we present four file-level defect prediction methods [40, 41, 42, 43]:

  • DBN, a deep belief network, automatically extracts semantic and structural code features to predict defective code files. Adhering to the experimental setup outlined by Wang et al. [40], our configuration comprises a batch size of 32, 10 hidden layers with 100 nodes each, an embedding dimension of 50, a learning rate of 0.01, and 200 training epochs for each hidden layer.

  • CNN is a convolutional neural network that automatically learns code semantics and context to predict defective code files. Aligning with the experimental setup of Li et al. [41], our configuration includes a batch size of 32, an embedding dimension of 50, a filter length of 5, 100 filters, a learning rate of 0.001, and 10 training epochs.

  • BoW (also known as Bag-of-Words), which leverages code token frequencies as features for source code files, is employed to predict defective code files. We align with the experimental deployment of Hata et al. [42], incorporating the SMOTE technique to address the class imbalance within training data. A logistic regression classifier is trained on code token frequency to construct the BoW model.

  • Bi-LSTM, a bidirectional long short-term memory network, is utilized to incorporate both past and future code information, enabling the automatic learning of semantic and syntactic features for predicting defective code files. We follow the experimental configuration introduced by Dam et al. [43], featuring a batch size of 16, an embedding dimension of 50, 64 nodes in the hidden layer, a learning rate of 0.001, and a training epoch of 50.

Subsequently, we introduce three advanced line-level defect prediction approaches [11, 44, 45]:

  • DeepLineDP, the state-of-the-art line-level defect prediction method, leverages a hierarchical Bi-GRU network to automatically capture context from surrounding tokens and code lines, enabling the extraction of code semantic features for file-level defect prediction. In addition, it incorporates a token-level attention layer to identify defective lines based on key tokens instrumental in file-level defect prediction. Our configuration, in line with the experimental setup of Pornprasit et al. [11], involves a batch size of 32, a learning rate of 0.001, an embedding dimension of 50, a single hidden layer with 64 nodes, a dropout ratio of 0.2, and 10 training epochs.

  • ErrorProne, an open-source tool by Google, leverages the Java compiler for static analysis and error detection in Java code, utilizing a set of error-prone rules to identify and rectify errors. By emulating developers’ sequential reading, ErrorProne analyzes code files top-down to pinpoint potentially defective lines. Adhering to the usage guide by Aftandilian et al. [44], we invoke the ErrorProne plug-in within a Java8 compilation environment for line-level defect prediction on the test code files.

  • N-gram is employed to assess the unnaturalness of code tokens by computing the entropy score of each token. Research conducted by Ray et al. [26] reveals that defect-prone codes tend to exhibit higher entropy values, indicating increased unnaturalness. Therefore, N-gram has predictive utility in identifying defective code lines. Following the experimental setup outlined by Hellendoorn et al. [45], we train a cache-based N-gram model using clean code files and then perform line-level defect prediction on defective code files. By calculating and ranking the average entropy score of tokens within each line, the most defect-prone code lines could be found.

It should be noted that DeepLineDP is categorized as a line-level defect prediction method due to its hierarchical approach to constructing code features, which incorporates information from surrounding code lines. This design enhances the method’s accuracy in identifying line-level defects, exhibiting state-of-the-art (SOTA) performance. However, we also consider involving DeepLineDP in file-level defect prediction for a comprehensive comparison with our BAFLineDP.

IV-D Experimental Details

Our BAFLineDP model is implemented using PyTorch, a deep learning framework, and executed on a server equipped with an NVIDIA RTX 3090 GPU with 24 GB memory.

Regarding the experimental strategy, we employ binary cross-entropy as the loss function, utilize the Adam optimizer [46], and incorporate random dropout (rate of 0.2) and normalization to prevent model overfitting. Additionally, we address dataset class imbalance through weighted loss.

In terms of experimental parameters, we set the batch size to 16 and use the learning rate of 0.001. The model is trained at 10 epochs. The maximum input length of the pre-trained CodeBERT model is 75. The Bi-GRU network is configured with one hidden layer, while the number of nodes in the hidden layer is set to 64. The output dimension for the BAFN is designated as 256. Furthermore, the convolution kernel size of the bilinear pooling layer is 3.

About evaluation models, we consider the training model that attains the highest AUC value on the validation set as the ultimate evaluation model to perform both file-level and line-level defect prediction tasks.

IV-E Statistical Test

Given the potential for certain datasets to yield models that exhibit significant performance discrepancies, we employ the Scott-Knott Effect Size Difference (Scott-Knott ESD) test [47] to compare the efficiency of different methods in our experiments. The Scott-Knott ESD test, a mean comparison technique, uses hierarchical clustering to categorize measurements (e.g., AUC) into statistically distinct groups with non-negligible effect size differences. It comprises two steps: (1) correcting the non-normal distribution of the input dataset and (2) merging any two statistically distinct combinations with negligible effect sizes into a single group. The rankings generated by the Scott-Knott ESD test ensure that (1) the differences in distribution magnitudes within each ranking category are negligible and (2) the differences in distribution magnitudes between ranks are not negligible. For a detailed description of the Scott-Knott ESD test, please refer to [47].

IV-F Research Questions and Analysis

To assess the efficacy of our BAFLineDP approach, we conduct both file-level and line-level defect prediction in WPDP and CPDP scenarios, respectively, and compare with current advanced defect prediction techniques. Below, we present the approaches and results of two research questions (RQs).

(RQ1) What is the performance and cost-effectiveness of BAFLineDP in WPDP scenario?

Motivation. WPDP aims to identify and rectify software defects in their early stages, optimizing Software Quality Assurance (SQA) resource allocation and enhancing software quality. File-level and line-level defect prediction are techniques employed to realize these overarching objectives. Existing advanced defect prediction methodologies predominantly center on the file level, with limited attention to line-level approaches. Recently, DeepLineDP, which automatically learns the code hierarchical structure and considers code context, has been proposed to achieve SOTA performance in both file-level and line-level defect prediction within the WPDP scenario. However, this method disregards the importance of the context and local interaction information for code lines within line-level defect prediction. Therefore, we investigate whether BAFLineDP outperforms advanced file-level and line-level defect prediction methods within the WPDP scenario while offering superior cost-effectiveness.

Approach. To answer this question, we selected 14 training-validation-testing task combinations within the WPDP scenario. For instance, training and validation were conducted on datasets such as Hive-0.9.0 and Hive-0.10.0, followed by testing on Hive-0.12.0. Subsequently, we assess the performance of the BAFLineDP method by conducting evaluations in the context of both file-level and line-level defect prediction. In order to provide a comprehensive analysis, we compare BAFLineDP with four advanced file-level defect prediction approaches [40, 41, 42, 43], namely DBN, CNN, BoW, and Bi-LSTM. Additionally, we contrast BAFLineDP with three advanced line-level defect prediction methods [11, 44, 45], which include DeepLineDP, ErrorProne, and N-gram.

To evaluate the performance of these methods, we employ three traditional evaluation metrics (i.e., AUC, MCC, and BA) for file-level defect prediction and two effort-aware evaluation metrics (i.e., Recall@Top20%LOC and Effort@Top20%Recall) for line-level defect prediction. To effectively demonstrate the statistical performance disparities between different approaches, we apply the Scott-Knott ESD test. Figures 4 and 5 present the Scott-Knott ESD rankings and the distributions of corresponding metrics for BAFLineDP and other advanced file-level and line-level defect prediction approaches within the WPDP scenario.

Refer to caption
(a) AUC (\nearrow)
Refer to caption
(b) BA (\nearrow)
Refer to caption
(c) MCC (\nearrow)
Figure 4: (For RQ1) The Scott-Knott ESD rankings and the distributions of AUC, BA, and MCC of BAFLineDP and other file-level prediction approaches within the WPDP scenario. The higher (\nearrow) the values are, the better the approach is.
Refer to caption
(a) Recall@Top20%LOC (\nearrow)
Refer to caption
(b) Effort@Top20%Recall (\searrow)
Figure 5: (For RQ1) The Scott-Knott ESD rankings and the distributions of Recall@Top20%LOC and Effort@Top20%Recall of BAFLineDP and other line-level prediction approaches within the WPDP scenario. The higher (\nearrow) or the lower (\searrow) the values are, the better the approach is.

Results. In light of the findings presented in Figure 4, our BAFLineDP method demonstrates notably superior performance, as evidenced by its average AUC, BA, and MCC values, which respectively stand at 0.793, 0.655, and 0.212. These values exhibit a significant advantage, ranging from 2% to 34%, 1% to 25%, and 23% to 1016% when compared to those of other file-level defect prediction approaches. The outcome indicates the superiority of our BAFLineDP over existing file-level defect prediction methods. In addition, the Scott-Knott ESD test also confirmed that BAFLineDP consistently ranks among the best in terms of AUC, BA, and MCC, which shows that the performance difference is statistically significant with a non-negligible effect size.

Notably, while BAFLineDP’s median BA is slightly lower than DeepLineDP’s, BA mainly assesses the precise differentiation between correct and defective files. However, in real-world software development, developers often favor a probabilistic ranking of defective files over a binary classification at a fixed threshold (e.g., 0.5). Therefore, when defective files are ranked based on probabilities, the superior performance of BAFLineDP is further confirmed by AUC, which assesses the ability to discriminate between defective and clean files.

Based on the outcomes illustrated in Figure 5, our BAFLineDP method exhibits an average Recall@Top20%LOC value of 0.395 and an average Effort@Top20%Recall value of 0.2. Compared to other line-level defect prediction approaches, BAFLineDP demonstrates a notable enhancement in cost-effectiveness, boasting an improvement of 19%-42% in accurately identifying defective code lines with a fixed 20% of the overall effort. Moreover, from a cost perspective, there is a reduction ranging from 10% to 51%. These findings firmly establish the superiority of BAFLineDP over existing line-level defect prediction approaches. Furthermore, the Scott-Knott ESD test also affirms that BAFLineDP consistently ranks among the top performers in terms of Recall@Top20%LOC and Effort@Top20%Recall, which shows that the performance difference is statistically significant with non-negligible effect size.

Answer to RQ1: The experimental findings within the WPDP scenario illustrate the superior performance of BAFLineDP in both file-level and line-level defect prediction compared to existing defect prediction approaches, achieving better cost-effectiveness and lower cost overhead.

Refer to caption
(a) AUC (\nearrow)
Refer to caption
(b) BA (\nearrow)
Refer to caption
(c) MCC (\nearrow)
Figure 6: (For RQ2) The Scott-Knott ESD ranking and the distributions of AUC, BA, and MCC of BAFLineDP and other file-level prediction approaches within the CPDP scenario. The higher (\nearrow) the values are, the better the approach is.
Refer to caption
(a) Recall@Top20%LOC (\nearrow)
Refer to caption
(b) Effort@Top20%Recall (\searrow)
Figure 7: (For RQ2) The Scott-Knott ESD ranking and the distributions of Recall@Top20%LOC and Effort@Top20%Recall of other line-level prediction approaches within the CPDP scenario. The higher (\nearrow) or the lower (\searrow) the values are, the better the approach is.

(RQ2) How does BAFLineDP perform in both performance and cost-effectiveness within the CPDP scenario?

Motivation. The majority of existing defect prediction methods are primarily assessed within the WPDP scenario, yielding favorable performance. However, practical applications often encounter challenges such as limited availability of training samples for new projects and substantial variations in project structures and complexities [48]. Thus, although certain approaches excel in the WPDP scenario, their performance may not translate equally well to the CPDP scenario. Consequently, we focus on evaluating the performance and cost-effectiveness of BAFLineDP within the CPDP scenario while conducting a comparative analysis against other advanced file-level and line-level defect prediction methods.

Approach. To address this inquiry, we selected 112 training-validation-testing within the CPDP scenario. For instance, we utilize Hive-0.9.0 and Hive-0.10.0 for training and verification, respectively, while testing is conducted on JRuby-1.5.0. Similar to WPDP, we evaluate the performance of BAFLineDP in file-level and line-level defect prediction using three traditional metrics (i.e., AUC, BA, and MCC) and two effort-aware metrics (Recall@Top20%LOC and Effort@Top20%Recall). Furthermore, we compared BAFLineDP against four file-level defect prediction methods (i.e., DBN, CNN, BoW, and Bi-LSTM) and three line-level defect prediction approaches (i.e., DeepLineDP, ErrorProne, and N-gram).

It is important to note that ErrorProne, an open-source static code analysis plug-in, is unaffected by training data in line-level defect prediction. Its performance solely relies on customized code rules. Consequently, the performance results of ErrorProne in CPDP line-level defect prediction align with those observed in WPDP line-level defect prediction.

To provide a comprehensive and statistically significant analysis of performance differences among the approaches, we again employ the Scott-Knott ESD test. Figures 6 and 7 illustrate the Scott-Knott ESD rankings and the distributions of corresponding indicators for BAFLineDP and other file-level and line-level defect prediction methods within the CPDP scenario.

Results. According to the results in Figure 6, the average AUC, BA, and MCC values of BAFLineDP are 0.789, 0.647, and 0.212, respectively, outperforming other file-level defect prediction methods by 7%-46%, 5%-32%, and 91%-2027%. These results establish BAFLineDP’s superiority in the CPDP scenario. Additionally, the Scott-Knott ESD test confirmed BAFLineDP’s consistent top performers in AUC, BA, and MCC, which indicates that the performance difference is statistically significant with a non-negligible effect size.

As shown in Figure 7, the average Recall@Top20%LOC and Effort@Top20%Recall of BAFLineDP are 0.397 and 0.201, respectively. Compared to other line-level defect prediction methods, BAFLineDP exhibits a 12% to 42% improvement in identifying defective code lines using only 20% of the overall effort and a 14% to 51% reduction in costs. These findings position BAFLineDP as a superior choice over existing line-level defect prediction methods within the CPDP scenario. Furthermore, the Scott-Knott ESD test further confirms BAFLineDP’s top performance in Recall@Top20%LOC and Effort@Top20%Recall, indicating a statistically significant difference with non-negligible effect sizes.

Answer to RQ2: The experimental results illustrate that within the CPDP scenario, BAFLineDP outperforms existing defect prediction methods in both file-level and line-level defect prediction, while exhibiting higher cost-effectiveness and incurring lower cost overhead. These outcomes further reaffirm the effectiveness of BAFLineDP.

TABLE II: Ablation Study results of BAFLineDP
Approach WPDP (Line-Level) CPDP (Line-Level)
Recall@Top20%LOC(\nearrow) Effort@Top20%Recall(\searrow) Recall@Top20%LOC(\nearrow) Effort@Top20%Recall(\searrow)
Abl. Diff. Abl. Diff. Abl. Diff. Abl. Diff.
w/o CodeBERT 0.346 -12.4% 0.222 +11.0% 0.336 -15.4% 0.229 +13.9%
w/o Bi-GRU 0.365 -7.6% 0.211 +5.5% 0.382 -3.8% 0.203 +1%
w/o BAFN 0.363 -8.1% 0.255 +27.5% 0.343 -13.6% 0.260 +29.4%
BAFLineDP 0.395 - 0.200 - 0.397 - 0.201 -

V Discussion

V-A Effects of the Pivotal Modules within the BAFLineDP

We conducted an ablation study with the principal objective of dissecting the individual contributions of pivotal modules within the BAFLineDP framework, discerning their respective impacts on the overall efficacy. Our scrutiny is particularly centered on the examination of three pivotal components, namely CodeBERT, Bi-GRU, and BAFN, given our conviction that they assume a pivotal role in the capture of essential facets pertaining to source code line defect semantics, global line-level context, and local interaction information of code lines. Furthermore, we chose to focus our ablation study on line-level defect prediction, as it holds greater practical applicability compared to file-level defect prediction.

Within the context of our ablation study, we observe the changes in the average Recall@Top20%LOC and Effort@Top20%Recall metrics resulting from the elimination or replacement of specific components, both in the WPDP and CPDP scenarios. The results of this ablation study are tabulated in Table II, wherein Abl. represents the performance of the simplified iteration of BAFLineDP, while Diff. signifies the performance version in comparison to BAFLineDP.

Replacing CodeBERT with Doc2Vec: In our investigation of CodeBERT’s impact, we substitute it with Doc2Vec [49], an unsupervised algorithm adept at acquiring fixed-length feature representations from variable-length textual data, for encoding lines of code. As demonstrated in Table II, after replacing CodeBERT, the cost-effectiveness of line-level defect prediction in WPDP and CPDP dropped by 12.4% and 15.4%, respectively, while the cost overhead increased by 11.0% and 13.9%, respectively.

Removing Bi-GRU: In exploring the role of Bi-GRU, we eliminate the utilization of Bi-GRU, thereby discontinuing the extraction of contextual information from code lines. As presented in Table II, the removal of Bi-GRU led to a decline in cost-effectiveness for line-level defect prediction in both WPDP and CPDP scenarios, registering reductions of 7.6% and 3.8%, respectively, while the cost overhead experienced a rise of 5.5% in WPDP and 1% in CPDP.

Removing BAFN: To investigate the significance of BAFN, the component responsible for local interaction information between source code lines and their contextual counterparts, we remove BAFN. As depicted in Table II, the removal of BAFN resulted in a marked reduction in cost-effectiveness for line-level defect prediction in WPDP and CPDP scenarios, with drops of 8.1% and 13.6%, respectively. In parallel, the cost overhead exhibited a substantial escalation of 27.5% in WPDP and 29.4% in CPDP.

In summary, CodeBERT, Bi-GRU, and BAFN each play pivotal roles in capturing essential elements of source code line defect semantics, global line-level context, and local interaction information of code lines. This contribution significantly enhances the cost-effectiveness of SQA resource management, concurrently alleviating the burden on developers by reducing the need for manual inspection of defective code lines and the associated workload.

V-B Threats To Validity

In this section, we describe several threats that may have an impact on the effectiveness of our approach.

V-B1 Implementation of comparison methods

We implemented benchmark methods such as DeepLineDP and ErrorProne using open-source code to reduce the possible impact of incorrect implementations. For methods that do not provide source code (e.g., CNN, Bi-LSTM, BoW, DBN, and N-gram), we strictly follow the implementation details in the relevant papers, but there may still be some deviations.

V-B2 The experimental results may not be generalizable

We conducted experiments on 9 open-source software projects with different sizes and defect rates. It would be beneficial to generalize our research while avoiding the specificity of experimental results. However, we cannot guarantee that our method will achieve the same improvement on other software datasets.

V-B3 The model hyperparameter selection does not consider all options

In our experiments, we try to adjust the hyperparameters of the BAFLineDP model to obtain better defect prediction performance. However, it is impractical to evaluate all possible hyperparameter combinations. We evaluated several hyperparameter combinations within specific ranges based on previous research experience [11].

VI Conclusion

In this paper, we introduce BAFLineDP, a novel line-level defect prediction approach based on a code bilinear attention fusion framework. This methodology can effectively merge source code line semantics, line-level context, and local interaction information between code lines and corresponding line-level context to identify defective code files and lines. An empirical study on within- and cross-project defect prediction on 9 software projects covering 32 versions demonstrates that BAFLineDP outperforms existing file-level and line-level defect prediction methods. Thus, we anticipate that BAFLineDP can help software quality assurance teams find defective lines of code in a cost-effective manner. The data and source code that support the findings of this study are openly available on GitHub at https://github.com/insoft-lab/BAFLineDP.

Acknowledgment

This study was funded partly by Guangdong Natural Science Fund Project (2022A1515110564), Guangzhou Science and Technology Plan Project (202201010312), Key Research and Development Plan of Guangzhou (202206010091, 2023B03J1363), Special Fund for Rural Revitalization Strategy of Guangdong (2023TS-3), Science and Technology Project of Meizhou City Tobacco Monopoly (202304).

References

  • [1] F. Meng, W. Cheng, and J. Wang, “Semi-supervised software defect prediction model based on tri-training.” KSII Transactions on Internet & Information Systems, vol. 15, no. 11, 2021.
  • [2] M. K. Thota, F. H. Shajin, P. Rajesh et al., “Survey on software defect prediction techniques,” International Journal of Applied Science and Engineering, vol. 17, no. 4, pp. 331–344, 2020.
  • [3] Y. Kamei, S. Matsumoto, A. Monden, K.-i. Matsumoto, B. Adams, and A. E. Hassan, “Revisiting common bug prediction findings using effort-aware models,” in 2010 IEEE international conference on software maintenance.   IEEE, 2010, pp. 1–10.
  • [4] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revisiting code ownership and its relationship with software quality in the scope of modern code review,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 1039–1050.
  • [5] Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, and K.-i. Matsumoto, “The effects of over and under sampling on fault-prone module detection,” in First international symposium on empirical software engineering and measurement (ESEM 2007).   IEEE, 2007, pp. 196–204.
  • [6] T. Mende and R. Koschke, “Effort-aware defect prediction models,” in 2010 14th European Conference on Software Maintenance and Reengineering.   IEEE, 2010, pp. 107–116.
  • [7] H. Hata, O. Mizuno, and T. Kikuno, “Bug prediction based on fine-grained module histories,” in 2012 34th international conference on software engineering (ICSE).   IEEE, 2012, pp. 200–210.
  • [8] L. Pascarella, F. Palomba, and A. Bacchelli, “Fine-grained just-in-time defect prediction,” Journal of Systems and Software, vol. 150, pp. 22–36, 2019.
  • [9] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata, and K. Matsumoto, “Predicting defective lines using a model-agnostic technique,” IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1480–1496, 2020.
  • [10] T. Zhang, Q. Du, J. Xu, J. Li, and X. Li, “Software defect prediction and localization with attention-based models and ensemble learning,” in 2020 27th Asia-Pacific Software Engineering Conference (APSEC).   IEEE, 2020, pp. 81–90.
  • [11] C. Pornprasit and C. K. Tantithamthavorn, “Deeplinedp: Towards a deep learning approach for line-level defect prediction,” IEEE Transactions on Software Engineering, vol. 49, no. 1, pp. 84–98, 2022.
  • [12] Z. Zhu, W. Dai, Y. Hu, and J. Li, “Speech emotion recognition model based on bi-gru and focal loss,” Pattern Recognition Letters, vol. 140, pp. 358–365, 2020.
  • [13] F. Matloob, T. M. Ghazal, N. Taleb, S. Aftab, M. Ahmad, M. A. Khan, S. Abbas, and T. R. Soomro, “Software defect prediction using ensemble learning: A systematic literature review,” IEEE Access, vol. 9, pp. 98 754–98 771, 2021.
  • [14] W. Rhmann, B. Pandey, G. Ansari, and D. K. Pandey, “Software fault prediction based on change metrics using hybrid algorithms: An empirical study,” Journal of King Saud University-Computer and Information Sciences, vol. 32, no. 4, pp. 419–424, 2020.
  • [15] L. Qiao, X. Li, Q. Umer, and P. Guo, “Deep learning based software defect prediction,” Neurocomputing, vol. 385, pp. 100–110, 2020.
  • [16] R. Yedida and T. Menzies, “On the value of oversampling for deep learning in software defect prediction,” IEEE Transactions on Software Engineering, vol. 48, no. 8, pp. 3103–3116, 2021.
  • [17] J. Chen, K. Hu, Y. Yu, Z. Chen, Q. Xuan, Y. Liu, and V. Filkov, “Software visualization and deep transfer learning for effective software defect prediction,” in Proceedings of the ACM/IEEE 42nd international conference on software engineering, 2020, pp. 578–589.
  • [18] J. Deng, L. Lu, S. Qiu, and Y. Ou, “A suitable ast node granularity and multi-kernel transfer convolutional neural network for cross-project defect prediction,” IEEE Access, vol. 8, pp. 66 647–66 661, 2020.
  • [19] Z. Sun, J. Li, H. Sun, and L. He, “Cfps: Collaborative filtering based source projects selection for cross-project defect prediction,” Applied Soft Computing, vol. 99, p. 106940, 2021.
  • [20] Y. Xing, X. Qian, Y. Guan, B. Yang, and Y. Zhang, “Cross-project defect prediction based on g-lstm model,” Pattern Recognition Letters, vol. 160, pp. 50–57, 2022.
  • [21] Z. Wan, X. Xia, A. E. Hassan, D. Lo, J. Yin, and X. Yang, “Perceptions, expectations, and challenges in defect prediction,” IEEE Transactions on Software Engineering, vol. 46, no. 11, pp. 1241–1266, 2018.
  • [22] A. Majd, M. Vahidi-Asl, A. Khalilian, P. Poorsarvi-Tehrani, and H. Haghighi, “Sldeep: Statement-level software defect prediction using deep-learning model on static code features,” Expert Systems with Applications, vol. 147, p. 113156, 2020.
  • [23] C. Pornprasit and C. K. Tantithamthavorn, “Jitline: A simpler, better, faster, finer-grained just-in-time defect prediction,” in 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).   IEEE, 2021, pp. 369–379.
  • [24] S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan, “Bugram: bug detection with n-gram language models,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, 2016, pp. 708–719.
  • [25] B. Johnson, Y. Song, E. Murphy-Hill, and R. Bowdidge, “Why don’t software developers use static analysis tools to find bugs?” in 2013 35th International Conference on Software Engineering (ICSE).   IEEE, 2013, pp. 672–681.
  • [26] B. Ray, V. Hellendoorn, S. Godhane, Z. Tu, A. Bacchelli, and P. Devanbu, “On the” naturalness” of buggy code,” in Proceedings of the 38th International Conference on Software Engineering, 2016, pp. 428–439.
  • [27] D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020.
  • [28] M. Rahman, D. Palani, and P. C. Rigby, “Natural software revisited,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).   IEEE, 2019, pp. 37–48.
  • [29] T. Hoang, H. J. Kang, D. Lo, and J. Lawall, “Cc2vec: Distributed representations of code changes,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 518–529.
  • [30] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
  • [31] T. H. Le, H. Chen, and M. A. Babar, “Deep learning for source code modeling and generation: Models, applications, and challenges,” ACM Computing Surveys (CSUR), vol. 53, no. 3, pp. 1–38, 2020.
  • [32] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” Advances in neural information processing systems, vol. 31, 2018.
  • [33] V. Arvind, A. Chatterjee, R. Datta, and P. Mukhopadhyay, “Fast exact algorithms using hadamard product of polynomials,” Algorithmica, pp. 1–28, 2022.
  • [34] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with relu activation function,” 2020.
  • [35] H. Gholamalinezhad and H. Khosravi, “Pooling methods in deep neural networks, a review,” arXiv preprint arXiv:2009.07485, 2020.
  • [36] H. Pratiwi, A. P. Windarto, S. Susliansyah, R. R. Aria, S. Susilowati, L. K. Rahayu, Y. Fitriani, A. Merdekawati, and I. R. Rahadjeng, “Sigmoid activation function in selecting the best model of artificial neural networks,” in Journal of Physics: Conference Series, vol. 1471, no. 1.   IOP Publishing, 2020, p. 012010.
  • [37] A. P. Bradley, “The use of the area under the roc curve in the evaluation of machine learning algorithms,” Pattern recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
  • [38] D. R. Velez, B. C. White, A. A. Motsinger, W. S. Bush, M. D. Ritchie, S. M. Williams, and J. H. Moore, “A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction,” Genetic Epidemiology: the Official Publication of the International Genetic Epidemiology Society, vol. 31, no. 4, pp. 306–315, 2007.
  • [39] D. Chicco, N. Tötsch, and G. Jurman, “The matthews correlation coefficient (mcc) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation,” BioData mining, vol. 14, no. 1, pp. 1–22, 2021.
  • [40] S. Wang, T. Liu, J. Nam, and L. Tan, “Deep semantic feature learning for software defect prediction,” IEEE Transactions on Software Engineering, vol. 46, no. 12, pp. 1267–1293, 2018.
  • [41] J. Li, P. He, J. Zhu, and M. R. Lyu, “Software defect prediction via convolutional neural network,” in 2017 IEEE international conference on software quality, reliability and security (QRS).   IEEE, 2017, pp. 318–328.
  • [42] H. Hata, O. Mizuno, and T. Kikuno, “Fault-prone module detection using large-scale text features based on spam filtering,” Empirical Software Engineering, vol. 15, pp. 147–165, 2010.
  • [43] H. K. Dam, T. Tran, T. Pham, S. W. Ng, J. Grundy, and A. Ghose, “Automatic feature learning for predicting vulnerable software components,” IEEE Transactions on Software Engineering, vol. 47, no. 1, pp. 67–85, 2018.
  • [44] E. Aftandilian, R. Sauciuc, S. Priya, and S. Krishnan, “Building useful program analysis tools using an extensible java compiler,” in 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.   IEEE, 2012, pp. 14–23.
  • [45] V. J. Hellendoorn and P. Devanbu, “Are deep neural networks the best choice for modeling source code?” in Proceedings of the 2017 11th Joint meeting on foundations of software engineering, 2017, pp. 763–773.
  • [46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [47] S. Herbold, “Comments on scottknottesd in response to” an empirical comparison of model validation techniques for defect prediction models”,” IEEE Transactions on Software Engineering, vol. 43, no. 11, pp. 1091–1094, 2017.
  • [48] C. Jin, “Cross-project software defect prediction based on domain adaptation learning and optimization,” Expert Systems with Applications, vol. 171, p. 114637, 2021.
  • [49] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning.   PMLR, 2014, pp. 1188–1196.