[1]\fnmMd \surNadim

[1]\orgdivPh.D. Student, Software Research Lab (SRLab), Department of Computer Science, \orgnameUSASK, \orgaddress\citySaskatoon, \stateSaskatchewan, \countryCanada

2]\orgdivAssistant Professor, Department of Computer Science, \orgnameUSASK, \orgaddress\citySaskatoon, \stateSaskatchewan, \countryCanada

Utilizing Source Code Syntax Patterns to Detect Bug Inducing Commits using Machine Learning Models

[email protected] \fnmBanani \surRoy [email protected] * [

Abstract

Detecting Bug Inducing Commit (BIC) or Just in Time (JIT) defect prediction using Machine Learning (ML) based models requires tabulated feature values extracted from the source code or historical maintenance data of a software system. Existing studies have utilized meta-data from source code repositories (we named them GitHub Statistics or GS), n-gram-based source code text processing, and developer’s information (e.g., the experience of a developer) as the feature values in ML-based bug detection models. However, these feature values do not represent the source code syntax styles or patterns that a developer might prefer over available valid alternatives provided by programming languages. This investigation proposed a method to extract features from its source code syntax patterns to represent software commits and investigate whether they are helpful in detecting bug proneness in software systems. We utilize six manually and two automatically labeled datasets from eight open-source software projects written in Java, C++, and Python programming languages. Our datasets contain 642 manually labeled and 4,014 automatically labeled buggy and non-buggy commits from six and two subject systems, respectively. The subject systems contain a diverse number of revisions, and they are from various application domains. Our investigation shows the inclusion of the proposed features increases the performance of detecting buggy and non-buggy software commits using five different machine learning classification models. Our proposed features also perform better in detecting buggy commits using the Deep Belief Network generated features and classification model. This investigation also implemented a state-of-the-art tool to compare the explainability of predicted buggy commits using our proposed and traditional features and found that our proposed features provide better reasoning about buggy commit detection compared to the traditional features. The continuation of this study can lead us to enhance software effectiveness by identifying, minimizing, and fixing software bugs during its maintenance and evolution.

keywords:

Bug Inducing Commit; Classification; Just in Time (JIT) defect prediction; Source Code Syntax Pattern; Token Pattern; Token Sequence; Deep Belief Network; Explainability of Bug Detection.

1 Introduction

There are several studies on detecting buggy commits or Just in Time (JIT) defect predictions Kim et al. (2008, 2007); Mizuno and Hata (2013); Shivaji et al. (2013a); Yang et al. (2015); Kim et al. (2006b); Śliwerski et al. (2005b); Wen et al. (2016); most of these studies either used statistics extracted from their respective GitHub repositories as features Borg et al. (2019); Rosen et al. (2015) or features extracted from n-gram Cavnar and Trenkle (1994) based source code processing Shivaji et al. (2013b). These studies can provide assumptions about whether a commit is bug-inducing or clean, but they cannot provide any specific idea/ reasoning about how such bugs have been induced or what could be done to fix those induced bugs. Existing Studies Kamei et al. (2013); Kim et al. (2008); Śliwerski et al. (2005b); Yin et al. (2011); Gu et al. (2010) made it clear that failure in properly doing the maintenance activity of a software system is one of the most common reasons for introducing bug(s). A developer who is responsible for doing such maintenance activity performs coding in some unique syntax patterns or styles.

It is possible to have two different structures and code complexity of the code fragment written to solve the same problem if two other programmers write those. This study will practically verify the effect of such a difference in coding syntax patterns in detecting bug-inducing commits utilizing ML-based detection models. Nayrolles and Hamou-Lhadj (2018) reported the use of a clone detection tool named Nicad Cordy and Roy (2011) to determine the similarity of a new buggy code fragment to the code fragment(s) where a bug had already been fixed previously for suggesting the possible bug fix to the developer. There are a few more similar studies Jeffrey et al. (2009); Jiang et al. (2018); Martinez and Monperrus (2015); Martinez et al. (2014) that investigated generating bug-fix patterns based on the existing commits patches and identical code fragments. Suppose we can identify a specific set of coding syntax styles/ patterns responsible for inducing bugs/ inconsistencies in software systems. We can also determine the similarity of those patterns to the new developers’ code and provide general guidelines to the developers to avoid such patterns or structures in the codebase. More generalized (e.g., commercial and open-source software systems of different sizes and application domains) recommendations about risky commits and their possible fixes can be suggested to software developers using the dangerous patterns identified from the commits, which are most likely to induce bugs in software systems.

Casalnuovo et al. (2019) show that the source code of a software system shows more repetitiveness compared to the natural language text, and developers prefer some specific implementation of source code statements over other valid alternatives. For example, let us consider a simple $if$ statement is written in the C or Java programming language $if(x<=y)$ or $if(x<(y+1))$ ; both the code fragments have the same meaning but are written in two different syntax patterns/ style. Different syntax styles or patterns imply differences in some basic coding elements, such as the number of identifiers, the implementation of conditions, nesting levels of different statements, and the use of inter-dependent statements. The preferences of software developers also may affect the ordering of programming tokens, such as the following two valid alternatives, a) i = i + 1, b) i = 1 + i. Although both alternative statements have the same meaning when it is compiled on a computer, they might be perceived in diverse ways when a software developer is working in that software system later. The difference in such perceptions and preferences by different programmers/ developers of a software system might add a different level of complexity during software maintenance and evaluations. A similar scenario may also happen when different developers are coding complex or larger functionalities.This paper show investigations to verify whether such a difference in developers’ coding syntax patterns also makes a difference in complexity level while changing or fixing the identified issues in those code fragments by detecting the buggy code commits using machine learning and deep learning classification models. Our investigation utilizes five different ML-based classification models and compares the performance improvement of BIC detection using GitHub statistics and code syntax pattern based features.

Our goal in this study is to extract and evaluate developers’ coding syntax patterns-based features in detecting bug-inducing commits (BICs) using machine learning (ML) classification models. The analysis of these features will help us identify the distinctive coding syntax patterns from bug-inducing and clean commits. This identification of these patterns can provide a meaningful idea or reasoning to the software developer about the induced bug in the software system. Wen et al. (2020) published their preliminary fixing strategy devised by the patterns of Bug Inducing Commits, and their study reported automatic fix/ repair of eight new bugs, which the state-of-the-art techniques could not repair. They provided importance on analyzing BIC patterns rather than analyzing how bugs are being fixed to boost the automatic program repair (APR) Goues et al. (2019); Liu et al. (2019); Pei et al. (2014); Kim et al. (2013); Xin and Reiss (2019); Li et al. (2020b) techniques. Our study to evaluate and identify the developers’ coding syntax patterns responsible for inducing bugs in software systems could also contribute to finding relevant bug fixing patterns.

Table 1: Dataset Summary-I (Manually Labeled Wen et al. (2019))

Subject

Systems

Application

Domains

BIC

(L = 1)

(L = 0)

BIC+CC

Accumulo

Data Storage

and Retrieval

51 (60%)

34 (40%)

Ambari

System Administrators

for Hadoop

34 (47.22%)

38 (52.78%)

Hadoop

Distributed Computing

Environment

51 (50.50%)

50 (49.50%)

101

Jackrabbit

Content Repository

for Java API

58 (58.59%)

41 (41.41%)

Lucene

Text Search Engine

Library

131 (65.83%)

68 (34.17%)

199

Oozie

Executes Hadoop

Workloads via

Web Services

46 (53.49%)

40 (46.51%)

Total Commit Instances:

642

Programming Language: Java

Table 2: Dataset Summary-II (Automatically Labeled Pornprasit and Tantithamthavorn (2021))

Subject

Systems

Application

Domains

Prg.

Lang.

BIC

(L = 1)

(L = 0)

BIC+CC

Cross-platform

application

framework

CPP

1,047 (43%)

1,396 (57%)

2,443

OpenStack

Open-source

software

platform

Python

908 (58%)

663 (42%)

1,571

Total Commit Instances:

4,014

BIC: Bug Inducing Commits

CC: Clean Commits (Commits that fix a bug and no more bug is reported)

L: Data Label

Our key contribution to this investigation is to propose source code syntax pattern-based features to detect bug-inducing commits (BICs) using different machine learning (ML) models. We compare the performance of detecting BICs using the combinations of commonly used feature values obtained from similar studies Borg et al. (2019); Rosen et al. (2015); Fukushima et al. (2014) and our proposed feature values. We want to verify whether our proposed features extracted from source code syntax patterns of software projects improve the performance of ML models in detecting BICs. There are a lot of different studies Kim et al. (2008, 2007); Mizuno and Hata (2013); Shivaji et al. (2013a); Kim et al. (2006b); Śliwerski et al. (2005b); Wen et al. (2016) in the literature using similar types of feature values containing different experimental setups (i.e., machine learning, deep learning Yang et al. (2015); Zeng et al. (2021), single project, cross-project Fukushima et al. (2014); Li et al. (2020a); Tabassum et al. (2020); Kamei et al. (2016), etc.). Zeng et al. Zeng et al. (2021) replicated the results of two popular Just In Time defect prediction models CC2Vec Hoang et al. (2020) and DeepJIT Hoang et al. (2019). They reported the difference in obtained results between the original datasets and the experimental projects used in their study. They show that the selection of datasets and feature values impact the CC2Vec and DeepJIT defect prediction models. As in this investigation, we propose new types of features extracted from the source code syntax pattern of software projects to be used in ML models. We also aim to use six manually and two automatically labeled datasets of the buggy and non-buggy commit to performing the investigation. Instead of directly using the findings of any other study, we reproduce all the results in our experimental setup using commonly used feature values (which we give the name GitHub Statistics or GS) and compare them with the results of our proposed features. Therefore, we took our results using commonly used features in recent studies as the baseline result and compared them with the results using different combinations of our proposed features. As we wanted to see an improvement in detecting buggy and non-buggy commits using different feature combinations, we generated all the results in our implementation setup for a fair comparison among them. Our investigation shows that using the features extracted from the source code syntax patterns improves the performance of detecting BICs. We believe that any other experimental setup should provide a similar result. In the future, we plan to replicate recent studies and compare the improvement of buggy commit detection using our proposed features extracted from the source code syntax pattern.

Our baseline is the result we obtain using conventional feature values (GS-ALL, GS) to identify bug-inducing commits similar to the existing related works Borg et al. (2019); Rosen et al. (2015); Fukushima et al. (2014). We then compare the results using developers’ coding syntax pattern-based features (TS, TP) and their various possible combinations in detecting BIC from eight subject systems. The token patterns and sequences generated in Figure 2 demonstrate their effects on the control flow of the source code. A simple token pattern represents a simple coding structure, and a simple code is more likely to be easier for software developers to edit and less likely to be bug-inducing. We investigate the encoding of these patterns as feature values to detect bug-inducing commits in this study. We also investigate the set of best features for each of the individual subject systems.

We applied Random Forest (RF) Classifier to the datasets of the six software repositories shown in Table 1 and five different ML-based classification models to the datasets of the two software repositories shown in Table 2 to detect bug-inducing commits and evaluated our results to answer the following research questions.

RQ1: How can we determine whether developers’ coding syntax patterns could be responsible for inducing bugs in a software system?

•

We encoded developers’ coding syntax patterns as feature values and named them Token Sequence (TS) and Token Pattern (TP). Results of our investigation show that the inclusion of TS and TP as feature values provides better bug-inducing commit (BIC) detection performance than the conventional features (GS-ALL and GS). Therefore, we can conclude that, as TS and TP features contribute to detecting BIC using ML models, they might also be responsible for inducing bugs in software projects. Some repeating patterns might induce bugs in software systems, and ML models are being trained to identify those patterns utilizing the proposed TS and TP features.

RQ2: Do the features extracted from developers’ coding syntax patterns provide significantly better performance compared to the other feature values?

•

We perform the Wilcoxon Signed Rank test Wilcoxon (1945); Rosner et al. (2006) with the results we obtained in manually labeled six subject systems. Our investigation shows the improvements in precision, recall, and f1 scores are statistically significant.

RQ3: How generalized are the extracted features from one software system to the others?

•

Our findings show the number of features is different in the collection of best features in different subject systems if the dataset size is not large enough (our manually labeled six subject system). Based on the findings of this study, we can say that as different developers maintain different software systems, their coding syntax styles, working patterns, and codebase size are also different. Therefore, we require a separate list of features to detect future occurrences of buggy commits from those systems. With large datasets (automatically labeled two subject systems), we get improved BIC detection results using all the feature values without prioritization in five different ML detection models.

RQ4: Do the features extracted from developers’ coding syntax patterns enhance the explainability of BIC detection from software systems?

•

We utilized PyExplainer Pornprasit et al. (2021), a machine learning prediction explaining tool. The use of this tool provided us with some examples in Table 7, which clearly demonstrates that the use of source code syntactic pattern-based features provides better reasoning compared to the code-churn-based features about the bug proneness of a software commit.

The knowledge of this study will help avoid future bug introduction in the codebase, which will bring effective financial and resource utilization to the software development community. The source code of commit patches and their equivalent XML representation from all the eight subject systems we use in this study are publicly available¹¹1https://github.com/mnadims/bicDetectionSF for readers to investigate and facilitate any replication study.

We organized this paper into the following sections. A full overview of this study in different steps is in section 2; the results of this study are described and analyzed in section 3. Threats to validity are described in section 4, and section 5 performs the literature review of existing similar studies. Finally, we conclude this paper in section 6, mentioning the future directions of this study.

2 Study Overview

Most of the earlier studies to detect BICs either used features only from GitHub statistical data Borg et al. (2019); Rosen et al. (2015) or combined GitHub statistical features with natural language-based text processing of source code Shivaji et al. (2013b) in Machine Learning Models. A systematic literature review Hall et al. (2012) reports that different studies use features extracted from various sources, such as previous historical data from GitHub or submitted bug reports, source code metrics, or any combination of those features. For example, some studies Hoang et al. (2019); Kamei et al. (2016) use historical patches to identify defects from software systems, and they call it JIT (Just in Time defect prediction). Hall et al. (2012) also find that only source code metric works very poorly if no other kinds of features (such as GitHub Statistics) are combined with that. The decision to take which values to be used as features depends on some underlying assumptions. One such commonly used feature is a Line of codes (LOC) in almost all ML-based software quality assurance or defect prediction studies Hall et al. (2012). A bigger software system (more LOC) is more likely to be buggy than a smaller one is the underlying assumption of using LOC as a feature value. Some other standard features (such as Lines of Code Added/ Deleted/ Updated, Number of Directories Modified, and Experience of a Developer) used in these studies are also based on similar assumptions. These features can provide an assumption or likelihood of a commit being buggy and will always remain far behind to give a clear idea about which source code portion has introduced the bug or how we will find or fix it. We should focus on maintenance activities where source code is added or updated to identify specific reasons for bug introduction within a particular software system.

We perform this study to verify whether the developers’ coding syntax patterns can induce a bug in a software system by utilizing developers’ coding syntax pattern-based features (TS and TP) in ML-based detection models. We demonstrate the overall study in Figure 1, which includes the following key steps.

Refer to caption — Figure 1: Overall Steps of this Study

2.1 Datasets Labelling (Buggy and Non-buggy Commits)

To evaluate the efficiency of any techniques in detecting Bug Inducing Commits (BICs) or Just in Time (JIT) defect prediction, most of the related studies Borg et al. (2019); Rosen et al. (2015) used datasets labeled using the SZZ algorithm Śliwerski et al. (2005b). Wen et al. (2019) reported that about 63.7% of bug-inducing commits (BIC) identified by the SZZ Algorithm are not real. There are several updated versions of the SZZ algorithm to deal with its incorrect labeling da Costa et al. (2017); Davies et al. (2014); Kim et al. (2006b). We do not use a dataset labeled using the SZZ algorithm in this study because of its imprecision. As our primary goal is to verify the effectiveness of features extracted from developers’ coding syntax patterns, we utilize labeled datasets published by recent similar studies as our baseline and added our TS and TP-based features to investigate the improvements in detecting bug inducing commits (BICs). Data sets selected in this investigation fall into the following categories:

2.1.1 Manually labeled Datasets (Table 1)

Six of the eight datasets used in this investigation are labeled manually by Wen et al. (2019), which is publicly available ²²2https://github.com/justinwm/InduceBenchmark to access. The datasets of bug-fixing commits and the associated bug-inducing commits for seven different projects were used in their paper. We utilized six available project data files as they contain enough data instances to apply Machine Learning based classification models such as Random Forest Classifier. We could not utilize one project data file (./Defects4J.csv) as it contains only the bug-inducing commits and associated Bug IDs from five different projects. In addition, it includes 92 data instances from five various projects, making insufficient data instances to apply ML-based classification models on the dataset of each of these projects.

2.1.2 Automatically Labeled Datasets (Table 2)

Our investigation also utilizes two popular datasets, QT³³3https://github.com/qt/ and OpenStack⁴⁴4https://github.com/openstack/, used in similar studies Hoang et al. (2019); Pornprasit and Tantithamthavorn (2021), which are labeled automatically and publicly available to access. We first download their data files from their data repository and then extract the commit instances which have a labeled number of bugs. We find the labeled bug presence falls into two categories, those which have a bug count of 0, and those which have 1 or more bugs. To make it simpler for buggy commit detection by binary classification and compatible with our other six subject systems, we labeled the commit instances from these two subject systems which have 1 or more bugs as buggy commit instances and labeled them L=1. The commits which do not have any bugs are labeled as L=0. OpenStack and QT contain 754 and 89 repositories, respectively, as their sub-modules. We find the source code patches of the labeled commits from QT in a sub-module named ’qtbase’. The labeled commits of OpenStack are from different sub-modules. To extract the source code patches, we first search each commit in all the 754 sub-modules of OpenStack and then store the source code of that commit patch for further processing.

The datasets from the manually labeled six projects are shown in Table 1, and two automatically labeled projects are in Table 2. We investigate the BIC detection performance of the ML model using our introduced features on these six project datasets. There are two categories of commits in the dataset; one is identified as bug-inducing commits, and the others are identified as bug-fixing commits by manual investigation. We labeled those bug-inducing commits as L=1. As training and testing of an ML-based detection model require the target class and its different classes to provide as training data samples, we labeled the bug fixing commits, which are not inducing new bugs in the system, as clean commits (L=0). The data samples used to train and test the ML model those commits are mainly removing the problem/ issues inserted by the commits, which are labeled as bug-inducing (L=1). These commits can provide more distinguishable features to ML-based detection models to identify two categories of commits.

2.2 Identifying Developers’ Coding Syntax Patterns and Sequences

Developers’ coding syntax pattern represents the behavior of a software developer, which affects the overall structure of a source code statement. We identify thousands of different commit patterns from source code patches of the eight subject systems of this investigation. About 18k of these patches whose token pattern length is within 100 characters are publicly available to view by the readers at our GitHub repository¹ for an easier understanding of those patterns. We also extract a sequence of these patterns for preparing feature values, which could be used in our ML-based classification model. As tokens of a programming language are the basic building blocks of each feature type extracted from the XML representation of source codes, we name the categories of features as Token Sequence (TS) and Token Pattern (TP). Commits encoded by these TS and TP represent the syntax style of coding by different developers.

We show an overall demonstration of extracting features (TS and TP) using a straightforward piece of program in Java language in Figure 2. Few additional examples of more complex program constructions, such as adding AND, OR in if-statement, nested if statement, and switch-case statement, are also available to view in four different other files ( $./simpleDemonstrations$ ) at our GitHub repository¹. Our public repository also contains all the project files ( $./projectFiles$ ), including the source file of commit patches and their equivalent XML representations for further investigation and reproduction of this study.

We first downloaded all the commit patches using the commit id in our labeled dataset and the git command for corresponding GitHub repositories. Then we retrieved the source code segments from those patch files. After that, we utilized a tool (srcML⁵⁵5https://www.srcml.org/), which converts source code into an XML format representing its coding structure in a normalized form. Suppose we have a source code of a commit-patch in a file named test.java, as shown in Figure 2. Processing the file test.java with the srcML tool will generate an XML file test.xml. The XML representation of the source file provides a hierarchy of the source code statement structure and token organization. This example shows that the file test.java contains two main blocks, one for $if$ and the corresponding $else$ for the $if$ . In the XML representation of the source file, we can see those two blocks keeping the identifier hierarchy and sequence similarity. Processing the XML file, we can extract the normalized names of different tokens in the source code and the hierarchy of those tokens. Examples of the extracted token sequence (TS) and token pattern (TP) are shown in Figure 2. We have given each token pattern an id such as $pattern\_id\_1,~{}pattern\_id\_2,~{}$ ……. $~{}pattern\_id\_n$ where n is the total number of patterns extracted from the source file. Here, $pattern\_id\_1$ means the extracted pattern of normalized tokens, such as if_stmt-if-condition-expr-name is one of the patterns shown in Figure 2. There could be thousands of such patterns extracted from a software system that will reflect how the developer of that software system performs coding. To prepare the dataset for a commit, we will consider how many such patterns are in the source code of that commit patch. Similarly, in the same figure, we can also see an example of the extracted token sequence. We processed that token sequence with the n-gram Cavnar and Trenkle (1994) technique, which is mainly used in extracting features from natural language text. We utilized n-gram (with n=1 to n=5) in our extracted token sequence to produce Bag of Word (BOW) Jason Brownlee (2017); Le and Mikolov (2014); Zhao and Mao (2018), which is then used as another source of features, and we named these features as Token Sequence or TS. We repeated this process for the source code of every commit patch in all the subject systems to extract the features from TS and TP.

2.3 Encoding Developers’ Coding Syntax Patterns to be used in the ML-based Detection Model

Applying ML-based models depends on the availability of labeled datasets (sample commits from each of the labeled classes) and extracting relevant feature values from the codebase history to train and test the model. After identifying the TS and TP patterns from the source code of the commit patch, it is important to encode them as feature values, which could be used in ML-based detection models. In this study, we used simple encoding of TS and TP by considering their number of occurrences in the source code of the commit patch. For example, in Figure 2, we demonstrated how the TS and TP patterns are extracted from the Java source code. Each code fragment may have several TS and TP. We extract all the TS and TP from all the commit patches of our labeled datasets. After that, we selected unique TS and TP to be used as feature identifiers. We will use the count of those feature identifiers in each commit to represent that commit while using ML-based detection models. A sample of TP is extracted for each commit, and their encoding mechanism is demonstrated in Table 3 and Table 4. In Table 4, we show the commit id, its extracted TP, and the associated label for each of the commits. Here, L=0 indicates a clean commit, and L=1 indicates a BIC. We then took unique TP from all the TPs to encode them and took the number of occurrences of each TP to represent each of the commits in Table 4.

Table 3: Sample Token Pattern (TP) in Each Commit

Commit ID	Token Pattern (TP)	Label (L)
1	TP1, TP2, TP3, TP3, TP4, TP4	0
2	TP2, TP4, TP5	1
3	TP1, TP3, TP6	0
4	TP4, TP4, TP4	0
5	TP4, TP6, TP7, TP7	1

Table 4: Sample Encoding of Token Patterns (TP) of Table 3

Commit

TP1

TP2

TP3

TP4

TP5

TP6

TP7

Label

(L)

2.4 Selecting The Set of Best Features for each of the Subject Systems

Machine learning algorithms provide its result based on the feature values given for each data instance. The features we are considering for training and testing a machine learning model play a vital role in the performance of that model. A weak or irrelevant feature list can degrade the overall accuracy of classification. We wanted to evaluate whether the best features for a subject system are also better for the other subject systems. We applied Recursive Feature Elimination (RFE) of SciKit Learn Pedregosa et al. (2011) python library to rank all the features in the dataset of a subject system. RFE is a recursive approach to finding a list of the best features for a dataset with its classification labels. A user first needs to specify the number of best features expected to keep and the number of features to eliminate in each step. RFE then eliminates the weakest feature (or features) in each recursive step until the expected number of features is obtained. The parameter settings we have used for RFE are shown in Figure 3, where we used the Random Forest Classifier to estimate the importance of features and eliminate one feature in each step.

As we set $n\_features\_to\_select=1$ , it will rank all the features from rank 1 to n (n=total number of features). Then return the top feature as the best feature for the dataset classification. We performed this for each feature combination, such as GS, TS, TP, GS+TS, etc. As GS has 12 features, this step will rank those features from rank-1 to rank-12 based on their importance in classifying the dataset. Similarly, GS+TS features have $12+10514=10526$ features, and they will get from rank-1 to rank-10526.

After obtaining the ranking of each of the features from the RFE algorithm, we applied the Random Forest classifier to select the set of best features for each of the subject systems. We first determine the Precision, Recall, and F1 Score of BIC detection using only the top (rank-1) feature in a subject system. Then we added one feature in each step from the following position in the rank list and determined the classification performance. We discarded the added feature if it did not improve the performance obtained in the previous step. Therefore, after completing the iteration from feature 1 to the total number of features, we will have the best features for that subject system. We repeated this approach for each of the feature combinations and each of the six manually labeled subject systems.

2.5 Applying ML-based BIC Detection Models

Our investigation utilizes five machine learning classification models to detect bug inducing commits from the two (OpenStack and QT) of the eight subject systems. We use OpenStack and QT datasets for applying multiple ML models as these datasets are most used in similar studies and contain a large number of data samples compared to the other manually labeled six subject systems. We use only the Random Forest (RF) classification model on the manually labeled datasets as they contain a smaller number of data instances. We find the features extracted from source code statements improve the prediction of defective software commits. All the classification models except the DBN are available in the Python SciKit Learn Pedregosa et al. (2011) library. We utilized the DBN classification model made available by albertbup (2017).

i.

Random Forest (RF) Classifier implements ensemble learning methods where it builds multiple decision trees and merges them to get a better and more stable prediction. At the root of each tree, it first divides the data in such a way that the differences in each part of the data become as lower as possible, then it again divides the branches of the tree to get more specialization on the data. We can control the depth of such a tree by the parameter variable max_depth. The algorithm repeats this process with some different random selections and creates a set of random decision trees. During prediction, it predicts the case using each of the decision trees in the forest, then finalizes the decision with the majority decision of the trees.
ii.

K-Nearest Neighbors (KNN) Classifier is one of the most commonly used ML algorithms known for its simplicity and effectiveness Taunk et al. (2019). It is a supervised learning method that uses a labeled training dataset to categorize data points for predicting the category of the test dataset.
iii.

Gradient Boosting Classifier (GBC) is also an ensemble learning method, where the model learns by optimizing the loss function. It is used for both classification and regression. We use the default configuration of the Python SciKit Learn Pedregosa et al. (2011) library for utilizing this algorithm to detect bug inducing commits.
iv.

Perceptron (PCT) is the simplest type of neural network model used for binary classification. It takes a row of input data to predict its output class. It utilizes the weighted sum of the input and a bias value. During the training process of PCT, it updates its weights to minimize the error of classification in training datasets.
v.

Supervised Deep Belief Network (DBN) Classification is the utilization of an advanced deep learning algorithm Yang et al. (2015); Hinton and Salakhutdinov (2006); Hinton (2007); Hinton et al. (2006) for detecting more meaningful features to increase the classification performance of a labeled dataset. It contains two phases, i) feature selection, and ii) machine learning based classification. In the feature selection phase, it utilizes provided features to extract more meaningful feature values and then applies classification based on the extracted features.

We have applied machine learning models using time-sensitive detection techniques, which means training data must be from past timestamps than the testing data. Tan et al. (2015) show that cross-validation can provide false higher precision in predicting a data instance that is sensitive to timestamp, and a situation may happen then train the ML model with the data from the future and tests it using the data from the past. We first sorted the dataset in ascending order based on the commit date and took the first 70% of the data for training the model and the remaining 30% for testing. All the commit instances in the training dataset are from the past compared to the testing data instances.

2.6 Feature Combinations

As we extracted Token Pattern (TP) features utilizing the SrcML tool, which does not support Python for converting the source fragments into an XML representation. Therefore, we extracted only token sequence (TS) features from OpenStack as it is written using Python programming language. We extract both the TS and TP features from all the other seven subject systems listed in Tables 1 and 2. We investigated the ML-based classification model’s performance by taking different combinations of GS, TS, and TP features. In this investigation, our baseline feature is GS, and we are proposing TS and TP features to improve BIC detection performance. Suppose we would like to combine GS and TS features for the subject system Accumulo where the number of GS and TS features is 12 and 10,514. Thus, the number of combined total features (GS+TS) in Accumulo will be 10,526. We will rank those features based on their importance, where the most important feature will get rank-1, and the least important feature will get rank-10526. Similarly, if we would like to investigate GS+TS+TP for Accumulo, we will have 12+10,514+741= 11,267 features and rank them from 1 to 11,267th based on their importance in the dataset classification of Accumulo. As we have three basic feature types, we got seven feature combinations (GS, TS, TP, GS+TS, GS+TP, TS+TP, GS+TS+TP). We wanted to investigate whether more than one type of feature improved the classification performance by combining different feature types. Each of the ranked seven feature combinations is prioritized using the technique described in Section 2.4.

Table 5: Performance of Single Type Features

Feature

Type

Subject

System

Precision

Recall

F1 Score

AUC

GitHub Statistics All Features (GS-ALL)

Accumulo

0.57

0.73

0.64

0.76

Ambari

0.67

0.80

0.73

0.76

Hadoop

0.40

0.91

0.56

0.75

Jackrabbit

0.50

0.69

0.58

0.55

Lucene

0.63

0.79

0.70

0.78

Oozie

0.50

0.67

0.57

0.64

GitHub Statistics Prioritized Features (GS)

Accumulo

0.69

0.82

0.75

0.73

Ambari

0.89

0.80

0.84

0.92

Hadoop

0.53

0.91

0.67

0.78

Jackrabbit

0.53

0.77

0.62

0.55

Lucene

0.73

0.82

0.77

0.83

Oozie

0.56

0.83

0.67

0.65

Token Sequence (TS)

Accumulo

0.69

1.00

0.81

Ambari

0.75

0.60

0.67

0.77

Hadoop

0.85

1.00

0.92

0.96

Jackrabbit

0.65

1.00

0.79

0.82

Lucene

0.82

1.00

0.90

0.92

Oozie

0.86

1.00

0.92

0.91

Token Pattern (TP)

Accumulo

0.58

1.00

0.73

0.88

Ambari

0.90

0.91

Hadoop

0.64

0.82

0.72

0.67

Jackrabbit

0.65

1.00

0.79

0.66

Lucene

0.67

0.94

0.78

0.72

Oozie

0.86

1.00

0.92

0.91

Table 6: Performance of Multiple Type Feature Combinations

Combined

Features

Subject

System

Precision

Recall

F1 Score

AUC

GS+TS

Accumulo

0.77

0.91

0.83

0.91

Ambari

0.75

0.60

0.67

0.72

Hadoop

0.56

0.91

0.69

0.77

Jackrabbit

0.61

0.85

0.71

0.62

Lucene

0.94

0.97

0.96

0.97

Oozie

0.71

1.00

0.83

0.74

GS+TP

Accumulo

0.71

0.91

0.80

0.82

Ambari

0.82

0.90

0.86

0.76

Hadoop

0.62

0.91

0.74

0.76

Jackrabbit

0.71

0.92

0.80

0.78

Lucene

0.79

0.91

0.85

0.83

Oozie

0.73

0.92

0.81

0.74

TS+TP

Accumulo

0.73

1.00

0.85

0.81

Ambari

0.75

0.60

0.67

0.77

Hadoop

0.67

0.91

0.77

Jackrabbit

0.68

1.00

0.81

0.78

Lucene

0.82

1.00

0.90

Oozie

0.67

1.00

0.80

0.78

GS+TS+TP

Accumulo

0.77

0.91

0.83

0.93

Ambari

0.80

0.75

Hadoop

0.53

0.91

0.67

0.66

Jackrabbit

0.69

0.85

0.76

0.68

Lucene

0.84

0.94

0.89

0.92

Oozie

0.86

1.00

0.92

0.90

3 Results and Discussion

Our investigation contains two categories of datasets, i) Manually labeled data instances in Table 1 and ii) Automatically labeled data instances in Table 2. Identification of buggy and non-buggy commit instances is made in both datasets by earlier published studies Pornprasit and Tantithamthavorn (2021); Wen et al. (2019). We evaluate the performance of detecting bug inducing commits (BICs) using our proposed features (TS and TP) in all eight software projects and compare the result with conventional (GS) features used in earlier studies. We also utilize combining the features to detect BIC from different subject systems. Our results are in Table 5, Table 6, Figure 5, Figure 7, and Figure 9. We discuss our results in the following sub-sections.

3.1 Evaluation Metric

We first calculate some metric values to compare the performance of different feature combinations proposed in this study. Calculating these metric values depends on the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These counts indicate how well the machine learning (ML) model detects bug-inducing commits (BICs). For example, TP and TN are the counts of correctly identified BIC and non-BIC. Similarly, FP and FN indicate the incorrectly detected BIC and non-BIC using the ML model. Once we have these counts for all the feature combinations from the dataset of all the subject systems, we use the following formula to calculate Precision, Recall, and F1 Score.

When an ML model has a lower count of FP, it provides a higher value for precision, indicating the higher reliability of the model. Similarly, a model with lower FN provides a higher recall value indicating higher accuracy of the detected results. However, an attempt to increase the precision value typically decreases the precision and vice versa Developers.Google (2020 (accessed August 26, 2021). A BIC detection model that provides a higher precision with lower recall or higher recall with lower precision is not acceptable in a real-life practical software development industry. Therefore, we calculate the F1 Score by taking the harmonic mean of precision and recall to evaluate the results of this study. A higher F1 Score in a comparison scenario indicates a better result, considering the precision and recall values.

Precision=\frac{TP}{TP+FP}

Recall=\frac{TP}{TP+FN}

F1~{}Score=\frac{2\times Precision\times Recall}{Precision+Recall}

We also calculate the AUC (Area Under the ROC Curve) Developers.Google (2020 (accessed August 26, 2021) score, which provides an aggregate performance measure across all possible classification thresholds. The value of the AUC score is between 0 to 1, where 1.0 represents a model whose all detections (100%) are correct; one whose all of the detection are incorrect has an AUC of 0.0.

We use Scikit-learn Pedregosa et al. (2011) library available in Python programming language to calculate Precision, Recall, F1 Score, and AUC Score to determine the performance of detecting BIC and clean (non-BIC) commits. We extract basic features for all the subject systems from GS, TS, and TP. We have combined those features and got four extra (GS+TS, GS+TP, TS+TP, GS+TS+TP) feature lists. We obtain the results by prioritizing features (eliminating unimportant features using RFE) from all seven feature combinations. To compare the performance of BIC and non-BIC detection without eliminating unimportant features, we also reported the results using all the features from GitHub Statistics (GS-ALL, no feature eliminated) for all the subject systems. Based on the results of this investigation, we are answering our research questions as follows.

3.2 Answer to the RQ1

How can we determine whether developers’ coding syntax patterns could be responsible for inducing bugs in a software system?

This research question was the most important motivation for doing this study. We wanted to see whether the coding syntax style could induce buggy or faulty code fragments in the software system. The detailed extraction and encoding process of TS and TP features discussed in Sections 2.2, and 2.3 explain the different steps of identifying the coding syntax styles/ patterns and encoding the labeled commits with those identified TS and TP values. For instance, if a different developer writes the test.java file demonstrated in Figure 2, it could have a different coding structure, which will lead to obtaining a completely different list of TS and TP feature values. As the source code patterns encoded by TS and TP values improve BIC detection performance in all eight software projects compared to the conventional feature values (GS), we can say that these patterns might also be responsible for inducing bugs in software systems. Evaluation of the results of this study verifies this assumption as follows.

We show our results (Precision, Recall, and F1 Score) from the six manually labeled subject systems in Table 1 in Figures 5, 7, and 9 and two automatically labeled subject systems in Table 2 in Figure 4. These figures clearly show that we can detect BIC (and the non-BIC) in all software systems using only the TS or TP feature with higher performance measures (F1 scores) than the most commonly used GS-ALL and GS features. As ML-based detection models can distinguish BIC and non-BIC using the feature values representing the developer’s coding syntax style/ pattern, we can provide an affirmative answer to this research question. However, we can also argue that the ML-based model mainly utilizes those specific coding syntax styles/patterns to detect BIC from the software commits. Therefore, these coding syntax patterns must be dealt with very carefully while doing change operations in the codebase.

The results obtained using the six subject systems in Table 1 show that considering all the evaluation metrics (Precision, Recall, F1 Score, and AUC scores in all the subject systems), the performance obtained by only GS features is lower than the performance obtained by the use of TS and TP features. Besides, using all the 12 feature values (GS-ALL) extracted from GS has a lower performance than the prioritized features from GS, where we removed some unimportant features in all the subject systems. Comparing the BIC detection performance between GS-ALL and prioritized GS indicates that if we take only important features from GS-ALL, it will improve the performance, but this is not enough to detect all the BICs. On the other hand, using feature values extracted from TS and TP can improve the detection performance at a statistically significant level. This scenario provides the answer to this research question. In all the subject systems, either TS or TP or both features improved performance more than the GS-ALL (non-prioritized) and GS (prioritized) features.

The results of two automatically labeled subject systems in Table 2 also provide improved F1 Scores using all five different ML-based classification models in Figure 4. The horizontal line at 0.00 demonstrates the F1 Score using GS features only, and the points above and below the 0.00 line indicate the percentage (%) of improvements and decrease in F1 Scores, respectively. We show the gain (in %) in F1 Scores using all the different feature combinations of GS, TS, and TP. Our results show inclusion of TS and TP features improves the F1 Score in all five ML-based classification models. Some feature combinations in PCT and KNN provide a decrease in F1 Scores, but these are only six cases out of 40 total test cases in two subject systems.s

The distribution of the AUC scores shown in Figure 12 also supports our findings. Most of the AUC scores obtained using TS and TP features are higher than GS-ALL and GS features. We find the highest distribution of AUC scores using TS and GS+TS feature values. TP features also provide higher AUC scores but could not outperform GS and GS+TS. Although other feature combinations such as GS+TP, TS+TP, and GS+TS+TP also provide improved AUC scores compared to GS-ALL and GS features, it gives a similar distribution of AUC scores to other TS and TP feature combinations.

3.3 Answer to the RQ2

Do the features extracted from developers’ coding syntax patterns provide significantly better performance compared to the other feature values?

We performed the Wilcoxon Signed Rank Test to see if TS and TP-based feature values significantly improved the performance parameters (e.g., Precision, Recall, F1 Score). This statistical significance test will compare the results using TS and TP feature values with those obtained using feature values from GitHub Statistics (e.g., GS-ALL, GS ). For instance, we would like to examine whether the Precision/ Recall/ F1 Score obtained by GS+TP features is significantly better than the results obtained by GS features. Thus, we applied both GS+TP and GS features independently in our six subject systems and obtained Precision, Recall, and F1 Scores from each implementation, as shown in Table 7. Therefore, we got six pairs of observations for each of the Precision, Recall, and F1 Score, and we used them to calculate the differences between each observation pair. Then, we used the differences between these observation pairs to perform the Wilcoxon Signed Rank Test utilizing the SciPy library Virtanen et al. (2020) available in the Python programming language. Finally, we performed a significance test for each Precision, Recall, and F1 Score separately.

A summary of the results obtained from the significance test of the features is given in Figure 11, which shows the results of a feature list are significantly different from the results of corresponding features using a $|$ (vertical bar) symbols. To compare which result is better, we can refer to the distribution of the results shown in Figures 6, 8, 10, and 12. When we consider the precision of BIC detection, we can see that prioritized features from GS, TS, TP and any combinations of these features perform significantly better than using all the 12 features from GitHub Statistics (GS-ALL). TP features alone or a combination of TP and GS, providing significantly better results than GS-ALL and GS if we consider the recall results. The significance test of the F1 Score also shows any other feature combinations are significantly better than the GS-ALL features and GS+TP are better than GS. Therefore, we can conclude that features extracted from TP in all the scenarios provide significantly better results when combined with GS than the GS-All or GS features alone. Thus, our findings in this study provide enough evidence to believe that more source code-related features (TS, TP) can increase the detection accuracy significantly in identifying BIC using machine learning models.

Table 7: Preparing set of Observations for The Wilcoxon Signed Rank Test from Our Results

Performance Measure	Subject Systems
Performance Measure	S1	S2	S3	S4	S5	S6
Precision (GS)	0.69	0.89	0.53	0.53	0.73	0.56
Precision (GS+TP)	0.71	0.82	0.62	0.71	0.79	0.73
Difference	0.02	-0.07	0.09	0.18	0.06	0.17
Recall (GS)	0.82	0.80	0.91	0.77	0.82	0.83
Recall (GS+TP)	0.91	0.90	0.91	0.92	0.91	0.92
Difference	0.09	0.10	0.00	0.15	0.09	0.09
F1 Score (GS)	0.75	0.84	0.67	0.62	0.77	0.67
F1 Score (GS+TP)	0.80	0.86	0.74	0.80	0.85	0.81
Difference	0.05	0.02	0.07	0.18	0.08	0.14

3.4 Answer to the RQ3

How generalized are the extracted features from one software system to the others?

3.4.1 Manually Labeled Datasets of Table 1

We investigated the number of features in the best feature lists for each GS, TS, and TP feature in all the subject systems. We represented the list sizes in different subject systems according to the number of samples of the subject systems in Figure 13. In our investigation, Ambari has the smallest number of data samples (Table 1), Accumulo is next to Ambari, and Jackrabbit has the maximum number of data samples. By sorting the number of features in the list of best features, we wanted to see whether the best feature list depends on the number of available data instances of a subject system. We show the number of available data instances from each subject system in Table 1.

Figure 13 shows that the number of features to provide the best result increases for most subject systems. In this figure, the subject systems are arranged based on the number of available data instances. Thus, the left-most subject system (e.g., Ambari) contains the least number of data instances, and the right-most subject system (e.g., Jackrabbit) has the highest number of data instances. Although Hadoop, Lucene, and Jackrabbit show a decline while using different features (GS, TS, TP), all the other three subject systems show an increase in this number with the rise in data samples in the subject systems. To verify this assumption, we conducted both Pearson’s Kirch (2008) and Spearman’s Dodge (2008) correlation tests and found that the number of GS features is positively correlated, and TS and TP features are negatively correlated with the number of data instances in software projects under investigation for providing the best BIC detection performance, and these correlations are not statistically significant. Therefore, we can say that the number of features is not directly related to the number of data instances under investigation. There might be unique feature sets for each software project that may provide better BIC detection performance using different ML models.

We also investigated whether a set of features is generalizable from one subject system to another. Analyzing the list of best features, we can see that for almost all the subject systems, even though the number of features in different best feature lists are the same, the feature identifiers are different. For example, though there are 12 features available in the case of GS features, only 3 to 4 features can give the best BIC detection performance, and they contain at least one or more distinct features among various available subject systems. Similarly, different sets containing only 6 to 12 features can provide the best result from the thousands of features when considering the TS and TP-based features.

These findings emphasize the importance of selecting an appropriate number of most relevant features specialized for each subject system to identify BIC based on the data instance size of the software project available to train and test a machine learning model. A fixed set of features might not provide the best results for all the software projects to identify bug-inducing software commits using ML models.

3.4.2 Automatically Labeled Datasets of Table 2

Our investigation evaluates whether source code syntax pattern (TP) and sequence (TS) based features are generalizable toward different software projects and ML-based classification models. We apply BIC detection using five classification models on the automatically labeled datasets (OpenStack and QT) of Table 2. Our results are in Figure 4, where we compare the percent (%) of improvement in F1 Scores of BIC detection using our extracted features with GS features. Our results show that in all five classification models and two subject systems, our proposed features improve the F1 Scores. The highest improvement (20%) in the F1 Score was obtained using Random Forest (RF) classifier in the subject system QT using the GS+TP features. All the other feature combinations also improved F1 Scores compared to the GS feature using RF classifier in both the OpenStack and QT subject systems. Although in some cases, Perceptron (PCT) and K-Nearest Neighbour (KNN) algorithms show a decrease in F1 Scores in both the subject systems, they still show improvement in F1 scores with other feature combinations. Therefore, we can conclude that in most cases, our proposed features are generalizable toward different ML-based classification models and software projects to detect BIC in higher F1 Scores than the GS features.

3.5 Answer to the RQ4

Do the features extracted from developers’ coding syntax patterns enhance the explainability of BIC detection from software systems?

Pornprasit et al. (2021) published a state-of-the-art tool named PyExplainer to explain the underlying condition for machine learning predictions. We use their tool to explain our BIC detection using the conventional (GS) and our proposed Token Pattern (TP) features. The results of the most frequent conditions from PyExplainer in the QT subject system are shown in Table 8. We show the top five most frequent feature conditions from GS and TP features that detected buggy commits using Random Forest (RF) algorithm in the table, and all the other conditions are publicly available to access in the GitHub repository of our investigation. In the table, we can see that it is very difficult to make proper reasoning for buggy commit detection considering GS features from the identified conditions by PyExplainer. For example, the top five most frequent conditions obtained from the result of PyExplainer are some range of values of some GS features such as Awareness, Age, Line Added, and Number of functions, but it does not show any specific reasoning that what a developer can do to avoid bug proneness in a software project for a certain given condition. A software development environment with a set of developers may have a fixed set of given conditions, such as developer awareness of the software project, age of the developer, number of lines to be added or changed, etc. These given conditions might be difficult to modify instantly or within a short period. Therefore, if we detect buggy software commits using those unchangeable conditions as feature values, it can provide enough meaningful or explainable reasoning to the software developer about how they can avoid the bug proneness of a particular change.

Table 8: Comparing Key Conditions to Detect Commit Bug Proneness by PyExplainer

Feature Names

PyExplainer Condition

i. Code Churn Based Features

Awareness¹

0.0050

<

asawr

<=

0.0250

Awareness¹

0.1149

<

osawr

<=

0.1150

Age²

694101.53

<

age

<=

647989.44

Line Added¹

57.59

<

<=

621.30

Number of Functions³

4.14

<

<=

2.26

ii. Code Syntax Pattern (TP) Based Features

Declaring a Variable Name

1.84

<

decl_name

<=

15.32

Expression with Operators

in IF Condition

0.40

<

if_condition_expr_operator

<=

7.81

Expression with Operators

0.47

<

expr_operator

<=

2.42

Expression with Function Call

with Argument and Expression

with Variable Name

0.375

<

expr_call_argument_expr_name

<=

7.20

Expression with Operators

0.47

<

expr_operator

<=

24.27

1. Does Not Provide Enough Information About Why The Bug Occurred

2. Unrealistic Condition

3. Contradictory Condition

On the other hand, if we see the result from PyExplainer using Token Pattern (TP) features for the top five most frequent conditions for detecting buggy commits; we can see that most of the commits are detected buggy because of the presence of specific source code syntax patterns (i.e., number of declared variable names, expression with operators in IF-conditions, different level of nested expression and function call, etc. ). In any given condition of a software development environment, these conditions of a source code syntax can easily be updated to avoid or minimize bug proneness of a software project. These scenarios provide evidence to argue that, Token Patterns (TP) as feature values in the ML-based classification model enhance the explainability of BIC detection from software systems compared to the GS features. We plan to do more investigation about this scenario with some other subject systems in the future.

4 Threats to Validity

4.1 Subject Systems & Programming Languages

Subject systems in this study are written in Java, CPP, and Python programming languages. Therefore, we can argue that the results of this study may not be generalizable to the subject systems of other programming languages not used in this investigation. However, the subject systems used in this investigation are widely used in similar studies and are of diverse varieties. Thus we are confident about the outcome. Furthermore, since we wanted to compare our newly introduced features (TS, TP) to the most common features (GS) by exploiting the manually and automatically labeled data of eight of those systems, we used these widely used diverse varieties of Java, CPP, and Python systems. Regarding systems of other programming languages, we note the technique used may provide similar results for similar programming paradigms. Of course, further investigation is necessary, which remains our future work.

4.2 Dataset Labeling for Buggy and Non-buggy Commits

Wen et al. (2019) reported that automatically labeling buggy and non-buggy commits leave many incorrect labeling. On the other hand, getting enough data instances to apply ML classification models in manually labeled buggy and non-buggy software projects is difficult. For example, only 642 data instances of the buggy and non-buggy commits are available in six software projects from the manually labeled dataset Wen et al. (2019) used in this study. A small number of data instances or incorrectly labeled data instances both have a detrimental effect on the generalizability of results obtained by ML models. Therefore, we used the six manually, and two automatically labeled software projects of buggy and non-buggy commit to performing our investigation. We accomplish this investigation in two steps. First, we apply our technique in manually labeled datasets of the buggy and non-buggy commits by Wen et al. (2019) and show that our proposed features improve the performance of the ML model to detect buggy commits. We then apply five different ML classification models in two automatically labeled datasets having 4,014 data instances. In both experimental setups, we show that using our proposed feature values improves the performance of ML models to detect buggy commits. Our goal is not to contradict the results of any state-of-the-art JIT defect prediction method. Instead, we applied five different machine-learning models and eight different feature combinations. We showed in almost all the ML algorithms that using our proposed features improves the detection performance measures of bug-inducing commits. Therefore, we believe selecting any other dataset in this investigation should have similar findings. However, in the future, we want to extend this study using other software projects of different programming languages to substantiate the conclusion.

4.3 Machine Learning Classification Models

We also tried applying K-Nearest Neighbour (KNN) and Logistic Regression to detect BIC from the manually labeled datasets in Table 1. However, as these datasets contain a smaller number of data instances than a large number of features (different combinations of GS, TS, and TP), KNN and LR failed to accurately estimate the precision and recall of different implementations in detecting BIC. Therefore, we only used the Random Forest (RF) classifier to identify BIC from the manually labeled datasets and utilized four more ML classification models in the automatically labeled datasets in Table 2. As different ML algorithms work differently, it may be questionable whether our results obtained using the manually labeled datasets are generalizable for different ML algorithms or not. Hall et al. (2012) reported that ML models’ performance mainly depends on the nature of data, the quality of features, and proper parameter tuning. This study’s main goal was to compare the ML-based model performance by using the features GS, TS, TP, and their different combinations. We also applied five different classification models on the larger datasets shown in Table 2. The results in figure 4 show that in most cases, the use of software TS and TP (source code syntax pattern ) based features improve the BIC detection performance in all the ML models. If a different ML algorithm is used, it should equally affect all the feature combinations. Therefore, the comparison scenario obtained in this study should remain the same in different ML-based detection models. Although, more investigations using more massive datasets and other ML models are required to verify this scenario, which we will do in future studies.

5 Related Work

Studies related to detecting or predicting Bug Inducing Commit(s) attracted the attention of researchers due to their massive impact on software systems. Some studies Kim et al. (2008); Śliwerski et al. (2005b); Kim et al. (2006b); Eyolfson et al. (2011) tried to find the changes in a software system that is responsible for the first introduction of a bug, while some other studies Asaduzzaman et al. (2012); Bavota et al. (2012); Bernardi et al. (2012); Canfora et al. (2011) try to identify the bug fixing commits and then link it to its corresponding bug-inducing commit using SZZ Algorithm Śliwerski et al. (2005b), and its variants da Costa et al. (2017); Davies et al. (2014); Kim et al. (2006b). There are many studies to understand the essential characteristics of bug-inducing commits using machine learning models Asaduzzaman et al. (2012); Bavota et al. (2012); Bernardi et al. (2012); Canfora et al. (2011); Ell (2013); Eyolfson et al. (2011); Kim and Whitehead (2006); Śliwerski et al. (2005a). Aversano et al. (2007) did a study on two relatively small Java systems and applied five machine learning algorithms with 10-fold cross-validation to predict whether a change is likely to be buggy or not. They only used the features extracted from the GitHub repository. Fukushima et al. (2014) and Kamei et al. (2013) worked for predicting defects in software systems, but none of them showed any use of code quality measures in their prediction models. There are also some other studies Kim et al. (2008, 2007); Mizuno and Hata (2013); Shivaji et al. (2013a); Yang et al. (2015); Kim et al. (2006b); Śliwerski et al. (2005b); Wen et al. (2016) for predicting the likelihood of being a commit bug-inducing, but no earlier studies explored the impact of code quality measures or developers’ patterns of source code in this research domain. We tried to minimize the gap by proposing and testing a method to encode software commits using developers’ coding syntax patterns while doing software revisions. Our study compares the results to detect bug-inducing commits using the conventional features of the existing methods and our proposed developers’ source code pattern-based features.

Shivaji et al. (2013a) applied various feature selection techniques that are commonly used in classification-based bug prediction techniques. The techniques drop unnecessary features until optimal classification performance is achieved. The total number of features used for training is extensively degraded, often to fewer than 10 percent of the initial. We also applied a similar technique using the Recursive Feature Elimination (RFE) of the SciKit Learn Pedregosa et al. (2011) python library for reducing the required number of features to obtain the best BIC detection result.

Wu et al. (2018) applied Logistic Regression, Decision Trees, Naive Bayes, and Bayesian Network (BayesNet) to find the changes which induce crush from several subject systems. Our work complements the work done by Shivaji et al. (2013a) in several ways. Although they added some features from source code metrics and BOW+ implementation on the source code, they did not focus on developers’ coding syntax patterns. Besides, they did not normalize the source code by tokenization Jimenez et al. (2018) to find the optimal feature values. Our study tried to overcome both limitations by extracting features from the tokenized XML representation of source code fragments.

Our work also complements prior similar studies in several ways. First, most of the prior studies tried to detect the likelihood of a commit to bug-inducing or clean (non-bug-inducing) using only the statistical measures extracted from the GitHub repositories of the respective subject systems. We added two different types of extra features (Token Sequence-TS and Token Pattern-TP) by processing the source code of commit patches. The addition of source code pattern-based features provides the likelihood of a commit being buggy or not, and it also provides an idea of which types of source code patterns are more likely to induce bugs in software systems. We also executed the BIC detection technique without those features and compared the effect of adding our features. Our comparisons show we improved the performance of BIC detection with the addition of source code TS and TP-based features. Second, we applied the Recursive Feature Elimination (RFE) algorithm to investigate the importance of each feature in an ML-based BIC detection model. Third, we investigated an approach to find the best features by starting with the top feature and adding only those features to the list, improving detection performance. After completing the full iteration with all the features, we got a relatively small list of features providing better BIC detection results. We believe these additions made this study much more unique compared to the other related works.

This paper is an improved and rewritten version of a chapter from the M.Sc. thesis Nadim (2020) defended by the first author of this paper in September 2020. The thesis is available online as the official Graduate Theses and Dissertations HARVEST by the University of Saskatchewan, Saskatoon, Canada, which includes two more related studies for detecting and reducing Nadim et al. (2020) bug introduction in software systems. Our key focus in these studies is finding new tools and techniques that can improve the detection and prevention of probable bug-inducing commits in software systems compared to the existing related studies. This study is an essential step towards that goal. We proposed two new types of feature values (Token Sequence-TS and Token Pattern-TP), which can play an essential role in detecting bug-inducing commits using machine learning models. Our proposed TS, TP-based source code encoding, can also open new research directions in source code representation and analysis related to software bug-proneness and bug-fix patterns Yue et al. (2017); Kim et al. (2006a); Vieira et al. (2019), which we intend to do in future research.

6 Conclusion & Future Work

Our investigation introduced two new types of features, one represents the token sequence of source code (TS), and the other represents the hierarchy of these tokens (Token Pattern-TP), which we utilized as feature values for detecting bug inducing commits (BICs) from eight subject systems. Thus, The feature values of TS and TP represent the developers’ coding syntax style/ patterns. We extracted thousands of bug-inducing and bug-fixing source code patterns (TP) and their sequences (TS) from the history of these software projects and applied feature prioritization. We then applied Random Forest Classifier to detect Bug Inducing Commits (BICs) from the six manually labeled open-source Java projects using these TS and TP-based and conventionally used GitHub statistics (GS) based feature values. We also applied five different machine learning based classification models to the two subject systems containing automatically labeled bug inducing and clean commits. Finally, we used the BIC detection results using GS-based feature values as the baseline and compared them with TS and TP-based results to determine the performance improvements in detecting buggy commits from the software systems. Our results and analysis of four research questions show that the features representing developers’ coding syntax style/ patterns (TS and TP) can increase the performance of detecting bug-inducing commits (BICs) and explainability of the detected results compared to the conventional features (GS) using ML-based detection models.

We also did a significance test of the obtained results using the Wilcoxon Signed Rank Test and found the increase in F1 scores is statistically significant in manually labeled datasets. Therefore, we can conclude that developers’ coding syntax style/ pattern could be crucial for introducing bugs in the software systems. It impacts the software quality and reliability. Therefore, new tools and techniques are required to help software developers avoid risky coding syntax patterns and sequences that have been found to induce bugs in the earlier history of software systems.

Another observation in this study with the manually labeled datasets from the six subject systems in Table 1 is that each software system requires unique features to provide the best detection results if the dataset size is not big enough. The size of the best features list is also different. From this observation, we can conclude that the best feature set and their number depend on how the software system is maintained and how many data samples are available. Different developers maintain different software systems. Their coding syntax style is different; their exposed patterns (TP) and sequences (TS) are also different; this might be the key reason for this observation. However, using all the TS and TP features improve BIC detection in both the subject systems whose datasets have a larger number of data instances in Table 2. Therefore, if we have larger datasets for training the model using TS and TP-based features, we get improved buggy commit detection performance without prioritizing the features.

In our investigation, we use datasets from software projects written in three different programming languages (i.e., Java, CPP, and Python), and in all software projects, our proposed source code syntax-based features provided improved buggy commit detection performance compared to conventionally used GS features. Our features also perform better when we use a deep learning-based feature extraction technique using a deep belief network. We also investigate the explainability of detected buggy commits using the dataset of one software project, QT. We found the inclusion of TP features provides better explainability about the reason for detecting a buggy commit compared to the GS features. In future studies, we also want to extend the explainability study using different software projects of different programming languages, which could provide important insight into fixing the buggy commits automatically using the information of how a commit is identified as buggy by the machine learning models. Finding the reason for bug-inducing commits in the software systems may also lead to finding and automatically correcting the buggy patterns by utilizing historical similar (cloned) fixes in the codebase. We also plan to exploit these findings in building IDE-based tools and libraries so that developers can deal with them right away in the IDE. As part of this, we will start with an IDE-based clone detection environment Zibran and Roy (2012) with different visualization support as clone visualizations Asaduzzaman et al. (2011) but attributed with bug-inducing patterns and integrate different other clone detection tools used in our studies. We also plan to explore bug-inducing commits in the context of exception handling Asaduzzaman et al. (2016) or localizing bugs Rahman and Roy (2018) and see to what extent we could integrate them into the IDE support. We plan to study the relationships and their insights between bug-inducing patterns and several other related studies, such as bug propagation through code cloning Mondal et al. (2019, 2017a), bug-proneness and late propagation tendency in clones Mondal et al. (2018), replication of bugs in clones and micro clones Islam et al. (2019, 2019, 2017) or with those clones that have high possibilities of containing bugs Mondal et al. (2017b).

\bmhead

Acknowledgments This research is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery grants, and by an NSERC Collaborative Research and Training Experience (CREATE) grant.

Declarations

•

Conflict of interest/Competing interests. The authors declare that they have no conflict of interest.
•

Data availability. The datasets and source files generated during and/or analyzed during this study are available in our GitHub repository (https://github.com/mnadims/bicDetectionSF/) for readers to investigate and facilitate any replication study.

References

\bibcommenthead
albertbup (2017) albertbup (2017) A python implementation of deep belief networks built upon numpy and tensorflow with scikit-learn compatibility. URL https://github.com/albertbup/deep-belief-network
Asaduzzaman et al. (2011) Asaduzzaman M, Roy CK, Schneider KA (2011) Viscad: Flexible code clone analysis support for nicad. In: Proceedings of the 5th International Workshop on Software Clones (IWSC’11). Association for Computing Machinery, New York, NY, USA, pp 77–78
Asaduzzaman et al. (2012) Asaduzzaman M, Bullock MC, Roy CK, et al. (2012) Bug introducing changes: A case study with android. In: Proceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR’12), pp 116 – 119
Asaduzzaman et al. (2016) Asaduzzaman M, Ahasanuzzaman M, Roy CK, et al. (2016) How developers use exception handling in java? In: Proceedings of the IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR), pp 516–519
Aversano et al. (2007) Aversano L, Cerulo L, Del Grosso C (2007) Learning from bug-introducing changes to prevent fault prone code. In: Proceedings of the 9th International Workshop on Principles of Software Evolution: In Conjunction with the 6th ESEC/FSE Joint Meeting (IWPSE’07), pp 19 – 26
Bavota et al. (2012) Bavota G, De Carluccio B, De Lucia A, et al. (2012) When does a refactoring induce bugs? an empirical study. In: Proceedings of the IEEE 12th International Working Conference on Source Code Analysis and Manipulation (SCAM’12), pp 104 – 113
Bernardi et al. (2012) Bernardi ML, Canfora G, Di Lucca GA, et al. (2012) Do developers introduce bugs when they do not communicate? the case of eclipse and mozilla. In: Proceedings of the 16th European Conference on Software Maintenance and Reengineering (CSMR’12), pp 139 – 148
Borg et al. (2019) Borg M, Svensson O, Berg K, et al. (2019) Szz unleashed: an open implementation of the szz algorithm - featuring example usage in a study of just-in-time bug prediction for the jenkins project. In: Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE’19)
Canfora et al. (2011) Canfora G, Ceccarelli M, Cerulo L, et al. (2011) How long does a bug survive? an empirical study. In: Proceedings of the 18th Working Conference on Reverse Engineering (WCRE’11), pp 191 – 200
Casalnuovo et al. (2019) Casalnuovo C, Lee K, Wang H, et al. (2019) Do people prefer ”natural” code? CoRR
Cavnar and Trenkle (1994) Cavnar W, Trenkle J (1994) N-gram-based text categorization. Ann Arbor MI 48113(2):161 – 175
Cordy and Roy (2011) Cordy JR, Roy CK (2011) The nicad clone detector. In: Proceedings of the IEEE International Conference on Program Comprehension (ICPC’11), pp 219 – 220
da Costa et al. (2017) da Costa DA, McIntosh S, Shang W, et al. (2017) A framework for evaluating the results of the szz approach for identifying bug-introducing changes. IEEE Transactions on Software Engineering 43(7):641 – 657
Davies et al. (2014) Davies S, Roper M, Wood M (2014) Comparing text-based and dependence-based approaches for determining the origins of bugs. Journal of Software: Evolution and Process 26(1):107–139
Developers.Google (2020 (accessed August 26, 2021) Developers.Google (2020 (accessed August 26, 2021)a) Classification: Precision and Recall - Machine Learning Crash Course. URL https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
Developers.Google (2020 (accessed August 26, 2021) Developers.Google (2020 (accessed August 26, 2021)b) Classification: ROC Curve and AUC - Machine Learning Crash Course. URL https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
Dodge (2008) Dodge Y (2008) Spearman Rank Correlation Coefficient, Springer New York, New York, NY, pp 502–505. 10.1007/978-0-387-32833-1_379, URL https://doi.org/10.1007/978-0-387-32833-1_379
Ell (2013) Ell J (2013) Identifying failure inducing developer pairs within developer networks. In: Proceedings of the 35th International Conference on Software Engineering (ICSE’13), pp 1471 – 1473
Eyolfson et al. (2011) Eyolfson J, Tan L, Lam P (2011) Do time of day and developer experience affect commit bugginess? In: Proceedings of the 8th Working Conference on Mining Software Repositories (MSR’11), pp 153 – 162
Fukushima et al. (2014) Fukushima T, Kamei Y, McIntosh S, et al. (2014) An empirical study of just-in-time defect prediction using cross-project models. In: Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14), pp 172 – 181
Goues et al. (2019) Goues CL, Pradel M, Roychoudhury A (2019) Automated program repair. Communications of the ACM p 56–65
Gu et al. (2010) Gu Z, Barr ET, Hamilton DJ, et al. (2010) Has the bug really been fixed? In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE’10), pp 55 – 64
Hall et al. (2012) Hall T, Beecham S, Bowes D, et al. (2012) A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering 38(6):1276 – 1304
Hinton (2007) Hinton GE (2007) Learning multiple layers of representation. Trends in cognitive sciences, vol 11, no 10 p 428–434
Hinton and Salakhutdinov (2006) Hinton GE, Salakhutdinov R (2006) Reducing the dimensionality of data with neural networks. Science, vol 313, no 5786 pp 504–507
Hinton et al. (2006) Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural computation, vol 18, no 7, pp p 1527–1554
Hoang et al. (2019) Hoang T, Khanh Dam H, Kamei Y, et al. (2019) Deepjit: An end-to-end deep learning framework for just-in-time defect prediction. In: Proceedings of the IEEE/ACM 16th International Conference on Mining Software Repositories (MSR’19), pp 34 – 45
Hoang et al. (2020) Hoang T, Kang HJ, Lo D, et al. (2020) Cc2vec: Distributed representations of code changes. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20), pp 518–529
Islam et al. (2017) Islam JF, Mondal M, Roy CK, et al. (2017) Comparing Software Bugs in Clone and Non-clone Code: An Empirical Study. World Scientific 27(9-10):1507–1527
Islam et al. (2019) Islam JF, Mondal M, Roy CK (2019) A comparative study of software bugs in micro-clones and regular code clones. In: Proceedings of the International Conference on Software Analysis, Evolution, and Reengineering (SANER’19), pp 73 – 83
Islam et al. (2019) Islam JF, Mondal M, Roy CK, et al. (2019) Comparing bug replication in regular and micro code clones. In: Proceedings of the IEEE International Conference on Program Comprehension (ICPC’19), pp 81 – 92
Jason Brownlee (2017) Jason Brownlee (2017) A gentle introduction to the bag-of-words model. https://machinelearningmastery.com/gentle-introduction-bag-words-model/, [Online; accessed 28-September-2021]
Jeffrey et al. (2009) Jeffrey D, Feng M, Neelam Gupta, et al. (2009) Bugfix: A learning-based tool to assist developers in fixing bugs. In: Proceedings of the IEEE 17th International Conference on Program Comprehension, pp 70–79
Jiang et al. (2018) Jiang J, Xiong Y, Zhang H, et al. (2018) Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’18), p 298 – 309
Jimenez et al. (2018) Jimenez M, Maxime C, Le Traon Y, et al. (2018) On the impact of tokenizer and parameters on n-gram based code analysis. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 437–448, 10.1109/ICSME.2018.00053
Kamei et al. (2013) Kamei Y, Shihab E, Adams B, et al. (2013) A large-scale empirical study of just-in-time quality assurance. IEEE Transactions on Software Engineering pp 757 – 773
Kamei et al. (2016) Kamei Y, Fukushima T, Mcintosh S, et al. (2016) Studying just-in-time defect prediction using cross-project models. Empirical Software Engineering p 2072–2106
Kim et al. (2013) Kim D, Nam J, Song J, et al. (2013) Automatic patch generation learned from human-written patches. In: Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13), p 802–811
Kim and Whitehead (2006) Kim S, Whitehead EJJr. (2006) How long did it take to fix bugs? In: Proceedings of the International Workshop on Mining Software Repositories (MSR’06), pp 173 – 174
Kim et al. (2006a) Kim S, Pan K, Whitehead EEJ (2006a) Memories of bug fixes. In: Proceedings of the 14th ACM SIGSOFT International Symposium on Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, SIGSOFT ’06/FSE-14, p 35–45, URL https://doi.org/10.1145/1181775.1181781
Kim et al. (2006b) Kim S, Zimmermann T, Pan K, et al. (2006b) Automatic identification of bug-introducing changes. In: Proceedings of the 21st IEEE/ACM International Conference on Automated Software Engineering (ASE’06), pp 81 – 90
Kim et al. (2007) Kim S, Zimmermann T, Whitehead Jr. EJ, et al. (2007) Predicting faults from cached history. In: Proceedings of the 29th International Conference on Software Engineering (ICSE’07), pp 489 – 498
Kim et al. (2008) Kim S, Whitehead, Jr. EJ, Zhang Y (2008) Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering 34(2):181–196
Kirch (2008) Kirch W (ed) (2008) Pearson’s Correlation Coefficient, Springer Netherlands, Dordrecht, pp 1090–1091. 10.1007/978-1-4020-5614-7_2569, URL https://doi.org/10.1007/978-1-4020-5614-7_2569
Le and Mikolov (2014) Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14), p II–1188–II–1196
Li et al. (2020a) Li K, Xiang Z, Chen T, et al. (2020a) Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: An empirical study. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20), pp 566 – 577
Li et al. (2020b) Li Y, Wang S, Nguyen TN (2020b) Dlfix: Context-based code transformation learning for automated program repair. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20), pp 602 – 614
Liu et al. (2019) Liu K, Koyuncu A, Kim D, et al. (2019) Tbar: Revisiting template-based automated program repair. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’19), p 31–42
Martinez and Monperrus (2015) Martinez M, Monperrus M (2015) Mining software repair models for reasoning on the search space of automated program fixing. Empirical Software Engineering p 176–205
Martinez et al. (2014) Martinez M, Weimer W, Monperrus M (2014) Do the fix ingredients already exist? an empirical inquiry into the redundancy assumptions of program repair approaches. In: Companion Proceedings of the 36th International Conference on Software Engineering, p 492–495
Mizuno and Hata (2013) Mizuno O, Hata H (2013) A metric to detect fault-prone software modules using text filtering. International Journal of Reliability and Safety 7(1):17 – 31
Mondal et al. (2017a) Mondal M, Roy CK, Schneider KA (2017a) Bug propagation through code cloning: An empirical study. In: Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 227–237
Mondal et al. (2017b) Mondal M, Roy CK, Schneider KA (2017b) Identifying code clones having high possibilities of containing bugs. In: Proceedings of the IEEE/ACM 25th International Conference on Program Comprehension (ICPC), pp 99–109
Mondal et al. (2018) Mondal M, Roy CK, Schneider KA (2018) Bug-proneness and late propagation tendency of code clones: A comparative study on different clone types. Journal of Systems and Software 144:41 – 59
Mondal et al. (2019) Mondal M, Roy B, Roy CK, et al. (2019) An empirical study on bug propagation through code cloning. Journal of Systems and Software 158:110,407
Nadim (2020) Nadim M (2020) Investigating the techniques to detect and reduce bug inducing commits during change operations in software systems. Master’s thesis, University of Saskatchewan, Saskatoon, Canada, URL https://harvest.usask.ca/handle/10388/13125
Nadim et al. (2020) Nadim M, Mondal M, Roy CK (2020) Evaluating performance of clone detection tools in detecting cloned cochange candidates. In: Proceedings of the 14th International Workshop on Software Clones (IWSC’20), pp 15 – 21
Nayrolles and Hamou-Lhadj (2018) Nayrolles M, Hamou-Lhadj A (2018) Clever: Combining code metrics with clone detection for just-in-time fault prevention and resolution in large industrial projects. In: Proceedings of the IEEE/ACM 15th International Conference on Mining Software Repositories (MSR’18), pp 153 – 164
Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, et al. (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
Pei et al. (2014) Pei Y, Furia CA, Nordio M, et al. (2014) Automatic program repair by fixing contracts. In: Proceedings of Fundamental Approaches to Software Engineering, pp 246–260
Pornprasit and Tantithamthavorn (2021) Pornprasit C, Tantithamthavorn C (2021) Jitline: A simpler, better, faster, finer-grained just-in-time defect prediction. In: Proceedings of the International Conference on Mining Software Repositories (MSR), p To Appear
Pornprasit et al. (2021) Pornprasit C, Tantithamthavorn C, Jiarpakdee J, et al. (2021) Pyexplainer: Explaining the predictions of just-in-time defect models. In: 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp 407–418, 10.1109/ASE51524.2021.9678763
Rahman and Roy (2018) Rahman MM, Roy CK (2018) Improving ir-based bug localization with context-aware query reformulation. In: Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18). Association for Computing Machinery, p 621–632
Rosen et al. (2015) Rosen C, Grawi B, Shihab E (2015) Commit guru: Analytics and risk prediction of software commits. In: Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15), pp 966 – 969
Rosner et al. (2006) Rosner B, Glynn RJ, Lee MLT (2006) The wilcoxon signed rank test for paired comparisons of clustered data. Biometrics 62(1):185–192
Shivaji et al. (2013a) Shivaji S, James Whitehead E, Akella R, et al. (2013a) Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering 39(4):552 – 569
Shivaji et al. (2013b) Shivaji S, James Whitehead E, Akella R, et al. (2013b) Reducing features to improve code change-based bug prediction. IEEE Transactions on Software Engineering 39(4):552–569
Śliwerski et al. (2005a) Śliwerski J, Zimmermann T, Zeller A (2005a) Hatari: Raising risk awareness. ACM SIGSOFT Software Engineering Notes 30(5):107 – 110
Śliwerski et al. (2005b) Śliwerski J, Zimmermann T, Zeller A (2005b) When do changes induce fixes? ACM SIGSOFT Software Engineering Notes 30(4):1 – 5
Tabassum et al. (2020) Tabassum S, Minku LL, Feng D, et al. (2020) An investigation of cross-project learning in online just-in-time software defect prediction. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering (ICSE’20), pp 554 – 565
Tan et al. (2015) Tan M, Tan L, Dara S, et al. (2015) Online defect prediction for imbalanced data. In: Proceedings of the 37th International Conference on Software Engineering (ICSE’15), pp 99 – 108
Taunk et al. (2019) Taunk K, De S, Verma S, et al. (2019) A brief review of nearest neighbor algorithm for learning and classification. In: 2019 International Conference on Intelligent Computing and Control Systems (ICCS), pp 1255–1260, 10.1109/ICCS45141.2019.9065747
Vieira et al. (2019) Vieira R, da Silva A, Rocha L, et al. (2019) From reports to bug-fix commits: A 10 years dataset of bug-fixing activity from 55 apache’s open source projects. In: Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, New York, NY, USA, PROMISE’19, p 80–89, URL https://doi.org/10.1145/3345629.3345639
Virtanen et al. (2020) Virtanen P, Gommers R, Oliphant T, et al. (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17:261–272
Wen et al. (2016) Wen M, Wu R, Cheung SC (2016) Locus: Locating bugs from software changes. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16), pp 262 – 273
Wen et al. (2019) Wen M, Wu R, Liu Y, et al. (2019) Exploring and exploiting the correlations between bug-inducing and bug-fixing commits. In: Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19), pp 326 – 337
Wen et al. (2020) Wen M, Liu Y, Cheung SC (2020) Boosting automated program repair with bug-inducing commits. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’20), pp 77 – 80
Wilcoxon (1945) Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1(6):80 – 83. URL http://www.jstor.org/stable/3001968
Wu et al. (2018) Wu R, Wen M, Cheung SC, et al. (2018) Changelocator: locate crash-inducing changes based on crash reports. Empirical Software Engineering 23(5):2866–2900
Xin and Reiss (2019) Xin Q, Reiss SP (2019) Better code search and reuse for better program repair. In: Proceedings of the 6th International Workshop on Genetic Improvement (GI ’19), p 10–17
Yang et al. (2015) Yang X, Lo D, Xia X, et al. (2015) Deep learning for just-in-time defect prediction. In: Proceedings of the IEEE International Conference on Software Quality, Reliability and Security (QRS’15), pp 17 – 26
Yin et al. (2011) Yin Z, Yuan D, Zhou Y, et al. (2011) How do fixes become bugs? In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (ESEC/FSE ’11), pp 26 – 36
Yue et al. (2017) Yue R, Meng N, Wang Q (2017) A characterization study of repeated bug fixes. In: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 422–432, 10.1109/ICSME.2017.16
Zeng et al. (2021) Zeng Z, Zhang Y, Zhang H, et al. (2021) Deep just-in-time defect prediction: How far are we? In: Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. Association for Computing Machinery, New York, NY, USA, ISSTA 2021, p 427–438, 10.1145/3460319.3464819, URL https://doi.org/10.1145/3460319.3464819
Zhao and Mao (2018) Zhao R, Mao K (2018) Fuzzy bag-of-words model for document representation. IEEE Transactions on Fuzzy Systems 26(2):794–804
Zibran and Roy (2012) Zibran MF, Roy CK (2012) Ide-based real-time focused search for near-miss clones. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, New York, NY, USA, SAC 2012, pp 1235–1242