\jmlrvolume

1 \jmlryear2024 \jmlrworkshop2024

Defect Prediction with Content-based Features

\NameHung Viet Pham \Email[email protected]
\addrYork University Canada \NameTung Thanh Nguyen \Email[email protected]
\addrTexas A&M University USA

Abstract

Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.

keywords:

Defect prediction, text analysis, code analysis

1 Introduction

Software defects occur frequently in software development and often lead to costly and time-consuming activities to find and fix them. Gallaher and Kropp (2002) report that software defects cost the US economy nearly $60 billions a year. In addition, Hailpern and Santhanam (2002) found that finding and fixing them accounts for 50 - 75% of the total development cost in a software project. The later a defect is detected in the life cycle of a software product, the higher cost and effort are needed to fix it and re-deploy the fixed version of that product back to the field.

Many methods, techniques, and tools have been developed to support the early detection of software defects. One important line of research is defect prediction. Defect prediction approaches aim to identify the most defect-prone modules (binaries, source files, classes, or functions) in a given software system. Such prediction results can help software engineers to focus their manual defect detection effort like code review or testing on modules with higher likelihood of success, thus, improving the effectiveness and reducing the cost of their activities. Extensive literature reviews on existing defect prediction approaches can be found in Hall et al. (2012); Shihab (2012).

Most researches on defect prediction focus on the factors and metrics could be used to predict defects. Often called predictors or features, they are used as input of a prediction model which outputs the predicted number of undiscovered defects in a software module (regression models) or whether it is defective (classification models). These factors are general in nature and follow a common belief about software systems.

Traditional defect prediction approaches for source code often use metrics measuring the complexity of the design or the implementing code of a software system. For example, the most commonly used code metric is the number of lines of code (LOC). Other frequently used object-oriented design metrics are the depth of inheritance tree (DIT), the number of children (NOC), the lack of cohesion in methods (LCOM), or the coupling between objects (CBO) of a class.

In this paper, we explore a new approach based on two key assumptions. The first one is that a software system often implements several groups of functionality, each might have different levels of defect-proneness. For example, JEdit is a subject system studied in this paper. It is a word processor (document editor) with functionality for managing graphical user interface (GUI), presenting documents, processing edit commands from users, searching text, managing files (e.g. loading, saving, parsing), etc. Our study of JEdit suggests that while the code for GUI and editing commands is highly defect-prone (e.g. the most defective file is JEditTextArea.java with up to 45 post-release defects), the code for text search and parsing is much less defect-prone. Therefore, if we can infer the functionality implemented in a code module (e.g. a source file or a class) and the defect-proneness levels of such functionality from historical data, we can predict defects in that module.

We make the second assumption in our approach that functionality implemented in a code module could be inferred from its content, i.e. from identifiers, comments, annotations, string literals, keywords, embedded documentation, etc. For example, developers often name classes and methods using identifiers suggesting their functionality. For example, JEdit has some classes named JEditTextArea, OptionsDialog, and Buffer which clearly indicate the functionality they implement. In code comments, developers can also explain and discuss the functionality of their code such as the implemented algorithms or the roles of variables and parameters.

Based on those ideas, we explore four new types of features extracted from code content for defect prediction. The first one includes textual terms extracted from all text elements in code like comments or identifiers. We use standard tokenizing and stemming techniques to extract those terms and use the bag-of-word model to represent them. For example, the identifier OptionsDialog is tokenzied into two terms Options and Dialog which are further stemmed as opt and dialog, respectively.

Although textual content of source code could contain most information of the implemented functionality, the amount of extracted terms can be large and noisy. Therefore, we use topic modeling Blei et al. (2003) as a feature reduction technique for the extracted terms. This produces topics, the second type of features explored in this paper. Prior studies suggests that topics extracted from source code of a software system correspond to its technical concerns (e.g. functionality) and those topics can be used to predict defects Nguyen et al. (2011). For example, in JDT, a compiler framework, code written for semantic analysis is more likely to have errors than code written for lexical analysis.

To further address the noise in textual features like terms and topics, we investigate programming-semantic features including the data types of variables and objects, and the packages containing those types. Compared to text features, types provide a higher level of abstraction. In object-oriented design, a particular data type, e.g. a class, is defined to perform certain tasks, which contributes to the actual functionality of the code using that type. For example, objects of type java.io.File represent particular files or folders. They provide methods to obtain information about files and to manipulate those files in the system. Hence, if a source file uses java.io.File type, it likely works with the file system (e.g. create, delete, read, or write files) or performs file I/O functionality. In contrast, objects of type org.eclipse.jdt.core.dom.ASTParser provide functions to parse Java code in abstract syntax trees. Thus, if a source file contains objects of this type, it is likely to have parsing/compiling functionality. That means, we could infer functionality of a code unit based on the presence of some particular data types.

As a large software system might contain thousands data types, we consider package organization as a feature reduction technique for type features. In object-oriented programming, package organization can improve code modularity by grouping related modules (e.g. classes, source files) into packages. As related classes are grouped together, the resulting packages can provide higher levels of abstraction. For example, in Java API libraries, package java.io provides input/output functions while package javax.sql provides data access and processing functions. Like modeling infers the technical concerns of source code from textual features, package organization could infer technical concerns from data type information. However, while topic modeling is an unsupervised learning task, package organization was done by human, i.e. developers that design the target system, hence it would be more accurate.

While extracting text, topic, type and package features, we construct the feature vectors for a source file using the count of each term, the (log-transformed) counts of terms assigned to each topic, the presence of each type, and the presence of each package, respectively. Those vectors can further be combined into a unified one for all types of features.

As a software system can has high numbers of the four newly proposed features, we apply feature selection techniques to improve the prediction performance, in both accuracy and running time. For example, we only select terms and type features that have high ranked correlation or mutual information with defects. We also apply principal component analysis (PCA), a standard feature reduction technique to reduce the potential correlations among selected features.

Because this paper focuses on the features rather than the models for defect prediction, we only use the simple linear regression model (LR) for the prediction task. It should be noted that other prediction models like logistic regression, decision tree, or support vector machine, can also be used.

We have conducted an extensive empirical evaluation on a public defect dataset including 42 releases of 14 real-world software systems. This evaluation contains more than 2,000 experiment runs, exercising different options of the prediction model, such as the number of selected term features, the number of topics extracted, the number of selected type features, etc. We also compared our features with the best reported traditional code metrics.

The results show that all our proposed features are predictive of defect-proneness and provides better prediction results than traditional code metrics like number of lines of code (LOC) or Chidamber-Kemerer (CK) metrics. More importantly, feature selection and reduction techniques could further improve prediction accuracy. Finally, combining all four types of features is better than using them alone.

The key contributions of this paper include:

1. New types of defect predictors including terms, topics, types, and packages extracted from source code, and

2. An empirical evaluation to compare those predictors with traditional code metrics.

In Section 2, we introduce our features in details, including techniques for extracting them from source code and selecting the best ones. Section 3 describes our evaluation settings and Section 4 reports its results. Section 5 presents related work and conclusions appear last.

2 Approach

In this section, we describe in details the extraction process for term, topic, type, and package features. We also discuss several feature selection and reduction techniques for those features. While term features can be extracted using simple text analysis techniques, topic features are inferred using LDA, a widely used topic modeling technique by Blei et al. (2003). Extracting types and packages is more complicated, involving code parsing and partial program analysis (PPA).

2.1 Extracting term features

Term features are extracted directly from the textual content of source code. First, raw code was tokenized using whitespace, numeric, and special characters as separators. Because our subject systems are all Java projects, each resulted token is further split using the Java name convention (i.e. camel casing). For example, token StringBuffer is split into two words String and Buffer; token JEditArea is split into three words, J, Edit, and Area. After that, words of length 1 like i or J are disregarded. The remaining are lowercased and stemmed using the standard Porter stemmer. For example, Buffer becomes buff and Condition becomes condit. Finally, the vector representing term features for each source file is constructed using bag-of-word model and weighted using tf.idf scheme.

2.2 Extracting topic features

In our work, topic features are extracted using LDA Blei et al. (2003). By applying LDA, our approach assumes each software system to have $D$ source files, $V$ words, and $K$ topics. Each topic is a distribution over all those words and is a sample of the Dirichlet distribution $\rm{Dir}(\beta,V)$ . Each source file has a distinct topic proportion which is a sample of Dirichlet distribution $\rm{Dir}(\alpha,K)$ . Each word in a source file is assigned to a topic.

For example, assume that JEdit, an editor, has only two topics “text editing” (ED) and “graphical user interface” (GUI). LDA assumes each word has a probability to be assigned to each of those two topics. However, the probabilities of assigning words edit, delete, or buffer to ED are higher than to GUI. In contrast, the probabilities of assigning words button, window, or dialog to GUI are higher. Source file JEditArea.java could have 70% of its words assigned to ED and 30% to GUI. A word view in this file is likely assigned to GUI.

LDA is an unsupervised technique which automatically infers topics from documents. Its input includes $D$ documents, $V$ words, and the number of topics $K$ . Its output is $K$ distributions $\phi_{k=1..K}$ , each for a topic and $D$ topic assignment vectors $\varphi_{d=1..D}$ . That is, $\phi_{k}(w)$ is the probability word $w$ is assigned to topic $k$ , while $\varphi_{d}(k)$ is the number of words in document $d$ assigned to topic $k$ .

After applying LDA on the source files of a system, we log-transform the topic assignment vector of each file to construct its topic feature vector, expecting that log transformation will reduce the unbalance between common words and rare words. One could consider topic modeling as a reduction technique for term features, as we can produce topic feature vectors of $K$ dimensions from term feature vectors of $V$ dimensions.

2.3 Extracting type features

To extract type features in a source file, we first parse it into an abstract syntax tree. Then, we resolve type-binding for any identifier, object, variable, and expression appearing in the tree. To make the type-binding resolution robust, we employ Partial Program Analysis, which can perform type-binding even when the system is not completely compiled or has some missing dependencies.

Rather than counting the occurrences like for term features, the feature vector for type features is binary, i.e. it denotes only the presence/appearance of a type in a source file. This design decision is suggested by our preliminary investigation, when binary vectors for type features outperform the corresponding count vectors in defect prediction.

2.4 Extracting package features

The type binding process provides qualified name for all resolved types. For example, a variable parser is resolved to have type org.eclipse.jdt.core.dom.ASTParser. This qualified name indicates this type belong to the package org.eclipse.jdt.core.dom.

Thus, this package is considered to appear in the source file containing that variable. This suggests us to extract package features from type features for each source file.

Similar to type feature vectors, we also construct package feature vectors in binary representation because our preliminary investigation suggests that it proves better prediction results. That means, the package feature vector of a source file only indicates if a type of a package is used in that source file or not. One could consider package features as a reduction of type features, as a package often contain several types.

2.5 Feature selection and reduction

The feature spaces for term and type features, are generally large. For example, Xalan 2.7 has 11,087 extracted term features and MYL has 4,004 extracted type features. It is likely that many of those features are noise. For example, common words like an or the or common types like int or String appear frequently in source code and do not have high correlation to defects. In addition, related words or types often go together, thus the corresponding terms can be highly correlated. Thus, to reduce those noisy and highly correlated features, we use several feature selection and reduction techniques.

A feature selection method works by i) computing a relevance score between each feature and the defect count in the data, ii) using those scores to rank the corresponding features, and iii) selecting features with highest scores. Following prior studies in defect prediction and machine learning, we use three kinds of scores: Pearson correlation coefficients (Pearson), Spearman correlation coefficients (Spearman), and Mutual Information (MI).

The selected features might still be correlated. We address that situation by using Principal Component Analysis (PCA), a standard feature reduction technique. PCA transforms a potential correlated feature set into a set of principal components (PC) that are linearly orthogonal. By selecting top principal components that account for most (typically 90%) variance in the data, we could remove overlapping information and reduce noise.

3 Empirical Evaluation

3.1 Datasets

Table 1: Subject Systems

ID	Name	Version	No. Files	No. Term	No. Type	No. Package
ALR set	4 projects	4 releases
JDT	Eclipse JDT Core	3.4	995	6,120	3,052	45
PDE	Eclipse PDE UI	3.4.1	975	4,052	3,527	66
MYL	Mylyn	3.1	1,063	4,510	4,004	127
EQU	Eclipse Equinox framework	3.4	322	3,967	1,019	86
JM set	11 projects	38 releases
Ant	Apache Ant	1.3 - 1.7	123 - 740	2,040 - 4,926	346 - 1,558	8 - 67
Camel	Apache Camel	1.0 - 1.6	333 - 927	1,325 - 2,344	1,182 - 2,970	46 - 120
Ivy	Apache Ivy	2.0	352	1,986	743	52
JEdit	JEdit	3.2.1 - 4.2	260 - 355	2,744 - 3,371	882 - 1,241	16 - 23
Log4J	Apache Log4J	1.0 - 1.2	104 - 194	1,852 - 2,282	287 - 600	12 - 25
Lucene	Apache Lucene	2.0 - 2.4	186 - 330	1,936 - 2,650	378 - 655	10 - 13
Poi	Apache POI	1.5 - 3.0	234 - 437	2,442 - 3,913	459 - 821	19 - 20
Synapse	Apache Synapse	1.0 - 1.2	157 - 256	1,180 - 1,690	494 - 924	23 - 33
Velocity	Apache Velocity	1.4-1.61	195 - 229	2,279 - 2,335	424 - 476	25 - 30
Xalan	Apache Xalan-Java	2.4 - 2.7	676 - 899	4,716 - 11,087	1,245 - 1,565	38 - 42
Xerces	Apache Xerces	init - 1.4.4	162 - 452	2,299 - 3471	270 - 706	17 - 28

To evaluate the predictive power of our proposed features in comparison to the traditional ones, we conducted several experiments on 42 releases of 15 open-source projects. Table 1 provides a brief description of those systems, including their versions and the number of collected source files. The bug data was included in two different publicly available datasets provided in prior work. The first dataset (ALR set) includes bug data, CK and OO metrics extracted for four open-source software systems and the second (JM set) provides bug data and CK metrics for 38 releases of 11 projects. Lucene 2.4 is included in both datasets so to maintain consistency it was excluded from the first dataset. JM set also includes Apache Forrest, pBean, Apache Tomcat, and JEdit 4.3 but these were removed due to their small size (less than 100 files) or their lack of reported bugs (the bug ratio less than 10%). These datasets do not contain source code of those systems, we retrieved the code directly from their source repositories using the provided version information.

3.2 Evaluation method

We performed several experiments to evaluate the effectiveness of different feature types in predicting defects and provided a general estimation for parameters of each feature type. This section will describe our experiment methods and settings. Features were extracted using code written in Java and experiment steps including feature selection, model training, evaluation, and qualitative analysis were done using code written in Matlab and R.

3.2.1 The prediction model

To predict defects, we used a Linear Regression model. The reasons to select such a simple model are: i) we mainly focus on evaluating our proposed features, by selecting a simple universal and easy to train model we could quickly evaluate our system, ii) we want to compare our proposed features with other metrics that have been proposed of which many were implemented using this simple model. Shihab (2012) provided an extensive list of models used in defect prediction of which LR is the second most frequently used model after Logistic Regression. LR model works based on an assumption that the input features are not highly correlated so our PCA step makes sure that redundant information and noise was removed.

3.2.2 Performance measurement and cross validation

To measure the system performance, we used two metrics: Spearman ranked correlation coefficient (SCC) and Mean Absolute Error (MAE). SCC measures the ranked correlation coefficient between the predicted defects and the actual numbers of post release defects, the higher the coefficient the better the system at predicting defect-proneness of source code. Because we focus our effort in creating a system that could rank source file in term of its defect-proneness, SCC metric is an important performance measure that we consider. MAE metric is used to evaluate the system’s ability to predict the actual number of defects in a source file by measuring the average difference between the predicted and actual number of defects. Thus, a lower MAE indicates the better prediction performance.

We used cross validation such that each experiment was repeated 50 times using 90% data for training and 10% for testing. To make paired t-tests valid when comparing different features and selection methods across all projects, a fixed random seed was set before each experiment so the same subject system will be cross-validated using the same fold configurations for each feature or method.

3.2.3 Baseline

We compared our result with existing metrics that has been used in the past: the traditional and simple Lines of Code (LOC) and the more recent CK metrics. To train the baseline models, we used pre-compiled LOC and CK metrics provided in the datasets. The first four projects in the ARL set also contains some Object Oriented (OO) metrics. We include them in the evaluation of baseline systems for those projects as well.

4 Experiments and results

We conducted five experiments to evaluate our proposed features and methods of feature selections. The first fours examined each feature type in turn, the final experiment was conducted to evaluate combined features. Each of the first three experiments was split into phases to discover different optimal settings for each feature type. Every experiments were performed on all 42 released of 15 projects. To evaluate baseline systems, in the final experiment, we used the CK and OO metrics provided for some projects in the ALR set while for all others we used only available CK metrics.

4.1 Experiment 1: Term features and term feature selections

In this experiment, we evaluated our proposed term features’ performances and discover the most optimal configuration for term features including which feature selection method and what number of selected term features should be used. In the first phase of this experiment we ran our system using all three methods of selection: Spearman, Pearson, and MI with the number of term features selected of 5, 10, and 20 to determine the most appropriate selection method. We also ran with no feature selection, i.e. all terms are selected.

Table 2: Prediction performance of selection methods for term features

Average within one release
Selection method	Spearman		Pearson		MI		All features
Prediction measure	SCC	MAE	SCC	MAE	SCC	MAE	SCC	MAE
Across all releases	0.461	0.676	0.432	1.088	0.397	0.676	0.352	1.510
Ant 1.5	0.349	0.167	0.325	0.167	0.307	0.176	0.127	0.597
Ivy 2.0	0.388	0.204	0.371	0.195	0.374	0.196	0.160	0.401
JEdit 4.1	0.534	0.670	0.547	0.675	0.523	0.678	0.536	1.393
Log4j 1.2	0.426	0.924	0.359	0.879	0.401	0.897	0.356	1.917
Lucene 2.4	0.564	1.384	0.554	1.425	0.550	1.446	0.550	2.328
Poi 2.5.1	0.615	0.840	0.557	0.880	0.593	0.857	0.687	1.434
Velocity 1.5	0.542	1.232	0.575	1.113	0.503	1.197	0.508	7.554
Xalan 2.7	0.480	0.425	0.460	0.448	0.414	0.430	0.485	0.400
Xerces 1.4.4	0.811	1.641	0.605	1.370	0.576	1.568	0.746	1.793
JDT	0.411	0.468	0.412	0.455	0.403	0.471	0.403	0.727
EQU	0.544	0.717	0.554	0.648	0.565	0.651	0.489	1.454
PDE	0.352	0.307	0.311	0.337	0.350	0.299	0.348	0.358

Table 2 shows the average SCC and MAE across all evaluated releases in the top lines and some releases’ best scores in the bottom lines. Best scores among different selection methods are marked in bold. In term of SCC, systems using Spearman method provided the best overall performance and stability. On average, they out performed the second best method, Pearson, by almost 7%. On the other hand, MAE scores indicated that Spearman and MI selection methods have similar prediction errors. On average these two reduces error by almost 38% comparing to the last method Pearson. In Table 2, we could see that Spearman is not the best feature selection method for all evaluated releases but the differences are small. Paired t-tests confirmed that differences in performances between Spearman and other methods are statistically significant. However, the test could not confirm if, in term of MAE, Spearman is better than MI or PCC. It is important to note that using a feature selection method is better than not using: the results of no feature selection are often the worst.

In the second phase, we ran additional test using only Spearman as the selection method and extended the number of term features selected to include 3, 50, 100, and 200. This phase is designed to determine the optimal number of term features to be selected. We found that, selecting 5 term features seem provide the best performance. However the differences in predictive power between this setting and settings of 3, 10, and 20 are small and not statistically significant (confirmed using paired t-test). Our content-based features are project specific so different projects will have different optimal settings. It is important to note, however, that optimal settings for the number of selected term features are generally not more than 20.

4.2 Experiment 2: Topic features

This experiment was designed to confirm if topic modeling provide higher predictive power comparing to raw term features and which is the optimal number of extracted topics. We ran our system using topic features extracted by LDA, the numbers of topics were set to 5, 10, 15, 20, 30, 50, and 100. Table 3 shows the predictive power of topic features and raw term features (without any feature selection method). In general topic features outperformed raw term features by a large margin of 28% and 41% in term of SCC and MAE scores respectively. This result suggests that topics extracted from source code are predictive of defects.

Table 3: Average predictive power of topic and package features versus term and type features without feature selection

Average within one release
Feature	Topics		All terms		Packages		All types
Prediction measure	SCC	MAE	SCC	MAE	SCC	MAE	SCC	MAE
Across all releases	0.451	0.766	0.352	1.510	0.431	0.712	0.292	1.490
Ant 1.5	0.358	0.219	0.127	0.597	0.294	0.213	0.212	0.371
Ivy 2.0	0.339	0.285	0.160	0.401	0.377	0.282	0.191	0.420
JEdit 4.1	0.549	0.963	0.536	1.393	0.454	0.828	0.249	1.525
Log4j 1.2	0.412	0.966	0.356	1.917	0.495	0.908	0.106	3.048
Lucene 2.4	0.559	1.549	0.550	2.328	0.458	1.657	0.333	2.797
Poi 2.5.1	0.724	0.677	0.687	1.434	0.766	0.614	0.607	1.373
Velocity 1.5	0.625	1.051	0.508	7.554	0.590	1.069	0.396	1.729
Xalan 2.7	0.519	0.394	0.485	0.400	0.394	0.432	0.428	0.553
Xerces 1.4.4	0.667	1.699	0.746	1.793	0.745	1.479	0.514	3.581
JDT	0.441	0.521	0.403	0.727	0.416	0.501	0.327	0.771
EQU	0.621	0.714	0.489	1.454	0.483	0.776	0.438	1.071
PDE	0.366	0.421	0.348	0.358	0.350	0.338	0.136	1.118

We also found from the experiment results that using 20 topics often provide the best overall predictive power. However the differences in term of performance between 20 topics and similar settings are small. The paired t-tests confirm that differences are not statistically significant. Topic modeling extracted latent topics that are corresponding to technical concerns (e.g. functionalities) of specific target projects, it is ideal that the number of topics matches the number of functionalities. However, each of target projects would have different number of functionalities, which mean the optimal number of topics will change from project to project. The result suggests that there is no clear global optimal setting for the number of extracted topics and this setting is generally best set around 20 (performance of using 100 topics are significantly lower).

4.3 Experiment 3: Type features and type feature selections

Experiment 3 was conducted using the same process as Experiment 1, it was used to investigate the effectiveness of type features in defect prediction task by measuring their predictive powers and errors. In the first phase, comparing different feature selection methods, we ran the system using three methods of selection (Spearman, Pearson and MI) with the number of type features selected of 5, 10, and 20.

Table 4: Average predictive power of different type features selection methods

Average within one release
Selection method	Spearman		Pearson		MI		All features
Measure	SCC	MAE	SCC	MAE	SCC	MAE	SCC	MAE
Across all releases	0.443	0.688	0.417	0.913	0.425	0.686	0.292	1.490
Ant 1.5	0.347	0.162	0.392	0.165	0.312	0.173	0.212	0.371
Ivy 2.0	0.232	0.254	0.234	0.254	0.255	0.249	0.191	0.420
JEdit 4.1	0.552	0.779	0.326	1.143	0.526	0.793	0.249	1.525
Log4j 1.2	0.411	0.932	0.270	0.968	0.397	0.936	0.106	3.048
Lucene 2.4	0.453	1.555	0.295	1.861	0.486	1.516	0.333	2.797
Poi 2.5.1	0.700	0.690	0.674	0.700	0.693	0.637	0.607	1.373
Velocity 1.5	0.547	1.111	0.554	1.085	0.555	1.096	0.396	1.729
Xalan 2.7	0.515	0.346	0.415	0.411	0.518	0.345	0.428	0.553
Xerces 1.4.4	0.677	1.397	0.323	1.752	0.487	1.535	0.514	3.581
JDT	0.458	0.409	0.404	0.437	0.453	0.410	0.327	0.771
EQU	0.522	0.649	0.499	0.726	0.537	0.651	0.438	1.071
PDE	0.301	0.338	0.211	0.779	0.285	0.338	0.136	1.118

Overall, Spearman has the best predictive power (SCC) among selection methods for type features. Spearman on average perform better than MI, the second best method, by over 4% in term of SCC. MI on average has lower prediction error comparing to Spearman but only with the reduction of less than half of a percent. In the list of sampled projects, we can see that Spearman is the better selection method on most cases.

We conducted the second phase of this experiment to investigate the relation between number of selected type features with the predictive power of our system. To do this we used the established method of choice Spearman with different number of selected features of 50, and 100 (results for setting of 5, 10, and 20 were carried over from previous phase). We found that selecting 10 type features often provides the best overall predictive power, however the differences between settings are small.

4.4 Experiment 4: Package features

Experiment 4 was perform to evaluate our last feature type, package features. Feature vectors were extracted based on functional (non-empty) packages. Since these functional packages are organizations of member classes, they would provide abstract concepts of those classes. We assume that in a well designed software, these functional package would contain classes of different functionalities. PCA was applied to remove redundant information and reduce noise.

We compared package features with raw type features, without any feature selection technique, to prove that extra levels of abstraction improve defect prediction performance. As Table 3 shows, the overall predictive power was improved significantly, by almost 48%, and the average error dropped by over 52%. This result suggests that package organization contains significant amount of information in term of software modularity which provide our system a way to categorize type features and that help improve defect prediction performance.

4.5 Experiment 5: The best prediction system

To find the best prediction system we decided to combine all features using their best overall settings with the assumption is that each type of feature will contain a different aspect of the target system’s semantic complexity. PCA was applied on input vectors to reduce noise and overlapping data, the variance thread-hold was set to 90%.

Table 5: Average predictive power of different feature types

Feature type	Combined	Term	Topic	Type	Package	CKOO	LOC
Mean SCC	0.4624	0.4621	0.4550	0.4426	0.4315	0.3362	0.3357
Mean MAE	0.6733	0.6806	0.7560	0.6880	0.7117	0.7638	0.7686

To maintain consistency, we ran our combination test using best overall settings as well as neighboring setting values. We ran experiments using the Spearman feature selection on both term and type features with number of feature selected set to 3, 5, 10 and 5,10, 20 respectively, number of extracted topics was set to 5, 10, 15, 20, and 30. Because we only want to confirm if combining different types of feature does create a better predictive system, performance of combined features is compared directly to individual feature types using the same settings.

Table 5 shows the predictive power of systems that use combined features, our individual feature types, and two baseline systems using traditional metrics, CK and LOC. The result suggests that all four of our proposed feature types outperform the baseline systems in both prediction measures SCC and MAE. Among newly proposed feature types, term features create the biggest predictive power improvement of 38% and the largest prediction error reduction of 11% over the best conventional metrics (CKOO). Table 5 shows that combining all of our features does create a better prediction system, but the improvements are insignificant. On average, combined features improve SCC by only 0.4% and reduce MAE by only 1%. Using such an simple combination method might be the reason why our combined features did not show larger improvement. More sophisticated methods might create better results, we leave this to our future work.

4.6 Case study

The evaluation results suggest that the content-based features extracted from source code provide substantial improvements in defect prediction. We investigated a representative subject system JEdit 3.2.1 as a case study into those features.

4.6.1 Topic features

To verify in details if topics extracted from source code using topic modeling are representative of the functionality of the target system, we extracted the most relevant source files for each topic, i.e. ones contain most words assigned to that topic. That list could help us infer the corresponding functionality of each topic.

We found five topics corresponding to five functionalities of JEdit: 1. Text parsing, 2. Interpreter, 3. View and edit components, 4. Data model, and 5. Graphical User Interface (GUI) components. JEdit is a code editor for programmers, i.e. its main functions provide a developing environment. As an editor, most operations will involve user interaction to view, edit, copy, and delete text files. Components that are responsible to provide such functions are likely to be used extensively and will be subjected to multiple changes that are likely to introduce defects. We found that Topic 3 involving view and edit functionality has high predictive power to defects. For example, it has a Spearman ranked correlation coefficient of 0.701 (higher than all selected term and type features). Topic 5 involving GUI components is also highly predictive of defects with a ranked correlation of 0.610 (higher than all selected type features). In contrast, Topic 2 and Topic 5 has nearly zero correlation (-0.135 and 0.157, respectively).

4.6.2 Package features

Our package features are extracted based on an assumption that a well-designed software system will have a generally good level of modularity and the package structure of this system will reflect its modular organization.

We found that two major packages of JEdit 3.2.1, org.gjt.sp.jedit and org.gjt.sp.util have high correlation to defect (and higher than all listed baseline metrics). They are also appear exclusive in most defect files. For example, 90% of top defective files refer to org.gjt.sp.util, while it is referred only 11% in files having no defects. Their sub-packages, like org.gjt.sp.jedit.browser or org.gjt.sp.jedit.syntax appear even more exclusive in top defective files. This explain why package features are predictive of defects.

It is interesting to observe that package bsh (BeanShell) has a negative correlation with defect counts, i.e. source files use classes belonging to this package are more likely to have of no defects. We found that that package provides scripting capabilities for Java. However, due to the inclusion of JavaScript in JDK, its development has been discontinued since 2005. Although many components of JEdit still use this package, due to this discontinued development, they are likely legacy code, i.e. having no changes, and thus are likely to have no newly injected defects.

We further studied package features in Ant 1.4, a tool for compiling and building executable code for software projects. We found that, three packages org.apache.tools.ant.taskdefs, org.apache.tools.ant.taskdefs.rmic, and org.apache.tools.ant.taskdefs.condition have the highest correlation to defect among all package features. Their correlation is higher than that of all baseline metrics. That is reasonable because they contain core functionality of Ant, a build tool. For example, package taskdefs is for defining build tasks and taskdefs.condition is for specifying execution conditions on those build tasks. In contrast, org.apache.tools.zip and org.apache.tools.tar are external functionality for processing compressed files, and they have nearly zero correlation to defects. It is expected that components providing core functionality are more defect-prone than others because they often have more active development activities and stricter requirements.

4.6.3 Defect distribution

We observed that JEdit 3.2.1 has very skew defect distribution where a small number of files have significantly large numbers of defects, while the rest contain few to no defects. For example, file JEditTextArea.java contains 45 post-release defects, accounted for 12% of the total defects and the top 10 most defective files account for 47% of total defects. On the other hand, other systems like Ant 1.4 have more balanced defect distributions. The most defective file in Ant 1.4 has only 3 defects and the top 10 most defective files account for only 36% of total defects.

Systems with skew defect distributions have higher prediction performance than the more balanced ones. For example, the best SCC of JEdit 3.2.1 is 0.635 (over experiment settings for feature types and selection methods) while that of Ant 1.4 is lower at 0.441. The reason is possibly due to the training process of Linear Regression (LR) models. LR models are trained to minimize the sum of squared residuals. Because outliers, like ones with unusual high defect counts, would have high impact on that sum, the model is likely to be trained to reduce the errors when predicting those files. In other words, it will predict top defective files more accurately, and thus, improves the ranked correlation (which is also strongly influenced by files with high defect counts). If the defect distribution is more balanced, then the prediction model will less focus on files with high defect counts, thus, predicts them with larger errors which leads to a lower SCC score.

5 Related work

Defect prediction has attracted great research interest in software engineering. Researchers have searched for: 1) software metrics that are predictive of defect-proneness, and 2) prediction models that deliver accurate results.

Software metrics used in defect prediction studies mainly fall into three categories: product metrics (e.g. static code metrics), process metrics (e.g. code churn and previous bugs), and socio-technical metrics (e.g. developers).

One of the earliest code metrics used in software defect prediction is lines of code (LOC). Simple, easy to extract, frequently used, this metrics is still being discussed to this day. In a number of studies, LOC has been reported to correlate with the number of faults and performs quite well, although other studies show that LOC has only modest predictive power. In general, LOC appears to be useful in predicting software defects (a survey of LOC’s use can be found in Hall et al. (2012)). Beside LOC, popular complexity metrics such as McCabe’s cyclomatic complexity have been shown to be useful in predicting software defects. Khoshgoftaar and Allen (2003) used 16 code complexity metrics to predict defects in a large legacy system for telecommunications and achieved accuracy of almost 80% when using modules from one release as training data to predict fault in the consecutive release of the system. In addition to general code metrics, object-oriented design measures have been widely used in defect prediction studies. Basili et al. (1996) investigated the usefulness of of the Chidamber and Kemerer (CK) metrics in predicting bugs. They found five metrics correlating with the defect count: weighted methods per class, coupling between objects, the depth of inheritance, the number of children, and the response for a class. In a later study Nesi et al. (1999) showed that coupling between objects, lack of cohesion among methods, and response for a class were highly predictive of fault-proneness of a class. Overall, OO metrics have been reported to outperform general complexity metrics Hall et al. (2012).

Process metrics such as code churn, the number of changes, previous bugs, are extracted from the software development history. Graves et al. (2000) studied the change history and found that large and recent changes contributed the most to defects. Nagappan and Ball (2005) showed that system defect density could be predicted using a set of relative code churn measures that relate the amount of churns to other variables such as component size and the temporal extent of a churn. They later found that change bursts (i.e. frequently changed code) could be used has good predictors of bugs. Hassan (2009) proposed to use the entropies of changes as measures of code change complexity and found that such measures outperformed the absolute numbers of changes.

There are arguments in favor of both product and process metrics. While process metrics appear to deliver more accurate predictions than product metrics, the latter is easier to obtain because it is derived from the code itself and does not require any information from the development process. When both types of metrics are available, it is useful to combine them for better predictions. In addition to these two categories of metrics, socio-technical metrics that are computed from the information on organization structure, developers, and social networks have also been shown to be useful in predicting effects.

Some content-based features has been explored previously for defect prediction. Jiang et al. (2013) use all words and special operators as features in their classifier of buggy changes. However, as they did not use any feature selection/reduction techniques and nor provide any detailed analysis on the defect-proneness and predictive power of their features, our work is a complementary treatment and contains a deeper analysis of those textual features.

Chen et al. (2017) reports an empirical study using topics extracted from source code to explain defects. As their study focuses on explaining defect-proneness using topics, they did not perform defect prediction nor analyze the effect of feature selection on the performance of the prediction tasks. Thus, our work provides a deeper study of the relationship between topics and defects. Nguyen et al. (2011) only study topic features on one subject system and one configuration of topic modeling. In this paper, we explored more types of features and ran experiments on many more configurations and subject systems.

6 Conclusions

In this paper, we explore a content-based approach for defect prediction. Our approach extracts text, topic, type, and package features from the textual and semantic content of the source code and uses them as defect predictors. Empirical evaluation shows that i) our content-based features are predictive of defect-proneness and have higher predictive power than traditional code metrics. In addition, selecting, reducing, and combining those features provides better prediction performance than using them individually.

Future work. In this paper we design the features based on direct processing of textual and semantic content of source code. However this method of feature engineering might not cover all possible aspects of source code semantics. Recent advanced unsupervised text processing and pre-trained language models that produce distributed representation of terms and sentences was successfully applied to the natural languages. Thus, it is a natural next step to apply these techniques directly to source code. In addition, we could extend that concept to learn the distributed representations of more abstract code features such as data types, functions, classes or entire programs. These new representations could be the key to improve the performance of defect prediction systems.

References

Basili et al. (1996) V.R. Basili, L.C. Briand, and W.L. Melo. A validation of object-oriented design metrics as quality indicators. IEEE Transactions on Software Engineering, 22(10):751–761, 1996. 10.1109/32.544352.
Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
Chen et al. (2017) Tse-Hsun Chen, Weiyi Shang, Meiyappan Nagappan, Ahmed E. Hassan, and Stephen W. Thomas. Topic-based software defect explanation. Journal of Systems and Software, 129:79–106, 2017. ISSN 0164-1212. https://doi.org/10.1016/j.jss.2016.05.015.
Gallaher and Kropp (2002) M.P. Gallaher and B.M. Kropp. Economic impacts of inadequate infrastructure for software testing, 2002.
Graves et al. (2000) Todd L Graves, Alan F Karr, James S Marron, and Harvey Siy. Predicting fault incidence using software change history. IEEE Transactions on software engineering, 26(7):653–661, 2000.
Hailpern and Santhanam (2002) Brent Hailpern and Peter Santhanam. Software debugging, testing, and verification, 2002.
Hall et al. (2012) Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering,, 38(6):1276–1304, 2012.
Hassan (2009) Ahmed E. Hassan. Predicting faults using the complexity of code changes. In ICSE ’09: Proceedings of the 31st International Conference on Software Engineering, pages 78–88. IEEE CS, 2009.
Jiang et al. (2013) Tian Jiang, Lin Tan, and Sunghun Kim. Personalized defect prediction. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 279–289, 2013. 10.1109/ASE.2013.6693087.
Khoshgoftaar and Allen (2003) Taghi M. Khoshgoftaar and Edward B. Allen. Ordering fault-prone software modules. Software Quality Control, 11(1):19–37, 2003.
Nagappan and Ball (2005) Nachiappan Nagappan and Thomas Ball. Use of relative code churn measures to predict system defect density. In ICSE ’05: Proceedings of the 27th international conference on Software engineering, pages 284–292. ACM, 2005.
Nesi et al. (1999) P. Nesi, C. Kemerer, and L. Briand. The quality and productivity of object-oriented development: Measurement and empirical results. In Software Metrics, IEEE International Symposium on, 1999.
Nguyen et al. (2011) Tung Thanh Nguyen, Tien N. Nguyen, and Tu Minh Phuong. Topic-based defect prediction: Nier track. In Proceeding of the 33rd international conference on Software engineering, ICSE ’11, pages 932–935. ACM, 2011.
Shihab (2012) Emad Shihab. An exploration of challenges limiting pragmatic software defect prediction. PhD Thesis, 2012.