Do the Defect Prediction Models Really Work?
Abstract
“You may develop a potential prediction model, but how can I trust your model that it will benefit my software?”. Using a software defect prediction (SDP) model as a tool, we address this fundamental problem in machine learning research. This is a preliminary work targeted at providing an analysis of the developed binary SDP model in real-time working environments.
Index Terms:
Software Defect Prediction, Machine Learning, Probabilistic Bounds, Real-time Analysis.I Introduction
Due to the rapid development of complex and critical software systems, testing has become a tough challenge. Software defect prediction (SDP) models are being developed to alleviate this challenge [1, 2, 3, 4, 5, 6]. The primary objectives in developing SDP models are to reduce the testing time, cost, and effort to be spent on the newly developed software project [7]. The task of SDP models is to predict the defect proneness of newly developed software modules.
Once an efficient SDP model has been developed, any organisation may utilise its services. However, it is evident from the machine learning (ML) literature that, in general, the developed prediction models may produce misclassifications on the unseen data [8]. Owing to the result of either a misclassification (from the prediction model) or ineffective testing, the occurrence of any malfunction in the software modules may cause problems ranging from inconvenience to the loss of life [9]. It is more likely that a software fails when the prediction model wrongly predicts a defective module.
To know how feasible the SDP models are in the real-time working environments, we provide a theoretical analysis using probabilistic bounds. In a nutshell, the proofs are demonstrated by computing the deviation of a random variable (which is modeled as a hazard rate of a software that utilises SDP models) far from the estimated hazard rate of a manually tested software. Additionally, the proofs are also provided in terms of the measure called reliability.
II Preliminaries
We begin by discovering the chances of failures in the system from the predictions of SDP models. There are many ways a system will fail [9, 10]. Of which, the primary possible instance is when a defective module is predicted as clean. In such cases, in real-time working environments, the tester may miss the defective module. Now, the following assumption ensures failure incidents from each false negative module on the test set:
Assumption 1.
Misclassification of each defective module can cause one failure in the software.
This assumption enables us to count the total failures on the test set and on the newly developed project. Since we know the fact that the general testing procedures do not prompt all the defects [10], the following assumption ensures the presence of failures in any software that is tested by using SDP models:
Assumption 2.
The integration test, system test, or acceptance test do not prompt the defects for the misclassified defective modules.
Now, to measure the percentage of occurrences of the failure cases on the test set, we use a measure called the false omission rate (FOR). The FOR is the ratio of the total number of false negatives over the total predicted clean modules. This is given as:
(1) |
Since only predicted clean modules may contain hidden defects, the measure FOR is well suited to estimating the percentage of failure occurrences on the test set. However, in real-time testing, FNs do not provide sufficient information about the failures in software because the actual class label for the predicted clean module is unknown. Hence, we model the actual class for each predicted clean module as a random variable. For any newly developed software with modules, let us assume modules are predicted as being from the clean class. Now, the following random variable is used to represent the failure case from the wrongly predicted defective module, :
(2) |
To provide a guarantee that, for any module , takes a value in with an identical probability, the following assumption must hold true.
Assumption 3.
The SDP model is trained on the historical data of the software project(s).
In general, the SDP models are being developed on the historical data of the software projects, assuming similar data distributions for the training set, test set, and the population set [1, 2, 3, 4, 5, 6, 7]. From Assumption 3, since the SDP model does not change dynamically, we have that each predicted clean module goes into the wrong class with a similar FOR value. That is, the FOR is treated as the probability that each predicted clean module may fall into the defective module. This is given as:
(3) |
This probability is used to define the failure distribution of the software project. Hence, from Equations 2 and 3, the probability distribution of the random variable is represented as:
Now, to count the total failure instances from the prediction model, the following assumption ensures independence between each tested module:
Assumption 4.
The SDP model provides predictions for independent observations (software modules).
In fact, all the SDP models assume independence between the data points [1, 2, 3, 4, 5, 6, 7]. Since each predicted clean module has a identical probability then it becomes a Bernoulli trial [11]. Now, the sum of identical Bernoulli trials is said to be a binomial distribution [11, 12]. This is given as:
(4) |
Now, the mean of the random variable is derived as follows (using linearity of expectation):
(5) |
So far, we have modelled the occurrence of failures (we also call it the hazard rate later in the paper) as a random variable and estimated the expected number of failures (wrong predictions for the defective modules) in a software. It is worth noting that, with no loss of generality, the predicted defective modules will be tested by the tester. The following assumption ensures the presence of failures in some portion of software after its release:
Assumption 5.
For some portions of the software other than the predicted clean modules, the hazard rate follows the Weibull distribution.
Here, the hazard rate is defined as the instantaneous rate of failures in a software system [9]. According to Hartz et al. (in [13]), the hazards in a software (or the part of a software) may not be estimated with a single function. Hence, in order to fit various hazard curves, it is useful to investigate a hazard model of the form that is known as the Weibull distribution. Here, we assume the occurrence of a Weibull distribution of the hazard rate for the rest of the software modules (other than predicted clean modules). Now, for a software that is tested by using both the SDP model and the testers, the total estimated hazards () are calculated as:
(6) |
Here, is the hazard rate of a software that is tested by using the SDP model (and later the predicted defective modules are serviced by the testers). Here, is assumed to be the hazard rate, represented in terms of the Weibull distribution, for the software modules other than predicted clean modules. Here, the parameters , and will take real values and the inequality constraints for these parameters are adopted directly from the Lyu’s work [9]. Hence, from Assumptions 1 and 5, for the total software modules, the resultant hazard model () is the sum of the hazard rates of the sub parts of the software. Now the expected hazard rate of the software is derived as:
(7) |
To demonstrate the feasibility of SDP models in the real-time scenario, the following assumptions must be met:
Assumption 6.
An identical software is used for both the cases of testing using SDP model and manual testing.
This is an important assumption in providing proof for the feasibility of SDP models in the real-time scenario. For a software, from Equation 6, we know that the hazard rate is . Assume the same software that was tested by the testers, for which we have a Weibull distribution of the hazard rate as:
(8) |
The definition of the parameters such as and is similar to the definition of the parameters in Equation 6. Note that, at time , the two hazard functions such as and describe the instantaneous rate of failures in the software when tested manually and with SDP, respectively. Now, for any software, the proofs (given in Section III) provide the tight bounds for the deviation of a random variable far below from the corresponding hazard rate estimated with manual testing. Similarly, another possible approach is to find the deviation of the random variable (expressed in terms of reliability) far above the reliability of the manually tested software. Here, reliability is defined as the probability of failure-free software operation for a specified period of time in a specified environment [9].
The relation between reliability and hazard rate is given below [9]:
(9) |
Where is the software’s reliability at time , and is the hazard rate. Using Equation 9, we can derive the reliability values from numerous hazard models in some time interval [0,]. Here, we assume that the two identical software systems (having different testing scenarios—one is manually testing and the other is testing using SDP) are deployed at time 0.
Now, The reliability of the manually tested software is now defined by the Weibull model of a hazard rate ().
Lemma 1.
For the Weibull hazard model of a software , its reliability is:
Proof.
The proofs in Section III are valid by ensuring the probability value () lies in the interval (0,1). Hence, the following assumption must hold true:
Assumption 7.
The SDP model should produce at least one false negative and one true negative on the test set.
III The Proofs
III-A The tight lower bound in terms of hazard rate
In Section II, we modelled the number of hazard (failure) instances in a software that is tested by using the SDP model as a random variable (that is, ). Now, the following theorem defines the deviation of a random variable, below the value of hazard rate of a manually tested software, (in fact, far below from the expectation, ).
Theorem 1.
Let be the independent Bernoulli trials such that for, , , where . Also let parameters , time . Then for X = , and for the Weibull hazard model of a manually tested software, :
Proof.
As before, can be rewritten as:
(11) |
We know for some , and , using the Chernoff bound, the lower tail bound for the sum of independent Bernoulli trials, , that deviates far from the expectation is [14]:
(12) |
Here, the value represents the left-side marginal value from the expectation with the band length of .
Now, we wish to obtain a tight lower bound that the random variable (that is, ), that deviates far below from the hazard rate of a manually tested software, . In Equation 11, for some , and , the value is assumed to be below the expectation, , in a given time period . Now, from Equations 11 and 12:
(13) |
From Equation 6, we know the expected hazards in a software, which uses the SDP models is:
Thus, we have from Theorem 1 the occurrence of fewer hazards in the software that uses SDP than the occurrence of the total hazards in the same software that is tested by a human is exponentially small in , and , implying that at the larger values of these parameters, the bound becomes tighter.
III-B The tight upper bound in terms of reliability
In this section, we provide a lemma that calculates the reliability of a software that is tested using the SDP model.
Lemma 2.
Let be the independent Bernoulli trials, also let parameters , time . Then for X = and , its reliability is:
Proof.
From Equation 6, we have the hazards in software, which is tested by using the SDP model. Now substitute the value of (from Equation 6) in Equation 9, then we have:
(15) |
Here, is a random variable used to represent the reliability of the software which is tested from the predictions of the SDP model. Now, simplifying Equation 15 will result in the reliability of the software that is tested by using the SDP model. ∎
Now, the expected reliability of a software (which uses SDP models), or is derived as:
(16) |
We observe that:
(17) |
Since the are independent, the random variables are also independent. It follows that, . Now using these facts in Equation 16 gives:
(18) |
Here, the random variable assumes a value with probability , and the value 1 with probability . Now computing from these values, we have that:
(19) |
Now we use the inequality with , to obtain the expected reliability.
(20) |
The intuition behind the inequality is to provide an easy computation and that does not harm the final bound in the following theorem. Now, by using the Lemmas 1 and 2, the following theorem provides a bound for the deviation of a random variable, , above the reliability of a manually tested software, .
Theorem 2.
Let be the independent Bernoulli trials such that for, , , where . Also let parameters , time . Then for X = , and for the Weibull distribution for the Reliability function, :
Proof.
The proof for this upper tail is very similar to the proof for the lower tail, as we saw in Theorem 1. As before,
(21) |
Now, we wish to obtain a tight lower bound that the random variable, , deviates far from the value . In Equation 21, for some , and , the value is assumed to be below the expectation, , in a given time period . Now, equating the Equations 12 and Equation 21, then we get:
(22) |
Now, substitute the value of (from Equation 22) in Equation 12 to obtain the tight upper bound (expressed in terms of lower bound) for the deviation of a random variable (that is, ) from the reliability (which is derived from the Weibull model of the hazard rate) of a manually tested software . This is expressed below:
(23) |
After simplification, we get:
(24) |
Thus, we have from Theorem 2, the possibility of getting better reliability in the software that uses SDP than in the same software that is tested by a human is exponentially small in , and , implying that, similar to the result of Theorem 1, at the larger values of these parameters, the bound becomes tighter.
IV Future Plans
Theorems 1 and 2 provides preliminary bounds for the post-analysis of the binary classification model (SDP model) in real-time working environments. We believe that providing a critique of the developed binary classification model in the real-time working environment is novel in machine learning theory and has the potential to provide insight into the feasibility of other applications (such as safety-critical applications, for example, tumour prediction systems for medical diagnosis, online fraud detection, etc.). Within the scope of this work, the extensions of Theorems 1 and 2 are numerous. A few examples include, 1) the bounds become more specific to the application if the state-of-the-art hazard (and reliability) models are used in the construction of the proof, 2) new bounds derived if the random variable is assumed to be a function of time, , and 3) new bounds derived assuming the dependency among the random variables (relaxing Assumption 4). In this case, we derive bounds assuming the presence of cascading failures in the software as a result of SDP model.
References
- [1] T. M. Khoshgoftaar and J. C. Munson, “Predicting software development errors using software complexity metrics,” IEEE Journal on Selected Areas in Communications, vol. 8, no. 2, pp. 253–261, 1990.
- [2] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking classification models for software defect prediction: A proposed framework and novel findings,” IEEE Transactions on Software Engineering, vol. 34, no. 4, pp. 485–496, 2008.
- [3] S. Herbold, A. Trautsch, and J. Grabowski, “A comparative study to benchmark cross-project defect prediction approaches,” IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 811–833, 2017.
- [4] Y. Zhou, Y. Yang, H. Lu, L. Chen, Y. Li, Y. Zhao, J. Qian, and B. Xu, “How far we have progressed in the journey? an examination of cross-project defect prediction,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 27, no. 1, pp. 1–51, 2018.
- [5] T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy, “Cross-project defect prediction: a large scale experiment on data vs. domain vs. process,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, 2009, pp. 91–100.
- [6] S. Amasaki, H. Aman, and T. Yokogawa, “An extended study on applicability and performance of homogeneous cross-project defect prediction approaches under homogeneous cross-company effort estimation situation,” Empirical Software Engineering, vol. 27, no. 2, pp. 1–29, 2022.
- [7] U. S. B and R. Sadam, “How far does the predictive decision impact the software project? the cost, service time, and failure analysis from a cross-project defect prediction model,” Journal of Systems and Software, p. 111522, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121222001984
- [8] C. M. Bishop and N. M. Nasrabadi, Pattern recognition and machine learning. Springer, 2006, vol. 4, no. 4.
- [9] M. R. Lyu et al., Handbook of software reliability engineering. IEEE computer society press Los Alamitos, 1996, vol. 222.
- [10] R. S. Pressman, Software engineering: a practitioner’s approach. Palgrave macmillan, 2005.
- [11] S. M. Ross, Introduction to probability models. Academic press, 2014.
- [12] R. Motwani and P. Raghavan, Randomized algorithms. Cambridge university press, 1995.
- [13] M. A. Hartz, E. L. Walker, and D. Mahar, Introduction to software reliability: a state of the art review. The Center, 1997.
- [14] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, pp. 493–507, 1952.