Data Diversity based Metamorphic Relation Prioritization

Madhusudan Srinivasan Montana State University
Bozeman, United States
[email protected] Upulee Kanewala University of North Texas
Florida, United States
[email protected]

Abstract

An oracle is a mechanism to decide whether the outputs of the program for the executed test cases are correct. For machine learning programs, such oracle is not available or too difficult to apply. Metamorphic testing is a testing approach that uses metamorphic relations, which are necessary properties of the software under test to help verify the correctness of a program. Prioritization of metamorphic relations helps to reduce the cost of metamorphic testing [1]. However, prioritizing metamorphic relations based on code coverage is often not effective for prioritizing MRs for machine learning programs, since the decision logic of a machine learning model is learned from training data, and 100% code coverage can be easily achieved with a single test input. To this end, in this work, we propose a cost-effective approach based on diversity in the source and follow-up data set to prioritize metamorphic relations for machine learning programs. We show that the proposed data diversity-based prioritization approach increase the fault detection effectiveness by up to 40% when compared to the code coverage-based approach and reduce the time taken to detect a fault by 29% when compared to random execution of MRs. Overall, our approach leads to saving time and cost during testing.

Index Terms:

Metamorphic Testing, Metamorphic Relations, Machine Learning

I Introduction and Motivation

Machine learning (ML) is increasingly deployed in large-scale software and safety-critical systems due to recent advancements in deep learning and reinforcement learning. The software applications powered by ML are applied and utilized in our daily lives; from finance, and energy, to health and transportation. AI-related accidents are already making headlines, from inaccurate facial recognition systems causing false arrests to unexpected racial and gender discrimination by machine learning software [2] [3] [4]. Also, a recent incident with an Uber autonomous car resulted in the death of a pedestrian [5]. Thus, ensuring the reliability and conducting rigorous testing of machine learning systems is very vital to prevent any future disasters.

One of the critical components of software testing is the test oracle. The mechanism to determine the correctness of test case output is known as test oracle [6]. A program can be considered a non-testable program if there does not exist a test oracle or too difficult to determine whether the test output is correct [7]. Hence, those programs suffer from a problem known as the test oracle problem. ML-based systems exhibit non-deterministic behavior and reason in probabilistic manner. Also, the output is learned and predicted by a machine learning model as result do not have an expected output to compare against the actual output. As a result, correctness of the output produced by machine learning-based applications cannot be easily determined and suffer from test oracle problem [8].

Metamorphic testing (MT) is a property-based software testing technique that is used to alleviate the test oracle problem. A central component of MT is a set of Metamorphic Relations (MRs), which are necessary properties of the target function or algorithm in relation to multiple inputs and their expected outputs [9] [10]. In recent years especially, MT has attracted a lot of attention and detected a large number of real-life faults in various domains. For example, Murphy et al [11] identified six metamorphic relations and real bugs were identified on three machine learning tools. In a related project, Xie et al [12], [13] used metamorphic testing for the detection of faults in K-nearest neighbours and Naive Bayes classifiers. The results revealed that both algorithms violated some of the metamorphic relations. Also, some real faults were detected in the open–source machine learning tool Weka [14]. Zhang et al [15] applied MT to test DNN based autonomous driving systems. MT was able to find inconsistent behaviors under different road scenes for real-world autonomous driving models.

Previous work [16] has shown that several MRs detect the same type of fault and differ only in fault detection effectiveness. Also, each MR in a machine learning application can have multiple source and follow-up test cases, as a result, the execution time of the MRs increases drastically. The source and follow-up test cases in a MR represented as training data set could take several days to train and build a machine learning model for a large training data set. Hence, prioritization of MRs is important for conducting effective metamorphic testing on these programs.

Prior work [1] introduced approaches to prioritize MRs based on code coverage and fault-based approach. The fault-based prioritization approach involves using mutant killed information of each MR to prioritize MR. The code coverage approach uses statement or branch coverage information for prioritizing MRs. Further, the code coverage-based approach does not work effectively for machine learning-based applications since the data used for training determines the program logic. Machine learning program is usually just a sequence of library function invocations and as a result 100% code coverage can be easily achieved when at least one test input is processed. To this end, in this work, we propose methods for MR prioritization for machine learning programs. The MR prioritization method proposed in this work qualitatively identifies the diversity between the source and follow-up training data set in a MR. Moreover, our results indicate that MR prioritization reduces the number of MRs needed to execute on SUT and reduces the time taken to detect a fault up to 29%.

We make the following contributions in this work:

1.

Propose a novel data diversity based automated method where we proposed four metrics to prioritize metamorphic relations based on diversity between the source and follow-up data set in a MR.
2.

Evaluate the effectiveness of our proposed MR prioritization methods on four machine learning algorithm implementations and perform a comparison against two baseline approaches (1) Random baseline: represents the current practice of executing the MRs in a random order, (2) Code coverage-based ordering: represents the MR ordering based on the statement or branch coverage information collected when testing the previous version of the SUT as show in the prior work [1].

II Proposed Approach

Refer to caption — Figure 1: Steps for MR prioritization of ML Programs

In this section, we discuss our proposed method for prioritization of MRs. Figure 1 shows the steps for prioritizing MRs. In this approach, we prioritize the MRs as follows:

1.

Let the set of source test cases in a MR used for testing the SUT be the prioritizing source test cases $(T_{sp})$ .
2.

Let the set of follow-up test cases in a MR used for testing the SUT be the prioritizing follow-up test cases $(T_{fp})$ .
3.

For $(T_{sp})$ and $(T_{fp})$ , apply the proposed metric: rule based, anomaly detection, clustering based and data distribution. The metrics are discussed in detail in section II-A.
4.

For $(T_{sp})$ and $(T_{fp})$ , apply the proposed metric: rule based, anomaly detection, clustering based and data distribution and get the MR diversity value of each MR based on the formulas provided in section II-A.
5.
MR diversity value calculated using the metrics for each MR is normalized using the following steps:
1. (a)
  
  Let the set of MR diversity value of MRs be $DMR_{v}$ .
2. (b)
  
  Identify minimum and maximum value in $DMR_{v}$ . Let minimum value be $Min_{v}$ and maximum value be $Max_{v}$ .
3. (c)
  
  For each MR diversity value $v\in DMR_{v}$ , generate the normalized MR diversity value using the formula below:
  
  $N_{MR}=\frac{v-Min_{v}}{Max_{v}-Min_{v}}$ (1)
  
  Where $N_{MR}$ represent the normalized MR diversity value.
6.

Prioritize the normalized MR diversity values and select the top n MRs from the prioritized ordering to execute based on the resources available for testing.

Our proposed metrics are explained in detail below.

II-A Dataset Diversity Approach

In this work, we propose four metrics to prioritize metamorphic relations based on diversity between source and follow-up data set in a MR. Our intuition behind the metrics is that greater the diversity between the source and follow-up data set, then greater the quality of MR and performance of the ML model.

The proposed metrics are explained in detail below.

II-A1 Rule Based Classifier

Rule-based classifiers are just another type of classifier which makes the class decision by using various if-then rules. We apply a rule based classifier (CN2) to determine diversity between the source and follow-up data set in a metamorphic relation. CN2 is an instance of sequential covering algorithm family based on Rule-Based classification used for learning set of rules.The basic idea here is to learn one rule, remove the data that it covers, then repeat the same process. We hypothesize that if greater the diversity in the classification rules between the source and followup data set, then greater the performance of the model and MR. We apply the following steps.

1.

Let the set of source test cases used for testing SUT be the prioritizing source test cases $(T_{sp})$ .
2.

Let the set of follow-up test cases of the SUT be the prioritizing follow-up test cases $(T_{fp})$ .
3.

Apply Sequential Covering Algorithm (CN2) on $T_{sp}$ and $T_{fp}$ and obtain the rules.The CN2 induction algorithm is a learning algorithm for induction of simple and comprehensible rules.
4.

Remove same set of rules covered in both $T_{sp}$ and $T_{fp}$ .
5.

Calculate the total rules identified in $T_{sp}$ and $T_{fp}$ . Let $Total_{sr}$ and $Total_{fr}$ represent the total rules in $T_{sp}$ and $T_{fp}$ respectively.
6.

We calculate the diversity of MR using the formula below:

$\begin{split}\text{$R_{MR}$}=\text{$\mid(Total_{sr})-(Total_{fr})\mid$}\end{split}$ (2)

Where $R_{MR}$ represent the diversity of the MR.

II-A2 Anomaly Detection

Anomaly detection (or outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. We apply anomaly detection to identify outliers in the source and followup data set in a metamorphic relation. We hypothesize that greater the diversity in outliers detected between the source and followup, greater the possibility of detecting fault in the machine learning model. We apply the following steps.

1.

Let the set of source test cases used for testing SUT be the prioritizing source test cases $(T_{sp})$ .
2.

Let the set of follow-up test cases of the SUT be the prioritizing follow-up test cases $(T_{fp})$ .
3.

Apply K Nearest Neighbour (KNN) algorithm to $T_{sp}$ and $T_{fp}$ and identify outlier instances.
4.

Remove identical outliers detected between $T_{sp}$ and $T_{fp}$ .
5.

Calculate the total outliers in $T_{sp}$ and $T_{fp}$ respectively. Let $Total_{so}$ and $Total_{fo}$ be the sum of outlier detected in $T_{sp}$ and $T_{fp}$ respectively.
6.

We calculate the diversity of MR using the formula below:

$\begin{split}\text{$O_{MR}$}=\text{$\mid(Total_{so})-(Total_{fo})\mid$}\end{split}$ (3)

Where $O_{MR}$ represent the diversity of the MR.

II-A3 Clustering-based

In this metric, we partition the input space ie. all possible inputs into different region and determine how the data is distributed in the space. To achieve this, we use clustering methods which simply try to group similar patterns into clusters whose members are more similar to each other based on some distance measure than to members of other clusters. We apply the following steps.

1.

Let the set of source test cases used for testing SUT be the prioritizing source test cases $(T_{sp})$ .
2.

Let the set of follow-up test cases of the SUT be the prioritizing follow-up test cases $(T_{fp})$ .
3.

Apply K mean algorithm to $T_{sp}$ and $T_{fp}$ and identify clusters. Let clusters of $T_{sp}$ and $T_{fp}$ be $S_{cl}$ and $F_{cl}$ .
4.

Find the distance between the clusters by calculating the euclidean distances between the clusters in $S_{cl}$ . Let the total distance between the clusters be $Ts_{cds}$ .
5.

Determine the size of each cluster in $S_{cl}$ . Let the total size of the clusters in $S_{cl}$ be $Ts_{sc}$ .
6.

Find the distance within the cluster by calculating the average of the distances from observations to the centroid of each cluster in $S_{cl}$ . Let average distance within the cluster in $S_{cl}$ be $As_{cd}$ .
7.

Apply the steps 4 to 6 to $F_{cl}$ respectively. Let $Tf_{cds}$ be the total distance between the clusters, $Tf_{sc}$ be the total size of the clusters and $Af_{cd}$ be the average distance within the cluster.

$\begin{split}\text{$R_{MR}$}=\text{$\mid(Total_{sr})-(Total_{fr})\mid$}\end{split}$ (4)

We calculate the diversity of MR using the formula below:

\begin{split}\begin{split}\text{$C_{MR}$}=\text{$\mid(Ts_{cds}+Ts_{sc}+As_{cd})-(Tf_{cds}+Tf_{sc}+Af_{cd})\mid$}\end{split}\end{split}

(5)

Where $C_{MR}$ is the diversity of MR.

II-A4 Data Distribution

Data used in training machine learning model often form very similar patterns. For example, a lot of data were grouped around the middle values, with fewer observations at the outside edges of the distribution. These patterns are known as distributions, because they describe how the data are distributed across the range of possible values. In this metric, we used shape and spread of data distribution of dataset to find diversity of metamorphic relations. The spread of distribution involves calculating range, variance and standard deviation of the source and follow-up data set. Similarly, shape of distribution involves calculating skewness and kurtosis of the source and follow-up data set of the MR. We hypothesize that larger the diversity in the data distribution between the source and follow-up data set of a MR, then greater the quality of MR. We apply the following steps.

1.

Let the set of source test cases used for testing SUT be the prioritizing source test cases $(T_{sp})$ .
2.

Let the set of follow-up test cases of the SUT be the prioritizing follow-up test cases $(T_{fp})$ .
3.

Calculate the skewness and kurtosis for $T_{sp}$ . Let $ST_{sk}$ be the sum of skewness and kurtosis for $T_{sp}$ .
4.

Calculate range, variance and standard deviation for $T_{sp}$ . Let $ST_{rvs}$ be the sum of variance, range and standard deviation for $T_{sp}$ .
5.

Apply step 3 and 4 to $T_{fp}$ . Let $FT_{sk}$ and $FT_{rvs}$ be the shape and spread of distribution for $T_{fp}$ .

We calculate the diversity of MR using the formula below:

\begin{split}\text{$DD_{MR}$}\begin{split}=\text{$\mid(ST_{sk}+ST_{rvs})-(FT_{sk}+FT_{rvs})\mid$}\end{split}\end{split}

(6)

Where $DD_{MR}$ is the diversity of MR.

III Evaluation

In this work, we plan to evaluate the utility of the developed prioritization approaches on the following aspects: (1) Fault detection effectiveness of MR sets, (2) Effective number of MRs required for testing, and (3) Time taken to detect a fault. We evaluate the effectiveness of the proposed approaches with (1) a random baseline: this represents the current practice of executing source and follow-up data set of the MRs in random order. (2) code-coverage based ordering: this represents the MR order based on the statement and branch-coverage information of the MRs. The approach selects MR which has covered highest number of statements or branches and place it in the prioritized MR ordering. The process continues until all the possible statement or branches are covered [1].

We conducted experiments to find answers for the following research questions:

1.

Research Question 1 (RQ1): Are the proposed MR prioritization approach based on data diversity more effective than the random baseline?
2.

Research Question 1 (RQ2): Are the proposed MR prioritization approach more effective than the code coverage-based prioritization?

III-A Evaluation Procedure

In order to answer the above research questions, we carry out the following validation procedure similar to previous work [1]:

1.

We generate a set of mutants for the SUT. These mutants will be referred to as the validation set of faults $(F_{v})$ . Then from the generated mutants, we remove mutants that give exceptions or mutants that do not finish execution within a reasonable amount of time from further consideration.
2.

We use the method described in Section II to obtain data diversity-based MR ordering. Then we applied the obtained MR ordering to $F_{v}$ and log the mutant killing information.
3.

Creating the random baseline: We generate 100 random MR orderings and apply each of those orderings to $F_{v}$ . The mutant killing information for each of those random orderings is logged. The average mutant killing rates of these 100 random orderings were computed to get the fault detection effectiveness of the random baseline.
4.
Creating the Code Coverage-based ordering: Apply greedy approach on the statements or branches covered to prioritize the MRs as follows [1]:
1. (a)
  
  Select the MR that covers the highest number of statement or branches in the program and place it in the prioritized MR ordering.
2. (b)
  
  Remove the statements and branches covered by the MR.
3. (c)
  
  Repeat the previous two steps until all the statements and branches are covered.

III-B Evaluation Measures

We used the following measures to evaluate the effectiveness of the MR orderings generated by our proposed metrics:

1.

To measure the fault detection effectiveness of a given set of MRs, we use the percentage of mutants killed by those MRs.
2.

To calculate the effective MR set size we used the following approach: typically, the fault detection effectiveness of MT increases as the number of MRs used for testing increases. However, after a certain number of MRs, the rate of increase in fault detection slows due to factors such as redundancy of MRs. Therefore when there is no significant increase in the fault detection between two MR sets of consecutive sizes of size $m$ and $m+1$ , where the MR set of size $m+1$ is created by adding one MR to the MR set set of size $m$ , the effective MR set size can be determined. That is, if the difference in fault detection effectiveness of MR set of size $m$ and MR set of size $m+1$ is less than some threshold value, $m$ would be the effective MR set size that should be used for testing. Determining this threshold value should be done considering the critical nature of the SUT. In this work, we used two threshold values of 5% and 2.5% as used in previous related work for determining the oracle data set size [17].
3.

We used the following approach to find the average time taken to detect a fault: for each killable mutant $m$ in $F_{v}$ , we computed the time taken to kill the mutant ( $t_{m}$ ) by computing the time taken to execute the source and follow-up test cases of the MRs in a given prioritized order until $m$ is killed (here it is assumed that the source and follow-up test cases for each MR are executed sequentially). Then the average time taken to detect a fault is computed using the following formula:

$\frac{\sum t_{m}}{\#killable\,mutants\,in\,F_{v}}$
4.

We calculate Average percentage of fault detected (APFD) developed by Elbaum et al. [18] [19] [20] that measures the average rate of fault detection per percentage of test suite execution. The APFD is calculated by taking the weighted average of the number of faults detected during the execution of MRs. APFD can be calculated using the formula below:

$\begin{split}\text{APFD}=\frac{1-MRF_{1}+MRF_{2}+...+MRF_{m}}{nm}+\frac{1}{2n}\end{split}$ (7)

Where $MRF_{i}$ represents the fault $F_{i}$ detected by the metamorphic relations under evaluation, m represents the number of faults present in the SUT and n represents the total number of MRs used.

III-C Subject Programs and MRs

We applied the above-mentioned validation procedure on the following four machine learning programs for evaluating our proposed MR prioritization methods.

•

IBk¹¹1http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html: K-nearest neighbours (KNN) classifier implementation in the Weka machine learning library [21]. The input to IBk is a training data set, and a test data set to represent in .arff format. The output is the classification predictions made on the instances in the test data set.
•

Linear Regression²²2https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/LinearRegression: linear regression is a linear approach that models the relationship between the input variables and the single output variable. The subject program is the linear regression implementation in weka library. Weka latest stable version is 3.8.5, which consists of the multiple linear regression, named LinearRegression() class. The input to linear regression program is a training data set, and a test data set to represent in .arff format. The output is the prediction made on the instances in the test data set.
•

Naive Bayes³³3https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html: A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The subject program is the Naive Bayes algorithm implementation, named NaiveBayes() class in Weka library. The input to Naive Bayes program is a training data set, and a test data set to represent in .arff format. The output is a classification made on the instances in the test data set.
•

Convolutional Neural Network⁴⁴4https://javadoc.io/doc/org.deeplearning4j/deeplearning4j-nn/latest/index.html: Convolutional Neural Network (CNN) is a class of Artificial Neural Network and most commonly applied to analyze visual imagery. The subject program is the CNN algorithm implementation in Deeplearning4j library. The input to the CNN is MNIST handwritten digit data set and the output is to classify the image of a handwritten digit into one of 10 classes representing integer values from 0 to 9.

III-D Metamorphic Relations

For conducting MT on IBk and Naive Bayes, we used 11 MRs developed by Xie et al. [12]. These MRs are developed based on the user expectations of supervised classifiers. These MRs modify the training and testing data so that the predictions do not change between the source and follow-up test cases. Similarly, for testing Linear Regression system, we used 11 MRs developed by Luu et al. [22]. The properties of the linear regression is related to addition of data points, the rescaling of inputs, the shifting of variables, the reordering of data, and the rotation of independent variables. These MRs were grouped into 6 categories and derived from the properties of the targeted algorithm. For testing convolutional neural network, we used 11 MRs developed by Ding et al. [23]. The MRs were developed on three levels: system level, data set level and data item level. The MRs in the data set level are created on the relation of the classification accuracy of reorganized training data sets. The MRs in the data item level are created on the relation of the classification accuracy of reproduced individual images.

III-E Source and Follow-up Test Cases

As we described before, MT involves the process of generating source and follow-up test cases based on the MRs used for testing.

IBk uses a training data set to train the k-nearest neighbor classifier, and a test data set is used to evaluate the performance of the trained classifier. After executing source test case, the output will be the class labels predicted for the test data set, which is a value from the set {0,1,2,3,4}. The attribute values are randomly selected within the range [0, 100]. The values of the class label are randomly selected from the set {0,1,2,3,4,5}. The size of the training testing data sets ranges within [0, 500]. To generate the follow-up test cases using these source test cases, we applied the input transformations described in the MRs.

Linear Regression uses a training data set to train the model, and a test data set is used to evaluate the performance of the trained model. We obtained the training data from machine learning repository ⁵⁵5https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD as source data set. The training data contains 515345 instances and 90 attributes. To generate the follow-up data set using these source data set, we applied the input transformations described in the MRs. Similarly for Naive Bayes, we obtained the training data from machine learning repository ⁶⁶6https://archive.ics.uci.edu/ml/datasets/Adult as source data set. The training data contains 48842 instances and 14 attributes. To generate the follow-up data set using these source data set, we applied the input transformations described in the MRs. For testing convolutional neural network, we obtained the training data ⁷⁷7http://yann.lecun.com/exdb/mnist/ as source data set. The training dataset contains 60000 handwritten digits and 10000 images for testing. To generate the follow-up test cases using these source test cases, we applied the input transformations described in the MRs.

III-F Mutant Generation

For each subject program, we aimed at developing generated mutant set to be used as $F_{v}$ . For this, we used three automated mutation tools: $\mu$ Java⁸⁸8https://github.com/jeffoffutt/muJava, PIT⁹⁹9http://pitest.org/ and Major¹⁰¹⁰10http://mutation-testing.org/. For generating mutants using $\mu$ Java, we used all the method level mutation operators [24]. With PIT mutation tool [25] and the Major mutation tool [26] we used all the available mutation operators provided by the tools. All the mutants the were generated using these tools were first order mutants, where a mutation operator is used to make a single modification to the source code to generate the mutant. Table I shows the number of mutants used for evaluation.

TABLE I: Mutants generated for version

v_{k}

of the SUTs

Version	Subject	#Major	#PIT	# $\mathbf{\mu}$ Java
3.8.6	IBk	1021	N/A	621
3.8.6	Linear Regression	377	73	N/A
3.8.6	Naive Bayes	452	452	N/A
0.9.1	CNN	92	N/A	N/A

TABLE II: Validation Setup

Subject	$\mathbf{{T_{sp}}}$	$\mathbf{{F_{v}}}$
IBk	Test1	Major
Linear Regression	YearPrediction	Major
Naive Bayes	Adult	PIT
CNN	MNIST	PIT

IV Result and Analysis

In this section, we discuss our experimental results and provide answers to the two research questions that we listed in Section III. For each subject program, we carried out the validation procedure described in Section III-A using the setup described in Table II. In this setup, we used the generated mutant set for evaluating the prioritized MR ordering as $F_{v},$ and the source test cases as $T_{sp}$ . For example, in Table II for IBk, Test1 refers to the randomly generated data set that we described in Section III-E. Similarly, for CNN, MNIST refers to the 60000 images used for training and testing. Adult and YearPrediction refers to the training data set used for Naive Bayes and Linear Regression program respectively. Figure 2 shows the average fault detection effectiveness for evaluation runs described above vs. the MR set size used for testing, for each subject program. We also plot the percentage of faults detected by the random baseline, statement and branch coverage based MR ordering for comparison. Results for each research question is provided below.

IV-A RQ1: Comparison of the MR Prioritization approaches against the Random Baseline

IV-A1 Fault detection effectiveness

We formulated the following statistical hypothesis to answer RQ1 in the context of fault detection effectiveness:

$H_{1}$ :

For a given MR set of size m, the fault detection effectiveness of the MR set produced by data diversity-based approach is higher than that of the random baseline.

The null hypothesis $H_{0x}$ for each of the above defined hypothesis $H_{X}$ is that the data-diversity based approaches perform equal to or worse than the random baseline.

In Table III to VI, we list the relative improvement in average fault detection effectiveness for the data diversity-based method over the currently used random approach of prioritizing MRs for Linear Regression, Naive Bayes, IBk and CNN respectively. To evaluate the above hypotheses, we used the paired permutation test since it does not make any assumptions about the underlying distribution of the data [27]. We apply the paired permutation test to each MR set size for each of the subject programs with $\alpha$ = 0.05, and the relative improvements that are significant are marked with a *.

As shown in Table III, for IBk, rule based metric shows improvements in average fault detection effectiveness over the random approach for all the MR set sizes except for the last MR set size. Particularly, among the different MR set sizes, the increase in fault detection percentage varies from 1.23% to 29.5%. Therefore, we reject the null hypotheses $H_{01}$ for IBk for rule based, Anomaly, clustering based and data distribution metric. Similarly for Linear Regression, Table IV indicates that the anomaly based, clustering based and data distribution based metrics show consistent significant improvements in fault detection over random. Therefore we can reject $H_{01}$ in general for Regression. The fault detection varies from 2.5% to 30% for the three metrics. Table V shows the relative improvement in the average fault detection percentage for Naive Bayes. Our proposed metrics shows improvements for MR set sizes $m=1$ to $m=5$ . Also overall, the relative improvement across MR prioritization methods varies between 0% - 40%. Therefore we reject $H_{01}$ in general for Naive Bayes. Table VI shows the relative improvement in the average fault detection percentage for CNN. Clustering-based metric shows improvements for all MR set sizes except MR set size $m=5$ . Also overall, the relative improvement across MR prioritization methods varies between 0% -66%. Therefore we can reject $H_{01}$ in general for CNN.

Therefore our data diversity based approach outperforms the random approach for all our subject programs.

TABLE III: Relative improvement in fault detection effectiveness of the data diversity-based approach compared to random baseline for IBk

MR Set-Size	Rule-based	Anomaly-based	Clustering-based	Data Distribution
1	29.54%*	29.54%*	25%*	20.45%*
2	17.54%*	8.77%*	5.26%*	19.29%*
3	11.11%*	3.17%*	9.52%*	9.52%*
4	5.88%*	1.47%*	4.41%*	2.94%*
5	5.63%*	4.22%*	7.04%*	2.81%*
6	5.40%*	1.35%*	4.05%*	0%
7	6.57%*	0%	1.31%*	0%
8	5.12%*	0%	0%	0%
9	5%*	1.25%*	-2.5%	3.75%*
10	2.43%*	1.21*	-1.21%	2.43%*
11	0%	0%	0%	0%

TABLE IV: Relative improvement in fault detection effectiveness of the data diversity based approaches compared to random baseline for Linear Regression

MR Set-Size	Anomaly-based	Clustering-based	Data Distribution
1	30.43%*	30.43%*	30.43%*
2	3.38%*	6.77%*	8.47%*
3	-6.15%	1.53%*	9.23%*
4	5.79%*	1.44%*	5.79%*
5	8.33%*	1.38%*	13.88%*
6	6.75%*	1.35%*	10.81%*
7	7.89%*	0%*	7.89%*
8	6.49%*	2.59%*	6.49%*
9	5.12%*	5.12%*	5.12%*
10	2.5%*	2.5%*	2.5%*
11	0%	0%	0%

TABLE V: Relative improvement in fault detection effectiveness of the data diversity-based approach compared to random baseline for Naive Bayes

MR Set-Size	Rule-based	Anomaly-based	Clustering-based	Data Distribution
1	38.02%*	40.84%*	40.84%*	40.84*
2	8.69%*	8.69%*	8.69%*	8.69*
3	2.04%*	2.04%*	2.04%*	2.04%*
4	1.01%*	1.01%*	1.01%*	1.01*
5	1.01%*	1.01%*	1.01%*	1.01*
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%

TABLE VI: Relative improvement in fault detection effectiveness of the data diversity-based approach compared to random baseline for Convolutional Neural Network

MR Set-Size	Anomaly-based	Clustering-based
1	2.38%*	66.66%*
2	1.81%*	27.27%*
3	1.63%*	14.75%*
4	1.51%*	6.06%*
5	1.44%*	1.44%*
6	1.38%*	-2.77%*
7	1.35%*	8.18%*
8	1.31%*	5.26%*
9	1.29%*	3.89%*
10	1.28%*	2.56%*
11	0%	0%

TABLE VII: Average relative improvement in fault detection effectiveness of data diversity-based approach over statement coverage based approach for IBk

MR Set-Size	Rule-based	Anomaly-based	Clustering-based	Data Distribution
1	39.02%*	3.20%*	34.14%*	14.63*
2	36.73%*	26.53%*	22.44%*	6.12%*
3	9.37%*	1.56%*	7.81%*	10.93%*
4	7.46%*	2.98%*	5.97%*	1.49%*
5	8.69%*	7.24%*	10.14%*	0%
6	8.33%*	4.16%*	6.94%*	2.77%*
7	9.45%*	2.70%*	4.05%*	2.70%*
8	7.89%*	2.63%*	2.63%*	2.63%*
9	5.0%*	1.25%*	-2.50%*	3.75%*
10	2.43%*	1.21%	-1.21%	2.43%*
11	0%	0%	0%	0%

TABLE VIII: Average relative improvement in fault detection effectiveness of data diversity-based approach over branch coverage based approach for IBk

MR Set-Size	Rule-based	Anomaly-based	Clustering-based	Data Distribution
1	26.66%*	26.66%*	22.22*	22.22%*
2	36.73%*	26.53%*	22.44%*	6.12%*
3	11.11%*	3.17%*	9.52%*	9.52%*
4	5.88%*	1.47%*	4.41%*	2.49%*
5	5.63%*	4.22*	7.04%*	2.81%*
6	6.84%*	2.73%*	5.47%*	1.36%*
7	8%*	1.33%*	2.66%*	1.33%
8	6.49%*	1.29%*	1.29%*	1.29%
9	6.32%*	2.53%*	-1.26%	5.06%*
10	2.43%*	1.21%*	-1.21%*	2.43%*
11%	0%	0%	0%	0%

TABLE IX: Average relative improvement in fault detection effectiveness of data diversity-based approach over statement and branch coverage based approach for Naive Bayes

MR Set-Size	Rule-based	Anomaly-based	Clustering-based	Data Distribution
1	40%*	42.85%*	42.85*	42.85%*
2	9.80%*	9.89%*	9.89%*	9.898%*
3	2.04%*	2.04%*	2.04%*	2.04%*
4	1.01%	1.01%	1.01%	1.01%
5	0%	0%	0%	0%
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%

TABLE X: Average relative improvement in fault detection effectiveness of data diversity-based approach over statement and branch coverage based approach for Linear Regression

MR Set-Size	Anomaly-based	Clustering-based	Data Distribution
1	33.33%*	30.43%*	33.33%*
2	6.71%*	6.77%*	8.47%*
3	1.53%*	1.53%*	9.23%*
4	1.44%*	1.44%*	5.79%*
5	1.38%*	1.38%*	13.88%*
6	1.35%*	1.35%*	10.81%*
7	0%	0%	7.89%*
8	2.59%*	2.59%*	6.49%*
9	3.79%*	5.12%*	3.79%*
10	2.5%*	2.5%*	2.58%*
11	0%	0%	0%

TABLE XI: Average relative improvement in fault detection effectiveness of data diversity-based approach over statement and branch coverage based approach for Convolutional Neural Network

MR Set-Size	Anomaly-based	Clustering-based
1	86.95%*	204.38%*
2	1.81%*	27.27%*
3	0%	12.90%*
4	1.51%*	6.06%*
5	0%	0%
6	1.38%*	-2.77%
7	1.35%*	8.10%*
8	1.31%*	5.26%*
9	0%	2.56%*
10	0%	1.26%*
11	0%	0%

TABLE XII: Average percentage of fault detected for subject programs

Subject	Data Distribution	Rule Based	Clustering Based	Anomaly Based	Random Based
IBk	0.85	-	0.91	0.72	0.72
Linear Regression	0.89	0.92	0.97	0.95	0.78
CNN	-	-	0.95	0.95	0.85
Naive Bayes	0.90	0.93	0.95	0.92	0.67

IV-A2 Effective number of MRs used for Testing

In Figure 2, we show the effective number of MRs using the fault detection thresholds for the subject programs. The pink vertical lines represent effective MR set size when using the rule-based, green vertical line represents data distribution, brown vertical line represents the clustering-based and the orange vertical lines represent the same for the random baseline. The respective thresholds are shown near each vertical line.

Similarly for CNN, clustering based approach provides a effective MR set size of 2 for the 2.5% threshold and random approach provides a effective MR set size of 4 for the 2.5% threshold. In Naive Bayes, anomaly based approach provides a effective MR set size of 1 for the 2.5% threshold and random approach provides a effective MR set size of 3 for the 2.5% threshold. In Linear Regression, data diversity based approach provides an effective MR set size of 1 for the 2.5% threshold and random approach provides a effective MR set size of 7 for the 2.5% threshold. This practically leads to 85% reduction in MR set size when compared to random approach. Therefore for all the subject programs, using our proposed methods resulted in reduction in the effective MR set size compared to using a random ordering of MRs.

IV-A3 The average time taken to detect a fault

Tables XIII and XIV shows the average time taken to detect a fault in Linear Regression and IBk respectively. As shown in these results, using Data Distribution resulted in significant reductions in the average time taken to detect a fault by 26% when compared to random baseline. Table XV shows the time taken to generate the metrics for the SUT. We observed that the time taken to generate the metrics for the SUT was less than a minute. Therefore, our proposed approach is time and cost effective when prioritizing MRs.

IV-A4 Average percentage of fault detected

Table XII shows the APFD for the SUT. We can observe that for all our subject programs, our proposed metrics provided higher APFD value between 0.72-0.92 when compared to random based prioritization. As a result, our metrics provides higher rate of fault detection when compared to random approach.

IV-B RQ2: Comparison of proposed MR Prioritization approach against code coverage-based approach

In this research question, we evaluate whether the data diversity based approach outperforms code coverage-based approach. In order to answer the research question, we formulate the following hypotheses:

$H_{2}$ :

For a given MR set of size $m$ , the fault detection effectiveness of the MR set produced using data diversity based approach is higher than the fault detection effectiveness of the MR set produced by the code coverage-based prioritization.

The null hypothesis $H_{0x}$ for each of the above defined Hypotheses $H_{x}$ is that the MR sets generated by the proposed method performs equal to or worse than the MR sets generated by the code coverage-based prioritization in terms of fault detection effectiveness.

Table VII and Table VIII shows the relative improvement in fault detection percentage between the MR sets generated by data diversity based approach over statement and branch coverage based approach for IBk. We observed that the rule based, anomaly and clustering based metric provided improvement in fault detection over statement coverage based approach for all the MR set sizes except $m=11$ . Therefore we can reject the null hypotheses $H_{02}$ .

Table IX shows the relative improvement in fault detection percentage between the MR sets generated by data diversity based approach and code coverage based approach for Naive Bayes. We observed that all the metric provided improvement in fault detection for MR set size $m=1$ to $m=4$ . Therefore we cannot reject the null hypotheses $H_{02}$ in general for Naive Bayes. Table X shows the relative improvement in fault detection percentage between the MR sets generated by data diversity based approach over statement and branch coverage based approach for Linear Regression. We observed that anomaly based and data distribution metric provided improvement in fault detection only for all MR set sizes except $m=11$ . Therefore we reject the null hypotheses $H_{02}$ in general for Linear Regression. Table XI shows the relative improvement in fault detection percentage between the MR sets generated by data diversity based approach over statement and branch coverage based approach for CNN. We observed that the clustering-based approach showed improvement in fault detection for all the MR set sizes and anomaly based approach showed improvement in fault detection only for MR set size $m=1$ to $m=5$ . Therefore, we reject the null hypothesis $H_{02}$ in general for clustering-based metric. Overall, our proposed metrics outperforms the statement and branch coverage based approach for the subject programs.

TABLE XIII: Comparison of average time taken to detect a fault using Data Diversity method and Random baseline for Linear Regression and IBk

Subject	Tsp	Data Distribution	Rule Based	Random Baseline	% Time Reduction
IBk	Test1	-	8910.80s	12562.49s	29.06%
Linear Regression	Prediction	17057.43s	-	21609.766s	26.06%

TABLE XIV: Comparison of average time taken to detect a fault using Data Diversity method and Random baseline for IBk

$\mathbf{{T_{sv}}}$	$\mathbf{{F_{v}}}$	Rule Based	Random baseline	% Time reduction
Testset1	PIT	8910.80s	12562.49s	29.06%
Testset2	Major	6148.29s	7508.29s	22.23%

TABLE XV: Time taken to generate the metrics for SUT

SUT	Rule-based	Anomaly-based	Clustering-based	Data Distribution
IBk	9 sec	7 sec	17 sec	6 sec
Linear Regression	-	4 sec	20 sec	3 sec
Naive Bayes	8 sec	6 sec	18 sec	5 sec
CNN	N/A	200 sec	300 sec	N/A

V Related Work

Prior work [1] proposed fault-based and code coverage-based approach to prioritize MRs in the context of regression testing. The fault-based approach applies a greedy algorithm on the faults detected by the MRs on the previous version of the SUT to prioritize MRs for testing the next version of the SUT. Code coverage-based approach uses statements and branches covered by the MRs to prioritize MRs. The proposed approach was applied to three diverse applications and the result indicates that the code coverage based approach did not work effectively for machine learning-based programs.

Mayer and Guderlei [28] conducted an empirical study to evaluate the usefulness of MRs. Based on the experiment result, the authors devise several criteria to asses the quality of the MRs based on the potential usefulness. These criteria were defined based on the different execution paths undertaken by the source and follow-up test cases. While this work provides some directions into selecting useful MRs, they do not directly address MR prioritization. In particular, we focus on developing automated methods for MR prioritization. Hui et al. [29] provided qualitative guidelines for MR prioritization. The guidelines could be used to select effective MRs and improve the performance of MT. The author also provides three measurable metrics for MR selection: in-degree of MR, algebra complexity of MR, and distance between test inputs of MR. The empirical result shows inconsistency between the correlation value and the metrics, indicating that other factors could influence the effectiveness of MR. Mani et al. [30] proposed metrics to identify the quality of dataset based on centroid positioning, pairwise boundary conditioning and equivalence partitioning. This work does not provide insight on the diversity in the dataset and does not address MR prioritization. Sun et al. [31] proposed a technique to generate good source test cases in metamorphic testing based on constraint solvers and techniques of symbolic execution. Also, a prioritization of source test cases was conducted based on path distances among test cases. This work does not focus on prioritization of metamorphic relations. Also, the work does not conduct prioritization of metamorphic relations for machine learning programs.

VI Threats to Validity

External Validity: Our proposed metric is applied to supervised ML classifiers. Metrics such as rule-based is not suitable for unsupervised classifiers and data-distribution metric can be applied only to data set with numerical values. We plan to design more metrics to cover unsupervised classifiers and non-numerical data. In this work, all the experiment subjects are implementation of machine learning algorithm used in the Weka ML library. Although they are popular algorithms and widely used, our findings may not be generalizable to commercial machine learning projects such as autonomous vehicles. To mitigate this threat, we plan to explore the effectiveness of our proposed approach on AI systems in the health care domain and Apollo autonomous driving platform projects in the future.

Internal Validity: We used dataset having an average of 500000 instances and 90 attributes on our subject programs. However, the dataset size could be small for generating the proposed metrics and prioritizing the MRs. We used less than 100 mutants for validating our prioritized MR ordering for CNN program. However, it is possible that the number of mutants used is low.

VII Conclusion

Metamorphic relations in metamorphic testing have multiple source and follow-up test cases. As a result, execution and training of machine learning model with large source and follow-up data set can exponentially increase the execution time and cost of metamorphic testing for ML applications. To overcome this problem, we developed four metrics based on diversity between the source and follow-up data set for prioritization of MRs.

We evaluated our proposed metrics on four open-source machine learning programs. The experimental results show that utilizing MR prioritization on ML applications can increase the fault detection effectiveness of a given MR set up to 33% for linear regression, 86% for CNN and up to 40% for Naive Bayes. In particular, using our proposed MR prioritization approaches reduced the average time taken to detect a fault compared to most common approach of randomly executing the MRs by 30% for KNN classifier implementation. Further, our results show that data diversity based metrics outperforms code coverage based approach for prioritizing MRs by 40%. Our proposed metrics also provided higher rate of fault detection when compared to random based prioritization. Our finding also indicates that the number of MRs needed to execute and test ML programs reduced by 66% which translates to saving cost and time for software testing practitioners.

References

[1] Anonymous, “Under submission. details omitted for double-blind reviewing.”
[2] S. Liao, “Chinese facial recognition system mistakes a face on a bus for a jaywalker.”
[3] C. Ziegler, “A google self-driving car caused a crash for the first time,” 2016.
[4] J. Glenday, “Ai-powered genderify platform shut down after bias-based backlash.”
[5] A. Ohnsman, “Lidar maker velodyne ‘baffled’by self-driving uber’s failure to avoid pedestrian,” 2018.
[6] E. T. Barr, M. Harman, P. McMinn, M. Shahbaz, and S. Yoo, “The oracle problem in software testing: A survey,” IEEE transactions on software engineering, vol. 41, no. 5, pp. 507–525, 2015.
[7] E. J. Weyuker, “On testing non-testable programs,” The Computer Journal, vol. 25, no. 4, pp. 465–470, 1982.
[8] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine learning testing: Survey, landscapes and horizons,” IEEE Transactions on Software Engineering, 2020.
[9] T. Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T. Tse, and Z. Q. Zhou, “Metamorphic testing: A review of challenges and opportunities,” ACM Computing Surveys (CSUR), vol. 51, no. 1, p. 4, 2018.
[10] S. Segura and Z. Q. Zhou, “Metamorphic testing 20 years later: A hands-on introduction,” in Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. ACM, 2018, pp. 538–539.
[11] C. Murphy, G. E. Kaiser, and L. Hu, “Properties of machine learning applications for use in metamorphic testing,” 2008.
[12] X. Xie, J. W. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Testing and validating machine learning classifiers by metamorphic testing,” Journal of Systems and Software, vol. 84, no. 4, pp. 544–558, 2011.
[13] X. Xie, J. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, “Application of metamorphic testing to supervised classifiers,” in 2009 Ninth International Conference on Quality Software. IEEE, 2009, pp. 135–144.
[14] J. E. Gewehr, M. Szugat, and R. Zimmer, “Bioweka—extending the weka framework for bioinformatics,” Bioinformatics, vol. 23, no. 5, pp. 651–653, 2007.
[15] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems,” in 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2018, pp. 132–142.
[16] Anonymous, “Under submission. details omitted for double-blind reviewing.”
[17] G. Gay, M. Staats, M. Whalen, and M. P. Heimdahl, “Automated oracle data selection support,” IEEE Transactions on Software Engineering, vol. 41, no. 11, pp. 1119–1137, 2015.
[18] S. Elbaum, A. G. Malishevsky, and G. Rothermel, “Test case prioritization: A family of empirical studies,” IEEE transactions on software engineering, vol. 28, no. 2, pp. 159–182, 2002.
[19] A. G. Malishevsky, J. R. Ruthruff, G. Rothermel, and S. Elbaum, “Cost-cognizant test case prioritization,” Citeseer, Tech. Rep., 2006.
[20] S. Elbaum, G. Rothermel, S. Kanduri, and A. G. Malishevsky, “Selecting a cost-effective test case prioritization technique,” Software Quality Journal, vol. 12, no. 3, pp. 185–210, 2004.
[21] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, Jan 1991. [Online]. Available: https://doi.org/10.1007/BF00153759
[22] Q.-H. Luu, M. F. Lau, S. P. Ng, and T. Y. Chen, “Testing multiple linear regression systems with metamorphic testing,” Journal of Systems and Software, vol. 182, p. 111062, 2021.
[23] J. Ding, X. Kang, and X.-H. Hu, “Validating a deep learning framework by metamorphic testing,” in 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET). IEEE, 2017, pp. 28–34.
[24] Y.-S. Ma, J. Offutt, and Y.-R. Kwon, “Mujava: a mutation system for java,” in Proceedings of the 28th international conference on Software engineering. ACM, 2006, pp. 827–830.
[25] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “Pit: a practical mutation testing tool for java,” in Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 2016, pp. 449–452.
[26] R. Just, “The major mutation framework: Efficient and scalable mutation analysis for java,” in Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM, 2014, pp. 433–436.
[27] P. Good, Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business Media, 2013.
[28] J. Mayer and R. Guderlei, “An empirical study on the selection of good metamorphic relations,” in 30th Annual International Computer Software and Applications Conference (COMPSAC’06), vol. 1. IEEE, 2006, pp. 475–484.
[29] Z.-W. Hui, S. Huang, H. Li, J.-H. Liu, and L.-P. Rao, “Measurable metrics for qualitative guidelines of metamorphic relation,” in 2015 IEEE 39th Annual Computer Software and Applications Conference, vol. 3. IEEE, 2015, pp. 417–422.
[30] S. Mani, A. Sankaran, S. Tamilselvam, and A. Sethi, “Coverage testing of deep learning models using dataset characterization,” arXiv preprint arXiv:1911.07309, 2019.
[31] “Path-directed source test case generation and prioritization in metamorphic testing,” Journal of Systems and Software, vol. 183, p. 111091, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121221001886