SoK: Machine Learning for Continuous Integration

Ali Kazemi Arani School of Computer Science
University of Adelaide
Adelaide, Australia
[email protected] Mansooreh Zahedi School of Computing and Information Systems
University of Melbourne
Melbourne, Australia
[email protected] Triet Huynh Minh Le School of Computer Science
University of Adelaide
Adelaide, Australia
[email protected] Muhammad Ali Babar School of Computer Science
University of Adelaide
Adelaide, Australia
[email protected]

Abstract

Continuous Integration (CI) has become a well-established software development practice for automatically and continuously integrating code changes during software development. An increasing number of Machine Learning (ML) based approaches for automation of CI phases are being reported in the literature. It is timely and relevant to provide a Systemization of Knowledge (SoK) of ML-based approaches for CI phases. This paper reports an SoK of different aspects of the use of ML for CI. Our systematic analysis also highlights the deficiencies of the existing ML-based solutions that can be improved for advancing the state-of-the-art.

Index Terms:

Continuous Integration, Machine Learning, Systematic Literature Review

I Introduction

In recent years, the software development industry has seen a significant shift towards the adoption of Continuous Integration (CI) practices. The CI process is a software development approach that aims to improve the speed and reliability of software delivery by continuously integrating code changes into a shared repository. The goal of the CI process is to catch and fix issues and bugs early in the development process before they become major problems that are harder and more expensive to resolve [1].

The growing popularity of CI and DevOps, along with the increasing volume and complexity of data involved in these processes, have motivated researchers to propose Machine Learning (ML) based solutions for automating CI phases [1] and taking one more step toward enabling the AIOps [2]. ML methods can support the fast feedback loop in CI by analysing development data, deployment log files, and data from the operating environment and making automated decisions which are the exclusive benefits of ML methods [3]. These techniques can also be used for efficiently predicting the outcome of complex tasks. For example, ML methods can predict different types of software defects based on previous versions of code without running the current version [4]. Furthermore, ML-based solutions can provide accurate estimations, facilitate adapting to frequent changes [5], and assist software engineers in timely decision-making processes [3]. The benefits highlighted for ML-based solutions in CI environments would facilitate a reduction in human intervention and enhance the efficiency of CI services in cloud-based environments, which are essential requirements for cloud-based environments. [6].

Given the variety of employed techniques in applying ML solutions in CI, and growing interest in this domain, it is necessary to systematically identify state-of-the-art practices used for automating CI tasks through ML methods. This body of knowledge can provide a valuable reference for practitioners and researchers to understand the potential of available ML techniques and make an informed decision to apply ML models in real-world CI environments. To the best of our knowledge, there is no existing Systemization of Knowledge (SoK) of the state-of-the-art ML practices for CI.

To bridge this gap, we review existing ML solutions that have been developed for CI phases in the last decade. Specifically, we analyze relevant scientific articles on the topic published in peer-reviewed venues. We first identify the CI phases that have been automated by ML. Moreover, we provide insights into key ML practices such as data/feature engineering, learning algorithms, and evaluation procedures, employed in these state-of-the-art ML solutions. Such knowledge can serve as a guide for practitioners to adopt/develop high-performing ML solutions to automate CI phases at scale. We also discuss areas for improving the utilization of the current ML solutions in real-world CI settings.

II Background

Refer to caption — Figure 1: The key steps and concepts of Continuous Software Engineering. Notes: *Rectangles:* Steps. *Two-way arrows:* Automatic phases. *Dashed arrows:* Spanning multiple steps.

This review focuses on CI, which is a part of Continuous Software Engineering (CSE) that aims to provide quick and frequent feedback from customers’ experiences, as well as software operations and maintenance [7]. According to Figure 1, CI is the first step of CSE that comes before the Continuous DElivery (CDE) and Continuous Deployment (CD) phases. Therefore, improving the performance of CI will positively affect the whole process of CSE. CI includes software building, testing, and validation [7]. This process involves multiple integration within a day and requires frequent automatic testing and prompt feedback to prevent issues from propagating to the delivery phase or affecting the development process of other team members [1]. Lastly, developers receive feedback on detected bugs or performance issues through the validation phase in CI and also by maintaining (monitoring in operations) deployed software [8].

Most of the existing systematic review/mapping studies on the application of ML methods in CI have mainly focused on the Test Case Prioritization (TCP) and Test Case Selection (TCS) methods. Pan et al. [9] investigated the employed feature sets and evaluation metrics in the presented ML-based solution for TCP and TCS in CI environments. Also, in [10], Lima et al. only focused on reviewing TCP methods in CI and reported the types of ML approaches and the evaluation methods for the task. In contrast, our review did not limit the search to a specific solution in CI and instead investigated all ML-based methods in CI environments. This approach allows for a more comprehensive understanding of the application of ML in CI and the techniques employed in the state-of-the-art methods in the field. The aim of this study is to provide valuable insights for researchers and practitioners in the field and to identify areas where further research is needed to improve the effectiveness of ML-based solutions in CI.

III Methodology

We aim to provide a Systematization of Knowledge of the application of ML-based methods in CI by following guidelines for Systematic Literature Review (SLR) [11]. We performed a rigorous and systematic search for relevant studies and used various criteria for ensuring the quality of the studies included in our review. To achieve our aim, we investigated the two following Research Questions (RQs).

•

RQ1: What are the CI phases that have been addressed by ML-based solutions?
•

RQ2: What are the techniques employed in the development of state-of-the-art ML solutions for each CI phase?

The answers to these questions would help highlight the CI phases for which ML has shown potential and the current best practices for leveraging ML for automating CI phases.

III-A Study selection

We used Scopus¹¹1https://www.scopus.com/ as the main source for finding relevant studies because it includes many journals and conferences in Software Engineering and Computer Science that are relevant to CI-related studies [12, 13].

We then designed a search string to retrieve relevant studies on Scopus. Our search string was designed based on three segments, including A) Machine Learning, Artificial Intelligence and associated synonyms, B) the synonyms for CI, DevOps and CSE and C) the synonyms for “software”, “information systems”, “information technology”, “cloud”, and “service engineering” terms. This approach ensured that our search was comprehensive and covered a wide range of relevant studies in the field. By combining these three segments in our search string helped us capture most of the relevant studies on the application of ML-based methods in CI and DevOps.

III-B Inclusion and Exclusion Criteria

For identifying and removing irrelevant and low-quality studies, we defined the following inclusion criteria and selected papers based on these criteria.

1.

Research papers that were longer than four pages
2.

The full text of the papers was available in English
3.

The key topic of the papers was the application of ML-based methods in CI

Using the explained settings, we ran the search string on the Scopus indexing system and retrieved 1,662 studies. After applying the inclusion criteria and conducting snowballing methods for reducing the risk of missing studies [14], we obtained a final list of 44 related studies. Intuitively, the latest studies usually present state-of-the-art solutions. However, some of the latest studies might not include state-of-the-art ML solutions. To ensure relevance for RQ2, we thoroughly examined the full text of each paper to find each paper presented a novel approach and showed the superiority of their method over current state-of-the-art in the relevant literature. This would assist practitioners and researchers in determining the appropriate contexts for utilizing their ML-based methods in automating CI tasks.

III-C Data extraction and data synthesis

Based on SLR guidelines, we created a data extraction form to systematically extract information from selected studies to answer the RQs [11]. We used thematic analysis [15] to synthesize and analyze data for RQ1 and statistical analysis for RQ2. This approach allowed us to extract and organize data from selected studies, identify patterns and themes, and answer the RQs.

IV Results

IV-A RQ1: CI phases and ML applications

Based on reviewing and analyzing the selected papers, we identified five CI testing phases automated by ML. Figure 2 presents the application of ML-based methods to the five CI phases. The descriptions of these five identified phases based on their sequence in the CI pipeline are presented below.

IV-A1 Unit Test (UT)

UT is the first step in the chain of events in the CI pipeline [16]. This test checks and validates the developed code in an isolated environment when a developer check-in (commits) new or modified compiled code and builds it [17]. To keep the master branch free from error, developers commit all the changes on a developers’ branch at this step [18]. Two out of the 44 selected studies presented ML-based solutions to reduce the risk of buggy software releases by predicting the outcome of the unit tests without explicitly running them. Lee et al. [19], proposed a method for predicting the outcomes of unit tests using ML, and demonstrate its effectiveness in estimating the state of the alarms raised by static checkers. Vig et al. [20] proposed a method for estimating the required test efforts by predicting the outcome of unit tests using ML. They compared different ML methods and feature sets to evaluate the effectiveness of their method.

IV-A2 Integration Test (IT)

IT is a CI step where the newly developed modules are integrated with the existing ones on the same branch after they have been validated through unit testing [21]. The purpose of IT is to verify that all the modules of a system or subsystem work together seamlessly. In this phase, the utilization of ML techniques can aid in forecasting the outcomes of test cases [21], and identifying the branch coverage of test cases to detect bugs with fewer test runs without missing any changes that need to be tested [22].

IV-A3 Regression Test (RT)

After validating and testing the integration of new or modified software units with other software units, the current version needs to be tested entirely based on the previously designed test cases [23]. The goal of all 25 RT studies can be classified into “Test Optimization” studies by employing Test Case Prioritization (TCP) and Test Case Selection (TCS) strategies, predicting the outcome of the test suites for committed codes as “Defect Prediction” studies, and “Detecting Flaky Tests”. Flaky tests are the tests that produce an inconsistent pass or fail results and make the CI pipeline unreliable [24]. In the field of TCP and TCS studies, the objective of ML models is to prioritize the test cases or select a subset of them that are more likely to uncover bugs and defects in the software under development [25]. This is achieved by analyzing previous changes in source code and the results of previous test runs to identify patterns that indicate a higher likelihood of uncovering defects.

IV-A4 Build Validation (BV)

According to the outcome of the previous CI phases, developers can be relatively sure about the functionalities and performance of developed software units. So, this version of the software is eligible for merging into the master branch. The BV test is to ensure the stability of integrated code before releasing the software and handing it over to a system testing team [26]. Due to the high computational cost of building software products [27], predicting the build outcome is the main target of the presented ML-based solutions in the BV phase. Build prediction aims to reduce the resource usage and building time of software by predicting the outcome of the build process. The predictions are made based on analyzing within-project data or cross-project data [28]. Nine out of 11 studies in this CI phase used batch data of the CI pipeline to train their models. However, the authors of [26] and [29] trained their ML models using a stream of data and changes in data, respectively.

IV-A5 System Test (ST)

In this step, all CI-related and quality aspects including the performance, functionalities, and compatibility of different software units are tested. This phase can be performed by a portion of the users or an internal system testing team [30]. For example, beta testers (users who receive the developed software before official releases) are asked to provide their experience using the the fully integrated software product [31]. ML-based solutions in this phase target the discovery of installed software for ensuring compliance, security, and efficiency of the system under test [32], and the optimization of test suites for detecting performance defects in developed software products [31].

TABLE I: Employed techniques, data and characteristics of the state-of-the-art solutions.

Reference	CI phase - ML application	Data source - Data Type	Data Preparation	Feature Engineering	Learning algorithm(s)	Evaluation methods - Metrics
Lee et al. [19]	UT - Test Prediction	Samsung projects (Industrial) - Source code	Filtering - selecting source code lines	Tagging - word2vec	NN (Supervised)	Random K-fold - F1, Precision, Recall and Accuracy
Grano et al. [22]	IT - Branch Coverage Prediction	Google and Apache (OS) - Source code and Code MD	Not reported	Scaling - Normalization	RF, NN, SVM (Supervised)	Random K-fold - Mean Absolute Error (MAE)
Abdalkareem et al.[21]	IT - Test Prediction	GitHub (OS) - Commit MD	Balancing - Weka	Weighting - TF/IDF	DT (C4.5) (Supervised)	Random K-fold - F1, Precision, Recall and AUC
Yaraghi et al. [25]	RT - Test Optimization	GitHub (OS) - Build logs, Test code and Source code MD	Building data - Dependency graph of source code files and test cases	Scaling and Weighting - Normalization and TF/IDF	RF, SVM, XGBoost (Supervised)	Random K-fold - APFD_C
Pan et al.[33]	RT - Defect Prediction	GitHub (OS) - Code MD and Test MD	Filtering - Based on last two outcome of the test cases	Not reported	DT, RF, MLP, NB, SVM, LR (Supervised)	Sorted K-fold - F1, AUC, MCC and G-Measure
Parry et al.[24]	RT - Flaky Test Detection	Apache (OS) - Test MD and Test Code	Balancing - Tomek, SMOTE, Edited nearest neighbors	Scaling - Standardization	RF (Supervised)	Sorted K-fold - F1, Precision, Recall, FP, FN, and TP
Saidani et al. [34]	BV - Build Prediction	TravisTorrent (OS) - Source code MD and Test MD	Balancing - SMOTE	Not reported	LSTM, RNN (Supervised)	Sorted K-fold - F1, Precision, Recall, Accuracy and AUC
Byrne et al. [32]	ST - Installed Software Discovery	Simulated cloud environment - System logs	Filtering - Noise reduction	Tagging - Columbus	RL (Vowpal Wabbit) (Supervised)	Random K-fold - F1, Precision and Recall
Porres et al.[31]	ST - Performance Test Optimization	Simulated testing scenarios - Test MD	Not reported	Scaling - Normalization	DNN (Supervised)	Gradually evaluation - Positive predictive value (PPV)
Notes: OS = Open Source, MD = Meta Data, NN = Neural Network, DT = Decision Tree, RF = Random Forest, DNN = Deep Neural Network, RNN = Recursive Neural Network, LR = Linear Regression, RL = Reinforcement Learning.

IV-B RQ2: State-of-the-art ML-based methods for automating the CI testing phases

According to the results of the RQ1, we identified the state-of-the-art ML solutions in each CI automated task. The list of the selected papers, properties of the collected data, as well as the employed techniques for training and evaluating the ML methods are analysed and presented in Table I. The key lessons learned in different phases of developing state-of-the-art ML solutions are presented hereafter.

IV-B1 Data source and data preparation

According to the reviewed studies, a majority of the papers in the field of CI testing (24 out of 44), employed Open-Source (OS) projects. The popularity of the OS projects is due to the availability of different types of OS project data, including source code, test code, and their metadata, as well as access to the data of cloud-based tools such as Travis CI and GitHub. For example, Pan et al. [33], evaluated their model on 242 OS projects, and Parry et al. [24] assessed their model on 26 open-source Python projects.

Additionally, Table I shows different data preparation techniques used to ensure the quality of the input data for the ML models. The first step in data preparation for ML is to balance the data using various re-sampling algorithms. These include methods such as those found in Weka [21], oversampling techniques like SMOTE to increase the number of instances in the minority class [35], and under-sampling techniques like TOMEK [36]. Additionally, researchers can utilize a combination of both over- and under-sampling methods through the use of the SMOGN technique [37]. On the other hand, the currently available data sets in the context of CI need to be cleaned by removing unrelated data [19] or noises [32]. Yaraghi et al. [25], employed a novel method for data preparation and created a dependency graph in which nodes refer to a source file and each edge shows a dependency relation from the source node to the destination to determine the code coverage of test cases.

IV-B2 Feature engineering

It is important to normalize the feature values in a dataset due to the diversity in their distributions and range. This can be done by using statistical methods such as re-scaling the values to fit within a specific scale, like [0,1] for normalization [22], standardizing by dividing the feature values by the standard deviation of all values [24], or applying techniques such as log transformation to bring the values to the same magnitude and reducing the effect of extremely high or low values [38]. Scaling is commonly employed (17 out of 44 studies) in the context of CI due to the diversity of feature values and the availability of numerical features. Additionally, for converting text-based values to input that is understandable for ML methods, such as system logs [32] and the source codes [19], tagging and tokenization techniques are often used (9 out of 44). Word2vec [19] and Columbus [32] methods are employed in the identified state-of-the-art studies for extracting tags from the texts. TF-IDF (term frequency-inverse document frequency) is a technique used to weigh the importance of words in a normalized way in texts [21]. It can help address the issue of imbalanced class distribution in text data by giving more weight to terms that appeared less frequently.

IV-B3 Learning algorithms

As shown in Table I, all of the current state-of-the-art studies utilize supervised learning methods. Additionally, it should be noted that all papers published between 2020 and the time of writing this paper have employed supervised machine learning algorithms. This trend suggests that supervised methods are becoming more popular among researchers in recent years. The popularity of supervised learning can be attributed to the acceptable accuracy of the models and the availability of labelled data from automated tools in CI environments.

Also, Table I shows that tree-based algorithms such as Decision Trees and Random Forests have been commonly employed as classification methods in state-of-the-art solutions. Tree-based algorithms have also been used in 19 studies from 44 selected studies. The reasons for the popularity of tree-based algorithms are threefold. First, a CI environment continuously produces a huge volume of data, and ML models need to be retrained frequently [25]. Since training tree-based algorithms require low computational resources, it is suitable to employ them in practical CI environments [39]. Second, the performance of the tree-based algorithms is generally high in classifying unseen data [26]. For instance, for automating the TCP task in CI, Yaraghi et al. [25] compared their tree-based (XGBoost) model with BERT [40], the state-of-the-art language representation model. They found that their method achieved relatively similar results as BERT while the training cost of their model was lower than BERT. Third, the tree-based algorithms can be well-interpreted, and thus they are understandable for humans [41].

According to Table I, besides tree-based algorithms, Neural Network (NN) algorithms, including Long Short Term Memory (LSTM), Multi-Layer Perceptron (MLP), Recursive Neural Network (RNN), and Deep Neural Network (DNN) have also been commonly employed in six studies. One of the main reasons for the popularity of NN algorithms, despite their need for high computational resources for training, is their ability to automatically extract features from data sets and make accurate predictions [31].

TABLE II: Commonly employed performance measures, and their descriptions and formulas.

Measure	Description	Formula
Precision	The percentage of the detected positive instances that were correct	$\frac{TP}{TP+FP}$
Recall	The proportion of positive instances that were correctly identified	$\frac{TP}{TP+FN}$
F1-score	The harmonic mean of recall and precision	$\frac{2\times Precision\times Recall}{Precision+Recall}$
Accuracy	Percentage of correctly classified instances	$\frac{TP+TN}{TP+FP+FN+TN}$
Notes: TP=True Positives, TN=True Negative,FP=False Positives, FN=False Negatives.

IV-B4 Evaluation methods

The K-fold method is widely used for evaluating ML models. It involves dividing the data into K equal segments (folds), and then in each of the K evaluation rounds, the model is trained on K-1 segments of the data and evaluated on the remaining unseen segment. According to the selected studies, 19 out of 44 have used this technique. In general, evaluation methods are classified into random and sorted sampling. Under random sampling, each row of data has the same possibility to be selected as training, testing, and holdout data set, while in sorted sampling the testing set must be newer than the training data set. In sorted K-fold methods, the unseen testing data folds must be selected from the earlier folds in comparison with the training data set.

F1-score, precision, and recall are commonly used evaluation measures in literature, with 16, 20, and 23 studies utilizing them respectively. They are the most frequently used metrics in the field. The description and formula of these measures are presented in Table II. Also, AUC (“Area under the ROC Curve”) has also been used, in which ROC is a graph on the plot of True Positive Rate (TPR or Recall) as y-axis and False Positive Rate ( $\frac{FP}{FP+TN}$ ) as the x-axis.

The evaluation measure of efficiency (cost) has received significantly less attention compared to measures of effectiveness. Specifically, only Yaraghi et al.[25] reported cost-related evaluation metrics, which know as Cost-cognizant Average Percentage of Faults Detected ( $APFD_{c}$ ). Malishevsky et al. [42] presented another version of $APFD_{c}$ that combines the cost of detecting and the severity of faults as presented in Eq. (1). In this metric $n$ is the number of test cases in test suite $T$ with costs $t_{i}$ and $f_{i}$ is the severity of faults. $TF_{i}$ is the first test case that reveals the fault $i$ .

\small APFD_{c}=\frac{\sum_{i=1}^{m}(f_{i}\times(\sum_{j=TF_{i}}^{n}t_{j}-\frac{1}{2}t_{TF_{i}}))}{\sum_{i=1}^{n}t_{i}\times\sum_{i=1}^{m}f_{i}}

(1)

V Conclusion and Future Opportunities

In this study, we distilled the CI phases automated by ML and the respective state-of-the-art ML solutions in this emerging area. We found data from multiple open-source projects were commonly considered for developing ML solutions. Additionally, preprocessing techniques such as filtering and balancing input data and scaling feature values were used to improve the model performance. The use of tree-based algorithms was common among the selected studies, as they are capable of extracting patterns from large-scale CI data with low computational overhead. Furthermore, we highlighted a variety of evaluation methods and measures employed, to facilitate the comparison between future work and previous studies. Despite the promising results of the current ML solutions, we identify three key challenges that may hinder the real-world adoption of these solutions and suggest potential directions to address these challenges.

Realistic evaluation: Despite the known limitations of random-based evaluation methods, these techniques have been utilized in a notable number of studies in the field. Specifically, five state-of-the-art studies and 18 studies out of the 44 reviewed in the literature have employed these methods. Random-based evaluation methods have limitations and may not be reliable in real-world settings as they use future data for training the model, unlike the chronological data generation in CI environments, leading to a lack of robustness and generalizability of results. Thus, ML models must be trained only on current data and evaluated on new and unseen data to ensure a realistic evaluation of the model performance when deployed in real-world environments [43].

Cost benchmark: In addition to reporting performance and effectiveness measures, there is a need for reporting cost measures. These cost measures include the time to train and the time to perform prediction/classification. Cost-based metrics only have been used in [25], from all the selected state-of-the-art studies. However, they did not report the training and prediction cost/time of their presented models. Therefore, a fully cost-benefit analysis can be taken by combining performance and cost measures. Such analysis can provide more details about potential trade-offs when adopting the presented models in industrial environments.

Under-explored CI areas: According to the reviewed studies, the application of ML methods in the CI testing environment is limited to nine areas, as explained in Figure 2. A recent review study on predictive models in software engineering, Yang et al. [44], highlights the potential for applying ML-based methods in various software engineering testing areas. However, upon further examination, it is apparent that there are several under-explored software engineering testing areas that have yet to be fully explored in the context of CI literature. A number of them are including crash prediction, erroneous behaviour detection, bug severity classification, test report classification and assessment, fault injection, bug report management tasks including bug report assignment, bug report prediction and classification, software quality assessment and reliability, vulnerability and malware detection, code smell detection, traceability detection methods for bugs, and code clone detection. It is still unclear to what extent these areas and their respective ML solutions can be adapted to CI. Thus, it is suggested that researchers and practitioners should also explore these missing areas in addition to the current ones to maximize the utilization of ML for CI automation.

Acknowledgement

We acknowledge the contribution of Dr Zohaib Md. Jan and Roshan Namal Rajapakse during the first phase of collecting and analysing data in this study.

References

[1] M. Shahin, M. A. Babar, and L. Zhu, “Continuous integration, delivery and deployment: a systematic review on approaches, tools, challenges and practices,” IEEE Access, vol. 5, pp. 3909–3943, 2017.
[2] A. Sen, “Devops, devsecops, aiops-paradigms to it operations,” in Evolving technologies for computing, communication and smart world, pp. 211–221, Springer, 2021.
[3] I. Figalist, A. Biesdorf, C. Brand, S. Feld, and M. Kiermeier, “Supporting the devops feedback loop using unsupervised machine learning,” in 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), pp. 1–6, IEEE, 2019.
[4] M. Kawalerowicz and L. Madeyski, “Continuous build outcome prediction: A small-n experiment in settings of a real software project,” in International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 412–425, Springer, 2021.
[5] P. Pospieszny, “Software estimation: towards prescriptive analytics,” in Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement, pp. 221–226, 2017.
[6] B. P. Rimal, A. Jukan, D. Katsaros, and Y. Goeleven, “Architectural requirements for cloud computing systems: an enterprise cloud approach,” Journal of Grid Computing, vol. 9, pp. 3–26, 2011.
[7] B. Fitzgerald and K.-J. Stol, “Continuous software engineering: A roadmap and agenda,” Journal of Systems and Software, vol. 123, pp. 176–189, 2017.
[8] P. Debois et al., “Devops: A software revolution in the making,” Journal of Information Technology Management, vol. 24, no. 8, pp. 3–39, 2011.
[9] R. Pan, M. Bagherzadeh, T. A. Ghaleb, and L. Briand, “Test case selection and prioritization using machine learning: a systematic literature review,” Empirical Software Engineering, vol. 27, no. 2, pp. 1–43, 2022.
[10] J. A. P. Lima and S. R. Vergilio, “Test case prioritization in continuous integration environments: A systematic mapping study,” Information and Software Technology, vol. 121, p. 106268, 2020.
[11] B. A. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering technical report,” Software Engineering Group, EBSE Technical Report, Keele University and Department of Computer Science University of Durham, vol. 2, 2007.
[12] B. Kitchenham, R. Pretorius, D. Budgen, O. P. Brereton, M. Turner, M. Niazi, and S. Linkman, “Systematic literature reviews in software engineering–a tertiary study,” Information and software technology, vol. 52, no. 8, pp. 792–805, 2010.
[13] M. Daneva, D. Damian, A. Marchetto, and O. Pastor, “Empirical research methodologies and studies in requirements engineering: How far did we come?,” Journal of systems and software, vol. 95, pp. 1–9, 2014.
[14] C. Wohlin, “Guidelines for snowballing in systematic literature studies and a replication in software engineering,” in Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp. 1–10, 2014.
[15] V. Braun and V. Clarke, “Using thematic analysis in psychology,” Qualitative research in psychology, vol. 3, no. 2, pp. 77–101, 2006.
[16] S. Stolberg, “Enabling agile testing through continuous integration,” in 2009 agile conference, pp. 369–374, IEEE, 2009.
[17] Y. Zhou, H. Leung, Q. Song, J. Zhao, H. Lu, L. Chen, and B. Xu, “An in-depth investigation into the relationships between structural metrics and unit testability in object-oriented systems,” Science china information sciences, vol. 55, no. 12, pp. 2800–2815, 2012.
[18] R. Martins, R. Abreu, M. Lopes, and J. Nadkarni, “Supervised learning for test suit selection in continuous integration,” in 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 239–246, IEEE, 2021.
[19] S. Lee, S. Hong, J. Yi, T. Kim, C.-J. Kim, and S. Yoo, “Classifying false positive static checker alarms in continuous integration using convolutional neural networks,” in 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST), pp. 391–401, IEEE, 2019.
[20] V. Vig and A. Kaur, “Test effort estimation and prediction of traditional and rapid release models using machine learning algorithms,” Journal of Intelligent & Fuzzy Systems, vol. 35, no. 2, pp. 1657–1669, 2018.
[21] R. Abdalkareem, S. Mujahid, and E. Shihab, “A machine learning approach to improve the detection of ci skip commits,” IEEE Transactions on Software Engineering, 2020.
[22] G. Grano, T. V. Titov, S. Panichella, and H. C. Gall, “Branch coverage prediction in automated testing,” Journal of Software: Evolution and Process, vol. 31, no. 9, p. e2158, 2019.
[23] S. Ali, Y. Hafeez, S. Hussain, and S. Yang, “Enhanced regression testing technique for agile software development and continuous integration strategies,” Software Quality Journal, pp. 1–27, 2019.
[24] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “Evaluating features for machine learning detection of order-and non-order-dependent flaky tests,” in 2022 IEEE Conference on Software Testing, Verification and Validation (ICST), pp. 93–104, IEEE, 2022.
[25] A. S. Yaraghi, M. Bagherzadeh, N. Kahani, and L. Briand, “Scalable and accurate test case prioritization in continuous integration contexts,” IEEE Transactions on Software Engineering, 2022.
[26] J. Finlay, R. Pears, and A. M. Connor, “Data stream mining for predicting software build outcomes using source code metrics,” Information and Software Technology, vol. 56, no. 2, pp. 183–198, 2014.
[27] Z. Xie and M. Li, “Cutting the software building efforts in continuous integration by semi-supervised online auc optimization.,” in IJCAI, pp. 2875–2881, 2018.
[28] J. Xia, Y. Li, and C. Wang, “An empirical study on the cross-project predictability of continuous integration outcomes,” in 2017 14th Web Information Systems and Applications Conference (WISA), pp. 234–239, IEEE, 2017.
[29] F. Hassan and X. Wang, “Change-aware build prediction model for stall avoidance in continuous integration,” in 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 157–162, IEEE, 2017.
[30] J. Vemulapati, A. S. Khastgir, and C. Savalgi, “Ai based performance benchmarking & analysis of big data and cloud powered applications: An in depth view,” in Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, pp. 103–109, 2019.
[31] I. Porres, T. Ahmad, H. Rexha, S. Lafond, and D. Truscan, “Automatic exploratory performance testing using a discriminator neural network,” in 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 105–113, IEEE, 2020.
[32] A. Byrne, E. Ates, A. Turk, V. Pchelin, S. S. Duri, S. Nadgowda, C. Isci, and A. Coskun, “Praxi: Cloud software discovery that learns from practice,” IEEE Transactions on Cloud Computing, 2020.
[33] C. Pan and M. Pradel, “Continuous test suite failure prediction,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 553–565, 2021.
[34] I. Saidani, A. Ouni, and M. W. Mkaouer, “Improving the prediction of continuous integration build failures using deep learning,” Automated Software Engineering, vol. 29, no. 1, pp. 1–61, 2022.
[35] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[36] S. Dalla Palma, D. Di Nucci, F. Palomba, and D. A. Tamburri, “Within-project defect prediction of infrastructure-as-code using product and process metrics,” IEEE Transactions on Software Engineering, pp. 1–1, 2021.
[37] A. Sharif, D. Marijan, and M. Liaaen, “Deeporder: Deep learning for test case prioritization in continuous integration testing,” arXiv preprint arXiv:2110.07443, 2021.
[38] G. Grano, T. V. Titov, S. Panichella, and H. C. Gall, “How high will it be? using machine learning models to predict branch coverage in automated testing,” in 2018 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), pp. 19–24, IEEE, 2018.
[39] K. W. Al-Sabbagh, R. Hebig, and M. Staron, “The effect of class noise on continuous test case selection: A controlled experiment on industrial data,” in International Conference on Product-Focused Software Process Improvement, pp. 287–303, Springer, 2020.
[40] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[41] I. H. Witten and E. Frank, “Data mining: practical machine learning tools and techniques with java implementations,” Acm Sigmod Record, vol. 31, no. 1, pp. 76–77, 2002.
[42] A. G. Malishevsky, J. R. Ruthruff, G. Rothermel, and S. Elbaum, “Cost-cognizant test case prioritization,” tech. rep., Technical Report TR-UNL-CSE-2006-0004, University of Nebraska-Lincoln, 2006.
[43] D. Elsner, F. Hauer, A. Pretschner, and S. Reimer, “Empirically evaluating readily available information for regression test optimization in continuous integration,” in Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 491–504, 2021.
[44] Y. Yang, X. Xia, D. Lo, T. Bi, J. Grundy, and X. Yang, “Predictive models in software engineering: Challenges and opportunities,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 3, pp. 1–72, 2022.