Automating Quantum Software Maintenance: Flakiness Detection and Root Cause Analysis

Janakan Sivaloganathan Department of Computer Science, Toronto Metropolitan University, Toronto, Canada Ainaz Jamshidi *The first two coauthors have made equal contributions to this work. Department of Information Systems, University of Maryland, Baltimore County, USA Andriy Miranskyy Department of Computer Science, Toronto Metropolitan University, Toronto, Canada Lei Zhang Department of Information Systems, University of Maryland, Baltimore County, USA

Abstract

Flaky tests, which pass or fail inconsistently without code changes, are a major challenge in software engineering in general and in quantum software engineering in particular due to their complexity and probabilistic nature, leading to hidden issues and wasted developer effort.

We aim to create an automated framework to detect flaky tests in quantum software and an extended dataset of quantum flaky tests, overcoming the limitations of manual methods.

Building on prior manual analysis of 14 quantum software repositories, we expanded the dataset and automated flaky test detection using transformers and cosine similarity. We conducted experiments with Large Language Models (LLMs) from the OpenAI GPT and Meta LLaMA families to assess their ability to detect and classify flaky tests from code and issue descriptions.

Embedding transformers proved effective: we identified 25 new flaky tests, expanding the dataset by 54%. Top LLMs achieved an F1-score of 0.8871 for flakiness detection but only 0.5839 for root cause identification.

We introduced an automated flaky test detection framework using machine learning, showing promising results but highlighting the need for improved root cause detection and classification in large quantum codebases. Future work will focus on improving detection techniques and developing automatic flaky test fixes.

I Introduction

Flaky tests, which exhibit non-deterministic behavior by failing or passing inconsistently without any changes to the code under test, pose significant challenges for software maintenance and reliability [1]. In the field of quantum software engineering, flaky tests are particularly problematic due to the inherent complexities and probabilistic nature of quantum computations. They can obscure genuine issues, waste developers’ time, and undermine confidence in test suites [2, 3].

In previous work, Zhang et al. [2] explore the code and bug-tracking repositories of 14 quantum software and identify 46 unique quantum flaky tests in 12 quantum projects (ranging from 0.26% to 1.85% of bug reports). They search for the 10 keywords related to flaky tests (e.g., flaky and flakiness) in issue reports (IRs) and pull requests (PRs) to identify flaky tests. They then identify and categorize eight types of flakiness and seven common fixes. Randomness is the most common cause of quantum flakiness, and the most common solution is to fix pseudo-random number generator (PRNG) seeds. However, the findings (46 instances of flakiness) are constrained by the limitations inherent in their vocabulary-based method. In addition, they manually examine and identify all flaky tests, which is time-consuming.

Our research explores a more effective and efficient flaky test detection technique for quantum software by answering the following research questions.

•

RQ1: How can we detect if a given IR or PR is related to a flaky test?
•

RQ2: How can we detect if a given IR or PR is related to a flaky test with additional code context?
•

RQ3: How can we identify the root cause of a flaky test?

Our main contributions are as follows.

•

We have enriched the existing dataset of flaky tests [2] by adding more observations, as well as including the buggy code causing flakiness and the corresponding fixes, which were absent from the original dataset. The extended dataset is publicly available at: https://doi.org/10.5281/zenodo.13913775.
•

We developed a method to semi-automatically detect flaky test-related IRs and PRs by mining software repositories using embedding transformers and cosine similarity.
•

We propose a method to automatically detect flaky issues using LLMs with zero-shot prompting.

By automating the detection of flaky tests, our framework aims to improve the reliability and maintainability of quantum software systems.

II Method

II-A Dataset

Our study builds upon a prior manual analysis by Zhang et al. [2], who examined 14 open-source quantum software repositories from platforms such as IBM Qiskit, Microsoft Quantum Development Kit, TensorFlow Quantum, and NetKet. They identified 46 flaky test reports across 12 repositories by searching closed GitHub issues using keywords related to flakiness (e.g., flaky, intermittent, nondeterministic).

II-A1 Using transformers and cosine similarity for flaky tests detection

To extend this research and detect more flaky reports systematically and automatically, we employed embedding transformers to represent GitHub IRs and PRs and measured the cosine similarity with previously identified flaky tests.

Based on the Hugging Face leaderboard [4], we selected three top-performing (at the time of experiment design) embeddings on generic tasks. We utilized the pre-trained ‘mixedbread-ai/mxbai-embed-large-v1’ transformer [5, 6] from Hugging Face library [7] to extract contextual embeddings¹¹1The embeddings were derived from the penultimate layer of the model, ensuring that they captured the nuanced features of the text. of the GitHub IRs and PRs. We also evaluated other transformers, such as ‘SFR-Mistral’ [8] and ‘e5-mistral-7b-instruct transformers’ [9, 10], using $k$ -means clustering [11] to assess their effectiveness in distinguishing between flaky and non-flaky test cases. Our analysis showed that the “mixedbread-ai/mxbai-embed-large-v1” model provided the most distinct and separable representation for this task.

Using this model, we generated embeddings from the tokenized text of all scraped GitHub IRs and PRs from the 12 repositories, as well as from the previously identified flaky cases. We calculated cosine similarity scores between the embeddings and ranked the issues based on their similarity to known flaky cases. Excluding identical matches from our initial flaky set, at least two authors cross-examined the top-ranked issues, associated PRs, and code commits to determine if they were related to flaky tests and to establish their cause categories.

In the first iteration, we identified 15 new flaky tests from the top-ranked cosine similarity scores. We then augmented the original dataset with these new cases and repeated the process, resulting in the detection of an additional 10 flaky tests in the second iteration. In total, we identified 25 new flaky tests across the 12 repositories, increasing the original dataset size by 54%.

To create a balanced dataset, we also collected non-flaky tests from GitHub reports during the cosine similarity analysis. Test cases with a cosine similarity score of less than 0.5 were labeled as non-flaky. Additionally, we included instances that our method incorrectly labeled as flaky to further challenge the classification task. In total, we compiled 71 non-flaky cases to match the size of our expanded flaky dataset.

II-A2 Dataset preparation

We extracted descriptions and comments from IRs and PRs for each issue observation in the dataset, recorded code differences, and noted affected files before and after fixes. Method-level code changes were also extracted for a more concentrated analysis.

Manual verification ensured that the extracted artifacts matched the identified IRs and PRs. The dataset was organized into ‘full’ and ‘method’ directories, each further divided into ‘flaky’ and ‘non-flaky’ sections.

II-B Detecting Flaky Tests with LLMs

Refer to caption — Figure 1: Architecture of the automated framework for quantum flakiness detection.

Our experiments demonstrate an automated LLM-based framework (see Figure 1) that efficiently gathers resources, configures inputs, and classifies bugs by streamlining the standard software development workflow with GitHub as the version control system. This framework simulates a typical software engineering process for bug resolution.

II-B1 LLM Inference Configuration

Leveraging the extracted codebase data (discussed in Section II-A2), we crafted input prompts to address our research questions. The prompts are provided in the supplementary material (https://doi.org/10.5281/zenodo.13913775).

To explore our research questions and assess how context size affects the answers, we designed the following experiments.

For RQ1, which aims to classify whether a particular IR (or PR, if no issue is associated with it) is flaky or non-flaky, we tested two levels of context: $R_{p}$ (partial), which includes only the initial IR (or PR) description, and $R_{f}$ (full), which includes the description along with all associated comments.

For RQ2, we expanded the context for the language model by adding the code involved in the PR before the fix was applied, also at two levels: $C_{p}$ (partial), which includes the method-level code, and $C_{f}$ (full), which provides the complete code listing.

By combining the context levels from RQ1 and RQ2, we generated four experimental conditions: $\{R_{p},R_{f}\}\times\{C_{p},C_{f}\}$ . These conditions range from $(R_{p},C_{p})$ , which uses only the description and method-level code, to $(R_{f},C_{f})$ , which includes the description with comments and the full code listing.

For RQ3, the amount of information provided did not change; we simply followed up by asking which specific root cause a particular flaky test relates to, using the nine classes of root causes defined by [2]: “Randomness (PRNG),” “Floating Point Operations,” “Software Environment,” “Multi-threading,” “Visualization,” “Unhandled Exceptions,” “Network,” “Unordered Collection,” or “Others”. We find that we do not need to alter these classes for our extended dataset.

We utilized LangChain [12], an integration framework that abstracts prompt templating, retrieval strategies, and chaining, allowing us to manage conversational memory effectively. This setup enabled us to simulate a developer-to-AI interaction, appending additional context to study its impact on the LLM’s reasoning.

II-B2 LLMs under study

To assess the performance of various LLMs, we study two open-source LLMs, namely Meta LLaMA-70B-Instruct v.3.1, and LLaMA-405B-Instruct v.3.1 [13], and two closed-source LLMs, namely OpenAI GPT-4o-mini v.2024-10-03 and GPT-4o v.2024-10-07 [14, 15]. All captured OpenAI and Meta Instruct models were accessed remotely through serverless APIs of OpenAI [16] and Google VertexAI [17], respectively.²²2We also fine-tuned CodeBERT [18] for flaky test detection using a few-shot learning approach. While training was successful, the model failed to generalize effectively for the test set when applied to GitHub IRs and PRs.

II-C Limitations

II-C1 Repository Differences

In cases where the issue and fix resided in different repositories, we manually aligned the IRs with the fix repositories to maintain consistency in our analysis.

II-C2 Multiple Pull Requests

In some cases of the dataset, a flaky bug will have two PRs associated with it. For the four observations in the flaky data and two in the non-flaky data, we split the observation into two two instances. A single issue that is observed in both PRs will be present in both of the codebases respectively.

II-C3 Partial Code Extraction (Method-Level)

The differences between methods were automatically extracted using PyDriller [19], which operates only on Python code. Given that Python dominates our dataset (12 out of 14 repositories are written primarily in Python), this limitation is not a significant concern.

However, even in Python-focused datasets, not all repositories contain method-level data. For example, some fixes might target configuration files or global variables within Python scripts. Therefore, we capture and report the total number of observations for each experimental setup to account for such cases.

III Results and Analysis

TABLE I: Model Performance Comparison

Model	Context	F1		MCC		Recall		Total Observations
Model	Context	RQ1	RQ3	RQ1	RQ3	RQ1	RQ2	RQ1	RQ2&3
GPT-4o	$\{R_{p},C_{p}\}$	0.8443	0.5839	0.6971	0.5928	0.7746	0.7955	142	44
	$\{R_{p},C_{f}\}$	0.8443	0.4316	0.6971	0.3823	0.7746	0.6290	142	62
	$\{R_{f},C_{p}\}$	0.8731	0.5562	0.7477	0.5689	0.8451	0.8636	142	44
	$\{R_{f},C_{f}\}$	0.8731	0.4919	0.7477	0.4805	0.8451	0.7580	142	62
LLaMA-405B-Instruct	$\{R_{p},C_{p}\}$	0.8163	0.5219	0.6379	0.5457	0.7606	0.7727	142	44
	$\{R_{p},C_{f}\}$	0.8163	0.4271	0.6379	0.4252	0.7606	0.5806	142	62
	$\{R_{f},C_{p}\}$	0.8519	0.5251	0.7060	0.5515	0.8169	0.7727	142	44
	$\{R_{f},C_{f}\}$	0.8519	0.4729	0.7060	0.4855	0.8169	0.6935	142	62
GPT-4o-mini	$\{R_{p},C_{p}\}$	0.8229	0.5219	0.6558	0.5606	0.7465	0.6136	142	44
	$\{R_{p},C_{f}\}$	0.8229	0.4729	0.6558	0.4874	0.7465	0.5000	142	62
	$\{R_{f},C_{p}\}$	0.8871	0.5306	0.7774	0.5659	0.8451	0.6818	142	44
	$\{R_{f},C_{f}\}$	0.8871	0.4754	0.7774	0.4980	0.8451	0.6452	142	62
LLaMA-70B-Instruct	$\{R_{p},C_{p}\}$	0.8351	0.5158	0.7016	0.5230	0.7042	0.5227	142	44
	$\{R_{p},C_{f}\}$	0.8351	0.4430	0.7016	0.4368	0.7042	0.5806	142	62
	$\{R_{f},C_{p}\}$	0.7980	0.5333	0.6370	0.5505	0.6479	0.5682	142	44
	$\{R_{f},C_{f}\}$	0.7980	0.4760	0.6370	0.4895	0.6479	0.6452	142	62
Note: For RQ1, the $C_{(\cdot)}$ contexts are not used, hence the identical values for both contexts.

III-A LLM Detection: Interpretability and Challenges

We adopted four LLMs GPT-4o, GPT-4o-mini, LLaMA-405B-Instruct, and LLaMA-70B-Instruct, and evaluated the performance of the LLM in classifying RQ1 based on both flaky and non-flaky observations, and RQ2 and RQ3 based solely on the ground truths of the flaky observations. The results can be found in Table I; we employ the F1-score, Mathews Correlation Coefficient (MCC), recall, and the number of detected flaky/non-flaky tests to compare the performance.

We also evaluated two additional LLMs on-site using the Ollama framework [20]. However, the results for the non-instruct tuned LLaMA-8B and LLaMA-70B models were excluded due to insufficient performance, as many outputs were either empty or corrupted. This underperformance is likely due to these being the smallest non-fine-tuned models in our study. Additionally, GPT-4o-mini was the only model that required explicit manual modifications before scoring.

The difference in the number of observations between the $C_{p}$ and $C_{f}$ contexts is attributed to cases where method-level code could not be extracted. These situations include cases where the code was not written in Python or when no changes affected the methods in the code.

III-B Model Performance

III-B1 RQ1

Using the full context, the best-performing model is GPT-4o-mini, with an F1-score of 0.8871 and an MCC of 0.7774. This outcome is somewhat unexpected, as GPT-4o is generally considered more powerful. However, when only partial context is provided, GPT-4o outperforms GPT-4o-mini with an F1-score of 0.8443 vs. 0.8229 and MCC of 0.6971 vs. 0.6558. Thus, GPT-4o effectively identifies flakiness with respectable performance when provided just the initial report. This suggests that classification with limited context might be more challenging, and a more sophisticated model excels in such cases. LLaMA-405B-Instruct and LLaMA-70B-Instruct ranked third or fourth (depending on the context).

III-B2 RQ2

We assess whether adding code at the method-level ( $C_{p}$ ) or including a full code listing ( $C_{f}$ ) improves classification accuracy. We compare recall values from RQ1 and RQ2 to evaluate this. For GPT-4o, adding method-level code ( $C_{p}$ ) increases recall, with the highest score observed in the $\{R_{f},C_{p}\}$ setup, as expected. Interestingly, providing a complete code listing ( $C_{f}$ ) reduces recall, which aligns with the observation that a model, much like a human programmer, might struggle to identify specific methods to focus on when faced with the entire codebase.

For GPT-4o-mini, recall decreases for RQ2, but less so for $C_{p}$ compared to $C_{f}$ . This suggests that analyzing both natural language and code is a more complex task, one that requires a more advanced model.

The behavior of LLaMA-405B-Instruct is similar to GPT-4o, showing an increase in recall for $C_{p}$ and a decrease in recall for $C_{f}$ . In contrast, LLaMA-70B-Instruct exhibits behavior similar to GPT-4o-mini, with decreased performance in both contexts.

Note that we should be cautious about drawing strong conclusions when comparing $C_{p}$ and $C_{f}$ , as the number of observations for the $C_{p}$ cases is smaller than for the $C_{f}$ cases (44 vs. 62, respectively, due to the reasons discussed in Section II-C3).

III-B3 RQ3

The most challenging task, requiring both multi-label classification and code analysis, is presented by RQ3. As anticipated, the performance of all models declines compared to RQ1 and RQ2. The highest results are achieved by the most powerful model, GPT-4o, when using method-level data ( $C_{p}$ ) and the initial issue or pull request description ( $R_{p}$ ), though the performance is still low, with an F1-score of 0.5839 and an MCC of 0.5928. With complete code ( $C_{f}$ ), and thus a higher number of observations, the results further drop to an F1-score of 0.4919 and MCC of 0.4805.

GPT-4o-mini, LLaMA-405B-Instruct, and LLaMA-70B-Instruct have similar performances, but their rankings differ depending on whether the F1-score or MCC is used. For F1-score, from 2nd to 4th place, the order is LLaMA-405B-Instruct, GPT-4o-mini, and LLaMA-70B-Instruct. For MCC, the ranking is GPT-4o-mini, LLaMA-405B-Instruct, and LLaMA-70B-Instruct. These results suggest that there is considerable room for improvement in this area.

III-C Context

Based on the above discussion, we observe that, in general, $R_{f}$ aids models in making better decisions for RQ1 and RQ2, but not for RQ3. This is somewhat counterintuitive and may be related to the complexity of RQ3, warranting further investigation. For RQ1 and RQ2, the performance drop is relatively small, indicating that models can still provide practical value when an issue or pull request is initially opened.

Method-level code ( $C_{p}$ ) appears to yield better results than full code listings ( $C_{f}$ ), but further analysis is necessary, as the number of observations differs between the two setups.

IV Threats to Validity

Validity threats are classified according to [21, 22].

Internal and construct validity. Data collection and labeling are error-prone processes. In our study, the potential flaky tests are collected based on cosine similarity and manual labeling. A flaky test can be mislabeled. Our remedy is to have at least two authors cross-examine the potential flaky tests and confirm all positive cases. Non-flaky tests are also collected based on cosine similarity (when its value is less than 0.5) and from those that two authors labeled as non-flaky tests. To mitigate potential errors, at least two authors cross-examined all the non-flaky tests.

External and conclusion validity. Generally, software engineering studies suffer from real-world variability, and the generalization problem can only be solved partially [23]. One threat to external validity is the limited scope of our dataset, which, although enriched from previous studies, still focuses on a subset of quantum software repositories. As a result, the findings may not be fully representative of the broader population of quantum software projects, especially those utilizing different testing frameworks or methodologies. Through future research, we hope to expand our dataset and findings.

V Future Plans

Our future plans include improving detection methods by exploring and fine-tuning various LLMs, developing automatic patching techniques (given that our dataset includes corresponding fixes), and further exploring the use of LLMs in quantum software maintenance tasks.

VI Conclusions

In this paper, we have presented an automated framework for detecting and resolving flaky tests in quantum software, leveraging embedding transformers and LLMs. Our approach enhances the existing dataset of flaky tests and provides methods for semi-automatic and automatic detection. The scores showcase that LLMs perform well, but can be greatly improved for detecting flakiness and root cause detection in quantum software bugs.

References

[1] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653.
[2] L. Zhang, M. Radnejad, and A. Miranskyy, “Identifying flakiness in quantum programs,” in 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 2023, pp. 1–7.
[3] L. Zhang and A. Miranskyy, “Automated flakiness detection in quantum software bug reports,” arXiv preprint arXiv:2408.05331, 2024.
[4] N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” arXiv preprint arXiv:2210.07316, 2022. [Online]. Available: https://arxiv.org/abs/2210.07316
[5] S. Lee, A. Shakir, D. Koenig, and J. Lipp. (2024) Open source strikes bread - new fluffy embeddings model. [Online]. Available: https://www.mixedbread.ai/blog/mxbai-embed-large-v1
[6] X. Li and J. Li, “Angle-optimized text embeddings,” arXiv preprint arXiv:2309.12871, 2023.
[7] T. Wolf, “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
[8] S. R. J. Rui Meng, Ye Liu, “Sfr-embedding-mistral:enhance text retrieval with transfer learning,” Salesforce AI Research Blog, 2024. [Online]. Available: https://blog.salesforceairesearch.com/sfr-embedded-mistral/
[9] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving text embeddings with large language models,” arXiv preprint arXiv:2401.00368, 2023.
[10] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F. Wei, “Text embeddings by weakly-supervised contrastive pre-training,” arXiv preprint arXiv:2212.03533, 2022.
[11] J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press, 1967.
[12] LangChain, Inc., “LangChain LLM App Development Framework,” https://www.langchain.com/.
[13] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, and A. Goyal, “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[14] I. OpenAI, “Gpt-4o mini: Advancing cost-efficient intelligence,” https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024.
[15] ——, “Hello gpt-4o,” https://openai.com/index/hello-gpt-4o/, 2024.
[16] (2024) Overview - openai api. [Online]. Available: https://platform.openai.com/docs/overview
[17] (2024) Vertex ai documentation — google cloud. [Online]. Available: https://cloud.google.com/vertex-ai/docs
[18] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
[19] D. Spadini, M. Aniche, and A. Bacchelli, “PyDriller: Python framework for mining software repositories,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering - ESEC/FSE 2018, New York, New York, USA, 2018, pp. 908–911.
[20] J. Morgan and M. Chiang, “Ollama,” https://github.com/ollama/ollama, 2024.
[21] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering, ser. Computer Science. Springer Berlin Heidelberg, 2012.
[22] R. Yin, Case Study Research: Design and Methods, ser. Applied Social Research Methods. SAGE Publications, 2009.
[23] R. J. Wieringa and M. Daneva, “Six strategies for generalizing software engineering theories,” Science of computer programming, vol. 101, pp. 136–152, 4 2015.