Identifying Flakiness in Quantum Programs

Lei Zhang Department of Information Systems
University of Maryland, Baltimore County
Baltimore, USA
[email protected] Mahsa Radnejad Department of Computer Engineering
Isfahan Branch, Islamic Azad University
Isfahan, Iran
[email protected] Andriy Miranskyy Department of Computer Science
Toronto Metropolitan University
Toronto, Canada
[email protected]

Abstract

In recent years, software engineers have explored ways to assist quantum software programmers. Our goal in this paper is to continue this exploration and see if quantum software programmers deal with some problems plaguing classical programs. Specifically, we examine whether intermittently failing tests, i.e., flaky tests, affect quantum software development.

To explore flakiness, we conduct a preliminary analysis of 14 quantum software repositories. Then, we identify flaky tests and categorize their causes and methods of fixing them.

We find flaky tests in 12 out of 14 quantum software repositories. In these 12 repositories, the lower boundary of the percentage of issues related to flaky tests ranges between 0.26% and 1.85% per repository. We identify 46 distinct flaky test reports with 8 groups of causes and 7 common solutions. Further, we notice that quantum programmers are not using some of the recent flaky test countermeasures developed by software engineers.

This work may interest practitioners, as it provides useful insight into the resolution of flaky tests in quantum programs. Researchers may also find the paper helpful as it offers quantitative data on flaky tests in quantum software and points to new research opportunities.

I Introduction

A test running on the same code sometimes produces different results, i.e., it shows “passed” sometimes and “failed” other times. Such tests are called flaky tests. Flaky tests can negatively impact developers by providing misleading signals. One can view flaky tests as bugs of testing that produce non-deterministic results. Such tests consume significant resources. At Google, in 2014, 73K out of 1.6M (4.56%) of test failures were caused by flaky tests [1]; in 2017, 1.5% of 4.2M tests were flaky [2, 3].

A flaky test can be caused by two factors: either non-determinism in the source code or the test itself [1, 4, 5, 6, 7, 8]. Quantum programs are inherently non-deterministic. The randomness comes from a variety of sources. For example, it can be caused by the physical properties of quantum systems (e.g., quantum indeterminacy [9, Ch. 1]) or hardware issues (e.g., measurement errors or networking problems) or both (e.g., quantum decoherence [10, Ch. 7]). In a simulation of a quantum computer (QC) on a classical computer (CC), pseudo-random number generators (PRNGs) are used to emulate these sources of randomness. Randomness from all these sources leads to a distribution of output values, which may result in flaky tests.

To the best of our knowledge, there exists no quantum software engineering research on flaky tests in quantum programs. Thus, we aim to do an initial analysis of flakiness and its root causes in quantum programs. To achieve this, we will seek answers to the following three research questions (RQs).

RQ1. How prevalent are flaky tests in quantum programs?

RQ2. What causes flakiness in quantum software?

RQ3. How do quantum programmers fix flaky tests?

The paper makes two major contributions. First, we perform an initial study of flaky tests in quantum programs. Specifically, we investigate code and bug tracking repositories of 14 quantum software; 12 out of 14 software have at least one flaky test (46 unique flaky test reports in total). We estimate that at least 0.26% to 1.85% of issue reports in the 12 software are related to flaky tests. Second, we identify and categorize eight groups of causes for flakiness in quantum software and seven common fixes. We find that the most common cause of flakiness is randomness, and the most common solution is to fix seeds for PRNGs. Moreover, quantum programmers do not use some recent countermeasures (e.g., [7, 8]) developed by software engineers to deal with flaky tests. We publish the dataset that consists of quantum flaky tests, fixes, and categories at https://doi.org/10.5281/zenodo.7888639.

The paper is organized as follows. Section II presents our initial empirical study of flakiness in existing quantum programs. Section III seeks answers to our three RQs by analyzing flaky test reports. Section IV covers threats to validity. Section V introduces related work of testing quantum programs and flaky tests in classical programs. Section VI concludes the paper and outlines our long-term research objectives.

II Empirical Study: Data Gathering

To answer our RQs, we perform an empirical study focusing on open-source quantum programs on GitHub because open-source projects have transparent development histories and bug reports. We follow three steps to collect flaky test reports from open-source quantum projects.

First, we choose four popular quantum platforms (namely, IBM Qiskit [11], Microsoft Quantum Development Kit [12], TensorFlow Quantum [13], and NetKet [14]) and identify 14 repositories with active contributions and bug reports. Despite not being an exhaustive list of all quantum software, these platforms represent some of the most active and popular open-source quantum ecosystems.

Second, we searched closed Github issues associated with the 14 Github repositories for 10 keywords (namely, “flaky”, “flakiness”, “flakey”, “occasion”, “occasional”, “intermit”, “fragile”, “non-deterministic”, “nondeterministic”, and “brittle”). We focus only on closed reports because they have been verified by developers.

Finally, a minimum of two authors cross-examine these issues and associated pull requests and code commits, determine if they are related to flakiness, and establish the cause category. If there is a disagreement between the examiners, a joint review session is conducted to reach an agreement. In our case, the search for the 10 keywords returned 253 issue reports, 46 of which we manually labeled as flaky test reports (upon cross-examination). We then filter out two repositories without a verified flaky test (namely, qiskit-finance and qiskit-optimization) and end up with 12 repositories.

Based on our manual examination, the common case of a flaky test report consists of an issue report with a pull request that fixes the flakiness. However, there are three other cases: 1) multiple pull requests are related to a single report of flaky tests, e.g., a backport pull request, which we consider as one flaky test report; 2) a pull request that resolves flakiness without an issue report; and 3) a closed flakiness issue report without an associated pull request, e.g., a flaky test that is resolved without detailed conversations or corresponding commits. For simplicity, we call all four cases above “flaky test reports”.

TABLE I: Statistics of quantum software repositories with flaky tests. Three right-most columns are as follows: count of closed issue reports (Column

T

), count of closed flaky test reports (Column

F

), and percentage of reports related to flaky tests (Column

P

) computed as

F/T\times 100\%

Platform	Repository	Language	$T$	$F$	$P$
Qiskit	qiskit-terra	Python	2,810	25	0.89%
Qiskit	qiskit-aer	Python	558	3	0.54%
Qiskit	qiskit-nature	Python	287	1	0.35%
Qiskit	qiskit-experiments	Python	255	1	0.39%
Qiskit	qiskit-ibm-runtime	Python	217	2	0.92%
Qiskit	qiskit-ibm-provider	Python	184	2	1.09%
Qiskit	qiskit-machine-learning	Python	378	1	0.26%
Microsoft	qdk-python	Python	64	1	1.56%
Microsoft	QuantumLibraries	Q#	136	2	1.47%
Microsoft	Quantum	Q#	108	2	1.85%
TensorFlow	quantum	Python	192	1	0.52%
NetKet	netket	Python	295	5	1.69%
Total			5,484	46

III Analysis and Results

In this section, we seek answers to our three RQs by analyzing the obtained flaky test reports.

III-A RQ1: How prevalent are flaky tests in quantum programs?

Table I shows the statistics of the quantum program repositories and the flaky test reports that we detect. The data was last updated on January 12, 2023; thus, the statistics may change in the future.

As shown in Table I, we detect 46 flaky test reports in the 12 repositories. Among all the repositories, the core Qiskit component qiskit-terra has the most flakiness reports (i.e., 25) because of the size of the project. The average percentage of flakiness varies from 0.26% (qiskit-machine-learning) to 1.85% (Microsoft/Quantum). Since our list of keywords is not exhaustive, the flakiness percentages represent a lower bound on the number of flaky test reports (see further discussion in Section IV). In other words, there could be more flaky tests than what we have observed.

Finally, comparing the frequency of flaky tests in QC and CC programs is difficult. As mentioned in Section I, Google reported percentages of test cases and test failures rather than percentages of flaky test reports; other authors typically report similar statistics (see [5] for review). Therefore, a direct comparison is not possible. Our future plans include mapping flaky test reports to test cases and performing such a comparison. However, even without a direct comparison, we can definitively answer RQ1: flaky tests are present in significant quantities in quantum programs.

III-B RQ2: What causes flakiness in quantum software?

RQ2 is answered by manually analyzing all flaky test reports, categorizing them, and listing the summary in Table II. There are eight categories of causes with two special cases, i.e., “others” and “unknown”. We detect one flakiness report containing issues and commits related simultaneously to randomness and floating point operations. Therefore, the total number of flaky test reports in Table II is 47 instead of 46 (the total number of flaky test reports in Table I).

What are the differences between our causes and those associated with CC software? Multiple analyses of flaky tests in CC software exist in the literature [1, 4, 5, 6, 7]. For example, Luo et al. [1] identified 10 causes of flakiness in CC software: async wait, concurrency, test order dependency, resource leak, network, time, IO, randomness, floating point operations, and unordered collections. Most flaky tests in [1] are related to Java projects and distributed systems (e.g., Apache HBase and Hadoop). We have five categories in common with theirs: concurrency, network, randomness, floating point operations, and unordered collections.

The leading causes of flakiness in [1] are async waits and concurrency. Our top causes of flaky test reports are randomness (21%) and software environments (15%). Why is this discrepancy occurring?

Researchers have identified randomness as a non-major cause of flakiness in generic CC software [1, 5, 6]. A survey [5] indicates that in 1% to 5% of cases (depending on the software), flaky tests are caused by randomness, while async wait and concurrency are consistently the leading causes.

Our findings are closer to those focusing on probabilistic programming and machine learning software, where 60% of flaky tests may be caused by randomness [7]. As PRNGs are heavily used in QC programming (see Section III-B1), randomness contributes significantly to flaky test reports in quantum programs.

Let us examine the eight causes (as well as the two special cases) listed in Table II in more detail.

TABLE II: Count of cause categories and fix patterns based on flaky test reports.

	Fix pattern
Cause category	Fix Seed	Alter Software Env.	Make Single Thread	Adjust Tolerance	Add Exception Handler	Synchronize	Use Keys for Order	Others	Grand Total	Percentage
Randomness	9							1	10	21%
Software Env.		4						3	7	15%
Multi-Threading			2					4	6	13%
Floating Point Ops.				3				2	5	11%
Visualization								3	3	6%
Unhandled Exception					3				3	6%
Network						1			1	2%
Unordered Collection							1		1	2%
Others								7	7	15%
Unknown								4	4	9%
Grand Total	9	4	2	3	3	1	1	24	47	100%
Percentage	19%	9%	4%	6%	6%	2%	2%	51%	100%

III-B1 Randomness

As mentioned above, the use of PRNGs is the most common (21%) cause of reports of flakiness in quantum programs. The PRNGs produce different outputs from run to run, which may result in a flaky test. Given that most of the test cases are executed on simulators of QCs and that the simulators rely heavily on PRNGs to simulate the non-deterministic nature of QCs, it is not surprising that this is the primary cause of flaky test reports.

PRNGs are used in quantum programs and associated test suites. Our first example (Listing 1) demonstrates a problem with the quantum program found in issue report #3533 of qiskit-terra. The constructor of a two-qubit decomposer (details in [15]) has a non-deterministic component in Lines 6 and 7. This decomposer can generate random results when generating controlled versions of the same unitary gate. For example, Listing 2 creates a UnitaryGate instance in Line 1, and the equality operation in Line 5 returns “True” or “False” at random.

⬇

1class TwoQubitWeylDecomposition:

2 ...

3 def __init__(self, unitary_matrix):

4 ...

5 for _ in range(100):

6 M2real = np.random.randn()*M2.real \

7 + np.random.randn()*M2.imag

Listing 1: An example of randomness in a quantum program that can lead to a flaky test.

⬇

1uni = UnitaryGate([[1, 0, 0, 0],

2 [0, 1, 0, 0],

3 [0, 0, 1, 0],

4 [0, 0, 0, 1j]])

5uni.control() == uni.control()

Listing 2: A testing code used to reproduce the flaky test.

⬇

1def test_append_circuit(self, num_qubits):

2 ...

3 first_circuit = random_circuit(num_qubits[0], depth)

4 ...

5 for num in num_qubits[1:]:

6 circuit = random_circuit(num, depth)

Listing 3: An example of a PRNG-related flaky test in the code of a test case.

In Listing 3, the second example (based on issue report #5217 in qiskit-terra) illustrates a test case issue. The logic of the test test_append_circuit, which checks if appending quantum circuits work properly, is correct. However, Lines 3 and 6 cause flakiness in the test because the function random_circuit generates a random quantum circuit using a randomly selected seed (by default). Thus, the test fails occasionally due to randomness. We will discuss how to fix the flakiness in Listings 1 and 3 in Section III-C1.

III-B2 Software Environment

This category includes flaky tests caused by specific software or library dependency issues. For example, pull request #1369 in netket discusses a flaky test observed only in the GitHub Actions Python 3.10 environment. The pull request starts with the following comment.

“No idea why, but on the GitHub runner python 3.10 this test keeps failing. I never reproduced locally so I think it’s not a real failure. Simplyfying [sic] the test to avoid.”

A common way to fix this issue is to alter the software environment as will be shown in Section III-C2.

III-B3 Multi-Threading

This category of flaky tests is caused by multi-threading issues, e.g., concurrency and overload. As an example, issue report #5904 in qiskit-terra describes a flaky test caused by address collisions due to parallel builds. The code in this example can be seen in Listing 4, showing that sphinx-build generates documentation in parallel over $N$ processes, where $N$ is the number of CPUs (i.e., by the argument -j auto). While the parallelization improves the throughput, it causes an error when multiple jobs of sphinx-build compete against each other. A fix pattern for this cause is given in Section III-C3.

⬇

1commands = sphinx-build -W -b html -j auto docs/ docs/_build/html {posargs}

Listing 4: An example of a flaky test related to multi-threading.

III-B4 Floating Point Operations

We define a flaky test as floating-point-related if it occurs due to challenges such as round-off errors. For example, in netket pull request #1147 (Listing 5), the hard-coded tolerance value of $10^{-5}$ (shown in Line 5) causes a test case to fail intermittently.

⬇

1def test_vmc_functions():

2 ha, sx, ma, sampler, driver = _setup_vmc()

3 driver.advance(500)

4 assert driver.energy.mean == \

5 approx(ma.expect(ha).mean, abs=1e-5)

Listing 5: An example of a flaky test related to floating point operations.

To tackle this problem, developers modify the assertion to alter tolerance, round the actual value, or remove a flaky test case altogether. We will provide details of how developers address this challenge in Section III-C4.

III-B5 Visualization

This group of flaky tests is related to image generation. For example, qiskit-terra has a test manager that schedules visual tests sequentially and allows these tests to communicate with one another. Issue #3283 in qiskit-terra reports a flaky test in the visualizer of the test manager due to out-of-date reference indexes.

In this category, we are unable to identify a recurring fix pattern. There is a unique solution for each of the three reports (all associated with qiskit-terra).

III-B6 Unhandled Exception

This group of flaky tests is caused by the code that does not appropriately handle exceptions. For example, Listing 6 shows the stack trace in the issue report of #398 in Microsoft/QuantumLibraries. In this case, the testing can occasionally fail whenever the “number” does not fall into the expected range. We will explore a fix pattern for this cause in Section III-C5.

⬇

1Unhandled exception.

2Microsoft.Quantum.Simulation.Core.ExecutionFailException: "number" must be between 0 and 2^3 - 1, but was -1.

Listing 6: An example of an unhandled exception.

III-B7 Network

A flaky test in this category occurs as a result of network-related issues, such as an unstable network or server. For example, issue #584 of qiskit-ibm-runtime reports a flaky test due to timeouts and socket connection problems. As shown in Listing 7, test_websocket_proxy fails because jobs have been completed before websocket connection can be established. A method to fix this cause is given in Section III-C6.

⬇

1FAIL: test_websocket_proxy (test.integration.test_results.TestIntegrationResults)

2 (service=<QiskitRuntimeService>)

Listing 7: An example of network-related flaky test.

III-B8 Unordered Collection

This category of flaky tests is Python-specific (although, hypothetically, it may appear in other languages). Dictionaries (hash maps or hash tables) are implemented as an unordered collection from Python 3.3 to 3.7 [6]. When tests have order dependencies, test outcomes can become non-deterministic, which leads to flakiness. For example, pull request #8627 in qiskit-terra reveals a flaky test that uses insertion order to compare two dictionaries, hence non-deterministic results. We will explore a fix for this cause in Section III-C7.

III-B9 Others

Flaky test reports are included in this special category if there is only one observation of the cause and the cause has not been classified in previous studies (e.g., [1, 5, 6]).

For example, in Listing 8 (issue report #62 in Microsoft/Quantum), there is a space between the left double quotation mark and Microsoft, which accidentally injects a random value into the test and causes intermittent failures.

⬇

1fixedSeeds.Add(" Microsoft.Quantum.Tests.RobustPhaseEstimationTest", 2020776761);

Listing 8: A typo causes a flaky test.

Another example is keeping local outputs in Jupyter Notebook and preventing flaky tests from being executed in the continuous integration pipeline (pull request #453 in TensorFlow/quantum).

A final example is given in pull request #795 of qiskit-aer. It relates to the changing configuration of a QC (as engineers keep tweaking the computer’s configuration). Engineers update the configuration files of the simulator of this QC whenever the configuration changes. However, the test case expected values were hard-coded, which caused intermittent test case failures. To resolve this issue, testers started dynamically reading configuration parameters from configuration files instead of hard-coding them.

III-B10 Unknown

A flaky test cause is classified in this special case as unknown if there is insufficient information about its cause or fix, i.e., it is impossible to identify the cause of a flaky test without detailed conversations or corresponding commits. For example, a flaky test stack trace is provided in issue report #185 of qiskit-machine-learning. The report is later closed because of insufficient information.

III-C RQ3: How do quantum programmers fix flaky tests?

Table II shows seven fix patterns for the flaky test reports that we found (based on the flakiness-related commits and pull requests). These patterns cover 49% of fix reports. The rest of the fixes are one-off and are bundled into “others” fix pattern (51%). The one cause category that has no fix pattern associated is “visualization” because there exists no recurring pattern and we only observe visualization-related flaky tests in qiskit-terra.

III-C1 Fix Seed

It is easier to compare a variable with a constant when a seed used to initialize a PRNG is fixed. Thus, 9 out of 10 randomness-related flaky test reports (i.e., 19% of all flaky test reports) are fixed by setting a random seed value to a constant (or in combination with a reduced convergence threshold as shown in pull request #8820 of qiskit-terra). For example, to avoid flakiness in Listing 3, Listing 9 (pull request #5599 in qiskit-terra) shows a solution that replaces the default non-constant random seed value with a fixed seed value.

⬇

1- first_circuit = random_circuit(num_qubits[0], depth)

2+ first_circuit = random_circuit(num_qubits[0], depth, seed=4200)

3...

4- circuit = random_circuit(num, depth)

5+ circuit = random_circuit(num, depth, seed=4200)

Listing 9: An example of fixed seed for Listing 3.

While flakiness can be controlled by fixing seeds for the PRNGs, this approach can potentially make testing less effective because it limits possible execution paths that can expose real bugs [8]. Further, a change in the PRNG algorithm may make the test case flaky again.

We have not seen robust fixes to this issue that involve running the test case multiple times to compute the distribution¹¹1Although we saw a strategy of running the test three times and declaring success if the test passes at least once (pull request #724 in netket). of successful and failed executions, then performing distribution analysis to assess whether new changes to the code alter the distribution (see [8] for details). This could be due to higher computation and execution time requirements. Additionally, programmers may not want to build such a test framework from scratch and do not realize that test frameworks like this, e.g., FLEX [8], already exist.

Finally, we find one flaky test that is fixed by replacing the PRNG with a deterministic formula (pull request #3585 of qiskit-terra, as shown in Listing 10).

⬇

1- for _ in range(100):

2- M2real = np.random.randn()*M2.real + np.random.randn()*M2.imag

3+ for i in range(-50, 50):

4+ M2real = (i/50)*M2.real + (i/25)*M2.imag

Listing 10: An example of removing the source of randomness in Listing 1.

III-C2 Alter Software Environment

Flaky tests can be removed by updating or changing the development environment. In the seven observed flaky test reports related to the software environment (see Table II), four of them (9% of all the reports) are fixed by upgrading or changing the dependencies. Listing 11 demonstrates a fix by removing the dependency on a specific numpy version (more details can be found in pull request #4656 of qiskit-terra).

⬇

1- numpy!=1.19

Listing 11: An example of fixing library dependency.

The remaining three reports are fixed by changing the configuration of the continuous integration pipeline (issue #319 in Microsoft/qdk-python), simplifying the test case (pull request #1369 in netket), or simply closing the report because of dependency changes (issue #1466 in qiskit-aer).

III-C3 Make Single Thread

Multi-threading-related flaky tests can be resolved by limiting the number of threads to one. Setting a single thread resolved two of the six flaky tests (4% of all reports) related to multi-threading. For example, in pull request #780 of qiskit-experiments, developers observe occasional timeout issues when running multiple tests because multi-threading is disabled in certain environments. Though the flakiness may be resolved by disabling multi-threading, performance overheads may arise. Another example can be seen in Listing 12 (pull request #6539 in qiskit-terra), which shows the fix for the flaky test in Listing 4 by disabling the parallelization.

⬇

1- sphinx-build -W -b html -j auto docs/ docs/_build/html {posargs}

2+ sphinx-build -W -b html docs/ docs/_build/html {posargs}

Listing 12: The fix for the flaky test in Listing 4.

III-C4 Adjust Tolerance

For flaky tests related to floating point operations, one can adjust tolerance. Three out of five reports, 6% of all the reports, are fixed this way. For example, one can increase the tolerance manually, e.g., by doubling the tolerance value as shown in Listing 13. However, hard-coded loose tolerances may result in increased false negative test results, affecting the correctness of the program.

⬇

1- np.testing.assert_allclose(result.values, [-1.307397243478641], rtol=0.05)

2+ np.testing.assert_allclose(result.values, [-1.307397243478641], rtol=0.1)

Listing 13: An example of tolerance increase from pull request #8820 of qiskit-terra.

Setting dynamic tolerance may be a more robust solution. For example, to fix the issue shown in Listing 5, the developers compute the tolerance level dynamically, as shown in Lines 2 and 3 of Listing 14.

⬇

1- assert driver.energy.mean == approx(ma.expect(ha).mean, abs=1e-5)

2+ tol = driver.energy.error_of_mean * 5

3+ assert driver.energy.mean == approx(ma.expect(ha).mean, abs=tol)

Listing 14: An example of tolerance increase for Listing 5.

Developers also round the actual value to combat numeric errors (e.g., pull request #4835 of qiskit-terra), or remove a flaky test case altogether if it is deemed non-essential (e.g., pull request #8582 of qiskit-terra).

III-C5 Add Exception Handler

Flaky tests caused by unhandled exceptions can be mitigated by adding an exception handler or removing the exception (all three reports, or 6% of all the reports, are fixed this way). For example, developers add an if condition to remove any unexpected negative coefficients in pull request #399 of Microsoft/QuantumLibraries (see Listing 15).

⬇

1- ApplyXorInPlace(keepCoeff[idx], keepCoeffRegister);

2+ if (keepCoeff[idx] >= 0) {

3+ ApplyXorInPlace(keepCoeff[idx], keepCoeffRegister);

4+ }

Listing 15: The fix for the flaky test in Listing 6.

III-C6 Synchronize

We observe one (i.e., 2% of all the reports) network-related flaky test report, i.e., issue #584 and pull request #588 in qiskit-ibm-runtime. Here, tests for callback functions are dependent on socket tests. However, sometimes those callback tests finish before the socket tests, which causes flakiness. Developers increase callback test iterations so that socket tests have sufficient time to finish first (see Listing 16). This may be a suboptimal solution as the number of iterations is hard-coded. The ideal solution would be to synchronize callback tests with the completion of socket tests.

⬇

1- job = self._run_program(service, iterations=1, callback=result_callback)

2+ job = self._run_program(service, iterations=10, callback=result_callback)

Listing 16: The fix for the flaky test in Listing 7.

III-C7 Use Keys for Order

For flaky tests caused by unordered Python dictionaries, developers can use key values for ordering instead of using the insertion order. In Python 3.8 and later, dictionaries are order-preserved, so upgrading the Python environment can solve the problem. However, Python upgrades can cause dependency issues. One flaky test report (i.e., 2% of all reports) had this cause and fix.

IV Threats to Validity

Validity threats are classified according to [16, 17].

Internal and construct validity. Data harvesting and cleaning are error-prone processes. The data were collected manually. Several co-authors independently examined the search results of flaky tests, then jointly reviewed and discussed the findings to finalize the list. The 10 keywords used to select the issue reports (discussed in Section II) can have false negatives. Therefore, we underestimate the number of reports related to flaky tests. Yet, even this limited set of keywords proves that flaky tests exist in QC programs. In the future, we will extend this research to automated methods and tools to determine whether the reports are related to flakiness.

External and conclusion validity. Generally, software engineering studies suffer from real-world variability, and the generalization problem can only be solved partially [18]. We need to generalize a theoretical population and understand its architectural similarity relation to build a theory [18]. Although 14 open-source projects are used in this study, our findings may not generalize to other projects. However, based on our findings in this pilot study, the same empirical examination can be conducted on other quantum software products using well-designed and controlled experiments. Through such future research, we hope the community will expand our taxonomy of causes and identify patterns that will eventually lead to a general theory of flaky tests in quantum computing.

V Related Work

The literature on testing and debugging quantum programs is growing. A quantum program is challenging to test because of quantum mechanics [19]. Software engineering principles are being applied to quantum program testing and debugging [19, 20, 21]; see [22, 23] for a comprehensive overview of quantum software engineering research work.

The community takes multiple approaches to tackle the challenge. Testing QC programs may be simplified by adding assertion checks to the code [24, 25, 26, 27] or, in some cases, introducing debugging tricks, such as extracting classical information [20]. The identification of bug patterns in quantum programs can assist in defect analysis and categorization [28]. We can also adapt classical fuzzy testing techniques [29] or perform property-based testing [30]. Additionally, quantum programs can be debugged in simulators on a CC (frameworks, such as Qiskit and Q# readily provide such an option), but they can be used only for small problems [19, 31].

As far as we know, there has been no study of flaky tests in quantum programs. As discussed in Section III-B, Luo et al. [1] summarize 10 common causes of flakiness in software for CCs; four of them overlap with ours. Similar taxonomies for CC software have been presented by [5, 6, 7], which are complementary to ours.

VI Conclusions and Future Work

This paper examines flakiness in 14 quantum programs. We detect 46 flaky test reports in 12 of these programs, which means at least 0.26% to 1.85% of issue reports in those programs are related to flakiness. Additionally, we identify eight groups of causes of flaky tests in QC code and seven common fixes. The final observation is that quantum programmers use only some recent software engineering techniques developed to deal with flaky tests. We hope that these findings will assist researchers and developers in mitigating the risk of flakiness when designing and testing quantum programs.

There are several ways to expand this work. In the future, we will explore additional repositories and keywords to identify more flaky test patterns in QC. In addition, we plan to compute the fraction of test cases affected by flakiness to perform direct statistical comparisons between CC and QC programs.

Our long-term objectives are 1) to automate the manual approaches to detect flaky tests in quantum programs based on the bug patterns identified in this paper and 2) to develop methods and tools to automatically fix QC flaky tests based on the fix patterns we have identified. For the first objective, we will develop an automated tool for each quantum flaky test category and integrate them into a framework that covers the most popular flaky test categories. As for the second objective, we will follow a similar path as the first one, i.e., we will develop individual automated fix methods and tools followed by a more comprehensive framework to fix flaky tests. These two objectives will combine organically to provide an efficient solution to mitigate flakiness in quantum programs.

We hope the community can use our findings to improve methods for detecting flakiness in the text of bug reports, automatically identifying flaky tests, suggesting mitigation strategies, and enriching flaky test datasets. Moreover, by raising awareness of the similarities and differences between classical and quantum programs, we would like to encourage quantum programmers to adopt some of the tools developed by the software engineering community to improve quantum software quality.

References

[1] Q. Luo, F. Hariri, L. Eloussi, and D. Marinov, “An empirical analysis of flaky tests,” in Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, 2014, pp. 643–653.
[2] J. Micco, “The state of continuous integration testing @google,” 2017. [Online]. Available: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45880.pdf
[3] A. Memon, Z. Gao, B. Nguyen, S. Dhanda, E. Nickell, R. Siemborski, and J. Micco, “Taming google-scale continuous testing,” in Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). IEEE, 2017, pp. 233–242.
[4] M. Eck, F. Palomba, M. Castelluccio, and A. Bacchelli, “Understanding flaky tests: The developer’s perspective,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 830–840.
[5] O. Parry, G. M. Kapfhammer, M. Hilton, and P. McMinn, “A survey of flaky tests,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 1, pp. 1–74, 2021.
[6] M. Gruber, S. Lukasczyk, F. Kroiß, and G. Fraser, “An empirical study of flaky tests in python,” in Proceedings of the 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2021, pp. 148–158.
[7] S. Dutta, A. Shi, R. Choudhary, Z. Zhang, A. Jain, and S. Misailovic, “Detecting flaky tests in probabilistic and machine learning applications,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2020, pp. 211–224.
[8] S. Dutta, A. Shi, and S. Misailovic, “Flex: fixing flaky tests in machine learning projects by updating assertion bounds,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 603–614.
[9] D. C. Marinescu, Classical and quantum information. Academic Press, 2011.
[10] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information: 10th Anniversary Edition. Cambridge Univ. Press, 2010.
[11] M. Treinish, J. Gambetta et al., “Qiskit/qiskit: Qiskit 0.39.5,” Jan. 2023. [Online]. Available: https://doi.org/10.5281/zenodo.7545230
[12] Microsoft, “Q# and the quantum development kit,” 2023. [Online]. Available: https://azure.microsoft.com/en-us/resources/development-kit/quantum-computing
[13] TensorFlow, “TensorFlow quantum,” 2023. [Online]. Available: https://www.tensorflow.org/quantum
[14] NetKet, “NetKet - the machine learning toolbox for quantum physics,” 2023. [Online]. Available: https://www.netket.org/
[15] B. Kraus and J. I. Cirac, “Optimal creation of entanglement using a two-qubit gate,” Phys. Rev. A, vol. 63, p. 062309, May 2001. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.63.062309
[16] C. Wohlin, P. Runeson, M. Höst, M. Ohlsson, B. Regnell, and A. Wesslén, Experimentation in Software Engineering, ser. Computer Science. Springer Berlin Heidelberg, 2012.
[17] R. Yin, Case Study Research: Design and Methods, ser. Applied Social Research Methods. SAGE Publications, 2009.
[18] R. J. Wieringa and M. Daneva, “Six strategies for generalizing software engineering theories,” Science of computer programming, vol. 101, pp. 136–152, 4 2015.
[19] A. Miranskyy and L. Zhang, “On testing quantum programs,” in Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2019, pp. 57–60.
[20] A. Miranskyy, L. Zhang, and J. Doliskani, “Is your quantum program bug-free?” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, ser. ICSE-NIER ’20. ACM, 2020, p. 29–32.
[21] ——, “On testing and debugging quantum software,” arXiv preprint arXiv:2103.09172, 2021.
[22] J. Zhao, “Quantum software engineering: Landscapes and horizons,” arXiv preprint arXiv:2007.07047, 2020.
[23] M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, and A. De Lucia, “Software engineering for quantum programming: How far are we?” Journal of Systems and Software, vol. 190, p. 111326, 2022.
[24] Y. Huang and M. Martonosi, “Statistical assertions for validating patterns and finding bugs in quantum programs,” in Proceedings of the 46th International Symposium on Computer Architecture, ser. ISCA’19. Association for Computing Machinery, 2019, p. 541–553.
[25] G. Li, L. Zhou, N. Yu, Y. Ding, M. Ying, and Y. Xie, “Projection-based runtime assertions for testing and debugging quantum programs,” Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 150:1–150:29, 2020.
[26] J. Liu, G. T. Byrd, and H. Zhou, “Quantum circuits for dynamic runtime assertions in quantum computation,” in Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS’20. Association for Computing Machinery, 2020, p. 1017–1030.
[27] S. Ali, P. Arcaini, X. Wang, and T. Yue, “Assessing the effectiveness of input and output coverage criteria for testing quantum programs,” in 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2021, pp. 13–23.
[28] P. Zhao, J. Zhao, and L. Ma, “Identifying bug patterns in quantum programs,” in Proceedings of the 2021 IEEE/ACM 2nd International Workshop on Quantum Software Engineering (Q-SE). IEEE, 2021, pp. 16–21.
[29] J. Wang, M. Gao, Y. Jiang, J. Lou, Y. Gao, D. Zhang, and J. Sun, “Quanfuzz: Fuzz testing of quantum program,” arXiv preprint arXiv:1810.10310, 2018.
[30] S. Honarvar, M. R. Mousavi, and R. Nagarajan, “Property-based testing of quantum programs in q#,” in Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, 2020, pp. 430–435.
[31] M. P. Usaola, “Quantum software testing,” in Short Papers Proceedings of the 1st International Workshop on the QuANtum SoftWare Engineering & pRogramming, ser. CEUR Workshop Proceedings. CEUR-WS.org, 2020, pp. 57–63.