QuaLITi: Quantum Machine Learning Hardware Selection for Inferencing with Top-Tier Performance

Koustubh Phalak CSE Department
Pennsylvania State University
State College, PA
[email protected] Swaroop Ghosh School of EECS
Pennsylvania State University
State College, PA
[email protected]

Abstract

Quantum Machine Learning (QML) is an accelerating field of study that leverages the principles of quantum computing to enhance and innovate within machine learning methodologies. However, Noisy Intermediate-Scale Quantum (NISQ) computers suffer from noise that corrupts the quantum states of the qubits and affects the training and inferencing accuracy. Furthermore, quantum computers have long access queues. A single execution with a pre-defined number of shots can take hours just to reach the top of the wait queue, which is especially disadvantageous to Quantum Machine Learning (QML) algorithms that are iterative in nature. Many vendors provide access to a suite of quantum hardware with varied qubit technologies, number of qubits, coupling architectures, and noise characteristics. However, present QML algorithms do not use them for the training procedure and often rely on local noiseless/noisy simulators due to cost and training timing overhead on real hardware. Additionally, inferencing is generally performed on reduced datasets with fewer datapoints. Taking these constraints into account, we perform a study to maximize the inferencing performance of QML workloads based on the choice of hardware selection. Specifically, we perform a detailed analysis of quantum classifiers (both training and inference through the lens of hardware queue wait times) on Iris and reduced Digits datasets under noise and varied conditions such as different hardware and coupling maps. We show that using multiple readily available hardware for training rather than relying on a single hardware, especially if it has a long queue depth of pending jobs, can lead to a performance impact of only 3-4% while providing up to 45X reduction in training wait time.

Index Terms:

Quantum Hardware, Quantum Machine Learning, Inferencing

I Introduction

In recent years, the field of quantum computing has witnessed significant growth, propelled by its potential to solve complex problems far beyond the reach of classical computing paradigms [1]. This emerging technology, characterized by its principles of superposition, entanglement, and quantum interference, offers unprecedented computational advantages, promising revolutionary breakthroughs [2, 3] in various disciplines, including cryptography [4], finance [5, 6], chemistry and material science [7], and healthcare [8]. One of the most promising applications of quantum computing lies in the domain of machine learning, where the computational advantages of quantum algorithms can be leveraged to enhance the efficiency and capability of traditional machine learning algorithms [9]. The synthesis of quantum computing and machine learning has given rise to a new interdisciplinary field known as Quantum Machine Learning (QML), which seeks to harness quantum computational advantages to improve machine learning tasks. Examples of quantum machine learning algorithms, such as Quantum Neural Networks (QNN) [10], Variational Quantum Eigensolver (VQE) [11], Variational Quantum Classifier (VQC) [12], and Quantum Support Vector Machine (QSVM) [13], illustrate the potential of quantum computing to provide solutions to otherwise intractable learning problems.

Refer to caption — Figure 1: Training reduced Digits dataset (classes 8,9) on randomly allocated configurations gives poor inferencing results. The training is done on 127 qubit hardware where we observe a maximum inferencing performance up to $66\%$ , suggesting that there is room for improvement with regards to the choice of qubit configuration and even hardware.

However, the practical realization of QML’s potential is currently hindered by the limitations of the Noisy Intermediate-Scale Quantum (NISQ) technology era [14] such as gate errors, decoherence, and crosstalk errors [15] and adhering to the hardware constraints e.g., coupling map that lead to performance degradation. Suppose, we want to train a QNN on a particular quantum hardware for performing binary classification. One iteration/epoch of training will constitute classifying all the data points by predicting their classes and using the predictions to compute the gradient and update the QNN parameters. Considering classifying even a single data point, the QNN will (i) undergo a transpilation process to make the circuit hardware compliant. This will increase the depth and gate count of the circuit, making it more susceptible to noise and, (ii) make multiple executions (also referred to as shots) to compute accurate expectation values for each datapoint. Repeating this process for all data points and multiple epochs can be challenging, given the NISQ constraints such as hardware wait times and noise levels. These constraints underscore the importance of hardware selection for QML workloads without significant degradation in training and inference accuracy.

Previously, efforts have been made to mitigate the effects of some of the above constraints that include employing circuit concurrency in the QML training pipeline [16] (mitigating noise and reducing wait time), directly converting quantum circuits into native pulse schedules without the need for transpilation [17] (mitigating noise) and a similar work [18] (mitigating noise) using state preparation circuits catered to work robustly under noise [19] (mitigating noise), performing noise-aware training [20] (mitigating noise) and training the unitary operator representation of a quantum circuit instead of the quantum circuit itself [21] (mitigating noise).

Motivation: Suppose a user wants to train their QNN model on real quantum hardware, they accordingly define their QNN network consisting of an appropriate state preparation circuit to load classical data in quantum Hilbert space, the Parametric Quantum Circuit (PQC) consisting of trainable quantum layers and measurement operations for classical gradient computation and optimization of PQC parameters. However, suppose the user does not take the hardware constraints into account such as the coupling map and noise, and arbitrarily train their QNN without explicitly specifying a reasonably good configuration. In that case, there are chances that the QNN will be mapped to a poor configuration of qubits that can have high error rates, low coherence times, and high transpilation depth. For example, if a user wants to train their 8 qubit QNN using the Digits dataset (binary classification, classes 8 and 9) on 127 qubit hardware (qubits 0 to 126), with no configuration specified. There are chances that the QNN can be mapped to the following set of 8 qubits: (i) [9,10,11,12,17,30,29,28], (ii) [79,80,81,82,83,84,85,86] or (iii) [97,98,99,100,110,118,119,120]. These configurations have mean two qubit error rates up to as high as 0.31, and mean coherence times of all qubits as low as 70 $\mu$ s. These factors will corrupt the qubit states, leading to significant performance degradation. We can see that the inferencing performance reaches only a maximum of $66\%$ for all three configurations combined (Fig. 1). The situation can become even worse for multi-hardware training if the coupling maps are not chosen carefully.

Proposed approach: In this work, we address the above concern by studying training of QNN on multiple hardware to maximize inferencing performance while cutting down wait time. We show the overall methodology in Fig. 2. We choose the appropriate model, then perform configurational analysis to find out the best configurations and hardware and use them for multi-hardware training. Finally, we perform inferencing and obtain the results. Note that while in practice step 2 (multi-hardware training) should ideally run on real hardware, because of hardware access restrictions we perform step 2 using noisy simulations. Although we use the Iris and reduced Digits datasets, the proposed methodology is generic and can be extended to larger datasets, such as a reduced Cifar-10 dataset as shown in the later part of this paper.

In the rest of the paper, Section II provides relevant background and related works. Section III explains the training setup followed by multi-hardware training setup and inferencing results in Section IV. Next, we perform additional analysis such as variation of inferencing performance, the effect of coupling map, real hardware queue depth analysis, and scalability to larger datasets in Section V. Finally, we draw conclusions in Section VI. All the code corresponding this work can be found in our GitHub repository¹¹1GitHub repository link: https://github.com/KoustubhPhalak/QuaLITi-QML-Workload-Optimization.

II Background and Related Works

II-A Quantum Computing

Qubits are the fundamental units of a quantum computer, equivalent to bits for classical computers. A qubit stores information in a quantum state, which is represented using a 2x1 vector. It is mathematically denoted as $\ket{\psi}=$ $\big{[}\begin{smallmatrix}\alpha\\ \beta\end{smallmatrix}\big{]}$ , where where $|\alpha|^{2}$ represents the probability of qubit being measured to 0 and $|\beta|^{2}$ represents the probability of qubit being measured to 1. There are two special states with $\alpha=1$ , $\beta=0$ ( $\ket{0}=$ $\big{[}\begin{smallmatrix}1\\ 0\end{smallmatrix}\big{]}$ ) and $\alpha=0$ , $\beta=1$ ( $\ket{1}=$ $\big{[}\begin{smallmatrix}0\\ 1\end{smallmatrix}\big{]}$ ). These are known as basis states and are quantum analogous to classical 0 and 1 bit values respectively. The quantum state of a qubit is changed with the help of quantum gates, which are unitary matrix operations. These gates work either on a single qubit (e.g., Hadamard gate, Pauli X/Y/Z gates, etc.) or on multiple qubits (CNOT gate, Pauli CY/CZ gate, SWAP gate, Toffoli gate, etc.). Combining qubits and quantum gates, we obtain quantum circuits that are ordered sequences of quantum gates placed on qubits. All the gate operations of qubits are eventually collapsed to classical bit values (either 0 or 1) [22]. A special kind of quantum circuit, known as Parametric Quantum Circuit (PQC) consists of parametric rotation gates (such as U, Pauli RX/RY/RZ, etc.) that can be tuned classically using traditional optimization algorithms. These PQCs can be thought of as trainable ML models that form an integral part of QNNs in QML.

II-B Cloud-based NISQ Quantum Computing

Modern NISQ quantum computers today are typically accessed via a cloud service from vendors such as IBM [23], Google [24], and Amazon [25]. The general way to access a quantum computer is (1) the user would write the program containing the specific quantum circuit for their task, (2) the user would then send the program to the cloud service along with the target hardware to run the quantum circuit on and some extra metadata (such as number of shots, optimization level for transpilation, number of ancilla qubits, etc.), (3) the cloud service would allocate the user’s program to the desired hardware, after which the hardware would run the program, generate the results and send them back to the cloud service, (4) finally, the cloud service sends the results from the quantum hardware back to the user.

However, there is a major problem that arises from such an access model. Due to the limited availability of quantum computers, a single quantum computer has a long access queue that can have up to hours of wait time until the user’s program reaches the top of the queue. This problem is further aggravated for QML algorithms, that require iterative runs for optimizing the rotation gate parameters for which selecting a quantum hardware that has less wait time is an important aspect for consideration in the QML training pipeline.

II-C Noise in quantum hardware

The performance of QNNs, like all quantum computing systems, is significantly influenced by various forms of noise, each impacting the accuracy and efficiency of the system in distinct ways. A common source of error within QNNs is decoherence, a phenomenon where qubits lose their quantum state due to unintended interactions with the external environment. This loss is essentially an energy dissipation from the qubits, leading to a degradation of the quantum coherence of the system and, consequently, its computational capabilities. Another error is crosstalk which occurs when there are unwanted interactions specifically between qubits that are quantum mechanically coupled. These interactions can alter the state of neighboring qubits in an uncontrolled manner, introducing errors into the computation process.

The implementation of quantum gates also introduces potential sources of error. Quantum gates in QNNs are typically realized through the application of microwave pulses in systems utilizing superconducting qubits, or laser pulses in the case of trapped-ion qubits. Any inaccuracies or imperfections in these pulses can result in gate errors, where the intended quantum operation is applied incorrectly, leading to deviations from the expected computational outcome. Lastly, the process of measuring quantum states introduces another avenue for error, known as readout errors. Quantum measurement operations can vary widely depending on the physical implementation of the qubits. For instance, photonic qubits are often measured using photon detectors, trapped-ion qubits may be measured through the intensity of fluorescence, and superconducting qubits might be measured via resonator coupling. Each measurement technique has its own set of potential inaccuracies, whether due to imprecise measurements, equipment limitations, or inherent inaccuracies in the measurement apparatus itself. These readout errors can significantly affect the accuracy of the quantum computation, as they directly influence the interpretation of the quantum system’s final state.

II-D Related Works

Many efforts have been attempted to reduce the effect of quantum hardware limitations. For example, [16] was proposed to run concurrent executions of different training data within the same batch for the same QNN circuit on different available qubits. [17, 18] propose converting Variational Quantum Algorithms (VQA) into pulse schedules such that the pulses are native to the quantum hardware and have tunable parameters. By training these parameters, the authors train the original model. Works like [19, 20] take the effect of noise into account and incorporate them into the VQA, such as creating noise-resilient state preparation circuit for better training, and injecting quantum noise during training to make it noise-aware and performing post-measurement processing, such as quantization and normalization of measurement outputs. Finally, [21] propose training unitary operator representation of a QML ansatz ( $2^{N}*2^{N}$ in size for $N$ qubits) as compared to training the ansatz itself. The authors use a gradient descent algorithm for optimization and also use partitioning of the unitary operator for further time complexity reduction. Out of all these works, only [16] takes hardware queue wait time into consideration in its study and achieves up to 20x speedup. The proposed approach is complementary in nature and could be augmented in conjunction with [16] for additional benefit.

III Training Setup

III-A Hardware and configuration selection

We first choose a set of quantum hardware and their corresponding configurations for training and inferencing QML models. Considering the latest release of Qiskit 1.0, we select Fake20QV1() (20 qubits), Fake27QPulseV1() (27 qubits) and Fake127QPulseV1() (127 qubits) noisy simulators (containing real hardware calibrated data) from qiskit.providers.fake_provider library for conducting our experiments. The coupling map of each of these hardware is shown in Fig. 3. We select five topologically different 8 qubit coupling maps (labeled I-V) for each hardware as shown visually in Fig. 3(d) and with individual qubit information in Table I. Considering the exponential simulation runtime increase with growing qubits [26, 27], we pick small-scale datasets such as the Iris and UCI Digits dataset for the training process. Furthermore, we select all 3 classes in the Iris dataset for classification and 2 sets of 2 class datasets in the Digits dataset: 0,1 and 8,9 for performing classification²²2Henceforth, we refer to them as Digits01 and Digits89 respectively..

TABLE I: 8-qubit coupling configurations used in different hardware. Note that C.M. = Coupling Map

C.M.

27Q

127Q

20Q

[4,7,10,12,

15,18,20,23]

[14,18,19,20,

21,22,23,24]

[0,1,6,5,

10,11,15,16]

[4,7,10,12,

13,15,18,20]

[19,20,21,22,

23,24,25,15]

[5,6,7,8,

9,14,13,3]

III

[7,10,12,15,

18,20,13,14]

[20,21,22,23,

24,25,15,4]

[6,7,8,9,

14,13,3,2]

[10,12,15,18,

20,23,24,13]

[14,18,19,20,

21,22,23,15]

[0,1,2,3,

8,9,14,6]

[4,7,10,12,

15,18,6,13]

[19,20,21,22,

23,24,15,33]

[5,6,1,2,

3,4,0,8]

III-B Model selection

We use 8-qubit QNNs containing classical data loading circuits, Parametric Quantum Circuit (PQC), and measurement operations. We select the input data loading scheme based on the dimensions of the input. Considering we have 8 qubit-QNN, the Iris dataset ( $4$ size vector) can fit with an $n$ features to $n$ qubits embedding, while the Digits dataset ( $64$ size vector) can fit with a $2^{n}$ features to $n$ qubits embedding. So, we use angle embedding ( $4$ features on $4$ qubits) for the Iris dataset and amplitude embedding ( $64$ features on $6$ qubits) for the Digits dataset [28]. Strongly Entangling Layers (SEL) [29] are selected for the PQC part since they create strong entanglement due to the presence of many entangling gates such as CNOT gates, and also have many trainable parameters which overall improves trainability of the PQC [30]. Furthermore, we take the number of classes into account and accordingly set the number of layers for each case. Specifically for the Iris dataset (with 3 classes) we choose a model with 6 SEL and for the reduced Digits datasets (with 2 classes) we choose a model with 3 SEL. Finally, the measurement operations consist of expectation value measurement in the Pauli-Z basis. For all the runs, we perform training on 10 epochs, use Adam optimizer (learning rate $=10^{-3}$ ), and use a batch size of 16.

TABLE II: Inferencing performance of models with varying

r

Range (r)	Iris	Digits01	Digits89
1	93.78%	89.35%	84.57%
2	86.00%	84.72%	83.73%
3	88.44%	90.74%	80.37%
4	74.22%	61.38%	75.23%

TABLE III: Analysis of various configurations

Config.

Decoherence score

normalized (A)

Readout score

normalized (B)

1Q error score

normalized (C)

2Q error score

normalized (D)

Layer depth score

normalized (E)

Final score=0.2(A)+0.2

(B)+0.1(C)+0.3(D)+0.2(E)

27Q(II)

90.23

83.91

100.00

93.80

10.47

75.06

27Q(I)

76.91

85.59

65.91

100.00

19.47

72.98

27Q(V)

100.00

68.69

2.08

71.02

27Q(III)

89.06

89.01

65.91

86.23

1.02

68.28

27Q(IV)

47.97

75.94

65.91

76.13

13.26

56.86

20Q(II)

3.69

29.04

9.09

13.78

100.00

31.59

127Q(V)

5.55

64.38

11.36

15.28

51.51

30.01

20Q(IV)

0.57

32.71

13.46

8.30

73.83

25.26

127Q(II)

10.11

64.38

11.36

22.50

28.72

28.53

127Q(I)

7.58

56.56

11.36

29.71

14.73

25.82

127Q(IV)

6.68

51.70

14.77

19.58

24.80

23.98

20Q(III)

0.00

22.19

7.95

9.60

54.73

19.06

20Q(I)

12.33

0.00

14.77

73.83

21.66

20Q(V)

7.55

34.61

13.46

0.00

48.46

19.47

127Q(III)

9.55

18.99

10.51

20.15

0.00

12.80

Another aspect to consider in the model selection process is the ansatz used in SEL. The SELs have a range parameter $r$ that can dictate the target qubit for a given control qubit of the CNOT gate. Suppose the control qubit is on $i^{th}$ qubit (ranging $0$ to $n-1$ ), the total number of qubits in the QNN is $n$ and the range value is $r$ , then the target qubit will be on qubit number $(i+r)$ (mod $n$ ). As an example, we show for an 8 qubit SEL (entangling part in particular) how having different range values changes the arrangement of CNOT gates in Fig. 4. We select four range values $r=\{1,2,3,4\}$ , define a model for each range value and train these models under noise using the chosen datasets to determine the model with most appropriate entanglement. For each case, the training is done on 27 qubit hardware with configuration I (linear coupling map). We note the inferencing results for this experiment in Table II. Note that the tabulated values present mean inferencing performance for 10 inferencing runs to account for fluctuations due to noise³³3For all the runs, we use mean accuracy of 10 inferencing runs.. We observe that for the Iris dataset, $r=1$ performs the best, and for Digits01 and Digits89 $r=3$ and $r=1$ perform the best, respectively. Therefore, we select these models for further analysis.

IV Multi hardware training

IV-A Configurational analysis

We identify the potentially best-performing configurations by analyzing their various properties such as coherence times (both T1, and T2 times), single and two-qubit error rates, single-layer post-transpilation depth, and readout error. In general, the presence of CNOT gates due to the SWAP gate insertion procedure during transpilation makes the resulting QNN circuit sensitive to the two-qubit error rate. Next, the native gate set of a particular hardware plays an important factor in determining the overall QNN depth. The readout errors (preparing $\ket{0}$ , measuring $1$ and preparing $\ket{1}$ , measuring 0) can further degrade the performance. Finally, individual single qubit error rates can lead to the erroneous computation of quantum state, however, their effect can be considered relatively small compared to other factors. For the best results, (i) the two qubit error rate, post-transpilation depth, readout errors and single qubit error rate should be low (inversely proportional $\downarrow$ ), and (ii) the coherence times should be high (directly proportional $\uparrow$ ). Furthermore, both the types of aforementioned readout errors should be low and both T1 and T2 times should be high. Taking this into account, we can combine the T1 and T2 times by taking their harmonic means and the readout errors using arithmetic mean. The harmonic mean is sensitive to lower values, so a pair of T1, and T2 values having even a single low value will have a lower harmonic mean, implying a lower overall coherence time for the configuration. Similarly, high readout error values will yield a higher arithmetic mean. Next, we create a score metric for every property in the range of [0,100]. We do this as follows: (i) For inversely proportional properties, we take the inverse/reciprocal of the property and normalize them in the range of [0,100]. The normalization is done by taking the minimum and maximum of the property value into account. Suppose, if $p$ is the list of all the scores of the property after taking inverse, then for a property score $p_{i}$ ( $1\leq i\leq 15$ since there are 5 configurations for 3 hardware), the normalized score will be $p_{i}^{\prime}=\frac{p_{i}-min(p)}{max(p)-min(p)}*100$ . (ii) For directly proportional values, we directly normalize the property score value in the range [0,100] using the aforementioned formula without taking the inverse. Finally, once the individual score for each property is computed, we combine these individual scores in a weighted fashion to obtain a final overall score for each configuration. Based on the criticality of the properties discussed earlier, we assign a weight of 0.3 to two qubit error rate, 0.2 to post-transpilation layer depth, coherence times, and readout error rates, and 0.1 to single-qubit error rate. These results are tabulated in Table III in decreasing order of final score from top to bottom. From the table, we observe that all the 27 qubit hardware configurations are top performing, owing to high coherence times and lower error rates.

IV-B Multi hardware training procedure

From the analysis performed in Table III, it would be motivating to select the top-scoring configurations with the highest scores. However, as we can see the top 5 scoring configurations all belong to the 27 qubit hardware. From a training standpoint, while the individual coupling map configurations might be different we still dedicate all 10 epochs of training only to 27 qubit hardware. If the 27 qubit hardware has a large queue of pending jobs, then the overall training time overhead will be equivalently compounded based on the number of epochs allotted to it for training. To address this challenge, we propose an alternative configuration selection strategy to just selecting the top-scoring configurations, where (i) We select one of the top-scoring configurations from each hardware at least once during training, and (ii) the next hardware selected for training should be different from the current hardware chosen. We specifically use these two criteria since they allow the usage of top-performing hardware and avoid the case of choosing the busiest hardware back to back, potentially saving training wait time in the hardware queue. We satisfy these two conditions by selecting five configurations (2 epochs per configuration) such that the first three configurations are among the best-performing configurations from each hardware and the next two are chosen in a similar fashion but we also ensure that these configurations have the least queue wait times. For example, the top three scored configurations (in order) for each hardware are (i) 20Q: II, IV, III (ii) 27Q: II, I, V (iii) 127Q: V, II, IV. From these, we can randomly select one configuration from each hardware, switch to a different hardware, and repeat this process 5 times. We employ randomness in the selection here as the scores of specified configurations for each hardware are relatively close so selecting one configuration over the other will not make much difference in the final inferencing performance. A set of configurations chosen in this fashion for the Iris dataset could be 20Q (IV), 127Q (V), 27Q (I), 20Q (III), and 127Q (IV). The final training configurations for all the datasets selected using this procedure are shown in Table IV in order of training from left to right. Note that we assume all the chosen configurations have the least wait time.

The results for this training procedure are shown in Fig. 5. We observe that 27 qubit hardware is the best-performing hardware for inferencing, followed by 20 qubit hardware and finally 127 qubit hardware. For 20, 27 and 127 qubit hardware respectively, we note mean inferencing accuracy of (i) $92.76\%$ , $93.91\%$ and $91.24\%$ for Iris dataset, (ii) $84.85\%$ , $85.83\%$ and $78.47\%$ for Digits01 dataset and (iii) $77.65\%$ , $83.45\%$ and $72.57\%$ for Digits89 dataset.

TABLE IV: Configurations selected for multi-hardware training procedure

Dataset	Configurations (in order from left to right)
Iris	20Q, IV	127Q, V	27Q, I	20Q, III	127Q, IV
Digits01	127Q, II	27Q, I	20Q, III	27Q, V	20Q, IV
Digits89	27Q, II	20Q, II	127Q, V	27Q, I	20Q, IV

TABLE V: Mean hardware characteristics of different hardware for all coupling configurations combined. Note that A={id,u1,u2,u3,cx} and B={id,rz,sx,x,cx,reset}.

Property	20Q	27Q	127Q
2q error rate	0.0172	0.0085	0.0147
Basis gate set	A	B	B

V Additional Analysis

V-A Performance variation with hardware and dataset

Variation with hardware: From Fig. 5, we observe inferencing performance variation as we switch to different hardware. In particular, we note that 27 qubit hardware performs the best, followed by 20 qubit hardware and finally the worst-performing 127 qubit hardware. This trend can be explained by examining internal hardware characteristics.

We show the mean hardware characteristics for all coupling configurations combined such as mean two qubit error rates in Table V and the native basis gate set for every hardware. We observe that 27 qubit hardware has the best two-qubit error rate (0.0085) (which matches with the high two-qubit error scores of all 27 qubit hardware in Table III). The 20-qubit and 127-qubit hardware have two-qubit error rates that are a magnitude of order higher (0.0172 and 0.0147 respectively) than the 27-qubit hardware (which explains why it shows the best performance). Furthermore, the 20 qubit hardware has a different basis gate set compared to the other two, which leads to lower post-transpilation depth. For example, an amplitude embedding circuit along with a single SEL with r=1 post-transpilation has an average depth of 608 on 127 qubit hardware while having 321 on 20 qubit hardware. Therefore, even if the two-qubit error rate is higher, the lower post-transpilation depth compensates for the error rate and gives higher performance for 20-qubit hardware, as compared to the 127-qubit hardware. However, if 27 qubit hardware has a large queue, then we may lose performance by selecting other hardware in the process of optimizing wait time. This is a trade-off between the wait time and inferencing performance that should be made while choosing the desired configurations.

Variation with dataset: As mentioned earlier, we utilize angle embedding to embed a vector of size 4 onto 4 qubits using parametric rotation angles for the Iris dataset, and amplitude embedding to embed 64 size feature vectors onto 6 qubits. Under ideal conditions, both the state preparation circuits perform well. However, when coupling constraints of hardware are taken into account, an amplitude embedding circuit requires depth exponential in the number of qubits used [31]. This contributes significantly to the overall post-transpilation circuit depth, leading to degradation in performance. This can be seen when we compare the performances of the Iris and Digits dataset in Fig. 5. We note that Iris dataset always shows greater than $90\%$ inferencing performance for any configuration, and Digits dataset shows inferencing performance below $90\%$ , sometimes even below $80\%$ .

We also observe significant performance differences within the two Digits datasets. Structurally, both datasets have images of the same dimensions (8x8 images, or 64x1 size vector for the case of training). This means that both datasets will have the same amplitude embedding depth post-transpilation. Numerically, we observe a post-transpilation depth for the amplitude embedding circuit of roughly 490 for both datasets. The remaining difference then would be the digits present in the images themselves, specifically the structure of the digits. For the case of Digits01, 0 has a round ‘o’ shape while 1 has line ‘ $|$ ’ shape, while for Digits89 digit 8 has two ‘o’ shapes while 9 has one ‘o’ shape attached to a ‘ $|$ ’ shape. Under noiseless conditions, the selected model can classify both datasets with greater than $90\%$ accuracy. However, under noise, the same model would probably find it easier to distinguish between images containing two distinct shapes (‘o’ and ‘ $|$ ’) like 0 and 1, as compared to images having common shapes (‘o’ shape) like 8 and 9. This can potentially explain the lower performance in the Digits89 dataset as compared to the Digits01 dataset.

TABLE VI: Training and inferencing wait times on real hardware

H/W	Queue wait time (A)	Queue depth (B)	Avg wait time C = (A $\div$ B)	# Train data (D, Iris/Digits)	Train wait time (CD10 $\div$ 60)	# Inf. data (E, Iris/Digits)	Inf. wait time (C*E)
Brisbane	5h 40m	1447	14.1s	105/252	247/592 min	45/108	634/1522s
Kyoto	2h 15m	183	44.3s	105/252	775/1860 min	45/108	1993/4784s
Osaka	1s	1	1s	105/252	17.5/42 min	45/108	45/108s

V-B Effect of changing coupling map

We visualize previous results from the viewpoint of coupling maps. We restructure multi-hardware training results from Fig. 5 in Fig. 6 in the form of boxplots for each coupling map and every dataset. An indicator of a good coupling map configuration is the lower fluctuation it shows across different hardware. Across all the three datasets combined for coupling maps I to V respectively, we observe mean fluctuations of $3.74\%$ , $3.46\%$ , $2.59\%$ , $3.33\%$ , and $2.96\%$ . From these values, we can conclude that coupling maps III, and V are the most resilient for inferencing multi-hardware trained models.

V-C Hardware queue depth analysis

We observe queue depth variation of real hardware (IBM Brisbane, IBM Kyoto and IBM Osaka) with time. We show boxplot for queue depth (recorded from 04/10/2024 16:30 to 04/12/2024 19:30) in Fig. 7. We note that IBM Brisbane is the busiest out of all three, followed by IBM Kyoto and IBM Osaka. From the boxplots, we note the overall fluctuation in queue depth (i.e. standard deviation) as 88 jobs for IBM Brisbane, 74 jobs for IBM Kyoto, and 43 jobs for IBM Osaka. To get an idea of how much wait time these queue depths translate to, we choose a dummy 8-qubit circuit (consisting of only a single RZ gate on every qubit) and send it for execution on each circuit. The queue wait times for this circuit is tabulated in Table VI. We observe that for the aforementioned time period, Kyoto has the largest wait time per circuit (44.3s), followed by Brisbane (14.1s) and Osaka (1s). Using this data, we estimate the single inference run time. For our datasets, we select a 70:30 split for training and testing, which would mean (for inferencing) 45 data points for the Iris dataset and 108 data points for both the Digits01 and Digits89 datasets. Assuming each datapoint run would wait for the computed wait time, we obtain an overall inferencing wait time of 634, 1993, and 45 seconds on IBM Brisbane, IBM Kyoto, and IBM Osaka respectively for the Iris dataset and 1522, 4784, and 108 seconds respectively for both the Digits datasets for a single inferencing run. We can also extrapolate the average wait time to estimate training wait time. For example, consider the Iris dataset for training on the IBM Osaka machine. The training set will have 105 images, so for an average wait time of 1s and 10 epochs of training, the total training wait time will be roughly 17 minutes. We tabulate all the training wait times for all datasets on all hardware in Table VI. From these values, we note up to nearly as high as 45X reduction (per epoch) in training wait time when the training hardware is switched from IBM Kyoto to IBM Osaka. This (potentially) is more than twice the speedup that is achieved in [16].

V-D Scalability to larger datasets

We also show that our methodology is scalable to larger datasets. In particular, we train a hybrid quantum-classical neural network (with few convolution layers followed by QNN having 8 qubits) using a multi-hardware training setup for the Cifar-10 dataset [32]. We use a reduced version of the dataset consisting of airplane (class 0) and frog (class 6) classes (300 images per class, 70:30 train-test split). Based on the best-scoring configurations from Table III, we select the following configurations: 27Q(II), 20Q(II), 127Q(V), 27Q(I) and 20Q(IV). For all configurations combined, we note mean inferencing accuracy of $80.96\%$ , $82.84\%$ , and $80.59\%$ for 20, 27, and 127-qubit hardware respectively. Once again, we observe that 27-qubit hardware performs the best, followed by 20-qubit hardware, and finally 127-qubit hardware. We show the boxplots of inferencing performance for various hardware in Fig. 8. We also compare these results with another model that is trained only on the best-scoring configuration 27Q(II). From this model, we note mean inferencing accuracy of $82.78\%$ , $82.33\%$ , and $81.82\%$ respectively for 20, 27, and 127-qubit hardware respectively. From this, we note a mean accuracy reduction of $0.84\%$ across all hardware when we switch from single hardware to multi-hardware training for the Cifar dataset.

VI Conclusion

In this work, we proposed a novel methodology to train QML models on multiple hardware. First, we selected a suitable model followed by a configurational analysis of all configurations. Based on the intuition gained, we chose the top-scoring configurations and proposed a multi-hardware training setup. The results of multi-hardware training show that small datasets such as Iris are resilient to the effect of noise, however, more complex datasets such as Digits images are susceptible to different factors such as coupling constraints and noise characteristics. Finally, we note that the proposed methodology can be scalable to larger datasets such as RGB Cifar-10 images yielding reasonable inferencing performance.

Acknowledgements

We acknowledge the usage of IBM Quantum along with Pennylane for performing all the experiments. All the relevent code has been added to a GitHub Repository⁴⁴4GitHub repository link: https://github.com/KoustubhPhalak/QuaLITi-QML-Workload-Optimization. This work is supported in parts by NSF (CNS-1722557, CNS-2129675,CCF-2210963,CCF-1718474,OIA-2040667, DGE-1723687, DGE-1821766, and DGE-2113839) and Intel’s gift.

References

[1] A. Galindo and M. A. Martin-Delgado, “Information and computation: Classical and quantum aspects,” Reviews of Modern Physics, vol. 74, no. 2, p. 347, 2002.
[2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
[3] Y. Kim, A. Eddins, S. Anand, K. X. Wei, E. Van Den Berg, S. Rosenblatt, H. Nayfeh, Y. Wu, M. Zaletel, K. Temme et al., “Evidence for the utility of quantum computing before fault tolerance,” Nature, vol. 618, no. 7965, pp. 500–505, 2023.
[4] V. Mavroeidis, K. Vishi, M. D. Zych, and A. Jøsang, “The impact of quantum computing on present cryptography,” arXiv preprint arXiv:1804.00200, 2018.
[5] R. Orús, S. Mugel, and E. Lizaso, “Quantum computing for finance: Overview and prospects,” Reviews in Physics, vol. 4, p. 100028, 2019.
[6] D. Herman, C. Googin, X. Liu, Y. Sun, A. Galda, I. Safro, M. Pistoia, and Y. Alexeev, “Quantum computing for finance,” Nature Reviews Physics, vol. 5, no. 8, pp. 450–465, 2023.
[7] B. Bauer, S. Bravyi, M. Motta, and G. K.-L. Chan, “Quantum algorithms for quantum chemistry and quantum materials science,” Chemical Reviews, vol. 120, no. 22, pp. 12 685–12 717, 2020.
[8] S. Gupta, S. Modgil, P. C. Bhatt, C. J. C. Jabbour, and S. Kamble, “Quantum computing led innovation for achieving a more sustainable covid-19 healthcare industry,” Technovation, vol. 120, p. 102544, 2023.
[9] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, “Quantum machine learning,” Nature, vol. 549, no. 7671, pp. 195–202, 2017.
[10] S. Garg and G. Ramakrishnan, “Advances in quantum deep learning: An overview,” arXiv preprint arXiv:2005.04316, 2020.
[11] J. Tilly, H. Chen, S. Cao, D. Picozzi, K. Setia, Y. Li, E. Grant, L. Wossnig, I. Rungger, G. H. Booth et al., “The variational quantum eigensolver: a review of methods and best practices,” Physics Reports, vol. 986, pp. 1–128, 2022.
[12] M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio et al., “Variational quantum algorithms,” Nature Reviews Physics, vol. 3, no. 9, pp. 625–644, 2021.
[13] D. Willsch, M. Willsch, H. De Raedt, and K. Michielsen, “Support vector machines on the d-wave quantum annealer,” Computer physics communications, vol. 248, p. 107006, 2020.
[14] J. W. Z. Lau, K. H. Lim, H. Shrotriya, and L. C. Kwek, “Nisq computing: where are we and where do we go?” AAPPS bulletin, vol. 32, no. 1, p. 27, 2022.
[15] J. Preskill, “Quantum computing in the nisq era and beyond,” Quantum, vol. 2, p. 79, 2018.
[16] S. Resch, A. Gutierrez, J. S. Huh, S. Bharadwaj, Y. Eckert, G. Loh, M. Oskin, and S. Tannu, “Accelerating variational quantum algorithms using circuit concurrency,” arXiv preprint arXiv:2109.01714, 2021.
[17] Z. Liang, H. Wang, J. Cheng, Y. Ding, H. Ren, Z. Gao, Z. Hu, D. S. Boning, X. Qian, S. Han et al., “Variational quantum pulse learning,” in 2022 IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2022, pp. 556–565.
[18] Z. Liang, J. Cheng, H. Ren, H. Wang, F. Hua, Z. Song, Y. Ding, F. T. Chong, S. Han, X. Qian et al., “Napa: intermediate-level variational native-pulse ansatz for variational quantum algorithms,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024.
[19] H. Wang, Y. Liu, P. Liu, J. Gu, Z. Li, Z. Liang, J. Cheng, Y. Ding, X. Qian, Y. Shi et al., “Robuststate: Boosting fidelity of quantum state preparation via noise-aware variational training,” arXiv preprint arXiv:2311.16035, 2023.
[20] H. Wang, J. Gu, Y. Ding, Z. Li, F. T. Chong, D. Z. Pan, and S. Han, “Quantumnat: quantum noise-aware training with noise injection, quantization and normalization,” in Proceedings of the 59th ACM/IEEE design automation conference, 2022, pp. 1–6.
[21] B. Máté, B. L. Saux, and M. Henderson, “Beyond ans $\backslash$ ” atze: Learning quantum circuits as unitary operators,” arXiv preprint arXiv:2203.00601, 2022.
[22] J. Von Neumann, Mathematical foundations of quantum mechanics: New edition. Princeton university press, 2018, vol. 53.
[23] IBM, “IBM Quantum,” 2024. [Online]. Available: https://www.ibm.com/quantum
[24] Google Quantum AI, “Google Quantum Computer,” 2024. [Online]. Available: https://quantumai.google/quantumcomputer
[25] Amazon, “Amazon Braket,” 2024. [Online]. Available: https://aws.amazon.com/braket/
[26] A. Kitaev and J. Watrous, “Parallelization, amplification, and exponential time simulation of quantum interactive proof systems,” in Proceedings of the thirty-second annual ACM symposium on Theory of computing, 2000, pp. 608–617.
[27] A. J. Daley, I. Bloch, C. Kokail, S. Flannigan, N. Pearson, M. Troyer, and P. Zoller, “Practical quantum advantage in quantum simulation,” Nature, vol. 607, no. 7920, pp. 667–676, 2022.
[28] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum information. Cambridge university press Cambridge, 2001, vol. 2.
[29] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe, “Circuit-centric quantum classifiers,” Physical Review A, vol. 101, no. 3, p. 032308, 2020.
[30] X.-D. Cai, D. Wu, Z.-E. Su, M.-C. Chen, X.-L. Wang, L. Li, N.-L. Liu, C.-Y. Lu, and J.-W. Pan, “Entanglement-based machine learning on a quantum computer,” Physical review letters, vol. 114, no. 11, p. 110504, 2015.
[31] M. Weigold, J. Barzen, F. Leymann, and M. Salm, “Data encoding patterns for quantum computing,” in Proceedings of the 27th Conference on Pattern Languages of Programs, 2020, pp. 1–11.
[32] A. Krizhevsky et al., “CIFAR-10 (canadian institute for advanced research).” [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html