QUACK: Quantum Aligned Centroid Kernel

Kilian Tscharke, Sebastian Issel, Pascal Debus Quantum Security Technologies
Fraunhofer Institute for Applied and Integrated Security
Garching near Munich, Germany

\langle

firstname

\rangle

\langle

lastname

\rangle

@aisec.fraunhofer.de

Abstract

Quantum computing (QC) seems to show potential for application in machine learning (ML). In particular quantum kernel methods (QKM) exhibit promising properties for use in supervised ML tasks. However, a major disadvantage of kernel methods is their unfavorable quadratic scaling with the number of training samples. Together with the limits imposed by currently available quantum hardware (NISQ devices) with their low qubit coherence times, small number of qubits, and high error rates, the use of QC in ML at an industrially relevant scale is currently impossible. As a small step in improving the potential applications of QKMs, we introduce QUACK, a quantum kernel algorithm whose time complexity scales linear with the number of samples during training, and independent of the number of training samples in the inference stage. In the training process, only the kernel entries for the samples and the centers of the classes are calculated, i.e. the maximum shape of the kernel for n samples and c classes is (n, c). During training, the parameters of the quantum kernel and the positions of the centroids are optimized iteratively. In the inference stage, for every new sample the circuit is only evaluated for every centroid, i.e. c times. We show that the QUACK algorithm nevertheless provides satisfactory results and can perform at a similar level as classical kernel methods with quadratic scaling during training. In addition, our (simulated) algorithm is able to handle high-dimensional datasets such as MNIST with 784 features without any dimensionality reduction.

Index Terms:

quantum computing, machine learning, kernel methods, linear complexity

I Introduction

Supervised Learning is an important branch of Machine Learning (ML) where a model is trained on labeled data to predict the labels of new, unseen data. It encompasses two main types of tasks: classification, which predicts discrete labels or classes, and regression, which forecasts continuous values. Quantum Machine Learning (QML) is an emerging field in the intersection of Quantum Computing (QC) and ML with the goal of utilizing the potential advantages of QC - like superposition, entanglement, and the exponential size of the Hilbert space - for Machine Learning. In particular, Quantum Kernel Methods (QKM) have recently gained attention because of their ability to replace many supervised quantum models and their guarantee to find equally good or better quantum models than variational circuits [1]. In addition, theoretical results are showing that QKMs can handle classification problems that cannot be solved using classical ML techniques, such as classifying numbers based on the discrete logarithm [2]. However, a significant drawback of using (quantum) kernels is the quadratic time complexity of the kernel calculation, i.e. $\mathcal{O}(n_{\text{train}}^{2})$ for $n_{\text{train}}$ training samples, since a kernel value must be estimated for each pair of samples. For the inference stage, the time complexity without using advanced techniques such as Support Vectors is $\mathcal{O}(n_{\text{train}}n_{\text{predict}})$ . Estimating a quantum kernel for the - by classical ML standards very small - URL dataset [3] with around 36,000 samples, requires approximately $10^{9}$ kernel value calculations. With the commonly-used number of 1,000 shots, this involves $10^{12}$ circuit executions. A state-of-the-art IBM device with Eagle r3 processor (e.g. ibm_sherbrooke [4]) achieves 5,000 CLOPS (circuit executions per second) and hence the execution time for calculating the kernel would be $10^{8}$ seconds, or over 6 years.

Quantum Kernel Alignment (QKA) is a fascinating tool for QKMs that uses kernels with variational parameters which can be trained to align the kernel to the ideal kernel for a given dataset [5]. This could enable the use of a general quantum kernel architecture that can be trained for different datasets.

The remainder of this paper is structured as follows: The next subsection I-A gives an overview of the related work in the fields of QKMs and QKA. The final part of the introduction contains our contributions (subsection I-B). In the following Background (section II), the fundamentals of supervised learning, quantum kernels, and QKA are introduced. Next, the Methods (section III) contain the implementation of the model and the Experiments (section IV) describe the numerical experiments carried out. In the Results and Discussion (section V) we show the results of our experiments and analyze them. Finally, Conclusion and Outlook (section VI) highlights the key results of this work and gives future research directions.

I-A Related Work

In 2021, Hubregetsen et al. [5] described the algorithm of QKA and defined the kernel-alignment measure. Moreover, they theoretically assessed the influence of noise on the algorithm and carried out numerical experiments on toy datasets, both on simulations and on real hardware.

Gentinetta et al. [6] developed a Quantum Support Vector Machine (QSVM) for which the quantum kernel is trained with QKA using the Pegasos algorithm in 2023. Unlike the default Support Vector Machine (SVM) implementation, their algorithm solves the primal formulation of the SVM which results in a min-min optimization and hence the SVM weights and the kernel parameters can be optimized simultaneously, increasing the efficiency of the algorithm.

In the same year, Kölle et al. [7] introduced a one-class QSVM, for which they reduced the training and inference times compared to a default QSVM by up to 95% and 25%, respectively, by applying randomized measurements and the variable subsampling ensemble method while achieving a superior average precision compared to a SVM with Radial Basis Function (RBF) kernel.

Finally, in 2024 Bowles et al. [8] benchmarked 12 popular QML models on six binary classification tasks. They concluded that out-of-the-box classical ML models tend to outperform the QML models and that entanglement does not necessarily improve the models’ performance. Moreover, they noted that QML models both in simulations and on hardware can usually only handle input with size of the order of tens of features, and therefore classical pre-processing techniques such as principal component analysis are required to deal with higher dimensional data like the famous MNIST dataset. This classical pre-processing, however, influences the performance of the model and hence dilutes the results of a benchmark.

I-B Contributions

Our work aims to answer this question: Can the time complexity of QKMs be improved while still achieving satisfactory results?

We found a positive answer and report these contributions of our work:

•

We develop Quantum Aligned Centroid Kernel (QUACK), a classifier based on quantum kernel alignment that improves the time complexity compared to basic kernel methods from $\mathcal{O}(n^{2}_{\text{train}})$ to $\mathcal{O}(n_{\text{train}})$ during training and from $\mathcal{O}(n_{\text{train}}n_{\text{test}})$ to $\mathcal{O}(n_{\text{test}})$ during testing.
•

We benchmark our classifier by evaluating it on eight different datasets with up to 784 features and different class ratios ranging from balanced to highly unbalanced.
•

Finally, we observe that QUACK performs on a similar level as a classical SVM with RBF kernel

II Background

II-A Supervised Learning

Let $\mathcal{X}\subset\mathbb{R}^{d}$ be the data space and $\boldsymbol{x}=(x_{1},\ldots,x_{d})\in\mathcal{X}$ the feature vector of a single $d$ -dimensional sample. Let $\mathcal{Y}$ denote the target variable space and $y\in\mathcal{Y}$ the target variable or label of a single sample. For the case of binary classification, we restrict the target variable to the set $\{1,-1\}$ . Let further $\Theta$ denote the space of model parameters. The general task of supervised machine learning is to train a parameterized model $f_{\theta}:\mathcal{X}\times\Theta\to\mathcal{Y}$ such that it approximates a mapping between input $\boldsymbol{x}$ and output $\hat{y}$ based on the learned parameters $\theta$ , as described in (1). During training, the parameters $\theta$ are optimized such that the loss $\mathcal{L}$ quantifying the difference between the predicted output $\hat{y}$ and the target $y$ is minimized, as in (2).

	$\displaystyle\hat{y}=f_{\theta}(x)$		(1)
	$\displaystyle\min_{\theta}\mathcal{L}(y,\hat{y})$		(2)

II-B Quantum Kernels

Quantum kernels emerge as an important tool for encoding classical data into quantum systems and subsequently classifying the data. It is hoped that the unique properties of quantum computing, such as entanglement and superposition, which are utilized in quantum kernels, will enable them to be more powerful than classical kernels. This hypothesis is supported by theoretical results showing that a constructed classification problem based on the discrete logarithm can be efficiently solved by QKMs, but not by classical ML methods [2]. In general, the encoding in QKMs is achieved through unitary operations $U(\boldsymbol{x}_{i})$ that depend on individual data points $\boldsymbol{x}_{i}$ , often implemented via Pauli rotations. The state of the system after the encoding is

\displaystyle\ket{\psi(\boldsymbol{x}_{i})}=\ket{\psi_{i}}=U(\boldsymbol{x}_{i})\ket{0}.

(3)

Kernels are known from classical machine learning, where they are real- or complex-valued positive definite functions of two data points, i.e. $\kappa:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{K}$ , where $\mathbb{K}\in\{\mathbb{R},\mathbb{C}\}$ . This definition can be extended to the quantum case, where a kernel $k$ between two pure data-encoding quantum states $\psi_{i}$ and $\psi_{j}$ is calculated from the fidelity between these states

	$\displaystyle k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$	$\displaystyle=F(\psi_{i},\psi_{j})=\|\braket{\psi_{i}}{\psi_{j}}\|^{2}$		(4)
		$\displaystyle=\|\bra{0^{\otimes n}}U^{\dagger}(\boldsymbol{x}_{i})U(\boldsymbol{x}_{j})\ket{0^{\otimes n}}\|^{2}$		(5)

with data encoding unitaries $U(\boldsymbol{x}_{j})$ and $U^{\dagger}(\boldsymbol{x}_{i})$ , where $U^{\dagger}$ denotes the conjugate transpose of $U$ . This quantum kernel serves as a similarity measure between the states of two encoded samples: If both samples are identical, i.e. $\boldsymbol{x}_{i}=\boldsymbol{x}_{j}$ , so $\psi_{i}=\psi_{j}$ as well, the kernel equation (4) simplifies to

\displaystyle k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})=k(\boldsymbol{x}_{i},\boldsymbol{x}_{i})=F(\psi_{i},\psi_{i})=|\braket{\psi_{i}}{\psi_{i}}|^{2}=1.

(6)

On the other hand, if the encoded states $\psi_{i}$ and $\psi_{j}$ are orthogonal, the kernel will evaluate to

\displaystyle k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})

\displaystyle=F(\psi_{i},\psi_{j})=|\braket{\psi_{i}}{\psi_{j}}|^{2}=0.

(7)

A quantum kernel can be implemented as an $n$ -qubit circuit that consists of a trainable unitary $U(\boldsymbol{x}_{j})$ , encoding a single sample, followed by the complex conjugate $U^{\dagger}(\boldsymbol{x}_{i})$ of another sample, and a measurement of all qubits, as shown in Fig. 1. The kernel value of the two samples is then obtained as the probability of measuring the all-zero state as given in equation (5). If a state vector simulator is used, the kernel value of two samples $\boldsymbol{x}_{i}$ and $\boldsymbol{x}_{j}$ is the fidelity of the states after application of the unitary $U(\boldsymbol{x}_{i})$ , respectively $U(\boldsymbol{x}_{j})$ , as given in (4).

Refer to caption — Figure 1: Architecture of the circuit if executed on hardware. The kernel entry $K_{ij}$ for samples $i$ and $j$ is the probability of measuring the all-zero bit string.

II-C Trainable Quantum Kernels

A quantum kernel can contain not only parameters that encode the data into the circuit, but also adjustable parameters that affect the performance of the kernel for a particular dataset. This can for example be achieved by alternating layers of rotational gates, whose parameters consist of either one or more features of the datum $\boldsymbol{x}$ or some other variational parameter $\boldsymbol{w}$ . These parameters $\boldsymbol{w}$ can be optimized through Quantum Kernel Alignment (QKA), as explained in subsection II-D. For variational circuits, trainable encodings seem to be promising, since they were found to improve the robustness and generalization of the model [9, 10]. A trainable encoding embeds the datum $\boldsymbol{x}$ and the variational parameter $\boldsymbol{w}$ and bias $\boldsymbol{b}$ as one parameter vector $\boldsymbol{\theta}$ , where each entry of $\boldsymbol{\theta}$ is a single parameter used in a rotational gate. The parameter vector $\boldsymbol{\theta}$ is calculated as

\displaystyle\boldsymbol{\theta}=\boldsymbol{w}\circ\boldsymbol{x}+\boldsymbol{b},

(8)

analog to the neurons of neural networks, where $\circ$ is the element-wise product (Hadamard product).

II-D Quantum Kernel Alignment

QKA is a powerful tool that can be used to align a trainable kernel to the ideal kernel for a given dataset. It was originally developed for classical kernels [11] but can be used for quantum kernels as well. The implementation of QKA in this work is based on [5]. Kernel Alignment is used to optimize the kernel parameters for a specific task, improving the performance of kernel-based algorithms. The ideal kernel $k^{*}$ is defined such that it always outputs the correct similarity between two data points:

\displaystyle k^{*}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)=\begin{cases}1&\text{ if }\boldsymbol{x}_{i}\text{ and }\boldsymbol{x}_{j}\text{ in same class }\\ -1&\text{ if }\boldsymbol{x}_{i}\text{ and }\boldsymbol{x}_{j}\text{ in different classes }\end{cases}

(9)

In general, this ideal kernel is not known, but for the training set the ideal kernel matrix $K^{*}$ can be constructed from the labels, i.e. $K^{*}_{ij}=y_{i}y_{j}$ , or in vectorized form

\displaystyle K^{*}=\boldsymbol{y}\boldsymbol{y}^{T}.

(10)

The kernel-target alignment is a measure of the similarity between two kernels. To calculate it, we need the Frobenius inner product between two matrices as defined in (11).

\displaystyle\langle A,B\rangle_{F}=\sum_{ij}A_{ij}B_{ij}=\operatorname{Tr}\left\{A^{T}B\right\}

(11)

With this, the kernel-target alignment $\mathrm{TA}$ between the current kernel $K$ and the ideal kernel $K^{*}$ can be calculated as in (12).

	$\displaystyle\mathrm{TA}(K)$	$\displaystyle=\frac{\left\langle K,K^{}\right\rangle_{F}}{\sqrt{\langle K,K\rangle_{F}\left\langle K^{},K^{*}\right\rangle_{F}}}$
		$\displaystyle=\frac{\sum_{ij}y_{i}y_{j}k\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)}{\sqrt{\left(\sum_{ij}k\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)^{2}\right)\left(\sum_{ij}y_{i}^{2}y_{j}^{2}\right)}}$		(12)

The numerator of (12), $\sum_{ij}y_{i}y_{j}k\left(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\right)$ , is the kernel polarity. If two samples are in the same class, $y_{i}y_{j}=1$ , the kernel value $k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ will increase the sum and hence the kernel-target alignment. For samples in different classes, $y_{i}y_{j}=-1$ , the kernel polarity decreases by $k(\boldsymbol{x}_{i},\boldsymbol{x}_{j})$ and subsequently the kernel-target alignment decreases, too. The kernel-target alignment equals 1 if the matrices are perfectly aligned and -1 if they are perfectly misaligned, i.e. perfectly inversely correlated.

III Methodology

Driven by the need of a NISQ compatible quantum classification algorithm that can handle data on an industrially relevant scale, we developed QUACK, a linear complexity algorithm for supervised classification based on quantum kernel alignment. QUACK is motivated by the desire to find a quantum kernel algorithm that avoids the calculation of the pairwise distances between the samples. Instead, our algorithm optimizes the distance using centroids as a proxy for each class with labels $l\in\{1,-1\}$ . The centroids are intitialized as the means of the classes in the original input data space. However, during training of the embedding map, the initial centroids cease to represent the center of the classes and we need to update them. This results in a two step alternating training procedure, where we iteratively optimize the parameters of the embedding map, followed by the position of one of the centroids. Since we do not want to store the centroids as a vector in the $2^{n}$ -dimensional embedding space, we optimize the preimage of the centroids in embedding space, i.e. their positions in data space. Additionally, we alternate the class of the centroid to be optimized in each iteration.

The working principle of the algorithm is illustrated in Fig. 2. The centroids are initialized as the mean of each class in data space and initially, the embedding map performs a random embedding of the data in Hilbert space, as shown in the first figure. The first step of the algorithm, Kernel Alignment Optimization (KAO) iteration 1 for class 1 optimizes the parameters of the embedding map such that the distances between the samples of class 1 and centroid 1 are minimized, and the distances between the samples of class -1 and centroid 1 are maximized. This results in a new embedding map which is shown in the second figure. Next, the Centroid Optimization (CO) iteration 1 class -1 optimizes the position in data space of the centroid of the other class (class -1) with the aim of minimizing the distances between the centroid and the samples of class -1 and maximizing the distances between centroid -1 and the samples of class 1. The result of the CO step is shown in the third figure. After this, the first epoch of the QUACK algorithm is complete, and a new iteration of the two step process starts. For the second iteration of the KAO, the embedding map is optimized such that the distances between the samples of class -1 and centroid -1 are minimal, and the distances between the samples of class 1 and centroid -1 are maximal. The resulting new embedding space is shown in the fourth figure. Next, the second iteration of the CO is carried out, where the position of centroid 1 is optimized in data space, followed by the next epoch of the two step process and so on.

The parameterized circuit with trainable encoding used for data encoding is described in subsubsection III-A1 Circuit Design and Data Encoding. The two-step optimization process is explained in subsubsection III-A2 Kernel Alignment Optimization and subsubsection III-A3 Centroid Optimization. Finally, the Prediction Stage - where the kernel entries of a new sample with the centroids are estimated and the sample is given the label of the class for whose centroid the kernel entry is maximal - is explained in subsubsection III-A4. The final part of this section, subsection III-B sketches very briefly how our state vector simulator works.

III-A QUACK

The QUACK training algorithm is sketched in algorithm 1 and can be summarized as follows: The algorithm estimates a quantum kernel for the train samples $X$ and the current working centroid $\boldsymbol{c}_{l}\in\{\boldsymbol{c}_{-1},\boldsymbol{c}_{1}\}$ . Each of the $n_{\text{epochs}}$ training epoch consists of a two-step optimization iterating between $n_{\text{KAO}}$ epochs of optimizing the model parameters $\boldsymbol{w},\boldsymbol{b}$ and $n_{\text{CO}}$ epochs of optimizing the centroids $\boldsymbol{c}_{-1},\boldsymbol{c}_{1}$ . For predicting new data, the kernel values of the new data $X_{\text{predict}}$ and both centroids will be calculated, and each sample is given the label of the centroid with higher kernel entry. In the following, the different parts of the algorithm will be described in more detail.

III-A1 Circuit Design and Data Encoding

We use a trainable encoding map that was found to yield robustness and generalization improvements over fixed encodings [9]. In this context, trainable encoding refers to encodings, where the parameters of the gates depend on both, trainable weights and features of the data. How exactly the gate parameters are composed will be defined later.

The data encoding unitary $U(\boldsymbol{x}_{j})$ consists of $m^{\prime}=m+1$ layers of a unitary $U_{m}(\boldsymbol{\theta}_{m})$ , as in Fig. 3. For clarification, if we have e.g. $m^{\prime}=5$ layers, the first layer is $U_{0}(\boldsymbol{\theta}_{0})$ and the last layer is $U_{4}(\boldsymbol{\theta}_{4})$ . Each of the unitaries $U_{m}(\boldsymbol{\theta}_{m})$ is built using a layer of rotation gates and a ring of CNOT-gates, as sketched in Fig. 4. The rotation gate is the general parameterized rotation gate [12] with matrix representation shown in (15).

	$\displaystyle R(\theta_{m,i},\theta_{m,i+1},\theta_{m,i+2})=R(\phi,\theta,\omega)=RZ(\omega)RY(\theta)RZ(\phi)=$
	$\displaystyle=\left[\begin{array}[]{cc}e^{-i(\phi+\omega)/2}\cos(\theta/2)&-e^{i(\phi-\omega)/2}\sin(\theta/2)\\ e^{-i(\phi-\omega)/2}\sin(\theta/2)&e^{i(\phi+\omega)/2}\cos(\theta/2)\end{array}\right]$		(15)

Each parameter $\theta_{m,i}$ is calculated from the $k$ -th feature of the sample $\boldsymbol{x_{j}}$ , weight $w_{m,i}$ and bias $b_{m,i}$ as given in (16), where $k$ is a repeating counter from 1 to the number of input dimensions $d$ , i.e. $k=(3nm+i)\operatorname{mod}d$ for the $n$ -qubit system, the $m$ -th layer and the $i$ -th parameter in the layer.

\displaystyle\theta_{m,i}=w_{m,i}\cdot x_{j,k}+b_{m,i}

(16)

III-A2 Kernel Alignment Optimization

During the Kernel Alignment Optimization, the parameter vectors $\boldsymbol{w}$ and $\boldsymbol{b}$ of the embedding map are optimized. This is achieved by estimating the kernel between the train samples and one centroid and comparing it to the ideal kernel to obtain the kernel alignment. Since we use only one centroid to calculate the kernel, the matrix is an $n_{\text{samples}}$ -dimensional vector. The ideal kernel is simply the label vector $\boldsymbol{y}$ if the centroid class is 1 and $-\boldsymbol{y}$ if the centroid class is -1. The current kernel entries are the fidelities between the current centroid $\boldsymbol{c}_{l}$ and the samples, where $l\in\{-1,1\}$ defines the class label of the current centroid:

\displaystyle k(\boldsymbol{c}_{l},\boldsymbol{x}_{i})=|\braket{\psi_{l}}{\psi_{i}}|^{2}

(17)

To get the kernel alignment, equation (12) is adapted for vectors as kernels and we obtain:

\displaystyle\mathrm{TA}

\displaystyle=\frac{l\cdot\sum_{i}y_{i}k\left(\boldsymbol{c}_{l},\boldsymbol{x}_{i}\right)}{\sqrt{\left(\sum_{i}k\left(\boldsymbol{c}_{l},\boldsymbol{x}_{i}\right)^{2}\right)\sum_{i}y_{i}^{2}}}

(18)

The loss function $f_{\boldsymbol{c}_{l}}$ is derived from the kernel-target alignment, with an additional regularization term with regularization parameter $\lambda_{\text{KAO}}$ . Note that in this loss function, the centroid $\boldsymbol{c}_{l}$ is fixed.

\displaystyle\mathcal{L}_{\text{KAO}}=f_{\boldsymbol{c}_{l}}(\boldsymbol{w},\boldsymbol{b})=1-\mathrm{TA}+\lambda_{\text{KAO}}||\boldsymbol{w}||^{2}_{2}

(19)

This loss function is then used to optimize the kernel parameters $\boldsymbol{w}$ and $\boldsymbol{b}$ either through backpropagation if the circuits are executed on a simulator or the parameter shift rule if real hardware is used, by solving this minimization problem:

\displaystyle\min_{\boldsymbol{w},\boldsymbol{b}}f_{\boldsymbol{c}_{l}}(\boldsymbol{w},\boldsymbol{b})

(20)

III-A3 Centroid Optimization

The Centroid Optimization optimizes the position of the current centroid $\boldsymbol{c}_{l}$ in data space. For this, the kernel alignment is calculated the same way it is in the KAO optimization and then converted to a loss function $g_{\boldsymbol{w},\boldsymbol{b}}$ , in which the parameters $\boldsymbol{w}$ and $\boldsymbol{b}$ of the embedding map are fixed:

\displaystyle\mathcal{L}_{\text{CO}}=g_{\boldsymbol{w},\boldsymbol{b}}(\boldsymbol{c}_{l})=1-\mathrm{TA}+\lambda_{\text{CO}}R

(21)

Since the features are normalized in the range $(0,1)$ , the regularization $R$ introduces a penalty if the centroid position in data space is outside the normalization range:

\displaystyle R

\displaystyle=\sum_{d}\left(\max(\boldsymbol{c}^{d}_{l}-1,0)-\min(\boldsymbol{c}^{d}_{l},0)\right),

(22)

where $\boldsymbol{c}^{d}_{l}$ is the $d$ -th entry of the vector $\boldsymbol{c}_{l}$ . This loss is then used to optimize the position of the working centroid $\boldsymbol{c}_{l}$ in data space, by solving the following minimization problem:

\displaystyle\min_{\boldsymbol{c}_{l}}g_{\boldsymbol{w},\boldsymbol{b}}(\boldsymbol{c}_{l})

(23)

III-A4 Prediction Stage

After the training is complete, the labels of new samples can be predicted. For this, the kernel $K_{\text{predict}}$ of shape $(n_{\text{samples}},2)$ between the new samples $X_{\text{predict}}$ and both centroids $\boldsymbol{c}_{-1}$ and $\boldsymbol{c}_{1}$ is calculated. Each sample is given a label according to (24), where $K_{i,1}$ represents the kernel value for sample $i$ and centroid 1, and $K_{i,-1}$ the value for sample $i$ and centroid -1.

\displaystyle\hat{y}_{i}=\operatorname{sign}(K_{i,1}-K_{i,-1})

(24)

III-B State Vector Simulator

To speed up the simulations, a state vector simulator is implemented using PyTorch [13]. The simulator builds up on the nn.Module class where the forward function returns the states of shape $(n_{\text{samples}},2^{n_{\text{qubits}}})$ after applying all gates. The main advantage of the simulator is that the unitary of each gate is applied to all $n_{\text{samples}}$ states in one operation, making the execution of the circuits for multiple samples faster when compared to other commonly used simulators.

IV Experiments

Our linear time complexity QUACK algorithm is benchmarked on eight binary datasets from different areas and with various numbers of features and class ratios. The performance of our model is compared to three other models, containing both classical and quantum approaches. The source code can be found in a public code repository¹¹1https://github.com/Fraunhofer-AISEC/QUACK/tree/v1.

IV-A Datasets

The model is benchmarked on eight datasets from different areas, including IT security and handwritten digits. An overview of the datasets is given in table I. The number of features varies between 14 for the Census dataset and 784 for the image datasets. The share of the smaller class in the total dataset varies between 0.09 (0.07/0.09) and 0.50 (0.49/0.49) for the train (validation/test) set, meaning there are both balanced and highly unbalanced datasets. For the datasets that have not been pre-split into test and train sets, the train (test) set is created by randomly selecting $70\%$ ( $30\%$ ) of the samples from the dataset. Next, 1,000 samples are randomly selected from the training set for training, 400 from the test set for validation, and a further 400 from the test set for the final testing. The class labels are $\{1,-1\}$ according to the criteria specified in table I.

TABLE I: Overview of the datasets used in the benchmark. The ratios show the ratio of the minority class to the number of samples in the set.

Dataset	Ref.	Description	Class 1	Class -1	Ratio Train	Ratio Val.	Ratio Test	Features
Census	[14]	Income	$\leq 50K$	$>50K$	0.28	0.26	0.23	14
CoverT	[15]	Forest tree types	$4$	$>4$	0.09	0.07	0.09	15
DoH	[16]	Network traffic	Benign	Malicious	0.24	0.23	0.21	33
EMNIST	[17]	Handwritten letters	A-M	N-Z	0.50	0.48	0.48	784
FMNIST	[18]	Clothing types	0-4	5-9	0.50	0.49	0.49	784
KDD	[19]	Network intrusion	Normal	Anomalous	0.50	0.44	0.47	42
MNIST	[20]	Handwritten digits	0-4	5-9	0.48	0.47	0.49	784
URL	[21]	URLs	Benign	Non-benign	0.15	0.14	0.15	79

IV-B Models for Benchmarking

For a comprehensive evaluation of our model, it is benchmarked against these three other approaches: 1) A vanilla SVM with RBF kernel and default parameters, implemented with scikit-learn [22]. This model has a quadratic time complexity during training and a linear during testing 2) An RBF centroid classifier. This classifier first determines the mean of each class in feature space and then calculates a RBF kernel of shape $(n_{\text{samples}},2)$ that contains the kernel values between the samples and the means of both classes. Each sample is given the label of the class for which the kernel value between the sample and the respective centroid is larger. This classifier has a linear time complexity and does not require any training. 3) A Quantum Support Vector Machine that uses the trained kernel parameters from our approach. However, the QSVM determines the kernels in their quadratic form $(n_{\text{train}},n_{\text{train}})$ for the training of the SVM parameters and of the shape $(n_{\text{test}},n_{\text{train}})$ for testing. The SVM hyperparameters are the default ones from scikit-learn.

IV-C Implementation Details

All models are trained three times with different random seeds and the mean results with standard deviation are reported. The hyperparameters used for the QUACK algorithm are shown in tables II and III in Appendix A. The hyperparameter tuning is achieved by a randomized grid search for each dataset where the validation set is used to evaluate the model. The number of layers and qubits were selected such that each feature is encoded at least once and the model can still be simulated in a reasonable time.

IV-D Verification of the Simulator

To make sure that the state vector simulator works as intended, a small model is trained on both our simulator and PennyLane’s [23] default.qubit simulator with identical hyperparameters. After training, the weights, biases, loss, and metrics of both models were identical, and therefore we can conclude that our simulator works as intended.

V Results and Discussion

The newly introduced linear time complexity algorithm QUACK is benchmarked together with three other models on eight binary datasets from different areas with various numbers of features and class ratios. Each model is run three times on each dataset with different random seeds.

V-A Model Performance

Figure 5 shows the average of the test area under the ROC curve (AUC) for each model and dataset, and the values are listed in table IV in Appendix B. Our model performs equally or almost equally as the SVM RBF on five out of eight datasets. More precisely, for CoverT, DoH, FMNIST, KDD, and MNIST, the difference in AUC is 0.02 at most. The performance gap is highest on EMNIST with an AUC difference of 0.06, followed by Census and URL with 0.03. Fig. 7 shows the test AUCs of the best run of each model. The main difference to Fig. 5 is that the best QUACK run additionally achieves equal results as the SVM on the Census dataset. From this, we conclude that QUACK performs on a similar level as the SVM and may be a reasonable alternative to the SVM.

The QSVM that uses the trained weights from our model to compute the full kernel, achieves very similar AUCs compared to our model, with the highest difference being 0.02 in both directions. This suggests, that once the kernel training is completed and the kernel parameters are set, the SVM training and inference methods do not notably improve the model’s performance.

The RBF centroid classifier is the worst model on all datasets, which is intuitive since this model does not require any training at all. It is surprising, however, that this classifier comes relatively close to the performance of the other models for the DoH and KDD datasets. Together with the particularly good performance of the other models on these two datasets, we suspect that DoH and KDD are relatively easy datasets for binary classification. This observation is consistent with the findings of other authors which also reported high AUCs on these two datasets [24, 25].

Finally, the observed standard deviation for QUACK across datasets is consistently low, being 0.01 or below, except for Census and CoverT. These two datasets are notable outliers with a standard deviation of 0.04 and 0.02, respectively. From this we conclude that QUACK’s performance is largely independent of the initialisation of the trainable parameters, but is not entirely stable. The QSVM shows a similar standard deviation as QUACK which was expected since both algorithms use the same optimized weights and biases. The SVM RBF and RBF Centroid exhibit no standard deviation, as they are deterministic algorithms.

V-B Number of Circuit Evaluations

We compare the number of circuit evaluations required during the training of our model with a standard kernel method. All circuit evaluations that result from an evaluation of the models during training are ignored. Fig. 6 shows the number of circuit evaluations over the number of train samples for the number of epochs used throughout the numerical experiments (see table III in Appendix A). The QUACK algorithm scales linearly with the number of training samples $n_{\text{train}}$ , and the number of circuit evaluations is $N_{\text{QUACK}}=n_{\text{epochs}}\cdot(n_{\text{KAO}}+n_{\text{CO}})\cdot n_{\text{train}}$ , where $n_{\text{epochs}}$ is the number of two-step iterations performed, $n_{\text{KAO}}$ and $n_{\text{CO}}$ are the numbers of the Kernel Alignment Optimization steps and Centroid Optimization steps, respectively. A standard kernel on the other hand, requires a quadratic number of circuit evaluations $N_{\text{standard kernel}}=n_{\text{epochs}}\cdot n^{2}_{\text{train}}$ . As soon as the number of samples exceeds the sum of the number of epochs for kernel alignment and centroid optimization, i.e. $n_{\text{train}}>n_{\text{KAO}}+n_{\text{CO}}$ , QUACK requires fewer circuit evaluations than the default kernel. With a further increase in the sample size, the number of circuit evaluations required for QUACK grows quadratically slower than for the standard kernel.

VI Conclusion and Outlook

We developed QUACK, a classifier based on quantum kernel alignment that improves the time complexity compared to basic kernel methods from $\mathcal{O}(n^{2}_{\text{train}})$ to $\mathcal{O}(n_{\text{train}})$ during training and from $\mathcal{O}(n_{\text{train}}n_{\text{test}})$ to $\mathcal{O}(n_{\text{test}})$ during testing. QUACK’s training time complexity is a polynomial improvement compared to the SVM. The algorithm was benchmarked by evaluating it on eight different datasets with up to 784 features and various class ratios ranging from balanced to highly unbalanced. The performance was compared to a vanilla SVM with an RBF kernel, an RBF centroid classifier, and a QSVM. We conclude that QUACK performs on a similar level as the classical SVM and that the training of the SVM parameters does not improve the predictions of the model once the kernel parameters are trained. Finally, our algorithm works on data with up to 784 features without dimensionality reduction, which is often required for other state-of-the-art QML models.

Thanks to the linear scaling of QUACK, the algorithm can be used as quick baseline for future classification tasks: If the performance of QUACK is satisfactory, using more costly classification algorithms does not offer an advantage. If, however, QUACK performs poorly, the application of more costly algorithms, e.g. the quadratic-scaling (Q)SVM, should be considered. Furthermore, there is an intuitive explanation for the classification results of the algorithm: If QUACK performs well, it has found an encoding circuit and centroids which separate the classes into different clusters around these centroids in Hilbert space.

On the other hand, QUACK is limited by the assumption that there exists an embedding in which each class forms a cluster around a different centroid. Finally, the performance of QUACK is dependent on the choice of hyperparameters, and a extensive tuning of these hyperparameters is strongly recommended.

This work is only a first effort toward increasing the potential of QKMs and the next logical step is to benchmark QUACK on hardware. Further work can extend the algorithm to perform multi-class classification by using $k$ centroids for $k$ classes and running a one-versus-all classification in each iteration of the algorithm. In addition, the stability of the algorithm should be improved to make its performance less dependent on the choice of hyperparameters.

Acknowledgment

This research is part of the Munich Quantum Valley, which is supported by the Bavarian state government with funds from the Hightech Agenda Bayern Plus.

References

[1] Maria Schuld “Supervised quantum machine learning models are kernel methods”, 2021 arXiv:2101.11020 [quant-ph]
[2] Yunchao Liu, Srinivasan Arunachalam and Kristan Temme “A rigorous and robust quantum speed-up in supervised machine learning” In Nature Physics 17 Nature Research, 2021, pp. 1013–1017 DOI: 10.1038/s41567-021-01287-z
[3] Mohammad Saiful Islam Mamun et al. “Detecting Malicious URLs Using Lexical Analysis” In Network and System Security Cham: Springer International Publishing, 2016, pp. 467–482
[4] “ibm_sherbrooke” Accessed: 2024-03-28, https://quantum.ibm.com/services/resources?system=ibm_sherbrooke
[5] Thomas Hubregtsen et al. “Training quantum embedding kernels on near-term quantum computers” In Phys. Rev. A 106 American Physical Society, 2022, pp. 042431 DOI: 10.1103/PhysRevA.106.042431
[6] Gian Gentinetta et al. “Quantum Kernel Alignment with Stochastic Gradient Descent” In 2023 IEEE International Conference on Quantum Computing and Engineering (QCE) IEEE, 2023 DOI: 10.1109/qce57702.2023.00036
[7] Michael Kölle et al. “Towards Efficient Quantum Anomaly Detection: One-Class SVMs using Variable Subsampling and Randomized Measurements”, 2023 arXiv:2312.09174 [quant-ph]
[8] Joseph Bowles, Shahnawaz Ahmed and Maria Schuld “Better than classical? The subtle art of benchmarking quantum machine learning models”, 2024 arXiv:2403.07059 [quant-ph]
[9] Julian Berberich et al. “Training robust and generalizable quantum models”, 2023 arXiv:2311.11871 [quant-ph]
[10] Ben Jaderberg et al. “Let quantum neural networks choose their own frequencies” In Phys. Rev. A 109 American Physical Society, 2024, pp. 042421 DOI: 10.1103/PhysRevA.109.042421
[11] Nello Cristianini, John Shawe-Taylor, André Elisseeff and Jaz Kandola “On Kernel-Target Alignment” In Advances in Neural Information Processing Systems 14 MIT Press, 2001 URL: https://proceedings.neurips.cc/paper_files/paper/2001/file/1f71e393b3809197ed66df836fe833e5-Paper.pdf
[12] Maria Schuld and Francesco Petruccione “Quantum Computing” In Machine Learning with Quantum Computers Cham: Springer International Publishing, 2021, pp. 79–146 DOI: 10.1007/978-3-030-83098-4˙3
[13] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library”, 2019 arXiv:1912.01703 [cs.LG]
[14] Ron Kohavi “Census Income” DOI: https://doi.org/10.24432/C5GP7S, UCI Machine Learning Repository, 1996
[15] Jock A. Blackard and Denis J. Dean “Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables” In Computers and Electronics in Agriculture 24.3, 1999, pp. 131–151 DOI: https://doi.org/10.1016/S0168-1699(99)00046-0
[16] Mohammadreza MontazeriShatoori, Logan Davidson, Gurdip Kaur and Arash Habibi Lashkari “Detection of DoH Tunnels using Time-series Classification of Encrypted Traffic” In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), 2020, pp. 63–70 DOI: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00026
[17] Gregory Cohen, Saeed Afshar, Jonathan Tapson and André Schaik “EMNIST: Extending MNIST to handwritten letters” In 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2921–2926 DOI: 10.1109/IJCNN.2017.7966217
[18] Han Xiao, Kashif Rasul and Roland Vollgraf “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms”, 2017 arXiv:1708.07747 [cs.LG]
[19] Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu and Ali A. Ghorbani “A detailed analysis of the KDD CUP 99 data set” In 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, 2009, pp. 1–6 DOI: 10.1109/CISDA.2009.5356528
[20] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11, 1998, pp. 2278–2324 DOI: 10.1109/5.726791
[21] Mohammad Saiful Islam Mamun et al. “Detecting Malicious URLs Using Lexical Analysis” In Network and System Security Cham: Springer International Publishing, 2016, pp. 467–482
[22] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research 12, 2011, pp. 2825–2830
[23] Ville Bergholm et al. “PennyLane: Automatic differentiation of hybrid quantum-classical computations”, 2022 arXiv:1811.04968 [quant-ph]
[24] Jan Philipp Schulze, Philip Sperl and Konstantin Bottinger “Double-Adversarial Activation Anomaly Detection: Adversarial Autoencoders are Anomaly Generators” In Proceedings of the International Joint Conference on Neural Networks 2022-July Institute of ElectricalElectronics Engineers Inc., 2022 DOI: 10.1109/IJCNN55064.2022.9892896
[25] Jan-Philipp Schulze et al. “R2-AD2: Detecting Anomalies by Analysing the Raw Gradient”, 2022 URL: http://arxiv.org/abs/2206.10259

Appendix A Hyperparameters

TABLE II: overview of the optimized hyperparameters for each dataset.

Datasets	$lr_{kao}$	$lr_{co}$	$r_{\text{decay}}$	$reg_{kao}$	$reg_{co}$
Census	0.5	0.5	0.9	0.001	0.001
CoverT	0.5	0.1	0.9	0.001	0.001
DoH	0.5	0.5	0.9	0.001	0.001
EMNIST	1.0	5.0	0.9	0.001	0.001
FMNIST	5.0	0.5	0.8	0.0001	0.001
KDD	0.5	1.0	0.9	0.001	0.001
MNIST	5.0	1.0	0.9	0.001	0.001
URL	0.5	0.5	0.9	0.001	0.001

TABLE III: overview of the hyperparameters shared between QUACK on all datasets. The number of epochs for the two-step training, kernel alignment optimization and centroid optimization are given by

n

n_{kao}

, and

n_{co}

respectively. init_weights_scale gives the maximum value for the weights during random initialization.

layers	qubits	$n_{\text{train}}$	$n_{\text{val}}$	$n_{\text{test}}$	$n$	$n_{kao}$	$n_{co}$	init_weights_scale	seeds
53	5	1000	400	400	40	10	10	0.1	42, 123, 1234

Appendix B Detailed Results

TABLE IV: overview of the AUCs of the models.

Dataset	train_auc	val_auc	test_auc	qsvm_train_auc	qsvm_val_auc	qsvm_test_auc
Census	0.91 ± 0.03	0.85 ± 0.02	0.84 ± 0.04	0.91 ± 0.03	0.85 ± 0.02	0.84 ± 0.04
CoverT	0.85 ± 0.03	0.75 ± 0.04	0.84 ± 0.02	0.88 ± 0.02	0.73 ± 0.05	0.82 ± 0.01
DoH	0.98 ± 0.00	0.96 ± 0.00	0.96 ± 0.01	0.98 ± 0.00	0.98 ± 0.00	0.97 ± 0.00
EMNIST	0.99 ± 0.01	0.84 ± 0.01	0.81 ± 0.00	0.99 ± 0.00	0.84 ± 0.01	0.82 ± 0.01
FMNIST	0.98 ± 0.00	0.96 ± 0.00	0.95 ± 0.00	0.99 ± 0.00	0.95 ± 0.01	0.94 ± 0.00
KDD	1.00 ± 0.00	0.99 ± 0.00	0.99 ± 0.00	1.00 ± 0.00	0.99 ± 0.00	0.99 ± 0.00
MNIST	0.99 ± 0.00	0.95 ± 0.01	0.97 ± 0.00	1.00 ± 0.00	0.96 ± 0.01	0.98 ± 0.00
URL	0.93 ± 0.01	0.89 ± 0.02	0.95 ± 0.01	0.95 ± 0.01	0.92 ± 0.01	0.97 ± 0.00

Dataset	svm_rbf_train_auc	svm_rbf_val_auc	svm_rbf_test_auc	rbf_centroid_val_auc	rbf_centroid_test_auc
Census	0.88 ± 0.00	0.87 ± 0.00	0.87 ± 0.00	0.73 ± 0.00	0.77 ± 0.00
CoverT	0.92 ± 0.00	0.79 ± 0.00	0.85 ± 0.00	0.63 ± 0.00	0.61 ± 0.00
DoH	0.98 ± 0.00	0.96 ± 0.00	0.96 ± 0.00	0.90 ± 0.00	0.90 ± 0.00
EMNIST	0.99 ± 0.00	0.89 ± 0.00	0.87 ± 0.00	0.50 ± 0.00	0.50 ± 0.00
FMNIST	0.99 ± 0.00	0.97 ± 0.00	0.97 ± 0.00	0.50 ± 0.00	0.50 ± 0.00
KDD	1.00 ± 0.00	1.00 ± 0.00	1.00 ± 0.00	0.95 ± 0.00	0.96 ± 0.00
MNIST	1.00 ± 0.00	0.98 ± 0.00	0.98 ± 0.00	0.50 ± 0.00	0.50 ± 0.00
URL	0.97 ± 0.00	0.95 ± 0.00	0.98 ± 0.00	0.71 ± 0.00	0.79 ± 0.00