This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Lean classical-quantum hybrid neural network model for image classification

Ao Liu 1 Cuihong Wen 1 Jieci Wang E-mail: [email protected] 2 1College of information science and engineering E-mail: [email protected] Hunan Normal University Changsha 410081 China
2Department of Physics and Key Laboratory of Low Dimensional Quantum Structures and Quantum Control of Ministry of Education
Hunan Normal University Changsha 410081 China
Abstract

The integration of algorithms from quantum information with neural networks has enabled unprecedented advancements in various domains. Nonetheless, the application of quantum machine learning algorithms for image classification predominantly relies on traditional architectures such as variational quantum circuits. The performance of these models is closely tied to the scale of their parameters, with the substantial demand for parameters potentially leading to limitations in computational resources and a significant increase in computation time. In this paper, we introduce a Lean Classical-Quantum Hybrid Neural Network (LCQHNN), which achieves efficient classification performance with only four layers of variational circuits, thereby substantially reducing computational costs. We apply the LCQHNN to image classification tasks on public datasets and achieve a classification accuracy of 99.02% on the dataset, marking a 5.07% improvement over traditional deep learning methods. Under the same parameter conditions, this method shows a 75% and 70.59% improvement in training convergence speed on two datasets. Furthermore, through visualization studies, it is found that the model effectively captures key data features during training and establishes a clear association between these features and their corresponding categories. This study confirms that the employment of quantum algorithms enhances the model’s ability to handle complex classification problems.

1 Introduction

Neural Networks (NNs), as a vital machine learning tool, have achieved widespread success across various domains [1, 2, 3, 4, 5, 6]. Extensive research has demonstrated the formidable capabilities of neural networks in pattern recognition, feature extraction, and classification decision-making [7, 8, 9, 10]. However, the performance of traditional deep learning algorithms is intimately linked to the scale of the model and the size of the dataset. An increase in the scale of parameters can lead to constraints in computational resources and a significant escalation in computation time. Moreover, the optimization of hyperparameters presents a challenge, as there is currently no universally effective optimization method akin to gradient descent for hyperparameter optimization. When the parameter scale is large, the temporal cost of evaluating a set of hyperparameter configurations becomes exceedingly high. Recently, there has been a notable increase in interest regarding the integration of neural networks with quantum technologies [11, 12, 13, 14, 15, 16, 17]. Quantum algorithms possess a range of distinctive advantages that have the potential to significantly enhance both the performance and efficiency of traditional computational methods. By harnessing quantum parallelism, quantum-enhanced neural networks can navigate a broader search space within a reduced timeframe. Simultaneously, these networks can exploit quantum entanglement to establish correlations among complex data structures such as images, language representations, and time series data. Such correlations may facilitate more effective identification of underlying patterns within datasets by neural networks.

Variational quantum circuits (VQCs) are quantum algorithms designed to address optimization problems and machine learning tasks [18, 19, 20, 21]. The fundamental principle of VQCs is to identify the optimal solution by adjusting the parameters of the quantum circuit. The parameter count of VQCs is typically modest. This is because the design of VQCs aims to determine the optimal solution by adjusting a limited number of quantum circuit parameters, rather than relying on a large scale of parameters. By learning features undetectable by classical neural networks, VQCs can significantly enhance robustness against classical adversarial attacks, suggesting a potential quantum advantage in machine learning tasks [22, 23]. A quantum convolutional neural network (QCNN) has also been proposed for high-energy physics event classification [24], demonstrating faster convergence and higher test accuracy compared to CNNs [25]. Several studies have addressed issues such as gradient vanishing, robustness, and learnability [26, 27], thereby overcoming some of the inherent challenges. In recent years, several technology giants such as IBM, Google, Rigetti, Amazon Braket, and Microsoft Quantum Azure have deployed quantum computing hardware on cloud services. These machines are characterized as Noisy Intermediate-Scale Quantum (NISQ) devices. NISQ devices are typified by their possession of a moderate number of qubits and a susceptibility to errors due to the fragile nature of quantum states[28, 29, 30, 31, 32, 33, 34]. Despite these limitations, NISQ devices have demonstrated superior efficiency compared to classical computers in simulating complex quantum systems and performing certain types of computations[35, 36, 37, 38, 39].

On the other hand, although the participation of quantum algorithms can help models efficiently complete image classification tasks with limited computing resources, there are still many difficulties and limitations in specific applications, mainly from the following two aspects. The ‘curse of dimensionality’ is one of the main challenges in using quantum machine learning algorithms for image classification. Some image data have a huge spatial dimension, and as the data dimension increases, VQCs needs to process more qubits and perform more quantum gate operations, which can lead to exponential growth in computation time and resources. This computational limitation makes building a quantum image classification model that can meet practical application needs both time-consuming and expensive. On the other hand, complex quantum systems are highly sensitive to noise, and increasing the depth of VQCs can lead to more noise accumulation. The loss or error of quantum information can limit their performance in practical applications. Improving the efficiency of algorithms and ensuring the stability and reliability of quantum computing in image classification tasks remains a challenge.

The development of effective quantum image classification models and the minimization of noise impact are critical research areas in machine learning. This paper introduces a lean classical-quantum hybrid neural network (LCQHNN) for image classification. We employ multi-channel convolution operators to extract nonlinear features from images, thereby reducing spatial dimensions while capturing essential data information. This enhances compatibility with current hardware capabilities. The streamlined four-layer variational quantum circuit (VQC) module is the present model can reduce quantum circuit complexity and mitigates noise sensitivity, thereby improving stability and reliability. Experimental results on public datasets demonstrate that LCQHNN significantly outperforms convolutional neural networks (CNNs) in terms of convergence speed and classification accuracy. Specifically, it achieves up to 75% and 70.59% faster training convergence on two datasets, while also achieving superior test set accuracy. In addition, we utilize Grad-CAM heat maps to visualize the prediction process of LCQHNN, revealing how it learns edge contours without prior knowledge. These visualizations enhance our understanding of the model’s decision-making process and validate its robustness in image recognition tasks.

2 Methodology

2.1 Quantum Machine Learning Algorithms

Refer to caption
Figure 1: Lean classical-quantum hybrid neural network structure for studying image classification. Firstly, feature extraction is performed using classical neural networks, and then the extracted features are passed on to VQCs for classification tasks. Finally, the quantum data is converted back to classical data through the measurement layer to obtain the final classification result.

Let’s first organize our theoretical framework. The quantum machine learning algorithm proposed in this study is a classical quantum classical process. We position the classical convolutional neural network at the front end of VQCs for feature extraction while reducing the spatial dimension to meet the quantum bit requirements of LCQHNN[40]. Although we have reduced the spatial dimension of the image, CNN maps the information that needs to be focused on to a higher feature dimension by increasing the number of channels, which is used to extract hidden features and improve accuracy[41]. Therefore, the features extracted by the CNN algorithm can be used as inputs for VQCs, where classical state data becomes quantum states. VQCs is the core of our algorithm, similar to the general approximation theorem in Artificial Neural Network. VQCs represents a circuit that can fit the objective function. During the training process, VQCs will repeat the feedforward process and compare the results with the true labels of the training data. By iteratively adjusting the parameters of the variational quantum circuit, the cost function is minimized in terms of the difference between the predicted label and the true label. Finally, the decoded measurement results provide the classification results of the input data, at which point the data state returns from quantum state to classical state.

As shown in Figure 1, we construct the quantum algorithm model from five aspects to achieve image classification. Firstly, the feature extraction model of LCQHNN consists of two convolutional layers and two pooling layers. The convolutional layers are used to extract information to ensure reliable information transmission throughout the entire network architecture. Reducing the spatial dimension of the pooling layer can decrease the number of parameters and computational complexity in subsequent layers, making the network more efficient. In the second aspect, to ensure non-linear features, we added an activation layer composed of ReLU function between each convolutional layer and pooling layer. This work achieves nonlinear mapping of feature vectors. We assume that ff is the input image, gg is the convolution kernel, (i,j)(i,j) are the coordinates of the input image, and (m,n)(m,n) are the coordinates of the summation. So the convolution result should be (fg)(i,j)=m=k12k12n=k12k12f(m,n)g(im,jn)(f*g)(i,j)=\sum_{m=-\frac{k-1}{2}}^{\frac{k-1}{2}}\sum_{n=-\frac{k-1}{2}}^{\frac{k-1}{2}}f(m,n)g(i-m,j-n). Where k is the size of the convolution kernel, we set it to 3. The formula for ReLU is max(0,x)max(0,x). We assume that Window(i,j)Window(i,j) represents a pooling window centered around (i,j)(i,j), and P(i,j)P(i,j) is the value of the pooled output feature map at position (i,j)(i,j). So the extracted feature results are as follows[42].

P(i,j)=max(m,n)Window(i,j)max(0,m=k12k12n=k12k12f(m,n)g(im,jn))P(i,j)=\max_{(m,n)\in\text{Window}(i,j)}\max(0,\sum_{m=-\frac{k-1}{2}}^{\frac{k-1}{2}}\sum_{n=-\frac{k-1}{2}}^{\frac{k-1}{2}}f(m,n)g(i-m,j-n)) (1)

Thirdly, after feature extraction, we set up a transition model composed of fully connected networks, which reduces the dimensionality of features from 256 to 4 to accommodate quantum computing constraints. Due to the property of parameter sharing in convolutional neural networks, neurons in the network may develop complex co adaptation relationships. Therefore, we incorporate Dropout as a regularization technique into the module, which reduces co adaptation relationships among neurons by randomly discarding them. Intuitively described, it can be seen that each Dropout generates a different "sparse" network, which is equivalent to training many different networks and averaging their prediction results, which usually helps prevent overfitting of the training data by the network.

Fourthly, we use quantum encoding technology to transform the 4-dimensional feature vector into a 4-dimensional 4-bit quantum state, followed by a series of carefully designed single bit operations. Finally, by measuring the transition of quantum states to classical states, the classification results are obtained. We will provide a detailed description of the quantum algorithm in Section 2.2.

2.2 Variational Quantum Circuit Architecture

We employ quantum encoding techniques to process classical data[43]. Within the quantum deep learning model, the single-qubit unitary layer serves as the core structure for quantum state transformation, comprising a series of meticulously designed single-qubit operations[44]. These operations include, but are not limited to rotation gates, phase gates, and Hadamard gates, which collectively act on qubits to facilitate efficient processing and learning of quantum data. In the encoding layer of LCQHNN, we introduce Hadamard (H) gates and U1 gates for each qubit[45]. By appropriately selecting the rotation angle of the U1 gate in conjunction with the superposition characteristics of the H gate, we can effectively adjust and manipulate individual qubit states. Additionally, our model incorporates the controlled-not (CNOT) gate-a key multi-qubit operator that enables non-local state transfer through interactions between control and target qubits[46]. A notable feature of the CNOT gate is its parameter-free operation it relies solely on the current states of both qubits when they are in a superposition state. The application of this CNOT gate results in entanglement between them, introducing powerful non-linear characteristics into quantum computation. Through this innovative encoding structure, as depicted in Figure 2, the LCQHNN not only achieves efficient quantum encoding of classical data but also fully leverages principles from quantum mechanics such as superposition and entanglement, significantly enhancing both its learning capacity and expressive power.

After the data have undergone the encoding phase, they proceed to a crucial transition step, the parametrized layer, which is composed of tunable quantum gates that can perform delicate adjustments and transformations on the quantum state. The core function of the parametrized layer is to capture and enhance the intrinsic features of the input data through the evolution of quantum states. Within this layer, each qubit undergoes a rotation around the Y-axis, achieved by applying RY gates. The parameters of the RY gates are trainable, determining the proximity of the qubits to the |0|0\rangle and |1|1\rangle basis states in the quantum state space[47]. The modulation of this proximity directly affects the model’s representational power, significantly influencing the depth and breadth of feature extraction. These parameters can be precisely adjusted through classical optimization algorithms to suit specific classification tasks. During the training process, the algorithm iteratively optimizes the parameters to minimize the discrepancy between the model’s predictions and the actual labels. This process not only enhances the model’s sensitivity to data features but also strengthens its capability to address complex classification problems. The VQCs algorithmic procedure is formulated in Equations (2-7).

We first apply four Hadamard gates (H) to transform the four qubits from their initial state |0|0\rangle to a uniformly stacked state |ψ1|\psi_{1}\rangle. Among them, Hadamard transforms |0|0\rangle into 12(|0+|1)\frac{1}{\sqrt{2}}(|0\rangle+|1\rangle)[48].

|ψ1=H4|04|\psi_{1}\rangle=H^{\otimes 4}|0\rangle^{\otimes 4} (2)

Then four single qubit rotation gates U1,i(δi)U_{1,i}(\delta i) are applied to each qubit of |ψ1|\psi_{1}\rangle. The rotation angle δi\delta_{i} is twice the input data x[i]x[i]. This operation rotates the phase of each quantum bit by δi\delta_{i} radians, thereby generating a new quantum state |ψ2|\psi_{2}\rangle.

|ψ2=U1,0(δ0)U1,1(δ1)U1,2(δ2)U1,3(δ3)|ψ1(δi=2×x[i],i=1,2,3,4)|\psi_{2}\rangle=U_{1,0}(\delta_{0})\otimes U_{1,1}(\delta_{1})\otimes U_{1,2}(\delta_{2})\otimes U_{1,3}(\delta_{3})|\psi_{1}\rangle(\delta_{i}=2\times x[i],i=1,2,3,4) (3)

Here three controlled NOT gates (CNOT gates) are designed in VQCs for the application of |ψ2|\psi_{2}\rangle. The CNOT gate CXi,jCX_{i,j} is controlled by the i-th quantum bit and targets the j-th quantum bit. This operation entangles the information between quantum bits together, resulting in a new quantum state |ψ3|\psi_{3}\rangle.

|ψ3=CX0,1CX1,2CX2,3|ψ2|\psi_{3}\rangle=CX_{0,1}\cdot CX_{1,2}\cdot CX_{2,3}|\psi_{2}\rangle (4)

Afterwards, we apply four single qubit Y-axis rotation gates RY(ηi)RY(\eta_{i}) to each qubit of |ψ3|\psi_{3}\rangle. The rotation angle ηi\eta_{i} is a model parameter. This operation causes the state of each quantum bit to rotate by an angle of θi\theta_{i} on the Y-axis of the Bloch sphere, thereby changing the phase and amplitude of the quantum state.

|ψ4=RY(θi)4|ψ3(i=1,2,3,4)|\psi_{4}\rangle=RY(\theta_{i})^{\otimes 4}|\psi_{3}\rangle(i=1,2,3,4) (5)

Then, we apply a series of CNOT gates to the |ψ4|\psi_{4}\rangle state, in order from right to left. This means that the rightmost quantum bit (3rd quantum bit) serves as the control bit first, followed by the 2nd quantum bit, and finally the 1st quantum bit. This sequential CNOT gate operation can be seen as an effect of "canceling" previous CNOT gate operations, as they are applied in the opposite order.

|ψ5=CX2,3CX1,2CX0,1|ψ4|\psi_{5}\rangle=CX_{2,3}\cdot CX_{1,2}\cdot CX_{0,1}|\psi_{4}\rangle (6)

Finally, we apply Hadamard gates H0H_{0} to the first qubit of the |ψ5|\psi_{5}\rangle state, while keeping the other qubits unchanged (using unit gates II). This operation is typically used in the final stage of quantum circuits to prepare for measurements.

|ψ6=H0III|ψ5|\psi_{6}\rangle=H_{0}\otimes I\otimes I\otimes I|\psi_{5}\rangle (7)
Refer to caption
Figure 2: Variational Quantum Circuit Structure. The encoding layer of this network introduces H gates and U1 gates to each qubit, enabling adjustment and manipulation of individual qubit states by selecting appropriate rotation angles for the U1 gates. The research model integrates CNOT gates, a crucial multi-qubit operator that facilitates non-local state transfer by controlling interactions between the control and target qubits.

To observe the impact of quantum gates and operations on the quantum state more intuitively, we have mapped the output states of the two circuits onto the Bloch sphere. The specific distribution of quantum states is shown in Figure 7. After the input bits pass through the H gates and U1 gates of the encoding circuit, they are uniformly distributed on the equatorial plane. At this point, the quantum bits are in an equal superposition state of the 0 state and 1 state. The uniform distribution of quantum states implies that the quantum information is uniformly encoded in the quantum state space, which is beneficial for quantum entanglement and quantum error correction. Moreover, the uniform distribution of quantum states also indicates that our encoding strategy does not introduce any specific preferences when processing sample data. This is crucial for enhancing the model’s generalization capability. Such uniformity helps ensure that the model does not overfit to specific features in the training data, thereby maintaining high predictive accuracy and robustness when facing new, unseen data. Figure 7(b) illustrates the distribution of quantum states after rotation around the Y-axis. The parameters of the RY gate determine the final position of the quantum states on the Bloch sphere. Since the Bloch sphere can only represent a single quantum bit, a uniform distribution does not mean that there is no specific entanglement formed between the quantum bits.

Refer to caption
Figure 3: Bloch sphere representation of encoded quantum states. (a) After the input qubit undergoes the encoding circuit with H and U1 gates, it becomes uniformly distributed on the equatorial plane of the Bloch sphere, at this point, the qubit is in a superposition state of |0|0\rangle and |1|1\rangle. (b) The image displays the distribution of the quantum state after rotation around the y-axis. The parameter of the RY gate determines the final position of the quantum state on the Bloch sphere.

3 Result

3.1 Preparations

3.1.1 Experimental Environment

We conducted experiments on a server with a Xeon Gold 5315Y CPU and RTX3090 GPU, using Python 3.8.1, Pytorch 2.1.2, CUDA 11.8, Qiskit 1.0.2[49, 50]. Qiskit is an open-source quantum computing software development kit (SDK) developed and maintained by the IBM Quantum team. It allows users to design, simulate, validate, and run quantum programs and algorithms on their simulators or real quantum computers. Qiskit provides a complete set of tools for constructing quantum circuits, analyzing quantum information, and implementing quantum algorithms.

Our experiment uses Qiskit to simulate quantum circuits on a classical computer and integrates Qiskit with Pytorch through the Aer plugin. This integration enables the direct use of PyTorch’s tensor operations in quantum circuits, as well as the utilization of PyTorch’s automatic differentiation function in quantum algorithms.

3.1.2 Datasets

In order to evaluate the performance of the proposed image classification model, the MNIST and Fashion MNIST datasets were selected for experiments.

MNIST (Modified National Institute of Standards and Technology database) is a large handwritten digit database created by Yann LeCun et al. in the 1990s[51]. It contains 60000 training samples and 10000 test samples, each of which is a 28x28 pixel handwritten digit image ranging from 0 to 9. The images in the MNIST dataset are grayscale and have been centered and standardized to ensure that the format of each image is consistent. Due to the large amount of computing resources and time required for training and validation processes, we have chosen a subset of MNIST, which includes 2048 balanced training samples and 512 balanced validation samples, respectively. We selected 1024 random samples as the test dataset to ensure the broad applicability of our method.

FashionMNIST is a dataset provided by Zalando (a German fashion e-commerce platform) as an alternative to MNIST[52], aiming to provide a similar dataset structure but containing clothing images rather than handwritten digits. Fashion MNIST contains 70000 images, divided into 10 categories (such as T-shirts, pants, shoes, etc.), with 7000 images per category and a size of 28x28 pixels. In this study, we trained on 2048 images labeled with two categories ("pants" and "shirt"), validated with 512 images, and tested with 1024 images.

3.1.3 Training Setting

After converting the raw image data into feature maps, we trained the model using the Adam optimizer for 50 epochs with a batch size of 64 and a learning rate of 0.001. We determined a random seed of 42 to ensure the reproducibility of the algorithm.

3.2 Classification Analysis of MNIST and FashionMNIST

To substantiate and evaluate the effectiveness of the methodologies proposed within this manuscript, we have selected accuracy (Acc) as the evaluative criterion to measure the performance of disparate models. The computation of accuracy is defined by Equation (2), where N denotes the total number of classes, T signifies the total number of samples, the true class of each sample is represented by yiy_{i}, and the class predicted by the model is indicated by yi^\hat{y_{i}}[53].

Acc=i=1NΠ(yi^=yi)T×100%Acc=\frac{{\textstyle\sum_{i=1}^{N}}\Pi(\hat{y_{i}}=y_{i})}{T}\times 100\% (8)

At the same time, we also compared the convergence speed of different models. Assuming EAE_{A} represents the number of epochs for model A to converge and EBE_{B} represents the number of epochs for model B to converge, the percentage improvement QABQ_{AB} in the convergence speed of model A compared to model B can be calculated using the following formula.

QAB=(EBEAEB)×100%Q_{AB}=\left(\frac{E_{B}-E_{A}}{E_{B}}\right)\times 100\% (9)

We compared the LCQHNN against CNNs featuring fully connected modules with 4, 8, and 16 parameters, tracking the evolution of Accuracy throughout the training process. As can be seen from Figure 4(a), LCQHNN converges faster than a traditional cnn with 4 variable parameters, and the accuracy exceeds 99% by the sixth epoch, which is 75% quicker than its CNNs counterpart with the same parameters. Traditional CNNs necessitate a larger number of parameters to match this performance. Figure 4(b) displays the accuracy of the four models after 10 epochs of training. It is apparent that the model introduced in this paper performs on par with, and occasionally exceeds, the capabilities of sophisticated traditional neural network models on the digit classification task. This hints at the VQCs’s superior linear transformation capacity in comparison to the classical fully connected layers, allowing the model to concentrate on finer details, thereby attaining higher predictive precision. The data suggest that the LCQHNN possesses greater potential compared to conventional models.

Refer to caption
Figure 4: Experimental results on the MNIST dataset. (a) The comparison of the LCQHNN against traditional CNNs with varying numbers of parameters in terms of accuracy evolution during the training process was conducted following the unified setup described in the paper. (b) The accuracy of different models at the 10th epoch during training.

To further elucidate the differences in model performance, we trained four models on the more challenging FashionMNIST dataset and evaluated their classification accuracy on a separate test set, analogous to the structure of the previously discussed MNIST dataset. The experimental results are presented in Figure 5. Figure 5(a) illustrates the fluctuation in accuracy throughout the training process for all four models, with LCQHNN’s rapid convergence being particularly evident. The LCQHNN exceeded an accuracy threshold of 90% on the test set after only five epochs of training, representing a 70.59% improvement in convergence speed compared to traditional CNN models with an equivalent number of parameters. Even models with a larger array of parameters, such as CNN-8 and CNN-16, did not demonstrate such notable efficacy. This phenomenon can be attributed to VQCs’s adeptness at exploring high-dimensional feature spaces, facilitating quicker comprehension and adaptation to the underlying characteristics of the data. Figure 5(b) delineates each models precision on the FashionMNIST test set after ten epochs these outcomes substantiate that within an identical constrained training period, LCQHNN achieves superior classification precision, corroborating that incorporating quantum attributes significantly enhances classification proficiency within a reduced training duration.

Refer to caption
Figure 5: Experimental results on the FashionMNIST dataset. (a) Comparison of the accuracy change of LCQHNN training on the FashionMNIST dataset with traditional CNNs with different numbers of parameters. (b) The accuracy of each model on the FashionMNIST dataset after training for 10 epochs.

The complete training data is presented in Table 1. The MNIST dataset is relatively straightforward and is primarily used to explore the quantum characteristics within the LCQHNN. In terms of experimental evaluation metrics, there is no significant difference compared with traditional CNNs. However, on the more complex FashionMNIST dataset, it is evident that the LCQHNN outperforms the other three traditional CNNs of varying complexities in both convergence speed and final accuracy. The results indicate that the final test accuracy of the LCQHNN on FashionMNIST is 0.78% higher than that of CNN-16, 2.34% higher than CNN-8, and shows a significant improvement of 5.07% over CNN-4.

Model MNIST (10) MNIST (25) MNIST (50) FashionMNIST (10) FashionMNIST (25) FashionMNIST (50)
LCQHNN (our) 100% 100% 100% 96.88% 98.24% 99.02%
CNN-4 86.33% 99.22% 100% 76.56% 91.41% 93.95%
CNN-8 99.61% 100% 100% 91.41% 95.12% 96.68%
CNN-16 99.8% 100% 100% 90.43% 96.48% 98.24%
Table 1: Comparison of classification accuracy on public datasets.The table summarizes the classification accuracy of four models trained on the MNIST and FashionMNIST datasets for 10, 25, and 50 epochs, highlighting the unique highest accuracy within each group. If two or more models have the same accuracy, no emphasis is given.

3.3 Visual Explanations

Neural network models are frequently considered as black-box systems due to their inherently complex nonlinear transformations and opaque internal structural parameters. This lack of transparency prompts researchers to investigate the focal points of these models. To enhance our understanding of the model’s focus in this paper, we introduce Gradient-weighted Class Activation Mapping (Grad-CAM), an interpretability technique that leverages gradient information to identify regions in an image that contribute most significantly to the model’s predictions for a specific class. By combining gradients with activation maps from convolutional layers to generate Class Activation Maps (CAMs), areas with substantial activation are superimposed onto the original image, thereby highlighting regions that play a critical role in class prediction. We visualize the training process through heatmaps, which provide clearer insights into shifts in the model’s attention direction. The detailed computational formula is presented subsequently[54].

We first calculated the partial derivative of the model output yy with respect to the k-th convolutional layer output AkA^{k}. yy is usually the score of the model for a specific category. This partial derivative represents the contribution of each convolutional layer output to the final classification result.

αk=yAk\alpha_{k}=\frac{\partial_{y}}{\partial A^{k}} (10)

Then, on the feature map AkA^{k} of the k-th convolutional layer, calculate the average of all partial derivatives αk\alpha_{k} within the window centered on (i,j)(i,j). This average value ωk(i,j)\omega_{k}(i,j) represents the average contribution of the k-th convolutional layer to the final classification result within the window.

ωk(i,j)=1Window Size(m,n)Window(i,j)αk(m,n)\omega_{k}(i,j)=\frac{1}{\text{Window Size}}\sum_{(m,n)\in\text{Window}(i,j)}\alpha_{k}(m,n) (11)

Finally, multiply the contribution degree ωk\omega_{k} of all convolutional layers by their corresponding feature maps AkA^{k}, and sum up all layers. Then, apply the ReLU function to the results to ensure that only positive contributions are retained. The final GG is a heatmap that displays the region in the input image that contributes the most to the classification results of a specific category.

G=max(0,(kωkAk))G=\max(0,\left(\textstyle\sum_{k}\omega_{k}A^{k}\right)) (12)

The visualization interpretation results of the LCQHNN on the MNIST dataset are depicted in Figure 6. We selected the last convolutional layer as the target layer to observe the significant areas the model uses for predictive concepts. Figure 6(a) presents the input original image, while Figures 6(b), 6(c), and 6(d) illustrate the visualization interpretation results of the model at different training stages. We use the heatmaps to represent the importance of various regions in the image for the target category. The closer the heatmap pixel color is to dark red, the more critical the model considers that area for predicting the target category. Initially, for the untrained model (Figure 6(b)), the visualization shows that the model’s focus is scattered and chaotic, with no discernible pattern. This indicates that the model’s initial cognition of the target category is inaccurate, and it has not yet learned effective features related to prediction. Subsequently, the model in the middle of training (Figure 6(c)) begins to learn some feature information from the training set data, the model starts to focus on the edge contours of the handwritten digits, signifying that it is gradually understanding some important features of the digit shapes through training. Finally, the model after completion of training (Figure 6(d)) shows that the areas of focus largely coincide with the shape of the handwritten digits. This demonstrates that the model has learned the feature information of the data and associated these features with the target categories after training. The model’s ability to accurately identify and locate the important areas of the handwritten digits confirms the effectiveness of the LCQHNN on this task. Moreover, this indicates the practicality of the VQCs in neural networks, enhancing the model’s expressive power and performance by combining classical and quantum computing advantages. By leveraging the unique features of quantum computing, VQCs can offer advantages in handling complex data patterns and tasks, providing superior predictive and interpretive capabilities.

Refer to caption
Figure 6: Visualization heatmap for model trained on MNIST dataset. (a) Original Images Input to the Model. (b) Visualization of Untrained Model. (c) Visualization of Model Trained for 25 Epochs. (d) Visualization of Model Trained for 50 Epochs.

Furthermore, our investigation extends to visualizing the decision-making process of the LCQHNN on the FashionMNIST dataset. As depicted in Figure 3, the visualization outcomes allow us to distinctly perceive the progression of the model’s focal points of attention. Initially, during the nascent training phase, the model’s attention was divided, and as the training progressed, the model gradually redirected its attention to the key components of the image, eventually accurately identifying the object in question. Nonetheless, contrary to the MNIST dataset, the heatmaps on FashionMNIST do not manifest a saturated phenomenon even after successive training iterations, a discrepancy potentially owing to the visual intricacy inherent in FashionMNIST images. By juxtaposing the heatmaps that evolve with each training epoch against the accuracy curves, a pronounced correlation emerges. As the model’s concentration on the predictive targets intensifies, there is a concomitant rise in test accuracy. This correlation substantiates that the model indeed seizes upon critical image data features throughout the training regimen, with these features being instrumental in the amelioration of classification proficiency. The visual analytics within this research not only offer a clear vantage point for deciphering the LCQHNN’s decision-making apparatus but also underscore the model’s capacity to refine its performance through attentive readjustments during the learning trajectory when confronted with sophisticated image data. These revelations furnish persuasive evidence for delving deeper into the nascent potential of quantum computation within the realm of deep learning, thereby establishing a groundwork for the innovation of hybrid quantum-classical models that are both more efficacious and precise.

Refer to caption
Figure 7: Visualization heatmap for model trained on FashionMNIST dataset. (a) Original Images Input to the Model. (b) Visualization of Untrained Model. (c) Visualization of Model Trained for 25 Epochs. (d) Visualization of Model Trained for 50 Epochs.

4 Conclusion

In this study, we propose an innovative hybrid model that integrates classical and quantum computing, named LCQHNN, and conduct an in-depth comparative analysis of it. The novelty of the LCQHNN model lies in its combination of the powerful feature extraction capabilities of traditional CNNs with the classification potential of VQCs. Specifically, the model first utilizes CNNs to extract feature information from images and then effectively hands off these features to VQCs for precise classification tasks. Experimental results demonstrate that the improved classical-quantum hybrid model exhibits superior performance in classification tasks across multiple public datasets, with enhancements in both recognition accuracy and convergence speed compared to traditional CNNs models with the same parameters. In the binary classification task of MNIST involving digits 0 and 1, the model achieved 100% accuracy at the fastest speed for image recognition. On the FashionMNIST dataset, LCQHNN surpassed CNNs in both convergence rate and final accuracy, including those with a richer set of parameters. These results indicate that the LCQHNN model, which combines classical networks and quantum computing, can achieve higher precision and provides a reliable, flexible, and scalable deep learning approach for image classification tasks. Additionally, we observed the model’s visual interpretation results, which show that the model effectively captured key data features during the training process and established associations with corresponding categories. This visual analysis highlights the capability of LCQHNN to enhance performance by fine-tuning its learning trajectory when dealing with complex image data. Furthermore, it provides compelling evidence to explore the emerging potential of quantum computing in deep learning, paving the way for innovative and more accurate hybrid quantum-classical models.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants No. 12374408 and No. 12475051; the Natural Science Foundation of Hunan Province under grant No. 2023JJ30384; the science and technology innovation Program of Hunan Province under grant No. 2024RC1050; and the innovative research group of Hunan Province under Grant No. 2024JJ1006.

Data availability: The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Code availability: The code for this study can be obtained from the corresponding author upon request.

References

  • [1] M. Jordan, T. Mitchell, Science 2015, 349 255.
  • [2] Y. Sun, et al., arXiv:2312.13583 2023.
  • [3] D. P. Kingma, M. Welling, arXiv:1312.6114 2013.
  • [4] I. Goodfellow, et al., Advances in Neural Information Processing Systems 2014, 27 2672.
  • [5] A. Ho, A. Jain, P. Abbeel, arXiv:2006.11230 2020.
  • [6] D. Rumelhart, G. Hinton, R. Williams, Nature 1986, 323 533.
  • [7] A. Dosovitskiy, et al., arXiv:2010.11929 2020.
  • [8] K. He, X. Zhang, S. Ren, J. Sun, In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016 770–778.
  • [9] S. Hochreiter, J. Schmidhuber, Neural Computation 1997, 9 1735.
  • [10] C. Subakan, et al., In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021 21–25.
  • [11] V. Dunjko, J. M. Taylor, H. J. Briegel, Phys. Rev. Lett. 2016, 117 130501.
  • [12] S. Tamiya, H. Yamasaki, npj Quantum Information 2022, 8 90.
  • [13] J. Jäger, R. V. Krems, Nature Communications 2023, 14 576.
  • [14] F. Fan, Y. Shi, T. Guggemos, X. X. Zhu, IEEE Transactions on Neural Networks and Learning Systems 2024, 35 18145.
  • [15] A. Wang, J. Hu, S. Zhang, L. Li, Quantum Information Processing 2024, 23 17.
  • [16] P. P, et al., In 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT). 2023 1–5.
  • [17] Y. Tian, et al., Advanced Quantum Technologies 2022, 5 2200025.
  • [18] J. Biamonte, P. Wittek, N. e. a. Pancotti, Nature 2017, 549 195.
  • [19] J. Qi, C. Yang, P. Chen, et al., npj Quantum Information 2023, 9 4.
  • [20] Y. Bagoun, A. Zinedine, I. Berrada, In 2024 Sixth International Conference on Intelligent Computing in Data Sciences (ICDS). 2024 1–6.
  • [21] S. Bhowmik, H. Thapliyal, In 2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 2024 634–639.
  • [22] Y. Du, et al., PRX Quantum 2021, 2 040337.
  • [23] M. T. West, et al., Phys. Rev. Res. 2023, 5 023186.
  • [24] S. Y.-C. Chen, et al., Phys. Rev. Research 2022, 4 013231.
  • [25] A. Krizhevsky, I. Sutskever, G. E. Hinton, In Advances in Neural Information Processing Systems, volume 25. 2012 1097–1105.
  • [26] J. R. McClean, et al., Nature Communications 2018, 9 1.
  • [27] S. Chakrabarti, et al., In Advances in Neural Information Processing Systems, volume 32. 2019 6778–6789.
  • [28] G. González-García, R. Trivedi, J. I. Cirac, PRX Quantum 2022, 3 040326.
  • [29] M. Weber, et al., Phys. Rev. Res. 2022, 4 033217.
  • [30] M. Ippoliti, et al., PRX Quantum 2021, 2 030346.
  • [31] K. Bharti, et al., Rev. Mod. Phys. 2022, 94 015004.
  • [32] K. Yeter-Aydeniz, et al., Phys. Rev. A 2019, 99 032306.
  • [33] M. L. Wall, M. R. Abernathy, G. Quiroz, Phys. Rev. Res. 2021, 3 023010.
  • [34] R.-B. Wu, X. Cao, P. Xie, Y.-x. Liu, Phys. Rev. Appl. 2020, 14 064020.
  • [35] S. Chen, J. Cotler, H.-Y. Huang, J. Li, Nature Communications 2024, 15 1.
  • [36] García-Molina, et al., Communications Physics 2024, 7 321.
  • [37] C. Lamb, et al., Nature Communications 2024, 15 1.
  • [38] H. Nishi, T. Kosugi, Y.-i. Matsushita, npj Quantum Information 2021, 5 85.
  • [39] F. Barratt, et al., npj Quantum Information 2021, 7 79.
  • [40] J. Y. Araz, M. Spannowsky, Phys. Rev. A 2022, 106 062423.
  • [41] Y. Kim, arXiv:1408.5882 2014.
  • [42] X. Zhang, et al., Phys. Rev. C 2022, 105 034611.
  • [43] J. C. Zuñiga Castro, et al., Phys. Rev. A 2024, 110 052615.
  • [44] M. Kuś, I. Bengtsson, Phys. Rev. A 2009, 80 022319.
  • [45] M. A. Yurtalan, et al., Phys. Rev. Lett. 2020, 125 180504.
  • [46] F. Luis, et al., Phys. Rev. Lett. 2011, 107 117203.
  • [47] K. Kubo, Y. O. Nakagawa, S. Endo, S. Nagayama, Phys. Rev. A 2021, 103 052425.
  • [48] R. Aurich, M. Sieber, F. Steiner, Phys. Rev. Lett. 1988, 61 483.
  • [49] A. Paszke, et al., In Advances in Neural Information Processing Systems 32. 2019 8024–8035.
  • [50] A. Javadi-Abhari, et al., arXiv:2405.08810 2024.
  • [51] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Proceedings of the IEEE 1998, 86 2278.
  • [52] H. Xiao, K. Rasul, R. Vollgraf, arXiv:1708.07747 2017.
  • [53] R. A. Fisher, Annals of Human Genetics 1936, 7 179.
  • [54] R. R. Selvaraju, et al., In Proceedings of IEEE International Conference on Computer Vision (ICCV). 2017 618–626.