This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Quantum Federated Learning with Entanglement Controlled Circuits and Superposition Coding

Won Joon Yun, Jae Pyoung Kim, Hankyul Baek, Soyi Jung,  Jihong Park,  Mehdi Bennis,  and Joongheon Kim The parts of this research were presented at IEEE Conference on Computer Communications (INFOCOM), London, United Kingdom, May 2022 [1].This research is supported by the National Research Foundation of Korea (2021R1A4A1030775) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-00467, Intelligent 6G Wireless Access System). (Corresponding authors: Soyi Jung, Jihong Park, Joongheon Kim)W. J. Yun, J. P. Kim, H. Baek, and J. Kim are with the School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea (e-mails: {ywjoon95,paulkim436,67back,joongheon}@korea.ac.kr).S. Jung is with the Department of Electrical and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea (e-mail: [email protected]).J. Park is with the School of Information Technology, Deakin University, Geelong, VIC 3220, Australia (e-mail: [email protected]).M. Bennis is with the Centre for Wireless Communications, University of Oulu, Oulu 90014, Finland (e-mail: [email protected]).
Abstract

While witnessing the noisy intermediate-scale quantum (NISQ) era and beyond, quantum federated learning (QFL) has recently become an emerging field of study. In QFL, each quantum computer or device locally trains its quantum neural network (QNN) with trainable gates, and communicates only these gate parameters over classical channels, without costly quantum communications. Towards enabling QFL under various channel conditions, in this article we develop a depth-controllable architecture of entangled slimmable quantum neural networks (eSQNNs), and propose an entangled slimmable QFL (eSQFL) that communicates the superposition-coded parameters of eSQNNs. Compared to the existing depth-fixed QNNs, training the depth-controllable eSQNN architecture is more challenging due to high entanglement entropy and inter-depth interference, which are mitigated by introducing entanglement controlled universal (CU) gates and an inplace fidelity distillation (IPFD) regularizer penalizing inter-depth quantum state differences, respectively. Furthermore, we optimize the superposition coding power allocation by deriving and minimizing the convergence bound of eSQFL. In an image classification task, extensive simulations corroborate the effectiveness of eSQFL in terms of prediction accuracy, fidelity, and entropy compared to Vanilla QFL as well as under different channel conditions and various data distributions.

Index Terms:
Quantum Machine Learning, Quantum Entanglement, Quantum Federated Learning, Superposition Coding

I Introduction

I-A Background and Motivation

Recent advances in quantum computing hardware and algorithms have recently lead to the emergence of quantum machine learning (ML) [2, 3, 4]. As opposed to classical computation at a linear scale in bits, quantum computing can perform calculations at an exponential scale in qubits [5]. The main enablers are the stochastic nature and the entanglement phenomenon of qubits, allowing one to make each qubit represent superimposed multiple states and to simultaneously control multiple qubits, respectively. Consequently, even in the current era of noisy intermediate scale quantum (NISQ) [6], i.e., with 50 to a few hundred qubits, quantum ML has achieved linear or sublinear complexity in various applications, as compared with the polynomial complexity of classical ML [7].

Refer to caption
(a) Vanilla QFL.
Refer to caption
(b) eSQFL.
Figure 1: A schematic illustration of (a) Vanilla quantum federated learning (QFL) and (b) the proposed entangled slimmable quantum FL (eSQFL) with 22 devices each of which has the entangled slimmable quantum neural network (eSQNN) having 33 depth layers.

Quantum ML has recently established its standard framework. As analogous to the neural network (NN) of classical ML, the parameterized quantum circuit (PQC), also known as the quantum NN (QNN), has become a de facto standard quantum ML architecture [7, 8]. In a PQC, qubits flow through the gates associated with trainable classical parameters, during which the states of the qubits can be adjusted. For various applications ranging from image classification [9] to reinforcement learning [10], with a much smaller number of trainable parameters, PQC training has achieved the prediction accuracy on par with the neural network (NN) of classical ML.

Focusing on the parameter efficiency of PQCs, by integrating federated learning (FL) [11, 12, 13, 14] into standalone quantum ML, quantum FL (QFL) has recently attracted attention [15, 16]. Without communicating qubits via costly quantum communications, QFL enables distributed quantum ML at scale by communicating the PQC’s trainable parameters via classical communications, even over wireless channels [17]. This is not in the distant future, but is an upcoming application, especially considering the ever-increasing pace of innovation in quantum computers, e.g., IBM’s development roadmap planning to implement a 1K-qubit beyond-NISQ computer in 2023 [18] and a 100K-qubit computer in 2026 [19].

I-B Algorithm Design Concept

Motivated by this trend in QFL, the overarching goal of this article is to develop a communication-efficient QFL framework that can cope with heterogeneous and time-varying channel conditions and computing resources. To this end, we first revisit slimmable FL (SFL) in classical ML [1], wherein each device has a width-controllable local model, known as a slimmable NN (SNN) [20, 21], and communicates its superposition-coded local model with different width levels, enabling multi-level local information exchanges depending on channel conditions. Inspired from this, as visualized in Fig. 1, we propose an entangled slimmable quantum FL (eSQFL) framework with entangled slimmable QNNs (eSQNNs), which is a non-trivial extension from SFL with SNNs to their quantum versions as summarized later.

Unlike multi-width SNNs, the eSQNN is a multi-depth PQC wherein more depth levels incur higher von Neumann entanglement entropy on average. Unfortunately, the PQC trainability is often challenged by the problem of vanishing all gradients, known as the barren plateaus [22], which is exacerbated under higher entanglement entropy [23]. Meanwhile, too low entanglement may negate the benefit of quantum ML. To resolve this issue for an unknown target degree of entanglement, our proposed eSQNN entangles different qubits using the controlled universal (CU) quantum gates [24] such that the degree of entanglement is trainable.

Next, simultaneous local training of the multiple eSQNN depths may induce inter-depth interference, hindering convergence. In classical ML, SFL avoids its similar inter-width interference issue by adding the inplace knowledge distillation (IPKD) regularizer that penalizes the output difference from any smaller width to the largest width level [20]. Since the IPKD uses the Kullback-Leibler (KL) divergence, it becomes less accurate (or even diverging) for the larger differences. Alternatively, leveraging the Uhlmann’s fidelity function in quantum information theory [25], we propose a novel inplace fidelity distillation (IPFD) regularizer that is bounded within 0 and 1 while accurately measuring the quantum state difference even between the smallest and the largest levels.

Finally, for communication efficiency, the eSQNN parameters in multiple depths are superposition-coded and transmitted with a different transmit power allocation to each depth. Like SFL, the transmit power is optimized by deriving and minimizing the convergence bound of the eSQFL. Nevertheless, the convergence analysis is completely different since the gradient in PQC training is measured in a quantum computing way using the parameter shift rule [26].

Not only by analysis but also by extensive simulations, we corroborate that the proposed eSQFL with eSQNNs achieves convergence while each different width level can be trained to be a separate model with reasonable accuracy and fidelity, under various channel conditions as well as independent and identically distributed (IID) or non-IID data distributions. Note that unlike eSQFL, Vanilla quantum FL having fixed local PQC architectures cannot cope with different channel conditions [15, 27]. A recent work [28] also considers a slimmable architecture in the context of QFL. However, it does not theoretically guarantee convergence, and its specific architecture (i.e., angle/pole parameters) only allows two-level superposition coding, as opposed to generalized multi-level architectures using CU gates in eSQFL.

I-C Contributions

The major contributions of the work in this paper can be summarized as follows.

  • A multi-depth QNN architecture with CU gates, i.e., eSQNN, is proposed to enable superposition-coded transmissions while avoiding barren plateaus. We measure von Neumann entropy between inter-quantum states of different depths. Indeed, CU gates increase the trainability when designing multi-depth QNN.

  • A local eSQNN training algorithm with a fidelity-inspired regularizer, i.e., IPFD, is proposed in order to mitigate inter-depth interference. In eSQNN training, the proposed IPFD in this paper shows the crucial role.

  • With eSQNNs and IPFD, a novel quantum FL framework, i.e., eSQFL, is proposed, and its convergence bound is theoretically derived.

  • Based on the derived convergence bound, transmit power allocation in superposition coding is optimized. In addition, we corroborate that the derived convergence bound helps eSQFL achieve high accuracy.

I-D Organization

The rest of this paper is organized as follows. Sec. II presents the related work to the proposed quantum federated learning. Sec. III introduces the eSQNN architecture and its local training with IPFD regularizer. Sec. IV describes superposition coding, successive decoding, and the proposed eSQFL framework. Sec. V provides the convergence analysis on eSQFL and its insight. Sec. VI presents the numerical experimental results to corroborate eSQFL empirically. Lastly, Sec. VII concludes this paper. Notice that the notations in this paper are in Tab. III.

II Related Work

II-A Quantum Machine Learning Basics

Basic Quantum Gates. A qubit is a quantum computing unit where the quantum state is represented with two bases |0|0\rangle, and |1|1\rangle in Bloch sphere [29]. Consider the qq qubits system, in which the quantum state defined in Hilbert space 𝝍2q\bm{\psi}\in\mathbb{C}^{2^{q}} can be expressed as follows,

|𝝍=Λ1|00++Λ2q|11|\bm{\psi}\rangle=\Lambda_{1}|0\cdots 0\rangle+\cdots+\Lambda_{2^{q}}|1\cdots 1\rangle (1)

where i=12qΛi2=1\sum^{2^{q}}_{i=1}\Lambda_{i}^{2}=1. A classical data 𝒙\bm{x} is encoded as a quantum state with the rotation gates Rx(𝒙)R_{\text{x}}(\bm{x}), Ry(𝒙)R_{\text{y}}(\bm{x}), and Rz(𝒙)R_{\text{z}}(\bm{x}), where the rotation of (𝒙)(\bm{x}) occurs in the direction of xx-, yy-, and zz-axes in Bloch sphere, respectively. Moreover, qubits are entangled with controlled-NOT gates (CNOT) [30]. CNOT gates act on two qubits to entangle them by using the first qubit as the control qubit and performing XOR operation on the second qubit. These basic quantum gates configure the QNNs.

Quantum Neural Network. The structure of a QNN is tripartite: the state encoder, PQC, and the measurement layer [31, 32]. In the forward propagation, classical input data 𝒙\bm{x} needs to be first encoded with the state encoder via basic rotation gates, which is a unitary operation and denoted as U(𝒙)U(\bm{x}). Then, the encoded quantum state is processed through the PQC U(𝜽)U(\mathbf{\bm{\theta}}), a multi-layered set of CNOT gates and rotation gates associated with trainable parameters 𝜽\bm{\theta}. The quantum state 𝝍\bm{\psi} can be expressed as,

|𝝍𝜽=U(𝜽)|𝝍0=U(𝜽)U(𝒙)|0.|\bm{\psi}_{\bm{\theta}}\rangle=U(\bm{\theta})\cdot|\bm{\psi}_{0}\rangle=U(\bm{\theta})\cdot U(\bm{x})|0\rangle. (2)

The output of the PQC is the entangled quantum state that can be measured after applying a projection matrix M{M1,,Mc,,MC}M\in\mathcal{M}\equiv\{M_{1},\cdots,M_{c},\cdots,M_{C}\} onto the reference zz-axis. The measured output V𝜽[1,1]C\langle V\rangle_{\bm{\theta}}\in[-1,1]^{C} is called an observable, where CC denotes the output dimension. The operation of QNN corresponding to cc-th observable is as follows,

Vc𝜽=0|U(𝒙)U(𝜽)McU(𝜽)U(𝒙)|0=𝝍|Mc|𝝍\langle V_{c}\rangle_{\bm{\theta}}=\langle 0|U^{\dagger}(\bm{x})U^{\dagger}({\bm{\theta}})M_{c}U({\bm{\theta}})U(\bm{x})|0\rangle=\langle\bm{\psi}|M_{c}|\bm{\psi}\rangle (3)

where ()(\cdot)^{\dagger} denotes the complex conjugate operator. Using the observable, a given loss function is calculated. Unlike classical NNs having visible activations in their hidden layers, the quantum states within QNNs are not measurable; otherwise, the quantum states collapse [29]. This does not allows quantum ML to compute the loss gradients via the chain rule, i.e., backpropagations. Alternatively, quantum ML evaluates the gradients using the zero-th order method called the parameter shift rule [26] (see Appendix -A).

II-B Classical Federated Learning

Federated learning (FL) is a machine learning (ML) architecture made up of a server, local devices, and a global model [12]. The server transmits the global model to all the local devices, and each device produces local parameters by training the received global model. Then, these parameters are sent back to the central server, where all the data is aggregated to update the global model. Finally, the updated global model is transmitted to the local devices again for another iteration. Due to this mechanism, FL allows a large number of devices to learn a global model simultaneously without transmitting any data, ensuring data privacy as well. Considering the recent increase in the number and computational power of edge devices, FL is an extremely useful tool for reducing computational overhead and protecting data security which is both emerging challenges in the field of ML [33]. Within this architecture, various techniques with differing methods of aggregating data and training the global model exist, e.g., FedAvg [34], FedBN [35].

The convergence analysis of FL algorithms is especially challenging because of the data heterogeneity in FL, which forces researchers to rely on copious numbers of assumptions. Consequently, gaps in the understanding of FL analysis occur. Over the years, many major works have attempted to better understand FL by removing assumptions and exploring new techniques [36, 37, 34, 38, 39]. Even now, research on FL convergence in various aspects is still being carried out (i.e., non-convex, convergence bounds). For example, [40] has successfully proposed an analysis of local stochastic gradient descent (SGD) using only arbitrarily heterogeneous data while also using weaker assumptions than previous works. On the other hand, convergence analysis of most QFL algorithms has not been fully developed yet. This paper aims to further discuss the convergence analysis of QFL via the analysis of eSQFL with the characteristic of quantum computing. The convergence analysis on a dynamic QFL is elaborated in Sec. IV.

II-C Classical Slimmable Federated Learning

SFL is a framework that executes FL by using slimmable neural networks (SNNs) with SC and SD [1]. The architectural properties of SNN reduce memory costs of SFL [20] while with rigorous communication and computational efficiencies. SC is a process of compressing two different data signals into one signal. As the signals are encoded, different power levels are assigned to each data signal which is used to decide the priority of signals during SD. Additionally, the SNN is composed of the left-hand (LH) and the right-hand (RH) sides. The LH side is occupied by the high priority signal, while the lower priority signal goes to the RH side. After SC is finished, the encoded message will be uploaded to the server, which then undergoes SD. Assuming that the state of the communication channel is good, the LH of the SNN will be decoded first, followed by the RH signal. However, if the communication channel is not stable enough, only LH will be decoded, resulting in a small model. Finally, if the communication channel is completely unstable, no signal will be obtained. This flexible characteristic of SNN allows SFL to be extremely adaptable to dynamic communication environments, making it suitable for practical applications.

II-D Quantum Federated Learning

In this section, the concept of QFL is elaborated in depth. QFL is implementing FL via quantum computation by replacing all the NNs with QNNs. Chen et al.[15] is the first to propose a hybrid quantum-classical QFL architecture where the local devices are replaced with quantum devices, unlike FL models. After receiving global model parameters, the quantum computers carry out QML using QNNs. Then, the output of each device is aggregated to update the global model before repeating the process. In Chehimi et al.[27], a purely QFL framework is proposed. Similar to [15], this model is composed of a server and multiple quantum devices. However, instead of converting classic data into quantum states, the local devices generate quantum data by labeling qubits as excited or not excited according to the degree of rotation on the Bloch sphere. Both [15, 27] use FedAvg to aggregate data and execute training. As seen from the two examples above, a QFL and FL share an identical system structure, but QFL leverages QML instead of ML in order to exploit the advantages of quantum computing. For this work, Vanilla QFL is referring to a purely quantum version of [15]. In addition, quantum application of SFL is studied to improve the communication opportunities [28]. SQFL utilized trainable measurement parameters to configure two messages which contain both the trainable measurement parameters and PQC parameters, respectively. However, in this work, multiple layer architectures and local training algorithms are proposed, which are not present in [28].

III Architecture and Training of eSQNNs

Refer to caption
Figure 2: Illustration of eSQNN: (1) we use CU gates in eSQNN, (2) the fidelity regularizer is applied sub-model layers (e.g., Layer 1 and 2) by receiving quantum states, (3) our proposed eSQNN is based on the multi-depth QNN.

In this section, we describe the architecture of eSQNN and its local training algorithm. To elaborate on this, by slightly modifying the depth-fixed architecture of Vanilla QNN [7], we first prepare its depth-controllable counterpart without controlling the level of entanglement, dubbed Vanilla SQNN, followed by introducing the proposed eSQNN controlling both the depth and the level of entanglement.

Architecture of Vanilla SQNN. Suppose that Vanilla SQNN consists of LL layers, and produces the desired outputs at any layer l[1,L]l\in[1,L]. In this paper, the number of sub-models must be larger than 1 (i.e., L2L\geq 2). When the ll-th sub-model is used, it means that the ll-th model will be configured from the encoding layer to the ll-th layer. For an arbitrary k[1,K]k\in\mathbb{N}[1,K] and l[1,L]l\in\mathbb{N}[1,L], the model parameters of kk-th local device and the ll-th layer is denoted as 𝜽kl=1lΞl\bm{\theta}^{k}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}. Note that Ξl\Xi_{l^{\prime}} is a binary mask which eliminates all trainable parameters except parameters of the ll^{\prime}-th layer. The operation \odot denotes an element-wise product. However, it is difficult to make desirable results at any random layer because the vanilla SQNN is vulnerable to the barren plateau problem [22, 41]. The barren plateau is a bad local optimum which hinders convergence. It is known that more entanglement’s degree introduces worse the barren plateau problem [23]. The operations in Vanilla SQNN are as follows: 1) rotate quantum state |𝝍|\bm{\psi}\rangle with rotation gates, 2) entangle qubits, and 3) repeat the first and second steps. We predict that the operations mentioned above will increase the degree of entanglement.

Architecture of eSQNN. eSQNN is proposed to cope with the problem of Vanilla SQNN architecture. Fig. 2 shows the illustration of eSQNN. eSQNN is mainly composed of CU gates. The operations of the CU gate in two qubits are written as [I00U]\begin{bmatrix}I&0\\ 0&U\end{bmatrix}, where UU is expressed as U=[u00u01u10u11]U=\begin{bmatrix}u_{00}&u_{01}\\ u_{10}&u_{11}\end{bmatrix}. Note that UU is an unitary matrix, i.e., UU=IU^{\dagger}U=I. We focus on the architectural advantage of CU gates because CU gates can adjust the direction of entanglement, disentanglement, or rotation while training. We describe the advantages of eSQNN and the barren plateau phenomenon next.

To this end, at first we consider the von Neumann entanglement entropy, a metric for measuring the degree of quantum entanglement of bipartite subsystems in an entire system [42]. For instance, consider two subsystems, e.g., ll and ll^{\prime}-th model configuration, where l>ll>l^{\prime}. According to a two-copy test from [43], we can compare the different quantum states |𝝍l|\bm{\psi}_{l}\rangle and |𝝍l|\bm{\psi}_{l^{\prime}}\rangle by using additional qubits. Then, we can measure the entanglement entropy by following the statement below. Suppose a quantum state that exists in ll and ll^{\prime}-th depth is represented as 𝝍l,l22q\bm{\psi}_{l^{\prime},l}\in\mathbb{C}^{2^{2q}}. Its pure state is obtained by ρl,l|𝝍l,l𝝍l,l|\rho_{l^{\prime},l}\triangleq|\bm{\psi}_{l^{\prime},l}\rangle\langle\bm{\psi}_{l^{\prime},l}|. Finally, the entanglement entropy is calculated as follows,

Sl(ρl,l)=Trl(ρl,llogρl,l)S_{l}(\rho_{l^{\prime},l})=-\Tr_{l}(\rho_{l^{\prime},l}\log\rho_{l^{\prime},l}) (4)

where Trl()\Tr_{l}(\cdot) stands for partial trace over the ll-th layer. As discussed in many studies, avoiding the barren plateaus requires the reduction of the entanglement entropy [23].

On the basis of these studies, we assume that there exists an entropy threshold for every ll-th model, i.e., Sl,thS_{l,th} for all l[l,L]\forall l\in\mathbb{N}[l^{\prime},L] and l[0,l1]\forall l^{\prime}\in\mathbb{N}[0,l-1]. It starts at l=0l^{\prime}=0, because we measure the entanglement entropy from the encoding state, i.e., 𝝍0\bm{\psi}_{0}. If l=0l1S(ρl,l)Sl,th\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})\geq S_{l,th}, the barren plateau becomes severe and training of ll-th model fails. For this, we observe the entanglement entropy between the encoding state and the layer of eSQNN. In order to ensure all model configurations are trained, we define a metric as,

𝟙train=l=1L𝟙(l=0l1S(ρl,l)<Sl,th)\mathbbm{1}_{\text{train}}=\prod^{L}_{l=1}\mathbbm{1}\left(\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})<S_{l,th}\right) (5)

where 𝟙()\mathbbm{1}(\cdot) stands for an indicator function. In order to verify whether the metric works correctly, we provide the following two cases. If all model configurations are satisfied l=0l1S(ρl,l)<Sl,th)\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})<S_{l,th}), then 𝟙train=1\mathbbm{1}_{\text{train}}=1, which means it avoids the barren plateau. On the other hand, suppose that l\exists l that satisfies l=0l1S(ρl,l)Sl,th\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})\geq S_{l,th}, we have 𝟙train=0\mathbbm{1}_{\text{train}}=0 which it means it suffers from barren plateau. We conjecture that eSQNN is robust to the barren plateau than Vanilla SQNN because the event 𝟙train=1\mathbbm{1}_{\text{train}}=1 frequently occurs in eSQNN. More details are in Sec. VI-B.

eSQNN Local Training. This section presents the eSQNN local training algorithm. In general, classic SNNs use the IPKD regularizer KD\mathcal{L}_{KD} to transfer knowledge from a large model to a small model [20], which can be expressed as,

KD=DKL(p(𝒚t,ek,L)p(𝒚t,ek,l))\mathcal{L}_{KD}=D_{KL}(p(\bm{y}^{k,L}_{t,e})\|p(\bm{y}^{k,l}_{t,e})) (6)

where DKLD_{KL} is the KL divergence. IPKD is ill-suited when the difference between the outputs of two models becomes large, where the KL divergence may even diverge. Alternatively, we propose the IPFD regularizer FD\mathcal{L}_{FD}, inspired by the Uhlmann’s fidelity function [44] in quantum information theory, measuring the similarity between two quantum states [44]. Precisely, the fidelity of the quantum states in the LL-th and ll-th model configurations is defined as follows,

(𝝍L,𝝍l)=|𝝍L|𝝍l|2.\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})=|\langle\bm{\psi}_{L}|\bm{\psi}_{l}\rangle|^{2}. (7)

In (7), if (𝝍L,𝝍l)1\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})\approx 1, 𝝍l\bm{\psi}_{l} is similar to 𝝍L\bm{\psi}_{L}, which means the logits of ll-th model are almost same as the logits of LL-th model. On the other hand, the opposite condition (𝝍L,𝝍l)0\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})\approx 0 means the ll-th model does not follow the LL-th model.

Consequently, in a classification task, the local training of an eSQNN with the IPFD regularizer is described as follows. The parameters (𝒙,𝒚)(\bm{x},\bm{y}) are denoted as data and label, respectively. The predicted label 𝒚={yc}c=1C\bm{y}=\{y_{c}\}^{C}_{c=1} is an one-hot encoded vector wherein the element ycy_{c} becomes unity for a true label and otherwise 0, i.e., yc=0,ccy_{c^{\prime}}=0,\forall c^{\prime}\neq c. Hereafter, we describe the local training for the parameters of local device kk in the tt-th communication round and ee-th local training iteration. The logits of class and its prediction of ll-th model are denoted as,

yt,ek,l,c\displaystyle y^{k,l,c}_{t,e} =exp(aVc𝜽t,ekl=1lΞl),\displaystyle=\text{exp}(a\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}}), (8)
p(yt,ek,l,c|𝒙)\displaystyle p(y^{k,l,c}_{t,e}|\bm{x}) =yt,ek,l,cc=1Cyt,ek,l,c,\displaystyle=\frac{y^{k,l,c}_{t,e}}{\sum^{C}_{c=1}y^{k,l,c}_{t,e}}, (9)

where aa represents the observable hyperparameter. Additionally, the cross-entropy loss and the fidelity regularization are as,

CE\displaystyle\mathcal{L}_{CE} =c=1C[yclog(p(yt,ek,l,c)|𝒙)],\displaystyle=-\sum^{C}_{c=1}[y_{c}\text{log}(p(y^{k,l,c}_{t,e})|\bm{x})], (10)
FD\displaystyle\mathcal{L}_{FD} =1(𝝍t,e,𝒙k,L,𝝍t,e,𝒙k,l).\displaystyle=1-\mathcal{F}(\bm{\psi}^{k,L}_{t,e,\bm{x}},\bm{\psi}^{k,l}_{t,e,\bm{x}}). (11)

The loss function is given as,

t,ek,l=1D(𝒙,𝒚)ζk[λCE+(1λ)FD]\displaystyle\mathcal{L}^{k,l}_{t,e}=\frac{1}{D}\sum_{(\bm{x},\bm{y})\in\zeta^{k}}\left[\lambda\mathcal{L}_{CE}+(1-\lambda)\mathcal{L}_{FD}\right] (12)

where DD and λ\lambda stand for the batch size and the balanced parameter of fidelity regularization, respectively. The gradient of (12) can be calculated with parameter shift rule [26]. Algorithm 1 summarizes the local training process before one communication round. After training with Algorithm 1, the gradient to be transmitted to the server can be as,

gtk=e=1El=1L𝜽t,ekt,ek,lg^{k}_{t}=\sum^{E}_{e=1}\sum^{L}_{l=1}\nabla_{{\bm{\theta}^{k}_{t,e}}}\mathcal{L}^{k,l}_{t,e} (13)

where ηt\eta_{t} denotes the learning rate at communication round tt.

1 Initialization. local-QNN parameters, 𝜽\bm{\theta};
2 for  e={1,2,,E}e=\{1,2,\dots,E\} do
3       for  (𝐱,y)𝒟(\bm{x},y)\in\mathcal{D} do
4             Get logits with LL-th model;
5             Calculate loss with labels and accumulate loss;
6             for l={1,2,,L1}l=\{1,2,\dots,L-1\} do
7                   Get logits with ll-th model;
8                   Calculate loss gradient with parameter-shift rule;
9                  
10            𝜽t,e+1k𝜽t,ekηtθt,ekt,ek,l\bm{\theta}^{k}_{t,e+1}\leftarrow\bm{\theta}^{k}_{t,e}-\eta_{t}\nabla_{\theta^{k}_{t,e}}\mathcal{L}^{k,l}_{t,e};
11            
12      
Algorithm 1 Local-eSQNN Training

IV Entangled Slimmable Quantum Federated Learning

IV-A Superposition Coding & Successive Decoding

The successful reception of a wireless signal is mainly affected by the signal-to-interference-plus-noise ratio (SINR) [45]. At a receiver, SINR can be expressed as,

γ=χdβP/(σ2+PI)\gamma={\chi d^{-\beta}P}/{(\sigma^{2}+P^{I})} (14)

where PP, PIP^{I}, dd, and σ2\sigma^{2} denote the transmission interference, reception interference, a transmitter-receiver distance, and noise powers, respectively. In addition, β2\beta\geq 2 is a path loss exponent and χ\chi is small-scale fading parameter (i.e., Rayleigh fading). Following the Shannon’s capacity formula with a Gaussian codebook, the received throughput RR with the bandwidth WW is R=Wlog2(1+γ)R=W\log_{2}(1+\gamma) (bits/sec). When the transmitter encodes raw data with a code rate uu, its receiver successfully decodes the encoded data if R>uR>u. Finally, the decoding success probability can be given as follows,

Pr(Ru)=Pr(χdβPσ2+PIu)\Pr(R\geq u)=\Pr(\frac{\chi d^{-\beta}P}{\sigma^{2}+P^{I}}\geq u^{\prime}) (15)

where u=2uW1u^{\prime}=2^{\frac{u}{W}}-1. Consider transmitting LL messages from a transmitter to a receiver simultaneously. Before transmission, these messages are SC-encoded [46], and the whole transmission power budget PP is allotted to the ll-th message, with Pl=νlPP_{l}=\nu_{l}P transmission power for l[1,L]l\in[1,L]. Note that νl\nu_{l} is an allocation variable such that νl>ul=l+1Lνl\nu_{l}>u^{\prime}\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}, l=1Lνl=1\sum^{L}_{l=1}\nu_{l}=1, and νl0\forall\nu_{l}\geq 0.

The SC-encoded message is meant to be sequentially decoded at the receiver by first decoding the strongest signal, then canceling out the decoded signal, and finally decoding the next strongest signal, i.e., SD, also known as successive interference cancellation [47, 48]. The small-scale fading parameter χ\chi under Rayleigh fading follows an exponential distribution, i.e., χexp(1)\chi\sim\exp(1). Assuming l>ll^{\prime}>l, the receiver may gradually decode the ll-th message while experiencing the remaining messages as interference PlIP_{l}^{I}, i.e.,

PlI=χdβPl=l+1Lνl,P_{l}^{I}=\chi d^{-\beta}P\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}, (16)

for lL1l\leq L-1. However, PLI=0P^{I}_{L}=0 as there is no interference for the last message. Assume that RlR_{l} represents the throughput of the ll-th message. Then, the distribution of RlR_{l} is given as,

Pr(Rlu)=Pr(χ1/γ¯νl/ul=1+1Lνl)\Pr(R_{l}\geq u)=\text{Pr}\Big{(}\chi\geq\frac{1/\bar{\gamma}}{\nu_{l}/u^{\prime}-\sum^{L}_{l^{\prime}=1+1}\nu_{l}^{\prime}}\Big{)} (17)

where γ¯=Pdβσ2\bar{\gamma}=\frac{Pd^{-\beta}}{\sigma^{2}} denotes the averaged signal-to-noise ratio (SNR). By using this result, the ll-th message’s decoding success probability plp_{l} can be expressed as follows,

pl\displaystyle p_{l} =Pr(R1u,,Rlu),\displaystyle=\Pr(R_{1}\geq u,\cdots,R_{l}\geq u), (18)
=\displaystyle= Pr(χmax(1/γ¯νlul=2Lνl,,1/γ¯νlul=l+1νl)).\displaystyle\text{Pr}\Big{(}\!\!\chi\geq\max(\frac{1/\bar{\gamma}}{\frac{\nu_{l}}{u^{\prime}}\!-\!\sum^{L}_{l^{\prime}=2}\nu_{l^{\prime}}},\!\cdots\!,\!\frac{1/\bar{\gamma}}{\frac{\nu_{l}}{u^{\prime}}\!-\!\sum_{l^{\prime}=l+1}\nu_{l^{\prime}}})\!\!\Big{)}. (19)
1 Notation. 𝜽tk\bm{\theta}^{k}_{t}: kk-th device’s parameters, Θt\Theta_{t}: parameters of global eSQNN, XlX_{l}: set of ll-th subdivided gradient;
2 Initialization. Xl,l[1,L]X_{l}\leftarrow\emptyset,\forall l\in[1,L] ;
3 for  k={1,,K}k=\{1,\dots,K\} do
4       Sample χkexp(1)\chi^{k}\sim\exp(1);
5       for l={1,2,,L}l=\{1,2,\dots,L\} do
6             if χkul\chi_{k}\geq u_{l} then
7                   XlXlkX_{l}\leftarrow X_{l}\cup k;
8                  
9            
10      
11if l=1L𝟙(Xl=)0\prod^{L}_{l=1}\mathbbm{1}(X_{l}=\emptyset)\neq 0 then
12       Θt+1Θtηtl=1L1|Xl|kXlgtkΞl\Theta_{t+1}\leftarrow\Theta_{t}-\eta_{t}\sum\nolimits^{L}_{l=1}\frac{1}{|X_{l}|}\sum\nolimits_{k\in X_{l}}g^{k}_{t}\odot\Xi_{l} ;
13      
14else
15       Skip aggregation;
16      
17for k={1,,K}k=\{1,\cdots,K\} do
18       𝜽t+1,1kΘt+1\bm{\theta}^{k}_{t+1,1}\leftarrow\Theta_{t+1};
19      
Algorithm 2 eSQFL

IV-B eSQFL Operations

This section describes the operations of eSQFL. Algorithm 2 shows the eSQFL algorithm. First of all, local devices are trained with Algorithm 1. The power allocation is conducted to configure SC-encoded model parameters, i.e., 𝝂={νl}l=1L\bm{\nu}=\{\nu_{l}\}^{L}_{l=1} for the gradient {gtkΞl}l=1L\{g^{k}_{t}\odot\Xi_{l}\}^{L}_{l=1} of the subdivided model configuration. After that, the local devices transmit their SC-encoded model parameters to the server. The server decodes the devices’ SC-encoded model parameters with SD. If the server receives at least one local gradient for every model configuration, the server aggregates; otherwise, no aggregation occurs. In the aggregation of sub-divided model configuration, FedAvg is utilized [34]. The updates of eSQFL will be explained later.

V Convergence Analysis

V-A Setup

In order to analyze the convergence rate of eSQFL, the following assumptions are considered. Firstly, the local-side decoding is always successful (Algorithm 2, lines 12–13) because the server-side transmission power is higher than the uplink power. Secondly, KK is assumed to be big enough such that |Xl|Kpl|X_{l}|\approx Kp_{l}, for all ll. During the tt-th communication round, the server builds the global model which can be expressed as follows,

Θt+1Θtηtl=1L1KplkXlgtkΞl:=ft.\Theta_{t+1}\leftarrow\Theta_{t}-\eta_{t}\underbrace{\sum\nolimits^{L}_{l=1}\frac{1}{Kp_{l}}\sum\nolimits_{k\in X_{l}}g^{k}_{t}\odot\Xi_{l}}_{:=f_{t}}. (20)

The objective function of the global model and the local objective functions are denoted as FF and {Fk}\{F^{k}\} respectively. The bar notation ¯\bar{\cdot} is used for the averaged value over {ζtk}\{\zeta_{t}^{k}\}, and the superscript is used to indicate the optimum. For mathematical amenability, we consider the following assumptions on FF and {Fk}\{F^{k}\}, as used in [49].

Assumption 1 (𝜷\bm{\beta}-Smoothness).

If FF and {Fk}\{F^{k}\} are β\beta-smooth,

Fk(𝜽v)Fk(𝜽w)+(𝜽v𝜽w)TFk(𝜽w)+β2𝜽v𝜽w2,F^{k}(\bm{\theta}_{v})\leq\\ F^{k}(\bm{\theta}_{w})+(\bm{\theta}_{v}-\bm{\theta}_{w})^{T}\nabla F^{k}(\bm{\theta}_{w})+\frac{\beta}{2}\|\bm{\theta}_{v}-\bm{\theta}_{w}\|^{2}, (21)

for all v,w>0v,w>0.

Assumption 2 (𝝁\bm{\mu}-Strong Convexity).

If FF and {Fk}\{F^{k}\} are μ\mu-strong convex,

Fk(𝜽v)Fk(𝜽w)+(𝜽v𝜽w)TFk(𝜽w)+μ2𝜽v𝜽w2,F^{k}(\bm{\theta}_{v})\geq\\ F^{k}(\bm{\theta}_{w})+(\bm{\theta}_{v}-\bm{\theta}_{w})^{T}\nabla F^{k}(\bm{\theta}_{w})+\frac{\mu}{2}\|\bm{\theta}_{v}-\bm{\theta}_{w}\|^{2}, (22)

for all v,w>0v,w>0.

Assumption 3 (Bounded Local Gradient Variance).

For all device k[1,K]k\in\mathbb{N}[1,K] and its local data ζk𝐙\zeta^{k}\in\mathbf{Z}, the difference between the local gradient Fk(𝛉k;ζk)F^{k}(\bm{\theta}^{k};\zeta^{k}) and F¯k(𝛉;𝐙)\bar{F}^{k}(\bm{\theta};\mathbf{Z}) is bounded, i.e.,

𝔼[𝜽Fk(𝜽k,ζtk)𝜽F¯k(𝜽k;𝐙)2]σk2.\mathbb{E}[\|\nabla_{\bm{\theta}}F^{k}(\bm{\theta}^{k},\zeta^{k}_{t})-\nabla_{\bm{\theta}}\bar{F}^{k}(\bm{\theta}^{k};\mathbf{Z})\|^{2}]\leq\sigma_{k}^{2}. (23)

According to [40], the metric for the non-IIDness of 𝐙\mathbf{Z} is given as follows,

δ=1Kk=1Kσk2.\delta=\frac{1}{K}\sum^{K}_{k=1}\sigma_{k}^{2}. (24)

V-B Convergence Analysis

In classical ML, the convergence of FedAvg has been analyzed by assuming bounded local gradients in [49]. Without such an unrealistic assumption, the convergence bound of SFL has been derived in [1]. In quantum ML, local gradients can be shown to be inherently bounded thanks to the bounded fidelity and the parameter shift rule computing quantum gradients [26]. Hence, rather than adopting the methods in [1], we first derive the local gradient bound, and then derive the convergence bound of eSQFL by following the steps [49]. The detailed proofs are deferred to Appendix, and only the results are presented as elaborated next.

Lemma 1 (Bounded Local Gradient).

For t1t\geq 1 and ηtηt+1\eta_{t}\leq\eta_{t+1}, it follows that

𝔼[gtk2]EL(2+(a2)λ)2.\mathbb{E}[\|g^{k}_{t}\|^{2}]\leq EL(2+(a-2)\lambda)^{2}. (25)
Lemma 2 (Bounded Global Gradient).

For t1t\geq 1, the global gradient has bound as,

𝔼[ft2]EL2(2+(a2)λ)2l=1L1pl2.\mathbb{E}[\|f_{t}\|^{2}]\leq EL^{2}(2+(a-2)\lambda)^{2}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}. (26)
Lemma 3 (Bounded Global Gradient Variance).

Under Assumption 3, the variance of the global gradient ftf_{t} is bounded within 𝐙\mathbf{Z}, which is given as,

𝔼ftf¯t2Lδl=1L1pl2.\mathbb{E}\|f_{t}-\bar{f}_{t}\|^{2}\leq L\delta\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}. (27)

Note that Lemmas 2 and 3 are different, in the sense that Lemma 2 focuses on the actual gradient, whereas Lemma 3 is related to data distributions. The convergence analysis utilizes Lemmas 1–3 and eSQFL convergence can be proven by [49].

Theorem 1 (eSQFL Convergence).

Under Assumptions 1 and 3 with the learning rate ηt=2μt+2βμ\eta_{t}=\frac{2}{\mu t+2\beta-\mu}, we obtain

𝔼[F(θt)]F\displaystyle\mathbb{E}[F(\theta_{t})]-F^{*} βμμβΔ1+2Bμt+2βμ,\displaystyle\leq\frac{\beta}{\mu}\cdot\frac{\mu\beta\Delta_{1}+2B}{\mu t+2\beta-\mu}, (28)
whereΔt\displaystyle\text{where}~{}~{}~{}~{}~{}\Delta_{t} 𝔼ΘtΘ2,\displaystyle\triangleq\mathbb{E}\|\Theta_{t}-\Theta^{*}\|^{2}, (29)
B\displaystyle B =(EL2(2+(a2)λ)2+Lδ)l=1L1pl2.\displaystyle=(EL^{2}(2+(a-2)\lambda)^{2}+L\delta)\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}. (30)

Hence, limt𝔼[F(θt)]=F\lim\limits_{t\rightarrow\infty}\mathbb{E}[F(\theta_{t})]=F^{*}.

Theorem 1 exhibits several insights of eSQFL as follows.

  1. 1.

    Failure under extremely poor channels: Consider an extremely poor channel condition, where the server cannot receive [l,L][l,L]-th model configurations, i.e., pl0,l[l,L]p_{l^{\prime}}\simeq 0,\forall l^{\prime}\in[l,L]. In this case, the RHS of (28) diverges.

  2. 2.

    Importance of successful reception: The optimal gap of eSQFL becomes smaller by increasing the communication opportunities. Consider a perfect channel condition, where the RHS of (28) is minimized. By optimizing the SC transmission, the optimality gap is reduced which is referred to Proposition 1 and Corollary 1.

  3. 3.

    Other important metrics: The optimality gap is affected by the local iterations per communication round EE, balance factor λ\lambda, and the number of layers LL.

Proposition 1 (Optimal SC Power Allocation).

The transmission power allocation 𝛎\bm{\nu}^{*} minimizing the optimality gap is given as,

𝝂=argmin𝝂(l=1Lexp(2/γ¯νl/ul=l+1Lνl))\bm{\nu}^{*}=\arg\min_{\bm{\nu}}\left(\sum^{L}_{l=1}\mathrm{exp}\left(-\frac{2/\bar{\gamma}}{\nu_{l}/u^{\prime}-\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}}\right)\right) (31)

where L2L\geq 2, νl>ul=l+1Lνl\nu_{l}>u^{\prime}\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}} for l[1,L)\forall l\in[1,L), and l=1Lνl=1\sum^{L}_{l=1}\nu_{l}=1.

Proof.

Substituting the term plp_{l} into Theorem 1, the optimality gap is minimized by optimizing the power allocation. ∎

Corollary 1 (Low SNR, L=𝟐\bm{L=2}).

For L=2L=2, γ¯0\bar{\gamma}\to 0, and u(1+5)/21.618u^{\prime}\geq(1+\sqrt{5})/2\approx 1.618, the optimal power allocation is as,

(ν1,ν2)=(u+1u2+1u2+u,1+u+1u2+1u2+u).(\nu_{1}^{*},\nu_{2}^{*})=\\ \left(-\frac{\sqrt{u^{\prime}+1}-u^{\prime 2}+1}{u^{\prime 2}+u^{\prime}},1+\frac{\sqrt{u^{\prime}+1}-u^{\prime 2}+1}{u^{\prime 2}+u^{\prime}}\right). (32)
Proof.

Since exp(x)=1x\exp(-x)=1-x for x0x\to 0, the RHS of (31) becomes 2+2/γ¯ν1/u(1ν1)+2/γ¯(1ν1)/u2+\frac{2/\bar{\gamma}}{\nu_{1}/u^{\prime}-(1-\nu_{1})}+\frac{2/\bar{\gamma}}{(1-\nu_{1})/u^{\prime}}, which is piece-wise convex. Applying the first-order necessary condition (FONC) with respect to ν1\nu_{1} completes the proof. ∎

TABLE I: List of simulation parameters.
Description Value
Number of devices (NN) 10
Local iterations per communication round (EE) 10
Epoch (TT) 100
Optimizer SGD
Learning rate (η1\eta_{1}), 0.010.01
Decaying rate 0.0010.001
Observable hyperparameter (aa) 22
Number of qubits 44
Number of parameters in eSQFL & Vanilla QFL 3636
Number of data per device 128\!128
Batch size (DD) 3232

VI Experiments

VI-A Experimental Design

To corroborate the main analysis and hypothesis of this paper, the experiments are designed as follows:

  • From Sec. V-A, the derived convergence bound is highly affected by the decoding success probability and non-IIDness. To corroborate these results numerically, we compare the top-1 accuracy of eSQFL in various channel conditions and degrees of non-IIDness with Vanilla QFL (referred to Fig. 1(a)).

  • We investigate the advantage of CU gates that compose eSQNN by designing an experiment which measures entanglement entropy and top-1 accuracy of eSQNN and standard QNNs under the same conditions. Then, the two metrics are compared to demonstrate the advantage of CU gates.

  • The increased effectiveness of local training with IPFD compared to IPKD is proven. IPFD trains the local models by regularizing the fidelity of two quantum states. In contrast, IPKD trains local models by ensuring that the small model follows the large model via its prediction. The benchmark scheme comparing the fidelity and top-1 accuracy of IPFD and IPKD is designed.

  • According to Proposition 1 and Corollary 1, the convergence bound is minimized by optimal transmission power allocation. To corroborate this, we compare the optimal power allocation scheme to its random power allocation counterpart.

  • Finally, we conduct experiments by controlling various variables and assess their various impact on the performance.

Refer to caption Refer to caption Refer to caption
(a) α=0.1\alpha=0.1. (b) α=1.0\alpha=1.0. (c) α=10\alpha=10.
Figure 3: Class distributions with different Dirichlet concentration α\alpha.

For the experiment, eSQFL and Vanilla QFL are evaluated. eSQFL is the proposed model which leverages eSQNN. This specific QNN consists of three sub-models named ‘L1’, ‘L2’, and ‘L3’. In contrast, Vanilla QFL uses a standard QNN which is made up of basic quantum gates [7], and does not consider SC and SD [15]. Despite the difference in structure, both eSQNN and standard QNN use equivalent number of parameters. Moreover, we conduct ablation studies on our eSQNN by comparing it with Vanilla SQNN, a depth-controllable yet entanglement-fixed QNN. Since the performance of QFL suffers under a system with a large number of qubits, many QFL works use a simple dataset [15]. In this paper, the MNIST dataset is transformed into a simpler form: the dimension of MNIST data is reduced to 4×44\times 4 by inter-area interpolation, and only four classes are used (i.e., 0, 1, 2 and 3) [50]. The four classes are represented with red, blue, black, and green respectively. In addition, Dirichlet distribution is used to investigate non-IIDness of data [51]. Fig. 3 shows the data distribution with the different values of the Dirichlet concentration ratio α\alpha. Data with high Dirichlet concentration ratio (i.e., α=10\alpha=10) is IID while data with low Dirichlet concentration ratio (i.e., α=0.1\alpha=0.1) is non-IID.

To compare IPFD and IPKD, we initialize the parameter of eSQNN identically. The simulation parameters used in these numerical experiments are summarized in Tab. I.

Refer to caption Refer to caption Refer to caption
(a) α=0.1\alpha=0.1. (b) α=1\alpha=1. (c) α=10\alpha=10.
Figure 4: Comparison of top-1 accuracy under various avg. SNR [dB] and α\alpha.
Refer to caption Refer to caption Refer to caption
(a) α=0.1\alpha=0.1. (b) α=1\alpha=1. (c) α=10\alpha=10.
Figure 5: Comparison of top-1 accuracy under various α(γ¯=17dB)\alpha~{}(\bar{\gamma}=17\,\text{dB}).

VI-B Numerical Results

Numerical Results and Convergence Analysis. According to Theorem 1, the convergence bound decreases if the decoding success probability increases. Fig. 4 shows the performance of eSQFL under various channel conditions obtained through various σ2\sigma^{2}. As γ¯\bar{\gamma} increases from 11dB11\,\text{dB} to 19dB19\,\text{dB}, the decoding success probability and top-1 accuracy of the eSQFL with all layers increase. The small models, i.e., eSQFL-L2 and eSQFL-L1, also show improvement in performance along with eSQFL-L3. Especially, eSQFL-L2 shows significant improvement in top-1 accuracy from 28%28\% to 39%39\%. Fig. 5 shows the top-1 accuracy and convergence of eSQFL and comparison models. When γ¯=17dB\bar{\gamma}=17\,\text{dB}, the sub-models in eSQFL (i.e., eSQFL-L2, eSQFL-L3) achieve higher accuracy than Vanilla QFL. The final standard deviations of eSQFL under γ¯=17dB\bar{\gamma}=17\,\text{dB} are 0.041, 0.051, and 0.066 for eSQFL-L1, eSQFL-L2, and eSQFL-L3, respectively.

According to Theorem 1, the data distribution affects the convergence bound of eSQFL. With non-IID data, the convergence bound is widened. As shown in Fig. 5, we test various Dirichlet concentration, i.e., α={0.1,1,10}\alpha=\{0.1,1,10\}. The overall performance of all comparison models decreases as α\alpha decreases. However, eSQFL shows robustness under non-IID data distribution. Vanilla QFL shows low top-1 accuracy under α=1\alpha=1 and α=0.1\alpha=0.1. In contrast, eSQFL maintains the top-1 accuracy of 52%52\% and 41%41\% under α=1.0\alpha=1.0 and α=0.1\alpha=0.1 respectively. From the results in Fig. 4 and Fig. 5, eSQFL is robust under various channel conditions and non-IID data distribution.

Refer to caption
Refer to caption Refer to caption Refer to caption
(a) Top-1 accuracy. (b) Fidelity. (c) Entropy.
Figure 6: Model architectural difference (eSQNN vs. Vanilla SQNN).
Refer to caption Refer to caption
(a) IPKD Regularization. (b) IPFD Regularization.
Figure 7: Comparison of IPFD training algorithm under non-IID and IID.
Refer to caption Refer to caption
(a) α=0.1\alpha=0.1. (b) α=10\alpha=10.
Figure 8: Comparison of fidelity training algorithm under non-IID and IID.

Structural Advantage of eSQNN. In this subsection, we investigate the general performance, fidelity, and entanglement entropy of eSQNN. We conduct ablation studies corresponding to model architecture (i.e., eSQNN and Vanilla SQNN). Fig. 6 shows the experiment results. As shown in Fig. 6(a), eSQNN shows better top-1 accuracy than Vanilla eSQNN. Especially, eSQNN-L3 achieves a 20%20\% performance improvement over Vanilla SQNN. When eSQNN is used, the quantum state (i.e., knowledge) is successfully distilled to the small model as shown in Fig. 6(b). In contrast, Vanilla SQNN fails to distill the knowledge to its sub-model. To understand why its model is successfully trained, we calculate the entanglement entropy (referred to Sec. III). Fig. 6(c) exhibits the von Neumann entanglement entropy of each layer of eSQNN and Vanilla SQNN. The entanglement entropy of eSQNN is less than Vanilla SQNN for all layers. It means that the event of exceeding the entropy threshold aforementioned in Sec. VI, i.e., 𝟙train=0\mathbbm{1}_{\text{train}}=0, rarely occurs compared to Vanilla SQNN. This underscores that eSQNN is more robust to barren plateaus than Vanilla SQNN.

Effectiveness of IPFD. To investigate the effectiveness of IPFD used in eSQNN local training, we compare the results of local training using IPFD regularizer to IPKD regularizer. Fig. 7 (a)/(b) show the learning curve of FD\mathcal{L}_{FD} and KL\mathcal{L}_{KL}, respectively. The learning curve of IPFD starts at FD=0\mathcal{L}_{FD}=0 due to the fidelity (𝝍l,𝝍L)1\mathcal{F}(\bm{\psi}_{l},\bm{\psi}_{L})\approx 1. As eSQNN is trained, the fidelity decreases and converges to 0.955 for L1 and 0.987 for L2. In the learning curve of KD\mathcal{L}_{KD}, the curve has a tendency to decrease and converge. However, the fluctuation of IPKD regularization is larger than IPFD, especially in eSQFL-L1. This is because the KL divergence becomes unstable when the difference between the two distributions is large, i.e., the overlapping area between the distributions is small. Then, if there is no overlapping area, it diverges. In contrast, the aforementioned phenomena does not occur in IPFD regularization because IPFD regularizer is bounded from 0 to 1. Therefore, IPFD regularization provides more stable noise to its eSQNN than IPKD regularization.

TABLE II: Top-1 accuracy comparison with (𝝂\bm{\nu}^{*}) and without optimization.
L=3L=3 L=2L=2
Condition l=1l=1 l=2l=2 l=3l=3 l=1l=1 l=2l=2
with optimization (𝝂\bm{\nu}^{*}) 29.6\mathbf{29.6} 40.1\mathbf{40.1} 55.8\mathbf{55.8} 33.4\mathbf{33.4} 50.7\mathbf{50.7}
w.o. optimization 29.529.5 39.339.3 50.750.7 33.033.0 48.148.1

Impact on Optimal Power Allocation. We verify the proofs of Proposition 1, and Corollary 1. When L=3L=3, we calculate the power allocation variable as 𝝂={0.8909,0.0989,0.0102}\bm{\nu}^{*}=\{0.8909,0.0989,0.0102\} by non-convex optimization. When L=2L=2, we obtain 𝝂={0.8969,0.1059}\bm{\nu}^{*}=\{0.8969,0.1059\}, where the comparison of power allocation is set to 𝝂={0.9170,0.0820,0.001}\bm{\nu}=\{0.9170,0.0820,0.001\} for L=3L=3, and 𝝂={0.8333,0.1667}\bm{\nu}=\{0.8333,0.1667\} for L=2L=2. The final accuracy is Tab. II. Compared to the eSQFL with 𝝂\bm{\nu}, the eSQFL with 𝝂\bm{\nu}^{*} achieves 10.1%10.1\% higher top-1 accuracy when L=3L=3 and 5.41%5.41\% higher top-1 accuracy when L=2L=2. Thus, we corroborate that the optimal power allocation minimizes the convergence bound.

Impact on Balanced Parameter. The balanced parameter λ\lambda is an important parameter in eSQNN local training. Fig. 8 shows the top-1 accuracy according to λ\lambda in various data distributions (i.e., α=0.1\alpha=0.1 and α=10\alpha=10). With finitely adjusted IPFD parameter (λ=0.01\lambda^{*}=0.01), eSQNN shows the highest top-1 accuracy under non-IID data distribution (i.e., α=0.1\alpha=0.1). In addition, by not using IPFD (λ=0\lambda=0) or only using IPFD (λ=1\lambda=1), eSQNN fails to classify the mini-MNIST dataset. Under IID data distribution (i.e., α=10\alpha=10), eSQNN with λ=0.01\lambda^{*}=0.01 outperforms eSQNN with only using label training (λ=0\lambda=0) about 1.03%1.03\%. From the result, we recommend utilizing eSQNN with λ=0.001\lambda^{*}=0.001 for robust performance in both IID and non-IID data distribution.

VII Conclusions

In this paper, we developed a depth-adjustable QNN architecture, and proposed a novel QFL framework based on wireless communications, termed eQSNN and eSQFL, respectively. To control the level of entanglement and reduce its entropy, we applied CU gates to the eSQNN architecture. To mitigate the inter-depth interference inspired from the fidelity in quantum information theory, we introduced a novel IPFD regularizer. Finally, to cope with various channel conditions, we applied SC across multiple depths and optimized the SC power allocation by deriving and minimizing the convergence bound of eSQFL. In conclusion, we were able to propose a QFL model that shows stable performance despite the NISQ limitation and variable channel conditions. Additionally, the fidelity regularizer was also designed. This novel method decreases the error rate of QML in a way that is exclusively suitable with QC, instead of depending on classical optimization methods. Since the strengths of our model has been theoretically corroborated in this paper, we will go on to test the realistic efficacy of the model. In our future research directions, we will apply it to a plethora of real-life scenarios with various limitations.

TABLE III: List of notations.
Notation Description
KK Number of local devices, [1,,k,,K][1,\cdots,k,\cdots,K]
LL Number of local eSQNN blocks, [1,,l,,L][1,\cdots,l,\cdots,L]
EE Number of local iterations, [1,,e,,E][1,\cdots,e,\cdots,E]
TT Number of communication rounds, [1,,t,,T][1,\cdots,t,\cdots,T]
𝚵\bm{\Xi} Binary mask, 𝚵={Ξ1,,Ξl,,ΞL}\bm{\Xi}=\{\Xi_{1},\cdots,\Xi_{l},\cdots,\Xi_{L}\}
𝝍\bm{\psi} Quantum state
ρ\rho Reduced density matrix
Sl(ρ)S_{l}(\rho) Entanglement entropy of ρ\rho over subsystem ll
𝒁\bm{Z} Whole data, 𝒁={ζ1,,ζk,,ζK}\bm{Z}=\{\zeta^{1},\cdots,\zeta^{k},\cdots,\zeta^{K}\}
𝝂\bm{\nu} Power allocation for SC, 𝝂={ν1,,νl,,νL}\bm{\nu}=\{\nu_{1},\cdots,\nu_{l},\cdots,\nu_{L}\}
α\alpha Dirichlet concentration

-A Parameter Shift Rule

Parameter shift rule [26], one of the most known quantum gradient calculators, is utilized to train the model. Subsequently, the eSQNN is trained accordingly using the zeroth stochastic gradient descent algorithm, e.g., quantum natural gradient. Consider that eSQNN consists of II trainable parameters, i.e., 𝜽t,ek=[θt,e,1k,,θt,e,ik,,θt,e,Ik]\bm{\theta}^{k}_{t,e}=[\theta^{k}_{t,e,1},\cdots,\theta^{k}_{t,e,i},\cdots,\theta^{k}_{t,e,I}]. Then, the partial derivative of kk-th device’s cc-th observable over parameter θt,e,ik\theta^{k}_{t,e,i} is given as follows,

Vc𝜽t,ekθt,e,ik=Vc𝜽t,ek+ε𝐞iVc𝜽t,ekε𝐞i2ε\frac{\partial\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}}}{\partial\theta^{k}_{t,e,i}}=\frac{\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}+\varepsilon\mathbf{e}_{i}}-\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}-\varepsilon\mathbf{e}_{i}}}{2\varepsilon} (33)

where 𝐞i\mathbf{e}_{i} denotes the ii-th standard basis, and ε(0,π/2]\varepsilon\in(0,\pi/2]. We calculate the loss gradient using (33).

-B Proof of Lemma 1

The true label is class cc and the predictions and its derivative is canceled out due to the definition of cross-entropy. Then, the cross-entropy loss is simplified as follows,

CE=logp(yt,ek,l,c|𝒙).\mathcal{L}_{CE}=-\log p(y^{k,l,c}_{t,e}|\bm{x}). (34)

Hereafter, we denote y^c=yt,ek,l,c\hat{y}_{c}=y^{k,l,c}_{t,e}, and p(y^c)=p(yt,ek,l,c|𝒙)p(\hat{y}_{c})=p(y^{k,l,c}_{t,e}|\bm{x}). Let’s denote the partial derivative of cross-entropy loss and fidelity loss as,

G1\displaystyle G_{1} =CEθt,e,ik=1p(y^c)p(y^c)θt,e,ik,\displaystyle=\frac{\partial\mathcal{L}_{CE}}{\partial\theta^{k}_{t,e,i}}=\frac{1}{p(\hat{y}_{c})}\cdot\frac{\partial p(\hat{y}_{c})}{\partial\theta^{k}_{t,e,i}}, (35)
G2\displaystyle G_{2} =(𝝍t,e,𝒙k,L,𝝍t,e,𝒙k,l)θt,e,ik.\displaystyle=\frac{\partial\mathcal{F}(\bm{\psi}^{k,L}_{t,e,\bm{x}},\bm{\psi}^{k,l}_{t,e,\bm{x}})}{\partial\theta^{k}_{t,e,i}}. (36)

By the triangle inequality, the partial derivative of (12) is bounded as follows,

|t,ek,lθt,e,ik|(𝒙,𝒚)ζk[λD|G1|+1λD|G2|].\Big{|}\frac{\partial\mathcal{L}^{k,l}_{t,e}}{\partial\theta^{k}_{t,e,i}}\Big{|}\leq\sum_{(\bm{x},\bm{y})\in\zeta^{k}}\Big{[}\frac{\lambda}{D}|G_{1}|+\frac{1-\lambda}{D}|G_{2}|\Big{]}. (37)

We have

G1=a(c1,ccCy^cc=1Cy^c)Vc𝜽t,ekl=1lΞlθt,e,ik.G_{1}=a\left(\sum\limits^{C}_{c^{\prime}\geq 1,c^{\prime}\neq c}\frac{\hat{y}_{c^{\prime}}}{\sum\limits^{C}_{c=1}\hat{y}_{c}}\right)\cdot\frac{\partial\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}}}{\partial\theta^{k}_{t,e,i}}. (38)

The bound of G1G_{1} obtained as follows,

|G1|a|Vc𝜽kΞlθt,e,ik|a.|G_{1}|\leq a\Big{|}\frac{\partial\langle V_{c^{\prime}}\rangle_{\bm{\theta}^{k}\odot\Xi_{l}}}{\partial\theta^{k}_{t,e,i}}\Big{|}\leq a. (39)

The former step is due to c1,ccCy^cc=1Cy^c\sum\limits^{C}_{c^{\prime}\geq 1,c^{\prime}\neq c}\hat{y}_{c^{\prime}}\leq\sum\limits^{C}_{c=1}\hat{y}_{c}, and the latter step is because the gradient using parameter shift rule is bounded to 11 [26]. The term G2G_{2} and its bound are given as,

G2\displaystyle G_{2} =2|𝝍t,e,𝒙k,L|𝝍t,e,𝒙k,l|𝝍t,e,𝒙k,L|𝝍t,e,𝒙k,lθt,e,ik,\displaystyle=2|\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle|\cdot\frac{\partial\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle}{\partial\theta^{k}_{t,e,i}}, (40)
|G2|\displaystyle|G_{2}| 2|𝝍t,e,𝒙k,L|𝝍t,e,𝒙k,lθt,e,ik|2.\displaystyle\leq 2\Big{|}\frac{\partial\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle}{\partial\theta^{k}_{t,e,i}}\Big{|}\leq 2. (41)

The former step is due to |𝝍t,e,𝒙k,L|𝝍t,e,𝒙k,l|21|\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}|\cdot\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle|^{2}\leq 1, and the latter step is due to parameter shift rule. Substituting the bound of G1G_{1} and G2G_{2} into LHS of (37), we have the bound,

|t,ek,lθik|2+(a2)λ.\Big{|}\frac{\partial\mathcal{L}^{k,l}_{t,e}}{\partial\theta^{k}_{i}}\Big{|}\leq 2+(a-2)\lambda. (42)

Calculate LHS of (37) for all i[1,I]i\in[1,I], the loss gradient is obtained, and its gradient is bounded as,

𝜽t,ekt,ek,l2+(a2)λ.\|\nabla_{\bm{\theta}^{k}_{t,e}}\mathcal{L}^{k,l}_{t,e}\|\leq 2+(a-2)\lambda. (43)

Applying these results to gtkg^{k}_{t}, we complete the proof.

-C Proof of Lemma 2

We expand the global gradient ftf_{t} as follows,

ft2\displaystyle\|f_{t}\|^{2} =l=1L1Kplk|Xl|KgtkΞl2\displaystyle=\Big{\|}\sum^{L}_{l=1}\frac{1}{Kp_{l}}\sum^{K}_{k\in|X_{l}|}g^{k}_{t}\odot\Xi_{l}\Big{\|}^{2} (44)
LKl=1L1pl2k=1KgtkΞl2\displaystyle\leq\frac{L}{K}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\sum^{K}_{k=1}\|g^{k}_{t}\odot\Xi_{l}\|^{2} (45)
LKk=1Kl=1L1pl2gtk2.\displaystyle\leq\frac{L}{K}\sum^{K}_{k=1}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\cdot\|g^{k}_{t}\|^{2}. (46)

The first step is due to Jensen’s inequality, i.e.,

k=1Kxk2Kk=1Kxk2\|\sum^{K}_{k=1}x_{k}\|^{2}\leq K\sum^{K}_{k=1}\|x_{k}\|^{2} (47)

and the next step is due to Cauchy-Schwarz inequality, i.e., XΞ2X2\|X\odot\Xi\|^{2}\leq\|X\|^{2} [1]. Combining Lemma 1 and latter term of (46), we finalize the proof.

-D Proof of Lemma 3

According to (20) and Assumption 3, the distance between ftf_{t} and f¯t\bar{f}_{t} is as,

ftf¯t2\displaystyle\|f_{t}-\bar{f}_{t}\|^{2} =l=1L1Kplk|Xl|K(gtkg¯tk)Ξl2\displaystyle=\Big{\|}\sum^{L}_{l=1}\frac{1}{Kp_{l}}\sum^{K}_{k\in|X_{l}|}(g^{k}_{t}-\bar{g}^{k}_{t})\odot\Xi_{l}\Big{\|}^{2} (48)
LKl=1Lk=1K1pl2.\displaystyle\leq\frac{L}{K}\sum^{L}_{l=1}\sum^{K}_{k=1}\frac{1}{p_{l}^{2}}. (49)

This step is due to Jensen’s inequality. With Assumption 3, we have 𝔼gtkg¯tk2σk2\mathbb{E}\|g^{k}_{t}-\bar{g}^{k}_{t}\|^{2}\leq\sigma_{k}^{2}. Combining these results finalizes the proof.

-E Completing Proof of Theorem 1

Using (20), the distance between Θt+1\Theta_{t+1} to the optimal is as,

Θt+1Θ2=\displaystyle\|\Theta_{t+1}-\Theta^{*}\|^{2}= ΘtηtftΘ+ηtf¯tηtf¯t2\displaystyle\|\Theta_{t}-\eta_{t}f_{t}-\Theta^{*}+\eta_{t}\bar{f}_{t}-\eta_{t}\bar{f}_{t}\|^{2} (50)
=\displaystyle= Θtηtf¯tΘ2G3\displaystyle\underbrace{\|\Theta_{t}-\eta_{t}\bar{f}_{t}-\Theta^{*}\|^{2}}_{G_{3}} (51)
+2ηtΘtΘηtft,f¯tftG4\displaystyle+\underbrace{2\eta_{t}\langle\Theta_{t}-\Theta^{*}-\eta_{t}f_{t},\bar{f}_{t}-f_{t}\rangle}_{G_{4}} (52)
+ηt2ftf¯t2G5.\displaystyle+\underbrace{\eta_{t}^{2}\|f_{t}-\bar{f}_{t}\|^{2}}_{G_{5}}. (53)

We investigate the bound of G3G_{3} as follows,

G3=ΘtΘ22ηtΘtΘ,f¯tG6+ηt2f¯t2.G_{3}=\|\Theta_{t}-\Theta^{*}\|^{2}\underbrace{-2\eta_{t}\langle\Theta_{t}-\Theta^{*},\bar{f}_{t}\rangle}_{G_{6}}+\eta_{t}^{2}\|\bar{f}_{t}\|^{2}. (54)

The term G6/(2ηt)G_{6}/(2\eta_{t}) is bounded as,

G62ηt\displaystyle\frac{G_{6}}{2\eta_{t}} (a)\displaystyle\stackrel{{\scriptstyle(\text{a})}}{{\leq}} F(Θ)F(Θt)μ2ΘtΘ2\displaystyle F(\Theta^{*})-F(\Theta_{t})-\frac{\mu}{2}\|\Theta_{t}-\Theta^{*}\|^{2} (55)
(b)\displaystyle\stackrel{{\scriptstyle(\text{b})}}{{\leq}} 12βf¯t2μ2ΘtΘ2\displaystyle-\frac{1}{2\beta}\|\bar{f}_{t}\|^{2}-\frac{\mu}{2}\|\Theta_{t}-\Theta^{*}\|^{2} (56)
(c)\displaystyle\stackrel{{\scriptstyle(\text{c})}}{{\leq}} μ2ΘtΘ2.\displaystyle-\frac{\mu}{2}\|\Theta_{t}-\Theta^{*}\|^{2}. (57)

The steps (a), (b) and (c) are due to μ\mu-strong convexity, LL-smoothness, and f¯t20\|\bar{f}_{t}\|^{2}\geq 0, respectively. Since 𝔼[ft]=f¯t\mathbb{E}[f_{t}]=\bar{f}_{t}, 𝔼[G5]=0\mathbb{E}[G_{5}]=0. Combining Lemma 2, Lemma 3, and these results, we have the bound of LHS of (50). Summarizing (53) with taking expectation, and under Assumption 1 with a learning rate ηt1β\eta_{t}\leq\frac{1}{\beta}, the error between the updated global model and its optimum progress as,

𝔼\displaystyle\mathbb{E} Θt+1Θ2(1ηtμ)𝔼ΘtΘ2\displaystyle\|\Theta_{t+1}-\Theta^{*}\|^{2}\leq(1-\eta_{t}\mu)\mathbb{E}\|\Theta_{t}-\Theta^{*}\|^{2}
+ηt2(EL2(2λ)2+Lδ)l=1L1pl2:=B.\displaystyle+\eta_{t}^{2}\underbrace{\Big{(}EL^{2}(2-\lambda)^{2}+L\delta\Big{)}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}}_{:=B}. (58)

Since ηt=2μt+2Lμ1L\eta_{t}=\frac{2}{\mu{t}+2L-\mu}\leq\frac{1}{L}, applying (58), we have

Δt+1(1ηtμ)Δt+ηt2B.\Delta_{t+1}\leq(1-\eta_{t}\mu)\Delta_{t}+\eta_{t}^{2}B. (59)

For diminishing the step-size, we focus on showing that Δtvt+2κ1\Delta_{t}\leq\frac{v}{t+2\kappa-1}, where κ=βμ\kappa=\frac{\beta}{\mu} and v=max{2κΔ1,4B/μ2}v=\max\{2\kappa\Delta_{1},{4B}/{\mu^{2}}\} as elaborated next. It is trivial that Δ1v2κ\Delta_{1}\leq\frac{v}{2\kappa} due to the definition of vv. Assuming Δtvt+2κ1\Delta_{t}\leq\frac{v}{t+2\kappa-1}, we have

Δt+1(1μηt)Δt+ηt2B\displaystyle\Delta_{t+1}\leq(1-\mu\eta_{t})\Delta_{t}+\eta^{2}_{t}B (60)
(12t+2κ1)vt+2κ1+4B/μ2(t+2κ1)2\displaystyle\leq\left(1-\frac{2}{t+2\kappa-1}\right)\frac{v}{t+2\kappa-1}+\frac{{4B}/{\mu^{2}}}{(t+2\kappa-1)^{2}} (61)
=(t+2κ2)v(v4B/μ2)(t+2κ1)2t+2κ2(t+2κ1)2v\displaystyle=\frac{(t+2\kappa-2)v-(v-{4B}/{\mu^{2}})}{(t+2\kappa-1)^{2}}\leq\frac{t+2\kappa-2}{(t+2\kappa-1)^{2}}v (62)
vt+2κ.\displaystyle\leq\frac{v}{t+2\kappa}. (63)

For t=1t=1, we obtain

v=max{2κΔ1,4Bμ2}2κΔ1+4Bμ2.v=\max\{2\kappa\Delta_{1},\frac{4B}{\mu^{2}}\}\leq 2\kappa\Delta_{1}+\frac{4B}{\mu^{2}}. (64)

Finally, using Assumption 1, (58), the result above, we complete the proof of the theorem.

References

  • [1] H. Baek, W. J. Yun, Y. Kwak, S. Jung, M. Ji, M. Bennis, J. Park, and J. Kim, “Joint superposition coding and training for federated learning over multi-width neural networks,” in Proc. IEEE Conference on Computer Communications (INFOCOM), May 2022.
  • [2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
  • [3] W. J. Yun, J. Park, and J. Kim, “Quantum multi-agent meta reinforcement learning,” in Proc. AAAI Conference on Artificial Intelligence, Washington DC, USA, February 2023.
  • [4] W. J. Yun, Y. Kwak, J. P. Kim, H. Cho, S. Jung, J. Park, and J. Kim, “Quantum multi-agent reinforcement learning via variational quantum circuit design,” in Proc. IEEE International Conference on Distributed Computing Systems (ICDCS), Bologna, Italy, July 2022.
  • [5] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, October 1997.
  • [6] J. Preskill, “Quantum computing in the NISQ era and beyond,” Quantum, vol. 2, p. 79, August 2018.
  • [7] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Variational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8, pp. 141 007–141 024, 2020.
  • [8] “Quantum distributed deep learning architectures: Models, discussions, and applications,” ICT Express, 2022.
  • [9] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised learning with quantum-enhanced feature spaces,” Nature, vol. 567, no. 7747, pp. 209–212, 2019.
  • [10] O. Lockwood and M. Si, “Reinforcement learning with quantum variational circuit,” in Proc. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, 2020, pp. 245–251.
  • [11] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning,” IEEE Network, vol. 33, no. 5, pp. 156–165, 2019.
  • [12] J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.-L. Kim, and M. Debbah, “Communication-efficient and distributed learning over wireless networks: Principles and applications,” Proceedings of the IEEE, vol. 109, no. 5, pp. 796–819, May 2021.
  • [13] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communications: Motivation, opportunities, and challenges,” IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, 2020.
  • [14] D. Kwon, J. Jeon, S. Park, J. Kim, and S. Cho, “Multiagent DDPG-based deep learning for smart ocean federated learning IoT networks,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9895–9903, 2020.
  • [15] S. Y.-C. Chen and S. Yoo, “Federated quantum machine learning,” Entropy, vol. 23, no. 4, p. 460, 2021.
  • [16] H. Zhou, K. Lv, L. Huang, and X. Ma, “Quantum network: Security assessment and key management,” IEEE/ACM Transactions on Networking, vol. 30, no. 3, pp. 1328–1339, 2022.
  • [17] R. Pujahari and A. Tanwar, “Quantum federated learning for wireless communications,” in Federated Learning for IoT Applications.   Springer, 2022, pp. 215–230.
  • [18] A. Cho, “Ibm promises 1000-qubit quantum computer—a milestone—by 2023,” Science, vol. 15, 2020.
  • [19] J. Gambetta, “Our new 2022 development roadmap,” IBM Quantum Computing, May 2022.
  • [20] J. Yu and T. S. Huang, “Universally slimmable networks and improved training techniques,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 1803–1811.
  • [21] D. Kim, J. Kim, J. Kwon, and T.-H. Kim, “Depth-controllable very deep super-resolution network,” in Proc. IEEE International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, July 2019, pp. 1–8.
  • [22] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, and H. Neven, “Barren plateaus in quantum neural network training landscapes,” Nature Communications, vol. 9, no. 1, pp. 1–6, 2018.
  • [23] S. H. Sack, R. A. Medina, A. A. Michailidis, R. Kueng, and M. Serbyn, “Avoiding barren plateaus using classical shadows,” PRX Quantum, vol. 3, no. 2, June 2022.
  • [24] T. Sleator and H. Weinfurter, “Realizable universal quantum logic gates,” Physical Review Letters, vol. 74, no. 20, p. 4087, 1995.
  • [25] M. M. Wilde, Quantum information theory.   Cambridge University Press, 2013.
  • [26] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018.
  • [27] M. Chehimi and W. Saad, “Quantum federated learning with quantum data,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, May 2022, pp. 8617–8621.
  • [28] W. J. Yun, J. P. Kim, S. Jung, J. Park, M. Bennis, and J. Kim, “Slimmable quantum federated learning,” in Proc. of ICML Workshop on Dynamic Neural Networks, Baltimore, MD, USA, July 2022.
  • [29] D. Bouwmeester and A. Zeilinger, “The physics of quantum information: basic concepts,” in the Physics of Quantum Information, 2000, pp. 1–14.
  • [30] C. P. Williams, S. H. Clearwater et al., Explorations in quantum computing.   Springer, 1998.
  • [31] N. Killoran, T. R. Bromley, J. M. Arrazola, M. Schuld, N. Quesada, and S. Lloyd, “Continuous-variable quantum neural networks,” Physical Review Research, vol. 1, no. 3, p. 033063, 2019.
  • [32] O. Simeone, “An introduction to quantum machine learning for engineers,” Foundations and Trends® in Signal Processing, vol. 16, no. 1-2, pp. 1–223, 2022.
  • [33] N. H. Tran, W. Bao, A. Y. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in Proc. IEEE Conference on Computer Communications (INFOCOM), Paris, France, 2019, pp. 1387–1395.
  • [34] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, April 2017, pp. 1273–1282.
  • [35] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou, “FedBN: Federated learning on non-iid features via local batch normalization,” in Proc. International Conference on Learning Representations (ICLR), 2021.
  • [36] L. Mangasarian, “Parallel gradient distribution in unconstrained optimization,” SIAM Journal on Control and Optimization, vol. 33, no. 6, pp. 1916–1925, 1995.
  • [37] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan, “Better mini-batch algorithms via accelerated gradient methods,” Proc. Advances in Neural Information Processing Systems (NIPS), vol. 24, 2011.
  • [38] N. Karakoç, A. Scaglione, M. Reisslein, and R. Wu, “Federated edge network utility maximization for a multi-server system: Algorithm and convergence,” IEEE/ACM Transactions on Networking, vol. 30, no. 5, pp. 2002–2017, 2022.
  • [39] C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Transactions on Networking, vol. 29, no. 1, pp. 398–409, February 2021.
  • [40] A. Khaled, K. Mishchenko, and P. Richtarik, “Tighter theory for local sgd on identical and heterogeneous data,” in Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 108, August 2020, pp. 4519–4529.
  • [41] X. You and X. Wu, “Exponentially many local minima in quantum neural networks,” in Proc. of the International Conference on Machine Learning (ICML), Virtual, July 2021.
  • [42] D. Greenberger, K. Hentschel, and F. Weinert, Compendium of quantum physics: concepts, experiments, history and philosophy.   Springer Science & Business Media, 2009.
  • [43] Y. Subaşı, L. Cincio, and P. J. Coles, “Entanglement spectroscopy with a depth-two quantum circuit,” Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 4, p. 044001, January 2019.
  • [44] R. Jozsa, “Fidelity for mixed quantum states,” Journal of Modern Optics, vol. 41, no. 12, pp. 2315–2323, 1994.
  • [45] D. N. C. Tse and P. Viswanath, Fundamentals of Wireless Communications, 2005.
  • [46] T. Cover, “Broadcast channels,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 2–14, January 1972.
  • [47] J. Choi, “Joint rate and power allocation for NOMA with statistical CSI,” IEEE Transactions on Communications, vol. 65, no. 10, pp. 4519–4528, October 2017.
  • [48] M. Choi, D. Yoon, and J. Kim, “Blind signal classification for non-orthogonal multiple access in vehicular networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 10, pp. 9722–9734, 2019.
  • [49] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in Proc. of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
  • [50] L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
  • [51] T. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” CoRR, vol. abs/1909.06335, September 2019.
[Uncaptioned image] Won Joon Yun is currently a Ph.D. student in electrical and computer engineering at Korea University, Seoul, Republic of Korea, since March 2021, where he received his B.S. in electrical engineering. He was a visiting researcher at Cipherome Inc., San Jose, CA, USA, during summer 2022; and also a visiting researcher at the University of Southern California, Los Angeles, CA, USA during winter 2022 for a joint project with Prof. Andreas F. Molisch at the Ming Hsieh Department of Electrical and Computer Engineering, USC Viterbi School of Engineering.
[Uncaptioned image] Jae Pyoung Kim has been with the School of Electrical Engineering, Korea University, Seoul, Republic of Korea, since March 2017, where he is currently a B.S. student in electrical and computer engineering. He is now a research engineer at Artificial Intelligence and Mobility (AIM) Laboratory at Korea University, Seoul, Republic of Korea, since 2021. His current research interests include quantum machine learning.
[Uncaptioned image] Hankyul Baek is currently a Ph.D. student in electrical and computer engineering at Korea University, Seoul, Republic of Korea, since March 2021. He received his B.S. in electrical engineering from Korea University, Seoul, Republic of Korea, in 2020. He was with LG Electronics, Seoul, Republic of Korea, from 2020 to 2021. His current research interests include quantum machine learning and its applications.
[Uncaptioned image] Soyi Jung has been an assistant professor at the department of electrical and computer engineering, Ajou University, Suwon, Republic of Korea, since September 2022. She also holds a visiting scholar position at Donald Bren School of Information and Computer Sciences, University of California, Irvine, CA, USA, from 2021 to 2022. She was a research professor at Korea University, Seoul, Republic of Korea, during 2021. She was also a researcher at Korea Testing and Research (KTR) Institute, Gwacheon, Republic of Korea, from 2015 to 2016. She received her B.S., M.S., and Ph.D. degrees in electrical and computer engineering from Ajou University, Suwon, Republic of Korea, in 2013, 2015, and 2021, respectively. Her current research interests include network optimization for autonomous vehicles communications, distributed system analysis, big-data processing platforms, and probabilistic access analysis. She was a recipient of Best Paper Award by KICS (2015), Young Women Researcher Award by WISET and KICS (2015), Bronze Paper Award from IEEE Seoul Section Student Paper Contest (2018), ICT Paper Contest Award by Electronic Times (2019), and IEEE ICOIN Best Paper Award (2021).
[Uncaptioned image] Jihong Park (Senior Member, IEEE) received the B.S. and Ph.D. degrees from Yonsei University, South Korea. He is currently a Lecturer (Assistant Professor) with the School of Information Theory, Deakin University, Australia. His research interests include ultra-dense/ultra-reliable/mmWave system designs, and distributed learning/control/ledger technologies and their applications for beyond-5G/6G communication systems. He served as a Conference/Workshop Program Committee Member for IEEE GLOBECOM, ICC, and WCNC, and for NeurIPS, ICML, and IJCAI. He is an Associate Editor of Frontiers in Data Science for Communications, and a Review Editor of Frontiers in Aerial and Space Networks.
[Uncaptioned image] Mehdi Bennis (Fellow, IEEE) is a tenured Full Professor with the Centre for Wireless Communications, University of Oulu, Finland, an Academy of Finland Research Fellow, and the Head of the Intelligent Connectivity and Networks/Systems Group (ICON). He has published more than 200 research papers in international conferences, journals, and book chapters. His main research interests are in radio resource management, heterogeneous networks, game theory, and distributed machine learning in 5G networks and beyond. He has been the recipient of several prestigious awards, including the 2015 Fred W. Ellersick Prize from the IEEE Communications Society, the 2016 Best Tutorial Prize from the IEEE Communications Society, the 2017 EURASIP Best Paper Award for the Journal of Wireless Communications and Networks, the All-University of Oulu Award for research, the 2019 IEEE ComSoc Radio Communications Committee Early Achievement Award, and the 2020 Clarivate Highly Cited Researcher by the Web of Science. He is an Editor of IEEE Transactions on Communications and the Specialty Chief Editor of Data Science for Communications in the Frontiers in Communications and Networks.
[Uncaptioned image] Joongheon Kim (Senior Member, IEEE) has been with Korea University, Seoul, Korea, since 2019, where he is currently an associate professor. He received the B.S. and M.S. degrees in computer science and engineering from Korea University, Seoul, Korea, in 2004 and 2006, respectively; and the Ph.D. degree in computer science from the University of Southern California (USC), Los Angeles, CA, USA, in 2014. Before joining Korea University, he was with LG Electronics (Seoul, Korea, 2006–2009), Intel Corporation (Santa Clara in Silicon Valley, CA, USA, 2013–2016), and Chung-Ang University (Seoul, Korea, 2016–2019). He serves as an editor for IEEE Transactions on Vehicular Technology, IEEE Transactions on Machine Learning in Communications and Networking, IEEE Communications Standards Magazine, Computer Networks (Elsevier), and ICT Express (Elsevier). He is also a distinguished lecturer for IEEE Communications Society (ComSoc) (2022-2023) and IEEE Systems Council (2022-2024). He was a recipient of Annenberg Graduate Fellowship with his Ph.D. admission from USC (2009), Intel Corporation Next Generation and Standards (NGS) Division Recognition Award (2015), IEEE Systems Journal Best Paper Award (2020), IEEE ComSoc Multimedia Communications Technical Committee (MMTC) Outstanding Young Researcher Award (2020), IEEE ComSoc MMTC Best Journal Paper Award (2021), and Best Special Issue Guest Editor Award by ICT Express (Elsevier) (2022). He also received several awards from IEEE conferences including IEEE ICOIN Best Paper Award (2021), IEEE Vehicular Technology Society (VTS) Seoul Chapter Awards (2019, 2021), and IEEE ICTC Best Paper Award (2022).