Quantum Federated Learning with Entanglement Controlled Circuits and Superposition Coding

Won Joon Yun, Jae Pyoung Kim, Hankyul Baek, Soyi Jung, Jihong Park, Mehdi Bennis, and Joongheon Kim The parts of this research were presented at IEEE Conference on Computer Communications (INFOCOM), London, United Kingdom, May 2022 [1].This research is supported by the National Research Foundation of Korea (2021R1A4A1030775) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (2021-0-00467, Intelligent 6G Wireless Access System). (Corresponding authors: Soyi Jung, Jihong Park, Joongheon Kim)W. J. Yun, J. P. Kim, H. Baek, and J. Kim are with the School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea (e-mails: {ywjoon95,paulkim436,67back,joongheon}@korea.ac.kr).S. Jung is with the Department of Electrical and Computer Engineering, Ajou University, Suwon 16499, Republic of Korea (e-mail: [email protected]).J. Park is with the School of Information Technology, Deakin University, Geelong, VIC 3220, Australia (e-mail: [email protected]).M. Bennis is with the Centre for Wireless Communications, University of Oulu, Oulu 90014, Finland (e-mail: [email protected]).

Abstract

While witnessing the noisy intermediate-scale quantum (NISQ) era and beyond, quantum federated learning (QFL) has recently become an emerging field of study. In QFL, each quantum computer or device locally trains its quantum neural network (QNN) with trainable gates, and communicates only these gate parameters over classical channels, without costly quantum communications. Towards enabling QFL under various channel conditions, in this article we develop a depth-controllable architecture of entangled slimmable quantum neural networks (eSQNNs), and propose an entangled slimmable QFL (eSQFL) that communicates the superposition-coded parameters of eSQNNs. Compared to the existing depth-fixed QNNs, training the depth-controllable eSQNN architecture is more challenging due to high entanglement entropy and inter-depth interference, which are mitigated by introducing entanglement controlled universal (CU) gates and an inplace fidelity distillation (IPFD) regularizer penalizing inter-depth quantum state differences, respectively. Furthermore, we optimize the superposition coding power allocation by deriving and minimizing the convergence bound of eSQFL. In an image classification task, extensive simulations corroborate the effectiveness of eSQFL in terms of prediction accuracy, fidelity, and entropy compared to Vanilla QFL as well as under different channel conditions and various data distributions.

Index Terms:

Quantum Machine Learning, Quantum Entanglement, Quantum Federated Learning, Superposition Coding

I Introduction

I-A Background and Motivation

Recent advances in quantum computing hardware and algorithms have recently lead to the emergence of quantum machine learning (ML) [2, 3, 4]. As opposed to classical computation at a linear scale in bits, quantum computing can perform calculations at an exponential scale in qubits [5]. The main enablers are the stochastic nature and the entanglement phenomenon of qubits, allowing one to make each qubit represent superimposed multiple states and to simultaneously control multiple qubits, respectively. Consequently, even in the current era of noisy intermediate scale quantum (NISQ) [6], i.e., with 50 to a few hundred qubits, quantum ML has achieved linear or sublinear complexity in various applications, as compared with the polynomial complexity of classical ML [7].

Quantum ML has recently established its standard framework. As analogous to the neural network (NN) of classical ML, the parameterized quantum circuit (PQC), also known as the quantum NN (QNN), has become a de facto standard quantum ML architecture [7, 8]. In a PQC, qubits flow through the gates associated with trainable classical parameters, during which the states of the qubits can be adjusted. For various applications ranging from image classification [9] to reinforcement learning [10], with a much smaller number of trainable parameters, PQC training has achieved the prediction accuracy on par with the neural network (NN) of classical ML.

Focusing on the parameter efficiency of PQCs, by integrating federated learning (FL) [11, 12, 13, 14] into standalone quantum ML, quantum FL (QFL) has recently attracted attention [15, 16]. Without communicating qubits via costly quantum communications, QFL enables distributed quantum ML at scale by communicating the PQC’s trainable parameters via classical communications, even over wireless channels [17]. This is not in the distant future, but is an upcoming application, especially considering the ever-increasing pace of innovation in quantum computers, e.g., IBM’s development roadmap planning to implement a 1K-qubit beyond-NISQ computer in 2023 [18] and a 100K-qubit computer in 2026 [19].

I-B Algorithm Design Concept

Motivated by this trend in QFL, the overarching goal of this article is to develop a communication-efficient QFL framework that can cope with heterogeneous and time-varying channel conditions and computing resources. To this end, we first revisit slimmable FL (SFL) in classical ML [1], wherein each device has a width-controllable local model, known as a slimmable NN (SNN) [20, 21], and communicates its superposition-coded local model with different width levels, enabling multi-level local information exchanges depending on channel conditions. Inspired from this, as visualized in Fig. 1, we propose an entangled slimmable quantum FL (eSQFL) framework with entangled slimmable QNNs (eSQNNs), which is a non-trivial extension from SFL with SNNs to their quantum versions as summarized later.

Unlike multi-width SNNs, the eSQNN is a multi-depth PQC wherein more depth levels incur higher von Neumann entanglement entropy on average. Unfortunately, the PQC trainability is often challenged by the problem of vanishing all gradients, known as the barren plateaus [22], which is exacerbated under higher entanglement entropy [23]. Meanwhile, too low entanglement may negate the benefit of quantum ML. To resolve this issue for an unknown target degree of entanglement, our proposed eSQNN entangles different qubits using the controlled universal (CU) quantum gates [24] such that the degree of entanglement is trainable.

Next, simultaneous local training of the multiple eSQNN depths may induce inter-depth interference, hindering convergence. In classical ML, SFL avoids its similar inter-width interference issue by adding the inplace knowledge distillation (IPKD) regularizer that penalizes the output difference from any smaller width to the largest width level [20]. Since the IPKD uses the Kullback-Leibler (KL) divergence, it becomes less accurate (or even diverging) for the larger differences. Alternatively, leveraging the Uhlmann’s fidelity function in quantum information theory [25], we propose a novel inplace fidelity distillation (IPFD) regularizer that is bounded within 0 and 1 while accurately measuring the quantum state difference even between the smallest and the largest levels.

Finally, for communication efficiency, the eSQNN parameters in multiple depths are superposition-coded and transmitted with a different transmit power allocation to each depth. Like SFL, the transmit power is optimized by deriving and minimizing the convergence bound of the eSQFL. Nevertheless, the convergence analysis is completely different since the gradient in PQC training is measured in a quantum computing way using the parameter shift rule [26].

Not only by analysis but also by extensive simulations, we corroborate that the proposed eSQFL with eSQNNs achieves convergence while each different width level can be trained to be a separate model with reasonable accuracy and fidelity, under various channel conditions as well as independent and identically distributed (IID) or non-IID data distributions. Note that unlike eSQFL, Vanilla quantum FL having fixed local PQC architectures cannot cope with different channel conditions [15, 27]. A recent work [28] also considers a slimmable architecture in the context of QFL. However, it does not theoretically guarantee convergence, and its specific architecture (i.e., angle/pole parameters) only allows two-level superposition coding, as opposed to generalized multi-level architectures using CU gates in eSQFL.

I-C Contributions

The major contributions of the work in this paper can be summarized as follows.

•

A multi-depth QNN architecture with CU gates, i.e., eSQNN, is proposed to enable superposition-coded transmissions while avoiding barren plateaus. We measure von Neumann entropy between inter-quantum states of different depths. Indeed, CU gates increase the trainability when designing multi-depth QNN.
•

A local eSQNN training algorithm with a fidelity-inspired regularizer, i.e., IPFD, is proposed in order to mitigate inter-depth interference. In eSQNN training, the proposed IPFD in this paper shows the crucial role.
•

With eSQNNs and IPFD, a novel quantum FL framework, i.e., eSQFL, is proposed, and its convergence bound is theoretically derived.
•

Based on the derived convergence bound, transmit power allocation in superposition coding is optimized. In addition, we corroborate that the derived convergence bound helps eSQFL achieve high accuracy.

I-D Organization

The rest of this paper is organized as follows. Sec. II presents the related work to the proposed quantum federated learning. Sec. III introduces the eSQNN architecture and its local training with IPFD regularizer. Sec. IV describes superposition coding, successive decoding, and the proposed eSQFL framework. Sec. V provides the convergence analysis on eSQFL and its insight. Sec. VI presents the numerical experimental results to corroborate eSQFL empirically. Lastly, Sec. VII concludes this paper. Notice that the notations in this paper are in Tab. III.

II Related Work

II-A Quantum Machine Learning Basics

Basic Quantum Gates. A qubit is a quantum computing unit where the quantum state is represented with two bases $|0\rangle$ , and $|1\rangle$ in Bloch sphere [29]. Consider the $q$ qubits system, in which the quantum state defined in Hilbert space $\bm{\psi}\in\mathbb{C}^{2^{q}}$ can be expressed as follows,

|\bm{\psi}\rangle=\Lambda_{1}|0\cdots 0\rangle+\cdots+\Lambda_{2^{q}}|1\cdots 1\rangle

(1)

where $\sum^{2^{q}}_{i=1}\Lambda_{i}^{2}=1$ . A classical data $\bm{x}$ is encoded as a quantum state with the rotation gates $R_{\text{x}}(\bm{x})$ , $R_{\text{y}}(\bm{x})$ , and $R_{\text{z}}(\bm{x})$ , where the rotation of $(\bm{x})$ occurs in the direction of $x$ -, $y$ -, and $z$ -axes in Bloch sphere, respectively. Moreover, qubits are entangled with controlled-NOT gates (CNOT) [30]. CNOT gates act on two qubits to entangle them by using the first qubit as the control qubit and performing XOR operation on the second qubit. These basic quantum gates configure the QNNs.

Quantum Neural Network. The structure of a QNN is tripartite: the state encoder, PQC, and the measurement layer [31, 32]. In the forward propagation, classical input data $\bm{x}$ needs to be first encoded with the state encoder via basic rotation gates, which is a unitary operation and denoted as $U(\bm{x})$ . Then, the encoded quantum state is processed through the PQC $U(\mathbf{\bm{\theta}})$ , a multi-layered set of CNOT gates and rotation gates associated with trainable parameters $\bm{\theta}$ . The quantum state $\bm{\psi}$ can be expressed as,

|\bm{\psi}_{\bm{\theta}}\rangle=U(\bm{\theta})\cdot|\bm{\psi}_{0}\rangle=U(\bm{\theta})\cdot U(\bm{x})|0\rangle.

(2)

The output of the PQC is the entangled quantum state that can be measured after applying a projection matrix $M\in\mathcal{M}\equiv\{M_{1},\cdots,M_{c},\cdots,M_{C}\}$ onto the reference $z$ -axis. The measured output $\langle V\rangle_{\bm{\theta}}\in[-1,1]^{C}$ is called an observable, where $C$ denotes the output dimension. The operation of QNN corresponding to $c$ -th observable is as follows,

\langle V_{c}\rangle_{\bm{\theta}}=\langle 0|U^{\dagger}(\bm{x})U^{\dagger}({\bm{\theta}})M_{c}U({\bm{\theta}})U(\bm{x})|0\rangle=\langle\bm{\psi}|M_{c}|\bm{\psi}\rangle

(3)

where $(\cdot)^{\dagger}$ denotes the complex conjugate operator. Using the observable, a given loss function is calculated. Unlike classical NNs having visible activations in their hidden layers, the quantum states within QNNs are not measurable; otherwise, the quantum states collapse [29]. This does not allows quantum ML to compute the loss gradients via the chain rule, i.e., backpropagations. Alternatively, quantum ML evaluates the gradients using the zero-th order method called the parameter shift rule [26] (see Appendix -A).

II-B Classical Federated Learning

Federated learning (FL) is a machine learning (ML) architecture made up of a server, local devices, and a global model [12]. The server transmits the global model to all the local devices, and each device produces local parameters by training the received global model. Then, these parameters are sent back to the central server, where all the data is aggregated to update the global model. Finally, the updated global model is transmitted to the local devices again for another iteration. Due to this mechanism, FL allows a large number of devices to learn a global model simultaneously without transmitting any data, ensuring data privacy as well. Considering the recent increase in the number and computational power of edge devices, FL is an extremely useful tool for reducing computational overhead and protecting data security which is both emerging challenges in the field of ML [33]. Within this architecture, various techniques with differing methods of aggregating data and training the global model exist, e.g., FedAvg [34], FedBN [35].

The convergence analysis of FL algorithms is especially challenging because of the data heterogeneity in FL, which forces researchers to rely on copious numbers of assumptions. Consequently, gaps in the understanding of FL analysis occur. Over the years, many major works have attempted to better understand FL by removing assumptions and exploring new techniques [36, 37, 34, 38, 39]. Even now, research on FL convergence in various aspects is still being carried out (i.e., non-convex, convergence bounds). For example, [40] has successfully proposed an analysis of local stochastic gradient descent (SGD) using only arbitrarily heterogeneous data while also using weaker assumptions than previous works. On the other hand, convergence analysis of most QFL algorithms has not been fully developed yet. This paper aims to further discuss the convergence analysis of QFL via the analysis of eSQFL with the characteristic of quantum computing. The convergence analysis on a dynamic QFL is elaborated in Sec. IV.

II-C Classical Slimmable Federated Learning

SFL is a framework that executes FL by using slimmable neural networks (SNNs) with SC and SD [1]. The architectural properties of SNN reduce memory costs of SFL [20] while with rigorous communication and computational efficiencies. SC is a process of compressing two different data signals into one signal. As the signals are encoded, different power levels are assigned to each data signal which is used to decide the priority of signals during SD. Additionally, the SNN is composed of the left-hand (LH) and the right-hand (RH) sides. The LH side is occupied by the high priority signal, while the lower priority signal goes to the RH side. After SC is finished, the encoded message will be uploaded to the server, which then undergoes SD. Assuming that the state of the communication channel is good, the LH of the SNN will be decoded first, followed by the RH signal. However, if the communication channel is not stable enough, only LH will be decoded, resulting in a small model. Finally, if the communication channel is completely unstable, no signal will be obtained. This flexible characteristic of SNN allows SFL to be extremely adaptable to dynamic communication environments, making it suitable for practical applications.

II-D Quantum Federated Learning

In this section, the concept of QFL is elaborated in depth. QFL is implementing FL via quantum computation by replacing all the NNs with QNNs. Chen et al., [15] is the first to propose a hybrid quantum-classical QFL architecture where the local devices are replaced with quantum devices, unlike FL models. After receiving global model parameters, the quantum computers carry out QML using QNNs. Then, the output of each device is aggregated to update the global model before repeating the process. In Chehimi et al., [27], a purely QFL framework is proposed. Similar to [15], this model is composed of a server and multiple quantum devices. However, instead of converting classic data into quantum states, the local devices generate quantum data by labeling qubits as excited or not excited according to the degree of rotation on the Bloch sphere. Both [15, 27] use FedAvg to aggregate data and execute training. As seen from the two examples above, a QFL and FL share an identical system structure, but QFL leverages QML instead of ML in order to exploit the advantages of quantum computing. For this work, Vanilla QFL is referring to a purely quantum version of [15]. In addition, quantum application of SFL is studied to improve the communication opportunities [28]. SQFL utilized trainable measurement parameters to configure two messages which contain both the trainable measurement parameters and PQC parameters, respectively. However, in this work, multiple layer architectures and local training algorithms are proposed, which are not present in [28].

III Architecture and Training of eSQNNs

In this section, we describe the architecture of eSQNN and its local training algorithm. To elaborate on this, by slightly modifying the depth-fixed architecture of Vanilla QNN [7], we first prepare its depth-controllable counterpart without controlling the level of entanglement, dubbed Vanilla SQNN, followed by introducing the proposed eSQNN controlling both the depth and the level of entanglement.

Architecture of Vanilla SQNN. Suppose that Vanilla SQNN consists of $L$ layers, and produces the desired outputs at any layer $l\in[1,L]$ . In this paper, the number of sub-models must be larger than 1 (i.e., $L\geq 2$ ). When the $l$ -th sub-model is used, it means that the $l$ -th model will be configured from the encoding layer to the $l$ -th layer. For an arbitrary $k\in\mathbb{N}[1,K]$ and $l\in\mathbb{N}[1,L]$ , the model parameters of $k$ -th local device and the $l$ -th layer is denoted as $\bm{\theta}^{k}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}$ . Note that $\Xi_{l^{\prime}}$ is a binary mask which eliminates all trainable parameters except parameters of the $l^{\prime}$ -th layer. The operation $\odot$ denotes an element-wise product. However, it is difficult to make desirable results at any random layer because the vanilla SQNN is vulnerable to the barren plateau problem [22, 41]. The barren plateau is a bad local optimum which hinders convergence. It is known that more entanglement’s degree introduces worse the barren plateau problem [23]. The operations in Vanilla SQNN are as follows: 1) rotate quantum state $|\bm{\psi}\rangle$ with rotation gates, 2) entangle qubits, and 3) repeat the first and second steps. We predict that the operations mentioned above will increase the degree of entanglement.

Architecture of eSQNN. eSQNN is proposed to cope with the problem of Vanilla SQNN architecture. Fig. 2 shows the illustration of eSQNN. eSQNN is mainly composed of CU gates. The operations of the CU gate in two qubits are written as $\begin{bmatrix}I&0\\ 0&U\end{bmatrix}$ , where $U$ is expressed as $U=\begin{bmatrix}u_{00}&u_{01}\\ u_{10}&u_{11}\end{bmatrix}$ . Note that $U$ is an unitary matrix, i.e., $U^{\dagger}U=I$ . We focus on the architectural advantage of CU gates because CU gates can adjust the direction of entanglement, disentanglement, or rotation while training. We describe the advantages of eSQNN and the barren plateau phenomenon next.

To this end, at first we consider the von Neumann entanglement entropy, a metric for measuring the degree of quantum entanglement of bipartite subsystems in an entire system [42]. For instance, consider two subsystems, e.g., $l$ and $l^{\prime}$ -th model configuration, where $l>l^{\prime}$ . According to a two-copy test from [43], we can compare the different quantum states $|\bm{\psi}_{l}\rangle$ and $|\bm{\psi}_{l^{\prime}}\rangle$ by using additional qubits. Then, we can measure the entanglement entropy by following the statement below. Suppose a quantum state that exists in $l$ and $l^{\prime}$ -th depth is represented as $\bm{\psi}_{l^{\prime},l}\in\mathbb{C}^{2^{2q}}$ . Its pure state is obtained by $\rho_{l^{\prime},l}\triangleq|\bm{\psi}_{l^{\prime},l}\rangle\langle\bm{\psi}_{l^{\prime},l}|$ . Finally, the entanglement entropy is calculated as follows,

S_{l}(\rho_{l^{\prime},l})=-\Tr_{l}(\rho_{l^{\prime},l}\log\rho_{l^{\prime},l})

(4)

where $\Tr_{l}(\cdot)$ stands for partial trace over the $l$ -th layer. As discussed in many studies, avoiding the barren plateaus requires the reduction of the entanglement entropy [23].

On the basis of these studies, we assume that there exists an entropy threshold for every $l$ -th model, i.e., $S_{l,th}$ for all $\forall l\in\mathbb{N}[l^{\prime},L]$ and $\forall l^{\prime}\in\mathbb{N}[0,l-1]$ . It starts at $l^{\prime}=0$ , because we measure the entanglement entropy from the encoding state, i.e., $\bm{\psi}_{0}$ . If $\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})\geq S_{l,th}$ , the barren plateau becomes severe and training of $l$ -th model fails. For this, we observe the entanglement entropy between the encoding state and the layer of eSQNN. In order to ensure all model configurations are trained, we define a metric as,

\mathbbm{1}_{\text{train}}=\prod^{L}_{l=1}\mathbbm{1}\left(\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})<S_{l,th}\right)

(5)

where $\mathbbm{1}(\cdot)$ stands for an indicator function. In order to verify whether the metric works correctly, we provide the following two cases. If all model configurations are satisfied $\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})<S_{l,th})$ , then $\mathbbm{1}_{\text{train}}=1$ , which means it avoids the barren plateau. On the other hand, suppose that $\exists l$ that satisfies $\sum^{l-1}_{l^{\prime}=0}S(\rho_{l^{\prime},l})\geq S_{l,th}$ , we have $\mathbbm{1}_{\text{train}}=0$ which it means it suffers from barren plateau. We conjecture that eSQNN is robust to the barren plateau than Vanilla SQNN because the event $\mathbbm{1}_{\text{train}}=1$ frequently occurs in eSQNN. More details are in Sec. VI-B.

eSQNN Local Training. This section presents the eSQNN local training algorithm. In general, classic SNNs use the IPKD regularizer $\mathcal{L}_{KD}$ to transfer knowledge from a large model to a small model [20], which can be expressed as,

\mathcal{L}_{KD}=D_{KL}(p(\bm{y}^{k,L}_{t,e})\|p(\bm{y}^{k,l}_{t,e}))

(6)

where $D_{KL}$ is the KL divergence. IPKD is ill-suited when the difference between the outputs of two models becomes large, where the KL divergence may even diverge. Alternatively, we propose the IPFD regularizer $\mathcal{L}_{FD}$ , inspired by the Uhlmann’s fidelity function [44] in quantum information theory, measuring the similarity between two quantum states [44]. Precisely, the fidelity of the quantum states in the $L$ -th and $l$ -th model configurations is defined as follows,

\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})=|\langle\bm{\psi}_{L}|\bm{\psi}_{l}\rangle|^{2}.

(7)

In (7), if $\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})\approx 1$ , $\bm{\psi}_{l}$ is similar to $\bm{\psi}_{L}$ , which means the logits of $l$ -th model are almost same as the logits of $L$ -th model. On the other hand, the opposite condition $\mathcal{F}(\bm{\psi}_{L},\bm{\psi}_{l})\approx 0$ means the $l$ -th model does not follow the $L$ -th model.

Consequently, in a classification task, the local training of an eSQNN with the IPFD regularizer is described as follows. The parameters $(\bm{x},\bm{y})$ are denoted as data and label, respectively. The predicted label $\bm{y}=\{y_{c}\}^{C}_{c=1}$ is an one-hot encoded vector wherein the element $y_{c}$ becomes unity for a true label and otherwise $0$ , i.e., $y_{c^{\prime}}=0,\forall c^{\prime}\neq c$ . Hereafter, we describe the local training for the parameters of local device $k$ in the $t$ -th communication round and $e$ -th local training iteration. The logits of class and its prediction of $l$ -th model are denoted as,

	$\displaystyle y^{k,l,c}_{t,e}$	$\displaystyle=\text{exp}(a\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}}),$		(8)
	$\displaystyle p(y^{k,l,c}_{t,e}\|\bm{x})$	$\displaystyle=\frac{y^{k,l,c}_{t,e}}{\sum^{C}_{c=1}y^{k,l,c}_{t,e}},$		(9)

where $a$ represents the observable hyperparameter. Additionally, the cross-entropy loss and the fidelity regularization are as,

	$\displaystyle\mathcal{L}_{CE}$	$\displaystyle=-\sum^{C}_{c=1}[y_{c}\text{log}(p(y^{k,l,c}_{t,e})\|\bm{x})],$		(10)
	$\displaystyle\mathcal{L}_{FD}$	$\displaystyle=1-\mathcal{F}(\bm{\psi}^{k,L}_{t,e,\bm{x}},\bm{\psi}^{k,l}_{t,e,\bm{x}}).$		(11)

The loss function is given as,

\displaystyle\mathcal{L}^{k,l}_{t,e}=\frac{1}{D}\sum_{(\bm{x},\bm{y})\in\zeta^{k}}\left[\lambda\mathcal{L}_{CE}+(1-\lambda)\mathcal{L}_{FD}\right]

(12)

where $D$ and $\lambda$ stand for the batch size and the balanced parameter of fidelity regularization, respectively. The gradient of (12) can be calculated with parameter shift rule [26]. Algorithm 1 summarizes the local training process before one communication round. After training with Algorithm 1, the gradient to be transmitted to the server can be as,

g^{k}_{t}=\sum^{E}_{e=1}\sum^{L}_{l=1}\nabla_{{\bm{\theta}^{k}_{t,e}}}\mathcal{L}^{k,l}_{t,e}

(13)

where $\eta_{t}$ denotes the learning rate at communication round $t$ .

1 Initialization. local-QNN parameters,

\bm{\theta}

;

2 for $e=\{1,2,\dots,E\}$ do

3 for $(\bm{x},y)\in\mathcal{D}$ do

4 Get logits with

L

-th model;

5 Calculate loss with labels and accumulate loss;

6 for $l=\{1,2,\dots,L-1\}$ do

7 Get logits with

l

-th model;

8 Calculate loss gradient with parameter-shift rule;

\bm{\theta}^{k}_{t,e+1}\leftarrow\bm{\theta}^{k}_{t,e}-\eta_{t}\nabla_{\theta^{k}_{t,e}}\mathcal{L}^{k,l}_{t,e}

;

Algorithm 1 Local-eSQNN Training

IV Entangled Slimmable Quantum Federated Learning

IV-A Superposition Coding & Successive Decoding

The successful reception of a wireless signal is mainly affected by the signal-to-interference-plus-noise ratio (SINR) [45]. At a receiver, SINR can be expressed as,

\gamma={\chi d^{-\beta}P}/{(\sigma^{2}+P^{I})}

(14)

where $P$ , $P^{I}$ , $d$ , and $\sigma^{2}$ denote the transmission interference, reception interference, a transmitter-receiver distance, and noise powers, respectively. In addition, $\beta\geq 2$ is a path loss exponent and $\chi$ is small-scale fading parameter (i.e., Rayleigh fading). Following the Shannon’s capacity formula with a Gaussian codebook, the received throughput $R$ with the bandwidth $W$ is $R=W\log_{2}(1+\gamma)$ (bits/sec). When the transmitter encodes raw data with a code rate $u$ , its receiver successfully decodes the encoded data if $R>u$ . Finally, the decoding success probability can be given as follows,

\Pr(R\geq u)=\Pr(\frac{\chi d^{-\beta}P}{\sigma^{2}+P^{I}}\geq u^{\prime})

(15)

where $u^{\prime}=2^{\frac{u}{W}}-1$ . Consider transmitting $L$ messages from a transmitter to a receiver simultaneously. Before transmission, these messages are SC-encoded [46], and the whole transmission power budget $P$ is allotted to the $l$ -th message, with $P_{l}=\nu_{l}P$ transmission power for $l\in[1,L]$ . Note that $\nu_{l}$ is an allocation variable such that $\nu_{l}>u^{\prime}\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}$ , $\sum^{L}_{l=1}\nu_{l}=1$ , and $\forall\nu_{l}\geq 0$ .

The SC-encoded message is meant to be sequentially decoded at the receiver by first decoding the strongest signal, then canceling out the decoded signal, and finally decoding the next strongest signal, i.e., SD, also known as successive interference cancellation [47, 48]. The small-scale fading parameter $\chi$ under Rayleigh fading follows an exponential distribution, i.e., $\chi\sim\exp(1)$ . Assuming $l^{\prime}>l$ , the receiver may gradually decode the $l$ -th message while experiencing the remaining messages as interference $P_{l}^{I}$ , i.e.,

P_{l}^{I}=\chi d^{-\beta}P\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}},

(16)

for $l\leq L-1$ . However, $P^{I}_{L}=0$ as there is no interference for the last message. Assume that $R_{l}$ represents the throughput of the $l$ -th message. Then, the distribution of $R_{l}$ is given as,

\Pr(R_{l}\geq u)=\text{Pr}\Big{(}\chi\geq\frac{1/\bar{\gamma}}{\nu_{l}/u^{\prime}-\sum^{L}_{l^{\prime}=1+1}\nu_{l}^{\prime}}\Big{)}

(17)

where $\bar{\gamma}=\frac{Pd^{-\beta}}{\sigma^{2}}$ denotes the averaged signal-to-noise ratio (SNR). By using this result, the $l$ -th message’s decoding success probability $p_{l}$ can be expressed as follows,

	$\displaystyle p_{l}$	$\displaystyle=\Pr(R_{1}\geq u,\cdots,R_{l}\geq u),$		(18)
	$\displaystyle=$	$\displaystyle\text{Pr}\Big{(}\!\!\chi\geq\max(\frac{1/\bar{\gamma}}{\frac{\nu_{l}}{u^{\prime}}\!-\!\sum^{L}_{l^{\prime}=2}\nu_{l^{\prime}}},\!\cdots\!,\!\frac{1/\bar{\gamma}}{\frac{\nu_{l}}{u^{\prime}}\!-\!\sum_{l^{\prime}=l+1}\nu_{l^{\prime}}})\!\!\Big{)}.$		(19)

1 Notation.

\bm{\theta}^{k}_{t}

k

-th device’s parameters,

\Theta_{t}

: parameters of global eSQNN,

X_{l}

: set of

l

-th subdivided gradient;

2 Initialization.

X_{l}\leftarrow\emptyset,\forall l\in[1,L]

;

3 for $k=\{1,\dots,K\}$ do

4 Sample

\chi^{k}\sim\exp(1)

;

5 for $l=\{1,2,\dots,L\}$ do

6 if $\chi_{k}\geq u_{l}$ then

X_{l}\leftarrow X_{l}\cup k

;

11if $\prod^{L}_{l=1}\mathbbm{1}(X_{l}=\emptyset)\neq 0$ then

\Theta_{t+1}\leftarrow\Theta_{t}-\eta_{t}\sum\nolimits^{L}_{l=1}\frac{1}{|X_{l}|}\sum\nolimits_{k\in X_{l}}g^{k}_{t}\odot\Xi_{l}

;

14else

15 Skip aggregation;

17for $k=\{1,\cdots,K\}$ do

\bm{\theta}^{k}_{t+1,1}\leftarrow\Theta_{t+1}

;

Algorithm 2 eSQFL

IV-B eSQFL Operations

This section describes the operations of eSQFL. Algorithm 2 shows the eSQFL algorithm. First of all, local devices are trained with Algorithm 1. The power allocation is conducted to configure SC-encoded model parameters, i.e., $\bm{\nu}=\{\nu_{l}\}^{L}_{l=1}$ for the gradient $\{g^{k}_{t}\odot\Xi_{l}\}^{L}_{l=1}$ of the subdivided model configuration. After that, the local devices transmit their SC-encoded model parameters to the server. The server decodes the devices’ SC-encoded model parameters with SD. If the server receives at least one local gradient for every model configuration, the server aggregates; otherwise, no aggregation occurs. In the aggregation of sub-divided model configuration, FedAvg is utilized [34]. The updates of eSQFL will be explained later.

V Convergence Analysis

V-A Setup

In order to analyze the convergence rate of eSQFL, the following assumptions are considered. Firstly, the local-side decoding is always successful (Algorithm 2, lines 12–13) because the server-side transmission power is higher than the uplink power. Secondly, $K$ is assumed to be big enough such that $|X_{l}|\approx Kp_{l}$ , for all $l$ . During the $t$ -th communication round, the server builds the global model which can be expressed as follows,

\Theta_{t+1}\leftarrow\Theta_{t}-\eta_{t}\underbrace{\sum\nolimits^{L}_{l=1}\frac{1}{Kp_{l}}\sum\nolimits_{k\in X_{l}}g^{k}_{t}\odot\Xi_{l}}_{:=f_{t}}.

(20)

The objective function of the global model and the local objective functions are denoted as $F$ and $\{F^{k}\}$ respectively. The bar notation $\bar{\cdot}$ is used for the averaged value over $\{\zeta_{t}^{k}\}$ , and the superscript ^∗ is used to indicate the optimum. For mathematical amenability, we consider the following assumptions on $F$ and $\{F^{k}\}$ , as used in [49].

Assumption 1 ( $\bm{\beta}$ -Smoothness).

If $F$ and $\{F^{k}\}$ are $\beta$ -smooth,

F^{k}(\bm{\theta}_{v})\leq\\ F^{k}(\bm{\theta}_{w})+(\bm{\theta}_{v}-\bm{\theta}_{w})^{T}\nabla F^{k}(\bm{\theta}_{w})+\frac{\beta}{2}\|\bm{\theta}_{v}-\bm{\theta}_{w}\|^{2},

(21)

for all $v,w>0$ .

Assumption 2 ( $\bm{\mu}$ -Strong Convexity).

If $F$ and $\{F^{k}\}$ are $\mu$ -strong convex,

F^{k}(\bm{\theta}_{v})\geq\\ F^{k}(\bm{\theta}_{w})+(\bm{\theta}_{v}-\bm{\theta}_{w})^{T}\nabla F^{k}(\bm{\theta}_{w})+\frac{\mu}{2}\|\bm{\theta}_{v}-\bm{\theta}_{w}\|^{2},

(22)

for all $v,w>0$ .

Assumption 3 (Bounded Local Gradient Variance).

For all device $k\in\mathbb{N}[1,K]$ and its local data $\zeta^{k}\in\mathbf{Z}$ , the difference between the local gradient $F^{k}(\bm{\theta}^{k};\zeta^{k})$ and $\bar{F}^{k}(\bm{\theta};\mathbf{Z})$ is bounded, i.e.,

\mathbb{E}[\|\nabla_{\bm{\theta}}F^{k}(\bm{\theta}^{k},\zeta^{k}_{t})-\nabla_{\bm{\theta}}\bar{F}^{k}(\bm{\theta}^{k};\mathbf{Z})\|^{2}]\leq\sigma_{k}^{2}.

(23)

According to [40], the metric for the non-IIDness of $\mathbf{Z}$ is given as follows,

\delta=\frac{1}{K}\sum^{K}_{k=1}\sigma_{k}^{2}.

(24)

V-B Convergence Analysis

In classical ML, the convergence of FedAvg has been analyzed by assuming bounded local gradients in [49]. Without such an unrealistic assumption, the convergence bound of SFL has been derived in [1]. In quantum ML, local gradients can be shown to be inherently bounded thanks to the bounded fidelity and the parameter shift rule computing quantum gradients [26]. Hence, rather than adopting the methods in [1], we first derive the local gradient bound, and then derive the convergence bound of eSQFL by following the steps [49]. The detailed proofs are deferred to Appendix, and only the results are presented as elaborated next.

Lemma 1 (Bounded Local Gradient).

For $t\geq 1$ and $\eta_{t}\leq\eta_{t+1}$ , it follows that

\mathbb{E}[\|g^{k}_{t}\|^{2}]\leq EL(2+(a-2)\lambda)^{2}.

(25)

Lemma 2 (Bounded Global Gradient).

For $t\geq 1$ , the global gradient has bound as,

\mathbb{E}[\|f_{t}\|^{2}]\leq EL^{2}(2+(a-2)\lambda)^{2}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}.

(26)

Lemma 3 (Bounded Global Gradient Variance).

Under Assumption 3, the variance of the global gradient $f_{t}$ is bounded within $\mathbf{Z}$ , which is given as,

\mathbb{E}\|f_{t}-\bar{f}_{t}\|^{2}\leq L\delta\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}.

(27)

Note that Lemmas 2 and 3 are different, in the sense that Lemma 2 focuses on the actual gradient, whereas Lemma 3 is related to data distributions. The convergence analysis utilizes Lemmas 1–3 and eSQFL convergence can be proven by [49].

Theorem 1 (eSQFL Convergence).

Under Assumptions 1 and 3 with the learning rate $\eta_{t}=\frac{2}{\mu t+2\beta-\mu}$ , we obtain

$\displaystyle\mathbb{E}[F(\theta_{t})]-F^{*}$	$\displaystyle\leq\frac{\beta}{\mu}\cdot\frac{\mu\beta\Delta_{1}+2B}{\mu t+2\beta-\mu},$	(28)
$\displaystyle\text{where}~{}~{}~{}~{}~{}\Delta_{t}$	$\displaystyle\triangleq\mathbb{E}\\|\Theta_{t}-\Theta^{*}\\|^{2},$	(29)
$\displaystyle B$	$\displaystyle=(EL^{2}(2+(a-2)\lambda)^{2}+L\delta)\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}.$	(30)

Hence, $\lim\limits_{t\rightarrow\infty}\mathbb{E}[F(\theta_{t})]=F^{*}$ .

Theorem 1 exhibits several insights of eSQFL as follows.

1.

Failure under extremely poor channels: Consider an extremely poor channel condition, where the server cannot receive $[l,L]$ -th model configurations, i.e., $p_{l^{\prime}}\simeq 0,\forall l^{\prime}\in[l,L]$ . In this case, the RHS of (28) diverges.
2.

Importance of successful reception: The optimal gap of eSQFL becomes smaller by increasing the communication opportunities. Consider a perfect channel condition, where the RHS of (28) is minimized. By optimizing the SC transmission, the optimality gap is reduced which is referred to Proposition 1 and Corollary 1.
3.

Other important metrics: The optimality gap is affected by the local iterations per communication round $E$ , balance factor $\lambda$ , and the number of layers $L$ .

Proposition 1 (Optimal SC Power Allocation).

The transmission power allocation $\bm{\nu}^{*}$ minimizing the optimality gap is given as,

\bm{\nu}^{*}=\arg\min_{\bm{\nu}}\left(\sum^{L}_{l=1}\mathrm{exp}\left(-\frac{2/\bar{\gamma}}{\nu_{l}/u^{\prime}-\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}}\right)\right)

(31)

where $L\geq 2$ , $\nu_{l}>u^{\prime}\sum^{L}_{l^{\prime}=l+1}\nu_{l^{\prime}}$ for $\forall l\in[1,L)$ , and $\sum^{L}_{l=1}\nu_{l}=1$ .

Proof.

Substituting the term $p_{l}$ into Theorem 1, the optimality gap is minimized by optimizing the power allocation. ∎

Corollary 1 (Low SNR, $\bm{L=2}$ ).

For $L=2$ , $\bar{\gamma}\to 0$ , and $u^{\prime}\geq(1+\sqrt{5})/2\approx 1.618$ , the optimal power allocation is as,

(\nu_{1}^{*},\nu_{2}^{*})=\\ \left(-\frac{\sqrt{u^{\prime}+1}-u^{\prime 2}+1}{u^{\prime 2}+u^{\prime}},1+\frac{\sqrt{u^{\prime}+1}-u^{\prime 2}+1}{u^{\prime 2}+u^{\prime}}\right).

(32)

Proof.

Since $\exp(-x)=1-x$ for $x\to 0$ , the RHS of (31) becomes $2+\frac{2/\bar{\gamma}}{\nu_{1}/u^{\prime}-(1-\nu_{1})}+\frac{2/\bar{\gamma}}{(1-\nu_{1})/u^{\prime}}$ , which is piece-wise convex. Applying the first-order necessary condition (FONC) with respect to $\nu_{1}$ completes the proof. ∎

TABLE I: List of simulation parameters.

Description	Value
Number of devices ( $N$ )	10
Local iterations per communication round ( $E$ )	10
Epoch ( $T$ )	100
Optimizer	SGD
Learning rate ( $\eta_{1}$ ),	$0.01$
Decaying rate	$0.001$
Observable hyperparameter ( $a$ )	$2$
Number of qubits	$4$
Number of parameters in eSQFL & Vanilla QFL	$36$
Number of data per device	$\!128$
Batch size ( $D$ )	$32$

VI Experiments

VI-A Experimental Design

To corroborate the main analysis and hypothesis of this paper, the experiments are designed as follows:

•

From Sec. V-A, the derived convergence bound is highly affected by the decoding success probability and non-IIDness. To corroborate these results numerically, we compare the top-1 accuracy of eSQFL in various channel conditions and degrees of non-IIDness with Vanilla QFL (referred to Fig. 1(a)).
•

We investigate the advantage of CU gates that compose eSQNN by designing an experiment which measures entanglement entropy and top-1 accuracy of eSQNN and standard QNNs under the same conditions. Then, the two metrics are compared to demonstrate the advantage of CU gates.
•

The increased effectiveness of local training with IPFD compared to IPKD is proven. IPFD trains the local models by regularizing the fidelity of two quantum states. In contrast, IPKD trains local models by ensuring that the small model follows the large model via its prediction. The benchmark scheme comparing the fidelity and top-1 accuracy of IPFD and IPKD is designed.
•

According to Proposition 1 and Corollary 1, the convergence bound is minimized by optimal transmission power allocation. To corroborate this, we compare the optimal power allocation scheme to its random power allocation counterpart.
•

Finally, we conduct experiments by controlling various variables and assess their various impact on the performance.

For the experiment, eSQFL and Vanilla QFL are evaluated. eSQFL is the proposed model which leverages eSQNN. This specific QNN consists of three sub-models named ‘L1’, ‘L2’, and ‘L3’. In contrast, Vanilla QFL uses a standard QNN which is made up of basic quantum gates [7], and does not consider SC and SD [15]. Despite the difference in structure, both eSQNN and standard QNN use equivalent number of parameters. Moreover, we conduct ablation studies on our eSQNN by comparing it with Vanilla SQNN, a depth-controllable yet entanglement-fixed QNN. Since the performance of QFL suffers under a system with a large number of qubits, many QFL works use a simple dataset [15]. In this paper, the MNIST dataset is transformed into a simpler form: the dimension of MNIST data is reduced to $4\times 4$ by inter-area interpolation, and only four classes are used (i.e., 0, 1, 2 and 3) [50]. The four classes are represented with red, blue, black, and green respectively. In addition, Dirichlet distribution is used to investigate non-IIDness of data [51]. Fig. 3 shows the data distribution with the different values of the Dirichlet concentration ratio $\alpha$ . Data with high Dirichlet concentration ratio (i.e., $\alpha=10$ ) is IID while data with low Dirichlet concentration ratio (i.e., $\alpha=0.1$ ) is non-IID.

To compare IPFD and IPKD, we initialize the parameter of eSQNN identically. The simulation parameters used in these numerical experiments are summarized in Tab. I.

VI-B Numerical Results

Numerical Results and Convergence Analysis. According to Theorem 1, the convergence bound decreases if the decoding success probability increases. Fig. 4 shows the performance of eSQFL under various channel conditions obtained through various $\sigma^{2}$ . As $\bar{\gamma}$ increases from $11\,\text{dB}$ to $19\,\text{dB}$ , the decoding success probability and top-1 accuracy of the eSQFL with all layers increase. The small models, i.e., eSQFL-L2 and eSQFL-L1, also show improvement in performance along with eSQFL-L3. Especially, eSQFL-L2 shows significant improvement in top-1 accuracy from $28\%$ to $39\%$ . Fig. 5 shows the top-1 accuracy and convergence of eSQFL and comparison models. When $\bar{\gamma}=17\,\text{dB}$ , the sub-models in eSQFL (i.e., eSQFL-L2, eSQFL-L3) achieve higher accuracy than Vanilla QFL. The final standard deviations of eSQFL under $\bar{\gamma}=17\,\text{dB}$ are 0.041, 0.051, and 0.066 for eSQFL-L1, eSQFL-L2, and eSQFL-L3, respectively.

According to Theorem 1, the data distribution affects the convergence bound of eSQFL. With non-IID data, the convergence bound is widened. As shown in Fig. 5, we test various Dirichlet concentration, i.e., $\alpha=\{0.1,1,10\}$ . The overall performance of all comparison models decreases as $\alpha$ decreases. However, eSQFL shows robustness under non-IID data distribution. Vanilla QFL shows low top-1 accuracy under $\alpha=1$ and $\alpha=0.1$ . In contrast, eSQFL maintains the top-1 accuracy of $52\%$ and $41\%$ under $\alpha=1.0$ and $\alpha=0.1$ respectively. From the results in Fig. 4 and Fig. 5, eSQFL is robust under various channel conditions and non-IID data distribution.

Structural Advantage of eSQNN. In this subsection, we investigate the general performance, fidelity, and entanglement entropy of eSQNN. We conduct ablation studies corresponding to model architecture (i.e., eSQNN and Vanilla SQNN). Fig. 6 shows the experiment results. As shown in Fig. 6(a), eSQNN shows better top-1 accuracy than Vanilla eSQNN. Especially, eSQNN-L3 achieves a $20\%$ performance improvement over Vanilla SQNN. When eSQNN is used, the quantum state (i.e., knowledge) is successfully distilled to the small model as shown in Fig. 6(b). In contrast, Vanilla SQNN fails to distill the knowledge to its sub-model. To understand why its model is successfully trained, we calculate the entanglement entropy (referred to Sec. III). Fig. 6(c) exhibits the von Neumann entanglement entropy of each layer of eSQNN and Vanilla SQNN. The entanglement entropy of eSQNN is less than Vanilla SQNN for all layers. It means that the event of exceeding the entropy threshold aforementioned in Sec. VI, i.e., $\mathbbm{1}_{\text{train}}=0$ , rarely occurs compared to Vanilla SQNN. This underscores that eSQNN is more robust to barren plateaus than Vanilla SQNN.

Effectiveness of IPFD. To investigate the effectiveness of IPFD used in eSQNN local training, we compare the results of local training using IPFD regularizer to IPKD regularizer. Fig. 7 (a)/(b) show the learning curve of $\mathcal{L}_{FD}$ and $\mathcal{L}_{KL}$ , respectively. The learning curve of IPFD starts at $\mathcal{L}_{FD}=0$ due to the fidelity $\mathcal{F}(\bm{\psi}_{l},\bm{\psi}_{L})\approx 1$ . As eSQNN is trained, the fidelity decreases and converges to 0.955 for L1 and 0.987 for L2. In the learning curve of $\mathcal{L}_{KD}$ , the curve has a tendency to decrease and converge. However, the fluctuation of IPKD regularization is larger than IPFD, especially in eSQFL-L1. This is because the KL divergence becomes unstable when the difference between the two distributions is large, i.e., the overlapping area between the distributions is small. Then, if there is no overlapping area, it diverges. In contrast, the aforementioned phenomena does not occur in IPFD regularization because IPFD regularizer is bounded from 0 to 1. Therefore, IPFD regularization provides more stable noise to its eSQNN than IPKD regularization.

TABLE II: Top-1 accuracy comparison with (

\bm{\nu}^{*}

) and without optimization.

	$L=3$			$L=2$
Condition	$l=1$	$l=2$	$l=3$	$l=1$	$l=2$
with optimization ( $\bm{\nu}^{*}$ )	$\mathbf{29.6}$	$\mathbf{40.1}$	$\mathbf{55.8}$	$\mathbf{33.4}$	$\mathbf{50.7}$
w.o. optimization	$29.5$	$39.3$	$50.7$	$33.0$	$48.1$

Impact on Optimal Power Allocation. We verify the proofs of Proposition 1, and Corollary 1. When $L=3$ , we calculate the power allocation variable as $\bm{\nu}^{*}=\{0.8909,0.0989,0.0102\}$ by non-convex optimization. When $L=2$ , we obtain $\bm{\nu}^{*}=\{0.8969,0.1059\}$ , where the comparison of power allocation is set to $\bm{\nu}=\{0.9170,0.0820,0.001\}$ for $L=3$ , and $\bm{\nu}=\{0.8333,0.1667\}$ for $L=2$ . The final accuracy is Tab. II. Compared to the eSQFL with $\bm{\nu}$ , the eSQFL with $\bm{\nu}^{*}$ achieves $10.1\%$ higher top-1 accuracy when $L=3$ and $5.41\%$ higher top-1 accuracy when $L=2$ . Thus, we corroborate that the optimal power allocation minimizes the convergence bound.

Impact on Balanced Parameter. The balanced parameter $\lambda$ is an important parameter in eSQNN local training. Fig. 8 shows the top-1 accuracy according to $\lambda$ in various data distributions (i.e., $\alpha=0.1$ and $\alpha=10$ ). With finitely adjusted IPFD parameter ( $\lambda^{*}=0.01$ ), eSQNN shows the highest top-1 accuracy under non-IID data distribution (i.e., $\alpha=0.1$ ). In addition, by not using IPFD ( $\lambda=0$ ) or only using IPFD ( $\lambda=1$ ), eSQNN fails to classify the mini-MNIST dataset. Under IID data distribution (i.e., $\alpha=10$ ), eSQNN with $\lambda^{*}=0.01$ outperforms eSQNN with only using label training ( $\lambda=0$ ) about $1.03\%$ . From the result, we recommend utilizing eSQNN with $\lambda^{*}=0.001$ for robust performance in both IID and non-IID data distribution.

VII Conclusions

In this paper, we developed a depth-adjustable QNN architecture, and proposed a novel QFL framework based on wireless communications, termed eQSNN and eSQFL, respectively. To control the level of entanglement and reduce its entropy, we applied CU gates to the eSQNN architecture. To mitigate the inter-depth interference inspired from the fidelity in quantum information theory, we introduced a novel IPFD regularizer. Finally, to cope with various channel conditions, we applied SC across multiple depths and optimized the SC power allocation by deriving and minimizing the convergence bound of eSQFL. In conclusion, we were able to propose a QFL model that shows stable performance despite the NISQ limitation and variable channel conditions. Additionally, the fidelity regularizer was also designed. This novel method decreases the error rate of QML in a way that is exclusively suitable with QC, instead of depending on classical optimization methods. Since the strengths of our model has been theoretically corroborated in this paper, we will go on to test the realistic efficacy of the model. In our future research directions, we will apply it to a plethora of real-life scenarios with various limitations.

TABLE III: List of notations.

Notation	Description
$K$	Number of local devices, $[1,\cdots,k,\cdots,K]$
$L$	Number of local eSQNN blocks, $[1,\cdots,l,\cdots,L]$
$E$	Number of local iterations, $[1,\cdots,e,\cdots,E]$
$T$	Number of communication rounds, $[1,\cdots,t,\cdots,T]$
$\bm{\Xi}$	Binary mask, $\bm{\Xi}=\{\Xi_{1},\cdots,\Xi_{l},\cdots,\Xi_{L}\}$
$\bm{\psi}$	Quantum state
$\rho$	Reduced density matrix
$S_{l}(\rho)$	Entanglement entropy of $\rho$ over subsystem $l$
$\bm{Z}$	Whole data, $\bm{Z}=\{\zeta^{1},\cdots,\zeta^{k},\cdots,\zeta^{K}\}$
$\bm{\nu}$	Power allocation for SC, $\bm{\nu}=\{\nu_{1},\cdots,\nu_{l},\cdots,\nu_{L}\}$
$\alpha$	Dirichlet concentration

-A Parameter Shift Rule

Parameter shift rule [26], one of the most known quantum gradient calculators, is utilized to train the model. Subsequently, the eSQNN is trained accordingly using the zeroth stochastic gradient descent algorithm, e.g., quantum natural gradient. Consider that eSQNN consists of $I$ trainable parameters, i.e., $\bm{\theta}^{k}_{t,e}=[\theta^{k}_{t,e,1},\cdots,\theta^{k}_{t,e,i},\cdots,\theta^{k}_{t,e,I}]$ . Then, the partial derivative of $k$ -th device’s $c$ -th observable over parameter $\theta^{k}_{t,e,i}$ is given as follows,

\frac{\partial\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}}}{\partial\theta^{k}_{t,e,i}}=\frac{\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}+\varepsilon\mathbf{e}_{i}}-\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}-\varepsilon\mathbf{e}_{i}}}{2\varepsilon}

(33)

where $\mathbf{e}_{i}$ denotes the $i$ -th standard basis, and $\varepsilon\in(0,\pi/2]$ . We calculate the loss gradient using (33).

-B Proof of Lemma 1

The true label is class $c$ and the predictions and its derivative is canceled out due to the definition of cross-entropy. Then, the cross-entropy loss is simplified as follows,

\mathcal{L}_{CE}=-\log p(y^{k,l,c}_{t,e}|\bm{x}).

(34)

Hereafter, we denote $\hat{y}_{c}=y^{k,l,c}_{t,e}$ , and $p(\hat{y}_{c})=p(y^{k,l,c}_{t,e}|\bm{x})$ . Let’s denote the partial derivative of cross-entropy loss and fidelity loss as,

	$\displaystyle G_{1}$	$\displaystyle=\frac{\partial\mathcal{L}_{CE}}{\partial\theta^{k}_{t,e,i}}=\frac{1}{p(\hat{y}_{c})}\cdot\frac{\partial p(\hat{y}_{c})}{\partial\theta^{k}_{t,e,i}},$		(35)
	$\displaystyle G_{2}$	$\displaystyle=\frac{\partial\mathcal{F}(\bm{\psi}^{k,L}_{t,e,\bm{x}},\bm{\psi}^{k,l}_{t,e,\bm{x}})}{\partial\theta^{k}_{t,e,i}}.$		(36)

By the triangle inequality, the partial derivative of (12) is bounded as follows,

\Big{|}\frac{\partial\mathcal{L}^{k,l}_{t,e}}{\partial\theta^{k}_{t,e,i}}\Big{|}\leq\sum_{(\bm{x},\bm{y})\in\zeta^{k}}\Big{[}\frac{\lambda}{D}|G_{1}|+\frac{1-\lambda}{D}|G_{2}|\Big{]}.

(37)

We have

G_{1}=a\left(\sum\limits^{C}_{c^{\prime}\geq 1,c^{\prime}\neq c}\frac{\hat{y}_{c^{\prime}}}{\sum\limits^{C}_{c=1}\hat{y}_{c}}\right)\cdot\frac{\partial\langle V_{c}\rangle_{\bm{\theta}^{k}_{t,e}\odot\sum^{l}_{l^{\prime}=1}\Xi_{l^{\prime}}}}{\partial\theta^{k}_{t,e,i}}.

(38)

The bound of $G_{1}$ obtained as follows,

|G_{1}|\leq a\Big{|}\frac{\partial\langle V_{c^{\prime}}\rangle_{\bm{\theta}^{k}\odot\Xi_{l}}}{\partial\theta^{k}_{t,e,i}}\Big{|}\leq a.

(39)

The former step is due to $\sum\limits^{C}_{c^{\prime}\geq 1,c^{\prime}\neq c}\hat{y}_{c^{\prime}}\leq\sum\limits^{C}_{c=1}\hat{y}_{c}$ , and the latter step is because the gradient using parameter shift rule is bounded to $1$ [26]. The term $G_{2}$ and its bound are given as,

	$\displaystyle G_{2}$	$\displaystyle=2\|\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}\|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle\|\cdot\frac{\partial\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}\|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle}{\partial\theta^{k}_{t,e,i}},$		(40)
	$\displaystyle\|G_{2}\|$	$\displaystyle\leq 2\Big{\|}\frac{\partial\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}\|\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle}{\partial\theta^{k}_{t,e,i}}\Big{\|}\leq 2.$		(41)

The former step is due to $|\langle\bm{\psi}^{k,L}_{t,e,\bm{x}}|\cdot\bm{\psi}^{k,l}_{t,e,\bm{x}}\rangle|^{2}\leq 1$ , and the latter step is due to parameter shift rule. Substituting the bound of $G_{1}$ and $G_{2}$ into LHS of (37), we have the bound,

\Big{|}\frac{\partial\mathcal{L}^{k,l}_{t,e}}{\partial\theta^{k}_{i}}\Big{|}\leq 2+(a-2)\lambda.

(42)

Calculate LHS of (37) for all $i\in[1,I]$ , the loss gradient is obtained, and its gradient is bounded as,

\|\nabla_{\bm{\theta}^{k}_{t,e}}\mathcal{L}^{k,l}_{t,e}\|\leq 2+(a-2)\lambda.

(43)

Applying these results to $g^{k}_{t}$ , we complete the proof.

-C Proof of Lemma 2

We expand the global gradient $f_{t}$ as follows,

$\displaystyle\\|f_{t}\\|^{2}$	$\displaystyle=\Big{\\|}\sum^{L}_{l=1}\frac{1}{Kp_{l}}\sum^{K}_{k\in\|X_{l}\|}g^{k}_{t}\odot\Xi_{l}\Big{\\|}^{2}$	(44)
	$\displaystyle\leq\frac{L}{K}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\sum^{K}_{k=1}\\|g^{k}_{t}\odot\Xi_{l}\\|^{2}$	(45)
	$\displaystyle\leq\frac{L}{K}\sum^{K}_{k=1}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\cdot\\|g^{k}_{t}\\|^{2}.$	(46)

The first step is due to Jensen’s inequality, i.e.,

\|\sum^{K}_{k=1}x_{k}\|^{2}\leq K\sum^{K}_{k=1}\|x_{k}\|^{2}

(47)

and the next step is due to Cauchy-Schwarz inequality, i.e., $\|X\odot\Xi\|^{2}\leq\|X\|^{2}$ [1]. Combining Lemma 1 and latter term of (46), we finalize the proof.

-D Proof of Lemma 3

According to (20) and Assumption 3, the distance between $f_{t}$ and $\bar{f}_{t}$ is as,

	$\displaystyle\\|f_{t}-\bar{f}_{t}\\|^{2}$	$\displaystyle=\Big{\\|}\sum^{L}_{l=1}\frac{1}{Kp_{l}}\sum^{K}_{k\in\|X_{l}\|}(g^{k}_{t}-\bar{g}^{k}_{t})\odot\Xi_{l}\Big{\\|}^{2}$		(48)
		$\displaystyle\leq\frac{L}{K}\sum^{L}_{l=1}\sum^{K}_{k=1}\frac{1}{p_{l}^{2}}.$		(49)

This step is due to Jensen’s inequality. With Assumption 3, we have $\mathbb{E}\|g^{k}_{t}-\bar{g}^{k}_{t}\|^{2}\leq\sigma_{k}^{2}$ . Combining these results finalizes the proof.

-E Completing Proof of Theorem 1

Using (20), the distance between $\Theta_{t+1}$ to the optimal is as,

$\displaystyle\\|\Theta_{t+1}-\Theta^{*}\\|^{2}=$	$\displaystyle\\|\Theta_{t}-\eta_{t}f_{t}-\Theta^{*}+\eta_{t}\bar{f}_{t}-\eta_{t}\bar{f}_{t}\\|^{2}$	(50)
$\displaystyle=$	$\displaystyle\underbrace{\\|\Theta_{t}-\eta_{t}\bar{f}_{t}-\Theta^{*}\\|^{2}}_{G_{3}}$	(51)
	$\displaystyle+\underbrace{2\eta_{t}\langle\Theta_{t}-\Theta^{*}-\eta_{t}f_{t},\bar{f}_{t}-f_{t}\rangle}_{G_{4}}$	(52)
	$\displaystyle+\underbrace{\eta_{t}^{2}\\|f_{t}-\bar{f}_{t}\\|^{2}}_{G_{5}}.$	(53)

We investigate the bound of $G_{3}$ as follows,

G_{3}=\|\Theta_{t}-\Theta^{*}\|^{2}\underbrace{-2\eta_{t}\langle\Theta_{t}-\Theta^{*},\bar{f}_{t}\rangle}_{G_{6}}+\eta_{t}^{2}\|\bar{f}_{t}\|^{2}.

(54)

The term $G_{6}/(2\eta_{t})$ is bounded as,

$\displaystyle\frac{G_{6}}{2\eta_{t}}$	$\displaystyle\stackrel{{\scriptstyle(\text{a})}}{{\leq}}$	$\displaystyle F(\Theta^{})-F(\Theta_{t})-\frac{\mu}{2}\\|\Theta_{t}-\Theta^{}\\|^{2}$	(55)
	$\displaystyle\stackrel{{\scriptstyle(\text{b})}}{{\leq}}$	$\displaystyle-\frac{1}{2\beta}\\|\bar{f}_{t}\\|^{2}-\frac{\mu}{2}\\|\Theta_{t}-\Theta^{*}\\|^{2}$	(56)
	$\displaystyle\stackrel{{\scriptstyle(\text{c})}}{{\leq}}$	$\displaystyle-\frac{\mu}{2}\\|\Theta_{t}-\Theta^{*}\\|^{2}.$	(57)

The steps (a), (b) and (c) are due to $\mu$ -strong convexity, $L$ -smoothness, and $\|\bar{f}_{t}\|^{2}\geq 0$ , respectively. Since $\mathbb{E}[f_{t}]=\bar{f}_{t}$ , $\mathbb{E}[G_{5}]=0$ . Combining Lemma 2, Lemma 3, and these results, we have the bound of LHS of (50). Summarizing (53) with taking expectation, and under Assumption 1 with a learning rate $\eta_{t}\leq\frac{1}{\beta}$ , the error between the updated global model and its optimum progress as,

	$\displaystyle\mathbb{E}$	$\displaystyle\\|\Theta_{t+1}-\Theta^{}\\|^{2}\leq(1-\eta_{t}\mu)\mathbb{E}\\|\Theta_{t}-\Theta^{}\\|^{2}$
		$\displaystyle+\eta_{t}^{2}\underbrace{\Big{(}EL^{2}(2-\lambda)^{2}+L\delta\Big{)}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}}_{:=B}.$		(58)

Since $\eta_{t}=\frac{2}{\mu{t}+2L-\mu}\leq\frac{1}{L}$ , applying (58), we have

\Delta_{t+1}\leq(1-\eta_{t}\mu)\Delta_{t}+\eta_{t}^{2}B.

(59)

For diminishing the step-size, we focus on showing that $\Delta_{t}\leq\frac{v}{t+2\kappa-1}$ , where $\kappa=\frac{\beta}{\mu}$ and $v=\max\{2\kappa\Delta_{1},{4B}/{\mu^{2}}\}$ as elaborated next. It is trivial that $\Delta_{1}\leq\frac{v}{2\kappa}$ due to the definition of $v$ . Assuming $\Delta_{t}\leq\frac{v}{t+2\kappa-1}$ , we have

	$\displaystyle\Delta_{t+1}\leq(1-\mu\eta_{t})\Delta_{t}+\eta^{2}_{t}B$		(60)
	$\displaystyle\leq\left(1-\frac{2}{t+2\kappa-1}\right)\frac{v}{t+2\kappa-1}+\frac{{4B}/{\mu^{2}}}{(t+2\kappa-1)^{2}}$		(61)
	$\displaystyle=\frac{(t+2\kappa-2)v-(v-{4B}/{\mu^{2}})}{(t+2\kappa-1)^{2}}\leq\frac{t+2\kappa-2}{(t+2\kappa-1)^{2}}v$		(62)
	$\displaystyle\leq\frac{v}{t+2\kappa}.$		(63)

For $t=1$ , we obtain

v=\max\{2\kappa\Delta_{1},\frac{4B}{\mu^{2}}\}\leq 2\kappa\Delta_{1}+\frac{4B}{\mu^{2}}.

(64)

Finally, using Assumption 1, (58), the result above, we complete the proof of the theorem.

References

[1] H. Baek, W. J. Yun, Y. Kwak, S. Jung, M. Ji, M. Bennis, J. Park, and J. Kim, “Joint superposition coding and training for federated learning over multi-width neural networks,” in Proc. IEEE Conference on Computer Communications (INFOCOM), May 2022.
[2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buell et al., “Quantum supremacy using a programmable superconducting processor,” Nature, vol. 574, no. 7779, pp. 505–510, 2019.
[3] W. J. Yun, J. Park, and J. Kim, “Quantum multi-agent meta reinforcement learning,” in Proc. AAAI Conference on Artificial Intelligence, Washington DC, USA, February 2023.
[4] W. J. Yun, Y. Kwak, J. P. Kim, H. Cho, S. Jung, J. Park, and J. Kim, “Quantum multi-agent reinforcement learning via variational quantum circuit design,” in Proc. IEEE International Conference on Distributed Computing Systems (ICDCS), Bologna, Italy, July 2022.
[5] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, October 1997.
[6] J. Preskill, “Quantum computing in the NISQ era and beyond,” Quantum, vol. 2, p. 79, August 2018.
[7] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Variational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8, pp. 141 007–141 024, 2020.
[8] “Quantum distributed deep learning architectures: Models, discussions, and applications,” ICT Express, 2022.
[9] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, “Supervised learning with quantum-enhanced feature spaces,” Nature, vol. 567, no. 7747, pp. 209–212, 2019.
[10] O. Lockwood and M. Si, “Reinforcement learning with quantum variational circuit,” in Proc. AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, 2020, pp. 245–251.
[11] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen, “In-edge AI: Intelligentizing mobile edge computing, caching and communication by federated learning,” IEEE Network, vol. 33, no. 5, pp. 156–165, 2019.
[12] J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.-L. Kim, and M. Debbah, “Communication-efficient and distributed learning over wireless networks: Principles and applications,” Proceedings of the IEEE, vol. 109, no. 5, pp. 796–819, May 2021.
[13] S. Niknam, H. S. Dhillon, and J. H. Reed, “Federated learning for wireless communications: Motivation, opportunities, and challenges,” IEEE Communications Magazine, vol. 58, no. 6, pp. 46–51, 2020.
[14] D. Kwon, J. Jeon, S. Park, J. Kim, and S. Cho, “Multiagent DDPG-based deep learning for smart ocean federated learning IoT networks,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9895–9903, 2020.
[15] S. Y.-C. Chen and S. Yoo, “Federated quantum machine learning,” Entropy, vol. 23, no. 4, p. 460, 2021.
[16] H. Zhou, K. Lv, L. Huang, and X. Ma, “Quantum network: Security assessment and key management,” IEEE/ACM Transactions on Networking, vol. 30, no. 3, pp. 1328–1339, 2022.
[17] R. Pujahari and A. Tanwar, “Quantum federated learning for wireless communications,” in Federated Learning for IoT Applications. Springer, 2022, pp. 215–230.
[18] A. Cho, “Ibm promises 1000-qubit quantum computer—a milestone—by 2023,” Science, vol. 15, 2020.
[19] J. Gambetta, “Our new 2022 development roadmap,” IBM Quantum Computing, May 2022.
[20] J. Yu and T. S. Huang, “Universally slimmable networks and improved training techniques,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, October 2019, pp. 1803–1811.
[21] D. Kim, J. Kim, J. Kwon, and T.-H. Kim, “Depth-controllable very deep super-resolution network,” in Proc. IEEE International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, July 2019, pp. 1–8.
[22] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, and H. Neven, “Barren plateaus in quantum neural network training landscapes,” Nature Communications, vol. 9, no. 1, pp. 1–6, 2018.
[23] S. H. Sack, R. A. Medina, A. A. Michailidis, R. Kueng, and M. Serbyn, “Avoiding barren plateaus using classical shadows,” PRX Quantum, vol. 3, no. 2, June 2022.
[24] T. Sleator and H. Weinfurter, “Realizable universal quantum logic gates,” Physical Review Letters, vol. 74, no. 20, p. 4087, 1995.
[25] M. M. Wilde, Quantum information theory. Cambridge University Press, 2013.
[26] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018.
[27] M. Chehimi and W. Saad, “Quantum federated learning with quantum data,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, May 2022, pp. 8617–8621.
[28] W. J. Yun, J. P. Kim, S. Jung, J. Park, M. Bennis, and J. Kim, “Slimmable quantum federated learning,” in Proc. of ICML Workshop on Dynamic Neural Networks, Baltimore, MD, USA, July 2022.
[29] D. Bouwmeester and A. Zeilinger, “The physics of quantum information: basic concepts,” in the Physics of Quantum Information, 2000, pp. 1–14.
[30] C. P. Williams, S. H. Clearwater et al., Explorations in quantum computing. Springer, 1998.
[31] N. Killoran, T. R. Bromley, J. M. Arrazola, M. Schuld, N. Quesada, and S. Lloyd, “Continuous-variable quantum neural networks,” Physical Review Research, vol. 1, no. 3, p. 033063, 2019.
[32] O. Simeone, “An introduction to quantum machine learning for engineers,” Foundations and Trends® in Signal Processing, vol. 16, no. 1-2, pp. 1–223, 2022.
[33] N. H. Tran, W. Bao, A. Y. Zomaya, M. N. H. Nguyen, and C. S. Hong, “Federated learning over wireless networks: Optimization model design and analysis,” in Proc. IEEE Conference on Computer Communications (INFOCOM), Paris, France, 2019, pp. 1387–1395.
[34] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, April 2017, pp. 1273–1282.
[35] X. Li, M. Jiang, X. Zhang, M. Kamp, and Q. Dou, “FedBN: Federated learning on non-iid features via local batch normalization,” in Proc. International Conference on Learning Representations (ICLR), 2021.
[36] L. Mangasarian, “Parallel gradient distribution in unconstrained optimization,” SIAM Journal on Control and Optimization, vol. 33, no. 6, pp. 1916–1925, 1995.
[37] A. Cotter, O. Shamir, N. Srebro, and K. Sridharan, “Better mini-batch algorithms via accelerated gradient methods,” Proc. Advances in Neural Information Processing Systems (NIPS), vol. 24, 2011.
[38] N. Karakoç, A. Scaglione, M. Reisslein, and R. Wu, “Federated edge network utility maximization for a multi-server system: Algorithm and convergence,” IEEE/ACM Transactions on Networking, vol. 30, no. 5, pp. 2002–2017, 2022.
[39] C. T. Dinh, N. H. Tran, M. N. H. Nguyen, C. S. Hong, W. Bao, A. Y. Zomaya, and V. Gramoli, “Federated learning over wireless networks: Convergence analysis and resource allocation,” IEEE/ACM Transactions on Networking, vol. 29, no. 1, pp. 398–409, February 2021.
[40] A. Khaled, K. Mishchenko, and P. Richtarik, “Tighter theory for local sgd on identical and heterogeneous data,” in Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 108, August 2020, pp. 4519–4529.
[41] X. You and X. Wu, “Exponentially many local minima in quantum neural networks,” in Proc. of the International Conference on Machine Learning (ICML), Virtual, July 2021.
[42] D. Greenberger, K. Hentschel, and F. Weinert, Compendium of quantum physics: concepts, experiments, history and philosophy. Springer Science & Business Media, 2009.
[43] Y. Subaşı, L. Cincio, and P. J. Coles, “Entanglement spectroscopy with a depth-two quantum circuit,” Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 4, p. 044001, January 2019.
[44] R. Jozsa, “Fidelity for mixed quantum states,” Journal of Modern Optics, vol. 41, no. 12, pp. 2315–2323, 1994.
[45] D. N. C. Tse and P. Viswanath, Fundamentals of Wireless Communications, 2005.
[46] T. Cover, “Broadcast channels,” IEEE Transactions on Information Theory, vol. 18, no. 1, pp. 2–14, January 1972.
[47] J. Choi, “Joint rate and power allocation for NOMA with statistical CSI,” IEEE Transactions on Communications, vol. 65, no. 10, pp. 4519–4528, October 2017.
[48] M. Choi, D. Yoon, and J. Kim, “Blind signal classification for non-orthogonal multiple access in vehicular networks,” IEEE Transactions on Vehicular Technology, vol. 68, no. 10, pp. 9722–9734, 2019.
[49] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” in Proc. of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
[50] L. Deng, “The MNIST database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
[51] T. H. Hsu, H. Qi, and M. Brown, “Measuring the effects of non-identical data distribution for federated visual classification,” CoRR, vol. abs/1909.06335, September 2019.

Joongheon Kim (Senior Member, IEEE) has been with Korea University, Seoul, Korea, since 2019, where he is currently an associate professor. He received the B.S. and M.S. degrees in computer science and engineering from Korea University, Seoul, Korea, in 2004 and 2006, respectively; and the Ph.D. degree in computer science from the University of Southern California (USC), Los Angeles, CA, USA, in 2014. Before joining Korea University, he was with LG Electronics (Seoul, Korea, 2006–2009), Intel Corporation (Santa Clara in Silicon Valley, CA, USA, 2013–2016), and Chung-Ang University (Seoul, Korea, 2016–2019). He serves as an editor for IEEE Transactions on Vehicular Technology, IEEE Transactions on Machine Learning in Communications and Networking, IEEE Communications Standards Magazine, Computer Networks (Elsevier), and ICT Express (Elsevier). He is also a distinguished lecturer for IEEE Communications Society (ComSoc) (2022-2023) and IEEE Systems Council (2022-2024). He was a recipient of Annenberg Graduate Fellowship with his Ph.D. admission from USC (2009), Intel Corporation Next Generation and Standards (NGS) Division Recognition Award (2015), IEEE Systems Journal Best Paper Award (2020), IEEE ComSoc Multimedia Communications Technical Committee (MMTC) Outstanding Young Researcher Award (2020), IEEE ComSoc MMTC Best Journal Paper Award (2021), and Best Special Issue Guest Editor Award by ICT Express (Elsevier) (2022). He also received several awards from IEEE conferences including IEEE ICOIN Best Paper Award (2021), IEEE Vehicular Technology Society (VTS) Seoul Chapter Awards (2019, 2021), and IEEE ICTC Best Paper Award (2022).

$\displaystyle\\|f_{t}\\|^{2}$	$\displaystyle=\Big{\\|}\sum^{L}_{l=1}\frac{1}{Kp_{l}}\sum^{K}_{k\in\|X_{l}\|}g^{k}_{t}\odot\Xi_{l}\Big{\\|}^{2}$	(44)
	$\displaystyle\leq\frac{L}{K}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\sum^{K}_{k=1}\\|g^{k}_{t}\odot\Xi_{l}\\|^{2}$	(45)
	$\displaystyle\leq\frac{L}{K}\sum^{K}_{k=1}\sum^{L}_{l=1}\frac{1}{p_{l}^{2}}\cdot\\|g^{k}_{t}\\|^{2}.$	(46)



(a) Top-1 accuracy.	(b) Fidelity.	(c) Entropy.


(a) IPKD Regularization.	(b) IPFD Regularization.

Quantum Federated Learning with Entanglement Controlled Circuits and Superposition Coding

Abstract

Index Terms:

I Introduction

I-A Background and Motivation

I-B Algorithm Design Concept

I-C Contributions

I-D Organization

II Related Work

II-A Quantum Machine Learning Basics

II-B Classical Federated Learning

II-C Classical Slimmable Federated Learning

II-D Quantum Federated Learning

III Architecture and Training of eSQNNs

IV Entangled Slimmable Quantum Federated Learning

IV-A Superposition Coding & Successive Decoding

IV-B eSQFL Operations

V Convergence Analysis

V-A Setup

Assumption 1 (𝜷\bm{\beta}-Smoothness).

Assumption 2 (𝝁\bm{\mu}-Strong Convexity).

Assumption 3 (Bounded Local Gradient Variance).

V-B Convergence Analysis

Lemma 1 (Bounded Local Gradient).

Lemma 2 (Bounded Global Gradient).

Lemma 3 (Bounded Global Gradient Variance).

Theorem 1 (eSQFL Convergence).

Proposition 1 (Optimal SC Power Allocation).

Proof.

Corollary 1 (Low SNR, L=𝟐\bm{L=2}).

Proof.

VI Experiments

VI-A Experimental Design

VI-B Numerical Results

VII Conclusions

-A Parameter Shift Rule

-B Proof of Lemma 1

-C Proof of Lemma 2

-D Proof of Lemma 3

-E Completing Proof of Theorem 1

References

Assumption 1 ( $\bm{\beta}$ -Smoothness).

Assumption 2 ( $\bm{\mu}$ -Strong Convexity).

Corollary 1 (Low SNR, $\bm{L=2}$ ).