This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-IID Quantum Federated Learning with One-shot Communication Complexity

Haimeng Zhao (赵海萌) [email protected] Zhili College, Tsinghua University, Beijing 100084, China
Abstract

Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients’ data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.

preprint: APS/123-QED

I Introduction

Recent advances in artificial intelligence [1] and quantum computing [2] have given birth to an emerging field of quantum machine learning [3, 4]. By leveraging quantum advantage in machine learning tasks, quantum machine learning, has demonstrated unprecedented power in solving a wide range of problems. Notable examples include solving linear equations [5], quantum supervised learning [6, 7, 8, 9], and quantum generative learning [10, 11, 12]. This line of research mostly focuses on developing quantum algorithms that can showcase quantum speed-up in machine learning problems.

One of the most important ingredients in machine learning is the data. In real-world applications, data are often distributed among multiple clients and cannot be gathered into a single joint dataset for various reasons. For example, medical data from patients across multiple hospitals [13] or cyber-physical attack data from Internet of Things devices [14] are sensitive and private, and cannot be directly shared without anonymization. Obstacles in data transmission may also forbid us from constructing a joint dataset. For example, the data size might be too large and therefore the transmission is too expensive, or we need quantum data that suffer from decoherence and are hard to preserve or transmit.

Machine learning algorithms with such decentralized data have been developed under the name of federated learning [15, 16]. The conventional solution is the federated averaging algorithm [16], or FedAvg, in which multiple clients jointly train a global model by sharing only the model parameters/updates while keeping their data private. Its quantum extensions, dubbed qFedAvg, have been proposed to incorporate quantum features such as quantum speed-up and blind computing [17, 18, 19, 20, 21].

However, the classical FedAvg algorithm has three shortcomings already identified in the literature. Firstly, it suffers from the non-IID quagmire [22, 23], i.e. its performance is proved to deteriorate when the local data of different clients are not independent and identically distributed (IID). But real data are often heterogeneous or even multi-modal for different clients. Secondly, the joint training involves gradient sharing, so data privacy is threatened by attacks based on gradient inversion [24, 25]. Thirdly, it requires many rounds of communication, which is often the bottleneck of real-world applications. To reduce communication burdens, multiple one-shot alternatives have been proposed [26, 27, 28, 29].

In this work, we discuss whether the non-IID quagmire exists in quantum federated learning. The answer is yes, and we support it with both theoretical analysis and numerical experiments. Then we move on to propose a solution to this problem. We prove that a global quantum channel can be exactly decomposed into channels trained by each client with the help of local density estimators. It provides a general framework, dubbed qFedInf, for quantum federated learning on non-IID data. Meanwhile, qFedInf is one-shot in terms of communication complexity and is free from gradient sharing. We further identify its connection to mixture of experts (MoE) and ensemble learning. Numerical experiments in highly non-IID settings demonstrate that the proposed framework outperforms the conventional algorithm significantly with only one communication round.

II Main Results

II.1 Decentralized Quantum Data

We begin by setting up a typical decentralized quantum dataset. Classical datasets can be regarded as a special case of it, where the data samples are orthogonal to each other.

In federated learning tasks, we are given a set of clients {Ci}i=1n\{C_{i}\}_{i=1}^{n}, and we can use a set of orthonormal basis elements {|CiC}i=1n\{\ket{C_{i}}\in\mathcal{H}_{C}\}_{i=1}^{n} to denote them. Each of the clients has access to its own dataset containing NCiN_{C_{i}} samples Di={|ψjCi}j=1NCiD_{i}=\{\ket{\psi_{j}^{C_{i}}}\}_{j=1}^{N_{C_{i}}} from the data Hilbert space x\mathcal{H}_{x}. The dataset can be statistically represented by the density matrix ρxCi=NCi1j|ψjCiψjCi|\rho_{x}^{C_{i}}=N_{C_{i}}^{-1}\sum_{j}\ket{\psi_{j}^{C_{i}}}\bra{\psi_{j}^{C_{i}}}. Note that the density matrices of different clients may vary dramatically (i.e. non-IID). For example, in a multi-class classification problem, each client may have only seen samples from two or three classes.

Since the construction of entanglement between macroscopic objects from distant places is in general very difficult and is not expected to be realized in the near future [2], we focus on the situation where there is no entanglement among the clients. Then the joint density matrix of both the data and the clients is represented by ρ=i=1nρxCipCi|CiCi|\rho=\sum_{i=1}^{n}\rho_{x}^{C_{i}}\otimes p_{C_{i}}\ket{C_{i}}\bra{C_{i}}, where pCi=NCi/Np_{C_{i}}=N_{C_{i}}/N is the statistical weight of client CiC_{i}, and N=iNCiN=\sum_{i}N_{C_{i}} is the total number of samples. For a classical dataset, the joint density matrix ρ\rho is diagonal, and its matrix elements correspond to the joint probability distribution p(x,Ci)p(x,C_{i}).

The averaged data density matrix is obtained by tracing out the clients: ρx=TrC(ρ)=ipCiρxCi.\rho_{x}=\text{Tr}_{C}(\rho)=\sum_{i}p_{C_{i}}\rho_{x}^{C_{i}}. Let PCi=|CiCi|P_{C_{i}}=\ket{C_{i}}\bra{C_{i}} be the projector onto the subspace of client CiC_{i}. Then we can introduce the conditional density matrix [30, 31] ρx|Ci=PCiρPCi/Tr(ρPCi)=ρxCi|CiCi|\rho_{x|C_{i}}={P_{C_{i}}\rho P_{C_{i}}}/{\text{Tr}(\rho P_{C_{i}})}=\rho_{x}^{C_{i}}\otimes\ket{C_{i}}\bra{C_{i}}, which characterizes the data of a given client CiC_{i}. Tracing out the clients recovers the local dataset: TrC(ρx|Ci)=ρxCi\text{Tr}_{C}(\rho_{x|C_{i}})=\rho_{x}^{C_{i}}. We summarize these concepts in Figure 1. For supervised learning tasks such as classification, there will also be a label yy associated with each data sample |ψ\ket{\psi}. We have omitted the labels here for simplicity.

Refer to caption
Figure 1: A schematic view of decentralized quantum data.

II.2 Quantum Federated Averaging

Here we briefly review the quantum version of FedAvg [16]. In a general supervised quantum machine learning problem, we aim to find a quantum channel \mathcal{M} that takes an input state |ψ\ket{\psi} and transform it into the desired result (e.g. which class the image belongs to). We achieve this by tuning the variational parameters ww of the channel w\mathcal{M}_{w} to minimize the average of some loss function \mathcal{L} on a given training dataset DD:

minw1N(|ψ,y)D(w,|ψ,y),\min_{w}\frac{1}{N}\sum_{(\ket{\psi},y)\in D}\mathcal{L}(w,\ket{\psi},y), (1)

where |ψ\ket{\psi} is the input and yy is the corresponding label. We use gradient descent with learning rate η\eta to iteratively solve this optimization problem: at time step tt,

wt=wt1ηw1N(|ψ,y)D(w,|ψ,y).w_{t}=w_{t-1}-\eta\nabla_{w}\frac{1}{N}\sum_{(\ket{\psi},y)\in D}\mathcal{L}(w,\ket{\psi},y). (2)

For decentralized data, the total dataset is divided into local datasets from multiple clients: D=iDiD=\cup_{i}D_{i}. Then we can decompose the update rule of Equation (2) into three steps: (1) local updates at time step tt for each client CiC_{i},

wti=wt1iηw1NCi(|ψ,y)Di(w,|ψ,y),w_{t}^{i}=w_{t-1}^{i}-\eta\nabla_{w}\frac{1}{N_{C_{i}}}\sum_{(\ket{\psi},y)\in D_{i}}\mathcal{L}(w,\ket{\psi},y), (3)

(2) global average for every TT steps, e.g. at time step mT,mmT,m\in\mathbb{N},

wmT=i=1npCiwmTi,w_{mT}=\sum_{i=1}^{n}p_{C_{i}}w_{mT}^{i}, (4)

and (3) broadcast the averaged weights to all clients as the initial parameters for the next iteration. This is the basic protocol of qFedAvg, the common ground of all the existing quantum federated learning algorithms [17, 18, 19, 20, 21]. It recovers the centralized update rule when T=1T=1. However, in practice, this cannot be achieved since we use mini-batch training strategies.

II.3 The Non-IID Quagmire of Quantum FedAvg

As pointed out in [22, 23], FedAvg faces the problem of non-IID quagmire, in which its performance deteriorates when the data of different clients are non-IID. Does this phenomenon also exist in the quantum regime?

We follow the steps of [22] and quantify the performance difference of FedAvg and the centralized case by the weight divergence Δ=w(f)w(c)\Delta=\|w^{(f)}-w^{(c)}\|, where w(f)w^{(f)} and w(c)w^{(c)} are the weights given by qFedAvg and the centralized case respectively. On the other hand, the level of non-IID can be quantified by the earth mover’s distance (EMD) [32] between the label distribution of client CiC_{i}, p(i)(y=k)=yDiδy,k/NCip^{(i)}(y=k)=\sum_{y\in D_{i}}\delta_{y,k}/N_{C_{i}}, and the centralized distribution p(y=k)=yDδy,k/NCp(y=k)=\sum_{y\in D}\delta_{y,k}/N_{C}: EMDi=kp(i)(y=k)p(y=k)\text{EMD}_{i}=\sum_{k}\|p^{(i)}(y=k)-p(y=k)\|. Then, as a direct quantum extension to Proposition 3.1 in [22], we have the following proposition:

Proposition 1

For a loss function of the form (w,|ψ,y)=k=1ncδy,kfk(w,|ψ)\mathcal{L}(w,\ket{\psi},y)=\sum_{k=1}^{n_{c}}\delta_{y,k}f_{k}(w,\ket{\psi}) 111In a typical classification problem, kk denotes the different classes and the standard cross entropy loss is given by fk(w,|ψ)=logrk(w,|ψ)f_{k}(w,\ket{\psi})=\log r_{k}(w,\ket{\psi}), where rkr_{k} is the predicted probability of |ψ\ket{\psi} belonging to class kk., if the gradient w𝔼|ψ|y=kfk(w,|ψ)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w,\ket{\psi}) defined as w(|ψ,yδy,kfk(w,|ψ)/|ψ,yδy,k)\nabla_{w}(\sum_{\ket{\psi},y}\delta_{y,k}f_{k}(w,\ket{\psi})/\sum_{\ket{\psi},y}\delta_{y,k}) is λk\lambda_{k}-Lipschitz for all possible value kk of label yy, then the following inequality holds for the weight divergence of qFedAvg:

ΔmTi=1npCi(ai)TΔ(m1)T+ηi=1npCiEMDij=1T1(ai)jg(wmT1j(c)),\begin{split}\Delta_{mT}\leq&\sum_{i=1}^{n}p_{C_{i}}(a_{i})^{T}\Delta_{(m-1)T}\\ &+\eta\sum_{i=1}^{n}p_{C_{i}}\text{EMD}_{i}\sum_{j=1}^{T-1}(a_{i})^{j}g(w^{(c)}_{mT-1-j}),\end{split} (5)

where g(w)=maxkw|ψ,yδy,kfk(w,|ψ)g(w)=\max_{k}\|\nabla_{w}\sum_{\ket{\psi},y}\delta_{y,k}f_{k}(w,\ket{\psi})\| and ai=1+ηkp(i)(y=k)λka_{i}=1+\eta\sum_{k}p^{(i)}(y=k)\lambda_{k}.

Therefore, EMD is indeed a source of the weight divergence when T>1T>1. This provides a theoretical explanation for the existence of non-IID quagmire in quantum federated learning, similar to its classical counterpart [22]. We provide a detailed proof of Proposition 1 in Appendix A. Numerical experiments are conducted in Section III.3 to empirically illustrate this phenomenon in quantum federated learning.

II.4 Federated Decomposition of Quantum Channels

Now that we have identified the non-IID quagmire in qFedAvg, we aim to find a different approach to tackle Equation (1) when the data are decentralized. We consider the case where the loss function is a function of the output of the quantum channel and the label: (w,|ψ,y)=(w(σxψ),y)\mathcal{L}(w,\ket{\psi},y)=\mathcal{L}(\mathcal{M}_{w}(\sigma_{x}^{\psi}),y), where σxψ=|ψψ|\sigma_{x}^{\psi}=\ket{\psi}\bra{\psi}.

We begin by noting that when the data are decentralized, each client CiC_{i} can still train a local channel i\mathcal{M}_{i} on its own data DiD_{i}:

i=argmin1NCi(|ψ,y)Di((σxψ),y).\mathcal{M}_{i}=\text{argmin}_{\mathcal{M}}\frac{1}{N_{C_{i}}}\sum_{(\ket{\psi},y)\in D_{i}}\mathcal{L}(\mathcal{M}(\sigma_{x}^{\psi}),y). (6)

To make use of the local channels, we can decompose the original problem, Equation (1), as

min1N(|ψ,y)D((σxψ),y)=minipCi[1NCi(|ψ,y)Di((σxψ),y)].\begin{split}&\min_{\mathcal{M}}\frac{1}{N}\sum_{(\ket{\psi},y)\in D}\mathcal{L}(\mathcal{M}(\sigma_{x}^{\psi}),y)\\ &=\min_{\mathcal{M}}\sum_{i}p_{C_{i}}\left[\frac{1}{N_{C_{i}}}\sum_{(\ket{\psi},y)\in D_{i}}\mathcal{L}(\mathcal{M}(\sigma_{x}^{\psi}),y)\right]\end{split}. (7)

We can solve the whole minimization problem if we can solve each individual sub-problems simultaneously, which is exactly what i\mathcal{M}_{i} does. Therefore, if we can construct a global channel \mathcal{M}, such that it coincides with all the local channels i\mathcal{M}_{i} on their own data, i.e.

(σxψ)=i(σxψ),|ψ from Di,\mathcal{M}(\sigma_{x}^{\psi})=\mathcal{M}_{i}(\sigma_{x}^{\psi}),\quad\forall\ket{\psi}\text{ from }D_{i}, (8)

then the goal is achieved. 222The fact that identical samples have identical labels guarantees that the global channel \mathcal{M} is well defined. For example, suppose we have two identical samples |ψCi=|ψCj=|ψ\ket{\psi^{C_{i}}}=\ket{\psi^{C_{j}}}=\ket{\psi} with the same label yy, but are from different clients CiCjC_{i}\neq C_{j}. Then, in order to fulfill the local minimization problems, Equation (6), we must have i(|ψψ|)=j(|ψψ|)\mathcal{M}_{i}(\ket{\psi}\bra{\psi})=\mathcal{M}_{j}(\ket{\psi}\bra{\psi}). This can be rephrased backwards, as demanding that the local channels i\mathcal{M}_{i} coincide with the global channel \mathcal{M} on condition that the input data σxψ\sigma_{x}^{\psi} comes from its own dataset DiD_{i}, which is statistically represented by ρxCi\rho_{x}^{C_{i}}. That is,

i(σxψ)=(PxψρxCiPxψTr(ρxCiPxψ))=(TrC(Pxψρx|CiPxψTr(ρx|CiPxψ))),\mathcal{M}_{i}(\sigma_{x}^{\psi})=\mathcal{M}\left(\frac{P_{x}^{\psi}\rho_{x}^{C_{i}}P_{x}^{\psi}}{\text{Tr}(\rho_{x}^{C_{i}}P_{x}^{\psi})}\right)=\mathcal{M}\left(\text{Tr}_{C}\left(\frac{P_{x}^{\psi}\rho_{x|C_{i}}P_{x}^{\psi}}{\text{Tr}(\rho_{x|C_{i}}P_{x}^{\psi})}\right)\right), (9)

where Pxψ=|ψψ|P_{x}^{\psi}=\ket{\psi}\bra{\psi} denotes the projector. 333To avoid confusion, we insist on using different notations for Pxψ=σxψ=|ψψ|P_{x}^{\psi}=\sigma_{x}^{\psi}=\ket{\psi}\bra{\psi} to emphasize their differences in physical meaning: we use σxψ\sigma_{x}^{\psi} to denote the quantum state that you can load into your circuit, while we use PxψP_{x}^{\psi} to denote the projection operator.

To combine these local channels into a global one, we also need to capture the information of the local datasets ρxCi\rho_{x}^{C_{i}}. To achieve this, we introduce the local density estimator 𝒟i\mathcal{D}_{i}, which is trained to output the probability density of the input state within the local dataset:

𝒟i(σx)=Tr(ρxCiPxψ)=Tr(ρx|CiPxψ).\mathcal{D}_{i}(\sigma_{x})=\text{Tr}(\rho_{x}^{C_{i}}P_{x}^{\psi})=\text{Tr}(\rho_{x|C_{i}}P_{x}^{\psi}). (10)

In fact, the local channels i\mathcal{M}_{i} and density estimators 𝒟i\mathcal{D}_{i} are enough to give an explicit construction of the global channel \mathcal{M}. This is provided by the following theorem (proved in Appendix B):

Theorem 1

(Federated Decomposition of Quantum Channels)
For each client Ci,i=1,,nC_{i},i=1,\cdots,n, which only has access to its own data ρxCi\rho_{x}^{C_{i}} with NCiN_{C_{i}} samples, a local channel i\mathcal{M}_{i} and a local density estimator 𝒟i\mathcal{D}_{i} can be trained. Assuming that there is no entanglement among clients, then the global channel \mathcal{M} can be decomposed into

(σx)=i=1ni(σx)qi=i=1ni(σx)𝒟i(σx)pCij𝒟j(σx)pCj,\displaystyle\mathcal{M}(\sigma_{x})=\sum_{i=1}^{n}\mathcal{M}_{i}(\sigma_{x})q_{i}=\sum_{i=1}^{n}\mathcal{M}_{i}(\sigma_{x})\frac{\mathcal{D}_{i}(\sigma_{x})p_{C_{i}}}{\sum_{j}\mathcal{D}_{j}(\sigma_{x})p_{C_{j}}}, (11)

where qi=𝒟i(σx)pCi/j𝒟j(σx)pCjq_{i}=\mathcal{D}_{i}(\sigma_{x})p_{C_{i}}/\sum_{j}\mathcal{D}_{j}(\sigma_{x})p_{C_{j}} is the decomposition weight, σx\sigma_{x} is any pure input state and pCi=NCi/j=1nNCjp_{C_{i}}=N_{C_{i}}/\sum_{j=1}^{n}N_{C_{j}}. Extension to mixed input states follows from direct linear superposition.

We note that the classical special case of this theorem was first introduced in [36]. As a result, for any input state σx\sigma_{x}, if we randomly apply a local channel i\mathcal{M}_{i} with probability qiq_{i}, the result will be statistically the same as the global channel \mathcal{M}. This fact leads to the following framework for quantum federated learning.

II.5 A One-shot Quantum Federated Learning Framework for Non-IID Data

Theorem 1 provides a framework for quantum federated learning. The specific protocol goes as follows. Firstly, each client CiC_{i} trains a local channel i\mathcal{M}_{i} and a local density estimator 𝒟i\mathcal{D}_{i} with its own data ρxCi\rho_{x}^{C_{i}}. This step is completely distributed and concludes the whole training phase. Secondly, the trained channels i\mathcal{M}_{i} and density estimators 𝒟i\mathcal{D}_{i} are sent to the server for inference according to Equation (11). That is, when a new input σxψ\sigma_{x}^{\psi} comes, the server computes qiq_{i} via the density estimators. Then it randomly loads the parameters from i\mathcal{M}_{i} with probability qiq_{i} and gathers the outcomes. We call this framework quantum federated inference, or qFedInf for short. The detailed algorithms are summarized in Algorithm 1 and 2.

Input: The local data set DiD_{i}; number of training epochs nen_{e}; batch size nbn_{b}; learning rate η\eta; the optimizer Opt.
Output: The trained channel i\mathcal{M}_{i} and density estimator 𝒟i\mathcal{D}_{i} to be sent to the server.
1 Initialize the channel i\mathcal{M}_{i} with parameters ww.
2 for epoch1\mathrm{epoch}\leftarrow 1 to nen_{e} do
3       for batch(|ψjCi,yj)j=1nbDi\mathrm{batch}~{}(\ket{\psi_{j}^{C_{i}}},y_{j})_{j=1}^{n_{b}}\in D_{i} do
4             Calculate the gradient gwj(i,w(|ψjCiψjCi|),yj)g\leftarrow\nabla_{w}\sum_{j}\mathcal{L}(\mathcal{M}_{i,w}(\ket{\psi_{j}^{C_{i}}}\bra{\psi_{j}^{C_{i}}}),y_{j}).
5             Update the channel parameters wwηOpt(g)w\leftarrow w-\eta\cdot\mathrm{Opt}(g).
6      
7Train the density estimator 𝒟i\mathcal{D}_{i}.
return i,𝒟i\mathcal{M}_{i},\mathcal{D}_{i}.
Algorithm 1 Quantum Federated Inference: Training Phase on Client CiC_{i}
Input: A new sample |ψ\ket{\psi} to perform inference on; channels {i}i=1n\{\mathcal{M}_{i}\}_{i=1}^{n} and density estimators {𝒟i}i=1n\{\mathcal{D}_{i}\}_{i=1}^{n} gathered from the clients.
Output: The inference outcome (|ψψ|)\mathcal{M}(\ket{\psi}\bra{\psi}).
1 for i1i\leftarrow 1 to nn do
2       Calculate the density q~i𝒟i(|ψψ|)\tilde{q}_{i}\leftarrow\mathcal{D}_{i}(\ket{\psi}\bra{\psi}).
3      
4Calculate the weights qiq~ipCi/mq~mpCm,i=1,,nq_{i}\leftarrow\tilde{q}_{i}p_{C_{i}}/\sum_{m}\tilde{q}_{m}p_{C_{m}},\quad i=1,\cdots,n.
5 Choose a random channel i\mathcal{M}_{i} with probability qiq_{i}.
6 (|ψψ|)i(|ψψ|)\mathcal{M}(\ket{\psi}\bra{\psi})\leftarrow\mathcal{M}_{i}(\ket{\psi}\bra{\psi}).
return (|ψψ|)\mathcal{M}(\ket{\psi}\bra{\psi}).
Algorithm 2 Quantum Federated Inference: Inference Phase

In practice, there is a wide range of choices for the channels i\mathcal{M}_{i} and the density estimators 𝒟i\mathcal{D}_{i}. For channels, we can use classical, quantum, or hybrid algorithms at our will [37, 1, 38]. For density estimators, we can use classical ones like Gaussian mixture models [37] and normalizing flows [39], quantum-inspired ones like [40], or quantum ones such as classical shadow tomography [41] and quantum state diagonalization algorithms [42, 43]. Each kind of density estimator comes with a different training strategy. We can also classify the possible scenarios based on the classical/quantum nature of the data and the channels:

Classical Data & Classical Channel: This is a purely classical problem already discussed in [36].

Classical Data & Quantum Channel: To apply quantum channels to classical data, one needs to choose an appropriate encoding scheme, e.g. amplitude encoding or gate encoding, to encode the data into quantum states. This gives rise to the problem of whether to use a classical density estimator on the original data or a quantum one on the encoded states. Note that before encoding, different samples are orthogonal to each other. However, the encoded quantum samples are in general overlapped, meaning that there’s a possibility of mistaking one sample for another. Nevertheless, quantum density estimators offer a potential exponential speed-up. So there’s a trade-off between accuracy and efficiency.

Quantum Data & Classical Channel: In general, trying to apply classical channels to process quantum data is not very efficient, as the tomography and representation of a quantum state cost exponentially large classical computational resources. Nevertheless, proposals [44] have been made to use the classical shadows [41] of a quantum state as the input to a classical machine learning algorithm. We defer to future works to investigate the performance of these proposals in a federated learning context.

Quantum Data & Quantum Channel: For quantum data, we need a quantum density estimator to estimate Tr(ρxCiPxψ)\text{Tr}(\rho_{x}^{C_{i}}P_{x}^{\psi}). This task can be solved by classical shadow tomography, which is proved to saturate the information-theoretic bound [41].

Compared to qFedAvg, the proposed framework qFedInf shares some merits with its classical counterpart [36]. It’s one-shot in the sense that only one communication between the server and each client is required. Meanwhile, there’s no need for a global public dataset or data synthesis/distillation, which are the common ground of existing one-shot algorithms [28, 29]. Moreover, since no gradient information is transmitted, it’s automatically immune to attacks based on gradient inversion [24, 25]. Finally, the density estimators can capture possible data heterogeneity, providing a new way to perform federated learning with non-IID data.

Though we have mainly focused on supervised learning throughout this paper, we also note that Theorem 1 holds for generic quantum channels, which is the most general form of quantum information processing. So the proposed framework qFedInf may be applied to machine learning tasks beyond classification. In Appendix C we provide an example of applying qFedInf to perform quantum generative learning [10, 11, 12].

Finally, we give a brief discussion on the complexity of the proposed framework. Suppose that the number of clients is NclientN_{\text{client}}, the number of iterations needed for training is NiterN_{\text{iter}}, and the number of parameters in each of the quantum channels i\mathcal{M}_{i} and density estimators 𝒟i\mathcal{D}_{i} are NchannelN_{\text{channel}} and NdensityN_{\text{density}}. Then the communication complexity [15] of the proposed qFedInf is 𝒪(Nclient(Nchannel+Ndensity))\mathcal{O}(N_{\text{client}}(N_{\text{channel}}+N_{\text{density}})). In comparison, the communication complexity of qFedAvg is 𝒪(NclientNiterNchannel)\mathcal{O}(N_{\text{client}}N_{\text{iter}}N_{\text{channel}}), which is much less efficient when NiterN_{\text{iter}} is large. As for the circuit complexity [45, 46], it depends on the specific choice of the quantum circuit used as the channels. If the channels have a quantum speed-up, so will qFedInf.

II.6 Connection to MoE and Ensemble Learning

In this section, we discuss the connection between the proposed framework qFedInf and mixture of experts (MoE) [47, 48, 49], which is an important strategy in ensemble learning. The idea of ensemble learning is to combine several models to make a joint prediction [50, 51]. Different models are expected to compensate for each other, leading to better performance. MoE, as a special kind of ensemble learning method, consists of two parts: a set of functions serving as the judge that decides the relative weight of different models, called the gating function; and an ensemble of specialized models, called the experts, each expected to perform well only on a subset of the total input space. In practice, MoE, along with other ensemble approaches such as bagging and boosting, has given birth to many state-of-the-art solutions to a wide range of machine learning problems [50, 51, 47].

In hindsight, we can see that qFedInf also consists of two parts: a set of density estimators 𝒟i\mathcal{D}_{i} that decides the probabilistic weight of different local channels i\mathcal{M}_{i}, as in Equation (11); and a set of local channels i\mathcal{M}_{i}, each performing well only on the client’s own data. In the language of MoE, the density estimators are the gating functions, and the local channels correspond to the experts. Therefore, qFedInf is exactly an MoE working in the one-shot federated learning context. In Section III, numerical experiments show that qFedInf can indeed combine the knowledge of many weak classifiers (shallow circuits only capable of binary classification) to achieve the capability of large models.

III Numerical Experiments

Refer to caption
Figure 2: The quantum circuit used as the quantum classifier.
Top-1 Accuracy/% Centralized qFedInf(\star) qFedAvg(\star) qFedInf(\circ) qFedAvg(\circ)
MNIST 91.2±\pm0.6 92.4±\pm0.3 86.2±\pm1.0 92.7±\pm0.2 88.4±\pm0.8
Fashion-MNIST 77.2±\pm0.5 74.0±\pm0.3 61.4±\pm1.4 75.4±\pm0.3 66.7±\pm1.3
Table 1: Performance of centralized, qFedInf and qFedAvg classifiers on MNIST and Fashion-MNIST datasets under different federated settings (\star/\circ for star/cycle-2 structure). The averages and standard deviations are obtained from 10 runs.
Refer to caption
Refer to caption
Figure 3: The test accuracy of qFedAvg on MNIST (left) and Fashion-MNIST (right) with different levels of non-IID in the cycle-m structure. The dashed line represents the test accuracy of centralized classifier (red) and qFedInf (black) trained under the most heterogeneous setting (2 classes per client). The accuracy of qFedInf is roughly unaffected by the number of classes per client. The error bars and shaded regions mark the standard deviation over 10 runs.

III.1 Constructing Non-IID Datasets

We observe that the existing quantum federated learning algorithms can already achieve high accuracies on binary classification tasks with synthetic and common classical/quantum datasets [20, 17]. Therefore, to better illustrate the performance differences between qFedInf and qFedAvg, we devise a highly heterogeneous federated dataset based on 8 classes (“0” through “7”) from the MNIST handwritten digits dataset. To produce heterogeneity, we adopt the star structure and cycle-m structure settings from [36] as follows.

Star structure: Each client only has access to the data of two classes, with one of the classes fixed. That is, client Ci,i=1,,7C_{i},i=1,\cdots,7 only has access to the data of digit “0” and “ii”.

Cycle-m structure: Each client only has access to the data of mm classes in a cyclic way. Client Ci,i=1,,7C_{i},i=1,\cdots,7 has access to the data of digit “ii”, …, “i+m1i+m-1 module 8”.

Datasets of the same structures are also prepared for the Fashion-MNIST dataset, which is composed of images of daily objects and is regarded as a harder version of MNIST.

III.2 Details of the Quantum Classifiers

We parameterize the quantum classifier as a quantum circuit of ll layers. Each layer contains a set of controlled-NOT gates on adjacent qubits, followed by a parameterized rotation on each qubit. The detailed circuit is shown in Figure 2.

To load the data into an 8-qubit quantum circuit, we interpolate and resize each image into a size of 16 ×\times 16 and use amplitude encoding to transform it into a quantum state [3, 52]:

{xj}j=1256|ψ=j=1256xju=1256xu2|j,\{x_{j}\}_{j=1}^{256}\rightarrow\ket{\psi}=\sum_{j=1}^{256}\frac{x_{j}}{\sqrt{\sum_{u=1}^{256}\|x_{u}\|^{2}}}\ket{j}, (12)

where {xj}\{x_{j}\} are the pixel values of the image and {|j}\{\ket{j}\} denote the computational basis. In experiments, amplitude encoding can be implemented using quantum random access memory [53, 54] or universal gate decomposition [46, 55, 56].

To perform inference, we measure the expectation values Zk\braket{Z_{k}} on each qubit, amplify the outcomes by a factor of 10, and fed them into the softmax function to predict the probability of each class. We use the Adam optimizer [57] of learning rate 10210^{-2} and a batch size of 128128 to minimize the standard cross entropy loss function. The gradients used in the optimization can be computed via the parameter shift rule [58, 17, 59].

All the numerical simulations are conducted with JAX [60] and TensorCircuit [61] on one NVIDIA Tesla V100 GPU. The source code is available at https://github.com/JasonZHM/quantum-fed-infer.

III.3 Performance of qFedAvg and qFedInf on Non-IID Data

Refer to caption
Refer to caption
Figure 4: The test loss and accuracy of centralized classifier and qFedAvg on star structure MNIST (left) and Fashion-MNIST (right). The dashed line represents the test accuracy of qFedInf. The error bars and shaded regions mark the standard deviation over 10 runs.

We begin by demonstrating the non-IID quagmire of qFedAvg discussed in Section II.3 with numerical simulations. Specifically, we train and test qFedAvg on datasets of the cycle-m structure. The parameter mm serves as a good controller over the level of non-IID: as mm increases, each client will have access to more classes, and the level of non-IID will decrease.

We train the channel with l=48l=48 and m=2,3,4,5,6m=2,3,4,5,6 using qFedAvg for 5 epochs. The global synchronization frequency of qFedAvg is set to be one time per batch step. The resulting test accuracies on a test set of size 1024 are plotted in Figure 3. In line with the theoretical analysis, we find that the top-1 accuracy increases as the level of non-IID drops. When the data are highly heterogeneous (m=2m=2), qFedAvg suffers from a loss of 4%\sim 4\% (10%10\%) in accuracy on MNIST (Fashion-MNIST) compared to the benchmark trained on the centralized data with the same circuit structure. Nevertheless, when the data heterogeneity is mild (m5m\geq 5), qFedAvg can achieve comparable performance compared with the centralized classifier. We also plot the test loss and accuracy curves on star structure in Figure 4. As expected, qFedAvg converges much slower to significantly lower accuracy.

To test the performance of qFedInf on non-IID data, we train and test qFedInf on the most heterogeneous settings, namely the star structure and cycle-2 structure, and compare it with qFedAvg and the centralized benchmark. In such settings, each client’s local classifier only needs to perform a binary classification, and thus in practice, we find that circuits with only a few layers suffice to achieve good performance. For comparison with the l=48l=48 qFedAvg and the centralized benchmark, we choose l=6l=6 for the local classifiers, so that the total number of variational parameters of the 88 clients remains the same. We train the local classifiers for 5 epochs and plot the loss and accuracy curves in Figure 5. We adopt Gaussian mixture models with 5 modes as the local density estimators. The combined global model achieves a top-1 accuracy similar to the centralized benchmark and significantly higher than that of qFedAvg on both settings of both datasets. The detailed accuracies are listed in Table 1. Meanwhile, the performance of qFedInf is roughly unaffected by the number of classes per client, demonstrating its robustness against the level of non-IID.

We note that in the training process, qFedAvg requires a total number of 500 communication rounds, while qFedInf only needs one. Moreover, the local classifiers used in qFedInf are much shallower compared to those of qFedAvg and the centralized benchmark. If we change the ll of the centralized classifier to 6, its test accuracy will drop to only 75%\sim 75\%. To rule out the possibility that barren plateau [62] causes this performance difference, we also test qFedInf with l=48l=48, and the resulting performance is the same as l=6l=6, which suggests that the barren plateau issue is not significant in our settings. Therefore, putting the federated settings aside, qFedInf can indeed utilize the collective knowledge of many small models to achieve the capability of a large model, in line with MoE and ensemble learning as discussed in Section II.6.

Refer to caption
Refer to caption
Figure 5: The training loss and accuracy of the local classifiers i\mathcal{M}_{i} in qFedInf on start structure MNIST (left) and Fashion-MNIST (right). The average curves are highlighted.

IV Conclusions

In this work, we tackle the problem of non-IID data in quantum federated learning. We give a theoretical analysis of the non-IID quagmire in qFedAvg and support it with numerical experiments. We prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. It leads to a general framework qFedInf for quantum federated learning on non-IID data. It’s one-shot in terms of communication complexity and immune to attacks based on gradient inversion.

We conduct numerical experiments on multi-class classification tasks to demonstrate the proposed framework. We devise a highly heterogeneous federated dataset based on MNIST and Fashion-MNIST. Experiments show that qFedInf achieves a comparable performance compared to the centralized benchmark, and outperforms qFedAvg with significantly fewer communication rounds.

The non-IID issue has been regarded as a major challenge and is under active research in the literature of classical federated learning. Future works may focus on a more thorough analysis of this issue in the quantum regime: develop more challenging quantum federated datasets to demonstrate the quantum non-IID quagmire and test the performance of different density estimators. On the other hand, as quantum channels are the most general form of quantum information processing, we expect that more quantum machine learning algorithms can be made federated through the proposed framework. Moreover, more quantum features such as quantum speed-up and universal blind quantum computation [63, 17] may also be incorporated into qFedInf in the future.

Acknowledgements.
We thank Weikang Li, Jingyi Zhang, Yuxuan Yan, Rebing Wu, and Yuchen Guo for their insightful discussions. We thank the anonymous reviewers for their constructive suggestions on the manuscript. We also acknowledge the Tsinghua Astrophysics High-Performance Computing platform for providing computational and data storage resources. This work is financially supported by Zhili College, Tsinghua University.

Appendix A Proof of Proposition 1

Proposition 1 is a quantum generalization of its classical counterpart, Proposition 3.1 in [22]. Below, we provide a detailed proof following the ideas introduced in [22]. Based on the definition of Δ\Delta and the update rules, Equations (2), (3) and (4), we have

ΔmT=wmT(f)wmT(c)=ipCi(wmT1iηw1NCi(|ψ,y)Dik=1ncδy,kfk(wmT1i,|ψ))wmT1(c)+ηw1N(|ψ,y)Dk=1ncδy,kfk(wmT1(c),|ψ)=ipCi(wmT1iηk=1ncp(i)(y=k)w𝔼|ψ|y=kfk(wmT1i,|ψ))wmT1(c)+ηk=1ncp(y=k)w𝔼|ψ|y=kfk(wmT1(c),|ψ).\begin{split}&\Delta_{mT}=\|w^{(f)}_{mT}-w^{(c)}_{mT}\|\\ &=\|\sum_{i}p_{C_{i}}(w^{i}_{mT-1}-\eta\nabla_{w}\frac{1}{N_{C_{i}}}\sum_{(\ket{\psi},y)\in D_{i}}\sum_{k=1}^{n_{c}}\delta_{y,k}f_{k}(w^{i}_{mT-1},\ket{\psi}))-w^{(c)}_{mT-1}+\eta\nabla_{w}\frac{1}{N}\sum_{(\ket{\psi},y)\in D}\sum_{k=1}^{n_{c}}\delta_{y,k}f_{k}(w^{(c)}_{mT-1},\ket{\psi})\|\\ &=\|\sum_{i}p_{C_{i}}(w^{i}_{mT-1}-\eta\sum_{k=1}^{n_{c}}p^{(i)}(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{i}_{mT-1},\ket{\psi}))-w^{(c)}_{mT-1}+\eta\sum_{k=1}^{n_{c}}p(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{(c)}_{mT-1},\ket{\psi})\|.\end{split} (13)

Now we apply the triangle inequality and the Lipschitz conditions. Together with the definition ai=1+ηkp(i)(y=k)λka_{i}=1+\eta\sum_{k}p^{(i)}(y=k)\lambda_{k}, we have

ΔmTipCiwmT1iwmT1(c)+ηipCikp(i)(y=k)(w𝔼|ψ|y=kfk(wmT1i,|ψ))w𝔼|ψ|y=kfk(wmT1(c),|ψ))ipCi(1+ηkp(i)(y=k)λk)wmT1iwmT1(c)=ipCiaiwmT1iwmT1(c).\begin{split}\Delta_{mT}&\leq\|\sum_{i}p_{C_{i}}w^{i}_{mT-1}-w^{(c)}_{mT-1}\|\\ &+\eta\|\sum_{i}p_{C_{i}}\sum_{k}p^{(i)}(y=k)(\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{i}_{mT-1},\ket{\psi}))-\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{(c)}_{mT-1},\ket{\psi}))\|\\ &\leq\sum_{i}p_{C_{i}}(1+\eta\sum_{k}p^{(i)}(y=k)\lambda_{k})\|w^{i}_{mT-1}-w^{(c)}_{mT-1}\|\\ &=\sum_{i}p_{C_{i}}a_{i}\|w^{i}_{mT-1}-w^{(c)}_{mT-1}\|.\end{split} (14)

Then we continue going backwards in the time steps. With the triangle inequality and the definitions of g(w)g(w) and EMDi\mathrm{EMD}_{i}, we have

wmT1iwmT1(c)wmT2iwmT2(c)+ηkp(i)(y=k)w𝔼|ψ|y=kfk(wmT2i,|ψ))kp(y=k)w𝔼|ψ|y=kfk(wmT2(c),|ψ))wmT2iwmT2(c)+ηkp(i)(y=k)w𝔼|ψ|y=kfk(wmT2i,|ψ))kp(i)(y=k)w𝔼|ψ|y=kfk(wmT2(c),|ψ))+ηk(p(i)(y=k)p(y=k))𝔼|ψ|y=kfk(wmT2(c),|ψ))aiwmT2iwmT2(c)+ηg(wmT2(c))EMDi.\begin{split}&\|w^{i}_{mT-1}-w^{(c)}_{mT-1}\|\\ &\leq\|w^{i}_{mT-2}-w^{(c)}_{mT-2}\|+\eta\|\sum_{k}p^{(i)}(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{i}_{mT-2},\ket{\psi}))-\sum_{k}p(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{(c)}_{mT-2},\ket{\psi}))\|\\ &\leq\|w^{i}_{mT-2}-w^{(c)}_{mT-2}\|\\ &+\eta\|\sum_{k}p^{(i)}(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{i}_{mT-2},\ket{\psi}))-\sum_{k}p^{(i)}(y=k)\nabla_{w}\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{(c)}_{mT-2},\ket{\psi}))\|\\ &+\eta\|\sum_{k}(p^{(i)}(y=k)-p(y=k))\mathbb{E}_{\ket{\psi}|y=k}f_{k}(w^{(c)}_{mT-2},\ket{\psi}))\|\\ &\leq a_{i}\|w^{i}_{mT-2}-w^{(c)}_{mT-2}\|+\eta g(w^{(c)}_{mT-2})\mathrm{EMD}_{i}.\end{split} (15)

By induction and the broadcast rule w(m1)Ti=w(m1)T(f)w^{i}_{(m-1)T}=w^{(f)}_{(m-1)T}, we arrive at

wmT1iwmT1(c)aiT1w(m1)Tiw(m1)T(c)+ηEMDij=0T2aijg(wmT2j(c))=aiT1Δ(m1)T+ηEMDij=0T2aijg(wmT2j(c))\begin{split}\|w^{i}_{mT-1}-w^{(c)}_{mT-1}\|&\leq a_{i}^{T-1}\|w^{i}_{(m-1)T}-w^{(c)}_{(m-1)T}\|+\eta\mathrm{EMD}_{i}\sum_{j=0}^{T-2}a_{i}^{j}g(w^{(c)}_{mT-2-j})\\ &=a_{i}^{T-1}\Delta_{(m-1)T}+\eta\mathrm{EMD}_{i}\sum_{j=0}^{T-2}a_{i}^{j}g(w^{(c)}_{mT-2-j})\end{split} (16)

Plug it into Equation (14), and we finally reach the desired result:

ΔmTipCiaiTΔ(m1)T+ηipCiEMDij=1T1aijg(wmT1j(c)).\begin{split}\Delta_{mT}&\leq\sum_{i}p_{C_{i}}a_{i}^{T}\Delta_{(m-1)T}\\ &+\eta\sum_{i}p_{C_{i}}\mathrm{EMD}_{i}\sum_{j=1}^{T-1}a_{i}^{j}g(w^{(c)}_{mT-1-j}).\end{split} (17)

Appendix B Proof of Theorem 1

With the definitions in Sections II.1 and II.4, for any pure input state σxψ=|ψψ|\sigma_{x}^{\psi}=\ket{\psi}\bra{\psi}, the global channel \mathcal{M} can be decomposed into

(σxψ)=(PxψρxPxψTr(ρxPxψ))=(PxψTrC(ρ)PxψTr(ρxPxψ))=1Tr(ρxPxψ)i(PxψTrC(PCiρPCi)Pxψ)=1Tr(ρxPxψ)i(TrC(Pxψρx|CiPxψ))Tr(ρPCi)=1Tr(ρxPxψ)ii(σx)Tr(ρx|CiPxψ)pCi=ii(σx)𝒟i(σx)pCij𝒟j(σx)pCj,\begin{split}\mathcal{M}\left(\sigma_{x}^{\psi}\right)&=\mathcal{M}\left(\frac{P_{x}^{\psi}\rho_{x}P_{x}^{\psi}}{\text{Tr}(\rho_{x}P_{x}^{\psi})}\right)=\mathcal{M}\left(\frac{P_{x}^{\psi}\text{Tr}_{C}(\rho)P_{x}^{\psi}}{\text{Tr}(\rho_{x}P_{x}^{\psi})}\right)\\ &=\frac{1}{\text{Tr}(\rho_{x}P_{x}^{\psi})}\sum_{i}\mathcal{M}(P_{x}^{\psi}\text{Tr}_{C}(P_{C_{i}}\rho P_{C_{i}})P_{x}^{\psi})\\ &=\frac{1}{\text{Tr}(\rho_{x}P_{x}^{\psi})}\sum_{i}\mathcal{M}(\text{Tr}_{C}(P_{x}^{\psi}\rho_{x|C_{i}}P_{x}^{\psi}))\text{Tr}(\rho P_{C_{i}})\\ &=\frac{1}{\text{Tr}(\rho_{x}P_{x}^{\psi})}\sum_{i}\mathcal{M}_{i}(\sigma_{x})\text{Tr}(\rho_{x|C_{i}}P_{x}^{\psi})p_{C_{i}}\\ &=\sum_{i}\mathcal{M}_{i}(\sigma_{x})\frac{\mathcal{D}_{i}(\sigma_{x})p_{C_{i}}}{\sum_{j}\mathcal{D}_{j}(\sigma_{x})p_{C_{j}}},\end{split} (18)

where the second line utilizes the fact that ρ\rho is diagonal in |Ci\ket{C_{i}}: ρ=iPCiρPCi\rho=\sum_{i}P_{C_{i}}\rho P_{C_{i}}, and the last equality follows from

Tr(ρxPxψ)=Tr(ρPxψ)=jTr(PCjρPCjPxψ)=jTr(ρx|CjPxψ)Tr(ρPCj)=j𝒟j(σx)pCj\begin{split}&\text{Tr}(\rho_{x}P_{x}^{\psi})=\text{Tr}(\rho P_{x}^{\psi})=\sum_{j}\text{Tr}(P_{C_{j}}\rho P_{C_{j}}P_{x}^{\psi})\\ &=\sum_{j}\text{Tr}(\rho_{x|C_{j}}P_{x}^{\psi})\text{Tr}(\rho P_{C_{j}})=\sum_{j}\mathcal{D}_{j}(\sigma_{x})p_{C_{j}}\end{split} (19)

As for mixed states, we note that they can always be decomposed into a linear combination of pure states. Thus following from the linearity of quantum channels, the formula for \mathcal{M} acting on mixed states follows from direct linear superposition. This completes the proof of Theorem 1.

Appendix C A Proposal of Quantum Generative Learning with qFedInf

We mentioned in the main text that the proposed framework qFedInf may be applied to machine learning tasks beyond classification. Here we provide a specific example of performing quantum generative learning [10, 11, 12] with qFedInf. This only serves as a preliminary proposal and we leave the detailed study of its performance and implications to future works.

In quantum generative learning, we aim to learn a generative model that can reconstruct some target quantum state ρx\rho_{x}. In a federated learning context, each client CiC_{i} only has access to a small proportion of the total data, which statistically forms a quantum state ρxCi\rho_{x}^{C_{i}}. Thus the whole target state can be written as ρx=ipCiρxCi\rho_{x}=\sum_{i}p_{C_{i}}\rho_{x}^{C_{i}}, where pCip_{C_{i}} is the proportion of data accessible to client CiC_{i}. The notations here are the same as in Section II.1.

We take the quantum generative adversarial network (qGAN) [11] as our quantum channel \mathcal{M} to perform the learning task. It’s a quantum circuit that takes some fixed initial state, for example |00\ket{0\cdots 0}, as its input, and outputs a quantum state, which is parameterized by the circuit parameters. Adversarial learning strategies are applied to train the circuit and the output state after training is expected to approximate the target state. Here we omit the training details as our focus is on the federated learning aspect.

In the qFedInf framework, each client trains its own qGAN, denoted as the local channel i\mathcal{M}_{i}, with its own data ρxCi\rho_{x}^{C_{i}}. After training, we expect i(|0000|)ρxCi\mathcal{M}_{i}(\ket{0\cdots 0}\bra{0\cdots 0})\approx\rho_{x}^{C_{i}}. As for the density estimation part, we note that the input states are fixed to be |00\ket{0\cdots 0}, so the density estimators become trivial, i.e. 𝒟i(|00)=1\mathcal{D}_{i}(\ket{0\cdots 0})=1. Plug these into Equation (11) and we arrive at

(|0000|)ipCiρxCi=ρx,\mathcal{M}(\ket{0\cdots 0}\bra{0\cdots 0})\approx\sum_{i}p_{C_{i}}\rho_{x}^{C_{i}}=\rho_{x}, (20)

which is exactly our goal. This is a concrete proposal of quantum federated generative learning which has not appeared in the literature so far.

Declarations

Competing interests The author declares no competing interests.

Funding This work is financially supported by Zhili College, Tsinghua University.

Availability of data and materials All the data and materials used in this work can be accessed at https://github.com/JasonZHM/quantum-fed-infer.

References

  • Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) http://www.deeplearningbook.org.
  • Nielsen and Chuang [2010] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, 2010).
  • Biamonte et al. [2017] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, Quantum machine learning, Nature 549, 195 (2017).
  • Das Sarma et al. [2019] S. Das Sarma, D.-L. Deng, and L.-M. Duan, Machine learning meets quantum physics, Physics Today 72, 48 (2019).
  • Harrow et al. [2009] A. W. Harrow, A. Hassidim, and S. Lloyd, Quantum algorithm for linear systems of equations, Physical review letters 103, 150502 (2009).
  • Lloyd et al. [2014] S. Lloyd, M. Mohseni, and P. Rebentrost, Quantum principal component analysis, Nature Physics 10, 631 (2014).
  • Schuld and Killoran [2019] M. Schuld and N. Killoran, Quantum machine learning in feature hilbert spaces, Physical review letters 122, 040504 (2019).
  • Havlíček et al. [2019] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, Supervised learning with quantum-enhanced feature spaces, Nature 567, 209 (2019).
  • Rebentrost et al. [2014] P. Rebentrost, M. Mohseni, and S. Lloyd, Quantum support vector machine for big data classification, Physical review letters 113, 130503 (2014).
  • Gao et al. [2018] X. Gao, Z.-Y. Zhang, and L.-M. Duan, A quantum machine learning algorithm based on generative models, Science advances 4, eaat9004 (2018).
  • Lloyd and Weedbrook [2018] S. Lloyd and C. Weedbrook, Quantum generative adversarial learning, Physical review letters 121, 040502 (2018).
  • Liu and Wang [2018] J.-G. Liu and L. Wang, Differentiable learning of quantum circuit born machines, Physical Review A 98, 062324 (2018).
  • Rieke et al. [2020] N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. N. Galtier, B. A. Landman, K. Maier-Hein, et al., The future of digital health with federated learning, NPJ digital medicine 3, 1 (2020).
  • Khraisat and Alazab [2021] A. Khraisat and A. Alazab, A critical review of intrusion detection systems in the internet of things: techniques, deployment strategy, validation strategy, attacks, public datasets and challenges, Cybersecurity 4, 1 (2021).
  • Konečnỳ et al. [2016] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, Federated learning: Strategies for improving communication efficiency, arXiv preprint arXiv:1610.05492  (2016).
  • McMahan et al. [2017] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, Communication-Efficient Learning of Deep Networks from Decentralized Data, in Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 54, edited by A. Singh and J. Zhu (PMLR, 2017) pp. 1273–1282.
  • Li et al. [2021a] W. Li, S. Lu, and D.-L. Deng, Quantum federated learning through blind quantum computing, Science China Physics, Mechanics & Astronomy 64, 1 (2021a).
  • Xia and Li [2021] Q. Xia and Q. Li, Quantumfed: A federated learning framework for collaborative quantum training, in 2021 IEEE Global Communications Conference (GLOBECOM) (IEEE, 2021) pp. 1–6.
  • Chen and Yoo [2021] S. Y.-C. Chen and S. Yoo, Federated quantum machine learning, Entropy 23, 460 (2021).
  • Chehimi and Saad [2022] M. Chehimi and W. Saad, Quantum federated learning with quantum data, in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2022) pp. 8617–8621.
  • Yun et al. [2022] W. J. Yun, J. P. Kim, S. Jung, J. Park, M. Bennis, and J. Kim, Slimmable quantum federated learning, arXiv preprint arXiv:2207.10221  (2022).
  • Zhao et al. [2018] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, Federated learning with non-iid data, arXiv preprint arXiv:1806.00582  (2018).
  • Hsieh et al. [2020] K. Hsieh, A. Phanishayee, O. Mutlu, and P. Gibbons, The non-iid data quagmire of decentralized machine learning, in International Conference on Machine Learning (PMLR, 2020) pp. 4387–4398.
  • Zhu et al. [2019] L. Zhu, Z. Liu, and S. Han, Deep leakage from gradients, Advances in neural information processing systems 32 (2019).
  • Geiping et al. [2020] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, Inverting gradients-how easy is it to break privacy in federated learning?, Advances in Neural Information Processing Systems 33, 16937 (2020).
  • Guha et al. [2019] N. Guha, A. Talwalkar, and V. Smith, One-shot federated learning, arXiv preprint arXiv:1902.11175  (2019).
  • Salehkaleybar et al. [2021] S. Salehkaleybar, A. Sharif-Nassab, and S. J. Golestani, One-shot federated learning: Theoretical limits and algorithms to achieve them., J. Mach. Learn. Res. 22, 189 (2021).
  • Zhou et al. [2020] Y. Zhou, G. Pu, X. Ma, X. Li, and D. Wu, Distilled one-shot federated learning, arXiv preprint arXiv:2009.07999  (2020).
  • Kasturi et al. [2020] A. Kasturi, A. R. Ellore, and C. Hota, Fusion learning: A one shot federated learning, in International Conference on Computational Science (Springer, 2020) pp. 424–436.
  • J. M. Swart [2020] J. M. Swart, Introduction to quantum probability (2020), http://staff.utia.cas.cz/swart/lecture_notes/qua20_04_27.pdf, Last accessed on 2022-1-7.
  • González et al. [2021] F. A. González, V. Vargas-Calderón, and H. Vinck-Posada, Classification with quantum measurements, Journal of the Physical Society of Japan 90, 044002 (2021)https://doi.org/10.7566/JPSJ.90.044002 .
  • Rubner et al. [2000] Y. Rubner, C. Tomasi, and L. J. Guibas, The earth mover’s distance as a metric for image retrieval, International journal of computer vision 40, 99 (2000).
  • Note [1] In a typical classification problem, kk denotes the different classes and the standard cross entropy loss is given by fk(w,|ψ)=logrk(w,|ψ)f_{k}(w,\mathinner{|{\psi}\rangle})=\mathop{log}\nolimits r_{k}(w,\mathinner{|{\psi}\rangle}), where rkr_{k} is the predicted probability of |ψ\mathinner{|{\psi}\rangle} belonging to class kk.
  • Note [2] The fact that identical samples have identical labels guarantees that the global channel \mathcal{M} is well defined. For example, suppose we have two identical samples |ψCi=|ψCj=|ψ\mathinner{|{\psi^{C_{i}}}\rangle}=\mathinner{|{\psi^{C_{j}}}\rangle}=\mathinner{|{\psi}\rangle} with the same label yy, but are from different clients CiCjC_{i}\neq C_{j}. Then, in order to fulfill the local minimization problems, Equation (6), we must have i(|ψψ|)=j(|ψψ|)\mathcal{M}_{i}(\mathinner{|{\psi}\rangle}\mathinner{\langle{\psi}|})=\mathcal{M}_{j}(\mathinner{|{\psi}\rangle}\mathinner{\langle{\psi}|}).
  • Note [3] To avoid confusion, we insist on using different notations for Pxψ=σxψ=|ψψ|P_{x}^{\psi}=\sigma_{x}^{\psi}=\mathinner{|{\psi}\rangle}\mathinner{\langle{\psi}|} to emphasize their differences in physical meaning: we use σxψ\sigma_{x}^{\psi} to denote the quantum state that you can load into your circuit, while we use PxψP_{x}^{\psi} to denote the projection operator.
  • Liu et al. [2022] J. Liu, Y. Tang, H. Zhao, X. Wang, F. Li, and J. Zhang, Cps attack detection under limited local information in cyber security: A multi-node multi-class classification ensemble approach (2022), arXiv:2209.00170 [cs.CR] .
  • Bishop [2006] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer-Verlag, Berlin, Heidelberg, 2006).
  • Li and Deng [2021] W. Li and D.-L. Deng, Recent advances for quantum classifiers, Science China Physics, Mechanics & Astronomy 6510.1007/s11433-021-1793-6 (2021).
  • Rezende and Mohamed [2015] D. Rezende and S. Mohamed, Variational inference with normalizing flows, in International conference on machine learning (PMLR, 2015) pp. 1530–1538.
  • González et al. [2022] F. A. González, A. Gallego, S. Toledo-Cortés, and V. Vargas-Calderón, Learning with density matrices and random features, Quantum Machine Intelligence 4, 1 (2022).
  • Huang et al. [2020] H.-Y. Huang, R. Kueng, and J. Preskill, Predicting many properties of a quantum system from very few measurements, Nature Physics 16, 1050 (2020).
  • LaRose et al. [2019] R. LaRose, A. Tikku, É. O’Neel-Judy, L. Cincio, and P. J. Coles, Variational quantum state diagonalization, npj Quantum Information 5, 1 (2019).
  • Xin et al. [2021] T. Xin, L. Che, C. Xi, A. Singh, X. Nie, J. Li, Y. Dong, and D. Lu, Experimental quantum principal component analysis via parametrized quantum circuits, Physical Review Letters 126, 110502 (2021).
  • Huang et al. [2022] H.-Y. Huang, R. Kueng, G. Torlai, V. V. Albert, and J. Preskill, Provably efficient machine learning for quantum many-body problems, Science 377, eabk3333 (2022).
  • Li et al. [2021b] H.-S. Li, P. Fan, H. Peng, S. Song, and G.-L. Long, Multilevel 2-d quantum wavelet transforms, IEEE Transactions on Cybernetics  (2021b).
  • Barenco et al. [1995] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Margolus, P. Shor, T. Sleator, J. A. Smolin, and H. Weinfurter, Elementary gates for quantum computation, Physical review A 52, 3457 (1995).
  • Masoudnia and Ebrahimpour [2014] S. Masoudnia and R. Ebrahimpour, Mixture of experts: a literature survey, Artificial Intelligence Review 42, 275 (2014).
  • Jordan and Jacobs [1994] M. I. Jordan and R. A. Jacobs, Hierarchical mixtures of experts and the em algorithm, Neural computation 6, 181 (1994).
  • Jacobs et al. [1991] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, Adaptive mixtures of local experts, Neural computation 3, 79 (1991).
  • Zhou [2021] Z.-H. Zhou, Machine learning (Springer Nature, 2021).
  • Sagi and Rokach [2018] O. Sagi and L. Rokach, Ensemble learning: A survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249 (2018).
  • Li et al. [2014] H.-S. Li, Q. Zhu, M.-C. Li, H. Ian, et al., Multidimensional color image storage, retrieval, and compression based on quantum amplitudes and phases, Information Sciences 273, 212 (2014).
  • Giovannetti et al. [2008a] V. Giovannetti, S. Lloyd, and L. Maccone, Quantum random access memory, Physical review letters 100, 160501 (2008a).
  • Giovannetti et al. [2008b] V. Giovannetti, S. Lloyd, and L. Maccone, Architectures for a quantum random access memory, Physical Review A 78, 052310 (2008b).
  • Long and Sun [2001] G.-L. Long and Y. Sun, Efficient scheme for initializing a quantum register with an arbitrary superposed state, Physical Review A 64, 014303 (2001).
  • Plesch and Brukner [2011] M. Plesch and Č. Brukner, Quantum-state preparation with universal gate decompositions, Physical Review A 83, 032302 (2011).
  • Kingma and Ba [2017] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization (2017), arXiv:1412.6980 [cs.LG] .
  • Mitarai et al. [2018] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, Quantum circuit learning, Physical Review A 98, 032309 (2018).
  • Li et al. [2017] J. Li, X. Yang, X. Peng, and C.-P. Sun, Hybrid quantum-classical approach to quantum optimal control, Physical review letters 118, 150503 (2017).
  • Bradbury et al. [2018] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, JAX: composable transformations of Python+NumPy programs (2018).
  • Zhang et al. [2022] S.-X. Zhang, J. Allcock, Z.-Q. Wan, S. Liu, J. Sun, H. Yu, X.-H. Yang, J. Qiu, Z. Ye, Y.-Q. Chen, et al., Tensorcircuit: a quantum software framework for the nisq era, arXiv preprint arXiv:2205.10091  (2022).
  • McClean et al. [2018] J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, and H. Neven, Barren plateaus in quantum neural network training landscapes, Nature communications 9, 1 (2018).
  • Broadbent et al. [2009] A. Broadbent, J. Fitzsimons, and E. Kashefi, Universal blind quantum computation, in 2009 50th Annual IEEE Symposium on Foundations of Computer Science (IEEE, 2009) pp. 517–526.