This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feature-based Federated Transfer Learning: Communication Efficiency, Robustness and Privacy

Feng Wang, M. Cenk Gursoy and Senem Velipasalar The authors are with the Department of Electrical Engineering and Computer Science, Syracuse University, Syracuse, NY, 13244. E-mail: [email protected], [email protected], [email protected] material in this paper has been presented in part at the IEEE Global Communications Conference (Globecom), Dec. 2022.
Abstract

In this paper, we propose feature-based federated transfer learning as a novel approach to improve communication efficiency by reducing the uplink payload by multiple orders of magnitude compared to that of existing approaches in federated learning and federated transfer learning. Specifically, in the proposed feature-based federated learning, we design the extracted features and outputs to be uploaded instead of parameter updates. For this distributed learning model, we determine the required payload and provide comparisons with the existing schemes. Subsequently, we analyze the robustness of feature-based federated transfer learning against packet loss, data insufficiency, and quantization. Finally, we address privacy considerations by defining and analyzing label privacy leakage and feature privacy leakage, and investigating mitigating approaches. For all aforementioned analyses, we evaluate the performance of the proposed learning scheme via experiments on an image classification task and a natural language processing task to demonstrate its effectiveness. 111Code implementation available at: https://github.com/wfwf10/Feature-based-Federated-Transfer-Learning.

Index Terms:
Federated learning, transfer learning, communication efficiency, robustness, privacy

I Introduction

Federated learning (FL) is a form of distributed learning in which only model parameters are exchanged while datasets are kept local with the goal to maintain data privacy [1]. Typically, in FL, a central server, known as the parameter server (PS), coordinates the collaborative training of a deep neural network (DNN) model [2] by aggregating updates on the weights and biases from multiple participating devices/clients. The widespread use of mobile phones and tablets with sufficient computational power and wireless communication capability enables FL in a wide range of applications such as speech recognition and image classification. Internet of Things (IoT) with large number of devices/sensors may generate even larger amount of data while casting further constraints on the computational power and transmission power [3]. With these, FL has gained widespread attention from both academic and industrial communities, resulting in a rapid increase in research on FL techniques, such as federated averaging (FedAvg) [4], federated transfer learning (FTL) [5, 6], and FL with differential privacy (DP) [7]. Among them, model-based FTL stands out as a particularly efficient scheme, because it transfers a well-trained source model into the target task of interest where samples may have different input and output spaces, and thus FTL requires fewer training samples and shortens the training process [8, 9, 10, 11]. Especially, with the availability of open-source big data repositories and deep learning models [12, 13], transfer learning (TL) has become an attractive solution for various applications, such as text sentiment classification [14, 15, 16], image classification [17, 18, 19, 20, 21, 22], human activity classification [23, 24, 25], software defect classification [26, 27], and multi-language text classification [28, 29, 30].

However, when the clients in FL are mobile devices and the training process is performed over a wireless network such as a wireless local area network (WLAN) or cellular system, it is crucial to minimize the data sent during updates of the model parameters. This is due to the limited availability of radio spectrum, and the uncertain nature of the wireless environment caused by factors such as channel fading and interference [31]. In the context of FL applications on IoT devices, the added constraints on computation and power consumption for uploading data make it even more challenging to perform complex tasks that require deep neural networks (DNNs) with many layers and a large number of trainable parameters. While many existing FL studies have focused on shallow DNNs with a few layers, state-of-the-art DNN models used in various applications often have dozens of layers and millions or billions of trainable parameters in order to achieve the highest accuracy in e.g., image segmentation [32, 33], image classification [34, 35], object detection [36, 37], question answering [38], medical image segmentation [39], and speech recognition [40].

I-A Contributions

In response to the above-mentioned constraints on the uplink payload budget and limitations on the local computational power, we develop a novel scheme to perform model-based transfer FL and refer to this scheme as feature-based federated transfer learning (FbFTL). In FbFTL, rather than uploading the gradients, the input and output of the subset of DNN to be trained are uploaded with the goal to reduce the uplink payload. This is one of the key differences from prior FL and FTL strategies. In this paper, after describing the proposed FbFTL scheme, we analyze its overall uplink payload and provide comparisons with prior schemes to quantify the gains. Specifically, we test this approach by transferring the VGG-16 convolutional neural network (CNN) model [41] trained with ImageNet dataset [42] to CIFAR-10 dataset [43], and show that FL, two types of FTL, and FbFTL require uploading 3216 Tb, 949.5 Tb, 599 Tb, and 6.6 Gb of data, respectively, until performance convergence. With this, we demonstrate that the proposed FbFTL outperforms FL and FTL substantially by reducing the upload requirements by at least 5 orders of magnitude. We note that the ITU standard of 5G uplink user experienced data rate is only 50 Mb/s [44], and even the 6G uplink user experienced data rate at Gbit/s level may not sufficiently support such huge uploading requirements of regular FL. Additionally, to our best knowledge, existing works on improving FL efficiency (such as FedAvg [4], sparsification [45, 46] and quantization [47, 48] with or without error feedback [49, 50], federated distillation [51, 52, 53], pruning [54, 55] or partially trainable network [56, 57], and over-the-air computation [58, 59, 60]) still consider the transmission of gradient updates and can achieve a relatively limited reduction in payload and experience degradation in performance. For instance, the payload reduction is of only two orders of magnitude of the original payload on the same CIFAR-10 dataset [61, 62]. Therefore, FbFTL is still a more effective approach in reducing the uplink payload even after the efficiency of FL has been improved via the aforementioned methods. We further show that FbFTL has substantially lower downlink payload, and requires significantly less computational power from the edge devices, and therefore it is much more friendly in terms of facilitating the training tasks on clients with limited power budget.

To validate our approach, we further analyze the robustness of FbFTL against FL and FTL. We show significant reduction on packet loss rate (PLR) with FbFTL with the same block loss rate (BLR), and demonstrate the robustness of FbFTL with insufficient training data. While there have been numerous studies on gradient quantization and sparsification to reduce the uplink payload [63, 64], we illustrate that FbFTL also achieves significant compression rate by quantization while the validation accuracy is barely affected. Furthermore, we analyze the privacy leakage to a potential adversary. We define the label privacy leakage and feature privacy leakage for both feature-based and gradient-based frameworks. To our best knowledge, this is the first work that divides the privacy into different categories, defines each type, and proposes mitigation approaches. Via experimental results, we show that FbFTL has better performance (e.g., classification accuracy) with the same level of privacy protection when each client obtains a small set of samples.

In summary, our main contribution can be listed as follows:

  • We propose the FbFTL scheme that uploads extracted features instead of gradients to reduce uplink payload by five orders of magnitude.

  • We analyze the robustness of FbFTL, and demonstrate that it maintains the payload advantage when strategies such as sparsification, quantization and error feedback are deployed.

  • We study the privacy guarantees in terms of entropy based formulations that quantify the uncertainty and also by utilizing differential privacy mechanisms.

  • We show that a small batch size is preferred to protect label privacy of shuffled batches, and FbFTL has better performance (such as in terms of accuracy) given the differential privacy constraint for input privacy in a small batch.

I-B Organization and Notations

Table I: Notations
FL Parameters: Section II-A
𝒰\mathcal{U}, UU Set of clients, Number of clients
𝒦u\mathcal{K}_{u} Set of local training samples at client uu
KuK_{u} Number of local training samples at client uu
su,k\textbf{s}_{u,k} The kkth sample at client uu
xu,k\textbf{x}_{u,k} Input vector of su,k\textbf{s}_{u,k} with N0N_{0} attributes
yu,k\textbf{y}_{u,k} Output vector of su,k\textbf{s}_{u,k} with NN attributes (or classes)
f𝜽(x)f_{\boldsymbol{\theta}}(\textbf{x}) DNN that maps the input x to estimated output y^\hat{\textbf{y}}
𝜽\boldsymbol{\theta} Trainable parameter vector of DNN
L(f𝜽|su,k)L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k}) Loss function
gu\textbf{g}_{u} Sum of updates (or gradients) at client uu
α\alpha Learning rate
II Number of communication iterations during training process (superscript indicates methods)
CC Fraction of clients selected in each iteration
FTL Parameters: Section II-B, III-A, III-B
MM^{\prime} Number of layers without trainable parameters
MM Number of layers with trainable parameters
mcm_{c} Index of the layer partitioning DNN into feature extraction part and task-specific part
f𝜽11f^{1}_{\boldsymbol{\theta}^{1}} Feature extraction sub-model with trainable parameters 𝜽1\boldsymbol{\theta}^{1}
f𝜽22f^{2}_{\boldsymbol{\theta}^{2}} Task-specific sub-model with trainable parameters 𝜽2\boldsymbol{\theta}^{2}
zu,k\textbf{z}_{u,k} Extracted features f𝜽11(xu,k)f^{1}_{\boldsymbol{\theta}^{1}}(\textbf{x}_{u,k})
TmT_{m} Number of trainable parameters in the mmth trainable layer
NmN^{-}_{m} Number of input nodes at the mmth trainable layer
Nm+N^{+}_{m} Number of output nodes at the mmth trainable layer
Performance Measurements:           Section II-A, III-B, ,III-C
dd Payload of each float number in bits
PP Uplink payload (superscript indicates methods)
DD Downlink payload (superscript indicates methods)
O(X)O(X) Computation time complexity for training
Robustness Measurements:          Section IV
PLR Packet loss rate
BLR Block loss rate
nbn_{b} Number of blocks in each packet
nrn_{r} Maximum number of packet retransmission allowed
𝒮r()\mathcal{S}_{r}(\cdot) Sparsification function keeping a ratio rr of the input elements
𝒬q()\mathcal{Q}_{q}(\cdot) Quantization function keeping qq bits of the input elements
mu,t\textbf{m}_{u,t} Local memory vector for error feedback
Privacy Measurements:     Section V
H()H(\cdot) Entropy to quantify the uncertainty
NN Output dimension (or number of labels)
\mathcal{B} Set of possible batches
|||\mathcal{B}| Number of possible batches
BB Unknown batch type of client uu as a random variable
PyP_{y} Distribution over labels
Py,uniP_{y,\text{uni}} Uniform label distribution
na,byn^{y}_{a,b} Count of output label type aa in a batch of type bb
PB|uni\textbf{P}_{B|\text{uni}} Adversary’s presumed probability distribution of batches without prior knowledge
H(B|Py,uni)H(B\ |\ P_{y,\text{uni}}) Adversary’s uncertainty of a batch without prior knowledge of the content
Py,trueP_{y,\text{true}} True label distribution
PB|true\textbf{P}_{B|\text{true}} Adversary’s presumed probability distribution of batches given the true sample distribution
H{B|Py,true}H\{B\ |\ P_{y,\text{true}}\} Adversary’s uncertainty of a batch given the true sample distribution Py,trueP_{y,\text{true}}
LslL^{l}_{s} Label Privacy Leakage with Statistical Information
c(b)c(b) Count for batches of different types among shuffled batches
𝒞𝒰\mathcal{C_{U}} Number of each batch type bb in shuffled batches as a set of random variables
PB,emp\textbf{P}_{B,\text{emp}} Adversary’s presumed empirical distribution of batches given shuffled batches 𝒞𝒰=cU=[c(1),c(2),,c(||)]\mathcal{C_{U}}=c_{U}=[c(1),c(2),\ldots,c(|\mathcal{B}|)]
H(B|𝒞𝒰=cU)H(B|\mathcal{C_{U}}=c_{U}) Adversary’s uncertainty of a batch given shuffled batches 𝒞𝒰\mathcal{C_{U}}
LqlL^{l}_{q} Label Privacy Leakage with Query
LtlL_{t}^{l} Total Label Privacy Leakage
n Additive independent Gaussian noise to protect local instance privacy, where n𝒩(0,σ2R2std2(g(d1))I)\textbf{n}\sim\mathcal{N}(0,\,\sigma^{2}R^{2}\text{std}^{2}(g(\textbf{d}_{1}))\textbf{I})
d Training samples for one batch
g(d)g(\textbf{d}) Uploading vector in one batch
(ϵ,δ)(\epsilon,\delta) Differential privacy level
RR Maximum relative distance
Additionally: ff in subscript (f\star_{f}) denotes transfer learning updating full model, and cc in subscript (c\star_{c}) denotes transfer learning updating task-specific sub-model

The remainder of the paper is organized as follows. In Section II, we discuss the preliminary concepts regarding FL and FTL, describe the system design via FedAvg, and demonstrate their uplink payload. In Section III, we introduce the proposed FbFTL algorithm, introduce the learning structure and system design, and compare its payload with FL and FTL. In this section, we also evaluate the performance of FbFTL via simulations, and provide comparisons with FL and FTL. In Section IV, we illustrate the robustness of FbFTL in the presence of packet loss, data insufficiency, and quantization. In Section V, we define the label privacy leakage and the feature privacy leakage. For label privacy, we discuss the leakage with statistical information and the leakage with query, and investigate mitigation approaches to avoid these privacy leakages. For feature privacy leakage, we illustrate the mitigation approach via differential privacy. Finally, we conclude the paper in Section VI. The list of all notations in the paper is provided in Table I as a quick reference.

II Preliminaries

In this section, we introduce preliminary concepts related to FL and FTL, and analyze the requirements on their successful uplink payload.

II-A Federated Learning

We address a common FL task in which a PS coordinates a set 𝒰\mathcal{U} of UU clients to cooperatively train a DNN model. Each client uu possesses a local dataset 𝒦u\mathcal{K}_{u} with KuK_{u} samples for training. In this dataset, the kkth sample su,k𝒦u\textbf{s}_{u,k}\in\mathcal{K}_{u} is comprised of an input vector xu,kN0\textbf{x}_{u,k}\in\mathbb{R}^{N_{0}} and an output vector yu,kN\textbf{y}_{u,k}\in\mathbb{R}^{N}. The DNN, represented by the function f𝜽(x)f_{\boldsymbol{\theta}}(\textbf{x}), maps the input x to an output y^\hat{\textbf{y}} (as an estimate of y) with trainable parameter vector 𝜽\boldsymbol{\theta}. The goal of the FL training process is to update the parameter vector 𝜽\boldsymbol{\theta} using the training samples from each client in order to minimize the expected loss 𝔼(L(f𝜽|stest))\mathbb{E}(L(f_{\boldsymbol{\theta}}|\textbf{s}_{test})) on the unseen sample stest\textbf{s}_{test} with the same distribution as the training samples. For most DNNs with classification tasks, the output label y is an axis-aligned unit vector with one element equal to 1 indicating the class of this sample, and all others equal to 0. In this case, the loss function is typically the categorical cross-entropy:

L(f𝜽|su,k)=n=1N𝒚u,k[n]logf𝜽(𝒙u,k)[n].L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})=-\sum_{n=1}^{N}{\boldsymbol{y}_{u,k}[n]\log{f_{\boldsymbol{\theta}}(\boldsymbol{x}_{u,k})[n]}}. (1)

In order to minimize the loss and keep the training samples local, the authors in [4] proposed an iterative distributed optimization scheme called FedAvg with the following steps:

  1. 1.

    The PS initiates trainable parameters 𝜽\boldsymbol{\theta}, and broadcasts the model structure ff with non-trainable parameters to each client.

  2. 2.

    The PS chooses a subset of UCUC clients (with CC denoting the fraction of clients in this iteration) and broadcasts the trainable parameters 𝜽\boldsymbol{\theta}.

  3. 3.

    Each client performs stochastic gradient descent (SGD) to obtain the parameter update corresponding to each training sample: 𝜽L(f𝜽|su,k)\nabla_{\boldsymbol{\theta}}{L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})}.

  4. 4.

    Each client sends the sum of the updates over all local samples gu=k=1Ku𝜽L(f𝜽|su,k)\textbf{g}_{u}=\sum_{k=1}^{K_{u}}{\nabla_{\boldsymbol{\theta}}{L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})}} and KuK_{u} to the PS.

  5. 5.

    The PS updates the parameter 𝜽\boldsymbol{\theta} with 𝜽u=1UαKugu\boldsymbol{\theta}-\sum_{u=1}^{U}{\frac{\alpha}{K_{u}}\textbf{g}_{u}} where α\alpha is the learning rate that controls the training speed.

  6. 6.

    Return to step 2) until convergence.

Let us assume that the number of iterations in FL with FedAvg is IFLI^{FL} (in the absence of any processing and transmission failures). This implies that the total number of times the clients upload gu\textbf{g}_{u} is IFLUCI^{FL}UC within the training process. Assume further that the DNN includes MM layers with trainable parameters (such as convolutional layer and fully connected layer, i.e., dense layer) and MM^{\prime} layers without trainable parameters (such as pooling layer and residue connection). The mmth trainable layer has TmT_{m} trainable parameters. Thus, for each trainable parameter, the client uploads one float number of dd bits for the corresponding element in the update gu\textbf{g}_{u} during each iteration. Therefore, the overall uplink payload to train a DNN via FL with FedAvg is

PFL=dIFLUCm=1MTmbits,P^{FL}=dI^{FL}UC\sum_{m=1}^{M}{T_{m}}\ \text{bits}, (2)

and we will set this as a benchmark to compare with three other training methods in the remainder of this paper.

II-B Federated Transfer Learning

TL is an effective learning technique that utilizes knowledge from a different domain to enhance the performance in the target domain. FTL incorporates the privacy-preserving distributed learning paradigm into conventional TL in order to address the challenges of spectrum limitations in wireless applications and training data scarcity. In this paper, we consider FTL to be performed in the same fashion as in the previous subsection, where one PS orchestrates UU clients. Based on the difference between the domains, FTL can be divided into three different categories: instance-based FTL, feature-based FTL, and model-based FTL [6, 65]. The former two types of FTL assume similarity in the distribution of the input and output, and hence we in this paper focus on the model-based FTL which only assumes similarity in the functionality to extract a high-dimensional description from the input data.

Refer to caption
Figure 1: Diagram of the iterative training process of model-based federated transfer learning.

As shown in Fig. 1, we consider FTL with a DNN model pre-trained with an open-source dataset that corresponds to a similar but not the same task as the source model. The goal here is to reduce the number of iterations during training. Many DNNs can be partitioned into two sections. The first section generates the high-dimensional features from the sample input data with general information. The second section carries out operations for a particular task, such as classification, and is less likely to be transferable to a new task. To move the knowledge of the source model into a new task before training, the feature extraction section of the source model is directly transferred to the new target model, while the task-specific part is randomly initiated. FTL can be performed by retraining all the parameters with new data. However, there typically exist a large number of parameters to train, and in order to reduce the training time and data requirements, a common approach is to perform FTL by fixing the feature extraction part and simply updating the task-specific part. In particular, we select one fully connected layer mcm_{c} (which is Layer 4 in Fig. 1), close to the output without a paralleling path (such as residue connection), and randomly initiate all trainable parameters of layers mc,mc+1,,Mm_{c},m_{c}+1,\ldots,M (which are Layer 4 and Layer 5 in Fig. 1). During the distributed training process, all copied parameters are fixed, and we only update the parameters of layers mc,,Mm_{c},\ldots,M.

Similar to the previous sub-section on FL, we determine the least requirements on successful uplink payload for FTL until convergence. For FTL that updates all parameters, assuming that FTL with FedAvg is iterated IfFTLI^{FTL}_{f} times, the overall uplink payload required for training is

PfFTL=dIfFTLUCm=1MTmbits.P^{FTL}_{f}=dI^{FTL}_{f}UC\sum_{m=1}^{M}{T_{m}}\ \text{bits}. (3)

On the other hand, assuming FTL with FedAvg that only updates the task-specific part with IcFTLI^{FTL}_{c} iterations and assuming that each sum update gu\textbf{g}_{u} consists of m=mcMTm\sum_{m=m_{c}}^{M}{T_{m}} trainable parameters, the overall uplink payload for training is

PcFTL=dIcFTLUCm=mcMTmbits.P^{FTL}_{c}=dI^{FTL}_{c}UC\sum_{m=m_{c}}^{M}{T_{m}}\ \text{bits}. (4)

Typically, training from scratch requires more training samples, and thus we have IcFTLUCIfFTLUC<IFLUCI^{FTL}_{c}UC\approx I^{FTL}_{f}UC<I^{FL}UC. For most of the models, the majority of the DNN structure is dedicated to the feature extraction, and obviously m=mcMTm<m=1MTm\sum_{m=m_{c}}^{M}{T_{m}}<\sum_{m=1}^{M}{T_{m}}. Therefore,

PcFTL<PfFTL<PFL.P^{FTL}_{c}<P^{FTL}_{f}<P^{FL}. (5)

II-C Other Recent Transfer and Federated Learning Schemes

Method Payload Computation Key Assumption
FL iterative gradient-level medium, distributed start with random initialization
FTL iterative gradient-level low, distributed start with source model
FbFTL (ours) one-time feature-level low, mostly centralized start with source model
EWC [66] iterative gradient-level high, distributed evaluating many different tasks
SFL [67] iterative gradient-level medium, hybrid of distributed and centralized two servers needed
PFL [68] iterative gradient-level very high, distributed large and heterogeneous local batch
Table II: Comparison between different federated learning methods

While we in this paper focus on FL schemes that generalize to the majority of use cases, there are also recent works that improve the performance under certain assumptions and can be used in FL. We list these well-known methods, along with FL, FTL, and the proposed FbFTL, in Table II and provide a comparison. Below, we briefly describe the key assumptions of these methods, and highlight the differences of the proposed FbFTL scheme.

Specifically, elastic weight consolidation (EWC) [66] focuses on a large number of different tasks simultaneously and the transfer between them by remembering the importance of each weight. The performance is typically measured on all experienced tasks. EWC is inspired by neuroscience and Bayesian inference, and it aims to quantify the importance of each weight to the tasks the model has previously learned. The fundamental idea is that the weights critical to prior tasks should be altered less when new data is encountered. In this paper, we consider a different problem, where we transfer to a single target task from a single source model on another task whose performance is no longer considered. In this problem setting, there is no clear difference between EWC and plain SGD in [66] in terms of both training performance or training time, but EWC does introduce extra communication payload and computation cost in FL. On the contrary, FbFTL achieves five orders of magnitude payload reduction.

Split federated learning (SFL) [67] partitions the neural network into two segments: the initial part near the input undergoes training in a federated learning manner using a “Fed Server”, and the subsequent part near the output is trained using split learning with a “Main Server,” which receives the smashed data. Compared to FbFTL, the federated learning segment utilizing the Fed Server operates in the same way as FL and requires significant uplink and downlink payload. Hence, this part alone is typically much more communication-expensive compared to FbFTL. The advantage of SFL lies in its similar performance to FL and not needing a pre-trained model as it updates all parameters. We will present the experimental performance comparison in Table III in Section III-D.

Personalized federated learning (PFL) [68] considers clients with large number of heterogeneous local samples, enabling generalization to the heterogeneous local distribution. However, we in this paper consider a large number of clients with a small set of local samples from each. As explained in Section V-A and Section V-C, such a setting better protects label privacy. Therefore, personalized federated learning focuses on a different problem from ours. We cannot apply this approach to FbFTL because personalized FbFTL violates the shuffle-batch assumption, unless we train the global model by FbFTL and then retrain locally. Although PFL has better local performance than FL with large heterogeneous batches, it requires several independent local batches and significantly more local computation. Furthermore, it does not allow quantization and sparsification since the weights instead of gradients are uploaded. Therefore, it is comparatively even less communication-efficient when quantization and sparsification are deployed in other methods, as shown in Section IV-C and Table V.

III Feature-based Federated Transfer Learning

In this section, we describe the proposed FbFTL framework and demonstrate that it requires a substantially smaller uplink payload compared to FL and FTL. This efficiency arises from the fact that the extracted features and outputs are uploaded rather than the parameter updates.

III-A Learning Structure

Refer to caption
Figure 2: Diagram of the training process of feature-based federated transfer learning, which reduces to one-time communication of the output and the intermediate output (i.e., extracted features).

As depicted in Fig. 2, in FbFTL we consider the model-based TL on a source DNN model ff^{\prime} pre-trained on a different task. We choose one fully connected layer mcm_{c} without paralleling path as the cut layer, and divide the new target model fff\leftarrow f^{\prime} into two parts. Those layers before layer mcm_{c} (which is layer 4 in Fig. 2, i.e., mc=4m_{c}=4 ) are regarded as the feature extraction sub-model f𝜽11f^{1}_{\boldsymbol{\theta}^{1}}, and the parameters 𝜽1\boldsymbol{\theta}^{1} for the new target model ff are copied from the pre-trained source model ff^{\prime} and fixed without any further update. The other layers are regarded as the task-specific sub-model f𝜽22f^{2}_{\boldsymbol{\theta}^{2}}, and all trainable parameters 𝜽2\boldsymbol{\theta}^{2} in the new target model ff are randomly initiated for training with the uuth client’s kkth training sample su,k={xu,k,yu,k}\textbf{s}_{u,k}=\{\textbf{x}_{u,k},\textbf{y}_{u,k}\} for all uu and kk.

The forward pass of the full model f𝜽(xu,k)=f𝜽22(f𝜽11(xu,k))f_{\boldsymbol{\theta}}(\textbf{x}_{u,k})=f^{2}_{\boldsymbol{\theta}^{2}}(f^{1}_{\boldsymbol{\theta}^{1}}(\textbf{x}_{u,k})) maps the input xu,k\textbf{x}_{u,k} to the estimated output y^u,k\hat{\textbf{y}}_{u,k}, and we note that the output of the feature extraction sub-model zu,k=f𝜽11(xu,k)\textbf{z}_{u,k}=f^{1}_{\boldsymbol{\theta}^{1}}(\textbf{x}_{u,k}) is also the input of the task-specific sub-model f𝜽22f^{2}_{\boldsymbol{\theta}^{2}}. Note that we require only the task-specific sub-model to be trained. In this setting, each client generates the features zu,k\textbf{z}_{u,k} from input xu,k\textbf{x}_{u,k} and sends these features to the PS only once. Subsequently, in FbFTL, the PS performs the gradient back-propagation iteratively without sending any feedback to the clients. Contrary to this, in FL and FTL, the parameter update, which is based on the gradients, highly relies on the parameters in the current training iteration, and the same data sample may generate different parameter updates in different iterations within the training process. Therefore, in FL and FTL, we either require many more training samples, or need to upload updates multiple times for the same sample. However in FbFTL, each client only needs to upload the intermediate output zu,k\textbf{z}_{u,k} and output yu,k\textbf{y}_{u,k} once, instead of iteratively uploading the gradients. The PS may deem these as the input and the output of the training sample for the task-specific sub-model f𝜽22f^{2}_{\boldsymbol{\theta}^{2}}. Such pairs of samples are not correlated with the model parameters 𝜽2\boldsymbol{\theta}^{2}, and therefore they can be used in different training iterations without downloading or uploading anything again. We provide the steps of the FbFTL algorithm in Algorithm 1 below.

Algorithm 1 FbFTL
PS copies fixed sub-model f𝜽11f^{1}_{\boldsymbol{\theta}^{1}} from pre-trained source model and randomly initiates trainable sub-model f𝜽22f^{2}_{\boldsymbol{\theta}^{2}}.
PS broadcasts f𝜽11f^{1}_{\boldsymbol{\theta}^{1}} to each client u𝒰u\in\mathcal{U}
Clients execute:
for client u𝒰u\in\mathcal{U} do
     for sample su,k𝒦u\textbf{s}_{u,k}\in\mathcal{K}_{u} do
         Upload zu,k=f𝜽11(xu,k)\textbf{z}_{u,k}=f^{1}_{\boldsymbol{\theta}^{1}}(\textbf{x}_{u,k}) and yu,k\textbf{y}_{u,k} to PS
     end for
end for
PS executes:
while notconvergenot\ converge do
     for mini-batch bb of pairs {u,k}𝒰×𝒦u\{u,k\}\in\mathcal{U}\times\mathcal{K}_{u} do
         𝜽2𝜽2α{u,k}b𝜽2L(f𝜽2|zu,k,yu,k)\boldsymbol{\theta}^{2}\leftarrow\boldsymbol{\theta}^{2}-\alpha\sum_{\{u,k\}\in b}{\nabla_{\boldsymbol{\theta}^{2}}{L(f_{\boldsymbol{\theta}^{2}}|\textbf{z}_{u,k},\textbf{y}_{u,k})}}
     end for
end while

As emphasized above, a key distinction of FbFTL is that the training samples of task-specific sub-model are uploaded rather than the direct gradient updates. With this, FbFTL provides additional important benefits that can lead to further improvements in the training performance and efficient management in practice. One of the most important benefits is the significant reduction of packet loss rate given the same batch loss rate since FbFTL has much less data in each batch, as we will illustrate in detail in Section IV-A. One other benefit is the waiving of the requirement of synchronization between clients, since there is only one iteration in FbFTL. The other benefits pertain to hyper-parameter fine-tuning, dataset balancing, and enabling flexible training batch size selection, as detailed below.

Specifically, one of the most important additional benefit is that FbFTL enables iterative fine-tuning of the SGD optimizer hyper-parameters such as the learning rate. To obtain the optimized DNN, one needs to find the optimized hyper-parameters that provide the best convergence performance. In FbFTL, the model can be trained from scratch several times at the PS to identify the best hyper-parameter setting without additional communication with the clients. On the other hand, in FL and FTL, the entire online training process has to be run several times, resulting in a much higher cost compared to the ideal uplink payload PP.

Another benefit of FbFTL is the dataset balancing. If the overall dataset is imbalanced and hence the samples with certain types of output appears much more frequently than those of other outputs, it is hard for FL and FTL to distinguish this via gradient updates, and such imbalanced data distribution could significantly degrade FL performance [69, 70]. However, FbFTL with direct output information enables techniques, such as re-sampling specific classes or merging near-identical classes, to improve dataset imbalance.

One more benefit of FbFTL is to lift the constraint of UCUC and KuK_{u} on the training batch size. To avoid over-fitting to the training dataset, one needs to validate the performance on a separate validation set of data to identify the optimal number of training iterations to stop training and conclude the final model. It is straightforward for FbFTL to divide the obtained dataset into training and validation subsets. However, the gradient-based FL frameworks call for extra effort for the communication system to meticulously perform the training process and validation process with the desired order and number of samples. Additionally for the training process, due to the broadcast nature of the downlink in wireless FL and FTL, all selected clients in the same communication iteration receive the same parameters. Hence, each SGD mini-batch has a larger size than u=1UCKu\sum_{u=1}^{UC}K_{u}, and an overwhelmingly large SGD mini-batch size may delay the training process and require more training iterations. The mini-batches in these gradient-based FL frameworks are also on the order of given clients’ samples. Therefore, when the clients’ sample distribution is biased, we cannot shuffle the samples between iterations and have to accept the loss in the final performance. However, FbFTL may choose any rational size of SGD mini-batch without the constraints of the communication system, and can reshuffle the data in each training iteration.

III-B Payload Analysis

Note that FbFTL has the major benefit of requiring one-time communication between the clients and the PS. In addition to this, FbFTL has much less data in a single upload batch compared to that of each sample in the upload batch of FL and FTL. Let us assume that in the fully connected layer mm with bias, the number of input nodes and the number of output nodes are denoted by NmN^{-}_{m} and Nm+N^{+}_{m}, respectively Then, the number of trainable parameters in this layer is given by Tm=Nm+(Nm+1)T_{m}=N^{+}_{m}(N^{-}_{m}+1). Compared to the dimension of NmcN^{-}_{m_{c}}, the amount of information that we need to represent yu,k{1,,N}\textbf{y}_{u,k}\in\{1,\ldots,N\}, i.e., log2N\log_{2}{N} bits, is negligible. Note that in FbFTL, gradients are computed at the PS. Therefore, FedAvg cannot be applied and each sample is required to be uploaded separately. Therefore for FbFTL, the uplink payload for each sample is dNmcdN^{-}_{m_{c}}, and the overall successful uplink payload required for training is

PFbFTL=du=1UKuNmcbits.P^{FbFTL}=d\sum_{u=1}^{U}{K_{u}}N^{-}_{m_{c}}\ \text{bits}. (6)

Compared to the uplink payload in (4) of FTL with FedAvg (in which only the task-specific sub-model is updated), the ratio of each upload for single sample between FTL and FbFTL is

dm=mcMTmdNmc=Tmc+m=mc+1MTmNmc=Nmc+(Nmc+1)+m=mc+1MTmNmc>Nmc+.\begin{aligned} \frac{d\sum_{m=m_{c}}^{M}{T_{m}}}{dN^{-}_{m_{c}}}&=\frac{T_{m_{c}}+\sum_{m=m_{c}+1}^{M}{T_{m}}}{N^{-}_{m_{c}}}\\ &=\frac{N^{+}_{m_{c}}(N^{-}_{m_{c}}+1)+\sum_{m=m_{c}+1}^{M}{T_{m}}}{N^{-}_{m_{c}}}\\ &>N^{+}_{m_{c}}.\end{aligned} (7)

Therefore,

PcFTLPFbFTL=dIcFTLUCm=mcMTmdu=1UKuNmc>IcFTLUCu=1UKuNmc+.\begin{aligned} \frac{P^{FTL}_{c}}{P^{FbFTL}}&=\frac{dI^{FTL}_{c}UC\sum_{m=m_{c}}^{M}{T_{m}}}{d\sum_{u=1}^{U}{K_{u}}N^{-}_{m_{c}}}>\frac{I^{FTL}_{c}UC}{\sum_{u=1}^{U}{K_{u}}}N^{+}_{m_{c}}.\\ \end{aligned} (8)

The number of extracted features for many state-of-the-art models is larger than 10310^{3}, and TL usually requires a relatively deeper task-specific sub-model, and therefore Nmc+N^{+}_{m_{c}} can be large. For the cross-device federated learning, the clients do not obtain huge local datasets, u=1UKu<IcFTLUC\sum_{u=1}^{U}{K_{u}}<I^{FTL}_{c}UC. Therefore, we have PFbFTLPcFTLP^{FbFTL}\ll P^{FTL}_{c}. We note that FbFTL and FTL that updates the task-specific sub-model have the same performance at every iteration because the only difference between the two methods is the communication model but not the numerical process to generate the gradient updates. Combining this observation with the conclusion in (5), we have

PFbFTLPcFTL<PfFTL<PFL,P^{FbFTL}\ll P^{FTL}_{c}<P^{FTL}_{f}<P^{FL}, (9)

and therefore we expect extremely smaller uplink payload in FbFTL compared to FTL and FL.

We also note that FbFTL has the smallest downlink broadcast payload and the least local computation compared to FL and FTL. For FL, FTL that updates all parameters, FTL that updates the task-specific sub-model, and FbFTL, the overall downlink broadcast payloads, respectively, are

DFL=dIFLm=1MTmbits,D^{FL}=dI^{FL}\sum_{m=1}^{M}{T_{m}}\ \text{bits}, (10)
DfFTL=dIfFTLm=1MTmbits,D^{FTL}_{f}=dI^{FTL}_{f}\sum_{m=1}^{M}{T_{m}}\ \text{bits}, (11)
DcFTL=dIcFTLm=1MTmbits,D^{FTL}_{c}=dI^{FTL}_{c}\sum_{m=1}^{M}{T_{m}}\ \text{bits}, (12)
DFbFTL=dm=1mc1Tmbits.D^{FbFTL}=d\sum_{m=1}^{m_{c}-1}{T_{m}}\ \text{bits}. (13)

III-C Time Complexity Analysis

We note that FbFTL consumes less computation time and power for training in total, and also transfers a proportion of computation from the client devices to the PS. In FL and FTL, each client must complete one full forward pass and one back-propagation of the parameters to be trained for each training sample at each iteration. However, in FbFTL, each sample is only processed once by the client, and each client only needs to complete the forward pass of feature extraction sub-model, while all other computations are completed at the PS. Such shift of computational load to the PS is particularly beneficial if the end users and devices are severely limited in their computational capabilities and power resources (for instance, in IoT networks) and the PS is equipped with advanced processors and has access to more resources.

In this subsection, we use time complexity to estimate the computation time and energy consumption. For each fully connected layer m1m_{1} with Nm1N^{-}_{m_{1}} input nodes and Nm1+N^{+}_{m_{1}} output nodes, there are Tm1=Nm1+(Nm1+1)T_{m_{1}}=N^{+}_{m_{1}}(N^{-}_{m_{1}}+1) trainable parameters, and each sample requires O(Nm1+(Nm1+1))=O(Tm1)O(N^{+}_{m_{1}}(N^{-}_{m_{1}}+1))=O(T_{m_{1}}) time complexity for matrix multiplication during forward pass, and the same time complexity for matrix multiplication during back-propagation. Each 2-dimensional convolutional layer m2m_{2} with stride 1 in each direction and same padding has the input shape {a,b,c}\{a^{-},b^{-},c^{-}\} where Nm2=abcN^{-}_{m_{2}}=a^{-}b^{-}c^{-}, kernel shape {c+,a,b,c}\{c^{+},a^{\prime},b^{\prime},c^{-}\} where Tm2=c+(abc+1)T_{m_{2}}=c^{+}(a^{\prime}b^{\prime}c^{-}+1), and output shape {a,b,c+}\{a^{-},b^{-},c^{+}\} where Nm2+=abc+N^{+}_{m_{2}}=a^{-}b^{-}c^{+}. Thus, each sample requires O(c+(abc+1)ab)=O(abTm2)O(c^{+}(a^{\prime}b^{\prime}c^{-}+1)a^{-}b^{-})=O(a^{-}b^{-}T_{m_{2}}) time complexity for matrix multiplication during forward pass, and the same during back-propagation. Comparatively, the time complexity of activation functions, max-pooling layers and dropout layers is negligible. Therefore, we denote the time complexity of each trainable layer mm for a single training sample as O(Xm)O(X_{m}), and the overall time complexity required for all clients as O(X)O(X). Specifically, the overall time complexities at clients for the methods of FL, FTL that updates all parameters, FTL that updates the task-specific sub-model, and FbFTL, respectively, are

O(XFL)O(2IFLUCm=1MXm),O(X^{FL})\propto O\left(2I^{FL}UC\sum_{m=1}^{M}{X_{m}}\right), (14)
O(XfFTL)O(2IfFTLUCm=1MXm),O(X^{FTL}_{f})\propto O\left(2I^{FTL}_{f}UC\sum_{m=1}^{M}{X_{m}}\right), (15)
O(XcFTL)=O(Um=1mc1Xm+2IcFTLUCm=mcMXm),O(X^{FTL}_{c})=O\left(U\sum_{m=1}^{m_{c}-1}{X_{m}}+2I^{FTL}_{c}UC\sum_{m=m_{c}}^{M}{X_{m}}\right), (16)
O(XFbFTL)O(Um=1mc1Xm),O(X^{FbFTL})\propto O\left(U\sum_{m=1}^{m_{c}-1}{X_{m}}\right), (17)

where \propto stands for “proportional to”. Typically, we have XFbFTL<XcFTLXfFTL<XFLX^{FbFTL}<X^{FTL}_{c}\ll X^{FTL}_{f}<X^{FL}.

III-D Experimental Results

In this section, we consider the application of FL, FTL and FbFTL to the VGG-16 CNN model [41] and transfer the knowledge learned from ImageNet dataset [42] to CIFAR-10 dataset [43].

ImageNet is a vast online database containing over 14 million images, each with hand-annotated labels describing the classification types or intended outputs for training. The pre-trained source model we use for TL was created on the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC2012, [71]), and the images come from 1000 different categories. As an illustration, we provide 10 samples with their labels in Fig. 3.

Refer to caption
Figure 3: ImageNet samples with labels.

CIFAR-10 is a database of 6000060000 images, each with one of N=10N=10 distinct labels. Out of these images, 5000050000 images are used for training and 1000010000 images for testing. Fig. 4 showcases the first ten samples with their labels. Note that these samples have simple labels and appear more blurred compared to those in ImageNet due to their lower resolution/dimension.

Refer to caption
Figure 4: CIFAR-10 samples with labels.

In Fig. 5, we depict the structure of VGG-16 used for training on CIFAR-10. For TL, we consider the first half of the layers marked blue as the feature extraction sub-model f𝜽11f^{1}_{\boldsymbol{\theta}^{1}} and directly transfer this sub-model from that trained on ImageNet. We deem the latter part marked yellow as the task-specific sub-model f𝜽22f^{2}_{\boldsymbol{\theta}^{2}} and randomly initiate this part. For FbFTL, the dimension of the intermediate output is Nmc=4096N^{-}_{m_{c}}=4096.

Refer to caption
Figure 5: Diagram of the VGG-16 model for training on CIFAR-10 dataset. “Conv {\{receptive field size}\} - {\{number of output channels}\}” depicts the convolutional layers. “FC {\{output size}\}” depicts the fully connected layers.

To train the model on CIFAR-10, we utilize Nvidia GeForce GPU with CUDA to run the algorithms with PyTorch [72]. We assume that there are U=6250U=6250 clients in total, each iteration takes a fraction C=1.28×103C=1.28\times 10^{-3} of all clients, and each batch contains Ku=8K_{u}=8 samples. In the two most commonly used deep learning tools TensorFlow (including Keras) [73] and PyTorch, the default data type of each number has 32 bits and therefore d=32d=32 bits. The learning rate is 10210^{-2}, the momentum of the optimizer is 0.90.9, and the L2 penalty is 5×1045\times 10^{-4}.

FL SFL FLlow FTLf FTLlowf{}_{f}^{low} FTLc FbFTL
number of upload batches 656250 656250 68750 193750 25000 525000 50000
upload parameters per batch 153144650 117483328 153144650 153144650 153144650 35665418 4096
uplink payload per batch 4.9 Gb 3.8 Gb 4.9 Gb 4.9 Gb 4.9 Gb 1.1 Gb 131 Kb
total uplink payload PP 3216 Tb 2467 Tb 337 Tb 949 Tb 123 Tb 599 Tb 6.6 Gb
total downlink payload DD 402 Tb 308 Tb 42 Tb 253 Tb 15 Tb 322 Tb 3.8 Gb
computation time complexity XX 1.63×10171.63\times 10^{17} 1.63×10171.63\times 10^{17} 1.7×10161.7\times 10^{16} 4.80×10164.80\times 10^{16} 6.19×10156.19\times 10^{15} 1.07×10151.07\times 10^{15} 7.74×10147.74\times 10^{14}
validation accuracy 89.42% 89.42% 86.64% 93.75% 86.45% 86.51% 86.51%
Table III: Performance comparison on VGG-16 between FL, SFL, FTL updating full model (FTLf), FTL updating task-specific sub-model (FTLc) and FbFTL. Additionally, we compare with cases in which FL and FTLf achieve the same accuracy as FTLc and FbFTL (86%\approx 86\%) by terminating training slightly earlier. Algorithms in these cases are referred to as FLlow and FTLlowf{}_{f}^{low}.

In Table III, we compare the performances of FL, SFL, FTL updating the full model (FTLf), FTL updating the task-specific sub-model (FTLc), and FbFTL. For increased fairness in the comparison of payloads, we further demonstrate the performances FLlow and FTLlowf{}_{f}^{low}, which are the FL and FTLf algorithms that terminate training process when the validation accuracy reaches those of FTLc and FbFTL (i.e., 86%\approx 86\%). As we have analyzed, the other algorithms require IUCIUC successfully uploaded batches, while FbFTL only requires u=1UKu\sum_{u=1}^{U}{K_{u}} batches. Also, FL, SFL, FTLf and FTLc require uploading dm=1MTmd\sum_{m=1}^{M}{T_{m}} bits, d(KuNmc+m=mcMTm)d(K_{u}N^{-}_{m_{c}}+\sum_{m=m_{c}}^{M}{T_{m}}) bits, dm=1MTmd\sum_{m=1}^{M}{T_{m}} bits, and dm=mcMTmd\sum_{m=m_{c}}^{M}{T_{m}} bits, respectively, for each batch, while FbFTL only requires uploading dNmcdN^{-}_{m_{c}} bits for each batch. In the third row of the table, we observe that the FbFTL algorithm significantly reduces the uplink payload per batch by four orders of magnitude (i.e. a factor of 10410^{-4}) compared to the other algorithms for each client. Additionally, in the fourth row, FbFTL leads to a reduction of five orders of magnitude (i.e. a factor of 10510^{-5}) in the total uplink payload during training when compared to the other algorithms. These results demonstrate that FbFTL is apparently the most efficient scheme. FbFTL also results in a substantial decrease in the overall downlink payload. If larger models are trained for more complex tasks or if the size of the training dataset is more limited, the difference in payload could be even greater. In (14) through (17), we have described the overall computation time complexity required for training at clients as O(X)O(X). In the second-to-last row of the table, we quantify and provide this time complexity for each method by counting the number of multiplications among floating numbers. We readily observe that FbFTL requires the least computation time and power consumption at the clients, and has two orders of magnitude reduction compared to FL, SFL and FTLf, since it only runs the forward pass over feature extraction sub-model for each sample one time. We further note that for FbFTL, the computation complexity at the PS is also low with X=3.00×1014X=3.00\times 10^{14}. Furthermore, we note that different number of clients CUCU in each iteration does not affect the performance of FbFTL, but a large number of clients CUCU reduces the performance of FL and FTL, since it defines the minimum “training batch size” CUKuCUK_{u}. The performance of FL and FTL will decrease with an overwhelmingly large training batch size, which is another benefit of FbFTL. We in the experiment pick a small CU=8CU=8 so that the benchmark schemes (i.e., FL, FTLf, and FTLc) have the best performance (at which point FTLc and FbFTL have the same performance). Moreover, we also observe that FTLc and FbFTL only update a small portion of the parameters and exhibit a slight decrease in validation accuracy, which is the trade-off for the reduced payload. However, we will show in Section IV and Section V that under communication efficiency or privacy constraints, FbFTL may also prevail in terms of validation accuracy in certain situations.

FTLf FTLc FbFTL FTLc FbFTL FTLc FbFTL
number of trained encoders 8 8 8 4 4 2 2
number of upload batches 132588 36830 7366 88392 7366 103124 7366
upload parameters per batch 109860224 60511616 1024 51070144 1024 46349504 1024
uplink payload per batch 3.5 Gb 1.9 Gb 32.7 Kb 1.6 Gb 32.7 Kb 1.5 Gb 32.7 Kb
total uplink payload PP 466.1 Tb 71.3 Tb 241.4 Mb 144.5 Tb 241.4 Mb 152.9 Tb 241.4 Mb
total downlink payload DD 116.0 Tb 32.2 Tb 1.58 Gb 77.3 Tb 1.88 Gb 90.2 Tb 2.03 Gb
validation ROUGE-1 45.9249 45.4680 45.4680 45.2827 45.2827 44.9862 44.9862
Table IV: Performance comparison on FLAN-T5-small between FTLf updating full model, FTLc updating task-specific sub-model and FbFTL, with different number of encoders trained.

Furthermore, we note that in transfer learning, there is an additional trade-off between privacy protection, performance and payload. When partitioning a model into feature extraction sub-model and task-specific sub-model, choosing the cut layer closer to the output better preserves data privacy, while picking a cut layer closer to the input improves the training performance. In order to demonstrate such a trade-off, we next consider a language model as our application scenario. In FL, FTL, and FbFTL on natural language processing tasks, we consider tasks where the model and the intermediate features are not proprietary or private, and thus we assume that the clients are willing to share them while keeping the local data private As an example, we show the results on a conversation summary task SAMSum [74] with 32128 distinct token types out of 14732 training dialogues and 819 testing dialogues. For instance, dialogue ID 13728867 in SAMSum is “Olivia: Who are you voting for in this election? Oliver: Liberals as always. Olivia: Me too!! Oliver: Great”, and ground truth summary is “Olivia and Olivier are voting for liberals in this election.” For this task, we deploy a language model FLAN-T5-small [75], which is a transformer with 110 million parameters, including 8 encoders and 8 decoders. This model is pre-trained and the dataset is relatively small, so we do not have FL in this case. We assume that there are U=7366U=7366 clients, each has Ku=2K_{u}=2 dialogues, and each iteration takes a fraction C=5.43×104C=5.43\times 10^{-4} of all clients. Our experiment is based on HuggingFace [76] with learning rate 2×1042\times 10^{-4}.

In Table IV, we compare the performances of FTL and FbFTL in terms of ROUGE-1 score, which measures the match between the generated text and reference text. A larger ROUGE-1 indicates better performance. We note that FTLf trains all components including encoders, decoders, embedding and the final linear layer. On the other hand, FTLc and FbFTL freeze embedding, and may or may not freeze several encoders close to the input prompts. The number of trained encoders is given in the second row of Table IV. For instance, the performance results in the third and fourth columns are for FTLc and FbFTL that have trained all encoders (and hence have not frozen any of them), while the other columns provide the performances with 4 or 2 encoders trained (indicating that 4 or 6 encoders close to the input are frozen). We do not need to freeze decoders since label privacy leakage converges to zero with shuffled batches, as will be shown in Section V. In Table IV, we observe that FbFTL reduces the uplink and downlink payload by similar orders of magnitude as in the VGG-16 experiment. Furthermore, we notice that training less layers or encoders leads to a slight reduction in ROUGE-1 score but it does not guarantee lower payload. While FTL may require more iterations to arrive convergence, FbFTL may need to broadcast a larger feature extraction sub-model.

IV Robustness Analysis

In this section, we compare the performance of FbFTL with that of FL and FTL under the same packet loss rate (PLR) for each batch being uploaded, and illustrate the robustness of FbFTL against packet losses, data insufficiency, and compression, including quantization, sparsification and error feedback.

IV-A Packet Loss

As we have demonstrated in the previous section, gradient-based FedAvg FL frameworks including FL and FTL upload the gradient update gu=k=1Ku𝜽L(f𝜽|su,k)\textbf{g}_{u}=\sum_{k=1}^{K_{u}}{\nabla_{\boldsymbol{\theta}}{L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})}} iteratively, where each batch contains the gradient summation from all samples of the client. In contrast, our proposed FbFTL uploads the data for each sample only once, and each batch contains extracted features zu,k\textbf{z}_{u,k} and output yu,k\textbf{y}_{u,k} from one sample.

In (7), we have shown that each batch for gradient-based FL is much larger than each batch for FbFTL (by about 10410^{4} larger in our experiments), and we assume that the packets to transmit both types of batches consist of multiple transmission blocks of the same size. For both types of packets, we consider measuring the robustness of all frameworks against packet loss caused by block losses due to network congestion or link outage (e.g., as a result of deep fading in wireless networks). For each learning framework whose packet consists of nbn_{b} transmission blocks, we consider the same block loss rate (BLR). The PLR without retransmission is

PLR=1(1BLR)nb.\text{PLR}=1-(1-\text{BLR})^{n_{b}}. (18)

Obviously, PLR highly depends on the value of nbn_{b}. FbFTL has significantly lower packet size and consequently we expect much lower PLR for the given same BLR. We show the significant performance difference among different learning frameworks in our experiments at the end of this section.

If, on the other hand, we allow at most nrn_{r} retransmissions for all packets and assume the channel state of each transmission to be independent and identically distributed (i.i.d.), we can lower the PLR and achieve similar packet-level reliability. However, this is realized at the cost of higher total uplink payload. Specifically for each learning framework requiring uplink payload PP, the expected uplink payload with retransmissions is

P(BLR)=n=1nbPnbn=0nr(BLR)n=Pn=0nr(BLR)n.P(\text{BLR})=\sum_{n^{\prime}=1}^{n_{b}}\frac{P}{n_{b}}\sum_{n=0}^{n_{r}}(\text{BLR})^{n}=P\sum_{n=0}^{n_{r}}(\text{BLR})^{n}. (19)

We note that n=1nr(BLR)n\sum_{n=1}^{n_{r}}(\text{BLR})^{n} is the expected number of retransmissions, which is the same for all different learning frameworks. If the transmission has a fixed bandwidth, n=1nr(BLR)n\sum_{n=1}^{n_{r}}(\text{BLR})^{n} also represents the delay factor of the total transmission. We further note that if BLR=0\text{BLR}=0, then P(BLR)=PP(\text{BLR})=P. Otherwise, P(BLR)=P+Pn=1nr(BLR)n>PP(\text{BLR})=P+P\sum_{n=1}^{n_{r}}(\text{BLR})^{n}>P. Notice that the increase in the payload Pn=1nr(BLR)nP\sum_{n=1}^{n_{r}}(\text{BLR})^{n} grows with PP and nrn_{r}. Hence, learning frameworks with higher payload experience a higher increase in payload when retransmissions are introduced.

IV-B Data Insufficiency

At the end of Section III-A, we have discussed the benefit that FbFTL does not require additional online uploads from the clients to test different sets of hyper-parameters, while gradient-based frameworks need to run the entire uploading process multiple times. In practice, in addition to hyper-parameters, another intangible aspect prior to training is whether there exists a sufficient number of participating clients and training samples. If the planned set of samples is insufficient to train the neural network, gradient-based frameworks require rerunning the overall process with more clients, which leads to high consumption and potential waste of both computational and transmission resources. However, FbFTL only requires the new clients to compute and upload their batches, and hence there is no waste of transmission resources.

IV-C Quantization, Sparsification and Error Feedback

To further reduce the uplink payload of gradient-based frameworks, there have been extensive works on gradient compression, including gradient sparsification [77, 78] and gradient quantization [63, 64], especially signSGD, the extreme case in which each element is reduced to be binary valued without scaling [79]. Several recent studies utilize error feedback (or quantization and sparsification with memory) that reduces the error in compression at client’s local device to improve the updates in future iterations [50, 49].

For gradient-based frameworks utilizing gradient update gu\textbf{g}_{u} with XgX_{g} elements, we denote the sparsification function as 𝒮r():XgXg\mathcal{S}_{r}(\cdot):\mathbb{R}^{X_{g}}\rightarrow\mathbb{R}^{X_{g}}, where r(0,1]r\in(0,1]. This function keeps the value of rXgrX_{g} elements in the vector with the highest absolute values, and set the value of all other elements to 0. We denote the quantization function as 𝒬q():XgXg\mathcal{Q}_{q}(\cdot):\mathbb{R}^{X_{g}}\rightarrow\mathbb{R}^{X_{g}} where q+q\in\mathbb{Z}^{+}, and this function quantizes each element in the vector to qq bits within the max/min range. Furthermore, we denote the error memory vector for client uu at iteration tt as mu,tXg\textbf{m}_{u,t}\sim\mathbb{R}^{X_{g}}. Therefore, the process of quantization and sparsification with error feedback at iteration tt is described as follows:

gu=𝒬q(𝒮r(gu+mu,t)),\textbf{g}^{\prime}_{u}=\mathcal{Q}_{q}(\mathcal{S}_{r}(\textbf{g}_{u}+\textbf{m}_{u,t})), (20)
mu,t+1=gu+mu,tgu,\textbf{m}_{u,t+1}=\textbf{g}_{u}+\textbf{m}_{u,t}-\textbf{g}^{\prime}_{u}, (21)

where gu\textbf{g}^{\prime}_{u} is uploaded to the PS, and the local error memory vector is updated to mu,t+1\textbf{m}_{u,t+1}. For initialization, mu,1=0\textbf{m}_{u,1}=\textbf{0}.

Similarly, for FbFTL utilizing extracted feature zu,k\textbf{z}_{u,k} with XzX_{z} elements, we may also apply sparsification and quantization. However, error feedback is not applicable, because there is only one upload iteration in FbFTL, and the extracted feature is not additive unlike the gradient. Therefore, the process of quantization and sparsification for FbFTL is described as follows:

zu,k=𝒬q(𝒮r(zu,k)),\textbf{z}^{\prime}_{u,k}=\mathcal{Q}_{q}(\mathcal{S}_{r}(\textbf{z}_{u,k})), (22)

where zu,k\textbf{z}^{\prime}_{u,k} is uploaded to the PS.

In both cases, the compression rate is close to 2d/q/r2^{d/q}/r where dd is the number of bits for the representation of the original data type. We note that there is also a potential drawback in the practical deployment of sparsification and error feedback. On the one hand, while all other operations including training, communication, inference, quantization and error feedback have time complexity no more than O(X)O(X^{\prime}) where XX^{\prime} is the total number of parameters in the neural network, we notice that depending on the sparsification ratio rr, sparsification requires up to O(XglogXg)O(X_{g}\log{X_{g}}) time complexity for gradient-based frameworks and O(XzlogXz)O(X_{z}\log{X_{z}}) time complexity for FbFTL. Typically, we have O(Xz)O(Xg)=O(X)O(X_{z})\ll O(X_{g})=O(X^{\prime}), and FbFTL can achieve more significant time complexity reduction in local sparsification computation compared to gradient-based frameworks. On the other hand, by compensating for the compression error in future iterations, the error feedback significantly improves the training performance when the compression rate is high, i.e., rr and qq are low. However, the error feedback requires O(Xg)O(X^{g}) additional memory throughout the entire training process and not just during the local training. Also, we demonstrate below in the experimental results that the gradient-based frameworks do not prevail even with error feedback.

IV-D Experimental Results

In this section, we numerically demonstrate the aforementioned robustness metrics in our experiments with the VGG-16 CNN model on transfer learning from ImageNet dataset to CIFAR-10 dataset.

In Fig. 7, we show the PLR (without retransmissions) of different learning frameworks when the block size equals the batch size of FbFTL (i.e., the BLR equals the PLR of FbFTL). As we have shown in the row of uplink payload per batch of Table III, the batch sizes of other learning frameworks are about 10410^{4} times larger than that of FbFTL. Due to the huge difference in the batch sizes, the PLR curves of the other learning frameworks almost reach 11 when the PLR of FbFTL is less than 0.0010.001 (or equivalently when BLR<0.001<0.001). The difference is still substantially high if a few rounds of retransmissions are allowed in gradient-based frameworks. Fig. 7 provides the magnified plot of Fig. 7 for BLR<0.001<0.001, and we note that the curve of FL and the curve of FTL that updates full model overlap since they have the same number of parameters to update and therefore have the same batch size. In this figure, we again observe that the PLR curves of learning mechanisms other than FbFTL quickly approach 11 within the considered range of BLR <0.001<0.001, while the PLR of FbFTL (being equal to BLR) stays below 0.0010.001.

Refer to caption
Figure 6: Comparison of PLR (without retransmission) among FL, FTL updating full model, FTL updating task-specific sub-model, and FbFTL.
Refer to caption
Figure 7: Magnification of Fig. 7 for BLR<<0.1%.

Since FbFTL has significantly lower PLR, we subsequently analyze its performance with different amount of data. In Fig. 8, we see that FbFTL has relatively high validation accuracy even with only 10% of the samples. Indeed, when we have only 0.1%0.1\% of the samples (i.e., the case with 50 samples), the accuracy is 82.5%82.5\% even with PLR=0.5=0.5. Furthermore, we observe that as PLR increases, the accuracy curves remain relatively flat, and experience sharp drops only when PLR approaches 1. Therefore, FbFTL is considerably robust against data insufficiency and PLR. We note that such robustness requires significantly more training iterations II^{\prime} such that IUIU and IUI^{\prime}U^{\prime} have the same order of magnitude, where II is the original number of training iterations with all UU participating clients, and II^{\prime} is the number of training iterations with limited data from UU^{\prime} clients. In the case of smaller number of participating clients UUU^{\prime}\ll U and PLR =0=0, FbFTL requires the same level of computational power consumption and more total downlink payload compared to the original case with UU clients. However, there is a huge reduction in total uplink payload and total local computational power consumption, because FbFTL only requires uploading once for each participating client. In comparison, gradient-based frameworks require uploading in every training iteration, and therefore there is no such benefit in terms of reduced uplink payload and local computations, while they also suffer from higher downlink payload. Similarly, we have the performance curves for FLAN-T5-small in Fig. 9.

Refer to caption
Figure 8: Validation accuracy of FbFTL with different number of participating samples and PLRs.
Refer to caption
Figure 9: ROUGE-1 score of FbFTL with different number of participating samples and PLRs on FLAN-T5-small with SAMSum.

In Table V, we show the validation accuracy of each framework with data from all UU clients and PLR =0=0, but for different values of quantization size qq bits and sparsification ratio rr. In the experiment, we pick the same set of values for qq and rr as in [77, 78]. Note that the highest reduction in uplink payload is achieved when we set r=0.001r=0.001 and q=2q=2. However, even in this case, the uplink payload of gradient-based frameworks (i.e., FTL, FTLf, FTLc) is still greater than that of FbFTL with r=1r=1 and q=32q=32. Moreover, with such drastic reduction, gradient-based schemes achieve lower accuracies compared to that of FbFTL with r=1r=1 and q=32q=32. Therefore, even without sparsification and more restrictive quantization (e.g., q=8q=8 or q=2q=2), FbFTL outperforms gradient-based frameworks in terms of both accuracy and payload reduction. We can further reduce the uplink payload of FbFTL by choosing r<1r<1 and q<32q<32. We note that FbFTL in the extreme case of r=0.001r=0.001 only keeps 4 elements in the extracted features to distinguish among 10 classes. In this setting, the information is severely limited and is insufficient to make an accurate prediction, resulting in accuracy levels of 65%65\% for FbFTL. If, on the other hand, we pick a certain threshold on the validation accuracy, for instance 82%, FbFTL requires the sparsification ratio to be no lower than r=0.01r=0.01 (which is 1010 times more than that of the best gradient-based framework), but still maintains a reduction of more than four orders of magnitude (i.e., reduction of 10410^{-4}) in total uplink payload compared to the gradient-based algorithms while achieving the same validation accuracy. Such good performance of FbFTL without error feedback is due to the robustness of extracted features against noise. If we consider the error from compression as random noise, the extracted features from a well-trained source model is typically robust to noise while gradient update does not necessarily have the same level of robustness.

FL
FTLf
FTLc
FbFTL
r=1r=1 r=0.1r=0.1 r=0.01r=0.01 r=0.001r=0.001
q=32q=32 89.42%
93.75%
86.51%
86.51%
83.96%
93.46%
85.45%
86.47%
59.97%
90.98%
84.49%
82.10%
42.26%
81.98%
82.78%
65.71%
q=8q=8 88.46%
93.36%
86.16%
86.41%
82.34%
93.34%
85.44%
86.39%
57.04%
90.83%
84.41%
81.85%
39.34%
81.91%
82.53%
65.55%
q=2q=2 45.96%
88.20%
81.44%
85.50%
44.67%
88.16%
81.29%
85.48%
33.63%
86.19%
80.91%
80.99%
31.91%
80.15%
78.61%
65.18%
Table V: Validation accuracy comparison between FL, FTL updating full model, FTL updating task-specific sub-model and FbFTL with quantization to qq bits, sparsification ratio rr of elements, and error feedback (except for FbFTL).

V Privacy Analysis

In previous sections, we have described the FbFTL framework where each client uu with KuK_{u} samples uploads extracted features zu,k\textbf{z}_{u,k} and output yu,k\textbf{y}_{u,k} for k=1,,Kuk=1,\ldots,K_{u} instead of the gradient update gu=k=1Ku𝜽L(f𝜽|su,k)\textbf{g}_{u}=\sum_{k=1}^{K_{u}}{\nabla_{\boldsymbol{\theta}}{L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})}} as done in FL and FTL with FedAvg. In this section, we conduct a privacy analysis by studying privacy leakage to a potential adversary, and propose protection strategies. In particular, we consider leakage due to the unveiling of information regarding the outputs/labels {yu,k}\{\textbf{y}_{u,k}\} (henceforth referred to as label privacy leakage) and the unveiling of intermediate features {zu,k}\{\textbf{z}_{u,k}\} (henceforth referred to as feature privacy leakage). For example, during the training process of a classifier to distinguish the character in a photo to be a dog or a cat, the identity of the character being labeled as a dog is considered as label privacy, while the feature privacy includes additional information of the certain photo besides its label privacy such as the color of the fur and the furniture in the background.

The label privacy leakage quantifies how much information about the batch of a targeted client is revealed to an adversary through the information on the outputs {yu,k}\{\textbf{y}_{u,k}\} (or type of label in classification problems). The information on the outputs/labels can be of statistical nature or can be obtained via the unveiling of the outputs to the adversary. The former considers the information leakage via the knowledge of the general sample output distribution, while the latter specifies the leakage when the adversary has access to the output/label values. We analyze the label privacy leakage via the entropy and mutual information from the adversary’s perspective and propose an uploading design that randomly shuffles all batches to conceal the dependency between the client address and the output data.

Feature privacy leakage describes the amount of information that possibly leaks when the adversary obtains uploaded contents from clients. We will analyze the feature privacy leakage via the differential privacy (DP) framework [7] and provide comparisons between FbFTL and the gradient-based frameworks (FL and FTL, in which cases feature privacy is leaked when the adversary obtains the gradient updates) through experiments and numerical results.

V-A Label Privacy Leakage

First, we analyze the label privacy leakage. While FbFTL directly uploads the output in each batch for each sample, we note that FL and FTL with FedAvg that update the final layer also leak the output, and hence label privacy leakage also occurs in these cases. According to [80], the count of each output type in a batch can be numerically solved given the average gradient. Adversaries with certain prior knowledge on the training data are able to gain further knowledge and reconstruct the input from the average gradient, such as deep leakage [81] or gradient inversion [82, 83]. Although the following label privacy analysis applies to both feature-based FbFTL and gradient-based FedAvg frameworks, we in label privacy analysis use the word “batch” to indicate all uploaded information from one client with multiple samples. Compared to the high-dimensional gradient gu\textbf{g}_{u} or feature zu,k\textbf{z}_{u,k}, the output y also potentially reveals clients’ private information but up to a certain degree. In this setting, we analyze the conditions under which the label privacy leakage (with statistical information) vanishes. We also address the role of shuffling the output information from all clients as a way of hiding the client’s address from uploaded content when adversary has access to outputs.

To validate these approaches, we analyze the private output information leakage of a specific batch to a potential adversary without client addresses. According to [84], the privacy loss of a query can be evaluated as the difference in the privacy before and after the query. In our setting, we determine the amount of label privacy via the uncertainty from the adversary’s perspective, and quantify it by utilizing the entropy formulation.

As shown in Fig. 10, we consider a common learning task in which the output y[0,1]N\textbf{y}\in[0,1]^{N} is an axis-aligned unit vector with one element having a value of 1 indicating the label associated with the given input, and all others equal to 0. We assume that each sample {xu,k,zu,k,yu,k}\{\textbf{x}_{u,k},\textbf{z}_{u,k},\textbf{y}_{u,k}\} is independent and identically distributed (i.i.d.). Each client u𝒰u\in\mathcal{U} transmits one batch with Ku=KK_{u}=K samples, including the outputs/labels {yu,k}k=1K\{\textbf{y}_{u,k}\}_{k=1}^{K}. Therefore for this client, there are NKN^{K} different possible ordered batches of outputs in total (where NN is the number of possible different labels that can be associated with the input).

Refer to caption
Figure 10: Diagram of every sample label yu,k[0,1]N\textbf{y}_{u,k}\in[0,1]^{N} in batches.

Assuming that the order in the batch does not contain information, we use \mathcal{B} to denote the set of possible batches without order from client uu, and the privacy information of the real batch bb from client uu is the sum vector k=1Kyu,k=[n1,by,n2,by,,nN,by]\sum^{K}_{k=1}\textbf{y}_{u,k}=[n^{y}_{1,b},n^{y}_{2,b},\ldots,n^{y}_{N,b}], where a=1Nna,by=K\sum^{N}_{a=1}n^{y}_{a,b}=K. Subsequently, the number of possible batches without order is the combination with the replacement of NN items taken KK times and is given by

||=(N+K1K).|\mathcal{B}|=\binom{N+K-1}{K}. (23)

We denote the index of client uu’s batch as a random variable BB, and the index of the uploaded batch as B=bB=b\in\mathcal{B}. We denote the type of the label yu,k\textbf{y}_{u,k}’s distribution as PyP_{y}. Typically to achieve better performance, in most of the machine learning applications we desire a uniform distribution over different labels, and we denote the uniform distribution over NN labels as Py,uni=[1N,1N,,1N]P_{y,\text{uni}}=[\frac{1}{N},\frac{1}{N},\ldots,\frac{1}{N}].

To evaluate the adversary’s knowledge on one given batch B=bB=b of a target victim client, we consider the presumed distribution PB\textbf{P}_{B} of the batch BB from the adversary’s perspective to quantify the adversary’s uncertainty through the entropy H(B)H(B) of the given batch.

V-A1 Label Privacy Leakage with Statistical Information

A weak adversary without any prior knowledge of the sample distribution may presume the sample distribution to be uniform, i.e., Py,uniP_{y,\text{uni}}. In this case, the distribution of ordered batches from the adversary’s perspective is also uniform with probability 1/NK1/N^{K}. Each batch index bb\in\mathcal{B} without order corresponds to K!/a=1Nna,by!K!/{\prod^{N}_{a=1}n^{y}_{a,b}!} different batch indices with order, and hence the adversary’s presumed probability distribution of batches without order (under the assumption of uniform sample distribution) is PB|uni=[p0,1,p0,2,,p0,||]\textbf{P}_{B|\text{uni}}=[p_{0,1},p_{0,2},\ldots,p_{0,|\mathcal{B}|}], where

p0,b=K!NKa=1Nna,by!.p_{0,b}=\frac{K!}{N^{K}\prod^{N}_{a=1}n^{y}_{a,b}!}. (24)

Therefore from the weak adversary’s perspective, the uncertainty of the batch from client uu with known format and size but without prior knowledge about the content can be quantified by the entropy

H(B|Py,uni)=b=1||p0,blog2p0,bH(B\ |\ P_{y,\text{uni}})=-\sum^{|\mathcal{B}|}_{b=1}p_{0,b}\log_{2}{p_{0,b}} (25)

where the condition in the entropy signifies that this is the entropy under the assumption of uniformly distributed samples.

On the contrary, a stronger adversary with the ability to steal and decode a large amount of uploaded data may learn the structure of the DNN and is able to decode the labels of the KK samples in a batch, and hence such an adversary is likely to have the prior knowledge of the general distribution of batches from a large number of clients. In the case that the true sample distribution Py,trueP_{y,\text{true}} is known, we denote the strong adversary’s presumed distribution of the batch from client uu as

PB|true=[p1,p2,,p||],\textbf{P}_{B|\text{true}}=[p_{1},p_{2},\ldots,p_{|\mathcal{B}|}], (26)

and the corresponding uncertainty at the adversary in the original setting before query is

H(B|Py,true)=b=1||pblog2pb.H(B\ |\ P_{y,\text{true}})=-\sum^{|\mathcal{B}|}_{b=1}p_{b}\log_{2}{p_{b}}. (27)

By comparing the uncertainty between the weak adversary and the strong adversary, we give the definition of label privacy leakage with statistical information.

Definition 1.

Label Privacy Leakage with Statistical Information: For an adversary without prior knowledge of the sample distribution (and hence that initially assumes uniform sample distribution), the label privacy leakage is the amount of information regarding the target batch BB that leaks to the adversary when it obtains the true distribution Py,trueP_{y,\text{true}}. This privacy leakage can be formulated as follows:

Lsl=H(B|Py,uni)H(B|Py,true).L^{l}_{s}=H(B\ |\ P_{y,\text{uni}})-H(B\ |\ P_{y,\text{true}}). (28)

Note that H(B|Py,uni)H(B\ |\ P_{y,\text{uni}}) is the uncertainty in BB under the assumption that the labels are uniformly distributed. H(B|Py,true)H(B\ |\ P_{y,\text{true}}) is the remaining uncertainty in BB when the true sample distribution is learned by the adversary. Hence, the difference is the information gained by or equivalently leaked to the adversary.

Machine learning tasks typically require dataset balancing. If the data collection process is sufficiently well designed so that each output type is almost equally likely and we have PB|truePB|uni\textbf{P}_{B|\text{true}}\to\textbf{P}_{B|\text{uni}}, then the label privacy leakage Lsl0L^{l}_{s}\to 0. As we have illustrated in section III-A, achieving this goal is more viable in FbFTL.

V-A2 Label Privacy Leakage with Access via Query

Next, we analyze the label privacy leakage when the adversary has access to the shuffled outputs (e.g., via a query). In the worst case, the strongest query is the process that the adversary acquires all batches without clients’ addresses. Specifically, we assume that the adversary has access to randomly shuffled UU batches. Recall that there are |||\mathcal{B}| different types of batches. We use c(b)c(b) to denote the count for type bb batches among given UU batches. With this definition, we have b=1||c(b)=U\sum_{b=1}^{|\mathcal{B}|}c(b)=U. For different set of UU batches, the counts will be different. We define 𝒞U\mathcal{C}_{U} as a random vector of counts of different types of batches among a total of UU batches. Hence, for given UU batches, the realization of this vector is 𝒞𝒰=cU=[c(1),c(2),,c(||)]\mathcal{C_{U}}=c_{U}=[c(1),c(2),\ldots,c(|\mathcal{B}|)]. In the absence of any other statistical information, the adversary can utilize the following empirical distribution of BB based on the frequency of each batch type among all UU batches:

PB,empP{B|𝒞𝒰=cU}=[c(1)U,c(2)U,,c(||)U].\textbf{P}_{B,\text{emp}}\triangleq\textbf{P}\{B\ |\ \mathcal{C_{U}}=c_{U}\}=\bigg{[}\frac{c(1)}{U},\frac{c(2)}{U},\ldots,\frac{c(|\mathcal{B}|)}{U}\bigg{]}. (29)

With this empirical distribution based on the shuffled UU batches, the adversary’s uncertainty is given by the entropy

H(B|𝒞𝒰=cU)=b=1||c(b)Ulog2c(b)U.H(B\ |\ \mathcal{C_{U}}=c_{U})=-\sum^{|\mathcal{B}|}_{b=1}\frac{c(b)}{U}\log_{2}{\frac{c(b)}{U}}. (30)

In the case in which the adversary has acquired UU batches via the query and knows the distribution Py,trueP_{y,\text{true}} of the samples, we denote the adversary’s presumed distribution as P(B|𝒞𝒰=cU,Py,true)\textbf{P}(B\ |\ \mathcal{C_{U}}=c_{U},P_{y,\text{true}}), and its uncertainty on client uu’s batch as H(B|𝒞𝒰=cU,Py,true)H(B\ |\ \mathcal{C_{U}}=c_{U},P_{y,\text{true}}).

First, we establish the following result.

Lemma 1.

Assume that the true sample distribution is known. Once UU batches are revealed to the adversary, the distribution of the batches (from the adversary’s perspective) depends only on the frequency/count of each batch type in the unveiled UU batches and not on the sample distribution, i.e.,

P(B=b|𝒞𝒰=cU,Py,true)=P{B=b|𝒞𝒰=cU}.\displaystyle P(B=b\ |\ \mathcal{C_{U}}=c_{U},P_{y,\text{true}})=P\{B=b\,|\,\mathcal{C_{U}}=c_{U}\}. (31)
Proof.

We prove this lemma by utilizing the Bayes’ rule and determining the ratio of two conditional probabilities as follows:

P(B=b|𝒞𝒰=cU,Py,true)=P(𝒞𝒰=cU,B=b|Py,true)P(𝒞𝒰=cU|Py,true)=(U1)!(c(b)1)!bbc(b)!U!c(b)!bbc(b)!=(U1)!(c(b)1)!U!c(b)!=c(b)U=P{B=b|𝒞𝒰=cU}.\begin{aligned} P(B=b\ |\ \mathcal{C_{U}}=c_{U},P_{y,\text{true}})=&\frac{P(\mathcal{C_{U}}=c_{U},B=b|P_{y,\text{true}})}{P(\mathcal{C_{U}}=c_{U}|P_{y,\text{true}})}\\ =&\frac{\frac{(U-1)!}{(c(b)-1)!\prod_{b^{\prime}\neq b}c(b^{\prime})!}}{\frac{U!}{c(b)!\prod_{b^{\prime}\neq b}c(b^{\prime})!}}\\ =&\frac{\frac{(U-1)!}{(c(b)-1)!}}{\frac{U!}{c(b)!}}\\ =&\frac{c(b)}{U}\\ =&P\{B=b\,|\,\mathcal{C_{U}}=c_{U}\}.\end{aligned} (32)

This result indicates that the batch distribution is equal to the empirical distribution in (29). Consequently, we also have the following characterization for the entropies:

H(B|𝒞𝒰=cU,Py,true)=H(B|𝒞𝒰=cU)\displaystyle H(B\ |\ \mathcal{C_{U}}=c_{U},P_{y,\text{true}})=H(B\ |\ \mathcal{C_{U}}=c_{U}) (33)

Next, we give the definition of the label privacy leakage with query, and identify a condition under which the privacy leakage vanishes.

Definition 2.

Label Privacy Leakage with Query: For an adversary with prior knowledge of the true sample distribution Py,trueP_{y,\text{true}}, the label privacy leakage with query is the amount of information regarding the batch of a target client that leaks to the adversary when it obtains the randomly shuffled set of uploaded UU batches including the target’s batch. This privacy leakage can be formulated as follows:

Lql=H(B|Py,true)H(B|Py,true,𝒞𝒰=cU)=H(B|Py,true)H(B|𝒞𝒰=cU).\begin{aligned} L^{l}_{q}&=H(B\ |\,P_{y,\text{true}})-H(B\ |\ P_{y,\text{true}},\,\mathcal{C_{U}}=c_{U})\\ &=H(B\ |\,P_{y,\text{true}})-H(B\ |\ \mathcal{C_{U}}=c_{U}).\end{aligned} (34)
Lemma 2.

As the number of shuffled batches goes to infinity, the label privacy leakage with query converges to 0, i.e.,

limULql=0.\lim\limits_{U\to\infty}L^{l}_{q}=0. (35)
Proof.

As UU grows without bound, we have limUPB,emp=PB|true\lim\limits_{U\to\infty}\textbf{P}_{B,\text{emp}}=\textbf{P}_{B|\text{true}} by the law of large numbers, and as a result, we have the characterization that limUH(B|𝒞𝒰=cU)=H(B|Py,true)\lim\limits_{U\to\infty}H(B\ |\ \mathcal{C_{U}}=c_{U})=H(B\ |\ P_{y,\text{true}}). Hence, the label privacy leakage LlqL^{q}_{l} with query converges to 0. ∎

Note that if the adversary does not initially know even the distribution of the samples and assume a uniformly distributed samples, then total label privacy leakage after the unveiling of the UU batches to the adversary can be defined as

Ltl=Lsl+Lql=H(B|Py,uni)H(B|𝒞𝒰=cU).\displaystyle L_{t}^{l}=L_{s}^{l}+L_{q}^{l}=H(B\ |\ P_{y,\text{uni}})-H(B\ |\ \mathcal{C_{U}}=c_{U}). (36)

If the shuffled dataset is perfectly balanced and we have PB,emp=PB|uni\textbf{P}_{B,\text{emp}}=\textbf{P}_{B|\text{uni}}, the total label privacy leakage is zero, i.e., Ltl=Lsl+Lql=0L_{t}^{l}=L^{l}_{s}+L^{l}_{q}=0.

V-B Feature Privacy Leakage

In this section, we analyze the privacy leakage on the input x via uploaded content. Such privacy leakage includes the information that is not necessarily needed to determine the output y, and we will define it as feature privacy. Different from label privacy where the adversary knows the implication of each type of output (or the meaning of each label), it is hard to define the adversary’s prior knowledge on the gradients and extracted features, and it is unlikely to cancel feature privacy leakage via shuffling. Furthermore, it is hard to quantify and analyze the entropy and mutual information of gradients and extracted features because of the complexity of neural networks. Instead, we will analyze the feature privacy leakage via differential privacy.

For FbFTL, the intermediate output z is also referred to as the smashed data in split learning [85, 86], and cannot be directly transformed back to the input x due to the nonlinearity of the activation functions in each layer. However, it is possible that FbFTL leaks privacy to some extent and partially reveals the input. Since the feature before a fully connected layer from a sample can be analytically solved from its gradient [87], it is possible that gradient-based frameworks also leak the input. The strategy to extract the input from FedAvg gradients includes deep leakage [81, 87, 88] and gradient inversion [82, 83], and is extensively studied for image recognition. However, deep leakage highly depends on the dataset and DNN structure. Therefore, an analysis of the privacy leakage beyond label privacy is needed to identify what may be partially revealed regarding the input x from the uploaded features.

Differential privacy (DP) provides an upper bound on the privacy leakage using a different measuring approach, and we compare the feature privacy preserving performances of FbFTL and other schemes via DP. In the DP analysis of feature privacy, we denote the training samples for each batch as d and the function that generates the upload vector as gg. More specifically, d=su,k\textbf{d}=\textbf{s}_{u,k} and g(d)=zu,k=f𝜽11(xu,k)g(\textbf{d})=\textbf{z}_{u,k}=f^{1}_{\boldsymbol{\theta}^{1}}(\textbf{x}_{u,k}) for FbFTL, while d=k=1Ksu,k\textbf{d}=\bigcup_{k=1}^{K}\textbf{s}_{u,k} and g(d)=gu=k=1K𝜽L(f𝜽|su,k)g(\textbf{d})=\textbf{g}_{u}=\sum_{k=1}^{K}{\nabla_{\boldsymbol{\theta}}{L(f_{\boldsymbol{\theta}}|\textbf{s}_{u,k})}} for gradient-based FedAvg frameworks. Similarly as in [89, 90, 91], we define the feature privacy through the following condition: gg satisfies (ϵ,δ)(\epsilon,\delta)-DP if for any subset of possible outputs GG it holds that

Pr{g(d)G}eϵPr{g(d)G}+δ,\text{Pr}\{g(\textbf{d})\in G\}\leq e^{\epsilon}\text{Pr}\{g(\textbf{d}^{\prime})\in G\}+\delta, (37)

where d and d\textbf{d}^{\prime} differ in a single sample with the true distribution Py,trueP_{y,\text{true}}.

In [92], it has been shown via moments accountant approach that adding Gaussian noise n to g(d)g(\textbf{d}) prior to transmission maintains an overall privacy loss of (ϵ,δ)(\epsilon,\delta). Typically, the variance of the noise depends on the maximum distance maxg(d)g(d)\max{\|g(\textbf{d})-g(\textbf{d}^{\prime})\|} for any two adjacent inputs d and d\textbf{d}^{\prime}. To provide a fair comparison between different learning frameworks, we consider the maximum relative distance R=maxd,dmax|g(d)g(d)|Kstd(g(d))R=\max\limits_{d,d^{\prime}}{\frac{\max\left|g(\textbf{d})-g(\textbf{d}^{\prime})\right|}{K\text{std}(g(\textbf{d}))}} within training set for each framework, where |||\cdot| denotes absolute value and std()\text{std}(\cdot) denotes the standard deviation. Thus, the additive Gaussian noise to each output g(d1)g(\textbf{d}_{1}) should be n𝒩(0,σ2R2std2(g(d1))I)\textbf{n}\sim\mathcal{N}(0,\,\sigma^{2}R^{2}\text{std}^{2}(g(\textbf{d}_{1}))\textbf{I}) to mitigate the feature privacy leakage, and the lower bound of the σ\sigma value can be determined via the moments accountant approach.

As a result, according to [92, Theorem 1], there exist constants c1c_{1} and c2c_{2} such that for any ϵ<c1C2I\epsilon<c_{1}C^{2}I, the training process with g(d)+ng(\textbf{d})+\textbf{n} is (ϵ,δ)(\epsilon,\delta)-differentially private for any δ>0\delta>0 if we choose σc2CϵIlogδ\sigma\geq\frac{c_{2}C}{\epsilon}\sqrt{-I\log{\delta}}, where II is the number of iterations (I=1I=1 for FbFTL), and CC is the fraction of clients selected in each iteration. When the value of δ\delta and noise factor σR\sigma R are fixed, ϵ\epsilon for FedAvg DP decreases as the batch size KK grows, but increases as training iterations II (and potentially, the number of retraining for hyper-parameter tuning) increase. However, ϵ\epsilon for FbFTL DP does not depend on these factors. Furthermore, the final performance also depends on the robustness of each framework against noise. Therefore, we demonstrate the comparisons via experiments in the following subsection.

V-C Experimental Results

In our experiments, we utilize the dataset CIFAR-10 which has N=10N=10 types of labels, and is well-balanced (i.e., Py,true=Py,uniP_{y,\text{true}}=P_{y,\text{uni}}). We generate shuffled batches according to the uniform sample distribution Py,uniP_{y,\text{uni}} and evaluate the total label privacy leakage in different scenarios in Fig. 11. The blue curve plots the initial uncertainty H(B|Py,uni)H(B\ |\ P_{y,\text{uni}}) for different values of KK (where KK is the number of samples from each client). Each of the other curves shows the uncertainty given shuffled data H(B|𝒞U=cU)H(B\ |\ \mathcal{C}_{U}=c_{U}) with fixed total amount of samples UKUK from all clients, and the number of clients UU is determined by each value of KK accordingly. By Definition 1 and Definition 2, the total label privacy leakage given shuffled data is Ltl=Lsl+LqlL_{t}^{l}=L^{l}_{s}+L^{l}_{q} and is equal to the difference H(B|Py,uni)H(B|𝒞U=cU)H(B\ |\ P_{y,\text{uni}})-H(B\ |\ \mathcal{C}_{U}=c_{U}). We see that H(B|Py,uni)H(B|𝒞U=cU)H(B\ |\ P_{y,\text{uni}})\approx H(B\ |\ \mathcal{C}_{U}=c_{U}) when KK is small, and thus the label privacy is well preserved. However, for given value of the product UKUK, H(B|𝒞U=cU)H(B\ |\ \mathcal{C}_{U}=c_{U}) diminishes fast once a certain threshold of KK is exceeded. Therefore, we note that the label privacy leakage becomes high when KK is large. As addressed before, the label privacy applies in the same way to both FbFTL and gradient-based FedAvg frameworks, and hence a smaller KK might be preferred in order to preserve the label privacy. However, current FL privacy analyses typically focus on DP that provides a relatively loose upper bound on the total local privacy. Within the DP setting, studies typically consider a small number of clients and assume that each client obtains hundreds of samples, if not thousands of, to achieve improved DP performance. However, as we have observed above, a setting with higher values of KK can lead to higher label privacy loss (in addition to requiring more training epochs and achieving less validation accuracy). This leads to the important conclusion that the balance between different types of privacy leakage needs to be considered carefully.

Refer to caption
Figure 11: Label privacy of each client in bits against the number of samples KK from each client.

To compare the DP performances of feature privacy leakage, we apply different noise levels σ\sigma on upload content g(d)g(\textbf{d}) of different frameworks to achieve the same level (ϵ,δ)(\epsilon,\delta) of privacy, and compare the validation accuracy at convergence with different KK in Fig. 12. In our experiments, we implement the moments accountant via Rényi Divergence-based DP accountant to get a tighter bound. With fixed δ=106\delta=10^{-6}, each plot shows the validation accuracy at convergence as a function of ϵ\epsilon for given KK. We note that higher ϵ\epsilon indicates weaker DP protection, lower noise level σ\sigma, and thus higher accuracy (closer to the original performance without noise). As shown in Fig. 11, label privacy is better preserved at K=4K=4, and we show the DP performance within the same setting in Fig. 12(a). While gradient-based FedAvg frameworks totally fail, FbFTL has good performance because it has much smaller data size |g(d)||g(\textbf{d})| and is more robust against noise, as it receives constant data instead of varying perturbation in each training iteration and converges to a sub-optimal model. Such advantage of FbFTL remains until KK increases to 100 as shown in Fig. 12(b), where FTL that updates the task-specific sub-model outperforms FbFTL at ϵ4\epsilon\geq 4 as they have the same data size |g(d)||g(\textbf{d})| while the noise level of FedAvg decreases as KK increases. However, the label privacy completely leaks for K16K\geq 16 as we have seen in Fig. 11. FedAvg for K16K\geq 16 only prevails in protecting the subset of feature privacy that is not mitigated via shuffling. In Fig. 12(c), we show the comparison for even larger K=600K=600, where all types of FedAvg perform better than FbFTL.

Refer to caption
(a) K=4.
Refer to caption
(b) K=100.
Refer to caption
(c) K=600.
Figure 12: Comparison of validation accuracy at different KK, ϵ\epsilon, and δ=106\delta=10^{-6} between FL, FTL updating full model, FTL updating task-specific sub-model and FbFTL. The accuracy is marked as 0 if the model fails to converge.

Note that it is extremely difficult to analytically evaluate the volume of feature privacy leakage that can be mitigated via shuffling. Howeover, we can compare the performances of FbFTL and FedAvg for a given KK via experimental results as done above. One key conclusion we have is the following. In terms of privacy preservation, FbFTL is preferred when KK is small (i.e., each client obtains a small set of samples), and gradient-based FedAvg is preferred when KK is large.

VI Conclusion

In this paper, we have presented a novel communication-efficient federated transfer learning method. In this proposed feature-based federated transfer learning (FbFTL), the features and outputs are uploaded rather than the gradient updates. We have provided a thorough description of the system design and the learning algorithm, and compared its theoretical payload with that of federated learning and federated transfer learning. Our results demonstrate substantial reductions in both uplink and downlink payload when using FbFTL. Via experiments, we have further shown the effectiveness of the proposed FbFTL by showing that FbFTL reduces the uplink payload by up to five orders of magnitude compared to that of existing methods. Subsequently, we have demonstrated that FbFTL with small batch size has significantly less packet loss rate than gradient-based frameworks, and illustrated its robustness against data insufficiency and quantization. Finally, we have considered different types of privacy leakage, and analyzed mitigation approaches. Specifically, we have first analyzed label privacy leakage with both statistical knowledge and query (resulting in access to the label outputs). We have identified a condition under which label privacy vanishes. We have also addressed feature privacy and considered a DP mechanism for preserving this privacy. We have shown that with small KK and shuffling that eliminates label privacy leakage, FbFTL also attains good differential feature privacy protection. These characterizations and results render FbFTL a communication-efficient, robust, and privacy-preserving novel federated transfer learning scheme.

References

  • [1] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527, 2016.
  • [2] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., “Advances and open problems in federated learning,” arXiv preprint arXiv:1912.04977, 2019.
  • [3] J. Mills, J. Hu, and G. Min, “Communication-efficient federated learning for wireless edge intelligence in iot,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 5986–5994, 2019.
  • [4] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics.   PMLR, 2017, pp. 1273–1282.
  • [5] Y. Cheng, J. Lu, D. Niyato, B. Lyu, J. Kang, and S. Zhu, “Federated transfer learning with client selection for intrusion detection in mobile edge computing,” IEEE Communications Letters, vol. 26, no. 3, pp. 552–556, 2022.
  • [6] Q. Yang, Y. Liu, Y. Cheng, Y. Kang, T. Chen, and H. Yu, “Federated learning,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 13, no. 3, pp. 1–207, 2019.
  • [7] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farokhi, S. Jin, T. Q. Quek, and H. V. Poor, “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3454–3469, 2020.
  • [8] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for face recognition with under-represented data,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5704–5713.
  • [9] Y. Chen, X. Qin, J. Wang, C. Yu, and W. Gao, “Fedhealth: A federated transfer learning framework for wearable healthcare,” IEEE Intelligent Systems, vol. 35, no. 4, pp. 83–93, 2020.
  • [10] C. Ju, D. Gao, R. Mane, B. Tan, Y. Liu, and C. Guan, “Federated transfer learning for eeg signal classification,” in 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).   IEEE, 2020, pp. 3040–3045.
  • [11] H. Yang, H. He, W. Zhang, and X. Cao, “Fedsteg: A federated transfer learning framework for secure image steganalysis,” IEEE Transactions on Network Science and Engineering, 2020.
  • [12] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
  • [13] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” Journal of Big data, vol. 3, no. 1, pp. 1–40, 2016.
  • [14] C. Wang and S. Mahadevan, “Heterogeneous domain adaptation using manifold alignment,” in Twenty-second international joint conference on artificial intelligence, 2011.
  • [15] M. Kaya, G. Fidan, and I. H. Toroslu, “Transfer learning using twitter data for improving sentiment classification of turkish political news,” in Information sciences and systems 2013.   Springer, 2013, pp. 139–148.
  • [16] F. H. Khan, U. Qamar, and S. Bashir, “Enhanced cross-domain sentiment classification utilizing a multi-source transfer learning approach,” Soft Computing, vol. 23, no. 14, pp. 5431–5442, 2019.
  • [17] L. Duan, D. Xu, and I. Tsang, “Learning with augmented features for heterogeneous domain adaptation,” arXiv preprint arXiv:1206.4660, 2012.
  • [18] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain adaptation using asymmetric kernel transforms,” in CVPR 2011.   IEEE, 2011, pp. 1785–1792.
  • [19] Y. Zhu, Y. Chen, Z. Lu, S. J. Pan, G.-R. Xue, Y. Yu, and Q. Yang, “Heterogeneous transfer learning for image classification,” in Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
  • [20] M. Shaha and M. Pawar, “Transfer learning for image classification,” in 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA).   IEEE, 2018, pp. 656–660.
  • [21] M. Hussain, J. J. Bird, and D. R. Faria, “A study on cnn transfer learning for image classification,” in UK Workshop on computational Intelligence.   Springer, 2018, pp. 191–202.
  • [22] D. Han, Q. Liu, and W. Fan, “A new image classification method using cnn transfer learning and web data augmentation,” Expert Systems with Applications, vol. 95, pp. 43–56, 2018.
  • [23] M. Harel and S. Mannor, “Learning from multiple outlooks,” arXiv preprint arXiv:1005.0027, 2010.
  • [24] J. Park, R. J. Javier, T. Moon, and Y. Kim, “Micro-doppler based classification of human aquatic activities via transfer learning of convolutional neural networks,” Sensors, vol. 16, no. 12, p. 1990, 2016.
  • [25] N. Agarwal, A. Sondhi, K. Chopra, and G. Singh, “Transfer learning: Survey and classification,” in Smart Innovations in Communication and Computational Sciences.   Springer, 2021, pp. 145–155.
  • [26] Y. Ma, G. Luo, X. Zeng, and A. Chen, “Transfer learning for cross-company software defect prediction,” Information and Software Technology, vol. 54, no. 3, pp. 248–256, 2012.
  • [27] J. Nam, W. Fu, S. Kim, T. Menzies, and L. Tan, “Heterogeneous defect prediction,” IEEE Transactions on Software Engineering, vol. 44, no. 9, pp. 874–896, 2017.
  • [28] P. Prettenhofer and B. Stein, “Cross-language text classification using structural correspondence learning,” in Proceedings of the 48th annual meeting of the association for computational linguistics, 2010, pp. 1118–1127.
  • [29] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan, “Heterogeneous domain adaptation for multiple classes,” in Artificial intelligence and statistics.   PMLR, 2014, pp. 1095–1103.
  • [30] J. T. Zhou, S. J. Pan, I. W. Tsang, and Y. Yan, “Hybrid heterogeneous transfer learning through deep learning,” in Twenty-eighth AAAI conference on artificial intelligence, 2014.
  • [31] M. Chen, D. Gündüz, K. Huang, W. Saad, M. Bennis, A. V. Feljan, and H. V. Poor, “Distributed learning in wireless networks: Recent progress and future challenges,” arXiv preprint arXiv:2104.02151, 2021.
  • [32] Y. Yuan, X. Chen, X. Chen, and J. Wang, “Segmentation transformer: Object-contextual representations for semantic segmentation,” in European Conference on Computer Vision (ECCV), vol. 1, 2021.
  • [33] J. Jain, A. Singh, N. Orlov, Z. Huang, J. Li, S. Walton, and H. Shi, “Semask: Semantically masked transformers for semantic segmentation,” arXiv preprint arXiv:2112.12782, 2021.
  • [34] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” arXiv preprint arXiv:2106.04803, 2021.
  • [35] P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur, “Sharpness-aware minimization for efficiently improving generalization,” arXiv preprint arXiv:2010.01412, 2020.
  • [36] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” arXiv preprint arXiv:2111.09883, 2021.
  • [37] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data augmentation method for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2918–2928.
  • [38] I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto, “Luke: deep contextualized entity representations with entity-aware self-attention,” arXiv preprint arXiv:2010.01057, 2020.
  • [39] A. Srivastava, D. Jha, S. Chanda, U. Pal, H. D. Johansen, D. Johansen, M. A. Riegler, S. Ali, and P. Halvorsen, “Msrf-net: A multi-scale residual fusion network for biomedical image segmentation,” arXiv preprint arXiv:2105.07451, 2021.
  • [40] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu, R. Pang, Q. V. Le, and Y. Wu, “Pushing the limits of semi-supervised learning for automatic speech recognition,” arXiv preprint arXiv:2010.10504, 2020.
  • [41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [42] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [43] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009, citeseer.
  • [44] M. Series, “Minimum requirements related to technical performance for imt-2020 radio interface (s),” Report, pp. 2410–0, 2017.
  • [45] H. Sun, S. Li, F. R. Yu, Q. Qi, J. Wang, and J. Liao, “Toward communication-efficient federated learning in the internet of things with edge computing,” IEEE Internet of Things Journal, vol. 7, no. 11, pp. 11 053–11 067, 2020.
  • [46] E. Ozfatura, K. Ozfatura, and D. Gündüz, “Time-correlated sparsification for communication-efficient federated learning,” in 2021 IEEE International Symposium on Information Theory (ISIT).   IEEE, 2021, pp. 461–466.
  • [47] Y. He, H.-P. Wang, and M. Fritz, “Cossgd: Communicationefficient federated learning with a simple cosine-based quantization,” in 1st NeurIPS Workshop on New Frontiers in Federated Learning (NFFL), 2021.
  • [48] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
  • [49] D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [50] P. Richtárik, I. Sokolov, and I. Fatkhullin, “Ef21: A new, simpler, theoretically better, and practically faster error feedback,” Advances in Neural Information Processing Systems, vol. 34, pp. 4384–4396, 2021.
  • [51] D. Li and J. Wang, “Fedmd: Heterogenous federated learning via model distillation,” arXiv preprint arXiv:1910.03581, 2019.
  • [52] T. Lin, L. Kong, S. U. Stich, and M. Jaggi, “Ensemble distillation for robust model fusion in federated learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 2351–2363, 2020.
  • [53] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,” in 2019 IEEE 30th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC).   IEEE, 2019, pp. 1–6.
  • [54] Y. Jiang, S. Wang, V. Valls, B. J. Ko, W.-H. Lee, K. K. Leung, and L. Tassiulas, “Model pruning enables efficient federated learning on edge devices,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [55] J. Liu, S. Tripathi, U. Kurup, and M. Shah, “Pruning algorithms to accelerate convolutional neural networks for edge applications: A survey,” arXiv preprint arXiv:2005.04275, 2020.
  • [56] H. Sidahmed, Z. Xu, A. Garg, Y. Cao, and M. Chen, “Efficient and private federated learning with partially trainable networks,” arXiv preprint arXiv:2110.03450, 2021.
  • [57] T.-J. Yang, D. Guliani, F. Beaufays, and G. Motta, “Partial variable training for efficient on-device federated learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 4348–4352.
  • [58] O. Aygün, M. Kazemi, D. Gündüz, and T. M. Duman, “Hierarchical over-the-air federated edge learning,” in ICC 2022-IEEE International Conference on Communications.   IEEE, 2022, pp. 3376–3381.
  • [59] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Transactions on Wireless Communications, vol. 19, no. 3, pp. 2022–2035, 2020.
  • [60] Y. Sun, S. Zhou, Z. Niu, and D. Gündüz, “Time-correlated sparsification for efficient over-the-air model aggregation in wireless federated learning,” arXiv preprint arXiv:2202.08420, 2022.
  • [61] O. Shahid, S. Pouriyeh, R. M. Parizi, Q. Z. Sheng, G. Srivastava, and L. Zhao, “Communication efficiency in federated learning: Achievements and challenges,” arXiv preprint arXiv:2107.10996, 2021.
  • [62] H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, “Compressed communication for distributed deep learning: Survey and quantitative evaluation,” Tech. Rep., 2020.
  • [63] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2020, pp. 2021–2031.
  • [64] N. Shlezinger, M. Chen, Y. C. Eldar, H. V. Poor, and S. Cui, “Federated learning with quantization constraints,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 8851–8855.
  • [65] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
  • [66] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [67] C. Thapa, P. C. M. Arachchige, S. Camtepe, and L. Sun, “Splitfed: When federated learning meets split learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, pp. 8485–8493.
  • [68] A. Fallah, A. Mokhtari, and A. Ozdaglar, “Personalized federated learning: A meta-learning approach,” arXiv preprint arXiv:2002.07948, 2020.
  • [69] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint arXiv:1806.00582, 2018.
  • [70] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
  • [71] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [72] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” in NIPS 2017 Workshop on Autodiff, 2017. [Online]. Available: https://openreview.net/forum?id=BJJsrmfCZ
  • [73] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  • [74] B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization,” in Proceedings of the 2nd Workshop on New Frontiers in Summarization.   Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 70–79. [Online]. Available: https://www.aclweb.org/anthology/D19-5409
  • [75] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  • [76] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,” 2020.
  • [77] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [78] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [79] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: Compressed optimisation for non-convex problems,” in International Conference on Machine Learning.   PMLR, 2018, pp. 560–569.
  • [80] A. Wainakh, F. Ventola, T. Müßig, J. Keim, C. G. Cordero, E. Zimmer, T. Grube, K. Kersting, and M. Mühlhäuser, “User-level label leakage from gradients in federated learning,” Proceedings on Privacy Enhancing Technologies, vol. 2022, no. 2, pp. 227–244, 2022.
  • [81] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [82] J. Jeon, K. Lee, S. Oh, J. Ok et al., “Gradient inversion with generative image prior,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 898–29 908, 2021.
  • [83] Y. Huang, S. Gupta, Z. Song, K. Li, and S. Arora, “Evaluating gradient inversion attacks and defenses in federated learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 7232–7241, 2021.
  • [84] L. Longpr, V. Kreinovich, and T. Dumrongpokaphan, “Entropy as a measure of average loss of privacy,” Thai Journal of Mathematics, pp. 7–15, 2017.
  • [85] O. Gupta and R. Raskar, “Distributed learning of deep neural network over multiple agents,” Journal of Network and Computer Applications, vol. 116, pp. 1–8, 2018.
  • [86] P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar, “Split learning for health: Distributed deep learning without sharing raw patient data,” arXiv preprint arXiv:1812.00564, 2018.
  • [87] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting gradients-how easy is it to break privacy in federated learning?” Advances in Neural Information Processing Systems, vol. 33, pp. 16 937–16 947, 2020.
  • [88] B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage from gradients,” arXiv preprint arXiv:2001.02610, 2020.
  • [89] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference.   Springer, 2006, pp. 265–284.
  • [90] C. Dwork, “A firm foundation for private data analysis,” Communications of the ACM, vol. 54, no. 1, pp. 86–95, 2011.
  • [91] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” Found. Trends Theor. Comput. Sci., vol. 9, no. 3-4, pp. 211–407, 2014.
  • [92] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” in Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, 2016, pp. 308–318.