This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

Jianyu Wang§,  Hang Qi, Ankit Singh Rawat, Sashank Reddi,
Sagar Waghmare, Felix X. Yu, Gauri Joshi§

§Carnegie Mellon University, Google Research
Work performed while doing an internship at Google Research
Abstract

In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a small part of the model is stored and trained on clients while the remaining large part of the model only stays at the servers. However, the model partitioning employed in split learning introduces a significant amount of communication cost. This paper addresses this issue by compressing the additional communication using a novel clustering scheme accompanied by a gradient correction method. Extensive empirical evaluations on image and text benchmarks show that the proposed method can achieve up to 490×490\times communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.

1 Introduction

Federated learning (FL) is an emerging field that collaboratively trains machine learning models on decentralized data (Li et al., 2019; Kairouz et al., 2019; Wang et al., 2021). One major advantage of FL is that it does not require clients to upload their data which may contain sensitive personal information. Instead, clients separately train local models on their private datasets, and the resulting locally trained model parameters are infrequently synchronized with the help of a coordinating server (McMahan et al., 2017). While the FL framework helps alleviate data-privacy concerns for distributed training, most of existing FL algorithms critically assume that the clients have enough compute and storage resources to perform local updates on the entire machine learning model. However, this assumption does not necessarily hold in many modern applications. For example, classification problems with an extremely large number of classes (often in millions and billions) commonly arise in the context of recommender systems (Covington et al., 2016), and language modeling (Levy & Goldberg, 2014). Here, the classification layer of a neural network itself is large enough such that a typical FL client (such as a mobile device) cannot even store and locally update this single layer, let alone the entire neural network.

In order to overcome the above limitations of FL, split learning (SL) technique is adopted in (Vepakomma et al., 2018; Thapa et al., 2020). Specifically, SL splits the underlying model between the clients and server such that the first few layers are shared across the clients and the server, while the remaining layers are only stored at the server. The reduction of resource requirement at the clients is particularly pronounced when the last few dense layers constitute a large portion of the entire model. For instance, in an example convolutional neural network (Krizhevsky, 2014), the last two fully connected layers contribute to 95%95\% parameters of the entire model. In this case, if we could allocate the last two layers to the server, then the client-side memory usage can be directly reduced by a factor of 2020.

Limitations of SL.

One major limitation of SL is that the underlying model partitioning leads to an increased communication cost for the resulting framework. In order to train the split neural network, the activations and gradients at the layer where the model is split (referred to as cut layer) need to be communicated between the server and clients at each iteration. The additional message size is in proportion to the mini-batch size as well as the cut layer size. As a result, the communication cost for the model training can become prohibitive whenever the mini-batch size and the activation size of the cut layer are large (e.g. in one of our experiments, the additional message size can be 1010 times larger than that of the client-side model).

Refer to caption
Figure 1: Overview of the proposed algorithm. In order to reduce the additional communication associated with split learning, we propose to cluster similar client activations and only send the cluster centroids to the server. This is equivalent to adding a vector quantization layer in split neural network training. The success of our method relies on a novel variant of product quantizer and a gradient correction technique. We refer the reader to Section 4 for further details.

Contributions.

In this paper, we aim to make the split neural network training communication-efficient and enable its widespread adoption for FL in resource-constrained settings. Our proposed solution is based on the critical observation that, given a mini-batch of data, the client does not need to communicate per-example activation vectors if the activation vectors (at the cut layer) for different examples in the mini-batch exhibit enough similarity. Thus, we propose a training framework that performs clustering of the activation vectors and only communicates the cluster centroids to the server. Interestingly, this is equivalent to adding a vector quantization layer in the middle of the split neural network (see Figure 1 for an overview of our proposed method). Our main contributions are as follows.

  • We propose a communication-efficient splitting-based FL approach for resource-constrained clients (cf. Section 4), which improves previous SL solutions. The approach employs a novel clustering scheme that leverages product quantization to effectively compress the activations communicated between the clients and server.

  • After applying the activation quantization, the clients can only receive possibly noisy gradients from the server. The inaccurate client-side model updates lead to significant accuracy drops in our experiments. In order to mitigate this problem, we propose a gradient correction scheme, which plays a critical role in achieving a high compression ratio with minimal accuracy loss.

  • We empirically evaluate the performance of our approach on three standard FL datasets (cf. Section 5). We show that our approach allows for up to 490×490\times communication reduction without significant accuracy loss.

  • In addition, we present a convergence analysis for the proposed method (cf. Section 4.3), which reveals a trade-off between communication reduction and convergence speed. The analysis further helps explain why the proposed gradient correction technique is beneficial.

It is also worth mentioning that the proposed method does not expose any additional client-side information to the server than vanilla split neural network training approach. Similar to SplitFed (Thapa et al., 2020), our method can also leverage existing privacy preserving mechanisms such as differential privacy (Wei et al., 2020) or instance-hiding schemes (Huang et al., 2020) to provide formal privacy guarantees, which are out of the scope of this paper. Besides, our proposed method has potential applications beyond FL with resource-constrained clients as it can be applied in any learning framework that can benefit from a vector quantization layer. For instance, it can be used to reduce the communication overhead in two-party vertical FL (Romanini et al., 2021), where the model is naturally split across two institutions and the data labels are generated on the server.

2 Background and Related Works

Federated Learning (FL).

Suppose we have a collection of MM clients ={1,2,,M}{\mathcal{I}}=\{1,2,\dots,M\}. Each client ii\in{\mathcal{I}} has a local dataset 𝔻i{\mathbb{D}}_{i} and a corresponding empirical loss function Fi(𝒘)=ξ𝔻if(𝒘;ξ)/niF_{i}({\bm{w}})=\sum_{\xi\in{\mathbb{D}}_{i}}f({\bm{w}};\xi)/n_{i}, where 𝒘{\bm{w}} denotes the model parameters, and ni=|𝔻i|n_{i}=|{\mathbb{D}}_{i}| denotes the number of samples of the local dataset. The goal of FL is to find a shared model 𝒘{\bm{w}} that can minimize the averaged loss over all clients, defined as follows:

F(𝒘)=i=1MpiFi(𝒘)\displaystyle F({\bm{w}})=\sum_{i=1}^{M}p_{i}F_{i}({\bm{w}}) (1)

where pi=ni/i=1Mnip_{i}=n_{i}/\sum_{i=1}^{M}n_{i} is the relative weight of local loss FiF_{i}. Motivated by the data privacy concerns, under a FL framework, the clients perform local training and only communicate the resulting models to a coordinating server as opposed to sharing their raw local data with the server (McMahan et al., 2017). In current FL algorithms, the model size is limited by the resource constraints of clients. Consequently, these algorithms are not feasible when dealing with large machine learning models that cannot fit in clients memory or require large compute.

Large Model Training in FL.

To deploy large machine learning models on resource-constrained clients, a few recent works proposed methods to reduce the effective model size on clients. For example, Diao et al. (2020) varied the layer width on clients depending on their computational capacities; and Horvath et al. (2021) proposed ordered dropout to randomly removing neurons in the neural network. These methods are effective in reducing the width of the intermediate layers; however, they still place the full classification layer (which is in proportion to the size of the output space) on each client to compute the the client update. This may not be feasible for many clients as the parameters in the classification layer of a model can dominate the total number of model parameters (Krizhevsky, 2014).

Gradient Compression.

There are a large amount of literature studying gradient compression in distributed training, such as QSGD (Alistarh et al., 2017), SignSGD (Bernstein et al., 2018), Atomo (Wang et al., 2018), PowerSGD (Vogels et al., 2019), etc. Although these works share the same goal as our work (i.e., reducing communciation by compressing message size), we focus on orthogonal aspects. In particular, all these previous gradient compression methods only compress one gradient vector using the inherent properties of the stochastic gradient (e.g., sparsity or low rankness). However, our method considers compressing a batch of activation vectors and utilizes the similarity among the activations. This is a new dimension that does not exist in distributed training scenarios and has not been studied in gradient compression literature.

Split Learning (SL).

Split learning (SL) is another way of minimizing 1 without explicitly sharing local data on clients (Vepakomma et al., 2018). In particular, SL splits the neural network model into two parts on a layer basis. The first few layers (called client-side model parameterized by 𝒘c{\bm{w}}_{\text{c}}) are shared across clients and the server, while the remaining layers (called server-side model parameterized by 𝒘s{\bm{w}}_{\text{s}}) are only stored and trained on the server. Under this splitting, the original loss function for a data sample ξ\xi can be re-written as follows:

f(𝒘;ξ)=h(𝒘s;u(𝒘c;ξ)),ξ𝔻i,i\displaystyle f({\bm{w}};\xi)=h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi)),\quad\forall\xi\in{\mathbb{D}}_{i},\forall i\in{\mathcal{I}} (2)

where uu is the client-side function mapping the input data to the activation space and hh is the server-side function mapping the activation to a scalar loss value. Training both the client- and server-side models requires communicating client activations (i.e., the output of the client-side model, also called smashed data in some literature) between the clients and the server. We further elaborate on a concrete SL algorithm as a baseline in Section 3.

Communication-efficient SL.

Recently, He et al. (2020); Han et al. (2021) proposed to add a classification layer on the client-side model in SL so that each client can locally update its parameters. This can effectively reduce the communication frequency between the clients and the server. However, this kind of method is not suitable for the settings where the classification layer dominates the entire model size such that the clients may not have enough memory space to store the entire classification layer.

Product Quantization.

Product quantization (PQ) is a compression method and has been widely used in approximate nearest neighbor search (Jegou et al., 2010; Ge et al., 2013b). Given a batch of vectors, PQ divides each vector into multiple chunks and performs K-means clustering separately on each chunk. Now, the resulting cluster centroids construct the codebook and the closest codeword will be used to represent each vector. Instead of directly computing the distance between two high-dimensional vectors, computing the distance between two codewords is orders of magnitude more efficient and faster. Note that the previous works used PQ to compress either learned features after training (see, e.g., Ge et al., 2013a) or the parameters of a neural network (see, e.g., Chen et al., 2020). In this paper, we show that PQ is also effective in compressing the output of intermediate layers and present a simple technique to enable backpropagation through the corresponding quantization layer.

3 Baseline: the SplitFed Algorithm

In this section, we review a representative split learning algorithm, namely SplitFed (Thapa et al., 2020), that provides a baseline FL training approach in resource-constrained settings.

Introduction to SplitFed.

Each iteration of SplitFed contains four steps detailed as follows. Wherever it simplifies the presentation, we assume that the mini-batch size BB on each client is 11 but the derivations hold true for arbitrary batch sizes.

  • 1.

    Client Forward Pass: Let 𝒮{\mathcal{S}} be a randomly selected subset of clients. For i𝒮i\in{\mathcal{S}}, compute the output of the client-side model 𝒛i=u(𝒘c;ξ){\bm{z}}_{i}=u({\bm{w}}_{\text{c}};\xi), where ξ𝔻i\xi\in{\mathbb{D}}_{i} is a randomly chosen training sample, and 𝒛id{\bm{z}}_{i}\in\mathbb{R}^{d} denotes the activations of the last layer of the client-side model. Then, each selected client sends 𝒛i{\bm{z}}_{i} (together with its corresponding label if necessary111In vertical FL applications (Romanini et al., 2021), the labels do not need to be transferred.) to the server.

  • 2.

    Server Update: The server treats all the activations (also referred to as smashed data) {𝒛i}i𝒮\{{\bm{z}}_{i}\}_{i\in{\mathcal{S}}} as inputs to perform one step of gradient descent on the server-side model: Δ𝒘s=ηsi𝒮pi𝒘sh(𝒘s;𝒛i)/i𝒮pi\Delta{\bm{w}}_{\text{s}}=\eta_{\text{s}}\sum_{i\in{\mathcal{S}}}p_{i}\nabla_{{\bm{w}}_{\text{s}}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})/\sum_{i\in{\mathcal{S}}}p_{i}, where ηs\eta_{\text{s}} denotes the server learning rate.222The server learning rate here has a different definition from previous FL literature, e.g., Reddi et al. (2020). In addition, the server also computes the gradient with respect to the activations 𝒛ih(𝒘s;𝒛i)\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i}) and sends it back to the corresponding client ii.

  • 3.

    Client Backward Pass: Each client computes the gradient with respect to the client-side model using the chain rule: 𝒈i𝒘cf=𝒛ih(𝒘s;𝒛i)𝒘cu(𝒘c;ξ){\bm{g}}_{i}\triangleq\nabla_{{\bm{w}}_{\text{c}}}f=\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})\nabla_{{\bm{w}}_{\text{c}}}u({\bm{w}}_{\text{c}};\xi).

  • 4.

    Client-side Model Synchronization: At last, the client-side model is updated by synchronizing client gradients Δ𝒘c=ηci𝒮pi𝒈i/i𝒮pi\Delta{\bm{w}}_{\text{c}}=\eta_{\text{c}}\sum_{i\in{\mathcal{S}}}p_{i}{\bm{g}}_{i}/\sum_{i\in{\mathcal{S}}}p_{i}, where ηc\eta_{\text{c}} denotes the learning rate for the client-side model.

One can easily validate that, with the above four steps, SplitFed is equivalent to mini-batch stochastic gradient descent (SGD) with a total batch size of B|𝒮|B|{\mathcal{S}}|. No matter how the network is split, the performance of the algorithm remains unchanged and has the same iteration complexity as mini-batch SGD.

Besides, note that, a selected client needs to upload both the activations during “Client Forward Pass” and the gradients of the client-side model during “Client-side Model Synchronization”. Thus, the communication size per client per iteration can be written as |𝒘c|+Bd|{\bm{w}}_{\text{c}}|+Bd. Since the up-link communication bandwidth is quite limited for clients in federated learning, when the mini-batch size BB or the dimension of the activation dd gets large, the up-link communication may become the main bottleneck that limits the scalability of SplitFed. It is critical to compress the uploaded message from clients to the server (Reisizadeh et al., 2019; Li et al., 2019).

Comparison with FedAvg.

In Table 1, we further compare the difference between FedAvg and SplitFed. One can observe that SplitFed always requires less computation and memory on clients, and could potentially communicate less than FedAvg. In terms of convergence rate, it is still under discussion which algorithm is better Woodworth et al. (2020). Note that the goal of this paper is to show that we can effectively reduce the additional communication in SplitFed rather than showing SplitFed is better than FedAvg.

Algorithm Batch size Total compute Client-side compute Communication cost
FedAvg B/HB/H 𝒪(B|𝒘|){\mathcal{O}}(B|{\bm{w}}|) 𝒪(B|𝒘|){\mathcal{O}}(B|{\bm{w}}|) |𝒘||{\bm{w}}|
SplitFed B/HB/H 𝒪(B|𝒘|/H){\mathcal{O}}(B|{\bm{w}}|/H) 𝒪(B|𝒘c|/H){\mathcal{O}}(B|{\bm{w}}_{\text{c}}|/H) Bd/H+|𝒘c|Bd/H+|{\bm{w}}_{\text{c}}|
SplitFed BB 𝒪(B|𝒘|){\mathcal{O}}(B|{\bm{w}}|) 𝒪(B|𝒘c|){\mathcal{O}}(B|{\bm{w}}_{\text{c}}|) Bd+|𝒘c|Bd+|{\bm{w}}_{\text{c}}|
Table 1: Comparisons between FL (FedAvg) and SL (SplitFed). In the table, HH denotes the number of local steps in FedAvg. No matter of using which batch sizes, SL-based methods always require less computaion and memory on clients and potentially reduce the communication cost per round, depending on the batch size BB and activation size dd.

4 Proposed Method: FedLite

We now introduce our proposed method, FEDerated spLIT learning with vEctor quantization (FedLite), which can drastically reduce the up-link communication cost on clients in the SplitFed algorithm.

Overview.

The key idea of our method is to compress the redundant information in the client activations for each mini-batch. For instance, if we assume that all the activation vectors within a mini-batch are nearly identical to each other, then instead of sending all BB activation vectors, the client needs to send only one representative vector to the server, thereby, reducing the communication cost by a factor of BB.

Building on the above observation, we let each client input their activations into a clustering algorithm to get LL centroids. Then, each activation vector is represented by the index of its closest centroid. Instead of sending a mini-batch of BB vectors, only LL centroids and the cluster assignments are transmitted to the server. This procedure is equivalent to vector quantization. Here, a simple choice of the clustering algorithm is K-means (Krishna & Murty, 1999). However, as we observe in Figure 3, vanilla K-means clustering leads to a high quantization error with very limited compression ratio. It is non-trivial to design a proper clustering algorithm that can minimize the communication between clients and server, while maintaining a low quantization error. In Section 4.1, we design a novel variant of product quantizer that can flexibly operate this trade-off.

While the above approach is appealing, it also introduces new challenges. First, note that, in the backward pass, due to the additional quantization layer, the server no longer knows the true client activations; thus, constraining the server to compute the gradients with respect to the (possibly noisy) quantized activations. Furthermore, in order to update the client-side model parameters, the clients need to receive the gradients with respect to the original activations, which is not available any more. In Section 4.2, we introduce a gradient correction technique to mitigate this gradient mismatch, which is critical for our method to achieve a high compression ratio with minimal accuracy loss.

In the following sections, we present the key components of our proposed method and highlight how they address the aforementioned challenges. In addition, a convergence analysis is provided in Section 4.3. Unless otherwise stated, for the ease of exposition, we will focus on the operations on a specific client ii and omit the client index. However, it is important to keep in mind that the client-side operations happen in parallel on all selected clients.

Refer to caption
Figure 2: Illustration of our proposed quantizer. Given a mini-batch of activations 𝒁d×B{\bm{Z}}\in\mathbb{R}^{d\times B}, there are three steps: (i) divide each activation vector into qq subvectors; (ii) stack subvectors into RR groups based on their indices; (iii) perform K-means clustering to get LL centroids for each group. Each subvector can be represented by the index of the closest centroid in its corresponding group. On the server, by simply rearranging centroids, we can get the quantized activations 𝒁~\widetilde{{\bm{Z}}}.

4.1 Forward Pass: Compressing Activations with Product Quantization

In this subsection, we first present our proposed qunatizer and then elaborate on its advantages. A visual illustration is provided in Figure 2.

The Proposed Compression Scheme.

Suppose a client has computed a batch of activations 𝒁=[𝒛(1),,𝒛(B)]d×B{\bm{Z}}=[{\bm{z}}^{(1)},\dots,{\bm{z}}^{(B)}]\in\mathbb{R}^{d\times B} using the client-side model 𝒘c{\bm{w}}_{\text{c}}, where dd is the dimension of each activation and BB denotes the mini-batch size on each client. Our proposed quantizer first divides each activation vector 𝒛(j),j[1,B]{\bm{z}}^{(j)},j\in[1,B] into qq subvectors with equal dimension d/qd/q. We denote the ss-th subvector of 𝒛(j){\bm{z}}^{(j)} as 𝒛(j,s)d/q{\bm{z}}^{(j,s)}\in\mathbb{R}^{d/q}. Then, the quantizer stacks all subvectors into RR groups based on their indices. For example, the first group 𝒢(1){\mathcal{G}}^{(1)} contains all subvectors with indices (j,1),(j,2),(j,q/R),j[1,B](j,1),(j,2),\dots(j,q/R),\forall j\in[1,B]; the second group 𝒢(2){\mathcal{G}}^{(2)} contains all subvectors with indices (j,q/R+1),(j,q/R+2),,(j,2q/R),j[1,B](j,q/R+1),(j,q/R+2),\dots,(j,2q/R),\forall j\in[1,B], and so forth. At last, K-means clustering is performed on subvectors within each group 𝒢(r),r[1,R]{\mathcal{G}}^{(r)},\forall r\in[1,R] to find LL cluster centroids 𝒞(r)={𝒄(r,1),,𝒄(r,L)}d/q{\mathcal{C}}^{(r)}=\{{\bm{c}}^{(r,1)},\dots,{\bm{c}}^{(r,L)}\}\subset\mathbb{R}^{d/q}. Now each subvector 𝒛(j,s){\bm{z}}^{(j,s)} can be approximated by its closest centroid in the corresponding group. Formally, if 𝒛(j,s)𝒢(r){\bm{z}}^{(j,s)}\in{\mathcal{G}}^{(r)}, then

𝒛~(j,s)\displaystyle\widetilde{{\bm{z}}}^{(j,s)} 𝒄(r,l(j,s)),\displaystyle\triangleq{\bm{c}}^{(r,l_{*}^{(j,s)})}, (3)
l(j,s)\displaystyle l_{*}^{(j,s)} =argminl[1,L]𝒛(j,s)𝒄(r,l)2\displaystyle=\operatorname*{arg\,min}_{l\in[1,L]}\|{\bm{z}}^{(j,s)}-{\bm{c}}^{(r,l)}\|^{2} (4)

In the above equation, 𝒛~(j,s)\widetilde{{{\bm{z}}}}^{(j,s)} represents the quantized version of 𝒛(j,s){\bm{z}}^{(j,s)}. By concatenating all quantized subvectors, server can get the quantized activations 𝒁~\widetilde{{\bm{Z}}} and use it as input to update the server-side model. In order to obtain the quantized subvectors, each client needs to send all cluster centroids {𝒞(r)}r[1,R]\{{\mathcal{C}}^{(r)}\}_{r\in[1,R]} (referred to as the codebook) and the cluster assignments {l(j,s)}j[1,B],s[1,q]\{l_{*}^{(j,s)}\}_{j\in[1,B],s\in[1,q]} (referred to as the codewords) to the server. Assuming that each floating-point number occupies ϕ\phi bits333When computing the compression ratio in this paper, we always assume ϕ=64\phi=64., the transmitted message size per client is reduced to ϕdRL/q+Bqlog2L\phi dRL/q+Bq\log_{2}L from ϕdB\phi dB.

The Benefit of Subvector Division: Reduced Quantization Error.

The first key component of our proposed quantizer is subvector division, i.e., setting the number of subvectors qq to be greater than 11. When q=1q=1, the quantizer reduces to vanilla K-means. In this case, the codebook size is ϕdL\phi dL and each activation vector have LL possible choices. When we set q=R>1q=R>1, the codebook size is still ϕdL\phi dL. However, each activation vector now has LqL^{q} possible choices, as each of qq subvectors has LL choices. Thus, the number of quantization levels becomes exponentially larger without any increase in memory usage and computation complexity. As illustrated by the green lines in Figure 3, having more quantization levels or effective centroid choices can significantly lower the quantization error.

The Benefit of Subvector Grouping: Improved Compression Ratio.

By using subvector division, although the quantization error can be reduced, the codebook size ϕdL\phi dL is still pretty large. This is because subvectors within one vector are quantized separately using different codebooks. To further reduce the communication size, we propose subvector grouping and force subvectors within one group to share the same codebook. As a result, the total codebook size can be reduced to ϕdLR/q\phi dLR/q. When qR1q\gg R\geq 1, there can be an order of magnitude increase in the compression ratio, as illustrated by the red lines in Figure 3. One can also observe that, by changing the value of RR, our proposed quantizer (red lines) achieves a much better error-versus-compression trade-off than K-means and vanilla production quantization scheme.

Refer to caption
Figure 3: Quantization error (lower is better) versus compression ratio (larger is better). Our proposed quantizer achieves a better quantization error vs. compression ratio trade-off as compared to vanilla product quantization (PQ) and K-means. In this example, the activation size is d=9216d=9216, and the mini-batch size is B=20B=20. The activations are trained via a two-layer CNN on Federated EMNIST. In each curve, we vary the number of clusters LL. For green curves, the number of subvectors qq takes value in {288,1152,4608}\{288,1152,4608\}; for red curves, the number of groups RR takes value in {2304,1152,384,1}\{2304,1152,384,1\}, and qq is fixed as 46084608.

Why not Reuse the Codebooks from Previous Iterations?

In our proposed scheme, the codebook is reconstructed at each iteration on clients. This is necessary, since in FL, only few stateless clients are selected to participate training in at each iteration and they have non-IID data distributions (Kairouz et al., 2019). Previous codebooks can be stale and other clients’ codebooks may not be suitable.

4.2 Backward Pass: Gradient Correction

As mentioned at the beginning of Section 4, there is a gradient mismatch problem in the client backward pass. While client ii needs 𝒛ih(𝒘s;𝒛i)\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i}) to update the client-side models, it can only receive 𝒛~ih(𝒘s;𝒛~i)\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i}) from the server. A naive solution is to treat 𝒛~ih\nabla_{\widetilde{{\bm{z}}}_{i}}h as an approximation of 𝒛ih\nabla_{{\bm{z}}_{i}}h and update the client-side model by 𝒛~ih(𝒘s;𝒛~i)𝒘cu(𝒘c;ξ)=(f/𝒛~i)(𝒛i/𝒘c)\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})\nabla_{{\bm{w}}_{\text{c}}}u({\bm{w}}_{\text{c}};\xi)=(\partial f/\partial\widetilde{{\bm{z}}}_{i})(\partial{\bm{z}}_{i}/\partial{\bm{w}}_{\text{c}}). However, due to gradient mismatch, this approach can lead to a significant performance drop, as observed in our experiments (cf. Section 5). In order to address this issue, we propose a gradient correction technique. In particular, we approximate the gradient 𝒛ih(𝒘s;𝒛i)\nabla_{{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i}) by its first-order Taylor series expansion: 𝒛~ih(𝒘s;𝒛~i)+𝒛~i2h(𝒘s;𝒛~i)(𝒛i𝒛~i)\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})+\nabla^{2}_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})\cdot({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}). While the higher-order derivative may be expensive to compute, we use a scalar parameter λ𝑰>0\lambda{\bm{I}}>0 to replace 𝒛~i2h(𝒘s;𝒛~)i\nabla^{2}_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}})_{i}. Consequently, the gradient of the client-side model is defined as follows:

𝒈~i[h(𝒘s;𝒛~i)𝒛~i+λ(𝒛i𝒛~i)]u(𝒘c;ξ)𝒘c.\displaystyle\widetilde{{\bm{g}}}_{i}\triangleq\left[\frac{\partial h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})}{\partial\widetilde{{\bm{z}}}_{i}}+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\right]\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}. (5)

In Section 4.3, we will provide a convergence analysis that can help explain the effects of gradient correction; in Section 5, we will empirically show that setting a strictly positive λ\lambda is crucial for the success of the proposed method. In the following discussion, we provide an intuitive explanation of the correction to further motivate our approach.

The Effect of Regularization.

Here, we provide an intuitive explanation of why the gradient correction technique works. The client-side gradient in 5 can be considered as a gradient of the following surrogate loss (proof is provided in the Appendix):

𝒛i𝒛^i2+λ2𝒛i𝒛~i2,\displaystyle\|{\bm{z}}_{i}-\hat{{\bm{z}}}_{i}\|^{2}+\frac{\lambda}{2}\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|^{2}, (6)

where 𝒛i=u(𝒘c;ξ){\bm{z}}_{i}=u({\bm{w}}_{\text{c}};\xi) is the activation of the client-side model. Besides, 𝒛^i𝒛i𝒛~ih(𝒘s;𝒛~i)/2\hat{{\bm{z}}}_{i}\triangleq{\bm{z}}_{i}-\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})/2 and the quantized activation 𝒛~i\widetilde{{\bm{z}}}_{i} are fixed vectors that do not have derivatives when computing the gradient. Setting λ>0\lambda>0 is equivalent to having a regularization term, as per 6. The regularizer encourages the client-side model 𝒘c{\bm{w}}_{\text{c}} to move in a direction that can decrease the quantization error 𝒛i𝒛~i\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|. However, one cannot set a arbitrarily large value for λ\lambda because the client-side model may output the same activation for all inputs and fail to minimize the original loss function.

4.3 Convergence Analysis

In this subsection, we provide a convergence analysis for FedLite. The analysis will highlight how the quantization error influences the convergence and how the gradient correction technique helps. The analysis is conducted under standard assumptions for mini-batch SGD. We assume the objective function F(𝒘)=F([𝒘c;𝒘s])F({\bm{w}})=F([{\bm{w}}_{\text{c}};{\bm{w}}_{\text{s}}]) is LL-Lipschitz smooth (i.e., F(𝒘)F(𝒗)L𝒘𝒗\left\|\nabla F({\bm{w}})-\nabla F({\bm{v}})\right\|\leq L\left\|{\bm{w}}-{\bm{v}}\right\|), and stochastic gradient 𝒈(𝒘){\bm{g}}({\bm{w}}) has bounded variance 𝔼𝒈(𝒘)F(𝒘)2σ2/BS\mathbb{E}\|{\bm{g}}({\bm{w}})-\nabla F({\bm{w}})\|^{2}\leq\sigma^{2}/BS, where BB is the mini-batch size per client and SS is the number of selected clients per iteration. Under these assumptions, we have the following theorem.

Theorem 4.1 (Convergence of FedLite).

If the client-side model and server-side model update using the same learning rate α=BS/T\alpha=\sqrt{BS/T} where TT is the number of total iterations, then the expected gradient norm of the global function mint[0,T1]𝔼F(𝐰(t))2\min_{t\in[0,T-1]}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2} can be bounded by

4(F(𝒘(0))Finf)BST+4Lσ2BSTOpt. error of mini-batch SGD+(4BST+2)(Λ12+(Λ2λ)2Λ32)κ2Addnl. error caused by quantization\displaystyle\underbrace{\frac{4(F({\bm{w}}^{(0)})-F_{\text{inf}})}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}}_{\text{Opt. error of mini-batch SGD}}+\underbrace{\left(4\sqrt{\frac{BS}{T}}+2\right)(\Lambda_{1}^{2}+(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2})\kappa^{2}}_{\text{Addnl. error caused by quantization}} (7)

where κ\kappa is the maximal quantization error max𝐳𝐳~\max\|{\bm{z}}-\widetilde{{\bm{z}}}\|, λ\lambda is the tunable parameter in our gradient correction scheme, FinfF_{\text{inf}} is the lower bound of function value, and constants Λ1,Λ2,Λ3\Lambda_{1},\Lambda_{2},\Lambda_{3} are the largest eigenvalues of matrices 2h(𝐰s;𝐳)/𝐳𝐰s,2h(𝐰s;𝐳)/𝐳2,u(𝐰c;ξ)/𝐰c\partial^{2}h({\bm{w}}_{\text{s}};{\bm{z}})/\partial{\bm{z}}\partial{\bm{w}}_{\text{s}},\partial^{2}h({\bm{w}}_{\text{s}};{\bm{z}})/\partial{\bm{z}}^{2},\partial u({\bm{w}}_{\text{c}};\xi)/{\bm{w}}_{\text{c}}, respectively.

Theorem 4.1 guarantees that FedLite converges to a neighbourhood around the stationary point of the global function. And the size of this neighborhood is in proportion to the maximal quantization error κ\kappa during the training. When there is no quantization, the error bound 7 recovers that of mini-batch SGD. Moreover, one can observe that setting a positive λ\lambda can help to reduce the additional error caused by adding the quantization layer.

5 Experiments

We implement the proposed method FedLite using FedJAX (Ro et al., 2021) and evaluate its effectiveness on three standard federated datasets provided by the Tensorflow Federated (TFF) package (TFF, 2021): (i) image classification on FEMNIST, (ii) tag prediction on StackOverflow (referred to as SO Tag) and (iii) next word prediction on StackOverflow (referred to as SO NWP). On FEMNIST, the same model architecture as (Reddi et al., 2020) is adopted. We place two convolutional layers and two dense layers on the clients and the server, respectively. With this splitting, the client-side model only has about 1.6%1.6\% trainable parameters of the entire model. Therefore, the client-side resource requirement is significantly reduced. On SO Tag, both the client- and server-side models contains only one dense layer. On SO NWP, we place one LSTM layer and one dense layer on the clients, and another dense layer on the server. The ratios between the client-side model size and entire model size are 83%83\% and 79%79\% on SO Tag and SO NWP, respectively. Here, we note that even though the client-side model sizes for SO Tag and SO NWP do not correspond to an ideal setting for the split learning-based approaches, we include these two datasets to showcase the utility of our proposed method for language-based tasks.

For each task, we select the learning rate that is best for the baseline SplitFed algorithm. Although separately tuning the learning rate for our proposed method can further improve its performance, using the same learning rate as SplitFed already demonstrates the advantages of our method. Unless otherwise stated, the number of groups RR in our proposed quantizer is set to 11, as it exhibits the best trade-off in Figure 3 and experiments.

Main Results: Effectiveness of FedLite.

As discussed in Section 4.1, there is a trade-off between the final performance and compression ratio. Although the additional quantization layer helps reduce the communication cost, it also causes performance drop. While utilizing the quantization layer, it is important to understand the range of the compression ratio that does not degrade the final performance too much. To this end, we vary the number of clusters LL and number of subvectors qq in our proposed method and report the resulting accuracy vs. compression trade-off in Figure 4. Note that we define the compression ratio to be the ratio between the original activation size and compressed message (codebook + codewords) size.

One can observe that the proposed method can achieve at least 10×10\times compression ratio with almost no accuracy loss. Furthermore, if we allow for 5%5\% relative loss compared to SplitFed, then our method can achieve a 490×490\times compression ratio on FEMNIST. This suggests that 99.8%99.8\% information is redundant and can be dropped during the up-link communication. Surprisingly, on SO Tag, the Recall@5 even improves when adding the quantization layer. We conjecture that compressing the outputs of a intermediate layer might have similar effects to dropout.

Refer to caption
(a) FEMNIST
Refer to caption
(b) SO Tag
Refer to caption
(c) SO NWP
Figure 4: Trade-off between the accuracy and compression ratio. Our proposed method can achieve up to 490×490\times, 247×247\times, and 51×51\times communication reduction with less than 5%5\% accuracy drop on FEMNSIT, SO Tag, and SO NWP, respectively. Each curve in the figures corresponds to one value of qq (number of subvectors); and each point on a curve corresponds to a specific value of LL (number of clusters). For our method, the number of groups RR is fixed as 11.
Refer to caption
(a) λ=5×105\lambda=5\times 10^{-5}
Refer to caption
(b) λ=0\lambda=0
Refer to caption
(c) Quantizer comparison
Figure 5: Ablation studies on FEMNIST. (a) and (b): Validation accuracy on FEMNIST when fixing λ\lambda and varying qq and LL. Setting a small positive value for λ\lambda improves accuracy for almost all (q,L)(q,L) pairs; (c) Thanks to subvector grouping, our proposed quantizer can achieve an order of magnitude larger compression ratio as compared to vanilla PQ scheme.

Overall Communication and Computation Efficiencies.

Here, we provide a concrete example showing how the proposed method improves the communication and computation efficiencies over previous works. On FEMNIST, setting q=1152q=1152 and L=2L=2 amounts to 490×490\times compression ratio, which amounts to a 490×490\times reduction in the additional communication introduced by the network splitting. Compared to SplitFed, the overall up-link commutation cost (including the client-side gradients synchronization) is about 10×10\times smaller. Compared to FedAvg, the up-link communication cost per round is reduced by 62×62\times, with 64×64\times less trainable parameters on the clients. We further compare the training curves with respect to total communication costs of FedLite against SplitFed and FedAvg in Figure 6 in Appendix. Besides, the overall wall-clock time saving may depend on the characteristics of the training system. One can get an estimate using the analytical model in (Agarwal et al., 2021).

Effectiveness of the Gradient Correction.

From Figure 4, it is easy to observe that the gradient correction technique (i.e., λ>0\lambda>0) is crucial for improving performance. Without correction, the algorithm can even diverge in the high compression ratio regime. While λ\lambda is separately tuned for each point (i.e., each (q,L)(q,L) pair) in Figure 4, we found that fixing λ\lambda for all (q,L)(q,L) pairs still leads to significant improvements. In Figure 5, we report the performance on FEMNIST for various choices of qq and LL. In particular, with q=288q=288, the accuracy improvement ranges from 372%3-72\%. Furthermore, we found that a small λ\lambda ranging from 10510^{-5} to 10310^{-3} works well on all training tasks in our experiments. Setting a larger λ\lambda leads to near-zero quantization error. But the model may have poor performance, as it tends to output the same activation for all inputs and becomes meaningless.

Effectiveness of the Proposed Quantizer.

In Section 4.1, we already show that stacking subvectors into groups (i.e., setting R<qR<q) is critical to reach a high compression ratio. But how does it affect the model performance? To answer the question, we run the proposed method with R=q>1R=q>1 (vanilla PQ), and report the results in Figure 5(c). Observe that the proposed method significantly improves the compression ratio with minimal loss of accuracy. In addition, any other quantizers that compress each individual vector but do not utilize the similarities among vectors (such as stochastic quantization Suresh et al. (2017)) can be combined with our method to further improve the compression ratio, as they compress a different dimension from ours.

6 Conclusion

We studied the problem of training large ML models in FL with resource-constrained clients. Due to limitations on storage and compute capacity, the clients cannot locally optimize the entire model, prohibiting the usage of previous federated learning algorithms. SL is a promising approach to address this issue, as it only assigns the first few small layers to the clients. But, the network splitting incurs additional communication. In order to make split neural network training to be communication-efficient, we propose FedLite that can compress the additional communication by up to 490×490\times with minimal loss of accuracy. The success of our method relies on a variant of product quantization scheme and a gradient correction technique. We perform theoretical analysis as well as extensive experiments on both vision and language tasks to validate its effectiveness.

References

  • Agarwal et al. (2021) Agarwal, S., Wang, H., Venkataraman, S., and Papailiopoulos, D. On the utility of gradient compression in distributed training systems. arXiv preprint arXiv:2103.00543, 2021.
  • Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
  • Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signSGD: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
  • Chen et al. (2020) Chen, T., Li, L., and Sun, Y. Differentiable product quantization for end-to-end embedding compression. In International Conference on Machine Learning, pp. 1617–1626. PMLR, 2020.
  • Covington et al. (2016) Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp.  191–198, 2016.
  • Diao et al. (2020) Diao, E., Ding, J., and Tarokh, V. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2020.
  • Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  • Ge et al. (2013a) Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  2946–2953, 2013a.
  • Ge et al. (2013b) Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013b.
  • Han et al. (2021) Han, D.-J., Bhatti, H. I., Lee, J., and Moon, J. Accelerating federated learning with split learning on locally generated losses. ICML’21 workshop on federated learning for User Privacy and Data Confidentiality, 2021.
  • He et al. (2020) He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020.
  • Horvath et al. (2021) Horvath, S., Laskaridis, S., Almeida, M., Leontiadis, I., Venieris, S. I., and Lane, N. D. Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout. arXiv preprint arXiv:2102.13451, 2021.
  • Huang et al. (2020) Huang, Y., Song, Z., Li, K., and Arora, S. Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning, pp. 4507–4518. PMLR, 2020.
  • Jegou et al. (2010) Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  • Kairouz et al. (2019) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krishna & Murty (1999) Krishna, K. and Murty, M. N. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439, 1999.
  • Krizhevsky (2014) Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
  • Levy & Goldberg (2014) Levy, O. and Goldberg, Y. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27:2177–2185, 2014.
  • Li et al. (2019) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. CoRR, 2019.
  • McMahan et al. (2017) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
  • Reddi et al. (2020) Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečnỳ, J., Kumar, S., and McMahan, H. B. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • Reisizadeh et al. (2019) Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. arXiv preprint arXiv:1909.13014, 2019.
  • Ro et al. (2021) Ro, J. H., Suresh, A. T., and Wu, K. Fedjax: Federated learning simulation with jax. arXiv preprint arXiv:2108.02117, 2021.
  • Romanini et al. (2021) Romanini, D., Hall, A. J., Papadopoulos, P., Titcombe, T., Ismail, A., Cebere, T., Sandmann, R., Roehm, R., and Hoeh, M. A. Pyvertical: A vertical federated learning framework for multi-headed splitnn. arXiv preprint arXiv:2104.00489, 2021.
  • Suresh et al. (2017) Suresh, A. T., Felix, X. Y., Kumar, S., and McMahan, H. B. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pp. 3329–3337. PMLR, 2017.
  • TFF (2021) TFF. TensorFlow Federated: machine learning on decentralized data. https://www.tensorflow.org/federated, 2021. Accessed: 30 September 2021.
  • Thapa et al. (2020) Thapa, C., Chamikara, M. A. P., and Camtepe, S. Splitfed: When federated learning meets split learning. arXiv preprint arXiv:2004.12088, 2020.
  • Vepakomma et al. (2018) Vepakomma, P., Gupta, O., Swedish, T., and Raskar, R. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018.
  • Vogels et al. (2019) Vogels, T., Karimireddy, S. P., and Jaggi, M. PowerSGD: Practical low-rank gradient compression for distributed optimization. 2019.
  • Wang et al. (2018) Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9850–9861, 2018.
  • Wang et al. (2021) Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
  • Wei et al. (2020) Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., Jin, S., Quek, T. Q., and Poor, H. V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
  • Woodworth et al. (2020) Woodworth, B., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020.

Appendix A Detailed Explanation for the Effects of Regularization

In Section 4.2, we argue that the client-side gradient is a gradient of the following surrogate loss function

s(𝒘c)=𝒛i𝒛^i2+λ2𝒛i𝒛~i2\displaystyle s({\bm{w}}_{\text{c}})=\|{\bm{z}}_{i}-\hat{{\bm{z}}}_{i}\|^{2}+\frac{\lambda}{2}\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|^{2} (8)

where 𝒛^i𝒛i𝒛~ih(𝒘s;𝒛~i)/2\hat{{\bm{z}}}_{i}\triangleq{\bm{z}}_{i}-\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})/2 and the quantized activation 𝒛~i\widetilde{{\bm{z}}}_{i} are fixed vectors that do not have derivatives when computing the gradient. Here, we are going to provide a formal proof to support this argument. Directly taking the derivative, we have

s(𝒘c)𝒘c=\displaystyle\frac{\partial s({\bm{w}}_{\text{c}})}{\partial{\bm{w}}_{\text{c}}}= 2(𝒛i𝒛^i)𝒛i𝒘c+λ(𝒛i𝒛~i)𝒛i𝒘c\displaystyle 2({\bm{z}}_{i}-\hat{{\bm{z}}}_{i})\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}}+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}} (9)
=\displaystyle= (𝒛~ih(𝒘s;𝒛~i)+λ(𝒛i𝒛~i))𝒛i𝒘c.\displaystyle\left(\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\right)\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}}. (10)

The above equation is exactly the same as the client-side gradient we defined in 5.

Appendix B Proof of Theorem 1

B.1 Preliminaries

Without loss of generality, we define 𝒘=[𝒘c;𝒘s]{\bm{w}}=[{\bm{w}}_{\text{c}};{\bm{w}}_{\text{s}}]. Then, the gradient of the original loss function can be decomposed into two parts:

F(𝒘)=[𝒘cF(𝒘);𝒘sF(𝒘)]\displaystyle\nabla F({\bm{w}})=[\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}});\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}})] (11)

Similarly, we denote the stochastic gradients as follows:

𝒈(𝒘)=[𝒈c(𝒘);𝒈s(𝒘)]\displaystyle{\bm{g}}({\bm{w}})=[{\bm{g}}_{c}({\bm{w}});{\bm{g}}_{s}({\bm{w}})] (12)

where 𝔼[𝒈c(𝒘)]=𝒘cF(𝒘)\mathbb{E}[{\bm{g}}_{c}({\bm{w}})]=\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}) and 𝔼[𝒈s(𝒘)]=𝒘sF(𝒘)\mathbb{E}[{\bm{g}}_{s}({\bm{w}})]=\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}}). In addition, we assume the stochastic gradients with a mini-batch size of BSBS have bounded variance:

𝔼𝒈(𝒘)F(𝒘)2σ2BS.\displaystyle\mathbb{E}\left\|{\bm{g}}({\bm{w}})-\nabla F({\bm{w}})\right\|^{2}\leq\frac{\sigma^{2}}{BS}. (13)

That is equivalent to

𝔼𝒈c(𝒘)𝒘cF(𝒘)2+𝔼𝒈s(𝒘)𝒘sF(𝒘)2\displaystyle\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}})-\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}})\right\|^{2}+\mathbb{E}\left\|{\bm{g}}_{s}({\bm{w}})-\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}})\right\|^{2}\leq σ2BS\displaystyle\frac{\sigma^{2}}{BS} (14)

Then, we denote the client- and server-side gradients in the presence of the quantization layer as follows:

𝒈~c(𝒘)\displaystyle\widetilde{{\bm{g}}}_{c}({\bm{w}}) =𝒈c(𝒘)+𝜹c(𝒘)\displaystyle={\bm{g}}_{c}({\bm{w}})+\bm{\delta}_{c}({\bm{w}}) (15)
𝒈~s(𝒘)\displaystyle\widetilde{{\bm{g}}}_{s}({\bm{w}}) =𝒈s(𝒘)+𝜹s(𝒘).\displaystyle={\bm{g}}_{s}({\bm{w}})+\bm{\delta}_{s}({\bm{w}}). (16)

The concrete expressions of 𝜹c\bm{\delta}_{c} and 𝜹s\bm{\delta}_{s} will be derived later.

B.2 Main Proof

With the above notation, according to the Lipschitz smoothness of the loss function, we have

𝔼[F(𝒘(t+1))]F(𝒘(t))\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq α𝒘cF(𝒘(t)),𝔼[𝒈~c(𝒘(t))]α𝒘sF(𝒘(t)),𝔼[𝒈~s(𝒘(t))]\displaystyle-\alpha\left<\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)}),\mathbb{E}[\widetilde{{\bm{g}}}_{c}({\bm{w}}^{(t)})]\right>-\alpha\left<\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}}^{(t)}),\mathbb{E}[\widetilde{{\bm{g}}}_{s}({\bm{w}}^{(t)})]\right>
+α2L2𝔼𝒈~c(𝒘(t))2+α2L2𝔼𝒈~s(𝒘(t))2\displaystyle+\frac{\alpha^{2}L}{2}\mathbb{E}\left\|\widetilde{{\bm{g}}}_{c}({\bm{w}}^{(t)})\right\|^{2}+\frac{\alpha^{2}L}{2}\mathbb{E}\left\|\widetilde{{\bm{g}}}_{s}({\bm{w}}^{(t)})\right\|^{2} (17)

For the first term on the right-hand side of 17, we have

𝒘cF(𝒘(t)),𝒘cF(𝒘(t))+𝔼[𝜹c(𝒘(t))]\displaystyle-\left<\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)}),\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})+\mathbb{E}[\bm{\delta}_{c}({\bm{w}}^{(t)})]\right>
\displaystyle\leq 𝒘cF(𝒘(t))2+12ϵ𝒘cF(𝒘(t))2+ϵ2𝔼[𝜹c(𝒘(t))]2\displaystyle-\left\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\|^{2}+\frac{1}{2\epsilon}\left\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\|^{2}+\frac{\epsilon}{2}\left\|\mathbb{E}[\bm{\delta}_{c}({\bm{w}}^{(t)})]\right\|^{2} (18)
\displaystyle\leq (112ϵ)𝒘cF(𝒘(t))2+ϵ2𝔼𝜹c(𝒘(t))2\displaystyle-(1-\frac{1}{2\epsilon})\left\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\|^{2}+\frac{\epsilon}{2}\mathbb{E}\left\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2} (19)
=\displaystyle= 12𝒘cF(𝒘(t))2+12𝔼𝜹c(𝒘(t))2\displaystyle-\frac{1}{2}\left\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\|^{2}+\frac{1}{2}\mathbb{E}\left\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2} (20)

where 18 uses Young’s Inequality, and ϵ>0\epsilon>0 is an arbitrary positive number. We set ϵ=1\epsilon=1 right after 18. For the third term on the right-hand side of 17, we have

𝔼𝒈c(𝒘(t))+𝜹c(𝒘(t))2\displaystyle\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}}^{(t)})+\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2}\leq 2𝔼𝒈c(𝒘(t))2+2𝔼𝜹c(𝒘(t))2\displaystyle 2\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}}^{(t)})\right\|^{2}+2\mathbb{E}\left\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2} (21)

For the server-side gradients 𝒈~s\widetilde{{\bm{g}}}_{s}, we can repeat the same process. Combining them together, it follows that

𝔼[F(𝒘(t+1))]F(𝒘(t))\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq α2F(𝒘(t))2+α2L𝔼𝒈(𝒘(t))2\displaystyle-\frac{\alpha}{2}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}+\alpha^{2}L\mathbb{E}\left\|{\bm{g}}({\bm{w}}^{(t)})\right\|^{2}
+2α2L(χc2+χs2)+α2(χc2+χs2)\displaystyle+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2}) (22)

where

χc2=\displaystyle\chi_{c}^{2}= max𝔼𝜹c(𝒘)2,\displaystyle\max\mathbb{E}\left\|\bm{\delta}_{c}({\bm{w}})\right\|^{2}, (23)
χs2=\displaystyle\chi_{s}^{2}= max𝔼𝜹s(𝒘)2.\displaystyle\max\mathbb{E}\left\|\bm{\delta}_{s}({\bm{w}})\right\|^{2}. (24)

Using the assumption that the stochastic gradient has bounded variance, we obtain that

𝔼[F(𝒘(t+1))]F(𝒘(t))\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq α(12αL)F(𝒘(t))2+α2Lσ2BS\displaystyle-\alpha(\frac{1}{2}-\alpha L)\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}+\frac{\alpha^{2}L\sigma^{2}}{BS}
+2α2L(χc2+χs2)+α2(χc2+χs2).\displaystyle+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2}). (25)

When αL1/4\alpha L\leq 1/4, we have

𝔼[F(𝒘(t+1))]F(𝒘(t))\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq α4F(𝒘(t))2+α2Lσ2BS+2α2L(χc2+χs2)\displaystyle-\frac{\alpha}{4}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}+\frac{\alpha^{2}L\sigma^{2}}{BS}+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})
+α2(χc2+χs2).\displaystyle+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2}). (26)

With minor rearranging and taking the total expectation, it follows that

𝔼F(𝒘(t))2\displaystyle\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq 4𝔼[F(𝒘(t))F(𝒘(t+1))]α+4αLσ2BS+(2+4αL)(χc2+χs2).\displaystyle\frac{4\mathbb{E}[F({\bm{w}}^{(t)})-F({\bm{w}}^{(t+1)})]}{\alpha}+\frac{4\alpha L\sigma^{2}}{BS}+(2+4\alpha L)(\chi_{c}^{2}+\chi_{s}^{2}). (27)

Taking the average from t=0t=0 to t=T1t=T-1, obtain that

1Tt=0T1𝔼F(𝒘(t))2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq 4(F(𝒘(0))F(𝒘))αT+4αLσ2BS+(2+4αL)(χc2+χs2).\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\alpha T}+\frac{4\alpha L\sigma^{2}}{BS}+(2+4\alpha L)(\chi_{c}^{2}+\chi_{s}^{2}). (28)

If α=BS/T\alpha=\sqrt{BS/T}, then

1Tt=0T1𝔼F(𝒘(t))2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq 4(F(𝒘(0))F(𝒘))BST+4Lσ2BST+(4BSLT+2)(χc2+χs2).\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}+\left(\frac{4\sqrt{BS}L}{\sqrt{T}}+2\right)(\chi_{c}^{2}+\chi_{s}^{2}). (29)

Next we are going to show how χc,χs\chi_{c},\chi_{s} relate to the quantization error. Suppose that i(t){\mathcal{B}}_{i}^{(t)} denotes the randomly sampled mini-batch on client ii. Then according to the definition of server-side gradient, we have

𝒈~s(𝒘)=\displaystyle\widetilde{{\bm{g}}}_{s}({\bm{w}})= 1|𝒮(t)|i𝒮(t)[1|i(t)|ξi(t)h(𝒘s;𝒬(u(𝒘c;ξ)))𝒘s]\displaystyle\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}\right] (30)
=\displaystyle= 𝒈s(𝒘)+1|𝒮(t)|i𝒮(t)[1|i(t)|ξi(t)(h(𝒘s;𝒬(u(𝒘c;ξ)))𝒘sh(𝒘s;u(𝒘c;ξ))𝒘s)].\displaystyle{\bm{g}}_{s}({\bm{w}})+\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right)\right]. (31)

where 𝒬{\mathcal{Q}} represents the quantizer and maps the original activation to its quantized version. As a consequence,

𝜹s(𝒘)=\displaystyle\bm{\delta}_{s}({\bm{w}})= 1|𝒮(t)|i𝒮(t)[1|i(t)|ξi(t)(h(𝒘s;𝒬(u(𝒘c;ξ)))𝒘sh(𝒘s;u(𝒘c;ξ))𝒘s)]\displaystyle\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right)\right] (32)

and

𝜹s(𝒘)2\displaystyle\left\|\bm{\delta}_{s}({\bm{w}})\right\|^{2}\leq maxξ𝔻i,ih(𝒘s;𝒬(u(𝒘c;ξ)))𝒘sh(𝒘s;u(𝒘c;ξ))𝒘s2.\displaystyle\max_{\xi\in{\mathbb{D}}_{i},i\in{\mathcal{I}}}\left\|\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right\|^{2}. (33)

For the ease of writing, we denote 𝒛=u(𝒘c;ξ){\bm{z}}=u({\bm{w}}_{\text{c}};\xi) and 𝒛~=𝒬(u(𝒘c;ξ))\widetilde{{\bm{z}}}={\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)). Using mean value theorem, we have

𝜹s(𝒘)2\displaystyle\left\|\bm{\delta}_{s}({\bm{w}})\right\|^{2}\leq max2h(𝒘s;𝒖)𝒖𝒘s(𝒛~𝒛)2\displaystyle\max\left\|\frac{\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})}{\partial{\bm{u}}\partial{\bm{w}}_{\text{s}}}\left(\widetilde{{\bm{z}}}-{\bm{z}}\right)\right\|^{2} (34)
\displaystyle\leq Λ12κ2\displaystyle\Lambda_{1}^{2}\kappa^{2} (35)

where 𝒖=t𝒛+(1t)𝒛~{\bm{u}}=t{\bm{z}}+(1-t)\widetilde{{\bm{z}}} for some 0<t<10<t<1, constant Λ1\Lambda_{1} is the largest eigenvalue of matrix 2h(𝒘s;𝒖)/𝒖𝒘s\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})/\partial{\bm{u}}\partial{\bm{w}}_{\text{s}}, and κ=max𝒛~𝒛\kappa=\max\|\widetilde{{\bm{z}}}-{\bm{z}}\| denotes the maximal quantization error.

Using the same technique, for the client-side gradients, we have

𝜹c(𝒘)=\displaystyle\bm{\delta}_{c}({\bm{w}})= 1|𝒮(t)|i𝒮(t)[1|i(t)|ξi(t)(h(𝒘s;𝒬(u(𝒘c;ξ)))𝒬(u(𝒘c;ξ))h(𝒘s;u(𝒘c;ξ))u(𝒘c;ξ)))u(𝒘c;ξ)𝒘c]\displaystyle\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi))}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial u({\bm{w}}_{\text{c}};\xi))}\right)\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}\right]
+1|𝒮(t)|i𝒮(t)[1|i(t)|ξi(t)λ(u(𝒘c;ξ)𝒬(u(𝒘c;ξ)))u(𝒘c;ξ)𝒘c].\displaystyle+\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\lambda(u({\bm{w}}_{\text{c}};\xi)-{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}\right]. (36)

Accordingly,

𝜹s(𝒘)2\displaystyle\left\|\bm{\delta}_{s}({\bm{w}})\right\|^{2}\leq max(h(𝒘s;𝒛~)𝒛~h(𝒘s;𝒛)𝒛+λ(𝒛𝒛~))𝒛𝒘c2\displaystyle\max\left\|\left(\frac{\partial h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}})}{\partial\widetilde{{\bm{z}}}}-\frac{\partial h({\bm{w}}_{\text{s}};{\bm{z}})}{\partial{\bm{z}}}+\lambda({\bm{z}}-\widetilde{{\bm{z}}})\right)\frac{\partial{\bm{z}}}{\partial{\bm{w}}_{\text{c}}}\right\|^{2} (37)
=\displaystyle= max(2h(𝒘s;𝒖)𝒖2λ)(𝒛~𝒛)𝒛𝒘c2\displaystyle\max\left\|\left(\frac{\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})}{\partial{\bm{u}}^{2}}-\lambda\right)(\widetilde{{\bm{z}}}-{\bm{z}})\frac{\partial{\bm{z}}}{\partial{\bm{w}}_{\text{c}}}\right\|^{2} (38)
\displaystyle\leq (Λ2λ)2Λ32κ2\displaystyle(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2}\kappa^{2} (39)

where Λ2\Lambda_{2} and Λ3\Lambda_{3} are the largest eigenvalues of matrices 2h(𝒘s;𝒖)/𝒖2\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})/\partial{\bm{u}}^{2} and 𝒛/𝒘c\partial{\bm{z}}/\partial{\bm{w}}_{\text{c}}, respectively. Substituting the above bounds 35 and 39 on 𝜹c,𝜹s\bm{\delta}_{c},\bm{\delta}_{s} into 29, we have

1Tt=0T1𝔼F(𝒘(t))2\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq 4(F(𝒘(0))F(𝒘))BST+4Lσ2BST\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}
+(4BST+2)(Λ12+(Λ2λ)2Λ32)max𝒛𝒛~2.\displaystyle+\left(\frac{4\sqrt{BS}}{\sqrt{T}}+2\right)(\Lambda_{1}^{2}+(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2})\max\left\|{\bm{z}}-\widetilde{{\bm{z}}}\right\|^{2}. (40)

Here we complete the proof.

Appendix C More Experimental Details

C.1 Additional Results

We additionally report the training curves of FedAvg, FedLite and SplitFed on EMNIST in Figure 6. It can be observed that in terms of communication cost, FedLite is significantly faster than other two baselines. We also need to highlight that FedLite and SplitFed have lower memory and computation requirements on clients than FedAvg.

Refer to caption
Figure 6: Training curves for different algorithms on FEMNIST. Although split learning-based approaches (SplitFed and ours FedLite) communicate at every iteration, they communicate much less than FedAvg.

C.2 Hyper-parameter Choices

Here, we summarize all the hyper-parameters on three training tasks.

FEMNIST

  • Best learning rate: 101.510^{-1.5}

  • Optimizer: SGD

  • Mini-batch size per client BB: 2020

  • Activation size dd: 92169216

  • Number of clients per iteration: 1010

  • Ranges of qq: {4608,2304,1152,576,288,144}\{4608,2304,1152,576,288,144\}

  • Ranges of LL: {32,16,8,4,2}\{32,16,8,4,2\}

  • Ranges of λ\lambda: {0,105,5×105,104,5×104}\{0,10^{-5},5\times 10^{-5},10^{-4},5\times 10^{-4}\}

  • Client-side model: Same as the first five layers of the neural network used in (Reddi et al., 2020): Conv2d + Conv2d + MaxPool2d + Dropout + Flatten. Model size: 18,816×6418,816\times 64 bits.

  • Server-side model: Same as the last three layers of the neural network used in (Reddi et al., 2020): Dense + Dropout + Dense. Model size: 1,187,774×641,187,774\times 64 bits.

SO NWP

  • Best learning rate: 0.010.01

  • Optimizer: Adam (Kingma & Ba, 2014)

  • Mini-batch size per client BB: 128128. Each sample contains 3030 words. So the effective batch size is 38403840.

  • Activation size dd: 9696

  • Number of clients per iteration: 5050

  • Ranges of qq: {48,24,12,6,3}\{48,24,12,6,3\}

  • Ranges of LL: {960,480,240,120,60,30}\{960,480,240,120,60,30\}

  • Ranges of λ\lambda: {0,5×104,103,5×103,102}\{0,5\times 10^{-4},10^{-3},5\times 10^{-3},10^{-2}\}

  • Client-side model: Same as the first three layers of the neural network used in (Reddi et al., 2020): Embedding + LSTM + Dense. Model size: 3,680,360×643,680,360\times 64 bits.

  • Server-side model: Same as the last layer of the neural network used in (Reddi et al., 2020): Dense. Model size: 970,388×64970,388\times 64 bits.

SO Tag

  • Best learning rate: 100.510^{-0.5}

  • Optimizer: AdaGrad (Duchi et al., 2011)

  • Mini-batch size per client BB: 100100.

  • Activation size dd: 20002000

  • Number of clients per iteration: 1010

  • Ranges of qq: {1000,500,250,200,125,25}\{1000,500,250,200,125,25\}

  • Ranges of LL: {100,60,40,20,10}\{100,60,40,20,10\}

  • Ranges of λ\lambda: {0,103,5×103,102,5×102}\{0,10^{-3},5\times 10^{-3},10^{-2},5\times 10^{-2}\}

  • Client-side model: One dense layer. Model size: 5000×2000×645000\times 2000\times 64 bits.

  • Server-side model: One dense layer. Model size: 2000×1000×642000\times 1000\times 64 bits.