FedLite: A Scalable Approach for Federated Learning on Resource-constrained Clients

Jianyu Wang^§, Hang Qi^†, Ankit Singh Rawat^†, Sashank Reddi^†,
Sagar Waghmare^†, Felix X. Yu^†, Gauri Joshi^§

^§Carnegie Mellon University, ^†Google Research Work performed while doing an internship at Google Research

Abstract

In classical federated learning, the clients contribute to the overall training by communicating local updates for the underlying model on their private data to a coordinating server. However, updating and communicating the entire model becomes prohibitively expensive when resource-constrained clients collectively aim to train a large machine learning model. Split learning provides a natural solution in such a setting, where only a small part of the model is stored and trained on clients while the remaining large part of the model only stays at the servers. However, the model partitioning employed in split learning introduces a significant amount of communication cost. This paper addresses this issue by compressing the additional communication using a novel clustering scheme accompanied by a gradient correction method. Extensive empirical evaluations on image and text benchmarks show that the proposed method can achieve up to $490\times$ communication cost reduction with minimal drop in accuracy, and enables a desirable performance vs. communication trade-off.

1 Introduction

Federated learning (FL) is an emerging field that collaboratively trains machine learning models on decentralized data (Li et al., 2019; Kairouz et al., 2019; Wang et al., 2021). One major advantage of FL is that it does not require clients to upload their data which may contain sensitive personal information. Instead, clients separately train local models on their private datasets, and the resulting locally trained model parameters are infrequently synchronized with the help of a coordinating server (McMahan et al., 2017). While the FL framework helps alleviate data-privacy concerns for distributed training, most of existing FL algorithms critically assume that the clients have enough compute and storage resources to perform local updates on the entire machine learning model. However, this assumption does not necessarily hold in many modern applications. For example, classification problems with an extremely large number of classes (often in millions and billions) commonly arise in the context of recommender systems (Covington et al., 2016), and language modeling (Levy & Goldberg, 2014). Here, the classification layer of a neural network itself is large enough such that a typical FL client (such as a mobile device) cannot even store and locally update this single layer, let alone the entire neural network.

In order to overcome the above limitations of FL, split learning (SL) technique is adopted in (Vepakomma et al., 2018; Thapa et al., 2020). Specifically, SL splits the underlying model between the clients and server such that the first few layers are shared across the clients and the server, while the remaining layers are only stored at the server. The reduction of resource requirement at the clients is particularly pronounced when the last few dense layers constitute a large portion of the entire model. For instance, in an example convolutional neural network (Krizhevsky, 2014), the last two fully connected layers contribute to $95\%$ parameters of the entire model. In this case, if we could allocate the last two layers to the server, then the client-side memory usage can be directly reduced by a factor of $20$ .

Limitations of SL.

One major limitation of SL is that the underlying model partitioning leads to an increased communication cost for the resulting framework. In order to train the split neural network, the activations and gradients at the layer where the model is split (referred to as cut layer) need to be communicated between the server and clients at each iteration. The additional message size is in proportion to the mini-batch size as well as the cut layer size. As a result, the communication cost for the model training can become prohibitive whenever the mini-batch size and the activation size of the cut layer are large (e.g. in one of our experiments, the additional message size can be $10$ times larger than that of the client-side model).

Refer to caption — Figure 1: Overview of the proposed algorithm. In order to reduce the additional communication associated with split learning, we propose to *cluster similar client activations and only send the cluster centroids to the server*. This is equivalent to adding a vector quantization layer in split neural network training. The success of our method relies on a novel variant of product quantizer and a gradient correction technique. We refer the reader to Section 4 for further details.

Contributions.

In this paper, we aim to make the split neural network training communication-efficient and enable its widespread adoption for FL in resource-constrained settings. Our proposed solution is based on the critical observation that, given a mini-batch of data, the client does not need to communicate per-example activation vectors if the activation vectors (at the cut layer) for different examples in the mini-batch exhibit enough similarity. Thus, we propose a training framework that performs clustering of the activation vectors and only communicates the cluster centroids to the server. Interestingly, this is equivalent to adding a vector quantization layer in the middle of the split neural network (see Figure 1 for an overview of our proposed method). Our main contributions are as follows.

•

We propose a communication-efficient splitting-based FL approach for resource-constrained clients (cf. Section 4), which improves previous SL solutions. The approach employs a novel clustering scheme that leverages product quantization to effectively compress the activations communicated between the clients and server.
•

After applying the activation quantization, the clients can only receive possibly noisy gradients from the server. The inaccurate client-side model updates lead to significant accuracy drops in our experiments. In order to mitigate this problem, we propose a gradient correction scheme, which plays a critical role in achieving a high compression ratio with minimal accuracy loss.
•

We empirically evaluate the performance of our approach on three standard FL datasets (cf. Section 5). We show that our approach allows for up to $490\times$ communication reduction without significant accuracy loss.
•

In addition, we present a convergence analysis for the proposed method (cf. Section 4.3), which reveals a trade-off between communication reduction and convergence speed. The analysis further helps explain why the proposed gradient correction technique is beneficial.

It is also worth mentioning that the proposed method does not expose any additional client-side information to the server than vanilla split neural network training approach. Similar to SplitFed (Thapa et al., 2020), our method can also leverage existing privacy preserving mechanisms such as differential privacy (Wei et al., 2020) or instance-hiding schemes (Huang et al., 2020) to provide formal privacy guarantees, which are out of the scope of this paper. Besides, our proposed method has potential applications beyond FL with resource-constrained clients as it can be applied in any learning framework that can benefit from a vector quantization layer. For instance, it can be used to reduce the communication overhead in two-party vertical FL (Romanini et al., 2021), where the model is naturally split across two institutions and the data labels are generated on the server.

2 Background and Related Works

Federated Learning (FL).

Suppose we have a collection of $M$ clients ${\mathcal{I}}=\{1,2,\dots,M\}$ . Each client $i\in{\mathcal{I}}$ has a local dataset ${\mathbb{D}}_{i}$ and a corresponding empirical loss function $F_{i}({\bm{w}})=\sum_{\xi\in{\mathbb{D}}_{i}}f({\bm{w}};\xi)/n_{i}$ , where ${\bm{w}}$ denotes the model parameters, and $n_{i}=|{\mathbb{D}}_{i}|$ denotes the number of samples of the local dataset. The goal of FL is to find a shared model ${\bm{w}}$ that can minimize the averaged loss over all clients, defined as follows:

\displaystyle F({\bm{w}})=\sum_{i=1}^{M}p_{i}F_{i}({\bm{w}})

(1)

where $p_{i}=n_{i}/\sum_{i=1}^{M}n_{i}$ is the relative weight of local loss $F_{i}$ . Motivated by the data privacy concerns, under a FL framework, the clients perform local training and only communicate the resulting models to a coordinating server as opposed to sharing their raw local data with the server (McMahan et al., 2017). In current FL algorithms, the model size is limited by the resource constraints of clients. Consequently, these algorithms are not feasible when dealing with large machine learning models that cannot fit in clients memory or require large compute.

Large Model Training in FL.

To deploy large machine learning models on resource-constrained clients, a few recent works proposed methods to reduce the effective model size on clients. For example, Diao et al. (2020) varied the layer width on clients depending on their computational capacities; and Horvath et al. (2021) proposed ordered dropout to randomly removing neurons in the neural network. These methods are effective in reducing the width of the intermediate layers; however, they still place the full classification layer (which is in proportion to the size of the output space) on each client to compute the the client update. This may not be feasible for many clients as the parameters in the classification layer of a model can dominate the total number of model parameters (Krizhevsky, 2014).

Gradient Compression.

There are a large amount of literature studying gradient compression in distributed training, such as QSGD (Alistarh et al., 2017), SignSGD (Bernstein et al., 2018), Atomo (Wang et al., 2018), PowerSGD (Vogels et al., 2019), etc. Although these works share the same goal as our work (i.e., reducing communciation by compressing message size), we focus on orthogonal aspects. In particular, all these previous gradient compression methods only compress one gradient vector using the inherent properties of the stochastic gradient (e.g., sparsity or low rankness). However, our method considers compressing a batch of activation vectors and utilizes the similarity among the activations. This is a new dimension that does not exist in distributed training scenarios and has not been studied in gradient compression literature.

Split Learning (SL).

Split learning (SL) is another way of minimizing 1 without explicitly sharing local data on clients (Vepakomma et al., 2018). In particular, SL splits the neural network model into two parts on a layer basis. The first few layers (called client-side model parameterized by ${\bm{w}}_{\text{c}}$ ) are shared across clients and the server, while the remaining layers (called server-side model parameterized by ${\bm{w}}_{\text{s}}$ ) are only stored and trained on the server. Under this splitting, the original loss function for a data sample $\xi$ can be re-written as follows:

\displaystyle f({\bm{w}};\xi)=h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi)),\quad\forall\xi\in{\mathbb{D}}_{i},\forall i\in{\mathcal{I}}

(2)

where $u$ is the client-side function mapping the input data to the activation space and $h$ is the server-side function mapping the activation to a scalar loss value. Training both the client- and server-side models requires communicating client activations (i.e., the output of the client-side model, also called smashed data in some literature) between the clients and the server. We further elaborate on a concrete SL algorithm as a baseline in Section 3.

Communication-efficient SL.

Recently, He et al. (2020); Han et al. (2021) proposed to add a classification layer on the client-side model in SL so that each client can locally update its parameters. This can effectively reduce the communication frequency between the clients and the server. However, this kind of method is not suitable for the settings where the classification layer dominates the entire model size such that the clients may not have enough memory space to store the entire classification layer.

Product Quantization.

Product quantization (PQ) is a compression method and has been widely used in approximate nearest neighbor search (Jegou et al., 2010; Ge et al., 2013b). Given a batch of vectors, PQ divides each vector into multiple chunks and performs K-means clustering separately on each chunk. Now, the resulting cluster centroids construct the codebook and the closest codeword will be used to represent each vector. Instead of directly computing the distance between two high-dimensional vectors, computing the distance between two codewords is orders of magnitude more efficient and faster. Note that the previous works used PQ to compress either learned features after training (see, e.g., Ge et al., 2013a) or the parameters of a neural network (see, e.g., Chen et al., 2020). In this paper, we show that PQ is also effective in compressing the output of intermediate layers and present a simple technique to enable backpropagation through the corresponding quantization layer.

3 Baseline: the SplitFed Algorithm

In this section, we review a representative split learning algorithm, namely SplitFed (Thapa et al., 2020), that provides a baseline FL training approach in resource-constrained settings.

Introduction to SplitFed.

Each iteration of SplitFed contains four steps detailed as follows. Wherever it simplifies the presentation, we assume that the mini-batch size $B$ on each client is $1$ but the derivations hold true for arbitrary batch sizes.

1.

Client Forward Pass: Let ${\mathcal{S}}$ be a randomly selected subset of clients. For $i\in{\mathcal{S}}$ , compute the output of the client-side model ${\bm{z}}_{i}=u({\bm{w}}_{\text{c}};\xi)$ , where $\xi\in{\mathbb{D}}_{i}$ is a randomly chosen training sample, and ${\bm{z}}_{i}\in\mathbb{R}^{d}$ denotes the activations of the last layer of the client-side model. Then, each selected client sends ${\bm{z}}_{i}$ (together with its corresponding label if necessary¹¹1In vertical FL applications (Romanini et al., 2021), the labels do not need to be transferred.) to the server.
2.

Server Update: The server treats all the activations (also referred to as smashed data) $\{{\bm{z}}_{i}\}_{i\in{\mathcal{S}}}$ as inputs to perform one step of gradient descent on the server-side model: $\Delta{\bm{w}}_{\text{s}}=\eta_{\text{s}}\sum_{i\in{\mathcal{S}}}p_{i}\nabla_{{\bm{w}}_{\text{s}}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})/\sum_{i\in{\mathcal{S}}}p_{i}$ , where $\eta_{\text{s}}$ denotes the server learning rate.²²2The server learning rate here has a different definition from previous FL literature, e.g., Reddi et al. (2020). In addition, the server also computes the gradient with respect to the activations $\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})$ and sends it back to the corresponding client $i$ .
3.

Client Backward Pass: Each client computes the gradient with respect to the client-side model using the chain rule: ${\bm{g}}_{i}\triangleq\nabla_{{\bm{w}}_{\text{c}}}f=\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})\nabla_{{\bm{w}}_{\text{c}}}u({\bm{w}}_{\text{c}};\xi)$ .
4.

Client-side Model Synchronization: At last, the client-side model is updated by synchronizing client gradients $\Delta{\bm{w}}_{\text{c}}=\eta_{\text{c}}\sum_{i\in{\mathcal{S}}}p_{i}{\bm{g}}_{i}/\sum_{i\in{\mathcal{S}}}p_{i}$ , where $\eta_{\text{c}}$ denotes the learning rate for the client-side model.

One can easily validate that, with the above four steps, SplitFed is equivalent to mini-batch stochastic gradient descent (SGD) with a total batch size of $B|{\mathcal{S}}|$ . No matter how the network is split, the performance of the algorithm remains unchanged and has the same iteration complexity as mini-batch SGD.

Besides, note that, a selected client needs to upload both the activations during “Client Forward Pass” and the gradients of the client-side model during “Client-side Model Synchronization”. Thus, the communication size per client per iteration can be written as $|{\bm{w}}_{\text{c}}|+Bd$ . Since the up-link communication bandwidth is quite limited for clients in federated learning, when the mini-batch size $B$ or the dimension of the activation $d$ gets large, the up-link communication may become the main bottleneck that limits the scalability of SplitFed. It is critical to compress the uploaded message from clients to the server (Reisizadeh et al., 2019; Li et al., 2019).

Comparison with FedAvg.

In Table 1, we further compare the difference between FedAvg and SplitFed. One can observe that SplitFed always requires less computation and memory on clients, and could potentially communicate less than FedAvg. In terms of convergence rate, it is still under discussion which algorithm is better Woodworth et al. (2020). Note that the goal of this paper is to show that we can effectively reduce the additional communication in SplitFed rather than showing SplitFed is better than FedAvg.

Algorithm	Batch size	Total compute	Client-side compute	Communication cost
FedAvg	$B/H$	${\mathcal{O}}(B\|{\bm{w}}\|)$	${\mathcal{O}}(B\|{\bm{w}}\|)$	$\|{\bm{w}}\|$
SplitFed	$B/H$	${\mathcal{O}}(B\|{\bm{w}}\|/H)$	${\mathcal{O}}(B\|{\bm{w}}_{\text{c}}\|/H)$	$Bd/H+\|{\bm{w}}_{\text{c}}\|$
SplitFed	$B$	${\mathcal{O}}(B\|{\bm{w}}\|)$	${\mathcal{O}}(B\|{\bm{w}}_{\text{c}}\|)$	$Bd+\|{\bm{w}}_{\text{c}}\|$

Table 1: Comparisons between FL (FedAvg) and SL (SplitFed). In the table,

H

denotes the number of local steps in FedAvg. No matter of using which batch sizes, SL-based methods always require less computaion and memory on clients and potentially reduce the communication cost per round, depending on the batch size

B

and activation size

d

4 Proposed Method: FedLite

We now introduce our proposed method, FEDerated spLIT learning with vEctor quantization (FedLite), which can drastically reduce the up-link communication cost on clients in the SplitFed algorithm.

Overview.

The key idea of our method is to compress the redundant information in the client activations for each mini-batch. For instance, if we assume that all the activation vectors within a mini-batch are nearly identical to each other, then instead of sending all $B$ activation vectors, the client needs to send only one representative vector to the server, thereby, reducing the communication cost by a factor of $B$ .

Building on the above observation, we let each client input their activations into a clustering algorithm to get $L$ centroids. Then, each activation vector is represented by the index of its closest centroid. Instead of sending a mini-batch of $B$ vectors, only $L$ centroids and the cluster assignments are transmitted to the server. This procedure is equivalent to vector quantization. Here, a simple choice of the clustering algorithm is K-means (Krishna & Murty, 1999). However, as we observe in Figure 3, vanilla K-means clustering leads to a high quantization error with very limited compression ratio. It is non-trivial to design a proper clustering algorithm that can minimize the communication between clients and server, while maintaining a low quantization error. In Section 4.1, we design a novel variant of product quantizer that can flexibly operate this trade-off.

While the above approach is appealing, it also introduces new challenges. First, note that, in the backward pass, due to the additional quantization layer, the server no longer knows the true client activations; thus, constraining the server to compute the gradients with respect to the (possibly noisy) quantized activations. Furthermore, in order to update the client-side model parameters, the clients need to receive the gradients with respect to the original activations, which is not available any more. In Section 4.2, we introduce a gradient correction technique to mitigate this gradient mismatch, which is critical for our method to achieve a high compression ratio with minimal accuracy loss.

In the following sections, we present the key components of our proposed method and highlight how they address the aforementioned challenges. In addition, a convergence analysis is provided in Section 4.3. Unless otherwise stated, for the ease of exposition, we will focus on the operations on a specific client $i$ and omit the client index. However, it is important to keep in mind that the client-side operations happen in parallel on all selected clients.

4.1 Forward Pass: Compressing Activations with Product Quantization

In this subsection, we first present our proposed qunatizer and then elaborate on its advantages. A visual illustration is provided in Figure 2.

The Proposed Compression Scheme.

Suppose a client has computed a batch of activations ${\bm{Z}}=[{\bm{z}}^{(1)},\dots,{\bm{z}}^{(B)}]\in\mathbb{R}^{d\times B}$ using the client-side model ${\bm{w}}_{\text{c}}$ , where $d$ is the dimension of each activation and $B$ denotes the mini-batch size on each client. Our proposed quantizer first divides each activation vector ${\bm{z}}^{(j)},j\in[1,B]$ into $q$ subvectors with equal dimension $d/q$ . We denote the $s$ -th subvector of ${\bm{z}}^{(j)}$ as ${\bm{z}}^{(j,s)}\in\mathbb{R}^{d/q}$ . Then, the quantizer stacks all subvectors into $R$ groups based on their indices. For example, the first group ${\mathcal{G}}^{(1)}$ contains all subvectors with indices $(j,1),(j,2),\dots(j,q/R),\forall j\in[1,B]$ ; the second group ${\mathcal{G}}^{(2)}$ contains all subvectors with indices $(j,q/R+1),(j,q/R+2),\dots,(j,2q/R),\forall j\in[1,B]$ , and so forth. At last, K-means clustering is performed on subvectors within each group ${\mathcal{G}}^{(r)},\forall r\in[1,R]$ to find $L$ cluster centroids ${\mathcal{C}}^{(r)}=\{{\bm{c}}^{(r,1)},\dots,{\bm{c}}^{(r,L)}\}\subset\mathbb{R}^{d/q}$ . Now each subvector ${\bm{z}}^{(j,s)}$ can be approximated by its closest centroid in the corresponding group. Formally, if ${\bm{z}}^{(j,s)}\in{\mathcal{G}}^{(r)}$ , then

	$\displaystyle\widetilde{{\bm{z}}}^{(j,s)}$	$\displaystyle\triangleq{\bm{c}}^{(r,l_{*}^{(j,s)})},$		(3)
	$\displaystyle l_{*}^{(j,s)}$	$\displaystyle=\operatorname*{arg\,min}_{l\in[1,L]}\\|{\bm{z}}^{(j,s)}-{\bm{c}}^{(r,l)}\\|^{2}$		(4)

In the above equation, $\widetilde{{{\bm{z}}}}^{(j,s)}$ represents the quantized version of ${\bm{z}}^{(j,s)}$ . By concatenating all quantized subvectors, server can get the quantized activations $\widetilde{{\bm{Z}}}$ and use it as input to update the server-side model. In order to obtain the quantized subvectors, each client needs to send all cluster centroids $\{{\mathcal{C}}^{(r)}\}_{r\in[1,R]}$ (referred to as the codebook) and the cluster assignments $\{l_{*}^{(j,s)}\}_{j\in[1,B],s\in[1,q]}$ (referred to as the codewords) to the server. Assuming that each floating-point number occupies $\phi$ bits³³3When computing the compression ratio in this paper, we always assume $\phi=64$ ., the transmitted message size per client is reduced to $\phi dRL/q+Bq\log_{2}L$ from $\phi dB$ .

The Benefit of Subvector Division: Reduced Quantization Error.

The first key component of our proposed quantizer is subvector division, i.e., setting the number of subvectors $q$ to be greater than $1$ . When $q=1$ , the quantizer reduces to vanilla K-means. In this case, the codebook size is $\phi dL$ and each activation vector have $L$ possible choices. When we set $q=R>1$ , the codebook size is still $\phi dL$ . However, each activation vector now has $L^{q}$ possible choices, as each of $q$ subvectors has $L$ choices. Thus, the number of quantization levels becomes exponentially larger without any increase in memory usage and computation complexity. As illustrated by the green lines in Figure 3, having more quantization levels or effective centroid choices can significantly lower the quantization error.

The Benefit of Subvector Grouping: Improved Compression Ratio.

By using subvector division, although the quantization error can be reduced, the codebook size $\phi dL$ is still pretty large. This is because subvectors within one vector are quantized separately using different codebooks. To further reduce the communication size, we propose subvector grouping and force subvectors within one group to share the same codebook. As a result, the total codebook size can be reduced to $\phi dLR/q$ . When $q\gg R\geq 1$ , there can be an order of magnitude increase in the compression ratio, as illustrated by the red lines in Figure 3. One can also observe that, by changing the value of $R$ , our proposed quantizer (red lines) achieves a much better error-versus-compression trade-off than K-means and vanilla production quantization scheme.

Why not Reuse the Codebooks from Previous Iterations?

In our proposed scheme, the codebook is reconstructed at each iteration on clients. This is necessary, since in FL, only few stateless clients are selected to participate training in at each iteration and they have non-IID data distributions (Kairouz et al., 2019). Previous codebooks can be stale and other clients’ codebooks may not be suitable.

4.2 Backward Pass: Gradient Correction

As mentioned at the beginning of Section 4, there is a gradient mismatch problem in the client backward pass. While client $i$ needs $\nabla_{{\bm{z}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})$ to update the client-side models, it can only receive $\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})$ from the server. A naive solution is to treat $\nabla_{\widetilde{{\bm{z}}}_{i}}h$ as an approximation of $\nabla_{{\bm{z}}_{i}}h$ and update the client-side model by $\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})\nabla_{{\bm{w}}_{\text{c}}}u({\bm{w}}_{\text{c}};\xi)=(\partial f/\partial\widetilde{{\bm{z}}}_{i})(\partial{\bm{z}}_{i}/\partial{\bm{w}}_{\text{c}})$ . However, due to gradient mismatch, this approach can lead to a significant performance drop, as observed in our experiments (cf. Section 5). In order to address this issue, we propose a gradient correction technique. In particular, we approximate the gradient $\nabla_{{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};{\bm{z}}_{i})$ by its first-order Taylor series expansion: $\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})+\nabla^{2}_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})\cdot({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})$ . While the higher-order derivative may be expensive to compute, we use a scalar parameter $\lambda{\bm{I}}>0$ to replace $\nabla^{2}_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}})_{i}$ . Consequently, the gradient of the client-side model is defined as follows:

\displaystyle\widetilde{{\bm{g}}}_{i}\triangleq\left[\frac{\partial h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})}{\partial\widetilde{{\bm{z}}}_{i}}+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\right]\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}.

(5)

In Section 4.3, we will provide a convergence analysis that can help explain the effects of gradient correction; in Section 5, we will empirically show that setting a strictly positive $\lambda$ is crucial for the success of the proposed method. In the following discussion, we provide an intuitive explanation of the correction to further motivate our approach.

The Effect of Regularization.

Here, we provide an intuitive explanation of why the gradient correction technique works. The client-side gradient in 5 can be considered as a gradient of the following surrogate loss (proof is provided in the Appendix):

\displaystyle\|{\bm{z}}_{i}-\hat{{\bm{z}}}_{i}\|^{2}+\frac{\lambda}{2}\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|^{2},

(6)

where ${\bm{z}}_{i}=u({\bm{w}}_{\text{c}};\xi)$ is the activation of the client-side model. Besides, $\hat{{\bm{z}}}_{i}\triangleq{\bm{z}}_{i}-\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})/2$ and the quantized activation $\widetilde{{\bm{z}}}_{i}$ are fixed vectors that do not have derivatives when computing the gradient. Setting $\lambda>0$ is equivalent to having a regularization term, as per 6. The regularizer encourages the client-side model ${\bm{w}}_{\text{c}}$ to move in a direction that can decrease the quantization error $\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|$ . However, one cannot set a arbitrarily large value for $\lambda$ because the client-side model may output the same activation for all inputs and fail to minimize the original loss function.

4.3 Convergence Analysis

In this subsection, we provide a convergence analysis for FedLite. The analysis will highlight how the quantization error influences the convergence and how the gradient correction technique helps. The analysis is conducted under standard assumptions for mini-batch SGD. We assume the objective function $F({\bm{w}})=F([{\bm{w}}_{\text{c}};{\bm{w}}_{\text{s}}])$ is $L$ -Lipschitz smooth (i.e., $\left\|\nabla F({\bm{w}})-\nabla F({\bm{v}})\right\|\leq L\left\|{\bm{w}}-{\bm{v}}\right\|$ ), and stochastic gradient ${\bm{g}}({\bm{w}})$ has bounded variance $\mathbb{E}\|{\bm{g}}({\bm{w}})-\nabla F({\bm{w}})\|^{2}\leq\sigma^{2}/BS$ , where $B$ is the mini-batch size per client and $S$ is the number of selected clients per iteration. Under these assumptions, we have the following theorem.

Theorem 4.1 (Convergence of FedLite).

If the client-side model and server-side model update using the same learning rate $\alpha=\sqrt{BS/T}$ where $T$ is the number of total iterations, then the expected gradient norm of the global function $\min_{t\in[0,T-1]}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}$ can be bounded by

\displaystyle\underbrace{\frac{4(F({\bm{w}}^{(0)})-F_{\text{inf}})}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}}_{\text{Opt. error of mini-batch SGD}}+\underbrace{\left(4\sqrt{\frac{BS}{T}}+2\right)(\Lambda_{1}^{2}+(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2})\kappa^{2}}_{\text{Addnl. error caused by quantization}}

(7)

where $\kappa$ is the maximal quantization error $\max\|{\bm{z}}-\widetilde{{\bm{z}}}\|$ , $\lambda$ is the tunable parameter in our gradient correction scheme, $F_{\text{inf}}$ is the lower bound of function value, and constants $\Lambda_{1},\Lambda_{2},\Lambda_{3}$ are the largest eigenvalues of matrices $\partial^{2}h({\bm{w}}_{\text{s}};{\bm{z}})/\partial{\bm{z}}\partial{\bm{w}}_{\text{s}},\partial^{2}h({\bm{w}}_{\text{s}};{\bm{z}})/\partial{\bm{z}}^{2},\partial u({\bm{w}}_{\text{c}};\xi)/{\bm{w}}_{\text{c}}$ , respectively.

Theorem 4.1 guarantees that FedLite converges to a neighbourhood around the stationary point of the global function. And the size of this neighborhood is in proportion to the maximal quantization error $\kappa$ during the training. When there is no quantization, the error bound 7 recovers that of mini-batch SGD. Moreover, one can observe that setting a positive $\lambda$ can help to reduce the additional error caused by adding the quantization layer.

5 Experiments

We implement the proposed method FedLite using FedJAX (Ro et al., 2021) and evaluate its effectiveness on three standard federated datasets provided by the Tensorflow Federated (TFF) package (TFF, 2021): (i) image classification on FEMNIST, (ii) tag prediction on StackOverflow (referred to as SO Tag) and (iii) next word prediction on StackOverflow (referred to as SO NWP). On FEMNIST, the same model architecture as (Reddi et al., 2020) is adopted. We place two convolutional layers and two dense layers on the clients and the server, respectively. With this splitting, the client-side model only has about $1.6\%$ trainable parameters of the entire model. Therefore, the client-side resource requirement is significantly reduced. On SO Tag, both the client- and server-side models contains only one dense layer. On SO NWP, we place one LSTM layer and one dense layer on the clients, and another dense layer on the server. The ratios between the client-side model size and entire model size are $83\%$ and $79\%$ on SO Tag and SO NWP, respectively. Here, we note that even though the client-side model sizes for SO Tag and SO NWP do not correspond to an ideal setting for the split learning-based approaches, we include these two datasets to showcase the utility of our proposed method for language-based tasks.

For each task, we select the learning rate that is best for the baseline SplitFed algorithm. Although separately tuning the learning rate for our proposed method can further improve its performance, using the same learning rate as SplitFed already demonstrates the advantages of our method. Unless otherwise stated, the number of groups $R$ in our proposed quantizer is set to $1$ , as it exhibits the best trade-off in Figure 3 and experiments.

Main Results: Effectiveness of FedLite.

As discussed in Section 4.1, there is a trade-off between the final performance and compression ratio. Although the additional quantization layer helps reduce the communication cost, it also causes performance drop. While utilizing the quantization layer, it is important to understand the range of the compression ratio that does not degrade the final performance too much. To this end, we vary the number of clusters $L$ and number of subvectors $q$ in our proposed method and report the resulting accuracy vs. compression trade-off in Figure 4. Note that we define the compression ratio to be the ratio between the original activation size and compressed message (codebook + codewords) size.

One can observe that the proposed method can achieve at least $10\times$ compression ratio with almost no accuracy loss. Furthermore, if we allow for $5\%$ relative loss compared to SplitFed, then our method can achieve a $490\times$ compression ratio on FEMNIST. This suggests that $99.8\%$ information is redundant and can be dropped during the up-link communication. Surprisingly, on SO Tag, the Recall@5 even improves when adding the quantization layer. We conjecture that compressing the outputs of a intermediate layer might have similar effects to dropout.

Overall Communication and Computation Efficiencies.

Here, we provide a concrete example showing how the proposed method improves the communication and computation efficiencies over previous works. On FEMNIST, setting $q=1152$ and $L=2$ amounts to $490\times$ compression ratio, which amounts to a $490\times$ reduction in the additional communication introduced by the network splitting. Compared to SplitFed, the overall up-link commutation cost (including the client-side gradients synchronization) is about $10\times$ smaller. Compared to FedAvg, the up-link communication cost per round is reduced by $62\times$ , with $64\times$ less trainable parameters on the clients. We further compare the training curves with respect to total communication costs of FedLite against SplitFed and FedAvg in Figure 6 in Appendix. Besides, the overall wall-clock time saving may depend on the characteristics of the training system. One can get an estimate using the analytical model in (Agarwal et al., 2021).

Effectiveness of the Gradient Correction.

From Figure 4, it is easy to observe that the gradient correction technique (i.e., $\lambda>0$ ) is crucial for improving performance. Without correction, the algorithm can even diverge in the high compression ratio regime. While $\lambda$ is separately tuned for each point (i.e., each $(q,L)$ pair) in Figure 4, we found that fixing $\lambda$ for all $(q,L)$ pairs still leads to significant improvements. In Figure 5, we report the performance on FEMNIST for various choices of $q$ and $L$ . In particular, with $q=288$ , the accuracy improvement ranges from $3-72\%$ . Furthermore, we found that a small $\lambda$ ranging from $10^{-5}$ to $10^{-3}$ works well on all training tasks in our experiments. Setting a larger $\lambda$ leads to near-zero quantization error. But the model may have poor performance, as it tends to output the same activation for all inputs and becomes meaningless.

Effectiveness of the Proposed Quantizer.

In Section 4.1, we already show that stacking subvectors into groups (i.e., setting $R<q$ ) is critical to reach a high compression ratio. But how does it affect the model performance? To answer the question, we run the proposed method with $R=q>1$ (vanilla PQ), and report the results in Figure 5(c). Observe that the proposed method significantly improves the compression ratio with minimal loss of accuracy. In addition, any other quantizers that compress each individual vector but do not utilize the similarities among vectors (such as stochastic quantization Suresh et al. (2017)) can be combined with our method to further improve the compression ratio, as they compress a different dimension from ours.

6 Conclusion

We studied the problem of training large ML models in FL with resource-constrained clients. Due to limitations on storage and compute capacity, the clients cannot locally optimize the entire model, prohibiting the usage of previous federated learning algorithms. SL is a promising approach to address this issue, as it only assigns the first few small layers to the clients. But, the network splitting incurs additional communication. In order to make split neural network training to be communication-efficient, we propose FedLite that can compress the additional communication by up to $490\times$ with minimal loss of accuracy. The success of our method relies on a variant of product quantization scheme and a gradient correction technique. We perform theoretical analysis as well as extensive experiments on both vision and language tasks to validate its effectiveness.

References

Agarwal et al. (2021) Agarwal, S., Wang, H., Venkataraman, S., and Papailiopoulos, D. On the utility of gradient compression in distributed training systems. arXiv preprint arXiv:2103.00543, 2021.
Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720, 2017.
Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. signSGD: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
Chen et al. (2020) Chen, T., Li, L., and Sun, Y. Differentiable product quantization for end-to-end embedding compression. In International Conference on Machine Learning, pp. 1617–1626. PMLR, 2020.
Covington et al. (2016) Covington, P., Adams, J., and Sargin, E. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198, 2016.
Diao et al. (2020) Diao, E., Ding, J., and Tarokh, V. Heterofl: Computation and communication efficient federated learning for heterogeneous clients. In International Conference on Learning Representations, 2020.
Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
Ge et al. (2013a) Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2946–2953, 2013a.
Ge et al. (2013b) Ge, T., He, K., Ke, Q., and Sun, J. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013b.
Han et al. (2021) Han, D.-J., Bhatti, H. I., Lee, J., and Moon, J. Accelerating federated learning with split learning on locally generated losses. ICML’21 workshop on federated learning for User Privacy and Data Confidentiality, 2021.
He et al. (2020) He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large cnns at the edge. Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020.
Horvath et al. (2021) Horvath, S., Laskaridis, S., Almeida, M., Leontiadis, I., Venieris, S. I., and Lane, N. D. Fjord: Fair and accurate federated learning under heterogeneous targets with ordered dropout. arXiv preprint arXiv:2102.13451, 2021.
Huang et al. (2020) Huang, Y., Song, Z., Li, K., and Arora, S. Instahide: Instance-hiding schemes for private distributed learning. In International Conference on Machine Learning, pp. 4507–4518. PMLR, 2020.
Jegou et al. (2010) Jegou, H., Douze, M., and Schmid, C. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
Kairouz et al. (2019) Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krishna & Murty (1999) Krishna, K. and Murty, M. N. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439, 1999.
Krizhevsky (2014) Krizhevsky, A. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
Levy & Goldberg (2014) Levy, O. and Goldberg, Y. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, 27:2177–2185, 2014.
Li et al. (2019) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. CoRR, 2019.
McMahan et al. (2017) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017.
Reddi et al. (2020) Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečnỳ, J., Kumar, S., and McMahan, H. B. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
Reisizadeh et al. (2019) Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. arXiv preprint arXiv:1909.13014, 2019.
Ro et al. (2021) Ro, J. H., Suresh, A. T., and Wu, K. Fedjax: Federated learning simulation with jax. arXiv preprint arXiv:2108.02117, 2021.
Romanini et al. (2021) Romanini, D., Hall, A. J., Papadopoulos, P., Titcombe, T., Ismail, A., Cebere, T., Sandmann, R., Roehm, R., and Hoeh, M. A. Pyvertical: A vertical federated learning framework for multi-headed splitnn. arXiv preprint arXiv:2104.00489, 2021.
Suresh et al. (2017) Suresh, A. T., Felix, X. Y., Kumar, S., and McMahan, H. B. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pp. 3329–3337. PMLR, 2017.
TFF (2021) TFF. TensorFlow Federated: machine learning on decentralized data. https://www.tensorflow.org/federated, 2021. Accessed: 30 September 2021.
Thapa et al. (2020) Thapa, C., Chamikara, M. A. P., and Camtepe, S. Splitfed: When federated learning meets split learning. arXiv preprint arXiv:2004.12088, 2020.
Vepakomma et al. (2018) Vepakomma, P., Gupta, O., Swedish, T., and Raskar, R. Split learning for health: Distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564, 2018.
Vogels et al. (2019) Vogels, T., Karimireddy, S. P., and Jaggi, M. PowerSGD: Practical low-rank gradient compression for distributed optimization. 2019.
Wang et al. (2018) Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9850–9861, 2018.
Wang et al. (2021) Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat, M., Andrew, G., Avestimehr, S., Daly, K., Data, D., et al. A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
Wei et al. (2020) Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farokhi, F., Jin, S., Quek, T. Q., and Poor, H. V. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
Woodworth et al. (2020) Woodworth, B., Patel, K. K., and Srebro, N. Minibatch vs local sgd for heterogeneous distributed learning. arXiv preprint arXiv:2006.04735, 2020.

Appendix A Detailed Explanation for the Effects of Regularization

In Section 4.2, we argue that the client-side gradient is a gradient of the following surrogate loss function

\displaystyle s({\bm{w}}_{\text{c}})=\|{\bm{z}}_{i}-\hat{{\bm{z}}}_{i}\|^{2}+\frac{\lambda}{2}\|{\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i}\|^{2}

(8)

where $\hat{{\bm{z}}}_{i}\triangleq{\bm{z}}_{i}-\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})/2$ and the quantized activation $\widetilde{{\bm{z}}}_{i}$ are fixed vectors that do not have derivatives when computing the gradient. Here, we are going to provide a formal proof to support this argument. Directly taking the derivative, we have

	$\displaystyle\frac{\partial s({\bm{w}}_{\text{c}})}{\partial{\bm{w}}_{\text{c}}}=$	$\displaystyle 2({\bm{z}}_{i}-\hat{{\bm{z}}}_{i})\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}}+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}}$		(9)
	$\displaystyle=$	$\displaystyle\left(\nabla_{\widetilde{{\bm{z}}}_{i}}h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}}_{i})+\lambda({\bm{z}}_{i}-\widetilde{{\bm{z}}}_{i})\right)\frac{\partial{\bm{z}}_{i}}{\partial{\bm{w}}_{\text{c}}}.$		(10)

The above equation is exactly the same as the client-side gradient we defined in 5.

Appendix B Proof of Theorem 1

B.1 Preliminaries

Without loss of generality, we define ${\bm{w}}=[{\bm{w}}_{\text{c}};{\bm{w}}_{\text{s}}]$ . Then, the gradient of the original loss function can be decomposed into two parts:

\displaystyle\nabla F({\bm{w}})=[\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}});\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}})]

(11)

Similarly, we denote the stochastic gradients as follows:

\displaystyle{\bm{g}}({\bm{w}})=[{\bm{g}}_{c}({\bm{w}});{\bm{g}}_{s}({\bm{w}})]

(12)

where $\mathbb{E}[{\bm{g}}_{c}({\bm{w}})]=\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}})$ and $\mathbb{E}[{\bm{g}}_{s}({\bm{w}})]=\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}})$ . In addition, we assume the stochastic gradients with a mini-batch size of $BS$ have bounded variance:

\displaystyle\mathbb{E}\left\|{\bm{g}}({\bm{w}})-\nabla F({\bm{w}})\right\|^{2}\leq\frac{\sigma^{2}}{BS}.

(13)

That is equivalent to

\displaystyle\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}})-\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}})\right\|^{2}+\mathbb{E}\left\|{\bm{g}}_{s}({\bm{w}})-\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}})\right\|^{2}\leq

\displaystyle\frac{\sigma^{2}}{BS}

(14)

Then, we denote the client- and server-side gradients in the presence of the quantization layer as follows:

	$\displaystyle\widetilde{{\bm{g}}}_{c}({\bm{w}})$	$\displaystyle={\bm{g}}_{c}({\bm{w}})+\bm{\delta}_{c}({\bm{w}})$		(15)
	$\displaystyle\widetilde{{\bm{g}}}_{s}({\bm{w}})$	$\displaystyle={\bm{g}}_{s}({\bm{w}})+\bm{\delta}_{s}({\bm{w}}).$		(16)

The concrete expressions of $\bm{\delta}_{c}$ and $\bm{\delta}_{s}$ will be derived later.

B.2 Main Proof

With the above notation, according to the Lipschitz smoothness of the loss function, we have

	$\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq$	$\displaystyle-\alpha\left<\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)}),\mathbb{E}[\widetilde{{\bm{g}}}_{c}({\bm{w}}^{(t)})]\right>-\alpha\left<\nabla_{{\bm{w}}_{\text{s}}}F({\bm{w}}^{(t)}),\mathbb{E}[\widetilde{{\bm{g}}}_{s}({\bm{w}}^{(t)})]\right>$
		$\displaystyle+\frac{\alpha^{2}L}{2}\mathbb{E}\left\\|\widetilde{{\bm{g}}}_{c}({\bm{w}}^{(t)})\right\\|^{2}+\frac{\alpha^{2}L}{2}\mathbb{E}\left\\|\widetilde{{\bm{g}}}_{s}({\bm{w}}^{(t)})\right\\|^{2}$		(17)

For the first term on the right-hand side of 17, we have

	$\displaystyle-\left<\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)}),\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})+\mathbb{E}[\bm{\delta}_{c}({\bm{w}}^{(t)})]\right>$
$\displaystyle\leq$	$\displaystyle-\left\\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\\|^{2}+\frac{1}{2\epsilon}\left\\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\\|^{2}+\frac{\epsilon}{2}\left\\|\mathbb{E}[\bm{\delta}_{c}({\bm{w}}^{(t)})]\right\\|^{2}$	(18)
$\displaystyle\leq$	$\displaystyle-(1-\frac{1}{2\epsilon})\left\\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\\|^{2}+\frac{\epsilon}{2}\mathbb{E}\left\\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\\|^{2}$	(19)
$\displaystyle=$	$\displaystyle-\frac{1}{2}\left\\|\nabla_{{\bm{w}}_{\text{c}}}F({\bm{w}}^{(t)})\right\\|^{2}+\frac{1}{2}\mathbb{E}\left\\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\\|^{2}$	(20)

where 18 uses Young’s Inequality, and $\epsilon>0$ is an arbitrary positive number. We set $\epsilon=1$ right after 18. For the third term on the right-hand side of 17, we have

\displaystyle\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}}^{(t)})+\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2}\leq

\displaystyle 2\mathbb{E}\left\|{\bm{g}}_{c}({\bm{w}}^{(t)})\right\|^{2}+2\mathbb{E}\left\|\bm{\delta}_{c}({\bm{w}}^{(t)})\right\|^{2}

(21)

For the server-side gradients $\widetilde{{\bm{g}}}_{s}$ , we can repeat the same process. Combining them together, it follows that

	$\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq$	$\displaystyle-\frac{\alpha}{2}\left\\|\nabla F({\bm{w}}^{(t)})\right\\|^{2}+\alpha^{2}L\mathbb{E}\left\\|{\bm{g}}({\bm{w}}^{(t)})\right\\|^{2}$
		$\displaystyle+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2})$		(22)

where

	$\displaystyle\chi_{c}^{2}=$	$\displaystyle\max\mathbb{E}\left\\|\bm{\delta}_{c}({\bm{w}})\right\\|^{2},$		(23)
	$\displaystyle\chi_{s}^{2}=$	$\displaystyle\max\mathbb{E}\left\\|\bm{\delta}_{s}({\bm{w}})\right\\|^{2}.$		(24)

Using the assumption that the stochastic gradient has bounded variance, we obtain that

	$\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq$	$\displaystyle-\alpha(\frac{1}{2}-\alpha L)\left\\|\nabla F({\bm{w}}^{(t)})\right\\|^{2}+\frac{\alpha^{2}L\sigma^{2}}{BS}$
		$\displaystyle+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2}).$		(25)

When $\alpha L\leq 1/4$ , we have

	$\displaystyle\mathbb{E}[F({\bm{w}}^{(t+1)})]-F({\bm{w}}^{(t)})\leq$	$\displaystyle-\frac{\alpha}{4}\left\\|\nabla F({\bm{w}}^{(t)})\right\\|^{2}+\frac{\alpha^{2}L\sigma^{2}}{BS}+2\alpha^{2}L(\chi_{c}^{2}+\chi_{s}^{2})$
		$\displaystyle+\frac{\alpha}{2}(\chi_{c}^{2}+\chi_{s}^{2}).$		(26)

With minor rearranging and taking the total expectation, it follows that

\displaystyle\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq

\displaystyle\frac{4\mathbb{E}[F({\bm{w}}^{(t)})-F({\bm{w}}^{(t+1)})]}{\alpha}+\frac{4\alpha L\sigma^{2}}{BS}+(2+4\alpha L)(\chi_{c}^{2}+\chi_{s}^{2}).

(27)

Taking the average from $t=0$ to $t=T-1$ , obtain that

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq

\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\alpha T}+\frac{4\alpha L\sigma^{2}}{BS}+(2+4\alpha L)(\chi_{c}^{2}+\chi_{s}^{2}).

(28)

If $\alpha=\sqrt{BS/T}$ , then

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\|\nabla F({\bm{w}}^{(t)})\right\|^{2}\leq

\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}+\left(\frac{4\sqrt{BS}L}{\sqrt{T}}+2\right)(\chi_{c}^{2}+\chi_{s}^{2}).

(29)

Next we are going to show how $\chi_{c},\chi_{s}$ relate to the quantization error. Suppose that ${\mathcal{B}}_{i}^{(t)}$ denotes the randomly sampled mini-batch on client $i$ . Then according to the definition of server-side gradient, we have

	$\displaystyle\widetilde{{\bm{g}}}_{s}({\bm{w}})=$	$\displaystyle\frac{1}{\|{\mathcal{S}}^{(t)}\|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{\|{\mathcal{B}}_{i}^{(t)}\|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}\right]$		(30)
	$\displaystyle=$	$\displaystyle{\bm{g}}_{s}({\bm{w}})+\frac{1}{\|{\mathcal{S}}^{(t)}\|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{\|{\mathcal{B}}_{i}^{(t)}\|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right)\right].$		(31)

where ${\mathcal{Q}}$ represents the quantizer and maps the original activation to its quantized version. As a consequence,

\displaystyle\bm{\delta}_{s}({\bm{w}})=

\displaystyle\frac{1}{|{\mathcal{S}}^{(t)}|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{|{\mathcal{B}}_{i}^{(t)}|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right)\right]

(32)

and

\displaystyle\left\|\bm{\delta}_{s}({\bm{w}})\right\|^{2}\leq

\displaystyle\max_{\xi\in{\mathbb{D}}_{i},i\in{\mathcal{I}}}\left\|\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\bm{w}}_{\text{s}}}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial{\bm{w}}_{\text{s}}}\right\|^{2}.

(33)

For the ease of writing, we denote ${\bm{z}}=u({\bm{w}}_{\text{c}};\xi)$ and $\widetilde{{\bm{z}}}={\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi))$ . Using mean value theorem, we have

	$\displaystyle\left\\|\bm{\delta}_{s}({\bm{w}})\right\\|^{2}\leq$	$\displaystyle\max\left\\|\frac{\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})}{\partial{\bm{u}}\partial{\bm{w}}_{\text{s}}}\left(\widetilde{{\bm{z}}}-{\bm{z}}\right)\right\\|^{2}$		(34)
	$\displaystyle\leq$	$\displaystyle\Lambda_{1}^{2}\kappa^{2}$		(35)

where ${\bm{u}}=t{\bm{z}}+(1-t)\widetilde{{\bm{z}}}$ for some $0<t<1$ , constant $\Lambda_{1}$ is the largest eigenvalue of matrix $\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})/\partial{\bm{u}}\partial{\bm{w}}_{\text{s}}$ , and $\kappa=\max\|\widetilde{{\bm{z}}}-{\bm{z}}\|$ denotes the maximal quantization error.

Using the same technique, for the client-side gradients, we have

	$\displaystyle\bm{\delta}_{c}({\bm{w}})=$	$\displaystyle\frac{1}{\|{\mathcal{S}}^{(t)}\|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{\|{\mathcal{B}}_{i}^{(t)}\|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\left(\frac{\partial h({\bm{w}}_{\text{s}};{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))}{\partial{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi))}-\frac{\partial h({\bm{w}}_{\text{s}};u({\bm{w}}_{\text{c}};\xi))}{\partial u({\bm{w}}_{\text{c}};\xi))}\right)\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}\right]$
		$\displaystyle+\frac{1}{\|{\mathcal{S}}^{(t)}\|}\sum_{i\in{\mathcal{S}}^{(t)}}\left[\frac{1}{\|{\mathcal{B}}_{i}^{(t)}\|}\sum_{\xi\in{\mathcal{B}}_{i}^{(t)}}\lambda(u({\bm{w}}_{\text{c}};\xi)-{\mathcal{Q}}(u({\bm{w}}_{\text{c}};\xi)))\frac{\partial u({\bm{w}}_{\text{c}};\xi)}{\partial{\bm{w}}_{\text{c}}}\right].$		(36)

Accordingly,

$\displaystyle\left\\|\bm{\delta}_{s}({\bm{w}})\right\\|^{2}\leq$	$\displaystyle\max\left\\|\left(\frac{\partial h({\bm{w}}_{\text{s}};\widetilde{{\bm{z}}})}{\partial\widetilde{{\bm{z}}}}-\frac{\partial h({\bm{w}}_{\text{s}};{\bm{z}})}{\partial{\bm{z}}}+\lambda({\bm{z}}-\widetilde{{\bm{z}}})\right)\frac{\partial{\bm{z}}}{\partial{\bm{w}}_{\text{c}}}\right\\|^{2}$	(37)
$\displaystyle=$	$\displaystyle\max\left\\|\left(\frac{\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})}{\partial{\bm{u}}^{2}}-\lambda\right)(\widetilde{{\bm{z}}}-{\bm{z}})\frac{\partial{\bm{z}}}{\partial{\bm{w}}_{\text{c}}}\right\\|^{2}$	(38)
$\displaystyle\leq$	$\displaystyle(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2}\kappa^{2}$	(39)

where $\Lambda_{2}$ and $\Lambda_{3}$ are the largest eigenvalues of matrices $\partial^{2}h({\bm{w}}_{\text{s}};{\bm{u}})/\partial{\bm{u}}^{2}$ and $\partial{\bm{z}}/\partial{\bm{w}}_{\text{c}}$ , respectively. Substituting the above bounds 35 and 39 on $\bm{\delta}_{c},\bm{\delta}_{s}$ into 29, we have

	$\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}\left\\|\nabla F({\bm{w}}^{(t)})\right\\|^{2}\leq$	$\displaystyle\frac{4(F({\bm{w}}^{(0)})-F({\bm{w}}^{*}))}{\sqrt{BST}}+\frac{4L\sigma^{2}}{\sqrt{BST}}$
		$\displaystyle+\left(\frac{4\sqrt{BS}}{\sqrt{T}}+2\right)(\Lambda_{1}^{2}+(\Lambda_{2}-\lambda)^{2}\Lambda_{3}^{2})\max\left\\|{\bm{z}}-\widetilde{{\bm{z}}}\right\\|^{2}.$		(40)

Here we complete the proof.

Appendix C More Experimental Details

C.1 Additional Results

We additionally report the training curves of FedAvg, FedLite and SplitFed on EMNIST in Figure 6. It can be observed that in terms of communication cost, FedLite is significantly faster than other two baselines. We also need to highlight that FedLite and SplitFed have lower memory and computation requirements on clients than FedAvg.

C.2 Hyper-parameter Choices

Here, we summarize all the hyper-parameters on three training tasks.

FEMNIST

•

Best learning rate: $10^{-1.5}$
•

Optimizer: SGD
•

Mini-batch size per client $B$ : $20$
•

Activation size $d$ : $9216$
•

Number of clients per iteration: $10$
•

Ranges of $q$ : $\{4608,2304,1152,576,288,144\}$
•

Ranges of $L$ : $\{32,16,8,4,2\}$
•

Ranges of $\lambda$ : $\{0,10^{-5},5\times 10^{-5},10^{-4},5\times 10^{-4}\}$
•

Client-side model: Same as the first five layers of the neural network used in (Reddi et al., 2020): Conv2d + Conv2d + MaxPool2d + Dropout + Flatten. Model size: $18,816\times 64$ bits.
•

Server-side model: Same as the last three layers of the neural network used in (Reddi et al., 2020): Dense + Dropout + Dense. Model size: $1,187,774\times 64$ bits.

SO NWP

•

Best learning rate: $0.01$
•

Optimizer: Adam (Kingma & Ba, 2014)
•

Mini-batch size per client $B$ : $128$ . Each sample contains $30$ words. So the effective batch size is $3840$ .
•

Activation size $d$ : $96$
•

Number of clients per iteration: $50$
•

Ranges of $q$ : $\{48,24,12,6,3\}$
•

Ranges of $L$ : $\{960,480,240,120,60,30\}$
•

Ranges of $\lambda$ : $\{0,5\times 10^{-4},10^{-3},5\times 10^{-3},10^{-2}\}$
•

Client-side model: Same as the first three layers of the neural network used in (Reddi et al., 2020): Embedding + LSTM + Dense. Model size: $3,680,360\times 64$ bits.
•

Server-side model: Same as the last layer of the neural network used in (Reddi et al., 2020): Dense. Model size: $970,388\times 64$ bits.

SO Tag

•

Best learning rate: $10^{-0.5}$
•

Optimizer: AdaGrad (Duchi et al., 2011)
•

Mini-batch size per client $B$ : $100$ .
•

Activation size $d$ : $2000$
•

Number of clients per iteration: $10$
•

Ranges of $q$ : $\{1000,500,250,200,125,25\}$
•

Ranges of $L$ : $\{100,60,40,20,10\}$
•

Ranges of $\lambda$ : $\{0,10^{-3},5\times 10^{-3},10^{-2},5\times 10^{-2}\}$
•

Client-side model: One dense layer. Model size: $5000\times 2000\times 64$ bits.
•

Server-side model: One dense layer. Model size: $2000\times 1000\times 64$ bits.