This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DAdaQuant: Doubly-adaptive quantization for communication-efficient Federated Learning

Robert Hönig, Yiren Zhao, Robert Mullins
Department of Computer Science
University of Cambridge
{rh723,yiren.zhao,robert.mullins}@cl.cam.ac.uk
Abstract

Federated Learning (FL) is a powerful technique for training a model on a server with data from several clients in a privacy-preserving manner. In FL, a server sends the model to every client, who then train the model locally and send it back to the server. The server aggregates the updated models and repeats the process for several rounds. FL incurs significant communication costs, in particular when transmitting the updated local models from the clients back to the server. Recently proposed algorithms quantize the model parameters to efficiently compress FL communication. These algorithms typically have a quantization level that controls the compression factor. We find that dynamic adaptations of the quantization level can boost compression without sacrificing model quality. First, we introduce a time-adaptive quantization algorithm that increases the quantization level as training progresses. Second, we introduce a client-adaptive quantization algorithm that assigns each individual client the optimal quantization level at every round. Finally, we combine both algorithms into DAdaQuant, the doubly-adaptive quantization algorithm. Our experiments show that DAdaQuant consistently improves client\rightarrowserver compression, outperforming the strongest non-adaptive baselines by up to 2.8×2.8\times.

1 Introduction

Edge devices such as smartphones, remote sensors and smart home appliances generate massive amounts of data (Wang et al., 2018b; Cao et al., 2017; Shi & Dustdar, 2016). In recent years, Federated Learning (FL) has emerged as a technique to train models on this data while preserving privacy (McMahan et al., 2017; Li et al., 2018).

In FL, we have a single server that is connected to many clients. Each client stores a local dataset that it does not want to share with the server because of privacy concerns or law enforcement (Voigt & Von dem Bussche, 2017). The server wants to train a model on all local datasets. To this end, it initializes the model and sends it to a random subset of clients. Each client trains the model on its local dataset and sends the trained model back to the server. The server accumulates all trained models into an updated model for the next iteration and repeats the process for several rounds until some termination criterion is met. This procedure enables the server to train a model without accessing any local datasets.

Today’s neural network models often have millions or even billions (Brown et al., 2020) of parameters, which makes high communication costs a concern in FL. In fact, Qiu et al. (2020) suggest that communication between clients and server may account for over 70% of energy consumption in FL. Reducing communication in FL is an attractive area of research because it lowers bandwidth requirements, energy consumption and training time.

Communication in FL occurs in two phases: Sending parameters from the server to clients (downlink) and sending updated parameters from clients to the server (uplink). Uplink bandwidth usually imposes a tighter bottleneck than downlink bandwidth. This has several reasons. For one, the average global mobile upload bandwidth is currently less than one fourth of the download bandwidth (Speedtest, ). For another, FL downlink communication sends the same parameters to each client. Broadcasting parameters is usually more efficient than the accumulation of parameters from different clients that is required for uplink communication (Amiri et al., 2020; Reisizadeh et al., 2019). For these reasons, we seek to compress uplink communication.

Refer to caption
Client A
Refer to caption
Client B
Refer to caption
Server
Refer to caption
pA[0,1]p_{A}\propto[0,1]
Refer to caption
pB[0,1]p_{B}\propto[0,1]
Refer to caption
Q(pB)\textrm{{Q}}(p_{B})
Refer to caption
Q(p)\textrm{{Q}}(p)
𝔼pA,pB[Var(Q(p))]=0.0018\mathbb{E}_{p_{A},p_{B}}\left[\mathrm{Var}\left(\textrm{{Q}}(p)\right)\right]=\mathbf{0.0018}
Refer to caption
Q(pA)\textrm{{Q}}(p_{A})
Refer to caption
qA=8q_{A}=8
Refer to caption
qB=8q_{B}=8
Refer to caption
×15\times\frac{1}{5}
×45\times\frac{4}{5}
(a) Static quantization.
Refer to caption
Client A
Refer to caption
Client B
Refer to caption
Server
Refer to caption
pA[0,1]p_{A}\propto[0,1]
Refer to caption
pB[0,1]p_{B}\propto[0,1]
Refer to caption
Q(pB)\textrm{{Q}}(p_{B})
Refer to caption
Q(p)\textrm{{Q}}(p)
Refer to caption
𝔼pA,pB[Var(Q(p))]=0.0017\mathbb{E}_{p_{A},p_{B}}\left[\mathrm{Var}\left(\textrm{{Q}}(p)\right)\right]=\mathbf{0.0017}
Refer to caption
Q(pA)\textrm{{Q}}(p_{A})
Refer to caption
qA=4q_{A}=4
Refer to caption
qB=9q_{B}=9
Refer to caption
×15\times\frac{1}{5}
×45\times\frac{4}{5}
(b) Client-adaptive quantization.
Figure 1: Static quantization vs. client-adaptive quantization when accumulating parameters pAp_{A} and pBp_{B}. (a): Static quantization uses the same quantization level for pAp_{A} and pBp_{B}. (b) Client-adaptive quantization uses a slightly higher quantization level for pBp_{B} because pBp_{B} is weighted more heavily. This allows us to use a significantly lower quantization level qAq_{A} for pAp_{A} while keeping the quantization error measure EpA,pB[Var(Q(p))]\mathrm{E}_{p_{A},p_{B}}\left[\mathrm{Var}\left(\textrm{{Q}}(p)\right)\right] roughly constant. Since communication is approximately proportional to qA+qBq_{A}+q_{B}, client-adaptive quantization communicates less data.
Lossq=1q=2q=4adaptive q0112244Communicationq
Figure 2: Time-adaptive quantization. A small quantization level (q) decreases the loss with less communication than a large q, but converges to a higher loss. This motivates an adaptive quantization strategy that uses a small q as long as it is beneficial and then switches over to a large q. We generalize this idea into an algorithm that monotonically increases q based on the training loss.

A large class of compression algorithms for FL apply some lossy quantizer Q, optionally followed by a lossless compression stage. Q usually provides a “quantization level” hyperparameter qq to control the coarseness of quantization (e.g. the number of bins for fixed-point quantization). When qq is kept constant during training, we speak of static quantization. When qq changes, we speak of adaptive quantization. Adaptive quantization can exploit asymmetries in the FL framework to minimize communication. One such asymmetry lies in FL’s training time, where we observe that early training rounds can use a lower qq without affecting convergence. Figure 2 illustrates how time-adaptive quantization leverages this phenomenon to minimize communication. Another asymmetry lies in FL’s client space, because most FL algorithms weight client contributions to the global model proportional to their local dataset sizes. Figure 1 illustrates how client-adaptive quantization can minimize the quantization error. Intuitively, FL clients with greater weighting should have a greater communication budget and our proposed client-adaptive quantization achieves this in a principled way. To this end, we introduce the expected variance of an accumulation of quantized parameters, 𝔼[Var(Q(p))]\mathbb{E}[\mathrm{Var}(\sum\textrm{{Q}}(p))], as a measure of the quantization error. Our client-adaptive quantization algorithm then assigns clients minimal quantization levels, subject to a fixed 𝔼[Var(Q(p))]\mathbb{E}[\mathrm{Var}(\sum\textrm{{Q}}(p))]. This lowers the amount of data communicated from clients to the server, without increasing the quantization error.

DAdaQuant (Doubly Adaptive Quantization) combines time- and client-adaptive quantization with an adaptation of the QSGD fixed-point quantization algorithm to achieve state-of-the-art FL uplink compression. In this paper, we make the following contributions:

  • We introduce the concept of client-adaptive quantization and develop algorithms for time- and client-adaptive quantization that are computationally efficient, empirically Pareto optimal and compatible with arbitrary FL quantizers. Our client-adaptive quantization is provably optimal for stochastic fixed-point quantizers.

  • We create Federated QSGD as an adaptation of the stochastic fixed-point quantizer QSGD that works with FL. Federated QSGD outperforms all other quantizers, establishing a strong baseline for FL compression with static quantization.

  • We combine time- and client-adaptive quantization into DAdaQuant. We demonstrate DAdaQuant’s state-of-the-art compression by empirically comparing it against several competitive FL compression algorithms.

2 Related Work

FL research has explored several approaches to reduce communication. We identify three general directions.

First, there is a growing interest of investigating FL algorithms that can converge in fewer rounds. FedAvg (McMahan et al., 2017) achieves this with prolonged local training, while FOLB (Nguyen et al., 2020) speeds up convergence through a more principled client sampling. Since communication is proportional to the number of training rounds, these algorithms effectively reduce communication.

Secondly, communication can be reduced by reducing the model size because the model size is proportional to the amount of training communication. PruneFL (Jiang et al., 2019) progressively prunes the model over the course of training, while AFD (Bouacida et al., 2021) only trains submodels on clients.

Thirdly, it is possible to directly compress FL training communication. FL compression algorithms typically apply techniques like top-k sparsification (Malekijoo et al., 2021; Rothchild et al., 2020) or quantization (Reisizadeh et al., 2019; Shlezinger et al., 2020) to parameter updates, optionally followed by lossless compression. Our work applies to quantization-based compression algorithms. It is partially based on QSGD (Alistarh et al., 2017), which combines lossy fixed-point quantization with a lossless compression algorithm to compress gradients communicated in distributed training. DAdaQuant adapts QSGD into Federated QSGD, which works with Federated Learning. DAdaQuant also draws inspiration from FedPAQ (Reisizadeh et al., 2019), the first FL framework to use lossy compression based on model parameter update quantization. However, FedPAQ does not explore the advantages of additional lossless compression or adaptive quantization. UVeQFed (Shlezinger et al., 2020) is an FL compression algorithm that generalizes scalar quantization to vector quantization and subsequently employs lossless compression with arithmetic coding. Like FedPAQ, UVeQFed also limits itself to a single static quantization level.

Faster convergence, model size reduction and communication compression are orthogonal techniques, so they can be combined for further communication savings. For this paper, we limit the scope of empirical comparisons to quantization-based FL compression algorithms.

For quantization-based compression for model training, prior works have demonstrated that DNNs can be successfully trained in low-precision (Banner et al., 2018; Gupta et al., 2015; Sun et al., 2019). There are also several adaptive quantization algorithms for training neural networks in a non-distributed setting. Shen et al. (2020) use different quantization levels for different parameters of a neural network. FracTrain (Fu et al., 2020) introduced multi-dimensional adaptive quantization by developing time-adaptive quantization and combining it with parameter-adaptive quantization. However, FracTrain uses the current loss to decide on the quantization level. FL generally can only compute local client losses that are too noisy to be practical for FracTrain. AdaQuantFL introduces time-adaptive quantization to FL, but requires the global loss (Jhunjhunwala et al., 2021). To compute the global loss, AdaQuantFL has to communicate with every client each round. We show in Section 4.2 that this quickly becomes impractical as the number of clients grows. DAdaQuant’s time-adaptive quantization overcomes this issue without compromising on the underlying FL communication. In addition, to the best of our knowledge, DAdaQuant is the first algorithm to use client-adaptive quantization.

3 The DAdaQuant method

3.1 Federated Learning

Federated Learning assumes a client-server topology with a set ={ci|i{1,2N}}{\mathbb{C}}=\{c_{i}|i\in\{1,2...N\}\} of NN clients that are connected to a single server. Each client ckc_{k} has a local dataset DkD_{k} from the local data distribution 𝒟k\mathcal{D}_{k}. Given a model MM with parameters 𝒑{\bm{p}}, a loss function f𝒑(dDk)f_{{\bm{p}}}(d\in D_{k}) and the local loss Fk(𝒑)=1|Dk|dDkf𝒑(d)F_{k}({\bm{p}})=\frac{1}{|D_{k}|}\sum_{d\in D_{k}}f_{\bm{p}}(d), FL seeks to minimize the global loss G(𝒑)=k=1N|Dk|l|Dl|Fk(𝒑)G({\bm{p}})=\sum_{k=1}^{N}\frac{|D_{k}|}{\sum_{l}|D_{l}|}F_{k}({\bm{p}}).

3.2 Federated Averaging (FedAvg)

DAdaQuant makes only minimal assumptions about the FL algorithm. Crucially, DAdaquant can complement FedAvg (McMahan et al., 2017), which is representative of a large class of FL algorithms.

FedAvg trains the model MM over several rounds. In each round tt, FedAvg sends the model parameters 𝒑t{\bm{p}}_{t} to a random subset 𝕊t{\mathbb{S}}_{t} of KK clients who then optimize their local objectives Fk(𝒑t)F_{k}({\bm{p}}_{t}) and send the updated model parameters 𝒑t+1k{\bm{p}}_{t+1}^{k} back to the server. The server accumulates all parameters into the new global model 𝒑t+1=k𝕊t|Dk|j|Dj|𝒑t+1k{\bm{p}}_{t+1}=\sum_{k\in{\mathbb{S}}_{t}}\frac{|D_{k}|}{\sum_{j}|D_{j}|}{\bm{p}}_{t+1}^{k}  and starts the next round. Algorithm 1 lists FedAvg in detail. For our experiments, we use the FedProx (Li et al., 2018) adaptation of FedAvg. FedProx improves the convergence of FedAvg by adding the proximal term μ2𝒑t+1k𝒑t2\frac{\mu}{2}\|{\bm{p}}_{t+1}^{k}-{\bm{p}}_{t}\|^{2} to the local objective Fk(𝒑t+1k)F_{k}({\bm{p}}_{t+1}^{k}) in Algorithm 1 of Algorithm 1.

3.3 Quantization with Federated QSGD

While DAdaQuant can be applied to any quantizer with a configurable quantization level, it is optimized for fixed-point quantization. We introduce Federated QSGD as a competitive fixed-point quantizer on top of which DAdaQuant is applied.

In general, fixed-point quantization uses a quantizer Qq\textrm{{Q}}_{q} with quantization level qq that splits 0\mathbb{R}_{\geq 0} and 0\mathbb{R}_{\leq 0} into qq intervals each. Qq(p)\textrm{{Q}}_{q}(p) then returns the sign of pp and |p||p| rounded to one of the endpoints of its encompassing interval. Qq(𝒑)\textrm{{Q}}_{q}({\bm{p}}) quantizes the vector 𝒑{\bm{p}} elementwise.

We design DAdaQuant’s quantization stage based on QSGD, an efficient fixed-point quantizer for state-of-the-art gradient compression. QSGD quantizes a vector 𝒑{\bm{p}} in three steps:

  1. 1.

    Quantize 𝒑{\bm{p}} as Qq(𝒑𝒑2)\textrm{{Q}}_{q}(\frac{{\bm{p}}}{||{\bm{p}}||_{2}}) into qq bins in [0,1][0,1], storing signs and 𝒑2||{\bm{p}}||_{2} separately. (lossy)

  2. 2.

    Encode the resulting integers with 0 run-length encoding. (lossless)

  3. 3.

    Encode the resulting integers with Elias ω\omega coding. (lossless)

QSGD has been designed specifically for quantizing gradients. This makes it not directly applicable to parameter compression. To overcome this limitation, we apply difference coding to uplink compression, first introduced to FL by FedPAQ. Each client ckc_{k} applies Qq\textrm{{Q}}_{q} to the parameter updates 𝒑t+1k𝒑t{\bm{p}}^{k}_{t+1}-{\bm{p}}_{t} (cf. Algorithm 1 of Algorithm 1) and sends them to the server. The server keeps track of the previous parameters 𝒑t{\bm{p}}_{t} and accumulates the quantized parameter updates into the new parameters as 𝒑t+1=𝒑t+k𝕊t|Dk|l|Dl|Qq(𝒑t+1k𝒑t){\bm{p}}_{t+1}={\bm{p}}_{t}+\sum_{k\in{\mathbb{S}}_{t}}\frac{|D_{k}|}{\sum_{l}|D_{l}|}\textrm{{Q}}_{q}({\bm{p}}^{k}_{t+1}-{\bm{p}}_{t}) (cf. Algorithm 1 of Algorithm 1). We find that QSGD works well with parameter updates, which can be regarded as an accumulation of gradients over several training steps. We call this adaptation of QSGD Federated QSGD.

3.4 Time-adaptive quantization

Time-adaptive quantization uses a different quantization level qtq_{t} for each round tt of FL training. DAdaQuant chooses qtq_{t} to minimize communication costs without sacrificing accuracy. To this end, we find that lower quantization levels suffice to initially reduce the loss, while partly trained models require higher quantization levels to further improve (as illustrated in Figure 2). FracTrain is built on similar observations for non-distributed training. Therefore, we design DAdaQuant to mimic FracTrain in monotonically increasing qtq_{t} as a function of tt and using the training loss to inform increases in qtq_{t}.

When qq is too low, FL converges prematurely. Like FracTrain, DAdaQuant monitors the FL loss and increases qq when it converges. Unlike FracTrain, there is no single centralized loss function to evaluate and unlike AdaQuantFL, we do not assume availability of global training loss G(𝒑t)G({\bm{p}}_{t}). Instead, we estimate G(𝒑t)G({\bm{p}}_{t}) as the average local loss G^t=k𝕊t|Dk|l|Dl|Fk(𝒑t)\hat{G}_{t}=\sum_{k\in{\mathbb{S}}_{t}}\frac{|D_{k}|}{\sum_{l}|D_{l}|}F_{k}({\bm{p}}_{t}) where 𝕊t{\mathbb{S}}_{t} is the set of clients sampled at round tt. Since 𝕊t{\mathbb{S}}_{t} typically consists of only a small fraction of all clients, G^t\hat{G}_{t} is a very noisy estimate of G(𝒑t)G({\bm{p}}_{t}). This makes it unsuitable for convergence detection. Instead, DAdaQuant tracks a running average loss G^^t=ψG^^t1+(1ψ)G^t\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t}=\psi\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-1}+(1-\psi)\hat{G}_{t}.

We initialize q1=qminq_{1}=q_{\text{min}} for some qminq_{\text{min}}\in{\mathbb{N}}. DAdaQuant determines training to converge whenever G^^tG^^t+1ϕ\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t}\geq\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t+1-\phi} for some ϕ\phi\in{\mathbb{N}} that specifies the number of rounds across which we compare G^^\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}. On convergence, DAdaQuant sets qt=2qt1q_{t}=2q_{t-1} and keeps the quantization level fixed for at least ϕ\phi rounds to enable reductions in GG to manifest in G^^\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}. Eventually, the training loss converges regardless of the quantization level. To avoid unconstrained quantization increases on convergence, we limit the quantization level to qmaxq_{\text{max}}.

The following equation summarizes DAdaQuant’s time-adaptive quantization:

qt{qmint=02qt1t>0 and G^^t1G^^tϕ and t>ϕ and 2qt1<qmax and qt1=qtϕqt1elseq_{t}\longleftarrow\begin{cases}q_{\text{min}}&t=0\\ 2q_{t-1}&t>0\text{ and }\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-1}\geq\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-\phi}\text{ and }t>\phi\text{ and }2q_{t-1}<q_{\text{max}}\text{ and }q_{t-1}=q_{t-\phi}\\ q_{t-1}&\text{else}\end{cases}
Round 1 2 3 4 5
Client Samples Quantization level
A 1    8
B 2    8    8    8    8
C 3    8    8    8
D 4    8    8
(a) Static quantization.
Round 1 2 3 4 5
Client Samples Quantization level
A 1    4
B 2    1    2    2    4
C 3    1    2    8
D 4    2    8
(b) Time-adaptive quantization.
Round 1 2 3 4 5
Client Samples Quantization level
A 1    6
B 2    7    6    7    9
C 3    9    9    7
D 4    9    9
(c) Client-adaptive quantization.
Round 1 2 3 4 5
Client Samples Quantization level
A 1    3
B 2    1    1    2    5
C 3    1    2    7
D 4    1    9
(d) Time-adaptive and client-adaptive quantization.
Figure 3: Exemplary quantization level assignment for 4 FL clients that train over 5 rounds. Each round, two clients get sampled for training.

3.5 Client-adaptive quantization

FL algorithms typically accumulate each parameter pip_{i} over all clients into a weighted average p=i=1Kwipip=\sum_{i=1}^{K}{w_{i}p_{i}} (see Algorithm 1). Quantized FL communicates and accumulates quantized parameters Qq(p)=i=1KwiQq(pi)\textrm{{Q}}_{q}(p)=\sum_{i=1}^{K}{w_{i}\textrm{{Q}}_{q}(p_{i})} where qq is the quantization level. We define the quantization error epqe^{q}_{p} as epq=|pQq(p)|e^{q}_{p}=|p-\textrm{{Q}}_{q}(p)|. We observe that 𝔼p1pK[Var(Qq(p))]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(\textrm{{Q}}_{q}(p))] is a useful statistic of the quantization error because it strongly correlates with the loss added by quantization. For a stochastic, unbiased fixed-point compressor like Federated QSGD, 𝔼p1pK[Var(Qq(p))]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(\textrm{{Q}}_{q}(p))] equals 𝔼p1pK[Var(epq)]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})] and can be evaluated analytically.

We observe in our experiments that communication cost per client is roughly a linear function of Federated QSGD’s quantization level qq. This means that the communication cost per round is proportional to Q=KqQ=Kq. We call QQ the communication budget and use it as a proxy measure of communication cost.

Client-adaptive quantization dynamically adjusts the quantization level of each client. This means that even within a single round, each client ckc_{k} can be assigned a different quantization level qkq_{k}. The communication budget of client-adaptive quantization is then Q=k=1KqkQ=\sum_{k=1}^{K}{q_{k}} and Qq(p)\textrm{{Q}}_{q}(p) generalizes to Qq1qK(p)=i=1KwiQqi(pi)\textrm{{Q}}_{q_{1}\ldots q_{K}}(p)=\sum_{i=1}^{K}{w_{i}\textrm{{Q}}_{q_{i}}(p_{i})}. We devise an algorithm that chooses qkq_{k} to minimize QQ subject to 𝔼p1pK[Var(epq1qK)]=𝔼p1pK[Var(epq)]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})] for a given qq. Thus, our algorithm effectively minimizes communication costs while maintaining a quantization error similar to static quantization. Theorem 1 provides us with an analytical formula for quantization levels q1qKq_{1}\ldots q_{K}.

Theorem 1.

Given parameters p1pk𝒰[t,t]p_{1}\ldots p_{k}\sim\mathcal{U}[-t,t] and quantization level qq, minq1qKi=1Kqi\min_{q_{1}\ldots q_{K}}\sum_{i=1}^{K}{q_{i}} subject to 𝔼p1pK[Var(epq1qK)]=𝔼p1pK[Var(epq)]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})] is minimized by qi=ab×wi2/3q_{i}=\sqrt{\frac{a}{b}}\times w_{i}^{2/3} where a=j=1Kwj2/3a={\sum_{j=1}^{K}w_{j}^{2/3}} and b=j=1Kwj2q2b={\sum_{j=1}^{K}\frac{w_{j}^{2}}{q^{2}}}.

DAdaQuant applies Theorem 1 to lower communication costs while maintaining the same loss as static quantization does with a fixed qq. To ensure that quantization levels are natural numbers, DAdaQuant approximates the optimal real-valued solution as qi=max(1,round(ab×wi2/3))q_{i}=\max(1,\text{round}(\sqrt{\frac{a}{b}}\times w_{i}^{2/3})). Appendix B gives a detailed proof of Theorem 1. To the best of our knowledge, DAdaQuant is the first algorithm to use client-adaptive quantization.

1 Function RunServer()
2       Initialize wi=|Di|j|Dj|w_{i}=\frac{|D_{i}|}{\sum_{j}|D_{j}|} for all i[1,,N]i\in[1,\ldots,N];
3       for t=0,,T1t=0,\dots,T-1 do
4             Choose 𝕊t{\mathbb{S}}_{t}\subset{\mathbb{C}} with |𝕊t|=K|{\mathbb{S}}_{t}|=K, including each ckc_{k}\in{\mathbb{C}} with uniform probability;
5             qt{qmint=02qt1t>0 and G^^t1G^^tϕ and t>ϕ and qtqmax and qt1=qtϕqt1elseq_{t}\longleftarrow\begin{cases}q_{\text{min}}&t=0\\ 2q_{t-1}&t>0\text{ and }\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-1}\geq\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-\phi}\text{ and }t>\phi\text{ and }q_{t}\leq q_{\text{max}}\text{ and }q_{t-1}=q_{t-\phi}\\ q_{t-1}&\text{else}\end{cases};
6             for ck𝕊tc_{k}\in{\mathbb{S}}_{t} do in parallel
7                   qtkj=1Kwj2/3/j=1Kwj2q2q_{t}^{k}\longleftarrow\sqrt{\sum_{j=1}^{K}w_{j}^{2/3}/\sum_{j=1}^{K}\frac{w_{j}^{2}}{q^{2}}} ;
8                   Send(ck,𝒑t,Send(c_{k},{\bm{p}}_{t},qtkq_{t}^{k}));
9                   Receive(ck,𝒑t+1k,Receive(c_{k},{\bm{p}}_{t+1}^{k},G^tk\hat{G}_{t}^{k}));
10                  
11             end for
12            𝒑t+1k𝕊twk𝒑t+1k{\bm{p}}_{t+1}\longleftarrow\sum_{k\in{\mathbb{S}}_{t}}w_{k}{\bm{p}}_{t+1}^{k};
13             G^tk𝕊twkG^tk\hat{G}_{t}\longleftarrow\sum_{k\in{\mathbb{S}}_{t}}w_{k}\hat{G}_{t}^{k} ;
14             G^^t{G^0t=0ψG^^t1+(1ψ)G^telse\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t}\longleftarrow\begin{cases}\hat{G}_{0}&t=0\\ \psi\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t-1}+(1-\psi)\hat{G}_{t}&\textrm{else}\\ \end{cases} ;
15            
16       end for
17      
18 end
19Function RunClient(ckc_{k})
20       Receive(Server,𝒑t,Receive(\textrm{Server},{\bm{p}}_{t},qtkq_{t}^{k}));
21       G^tkFk(𝒑t)\hat{G}_{t}^{k}\longleftarrow F_{k}({\bm{p}}_{t});
22       𝒑t+1k{\bm{p}}_{t+1}^{k}\longleftarrow Fk(𝒑t+1k)F_{k}({\bm{p}}_{t+1}^{k}) trained with SGD for EE epochs with learning rate η\eta;
23       Send(Server,Send(\textrm{Server},\,Qqtk(𝒑t+1k)\textrm{{Q}}_{q_{t}^{k}}({\bm{p}}_{t+1}^{k}),G^tk\hat{G}_{t}^{k}));
24      
25 end
Algorithm 1 The FedAvg and DAdaQuant algorithms. The uncolored lines list FedAvg. Adding the colored lines creates DAdaQuant. — quantization, — client-adaptive quantization, — time-adaptive quantization.

3.6 Doubly-adaptive quantization (DAdaQuant)

DAdaQuant combines the time-adaptive and client-adaptive quantization algorithms described in the previous sections. At each round tt, time-adaptive quantization determines a preliminary quantization level qtq_{t}. Client-adaptive quantization then finds the client quantization levels qtk,k{1,,K}q_{t}^{k},k\in\{1,\ldots,K\} that minimize i=1Kqi\sum_{i=1}^{K}{q_{i}} subject to 𝔼p1pK[Var(epq1qK)]=𝔼p1pK[Var(epq)]\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})]. Algorithm 1 lists DAdaQuant in detail. Figure 3 gives an example of how our time-adaptive, client-adaptive and doubly-adaptive quantization algorithms set quantization levels.

Reisizadeh et al. (2019) prove the convergence of FL with quantization for convex and non-convex cases as long as the quantizer Q is (1) unbiased and (2) has a bounded variance. These convergence results extend to DAdaQuant when combined with any quantizer that satisfies (1) and (2) for DAdaQuant’s minimum quantization level q=1q=1. Crucially, this includes Federated QSGD.

We highlight DAdaQuant’s low overhead and general applicability. The computational overhead is dominated by an additional evaluation epoch per round per client to compute G^^t\hat{\vphantom{\rule{1.0pt}{5.71527pt}}\smash{\hat{{G}}}}_{t}, which is negligible when training for many epochs per round. DAdaQuant can compliment any FL algorithm that trains models over several rounds and accumulates a weighted average of client parameters. Most FL algorithms, including FedAvg, follow this design.

4 Experiments

4.1 Experimental details

Evaluation We use DAdaQuant with Federated QSGD to train different models with FedProx on different datasets for a fixed number of rounds. We monitor the test loss and accuracy at fixed intervals and measure uplink communication at every round across all devices.

Models & datasets We select a broad and diverse set of five models and datasets to demonstrate the general applicability of DAdaQuant. To this end, we use DAdaQuant to train a linear model, CNNs and LSTMs of varying complexity on a federated synthetic dataset (Synthetic), as well as two federated image datasets (FEMNIST and CelebA) and two federated natural language datasets (Sent140 and Shakespeare) from the LEAF (Caldas et al., 2018) project for standardized FL research. We refer to Section A.1 for more information on the models, datasets, training objectives and implementation.

System heterogeneity In practice, FL has to cope with clients that have different compute capabilities. We follow Li et al. (2018) and simulate this system heterogeneity by randomly reducing the number of epochs to EE^{\prime} for a random subset 𝕊t𝕊t{\mathbb{S}}_{t}^{\prime}\subset{\mathbb{S}}_{t} of clients at each round tt, where EE^{\prime} is sampled from [1,,E][1,\ldots,E] and |𝕊t|=0.9K|{\mathbb{S}}_{t}^{\prime}|=0.9K.

Baselines We compare DAdaQuant against competing quantization-based algorithms for FL parameter compression, namely Federated QSGD, FedPAQ (Reisizadeh et al., 2019), GZip with fixed-point quantization (FxPQ + GZip), UVeQFed (Shlezinger et al., 2020) and FP8. Federated QSGD (see section 3.3) is our most important baseline because it outperforms the other algorithms. FedPAQ only applies fixed-point quantization, which is equivalent to Federated QSGD without lossless compression. Similarly, FxPQ + GZip is equivalent to Federated QSGD with Gzip for its lossless compression stages. UVeQFed generalizes scalar quantization to vector quantization, followed by arithmetic coding. We apply UVeQFed with the optimal hyperparameters reported by its authors. FP8 (Wang et al., 2018a) is a floating-point quantizer that uses an 8-bit floating-point format designed for storing neural network gradients. We also evaluate all experiments without compression to establish an accuracy benchmark.

Hyperparameters With the exception of CelebA, all our datasets and models are also used by Li et al.. We therefore adopt most of the hyperparameters from Li et al. and use LEAF’s hyperparameters for CelebA Caldas et al. (2018). For all experiments, we sample 10 clients each round. We train Synthetic, FEMNIST and CelebA for 500 rounds each. We train Sent140 for 1000 rounds due to slow convergence and Shakespeare for 50 rounds due to rapid convergence. We use batch size 10, learning rates 0.01, 0.003, 0.3, 0.8, 0.1 and μ\mus (FedProx’s proximal term coefficient) 1, 1, 1, 0.001, 0 for Synthetic, FEMNIST, Sent140, Shakespeare, CelebA respectively. We randomly split the local datasets into 80% training set and 20% test set.

To select the quantization level qq for static quantization with Federated QSGD, FedPAQ and FxPQ + GZip, we run a gridsearch over q=1,2,4,8,q=1,2,4,8,\ldots and choose for each dataset the lowest qq for which Federated QSGD exceeds uncompressed training in accuracy. We set UVeQFed’s “coding rate” hyperparameter R=4R=4, which is the lowest value for which UVeQFed achieves negligible accuracy differences compared to uncompressed training. We set the remaining hyperparameters of UVeQFed to the optimal values reported by its authors. Section A.3 shows further experiments that compare against UVeQFed with RR chosen to maximize its compression factor.

For DAdaQuant’s time-adaptive quantization, we set ψ\psi to 0.9, ϕ\phi to 1/10th{1/10}^{th} of the number of rounds and qmaxq_{\textrm{max}} to the quantization level qq for each experiment. For Synthetic and FEMNIST, we set qminq_{\textrm{min}} to 1. We find that Sent140, Shakespeare and CelebA require a high quantization level to achieve top accuracies and/or converge in few rounds. This prevents time-adaptive quantization from increasing the quantization level quickly enough, resulting in prolonged low-precision training that hurts model performance. To counter this effect, we set qminq_{\textrm{min}} to qmax/2q_{\textrm{max}}/2. This effectively results in binary time-adaptive quantization with an initial low-precision phase with q=qmax/2q=q_{\textrm{max}}/2, followed by a high-precision phase with q=qmaxq=q_{\textrm{max}}.

4.2 Results

10\displaystyle{10}100\displaystyle{100}200\displaystyle{200}400\displaystyle{400}Number of clients in dataset0\displaystyle{0}5\displaystyle{5}10\displaystyle{10}Communication in MegabytesAdaQuantFLDAdaQuant
Figure 4: Comparison of AdaQuantFL and DAdaQuant. We plot the total client\rightarrowserver communication required to train an MLR model on synthetic datasets with 10, 100, 200 and 400 clients. AdaQuantFL’s communication increases linearly with the number of clients because it trains the model on all clients at each round. In contrast, DAdaQuant’s communication does not change with the number of clients.

We repeat the main experiments three times and report average results and their standard deviation (where applicable). Section 4.2 shows the highest accuracy and total communication for each experiment. Figure 5 plots the maximum accuracy achieved for any given amount of communication.

Synthetic FEMNIST Sent140
Uncompressed 78.3±0.378.3\pm 0.3 12.212.2 MB 77.7±0.477.7\pm 0.4 132.1\!\!\!\!\!132.1 GB 69.7±0.569.7\pm 0.5 43.943.9 GB
Federated QSGD 0.1±0.1-0.1\pm 0.1 17×17\times +0.7±0.5+0.7\pm 0.5 2809×\!\!\!\!\!2809\times 0.0±0.5-0.0\pm 0.5 90×90\times
FP8 +0.1±0.4\bm{+0.1\pm 0.4} 4.0×4.0\times (0.23××0.23\!\times\!\!\!\!\!\times) 0.1±0.4-0.1\pm 0.4 4.0×\!\!\!\!\!4.0\times (0.00××0.00\!\times\!\!\!\!\!\times) 0.2±0.5-0.2\pm 0.5 4.0×4.0\times (0.04××0.04\!\times\!\!\!\!\!\times)
FedPAQ (FxPQ) 0.1±0.1-0.1\pm 0.1 6.4×6.4\times (0.37××0.37\!\times\!\!\!\!\!\times) +0.7±0.5+0.7\pm 0.5 11×\!\!\!\!\!11\times (0.00××0.00\!\times\!\!\!\!\!\times) 0.0±0.5-0.0\pm 0.5 4.0×4.0\times (0.04××0.04\!\times\!\!\!\!\!\times)
FxPQ + GZip 0.1±0.1-0.1\pm 0.1 14×14\times (0.82××0.82\!\times\!\!\!\!\!\times) +0.6±0.2+0.6\pm 0.2 1557×\!\!\!\!\!1557\times (0.55××0.55\!\times\!\!\!\!\!\times) 0.0±0.6-0.0\pm 0.6 71×71\times (0.79××0.79\!\times\!\!\!\!\!\times)
UVeQFed 0.5±0.2-0.5\pm 0.2 0.6×0.6\times (0.03××0.03\!\times\!\!\!\!\!\times) 2.8±0.5-2.8\pm 0.5 12×\!\!\!\!\!12\times (0.00××0.00\!\times\!\!\!\!\!\times) +0.0±0.2+0.0\pm 0.2 15×15\times (0.16××0.16\!\times\!\!\!\!\!\times)
DAdaQuant 0.2±0.4-0.2\pm 0.4 𝟒𝟖×\bm{48\times} (2.81××\bm{2.81\!\times\!\!\!\!\!\times}) +0.7±0.1+0.7\pm 0.1 𝟒𝟕𝟕𝟐×\!\!\!\!\!\bm{4772\times} (1.70××)\bm{1.70\!\times\!\!\!\!\!\times$}) 0.1±0.4-0.1\pm 0.4 𝟏𝟎𝟖×\bm{108\times} (1.19××\bm{1.19\!\times\!\!\!\!\!\times})
DAdaQuanttime{}_{\text{time}} 0.1±0.5-0.1\pm 0.5 37×37\times (2.16××2.16\!\times\!\!\!\!\!\times) +0.8±0.2\bm{+0.8\pm 0.2} 4518×\!\!\!\!\!4518\times (1.61××1.61\!\times\!\!\!\!\!\times) 0.1±0.6-0.1\pm 0.6 93×93\times (1.03××1.03\!\times\!\!\!\!\!\times)
DAdaQuantclients{}_{\text{clients}} +0.0±0.3+0.0\pm 0.3 26×26\times (1.51××1.51\!\times\!\!\!\!\!\times) +0.7±0.4+0.7\pm 0.4 3017×\!\!\!\!\!3017\times (1.07××1.07\!\times\!\!\!\!\!\times) +0.1±0.6\bm{+0.1\pm 0.6} 105×105\times (1.16××1.16\!\times\!\!\!\!\!\times)
Shakespeare Celeba
Uncompressed 49.9±0.3\bm{49.9\pm 0.3} 267.0267.0 MB 90.4±0.090.4\pm 0.0 12.612.6 GB
Federated QSGD 0.5±0.6-0.5\pm 0.6 9.5×9.5\times 0.1±0.1-0.1\pm 0.1 648×648\times
FP8 0.2±0.4-0.2\pm 0.4 4.0×4.0\times (0.42××0.42\!\times\!\!\!\!\!\times) +0.0±0.1\bm{+0.0\pm 0.1} 4.0×4.0\times (0.01××0.01\!\times\!\!\!\!\!\times)
FedPAQ (FxPQ) 0.5±0.6-0.5\pm 0.6 3.2×3.2\times (0.34××0.34\!\times\!\!\!\!\!\times) 0.1±0.1-0.1\pm 0.1 6.4×6.4\times (0.01××0.01\!\times\!\!\!\!\!\times)
FxPQ + GZip 0.5±0.6-0.5\pm 0.6 9.3×9.3\times (0.97××0.97\!\times\!\!\!\!\!\times) 0.1±0.2-0.1\pm 0.2 494×494\times (0.76××0.76\!\times\!\!\!\!\!\times)
UVeQFed 0.0±0.4-0.0\pm 0.4 7.9×7.9\times (0.83××0.83\!\times\!\!\!\!\!\times) 0.4±0.3-0.4\pm 0.3 31×31\times (0.05××0.05\!\times\!\!\!\!\!\times)
DAdaQuant 0.6±0.5-0.6\pm 0.5 𝟐𝟏×\bm{21\times} (2.21××\bm{2.21\!\times\!\!\!\!\!\times}) 0.1±0.1-0.1\pm 0.1 𝟕𝟕𝟓×\bm{775\times} (1.20××\bm{1.20\!\times\!\!\!\!\!\times})
DAdaQuanttime{}_{\text{time}} 0.5±0.5-0.5\pm 0.5 12×12\times (1.29××1.29\!\times\!\!\!\!\!\times) 0.1±0.2-0.1\pm 0.2 716×716\times (1.10××1.10\!\times\!\!\!\!\!\times)
DAdaQuantclients{}_{\text{clients}} 0.4±0.5-0.4\pm 0.5 16×16\times (1.67××1.67\!\times\!\!\!\!\!\times) 0.1±0.0-0.1\pm 0.0 700×700\times (1.08××1.08\!\times\!\!\!\!\!\times)
Table 1: Top-1 test accuracies and total client\rightarrowserver communication of all baselines, DAdaQuant, DAdaQuanttime{}_{\textrm{time}} and DAdaQuantclients{}_{\textrm{clients}}. Entry x±yp×(q××)x\pm y\,\,\,\,p\!\times(q\!\times\!\!\!\!\times) denotes an accuracy difference of x% w.r.t. the uncompressed accuracy with a standard deviation of y%, a compression factor of pp w.r.t. the uncompressed communication and a compression factor of qq w.r.t. Federated QSGD.

Baselines

Section 4.2 shows that the accuracy of most experiments lies within the margin of error of the uncompressed experiments. This reiterates the viability of quantization-based compression algorithms for communication reduction in FL. For all experiments, Federated QSGD achieves a significantly higher compression factor than the other baselines. The authors of FedPAQ and UVeQFed also compare their methods against QSGD and report them as superior. However, FedPAQ is compared against “unfederated” QSGD that communicates gradients after each local training step and UVeQFed is compared against QSGD without its lossless compression stages.

Time-adaptive quantization

The purely time-adaptive version of DAdaQuant, DAdaQuanttime{}_{\textrm{time}}, universally outperforms Federated QSGD and the other baselines in Section 4.2, achieving comparable accuracies while lowering communication costs. DAdaQuanttime{}_{\textrm{time}} performs particularly well on Synthetic and FEMNIST, where it starts from the lowest possible quantization level q=1q=1. However, binary time-adaptive quantization still measurably improves over QSGD for Sent140, Shakespeare and Celeba.

Figure 4 provides empirical evidence that AdaQuantFL’s communication scales linearly with the number of clients. As a result, AdaQuantFL is prohibitively expensive for datasets with thousands of clients such as Celeba and Sent140. DAdaQuant does not face this problem because its communication is unaffected by the number of clients.

0.0\displaystyle{0.0}0.2\displaystyle{0.2}0.4\displaystyle{0.4}0.6\displaystyle{0.6}Communication in Megabytes20\displaystyle{20}40\displaystyle{40}60\displaystyle{60}80\displaystyle{80}Accuracy in %.Synthetic0\displaystyle{0}20\displaystyle{20}40\displaystyle{40}Communication in Megabytes0\displaystyle{0}25\displaystyle{25}50\displaystyle{50}75\displaystyle{75}Accuracy in %.FEMNISTFederated QSGDDAdaQuant0\displaystyle{0}200\displaystyle{200}400\displaystyle{400}Communication in Megabytes50\displaystyle{50}60\displaystyle{60}70\displaystyle{70}Accuracy in %.Sent140
Figure 5: Communication-accuracy trade-off curves for training on FEMNIST with Federated QSGD and DAdaQuant. We plot the average highest accuracies achieved up to any given amount of client\rightarrowserver communication. Section A.2 shows curves for all datasets, with similar results.

Client-adaptive quantization

The purely time-adaptive version of DAdaQuant, DAdaQuantclients{}_{\textrm{clients}}, also universally outperforms Federated QSGD and the other baselines in Section 4.2, achieving similar accuracies while lowering communication costs. Unsurprisingly, the performance of DAdaQuantclients{}_{\textrm{clients}} is correlated with the coefficient of variation cv=σμc_{v}=\frac{\sigma}{\mu} of the numbers of samples in the local datasets with mean μ\mu and standard deviation σ\sigma: Synthetic (cv=3.3c_{v}=3.3) and Shakespeare (cv=1.7c_{v}=1.7) achieve significantly higher compression factors than Sent140 (cv=0.3c_{v}=0.3), FEMNIST (cv=0.4c_{v}=0.4) and Celeba (cv=0.3c_{v}=0.3).

DAdaQuant

DAdaQuant outperforms DAdaQuanttime{}_{\textrm{time}} and DAdaQuantclients{}_{\textrm{clients}} in communication while achieving similar accuracies. The compression factors of DAdaQuant are roughly multiplicative in those of DAdaQuantclients{}_{\textrm{clients}} and DAdaQuanttime{}_{\textrm{time}}. This demonstrates that we can effectively combine time- and client-adaptive quantization for maximal communication savings.

Pareto optimality

Figure 5 shows that DAdaQuant achieves a higher accuracy than the strongest baseline, Federated QSGD, for any fixed accuracy. This means that DAdaQuant is Pareto optimal for the datasets we have explored.

5 Conclusion

We introduced DAdaQuant as a computationally efficient and robust algorithm to boost the performance of quantization-based FL compression algorithms. We showed intuitively and mathematically how DAdaQuant’s dynamic adjustment of the quantization level across time and clients minimize client\rightarrowserver communication while maintaining convergence speed. Our experiments establish DAdaQuant as nearly universally superior over static quantizers, achieving state-of-the-art compression factors when applied to Federated QSGD. The communication savings of DAdaQuant effectively lower FL bandwidth usage, energy consumption and training time. Future work may apply and adapt DAdaQuant to new quantizers, further pushing the state of the art in FL uplink compression.

6 Reproducibility Statement

Our submission includes a repository with the source code for DAdaQuant and for the experiments presented in this paper. All the datasets used in our experiments are publicly available. Any post-processing steps of the datasets are described in Section A.1. To facilitate the reproduction of our results, we have bundled all our source code, dependencies and datasets into a Docker image. The repository submitted with this paper contains instructions on how to use this Docker image and reproduce all plots and tables in this paper.

7 Ethics Statement

FL trains models on private client datasets in a privacy-preserving manner. However, FL does not completely eliminate privacy concerns, because the transmitted model updates and the learned model parameters may expose the private client data from which they are derived. Our work does not directly target privacy concerns in FL. With that said, it is worth noting that DAdaQuant does not expose any client data that is not already exposed through standard FL training algorithms. In fact, DAdaQuant reduces the amount of exposed data through lossy compression of the model updates. We therefore believe that DAdaQuant is free of ethical complications.

References

  • Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp.  1709–1720. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/6768-qsgd-communication-efficient-sgd-via-gradient-quantization-and-encoding.pdf.
  • Amiri et al. (2020) Mohammad Mohammadi Amiri, Deniz Gunduz, Sanjeev R. Kulkarni, and H. Vincent Poor. Federated learning with quantized global model updates. arXiv:2006.10672, 2020.
  • Banner et al. (2018) Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp.  5151–5159, 2018.
  • Beutel et al. (2020) Daniel J. Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D. Lane. Flower: A friendly Federated Learning research framework. arXiv:2007.14390, 2020.
  • Bouacida et al. (2021) Nader Bouacida, Jiahui Hou, Hui Zang, and Xin Liu. Adaptive Federated Dropout: Improving communication efficiency and generalization for federated learning. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp.  1–6. IEEE, 2021.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Caldas et al. (2018) Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. LEAF: A benchmark for federated settings. arXiv:1812.01097, 2018.
  • Cao et al. (2017) Bokai Cao, Lei Zheng, Chenwei Zhang, Philip S Yu, Andrea Piscitello, John Zulueta, Olu Ajilore, Kelly Ryan, and Alex D Leow. DeepMood: modeling mobile phone typing dynamics for mood detection. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  747–755, 2017.
  • Fu et al. (2020) Yonggan Fu, Haoran You, Yang Zhao, Yue Wang, Chaojian Li, Kailash Gopalakrishnan, Zhangyang Wang, and Yingyan Lin. FracTrain: Fractionally squeezing bit savings both temporally and spatially for efficient DNN training. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12127–12139. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8dc5983b8c4ef1d8fcd5f325f9a65511-Paper.pdf.
  • Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International conference on machine learning, pp. 1737–1746. PMLR, 2015.
  • Jhunjhunwala et al. (2021) Divyansh Jhunjhunwala, Advait Gadhikar, Gauri Joshi, and Yonina C Eldar. Adaptive quantization of model updates for communication-efficient federated learning. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3110–3114. IEEE, 2021.
  • Jiang et al. (2019) Yuang Jiang, Shiqiang Wang, Bong Jun Ko, Wei-Han Lee, and Leandros Tassiulas. Model pruning enables efficient Federated Learning on edge devices. arXiv:1909.12326, 2019.
  • Li et al. (2018) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv:1812.06127, 2018.
  • Malekijoo et al. (2021) Amirhossein Malekijoo, Mohammad Javad Fadaeieslam, Hanieh Malekijou, Morteza Homayounfar, Farshid Alizadeh-Shabdiz, and Reza Rawassizadeh. FEDZIP: A compression framework for communication-efficient Federated Learning. arXiv:2102.01593, 2021.
  • McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  • Nguyen et al. (2020) Hung T Nguyen, Vikash Sehwag, Seyyedali Hosseinalipour, Christopher G Brinton, Mung Chiang, and H Vincent Poor. Fast-convergent federated learning. IEEE Journal on Selected Areas in Communications, 39(1):201–218, 2020.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.
  • Qiu et al. (2020) Xinchi Qiu, Titouan Parcolle, Daniel J Beutel, Taner Topa, Akhil Mathur, and Nicholas D Lane. A first look into the carbon footprint of federated learning. arXiv preprint arXiv:2010.06537, 2020.
  • Reisizadeh et al. (2019) Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. FedPAQ: A communication-efficient Federated Learning method with periodic averaging and quantization. arXiv:1909.13014, 2019.
  • Rothchild et al. (2020) Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. FetchSGD: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pp. 8253–8265. PMLR, 2020.
  • Shen et al. (2020) Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, and Yingyan Lin. Fractional skipping: Towards finer-grained dynamic CNN inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  5700–5708, 2020.
  • Shi & Dustdar (2016) Weisong Shi and Schahram Dustdar. The promise of edge computing. Computer, 49(5):78–81, 2016.
  • Shlezinger et al. (2020) Nir Shlezinger, Mingzhe Chen, Yonina C Eldar, H Vincent Poor, and Shuguang Cui. UVeQFed: Universal vector quantization for federated learning. IEEE Transactions on Signal Processing, 69:500–514, 2020.
  • (24) Speedtest. Speedtest global index. https://www.speedtest.net/global-index. Accessed: 2021-05-12.
  • Sun et al. (2019) Xiao Sun, Jungwook Choi, Chia-Yu Chen, Naigang Wang, Swagath Venkataramani, Vijayalakshmi Srinivasan, Xiaodong Cui, Wei Zhang, and Kailash Gopalakrishnan. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp.  4900–4909, 2019.
  • Voigt & Von dem Bussche (2017) Paul Voigt and Axel Von dem Bussche. The EU general data protection regulation (GDPR). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10:3152676, 2017.
  • Wang et al. (2018a) Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and Kailash Gopalakrishnan. Training deep neural networks with 8-bit floating point numbers. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp.  7675–7684. Curran Associates, Inc., 2018a. URL http://papers.nips.cc/paper/7994-training-deep-neural-networks-with-8-bit-floating-point-numbers.pdf.
  • Wang et al. (2018b) Pan Wang, Feng Ye, and Xuejiao Chen. A smart home gateway platform for data collection and awareness. IEEE Communications magazine, 56(9):87–93, 2018b.

Appendix A Additional simulation details and experiments

A.1 Additional simulation details

Here, we give detailed information on the models, datasets, training objectives and implementation that we use for our experiments. We set the five following FL tasks:

  • Multinomial logistic regression (MLR) on a synthetic dataset called Synthetic that contains vectors in 60\mathbb{R}^{60} with a label of one out of 10 classes. We use the synthetic dataset generator in Li et al. (2018) to generate synthetic datasets. The generator samples Synthetic’s local datasets and labels from MLR models with randomly initialized parameters. For this purpose, parameters α\alpha and β\beta control different kinds of data heterogeneity. α\alpha controls the variation in the local models from which the local dataset labels are generated. β\beta controls the variation in the local dataset samples. We set α=1\alpha=1 and β=1\beta=1 to simulate an FL setting with both kinds of data heterogeneity. This makes Synthetic a useful testbed for FL.

  • Character classification into 62 classes of handwritten characters from the FEMNIST dataset using a CNN. FEMNIST groups samples from the same author into the same local dataset.

  • Smile detection in facial images from the CelebA dataset using a CNN. CelebA groups samples of the same person into the same local dataset. We note that LEAF’s CNN for CelebA uses BatchNorm layers. We replace them with LayerNorm layers because they are more amenable to quantization. This change does not affect the final accuracy.

  • Binary sentiment analysis of tweets from the Sent140 dataset using an LSTM. Sent140 groups tweets from the same user into the same local dataset. The majority of local datasets in the raw Sent140 dataset only have a single sample. This impedes FL convergence. Therefore, we filter Sent140 to clients with at least 10 samples (i.e. one complete batch). Caldas et al. (2018); Li et al. (2018) similarly filter Sent140 for their FL experiments.

  • Next character prediction on text snippets from the Shakespeare dataset of Shakespeare’s collected plays using an LSTM. Shakespeare groups lines from the same character into the same local dataset.

Table 2 provides statistics of our models and datasets.

For our experiments in Figure 4, AdaQuantFL requires a hyperparameter ss that determines the initial quantization level. We set ss to 2, the optimal value reported by the authors of AdaQuantFL. The remaining hyperparameters are identical to those used for the Synthetic dataset experiments in Section 4.2.

We implement the models with PyTorch (Paszke et al., 2019) and use Flower (Beutel et al., 2020) to simulate the FL server and clients.

Dataset Model Parameters Clients Samples Samples per client
mean min max stddev
Synthetic MLR 610 30 9,600 320.0 45 5,953 1051.6
FEMNIST 2-layer CNN 6.6×1066.6\times 10^{6} 3,500 785,582 224.1 19 584 87.8
CelebA 4-layer CNN 6.3×1056.3\times 10^{5} 9,343 200,288 21.4 5 35 7.6
Sent140 2-layer LSTM 1.1×1061.1\times 10^{6} 21,876 430,707 51.1 10 549 17.1
Shakespeare 2-layer LSTM 1.3×1051.3\times 10^{5} 1,129 4,226,158 3743 3 66,903 6212
Table 2: Statistics of the models and datasets used for evaluation. MLR stands for “Multinomial Logistic Regression”.

A.2 Complete communication-accuracy trade-off curves

0.0\displaystyle{0.0}0.2\displaystyle{0.2}0.4\displaystyle{0.4}0.6\displaystyle{0.6}Communication in Megabytes20\displaystyle{20}40\displaystyle{40}60\displaystyle{60}80\displaystyle{80}Accuracy in %.Synthetic0\displaystyle{0}20\displaystyle{20}40\displaystyle{40}Communication in Megabytes0\displaystyle{0}25\displaystyle{25}50\displaystyle{50}75\displaystyle{75}Accuracy in %.FEMNIST0\displaystyle{0}200\displaystyle{200}400\displaystyle{400}Communication in Megabytes50\displaystyle{50}60\displaystyle{60}70\displaystyle{70}Accuracy in %.Sent1400\displaystyle{0}10\displaystyle{10}20\displaystyle{20}Communication in Megabytes0\displaystyle{0}20\displaystyle{20}40\displaystyle{40}Accuracy in %.Shakespeare0\displaystyle{0}10\displaystyle{10}20\displaystyle{20}Communication in Megabytes60\displaystyle{60}80\displaystyle{80}Accuracy in %.CelebaFederated QSGDDAdaQuant
Figure 6: Communication-accuracy trade-off curves for QSGD and DAdaQuant. We plot the average highest accuracies achieved up to any given amount of communication.

A.3 Additional UVeQFed experiments

To demonstrate that the choice of UVeQFed’s “coding rate” hyperparameter RR does not affect our findings on the superior compression factors of DAdaQuant, we re-evaluate UVeQFed with R=1R=1, which maximizes UVeQFed’s compression factor. Our results in Table 3 show that with the exception of Shakespeare, DAdaQuant still achieves considerably higher compression factors than UVeQFed.

Synthetic FEMNIST Sent140 Shakespeare Celeba
Uncompressed 12.212.2 MB 132.1132.1 GB 43.943.9 GB 267.0267.0 MB 12.612.6 GB
QSGD 17×17\times 2809×2809\times 90×90\times 9.5×9.5\times 648×648\times
UVeQFed (R=4) 0.6×0.6\times (0.03××0.03\!\times\!\!\!\!\!\times) 12×12\times (0.00××0.00\!\times\!\!\!\!\!\times) 15×15\times (0.16××0.16\!\times\!\!\!\!\!\times) 7.9×7.9\times (0.83××0.83\!\times\!\!\!\!\!\times) 31×31\times (0.05××0.05\!\times\!\!\!\!\!\times)
UVeQFed (R=1) 13×13\times (0.77××0.77\!\times\!\!\!\!\!\times) 34×34\times (0.01××0.01\!\times\!\!\!\!\!\times) 41×41\times (0.45××0.45\!\times\!\!\!\!\!\times) 𝟐𝟏×\bf{21\times} (2.22××\bf{2.22\!\times\!\!\!\!\!\times}) 93×93\times (0.14××0.14\!\times\!\!\!\!\!\times)
DAdaQuant 𝟒𝟖×\bf{48\times} (2.81××\bf{2.81\!\times\!\!\!\!\!\times}) 𝟒𝟕𝟕𝟐×\bf{4772\times} (1.70××\bf{1.70\!\times\!\!\!\!\!\times}) 𝟏𝟎𝟖×\bf{108\times} (1.19××\bf{1.19\!\times\!\!\!\!\!\times}) 21×21\times (2.21××2.21\!\times\!\!\!\!\!\times) 𝟕𝟕𝟓×\bf{775\times} (1.20××\bf{1.20\!\times\!\!\!\!\!\times})
Table 3: Comparison of the compression factors of DAdaQuant, UVeQFed with R=4R=4 (default value used for our experiments in Section 4.2) and UVeQFed with R=1R=1 (maximizes UVeQFed’s compression factor). Entry p×(q××)p\!\times(q\!\times\!\!\!\!\!\times) denotes a compression factor of pp w.r.t. the uncompressed communication and a compression factor of qq w.r.t. Federated QSGD.

Appendix B Proofs

Lemma 1.

For arbitrary t>0t>0 and parameter pi[t,t]p_{i}\in[-t,t], let si=tqis_{i}=\frac{t}{q_{i}}, bi=rem(pi,si)b_{i}=\mathrm{rem}\left(p_{i},s_{i}\right) and ui=sibiu_{i}=s_{i}-b_{i}. Then, Var(Qqi(pi))=uibi\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)=u_{i}b_{i}.

Proof.
Var(Qqi(pi))\displaystyle\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)
=\displaystyle=~{} E[(Qqi(pi)E[Qqi(pi)])2]\displaystyle\mathrm{E}\left[\left(\textrm{{Q}}_{q_{i}}(p_{i})-\mathrm{E}\left[\textrm{{Q}}_{q_{i}}(p_{i})\right]\right)^{2}\right]
=\displaystyle=~{} E[(Qqi(pi)pi)2]\displaystyle\mathrm{E}\left[\left(\textrm{{Q}}_{q_{i}}(p_{i})-p_{i}\right)^{2}\right] Qqi(pi) is an unbiased estimator of pi\displaystyle\textrm{{Q}}_{q_{i}}(p_{i})\text{ is an unbiased estimator of }p_{i}
=\displaystyle=~{} bisiui2+uisibi2\displaystyle\frac{b_{i}}{s_{i}}u_{i}^{2}+\frac{u_{i}}{s_{i}}b_{i}^{2} cf. fig. 7
=\displaystyle=~{} uibisi(ui+bi)\displaystyle\frac{u_{i}b_{i}}{s_{i}}\left(u_{i}+b_{i}\right)
=\displaystyle=~{} uibi\displaystyle u_{i}b_{i}\qed
pip_{i}csics_{i}(c+1)si(c+1)s_{i}P(csi)=uisiP(cs_{i})=\frac{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}u_{i}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}s_{i}}P((c+1)si)=bisiP((c+1)s_{i})=\frac{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}b_{i}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}s_{i}}
Figure 7: Illustration of the Bernoulli random variable Qqi(pi)\textrm{{Q}}_{q_{i}}(p_{i}) in Lemma 1. sis_{i} is the length of the quantization interval. pip_{i} is rounded up to (c+1)si(c+1)s_{i} with a probability proportional to its distance from csics_{i}.
Lemma 2.

Let Q be a fixed-point quantizer. Assume that parameters p1pKp_{1}\ldots p_{K} are sampled from 𝒰[t,t]\mathcal{U}[-t,t] for arbitrary t>0t>0. Then, 𝔼p1pK[Var(epq1qK)]=t26i=1Kwi2qi2\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=\frac{t^{2}}{6}\sum_{i=1}^{K}\frac{w_{i}^{2}}{q_{i}^{2}}.

Proof.
𝔼p1pK[Var(ep)]\displaystyle\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e_{p})]
=\displaystyle=~{} 12ttt12ttt12tttVar(i=1KwiQqi(pi)p)𝑑p1𝑑p2𝑑pK\displaystyle\frac{1}{2t}\int_{-t}^{t}\frac{1}{2t}\int_{-t}^{t}\ldots\frac{1}{2t}\int_{-t}^{t}\mathrm{Var}\left(\sum_{i=1}^{K}w_{i}\textrm{{Q}}_{q_{i}}(p_{i})-p\right)\,dp_{1}dp_{2}\ldots dp_{K}
=\displaystyle=~{} 1t0t1t0t1t0tVar(i=1KwiQqi(pi)p)𝑑p1𝑑p2𝑑pK\displaystyle\frac{1}{t}\int_{0}^{t}\frac{1}{t}\int_{0}^{t}\ldots\frac{1}{t}\int_{0}^{t}\mathrm{Var}\left(\sum_{i=1}^{K}w_{i}\textrm{{Q}}_{q_{i}}(p_{i})-p\right)\,dp_{1}dp_{2}\ldots dp_{K} symmetry of Qqi(pi)\textrm{{Q}}_{q_{i}}(p_{i}) w.r.t. negation
=\displaystyle=~{} 1tn0t0t0ti=1Kwi2Var(Qqi(pi))dp1dp2dpK\displaystyle\frac{1}{t^{n}}\int_{0}^{t}\int_{0}^{t}\ldots\int_{0}^{t}\sum_{i=1}^{K}w_{i}^{2}\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)\,dp_{1}dp_{2}\ldots dp_{K} mutual independence of Qqi(pi)i\textrm{{Q}}_{q_{i}}(p_{i})~{}\forall i
=\displaystyle=~{} 1tni=1K0t0t0twi2Var(Qqi(pi))𝑑p1𝑑p2𝑑pK\displaystyle\frac{1}{t^{n}}\sum_{i=1}^{K}\int_{0}^{t}\int_{0}^{t}\ldots\int_{0}^{t}w_{i}^{2}\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)\,dp_{1}dp_{2}\ldots dp_{K} exchangeability of finite sums and integrals
=\displaystyle=~{} 1tni=1Ktn10twi2Var(Qqi(pi))𝑑pi\displaystyle\frac{1}{t^{n}}\sum_{i=1}^{K}t^{n-1}\int_{0}^{t}w_{i}^{2}\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)\,dp_{i}
=\displaystyle=~{} 1ti=1Kwi20tVar(Qqi(pi))𝑑pi\displaystyle\frac{1}{t}\sum_{i=1}^{K}w_{i}^{2}\int_{0}^{t}\mathrm{Var}\left(\textrm{{Q}}_{q_{i}}(p_{i})\right)\,dp_{i}
=\displaystyle=~{} 1ti=1Kwi20tuibi𝑑pi\displaystyle\frac{1}{t}\sum_{i=1}^{K}w_{i}^{2}\int_{0}^{t}u_{i}b_{i}\,dp_{i} Lemma 1
=\displaystyle=~{} 1ti=1Kwi2qi0siuibi𝑑pi\displaystyle\frac{1}{t}\sum_{i=1}^{K}w_{i}^{2}q_{i}\int_{0}^{s_{i}}u_{i}b_{i}\,dp_{i} si-periodicity of ui and bi\displaystyle s_{i}\text{-periodicity of }u_{i}\text{ and }b_{i}
=\displaystyle=~{} 1ti=1Kwi2qi0si(sipi)pi𝑑pi\displaystyle\frac{1}{t}\sum_{i=1}^{K}w_{i}^{2}q_{i}\int_{0}^{s_{i}}\left(s_{i}-p_{i}\right)p_{i}\,dp_{i}
=\displaystyle=~{} 16ti=1Kwi2qisi3\displaystyle\frac{1}{6t}\sum_{i=1}^{K}w_{i}^{2}q_{i}s_{i}^{3}
=\displaystyle=~{} t26i=1Kwi2qi2\displaystyle\frac{t^{2}}{6}\sum_{i=1}^{K}\frac{w_{i}^{2}}{q_{i}^{2}}

Lemma 3.

Let Q be a fixed-point quantizer. Assume that parameters p1pKp_{1}\ldots p_{K} are sampled from 𝒰[t,t]\mathcal{U}[-t,t] for arbitrary t>0t>0. Then, minq1qK𝔼p1pK[Var(epq1qK)]\min_{q_{1}\ldots q_{K}}\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})] subject to Q=i=1KqiQ=\sum_{i=1}^{K}q_{i} is minimized by qi=Qwi2/3k=1Kwk2/3q_{i}=Q\frac{w_{i}^{2/3}}{\sum_{k=1}^{K}{w_{k}^{2/3}}}.

Proof.

Define

f(𝒒)=𝔼p1pK[Var(epq1qK)]\displaystyle f({\bm{q}})=\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]
g(𝒒)=(i=1nqi)\displaystyle g({\bm{q}})=\left(\sum_{i=1}^{n}q_{i}\right)
(𝒒)=f(𝒒)λg(𝒒) (Lagrangian)\displaystyle\mathcal{L}\left({\bm{q}}\right)=f({\bm{q}})-\lambda g({\bm{q}})\text{ (Lagrangian)}

Any (local) minimum 𝒒^\hat{{\bm{q}}} satisfies

(𝒒^)=𝟎\displaystyle\nabla\mathcal{L}\left(\hat{{\bm{q}}}\right)=\mathbf{0}
\displaystyle\iff~{} t26i=1Kwi2qi2λi=1Kqi=0i=1Kqi=Q\displaystyle\nabla\frac{t^{2}}{6}\sum_{i=1}^{K}\frac{w_{i}^{2}}{q_{i}^{2}}-\lambda\nabla\sum_{i=1}^{K}q_{i}=0\wedge\sum_{i=1}^{K}q_{i}=Q Lemma 2
\displaystyle\iff~{} i=1n.t23wi2qi3=λi=1Kqi=Q\displaystyle\forall i=1\ldots n.~{}\frac{t^{2}}{-3}\frac{w_{i}^{2}}{q_{i}^{3}}=\lambda\wedge\sum_{i=1}^{K}q_{i}=Q
\displaystyle\iff~{} i=1n.qi=t23λwi23i=1Kqi=Q\displaystyle\forall i=1\ldots n.~{}q_{i}=\sqrt[3]{\frac{t^{2}}{-3\lambda}w_{i}^{2}}\wedge\sum_{i=1}^{K}q_{i}=Q
\displaystyle\implies~{} i=1n.qi=Qwi2/3j=1Kwj2/3\displaystyle\forall i=1\ldots n.~{}q_{i}=Q\frac{w_{i}^{2/3}}{\sum_{j=1}^{K}w_{j}^{2/3}}

B.1 Proof of Theorem 1

Proof.

Using Lemma 3, it is straightforward to show that for any VV, minq1qKi=1Kqi\min_{q_{1}\ldots q_{K}}\sum_{i=1}^{K}q_{i} subject to 𝔼p1pK[Var(epq1qK)]=V\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=V is minimized by qi=Cwi2/3q_{i}=Cw_{i}^{2/3} for the unique C>0C\in\mathbb{R}_{>0} that satisfies 𝔼p1pK[Var(epq1qK)]=V\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]=V.

Then, taking V=𝔼p1pK[Var(epq)]V=\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})] and C=abC=\sqrt{\frac{a}{b}} (see Theorem 1), we do indeed get

𝔼p1pK[Var(epq1qK)]\displaystyle\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q_{1}\ldots q_{K}}_{p})]
=\displaystyle=~{} t26i=1Kwi2(Cwi2/3)2\displaystyle\frac{t^{2}}{6}\sum_{i=1}^{K}\frac{w_{i}^{2}}{{(Cw_{i}^{2/3})}^{2}} Lemma 2
=\displaystyle=~{} 1C2t26i=1Kwi2/3\displaystyle\frac{1}{C^{2}}\frac{t^{2}}{6}\sum_{i=1}^{K}{w_{i}}^{2/3}
=\displaystyle=~{} j=1Kwj2q2j=1Kwj2/3t26i=1Kwi2/3\displaystyle\frac{\sum_{j=1}^{K}\frac{w_{j}^{2}}{q^{2}}}{\sum_{j=1}^{K}{w_{j}^{2/3}}}\frac{t^{2}}{6}\sum_{i=1}^{K}{w_{i}}^{2/3}
=\displaystyle=~{} t26j=1Kwj2q2\displaystyle\frac{t^{2}}{6}\sum_{j=1}^{K}\frac{w_{j}^{2}}{q^{2}}
=\displaystyle=~{} 𝔼p1pK[Var(epq)]\displaystyle\mathbb{E}_{p_{1}\ldots p_{K}}[\mathrm{Var}(e^{q}_{p})] lemma 2\displaystyle\text{\lx@cref{creftype~refnum}{lemma:varianceq}}\ \qed