This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\@ACM@journal@bibstripfalse

LOCKS: User Differentially Private and Federated Optimal Client Sampling

Ajinkya K Mulay 1234-5678-9012 Purdue UniversityWest LafayetteINUSA [email protected]
Abstract.

With changes in privacy laws, there is often a hard requirement for client data to remain on the device rather than being sent to the server. Therefore, most processing happens on the device, and only an altered element is sent to the server. Such mechanisms are developed by leveraging differential privacy and federated learning. Differential privacy adds noise to the client outputs and thus deteriorates the quality of each iteration. This distributed setting adds a layer of complexity and additional communication and performance overhead. These costs are additive per round, so we need to reduce the number of iterations. In this work, we provide an analytical framework for studying the convergence guarantees of gradient-based distributed algorithms. We show that our private algorithm minimizes the expected gradient variance in d2\approx d^{2} rounds, where dd is the dimensionality of the model. We discuss and suggest novel ways to improve the convergence rate to minimize the overhead using Importance Sampling (IS) and gradient diversity. Finally, we provide alternative frameworks that might be better suited to exploit client sampling techniques like IS and gradient diversity.

local differential privacy, high-dimensional linear models, federated learning
journalyear: YYYYjournalvolume: YYYYjournalnumber: Xdoi: XXXXXXX.XXXXXXX

1. Introduction

Stochastic Gradient Descent (SGD) functions by randomly selecting a subset of data points per round and computing their gradients. The update step in SGD uses the average gradient to modify the weights from the previous round. Optimal model weights can improve the trained model and its inference capabilities on unseen data. For high-dimensional non-convex parameter paces, obtaining optimal models is non-trivial and computationally heavy. Thus, improving the quality of gradients and accelerating the training process are fundamental goals for improving gradient algorithms.

However, as we often see in practice, the information assimilated in each gradient can vary widely. After the first few training epochs, the gradients of specific data points lose their value. (Katharopoulos and Fleuret, 2018). We could select data points randomly or by batching the entire training set per round. Both methods are inefficient and do not consider the information contributed by the gradients or the data points.

A way to assess the importance of each data point could be based on the magnitude of the gradient or loss value contributed by the data point (Loshchilov and Hutter, 2015), (Alain et al., 2015). Gradient magnitude refers to metrics like the L1L_{1}-norm or L2L_{2}-norm computed on individual gradients. Larger magnitudes suggest higher levels of information in gradients. For high-dimensional parameter models, gradient norms can be costly to compute. Computing loss values is straightforward but needs to reflect the gradient update accurately. In this work, we, therefore, focus on the magnitude of the gradient update. Such methods are based on Importance Sampling (IS) algorithms.

The rationale behind randomly picking data points per round is to obtain an unbiased estimate of the actual gradient. However, this unbiasedness comes at the cost of higher gradient variance. In such cases, Stochastic Variance Reduction Gradient (SVRG) techniques have been suggested to reduce the variance of the gradient.

Besides the gradient magnitude, we should prioritize diverse gradients. If two gradients are similar in the form of some similarity metric (ex., cosine similarity), then we may only need to use such a gradient. Gradient diversity should therefore be another basis for client selection.

This work further focuses on distributed and private machine-learning settings. Big data technologies have enabled massive leaps in predictive learning. However, these techniques commonly access private personal client data that can potentially cause severe harm to the owner if leaked. Such leaks are preventable under Federated Learning (FL) and Differential Privacy (DP), allowing owners tighter ownership over their data. For vanilla SGD, there are already techniques like FedAvg (Konečnỳ et al., 2016) and DP-SGD (Abadi et al., 2016) that can partially or fully protect private data. However, these algorithms’ privacy and convergence analysis cannot be directly applied to the algorithms proposed in this work.

In this work, we study the problem of partial client availability under Federated Learning under differential privacy. We analyze the expected gradient variance due to uniform sampling and the noise added by DP with adaptive server learning rates. As per our knowledge, such an exact theoretical analysis does not exist. Therefore, we provide analytical results and convergence guarantees in Theorem 5.1 in terms of the required iteration. We also suggested a modified approach that allows for lower communication rounds by simple scaling. We provide a pseudo-code for the algorithm suggested. We motivate the identification of other frameworks due to significant drawbacks to the one suggested in the reference papers.

2. Motivation

Federated Learning involves training machine learning models on individual clients rather than sending the entire dataset to a central server. Clients process and train their datasets with an SGD variant on-device. At each round, the server selects a subset of its client for training. Clients only send their computed gradient to the server. The server computes the average gradient over these client gradients and updates the model. After each round, the server sends back the updated model. These rounds are repeated until the model converges, similar to vanilla training. With more rounds, the communication cost incurred by the clients and the server substantially increases. Also, with limited computational resources and data, the clients must train longer and for more rounds. Therefore, we prefer to lower the number of training rounds.

Differential Privacy (DP) is a mathematical framework over queries to a dataset that limits the leakage of a single client or data point in a dataset. However, even though we wish to learn aggregated values, the query outputs can disclose information about a particular individual in the dataset. Combining auxiliary public data with aggregated histograms or anonymized datasets can help us extract exact individual identities (Narayanan and Shmatikov, 2006). DP adds noise or performs specific sampling to limit the impact of a single client or data point on the query outputs, thus protecting individual privacy. When accessing data over multiple rounds, DP needs to add noise at each stage; thus, the DP cost is proportional to the number of rounds. Higher DP costs lead to practically trivial privacy for real-world applications. Thus, reducing the number of rounds is crucial to ensuring high privacy standards.

Therefore, an accelerated SGD algorithm will improve training quality and speeds and potentially lead to more private methods. Federated IS (FedIS) and Differentially Private IS (DPIS) leverage gradient magnitude and diversity to achieve this acceleration. We study approaches that combine some of these ideas and introduce new ones for private training.

3. Preliminaries

3.1. Differential Privacy

DP proposes a solution to publishing or sharing private datasets. DP mathematically defines privacy leakage caused due to computations over private data points. By providing an attacker-agnostic and information-agnostic defense, DP protects against worst-case scenarios. Thus, we do not need specific assumptions about the attacker and are protected against future information releases.

In the DP literature, we commonly consider a dataset as a multi-set of nn samples in the 𝒳n\mathcal{X}^{n} space. This is analogous to the dataset belonging to the real space n×p\mathbb{R}^{n\times p}. The following section is based on preliminary definitions found in (Dwork et al., 2014).

To understand the mathematical notion of DP, we define pure and approximate DP.

Definition 3.1.

(Differential Privacy) A randomized mechanism :𝒳nd\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d} is considered to be (ϵ,δ)(\epsilon,\delta) DP if for all neighboring datasets x,x𝒳nx,x^{{}^{\prime}}\in\mathcal{X}^{n} differing in a single sample (i.e., xx1||x-x^{{}^{\prime}}||\leq 1), we have

(1) Pr[(x)d]eϵPr[(x)d]+δ\displaystyle\text{Pr}[\mathcal{M}(x)\in\mathbb{R}^{d}]\leq e^{\epsilon}\text{Pr}[\mathcal{M}(x^{{}^{\prime}})\in\mathbb{R}^{d}]+\delta

Differential privacy allows for pure-DP (δ=0\delta=0) or approximate-DP (δ>0\delta>0), each delivered by a different noise addition mechanism. Approximate DP allows for an infinitesimally small probability of data leakage. However, it composes (adds up) well over multiple iterations compared to pure DP. For most computations in this work, we will stick to approximate DP.

Definition 3.2.

(lkl_{k}-Sensitivity) For a function f:𝒳ndf:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}, we define its lkl_{k} norm sensitivity (denoted as Δkf\Delta_{k}f) over all neighboring datasets x,x𝒳nx,x^{{}^{\prime}}\in\mathcal{X}^{n} differing in a single sample as

(2) supx,x𝒳nf(x)f(x)kΔkf\displaystyle\text{sup}_{x,x^{{}^{\prime}}\in\mathcal{X}^{n}}||f(x)-f(x^{{}^{\prime}})||_{k}\leq\Delta_{k}f

Given the l2l_{2} sensitivity of a function, we can further define the Gaussian Mechanism that adds noise to the function f:𝒳ndf:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}. Gaussian DP is used to achieve approximate DP.

Lemma 3.3.

(Gaussian Mechanism) Consider an arbitrary ϵ(0,1)\epsilon\in(0,1). Let the l2l_{2} sensitivity of the function f:𝒳ndf:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d} be Δ2f\Delta_{2}f. Further, consider σ=cΔ2fϵ\sigma=\frac{c\Delta_{2}f}{\epsilon} where c2>2ln(1.25δ)c^{2}>2ln(\frac{1.25}{\delta}). Then, the following randomized mechanism \mathcal{M},

(3) η\displaystyle\eta 𝒩(0,σ2)\displaystyle\sim\mathcal{N}(0,\sigma^{2})
(4) (x,σ)\displaystyle\mathcal{M}(x,\sigma) f(x)+η\displaystyle\coloneqq f(x)+\eta

satisfies approximate or (ϵ,δ)(\epsilon,\delta) DP.

We define a few other properties of DP.

Definition 3.4.

(Rényi Divergence) We use Rényi divergence to quantify the closeness of two probability distributions. Given, two probability distributions 𝒫1,𝒫2\mathcal{P}_{1},\mathcal{P}_{2} and their density functions p1(x),p2(x)p_{1}(x),p_{2}(x) s.t. x𝒳x\in\mathcal{X} then the Rényi divergence of order (α>1\alpha>1) is given as,

(5) Dα(𝒫1||𝒫2)1α1ln𝒳q(x)(p(x)q(c))αdx\displaystyle D_{\alpha}(\mathcal{P}_{1}||\mathcal{P}_{2})\coloneqq\frac{1}{\alpha-1}ln\int_{\mathcal{X}}q(x)(\frac{p(x)}{q(c)})^{\alpha}dx

Based on the Rényi divergence, we use the Rényi Differential Privacy (RDP) (Mironov, 2017) to quantify our proposed algorithms’ privacy cost. Although approximate DP is a stricter notion of privacy, its composition analysis is weaker than the newer forms of privacy like RDP, Concentrated Differential Privacy (CDP), and z-CDP. In this case, we choose to use RDP to be consistent with the privacy costs shown in (Wei et al., 2022).

Definition 3.5.

(Rényi Differential Privacy (RDP)) For any neighboring datasets x,x𝒳nx,x^{{}^{\prime}}\in\mathcal{X}^{n}, a randomized mechanism :𝒳nd\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d} satisfies (α,τ)(\alpha,\tau)-RDP if

(6) Dα((x)||(x))τ\displaystyle D_{\alpha}(\mathcal{M}(x)||\mathcal{M}(x^{{}^{\prime}}))\leq\tau
Definition 3.6.

((α,τ)(\alpha,\tau)-RDP to (ϵ,δ)(\epsilon,\delta)-DP conversion) The randomized mechanism :𝒳nd\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}, satisfies (α,τ)(\alpha,\tau)-RDP. For the constants, α>1,ϵ0\alpha>1,\epsilon\geq 0 and appropriate δ,\delta,\mathcal{M} satisfies (ϵ,δ)(\epsilon,\delta)-DP for,

(7) ϵ=τ+log(1δ)+(α1)log(11α)log(α)α1\displaystyle\epsilon=\tau+\frac{\text{log}(\frac{1}{\delta})+(\alpha-1)\text{log}(1-\frac{1}{\alpha})-\text{log}(\alpha)}{\alpha-1}

3.2. Federated Optimization

Federated Learning pursues distributed optimization by allowing clients to train models on their data. The raw data never leaves the device, and each client only sends the gradient value to be aggregated on the server. Thus, the optimization function is split across nn clients and can be arranged as the following,

(8) w\displaystyle w^{*} arg minwpF(w)\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}F(w)
(9) arg minwp1ni=1nf(w;Xi,yi)\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}f(w;X_{i},y_{i})
(10) arg minwp1ni=1nfi(w)\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)

Here, XX and yy are the dataset and response variable resp.

We assume a non-convex setting and make the following assumptions as made in (Wang et al., 2022b) (restated for completeness).

3.2.1. Assumptions

Assumption 1 (L-smooth).

Each client has a local objective function that is LL-Lipschitz Smooth (L>0L>0) s.t.

(11) fi(w1)fi(w2)2Lw1w22,w1,w2p;i[n]\displaystyle||\triangledown f_{i}(w_{1})-\triangledown f_{i}(w_{2})||_{2}\leq L||w_{1}-w_{2}||_{2},\forall w_{1},w_{2}\in\mathbb{R}^{p};i\in[n]
Assumption 2.

(Unbiased Local Gradient Estimator and Local Variance) Suppose for the ttht^{th} round we have sample i,xtii,x_{t}^{i} and model parameter wtw_{t}. Then for all clients i[n]i\in[n]

(12) Exti𝕌(Si)[fi(w,xti)]=fi(w)\displaystyle E_{x_{t}^{i}\sim\mathbb{U}(S^{i})}[\triangledown f_{i}(w,x_{t}^{i})]=\triangledown f_{i}(w)

where SiS^{i} is the set of all the data samples at client ii and 𝕌\mathbb{U} is the uniform distribution.

Further, the local utility functions have bounded local variance s.t.,

(13) Exti𝕌(Si)[fi(w,xti)fi(w)2]=σL,i2σL2\displaystyle E_{x_{t}^{i}\sim\mathbb{U}(S^{i})}[||\triangledown f_{i}(w,x_{t}^{i})-\triangledown f_{i}(w)||^{2}]=\sigma_{L,i}^{2}\leq\sigma_{L}^{2}
Assumption 3.

(Bound Dissimilarity) (Wang et al., 2020) The following expression relates the expected value of the local and global objective functions,

(14) 𝔼tfi(w)22(A2+1)F(w)22+σG2\displaystyle\mathbb{E}_{t}||\triangledown f_{i}(w)||^{2}_{2}\leq(A^{2}+1)||\triangledown F(w)||^{2}_{2}+\sigma^{2}_{G}

where A,σG0A,\sigma_{G}\leq 0 are constants. For the case when A=σG=0A=\sigma_{G}=0, then all local loss functions are similar.

Assumption 4.

(Bounded Clipping) Suppose vector xx is clipped s.t. xc=xmax{1,x2/C}x_{c}=\frac{x}{\max\{1,||x||_{2}/C\}} then we define νxxcxc\nu_{x}\coloneqq x_{c}-x_{c}. We say that

(15) E[νx2]cν2\displaystyle E[||\nu_{x}||^{2}]\leq c_{\nu}^{2}

The last assumption is a new one that suggests that the variance due to clipping is generally small. (Wang et al., 2022b) also assumes that the stochastic gradient’s norm is uniformly bounded. However, since we will be clipping the local gradients before adding noise for DP, we do not require this explicit assumption. Rather, we use the clipping constant to bound each local gradient’s norm.

4. Preliminary Work

Lemma 4.1.

The expected batch size (𝔼[|𝒵m,t|]\mathbb{E}[|\mathcal{Z}_{m,t}|]) due to Algorithm 1 is less than b. Here nn is the number of total clients, mm is the fraction of clients chosen in the first round, and mk=b\frac{m}{k}=b where n,k,m+n,k,m\in\mathbb{Z}^{+}. Further note that pi,t1p_{i,t}\leq 1 for all i[n]i\in[n]. The values are defined as shown in Algorithm 1.

Proof: We know that 𝔼[|𝒵m,t|]\mathbb{E}[|\mathcal{Z}_{m,t}|] can be written as the follows,

(16) 𝔼[|𝒵m,t|]\displaystyle\mathbb{E}[|\mathcal{Z}_{m,t}|] =i=1n1mnpi,tk\displaystyle=\sum_{i=1}^{n}1\cdot\frac{m}{n}\frac{p_{i,t}}{k}
(17) =i=1nbnpi,t\displaystyle=\sum_{i=1}^{n}\frac{b}{n}p_{i,t}
(18) i=1nbn=b\displaystyle\leq\sum_{i=1}^{n}\frac{b}{n}=b
Lemma 4.2.

The private gradient estimate at round tt, (Δ~t\tilde{\Delta}_{t}) in Algorithm 1 is an unbiased estimator of the aggregate clipped gradients.

Let us first understand the expression for Δ~t\tilde{\Delta}_{t}. Note that u¯t=1bl𝒵m,tul\bar{u}_{t}=\frac{1}{b}\sum_{l\in\mathcal{Z}_{m,t}u_{l}} and thus by definition has expected value of zero. (1) follows from this idea. (2) follows from introducing the next sampling probability of clients in the first and second rounds (combined).

(19) 𝔼t[Δ~t]\displaystyle\mathbb{E}_{t}[\tilde{\Delta}_{t}] =𝔼t[Δt+u¯t]\displaystyle=\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]
(20) =(1)𝔼t[Δt]\displaystyle\overset{(1)}{=}\mathbb{E}_{t}[\Delta_{t}]
(21) =𝔼t[1bj𝒵m.tg¯j(wt)pj,t]\displaystyle=\mathbb{E}_{t}[\frac{1}{b}\sum_{j\in\mathcal{Z}_{m.t}}\frac{\bar{g}_{j}(w_{t})}{p_{j,t}}]
(22) =𝔼t[1b𝟙[i𝒵m,t]i=1ng¯i(wt)pi,t]\displaystyle=\mathbb{E}_{t}[\frac{1}{b}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\sum_{i=1}^{n}\frac{\bar{g}_{i}(w_{t})}{p_{i,t}}]
(23) =i=1ng¯i(wt)1bpi,t𝔼t[𝟙[i𝒵m,t]]\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{bp_{i,t}}\mathbb{E}_{t}[\mathbbm{1}[i\in\mathcal{Z}_{m,t}]]
(24) =(2)i=1ng¯i(wt)1bpi,tqi,tpi,tk\displaystyle\overset{(2)}{=}\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{bp_{i,t}}\frac{q_{i,t}p_{i,t}}{k}
(25) =i=1ng¯i(wt)1bmnk\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{b}\frac{m}{nk}
(26) =1ni=1ng¯i(wt)=Δ¯t\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}(w_{t})=\bar{\Delta}_{t}

5. Main Theorems

We need to find an optimal sampling mechanism for our proposed method that minimizes overall variance. Additional variance is introduced due to the partial participation of clients leading to an approximation of the overall objective. We first study the impact of uniformly choosing samples from the set in Theorem 5.1.

Theorem 5.1.

Let the adaptive global learning rates be ηt\eta_{t}. Under our assumptions 1-4, the model parameter sequence {wt}\{w_{t}\} generated by Algorithm 1 satisfies the following:

(27) mint[T]𝔼[F(wt)22]\displaystyle\min_{t\in[T]}\mathbb{E}[||\triangledown F(w_{t})||_{2}^{2}] F(w0)F(w)Tακ+Φ\displaystyle\leq\frac{F_{(}w_{0})-F_{(}w_{*})}{\sqrt{T}\alpha_{\kappa}}+\Phi
(28) =𝒪(ΔFT+σL2+cν2+σ02+σG2d)\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma_{L}^{2}+c_{\nu}^{2}+\sigma_{0}^{2}+\frac{\sigma^{2}_{G}}{d})

where the expectation is taken over the local dataset samples among clients. Φ\Phi is the variance seen due to random sampling and privacy noise. Here, T=d2log(T)2T=d^{2}log(T)^{2}. We uniformly sample clients as shown in Algorithm 1 with a net probability bV0n\frac{bV_{0}}{n} (across both rounds of sampling). The variance bounds σL,σG,cν2\sigma_{L},\sigma_{G},c_{\nu}^{2} are introduced in assumptions 1-4. dd is the number of model parameters, σ0\sigma_{0} is the privacy noise added as per Algorithm 1. Furthermore,

(29) ΔF\displaystyle\Delta F =F(w0)F(w)\displaystyle=F(w_{0})-F(w_{*})
(30) ακ\displaystyle\alpha_{\kappa} =ηbT1TV0Lηb2log(T)(A2+1)bT\displaystyle=\eta_{b}\frac{\sqrt{T}-1}{\sqrt{T}}-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b\sqrt{T}}
(32) Φ\displaystyle\Phi =(2ηbKmax(σL2+cν2))Kminακ\displaystyle=\frac{(2\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\alpha_{\kappa}}
(33) +Lηb2bακT[Kmax(σL2+cν2)Kmin+σG2+dσ022]V0log(T)\displaystyle+\frac{L\eta_{b}^{2}}{b\alpha_{\kappa}\sqrt{T}}[\frac{K_{max}(\sigma_{L}^{2}+c_{\nu}^{2})}{K_{min}}+\frac{\sigma^{2}_{G}+d\sigma^{2}_{0}}{2}]V_{0}log(T)

Proof: We define Δ¯t1ni=1ng~i(wt)\bar{\Delta}_{t}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\tilde{g}_{i}(w_{t}). In the case all participants participate, we can see that Δ¯t=Δ~t\bar{\Delta}_{t}=\tilde{\Delta}_{t}. However, for partial client participation we see that Δ¯t=i=1ng~i(wt)i𝒾𝒳𝓂,𝓉g~i(wt)=Δ~t\bar{\Delta}_{t}=\sum_{i=1}^{n}\tilde{g}_{i}(w_{t})\neq\sum_{i\in\mathcal{i\in\mathcal{X}_{m,t}}}\tilde{g}_{i}(w_{t})=\tilde{\Delta}_{t}. Thus, the randomness in partial client participation consists of client sampling, stochastic gradient, and privacy noise.

Due to the smoothness assumption on FF (by assumption  1), and taking the expectation of F(wt+1)F(w_{t+1}) over the client sampling and at round tt. Note wtw_{t} is unaffected by the sampling at round tt.

𝔼\displaystyle\mathbb{E} [F(wt+1)]t{}_{t}[F(w_{t+1})]
F(wt)+F(wt),𝔼t[wt+1wt]+L2𝔼t[wt+1wt2]\displaystyle\leq F(w_{t})+\langle\triangledown F(w_{t}),\mathbb{E}_{t}[w_{t+1}-w_{t}]\rangle+\frac{L}{2}\mathbb{E}_{t}[||w_{t+1}-w_{t}||^{2}]
=F(wt)+F(wt),η𝔼t[Δ~t]+L2η2𝔼t[Δ~t2]\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\tilde{\Delta}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[||\tilde{\Delta}_{t}||^{2}]
=F(wt)+F(wt),η𝔼t[Δt+u¯t]+L2η2𝔼t[Δt+u¯t2]\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[||\Delta_{t}+\bar{u}_{t}||^{2}]

Here, u¯t=1|𝒵m,t|i𝒵m,tui\bar{u}_{t}=\frac{1}{|\mathcal{Z}_{m,t}|}\sum_{i\in\mathcal{Z}_{m,t}}u_{i} and then u¯t𝒩(0,|𝒵m,t|σ02)\bar{u}_{t}\sim\mathcal{N}(0,\sqrt{|\mathcal{Z}_{m,t}|}\sigma_{0}^{2}). Further, Δt=i𝒵m,tg¯i(wt)nqi,tpi,t\Delta_{t}=\sum_{i\in\mathcal{Z}_{m,t}}\frac{\bar{g}_{i}(w_{t})}{nq_{i,t}p_{i,t}}. We note that expectation over the randomness of client sampling and private Gaussian noise gives us 𝔼t[Δt+u¯t]=i𝒵m,t𝔼t[g¯i(wt)bnqi,tpi,t+u¯t]=i𝒵m,t𝔼t[g¯i(wt)bnqi,tpi,t]=𝔼t[Δt]\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]=\sum_{i\in\mathcal{Z}_{m,t}}\mathbb{E}_{t}[\frac{\bar{g}_{i}(w_{t})}{bnq_{i,t}p_{i,t}}+\bar{u}_{t}]=\sum_{i\in\mathcal{Z}_{m,t}}\mathbb{E}_{t}[\frac{\bar{g}_{i}(w_{t})}{bnq_{i,t}p_{i,t}}]=\mathbb{E}_{t}[\Delta_{t}]. Due to the independent Gaussian noise and expectation over its randomness, we have 𝔼t[u¯t]=0\mathbb{E}_{t}[\bar{u}_{t}]=0 and Δtu¯t𝔼t[Δt,u¯t]=0\Delta_{t}\perp\!\!\!\perp\bar{u}_{t}\rightarrow\mathbb{E}_{t}[\langle\Delta_{t},\bar{u}_{t}\rangle]=0.

Thus, 𝔼t[Δt+u¯t22]=𝔼t[Δt22+u¯t22+2Δt,u¯t]=𝔼t[Δt22+u¯t22]\mathbb{E}_{t}[||\Delta_{t}+\bar{u}_{t}||^{2}_{2}]=\mathbb{E}_{t}[||\Delta_{t}||_{2}^{2}+||\bar{u}_{t}||^{2}_{2}+2\langle\Delta_{t},\bar{u}_{t}\rangle]=\mathbb{E}_{t}[||\Delta_{t}||_{2}^{2}+||\bar{u}_{t}||^{2}_{2}] . Therefore,

𝔼\displaystyle\mathbb{E} [F(wt+1)]t{}_{t}[F(w_{t+1})]
(34) =F(wt)ηtF(wt)22+F(wt),ηt𝔼t[Δt+F(wt)]A1\displaystyle=F(w_{t})-\eta_{t}||\triangledown F(w_{t})||^{2}_{2}+\underbrace{\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\Delta_{t}+\triangledown F(w_{t})]\rangle}_{A_{1}}
+L2ηt2𝔼t[Δt2]A2+L2ηt2𝔼t[u¯t2]A3\displaystyle+\frac{L}{2}\eta_{t}^{2}\underbrace{\mathbb{E}_{t}[||\Delta_{t}||^{2}]}_{A_{2}}+\underbrace{\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[||\bar{u}_{t}||^{2}]}_{A_{3}}

Let us bound term A3A_{3} first. We note that by our sampling technique, the expected batch size is 𝔼[|𝒵m,t|]b\mathbb{E}[|\mathcal{Z}_{m,t}|]\leq b (Lemma 4.1) and that u¯t𝒩(0,|𝒵m,t|σ02𝕀d)\bar{u}_{t}\sim\mathcal{N}(0,\sqrt{|\mathcal{Z}_{m,t}|}\sigma_{0}^{2}\mathbb{I}_{d}).

(35) A3\displaystyle A_{3} =L2ηt2𝔼t[u¯t2]\displaystyle=\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[||\bar{u}_{t}||^{2}]
(36) =(1)L2ηt2𝔼t[i=1n𝟙[i𝒵m,t]uibpi,t22]\displaystyle\overset{(1)}{=}\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[||\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\frac{u_{i}}{bp_{i,t}}||^{2}_{2}]
(37) =Lηt22b2i=1nPr[i𝒵m,t]𝔼t[uipi,t22]\displaystyle=\frac{L\eta_{t}^{2}}{2b^{2}}\sum_{i=1}^{n}\text{Pr}[i\in\mathcal{Z}_{m,t}]\mathbb{E}_{t}[||\frac{u_{i}}{p_{i,t}}||^{2}_{2}]
(38) =i=1nLηt22b2pi,tqi,tkpi,t2𝔼t[ui22]\displaystyle=\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2b^{2}}\frac{p_{i,t}q_{i,t}}{kp_{i,t}^{2}}\mathbb{E}_{t}[||u_{i}||^{2}_{2}]
(39) =i=1nLηt22bnpi,tdσ02\displaystyle=\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2bnp_{i,t}}d\sigma_{0}^{2}

Here (1) is from the law of total expectation (or tower expression), (2) is due to E[x22]=E[k=1pxi2]=pσ2E[||x||_{2}^{2}]=E[\sum_{k=1}^{p}x_{i}^{2}]=p\sigma^{2} where xi𝒩(0,σ2)x_{i}\sim\mathcal{N}(0,\sigma^{2}), (3) is by Jensen’s inequality for a concave function (squared root is concave) and the (4) follows from Lemma 4.1.

Let us now bound A2A_{2} with the assumption that we sample with replacement.

(40) A2\displaystyle A_{2} =𝔼t[Δt22]\displaystyle=\mathbb{E}_{t}[||\Delta_{t}||_{2}^{2}]
(41) =𝔼t[1bjZm,tg¯j(wt)pj,t2]\displaystyle=\mathbb{E}_{t}[||\frac{1}{b}\sum_{j\in Z_{m,t}}\frac{\bar{g}_{j}(w_{t})}{p_{j,t}}||^{2}]
(42) =(1)1b2𝔼t[j𝒵m,tr=1Kjfj(wt;xjr)fj(wt)νjrpj,tKj22]B1\displaystyle\overset{(1)}{=}\underbrace{\frac{1}{b^{2}}\mathbb{E}_{t}[||\sum_{j\in\mathcal{Z}_{m,t}}\sum_{r=1}^{K_{j}}\frac{\triangledown f_{j}(w_{t};x_{jr})-\triangledown f_{j}(w_{t})-\nu_{jr}}{p_{j,t}K_{j}}||_{2}^{2}]}_{B_{1}}
(43) +1b2𝔼t[j𝒵m,t1pj,tfj(wt)22]B2\displaystyle+\underbrace{\frac{1}{b^{2}}\mathbb{E}_{t}[||\sum_{j\in\mathcal{Z}_{m,t}}\frac{1}{p_{j,t}}\triangledown f_{j}(w_{t})||_{2}^{2}]}_{B_{2}}

Here, (1) follows from the simple expression E[x2]=E[xE[x]2]+E[x]2E[||x||^{2}]=E[||x-E[x]||^{2}]+||E[x]||^{2}. We first bound the first term B1B_{1}. Note, νjr\nu_{jr} is the factor clipped off by clipping s.t. νjrfi(wt,xjrf¯i(wt,xjr\nu_{jr}\coloneqq\triangledown f_{i}(w_{t},x_{jr}-\triangledown\bar{f}_{i}(w_{t},x_{jr}. Here, f¯i(wt,xjr\bar{f}_{i}(w_{t},x_{jr} is the clipped form of fi(wt,xjr\triangledown f_{i}(w_{t},x_{jr}. We assume Et[νjr22]cν2E_{t}[||\nu_{jr}||_{2}^{2}]\leq c_{\nu}^{2}.

B1\displaystyle B_{1} =(1)1b2𝔼t[i=1n𝟙[iZm,t]r=1Kifi(wt;xir)fi(wt)νirpi,tKi2]\displaystyle\overset{(1)}{=}\frac{1}{b^{2}}\mathbb{E}_{t}[||\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}||^{2}]
=1b2i=1n𝔼t[𝟙[iZm,t]r=1Kifi(wt;xir)fi(wt)νirpi,tKi]2\displaystyle=\frac{1}{b^{2}}||\sum_{i=1}^{n}\mathbb{E}_{t}[\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}]||^{2}
=(2)i=1nPr[iZm,t]𝔼t[1bpi,tKir=1Kifi(wt;xir)fi(wt)νir22]\displaystyle\overset{(2)}{=}\sum_{i=1}^{n}\text{Pr}[i\in Z_{m,t}]\mathbb{E}_{t}[||\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}||^{2}_{2}]
=i=1npi,tqi,tk𝔼t[||1bpi,tKir=1Ki[fi(wt;xir)fi(wt)νir]||22\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{k}\mathbb{E}_{t}[||\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]||^{2}_{2}
(44) =i=1npi,tqi,tkb2pi,t2Ki2𝔼t[||1bpi,tKir=1Ki[fi(wt;xir)fi(wt)νir]||22\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{kb^{2}p_{i,t}^{2}K_{i}^{2}}\mathbb{E}_{t}[||\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]||^{2}_{2}
(45) i=1n1bnpi,tKmin2𝔼t[||1bpi,tKir=1Ki[fi(wt;xir)fi(wt)νir]||22\displaystyle\leq\sum_{i=1}^{n}\frac{1}{bnp_{i,t}K_{min}^{2}}\mathbb{E}_{t}[||\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]||^{2}_{2}
(46) (3)i=1n2Kmaxbnpi,tKmin2(σL2+cν2)\displaystyle\overset{(3)}{\leq}\sum_{i=1}^{n}\frac{2K_{max}}{bnp_{i,t}K_{min}^{2}}(\sigma^{2}_{L}+c_{\nu}^{2})

Here (1) rewrites the summation with the indicator variable trick, and in (2), we take the expected value of the indicator function (also equal to its probability). In (3), we use the inequality E[||x1+x2++xn||2]n(E[||x1||2]+E[||xn||2)E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2}) and E[||x1x2||2]2(E[||x1||2]+E[||x2||2)E[||x_{1}-x_{2}||^{2}]\leq 2(E[||x_{1}||^{2}]+E[||x_{2}||^{2}), assumption 2 and assume Kmax=maxi=1nKiK_{max}=\max_{i=1}^{n}K_{i} (achieved by clipping the number of clients per user) and Kmin=mini=1nKiK_{min}=\min_{i=1}^{n}K_{i}. The server broadcasts both KmaxK_{max} and KminK_{min}. Clients limit their training to KmaxK_{max} samples with repetition and ensure that at least KminK_{min} samples (with replacement) are trained each time.

Let us bound B2B_{2} now,

(47) B2\displaystyle B_{2} =1b2𝔼t[i=1n𝟙[i𝒵m,t]1pi,tfi(wt)22]\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{t}[||\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\frac{1}{p_{i,t}}f_{i}(w_{t})||^{2}_{2}]
(48) =(1)1b2i=1nPr(i𝒵m,t)Etfi(wt)2\displaystyle\overset{(1)}{=}\frac{1}{b^{2}}\sum_{i=1}^{n}\text{Pr}(i\in\mathcal{Z}_{m,t})E_{t}||\triangledown f_{i}(w_{t})||^{2}
(49) =i=1npi,tqi,tkb2pi,t2Etfi(wt)2\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{kb^{2}p_{i,t}^{2}}E_{t}||\triangledown f_{i}(w_{t})||^{2}
(50) =i=1n1bnpi,tEtfi(wt)2\displaystyle=\sum_{i=1}^{n}\frac{1}{bnp_{i,t}}E_{t}||\triangledown f_{i}(w_{t})||^{2}
(51) (2)i=1n1bnpi,t((A2+1)F(wt)2+σG2)\displaystyle\overset{(2)}{\leq}\sum_{i=1}^{n}\frac{1}{bnp_{i,t}}((A^{2}+1)||\triangledown F(w_{t})||^{2}+\sigma^{2}_{G})

where (1) follows from analysis similar to B1B_{1} and cancellation of probabilities, (2) uses the inequality E[||x1+x2++xn||2]n(E[||x1||2]+E[||xn||2)E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2}) and Assumption 3.

Now, we only need to bound A1A_{1}.

(52) A1\displaystyle A_{1} =F(wt),ηt𝔼t[Δt+F(wt)]\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\Delta_{t}+\triangledown F(w_{t})]\rangle
(53) =F(wt),ηt𝔼t[i=1n𝟙[iZm,t]g¯i(wt)bpi,t+F(wt)]\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\frac{-\bar{g}_{i}(w_{t})}{bp_{i,t}}+\triangledown F(w_{t})]\rangle
=F(wt),ηt𝔼t[i=1n𝟙[iZm,t]r=1Kifi(wt;xir)+νirbpi,tKi\displaystyle=\langle\triangledown F(w_{t}),-\eta_{t}\mathbb{E}_{t}[\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}}{bp_{i,t}K_{i}}
+\displaystyle+ F(wt)]\displaystyle\triangledown F(w_{t})]\rangle
=F(wt),ηti,rn,KiPr[i𝒵m,t]Et[fi(wt;xir)+νir+fi(wt)bpi,tKi]\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\sum_{i,r}^{n,K_{i}}\text{Pr}[i\in\mathcal{Z}_{m,t}]E_{t}[\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})}{bp_{i,t}K_{i}}]\rangle
(1)ηtF(wt),i,rn,Kibpi,tnEt[fi(wt;xir)+νir+fi(wt)Kminbpi,tηt]\displaystyle\overset{(1)}{\leq}\langle\sqrt{\eta_{t}}\triangledown F(w_{t}),\sum_{i,r}^{n,K_{i}}\frac{bp_{i,t}}{n}E_{t}[\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})}{\frac{K_{min}bp_{i,t}}{\sqrt{\eta_{t}}}}]\rangle
(54) =ηtF(wt),i,rn,KiηtnKminEt[fi(wt;xir)+νir+fi(wt)]\displaystyle=\langle\sqrt{\eta_{t}}\triangledown F(w_{t}),\sum_{i,r}^{n,K_{i}}\frac{\sqrt{\eta_{t}}}{nK_{min}}E_{t}[-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})]\rangle
(55) (2)ηt2F(wt)2+ηt2nKmin2Kmaxn(σL2+cν2)\displaystyle\overset{(2)}{\leq}\frac{\eta_{t}}{2}||\triangledown F(w_{t})||^{2}+\frac{\eta_{t}}{2nK_{min}}\cdot 2K_{max}n(\sigma_{L}^{2}+c_{\nu}^{2})
(56) =ηt2F(wt)2+ηtKmaxKmin(σL2+cν2)\displaystyle=\frac{\eta_{t}}{2}||\triangledown F(w_{t})||^{2}+\frac{\eta_{t}K_{max}}{K_{min}}(\sigma_{L}^{2}+c_{\nu}^{2})

where (1) is a simple re-shuffling of constants and (2) uses the inequality a,b12(a22+b22)\langle a,b\rangle\leq\frac{1}{2}(||a||^{2}_{2}+||b||^{2}_{2}) and the analysis used in B1B_{1} based on E[||x1+x2++xn||2]n(E[||x1||2]+E[||xn||2)E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2}).

Putting everything together, we get that,

(57) 𝔼t\displaystyle\mathbb{E}_{t} [F(wt+1)]\displaystyle[F(w_{t+1})]
(58) F(wt)ηt2(1i=1nLηt(A2+1)bnpi,t)F(wt)22\displaystyle\leq F(w_{t})-\frac{\eta_{t}}{2}(1-\sum_{i=1}^{n}\frac{L\eta_{t}(A^{2}+1)}{bnp_{i,t}})||\triangledown F(w_{t})||^{2}_{2}
(59) +(ηtKmax(σL2+cν2))Kmin(1+i=1nLηtbnpi,tKmin)\displaystyle+\frac{(\eta_{t}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}}(1+\sum_{i=1}^{n}\frac{L\eta_{t}}{bnp_{i,t}K_{min}})
(60) +i=1nLηt22bnpi,t(σG2+dσ02)\displaystyle+\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2bnp_{i,t}}(\sigma^{2}_{G}+d\sigma_{0}^{2})

Re-arranging the terms, taking expectations for all the steps and summing from 0 to T1T-1 steps (and canceling some terms), and dividing by TT, we next get the following,

(61) t=0T1𝔼t[F(wt)2](η2T(1i=1nLηt2(A2+1)bnpi,t))\displaystyle\sum_{t=0}^{T-1}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}](\frac{\eta}{2T}(1-\sum_{i=1}^{n}\frac{L\eta_{t}^{2}(A^{2}+1)}{bnp_{i,t}}))
(62) F(w0)F(wT)T+tT1(ηtKmax(σL2+cν2))TKmin\displaystyle\leq\frac{F(w_{0})-F(w_{T})}{T}+\sum_{t}^{T-1}\frac{(\eta_{t}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{TK_{min}}
(63) +t,iT1,nLηt2Kmax(σL2+cν2)bnTpi,tKmin2+t,iT1,nLηt22bnTpi,t(σG2+dσ02)\displaystyle+\sum_{t,i}^{T-1,n}\frac{L\eta_{t}^{2}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2})}{bnTp_{i,t}K_{min}^{2}}+\sum_{t,i}^{T-1,n}\frac{L\eta_{t}^{2}}{2bnTp_{i,t}}(\sigma^{2}_{G}+d\sigma_{0}^{2})
(64) (1)F(w0)F(w)T+tT1(ηbKmax(σL2+cν2))Kmint+1T\displaystyle\overset{(1)}{\leq}\frac{F(w_{0})-F(w_{*})}{T}+\sum_{t}^{T-1}\frac{(\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\sqrt{t+1}T}
(65) +t,iT1,nLηb2Kmax(σL2+cν2)bnTpi,tKmin2(t+1)+t,iT1,nLηb22bnpi,t(t+1)T(σG2+dσ02)\displaystyle+\sum_{t,i}^{T-1,n}\frac{L\eta^{2}_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2})}{bnTp_{i,t}K_{min}^{2}(t+1)}+\sum_{t,i}^{T-1,n}\frac{L\eta_{b}^{2}}{2bnp_{i,t}(t+1)T}(\sigma^{2}_{G}+d\sigma_{0}^{2})
(66) (2)F(w0)F(w)T+(2ηbKmax(σL2+cν2))KminT\displaystyle\overset{(2)}{\leq}\frac{F(w_{0})-F(w_{*})}{T}+\frac{(2\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\sqrt{T}}
(67) +Lηb2bnT[Kmax(σL2+cν2)Kmin+σG2+dσ022]C1t,i1pt,i(t+1)\displaystyle+\underbrace{\frac{L\eta_{b}^{2}}{bnT}[\frac{K_{max}(\sigma_{L}^{2}+c_{\nu}^{2})}{K_{min}}+\frac{\sigma^{2}_{G}+d\sigma^{2}_{0}}{2}]}_{C_{1}}\sum_{t,i}\frac{1}{p_{t,i}(t+1)}

Here, in (1) we assume that ηt=ηbt+1\eta_{t}=\frac{\eta_{b}}{\sqrt{t+1}} where ηb\eta_{b} is the base learning rate. Also, we know that by definition, F(w)F(wT+1)F(w)F(wT+1)F(w_{*})\leq F(w_{T+1})\rightarrow-F(w_{*})\geq-F(w_{T+1}). (2) follows from the idea that t=0T11t11T11t1=2(T)22T\sum_{t=0}^{T-1}\frac{1}{\sqrt{t-1}}\leq\int_{1}^{T-1}\frac{1}{\sqrt{t-1}}=2(\sqrt{T})-2\leq 2\sqrt{T}.

We need to ensure that for an appropriate constant and all values of tt, ηb,Lηb2(A2+1)bni=1n(t+1)1pi,t1\eta_{b},\frac{L\eta_{b}^{2}(A^{2}+1)}{bn}\sum_{i=1}^{n(t+1)}\frac{1}{p_{i,t}}\leq 1. Note that the value is largest when t=1t=1, and so we essentially need,

(68) i=1n1pi,tbnLηb2(A2+1)\displaystyle\sum_{i=1}^{n}\frac{1}{p_{i,t}}\leq\frac{bn}{L\eta_{b}^{2}(A^{2}+1)}

Finally, we notice that to reduce the variance; we need a policy that follows equation 68, pi,t1p_{i,t}\leq 1 for all i[n],t[0:T1]i\in[n],t\in[0:T-1] combinations s.t.

(69) minpi,t1,(68)t,iC1pi,t(t+1)\displaystyle\min_{p_{i,t}\leq 1,(\ref{eqn:assumption-new})}\sum_{t,i}\frac{C_{1}}{p_{i,t}(t+1)}

If we choose pi,t=1V0p_{i,t}=\frac{1}{V_{0}} where V0>1V_{0}>1 is a constant, then we get the following value at round tt,

(70) ηt2(1iLηt2(A2+1)bnpi,t)=ηt2(1V0Lηt2(A2+1)b)=κt\displaystyle\frac{\eta_{t}}{2}(1-\sum_{i}\frac{L\eta^{2}_{t}(A^{2}+1)}{bnp_{i,t}})=\frac{\eta_{t}}{2}(1-\frac{V_{0}L\eta^{2}_{t}(A^{2}+1)}{b})=\kappa_{t}

Taking summation over all values of tt,

(71) tκt\displaystyle\sum_{t}\kappa_{t} =tηt2(1V0Lηt2(A2+1)b)\displaystyle=\sum_{t}\frac{\eta_{t}}{2}(1-\frac{V_{0}L\eta^{2}_{t}(A^{2}+1)}{b})
(72) =ηb(T1)V0Lηb2log(T)(A2+1)b=ακT\displaystyle=\eta_{b}(\sqrt{T}-1)-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b}=\alpha_{\kappa}\sqrt{T}

where ακ=ηbT1TV0Lηb2log(T)(A2+1)bT\alpha_{\kappa}=\eta_{b}\frac{\sqrt{T}-1}{\sqrt{T}}-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b\sqrt{T}}. Now, we can rewrite our expected variance function as,

(73) mint𝔼t[F(wt)22]1Ttκt\displaystyle\min_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]\frac{1}{T}\sum_{t}\kappa_{t}
(74) =1Ttmint𝔼t[F(wt)22]κt\displaystyle=\frac{1}{T}\sum_{t}\min_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]\kappa_{t}
(75) 1Tt𝔼t[F(wt)22]κt\displaystyle\leq\frac{1}{T}\sum_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]\kappa_{t}

Substituting this back, our final result convergence result is,

(76) mint𝔼t[F(wt)22]\displaystyle\min_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]
(77) 1Tt𝔼t[F(wt)22]\displaystyle\leq\frac{1}{T}\sum_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]
(78) =𝒪(ΔFTακ+σL2+cν2ακ+(σL2+cν2+σG2+dσ02)log(T)Tακ)\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}\alpha_{\kappa}}+\frac{\sigma^{2}_{L}+c_{\nu}^{2}}{\alpha_{\kappa}}+\frac{(\sigma^{2}_{L}+c_{\nu}^{2}+\sigma^{2}_{G}+d\sigma_{0}^{2})log(T)}{\sqrt{T}\alpha_{\kappa}})
(79) =𝒪(ΔFT+σL2+cν2+(σL2+cν2+σG2+dσ02)log(T)T)\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma^{2}_{L}+c_{\nu}^{2}+\frac{(\sigma^{2}_{L}+c_{\nu}^{2}+\sigma^{2}_{G}+d\sigma_{0}^{2})log(T)}{\sqrt{T}})

Here, ΔF=F(w0)F(w)\Delta F=F(w_{0})-F(w_{*}). For this bound to be meaningful and assuming the bounded variances are reasonably small, the biggest challenge is the additional term of dlog(T)σ02dlog(T)\sigma^{2}_{0}. We minimize variance only when Tdlog(T)\sqrt{T}\approx dlog(T) for models with many parameters. After that, we get the bound for uniform sampling as,

(80) mint𝔼t[F(wt)22]=𝒪(ΔFT+σL2+cν2+σG2d+σ02)\displaystyle\min_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma_{L}^{2}+c_{\nu}^{2}+\frac{\sigma^{2}_{G}}{d}+\sigma_{0}^{2})

This concludes the proof of the main theorem. The probabilities in the denominator should be large. But, with dynamic sampling some probabilities might be significantly small leading to large values for the variance. Therefore, we look at alternative ways of tackling the same problem.

We notice that the unbiasedness of the gradient can be maintained without probability scaling by assuming a weighted gradient output.

Lemma 5.2.

The private unscaled gradient estimate at round tt, (Δ~tu\tilde{\Delta}_{t}^{u}) is an unbiased estimator of the aggregate clipped gradients, assuming that at each round i=1npi,t=1\sum_{i=1}^{n}p_{i,t}=1 and that the actual gradient obtained is weighted.

Let us first understand the expression for the unscaled Δ~t=Δ~tu\tilde{\Delta}_{t}=\tilde{\Delta}_{t}^{u}.

(81) 𝔼t[Δ~tu]\displaystyle\mathbb{E}_{t}[\tilde{\Delta}_{t}^{u}] =𝔼t[Δtu+u¯tu]\displaystyle=\mathbb{E}_{t}[\Delta_{t}^{u}+\bar{u}_{t}^{u}]
(82) =(1)𝔼t[Δt]\displaystyle\overset{(1)}{=}\mathbb{E}_{t}[\Delta_{t}]
(83) =𝔼t[nbj𝒵m.tgj(wt)]\displaystyle=\mathbb{E}_{t}[\frac{n}{b}\sum_{j\in\mathcal{Z}_{m.t}}{g}_{j}(w_{t})]
(84) =𝔼t[nbi=1n𝟙[i𝒵m,t]g¯i(wt)]\displaystyle=\mathbb{E}_{t}[\frac{n}{b}\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\bar{g}_{i}(w_{t})]
(85) =i=1ng¯i(wt)nb𝔼t[𝟙[i𝒵m,t]]\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\mathbb{E}_{t}[\mathbbm{1}[i\in\mathcal{Z}_{m,t}]]
(86) =i=1ng¯i(wt)nb1kqi,tpi,t\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\frac{1}{k}q_{i,t}p_{i,t}
(87) =i=1ng¯i(wt)nbmnkpi,t\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\frac{m}{nk}p_{i,t}
(88) =i=1ng¯i(wt)pi,t\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})p_{i,t}

Further, the expected batch size is iqi,tpi,tk=impi,tkn=ibnpi,tibnpi,t=b\sum_{i}\frac{q_{i,t}p_{i,t}}{k}=\sum_{i}\frac{mp_{i,t}}{kn}=\sum_{i}\frac{b}{n}p_{i,t}\leq\sum_{i}\frac{b}{n}p_{i,t}=b as pi,t1p_{i,t}\leq 1. We believe this might be a better framework for analysis since this provides the following advantages:

  • Avoids having dynamic probabilities in the denominator.

  • Sums up to 1 and provides weighted gradients to denote the weights intuitively.

  • Maintains unbiased estimator, small batch size.

The only minor inconvenience might be the nn factor in the numerator. We study the variance factor as part of our future work.

Theorem 5.3.

Algorithm 1 is (ϵ,δ)(\epsilon,\delta) differentially private.

Proof: We notice that in Algorithm 1, the only private parameters are the raw gradients and their norms. Since, we bound each gradient by a clipped factor of CC, we have bounded sensitivity. Normally, we would add noise with the variance =Cc0ϵstep=\frac{Cc_{0}}{\epsilon_{step}} where c0=2ln(1.25/δstep)c_{0}=2\sqrt{\text{ln}(1.25/\delta_{step})} and ϵstep,δstep\epsilon_{step},\delta_{step} are the per-step privacy budgets. With strong composition (Kairouz et al., 2015) we get,

(89) ϵ\displaystyle\epsilon =ϵstep2Tln(1/δstep)+Teϵstep1eϵstep+1\displaystyle=\epsilon_{step}\sqrt{2T\text{ln}(1/\delta_{step})}+T\frac{e^{\epsilon_{step}}-1}{e^{\epsilon_{step}}+1}
(90) ϵstep2Tln(1/δstep)+Tϵϵ+1\displaystyle\approx\epsilon_{step}\sqrt{2T\text{ln}(1/\delta_{step})}+\frac{T\epsilon}{\epsilon+1}
(91) δ=(T+1)δstep\displaystyle\delta=(T+1)\delta_{step}

We add epsilons from the mechanisms with RDP (Mironov, 2017). However, the optimal value of α\alpha depends on the value of ϵ\epsilon and δ\delta. Thus, given a target δ\delta, δtarget\delta_{target} we have the following,

(92) Et[ϵ]\displaystyle E_{t}[\epsilon] mϵs(α)+bϵ0(α)\displaystyle\leq m\epsilon_{s}(\alpha_{*})+b\epsilon_{0}(\alpha_{*})
(93) δ=δtarget\displaystyle\delta=\delta_{target}

Here, α\alpha_{*} remains constant and can be replaced by the value that minimizes privacy costs. Further, the expectation is taken over the sampling randomness at step tt, and the value of α\alpha_{*} is also dependent on the number of steps.

6. Algorithms

For the LOCks algorithm (Algorithm 1), we provide details and explanations about the client-server architecture and how each update is processed.

In the first round, the server randomly picks a client w.p. mn\frac{m}{n}. These clients compute a private estimate of their last gradient norm. If the norm is negative, we clip to a minimal value 106\approx 10^{-6}. These norms are then converted to a probability distribution, and the ones with the highest probability are prioritized in the sampling. Thus, the probability is proportionate to the noisy gradient norm of the client. The clients picked to submit their private gradient to the server. The server aggregates this gradient and broadcasts the updated model to all the clients.

Algorithm 1 Generalized Optimal Client Sampling for Federated DP-SGD
1:Input: nn federated clients, privacy parameters ϵ,δ\epsilon,\delta, number of iterations TT, expected batch size bb, maximum/minimum samples per client Kmax,KminK_{max},K_{min}, σs,σ0\sigma_{s},\sigma_{0} for gradient norms and gradient respectively identified by RDP analysis.
2:Initialize: model parameters w0pw_{0}\in\mathbb{R}^{p},
3:for t=0:T-1 do \triangleright Total Iterations
4:     Get random sample 𝒳m,t\mathcal{X}_{m,t} with sampling probability mn\frac{m}{n}
5:     for i𝒳mi\in\mathcal{X}_{m} do \triangleright Computations at client ii
6:         gi(wt)jKifi(wt;xij)Kipg_{i}(w_{t})\leftarrow\sum_{j}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ij})}{{K_{i}}}\in\mathbb{R}^{p} \triangleright Client ii has KiK_{i} samples
7:         g¯i(wt)gi(wt)max{1,gi(wt)2C}\bar{g}_{i}(w_{t})\leftarrow\frac{g_{i}(w_{t})}{\max\{1,\frac{||g_{i}(w_{t})||_{2}}{C}\}} \triangleright Clip Gradients
8:         vi𝒩(0,σs2𝕀p)v_{i}\sim\mathcal{N}(0,\sigma_{s}^{2}\mathbb{I}_{p})
9:         z~ig¯i(wt)2+vi\tilde{z}_{i}\leftarrow||\bar{g}_{i}(w_{t})||_{2}+v_{i} \triangleright Private Gradient Norm
10:         Send z~i\tilde{z}_{i} to the server
11:     end for
12:     z~Si𝒳tz~i\tilde{z}_{S}\leftarrow\sum_{i\in\mathcal{X}_{t}}\tilde{z}_{i} \triangleright Server: Gradient norm sum over clients
13:     z~Smin(max(z~S,bC),N~C)\tilde{z}_{S}^{{}^{\prime}}\leftarrow\min(\max(\tilde{z}_{S},bC),\tilde{N}C)
14:     for j𝒳tj\in\mathcal{X}_{t} do \triangleright Computations at client jj
15:         pj,tmin{1,bz~jkz~S}p_{j,t}\leftarrow\min\{1,\frac{b\tilde{z}_{j}}{k\tilde{z}_{S}^{{}^{\prime}}}\}
16:     end for
17:     Sample 𝒵m,t\mathcal{Z}_{m,t} with the probability distribution derived from qj,j𝒳tq_{j},j\in\mathcal{X}_{t}
18:     for l𝒵m,tl\in\mathcal{Z}_{m,t} do \triangleright Computations at client ll
19:         ul𝒩(0,σ02)u_{l}\sim\mathcal{N}(0,\sigma^{2}_{0})
20:     end for
21:     Δ~l1bl𝒵m,t(g¯l(wt)+ul)\tilde{\Delta}_{l}\leftarrow\frac{1}{b}\sum_{l\in\mathcal{Z}_{m,t}}(\bar{g}_{l}(w_{t})+u_{l}) \triangleright Aggregation at server
22:     wt+1wtηtΔ~lw_{t+1}\leftarrow w_{t}-\eta_{t}\tilde{\Delta}_{l} \triangleright Server Update
23:     Return wt+1w_{t+1} to all clients
24:end for

7. Experiments

We provide preliminary experiments to motivate further investigation in this space. By uniformly sampling clients, we analyze a federated system. Here, the client only adds noise to the last gradient before sending it to the server. This provides user privacy and keeps the data on the device. For our analysis, we consider 100100 clients and assume that 20%20\% are uniformly randomly picked in each round. Each client runs 55 local epochs between the server updates. The noise multiplier in each round is picked after searching for the best α\alpha value in RDP using Google’s Tensorflow privacy library. We notice that for an ϵ=\epsilon=\infty, we need a multiplier of 0, for ϵ=50\epsilon=50, we require a multiplier of 0.5\approx 0.5, and for ϵ=10\epsilon=10, we need a multiplier of 1.321.32. The smaller the multiplier, the lower the noise, but the higher the privacy loss leading to significant leaks.

Note that none of the provided ϵ\epsilon values are satisfactory. We expect this since the gradient variance is proportionately higher if we use a smaller noise multiplier.

Refer to caption
Figure 1. Uniform Partial Client Sampling v/s LOCKS [MNIST]: Current sampling techniques do not perform well for simple datasets with user differential privacy. Further investigation is necessary to improve these mechanisms under noise.

We also empirically analyze the optimal client sampling techniques suggested in (Wei et al., 2022) and (Li et al., 2019) as modified in Algorithm 1. In this scenario, we first randomly sample 30%30\% clients. We prioritize the clients with the highest private gradient norms in the second round. So, the probability of picking these clients is proportionate to their gradient size. By prioritizing the norm gradient, we see mild speed-ups in our convergence rate (empirically), as shown in Figure 1. Though LOCKS generally performs better for high-noise / high-privacy cases, we still need to understand the underlying mechanism for its success.

We showcase our preliminary results for MNIST in Figure 1.

8. Conclusions

This work demonstrates the convergence of user differentially private and federated systems for non-convex objective functions with an adaptive learning rate and partial client participation. We provide a general two-pronged sampling framework that reduces the communication requirement for Federated systems per round. Our method maintains an unbiased gradient estimate while providing a flexible framework for further analysis. We also offer an alternative approach for analyzing the weighted sum of user gradients. There are, however, significant gaps in the literature that suggest multiple open problems.

  • (Wang et al., 2022a) shows that different gradients due to data heterogeneity do not impact empirical results. Therefore, we need to understand why gradient dissimilarity is important and how it impacts the convergence rate.

  • Currently, our model requires d2log(T)2d^{2}log(T)^{2} steps to reasonably converge. A high number of steps adds significant communication and privacy costs, and future work should tackle this issue.

  • Study how adaptive clipping and adaptive sampling affect convergence rates.

  • Demonstrate empirical results for complex datasets.

  • Alternative approaches for gradient similarity using similarity metrics like angles, cosine similarity, and correlation.

References

  • (1)
  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (2016), 308–318.
  • Alain et al. (2015) Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. 2015. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481 (2015).
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211–407.
  • Kairouz et al. (2015) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for differential privacy. In International conference on machine learning. PMLR, 1376–1385.
  • Katharopoulos and Fleuret (2018) Angelos Katharopoulos and François Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning. PMLR, 2525–2534.
  • Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
  • Li et al. (2019) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. 2019. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189 (2019).
  • Loshchilov and Hutter (2015) Ilya Loshchilov and Frank Hutter. 2015. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343 (2015).
  • Mironov (2017) Ilya Mironov. 2017. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF). IEEE, 263–275.
  • Narayanan and Shmatikov (2006) Arvind Narayanan and Vitaly Shmatikov. 2006. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006).
  • Wang et al. (2022a) Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, and Tong Zhang. 2022a. On the unreasonable effectiveness of federated averaging with heterogeneous data. arXiv preprint arXiv:2206.04723 (2022).
  • Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. 2020. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems 33 (2020), 7611–7623.
  • Wang et al. (2022b) Lin Wang, YongXin Guo, Tao Lin, and Xiaoying Tang. 2022b. DELTA: Diverse Client Sampling for Fasting Federated Learning. https://doi.org/10.48550/ARXIV.2205.13925
  • Wei et al. (2022) Jianxin Wei, Ergute Bao, Xiaokui Xiao, and Yin Yang. 2022. DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling. arXiv preprint arXiv:2210.09634 (2022).