\@ACM@journal@bibstripfalse

LOCKS: User Differentially Private and Federated Optimal Client Sampling

Ajinkya K Mulay 1234-5678-9012 Purdue UniversityWest LafayetteINUSA [email protected]

Abstract.

With changes in privacy laws, there is often a hard requirement for client data to remain on the device rather than being sent to the server. Therefore, most processing happens on the device, and only an altered element is sent to the server. Such mechanisms are developed by leveraging differential privacy and federated learning. Differential privacy adds noise to the client outputs and thus deteriorates the quality of each iteration. This distributed setting adds a layer of complexity and additional communication and performance overhead. These costs are additive per round, so we need to reduce the number of iterations. In this work, we provide an analytical framework for studying the convergence guarantees of gradient-based distributed algorithms. We show that our private algorithm minimizes the expected gradient variance in $\approx d^{2}$ rounds, where $d$ is the dimensionality of the model. We discuss and suggest novel ways to improve the convergence rate to minimize the overhead using Importance Sampling (IS) and gradient diversity. Finally, we provide alternative frameworks that might be better suited to exploit client sampling techniques like IS and gradient diversity.

local differential privacy, high-dimensional linear models, federated learning

^†^†journalyear: YYYY^†^†journalvolume: YYYY^†^†journalnumber: X^†^†doi: XXXXXXX.XXXXXXX

1. Introduction

Stochastic Gradient Descent (SGD) functions by randomly selecting a subset of data points per round and computing their gradients. The update step in SGD uses the average gradient to modify the weights from the previous round. Optimal model weights can improve the trained model and its inference capabilities on unseen data. For high-dimensional non-convex parameter paces, obtaining optimal models is non-trivial and computationally heavy. Thus, improving the quality of gradients and accelerating the training process are fundamental goals for improving gradient algorithms.

However, as we often see in practice, the information assimilated in each gradient can vary widely. After the first few training epochs, the gradients of specific data points lose their value. (Katharopoulos and Fleuret, 2018). We could select data points randomly or by batching the entire training set per round. Both methods are inefficient and do not consider the information contributed by the gradients or the data points.

A way to assess the importance of each data point could be based on the magnitude of the gradient or loss value contributed by the data point (Loshchilov and Hutter, 2015), (Alain et al., 2015). Gradient magnitude refers to metrics like the $L_{1}$ -norm or $L_{2}$ -norm computed on individual gradients. Larger magnitudes suggest higher levels of information in gradients. For high-dimensional parameter models, gradient norms can be costly to compute. Computing loss values is straightforward but needs to reflect the gradient update accurately. In this work, we, therefore, focus on the magnitude of the gradient update. Such methods are based on Importance Sampling (IS) algorithms.

The rationale behind randomly picking data points per round is to obtain an unbiased estimate of the actual gradient. However, this unbiasedness comes at the cost of higher gradient variance. In such cases, Stochastic Variance Reduction Gradient (SVRG) techniques have been suggested to reduce the variance of the gradient.

Besides the gradient magnitude, we should prioritize diverse gradients. If two gradients are similar in the form of some similarity metric (ex., cosine similarity), then we may only need to use such a gradient. Gradient diversity should therefore be another basis for client selection.

This work further focuses on distributed and private machine-learning settings. Big data technologies have enabled massive leaps in predictive learning. However, these techniques commonly access private personal client data that can potentially cause severe harm to the owner if leaked. Such leaks are preventable under Federated Learning (FL) and Differential Privacy (DP), allowing owners tighter ownership over their data. For vanilla SGD, there are already techniques like FedAvg (Konečnỳ et al., 2016) and DP-SGD (Abadi et al., 2016) that can partially or fully protect private data. However, these algorithms’ privacy and convergence analysis cannot be directly applied to the algorithms proposed in this work.

In this work, we study the problem of partial client availability under Federated Learning under differential privacy. We analyze the expected gradient variance due to uniform sampling and the noise added by DP with adaptive server learning rates. As per our knowledge, such an exact theoretical analysis does not exist. Therefore, we provide analytical results and convergence guarantees in Theorem 5.1 in terms of the required iteration. We also suggested a modified approach that allows for lower communication rounds by simple scaling. We provide a pseudo-code for the algorithm suggested. We motivate the identification of other frameworks due to significant drawbacks to the one suggested in the reference papers.

2. Motivation

Federated Learning involves training machine learning models on individual clients rather than sending the entire dataset to a central server. Clients process and train their datasets with an SGD variant on-device. At each round, the server selects a subset of its client for training. Clients only send their computed gradient to the server. The server computes the average gradient over these client gradients and updates the model. After each round, the server sends back the updated model. These rounds are repeated until the model converges, similar to vanilla training. With more rounds, the communication cost incurred by the clients and the server substantially increases. Also, with limited computational resources and data, the clients must train longer and for more rounds. Therefore, we prefer to lower the number of training rounds.

Differential Privacy (DP) is a mathematical framework over queries to a dataset that limits the leakage of a single client or data point in a dataset. However, even though we wish to learn aggregated values, the query outputs can disclose information about a particular individual in the dataset. Combining auxiliary public data with aggregated histograms or anonymized datasets can help us extract exact individual identities (Narayanan and Shmatikov, 2006). DP adds noise or performs specific sampling to limit the impact of a single client or data point on the query outputs, thus protecting individual privacy. When accessing data over multiple rounds, DP needs to add noise at each stage; thus, the DP cost is proportional to the number of rounds. Higher DP costs lead to practically trivial privacy for real-world applications. Thus, reducing the number of rounds is crucial to ensuring high privacy standards.

Therefore, an accelerated SGD algorithm will improve training quality and speeds and potentially lead to more private methods. Federated IS (FedIS) and Differentially Private IS (DPIS) leverage gradient magnitude and diversity to achieve this acceleration. We study approaches that combine some of these ideas and introduce new ones for private training.

3. Preliminaries

3.1. Differential Privacy

DP proposes a solution to publishing or sharing private datasets. DP mathematically defines privacy leakage caused due to computations over private data points. By providing an attacker-agnostic and information-agnostic defense, DP protects against worst-case scenarios. Thus, we do not need specific assumptions about the attacker and are protected against future information releases.

In the DP literature, we commonly consider a dataset as a multi-set of $n$ samples in the $\mathcal{X}^{n}$ space. This is analogous to the dataset belonging to the real space $\mathbb{R}^{n\times p}$ . The following section is based on preliminary definitions found in (Dwork et al., 2014).

To understand the mathematical notion of DP, we define pure and approximate DP.

Definition 3.1.

(Differential Privacy) A randomized mechanism $\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ is considered to be $(\epsilon,\delta)$ DP if for all neighboring datasets $x,x^{{}^{\prime}}\in\mathcal{X}^{n}$ differing in a single sample (i.e., $||x-x^{{}^{\prime}}||\leq 1$ ), we have

(1)

\displaystyle\text{Pr}[\mathcal{M}(x)\in\mathbb{R}^{d}]\leq e^{\epsilon}\text{Pr}[\mathcal{M}(x^{{}^{\prime}})\in\mathbb{R}^{d}]+\delta

Differential privacy allows for pure-DP ( $\delta=0$ ) or approximate-DP ( $\delta>0$ ), each delivered by a different noise addition mechanism. Approximate DP allows for an infinitesimally small probability of data leakage. However, it composes (adds up) well over multiple iterations compared to pure DP. For most computations in this work, we will stick to approximate DP.

Definition 3.2.

( $l_{k}$ -Sensitivity) For a function $f:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ , we define its $l_{k}$ norm sensitivity (denoted as $\Delta_{k}f$ ) over all neighboring datasets $x,x^{{}^{\prime}}\in\mathcal{X}^{n}$ differing in a single sample as

(2)

\displaystyle\text{sup}_{x,x^{{}^{\prime}}\in\mathcal{X}^{n}}||f(x)-f(x^{{}^{\prime}})||_{k}\leq\Delta_{k}f

Given the $l_{2}$ sensitivity of a function, we can further define the Gaussian Mechanism that adds noise to the function $f:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ . Gaussian DP is used to achieve approximate DP.

Lemma 3.3.

(Gaussian Mechanism) Consider an arbitrary $\epsilon\in(0,1)$ . Let the $l_{2}$ sensitivity of the function $f:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ be $\Delta_{2}f$ . Further, consider $\sigma=\frac{c\Delta_{2}f}{\epsilon}$ where $c^{2}>2ln(\frac{1.25}{\delta})$ . Then, the following randomized mechanism $\mathcal{M}$ ,

(3)		$\displaystyle\eta$	$\displaystyle\sim\mathcal{N}(0,\sigma^{2})$
(4)		$\displaystyle\mathcal{M}(x,\sigma)$	$\displaystyle\coloneqq f(x)+\eta$

satisfies approximate or $(\epsilon,\delta)$ DP.

We define a few other properties of DP.

Definition 3.4.

(Rényi Divergence) We use Rényi divergence to quantify the closeness of two probability distributions. Given, two probability distributions $\mathcal{P}_{1},\mathcal{P}_{2}$ and their density functions $p_{1}(x),p_{2}(x)$ s.t. $x\in\mathcal{X}$ then the Rényi divergence of order ( $\alpha>1$ ) is given as,

(5)

\displaystyle D_{\alpha}(\mathcal{P}_{1}||\mathcal{P}_{2})\coloneqq\frac{1}{\alpha-1}ln\int_{\mathcal{X}}q(x)(\frac{p(x)}{q(c)})^{\alpha}dx

Based on the Rényi divergence, we use the Rényi Differential Privacy (RDP) (Mironov, 2017) to quantify our proposed algorithms’ privacy cost. Although approximate DP is a stricter notion of privacy, its composition analysis is weaker than the newer forms of privacy like RDP, Concentrated Differential Privacy (CDP), and z-CDP. In this case, we choose to use RDP to be consistent with the privacy costs shown in (Wei et al., 2022).

Definition 3.5.

(Rényi Differential Privacy (RDP)) For any neighboring datasets $x,x^{{}^{\prime}}\in\mathcal{X}^{n}$ , a randomized mechanism $\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ satisfies $(\alpha,\tau)$ -RDP if

(6)

\displaystyle D_{\alpha}(\mathcal{M}(x)||\mathcal{M}(x^{{}^{\prime}}))\leq\tau

Definition 3.6.

( $(\alpha,\tau)$ -RDP to $(\epsilon,\delta)$ -DP conversion) The randomized mechanism $\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{d}$ , satisfies $(\alpha,\tau)$ -RDP. For the constants, $\alpha>1,\epsilon\geq 0$ and appropriate $\delta,\mathcal{M}$ satisfies $(\epsilon,\delta)$ -DP for,

(7)

\displaystyle\epsilon=\tau+\frac{\text{log}(\frac{1}{\delta})+(\alpha-1)\text{log}(1-\frac{1}{\alpha})-\text{log}(\alpha)}{\alpha-1}

3.2. Federated Optimization

Federated Learning pursues distributed optimization by allowing clients to train models on their data. The raw data never leaves the device, and each client only sends the gradient value to be aggregated on the server. Thus, the optimization function is split across $n$ clients and can be arranged as the following,

(8)	$\displaystyle w^{*}$	$\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}F(w)$
(9)		$\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}f(w;X_{i},y_{i})$
(10)		$\displaystyle\coloneqq\text{arg min}_{w\in\mathbb{R}^{p}}\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$

Here, $X$ and $y$ are the dataset and response variable resp.

We assume a non-convex setting and make the following assumptions as made in (Wang et al., 2022b) (restated for completeness).

3.2.1. Assumptions

Assumption 1 (L-smooth).

Each client has a local objective function that is $L$ -Lipschitz Smooth ( $L>0$ ) s.t.

(11)

\displaystyle||\triangledown f_{i}(w_{1})-\triangledown f_{i}(w_{2})||_{2}\leq L||w_{1}-w_{2}||_{2},\forall w_{1},w_{2}\in\mathbb{R}^{p};i\in[n]

Assumption 2.

(Unbiased Local Gradient Estimator and Local Variance) Suppose for the $t^{th}$ round we have sample $i,x_{t}^{i}$ and model parameter $w_{t}$ . Then for all clients $i\in[n]$

(12)

\displaystyle E_{x_{t}^{i}\sim\mathbb{U}(S^{i})}[\triangledown f_{i}(w,x_{t}^{i})]=\triangledown f_{i}(w)

where $S^{i}$ is the set of all the data samples at client $i$ and $\mathbb{U}$ is the uniform distribution.

Further, the local utility functions have bounded local variance s.t.,

(13)

\displaystyle E_{x_{t}^{i}\sim\mathbb{U}(S^{i})}[||\triangledown f_{i}(w,x_{t}^{i})-\triangledown f_{i}(w)||^{2}]=\sigma_{L,i}^{2}\leq\sigma_{L}^{2}

Assumption 3.

(Bound Dissimilarity) (Wang et al., 2020) The following expression relates the expected value of the local and global objective functions,

(14)

\displaystyle\mathbb{E}_{t}||\triangledown f_{i}(w)||^{2}_{2}\leq(A^{2}+1)||\triangledown F(w)||^{2}_{2}+\sigma^{2}_{G}

where $A,\sigma_{G}\leq 0$ are constants. For the case when $A=\sigma_{G}=0$ , then all local loss functions are similar.

Assumption 4.

(Bounded Clipping) Suppose vector $x$ is clipped s.t. $x_{c}=\frac{x}{\max\{1,||x||_{2}/C\}}$ then we define $\nu_{x}\coloneqq x_{c}-x_{c}$ . We say that

(15)

\displaystyle E[||\nu_{x}||^{2}]\leq c_{\nu}^{2}

The last assumption is a new one that suggests that the variance due to clipping is generally small. (Wang et al., 2022b) also assumes that the stochastic gradient’s norm is uniformly bounded. However, since we will be clipping the local gradients before adding noise for DP, we do not require this explicit assumption. Rather, we use the clipping constant to bound each local gradient’s norm.

4. Preliminary Work

Lemma 4.1.

The expected batch size ( $\mathbb{E}[|\mathcal{Z}_{m,t}|]$ ) due to Algorithm 1 is less than b. Here $n$ is the number of total clients, $m$ is the fraction of clients chosen in the first round, and $\frac{m}{k}=b$ where $n,k,m\in\mathbb{Z}^{+}$ . Further note that $p_{i,t}\leq 1$ for all $i\in[n]$ . The values are defined as shown in Algorithm 1.

Proof: We know that $\mathbb{E}[|\mathcal{Z}_{m,t}|]$ can be written as the follows,

(16)	$\displaystyle\mathbb{E}[\|\mathcal{Z}_{m,t}\|]$	$\displaystyle=\sum_{i=1}^{n}1\cdot\frac{m}{n}\frac{p_{i,t}}{k}$
(17)		$\displaystyle=\sum_{i=1}^{n}\frac{b}{n}p_{i,t}$
(18)		$\displaystyle\leq\sum_{i=1}^{n}\frac{b}{n}=b$

Lemma 4.2.

The private gradient estimate at round $t$ , ( $\tilde{\Delta}_{t}$ ) in Algorithm 1 is an unbiased estimator of the aggregate clipped gradients.

Let us first understand the expression for $\tilde{\Delta}_{t}$ . Note that $\bar{u}_{t}=\frac{1}{b}\sum_{l\in\mathcal{Z}_{m,t}u_{l}}$ and thus by definition has expected value of zero. (1) follows from this idea. (2) follows from introducing the next sampling probability of clients in the first and second rounds (combined).

(19)	$\displaystyle\mathbb{E}_{t}[\tilde{\Delta}_{t}]$	$\displaystyle=\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]$
(20)		$\displaystyle\overset{(1)}{=}\mathbb{E}_{t}[\Delta_{t}]$
(21)		$\displaystyle=\mathbb{E}_{t}[\frac{1}{b}\sum_{j\in\mathcal{Z}_{m.t}}\frac{\bar{g}_{j}(w_{t})}{p_{j,t}}]$
(22)		$\displaystyle=\mathbb{E}_{t}[\frac{1}{b}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\sum_{i=1}^{n}\frac{\bar{g}_{i}(w_{t})}{p_{i,t}}]$
(23)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{bp_{i,t}}\mathbb{E}_{t}[\mathbbm{1}[i\in\mathcal{Z}_{m,t}]]$
(24)		$\displaystyle\overset{(2)}{=}\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{bp_{i,t}}\frac{q_{i,t}p_{i,t}}{k}$
(25)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{1}{b}\frac{m}{nk}$
(26)		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\bar{g}_{i}(w_{t})=\bar{\Delta}_{t}$

5. Main Theorems

We need to find an optimal sampling mechanism for our proposed method that minimizes overall variance. Additional variance is introduced due to the partial participation of clients leading to an approximation of the overall objective. We first study the impact of uniformly choosing samples from the set in Theorem 5.1.

Theorem 5.1.

Let the adaptive global learning rates be $\eta_{t}$ . Under our assumptions 1-4, the model parameter sequence $\{w_{t}\}$ generated by Algorithm 1 satisfies the following:

(27)		$\displaystyle\min_{t\in[T]}\mathbb{E}[\|\|\triangledown F(w_{t})\|\|_{2}^{2}]$	$\displaystyle\leq\frac{F_{(}w_{0})-F_{(}w_{*})}{\sqrt{T}\alpha_{\kappa}}+\Phi$
(28)			$\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma_{L}^{2}+c_{\nu}^{2}+\sigma_{0}^{2}+\frac{\sigma^{2}_{G}}{d})$

where the expectation is taken over the local dataset samples among clients. $\Phi$ is the variance seen due to random sampling and privacy noise. Here, $T=d^{2}log(T)^{2}$ . We uniformly sample clients as shown in Algorithm 1 with a net probability $\frac{bV_{0}}{n}$ (across both rounds of sampling). The variance bounds $\sigma_{L},\sigma_{G},c_{\nu}^{2}$ are introduced in assumptions 1-4. $d$ is the number of model parameters, $\sigma_{0}$ is the privacy noise added as per Algorithm 1. Furthermore,

(29)	$\displaystyle\Delta F$	$\displaystyle=F(w_{0})-F(w_{*})$
(30)	$\displaystyle\alpha_{\kappa}$	$\displaystyle=\eta_{b}\frac{\sqrt{T}-1}{\sqrt{T}}-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b\sqrt{T}}$
(32)	$\displaystyle\Phi$	$\displaystyle=\frac{(2\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\alpha_{\kappa}}$
(33)		$\displaystyle+\frac{L\eta_{b}^{2}}{b\alpha_{\kappa}\sqrt{T}}[\frac{K_{max}(\sigma_{L}^{2}+c_{\nu}^{2})}{K_{min}}+\frac{\sigma^{2}_{G}+d\sigma^{2}_{0}}{2}]V_{0}log(T)$

Proof: We define $\bar{\Delta}_{t}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\tilde{g}_{i}(w_{t})$ . In the case all participants participate, we can see that $\bar{\Delta}_{t}=\tilde{\Delta}_{t}$ . However, for partial client participation we see that $\bar{\Delta}_{t}=\sum_{i=1}^{n}\tilde{g}_{i}(w_{t})\neq\sum_{i\in\mathcal{i\in\mathcal{X}_{m,t}}}\tilde{g}_{i}(w_{t})=\tilde{\Delta}_{t}$ . Thus, the randomness in partial client participation consists of client sampling, stochastic gradient, and privacy noise.

Due to the smoothness assumption on $F$ (by assumption 1), and taking the expectation of $F(w_{t+1})$ over the client sampling and at round $t$ . Note $w_{t}$ is unaffected by the sampling at round $t$ .

	$\displaystyle\mathbb{E}$	${}_{t}[F(w_{t+1})]$
		$\displaystyle\leq F(w_{t})+\langle\triangledown F(w_{t}),\mathbb{E}_{t}[w_{t+1}-w_{t}]\rangle+\frac{L}{2}\mathbb{E}_{t}[\|\|w_{t+1}-w_{t}\|\|^{2}]$
		$\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\tilde{\Delta}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\|\tilde{\Delta}_{t}\|\|^{2}]$
		$\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\|\Delta_{t}+\bar{u}_{t}\|\|^{2}]$

Here, $\bar{u}_{t}=\frac{1}{|\mathcal{Z}_{m,t}|}\sum_{i\in\mathcal{Z}_{m,t}}u_{i}$ and then $\bar{u}_{t}\sim\mathcal{N}(0,\sqrt{|\mathcal{Z}_{m,t}|}\sigma_{0}^{2})$ . Further, $\Delta_{t}=\sum_{i\in\mathcal{Z}_{m,t}}\frac{\bar{g}_{i}(w_{t})}{nq_{i,t}p_{i,t}}$ . We note that expectation over the randomness of client sampling and private Gaussian noise gives us $\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]=\sum_{i\in\mathcal{Z}_{m,t}}\mathbb{E}_{t}[\frac{\bar{g}_{i}(w_{t})}{bnq_{i,t}p_{i,t}}+\bar{u}_{t}]=\sum_{i\in\mathcal{Z}_{m,t}}\mathbb{E}_{t}[\frac{\bar{g}_{i}(w_{t})}{bnq_{i,t}p_{i,t}}]=\mathbb{E}_{t}[\Delta_{t}]$ . Due to the independent Gaussian noise and expectation over its randomness, we have $\mathbb{E}_{t}[\bar{u}_{t}]=0$ and $\Delta_{t}\perp\!\!\!\perp\bar{u}_{t}\rightarrow\mathbb{E}_{t}[\langle\Delta_{t},\bar{u}_{t}\rangle]=0$ .

Thus, $\mathbb{E}_{t}[||\Delta_{t}+\bar{u}_{t}||^{2}_{2}]=\mathbb{E}_{t}[||\Delta_{t}||_{2}^{2}+||\bar{u}_{t}||^{2}_{2}+2\langle\Delta_{t},\bar{u}_{t}\rangle]=\mathbb{E}_{t}[||\Delta_{t}||_{2}^{2}+||\bar{u}_{t}||^{2}_{2}]$ . Therefore,

	$\displaystyle\mathbb{E}$	${}_{t}[F(w_{t+1})]$
(34)			$\displaystyle=F(w_{t})-\eta_{t}\|\|\triangledown F(w_{t})\|\|^{2}_{2}+\underbrace{\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\Delta_{t}+\triangledown F(w_{t})]\rangle}_{A_{1}}$
		$\displaystyle+\frac{L}{2}\eta_{t}^{2}\underbrace{\mathbb{E}_{t}[\|\|\Delta_{t}\|\|^{2}]}_{A_{2}}+\underbrace{\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[\|\|\bar{u}_{t}\|\|^{2}]}_{A_{3}}$

Let us bound term $A_{3}$ first. We note that by our sampling technique, the expected batch size is $\mathbb{E}[|\mathcal{Z}_{m,t}|]\leq b$ (Lemma 4.1) and that $\bar{u}_{t}\sim\mathcal{N}(0,\sqrt{|\mathcal{Z}_{m,t}|}\sigma_{0}^{2}\mathbb{I}_{d})$ .

(35)	$\displaystyle A_{3}$	$\displaystyle=\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[\|\|\bar{u}_{t}\|\|^{2}]$
(36)		$\displaystyle\overset{(1)}{=}\frac{L}{2}\eta_{t}^{2}\mathbb{E}_{t}[\|\|\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\frac{u_{i}}{bp_{i,t}}\|\|^{2}_{2}]$
(37)		$\displaystyle=\frac{L\eta_{t}^{2}}{2b^{2}}\sum_{i=1}^{n}\text{Pr}[i\in\mathcal{Z}_{m,t}]\mathbb{E}_{t}[\|\|\frac{u_{i}}{p_{i,t}}\|\|^{2}_{2}]$
(38)		$\displaystyle=\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2b^{2}}\frac{p_{i,t}q_{i,t}}{kp_{i,t}^{2}}\mathbb{E}_{t}[\|\|u_{i}\|\|^{2}_{2}]$
(39)		$\displaystyle=\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2bnp_{i,t}}d\sigma_{0}^{2}$

Here (1) is from the law of total expectation (or tower expression), (2) is due to $E[||x||_{2}^{2}]=E[\sum_{k=1}^{p}x_{i}^{2}]=p\sigma^{2}$ where $x_{i}\sim\mathcal{N}(0,\sigma^{2})$ , (3) is by Jensen’s inequality for a concave function (squared root is concave) and the (4) follows from Lemma 4.1.

Let us now bound $A_{2}$ with the assumption that we sample with replacement.

(40)	$\displaystyle A_{2}$	$\displaystyle=\mathbb{E}_{t}[\|\|\Delta_{t}\|\|_{2}^{2}]$
(41)		$\displaystyle=\mathbb{E}_{t}[\|\|\frac{1}{b}\sum_{j\in Z_{m,t}}\frac{\bar{g}_{j}(w_{t})}{p_{j,t}}\|\|^{2}]$
(42)		$\displaystyle\overset{(1)}{=}\underbrace{\frac{1}{b^{2}}\mathbb{E}_{t}[\|\|\sum_{j\in\mathcal{Z}_{m,t}}\sum_{r=1}^{K_{j}}\frac{\triangledown f_{j}(w_{t};x_{jr})-\triangledown f_{j}(w_{t})-\nu_{jr}}{p_{j,t}K_{j}}\|\|_{2}^{2}]}_{B_{1}}$
(43)		$\displaystyle+\underbrace{\frac{1}{b^{2}}\mathbb{E}_{t}[\|\|\sum_{j\in\mathcal{Z}_{m,t}}\frac{1}{p_{j,t}}\triangledown f_{j}(w_{t})\|\|_{2}^{2}]}_{B_{2}}$

Here, (1) follows from the simple expression $E[||x||^{2}]=E[||x-E[x]||^{2}]+||E[x]||^{2}$ . We first bound the first term $B_{1}$ . Note, $\nu_{jr}$ is the factor clipped off by clipping s.t. $\nu_{jr}\coloneqq\triangledown f_{i}(w_{t},x_{jr}-\triangledown\bar{f}_{i}(w_{t},x_{jr}$ . Here, $\bar{f}_{i}(w_{t},x_{jr}$ is the clipped form of $\triangledown f_{i}(w_{t},x_{jr}$ . We assume $E_{t}[||\nu_{jr}||_{2}^{2}]\leq c_{\nu}^{2}$ .

	$\displaystyle B_{1}$	$\displaystyle\overset{(1)}{=}\frac{1}{b^{2}}\mathbb{E}_{t}[\|\|\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}\|\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\|\|\sum_{i=1}^{n}\mathbb{E}_{t}[\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}]\|\|^{2}$
		$\displaystyle\overset{(2)}{=}\sum_{i=1}^{n}\text{Pr}[i\in Z_{m,t}]\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}\|\|^{2}_{2}]$
		$\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{k}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(44)			$\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{kb^{2}p_{i,t}^{2}K_{i}^{2}}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(45)			$\displaystyle\leq\sum_{i=1}^{n}\frac{1}{bnp_{i,t}K_{min}^{2}}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(46)			$\displaystyle\overset{(3)}{\leq}\sum_{i=1}^{n}\frac{2K_{max}}{bnp_{i,t}K_{min}^{2}}(\sigma^{2}_{L}+c_{\nu}^{2})$

Here (1) rewrites the summation with the indicator variable trick, and in (2), we take the expected value of the indicator function (also equal to its probability). In (3), we use the inequality $E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2})$ and $E[||x_{1}-x_{2}||^{2}]\leq 2(E[||x_{1}||^{2}]+E[||x_{2}||^{2})$ , assumption 2 and assume $K_{max}=\max_{i=1}^{n}K_{i}$ (achieved by clipping the number of clients per user) and $K_{min}=\min_{i=1}^{n}K_{i}$ . The server broadcasts both $K_{max}$ and $K_{min}$ . Clients limit their training to $K_{max}$ samples with repetition and ensure that at least $K_{min}$ samples (with replacement) are trained each time.

Let us bound $B_{2}$ now,

(47)	$\displaystyle B_{2}$	$\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{t}[\|\|\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\frac{1}{p_{i,t}}f_{i}(w_{t})\|\|^{2}_{2}]$
(48)		$\displaystyle\overset{(1)}{=}\frac{1}{b^{2}}\sum_{i=1}^{n}\text{Pr}(i\in\mathcal{Z}_{m,t})E_{t}\|\|\triangledown f_{i}(w_{t})\|\|^{2}$
(49)		$\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{kb^{2}p_{i,t}^{2}}E_{t}\|\|\triangledown f_{i}(w_{t})\|\|^{2}$
(50)		$\displaystyle=\sum_{i=1}^{n}\frac{1}{bnp_{i,t}}E_{t}\|\|\triangledown f_{i}(w_{t})\|\|^{2}$
(51)		$\displaystyle\overset{(2)}{\leq}\sum_{i=1}^{n}\frac{1}{bnp_{i,t}}((A^{2}+1)\|\|\triangledown F(w_{t})\|\|^{2}+\sigma^{2}_{G})$

where (1) follows from analysis similar to $B_{1}$ and cancellation of probabilities, (2) uses the inequality $E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2})$ and Assumption 3.

Now, we only need to bound $A_{1}$ .

(52)		$\displaystyle A_{1}$	$\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\Delta_{t}+\triangledown F(w_{t})]\rangle$
(53)			$\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\mathbb{E}_{t}[\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\frac{-\bar{g}_{i}(w_{t})}{bp_{i,t}}+\triangledown F(w_{t})]\rangle$
		$\displaystyle=\langle\triangledown F(w_{t}),-\eta_{t}\mathbb{E}_{t}[\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}}{bp_{i,t}K_{i}}$
	$\displaystyle+$	$\displaystyle\triangledown F(w_{t})]\rangle$
		$\displaystyle=\langle\triangledown F(w_{t}),\eta_{t}\sum_{i,r}^{n,K_{i}}\text{Pr}[i\in\mathcal{Z}_{m,t}]E_{t}[\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})}{bp_{i,t}K_{i}}]\rangle$
		$\displaystyle\overset{(1)}{\leq}\langle\sqrt{\eta_{t}}\triangledown F(w_{t}),\sum_{i,r}^{n,K_{i}}\frac{bp_{i,t}}{n}E_{t}[\frac{-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})}{\frac{K_{min}bp_{i,t}}{\sqrt{\eta_{t}}}}]\rangle$
(54)			$\displaystyle=\langle\sqrt{\eta_{t}}\triangledown F(w_{t}),\sum_{i,r}^{n,K_{i}}\frac{\sqrt{\eta_{t}}}{nK_{min}}E_{t}[-\triangledown f_{i}(w_{t};x_{ir})+\nu_{ir}+\triangledown f_{i}(w_{t})]\rangle$
(55)			$\displaystyle\overset{(2)}{\leq}\frac{\eta_{t}}{2}\|\|\triangledown F(w_{t})\|\|^{2}+\frac{\eta_{t}}{2nK_{min}}\cdot 2K_{max}n(\sigma_{L}^{2}+c_{\nu}^{2})$
(56)			$\displaystyle=\frac{\eta_{t}}{2}\|\|\triangledown F(w_{t})\|\|^{2}+\frac{\eta_{t}K_{max}}{K_{min}}(\sigma_{L}^{2}+c_{\nu}^{2})$

where (1) is a simple re-shuffling of constants and (2) uses the inequality $\langle a,b\rangle\leq\frac{1}{2}(||a||^{2}_{2}+||b||^{2}_{2})$ and the analysis used in $B_{1}$ based on $E[||x_{1}+x_{2}+...+x_{n}||^{2}]\leq n(E[||x_{1}||^{2}]+...E[||x_{n}||^{2})$ .

Putting everything together, we get that,

(57)	$\displaystyle\mathbb{E}_{t}$	$\displaystyle[F(w_{t+1})]$
(58)		$\displaystyle\leq F(w_{t})-\frac{\eta_{t}}{2}(1-\sum_{i=1}^{n}\frac{L\eta_{t}(A^{2}+1)}{bnp_{i,t}})\|\|\triangledown F(w_{t})\|\|^{2}_{2}$
(59)		$\displaystyle+\frac{(\eta_{t}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}}(1+\sum_{i=1}^{n}\frac{L\eta_{t}}{bnp_{i,t}K_{min}})$
(60)		$\displaystyle+\sum_{i=1}^{n}\frac{L\eta_{t}^{2}}{2bnp_{i,t}}(\sigma^{2}_{G}+d\sigma_{0}^{2})$

Re-arranging the terms, taking expectations for all the steps and summing from $0$ to $T-1$ steps (and canceling some terms), and dividing by $T$ , we next get the following,

(61)		$\displaystyle\sum_{t=0}^{T-1}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}](\frac{\eta}{2T}(1-\sum_{i=1}^{n}\frac{L\eta_{t}^{2}(A^{2}+1)}{bnp_{i,t}}))$
(62)		$\displaystyle\leq\frac{F(w_{0})-F(w_{T})}{T}+\sum_{t}^{T-1}\frac{(\eta_{t}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{TK_{min}}$
(63)		$\displaystyle+\sum_{t,i}^{T-1,n}\frac{L\eta_{t}^{2}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2})}{bnTp_{i,t}K_{min}^{2}}+\sum_{t,i}^{T-1,n}\frac{L\eta_{t}^{2}}{2bnTp_{i,t}}(\sigma^{2}_{G}+d\sigma_{0}^{2})$
(64)		$\displaystyle\overset{(1)}{\leq}\frac{F(w_{0})-F(w_{*})}{T}+\sum_{t}^{T-1}\frac{(\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\sqrt{t+1}T}$
(65)		$\displaystyle+\sum_{t,i}^{T-1,n}\frac{L\eta^{2}_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2})}{bnTp_{i,t}K_{min}^{2}(t+1)}+\sum_{t,i}^{T-1,n}\frac{L\eta_{b}^{2}}{2bnp_{i,t}(t+1)T}(\sigma^{2}_{G}+d\sigma_{0}^{2})$
(66)		$\displaystyle\overset{(2)}{\leq}\frac{F(w_{0})-F(w_{*})}{T}+\frac{(2\eta_{b}K_{max}(\sigma^{2}_{L}+c_{\nu}^{2}))}{K_{min}\sqrt{T}}$
(67)		$\displaystyle+\underbrace{\frac{L\eta_{b}^{2}}{bnT}[\frac{K_{max}(\sigma_{L}^{2}+c_{\nu}^{2})}{K_{min}}+\frac{\sigma^{2}_{G}+d\sigma^{2}_{0}}{2}]}_{C_{1}}\sum_{t,i}\frac{1}{p_{t,i}(t+1)}$

Here, in (1) we assume that $\eta_{t}=\frac{\eta_{b}}{\sqrt{t+1}}$ where $\eta_{b}$ is the base learning rate. Also, we know that by definition, $F(w_{*})\leq F(w_{T+1})\rightarrow-F(w_{*})\geq-F(w_{T+1})$ . (2) follows from the idea that $\sum_{t=0}^{T-1}\frac{1}{\sqrt{t-1}}\leq\int_{1}^{T-1}\frac{1}{\sqrt{t-1}}=2(\sqrt{T})-2\leq 2\sqrt{T}$ .

We need to ensure that for an appropriate constant and all values of $t$ , $\eta_{b},\frac{L\eta_{b}^{2}(A^{2}+1)}{bn}\sum_{i=1}^{n(t+1)}\frac{1}{p_{i,t}}\leq 1$ . Note that the value is largest when $t=1$ , and so we essentially need,

(68)

\displaystyle\sum_{i=1}^{n}\frac{1}{p_{i,t}}\leq\frac{bn}{L\eta_{b}^{2}(A^{2}+1)}

Finally, we notice that to reduce the variance; we need a policy that follows equation 68, $p_{i,t}\leq 1$ for all $i\in[n],t\in[0:T-1]$ combinations s.t.

(69)

\displaystyle\min_{p_{i,t}\leq 1,(\ref{eqn:assumption-new})}\sum_{t,i}\frac{C_{1}}{p_{i,t}(t+1)}

If we choose $p_{i,t}=\frac{1}{V_{0}}$ where $V_{0}>1$ is a constant, then we get the following value at round $t$ ,

(70)

\displaystyle\frac{\eta_{t}}{2}(1-\sum_{i}\frac{L\eta^{2}_{t}(A^{2}+1)}{bnp_{i,t}})=\frac{\eta_{t}}{2}(1-\frac{V_{0}L\eta^{2}_{t}(A^{2}+1)}{b})=\kappa_{t}

Taking summation over all values of $t$ ,

(71)		$\displaystyle\sum_{t}\kappa_{t}$	$\displaystyle=\sum_{t}\frac{\eta_{t}}{2}(1-\frac{V_{0}L\eta^{2}_{t}(A^{2}+1)}{b})$
(72)			$\displaystyle=\eta_{b}(\sqrt{T}-1)-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b}=\alpha_{\kappa}\sqrt{T}$

where $\alpha_{\kappa}=\eta_{b}\frac{\sqrt{T}-1}{\sqrt{T}}-\frac{V_{0}L\eta_{b}^{2}log(T)(A^{2}+1)}{b\sqrt{T}}$ . Now, we can rewrite our expected variance function as,

(73)		$\displaystyle\min_{t}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}_{2}]\frac{1}{T}\sum_{t}\kappa_{t}$
(74)		$\displaystyle=\frac{1}{T}\sum_{t}\min_{t}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}_{2}]\kappa_{t}$
(75)		$\displaystyle\leq\frac{1}{T}\sum_{t}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}_{2}]\kappa_{t}$

Substituting this back, our final result convergence result is,

(76)		$\displaystyle\min_{t}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}_{2}]$
(77)		$\displaystyle\leq\frac{1}{T}\sum_{t}\mathbb{E}_{t}[\|\|\triangledown F(w_{t})\|\|^{2}_{2}]$
(78)		$\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}\alpha_{\kappa}}+\frac{\sigma^{2}_{L}+c_{\nu}^{2}}{\alpha_{\kappa}}+\frac{(\sigma^{2}_{L}+c_{\nu}^{2}+\sigma^{2}_{G}+d\sigma_{0}^{2})log(T)}{\sqrt{T}\alpha_{\kappa}})$
(79)		$\displaystyle=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma^{2}_{L}+c_{\nu}^{2}+\frac{(\sigma^{2}_{L}+c_{\nu}^{2}+\sigma^{2}_{G}+d\sigma_{0}^{2})log(T)}{\sqrt{T}})$

Here, $\Delta F=F(w_{0})-F(w_{*})$ . For this bound to be meaningful and assuming the bounded variances are reasonably small, the biggest challenge is the additional term of $dlog(T)\sigma^{2}_{0}$ . We minimize variance only when $\sqrt{T}\approx dlog(T)$ for models with many parameters. After that, we get the bound for uniform sampling as,

(80)

\displaystyle\min_{t}\mathbb{E}_{t}[||\triangledown F(w_{t})||^{2}_{2}]=\mathcal{O}(\frac{\Delta F}{\sqrt{T}}+\sigma_{L}^{2}+c_{\nu}^{2}+\frac{\sigma^{2}_{G}}{d}+\sigma_{0}^{2})

This concludes the proof of the main theorem. The probabilities in the denominator should be large. But, with dynamic sampling some probabilities might be significantly small leading to large values for the variance. Therefore, we look at alternative ways of tackling the same problem.

We notice that the unbiasedness of the gradient can be maintained without probability scaling by assuming a weighted gradient output.

Lemma 5.2.

The private unscaled gradient estimate at round $t$ , ( $\tilde{\Delta}_{t}^{u}$ ) is an unbiased estimator of the aggregate clipped gradients, assuming that at each round $\sum_{i=1}^{n}p_{i,t}=1$ and that the actual gradient obtained is weighted.

Let us first understand the expression for the unscaled $\tilde{\Delta}_{t}=\tilde{\Delta}_{t}^{u}$ .

(81)	$\displaystyle\mathbb{E}_{t}[\tilde{\Delta}_{t}^{u}]$	$\displaystyle=\mathbb{E}_{t}[\Delta_{t}^{u}+\bar{u}_{t}^{u}]$
(82)		$\displaystyle\overset{(1)}{=}\mathbb{E}_{t}[\Delta_{t}]$
(83)		$\displaystyle=\mathbb{E}_{t}[\frac{n}{b}\sum_{j\in\mathcal{Z}_{m.t}}{g}_{j}(w_{t})]$
(84)		$\displaystyle=\mathbb{E}_{t}[\frac{n}{b}\sum_{i=1}^{n}\mathbbm{1}[i\in\mathcal{Z}_{m,t}]\bar{g}_{i}(w_{t})]$
(85)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\mathbb{E}_{t}[\mathbbm{1}[i\in\mathcal{Z}_{m,t}]]$
(86)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\frac{1}{k}q_{i,t}p_{i,t}$
(87)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})\frac{n}{b}\frac{m}{nk}p_{i,t}$
(88)		$\displaystyle=\sum_{i=1}^{n}\bar{g}_{i}(w_{t})p_{i,t}$

Further, the expected batch size is $\sum_{i}\frac{q_{i,t}p_{i,t}}{k}=\sum_{i}\frac{mp_{i,t}}{kn}=\sum_{i}\frac{b}{n}p_{i,t}\leq\sum_{i}\frac{b}{n}p_{i,t}=b$ as $p_{i,t}\leq 1$ . We believe this might be a better framework for analysis since this provides the following advantages:

•

Avoids having dynamic probabilities in the denominator.
•

Sums up to 1 and provides weighted gradients to denote the weights intuitively.
•

Maintains unbiased estimator, small batch size.

The only minor inconvenience might be the $n$ factor in the numerator. We study the variance factor as part of our future work.

Theorem 5.3.

Algorithm 1 is $(\epsilon,\delta)$ differentially private.

Proof: We notice that in Algorithm 1, the only private parameters are the raw gradients and their norms. Since, we bound each gradient by a clipped factor of $C$ , we have bounded sensitivity. Normally, we would add noise with the variance $=\frac{Cc_{0}}{\epsilon_{step}}$ where $c_{0}=2\sqrt{\text{ln}(1.25/\delta_{step})}$ and $\epsilon_{step},\delta_{step}$ are the per-step privacy budgets. With strong composition (Kairouz et al., 2015) we get,

(89)	$\displaystyle\epsilon$	$\displaystyle=\epsilon_{step}\sqrt{2T\text{ln}(1/\delta_{step})}+T\frac{e^{\epsilon_{step}}-1}{e^{\epsilon_{step}}+1}$
(90)		$\displaystyle\approx\epsilon_{step}\sqrt{2T\text{ln}(1/\delta_{step})}+\frac{T\epsilon}{\epsilon+1}$
(91)		$\displaystyle\delta=(T+1)\delta_{step}$

We add epsilons from the mechanisms with RDP (Mironov, 2017). However, the optimal value of $\alpha$ depends on the value of $\epsilon$ and $\delta$ . Thus, given a target $\delta$ , $\delta_{target}$ we have the following,

(92)		$\displaystyle E_{t}[\epsilon]$	$\displaystyle\leq m\epsilon_{s}(\alpha_{})+b\epsilon_{0}(\alpha_{})$
(93)			$\displaystyle\delta=\delta_{target}$

Here, $\alpha_{*}$ remains constant and can be replaced by the value that minimizes privacy costs. Further, the expectation is taken over the sampling randomness at step $t$ , and the value of $\alpha_{*}$ is also dependent on the number of steps.

6. Algorithms

For the LOCks algorithm (Algorithm 1), we provide details and explanations about the client-server architecture and how each update is processed.

In the first round, the server randomly picks a client w.p. $\frac{m}{n}$ . These clients compute a private estimate of their last gradient norm. If the norm is negative, we clip to a minimal value $\approx 10^{-6}$ . These norms are then converted to a probability distribution, and the ones with the highest probability are prioritized in the sampling. Thus, the probability is proportionate to the noisy gradient norm of the client. The clients picked to submit their private gradient to the server. The server aggregates this gradient and broadcasts the updated model to all the clients.

Algorithm 1 Generalized Optimal Client Sampling for Federated DP-SGD

1:Input:

n

federated clients, privacy parameters

\epsilon,\delta

, number of iterations

T

, expected batch size

b

, maximum/minimum samples per client

K_{max},K_{min}

\sigma_{s},\sigma_{0}

for gradient norms and gradient respectively identified by RDP analysis.

2:Initialize: model parameters

w_{0}\in\mathbb{R}^{p}

3:for t=0:T-1 do

\triangleright

Total Iterations

4: Get random sample

\mathcal{X}_{m,t}

with sampling probability

\frac{m}{n}

5: for

i\in\mathcal{X}_{m}

\triangleright

Computations at client

i

g_{i}(w_{t})\leftarrow\sum_{j}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ij})}{{K_{i}}}\in\mathbb{R}^{p}

\triangleright

Client

i

has

K_{i}

samples

\bar{g}_{i}(w_{t})\leftarrow\frac{g_{i}(w_{t})}{\max\{1,\frac{||g_{i}(w_{t})||_{2}}{C}\}}

\triangleright

Clip Gradients

v_{i}\sim\mathcal{N}(0,\sigma_{s}^{2}\mathbb{I}_{p})

\tilde{z}_{i}\leftarrow||\bar{g}_{i}(w_{t})||_{2}+v_{i}

\triangleright

Private Gradient Norm

10: Send

\tilde{z}_{i}

to the server

11: end for

12:

\tilde{z}_{S}\leftarrow\sum_{i\in\mathcal{X}_{t}}\tilde{z}_{i}

\triangleright

Server: Gradient norm sum over clients

13:

\tilde{z}_{S}^{{}^{\prime}}\leftarrow\min(\max(\tilde{z}_{S},bC),\tilde{N}C)

14: for

j\in\mathcal{X}_{t}

\triangleright

Computations at client

j

15:

p_{j,t}\leftarrow\min\{1,\frac{b\tilde{z}_{j}}{k\tilde{z}_{S}^{{}^{\prime}}}\}

16: end for

17: Sample

\mathcal{Z}_{m,t}

with the probability distribution derived from

q_{j},j\in\mathcal{X}_{t}

18: for

l\in\mathcal{Z}_{m,t}

\triangleright

Computations at client

l

19:

u_{l}\sim\mathcal{N}(0,\sigma^{2}_{0})

20: end for

21:

\tilde{\Delta}_{l}\leftarrow\frac{1}{b}\sum_{l\in\mathcal{Z}_{m,t}}(\bar{g}_{l}(w_{t})+u_{l})

\triangleright

Aggregation at server

22:

w_{t+1}\leftarrow w_{t}-\eta_{t}\tilde{\Delta}_{l}

\triangleright

Server Update

23: Return

w_{t+1}

to all clients

24:end for

7. Experiments

We provide preliminary experiments to motivate further investigation in this space. By uniformly sampling clients, we analyze a federated system. Here, the client only adds noise to the last gradient before sending it to the server. This provides user privacy and keeps the data on the device. For our analysis, we consider $100$ clients and assume that $20\%$ are uniformly randomly picked in each round. Each client runs $5$ local epochs between the server updates. The noise multiplier in each round is picked after searching for the best $\alpha$ value in RDP using Google’s Tensorflow privacy library. We notice that for an $\epsilon=\infty$ , we need a multiplier of $0$ , for $\epsilon=50$ , we require a multiplier of $\approx 0.5$ , and for $\epsilon=10$ , we need a multiplier of $1.32$ . The smaller the multiplier, the lower the noise, but the higher the privacy loss leading to significant leaks.

Note that none of the provided $\epsilon$ values are satisfactory. We expect this since the gradient variance is proportionately higher if we use a smaller noise multiplier.

Refer to caption — Figure 1. Uniform Partial Client Sampling v/s LOCKS [MNIST]: Current sampling techniques do not perform well for simple datasets with user differential privacy. Further investigation is necessary to improve these mechanisms under noise.

We also empirically analyze the optimal client sampling techniques suggested in (Wei et al., 2022) and (Li et al., 2019) as modified in Algorithm 1. In this scenario, we first randomly sample $30\%$ clients. We prioritize the clients with the highest private gradient norms in the second round. So, the probability of picking these clients is proportionate to their gradient size. By prioritizing the norm gradient, we see mild speed-ups in our convergence rate (empirically), as shown in Figure 1. Though LOCKS generally performs better for high-noise / high-privacy cases, we still need to understand the underlying mechanism for its success.

We showcase our preliminary results for MNIST in Figure 1.

8. Conclusions

This work demonstrates the convergence of user differentially private and federated systems for non-convex objective functions with an adaptive learning rate and partial client participation. We provide a general two-pronged sampling framework that reduces the communication requirement for Federated systems per round. Our method maintains an unbiased gradient estimate while providing a flexible framework for further analysis. We also offer an alternative approach for analyzing the weighted sum of user gradients. There are, however, significant gaps in the literature that suggest multiple open problems.

•

(Wang et al., 2022a) shows that different gradients due to data heterogeneity do not impact empirical results. Therefore, we need to understand why gradient dissimilarity is important and how it impacts the convergence rate.
•

Currently, our model requires $d^{2}log(T)^{2}$ steps to reasonably converge. A high number of steps adds significant communication and privacy costs, and future work should tackle this issue.
•

Study how adaptive clipping and adaptive sampling affect convergence rates.
•

Demonstrate empirical results for complex datasets.
•

Alternative approaches for gradient similarity using similarity metrics like angles, cosine similarity, and correlation.

References

(1)
Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. Proceedings of the 2016 ACM SIGSAC conference on computer and communications security (2016), 308–318.
Alain et al. (2015) Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. 2015. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481 (2015).
Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 3-4 (2014), 211–407.
Kairouz et al. (2015) Peter Kairouz, Sewoong Oh, and Pramod Viswanath. 2015. The composition theorem for differential privacy. In International conference on machine learning. PMLR, 1376–1385.
Katharopoulos and Fleuret (2018) Angelos Katharopoulos and François Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning. PMLR, 2525–2534.
Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
Li et al. (2019) Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. 2019. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189 (2019).
Loshchilov and Hutter (2015) Ilya Loshchilov and Frank Hutter. 2015. Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343 (2015).
Mironov (2017) Ilya Mironov. 2017. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF). IEEE, 263–275.
Narayanan and Shmatikov (2006) Arvind Narayanan and Vitaly Shmatikov. 2006. How to break anonymity of the netflix prize dataset. arXiv preprint cs/0610105 (2006).
Wang et al. (2022a) Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, and Tong Zhang. 2022a. On the unreasonable effectiveness of federated averaging with heterogeneous data. arXiv preprint arXiv:2206.04723 (2022).
Wang et al. (2020) Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. 2020. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems 33 (2020), 7611–7623.
Wang et al. (2022b) Lin Wang, YongXin Guo, Tao Lin, and Xiaoying Tang. 2022b. DELTA: Diverse Client Sampling for Fasting Federated Learning. https://doi.org/10.48550/ARXIV.2205.13925
Wei et al. (2022) Jianxin Wei, Ergute Bao, Xiaokui Xiao, and Yin Yang. 2022. DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling. arXiv preprint arXiv:2210.09634 (2022).

	$\displaystyle\mathbb{E}$	${}_{t}[F(w_{t+1})]$
		$\displaystyle\leq F(w_{t})+\langle\triangledown F(w_{t}),\mathbb{E}_{t}[w_{t+1}-w_{t}]\rangle+\frac{L}{2}\mathbb{E}_{t}[\|\|w_{t+1}-w_{t}\|\|^{2}]$
		$\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\tilde{\Delta}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\|\tilde{\Delta}_{t}\|\|^{2}]$
		$\displaystyle=F(w_{t})+\langle\triangledown F(w_{t}),\eta\mathbb{E}_{t}[\Delta_{t}+\bar{u}_{t}]\rangle+\frac{L}{2}\eta^{2}\mathbb{E}_{t}[\|\|\Delta_{t}+\bar{u}_{t}\|\|^{2}]$

	$\displaystyle B_{1}$	$\displaystyle\overset{(1)}{=}\frac{1}{b^{2}}\mathbb{E}_{t}[\|\|\sum_{i=1}^{n}\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}\|\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\|\|\sum_{i=1}^{n}\mathbb{E}_{t}[\mathbbm{1}[i\in Z_{m,t}]\sum_{r=1}^{K_{i}}\frac{\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}}{p_{i,t}K_{i}}]\|\|^{2}$
		$\displaystyle\overset{(2)}{=}\sum_{i=1}^{n}\text{Pr}[i\in Z_{m,t}]\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}\|\|^{2}_{2}]$
		$\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{k}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(44)			$\displaystyle=\sum_{i=1}^{n}\frac{p_{i,t}q_{i,t}}{kb^{2}p_{i,t}^{2}K_{i}^{2}}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(45)			$\displaystyle\leq\sum_{i=1}^{n}\frac{1}{bnp_{i,t}K_{min}^{2}}\mathbb{E}_{t}[\|\|\frac{1}{bp_{i,t}K_{i}}\sum_{r=1}^{K_{i}}[\triangledown f_{i}(w_{t};x_{ir})-\triangledown f_{i}(w_{t})-\nu_{ir}]\|\|^{2}_{2}$
(46)			$\displaystyle\overset{(3)}{\leq}\sum_{i=1}^{n}\frac{2K_{max}}{bnp_{i,t}K_{min}^{2}}(\sigma^{2}_{L}+c_{\nu}^{2})$