This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private Stochastic Convex Optimization:
Efficient Algorithms for Non-smooth Objectives

Raman Arora
[email protected]
   Teodor V. Marinov
[email protected]
   Enayat Ullah
[email protected]
   Johns Hopkins University
Abstract

In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples.

1 Introduction

Modern machine learning systems often leverage data that are generated ubiquitously and seamlessly through devices such as smartphones, cameras, microphones, or user’s weblogs, transaction logs, social media, etc. Much of this data is private, and releasing models trained on such data without serious privacy considerations can reveal sensitive information (Narayanan and Shmatikov, 2008; Sweeney, 1997). Consequently, much emphasis has been placed in recent years on machine learning under the constraints of a robust privacy guarantee. One such notion that has emerged as a de facto standard is that of differential privacy.

Informally, differential privacy provides a quantitative assessment of how different are the outputs of a randomized algorithm when fed two very similar inputs. If small changes in the input do not manifest as drastically different outputs, then it is hard to discern much information about the inputs solely based on the outputs of the algorithm. In the context of machine learning, this implies that if the learning algorithm is not overly sensitive to any single datum in the training set, then releasing the trained model should preserve the privacy of the training data. This requirement, apriori, seems compatible with the goal of learning, which is to find a model that generalizes well on the population and does not overfit to the given training sample. It seems reasonable then to argue that privacy is not necessarily at odds with generalization, especially when large training sets are available.

We take the following stochastic optimization view of machine learning, where the goal is to find a predictor that minimizes the expected loss (aka risk)

minw𝒲F(w)=𝔼z𝒟[f(w,z)],\min_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})=\mathbb{E}_{\mathrm{z}\sim\mathcal{D}}[f(\mathrm{w},\mathrm{z})], (1)

based on i.i.d. samples from the source distribution 𝒟\mathcal{D}, and full knowledge of the instantaneous objective function f(,)f(\cdot,\cdot) and the hypothesis class 𝒲{\mathcal{W}}. We are particularly interested in convex learning problems where the hypothesis class is a convex set and the loss function f(,z)f(\cdot,\mathrm{z}) is a convex function in the first argument for all z𝒵\mathrm{z}\in{\mathcal{Z}}. We seek a learning algorithm that uses the smallest possible number of samples and the least runtime and returns w~\tilde{\mathrm{w}} such that F(w~)infw𝒲F(w)+αF(\tilde{\mathrm{w}})\leq\inf_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})+\alpha, for a user specified α>0\alpha>0, while guaranteeing differential privacy (see Section 2.2 for a formal definition).

A natural approach to solving Problem 1 is sample average approximation (SAA), or empirical risk minimization (ERM), where we instead minimize an empirical approximation of the objective based on the i.i.d. sample. Empirical risk minimization for convex learning problems has been studied in the context of differential privacy by several researchers including Bassily et al. (2019) who give statistically efficient algorithms with oracle complexity matching that of optimal non-private ERM.

An alternative approach to solving Problem 1 is stochastic approximation (SA), wherein rather than form an approximation of the objective, the goal is to directly minimize the true risk. The learning algorithm is an iterative algorithm that processes a single sample from the population in each iteration to perform an update. Stochastic gradient descent (SGD), for instance, is a classic SA algorithm. Recent work of Feldman et al. (2020) gives optimal rates for convex learning problems (Problem 1) using stochastic approximation for smooth loss functions; however, they leave open the question of optimal rates for non-smooth convex learning problems which include a large class of learning problems, including, for example, support vector machines. In this work, we focus on non-smooth convex learning problems and propose a simple algorithm which achieves nearly optimal rates in the special case where the privacy guarantee scales inversely with the number of observed samples. There are alternative approaches111We became aware of it in a private communication with Raef Basilly.; for further discussion we refer the reader to Section 6.

2 Notation and Preliminaries

We consider the general learning setup of Vapnik (2013). Let 𝒵{\mathcal{Z}} be the sample space and let 𝒟{\mathcal{D}} be an unknown distribution over 𝒵{\mathcal{Z}}. We are given a sample z1,z2,,zn\mathrm{z}_{1},\mathrm{z}_{2},\ldots,\mathrm{z}_{n} drawn identically and independently (i.i.d) from 𝒟{\mathcal{D}} – the samples zi\mathrm{z}_{i} may correspond to (feature, label) tuples as in supervised learning, or to features in unsupervised learning. We assume that loss f:𝒲×𝒵Rf:{\mathcal{W}}\times{\mathcal{Z}}\rightarrow\mathrm{R} is a convex function in its first argument w\mathrm{w} and the hypothesis set 𝒲{\mathcal{W}} is a convex set. Given n samples, the goal is to minimize the population risk (Problem 1).

We assume that the hypothesis class 𝒲\mathcal{W} is a convex, closed and bounded set in Rd\mathrm{R}^{d} equipped with norm \left\|{\cdot}\right\|, such that wD\|\mathrm{w}\|\leq D for all w𝒲\mathrm{w}\in{\mathcal{W}}. We recall that the dual space of (Rd,)(\mathrm{R}^{d},\left\|{\cdot}\right\|) is the set of all linear functionals over it; the dual norm, denoted by \left\|{\cdot}\right\|_{*}, is defined as h=minw1h(w)\left\|{h}\right\|_{*}=\min_{\left\|{w}\right\|\leq 1}h(\mathrm{w}), where hh is a element of the dual space (Rd,)(\mathrm{R}^{d},\left\|{\cdot}\right\|_{*}). In general we will use \|\cdot\| to denote the 2\ell_{2} norm when there is no risk of confusion.

We do not assume that f(w,z)f(\mathrm{w},\mathrm{z}) is necessarily differentiable and will denote an arbitrary sub-gradient as f(w,z)\nabla f(\mathrm{w},\mathrm{z}). We assume that f(,z)f(\cdot,\mathrm{z}) is LL-lipschitz with respect to the dual norm \left\|{\cdot}\right\|_{*}, i.e., |f(w1,z)f(w2,z)|Lw1w2\left|{f(\mathrm{w}_{1},\mathrm{z})-f(\mathrm{w}_{2},\mathrm{z})}\right|\leq L\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\| for all z\mathrm{z}. For convex ff, this implies that sub-gradients with respect to w\mathrm{w}, are bounded as f(w,z)L,w𝒲,z𝒵\left\|{\nabla f(\mathrm{w},\mathrm{z})}\right\|_{*}\leq L,\ \forall\ \mathrm{w}\in\mathcal{W},\mathrm{z}\in\mathcal{Z}. A popular instance of the above is the p/q\ell_{p}/\ell_{q} setup, which considers p\ell_{p} norm in primal space and the corresponding q\ell_{q} norm in the dual space such that 1p+1q=1\frac{1}{p}+\frac{1}{q}=1. A function g()g(\cdot) is β\beta-strongly smooth (or just β\beta-smooth) if g(w1)g(w2)βw1w2,w1,w2𝒲\left\|{\nabla g(\mathrm{w}_{1})-\nabla g(\mathrm{w}_{2})}\right\|_{*}\leq\beta\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\|,\forall\mathrm{w}_{1},\mathrm{w}_{2}\in{\mathcal{W}}. For convex functions, this is equivalent to the condition g(w2)g(w1)+g(w1),w2w1+β2w1w22,w1,w2𝒲g(\mathrm{w}_{2})\leq g(\mathrm{w}_{1})+\langle\nabla g(\mathrm{w}_{1}),\mathrm{w}_{2}-\mathrm{w}_{1}\rangle+\frac{\beta}{2}\|\mathrm{w}_{1}-\mathrm{w}_{2}\|^{2},\forall\mathrm{w}_{1},\mathrm{w}_{2}\in\mathcal{W}. A function gg is λ\lambda-strongly convex if g(w2)g(w1)+g(w1),w2w1+λ2w1w22,w1,w2𝒲g(\mathrm{w}_{2})\geq g(\mathrm{w}_{1})+\langle\nabla g(\mathrm{w}_{1}),\mathrm{w}_{2}-\mathrm{w}_{1}\rangle+\frac{\lambda}{2}\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\|^{2},\forall\mathrm{w}_{1},\mathrm{w}_{2}\in{\mathcal{W}}. Finally, we use O~()\tilde{O}(\cdot) to suppress poly-logarithmic factors in the complexity.

2.1 Mirror descent and potential functions

We recall some basics of convex duality, which will help us set up the mirror descent algorithm and analysis. For a convex function Φ:d\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}, we define its conjugate Φ:d{}\Phi^{*}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{\infty\} as Φ(Y)=supXY,XΦ(X)\Phi^{*}(Y)=\sup_{X}\langle Y,X\rangle-\Phi(X). Moreover, DΦ(z||y)D_{\Phi}(\mathrm{z}||\mathrm{y}) denotes Bregman divergence induced by Φ\Phi, defined as

DΦ(z||y)=Φ(z)Φ(y)Φ(y),zy.\displaystyle D_{\Phi}(\mathrm{z}||\mathrm{y})=\Phi(\mathrm{z})-\Phi(\mathrm{y})-\langle\nabla\Phi(\mathrm{y}),\mathrm{z}-\mathrm{y}\rangle.

For a convex set 𝒲{\mathcal{W}}, we denote by I𝒲\mathrm{I}_{{\mathcal{W}}} the indicator function of the set 𝒲{\mathcal{W}}, I𝒲(z)=0 if z𝒲\mathrm{I}_{{\mathcal{W}}}(\mathrm{z})=0\ \text{ if }\ \mathrm{z}\in{\mathcal{W}}, and \infty otherwise. The following result from Kakade et al. (2009) establishes a natural relation between strong convexity and strong smoothness of a potential function and its conjugate, respectively.

Theorem 1 (Theorem 6 from Kakade et al. (2009)).

Assume Φ\Phi is a closed convex function. Then Φ\Phi is α\alpha-strongly convex with respect to a norm \|\cdot\| iff Φ\Phi^{*} is 1α\frac{1}{\alpha}-strongly smooth with respect to the dual norm \|\cdot\|_{*}.

2.2 Differential privacy

We seek to design algorithms for solving the stochastic convex optimization problem (Problem 1) with the additional constraint that the algorithm’s output is differentially private.

Definition 1 ((ϵ,δ)(\epsilon,\delta)-differential privacy (Dwork et al., 2006)).

An algorithm 𝒜\mathcal{A} satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy if given two datasets SS and SS^{\prime}, differing in only one data point, it satisfies that for any measurable event RRange(𝒜)R\in\text{Range}({\mathcal{A}})

(𝒜(S)R)eϵ(𝒜(S)R)+δ.\displaystyle\mathbb{P}(\mathcal{A}(S)\in R)\leq e^{\epsilon}\mathbb{P}(\mathcal{A}(S^{\prime})\in R)+\delta.

3 Related Work

In convex learning and optimization, the following four classes of functions are widely studied: (a) LL-lispchitz convex functions, (b) β\beta-smooth and convex functions, (c) λ\lambda-strongly convex functions, and (d) β\beta-smooth and λ\lambda-strongly convex functions. In the computational framework of first order stochastic oracle, algorithms with optimal oracle complexity for all these classes of functions have long been known (Agarwal et al., 2009). However, the landscape of known results changes with the additional constraint of privacy. We can trace two approaches to solving the private version of Problem 1. The first is private ERM (Chaudhuri et al., 2011; Bassily et al., 2014; Feldman et al., 2018; Bassily et al., 2019) and the second is private stochastic approximation (Feldman et al., 2020). Chaudhuri et al. (2011) begin the study of private ERM by constructing algorithms based on output perturbation and objective perturbation. Under the assumption that the stochastic gradients are β\beta-Lipschitz continuous, the output perturbation bounds achieve excess population risk of O(LDmax(1/n,d/(nϵ),dβ/(n2/3ϵ)))O(LD\max(1/\sqrt{n},d/(n\epsilon),d\sqrt{\beta}/(n^{2/3}\epsilon))), where LL is the Lipschitz constant of the loss function and DD measures the diameter of the hypothesis class in the respective norm. The objective perturbation bounds have a similar form. Bassily et al. (2014) showed tight upper and lower bounds for the excess empirical risk. They also showed bounds for the excess population risk when the loss function does not have Lipschitz continuous gradients, achieving a rate of O(d1/4/(nϵ))O(d^{1/4}/(\sqrt{n}\epsilon)). Their techniques appeal to uniform convergence i.e. bounding supw𝒲F(w)F^(w)\sup_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})-\widehat{F}(\mathrm{w}), and convert the guarantees on excess empirical risk to get a bound on the excess population risk. These guarantees, however, were sub-optimal. Bassily et al. (2019) improved these to get optimal bounds on excess population risk, leveraging connections between algorithmic stability and generalization. The algorithms given by Bassily et al. (2019) are sample efficient but their runtimes are superlinear (in the number of samples), whereas the non-private counterparts run in linear time. In a follow-up work, Feldman et al. (2020) improved the runtime of some of these algorithms without sacrificing statistical efficiency; however, the authors require that the stochastic gradients are Lipschitz continuous. Essentially, the statistical complexity of private stochastic convex optimization has been resolved, but some questions about computational efficiency still remain open. We begin with a discussion of different settings for the population loss in subsequent paragraphs, describe what is already known, and what are the avenues for improvement.

Non-smooth Lipschitz Convex.

For the class of LL-Lipschitz convex functions, Bassily et al. (2014) improved upon Chaudhuri et al. (2011) and gave optimal bounds on excess empirical risk of O(dϵn)O\left(\frac{d}{\epsilon n}\right). They then appeal to uniform convergence to convert the guarantees on excess empirical risk to get an excess population risk of O(max(d1/4n,dϵn))O\left(\max\left(\frac{d^{1/4}}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right)\right). This is sub-optimal and was very recently improved by Bassily et al. (2019) using connections between algorithmic stability and generalization to get O(max(1n,dϵn))O\left(\max\left(\frac{1}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right)\right). This is optimal since 1n\frac{1}{\sqrt{n}} is the optimal excess risk without privacy constraints, and dϵn\frac{\sqrt{d}}{\epsilon n} is the optimal excess risk when the data distribution is the empirical distribution. This resolves the statistical complexity of private convex learning and shows that in various regimes of interest, e.g., when d=Θ(n)d=\Theta(n) and ϵ=Θ(1)\epsilon=\Theta(1), the constraint of privacy has no effect on utility. However, the algorithm proposed by Bassily et al. (2019) is based on Moreau smoothing and proximal methods, and requires O(n5)O(n^{5}) stochastic gradient computations. This rate vastly exceeds the gradient computations needed for non-private stochastic convex optimization which are of the order O(n)O(n). The computational complexity was improved by Feldman et al. (2020) to O(n2)O(n^{2}) by using a regularized ERM algorithm that runs in phases and after each phase, noise is added to the solution (output perturbation) and used as regularization for the next phase.

Smooth Lipschitz Convex.

For β\beta-smooth convex LL-Lispschitz functions, Bassily et al. (2019) give an algorithm with optimal bounds on excess risk of O(LDmax{1n,dϵn})O\left(LD\max\left\{\frac{1}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right\}\right) with O(min{n3/2,n5/2d})O\left(\min\left\{n^{3/2},\frac{n^{5/2}}{d}\right\}\right) queries to the stochastic gradient oracle. This, again, was improved in a later work of Feldman et al. (2020) to O(n)O(n) stochastic gradient queries. Note that even for non-private stochastic optimization O(n)O(n) stochastic gradient computations are necessary, so Feldman et al. (2020) achieve optimal statistical and oracle complexity for the smooth Lipschitz convex functions. The algorithm they present is an instance of a single-pass noisy SGD, and the guarantees hold for the last iterate.

(Smooth and Non-smooth) Lipschitz Strongly Convex.

For LL-Lispchitz λ\lambda-strongly convex functions, Bassily et al. (2014) gave an algorithm with optimal excess empirical risk which is in O~(dλn2ϵ2)\tilde{O}\left(\frac{d}{\lambda n^{2}\epsilon^{2}}\right). However, as in the non-strongly convex case, the corresponding excess population risk is O~(dλϵn)\tilde{O}\left(\frac{\sqrt{d}}{\lambda\epsilon n}\right), which is sub-optimal. For this case, Feldman et al. (2020) proposed an algorithm which achieves optimal rates in O(n)O(n) time for smooth functions but O(n2)O(n^{2}) for non-smooth functions.

4 Algorithm and Utility Analysis

In this section, we describe the proposed algorithm and provide convergence guarantees.

Recall that we are given a set of nn samples (z1,,zn)(\mathrm{z}_{1},\ldots,\mathrm{z}_{n}) drawn i.i.d. from 𝒟{\mathcal{D}}. We consider an iterative algorithm, wherein, at time tt, we sample index YtY_{t} uniformly at random from the set of indices, [n][n]. Note that YtY_{t} is a random variable; we denote the realization of YtY_{t} as yty_{t}. Through the run of the algorithm, we maintain a set, FtF_{t}, of indices observed until time tt, i.e., Ft={yττ<t}.F_{t}=\{y_{\tau}\mid\tau<t\}.

If ytFty_{t}\notin F_{t}, i.e., zyt\mathrm{z}_{y_{t}} is a fresh sample that has not been seen and processed prior to the ttht^{\textrm{th}} iteration, then we proceed with a projected gradient descent step using the noisy gradient f(wt,zyt)+ξt\nabla f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})+\xi_{t}. If ytFty_{t}\in F_{t}, then the algorithm simply perturbs the current iterate, wt\mathrm{w}_{t}, and projects on to the set 𝒲{\mathcal{W}}. We remark that the noise-only step is important for the privacy analysis, as it allows for privacy amplification by sub-sampling.

The algorithm terminates when half of the points in the training set have been processed at least once, i.e., the size of FtF_{t} is greater than or equal to n/2n/2. We denote this stopping time by τ\tau. We refer the reader to Algorithm 1 for the pseudo-code.

Algorithm 1 Private SGD(w1,{ηt}t,{ξt}t,{zt}t\mathrm{w}_{1},\{\eta_{t}\}_{t},\{\xi_{t}\}_{t},\{\mathrm{z}_{t}\}_{t})
0:  Step size schedule {ηt}t\{\eta_{t}\}_{t}, noise sequence {ξt}t\{\xi_{t}\}_{t}, stream of data points {zt}t\{\mathrm{z}_{t}\}_{t}, initial iterate w1\mathrm{w}_{1}
1:  F1=F_{1}=\emptyset,τ=1\tau=1
2:  while |Fτ|n/2|F_{\tau}|\leq n/2 do
3:     Sample index yτy_{\tau} uniformly from [n][n]
4:     if {yτ}Fτ\{y_{\tau}\}\bigcap F_{\tau}\neq\emptyset then
5:        wτ+1=𝒫(wτητξτ)\mathrm{w}_{\tau+1}=\mathcal{P}(\mathrm{w}_{\tau}-\eta_{\tau}\xi_{\tau})
6:        Fτ+1=FτF_{\tau+1}=F_{\tau}
7:     else
8:        wτ+1=𝒫(wτητ(f(wτ,zyτ)+ξτ))\mathrm{w}_{\tau+1}=\mathcal{P}(\mathrm{w}_{\tau}-\eta_{\tau}(\nabla f(\mathrm{w}_{\tau},\mathrm{z}_{y_{\tau}})+\xi_{\tau}))
9:        Fτ+1=Fτ{yτ}F_{\tau+1}=F_{\tau}\bigcup\{y_{\tau}\}
10:     end if
11:     τ=τ+1\tau=\tau+1
12:  end while
12:  Average iterate w^τ=1|Fτ|jFτwj\widehat{\mathrm{w}}_{\tau}=\frac{1}{|F_{\tau}|}\sum_{j\in F_{\tau}}\mathrm{w}_{j}

Next, we establish the utility guarantee for Algorithm 1. We first show that τ\tau is finite almost surely and bounded by O(n)O(n) with probability 1O(exp(n))1-O(\operatorname{exp}\left(-n\right)). The proof follows a standard bins-and-balls argument.

Theorem 2.

For any n16n\geq 16 with probability 12exp(n/16)1-2\operatorname{exp}\left(-n/16\right) it holds that τ2n\tau\leq 2n, which implies 𝔼[τ]O(n)\mathbb{E}[\tau]\leq O(n).

We proceed with bounding the regret of Algorithm 1. Given a sequence of convex functions f~t:𝒲\tilde{f}_{t}:\mathcal{W}\rightarrow\mathbb{R}, where ft()=f(,z)f_{t}(\cdot)=f(\cdot,\mathrm{z}), we bound the regret

R¯(τ,u)=t=1τ𝔼[f~t(wt)]t=1τ𝔼[f~t(u)],\displaystyle\bar{R}(\tau,u)=\sum_{t=1}^{\tau}\mathbb{E}[\tilde{f}_{t}(\mathrm{w}_{t})]-\sum_{t=1}^{\tau}\mathbb{E}[\tilde{f}_{t}(\mathrm{u})],

where the expectation is with respect to any randomness in the algorithm and randomness of f~t\tilde{f}_{t}’s. We assume full access to the gradient of f~t\tilde{f}_{t} and that the diameter of 𝒲\mathcal{W} is DD, i.e., w,v𝒲,wvD\forall\mathrm{w},\mathrm{v}\in\mathcal{W},\|\mathrm{w}-\mathrm{v}\|\leq D.

Our setup fits the popular framework of Online Stochastic Mirror Descent (OSMD) algorithm, wherein, given a strictly convex potential Φ:d\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}, the updates are given as

w~t+1\displaystyle\tilde{\mathrm{w}}_{t+1} =Φ(Φ(wt)ηf~t(wt))\displaystyle=\nabla\Phi^{*}(\nabla\Phi(\mathrm{w}_{t})-\eta\nabla\tilde{f}_{t}(\mathrm{w}_{t})) (2)
wt+1\displaystyle\mathrm{w}_{t+1} =argminw𝒲DΦ(w||w~t+1).\displaystyle=\mathop{\arg\min}_{\mathrm{w}\in{\mathcal{W}}}D_{\Phi}(\mathrm{w}||\tilde{\mathrm{w}}_{t+1}).

We utilize the following result.

Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012)).

Let f1,,fτf_{1},\ldots,f_{\tau} be a sequence of functions from d\mathbb{R}^{d} to \mathbb{R}. Suppose f~t\nabla\tilde{f}_{t} is an unbiased estimator of ft\nabla f_{t}, i.e., 𝔼[f~t]=ft\mathbb{E}[\nabla\tilde{f}_{t}]=\nabla f_{t}. If one initializes OSMD 2 with w1=argminw𝒲Φ(w)\mathrm{w}_{1}=\mathop{\arg\min}_{\mathrm{w}\in{\mathcal{W}}}\Phi(\mathrm{w}), then the expected pseudo-regret R¯(τ)\bar{R}(\tau) is bounded as

R¯(τ,u)DΦ(u||w1)η+1ηt=1τ𝔼[DΦ(ft(wt)ηf~t(wt)||ft(wt))].\displaystyle\bar{R}(\tau,\mathrm{u})\leq\frac{D_{\Phi}(\mathrm{u}||\mathrm{w}_{1})}{\eta}+\frac{1}{\eta}\sum_{t=1}^{\tau}\mathbb{E}[D_{\Phi^{*}}(\nabla f_{t}(\mathrm{w}_{t})-\eta\nabla\tilde{f}_{t}(\mathrm{w}_{t})||\nabla f_{t}(\mathrm{w}_{t}))].

For any fixed τ<\tau<\infty, we can view Algorithm 1 as an instance of OSMD, with Φ1222\Phi\equiv\frac{1}{2}\|\cdot\|_{2}^{2}, and f~t()\tilde{f}_{t}(\cdot) defined as

f~t()={f(,zyt)+ξt,ytFtξt,otherwise,\displaystyle\tilde{f}_{t}(\cdot)=\begin{cases}f(\cdot,\mathrm{z}_{y_{t}})+\langle\xi_{t},\cdot\rangle&y_{t}\not\in F_{t}\\ \langle\xi_{t},\cdot\rangle&\text{otherwise,}\end{cases}

and ft(w)=𝔼ξt[f~t(w)]f_{t}(\mathrm{w})=\mathbb{E}_{\xi_{t}}[\tilde{f}_{t}(\mathrm{w})]. Theorem 1 implies that for any z,yd\mathrm{z},\mathrm{y}\in\mathbb{R}^{d} DΦ(z||y)1/αzy2D_{\Phi^{*}}(\mathrm{z}||\mathrm{y})\leq 1/\alpha\|\mathrm{z}-\mathrm{y}\|_{*}^{2}, where α\alpha is the strong-convexity parameter of the potential (in this case α=1\alpha=1) – using this result with Theorem 3, and taking expectation with respect to the randomness in τ\tau and ξt,t[τ]\xi_{t},t\in[\tau], we have that for any u𝒲\mathrm{u}\in{\mathcal{W}}

𝔼[t=1τft(wt)]𝔼[t=1τft(u)]\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right] uw1222η+12η[t=1τη2f~t(wt)22]\displaystyle\leq\frac{\|\mathrm{u}-\mathrm{w}_{1}\|_{2}^{2}}{2\eta}+\frac{1}{2\eta}\left[\sum_{t=1}^{\tau}\eta^{2}\|\nabla\tilde{f}_{t}(\mathrm{w}_{t})\|_{2}^{2}\right] (3)
uw1222η+η𝔼[t=1τft(wt)22+ξt22].\displaystyle\leq\frac{\|\mathrm{u}-\mathrm{w}_{1}\|_{2}^{2}}{2\eta}+\eta\mathbb{E}\left[\sum_{t=1}^{\tau}\|\nabla f_{t}(\mathrm{w}_{t})\|_{2}^{2}+\|\xi_{t}\|_{2}^{2}\right].

Since ξ𝒩(0,σI)\xi\sim\mathcal{N}(0,\sigma\mathrm{I}), we have the following corollary.

Corollary 1.

Suppose the diameter of 𝒲{\mathcal{W}} is bounded by DD and that f(,zt)f(\cdot,\mathrm{z}_{t}) is an LL-Lipschitz function for all t[n]t\in[n]. Running Algorithm 1 for τ\tau iterations with noise sequence ξt𝒩(0,σI)\xi_{t}\sim\mathcal{N}(0,\sigma\mathrm{I}) and step size ηt=Dn(L+σd)\eta_{t}=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})} guarantees that

𝔼[t=1τft(wt)]𝔼[t=1τft(u)]2D(L+σd)n.\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right]\leq 2D(L+\sigma\sqrt{d})\sqrt{n}.
Proof.

Combine Equation 3, Wald’s lemma and the assumptions of the theorem to get

𝔼[t=1τft(wt)]𝔼[t=1τft(u)]D22η+η(L2+σ2d)𝔼[τ].\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right]\leq\frac{D^{2}}{2\eta}+\eta(L^{2}+\sigma^{2}d)\mathbb{E}[\tau].

Using the bound on 𝔼[τ]O(n)\mathbb{E}[\tau]\leq O(n) from Theorem 2 together with the step-size choice in the corollary claim finishes the proof. ∎

The result now follows using the Corollary above and the following lemma.

Lemma 1.

For the iterates {wt}t=1τ\{\mathrm{w}_{t}\}_{t=1}^{\tau} of Algorithm 1 it holds that 𝔼[tFτf(wt,zYt)]n𝔼[F(w^τ)]\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})]\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})], where w^τ\widehat{\mathrm{w}}_{\tau} is the output of the algorithm.

Proof.

Let St=|Ft|S_{t}=|F_{t}| be the random variable measuring the size of FtF_{t}. It then holds that τ\tau is a stopping time for the process {St}t=1\{S_{t}\}_{t=1}^{\infty}, i.e., τ=min{t:St=n}\tau=\min\{t:S_{t}=n\}. Further, denote by 𝐲=(y1,,yT)\mathbf{y}=(y_{1},\ldots,y_{T}) the vector of indices consisting of y1,,yTy_{1},\ldots,y_{T} for some TT. Let us focus on the term 𝔼[tFτf(wt,xYt)]\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},x_{Y_{t}})]. Let μ=𝒩(𝟎,σ2𝐈d)T×Unif[n]T\mu=\mathcal{N}(\boldsymbol{0},\sigma^{2}\mathbf{I}_{d})^{T}\times\text{Unif}[n]^{T} be the measure capturing all the randomness after TT iterations except for the randomness with respect to the data distribution 𝒟\mathcal{D}. Furthermore, let χ()\chi(\cdot) denotes the indicator of the input event. We now show that 𝔼[tFτf(wt,zYt)]=𝔼[tFτF(wt)]\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})]=\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})\right], which essentially follows using the tower rule of expectation. We have the following

𝔼[tFτf(wt,zYt)]\displaystyle\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})\right] =𝔼[𝔼[tFτf(wt,zYt)|τ,Fτ]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{\mathrm{Y}_{t}})|\tau,F_{\tau}\right]\right]
=𝔼[τ=T,FT=𝐲χ(τ=T,FT=𝐲)𝔼[χ(τ=T,FT=𝐲)tFTf(wt,zYt)[τ=T,FT=𝐲]]]\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in F_{T}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]
=𝔼[τ=T,FT=𝐲χ(τ=T,FT=𝐲)𝔼μ×𝒟T[χ(τ=T,FT=𝐲)t𝐲f(wt,zyt)[τ=T,FT=𝐲]]]\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu\times{\mathcal{D}}^{T}}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in\mathbf{y}}f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]
=𝔼[τ=T,FT=𝐲χ(τ=T,FT=𝐲)𝔼μ[χ(τ=T,FT=𝐲)𝔼𝒟T[t𝐲f(wt,zyt)][τ=T,FT=𝐲]]]\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{{\mathcal{D}}^{T}}[\sum_{t\in\mathbf{y}}f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})]}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]
=𝔼[τ=T,FT=𝐲χ(τ=T,FT=𝐲)𝔼μ[χ(τ=T,FT=𝐲)𝔼𝒟T[t𝐲F(wt)][τ=T,FT=𝐲]]]\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{{\mathcal{D}}^{T}}[\sum_{t\in\mathbf{y}}F(\mathrm{w}_{t})]}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]
=𝔼[τ=T,FT=𝐲χ(τ=T,FT=𝐲)𝔼[χ(τ=T,FT=𝐲)t𝐲F(wt)[τ=T,FT=𝐲]]]\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in\mathbf{y}}F(\mathrm{w}_{t})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]
=𝔼[𝔼[tFτF(wt)|τ,Fτ]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})|\tau,F_{\tau}\right]\right]
=𝔼[tFτF(wt)]\displaystyle=\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})\right]
n𝔼[F(w^τ)],\displaystyle\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})],

where in the second equality, we condition on τ,Fτ\tau,F_{\tau}, and write the conditional expectation as total expectation as: 𝔼[tFτf(wt,zYt)|τ,Fτ]=𝔼[χ(τ=T,FT=𝐲)tFTf(wt,zYt)[τ=T,FT=𝐲]]\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{\mathrm{Y}_{t}})|\tau,F_{\tau}\right]=\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in F_{T}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]. In the fourth equality, we use the fact that the iterates wt\mathrm{w}_{t} are fixed, and take the expectation of f(wt,zyt)f(\mathrm{w}_{t},\mathrm{z}_{y_{t}}) with respect to the data generating distribution 𝒟{\mathcal{D}}, yielding F(wt)F(\mathrm{w}_{t}). The rest of the steps collapses the conditional expectation back to a total expectation, and finally the last inequality holds using convexity. Using Corollary 1 we get

2D(L+σd)n\displaystyle 2D(L+\sigma\sqrt{d})\sqrt{n} 𝔼[tFτf(wt,xYt)]+𝔼[t=1τξt,wt]𝔼[(tFτf(u,zyt)+t=1τξt,u)]\displaystyle\geq\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},x_{Y_{t}})\right]+\mathbb{E}\left[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{w}_{t}\rangle\right]-\mathbb{E}\left[\left(\sum_{t\in F_{\tau}}f(\mathrm{u},\mathrm{z}_{y_{t}})+\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{u}\rangle\right)\right]
n𝔼[F(w^τ)]+𝔼[τ]×0nF(u)+𝔼[τ]×0=n(𝔼[F(w^τ)]F(u)),\displaystyle\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]+\mathbb{E}[\tau]\times 0-nF(\mathrm{u})+\mathbb{E}[\tau]\times 0=n(\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]-F(\mathrm{u})),

where in the second inequality we have used Wald’s lemma to guarantee 𝔼[t=1τξt,wt]=0\mathbb{E}[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{w}_{t}\rangle]=0 and 𝔼[t=1τξt,u]=0\mathbb{E}[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{u}\rangle]=0.

Corollary 1 with Lemma 1 gives the utility guarantee.

Theorem 4.

Suppose the elements in 𝒲{\mathcal{W}} are bounded in norm by DD and that f(,zt)f(\cdot,\mathrm{z}_{t}) is an LL-Lipschitz function for all t[n]t\in[n]. Running Algorithm 1 for τ\tau iterations with noise sequence ξt𝒩(0,σI)\xi_{t}\sim\mathcal{N}(0,\sigma\mathrm{I}) and step size ηt=Dn(L+σd)\eta_{t}=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})} guarantees that

𝔼[F(w^τ)]F(w)52D(L+σd)n.\displaystyle\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]-F(\mathrm{w}^{*})\leq\frac{5}{2}\frac{D(L+\sigma\sqrt{d})}{\sqrt{n}}.

5 Privacy Proof

In this section, we establish the differential privacy of Algorithm 1. The privacy proof essentially follows the analysis of noisy-SGD in Bassily et al. (2014), but stated in full detail, for completeness and to provide a self-contained presentation. The basic idea is as follows. We view Algorithm 1 as a variant of noisy stochastic gradient descent on the ERM problem over nn samples {zi}i=1n\{\mathrm{z}_{i}\}_{i=1}^{n}. Indeed, but for the step where we update the iterate only with noise and do not compute a gradient, the proposed algorithm would be exactly equivalent to noisy SGD. Prior work shows that using noisy SGD to solve the ERM problem enjoys nicer differential privacy due to privacy amplification via subsampling. Intuitively, Algorithm 1 should also benefit from privacy amplification, and the algorithm should not suffer from privacy loss in steps where we use the zero gradient. We formalize this intuition using the standard analysis for Rènyi differential privacy (Wang et al., 2019) and properties of Rènyi divergence (Mironov, 2017).

We first introduce additional notation. Let i:𝒲×[n]×S𝒲×[n]{\mathcal{M}}_{i}:{\mathcal{W}}\times[n]^{*}\times S\rightarrow{\mathcal{W}}\times[n]^{*} be a function which describes the iiti^{\text{it}} iteration of the algorithm – it takes as input, an iterate wi𝒲\mathrm{w}_{i}\in{\mathcal{W}}, a set, Fi[n]F_{i}\subseteq[n]^{*}, of indices of data points, and a subset SiSS_{i}\subseteq S; it outputs an iterate wi+1𝒲\mathrm{w}_{i+1}\in{\mathcal{W}} and set of indices Fi+1[n]F_{i+1}\subseteq[n]^{*}. We assume that SS is totally ordered according to the indices of the datapoints. Further let 𝔖i\mathfrak{S}_{i} denote the set of indices of datapoints in SiS_{i}. Let i1{\mathcal{M}}_{i}^{1} and i2{\mathcal{M}}_{i}^{2} denote the first and second outputs of i{\mathcal{M}}_{i}, i.e., i1(wi,Fi,S)=wi+1{\mathcal{M}}_{i}^{1}(\mathrm{w}_{i},F_{i},S)=\mathrm{w}_{i+1} and i2(wi,Fi,S)=Fi+1{\mathcal{M}}_{i}^{2}(\mathrm{w}_{i},F_{i},S)=F_{i+1}. Note that in the algorithm, the i{\mathcal{M}}_{i}’s are composed – we initialize F1=F_{1}=\emptyset and w1\mathrm{w}_{1} is fixed, therefore, 1(w1,,S)=(w2,F2){\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)=(\mathrm{w}_{2},F_{2}). 2{\mathcal{M}}_{2} acts on the output of 1(w1,,S){\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S) as 2((1(w1,,S)),S)=(w3,F3){\mathcal{M}}_{2}(({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)),S)=(\mathrm{w}_{3},F_{3}). In general, we have that

i(i1((1(w1,,S1))),S)=(wi+1,Fi+1).{\mathcal{M}}_{i}({\mathcal{M}}_{i-1}(\cdots({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S_{1}))\cdots),S)=(\mathrm{w}_{i+1},F_{i+1}).

Formally, given w𝒲,m\mathrm{w}\in\mathcal{W},m\in\mathbb{N}, and F[n]F\in[n]^{*}, we define random variable i(w,F,S):d×[n]m𝒲×[n]{\mathcal{M}}_{i}(\mathrm{w},F,S):{\mathbb{R}}^{d}\times[n]^{m}\rightarrow{\mathcal{W}}\times[n]^{*} as

i(wi,Fi,S)(ξi,𝔖i)=[𝒫(wiηi(gi+ξi)),Fi𝔖i],\displaystyle\mathcal{M}_{i}(\mathrm{w}_{i},F_{i},S)(\xi_{i},\mathfrak{S}_{i})=\big{[}\mathcal{P}\left(\mathrm{w}_{i}-\eta_{i}(\mathrm{g}_{i}+\xi_{i})\right),F_{i}\bigcup\mathfrak{S}_{i}\big{]},

where gi\mathrm{g}_{i} is the following gradient operator:

gi={1|𝔖iFi|j𝔖iFif(wi,zj)if|𝔖iFi|>00if|𝔖iFi|=0.\displaystyle\mathrm{g}_{i}=\begin{cases}\frac{1}{|\mathfrak{S}_{i}\setminus F_{i}|}\sum_{j\in\mathfrak{S}_{i}\setminus F_{i}}\nabla f(\mathrm{w}_{i},\mathrm{z}_{j})&\text{if}\ \ |\mathfrak{S}_{i}\setminus F_{i}|>0\\ 0&\text{if}\ \ |\mathfrak{S}_{i}\setminus F_{i}|=0.\end{cases}

In Algorithm 1, m=1m=1, and in general mm is just the size of the sub-sampled set from SS, which we use to construct a mini-batched gradient.

The input space of i(wi,Fi,S)\mathcal{M}_{i}(\mathrm{w}_{i},F_{i},S), which is d×[n]m{\mathbb{R}}^{d}\times[n]^{m} has measure 𝒩(0,σ2𝕀)×Unif([n],m){\mathcal{N}}(0,\sigma^{2}{\mathbb{I}})\times\text{Unif}([n],m), where Unif([n],m)\text{Unif}([n],m) is the sub-sampling measure, which sub-samples mm out of nn data points uniformly randomly. We now construct a vector concatenating the first outputs of all these i{\mathcal{M}}_{i}’s. Let

𝐌τ(w1,S)\displaystyle\mathbf{M}_{\tau}(\mathrm{w}_{1},S) =[w1,w2,wτ]\displaystyle=[\mathrm{w}_{1},\mathrm{w}_{2},\cdots\mathrm{w}_{\tau}]
=[w1,11(w1,),21((11(w1,),S),S),\displaystyle=[\mathrm{w}_{1},{\mathcal{M}}_{1}^{1}(\mathrm{w}_{1},\emptyset),{\mathcal{M}}_{2}^{1}(({\mathcal{M}}_{1}^{1}(\mathrm{w}_{1},\emptyset),S),S),
,τ1(τ1((1(w1,,S)))),S)].\displaystyle\qquad\qquad\cdots,{\mathcal{M}}_{\tau}^{1}({\mathcal{M}}_{\tau-1}(\cdots({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)))\cdots),S)].

We claim that 𝐌τ(w1,S)\mathbf{M}_{\tau}(\mathrm{w}_{1},S) is differentially private.

Theorem 5.

Suppose we run Algorithm 1 with noise sampled from 𝒩(𝟎,σ𝟐𝐈𝐝)\mathcal{N}(\bf 0,\sigma^{2}\mathbf{I}_{d}), where σ=L3log(1/δ)ϵ~\sigma=\frac{L\sqrt{3\operatorname{log}\left(1/\delta\right)}}{\tilde{\epsilon}}. For a fixed τ\tau and ϵ~1.256\tilde{\epsilon}\leq 1.256, 𝐌τ(w1,S)\mathbf{M}_{\tau}(\mathrm{w}_{1},S) is (2ϵ~2τlog(1/δ)|S|+4τ|S|2ϵ~2,τ|S|δ+δ)(\frac{2\tilde{\epsilon}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}}{|S|}+\frac{4\tau}{|S|^{2}}\tilde{\epsilon}^{2},\frac{\tau}{|S|}\delta+\delta^{\prime})-DP.

To prove the above theorem we are first going to show that the first coordinate of each i\mathcal{M}_{i} is sufficiently differentially private.

Lemma 2.

Let σ=L3log(1/δ)ϵ~\sigma=\frac{L\sqrt{3\operatorname{log}\left(1/\delta\right)}}{\tilde{\epsilon}}. For any w,F\mathrm{w},F and any ii the mechanism i1(w,F,S)\mathcal{M}^{1}_{i}(\mathrm{w},F,S) is (m|S|(eϵ~1),m|S|δ)\left(\frac{m}{|S|}(e^{\tilde{\epsilon}}-1),\frac{m}{|S|}\delta\right)-DP.

Proof.

Consider the two differing datasets SS and SS^{\prime} and let the index at which they differ be ρ\rho. Let p=m/|S|p=m/|S|. Denote the random subsample from SS and SS^{\prime} as S~\tilde{S} and S~\tilde{S}^{\prime} respectively Under the uniform random sampling we have 𝔖~=𝔖~\tilde{\mathfrak{S}}=\tilde{\mathfrak{S}}^{\prime}. This holds because the sampling of indices only depends on the size nn of SS and SS^{\prime}. The proof follows the standard privacy amplification by sub-sampling argument 222See https://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf. For fixed w\mathrm{w} and FF, and any measurable (with respect to the Lebesgue measure on the space d\mathbb{R}^{d}) 𝒲\mathcal{E}\subseteq\mathcal{W} define the measures

P()\displaystyle P(\mathcal{E}) =[i1(w,F,S)|ρ𝔖~,ρF]\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}|\rho\in\tilde{\mathfrak{S}},\rho\not\in F\right]
P()\displaystyle P^{\prime}(\mathcal{E}) =[i1(w,F,S)|ρ𝔖~,ρF]\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}|\rho\in\tilde{\mathfrak{S}},\rho\not\in F\right]
Q()\displaystyle Q(\mathcal{E}) =[i1(w,F,S)|ρF]\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}|\rho\in F\right]
Q()\displaystyle Q^{\prime}(\mathcal{E}) =[i1(w,F,S)|ρF]\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}|\rho\in F\right]
R()\displaystyle R(\mathcal{E}) =[i1(w,F,S)|ρ𝔖~,ρF]\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}|\rho\not\in\tilde{\mathfrak{S}},\rho\not\in F\right]
R()\displaystyle R^{\prime}(\mathcal{E}) =[i1(w,F,S)|ρ𝔖~,ρF].\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}|\rho\not\in\tilde{\mathfrak{S}},\rho\not\in F\right].

First we note that Q()=Q()Q(\mathcal{E})=Q^{\prime}(\mathcal{E}) because the gradient descent step is restricted only to points indexed by [S~]F[\tilde{S}]\setminus F and the differing point ρ\rho does not belong to that set. We also have R()=R()R(\mathcal{E})=R^{\prime}(\mathcal{E}) because in this case ρ\rho is not part of the subsampled points. We consider two cases: ρF\rho\in F and ρF\rho\not\in F. In the first case [i1(w,F,S)]=Q()=Q()=[i1(w,F,S)]\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in{\mathcal{E}}\right]=Q({\mathcal{E}})=Q^{\prime}({\mathcal{E}})=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in{\mathcal{E}}\right], and we have perfect (0,0)(0,0)-DP. In the other case, we have

[i1(w,F,S)]pδ\displaystyle\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}\right]-p\delta =pP()+(1p)R()pδ\displaystyle=pP(\mathcal{E})+(1-p)R(\mathcal{E})-p\delta
p(eϵ~min{P(),R()}))+pδ+(1p)R()pδ\displaystyle\leq p(e^{\tilde{\epsilon}}\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\}))+p\delta+(1-p)R(\mathcal{E})-p\delta
=p(min{P(),R()}+(eϵ~1)min{P(),R()})+(1p)R()\displaystyle=p(\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\}+(e^{\tilde{\epsilon}}-1)\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\})+(1-p)R(\mathcal{E})
p(P()+(eϵ~1)(pP()+(1p)R()))+(1p)R()\displaystyle\leq p(P^{\prime}(\mathcal{E})+(e^{\tilde{\epsilon}}-1)(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E})))+(1-p)R(\mathcal{E})
=pP(ϵ~)+(1p)R()+p(eϵ~1)(pP()+(1p)R())\displaystyle=pP^{\prime}(\mathcal{\tilde{\epsilon}})+(1-p)R(\mathcal{E})+p(e^{\tilde{\epsilon}}-1)(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))
=(1+p(eϵ~1))(pP()+(1p)R())\displaystyle=(1+p(e^{\tilde{\epsilon}}-1))(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))
ep(eϵ~1)(pP()+(1p)R())\displaystyle\leq e^{p(e^{\tilde{\epsilon}}-1)}(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))
=ep(eϵ~1)[i1(w,F,S)],\displaystyle=e^{p(e^{\tilde{\epsilon}}-1)}\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}\right],

where in the first inequality we have used the following DP-guarantee of 1(w,F,S)\mathcal{M}^{1}(\mathrm{w},F,S). For any subsampled sets S~S\tilde{S}\subseteq S and S~S\tilde{S}^{\prime}\subseteq S^{\prime} it holds that the mechanism 1\mathcal{M}^{1} is (ϵ~,δ)(\tilde{\epsilon},\delta)-DP as it is a step of noisy projected gradient descent with gradients bounded in norm by LL and Gaussian noise with sufficient variance. We use the DP guarantee twice – once on the neighboring datasets SS and SS^{\prime} to get the inequality P()eϵ~P()P(\mathcal{E})\leq e^{\tilde{\epsilon}}P^{\prime}(\mathcal{E}) and once on the neighboring dataset S{zρ}S\setminus\{\mathrm{z}_{\rho}\} to get the inequality P()eϵ~R()P(\mathcal{E})\leq e^{\tilde{\epsilon}}R(\mathcal{E}). In the second application, note that R()R(\mathcal{E}) contains events when a previously seen point is sampled; however the noise-only-step ensures that it also is a Gaussian with the same variance. In the second inequality we use the fact that a convex combination of two numbers is greater than their minimum and in the last inequality we use the standard relation 1+xex1+x\leq e^{x}. Combining the two cases finishes the proof. ∎

We can now prove Theorem 5.

Proof of Theorem 5.

The proof essentially follows the proof of Theorem 3.3 (Dwork et al., 2010). Let ϵ0=1|S|(eϵ~1)\epsilon_{0}=\frac{1}{|S|}(e^{\tilde{\epsilon}}-1), δ0=1|S|δ\delta_{0}=\frac{1}{|S|}\delta, let ϵ1=eϵ01\epsilon_{1}=e^{\epsilon_{0}}-1 and let ϵ=ϵ02τlog(1/δ)+τϵ0ϵ1\epsilon^{\prime}=\epsilon_{0}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}+\tau\epsilon_{0}\epsilon_{1}. Condition on the event that throughout the τ\tau rounds, ϵ0\epsilon_{0}-DP holds for all of the mechanisms i1\mathcal{M}_{i}^{1}. This event fails with probability τδ0\tau\delta_{0} and will be accounted for in the final DP bound. Define the set of events on which ϵ\epsilon^{\prime}-DP does not holds as

={:[𝐌τ(w1,S)]>eϵ[𝐌τ(w1,S)]}.\displaystyle\mathcal{B}=\{\mathcal{E}\colon\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{E}\right]>e^{\epsilon^{\prime}}\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S^{\prime})\in\mathcal{E}\right]\}.

The goal is to show that [𝐌τ(w1,S)]δ\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{B}\right]\leq\delta^{\prime}. Let 𝐖=𝐌τ(w1,S)=[W1,,Wτ]\mathbf{W}=\mathbf{M}_{\tau}(\mathrm{w}_{1},S)=[\mathrm{W}_{1},\ldots,\mathrm{W}_{\tau}], where Wi\mathrm{W}_{i} is the random variable taking values wi\mathrm{w}_{i}, and let 𝐖=𝐌τ(w1,S)\mathbf{W}^{\prime}=\mathbf{M}_{\tau}(\mathrm{w}_{1},S^{\prime}). Further let w\mathrm{w} be a fixed realization of 𝐖\mathbf{W}, i.e., 𝐰=[w1,,wτ]\mathbf{w}=[\mathrm{w}_{1},\ldots,\mathrm{w}_{\tau}]. We have

log([𝐖=𝐰][𝐖=𝐰])\displaystyle\operatorname{log}\left(\frac{\mathbb{P}\left[\mathbf{W}=\mathbf{w}\right]}{\mathbb{P}\left[\mathbf{W}^{\prime}=\mathbf{w}\right]}\right) =i=1τlog([Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1][Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1])\displaystyle=\sum_{i=1}^{\tau}\operatorname{log}\left(\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\right)
+log([Y1=y1][Y1=y1]).\displaystyle+\operatorname{log}\left(\frac{\mathbb{P}\left[Y_{1}=y_{1}\right]}{\mathbb{P}\left[Y_{1}=y_{1}\right]}\right).

Consider the conditional probability [Wi=wi|Yi=yi,Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2]\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|Y_{i}=y_{i},\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2}\right]. After conditioning on all the randomness so far in the algorithm the event {Wi=wi}\{\mathrm{W}_{i}=\mathrm{w}_{i}\} is just the event that {i11(wi1,Fi1,S)=wi}\{\mathcal{M}^{1}_{i-1}(\mathrm{w}_{i-1},F_{i-1},S)=\mathrm{w}_{i}\}, where wi1\mathrm{w}_{i-1} and Fi1F_{i-1} fixed. Lemma 2 now implies that

[Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1][Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1]eϵ0,\displaystyle\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\leq e^{\epsilon_{0}},

recall we conditioned on the event that each of the mechanisms i1\mathcal{M}_{i}^{1} are ϵ0\epsilon_{0}-DP. Define the random variable ci1(W2,,Wi,W2,,Wi,Y1,,Yi1)c_{i-1}(\mathrm{W}_{2},\ldots,\mathrm{W}_{i},\mathrm{W}_{2}^{\prime},\ldots,\mathrm{W}_{i}^{\prime},Y_{1},\ldots,Y_{i-1}) which takes values

ci1(w2,,wi,y1,,yi1)=log([Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1][Wi=wi|Wi1=wi1,Yi1=yi1,,W2=w2,Y2=y2,Y1=y1]).\displaystyle c_{i-1}(\mathrm{w}_{2},\ldots,\mathrm{w}_{i},y_{1},\ldots,y_{i-1})=\operatorname{log}\left(\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\right).

We have just shown that ci1(w2,,wi,w2,,wi,y1,,yi1)ϵ0c_{i-1}(\mathrm{w}_{2},\ldots,\mathrm{w}_{i},\mathrm{w}_{2},\ldots,\mathrm{w}_{i},y_{1},\ldots,y_{i-1})\leq\epsilon_{0} for any w2:i,y1:i1\mathrm{w}_{2:i},y_{1:i-1}. By switching Wi\mathrm{W}_{i} with Wi\mathrm{W}_{i}^{\prime} we can show in the exact same way that ci1(w2:i,w2:i,y1:i1)ϵ0-c_{i-1}(\mathrm{w}_{2:i},\mathrm{w}_{2:i},y_{1:i-1})\leq\epsilon_{0} which implies that the random variable ci1(W2:i,W2:i,Y1:i1)c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1}) is a.s. bounded by ϵ0\epsilon_{0}. Further using the fact that ϵ0\epsilon_{0}-DP implies boundedness in the \infty-divergence we also have

max{D(i1(wi,Fi,S)||i1(wi,Fi,S)),D(i1(wi,Fi,S)||i1(wi,Fi,S))}ϵ0.\displaystyle\max\big{\{}D_{\infty}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})),D_{\infty}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S))\big{\}}\leq\epsilon_{0}.

Lemma 3.2 in (Dwork et al., 2010) now implies that the KL-divergence between i1(wi,Fi,S)\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S) and i1(wi,Fi,S)\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}) is bounded as DKL(i1(wi,Fi,S)||i1(wi,Fi,S))ϵ1ϵ0D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}))\leq\epsilon_{1}\epsilon_{0}. Together with the definition of ci1c_{i-1} this implies

𝔼𝐖i,𝐖i[ci1(W2:i,W2:i,Y1:i1)|w2:i1,y1:i1]=DKL(i1(wi,Fi,S)||i1(wi,Fi,S))ϵ0ϵ1.\displaystyle\mathbb{E}_{\mathbf{W}_{i},\mathbf{W}_{i}^{\prime}}[c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1})|\mathrm{w}_{2:i-1},y_{1:i-1}]=D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}))\leq\epsilon_{0}\epsilon_{1}.

Because the filtration generated by W2:i1,Y1:i1\mathrm{W}_{2:i-1},Y_{1:i-1} includes the one generated by c1:i1c_{1:i-1} we can now apply Azuma’s inequality on the martingale difference sequence ci1(W2:i,W2:i,Y1:i1)DKL(i1(wi,Fi,S)||i1(wi,Fi,S))c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1})-D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})), which we have just shown it is bounded as the KL-term is bounded by ϵ0ϵ1\epsilon_{0}\epsilon_{1} and cic_{i}’s are bounded by ϵ0\epsilon_{0}, and get

[𝐌τ(w1,S)]\displaystyle\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{B}\right] =[i=1τci1(W2:i,W2:i,Yi1)>ϵ]\displaystyle=\mathbb{P}\left[\sum_{i=1}^{\tau}c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{i-1})>\epsilon^{\prime}\right]
=[i=1τci1(W2:i,W2:i,Yi1)τϵ0ϵ1>ϵ02τlog(1/δ)]\displaystyle=\mathbb{P}\left[\sum_{i=1}^{\tau}c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{i-1})-\tau\epsilon_{0}\epsilon_{1}>\epsilon_{0}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}\right]
exp((2ϵ02τlog(1/δ))/(2τϵ02ϵ12))=δ.\displaystyle\leq\operatorname{exp}\left((-2\epsilon_{0}^{2}\tau\operatorname{log}\left(1/\delta^{\prime}\right))/(2\tau\epsilon_{0}^{2}\epsilon_{1}^{2})\right)=\delta^{\prime}.

Because for ϵ~1.256\tilde{\epsilon}\leq 1.256 it holds that eϵ~12ϵ~e^{\tilde{\epsilon}}-1\leq 2\tilde{\epsilon} we have ϵ02ϵ~|S|\epsilon_{0}\leq\frac{2\tilde{\epsilon}}{|S|}. This in hand implies ϵ12ϵ0\epsilon_{1}\leq 2\epsilon_{0} which shows that for a fixed τ\tau Algorithm 1 is (2ϵ~2τlog(1/δ)|S|+4τ|S|2ϵ2,τ|S|δ+δ)(\frac{2\tilde{\epsilon}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}}{|S|}+\frac{4\tau}{|S|^{2}}\epsilon^{2},\frac{\tau}{|S|}\delta+\delta^{\prime})-DP. ∎

Combining the utility and privacy guarantees yields the main result, stated below.

Theorem 6.

Let f(w,z)f(\mathrm{w},\mathrm{z}) be a convex LL-Lipschitz function in its first argument, w𝒲\mathrm{w}\in{\mathcal{W}}, for all z𝒵\mathrm{z}\in{\mathcal{Z}}. Furthermore, assume that the diameter of 𝒲{\mathcal{W}} is bounded by DD. For any n16n\geq 16, given 0<ϵ12n0<\epsilon\leq\frac{1}{2\sqrt{n}}, δ>0\delta>0, Algorithm 1 run with step size η=Dn(L+σd)\eta=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})} and σ=8Llog(1/δ)nϵ\sigma=\frac{8L\sqrt{\operatorname{log}\left(1/\delta\right)}}{\sqrt{n}\epsilon} outputs w^τ\widehat{\mathrm{w}}_{\tau} which satisfies (4ϵ(log(1/δ)+2),δ+δ+2exp(n/16))(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime}+2\operatorname{exp}\left(-n/16\right)) – differential privacy, for any δ>0\delta^{\prime}>0, and its accuracy is bounded as,

𝔼F(w^τ)F(w)5LDn+20LDdlog(1/δ)ϵn.\displaystyle\mathbb{E}{F(\widehat{\mathrm{w}}_{\tau})-F(\mathrm{w}^{*})}\leq\frac{5LD}{\sqrt{n}}+\frac{20LD\sqrt{d\operatorname{log}\left(1/\delta\right)}}{\epsilon n}.
Proof.

Condition on the event that τ2n\tau\leq 2n. Using Theorem 5, with ϵ~=nϵ\tilde{\epsilon}=\sqrt{n}\epsilon implies that Algorithm 1 is (4ϵ(log(1/δ)+2),δ+δ)(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime})-DP under the condition that ϵ1.256n\epsilon\leq\frac{1.256}{\sqrt{n}}. The event τ>2n\tau>2n happens with probability 2exp(n/16)-2\operatorname{exp}\left(-n/16\right) according to Theorem 2, which implies that Algorithm 1 is (4ϵ(log(1/δ)+2),δ+δ+2exp(n/16))(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime}+2\operatorname{exp}\left(-n/16\right))-DP. Finally, the convergence guarantee follows from Theorem 4. ∎

Remark 1.

To get (ϵ¯,δ¯)(\bar{\epsilon},\bar{\delta})-DP, for any 6exp(n/16)δ¯3e46\operatorname{exp}\left(-n/16\right)\leq\bar{\delta}\leq 3e^{-4}, we can set δ=δ=δ¯3\delta^{\prime}=\delta=\frac{\bar{\delta}}{3}, and ϵ=ϵ¯8log(1/δ)\epsilon=\frac{\bar{\epsilon}}{8\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}}. Using the condition ϵ1n\epsilon\leq\frac{1}{\sqrt{n}}, we get that ϵ¯log(3/δ¯)8n\frac{\bar{\epsilon}}{\sqrt{\operatorname{log}\left(3/\bar{\delta}\right)}}\leq\frac{8}{\sqrt{n}} The utility guarantee becomes 𝔼F(w^τ)F(w)5LDn+160LDdlog(3/δ¯)ϵ¯n\mathbb{E}{F(\widehat{\mathrm{w}}_{\tau})-F(\mathrm{w}^{*})}\leq\frac{5LD}{\sqrt{n}}+\frac{160LD\sqrt{d}\operatorname{log}\left(3/\bar{\delta}\right)}{\bar{\epsilon}n}.

6 Other Approaches

We now briefly discuss a few other approaches which would also recover a convergence rate of O~(1/n+d/(ϵn))\tilde{O}(1/\sqrt{n}+\sqrt{d}/(\epsilon n)) for ϵO(1/n)\epsilon\leq O(\sqrt{1/n}). The first approach uses the standard decomposition of excess population risk to generalization gap ++ excess empirical risk. As pointed out in Bassily et al. (2019); Feldman et al. (2020), it is possible to use the optimal rates for ERM from Bassily et al. (2014) together with generalization propties of DP from Dwork et al. (2015) to bound the generalization gap, to guarantee a rate of O(max(d1/4/n,d/(nϵ)))O\left(\max\left(d^{1/4}/\sqrt{n},\sqrt{d}/(n\epsilon)\right)\right), which evaluates to d/n\sqrt{d/n} in the regime ϵO(1/n)\epsilon\leq O(1/{\sqrt{n}}). Even though the runtime of noisy-SGD for ERM, stated in Bassily et al. (2014) is O(n2)O(n^{2}), it can be shown that their result holds with only O((nϵ)2d)O\left({\frac{(n\epsilon)^{2}}{\sqrt{d}}}\right) stochastic gradient computations – therefore, in the regime ϵ1n\epsilon\lesssim\frac{1}{\sqrt{n}}, it is a linear time algorithm.

Second, it may be possible to use amplification by subsampling in Algorithm 1 of Feldman et al. (2020) instead of amplification by iteration. This would discard the smoothness requirement for f(,z)f(\cdot,\mathrm{z}) and perhaps yield the same result in Theorem 6. We note that Algorithm 1 is much simpler and does not require the complicated mini-batch schedule from Algorithm 1 of Feldman et al. (2020).

7 Conclusion

In this paper, we studied the problem of private stochastic convex optimization for non-smooth objectives. We proposed a noisy version of the OSMD algorithm and presented convergence and privacy guarantees for the 2\ell_{2} geometry. Algorithm 1 achieves statistically optimal rates, while running in linear time, and is differentially private as long as the DP parameter is sufficiently small. Algorithm 1 can easily be extended to geometries induced by arbitrary strictly convex potentials Φ\Phi. We leave it as future work to explore what kind of privacy guarantees can be retained if the privacy mechanism is tailored towards the geometry induced by Φ\Phi. Finally, we think it should be possible to extend our analysis to the case when f(,zi)f(\cdot,\mathrm{z}_{i}) is strongly convex for all i[n]i\in[n] and achieve optimal statistical rates in linear time.

References

  • Agarwal et al. (2009) Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
  • Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
  • Bassily et al. (2019) Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288, 2019.
  • Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
  • Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • Dwork et al. (2010) Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and Differential Privacy. In FOCS, pages 51–60, 2010.
  • Dwork et al. (2015) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.
  • Feldman et al. (2018) Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521–532, 2018.
  • Feldman et al. (2020) Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
  • Kakade et al. (2009) Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.
  • Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
  • Narayanan and Shmatikov (2008) Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.
  • Sweeney (1997) Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine &amp; Ethics, 25(2-3):98–110, 1997.
  • Vapnik (2013) Vladimir Vapnik. The nature of statistical learning theory. Springer science &amp; business media, 2013.
  • Wang et al. (2019) Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.

Appendix A Proof of Theorem 2

Proof.

We begin by fixing a time horizon TT and showing that with high probability τT\tau\leq T for appropriately chosen TT. Let q:[n]Tq:[n]^{T}\rightarrow\mathbb{N} be the function which counts the unique draws among (Y1,,YT)(Y_{1},\ldots,Y_{T}). Further, let ZiZ_{i} be the indicator random variable of the event that the ii-th datapoint is drawn at least ones in the TT rounds, i.e., j[T]:Yj=i\exists j\in[T]:Y_{j}=i. We can write the expectation of qq as

𝔼[q(Y1,,YT)]\displaystyle\mathbb{E}[q(Y_{1},\ldots,Y_{T})] =𝔼i=1nZi=i=1n𝔼[Zj]=i=1n[j[T]:Yj=i]\displaystyle=\mathbb{E}\sum_{i=1}^{n}Z_{i}=\sum_{i=1}^{n}\mathbb{E}[Z_{j}]=\sum_{i=1}^{n}\mathbb{P}\left[\exists j\in[T]:Y_{j}=i\right]
=i=1n1[j[T],Yji]=i=1n1(11/n)Tnnexp(T/n).\displaystyle=\sum_{i=1}^{n}1-\mathbb{P}\left[\forall j\in[T],Y_{j}\neq i\right]=\sum_{i=1}^{n}1-\left(1-1/n\right)^{T}\geq n-n\operatorname{exp}\left(-T/n\right).

Setting TnT\geq n already shows that the number of unique elements after TnT\geq n is at least n/2n/2 and thus the stopping time condition is reached. Next, we show concentration around the expectation of qq. Note that if a single element yiy_{i} is changed by yiy_{i}^{\prime} the value of qq will not change by more than 11 i.e.,

|q(y1,,yT)q(y1,,yi1,yi,yi+1,yT)|1.\displaystyle|q(y_{1},\ldots,y_{T})-q(y_{1},\ldots,y_{i-1},y_{i}^{\prime},y_{i+1},\ldots y_{T})|\leq 1.

This allows us to use McDiarmid’s inequality to claim that

[q(Y1,,YT)𝔼[q(Y1,,YT)]<log(2/δ)T/2]δ.\displaystyle\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})-\mathbb{E}[q(Y_{1},\ldots,Y_{T})]<-\sqrt{\operatorname{log}\left(2/\delta\right)T/2}\right]\leq\delta.

This implies that with probability at least 1δ1-\delta we have

q(Y1,,YT)nnexp(T/n)log(2/δ)T/2.\displaystyle q(Y_{1},\ldots,Y_{T})\geq n-n\operatorname{exp}\left(-T/n\right)-\sqrt{\operatorname{log}\left(2/\delta\right)T/2}.

Setting T=2nT=2n and δ=2exp(n/16)\delta=2\operatorname{exp}\left(-n/16\right) we have that for any n16n\geq 16 it holds that with probability 12exp(n/16)1-2\operatorname{exp}\left(-n/16\right)

q(Y1,,YT)nne2n×n/16n/2.\displaystyle q(Y_{1},\ldots,Y_{T})\geq n-ne^{-2}-\sqrt{n\times n/16}\geq n/2.

To get the bound in expectation we note that [τT]=[q(Y1,,YT)n/2]\mathbb{P}\left[\tau\geq T\right]=\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})\leq n/2\right]. On the other hand we know from the above derivation, that for T2nT\geq 2n it holds that [q(Y1,,YT)n/2]2eT/16\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})\leq n/2\right]\leq 2e^{-T/16}. This implies

𝔼[τ]2n[τ2n]+2n[τt]𝑑t2n+2n2exp(t/16)𝑑tO(n+nen/16).\displaystyle\mathbb{E}[\tau]\leq 2n\mathbb{P}\left[\tau\leq 2n\right]+\int_{2n}^{\infty}\mathbb{P}\left[\tau\geq t\right]dt\leq 2n+\int_{2n}^{\infty}2\operatorname{exp}\left(-t/16\right)dt\leq O(n+ne^{-n/16}).