Private Stochastic Convex Optimization:
Efficient Algorithms for Non-smooth Objectives

Raman Arora
[email protected] Teodor V. Marinov
[email protected] Enayat Ullah
[email protected]
Johns Hopkins University

Abstract

In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples.

1 Introduction

Modern machine learning systems often leverage data that are generated ubiquitously and seamlessly through devices such as smartphones, cameras, microphones, or user’s weblogs, transaction logs, social media, etc. Much of this data is private, and releasing models trained on such data without serious privacy considerations can reveal sensitive information (Narayanan and Shmatikov, 2008; Sweeney, 1997). Consequently, much emphasis has been placed in recent years on machine learning under the constraints of a robust privacy guarantee. One such notion that has emerged as a de facto standard is that of differential privacy.

Informally, differential privacy provides a quantitative assessment of how different are the outputs of a randomized algorithm when fed two very similar inputs. If small changes in the input do not manifest as drastically different outputs, then it is hard to discern much information about the inputs solely based on the outputs of the algorithm. In the context of machine learning, this implies that if the learning algorithm is not overly sensitive to any single datum in the training set, then releasing the trained model should preserve the privacy of the training data. This requirement, apriori, seems compatible with the goal of learning, which is to find a model that generalizes well on the population and does not overfit to the given training sample. It seems reasonable then to argue that privacy is not necessarily at odds with generalization, especially when large training sets are available.

We take the following stochastic optimization view of machine learning, where the goal is to find a predictor that minimizes the expected loss (aka risk)

\min_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})=\mathbb{E}_{\mathrm{z}\sim\mathcal{D}}[f(\mathrm{w},\mathrm{z})],

(1)

based on i.i.d. samples from the source distribution $\mathcal{D}$ , and full knowledge of the instantaneous objective function $f(\cdot,\cdot)$ and the hypothesis class ${\mathcal{W}}$ . We are particularly interested in convex learning problems where the hypothesis class is a convex set and the loss function $f(\cdot,\mathrm{z})$ is a convex function in the first argument for all $\mathrm{z}\in{\mathcal{Z}}$ . We seek a learning algorithm that uses the smallest possible number of samples and the least runtime and returns $\tilde{\mathrm{w}}$ such that $F(\tilde{\mathrm{w}})\leq\inf_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})+\alpha$ , for a user specified $\alpha>0$ , while guaranteeing differential privacy (see Section 2.2 for a formal definition).

A natural approach to solving Problem 1 is sample average approximation (SAA), or empirical risk minimization (ERM), where we instead minimize an empirical approximation of the objective based on the i.i.d. sample. Empirical risk minimization for convex learning problems has been studied in the context of differential privacy by several researchers including Bassily et al. (2019) who give statistically efficient algorithms with oracle complexity matching that of optimal non-private ERM.

An alternative approach to solving Problem 1 is stochastic approximation (SA), wherein rather than form an approximation of the objective, the goal is to directly minimize the true risk. The learning algorithm is an iterative algorithm that processes a single sample from the population in each iteration to perform an update. Stochastic gradient descent (SGD), for instance, is a classic SA algorithm. Recent work of Feldman et al. (2020) gives optimal rates for convex learning problems (Problem 1) using stochastic approximation for smooth loss functions; however, they leave open the question of optimal rates for non-smooth convex learning problems which include a large class of learning problems, including, for example, support vector machines. In this work, we focus on non-smooth convex learning problems and propose a simple algorithm which achieves nearly optimal rates in the special case where the privacy guarantee scales inversely with the number of observed samples. There are alternative approaches¹¹1We became aware of it in a private communication with Raef Basilly.; for further discussion we refer the reader to Section 6.

2 Notation and Preliminaries

We consider the general learning setup of Vapnik (2013). Let ${\mathcal{Z}}$ be the sample space and let ${\mathcal{D}}$ be an unknown distribution over ${\mathcal{Z}}$ . We are given a sample $\mathrm{z}_{1},\mathrm{z}_{2},\ldots,\mathrm{z}_{n}$ drawn identically and independently (i.i.d) from ${\mathcal{D}}$ – the samples $\mathrm{z}_{i}$ may correspond to (feature, label) tuples as in supervised learning, or to features in unsupervised learning. We assume that loss $f:{\mathcal{W}}\times{\mathcal{Z}}\rightarrow\mathrm{R}$ is a convex function in its first argument $\mathrm{w}$ and the hypothesis set ${\mathcal{W}}$ is a convex set. Given n samples, the goal is to minimize the population risk (Problem 1).

We assume that the hypothesis class $\mathcal{W}$ is a convex, closed and bounded set in $\mathrm{R}^{d}$ equipped with norm $\left\|{\cdot}\right\|$ , such that $\|\mathrm{w}\|\leq D$ for all $\mathrm{w}\in{\mathcal{W}}$ . We recall that the dual space of $(\mathrm{R}^{d},\left\|{\cdot}\right\|)$ is the set of all linear functionals over it; the dual norm, denoted by $\left\|{\cdot}\right\|_{*}$ , is defined as $\left\|{h}\right\|_{*}=\min_{\left\|{w}\right\|\leq 1}h(\mathrm{w})$ , where $h$ is a element of the dual space $(\mathrm{R}^{d},\left\|{\cdot}\right\|_{*})$ . In general we will use $\|\cdot\|$ to denote the $\ell_{2}$ norm when there is no risk of confusion.

We do not assume that $f(\mathrm{w},\mathrm{z})$ is necessarily differentiable and will denote an arbitrary sub-gradient as $\nabla f(\mathrm{w},\mathrm{z})$ . We assume that $f(\cdot,\mathrm{z})$ is $L$ -lipschitz with respect to the dual norm $\left\|{\cdot}\right\|_{*}$ , i.e., $\left|{f(\mathrm{w}_{1},\mathrm{z})-f(\mathrm{w}_{2},\mathrm{z})}\right|\leq L\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\|$ for all $\mathrm{z}$ . For convex $f$ , this implies that sub-gradients with respect to $\mathrm{w}$ , are bounded as $\left\|{\nabla f(\mathrm{w},\mathrm{z})}\right\|_{*}\leq L,\ \forall\ \mathrm{w}\in\mathcal{W},\mathrm{z}\in\mathcal{Z}$ . A popular instance of the above is the $\ell_{p}/\ell_{q}$ setup, which considers $\ell_{p}$ norm in primal space and the corresponding $\ell_{q}$ norm in the dual space such that $\frac{1}{p}+\frac{1}{q}=1$ . A function $g(\cdot)$ is $\beta$ -strongly smooth (or just $\beta$ -smooth) if $\left\|{\nabla g(\mathrm{w}_{1})-\nabla g(\mathrm{w}_{2})}\right\|_{*}\leq\beta\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\|,\forall\mathrm{w}_{1},\mathrm{w}_{2}\in{\mathcal{W}}$ . For convex functions, this is equivalent to the condition $g(\mathrm{w}_{2})\leq g(\mathrm{w}_{1})+\langle\nabla g(\mathrm{w}_{1}),\mathrm{w}_{2}-\mathrm{w}_{1}\rangle+\frac{\beta}{2}\|\mathrm{w}_{1}-\mathrm{w}_{2}\|^{2},\forall\mathrm{w}_{1},\mathrm{w}_{2}\in\mathcal{W}$ . A function $g$ is $\lambda$ -strongly convex if $g(\mathrm{w}_{2})\geq g(\mathrm{w}_{1})+\langle\nabla g(\mathrm{w}_{1}),\mathrm{w}_{2}-\mathrm{w}_{1}\rangle+\frac{\lambda}{2}\left\|{\mathrm{w}_{1}-\mathrm{w}_{2}}\right\|^{2},\forall\mathrm{w}_{1},\mathrm{w}_{2}\in{\mathcal{W}}$ . Finally, we use $\tilde{O}(\cdot)$ to suppress poly-logarithmic factors in the complexity.

2.1 Mirror descent and potential functions

We recall some basics of convex duality, which will help us set up the mirror descent algorithm and analysis. For a convex function $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , we define its conjugate $\Phi^{*}:\mathbb{R}^{d}\rightarrow\mathbb{R}\cup\{\infty\}$ as $\Phi^{*}(Y)=\sup_{X}\langle Y,X\rangle-\Phi(X)$ . Moreover, $D_{\Phi}(\mathrm{z}||\mathrm{y})$ denotes Bregman divergence induced by $\Phi$ , defined as

\displaystyle D_{\Phi}(\mathrm{z}||\mathrm{y})=\Phi(\mathrm{z})-\Phi(\mathrm{y})-\langle\nabla\Phi(\mathrm{y}),\mathrm{z}-\mathrm{y}\rangle.

For a convex set ${\mathcal{W}}$ , we denote by $\mathrm{I}_{{\mathcal{W}}}$ the indicator function of the set ${\mathcal{W}}$ , $\mathrm{I}_{{\mathcal{W}}}(\mathrm{z})=0\ \text{ if }\ \mathrm{z}\in{\mathcal{W}}$ , and $\infty$ otherwise. The following result from Kakade et al. (2009) establishes a natural relation between strong convexity and strong smoothness of a potential function and its conjugate, respectively.

Theorem 1 (Theorem 6 from Kakade et al. (2009)).

Assume $\Phi$ is a closed convex function. Then $\Phi$ is $\alpha$ -strongly convex with respect to a norm $\|\cdot\|$ iff $\Phi^{*}$ is $\frac{1}{\alpha}$ -strongly smooth with respect to the dual norm $\|\cdot\|_{*}$ .

2.2 Differential privacy

We seek to design algorithms for solving the stochastic convex optimization problem (Problem 1) with the additional constraint that the algorithm’s output is differentially private.

Definition 1 ( $(\epsilon,\delta)$ -differential privacy (Dwork et al., 2006)).

An algorithm $\mathcal{A}$ satisfies $(\epsilon,\delta)$ -differential privacy if given two datasets $S$ and $S^{\prime}$ , differing in only one data point, it satisfies that for any measurable event $R\in\text{Range}({\mathcal{A}})$

\displaystyle\mathbb{P}(\mathcal{A}(S)\in R)\leq e^{\epsilon}\mathbb{P}(\mathcal{A}(S^{\prime})\in R)+\delta.

3 Related Work

In convex learning and optimization, the following four classes of functions are widely studied: (a) $L$ -lispchitz convex functions, (b) $\beta$ -smooth and convex functions, (c) $\lambda$ -strongly convex functions, and (d) $\beta$ -smooth and $\lambda$ -strongly convex functions. In the computational framework of first order stochastic oracle, algorithms with optimal oracle complexity for all these classes of functions have long been known (Agarwal et al., 2009). However, the landscape of known results changes with the additional constraint of privacy. We can trace two approaches to solving the private version of Problem 1. The first is private ERM (Chaudhuri et al., 2011; Bassily et al., 2014; Feldman et al., 2018; Bassily et al., 2019) and the second is private stochastic approximation (Feldman et al., 2020). Chaudhuri et al. (2011) begin the study of private ERM by constructing algorithms based on output perturbation and objective perturbation. Under the assumption that the stochastic gradients are $\beta$ -Lipschitz continuous, the output perturbation bounds achieve excess population risk of $O(LD\max(1/\sqrt{n},d/(n\epsilon),d\sqrt{\beta}/(n^{2/3}\epsilon)))$ , where $L$ is the Lipschitz constant of the loss function and $D$ measures the diameter of the hypothesis class in the respective norm. The objective perturbation bounds have a similar form. Bassily et al. (2014) showed tight upper and lower bounds for the excess empirical risk. They also showed bounds for the excess population risk when the loss function does not have Lipschitz continuous gradients, achieving a rate of $O(d^{1/4}/(\sqrt{n}\epsilon))$ . Their techniques appeal to uniform convergence i.e. bounding $\sup_{\mathrm{w}\in{\mathcal{W}}}F(\mathrm{w})-\widehat{F}(\mathrm{w})$ , and convert the guarantees on excess empirical risk to get a bound on the excess population risk. These guarantees, however, were sub-optimal. Bassily et al. (2019) improved these to get optimal bounds on excess population risk, leveraging connections between algorithmic stability and generalization. The algorithms given by Bassily et al. (2019) are sample efficient but their runtimes are superlinear (in the number of samples), whereas the non-private counterparts run in linear time. In a follow-up work, Feldman et al. (2020) improved the runtime of some of these algorithms without sacrificing statistical efficiency; however, the authors require that the stochastic gradients are Lipschitz continuous. Essentially, the statistical complexity of private stochastic convex optimization has been resolved, but some questions about computational efficiency still remain open. We begin with a discussion of different settings for the population loss in subsequent paragraphs, describe what is already known, and what are the avenues for improvement.

Non-smooth Lipschitz Convex.

For the class of $L$ -Lipschitz convex functions, Bassily et al. (2014) improved upon Chaudhuri et al. (2011) and gave optimal bounds on excess empirical risk of $O\left(\frac{d}{\epsilon n}\right)$ . They then appeal to uniform convergence to convert the guarantees on excess empirical risk to get an excess population risk of $O\left(\max\left(\frac{d^{1/4}}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right)\right)$ . This is sub-optimal and was very recently improved by Bassily et al. (2019) using connections between algorithmic stability and generalization to get $O\left(\max\left(\frac{1}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right)\right)$ . This is optimal since $\frac{1}{\sqrt{n}}$ is the optimal excess risk without privacy constraints, and $\frac{\sqrt{d}}{\epsilon n}$ is the optimal excess risk when the data distribution is the empirical distribution. This resolves the statistical complexity of private convex learning and shows that in various regimes of interest, e.g., when $d=\Theta(n)$ and $\epsilon=\Theta(1)$ , the constraint of privacy has no effect on utility. However, the algorithm proposed by Bassily et al. (2019) is based on Moreau smoothing and proximal methods, and requires $O(n^{5})$ stochastic gradient computations. This rate vastly exceeds the gradient computations needed for non-private stochastic convex optimization which are of the order $O(n)$ . The computational complexity was improved by Feldman et al. (2020) to $O(n^{2})$ by using a regularized ERM algorithm that runs in phases and after each phase, noise is added to the solution (output perturbation) and used as regularization for the next phase.

Smooth Lipschitz Convex.

For $\beta$ -smooth convex $L$ -Lispschitz functions, Bassily et al. (2019) give an algorithm with optimal bounds on excess risk of $O\left(LD\max\left\{\frac{1}{\sqrt{n}},\frac{\sqrt{d}}{\epsilon n}\right\}\right)$ with $O\left(\min\left\{n^{3/2},\frac{n^{5/2}}{d}\right\}\right)$ queries to the stochastic gradient oracle. This, again, was improved in a later work of Feldman et al. (2020) to $O(n)$ stochastic gradient queries. Note that even for non-private stochastic optimization $O(n)$ stochastic gradient computations are necessary, so Feldman et al. (2020) achieve optimal statistical and oracle complexity for the smooth Lipschitz convex functions. The algorithm they present is an instance of a single-pass noisy SGD, and the guarantees hold for the last iterate.

(Smooth and Non-smooth) Lipschitz Strongly Convex.

For $L$ -Lispchitz $\lambda$ -strongly convex functions, Bassily et al. (2014) gave an algorithm with optimal excess empirical risk which is in $\tilde{O}\left(\frac{d}{\lambda n^{2}\epsilon^{2}}\right)$ . However, as in the non-strongly convex case, the corresponding excess population risk is $\tilde{O}\left(\frac{\sqrt{d}}{\lambda\epsilon n}\right)$ , which is sub-optimal. For this case, Feldman et al. (2020) proposed an algorithm which achieves optimal rates in $O(n)$ time for smooth functions but $O(n^{2})$ for non-smooth functions.

4 Algorithm and Utility Analysis

In this section, we describe the proposed algorithm and provide convergence guarantees.

Recall that we are given a set of $n$ samples $(\mathrm{z}_{1},\ldots,\mathrm{z}_{n})$ drawn i.i.d. from ${\mathcal{D}}$ . We consider an iterative algorithm, wherein, at time $t$ , we sample index $Y_{t}$ uniformly at random from the set of indices, $[n]$ . Note that $Y_{t}$ is a random variable; we denote the realization of $Y_{t}$ as $y_{t}$ . Through the run of the algorithm, we maintain a set, $F_{t}$ , of indices observed until time $t$ , i.e., $F_{t}=\{y_{\tau}\mid\tau<t\}.$

If $y_{t}\notin F_{t}$ , i.e., $\mathrm{z}_{y_{t}}$ is a fresh sample that has not been seen and processed prior to the $t^{\textrm{th}}$ iteration, then we proceed with a projected gradient descent step using the noisy gradient $\nabla f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})+\xi_{t}$ . If $y_{t}\in F_{t}$ , then the algorithm simply perturbs the current iterate, $\mathrm{w}_{t}$ , and projects on to the set ${\mathcal{W}}$ . We remark that the noise-only step is important for the privacy analysis, as it allows for privacy amplification by sub-sampling.

The algorithm terminates when half of the points in the training set have been processed at least once, i.e., the size of $F_{t}$ is greater than or equal to $n/2$ . We denote this stopping time by $\tau$ . We refer the reader to Algorithm 1 for the pseudo-code.

Algorithm 1 Private SGD(

\mathrm{w}_{1},\{\eta_{t}\}_{t},\{\xi_{t}\}_{t},\{\mathrm{z}_{t}\}_{t}

)

0: Step size schedule

\{\eta_{t}\}_{t}

, noise sequence

\{\xi_{t}\}_{t}

, stream of data points

\{\mathrm{z}_{t}\}_{t}

, initial iterate

\mathrm{w}_{1}

F_{1}=\emptyset

\tau=1

2: while

|F_{\tau}|\leq n/2

3: Sample index

y_{\tau}

uniformly from

[n]

4: if

\{y_{\tau}\}\bigcap F_{\tau}\neq\emptyset

then

\mathrm{w}_{\tau+1}=\mathcal{P}(\mathrm{w}_{\tau}-\eta_{\tau}\xi_{\tau})

F_{\tau+1}=F_{\tau}

7: else

\mathrm{w}_{\tau+1}=\mathcal{P}(\mathrm{w}_{\tau}-\eta_{\tau}(\nabla f(\mathrm{w}_{\tau},\mathrm{z}_{y_{\tau}})+\xi_{\tau}))

F_{\tau+1}=F_{\tau}\bigcup\{y_{\tau}\}

10: end if

11:

\tau=\tau+1

12: end while

12: Average iterate

\widehat{\mathrm{w}}_{\tau}=\frac{1}{|F_{\tau}|}\sum_{j\in F_{\tau}}\mathrm{w}_{j}

Next, we establish the utility guarantee for Algorithm 1. We first show that $\tau$ is finite almost surely and bounded by $O(n)$ with probability $1-O(\operatorname{exp}\left(-n\right))$ . The proof follows a standard bins-and-balls argument.

Theorem 2.

For any $n\geq 16$ with probability $1-2\operatorname{exp}\left(-n/16\right)$ it holds that $\tau\leq 2n$ , which implies $\mathbb{E}[\tau]\leq O(n)$ .

We proceed with bounding the regret of Algorithm 1. Given a sequence of convex functions $\tilde{f}_{t}:\mathcal{W}\rightarrow\mathbb{R}$ , where $f_{t}(\cdot)=f(\cdot,\mathrm{z})$ , we bound the regret

\displaystyle\bar{R}(\tau,u)=\sum_{t=1}^{\tau}\mathbb{E}[\tilde{f}_{t}(\mathrm{w}_{t})]-\sum_{t=1}^{\tau}\mathbb{E}[\tilde{f}_{t}(\mathrm{u})],

where the expectation is with respect to any randomness in the algorithm and randomness of $\tilde{f}_{t}$ ’s. We assume full access to the gradient of $\tilde{f}_{t}$ and that the diameter of $\mathcal{W}$ is $D$ , i.e., $\forall\mathrm{w},\mathrm{v}\in\mathcal{W},\|\mathrm{w}-\mathrm{v}\|\leq D$ .

Our setup fits the popular framework of Online Stochastic Mirror Descent (OSMD) algorithm, wherein, given a strictly convex potential $\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ , the updates are given as

	$\displaystyle\tilde{\mathrm{w}}_{t+1}$	$\displaystyle=\nabla\Phi^{*}(\nabla\Phi(\mathrm{w}_{t})-\eta\nabla\tilde{f}_{t}(\mathrm{w}_{t}))$		(2)
	$\displaystyle\mathrm{w}_{t+1}$	$\displaystyle=\mathop{\arg\min}_{\mathrm{w}\in{\mathcal{W}}}D_{\Phi}(\mathrm{w}\|\|\tilde{\mathrm{w}}_{t+1}).$		(2)

We utilize the following result.

Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012)).

Let $f_{1},\ldots,f_{\tau}$ be a sequence of functions from $\mathbb{R}^{d}$ to $\mathbb{R}$ . Suppose $\nabla\tilde{f}_{t}$ is an unbiased estimator of $\nabla f_{t}$ , i.e., $\mathbb{E}[\nabla\tilde{f}_{t}]=\nabla f_{t}$ . If one initializes OSMD 2 with $\mathrm{w}_{1}=\mathop{\arg\min}_{\mathrm{w}\in{\mathcal{W}}}\Phi(\mathrm{w})$ , then the expected pseudo-regret $\bar{R}(\tau)$ is bounded as

\displaystyle\bar{R}(\tau,\mathrm{u})\leq\frac{D_{\Phi}(\mathrm{u}||\mathrm{w}_{1})}{\eta}+\frac{1}{\eta}\sum_{t=1}^{\tau}\mathbb{E}[D_{\Phi^{*}}(\nabla f_{t}(\mathrm{w}_{t})-\eta\nabla\tilde{f}_{t}(\mathrm{w}_{t})||\nabla f_{t}(\mathrm{w}_{t}))].

For any fixed $\tau<\infty$ , we can view Algorithm 1 as an instance of OSMD, with $\Phi\equiv\frac{1}{2}\|\cdot\|_{2}^{2}$ , and $\tilde{f}_{t}(\cdot)$ defined as

\displaystyle\tilde{f}_{t}(\cdot)=\begin{cases}f(\cdot,\mathrm{z}_{y_{t}})+\langle\xi_{t},\cdot\rangle&y_{t}\not\in F_{t}\\ \langle\xi_{t},\cdot\rangle&\text{otherwise,}\end{cases}

and $f_{t}(\mathrm{w})=\mathbb{E}_{\xi_{t}}[\tilde{f}_{t}(\mathrm{w})]$ . Theorem 1 implies that for any $\mathrm{z},\mathrm{y}\in\mathbb{R}^{d}$ $D_{\Phi^{*}}(\mathrm{z}||\mathrm{y})\leq 1/\alpha\|\mathrm{z}-\mathrm{y}\|_{*}^{2}$ , where $\alpha$ is the strong-convexity parameter of the potential (in this case $\alpha=1$ ) – using this result with Theorem 3, and taking expectation with respect to the randomness in $\tau$ and $\xi_{t},t\in[\tau]$ , we have that for any $\mathrm{u}\in{\mathcal{W}}$

	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right]$	$\displaystyle\leq\frac{\\|\mathrm{u}-\mathrm{w}_{1}\\|_{2}^{2}}{2\eta}+\frac{1}{2\eta}\left[\sum_{t=1}^{\tau}\eta^{2}\\|\nabla\tilde{f}_{t}(\mathrm{w}_{t})\\|_{2}^{2}\right]$		(3)
		$\displaystyle\leq\frac{\\|\mathrm{u}-\mathrm{w}_{1}\\|_{2}^{2}}{2\eta}+\eta\mathbb{E}\left[\sum_{t=1}^{\tau}\\|\nabla f_{t}(\mathrm{w}_{t})\\|_{2}^{2}+\\|\xi_{t}\\|_{2}^{2}\right].$		(3)

Since $\xi\sim\mathcal{N}(0,\sigma\mathrm{I})$ , we have the following corollary.

Corollary 1.

Suppose the diameter of ${\mathcal{W}}$ is bounded by $D$ and that $f(\cdot,\mathrm{z}_{t})$ is an $L$ -Lipschitz function for all $t\in[n]$ . Running Algorithm 1 for $\tau$ iterations with noise sequence $\xi_{t}\sim\mathcal{N}(0,\sigma\mathrm{I})$ and step size $\eta_{t}=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})}$ guarantees that

\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right]\leq 2D(L+\sigma\sqrt{d})\sqrt{n}.

Proof.

Combine Equation 3, Wald’s lemma and the assumptions of the theorem to get

\displaystyle\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(\mathrm{w}_{t})\right]-\mathbb{E}\left[\sum_{t=1}^{\tau}f_{t}(u)\right]\leq\frac{D^{2}}{2\eta}+\eta(L^{2}+\sigma^{2}d)\mathbb{E}[\tau].

Using the bound on $\mathbb{E}[\tau]\leq O(n)$ from Theorem 2 together with the step-size choice in the corollary claim finishes the proof. ∎

The result now follows using the Corollary above and the following lemma.

Lemma 1.

For the iterates $\{\mathrm{w}_{t}\}_{t=1}^{\tau}$ of Algorithm 1 it holds that $\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})]\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]$ , where $\widehat{\mathrm{w}}_{\tau}$ is the output of the algorithm.

Proof.

Let $S_{t}=|F_{t}|$ be the random variable measuring the size of $F_{t}$ . It then holds that $\tau$ is a stopping time for the process $\{S_{t}\}_{t=1}^{\infty}$ , i.e., $\tau=\min\{t:S_{t}=n\}$ . Further, denote by $\mathbf{y}=(y_{1},\ldots,y_{T})$ the vector of indices consisting of $y_{1},\ldots,y_{T}$ for some $T$ . Let us focus on the term $\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},x_{Y_{t}})]$ . Let $\mu=\mathcal{N}(\boldsymbol{0},\sigma^{2}\mathbf{I}_{d})^{T}\times\text{Unif}[n]^{T}$ be the measure capturing all the randomness after $T$ iterations except for the randomness with respect to the data distribution $\mathcal{D}$ . Furthermore, let $\chi(\cdot)$ denotes the indicator of the input event. We now show that $\mathbb{E}[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})]=\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})\right]$ , which essentially follows using the tower rule of expectation. We have the following

	$\displaystyle\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{\mathrm{Y}_{t}})\|\tau,F_{\tau}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in F_{T}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu\times{\mathcal{D}}^{T}}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in\mathbf{y}}f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{{\mathcal{D}}^{T}}[\sum_{t\in\mathbf{y}}f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})]}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{\mu}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}_{{\mathcal{D}}^{T}}[\sum_{t\in\mathbf{y}}F(\mathrm{w}_{t})]}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{\tau=T,F_{T}=\mathbf{y}}\chi(\tau=T,F_{T}=\mathbf{y})\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in\mathbf{y}}F(\mathrm{w}_{t})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})\|\tau,F_{\tau}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\sum_{t\in F_{\tau}}F(\mathrm{w}_{t})\right]$
		$\displaystyle\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})],$

where in the second equality, we condition on $\tau,F_{\tau}$ , and write the conditional expectation as total expectation as: $\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},\mathrm{z}_{\mathrm{Y}_{t}})|\tau,F_{\tau}\right]=\mathbb{E}\left[\frac{\chi(\tau=T,F_{T}=\mathbf{y})\sum_{t\in F_{T}}f(\mathrm{w}_{t},\mathrm{z}_{Y_{t}})}{\mathbb{P}\left[\tau=T,F_{T}=\mathbf{y}\right]}\right]$ . In the fourth equality, we use the fact that the iterates $\mathrm{w}_{t}$ are fixed, and take the expectation of $f(\mathrm{w}_{t},\mathrm{z}_{y_{t}})$ with respect to the data generating distribution ${\mathcal{D}}$ , yielding $F(\mathrm{w}_{t})$ . The rest of the steps collapses the conditional expectation back to a total expectation, and finally the last inequality holds using convexity. Using Corollary 1 we get

	$\displaystyle 2D(L+\sigma\sqrt{d})\sqrt{n}$	$\displaystyle\geq\mathbb{E}\left[\sum_{t\in F_{\tau}}f(\mathrm{w}_{t},x_{Y_{t}})\right]+\mathbb{E}\left[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{w}_{t}\rangle\right]-\mathbb{E}\left[\left(\sum_{t\in F_{\tau}}f(\mathrm{u},\mathrm{z}_{y_{t}})+\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{u}\rangle\right)\right]$
		$\displaystyle\geq n\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]+\mathbb{E}[\tau]\times 0-nF(\mathrm{u})+\mathbb{E}[\tau]\times 0=n(\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]-F(\mathrm{u})),$

where in the second inequality we have used Wald’s lemma to guarantee $\mathbb{E}[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{w}_{t}\rangle]=0$ and $\mathbb{E}[\sum_{t=1}^{\tau}\langle\xi_{t},\mathrm{u}\rangle]=0$ .

∎

Corollary 1 with Lemma 1 gives the utility guarantee.

Theorem 4.

Suppose the elements in ${\mathcal{W}}$ are bounded in norm by $D$ and that $f(\cdot,\mathrm{z}_{t})$ is an $L$ -Lipschitz function for all $t\in[n]$ . Running Algorithm 1 for $\tau$ iterations with noise sequence $\xi_{t}\sim\mathcal{N}(0,\sigma\mathrm{I})$ and step size $\eta_{t}=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})}$ guarantees that

\displaystyle\mathbb{E}[F(\widehat{\mathrm{w}}_{\tau})]-F(\mathrm{w}^{*})\leq\frac{5}{2}\frac{D(L+\sigma\sqrt{d})}{\sqrt{n}}.

5 Privacy Proof

In this section, we establish the differential privacy of Algorithm 1. The privacy proof essentially follows the analysis of noisy-SGD in Bassily et al. (2014), but stated in full detail, for completeness and to provide a self-contained presentation. The basic idea is as follows. We view Algorithm 1 as a variant of noisy stochastic gradient descent on the ERM problem over $n$ samples $\{\mathrm{z}_{i}\}_{i=1}^{n}$ . Indeed, but for the step where we update the iterate only with noise and do not compute a gradient, the proposed algorithm would be exactly equivalent to noisy SGD. Prior work shows that using noisy SGD to solve the ERM problem enjoys nicer differential privacy due to privacy amplification via subsampling. Intuitively, Algorithm 1 should also benefit from privacy amplification, and the algorithm should not suffer from privacy loss in steps where we use the zero gradient. We formalize this intuition using the standard analysis for Rènyi differential privacy (Wang et al., 2019) and properties of Rènyi divergence (Mironov, 2017).

We first introduce additional notation. Let ${\mathcal{M}}_{i}:{\mathcal{W}}\times[n]^{*}\times S\rightarrow{\mathcal{W}}\times[n]^{*}$ be a function which describes the $i^{\text{it}}$ iteration of the algorithm – it takes as input, an iterate $\mathrm{w}_{i}\in{\mathcal{W}}$ , a set, $F_{i}\subseteq[n]^{*}$ , of indices of data points, and a subset $S_{i}\subseteq S$ ; it outputs an iterate $\mathrm{w}_{i+1}\in{\mathcal{W}}$ and set of indices $F_{i+1}\subseteq[n]^{*}$ . We assume that $S$ is totally ordered according to the indices of the datapoints. Further let $\mathfrak{S}_{i}$ denote the set of indices of datapoints in $S_{i}$ . Let ${\mathcal{M}}_{i}^{1}$ and ${\mathcal{M}}_{i}^{2}$ denote the first and second outputs of ${\mathcal{M}}_{i}$ , i.e., ${\mathcal{M}}_{i}^{1}(\mathrm{w}_{i},F_{i},S)=\mathrm{w}_{i+1}$ and ${\mathcal{M}}_{i}^{2}(\mathrm{w}_{i},F_{i},S)=F_{i+1}$ . Note that in the algorithm, the ${\mathcal{M}}_{i}$ ’s are composed – we initialize $F_{1}=\emptyset$ and $\mathrm{w}_{1}$ is fixed, therefore, ${\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)=(\mathrm{w}_{2},F_{2})$ . ${\mathcal{M}}_{2}$ acts on the output of ${\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)$ as ${\mathcal{M}}_{2}(({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)),S)=(\mathrm{w}_{3},F_{3})$ . In general, we have that

{\mathcal{M}}_{i}({\mathcal{M}}_{i-1}(\cdots({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S_{1}))\cdots),S)=(\mathrm{w}_{i+1},F_{i+1}).

Formally, given $\mathrm{w}\in\mathcal{W},m\in\mathbb{N}$ , and $F\in[n]^{*}$ , we define random variable ${\mathcal{M}}_{i}(\mathrm{w},F,S):{\mathbb{R}}^{d}\times[n]^{m}\rightarrow{\mathcal{W}}\times[n]^{*}$ as

\displaystyle\mathcal{M}_{i}(\mathrm{w}_{i},F_{i},S)(\xi_{i},\mathfrak{S}_{i})=\big{[}\mathcal{P}\left(\mathrm{w}_{i}-\eta_{i}(\mathrm{g}_{i}+\xi_{i})\right),F_{i}\bigcup\mathfrak{S}_{i}\big{]},

where $\mathrm{g}_{i}$ is the following gradient operator:

\displaystyle\mathrm{g}_{i}=\begin{cases}\frac{1}{|\mathfrak{S}_{i}\setminus F_{i}|}\sum_{j\in\mathfrak{S}_{i}\setminus F_{i}}\nabla f(\mathrm{w}_{i},\mathrm{z}_{j})&\text{if}\ \ |\mathfrak{S}_{i}\setminus F_{i}|>0\\ 0&\text{if}\ \ |\mathfrak{S}_{i}\setminus F_{i}|=0.\end{cases}

In Algorithm 1, $m=1$ , and in general $m$ is just the size of the sub-sampled set from $S$ , which we use to construct a mini-batched gradient.

The input space of $\mathcal{M}_{i}(\mathrm{w}_{i},F_{i},S)$ , which is ${\mathbb{R}}^{d}\times[n]^{m}$ has measure ${\mathcal{N}}(0,\sigma^{2}{\mathbb{I}})\times\text{Unif}([n],m)$ , where $\text{Unif}([n],m)$ is the sub-sampling measure, which sub-samples $m$ out of $n$ data points uniformly randomly. We now construct a vector concatenating the first outputs of all these ${\mathcal{M}}_{i}$ ’s. Let

	$\displaystyle\mathbf{M}_{\tau}(\mathrm{w}_{1},S)$	$\displaystyle=[\mathrm{w}_{1},\mathrm{w}_{2},\cdots\mathrm{w}_{\tau}]$
		$\displaystyle=[\mathrm{w}_{1},{\mathcal{M}}_{1}^{1}(\mathrm{w}_{1},\emptyset),{\mathcal{M}}_{2}^{1}(({\mathcal{M}}_{1}^{1}(\mathrm{w}_{1},\emptyset),S),S),$
		$\displaystyle\qquad\qquad\cdots,{\mathcal{M}}_{\tau}^{1}({\mathcal{M}}_{\tau-1}(\cdots({\mathcal{M}}_{1}(\mathrm{w}_{1},\emptyset,S)))\cdots),S)].$

We claim that $\mathbf{M}_{\tau}(\mathrm{w}_{1},S)$ is differentially private.

Theorem 5.

Suppose we run Algorithm 1 with noise sampled from $\mathcal{N}(\bf 0,\sigma^{2}\mathbf{I}_{d})$ , where $\sigma=\frac{L\sqrt{3\operatorname{log}\left(1/\delta\right)}}{\tilde{\epsilon}}$ . For a fixed $\tau$ and $\tilde{\epsilon}\leq 1.256$ , $\mathbf{M}_{\tau}(\mathrm{w}_{1},S)$ is $(\frac{2\tilde{\epsilon}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}}{|S|}+\frac{4\tau}{|S|^{2}}\tilde{\epsilon}^{2},\frac{\tau}{|S|}\delta+\delta^{\prime})$ -DP.

To prove the above theorem we are first going to show that the first coordinate of each $\mathcal{M}_{i}$ is sufficiently differentially private.

Lemma 2.

Let $\sigma=\frac{L\sqrt{3\operatorname{log}\left(1/\delta\right)}}{\tilde{\epsilon}}$ . For any $\mathrm{w},F$ and any $i$ the mechanism $\mathcal{M}^{1}_{i}(\mathrm{w},F,S)$ is $\left(\frac{m}{|S|}(e^{\tilde{\epsilon}}-1),\frac{m}{|S|}\delta\right)$ -DP.

Proof.

Consider the two differing datasets $S$ and $S^{\prime}$ and let the index at which they differ be $\rho$ . Let $p=m/|S|$ . Denote the random subsample from $S$ and $S^{\prime}$ as $\tilde{S}$ and $\tilde{S}^{\prime}$ respectively Under the uniform random sampling we have $\tilde{\mathfrak{S}}=\tilde{\mathfrak{S}}^{\prime}$ . This holds because the sampling of indices only depends on the size $n$ of $S$ and $S^{\prime}$ . The proof follows the standard privacy amplification by sub-sampling argument ²²2See https://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf. For fixed $\mathrm{w}$ and $F$ , and any measurable (with respect to the Lebesgue measure on the space $\mathbb{R}^{d}$ ) $\mathcal{E}\subseteq\mathcal{W}$ define the measures

	$\displaystyle P(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}\|\rho\in\tilde{\mathfrak{S}},\rho\not\in F\right]$
	$\displaystyle P^{\prime}(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}\|\rho\in\tilde{\mathfrak{S}},\rho\not\in F\right]$
	$\displaystyle Q(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}\|\rho\in F\right]$
	$\displaystyle Q^{\prime}(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}\|\rho\in F\right]$
	$\displaystyle R(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}\|\rho\not\in\tilde{\mathfrak{S}},\rho\not\in F\right]$
	$\displaystyle R^{\prime}(\mathcal{E})$	$\displaystyle=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}\|\rho\not\in\tilde{\mathfrak{S}},\rho\not\in F\right].$

First we note that $Q(\mathcal{E})=Q^{\prime}(\mathcal{E})$ because the gradient descent step is restricted only to points indexed by $[\tilde{S}]\setminus F$ and the differing point $\rho$ does not belong to that set. We also have $R(\mathcal{E})=R^{\prime}(\mathcal{E})$ because in this case $\rho$ is not part of the subsampled points. We consider two cases: $\rho\in F$ and $\rho\not\in F$ . In the first case $\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in{\mathcal{E}}\right]=Q({\mathcal{E}})=Q^{\prime}({\mathcal{E}})=\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in{\mathcal{E}}\right]$ , and we have perfect $(0,0)$ -DP. In the other case, we have

	$\displaystyle\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S)\in\mathcal{E}\right]-p\delta$	$\displaystyle=pP(\mathcal{E})+(1-p)R(\mathcal{E})-p\delta$
		$\displaystyle\leq p(e^{\tilde{\epsilon}}\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\}))+p\delta+(1-p)R(\mathcal{E})-p\delta$
		$\displaystyle=p(\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\}+(e^{\tilde{\epsilon}}-1)\min\{P^{\prime}(\mathcal{E}),R(\mathcal{E})\})+(1-p)R(\mathcal{E})$
		$\displaystyle\leq p(P^{\prime}(\mathcal{E})+(e^{\tilde{\epsilon}}-1)(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E})))+(1-p)R(\mathcal{E})$
		$\displaystyle=pP^{\prime}(\mathcal{\tilde{\epsilon}})+(1-p)R(\mathcal{E})+p(e^{\tilde{\epsilon}}-1)(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))$
		$\displaystyle=(1+p(e^{\tilde{\epsilon}}-1))(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))$
		$\displaystyle\leq e^{p(e^{\tilde{\epsilon}}-1)}(pP^{\prime}(\mathcal{E})+(1-p)R(\mathcal{E}))$
		$\displaystyle=e^{p(e^{\tilde{\epsilon}}-1)}\mathbb{P}\left[\mathcal{M}_{i}^{1}(\mathrm{w},F,S^{\prime})\in\mathcal{E}\right],$

where in the first inequality we have used the following DP-guarantee of $\mathcal{M}^{1}(\mathrm{w},F,S)$ . For any subsampled sets $\tilde{S}\subseteq S$ and $\tilde{S}^{\prime}\subseteq S^{\prime}$ it holds that the mechanism $\mathcal{M}^{1}$ is $(\tilde{\epsilon},\delta)$ -DP as it is a step of noisy projected gradient descent with gradients bounded in norm by $L$ and Gaussian noise with sufficient variance. We use the DP guarantee twice – once on the neighboring datasets $S$ and $S^{\prime}$ to get the inequality $P(\mathcal{E})\leq e^{\tilde{\epsilon}}P^{\prime}(\mathcal{E})$ and once on the neighboring dataset $S\setminus\{\mathrm{z}_{\rho}\}$ to get the inequality $P(\mathcal{E})\leq e^{\tilde{\epsilon}}R(\mathcal{E})$ . In the second application, note that $R(\mathcal{E})$ contains events when a previously seen point is sampled; however the noise-only-step ensures that it also is a Gaussian with the same variance. In the second inequality we use the fact that a convex combination of two numbers is greater than their minimum and in the last inequality we use the standard relation $1+x\leq e^{x}$ . Combining the two cases finishes the proof. ∎

We can now prove Theorem 5.

Proof of Theorem 5.

The proof essentially follows the proof of Theorem 3.3 (Dwork et al., 2010). Let $\epsilon_{0}=\frac{1}{|S|}(e^{\tilde{\epsilon}}-1)$ , $\delta_{0}=\frac{1}{|S|}\delta$ , let $\epsilon_{1}=e^{\epsilon_{0}}-1$ and let $\epsilon^{\prime}=\epsilon_{0}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}+\tau\epsilon_{0}\epsilon_{1}$ . Condition on the event that throughout the $\tau$ rounds, $\epsilon_{0}$ -DP holds for all of the mechanisms $\mathcal{M}_{i}^{1}$ . This event fails with probability $\tau\delta_{0}$ and will be accounted for in the final DP bound. Define the set of events on which $\epsilon^{\prime}$ -DP does not holds as

\displaystyle\mathcal{B}=\{\mathcal{E}\colon\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{E}\right]>e^{\epsilon^{\prime}}\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S^{\prime})\in\mathcal{E}\right]\}.

The goal is to show that $\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{B}\right]\leq\delta^{\prime}$ . Let $\mathbf{W}=\mathbf{M}_{\tau}(\mathrm{w}_{1},S)=[\mathrm{W}_{1},\ldots,\mathrm{W}_{\tau}]$ , where $\mathrm{W}_{i}$ is the random variable taking values $\mathrm{w}_{i}$ , and let $\mathbf{W}^{\prime}=\mathbf{M}_{\tau}(\mathrm{w}_{1},S^{\prime})$ . Further let $\mathrm{w}$ be a fixed realization of $\mathbf{W}$ , i.e., $\mathbf{w}=[\mathrm{w}_{1},\ldots,\mathrm{w}_{\tau}]$ . We have

	$\displaystyle\operatorname{log}\left(\frac{\mathbb{P}\left[\mathbf{W}=\mathbf{w}\right]}{\mathbb{P}\left[\mathbf{W}^{\prime}=\mathbf{w}\right]}\right)$	$\displaystyle=\sum_{i=1}^{\tau}\operatorname{log}\left(\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}\|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}\|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\right)$
		$\displaystyle+\operatorname{log}\left(\frac{\mathbb{P}\left[Y_{1}=y_{1}\right]}{\mathbb{P}\left[Y_{1}=y_{1}\right]}\right).$

Consider the conditional probability $\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|Y_{i}=y_{i},\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2}\right]$ . After conditioning on all the randomness so far in the algorithm the event $\{\mathrm{W}_{i}=\mathrm{w}_{i}\}$ is just the event that $\{\mathcal{M}^{1}_{i-1}(\mathrm{w}_{i-1},F_{i-1},S)=\mathrm{w}_{i}\}$ , where $\mathrm{w}_{i-1}$ and $F_{i-1}$ fixed. Lemma 2 now implies that

\displaystyle\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\leq e^{\epsilon_{0}},

recall we conditioned on the event that each of the mechanisms $\mathcal{M}_{i}^{1}$ are $\epsilon_{0}$ -DP. Define the random variable $c_{i-1}(\mathrm{W}_{2},\ldots,\mathrm{W}_{i},\mathrm{W}_{2}^{\prime},\ldots,\mathrm{W}_{i}^{\prime},Y_{1},\ldots,Y_{i-1})$ which takes values

\displaystyle c_{i-1}(\mathrm{w}_{2},\ldots,\mathrm{w}_{i},y_{1},\ldots,y_{i-1})=\operatorname{log}\left(\frac{\mathbb{P}\left[\mathrm{W}_{i}=\mathrm{w}_{i}|\mathrm{W}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}{\mathbb{P}\left[\mathrm{W}^{\prime}_{i}=\mathrm{w}_{i}|\mathrm{W}^{\prime}_{i-1}=\mathrm{w}_{i-1},Y_{i-1}=y_{i-1},\ldots,\mathrm{W}^{\prime}_{2}=\mathrm{w}_{2},Y_{2}=y_{2},Y_{1}=y_{1}\right]}\right).

We have just shown that $c_{i-1}(\mathrm{w}_{2},\ldots,\mathrm{w}_{i},\mathrm{w}_{2},\ldots,\mathrm{w}_{i},y_{1},\ldots,y_{i-1})\leq\epsilon_{0}$ for any $\mathrm{w}_{2:i},y_{1:i-1}$ . By switching $\mathrm{W}_{i}$ with $\mathrm{W}_{i}^{\prime}$ we can show in the exact same way that $-c_{i-1}(\mathrm{w}_{2:i},\mathrm{w}_{2:i},y_{1:i-1})\leq\epsilon_{0}$ which implies that the random variable $c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1})$ is a.s. bounded by $\epsilon_{0}$ . Further using the fact that $\epsilon_{0}$ -DP implies boundedness in the $\infty$ -divergence we also have

\displaystyle\max\big{\{}D_{\infty}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})),D_{\infty}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S))\big{\}}\leq\epsilon_{0}.

Lemma 3.2 in (Dwork et al., 2010) now implies that the KL-divergence between $\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)$ and $\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime})$ is bounded as $D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}))\leq\epsilon_{1}\epsilon_{0}$ . Together with the definition of $c_{i-1}$ this implies

\displaystyle\mathbb{E}_{\mathbf{W}_{i},\mathbf{W}_{i}^{\prime}}[c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1})|\mathrm{w}_{2:i-1},y_{1:i-1}]=D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}))\leq\epsilon_{0}\epsilon_{1}.

Because the filtration generated by $\mathrm{W}_{2:i-1},Y_{1:i-1}$ includes the one generated by $c_{1:i-1}$ we can now apply Azuma’s inequality on the martingale difference sequence $c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{1:i-1})-D_{KL}(\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S)||\mathcal{M}_{i}^{1}(\mathrm{w}_{i},F_{i},S^{\prime}))$ , which we have just shown it is bounded as the KL-term is bounded by $\epsilon_{0}\epsilon_{1}$ and $c_{i}$ ’s are bounded by $\epsilon_{0}$ , and get

	$\displaystyle\mathbb{P}\left[\mathbf{M}_{\tau}(\mathrm{w}_{1},S)\in\mathcal{B}\right]$	$\displaystyle=\mathbb{P}\left[\sum_{i=1}^{\tau}c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{i-1})>\epsilon^{\prime}\right]$
		$\displaystyle=\mathbb{P}\left[\sum_{i=1}^{\tau}c_{i-1}(\mathrm{W}_{2:i},\mathrm{W}_{2:i}^{\prime},Y_{i-1})-\tau\epsilon_{0}\epsilon_{1}>\epsilon_{0}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}\right]$
		$\displaystyle\leq\operatorname{exp}\left((-2\epsilon_{0}^{2}\tau\operatorname{log}\left(1/\delta^{\prime}\right))/(2\tau\epsilon_{0}^{2}\epsilon_{1}^{2})\right)=\delta^{\prime}.$

Because for $\tilde{\epsilon}\leq 1.256$ it holds that $e^{\tilde{\epsilon}}-1\leq 2\tilde{\epsilon}$ we have $\epsilon_{0}\leq\frac{2\tilde{\epsilon}}{|S|}$ . This in hand implies $\epsilon_{1}\leq 2\epsilon_{0}$ which shows that for a fixed $\tau$ Algorithm 1 is $(\frac{2\tilde{\epsilon}\sqrt{2\tau\operatorname{log}\left(1/\delta^{\prime}\right)}}{|S|}+\frac{4\tau}{|S|^{2}}\epsilon^{2},\frac{\tau}{|S|}\delta+\delta^{\prime})$ -DP. ∎

Combining the utility and privacy guarantees yields the main result, stated below.

Theorem 6.

Let $f(\mathrm{w},\mathrm{z})$ be a convex $L$ -Lipschitz function in its first argument, $\mathrm{w}\in{\mathcal{W}}$ , for all $\mathrm{z}\in{\mathcal{Z}}$ . Furthermore, assume that the diameter of ${\mathcal{W}}$ is bounded by $D$ . For any $n\geq 16$ , given $0<\epsilon\leq\frac{1}{2\sqrt{n}}$ , $\delta>0$ , Algorithm 1 run with step size $\eta=\frac{D}{\sqrt{n}(L+\sigma\sqrt{d})}$ and $\sigma=\frac{8L\sqrt{\operatorname{log}\left(1/\delta\right)}}{\sqrt{n}\epsilon}$ outputs $\widehat{\mathrm{w}}_{\tau}$ which satisfies $(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime}+2\operatorname{exp}\left(-n/16\right))$ – differential privacy, for any $\delta^{\prime}>0$ , and its accuracy is bounded as,

\displaystyle\mathbb{E}{F(\widehat{\mathrm{w}}_{\tau})-F(\mathrm{w}^{*})}\leq\frac{5LD}{\sqrt{n}}+\frac{20LD\sqrt{d\operatorname{log}\left(1/\delta\right)}}{\epsilon n}.

Proof.

Condition on the event that $\tau\leq 2n$ . Using Theorem 5, with $\tilde{\epsilon}=\sqrt{n}\epsilon$ implies that Algorithm 1 is $(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime})$ -DP under the condition that $\epsilon\leq\frac{1.256}{\sqrt{n}}$ . The event $\tau>2n$ happens with probability $-2\operatorname{exp}\left(-n/16\right)$ according to Theorem 2, which implies that Algorithm 1 is $(4\epsilon(\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}+2),\delta+\delta^{\prime}+2\operatorname{exp}\left(-n/16\right))$ -DP. Finally, the convergence guarantee follows from Theorem 4. ∎

Remark 1.

To get $(\bar{\epsilon},\bar{\delta})$ -DP, for any $6\operatorname{exp}\left(-n/16\right)\leq\bar{\delta}\leq 3e^{-4}$ , we can set $\delta^{\prime}=\delta=\frac{\bar{\delta}}{3}$ , and $\epsilon=\frac{\bar{\epsilon}}{8\sqrt{\operatorname{log}\left(1/\delta^{\prime}\right)}}$ . Using the condition $\epsilon\leq\frac{1}{\sqrt{n}}$ , we get that $\frac{\bar{\epsilon}}{\sqrt{\operatorname{log}\left(3/\bar{\delta}\right)}}\leq\frac{8}{\sqrt{n}}$ The utility guarantee becomes $\mathbb{E}{F(\widehat{\mathrm{w}}_{\tau})-F(\mathrm{w}^{*})}\leq\frac{5LD}{\sqrt{n}}+\frac{160LD\sqrt{d}\operatorname{log}\left(3/\bar{\delta}\right)}{\bar{\epsilon}n}$ .

6 Other Approaches

We now briefly discuss a few other approaches which would also recover a convergence rate of $\tilde{O}(1/\sqrt{n}+\sqrt{d}/(\epsilon n))$ for $\epsilon\leq O(\sqrt{1/n})$ . The first approach uses the standard decomposition of excess population risk to generalization gap $+$ excess empirical risk. As pointed out in Bassily et al. (2019); Feldman et al. (2020), it is possible to use the optimal rates for ERM from Bassily et al. (2014) together with generalization propties of DP from Dwork et al. (2015) to bound the generalization gap, to guarantee a rate of $O\left(\max\left(d^{1/4}/\sqrt{n},\sqrt{d}/(n\epsilon)\right)\right)$ , which evaluates to $\sqrt{d/n}$ in the regime $\epsilon\leq O(1/{\sqrt{n}})$ . Even though the runtime of noisy-SGD for ERM, stated in Bassily et al. (2014) is $O(n^{2})$ , it can be shown that their result holds with only $O\left({\frac{(n\epsilon)^{2}}{\sqrt{d}}}\right)$ stochastic gradient computations – therefore, in the regime $\epsilon\lesssim\frac{1}{\sqrt{n}}$ , it is a linear time algorithm.

Second, it may be possible to use amplification by subsampling in Algorithm 1 of Feldman et al. (2020) instead of amplification by iteration. This would discard the smoothness requirement for $f(\cdot,\mathrm{z})$ and perhaps yield the same result in Theorem 6. We note that Algorithm 1 is much simpler and does not require the complicated mini-batch schedule from Algorithm 1 of Feldman et al. (2020).

7 Conclusion

In this paper, we studied the problem of private stochastic convex optimization for non-smooth objectives. We proposed a noisy version of the OSMD algorithm and presented convergence and privacy guarantees for the $\ell_{2}$ geometry. Algorithm 1 achieves statistically optimal rates, while running in linear time, and is differentially private as long as the DP parameter is sufficiently small. Algorithm 1 can easily be extended to geometries induced by arbitrary strictly convex potentials $\Phi$ . We leave it as future work to explore what kind of privacy guarantees can be retained if the privacy mechanism is tailored towards the geometry induced by $\Phi$ . Finally, we think it should be possible to extend our analysis to the case when $f(\cdot,\mathrm{z}_{i})$ is strongly convex for all $i\in[n]$ and achieve optimal statistical rates in linear time.

References

Agarwal et al. (2009) Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
Bassily et al. (2019) Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288, 2019.
Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011.
Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
Dwork et al. (2010) Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and Differential Privacy. In FOCS, pages 51–60, 2010.
Dwork et al. (2015) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.
Feldman et al. (2018) Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521–532, 2018.
Feldman et al. (2020) Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
Kakade et al. (2009) Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.
Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
Narayanan and Shmatikov (2008) Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.
Sweeney (1997) Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.
Vapnik (2013) Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
Wang et al. (2019) Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.

Appendix A Proof of Theorem 2

Proof.

We begin by fixing a time horizon $T$ and showing that with high probability $\tau\leq T$ for appropriately chosen $T$ . Let $q:[n]^{T}\rightarrow\mathbb{N}$ be the function which counts the unique draws among $(Y_{1},\ldots,Y_{T})$ . Further, let $Z_{i}$ be the indicator random variable of the event that the $i$ -th datapoint is drawn at least ones in the $T$ rounds, i.e., $\exists j\in[T]:Y_{j}=i$ . We can write the expectation of $q$ as

	$\displaystyle\mathbb{E}[q(Y_{1},\ldots,Y_{T})]$	$\displaystyle=\mathbb{E}\sum_{i=1}^{n}Z_{i}=\sum_{i=1}^{n}\mathbb{E}[Z_{j}]=\sum_{i=1}^{n}\mathbb{P}\left[\exists j\in[T]:Y_{j}=i\right]$
		$\displaystyle=\sum_{i=1}^{n}1-\mathbb{P}\left[\forall j\in[T],Y_{j}\neq i\right]=\sum_{i=1}^{n}1-\left(1-1/n\right)^{T}\geq n-n\operatorname{exp}\left(-T/n\right).$

Setting $T\geq n$ already shows that the number of unique elements after $T\geq n$ is at least $n/2$ and thus the stopping time condition is reached. Next, we show concentration around the expectation of $q$ . Note that if a single element $y_{i}$ is changed by $y_{i}^{\prime}$ the value of $q$ will not change by more than $1$ i.e.,

\displaystyle|q(y_{1},\ldots,y_{T})-q(y_{1},\ldots,y_{i-1},y_{i}^{\prime},y_{i+1},\ldots y_{T})|\leq 1.

This allows us to use McDiarmid’s inequality to claim that

\displaystyle\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})-\mathbb{E}[q(Y_{1},\ldots,Y_{T})]<-\sqrt{\operatorname{log}\left(2/\delta\right)T/2}\right]\leq\delta.

This implies that with probability at least $1-\delta$ we have

\displaystyle q(Y_{1},\ldots,Y_{T})\geq n-n\operatorname{exp}\left(-T/n\right)-\sqrt{\operatorname{log}\left(2/\delta\right)T/2}.

Setting $T=2n$ and $\delta=2\operatorname{exp}\left(-n/16\right)$ we have that for any $n\geq 16$ it holds that with probability $1-2\operatorname{exp}\left(-n/16\right)$

\displaystyle q(Y_{1},\ldots,Y_{T})\geq n-ne^{-2}-\sqrt{n\times n/16}\geq n/2.

To get the bound in expectation we note that $\mathbb{P}\left[\tau\geq T\right]=\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})\leq n/2\right]$ . On the other hand we know from the above derivation, that for $T\geq 2n$ it holds that $\mathbb{P}\left[q(Y_{1},\ldots,Y_{T})\leq n/2\right]\leq 2e^{-T/16}$ . This implies

\displaystyle\mathbb{E}[\tau]\leq 2n\mathbb{P}\left[\tau\leq 2n\right]+\int_{2n}^{\infty}\mathbb{P}\left[\tau\geq t\right]dt\leq 2n+\int_{2n}^{\infty}2\operatorname{exp}\left(-t/16\right)dt\leq O(n+ne^{-n/16}).

∎

Private Stochastic Convex Optimization: Efficient Algorithms for Non-smooth Objectives

Abstract

1 Introduction

2 Notation and Preliminaries

2.1 Mirror descent and potential functions

Theorem 1 (Theorem 6 from Kakade et al. (2009)).

2.2 Differential privacy

Definition 1 ((ϵ,δ)(\epsilon,\delta)-differential privacy (Dwork et al., 2006)).

3 Related Work

Non-smooth Lipschitz Convex.

Smooth Lipschitz Convex.

(Smooth and Non-smooth) Lipschitz Strongly Convex.

4 Algorithm and Utility Analysis

Theorem 2.

Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012)).

Corollary 1.

Proof.

Lemma 1.

Proof.

Theorem 4.

5 Privacy Proof

Theorem 5.

Lemma 2.

Proof.

Proof of Theorem 5.

Theorem 6.

Proof.

Remark 1.

6 Other Approaches

7 Conclusion

References

Appendix A Proof of Theorem 2

Proof.

Private Stochastic Convex Optimization:
Efficient Algorithms for Non-smooth Objectives

Definition 1 ( $(\epsilon,\delta)$ -differential privacy (Dwork et al., 2006)).