Private Stochastic Convex Optimization:
Efficient Algorithms for Non-smooth Objectives
Abstract
In this paper, we revisit the problem of private stochastic convex optimization. We propose an algorithm based on noisy mirror descent, which achieves optimal rates both in terms of statistical complexity and number of queries to a first-order stochastic oracle in the regime when the privacy parameter is inversely proportional to the number of samples.
1 Introduction
Modern machine learning systems often leverage data that are generated ubiquitously and seamlessly through devices such as smartphones, cameras, microphones, or user’s weblogs, transaction logs, social media, etc. Much of this data is private, and releasing models trained on such data without serious privacy considerations can reveal sensitive information (Narayanan and Shmatikov, 2008; Sweeney, 1997). Consequently, much emphasis has been placed in recent years on machine learning under the constraints of a robust privacy guarantee. One such notion that has emerged as a de facto standard is that of differential privacy.
Informally, differential privacy provides a quantitative assessment of how different are the outputs of a randomized algorithm when fed two very similar inputs. If small changes in the input do not manifest as drastically different outputs, then it is hard to discern much information about the inputs solely based on the outputs of the algorithm. In the context of machine learning, this implies that if the learning algorithm is not overly sensitive to any single datum in the training set, then releasing the trained model should preserve the privacy of the training data. This requirement, apriori, seems compatible with the goal of learning, which is to find a model that generalizes well on the population and does not overfit to the given training sample. It seems reasonable then to argue that privacy is not necessarily at odds with generalization, especially when large training sets are available.
We take the following stochastic optimization view of machine learning, where the goal is to find a predictor that minimizes the expected loss (aka risk)
(1) |
based on i.i.d. samples from the source distribution , and full knowledge of the instantaneous objective function and the hypothesis class . We are particularly interested in convex learning problems where the hypothesis class is a convex set and the loss function is a convex function in the first argument for all . We seek a learning algorithm that uses the smallest possible number of samples and the least runtime and returns such that , for a user specified , while guaranteeing differential privacy (see Section 2.2 for a formal definition).
A natural approach to solving Problem 1 is sample average approximation (SAA), or empirical risk minimization (ERM), where we instead minimize an empirical approximation of the objective based on the i.i.d. sample. Empirical risk minimization for convex learning problems has been studied in the context of differential privacy by several researchers including Bassily et al. (2019) who give statistically efficient algorithms with oracle complexity matching that of optimal non-private ERM.
An alternative approach to solving Problem 1 is stochastic approximation (SA), wherein rather than form an approximation of the objective, the goal is to directly minimize the true risk. The learning algorithm is an iterative algorithm that processes a single sample from the population in each iteration to perform an update. Stochastic gradient descent (SGD), for instance, is a classic SA algorithm. Recent work of Feldman et al. (2020) gives optimal rates for convex learning problems (Problem 1) using stochastic approximation for smooth loss functions; however, they leave open the question of optimal rates for non-smooth convex learning problems which include a large class of learning problems, including, for example, support vector machines. In this work, we focus on non-smooth convex learning problems and propose a simple algorithm which achieves nearly optimal rates in the special case where the privacy guarantee scales inversely with the number of observed samples. There are alternative approaches111We became aware of it in a private communication with Raef Basilly.; for further discussion we refer the reader to Section 6.
2 Notation and Preliminaries
We consider the general learning setup of Vapnik (2013). Let be the sample space and let be an unknown distribution over . We are given a sample drawn identically and independently (i.i.d) from – the samples may correspond to (feature, label) tuples as in supervised learning, or to features in unsupervised learning. We assume that loss is a convex function in its first argument and the hypothesis set is a convex set. Given n samples, the goal is to minimize the population risk (Problem 1).
We assume that the hypothesis class is a convex, closed and bounded set in equipped with norm , such that for all . We recall that the dual space of is the set of all linear functionals over it; the dual norm, denoted by , is defined as , where is a element of the dual space . In general we will use to denote the norm when there is no risk of confusion.
We do not assume that is necessarily differentiable and will denote an arbitrary sub-gradient as . We assume that is -lipschitz with respect to the dual norm , i.e., for all . For convex , this implies that sub-gradients with respect to , are bounded as . A popular instance of the above is the setup, which considers norm in primal space and the corresponding norm in the dual space such that . A function is -strongly smooth (or just -smooth) if . For convex functions, this is equivalent to the condition . A function is -strongly convex if . Finally, we use to suppress poly-logarithmic factors in the complexity.
2.1 Mirror descent and potential functions
We recall some basics of convex duality, which will help us set up the mirror descent algorithm and analysis. For a convex function , we define its conjugate as . Moreover, denotes Bregman divergence induced by , defined as
For a convex set , we denote by the indicator function of the set , , and otherwise. The following result from Kakade et al. (2009) establishes a natural relation between strong convexity and strong smoothness of a potential function and its conjugate, respectively.
Theorem 1 (Theorem 6 from Kakade et al. (2009)).
Assume is a closed convex function. Then is -strongly convex with respect to a norm iff is -strongly smooth with respect to the dual norm .
2.2 Differential privacy
We seek to design algorithms for solving the stochastic convex optimization problem (Problem 1) with the additional constraint that the algorithm’s output is differentially private.
Definition 1 (-differential privacy (Dwork et al., 2006)).
An algorithm satisfies -differential privacy if given two datasets and , differing in only one data point, it satisfies that for any measurable event
3 Related Work
In convex learning and optimization, the following four classes of functions are widely studied: (a) -lispchitz convex functions, (b) -smooth and convex functions, (c) -strongly convex functions, and (d) -smooth and -strongly convex functions. In the computational framework of first order stochastic oracle, algorithms with optimal oracle complexity for all these classes of functions have long been known (Agarwal et al., 2009). However, the landscape of known results changes with the additional constraint of privacy. We can trace two approaches to solving the private version of Problem 1. The first is private ERM (Chaudhuri et al., 2011; Bassily et al., 2014; Feldman et al., 2018; Bassily et al., 2019) and the second is private stochastic approximation (Feldman et al., 2020). Chaudhuri et al. (2011) begin the study of private ERM by constructing algorithms based on output perturbation and objective perturbation. Under the assumption that the stochastic gradients are -Lipschitz continuous, the output perturbation bounds achieve excess population risk of , where is the Lipschitz constant of the loss function and measures the diameter of the hypothesis class in the respective norm. The objective perturbation bounds have a similar form. Bassily et al. (2014) showed tight upper and lower bounds for the excess empirical risk. They also showed bounds for the excess population risk when the loss function does not have Lipschitz continuous gradients, achieving a rate of . Their techniques appeal to uniform convergence i.e. bounding , and convert the guarantees on excess empirical risk to get a bound on the excess population risk. These guarantees, however, were sub-optimal. Bassily et al. (2019) improved these to get optimal bounds on excess population risk, leveraging connections between algorithmic stability and generalization. The algorithms given by Bassily et al. (2019) are sample efficient but their runtimes are superlinear (in the number of samples), whereas the non-private counterparts run in linear time. In a follow-up work, Feldman et al. (2020) improved the runtime of some of these algorithms without sacrificing statistical efficiency; however, the authors require that the stochastic gradients are Lipschitz continuous. Essentially, the statistical complexity of private stochastic convex optimization has been resolved, but some questions about computational efficiency still remain open. We begin with a discussion of different settings for the population loss in subsequent paragraphs, describe what is already known, and what are the avenues for improvement.
Non-smooth Lipschitz Convex.
For the class of -Lipschitz convex functions, Bassily et al. (2014) improved upon Chaudhuri et al. (2011) and gave optimal bounds on excess empirical risk of . They then appeal to uniform convergence to convert the guarantees on excess empirical risk to get an excess population risk of . This is sub-optimal and was very recently improved by Bassily et al. (2019) using connections between algorithmic stability and generalization to get . This is optimal since is the optimal excess risk without privacy constraints, and is the optimal excess risk when the data distribution is the empirical distribution. This resolves the statistical complexity of private convex learning and shows that in various regimes of interest, e.g., when and , the constraint of privacy has no effect on utility. However, the algorithm proposed by Bassily et al. (2019) is based on Moreau smoothing and proximal methods, and requires stochastic gradient computations. This rate vastly exceeds the gradient computations needed for non-private stochastic convex optimization which are of the order . The computational complexity was improved by Feldman et al. (2020) to by using a regularized ERM algorithm that runs in phases and after each phase, noise is added to the solution (output perturbation) and used as regularization for the next phase.
Smooth Lipschitz Convex.
For -smooth convex -Lispschitz functions, Bassily et al. (2019) give an algorithm with optimal bounds on excess risk of with queries to the stochastic gradient oracle. This, again, was improved in a later work of Feldman et al. (2020) to stochastic gradient queries. Note that even for non-private stochastic optimization stochastic gradient computations are necessary, so Feldman et al. (2020) achieve optimal statistical and oracle complexity for the smooth Lipschitz convex functions. The algorithm they present is an instance of a single-pass noisy SGD, and the guarantees hold for the last iterate.
(Smooth and Non-smooth) Lipschitz Strongly Convex.
For -Lispchitz -strongly convex functions, Bassily et al. (2014) gave an algorithm with optimal excess empirical risk which is in . However, as in the non-strongly convex case, the corresponding excess population risk is , which is sub-optimal. For this case, Feldman et al. (2020) proposed an algorithm which achieves optimal rates in time for smooth functions but for non-smooth functions.
4 Algorithm and Utility Analysis
In this section, we describe the proposed algorithm and provide convergence guarantees.
Recall that we are given a set of samples drawn i.i.d. from . We consider an iterative algorithm, wherein, at time , we sample index uniformly at random from the set of indices, . Note that is a random variable; we denote the realization of as . Through the run of the algorithm, we maintain a set, , of indices observed until time , i.e.,
If , i.e., is a fresh sample that has not been seen and processed prior to the iteration, then we proceed with a projected gradient descent step using the noisy gradient . If , then the algorithm simply perturbs the current iterate, , and projects on to the set . We remark that the noise-only step is important for the privacy analysis, as it allows for privacy amplification by sub-sampling.
The algorithm terminates when half of the points in the training set have been processed at least once, i.e., the size of is greater than or equal to . We denote this stopping time by . We refer the reader to Algorithm 1 for the pseudo-code.
Next, we establish the utility guarantee for Algorithm 1. We first show that is finite almost surely and bounded by with probability . The proof follows a standard bins-and-balls argument.
Theorem 2.
For any with probability it holds that , which implies .
We proceed with bounding the regret of Algorithm 1. Given a sequence of convex functions , where , we bound the regret
where the expectation is with respect to any randomness in the algorithm and randomness of ’s. We assume full access to the gradient of and that the diameter of is , i.e., .
Our setup fits the popular framework of Online Stochastic Mirror Descent (OSMD) algorithm, wherein, given a strictly convex potential , the updates are given as
(2) | ||||
We utilize the following result.
Theorem 3 (Theorem 5.5 (Bubeck and Cesa-Bianchi, 2012)).
Let be a sequence of functions from to . Suppose is an unbiased estimator of , i.e., . If one initializes OSMD 2 with , then the expected pseudo-regret is bounded as
For any fixed , we can view Algorithm 1 as an instance of OSMD, with , and defined as
and . Theorem 1 implies that for any , where is the strong-convexity parameter of the potential (in this case ) – using this result with Theorem 3, and taking expectation with respect to the randomness in and , we have that for any
(3) | ||||
Since , we have the following corollary.
Corollary 1.
Suppose the diameter of is bounded by and that is an -Lipschitz function for all . Running Algorithm 1 for iterations with noise sequence and step size guarantees that
Proof.
The result now follows using the Corollary above and the following lemma.
Lemma 1.
For the iterates of Algorithm 1 it holds that , where is the output of the algorithm.
Proof.
Let be the random variable measuring the size of . It then holds that is a stopping time for the process , i.e., . Further, denote by the vector of indices consisting of for some . Let us focus on the term . Let be the measure capturing all the randomness after iterations except for the randomness with respect to the data distribution . Furthermore, let denotes the indicator of the input event. We now show that , which essentially follows using the tower rule of expectation. We have the following
where in the second equality, we condition on , and write the conditional expectation as total expectation as: . In the fourth equality, we use the fact that the iterates are fixed, and take the expectation of with respect to the data generating distribution , yielding . The rest of the steps collapses the conditional expectation back to a total expectation, and finally the last inequality holds using convexity. Using Corollary 1 we get
where in the second inequality we have used Wald’s lemma to guarantee and .
∎
Theorem 4.
Suppose the elements in are bounded in norm by and that is an -Lipschitz function for all . Running Algorithm 1 for iterations with noise sequence and step size guarantees that
5 Privacy Proof
In this section, we establish the differential privacy of Algorithm 1. The privacy proof essentially follows the analysis of noisy-SGD in Bassily et al. (2014), but stated in full detail, for completeness and to provide a self-contained presentation. The basic idea is as follows. We view Algorithm 1 as a variant of noisy stochastic gradient descent on the ERM problem over samples . Indeed, but for the step where we update the iterate only with noise and do not compute a gradient, the proposed algorithm would be exactly equivalent to noisy SGD. Prior work shows that using noisy SGD to solve the ERM problem enjoys nicer differential privacy due to privacy amplification via subsampling. Intuitively, Algorithm 1 should also benefit from privacy amplification, and the algorithm should not suffer from privacy loss in steps where we use the zero gradient. We formalize this intuition using the standard analysis for Rènyi differential privacy (Wang et al., 2019) and properties of Rènyi divergence (Mironov, 2017).
We first introduce additional notation. Let be a function which describes the iteration of the algorithm – it takes as input, an iterate , a set, , of indices of data points, and a subset ; it outputs an iterate and set of indices . We assume that is totally ordered according to the indices of the datapoints. Further let denote the set of indices of datapoints in . Let and denote the first and second outputs of , i.e., and . Note that in the algorithm, the ’s are composed – we initialize and is fixed, therefore, . acts on the output of as . In general, we have that
Formally, given , and , we define random variable as
where is the following gradient operator:
In Algorithm 1, , and in general is just the size of the sub-sampled set from , which we use to construct a mini-batched gradient.
The input space of , which is has measure , where is the sub-sampling measure, which sub-samples out of data points uniformly randomly. We now construct a vector concatenating the first outputs of all these ’s. Let
We claim that is differentially private.
Theorem 5.
Suppose we run Algorithm 1 with noise sampled from , where . For a fixed and , is -DP.
To prove the above theorem we are first going to show that the first coordinate of each is sufficiently differentially private.
Lemma 2.
Let . For any and any the mechanism is -DP.
Proof.
Consider the two differing datasets and and let the index at which they differ be . Let . Denote the random subsample from and as and respectively Under the uniform random sampling we have . This holds because the sampling of indices only depends on the size of and . The proof follows the standard privacy amplification by sub-sampling argument 222See https://www.ccs.neu.edu/home/jullman/cs7880s17/HW1sol.pdf. For fixed and , and any measurable (with respect to the Lebesgue measure on the space ) define the measures
First we note that because the gradient descent step is restricted only to points indexed by and the differing point does not belong to that set. We also have because in this case is not part of the subsampled points. We consider two cases: and . In the first case , and we have perfect -DP. In the other case, we have
where in the first inequality we have used the following DP-guarantee of . For any subsampled sets and it holds that the mechanism is -DP as it is a step of noisy projected gradient descent with gradients bounded in norm by and Gaussian noise with sufficient variance. We use the DP guarantee twice – once on the neighboring datasets and to get the inequality and once on the neighboring dataset to get the inequality . In the second application, note that contains events when a previously seen point is sampled; however the noise-only-step ensures that it also is a Gaussian with the same variance. In the second inequality we use the fact that a convex combination of two numbers is greater than their minimum and in the last inequality we use the standard relation . Combining the two cases finishes the proof. ∎
We can now prove Theorem 5.
Proof of Theorem 5.
The proof essentially follows the proof of Theorem 3.3 (Dwork et al., 2010). Let , , let and let . Condition on the event that throughout the rounds, -DP holds for all of the mechanisms . This event fails with probability and will be accounted for in the final DP bound. Define the set of events on which -DP does not holds as
The goal is to show that . Let , where is the random variable taking values , and let . Further let be a fixed realization of , i.e., . We have
Consider the conditional probability . After conditioning on all the randomness so far in the algorithm the event is just the event that , where and fixed. Lemma 2 now implies that
recall we conditioned on the event that each of the mechanisms are -DP. Define the random variable which takes values
We have just shown that for any . By switching with we can show in the exact same way that which implies that the random variable is a.s. bounded by . Further using the fact that -DP implies boundedness in the -divergence we also have
Lemma 3.2 in (Dwork et al., 2010) now implies that the KL-divergence between and is bounded as . Together with the definition of this implies
Because the filtration generated by includes the one generated by we can now apply Azuma’s inequality on the martingale difference sequence , which we have just shown it is bounded as the KL-term is bounded by and ’s are bounded by , and get
Because for it holds that we have . This in hand implies which shows that for a fixed Algorithm 1 is -DP. ∎
Combining the utility and privacy guarantees yields the main result, stated below.
Theorem 6.
Let be a convex -Lipschitz function in its first argument, , for all . Furthermore, assume that the diameter of is bounded by . For any , given , , Algorithm 1 run with step size and outputs which satisfies – differential privacy, for any , and its accuracy is bounded as,
Proof.
Remark 1.
To get -DP, for any , we can set , and . Using the condition , we get that The utility guarantee becomes .
6 Other Approaches
We now briefly discuss a few other approaches which would also recover a convergence rate of for . The first approach uses the standard decomposition of excess population risk to generalization gap excess empirical risk. As pointed out in Bassily et al. (2019); Feldman et al. (2020), it is possible to use the optimal rates for ERM from Bassily et al. (2014) together with generalization propties of DP from Dwork et al. (2015) to bound the generalization gap, to guarantee a rate of , which evaluates to in the regime . Even though the runtime of noisy-SGD for ERM, stated in Bassily et al. (2014) is , it can be shown that their result holds with only stochastic gradient computations – therefore, in the regime , it is a linear time algorithm.
Second, it may be possible to use amplification by subsampling in Algorithm 1 of Feldman et al. (2020) instead of amplification by iteration. This would discard the smoothness requirement for and perhaps yield the same result in Theorem 6. We note that Algorithm 1 is much simpler and does not require the complicated mini-batch schedule from Algorithm 1 of Feldman et al. (2020).
7 Conclusion
In this paper, we studied the problem of private stochastic convex optimization for non-smooth objectives. We proposed a noisy version of the OSMD algorithm and presented convergence and privacy guarantees for the geometry. Algorithm 1 achieves statistically optimal rates, while running in linear time, and is differentially private as long as the DP parameter is sufficiently small. Algorithm 1 can easily be extended to geometries induced by arbitrary strictly convex potentials . We leave it as future work to explore what kind of privacy guarantees can be retained if the privacy mechanism is tailored towards the geometry induced by . Finally, we think it should be possible to extend our analysis to the case when is strongly convex for all and achieve optimal statistical rates in linear time.
References
- Agarwal et al. (2009) Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems, pages 1–9, 2009.
- Bassily et al. (2014) Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
- Bassily et al. (2019) Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288, 2019.
- Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint arXiv:1204.5721, 2012.
- Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. JMLR, 2011.
- Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- Dwork et al. (2010) Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and Differential Privacy. In FOCS, pages 51–60, 2010.
- Dwork et al. (2015) Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015.
- Feldman et al. (2018) Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. Privacy amplification by iteration. 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 521–532, 2018.
- Feldman et al. (2020) Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
- Kakade et al. (2009) Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization. Unpublished Manuscript, http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf, 2(1), 2009.
- Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
- Narayanan and Shmatikov (2008) Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE, 2008.
- Sweeney (1997) Latanya Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.
- Vapnik (2013) Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
- Wang et al. (2019) Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled rényi differential privacy and analytical moments accountant. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1226–1235. PMLR, 2019.
Appendix A Proof of Theorem 2
Proof.
We begin by fixing a time horizon and showing that with high probability for appropriately chosen . Let be the function which counts the unique draws among . Further, let be the indicator random variable of the event that the -th datapoint is drawn at least ones in the rounds, i.e., . We can write the expectation of as
Setting already shows that the number of unique elements after is at least and thus the stopping time condition is reached. Next, we show concentration around the expectation of . Note that if a single element is changed by the value of will not change by more than i.e.,
This allows us to use McDiarmid’s inequality to claim that
This implies that with probability at least we have
Setting and we have that for any it holds that with probability
To get the bound in expectation we note that . On the other hand we know from the above derivation, that for it holds that . This implies
∎