Convergence of PrivateGD in Strongly Convex Problems
Abstract
This note describes the convergence of noisy gradient descent in strongly convex problems (with references to machine learning / optimization literature). Specific considerations we focus on this note are what these existing results imply in the context of differentially private learning and how do they help us in obtaining good hyperparameter choices.
1 Notations and Problem setup
General convex ERM: We consider the problem of
where is assumed to be convex. Note that the objective function is the sum, rather than the average of the loss functions of the dataset.
Sometimes we also consider regularized objective function to make it strongly convex. The subscript and reference is often abbreviated to avoid clutter.
Besides convexity, we also make the following assumptions: for all , is convex, -Lipschitz and -smooth in its first argument. is -smooth and -strongly convex. Note that and .
Specifically for the interest of defining differential privacy, belongs to a universe of the loss functions induced by the universe of data points. The assumptions stated above for are required for all in this universe, thus by adding or removing individual data points, the data point satisfy the same assumptions.
General linear models (GLM): In some cases, we consider the restriction to a family of generalized linear models when the loss function is for some convex link function . In this cases, .
Algorithm: The algorithm we consider is the Noisy Gradient Descent (NoisyGD), which updates the parameter vector iteratively using
where is a convex feasible region and is the Euclidean projection to this set. When we consider unconstrained problems, then .
2 Convergence guarantees
2.1 General convex case
Theorem 1 (Convex, smooth and Lipschitz).
Let the learning rate , assume
(1) |
In particular, if and that is chosen such that the algorithm satisfies -zCDP, i.e., , then
(2) | |||
(3) |
Theorem 2 (Convex and Lipschitz).
Let the learning rate , also assume , then
(4) |
In particular, if , and that is chosen such that the algorithm satisfies -zCDP, i.e., , then
(5) | ||||
(6) |
2.2 Strongly convex case
Theorem 3.
Let the learning rate be , and then
(7) | ||||
(8) |
where is the zCDP parameter of the algorithm.
Note that when , zCDP implies -DP with . This means that as goes to and stays a constant, the first term vanishes and second term matches the information-theoretic limit of differentially private learning under -DP in the convex case and in the strongly convex case respectively.
2.3 Utility in the regularized case
When additional regularization with parameter , the utility we consider should still be considered in terms of . Let be any comparator satisfying
Take expectation on both sides and substitute the bound with an initialization at
In the case when is not strongly convex, we get
when we choose we get
Again, when , the first term vanishes and the second term is the information-theoretic limit for general convex problems. when . However, if we choose according to the public data, then if the public data gives either a good initialization or a good reference point for the strongly convex regularization, then it provably improves the utility under the same privacy parameter .
2.4 Provably good initialization via a public data set
Assume the public data with samples are drawn from the same distribution of the private data. Assume that the (population risk)
is -strongly convex at for some constant . Note that this does not require strong convexity everywhere and it also does not require strong convexity to hold on the empirical risk objective function finite dataset.
Moreover, assume that the . Then running one-pass SGD which returns to be the weighted average satisfies that
This result together with the bound above, implies that
The outer expectation is taken over the distribution of the public data, which implies a learning bound of the form (by dividing both sides by and adjust for the difference between empirical risk and population risks)
where the GeneralizationGap(n) is difference between the empirical and population risk of , typically obtained via a uniform convergence argument. The generalization gap is required even for the non-private learning and is between and when we qualify for a condition for fast rate (e.g., strong convexity, low-noise, etc). Note that the above bound extra factor due to privacy is on the order of comparing to when we have private data only.
2.5 Improved dimension dependence for GLMs
For the generalized linear model case, all dimension above can be replaced with where is the data matrix and . The arguments follow from (song21evading) by leveraging that the gradients are all within the row-space of and it suffices to consider.
This observation is quite useful to us because the linearization of a deep model is often very high-dimensional and . In this case, this is saying that we can chooses the learning rate as a function of instead.
2.6 Gradient clipping under GLM
It was established in (chen2020understanding) that the gradient clipping is known to have the effect of Huberization of the loss function in the GLM cases. The convergence to the optimal solution of a modified objective function is guaranteed if we fix the clipping parameter. Our approach on the adaptive clipping has the effect of selecting the hyperparameter adaptively in an homotopy style algorithm, which shows further empirical benefits.
2.7 Related work
The theory on the convergence of SGD is well-studied under various regimes. The exposition above is based on the analysis of (ghadimi2013stochastic) (for the convex cases) and (lacoste2012simpler) (for the strongly convex cases). song21evading provided the first analysis of this algorithm in the general convex case with dependence on the , and first illustrated the interpretation of gradient clipping as a Huberization of the link function. chen2020understanding provided an alternative explanation of the gradient clipping in terms of how it effectively modify the data distribution. To the best of our knowledge, we are the first that states these results for GLM that depends on in the strongly convex case, as well as showing that the use of public data gives provably smaller sample complexity comparing to the standard private learning setting.
3 Discussion
We make a few interesting discussion here.