Convergence of PrivateGD in Strongly Convex Problems

Yu-Xiang Wang Computer Science Department, UC Santa Barbara Amazon Web Services

Abstract

This note describes the convergence of noisy gradient descent in strongly convex problems (with references to machine learning / optimization literature). Specific considerations we focus on this note are what these existing results imply in the context of differentially private learning and how do they help us in obtaining good hyperparameter choices.

1 Notations and Problem setup

General convex ERM: We consider the problem of

\min_{\theta\in\Theta}\sum_{i=1}^{n}\ell_{i}(\theta):=\mathcal{L}(\theta)

where $\ell_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is assumed to be convex. Note that the objective function is the sum, rather than the average of the loss functions of the dataset.

Sometimes we also consider regularized objective function $J_{\lambda,\theta_{\text{ref}}}(\theta):=\mathcal{L}(\theta)+\frac{\lambda}{2}\|\theta-\theta_{\text{ref}}\|^{2}$ to make it strongly convex. The subscript $\lambda$ and reference $\theta_{\text{ref}}$ is often abbreviated to avoid clutter.

Besides convexity, we also make the following assumptions: for all $i=1,...,n$ , $\ell_{i}$ is convex, $L$ -Lipschitz and $\beta$ -smooth in its first argument. $J$ is $M$ -smooth and $\mu$ -strongly convex. Note that $M\leq n\beta+\lambda$ and $\mu\geq\lambda$ .

Specifically for the interest of defining differential privacy, $\ell_{1},...,\ell_{n}$ belongs to a universe of the loss functions induced by the universe of data points. The assumptions stated above for $\ell_{i}$ are required for all $\ell$ in this universe, thus by adding or removing individual data points, the data point satisfy the same assumptions.

General linear models (GLM): In some cases, we consider the restriction to a family of generalized linear models when the loss function is $\ell(\theta,(x,y)):=f(\theta^{T}x,y)$ for some convex link function $f$ . In this cases, $\ell_{i}(\theta):=\ell(\theta,(x_{i},y_{i}))$ .

Algorithm: The algorithm we consider is the Noisy Gradient Descent (NoisyGD), which updates the parameter vector iteratively using

\theta_{t+1}=\Pi_{\Theta}\left(\theta_{t}-\eta_{t}(\nabla J(\theta)+\mathcal{N}(0,\sigma^{2}I_{d}))\right),

where $\Theta$ is a convex feasible region and $\Pi_{\Theta}$ is the Euclidean projection to this set. When we consider unconstrained problems, then $\Theta=\mathbb{R}^{d}$ .

2 Convergence guarantees

2.1 General convex case

Theorem 1 (Convex, smooth and Lipschitz).

Let the learning rate $\eta_{t}=\eta\leq\frac{1}{M}$ , assume $\beta<+\infty$

\displaystyle\mathbb{E}J(\bar{\theta}_{1:T})-J^{*}\leq\frac{\|\theta_{1}-\theta^{*}\|^{2}}{T\eta}+\eta d\sigma^{2}.

(1)

In particular, if $\eta_{t}=\min\{\frac{1}{M},\frac{\|\theta_{1}-\theta^{*}\|}{\sqrt{d\sigma^{2}T}}\}$ and that $\sigma$ is chosen such that the algorithm satisfies $\rho$ -zCDP, i.e., $\rho:=\frac{TL^{2}}{2\sigma^{2}}$ , then

	$\displaystyle\mathbb{E}J(\bar{\theta}_{1:T})-J^{}\leq\frac{M\\|\theta_{1}-\theta^{}\\|^{2}}{T}+\frac{2\\|\theta_{1}-\theta^{*}\\|\sqrt{d\sigma^{2}}}{\sqrt{T}}$		(2)
	$\displaystyle=\frac{M\\|\theta_{1}-\theta^{}\\|^{2}}{T}+\sqrt{\frac{2d}{\rho}}L\\|\theta_{1}-\theta^{}\\|.$		(3)

Theorem 2 (Convex and Lipschitz).

Let the learning rate $\eta_{t}=\eta$ , also assume $\sup_{\theta\in\Theta}\|\theta-\theta_{\text{ref}}\|_{2}\leq B$ , then

\displaystyle\mathbb{E}J(\bar{\theta}_{1:T})-J^{*}\leq\frac{\|\theta_{1}-\theta^{*}\|^{2}}{T\eta}+\eta\left((nL+\lambda B)^{2}+d\sigma^{2}\right).

(4)

In particular, if $\eta_{t}=\frac{\|\theta_{1}-\theta^{*}\|}{\sqrt{(d\sigma^{2}+(nL+\lambda B)^{2})T}}$ , and that $\sigma$ is chosen such that the algorithm satisfies $\rho$ -zCDP, i.e., $\rho:=\frac{TL^{2}}{2\sigma^{2}}$ , then

	$\displaystyle\mathbb{E}J(\bar{\theta}_{1:T})-J^{*}$	$\displaystyle\leq\sqrt{\frac{\\|\theta_{1}-\theta^{}\\|^{2}(nL+\lambda B)^{2}}{T}}+\sqrt{\frac{2d\sigma^{2}\\|\theta_{1}-\theta^{}\\|^{2}}{T}}$		(5)
		$\displaystyle=\sqrt{\frac{\\|\theta_{1}-\theta^{}\\|^{2}(nL+\lambda B)^{2}}{T}}+\sqrt{\frac{2d}{\rho}}L\\|\theta_{1}-\theta^{}\\|.$		(6)

2.2 Strongly convex case

Theorem 3.

Let the learning rate be $\eta_{t}=\frac{2}{\mu(t+1)}$ , and $\|\Theta\|_{2}\leq B$ then

	$\displaystyle\mathbb{E}\left[J\left(\frac{2}{T(T+1)}\sum_{t=1}^{T}t\theta_{t}\right)\right]-J^{*}\leq$	$\displaystyle\frac{4(nL+\lambda B)^{2}}{\mu T}+\frac{d\sigma^{2}}{\mu T}$		(7)
	$\displaystyle=$	$\displaystyle\frac{4(nL+\lambda B)^{2}}{\mu T}+\frac{dL^{2}}{\mu\rho},$		(8)

where $\rho:=\frac{TL^{2}}{2\sigma^{2}}$ is the zCDP parameter of the algorithm.

Note that when $\rho\leq\sqrt{\log(1/\delta)}$ , zCDP implies $(\epsilon,\delta)$ -DP with $\rho\geq\frac{\epsilon^{2}}{8\log(1/\delta)}$ . This means that as $T$ goes to $\infty$ and $\frac{\sigma^{2}}{T}$ stays a constant, the first term vanishes and second term matches the information-theoretic limit of differentially private learning under $(\epsilon,\delta)$ -DP in the convex case and in the strongly convex case respectively.

2.3 Utility in the regularized case

When additional regularization with parameter $\lambda$ , the utility we consider should still be considered in terms of $\mathcal{L}(\hat{\theta})-\mathcal{L}(\theta^{*})$ . Let $\theta^{*}$ be any comparator satisfying $B>\|\theta^{*}\|$

	$\displaystyle\mathcal{L}(\hat{\theta})-\mathcal{L}(\theta^{*})$	$\displaystyle=J(\hat{\theta})-J_{\lambda}(\theta^{}_{\lambda})+J_{\lambda}(\theta^{}_{\lambda})-J(\theta^{})+\frac{\lambda}{2}\\|\theta^{}-\theta_{\text{ref}}\\|^{2}-\frac{\lambda}{2}\\|\hat{\theta}-\theta_{\text{ref}}\\|^{2}$
		$\displaystyle\leq J(\hat{\theta})-J_{\lambda}(\theta^{}_{\lambda})+\frac{\lambda}{2}\\|\theta^{}-\theta_{\text{ref}}\\|^{2}.$

Take expectation on both sides and substitute the bound with an initialization at $o$

\displaystyle\mathbb{E}[\mathcal{L}(\hat{\theta})]-\mathcal{L}(\theta^{*})\leq\frac{4(nL+\lambda B)^{2}}{\mu T}+\frac{dL^{2}}{\mu\rho}+\frac{\lambda}{2}\|\theta^{*}-\theta_{\text{ref}}\|^{2}.

In the case when $\mathcal{L}$ is not strongly convex, $\mu=\lambda$ we get

\frac{4(nL+\lambda B)^{2}}{\lambda T}+\frac{dL^{2}}{\lambda\rho}+\frac{\lambda}{2}\|\theta^{*}\|^{2}

when we choose $\lambda=\sqrt{\frac{\rho}{2\|\theta^{*}-\theta_{\text{ref}}\|dL^{2}}}$ we get

\mathbb{E}[\mathcal{L}(\hat{\theta})]-\mathcal{L}(\theta^{*})\leq\frac{4(nL+\lambda B)^{2}}{\lambda T}+\frac{\sqrt{d}L\|\theta^{*}-\theta_{\text{ref}}\|}{\sqrt{2\rho}}.

Again, when $T\rightarrow+\infty$ , the first term vanishes and the second term is the information-theoretic limit for general convex problems. when $\theta_{\text{ref}}=0$ . However, if we choose $\theta_{\text{ref}}$ according to the public data, then if the public data gives either a good initialization or a good reference point for the strongly convex regularization, then it provably improves the utility under the same privacy parameter $\rho$ .

2.4 Provably good initialization via a public data set

Assume the public data with $m$ samples are drawn from the same distribution of the private data. Assume that the (population risk)

R(\theta)=\mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(\theta^{*},(x,y))]

is $c$ -strongly convex at $\theta^{*}$ for some constant $c$ . Note that this does not require strong convexity everywhere and it also does not require strong convexity to hold on the empirical risk objective function finite dataset.

Moreover, assume that the $\mathbb{E}_{(x,y)\sim\mathcal{D}}[\|\nabla\ell(\theta,(x,y))\|^{2}]\leq G^{2}$ . Then running one-pass SGD which returns $\theta_{\text{ref}}$ to be the weighted average satisfies that

\mathbb{E}[\|\theta_{\text{ref}}-\theta^{*}\|^{2}]\leq\frac{4G^{2}}{c^{2}m}.

This result together with the bound above, implies that

\mathbb{E}[\mathbb{E}[\mathcal{L}(\hat{\theta})]-\mathcal{L}(\theta^{*})]\leq\frac{4\sqrt{d}LG^{2}}{c^{2}m\sqrt{2\rho}}

The outer expectation is taken over the distribution of the public data, which implies a learning bound of the form (by dividing both sides by $n$ and adjust for the difference between empirical risk and population risks)

\mathbb{E}[R(\hat{\theta})]-R^{*}\leq\frac{4\sqrt{d}LG^{2}}{c^{2}mn\sqrt{2\rho}}+\text{GeneralizationGap}(n).

where the GeneralizationGap(n) is difference between the empirical and population risk of $\hat{\theta}$ , typically obtained via a uniform convergence argument. The generalization gap is required even for the non-private learning and is between $O(\sqrt{1/n})$ and $O(1/n)$ when we qualify for a condition for fast rate (e.g., strong convexity, low-noise, etc). Note that the above bound extra factor due to privacy is on the order of $O(\frac{\sqrt{d}}{mn\sqrt{\rho}})$ comparing to $O(\frac{\sqrt{d}}{n\sqrt{\rho}})$ when we have private data only.

2.5 Improved dimension dependence for GLMs

For the generalized linear model case, all dimension $d$ above can be replaced with $\mathrm{rand}(X)$ where $X\in\mathbb{R}^{n\times d}$ is the data matrix and $\leq\min\{d,n\}$ . The arguments follow from (song21evading) by leveraging that the gradients are all within the row-space of $X$ and it suffices to consider.

This observation is quite useful to us because the linearization of a deep model is often very high-dimensional and $d\geq n$ . In this case, this is saying that we can chooses the learning rate as a function of $n$ instead.

2.6 Gradient clipping under GLM

It was established in (chen2020understanding) that the gradient clipping is known to have the effect of Huberization of the loss function in the GLM cases. The convergence to the optimal solution of a modified objective function is guaranteed if we fix the clipping parameter. Our approach on the adaptive clipping has the effect of selecting the hyperparameter adaptively in an homotopy style algorithm, which shows further empirical benefits.

2.7 Related work

The theory on the convergence of SGD is well-studied under various regimes. The exposition above is based on the analysis of (ghadimi2013stochastic) (for the convex cases) and (lacoste2012simpler) (for the strongly convex cases). song21evading provided the first analysis of this algorithm in the general convex case with dependence on the $\min{n,d}$ , and first illustrated the interpretation of gradient clipping as a Huberization of the link function. chen2020understanding provided an alternative explanation of the gradient clipping in terms of how it effectively modify the data distribution. To the best of our knowledge, we are the first that states these results for GLM that depends on $\min{n,d}$ in the strongly convex case, as well as showing that the use of public data gives provably smaller sample complexity comparing to the standard private learning setting.

3 Discussion

We make a few interesting discussion here.