This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private Convex Optimization in General Norms

Sivakanth Gopi Microsoft Research, [email protected]    Yin Tat Lee University of Washington, [email protected]    Daogao Liu University of Washington, [email protected]    Ruoqi Shen University of Washington, [email protected]    Kevin Tian Microsoft Research, [email protected]. Work completed while at Stanford University.

1 Introduction

The study of convex optimization in spaces where the natural geometry is non-Euclidean, beyond being a natural question of independent interest, has resulted in many successes across algorithm design. A basic example of this is the celebrated multiplicative weights, or exponentiated gradient method [AHK12], which caters to the 1\ell_{1} geometry and has numerous applications in learning theory and algorithms. Moreover, optimization in real vector spaces equipped with different p\ell_{p} norms has found use in sparse recovery [CRT06], combinatorial optimization [KLOS14, KPSW19], multi-armed bandit problems [BC12], fair resource allocation [DFO20], and more (see e.g. [AKPS19, DG21] and references therein). Furthermore, optimization in Schatten-pp norm geometries (the natural generalization of p\ell_{p} norms to matrix spaces) has resulted in improved algorithms for matrix completion [ANW10] and outlier-robust PCA [JLT20]. In addition to p\ell_{p} and Schatten-pp norms, the theory of non-Euclidean geometries has been very useful in settings such as linear and semidefinite programming [Nem04] and optimization on matrix spaces [AGL+18], amongst others.

The main result of this paper is a framework for differentially private convex optimization in general normed spaces under a Lipschitz parameter bound. Differential privacy [DKM+06, DMNS06] has been adopted as the standard privacy notion for data analysis in both theory and practice, and differentially private algorithms have been deployed in many important settings in the industry as well as the U.S. census [EPK14, Abo16, Tea17, BEM+17, DKY17]. Consequently, differentially private optimization is an increasingly important and fundamental primitive in modern machine learning applications [BST14, ACG+16]. However, despite an extensive body of theoretical work providing privacy-utility tradeoffs (and more) for optimization in the Euclidean norm geometry, e.g. [CM08, CMS11, KST12, JT14, BST14, KJ16] (and many other follow-up works), more general settings have been left relatively unexplored. This state of affairs prohibits the application of private optimization theory to problems where the natural geometry is non-Euclidean. Recent works [AFKT21, BGN21, BGM21] have investigated special cases of private convex optimization, e.g. for p\ell_{p} spaces or polyhedral sets, under smoothness assumptions, or under structured losses. However, the systematic study of private convex optimization in general normed spaces in the most fundamental setting of Lipschitz losses has been left open, a gap that our work addresses.

Our framework for private convex optimization is simple: we demonstrate strong privacy-utility tradeoffs for a regularized exponential mechanism when optimizing a loss over a set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} equipped with a norm 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}. More concretely, our algorithms sample from densities

exp(k(F𝒟+μr))\propto\exp\left(-k(F_{\mathcal{D}}+\mu r)\right)

where k,μ>0k,\mu>0 are tunable parameters, F𝒟F_{\mathcal{D}} is a (data-dependent) empirical risk, and rr is a strongly convex regularizer in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} with bounded range over 𝒳\mathcal{X}. In the analogous non-private Lipschitz convex optimization setting, most theoretical developments (namely mirror descent frameworks) have focused on precise applications where such an rr is readily available [Sha12, Bub15]. In this sense, our framework directly extends existing Lipschitz convex optimization theory to the private setting (and indeed, recovers existing non-private guarantees obtained by mirror descent [NY83]).

In the remainder of the introduction, we summarize our results (Section 1.1), highlight our technical contributions (Section 1.2), and situate our paper in the context of prior work (Section 1.3).

1.1 Our results

We study both the empirical risk minimization (ERM) problem and the stochastic convex optimization (SCO) problem in this paper; the goal in the latter case is to minimize the population risk. We formalize this under the following assumption, which parameterizes the space we are optimizing and the (empirical and population) risks we aim to minimize.

Assumption 1.

We make the following assumptions.

  1. (1)

    There is a compact, convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} equipped with a norm 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}.

  2. (2)

    There is a 11-strongly convex function r:𝒳r:\mathcal{X}\to\mathbb{R} in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}, and Θmaxx𝒳r(x)minx𝒳r(x)\Theta\geq\max_{x\in\mathcal{X}}r(x)-\min_{x\in\mathcal{X}}r(x).

  3. (3)

    There is a set Ω\Omega such that for any sΩs\in\Omega, there is a convex function f(;s):𝒳f(\cdot;s):\mathcal{X}\to\mathbb{R} which is GG-Lipschitz in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}.

For definitions used above, see Section 2. We remark that by strong convexity, the parameter Θ\Theta scales at least as Ω(D2)\Omega(D^{2}), where DD is the diameter of 𝒳\mathcal{X} with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}; in many cases of interest, we may upper bound Θ\Theta by O(D2)O(D^{2}) as well up to a logarithmic factor.

Finally, throughout the paper when working under Assumption 1, 𝒟={si}i[n]\mathcal{D}=\{s_{i}\}_{i\in[n]} denotes a dataset drawn independently from 𝒫\mathcal{P}, a distribution supported on Ω\Omega, and we define F𝒟:𝒳F_{\mathcal{D}}:\mathcal{X}\to\mathbb{R} and Fpop:𝒳F_{\textup{pop}}:\mathcal{X}\to\mathbb{R} by

F𝒟(x):=1ni[n]f(x;si),Fpop(x):=𝔼s𝒫[f(x;s)].F_{\mathcal{D}}(x):=\frac{1}{n}\sum_{i\in[n]}f(x;s_{i}),\;F_{\textup{pop}}(x):=\mathbb{E}_{s\sim\mathcal{P}}[f(x;s)]. (1)

Private ERM and SCO.

We first present the following general results under Assumption 1.

Theorem 1 (Informal, see Theorems 34).

Under Assumption 1 and following notation (1), drawing a sample xx from the density νexp(k(F𝒟+μr))\nu\propto\exp(-k(F_{\mathcal{D}}+\mu r)) for some k,μ>0k,\mu>0 specified in Theorem 3 is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼xν[F𝒟(x)]minx𝒳F𝒟(x)GΘ8dlog12δnϵ.\mathbb{E}_{x\sim\nu}[F_{\mathcal{D}}(x)]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq G\sqrt{\Theta}\cdot\frac{\sqrt{8d\log\frac{1}{2\delta}}}{n\epsilon}.

Moreover, drawing a sample xx from the density νexp(k(F𝒟+μr))\nu\propto\exp(-k(F_{\mathcal{D}}+\mu r)) for some k,μ>0k,\mu>0 specified in Theorem 4 is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼𝒟𝒫n,xν[Fpop(x)]minx𝒳Fpop(x)GΘ(8dlog12δnϵ+8n).\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq G\sqrt{\Theta}\cdot\left(\frac{\sqrt{8d\log\frac{1}{2\delta}}}{n\epsilon}+\sqrt{\frac{8}{n}}\right).

Minimizing the non-private population risk under the same setting as Assumption 1 is a very well-studied problem, with matching upper and lower bounds in many cases of interest, such as p\ell_{p} norms [NY83, ABRW12, DJW14]. The population utility achieved by our regularized exponential mechanism in Theorem 1 (namely, as ϵ\epsilon\to\infty) matches the rate obtained by the classical mirror descent algorithm [NY83], which to our knowledge has not been previously observed. Finally, in Appendix A we provide an analog of Theorem 1 under the stronger assumption that the sample losses f(;s)f(\cdot;s) are strongly convex, bypassing the need for explicit regularization. Our results in Appendix A recover the optimal rate in the Euclidean case, matching known lower bounds [BST14].

We next show how to apply the results of Theorem 1 under various instantiations of Assumption 1 to derive new rates for private convex optimization under p\ell_{p} and Schatten-pp norm geometries.

p\ell_{p} and Schatten-pp norms.

In Corollaries 23, and 4, we combine known (optimal) uniform convexity estimates for p\ell_{p} spaces [BCL94] with the algorithms of Theorem 3 and 4 to obtain privacy-utility tradeoffs summarized in Table 1. Interestingly, we achieve all these bounds with a single algorithmic framework, which in all cases matches or partially matches known lower bounds.

p\ell_{p} norm Optimality gap
ERM loss F𝒟F_{\mathcal{D}} SCO loss FpopF_{\textup{pop}}
p(1,2)p\in(1,2) ()(\star) GDdlog12δnϵp1GD\cdot\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}} GD(1n(p1)+dlog12δnϵp1)GD\cdot\left(\frac{1}{\sqrt{n(p-1)}}+\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}}\right)
p=1p=1 ()(\dagger) GDdlogdlog12δnϵGD\cdot\frac{\sqrt{d\log d\log\frac{1}{2\delta}}}{n\epsilon} GD(logdn+dlogdlog12δnϵ)GD\cdot\left(\sqrt{\frac{\log d}{n}}+\frac{\sqrt{d\log d\log\frac{1}{2\delta}}}{n\epsilon}\right)
p2p\geq 2 ()(\dagger) GDd11plog12δnϵGD\cdot\frac{d^{1-\frac{1}{p}}\sqrt{\log\frac{1}{2\delta}}}{n\epsilon} GD(d121pn+d11plog12δnϵ)GD\cdot\left(\frac{d^{\frac{1}{2}-\frac{1}{p}}}{\sqrt{n}}+\frac{d^{1-\frac{1}{p}}\sqrt{\log\frac{1}{2\delta}}}{n\epsilon}\right)
Table 1: Privacy-utility tradeoffs for p\ell_{p} norm optimization under (ϵ,δ)(\epsilon,\delta)-differential privacy obtained by: Corollary 2 (p(1,2)p\in(1,2)), Corollary 3 (p=1p=1), and Corollary 4 (p2p\geq 2). We assume 𝒳\mathcal{X} has p\ell_{p} diameter bounded by DD and hide constants (stated in the formal results) for brevity. ()(\star) indicates that our result matches the private ERM and SCO lower bound [BGN21, LL22]. ()(\dagger) indicates that our result (as ϵ\epsilon\to\infty) matches the non-private SCO lower bound [ABRW12, DJW14].

We now contextualize our results with regard to the existing literature. In the following discussion, the “privacy-dependent” loss term is the term in the SCO loss scaling with ϵ,δ\epsilon,\delta, and the “privacy-independent” loss term is the SCO loss when ϵ\epsilon\to\infty.

In the case of constant p(1,2)p\in(1,2), our Corollary 2 sharpens Theorem 5 of [AFKT21] by a logd\sqrt{\log d} factor in the privacy-dependent loss term, and is the first to match lower bounds of [BGN21, LL22]. It improves bounds by [BGN21] by at least a logn\log n factor on both parts of the SCO loss, which further loses an n14n^{\frac{1}{4}} factor on the privacy-dependent loss and requires additional smoothness assumptions.

In the important case of p=1p=1, of fundamental interest due to its applications in sparse recovery [CRT06] as well as online learning [Sha12, AHK12], our Corollary 3 improves the privacy-dependent loss term of [AFKT21] by a logd\log d factor, and matches the privacy-independent loss lower bound in the SCO literature [DJW14], matching the rate of entropic mirror descent. The privacy-dependent loss term incurs an additional overhead of logd\sqrt{\log d} compared to existing lower bounds. However, just as lower bounds on the privacy-independent loss increase as p1p\to 1, it is plausible that the upper bound obtained by Corollary 3 is optimal, which we leave as an interesting open direction.

In the p2p\geq 2 case, prior work by [BGN21] obtains a rate matched by Corollary 4. The non-private population risk term in (15) is again known to be optimal [ABRW12]. We again find it an interesting open direction to close the gap between the upper bound (13) and known lower bounds for private convex optimization when p2p\geq 2, e.g. [BGN21, LL22].

We further demonstrate in Corollary 5 that all of these results have direct analogs in the case of optimization over matrix spaces equipped with Schatten norm geometries. To the best of our knowledge, this is the first such result in the private optimization literature; we believe this showcases the generality and versatility of our approach.

Finally, we mention that all of these results are algorithmic and achieved by samplers with polynomial query complexity and runtime, following developments of [LST21, GLL22]. In all cases, by simple norm equivalence relations, the query complexity of our samplers is at most a dd factor worse than the 2\ell_{2} case, with improvements as p2p\to 2. It is an exciting direction to develop efficient high-accuracy samplers catering to structured densities relevant to the setups considered in this paper, e.g. those whose negative log-likelihoods are strongly convex in p\ell_{p} norms. The design of sampling algorithms for continuous distributions has been an area of intense research activity in the machine learning community, discussed in greater detail in Section 1.3. We mention here that our hope is that our results and general optimization framework serve as additional motivation for the pursuit of efficient structured sampling algorithms working directly in non-Euclidean geometries.

1.2 Our techniques

Our results essentially build on the recent work of [GLL22], who observed that a regularized exponential mechanism achieves optimal privacy-utility tradeoffs for empirical and population risks when losses are Lipschitz in the 2\ell_{2} norm. Under a Euclidean specialization of Assumption 1, [GLL22] provided variants of Theorem 1 using the regularizer r(x)=12x22r(x)=\frac{1}{2}\left\lVert x\right\rVert_{2}^{2}, i.e. reweighting by a Gaussian.

We demonstrate several key tools used in [GLL22] have non-Euclidean extensions by using a simple, general approach based on a convex geometric tool known as localization. For example, the starting point of our developments is relating the privacy curves of two nearby, strongly convex densities with the privacy curve of Gaussians (see Section 2 for definitions).

Theorem 2.

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, let F,F~:𝒳F,\widetilde{F}:\mathcal{X}\to\mathbb{R} be μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}, and let Pexp(F)P\propto\exp(-F) and Qexp(F~)Q\propto\exp(-\widetilde{F}). Suppose F~F\widetilde{F}-F is GG-Lipschitz in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}. For all ϵ0\epsilon\in\mathbb{R}_{\geq 0},

δ(PQ)(ϵ)δ(𝒩(0,1)𝒩(Gμ,1))(ϵ).\delta\left(P\;\middle\|\;Q\right)(\epsilon)\leq\delta\left(\mathcal{N}(0,1)\;\middle\|\;\mathcal{N}\left(\frac{G}{\sqrt{\mu}},1\right)\right)(\epsilon).

An analog of Theorem 2 when 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} is the Euclidean norm was proven as Theorem 4.1 of [GLL22]. Moreover, the analog of Theorem 1 in [GLL22] follows from combining Theorem 4.1 of that work, and their Theorem 6.10, a reduction from the SCO problem to the ERM problem (containing a generalization error bound). These proofs in [GLL22] rely on powerful inequalities from probability theory, which were initially studied in the Gaussian (Euclidean norm regularizer) setting. For example, Theorem 4.1 applied the Gaussian isoperimetric inequality of [ST74, Bor75a] (see also Theorem 1.1, [Led99]), which states that strongly logconcave distributions in the Euclidean norm have expansion quality at least as good as a corresponding Gaussian. Moreover, the generalization error bound in Theorem 6.10 was proven based on a Euclidean norm log-Sobolev inequality and transportation inequality, relating Wasserstein distances, KL divergences, and Lipschitz bounds on negative log-densities. Fortunately, it turns out that all of these inequalities have non-Euclidean generalizations (possibly losing constant factors). For example, a non-Euclidean log-Sobolev inequality was shown by Proposition 3.1 of [BL00], and a non-Euclidean transport inequality sufficient for our purposes is proved as Proposition 1 of [CE17]. Finally, variants of the Gaussian isoperimetric inequality in general norms are given by [MS08, Kol11]. Plugging in these tools into the proofs of [GLL22] allows us to recover Theorems 1 and 2, as well as our applications.

In this work, we take a different (and in our opinion, simpler) strategy to proving the probability-theoretic inequalities required by Theorems 12, yielding an alternative to the proofs in [GLL22] which we believe may be of independent intellectual interest to the privacy community. In particular, our technical insight is the simple observation that several of the definitions in differential privacy are naturally cast in the language of localization [KLS95, FG04], which characterizes extremal logconcave densities subject to linear constraints (see our proof of Lemma 2 for an example of this). This observation allows us to reduce the proofs of key technical tools used in Theorems 1 and 2 to proving these tools in one dimension, where all norms are equivalent up to constant factor rescaling.111The one-dimensional case can then typically be handled by more straightforward “combinatorial” arguments, see e.g. Section 2.b of [LS93] or Appendix B.3 of [CDWY20] for examples. After deriving several extensions of basic localization arguments in Section 3.1, we follow this reduction approach to give a more unified proof to Theorems 1 and 2. To our knowledge, this is the first direct application of localization techniques in differential privacy.

The interplay between the privacy and probability theory communities is an increasingly active area of exploration [DRS21, GLL22, GTU22] (discussed in more detail in Section 1.3). We are hence optimistic that localization-based proof strategies will have further applications in the privacy literature, especially in situations (beyond this paper) where probability theoretic tools used in the Euclidean case do not have non-Euclidean variants in full generality. In such settings, it may be a valuable endeavor to see if necessary inequalities may be directly recast in the language of localization.

1.3 Prior work

Private optimization in Euclidean norm.

Many prior works on private convex optimization have focused on variants of the ERM and SCO problems studied in this work, under 2\ell_{2} Lipschitz losses and 2\ell_{2} bounded domains, such as [CMS11, KST12, BST14, BFTT19, BFGT20]. The optimal information-theoretic rate for these private optimization problems was given by [BST14], which was matched algorithmically up to constant factors by [BFTT19, BFGT20].

From an algorithmic perspective, a topic of recent interest in the Euclidean case is the problem of attaining optimal privacy-utility tradeoffs in nearly-linear time, namely, with n\approx n gradient queries [FKT20, AFKT21, KLL21]. Under additional smoothness assumptions, this goal was achieved by [FKT20]; however, achieving near-optimal gradient oracle query rates in the general Lipschitz case remains open. We note that under value oracle access, a near-optimal bound was recently achieved by [GLL22]. This paper primarily focuses on the information-theoretic problem of achieving optimal privacy-utility tradeoffs for a given dataset size. However, we believe the corresponding problem of designing algorithms with near-optimal query complexities and runtimes (under value or gradient oracle access) is also an important open direction in the case of general norm geometries.

Private optimization in non-Euclidean norms.

The study of convex optimization in non-Euclidean geometries was recently initiated by [AFKT21, BGN21], who focused primarily on developing algorithms under p\ell_{p} regularity assumptions and bounded domains. In follow-up work, [BGM21] gave improved guarantees for the family of generalized linear losses. We discuss the rates we achieve for p\ell_{p} norm geometries compared to [AFKT21, BGN21] in Section 1.1; in short, we improve prior results by logarithmic factors in the case p[1,2)p\in[1,2), and match them when p2p\geq 2. Independently from our work, [HLL+22] designed an algorithm for private optimization in p\ell_{p} geometries improving upon [BGN21] in gradient query complexity (matching their privacy-utility tradeoffs); both [BGN21, HLL+22] require further smoothness assumptions on the loss functions.

One of the main motivations for this work was to develop a general theory for private convex optimization under non-Euclidean geometries, beyond p\ell_{p} setups. In particular, [BGN21] designed a generalized Gaussian mechanism for the case p[1,2)p\in[1,2), where gradients were perturbed by a noise distribution catering to the p\ell_{p} geometry. However, how to design a corresponding mechanism for more general norms may be less clear. The algorithm of [AFKT21] in the non-smooth case was based on a (Euclidean norm) Gaussian mechanism; again, this strategy is potentially more specialized to p\ell_{p} geometries. Beyond giving a general algorithmic framework for non-Euclidean convex optimization based on structured logconcave sampling, we hope that the information-theoretic properties we show regarding regularized exponential mechanisms (e.g. Theorem 2) may find use in designing “generalized Gaussian mechanisms” beyond p\ell_{p} norms.

Connections between privacy and sampling.

Our work extends a line of work exploring privacy-utility tradeoffs for the exponential mechanism, a general strategy for designing private algorithms introduced by [MT07] (see additional discussion in [GLL22]). For example, the regularized exponential mechanisms we design are similar in spirit to the exponential mechanism “in the 𝒳\mathcal{X} norm222That is, the norm induced by the convex body 𝒳\mathcal{X}, not to be confused with the 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} of Assumption 1.” designed by [HT10, BDKT12]. Moreover, our work continues a recent interface between the sampling and privacy literature, where (continuous and discrete-time) sampling algorithms are shown to efficiently obtain strong privacy-utility tradeoffs for optimization problems [GLL22, GTU22]. This work further develops this interface, motivating the design of efficient samplers for densities satisfying non-Euclidean regularity assumptions.

The design of sampling algorithms under general geometries (e.g. “mirrored Langevin algorithms”) has been a topic of great recent interest, independently from applications in private optimization. Obtaining mixing guarantees under regularity assumptions naturally found in applications is a notoriously challenging problem in the recent algorithmic sampling literature [HKRC18, ZPFP20, AC21, Jia21, LTVW22]. For example, it has been observed both theoretically and empirically that without (potentially restrictive) relationships between regularity parameters, natural discretizations of the mirrored Langevin dynamics may not even have vanishing bias [ZPFP20, Jia21, LTVW22]. Recently, [LST21] gave an alternative strategy (to discretizing Langevin dynamics) for designing sampling algorithms in the Euclidean case, used in [GLL22] to obtain private algorithms for 2\ell_{2}-structured ERM and SCO problems (see Proposition 8). Our work suggests a natural non-Euclidean generalization of these sampling problems, which is useful to study from an algorithmic perspective. We are optimistic that a non-Euclidean variant of [LST21] may shed light on these mysteries and yield new efficient private algorithms. More generally (beyond the particular [LST21] framework), we state the direction of designing efficient samplers for densities of the form exp(F𝒟μr)\exp(-F_{\mathcal{D}}-\mu r) satisfying Assumption 1 as an important open research endeavor with implications for both sampling and private optimization, the latter of which this paper demonstrates.

2 Preliminaries

General notation.

Throughout, O~\widetilde{O} hides logarithmic factors in problem parameters when clear from the context. For nn\in\mathbb{N}, [n][n] refers to the naturals 1in1\leq i\leq n. We use 𝒳\mathcal{X} to denote a compact convex subset of d\mathbb{R}^{d}. The standard (2\ell_{2}) Euclidean norm is denoted 2\left\lVert\cdot\right\rVert_{2}. We will be concerned with optimizing functions f:𝒳f:\mathcal{X}\to\mathbb{R}, and 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} will refer to a norm on 𝒳\mathcal{X}. The diameter of such a set is denoted diam𝒳(𝒳):=maxx,y𝒳xy𝒳\textup{diam}_{\left\lVert\cdot\right\rVert_{\mathcal{X}}}(\mathcal{X}):=\max_{x,y\in\mathcal{X}}\left\lVert x-y\right\rVert_{\mathcal{X}}. We let 𝒩(μ,𝚺)\mathcal{N}(\mu,\boldsymbol{\Sigma}) be the Gaussian density of specified mean and covariance. We denote the convex hull of a set SS (when well-defined) by conv(S)\mathrm{conv}(S). When a,bda,b\in\mathbb{R}^{d}, we abuse notation and let [a,b][a,b] be the line segment between aa and bb.

Norms.

For p1p\geq 1, we let p\left\lVert\cdot\right\rVert_{p} applied to a vector-valued variable be the p\ell_{p} norm, namely vp=(i[d]|vi|p)1/p\left\lVert v\right\rVert_{p}=(\sum_{i\in[d]}|v_{i}|^{p})^{1/p} for vdv\in\mathbb{R}^{d}; the \ell_{\infty} norm is the maximum absolute value. We will use the well-known inequality

vpvqd1q1pvp, for vd,qp.\left\lVert v\right\rVert_{p}\leq\left\lVert v\right\rVert_{q}\leq d^{\frac{1}{q}-\frac{1}{p}}\left\lVert v\right\rVert_{p},\text{ for }v\in\mathbb{R}^{d},\;q\leq p. (2)

Matrices will be denoted in boldface throughout, and p\left\lVert\cdot\right\rVert_{p} applied to a matrix-valued variable 𝐌\mathbf{M} is the Schatten-pp norm, i.e. the p\ell_{p} norm of the singular values of 𝐌\mathbf{M}.

Optimization.

In the following discussion, fix some f:𝒳f:\mathcal{X}\to\mathbb{R}. We say ff is GG-Lipschitz in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} if for all x,x𝒳x,x^{\prime}\in\mathcal{X}, |f(x)f(x)|Gxx𝒳|f(x)-f(x^{\prime})|\leq G\left\lVert x-x^{\prime}\right\rVert_{\mathcal{X}}. We say ff is μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} if for all x,x𝒳x,x^{\prime}\in\mathcal{X} and t[0,1]t\in[0,1],

f(tx+(1t)y)tf(x)+(1t)f(y)μt(1t)2xy𝒳2.f\left(tx+(1-t)y\right)\leq tf(x)+(1-t)f(y)-\frac{\mu t(1-t)}{2}\left\lVert x-y\right\rVert_{\mathcal{X}}^{2}.

Probability.

For two densities π,π\pi,\pi^{\prime}, we define their total variation distance by ππTV:=12|π(x)π(x)|𝑑x\left\lVert\pi-\pi^{\prime}\right\rVert_{\textup{TV}}:=\frac{1}{2}\int|\pi(x)-\pi^{\prime}(x)|dx and (when the Radon-Nikodym derivative exists) their KL divergence by DKL(ππ):=π(x)logπ(x)π(x)dxD_{\textup{KL}}(\pi\|\pi^{\prime}):=\int\pi(x)\log\frac{\pi(x)}{\pi^{\prime}(x)}dx. We define the 22-Wasserstein distance by

W2(π,π)=infΓΓ(π,π)𝔼(x,x)Γxx22,W_{2}(\pi,\pi^{\prime})=\inf_{\Gamma\in\Gamma(\pi,\pi^{\prime})}\sqrt{\mathbb{E}_{(x,x^{\prime})\sim\Gamma}\|x-x^{\prime}\|_{2}^{2}},

where Γ(π,π)\Gamma(\pi,\pi^{\prime}) is the set of couplings of π\pi and π\pi^{\prime}. We note W2W_{2} satisfies the following inequality.

Fact 1.

Let Lip2(f)\textup{Lip}_{2}(f) be the Lipschitz constant in the 2\ell_{2} norm of a function ff. Then, for densities π,π\pi,\pi^{\prime} supported on 𝒳\mathcal{X},

W2(π,π)supLip2(f)1𝒳f(x)(π(x)π(x))𝑑x.W_{2}(\pi,\pi^{\prime})\geq\sup_{\textup{Lip}_{2}(f)\leq 1}\int_{\mathcal{X}}f(x)(\pi(x)-\pi^{\prime}(x))dx.
Proof.

This follows from the dual characterization of the 11-Wasserstein distance (which shows supLip(f)1𝒳f(x)(π(x)π(x))𝑑x=infΓΓ(π,π)𝔼(x,x)Γxx2\sup_{\textup{Lip}(f)\leq 1}\int_{\mathcal{X}}f(x)(\pi(x)-\pi^{\prime}(x))dx=\inf_{\Gamma\in\Gamma(\pi,\pi^{\prime})}\mathbb{E}_{(x,x^{\prime})\sim\Gamma}\|x-x^{\prime}\|_{2}), and convexity of the square. ∎

We use \propto to indicate proportionality, e.g. if π\pi is a density and we write πexp(f)\pi\propto\exp(-f), we mean π(x)=exp(f)Z\pi(x)=\frac{\exp(-f)}{Z} where Z:=exp(f(x))𝑑xZ:=\int\exp(-f(x))dx and the integration is over the support of π\pi.

We say that a measure π\pi on d\mathbb{R}^{d} is logconcave if for any λ(0,1)\lambda\in(0,1) and compact A,BdA,B\subset\mathbb{R}^{d},

π(λA+(1λ)B)π(A)λπ(B)1λ.\pi(\lambda A+(1-\lambda)B)\geq\pi(A)^{\lambda}\pi(B)^{1-\lambda}.

We have the following equivalent characterization of logconcave measures.

Proposition 1 ([Bor75b]).

Let π\pi be a measure on d\mathbb{R}^{d}. Let EE be the least affine subspace containing the support of π\pi, and let mEm_{E} be the Lebesgue measure in EE. Then π\pi is logconcave if and only if dπ=fdmEd\pi=fdm_{E}, ff is nonnegative and locally integrable, and logf:E{+}-\log f:E\to\mathbb{R}\cup\{+\infty\} is convex.

In particular, Proposition 1 shows that the measure of any subspace of EE according to π\pi is zero. If in the characterization of [Bor75b] the function logf-\log f is affine, we say π\pi is logaffine. Following [Bor75b], we analogously define strong logconcavity with respect to a norm.

Definition 1 (strong logconcavity).

Let π\pi be a measure on d\mathbb{R}^{d}. Let EE be the least affine subspace containing the support of π\pi, and let mEm_{E} be the Lebesgue measure in EE. We say π\pi is μ\mu-strongly logconcave with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} if dπ=fdmEd\pi=fdm_{E}, ff is nonnegative and locally integrable, and logf:E{+}-\log f:E\to\mathbb{R}\cup\{+\infty\} is μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}.

Privacy.

Throughout, \mathcal{M} denotes a mechanism, and 𝒟\mathcal{D} denotes a dataset. We say 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} are neighboring if they differ in one entry. We say a mechanism \mathcal{M} satisfies (ϵ,δ)(\epsilon,\delta)-differential privacy if it has output space Ω\Omega and for any neighboring 𝒟,𝒟\mathcal{D},\mathcal{D}^{\prime},

supSΩPr[(𝒟)S]exp(ϵ)Pr[(𝒟)S]δ.\sup_{S\subseteq\Omega}\Pr[\mathcal{M}(\mathcal{D})\in S]-\exp(\epsilon)\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in S]\leq\delta.

We define the privacy curve of two random variables X,YX,Y supported on Ω\Omega by

δ(XY)(ϵ):=supSΩPr[YS]exp(ϵ)Pr[XS].\delta(X\|Y)(\epsilon):=\sup_{S\subseteq\Omega}\Pr[Y\in S]-\exp(\epsilon)\Pr[X\in S].

We say \mathcal{M} has a privacy curve δ:0[0,1]\delta:\mathbb{R}_{\geq 0}\to[0,1] if for all neighboring 𝒟\mathcal{D}, 𝒟\mathcal{D}^{\prime}, δ((𝒟)(𝒟))δ(ϵ)\delta(\mathcal{M}(\mathcal{D})\|\mathcal{M}(\mathcal{D}^{\prime}))\leq\delta(\epsilon). For any ϵ0\epsilon\in\mathbb{R}_{\geq 0}, it is clear that such a \mathcal{M} is (ϵ,δ(ϵ))(\epsilon,\delta(\epsilon))-differentially private. We will frequently compare to the privacy curve of a Gaussian, so we recall the following bound from prior work.

Fact 2 (Gaussian privacy curve, Lemma 6.3, [GLL22]).

Let δ(0,12)\delta\in(0,\frac{1}{2}) and ϵ>0\epsilon>0. For any |t|2log12δ+2ϵ2log12δϵ2log12δ|t|\leq\sqrt{2\log\frac{1}{2\delta}+2\epsilon}-\sqrt{2\log\frac{1}{2\delta}}\leq\frac{\epsilon}{\sqrt{2\log\frac{1}{2\delta}}}, δ(𝒩(0,1)𝒩(t,1))(ϵ)δ\delta\left(\mathcal{N}(0,1)\;\middle\|\;\mathcal{N}(t,1)\right)(\epsilon)\leq\delta.

We will use Fact 2 after deriving our Gaussian differential privacy guarantees [DRS21] for strongly logconcave densities in Theorem 2.

3 Gaussian differential privacy in general norms

In this section, we give a generalization of Theorem 4.1 of [GLL22], which demonstrates that a regularized exponential mechanism for (Euclidean norm) Lipschitz losses achieves privacy guarantees comparable to an analogous instance of the Gaussian mechanism. The proof from [GLL22] was specialized to the Euclidean setup; to show our more general result, we draw upon the localization technique from convex geometry [LS93, KLS95]. We provide the relevant localization tools we will use in Section 3.1, and prove our Gaussian differential privacy result in Section 3.2.

3.1 Localization

We recall the localization lemma from [FG04]. We remark that the statement in [FG04] is more refined than our statement (in that [FG04] gives a complete characterization of extreme points, whereas we give a superset), but the following form of the [FG04] result suffices for our purposes.

Proposition 2 (Theorem 1, [FG04]).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, and let f:𝒳f:\mathcal{X}\to\mathbb{R} be upper semi-continuous. Let 𝒮(f)\mathcal{S}(f) be the set of logconcave densities ν\nu supported in 𝒳\mathcal{X} satisfying 𝒳f𝑑ν0\int_{\mathcal{X}}fd\nu\geq 0. The set of extreme points of conv(𝒮(f))\mathrm{conv}(\mathcal{S}(f)) satisfies one of the following.

  • ν\nu is a Dirac measure at x𝒳x\in\mathcal{X} such that f(x)0f(x)\geq 0.

  • ν\nu is logaffine and supported on [a,b]𝒳[a,b]\subset\mathcal{X} such that [a,b]f𝑑ν=0\int_{[a,b]}fd\nu=0.

We next derive several extensions of Proposition 2.

Lemma 1 (Strongly logconcave localization).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, let β:𝒳>0\beta:\mathcal{X}\to\mathbb{R}_{>0} be continuous, and let f:𝒳f:\mathcal{X}\to\mathbb{R} be upper semi-continuous. Let 𝒮μ,β(f)\mathcal{S}_{\mu,\beta}(f) be the set of probability densities π\pi such that π\pi is μ\mu-strongly logconcave with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} and supported in 𝒳\mathcal{X}, such that πβπ\pi^{\prime}\propto\beta\pi is also μ\mu-strongly logconcave, and 𝒳f𝑑π0\int_{\mathcal{X}}fd\pi\geq 0. The set of extreme points of conv(𝒮μ,β(f))\mathrm{conv}(\mathcal{S}_{\mu,\beta}(f)) satisfy one of the following.

  • π\pi is a Dirac measure at x𝒳x\in\mathcal{X} such that f(x)0f(x)\geq 0.

  • π\pi is supported on [a,b]𝒳[a,b]\subset\mathcal{X}.

Proof.

Clearly, Dirac measures at xx with f(x)0f(x)\geq 0 are extreme points, so it suffices to consider other extreme points. Given any extreme point π\pi which is not a Dirac measure, we prove the least affine subspace containing the support of π\pi has dimension one, i.e. denoting the least affine subspace containing the support of π\pi by SS, we prove dimS=1\dim S=1.

Assume for the sake of contradiction that dimS2\dim S\geq 2. There exists x0x_{0} in the relative interior of the support of π\pi and a two-dimensional subspace EdE\subset\mathbb{R}^{d} such that x0+ESx_{0}+E\subseteq S. Let S1(E)S_{1}(E) be the unit circle in EE, and for any uS1(E)u\in S_{1}(E) denote Hu={xS:xx0,u=0},Hu+={xS:xx0,u0}H_{u}=\{x\in S:\langle x-x_{0},u\rangle=0\},H_{u}^{+}=\{x\in S:\langle x-x_{0},u\rangle\geq 0\} and Hu={xS:xx0,u0}H_{u}^{-}=\{x\in S:\langle x-x_{0},u\rangle\leq 0\}. Finally, define ϕ:S1(E)\phi:S_{1}(E)\to\mathbb{R} by ϕ(u):=Hu+f𝑑π12(f𝑑π)\phi(u):=\int_{H_{u}^{+}}fd\pi-\frac{1}{2}(\int fd\pi), such that ϕ(u)=0Hu+f𝑑π=12f𝑑π0\phi(u)=0\implies\int_{H_{u}^{+}}fd\pi=\frac{1}{2}\int fd\pi\geq 0.

By Proposition 1, we know π(Hu)=0\pi(H_{u})=0. Moreover, ϕ\phi is continuous since every hyperplane HuH_{u} has π(Hu)=0\pi(H_{u})=0. Since ϕ(u)=ϕ(u)\phi(u)=-\phi(-u), by the intermediate value theorem there exists u0S1(E)u_{0}\in S_{1}(E) such that ϕ(u0)=0\phi(u_{0})=0. We can hence rewrite π\pi as a convex combination of its restrictions to Hu0+H_{u_{0}}^{+} and Hu0H_{u_{0}}^{-}, both of which are μ\mu-strongly logconcave, and whose (renormalized) multiplications by β\beta are also μ\mu-strongly logconcave. Since ϕ(u0)=0\phi(u_{0})=0 both of these restrictions belong to 𝒮μ,β(f)\mathcal{S}_{\mu,\beta}(f), contradicting extremality of π\pi. ∎

We briefly remark that the proof technique used in Lemma 1 is quite general, and the only property we used about 𝒮μ,β\mathcal{S}_{\mu,\beta} is that it is a subset of logconcave densities, and it is closed under restrictions to convex subsets. Similar arguments hold for other density families with these properties. Further, we note that restrictions to compact sets are upper semi-continuous; it is straightforward to verify our applications of Lemma 1 satisfy the upper semi-continuity assumption.

We prove the following two technical lemmas using Lemma 1.

Lemma 2.

Following notation of Lemma 1, fix a continuous function α:𝒳\alpha:\mathcal{X}\to\mathbb{R} and a subset S𝒳S\subset\mathcal{X}. For any probability density π\pi on 𝒳\mathcal{X}, define the renormalized density π~eαπ\tilde{\pi}\propto e^{-\alpha}\pi. Finally, let

g(π):=Prxπ~[xS]eϵPrxπ[xS].g(\pi):=\Pr_{x\sim\tilde{\pi}}[x\in S]-e^{\epsilon}\Pr_{x\sim\pi}[x\in S].

Then maxπ𝒮μ,βg(π)=maxπ𝒮μ,βg(π)\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi)=\max_{\pi\in\mathcal{S}^{*}_{\mu,\beta}}g(\pi) where 𝒮μ,β\mathcal{S}_{\mu,\beta}^{*} is the subset of densities π𝒮μ,β\pi\in\mathcal{S}_{\mu,\beta} satisfying one of the following.

  • π\pi is a Dirac measure at x𝒳x\in\mathcal{X}.

  • π\pi is supported on [a,b]𝒳[a,b]\subset\mathcal{X}.

Proof.

Let 𝒮μ,β(f)𝒮μ,β\mathcal{S}_{\mu,\beta}(f)\subseteq\mathcal{S}_{\mu,\beta} be the set of π𝒮μ,β\pi\in\mathcal{S}_{\mu,\beta} such that f𝑑π0\int fd\pi\geq 0. We have

maxπ𝒮μ,βg(π)\displaystyle\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi) =maxπ𝒮μ,βxS𝑑π~(x)eϵxS𝑑π(x)\displaystyle=\max_{\pi\in\mathcal{S}_{\mu,\beta}}\int_{x\in S}d\tilde{\pi}(x)-e^{\epsilon}\int_{x\in S}d\pi(x)
=maxπ𝒮μ,βxSeα(x)𝑑π(x)x𝒳eα(x)𝑑π(x)eϵxS𝑑π(x)\displaystyle=\max_{\pi\in\mathcal{S}_{\mu,\beta}}\frac{\int_{x\in S}e^{-\alpha(x)}d\pi(x)}{\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}-e^{\epsilon}\int_{x\in S}d\pi(x)
=maxπ𝒮μ,βmaxCx𝒳eα(x)𝑑π(x)xSeα(x)𝑑π(x)CeϵxS𝑑π(x)\displaystyle=\max_{\pi\in\mathcal{S}_{\mu,\beta}}\quad\max_{C\geq\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}\frac{\int_{x\in S}e^{-\alpha(x)}d\pi(x)}{C}-e^{\epsilon}\int_{x\in S}d\pi(x)
=maxCmaxπ𝒮μ,β(Ceα)x𝒳(eα(x)Ceϵ)𝟏S(x)𝑑π(x)\displaystyle=\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})}\int_{x\in\mathcal{X}}\left(\frac{e^{-\alpha(x)}}{C}-e^{\epsilon}\right)\mathbf{1}_{S}(x)d\pi(x)
=maxCmaxπ𝒮μ,β(Ceα)x𝒳(eα(x)Ceϵ)𝟏S(x)𝑑π(x),\displaystyle=\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}}\int_{x\in\mathcal{X}}\left(\frac{e^{-\alpha(x)}}{C}-e^{\epsilon}\right)\mathbf{1}_{S}(x)d\pi(x),

where 𝒮μ,β(Ceα)\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*} is the (super)set of extreme points of 𝒮μ,β(Ceα)\mathcal{S}_{\mu,\beta}(C-e^{-\alpha}) given by the strongly logconcave localization lemma (Lemma 1). These candidate extreme points are Dirac measures at xx such that Ceα(x)C\geq e^{-\alpha(x)}, or are supported in [a,b]𝒳[a,b]\subset\mathcal{X}. Hence, 𝒮μ,β(Ceα)𝒮μ,β\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}\subseteq\mathcal{S}_{\mu,\beta}^{*}, and we conclude

maxπ𝒮μ,βg(π)\displaystyle\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi) =maxCmaxπ𝒮μ,β(Ceα)x𝒳(eα(x)Ceϵ)𝟏S(x)𝑑π(x)\displaystyle=\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}}\int_{x\in\mathcal{X}}\left(\frac{e^{-\alpha(x)}}{C}-e^{\epsilon}\right)\mathbf{1}_{S}(x)d\pi(x) (3)
maxCmaxπ𝒮μ,β(Ceα)x𝒳(eα(x)x𝒳eα(x)𝑑π(x)eϵ)𝟏S(x)𝑑π(x)\displaystyle\leq\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}}\int_{x\in\mathcal{X}}\left(\frac{e^{-\alpha(x)}}{\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}-e^{\epsilon}\right)\mathbf{1}_{S}(x)d\pi(x) (4)
maxπ𝒮μ,βx𝒳(eα(x)x𝒳eα(x)𝑑π(x)eϵ)𝟏S(x)𝑑π(x)=maxπ𝒮μ,βg(π).\displaystyle\leq\max_{\pi\in\mathcal{S}_{\mu,\beta}^{*}}\int_{x\in\mathcal{X}}\left(\frac{e^{-\alpha(x)}}{\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}-e^{\epsilon}\right)\mathbf{1}_{S}(x)d\pi(x)=\max_{\pi\in\mathcal{S}_{\mu,\beta}^{*}}g(\pi). (5)

The first inequality used that Cx𝒳eα(x)𝑑π(x)C\geq\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x) for π𝒮μ,β(Ceα)\pi\in\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}, and the second used that 𝒮μ,β(Ceα)𝒮μ,β\mathcal{S}_{\mu,\beta}(C-e^{-\alpha})^{*}\subseteq\mathcal{S}_{\mu,\beta}^{*} for any CC. Since 𝒮μ,β𝒮μ,β\mathcal{S}_{\mu,\beta}^{*}\subseteq\mathcal{S}_{\mu,\beta}, we have the claim. ∎

Lemma 3.

Following notation of Lemma 1, fix continuous function α:𝒳\alpha:\mathcal{X}\to\mathbb{R} and upper semi-continuous function f:𝒳f:\mathcal{X}\to\mathbb{R}. For any probability density π\pi on 𝒳\mathcal{X}, define π~eαπ\tilde{\pi}\propto e^{-\alpha}\pi to be a renormalized density on 𝒳\mathcal{X}. Finally, let

g(π):=x𝒳f(x)d(ππ~)(x).\displaystyle g(\pi):=\int_{x\in\mathcal{X}}f(x)d(\pi-\tilde{\pi})(x).

Then maxπ𝒮μ,βg(π)=maxπ𝒮μ,βg(π)\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi)=\max_{\pi\in\mathcal{S}_{\mu,\beta}^{*}}g(\pi) where 𝒮μ,β\mathcal{S}_{\mu,\beta}^{*} is the subset of densities π𝒮μ,β\pi\in\mathcal{S}_{\mu,\beta} satisfying one of the following.

  • π\pi is a Dirac measure at x𝒳x\in\mathcal{X}.

  • π\pi is supported on [a,b]𝒳[a,b]\subset\mathcal{X}.

Proof.

We follow the notation from Lemma 2. If π\pi is a Dirac measure, g(π)=0g(\pi)=0, so we only need to consider the case when maxπ𝒮μ,βg(π)>0\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi)>0. We have

maxπ𝒮μ,βg(π)\displaystyle\max_{\pi\in\mathcal{S}_{\mu,\beta}}g(\pi) =maxπ𝒮μ,βx𝒳f(x)(1eα(x)x𝒳eα(x)𝑑π(x))𝑑π(x)\displaystyle=\max_{\pi\in\mathcal{S}_{\mu,\beta}}\int_{x\in\mathcal{X}}f(x)\left(1-\frac{e^{-\alpha(x)}}{\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}\right)d\pi(x)
=maxπ𝒮μ,βmaxCx𝒳eα(x)𝑑π(x)x𝒳(f(x)eα(x)f(x)C)𝑑π(x)\displaystyle=\max_{\pi\in\mathcal{S}_{\mu,\beta}}\quad\max_{C\leq\int_{x\in\mathcal{X}}e^{-\alpha(x)}d\pi(x)}\int_{x\in\mathcal{X}}\left(f(x)-\frac{e^{-\alpha(x)}f(x)}{C}\right)d\pi(x)
=maxCmaxπ𝒮μ,β(eαC)x𝒳(f(x)eα(x)f(x)C)𝑑π(x)\displaystyle=\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(e^{-\alpha}-C)}\int_{x\in\mathcal{X}}\left(f(x)-\frac{e^{-\alpha(x)}f(x)}{C}\right)d\pi(x)
=maxCmaxπ𝒮μ,β(eαC)x𝒳(f(x)eα(x)f(x)C)𝑑π(x).\displaystyle=\max_{C}\max_{\pi\in\mathcal{S}_{\mu,\beta}(e^{-\alpha}-C)^{*}}\int_{x\in\mathcal{X}}\left(f(x)-\frac{e^{-\alpha(x)}f(x)}{C}\right)d\pi(x).

The remainder of the proof is analogous to Lemma 2. ∎

3.2 Gaussian differential privacy

Using an instantiation of the localization lemma, we prove Gaussian differential privacy in general norms by first reducing to one dimension and then using the result of [GLL22] to handle the one-dimensional case. Gaussian differential privacy was introduced by [DRS21] and is a useful tool to compare privacy curves. We first recall the (2\ell_{2}) Gaussian differential privacy result of [GLL22].

Proposition 3 (Theorem 4.1, [GLL22]).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, let F,F~:𝒳F,\widetilde{F}:\mathcal{X}\to\mathbb{R} be μ\mu-strongly convex in 2\left\lVert\cdot\right\rVert_{2}, and let Pexp(F)P\propto\exp(-F) and Qexp(F~)Q\propto\exp(-\widetilde{F}). Suppose F~F\widetilde{F}-F is GG-Lipschitz in 2\left\lVert\cdot\right\rVert_{2}. For all ϵ0\epsilon\in\mathbb{R}_{\geq 0},

δ(PQ)(ϵ)δ(𝒩(0,1)𝒩(Gμ,1))(ϵ).\delta\left(P\;\middle\|\;Q\right)(\epsilon)\leq\delta\left(\mathcal{N}(0,1)\;\middle\|\;\mathcal{N}\left(\frac{G}{\sqrt{\mu}},1\right)\right)(\epsilon).

We next give a simple comparison result between norms.

Lemma 4.

For f:𝒳f:\mathcal{X}\to\mathbb{R}, fix a,b𝒳a,b\in\mathcal{X}, and let f~:[a,b]\tilde{f}:[a,b]\to\mathbb{R} be the restriction of ff to [a,b][a,b].

  1. (1)

    If ff is GG-Lipschitz in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}, f~\tilde{f} is Gba𝒳ba2G\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}-Lipschitz in 2\left\lVert\cdot\right\rVert_{2}.

  2. (2)

    If ff is μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}, f~\tilde{f} is μba𝒳2ba22\mu\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}^{2}}{\left\lVert b-a\right\rVert_{2}^{2}}-strongly convex in 2\left\lVert\cdot\right\rVert_{2}.

Proof.

To see the first claim, let c=a+r(ba)c=a+r(b-a) and d=a+s(ba)d=a+s(b-a) for s,r[0,1]s,r\in[0,1]. We have by Lipschitzness of ff that

|f~(d)f~(c)|G|sr|ba𝒳=(Gba𝒳ba2)dc2.\displaystyle\left|\tilde{f}(d)-\tilde{f}(c)\right|\leq G\left|s-r\right|\left\lVert b-a\right\rVert_{\mathcal{X}}=\left(G\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}\right)\cdot\left\lVert d-c\right\rVert_{2}.

Similarly, to see the second claim, by strong convexity of ff,

f~(tc+(1t)d)\displaystyle\tilde{f}\left(tc+(1-t)d\right) tf~(c)+(1t)f~(d)μt(1t)2cd𝒳2\displaystyle\leq t\tilde{f}(c)+(1-t)\tilde{f}(d)-\frac{\mu t(1-t)}{2}\left\lVert c-d\right\rVert_{\mathcal{X}}^{2}
=tf~(c)+(1t)f~(d)μt(1t)(rs)22ab𝒳2\displaystyle=t\tilde{f}(c)+(1-t)\tilde{f}(d)-\frac{\mu t(1-t)(r-s)^{2}}{2}\left\lVert a-b\right\rVert_{\mathcal{X}}^{2}
=tf~(c)+(1t)f~(d)(μba𝒳2ba22)(t(1t)2dc22).\displaystyle=t\tilde{f}(c)+(1-t)\tilde{f}(d)-\left(\mu\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}^{2}}{\left\lVert b-a\right\rVert_{2}^{2}}\right)\left(\frac{t(1-t)}{2}\left\lVert d-c\right\rVert_{2}^{2}\right).

We now present our main result on Gaussian differential privacy with respect to arbitrary norms.

See 2

Proof.

Throughout this proof, fix some α\alpha which is GG-Lipschitz in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}} by assumption. We first claim that amongst all μ\mu-strongly convex (in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}) functions F:𝒳F:\mathcal{X}\to\mathbb{R} such that F+αF+\alpha is also μ\mu-strongly convex, defining Pexp(F)P\propto\exp(-F) and Qexp((F+α))Q\propto\exp(-(F+\alpha)), some FF maximizing δ(PQ)(ϵ)\delta\left(P\;\middle\|\;Q\right)(\epsilon) is either a Dirac measure or supported on [a,b]𝒳[a,b]\subset\mathcal{X}. We will prove this by contradiction.

Suppose otherwise, and let FF be a μ\mu-strongly convex function that maximizes δ(PQ)(ϵ)\delta\left(P\;\middle\|\;Q\right)(\epsilon) defined above. Define Pexp(F)P\propto\exp(-F) and Qexp((F+α))Q\propto\exp(-(F+\alpha)). Let S𝒳S^{*}\subseteq\mathcal{X} be the set achieving

δ(PQ)(ϵ)=PrXP[XS]exp(ϵ)PrXQ[XS].\delta\left(P\;\middle\|\;Q\right)(\epsilon)=\Pr_{X\sim P}[X\in S^{*}]-\exp(\epsilon)\Pr_{X\sim Q}[X\in S^{*}].

By Lemma 2, there is another μ\mu-strongly logconcave π\pi where the renormalized density πexp(α)\propto\pi\exp(-\alpha) is also μ\mu-strongly logconcave, such that (following notation of Lemma 2) g(π)g(P)g(\pi)\geq g(P), where π\pi is either a Dirac or supported on [a,b][a,b]. We conclude that δ(PQ)(ϵ)δ(ππexp(α))(ϵ)\delta\left(P\;\middle\|\;Q\right)(\epsilon)\leq\delta\left(\pi\;\middle\|\;\pi\exp(-\alpha)\right)(\epsilon) (since the maximizing set for π\pi is at least as good as SS^{*}), a contradiction.

It hence suffices to prove the theorem statement for F,F~F,\widetilde{F}, which are supported on some [a,b]𝒳[a,b]\in\mathcal{X}. By Lemma 4, we have that F~F\widetilde{F}-F is Gba𝒳ba2G\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}-Lipschitz in 2\left\lVert\cdot\right\rVert_{2} and F,F~F,\widetilde{F} are μba𝒳2ba22\mu\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}^{2}}{\left\lVert b-a\right\rVert_{2}^{2}}-strongly convex in 2\left\lVert\cdot\right\rVert_{2}. We conclude by Proposition 3 which shows

δ(PQ)(ϵ)\displaystyle\delta\left(P\;\middle\|\;Q\right)(\epsilon) δ(𝒩(0,1)𝒩(Gμba𝒳ba2ba2ba𝒳,1))(ϵ)\displaystyle\leq\delta\left(\mathcal{N}(0,1)\;\middle\|\;\mathcal{N}\left(\frac{G}{\sqrt{\mu}}\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}\cdot\frac{\left\lVert b-a\right\rVert_{2}}{\left\lVert b-a\right\rVert_{\mathcal{X}}},1\right)\right)(\epsilon)
=δ(𝒩(0,1)𝒩(Gμ,1))(ϵ).\displaystyle=\delta\left(\mathcal{N}(0,1)\;\middle\|\;\mathcal{N}\left(\frac{G}{\sqrt{\mu}},1\right)\right)(\epsilon).

Our proof strategy is a reduction to an application of Proposition 3 in one dimension. It is an interesting open question to obtain a simpler direct proof of Proposition 3 in the one-dimensional setting (without using the machinery of [GLL22]), which is tight up to constant factors.

4 Private ERM and SCO in general norms

In this section, we derive our results for private ERM and SCO in general norms. We will state our results for private ERM (Section 4.1) and SCO (Section 4.2) with respect to an arbitrary compact convex subset 𝒳\mathcal{X} of a dd-dimensional normed space, satisfying Assumption 1. We then use this to derive guarantees for a variety of settings of import in Section 4.3.

4.1 Private ERM under Assumption 1

To develop our private ERM algorithms, we recall the following risk guarantee from [dKL18] of sampling from Gibbs distributions (improving upon [KV06, BST14]).

Proposition 4 ([dKL18], Corollary 1).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, let F:𝒳F:\mathcal{X}\to\mathbb{R} be convex, and let k>0k>0. If νexp(kF)\nu\propto\exp(-kF),

𝔼xν[F(x)]minx𝒳F(x)+dk.\mathbb{E}_{x\sim\nu}[F(x)]\leq\min_{x\in\mathcal{X}}F(x)+\frac{d}{k}.

We conclude by a simple combination of Proposition 4 (providing a risk guarantee) and Theorem 2 (providing a privacy guarantee), which yields our main result on private ERM.

Theorem 3 (Private ERM).

Under Assumption 1 and following notation (1), drawing a sample xx from the density νexp(k(F𝒟+μr))\nu\propto\exp(-k(F_{\mathcal{D}}+\mu r)) for

k=dnϵG2Θlog12δ,μ=G2dlog12δΘnϵ,k=\frac{\sqrt{d}n\epsilon}{G\sqrt{2\Theta\log\frac{1}{2\delta}}},\;\mu=\frac{G\sqrt{2d\log\frac{1}{2\delta}}}{\sqrt{\Theta}n\epsilon},

is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼xν[F𝒟(x)]minx𝒳F𝒟(x)GΘ8dlog12δnϵ.\mathbb{E}_{x\sim\nu}[F_{\mathcal{D}}(x)]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq G\sqrt{\Theta}\cdot\frac{\sqrt{8d\log\frac{1}{2\delta}}}{n\epsilon}.
Proof.

Let F𝒟F_{\mathcal{D}^{\prime}} be the realization of (1) when 𝒟\mathcal{D} is replaced with a neighboring dataset 𝒟\mathcal{D}^{\prime} which agrees in all entries except some sample sisis^{\prime}_{i}\neq s_{i}. By Assumption 1, we have k(F𝒟F𝒟)k(F_{\mathcal{D}}-F_{\mathcal{D}^{\prime}}) is kGn\frac{kG}{n}-Lipschitz, and both k(F𝒟+μr)k(F_{\mathcal{D}}+\mu r) and k(F𝒟+μr)k(F_{\mathcal{D}^{\prime}}+\mu r) are kμk\mu-strongly convex (all with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}). Hence, combining Theorem 2 and Fact 2 shows the mechanism is (ϵ,δ)(\epsilon,\delta)-differentially private, since

μ=2G2klog12δn2ϵ2Gknμϵ2log12δ.\mu=\frac{2G^{2}k\log\frac{1}{2\delta}}{n^{2}\epsilon^{2}}\implies\frac{G\sqrt{k}}{n\sqrt{\mu}}\leq\frac{\epsilon}{\sqrt{2\log\frac{1}{2\delta}}}. (6)

Let x𝒟:=argminx𝒳F𝒟(x)x^{\star}_{\mathcal{D}}:=\textup{argmin}_{x\in\mathcal{X}}F_{\mathcal{D}}(x). We obtain the risk guarantee by the calculation (see Proposition 4)

𝔼xν[F𝒟(x)]\displaystyle\mathbb{E}_{x\sim\nu}[F_{\mathcal{D}}(x)] F𝒟(x𝒟)+(μr(x𝒟)𝔼xνμr(x))+dk\displaystyle\leq F_{\mathcal{D}}(x^{\star}_{\mathcal{D}})+\left(\mu r(x^{\star}_{\mathcal{D}})-\mathbb{E}_{x\sim\nu}\mu r(x)\right)+\frac{d}{k}
F𝒟(x𝒟)+μΘ+dk\displaystyle\leq F_{\mathcal{D}}(x^{\star}_{\mathcal{D}})+\mu\Theta+\frac{d}{k}

and plugging in our choices of μ\mu and kk. ∎

4.2 Private SCO under Assumption 1

We first give a generic comparison result between population risk and empirical risk under Assumption 1. To do so, we use two helper results from prior work. The first was derived in [GLL22] by combining a transportation inequality and a log-Sobolev inequality (see e.g. [OV00]).

Proposition 5 ([GLL22], Theorem 6.7, Lemma 6.8).

Let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} be compact and convex, let F,F~:𝒳F,\widetilde{F}:\mathcal{X}\to\mathbb{R} be μ\mu-strongly convex in 2\left\lVert\cdot\right\rVert_{2}, and let Pexp(F)P\propto\exp(-F) and Qexp(F~)Q\propto\exp(-\widetilde{F}). Suppose F~F\widetilde{F}-F is HH-Lipschitz in 2\left\lVert\cdot\right\rVert_{2}. Then, W2(P,Q)HμW_{2}(P,Q)\leq\frac{H}{\mu}.

Corollary 1.

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex, and let α,f:𝒳\alpha,f:\mathcal{X}\to\mathbb{R} be HH-Lipschitz and GG-Lipschitz respectively in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}. Let 𝒮μ,exp(α)\mathcal{S}_{\mu,\exp(-\alpha)} be the set of densities π\pi over 𝒳\mathcal{X} such that π\pi is μ\mu-strongly logconcave with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}, and π~πexp(α)\tilde{\pi}\propto\pi\exp(-\alpha) is also μ\mu-strongly logconcave. For any π𝒮μ,exp(α)\pi\in\mathcal{S}_{\mu,\exp(-\alpha)} define g(π):=𝒳f(x)d(ππ~)(x)g(\pi):=\int_{\mathcal{X}}f(x)d\left(\pi-\tilde{\pi}\right)(x) where π~πexp(α)\tilde{\pi}\propto\pi\exp(-\alpha). Then, g(π)GHμg(\pi)\leq\frac{GH}{\mu}.

Proof.

By Lemma 3 (and following its notation), it suffices to show g(π)GHμg(\pi)\leq\frac{GH}{\mu} for all π𝒮μ,exp(α)\pi\in\mathcal{S}^{*}_{\mu,\exp(-\alpha)}. Clearly this is true for a Dirac measure π\pi as then g(π)=0g(\pi)=0, so consider the other case where π\pi is supported on [a,b][a,b], such that πexp(F)\pi\propto\exp(-F) and FF is μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}. Further, define F~=F+α\widetilde{F}=F+\alpha, so that F~\widetilde{F} is also μ\mu-strongly convex and supported on [a,b][a,b].

By Lemma 4, restricting to [a,b][a,b], FF and F~\widetilde{F} are μba𝒳2ba22\mu\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}^{2}}{\left\lVert b-a\right\rVert_{2}^{2}}-strongly convex in 2\left\lVert\cdot\right\rVert_{2}, FF~F-\widetilde{F} is Hba𝒳ba2H\cdot\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}-Lipschitz in 2\ell_{2} and ff is ba𝒳ba2\frac{\left\lVert b-a\right\rVert_{\mathcal{X}}}{\left\lVert b-a\right\rVert_{2}}-Lipschitz in 2\left\lVert\cdot\right\rVert_{2}. Hence, where the inequalities are by Fact 1 and Proposition 5 respectively,

g(π)=𝒳f(x)d(ππ~)(x)GW2(π,π~)GHμ.g(\pi)=\int_{\mathcal{X}}f(x)d(\pi-\tilde{\pi})(x)\leq GW_{2}(\pi,\tilde{\pi})\leq\frac{GH}{\mu}.

The second relates the population risk to the empirical risk on an independent sample.

Proposition 6 (Lemma 7, [BE02]).

Suppose 𝒟={si}i[n]\mathcal{D}=\{s_{i}\}_{i\in[n]} is drawn independently from 𝒫\mathcal{P}, let s𝒫s\sim\mathcal{P} be drawn independently from 𝒟\mathcal{D}, and let 𝒟:={s}{si}i[n]{1}\mathcal{D}^{\prime}:=\{s\}\cup\{s_{i}\}_{i\in[n]\setminus\{1\}} be 𝒟\mathcal{D} where s1s_{1} is swapped with ss. Then, for any symmetric333Here, a symmetric mechanism is one which only depends on the set of inputs rather than their order. mechanism :supp(𝒫)nd\mathcal{M}:\textup{supp}(\mathcal{P})^{n}\to\mathbb{R}^{d},

𝔼[Fpop((𝒟))F𝒟((𝒟))]=𝔼[f((𝒟);s)f((𝒟);s)],\mathbb{E}\left[F_{\textup{pop}}(\mathcal{M}(\mathcal{D}))-F_{\mathcal{D}}(\mathcal{M}(\mathcal{D}))\right]=\mathbb{E}\left[f(\mathcal{M}(\mathcal{D});s)-f(\mathcal{M}(\mathcal{D}^{\prime});s)\right],

where expectations are over \mathcal{M} and the randomness used in producing 𝒟\mathcal{D} and ss.

By applying Corollary 1 and Proposition 6 (which bound the generalization error of our mechanism), we provide the following extension of Theorem 3, our main result on private SCO.

Theorem 4 (Private SCO).

Under Assumption 1 and following notation (1), drawing a sample xx from the density νexp(k(F𝒟+μr))\nu\propto\exp(-k(F_{\mathcal{D}}+\mu r)) for

k=d+C2C1,μ=2G2klog12δn2ϵ2,C1:=2G2Θlog12δn2ϵ2,C2:=nϵ22log12δ,k=\sqrt{\frac{d+C_{2}}{C_{1}}},\;\mu=\frac{2G^{2}k\log\frac{1}{2\delta}}{n^{2}\epsilon^{2}},\;C_{1}:=\frac{2G^{2}\Theta\log\frac{1}{2\delta}}{n^{2}\epsilon^{2}},\;C_{2}:=\frac{n\epsilon^{2}}{2\log\frac{1}{2\delta}},

is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼𝒟𝒫n,xν[Fpop(x)]minx𝒳Fpop(x)GΘ(8dlog12δnϵ+8n).\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq G\sqrt{\Theta}\cdot\left(\frac{\sqrt{8d\log\frac{1}{2\delta}}}{n\epsilon}+\sqrt{\frac{8}{n}}\right).
Proof.

For the given choice of k,μk,\mu, the privacy proof follows identically to Theorem 3, so we focus on the risk proof. We follow the notation of Proposition 6 and let s𝒫s\sim\mathcal{P} independently from 𝒟\mathcal{D}. By exchanging the expectation and minimum and using that 𝔼𝒟𝒫nF𝒟=Fpop\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n}}F_{\mathcal{D}}=F_{\textup{pop}},

𝔼𝒟𝒫n,xν[Fpop(x)]minx𝒳Fpop(x)\displaystyle\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\sim\mathcal{X}}F_{\textup{pop}}(x) 𝔼𝒟𝒫n[𝔼xν[Fpop(x)]minx𝒳[F𝒟(x)]]\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}\left[F_{\mathcal{D}}(x)\right]\right]
𝔼𝒟𝒫n[𝔼xν[Fpop(x)F𝒟(x)]]\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]\right]
+𝔼𝒟𝒫n[𝔼xν[F𝒟(x)]minx𝒳[F𝒟(x)]]\displaystyle+\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\mathcal{D}}(x)\right]-\min_{x\in\mathcal{X}}\left[F_{\mathcal{D}}(x)\right]\right]
𝔼𝒟𝒫n[𝔼xν[Fpop(x)F𝒟(x)]]+μΘ+dk,\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]\right]+\mu\Theta+\frac{d}{k},

where we bounded the empirical risk in the proof of Theorem 3. Next, let ν\nu^{\prime} be the density exp(k(F𝒟+μr))\propto\exp(-k(F_{\mathcal{D}^{\prime}}+\mu r)). Our mechanism is symmetric, and hence by Proposition 6,

𝔼[Fpop(x)F𝒟(x)]=𝔼[𝔼xν[f(x;s)]𝔼xν[f(x;s)]]\mathbb{E}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]=\mathbb{E}\left[\mathbb{E}_{x\sim\nu}\left[f(x;s)\right]-\mathbb{E}_{x\sim\nu^{\prime}}\left[f(x;s)\right]\right]

where the outer expectations are over the randomness of drawing 𝒟,s\mathcal{D},s. Finally, for any fixed realization of 𝒟,s\mathcal{D},s, the densities ν,ν\nu,\nu^{\prime} satisfy the assumption of Corollary 1 with H=GnH=\frac{G}{n}, and f(;s)f(\cdot;s) is GG-Lipschitz, so Corollary 1 shows that

𝔼xν[f(x;s)]𝔼xν[f(x;s)]G2nμ.\mathbb{E}_{x\sim\nu}\left[f(x;s)\right]-\mathbb{E}_{x\sim\nu^{\prime}}\left[f(x;s)\right]\leq\frac{G^{2}}{n\mu}.

Combining the above three displays bounds the population risk by

𝔼𝒟𝒫n,xν[Fpop(x)]minx𝒳Fpop(x)\displaystyle\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x) G2nμ+μΘ+dk\displaystyle\leq\frac{G^{2}}{n\mu}+\mu\Theta+\frac{d}{k}
=C1k+C2+dk,\displaystyle=C_{1}k+\frac{C_{2}+d}{k},

for our given value of μ\mu. The conclusion follows by optimizing over kk yielding a risk of 2C1(C2+d)2\sqrt{C_{1}(C_{2}+d)}, and using the scalar inequality a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for nonnegative a,ba,b. ∎

4.3 Applications

To derive our private optimization algorithms for p\ell_{p}-norm and Schatten-pp norm geometries, we recall the following results on the existence of bounded strongly convex regularizers.

Proposition 7 ([BCL94]).

For 1<p21<p\leq 2, letting p\left\lVert\cdot\right\rVert_{p} be the p\ell_{p} norm of a vector, r(v):=12(p1)vp2r(v):=\frac{1}{2(p-1)}\|v\|_{p}^{2} is 11-strongly convex in p\left\lVert\cdot\right\rVert_{p}. Similarly, for 1<p21<p\leq 2, letting p\left\lVert\cdot\right\rVert_{p} be the Schatten-pp norm of a matrix, r(𝐌):=12(p1)𝐌p2r(\mathbf{M}):=\frac{1}{2(p-1)}\|\mathbf{M}\|_{p}^{2} is 11-strongly convex in p\left\lVert\cdot\right\rVert_{p}.

We state a useful result on efficiently sampling from Lipschitz, strongly logconcave densities under value oracle access given by [GLL22] (building upon the framework of [LST21]). We slightly specialize the result of [GLL22] by giving a rephrasing sufficient for our purposes.

Proposition 8 ([GLL22], Theorem 2.3).

Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex with diam2(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{2}}(\mathcal{X})\leq D. Let 𝒟={si}i[n]\mathcal{D}=\{s_{i}\}_{i\in[n]} and let F~𝒟(x)=1ni[n]f(x;si)+ψ(x)\widetilde{F}_{\mathcal{D}}(x)=\frac{1}{n}\sum_{i\in[n]}f(x;s_{i})+\psi(x) such that all f(;si):𝒳f(\cdot;s_{i}):\mathcal{X}\to\mathbb{R} are GG-Lipschitz in 2\left\lVert\cdot\right\rVert_{2} and convex, and ψ(x):𝒳\psi(x):\mathcal{X}\to\mathbb{R} is μ\mu-strongly convex in 2\left\lVert\cdot\right\rVert_{2}. For δ(0,12)\delta\in(0,\frac{1}{2}), we can generate a sample within total variation δ\delta of the density exp(F~𝒟)\propto\exp(-\widetilde{F}_{\mathcal{D}}) in NN value queries to some f(;si)f(\cdot;s_{i}) and samples from densities exp(ψ12ηv22)\propto\exp\left(-\psi-\frac{1}{2\eta}\|\cdot-v\|_{2}^{2}\right) for some η>0\eta>0, vdv\in\mathbb{R}^{d}, where

N=O(G2μlog2(G2(D2+μ1)dδ)).\displaystyle N=O\left(\frac{G^{2}}{\mu}\log^{2}\left(\frac{G^{2}(D^{2}+\mu^{-1})d}{\delta}\right)\right).

p\ell_{p} norms.

We state our results on private convex optimization under p\ell_{p} geometry. As a preliminary, we combine norm equivalence bounds (2) and Proposition 8 to give the following algorithmic result on sampling from a logconcave distribution under value oracle access under p\ell_{p} geometry.

Proposition 9.

Let p1p\geq 1 and let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} be compact and convex with diamp(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{p}}(\mathcal{X})\leq D. Let 𝒟={si}i[n]\mathcal{D}=\{s_{i}\}_{i\in[n]} and let F~𝒟(x)=1ni[n]f(x;si)+ψ(x)\widetilde{F}_{\mathcal{D}}(x)=\frac{1}{n}\sum_{i\in[n]}f(x;s_{i})+\psi(x) such that all f(;si):𝒳f(\cdot;s_{i}):\mathcal{X}\to\mathbb{R} are GG-Lipschitz in p\left\lVert\cdot\right\rVert_{p} and convex, and ψ(x):𝒳\psi(x):\mathcal{X}\to\mathbb{R} is μ\mu-strongly convex in p\left\lVert\cdot\right\rVert_{p}. For δ(0,12)\delta\in(0,\frac{1}{2}), we can generate a sample within total variation δ\delta of the density exp(F~𝒟)\propto\exp(-\widetilde{F}_{\mathcal{D}}) in NN value queries to some f(;si)f(\cdot;s_{i}) and samples from densities exp(ψ12ηv22)\propto\exp\left(-\psi-\frac{1}{2\eta}\|\cdot-v\|_{2}^{2}\right) for some η>0\eta>0, vdv\in\mathbb{R}^{d}, where

N\displaystyle N =O(G2d2p1μlog2(G2(D2+μ1)dδ)) if p[1,2],\displaystyle=O\left(\frac{G^{2}d^{\frac{2}{p}-1}}{\mu}\log^{2}\left(\frac{G^{2}(D^{2}+\mu^{-1})d}{\delta}\right)\right)\text{ if }p\in[1,2],
N\displaystyle N =O(G2d12pμlog2(G2(D2+μ1)dδ)) if p[2,).\displaystyle=O\left(\frac{G^{2}d^{1-\frac{2}{p}}}{\mu}\log^{2}\left(\frac{G^{2}(D^{2}+\mu^{-1})d}{\delta}\right)\right)\text{ if }p\in[2,\infty).
Proof.

For p[1,2]p\in[1,2], note that each f(;si)f(\cdot;s_{i}) is d1p12Gd^{\frac{1}{p}-\frac{1}{2}}G-Lipschitz in the 2\ell_{2} norm by combining (2) and the definition of Lipschitzness. Moreover, because the p\ell_{p} norm is larger than the 2\ell_{2} norm, ψ\psi remains μ\mu-strongly convex in the 2\ell_{2} norm. The diameter DD is only affected by poly(d)\text{poly}(d) factors when converting norms, which is accounted for by the logarithmic term. Hence, the complexity bound follows by applying Proposition 8 under this change of parameters. For the other case of p[2,)p\in[2,\infty), the Lipschitz bound is GG, and the strong convexity bound is d2p1μd^{\frac{2}{p}-1}\mu by a similar argument. ∎

In the following discussion, we primarily focus on the value oracle query complexity of our samplers. Generic results on logconcave sampling (see e.g. [LV07], or more recent developments by [JLLV21, Che21, KL22]) imply the samples from the densities exp(ψ12ηv22)\propto\exp(-\psi-\frac{1}{2\eta}\|\cdot-v\|_{2}^{2}) can be performed in polynomial time, for all the ψ\psi that are relevant in our applications (which are all squared p\ell_{p} distances). We expect samplers which run in nearly-linear time (in dd) may be designed for applications where 𝒳\mathcal{X} is structured, such as an p\ell_{p} ball, but for brevity we omit this discussion.

Corollary 2.

Let 1<p21<p\leq 2 be a constant, and let ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1). Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} have diamp(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{p}}(\mathcal{X})\leq D, and let Fpop=𝔼s𝒫[f(;s)]F_{\textup{pop}}=\mathbb{E}_{s\sim\mathcal{P}}[f(\cdot;s)] where all f(;s):df(\cdot;s):\mathbb{R}^{d}\to\mathbb{R} are convex and GG-Lipschitz in p\left\lVert\cdot\right\rVert_{p}. Finally, let 𝒟={si}i[n]𝒫n\mathcal{D}=\{s_{i}\}_{i\in[n]}\sim\mathcal{P}^{n} independently and F𝒟:=1ni[n]f(;si)F_{\mathcal{D}}:=\frac{1}{n}\sum_{i\in[n]}f(\cdot;s_{i}).

  1. (1)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼[F𝒟(x)]minx𝒳F𝒟(x)2GDdlog12δnϵp1\mathbb{E}_{\mathcal{M}}\left[F_{\mathcal{D}}(x)\right]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq 2GD\cdot\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}} (7)

    using

    O(n2ϵ2d2p1log1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d^{\frac{2}{p}-1}}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}). (8)
  2. (2)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼𝒟𝒫n,[Fpop(x)]minx𝒳Fpop(x)2GD(1n(p1)+dlog12δnϵp1).\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq 2GD\cdot\left(\sqrt{\frac{1}{n(p-1)}}+\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}}\right). (9)

    using

    O(n2ϵ2d2p1log1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d^{\frac{2}{p}-1}}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}).
Proof.

We will parameterize Assumption 1 with the function r(x):=12(p1)xx0p2r(x):=\frac{1}{2(p-1)}\|x-x_{0}\|_{p}^{2}, where x0𝒳x_{0}\in\mathcal{X} is an arbitrary point, and strong convexity follows from Proposition 7. By assumption, we may set Θ=12(p1)D2\Theta=\frac{1}{2(p-1)}D^{2}. The conclusions follow by combining Theorem 3, Theorem 4. To obtain (ϵ,δ)(\epsilon,\delta)-differential privacy, it suffices to run the mechanism with privacy level δδ2\delta\leftarrow\frac{\delta}{2}, run to total variation δ2\frac{\delta}{2} using Proposition 9, and take a union bound. For both ERM and SCO, note that our choices of kk and μ\mu satisfy the relation (6), namely kG2μ=O(n2ϵ2/log1δ)\frac{kG^{2}}{\mu}=O(n^{2}\epsilon^{2}/\log\frac{1}{\delta}). Since both the Lipschitz and strong convexity parameters are scaled up by kk in our application of Proposition 9, we have the leading-order term is kG2μ\frac{kG^{2}}{\mu} which yields the conclusion. ∎

For any pp such that p1p-1 is bounded away from 0, Corollary 2 matches the information-theoretic lower bound of [BGN21] (and its subsequent sharpening by [LL22]). When this is not the case, we use norm equivalence (2) to obtain a weaker bound.

Corollary 3.

Let ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1). Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} have diam1(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{1}}(\mathcal{X})\leq D, and let Fpop=𝔼s𝒫[f(;s)]F_{\textup{pop}}=\mathbb{E}_{s\sim\mathcal{P}}[f(\cdot;s)] where all f(;s):df(\cdot;s):\mathbb{R}^{d}\to\mathbb{R} are convex and GG-Lipschitz in 1\left\lVert\cdot\right\rVert_{1}. Finally, let 𝒟={si}i[n]𝒫n\mathcal{D}=\{s_{i}\}_{i\in[n]}\sim\mathcal{P}^{n} independently and F𝒟:=1ni[n]f(;si)F_{\mathcal{D}}:=\frac{1}{n}\sum_{i\in[n]}f(\cdot;s_{i}).

  1. (1)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼[F𝒟(x)]minx𝒳F𝒟(x)6GDlogddlog12δnϵ\mathbb{E}_{\mathcal{M}}\left[F_{\mathcal{D}}(x)\right]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq 6GD\sqrt{\log d}\cdot\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon} (10)

    using

    O(n2ϵ2dlog1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}). (11)
  2. (2)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼𝒟𝒫n,[Fpop(x)]minx𝒳Fpop(x)6GDlogd(1n+dlog12δnϵ)\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq 6GD\sqrt{\log d}\cdot\left(\sqrt{\frac{1}{n}}+\frac{\sqrt{d\log\frac{1}{2\delta}}}{n\epsilon}\right) (12)

    using

    O(n2ϵ2dlog1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}).
Proof.

We will parameterize Assumption 1 with the function r(x):=e22(q1)xx0q2r(x):=\frac{e^{2}}{2(q-1)}\|x-x_{0}\|_{q}^{2}, where q=1+1logdq=1+\frac{1}{\log d}. By combining Proposition 7 (which shows rr is e2e^{2}-strongly convex in q\ell_{q}) and (2), we have that rr is 11-strongly convex in 1\ell_{1}. The remainder of the proof follows identically to Corollary 2. ∎

The term scaling as logd/n\sqrt{\log d/n} in (12), namely the non-private population risk, is known to be optimal from existing lower bounds on SCO [DJW14]. Up to a logd\sqrt{\log d} factor, the non-private empirical risk is optimal with respect to current private optimization lower bounds [BGN21, LL22].

Corollary 4.

Let p2p\geq 2, and let ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1). Let 𝒳d\mathcal{X}\subset\mathbb{R}^{d} have diamp(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{p}}(\mathcal{X})\leq D, and let Fpop=𝔼s𝒫[f(;s)]F_{\textup{pop}}=\mathbb{E}_{s\sim\mathcal{P}}[f(\cdot;s)] where all f(;s):df(\cdot;s):\mathbb{R}^{d}\to\mathbb{R} are convex and GG-Lipschitz in p\left\lVert\cdot\right\rVert_{p}. Finally, let 𝒟={si}i[n]𝒫n\mathcal{D}=\{s_{i}\}_{i\in[n]}\sim\mathcal{P}^{n} independently and F𝒟:=1ni[n]f(;si)F_{\mathcal{D}}:=\frac{1}{n}\sum_{i\in[n]}f(\cdot;s_{i}).

  1. (1)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼[F𝒟(x)]minx𝒳F𝒟(x)2GDd11plog12δnϵ\mathbb{E}_{\mathcal{M}}\left[F_{\mathcal{D}}(x)\right]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq 2GD\cdot\frac{d^{1-\frac{1}{p}}\sqrt{\log\frac{1}{2\delta}}}{n\epsilon} (13)

    using

    O(n2ϵ2d12plog1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d^{1-\frac{2}{p}}}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}). (14)
  2. (2)

    There is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces xx such that

    𝔼𝒟𝒫n,[Fpop(x)]minx𝒳Fpop(x)2GD(d121pn+d11plog12δnϵ).\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq 2GD\cdot\left(\frac{d^{\frac{1}{2}-\frac{1}{p}}}{\sqrt{n}}+\frac{d^{1-\frac{1}{p}}\sqrt{\log\frac{1}{2\delta}}}{n\epsilon}\right). (15)

    using

    O(n2ϵ2d12plog1δlog2(GDdnϵδ)) value queries to some f(;si).O\left(\frac{n^{2}\epsilon^{2}d^{1-\frac{2}{p}}}{\log\frac{1}{\delta}}\log^{2}\left(\frac{GDdn\epsilon}{\delta}\right)\right)\text{ value queries to some }f(\cdot;s_{i}).
Proof.

We will parameterize Assumption 1 with the function r(x):=12xx022r(x):=\frac{1}{2}\left\lVert x-x_{0}\right\rVert_{2}^{2}. By combining Proposition 7 (which shows rr is 11-strongly convex in 2\ell_{2}, and hence also p\ell_{p}) and (2), we may set Θ=12d12/pD2\Theta=\frac{1}{2}d^{1-2/p}D^{2}. The remainder of the proof follows identically to Corollary 2. ∎

Schatten-pp norms.

Our results extend immediately to matrix spaces equipped with Schatten-pp norm geometries. We record our relevant results in the following.

Corollary 5.

Let p[1,)p\in[1,\infty), ϵ>0\epsilon>0, δ(0,1)\delta\in(0,1), and let d1,d2d_{1},d_{2}\in\mathbb{N} have d1>d2d_{1}>d_{2}. Let 𝒳d1×d2\mathcal{X}\subset\mathbb{R}^{d_{1}\times d_{2}} have diamp(𝒳)D\textup{diam}_{\left\lVert\cdot\right\rVert_{p}}(\mathcal{X})\leq D, and let Fpop=𝔼s𝒫[f(;s)]F_{\textup{pop}}=\mathbb{E}_{s\sim\mathcal{P}}[f(\cdot;s)] where all f(;s):d1×d2f(\cdot;s):\mathbb{R}^{d_{1}\times d_{2}}\to\mathbb{R} are convex and GG-Lipschitz in p\left\lVert\cdot\right\rVert_{p}. Finally, let 𝒟={si}i[n]𝒫n\mathcal{D}=\{s_{i}\}_{i\in[n]}\sim\mathcal{P}^{n} independently and F𝒟:=1ni[n]f(;si)F_{\mathcal{D}}:=\frac{1}{n}\sum_{i\in[n]}f(\cdot;s_{i}).

  1. (1)

    For constant 1<p21<p\leq 2, there is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces 𝐌\mathbf{M} such that

    𝔼[F𝒟(𝐌)]min𝐌𝒳F𝒟(𝐌)2GDd1d2log12δnϵp1,\displaystyle\mathbb{E}_{\mathcal{M}}[F_{\mathcal{D}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\mathcal{D}}(\mathbf{M})\leq 2GD\cdot\frac{\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}},
    𝔼𝒟𝒫n,[Fpop(𝐌)]min𝐌𝒳Fpop(𝐌)2GD(1n(p1)+d1d2log12δnϵp1).\displaystyle\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}[F_{\textup{pop}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\textup{pop}}(\mathbf{M})\leq 2GD\cdot\left(\sqrt{\frac{1}{n(p-1)}}+\frac{\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}}\right).

    The value oracle complexity of the algorithm is bounded as in (8) for dd2d\leftarrow d_{2} in the non-logarithmic term, and dd1d\leftarrow d_{1} in the logarithmic term.

  2. (2)

    For p=1p=1, there is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces 𝐌\mathbf{M} such that

    𝔼[F𝒟(𝐌)]min𝐌𝒳F𝒟(𝐌)6GDlogd2d1d2log12δnϵp1,\displaystyle\mathbb{E}_{\mathcal{M}}[F_{\mathcal{D}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\mathcal{D}}(\mathbf{M})\leq 6GD\sqrt{\log d_{2}}\cdot\frac{\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}},
    𝔼𝒟𝒫n,[Fpop(𝐌)]min𝐌𝒳Fpop(𝐌)6GDlogd2(1n+d1d2log12δnϵp1).\displaystyle\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}[F_{\textup{pop}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\textup{pop}}(\mathbf{M})\leq 6GD\sqrt{\log d_{2}}\cdot\left(\sqrt{\frac{1}{n}}+\frac{\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon\sqrt{p-1}}\right).

    The value oracle complexity of the algorithm is bounded as in (11) for dd2d\leftarrow d_{2} in the non-logarithmic term, and dd1d\leftarrow d_{1} in the logarithmic term.

  3. (3)

    For p2p\geq 2, there is an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm \mathcal{M} which produces 𝐌\mathbf{M} such that

    𝔼[F𝒟(𝐌)]min𝐌𝒳F𝒟(𝐌)2GDd2121pd1d2log12δnϵ,\displaystyle\mathbb{E}_{\mathcal{M}}[F_{\mathcal{D}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\mathcal{D}}(\mathbf{M})\leq 2GD\cdot\frac{d_{2}^{\frac{1}{2}-\frac{1}{p}}\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon},
    𝔼𝒟𝒫n,[Fpop(𝐌)]min𝐌𝒳Fpop(𝐌)2GD(d2121pn+d2121pd1d2log12δnϵ).\displaystyle\mathbb{E}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}[F_{\textup{pop}}(\mathbf{M})]-\min_{\mathbf{M}\in\mathcal{X}}F_{\textup{pop}}(\mathbf{M})\leq 2GD\cdot\left(\frac{d_{2}^{\frac{1}{2}-\frac{1}{p}}}{\sqrt{n}}+\frac{d_{2}^{\frac{1}{2}-\frac{1}{p}}\sqrt{d_{1}d_{2}\log\frac{1}{2\delta}}}{n\epsilon}\right).

    The value oracle complexity of the algorithm is bounded as in (14) for dd2d\leftarrow d_{2} in the non-logarithmic term, and dd1d\leftarrow d_{1} in the logarithmic term.

Proof.

The privacy and utility proofs follow identically to Corollaries 23, and 4, where we use the second portion of Proposition 7 instead of the first. We note that the “dimension-dependent” term in the risk inherited from Proposition 4 scales as d1d2d_{1}d_{2} (the dimensionality of the matrix space). However, the terms in the risk due to the size of regularizers (inherited from the tradeoffs in (2), for p=1p=1 and p>2p>2) scales as a power of d2d_{2}, the maximum dimension of singular values. To obtain the value oracle complexity, we note that by definition of the Schatten norm, it satisfies the relationship (2) as well. Moreover, the Schatten-22 norm agrees with the vector 2\ell_{2} norm (when the matrix is flattened into a vector), since they are both the Frobenius norm. Hence, we may directly apply Proposition 8 after paying a norm conversion, in the same way as was done in Proposition 9. ∎

Remark on high-probability bounds.

One advantage of using a sampling-based algorithm is an immediate high-probability bound which follows due to the good concentration of Lipschitz functions over samples from strongly logconcave distributions, stated below.

Lemma 5 (Concentration of Lipschitz functions, [Led99], Section 2.3 and [BL00], Proposition 3.1).

Let \ell be a GG-Lipschitz function and Xexp(F)X\sim\exp(-F) for a μ\mu-strongly convex function FF, all with respect to the same norm 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}. For all t0t\geq 0,

Pr[(X)𝔼[(X)]t]exp(t2μ2G2).\displaystyle\Pr\big{[}\ell(X)-\mathbb{E}[\ell(X)]\geq t\big{]}\leq\exp\left(-\frac{t^{2}\mu}{2G^{2}}\right).

In particular, we have demonstrated that the population and empirical risks (which are Lipschitz) have good expectations. Naïvely combining Lemma 5 and our main results on the expectation utility bound then yields tight concentration around the mean in some parameter regimes, but we suspect the resulting bound is loose in general. We leave it as an interesting open problem to obtain tight high-probability bounds in all parameter regimes.

Acknowledgments

We would like to thank the anonymous SODA reviewers for their helpful suggestions. We would also like to thank a reviewer and Cristóbal Guzmán for pointing out a mistake in the utility guarantees in Corollary 5 in the original version, which is corrected in this version.

References

  • [Abo16] John M. Abowd. The challenge of scientific reproducibility and privacy protection for statistical agencies. Technical report, Census Scientific Advisory Committee, 2016.
  • [ABRW12] Alekh Agarwal, Peter L. Bartlett, Pradeep Ravikumar, and Martin J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory, 58(5):3235–3249, 2012.
  • [AC21] Kwangjun Ahn and Sinho Chewi. Efficient constrained sampling via the mirror-langevin algorithm. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 28405–28418, 2021.
  • [ACG+16] Martín Abadi, Andy Chu, Ian J. Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Edgar R. Weippl, Stefan Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, October 24-28, 2016, pages 308–318. ACM, 2016.
  • [AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in L1 geometry. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 393–403. PMLR, 2021.
  • [AGL+18] Zeyuan Allen-Zhu, Ankit Garg, Yuanzhi Li, Rafael Mendes de Oliveira, and Avi Wigderson. Operator scaling via geodesically convex optimization, invariant theory and polynomial identity testing. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 172–181. ACM, 2018.
  • [AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput., 8(1):121–164, 2012.
  • [AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng, and Sushant Sachdeva. Iterative refinement for 𝓁\mathscr{l}p-norm regression. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 1405–1424. SIAM, 2019.
  • [ANW10] Alekh Agarwal, Sahand N. Negahban, and Martin J. Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In John D. Lafferty, Christopher K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and Aron Culotta, editors, Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada, pages 37–45. Curran Associates, Inc., 2010.
  • [BC12] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn., 5(1):1–122, 2012.
  • [BCL94] Keith Ball, Eric A. Carlen, and Elliott H. Lieb. Sharp uniform convexity and smoothness estimates for trace norms. Inventiones mathematicae, 115(1):463–482, 1994.
  • [BDKT12] Aditya Bhaskara, Daniel Dadush, Ravishankar Krishnaswamy, and Kunal Talwar. Unconditional differentially private mechanisms for linear queries. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1269–1284, 2012.
  • [BE02] Olivier Bousquet and André Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002.
  • [BEM+17] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnés, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, pages 441–459. ACM, 2017.
  • [BFGT20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 11279–11288, 2019.
  • [BGM21] Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private stochastic optimization: New results in convex and non-convex settings. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9317–9329, 2021.
  • [BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Mikhail Belkin and Samory Kpotufe, editors, Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, volume 134 of Proceedings of Machine Learning Research, pages 474–499. PMLR, 2021.
  • [BL00] Sergey G Bobkov and Michel Ledoux. From brunn-minkowski to brascamp-lieb and to logarithmic sobolev inequalities. GAFA, Geometric and Functional Analysis, 10:1028–1052, 2000.
  • [Bor75a] Christer Borell. The brunn-minkowski in gauss space. Inventiones mathematicae, 30:207–216, 1975.
  • [Bor75b] Christer Borell. Convex set functions ind-space. Periodica Mathematica Hungarica, 6(2):111–136, 1975.
  • [BST14] Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, pages 464–473. IEEE Computer Society, 2014.
  • [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231–357, 2015.
  • [CDWY20] Yuansi Chen, Raaz Dwivedi, Martin J. Wainwright, and Bin Yu. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. J. Mach. Learn. Res., 21:92:1–92:72, 2020.
  • [CE17] Dario Cordero-Erausquin. Transport inequalities for log-concave measures, quantitative forms, and applications. Canada J. Math, 69(3):481–501, 2017.
  • [Che21] Yuansi Chen. An almost constant lower bound of the isoperimetric coefficient in the kls conjecture. GAFA, Geometric and Functional Analysis, 31:34–61, 2021.
  • [CM08] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regression. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 289–296. Curran Associates, Inc., 2008.
  • [CMS11] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12:1069–1109, 2011.
  • [CRT06] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory, 52(2):489–509, 2006.
  • [DFO20] Jelena Diakonikolas, Maryam Fazel, and Lorenzo Orecchia. Fair packing and covering on a relative scale. SIAM J. Optim., 30(4):3284–3314, 2020.
  • [DG21] Jelena Diakonikolas and Cristóbal Guzmán. Complementary composite minimization, small gradients in general norms, and applications to regression problems. CoRR, abs/2101.11041, 2021.
  • [DJW14] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Privacy aware learning. J. ACM, 61(6):38:1–38:57, 2014.
  • [dKL18] Etienne de Klerk and Monique Laurent. Comparison of lasserre’s measure-based bounds for polynomial optimization to bounds obtained by simulated annealing. Math. Oper. Res., 43(4):1317–1325, 2018.
  • [DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In Serge Vaudenay, editor, Advances in Cryptology - EUROCRYPT 2006, 25th Annual International Conference on the Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May 28 - June 1, 2006, Proceedings, volume 4004 of Lecture Notes in Computer Science, pages 486–503. Springer, 2006.
  • [DKY17] Bolin Ding, Janardhan Kulkarni, and Sergey Yekhanin. Collecting telemetry data privately. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 3571–3580, 2017.
  • [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors, Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings, volume 3876 of Lecture Notes in Computer Science, pages 265–284. Springer, 2006.
  • [DRS21] Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian differential privacy. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(1):3–37, 2021.
  • [EPK14] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Gail-Joon Ahn, Moti Yung, and Ninghui Li, editors, Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, November 3-7, 2014, pages 1054–1067. ACM, 2014.
  • [FG04] Matthieu Fradelizi and Olivier Guédon. The extreme points of subsets of s-concave probabilities and a geometric localization theorem. Discrete & Computational Geometry, 31(2):327–335, 2004.
  • [FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 439–449. ACM, 2020.
  • [GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. In Po-Ling Loh and Maxim Raginsky, editors, Conference on Learning Theory, 2-5 July 2022, London, UK, volume 178 of Proceedings of Machine Learning Research, pages 1948–1989. PMLR, 2022.
  • [GTU22] Arun Ganesh, Abhradeep Thakurta, and Jalaj Upadhyay. Langevin diffusion: An almost universal algorithm for private euclidean (convex) optimization. CoRR, abs/2204.01585, 2022.
  • [HKRC18] Ya-Ping Hsieh, Ali Kavis, Paul Rolland, and Volkan Cevher. Mirrored langevin dynamics. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 2883–2892, 2018.
  • [HLL+22] Yuxuan Han, Zhicong Liang, Zhipeng Liang, Yang Wang, Yuan Yao, and Jiheng Zhang. Private streaming sco in ell_pell\_p geometry with applications in high dimensional online decision making. In International Conference on Machine Learning, pages 8249–8279. PMLR, 2022.
  • [HT10] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 705–714, 2010.
  • [Jia21] Qijia Jiang. Mirror langevin monte carlo: the case under isoperimetry. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 715–725, 2021.
  • [JLLV21] He Jia, Aditi Laddha, Yin Tat Lee, and Santosh S. Vempala. Reducing isotropy and volume to KLS: an o*(n3{}^{\mbox{3}}ψ\psi2{}^{\mbox{2}}) volume algorithm. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 961–974. ACM, 2021.
  • [JLT20] Arun Jambulapati, Jerry Li, and Kevin Tian. Robust sub-gaussian principal component analysis and width-independent schatten packing. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [JT14] Prateek Jain and Abhradeep Guha Thakurta. (near) dimension independent risk bounds for differentially private learning. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 476–484. JMLR.org, 2014.
  • [KJ16] Shiva Prasad Kasiviswanathan and Hongxia Jin. Efficient private empirical risk minimization for high-dimensional learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 488–497. JMLR.org, 2016.
  • [KL22] Bo’az Klartag and Joseph Lehec. Bourgain’s slicing problem and kls isoperimetry up to polylog. arXiv preprint arXiv:2203.15551, 2022.
  • [KLL21] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth ERM and SCO in subquadratic steps. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 4053–4064, 2021.
  • [KLOS14] Jonathan A. Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An almost-linear-time algorithm for approximate max flow in undirected graphs, and its multicommodity generalizations. In Chandra Chekuri, editor, Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 217–226. SIAM, 2014.
  • [KLS95] Ravi Kannan, László Lovász, and Miklós Simonovits. Isoperimetric problems for convex bodies and a localization lemma. Discrete & Computational Geometry, 13(3):541–559, 1995.
  • [Kol11] Alexander V. Kolesnikov. Mass transportation and contractions. arXiv preprint arXiv:1103.1479, 2011.
  • [KPSW19] Rasmus Kyng, Richard Peng, Sushant Sachdeva, and Di Wang. Flows in almost linear time via adaptive preconditioning. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 902–913. ACM, 2019.
  • [KST12] Daniel Kifer, Adam D. Smith, and Abhradeep Thakurta. Private convex optimization for empirical risk minimization with applications to high-dimensional regression. In Shie Mannor, Nathan Srebro, and Robert C. Williamson, editors, COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, volume 23 of JMLR Proceedings, pages 25.1–25.40. JMLR.org, 2012.
  • [KV06] Adam Tauman Kalai and Santosh S. Vempala. Simulated annealing for convex optimization. Math. Oper. Res., 31(2):253–266, 2006.
  • [Led99] Michel Ledoux. Concentration of measure and logarithmic Sobolev inequalities. Seminaire de probabilities XXXIII, 1999.
  • [LL22] Daogao Liu and Zhou Lu. Lower bounds for differentially private erm: Unconstrained and non-euclidean. arXiv preprint arXiv:2105.13637, 2022.
  • [LS93] László Lovász and Miklós Simonovits. Random walks in a convex body and an improved volume algorithm. Random structures & algorithms, 4(4):359–412, 1993.
  • [LST21] Yin Tat Lee, Ruoqi Shen, and Kevin Tian. Structured logconcave sampling with a restricted gaussian oracle. In Conference on Learning Theory, pages 2993–3050. PMLR, 2021.
  • [LTVW22] Ruilin Li, Molei Tao, Santosh S. Vempala, and Andre Wibisono. The mirror langevin algorithm converges with vanishing bias. In Sanjoy Dasgupta and Nika Haghtalab, editors, International Conference on Algorithmic Learning Theory, 29-1 April 2022, Paris, France, volume 167 of Proceedings of Machine Learning Research, pages 718–742. PMLR, 2022.
  • [LV07] László Lovász and Santosh S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms, 30(3):307–358, 2007.
  • [MS08] Emanuel Milman and Sasha Sodin. An isoperimetric inequality for uniformly log-concave measures and uniformly convex bodies. Journal of Functional Analysis, 254(5):1235–1268, 2008.
  • [MT07] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, pages 94–103. IEEE Computer Society, 2007.
  • [Nem04] Arkadi Nemirovski. Interior point polynomial time methods in convex programming. Lecture notes, 42(16):3215–3224, 2004.
  • [NY83] A. Nemirovski and D.B̃. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
  • [OV00] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
  • [Sha12] Shai Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn., 4(2):107–194, 2012.
  • [ST74] Vladimir Sudakov and Boris Tsirelson. Extremal properties of half-spaces for spherically invariant measures. J. Soviet Math, 9:9–18, 1974.
  • [Tea17] Apple Differential Privacy Team. Learning with privacy at scale. Technical report, Apple, 2017.
  • [ZPFP20] Kelvin Shuangjian Zhang, Gabriel Peyré, Jalal Fadili, and Marcelo Pereyra. Wasserstein control of mirror langevin monte carlo. In Jacob D. Abernethy and Shivani Agarwal, editors, Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 3814–3841. PMLR, 2020.

Appendix A Private ERM and SCO under strong convexity

In this section, we derive our results for private ERM and SCO in general norms under the assumption that the sample losses are strongly convex. We will state our results for private ERM (Theorem 5) and SCO (Theorem 6) with respect to an arbitrary compact convex subset 𝒳\mathcal{X} of a dd-dimensional normed space, satisfying the following Assumption 2.

Assumption 2.

We make the following assumptions.

  1. (1)

    There is a compact, convex subspace 𝒳d\mathcal{X}\subset\mathbb{R}^{d} equipped with a norm 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}.

  2. (2)

    There is a set Ω\Omega such that for any sΩs\in\Omega, there is a function f(;s):𝒳f(\cdot;s):\mathcal{X}\to\mathbb{R} which is GG-Lipschitz and μ\mu-strongly convex in 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}.

Theorem 5 (Private ERM).

Under Assumption 2 and following notation (1), drawing a sample xx from the density νexp(kF𝒟)\nu\propto\exp(-kF_{\mathcal{D}}) for

k=n2ϵ2μ2G2log12δ,k=\frac{n^{2}\epsilon^{2}\mu}{2G^{2}\log\frac{1}{2\delta}},

is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼xν[F𝒟(x)]minx𝒳F𝒟(x)2dG2log12δn2ϵ2μ.\mathbb{E}_{x\sim\nu}[F_{\mathcal{D}}(x)]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x)\leq\frac{2dG^{2}\log\frac{1}{2\delta}}{n^{2}\epsilon^{2}\mu}.
Proof.

Let F𝒟F_{\mathcal{D}^{\prime}} be the realization of (1) when 𝒟\mathcal{D} is replaced with a neighboring dataset 𝒟\mathcal{D}^{\prime} which agrees in all entries except some sample sisis^{\prime}_{i}\neq s_{i}. By Assumption 2, we have k(F𝒟F𝒟)k(F_{\mathcal{D}}-F_{\mathcal{D}^{\prime}}) is kGn\frac{kG}{n}-Lipschitz, and both kF𝒟kF_{\mathcal{D}} and kF𝒟kF_{\mathcal{D}^{\prime}} are kμk\mu-strongly convex (all with respect to 𝒳\left\lVert\cdot\right\rVert_{\mathcal{X}}). Hence, combining Theorem 2 and Fact 2 shows the mechanism is (ϵ,δ)(\epsilon,\delta)-differentially private, since

k=n2ϵ2μ2G2log12δGknμϵ2log12δ.k=\frac{n^{2}\epsilon^{2}\mu}{2G^{2}\log\frac{1}{2\delta}}\implies\frac{G\sqrt{k}}{n\sqrt{\mu}}\leq\frac{\epsilon}{\sqrt{2\log\frac{1}{2\delta}}}.

Let x𝒟:=argminx𝒳F𝒟(x)x^{\star}_{\mathcal{D}}:=\textup{argmin}_{x\in\mathcal{X}}F_{\mathcal{D}}(x). We obtain the risk guarantee by the calculation (see Proposition 4)

𝔼xν[F𝒟(x)]\displaystyle\mathbb{E}_{x\sim\nu}[F_{\mathcal{D}}(x)] F𝒟(x𝒟)+dkF𝒟(x𝒟)+2dG2log12δn2ϵ2μ.\displaystyle\leq F_{\mathcal{D}}(x^{\star}_{\mathcal{D}})+\frac{d}{k}\leq F_{\mathcal{D}}(x^{\star}_{\mathcal{D}})+\frac{2dG^{2}\log\frac{1}{2\delta}}{n^{2}\epsilon^{2}\mu}.

Theorem 6 (Private SCO).

Under Assumption 2 and following notation (1), drawing a sample xx from the density νexp(kF𝒟)\nu\propto\exp(-kF_{\mathcal{D}}) for

k=n2ϵ2μ2G2log12δk=\frac{n^{2}\epsilon^{2}\mu}{2G^{2}\log\frac{1}{2\delta}}

is (ϵ,δ)(\epsilon,\delta)-differentially private, and produces xx such that

𝔼𝒟πn,xν[Fpop(x)]minx𝒳Fpop(x)G2nμ(1+2dlog12δnϵ2).\mathbb{E}_{\mathcal{D}\sim\pi^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x)\leq\frac{G^{2}}{n\mu}\left(1+\frac{2d\log\frac{1}{2\delta}}{n\epsilon^{2}}\right).
Proof.

For the given choice k,μk,\mu, the privacy proof follows identically to Theorem 3, so we focus on the risk proof. We follow the notation of Proposition 6 and let sπs\sim\pi independently from π\pi. By exchanging the expectation and minimum and using that 𝔼𝒟πnF𝒟=Fpop\mathbb{E}_{\mathcal{D}\sim\pi^{n}}F_{\mathcal{D}}=F_{\textup{pop}},

𝔼𝒟πn,xν[Fpop(x)]minx𝒳Fpop(x)\displaystyle\mathbb{E}_{\mathcal{D}\sim\pi^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\sim\mathcal{X}}F_{\textup{pop}}(x) 𝔼𝒟πn[𝔼xν[Fpop(x)]minx𝒳[F𝒟(x)]]\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\pi^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}\left[F_{\mathcal{D}}(x)\right]\right]
𝔼𝒟πn[𝔼xν[Fpop(x)F𝒟(x)]]\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\pi^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]\right]
+𝔼𝒟πn[𝔼xν[F𝒟(x)]minx𝒳[F𝒟(x)]]\displaystyle+\mathbb{E}_{\mathcal{D}\sim\pi^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\mathcal{D}}(x)\right]-\min_{x\in\mathcal{X}}\left[F_{\mathcal{D}}(x)\right]\right]
𝔼𝒟πn[𝔼xν[Fpop(x)F𝒟(x)]]+dk,\displaystyle\leq\mathbb{E}_{\mathcal{D}\sim\pi^{n}}\left[\mathbb{E}_{x\sim\nu}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]\right]+\frac{d}{k},

where we bounded the empirical risk in the proof of Theorem 5. Next, let ν\nu^{\prime} be the density exp(kF𝒟)\propto\exp(-kF_{\mathcal{D}^{\prime}}). Our mechanism is symmetric, and hence by Proposition 6,

𝔼[Fpop(x)F𝒟(x)]=𝔼[𝔼xν[f(x;s)]𝔼xν[f(x;s)]]\mathbb{E}\left[F_{\textup{pop}}(x)-F_{\mathcal{D}}(x)\right]=\mathbb{E}\left[\mathbb{E}_{x\sim\nu}\left[f(x;s)\right]-\mathbb{E}_{x\sim\nu^{\prime}}\left[f(x;s)\right]\right]

where the outer expectations are over the randomness of drawing 𝒟,s\mathcal{D},s. Finally, for any fixed realization of 𝒟,s\mathcal{D},s, the densities ν,ν\nu,\nu^{\prime} satisfy the assumption of Corollary 1 with H=GnH=\frac{G}{n}, and f(;s)f(\cdot;s) is GG-Lipschitz, so Corollary 1 shows that

𝔼xν[f(x;s)]𝔼xν[f(x;s)]G2nμ.\mathbb{E}_{x\sim\nu}\left[f(x;s)\right]-\mathbb{E}_{x\sim\nu^{\prime}}\left[f(x;s)\right]\leq\frac{G^{2}}{n\mu}.

Combining the above three displays bounds the population risk by

𝔼𝒟πn,xν[Fpop(x)]minx𝒳Fpop(x)\displaystyle\mathbb{E}_{\mathcal{D}\sim\pi^{n},x\sim\nu}\left[F_{\textup{pop}}(x)\right]-\min_{x\in\mathcal{X}}F_{\textup{pop}}(x) G2nμ+dk=G2nμ(1+2dlog12δnϵ2).\displaystyle\leq\frac{G^{2}}{n\mu}+\frac{d}{k}=\frac{G^{2}}{n\mu}\left(1+\frac{2d\log\frac{1}{2\delta}}{n\epsilon^{2}}\right).

for our given value of kk. ∎