This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Privately Learning Markov Random Fields

Huanyu Zhang111Cornell University. [email protected]. Supported by NSF #1815893 and by NSF #1704443. This work was partially done while the author was an intern at Microsoft Research Redmond.    Gautam Kamath222University of Waterloo. [email protected]. Supported by a University of Waterloo startup grant. Part of this work was done while supported as a Microsoft Research Fellow, as part of the Simons-Berkeley Research Fellowship program, and while visiting Microsoft Research Redmond. 555These authors are in alphabetical order.    Janardhan Kulkarni333Microsoft Research Redmond. [email protected]. 55footnotemark: 5    Zhiwei Steven Wu444University of Minnesota. [email protected]. Supported in part by the NSF FAI Award #1939606, a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Facebook Research Award, and a Mozilla Research Grant. 55footnotemark: 5
Abstract

We consider the problem of learning Markov Random Fields (including the prototypical example, the Ising model) under the constraint of differential privacy. Our learning goals include both structure learning, where we try to estimate the underlying graph structure of the model, as well as the harder goal of parameter learning, in which we additionally estimate the parameter on each edge. We provide algorithms and lower bounds for both problems under a variety of privacy constraints – namely pure, concentrated, and approximate differential privacy. While non-privately, both learning goals enjoy roughly the same complexity, we show that this is not the case under differential privacy. In particular, only structure learning under approximate differential privacy maintains the non-private logarithmic dependence on the dimensionality of the data, while a change in either the learning goal or the privacy notion would necessitate a polynomial dependence. As a result, we show that the privacy constraint imposes a strong separation between these two learning problems in the high-dimensional data regime.

1 Introduction

In this chapter, we continue to study the problem of private distribution estimation. However, we focus on a more complicated class of distributions – random graphs.

Graphical models are a common structure used to model high-dimensional data, which find a myriad of applications in diverse research disciplines, including probability theory, Markov Chain Monte Carlo, computer vision, theoretical computer science, social network analysis, game theory, and computational biology [LevinPW09, Chatterjee05, Felsenstein04, DaskalakisMR11, GemanG86, Ellison93, MontanariS10]. While statistical tasks involving general distributions over pp variables often run into the curse of dimensionality (i.e., an exponential sample complexity in pp), Markov Random Fields (MRFs) are a particular family of undirected graphical models which are parameterized by the “order” tt of their interactions. Restricting the order of interactions allows us to capture most distributions which may naturally arise, and also avoids this severe dependence on the dimension (i.e., we often pay an exponential dependence on tt instead of pp). An MRF is defined as follows, see Section 2 for more precise definitions and notations we will use in this chapter.

Definition 1.1.

Let k,t,pk,t,p\in\mathbb{N}, G=(V,E)G=(V,E) be a graph on pp nodes, and Ct(G)C_{t}(G) be the set of cliques of size at most tt in GG. A Markov Random Field with alphabet size kk and tt-order interactions is a distribution 𝒟\mathcal{D} over [k]p[k]^{p} such that

PrX𝒟[X=x]exp(ICt(G)ψI(x)),\Pr_{X\sim\mathcal{D}}[X=x]\propto\exp\left(\sum_{I\in C_{t}(G)}\psi_{I}(x)\right),

where ψI:[k]p\psi_{I}:[k]^{p}\rightarrow\mathbb{R} depends only on varables in II.

The case when k=t=2k=t=2 corresponds to the prototypical example of an MRF, the Ising model [Ising25] (Definition 2.1). More generally, if t=2t=2, we call the model pairwise (Definition 2.2), and if k=2k=2 but tt is unrestricted, we call the model a binary MRF (Definition 2.4). In this chapter, we mainly look at these two special cases of MRFs.

Given the wide applicability of these graphical models, there has been a great deal of work on the problem of graphical model estimation [RavikumarWL10, SanthanamW12, Bresler15, VuffrayMLC16, KlivansM17, HamiltonKM17, RigolletH17, LokhovVMC18, WuSD19]. That is, given a dataset generated from a graphical model, can we infer properties of the underlying distribution? Most of the attention has focused on two learning goals.

  1. 1.

    Structure learning (Definition 2.9): Recover the set of non-zero edges in GG.

  2. 2.

    Parameter learning (Definition 2.10): Recover the set of non-zero edges in GG, as well as ψI\psi_{I} for all cliques II of size at most tt.

It is clear that structure learning is easier than parameter learning. Nonetheless, the sample complexity of both learning goals is known to be roughly equivalent. That is, both can be performed using a number of samples which is only logarithmic in the dimension pp (assuming a model of bounded “width” λ\lambda111This is a common parameterization of the problem, which roughly corresponds to the graph having bounded-degree, see Section 2 for more details.), thus facilitating estimation in very high-dimensional settings.

Our goal is to design algorithms which guarantee both:

  • Accuracy: With probability greater than 2/32/3, the algorithm learns the underlying graphical model;

  • Privacy: The algorithm satisfies differential privacy, even when the dataset is not drawn from a graphical model.

Thematically, we investigate the following question: how much additional data is needed to learn Markov Random Fields under the constraint of differential privacy? As mentioned before, absent privacy constraints, the sample complexity is logarithmic in pp. Can we guarantee privacy with comparable amounts of data? Or if more data is needed, how much more?

1.1 Results and Techniques

We proceed to describe our results on privately learning Markov Random Fields. In this section, we will assume familiarity with some of the most common notions of differential privacy: pure ε\varepsilon-differential privacy, ρ\rho-zero-concentrated differential privacy, and approximate (ε,δ)(\varepsilon,\delta)-differential privacy. In particular, one should know that these are in (strictly) decreasing order of strength (i.e., an algorithm which satisfies pure DP gives more privacy to the dataset than concentrated DP), formal definitions appear in Section 2. Furthermore, in order to be precise, some of our theorem statements will use notation which is defined later (Section 2) – these may be skipped on a first reading, as our prose will not require this knowledge.

Upper Bounds.

Our first upper bounds are for parameter learning. First, we have the following theorem, which gives an upper bound for parameter learning pairwise graphical models under concentrated differential privacy, showing that this learning goal can be achieved with O(p)O(\sqrt{p}) samples. In particular, this includes the special case of the Ising model, which corresponds to an alphabet size k=2k=2. Note that this implies the same result if one relaxes the learning goal to structure learning, or the privacy notion to approximate DP, as these modifications only make the problem easier. Further details are given in Section 3.3.

Theorem 1.2.

There exists an efficient ρ\rho-zCDP algorithm which learns the parameters of a pairwise graphical model to accuracy α\alpha with probability at least 2/32/3, which requires a sample complexity of

n=O(λ2k5log(pk)eO(λ)α4+pλ2k5.5log2(pk)eO(λ)ρα3)n=O{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}

This result can be seen as a private adaptation of the elegant work of [WuSD19] (which in turn builds on the structural results of [KlivansM17]). Wu, Sanghavi, and Dimakis [WuSD19] show that 1\ell_{1}-constrained logistic regression suffices to learn the parameters of all pairwise graphical models. We first develop a private analog of this method, based on the private Franke-Wolfe method of Talwar, Thakurta, and Zhang [TalwarTZ14, TalwarTZ15], which is of independent interest. This method is studied in Section 3.1.

Theorem 1.3.

If we consider the problem of private sparse logistic regression, there exists an efficient ρ\rho-zCDP algorithm that produces a parameter vector wprivw^{priv}, such that with probability at least 1β1-\beta, the empirical risk

(wpriv;D)(werm;D)=O(λ43log(npβ)(nρ)23).{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.

We note that Theorem 1.3 avoids a polynomial dependence on the dimension pp in favor of a polynomial dependence on the “sparsity” parameter λ\lambda. The greater dependence on pp which arises in Theorem 1.2 is from applying Theorem 1.3 and then using composition properties of concentrated DP.

We go on to generalize the results of [WuSD19], showing that 1\ell_{1}-constrained logistic regression can also learn the parameters of binary tt-wise MRFs. This result is novel even in the non-private setting. Further details are presented in Section 4.

The following theorem shows that we can learn the parameters of binary tt-wise MRFs with O~(p)\tilde{O}(\sqrt{p}) samples.

Theorem 1.4.

Let 𝒟{\cal D} be an unknown binary tt-wise MRF with associated polynomial hh. Then there exists an ρ\rho-zCDP algorithm which, with probability at least 2/32/3, learns the maximal monomials of hh to accuracy α\alpha, given nn i.i.d. samples Z1,,Zn𝒟Z^{1},\cdots,Z^{n}\sim{\cal D}, where

n=O(e5λtplog2(p)ρα92+tλ2plogpρα2+e6λtlog(p)α6).n=O{\left({\frac{e^{5\lambda t}\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}.

To obtain the rate above, our algorithm uses the Private Multiplicative Weights (PMW) method by [HardtR10] to estimate all parity queries of all orders no more than tt. The PMW method runs in time exponential in pp, since it maintains a distribution over the data domain. We can also obtain an oracle-efficient algorithm that runs in polynomial time when given access to an empirical risk minimization oracle over the class of parities. By replacing PMW with such an oracle-efficient algorithm sepFEM in [VietriTBSW20], we obtain a slightly worse sample complexity

n=O(e5λtplog2(p)ρα92+tλ2p5/4logpρα2+e6λtlog(p)α6).n=O{\left({\frac{e^{5\lambda t}\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}{p^{5/4}}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}.

For the special case of structure learning under approximate differential privacy, we provide a significantly better algorithm. In particular, we can achieve an O(logp)O(\log p) sample complexity, which improves exponentially on the above algorithm’s sample complexity of O(p)O(\sqrt{p}). The following is a representative theorem statement for pairwise graphical models, though we derive similar statements for binary MRFs of higher order.

Theorem 1.5.

There exists an efficient (ε,δ)(\varepsilon,\delta)-differentially private algorithm which, with probability at least 2/32/3, learns the structure of a pairwise graphical model, which requires a sample complexity of

n=O(λ2k4exp(14λ)log(pk)log(1/δ)εη4).n=O\left(\frac{\lambda^{2}k^{4}\exp(14\lambda)\log(pk)\log(1/\delta)}{\varepsilon\eta^{4}}\right).

This result can be derived using stability properties of non-private algorithms. In particular, in the non-private setting, the guarantees of algorithms for this problem recover the entire graph exactly with constant probability. This allows us to derive private algorithms at a multiplicative cost of O(log(1/δ)/ε)O(\log(1/\delta)/\varepsilon) samples, using either the propose-test-release framework [DworkL09] or stability-based histograms [KorolovaKMN09, BunNSV15]. Further details are given in Section 6.

Lower Bounds.

We note the significant gap between the aforementioned upper bounds: in particular, our more generally applicable upper bound (Theorem 1.2) has a O(p)O(\sqrt{p}) dependence on the dimension, whereas the best known lower bound is Ω(logp)\Omega(\log p) [SanthanamW12]. However, we show that our upper bound is tight. That is, even if we relax the privacy notion to approximate differential privacy, or relax the learning goal to structure learning, the sample complexity is still Ω(p)\Omega(\sqrt{p}). Perhaps surprisingly, if we perform both relaxations simultaneously, this falls into the purview of Theorem 1.5, and the sample complexity drops to O(logp)O(\log p).

First, we show that even under approximate differential privacy, learning the parameters of a graphical model requires Ω(p)\Omega(\sqrt{p}) samples. The formal statement is given in Section 5.

Theorem 1.6 (Informal).

Any algorithm which satisfies approximate differential privacy and learns the parameters of a pairwise graphical model with probability at least 2/32/3 requires poly(p)\operatorname*{poly}(p) samples.

This result is proved by constructing a family of instances of binary pairwise graphical models (i.e., Ising models) which encode product distributions. Specifically, we consider the set of graphs formed by a perfect matching with edges (2i,2i+1)(2i,2i+1) for i[p/2]i\in[p/2]. In order to estimate the parameter on every edge, one must estimate the correlation between each such pair of nodes, which can be shown to correspond to learning the mean of a particular product distribution in \ell_{\infty}-distance. This problem is well-known to have a gap between the non-private and private sample complexities, due to methods derived from fingerprinting codes [BunUV14, DworkSSUV15, SteinkeU17a], and differentially private Fano’s inequality.

Second, we show that learning the structure of a graphical model, under either pure or concentrated differential privacy, requires poly(p)\operatorname*{poly}(p) samples. The formal theorem appears in Section 7.

Theorem 1.7 (Informal).

Any algorithm which satisfies pure or concentrated differential privacy and learns the structure of a pairwise graphical model with probability at least 2/32/3 requires poly(p)\operatorname*{poly}(p) samples.

We derive this result via packing arguments [HardtT10, BeimelBKN14], and differentially private Fano’s inequality, by showing that there exists a large number (exponential in pp) of different binary pairwise graphical models which must be distinguished. The construction of a set of size mm implies lower bounds of Ω(logm)\Omega(\log m) and Ω(logm)\Omega(\sqrt{\log m}) for learning under pure and concentrated differential privacy, respectively.

1.1.1 Summary and Discussion

We summarize our findings on privately learning Markov Random Fields in Table 1, focusing on the specific case of the Ising model. We note that qualitatively similar relationships between problems also hold for general pairwise models as well as higher-order binary Markov Random Fields. Each cell denotes the sample complexity of a learning task, which is a combination of an objective and a privacy constraint. Problems become harder as we go down (as the privacy requirement is tightened) and to the right (structure learning is easier than parameter learning).

The top row shows that both learning goals require only Θ(logp)\Theta(\log p) samples to perform absent privacy constraints, and are thus tractable even in very high-dimensional settings or when data is limited. However, if we additionally wish to guarantee privacy, our results show that this logarithmic sample complexity is only achievable when one considers structure learning under approximate differential privacy. If one changes the learning goal to parameter learning, or tightens the privacy notion to concentrated differential privacy, then the sample complexity jumps to become polynomial in the dimension, in particular Ω(p)\Omega(\sqrt{p}). Nonetheless, we provide algorithms which match this dependence, giving a tight Θ(p)\Theta(\sqrt{p}) bound on the sample complexity.

Structure Learning Parameter Learning
Non-private Θ(logp)\Theta(\log{p}) (folklore) Θ(logp)\Theta(\log{p}) (folklore)
Approximate DP Θ(logp)\Theta(\log{p}) (Theorems 6.3) Θ(p)\Theta(\sqrt{p}) (Theorems 3.8 and 5.1)
Zero-concentrated DP Θ(p)\Theta(\sqrt{p}) (Theorems 3.8 and 7.1) Θ(p)\Theta(\sqrt{p}) (Theorems 3.8 and 5.1)
Pure DP Ω(p)\Omega(p) (Theorem 7.1) Ω(p)\Omega(p) (Theorem 7.1)
Table 1: Sample complexity (dependence on pp) of privately learning an Ising model.

1.2 Related Work

As mentioned before, there has been significant work in learning the structure and parameters of graphical models, see, e.g., [ChowL68, CsiszarT06, AbbeelKN06, RavikumarWL10, JalaliJR11, JalaliRVS11, SanthanamW12, BreslerGS14b, Bresler15, VuffrayMLC16, KlivansM17, HamiltonKM17, RigolletH17, LokhovVMC18, WuSD19]. Perhaps a turning point in this literature is the work of Bresler [Bresler15], who showed for the first time that general Ising models of bounded degree can be learned in polynomial time. Since this result, following works have focused on both generalizing these results to broader settings (including MRFs with higher-order interactions and non-binary alphabets) as well as simplifying existing arguments. There has also been work on learning, testing, and inferring other statistical properties of graphical models [BhattacharyaM16, MartindelCampoCU16, DaskalakisDK17, MukherjeeMY18, Bhattacharya19]. In particular, learning and testing Ising models in statistical distance have also been explored [DaskalakisDK18, GheissariLP18, DevroyeMR20, DaskalakisDK19, BezakovaBCSV19], and are interesting questions under the constraint of privacy.

Recent investigations at the intersection of graphical models and differential privacy include [BernsteinMSSHM17, ChowdhuryRJ20, McKennaSM19]. Bernstein et al. [BernsteinMSSHM17] privately learn graphical models by adding noise to the sufficient statistics and use an expectation-maximization based approach to recover the parameters. However, the focus is somewhat different, as they do not provide finite sample guarantees for the accuracy when performing parameter recovery, nor consider structure learning at all. Chowdhury, Rekatsinas, and Jha [ChowdhuryRJ20] study differentially private learning of Bayesian Networks, another popular type of graphical model which is incomparable with Markov Random Fields. McKenna, Sheldon, and Miklau [McKennaSM19] apply graphical models in place of full contingency tables to privately perform inference.

Graphical models can be seen as a natural extension of product distributions, which correspond to the case when the order of the MRF tt is 11. There has been significant work in differentially private estimation of product distributions [BlumDMN05, BunUV14, DworkMNS06, SteinkeU17a, KamathLSU18, CaiWZ19, BunKSW2019]. Recently, this investigation has been broadened into differentially private distribution estimation, including sample-based estimation of properties and parameters, see, e.g., [NissimRS07, Smith11, BunNSV15, DiakonikolasHS15, KarwaV18, AcharyaKSZ18, KamathLSU18, BunKSW2019]. For further coverage of differentially private statistics, see [KamathU20].

2 Preliminaries and Notation

In order to distinguish between the vector coordinate and the sample, we use a different notation in this chapter. Given a set of points XnX^{n}, we use superscripts, i.e., XiX^{i} to denote the ii-th datapoint. Given a vector XpX\in\mathbb{R}^{p}, we use subscripts, i.e., XiX_{i} to denote its ii-th coordinate. We also use XiX_{-i} to denote the vector after deleting the ii-th coordinate, i.e. Xi=[X1,,Xi1,Xi+1,,Xp]X_{-i}=[X_{1},\cdots,X_{i-1},X_{i+1},\cdots,X_{p}].

2.1 Markov Random Field Preliminaries

We first introduce the definition of the Ising model, which is a special case of general MRFs when k=t=2k=t=2.

Definition 2.1.

The pp-variable Ising model is a distribution 𝒟(A,θ){\cal D}(A,\theta) on {1,1}p\{-1,1\}^{p} that satisfies

Pr(Z=z)exp(1ijpAi,jzizj+i[p]θizi),\displaystyle\Pr{\left({Z=z}\right)}\propto\exp{\left({\sum_{1\leq i\leq j\leq p}A_{i,j}z_{i}z_{j}+\sum_{i\in[p]}\theta_{i}z_{i}}\right)},

where Ap×pA\in\mathbb{R}^{p\times p} is a symmetric weight matrix with Aii=0,i[p]A_{ii}=0,\forall i\in[p] and θp\theta\in\mathbb{R}^{p} is a mean-field vector. The dependency graph of 𝒟(A,θ){\cal D}(A,\theta) is an undirected graph G=(V,E)G=(V,E), with vertices V=[p]V=[p] and edges E={(i,j):Ai,j0}E=\{(i,j):A_{i,j}\neq 0\}. The width of 𝒟(A,θ){\cal D}(A,\theta) is defined as

λ(A,θ)=maxi[p](j[p]|Ai,j|+|θi|).\displaystyle\lambda(A,\theta)=\max_{i\in[p]}{\left({\sum_{j\in[p]}\left|A_{i,j}\right|+\left|\theta_{i}\right|}\right)}.

Let η(A,θ)\eta(A,\theta) be the minimum edge weight in absolute value, i.e., η(A,θ)=mini,j[p]:Ai,j0|Ai,j|.\eta(A,\theta)=\min_{i,j\in[p]:A_{i,j}\neq 0}\left|A_{i,j}\right|.

We note that the Ising model is supported on {1,1}p\{-1,1\}^{p}. A natural generalization is to generalize its support to [k]p[k]^{p}, and maintain pairwise correlations.

Definition 2.2.

The pp-variable pairwise graphical model is a distribution 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) on [k]p[k]^{p} that satisfies

Pr(Z=z)exp(1ijpWi,j(zi,zj)+i[p]θi(zi)),\displaystyle\Pr{\left({Z=z}\right)}\propto\exp{\left({\sum_{1\leq i\leq j\leq p}W_{i,j}(z_{i},z_{j})+\sum_{i\in[p]}\theta_{i}(z_{i})}\right)},

where 𝒲={Wi,jk×k:ij[p]}{\cal W}=\{W_{i,j}\in\mathbb{R}^{k\times k}:i\neq j\in[p]\} is a set of weight matrices satisfying Wi,j=Wj,iTW_{i,j}=W^{T}_{j,i}, and Θ={θik:i[p]}\Theta=\{\theta_{i}\in\mathbb{R}^{k}:i\in[p]\} is a set of mean-field vectors. The dependency graph of 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) is an undirected graph G=(V,E)G=(V,E), with vertices V=[p]V=[p] and edges E={(i,j):Wi,j0}E=\{(i,j):W_{i,j}\neq 0\}. The width of 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) is defined as

λ(𝒲,Θ)=maxi[p],a[k](j[p]\imaxb[k]|Wi,j(a,b)|+|θi(a)|).\displaystyle\lambda({\cal W},\Theta)=\max_{i\in[p],a\in[k]}{\left({\sum_{j\in[p]\backslash i}\max_{b\in[k]}\left|W_{i,j}(a,b)\right|+\left|\theta_{i}(a)\right|}\right)}.

Define η(𝒲,Θ)=min(i,j)Emaxa,b|Wi,j(a,b)|\eta({\cal W},\Theta)=\min_{(i,j)\in E}\max_{a,b}|W_{i,j}(a,b)|.

Both the models above only consider pairwise interactions between nodes. In order to capture higher-order interactions, we examine the more general model of Markov Random Fields (MRFs). In this chapter, we will restrict our attention to MRFs over a binary alphabet (i.e., distributions over {±1}p\{\pm 1\}^{p}). In order to define binary tt-wise MRFs, we first need the following definition of multilinear polynomials, partial derivatives and maximal monomials.

Definition 2.3.

Multilinear polynomial is defined as h:ph:\mathbb{R}^{p}\rightarrow\mathbb{R} such that h(x)=Ih¯(I)iIxih(x)=\sum_{I}\bar{h}(I)\prod_{i\in I}x_{i} where h¯(I)\bar{h}(I) denotes the coefficient of the monomial iIxi\prod_{i\in I}x_{i} with respect to the variables (xi:iI)(x_{i}:i\in I). Let ih(x)=J:iJh¯(J{i})jJxj\partial_{i}h(x)=\sum_{J:i\not\in J}\bar{h}(J\cup\{i\})\prod_{j\in J}x_{j} denote the partial derivative of hh with respect to xix_{i}. Similarly, for I[p]I\subseteq[p], let Ih(x)=J:JI=ϕh¯(JI)jJxj\partial_{I}h(x)=\sum_{J:J\cap I=\phi}\bar{h}(J\cup I)\prod_{j\in J}x_{j} denote the partial derivative of hh with respect to the variables (xi:iI)(x_{i}:i\in I). We say I[p]I\subseteq[p] is a maximal monomial of hh if h¯(J)=0\bar{h}(J)=0 for all JIJ\supset I.

Now we are able to formally define binary tt-wise MRFs.

Definition 2.4.

For a graph G=(V,E)G=(V,E) on pp vertices, let Ct(G)C_{t}(G) denotes all cliques of size at most tt in G. A binary tt-wise Markov random field on GG is a distribution 𝒟{\cal D} on {1,1}p\{-1,1\}^{p} which satisfies

PrZ𝒟(Z=z)exp(ICt(G)φI(z)),\displaystyle\Pr_{Z\sim{\cal D}}{\left({Z=z}\right)}\propto\exp{\left({\sum_{I\in C_{t}(G)}\varphi_{I}(z)}\right)},

and each φI:p\varphi_{I}:\mathbb{R}^{p}\rightarrow\mathbb{R} is a multilinear polynomial that depends only on the variables in II.

We call GG the dependency graph of the MRF and h(x)=ICt(G)φI(x)h(x)=\sum_{I\in C_{t}(G)}\varphi_{I}(x) the factorization polynomial of the MRF. The width of 𝒟{\cal D} is defined as λ=maxi[p]\normoneih\lambda=\max_{i\in[p]}\normone{\partial_{i}h}, where \normonehI|h¯(I)|\normone{h}\coloneqq\sum_{I}\left|\bar{h}(I)\right|.

Now we introduce the definition of δ\delta-unbiased distribution and its properties. The proof appears in [KlivansM17].

Definition 2.5 (δ\delta-unbiased).

Let SS be the alphabet set, e.g., S={1,1}S=\{1,-1\} for binary tt-pairwise MRFs and S=[k]S=[k] for pairwise graphical models. A distribution 𝒟{\cal D} on SpS^{p} is δ\delta-unbiased if for Z𝒟Z\sim{\cal D}, i[p]\forall i\in[p], and any assignment xSp1x\in S^{p-1} to ZiZ_{-i}, minzSPr(Zi=z|Zi=x)δ\min_{z\in S}\Pr{\left({Z_{i}=z|Z_{-i}=x}\right)}\geq\delta.

The marginal distribution of a δ\delta-unbiased distribution also satisfies δ\delta-unbiasedness.

Lemma 2.6.

Let 𝒟{\cal D} be a δ\delta-unbiased on SpS^{p}, with alphabet set SS. For X𝒟X\sim{\cal D}, i[p]\forall i\in[p], the distribution of XiX_{-i} is also δ\delta-unbiased.

The following lemmas provide δ\delta-unbiased guarantees for various graphical models.

Lemma 2.7.

Let 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) be a pairwise graphical model with alphabet size kk and width λ(𝒲,Θ)\lambda({\cal W},\Theta). Then 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) is δ\delta-unbiased with δ=e2λ(𝒲,Θ)/k\delta=e^{-2\lambda({\cal W},\Theta)}/k. In particular, an Ising model 𝒟(A,θ){\cal D}(A,\theta) is e2λ(A,θ)/2e^{-2\lambda(A,\theta)}/2-unbiased.

Lemma 2.8.

Let 𝒟{\cal D} be a binary tt-wise MRFs with width λ\lambda. Then 𝒟{\cal D} is δ\delta-unbiased with δ=e2λ/2\delta=e^{-2\lambda}/2.

Finally, we define two possible goals for learning graphical models. First, the easier goal is structure learning, which involves recovering the set of non-zero edges.

Definition 2.9.

An algorithm learns the structure of a graphical model if, given samples Z1,,Zn𝒟Z_{1},\dots,Z_{n}\sim{\cal D}, it outputs a graph G^=(V,E^)\hat{G}=(V,\hat{E}) over V=[p]V=[p] such that E^=E\hat{E}=E, the set of edges in the dependency graph of 𝒟{\cal D}.

The more difficult goal is parameter learning, which requires the algorithm to learn not only the location of the edges, but also their parameter values.

Definition 2.10.

An algorithm learns the parameters of an Ising model (resp. pairwise graphical model) if, given samples Z1,,Zn𝒟Z_{1},\dots,Z_{n}\sim{\cal D}, it outputs a matrix A^\hat{A} (resp. set of matrices 𝒲^\hat{\cal W}) such that maxi,j[p]|Ai,jA^i,j|α\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\leq\alpha (resp. |Wi,j(a,b)W^i,j(a,b)|α|W_{i,j}(a,b)-\widehat{W}_{i,j}(a,b)|\leq\alpha, ij[p],a,b[k]\forall i\neq j\in[p],\forall a,b\in[k]).

Definition 2.11.

An algorithm learns the parameters of a binary tt-wise MRF with associated polynomial hh if, given samples X1,,Xn𝒟X^{1},\dots,X^{n}\sim{\cal D}, it outputs another multilinear polynomial uu such that that for all maximal monomial I[p]I\subseteq[p], |h¯(I)u¯(I)|α\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha.

2.2 Privacy Preliminaries

A dataset X=Xn𝒳nX=X^{n}\in{\cal X}^{n} is a collection of points from some universe 𝒳{\cal X}. In this chapter we consider a few different variants of differential privacy. The first is the standard notion of differential privacy, which has been heavily used in the previous chapters. The second is concentrated differential privacy [DworkR16]. In this chapter, we specifically consider its refinement zero-mean concentrated differential privacy [BunS16].

Definition 2.12 (Concentrated Differential Privacy (zCDP) [BunS16]).

A randomized algorithm 𝒜:𝒳n𝒮{\cal A}:{\cal X}^{n}\rightarrow{\cal S} satisfies ρ\rho-zCDP if for every pair of neighboring datasets X,X𝒳nX,X^{\prime}\in{\cal X}^{n},

α(1,)Dα(M(X)||M(X))ρα,\forall\alpha\in(1,\infty)~{}~{}~{}D_{\alpha}\left(M(X)||M(X^{\prime})\right)\leq\rho\alpha,

where Dα(M(X)||M(X))D_{\alpha}\left(M(X)||M(X^{\prime})\right) is the α\alpha-Rényi divergence between M(X)M(X) and M(X)M(X^{\prime}).

The following lemma quantifies the relationships between (ε,0)(\varepsilon,0)-DP, ρ\rho-zCDP and (ε,δ)(\varepsilon,\delta)-DP.

Lemma 2.13 (Relationships Between Variants of DP [BunS16]).

For every ε0\varepsilon\geq 0,

  1. 1.

    If 𝒜{\cal A} satisfies (ε,0)(\varepsilon,0)-DP, then 𝒜{\cal A} is ε22\frac{\varepsilon^{2}}{2}-zCDP.

  2. 2.

    If 𝒜{\cal A} satisfies ε22\frac{\varepsilon^{2}}{2}-zCDP, then 𝒜{\cal A} satisfies (ε22+ε2log(1δ),δ)(\frac{\varepsilon^{2}}{2}+\varepsilon\sqrt{2\log(\frac{1}{\delta})},\delta)-DP for every δ>0\delta>0.

Roughly speaking, pure DP is stronger than zero-concentrated DP, which is stronger than approximate DP.

A crucial property of all the variants of differential privacy is that they can be composed adaptively. By adaptive composition, we mean a sequence of algorithms 𝒜1(X),,𝒜T(X){\cal A}_{1}(X),\dots,{\cal A}_{T}(X) where the algorithm 𝒜t(X){\cal A}_{t}(X) may also depend on the outcomes of the algorithms 𝒜1(X),,𝒜t1(X){\cal A}_{1}(X),\dots,{\cal A}_{t-1}(X).

Lemma 2.14 (Composition of zero-concentrated DP [BunS16]).

If 𝒜{\cal A} is an adaptive composition of differentially private algorithms 𝒜1,,𝒜T{\cal A}_{1},\dots,{\cal A}_{T}, and 𝒜1,,𝒜T{\cal A}_{1},\dots,{\cal A}_{T} are ρ1,,ρT\rho_{1},\dots,\rho_{T}-zCDP respectively, then 𝒜{\cal A} is ρ\rho-zCDP for ρ=tρt\rho=\sum_{t}\rho_{t}.

3 Parameter Learning of Pairwise Graphical Models

3.1 Private Sparse Logistic Regression

As a subroutine of our parameter learning algorithm, we consider the following problem: given a training data set DD consisting of n pairs of data D={dj}j=1n={(xj,yj)}j=1nD=\{d^{j}\}_{j=1}^{n}=\{(x^{j},y^{j})\}_{j=1}^{n}, where xjpx^{j}\in\mathbb{R}^{p} and yjy^{j}\in\mathbb{R}, a constraint set 𝒞p{\cal C}\in\mathbb{R}^{p}, and a loss function :𝒞×p+1\ell:{\cal C}\times\mathbb{R}^{p+1}\rightarrow\mathbb{R}, we want to find werm=argminw𝒞(w;D)=argminw𝒞1nj=1n(w;dj)w^{erm}=\arg\min_{w\in{\cal C}}~{}{\cal L}(w;D)=\arg\min_{w\in{\cal C}}~{}\frac{1}{n}{\sum_{j=1}^{n}\ell(w;d^{j})} with a zCDP constraint. This problem was previously studied in [TalwarTZ14]. Before stating their results, we need the following two definitions. The first definition is regarding Lipschitz continuity.

Definition 3.1.

A function :𝒞\ell:{\cal C}\rightarrow\mathbb{R} is L1L_{1}-Lipschitz with respect to 1\ell_{1} norm, if the following holds.

w1,w2𝒞,|(w1)(w2)|L1\normonew1w2.\forall w_{1},w_{2}\in{\cal C},\left|\ell(w_{1})-\ell(w_{2})\right|\leq L_{1}\normone{w_{1}-w_{2}}.

The performance of the algorithm also depends on the “curvature” of the loss function, which is defined below, based on the definition of [Clarkson10, Jaggi13]. A side remark is that this is a strictly weaker constraint than smoothness [TalwarTZ14].

Definition 3.2 (Curvature constant).

For :𝒞\ell:{\cal C}\rightarrow\mathbb{R}, Γ\Gamma_{\ell} is defined as

Γ=supw1,w2𝒞,γ(0,1],w3=w1+γ(w2w1)2γ2((w3)(w1)w3w1,(w1)).\Gamma_{\ell}=\sup_{w_{1},w_{2}\in{\cal C},\gamma\in(0,1],w_{3}=w_{1}+\gamma(w_{2}-w_{1})}\frac{2}{\gamma^{2}}{\left({\ell(w_{3})-\ell(w_{1})-\langle w_{3}-w_{1},\nabla\ell(w_{1})\rangle}\right)}.

Now we are able to introduce the algorithm and its theoretical guarantees.

1
2
3Input: Data set: D={d1,,dn}D=\{d^{1},\cdots,d^{n}\}, loss function: (w;D)=1nj=1n(w;dj){\cal L}(w;D)=\frac{1}{n}\sum_{j=1}^{n}\ell(w;d^{j}) (with Lipschitz constant L1L_{1}), privacy parameters: ρ\rho, convex set: 𝒞=conv(S){\cal C}=conv(S) with \normone𝒞maxsS\normones\normone{{\cal C}}\coloneqq\max_{s\in S}\normone{s}, iteration times: TT
41:Initialize ww from an arbitrary point in 𝒞{\cal C}
5For t=1t=1 to T1T-1 
6      2:
7sS\forall s\in S, αss,(w;D)+Lap(0,L1\normoneCTnρ)\alpha_{s}\leftarrow\langle s,\nabla{\cal L}(w;D)\rangle+\text{Lap}{\left({0,\frac{L_{1}\normone{C}\sqrt{T}}{n\sqrt{\rho}}}\right)}
83:wt~argminsSαs\tilde{w_{t}}\leftarrow\arg\min_{s\in S}\alpha_{s}
4:wt+1(1μt)wt+μtwt~w_{t+1}\leftarrow(1-\mu_{t})w_{t}+\mu_{t}\tilde{w_{t}}, where μt=2t+2\mu_{t}=\frac{2}{t+2}
95:
Output: wpriv=wTw^{priv}=w_{T}
Algorithm 1 𝒜PFW(D,,ρ,𝒞):{\cal A}_{PFW}(D,{\cal L},\rho,{\cal C}): Private Frank-Wolfe Algorithm
Lemma 3.3 (Theorem 5.5 from [TalwarTZ14]).

Algorithm 1 satisfies ρ\rho-zCDP. Furthermore, let L1L_{1}, \normone𝒞\normone{{\cal C}} be defined as in Algorithm 1. Let Γ\Gamma_{\ell} be an upper bound on the curvature constant for the loss function (;d)\ell(\cdot;d) for all dd and |S|\left|S\right| be the number of extreme points in SS. If we set T=Γ23(nρ)23L1\normone𝒞23T=\frac{\Gamma_{\ell}^{\frac{2}{3}}(n\sqrt{\rho})^{\frac{2}{3}}}{L_{1}\normone{{\cal C}}^{\frac{2}{3}}}, then with probability at least 1β1-\beta over the randomness of the algorithm,

(wpriv;D)(werm;D)=O(Γ13(L1\normone𝒞)23log(n|S|β)(nρ)23).{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\Gamma_{\ell}^{\frac{1}{3}}(L_{1}\normone{{\cal C}})^{\frac{2}{3}}\log(\frac{n\left|S\right|}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.
Proof.

The utility guarantee is proved in [TalwarTZ14]. Therefore, it is enough to prove the algorithm satisfies ρ\rho-zCDP. According to the definition of the Laplace mechanism, every iteration of the algorithm satisfies (ρT,0)(\sqrt{\frac{\rho}{T}},0)-DP, which naturally satisfies ρT\frac{\rho}{T}-zCDP by Lemma 2.13. Then, by the composition theorem of zCDP (Lemma 2.14), the algorithm satisfies ρ\rho-zCDP. ∎

If we consider the specific problem of sparse logistic regression, we will get the following corollary.

Corollary 3.4.

If we consider the problem of sparse logistic regression, i.e., (w;D)=1nj=1nlog(1+eyjw,xj){\cal L}(w;D)=\frac{1}{n}\sum_{j=1}^{n}\log(1+e^{-y^{j}\langle w,x^{j}\rangle}), with the constraint that 𝒞={w:\normonewλ}{\cal C}=\{w:\normone{w}\leq\lambda\}, and we further assume that j,xj1,yj{±1}\forall j,\left\lVert x^{j}\right\rVert_{\infty}\leq 1,y^{j}\in\{\pm 1\}, let T=λ23(nρ)23T=\lambda^{\frac{2}{3}}(n\sqrt{\rho})^{\frac{2}{3}}, then with probability at least 1β1-\beta over the randomness of the algorithm,

(wpriv;D)(werm;D)=O(λ43log(npβ)(nρ)23).{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.

Furthermore, the time complexity of the algorithm is O(T(np+p2))=O(n23(np+p2))O(T\cdot{\left({np+p^{2}}\right)})=O{\left({n^{\frac{2}{3}}\cdot{\left({np+p^{2}}\right)}}\right)}.

Proof.

First let we show L12L_{1}\leq 2. If we fix sample d=(x,y)d=(x,y), then for any w1,w2𝒞w_{1},w_{2}\in{\cal C},

|(w1;d)(w2;d)|maxww((w;d))\normonew1w2.\left|\ell(w_{1};d)-\ell(w_{2};d)\right|\leq\max_{w}\left\lVert\nabla_{w}{\left({\ell(w;d)}\right)}\right\rVert_{\infty}\cdot\normone{w_{1}-w_{2}}.

Since w((w;d))=(σ(w,x)y)x\nabla_{w}{\left({\ell(w;d)}\right)}={\left({\sigma(\langle w,x\rangle)-y}\right)}\cdot x, we have w((w;d))2\left\lVert\nabla_{w}{\left({\ell(w;d)}\right)}\right\rVert_{\infty}\leq 2.

Next, we wish to show Γλ2\Gamma_{\ell}\leq\lambda^{2}. We use the following lemma from [TalwarTZ14].

Lemma 3.5 (Remark 4 in [TalwarTZ14]).

For any q,r1q,r\geq 1 such that 1q+1r=1\frac{1}{q}+\frac{1}{r}=1, Γ\Gamma_{\ell} is upper bounded by α𝒞q2\alpha{\left\lVert{\cal C}\right\rVert_{q}}^{2}, where α=maxw𝒞,vq=12(w)vq\alpha=\max_{w\in{\cal C},{\left\lVert v\right\rVert_{q}=1}}{\left\lVert\nabla^{2}\ell(w)\cdot v\right\rVert_{q}}.

If we take q=1,r=+q=1,r=+\infty, then Γαλ2\Gamma_{\ell}\leq\alpha\lambda^{2}, where

α=maxw𝒞,\normonev=12(w;d)vmaxi,j[p](2(w;d))i,j.\alpha=\max_{w\in{\cal C},\normone{v}=1}\left\lVert\nabla^{2}\ell(w;d)\cdot v\right\rVert_{\infty}\leq\max_{i,j\in[p]}{\left({\nabla^{2}\ell(w;d)}\right)}_{i,j}.

We have α1\alpha\leq 1, since 2(w;d)=σ(w,x)(1σ(w,x))xxT\nabla^{2}\ell(w;d)=\sigma(\langle w,x\rangle){\left({1-\sigma(\langle w,x\rangle)}\right)}\cdot xx^{T}, and x1\left\lVert x\right\rVert_{\infty}\leq 1,

Finally given 𝒞={w:\normonew1}{\cal C}=\{w:\normone{w}\leq 1\}, the number of extreme points of SS equals 2p2p. By replacing all these parameters in Lemma 3.3, we have proved the loss guarantee in the corollary.

With respect to the time complexity, we note that the time complexity of each iteration is O(np+p2)O{\left({np+p^{2}}\right)} and there are TT iterations in total. ∎

Now if we further assume the data set DD is drawn i.i.d. from some underlying distribution PP, the following lemma from learning theory relates the true risk and the empirical risk, which shall be heavily used in the following sections.

Theorem 3.6.

If we consider the same problem setting and assumptions as in Corollary 3.4, and we further assume that the training data set DD is drawn i.i.d. from some unknown distribution PP, then with probability at least 1β1-\beta over the randomness of the algorithm and the training data set,

𝔼(X,Y)P[(wpriv;(X,Y))]𝔼(X,Y)P[(w;(X,Y))]=O(λ43log(npβ)(nρ)23+λlog(1β)n),\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}+\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)},

where w=argminwC𝔼(X,Y)P[(w;(X,Y))]w^{*}=\arg\min_{w\in C}\mathbb{E}_{(X,Y)\sim P}\left[\ell(w;(X,Y))\right].

Proof.

By triangle inequality,

𝔼(X,Y)P[(wpriv;(X,Y))]𝔼(X,Y)P[(w;(X,Y))]\displaystyle~{}\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]
\displaystyle\leq |𝔼(X,Y)P[(wpriv;(X,Y))]1nm=1n(wpriv;dm)|+|1nm=1n(wpriv;dm)1nm=1n(werm;dm)|\displaystyle\left|\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{priv};d^{m})\right|+\left|\frac{1}{n}\sum_{m=1}^{n}\ell(w^{priv};d^{m})-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{erm};d^{m})\right|
+\displaystyle+ (1nm=1n(werm;dm)1nm=1n(w;dm))+|𝔼(X,Y)P[(w;(X,Y))]1nm=1n(w;dm)|\displaystyle{\left({\frac{1}{n}\sum_{m=1}^{n}\ell(w^{erm};d^{m})-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{*};d^{m})}\right)}+\left|\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{*};d^{m})\right|

Now we need to bound each term. We firstly bound the first and last term simultaneously. By the generalization error bound (Lemma 7 from [WuSD19]), they are bounded by O(λlog(1β)n)O{\left({\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)} simultaneously, with probability greater than 123β1-\frac{2}{3}\beta. Then we turn to the second term, by Corollary 3.4, with probability greater than 113β1-\frac{1}{3}\beta, it is bounded by O(λ43log(npβ)(nρ)23)O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}. Finally we bound the third term. According to the definition of wermw^{erm}, the third term should be smaller than 0. Therefore, by union bound, 𝔼(X,Y)P[(wpriv;(X,Y))]𝔼(X,Y)P[(w;(X,Y))]=O(λ43log(npβ)(nρ)23+λlog(1β)n)\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}+\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)}, with probability greater than 1β1-\beta. ∎

3.2 Privately Learning Ising Models

We first consider the problem of estimating the weight matrix of the Ising model. To be precise, given nn i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\} generated from an unknown distribution 𝒟(A,θ){\cal D}(A,\theta), our goal is to design an ρ\rho-zCDP estimator A^\hat{A} such that with probability at least 23\frac{2}{3}, maxi,j[p]|Ai,jA^i,j|α\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha.

An observation of the Ising model is that for any node ZiZ_{i}, the probability of Zi=1Z_{i}=1 conditioned on the values of the remaining nodes ZiZ_{-i} follows from a sigmoid function. The next lemma comes from [KlivansM17], which formalizes this observation.

Lemma 3.7.

Let Z𝒟(A,θ)Z\sim{\cal D}(A,\theta) and Z{1,1}pZ\in\{-1,1\}^{p}, then i[p]\forall i\in[p], x{1,1}[p]\{i}\forall x\in\{-1,1\}^{[p]\backslash\{i\}},

Pr(Zi=1|Zi=x)\displaystyle\Pr{\left({Z_{i}=1|Z_{-i}=x}\right)} =σ(ji2Ai,jxj+2θi)=σ(w,x).\displaystyle=\sigma{\left({\sum_{j\neq i}2A_{i,j}x_{j}+2\theta_{i}}\right)}=\sigma{\left({\langle w,x^{\prime}\rangle}\right)}.

where w=2[Ai,1,,Ai,i1,Ai,i+1,,Ai,p,θi]pw=2[A_{i,1},\cdots,A_{i,i-1},A_{i,i+1},\cdots,A_{i,p},\theta_{i}]\in\mathbb{R}^{p}, and x=[x,1]px^{\prime}=[x,1]\in\mathbb{R}^{p}.

Proof.

The proof is from [KlivansM17], and we include it here for completeness. According to the definition of the Ising model,

Pr(Zi=1|Zi=x)\displaystyle\Pr{\left({Z_{i}=1|Z_{-i}=x}\right)} =exp(jiAi,jxj+jiθj+θi)exp(jiAi,jxj+jiθj+θi)+exp(jiAi,jxj+jiθjθi)\displaystyle=\frac{\exp{\left({\sum\limits_{j\neq i}A_{i,j}x_{j}+\sum\limits_{j\neq i}\theta_{j}+\theta_{i}}\right)}}{\exp{\left({\sum\limits_{j\neq i}A_{i,j}x_{j}+\sum_{j\neq i}\theta_{j}+\theta_{i}}\right)}+\exp{\left({\sum_{j\neq i}-A_{i,j}x_{j}+\sum_{j\neq i}\theta_{j}-\theta_{i}}\right)}}
=σ(ji2Ai,jxj+2θi).\displaystyle=\sigma{\left({\sum_{j\neq i}2A_{i,j}x_{j}+2\theta_{i}}\right)}.

By Lemma 3.7, we can estimate the weight matrix by solving a logistic regression for each node, which is utilized in [WuSD19] to design non-private estimators. Our algorithm uses the private Frank-Wolfe method to solve the per-node logistic regression problem, achieving the following theoretical guarantee.

1
2
3Input: nn samples {z1,,zn}\{z^{1},\cdots,z^{n}\}, where zm{±1}pz^{m}\in\{\pm 1\}^{p} for m[n]m\in[n]; an upper bound on λ(A,θ)λ\lambda(A,\theta)\leq\lambda, privacy parameter ρ\rho
4
Fori=1i=1 to pp
5      
61:m[n]\forall m\in[n], xm[zim,1]x^{m}\leftarrow[z_{-i}^{m},1], ymzimy^{m}\leftarrow z_{i}^{m}
72:wpriv𝒜PFW(D,,ρ,𝒞)w^{priv}\leftarrow{\cal A}_{PFW}(D,{\cal L},\rho^{\prime},{\cal C}), where ρ=ρp\rho^{\prime}=\frac{\rho}{p}, D={(xm,ym)}m=1nD=\{{\left({x^{m},y^{m}}\right)}\}_{m=1}^{n}, (w;D)=1nm=1nlog(1+eymw,xm){\cal L}(w;D)=\frac{1}{n}\sum_{m=1}^{n}\log{\left({1+e^{-y^{m}\langle w,x^{m}\rangle}}\right)}, 𝒞={\normonew2λ}{\cal C}=\{\normone{w}\leq 2\lambda\}
3:jp\forall j\in p, A^i,j12wj~priv\hat{A}_{i,j}\leftarrow\frac{1}{2}w^{priv}_{\tilde{j}}, where j~=j\tilde{j}=j when j<ij<i and j~=j1\widetilde{j}=j-1 if j>ij>i
4:
Algorithm 2 Privately Learning Ising Models

Output:A^p×p\hat{A}\in\mathbb{R}^{p\times p}

Theorem 3.8.

Let 𝒟(A,θ){\cal D}(A,\theta) be an unknown pp-variable Ising model with λ(A,θ)λ\lambda(A,\theta)\leq\lambda. There exists an efficient ρ\rho-zCDP algorithm which outputs a weight matrix A^p×p\hat{A}\in\mathbb{R}^{p\times p} such that with probability greater than 2/32/3, maxi,j[p]|Ai,jA^i,j|α\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha if the number of i.i.d. samples satisfies

n=Ω(λ2log(p)e12λα4+pλ2log2(p)e9λρα3).n=\Omega{\left({\frac{\lambda^{2}\log(p)e^{12\lambda}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}\log^{2}(p)e^{9\lambda}}{\sqrt{\rho}\alpha^{3}}}\right)}.
Proof.

We first prove that Algorithm 2 satisfies ρ\rho-zCDP. Notice that in each iteration, the algorithm solves a private sparse logistic regression under ρp\frac{\rho}{p}-zCDP. Therefore, Algorithm 2 satisfies ρ\rho-zCDP by composition (Lemma 2.14).

For the accuracy analysis, we start by looking at the first iteration (i=1i=1) and showing that |A1,jA^1,j|α\left|A_{1,j}-\hat{A}_{1,j}\right|\leq\alpha, j[p]\forall j\in[p], with probability greater than 1110p1-\frac{1}{10p}.

Given a random sample Z𝒟(A,θ)Z\sim{\cal D}(A,\theta), we let X=[Z1,1]X=[Z_{-1},1], Y=Z1Y=Z_{1}. From Lemma 3.7, Pr(Y=1|X=x)=σ(w,x)\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({\langle w^{*},x\rangle}\right)}, where w=2[A1,2,,A1,p,θ1]w^{*}=2[A_{1,2},\cdots,A_{1,p},\theta_{1}]. We also note that \normonew2λ\normone{w^{*}}\leq 2\lambda, as a consequence of the width constraint of the Ising model.

For any nn i.i.d. samples {zm}m=1n\{z^{m}\}_{m=1}^{n} drawn from the Ising model, let xm=[z1m,1]x^{m}=[z_{-1}^{m},1] and ym=z1my^{m}=z_{1}^{m}, it is easy to check that each (xm,ym)(x^{m},y^{m}) is the realization of (X,Y)(X,Y). Let wprivw^{priv} be the output of 𝒜(D,,ρp,{w:\normonew2λ}){\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}, where D={(xm,ym)}m=1nD=\{(x^{m},y^{m})\}_{m=1}^{n}. By Lemma 3.6, when n=O(pλ2log2(p)ργ32+λ2log(p)γ2)n=O{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}, with probability greater than 1110p1-\frac{1}{10p}, 𝔼Z𝒟(A,θ)[(wpriv;(X,Y))]𝔼Z𝒟(A,θ)[(w;(X,Y))]γ.\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma.

We will use the following lemma from [WuSD19]. Roughly speaking, with the assumption that the samples are generated from an Ising model, any estimator wprivw^{priv} which achieves a small error in the loss {\cal L} guarantees an accurate parameter recovery in \ell_{\infty} distance.

Lemma 3.9.

Let PP be a distribution on {1,1}p1×{1,1}\{-1,1\}^{p-1}\times\{-1,1\}. Given u1p1,θ1u_{1}\in\mathbb{R}^{p-1},\theta_{1}\in\mathbb{R}, suppose Pr(Y=1|X=x)=σ(u1,x+θ1)\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({\langle u_{1},x\rangle+\theta_{1}}\right)} for (X,Y)P(X,Y)\sim P. If the marginal distribution of PP on XX is δ\delta-unbiased, and 𝔼(X,Y)P[log(1+eY(u1,X+θ1))]𝔼(X,Y)P[log(1+eY(u2,X+θ2))]γ\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{1},X\rangle+\theta_{1}}\right)}}}\right)}\right]-\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{2},X\rangle+\theta_{2}}\right)}}}\right)}\right]\leq\gamma for some u2p1,θ2u_{2}\in\mathbb{R}^{p-1},\theta_{2}\in\mathbb{R}, and γδe2\normoneu12\normoneθ16\gamma\leq\delta e^{-2\normone{u_{1}}-2\normone{\theta_{1}}-6}, then u1u2=O(e\normoneu1+\normoneθ1γ/δ).\left\lVert u_{1}-u_{2}\right\rVert_{\infty}=O(e^{\normone{u_{1}}+\normone{\theta_{1}}}\cdot\sqrt{\gamma/\delta}).

By Lemma 2.6, Lemma 2.7 and Lemma 3.9, if 𝔼Z𝒟(A,θ)[(wpriv;(X,Y))]𝔼Z𝒟(A,θ)[(w;(X,Y))]O(α2e6λ)\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq O{\left({\alpha^{2}e^{-6\lambda}}\right)}, we have wprivwα\left\lVert w^{priv}-w^{*}\right\rVert_{\infty}\leq\alpha. By replacing γ=α2e6λ\gamma=\alpha^{2}e^{-6\lambda}, we prove that A1,jA^1,jα\left\lVert A_{1,j}-\hat{A}_{1,j}\right\rVert_{\infty}\leq\alpha with probability greater than 1110p1-\frac{1}{10p}. Noting that similar argument works for the other iterations and non-overlapping part of the matrix is recovered in different iterations. By union bound over pp iterations, we prove that with probability at least 23\frac{2}{3}, maxi,j[p]|Ai,jA^i,j|α\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha.

Finally, we note that the time compexity of the algorithm is poly(n,p)poly(n,p) since the private Frank-Wolfe algorithm is time efficient by Corollary 3.4. ∎

3.3 Privately Learning Pairwise Graphical Models

Next, we study parameter learning for pairwise graphical models over general alphabet. Given nn i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\} drawn from an unknown distribution 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta), we want to design an ρ\rho-zCDP estimator 𝒲^\hat{{\cal W}} such that with probability at least 23\frac{2}{3}, ij[p],u,v[k],|Wi,j(u,v)W^i,j(u,v)|α\forall i\neq j\in[p],\forall u,v\in[k],\left|W_{i,j}(u,v)-\widehat{W}_{i,j}(u,v)\right|\leq\alpha. To facilitate our presentation, we assume that ij[p]\forall i\neq j\in[p], every row (and column) vector of Wi,jW_{i,j} has zero mean.222The assumption that Wi,jW_{i,j} is centered is without loss of generality and widely used in the literature [KlivansM17, WuSD19]. We present the argument here for completeness. Suppose the aa-th row of Wi,jW_{i,j} is not centered, i.e., bWi,j(a,b)0\sum_{b}W_{i,j}(a,b)\neq 0, we can define Wi,j(a,b)=Wi,j(a,b)1kbWi,j(a,b)W^{\prime}_{i,j}(a,b)=W_{i,j}(a,b)-\frac{1}{k}\sum_{b}W_{i,j}(a,b) and θi(a)=θi(a)+1kbWi,j(a,b)\theta^{\prime}_{i}(a)=\theta_{i}(a)+\frac{1}{k}\sum_{b}W_{i,j}(a,b), and the probability distribution remains unchanged.

Analogous to Lemma 3.7 for the Ising model, a pairwise graphical model has the following property, which can be utilized to recover its parameters.

Lemma 3.10 (Fact 2 of [WuSD19]).

Let Z𝒟(𝒲,Θ)Z\sim{\cal D}({\cal W},\Theta) and Z[k]pZ\in[k]^{p}. For any i[p]i\in[p], any uv[k]u\neq v\in[k], and any x[k]p1x\in[k]^{p-1},

Pr(Zi=u|Zi{u,v},Zi=x)=σ(ji(Wi,j(u,xj)Wi,j(v,xj))+θi(u)θi(v)).\Pr{\left({Z_{i}=u|Z_{i}\in\{u,v\},Z_{-i}=x}\right)}=\sigma{\left({\sum_{j\neq i}{\left({W_{i,j}(u,x_{j})-W_{i,j}(v,x_{j})}\right)}+\theta_{i}(u)-\theta_{i}(v)}\right)}.

Now we introduce our algorithm. Without loss of generality, we consider estimating W1,jW_{1,j} for all j[p]j\in[p] as a running example. We fix a pair of values (u,v)(u,v), where u,v[k]u,v\in[k] and uvu\neq v. Let Su,vS_{u,v} be the samples where Z1{u,v}Z_{1}\in\{u,v\}. In order to utilize Lemma 3.10, we perform the following transformation on the samples in Su,vS_{u,v}: for the mm-th sample zmz^{m}, let ym=1y^{m}=1 if z1m=uz_{1}^{m}=u, else ym=1y^{m}=-1. And xmx^{m} is the one-hot encoding of the vector [z1m,1][z^{m}_{-1},1], where OneHotEncode(s)\text{OneHotEncode}(s) is a mapping from [k]p[k]^{p} to p×k\mathbb{R}^{p\times k}, and the ii-th row is the tt-th standard basis vector given si=ts_{i}=t. Then we define wp×kw^{*}\in\mathbb{R}^{p\times k} as follows:

w(j,)=W1,j+1(u,)W1,j+1(v,),j[p1];\displaystyle w^{*}(j,\cdot)=W_{1,j+1}(u,\cdot)-W_{1,j+1}(v,\cdot),\forall j\in[p-1];
w(p,)=[θ1(u)θ1(v),0,,0].\displaystyle w^{*}(p,\cdot)=[\theta_{1}(u)-\theta_{1}(v),0,\cdots,0].

Lemma 3.10 implies that t\forall t, Pr(Yt=1)=σ(w,Xt)\Pr{\left({Y^{t}=1}\right)}=\sigma{\left({\langle w^{*},X^{t}\rangle}\right)}, where ,\langle\cdot,\cdot\rangle is the element-wise multiplication of matrices. According to the definition of the width of 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta), \normonewλk\normone{w^{*}}\leq\lambda k. Now we can apply the sparse logistic regression method of Algorithm 3 to the samples in Su,vS_{u,v}.

Suppose wu,vprivw^{priv}_{u,v} is the output of the private Frank-Wolfe algorithm, we define Uu,vp×kU_{u,v}\in\mathbb{R}^{p\times k} as follows: b[k]\forall b\in[k],

Uu,v(j,b)=wu,vpriv(j,b)1ka[k]wu,vpriv(j,a),j[p1];\displaystyle U_{u,v}(j,b)=w^{priv}_{u,v}(j,b)-\frac{1}{k}\sum_{a\in[k]}w^{priv}_{u,v}(j,a),\forall j\in[p-1];
Uu,v(p,b)=wu,vpriv(p,b)+1kj[p1]a[k]wu,vpriv(j,a).\displaystyle U_{u,v}(p,b)=w^{priv}_{u,v}(p,b)+\frac{1}{k}\sum_{j\in[p-1]}\sum_{a\in[k]}w^{priv}_{u,v}(j,a). (1)

Uu,vU_{u,v} can be seen as a “centered” version of wu,vprivw^{priv}_{u,v} (for the first p1p-1 rows). It is not hard to see that Uu,v,x=wu,vpriv,x\langle U_{u,v},x\rangle=\langle w^{priv}_{u,v},x\rangle, so Uu,vU_{u,v} is also a minimizer of the sparse logistic regression.

For now, assume that j[p1],b[k]\forall j\in[p-1],b\in[k], Uu,v(j,b)U_{u,v}(j,b) is a “good” approximation of (W1,j+1(u,b)W1,j+1(v,b)){\left({W_{1,j+1}(u,b)-W_{1,j+1}(v,b)}\right)}, which we will show later. If we sum over v[k]v\in[k], it can be shown that 1kv[k]Uu,v(j,b)\frac{1}{k}\sum_{v\in[k]}U_{u,v}(j,b) is also a “good” approximation of W1,j+1(u,b)W_{1,j+1}(u,b), for all j[p1]j\in[p-1], and u,b[k]u,b\in[k], because of the centering assumption of 𝒲{\cal W}, i.e., j[p1],b[k],v[k]W1,j+1(v,b)=0\forall j\in[p-1],b\in[k],\sum_{v\in[k]}W_{1,j+1}(v,b)=0. With these considerations in mind, we are able to introduce our algorithm.

1
2
3Input: alphabet size kk, nn i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\}, where zm[k]pz^{m}\in[k]^{p} for m[n]m\in[n]; an upper bound on λ(𝒲,Θ)λ\lambda({\cal W},\Theta)\leq\lambda, privacy parameter ρ\rho
4
5For i=1i=1 to pp 
6      For 
7       
8            
9      
each pair uv[k]u\neq v\in[k]
101:Su,v{zm,m[n]:zim{u,v}}S_{u,v}\leftarrow\{z^{m},m\in[n]:z_{i}^{m}\in\{u,v\}\}
112:zmSu,v\forall z^{m}\in S_{u,v}, xmOneHotEncode([zim,1])x^{m}\leftarrow\text{OneHotEncode}([z^{m}_{-i},1]), ym1y^{m}\leftarrow 1 if zim=uz_{i}^{m}=u; yt1y^{t}\leftarrow-1 if zim=vz_{i}^{m}=v
123:wu,vpriv𝒜PFW(D,,ρ,𝒞)w^{priv}_{u,v}\leftarrow{\cal A}_{PFW}(D,{\cal L},\rho^{\prime},{\cal C}), where ρ=ρk2p\rho^{\prime}=\frac{\rho}{k^{2}p}, D={(xm,ym):zmSu,v}D=\{{\left({x^{m},y^{m}}\right)}:z^{m}\in S_{u,v}\}, (w;D)=1|Su,v|m=1|Su,v|log(1+eymw,xm){\cal L}(w;D)=\frac{1}{|S_{u,v}|}\sum_{m=1}^{|S_{u,v}|}\log{\left({1+e^{-y^{m}\langle w,x^{m}\rangle}}\right)}, 𝒞={\normonew2λk}{\cal C}=\{\normone{w}\leq 2\lambda k\}
4:Define Uu,vp×kU_{u,v}\in\mathbb{R}^{p\times k} by centering the first p1p-1 rows of wu,vprivw^{priv}_{u,v}, as in Equation 3.3
135:
14For j[p]\ij\in[p]\backslash i and u[k]u\in[k] 
15      6:
W^i,j(u,:)1kv[k]Uu,v(j~,:)\widehat{W}_{i,j}(u,:)\leftarrow\frac{1}{k}\sum_{v\in[k]}U_{u,v}(\tilde{j},:), where j~=j\tilde{j}=j when j<ij<i and j~=j1\tilde{j}=j-1 when j>ij>i
7:
8:
Algorithm 3 Privately Learning Pairwise Graphical Model

Output: W^i,jk×k\widehat{W}_{i,j}\in\mathbb{R}^{k\times k} for all ij[p]i\neq j\in[p]

The following theorem is the main result of this section. Its proof is structurally similar to that of Theorem 3.8.

Theorem 3.11.

Let 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) be an unknown pp-variable pairwise graphical model distribution, and we suppose that 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) has width λ(𝒲,Θ)λ\lambda({\cal W},\Theta)\leq\lambda. There exists an efficient ρ\rho-zCDP algorithm which outputs W^\widehat{W} such that with probability greater than 2/32/3, |Wi,j(u,v)W^i,j(u,v)|α\left|W_{i,j}(u,v)-\widehat{W}_{i,j}(u,v)\right|\leq\alpha, ij[p],u,v[k]\forall i\neq j\in[p],\forall u,v\in[k] if the number of i.i.d. samples satisfy

n=Ω(λ2k5log(pk)eO(λ)α4+pλ2k5.5log2(pk)eO(λ)ρα3).n=\Omega{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}.
Proof.

We consider estimating W1,jW_{1,j} for all j[p]j\in[p] as an example. Fixing one pair (u,v)(u,v), let Su,vS_{u,v} be the samples whose first element is either uu or vv, and nu,vn^{u,v} be the number of samples in Su,vS_{u,v}. We perform the following transformation on the samples in Su,vS_{u,v}: for the sample ZZ, let Y=1Y=1 if Z1=uZ_{1}=u, else Y=1Y=-1, and let XX be the one-hot encoding of the vector [Z1,1][Z_{-1},1].

Suppose the underlying joint distribution of XX and YY is PP, i.e., (X,Y)P(X,Y)\sim P, then by Theorem 3.6, when nu,v=O(λ2k2log2(pk)γ2+dλ2k3log2(pk)γ32ρ)n^{u,v}=O{\left({\frac{\lambda^{2}k^{2}\log^{2}(pk)}{\gamma^{2}}+\frac{\sqrt{d}\lambda^{2}k^{3}\log^{2}(pk)}{\gamma^{\frac{3}{2}}\sqrt{\rho}}}\right)}, with probability greater than 1110pk21-\frac{1}{10pk^{2}},

𝔼(X,Y)P[(Uu,v;(X,Y))]𝔼(X,Y)P[(w;(X,Y))]γ.\mathbb{E}_{(X,Y)\sim P}\left[\ell(U_{u,v};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]\leq\gamma.

The following lemma appears in [WuSD19], which is analogous to Lemma 3.9 for the Ising model.

Lemma 3.12.

Let 𝒟{\cal D} be a δ\delta-unbiased distribution on [k]p1[k]^{p-1}. For Z𝒟Z\sim{\cal D}, XX denotes the one-hot encoding of ZZ. Let u1,u2(p1)×ku_{1},u_{2}\in\mathbb{R}^{(p-1)\times k} be two matrices where au1(i,a)=0\sum_{a}u_{1}(i,a)=0 and au2(i,a)=0\sum_{a}u_{2}(i,a)=0 for all i[p1]i\in[p-1]. Let PP be a distribution such that given u1,θ1u_{1},\theta_{1}\in\mathbb{R}, Pr(Y=1|X=X)=σ(u1,x+θ1)\Pr{\left({Y=1|X=X}\right)}=\sigma{\left({\langle u_{1},x\rangle+\theta_{1}}\right)} for (X,Y)P(X,Y)\sim P. Suppose 𝔼(X,Y)P[log(1+eY(u1,X+θ1))]𝔼(x,Y)P[log(1+eY(u2,X+θ2))]γ\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{1},X\rangle+\theta_{1}}\right)}}}\right)}\right]-\mathbb{E}_{(x,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{2},X\rangle+\theta_{2}}\right)}}}\right)}\right]\leq\gamma for u2(p1)×k,θ2u_{2}\in\mathbb{R}^{(p-1)\times k},\theta_{2}\in\mathbb{R}, and γδe2u1,12\normoneθ16\gamma\leq\delta e^{-2\left\lVert u_{1}\right\rVert_{\infty,1}-2\normone{\theta_{1}}-6}, then u1u2=O(eu1,1+\normoneθ1γ/δ).\left\lVert u_{1}-u_{2}\right\rVert_{\infty}=O(e^{\left\lVert u_{1}\right\rVert_{\infty,1}+\normone{\theta_{1}}}\cdot\sqrt{\gamma/\delta}).

By Lemma 2.6, Lemma 2.7 and Lemma 3.12, if we substitute γ=e6λα2k\gamma=\frac{e^{-6\lambda}\alpha^{2}}{k}, when nu,v=O(λ2k4log(pk)eO(λ)α4+pλ2k4.5log2(pk)eO(λ)ρα3),n^{u,v}=O{\left({\frac{\lambda^{2}k^{4}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{4.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)},

|W1,j(u,b)W1,j(v,b)Uu,v(j,b)|α,j[p1],b[k].\displaystyle\left|W_{1,j}(u,b)-W_{1,j}(v,b)-U^{u,v}(j,b)\right|\leq\alpha,\forall j\in[p-1],\forall b\in[k]. (2)

By a union bound, Equation (2) holds for all (u,v)(u,v) pairs simultaneously with probability greater than 1110p1-\frac{1}{10p}. If we sum over v[k]v\in[k] and use the fact that j,b,v[k]W1,j(v,b)=0\forall j,b,\sum_{v\in[k]}W_{1,j}(v,b)=0, we have

|W1,j(u,b)1kv[k]Uu,v(j,b)|α,j[p1],u,b[k].\left|W_{1,j}(u,b)-\frac{1}{k}\sum_{v\in[k]}U_{u,v}(j,b)\right|\leq\alpha,\forall j\in[p-1],\forall u,b\in[k].

Note that we need to guarantee that we obtain nu,vn^{u,v} samples for each pair (u,v)(u,v). Since 𝒟(𝒲,Θ){\cal D}({\cal W},\Theta) is δ\delta-unbiased, given Z𝒟(𝒲,Θ)Z\sim{\cal D}({\cal W},\Theta), for all uvu\neq v, Pr(ZSu,v)2δ\Pr{\left({Z\in S_{u,v}}\right)}\geq 2\delta. By Hoeffding’s inequality, when n=O(nu,vδ+log(pk2)δ2)n=O{\left({\frac{n^{u,v}}{\delta}+\frac{\log(pk^{2})}{\delta^{2}}}\right)}, with probability greater than 1110p1-\frac{1}{10p}, we have enough samples for all (u,v)(u,v) pairs simultaneously. Substituting δ=e6λk\delta=\frac{e^{-6\lambda}}{k}, we have

n=O(λ2k5log(pk)eO(λ)α4+pλ2k5.5log2(pk)eO(λ)ρα3).n=O{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}.

The same argument holds for other entries of the matrix. We conclude the proof by a union bound over pp iterations.

Finally, we note that the time compexity of the algorithm is poly(n,p)\operatorname*{poly}(n,p) since the private Frank-Wolfe algorithm is time efficient by Corollary 3.4. ∎

4 Privately Learning Binary tt-wise MRFs

Let 𝒟{\cal D} be a tt-wise MRF on {1,1}p\{1,-1\}^{p} with underlying dependency graph GG and factorization polynomial h(x)=ICt(G)hI(x)h(x)=\sum_{I\in C_{t}(G)}h_{I}(x). We assume that the width of 𝒟{\cal D} is bounded by λ\lambda, i.e., maxi[p]\normoneihλ\max_{i\in[p]}\normone{\partial_{i}h}\leq\lambda, where \normonehICt(G)|h¯(I)|\normone{h}\coloneqq\sum_{I\in C_{t}(G)}\left|\bar{h}(I)\right|. Similar to [KlivansM17], given nn i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\} generated from an unknown distribution 𝒟{\cal D}, we consider the following two related learning objectives, under the constraint of ρ\rho-zCDP:

  1. 1.

    find a multilinear polynomial uu such that with probability greater than 23\frac{2}{3}, \normonehuICt(G)|h¯(I)u¯(I)|α\normone{h-u}\coloneqq\sum_{I\in C_{t}(G)}\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha ;

  2. 2.

    find a multilinear polynomial uu such that with probability greater than 23\frac{2}{3}, for every maximal monomial II of hh, |h¯(I)u¯(I)|α\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha.

We note that our first objective can be viewed as parameter estimation in 1\ell_{1} distance, where only an average performance guarantee is provided. In the second objective, the algorithm recovers every maximal monomial, which can be viewed as parameter estimation in \ell_{\infty} distance. These two objectives are addressed in Sections 4.1 and 4.2, respectively.

4.1 Parameter Estimation in 1\ell_{1} Distance

The following property of MRFs, from [KlivansM17], plays a critical role in our algorithm. The proof is similar to that of Lemma 3.7.

Lemma 4.1 (Lemma 7.6 of [KlivansM17]).

Let 𝒟{\cal D} be a tt-wise MRF on {1,1}p\{1,-1\}^{p} with underlying dependency graph GG and factorization polynomial h(x)=ICt(G)hI(x)h(x)=\sum_{I\in C_{t}(G)}h_{I}(x), then

Pr(Zi=1|Zi=x)=σ(2ih(x)),i[p],x{1,1}[p]\i.\Pr{\left({Z_{i}=1|Z_{-i}=x}\right)}=\sigma(2\partial_{i}h(x)),\forall i\in[p],\forall x\in\{1,-1\}^{[p]\backslash i}.

Lemma 4.1 shows that, similar to pairwise graphical models, it also suffices to learn the parameters of binary tt-wise MRF using sparse logistic regression.

1
2 Input: nn i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\}, where zm{±1}pz^{m}\in\{\pm 1\}^{p} for m[n]m\in[n]; an upper bound λ\lambda on maxi[p]\normoneih\max_{i\in[p]}\normone{\partial_{i}h}, privacy parameter ρ\rho
3
Fori=1i=1 to pp
4      1:
5m[n]\forall m\in[n], xm[jIzjm:I[p\i],|I|t1]x_{m}\leftarrow{\left[{\prod_{j\in I}z_{j}^{m}:I\subset[p\backslash i],\left|I\right|\leq t-1}\right]}, ymzimy_{m}\leftarrow z_{i}^{m}
62:wpriv𝒜(D,,ρ,𝒞)w^{priv}\leftarrow{\cal A}(D,{\cal L},\rho^{\prime},{\cal C}) where D={(xm,ym)}m=1nD=\{{\left({x_{m},y_{m}}\right)}\}_{m=1}^{n}, (w;d)=log(1+eyw,x)\ell(w;d)=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}, 𝒞={\normonew2λ}{\cal C}=\{\normone{w}\leq 2\lambda\}, and ρ=ρp\rho^{\prime}=\frac{\rho}{p}
7For I[p\i]I\subset[p\backslash i] with |I|t1\left|I\right|\leq t-1  
8      3:
u¯(I{i})=12wpriv(I)\bar{u}(I\cup\{i\})=\frac{1}{2}w^{priv}(I), when argmin(Ii)=i\arg\min(I\cup i)=i
4:
5:
Algorithm 4 Private Learning binary tt-wise MRF in 1\ell_{1} distance

Output: u¯(I):ICt(Kp)\bar{u}(I):I\in C_{t}(K_{p}), where KpK_{p} is the pp-dimensional complete graph

Theorem 4.2.

There exists a ρ\rho-zCDP algorithm which, with probability at least 2/32/3, finds a multilinear polynomial uu such that \normonehuα,\normone{h-u}\leq\alpha, given nn i.i.d. samples Z1,,Zn𝒟Z^{1},\cdots,Z^{n}\sim{\cal D}, where

n=O((2t)O(t)eO(λt)p4tlog(p)α4+(2t)O(t)eO(λt)p3t+12log2(p)ρα3).n=O{\left({\frac{(2t)^{O(t)}e^{O(\lambda t)}\cdot p^{4t}\cdot\log(p)}{\alpha^{4}}+\frac{(2t)^{O(t)}e^{O(\lambda t)}\cdot p^{3t+\frac{1}{2}}\cdot\log^{2}(p)}{\sqrt{\rho}\alpha^{3}}}\right)}.
Proof.

Similar to the previous proof, we start by fixing i=1i=1. Given a random sample Z𝒟Z\sim{\cal D}, let X=[jIZj:I[p]\1,|I|t1]X={\left[{\prod_{j\in I}Z^{j}:I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right]} and Y=ZiY=Z_{i}. According to Lemma 4.1, we know that 𝔼[Y|X]=σ(w,X)\mathbb{E}\left[Y|X\right]=\sigma{\left({\langle w^{*},X\rangle}\right)}, where w=(21h¯(I):I[p]\1,|I|t1)w^{*}={\left({2\cdot\overline{\partial_{1}h}(I):I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right)}. Furthermore, \normonew2λ\normone{w^{*}}\leq 2\lambda by the width constraint. Now, given nn i.i.d. samples {zm}m=1n\{z^{m}\}_{m=1}^{n} drawn from 𝒟{\cal D}, it is easy to check that for any given zmz^{m}, its corresponding (xm,ym)(x^{m},y^{m}) is one realization of (X,Y)(X,Y). Let wprivw^{priv} be the output of 𝒜(D,,ρp,{w:\normonew2λ}){\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}, where D={(xm,ym)}m=1nD=\{(x^{m},y^{m})\}_{m=1}^{n} and (w;(x,y))=log(1+eyw,x)\ell(w;(x,y))=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}. By Lemma 3.6, 𝔼Z𝒟(A,θ)[(wpriv;(X,Y))]𝔼Z𝒟(A,θ)[(w;(X,Y))]γ\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma with probability greater than 1110p1-\frac{1}{10p}, assuming n=Ω(pλ2log2(p)ργ32+λ2log(p)γ2)n=\Omega{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}.

Now we need the following lemma from [KlivansM17], which is analogous to Lemma 3.7 for the Ising model.

Lemma 4.3 (Lemma 6.4 of [KlivansM17]).

Let PP be a distribution on {1,1}p1×{1,1}\{-1,1\}^{p-1}\times\{-1,1\}. Given multilinear polynomial u1p1u_{1}\in\mathbb{R}^{p-1}, Pr(Y=1|X=x)=σ(u1(X))\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({u_{1}(X)}\right)} for (X,Y)P(X,Y)\sim P. Suppose the marginal distribution of PP on XX is δ\delta-unbiased, and 𝔼(X,Y)P[log(1+eY(u1(X)))]𝔼(X,Y)P[log(1+eY(u2(X)))]γ\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({u_{1}(X)}\right)}}}\right)}\right]-\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({u_{2}(X)}\right)}}}\right)}\right]\leq\gamma for another multilinear polynomial u2u_{2}, where γδte2\normoneu16\gamma\leq\delta^{t}e^{-2\normone{u_{1}}-6}, then \normoneu1u2=O((2t)te\normoneu1γ/δt(pt)).\normone{u_{1}-u_{2}}=O{\left({(2t)^{t}e^{\normone{u_{1}}}\cdot\sqrt{\gamma/\delta^{t}}\cdot{p\choose t}}\right)}.

By substituting γ=eO(λt)(2t)O(t)p3tα2\gamma=e^{-O(\lambda t)}\cdot(2t)^{-O(t)}\cdot p^{-3t}\cdot\alpha^{2}, we have that with probability greater than 1110p1-\frac{1}{10p}, I:argminI=1|u¯(I)h(I)|αp\sum_{I:\arg\min I=1}\left|\bar{u}(I)-h(I)\right|\leq\frac{\alpha}{p}. We note that the coefficients of different monomials are recovered in each iteration. Therefore, by a union bound over pp iterations, we prove the desired result. ∎

4.2 Parameter Estimation in \ell_{\infty} Distance

In this section, we introduce a slightly modified version of the algorithm in the last section.

1
2
3Input: n=n1+n2n=n_{1}+n_{2} i.i.d. samples {z1,,zn}\{z^{1},\cdots,z^{n}\}, where zm{±1}pz^{m}\in\{\pm 1\}^{p} for m[n]m\in[n]; an upper bound λ\lambda on maxi[p]\normoneih\max_{i\in[p]}\normone{\partial_{i}h}, privacy parameter ρ\rho
4
Foreach I[p]I\subset[p] with |I|t\left|I\right|\leq t
5      1:
6Let Q(I)1n2m=n1n1+n2jIzjmQ(I)\coloneqq\frac{1}{n_{2}}\sum_{m=n_{1}}^{n_{1}+n_{2}}\prod_{j\in I}z_{j}^{m}
2:Compute Q^(I)\hat{Q}(I), an estimate of Q(I)Q(I) through an ρ/2\rho/2-zCDP query release algorithm (PMW [HardtR10] or sepFEM [VietriTBSW20])
73:
8For i=1i=1 to pp 
9      4:
10m[n1]\forall m\in[n_{1}], xm[jIzjm:I[p\i],|I|t1]x_{m}\leftarrow{\left[{\prod_{j\in I}z^{m}_{j}:I\subset[p\backslash i],\left|I\right|\leq t-1}\right]}, ymzimy_{m}\leftarrow z_{i}^{m}
115:wpriv𝒜(D,,ρ,𝒞)w^{priv}\leftarrow{\cal A}(D,{\cal L},\rho^{\prime},{\cal C}), where D={(xm,ym)}m=1nD=\{{\left({x_{m},y_{m}}\right)}\}_{m=1}^{n}, (w;d)=log(1+eyw,x)\ell(w;d)=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}, 𝒞={\normonew2λ}{\cal C}=\{\normone{w}\leq 2\lambda\}, and ρ=ρ2p\rho^{\prime}=\frac{\rho}{2p}
126:Define a polynomial vi:p1v_{i}:\mathbb{R}^{p-1}\rightarrow\mathbb{R} by setting vi¯(I)=12wpriv(I)\bar{v_{i}}(I)=\frac{1}{2}w^{priv}(I) for all I[p\i]I\subset[p\backslash i]
13For each I[p\i]I\subset[p\backslash i] with |I|t1\left|I\right|\leq t-1 
14      7:
u¯(I{i})=I[p]Ivi¯(I)Q^(I)\bar{u}(I\cup\{i\})=\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{i}}(I^{\prime})\cdot\hat{Q}(I^{\prime}),when argmin(Ii)=i\arg\min(I\cup i)=i
8:
9:
Algorithm 5 Private Learning binary tt-wise MRF in \ell_{\infty} distance
Output: u¯(I):ICt(Kp)\bar{u}(I):I\in C_{t}(K_{p}), where KpK_{p} is the pp-dimensional complete graph

We first show that if the estimates Q^\hat{Q} for the parity queries QQ are sufficiently accurate, Algorithm 5 solves the \ell_{\infty} estimation problem, as long as the sample size n1n_{1} is large enough.

Lemma 4.4.

Suppose that the estimates Q^\hat{Q} satisfies |Q^(I)Q(I)|α/(8λ)|\hat{Q}(I)-Q(I)|\leq\alpha/(8\lambda) for all I[p]I\subset[p] such that |I|t|I|\leq t and n2=Ω(λ2tlog(p)/α2)n_{2}=\Omega(\lambda^{2}t\log(p)/\alpha^{2}). Then with probability at least 3/43/4, Algorithm 5 outputs a multilinear polynomial uu such that for every maximal monomial II of hh, |h¯(I)u¯(I)|α,\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha, given nn i.i.d. samples Z1,,Zn𝒟Z^{1},\cdots,Z^{n}\sim{\cal D}, as long as

n1=Ω(e5λtplog2(p)ρα92+e6λtlog(p)α6).n_{1}=\Omega{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}.
Proof.

We will condition on the event that Q^\hat{Q} is a “good” estimate of QQ: |Q^(I)Q(I)|α/(8λ)|\hat{Q}(I)-Q(I)|\leq\alpha/(8\lambda) for all I[p]I\subset[p] such that |I|t|I|\leq t. Let us fix i=1i=1. Let X=[jIZj:I[p]\{1},|I|t1]X={\left[{\prod_{j\in I}Z^{j}:I\subset[p]\backslash\{1\},\left|I\right|\leq t-1}\right]}, Y=ZiY=Z_{i}, and we know that 𝔼[Y|X]=σ(w,X)\mathbb{E}\left[Y|X\right]=\sigma{\left({\langle w^{*},X\rangle}\right)}, where w=(21h¯(I):I[p]\1,|I|t1)w^{*}={\left({2\cdot\overline{\partial_{1}h}(I):I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right)}. Now given n1n_{1} i.i.d. samples {zm}m=1n1\{z^{m}\}_{m=1}^{n_{1}} drawn from 𝒟{\cal D}, let wprivw^{priv} be the output of 𝒜(D,,ρp,{w:\normonew2λ}){\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}, where D={(xm,ym)}m=1n1D=\{(x^{m},y^{m})\}_{m=1}^{n_{1}} and (w;(x,y))=log(1+eyw,x)\ell(w;(x,y))=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}. Similarly, with probability at least 1110p1-\frac{1}{10p},

𝔼Z𝒟(A,θ)[(wpriv;(X,Y))]𝔼Z𝒟(A,θ)[(w;(X,Y))]γ\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma

as long as n1=Ω(pλ2log2(p)ργ32+λ2log(p)γ2)n_{1}=\Omega{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}.

Now we utilize Lemma 6.4 from [KlivansM17], which states that if 𝔼Z𝒟(A,θ)[(wpriv;(X,Y))]𝔼Z𝒟(A,θ)[(w;(X,Y))]γ\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma, given a random sample XX, for any maximal monomial I[p]\{1}I\subset[p]\backslash\{1\} of 1h\partial_{1}h,

Pr(|1h¯(I)Iv1(X)|α4)<O(γe3λtα2).\Pr{\left({\left|\overline{\partial_{1}h}(I)-\partial_{I}v_{1}(X)\right|\geq\frac{\alpha}{4}}\right)}<O{\left({\frac{\gamma\cdot e^{3\lambda t}}{\alpha^{2}}}\right)}.

By replacing γ=e3λtα38λ\gamma=\frac{e^{-3\lambda t}\cdot\alpha^{3}}{8\lambda}, we have Pr(|1h¯(I)Iv1(X)|α4)<α8λ\Pr{\left({\left|\overline{\partial_{1}h}(I)-\partial_{I}v_{1}(X)\right|\geq\frac{\alpha}{4}}\right)}<\frac{\alpha}{8\lambda}, as long as n1=Ω(pe5λtlog2(p)ρα92+e6λtlog(p)α6)n_{1}=\Omega{\left({\frac{\sqrt{p}e^{5\lambda t}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}. Accordingly, for any maximal monomial II, |𝔼[Iv1(X)]1h¯(I)|𝔼[|Iv1(X)1h¯(I)|]α4+2λα8λ=α2\left|\mathbb{E}\left[\partial_{I}v_{1}(X)\right]-\overline{\partial_{1}h}(I)\right|\leq\mathbb{E}\left[\left|\partial_{I}v_{1}(X)-\overline{\partial_{1}h}(I)\right|\right]\leq\frac{\alpha}{4}+2\lambda\cdot\frac{\alpha}{8\lambda}=\frac{\alpha}{2}. By Hoeffding inequality, given n2=Ω(λ2tlogpα2)n_{2}=\Omega{\left({\frac{\lambda^{2}t\log{p}}{\alpha^{2}}}\right)}, for each maximal monomial II, with probability greater than 11pt1-\frac{1}{p^{t}}, |1n2m=1n2Iv1(Xm)𝔼[Iv1(X)]|α4\left|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\mathbb{E}\left[\partial_{I}v_{1}(X)\right]\right|\leq\frac{\alpha}{4}. Note that |Q(I)Q^(I)|α8λ\left|Q(I)-\hat{Q}(I)\right|\leq\frac{\alpha}{8\lambda}, then |1n2m=1n2Iv1(Xm)I[p]Iv1¯(I)Q^(I)|α8\left|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})\right|\leq\frac{\alpha}{8}. Therefore,

|I[p]Iv1¯(I)Q^(I)1h¯(I)|\displaystyle\left|\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})-\overline{\partial_{1}h}(I)\right|
\displaystyle\leq |I[p]Iv1¯(I)Q^(I)1n2m=1n2Iv1(Xm)|+|1n2m=1n2Iv1(Xm)𝔼[Iv1(X)]|\displaystyle\left|\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})-\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})\right|+\left|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\mathbb{E}\left[\partial_{I}v_{1}(X)\right]\right|
+\displaystyle+ |𝔼[Iv1(X)]1h¯(I)|\displaystyle\left|\mathbb{E}\left[\partial_{I}v_{1}(X)\right]-\overline{\partial_{1}h}(I)\right|
\displaystyle\leq α8+α4+α2=7α8.\displaystyle\frac{\alpha}{8}+\frac{\alpha}{4}+\frac{\alpha}{2}=\frac{7\alpha}{8}.

Finally, by a union bound over pp iterations and all the maximal monomials, we prove the desired results.∎

We now consider two private algorithms for releasing the parity queries. The first algorithm is called Private Multiplicative Weights (PMW) [HardtR10], which provides a better accuracy guarantee but runs in time exponential in the dimension pp. The following theorem can be viewed as a zCDP version of Theorem 4.3 in [Vadhan17], by noting that during the analysis, every iteration satisfies ε0\varepsilon_{0}-DP, which naturally satisfies ε02\varepsilon_{0}^{2}-zCDP, and by replacing the strong composition theorem of (ε,δ)(\varepsilon,\delta)-DP by the composition theorem of zCDP (Lemma 2.14).

Lemma 4.5 (Sample complexity of PMW, modification of Theorem 4.3 of [Vadhan17]).

The PMW algorithm satisfies ρ\rho-zCDP and releases Q^\hat{Q} such that with probability greater than 1920\frac{19}{20}, for all I[p]I\subset[p] with |I|t\left|I\right|\leq t, |Q^(I)Q(I)|α8λ\left|\hat{Q}(I)-Q(I)\right|\leq\frac{\alpha}{8\lambda} as long as the size of the data set

n2=Ω(tλ2plogpρα2).n_{2}=\Omega{\left({\frac{t\lambda^{2}\cdot\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}}\right)}.

The second algorithm sepFEM (Separator-Follow-the-perturbed-leader with exponential mechanism) has slightly worse sample complexity, but runs in polynomial time when it has access to an optimization oracle 𝒪\mathcal{O} that does the following: given as input a weighted dataset (I1,w1),,(Im,wm)2[p]×(I_{1},w_{1}),\ldots,(I_{m},w_{m})\in 2^{[p]}\times\mathbb{R}, find x{±1}px\in\{\pm 1\}^{p},

maxx{±1}pi=1mwijIixj.\max_{x\in\{\pm 1\}^{p}}\sum_{i=1}^{m}w_{i}\prod_{j\in I_{i}}x_{j}.

The oracle 𝒪\mathcal{O} essentially solves cost-sensitive classification problems over the set of parity functions [ZLA03], and it can be implemented with an integer program solver [VietriTBSW20, GaboardiAHRW14].

Lemma 4.6 (Sample complexity of sepFEM, [VietriTBSW20]).

The sepFEM algorithm satisfies ρ\rho-zCDP and releases Q^\hat{Q} such that with probability greater than 1920\frac{19}{20}, for all I[p]I\subset[p] with |I|t\left|I\right|\leq t, |Q^(I)Q(I)|α8λ\left|\hat{Q}(I)-Q(I)\right|\leq\frac{\alpha}{8\lambda} as long as the size of the data set

n2=Ω(tλ2p5/4logpρα2)n_{2}=\Omega{\left({\frac{t\lambda^{2}\cdot{p^{5/4}}\log{p}}{\sqrt{\rho}\alpha^{2}}}\right)}

The algorithm runs in polynomial time given access to the optimization oracle 𝒪\mathcal{O} defined above.

Now we can combine Lemmas 4.4, 4.5, and 4.6 to state the formal guarantee of Algorithm 5.

Theorem 4.7.

Algorithm 5 is a ρ\rho-zCDP algorithm which, with probability at least 2/32/3, finds a multilinear polynomial uu such that for every maximal monomial II of hh, |h¯(I)u¯(I)|α,\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha, given nn i.i.d. samples Z1,,Zn𝒟Z^{1},\cdots,Z^{n}\sim{\cal D}, and

  1. 1.

    if it uses PMW for releasing Q^\hat{Q}; it has a sample complexity of

    n=O(e5λtplog2(p)ρα92+tλ2plogpρα2+e6λtlog(p)α6)n=O{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\cdot\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}

    and a runtime complexity that is exponential in pp;

  2. 2.

    if it uses sepFEM for releasing Q^\hat{Q}, it has a sample complexity of

    n=O~(e5λtplog2(p)ρα92+tλ2p5/4logpρα2+e6λtlog(p)α6)n=\tilde{O}{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\cdot p^{5/4}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}

    and runs in polynomial time whenever t=O(1)t=O(1).

5 Lower Bounds for Parameter Learning

The lower bound for parameter estimation is based on mean estimation in \ell_{\infty} distance.

Theorem 5.1.

Suppose 𝒜{\cal A} is an (ε,δ)(\varepsilon,\delta)-differentially private algorithm that takes nn i.i.d. samples Z1,,ZnZ^{1},\ldots,Z^{n} drawn from any unknown pp-variable Ising model 𝒟(A,θ){\cal D}(A,\theta) and outputs A^\hat{A} such that 𝔼[maxi,j[p]|Ai,jA^i,j|]α1/50.\mathbb{E}\left[\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\right]\leq\alpha\leq 1/50. Then n=Ω(pαε)n=\Omega{\left({\frac{\sqrt{p}}{\alpha\varepsilon}}\right)}.

Proof.

Consider a Ising model 𝒟(A,0){\cal D}(A,0) with Ap×pA\in\mathbb{R}^{p\times p} defined as follows: for i[p2],A2i1,2i=A2i,2i1=ηi[ln(2),ln(2)]i\in[\frac{p}{2}],A_{2i-1,2i}=A_{2i,2i-1}=\eta_{i}\in[-\ln(2),\ln(2)], and All=0A_{ll^{\prime}}=0 for all other pairs of (l,l)(l,l^{\prime}). This construction divides the pp nodes into p2\frac{p}{2} pairs, where there is no correlation between nodes belonging to different pairs. It follows that

Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=12eηieηi+1,\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{2}\frac{e^{\eta_{i}}}{e^{\eta_{i}}+1},
Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=121eηi+1.\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{2}\frac{1}{e^{\eta_{i}}+1}.

For each observation ZZ, we obtain an observation X{±1}p/2X\in\{\pm 1\}^{p/2} such that Xi=Z2i1Z2iX_{i}=Z_{2i-1}Z_{2i}. Then each observation XX is distributed according to a product distribution in {±1}(p/2)\{\pm 1\}^{(p/2)} such that the mean of each coordinate jj is (eηi1)/(eηi+1)[1/3,1/3](e^{\eta_{i}}-1)/(e^{\eta_{i}}+1)\in[-1/3,1/3].

Suppose that an (ε,δ)(\varepsilon,\delta)-differentially private algorithm takes nn observations drawn from any such Ising model distribution and output a matrix A^\hat{A} such that E[maxi,j[p]|Ai,jA^i,j|]α\mbox{\bf E}\left[\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\right]\leq\alpha. Let η^i=min{max{A^2i1,2i,ln(2)},ln(2)}\hat{\eta}_{i}=\min\{\max\{\hat{A}_{2i-1,2i},-\ln(2)\},\ln(2)\} be the value of A2i1,2iA_{2i-1,2i} rounded into the range of [ln(2),ln(2)][-\ln(2),\ln(2)], and so |ηiη^i|α|\eta_{i}-\hat{\eta}_{i}|\leq\alpha. It follows that

|eηi1eηi+1eη^i1eη^i+1|\displaystyle\left|\frac{e^{\eta_{i}}-1}{e^{\eta_{i}}+1}-\frac{e^{\hat{\eta}_{i}}-1}{e^{\hat{\eta}_{i}}+1}\right| =2|eηieη^i(eηi+1)(eη^i+1)|\displaystyle=2\left|\frac{e^{\eta_{i}}-e^{\hat{\eta}_{i}}}{(e^{\eta_{i}}+1)(e^{\hat{\eta}_{i}}+1)}\right|
<2|eηieη^i|4(e|ηiη^i|1)8|ηiη^i|\displaystyle<2\left|{e^{\eta_{i}}-e^{\hat{\eta}_{i}}}\right|\leq 4\left(e^{|\eta_{i}-\hat{\eta}_{i}|}-1\right)\leq 8|\eta_{i}-\hat{\eta}_{i}|

where the last step follows from the fact that ea1+2ae^{a}\leq 1+2a for any a[0,1]a\in[0,1]. Thus, such private algorithm also can estimate the mean of the product distribution accurately:

E[i=1p/2|eηi1eηi+1eη^i1eη^i+1|2]32pα2\mbox{\bf E}\left[\sum_{i=1}^{p/2}\left|\frac{e^{\eta_{i}}-1}{e^{\eta_{i}}+1}-\frac{e^{\hat{\eta}_{i}}-1}{e^{\hat{\eta}_{i}}+1}\right|^{2}\right]\leq 32p\alpha^{2}

Now we will use the following sample complexity lower bound on private mean estimation on product distributions.

Lemma 5.2 (Lemma 6.2 of [KamathLSU18]).

If M:{±1}n×d[1/3,1/3]dM\colon\{\pm 1\}^{n\times d}\rightarrow[-1/3,1/3]^{d} is (ε,3/(64n))(\varepsilon,3/(64n))-differentially private, and for every product distribution PP over {±1}d\{\pm 1\}^{d} such that the mean of each coordinate μj\mu_{j} satisfies 1/3μj1/3-1/3\leq\mu_{j}\leq 1/3,

EXPn[M(X)μ22]γ2d54,\mbox{\bf E}_{X\sim P^{n}}\left[\|M(X)-\mu\|_{2}^{2}\right]\leq\gamma^{2}\leq\frac{d}{54},

then nd/(72γε)n\geq d/(72\gamma\varepsilon).

Then our stated bound follows by instantiating γ2=32pα2\gamma^{2}=32p\alpha^{2} and d=p/2d=p/2 in Lemma 5.2.∎

6 Structure Learning of Graphical Models

In this section, we will give an (ε,δ)(\varepsilon,\delta)-differentially private algorithm for learning the structure of a Markov Random Field. The dependence on the dimension dd will be only logarithmic, in comparison to the complexity of privately learning the parameters. As we have shown in Section 5, this dependence is necessarily polynomial in pp, even under approximate differential privacy. Furthermore, as we will show in Section 7, if we wish to learn the structure of an MRF under more restrictive notions of privacy (such as pure or concentrated), the complexity also becomes polynomial in pp. Thus, in very high-dimensional settings, learning the structure of the MRF under approximate differential privacy is essentially the only notion of private learnability which is tractable.

The following lemma is immediate from stability-based mode arguments (see, e.g., Proposition 3.4 of [Vadhan17]).

Lemma 6.1.

Suppose there exists a (non-private) algorithm which takes X=(X1,,Xn)X=(X^{1},\dots,X^{n}) sampled i.i.d. from some distribution 𝒟{\cal D}, and outputs some fixed value YY (which may depend on 𝒟{\cal D}) with probability at least 2/32/3. Then there exists an (ε,δ)(\varepsilon,\delta)-differentially private algorithm which takes O(nlog(1/δ)ε)O\left(\frac{n\log(1/\delta)}{\varepsilon}\right) samples and outputs YY with probability at least 1δ1-\delta.

We can now directly import the following theorem from [WuSD19].

Theorem 6.2 ([WuSD19]).

There exists an algorithm which, with probability at least 2/32/3, learns the structure of a pairwise graphical model. It requires n=O(λ2k4e14λlog(pk)η4)n=O\left(\frac{\lambda^{2}k^{4}e^{14\lambda}\log(pk)}{\eta^{4}}\right) samples.

This gives us the following private learning result as a corollary.

Corollary 6.3.

There exists an (ε,δ)(\varepsilon,\delta)-differentially private algorithm which, with probability at least 2/32/3, learns the structure of a pairwise graphical model. It requires n=O(λ2k4e14λlog(pk)log(1/δ)εη4)n=O\left(\frac{\lambda^{2}k^{4}e^{14\lambda}\log(pk)\log(1/\delta)}{\varepsilon\eta^{4}}\right) samples.

For binary MRFs of higher-order, we instead import the following theorem from [KlivansM17]:

Theorem 6.4 ([KlivansM17]).

There exists an algorithm which, with probability at least 2/32/3, learns the structure of a binary tt-wise MRF. It requires n=O(eO(λt)log(pη)η4)n=O\left(\frac{e^{O{\left({\lambda t}\right)}}\log(\frac{p}{\eta})}{\eta^{4}}\right) samples.

This gives us the following private learning result as a corollary.

Corollary 6.5.

There exists an (ε,δ)(\varepsilon,\delta)-differentially private algorithm which, with probability at least 2/32/3, learns the structure of a binary tt-wise MRF. It requires

n=O(eO(λt)log(pη)log(1/δ)εη4)n=O\left(\frac{e^{O{\left({\lambda t}\right)}}\log(\frac{p}{\eta})\log(1/\delta)}{\varepsilon\eta^{4}}\right)

samples.

7 Lower Bounds for Structure Learning of Graphical Models

In this section, we will prove structure learning lower bounds under pure DP or zero-concentrated DP. The graphical models we consider are the Ising models and pairwise graphical model. However, we note that all the lower bounds for the Ising model also hold for binary tt-wise MRFs, since the Ising model is a special case of binary tt-wise MRFs corresponding to t=2t=2. We will show that under (ε,0)(\varepsilon,0)-DP or ρ\rho-zCDP, a polynomial dependence on the dimension is unavoidable in the sample complexity.

In Section 7.1, we assume that our samples are generated from an Ising model. In Section 7.2, we extend our lower bounds to pairwise graphical models.

7.1 Lower Bounds for Structure Learning of Ising Models

Theorem 7.1.

Any (ε,0)(\varepsilon,0)-DP algorithm which learns the structure of an Ising model with minimum edge weight η\eta with probability at least 2/32/3 requires n=Ω(pηε+pε)n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{p}{\varepsilon}}\right)} samples. Furthermore, at least n=Ω(pρ)n=\Omega{\left({\sqrt{\frac{p}{\rho}}}\right)} samples are required for the same task under ρ\rho-zCDP.

Proof.

Our lower bound argument is in two steps. The first step is to construct a set of distributions, consisting of 2p22^{\frac{p}{2}} different Ising models such that any feasible structure learning algorithm should output different answers for different distributions. In the second step, we utilize our Private Fano’s inequality, or the packing argument for zCDP [BunS16] to get the desired lower bound.

To start, we would like to use the following binary code to construct the distribution set. Let 𝒞={0,1}p2{\cal C}=\{0,1\}^{\frac{p}{2}}, given cCc\in C, we construct the corresponding distribution 𝒟(Ac,0){\cal D}(A^{c},0) with Acp×pA^{c}\in\mathbb{R}^{p\times p} defined as follows: for i[p2],A2i1,2ic=A2i,2i1c=ηc[i]i\in[\frac{p}{2}],A^{c}_{2i-1,2i}=A^{c}_{2i,2i-1}=\eta\cdot c[i], and 0 elsewhere. By construction, we divide the pp nodes into p2\frac{p}{2} different pairs, where there is no correlation between nodes belonging to different pairs. Furthermore, for pair ii, if c[i]=0c[i]=0, which means the value of node 2i12i-1 is independent of node 2i2i, it is not hard to show

Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=14,\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{4},
Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=14.\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{4}.

On the other hand, if c[i]=1c[i]=1,

Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=12eηeη+1,\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{2}\cdot\frac{e^{\eta}}{e^{\eta}+1},
Pr(Z2i1=1,Z2i=1)=Pr(Z2i1=1,Z2i=1)=121eη+1.\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{2}\cdot\frac{1}{e^{\eta}+1}.

The Chi-squared distance between these two distributions is

8[(12eηeη+114)2+(121eη+114)2]=(12eη+1)24η2.8{\left[{{\left({\frac{1}{2}\cdot\frac{e^{\eta}}{e^{\eta}+1}-\frac{1}{4}}\right)}^{2}+{\left({\frac{1}{2}\cdot\frac{1}{e^{\eta}+1}-\frac{1}{4}}\right)}^{2}}\right]}={\left({1-\frac{2}{e^{\eta}+1}}\right)}^{2}\leq 4\eta^{2}.

Now we want to upper bound the total variation distance between 𝒟(Ac1,0){\cal D}(A^{c_{1}},0) and 𝒟(Ac2,0){\cal D}(A^{c_{2}},0) for any c1c2𝒞c_{1}\neq c_{2}\in{\cal C}. Let PiP_{i} and QiQ_{i} denote the joint distribution of node 2i12i-1 and node 2i2i corresponding to 𝒟(Ac1,0){\cal D}(A^{c_{1}},0) and 𝒟(Ac2,0){\cal D}(A^{c_{2}},0). We have that

dTV(𝒟(Ac1,0),𝒟(Ac2,0))\displaystyle d_{TV}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)} 2dKL(𝒟(Ac1,0),𝒟(Ac2,0))\displaystyle\leq\sqrt{2d_{KL}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)}}
=2i=1p2dKL(Pi,Qi)min(2ηp,1),\displaystyle=\sqrt{2\sum_{i=1}^{\frac{p}{2}}d_{KL}{\left({P_{i},Q_{i}}\right)}}\leq\min{\left({2\eta\sqrt{p},1}\right)},

where the first inequality is by Pinsker’s inequality, and the last inequality comes from the fact that the KL divergence is always upper bounded by the Chi-squared distance.

In order to attain pure DP lower bounds, we utilize the corollary of DP Fano’s inequality for estimation (Theorem LABEL:coro:fano).

For any c1,c2Cc_{1},c_{2}\in C, we have dTV(𝒟(Ac1,0),𝒟(Ac2,0))min(2ηp,1)d_{TV}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)}\leq\min{\left({2\eta\sqrt{p},1}\right)}. By the property of maximal coupling [Hollander12], there must exist some coupling between 𝒟n(Ac1,0){\cal D}^{n}(A^{c_{1}},0) and 𝒟n(Ac2,0){\cal D}^{n}(A^{c_{2}},0) with expected Hamming distance smaller than min(2nηp,n)\min{\left({2n\eta\sqrt{p},n}\right)}. Therefore, we have ε=Ω(log|𝒞|min(nηp,n))\varepsilon=\Omega{\left({\frac{\log\left|{\cal C}\right|}{\min{\left({n\eta\sqrt{p},n}\right)}}}\right)}, and accordingly, n=Ω(pηε+pε)n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{p}{\varepsilon}}\right)}.

Now we move to zCDP lower bounds. We utilize a different version of the packing argument [BunS16], which works under zCDP.

Lemma 7.2.

Let 𝒱={P1,P2,,PM}{\cal V}=\{P_{1},P_{2},...,P_{M}\} be a set of MM distributions over 𝒳n{\cal X}^{n}. Let {Si}i[M]\{S_{i}\}_{i\in[M]} be a collection of disjoint subsets of 𝒮{\cal S}. If there exists an ρ\rho-zCDP algorithm 𝒜:𝒳n𝒮{\cal A}:{\cal X}^{n}\to{\cal S} such that for every i[M]i\in[M], given ZnPiZ^{n}\sim P_{i}, Pr(𝒜(Zn)Si)910\Pr{\left({{\cal A}(Z^{n})\in S_{i}}\right)}\geq\frac{9}{10}, then

ρ=Ω(logMn2).\rho=\Omega{\left({\frac{\log M}{n^{2}}}\right)}.

By Lemma 7.2, we derive ρ=Ω(pn2)\rho=\Omega{\left({\frac{p}{n^{2}}}\right)} and n=Ω(pρ)n=\Omega{\left({\sqrt{\frac{p}{\rho}}}\right)} accordingly. ∎

7.2 Lower Bounds for Structure Learning of Pairwise Graphical Models

Similar techniques can be used to derive lower bounds for pairwise graphical models.

Theorem 7.3.

Any (ε,0)(\varepsilon,0)-DP algorithm which learns the structure of the pp-variable pairwise graphical models with minimum edge weight η\eta with probability at least 2/32/3 requires n=Ω(pηε+k2pε)n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{k^{2}p}{\varepsilon}}\right)} samples. Furthermore, at least n=Ω(k2pρ)n=\Omega{\left({\sqrt{\frac{k^{2}p}{\rho}}}\right)} samples are required for the same task under ρ\rho-zCDP.

Proof.

Similar to before, we start with constructing a distribution set consisting of 2O(kp)2^{O{\left({kp}\right)}} different pairwise graphical models such that any accurate structure learning algorithm must output different answers for different distributions.

Let 𝒞{\cal C} be the real symmetric matrix with each value constrained to either 0 or η\eta, i.e., 𝒞={W{0,η}k×k:W=WT}{\cal C}=\{W\in\{0,\eta\}^{k\times k}:W=W^{T}\}. Without loss of generality, we assume pp is even. Given c=[c1,c2,,cp]c=[c_{1},c_{2},\cdots,c_{p}], where c1,c2,,cpCc_{1},c_{2},\cdots,c_{p}\in C, we construct the corresponding distribution 𝒟(𝒲c,0){\cal D}({\cal W}^{c},0) with 𝒲c{\cal W}^{c} defined as follows: for l[p2],W2l1,2lc=cll\in[\frac{p}{2}],W^{c}_{2l-1,2l}=c_{l}, and for other pairs (i,j)(i,j), Wi,jc=0W^{c}_{i,j}=0. Similarly, by this construction we divide pp nodes into p2\frac{p}{2} different pairs, and there is no correlation between nodes belonging to different pairs.

We first prove lower bounds under (ε,0)(\varepsilon,0)-DP. By Theorem LABEL:coro:fano, ε=Ω(log|𝒞|n)\varepsilon=\Omega{\left({\frac{\log\left|{\cal C}\right|}{n}}\right)}, since for any two nn-sample distributions, the expected coupling distance can be always upper bounded by nn. We also note that |𝒞|=(2k(k+1)2)p\left|{\cal C}\right|={\left({2^{\frac{k(k+1)}{2}}}\right)}^{p}. Therefore, we have n=Ω(k2pε)n=\Omega{\left({\frac{k^{2}p}{\varepsilon}}\right)}. At the same time, n=Ω(pηε)n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}}\right)} is another lower bound, inherited from the easier task of learning Ising models.

With respect to zCDP, we utilize Lemma 7.2 and obtain ρ=Ω(k2pn2)\rho=\Omega{\left({\frac{k^{2}p}{n^{2}}}\right)}. Therefore, we have n=Ω(k2pρ)n=\Omega{\left({\sqrt{\frac{k^{2}p}{\rho}}}\right)}. ∎

Acknowledgments

The authors would like to thank Kunal Talwar for suggesting the study of this problem, and Adam Klivans, Frederic Koehler, Ankur Moitra, and Shanshan Wu for helpful and inspiring conversations. GK would like to thank Chengdu Style Restaurant (古月飘香) in Berkeley for inspiration in the conception of this project.