Privately Learning Markov Random Fields

Huanyu Zhang¹¹1Cornell University. [email protected]. Supported by NSF #1815893 and by NSF #1704443. This work was partially done while the author was an intern at Microsoft Research Redmond. Gautam Kamath²²2University of Waterloo. [email protected]. Supported by a University of Waterloo startup grant. Part of this work was done while supported as a Microsoft Research Fellow, as part of the Simons-Berkeley Research Fellowship program, and while visiting Microsoft Research Redmond. ⁵⁵5These authors are in alphabetical order. Janardhan Kulkarni³³3Microsoft Research Redmond. [email protected]. ⁵⁵footnotemark: 5 Zhiwei Steven Wu⁴⁴4University of Minnesota. [email protected]. Supported in part by the NSF FAI Award #1939606, a Google Faculty Research Award, a J.P. Morgan Faculty Award, a Facebook Research Award, and a Mozilla Research Grant. ⁵⁵footnotemark: 5

Abstract

We consider the problem of learning Markov Random Fields (including the prototypical example, the Ising model) under the constraint of differential privacy. Our learning goals include both structure learning, where we try to estimate the underlying graph structure of the model, as well as the harder goal of parameter learning, in which we additionally estimate the parameter on each edge. We provide algorithms and lower bounds for both problems under a variety of privacy constraints – namely pure, concentrated, and approximate differential privacy. While non-privately, both learning goals enjoy roughly the same complexity, we show that this is not the case under differential privacy. In particular, only structure learning under approximate differential privacy maintains the non-private logarithmic dependence on the dimensionality of the data, while a change in either the learning goal or the privacy notion would necessitate a polynomial dependence. As a result, we show that the privacy constraint imposes a strong separation between these two learning problems in the high-dimensional data regime.

1 Introduction

In this chapter, we continue to study the problem of private distribution estimation. However, we focus on a more complicated class of distributions – random graphs.

Graphical models are a common structure used to model high-dimensional data, which find a myriad of applications in diverse research disciplines, including probability theory, Markov Chain Monte Carlo, computer vision, theoretical computer science, social network analysis, game theory, and computational biology [LevinPW09, Chatterjee05, Felsenstein04, DaskalakisMR11, GemanG86, Ellison93, MontanariS10]. While statistical tasks involving general distributions over $p$ variables often run into the curse of dimensionality (i.e., an exponential sample complexity in $p$ ), Markov Random Fields (MRFs) are a particular family of undirected graphical models which are parameterized by the “order” $t$ of their interactions. Restricting the order of interactions allows us to capture most distributions which may naturally arise, and also avoids this severe dependence on the dimension (i.e., we often pay an exponential dependence on $t$ instead of $p$ ). An MRF is defined as follows, see Section 2 for more precise definitions and notations we will use in this chapter.

Definition 1.1.

Let $k,t,p\in\mathbb{N}$ , $G=(V,E)$ be a graph on $p$ nodes, and $C_{t}(G)$ be the set of cliques of size at most $t$ in $G$ . A Markov Random Field with alphabet size $k$ and $t$ -order interactions is a distribution $\mathcal{D}$ over $[k]^{p}$ such that

\Pr_{X\sim\mathcal{D}}[X=x]\propto\exp\left(\sum_{I\in C_{t}(G)}\psi_{I}(x)\right),

where $\psi_{I}:[k]^{p}\rightarrow\mathbb{R}$ depends only on varables in $I$ .

The case when $k=t=2$ corresponds to the prototypical example of an MRF, the Ising model [Ising25] (Definition 2.1). More generally, if $t=2$ , we call the model pairwise (Definition 2.2), and if $k=2$ but $t$ is unrestricted, we call the model a binary MRF (Definition 2.4). In this chapter, we mainly look at these two special cases of MRFs.

Given the wide applicability of these graphical models, there has been a great deal of work on the problem of graphical model estimation [RavikumarWL10, SanthanamW12, Bresler15, VuffrayMLC16, KlivansM17, HamiltonKM17, RigolletH17, LokhovVMC18, WuSD19]. That is, given a dataset generated from a graphical model, can we infer properties of the underlying distribution? Most of the attention has focused on two learning goals.

1.

Structure learning (Definition 2.9): Recover the set of non-zero edges in $G$ .
2.

Parameter learning (Definition 2.10): Recover the set of non-zero edges in $G$ , as well as $\psi_{I}$ for all cliques $I$ of size at most $t$ .

It is clear that structure learning is easier than parameter learning. Nonetheless, the sample complexity of both learning goals is known to be roughly equivalent. That is, both can be performed using a number of samples which is only logarithmic in the dimension $p$ (assuming a model of bounded “width” $\lambda$ ¹¹1This is a common parameterization of the problem, which roughly corresponds to the graph having bounded-degree, see Section 2 for more details.), thus facilitating estimation in very high-dimensional settings.

Our goal is to design algorithms which guarantee both:

•

Accuracy: With probability greater than $2/3$ , the algorithm learns the underlying graphical model;
•

Privacy: The algorithm satisfies differential privacy, even when the dataset is not drawn from a graphical model.

Thematically, we investigate the following question: how much additional data is needed to learn Markov Random Fields under the constraint of differential privacy? As mentioned before, absent privacy constraints, the sample complexity is logarithmic in $p$ . Can we guarantee privacy with comparable amounts of data? Or if more data is needed, how much more?

1.1 Results and Techniques

We proceed to describe our results on privately learning Markov Random Fields. In this section, we will assume familiarity with some of the most common notions of differential privacy: pure $\varepsilon$ -differential privacy, $\rho$ -zero-concentrated differential privacy, and approximate $(\varepsilon,\delta)$ -differential privacy. In particular, one should know that these are in (strictly) decreasing order of strength (i.e., an algorithm which satisfies pure DP gives more privacy to the dataset than concentrated DP), formal definitions appear in Section 2. Furthermore, in order to be precise, some of our theorem statements will use notation which is defined later (Section 2) – these may be skipped on a first reading, as our prose will not require this knowledge.

Upper Bounds.

Our first upper bounds are for parameter learning. First, we have the following theorem, which gives an upper bound for parameter learning pairwise graphical models under concentrated differential privacy, showing that this learning goal can be achieved with $O(\sqrt{p})$ samples. In particular, this includes the special case of the Ising model, which corresponds to an alphabet size $k=2$ . Note that this implies the same result if one relaxes the learning goal to structure learning, or the privacy notion to approximate DP, as these modifications only make the problem easier. Further details are given in Section 3.3.

Theorem 1.2.

There exists an efficient $\rho$ -zCDP algorithm which learns the parameters of a pairwise graphical model to accuracy $\alpha$ with probability at least $2/3$ , which requires a sample complexity of

n=O{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}

This result can be seen as a private adaptation of the elegant work of [WuSD19] (which in turn builds on the structural results of [KlivansM17]). Wu, Sanghavi, and Dimakis [WuSD19] show that $\ell_{1}$ -constrained logistic regression suffices to learn the parameters of all pairwise graphical models. We first develop a private analog of this method, based on the private Franke-Wolfe method of Talwar, Thakurta, and Zhang [TalwarTZ14, TalwarTZ15], which is of independent interest. This method is studied in Section 3.1.

Theorem 1.3.

If we consider the problem of private sparse logistic regression, there exists an efficient $\rho$ -zCDP algorithm that produces a parameter vector $w^{priv}$ , such that with probability at least $1-\beta$ , the empirical risk

{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.

We note that Theorem 1.3 avoids a polynomial dependence on the dimension $p$ in favor of a polynomial dependence on the “sparsity” parameter $\lambda$ . The greater dependence on $p$ which arises in Theorem 1.2 is from applying Theorem 1.3 and then using composition properties of concentrated DP.

We go on to generalize the results of [WuSD19], showing that $\ell_{1}$ -constrained logistic regression can also learn the parameters of binary $t$ -wise MRFs. This result is novel even in the non-private setting. Further details are presented in Section 4.

The following theorem shows that we can learn the parameters of binary $t$ -wise MRFs with $\tilde{O}(\sqrt{p})$ samples.

Theorem 1.4.

Let ${\cal D}$ be an unknown binary $t$ -wise MRF with associated polynomial $h$ . Then there exists an $\rho$ -zCDP algorithm which, with probability at least $2/3$ , learns the maximal monomials of $h$ to accuracy $\alpha$ , given $n$ i.i.d. samples $Z^{1},\cdots,Z^{n}\sim{\cal D}$ , where

n=O{\left({\frac{e^{5\lambda t}\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}.

To obtain the rate above, our algorithm uses the Private Multiplicative Weights (PMW) method by [HardtR10] to estimate all parity queries of all orders no more than $t$ . The PMW method runs in time exponential in $p$ , since it maintains a distribution over the data domain. We can also obtain an oracle-efficient algorithm that runs in polynomial time when given access to an empirical risk minimization oracle over the class of parities. By replacing PMW with such an oracle-efficient algorithm sepFEM in [VietriTBSW20], we obtain a slightly worse sample complexity

n=O{\left({\frac{e^{5\lambda t}\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}{p^{5/4}}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}.

For the special case of structure learning under approximate differential privacy, we provide a significantly better algorithm. In particular, we can achieve an $O(\log p)$ sample complexity, which improves exponentially on the above algorithm’s sample complexity of $O(\sqrt{p})$ . The following is a representative theorem statement for pairwise graphical models, though we derive similar statements for binary MRFs of higher order.

Theorem 1.5.

There exists an efficient $(\varepsilon,\delta)$ -differentially private algorithm which, with probability at least $2/3$ , learns the structure of a pairwise graphical model, which requires a sample complexity of

n=O\left(\frac{\lambda^{2}k^{4}\exp(14\lambda)\log(pk)\log(1/\delta)}{\varepsilon\eta^{4}}\right).

This result can be derived using stability properties of non-private algorithms. In particular, in the non-private setting, the guarantees of algorithms for this problem recover the entire graph exactly with constant probability. This allows us to derive private algorithms at a multiplicative cost of $O(\log(1/\delta)/\varepsilon)$ samples, using either the propose-test-release framework [DworkL09] or stability-based histograms [KorolovaKMN09, BunNSV15]. Further details are given in Section 6.

Lower Bounds.

We note the significant gap between the aforementioned upper bounds: in particular, our more generally applicable upper bound (Theorem 1.2) has a $O(\sqrt{p})$ dependence on the dimension, whereas the best known lower bound is $\Omega(\log p)$ [SanthanamW12]. However, we show that our upper bound is tight. That is, even if we relax the privacy notion to approximate differential privacy, or relax the learning goal to structure learning, the sample complexity is still $\Omega(\sqrt{p})$ . Perhaps surprisingly, if we perform both relaxations simultaneously, this falls into the purview of Theorem 1.5, and the sample complexity drops to $O(\log p)$ .

First, we show that even under approximate differential privacy, learning the parameters of a graphical model requires $\Omega(\sqrt{p})$ samples. The formal statement is given in Section 5.

Theorem 1.6 (Informal).

Any algorithm which satisfies approximate differential privacy and learns the parameters of a pairwise graphical model with probability at least $2/3$ requires $\operatorname*{poly}(p)$ samples.

This result is proved by constructing a family of instances of binary pairwise graphical models (i.e., Ising models) which encode product distributions. Specifically, we consider the set of graphs formed by a perfect matching with edges $(2i,2i+1)$ for $i\in[p/2]$ . In order to estimate the parameter on every edge, one must estimate the correlation between each such pair of nodes, which can be shown to correspond to learning the mean of a particular product distribution in $\ell_{\infty}$ -distance. This problem is well-known to have a gap between the non-private and private sample complexities, due to methods derived from fingerprinting codes [BunUV14, DworkSSUV15, SteinkeU17a], and differentially private Fano’s inequality.

Second, we show that learning the structure of a graphical model, under either pure or concentrated differential privacy, requires $\operatorname*{poly}(p)$ samples. The formal theorem appears in Section 7.

Theorem 1.7 (Informal).

Any algorithm which satisfies pure or concentrated differential privacy and learns the structure of a pairwise graphical model with probability at least $2/3$ requires $\operatorname*{poly}(p)$ samples.

We derive this result via packing arguments [HardtT10, BeimelBKN14], and differentially private Fano’s inequality, by showing that there exists a large number (exponential in $p$ ) of different binary pairwise graphical models which must be distinguished. The construction of a set of size $m$ implies lower bounds of $\Omega(\log m)$ and $\Omega(\sqrt{\log m})$ for learning under pure and concentrated differential privacy, respectively.

1.1.1 Summary and Discussion

We summarize our findings on privately learning Markov Random Fields in Table 1, focusing on the specific case of the Ising model. We note that qualitatively similar relationships between problems also hold for general pairwise models as well as higher-order binary Markov Random Fields. Each cell denotes the sample complexity of a learning task, which is a combination of an objective and a privacy constraint. Problems become harder as we go down (as the privacy requirement is tightened) and to the right (structure learning is easier than parameter learning).

The top row shows that both learning goals require only $\Theta(\log p)$ samples to perform absent privacy constraints, and are thus tractable even in very high-dimensional settings or when data is limited. However, if we additionally wish to guarantee privacy, our results show that this logarithmic sample complexity is only achievable when one considers structure learning under approximate differential privacy. If one changes the learning goal to parameter learning, or tightens the privacy notion to concentrated differential privacy, then the sample complexity jumps to become polynomial in the dimension, in particular $\Omega(\sqrt{p})$ . Nonetheless, we provide algorithms which match this dependence, giving a tight $\Theta(\sqrt{p})$ bound on the sample complexity.

	Structure Learning	Parameter Learning
Non-private	$\Theta(\log{p})$ (folklore)	$\Theta(\log{p})$ (folklore)
Approximate DP	$\Theta(\log{p})$ (Theorems 6.3)	$\Theta(\sqrt{p})$ (Theorems 3.8 and 5.1)
Zero-concentrated DP	$\Theta(\sqrt{p})$ (Theorems 3.8 and 7.1)	$\Theta(\sqrt{p})$ (Theorems 3.8 and 5.1)
Pure DP	$\Omega(p)$ (Theorem 7.1)	$\Omega(p)$ (Theorem 7.1)

Table 1: Sample complexity (dependence on

p

) of privately learning an Ising model.

1.2 Related Work

As mentioned before, there has been significant work in learning the structure and parameters of graphical models, see, e.g., [ChowL68, CsiszarT06, AbbeelKN06, RavikumarWL10, JalaliJR11, JalaliRVS11, SanthanamW12, BreslerGS14b, Bresler15, VuffrayMLC16, KlivansM17, HamiltonKM17, RigolletH17, LokhovVMC18, WuSD19]. Perhaps a turning point in this literature is the work of Bresler [Bresler15], who showed for the first time that general Ising models of bounded degree can be learned in polynomial time. Since this result, following works have focused on both generalizing these results to broader settings (including MRFs with higher-order interactions and non-binary alphabets) as well as simplifying existing arguments. There has also been work on learning, testing, and inferring other statistical properties of graphical models [BhattacharyaM16, MartindelCampoCU16, DaskalakisDK17, MukherjeeMY18, Bhattacharya19]. In particular, learning and testing Ising models in statistical distance have also been explored [DaskalakisDK18, GheissariLP18, DevroyeMR20, DaskalakisDK19, BezakovaBCSV19], and are interesting questions under the constraint of privacy.

Recent investigations at the intersection of graphical models and differential privacy include [BernsteinMSSHM17, ChowdhuryRJ20, McKennaSM19]. Bernstein et al. [BernsteinMSSHM17] privately learn graphical models by adding noise to the sufficient statistics and use an expectation-maximization based approach to recover the parameters. However, the focus is somewhat different, as they do not provide finite sample guarantees for the accuracy when performing parameter recovery, nor consider structure learning at all. Chowdhury, Rekatsinas, and Jha [ChowdhuryRJ20] study differentially private learning of Bayesian Networks, another popular type of graphical model which is incomparable with Markov Random Fields. McKenna, Sheldon, and Miklau [McKennaSM19] apply graphical models in place of full contingency tables to privately perform inference.

Graphical models can be seen as a natural extension of product distributions, which correspond to the case when the order of the MRF $t$ is $1$ . There has been significant work in differentially private estimation of product distributions [BlumDMN05, BunUV14, DworkMNS06, SteinkeU17a, KamathLSU18, CaiWZ19, BunKSW2019]. Recently, this investigation has been broadened into differentially private distribution estimation, including sample-based estimation of properties and parameters, see, e.g., [NissimRS07, Smith11, BunNSV15, DiakonikolasHS15, KarwaV18, AcharyaKSZ18, KamathLSU18, BunKSW2019]. For further coverage of differentially private statistics, see [KamathU20].

2 Preliminaries and Notation

In order to distinguish between the vector coordinate and the sample, we use a different notation in this chapter. Given a set of points $X^{n}$ , we use superscripts, i.e., $X^{i}$ to denote the $i$ -th datapoint. Given a vector $X\in\mathbb{R}^{p}$ , we use subscripts, i.e., $X_{i}$ to denote its $i$ -th coordinate. We also use $X_{-i}$ to denote the vector after deleting the $i$ -th coordinate, i.e. $X_{-i}=[X_{1},\cdots,X_{i-1},X_{i+1},\cdots,X_{p}]$ .

2.1 Markov Random Field Preliminaries

We first introduce the definition of the Ising model, which is a special case of general MRFs when $k=t=2$ .

Definition 2.1.

The $p$ -variable Ising model is a distribution ${\cal D}(A,\theta)$ on $\{-1,1\}^{p}$ that satisfies

\displaystyle\Pr{\left({Z=z}\right)}\propto\exp{\left({\sum_{1\leq i\leq j\leq p}A_{i,j}z_{i}z_{j}+\sum_{i\in[p]}\theta_{i}z_{i}}\right)},

where $A\in\mathbb{R}^{p\times p}$ is a symmetric weight matrix with $A_{ii}=0,\forall i\in[p]$ and $\theta\in\mathbb{R}^{p}$ is a mean-field vector. The dependency graph of ${\cal D}(A,\theta)$ is an undirected graph $G=(V,E)$ , with vertices $V=[p]$ and edges $E=\{(i,j):A_{i,j}\neq 0\}$ . The width of ${\cal D}(A,\theta)$ is defined as

\displaystyle\lambda(A,\theta)=\max_{i\in[p]}{\left({\sum_{j\in[p]}\left|A_{i,j}\right|+\left|\theta_{i}\right|}\right)}.

Let $\eta(A,\theta)$ be the minimum edge weight in absolute value, i.e., $\eta(A,\theta)=\min_{i,j\in[p]:A_{i,j}\neq 0}\left|A_{i,j}\right|.$

We note that the Ising model is supported on $\{-1,1\}^{p}$ . A natural generalization is to generalize its support to $[k]^{p}$ , and maintain pairwise correlations.

Definition 2.2.

The $p$ -variable pairwise graphical model is a distribution ${\cal D}({\cal W},\Theta)$ on $[k]^{p}$ that satisfies

\displaystyle\Pr{\left({Z=z}\right)}\propto\exp{\left({\sum_{1\leq i\leq j\leq p}W_{i,j}(z_{i},z_{j})+\sum_{i\in[p]}\theta_{i}(z_{i})}\right)},

where ${\cal W}=\{W_{i,j}\in\mathbb{R}^{k\times k}:i\neq j\in[p]\}$ is a set of weight matrices satisfying $W_{i,j}=W^{T}_{j,i}$ , and $\Theta=\{\theta_{i}\in\mathbb{R}^{k}:i\in[p]\}$ is a set of mean-field vectors. The dependency graph of ${\cal D}({\cal W},\Theta)$ is an undirected graph $G=(V,E)$ , with vertices $V=[p]$ and edges $E=\{(i,j):W_{i,j}\neq 0\}$ . The width of ${\cal D}({\cal W},\Theta)$ is defined as

\displaystyle\lambda({\cal W},\Theta)=\max_{i\in[p],a\in[k]}{\left({\sum_{j\in[p]\backslash i}\max_{b\in[k]}\left|W_{i,j}(a,b)\right|+\left|\theta_{i}(a)\right|}\right)}.

Define $\eta({\cal W},\Theta)=\min_{(i,j)\in E}\max_{a,b}|W_{i,j}(a,b)|$ .

Both the models above only consider pairwise interactions between nodes. In order to capture higher-order interactions, we examine the more general model of Markov Random Fields (MRFs). In this chapter, we will restrict our attention to MRFs over a binary alphabet (i.e., distributions over $\{\pm 1\}^{p}$ ). In order to define binary $t$ -wise MRFs, we first need the following definition of multilinear polynomials, partial derivatives and maximal monomials.

Definition 2.3.

Multilinear polynomial is defined as $h:\mathbb{R}^{p}\rightarrow\mathbb{R}$ such that $h(x)=\sum_{I}\bar{h}(I)\prod_{i\in I}x_{i}$ where $\bar{h}(I)$ denotes the coefficient of the monomial $\prod_{i\in I}x_{i}$ with respect to the variables $(x_{i}:i\in I)$ . Let $\partial_{i}h(x)=\sum_{J:i\not\in J}\bar{h}(J\cup\{i\})\prod_{j\in J}x_{j}$ denote the partial derivative of $h$ with respect to $x_{i}$ . Similarly, for $I\subseteq[p]$ , let $\partial_{I}h(x)=\sum_{J:J\cap I=\phi}\bar{h}(J\cup I)\prod_{j\in J}x_{j}$ denote the partial derivative of $h$ with respect to the variables $(x_{i}:i\in I)$ . We say $I\subseteq[p]$ is a maximal monomial of $h$ if $\bar{h}(J)=0$ for all $J\supset I$ .

Now we are able to formally define binary $t$ -wise MRFs.

Definition 2.4.

For a graph $G=(V,E)$ on $p$ vertices, let $C_{t}(G)$ denotes all cliques of size at most $t$ in G. A binary $t$ -wise Markov random field on $G$ is a distribution ${\cal D}$ on $\{-1,1\}^{p}$ which satisfies

\displaystyle\Pr_{Z\sim{\cal D}}{\left({Z=z}\right)}\propto\exp{\left({\sum_{I\in C_{t}(G)}\varphi_{I}(z)}\right)},

and each $\varphi_{I}:\mathbb{R}^{p}\rightarrow\mathbb{R}$ is a multilinear polynomial that depends only on the variables in $I$ .

We call $G$ the dependency graph of the MRF and $h(x)=\sum_{I\in C_{t}(G)}\varphi_{I}(x)$ the factorization polynomial of the MRF. The width of ${\cal D}$ is defined as $\lambda=\max_{i\in[p]}\normone{\partial_{i}h}$ , where $\normone{h}\coloneqq\sum_{I}\left|\bar{h}(I)\right|$ .

Now we introduce the definition of $\delta$ -unbiased distribution and its properties. The proof appears in [KlivansM17].

Definition 2.5 ( $\delta$ -unbiased).

Let $S$ be the alphabet set, e.g., $S=\{1,-1\}$ for binary $t$ -pairwise MRFs and $S=[k]$ for pairwise graphical models. A distribution ${\cal D}$ on $S^{p}$ is $\delta$ -unbiased if for $Z\sim{\cal D}$ , $\forall i\in[p]$ , and any assignment $x\in S^{p-1}$ to $Z_{-i}$ , $\min_{z\in S}\Pr{\left({Z_{i}=z|Z_{-i}=x}\right)}\geq\delta$ .

The marginal distribution of a $\delta$ -unbiased distribution also satisfies $\delta$ -unbiasedness.

Lemma 2.6.

Let ${\cal D}$ be a $\delta$ -unbiased on $S^{p}$ , with alphabet set $S$ . For $X\sim{\cal D}$ , $\forall i\in[p]$ , the distribution of $X_{-i}$ is also $\delta$ -unbiased.

The following lemmas provide $\delta$ -unbiased guarantees for various graphical models.

Lemma 2.7.

Let ${\cal D}({\cal W},\Theta)$ be a pairwise graphical model with alphabet size $k$ and width $\lambda({\cal W},\Theta)$ . Then ${\cal D}({\cal W},\Theta)$ is $\delta$ -unbiased with $\delta=e^{-2\lambda({\cal W},\Theta)}/k$ . In particular, an Ising model ${\cal D}(A,\theta)$ is $e^{-2\lambda(A,\theta)}/2$ -unbiased.

Lemma 2.8.

Let ${\cal D}$ be a binary $t$ -wise MRFs with width $\lambda$ . Then ${\cal D}$ is $\delta$ -unbiased with $\delta=e^{-2\lambda}/2$ .

Finally, we define two possible goals for learning graphical models. First, the easier goal is structure learning, which involves recovering the set of non-zero edges.

Definition 2.9.

An algorithm learns the structure of a graphical model if, given samples $Z_{1},\dots,Z_{n}\sim{\cal D}$ , it outputs a graph $\hat{G}=(V,\hat{E})$ over $V=[p]$ such that $\hat{E}=E$ , the set of edges in the dependency graph of ${\cal D}$ .

The more difficult goal is parameter learning, which requires the algorithm to learn not only the location of the edges, but also their parameter values.

Definition 2.10.

An algorithm learns the parameters of an Ising model (resp. pairwise graphical model) if, given samples $Z_{1},\dots,Z_{n}\sim{\cal D}$ , it outputs a matrix $\hat{A}$ (resp. set of matrices $\hat{\cal W}$ ) such that $\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\leq\alpha$ (resp. $|W_{i,j}(a,b)-\widehat{W}_{i,j}(a,b)|\leq\alpha$ , $\forall i\neq j\in[p],\forall a,b\in[k]$ ).

Definition 2.11.

An algorithm learns the parameters of a binary $t$ -wise MRF with associated polynomial $h$ if, given samples $X^{1},\dots,X^{n}\sim{\cal D}$ , it outputs another multilinear polynomial $u$ such that that for all maximal monomial $I\subseteq[p]$ , $\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha$ .

2.2 Privacy Preliminaries

A dataset $X=X^{n}\in{\cal X}^{n}$ is a collection of points from some universe ${\cal X}$ . In this chapter we consider a few different variants of differential privacy. The first is the standard notion of differential privacy, which has been heavily used in the previous chapters. The second is concentrated differential privacy [DworkR16]. In this chapter, we specifically consider its refinement zero-mean concentrated differential privacy [BunS16].

Definition 2.12 (Concentrated Differential Privacy (zCDP) [BunS16]).

A randomized algorithm ${\cal A}:{\cal X}^{n}\rightarrow{\cal S}$ satisfies $\rho$ -zCDP if for every pair of neighboring datasets $X,X^{\prime}\in{\cal X}^{n}$ ,

\forall\alpha\in(1,\infty)~{}~{}~{}D_{\alpha}\left(M(X)||M(X^{\prime})\right)\leq\rho\alpha,

where $D_{\alpha}\left(M(X)||M(X^{\prime})\right)$ is the $\alpha$ -Rényi divergence between $M(X)$ and $M(X^{\prime})$ .

The following lemma quantifies the relationships between $(\varepsilon,0)$ -DP, $\rho$ -zCDP and $(\varepsilon,\delta)$ -DP.

Lemma 2.13 (Relationships Between Variants of DP [BunS16]).

For every $\varepsilon\geq 0$ ,

1.

If ${\cal A}$ satisfies $(\varepsilon,0)$ -DP, then ${\cal A}$ is $\frac{\varepsilon^{2}}{2}$ -zCDP.
2.

If ${\cal A}$ satisfies $\frac{\varepsilon^{2}}{2}$ -zCDP, then ${\cal A}$ satisfies $(\frac{\varepsilon^{2}}{2}+\varepsilon\sqrt{2\log(\frac{1}{\delta})},\delta)$ -DP for every $\delta>0$ .

Roughly speaking, pure DP is stronger than zero-concentrated DP, which is stronger than approximate DP.

A crucial property of all the variants of differential privacy is that they can be composed adaptively. By adaptive composition, we mean a sequence of algorithms ${\cal A}_{1}(X),\dots,{\cal A}_{T}(X)$ where the algorithm ${\cal A}_{t}(X)$ may also depend on the outcomes of the algorithms ${\cal A}_{1}(X),\dots,{\cal A}_{t-1}(X)$ .

Lemma 2.14 (Composition of zero-concentrated DP [BunS16]).

If ${\cal A}$ is an adaptive composition of differentially private algorithms ${\cal A}_{1},\dots,{\cal A}_{T}$ , and ${\cal A}_{1},\dots,{\cal A}_{T}$ are $\rho_{1},\dots,\rho_{T}$ -zCDP respectively, then ${\cal A}$ is $\rho$ -zCDP for $\rho=\sum_{t}\rho_{t}$ .

3 Parameter Learning of Pairwise Graphical Models

3.1 Private Sparse Logistic Regression

As a subroutine of our parameter learning algorithm, we consider the following problem: given a training data set $D$ consisting of n pairs of data $D=\{d^{j}\}_{j=1}^{n}=\{(x^{j},y^{j})\}_{j=1}^{n}$ , where $x^{j}\in\mathbb{R}^{p}$ and $y^{j}\in\mathbb{R}$ , a constraint set ${\cal C}\in\mathbb{R}^{p}$ , and a loss function $\ell:{\cal C}\times\mathbb{R}^{p+1}\rightarrow\mathbb{R}$ , we want to find $w^{erm}=\arg\min_{w\in{\cal C}}~{}{\cal L}(w;D)=\arg\min_{w\in{\cal C}}~{}\frac{1}{n}{\sum_{j=1}^{n}\ell(w;d^{j})}$ with a zCDP constraint. This problem was previously studied in [TalwarTZ14]. Before stating their results, we need the following two definitions. The first definition is regarding Lipschitz continuity.

Definition 3.1.

A function $\ell:{\cal C}\rightarrow\mathbb{R}$ is $L_{1}$ -Lipschitz with respect to $\ell_{1}$ norm, if the following holds.

\forall w_{1},w_{2}\in{\cal C},\left|\ell(w_{1})-\ell(w_{2})\right|\leq L_{1}\normone{w_{1}-w_{2}}.

The performance of the algorithm also depends on the “curvature” of the loss function, which is defined below, based on the definition of [Clarkson10, Jaggi13]. A side remark is that this is a strictly weaker constraint than smoothness [TalwarTZ14].

Definition 3.2 (Curvature constant).

For $\ell:{\cal C}\rightarrow\mathbb{R}$ , $\Gamma_{\ell}$ is defined as

\Gamma_{\ell}=\sup_{w_{1},w_{2}\in{\cal C},\gamma\in(0,1],w_{3}=w_{1}+\gamma(w_{2}-w_{1})}\frac{2}{\gamma^{2}}{\left({\ell(w_{3})-\ell(w_{1})-\langle w_{3}-w_{1},\nabla\ell(w_{1})\rangle}\right)}.

Now we are able to introduce the algorithm and its theoretical guarantees.

3Input: Data set:

D=\{d^{1},\cdots,d^{n}\}

, loss function:

{\cal L}(w;D)=\frac{1}{n}\sum_{j=1}^{n}\ell(w;d^{j})

(with Lipschitz constant

L_{1}

), privacy parameters:

\rho

, convex set:

{\cal C}=conv(S)

with

\normone{{\cal C}}\coloneqq\max_{s\in S}\normone{s}

, iteration times:

T

41:Initialize

w

from an arbitrary point in

{\cal C}

5For $t=1$ to $T-1$

6 2:

\forall s\in S

\alpha_{s}\leftarrow\langle s,\nabla{\cal L}(w;D)\rangle+\text{Lap}{\left({0,\frac{L_{1}\normone{C}\sqrt{T}}{n\sqrt{\rho}}}\right)}

83:

\tilde{w_{t}}\leftarrow\arg\min_{s\in S}\alpha_{s}

w_{t+1}\leftarrow(1-\mu_{t})w_{t}+\mu_{t}\tilde{w_{t}}

, where

\mu_{t}=\frac{2}{t+2}

95:

Output:

w^{priv}=w_{T}

Algorithm 1

{\cal A}_{PFW}(D,{\cal L},\rho,{\cal C}):

Private Frank-Wolfe Algorithm

Lemma 3.3 (Theorem 5.5 from [TalwarTZ14]).

Algorithm 1 satisfies $\rho$ -zCDP. Furthermore, let $L_{1}$ , $\normone{{\cal C}}$ be defined as in Algorithm 1. Let $\Gamma_{\ell}$ be an upper bound on the curvature constant for the loss function $\ell(\cdot;d)$ for all $d$ and $\left|S\right|$ be the number of extreme points in $S$ . If we set $T=\frac{\Gamma_{\ell}^{\frac{2}{3}}(n\sqrt{\rho})^{\frac{2}{3}}}{L_{1}\normone{{\cal C}}^{\frac{2}{3}}}$ , then with probability at least $1-\beta$ over the randomness of the algorithm,

{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\Gamma_{\ell}^{\frac{1}{3}}(L_{1}\normone{{\cal C}})^{\frac{2}{3}}\log(\frac{n\left|S\right|}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.

Proof.

The utility guarantee is proved in [TalwarTZ14]. Therefore, it is enough to prove the algorithm satisfies $\rho$ -zCDP. According to the definition of the Laplace mechanism, every iteration of the algorithm satisfies $(\sqrt{\frac{\rho}{T}},0)$ -DP, which naturally satisfies $\frac{\rho}{T}$ -zCDP by Lemma 2.13. Then, by the composition theorem of zCDP (Lemma 2.14), the algorithm satisfies $\rho$ -zCDP. ∎

If we consider the specific problem of sparse logistic regression, we will get the following corollary.

Corollary 3.4.

If we consider the problem of sparse logistic regression, i.e., ${\cal L}(w;D)=\frac{1}{n}\sum_{j=1}^{n}\log(1+e^{-y^{j}\langle w,x^{j}\rangle})$ , with the constraint that ${\cal C}=\{w:\normone{w}\leq\lambda\}$ , and we further assume that $\forall j,\left\lVert x^{j}\right\rVert_{\infty}\leq 1,y^{j}\in\{\pm 1\}$ , let $T=\lambda^{\frac{2}{3}}(n\sqrt{\rho})^{\frac{2}{3}}$ , then with probability at least $1-\beta$ over the randomness of the algorithm,

{\cal L}(w^{priv};D)-{\cal L}(w^{erm};D)=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}.

Furthermore, the time complexity of the algorithm is $O(T\cdot{\left({np+p^{2}}\right)})=O{\left({n^{\frac{2}{3}}\cdot{\left({np+p^{2}}\right)}}\right)}$ .

Proof.

First let we show $L_{1}\leq 2$ . If we fix sample $d=(x,y)$ , then for any $w_{1},w_{2}\in{\cal C}$ ,

\left|\ell(w_{1};d)-\ell(w_{2};d)\right|\leq\max_{w}\left\lVert\nabla_{w}{\left({\ell(w;d)}\right)}\right\rVert_{\infty}\cdot\normone{w_{1}-w_{2}}.

Since $\nabla_{w}{\left({\ell(w;d)}\right)}={\left({\sigma(\langle w,x\rangle)-y}\right)}\cdot x$ , we have $\left\lVert\nabla_{w}{\left({\ell(w;d)}\right)}\right\rVert_{\infty}\leq 2$ .

Next, we wish to show $\Gamma_{\ell}\leq\lambda^{2}$ . We use the following lemma from [TalwarTZ14].

Lemma 3.5 (Remark 4 in [TalwarTZ14]).

For any $q,r\geq 1$ such that $\frac{1}{q}+\frac{1}{r}=1$ , $\Gamma_{\ell}$ is upper bounded by $\alpha{\left\lVert{\cal C}\right\rVert_{q}}^{2}$ , where $\alpha=\max_{w\in{\cal C},{\left\lVert v\right\rVert_{q}=1}}{\left\lVert\nabla^{2}\ell(w)\cdot v\right\rVert_{q}}$ .

If we take $q=1,r=+\infty$ , then $\Gamma_{\ell}\leq\alpha\lambda^{2}$ , where

\alpha=\max_{w\in{\cal C},\normone{v}=1}\left\lVert\nabla^{2}\ell(w;d)\cdot v\right\rVert_{\infty}\leq\max_{i,j\in[p]}{\left({\nabla^{2}\ell(w;d)}\right)}_{i,j}.

We have $\alpha\leq 1$ , since $\nabla^{2}\ell(w;d)=\sigma(\langle w,x\rangle){\left({1-\sigma(\langle w,x\rangle)}\right)}\cdot xx^{T}$ , and $\left\lVert x\right\rVert_{\infty}\leq 1$ ,

Finally given ${\cal C}=\{w:\normone{w}\leq 1\}$ , the number of extreme points of $S$ equals $2p$ . By replacing all these parameters in Lemma 3.3, we have proved the loss guarantee in the corollary.

With respect to the time complexity, we note that the time complexity of each iteration is $O{\left({np+p^{2}}\right)}$ and there are $T$ iterations in total. ∎

Now if we further assume the data set $D$ is drawn i.i.d. from some underlying distribution $P$ , the following lemma from learning theory relates the true risk and the empirical risk, which shall be heavily used in the following sections.

Theorem 3.6.

If we consider the same problem setting and assumptions as in Corollary 3.4, and we further assume that the training data set $D$ is drawn i.i.d. from some unknown distribution $P$ , then with probability at least $1-\beta$ over the randomness of the algorithm and the training data set,

\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}+\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)},

where $w^{*}=\arg\min_{w\in C}\mathbb{E}_{(X,Y)\sim P}\left[\ell(w;(X,Y))\right]$ .

Proof.

By triangle inequality,

		$\displaystyle~{}\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]$
	$\displaystyle\leq$	$\displaystyle\left\|\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{priv};d^{m})\right\|+\left\|\frac{1}{n}\sum_{m=1}^{n}\ell(w^{priv};d^{m})-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{erm};d^{m})\right\|$
	$\displaystyle+$	$\displaystyle{\left({\frac{1}{n}\sum_{m=1}^{n}\ell(w^{erm};d^{m})-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{};d^{m})}\right)}+\left\|\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{};(X,Y))\right]-\frac{1}{n}\sum_{m=1}^{n}\ell(w^{*};d^{m})\right\|$

Now we need to bound each term. We firstly bound the first and last term simultaneously. By the generalization error bound (Lemma 7 from [WuSD19]), they are bounded by $O{\left({\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)}$ simultaneously, with probability greater than $1-\frac{2}{3}\beta$ . Then we turn to the second term, by Corollary 3.4, with probability greater than $1-\frac{1}{3}\beta$ , it is bounded by $O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}}\right)}$ . Finally we bound the third term. According to the definition of $w^{erm}$ , the third term should be smaller than 0. Therefore, by union bound, $\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]=O{\left({\frac{\lambda^{\frac{4}{3}}\log(\frac{np}{\beta})}{(n\sqrt{\rho})^{\frac{2}{3}}}+\frac{\lambda\log{\left({\frac{1}{\beta}}\right)}}{\sqrt{n}}}\right)}$ , with probability greater than $1-\beta$ . ∎

3.2 Privately Learning Ising Models

We first consider the problem of estimating the weight matrix of the Ising model. To be precise, given $n$ i.i.d. samples $\{z^{1},\cdots,z^{n}\}$ generated from an unknown distribution ${\cal D}(A,\theta)$ , our goal is to design an $\rho$ -zCDP estimator $\hat{A}$ such that with probability at least $\frac{2}{3}$ , $\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha$ .

An observation of the Ising model is that for any node $Z_{i}$ , the probability of $Z_{i}=1$ conditioned on the values of the remaining nodes $Z_{-i}$ follows from a sigmoid function. The next lemma comes from [KlivansM17], which formalizes this observation.

Lemma 3.7.

Let $Z\sim{\cal D}(A,\theta)$ and $Z\in\{-1,1\}^{p}$ , then $\forall i\in[p]$ , $\forall x\in\{-1,1\}^{[p]\backslash\{i\}}$ ,

\displaystyle\Pr{\left({Z_{i}=1|Z_{-i}=x}\right)}

\displaystyle=\sigma{\left({\sum_{j\neq i}2A_{i,j}x_{j}+2\theta_{i}}\right)}=\sigma{\left({\langle w,x^{\prime}\rangle}\right)}.

where $w=2[A_{i,1},\cdots,A_{i,i-1},A_{i,i+1},\cdots,A_{i,p},\theta_{i}]\in\mathbb{R}^{p}$ , and $x^{\prime}=[x,1]\in\mathbb{R}^{p}$ .

Proof.

The proof is from [KlivansM17], and we include it here for completeness. According to the definition of the Ising model,

	$\displaystyle\Pr{\left({Z_{i}=1\|Z_{-i}=x}\right)}$	$\displaystyle=\frac{\exp{\left({\sum\limits_{j\neq i}A_{i,j}x_{j}+\sum\limits_{j\neq i}\theta_{j}+\theta_{i}}\right)}}{\exp{\left({\sum\limits_{j\neq i}A_{i,j}x_{j}+\sum_{j\neq i}\theta_{j}+\theta_{i}}\right)}+\exp{\left({\sum_{j\neq i}-A_{i,j}x_{j}+\sum_{j\neq i}\theta_{j}-\theta_{i}}\right)}}$
		$\displaystyle=\sigma{\left({\sum_{j\neq i}2A_{i,j}x_{j}+2\theta_{i}}\right)}.$

∎

By Lemma 3.7, we can estimate the weight matrix by solving a logistic regression for each node, which is utilized in [WuSD19] to design non-private estimators. Our algorithm uses the private Frank-Wolfe method to solve the per-node logistic regression problem, achieving the following theoretical guarantee.

3Input:

n

samples

\{z^{1},\cdots,z^{n}\}

, where

z^{m}\in\{\pm 1\}^{p}

for

m\in[n]

; an upper bound on

\lambda(A,\theta)\leq\lambda

, privacy parameter

\rho

For $i=1$ to $p$

61:

\forall m\in[n]

x^{m}\leftarrow[z_{-i}^{m},1]

y^{m}\leftarrow z_{i}^{m}

72:

w^{priv}\leftarrow{\cal A}_{PFW}(D,{\cal L},\rho^{\prime},{\cal C})

, where

\rho^{\prime}=\frac{\rho}{p}

D=\{{\left({x^{m},y^{m}}\right)}\}_{m=1}^{n}

{\cal L}(w;D)=\frac{1}{n}\sum_{m=1}^{n}\log{\left({1+e^{-y^{m}\langle w,x^{m}\rangle}}\right)}

{\cal C}=\{\normone{w}\leq 2\lambda\}

\forall j\in p

\hat{A}_{i,j}\leftarrow\frac{1}{2}w^{priv}_{\tilde{j}}

, where

\tilde{j}=j

when

j<i

and

\widetilde{j}=j-1

j>i

Algorithm 2 Privately Learning Ising Models

Output: $\hat{A}\in\mathbb{R}^{p\times p}$

Theorem 3.8.

Let ${\cal D}(A,\theta)$ be an unknown $p$ -variable Ising model with $\lambda(A,\theta)\leq\lambda$ . There exists an efficient $\rho$ -zCDP algorithm which outputs a weight matrix $\hat{A}\in\mathbb{R}^{p\times p}$ such that with probability greater than $2/3$ , $\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha$ if the number of i.i.d. samples satisfies

n=\Omega{\left({\frac{\lambda^{2}\log(p)e^{12\lambda}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}\log^{2}(p)e^{9\lambda}}{\sqrt{\rho}\alpha^{3}}}\right)}.

Proof.

We first prove that Algorithm 2 satisfies $\rho$ -zCDP. Notice that in each iteration, the algorithm solves a private sparse logistic regression under $\frac{\rho}{p}$ -zCDP. Therefore, Algorithm 2 satisfies $\rho$ -zCDP by composition (Lemma 2.14).

For the accuracy analysis, we start by looking at the first iteration ( $i=1$ ) and showing that $\left|A_{1,j}-\hat{A}_{1,j}\right|\leq\alpha$ , $\forall j\in[p]$ , with probability greater than $1-\frac{1}{10p}$ .

Given a random sample $Z\sim{\cal D}(A,\theta)$ , we let $X=[Z_{-1},1]$ , $Y=Z_{1}$ . From Lemma 3.7, $\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({\langle w^{*},x\rangle}\right)}$ , where $w^{*}=2[A_{1,2},\cdots,A_{1,p},\theta_{1}]$ . We also note that $\normone{w^{*}}\leq 2\lambda$ , as a consequence of the width constraint of the Ising model.

For any $n$ i.i.d. samples $\{z^{m}\}_{m=1}^{n}$ drawn from the Ising model, let $x^{m}=[z_{-1}^{m},1]$ and $y^{m}=z_{1}^{m}$ , it is easy to check that each $(x^{m},y^{m})$ is the realization of $(X,Y)$ . Let $w^{priv}$ be the output of ${\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}$ , where $D=\{(x^{m},y^{m})\}_{m=1}^{n}$ . By Lemma 3.6, when $n=O{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}$ , with probability greater than $1-\frac{1}{10p}$ , $\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma.$

We will use the following lemma from [WuSD19]. Roughly speaking, with the assumption that the samples are generated from an Ising model, any estimator $w^{priv}$ which achieves a small error in the loss ${\cal L}$ guarantees an accurate parameter recovery in $\ell_{\infty}$ distance.

Lemma 3.9.

Let $P$ be a distribution on $\{-1,1\}^{p-1}\times\{-1,1\}$ . Given $u_{1}\in\mathbb{R}^{p-1},\theta_{1}\in\mathbb{R}$ , suppose $\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({\langle u_{1},x\rangle+\theta_{1}}\right)}$ for $(X,Y)\sim P$ . If the marginal distribution of $P$ on $X$ is $\delta$ -unbiased, and $\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{1},X\rangle+\theta_{1}}\right)}}}\right)}\right]-\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{2},X\rangle+\theta_{2}}\right)}}}\right)}\right]\leq\gamma$ for some $u_{2}\in\mathbb{R}^{p-1},\theta_{2}\in\mathbb{R}$ , and $\gamma\leq\delta e^{-2\normone{u_{1}}-2\normone{\theta_{1}}-6}$ , then $\left\lVert u_{1}-u_{2}\right\rVert_{\infty}=O(e^{\normone{u_{1}}+\normone{\theta_{1}}}\cdot\sqrt{\gamma/\delta}).$

By Lemma 2.6, Lemma 2.7 and Lemma 3.9, if $\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq O{\left({\alpha^{2}e^{-6\lambda}}\right)}$ , we have $\left\lVert w^{priv}-w^{*}\right\rVert_{\infty}\leq\alpha$ . By replacing $\gamma=\alpha^{2}e^{-6\lambda}$ , we prove that $\left\lVert A_{1,j}-\hat{A}_{1,j}\right\rVert_{\infty}\leq\alpha$ with probability greater than $1-\frac{1}{10p}$ . Noting that similar argument works for the other iterations and non-overlapping part of the matrix is recovered in different iterations. By union bound over $p$ iterations, we prove that with probability at least $\frac{2}{3}$ , $\max_{i,j\in[p]}\left|A_{i,j}-\hat{A}_{i,j}\right|\leq\alpha$ .

Finally, we note that the time compexity of the algorithm is $poly(n,p)$ since the private Frank-Wolfe algorithm is time efficient by Corollary 3.4. ∎

3.3 Privately Learning Pairwise Graphical Models

Next, we study parameter learning for pairwise graphical models over general alphabet. Given $n$ i.i.d. samples $\{z^{1},\cdots,z^{n}\}$ drawn from an unknown distribution ${\cal D}({\cal W},\Theta)$ , we want to design an $\rho$ -zCDP estimator $\hat{{\cal W}}$ such that with probability at least $\frac{2}{3}$ , $\forall i\neq j\in[p],\forall u,v\in[k],\left|W_{i,j}(u,v)-\widehat{W}_{i,j}(u,v)\right|\leq\alpha$ . To facilitate our presentation, we assume that $\forall i\neq j\in[p]$ , every row (and column) vector of $W_{i,j}$ has zero mean.²²2The assumption that $W_{i,j}$ is centered is without loss of generality and widely used in the literature [KlivansM17, WuSD19]. We present the argument here for completeness. Suppose the $a$ -th row of $W_{i,j}$ is not centered, i.e., $\sum_{b}W_{i,j}(a,b)\neq 0$ , we can define $W^{\prime}_{i,j}(a,b)=W_{i,j}(a,b)-\frac{1}{k}\sum_{b}W_{i,j}(a,b)$ and $\theta^{\prime}_{i}(a)=\theta_{i}(a)+\frac{1}{k}\sum_{b}W_{i,j}(a,b)$ , and the probability distribution remains unchanged.

Analogous to Lemma 3.7 for the Ising model, a pairwise graphical model has the following property, which can be utilized to recover its parameters.

Lemma 3.10 (Fact 2 of [WuSD19]).

Let $Z\sim{\cal D}({\cal W},\Theta)$ and $Z\in[k]^{p}$ . For any $i\in[p]$ , any $u\neq v\in[k]$ , and any $x\in[k]^{p-1}$ ,

\Pr{\left({Z_{i}=u|Z_{i}\in\{u,v\},Z_{-i}=x}\right)}=\sigma{\left({\sum_{j\neq i}{\left({W_{i,j}(u,x_{j})-W_{i,j}(v,x_{j})}\right)}+\theta_{i}(u)-\theta_{i}(v)}\right)}.

Now we introduce our algorithm. Without loss of generality, we consider estimating $W_{1,j}$ for all $j\in[p]$ as a running example. We fix a pair of values $(u,v)$ , where $u,v\in[k]$ and $u\neq v$ . Let $S_{u,v}$ be the samples where $Z_{1}\in\{u,v\}$ . In order to utilize Lemma 3.10, we perform the following transformation on the samples in $S_{u,v}$ : for the $m$ -th sample $z^{m}$ , let $y^{m}=1$ if $z_{1}^{m}=u$ , else $y^{m}=-1$ . And $x^{m}$ is the one-hot encoding of the vector $[z^{m}_{-1},1]$ , where $\text{OneHotEncode}(s)$ is a mapping from $[k]^{p}$ to $\mathbb{R}^{p\times k}$ , and the $i$ -th row is the $t$ -th standard basis vector given $s_{i}=t$ . Then we define $w^{*}\in\mathbb{R}^{p\times k}$ as follows:

	$\displaystyle w^{*}(j,\cdot)=W_{1,j+1}(u,\cdot)-W_{1,j+1}(v,\cdot),\forall j\in[p-1];$
	$\displaystyle w^{*}(p,\cdot)=[\theta_{1}(u)-\theta_{1}(v),0,\cdots,0].$

Lemma 3.10 implies that $\forall t$ , $\Pr{\left({Y^{t}=1}\right)}=\sigma{\left({\langle w^{*},X^{t}\rangle}\right)}$ , where $\langle\cdot,\cdot\rangle$ is the element-wise multiplication of matrices. According to the definition of the width of ${\cal D}({\cal W},\Theta)$ , $\normone{w^{*}}\leq\lambda k$ . Now we can apply the sparse logistic regression method of Algorithm 3 to the samples in $S_{u,v}$ .

Suppose $w^{priv}_{u,v}$ is the output of the private Frank-Wolfe algorithm, we define $U_{u,v}\in\mathbb{R}^{p\times k}$ as follows: $\forall b\in[k]$ ,

		$\displaystyle U_{u,v}(j,b)=w^{priv}_{u,v}(j,b)-\frac{1}{k}\sum_{a\in[k]}w^{priv}_{u,v}(j,a),\forall j\in[p-1];$
		$\displaystyle U_{u,v}(p,b)=w^{priv}_{u,v}(p,b)+\frac{1}{k}\sum_{j\in[p-1]}\sum_{a\in[k]}w^{priv}_{u,v}(j,a).$		(1)

$U_{u,v}$ can be seen as a “centered” version of $w^{priv}_{u,v}$ (for the first $p-1$ rows). It is not hard to see that $\langle U_{u,v},x\rangle=\langle w^{priv}_{u,v},x\rangle$ , so $U_{u,v}$ is also a minimizer of the sparse logistic regression.

For now, assume that $\forall j\in[p-1],b\in[k]$ , $U_{u,v}(j,b)$ is a “good” approximation of ${\left({W_{1,j+1}(u,b)-W_{1,j+1}(v,b)}\right)}$ , which we will show later. If we sum over $v\in[k]$ , it can be shown that $\frac{1}{k}\sum_{v\in[k]}U_{u,v}(j,b)$ is also a “good” approximation of $W_{1,j+1}(u,b)$ , for all $j\in[p-1]$ , and $u,b\in[k]$ , because of the centering assumption of ${\cal W}$ , i.e., $\forall j\in[p-1],b\in[k],\sum_{v\in[k]}W_{1,j+1}(v,b)=0$ . With these considerations in mind, we are able to introduce our algorithm.

3Input: alphabet size

k

n

i.i.d. samples

\{z^{1},\cdots,z^{n}\}

, where

z^{m}\in[k]^{p}

for

m\in[n]

; an upper bound on

\lambda({\cal W},\Theta)\leq\lambda

, privacy parameter

\rho

5For $i=1$ to $p$

6 For

each pair

u\neq v\in[k]

101:

S_{u,v}\leftarrow\{z^{m},m\in[n]:z_{i}^{m}\in\{u,v\}\}

112:

\forall z^{m}\in S_{u,v}

x^{m}\leftarrow\text{OneHotEncode}([z^{m}_{-i},1])

y^{m}\leftarrow 1

z_{i}^{m}=u

;

y^{t}\leftarrow-1

z_{i}^{m}=v

123:

w^{priv}_{u,v}\leftarrow{\cal A}_{PFW}(D,{\cal L},\rho^{\prime},{\cal C})

, where

\rho^{\prime}=\frac{\rho}{k^{2}p}

D=\{{\left({x^{m},y^{m}}\right)}:z^{m}\in S_{u,v}\}

{\cal L}(w;D)=\frac{1}{|S_{u,v}|}\sum_{m=1}^{|S_{u,v}|}\log{\left({1+e^{-y^{m}\langle w,x^{m}\rangle}}\right)}

{\cal C}=\{\normone{w}\leq 2\lambda k\}

4:Define

U_{u,v}\in\mathbb{R}^{p\times k}

by centering the first

p-1

rows of

w^{priv}_{u,v}

, as in Equation 3.3

135:

14For $j\in[p]\backslash i$ and $u\in[k]$

15 6:

\widehat{W}_{i,j}(u,:)\leftarrow\frac{1}{k}\sum_{v\in[k]}U_{u,v}(\tilde{j},:)

, where

\tilde{j}=j

when

j<i

and

\tilde{j}=j-1

when

j>i

Algorithm 3 Privately Learning Pairwise Graphical Model

Output: $\widehat{W}_{i,j}\in\mathbb{R}^{k\times k}$ for all $i\neq j\in[p]$

The following theorem is the main result of this section. Its proof is structurally similar to that of Theorem 3.8.

Theorem 3.11.

Let ${\cal D}({\cal W},\Theta)$ be an unknown $p$ -variable pairwise graphical model distribution, and we suppose that ${\cal D}({\cal W},\Theta)$ has width $\lambda({\cal W},\Theta)\leq\lambda$ . There exists an efficient $\rho$ -zCDP algorithm which outputs $\widehat{W}$ such that with probability greater than $2/3$ , $\left|W_{i,j}(u,v)-\widehat{W}_{i,j}(u,v)\right|\leq\alpha$ , $\forall i\neq j\in[p],\forall u,v\in[k]$ if the number of i.i.d. samples satisfy

n=\Omega{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}.

Proof.

We consider estimating $W_{1,j}$ for all $j\in[p]$ as an example. Fixing one pair $(u,v)$ , let $S_{u,v}$ be the samples whose first element is either $u$ or $v$ , and $n^{u,v}$ be the number of samples in $S_{u,v}$ . We perform the following transformation on the samples in $S_{u,v}$ : for the sample $Z$ , let $Y=1$ if $Z_{1}=u$ , else $Y=-1$ , and let $X$ be the one-hot encoding of the vector $[Z_{-1},1]$ .

Suppose the underlying joint distribution of $X$ and $Y$ is $P$ , i.e., $(X,Y)\sim P$ , then by Theorem 3.6, when $n^{u,v}=O{\left({\frac{\lambda^{2}k^{2}\log^{2}(pk)}{\gamma^{2}}+\frac{\sqrt{d}\lambda^{2}k^{3}\log^{2}(pk)}{\gamma^{\frac{3}{2}}\sqrt{\rho}}}\right)}$ , with probability greater than $1-\frac{1}{10pk^{2}}$ ,

\mathbb{E}_{(X,Y)\sim P}\left[\ell(U_{u,v};(X,Y))\right]-\mathbb{E}_{(X,Y)\sim P}\left[\ell(w^{*};(X,Y))\right]\leq\gamma.

The following lemma appears in [WuSD19], which is analogous to Lemma 3.9 for the Ising model.

Lemma 3.12.

Let ${\cal D}$ be a $\delta$ -unbiased distribution on $[k]^{p-1}$ . For $Z\sim{\cal D}$ , $X$ denotes the one-hot encoding of $Z$ . Let $u_{1},u_{2}\in\mathbb{R}^{(p-1)\times k}$ be two matrices where $\sum_{a}u_{1}(i,a)=0$ and $\sum_{a}u_{2}(i,a)=0$ for all $i\in[p-1]$ . Let $P$ be a distribution such that given $u_{1},\theta_{1}\in\mathbb{R}$ , $\Pr{\left({Y=1|X=X}\right)}=\sigma{\left({\langle u_{1},x\rangle+\theta_{1}}\right)}$ for $(X,Y)\sim P$ . Suppose $\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{1},X\rangle+\theta_{1}}\right)}}}\right)}\right]-\mathbb{E}_{(x,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({\langle u_{2},X\rangle+\theta_{2}}\right)}}}\right)}\right]\leq\gamma$ for $u_{2}\in\mathbb{R}^{(p-1)\times k},\theta_{2}\in\mathbb{R}$ , and $\gamma\leq\delta e^{-2\left\lVert u_{1}\right\rVert_{\infty,1}-2\normone{\theta_{1}}-6}$ , then $\left\lVert u_{1}-u_{2}\right\rVert_{\infty}=O(e^{\left\lVert u_{1}\right\rVert_{\infty,1}+\normone{\theta_{1}}}\cdot\sqrt{\gamma/\delta}).$

By Lemma 2.6, Lemma 2.7 and Lemma 3.12, if we substitute $\gamma=\frac{e^{-6\lambda}\alpha^{2}}{k}$ , when $n^{u,v}=O{\left({\frac{\lambda^{2}k^{4}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{4.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)},$

\displaystyle\left|W_{1,j}(u,b)-W_{1,j}(v,b)-U^{u,v}(j,b)\right|\leq\alpha,\forall j\in[p-1],\forall b\in[k].

(2)

By a union bound, Equation (2) holds for all $(u,v)$ pairs simultaneously with probability greater than $1-\frac{1}{10p}$ . If we sum over $v\in[k]$ and use the fact that $\forall j,b,\sum_{v\in[k]}W_{1,j}(v,b)=0$ , we have

\left|W_{1,j}(u,b)-\frac{1}{k}\sum_{v\in[k]}U_{u,v}(j,b)\right|\leq\alpha,\forall j\in[p-1],\forall u,b\in[k].

Note that we need to guarantee that we obtain $n^{u,v}$ samples for each pair $(u,v)$ . Since ${\cal D}({\cal W},\Theta)$ is $\delta$ -unbiased, given $Z\sim{\cal D}({\cal W},\Theta)$ , for all $u\neq v$ , $\Pr{\left({Z\in S_{u,v}}\right)}\geq 2\delta$ . By Hoeffding’s inequality, when $n=O{\left({\frac{n^{u,v}}{\delta}+\frac{\log(pk^{2})}{\delta^{2}}}\right)}$ , with probability greater than $1-\frac{1}{10p}$ , we have enough samples for all $(u,v)$ pairs simultaneously. Substituting $\delta=\frac{e^{-6\lambda}}{k}$ , we have

n=O{\left({\frac{\lambda^{2}k^{5}\log(pk)e^{O(\lambda)}}{\alpha^{4}}+\frac{\sqrt{p}\lambda^{2}k^{5.5}\log^{2}(pk)e^{O(\lambda)}}{\sqrt{\rho}\alpha^{3}}}\right)}.

The same argument holds for other entries of the matrix. We conclude the proof by a union bound over $p$ iterations.

Finally, we note that the time compexity of the algorithm is $\operatorname*{poly}(n,p)$ since the private Frank-Wolfe algorithm is time efficient by Corollary 3.4. ∎

4 Privately Learning Binary $t$ -wise MRFs

Let ${\cal D}$ be a $t$ -wise MRF on $\{1,-1\}^{p}$ with underlying dependency graph $G$ and factorization polynomial $h(x)=\sum_{I\in C_{t}(G)}h_{I}(x)$ . We assume that the width of ${\cal D}$ is bounded by $\lambda$ , i.e., $\max_{i\in[p]}\normone{\partial_{i}h}\leq\lambda$ , where $\normone{h}\coloneqq\sum_{I\in C_{t}(G)}\left|\bar{h}(I)\right|$ . Similar to [KlivansM17], given $n$ i.i.d. samples $\{z^{1},\cdots,z^{n}\}$ generated from an unknown distribution ${\cal D}$ , we consider the following two related learning objectives, under the constraint of $\rho$ -zCDP:

1.

find a multilinear polynomial $u$ such that with probability greater than $\frac{2}{3}$ , $\normone{h-u}\coloneqq\sum_{I\in C_{t}(G)}\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha$ ;
2.

find a multilinear polynomial $u$ such that with probability greater than $\frac{2}{3}$ , for every maximal monomial $I$ of $h$ , $\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha$ .

We note that our first objective can be viewed as parameter estimation in $\ell_{1}$ distance, where only an average performance guarantee is provided. In the second objective, the algorithm recovers every maximal monomial, which can be viewed as parameter estimation in $\ell_{\infty}$ distance. These two objectives are addressed in Sections 4.1 and 4.2, respectively.

4.1 Parameter Estimation in $\ell_{1}$ Distance

The following property of MRFs, from [KlivansM17], plays a critical role in our algorithm. The proof is similar to that of Lemma 3.7.

Lemma 4.1 (Lemma 7.6 of [KlivansM17]).

Let ${\cal D}$ be a $t$ -wise MRF on $\{1,-1\}^{p}$ with underlying dependency graph $G$ and factorization polynomial $h(x)=\sum_{I\in C_{t}(G)}h_{I}(x)$ , then

\Pr{\left({Z_{i}=1|Z_{-i}=x}\right)}=\sigma(2\partial_{i}h(x)),\forall i\in[p],\forall x\in\{1,-1\}^{[p]\backslash i}.

Lemma 4.1 shows that, similar to pairwise graphical models, it also suffices to learn the parameters of binary $t$ -wise MRF using sparse logistic regression.

2 Input:

n

i.i.d. samples

\{z^{1},\cdots,z^{n}\}

, where

z^{m}\in\{\pm 1\}^{p}

for

m\in[n]

; an upper bound

\lambda

\max_{i\in[p]}\normone{\partial_{i}h}

, privacy parameter

\rho

For $i=1$ to $p$

4 1:

\forall m\in[n]

x_{m}\leftarrow{\left[{\prod_{j\in I}z_{j}^{m}:I\subset[p\backslash i],\left|I\right|\leq t-1}\right]}

y_{m}\leftarrow z_{i}^{m}

62:

w^{priv}\leftarrow{\cal A}(D,{\cal L},\rho^{\prime},{\cal C})

where

D=\{{\left({x_{m},y_{m}}\right)}\}_{m=1}^{n}

\ell(w;d)=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}

{\cal C}=\{\normone{w}\leq 2\lambda\}

, and

\rho^{\prime}=\frac{\rho}{p}

7For $I\subset[p\backslash i]$ with $\left|I\right|\leq t-1$

8 3:

\bar{u}(I\cup\{i\})=\frac{1}{2}w^{priv}(I)

, when

\arg\min(I\cup i)=i

Algorithm 4 Private Learning binary

t

-wise MRF in

\ell_{1}

distance

Output: $\bar{u}(I):I\in C_{t}(K_{p})$ , where $K_{p}$ is the $p$ -dimensional complete graph

Theorem 4.2.

There exists a $\rho$ -zCDP algorithm which, with probability at least $2/3$ , finds a multilinear polynomial $u$ such that $\normone{h-u}\leq\alpha,$ given $n$ i.i.d. samples $Z^{1},\cdots,Z^{n}\sim{\cal D}$ , where

n=O{\left({\frac{(2t)^{O(t)}e^{O(\lambda t)}\cdot p^{4t}\cdot\log(p)}{\alpha^{4}}+\frac{(2t)^{O(t)}e^{O(\lambda t)}\cdot p^{3t+\frac{1}{2}}\cdot\log^{2}(p)}{\sqrt{\rho}\alpha^{3}}}\right)}.

Proof.

Similar to the previous proof, we start by fixing $i=1$ . Given a random sample $Z\sim{\cal D}$ , let $X={\left[{\prod_{j\in I}Z^{j}:I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right]}$ and $Y=Z_{i}$ . According to Lemma 4.1, we know that $\mathbb{E}\left[Y|X\right]=\sigma{\left({\langle w^{*},X\rangle}\right)}$ , where $w^{*}={\left({2\cdot\overline{\partial_{1}h}(I):I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right)}$ . Furthermore, $\normone{w^{*}}\leq 2\lambda$ by the width constraint. Now, given $n$ i.i.d. samples $\{z^{m}\}_{m=1}^{n}$ drawn from ${\cal D}$ , it is easy to check that for any given $z^{m}$ , its corresponding $(x^{m},y^{m})$ is one realization of $(X,Y)$ . Let $w^{priv}$ be the output of ${\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}$ , where $D=\{(x^{m},y^{m})\}_{m=1}^{n}$ and $\ell(w;(x,y))=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}$ . By Lemma 3.6, $\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma$ with probability greater than $1-\frac{1}{10p}$ , assuming $n=\Omega{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}$ .

Now we need the following lemma from [KlivansM17], which is analogous to Lemma 3.7 for the Ising model.

Lemma 4.3 (Lemma 6.4 of [KlivansM17]).

Let $P$ be a distribution on $\{-1,1\}^{p-1}\times\{-1,1\}$ . Given multilinear polynomial $u_{1}\in\mathbb{R}^{p-1}$ , $\Pr{\left({Y=1|X=x}\right)}=\sigma{\left({u_{1}(X)}\right)}$ for $(X,Y)\sim P$ . Suppose the marginal distribution of $P$ on $X$ is $\delta$ -unbiased, and $\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({u_{1}(X)}\right)}}}\right)}\right]-\mathbb{E}_{(X,Y)\sim P}\left[\log{\left({1+e^{-Y{\left({u_{2}(X)}\right)}}}\right)}\right]\leq\gamma$ for another multilinear polynomial $u_{2}$ , where $\gamma\leq\delta^{t}e^{-2\normone{u_{1}}-6}$ , then $\normone{u_{1}-u_{2}}=O{\left({(2t)^{t}e^{\normone{u_{1}}}\cdot\sqrt{\gamma/\delta^{t}}\cdot{p\choose t}}\right)}.$

By substituting $\gamma=e^{-O(\lambda t)}\cdot(2t)^{-O(t)}\cdot p^{-3t}\cdot\alpha^{2}$ , we have that with probability greater than $1-\frac{1}{10p}$ , $\sum_{I:\arg\min I=1}\left|\bar{u}(I)-h(I)\right|\leq\frac{\alpha}{p}$ . We note that the coefficients of different monomials are recovered in each iteration. Therefore, by a union bound over $p$ iterations, we prove the desired result. ∎

4.2 Parameter Estimation in $\ell_{\infty}$ Distance

In this section, we introduce a slightly modified version of the algorithm in the last section.

3Input:

n=n_{1}+n_{2}

i.i.d. samples

\{z^{1},\cdots,z^{n}\}

, where

z^{m}\in\{\pm 1\}^{p}

for

m\in[n]

; an upper bound

\lambda

\max_{i\in[p]}\normone{\partial_{i}h}

, privacy parameter

\rho

Foreach $I\subset[p]$ with $\left|I\right|\leq t$

5 1:

6Let

Q(I)\coloneqq\frac{1}{n_{2}}\sum_{m=n_{1}}^{n_{1}+n_{2}}\prod_{j\in I}z_{j}^{m}

2:Compute

\hat{Q}(I)

, an estimate of

Q(I)

through an

\rho/2

-zCDP query release algorithm (PMW [HardtR10] or sepFEM [VietriTBSW20])

73:

8For $i=1$ to $p$

9 4:

\forall m\in[n_{1}]

x_{m}\leftarrow{\left[{\prod_{j\in I}z^{m}_{j}:I\subset[p\backslash i],\left|I\right|\leq t-1}\right]}

y_{m}\leftarrow z_{i}^{m}

115:

w^{priv}\leftarrow{\cal A}(D,{\cal L},\rho^{\prime},{\cal C})

, where

D=\{{\left({x_{m},y_{m}}\right)}\}_{m=1}^{n}

\ell(w;d)=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}

{\cal C}=\{\normone{w}\leq 2\lambda\}

, and

\rho^{\prime}=\frac{\rho}{2p}

126:Define a polynomial

v_{i}:\mathbb{R}^{p-1}\rightarrow\mathbb{R}

by setting

\bar{v_{i}}(I)=\frac{1}{2}w^{priv}(I)

for all

I\subset[p\backslash i]

13For each $I\subset[p\backslash i]$ with $\left|I\right|\leq t-1$

14 7:

\bar{u}(I\cup\{i\})=\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{i}}(I^{\prime})\cdot\hat{Q}(I^{\prime})

,when

\arg\min(I\cup i)=i

Algorithm 5 Private Learning binary

t

-wise MRF in

\ell_{\infty}

distance

Output:

\bar{u}(I):I\in C_{t}(K_{p})

, where

K_{p}

is the

p

-dimensional complete graph

We first show that if the estimates $\hat{Q}$ for the parity queries $Q$ are sufficiently accurate, Algorithm 5 solves the $\ell_{\infty}$ estimation problem, as long as the sample size $n_{1}$ is large enough.

Lemma 4.4.

Suppose that the estimates $\hat{Q}$ satisfies $|\hat{Q}(I)-Q(I)|\leq\alpha/(8\lambda)$ for all $I\subset[p]$ such that $|I|\leq t$ and $n_{2}=\Omega(\lambda^{2}t\log(p)/\alpha^{2})$ . Then with probability at least $3/4$ , Algorithm 5 outputs a multilinear polynomial $u$ such that for every maximal monomial $I$ of $h$ , $\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha,$ given $n$ i.i.d. samples $Z^{1},\cdots,Z^{n}\sim{\cal D}$ , as long as

n_{1}=\Omega{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}.

Proof.

We will condition on the event that $\hat{Q}$ is a “good” estimate of $Q$ : $|\hat{Q}(I)-Q(I)|\leq\alpha/(8\lambda)$ for all $I\subset[p]$ such that $|I|\leq t$ . Let us fix $i=1$ . Let $X={\left[{\prod_{j\in I}Z^{j}:I\subset[p]\backslash\{1\},\left|I\right|\leq t-1}\right]}$ , $Y=Z_{i}$ , and we know that $\mathbb{E}\left[Y|X\right]=\sigma{\left({\langle w^{*},X\rangle}\right)}$ , where $w^{*}={\left({2\cdot\overline{\partial_{1}h}(I):I\subset[p]\backslash 1,\left|I\right|\leq t-1}\right)}$ . Now given $n_{1}$ i.i.d. samples $\{z^{m}\}_{m=1}^{n_{1}}$ drawn from ${\cal D}$ , let $w^{priv}$ be the output of ${\cal A}{\left({D,{\cal L},\frac{\rho}{p},\{w:\normone{w}\leq 2\lambda\}}\right)}$ , where $D=\{(x^{m},y^{m})\}_{m=1}^{n_{1}}$ and $\ell(w;(x,y))=\log{\left({1+e^{-y\langle w,x\rangle}}\right)}$ . Similarly, with probability at least $1-\frac{1}{10p}$ ,

\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma

as long as $n_{1}=\Omega{\left({\frac{\sqrt{p}\lambda^{2}\log^{2}(p)}{\sqrt{\rho}\gamma^{\frac{3}{2}}}+\frac{\lambda^{2}\log(p)}{\gamma^{2}}}\right)}$ .

Now we utilize Lemma 6.4 from [KlivansM17], which states that if $\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{priv};(X,Y))\right]-\mathbb{E}_{Z\sim{\cal D}(A,\theta)}\left[\ell(w^{*};(X,Y))\right]\leq\gamma$ , given a random sample $X$ , for any maximal monomial $I\subset[p]\backslash\{1\}$ of $\partial_{1}h$ ,

\Pr{\left({\left|\overline{\partial_{1}h}(I)-\partial_{I}v_{1}(X)\right|\geq\frac{\alpha}{4}}\right)}<O{\left({\frac{\gamma\cdot e^{3\lambda t}}{\alpha^{2}}}\right)}.

By replacing $\gamma=\frac{e^{-3\lambda t}\cdot\alpha^{3}}{8\lambda}$ , we have $\Pr{\left({\left|\overline{\partial_{1}h}(I)-\partial_{I}v_{1}(X)\right|\geq\frac{\alpha}{4}}\right)}<\frac{\alpha}{8\lambda}$ , as long as $n_{1}=\Omega{\left({\frac{\sqrt{p}e^{5\lambda t}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{e^{6\lambda t}\log(p)}{\alpha^{6}}}\right)}$ . Accordingly, for any maximal monomial $I$ , $\left|\mathbb{E}\left[\partial_{I}v_{1}(X)\right]-\overline{\partial_{1}h}(I)\right|\leq\mathbb{E}\left[\left|\partial_{I}v_{1}(X)-\overline{\partial_{1}h}(I)\right|\right]\leq\frac{\alpha}{4}+2\lambda\cdot\frac{\alpha}{8\lambda}=\frac{\alpha}{2}$ . By Hoeffding inequality, given $n_{2}=\Omega{\left({\frac{\lambda^{2}t\log{p}}{\alpha^{2}}}\right)}$ , for each maximal monomial $I$ , with probability greater than $1-\frac{1}{p^{t}}$ , $\left|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\mathbb{E}\left[\partial_{I}v_{1}(X)\right]\right|\leq\frac{\alpha}{4}$ . Note that $\left|Q(I)-\hat{Q}(I)\right|\leq\frac{\alpha}{8\lambda}$ , then $\left|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})\right|\leq\frac{\alpha}{8}$ . Therefore,

		$\displaystyle\left\|\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})-\overline{\partial_{1}h}(I)\right\|$
	$\displaystyle\leq$	$\displaystyle\left\|\sum_{I^{\prime}\subset[p]}\overline{\partial_{I}v_{1}}(I^{\prime})\cdot\hat{Q}(I^{\prime})-\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})\right\|+\left\|\frac{1}{n_{2}}\sum_{m=1}^{n_{2}}\partial_{I}v_{1}(X_{m})-\mathbb{E}\left[\partial_{I}v_{1}(X)\right]\right\|$
	$\displaystyle+$	$\displaystyle\left\|\mathbb{E}\left[\partial_{I}v_{1}(X)\right]-\overline{\partial_{1}h}(I)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{\alpha}{8}+\frac{\alpha}{4}+\frac{\alpha}{2}=\frac{7\alpha}{8}.$

Finally, by a union bound over $p$ iterations and all the maximal monomials, we prove the desired results.∎

We now consider two private algorithms for releasing the parity queries. The first algorithm is called Private Multiplicative Weights (PMW) [HardtR10], which provides a better accuracy guarantee but runs in time exponential in the dimension $p$ . The following theorem can be viewed as a zCDP version of Theorem 4.3 in [Vadhan17], by noting that during the analysis, every iteration satisfies $\varepsilon_{0}$ -DP, which naturally satisfies $\varepsilon_{0}^{2}$ -zCDP, and by replacing the strong composition theorem of $(\varepsilon,\delta)$ -DP by the composition theorem of zCDP (Lemma 2.14).

Lemma 4.5 (Sample complexity of PMW, modification of Theorem 4.3 of [Vadhan17]).

The PMW algorithm satisfies $\rho$ -zCDP and releases $\hat{Q}$ such that with probability greater than $\frac{19}{20}$ , for all $I\subset[p]$ with $\left|I\right|\leq t$ , $\left|\hat{Q}(I)-Q(I)\right|\leq\frac{\alpha}{8\lambda}$ as long as the size of the data set

n_{2}=\Omega{\left({\frac{t\lambda^{2}\cdot\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}}\right)}.

The second algorithm sepFEM (Separator-Follow-the-perturbed-leader with exponential mechanism) has slightly worse sample complexity, but runs in polynomial time when it has access to an optimization oracle $\mathcal{O}$ that does the following: given as input a weighted dataset $(I_{1},w_{1}),\ldots,(I_{m},w_{m})\in 2^{[p]}\times\mathbb{R}$ , find $x\in\{\pm 1\}^{p}$ ,

\max_{x\in\{\pm 1\}^{p}}\sum_{i=1}^{m}w_{i}\prod_{j\in I_{i}}x_{j}.

The oracle $\mathcal{O}$ essentially solves cost-sensitive classification problems over the set of parity functions [ZLA03], and it can be implemented with an integer program solver [VietriTBSW20, GaboardiAHRW14].

Lemma 4.6 (Sample complexity of sepFEM, [VietriTBSW20]).

The sepFEM algorithm satisfies $\rho$ -zCDP and releases $\hat{Q}$ such that with probability greater than $\frac{19}{20}$ , for all $I\subset[p]$ with $\left|I\right|\leq t$ , $\left|\hat{Q}(I)-Q(I)\right|\leq\frac{\alpha}{8\lambda}$ as long as the size of the data set

n_{2}=\Omega{\left({\frac{t\lambda^{2}\cdot{p^{5/4}}\log{p}}{\sqrt{\rho}\alpha^{2}}}\right)}

The algorithm runs in polynomial time given access to the optimization oracle $\mathcal{O}$ defined above.

Now we can combine Lemmas 4.4, 4.5, and 4.6 to state the formal guarantee of Algorithm 5.

Theorem 4.7.

Algorithm 5 is a $\rho$ -zCDP algorithm which, with probability at least $2/3$ , finds a multilinear polynomial $u$ such that for every maximal monomial $I$ of $h$ , $\left|\bar{h}(I)-\bar{u}(I)\right|\leq\alpha,$ given $n$ i.i.d. samples $Z^{1},\cdots,Z^{n}\sim{\cal D}$ , and

if it uses PMW for releasing $\hat{Q}$ ; it has a sample complexity of

n=O{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\cdot\sqrt{p}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}

and a runtime complexity that is exponential in $p$ ;

if it uses sepFEM for releasing $\hat{Q}$ , it has a sample complexity of

n=\tilde{O}{\left({\frac{e^{5\lambda t}\cdot\sqrt{p}\log^{2}(p)}{\sqrt{\rho}\alpha^{\frac{9}{2}}}+\frac{t\lambda^{2}\cdot p^{5/4}\log{p}}{\sqrt{\rho}\alpha^{2}}+\frac{e^{6\lambda t}\cdot\log(p)}{\alpha^{6}}}\right)}

and runs in polynomial time whenever $t=O(1)$ .

5 Lower Bounds for Parameter Learning

The lower bound for parameter estimation is based on mean estimation in $\ell_{\infty}$ distance.

Theorem 5.1.

Suppose ${\cal A}$ is an $(\varepsilon,\delta)$ -differentially private algorithm that takes $n$ i.i.d. samples $Z^{1},\ldots,Z^{n}$ drawn from any unknown $p$ -variable Ising model ${\cal D}(A,\theta)$ and outputs $\hat{A}$ such that $\mathbb{E}\left[\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\right]\leq\alpha\leq 1/50.$ Then $n=\Omega{\left({\frac{\sqrt{p}}{\alpha\varepsilon}}\right)}$ .

Proof.

Consider a Ising model ${\cal D}(A,0)$ with $A\in\mathbb{R}^{p\times p}$ defined as follows: for $i\in[\frac{p}{2}],A_{2i-1,2i}=A_{2i,2i-1}=\eta_{i}\in[-\ln(2),\ln(2)]$ , and $A_{ll^{\prime}}=0$ for all other pairs of $(l,l^{\prime})$ . This construction divides the $p$ nodes into $\frac{p}{2}$ pairs, where there is no correlation between nodes belonging to different pairs. It follows that

	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{2}\frac{e^{\eta_{i}}}{e^{\eta_{i}}+1},$
	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{2}\frac{1}{e^{\eta_{i}}+1}.$

For each observation $Z$ , we obtain an observation $X\in\{\pm 1\}^{p/2}$ such that $X_{i}=Z_{2i-1}Z_{2i}$ . Then each observation $X$ is distributed according to a product distribution in $\{\pm 1\}^{(p/2)}$ such that the mean of each coordinate $j$ is $(e^{\eta_{i}}-1)/(e^{\eta_{i}}+1)\in[-1/3,1/3]$ .

Suppose that an $(\varepsilon,\delta)$ -differentially private algorithm takes $n$ observations drawn from any such Ising model distribution and output a matrix $\hat{A}$ such that $\mbox{\bf E}\left[\max_{i,j\in[p]}|A_{i,j}-\hat{A}_{i,j}|\right]\leq\alpha$ . Let $\hat{\eta}_{i}=\min\{\max\{\hat{A}_{2i-1,2i},-\ln(2)\},\ln(2)\}$ be the value of $A_{2i-1,2i}$ rounded into the range of $[-\ln(2),\ln(2)]$ , and so $|\eta_{i}-\hat{\eta}_{i}|\leq\alpha$ . It follows that

	$\displaystyle\left\|\frac{e^{\eta_{i}}-1}{e^{\eta_{i}}+1}-\frac{e^{\hat{\eta}_{i}}-1}{e^{\hat{\eta}_{i}}+1}\right\|$	$\displaystyle=2\left\|\frac{e^{\eta_{i}}-e^{\hat{\eta}_{i}}}{(e^{\eta_{i}}+1)(e^{\hat{\eta}_{i}}+1)}\right\|$
		$\displaystyle<2\left\|{e^{\eta_{i}}-e^{\hat{\eta}_{i}}}\right\|\leq 4\left(e^{\|\eta_{i}-\hat{\eta}_{i}\|}-1\right)\leq 8\|\eta_{i}-\hat{\eta}_{i}\|$

where the last step follows from the fact that $e^{a}\leq 1+2a$ for any $a\in[0,1]$ . Thus, such private algorithm also can estimate the mean of the product distribution accurately:

\mbox{\bf E}\left[\sum_{i=1}^{p/2}\left|\frac{e^{\eta_{i}}-1}{e^{\eta_{i}}+1}-\frac{e^{\hat{\eta}_{i}}-1}{e^{\hat{\eta}_{i}}+1}\right|^{2}\right]\leq 32p\alpha^{2}

Now we will use the following sample complexity lower bound on private mean estimation on product distributions.

Lemma 5.2 (Lemma 6.2 of [KamathLSU18]).

If $M\colon\{\pm 1\}^{n\times d}\rightarrow[-1/3,1/3]^{d}$ is $(\varepsilon,3/(64n))$ -differentially private, and for every product distribution $P$ over $\{\pm 1\}^{d}$ such that the mean of each coordinate $\mu_{j}$ satisfies $-1/3\leq\mu_{j}\leq 1/3$ ,

\mbox{\bf E}_{X\sim P^{n}}\left[\|M(X)-\mu\|_{2}^{2}\right]\leq\gamma^{2}\leq\frac{d}{54},

then $n\geq d/(72\gamma\varepsilon)$ .

Then our stated bound follows by instantiating $\gamma^{2}=32p\alpha^{2}$ and $d=p/2$ in Lemma 5.2.∎

6 Structure Learning of Graphical Models

In this section, we will give an $(\varepsilon,\delta)$ -differentially private algorithm for learning the structure of a Markov Random Field. The dependence on the dimension $d$ will be only logarithmic, in comparison to the complexity of privately learning the parameters. As we have shown in Section 5, this dependence is necessarily polynomial in $p$ , even under approximate differential privacy. Furthermore, as we will show in Section 7, if we wish to learn the structure of an MRF under more restrictive notions of privacy (such as pure or concentrated), the complexity also becomes polynomial in $p$ . Thus, in very high-dimensional settings, learning the structure of the MRF under approximate differential privacy is essentially the only notion of private learnability which is tractable.

The following lemma is immediate from stability-based mode arguments (see, e.g., Proposition 3.4 of [Vadhan17]).

Lemma 6.1.

Suppose there exists a (non-private) algorithm which takes $X=(X^{1},\dots,X^{n})$ sampled i.i.d. from some distribution ${\cal D}$ , and outputs some fixed value $Y$ (which may depend on ${\cal D}$ ) with probability at least $2/3$ . Then there exists an $(\varepsilon,\delta)$ -differentially private algorithm which takes $O\left(\frac{n\log(1/\delta)}{\varepsilon}\right)$ samples and outputs $Y$ with probability at least $1-\delta$ .

We can now directly import the following theorem from [WuSD19].

Theorem 6.2 ([WuSD19]).

There exists an algorithm which, with probability at least $2/3$ , learns the structure of a pairwise graphical model. It requires $n=O\left(\frac{\lambda^{2}k^{4}e^{14\lambda}\log(pk)}{\eta^{4}}\right)$ samples.

This gives us the following private learning result as a corollary.

Corollary 6.3.

There exists an $(\varepsilon,\delta)$ -differentially private algorithm which, with probability at least $2/3$ , learns the structure of a pairwise graphical model. It requires $n=O\left(\frac{\lambda^{2}k^{4}e^{14\lambda}\log(pk)\log(1/\delta)}{\varepsilon\eta^{4}}\right)$ samples.

For binary MRFs of higher-order, we instead import the following theorem from [KlivansM17]:

Theorem 6.4 ([KlivansM17]).

There exists an algorithm which, with probability at least $2/3$ , learns the structure of a binary $t$ -wise MRF. It requires $n=O\left(\frac{e^{O{\left({\lambda t}\right)}}\log(\frac{p}{\eta})}{\eta^{4}}\right)$ samples.

This gives us the following private learning result as a corollary.

Corollary 6.5.

There exists an $(\varepsilon,\delta)$ -differentially private algorithm which, with probability at least $2/3$ , learns the structure of a binary $t$ -wise MRF. It requires

n=O\left(\frac{e^{O{\left({\lambda t}\right)}}\log(\frac{p}{\eta})\log(1/\delta)}{\varepsilon\eta^{4}}\right)

samples.

7 Lower Bounds for Structure Learning of Graphical Models

In this section, we will prove structure learning lower bounds under pure DP or zero-concentrated DP. The graphical models we consider are the Ising models and pairwise graphical model. However, we note that all the lower bounds for the Ising model also hold for binary $t$ -wise MRFs, since the Ising model is a special case of binary $t$ -wise MRFs corresponding to $t=2$ . We will show that under $(\varepsilon,0)$ -DP or $\rho$ -zCDP, a polynomial dependence on the dimension is unavoidable in the sample complexity.

In Section 7.1, we assume that our samples are generated from an Ising model. In Section 7.2, we extend our lower bounds to pairwise graphical models.

7.1 Lower Bounds for Structure Learning of Ising Models

Theorem 7.1.

Any $(\varepsilon,0)$ -DP algorithm which learns the structure of an Ising model with minimum edge weight $\eta$ with probability at least $2/3$ requires $n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{p}{\varepsilon}}\right)}$ samples. Furthermore, at least $n=\Omega{\left({\sqrt{\frac{p}{\rho}}}\right)}$ samples are required for the same task under $\rho$ -zCDP.

Proof.

Our lower bound argument is in two steps. The first step is to construct a set of distributions, consisting of $2^{\frac{p}{2}}$ different Ising models such that any feasible structure learning algorithm should output different answers for different distributions. In the second step, we utilize our Private Fano’s inequality, or the packing argument for zCDP [BunS16] to get the desired lower bound.

To start, we would like to use the following binary code to construct the distribution set. Let ${\cal C}=\{0,1\}^{\frac{p}{2}}$ , given $c\in C$ , we construct the corresponding distribution ${\cal D}(A^{c},0)$ with $A^{c}\in\mathbb{R}^{p\times p}$ defined as follows: for $i\in[\frac{p}{2}],A^{c}_{2i-1,2i}=A^{c}_{2i,2i-1}=\eta\cdot c[i]$ , and 0 elsewhere. By construction, we divide the $p$ nodes into $\frac{p}{2}$ different pairs, where there is no correlation between nodes belonging to different pairs. Furthermore, for pair $i$ , if $c[i]=0$ , which means the value of node $2i-1$ is independent of node $2i$ , it is not hard to show

	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{4},$
	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{4}.$

On the other hand, if $c[i]=1$ ,

	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=-1}\right)}=\frac{1}{2}\cdot\frac{e^{\eta}}{e^{\eta}+1},$
	$\displaystyle\Pr{\left({Z_{2i-1}=1,Z_{2i}=-1}\right)}=\Pr{\left({Z_{2i-1}=-1,Z_{2i}=1}\right)}=\frac{1}{2}\cdot\frac{1}{e^{\eta}+1}.$

The Chi-squared distance between these two distributions is

8{\left[{{\left({\frac{1}{2}\cdot\frac{e^{\eta}}{e^{\eta}+1}-\frac{1}{4}}\right)}^{2}+{\left({\frac{1}{2}\cdot\frac{1}{e^{\eta}+1}-\frac{1}{4}}\right)}^{2}}\right]}={\left({1-\frac{2}{e^{\eta}+1}}\right)}^{2}\leq 4\eta^{2}.

Now we want to upper bound the total variation distance between ${\cal D}(A^{c_{1}},0)$ and ${\cal D}(A^{c_{2}},0)$ for any $c_{1}\neq c_{2}\in{\cal C}$ . Let $P_{i}$ and $Q_{i}$ denote the joint distribution of node $2i-1$ and node $2i$ corresponding to ${\cal D}(A^{c_{1}},0)$ and ${\cal D}(A^{c_{2}},0)$ . We have that

	$\displaystyle d_{TV}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)}$	$\displaystyle\leq\sqrt{2d_{KL}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)}}$
		$\displaystyle=\sqrt{2\sum_{i=1}^{\frac{p}{2}}d_{KL}{\left({P_{i},Q_{i}}\right)}}\leq\min{\left({2\eta\sqrt{p},1}\right)},$

where the first inequality is by Pinsker’s inequality, and the last inequality comes from the fact that the KL divergence is always upper bounded by the Chi-squared distance.

In order to attain pure DP lower bounds, we utilize the corollary of DP Fano’s inequality for estimation (Theorem LABEL:coro:fano).

For any $c_{1},c_{2}\in C$ , we have $d_{TV}{\left({{\cal D}(A^{c_{1}},0),{\cal D}(A^{c_{2}},0)}\right)}\leq\min{\left({2\eta\sqrt{p},1}\right)}$ . By the property of maximal coupling [Hollander12], there must exist some coupling between ${\cal D}^{n}(A^{c_{1}},0)$ and ${\cal D}^{n}(A^{c_{2}},0)$ with expected Hamming distance smaller than $\min{\left({2n\eta\sqrt{p},n}\right)}$ . Therefore, we have $\varepsilon=\Omega{\left({\frac{\log\left|{\cal C}\right|}{\min{\left({n\eta\sqrt{p},n}\right)}}}\right)}$ , and accordingly, $n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{p}{\varepsilon}}\right)}$ .

Now we move to zCDP lower bounds. We utilize a different version of the packing argument [BunS16], which works under zCDP.

Lemma 7.2.

Let ${\cal V}=\{P_{1},P_{2},...,P_{M}\}$ be a set of $M$ distributions over ${\cal X}^{n}$ . Let $\{S_{i}\}_{i\in[M]}$ be a collection of disjoint subsets of ${\cal S}$ . If there exists an $\rho$ -zCDP algorithm ${\cal A}:{\cal X}^{n}\to{\cal S}$ such that for every $i\in[M]$ , given $Z^{n}\sim P_{i}$ , $\Pr{\left({{\cal A}(Z^{n})\in S_{i}}\right)}\geq\frac{9}{10}$ , then

\rho=\Omega{\left({\frac{\log M}{n^{2}}}\right)}.

By Lemma 7.2, we derive $\rho=\Omega{\left({\frac{p}{n^{2}}}\right)}$ and $n=\Omega{\left({\sqrt{\frac{p}{\rho}}}\right)}$ accordingly. ∎

7.2 Lower Bounds for Structure Learning of Pairwise Graphical Models

Similar techniques can be used to derive lower bounds for pairwise graphical models.

Theorem 7.3.

Any $(\varepsilon,0)$ -DP algorithm which learns the structure of the $p$ -variable pairwise graphical models with minimum edge weight $\eta$ with probability at least $2/3$ requires $n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}+\frac{k^{2}p}{\varepsilon}}\right)}$ samples. Furthermore, at least $n=\Omega{\left({\sqrt{\frac{k^{2}p}{\rho}}}\right)}$ samples are required for the same task under $\rho$ -zCDP.

Proof.

Similar to before, we start with constructing a distribution set consisting of $2^{O{\left({kp}\right)}}$ different pairwise graphical models such that any accurate structure learning algorithm must output different answers for different distributions.

Let ${\cal C}$ be the real symmetric matrix with each value constrained to either $0$ or $\eta$ , i.e., ${\cal C}=\{W\in\{0,\eta\}^{k\times k}:W=W^{T}\}$ . Without loss of generality, we assume $p$ is even. Given $c=[c_{1},c_{2},\cdots,c_{p}]$ , where $c_{1},c_{2},\cdots,c_{p}\in C$ , we construct the corresponding distribution ${\cal D}({\cal W}^{c},0)$ with ${\cal W}^{c}$ defined as follows: for $l\in[\frac{p}{2}],W^{c}_{2l-1,2l}=c_{l}$ , and for other pairs $(i,j)$ , $W^{c}_{i,j}=0$ . Similarly, by this construction we divide $p$ nodes into $\frac{p}{2}$ different pairs, and there is no correlation between nodes belonging to different pairs.

We first prove lower bounds under $(\varepsilon,0)$ -DP. By Theorem LABEL:coro:fano, $\varepsilon=\Omega{\left({\frac{\log\left|{\cal C}\right|}{n}}\right)}$ , since for any two $n$ -sample distributions, the expected coupling distance can be always upper bounded by $n$ . We also note that $\left|{\cal C}\right|={\left({2^{\frac{k(k+1)}{2}}}\right)}^{p}$ . Therefore, we have $n=\Omega{\left({\frac{k^{2}p}{\varepsilon}}\right)}$ . At the same time, $n=\Omega{\left({\frac{\sqrt{p}}{\eta\varepsilon}}\right)}$ is another lower bound, inherited from the easier task of learning Ising models.

With respect to zCDP, we utilize Lemma 7.2 and obtain $\rho=\Omega{\left({\frac{k^{2}p}{n^{2}}}\right)}$ . Therefore, we have $n=\Omega{\left({\sqrt{\frac{k^{2}p}{\rho}}}\right)}$ . ∎

Acknowledgments

The authors would like to thank Kunal Talwar for suggesting the study of this problem, and Adam Klivans, Frederic Koehler, Ankur Moitra, and Shanshan Wu for helpful and inspiring conversations. GK would like to thank Chengdu Style Restaurant (古月飘香) in Berkeley for inspiration in the conception of this project.

Privately Learning Markov Random Fields

Abstract

1 Introduction

Definition 1.1.

1.1 Results and Techniques

Upper Bounds.

Theorem 1.2.

Theorem 1.3.

Theorem 1.4.

Theorem 1.5.

Lower Bounds.

Theorem 1.6 (Informal).

Theorem 1.7 (Informal).

1.1.1 Summary and Discussion

1.2 Related Work

2 Preliminaries and Notation

2.1 Markov Random Field Preliminaries

Definition 2.1.

Definition 2.2.

Definition 2.3.

Definition 2.4.

Definition 2.5 (δ\delta-unbiased).

Lemma 2.6.

Lemma 2.7.

Lemma 2.8.

Definition 2.9.

Definition 2.10.

Definition 2.11.

2.2 Privacy Preliminaries

Definition 2.12 (Concentrated Differential Privacy (zCDP) [BunS16]).

Lemma 2.13 (Relationships Between Variants of DP [BunS16]).

Lemma 2.14 (Composition of zero-concentrated DP [BunS16]).

3 Parameter Learning of Pairwise Graphical Models

3.1 Private Sparse Logistic Regression

Definition 3.1.

Definition 3.2 (Curvature constant).

Lemma 3.3 (Theorem 5.5 from [TalwarTZ14]).

Proof.

Corollary 3.4.

Proof.

Lemma 3.5 (Remark 4 in [TalwarTZ14]).

Theorem 3.6.

Proof.

3.2 Privately Learning Ising Models

Lemma 3.7.

Proof.

Theorem 3.8.

Proof.

Lemma 3.9.

3.3 Privately Learning Pairwise Graphical Models

Lemma 3.10 (Fact 2 of [WuSD19]).

Theorem 3.11.

Proof.

Lemma 3.12.

4 Privately Learning Binary tt-wise MRFs

4.1 Parameter Estimation in ℓ1\ell_{1} Distance

Lemma 4.1 (Lemma 7.6 of [KlivansM17]).

Theorem 4.2.

Proof.

Lemma 4.3 (Lemma 6.4 of [KlivansM17]).

4.2 Parameter Estimation in ℓ∞\ell_{\infty} Distance

Lemma 4.4.

Proof.

Lemma 4.5 (Sample complexity of PMW, modification of Theorem 4.3 of [Vadhan17]).

Lemma 4.6 (Sample complexity of sepFEM, [VietriTBSW20]).

Theorem 4.7.

5 Lower Bounds for Parameter Learning

Theorem 5.1.

Proof.

Lemma 5.2 (Lemma 6.2 of [KamathLSU18]).

6 Structure Learning of Graphical Models

Lemma 6.1.

Theorem 6.2 ([WuSD19]).

Corollary 6.3.

Theorem 6.4 ([KlivansM17]).

Corollary 6.5.

7 Lower Bounds for Structure Learning of Graphical Models

7.1 Lower Bounds for Structure Learning of Ising Models

Theorem 7.1.

Proof.

Definition 2.5 ( $\delta$ -unbiased).

4 Privately Learning Binary $t$ -wise MRFs

4.1 Parameter Estimation in $\ell_{1}$ Distance

4.2 Parameter Estimation in $\ell_{\infty}$ Distance