This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Plausible Differentially Private ADMM Based Distributed Machine Learning

Jiahao Ding University of Houston [email protected] Jingyi Wang San Francisco State University [email protected] Guannan Liang University of Connecticut [email protected] Jinbo Bi University of Connecticut [email protected]  and  Miao Pan University of Houston [email protected]
Abstract.

The Alternating Direction Method of Multipliers (ADMM) and its distributed version have been widely used in machine learning. In the iterations of ADMM, model updates using local private data and model exchanges among agents impose critical privacy concerns. Despite some pioneering works to relieve such concerns, differentially private ADMM still confronts many research challenges. For example, the guarantee of differential privacy (DP) relies on the premise that the optimality of each local problem can be perfectly attained in each ADMM iteration, which may never happen in practice. The model trained by DP ADMM may have low prediction accuracy. In this paper, we address these concerns by proposing a novel (Improved) Plausible differentially Private ADMM algorithm, called PP-ADMM and IPP-ADMM. In PP-ADMM, each agent approximately solves a perturbed optimization problem that is formulated from its local private data in an iteration, and then perturbs the approximate solution with Gaussian noise to provide the DP guarantee. To further improve the model accuracy and convergence, an improved version IPP-ADMM adopts sparse vector technique (SVT) to determine if an agent should update its neighbors with the current perturbed solution. The agent calculates the difference of the current solution from that in the last iteration, and if the difference is larger than a threshold, it passes the solution to neighbors; or otherwise the solution will be discarded. Moreover, we propose to track the total privacy loss under the zero-concentrated DP (zCDP) and provide a generalization performance analysis. Experiments on real-world datasets demonstrate that under the same privacy guarantee, the proposed algorithms are superior to the state of the art in terms of model accuracy and convergence rate.

differential privacy; distributed machine learning; ADMM; decentralized optimization

1. Introduction

Nowadays, the development of machine learning creates many emerging applications that can improve the quality of our life, such as medical diagnosis, autonomous driving, face recognition, etc. With the proliferation of mobile phones and Internet-of-things devices, a vast amount of data have been generated at an ever-increasing rate, which leads to significant computational complexity for data collection and processes via a centralized machine learning approach. Therefore, distributed machine learning plays an increasingly important role in dealing with large scale machine learning tasks. There are many research efforts on distributed training a large scale optimization problem, which mainly consist of two types: (sub)gradient based methods, and Alternating Direction Method of Multipliers (ADMM) based methods. As shown in (Chang et al., 2014), the convergence of (sub)gradient based methods are usually slow, which is 𝒪(1/T)\mathcal{O}(1/{\sqrt{T}}) under general convex objectives, while ADMM based algorithms can achieve 𝒪(1/T)\mathcal{O}(1/{T}) convergence rate, where TT is the number of iterations. Therefore, ADMM has been widely used in distributed machine learning (Zhang and Kwok, 2014; Mota et al., 2013), and in this paper, we focus on the distributed ADMM.

In the framework of distributed ADMM, data providers (agents) collaboratively solve a learning problem, which can be decomposed into several subproblems, via an interactive procedure of local computation and message passing. However, the information exchanges during this process raise serious privacy concerns, and the adversary can extract private information from the shared learning models via various inference attacks (Shokri et al., 2017; Melis et al., 2019). To prevent such privacy leakage, differential privacy (DP) (Dwork et al., 2006) provides a de facto standard of privacy definition for protecting data privacy, which guarantees that the adversary with arbitrary background knowledge cannot extract any sensitive information about the training dataset.

Many pioneering works have studied how to effectively integrate ADMM with DP, e.g., (Zhang and Zhu, 2016; Zhang et al., 2018a, b; Huang et al., 2019; Ding et al., 2019a, b, c), which can be classified into two categories in general. The first type of works is to add a noisy term to perturb the objective function in each ADMM iteration using an objective perturbation approach (Chaudhuri et al., 2011). The second type of works is to perturb the updates of original distributed ADMM algorithm via an output perturbation approach (Chaudhuri et al., 2011). Specifically, as an objective perturbation method, (Zhang and Zhu, 2016) proposed to inject noise to the update of the dual variable to provide DP guarantee, while the total privacy loss over the whole iterative process is not quantified. Further, (Zhang et al., 2018a, b) proposed to perturb the penalty parameter of ADMM and re-utilize the previous iteration’s results to save the privacy loss. These methods also quantify the total privacy loss over the entire process. Moreover, (Huang et al., 2019) perturbed the augmented Lagrangian with time-varying Gaussian noise and considered a centralized network structure to perform ADMM updates. As output perturbation methods, (Ding et al., 2019b, c) proposed to perturb the primal variables by Gaussian noise with linearly decaying Gaussian noise to preserve DP and maintain the utility.

However, the guarantee of DP in the above works relies on the premise that the optimality of each local problem can be perfectly attained in each iteration during the whole training procedure, which is seldom seen in practice. Further, the trained models from the above works exhibit severe degradation in terms of the convergence performance and model accuracy, compared to their non-private versions.

In this paper, we propose (Improved) Plausible differentially Private ADMM based distributed machine learning algorithm called PP-ADMM and IPP-ADMM, respectively. Instead of requiring each local problem to reach the optimality, PP-ADMM is able to release a noisy approximate solution of the local optimization with Gaussian noise related to the optimization accuracy, while preserving DP.

To further improve the utility, we propose an improved version of PP-ADMM, i.e., IPP-ADMM, by exploiting the sparse vector technique (SVT) to check whether the current approximate solution has enough difference from that of the previous iteration. Moreover, the privacy analysis of our algorithms based on the zero-concentrated DP (zCDP) yields a tight privacy loss bound. We analyze the generalization performance of PP-ADMM. Our salient contributions are summarized as follows.

  • To release the shackles of “exact local optimal results” and make ADMM based distributed machine learning achieve DP and plausible, we propose a novel PP-ADMM method by exploiting the inexact solution of the perturbed local optimization over local agent’s private data during each ADMM iteration, while preserving the data privacy.

  • We further propose an improved version of PP-ADMM (IPP-ADMM) by employing SVT to evaluate whether the current approximate solution has a big enough difference from that of the previous iteration. If the difference surpasses a predefined threshold, the approximate solution with Gaussian noise will be shared with neighbors; otherwise, the current approximate solution will be discarded. By this way, the redundant privacy loss accumulation and the transmissions of “low quality” can be avoided during the ADMM iterative process.

  • To best track the privacy loss accumulation, we leverage the serial and parallel composition theorems of the zCDP to theoretically quantify and analyze the overall privacy guarantees of the PP-ADMM and IPP-ADMM algorithms. Moreover, we provide a generalization performance analysis of PP-ADMM by measuring the number of data samples required to achieve a certain criteria.

  • Through extensive experiments on the real-world datasets, we show the superior performance of the proposed algorithms over the state-of-the-art differentially private ADMM algorithms.

The rest of the paper is organized as follows. Section 2 formulates the optimization problem, and describes preliminaries of ADMM and differential privacy. Then, the plausible private robust ADMM algorithm and its corresponding privacy and sample complexity analysis are presented in Section 3. In Section 4, we provide the IPP-ADMM, an improved version of PP-ADMM, and the corresponding privacy analysis. The experimental results on real-world datasets are shown in Section 5. Finally, we draw conclusive remarks in Section 6.

2. Problem Formulation and Preliminaries

In this paper, we consider a connected network contains NN agents with node set 𝒩={1,,N}\mathscr{N}=\{1,\cdots,N\}, and each agent ii has a dataset DiD_{i} with Di={(xin,yin)}n=1|Di|D_{i}=\{(x_{i}^{n},y_{i}^{n})\}_{n=1}^{|D_{i}|}, where xin𝒳x_{i}^{n}\in\mathcal{X} is a feature vector and yin𝒴y_{i}^{n}\in\mathcal{Y} is a label. The communication among agents can be represented by an undirected graph G={𝒩,}G=\{\mathscr{N},\mathscr{E}\}, where \mathscr{E} denotes the set of communication links between agents. Note that two agents ii and jj can communicate with each other only when they are neighbors, i.e., (i,j)(i,j)\in\mathscr{E}. We also denote the set of neighbors of agent ii as i\mathscr{B}_{i}. The goal is to cooperatively train a classifier θd\theta\in\mathbb{R}^{d} over the union of all local datasets in a decentralized fashion (i.e., no centralized controller) while keeping the privacy for each data sample, which can be formulated as an Empirical Risk Minimization (ERM) problem.

(1) minθdi=1N1|Di|n=1|Di|(yinθTxin)+λ^(θ),\displaystyle\min_{\theta\in\mathbb{R}^{d}}\sum_{i=1}^{N}\frac{1}{|D_{i}|}\sum_{n=1}^{|D_{i}|}\mathscr{L}(y_{i}^{n}\theta^{T}x_{i}^{n})+\hat{\lambda}\mathscr{R}(\theta),

where ():𝒳×𝒴×d\mathscr{L}(\cdot):\mathcal{X}\times\mathcal{Y}\times\mathbb{R}^{d}\to\mathbb{R} stands for a convex loss function with |()|1|\mathscr{L}^{\prime}(\cdot)|\leq 1 and 0<′′()c10<\mathscr{L}^{\prime\prime}(\cdot)\leq c_{1}, (θ):d\mathscr{R}(\theta):\mathbb{R}^{d}\to\mathbb{R} is a differentiable and 1-strongly convex regularizer to prevent overfitting, and λ^0\hat{\lambda}\geq 0 refers to a regularizer parameter that controls the impact of regularizer. We assume that each feature vector xinx_{i}^{n} is normalized to xin21\|x_{i}^{n}\|_{2}\leq 1. Note that the formulations of classification in machine learning like logistic regression, or support vector machines, can also be fallen into the framework of ERM. In order to solve the ERM problem (1) in a decentralized manner, we adopt the simple but efficient optimization method, ADMM. We then in the following subsection review some preliminaries about ADMM algorithm for solving Problem (1).

2.1. ADMM

It is easy to see that the ERM problem (1) can be equivalently reformulated as the following consensus form by introducing θi\theta_{i}, that is, the local copy of common classifier θ\theta at agent ii.

(2) min{θi},{ρij}i=1Nfi(θi)s.t.θi=ρij,θj=ρij,i𝒩,ji,\begin{array}[]{cl}\min\limits_{\{\theta_{i}\},\{\rho_{ij}\}}&\sum_{i=1}^{N}f_{i}(\theta_{i})\\ \mbox{s.t.}&\theta_{i}=\rho_{ij},~{}\theta_{j}=\rho_{ij},~{}i\in\mathscr{N},j\in\mathscr{B}_{i},\\ \end{array}

where {ρij|i𝒩,ji}\{\rho_{ij}|i\in\mathscr{N},j\in\mathscr{B}_{i}\} is a set of slack variables to enforce all local copies are equal to each other, i.e., θ1=θ2=,=θN\theta_{1}=\theta_{2}=\cdots,=\theta_{N}, and fi(θi)=1|Di|n=1|Di|(yinθiTxin)+λ^N(θi)f_{i}(\theta_{i})=\frac{1}{|D_{i}|}\sum_{n=1}^{|D_{i}|}\mathscr{L}(y_{i}^{n}\theta_{i}^{T}x_{i}^{n})+\frac{\hat{\lambda}}{N}\mathscr{R}(\theta_{i}). According to Problem (2), each agent ii can minimize local function fi(θi)f_{i}(\theta_{i}) over its own private dataset with respect to θi\theta_{i}, under the consensus constraints. In (Zhang et al., 2018a), ADMM is employed to optimize Problem (2) in a decentralized fashion. By defining a dual variable λi\lambda_{i} for agent ii, and introducing the following notion, non(θi,Di)=fi(θi)+(2λit)Tθi+ηji12(θit+θjt)θi22\mathcal{L}_{non}(\theta_{i},D_{i})=f_{i}(\theta_{i})+(2\lambda_{i}^{t})^{T}\theta_{i}+\eta\sum_{j\in\mathscr{B}_{i}}||\dfrac{1}{2}(\theta_{i}^{t}+\theta_{j}^{t})-\theta_{i}||_{2}^{2}, ADMM then has the following iterative updates in the (t+1)(t+1)-th iteration:

(3) θit+1=argminθinon(θi,Di);\displaystyle\theta_{i}^{t+1}=\underset{\theta_{i}}{\text{argmin}}~{}~{}\mathcal{L}_{non}(\theta_{i},D_{i});
(4) λit+1=λit+η2ji(θit+1θjt+1),\lambda_{i}^{t+1}=\lambda_{i}^{t}+\dfrac{\eta}{2}\sum_{j\in\mathscr{B}_{i}}(\theta_{i}^{t+1}-\theta_{j}^{t+1}),

where η>0\eta>0 is a penalty parameter. Note that the reason why the variable ρij\rho_{ij} is not appeared in (3) and (4) is that it can be expressed by using the primal variable θi\theta_{i}, as shown in (Mateos et al., 2010). In the iteration t+1t+1, each agent i𝒩i\in\mathscr{N} updates its local θit+1\theta_{i}^{t+1} via (3) by using its previous results θit\theta_{i}^{t} and λit\lambda_{i}^{t}, and the shared local classifiers θjt\theta_{j}^{t} from its neighbors jij\in\mathscr{B}_{i}. Next, agent ii broadcasts θit+1\theta_{i}^{t+1} to all its neighboring agents. After obtaining all of its neighboring computation results, each agent updates the dual variable λit+1\lambda_{i}^{t+1} through (4).

2.2. Differential Privacy

For the privacy-preserving data analysis, the standard privacy metric, Differential privacy (DP) (Dwork et al., 2006; Dwork and Roth, 2014), is proposed to measure the privacy risk of each data sample in the dataset, and has already been adopted in many machine learning domains (Ding et al., 2020a; Ding et al., 2020b; Zhang et al., 2019; Shi et al., 2019; Wang et al., 2018). Basically, under DP framework, privacy protection is guaranteed by limiting the difference of the distribution of the output regardless of the value change of any one sample in the dataset.

Definition 2.1 ((ϵ,δ)(\epsilon,\delta)-DP (Dwork et al., 2006)).

A randomized mechanism \mathcal{M} satisfies (ϵ,δ)(\epsilon,\delta)-DP if for any two neighboring datasets DD and D^\hat{D} differing in at most one single data sample, and for any possible output oRange()o\in Range(\mathcal{M}), we have Pr[(D)=o]eϵPr[(D^)=o]+δ{\rm Pr}[\mathcal{M}(D)=o]\leq e^{\epsilon}{\rm Pr}[\mathcal{M}(\hat{D})=o]+\delta.

Here ϵ,δ\epsilon,\delta are privacy loss parameters that indicate the strength of the privacy protection from the mechanism \mathcal{M}. The privacy protection is stronger if they are smaller. The above privacy definition reduces to pure DP when δ=0\delta=0 and when δ>0\delta>0, it is referred to as approximate DP. We can achieve pure and approximate DP by utilizing two popular approaches called Laplace and Gaussian Mechanism, both of which share the same form (D)=q(D)+𝐮\mathcal{M}(D)=\mathcal{M}_{q}(D)+{\rm\bf{u}}, where q(D)\mathcal{M}_{q}(D) is a query function over dataset DD, and 𝐮{\rm\bf{u}} is random noise. We also denote two neighboring datasets DD and D^\hat{D} as DD^D\sim\hat{D}, and denote Lap(λ){\rm Lap}(\lambda) as a zero-mean Laplace distribution with scale parameter λ\lambda.

Definition 2.2 (Laplace Mechanism (Dwork et al., 2006)).

Given any function q:𝒟d\mathcal{M}_{q}:\mathcal{D}\to\mathbb{R}^{d}, the Laplace Mechanism is defined as: L(D,q,ϵ)=q(D)+𝐮\mathcal{M}_{L}(D,q,\epsilon)=\mathcal{M}_{q}(D)+{\rm\bf{u}}, where 𝐮\rm\bf{u} is drawn from a Laplace distribution Lap(Δ1ϵ){\rm Lap}(\frac{\Delta_{1}}{\epsilon}) with scale parameter proportional to the L1L_{1}-sensitivity of q\mathcal{M}_{q} given as Δ1=supDD^q(D)q(D^)1\Delta_{1}=\sup_{D\sim\hat{D}}\|\mathcal{M}_{q}(D)-\mathcal{M}_{q}(\hat{D})\|_{1}. Laplace Mechanism preservers ϵ\epsilon-differential privacy.

Definition 2.3 (Gaussian Mechanism (Dwork et al., 2006)).

Given any function q:𝒟d\mathcal{M}_{q}:\mathcal{D}\to\mathbb{R}^{d}, the Gaussian Mechanism is defined as: G(D,q,ϵ,δ)=q(D)+𝐮\mathcal{M}_{G}(D,q,\epsilon,\delta)=\mathcal{M}_{q}(D)+{\rm\bf{u}}, where 𝐮\rm\bf{u} is drawn from a Gaussian distribution 𝒩(0,σ2Id)\mathcal{N}(0,\sigma^{2}I_{d}) with σ2ln(1.25/δ)Δ2ϵ\sigma\geq\frac{\sqrt{2\ln(1.25/\delta)}\Delta_{2}}{\epsilon}, and Δ2\Delta_{2} is the L2L_{2}-sensitivity of function q\mathcal{M}_{q}, i.e., Δ2=supDD^q(D)q(D^)2\Delta_{2}=\sup_{D\sim\hat{D}}\|\mathcal{M}_{q}(D)-\mathcal{M}_{q}(\hat{D})\|_{2}. Gaussian Mechanism provides (ϵ,δ)(\epsilon,\delta)-differential privacy.

Next, we introduce a generalization of DP, which is called the zero-concentrated DP (zCDP) (Bun and Steinke, 2016) that uses the Rényi divergence between (D)\mathcal{M}(D) and (D^)\mathcal{M}(\hat{D}), which can achieve a much tighter privacy loss bound under multiple privacy mechanisms composition.

Definition 2.4 (ρ\rho-zCDP (Bun and Steinke, 2016)).

We say that a randomized algorithm \mathcal{M} provides ρ\rho-zCDP, if for all neighboring datasets DD and D^\hat{D} and for all τ(1,)\tau\in(1,\infty), we have Dτ((D)(D^))ρτ,D_{\tau}(\mathcal{M}(D)\|\mathcal{M}(\hat{D}))\leq\rho\tau, where Dτ((D)(D^))D_{\tau}(\mathcal{M}(D)\|\mathcal{M}(\hat{D})) is the τ\tau-Rényi divergence 111Definition can be found in (Bun and Steinke, 2016) between the distribution (D)\mathcal{M}(D) and the distribution (D^)\mathcal{M}(\hat{D}).

The following lemmas show that the Gaussian mechanism satisfies zCDP, the composition theorem of ρ\rho-zCDP, and the relationship among ρ\rho-zCDP, ϵ\epsilon-DP, and (ϵ,δ)(\epsilon,\delta)-DP.

Lemma 2.5 ((Bun and Steinke, 2016)).

The Gaussian mechanism with noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) satisfies Δ22/(2σ2)\Delta_{2}^{2}/(2\sigma^{2})-zCDP.

Lemma 2.6 (Serial Composition (Bun and Steinke, 2016)).

If randomized mechanisms 1:𝒟1\mathcal{M}_{1}:\mathcal{D}\to\mathcal{R}_{1} and 2:𝒟2\mathcal{M}_{2}:\mathcal{D}\to\mathcal{R}_{2} obey ρ1\rho_{1}-zCDP and ρ2\rho_{2}-zCDP, respectively. Then their composition defined as ′′:D1×2\mathcal{M}^{\prime\prime}:D\to\mathcal{R}_{1}\times\mathcal{R}_{2} by ′′=(1,2)\mathcal{M}^{\prime\prime}=(\mathcal{M}_{1},\mathcal{M}_{2}) obeys (ρ1+ρ2)(\rho_{1}+\rho_{2})-zCDP.

Lemma 2.7 (DP to zCDP conversion (Bun and Steinke, 2016)).

If a randomized mechanism \mathcal{M} provides ϵ\epsilon-DP, then \mathcal{M} is 12ϵ2\frac{1}{2}\epsilon^{2}-zCDP. Moreover, for \mathcal{M} to satisfy (ϵ,δ)(\epsilon,\delta)-DP, it suffices to satisfy ρ\rho-zCDP with ρ=ϵ24ln(1/δ)\rho=\frac{\epsilon^{2}}{4\ln{(1/\delta)}}.

Lemma 2.8 (zCDP to DP conversion (Bun and Steinke, 2016)).

If a randomized mechanism \mathcal{M} obeys ρ\rho-zCDP, then \mathcal{M} obeys (ρ+2ρln(1/δ),δ)(\rho+2\sqrt{\rho\ln(1/\delta)},\delta)-DP for all 0<δ<10<\delta<1.

2.3. Sparse Vector Technique

A powerful approach for achieving DP employs the sparse vector technique (SVT) (Dwork et al., 2009), which takes a sequence of queries with bounded sensitivity Δ\Delta against a fixed sensitive dataset and outputs a vector representing whether each answer to the query exceeds a threshold or not. A unique advantage of SVT is that it can output some query answer without paying additional privacy cost. Specifically, as shown in (Lyu et al., 2017), SVT has the following four steps. (i), We first compute a noisy threshold γ^\hat{\gamma} by adding a threshold noise Lap(Δϵ1){\rm Lap}{(\frac{\Delta}{\epsilon_{1}})} to the predefined threshold γ\gamma. (ii), We then utilize a noise viLap(2cΔϵ2)v_{i}\sim{\rm Lap}(\frac{2c\Delta}{\epsilon_{2}}) to perturb each query qiq_{i}. (iii), We compare each noisy query answer qi(D)+νiq_{i}(D)+\nu_{i} with the noisy threshold γ^\hat{\gamma} and respond whether it is higher (\top) or lower (\perp) than the threshold. (iv), This procedure continues until the number of \top’s in the output meets the predefined bound cc. According to (Lyu et al., 2017), the SVT algorithm satisfies the ϵ\epsilon-DP with ϵ=ϵ1+ϵ2\epsilon=\epsilon_{1}+\epsilon_{2}. In order to analyze the privacy guarantee of SVT under the zCDP framework, we utilize the conversion result in Lemma 2.7. We can see that SVT satisfies 12ϵ2\frac{1}{2}\epsilon^{2}-zCDP.

3. Plausible Private ADMM

In this section, we will present our plausible differentially private (PP-ADMM) by adding Gaussian noise related to the maximum tolerable gradient norm of perturbed objective in each ADMM iteration, which relaxes the requirement of exact optimal solution as shown in (Zhang and Zhu, 2016; Zhang et al., 2018a, b), to provide differential privacy guarantee of each training data sample during the iterative procedure. We also adopt the privacy framework of zCDP to compute much tighter privacy loss estimation of PP-ADMM. In addition, the generalization performance guarantees of PP-ADMM is provided by measuring the number of data samples at each agent to achieve a specific criteria.

Specifically, in each iteration, we perturb the subproblem (3) with the objective perturbation method the same as used in previous studies (Zhang and Zhu, 2016; Zhang et al., 2018a, b), where a random linear vector (bi1)Tθi(b_{i1})^{T}\theta_{i} is injected to the objective function, and bi1b_{i1} is a random vector drawn from a zero mean Gaussian distribution 𝒩(0,σi12Id)\mathcal{N}(0,{\sigma}_{i1}^{2}I_{d}). Consequently the objective function (3) used to update the primal variable θit+1\theta_{i}^{t+1} becomes the following modified function:

(5) per(θi,Di)=fi(θi)+(2λit+bi1)Tθi+ηji12(θit+θjt)θi22\displaystyle\mathcal{L}_{per}(\theta_{i},D_{i})=f_{i}(\theta_{i})+(2\lambda_{i}^{t}+b_{i1})^{T}\theta_{i}+\eta\sum_{j\in\mathscr{B}_{i}}||\dfrac{1}{2}(\theta_{i}^{t}+\theta_{j}^{t})-\theta_{i}||_{2}^{2}

where fi(θi)=1|Di|n=1|Di|(yinθiTxin)+λ^N(θi)f_{i}(\theta_{i})=\frac{1}{|D_{i}|}\sum_{n=1}^{|D_{i}|}\mathscr{L}(y_{i}^{n}\theta_{i}^{T}x_{i}^{n})+\frac{\hat{\lambda}}{N}\mathscr{R}(\theta_{i}). In order to ensure DP guarantee, as pointed out in (Zhang and Zhu, 2016; Zhang et al., 2018a, b), each agent i𝒩i\in\mathscr{N} needs to find the optimal solution θ~it+1\tilde{\theta}_{i}^{t+1} of the perturbed objective function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}), i.e.,

(6) θ~it+1=argminθiper(θi,Di).\displaystyle\tilde{\theta}_{i}^{t+1}=\underset{\theta_{i}}{\text{argmin}}~{}~{}\mathcal{L}_{per}(\theta_{i},D_{i}).

However, the subproblem (6) may not be easy to solve and obtain an optimal solution in a finite time. For instance, if we choose logistic regression as loss function, the subproblem (6) cannot yield an analytical solution due to the complicated form of logistic regression. Especially when the problem dimension or the number of training samples is large, obtaining optimal solution might not be feasible in every iteration.

Thus, we consider obtaining the approximate solution of perturbed objective function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}) to provide privacy guarantees when the optimal solution is not attainable.

Specifically, we approximately solve the perturbed problem until the norm of gradient of per\mathcal{L}_{per} is within a pre-defined threshold β\beta. However, due to the limitations of objective perturbation method (Chaudhuri et al., 2011), releasing this inexact solution leads to the failure of providing DP guarantee. We thus perturb the approximated solution θ^it+1\hat{\theta}_{i}^{t+1} with another random noise bi2b_{i2} from Gaussian distribution 𝒩(0,σi22Id)\mathcal{N}(0,{\sigma}_{i2}^{2}I_{d}), to ”fuzz” the difference between θ^it+1\hat{\theta}_{i}^{t+1} and the optimal solution θ~it+1\tilde{\theta}_{i}^{t+1}. Note that the noise variance σi22{\sigma}_{i2}^{2} has the parameter β\beta about the maximum tolerable gradient norm, which leads to a trade-off between the gradient norm bound and the difficulty of obtaining an approximate solution within the norm bound.

Algorithm 1 Plausible Private ADMM
1:  Input: datasets {Di}i=1N\{D_{i}\}_{i=1}^{N}; initial variables θi0d\theta_{i}^{0}\in\mathbb{R}^{d} and λi0=0d\lambda_{i}^{0}=0_{d}; step size η\eta; privacy parameters, ϵi1\epsilon_{i1}, δi1\delta_{i1}, ϵi3\epsilon_{i3}, ρi2\rho_{i2}; Optimizer 𝒪(,):×𝜷d\mathscr{O}(\cdot,\cdot):\mathscr{F}\times{\bm{\beta}}\to\mathbb{R}^{d} ( \mathscr{F} is the class of objectives, and 𝜷\bm{\beta} is the optimization accuracy, i.e., the gradient norm of objectives); gradient norm threshold β𝜷\beta\in{\bm{\beta}}.
2:  Set ϵi1\epsilon_{i1}, δi1\delta_{i1}, ϵi3\epsilon_{i3}, ρi2>0\rho_{i2}>0 such that ϵi1>ϵi3\epsilon_{i1}>\epsilon_{i3}.
3:  Set regularizer parameter λ^max𝑖2.8Nc1(ϵi1ϵi3)|Di|\hat{\lambda}\geq\underset{i}{\max}\frac{2.8Nc_{1}}{(\epsilon_{i1}-\epsilon_{i3})|D_{i}|}.
4:  for t=0,,T1t=0,\dots,T-1 do
5:        for i=1,,Ni=1,\dots,N do
6:              Generate noise bi1𝒩(0,σi12Id)b_{i1}\sim\mathcal{N}(0,{\sigma}_{i1}^{2}I_{d}) with σi1=22ln(1.25/δi1)/(|Di|ϵi3)\sigma_{i1}={2\sqrt{2\ln{(1.25/\delta_{i1})}}}/(|D_{i}|\epsilon_{i3}).
7:              Construct the perturbed objective function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}) according to (5).
8:              Compute an approximate solution θ^it+1\hat{\theta}_{i}^{t+1}: θ^it+1=𝒪(per(θi,Di),β)\hat{\theta}_{i}^{t+1}=\mathscr{O}(\mathcal{L}_{per}(\theta_{i},D_{i}),\beta).
9:              Generate noise bi2𝒩(0,σi22Id)b_{i2}\sim\mathcal{N}(0,{\sigma}_{i2}^{2}I_{d}) with σi2=β/[2ρi2(λ^N+2η|i|)]{\sigma}_{i2}={\beta}/[{\sqrt{2\rho_{i2}}(\frac{\hat{\lambda}}{N}+2\eta|\mathscr{B}_{i}|)}].
10:              Perturb θ^it+1\hat{\theta}_{i}^{t+1}: θit+1=θ^it+1+bi2\theta_{i}^{t+1}=\hat{\theta}_{i}^{t+1}+b_{i2}.
11:        end for
12:        for i=1,,Ni=1,\dots,N do
13:              Broadcast θit+1\theta_{i}^{t+1} to all neighbors jij\in\mathscr{B}_{i}.
14:        end for
15:        for i=1,,Ni=1,\dots,N do
16:              Update local dual variables λit+1\lambda_{i}^{t+1} from λit+1=λit+η2ji(θit+1θjt+1)\lambda_{i}^{t+1}=\lambda_{i}^{t}+\dfrac{\eta}{2}\sum\limits_{j\in\mathscr{B}_{i}}(\theta_{i}^{t+1}-\theta_{j}^{t+1}).
17:        end for
18:  end for

The key steps of PP-ADMM algorithm are summarized in Algorithm 1. The privacy parameters (ϵi1,δi1)(\epsilon_{i1},\delta_{i1}) are used to perturb the objective function while the parameter ρi2\rho_{i2} being used to perturb the approximate solution. Moreover, the parameter ϵi3\epsilon_{i3}, a portion of ϵi1\epsilon_{i1}, is used to scale the noise injected to the objective function, and the remaining privacy budget (ϵi1ϵi3)(\epsilon_{i1}-\epsilon_{i3}) is allocated to setting the regularizer parameter. Notice that we also define an Optimizer 𝒪(,):×𝜷d\mathscr{O}(\cdot,\cdot):\mathscr{F}\times{\bm{\beta}}\to\mathbb{R}^{d}, where \mathscr{F} is the class of objectives, and 𝜷\bm{\beta} is the optimization accuracy, i.e., the gradient norm of objectives. Each agent ii then constructs the perturbed function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}) with a Gaussian random vector bi1b_{i1} and finds an inexact solution θ^it+1\hat{\theta}_{i}^{t+1} where the norm of gradient is lower than β\beta, i.e., θ^it+1=𝒪(per(θi,Di),β)\hat{\theta}_{i}^{t+1}=\mathscr{O}(\mathcal{L}_{per}(\theta_{i},D_{i}),\beta). After that each agent ii generates a random Gaussian noise bi2b_{i2} and transmits θit+1=θ^i+bi2\theta_{i}^{t+1}=\hat{\theta}_{i}+b_{i2} to its neighbors jij\in\mathscr{B}_{i}. Finally, each agent updates the local dual variables λit+1\lambda_{i}^{t+1} via λit+1=λit+η2ji(θit+1θjt+1)\lambda_{i}^{t+1}=\lambda_{i}^{t}+\dfrac{\eta}{2}\sum\limits_{j\in\mathscr{B}_{i}}(\theta_{i}^{t+1}-\theta_{j}^{t+1}).

3.1. Privacy Analysis

Here, we provide the privacy guarantee of PP-ADMM (Algorithm 1) in the following two theorems. Due to the limited space, we only provide a proof idea of Theorem 3.1, and the detailed proof can be found in Appendix.

Theorem 3.1.

The PP-ADMM in Algorithm 1 satisfies ρi\rho_{i}-zCDP for each agent ii with ρi=T(ρi1+ρi2)\rho_{i}=T(\rho_{i1}+\rho_{i2}), where ρi1=ϵi12/(4ln(1/δi1))\rho_{i1}=\epsilon_{i1}^{2}/(4\ln{(1/\delta_{i1})}), and ρi2>0\rho_{i2}>0 is the privacy budget for perturbing the approximate solution.

Proof Sketch.

For achieving ρi\rho_{i}-zCDP for each agent ii at t+1t+1 iteration in Algorithm 1, we first divide the output of t+1t+1 iteration into two parts. The first part is to obtain the optimal solution θ~it+1\tilde{\theta}_{i}^{t+1} of the perturbed objective function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}), and the second part is to obtain the approximate solution with Gaussian noise θit+1{\theta}_{i}^{t+1}. We then show obtaining the optimal solution θ~it+1\tilde{\theta}_{i}^{t+1} provides ρi1\rho_{i1}-zCDP with ρi1=ϵi12/(4ln(1/δi1))\rho_{i1}=\epsilon_{i1}^{2}/(4\ln{(1/\delta_{i1})}) for the first part, and releasing an approximate solution in the second part is ρi2\rho_{i2}-zCDP. By using the composition of zCDP in Lemma 2.6, we can get releasing the perturbed primal variable θit+1\theta_{i}^{t+1} at t+1t+1 iteration provides (ρi1+ρi2)(\rho_{i1}+\rho_{i2})-zCDP. Considering TT iterations, the total privacy loss for each agent ii is bounded by ρi=T(ρi1+ρi2)\rho_{i}=T(\rho_{i1}+\rho_{i2}). ∎

We then give the following parallel composition theorem of ρ\rho-zCDP to provides a together characterization of total privacy loss for distributed algorithms.

Lemma 3.2 (Parallel Composition (Yu et al., 2019)).

Suppose that a mechanism \mathcal{M} consists of a sequence of kk adaptive mechanism 1,,k\mathcal{M}_{1},\cdots,\mathcal{M}_{k} where each i:j=1i1j×𝒟i\mathcal{M}_{i}:\prod_{j=1}^{i-1}\mathcal{R}_{j}\times\mathcal{D}\to\mathcal{R}_{i} and i\mathcal{M}_{i} satisfies ρi\rho_{i}-zCDP. Let 𝔻1,𝔻2,,𝔻k\mathbb{D}_{1},\mathbb{D}_{2},\cdots,\mathbb{D}_{k} be the result of a randomized partition of the input domain 𝔻\mathbb{D}. The mechanism (D)=(1(D𝔻1),,𝕄k(D𝔻k))\mathcal{M}(D)=(\mathcal{M}_{1}(D\cap\mathbb{D}_{1}),\cdots,\mathbb{M}_{k}(D\cap\mathbb{D}_{k})) satisfies (max𝑖ρi)(\underset{i}{\max}~{}\rho_{i})-zCDP.

Based on Lemma 3.2, we can directly obtain the total privacy loss of PP-ADMM given as follows.

Theorem 3.3.

The PP-ADMM in Algorithm 1 satisfies ρ\rho-zCDP with ρ=max𝑖ρi\rho=\underset{i}{\max}~{}\rho_{i} and satisfies (ϵ,δ)(\epsilon,\delta)-DP with ϵ=ρ+2ρln(1/δ)\epsilon=\rho+2\sqrt{\rho\ln{(1/\delta)}}.

3.2. Sample Complexity Analysis

We next measure the generalization performance of PP-ADMM by focusing on the ERM problem given in Section 2. We also assume that data samples of each agent ii are drawn from a data distribution 𝒫\mathscr{P}. The expected loss of classifier θit\theta_{i}^{t} at iteration tt is defined as

𝕃(θit)=𝔼(x,y)𝒫((y(θit)Tx)).\displaystyle\mathbb{L}(\theta_{i}^{t})=\mathbb{E}_{(x,y)\sim\mathscr{P}}\left(\mathscr{L}(y(\theta_{i}^{t})^{T}x)\right).

Following the similar analysis in (Chaudhuri et al., 2011; Zhang and Zhu, 2016), we first introduce a reference classifier θref\theta_{ref} with expected loss 𝕃(θref)\mathbb{L}(\theta_{ref}), and we then measure the generalization performance using the number of samples DiD_{i} required at each agent to achieve 𝕃(θit)𝕃(θref)+aacc\mathbb{L}(\theta_{i}^{t})\leq\mathbb{L}(\theta_{ref})+a_{acc}, where aacca_{acc} is the generalization error.

3.2.1. PP-ADMM without Noise

Here, we consider the learning performance at all iterations rather than only the final output. Let the intermediate updated classifier θ^i,nont+1\hat{\theta}_{i,non}^{t+1} at iteration t+1t+1 be θ^i,nont+1=𝒪(non(θi,Di),β)\hat{\theta}_{i,non}^{t+1}=\mathscr{O}(\mathcal{L}_{non}(\theta_{i},D_{i}),\beta). Note that {θ^i,nont+1}\{\hat{\theta}_{i,non}^{t+1}\} is a sequence of non-private classifier without adding perturbations. Let θ=argminθifi(θi,Di)\theta^{*}=\underset{\theta_{i}}{\text{argmin}}~{}~{}f_{i}{(\theta_{i},D_{i})} be the optimal output of PP-ADMM without Noise. The sequence {θ^i,nont+1}\{\hat{\theta}_{i,non}^{t+1}\} is bounded and θ^i,nont+1\hat{\theta}_{i,non}^{t+1} converges to θ\theta^{*} as tt\to\infty. Therefore, there exists a constant Δi,nont+1\Delta_{i,non}^{t+1} at iteration t+1t+1 such that 𝕃(θ^i,nont+1)𝕃(θ)+Δi,nont+1\mathbb{L}(\hat{\theta}_{i,non}^{t+1})\leq\mathbb{L}(\theta^{*})+\Delta_{i,non}^{t+1}. We then have the following result, and the detailed proof can be found in supplemental material.

Theorem 3.4.

Consider a regularized ERM problem with (θ)=12θ22\mathscr{R}(\theta)=\frac{1}{2}\|\theta\|_{2}^{2}, and let θref\theta_{ref} be the reference classifier for all agents and {θ^i,nont+1}\{\hat{\theta}_{i,non}^{t+1}\} be a sequence of outputs of PP-ADMM without adding noise. If the number of samples at agent ii satisfies,

|Di|Vmaxt{log(1/ξ)(aaccΔi,nont+1)22θref22(1+a)β}\displaystyle|D_{i}|\geq V\max_{t}\{\frac{\log{(1/\xi)}}{\frac{(a_{acc}-\Delta_{i,non}^{t+1})^{2}}{2\|\theta_{ref}\|^{2}_{2}}-(1+a)\beta}\}

for some constant VV, then θ^i,nont+1\hat{\theta}_{i,non}^{t+1} satisfies

Pr[𝕃(θ^i,nont+1)𝕃(θref)+aacc]1ξ\displaystyle\Pr\left[\mathbb{L}(\hat{\theta}_{i,non}^{t+1})\leq\mathbb{L}(\theta_{ref})+a_{acc}\right]\geq 1-\xi

with aaccΔi,nont+1a_{acc}\geq\Delta_{i,non}^{t+1}.

Remark 1.

As we can see from Theorem 3.4, the number of data samples |Di||D_{i}| relies on the l2l_{2}-norm of reference classifier θref22\|\theta_{ref}\|_{2}^{2} and the parameter β\beta that bounds the optimization accuracy of the non-private intermediate classifier. The results demonstrate that if |Di||D_{i}| satisfies |Di|Vmaxt{log(1/ξ)(aaccΔi,nont+1)22θref22(1+a)β}|D_{i}|\geq V\max_{t}\{\frac{\log{(1/\xi)}}{\frac{(a_{acc}-\Delta_{i,non}^{t+1})^{2}}{2\|\theta_{ref}\|^{2}_{2}}-(1+a)\beta}\}, each agent’s non-private intermediate classifier will have an additional error less than aacca_{acc} compared to any classifier with θref22\|\theta_{ref}\|_{2}^{2}. Moreover, if β=0\beta=0, the result reduces to |Di|Vmaxt{2θref22log(1/ξ)(aaccΔi,nont+1)2}|D_{i}|\geq V\max_{t}\{\frac{2\|\theta_{ref}\|^{2}_{2}\log{(1/\xi)}}{(a_{acc}-\Delta_{i,non}^{t+1})^{2}}\}, the same as given in (Zhang and Zhu, 2016), which shows that the lower optimization accuracy of the non-private intermediate classifier, the more samples required to achieve the same accuracy.

3.2.2. PP-ADMM

We then show the sample complexity of the PP-ADMM algorithm. Similar to the analysis in PP-ADMM without noise, we also consider bounding the generalization error of the intermediate classifier θit+1\theta_{i}^{t+1} of each agent ii at all iterations. In order to compare the private classifier θit+1\theta_{i}^{t+1} with a reference classifier θref\theta_{ref}, we follow the same strategy used in (Zhang and Zhu, 2016). We define a new optimization function finew(θi,Di)=fi(θi,Di)+bi1Tθif_{i}^{new}(\theta_{i},D_{i})=f_{i}(\theta_{i},D_{i})+{b_{i1}}^{T}\theta_{i} and then solving PP-ADMM algorithm is equivalent to solving a new optimization problem, where each agent ii’s performs local minimization to get θit+1=𝒪(finew(θi,Di),β)+bi2\theta_{i}^{t+1}=\mathscr{O}(f_{i}^{new}(\theta_{i},D_{i}),\beta)+b_{i2}. The sequence of outputs {θit+1}\{\theta_{i}^{t+1}\} is bounded and θit+1\theta_{i}^{t+1} converges to a fixed point θnew\theta^{*}_{new} as tt\to\infty. Thus, there exists a constant Δi,newt+1\Delta_{i,new}^{t+1} at t+1t+1 iteration, such that 𝕃(θit+1)𝕃(θnew)+Δi,newt+1\mathbb{L}({\theta}_{i}^{t+1})\leq\mathbb{L}(\theta^{*}_{new})+\Delta_{i,new}^{t+1}. We then give the following result, and the detailed proof can be found in supplemental material.

Theorem 3.5.

Consider a regularized ERM problem with (θ)=12θ22\mathscr{R}(\theta)=\frac{1}{2}\|\theta\|_{2}^{2}, and let θref\theta_{ref} be the reference classifier for all agents and {θit+1}\{{\theta}_{i}^{t+1}\} be a sequence of outputs of PP-ADMM. If the number of samples at agent ii satisfies, for some constant VV,

|Di|Vmaxt{log(1/ξ)(aaccΔi,newt+1)22θref22(1+a)(β+)}\displaystyle|D_{i}|\geq V\max_{t}\{\frac{\log{(1/\xi)}}{\frac{(a_{acc}-\Delta_{i,new}^{t+1})^{2}}{2\|\theta_{ref}\|^{2}_{2}}-(1+a)(\beta+\mathscr{H})}\}

with =σi2(aaccΔi,newt+1)2dlog1ξθref22+2σi12dlog1ξ\mathscr{H}=\frac{\sigma_{i2}(a_{acc}-\Delta_{i,new}^{t+1})\sqrt{2d\log\frac{1}{\xi}}}{\|\theta_{ref}\|^{2}_{2}}+2\sigma_{i1}^{2}d\log{\frac{1}{\xi}}, then θ^i,newt+1\hat{\theta}_{i,new}^{t+1} satisfies

Pr[𝕃(θit+1)𝕃(θref)+aacc]13ξ\displaystyle\Pr\left[\mathbb{L}({\theta}_{i}^{t+1})\leq\mathbb{L}(\theta_{ref})+a_{acc}\right]\geq 1-3\xi

with aaccΔi,newt+1a_{acc}\geq\Delta_{i,new}^{t+1}.

Remark 2.

Compared to Theorem 3.4, we can see that in Theorem 3.5, the privacy constraints impose an additional term \mathscr{H} with =σi2(aaccΔi,newt+1)2dlog1ξ/θref22+2σi12dlog1ξ\mathscr{H}\\ ={\sigma_{i2}(a_{acc}-\Delta_{i,new}^{t+1})\sqrt{2d\log\frac{1}{\xi}}}/\|\theta_{ref}\|^{2}_{2}+2\sigma_{i1}^{2}d\log{\frac{1}{\xi}}. If both noise variances σi1\sigma_{i1} and σi2\sigma_{i2} are equal to zero, the number of required samples |Di||D_{i}| will reduce to the same result shown in Theorem 3.4. Moreover, the additional term \mathscr{H} demonstrates that the higher dimension of features, the more added noise to achieve the same accuracy requires more data samples.

4. Improved Plausible Private ADMM

In this section, we present an improved version of PP-ADMM algorithm called Improved Plausible Private ADMM (IPP-ADMM) by leveraging sparse vector technique (SVT) to improve the performance and reduce the communication cost of PP-ADMM. Compared with current differentially private ADMM algorithms (Zhang and Zhu, 2016; Zhang et al., 2018a, b), although the proposed PP-ADMM algorithm can ensure DP guarantee without requiring the optimal solution during each ADMM iteration, the primal variable is updated using the local data in every iteration and frequently broadcasted to neighboring agents, which leads to the privacy loss unavoidably accumulating over the iterations, and compromise the accuracy during the whole training procedure.

Hence, we adopt SVT that can output some local computational results without paying any privacy budget, to check whether current approximate solution has a big enough difference from that of previous iteration,

where the difference is quantified by a quality function, fi(θit)fi(θ^it+1)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1}), based on the change of the values of local function over the primal variable from previous iteration and current approximate solution. If a sufficient level of difference α\alpha is achieved, each agent transmits the current approximate solution with Gaussian noise to its neighbors. Intuitively, if the difference between the current approximate solution θ^it+1\hat{\theta}_{i}^{t+1} and the previously transmitted θit\theta_{i}^{t} is small, then using either one does not help the convergence of the iterative process, which leads to reducing the communication cost.

However, one difficulty in using SVT is that there is no known priori bound on query (i.e., the quality function) fi(θit)fi(θ^i)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}). To bound the sensitivity of fi(θit)fi(θ^i)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}), we apply the clipping method to clipping the loss function ()\mathscr{L}(\cdot). Given a fixed clipping threshold ClossC_{loss}, we compute the value of loss function ()\mathscr{L}(\cdot) on each local data sample, clip the values at most ClossC_{loss}, and compute the value of fi(θit)fi(θ^i)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}) based on the clipped values. Note that we denote this loss function clipping procedure as Clip.

Algorithm 2 Improved Plausible Private ADMM Run by Agent ii
1:  Input: dataset DiD_{i}; initial variables θi0d\theta_{i}^{0}\in\mathbb{R}^{d} and λi0=0d\lambda_{i}^{0}=0_{d}; threshold, α\alpha; Maximum number of primal variables that can be broadcasted, 𝐜\mathbf{c}; loss function clipping threshold ClossC_{loss}; step size η\eta; privacy parameters, ϵi1\epsilon_{i1}, δi1\delta_{i1}, ϵi3\epsilon_{i3}, ρi2\rho_{i2}, ϵ1\epsilon_{1}, ϵ2\epsilon_{2}; Optimizer 𝒪(,):×𝜷d\mathscr{O}(\cdot,\cdot):\mathscr{F}\times{\bm{\beta}}\to\mathbb{R}^{d} ( \mathscr{F} is the class of objectives, and 𝜷\bm{\beta} is the optimization accuracy, i.e., the gradient norm of objectives); gradient norm threshold β𝜷\beta\in{\bm{\beta}}.
2:  Set ϵi1\epsilon_{i1}, δi1\delta_{i1}, ϵi3\epsilon_{i3}, ρi2\rho_{i2}, ϵ1\epsilon_{1}, ϵ2>0\epsilon_{2}>0 such that ϵi1>ϵi3\epsilon_{i1}>\epsilon_{i3}.
3:  Set regularizer parameter λ^max𝑖2.8Nc1(ϵi1ϵi3)|Di|\hat{\lambda}\geq\underset{i}{\max}\frac{2.8Nc_{1}}{(\epsilon_{i1}-\epsilon_{i3})|D_{i}|}.
4:  counti=0count_{i}=0.
5:  for t=0,,T1t=0,\dots,T-1 do
6:        Generate noise bi1𝒩(0,σi12Id)b_{i1}\sim\mathcal{N}(0,{\sigma}_{i1}^{2}I_{d}) with σi1=22ln(1.25/δi1)/(|Di|ϵi1)\sigma_{i1}={2\sqrt{2\ln{(1.25/\delta_{i1})}}}/(|D_{i}|\epsilon_{i1}).
7:        Construct the perturbed objective function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}) according to (5).
8:        Compute an approximate solution θ^it+1\hat{\theta}_{i}^{t+1}: θ^it+1=𝒪(per(θi,Di),β)\hat{\theta}_{i}^{t+1}=\mathscr{O}(\mathcal{L}_{per}(\theta_{i},D_{i}),\beta).
9:        if Clip[fi(θit)fi(θ^it+1)]+Lap(4𝐜Clossϵ2)α+Lap(2𝐜Clossϵ1)\displaystyle\texttt{Clip}\left[f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1})\right]+{\rm Lap}(\frac{4{\mathbf{c}}{C}_{loss}}{\epsilon_{2}})\geq\alpha+{\rm Lap}(\frac{2{\rm\mathbf{c}}{C}_{loss}}{\epsilon_{1}}) then
10:              counti=counti+1count_{i}=count_{i}+1, Abort if counti>𝐜count_{i}>{\rm\mathbf{c}}.
11:              Generate noise bi2𝒩(0,σi22Id)b_{i2}\sim\mathcal{N}(0,{\sigma}_{i2}^{2}I_{d}) with σi2=β/[2ρi2(λ^N+2η|i|)]{\sigma}_{i2}={\beta}/[{\sqrt{2\rho_{i2}}(\frac{\hat{\lambda}}{N}+2\eta|\mathscr{B}_{i}|)}].
12:              Perturb θ^it+1\hat{\theta}_{i}^{t+1}: θit+1=θ^it+1+bi2\theta_{i}^{t+1}=\hat{\theta}_{i}^{t+1}+b_{i2}.
13:              Broadcast θit+1\theta_{i}^{t+1} to all neighbors jij\in\mathscr{B}_{i}.
14:        else
15:              Let θit+1=θit\theta_{i}^{t+1}=\theta_{i}^{t}.
16:        end if
17:        if θjt+1\theta_{j}^{t+1} is not received from neighbor jij\in\mathscr{B}_{i} then
18:              Replace θjt+1\theta_{j}^{t+1} with θjt\theta_{j}^{t}.
19:        else
20:              Keep θjt+1\theta_{j}^{t+1}.
21:        end if
22:        Update local dual variables λit+1\lambda_{i}^{t+1} from λit+1=λit+η2ji(θit+1θjt+1)\lambda_{i}^{t+1}=\lambda_{i}^{t}+\dfrac{\eta}{2}\sum\limits_{j\in\mathscr{B}_{i}}(\theta_{i}^{t+1}-\theta_{j}^{t+1}).
23:  end for

The complete procedure of IPP-ADMM algorithm for a single agent is shown in Algorithm 2. The privacy parameters ϵ1\epsilon_{1} and ϵ2\epsilon_{2} are allocated to perturb the quality function and threshold α\alpha, respectively. In each iteration, each agent ii first constructs the perturbed function per(θi,Di)\mathcal{L}_{per}(\theta_{i},D_{i}) with a Gaussian random vector bi1b_{i1} and finds an inexact solution θ^it+1\hat{\theta}_{i}^{t+1}, where the norm of gradient is lower than β\beta, i.e., θ^it+1=𝒪(per(θi,Di),β)\hat{\theta}_{i}^{t+1}=\mathscr{O}(\mathcal{L}_{per}(\theta_{i},D_{i}),\beta). Then each agent apply the clipping method Clip to clip the quality function fi(θit)fi(θ^it+1)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1}) with a clipping threshold ClossC_{loss} to limit the sensitivity of quality function. Further, each agent uses SVT to check whether the difference between the approximate solution θ^it+1\hat{\theta}_{i}^{t+1} and θit\theta_{i}^{t} is below a noisy threshold α^=α+Lap(2𝐜Clossϵ1)\hat{\alpha}=\alpha+{\rm Lap}(\frac{2{\rm\mathbf{c}}{C}_{loss}}{\epsilon_{1}}) via a noisy quality function, Clip[fi(θit)fi(θ^it+1)]+Lap(4𝐜Clossϵ2)\texttt{Clip}\left[f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1})\right]+{\rm Lap}(\frac{4{\mathbf{c}}{C}_{loss}}{\epsilon_{2}}). If yes, then agent ii does not transmit any computational results and let θit+1=θit\theta_{i}^{t+1}=\theta_{i}^{t}; otherwise, each agent ii generates a random noise bi2𝒩(0,σi22Id)b_{i2}\sim\mathcal{N}(0,{\sigma}_{i2}^{2}I_{d}) with σi2=β/2ρi2(λ^N+2η|i|){\sigma}_{i2}={\beta}/{\sqrt{2\rho_{i2}}(\frac{\hat{\lambda}}{N}+2\eta|\mathscr{B}_{i}|)}, and transmits θit+1=θ^it+1+bi2\theta_{i}^{t+1}=\hat{\theta}_{i}^{t+1}+b_{i2} to its neighbors. Moreover, each agent maintains a counter counti{count}_{i} to bound the total number of broadcasts of primal variables during the whole interactive process. If a predefined transmission number 𝐜(𝐜T){\rm\mathbf{c}}({\rm\mathbf{c}}\leq T) is exceeded, agent ii stops transmitting anything even when the condition in Line 7 is satisfied. Hence, if agent ii does not receive θjt+1\theta_{j}^{t+1} from any neighbor jij\in\mathscr{B}_{i}, then lets θjt+1=θjt\theta_{j}^{t+1}=\theta_{j}^{t}. Finally, each agent updates the local dual variables λit+1\lambda_{i}^{t+1} via λit+1=λit+η2ji(θit+1θjt+1)\lambda_{i}^{t+1}=\lambda_{i}^{t}+\dfrac{\eta}{2}\sum\limits_{j\in\mathscr{B}_{i}}(\theta_{i}^{t+1}-\theta_{j}^{t+1}).

4.1. Privacy Analysis

We provide the privacy guarantee of IPP-ADMM (Algorithm 2) in following theorem.

Theorem 4.1.

The IPP-ADMM in Algorithm 2 satisfies ρi\rho^{\prime}_{i}-zCDP for each agent ii with ρi=ρ1+𝐜(ρi1+ρi2)\rho^{\prime}_{i}=\rho^{\prime}_{1}+\mathbf{c}(\rho_{i1}+\rho_{i2}), where ρ1=(ϵ1+ϵ2)22\rho^{\prime}_{1}=\frac{(\epsilon_{1}+\epsilon_{2})^{2}}{2}, ρi1=ϵi12/(4ln(1/δi1))\rho_{i1}=\epsilon_{i1}^{2}/(4\ln{(1/\delta_{i1})}), ρi2>0\rho_{i2}>0 is the privacy budget for perturbing the approximate solution, and 𝐜\mathbf{c} (𝐜<T)(\mathbf{c}<T) is the maximum number of primal variables that can be broadcasted. Moreover, the total privacy guarantee of IPP-ADMM is ρ\rho^{\prime}-zCDP with ρ=max𝑖ρi\rho^{\prime}=\underset{i}{\max}~{}\rho^{\prime}_{i}.

Proof.

For achieving ρi\rho^{\prime}_{i}-zCDP for each agent ii in Algorithm 2, we first divide the procedure of the algorithm into two parts. The first part is using SVT to compare the noisy threshold and the perturbed query answer (i.e., the value of quality function) to check the quality of the approximate solution obtained in Step 7 of the Algorithm 1. The second part is to share the approximate solution with Gaussian noise, whose value is above the threshold. We prove that DP mechanism used in the first part provides ρ1\rho^{\prime}_{1}-zCDP (shown in Lemma 4.3). Moreover, at each iteration, the privacy budget spending on releasing an approximate solution in the second part is (ρi1+ρi2)(\rho_{i1}+\rho_{i2})-zCDP (shown in Theorem 3.1). Then, using the composition of zCDP in Lemma 2.6, we obtain the privacy guarantee of IPP-ADMM for each agent ii is ρi=ρ1+𝐜(ρi1+ρi2)\rho_{i}=\rho_{1}+\mathbf{c}(\rho_{i1}+\rho_{i2}) by considering 𝐜\mathbf{c} times of broadcasting primal variables. Lastly, we get a total privacy guarantee of IPP-ADMM, i.e., ρ\rho^{\prime}-zCDP with ρ=max𝑖ρi\rho^{\prime}=\underset{i}{\max}~{}\rho^{\prime}_{i} by adopting the parallel composition in Lemma 3.2. ∎

Before presenting the privacy guarantee of the first part, i.e., compare the noisy threshold and the perturbed query answer to check the quality of the approximate solution, we first give the sensitivity of the clipped quality function as follows.

Lemma 4.2.

Given a clipping threshold ClossC_{loss} of the loss function ()\mathscr{L}(\cdot), the sensitivity of quality function fi(θit)fi(θ^it+1)f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1}) is at most 2Closs2C_{loss}, where fi(θi)=1|Di|n=1|Di|(yinθiTxin)+λ^N(θi)f_{i}(\theta_{i})=\frac{1}{|D_{i}|}\sum_{n=1}^{|D_{i}|}\mathscr{L}(y_{i}^{n}\theta_{i}^{T}x_{i}^{n})+\frac{\hat{\lambda}}{N}\mathscr{R}(\theta_{i}).

Proof.

Fix a pair of adjacent datasets DiD_{i} and D^i\hat{D}_{i} and we also assume that only the first data point in DiD_{i} and D^i\hat{D}_{i} are different, i.e., (xi1,yi1)(x_{i}^{1},y_{i}^{1}) and (x^i1,y^i1)(\hat{x}_{i}^{1},\hat{y}_{i}^{1}). According to the definition of L1L_{1}-sensitivity, we have

Δf=\displaystyle\Delta_{f}=~{}~{}~{} fi(θit,Di)fi(θ^it+1,Di)fi(θit,D^i)+fi(θ^it+1,D^i)1\displaystyle\|f_{i}(\theta_{i}^{t},D_{i})-f_{i}(\hat{\theta}_{i}^{t+1},D_{i})-f_{i}(\theta_{i}^{t},\hat{D}_{i})+f_{i}(\hat{\theta}_{i}^{t+1},\hat{D}_{i})\|_{1}
=\displaystyle=~{}~{}~{} (yi1(θit)Txi1)(y^i1(θit)Tx^i1)\displaystyle\|\mathscr{L}(y_{i}^{1}(\theta_{i}^{t})^{T}x_{i}^{1})-\mathscr{L}(\hat{y}_{i}^{1}(\theta_{i}^{t})^{T}\hat{x}_{i}^{1})
((yi1(θit+1)Txi1)(y^i1(θit+1)Tx^i1))1\displaystyle-(\mathscr{L}(y_{i}^{1}(\theta_{i}^{t+1})^{T}x_{i}^{1})-\mathscr{L}(\hat{y}_{i}^{1}(\theta_{i}^{t+1})^{T}\hat{x}_{i}^{1}))\|_{1}
\displaystyle\leq~{}~{}~{} (yi1(θit)Txi1)(y^i1(θit)Tx^i1)1\displaystyle\|\mathscr{L}(y_{i}^{1}(\theta_{i}^{t})^{T}x_{i}^{1})-\mathscr{L}(\hat{y}_{i}^{1}(\theta_{i}^{t})^{T}\hat{x}_{i}^{1})\|_{1}
+(yi1(θit+1)Txi1)(y^i1(θit+1)Tx^i1)1\displaystyle+\|\mathscr{L}(y_{i}^{1}(\theta_{i}^{t+1})^{T}x_{i}^{1})-\mathscr{L}(\hat{y}_{i}^{1}(\theta_{i}^{t+1})^{T}\hat{x}_{i}^{1})\|_{1}
\displaystyle\leq~{}~{}~{} 2Closs.\displaystyle 2C_{loss}.

Then we show the privacy guarantee of the first part in the following lemma.

Lemma 4.3.

Given the maximum number of primal variables that we can broadcast, 𝐜\mathbf{c}, using SVT to check whether the approximate solution is above the threshold α\alpha provides ρ1\rho_{1}-zCDP with ρ1=(ϵ1+ϵ2)22\rho_{1}=\frac{(\epsilon_{1}+\epsilon_{2})^{2}}{2}.

Proof.

During the whole training process, each agent will receive a stream of queries (i.e., a stream of clipped quality functions Clip[fi(θit)fi(θ^it+1)]\texttt{Clip}\left[f_{i}(\theta_{i}^{t})-f_{i}(\hat{\theta}_{i}^{t+1})\right]) with sensitivity 2Closs2C_{loss} and compare them with a noisy threshold α+Lap(2𝐜Clossϵ1)\alpha+{\rm Lap}(\frac{2{\rm\mathbf{c}}{C}_{loss}}{\epsilon_{1}}). According to Theorem 1 in (Lyu et al., 2017), this procedure satisfies (ϵ1+ϵ2)(\epsilon_{1}+\epsilon_{2})-DP and by Lemma 2.7, it also satisfies (ϵ1+ϵ2)22\frac{(\epsilon_{1}+\epsilon_{2})^{2}}{2}-zCDP. ∎

5. Numerical Experiments

Datasets. Experiments are performed on three benchmark datasets222http://archive.ics.uci.edu/ml/datasets/Adult, http://international.ipums.org: Adult, US, and Brazil. Adult has 48,842 data samples and 41 features, and the label is to predict whether an annual income is more than $50k or not. US has 40,000 records and 58 features, and the label is to predict whether the annual income of an individual is more than $25k. BR has 38,000 samples and 53 features, and the goal is to predict whether the monthly income of an individual is more than $300.

Data preprocessing. We consider the same preprocessing procedure as the method used in (Zhang et al., 2018a). We first normalize each attribute so that the maximum attribute value is 1, and normalize each sample so its L2L_{2}-norm at most 1. As for the label column, we also map it to {1,1}\{-1,1\}. In each simulation, we randomly sample 35,000 records for training and divide them into NN parties, and thus each party includes 35000/N35000/N data samples (i.e., |Di|=35000/N|D_{i}|=35000/N). We denote the rest of the data records as testing data.

Baselines. We compare our proposed algorithms against four baseline algorithms: (i) DVP (Zhang and Zhu, 2016), is a dual variable perturbation method, where the dual variable of each agent at each ADMM iteration is perturbed by Gamma noise. (ii) M-ADMM (Zhang et al., 2018a), is a penalty perturbation approach, where each agent’s penalty variable is perturbed by Gamma noise at each ADMM iteration. (iii) R-ADMM (Zhang et al., 2018b), is based on the penalty approach and the re-utilization of previous iteration’s results to save the privacy loss. (iv) Non-private (decentralized ADMM without adding noise). Note that the privacy guarantees of DVP, M-ADMM, and R-ADMM hold only when the optimal solution of the perturbed subproblem is obtained in each iteration. In order to have a fair comparison, we adopt the Newton solver to obtain the optimal solution in each iteration. Notice that we also provide sharpened and tight privacy loss of above private ADMM algorithms under the privacy framework of zCDP.

Setup. We adopt logistic loss (yinθiTxin)=log(1+exp(yinθiTxin))\mathscr{L}(y_{i}^{n}{\theta}_{i}^{T}x_{i}^{n})=\log(1+\exp(-y_{i}^{n}\theta_{i}^{T}x_{i}^{n})) as loss function, and the derivative ()\mathscr{L}^{\prime}(\cdot) is bounded with |()|1|\mathscr{L}^{\prime}(\cdot)|\leq 1 and c1c_{1}-Lipschitz with c1=1/4c_{1}=1/4. We also let (θi)=12θi22\mathscr{R}(\theta_{i})=\frac{1}{2}\|\theta_{i}\|_{2}^{2}. We evaluate the accuracy by classification error rate over the testing set, defined as Errorrate=NumberofincorrectpredictionsTotalnumberofpredictionsmadeError~{}rate=\frac{Number~{}of~{}incorrect~{}predictions}{Total~{}number~{}of~{}predictions~{}made} and the convergence of algorithms by the average loss over the training samples, given by  t=1Ni=1N1|Di|n=1|Di|(yin(θit)Txin)\mathscr{L}_{t}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{|D_{i}|}\sum_{n=1}^{|D_{i}|}\mathscr{L}(y_{i}^{n}(\theta_{i}^{t})^{T}x_{i}^{n}). We also report the mean and standard deviation of the average loss. The smaller the average loss, the higher accuracy.

Parameter settings. We consider a randomly generated undirectedly network with N=5N=5 agents and we fix the step size η=0.5\eta=0.5 and the total iteration number T=30T=30. We also consider the maximum number of primal variables that can be shared, 𝐜=15\mathbf{c}=15. Moreover, to maximize the utility of SVT, we follow the ratio between ϵ1\epsilon_{1} and ϵ2\epsilon_{2} in (Lyu et al., 2017), i.e., ϵ1:ϵ2=1:(2𝐜)23\epsilon_{1}:\epsilon_{2}=1:(2\mathbf{c})^{\frac{2}{3}}. In all experiments, we set δ=104\delta=10^{-4}, and ϵ={0.5,1,1.5,2,10}\epsilon=\{0.5,1,1.5,2,10\}.

Refer to captionRefer to caption\begin{array}[]{c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{\hspace{0.0in}}c}\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{splits_train.eps}&\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{splits_test.eps}&\end{array}

Figure 1. Effects of privacy budget splitting

Refer to captionRefer to caption\begin{array}[]{c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{\hspace{0.0in}}c}\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{beta_train.eps}&\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{beta_test.eps}&\end{array}

Figure 2. Effects of optimization accuracy β\beta

Refer to captionRefer to caption\begin{array}[]{c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{\hspace{0.0in}}c}\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{closs_train.eps}&\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{closs_test.eps}&\end{array}

Figure 3. Effects of clipping threshold ClossC_{loss}

Refer to captionRefer to caption\begin{array}[]{c@{\hspace{0.0in}}c@{\hspace{0.0in}}c@{\hspace{0.0in}}c}\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{alpha_train.eps}&\includegraphics[trim=4.26773pt 0.0pt 36.98866pt 22.76228pt,clip={true},width=118.52275pt]{alpha_test.eps}&\end{array}

Figure 4. Effects of α\alpha
Refer to caption
Figure 5. Trade-off between classification error rate and privacy on Adult dataset
Refer to caption
Refer to caption
Refer to caption
Figure 6. Convergence comparisons on Adult dataset (left: ϵ=1\epsilon=1, middle: ϵ=2\epsilon=2, right: ϵ=10\epsilon=10)

Refer to captionRefer to caption(a) Brazil(b) US\begin{array}[]{c@{\hspace{0.5in}}c@{\hspace{0.5in}}c}\includegraphics[trim=4.26773pt 0.0pt 45.52458pt 22.76228pt,clip={true},width=180.67499pt]{acc_BR.eps}\hfil\hskip 36.135pt&\includegraphics[trim=4.26773pt 0.0pt 45.52458pt 22.76228pt,clip={true},width=180.67499pt]{acc_us.eps}\hfil\hskip 36.135pt\\[0.0pt] \mbox{(a) Brazil}\hfil\hskip 36.135pt&\mbox{(b) US}\hfil\hskip 36.135pt\end{array}

Figure 7. Classification error rate comparisons on Brazil and US datasets.

Impacts of parameters. In this set of experiments on the Adult dataset, we present the effects of privacy budgets splitting and optimization accuracy (i.e., gradient norm threshold) β\beta on the performance of PP-ADMM, and the loss clipping threshold ClossC_{loss} and the quality function significance threshold α\alpha on the performance of IPP-ADMM. Specifically, we adjust different parameter settings separately, while keeping the rest constant to represent their impacts on training and testing accuracy.

For the privacy budgets splitting of PP-ADMM, we first convert the overall privacy budget parameters (ϵ,δ)(\epsilon,\delta) to ρtotal=ϵ24ln(1/δ)\rho_{total}=\frac{\epsilon^{2}}{4\ln(1/\delta)}. We set ρi1=ρtotalT(1splits)\rho_{i1}=\frac{\rho_{total}}{T}\cdot(1-splits) and ρi2=ρtotalTsplits\rho_{i2}=\frac{\rho_{total}}{T}\cdot{splits}, where splitssplits denotes the fraction of ρtotal\rho_{total} allocated to ρi2\rho_{i2}. By tuning splitssplits, we can find the good trade-off between the privacy budget for perturbing the objective and perturbing the approximate solution. In addition, we compute ϵi1=ρi1+2ρi1ln(1/δi1)\epsilon_{i1}=\rho_{i1}+2\sqrt{\rho_{i1}\ln(1/\delta_{i1})} with δi1=104\delta_{i1}=10^{-4}, and set ϵi3=0.99ϵi1\epsilon_{i3}=0.99\cdot\epsilon_{i1} to dedicate most of the budget to reduce the amount of noise for perturbing the objective and increase the influence of regularization. Figure 1 shows the effects of privacy budget splitting on the performance of PP-ADMM by setting β=106\beta=10^{-6}. As splitssplits decreases, i.e., allocating less privacy budgets for perturbing the approximate solution, it yields better training and testing accuracy. Thus, we set splits=0.001splits=0.001 to achieve a good trade-off between amount of noise added to the objective and approximate solution.

Figure 2 shows how classification accuracy changes with varying values of β\beta and fixing splits=0.001splits=0.001. The parameter β\beta controls the optimization accuracy of each iteration of PP-ADMM training process and the amount of noise for perturbing the approximate solution. As it can be observed from the figure, due to randomness of objective introduced by the random noise, when β\beta is too small, solving the noisy objective perfectly in each iteration may not help the final performance. Conversely, when β\beta is too large, large amount of noise is added to perturb the approximate solution, which also leads to performance degradation. In our experiments, we thus fix β=103.5\beta=10^{-3.5} that achieves lowest training/testing error rate.

The IPP-ADMM algorithm has two threshold parameters, ClossC_{loss} and α\alpha. These two parameters are used to bound the sensitivity of the quality function, and the value of quality function, respectively. If the clipping threshold ClossC_{loss} is set to a small value, it significantly reduces the sensitivity but at the same time it leads much information loss in the estimation of quality function. On the other hand, if ClossC_{loss} is large, the sensitivity becomes large that results in adding too much noise to the estimation. Thus, too large or small values of ClossC_{loss} have a negative effect on employing SVT to check whether the current approximate solution has a big enough difference from that of previous iteration. As we can see from Figure 3, Closs=2C_{loss}=2 achieves a good trade-off between high information loss and large sensitivity. In Figure 4, we fix the the clipping threshold Closs=2C_{loss}=2 and vary α\alpha from 10310^{-3} to 1010 to see the effect of α\alpha on the performance. Although large value of α\alpha may potentially reduce the releasing of low quality approximate solution and reduce the communication cost, we observe that it also leads the learning performance degradation. We then choose α=103\alpha=10^{-3} in our experiments, which achieves the lowest testing/training error rate.

Performance comparisons. We also present the trade-off between classification error rate and privacy cost in Figure 5, where we measured the privacy costs of all algorithms to obtain some specified testing error rates. Figure 5 illustrates that both of our methods have consistently lower privacy cost than those baselines algorithms. Compared with PP-ADMM, IPP-ADMM further saves more privacy cost due to limiting the number of releasing low-quality computational results. Additionally, we also inspect the convergence performance (i.e., average loss) of different algorithms under the same budgets, as shown in Fig. 6. We can observe that when budget ϵ\epsilon decreases from 10 to 1, the average loss values of baseline algorithms increase, which matches the simulation results shown in (Zhang et al., 2018a, b; Ding et al., 2019c). Although we also analyze the baseline algorithms using zCDP to provide tight privacy bound, using Gaussian noise instead of Gamma noise might be more beneficial to the performance, which usually has at least d\sqrt{d} times improvement of the empirical risk bound (Kifer et al., 2012), where dd is the dimension of training model. And our proposed algorithms continues to outperform the baseline algorithms significantly.

Figure 7 compares the accuracy (classification error rate) of different algorithms on Brazil and US. The noise parameter of all algorithms are chosen respectively so that they can achieve the same total privacy loss. As expected, the lower privacy budget, the higher classification error rate. As it was observed in the experiments, our proposed algorithms get close to the best achievable classification error rate for a wide range of total privacy loss considered in the experiments.

6. Conclusions

In this paper, we have developed (Improved) plausible differentially private ADMM algorithm, PP-ADMM and IPP-ADMM. In PP-ADMM, in order to release the shackles of the exact optimal solution during each ADMM iteration to ensure differential privacy, we consider outputting a noisy approximate solution for the perturbed objective. To further improve the utility of PP-ADMM, we have adopted SVT in IPP-ADMM to check whether the current approximate solution has a big enough difference from that of previous iteration. Moreover, we have analyzed privacy loss under the framework of zCDP and generalization performance guarantee. Finally, through the experiments on real-world datasets, we have demonstrated that the proposed algorithms outperform other differentially private ADMM based algorithms while providing the same privacy guarantee. In future work, we plan to extend our privacy analysis to non-convex loss function and apply our methods to deep learning framework. Another research direction is to study the idea of using SVT to other distributed (federated) deep learning framework to save the privacy budget and reduce the communication cost.

Acknowledgements.
We thank the reviewers for their insightful comments. The work of J. Ding and M. Pan was supported in part by the U.S. National Science Foundation under grants US CNS-1646607, CNS-1801925, and CNS-2029569. The work of J. Bi was partially supported by NSF grants: CCF-1514357 and IIS-1718738, and NIH grant 5K02DA043063-03.

References

  • (1)
  • Bun and Steinke (2016) Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography. Springer Berlin Heidelberg, Beijing, China, 635–658.
  • Chang et al. (2014) Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. 2014. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing 63, 2 (2014), 482–497.
  • Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. 2011. Differentially private empirical risk minimization. JMLR 12, Mar (2011), 1069–1109.
  • Ding et al. (2020a) Jiahao Ding, Sai Mounika Errapotu, Yuanxiong Guo, Haixia Zhang, Dongfeng Yuan, and Miao Pan. 2020a. Private Empirical Risk Minimization with Analytic Gaussian Mechanism for Healthcare System. IEEE Transactions on Big Data (2020), 1–1.
  • Ding et al. (2019a) Jiahao Ding, Sai Mounika Errapotu, Haijun Zhang, Miao Pan, and Zhu Han. 2019a. Stochastic ADMM Based Distributed Machine Learning with Differential Privacy. In International conference on security and privacy in communication systems. Springer International Publishing, Orlando, FL, 257–277.
  • Ding et al. (2019b) Jiahao Ding, Yanmin Gong, Chi Zhang, Miao Pan, and Zhu Han. 2019b. Optimal Differentially Private ADMM for Distributed Machine Learning. CoRR abs/1901.02094 (2019). https://arxiv.org/abs/1901.02094
  • Ding et al. (2019c) Jiahao Ding, Xinyue Zhang, Mingsong Chen, Kaiping Xue, Chi Zhang, and Miao Pan. 2019c. Differentially Private Robust ADMM for Distributed Machine Learning. In IEEE International Conference on Big Data. IEEE, Los Angeles, CA, 1302–1311.
  • Ding et al. (2020b) Jiahao Ding, Xinyue Zhang, Xiaohuan Li, Junyi Wang, Rong Yu, and Miao Pan. 2020b. Differentially Private and Fair Classification via Calibrated Functional Mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, New York, NY, 622–629.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography. Springer Berlin Heidelberg, Berlin, Heidelberg, 265–284.
  • Dwork et al. (2009) Cynthia Dwork, Moni Naor, Omer Reingold, Guy N Rothblum, and Salil Vadhan. 2009. On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing. ACM, New York, NY, 381–390.
  • Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy (1st ed.). Now Publishers, Inc., Boston, MA, USA.
  • Huang et al. (2019) Zonghao Huang, Rui Hu, Yuanxiong Guo, Eric Chan-Tin, and Yanmin Gong. 2019. DP-ADMM: ADMM-based distributed learning with differential privacy. IEEE Transactions on Information Forensics and Security 15 (July 2019), 1002–1012.
  • Kifer et al. (2012) Daniel Kifer, Adam Smith, and Abhradeep Thakurta. 2012. Private Convex Empirical Risk Minimization and High-dimensional Regression. In Proceedings of the 25th Annual Conference on Learning Theory, Vol. 23. PMLR, Edinburgh, Scotland, 25.1–25.40.
  • Lyu et al. (2017) Min Lyu, Dong Su, and Ninghui Li. 2017. Understanding the sparse vector technique for differential privacy. Proceedings of the VLDB Endowment 10, 6 (2017), 637–648.
  • Mateos et al. (2010) Gonzalo Mateos, Juan Andrés Bazerque, and Georgios B Giannakis. 2010. Distributed sparse linear regression. IEEE Transactions on Signal Processing 58, 10 (2010), 5262–5276.
  • Melis et al. (2019) Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. 2019. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE S&P. IEEE, San Francisco, CA, 691–706.
  • Mota et al. (2013) Joao FC Mota, Joao MF Xavier, Pedro MQ Aguiar, and Markus Püschel. 2013. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transactions on Signal Processing 61, 10 (2013), 2718–2723.
  • Shi et al. (2019) Dian Shi, Jiahao Ding, Sai Mounika Errapotu, Hao Yue, Wenjun Xu, Xiangwei Zhou, and Miao Pan. 2019. Deep Q-Network-Based Route Scheduling for TNC Vehicles With Passengers’ Location Differential Privacy. IEEE Internet of Things Journal 6, 5 (2019), 7681–7692.
  • Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against machine learning models. In 2017 IEEE S&P. IEEE, San Jose, CA, 3–18.
  • Wang et al. (2018) Jingyi Wang, Xinyue Zhang, Haijun Zhang, Hai Lin, Hideki Tode, Miao Pan, and Zhu Han. 2018. Data-Driven Optimization for Utility Providers with Differential Privacy of Users’ Energy Profile. In 2018 IEEE Global Communications Conference (GLOBECOM). IEEE, Singapore, 1–6.
  • Yu et al. (2019) L. Yu, L. Liu, C. Pu, M. E. Gursoy, and S. Truex. 2019. Differentially Private Model Publishing for Deep Learning. In 2019 IEEE S&P. IEEE Computer Society, Los Alamitos, CA, USA, 332–349.
  • Zhang and Kwok (2014) Ruiliang Zhang and James Kwok. 2014. Asynchronous distributed ADMM for consensus optimization. In ICML. JMLR, Beijing, China, 1701–1709.
  • Zhang and Zhu (2016) Tao Zhang and Quanyan Zhu. 2016. Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Transactions on Information Forensics and Security 12, 1 (2016), 172–187.
  • Zhang et al. (2019) Xinyue Zhang, Jiahao Ding, Sai Mounika Errapotu, Xiaoxia Huang, Pan Li, and Miao Pan. 2019. Differentially Private Functional Mechanism for Generative Adversarial Networks. In 2019 IEEE Global Communications Conference (GLOBECOM). IEEE, Waikoloa, HI, 1–6.
  • Zhang et al. (2018a) Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. 2018a. Improving the Privacy and Accuracy of ADMM-Based Distributed Algorithms. In ICML. PMLR, Stockholmsmässan, Stockholm Sweden, 5796–5805.
  • Zhang et al. (2018b) Xueru Zhang, Mohammad Mahdi Khalili, and Mingyan Liu. 2018b. Recycled admm: Improve privacy and accuracy with less computation in distributed algorithms. In Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, Monticello, IL, 959–965.