Differentially Private and Fair Deep Learning:
A Lagrangian Dual Approach

Cuong Tran
Syracuse University
[email protected] &Ferdinando Fioretto
Syracuse University
[email protected] &Pascal Van Hentenryck
Georgia Institute of Technology
[email protected]

Abstract

A critical concern in data-driven decision making is to build models whose outcomes do not discriminate against some demographic groups, including gender, ethnicity, or age. To ensure non-discrimination in learning tasks, knowledge of the sensitive attributes is essential, while, in practice, these attributes may not be available due to legal and ethical requirements. To address this challenge, this paper studies a model that protects the privacy of the individuals’ sensitive information while also allowing it to learn non-discriminatory predictors. The method relies on the notion of differential privacy and the use of Lagrangian duality to design neural networks that can accommodate fairness constraints while guaranteeing the privacy of sensitive attributes. The paper analyses the tension between accuracy, privacy, and fairness and the experimental evaluation illustrates the benefits of the proposed model on several prediction tasks.

1 Introduction

A number of socio-technical decisions, such as criminal assessment, landing, and hiring, are increasingly being aided by machine learning systems. A critical concern is that the learned models are prone to report outcomes that are discriminatory against some demographic group, including gender, ethnicity, or age. These concerns have spurred the recent development of fairness definitions and algorithms for decision-making, focusing attention on the tradeoff between the model accuracy and fairness.

To ensure non-discrimination in learning tasks, knowledge of the sensitive attributes is essential. At the same time, legal and ethical requirements often prevent the use of this sensitive data. For example, U.S. law prevents using racial identifiers in the development of models for consumer lending or credit scoring. Other requirements may be even more stringent, and prevent the collection of protected user attributes, such as for the case of racial attributes in the E.U. General Data Protection Regulation (GDPR), or require protection of the consumer data privacy. In this scenario, an important tension arise between (1) the demand for models to be non-discriminatory, (2) the requirement for such model to use the protected attribute during training, and (3) the restriction on the data or protected attributes that can be used. There is thus a need to provide learning models that can both guarantee non discriminatory decisions and protect the privacy of the individuals’ sensitive attributes.

To this end, this paper introduces a differential privacy framework to train deep learning models that satisfy several group fairness notions, including equalized odds, accuracy parity, and demographic parity [40, 22, 2], while providing privacy of the protected attributes. The key elements of the framework can be summarized as follows:

1.

The fairness requirement is captured by casting the learning task as a constrained optimization problem. A Lagrangian dual approach is then applied to the learning task, dualizing the fairness constraints using augmented Lagrangian terms [23].
2.

The privacy requirement is enforced by using a clipping approach on the primal and dual steps and adding noise calibrated by the sensitivities of the constraint terms and their gradients. The primal step only applies clipping on constraint gradients involving sensitive attributes, and thus, does not have a major effect on the model accuracy.
3.

The framework addresses the bias-variance trade-off of clipping by providing bounds on the expected errors of constraint gradients and constraint violations. The clipping bounds can then be calibrated by minimizing these upper bounds.
4.

Finally, the framework generalizes to the important scenario where only a subset of the individuals reports the sensitive attributes, i.e., when participants are given the choice of releasing sensitive information.

The rest of the paper reviews the related work (Section 2), introduces the problem setting (Section 3), discusses the fairness and privacy definitions adopted (Section 4), presents the proposed Private and Fair Lagrangian Dual (PF-LD) framework (Sections 5, 6, and 7), its theoretical results (Sections 6.1 and 6.2), and its empirical evaluation on several prediction tasks (Section 8). The empirical results show that, on selected benchmarks, PF-LD achieves an excellent trade-off among accuracy, privacy, and fairness. It may represent a promising step towards a practical tool for privacy-preserving and fair decision making.

2 Related Work

The line of works on algorithmic fairness can be categorized into three groups those adopting pre-processing techniques to guarantee fairness [43, 15, 6], those developing modifying algorithms to satisfy some fairness notion [11, 39, 38, 8, 34], and those using post-processing techniques [17, 21, 28, 27]. The interested reader is referred to [9] for a recent survey on algorithmic fairness. On the differentially private deep learning front, there are two relevant lines of work. The first is based on the seminal work of Abadi et al. [1], which derives a differential privacy version of stochastic gradient descent (DP-SGD) and proposes a technique, called moment accountant to track detailed information of the privacy loss incurred by a sequence of SGD steps [1]. This idea has been extended and improved by a number of follow-up work [29, 36]. The second avoids using the iterative nature of the optimization algorithms used to training deep learning models by exploiting a collection of teacher models [35, 5] to train a privacy-preserving model.

While this literature is extensive, the topics of privacy and fairness have been study mostly in isolation. A few exceptions are represented by the following work. The work by Dwork et. al [12] is one of the earliest contribution linking fairness and differential privacy and shows that individual fairness is a generalization of differential privacy. More recently, Cummings et. al [10] consider the tradeoffs when considering differential privacy and equal opportunity, a notion of fairness that restricts a classifier to produce equal true positive rates across different groups. The work claim that there is no classifier that achieves $(\epsilon,0)$ -differential privacy, satisfies equal opportunity, and has accuracy better than a constant classifier. Ekstrand et. al [16] raise questions about the tradeoffs involved between privacy and fairness and, finally, Jagielski et. al [24] shows two simple, yet effective algorithms that satisfy $(\epsilon,\delta)$ -differential privacy and equalized odds. Finally, a recent line of work has also observed that private models may have a negative impact towards fairness. In particular, Pujol et. al [37] shows that differential privacy could disproportionately affect some groups on several Census resource allocation tasks. A similar observation was made by Bagdasaryan et. al [4] in the context of private deep learning models trained using DP-SDG. The authors observed disparity in performance across different sub-populations on several classification tasks.

It is also worth mentioning that in order to build a fair model it is necessary to collect a subset of users with their sensitive information like their gender, races or ages. This poses a privacy risks on fair models [24, 26]. To achieve a fair model without disclosing those sensitive attributes, ¡ozannar et. al [33] have very recently developed a differential privacy mechanism in which the true sensitive information is perturbed prior being applied to a fair learning model.

In contrast to the work discussed above, this paper, presents a Lagrangian dual method to enforce several fairness constraints directly into the training cycle of a deep neural network and proposes a differentially private and fair version of the learning algorithm.

3 Problem Settings and Goals

The paper adopts boldface symbols to describe vectors (lowercase) and matrices (uppercase). Italic symbols are used to denote scalars (lowercase) and random variables or data features (uppercase). Notation $\|\cdot\|$ is used to denote the $l_{2}$ norm.

The paper considers datasets $D$ consisting of $n$ individual data points $(X_{i},A_{i},Y_{i})$ , with $i\!\in\![n]$ drawn i.i.d. from an unknown distribution. Therein, $X_{i}\!\in\!\mathcal{X}$ is a non-sensitive feature vector, $A_{i}\!\in\!\mathcal{A}$ , with $\mathcal{A}=[m]$ (for some finite $m$ ) is a protected attribute, and $Y_{i}\!\in\!\mathcal{Y}=\{0,1\}$ is a binary label. The goal is to learn a classifier $\mathcal{M}_{\theta}:\mathcal{X}\to\mathcal{Y}$ , where $\theta$ is a vector of real-valued parameters, that ensures a specified non-discriminatory notion with respect to $A$ while guaranteeing the privacy of the sensitive attribute $A$ . The model quality is measured in terms of a nonnegative, and assumed differentiable, loss function $\mathcal{L}:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}_{+}$ , and the problem is that of minimizing the empirical risk function:

\min_{\theta}J(\mathcal{M}_{\theta},D)=\frac{1}{n}\sum_{i=1}^{n}\mathcal{L}(\mathcal{M}_{\theta}(X_{i}),Y_{i}).

(L)

The paper focuses on learning general classifiers, such as neural networks, that satisfy group fairness (as defined next) and protect the disclosure of the sensitive attributes using the notion of differential privacy. Importantly, the paper assumes that the attribute $A$ is not part of the model input during inference. This is crucial in the application of interest to this work as the protected attributes cannot be disclosed.

4 Preliminaries

This section reviews the fairness and privacy notion adopted in this work.

4.1 Fairness

The paper consider a classifier $\mathcal{M}$ satisfying some group fairness notion under a distribution over $(X,A,Y)$ for the protected attribute $A$ and focuses on three fairness notions:

•

Demographic Parity: $\mathcal{M}$ ’s predictions are statistically independent of the protected attribute $A$ . That is,

\Pr[\mathcal{M}(X)\!=\!\hat{y}\mid A\!=\!a]\!=\!\Pr[\mathcal{M}(X)\!=\!\hat{y}]\;\;\forall a\in\mathcal{A},\hat{y}\in\mathcal{Y},

which, since $\hat{y}\in\{0,1\}$ , can be expressed as

\mathbb{E}[\mathcal{M}(X)\mid A=a]=\mathbb{E}[\mathcal{M}(X)],\;\;\forall a\in\mathcal{A}.

•

Equalized odds: $\mathcal{M}$ ’s predictions are conditionally independent of the protected attribute $A$ given the label $Y$ . That is, for all $a\in\mathcal{A},\hat{y}\in\mathcal{Y}$ , and $y\in\mathcal{Y}$ :

$\Pr[\mathcal{M}(X)\!=\!\hat{y}\mid A\!=\!a,Y\!=\!y]\!=\!\Pr[\mathcal{M}(X)\!=\!\hat{y}\mid Y\!=\!y].$

or, equivalently, for all $a\in\mathcal{A},y\in\mathcal{Y}$ ,

$\mathbb{E}[\mathcal{M}(X)\mid A\!=\!a,Y\!=\!y]\!=\!\mathbb{E}[\mathcal{M}(X)\mid Y\!=\!y].$

•

Accuracy parity: $\mathcal{M}$ ’s miss-classification rate is conditionally independent of the protected attribute:

\Pr[\mathcal{M}(X)\neq Y\mid A=a]=\Pr[\mathcal{M}(X)\neq Y],\;\;\forall a\in\mathcal{A},

or equivalently,

\mathbb{E}[\mathcal{L}(\mathcal{M}(X),Y)\mid A\!=\!a]\!=\!\mathbb{E}[\mathcal{L}(\mathcal{M}(X),Y)],\;\;\forall a\in\mathcal{A},

where $\mathcal{L}$ is the loss function to minimize in problem (L).

As noted by [3] and [18], several fairness notions, including those above, can be viewed as equality constraints between the properties of each group with respect to the population. These constraints can be expressed as:

\mathbb{E}_{z\sim D_{P_{i}}}[h(z)]-\mathbb{E}_{z\sim D_{G_{i}}}[h(z)]=0

(1)

where, for $i$ in some index set $\mathcal{I}$ , $D_{P_{i}}$ is a subset of the dataset $D$ , indicating the population term, $D_{G_{i}}$ is a subset of $D_{P_{i}}$ , indicating the group term, and is obtained by accessing the protected attributes $A$ , the function $h$ characterizes the model output under some fairness definition.

Example 1 (Demographic parity).

Demographic parity can be expressed as a set of $|\mathcal{A}|$ constraints, with $h(z)\!=\!\mathcal{M}_{\theta}(z)$ and, for each $i\!\in\!\mathcal{A}$ , the subsets indicating population terms are defined as:

D_{P_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\},

and the subsets indicating the group terms as:

D_{G_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\land A\!=\!i\}.

Example 2 (Equalized odds).

Equalized odds can be expressed as a set of $2|\mathcal{A}|$ constraints, with $h(z)\!=\!\mathcal{M}_{\theta}(z)$ , and for each choice of $y\in\{0,1\}$ , and $i\!\in\!\mathcal{A}$ , the subsets indicating population terms are defined as:

D_{P_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\land Y\!=\!y\},

and the subsets indicating the group terms as:

D_{G_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\land Y\!=\!y\land A\!=\!i\}.

Example 3 (Accuracy parity).

Accuracy parity can be expressed as a set of $|\mathcal{A}|$ constraints, with $h(z)\!=\!\mathcal{L}_{\theta}(\mathcal{M}_{\theta}(z))$ , where $\mathcal{L}$ is the loss function defined in problem (L), and, for each $i\!\in\!\mathcal{A}$ , the subsets indicating population terms are defined as:

D_{P_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\},

and the subsets indicating the group terms as:

D_{G_{i}}\!=\!\{(X,Y)\mid(X,A,Y)\!\in\!D\land A\!=\!i\}.

4.2 Differential Privacy

Differential privacy (DP) [13] is a strong privacy notion used to quantify and bound the privacy loss of an individual participation to a computation. While traditional DP protects the participation of an individual to a dataset used in a computation, similarly to [24, 33], this work focuses on the instance where the protection is restricted to the sensitive attributes only. A dataset $D\!\in\!\mathcal{D}\!=\!(\mathcal{X}\!\times\!\mathcal{A}\!\times\!\mathcal{Y})$ of size $n$ can be described as a pair $(D_{P},D_{S})$ where $D_{P}\!\in\!(\mathcal{X}\!\times\!\mathcal{Y})^{n}$ describes the public attributes and $D_{S}\!\in\!\mathcal{A}^{n}$ describes the sensitive attributes. The privacy goal is to guarantee that the output of the learning model does not differ much when a single individual sensitive attribute is changed.

The action of changing a single attribute from a dataset $D_{S}$ , resulting in a new dataset $D_{S}^{\prime}$ , defines the notion of dataset adjacency. Two dataset $D_{S}$ and $D_{S}^{\prime}\in\mathcal{A}^{n}$ are said adjacent, denoted $D_{S}\sim D_{S}^{\prime}$ , if they differ in at most a single entry (e.g., in one individual’s group membership).

Definition 1 (Differential Privacy).

A randomized mechanism $\mathcal{M}\!:\!\mathcal{D}\!\to\!\mathcal{R}$ with domain $\mathcal{D}$ and range $\mathcal{R}$ is $(\epsilon,\delta)$ -differentially private w.r.t. attribute $A$ , if, for any dataset $D_{P}\!\in\!(\mathcal{X}\times\mathcal{Y})^{n}$ , any two adjacent inputs $D_{S},D_{S}^{\prime}\!\in\!\mathcal{A}^{n}$ , and any subset of output responses $R\subseteq\mathcal{R}$ :

\Pr[\mathcal{M}(D_{P},D_{S})\in R]\leq e^{\epsilon}\Pr[\mathcal{M}(D_{P},D_{S}^{\prime})\in R]+\delta.

When $\delta\!=\!0$ the algorithm is said to satisfy $\epsilon$ -differential privacy. Parameter $\epsilon>0$ describes the privacy loss of the algorithm, with values close to $0$ denoting strong privacy, while parameter $\delta\in[0,1]$ captures the probability of failure of the algorithm to satisfy $\epsilon$ -differential privacy. The global sensitivity $\Delta_{f}$ of a real-valued function $f:\mathcal{D}\to\mathbb{R}^{k}$ is defined as the maximum amount by which $f$ changes in two adjacent inputs $D$ and $D^{\prime}$ : $\Delta_{f}=\max_{D\sim D^{\prime}}\|f(D)-f(D^{\prime})\|.$ In particular, the Gaussian mechanism, defined by

\mathcal{M}(D)=f(D)+\mathcal{N}(0,\Delta_{f}^{2}\,\sigma^{2}),

where $\mathcal{N}(0,\Delta_{f}\,\sigma^{2})$ is the Gaussian distribution with $0$ mean and standard deviation $\Delta_{f}\,\sigma^{2}$ , satisfies $(\epsilon,\delta)$ -DP for $\delta\!>\!\frac{4}{5}\exp(-(\sigma\epsilon)^{2}/2)$ and $\epsilon\!<\!1$ [14].

5 Constrained Learning with Lagrangian Duality

When interpreted as constraints of the form (1), fairness properties can be explicitly imposed to problem (L), resulting in a constrained empirical risk minimization problem. Solving this new problem, however, becomes challenging due to the presence of constraints. To address this challenge, this work uses concepts borrowed from Lagrangian duality.

Consider a set of $|\mathcal{I}|$ constraints of the form (1), and expressed succinctly as:

\bm{\mu}(D_{P})-\bm{\mu}(D_{G})=\bm{0}^{\top},

(2)

where $\bm{\mu}(D_{P})$ and $\bm{\mu}(D_{G})$ are vectors containing elements $\mu(D_{P_{i}})\!=\!\hat{\mathbb{E}}_{z\sim D_{P_{i}}}[h(z)]$ and $\mu(D_{G_{i}})\!=\!\hat{\mathbb{E}}_{z\sim D_{G_{i}}}[h(z)]$ , respectively, for each $i\!\in\!\mathcal{I}$ . Notice that the constraints in $\bm{\mu}(D_{P})$ access public data only, while the constraints in $\bm{\mu}(D_{G})$ access also the sensitive data. The resulting learning task is defined by the following optimization problem


$\displaystyle\operatorname*{argmin}_{\theta}$	$\displaystyle\;J(\mathcal{M}_{\theta},D_{P})=\frac{1}{n}\sum_{i=1}^{n}{\cal L}({\cal M}_{\theta}(X_{i}),Y_{i})$	(3a)
subject to	$\displaystyle\hskip 30.0pt\bm{\mu}(D_{P})-\bm{\mu}(D_{G})=\bm{0}^{\top}.$	(3b)

In Lagrangian relaxation, the problem constraints are relaxed into the objective function using Lagrangian multipliers $\lambda_{i}\geq 0$ associated to each of the $|\mathcal{I}|$ constraints and expressing the penalty induced by violating them. When all the constraints are relaxed, the Lagrangian function becomes

\mathcal{L}_{\bm{\lambda}}(\theta)=J(\mathcal{M}_{\theta},D_{P})+\bm{\lambda}^{\top}\left|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})\right|,

(4)

where $\bm{\lambda}=(\lambda_{1},\ldots,\lambda_{|\mathcal{I}|})$ and the function $|\cdot|$ , used here to denote the element-wise operator (i.e, $|\mu(D_{P_{i}})-\mu(D_{G_{i}})|$ for $i\!\in\!\mathcal{I}$ ), captures a quantification of the constraint violations, often used in constraint programming [20].

Using a Lagrangian function, the optimization becomes

\mathrel{\mathop{\theta}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}(\bm{\lambda})=LR_{\bm{\lambda}}=\operatorname*{argmin}_{\theta}\mathcal{L}_{\bm{\lambda}}(\theta),

(5)

that produces an approximation $\mathcal{M}_{\mathrel{\mathop{\theta}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}(\bm{\lambda})}$ of $\mathcal{M}_{\mathrel{\mathop{\theta}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}}$ . The Lagrangian dual finds the best Lagrangian multipliers, i.e.,

\textstyle\mathrel{\mathop{\bm{\lambda}}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}=\operatorname*{argmax}_{\bm{\lambda}\geq 0}J(\mathcal{M}_{\mathrel{\mathop{\theta}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}(\bm{\lambda})},D_{P}),

(6)

to obtain $\mathcal{M}_{\mathrel{\mathop{\theta}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}}(\mathrel{\mathop{\bm{\lambda}}\limits^{\vbox to0.0pt{\kern-2.0pt\hbox{$\scriptstyle\scalebox{0.5}{$\star$}$}\vss}}})}$ , i.e., the strongest Lagrangian relaxation of $\mathcal{M}$ . Learning this relaxation relies on an iterative scheme that interleaves the learning of a number of Lagrangian relaxations (for various multipliers) with a subgradient method to learn the best multipliers. The resulting method, called Fair-Lagrangian Dual (F-LD) is sketched in Algorithm 1. Given the input dataset $D$ , the optimizer step size $\alpha\!>\!0$ , and step sizes $\bm{s}$ , the Lagrangian multipliers are initialized in line 1. The training is performed for a fixed number of $T$ epochs. At each epoch $k$ , the primal update step (lines 1 and 1) optimizes the model parameters $\theta$ using stochastic gradient descent over different mini-batches $B\subseteq D$ . The optimization step uses the current Lagrangian multipliers $\bm{\lambda}_{k}$ . Therein, $B_{P}$ and $B_{G}$ indicate the population and group terms over a minibatch. After each epoch, the dual update step (line 1), updates the value of the Lagrangian multipliers following to a dual ascent rule [7, 19]. The multipliers values are thus restricted to a predefined upper bound $\lambda^{\max}\!$ (line 1).

input :

D=(X_{i},A_{i},Y_{i})_{i=1}^{n}:

Training data;

\alpha,\bm{s}=(s_{1},s_{2},\ldots):

step sizes.

\lambda^{\max}

: Max multipliers value.

\lambda_{1,i}\leftarrow 0\;\;\forall i\in\mathcal{I}

3 for epoch $k=1,2,\ldots T$ do

4 foreach Mini-batch $B\subseteq D$ do

\theta\!\leftarrow\!\theta-\alpha\nabla_{\theta}\big{[}J(\mathcal{M}_{\theta},B_{P})\text{+}\bm{\lambda}_{k}^{\top}\!\left|\bm{\mu}(B_{P})\,\text{-}\,\bm{\mu}(B_{G})\right|\big{]}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!

\bm{\lambda}_{k+1}\leftarrow\bm{\lambda}_{k}+s_{k}\,\left|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})\right|

\lambda_{k+1,i}\leftarrow\min(\lambda^{\max},\lambda_{k+1,i})\;\;\forall i\in\mathcal{I}

Algorithm 1 Fair-Lagrangian Dual (F-LD)

6 A Private and Fair LD Model

To ensure fairness, the primal (line 1) and dual (line 1) updates of Algorithm 1 involve terms to compute the violations associated to constraints (3b). These terms rely on the attributes $A$ , and therefore, the resulting model leaks the sensitive information. To contrast this issue, this section introduces an extension to F-LD, called Private and Fair Lagrangian Dual (PF-LD) method, that guarantees both fairness and privacy. The idea is to render the computations of the primal and dual update steps differentially private with respect to the sensitive attributes.

Private Primal Update

At each epoch $k$ , the primal update (line 1 of Algorithm 1) computes the gradients over the loss function $\mathcal{L}_{\bm{\lambda}_{k}}(\theta)$ , which is composed of two terms (see Equation (4)). The first term, $J(\mathcal{M}_{\theta},D_{P})$ , uses exclusively public information, while the second term, $\bm{\lambda}^{\top}|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})|$ requires both the public and sensitive group information. The computation of these gradients can be made differentially private by the introduction of carefully calibrated Gaussian noise. The general concept, relies on performing a differentially private Stochastic Gradient Descent (DP-SDG) step [1]. In a nutshell, DP-SDG computes the gradients for each data sample in a random mini-batch, clips their L2-norm, computes the average, and adds noise to ensure privacy.

The result below bounds the global sensitivity $\Delta_{p}$ of the sensitive term in the primal update, which is needed to calibrate the noise necessary to guarantee privacy.

Theorem 1.

Let $\|\nabla_{\theta}h(z)\|\!\leq\!C_{p}$ , for all $z\!\in\!B_{G_{i}}$ , $i\!\in\!\mathcal{I}$ , and some $C_{p}\!>\!0$ . The global sensitivity $\Delta_{p}$ of the gradients of the constraints violation $\nabla_{\theta}\bm{\lambda}^{\top}|\bm{\mu}(B_{P})-\bm{\mu}(B_{G})|$ is

\Delta_{p}\leq\frac{2C_{p}\lambda^{\max}}{\min_{i\in\mathcal{I}}|B_{G_{i}}|-1}.

(7)

The above uses a clipping term, $C_{p}$ , to control the maximal change of the gradients. Crucially, this is non-limiting, as it can be enforced by clipping the gradient contribution $\left\|\nabla_{\theta}h(z)\right\|$ to $C_{p}$ , similarly to what done in DP-SDG.

Proof.

Consider two neighboring dataset $B$ and $B^{\prime}$ differing in the membership to a protected group of one participating sample $z=(X,A,Y)$ . W.l.o.g., consider a sample $z$ that changes membership from group $A=k_{1}$ to $A^{\prime}=k_{2}$ . Upon this change, there are exactly two groups ( $B_{G_{k_{1}}}$ and $B_{G_{k_{2}}}$ ) whose sizes are being affected. Namely, one group size increases while the other decrease. Additionally notice that $|B_{G_{i}}-B_{G_{i}}^{\prime}|\leq 1$ when $i\in\{k_{1},k_{2}\}$ . Additionally, notice that this change does not impact the population terms $B_{P}$ and $B_{P}^{\prime}$ . For $i\in\{k_{1},k_{2}\}$ , the gradient contributions of the constraint violations associated to the group terms can be bound as:


	$\displaystyle\left\\|\nabla_{\theta}\lambda_{i}\left\|\mu(B_{P_{i}})-\mu(B_{G_{i}})\right\|-\nabla_{\theta}\lambda_{i}\left\|\mu(B^{\prime}_{P_{i}})-\mu(B^{\prime}_{G_{i}})\right\|\right\\|$	(8a)
$\displaystyle=$	$\displaystyle\left\\|\nabla_{\theta}\lambda_{i}\left\|\mu(B^{\prime}_{G_{i}})-\mu(B_{G_{i}})\right\|\right\\|$	(8b)
$\displaystyle=$	$\displaystyle\lambda_{i}\left\\|\nabla_{\theta}\hat{\mathbb{E}}_{z\sim B^{\prime}_{G_{i}}}\left[h(z)\right]-\nabla_{\theta}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[h(z)\right]\right\\|$	(8c)
$\displaystyle=$	$\displaystyle\lambda_{i}\left\\|\nabla_{\theta}\frac{1}{\|B^{\prime}_{G_{i}}\|}\sum_{z\in B^{\prime}_{G_{i}}}h(z)-\nabla_{\theta}\frac{1}{\|B_{G_{i}}\|}\sum_{z\in B_{G_{i}}}h(z)\right\\|$	(8d)
$\displaystyle\leq$	$\displaystyle\lambda_{i}\left\\|\nabla_{\theta}\frac{1}{\|B_{G_{i}}\|-1}h(z)\right\\|$	(8e)
$\displaystyle=$	$\displaystyle\frac{\lambda_{i}}{\|B_{G_{i}}\|-1}\left\\|\nabla_{\theta}h(z)\right\\|\leq\frac{\lambda_{i}C_{p}}{\|B_{G_{i}}\|-1}$	(8f)

where equation (8b) follows from that $B_{P}=B_{P}^{\prime}$ (and thus $\mu(B_{P_{i}})=\mu(B_{P^{\prime}_{i}})$ ) and since the whole expression is under a norm operator, equation (8c) from that $\lambda_{i}\geq 0$ and from definition of the $\mu$ terms, equation (8e) follows from the notion of adjacent datasets and that there is a single element differing between $B_{G_{i}}$ and $B^{\prime}_{G_{i}}$ , and, finally, equation (8f) follows from the theorem assumption. Therefore, the global sensitivity of the gradient contributions of the constraint violations can be bounded above as:


$\displaystyle\Delta_{p}$	$\displaystyle=\max_{B,B^{\prime}}\left\\|\nabla_{\theta}\bm{\lambda}^{\top}\|\bm{\mu}(B_{P})-\bm{\mu}(B_{G})\|-\nabla_{\theta}\bm{\lambda}^{\top}\|\bm{\mu}(B_{P}^{\prime})-\bm{\mu}(B_{G}^{\prime})\|\right\\|$	(9a)
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\left\\|\nabla_{\theta}\lambda_{i}\|\mu(B_{P_{i}})-\mu(B_{G_{i}})\|-\nabla_{\theta}\lambda_{i}\|\mu(B_{P_{i}}^{\prime})-\mu(B_{G_{i}}^{\prime})\|\right\\|$	(9b)
	$\displaystyle\leq\frac{C_{p}\lambda_{k_{1}}}{\|B_{G_{k_{1}}}\|-1}+\frac{C_{p}\lambda_{k_{2}}}{\|B_{G_{k_{2}}}\|-1}\leq\frac{2C_{p}\lambda^{\max}}{\min_{i\in\mathcal{I}}\|B_{G_{i}}\|-1}.$	(9c)

where the last inequality follows from Equation (8), and noticing that (1) there are exactly two groups $B_{G_{i}}$ ( $i\in\{k_{1},k_{2}\}$ ) whose size is being affected, (2) for any $i\in\mathcal{I}$ , $\lambda_{i}\leq\lambda^{\max}$ and (3) $|B_{G_{i}}|\geq\min_{i\in\mathcal{I}}|B_{G_{i}}|$ . ∎

Using Theorem 1, the privacy-preserving primal update step for a mini-batch $B\!\subseteq\!D$ can be executed by clipping exclusively the gradients of the functions $h(z)$ associated with the group terms in $B_{G}$ . It is not necessary to perform gradient clipping for the functions $h(z)$ associated with the population terms in $B_{P}$ . While this may induce propagating population and group terms gradients of different magnitudes, the authors observed often improved performance in the adopted setting. Thus, PF-LD substitutes line 1 of Algorithm 1 with the following

\displaystyle\theta\leftarrow\theta-\alpha\big{(}\nabla_{\theta}\left[J(\mathcal{M}_{\theta},B_{P})\right]+\bm{\lambda}^{\top}\left|\nabla_{\theta}\bm{\mu}(B_{P})-\bar{\nabla}^{C_{p}}_{\theta}\bm{\mu}(B_{G})\right|+\mathcal{N}(0,\sigma_{p}^{2}\ \Delta^{2}_{p}\bm{I})\!\big{)},

(10)

with $\bm{I}\!\in\!\{0,1\}^{|\mathcal{I}|\times|\mathcal{I}|}$ , $\sigma_{p}\!>\!0$ , and $\bar{\nabla}^{C_{p}}_{\theta}$ is applied to each element $\mu(B_{G_{i}})$ of vector $\bm{\mu}(B_{G})$ , where

\bar{\nabla}_{\theta}^{C_{p}}(x)\!=\!\frac{\nabla x}{\max(1,\frac{\|\nabla x\|}{C_{p}})}

denotes the gradients of a given scalar loss $x$ clipped in a $C_{p}$ -ball, for $C_{p}>0$ .

Private Dual Update

Similar to the primal step, the dual update requires access to the sensitive group information (see line 1 of Algorithm 1). It updates the multipliers based on amount of constraint violation $\left|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})\right|$ computed over the entire dataset $D$ . Privacy can be attained by injecting Gaussian noise to the computation of the multipliers, but computing the global sensitivity $\Delta_{d}$ of the constraint violations is non-trivial since the range of the violations is unbounded. Once again, the paper recurs to the adoption of a clipping term, $C_{d}$ , that controls the maximal contribution of the constraint violation to the associated multiplier value.

Theorem 2.

Let $|h(z)|\!\leq\!C_{d}$ , for all samples $z\in D_{G_{i}}$ , $i\!\in\!\mathcal{I}$ , and some $C_{d}\!>\!0$ . The global sensitivity $\Delta_{d}$ of the constraint violation $\left|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})\right|$ is

\displaystyle\Delta_{d}

\displaystyle\leq\frac{\sqrt{2}C_{d}}{\min_{i\in\mathcal{I}}|D_{G_{i}}|-1}.

(11)

Proof.

Consider two neighboring datasets $D$ and $D^{\prime}$ differing by the group membership of one particular sample $z$ . Using a similar argument as that used in the proof of Theorem 1, consider a sample $z$ that changes membership from group $A=k_{1}$ to $A^{\prime}=k_{2}$ . Upon this change, there are exactly two groups ( $B_{G_{k_{1}}}$ and $B_{G_{k_{2}}}$ ) whose sizes are being affected. Namely, one group size increases while the other decrease. Additionally notice that $|B_{G_{i}}-B_{G_{i}}^{\prime}|\leq 1$ when $i\in\{k_{1},k_{2}\}$ . Additionally, notice that this change does not impact the population terms $B_{P}$ and $B_{P}^{\prime}$ .

The maximal amount of difference between the conditional empirical mean $\mu(D_{G_{i}})$ constructed from $D$ and $\mu(D^{\prime}_{G_{i}})$ constructed from $D^{\prime}$ , for any given $D,D^{\prime}$ , can be bounded as


	$\displaystyle\left\\|\left\|\bm{\mu}(D_{P})-\bm{\mu}(D_{G})\right\|-\left\|\bm{\mu}(D^{\prime}_{P})-\bm{\mu}(D^{\prime}_{G})\right\|\right\\|$	(12a)
$\displaystyle\leq$	$\displaystyle\sum_{i\in\mathcal{I}}\left\\|\left\|\mu(D_{P_{i}})-\mu(D_{G_{i}})\right\|-\left\|\mu(D^{\prime}_{P_{i}})-\mu(D^{\prime}_{G_{i}})\right\|\right\\|$	(12b)
$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\left\\|\left\|\mu(D^{\prime}_{G_{i}})-\mu(D_{G_{i}})\right\|\right\\|$	(12c)
$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\left\\|\left\|\hat{\mathbb{E}}_{z\sim D^{\prime}_{G_{i}}}[h(z)]-\hat{\mathbb{E}}_{z\sim D_{G_{i}}}[h(z)]\right\|\right\|$	(12d)
$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\left\\|\left\|\frac{1}{\|D^{\prime}_{G_{i}}\|}\sum_{z\in D^{\prime}_{G_{i}}}h(z)-\frac{1}{\|D_{G_{i}}\|}\sum_{z\in D_{G_{i}}}h(z)\right\|\right\\|$	(12e)
$\displaystyle\leq$	$\displaystyle\left\\|\frac{\|h(z)\|}{\|D_{G_{k_{1}}}\|-1}+\frac{\|h(z)\|}{\|D_{G_{k_{2}}}\|-1}\right\\|$	(12f)
$\displaystyle\leq$	$\displaystyle\left\\|\frac{C_{d}}{\|D_{G_{k_{1}}}\|-1}+\frac{C_{d}}{\|D_{G_{k_{2}}}\|-1}\right\\|$	(12g)
$\displaystyle=$	$\displaystyle\sqrt{\left(\frac{C_{d}}{\|D_{G_{k_{1}}}\|-1}\right)^{2}+\left(\frac{C_{d}}{\|D_{G_{k_{2}}}\|-1}\right)^{2}}$	(12h)
$\displaystyle\leq$	$\displaystyle\frac{\sqrt{2}C_{d}}{\min_{i\in\mathcal{I}}\|D_{G_{i}}-1\|}$	(12i)

where (12b) follows from triangle inequality, (12c) from that $D_{P_{i}}=D^{\prime}_{P_{i}}$ , for any $i\in\mathcal{I}$ (and thus $\mu(B_{P_{i}})=\mu(B_{P^{\prime}_{i}})$ ) and since the whole expression is under a norm operator, (12d) and (12e) follow by definition of the function $\mu$ . The inequality (12f) follows from that $D_{G_{i}}$ and $D_{G_{i}}^{\prime}$ differ in at most one element (when $i=k_{1}$ or $i=k_{2}$ ) and (12g), by assumption $|h(z)|\leq C_{d}$ , and, finally, (12i) follows by noting that $|D_{G_{i}}|\geq\min_{i\in\mathcal{I}}|D_{G_{i}}|$ . ∎

The privacy-preserving dual update step, used in lieu of line 1 of Algorithm 1, is given by the following

\displaystyle\!\!\!\bm{\lambda}_{k+1}\leftarrow\bm{\lambda}_{k}

\displaystyle+s_{k}\left(|\bm{\mu}(D_{P})-\bar{\bm{\mu}}^{C_{d}}(D_{G})|\!+\!\mathcal{N}(0,\sigma_{d}^{2}\Delta_{d}^{2}\bm{I})\right)\!\!

(13)

with $\bm{I}\in\{0,1\}^{|\mathcal{I}|\times|\mathcal{I}|}$ , $\sigma_{d}>0$ , and where, for every $i\in\mathcal{I}$ ,

\bar{\mu}^{C_{d}}(D_{G_{i}})=\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[\frac{h(z)}{\max(1,\frac{|h(z)|}{C_{d}})}\right].

Note that while Theorem 1 bounds the individual gradient norms of each functions $h(z)$ for samples $z\!\in\!B_{G_{i}}$ and $i\!\in\!\mathcal{I}$ , Theorem 2 bounds their maximum absolute values. The choice of terms $C_{p}$ and $C_{d}$ plays a special role in limiting the impact of an individual change in the protected attributes. It controls, indirectly, the privacy loss, as it impacts the global sensitivities $\Delta_{p}$ and $\Delta_{d}$ . However, these terms also affect the model accuracy and fairness. In particular, larger $C_{p}$ values will propagate more precise gradients ensuring better accuracy and model fairness. On the other hand, larger clipping values will also introduce more noise, and thus degrade the information propagated. The converse is true for smaller clipping values. A theoretical and experimental analysis of the impact of these terms to the model accuracy and fairness is provided in the next sections.

6.1 Privacy Analysis

The privacy analysis of PF-LD relies on the moment accountant for Sampled Gaussian (SG) mechanism [32], whose privacy is analyzed using Rényi Differential Privacy (RDP) [31].

Definition 2.

[Sampled Gaussian Mechanism] Let $f:S\subseteq D\to\mathbb{R}^{d}$ be a function mapping subsets $S$ of the input data $D$ to $\mathbb{R}^{d}$ . The Sampled Gaussian (SG) mechanism with sampling rate $0<q\leq 1$ and standard deviation $\sigma>0$ is defined as follows:

\displaystyle SG_{q,\sigma}(D)\triangleq f(\{x:x\in D\ \mbox{is sampled with probability q}\})+\mathcal{N}(0,\sigma^{2}\bm{I}),

where each element of $D$ is sampled independently at random without replacement with probability $q$ , and $\mathcal{N}(0,\sigma^{2}\bm{I})$ is the spherical $d$ -dimensional Gaussian noise with per-coordinate variance $\sigma^{2}$ .

Theorem 3.

(( $\alpha,\epsilon$ )-RDP) A randomized mechanism $f:\mathcal{D}\to\mathcal{R}$ with domain $\mathcal{D}$ and range $\mathcal{R}$ is said to have $\epsilon$ -Rényi differential privacy of order $\alpha$ , or ( $\alpha,\epsilon$ )-RDP for short, if for any adjacent $D,D^{\prime}\in\mathcal{D}$ it holds that

\mathcal{D}_{\alpha}(f(D)\parallel f(D^{\prime}))\leq\epsilon,

where $\mathcal{D}_{\alpha}(P\parallel Q)\triangleq\frac{1}{1-\alpha}\log\mathbb{E}_{x\sim Q}\left(\frac{P(x)}{Q(x)}\right)^{\alpha}$ is the Rényi divergence of order $\alpha$ between two probability distributions $P$ and $Q$ .

The privacy analysis of the SG mechanism is described by the following Theorem from [32].

Theorem 4.

For a given function $f:D\to\mathbb{R}^{d}$ , such that $||f(D)-f(D^{\prime})||_{2}\leq 1$ for any neighboring databases $D,D^{\prime}$ , the SG mechanism $SG_{q,\sigma}$ with sampling ratio $q$ and standard deviation $\sigma$ satisfies $(\alpha,\epsilon)$ -RDP with:


$\displaystyle\epsilon$	$\displaystyle\leq\mathcal{D}_{\alpha}\left[\mathcal{N}(0,\sigma^{2})\;\parallel\;(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2})\right]$	(14a)
$\displaystyle\epsilon$	$\displaystyle\leq\mathcal{D}_{\alpha}\left[(1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,\sigma^{2})\;\parallel\;\mathcal{N}(0,\sigma^{2})\right].$	(14b)

The Rényi divergences appearing in Equations (14a) and (14b) can be computed numerically according to the procedure described in [32]. Additionally, RDP enjoys composition: if $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are mechanisms satisfying $(\alpha,\epsilon_{1})$ -RDP and $(\alpha,\epsilon_{2})$ -RDP, respectively, then their composition satisfies $(\alpha,\epsilon_{1}+\epsilon_{2})$ -RDP [30]

The following results consider a PF-LD algorithm that trains a model over $T$ epochs using a dataset $D$ containing $n$ training samples, uses mini-batches $B$ at each iteration, and standard deviation parameters $\sigma_{p}$ and $\sigma_{d}$ , associated to the primal and dual steps, respectively. Note that, Equations (10) and (13), correspond to instances of the SG mechanism.

Lemma 1.

The PF-LD primal step satisfies $(\alpha,\epsilon_{p})$ -RDP, where $\epsilon_{p}$ satisfies Equations (14) with $q\!=\!\nicefrac{{|B|}}{{n}}$ and $\sigma=\sigma_{p}\Delta_{p}$ are, respectively, the SG sampling ratio $q$ and standard deviation parameters.

Proof.

The result follows directly by Definition 2 and Theorems 3 and 4 since the primal step of PF-LD uses Gaussian noise with parameter $\sigma_{p}$ over a subsample of the dataset of size $|B|$ . ∎

Lemma 2.

The PF-LD dual step satisfies $(\alpha,\epsilon_{d})$ -RDP, where $\epsilon_{d}$ satisfies Equations (14) with $q\!=\!1$ and $\sigma\!=\!\sigma_{d}\Delta_{d}$ are, respectively, the SG sampling ratio and standard deviation parameters.

Proof.

The result follows directly by Definition 2 and Theorem 3 since the dual step of PF-LD uses Gaussian noise with parameter $\sigma_{d}$ . ∎

PF-LD uses a predefined amount of noise (specified by parameters $\sigma_{p}$ and $\sigma_{d}$ ) at each iteration, so that each iteration has roughly the same privacy loss, and uses the moment accountant [1] to track detailed information of the cumulative privacy loss.

Theorem 5.

PF-LD satisfies $(\alpha,\frac{Tn\epsilon_{p}}{|B|}+T\epsilon_{d})$ -RDP.

Proof.

The result follows from Lemmas 1 and 2 and by the composability of RDP [30]. ∎

The final privacy loss in the $(\epsilon,\delta)$ -differential privacy model is obtained by observing that a mechanism satisfying $(\alpha,\epsilon)$ -Rényi differential privacy also satisfies $(\epsilon+\frac{\log\nicefrac{{1}}{{\delta}}}{\alpha-1},\delta)$ -differential privacy, for any $0\!<\!\delta\!<\!1$ [30].

6.2 Bias-Variance Analysis

A key aspect of PF-LD is the choice of values $C_{d}$ and $C_{p}$ used to bound the functions $h(z)$ , and their gradients, respectively, for every sample $z\in D_{G_{i}}$ and $i\in\mathcal{I}$ . The choice of these clipping terms affects the global sensitivity of the functions of interest, which, in turn, impacts the amount of noise used by the differentially private mechanisms. As a result, the clipping terms are associated to a bias-variance trade-off: Small values can discard significant amounts of constraints information, thus introduce bias; large values retain more information but force the differential privacy mechanism to introduce larger noise, inducing more variance. It is important to recall that, at every iteration, PF-LD induces some privacy loss, thus, for a fixed privacy budget, the use of small values cannot be compensated by longer runs. This section formulates a bias-variance analysis that is helpful to select clipping values for gradient norms under the SG mechanism.

Let $\bm{G}\!=\!\nabla_{\theta}\bm{\lambda}^{\top}\!|\bm{\mu}(B_{P})-\bm{\mu}(B_{G})|$ be the gradient computed over a minibatch $B$ during the primal update of F-LD (Algorithm 1 line 1) and $\tilde{\bm{G}}\!=\!\bm{\lambda}^{\top}\!|\nabla_{\theta}\bm{\mu}(B_{P})-\bar{\nabla}^{C_{p}}_{\theta}\bm{\mu}(B_{G})|+\mathcal{N}(0,\sigma_{p}^{2}\Delta^{2}_{p}\bm{I})$ be its privacy-preserving counterpart, as computed by PF-LD (Equation (10)).

Theorem 6.

The expected error between the real and noisy gradients, $\bm{G}$ and $\tilde{\bm{G}}$ , incurred during the primal step can be upper bounded as:

\displaystyle\mathbb{E}\left[\big{\|}\bm{G}-\tilde{\bm{G}}\big{\|}\right]\leq

\displaystyle\frac{2\sqrt{S_{\tilde{\bm{G}}}}\sigma_{p}\lambda^{\max}C_{p}}{\min_{i\in\mathcal{I}}|B_{G_{i}}|-1}+\sum_{i\in\mathcal{I}}\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\max\big{(}0,\left\|\nabla_{\theta}h(z)\right\|-C_{p}\big{)}\right],

where $S_{\tilde{\bm{G}}}$ is the shape (i.e., the number of entries) of $\tilde{\bm{G}}$ .

Proof.

By triangular inequality,

\mathbb{E}\left[\|\bm{G}-\tilde{\bm{G}}\|\right]\leq\mathbb{E}\left[\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|\right]+\|\mathbb{E}[\tilde{\bm{G}}]-\bm{G}\|

Where the first term of the right hand side inequality represents the variance term and the second is the bias term. The squared variance term can be bounded as follows:

\displaystyle\left(\mathbb{E}\left[\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|\right]\right)^{2}\leq\mathbb{E}\left[\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|^{2}\right]=S_{\tilde{\bm{G}}}\sigma^{2}_{p}\Delta^{2}_{p}.

(15)

The above follows from Jensen inequality: For any convex function $f$ , $f(\mathbb{E}[X])\leq\mathbb{E}[f(X)]$ , and by setting $f=(\cdot)^{2}$ and $X=\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|$ , and from noticing that the term $\mathbb{E}\left[\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|^{2}\right]$ correspond to the variance of the Gaussian noise added to the each of the $S_{\tilde{\bm{G}}}$ model gradients (see Equation (10)). Therefore $\mathbb{E}\left[\|\tilde{\bm{G}}-\mathbb{E}[\tilde{\bm{G}}]\|\right]\leq\sqrt{S_{\tilde{\bm{G}}}}\sigma_{p}\Delta_{p}$ . The bias term can be bound as follows:

$\displaystyle\left\\|\mathbb{E}[\tilde{\bm{G}}]-\bm{G}\right\\|$	$\displaystyle=\left\\|\sum_{i\in\mathcal{I}}\lambda_{i}\left(\nabla_{\theta}\big{\|}\mu(B_{P_{i}})-\bar{\nabla}^{C_{p}}_{\theta}\mu(B_{G_{i}})\big{\|}-\nabla_{\theta}\big{\|}\mu(B_{P_{i}})-\nabla_{\theta}\mu(B_{G_{i}})\big{\|}\right)\right\\|$
	$\displaystyle=\left\\|\sum_{i\in\mathcal{I}}\lambda_{i}\left(\nabla_{\theta}\mu(B_{G_{i}})-\bar{\nabla}^{C_{p}}_{\theta}\mu(B_{G_{i}})\|\right)\right\\|$
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\left\\|\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\nabla_{\theta}h(z)\right]-\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\bar{\nabla}_{\theta}^{C_{p}}h(z)\right]\right\\|$
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\left\\|\nabla_{\theta}h(z)-\bar{\nabla}_{\theta}^{C_{p}}h(z)\right\\|\right]$	(by Jensen inequality)
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\max\big{(}0,\left\\|\nabla_{\theta}h(z)\right\\|-C_{p}\big{)}\right],$	(16)

where the last equality is due to gradient clipping.

Combining Equation (15) with Equation (16) and replacing the term $\Delta_{p}$ with $\frac{2C_{p}\lambda^{\max}}{\min_{i\in\mathcal{I}}|B_{G_{i}}|-1}$ (by Theorem 1), gives the sought upper bound for the expected error of the private gradients. ∎

Note that the bound above is a convex function of $C_{p}$ . Its unique minimizer satisfies:

\frac{2\sqrt{S_{\tilde{\bm{G}}}}\sigma_{p}\lambda^{\max}}{\min_{i\in\mathcal{I}}|B_{G_{i}}|-1}\!=\!\sum_{i\in\mathcal{I}}\!\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\!\Big{[}\mathbbm{1}\big{[}\|\nabla_{\theta}h(z)\|\geq C_{p}\big{]}\Big{]}.

While beyond the scope of the this work, the above illustrates that a procedure to find the optimal $C_{p}$ privately can be constructed effectively.

Next, the paper shows how to bound the expected error incurred in using the noisy constraint violations during the dual step. Let $V_{i}\!=\!|\mu(D_{P_{i}})-\mu(D_{G_{i}})|$ be the value corresponding to the i-th constraint violation ( $i\in\mathcal{I}$ ), and $\tilde{V}_{i}\!=\!|\mu(D_{P_{i}})-\bar{\mu}^{C_{d}}(D_{G_{i}})|+\mathcal{N}(0,\sigma_{d}^{2}\Delta_{d}^{2})$ be its privacy-preserving version (see Equation (13)).

Theorem 7.

The expected absolute error between the real and noisy constraint violations $V_{i}$ and $\tilde{V}_{i}$ , for $i\!\in\!\mathcal{I}$ , is bounded by the following

\displaystyle\mathbb{E}\left[|V_{i}-\tilde{V}_{i}|\right]

\displaystyle\leq\frac{\sqrt{2}C_{d}\,\sigma_{d}}{\min_{i\in\mathcal{I}}|D_{G_{i}}|-1}+\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[\max(0,|h(z)|-C_{d})\right].

The proof uses similar arguments as those in the proof of Theorem 6

Proof.

By triangular inequality,

\displaystyle\mathbb{E}\left[|V_{i}-\tilde{V}_{i}|\right]\leq\left|V_{i}-\mathbb{E}[\tilde{V}_{i}]\right|+\mathbb{E}\left[\left|\tilde{V}_{i}-\mathbb{E}\left[\tilde{V}_{i}\right]\right|\right]

Where the first term of the right hand side inequality represents the variance term and the second is the bias term.

Using the Jensen inequality, the squared variance term can be bound as follows:

\left(\mathbb{E}\left[|\tilde{V}_{i}-\mathbb{E}[\tilde{V}_{i}]|\right]\right)^{2}\leq\mathbb{E}\left[\left(\tilde{V}_{i}-\mathbb{E}[\tilde{V}_{i}]\right)^{2}\right]=\sigma^{2}_{d}\Delta^{2}_{d},

where the last equality holds because the noisy constraint violations are perturbed with Gaussian random noise with a standard deviation $\sigma_{d}\Delta_{d}$ . By replacing $\Delta_{d}=\frac{\sqrt{2}C_{d}}{\min_{i\in\mathcal{I}}|D_{G_{i}}|}$ by Theorem 1, the variance term now can thus be upper bounded by:

\mathbb{E}\left[|\tilde{V}_{i}-\mathbb{E}[\tilde{V}_{i}]|\right]\leq\frac{\sqrt{2}C_{d}\,\sigma_{d}}{\min_{i\in\mathcal{I}}|D_{G_{i}}|-1}.

(17)

Next, focusing on the bias term, note that the true constraint $V_{i}$ is represented as:

\displaystyle V_{i}

\displaystyle=|\mu(D_{P_{i}})-\mu(D_{G_{i}})|=\left|\hat{\mathbb{E}}_{z\sim D_{P_{i}}}[h(z)]-\hat{\mathbb{E}}_{z\sim D_{G_{i}}}[h(z)]\right|,

(18)

and the expectation of its private version, $\tilde{V}_{i}$ , can be written as

\displaystyle\mathbb{E}\left[\tilde{V}_{i}\right]

\displaystyle=\left|\hat{\mathbb{E}}_{z\sim D_{P_{i}}}\left[h(z))\right]-\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[\frac{h(z)}{\max(1,\frac{|h(z)|}{C_{d}})}\right]\right|

(19)

where the above follows from the definition of the constraint on the group term $\bar{\mu}^{C_{d}}(D_{G_{i}}$ (see main paper, equation after Equation (13)). Combining Equations (18) and (19), the bias term can be rewritten as:

	$\displaystyle\left\|V_{i}-\mathbb{E}[\tilde{V}_{i}]\right\|$	$\displaystyle=\left\|\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[h(z)\right]-\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[\frac{h(z)}{\max(1,\frac{\|h(z)\|}{C_{d}})}\ \right]\right\|$
		$\displaystyle=\mathbb{E}_{z\sim D_{G_{i}}}\left[\max(0,\|h(z))\|-C_{d})\|\right].$		(20)

Combining Equation (20) and (17) provides the sought bound. ∎

7 PF-LD: Handing Missing Values

This section presents a simple, yet important, extension to the PF-LD model where only a subset of the individuals reports the value of the sensitive attributes. Recall that to build a fair model it is necessary to collect sensitive information about users. While the privacy risks of this data collection can be mitigated with the techniques proposed in this paper, a practical scenario may involve the data curator to give the user a choice to whether release their sensitive information or not. This section shows that it is straightforward to adapt the proposed model to this scenario.

The paper considers the case where only a fraction $r\leq 1$ of the training samples presents the sensitive attribute $A$ . To operate under this scenario, it is sufficient to modify the sensitivity $\Delta_{p}$ of the gradients of the constraint violations (Equation (7)) and the sensitivity $\Delta_{d}$ of the constraint violations (Equation (11)) to, respectively, $\Delta_{p}^{\prime}=\nicefrac{{\Delta_{p}}}{{r}}$ and $\Delta_{d}^{\prime}=\nicefrac{{\Delta_{d}}}{{r}}$ . The argument follows from the observation that reducing the number of training samples having sensitive data also reduces the components $\min_{i\in\mathcal{I}}|B_{G_{i}}|$ (Equation (7)) and $\min_{i\in\mathcal{I}}|D_{G_{i}}|$ (Equation (11)) by a factor of $r$ . Consequentially, while on one side, using less information may affect the model ability to satisfy the fairness constraint, on the other, it also require smaller amounts of noise to guarantee the same level of privacy. This trade-off is subject of analytical study, presented next.

8 Experimental Analysis

Datasets, Models, and Metrics

This section studies the behavior of the proposed algorithm on several datasets, including Income. Bank, and Compass [41] datasets, described as follows.

•

Income: The task is to predict if an user has annual income above $50,000. The protected attribute associated with the group membership is the gender, and there are two groups: male and female.
•

Compas: The task is to predict if a defendant will re-offend in the next two years (2 years recidivism). The protected attribute associated with the group membership is gender.
•
Bank: The task is to detect client subscriptions to the term deposit. The protected attribute associated with the group membership is age and the experiments consider the following three different group membership sizes:
- –
  
  Bank, in which the number of protected groups $|\mathcal{A}|=2$ , with the majority of the people of age ranging from 25 to 60 and minority, being under 25 or over 60 years old.
- –
  
  $\mbox{M-Bank}^{3}$ , in which the number of protected groups is $|\mathcal{A}|=3$ , indicating people with age ranging from 25 to 40, 41 to 60 and the remaining ages.
- –
  
  $\mbox{M-Bank}^{5}$ , in which the number of protected groups is $|\mathcal{A}|=5$ , indicating people with age ranging from 25 to 33, 34 to 40, 41 to 48, 49 to 60, and the remaining ages.

All datasets above were pre-processed according to [42, 41] and are so that all features have zero mean and unit variance. A summary of the datasets adopted and their features is provided in Table 1.

Data	# Samples	# Features	$\mathcal{Y}$	$\mathcal{A}$	$\|\mathcal{A}\|$	Protected group sizes (%)
Bank	11,162	15	Deposit	Age	2	8% ; 92%
Income	45,222	50	$\geq$ 50K	Gender	2	32% ; 68%
Compas	6,150	10	2 years recidivism	Gender	2	20% ; 80%
$\mbox{M-Bank}^{3}$	11,162	15	Deposit	Age	3	8% ; 46% ; 46%
$\mbox{M-Bank}^{5}$	11,162	15	Deposit	Age	5	8% ; 23% ; 23% ; 23% ; 23%

Table 1: Dataset summary

The experiments consider a baseline classifier (CLF), implemented as a neural network with two hidden layers, that maximize accuracy only, without considerations for fairness or privacy, and compare the proposed PF-LD model against the following state-of-the-art algorithms: Z, it implements a fair logistic regression models that achieves group fairness. These models were presented in [42] for demographic parity and in [41] for accuracy parity and equalized odds. A, it implements the fair logistic regression model based on reduction approaches. introduced in [2]. Note that the models above preserves group fairness but do not guarantee privacy. They are used to highlight the effectiveness of the proposed approach based on the Lagrangian dual to ensure fairness. Finally, M, proposed in [33], the model most related to the proposed work, ensures both fairness and $\epsilon$ -differential privacy with respect to the sensitive attributes. The algorithm uses the fair model A on perturbed noisy data generated according to a randomized response mechanism [25]. While these models where studied in the context of equalized odds, this work extends them to satisfy all fairness definitions considered in this work. Compared to the proposed model M has the disadvantage of introducing large amounts of noise to the sensitive attribute, especially when the domain of these attributes is large and/or when $A$ is high-dimensional. ¹¹1 The authors note there is an additional work which addresses learning a fair and private classifier [24]. While an important contribution, it has been shown to induce significant privacy losses (see Figure 1 of [24]). Model M was shown to outperform these algorithms presented in [24] in terms of classification error bounds. Therefore, this paper adopts M, as the state-of-the-art.

The experiments analyze the accuracy, fairness violations, and privacy losses (when applicable) of the models above. The fairness violations are measured as the maximal difference in fairness constraint violations between any two protected groups. The privacy losses are set to $\epsilon=1.0$ and $\delta=10^{-5}$ , unless otherwise specified. PF-LD uses clipping bound values $C_{p}=10.0$ and $C_{d}=5.0$ .

			CLF	Z	A	M	DF-LP
Bank	Accuracy Parity	acc	0.8 (0.007)	0.801 (0.005)	0.808 (0.004)	0.799 (0.007)	0.812 (0.004)
	Accuracy Parity	fv	0.041 (0.021)	0.025 (0.015)	0.006 (0.004)	0.036 (0.019)	0.021 (0.009)
	Demographic Parity	acc	0.8 (0.007)	0.78 (0.009)	0.772 (0.01)	0.784 (0.008)	0.793 (0.013)
	Demographic Parity	fv	0.299 (0.034)	0.03 (0.021)	0.03 (0.017)	0.131 (0.048)	0.126 (0.027)
	Equalized odds	acc	0.8 (0.007)	0.791 (0.006)	0.764 (0.007)	0.796 (0.006)	0.808 (0.004)
	Equalized odds	fv	0.239 (0.046)	0.113 (0.054)	0.125 (0.055)	0.252 (0.103)	0.188 (0.076)
Income	Accuracy Parity	acc	0.848 (0.003)	0.848 (0.004)	0.834 (0.005)	0.842 (0.004)	0.782 (0.008)
	Accuracy Parity	fv	0.114 (0.005)	0.114 (0.007)	0.093 (0.006)	0.11 (0.006)	0.061 (0.019)
	Demographic Parity	acc	0.848 (0.003)	0.827 (0.019)	0.789 (0.031)	0.792 (0.035)	0.799 (0.002)
	Demographic Parity	fv	0.181 (0.007)	0.039 (0.012)	0.044 (0.053)	0.063 (0.054)	0.019 (0.009)
	Equalized odds	acc	0.848 (0.003)	0.842 (0.003)	0.84 (0.003)	0.837 (0.002)	0.841 (0.003)
	Equalized odds	fv	0.094 (0.013)	0.102 (0.022)	0.042 (0.005)	0.064 (0.014)	0.044 (0.006)
Compas	Accuracy Parity	acc	0.683 (0.011)	0.677 (0.013)	0.675 (0.013)	0.661 (0.019)	0.671 (0.015)
	Accuracy Parity	fv	0.024 (0.014)	0.016 (0.008)	0.033 (0.022)	0.016 (0.012)	0.031 (0.029)
	Demographic Parity	acc	0.683 (0.011)	0.634 (0.017)	0.643 (0.014)	0.664 (0.014)	0.667 (0.015)
	Demographic Parity	fv	0.121 (0.022)	0.023 (0.012)	0.048 (0.023)	0.092 (0.032)	0.098 (0.037)
	Equalized odds	acc	0.683 (0.011)	0.647 (0.011)	0.662 (0.019)	0.67 (0.013)	0.677 (0.015)
	Equalized odds	fv	0.118 (0.038)	0.049 (0.033)	0.097 (0.044)	0.139 (0.036)	0.115 (0.04)
$\mbox{M-Bank}^{3}$	Accuracy Parity	acc	0.807 (0.008)	NA	0.813 (0.005)	0.817 (0.009)	0.827 (0.007)
	Accuracy Parity	fv	0.064 (0.021)	NA	0.037 (0.012)	0.04 (0.032)	0.033 (0.007)
	Demographic Parity	acc	0.807 (0.008)	NA	0.771 (0.008)	0.797 (0.006)	0.811 (0.008)
	Demographic Parity	fv	0.317 (0.035)	NA	0.067 (0.026)	0.261 (0.085)	0.218 (0.033)
	Equalized odds	acc	0.807 (0.008)	NA	0.783 (0.004)	0.808 (0.01)	0.825 (0.006)
	Equalized odds	fv	0.244 (0.07)	NA	0.199 (0.054)	0.348 (0.133)	0.279 (0.082)
$\mbox{M-Bank}^{5}$	Accuracy Parity	acc	0.807 (0.008)	NA	0.809 (0.007)	0.814 (0.008)	0.827 (0.007)
	Accuracy Parity	fv	0.078 (0.013)	NA	0.055 (0.018)	0.059 (0.009)	0.046 (0.008)
	Demographic Parity	acc	0.807 (0.008)	NA	0.757 (0.015)	0.774 (0.011)	0.805 (0.009)
	Demographic Parity	fv	0.341 (0.034)	NA	0.091 (0.028)	0.336 (0.047)	0.193 (0.035)
	Equalized odds	acc	0.807 (0.008)	NA	0.785 (0.011)	0.8 (0.008)	0.823 (0.007)
	Equalized odds	fv	0.271 (0.068)	NA	0.214 (0.045)	0.4 (0.117)	0.297 (0.084)

Table 2: Accuracy (acc) and fairness violation (fv) among all models and datasets. Models M and DF-LP uses privacy budget

\epsilon=1.0

. The results report the average value of a five fold cross validation and its associated standard deviation (in parenthesis). Best accuracy or fairness violation scores between M and DF-LP are highlighted in bold.

Architectures and Parameter Tuning Models PF-LD, M, and A all use a ReLU neural network with two hidden layers. Thus these model have the same number of parameters to optimize.

The authors have not performed a systematic model tuning for the proposed PF-LD. Thus, they suspect that the proposed model can achieve a higher accuracy, fairness, and privacy tradeoff that that reported here, if the hyperparameters are tuned for specific benchmark and fairness definitions. However, this is not the focus of this work.

8.1 Accuracy and Fairness

This section analyzes the impact on accuracy and fairness of the privacy-preserving models introduced above. The main results are summarized in Table 2. Notice that the algorithm associated with model Z only handles binary group data, and thus the table entries associated with the multi-group experiments are left blank. For each dataset and fairness definition, the table reports the average accuracy (acc) and fairness violation (fv) results and standard deviations (in parenthesis) of a five fold cross validation. The results highlight in bold the best outcome, in terms of accuracy or fairness violation, attained when comparing models M and DF-LP.

The table illustrates a clear trend: The proposed DF-LP is able to attain accuracy and fairness violation scores that are comparable with those attained by the models that do not consider privacy, and achieves a better accuracy-fairness tradeoff, when compared with M. Indeed, DF-LP achieves a better accuracy or fairness scores in 90%(27/30) of cases. Additionally, while the fairness violation scores achieved by model M are comparable with (or worse than) those attained by the basic classifier on the multi protected attribute datasets, DF-LP is able to consistently reduce the model disparate effects. This is due to that model M replies on an input perturbation mechanism to guarantee privacy. However, the amount of noise required to guarantee privacy increases with the size of the protected attribute domain.

8.2 Privacy, Fairness, and Accuracy Tradeoff

This section illustrates the tradeoff between privacy, fairness, and accuracy attained by PF-LD and compares them with algorithm M. The results are summarized in Figure 1, that depicts the average and standard deviation of 10 model runs.

Firstly, observe that the fairness violation score decreases as the privacy budget $\epsilon$ increases (note that the scale differs across the plots). Large privacy losses allow PF-LD to either run more iterations, given fixed noise values used at each iteration, or reduce the level of noise applied to a given number of iterations. These cases imply propagating more (former case) or more accurate (latter case) noisy constraint violations that results in better capturing the fairness constraints violations during the primal and dual update steps. This aspect is not obvious for M.

Next, notice that the model accuracy slightly decreases as $\epsilon$ increases. While this may seems surprising, our analysis shows that the fairness constraints, having their violations being propagated more exactly when $\epsilon$ increase, have a negative impact on the model accuracy.

Finally, notice that, in most cases, PF-LD is more accurate and produce models that have smaller fairness violations, and, importantly, it produces models that are more robust than those produced by M. This is noticeable by comparing the standard deviations on accuracy and fairness violations of the two models.

These observations demonstrates the practical benefits of the proposed model.

8.3 PF-LD: Analysis of the Primal Clipping Bound Value

This sections analyses the impact of primal clipping bound $C_{p}$ to the privacy, accuracy, and fairness tradeoff. Figure 2 illustrates the effects of $C_{p}$ on the model accuracy and fairness, at varying of the privacy parameter $\epsilon$ .

Observe that, for different fairness definitions, the best accuracy/fairness tradeoff is obtained when $C_{p}\in[10,20]$ (green and yellow curves). The use of small clipping values (blue curve) slows the drop in fairness violations at the increasing of the privacy budget $\epsilon$ . This is because small $C_{p}$ values limit the impact of the constraints violations to the model. On the other extreme, for high $C_{p}$ values (e.g., brown curve), not only it is observed a degradation in fairness violations, but also in the model accuracy. This is because large $C_{p}$ values imply larger amount of noise to be added to the gradient of the constraint violations, resulting in less accurate information to be back-propagated at each step. Additionally, the noisy constraints gradients can negatively impact the classifier loss function. Thus, the resulting models tend to have worse accuracy/fairness tradeoff than those trained with intermediate $C_{p}$ values.

These observations support the theoretical analysis which showed that the expected error between the true and private gradients of the fairness constraints is upper bounded by a convex function of the primal and the dual clipping bounds.

To further shed lights on the impacts of $C_{p}$ to the model fairness and accuracy, Figure 3 illustrates the model accuracy (left column of each sub-figure) the fairness violations (middle column of each sub-figure) and the percentage of times the norm of the gradients associated to the constraint violations of a protected group exceeds the clipping value $C_{p}$ : $\|\tilde{G}\|>C_{p}\%$ (right column of each sub-figure). The last column on each sub-figure indicates the frequency of propagating the correct or the clipped information. The figure uses demographic parity, but the results are consistent across the other fairness metrics studied. Observe that, the percentage of individual constraint gradients exceeding $C_{p}$ is very high when $C_{p}$ is small. Thus, a significant amount of information is lost due to clipping. On the other extreme, at large $C_{p}$ regimes most individual gradients (for both protected groups) are smaller than $C_{p}$ . This choice reduces bias, but it introduces large variances due to noise necessary to preserve privacy. Therefore, both cases result in models that have large fairness violations. Conversely, at intermediate $C_{p}$ regimes, the produced models have lower constraint violations while retaining high accuracy.

8.4 PF-LD: Analysis of the Dual Clipping Bound Value

This sections analyses the impact of dual clipping value $C_{d}$ to the privacy, accuracy, and fairness tradeoff. Figure 4 shows one example of the impact of $C_{d}$ to the model accuracy and fairness violations on Bank dataset. The results show a similar trend to what depicted for the for primal clipping bound: In dual update using larger values $C_{d}$ can introduce higher variance. In contrast, small values can limit the amount of constraint violation being back propagated to update the multipliers $\bm{\lambda}$ . Hence, the model bias may increase. A clear example of this case is illustrated in Figure 4. It showed that using large values of $C_{d}$ (e.g $C_{d}=100.0$ (green curve) or $C_{d}=1000.0$ (red curve)) introduce large fluctuation to the model accuracy and fairness violation scores. In addition, large clipping bounds $C_{d}$ can deteriorate the model’s accuracy. Nevertheless, small clipping bounds $C_{d}$ might not reduce fairness violation at high privacy loss regimes. Therefore, intermediate clipping values suggest better accuracy/fairness tradeoffs and, at the same time, do not introduce much variance to the model performance.

8.5 Missing Values

The last experiments present results for the PF-LD model extension that handles the missing sensitive information. The model is tested for cases when 60%, 40%, 20%, and no entry in the training data misses the sensitive information. Missing values are not considered during the model evaluation process to assess the model performance. Figure 5 depict the tradeoff among accuracy, fairness, and privacy on Bank data using demographic parity as a fairness metric. It can be noted that the model achieve smaller fairness violations as the number of missing values decreases. Similarly, the model is better able to trade accuracy for fairness violations as the number of missing values decreases.

9 Conclusions

This paper was motivated by the discrepancy between concerns in building models whose outcomes do not discriminate against some demographic groups and the requirements that the sensitive attributes, which are essential to build these models, may not be available due to legal and ethical requirements. It proposed a framework to train deep learning models that satisfy several notions of group fairness, including equalized odds, accuracy parity, and demographic parity, while ensuring that the model satisfies differential privacy for the protected attributes. The framework relies on the use of Lagrangian duality to accommodate the fairness constraints and the paper showed how to inject carefully calibrated noise to the primal and dual steps of the Lagrangian dual process to guarantee privacy of the sensitive attributes. The paper further analyses the tension between accuracy, privacy, and fairness and an extensive experimental evaluation illustrates the benefits of the proposed framework showing that it may be come a practical tool for privacy-preserving and fair decision making.

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016.
[2] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fair classification. In ICML, 2018.
[3] Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fair classification. In ICML, pages 60–69, 2018.
[4] Eugene Bagdasaryan, Omid Poursaeed, and Vitaly Shmatikov. Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, pages 15479–15488, 2019.
[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. In NIPS, pages 5049–5059, 2019.
[6] Alex Beutel, Jilin Chen, Zhe Zhao, and Ed H. Chi. Data decisions and theoretical implications when adversarially learning fair representations, 2017.
[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
[8] Toon Calders, Asim Karim, Faisal Kamiran, Wasif Ali, and Xiangliang Zhang. Controlling attribute effect in linear regression. In ICDM, 2013.
[9] Alexandra Chouldechova and Aaron Roth. The frontiers of fairness in machine learning, 2018.
[10] Rachel Cummings, Varun Gupta, Dhamma Kimpara, and Jamie Morgenstern. On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pages 309–315, 2019.
[11] Michele Donini, Luca Oneto, Shai Ben-David, John Shawe-Taylor, and Massimiliano Pontil. Empirical risk minimization under fairness constraints. In NIPS, 2018.
[12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
[13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
[14] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
[15] Harrison Edwards and Amos Storkey. Censoring representations with an adversary, 2015.
[16] Michael D Ekstrand, Rezvan Joshaghani, and Hoda Mehrpouyan. Privacy for all: Ensuring fair and equitable privacy protections. In Conference on Fairness, Accountability and Transparency, pages 35–47, 2018.
[17] Michael Feldman. Computational fairness: Preventing machine-learned discrimination. 2015.
[18] Ferdinando Fioretto, Terrence W. K. Mak, Federico Baldo, Michele Lombardi, and Pascal Van Hentenryck. A lagrangian dual framework for deep neural networks with constraints. CoRR, abs/2001.09394, 2020.
[19] Ferdinando Fioretto, Terrence W. K. Mak, and Pascal Van Hentenryck. Predicting AC optimal power flows: Combining deep learning and lagrangian dual methods. In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), pages 630–637, 2020.
[20] Daniel Fontaine, Michel Laurent, and Pascal Van Hentenryck. Constraint-based lagrangian relaxation. In Principles and Practice of Constraint Programming, pages 324–339, 2014.
[21] Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In NIPS. 2016.
[22] Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In NIPS, 2016.
[23] Magnus R Hestenes. Multiplier and gradient methods. Journal of optimization theory and applications, 4(5):303–320, 1969.
[24] Matthew Jagielski, Michael Kearns, Jieming Mao, Alina Oprea, Aaron Roth, Saeed Sharifi Malvajerdi, and Jonathan Ullman. Differentially private fair learning. In Proceedings of the 36th International Conference on Machine Learning, 2019.
[25] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. In Advances in Neural Information Processing Systems 27. 2014.
[26] Niki Kilbertus, Adria Gascon, Matt Kusner, Michael Veale, Krishna Gummadi, and Adrian Weller. Blind justice: Fairness with encrypted sensitive attributes. Proceedings of Machine Learning Research, 2018.
[27] Michael P. Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019.
[28] Pranay Kr. Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha, Kush R. Varshney, and Ruchir Puri. Bias mitigation post-processing for individual and group fairness. CoRR, abs/1812.06135, 2018.
[29] H. Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A general approach to adding differential privacy to iterative training procedures, 2018.
[30] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.
[31] Ilya Mironov. Rényi differential privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Aug 2017.
[32] Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism, 2019.
[33] Hussein Mozannar, Mesrob I. Ohannessian, and Nathan Srebro. Fair learning with private demographic data. In ICML, 2020.
[34] Luca Oneto, Michele Doninini, Amon Elders, and Massimiliano Pontil. Taking advantage of multitask learning for fair classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019.
[35] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data, 2016.
[36] Venkatadheeraj Pichapati, Ananda Theertha Suresh, Felix X. Yu, Sashank J. Reddi, and Sanjiv Kumar. Adaclip: Adaptive clipping for private sgd, 2019.
[37] David Pujol, Ryan McKenna, Satya Kuppam, Michael Hay, Ashwin Machanavajjhala, and Gerome Miklau. Fair decision making using privacy-protected data. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 189–199, 2020.
[38] Novi Quadrianto and Viktoriia Sharmanska. Recycling privileged learning and distribution matching for fairness. In NIPS. 2017.
[39] Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning non-discriminatory predictors, 2017.
[40] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171–1180, 2017.
[41] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171–1180, 2017.
[42] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness constraints: Mechanisms for fair classification, 2015.
[43] Han Zhao, Amanda Coston, Tameem Adel, and Geoffrey J. Gordon. Conditional learning of fair representations. In ICLR. 2019.

$\displaystyle\left\\|\mathbb{E}[\tilde{\bm{G}}]-\bm{G}\right\\|$	$\displaystyle=\left\\|\sum_{i\in\mathcal{I}}\lambda_{i}\left(\nabla_{\theta}\big{\|}\mu(B_{P_{i}})-\bar{\nabla}^{C_{p}}_{\theta}\mu(B_{G_{i}})\big{\|}-\nabla_{\theta}\big{\|}\mu(B_{P_{i}})-\nabla_{\theta}\mu(B_{G_{i}})\big{\|}\right)\right\\|$
	$\displaystyle=\left\\|\sum_{i\in\mathcal{I}}\lambda_{i}\left(\nabla_{\theta}\mu(B_{G_{i}})-\bar{\nabla}^{C_{p}}_{\theta}\mu(B_{G_{i}})\|\right)\right\\|$
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\left\\|\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\nabla_{\theta}h(z)\right]-\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\bar{\nabla}_{\theta}^{C_{p}}h(z)\right]\right\\|$
	$\displaystyle\leq\sum_{i\in\mathcal{I}}\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\left\\|\nabla_{\theta}h(z)-\bar{\nabla}_{\theta}^{C_{p}}h(z)\right\\|\right]$	(by Jensen inequality)
	$\displaystyle=\sum_{i\in\mathcal{I}}\lambda_{i}\hat{\mathbb{E}}_{z\sim B_{G_{i}}}\left[\max\big{(}0,\left\\|\nabla_{\theta}h(z)\right\\|-C_{p}\big{)}\right],$	(16)

	$\displaystyle\left\|V_{i}-\mathbb{E}[\tilde{V}_{i}]\right\|$	$\displaystyle=\left\|\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[h(z)\right]-\hat{\mathbb{E}}_{z\sim D_{G_{i}}}\left[\frac{h(z)}{\max(1,\frac{\|h(z)\|}{C_{d}})}\ \right]\right\|$
		$\displaystyle=\mathbb{E}_{z\sim D_{G_{i}}}\left[\max(0,\|h(z))\|-C_{d})\|\right].$		(20)

Differentially Private and Fair Deep Learning: A Lagrangian Dual Approach

Abstract

1 Introduction

2 Related Work

3 Problem Settings and Goals

4 Preliminaries

4.1 Fairness

Example 1 (Demographic parity).

Example 2 (Equalized odds).

Example 3 (Accuracy parity).

4.2 Differential Privacy

Definition 1 (Differential Privacy).

5 Constrained Learning with Lagrangian Duality

6 A Private and Fair LD Model

Private Primal Update

Theorem 1.

Proof.

Private Dual Update

Theorem 2.

Proof.

6.1 Privacy Analysis

Definition 2.

Theorem 3.

Theorem 4.

Lemma 1.

Proof.

Lemma 2.

Proof.

Theorem 5.

Proof.

6.2 Bias-Variance Analysis

Theorem 6.

Proof.

Theorem 7.

Proof.

7 PF-LD: Handing Missing Values

8 Experimental Analysis

Datasets, Models, and Metrics

8.1 Accuracy and Fairness

8.2 Privacy, Fairness, and Accuracy Tradeoff

8.3 PF-LD: Analysis of the Primal Clipping Bound Value

8.4 PF-LD: Analysis of the Dual Clipping Bound Value

8.5 Missing Values

9 Conclusions

References

Differentially Private and Fair Deep Learning:
A Lagrangian Dual Approach