A Novel Regularization Approach to Fair ML

Norman Matloff
Department of Computer Science
University of California, Davis¹¹1NM is professor of computer science, UCD, and has served as an expert witness in various litigation cases involving discrimination.
[email protected]
and
Wenxi Zhang
University of California, Davis²²2WZ performed this work as an undergraduate at UCD. She is now a graduate student in Data Science at Columbia University.
[email protected]

Abstract

A number of methods have been introduced for the fair ML issue, most of them complex and many of them very specific to the underlying ML moethodology. Here we introduce a new approach that is simple, easily explained, and potentially applicable to a number of standard ML algorithms. Explicitly Deweighted Features (EDF) reduces the impact of each feature among the proxies of sensitive variables, allowing a different amount of deweighting applied to each such feature. The user specifies the deweighting hyperparameters, to achieve a given point in the Utility/Fairness tradeoff spectrum. We also introduce a new, simple criterion for evaluating the degree of protection afforded by any fair ML method.

1 Introduction

There has been increasing concern that use of powerful ML tools may act in ways discriminatory toward various protected groups, a concern that has spawned an active research field. One challenge has been that the term “discriminatory” can have a number of different statistical meanings, which themselves can be subject of debate concerning the propriety of applying such criteria.

A common criterion in the litigation realm is that two individuals who are of different races, genders, ages and so on but who are otherwise similarly situated should be treated equally [21]. As the Court noted in Carson v. Bethlehem Steel [1],

The central question in any employment-discrimination case is whether the employer would have taken the same action had the employee been of a different race (age, sex, religion, national origin, etc.) and everything else had remained the same

Many US regulatory agencies apply a broader standard, the 4/5 Rule [21], and as noted, various criteria have been proposed in the ML fairness literature. However, the present paper focuses on the above “if everything else had remained the same” criterion of Carson v. Bethlehem Steel, which is also the context of [14] and [20]. This is the notion of disparate treatment [21]. See further discussion of legal aspects in [11], where the author is both a statistician and a lawyer.

Disparate treatment can be described statistically as follows. Consider data of the form $(Y,X,S)$ , for an outcome variable $Y$ , a feature set $X$ and a separate set of sensitive/protected variables $S$ . Say we have a new case, $(Y_{new},X_{new},S_{new})$ , with $Y_{new}$ to be predicted. Let $\widehat{Y}_{new}$ denote our predicted value. Then ideally, $\widehat{Y}_{new}$ will be statistically independent of $S_{new}$ [14].

However, note our use of the word ideally above. Full independence of $\widehat{Y}$ and $S$ is achievable only under very restrictive assumptions, as in [14] and [20]. In particular, an obstacle exists in the form of a set of features $C$ belonging to $X$ that are correlated with $S$ , resulting in indirect information about $S$ :

On the other hand, the features in $C$ may have substantial predictive power for $Y$ . One would not wish to exclude a feature that is strongly related to $Y$ simply because it is somewhat related to $S$ . In other words, we have the following issue.

The Fairness-Utility Tradeoff [26] [4]: There is an inherent tradeoff betweent fairness and utility

Thus the set $D$ is the crux of the ML fairness issue. How do we use $C$ and the rest of $X$ to predict $Y$ well (utility), while using only minimal information about $S$ (fairness)?

Though the fairness literature is already quite extensive, much of it is either highly complex, or closely tied to specific ML methods, or both. Our goals in the present work are to develop fair methods that

•

are simple to explain and use,
•

apply to a number of standard ML algorithms, and
•

do not require use of, or even knowledge of $S$ .³³3That last point refers to the fact that if we believe the features in $C$ are related to $S$ , we do not actually need data on $S$ itself.

We present a general approach. Explicitly Deweighted Features (EDF) reduces the impact of $C$ . The term explicit alludes to the fact that the user drectly specifies the deweighting hyperparameters $\delta_{i}$ , one for each feature in $C$ .

Our work is inspired by [17] and [20], which deal with linear models. Here EDF borrows from the well-understood methodologies of ridge regression and the LASSO. Moreover, we also will present our analyses of random forests and the k-Nearest Neighbor methods, in which EDF will deweight features in other manners natural to the structure of the ML methods.

We will also introduce a novel measure of the strength of relation between $\widehat{Y}$ and $S$ (the weaker, the better). Recalling our ideal criterion above of independence between these two quantities, one might measure their correlation. However, correlation is a less desirable measure in cases in which one or both of the quantities is a one-hot (indicator) variable. Instead, we correlate modified versions of the quantities, to be presented below. Note that the modified $\widehat{Y}$ will be based only on $C$ and the rest of $X$ , not $S$ , thus forming a measure of fairneess dealing purely with $C$ .

The organization of this paper is as follows. We set the stage in Section 2 regarding notation and statistical issues. A literature review follows next, in Section 3, and we describe our novel fairness measure in Section 4. The linear/generalized linear portion of the paper then begins in Section 5, with empirical evaluations in Section 8. The material on other ML algorithms then starts in Section 9.

2 Notation and Context

Except when otherwise stated, all mathematical symbols will refer to population quantities, rather than sample estimates.⁴⁴4This population/sample paradigm is of course the classical statistical view. Readers who prefer to use ML terms such as probability generative mechanism may substitute accordingly.

We consider here the common case in which $\widehat{Y}$ is derived from an estimated regression function. By “regression function” we simply mean conditional expectation. Notationally, it is convenient to use the random variable form: In $W=E(V|U)$ , $U$ is considered random, and thus so is $W$ [19].

The regression function is estimated from training data either via a parametric (e.g. linear) or nonparametric (e.g. random forests) approach. And we of course include the binary/categorical case, in which $Y$ is a dummy/one-hot indicator variable or a vector of such variables, conditional probabilities.

In other words, ideally we aim to have $\widehat{Y}_{new}=E(Y_{new}|X_{new})$ independent of $S_{new}$ , or at least approximately so. (In the case of dichotomous $Y$ , the analyst will typically do something like rounding $\widehat{Y}$ to 0 or 1.) To avoid equation clutter, let us drop the new subscript and simply write the ideal condition as

E(Y~{}|~{}X)\textrm{ and }S\textrm{ are statistically independent}

(1)

We assume here that policy is to completely exclude $S$ from the ML computation. This is a slightly stronger assumption than those typically made in the fairness literature, and we make the assumption first for convenience; it simplifies notation and code. Our methodology would need only slight change below if $S$ were included, so this assumption should not be considered a restriction.

And, second, the assumption makes sense from a policy point of view. The analyst may, for instance, be legally required to exclude data on $S$ during the training stage, i.e. to develop prediction rules based only on estimation of $E(Y|X)$ rather than of $E(Y|X,S)$ .⁵⁵5Presumably the analyst is allowed to use the $S$ data in the training set to gauge how well her resulting prediction rule works and test data.

3 Previous Work on Regularized Approaches to Fair ML

The term regularization is commonly meant as minimization of a loss function subject to some penalty, often in the form of $L_{1}$ and $L_{2}$ penalties in linear regression models [12]. Our EDF method falls into this category in the case of linear models,.

In [5] the goal is to satisfy an equalized odds criterion. matching the rates of both false negatives and false positives for all groups. This is realized by adding terms to the loss function that penalize the difference in the FPR and FNR between the groups in the population. Let us first review some of the literature in this regard.

[7] proposes a meta-algorithm that incorporates a large set of linear fairness constraints. It runs a polynomial time algorithm that computes an approximately optimal classifier under the fairness constraints. This approach is based on empirical risk minimization, incorporating a fairness constraint that minimizes the difference of conditional risks between two groups. Since this minimization is a difficult nonconvex nonsmooth problem, they replace the hard loss in the risk with a convex loss function (hinge loss) and the hard loss in the constraint with the linear loss.

[15] proposed a (double-) regularization apprach based on mutual information between Y and S, with special attention to indirect prejudice, as is the case for the present paper.

3.1 Relation to the Work of Komiyama et al and Scutari et al

A much-cited work that shares some similarity with that of the present paper is that of [17], later extended by for instance [20].

In [17], the usual least-squares estimate is modified by imposing an upper bound on the coefficient of determination $R^{2}$ in the prediction of $Y$ by (a scalar) $S$ . Closed-form expressions are derived, and a kernelized form is considered as well. In this paper, we focus on the approach of [20], which builds on that of [17].

Those two papers develop a kind of 2-stage least squares method, which we now describe in terms relevant to the present work. We use different notation, make previously-implicit assumptions explicit, and apply a bit more rigor, but the method is the same. As noted, all quantities here are population values, not sample statistics.

The basic assumption (BA) amounts to $(Y,X,S)$ having a multivariate Gaussian distribution, with $Y$ scalar and $X$ being a vector of length $p$ . For convenience, assume here that $S$ is scalar. All variables are assumed centered. The essence of the algorithm is as follows.

One first applies a linear model in regressing $X$ on $S$ ,

E(X|S)=S\gamma

(2)

where $\gamma$ is a length- $p$ coefficient vector. BA implies the linearity.

This yields residuals

U=X-S\gamma

(3)

Note that $U$ is a vector of length $p$ .

$Y$ is then regressed on $S$ and $U$ , which again by BA is linear:

E(Y~{}|~{}S,U)=\alpha S+\beta^{\prime}U

(4)

where $\alpha$ is a scalar and $\beta$ is a vector of length $p$ .

Residuals and regressors are uncorrelated, i.e. have 0 covariance (in general, not just in linear or multivariate normal settings). So, from (3), we have that $U$ is uncorrelated with $S$ . In general 0 correlation does not imply independence. but again under BA, uncorrelatedness does imply independence. This yields

E(S~{}|~{}U)=ES=0

(5)

and, by the Tower Property [16],

E(Y~{}|~{}U)=E\left[E(Y~{}|~{}S,U)~{}|~{}U\right]=\beta^{\prime}U

(6)

In other words, we have enabled $S$ -free predictions of $Y$ by $U$ , i.e. (1). The residuals $U$ form our new features, replacing $X$ .

This technically solves the fairness problem, since $S$ and $U$ are independent. However, there are some major concerns:

(a)

The BA holds only rarely if ever in practical use.
(b)

The above methodology, while achieving fairness, does so at a significant cost in terms of our ability to predict $Y$ . This methodology, in removing all vestige of correlation to $S$ , may weaken some variables in $X$ that are helpful in predicting $Y$ . In other words, this methodology may go too far toward the Fairness end of the Fairness-Utility Tradeoff.

The authors in [17] were most concerned about (b) above. They thus modified their formulation of the problem to allow some correlation between $S$ and $\widehat{Y}$ , the predicted value of $Y$ . They then formulate and solve an optimization problem that finds the best predictor, subject to a user-specified bound on the coefficient of determination, $R^{2}$ , in predicting $S$ from $X$ . The smaller the bound, the fairer the predictions, but the weaker the predictive ability. Their method is thus an approach to the Fairness/Utility Tradeoff.

The later paper [20] extends this approach, using a ridge penalty. This has the advantage of simplicity. Specifically, they choose $\widehat{\alpha}$ and $\widehat{\beta}$ to be

\textrm{argmin}_{a,b}~{}||\mathbb{Y}-Sa-\widehat{U}b||_{2}^{2}+\lambda||a||_{2}^{2}

(7)

where $\lambda$ is a regularizing hyperparameter.

If this produces a coefficient of determination at or below the desired level, these are the final estimated regression coefficients. Otherwise, an update formula is applied.

Thus [20] finds a simpler, nearly-closed form extension of [17]. And they find in an empirical study that their method has superior performance in the Fairness-Utility Tradeoff:

We compare our approach with the regression model from Komiyama et al. (2018), which implements a provably-optimal linear regression model, and with the fair models from Zafar et al. (2019). We evaluate these approaches empirically on six different data sets, and we find that our proposal provides better goodness of fit and better predictive accuracy for the same level of fairness.

Note carefully the difference between EDF and [20] (and [17]):

•

EDF allows limited but explicit involvement of the specific features in $C$ in predicting $Y$ .
•

EDF completely excludes $S$ , thus complying with European Union regulations [6]. [20] and [17] go much further, allowing limited involvement of $S$ (even in the prediction stage), thus possibly out of compliance with legal requirements in some jurisdictions.

4 Evaluation Criteria

As noted, the Fairness-Utility Tradeoff is a common theme in ML fairness literature [26] [4]. This raises the question of how to quantify how well we do in that tradeoff. In particular, how do we evaluate fairness?

Again, the literature is full of various measures for assessing fairness, but here we propose another that some will find useful. In light of our ideal state, (1), one might take a preliminary measure to be

\rho(\widehat{Y},S)=\textrm{ correlation between predicted Y and S}

(8)

However, a correlation coefficient is generally used to measure association between two continuous variables. Yet in our context here, one or both of $\widehat{Y}$ and $S$ may be one-hot variables, not continuous. What can be done? We propose the following.

4.1 A Novel Fairness Measure

As noted earlier, we aim for simplicity, with the measure of fairness consisting of a single number if possible. This makes the measure easier to understand by nontechnical consumers of the analysis, say a jury in litigation, and is more amenable to graphical display. Our goal here is thus to retain the correlation concept (as in [20]), but in a modified setting.

Specifically, we will use the correlation between modified versions of $\widehat{Y}$ and $S$ , denoted by $T$ and $W$ . They are defined as follows:

•

If $Y$ is continuous, set $T=\widehat{Y}$ . If instead $Y$ is a one-hot variable, set $T=P(Y=1~{}|~{}X)$ .
•

If $S$ is continuous, set $W=S$ . If instead $S$ is a one-hot variable, set $W=P(S=1~{}|~{}X)$ .

Our fairness measure is then constructed by calculating $\rho^{2}(T,W)$ for each component of $S$ .

In other words, we have arranged things so that in all cases we are correlating two continuous quantities. One then reports $\rho^{2}(T,W)$ , estimated from our dataset. The reason for squaring is that squared correlation is well known to be the proportion of variance of one variable explained by another (as in the use of the coefficient of determination in [17] and [20]).

For categorical $S$ , a squared correlation is reported for each category, e.g. for each race if $S$ represents race.

We assume that the ML method being used provides us with the estimated value of $E(Y~{}|~{}X)$ in the cases of both numeric and categorical $Y$ ; in the latter case, this is a vector of probabilities, one for each category. The estimated means and probabilities come naturally from ML methods such as linear/generalized linear models, tree-based models, and neural networks. For others, e.g. support vector machines, one can use probability calibration techniques [13].

5 EDF in the Linear Case

In the linear setting, we employ a regularizer $\delta$ that works as a modified ridge regression operator, applied only to the variables in $C$ . Note that here $\delta$ will be a vector, allowing for different amounts of deweighting for different features in $C$ .

For $n$ data points in our training set, let $\mathbb{Y}$ denote the vector of their $Y$ values, and define $\mathbb{X}$ similarly for the matrix of $X$ values.

5.1 Structural Details

Our only real asumption is linearity, i.e. that

E(Y|X)=\beta^{\prime}X

(9)

for some constant vector $\beta$ of length $p$ . The predictors in $X$ can be continuous, discrete, categorical etc. Since we are not performing statistical inference, assumptions such as homoscedasticity are not needed.

We then generalize the classic ridge regression problem to finding

\textrm{argmin}_{b}~{}||\mathbb{Y}-\mathbb{X}b||^{2}+||{D}b||^{2}

(10)

where the diagonal matrix ${D}=diag(d_{1},...,d_{p})$ is a hyperparameter. We will want $d_{i}=0$ if the $i^{th}$ feature is not in $C$ , while within $C$ , features that we wish to deweight more heavily will have larger $d_{i}$ . (In order for $D$ to be invertible, we will temporarily assume $d_{i}$ is positive but small for elements of $X$ not in $C$ .)

Now with the change of variables $\widetilde{b}={D}b$ and $\widetilde{\mathbb{X}}=\mathbb{X}{D}^{-1}$ , (10) becomes

\textrm{argmin}_{\widetilde{b}}~{}||\mathbb{Y}-\widetilde{\mathbb{X}}\widetilde{b}||^{2}+||\widetilde{b}||^{2}=\textrm{argmin}_{\widetilde{b}}~{}||\mathbb{Y}-\widetilde{\mathbb{X}}\widetilde{b}||^{2}+\lambda||\widetilde{b}||^{2}

(11)

with $\lambda=1$ .

Equation (11) is then in the standard ridge regression form, and has the closed-form solution

\widetilde{b}=(\widetilde{\mathbb{X}}^{\prime}\widetilde{\mathbb{X}}+\mathbb{I})^{-1}\widetilde{\mathbb{X}}^{\prime}\mathbb{Y}

(12)

Mapping back to $b$ and $\mathbb{X}$ , we have

b={D}^{-1}\widetilde{b}={{D}}^{-1}{[{{D}}^{-1}\mathbb{X}^{\prime}\mathbb{X}{{D}}^{-1}+\mathbb{I}]}^{-1}{{D}}^{-1}\mathbb{X}^{\prime}\mathbb{Y}

(13)

Substituting $\mathbb{I}={D}{{D}}^{-1}$ in (13), and using the identity $T^{-1}S^{-1}=(ST)^{-1}$ , the above simplifies to

b={[{{D}}^{-1}\mathbb{X}^{\prime}\mathbb{X}+{D}]}^{-1}{{D}}^{-1}\mathbb{X}^{\prime}\mathbb{Y}

(14)

and then, with $D={{D}}^{-1}{DD}$ , we have

b={[\mathbb{X}^{\prime}\mathbb{X}+{D}^{2}]}^{-1}\mathbb{X}^{\prime}\mathbb{Y}

(15)

We originally assumed $d_{i}>0$ , but due to the fact that (15) is continuous in the $d_{i}$ , we can take the limit as $d_{i}\rightarrow 0$ , showing that 0 values are valid. We now drop that assumption.

In other words, we can achieve our goal—deweighting only in $C$ , and in fact with different weightings within $C$ if we wish—via a simple generalization of ridge regression.⁶⁶6In a different context, see also [24]. We can assign a different $d_{i}$ value to each feature in $C$ , according to our assessment (whether it be ad hoc, formal correlation etc.) of its impact on $S$ . We then set $d_{i}=0$ for features not in $C$ .

Note again that the diagonal matrix $D$ is a hyperparameter, a set of $|C|$ quantities $d_{i}$ controlling the relative weights we wish to place within $C$ . We then set our deweighting parameters $\delta_{i}=d_{i}^{2}$ .

Applying the well-known dual problem in this context, minimizing (10), the above is equivalent to choosing $b$ as

||\mathbb{Y}-\mathbb{X}b||^{2}~{}\textrm{ subject to }\sum_{i}(d_{i}b_{i})^{2}\leq\gamma

(16)

where $\gamma$ is also a hyperparameter.

5.2 Computation

One computational trick in ridge regression is to add certain artificial data points, as follows.

Add $p$ rows to the feature design matrix, which in our case means

A=\left(\begin{array}[]{cc}\mathbb{X}\\ D\end{array}\right)

(17)

Add $p$ 0s to $\mathbb{Y}$ :

B=\left(\begin{array}[]{cc}\mathbb{Y}\\ 0\end{array}\right)

(18)

Then compute the linear model coefficients as usual,

(A^{\prime}A)^{-1}A^{\prime}B

(19)

Since

A^{\prime}A=\mathbb{X}^{\prime}\mathbb{X}+D^{2}

(20)

and

A^{\prime}B=\mathbb{X}^{\prime}Y

(21)

(19) gives us (15), our desired ridge estimator. We then can use ordinary linear model software on the new data (17) and (18).

5.3 Generalized Linear Case

In the case of the generalized linear model, here taken to be logistic, the situation is more complicated.

Regularization in general linear models is often implemented by penalizing the length of the coefficient vector in maximizing the likelihood function [10]. Also, [15] uses conjugate gradient for solving for the coefficient vector. In the EDF setting, such approaches might be taken in conjunction with a penalty term involving $||Db||^{2}$ as above.

As a simpler approach, one might add artificial rows to the design matrix, as in Section 5.2. We are currently investigating the efficacy of this ad hoc approach.

6 Issues Involving C

6.1 Choosing C

Which features should go into the $C$ deweighting set? These are features that predict $S$ well. The analyst may rely on domain expertise here, or use any of the plethora of feature engineering methods that have been developed in the ML field. In some of the experiments reported later, we turned to FOCI, a model-free method for choosing features to predict a specified outcome variable [2].

6.2 Choosing the Deweighting Parameters

The deweighting parameters can be chosen by cross-validation, finding the best predictive ability for $Y$ for a given desired level for the fairness measure $\rho^{2}$ . But if $C$ consists of more than one variable, each with a different deweighting parameter, we must do a grid search. Our software standardizes the features, which puts the various deweighting parameters on the same scale, and the analyst may thus opt to use the same parameter value for each variable in $C$ . The search then becomes one-dimensional. Note that even in that case, EDF differs from other shrinkage approaches to fair ML, since there is no shrinkage associated with features outside $C$ .

Our examples below are only exploratory, illustrating the use of EDF for a few values of the deweighting parameters; no attempt is made to optimize.

7 Derivative Benefits from Use of a Ridge Variant

It has long been known that ridge regression, although originally developed to improve stability of estimated linear regression coefficients, also tends to bring improved prediction accuracy [9]. Thus, while in fair ML methods we are prepared to deal with a tradeoff between accuracy and fairness, with EDF we may actually achieve improvements in both aspects simultaneously.

8 Empirical Investigation of Linear EDF

Note that software used here is EDFfair [18].

8.1 Census Dataset

Let us start with a dataset derived from the 2000 US census, consisting of salaries for programmers and engineers in California’s Silicon Valley.⁷⁷7This is the pef data in the R regtools package. Census data is often used to investigate biases of various kinds. Let’s see if we can predict income without gender bias.

Even though the dataset is restricted to programmers and engineers, it’s well known that there is a quite a variation between different job types within that tech field. One might suspect that the variation is gendered, and indeed it is. Here is a comparison between men and women among the six occupations, showing the gender breakdown across occupatioms:

gender	occ 100	occ 101	occ 102	occ 106	occ 140	occ 141
men	0.2017	0.2203	0.3434	0.0192	0.0444	0.1709
women	0.3117	0.2349	0.3274	0.0426	0.0259	0.0575

Men and women do seem to work in substantially different occupations. A man was 3 times as likely as a woman to be in occupation 141, for instance.

Moreover, those occupations vary widely in mean wages:

occ 100	occ 101	occ 102	occ 106	occ 140	occ 141
50396.47	51373.53	68797.72	53639.86	67019.26	69494.44

So, the occupation variable is a prime candidate for inclusion in $C$ .

Thus, $Y$ will be income, $S$ will be gender, and we’ll take $C$ to be occupation. Here are the Mean Absolute Prediction Errors and correlations between predicted $Y$ and gender, based on results of 500 runs, i.e. 500 random holdout sets:⁸⁸8The default holdout size was used, 1000. The dataset has 20090 rows.

$D_{i}^{2}$	MAPE	$\rho^{2}$
1.00	25631.21	0.22
4.00	25586.08	0.22
25.00	25548.24	0.22
625.00	25523.83	0.21
5625.00	25576.15	0.18
15625.00	25671.33	0.15

Now, remember the goal here. We know that occupation is related to gender, and want to avoid using the latter, so we hope to use occupation only “somewhat” in predicting income, in the spirit of the Fairness-Utility Tradeoff. Did we achieve that goal here?

We see the anticipated effect for occupation: Squared correlation between $S$ and predicted $Y$ decreases as $D_{i}^{2}$ increases, with only a slight deterioriation in predictive accuracy in the last settings, and with predictive accuracy actually improving in the first few settings—as anticipated in Section 7. Further reductions in $\rho^{2}$ might be possible by expanding $C$ to include more variables.

8.2 Empirical Comparison to Scutari et al

The authors of [20] have made available software for their method (and others), in the fairml package. Below are the results of running their function frrm() on the census data.⁹⁹9The package also contains implementations of the methods of [17] and [25], but since [20] found their method to work better, we will not compare to the other two. The ‘unfairness’ parameter, which takes on values in [0,1], corresponds to their ridge restriction; smaller values place more stringent restriction on $||\alpha||$ in (4), i.e. are fairer. We set their $\lambda$ parameter to 0.2.

unfairness	MAPE	$\rho^{2}$
0.35	25538	0.2200
0.30	25483	0.2193
0.25	25353	0.2194
0.20	25322	0.2198
0.15	25484	0.2203
0.10	25378	0.2193
0.05	25339	0.2140

The mean absolute prediction errors here are somewhat smaller than those of EDF. But the squared correlations between predicted $Y$ and probabilities of $S$ (here, the probabilities of being female, the sensitive attribute) are larger than those of EDF. In fact, the correlation seemed to be constant, and no value of the fairness parameter succeeded in bringing the squared correlation below 0.20. Further investigation is needed.

9 Other ML Algorithms

The key point, then, is that $\ell_{2}$ regularization places bounds on the the estimated linear regression coefficients $b_{i}$ (of course this is true also for $\ell_{1}$ and so on)—but now for us the bounds are only on the $b_{i}$ corresponding to the variables in $C$ . The goal, as noted, is to have a “slider” one can use in the tradeoff between utility and fairness. Smaller bounds move the slider more in the direction of fairness, with larger ones focusing more on utillty. That leads to our general approach to fairness:

For any ML method, impose fairness by restricting the amount of influence the variables in $C$ have on predicted $Y$ .

9.1 Approaches to Deweighting

This is easy to do for both linear and several other standard ML methods:

•

Random forests: In considering whether to split a tree node, set the probability of choosing variables in $C$ to be lower than for the other variables in $X$ . The R package ranger [23] sets variable weights in this manner (in general, of course, not specific to our fair-ML context here).
•

k-nearest neighbors (k-NN): In defining the distance metric, place smaller weight on the coordinates corresponding to $C$ , as allowed in [8] (again, not specific to the fair-ML context).
•

Support vector machines: Apply an $\ell_{2}$ constraint on the portion of the vector $w$ of hyperplane coefficients corresponding to $C$ . This could be implemented using, for instance, the weight-feature SVM method of [22] (once again, a feature not originally motivated by fairness).

As noted, in each of the above cases, the deweighting of features has traditionally been aimed at improving prediction accuracy. Here we repurpose the deweighting for improving fairness in ML.

9.2 Empirical Investigation

9.2.1 The COMPAS Dataset

This dataset has played a major role in calling attention to the need for fair ML.¹⁰¹⁰10The version used was that of the fairml package. This is a highly complex issue. See [3] for the many nuances, legal and methodological. The outcome variable $Y$ is an indicator for recidivism within two years.

As with other studies, $S$ will be race, but choice of $C$ is more challenging in this more complex dataset.

In predicting the race African-American, of special interest in the COMPAS controversy, the relevant variables were found by FOCI to be, in order of importance,

⬇

decile_score

sex.Female

priors_count

age

out_custody

juv_fel_count

sex.Male

The decile_score is of particular interest, as it has been at the COMPAS controversy [3]. There are indications that it may be racially biased, which on the one hand suggests using it in $C$ but on the other hand suggests using a small deweighting parameter. In this example, we chose to use decile_score, gender, priors_count and age.

Here we used a random forests analysis. We used a common deweighting factor (1.0 means no reduction in importance of the variable) on these variables, and default values of 500 and 10 for the number of trees and minimum node size, respectively.

Below are the results, showing the overall probability of misclassification. There were 100 replications for each line in the table.

deweight factor	OPM	$\rho^{2}$
1.00	0.2150	0.3136
0.80	0.2109	0.3029
0.60	0.2118	0.3028
0.40	0.2116	0.2870
0.20	0.2126	0.2770
0.10	0.2187	0.2488
0.05	0.2126	0.2484
0.01	0.2203	0.2210

Deweighting did reduce correlation between predicted $Y$ and $S$ , with essentially no compromise in predictive ability. Interestingly, there seems to be something of a “phase change,” i.e. a sharp drop,between $C$ values 0.20 and 0.10.

The $;\rho^{2}$ values are still somewhat high, and expansion of $C$ may be helpful.

9.2.2 The Mortgage Denial Dataset

This dataset, from the Sorted Effects package, consists of data on mortgage applications in Boston in 1990. We predict whether the mortage application is denied based on a list of credit-related variables. Here we take the indicator for Black applicant to be the sensitive variable. The $Y$ variable here is an indicator variables for mortgage application denial.

Again choosing C is challenging in this complex dataset, so we run FOCI to find relevant variables by predicting the race Black. The chosen relevant varibles in the order of importance are

⬇

loan_val

mcred

condo

In this case, after experiments, we chose to just use loan_val as the variable $C$ .

Here we used k-nearest neighbors (k-NN). The default value of 25 was used for k. Again we use the deweighting factor for variable $C$ (1.0 means no reduction in importance of the variable).

deweight factor	OPM	$\rho^{2}$
1.00	0.0953	0.0597
0.80	0.0977	0.0621
0.60	0.1012	0.0597
0.40	0.1006	0.0563
0.20	0.1189	0.0305
0.10	0.2461	0.0373
0.05	0.2820	0.0267
0.01	0.3249	0.0238

We can see that the deweighting factor reduces the correlation between $S$ and predicted $Y$ , with little impact on predictive accuracy up through a deweighting factor of 0.20. Again, there appear to be phase changes in both columns.

Next we also compute the correlation between fitted values of full model(with s) and the EDF kNN model with deweighting variable C:

deweight factor	black.0 $\rho^{2}$	black.1 $\rho^{2}$
1.0	0.7865	0.7751
0.8	0.7824	0.7756
0.6	0.7754	0.7681
0.2	0.7363	0.7793

Here we can see that overall the correlations for all levels (black.0 and black.1) are high. As we deweight more, the correlation decrease a little. Thus we can say that Y isn’t strongly related to the sensitive variable Black and C have relative substantial predictive power for Y. Thus we can see that loan_val is an appropriate proxy for the sensitive variable Black.

We also apply random forests analysis to Mortgage dataset, with C variable being loan_val. Below are the results:

deweight factor	OPM	$\rho^{2}$
1.0	0.0943	0.06953
0.8	0.0882	0.0602
0.6	0.0910	0.0523
0.3	0.0924	0.0434

We have similar results for random forests analysis. We see a decrease in correlation as we deweight the $C$ variable. We even has a higher overall accuracy in the random forests analysis. For future analysis, we may also apply deweighting on different ML methods to see which method best suit the dataset. And again, using more features in $C$ would likely be helpful.

10 Conclusions and Future Work

In the examples here, and others not included here, we have found that EDF can be an effective tool for analysts hoping to achieve a desired point in the Utility/Fairness tradeoff spectrum. More types of data need to be explored, especially those in which the set $C$ is large.

Ridge regression, feature weighting in random forests and k-NN, and so on, were orginally developed for the purpose of improved prediction and estimation. We have repurposed them here in a novel setting, seeking fair ML.

We have also introduced a new approach to evaluating fairness, involving correlations of probabilities. It will be interesting to see how the various fair ML methods fare under this criterion.

As noted, R software for our methods is publicly available.

11 Author Contributions

NM conceived of the problem, derived the mathematical expressions, and wrote the initial version of the software. WZ joined the project midway, but made significant contributions to the experimental design, software and analysis of relations to previous literature.

Table 1:

sex	MAPE
$0.111$	$25,491.260$
$0.111$	$25,816.110$
$0.114$	$25,693.020$
$0.106$	$25,708.600$
$0.111$	$25,789.700$
$0.110$	$25,959.280$
$0.114$	$25,931.790$
$0.115$	$25,856.950$

References

[1] 70 FEP cases 921, 7th cir. (1996).
[2] Azadkia, M., and Chatterjee, S. A simple measure of conditional dependence, 2021.
[3] Bao, M., Zhou, A., Zottola, S., Brubach, B., Desmarais, S., Horowitz, A., Lum, K., and Venkatasubramanian, S. It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks, 2021.
[4] Barocas, S., Hardt, M., and Narayanan, A. Fairness and machine learning, 2021.
[5] Bechavod, Y., and Ligett, K. Penalizing unfairness in binary classification, 2017.
[6] Besse, P., del Barrio, E., Gordaliza, P., Loubes, J.-M., and Risser, L. A survey of bias in machine learning through the prism of statistical parity. The American Statistician 76, 2 (2022), 188–198.
[7] Celis, L. E., Huang, L., Keswani, V., and Vishnoi, N. K. Classification with fairness constraints: A meta-algorithm with provable guarantees, 2018.
[8] Elizabeth Yancey, R., Xin, B., and Matloff, N. Modernizing k-nearest neighbors. Stat 10, 1 (2021), e335.
[9] Frank, L. E., and Friedman, J. H. A statistical view of some chemometrics regression tools. Technometrics 35, 2 (1993), 109–135.
[10] Friedman, J., Hastie, T., Tibshirani, R., Narasimhan, B., Tay, K., Simon, N., Qian, J., and Yang, J. glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models.
[11] Gray, M. W. The struggle for equal pay, the lament of a female statistician. CHANCE 35, 2 (2022), 29–31.
[12] Hastie, T., Tibshirani, R., and Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations. Taylor & Francis, 2015.
[13] Huang, Y., Li, W., Macheret, F., Gabriel, R. A., and Ohno-Machado, L. A tutorial on calibration measurements and calibration models for clinical prediction models. Journal of the American Medical Informatics Association 27, 4 (2020), 621–633.
[14] Johndrow, J. E., and Lum, K. An algorithm for removing sensitive information: Application to race-independent recidivism prediction. The Annals of Applied Statistics 13, 1 (2019), 189 – 220.
[15] Kamishima, T., Akaho, S., and Sakuma, J. Fairness-aware learning through regularization approach. In 2011 IEEE 11th International Conference on Data Mining Workshops (2011), pp. 643–650.
[16] Klenke, A. Probability Theory: A Comprehensive Course. World Publishing Corporation, 2012.
[17] Komiyama, J., Takeda, A., Honda, J., and Shimao, H. Nonconvex optimization for regression with fairness constraints. In Proceedings of the 35th International Conference on Machine Learning (10–15 Jul 2018), J. Dy and A. Krause, Eds., vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 2737–2746.
[18] Matloff, N., and Zhang, W. EDFfair, github.com/matloff/EDFfair.
[19] Ross, S. Introduction to Probability Models. Elsevier Science, 2007.
[20] Scutari, M., Panero, F., and Proissl, M. Achieving fairness with a simple ridge penalty, 2021.
[21] SHRM. What are disparate impact and disparate treatment?
[22] Wang, K., Wang, X., and Zhong, Y. A weighted feature support vector machines method for semantic image classification. In 2010 International Conference on Measuring Technology and Mechatronics Automation (2010), pp. 377–380.
[23] Wright, M. N., and Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77, 1 (2017), 1–17.
[24] Wu, D., and Xu, J. On the optimal weighted $\ell_{2}$ regularization in overparameterized linear regression, 2020.
[25] Zafar, M. B., Valera, I., Gomez-Rodriguez, M., and Gummadi, K. P. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research 20, 75 (2019), 1–42.
[26] Zhao, H., and Gordon, G. J. Inherent tradeoffs in learning fair representation. CoRR abs/1906.08386 (2019).