Fair Representation Clustering with Several Protected Classes
Abstract
We study the problem of fair -median where each cluster is required to have a fair representation of individuals from different groups. In the fair representation -median problem, we are given a set of points in a metric space. Each point belongs to one of groups. Further, we are given fair representation parameters and for each group . We say that a -clustering fairly represents all groups if the number of points from group in cluster is between and for every and . The goal is to find a set of centers and an assignment such that the clustering defined by fairly represents all groups and minimizes the -objective .
We present an -approximation algorithm that runs in time . Note that the known algorithms for the problem either (i) violate the fairness constraints by an additive term or (ii) run in time that is exponential in both and . We also consider an important special case of the problem where and for all . For this special case, we present an -approximation algorithm that runs in time.
1 Introduction
Algorithmic decision making is widely used for high-stake decisions like college admissions (Marcinkowski et al., 2020) and criminal justice (Chouldechova, 2017; Kleinberg et al., 2018). While automated decision-making processes are often very efficient, there are serious concerns about their fairness. Consequently, in recent years, there has been an extensive line of research on fairness of algorithms and machine learning approaches (Chouldechova and Roth, 2020; Kearns and Roth, 2019; Kleinberg et al., 2017).
In this paper, we study the “fair representation” clustering problem proposed in the seminal work of Chierichetti et al. (2017). The notion, which is motivated by the concept of disparate impact (Feldman et al., 2015), requires that each protected class has an approximately equal representation in each cluster. In many scenarios, a different set of benefits are associated with each cluster of points output by the algorithm. Then, it is desirable that different groups of individuals (e.g., men or women) receive the benefits associated with each of the clusters (e.g., mortgage options) in similar proportions. Further, clustering is often used for feature engineering. In this case, we need to ensure that the generated features are fair; that is, they neither introduce new nor amplify existing biases in the data set. Now, we formally define the notion of representation fairness for clustering.
Definition 1.1 (fair representation clustering).
Given a set of points that come from different groups , a -clustering of is fair with respect to the fairness requirement specified by if
(1) |
In fair -median with fairness requirement , the goal is to find clusters and centers, (one center for each cluster) so that the clustering is fair with respect to the fairness requirement and the -objective is minimized. We will say that points in are assigned to center . We let be the assignment function that maps each point to the center is assigned to. To specify a solution, it is sufficient to provide the set of centers and .
Bera et al. (2019) and Bercea et al. (2019) independently introduced this notion of fairness, which generalizes the notions studied by Chierichetti et al. (2017); Schmidt et al. (2019); Backurs et al. (2019); Ahmadian et al. (2019). Bercea et al. presented a constant factor approximation algorithm for the fair representation clustering with the general -objective. However, their algorithm returns a clustering that satisfies the fairness requirements with some additive error. When the maximum number of groups/classes to which a point may belong is , the additive error/violation is at most ; in the most common case of , the additive violation is at most . Bercea et al. also gave constant factor approximation algorithms for a variety of clustering objective (including -median and -means) that violate the fairness requirement only by a small additive value. More recently, Bandyapadhyay et al. (2021) designed algorithms that compute a constant factor approximation for fair -median and -means that run in time ; these algorithms do not violate the fairness constraints. However, one of the main questions in the area of fair clustering still remains open:
What is the best polynomial-time approximation algorithm for representation fair clustering?
We also study exact fair representation clustering, which is an important special case of the problem.
Definition 1.2 (exact fairness).
Assume that we are given a set of points that come from disjoint groups: . A -clustering of is exactly fair if
(2) |
We define fairlet as a minimal size non-empty set of points that is exactly fair. Note that all fairlets have the same size (when is fixed). Denote this size by . Further, for every , let denote the number of points from group in any fairlet.
This notion was previously studied for (1) -center (Rösner and Schmidt, 2018; Bercea et al., 2019) and (2) -clustering with -objective on balanced instances (instances with for all and ) (Böhm et al., 2020). In all these special cases of exact fairness, the fair clustering problem admits a constant factor approximation.
Other Related Work.
Fair clustering is an active domain of research and by now it has been studied under various standard notions including both group fairness and individual fairness, e.g., (Chierichetti et al., 2017; Bera et al., 2019; Huang et al., 2019; Kleindessner et al., 2019; Jones et al., 2020; Chen et al., 2019; Micha and Shah, 2020; Jung et al., 2020; Mahabadi and Vakilian, 2020; Chakrabarty and Negahbani, 2021; Vakilian and Yalçıner, 2021; Brubach et al., 2020; Kleindessner et al., 2020; Ghadiri et al., 2021; Abbasi et al., 2021; Makarychev and Vakilian, 2021; Chlamtáč et al., 2022; Esmaeili et al., 2020, 2021).
1.1 Our Results
In this paper, we study the fair representation -median problem and give an -approximation for it that runs in time . Importantly, we get a polynomial time algorithm for every fixed – this is the case in most practical settings of interest for fairness applications. In fact, we design an algorithm that can handle arbitrary fairness profiles (see below for the definitions). Further, we design a much faster algorithm for fair -median with exact fairness constraints and small , where is the size of a fairlet. It runs in time . We emphasize that even in the case , all previous results on fair representation clustering with more than two protected groups, notably Bera et al. (2019); Bercea et al. (2019); Bandyapadhyay et al. (2021), either violate the fairness constraints by additive terms or run in time that is exponential in . In this paper, we present first polynomial time approximation algorithms for fair representation with multiple protected classes that satisfy the fairness requirement with no additive violations (see Theorem 1.3 and Theorem 1.6).
General Representation Fairness.
Our main algorithm has three steps: location consolidation, approximation the metric by a distribution of tree metrics, and finally solving fair clustering on a tree.
In the first step, we run an existing constant factor approximation algorithm for -median to find a set of centers . Next, we move each point to its closest center in in the constructed solution. In other words, we reduce the initial instance of size (with possibly different locations) to an instance of size with exactly locations. (i.e., we may have multiple data points mapped/moved to each location.)
In the second step, we use the metric embedding technique by Fakcharoenphol et al. (2004) to approximate the reduced instance by a tree metric with expected distortion . As the reduced instance, this instance also has at most different locations. Additionally, the metric on these locations is a tree metric.
Finally, in the third step, we use dynamic programming (DP) to find a fair assignment of points located at different locations to centers. The DP runs in time .
We note that that our first step is very similar to that used by Bera et al. (2019) and Bercea et al. (2019); they first find a not-necessarily-fair clustering of the data points and then reassign the points so as to ensure that the clustering is approximately fair. In the context of -median, the idea of approximating the input metric with a distribution of dominating trees was introduced by Bartal (1998); this approach was recently used by Backurs et al. (2019) in their approximation algorithm for a different variant of fair representation clustering with 2 groups. Our dynamic programming algorithm is novel and very different from DP algorithms previously used for solving -median on trees (see e.g., (Kariv and Hakimi, 1979; Tamir, 1996; Angelidakis et al., 2017)).
Theorem 1.3.
There exists a randomized -approximation algorithm for fair representation -median that runs in time.
Note that in practice the number of classes is usually a small constant. Then our algorithm runs in polynomial time. The problem is interesting even when is a fixed single-digit number.
Remark 1.4.
Our approach works in a more general setting with a set of fair profiles , where each profile is a vector of length . A cluster is fair w.r.t. if . For example, this general notion captures the setting in which we only need to guarantee that in each cluster, a sufficiently large fraction of members belong to one of the disadvantaged groups specified by ; .
To the best of our knowledge, none of the previous results on fair representation clustering implies an approximation bound for this general “fairness profile” notion.
Given a clustering instance and set , our algorithm finds a clustering such that all clusters are fair w.r.t. . Assuming the existence of a membership oracle for that runs is time , the running time and the approximation factor are and as in Theorem 1.3. In most natural scenarios the membership oracle can be implemented efficiently and the asymptotic runtime of our algorithm remains . For instance, in the aforementioned fairness requirement which guarantees the presence of disadvantaged groups in each cluster, ; hence, the total runtime of our algorithm in this setting is still .
Remark 1.5.
We describe and analyze our algorithm for the case when each point belongs to exactly one group . However, with a minor modification,our algorithm can also handle the case when a point may belong to multiple groups or do not belong to any group. We introduce a “virtual” group for every such that there exists a point that belongs to groups with and only to them. Note that new virtual groups are disjoint and cover . We define the fairness constraints as follows: for every , . By Remark 1.4, our algorithm can handle these constraints; they are equivalent to the original constraints, since . Note that now the algorithm runs in time where is the number of virtual groups (clearly but may be much smaller than ).
Exact Representation Fairness.
We significantly improve the running time of our algorithms when the fairness constraints are exact (see Definition 1.2) and each point belongs to exactly one group. We first run the algorithm by Bera et al. (2019) that returns a set of centers and an assignment of points to these centers that “nearly” satisfies the fairness requirement. Next, we move each point to its assigned center. We prove that there exists an -approximately optimal fair assignment that only moves a set of points . Lastly, we show that we can find such a set of size in polynomial time. Then, loosely speaking, we run our main algorithm on the set . Since has only points, the algorithm runs in time .
Theorem 1.6.
There exists a randomized -approximation algorithm for exactly fair -median that runs in time , where is the fairlet size.
2 Preliminaries
Embedding into a distribution of dominating trees.
Assume that we are given a metric space . We will consider trees on (whose edges have positive lengths) and shortest-path metrics that they define on . We say that is -approximated by a probabilistic distribution of dominating trees with distortion if
-
•
Every tree metric in the support of dominates metric ; i.e., for all .
-
•
for all .
A key component in our algorithms is the following result by Fakcharoenphol et al. Fakcharoenphol et al. (2004) (see also Bartal (1996) by Bartal).
Theorem 2.1 (Fakcharoenphol et al. (2004)).
Every metric space can be approximated by a distribution of dominating trees with distortion at most . Moreover, we can sample from the distribution of dominating trees in polynomial time.
3 Algorithms
Now we present an -approximation algorithm for fair representation -median that runs in time , where is the number of points and is the number of groups.
In our algorithms, we are going to transform instances to “simpler” instances by moving points to new locations. It will be helpful to distinguish between data points and their locations. Our algorithms will not change data points and their memberships in groups, but will change their locations. Formally, we think of abstract data points that are mapped to points/locations in a metric space; our algorithms will modify this mapping between data points and their locations. We call this process a reassignment. We denote the set of data points by and the set of locations by . Initially and every point is assigned to location , but this will change when we transform the instance. We denote the location of point w.r.t. instance by .
Now, if two data points at the same location belong to the same group, then they are interchangeable for our algorithm. Thus, instead of storing the actual data points, the algorithm will only store the number of points from each group at every location.
Denote and . For each and , let be the number of data points from group at location . We call vector the profile of location . We define the profile of a set of data points as a vector in whose -th coordinate equals the number of data points from group in the set; .
Consider a set of clusters . To describe a clustering, we introduce an assignment function . For each center , denotes the number of points from group at location that we assign to . Note that vector specifies the number of points from each group that are assigned to center . We call the profile of center . We require that for every , meaning that each data point at location belongs to exactly one cluster.
The fairness constraints are defined by a set of fair profiles . We say that the cluster assigned to center fairly represents groups if . In fair representation clustering with fairness requirement
if and only if for all .
We restate the objective of fair representation -median as follows
(3) |
Note that is the total number of points at location assigned to center .
3.1 From a Clustering to a New Instance
Consider an instance and a clustering for (which is not necessarily fair). As we discussed above, we can define the clustering by specifying (i) a set of -centers and (ii) either a mapping from the set of data points to or, equivalently, an assignment function . The cost of the clustering is
Let us move every data point from its original location in to . We get a new instance . Note that . The profile of location is (the profile of a location is a zero vector). Instance has the same set of fairness profiles as the original instance .
Claim 3.1.
Consider a clustering of . Denote its cost w.r.t. instances and by and , respectively. Then .
Proof.
Consider a data point and the center that it is assigned to by clustering . Then, by the triangle inequality, . Adding up this inequality over all data points , we get the desired inequality . ∎
3.2 Location Consolidation Step
We run a constant factor approximation algorithm for the standard -median problem on our set of data points; e.g., the algorithm by Charikar et al. Charikar et al. (2002) or by Li and Svensson Li and Svensson (2016). We get centers and a Voronoi assignment of points to the centers.
As described above, we move each data point to center . We denote the original instance by and the obtained instance by . We will refer to as the reduced instance.
Claim 3.2.
Let and be the costs of optimal fair clusterings for and , respectively. Assume that we used an -approximation algorithm for -median in the consolidation step. Further assume that there is a -approximation algorithm for fair representation -median on . Then, there is an -approximation algorithm for fair representation -median on .
Proof.
Observe that the cost of an optimal (not necessarily fair) -median clustering of is at most the cost of an optimal fair -median clustering of . Thus the cost of the clustering that we use in the reduction is at most .
Now consider the optimal fair clustering for . By Claim 3.1, the cost of this clustering as a clustering of is at most . Therefore, .
Our approximation algorithm for is very straightforward. We simply run the -approximation algorithm on instance and output the obtained clustering as a clustering of . Now we upper bound the cost of this clustering w.r.t. instance and then w.r.t . The cost w.r.t. is at most . Consequently, by Claim 3.1, the cost w.r.t. is at most , as required. ∎
3.3 Embedding into Tree Metrics
Consider restriction of metric to . We will reduce the problem to the case when is a tree metric by paying a factor of . To this end, we will approximate by a distribution of dominating trees with distortion and then solve the fair representation -median for a number of trees randomly sampled from .
Claim 3.3.
Suppose that there exists an -approximation algorithm for fair representation -median on tree metrics. Then, there is an -approximation algorithm for the reduced problem that succeeds with high probability (i.e., for any desired constant ).
Proof.
Let be the optimal fair assignment for and OPT be its cost. Consider an approximation of by a distribution of dominating trees with distortion , which exists by Theorem 2.1.
For a tree metric , let denote the -approximate assignment of points to centers that algorithm finds on instance with metric . Then,
The three inequalities above hold, since (i) (always); (ii) is an -approximate assignment/solution; and (iii) (recall that is a restriction of to ; for every ).
By Markov’s inequality the obtained solution approximates the optimal assignment within a factor of with probability at least . By running the proposed algorithm times, we get an -approximate solution w.h.p. ∎
3.4 Reduced Assignment Problem on Trees
In this section, we assume that is a tree metric on points with profile vector for every location . We open a center at every location and our goal now is to find a fair assignment of data points to centers. Recall that our notion of fairness is more general than that in Definition 1.1 (we discussed it in Remark 1.4).
3.4.1 Conversion to Binary Tree
We choose an arbitrary root in the tree. It is convenient now to convert the tree to a binary tree in which every non-leaf vertex has exactly two children. We do that by adding Steiner locations. Namely, we replace each vertex with children with a path of length . The first vertex on the path is (we assume ); other vertices are new Steiner locations. We keep the parent of unchanged; we let the parent of be (for ). For , we reassign -th child of to and reassign -th child to . We set the length of all edges on the path to 0; we let the length of each edge be that of in the original tree; the length of to that of in the original tree. Additionally, we add a Steiner child to vertices that have exactly one child. For a concrete example, consider a tree rooted at vertex , in which has four children . We transform the tree by adding Steiner nodes 6 and 7 as follows:
Let be the set of non-Steiner locations and be the set of Steiner locations. We open a center at every ; we do not open centers at any Steiner location . For , let .
3.4.2 Dynamic Program
Now we write a dynamic program (DP). We first recall the setup. We are given a binary tree , which contains Steiner and non-Steiner nodes/locations. Each non-Steiner node contains some data points that we want to cluster and each Steiner node contains no data points. Our goal is to move the data points around (assign data points to the non-Steiner nodes) such that the resulting clustering is fair. The objective is to minimize the “assignment cost” , where is the number of data points at location that are assigned to .
Let be the subtree of rooted at . Let and be non-Steiner and Steiner locations in , respectively. For a fixed assignment , data points located at nodes/locations in can be classified into two types: local points and out-points. A local point is assigned to a node/location in by and an out-point is assigned to a node/location outside by . In addition, there are some points outside that are assigned to nodes/locations in . We refer to these points as in-points. Now we define two functions and , which specify the out-points and in-points, respectively. Namely, for each location , is the profile of out-points located at , and is the profile of in-points assigned to . Let be the “net-imports” of . Clearly, .
Now, we create a DP-table where and . Loosely speaking, is the cost of the minimum cost partial solution such that (a) clusters for all centers satisfy fairness constraints (note that each cluster consists of the points at locations in assigned to and in-points assigned to ) and (b) the difference between the number of in-points and out-points from group equals . The cost of a partial solution comprises (i) the assignment costs for points in assigned to centers in , (ii) for each out-point located at , the portion of the assignment cost of that “lies” inside , and (iii) for each in-point assigned to center , the portion of the assignment cost of that “lies” inside .
Formally, is the cost of the optimal solution for the following problem.
Find
-
•
function ; map specifies the assignment of data points located in subtree to centers in . Namely, is the profile of the set of data points located at that are assigned to center . Let . Note that is the profile of the set of data points at locations in that are assigned to center .
-
•
function ; map specifies the set of data points in subtree that are not assigned to any center in ; these data points will have to be assigned to centers outside of later. Namely, is the profile of currently unassigned data points at location .
-
•
function ; specifies how many data points located outside of the subtree are assigned to centers in . Namely, is the profile of data points that lie outside of but assigned to center .
Such that
-
•
for every , . This condition says that every data point at location is assigned to some center or is currently unassigned.
-
•
for all . This condition says that the profile of the cluster centered at is in (that is, the cluster satisfies the fairness constraints). Here, counts data points in that are assigned to and counts data points outside of that are assigned to .
-
•
.
Cost
The objective is to minimize the following cost
If there is no feasible solution the cost is .
3.4.3 Solving the DP
We fill out the DP table starting from the leaves and going up to the root (bottom-up).
Computing for leaves is straightforward using the definition of . In this case, the subtree rooted at is a single node. Thus, for any given , the profile at after assignment is
If is not a Steiner node, then if and if . If is a Steiner node, then if and if .
Now assume that is not a leaf; since the tree is a binary tree, has two children and . Note that node will send points to and points to for some . Then is the minimum over satisfying
-
•
if , then ; that is, cluster centered at satisfies fairness constraints.
-
•
if , then , since and no point will be assigned to .
of the following cost
(4) |
The pseudo-code of the DP algorithm is provided in Algorithm 1.
Lemma 3.4.
The described dynamic programming algorithm runs in time and finds an optimal fair assignment of the data points located in the set of locations .
Proof.
It is straightforward that the DP algorithm correctly computes the DP entries. The cost of the optimal fair assignment for our problem is given by where is root of the tree and is the all-zero vector.
The update rule, which is specified by Eq. (4) and the constraints on and , computes correctly in time , which accounts for enumerating over all possible values of and . Since the DP tables contains cells, the total time to fully compute the table is . Finally, once the whole table is computed, we can recover the assignment itself by traversing the table from in the reverse direction, which takes time. The total running time is . ∎
Proof of Theorem 1.3:.
We first apply the consolidation step and thus reduce the problem to the set of locations is of size . Then we approximate metric on by a distribution of dominating tree metrics, as explained in 3.3. We obtain a logarithmic number of instances with tree metrics. We exactly solve each of them using the DP algorithm described above. Finally, we return the best of the clusterings we find. It follows from Claims 3.2 and 3.3 and Lemma 3.4, that the algorithm finds an approximation with high probability in time . ∎
4 Special Case of Exact Fairness
In this section, we design an approximation algorithm for -median with exact fairness constraints. The algorithm runs in time and succeeds with high probability.
To recall, we say that a cluster centered at is exactly fair if , , where is the proportion of data points in that belongs to group . We say that a clustering is exactly fair if all of its clusters are exactly fair. More generally, a set of data points with profile is exactly fair if for all , .
Recall that we defined fairlet in Definition 1.2 as a minimal size non-empty set that is exactly fair. Note that all fairlets have the same size and we use to denote their size. Further, for every , we use to denote the number of points from group in any fairlet.
4.1 Reassignment Method
We start with the reduced instance on the set of locations , which we constructed in the location consolidation step.
Given any subset with profile vector , we say that is -approximately fair if
(5) |
Given an assignment , we say that is a -approximately fair assignment if is -approximately fair for all , where is the center profiles corresponding to .
We use the following result by Bera et al. Bera et al. (2019).
Theorem 4.1.
For any metric space and a set of centers , There exists a -approximately fair assignment whose cost is no more than the optimal fair assignment of . Moreover, there is an algorithm which finds this assignment in polynomial time.
We start by running the assignment algorithm from Theorem 4.1 on instance . Let denote this new assignment and be the center profiles corresponding to . As discussed in Section 3.1, every assignment defines a new instance. Let be the instance defined by assignment . Now our goal is to modify by reassigning only a small number of points so that the obtained assignment is fair. We describe this reassigning procedure below.
Lemma 4.2.
Assume that there is a -approximation algorithm for the fair clustering problem on . Then, there is an -approximation algorithm for .
Proof.
The proof is similar to that of Claim 3.2. We run the approximation algorithm on instance and output the obtained clustering as a clustering for . We now upper bound its cost.
Algorithm for the reassignment problem.
We present an -approximation algorithm for the reassignment problem that runs in time .
Suppose we are given an instance with locations , metric and profile vectors , that satisfy -approximately fair requirement as in Eq. (5). We will apply our reassignment algorithm to . For every , let be the set of points assigned to in .
Definition 4.3.
For every , let be a maximal exactly fair subset of . Decompose each into fairlets . Let be the set of remaining points. We say that is a fairlet decomposition for clustering . We call points in problematic points.
Lemma 4.4.
Consider a fairlet decomposition for . Then for every , .
Proof.
Consider an arbitrary . We first assume that ; in other words, is empty. Let be the profile vector of . Since , and there exits a group such that . Further, Since is -approximately fair, , which implies that . Dividing both sides of the above equation by , , since . Thus, .
Now, we drop the condition . Let be the profile vector for . We show that not only but also satisfies Eq. (5) with . Indeed, for each , . Thus, for each , . Hence the same argument as for the case shows that . ∎
Next, we give an outline of our approach.
-
1.
First, we show that there exists an exactly fair assignment for of cost at most that only reassigns points.
-
2.
Next, we show that it is possible to find a subset for each such that for all , and there exists a solution of cost at most that only reassigns points in .
-
3.
Finally, we apply our dynamic programming approach to reassign points in . Since , the DP algorithm runs in time.
To show (1), we impose an additional assumption on the structure of the desired (re)assignment . Let be the clustering resulted from an exactly fair assignment of , where is the cluster of points at location after the reassignment.
Let be the clustering before the (re)assignment. Now, we construct fairlet decompositions for and . Let be the set of data points that are assigned to the same center before and after (re)assignment . For each , let be the set of data points in that are fixed by .
For every , let be a maximal exactly fair subset of . We decompose into fairlets. We call this decomposition . Now, we greedily extend to a fairlet decomposition for . Let be the set of problematic points in the obtained decomposition. Note that Lemma 4.4 applies to and therefore .
Similarly, we greedily extend decomposition to a fairlet decomposition for . We refer to fairlets that we add to as new fairlets. Let be the set of all new fairlets.
Definition 4.5 (restricted assignment).
In the restricted assignment problem, we need to find a minimum cost assignment of such that
-
•
is a valid exactly fair assignment. Formally,
-
•
every new fairlet in contains at least one point from .
Lemma 4.6.
Let and denote the optimal values of the restricted and vanilla fair assignment problems with input respectively. Then, .
Proof.
Let be an optimal fair assignment of . Let be the resulting clusters; the set consists of the points assigned to by . Now, suppose that there exits a new fairlet in some () such that . Without loss of generality, assume that the points in are from for some . Let for every and . Then, the total cost of forming is . Now, for every , compute the cost of moving all points in fairlet to cluster (i.e., reassign all data points in to center ) and denote it as .
Next, consider the convex combination .
This implies that there exist with such that the cost of reassigning to is at most twice the current assignment cost of .
We perform this fairlet reassignment procedure to all new fairlets that do not satisfy the restricted assignment property and obtain a new restricted assignment . Our argument shows that the assignment cost of is at most twice the assignment cost of ; hence,
∎
Next, we show that the optimal restricted assignment is even more structured.
Lemma 4.7.
There exists an optimal solution to the restricted assignment of such that
-
(a)
for any and any new fairlet , there exists a problematic point ;
-
(b)
if moves a point for some fairlet and , then moves all points in .
Proof.
Let be the optimal solution for the restricted assignment problem. We apply two transformations to it.
Transformation I: Simplifying the reassignment multigraph.
For every group , define the reassignment multigraph on the set of centers . There is a directed edge from to in for every point that is reassigned from to by . We label such an edge with label . Note that there are parallel edges from to in .
Assume that there is a vertex that has both incoming and outgoing edges in . Let be an incoming edge and be an outgoing edge (it is possible that ). Denote their labels by and . By the definition of the graph, reassigns point from to and point from to . We modify as follows (see Figure 1).
original assignment | new assignment | ||
---|---|---|---|
assignment | |||
cost |
We only change the assignment of points and : we reassign to and to . Denote the obtained assignment . Since and belong to the same group , we do not change the profiles of any clusters and thus is exactly fair. Further, if assigns a point from to then so does . Therefore, is also a solution to the restricted assignment problem. Finally, its cost is at most that of : indeed the assignment costs of all points other than and do not change; the total assignment cost of and was equal to for and is equal to for ; we have by the triangle inequality.
We perform this step over and over until there are no vertices that simultaneously have incoming and outgoing edges. (Note that each time we perform this step, the number of edges in one of the graphs decreases by 1 and does not change in all other graphs (for every ). So we will necessarily stop at some point.)
Denote the obtained assignment by . It is a fair assignment and its cost is at most that of . Consider the reassignment multigraphs for . Each vertex in is either a source, sink, or isolated vertex.
Transformation II: Specifying which points moves.
We go over all and such that is a source-vertex in . Assignment moves points in group from to other clusters. Since all points in are interchangeable, we are free to choose any subset of points in to move. We choose this set in a special way. Let be fairlets in the fairlet decomposition we constructed for . We order all points in as follows: first points from , then from , then from , then from , etc (we order points inside each of these sets arbitrarily). We choose the subset of the first points w.r.t. this order and assume that moves them. Note that moves non-problematic points from to another cluster only if it moves all problematic points from .
We have defined . Now we prove that it satisfies properties (a) and (b). To this end, we consider a cluster and analyze the following two cases.
Case 1: Assignment does not move any non-problematic points from cluster to other clusters. All points (if any) it moves from to other clusters are problematic.
Then and thus all fairlets present in the decomposition for are also present in that for . Hence, every non-problematic point belongs to the same fairlet in and . In particular, only problematic points may belong to new fairlets. On the other hand, since is a solution to the restricted assignment problem, at least one point in each new fairlet in must be from . It follows that this point must be problematic. We proved that property (a) holds in this case. Since does not move any points from fairlets to other clusters, property (b) trivially holds.
Case 2: Assignment moves at least one non-problematic point from to another cluster.
Let be maximal index such that moves points from fairlet to other clusters. Since we are guaranteed that moves at least one non-problematic point, is well defined.
By our choice of , does not move any points from fairlets and therefore these fairlets are still present in . Now we show that there are no other fairlets in ; in particular, there are no new fairlets.
Consider a point in that moves to some other cluster . Assume that . Then edge – an edge outgoing from – is present in . Hence, cannot be a sink or isolated vertex and must be a source-vertex in . Therefore, does not move any point from another cluster to . Also, all points in that precede w.r.t. the order we considered in the Transformation II step are moved to other clusters by ; in particular, all points in as well as itself are moved to other clusters by . Therefore, consists of points from and hypothetically some points from . However, point from is assigned to another cluster and thus there are not enough points from group left in to form another fairlet. We conclude that are the only fairlets in . Since satisfies the exact fairness constraints, each point in lies in a fairlet, that is, in one of these fairlets. Now (a) trivially holds since there are no new fairlets in ; (b) holds since moves all points from fairlets and no points from fairlets . ∎
Corollary 4.8.
Suppose that is a -approximately fair instance. There exists an assignment of cost at most that only moves points. Furthermore, we can choose a subset of data points such that and the solution only moves points in .
Proof.
Let be an optimal restricted assignment that satisfies both conditions in Lemma 4.7. In the clustering defined by , every new fairlet contains a problematic point from the original cluster of the center they are assigned to by . Since by Lemma 4.4 there are only problematic points in total, there are at most many new fairlets. Thus, moves at most many points. Moreover, by Lemma 4.6, the cost of is at most .
Since we know that moves at most many points and satisfies condition in Lemma 4.7, would move at most fairlets in each cluster. Recall that each contains fairlets. From each cluster , we pick many fairlets and add them to the set . Then, we add all the problematic points to . Then, and we can assume that only moves the points in . ∎
Now, we can solve the assignment problem by passing the set as input to our dynamic programming approach described in Section 3.4.
Proof of Theorem 1.6..
Let be an instance of the exactly fair -median problem. First, we perform location consolidation on to obtain the instance as described in Section 3.2. Then, we run the assignment algorithm by Bera et al. (2019) on . Let be the resulting instance. By Theorem 4.1, is -approximately fair. By Corollary 4.8, we can pick a subset of data points such that and there exists an solution of cost at most that only moves points in . Now, we apply the dynamic programming algorithm in Theorem 1.3 on to obtain a solution of cost at most with high probability that runs in time. Since all the previous steps take time, the total running time of this algorithm is .
5 Conclusions
In this paper, we study the fair -median problem. We present an -approximation algorithm that runs in time . We further showed that our algorithm works in a more general setting where the fairness requirements are specified as an arbitrary set of fair profiles. This notion “profile-based fairness” captures a richer class of fairness requirements that cannot be handled by the previously known approaches for fair representation clustering.
In addition, in the special case of exact fairness, we present an -approximation algorithm that runs in time, where is the size of a fairlet.
Our paper shows that there exists approximation algorithms for fair representation clustering with protected classes that run in polynomial time. It remains as an exciting question whether polynomial time -approximation can be achieved for the problem when . An approach that can be potentially helpful to resolve the previous question is to extend the “reassignment” method used in Section 4 to the more general setting of representation fairness (as in Section 3).
References
- Abbasi et al. [2021] M. Abbasi, A. Bhaskara, and S. Venkatasubramanian. Fair clustering via equitable group representations. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, page 504–514, 2021.
- Ahmadian et al. [2019] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. Clustering without over-representation. In Proceedings of the SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 267–275, 2019.
- Angelidakis et al. [2017] H. Angelidakis, K. Makarychev, and Y. Makarychev. Algorithms for stable and perturbation-resilient problems. In Proceedings of the Symposium on Theory of Computing, pages 438–451, 2017.
- Backurs et al. [2019] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner. Scalable fair clustering. In Proceedings of the International Conference on Machine Learning, pages 405–413, 2019.
- Bandyapadhyay et al. [2021] S. Bandyapadhyay, F. V. Fomin, and K. Simonov. On coresets for fair clustering in metric and Euclidean spaces and their applications. In Proceedings of the International Colloquium on Automata, Languages, and Programming, pages 23:1–23:15, 2021.
- Bartal [1996] Y. Bartal. Probabilistic approximation of metric spaces and its algorithmic applications. In Proceedings of the Conference on Foundations of Computer Science, pages 184–193, 1996.
- Bartal [1998] Y. Bartal. On approximating arbitrary metrices by tree metrics. In Proceedings of the Symposium on Theory of Computing, pages 161–168, 1998.
- Bera et al. [2019] S. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In Advances in Neural Information Processing Systems, pages 4955–4966, 2019.
- Bercea et al. [2019] I. O. Bercea, M. Groß, S. Khuller, A. Kumar, C. Rösner, D. R. Schmidt, and M. Schmidt. On the cost of essentially fair clusterings. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 2019.
- Böhm et al. [2020] M. Böhm, A. Fazzone, S. Leonardi, and C. Schwiegelshohn. Fair clustering with multiple colors. arXiv preprint arXiv:2002.07892, 2020.
- Brubach et al. [2020] B. Brubach, D. Chakrabarti, J. Dickerson, S. Khuller, A. Srinivasan, and L. Tsepenekas. A pairwise fair and community-preserving approach to -center clustering. In Proceedings of the International Conference on Machine Learning, pages 1178–1189, 2020.
- Chakrabarty and Negahbani [2021] D. Chakrabarty and M. Negahbani. Better algorithms for individually fair -clustering. arXiv preprint arXiv:2106.12150, 2021.
- Charikar et al. [2002] M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the -median problem. Journal of Computer and System Sciences, 65(1):129–149, 2002.
- Chen et al. [2019] X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In International Conference on Machine Learning, pages 1032–1041. PMLR, 2019.
- Chierichetti et al. [2017] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Advances in Neural Information Processing Systems, pages 5036–5044, 2017.
- Chlamtáč et al. [2022] E. Chlamtáč, Y. Makarychev, and A. Vakilian. Approximating fair clustering with cascaded norm objectives. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2664–2683, 2022.
- Chouldechova [2017] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
- Chouldechova and Roth [2020] A. Chouldechova and A. Roth. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82–89, 2020.
- Esmaeili et al. [2020] S. Esmaeili, B. Brubach, L. Tsepenekas, and J. Dickerson. Probabilistic fair clustering. Advances in Neural Information Processing Systems, 33, 2020.
- Esmaeili et al. [2021] S. A. Esmaeili, B. Brubach, A. Srinivasan, and J. P. Dickerson. Fair clustering under a bounded cost. Advances in Neural Information Processing Systems, 2021.
- Fakcharoenphol et al. [2004] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. Journal of Computer and System Sciences, 69(3):485–497, 2004.
- Feldman et al. [2015] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian. Certifying and removing disparate impact. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259–268, 2015.
- Ghadiri et al. [2021] M. Ghadiri, S. Samadi, and S. Vempala. Socially fair -means clustering. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 438–448, 2021.
- Huang et al. [2019] L. Huang, S. Jiang, and N. Vishnoi. Coresets for clustering with fairness constraints. In Proceedings of the Conference on Neural Information Processing Systems, 2019.
- Jones et al. [2020] M. Jones, H. Nguyen, and T. Nguyen. Fair -centers via maximum matching. In Proceedings of the International Conference on Machine Learning, pages 4940–4949, 2020.
- Jung et al. [2020] C. Jung, S. Kannan, and N. Lutz. A center in your neighborhood: Fairness in facility location. In Proceedings of the Symposium on Foundations of Responsible Computing, page 5:1–5:15, 2020.
- Kariv and Hakimi [1979] O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems. ii: The -medians. SIAM Journal on Applied Mathematics, 37(3):539–560, 1979.
- Kearns and Roth [2019] M. Kearns and A. Roth. The ethical algorithm: The science of socially aware algorithm design. Oxford University Press, 2019.
- Kleinberg et al. [2017] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In Proceedings of the Innovations in Theoretical Computer Science, 2017.
- Kleinberg et al. [2018] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
- Kleindessner et al. [2019] M. Kleindessner, P. Awasthi, and J. Morgenstern. Fair -center clustering for data summarization. In Proceedings of the International Conference on Machine Learning, pages 3448–3457, 2019.
- Kleindessner et al. [2020] M. Kleindessner, P. Awasthi, and J. Morgenstern. A notion of individual fairness for clustering. arXiv preprint arXiv:2006.04960, 2020.
- Li and Svensson [2016] S. Li and O. Svensson. Approximating -median via pseudo-approximation. SIAM Journal on Computing, 45(2):530–547, 2016.
- Mahabadi and Vakilian [2020] S. Mahabadi and A. Vakilian. Individual fairness for -clustering. In Proceedings of the International Conference on Machine Learning, pages 6586–6596, 2020.
- Makarychev and Vakilian [2021] Y. Makarychev and A. Vakilian. Approximation algorithms for socially fair clustering. In Proceedings of the Conference on Learning Theory, pages 3246–3264. PMLR, 2021.
- Marcinkowski et al. [2020] F. Marcinkowski, K. Kieslich, C. Starke, and M. Lünich. Implications of ai (un-) fairness in higher education admissions: the effects of perceived ai (un-) fairness on exit, voice and organizational reputation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 122–130, 2020.
- Micha and Shah [2020] E. Micha and N. Shah. Proportionally fair clustering revisited. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2020.
- Rösner and Schmidt [2018] C. Rösner and M. Schmidt. Privacy preserving clustering with constraints. In Proceedings of the International Colloquium on Automata, Languages, and Programming, pages 96:1–96:14, 2018.
- Schmidt et al. [2019] M. Schmidt, C. Schwiegelshohn, and C. Sohler. Fair coresets and streaming algorithms for fair -means. In Proceedings of the International Workshop on Approximation and Online Algorithms, pages 232–251, 2019.
- Tamir [1996] A. Tamir. An algorithm for the -median and related problems on tree graphs. Operations Research Letters, 19(2):59–64, 1996.
- Vakilian and Yalçıner [2021] A. Vakilian and M. Yalçıner. Improved approximation algorithms for individually fair clustering. arXiv preprint arXiv:2106.14043, 2021.