New Approximation Algorithms for Fair -median Problem
Abstract
The fair -median problem is one of the important clustering problems. The current best approximation ratio is 4.675 for this problem with 1-fair violation, which was proposed by Bercea et al. [APPROX-RANDOM’2019]. As far as we know, there is no available approximation algorithm for the problem without any fair violation. In this paper, we consider the fair -median problem in bounded doubling metrics and general metrics. We provide the first QPTAS for fair -median problem in doubling metrics. Based on the split-tree decomposition of doubling metrics, we present a dynamic programming process to find the candidate centers, and apply min-cost max-flow method to deal with the assignment of clients. Especially, for overcoming the difficulties caused by the fair constraints, we construct an auxiliary graph and use minimum weighted perfect matching to get part of the cost. For the fair -median problem in general metrics, we present an approximation algorithm with ratio , which is based on the embedding of given space into tree metrics, and the dynamic programming method. Our two approximation algorithms for the fair -median problem are the first results for the corresponding problems without any fair violation, respectively.
1 Introduction
As one of the most commonly studied problems, clustering aims to give a ”good” partition of a set of points of a metric space into different groups so that points with similar attributes should be in the same group. The clustering algorithm may be biased if the data set involves some sensitive attributes (gender, race, or age). To solve this deviation caused by some sensitive attributes of the data set, the fair clustering problem was proposed. Over the past few years, lots of attention have been paid on the fair clustering problem (Abbasi et al. (2021); Ahmadian et al. (2020); Backurs et al. (2019); Chen et al. (2019); Chierichetti et al. (2017); Ghadiri et al. (2021); Bera et al. (2019); Esmaeili et al. (2020)).
Chierichetti et al. (2017) first proposed the definition of redistricting fairness for clustering, which refers to the balance of disparate impact and fair representation of each protected class in every cluster. Bercea et al. (2019) first proposed a generalized and tunable notion of redistricting fairness where asserts that protected classes should maintain an upper bound and a lower bound proportion of the cluster size in clustering. This kind of fairness model is called -proportional fair model. The input for fair -median problem is a set of points , a set of candidate centers in a metric space together with an integer , and two vectors and , the fair -median problem asks for a set of points, called , together with an assignment , where each client in the data set is entitled a color and is divided into disjoint groups , the goal is to minimize the -median objective function satisfying that the number of each type clients within in each cluster conforms to a certain proportion between and .
Bercea et al. (2019) first proposed a bicriteria approximation scheme with -approximation and -fairness violation in polynomial time for the fair -median problem in metric space, which can be extended to a -approximation for the fair -center/facility location/-means problem within a -fairness violation, respectively. For the special case of ( is a fixed number and ), Ahmadian et al. (2019) gave a -approximation with -fairness violation algorithm for fair -center problem in metric space. As a generalization for the above -proportional fairness model, Bera et al. (2019) proposed a more generalized notion of fairness such that each client can belong to multiple types. They gave an approximation algorithm with -approximation and -fairness violation in polynomial time for the fair -median, fair -center and fair -means problems, where is the approximation ratio of the known -clustering algorithm, and denotes the maximum number of types that each client belongs to. Harb and Lam (2020) gave a -approximation and -fairness violation for the fair -center problem in polynomial time. As far as we know, for the fair -median problem, no approximation algorithm without fairness violation is available for any non-trivial metrics.
2 Results and Techniques
We propose the first qusi-polynomial time approximation scheme(QPTAS) for the fair -median problem in doubling metrics, the first -approximation algorithm for the fair -median problem in general metrics, and these two results have no fairness violation.
Theorem 2.1.
Given an instance of size of the fair -median problem in metrics of fixed doubling dimension , there exists a randomized -approximation algorithm with running time .
Theorem 2.2.
Given an instance of size of the fair -median problem in general metric space, there exists a randomized -approximation algorithm with running time .
Our method solving the fair -median problem in doubling metrics is based on the dynamic programming process on split-tree decomposition. Firstly, the split-tree decomposition is to partition the metrics into a number of subsets randomly, called blocks, and keeps dividing these blocks recursively until each block only contains one point. For each block, we construct a subset of points, called portals, as the bridges to get the distance of points in split-tree. Our goal is to prove that there exists a near-optimal solution such that different blocks ”interact” only through portals, which is one of the major challenges to solve the fair -median problem in doubling metrics. The dynamic programming algorithm proceeds on the split-tree from the leaves to the root to fill in the DP table. For a table entry of a specific block, we save the information that how many clients of each type located outside (respectively inside) is assigned to facilities located inside (respectively outside) this block through each portal, and the number of facilities opened in this block. For a subproblem of a specific block, the value of the table entry of this block is calculated by considering the following two parts: the total value of table entries of the children of this block which can be extracted from the DP table; the cost between the portals of this block and the ones in its children. How to calculate the cost between the portals of a block and the ones in its children is another challenge to solve the fair -median problem in doubling metrics. To solve this problem, we construct an auxiliary graph and use minimum weighted perfect matching to get the cost between the portals of a block and the ones in its children. The dynamic programming procedure returns a set of facilities that need to be opened, and the number of clients assigned to each open facility satisfying the fairness constraints. Finally, we apply the min-cost max-flow method to assign the clients to the facilities.
The algorithm solving the fair -median problem in general metric space is based on the embedding of given space to tree metrics such that a Hierarchically Separated Tree (HST) can be obtained. Moreover, a dynamic programming process is presented on the Hierarchically Separated Tree to get the candidate opened facilities.
3 Preliminaries
Definition 3.1 (metric space).
Given a space where is a set of points associated with a distance function , space is called a metric space if for any point in , the following properties are satisfied: (1) iff ; (2) ; (3) .
Definition 3.2 (doubling dimension).
Given a metric space and a non-negative real number , for any point , let denote the ball around with radius . The doubling dimension of the metric space is the smallest integer such that any ball can be covered by at most ball . A metric space with doubling dimension is called a doubling metric space.
Given a metric space , let be the ratio between the largest and the smallest distance in , which is called the aspect ratio of . For any subset , the aspect ratio of is defined as . For a positive integer , let .
Definition 3.3 (the fair -median problem).
Given a metric space with a set of clients and a set of facilities , a non-negative integer and two vectors , assume that is partitioned into disjoint groups, i.e., and . For any , if , is called an -th type client. The fair -median problem is to find a subset of size at most and an assignment function such that: (1) for any , , ; (2) is minimized.
For an instance of the fair -median problem, given a solution of instance , the facilities in are called candidate facilities. For a facility and a client , if , then it is called that serves client . Let , and assume that the size of is .
4 A QPTAS for Fair -median Problem in Doubling Metrics
For the fair -median problem in doubling metrics, we first construct a split-tree, and design a dynamic programming procedure to get the solution.
4.1 Split-tree Decomposition
Given an instance of the fair -median problem, following the assumptions in Behsaz et al. (2019), the aspect-ratio of the input metrics is at most (where is the size of , and is a constant). In order to satisfy the aspect-ratio assumption, we first deal with the points in by the following preprocessing steps and we use to denote an optimal solution of fair -median problem. Find an -approximation feasible solution of the fair -median problem (Bercea et al. (2019)), and let be the cost of the feasible solution obtained. Then, if there exists a pair of points with distance less than , remove , copy a point of , and add at (note that if there are two clients at , the two clients may not necessarily be assigned to the same facility in the solution). In the new instance obtained, a point is within distance at most from its original location. Thus, the cost of any solution for this new instance compared with the original instance is increased by at most .
For any set of points, a subset is called a - of if for any point , there is a point such that . A subset is called a - of if for any , . A subset is a - in if it is both a - of and a - of . The size of a net in metrics with fixed doubling dimension is bounded by the following lemma.
Lemma 4.1.
Let be a metric space with doubling dimension and aspect-ratio , and let be a -. Then .
A decomposition of the metric is a partition of into subsets, which are called blocks. A hierarchical decomposition is a sequence of decompositions such that every block of is the union of blocks in and and . The blocks of are called the level blocks. A split-tree of the metric space is a complete hierarchical decomposition: the root node is the level block and the leaves are the singletons such that . Given a metric space , let be the diameter of the metrics. We first construct a sequence of sets such that is a -net of . Starting from , we now discuss the relation between the blocks in ()-th level and -th level, where . Let where , and let be a random ordering of the points in . For each point , let be the order of in . For the set , let be the set of points in with order from left to right in . For the point , a new block can be obtained at level such that . For any point , a new block can be obtained at level by the following way: let . For each point in , if is not contained in any blocks , then is contained in .
Lemma 4.2 (Talwar (2004)).
Given a split-tree of metric space with a sequence of decompositions , the split-tree decomposition has the following properties:
-
1.
The total number of levels is (since ).
-
2.
Each block of level has diameter at most , namely the maximum distance between any pair of points in a block of level is at most .
-
3.
Each level block is the union of at most level blocks.
-
4.
For any pair of points , the probability that they are in different sets corresponding to blocks at level of split-tree is at most .
Given an instance of the fair -median problem, let be the split-tree obtained by using the methods in Talwar (2004). For each block of level in , we compute a -net of block . Each point in is called a portal, and is also called a portal set. By Lemma 4.1, it follows that the number of portals at a given block is . Moreover, the split-tree within portal set can be found in time (Bartal and Gottlieb (2013); Cohen-Addad et al. (2019)). For a portal set of a block of level in , there is an important property as follows.
Lemma 4.3 (Bartal et al. (2016)).
For a portal set of a block of level in , assume that the children of at level is and each block has a portal set . is a subset of the portal sets computed for the descendant blocks of . That is, .
Given a split-tree and two blocks and at level of , and for two points , and , the distance between and on the split-tree is defined as the length of the path which is constructed by the subpath from to a portal of , the subpath from to a portal of , and the subpath from to . Given a block at level of split-tree and a portal set , if there is a client outside of that is assigned to a facility inside of crossing at a portal , then we say that client enters through portal . Similarly, if there is a client inside of that is assigned to a facility outside of crossing at a portal , then we say that client leaves through portal . In the following, all the distances considered are the distances on the split-tree .
Lemma 4.4.
For any metric with doubling dimension and any , given a randomized split-tree , a pair of blocks and of level , a -net of block , a -net of block , and any two points and , for the distance of and on the split-tree and the distance in the metric space, we have: .
Proof.
By the triangle inequality, we have
Then, we have
Since is a -net of block and is a -net of block , we have that , where is any point in . Similarly, we have that . Thus, we have
∎
4.2 Algorithm for Fair -median Problem in Doubling Metrics
In this section, we show how to design the dynamic programming process based on the split-tree.
4.2.1 Dynamic Programming State
Given a split-tree , the dynamic programming algorithm proceeds on split-tree from the leaves to the root. For any block of split-tree , we define the subtree rooted at block as , and each subtree is a subproblem in our dynamic programming process, which includes the information that how many clients of each type entering or leaving the subtree through the portals of the subtree.
A table entry in the dynamic programming process is a tuple where the parameters are defined as follows:
-
1.
is the size of portal set in and .
-
2.
is the root node of the subtree .
-
3.
() is the number of open facilities in .
-
4.
() is a vector with values and for , () in denotes the number of type clients that enter through the -th portal.
-
5.
() is a vector with values and for , () in denotes the number of type clients that leave through the -th portal.
The cost of table entry consists of the following three parts: (1) The cost of assigning clients inside of to facilities inside of ; (2) For the clients inside of assigned to facilities outside of through portals of , the cost from the clients inside of to portals of ; (3) For the clients outside of assigned to facilities inside of through portals of , the cost from portals of to the assigned facilities inside of .
The solutions at the root of are in the table entry , where is a vector with zero components. Among all these solutions in , the algorithm outputs the one with the minimum cost.
The base case of the dynamic programming is located at the leaves (which are singletons) of split-tree. For a leaf of split-tree, the corresponding block has only one portal and we only need to set and , while the remaining ’s and ’s can be assigned . Since there is only one facility or client in each leaf of split-tree, we consider the following three cases for the table entry of leaf node :
-
1.
If there is only one facility in the block and this facility is opened, then the table value of this block is . In addition, the -th value of is denoted as , i.e., is the number of type clients outside of that are served by the facility inside . Thus, in order to satisfy the fair constraints, the number of each type clients should satisfy the following constraint: for each , .
-
2.
If there is only one facility in the block but this facility is not opened, then the table value of this block is .
-
3.
If there is only one client in the block, then the table value of this block is . If the client is a type () client, then the -th value in is -1 and the other values in is 0.
4.2.2 Computing Table Entries
Now we consider the subproblem on the subtree where block is at level of split-tree . For the block , let be the children of block where is at most . Let be the configuration of , and let be the configuration of children (), where () denotes the -th child of , is the number of facilities that are opened in , () is a vector with values and each value () in denotes the number of the -th type clients entering through the -th portal of , is a vector with values and each value () in denotes the number of the -th type clients leaving through the -th portal of . The values of , and have the following constraints:
-
(1)
;
-
(2)
=.
Lemma 4.5.
Given a subproblem at level of split-tree () and its children where is at most , for the configuration of , and the configuration of each children (), we have
where is the cost between portals of and the ones in , and can be calculated in polynomial time.
Proof.
Since there exists consistence relation between the configurations of and its children, that is, the number of clients entering or leaving in the configuration is equal to the number of clients entering or leaving in the configurations , we have
For a particular type (), the number of type clients entering or leaving in the configuration is equal to the number of type clients entering or leaving in the configurations . Thus, we have
Since , , we have:
(1) |
Then, we have:
(2) |
Based on the constraints of , , , to find the minimum cost of the solutions in subproblem with table entry , we need to enumerate all consistent subproblems from its children . Moreover, the cost is also closely related to the cost between portals of and the ones of , denoted by .
We now give the general idea to calculate . For each type , a weighted bipartite graph is constructed. The value of is obtained based on perfect matching on each bipartite graph constructed. If we find the minimum cost perfect matching in , it gives the optimal case for the type clients entering or leaving and through portals of and the ones in .
For each type , a weighted bipartite graph can be constructed by the following way. Let , and . For the block , and for each portal in , if there exist clients entering through , then add new vertices to . It is easy to see that . Similarly, for each portal in , if there exist clients leaving through , then add new vertices to . Then, we have . For each child of , and for each portal of , if there exist clients entering through , then add new vertices to . Then, we have . Similarly, for each child of , and for each portal of , if there exist clients leaving through , then add new vertices to . Then, we have .
For each vertex in , let be the portal in or such that vertex is added by the clients entering or leaving block through the portal.
The set of edges in can be constructed by the following methods.
-
1.
Add all the edges in to .
-
2.
Add all the edges in to .
-
3.
Add all edges in , where and and to .
-
4.
For each edge in , the weight of edge is denoted as .
Claim 4.6.
Given a bipartite graph as described above, there must exist a perfect matching in .
Proof.
Assume that there does not exist a perfect matching in . Let be one of the maximum matchings in . Since is not a perfect matching, there exists at least two unmatched vertices in . By equation 1, we have . Thus, the number of unmatched vertices must be even. Assume that are two unmatched vertices in , we have the following cases for the unmatched vertices and : Case 1, or . According to the construction process of , there must exist an edge between and . Thus, based on , a matching with one more edge can be obtained, contradicting the fact that is a maximum matching in .
Case 2, . We have the following two subcases: 1) There is no edge between vertices in and vertices in . Since , there must exist two unmatched vertices and such that an augmenting path from to and an augmenting path from to can be found. Thus, based on , a matching with larger size can be obtained, contradicting the fact that is a maximum matching in . 2) There exist two vertices such that edge is in . According to the construction process of , edges , are contained in . Thus, an augmenting path from to can be found, and a matching with larger size can be obtained, which is a contradiction.
Case 3, . We consider the following four cases. 1) There is no edge from to the vertices in , and there is no edge from to the vertices in . Since , there must exist two unmatched vertices and such that an augmenting path from to and an augmenting path from to can be found. Thus, based on , a matching with larger size can be obtained, contradicting tha fact that is a maximum matching in . 2) There exists a vertex in such that edge is in . According to the construction process of , there exist a vertex in such that edges , are in . Thus, an augmenting path from to can be found, and a matching with larger size can be obtained, which is a contradiction. 3) There exists a vertex in such that edge is in . According to the construction process of , there exist a vertex in such that edges , are in . Thus, an augmenting path from to can be found, and a matching with larger size can be obtained, which is a contradiction. 4) There exists a vertex in such that edge is in and there exists a vertex in such that edge is in . Based on , it is easy to find an augmenting path from to to get a matching with larger size, which is a contradiction.
Therefore, there must exist a perfect matching in bipartite graph . ∎
Let be the sum weights of the minimum weighted perfect matching in .
Claim 4.7.
The value of is .
Proof.
By equation 1, we have
(3) |
We denote and as the remaining number of clients entering or leaving , respectively. Then, the remaining number of clients entering is equal to the remaining number of clients leaving .
For each type , a weighted bipartite graph is constructed. Let be the number of edges in perfect matching of , and let . Since the number of clients entering or leaving in the configuration is equal to the number of clients entering or leaving in the configurations , we can get that the number of pair portals to get value is equal to .
In each graph , the perfect matching in gives the relation between the portals of and the ones in , where the clients of type entering or leaving through those portals. The minimum cost of the solutions in subproblem with table entry is based on the value of , which is the cost between portals of and the ones in . Thus, we can get that . We now prove that the value of cannot be less than . Assume that . The value of is the cost between portals of and the ones in , and by the fair constraints, there are types of clients to deal with. Thus, the value of can be divided into values, i.e., . By assumption, . Thus, there is a type such that . Let be the perfect matching in to get the value , and let be the set of pair portals used to get the value . By the above discussion, the number of pair portals in is equal to the number of edges in . For each pair portal in , by the construction process of edge set , there exists a corresponding edge with weight in graph . Moreover, let be the set of corresponding edges of the pair portals in . It is easy to get that is a perfect matching in with weight smaller than , contradicting that is a minimum weighted perfect matching in . Thus, . ∎
4.2.3 Assigning Clients to Facilities
Given an instance of the fair -median problem, the above dynamic programming procedure obtains a set of size at most which is a set of open facilities for the fair -median problem and the number of clients assigned to each open facility satisfying the fairness constraints. However, for each client , the dynamic programming procedure only ensures that the number of clients assigned to each open facility satisfies fair constraints, and does not obtain the assignment which maps each client to a facility . We apply the min-cost max-flow technique to obtain the assignment which maps each client to a facility for fair -median problem. Consider the solution obtained by the above dynamic programming, we obtain a set of size at most which is a set of open facilities for fair -median problem and each is given some numbers to denote the number of each type clients assigned to it.
Given an open facilities set (where , ) and a vector with values for each open facility in , where () is a vector within values, and each value () in denotes the number of the -th type clients that are assigned to , we construct a bipartite graph with both capacities and weights on the edges and values on the vertices as follows. Let , , where denote the source point, denotes the sink point, denotes the -th client of the type clients (), denotes the -th facility and we label this facility as type ().
The edges in can be constructed as follows. For each fixed , and for each vertex and (where ), add a directed edge from to in . The capacity of the edge is 1 and the cost of the edge is . Note that parallel edges may exit (parallel edges are kept in ). Furthermore, for each vertex , add a directed edge from to with capacity 0 and cost 0. For each vertex , add a directed edge from to with capacity 0 and cost 0.
Finally, we set for each vertex in bipartite graph . For each vertex , let denote the value of each vertex. Then, positive means that the vertex supplies units of flow, and negative means that the vertex demands units of flow. For each vertex , set . For each vertex , set . Since and , we have . Then, we run a min-cost max-flow algorithm on the bipartite graph . For the min-cost max-flow found in graph , and for each edge in , if edge has a flow of 1, then we assign the client to facility . Thus, by applying the min-cost max-flow method in graph , we obtain the assignment which maps each client to a facility in polynomial time.
4.2.4 Analysis and Running Time
We show that our algorithm outputs a solution of cost at most times the cost of the optimal solution and it gives a QPTAS for the fair -median problem in doubling metrics. Let denote one of the optimal solution of the fair -median problem in doubling metrics, and let denote the cost of this optimal solution.
Theorem 4.8.
Given an instance of the fair -median problem in doubling metrics, there exists an algorithm such that a solution of cost at most can be obtained, and the running time is at most .
Proof.
Let be the solution obtained by the dynamic programming and clients assignment procedure. For each client , we say that serves a client if . By lemma 4.2, and for any pair of points and in the metrics, the probability that and are divided into different blocks at level is at most . By lemma 4.4, for any two points and at level , the existence of portal set in each block of split-tree incurs an additional cost of , and this is incurred with probability at most . Then, the total additional expected cost for the path between and is at most
(4) |
Let , the total additional expected cost for the path between and is at most . Consider a pair of points and ( denotes the facility that serves ), where is assigned to in the optimal solution . By equation 4, we have that total additional expected cost for the path between and is at most . Namely, the total additional expected cost of solution is . Let be the total cost of solution based on split-tree . Then, we have that . Then, we have that . Thus, the total cost of the solution obtained is no more than .
The running time of our algorithm contains the following three parts: constructing the split-tree , executing the dynamic programming procedure, and executing the min-cost max-flow algorithm. Firstly, the split-tree decomposition within portal set can be found in time (where ).
We now analyze the running time of the dynamic programming process. The running time of the dynamic programming process is related to the number of entries in the dynamic programming table, and the time to calculate the value of each table entry. For a table entry in the dynamic programming table, we analyze the values of the parameters in table entry. For the given split-tree , the total number of levels in the tree is , and there are leaves in . The number of distinct value of is equal to the number of nodes in the split-tree , which is at most . Moreover, the value of is a non-negative integer bounded by , and the parameters of and () in and (), which are non-negative integers bounded by . Thus, the total number of entries in our dynamic programming table is at most .
We now analyze the time to calculate the value of each table entry. For the table entry , we consider all possibilities based on the values of variables (). Since each of these variables are non-negative integers bounded by , there are cases to try all possible values of those parameters to satisfy certain constraints. In the process of dynamic programming procedure, the value of is calculated, and it is closely related to the minimum weighted perfect matching in bipartite graphs, which can be solved in polynomial time (Kuhn (1955); Munkres (1957)). Specifically, for any bipartite graph , the number of vertices in (where ) is at most . Thus, the time to calculate the minimum weighted perfect matching of is bounded by , and the time to calculate the value of is .
Therefore, the time to calculate the value of each table entry in the dynamic programming process is at most (which can be simplified to . As the total number of entries in the table is bounded by , and the number of portals in a portal set -net of a block at level is , the running time of the dynamic programming process is bounded by . Moreover, for the set of opened facilities obtained by dynamic programming process, the clients can be assigned to by applying the min-cost max-flow method in time.
For the instance of the fair -median problem in doubling metrics, a solution of cost at most can be obtained in time . ∎
5 An -Approximation for Fair -median Problem General Metrics
In this section, to solve the fair -median problem in general metrics, we first deal with the points in by a general preprocessing step, then construct a Hierarchically Separated Trees (HST) to embed metric space into tree metrics, finally apply a dynamic programming procedure to get the solution.
5.1 Hierarchically Separated Trees
Given an instance of the fair -median problem, in order to reduce the size of input, we deal with the points in by the following preprocessing steps and we use to denote an optimal solution of fair -median problem. Find an -approximation feasible solution of the fair -median problem (Bercea et al. (2019)), and let be the centers of the feasible solution obtained. Then, for each point , remove , copy a point of and add at where is the nearest center in to . In the new instance obtained, we reduce the size of input locations from to with constant factor distortion. Thus, the cost of any solution for this new instance compared with the original instance is increased by at most .
A decomposition of the metric space is a partitioning of into subsets, which are called blocks. A hierarchically decomposition of the metric is a sequence of decompositions such that every block of is the union of blocks in and and . The blocks of are called the level blocks. A hierarchically separated tree (HST) of the metric space is a complete hierarchically decomposition: the root node is the level block and the leaves are the singletons such that . Given a metric space , let be the diameter of the metrics. Starting from , we now discuss the relation between the blocks in -th level and -th level, where . Let where , and let be a random ordering of the points in . For each point , let be the order of in . For each block at , let be the set of points in with order from left to right in . For the point , a new block can be obtained at level such that . For any point , a new block can be obtained at level by the following way: let . For each point in , if is not contained in any blocks , then is contained in .
Lemma 5.1 (Fakcharoenphol et al. (2004)).
Given a HST of metric space with a sequence of decompositions , the HST decomposition has the following properties:
-
1.
The number of level is (since the aspect ratio of the input metrics is ).
-
2.
Each block of level has diameter at most , namely the maximum distance between any pair of points in a block of level is at most .
-
3.
For each block of level , there are some edges between the block and its children at level , the length from a block to each of its children at level is equal to (which is also the upper bound of the radius of ).
-
4.
For any pair of points must be separated at level , where .
A general metrics can be probabilistically embedded into HST with a distortion of where is the size of input locations and equals to from the above preprocessing step and this embedding can be constructed in polynomial time (Fakcharoenphol et al. (2004)). We now define the distance function in HST. For any two blocks in HST , and for any two points , , the distance between and in is denoted as the length of the shortest path distance in between block and block .
5.2 Algorithm for Fair -median Problem in General Metrics
In this section, we give the algorithm to solve the fair -median problem in general metrics.
5.2.1 Dynamic Programming States
Given a HST , the dynamic programming proceeds on from the leaves to the root. For any block of HST , let be the subtree rooted at block , and let be the number of children of block , and assume that the set of children of block is . For , let denote the subtree induced by the blocks in , and each subtree is a subproblem in our dynamic programming process, which includes the information that how many clients of each type entering or leaving .
A table entry in the dynamic programming process is a tuple , where the parameters are defined as follows:
-
1.
is a block of HST .
-
2.
is the number of facilities that are opened in the subtree .
-
3.
is an integer indicating how many blocks in children are used to get subtree .
-
4.
is a vector with values, and () in denotes the number of type clients inside the subtree that are served by facilities located in .
-
5.
is a vector with values, and () in denotes the number of type clients inside the subtree that are served by facilities located in .
-
6.
is a vector with values, and () in denotes the number of type clients in the subtree that are served by facilities located inside the subtree .
-
7.
is a vector with values, and () in denotes the number of type clients in the subtree that are served by facilities located inside the subtree .
The solutions at the root of HST are in the table entry , where is a vector with zero components. Among all these solutions in , the algorithm outputs the one with the minimum cost.
The base case of the dynamic programming consist of leaves of the HST where at most one facility can be opened at a given location. Let denote the simple tree within the singleton . Since there is only one facility or client in each leaf of HST, we consider the following three cases for the table entry of leaf node :
-
1.
For a block at level 0 of HST , there is only a facility in the block and this facility is opened. The table entry of the subproblem at is . In addition, the -th value in , denoted as , is the number of type clients inside the subtree that are assigned to the facility located in . Thus, in order to satisfy the fair constraints, for each , .
-
2.
For a block at level 0 of HST , there is only a facility in the block but this facility is not opened. The table entry of the subproblem at is .
-
3.
For a block at level 0 of HST , there is only a client in the block. The table entry of the subproblem at is . Since this client must be a client of type (), the -th value in will be -1.
5.2.2 Computing Table Entries
Assume that we consider the subproblem on the subtree where is a block at level of HST . Let be the configuration of with table entry , where . To find the minimum cost of , we enumerate all consistent subproblems corresponding to and (if ). Moreover, the value of the table entries corresponding to and can be extracted from the dynamic programming table. The recursive expression of the dynamic programming procedure has the following two cases.
Case 1. . In this case, we consider the subtree of induced by the blocks in . As is the first node in the total order of the children of , the assignments of the subproblem is the same as the assignments of , and the facilities and clients in any feasible solution to the subproblem must come from . Thus, we have
where is the number of open facilities in the subtree satisfying the constraint . denotes the number of each type clients inside the subtree that are served by facilities inside and in denotes the number of type () clients inside the subtree that are served by facilities inside , denotes the number of each type clients inside that are served by facilities inside the subtree and in denotes the number of type () clients inside that are served by facilities inside the subtree . denotes the length of the edge . Moreover, the values of and satisfy the following constraints.
-
1.
(1) =+. From the definition of above, the set of clients to get value are either in the subtree or in the subtree . From the table entry, the number of clients in the subtree that are served by the facilities in is in , and the number of clients in the subtree that are served by the facilities in is in .
-
2.
(2) =+. From the definition of , the set of facilities to get value are either in the subtree or in the subtree . From the table entry, the number of clients in that are served by the facilities in the subtree is in , and the number of clients in that are served by the facilities in the subtree is in .
Case 2. . In this case, the value of table entry on the subtree is obtained from the subproblems on the subtrees and , by considering all various possible values of the parameters . Thus, we have
where denotes the number of each type clients inside the subtree that are served by facilities inside and denotes the number of each type clients inside the subtree that are served by facilities inside , denotes the number of each type clients inside the subtree that are assigned to facilities inside the subtree and denotes the number of each type clients inside the subtree that are assigned to facilities inside the subtree . Moreover, denotes the number of each type clients inside the subtree that served by facilities inside , in denotes the number of type () clients inside the subtree that are served by facilities inside , denotes the number of each type clients inside that are served by facilities inside the subtree , in denotes the number of type () clients inside that are served by facilities inside the subtree , and denotes the length of the edge . The values of , , , and , should satisfy the following consistent constraints:
-
1.
(1) . is the number of open facilities in the subtree and is the number of open facilities in the subtree satisfying the constraint .
-
2.
(2) . The subproblem on the subtree corresponding to is consistent with subproblems on the subtrees and corresponding to and .
Finally, based on the solution obtained by the dynamic programming, we can apply min-cost max-flow technique same as the one in doubling metrics to guarantee the assignment function for fair -median problem in general metrics.
5.2.3 Analysis and Running time
In this section, we show that our algorithm outputs a solution of cost at most times the cost of the optimal solution in polynomial time. Let denote a optimal solution of the fair -median problem in general metrics, and let denote the cost of this optimal solution.
Theorem 5.2.
Given an instance of the fair -median problem in general metrics, there exists an algorithm such that a solution of cost at most can be obtained, and the running time is at most .
Proof.
Let be the solution obtained by the dynamic programming process and clients assignment procedure. For each client , we say that serves the client if . By lemma 5.1, for any pair of points , we have that and must be separated at level where . Then, the total expected cost for the path between and on the HST is at most
(5) |
Consider a pair of points and where is assigned to ( denotes the facility that serves ) in the optimal solution . By equation 5, we have that the total expected cost for the path between and is at most . Namely, the total cost of solution on the tree metrics is at most . Then, we have that . Moreover, we have that a metric can be probabilistically embedded into HST with a distortion of . Thus, the total cost of the solution obtained by the above dynamic programming process is no more than
The running time of our algorithm is the sum of three main parts: constructing the HST , executing the dynamic programming procedure and executing the min-cost max-flow algorithm. Firstly, as described above, we have that a HST can be constructed in polynomial time.
For a table entry in the dynamic programming table, we analyze the the value of the parameters in this table entry. For the given HST , the total number of levels is and there are leaves in . The number of distinct value of is equal to the number of nodes in , which from Fakcharoenphol et al. (2004) is at most . Moreover, the value of parameters is non-negative integer bounded above by . The parameters in are also non-negative integers bounded above by . For a fixed , each denotes the first children of from the total ordering of the children of in . Since has exactly leaves, can have no more than children. Thus, is bounded above by . From the above arguments, the total number of table entries in our dynamic programming table is at most .
We now analyze the time to calculate the value of each table entry. For a table entry , we consider all possibilities based on the parameters . Since each of these parameters are non-negative integers bounded above by , there are cases to try all possible values of these parameters to satisfy certain constraints.
Thus, the running time of our dynamic programming algorithm is bounded by . Moreover, for the set of opened facilities obtained by dynamic programming process, the clients can be assigned to by applying the min-cost max-flow method in time.
For the instance of the fair -median problem in general metrics, a solution of cost at most can be obtained in time .
∎
6 Conclusions
In this paper, we study the fair -median problem in doubling metrics and general metrics. We present a -approximation algorithm that runs in QPTAS. We further show that our algorithm works in a more general metrics and we present an -approximation algorithm that runs in time . Our two approximation algorithms for the fair -median problem are the first results for the corresponding problems without any fair violation, respectively.
References
- Abbasi et al. [2021] M. Abbasi, A. Bhaskara, and S. Venkatasubramanian. Fair clustering via equitable group representations. In Proc. 4th ACM Conference on Fairness, Accountability, and Transparency, pages 504–514, 2021.
- Ahmadian et al. [2019] S. Ahmadian, A. Epasto, R. Kumar, and M. Mahdian. Clustering without over-representation. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 267–275, 2019.
- Ahmadian et al. [2020] S. Ahmadian, A. Epasto, M. Knittel, R. Kumar, M. Mahdian, B. Moseley, P. Pham, S. Vassilvitskii, and Y. Wang. Fair hierarchical clustering. In Proc. 34th International Conference on Neural Information Processing Systems, 2020.
- Backurs et al. [2019] A. Backurs, P. Indyk, K. Onak, B. Schieber, A. Vakilian, and T. Wagner. Scalable fair clustering. In Proc. 36th International Conference on Machine Learning, pages 405–413, 2019.
- Bartal and Gottlieb [2013] Y. Bartal and L.-A. Gottlieb. A linear time approximation scheme for euclidean tsp. In Proc. 54th Annual Symposium on Foundations of Computer Science, pages 698–706, 2013.
- Bartal et al. [2016] Y. Bartal, L.-A. Gottlieb, and R. Krauthgamer. The traveling salesman problem: low-dimensionality implies a polynomial time approximation scheme. SIAM Journal on Computing, 45(4):1563–1581, 2016.
- Behsaz et al. [2019] B. Behsaz, Z. Friggstad, M. R. Salavatipour, and R. Sivakumar. Approximation algorithms for min-sum -clustering and balanced -median. Algorithmica, 81(3):1006–1030, 2019.
- Bera et al. [2019] S. K. Bera, D. Chakrabarty, N. Flores, and M. Negahbani. Fair algorithms for clustering. In Proc. 33rd International Conference on Neural Information Processing Systems, pages 4955–4966, 2019.
- Bercea et al. [2019] I. O. Bercea, M. Groß, S. Khuller, A. Kumar, C. Rösner, D. R. Schmidt, and M. Schmidt. On the cost of essentially fair clusterings. In Proc. 22nd International Conference on Approximation Algorithms for Combinatorial Optimization Problems and 23rd International Conference on Randomization and Computation, pages 18:1–18:22, 2019.
- Chen et al. [2019] X. Chen, B. Fain, L. Lyu, and K. Munagala. Proportionally fair clustering. In Proc. 36th International Conference on Machine Learning, pages 1032–1041, 2019.
- Chierichetti et al. [2017] F. Chierichetti, R. Kumar, S. Lattanzi, and S. Vassilvitskii. Fair clustering through fairlets. In Proc. 31st International Conference on Neural Information Processing Systems, pages 5029–5037, 2017.
- Cohen-Addad et al. [2019] V. Cohen-Addad, A. E. Feldmann, and D. Saulpic. Near-linear time approximations schemes for clustering in doubling metrics. In Proc. 60th Annual Symposium on Foundations of Computer Science, pages 540–559, 2019.
- Esmaeili et al. [2020] S. Esmaeili, B. Brubach, L. Tsepenekas, and J. Dickerson. Probabilistic fair clustering. Proc. 34th International Conference on Neural Information Processing Systems, 33, 2020.
- Fakcharoenphol et al. [2004] J. Fakcharoenphol, S. Rao, and K. Talwar. A tight bound on approximating arbitrary metrics by tree metrics. Journal of Computer and System Sciences, 69(3):485–497, 2004.
- Ghadiri et al. [2021] M. Ghadiri, S. Samadi, and S. Vempala. Socially fair -means clustering. In Proc. 4th ACM Conference on Fairness, Accountability, and Transparency, pages 438–448, 2021.
- Harb and Lam [2020] E. Harb and H. S. Lam. KFC: A scalable approximation algorithm for -center fair clustering. In Proc. 34th International Conference on Neural Information Processing Systems, 2020.
- Kuhn [1955] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- Munkres [1957] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics, 5(1):32–38, 1957.
- Talwar [2004] K. Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In Proc. 36th Annual ACM symposium on Theory of computing, pages 281–290, 2004.