Fair Hierarchical Clustering
Abstract
As machine learning has become more prevalent, researchers have begun to recognize the necessity of ensuring machine learning systems are fair. Recently, there has been an interest in defining a notion of fairness that mitigates over-representation in traditional clustering.
In this paper we extend this notion to hierarchical clustering, where the goal is to recursively partition the data to optimize a specific objective. For various natural objectives, we obtain simple, efficient algorithms to find a provably good fair hierarchical clustering. Empirically, we show that our algorithms can find a fair hierarchical clustering, with only a negligible loss in the objective.
1 Introduction
Algorithms and machine learned models are increasingly used to assist in decision making on a wide range of issues, from mortgage approval to court sentencing recommendations [26]. It is clearly undesirable, and in many cases illegal, for models to be biased to groups, for instance to discriminate on the basis of race or religion. Ensuring that there is no bias is not as easy as removing these protected categories from the data. Even without them being explicitly listed, the correlation between sensitive features and the rest of the training data may still cause the algorithm to be biased. This has led to an emergent literature on computing provably fair outcomes (see the book [6]).
The prominence of clustering in data analysis, combined with its use for data segmentation, feature engineering, and visualization makes it critical that efficient fair clustering methods are developed. There has been a flurry of recent results in the ML research community, proposing algorithms for fair flat clustering, i.e., partitioning a dataset into a set of disjoint clusters, as captured by k-center, k-median, k-means, correlation clustering objectives [2, 3, 5, 7, 8, 13, 17, 23, 24, 27, 28]. However, the same issues affect hierarchical clustering, which is the problem we study.
The input to the hierarchical clustering problem is a set of data points, with pairwise similarity or dissimilarity scores. A hierarchical clustering is a tree, whose leaves correspond to the individual datapoints. Each internal node represents a cluster containing all the points in the leaves of its subtree. Naturally, the cluster at an internal node is the union of the clusters given by its children. Hierarchical clustering is widely used in data analysis [20], social networks [29, 31], image/text organization [25].
Hierarchical clustering is frequently used for flat clustering when the number of clusters is unknown ahead of time. A hierarchical clustering yields a set of clusterings at different granularities that are consistent with each other. Therefore, in all clustering problems where fairness is desired but the number of clusters is unknown, fair hierarchical clustering is useful. As concrete examples, consider a collection of news articles organized by a topic hierarchy, where we wish to ensure that no single source or view point is over-represented in a cluster; or a hierarchical division of a geographic area, where the sensitive attribute is gender or race, and we wish to ensure balance in every level of the hierarchy. There are many such problems that benefit from fair hierarchical clustering, motivating the study of the problem area.
Our contributions. We initiate an algorithmic study of fair hierarchical clustering. We build on Dasgupta’s seminal formal treatment of hierarchical clustering [19] and prove our results for the revenue [30], value [18], and cost [19] objectives in his framework.
To achieve fairness, we show how to extend the fairlets machinery, introduced by [15] and extended by [2], to this problem. We then investigate the complexity of finding a good fairlet decomposition, giving both strong computational lower bounds and polynomial time approximation algorithms.
Finally, we conclude with an empirical evaluation of our approach. We show that ignoring protected attributes when performing hierarchical clustering can lead to unfair clusters. On the other hand, adopting the fairlet framework in conjunction with the approximation algorithms we propose yields fair clusters with a negligible objective degradation.
Related work. Hierarchical clustering has received increased attention over the past few years. Dasgupta [19] developed a cost function objective for data sets with similarity scores, where similar points are encouraged to be clustered together lower in the tree. Cohen-Addad et al. [18] generalized these results into a class of optimization functions that possess other desirable properties and introduced their own value objective in the dissimilarity score context. In addition to validating their objective on inputs with known ground truth, they gave a theoretical justification for the average-linkage algorithm, one of the most popular algorithms used in practice, as a constant-factor approximation for value. Contemporaneously, Moseley and Wang [30] designed a revenue objective function based on the work of Dasgupta for point sets with similarity scores and showed the average-linkage algorithm is a constant approximation for this objective as well. This work was further improved by Charikar [12] who gave a tighter analysis of average-linkage for Euclidean data for this objective and [1, 4] who improved the approximation ratio in the general case.
In parallel to the new developments in algorithms for hierarchical clustering, there has been tremendous development in the area of fair machine learning. We refer the reader to a recent textbook [6] for a rich overview, and focus here on progress for fair clustering. Chierichetti et al. [15] first defined fairness for -median and -center clustering, and introduced the notion of fairlets to design efficient algorithms. Extensive research has focused on two topics: adapting the definition of fairness to broader contexts, and designing efficient algorithms for finding good fairlet decompositions. For the first topic, the fairness definition was extended to multiple values for the protected feature [2, 8, 32]. For the second topic, Backurs et al. [5] proposed a near-linear constant approximation algorithm for finding fairlets for -median, Kleindessner et al. [27] designed a linear time constant approximation algorithm for -center, Bercea et al. [8] developed methods for fair -means, while Ahmadian et al. [3] defined approximation algorithms for fair correlation clustering. Concurrently with our work, Chhabra et al. [14] introduced a possible approach to ensuring fairness in hierarchical clustering. However, their fairness definition differs from ours (in particular, they do not ensure that all levels of the tree are fair), and the methods they introduce are heuristic, without formal fairness or quality guarantees.
2 Formulation
2.1 Objectives for hierarchical clustering
Let be an input instance, where is a set of data points, and is a similarity function over vertex pairs. For two sets, , we let and . For problems where the input is , with a distance function, we define and similarly. We also consider the vertex-weighted versions of the problem, i.e. (or ), where is a weight function on the vertices. The vertex-unweighted version can be interpreted as setting . For , we use the notation .
A hierarchical clustering of is a tree whose leaves correspond to and whose internal vertices represent the merging of vertices (or clusters) into larger clusters until all data merges at the root. The goal of hierarchical clustering is to build a tree to optimize some objective.
To define these objectives formally, we need some notation. Let be a hierarchical clustering tree of . For two leaves and , we say is their least common ancestor. For an internal vertex in , let be the subtree in rooted at . Let be the leaves of .
We consider three different objectives—revenue, value, and cost—based on the seminal framework of [19], and generalize them to the vertex-weighted case.
Revenue. Moseley and Wang [30] introduced the revenue objective for hierarchical clustering. Here the input instance is of the form , where is a similarity function.
Definition 1 (Revenue).
The revenue () of a tree for an instance , where denotes similarity between data points, is:
Note that in this definition, each weight is scaled by (the vertex-weight of) the non-leaves. The goal is to find a tree of maximum revenue. It is known that average-linkage is a -approximation for vertex-unweighted revenue [30]; the state-of-the-art is a -approximation [4].
As part of the analysis, there is an upper bound for the revenue objective [18, 30], which is easily extended to the vertex-weighted setting:
(1) |
Note that in the vertex-unweighted case, the upper bound is just .
Value. A different objective was proposed by Cohen-Addad et al. [18], using distances instead of similarities. Let , where is a distance (or dissimilarity) function.
Definition 2 (Value).
The value () of a tree for an instance where denotes distance is:
As in revenue, we aim to find a hierarchical clustering to maximize value. Cohen-Addad et al. [18] showed that both average-linkage and a locally -densest cut algorithm achieve a -approximation for vertex-unweighted value. They also provided an upper bound for value, much like that in (1), which in the vertex-weighted context, is:
(2) |
Cost. The original objective introduced by Dasgupta [19] for analyzing hierarchical clustering algorithms introduces the notion of cost.
Definition 3 (Cost).
The of a tree for an instance where denotes similarity is:
The objective is to find a tree of minimum cost. From a complexity point of view, cost is a harder objective to optimize. Charikar and Chatziafratis [11] showed that cost is not constant-factor approximable under the Small Set Expansion hypothesis, and the current best approximations are and require solving SDPs.
Convention. Throughout the paper we adopt the following convention: will always denote similarities and will always denote distances. Thus, the inputs for the cost and revenue objectives will be instances of the form and inputs for the value objective will be instances of the form . All the missing proofs can be found in the Supplementary Material.
2.2 Notions of fairness
Many definitions have been proposed for fairness in clustering. We consider the setting in which each data point in has a color; the color corresponds to the protected attribute.
Disparate impact. This notion is used to capture the fact that decisions (i.e., clusterings) should not be overly favorable to one group versus another. This notion was formalized by Chierichetti et al. [15] for clustering when the protected attribute can take on one of two values, i.e., points have one of two colors. In their setup, the balance of a cluster is the ratio of the minimum to the maximum number of points of any color in the cluster. Given a balance requirement , a clustering is fair if and only if each cluster has a balance of at least .
Bounded representation. A generalization of disparate impact, bounded representation focuses on mitigating the imbalance of the representation of protected classes (i.e., colors) in clusters and was defined by Ahmadian et al. [2]. Given an over-representation parameter , a cluster is fair if the fractional representation of each color in the cluster is at most , and a clustering is fair if each cluster has this property. An interesting special case of this notion is when there are total colors and . In this case, we require that every color is equally represented in every cluster. We will refer to this as equal representation. These notions enjoy the following useful property:
Definition 4 (Union-closed).
A fairness constraint is union-closed if for any pair of fair clusters and , is also fair.
This property is useful in hierarchical clustering: given a tree and internal node , if each child cluster of is fair, then must also be a fair cluster.
Definition 5 (Fair hierarchical clustering).
For any fairness constraint, a hierarchical clustering is fair if all of its clusters (besides the leaves) are fair.
Thus, under any union-closed fairness constraint, this definition is equivalent to restricting the bottom-most clustering (besides the leaves) to be fair. Then given an objective (e.g., revenue), the goal is to find a fair hierarchical clustering that optimizes the objective. We focus on the bounded representation fairness notion with colors and an over-representation cap . However, the main ideas for the revenue and value objectives work under any notion of fairness that is union-closed.
3 Fairlet decomposition
Definition 6 (Fairlet [15]).
A fairlet is a fair set of points such that there is no partition of into and with both and being fair.
In the bounded representation fairness setting, a set of points is fair if at most an fraction of the points have the same color. We call this an -capped fairlet. For with an integer, the fairlet size will always be at most . We will refer to the maximum size of a fairlet by .
Recall that given a union-closed fairness constraint, if the bottom clustering in the tree is a layer of fairlets (which we call a fairlet decomposition of the original dataset) the hierarchical clustering tree is also fair. This observation gives an immediate algorithm for finding fair hierarchical clustering trees in a two-phase manner. (i) Find a fairlet decomposition, i.e., partition the input set into clusters that are all fairlets. (ii) Build a tree on top of all the fairlets. Our goal is to complete both phases in such a way that we optimize the given objective (i.e., revenue or value).
In Section 4, we will see that to optimize for the revenue objective, all we need is a fairlet decomposition with bounded fairlet size. However, the fairlet decomposition required for the value objective is more nuanced. We describe this next.
Fairlet decomposition for the value objective
For the value objective, we need the total distance between pairs of points inside each fairlet to be small. Formally, suppose is partitioned into fairlets such that is an -capped fairlet. The cost of this decomposition is defined as:
(3) |
Unfortunately, the problem of finding a fairlet decomposition to minimize does not admit any constant-factor approximation unless P = NP.
Theorem 7.
Let be an integer. Then there is no bounded approximation algorithm for finding -capped fairlets optimizing , which runs in polynomial time, unless P = NP.
The proof proceeds by a reduction from the Triangle Partition problem, which asks if a graph on vertices can be partitioned into three element sets, with each set forming a triangle in . Fortunately, for the purpose of optimizing the value objective, it is not necessary to find an approximate decomposition.
4 Optimizing revenue with fairness
This section considers the revenue objective. We will obtain an approximation algorithm for this objective in three steps: (i) obtain a fairlet decomposition such that the maximum fairlet size in the decomposition is small, (ii) show that any -approximation algorithm to (1) plus this fairlet decomposition can be used to obtain a (roughly) -approximation for fair hierarchical clustering under the revenue objective, and (iii) use average-linkage, which is known to be a -approximation to (1). (We note that the recent work [1, 4] on improved approximation algorithms compare to a bound on the optimal solution that differs from (1) and therefore do not fit into our framework.)
First, we address step (ii). Due to space, this proof can be found in Appendix B .
Theorem 8.
Given an algorithm that obtains a -approximation to (1) where , and a fairlet decomposition with maximum fairlet size , there is a -approximation for fair hierarchical clustering under the revenue objective.
Prior work showed that average-linkage is a -approximation to (1) in the vertex-unweighted case; this proof can be easily modified to show that it is still -approximation even with vertex weights. This accounts for step (iii) in our process.
Combined with the fairlet decomposition methods for the two-color case [15] and for multi-color case (Supplementary Material) to address step (i), we have the following.
Corollary 9.
There is polynomial time algorithm that constructs a fair tree that is a -approximation for revenue objective, where is the maximum size of fairlets.
5 Optimizing value with fairness
In this section we consider the value objective. As in the revenue objective, we prove that we can reduce fair hierarchical clustering to the problem of finding a good fairlet decomposition for the proposed fairlet objective (3), and then use any approximation algorithm for weighted hierarchical clustering with the decomposition as the input.
Theorem 10.
We complement this result with an algorithm that finds a good fairlet decomposition in polynomial time under the bounded representation fairness constraint with cap .
Let be the colors and let be the fairlet decomposition. Let be the number of points colored in . Let denote the number of points colored in the th fairlet.
Theorem 11.
There exists a local search algorithm that finds a fairlet decomposition with in time .
We can now use the fact that both average-linkage and the -locally-densest cut algorithm give a - and -approximation respectively for vertex-weighted hierarchical clustering under the value objective. Finally, recall that fairlets are intended to be minimal, and their size depends only on the parameter , and not on the size of the original input. Therefore, as long as the number of points of each color increases as input size, , grows, the ratio goes to . These results, combined with Theorem 10 and Theorem 11, yield Corollary 12.
Corollary 12.
Given bounded size fairlets, the fairlet decomposition computed by local search combined with average-linkage constructs a fair hierarchical clustering that is a -approximation for the value objective. For the -locally-densest cut algorithm in [18], we get a polynomial time algorithm for fair hierarchical clustering that is a -approximation under the value objective for any .
Given at most a small fraction of every color is in any cluster, Corollary 12 states that we can extend the state-of-the-art results for value to the -capped, multi-colored constraint. Note that the preconditions will always be satisfied and the extension will hold in the two-color fairness setting or in the multi-colored equal representation fairness setting.
Fairlet decompositions via local search
In this section, we give a local search algorithm to construct a fairlet decomposition, which proves Theorem 11. This is inspired by the -densest cut algorithm of [18]. To start, recall that for a pair of sets and we denote by the sum of interpoint distances, . A fairlet decomposition is a partition of the input such that each color composes at most an fraction of each .
Our algorithm will recursively subdivide the cluster of all data to construct a hierarchy by finding cuts. To search for a cut, we will use a swap method.
Definition 13 (Local optimality).
Consider any fairlet decomposition and . Define a swap of and for as updating to be and to be . We say is -locally-optimal if any swap with of the same color reduces the objective value by less than a factor.
The algorithm constructs a -locally optimal algorithm for fairlet decomposition, which runs in time. Consider any given instance . Let denote the maximum distance, denote the maximum fairlet size, and . The algorithm begins with an arbitrary decomposition. Then it swaps pairs of monochromatic points until it terminates with a locally optimal solution. By construction we have the following.
Claim 14.
Algorithm 1 finds a valid fairlet decomposition.
We prove two things: Algorithm 1 optimizes the objective (3), and has a small running time. The following lemma gives an upper bound on ’s performance for (3) found by Algorithm 1.
Lemma 15.
Finally we bound the running time. The algorithm has much better performance in practice than its worst-case analysis would indicate. We will show this later in Section 7.
Lemma 16.
The running time for Algorithm 1 is .
6 Optimizing cost with fairness
This section considers the cost objective of [19]. Even without our fairness constraint, the difficulty of approximating cost is clear in its approximation hardness and the fact that all known solutions require an LP or SDP solver. We obtain the result in Theorem 17; extending this result to other fairness constraints, improving its bound, or even making the algorithm practical, are open questions.
Theorem 17.
Consider the two-color case. Given a -approximation for cost and a -approximation for minimum weighted bisection 111 The minimum weighted bisection problem is to find a partition of nodes into two equal-sized subsets so that the sum of the weights of the edges crossing the partition is minimized. on input of size , then for parameters and such that and , there is a fair -approximation for .
With proper parameterization, we achieve an -approximation. We defer our algorithm description, pseudocode, and proofs to the Supplementary Material. While our algorithm is not simple, it is an important (and non-obvious) step to show the existence of an approximation, which we hope will spur future work in this area.
7 Experiments
This section validates our algorithms from Sections 4 and 5 empirically. We adopt the disparate impact fairness constraint [15]; thus each point is either blue or red. In particular, we would like to:
-
•
Show that running the standard average-linkage algorithm results in highly unfair solutions.
-
•
Demonstrate that demanding fairness in hierarchical clustering incurs only a small loss in the hierarchical clustering objective.
-
•
Show that our algorithms, including fairlet decomposition, are practical on real data.
In Appendix G we consider multiple colors and the same trends as the two color case occur.
Name | Sample size | # features | Protected feature | Color (blue, red) | |
---|---|---|---|---|---|
CensusGender | 30162 | 6 | gender | (female, male) | |
CensusRace | 30162 | 6 | race | (non-white, white) | |
BankMarriage | 45211 | 7 | marital status | (not married, married) | |
BankAge | 45211 | 7 | age | (, ) |
Samples | 400 | 800 | 1600 | 3200 | 6400 | 12800 |
---|---|---|---|---|---|---|
CensusGender, initial | ||||||
final | ||||||
CensusRace, initial | ||||||
final | ||||||
BankMarriage, initial | ||||||
final | ||||||
BankAge, initial | ||||||
final |
Datasets. We use two datasets from the UCI data repository.222archive.ics.uci.edu/ml/index.php, Census: archive.ics.uci.edu/ml/datasets/census+income, Bank: archive.ics.uci.edu/ml/datasets/Bank+Marketing In each dataset, we use features with numerical values and leave out samples with empty entries. For for value, we use the Euclidean distance as the dissimilarity measure. For revenue, we set the similarity to be where is the Euclidean distance. We pick two different protected features for both datasets, resulting in four datasets in total (See Table 1 for details).
-
•
Census dataset: We choose gender and race to be the protected feature and call the resulting datasets CensusGender and CensusRace.
-
•
Bank dataset: We choose marital status and age to be the protected features and call the resulting datasets BankMarriage and BankAge.
In this section, unless otherwise specified, we report results only for the value objective. Results for the revenue objective are qualitatively similar and are omitted here. We do not evaluate our algorithm for the cost objective since it is currently only of theoretical interest.
We sub-sample points of two colors from the original data set proportionally, while approximately retaining the original color balance. The sample sizes used are . On each, we do experiments and report the average results. We set in Algorithm 1 to in all of the experiments.
Implementation. The code is available in the Supplementary Material. In the experiments, we use Algorithm 1 for the fairlet decomposition phase, where the fairlet decomposition is initialized by randomly assigning red and blue points to each fairlet. We apply the average-linkage algorithm to create a tree on the fairlets. We further use average-linkage to create subtrees inside of each fairlet.
The algorithm selects a random pair of blue or red points in different fairlets to swap, and checks if the swap sufficiently improves the objective. We do not run the algorithm until all the pairs are checked, rather the algorithm stops if it has made a failed attempts to swap a random pair. As we obseve empirically, this does not have material effect on the quality of the overall solution.
Samples | 100 | 200 | 400 | 800 | 1600 | 3200 | 6400 | 12800 |
---|---|---|---|---|---|---|---|---|
CensusGender, initial | 2.5e-2 | 1.2e-2 | 6.2e-3 | 3.0e-3 | 1.5e-3 | 7.5e-4 | 3.8e-4 | 1.9e-4 |
final | 4.9e-3 | 1.4e-3 | 6.9e-4 | 2.5e-4 | 8.5e-5 | 3.6e-5 | 1.8e-5 | 8.0e-6 |
CensusRace, initial | 6.6e-2 | 3.4e-2 | 1.7e-2 | 8.4e-3 | 4.2e-3 | 2.1e-3 | 1.1e-3 | 5.3e-4 |
final | 2.5e-2 | 1.2e-2 | 6.2e-3 | 3.0e-3 | 1.5e-3 | 7.5e-4 | 3.8e-4 | 1.9e-5 |
BankMarriage, initial | 1.7e-2 | 8.2e-3 | 4.0e-3 | 2.0e-3 | 1.0e-3 | 5.0e-4 | 2.5e-4 | 1.3e-4 |
final | 5.9e-3 | 2.1e-3 | 9.3e-4 | 4.1e-4 | 1.3e-4 | 7.1e-5 | 3.3e-5 | 1.4e-5 |
BankAge, initial | 1.3e-2 | 7.4e-3 | 3.5e-3 | 1.9e-3 | 9.3e-4 | 4.7e-4 | 2.3e-4 | 1.2e-4 |
final | 5.0e-3 | 2.2e-3 | 7.0e-4 | 3.7e-4 | 1.3e-4 | 5.7e-5 | 3.0e-5 | 1.4e-5 |
![]() |
![]() |
![]() |
(i) | (ii) | (iii) |
Metrics. We present results for value here, the results for revenue are qualitatively similar. In our experiments, we track the following quantities. Let be the given input instance and let be the output of our fair hierarchical clustering algorithm. We consider the following ratio , where is the tree obtained by the standard average-linkage algorithm. We consider the fairlet objective function where is a fairlet decomposition. Let .
Results. Average-linkage algorithm always constructs unfair trees. For each of the datasets, the algorithm results in monochromatic clusters at some level, strengthening the case for fair algorithms.
In Table 2, we show for each dataset the both at the time of initialization (Initial) and after usint the local search algorithm (Final). We see the change in the ratio as the local search algorithm performs swaps. Fairness leads to almost no degradation in the objective value as the swaps increase. Table 3 shows the between the initial initialization and the final output fairlets. As we see, Algorithm 1 significantly improves the fairness of the initial random fairlet decomposition.
Dataset | CensusGender | CensusRace | BankMarriage | BankAge |
---|---|---|---|---|
Revenue vs. upper bound | ||||
Value vs. upper bound |
The more the locally-optimal algorithm improves the objective value of (3), the better the tree’s performance based on the fairlets. Figures 1(i) and 1(ii) show and for every swaps in the execution of Algorithm 1 on a subsample of size from Census data set. The plots show that as the fairlet objective value decreases, the value objective of the resulting fair tree increases. Such correlation are found on subsamples of all sizes.
Now we compare the objective value of the algorithm with the upper bound on the optimum. We report the results for both the revenue and value objectives, using fairlets obtained by local search, in Table 4. On all datasets, we obtain ratios significantly better than the theoretical worst case guarantee. In Figure 1(iii), we show the average running time on Census data for both the original average-linkage and the fair average-linkage algorithms. As the sample size grows, the running time scales almost as well as current implementations of average-linkage algorithm. Thus with a modest increase in time, we can obtain a fair hierarchical clustering under the value objective.
8 Conclusions
In this paper we extended the notion of fairness to the classical problem of hierarchical clustering under three different objectives (revenue, value, and cost). Our results show that revenue and value are easy to optimize with fairness; while optimizing cost appears to be more challenging.
Our work raises several questions and research directions. Can the approximations be improved? Can we find better upper and lower bounds for fair cost? Are there other important fairness criteria?
References
- [1] Sara Ahmadian, Vaggos Chatziafratis, Alessandro Epasto, Euiwoong Lee, Mohammad Mahdian, Konstantin Makarychev, and Grigory Yaroslavtsev. Bisect and conquer: Hierarchical clustering via max-uncut bisection. In AISTATS, 2020.
- [2] Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Clustering without over-representation. In KDD, pages 267–275, 2019.
- [3] Sara Ahmadian, Alessandro Epasto, Ravi Kumar, and Mohammad Mahdian. Fair correlation clustering. In AISTATS, 2020.
- [4] Noga Alon, Yossi Azar, and Danny Vainstein. Hierarchical clustering: a 0.585 revenue approximation. In COLT, 2020.
- [5] Arturs Backurs, Piotr Indyk, Krzysztof Onak, Baruch Schieber, Ali Vakilian, and Tal Wagner. Scalable fair clustering. In ICML, pages 405–413, 2019.
- [6] Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning. www.fairmlbook.org, 2019.
- [7] Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In NeurIPS, pages 4955–4966, 2019.
- [8] Ioana O Bercea, Martin Groß, Samir Khuller, Aounon Kumar, Clemens Rösner, Daniel R Schmidt, and Melanie Schmidt. On the cost of essentially fair clusterings. In APPROX-RANDOM, pages 18:1–18:22, 2019.
- [9] L. Elisa Celis, Lingxiao Huang, and Nisheeth K. Vishnoi. Multiwinner voting with fairness constraints. In IJCAI, pages 144–151, 2018.
- [10] L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi. Ranking with fairness constraints. In ICALP, pages 28:1–28:15, 2018.
- [11] Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In SODA, pages 841–854, 2017.
- [12] Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better than average-linkage. In SODA, pages 2291–2304, 2019.
- [13] Xingyu Chen, Brandon Fain, Charles Lyu, and Kamesh Munagala. Proportionally fair clustering. In ICML, pages 1032–1041, 2019.
- [14] Anshuman Chhabra and Prasant Mohapatra. Fair algorithms for hierarchical agglomerative clustering. arXiv:2005.03197, 2020.
- [15] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017.
- [16] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Matroids, matchings, and fairness. In AISTATS, pages 2212–2220, 2019.
- [17] Ashish Chiplunkar, Sagar Kale, and Sivaramakrishnan Natarajan Ramamoorthy. How to solve fair -center in massive data models. In ICML, 2020.
- [18] Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. In SODA, pages 378–397, 2018.
- [19] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In STOC, pages 118–127, 2016.
- [20] Richard Dubes and Anil K. Jain. Clustering methodologies in exploratory data analysis. In Advances in Computers, volume 19, pages 113–228. Academic Press, 1980.
- [21] Uriel Feige and Robert Krauthgamer. A polylogarithmic approximation of the minimum bisection. In FOCS, pages 105–115, 2000.
- [22] Michael R Garey and David S Johnson. Computers and Intractability. WH Freeman New York, 2002.
- [23] Lingxiao Huang, Shaofeng H. C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019.
- [24] Matthew Jones, Thy Nguyen, and Huy Nguyen. Fair -centers via maximum matching. In ICML, 2020.
- [25] Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. A comparison of document clustering techniques. In TextMining Workshop at KDD, 2000.
- [26] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. Human decisions and machine predictions. The Quarterly Journal of Economics, 133(1):237–293, 2017.
- [27] Matthäus Kleindessner, Pranjal Awasthi, and Jamie Morgenstern. Fair -center clustering for data summarization. In ICML, pages 3448–3457, 2019.
- [28] Matthäus Kleindessner, Samira Samadi, Pranjal Awasthi, and Jamie Morgenstern. Guarantees for spectral clustering with fairness constraints. In ICML, pages 3448–3457, 2019.
- [29] Charles F Mann, David W Matula, and Eli V Olinick. The use of sparsest cuts to reveal the hierarchical community structure of social networks. Social Networks, 30(3):223–234, 2008.
- [30] Benjamin Moseley and Joshua Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting -means, and local search. In NIPS, pages 3094–3103, 2017.
- [31] Anand Rajaraman and Jeffrey David Ullman. Mining of Massive Datasets. Cambridge University Press, 2011.
- [32] Clemens Rösner and Melanie Schmidt. Privacy preserving clustering with constraints. In ICALP, pages 96:1–96:14, 2018.
Appendix
Appendix A Approximation algorithms for weighted hierarchical clustering
In this section we first prove that running constant-approximation algorithms on fairlets gives good solutions for value objective, and then give constant approximation algorithms for both revenue and value in weighted hierarchical clustering problem, as is mentioned in Corollary 9 and 12. That is, a weighted version of average-linkage, for both weighted revenue and value objective, and weighted ()-locally densest cut algorithm, which works for weighted value objective. Both proofs are easily adapted from previous proofs in [18] and [30].
A.1 Running constant-approximation algorithms on fairlets
In this section, we prove Theorem 10, which says if we run any -approximation algorithm for the upper bound on weighted value on the fairlet decomposition, we get a fair tree with minimal loss in approximation ratio. For the remainder of this section, fix any hierarchical clustering algorithm that is guaranteed on any weighted input to construct a hierarchical clustering with objective value at least for the value objective on a weighted input. Recall that we extended the value objective to a weighted variant in the Preliminaries Section and . Our aim is to show that we can combine with the fairlet decomposition introduced in the prior section to get a fair hierarchical clustering that is a -approximation for the value objective, if .
In the following definition, we transform the point set to a new set of points that are weighted. We will analyze on this new set of points. We then show how we can relate this to the objective value of the optimal tree on the original set of points.
Definition 18.
Let be the fairlet decomposition for that is produced by the local search algorithm. Define as follows:
-
•
Each set has a corresponding point in .
-
•
The weight of is set to be .
-
•
For each partitions , where and , .
We begin by observing the objective value that receives on the instance is large compared to the weights in the original instance.
Theorem 19.
On the instance the algorithm has a total weighted objective of .
Proof.
Notice that . Consider the total sum of all the distances in . This is . The upper bound on the optimal solution is . Since , this upper bound is at least . Theorem 10 follows from the fact that the algorithm archives a weighted revenue at least a factor of the total weighted distances. ∎
A.2 Weighted hierarchical clustering: Constant-factor approximation
For weighted hierarchical clustering with positive integral weights, we define the weighted average-linkage algorithm for input and . Define the average distance to be for dissimilarity-based input, and for similarity-based input. In each iteration, weighted average-linkage seeks to merge the clusters which minimizes this value, if dissimilarity-based, and maximizes this value, if similarity-based.
Lemma 20.
Weighted average-linkage is a approximation for the upper bound on weighted value (resp., revenue) objective with positive, integral weights.
Proof.
We prove it for weighted value first. This is directly implied by the fact that average-linkage is approximation for unweighted value objective, as is proved in [18]. We have already seen in the last subsection that a unweighted input can be converted into weighted input . Vice versa, we can construct a weighted input into unweighted input with same upper bound for value objective.
In weighted hierarchical clustering we treat each point with integral weights as duplicates of points with distance among themselves, let’s call this set . For two weighted points and , if , let . This unweighted instance, composed of many duplicates, has the same upper bound as the weighted instance. Notice that running average-linkage on the unweighted instance will always choose to put all the duplicates together first for each , and then do hierarchical clustering on top of the duplicates. Thus running average-linkage on the unweighted input gives a valid hierarchical clustering tree for weighted input. Since unweighted value upper bound equals weighted value upper bound, the approximation ratio is the same.
Now we prove it for weighted revenue. In [30], average-linkage being approximation for unweighted revenue is proved by the following. Given any clustering , if average-linkage chooses to merge and in , we define a local revenue for this merge:
And correspondingly, a local cost:
Summing up the local revenue and cost over all merges gives the upper bound. [30] used the property of average-linkage to prove that at every merge, , which guarantees the total revenue, which is the summation of over all merges, is at least of the upper bound. For the weighted case, we define
And
And the rest of the proof works in the same way as in [30], proving weighted average-linkage to be for weighted revenue. ∎
Next we define the weighted -locally-densest cut algorithm. The original algorithm, introduced in [18], defines a cut to be . It starts with the original set as one cluster, at every step, it seeks the partition of the current set that locally maximizes this value, and thus constructing a tree from top to bottom. For the weighted input , we define the cut to be , and let . For more description of the algorithm, see Algorithm 4 in Section 6.2 in [18].
Lemma 21.
Weighted -locally-densest cut algorithm is a approximation for weighted value objective.
Proof.
Just as in the average-linkage proof, we convert each weighted point into a set of duplicates of . Notice that the converted unweighted hierarchical clustering input has the same upper bound as the weighted hierarchical clustering input, and the -locally-densest cut algorithm moves all the duplicate sets around in the unweighted input, instead of single points as in the original algorithm in [18].
Focus on a split of cluster into . Let be a duplicate set. , where is a set of duplicates, we must have
Pick up a point ,
Rearrange the terms and we get the following inequality holds for any point :
The rest of the proof goes exactly the same as the proof in [18, Theorem 6.5]. ∎
Appendix B Proof of Theorem 8
Proof.
Let be the -approximation algorithm to (1). For a given instance , let be a fairlet decomposition of ; let . Recall that .
We use to create a weighted instance . For , we define and we define .
We run on and let be the hierarchical clustering obtained by . To extend this to a tree on , we simply place all the points in each fairlet as leaves under the corresponding vertex in .
We argue that .
Since obtains a -approximation to hierarchical clustering on , we have
Notice the fact that, for any pair of points in the same fairlet , the revenue they get in the tree is . Then using ,
Thus the resulting tree is a -approximation of the upper bound. ∎
Appendix C Proofs for -locally-optimal local search algorithm
In this section, we prove that Algorithm 1 gives a good fairlet decomposition for the fairlet decomposition objective 3, and that it has polynomial run time.
C.1 Proof for a simplified version of Lemma 15
In Subsection C.2, we will prove Lemma 15. For now, we will consider a simpler version of Lemma 15 in the context of [15]’s disparate impact problem, where we have red and blue points and strive to preserve their ratios in all clusters. Chierichetti et al. [15] provided a valid fairlet decomposition in this context, where each fairlet has at most blue points and red points. Before going deeper into the analysis, we state the following useful proposition.
Proposition 22.
Let be the total number of red points and the number of blue points. We have that, .
Proof.
Recall that , and wlog . Since the fractions are positive and we know that . Since we conclude that . Similarly, we conclude that . Therefore .
Thus, . However, since and , , . ∎
Using this, we can define and prove the following lemma, which is a simplified version of Lemma 15.
Lemma 23.
Proof.
Let denote a mapping from a point in to the fairlet it belongs to. Let , and . Naturally, for any set . For a fairlet , let and denote the number of red and blue points in .
We first bound the total number of intra-fairlet pairs. Let , we know that and . The number of intra-fairlet pairs is at most .
The While loop can end in two cases: 1) if is -locally-optimal; 2) if . Case 2 immediately implies the lemma, thus we focus on case 1. By definition of the algorithm, we know that for any pair and where have the same color and the swap does not increase objective value by a large amount. (The same trivially holds if the pair are in the same cluster.)
After moving terms and some simplification, we get the following inequality:
(4) |
Then we sum up (4), , over every pair of points in (even if they are in the same partition).
Divide both sides by and use the fact that for all :
(5) |
For pairs of points in we sum (4) to similarly obtain:
(6) |
Now we sum up (5) and (6). The LHS becomes:
The other terms give:
The last inequality follows from Proposition 22. All together, this proves that
Then, . The final step follows from the fact that . This proves the lemma. ∎
C.2 Proof for the generalized Lemma 15
Next, we prove Lemma 15 for the more generalized definition of fairness, which is -capped fairness.
Proof of [Lemma 15] The proof follows the same logic as in the two-color case: we first use the -local optimality of the solution, and sum up the inequality over all pairs of points with the same color.
Let denote a mapping from a point in to the fairlet it belongs to. Let be the set of colored points in a set . Let . Naturally, for any set since the weight for every pair of points is repeated twice.
The While loop can end in two cases: 1) if is -locally-optimal; 2) if . Case 2 immediately implies the lemma, thus we focus on case 1.
By definition of the algorithm, we know that for any pair and where have the same color and the swap does not increase objective value by a large amount. (The same trivially holds if the pair are in the same cluster.) We get the following inequality as in the two color case:
(7) |
For any color , we sum it over every pair of points in (even if they are in the same partition).
Divide both sides by and we get:
(8) |
Now we sum up this inequality over all colors . The LHS becomes:
For the RHS, the last term sums up to . Using the fact that , the other terms sum up to :
Therefore, putting LHS and RHS together, we get
Then, . The final step follows from the fact that .
∎
In the two-color case, the ratio becomes , which can be further bounded by (see Proposition 22). If there exists a caplet decomposition such that , Lemma 15 implies we can build a fair hierarchical clustering tree with loss in approximation ratio for value objective.
Assuming for all color class , as , here we give a possible caplet decomposition for with size for positive integer , thus guaranteeing for any .
Lemma 24.
For any set of size that satisfies fairness constraint with , there exists a partition of into sets where each satisfies the fairness constraint and .
Proof.
Let with , then the fairness constraints ensures that there are at most elements of each color. Consider partitioning obtained through the following process: consider an ordering of elements where points of the same color are in consecutive places, assign points to sets in a round robin fashion. So each set gets at least elements and at most elements assigned to it. Since there are at most elements of each color, each set gets at most one point of any color and hence all sets satisfy the fairness constraint as . ∎
C.3 Proof for the running time of -locally-optimal fairlet decomposition algorithm
Proof of [Lemma 16] Notice that finding the maximum pairwise distance takes time. Thus, we focus on analyzing the time spent on the While loop.
Let be the total number of swaps. We argue that . If the conclusion trivially holds. Otherwise, consider the decomposition before the last swap. Since the While loop does not terminate here, . However, at the beginning, we have . Therefore, it takes at most iterations to finish the While loop.
It remains to discuss the running time of each iteration. We argue that there is a way to finish each iteration in time. Before the While loop, keep a record of for each point and each fairlet . This takes time. If we know and the objective value from the last iteration, in the current iteration, it takes time to calculate the new objective value after each swap , and there are at most such calculations, before the algorithm either finds a pair to swap, or determines that no such pair is left. After the swap, the update for all the data takes time. In total, every iteration takes time.
Therefore, Algorithm 1 takes time. ∎
Appendix D Hardness of optimal fairlet decomposition
Before proving Theorem 7, we state that the PARTITION INTO TRIANGLES (PIT) problem is known to belong to the NP-complete class [22], defined as follows. In the definition, we call a clique -clique if it has nodes. A triangle is a -clique.
Definition 25.
PARTITION INTO TRIANGLES
(PIT). Given graph , where , determine if can be partitioned into -element sets , such that each forms a triangle in .
The NP-hardness of PIT problem gives us a more general statement.
Definition 26.
PARTITION INTO -CLIQUES
(PIKC).
For a fixed number treated as constant, given graph , where , determine if can be partitioned into -element sets , such that each forms a -clique in .
Lemma 27.
For a fixed constant , the PIKC problem is NP-hard.
Proof.
We reduce the PIKC problem from the PIT problem. For any graph given to the PIT problem where , construct another graph . Let , where all the ’s are -cliques, and there is no edge between any two cliques and where . For any , let all points in to be connected to all nodes in .
Now let be the input to PIKC problem. We prove that can be partitioned into triangles if and only if can be partitioned into -cliques. If has a triangle partition , then is a -clique partition. On the other hand, if has a -clique partition then must each belong to different -cliques since they are not connected to each other. Without loss of generality we assume , then is a triangle partition. ∎
We are ready to prove the theorem.
Proof of [Theorem 7] We prove Theorem 7 by proving that for given , if there exists a -approximation polynomial algorithm for (3), it can be used to solve the PIKC problem where for any instance as well. This holds for any finite .
Given any graph that is input to the PIKC problem, where , let a set with distances be constructed in the following way:
-
1.
, where each is a singleton.
-
2.
Color the points in red, and color all the ’s blue.
-
3.
For a , let , if it satisfies one of the three conditions: 1) . 2) for some . 3) one of is in , while the other belong to some .
-
4.
All other edges have distance .
Obviously the blue points make up a fraction of the input so each fairlet should have exactly blue point and red points.
We claim that has a -clique partition if and only if algorithm gives a solution of for (3). The same argument as in the proof of Lemma 27 will show that has a -clique partition if and only if the optimal solution to (3) is . This is equal to algorithm giving a solution of since otherwise the approximate is not bounded. ∎
Appendix E Optimizing cost with fairness
In this section, we present our fair hierarchical clustering algorithm that approximates Dasgupta’s cost function and satisfies Theorem 17. Most of the proofs can be found in Section E.1. We consider the problem of equal representation, where vertices are red or blue and . From now on, whenever we use the word “fair”, we are referring to this fairness constraint. Our algorithm also uses parameters and such that and for , and leverages a -approximation for cost and -approximation for minimum weighted bisection. We will assume these are fixed and use them throughout the section.
We will ultimately show that we can find a fair solution that is a sublinear approximation for the unfair optimum , which is a lower bound of the fair optimum. Our main result is Theorem 17, which is stated in the body of the paper.
The current best approximations described in Theorem 17 are by [21] and by both [19] and [11]. If we set and , then we get Corollary 28.
Corollary 28.
Consider the equal representation problem with two colors. There is an -approximate fair clustering under the cost objective.
The algorithm will be centered around a single clustering, which we call , that is extracted from an unfair hierarchy. We will then adapt this to become a similar, fair clustering . To formalize what must satisfy to be sufficiently “similar” to , we introduce the notion of a -good clustering. Note that this is not an intuitive set of properties, it is simply what must satisfy in order
Definition 29 (Good clustering).
Fix a clustering whose cluster sizes are at most . A fair clustering is -good if it satisfies the following two properties:
-
1.
For any cluster , there is a cluster such that all but (at most) an -fraction of the weight of edges in is also in .
-
2.
Any is not too much bigger, so .
The hierarchy will consist of a -good (for a specifically chosen ) clustering as its only nontrivial layer.
Lemma 30.
Let be a -approximation for cost and be a maximal clustering in under the condition that all cluster sizes are at most . Then, a fair two-tiered hierarchy whose first level consists of a -good clustering achieves an -approximation for cost.
Proof.
Since is a -approximation, we know that:
We will then utilize a scheme to account for the cost contributed by each edge relative to their cost in in the hopes of extending it to . There are three different types of edges:
-
1.
An edge that is merged into a cluster of size or greater in , thus contributing to the cost. At worst, this edge is merged in the top cluster in to contribute . Thus, the factor increase in the cost contributed by is . Then since the total contribution of all such edges in is at most , the total contribution of all such edges in is at most .
-
2.
An edge that started in some cluster that does not remain in the corresponding cluster . We are given that the total weight removed from any such is an -fraction of the weight contained in . If we sum across the weight in all clusters in , that is at most . So the total amount of weight moved is at most . These edges contributed at least in as the smallest possible cluster size is two. In , these may have been merged at the top of the cluster, for a maximum cost contribution of . Therefore, the total cost across all such edges is increased by at most a factor of , which gives a total cost of at most .
-
3.
An edge that starts in some cluster and remains in the corresponding . Similarly, this must have contributed at least in , but now we know that this edge is merged within in , and that the size of is . Thus its contribution increases at most by a factor of . By the same reasoning from the first edge type we discussed, all these edges total contribute at most a factor of .
We can then put a conservative bound by putting this all together.
Finally, we know is a -approximation for .
With this proof, the only thing left to do is find a -good clustering (Definition 29). Specifically, using the clustering mentioned in Lemma 30, we would like to find a -good clustering using the following.
Lemma 31.
There is an algorithm that, given a clustering with maximum cluster size , creates a -good clustering.
Proof.
Consider our graph . We first obtain a -approximation for unfair cost, which yields a hierarchy tree . Let be the maximal clustering in under the constraint that the cluster sizes must not exceed . We then apply the algorithm from Lemma 31 to get a -good clustering . Construct such that it has one layer that is . Then we can apply the results from Lemma 30 to get the desired approximation. ∎
From here, we will only provide a high-level description of the algorithm for Lemma 31. For precise details and proofs, see Section E.1. To start, we need to propose some terminology.
Definition 32 (Red-blue matching).
A red-blue matching on a graph is a matching such that implies and are different colors.
Red-blue matchings are interesting because they help us ensure fairness. For instance, suppose is a red-blue matching that is also perfect (i.e., touches all nodes). If the lowest level of a hierarchy consists of a clustering such that and are in the same cluster for all , then that level of the hierarchy is fair since there is a bijection between red and blue vertices within each cluster. When these clusters are merged up in the hierarchy, fairness is preserved.
Our algorithm will modify an unfair clustering to be fair by combining clusters and moving a small number of vertices. To do this, we will use the following notion.
Definition 33 (Red-blue clustering graph).
Given a graph and a clustering , we can construct a red-blue clustering graph that is associated with some red-blue matching . Then is a graph where and if and only if there is a and .
Basically, we create a graph of clusters, and there is an edge between two clusters if and only if there is at least one vertex in one cluster that is matched to some vertex in the other cluster. We now show that the red-blue clustering graph can be used to construct a fair clustering based on an unfair clustering.
Proposition 34.
Let be a red-blue clustering graph on a clustering with a perfect red-blue matching . Let be constructed by merging all the clusters in each component of . Then is fair.
Proof.
Consider some . By construction, this must correspond to a connected component in . By definition of , for any vertex , . That means , restricted to , defines a bijection between the red and blue nodes in . Therefore, has an equal number of red and blue vertices and hence is fair. ∎
We will start by extracting a clustering from an unfair hierarchy that approximates cost. Then, we will construct a red-blue clustering graph with a perfect red-blue matching . Then we can use the components of to define our first version of the clustering . However, this requires a non-trivial way of moving vertices between clusters in .
We now give an overview of our algorithm in Steps (A)–(G). For a full description, see our pseudocode in Section H.
(A) Get an unfair approximation . We start by running a -approximation for cost in the unfair setting. This gives us a tree such that .
(B) Extract a -maximal clustering. Given , we find the maximal clustering such that (i) every cluster in the clustering is of size at most , and (ii) any cluster above these clusters in is of size more than .
(C) Combine clusters to be size to . We will now slowly change into during a number of steps. In the first step, we simply define by merging small clusters until the merged size is between and . Thus clusters in are contained within clusters in , and all clusters are between size and .
(D) Find cluster excesses. Next, we strive to make our clustering more fair. We do this by trying to find an underlying matching between red and blue vertices that agrees with (matches are in the same cluster). If the matching were perfect, then the clusters in would have equal red and blue representation. However, this is not guaranteed initially. We start by conceptually matching as many red and blue vertices within clusters as we can. Note we do not actually create this matching; we just want to reserve the space for this matching to ensure fairness, but really some of these vertices may be moved later on. Then the remaining unmatched vertices in each cluster is either entirely red or entirely blue. We call this amount the excess and the color the excess color. We label each cluster with both of these.
(E) Construct red-blue clustering graph. Next, we would like to construct , our red-blue clustering graph on . Let . In addition, for the within-cluster matchings mentioned in Step (D), let those matches be contained in . With this start, we will do a matching process to simultaneously construct and the rest of . Note the unmatched vertices are specifically the excess vertices in each cluster. We will match these with an iterative process given our parameter :
-
1.
Select a vertex with excess at least to start a new connected component in . Without loss of generality, say its excess color is red.
-
2.
Find a vertex whose excess color is blue and whose excess is at least . Add to .
-
3.
Say without loss of generality that the excess of is less than that of . Then match all the excess in to vertices in the excess of . Now has a smaller excess.
-
4.
If has an excess less than or is the th cluster in this component, end this component. Start over at (1) with a new cluster.
-
5.
Otherwise, use as our reference and continue constructing this component at (2).
-
6.
Complete when there are no more clusters with over excess that are not in a component (or all remaining such clusters have the same excess color).
We would like to construct by merging all clusters in each component. This would be fair if were a perfect matching, however this is not true yet. In the next step, we handle this.
(F) Fix unmatched vertices. We now want to match excess vertices that are unmatched. We do this by bringing vertices from other clusters into the clusters that have unmatched excess, starting with all small unmatched excess. Note that some clusters were never used in Step (E) because they had small excess to start. This means they had many internal red-blue matches. Remove of these and put them into clusters in need. For other vertices, we will later describe a process where of the clusters can contribute vertices to account for unmatched excess. Thus clusters lose at most vertices, and we account for all unmatched vertices. Call the new clustering . Now is perfect and is unchanged.
(G) Define . Finally, we create the clustering by merging the clusters in each component of . Note that Proposition 34 assures is fair. In addition, we will show that cluster sizes in are at most , so has the desired upper bound of on cluster size. Finally, we removed at most vertices from each cluster. This is the desired -good clustering.
Further details and the proofs that the above sequence of steps achieve the desired approximation can be found in the next section. While the approximation factor obtained is not as strong as the ones for revenue or value objectives with fairness, we believe cost is a much harder objective with fairness constraints.
E.1 Proof of Theorem 17
This algorithm contains a number of components. We will discuss the claims made by the description step by step. In Step (A), we simply utilize any -approximation for the unfair approximation. Step (B) is also quite simple. At this point, all that is left is to show how to find , ie, prove Lemma 31 (introduced in Section 6). This occurs in the steps following Step (B). In Step (C), we apply our first changes to the starting clustering from . We now prove that the cluster sizes can be enforced to be between and .
Lemma 35.
Given a clustering , we can construct a clustering , where each is a union of clusters in and .
Proof.
We iterate over all clusters in whose size are less than and continually merge them until we create a cluster of size . Note that since the last two clusters we merged were of size , this cluster is of size . We then stop this cluster and continue merging the rest of the clusters. At the end, if we are left with a single cluster of size , we simply merge this with any other cluster, which will then be of size . ∎
Step (D) describes a rather simple process. All we have to do in each cluster is count the amount of each color in each cluster, find which is more, and also compute the difference. No claims are made here.
Step (E) defines a more careful process. We describe this process and its results here.
Lemma 36.
There is an algorithm that, given a clustering with for , can construct a red-blue clustering graph on with underlying matching such that:
-
1.
is a forest, and its max component size is .
-
2.
For every , there are at least matches between and in . In other words, .
-
3.
For most , at most vertices in are unmatched in . The only exceptions to this rule are (1) exactly one cluster in every -sized component in , and (2) at most additional clusters.
Proof.
We use precisely the process from Step 5. Let . will look like a bipartite graph with some entirely isolated nodes. We then try to construct components of one-by-one such that (1) the max component size is , and (2) edges represent at least matches in .
Let us show it satisfies the three conditions of the lemma. For condition 1, note that we will always halt component construction once it reaches size . Thus no component can exceed size . In addition, for every edge added to the graph, at least one of its endpoints now has small excess and will not be considered later in the program. Thus no cycles can be created, so it is a forest.
For condition 2, consider the construction of any edge . At this point, we only consider and to be clusters with different-color excess of at least each. In the next part of the algorithm, we match as much excess as we can between the two clusters. Therefore, there must be at least underlying matches.
Finally, condition 3 will be achieved by the completion condition. By the completion condition, there are no isolated vertices (besides possibly those leftover of the same excess color) that have over excess. Whenever we add a cluster to a component, either that cluster matches all of its excess, or the cluster it becomes adjacent to matches all of its excess. Therefore at any time, any component has at most one cluster with any excess at all. If the component is smaller than (and is not the final component), then that can only happen when in the final addition, both clusters end up with less than excess. Therefore, no cluster in this component can have less than excess. For an -sized component, by the rule mentioned before, only one cluster can remain with excess. When the algorithm completes, we are left with a number of large-excess clusters with the same excess color, say red without loss of generality. Assume for contradiction there are more than such clusters, and so there is at least . Since we started with half red and half blue vertices, the remaining excess in the rest of the clusters must match up with the large red excess. Thus the remaining at most clusters must have at least blue excess, but this is only achievable if they have large excess left. This is a contradiction. Thus we satisfy condition 3. ∎
This concludes Step (E). In Step (F), we will transform the underlying clustering such that we can achieve a perfect matching . This will require removing a small number of vertices from some clusters in and putting them in clusters that have unmatched vertices. This process will at most double cluster size.
Lemma 37.
There is an algorithm that, given a clustering with for , finds a clustering and an underlying matching such that:
-
1.
There is a bijection between and .
-
2.
For any cluster and its corresponding , . This means that at most vertices are removed from in the construction of .
-
3.
For all , .
-
4.
is a perfect red-blue matching.
-
5.
is a red-blue clustering graph of with matching , perhaps with additional edges.
Proof.
Use Lemma 36 to find the red-blue clustering graph and its corresponding graph . Then we know that only one cluster in every -sized component plus one other cluster can have a larger than excess. Since cluster sizes are at least , . This means that at most clusters need more than vertices. Since the excess is upper bounded by cluster size which is upper bounded by , this is at most vertices in large excess that need matches.
We will start by removing all small excess vertices from clusters. This removes at most from any cluster. These vertices will then be placed in clusters with large excess of the right color. If we run out of large excess of the right color that needs matches, since the total amount of red and blue vertices is balanced, that means we can instead transfer the unmatched small excess red vertices to clusters with a small amount of unmatched blue vertices. In either case, this accounts for all the small unmatched excess. Now all we need to account for is at most unmatched vertices in large excess clusters. At this point, note that the large excess should be balanced between red and blue. From now on, we will remove matches from within and between clusters to contribute to this excess. Since this always contributes the same amount of red and blue vertices by breaking matches, we do not have to worry about the balance of colors. We will describe how to distribute these contributions across a large number of clusters.
Consider vertices that correspond to clusters that (ignoring the matching ) started out with at most excess. So the non-excess portion, which is at least size , is entirely matched with itself. We will simply remove of these matches to contribute.
Otherwise, we will consider vertices that started out with large excess. We must devise a clever way to break matches without breaking too many incident upon a single cluster. For every tree in (since is a forest by Lemma 36), start at the root, and do a breadth-first search over all internal vertices. At any vertex we visit, break matches between it and its child (recall by by Lemma 36 that each edge in represents at least inter-cluster matches). Thus, each break contributes vertices. We do this for every internal vertex. Since an edge represents at least matches and the max cluster size is at most , any vertex can have at most children. Thus the fraction of vertices in that correspond to a contribution of vertices is at least .
Clearly, the worst case is when all vertices in have large excess, as this means that fewer clusters are ensured to be able to contribute. By Lemma 36, at least of these are a part of completed connected components (ie, of size or with each cluster having small remaining excess). So consider this case. Since , then this process yields vertices. To achieve vertices, we must then run iterations. If an edge no longer represents matches because of an earlier iteration, consider it a non-edge for the rest of the process. The only thing left to consider is if a cluster becomes isolated in during the process. We know began with at least vertices, and at most were removed by removing small excess. So as long as , we can remove the rest of the vertices from the non-excess in (the rest must be non-excess) in the same way as vertices that were isolated in to start. Thus, we can account for the entire set of unmatched vertices without removing more than vertices from any given cluster.
Now we consider the conditions. Condition 1 is obviously satisfied because we are just modifying clusters in , not removing them. The second condition is true because of our careful accounting scheme where we only remove vertices per cluster. The same is true for the lower bound in condition 3. When we add them to new clusters, since we only add a vertex to match an unmatched vertex, we at most double cluster size. So the max cluster size is .
For the fourth condition, note that we explicitly executed this process until all unmatched vertices became matched, and any endpoint in a match we broke was used to create a new match. Thus the new matching, which we call , is perfect. It is still red-blue. Finally, note we did not create any matches between clusters. Therefore, no match in can violate . Thus condition 5 is met. ∎
Finally, we construct our final clustering in Step (G). However, to satisfy the qualities of Lemma 30, we must first argue about the weight loss from each cluster.
Lemma 38.
Consider any clustering with cluster sizes between and . Say each cluster has a specified number of red vertices to remove and number of blue vertices to remove such that for some , and (resp. ) is nonzero only if the number of red (resp. blue) vertices in the cluster is . Then we can remove the desired number of each color while removing at most an of the weight originally contained within the cluster.
Proof.
Consider some cluster with parameters and . We will focus first on removing red vertices. Let be the red vertex set in . We create a graph corresponding to this cluster as follows. Let be a vertex representing all blue vertices from , be the “complement” vertex to , and be a set of vertices corresponding to all red vertices in . We also add a set of dummy vertices (where is just some large value that makes it so ). of the dummy vertices will be connected to with infinite edge weight (denote these ), the other will be connected to with infinite edge weight (denote these ). This will ensure that and are in the same partitions as their corresponding dummies. Let and be the similarity function in the original graph and new graph respectively.
The blue vertex is also connected to all with the following weight (where is the set of blue vertices in ):
This edge represents the cumulative edge weight between and all blue vertices. The additional summation term, which contains the edge weights between and all other red vertices, is necessary to ensure our bisection cut will also contain the edge weights between two of the removed red vertices.
Next, the edge weights between red vertices must contain the other portion of the corresponding edge weight in the original graph.
Now, we note that there are a total of vertices. So a bisection will partition the graph into vertex sets of size . Obviously, in any approximation, must be grouped with all and must be grouped with all . This means the partition must contain of the vertices, and the partition must contain the other . These vertices in the latter partition are the ones we select to move.
Consider any set of red vertices in . Then it is a valid bisection. We now show that the edge weight in the cut for this bisection is exactly the edge weight lost by removing from . We can do this algebraically. We start by breaking down the weight of the cut into the weight between the red vertices in and , and also the red vertices in and the red vertices not in .
Notice that the two last summations have an overlap. They both contribute half the edge weight between and vertices in . Thus, these edges contribute their entire edge weight. All remaining vertices in only contribute half their edge weight. We can then distribute the summation.
In the middle summation, note that every edge is counted twice when and , and when and . We can then rewrite this as:
When we remove , we remove the connections between and blue vertices, the connections within , and the connections between and red vertices not in . This is precisely what this accounts for. Therefore, any bisection on directly corresponds to removing a vertex set of red vertices from . If we have a -approximation for minimum weighted bisection, then, this yields a -approximation for the smallest loss we can achieve from removing red vertices.
Now we must compare the optimal way to remove vertices to the total weight in a cluster. Let be the number of red vertices in a cluster. Then the total number of possible cuts to isolate red vertices is . Let be the set of all possible cuts to isolate red vertices. Then if we sum over the weight of all possible cuts (where weight here is the weight between the removed vertices and all vertices, including each other), that will sum over each red-red edge and blue-red edge multiple times. A red-red edge is counted if either of its endpoints is in , and this happens of the time. A blue-red edge is counted if its red endpoint is in , which happens . And of course, since no blue-blue edge is covered, each is covered under times. Therefore, if we sum over all these cuts, we get at most times the weight of all edges in .
Let be the minimum possible cut. Now since there are cuts, we know the lefthand side here is bounded above by .
We can now simplify.
But note we are given . So if we have a approximation for the minimum bisection problem, this means we can find a way to remove vertices such that the removed weight is at most . We can do this again to get a bound on the removal of the blue vertices. This yields a total weight removal of . ∎
Proof.
Start by running Lemma 35 on to yield . Then we can apply Lemma 37 to yield with red-blue clustering graph and underlying perfect red-blue matching . We create by merging components in into clusters. Since the max component size is and the max cluster size in is , then the max cluster size in is . This satisfies condition 2 of being -good. In addition, it is fair by Proposition 34.
Finally, we utilize the fact that we only moved at most vertices from any cluster, and note that we only move vertices of a certain color if we have of that color in that cluster. Then by Lemma 38, we know we lost at most fraction of the weight from any cluster. This satisfies the second condition and therefore is -good. ∎
Appendix F Additional experimental results for revenue
We have conducted experiments on the four datasets for revenue as well. The Table 5 shows the ratio of fair tree built by using average-linkage on different fairlet decompositions. We run Algorithm 1 on the subsamples with Euclidean distances. Then we convert distances into similarity scores using transformation . We test the performance of the initial random fairlet decomposition and final fairlet decomposition found by Algorithm 1 for revenue objective using the converted similarity scores.
Samples | 100 | 200 | 400 | 800 | 1600 |
---|---|---|---|---|---|
CensusGender, initial | |||||
final | |||||
CensusRace, initial | |||||
final | |||||
BankMarriage, initial | |||||
final | |||||
BankAge, initial | |||||
final |
Appendix G Additional experimental results for multiple colors
We ran experiments with multiple colors and the results are analogous to those in the paper. We tested both Census and Bank datasets, with age as the protected feature. For both datasets we set 4 ranges of age to get 4 colors and used . We ran the fairlet decomposition in [2] and compare the fair hierarchical clustering’s performance to that of average-linkage. The age ranges and the number of data points belonging to each color are reported in Table 6. Colors are named descending with regard to the number of points of the color. The vanilla average-linkage has been found to be unfair: if we take the layer of clusters in the tree that is only one layer higher than the leaves, there is always one cluster with for the definition of -capped fairness, showing the tree to be unfair.
Dataset | Color 1 | Color 2 | Color 3 | Color 4 |
---|---|---|---|---|
CensusMultiColor | ||||
BankMultiColor |
As in the main body, in Table 7, we show for each dataset the both at the time of initialization (Initial) and after using the local search algorithm (Final), where is the ratio between the performance of the tree built on top of the fairlets and that of the tree directly built by average-linkage.
Samples | 200 | 400 | 800 | 1600 | 3200 | 6400 |
---|---|---|---|---|---|---|
CensusMultiColor, initial | ||||||
final | ||||||
BankMultiColor, initial | ||||||
final |
Table 8 shows the performance of trees built by average-linkage based on different fairlets, for Revenue objective. As in the main body, the similarity score between any two points is . The entries in the table are mean and standard deviation of ratios between the fair tree performance and the vanilla average-linkage tree performance. This ratio was calculated both at time of initialization (Initial) when the fairlets were randomly found, and after Algorithm 1 terminated (Final).
Samples | 200 | 400 | 800 | 1600 | 3200 |
---|---|---|---|---|---|
CensusMultiColor, initial | |||||
final | |||||
BankMultiColor, initial | |||||
final |
Samples | 200 | 400 | 800 | 1600 | 3200 | 6400 |
---|---|---|---|---|---|---|
CensusMultiColor | 0.43 | 1.76 | 7.34 | 35.22 | 152.71 | 803.59 |
BankMultiColor | 0.43 | 1.45 | 6.77 | 29.64 | 127.29 | 586.08 |