Proportionally Representative Clustering
Abstract
In recent years, there has been a surge in effort to formalize notions of fairness in machine learning. We focus on centroid clustering—one of the fundamental tasks in unsupervised machine learning. We propose a new axiom “proportionally representative fairness” (PRF) that is designed for clustering problems where the selection of centroids reflects the distribution of data points and how tightly they are clustered together. Our fairness concept is not satisfied by existing fair clustering algorithms. We design efficient algorithms to achieve PRF both for unconstrained and discrete clustering problems. Our algorithm for the unconstrained setting is also the first known polynomial-time approximation algorithm for the well-studied Proportional Fairness (PF) axiom [14]. Our algorithm for the discrete setting also matches the best known approximation factor for PF.
1 Introduction
As artificial intelligence and machine learning serve as the ‘new electricity’ of systems, it is becoming increasingly important to ensure that AI systems embody societal values such as privacy, transparency, and fairness. Fairness is a growing concern, as AI systems are used to make decisions that can critically affect our daily lives, finances, and careers [17]. The impulse to tackle the issue of fairness in AI is prominently reflected in several governmental policy documents [18, 3, 24], the emergence of dedicated conferences on Ethical AI, and the formation of AI ethics boards by various big tech companies. Over the past years, several developments have taken place in the theory and application of fairness in supervised learning. One prominent approach is to use information derived from labelled data in supervised learning to formalize some notions of equity within and across (pre-specified and) protected attributes [21]. In contrast, in several models of unsupervised learning, protected attributes or their labelling are unknown. Fairness in unsupervised machine learning is becoming important with the growth of data collection both in public (e.g., via IoT devices) and private sectors (e.g., map services and social media). Unfettered and irresponsible use of such data has the danger of increasing inequity [37].
We focus on fairness in clustering, which is one of the most widely applied tasks in unsupervised machine learning. A (centroid selection) clustering problem has a metric space with a distance measure , a multiset of data points (agents), a set of candidate centers, and positive integer . In many clustering problems, each data point corresponds to an individual’s attribute profile; thus, we refer to the data points as agents. The goal is to find a size- subset of centers, . Such centroid selection problems can be widely applied to various applications including recommender systems and data-compression. The problem also captures social planning scenarios where facility locations need to be chosen to serve a set of agents.
In such a (centroid selection) clustering problem, there are two versions of the problem. In the discrete version, the set of candidate centers is a finite subset of . In the unconstrained or continuous setting, . We first focus on the unconstrained setting as done by Micha and Shah [36]. Later we discuss how our concepts and ideas extend to the discrete setting. For a location and set of locations , we will denote as . Standard clustering solutions include -center, -medians, and -means.111 The -center solution outputs a subset . The -medians solution outputs a subset . The -means solution outputs a subset . These standard objectives can be viewed as achieving some form of global welfare but are not well-suited for proportional representational and fairness concerns. The problems of computing outcomes optimising these objectives are also NP-hard [34, 4].
We seek a suitable concept for clustering that captures general fairness principles such as non-discrimination, equality of opportunity, and equality of outcome. In many applications, the data points correspond to real individuals or their positions or opinions. Such individuals may expect the clustering outcome to capture natural fairness requirements. Previous axiomatic studies of clustering have measured fair representation by the distance from each data point to the nearest centroid. We initiate a proportional representation perspective on fairness in clustering. In capturing this desiderata axiomatically, we require a property in which representation guarantees depend on the number (or proportion) of data points and how “tightly” they are clustered together.
Our perspective is motivated by applications in which groups of individuals (to which the data points correspond) may expect to receive centroid representation proportional to their size. For instance, the centroids may be citizens selected to make representative decisions on behalf of the public, as is done in a jury or a sortition panel. Similarly, computing a proportionally fair clustering is also meaningful in facility location, where the location of the facility should depend on the density and the number of people it is serving. More generally, our axioms and algorithms are well-motivated in any setting in which one may want to take a representative sample from datapoints between which a distance metric is well-defined. In this paper, we consider the following fundamental research question. What is a meaningful concept of proportional fairness in centroid clustering? Is such a concept guaranteed to be achievable?
Contributions.
In contrast to traditional optimization objectives in clustering and centroid selection, we take an axiomatic fairness perspective. We first identify a potential shortcoming of some recently introduced concepts that aim to capture proportional fairness for clustering; in particular, we show how the concepts do not capture an intuitive and minimal requirement of proportional representation called unanimous proportionality. We then propose a new axiom called proportional representative fairness (PRF), which is our central conceptual contribution.
We propose PRF for both the unconstrained and discrete clustering setting. We discuss how PRF overcomes some of the drawbacks of the previously introduced concepts. In particular, it implies the minimal requirement of unanimous proportionality. PRF has several other desirable features: a PRF outcome is guaranteed to exist, the axiom does not require a preset specification of protected groups and it is robust to outliers and multiplicative scaling. We show that none of the traditional clustering algorithms or previously introduced algorithms for fair clustering satisfy PRF.
We design polynomial-time algorithms to achieve PRF both in the unconstrained and discrete settings.222The results for unconstrained and discrete settings are incomparable in the following sense. The guarantee of existence of a fair outcome in the discrete setting guarantees the existence of a fair outcome in the unconstrained setting; however, a polynomial-time algorithm for computing a fair outcome in the discrete setting does not imply the same for the unconstrained setting. We call our algorithms SEAR (Spatial Expanding Approval Rule) as they are inspired by using core ideas from the ordinal multi-winner voting Expanding Approvals Rule (EAR) of Aziz and Lee [7] but carefully use spatial information to make synchronized decisions. We prove that our algorithm for the unconstrained setting also gives -approximate Proportional Fairness (PF) [14], and hence is the first known polynomial-time approximation algorithm for PF on general metric spaces. We also prove that our algorithm for the discrete setting matches the best known approximation of Proportional Fairness (PF), thereby providing no compromise on PF approximation but additionally satisfying PRF. We summarize our theoretical contributions in Table 1 via a comparison of our algorithms with the best guarantees previously known. All missing proofs can be found in A.
Unconstrained | Discrete | ||||||
---|---|---|---|---|---|---|---|
-PF | PRF | Complex. | -PF | PRF | Complex. | ||
Our Algorithms (SEAR) | ✓ | in P | ✓ | in P | |||
Greedy Capture | - | NP-hard | - | in P |
2 Related Work
2.1 Fair clustering
Fairness in machine learning and computer systems is a vast topic [27]—for an overview, see Chouldechova and Roth [17] or the survey by Mehrabi et al. [35]. We focus on the fundamental learning problem of clustering, which is a well-studied problem in computer science, statistics, and operations research [25]. Our approach is axiomatic, focused on the concept of proportional fairness (or representation), and does not rely on the use of a pre-specified or protected set of attributes. Thus, the most closely related papers are those of Chen et al. [14], Micha and Shah [36], Li et al. [33], and Jung et al. [28], which we review below.
Chen et al. [14] proposed the concept of proportional fairness (PF) for the discrete setting, which is based on the entitlement of groups of agents of large enough size. They reasoned that PF is a desirable fairness concept but it is not guaranteed to be achievable for every instance. On the other hand, they showed that outcomes satisfying reasonable approximations of PF are guaranteed to exist and can be computed in polynomial time via the ‘Greedy Capture’ algorithm.
Micha and Shah [36] analyzed the proportional fairness axiom in the unconstrained setting and presented similar results to those obtained by Chen et al. [14]. They showed that the natural adaptation of the Greedy Capture algorithm finds an approximately PF outcome in Euclidean space. On the other hand, they showed that checking whether a PF outcome exists is NP-hard and that the clustering returned by the ‘greedy capture’ algorithm cannot be computed in polynomial time unless P = NP. Li et al. [33] consider the same discrete clustering setting as Chen et al. [14] and propose a stronger version of proportional fairness called core fairness. They present results on lower and upper approximation bounds for various restrictions of the discrete setting.
Jung et al. [28] consider the discrete clustering problem in which the set of agents and candidate centers coincide. For this setting, they propose a fairness concept that has a similar goal as that of Chen et al. [14]. In subsequent work, it has been referred to as individual fairness [42, 12]. Individual fairness requires that, for each agent , there is a selected center that is at most from , where is the smallest radius around that includes agents.
In a paper inspired by our work, Kellerhals and Peters [30] analyze the key algorithms and axioms in our paper in further detail and show that the axioms that we propose also imply approximations of various fairness concepts introduced in the literature. Drawing an analogy to multiwinner voting axioms, they refer to our proposed axiom PRF for the discrete setting as mPJR and also define a weaker axiom which they call mJR. They take a deep dive into the logical relations between the axioms and approximate notions of proportional fairness, individual fairness, core fairness, and another core-inspired generalization of proportional fairness [22]. They prove that any approximation to PF is also an approximation to individual fairness and vice versa. In addition to the algorithms we formalize in this paper, they also consider Greedy Capture [14] and a randomized version of this algorithm known as Fair Greedy Capture [22]. A key takeaway from our paper is reaffirmed by their work, namely that our algorithms are the best-known algorithms for satisfying proportional representation axioms while also giving non-trivial approximation guarantees with respect to other fairness notions.
There are also several concepts and algorithms in clustering where the protected groups are pre-specified based on protected attributes [1, 2, 10, 9, 31, 15, 16, 23]. In contrast, we are seeking fairness properties that apply to arbitrarily defined groups of all possible sizes. In our proposed axiom, these groups are endogenously determined from data points; in some real-world settings, this is a legal requirement since it can be forbidden by law to have decision-making processes use certain attributes as inputs. Clustering can also be viewed as a facility location problem [26], where multiple facilities (centers) are to be located in a metric space; for a recent overview, see the survey by Chan et al. [13].
2.2 Comparison with multi-winner voting
The clustering problem especially has natural connections with multi-winner voting in which agents and centers in the clustering context correspond to agents and candidates in multi-winner voting.
The literature on proportionally representative multi-winner voting includes axioms such as JR, PJR, EJR, and FJR for dichotomous preferences [6, 32], axioms such as PSC for rankings [20], and axioms such as Generalised PSC [6, 7] or Rank-PJR+[11] for weak orders. Naively applying these axioms to our spatial problem results in a reduction from our unconstrained problem to a multiwinner voting problem with an infinite number of candidates. Conversely, our fairness (or proportional representation) axioms take into account the additional distance-based cardinal information available when the candidates belong to a general metric space. Hence, our axioms formalize a notion of proportional representation for -dimensional Euclidean space for . In contrast to ordinal algorithms for multi-winner voting (such as the Expanding Approvals Rule (EAR) [7] or Single Transferable Vote), the algorithms in this paper carefully account for additional spatial information to make synchronized decisions.
Kalayci et al. [29] consider a similar setting to ours but explore to what extent algorithms that are agnostic to cardinal spatial information and only use ordinal ranking of pairwise distances can achieve approximate fairness. They define -proportional representation, a powerful strengthening of approximate core fairness [33]. Their notion captures (and strengthens) the ideal of core stability in this setting: no coalition of agents large enough to “deserve” centroids should be able to propose an alternative set of centroids which they all prefer to their individually best size- subset of the selected centroids. They then parameterize this idea by , the required size of a coalition which deserves centroids, and , the degree to which the alternative centroid selection better represents the coalition. Similar to ours, their property effectively addresses the shortfalls of existing concepts with respect to proportional representation (see Definition 2). In contrast to theirs, the properties we study do not require parameterization and are guaranteed to exist. Kalayci et al. [29] show that the (ordinal multi-winner) Expanding Approvals Rule (EAR) of Aziz and Lee [7] is -representative with and also satisfies -approximation to proportional fairness. They are able to improve on these bounds with a modified version of Greedy Capture when cardinal information is known.
Proportionally representation in multi-winner voting (and more generally participatory budgeting) has also been studied when agents have cardinal utilities. Based on prior work for approvals [6], Peters et al. [38] study desirable axioms such as FJR and EJR for positive, additive utilities, and rules such as the Method of Equal Shares (MES). We make no assumption of additive utilities, nor do we ever consider total or average distances. Under additive cardinal utilities, a group of agents of proportion large enough to “deserve” multiple candidates can (in some instances) be satisfied with respect to EJR by a single candidate which gives some agent very high utility. Because our model and axioms take an agnostic approach to cardinal utility values, we instead place requirements on the selection of a proportional number of candidates. One possible approach to our problem is to take distances as proxy for utilities and to apply a rule like MES. However, translating to cardinal utilities necessitates the loss of important spatial information as different cardinal utilities consistent with the distances give different MES outcomes. One possibility is to set utility to be negation of the distance. But the setting and axioms of Peters et al. assume positive, additive utilities.
3 Towards an Axiom Capturing Proportional Representation
Proportional fairness for clustering has been considered in a series of recent papers. Chen et al. [14] first proposed a proportional fairness axiom that requires that there is no set of agents of size at least that can find an unselected center that is “better” (i.e., located closer than the closest selected center) for each of them.
Definition 1 (Proportional Fairness [14]).
with satisfies proportional fairness if with and for all , there exists with .
The idea of proportional fairness was further strengthened to core fairness that requires that there is no set of agents of size at least that can demand a more preferred location for a center [33]. One rationale for proportional fairness is explained in the following example, which was first presented by Micha and Shah [36] and then reused by Li et al. [33]. We reuse the same example. Consider a set of data points/agents. The agents are divided into 11 subsets of clusters each of which is densely clustered. One cluster of agents has size 10,000. The other 10 clusters have sizes 100 each. The big cluster is very far from the smaller clusters. The small clusters are relatively near to each other. Micha and Shah [36] and Li et al. [33] are of the view that any set of centers that satisfies a reasonable notion of proportional fairness would place 10 centers for the big cluster and 1 center serving the small clusters. Next, we point out that proportional fairness and core fairness do not necessarily satisfy this minimal requirement, which we formalize via the unanimous proportionality axiom below.
Definition 2 (Unanimous Proportionality (UP)).
with satisfies unanimous proportionality if with and each agent in has the same location , then possible locations closest to are selected.
UP captures a principle which Chen et al. [14] highlighted as desirable — in their words, one which requires that “any subset of at least individuals is entitled to choose centers.”333In the term unanimous proportionality, unanimous refers to the condition where a set of agents have the same location, and proportionality refers to the requirements that such agents get an appropriate number of centers in proportion to the number of agents. The following example shows that proportional fairness [14] does not imply Unanimous Proportionality. Similarly, it can be shown that the fairness concept proposed by Jung et al. [28] does not imply unanimous fairness.
Example 1 (Proportional fairness and core fairness).
Suppose agents are located in the interval and . 10,000 agents are at point 0 and 1000 agents are at point 1. UP requires that 10 centers are at or just around point 0 and 1 center is at point 1. However, placing 1 center at point 0 and 10 centers at point 1 satisfies the proportional fairness concept of Chen et al. [14] and also the core fairness concept of Li et al. [33].
The example above shows that there is a need to formalize new concepts in order to capture proportional representative fairness. The property should imply unanimous proportionality but should also be meaningful if no two agents/data points completely coincide in their location. One reason why the existing fairness concepts do not capture UP is that they make an implicit assumption that an agent only cares about the distance from the nearest center and not on how many centers are nearby. As discussed in the introduction, this assumption is not necessarily accurate in contexts such as data abstraction, sortition, and political representation.
In addition to capturing proportional representation in a robust sense, we would also like our axiom (and algorithms) to guarantee the existing proportional fairness concepts, at least in an approximate fashion. Since a PF outcome may not exist, Chen et al. [14] and Micha and Shah [36] studied an approximate notion of proportional fairness in the discrete and unconstrained settings, respectively.
Definition 3 (-approximate Proportional Fairness).
with satisfies -approximate PF (or -PF) if with and for all , there exists with .
We seek to devise an axiom which is compatible with approximate PF, and subsequently to design algorithms which satisfy our axiom while also providing good approximate guarantees with respect to PF. We close the section with an example presented by Micha and Shah [36] that shows that a PF outcome may not exist. We will use the same example to illustrate our new fairness concept in the next section.
Example 2.
4 Fairness for the Unconstrained Setting
We propose a new concept Proportionally Representative Fairness (PRF) that captures proportional representation fairness concerns. The intuition behind the concept is based on two natural principles: (1) a set of agents that is large enough deserves to have a proportional number of centers ‘near’ it, and (2) the requirement of proximity of the nearby centers to a subset of agents should also depend on how densely together the particular subset of agents are. We use these two principles to propose a new fairness concept.
Definition 4 (Proportionally Representative Fairness (PRF) for Unconstrained Clustering).
An outcome with satisfies Proportionally Representative Fairness (PRF) if the following holds. For any set of agents of size at least such that the maximum distance between pairs of agents in is , the following holds: .
PRF has the following features: it implies unanimous proportionality; it is oblivious to and does not depend on the specification of protected groups or sensitive attributes; it is robust to outliers in the data, since the fairness requirements are for groups of points that are sizable enough; and it is scale invariant to multiplication of distances (i.e., PRF outcomes are preserved if all the distances are multiplied by a constant).
Next, let us illustrate an example where proportional fairness is not achievable but one of the most natural clustering outcomes satisfies PRF.
Example 3 (Requirements of PRF).
We revisit Example 2 to illustrate the requirements that PRF imposes. For each set of size at least , PRF requires centers in the relevant neighborhood of agents in (see Figure 2). Therefore, for agent 1 and 2, PRF requires that one selected candidate center should be in one of the circles around 1 or 2. For the set of agents , PRF also requires that one candidate center should be in one of the circles around 1, 2 or 3. A natural solution is to have one center located between agents 1, 2, and 3, another center located between agents 4, 5, and 6, and the final center located somewhere not too far from all agents (e.g., between the two groups of agents). Such solutions are intuitively reasonable and fair and, furthermore, satisfy PRF; however, as mentioned in Example 2, this instance does not admit a proportionally fair solution.
In addition to capturing natural outcomes when PF outcomes may not exist, PRF actually guarantees a -PF outcome for in the unconstrained setting on general metric spaces.
Proposition 1.
For metric spaces, if an outcome with satisfies PRF for Unconstrained Clustering, then is -approximate PF, and there exists an instance for which this bound is tight.
Proof.
Let be an outcome which satisfies PRF for Unconstrained Clustering. Fix with .
Let be the maximum distance between pairs of agents in , i.e., , and let and be two agents in that maximize this distance. We assume that ; otherwise, if , then every agent in is located at the same point, and there exists such that for all . Denote by the agent in for which the distance to is maximized. We have that .
Next, since and every pair of agents in are a distance of at most from each other, it follows from satisfying Unconstrained PRF that there exist such that . Using this along with the triangle inequality, we can establish the following upper bound: . From these observations, the minimum PF approximation ratio obtained by or becomes
Note that the final inequality above holds since replacing by the non-negative number , which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when , hence the final transition. Thus, for any , there exists such that . We defer the proof that this bound is tight to B. ∎
Next, we highlight that the well-known -means or -median solutions do not necessarily satisfy PRF. Example 4 is adapted from an example by Chen et al. [14].
Example 4 (-means does not satisfy PRF).
Consider Figure 3 in which there are agents uniformly distributed on the perimeter of each of the three circles. If , a -means or -median solution will place one center by the small circles and two centers in the big circle. In contrast, PRF requires that each of the circles gets its own centroid. It can also be shown that -center does not satisfy PRF.
On face value, it is not clear whether an outcome satisfying PRF exists as similar concepts such as PF cannot always be guaranteed [36]. Secondly, even if a PRF outcome is guaranteed to exist, the complex constraints seem computationally challenging. PRF is a property that requires conditions on an exponential number of subsets of agents. For each subset of agents, it enforces representational constraints pertaining to an infinite number of neighborhood distances. Perhaps surprisingly, we show that not only is a PRF outcome guaranteed to exist, it can be computed in polynomial time.
The algorithm intuitively works as follows. Firstly, instead of considering infinite possible centers, we restrict our attention to the candidates centers that coincide with each agent’s location. Each agent is given an initial weight of 1. This weight dynamically decreases over the course of the algorithm. The algorithm can be viewed as gradually expanding the neighborhoods of each of the agents.444The idea of expanding neighborhoods is used in other algorithms in the literature, such as the Greedy Capture algorithm for the unconstrained setting studied by Micha and Shah [36]. Unlike our algorithm, it does not involve reweighting of agents. More importantly, Greedy Capture does not satisfy PRF and is not polynomial time. Our algorithm uses reweighting of agents as is done by EAR for multi-winner voting [7]. Our algorithm can be viewed as using key ideas from the EAR for committee voting while taking into account additional spatial information to make synchronized decisions. Instead of continuously expanding the neighborhood, we only consider at most possible values of neighborhood radii. If there exists some location in the candidate set such that it is in the intersection of the neighborhoods of a group of agents with total weight of at least , we select one such candidate, and reduce the agents’ collective weight by .555We point out that the reweighting process can be done in any manner that results in the group’s total weight reducing by . Two natural possibilities would be (i) reducing weights as equally as possible, or (ii) reducing weights proportional to agent distances from the selected candidate. The process is continued until centers are chosen. The algorithm is formally specified as Algorithm 1. The following is the main result of the present section and summarizes the theoretical guarantees of Algorithm 1.
Theorem 1.
Algorithm 1 terminates in polynomial time and returns a set of centers which satisfy PRF and 3-approximate PF.
We will now formally prove two key lemmas used to prove Theorem 1.
Lemma 1.
Algorithm 1 returns an outcome that satisfies PRF.
Proof.
Suppose for contradiction that the algorithm finds a set of centers that violates PRF. In that case, there exist a set of agents of size at least such that the maximum distance between pairs of agents in is but the number of locations that are within of some agent in is at most . In that case, consider the snapshot of the algorithm when neighborhood distance is considered. At this point, . Since agents in have only used their weight towards selecting locations within up to this point, it follows that . Therefore, the agents in still have total weight at least to select one more location that is a distance of at most from some agent in . Hence, at this stage, when the neighborhood distance is , the algorithm would have selected at least one more center within distance than in the outcome . ∎
Lemma 1 implies guaranteed existence of an outcome satisfying our axiom, which was one of our goals in devising PRF. In conjunction with Proposition 1, the lemma also gives us that the outcome returned by Algorithm 1 is -PF. This is noteworthy as it establishes Algorithm 1 as the first known polynomial-time approximation algorithm for PF on general metric spaces in the unconstrained setting. As we will show next, Algorithm 1 actually provides the stronger guarantee of -PF.
Lemma 2.
Algorithm 1 finds an outcome that satisfies -approximate PF.
Proof.
For an instance of unconstrained clustering, let be the outcome returned by Algorithm 1. Fix with . We will show that there exists an agent and center such that . We assume that , since otherwise for some agent and center, and we are done.
At the start of the algorithm, the collective weight of agents in is . When the algorithm terminates, the collective weight of is zero. Thus, in some iteration, the collective weight of decreases for the first time. We denote the centroid selected in this iteration by and denote an arbitrary agent whose weight is decreased in this iteration by . Let be the maximum distance between and any other agent in .
We first point out that . To see this, suppose for contradiction . Since and are both distances between pairs of agents in , and for some and . Furthermore, , since is in increasing order and by assumption that . Thus, when the algorithm considers neighborhoods of , the collective weight of has not yet decreased, and we have that
Therefore, before transitioning to the next element in , the algorithm will either add to or the weight of some agent in will decrease. The first possibility contradicts . For the second possibility, by assumption, it must be that was added to . But, since , and does not decrease, which gives us a contradiction.
We divide the remainder of the argument into cases. First, consider the case in which It follows that If, instead, , then consider the agent such that . By the triangle inequality, we have that
∎
5 Fairness for the Discrete Setting
We presented PRF as a desirable concept for the unconstrained setting. It is not immediate how to adapt the concept to the discrete setting. Applying the unconstrained PRF definition to the discrete setting leads to the issues of non-existence. On the other hand, some natural attempts to account for the coarseness of the candidate space, , can lead to a concept that is too weak. For example, one alternative is to require that an outcome be such that if, for any set of agents of size at least such that the maximum distance between pairs of agents in is , the following holds: , where denote a ball of radius around agent . Alternatively, we could replace the right-hand side of the final inequality with . Both of these versions are too weak. To see this, suppose , all agents are located at , and , then neither version places any restriction on the facility location and do not even imply UP.
To resolve these issues, we need to take into account that the nearest candidate locations may be very far from a subset of agents. We adapt our PRF concept for the discrete setting via a careful adaptation of the PRF concept for the unconstrained setting.
Definition 5 (Proportionally Representative Fairness (PRF) for Discrete Clustering).
For any set of agents of size at least , if there are candidates from that are at distance at most from all agents in , the following holds: .
We begin by highlighting the fact that, in the discrete setting, PRF implies -approximate PF, the closest PF approximation known to exist in the discrete setting [14]. Our proof follows similarly to Theorem 1 from Chen et al. [14].
Proposition 2.
For metric spaces, if an outcome with satisfies PRF for Discrete Clustering, then is -approximate PF.
For the discrete setting, we propose a new algorithm (Algorithm 2). Algorithm 2 is similar to Algorithm 1, which was designed for the unconstrained setting. In fact, Algorithm 1 can be viewed as first setting to the multiset of candidate locations corresponding to agents in and then running Algorithm 2. Similar to Algorithm 1, Algorithm 2 terminates in polynomial time and returns centers and its output satisfies PRF.
Theorem 2.
Algorithm 1 terminates in polynomial time and returns a set of centers which satisfy PRF and -approximate PF.
Notably, by satisfying PRF, Algorithm 2 matches the best known PF approximation factor in the discrete setting. As the following remark highlights, however, the existing algorithms do not capture PRF.
Remark 1.
Chen et al. [14] present Greedy Capture. An equivalent algorithm, called ALGg, is also presented by Li et al. [33]. However, these algorithms do not satisfy PRF. In fact, for given , these algorithms may fail to output candidate locations.666To see this, let and let . Chen et al.’s Greedy Capture and Li et al.’s ALGg will select candidate locations 0 and 1, but will not output a third location. Of course, this issue could be rectified by arbitrarily choosing a third candidate location—but then there is no guarantee that the set of locations would satisfy PRF.
In F, we experimentally compare Algorithm 2 to ALGg and a variant of k-means. We show that it performs especially well when the mean squared distance of the agents is calculated based on their distance from not only the closest center but also the next -closest centers (for large ). This provides another perspective indicating that, in situations where a “fair” clustering should account for an agent’s distance from more than just her closest center, our algorithm is better suited than the existing alternatives. In C, we consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We show that the versions that we consider do not guarantee the existence of a fair outcome.
6 Discussion
We revisited a classical AI problem with a novel solution concept that builds on ideas from social choice and fair division. We proposed a natural fairness concept called PRF for clustering. PRF is guaranteed to exist and has desirable features: it implies unanimous proportionality, does not require the specification of protected groups, and is robust to outliers and multiplicative scaling. Even if our only focus is on the PF axiom, we make significant progress on efficient algorithms for approximate PF for general metric spaces. We did not focus on strategic aspects. Unfortunately, PRF and PF are incompatible in general with the incentive of an agent to report their location truthfully (see E). A natural next step is to identify stronger versions of PRF that still guarantee existence. It will be interesting to consider refinements of our algorithms which make more clever choices when selecting candidates with sufficient weighted support. Another direction for future research is to understand how much impact the constraint of PRF imposes on standard optimization objectives such as for -means, -median, or -center.
Acknowledgments
Haris Aziz is supported by the NSF-CSIRO project on “Fair Sequential Collective Decision-Making”. We thank John Dickerson and Bo Li for helpful comments.
References
- Abbasi et al. [2021] Abbasi, M., Bhaskara, A., Venkatasubramanian, S.: Fair clustering via equitable group representations. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 504–514, FAccT ’21, Association for Computing Machinery, New York, NY, USA (2021)
- Ahmadian et al. [2019] Ahmadian, S., Epasto, A., Kumar, R., Mahdian, M.: Clustering without over-representation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 267–275, ACM (2019)
- AI-Committee [2019] AI-Committee: The national artificial intelligence research and development strategic plan: 2019 update. Tech. rep., National Science and Technology Council, Washington, D.C. (2019)
- Aloise et al. [2009] Aloise, D., Deshpande, A., Hansen, P., Popat, P.: Np-hardness of euclidean sum-of-squares clustering. Machine Learning 75(2), 245–248 (2009)
- Arthur and Vassilvitskii [2007] Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pp. 1027–1035, SIAM (2007)
- Aziz et al. [2017] Aziz, H., Brill, M., Conitzer, V., Elkind, E., Freeman, R., Walsh, T.: Justified representation in approval-based committee voting. Social Choice and Welfare 48(2), 461–485 (2017)
- Aziz and Lee [2020] Aziz, H., Lee, B.: The expanding approvals rule: Improving proportional representation and monotonicity. Social Choice and Welfare 54(1), 1–45 (2020)
- Aziz et al. [2023] Aziz, H., Lee, B.E., Chu, S.M., Vollen, J.: Proportionally representative clustering. arXiv preprint arXiv:2304.13917 (2023)
- Backurs et al. [2019] Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., Wagner, T.: Scalable fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 405–413, PMLR (2019)
- Bera et al. [2019] Bera, S.K., Chakrabarty, D., Flores, N., Negahbani, M.: Fair algorithms for clustering. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019,, pp. 4955–4966 (2019)
- Brill and Peters [2023] Brill, M., Peters, J.: Robust and verifiable proportionality axioms for multiwinner voting. In: EC ’23: The 24th ACM Conference on Economics and Computation, Boulder, CO, USA, July 11 - 15, 2022, ACM (2023)
- Chakrabarty and Negahbani [2021] Chakrabarty, D., Negahbani, M.: Better algorithms for individually fair k-clustering. CoRR abs/2106.12150 (2021)
- Chan et al. [2021] Chan, H., Filos-Ratsikas, A., Li, B., Li, M., Wang, C.: Mechanism design for facility location problems: A survey. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI) (2021)
- Chen et al. [2019] Chen, X., Fain, B., Lyu, L., Munagala, K.: Proportionally fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 1032–1041, PMLR (2019)
- Chierichetti et al. [2017] Chierichetti, F., Kumar, R., Lattanzi, S., Vassilvitskii, S.: Fair clustering through fairlets. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 5029–5037 (2017)
- Chiplunkar et al. [2020] Chiplunkar, A., Kale, S., Ramamoorthy, S.N.: How to solve fair k-center in massive data models. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, vol. 119, pp. 1877–1886, PMLR (2020)
- Chouldechova and Roth [2020] Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63(5), 82–89 (2020)
- Congressional-Research-Service [2021] Congressional-Research-Service: Artificial intelligence: Background, selected issues, and policy considerations. Tech. rep., Congressional Research Service Reports (2021)
- Dua and Graff [2017] Dua, D., Graff, C.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017), URL http://archive.ics.uci.edu/ml
- Dummett [1984] Dummett, M.: Voting Procedures. Oxford University Press (1984)
- Dwork et al. [2012] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, pp. 214–226, ACM (2012)
- Ebadian and Micha [2024] Ebadian, S., Micha, E.: Boosting sortition via proportional representation. arXiv preprint arXiv:2406.00913 (2024)
- Esmaeili et al. [2021] Esmaeili, S.A., Brubach, B., Srinivasan, A., Dickerson, J.P.: Fair clustering under a bounded cost. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
- European Commission [2020] European Commission: White paper on artificial intelligence: a european approach to excellence and trust. White Paper COM(2020) 65 final, European Commission, Brussels (Feb 2020)
- Everitt et al. [2009] Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley Publishing, 4th edn. (2009)
- Farahani and Hekmatfar [2009] Farahani, R.Z., Hekmatfar, M.: Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media (2009)
- Friedman and Nissenbaum [1996] Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst. 14(3), 330–347 (Jul 1996)
- Jung et al. [2020] Jung, C., Kannan, S., Lutz, N.: Service in your neighborhood: Fairness in center location. In: 1st Symposium on Foundations of Responsible Computing, FORC 2020, June 1-3, 2020, Harvard University, Cambridge, MA, USA (virtual conference), LIPIcs, vol. 156, pp. 5:1–5:15 (2020)
- Kalayci et al. [2024] Kalayci, Y.H., Kempe, D., Kher, V.: Proportional representation in metric spaces and low-distortion committee selection. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, pp. 9815–9823, AAAI Press (2024)
- Kellerhals and Peters [2023] Kellerhals, L., Peters, J.: Proportional fairness in clustering: A social choice perspective. CoRR abs/2310.18162 (2023)
- Kleindessner et al. [2019] Kleindessner, M., Samadi, S., Awasthi, P., Morgenstern, J.: Guarantees for spectral clustering with fairness constraints. In: Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, pp. 3458–3467, PMLR (2019)
- Lackner and Skowron [2023] Lackner, M., Skowron, P.: Multi-Winner Voting with Approval Preferences. Springer Briefs in Intelligent Systems, Springer (2023)
- Li et al. [2021] Li, B., Li, L., Sun, A., Wang, C., Wang, Y.: Approximate group fairness for clustering. In: Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, pp. 6381–6391, PMLR (2021)
- Megiddo and Supowit [1984] Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM Journal on Computing 13(1), 182–196 (1984)
- Mehrabi et al. [2021] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Computing Surveys 54(6) (2021)
- Micha and Shah [2020] Micha, E., Shah, N.: Proportionally fair clustering revisited. In: 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, vol. 168, pp. 85:1–85:16 (2020)
- O’Neil [2016] O’Neil, C.: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, USA (2016)
- Peters et al. [2021] Peters, D., Pierczyński, G., Skowron, P.: Proportional participatory budgeting with additive utilities. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), pp. 12726–12737 (2021)
- Peters et al. [1992] Peters, H., van der Stel, H., Storcken, T.: Pareto optimality, anonymity, and strategy-proofness in location problems. International Journal of Game Theory 21(3), 221–235 (1992)
- Procaccia [2008] Procaccia, A.D.: Towards a theory of incentives in machine learning. SIGecom Exchanges 7(2) (2008)
- Procaccia and Tennenholtz [2013] Procaccia, A.D., Tennenholtz, M.: Approximate mechanism design without money. ACM Transactions on Economics and Computation 1(4) (2013)
- Vakilian and Yalçiner [2021] Vakilian, A., Yalçiner, M.: Improved approximation algorithms for individually fair clustering. CoRR abs/2106.14043 (2021)
Appendix A Omitted proofs
Proof of Theorem 1.
We prove by induction that if the number of candidates selected is less than , then there is a way to select at least one more center. If the number of candidates selected is less than , there is still aggregate weight of at least on the set of all agents. These agents can get a candidate selected from their neighborhoods if the neighborhood radius is large enough. In particular, for , some unselected candidate from can be selected.
Note that for a given radius , at most candidates can be selected. If no more candidates can be selected for a given , the next radius is considered. There are different distances that are to be considered. Next, we bound the running time.
There are different distances that are to be considered. These distances are sorted in time. For each distance that we consider, we check for each of the locations whether they get sufficient support of . It takes time to check if some location has support for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most times. Therefore the running time is .
Proof of Proposition 2.
Let be an outcome which satisfies PRF for Discrete Clustering. Fix with . Let , the maximum distance any agent in is from , and denote this agent . We assume that ; otherwise, if , then every agent in is located at , (by PRF), and for all .
Since and every agent in is a distance of at most from , it follows from satisfying PRF that there exist such that . By the triangle inequality, and thus . From these observations, the minimum PF approximation ratio obtained by or becomes
Note that the final inequality above holds since replacing by the non-negative number , which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when , hence the final transition. Thus, for any , there exists such that . ∎
Proof of Theorem 2.
Output size. We prove by induction that if the number of candidates selected is less than , then then there is a way to select at least one one more center. If the number of candidates selected is less than , there is still aggregate weight of at least on the set of all agents. These agents can get a candidate selected from their neighborhood if the neighborhood distance is large enough. In particular, for , some unselected candidate from can be selected.
Running time. Note that for a given radius , at most candidates can be selected. If no more candidates can be selected for a given , the next distance is considered. Thus,here are different distances to be considered. These distances are sorted in time. For each distance that we consider, we check for each of the locations whether they get sufficient support of . It takes time to check if some location has support for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most times. Therefore the running time is .
PRF. Suppose for contradiction that the algorithm finds a set of centers that violates PRF. In that case, there exist a set of agents of size at least such that there are at least locations in that are within distance of some agent in , but the number of locations in that are within of at least some agent in is at most . In that case, consider the snapshot of the algorithm when neighborhood distance is considered. At this point, the the number of locations that are within of at least some agent in is at most . Since agents in have only used their weight towards selecting locations within up till this point, it follows that . It follows that the agents in still have total weight at least to select one more location that is at distance at most from some agent in . Hence, at this stage, when the neighborhood distance is , the algorithm would have selected at least one more center within distance than in the outcome . Thus is not the correct output of the algorithm, a contradiction.
Appendix B Tight Analysis of PF Approximations
The following example shows the analysis in Proposition 1 is tight. That is, there is an instance for which an outcome satisfying PRF for Unconstrained Clustering obtains no better than -PF.
Example 5.
Consider an instance of unconstrained clustering with and . Furthermore, let and be two candidate locations in such that the distance between agents and candidates is as given in Table 2.
i | j | l | |
---|---|---|---|
i | 0 | ||
j | |||
l | |||
c | |||
x |
It can be verified that all distances in Table 2 satisfy the triangle inequality. Note that, since is the maximum distance between any pair of agents in (call this ), and , it holds that satisfies PRF for Unconstrained Clustering. Through straightforward arithmetic, it can then be shown that .
Appendix C Strengthening of PRF
We consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We will show that the versions that we consider do not guarantee the existence of a fair outcome in all settings.
The PRF concept can be increasingly strengthened in the following ways by making the requirements on the outcome stronger.
Definition 6 (Proportionally Representative Fairness (PRF)-II for Discrete Clustering).
For any set of agents of size at least , if there are candidates from that are at distance at most from all agents in , the following holds: there exists some such that .
Definition 7 (Proportionally Representative Fairness (PRF)-III for Discrete Clustering).
For any set of agents of size at least , if there are candidates from that are at distance at most from all agents in , the following holds: .
The following example shows that an outcome satisfying PRFII and hence PRFIII may not exist.
Example 6.
Take with the following location profile on a unit interval : There are 4 candidate locations at . It is immediate that the 2 groups of agents at and must have a facility at and . But this violates PRF-II. The set of agents (size 4) have 2 candidate locations with a distance of of all agents; however, no agent in has 2 facility locations within a distance of 0.5 of them.
Appendix D -means++
- Step 1.
-
The first centroid is selected randomly.
- Step 2.
-
Calculate the Euclidean distance between the centroid and every other data point in the dataset. The point farthest away will become our next centroid.
- Step 3.
-
Create clusters around these centroids by associating every point with its nearest centroid.
- Step 4.
-
The point which has the farthest distance from its centroid will be our next centroid.
- Step 5.
-
Repeat Steps 3 and 4 until number of centroids are located.
Appendix E Strategyproofness and Fairness
Most of the work on clustering assumes that the data points are truthfully reported. If data points correspond to individual agents who prefer to have a nearby center, then the issue of incentives also arises (see, e.g., Procaccia [40]). It is easy to see that Algorithm 1 for the unconstrained setting coincides with the mid-point mechanism if there are two agents on a line. Since the mid-point (egalitarian mechanism) is known to be manipulable (see, e.g., Procaccia and Tennenholtz [41]), Algorithm 1 is not strategyproof even for (under which an agent cares about the sole facility to be as close as possible). A discretized version of the same example shows that Algorithm 2 is also not strategyproof.
Instead of understanding the strategic aspects of a particular algorithm, we next examine the tradeoff between fairness and strategyproofness. We focus on the case of for which the preference relation of an agent is clear: an agent wants the location to be as close as possible.
Proposition 3.
Even for and the unconstrained setting, there exists no anonymous, PF, and strategyproof algorithm.
Proof.
For and Euclidean space, an outcome is Pareto optimal if and only is it is in the convex hull of the agent locations [39]. Next, we show that weak Pareto optimality is equivalent to Pareto optimality in this context. We show that if a location is not Pareto optimal, then it is not weakly Pareto optimal. Since is not Pareto optimal, it is strictly outside the convex hull of the agent locations. Let be the point in closest to . Note that will either be a corner of or the perpendicular projection of onto an edge of . Consider the line going through orthogonal to the line through and . Assume without loss of generality that is vertical with to the right of . Then all points of lie on or to the left of . Let be an arbitrary point in . If is the same point as , then . Now suppose is not at the same loation as . Consider the triangle . Since - is orthogonal to , it follows that the angle at point for triangle is at least . The maximum angle of the other two corners of the triangle is strictly less than as the sum of the angles of a triangle is . There is a well-known triangle inequality law that “if two angles of a triangle are unequal, then the measures of the sides opposite these angles are also unequal, and the longer side is opposite the greater angle.” Hence, it follows that . We have shown that for all . Hence is not weakly Pareto optimal.
Peters et al. [39] proved that there is no anonymous, strategyproof, and Pareto optimal mechanism for the Euclidean space with 3 dimensions. From our argument above, it follows that there is no anonymous, strategyproof, and weakly Pareto optimal mechanism for the Euclidean space with 3 dimensions. Since PF is equivalent to weak Pareto optimality for , it follows that for , there is no anonymous, PF, and strategyproof algorithm. ∎
Corollary 1.
Even for , there exists no anonymous, core fair, and strategyproof algorithm.
Proposition 4.
Even for and the discrete setting, there exists no PRF strategyproof algorithm even in one dimension.
Proof.
Let and . Let . PRF (for discrete setting) requires that . This follows because the group of agents can demand 2 centers and there exists 2 centers at most for all agents locations—namely, and . The other groups of agents do not add any further demands (i.e., no more demands that are not already satisfied by considering ). With this outcome, agent has access to 2 facilities with distances and , respectively.
Now consider a deviation by agent 3 to . Now the group of agents can demand that the candidate center at is chosen—this is because this is the only center that is within of both agents 2 and 3. There are 2 possible outcomes and . In the first case, agent (with true location ) has cost and . In the second case, agent (with true location ) has cost and . In either case, the new outcome component-wise dominants the original outcome (under sincere voting). Hence, PRF is incompatible with SP in discrete setting. ∎
Appendix F Experimental results
MSD to | MSD to | MSD to | |||||||
++ | Alg 1 | ++ | Alg 1 | ++ | Alg 1 | ||||
Difference in MSD | Difference in MSD | Difference in MSD | Difference in MSD | Difference in MSD | Difference in MSD | ||||
Dataset | Average MSD | relative to ++ | relative to ++ | Average MSD | relative to ++ | relative to ++ | Average MSD | relative to ++ | relative to ++ |
Wholesale | 34,977,392.00 | 664% | 112% | 11,668,107,198.00 | -10% | -12% | 100,307,739,343.00 | -63% | -55% |
HCV | 811.60 | 519% | 429% | 582,360.00 | -4% | -6% | 4,858,449.00 | -70% | -62% |
Buddy | 782.20 | 8% | 5% | 181,700.00 | -8% | -7% | 784,360.00 | -13% | -9% |
Seeds | 0.54 | 23% | 8% | 228.68 | -1% | -1% | 1,449.94 | -4% | -1% |
We now apply Algorithm 1 to four real-world datasets. We consider the discrete domain in which as most clustering data sets only have the data points. Algorithm 1 is guaranteed to satisfy PRF; therefore, we are interested in analyzing the performance of the algorithm with respect to other objectives. A prominent objective for clustering algorithms is to minimize the Mean Squared Distance (MSD) to the closest center: the average distance between each datapoint and its closest center. We consider this performance measure as the number of centers ranges from to .
In addition to MSD to the closest center, we analyze two other related measures. MSD to the closest centers and MSD to the closest centers. Intuitively, MSD to the closest centers is the average distance between each datapoint and its -closest centers (MSD to the closest centers is similar). These measures capture—to varying extents—the idea that it may be desirable to have datapoints located close to more than just one center; this idea is at the core of the proportional representation concept. To benchmark Algorithm 1’s performance, we also implement two other algorithms: the well-known -means++ algorithm777I.e., Lloyd’s algorithm for -means minimization objective with a particular initialization. Given our focus on the discrete domain, we use a -means++ algorithms that only chooses centers from among the data points. of Arthur and Vassilvitskii [5] and ALGg of Li et al. [33]. For all of our MSD measures, a smaller value is more desirable.
Datasets. The four datasets that we analyze are from the UCI Machine Learning Repository [19]. We summarize these below.888 The datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0); for details, please refer to https://archive-beta.ics.uci.edu/ml/donation-policy. HCV dataset (contains information about patients with Hepatitis C and blood donors.); Seeds dataset (describes measurements of geometrical properties of kernels belonging to three different varieties of wheat.); Buddy-move dataset (consists of user information from user reviews published in www.holidayiq.com about point of interests in South India); and Wholesale dataset (consists of client information for a wholesale distributor. It includes annual spending on various product categories.)
Results. Table 3 summarizes our results. We illustrate our analysis in Figures 4-9. Across all the datasets, Algorithm 1’s performance relative to -means improves when moving from MSD to closest center to MSD to closest centers and MSD to closest centers. A similar pattern is observed for , which may be expected and is reassuring since—like Algorithm 1— is motivated by the idea of proportional fairness. For two of the datasets (Wholesale and HCV), Algorithm 1 offers substantially better performance compared to -means with respect to MSD to closest centers. In these cases, Algorithm 1 also outperforms . For MSD to the closest centers, and Algorithm 1 either outperform or perform similar to -means. In contrast, and as expected, the -means outperforms both algorithms with respect to the MSD to closest center. With this measure, typically outperforms Algorithm 1. The experiments illustrate that other well-known algorithms produce different outcomes compared to our Algorithm 1, and this difference can be quantified via an intuitive PRF-related metric, i.e., minimizing the MSD to multiple centers.

















