This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Proportionally Representative Clustering

Haris Aziz [email protected] UNSW Sydney
Sydney, Australia
Barton E. Lee [email protected] ETH Zürich, Switzerland Sean Morota Chu [email protected] UNSW Sydney
Sydney, Australia
Jeremy Vollen [email protected] UNSW Sydney
Sydney, Australia
Abstract

In recent years, there has been a surge in effort to formalize notions of fairness in machine learning. We focus on centroid clustering—one of the fundamental tasks in unsupervised machine learning. We propose a new axiom “proportionally representative fairness” (PRF) that is designed for clustering problems where the selection of centroids reflects the distribution of data points and how tightly they are clustered together. Our fairness concept is not satisfied by existing fair clustering algorithms. We design efficient algorithms to achieve PRF both for unconstrained and discrete clustering problems. Our algorithm for the unconstrained setting is also the first known polynomial-time approximation algorithm for the well-studied Proportional Fairness (PF) axiom [14]. Our algorithm for the discrete setting also matches the best known approximation factor for PF.

1 Introduction

As artificial intelligence and machine learning serve as the ‘new electricity’ of systems, it is becoming increasingly important to ensure that AI systems embody societal values such as privacy, transparency, and fairness. Fairness is a growing concern, as AI systems are used to make decisions that can critically affect our daily lives, finances, and careers [17]. The impulse to tackle the issue of fairness in AI is prominently reflected in several governmental policy documents [18, 3, 24], the emergence of dedicated conferences on Ethical AI, and the formation of AI ethics boards by various big tech companies. Over the past years, several developments have taken place in the theory and application of fairness in supervised learning. One prominent approach is to use information derived from labelled data in supervised learning to formalize some notions of equity within and across (pre-specified and) protected attributes [21]. In contrast, in several models of unsupervised learning, protected attributes or their labelling are unknown. Fairness in unsupervised machine learning is becoming important with the growth of data collection both in public (e.g., via IoT devices) and private sectors (e.g., map services and social media). Unfettered and irresponsible use of such data has the danger of increasing inequity [37].

We focus on fairness in clustering, which is one of the most widely applied tasks in unsupervised machine learning. A (centroid selection) clustering problem has a metric space 𝒳\mathscr{X} with a distance measure d:𝒳×𝒳+{0}d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}, a multiset 𝒩𝒳{\mathscr{N}}\subseteq\mathscr{X} of nn data points (agents), a set {\mathscr{M}} of candidate centers, and positive integer knk\leq n. In many clustering problems, each data point corresponds to an individual’s attribute profile; thus, we refer to the data points as agents. The goal is to find a size-kk subset of centers, Y:|Y|=kY\subseteq{\mathscr{M}}\ :\ |Y|=k. Such centroid selection problems can be widely applied to various applications including recommender systems and data-compression. The problem also captures social planning scenarios where kk facility locations need to be chosen to serve a set of agents.

In such a (centroid selection) clustering problem, there are two versions of the problem. In the discrete version, the set of candidate centers {\mathscr{M}} is a finite subset of 𝒳\mathscr{X}. In the unconstrained or continuous setting, =𝒳{\mathscr{M}}=\mathscr{X}. We first focus on the unconstrained setting as done by Micha and Shah [36]. Later we discuss how our concepts and ideas extend to the discrete setting. For a location ii and set of locations SS{\subseteq{\mathscr{M}}}, we will denote mincSd(i,c)\min_{c\in S}d(i,c) as d(i,S)d(i,S). Standard clustering solutions include kk-center, kk-medians, and kk-means.111 The kk-center solution outputs a subset YargminW,|W|=kmaxi𝒩d(i,W)Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\max_{i\in{\mathscr{N}}}d(i,W). The kk-medians solution outputs a subset YargminW,|W|=ki𝒩d(i,W)Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\sum_{i\in{\mathscr{N}}}d(i,W). The kk-means solution outputs a subset YargminW,|W|=ki𝒩d(i,W)2Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\sum_{i\in{\mathscr{N}}}d(i,W)^{2}. These standard objectives can be viewed as achieving some form of global welfare but are not well-suited for proportional representational and fairness concerns. The problems of computing outcomes optimising these objectives are also NP-hard [34, 4].

We seek a suitable concept for clustering that captures general fairness principles such as non-discrimination, equality of opportunity, and equality of outcome. In many applications, the data points correspond to real individuals or their positions or opinions. Such individuals may expect the clustering outcome to capture natural fairness requirements. Previous axiomatic studies of clustering have measured fair representation by the distance from each data point to the nearest centroid. We initiate a proportional representation perspective on fairness in clustering. In capturing this desiderata axiomatically, we require a property in which representation guarantees depend on the number (or proportion) of data points and how “tightly” they are clustered together.

Our perspective is motivated by applications in which groups of individuals (to which the data points correspond) may expect to receive centroid representation proportional to their size. For instance, the centroids may be citizens selected to make representative decisions on behalf of the public, as is done in a jury or a sortition panel. Similarly, computing a proportionally fair clustering is also meaningful in facility location, where the location of the facility should depend on the density and the number of people it is serving. More generally, our axioms and algorithms are well-motivated in any setting in which one may want to take a representative sample from datapoints between which a distance metric is well-defined. In this paper, we consider the following fundamental research question. What is a meaningful concept of proportional fairness in centroid clustering? Is such a concept guaranteed to be achievable?

Contributions.

In contrast to traditional optimization objectives in clustering and centroid selection, we take an axiomatic fairness perspective. We first identify a potential shortcoming of some recently introduced concepts that aim to capture proportional fairness for clustering; in particular, we show how the concepts do not capture an intuitive and minimal requirement of proportional representation called unanimous proportionality. We then propose a new axiom called proportional representative fairness (PRF), which is our central conceptual contribution.

We propose PRF for both the unconstrained and discrete clustering setting. We discuss how PRF overcomes some of the drawbacks of the previously introduced concepts. In particular, it implies the minimal requirement of unanimous proportionality. PRF has several other desirable features: a PRF outcome is guaranteed to exist, the axiom does not require a preset specification of protected groups and it is robust to outliers and multiplicative scaling. We show that none of the traditional clustering algorithms or previously introduced algorithms for fair clustering satisfy PRF.

We design polynomial-time algorithms to achieve PRF both in the unconstrained and discrete settings.222The results for unconstrained and discrete settings are incomparable in the following sense. The guarantee of existence of a fair outcome in the discrete setting guarantees the existence of a fair outcome in the unconstrained setting; however, a polynomial-time algorithm for computing a fair outcome in the discrete setting does not imply the same for the unconstrained setting. We call our algorithms SEAR (Spatial Expanding Approval Rule) as they are inspired by using core ideas from the ordinal multi-winner voting Expanding Approvals Rule (EAR) of Aziz and Lee [7] but carefully use spatial information to make synchronized decisions. We prove that our algorithm for the unconstrained setting also gives 33-approximate Proportional Fairness (PF) [14], and hence is the first known polynomial-time approximation algorithm for PF on general metric spaces. We also prove that our algorithm for the discrete setting matches the best known approximation of Proportional Fairness (PF), thereby providing no compromise on PF approximation but additionally satisfying PRF. We summarize our theoretical contributions in Table 1 via a comparison of our algorithms with the best guarantees previously known. All missing proofs can be found in A.

Unconstrained Discrete
ρ\rho-PF PRF Complex. ρ\rho-PF PRF Complex.
Our Algorithms (SEAR) 𝟑\mathbf{3} in P 2.4\mathbf{\approx 2.4} in P
Greedy Capture 2.4\approx 2.4 - NP-hard 2.4\approx 2.4 - in P
Table 1: Theoretical comparison of our algorithms (Algorithm 1 for the unconstrained setting and Algorithm 2 for the discrete setting) with Greedy Capture (GC), which gives the best known approximate PF guarantees. Unconstrained and discrete GC results come from Micha and Shah [36] and Chen et al. [14], respectively.

2 Related Work

2.1 Fair clustering

Fairness in machine learning and computer systems is a vast topic [27]—for an overview, see Chouldechova and Roth [17] or the survey by Mehrabi et al. [35]. We focus on the fundamental learning problem of clustering, which is a well-studied problem in computer science, statistics, and operations research [25]. Our approach is axiomatic, focused on the concept of proportional fairness (or representation), and does not rely on the use of a pre-specified or protected set of attributes. Thus, the most closely related papers are those of Chen et al. [14], Micha and Shah [36], Li et al. [33], and Jung et al. [28], which we review below.

Chen et al. [14] proposed the concept of proportional fairness (PF) for the discrete setting, which is based on the entitlement of groups of agents of large enough size. They reasoned that PF is a desirable fairness concept but it is not guaranteed to be achievable for every instance. On the other hand, they showed that outcomes satisfying reasonable approximations of PF are guaranteed to exist and can be computed in polynomial time via the ‘Greedy Capture’ algorithm.

Micha and Shah [36] analyzed the proportional fairness axiom in the unconstrained setting and presented similar results to those obtained by Chen et al. [14]. They showed that the natural adaptation of the Greedy Capture algorithm finds an approximately PF outcome in Euclidean space. On the other hand, they showed that checking whether a PF outcome exists is NP-hard and that the clustering returned by the ‘greedy capture’ algorithm cannot be computed in polynomial time unless P = NP. Li et al. [33] consider the same discrete clustering setting as Chen et al. [14] and propose a stronger version of proportional fairness called core fairness. They present results on lower and upper approximation bounds for various restrictions of the discrete setting.

Jung et al. [28] consider the discrete clustering problem in which the set of agents and candidate centers coincide. For this setting, they propose a fairness concept that has a similar goal as that of Chen et al. [14]. In subsequent work, it has been referred to as individual fairness [42, 12]. Individual fairness requires that, for each agent ii, there is a selected center that is at most r(i)r(i) from ii, where r(i)r(i) is the smallest radius around ii that includes n/kn/k agents.

In a paper inspired by our work, Kellerhals and Peters [30] analyze the key algorithms and axioms in our paper in further detail and show that the axioms that we propose also imply approximations of various fairness concepts introduced in the literature. Drawing an analogy to multiwinner voting axioms, they refer to our proposed axiom PRF for the discrete setting as mPJR and also define a weaker axiom which they call mJR. They take a deep dive into the logical relations between the axioms and approximate notions of proportional fairness, individual fairness, core fairness, and another core-inspired generalization of proportional fairness [22]. They prove that any approximation to PF is also an approximation to individual fairness and vice versa. In addition to the algorithms we formalize in this paper, they also consider Greedy Capture [14] and a randomized version of this algorithm known as Fair Greedy Capture [22]. A key takeaway from our paper is reaffirmed by their work, namely that our algorithms are the best-known algorithms for satisfying proportional representation axioms while also giving non-trivial approximation guarantees with respect to other fairness notions.

There are also several concepts and algorithms in clustering where the protected groups are pre-specified based on protected attributes [1, 2, 10, 9, 31, 15, 16, 23]. In contrast, we are seeking fairness properties that apply to arbitrarily defined groups of all possible sizes. In our proposed axiom, these groups are endogenously determined from data points; in some real-world settings, this is a legal requirement since it can be forbidden by law to have decision-making processes use certain attributes as inputs. Clustering can also be viewed as a facility location problem [26], where multiple facilities (centers) are to be located in a metric space; for a recent overview, see the survey by Chan et al. [13].

2.2 Comparison with multi-winner voting

The clustering problem especially has natural connections with multi-winner voting in which agents and centers in the clustering context correspond to agents and candidates in multi-winner voting.

The literature on proportionally representative multi-winner voting includes axioms such as JR, PJR, EJR, and FJR for dichotomous preferences [6, 32], axioms such as PSC for rankings [20], and axioms such as Generalised PSC [6, 7] or Rank-PJR+[11] for weak orders. Naively applying these axioms to our spatial problem results in a reduction from our unconstrained problem to a multiwinner voting problem with an infinite number of candidates. Conversely, our fairness (or proportional representation) axioms take into account the additional distance-based cardinal information available when the candidates belong to a general metric space. Hence, our axioms formalize a notion of proportional representation for mm-dimensional Euclidean space for m>2m>2. In contrast to ordinal algorithms for multi-winner voting (such as the Expanding Approvals Rule (EAR) [7] or Single Transferable Vote), the algorithms in this paper carefully account for additional spatial information to make synchronized decisions.

Kalayci et al. [29] consider a similar setting to ours but explore to what extent algorithms that are agnostic to cardinal spatial information and only use ordinal ranking of pairwise distances can achieve approximate fairness. They define (α,γ)(\alpha,\gamma)-proportional representation, a powerful strengthening of approximate core fairness [33]. Their notion captures (and strengthens) the ideal of core stability in this setting: no coalition of agents SS large enough to “deserve” tt centroids should be able to propose an alternative set of tt centroids which they all prefer to their individually best size-tt subset of the selected centroids. They then parameterize this idea by α\alpha, the required size of a coalition which deserves tt centroids, and γ\gamma, the degree to which the alternative centroid selection better represents the coalition. Similar to ours, their property effectively addresses the shortfalls of existing concepts with respect to proportional representation (see Definition 2). In contrast to theirs, the properties we study do not require parameterization and are guaranteed to exist. Kalayci et al. [29] show that the (ordinal multi-winner) Expanding Approvals Rule (EAR) of Aziz and Lee [7] is (α,γ)(\alpha,\gamma)-representative with γ1+6.71α/(α1)\gamma\leq 1+6.71\alpha/(\alpha-1) and also satisfies 5.715.71-approximation to proportional fairness. They are able to improve on these bounds with a modified version of Greedy Capture when cardinal information is known.

Proportionally representation in multi-winner voting (and more generally participatory budgeting) has also been studied when agents have cardinal utilities. Based on prior work for approvals [6], Peters et al. [38] study desirable axioms such as FJR and EJR for positive, additive utilities, and rules such as the Method of Equal Shares (MES). We make no assumption of additive utilities, nor do we ever consider total or average distances. Under additive cardinal utilities, a group of agents of proportion large enough to “deserve” multiple candidates can (in some instances) be satisfied with respect to EJR by a single candidate which gives some agent very high utility. Because our model and axioms take an agnostic approach to cardinal utility values, we instead place requirements on the selection of a proportional number of candidates. One possible approach to our problem is to take distances as proxy for utilities and to apply a rule like MES. However, translating to cardinal utilities necessitates the loss of important spatial information as different cardinal utilities consistent with the distances give different MES outcomes. One possibility is to set utility to be negation of the distance. But the setting and axioms of Peters et al. assume positive, additive utilities.

3 Towards an Axiom Capturing Proportional Representation

Proportional fairness for clustering has been considered in a series of recent papers. Chen et al. [14] first proposed a proportional fairness axiom that requires that there is no set of agents of size at least n/kn/k that can find an unselected center that is “better” (i.e., located closer than the closest selected center) for each of them.

Definition 1 (Proportional Fairness [14]).

XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies proportional fairness if S𝒩\forall S\subseteq{\mathscr{N}} with |S|nk|S|\geq\lceil\frac{n}{k}\rceil and for all cc\in{\mathscr{M}}, there exists iSi\in S with d(i,c)d(i,X)d(i,c)\geq d(i,X).

The idea of proportional fairness was further strengthened to core fairness that requires that there is no set of agents of size at least n/kn/k that can demand a more preferred location for a center [33]. One rationale for proportional fairness is explained in the following example, which was first presented by Micha and Shah [36] and then reused by Li et al. [33]. We reuse the same example. Consider a set of data points/agents. The agents are divided into 11 subsets of clusters each of which is densely clustered. One cluster of agents has size 10,000. The other 10 clusters have sizes 100 each. The big cluster is very far from the smaller clusters. The small clusters are relatively near to each other. Micha and Shah [36] and Li et al. [33] are of the view that any set of centers that satisfies a reasonable notion of proportional fairness would place 10 centers for the big cluster and 1 center serving the small clusters. Next, we point out that proportional fairness and core fairness do not necessarily satisfy this minimal requirement, which we formalize via the unanimous proportionality axiom below.

Definition 2 (Unanimous Proportionality (UP)).

XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies unanimous proportionality if S𝒩\forall S\subseteq{\mathscr{N}} with |S|nk|S|\geq\ell\lceil\frac{n}{k}\rceil and each agent in SS has the same location x𝒳x\in\mathscr{X}, then \ell possible locations closest to xx are selected.

UP captures a principle which Chen et al. [14] highlighted as desirable — in their words, one which requires that “any subset of at least rn/kr\lceil n/k\rceil individuals is entitled to choose rr centers.”333In the term unanimous proportionality, unanimous refers to the condition where a set of agents have the same location, and proportionality refers to the requirements that such agents get an appropriate number of centers in proportion to the number of agents. The following example shows that proportional fairness [14] does not imply Unanimous Proportionality. Similarly, it can be shown that the fairness concept proposed by Jung et al. [28] does not imply unanimous fairness.

Example 1 (Proportional fairness and core fairness).

Suppose agents are located in the [0,1][0,1] interval and k=11k=11. 10,000 agents are at point 0 and 1000 agents are at point 1. UP requires that 10 centers are at or just around point 0 and 1 center is at point 1. However, placing 1 center at point 0 and 10 centers at point 1 satisfies the proportional fairness concept of Chen et al. [14] and also the core fairness concept of Li et al. [33].

The example above shows that there is a need to formalize new concepts in order to capture proportional representative fairness. The property should imply unanimous proportionality but should also be meaningful if no two agents/data points completely coincide in their location. One reason why the existing fairness concepts do not capture UP is that they make an implicit assumption that an agent only cares about the distance from the nearest center and not on how many centers are nearby. As discussed in the introduction, this assumption is not necessarily accurate in contexts such as data abstraction, sortition, and political representation.

In addition to capturing proportional representation in a robust sense, we would also like our axiom (and algorithms) to guarantee the existing proportional fairness concepts, at least in an approximate fashion. Since a PF outcome may not exist, Chen et al. [14] and Micha and Shah [36] studied an approximate notion of proportional fairness in the discrete and unconstrained settings, respectively.

Definition 3 (ρ\rho-approximate Proportional Fairness).

XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies ρ\rho-approximate PF (or ρ\rho-PF) if SN\forall S\subseteq N with |S|nk|S|\geq\lceil\frac{n}{k}\rceil and for all cc\in{\mathscr{M}}, there exists iSi\in S with ρd(i,c)d(i,X)\rho\cdot d(i,c)\geq d(i,X).

We seek to devise an axiom which is compatible with approximate PF, and subsequently to design algorithms which satisfy our axiom while also providing good approximate guarantees with respect to PF. We close the section with an example presented by Micha and Shah [36] that shows that a PF outcome may not exist. We will use the same example to illustrate our new fairness concept in the next section.

Example 2.

Consider Figure 1. For the 6 agents and k=3k=3, it has been shown by Micha and Shah [36] that a PF outcome may not exist.

123546
Figure 1: An example instance with 6 agents and k=3k=3 for which a PF outcome does not exist.

4 Fairness for the Unconstrained Setting

We propose a new concept Proportionally Representative Fairness (PRF) that captures proportional representation fairness concerns. The intuition behind the concept is based on two natural principles: (1) a set of agents that is large enough deserves to have a proportional number of centers ‘near’ it, and (2) the requirement of proximity of the nearby centers to a subset of agents should also depend on how densely together the particular subset of agents are. We use these two principles to propose a new fairness concept.

Definition 4 (Proportionally Representative Fairness (PRF) for Unconstrained Clustering).

An outcome XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies Proportionally Representative Fairness (PRF) if the following holds. For any set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k such that the maximum distance between pairs of agents in SS is yy, the following holds: |cX:iS s.t d(i,c)y||c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\ell.

PRF has the following features: it implies unanimous proportionality; it is oblivious to and does not depend on the specification of protected groups or sensitive attributes; it is robust to outliers in the data, since the fairness requirements are for groups of points that are sizable enough; and it is scale invariant to multiplication of distances (i.e., PRF outcomes are preserved if all the distances are multiplied by a constant).

Next, let us illustrate an example where proportional fairness is not achievable but one of the most natural clustering outcomes satisfies PRF.

Example 3 (Requirements of PRF).

We revisit Example 2 to illustrate the requirements that PRF imposes. For each set SS of size at least nk=2\ell\lceil\frac{n}{k}\rceil=2\ell, PRF requires \ell centers in the relevant neighborhood of agents in SS (see Figure 2). Therefore, for agent 1 and 2, PRF requires that one selected candidate center should be in one of the circles around 1 or 2. For the set of agents {1,2,3}\{1,2,3\}, PRF also requires that one candidate center should be in one of the circles around 1, 2 or 3. A natural solution is to have one center located between agents 1, 2, and 3, another center located between agents 4, 5, and 6, and the final center located somewhere not too far from all agents (e.g., between the two groups of agents). Such solutions are intuitively reasonable and fair and, furthermore, satisfy PRF; however, as mentioned in Example 2, this instance does not admit a proportionally fair solution.

123546
Figure 2: Some of the requirements of PRF for instance in Example 2.

In addition to capturing natural outcomes when PF outcomes may not exist, PRF actually guarantees a ρ\rho-PF outcome for ρ3.6\rho\approx 3.6 in the unconstrained setting on general metric spaces.

Proposition 1.

For metric spaces, if an outcome XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies PRF for Unconstrained Clustering, then XX is 3+172\frac{3+\sqrt{17}}{2}-approximate PF, and there exists an instance for which this bound is tight.

Proof.

Let XX be an outcome which satisfies PRF for Unconstrained Clustering. Fix SN,cS\subseteq N,c\in{\mathscr{M}} with |S|nk|S|\geq\lceil\frac{n}{k}\rceil.

Let yy be the maximum distance between pairs of agents in SS, i.e., y=maxi,jSd(i,j)y=\max_{i,j\in S}d(i,j), and let i1i_{1} and i2i_{2} be two agents in SS that maximize this distance. We assume that y>0y>0; otherwise, if y=0y=0, then every agent in SS is located at the same point, and there exists xXx\in X such that d(i,x)=0d(i,c)d(i,x)=0\leq d(i,c) for all iSi\in S. Denote by ii^{*} the agent in SS for which the distance to cc is maximized. We have that d(i,c)max(d(i1,c),d(i2,c))12(d(i1,c)+d(i2,c))y2d(i^{*},c)\geq\max(d(i_{1},c),d(i_{2},c))\geq\frac{1}{2}(d(i_{1},c)+d(i_{2},c))\geq\frac{y}{2}.

Next, since |S|nk|S|\geq\frac{n}{k} and every pair of agents in SS are a distance of at most yy from each other, it follows from XX satisfying Unconstrained PRF that there exist xX,iSx\in X,i\in S such that d(i,x)yd(i,x)\leq y. Using this along with the triangle inequality, we can establish the following upper bound: d(i,x)d(i,c)+d(c,x)d(i,c)+d(i,c)+yd(i^{*},x)\leq d(i^{*},c)+d(c,x)\leq d(i^{*},c)+d(i,c)+y. From these observations, the minimum PF approximation ratio obtained by ii or ii^{*} becomes

min(d(i,x)d(i,c),d(i,x)d(i,c))\displaystyle\min\left(\frac{d(i,x)}{d(i,c)},\frac{d(i^{*},x)}{d(i^{*},c)}\right)
min(yd(i,c),d(i,c)+d(i,c)+yd(i,c))\displaystyle\leq\min\left(\frac{y}{d(i,c)},\frac{d(i^{*},c)+d(i,c)+y}{d(i^{*},c)}\right)
min(yd(i,c),3+2d(i,c)y)\displaystyle\leq\min\left(\frac{y}{d(i,c)},3+\frac{2\cdot d(i,c)}{y}\right)
maxz0(min(z,3+2/z))=3+172.\displaystyle\leq\max_{z\geq 0}(\min(z,3+2/z))=\frac{3+\sqrt{17}}{2}.

Note that the final inequality above holds since replacing yd(i,c)\frac{y}{d(i,c)} by the non-negative number zz, which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when z=3+172z=\frac{3+\sqrt{17}}{2}, hence the final transition. Thus, for any cc\in{\mathscr{M}}, there exists iSi\in S such that d(i,X)3+172d(i,c)d(i,X)\leq\frac{3+\sqrt{17}}{2}d(i,c). We defer the proof that this bound is tight to B. ∎

Next, we highlight that the well-known kk-means or kk-median solutions do not necessarily satisfy PRF. Example 4 is adapted from an example by Chen et al. [14].

Example 4 (kk-means does not satisfy PRF).

Consider Figure 3 in which there are n/3n/3 agents uniformly distributed on the perimeter of each of the three circles. If k=3k=3, a kk-means or kk-median solution will place one center by the small circles and two centers in the big circle. In contrast, PRF requires that each of the circles gets its own centroid. It can also be shown that kk-center does not satisfy PRF.

n/3n/3 pointsn/3n/3 pointsn/3n/3 points
Figure 3: The k-means solution may not satisfy PRF.

On face value, it is not clear whether an outcome satisfying PRF exists as similar concepts such as PF cannot always be guaranteed [36]. Secondly, even if a PRF outcome is guaranteed to exist, the complex constraints seem computationally challenging. PRF is a property that requires conditions on an exponential number of subsets of agents. For each subset of agents, it enforces representational constraints pertaining to an infinite number of neighborhood distances. Perhaps surprisingly, we show that not only is a PRF outcome guaranteed to exist, it can be computed in polynomial time.

The algorithm intuitively works as follows. Firstly, instead of considering infinite possible centers, we restrict our attention to the nn candidates centers that coincide with each agent’s location. Each agent is given an initial weight of 1. This weight dynamically decreases over the course of the algorithm. The algorithm can be viewed as gradually expanding the neighborhoods of each of the agents.444The idea of expanding neighborhoods is used in other algorithms in the literature, such as the Greedy Capture algorithm for the unconstrained setting studied by  Micha and Shah [36]. Unlike our algorithm, it does not involve reweighting of agents. More importantly, Greedy Capture does not satisfy PRF and is not polynomial time. Our algorithm uses reweighting of agents as is done by EAR for multi-winner voting [7]. Our algorithm can be viewed as using key ideas from the EAR for committee voting while taking into account additional spatial information to make synchronized decisions. Instead of continuously expanding the neighborhood, we only consider at most n2n^{2} possible values of neighborhood radii. If there exists some location in the candidate set such that it is in the intersection of the neighborhoods of a group of agents with total weight of at least n/kn/k, we select one such candidate, and reduce the agents’ collective weight by n/kn/k.555We point out that the reweighting process can be done in any manner that results in the group’s total weight reducing by n/kn/k. Two natural possibilities would be (i) reducing weights as equally as possible, or (ii) reducing weights proportional to agent distances from the selected candidate. The process is continued until kk centers are chosen. The algorithm is formally specified as Algorithm 1. The following is the main result of the present section and summarizes the theoretical guarantees of Algorithm 1.

Theorem 1.

Algorithm 1 terminates in polynomial time O(n4k2)O(n^{4}k^{2}) and returns a set of kk centers which satisfy PRF and 3-approximate PF.

We will now formally prove two key lemmas used to prove Theorem 1.

Algorithm 1 SEAR (Spatial Expanding Approval Rule) for Unconstrained Clustering
  Input: metric space 𝒳\mathscr{X} with a distance measure d:𝒳×𝒳+{0}d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}, a finite multiset 𝒩𝒳{\mathscr{N}}\subseteq\mathscr{X} of nn data points (agents), and positive integer kk.
  Output: A multiset of kk centers
  wi1w_{i}\longleftarrow 1 for each i𝒩i\in{\mathscr{N}}
  Consider the set D={d(i,i)i,iN}D=\{d(i,i^{\prime})\mid i,i^{\prime}\in N\}. Order the entries in DD as d1d2d|D|d_{1}\leq d_{2}\leq\cdots d_{|D|}.
  𝒩{\mathscr{M}}\longleftarrow{\mathscr{N}}
  j1j\longleftarrow 1
  WW\longleftarrow\emptyset
  while |W|<k|W|<k do
     C={ci𝒩d(i,c)djwin/k}C^{*}=\{c\in{\mathscr{M}}\mid\sum_{i\in{\mathscr{N}}\mid d(i,c)\leq d_{j}}w_{i}\geq n/k\}
     if C=C^{*}=\emptyset then
        jj+1j\longleftarrow j+1
     else
         Select some candidate cc^{*} from CC^{*} such that c=argmaxcCi𝒩d(i,c)djwic^{*}=\arg\max_{c^{\prime}\in C^{*}}\sum_{i\in{\mathscr{N}}\mid d(i,c^{\prime})\leq d_{j}}w_{i}:WW{c}W\longleftarrow W\cup\{c^{*}\}; {c}{\mathscr{M}}\longleftarrow{\mathscr{M}}\setminus\{c^{*}\}
        N{i𝒩:d(i,c)dj}N^{\prime}\longleftarrow\{i\in{\mathscr{N}}\,:\,d(i,c^{*})\leq d_{j}\}
         Modify the weights of agents in NN^{\prime} so the total weight of agents in NN^{\prime}, i.e., iNwi\sum_{i\in N^{\prime}}w_{i}, decreases by exactly n/kn/k.
  Return WW
Lemma 1.

Algorithm 1 returns an outcome that satisfies PRF.

Proof.

Suppose for contradiction that the algorithm finds a set of kk centers WW that violates PRF. In that case, there exist a set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k such that the maximum distance between pairs of agents in SS is yy but the number of locations that are within yy of some agent in SS is at most 1\ell-1. In that case, consider the snapshot of the algorithm when neighborhood distance y=maxi,jSd(i,j)y=\max_{i,j\in S}d(i,j) is considered. At this point, |cX:iS s.t d(i,c)y|1|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\leq\ell-1. Since agents in 𝒩{\mathscr{N}} have only used their weight towards selecting locations within yy up to this point, it follows that iSwink(1)nk=nk\sum_{i\in S}w_{i}\geq\ell\frac{n}{k}-(\ell-1)\frac{n}{k}=\frac{n}{k}. Therefore, the agents in SS still have total weight at least nk\frac{n}{k} to select one more location that is a distance of at most yy from some agent in SS. Hence, at this stage, when the neighborhood distance is yy, the algorithm would have selected at least one more center within distance yy than in the outcome WW. ∎

Lemma 1 implies guaranteed existence of an outcome satisfying our axiom, which was one of our goals in devising PRF. In conjunction with Proposition 1, the lemma also gives us that the outcome returned by Algorithm 1 is 12(3+17)\frac{1}{2}(3+\sqrt{17})-PF. This is noteworthy as it establishes Algorithm 1 as the first known polynomial-time approximation algorithm for PF on general metric spaces in the unconstrained setting. As we will show next, Algorithm 1 actually provides the stronger guarantee of 33-PF.

Lemma 2.

Algorithm 1 finds an outcome that satisfies 33-approximate PF.

Proof.

For an instance of unconstrained clustering, let XX be the outcome returned by Algorithm 1. Fix SN,cS\subseteq N,c\in{\mathscr{M}} with |S|nk|S|\geq\lceil\frac{n}{k}\rceil. We will show that there exists an agent iSi\in S and center xXx\in X such that d(i,x)3d(i,c)d(i,x)\leq 3d(i,c). We assume that SX=S\cap X=\emptyset, since otherwise d(i,x)=0d(i,x)=0 for some agent and center, and we are done.

At the start of the algorithm, the collective weight of agents in SS is |S||S|. When the algorithm terminates, the collective weight of SS is zero. Thus, in some iteration, the collective weight of SS decreases for the first time. We denote the centroid selected in this iteration by xx^{*} and denote an arbitrary agent whose weight is decreased in this iteration by ii^{*}. Let yy be the maximum distance between ii^{*} and any other agent in SS.

We first point out that yd(i,x)y\geq d(i^{*},x^{*}). To see this, suppose for contradiction y<d(i,x)y<d(i^{*},x^{*}). Since yy and d(i,x)d(i^{*},x^{*}) are both distances between pairs of agents in NN, y=djy=d_{j} and d(i,x)=djd(i^{*},x^{*})=d_{j^{\prime}} for some jj and jj^{\prime}. Furthermore, j<jj<j^{\prime}, since DD is in increasing order and by assumption that y<d(i,x)y<d(i^{*},x^{*}). Thus, when the algorithm considers neighborhoods of dj=yd_{j}=y, the collective weight of SS has not yet decreased, and we have that

i𝒩|d(i,i)ywi|S|nknk.\sum_{i\in{\mathscr{N}}|d(i,i^{*})\leq y}w_{i}\geq|S|\geq\lceil\frac{n}{k}\rceil\geq\frac{n}{k}.

Therefore, before transitioning to the next element in DD, the algorithm will either add ii^{*} to XX or the weight of some agent in SS will decrease. The first possibility contradicts SX=S\cap X=\emptyset. For the second possibility, by assumption, it must be that xx^{*} was added to XX. But, since d(i,x)>yd(i^{*},x^{*})>y, iNi^{*}\notin N^{\prime} and wiw_{i^{*}} does not decrease, which gives us a contradiction.

We divide the remainder of the argument into cases. First, consider the case in which d(i,c)y3.d(i^{*},c)\geq\frac{y}{3}. It follows that d(i,x)y3d(i,c).d(i^{*},x^{*})\leq y\leq 3d(i^{*},c). If, instead, d(i,c)<y3d(i^{*},c)<\frac{y}{3}, then consider the agent jS{i}j\in S\setminus\{i\} such that d(i,j)=yd(i^{*},j)=y. By the triangle inequality, we have that

d(j,c)\displaystyle d(j,c) yd(i,c)2y3\displaystyle\geq y-d(i^{*},c)\geq\frac{2y}{3}
13(d(j,i)+d(i,x))13d(j,x).\displaystyle\geq\frac{1}{3}(d(j,i^{*})+d(i^{*},x^{*}))\geq\frac{1}{3}d(j,x^{*}).

5 Fairness for the Discrete Setting

We presented PRF as a desirable concept for the unconstrained setting. It is not immediate how to adapt the concept to the discrete setting. Applying the unconstrained PRF definition to the discrete setting leads to the issues of non-existence. On the other hand, some natural attempts to account for the coarseness of the candidate space, {\mathscr{M}}, can lead to a concept that is too weak. For example, one alternative is to require that an outcome XX be such that if, for any set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k such that the maximum distance between pairs of agents in SS is yy, the following holds: |cX:iS s.t d(i,c)y|min{,|jSBy(j)|}|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\min\{\ell,|\cup_{j\in S}B_{y}(j)\cap{\mathscr{M}}|\}, where Br(i)B_{r}(i) denote a ball of radius rr around agent ii. Alternatively, we could replace the right-hand side of the final inequality with min{,|iSBy(i)|}\min\{\ell,|\cap_{i\in S}B_{y}(i)\cap{\mathscr{M}}|\}. Both of these versions are too weak. To see this, suppose k=1k=1, all agents are located at 0, and =(,2,1,1,2,3,){\mathscr{M}}=(\ldots,-2,-1,1,2,3,\ldots), then neither version places any restriction on the facility location and do not even imply UP.

To resolve these issues, we need to take into account that the nearest candidate locations may be very far from a subset of agents. We adapt our PRF concept for the discrete setting via a careful adaptation of the PRF concept for the unconstrained setting.

Definition 5 (Proportionally Representative Fairness (PRF) for Discrete Clustering).

For any set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k, if there are \ell^{\prime}\leq\ell candidates from {\mathscr{M}} that are at distance at most yy from all agents in SS, the following holds: |cX:iS s.t d(i,c)y||c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\ell^{\prime}.

We begin by highlighting the fact that, in the discrete setting, PRF implies (1+2)(1+\sqrt{2})-approximate PF, the closest PF approximation known to exist in the discrete setting [14]. Our proof follows similarly to Theorem 1 from Chen et al. [14].

Proposition 2.

For metric spaces, if an outcome XX\subseteq{\mathscr{M}} with |X|=k|X|=k satisfies PRF for Discrete Clustering, then XX is (1+2)(1+\sqrt{2})-approximate PF.

For the discrete setting, we propose a new algorithm (Algorithm 2). Algorithm 2 is similar to Algorithm 1, which was designed for the unconstrained setting. In fact, Algorithm 1 can be viewed as first setting {\mathscr{M}} to the multiset of candidate locations corresponding to agents in 𝒩{\mathscr{N}} and then running Algorithm 2. Similar to Algorithm 1, Algorithm 2 terminates in polynomial time and returns kk centers and its output satisfies PRF.

Algorithm 2 SEAR (Spatial Expanding Approval Rule) for Discrete Clustering
  Input: metric space 𝒳\mathscr{X} with a distance measure d:𝒳×𝒳+{0}d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}, a finite multiset 𝒩𝒳{\mathscr{N}}\subseteq\mathscr{X} of nn data points (agents), a finite set of candidate locations {\mathscr{M}}, and positive integer kk.
  Output: A multiset of kk centers
  wi1w_{i}\longleftarrow 1 for each i𝒩i\in{\mathscr{N}}
  Consider the set D={d(i,c)i𝒩,c}D=\{d(i,c)\mid i\in{\mathscr{N}},c\in{\mathscr{M}}\}. Order the entries in DD as d1d2d|D|d_{1}\leq d_{2}\leq\cdots\leq d_{|D|}.
  j1j\longleftarrow 1
  WW\longleftarrow\emptyset
  while |W|<k|W|<k do
     C={ci𝒩d(i,c)djwin/k}C^{*}=\{c\in{\mathscr{M}}\mid\sum_{i\in{\mathscr{N}}\mid d(i,c)\leq d_{j}}w_{i}\geq n/k\}
     if C=C^{*}=\emptyset then
        jj+1j\longleftarrow j+1
     else
         Select some candidate cc^{*} from CC^{*} such that c=argmaxcCi𝒩d(i,c)djwic^{*}=\arg\max_{c^{\prime}\in C^{*}}\sum_{i\in{\mathscr{N}}\mid d(i,c^{\prime})\leq d_{j}}w_{i} :WW{c}W\longleftarrow W\cup\{c^{*}\}; {c}{\mathscr{M}}\longleftarrow{\mathscr{M}}\setminus\{c^{*}\}
        N{i𝒩:d(i,c)dj}N^{\prime}\longleftarrow\{i\in{\mathscr{N}}\,:\,d(i,c^{*})\leq d_{j}\}
         Modify the weights of agents in NN^{\prime} so the total weight of agents in NN^{\prime}, i.e., iNwi\sum_{i\in N^{\prime}}w_{i}, decreases by exactly n/kn/k.
  Return WW
Theorem 2.

Algorithm 1 terminates in polynomial time O(||2n2)O({|{\mathscr{M}}|}^{2}n^{2}) and returns a set of kk centers which satisfy PRF and (1+2)(1+\sqrt{2})-approximate PF.

Notably, by satisfying PRF, Algorithm 2 matches the best known PF approximation factor in the discrete setting. As the following remark highlights, however, the existing algorithms do not capture PRF.

Remark 1.

Chen et al. [14] present Greedy Capture. An equivalent algorithm, called ALGg, is also presented by Li et al. [33]. However, these algorithms do not satisfy PRF. In fact, for given kk, these algorithms may fail to output kk candidate locations.666To see this, let k=3k=3 and let 𝐱=(0,0,1)\boldsymbol{x}=(0,0,1). Chen et al.’s Greedy Capture and Li et al.’s ALGg will select candidate locations 0 and 1, but will not output a third location. Of course, this issue could be rectified by arbitrarily choosing a third candidate location—but then there is no guarantee that the set of locations would satisfy PRF.

In F, we experimentally compare Algorithm 2 to ALGg and a variant of k-means. We show that it performs especially well when the mean squared distance of the agents is calculated based on their distance from not only the closest center but also the next jj-closest centers (for large jj). This provides another perspective indicating that, in situations where a “fair” clustering should account for an agent’s distance from more than just her closest center, our algorithm is better suited than the existing alternatives. In C, we consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We show that the versions that we consider do not guarantee the existence of a fair outcome.

6 Discussion

We revisited a classical AI problem with a novel solution concept that builds on ideas from social choice and fair division. We proposed a natural fairness concept called PRF for clustering. PRF is guaranteed to exist and has desirable features: it implies unanimous proportionality, does not require the specification of protected groups, and is robust to outliers and multiplicative scaling. Even if our only focus is on the PF axiom, we make significant progress on efficient algorithms for approximate PF for general metric spaces. We did not focus on strategic aspects. Unfortunately, PRF and PF are incompatible in general with the incentive of an agent to report their location truthfully (see E). A natural next step is to identify stronger versions of PRF that still guarantee existence. It will be interesting to consider refinements of our algorithms which make more clever choices when selecting candidates with sufficient weighted support. Another direction for future research is to understand how much impact the constraint of PRF imposes on standard optimization objectives such as for kk-means, kk-median, or kk-center.

Acknowledgments

Haris Aziz is supported by the NSF-CSIRO project on “Fair Sequential Collective Decision-Making”. We thank John Dickerson and Bo Li for helpful comments.

References

  • Abbasi et al. [2021] Abbasi, M., Bhaskara, A., Venkatasubramanian, S.: Fair clustering via equitable group representations. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 504–514, FAccT ’21, Association for Computing Machinery, New York, NY, USA (2021)
  • Ahmadian et al. [2019] Ahmadian, S., Epasto, A., Kumar, R., Mahdian, M.: Clustering without over-representation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 267–275, ACM (2019)
  • AI-Committee [2019] AI-Committee: The national artificial intelligence research and development strategic plan: 2019 update. Tech. rep., National Science and Technology Council, Washington, D.C. (2019)
  • Aloise et al. [2009] Aloise, D., Deshpande, A., Hansen, P., Popat, P.: Np-hardness of euclidean sum-of-squares clustering. Machine Learning 75(2), 245–248 (2009)
  • Arthur and Vassilvitskii [2007] Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pp. 1027–1035, SIAM (2007)
  • Aziz et al. [2017] Aziz, H., Brill, M., Conitzer, V., Elkind, E., Freeman, R., Walsh, T.: Justified representation in approval-based committee voting. Social Choice and Welfare 48(2), 461–485 (2017)
  • Aziz and Lee [2020] Aziz, H., Lee, B.: The expanding approvals rule: Improving proportional representation and monotonicity. Social Choice and Welfare 54(1), 1–45 (2020)
  • Aziz et al. [2023] Aziz, H., Lee, B.E., Chu, S.M., Vollen, J.: Proportionally representative clustering. arXiv preprint arXiv:2304.13917 (2023)
  • Backurs et al. [2019] Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., Wagner, T.: Scalable fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 405–413, PMLR (2019)
  • Bera et al. [2019] Bera, S.K., Chakrabarty, D., Flores, N., Negahbani, M.: Fair algorithms for clustering. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019,, pp. 4955–4966 (2019)
  • Brill and Peters [2023] Brill, M., Peters, J.: Robust and verifiable proportionality axioms for multiwinner voting. In: EC ’23: The 24th ACM Conference on Economics and Computation, Boulder, CO, USA, July 11 - 15, 2022, ACM (2023)
  • Chakrabarty and Negahbani [2021] Chakrabarty, D., Negahbani, M.: Better algorithms for individually fair k-clustering. CoRR abs/2106.12150 (2021)
  • Chan et al. [2021] Chan, H., Filos-Ratsikas, A., Li, B., Li, M., Wang, C.: Mechanism design for facility location problems: A survey. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI) (2021)
  • Chen et al. [2019] Chen, X., Fain, B., Lyu, L., Munagala, K.: Proportionally fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 1032–1041, PMLR (2019)
  • Chierichetti et al. [2017] Chierichetti, F., Kumar, R., Lattanzi, S., Vassilvitskii, S.: Fair clustering through fairlets. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 5029–5037 (2017)
  • Chiplunkar et al. [2020] Chiplunkar, A., Kale, S., Ramamoorthy, S.N.: How to solve fair k-center in massive data models. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, vol. 119, pp. 1877–1886, PMLR (2020)
  • Chouldechova and Roth [2020] Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63(5), 82–89 (2020)
  • Congressional-Research-Service [2021] Congressional-Research-Service: Artificial intelligence: Background, selected issues, and policy considerations. Tech. rep., Congressional Research Service Reports (2021)
  • Dua and Graff [2017] Dua, D., Graff, C.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017), URL http://archive.ics.uci.edu/ml
  • Dummett [1984] Dummett, M.: Voting Procedures. Oxford University Press (1984)
  • Dwork et al. [2012] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, pp. 214–226, ACM (2012)
  • Ebadian and Micha [2024] Ebadian, S., Micha, E.: Boosting sortition via proportional representation. arXiv preprint arXiv:2406.00913 (2024)
  • Esmaeili et al. [2021] Esmaeili, S.A., Brubach, B., Srinivasan, A., Dickerson, J.P.: Fair clustering under a bounded cost. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
  • European Commission [2020] European Commission: White paper on artificial intelligence: a european approach to excellence and trust. White Paper COM(2020) 65 final, European Commission, Brussels (Feb 2020)
  • Everitt et al. [2009] Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley Publishing, 4th edn. (2009)
  • Farahani and Hekmatfar [2009] Farahani, R.Z., Hekmatfar, M.: Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media (2009)
  • Friedman and Nissenbaum [1996] Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst. 14(3), 330–347 (Jul 1996)
  • Jung et al. [2020] Jung, C., Kannan, S., Lutz, N.: Service in your neighborhood: Fairness in center location. In: 1st Symposium on Foundations of Responsible Computing, FORC 2020, June 1-3, 2020, Harvard University, Cambridge, MA, USA (virtual conference), LIPIcs, vol. 156, pp. 5:1–5:15 (2020)
  • Kalayci et al. [2024] Kalayci, Y.H., Kempe, D., Kher, V.: Proportional representation in metric spaces and low-distortion committee selection. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, pp. 9815–9823, AAAI Press (2024)
  • Kellerhals and Peters [2023] Kellerhals, L., Peters, J.: Proportional fairness in clustering: A social choice perspective. CoRR abs/2310.18162 (2023)
  • Kleindessner et al. [2019] Kleindessner, M., Samadi, S., Awasthi, P., Morgenstern, J.: Guarantees for spectral clustering with fairness constraints. In: Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, pp. 3458–3467, PMLR (2019)
  • Lackner and Skowron [2023] Lackner, M., Skowron, P.: Multi-Winner Voting with Approval Preferences. Springer Briefs in Intelligent Systems, Springer (2023)
  • Li et al. [2021] Li, B., Li, L., Sun, A., Wang, C., Wang, Y.: Approximate group fairness for clustering. In: Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, pp. 6381–6391, PMLR (2021)
  • Megiddo and Supowit [1984] Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM Journal on Computing 13(1), 182–196 (1984)
  • Mehrabi et al. [2021] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Computing Surveys 54(6) (2021)
  • Micha and Shah [2020] Micha, E., Shah, N.: Proportionally fair clustering revisited. In: 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, vol. 168, pp. 85:1–85:16 (2020)
  • O’Neil [2016] O’Neil, C.: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, USA (2016)
  • Peters et al. [2021] Peters, D., Pierczyński, G., Skowron, P.: Proportional participatory budgeting with additive utilities. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), pp. 12726–12737 (2021)
  • Peters et al. [1992] Peters, H., van der Stel, H., Storcken, T.: Pareto optimality, anonymity, and strategy-proofness in location problems. International Journal of Game Theory 21(3), 221–235 (1992)
  • Procaccia [2008] Procaccia, A.D.: Towards a theory of incentives in machine learning. SIGecom Exchanges 7(2) (2008)
  • Procaccia and Tennenholtz [2013] Procaccia, A.D., Tennenholtz, M.: Approximate mechanism design without money. ACM Transactions on Economics and Computation 1(4) (2013)
  • Vakilian and Yalçiner [2021] Vakilian, A., Yalçiner, M.: Improved approximation algorithms for individually fair clustering. CoRR abs/2106.14043 (2021)

Appendix A Omitted proofs

Proof of Theorem 1.

We prove by induction that if the number of candidates selected is less than kk, then there is a way to select at least one more center. If the number of candidates selected is less than kk, there is still aggregate weight of at least n/kn/k on the set of all agents. These agents can get a candidate selected from their neighborhoods if the neighborhood radius is large enough. In particular, for d|D|=maxi,jNd(i,j)d_{|D|}=\max_{i,j\in N}d(i,j), some unselected candidate from MM can be selected.

Note that for a given radius djd_{j}, at most kk candidates can be selected. If no more candidates can be selected for a given djd_{j}, the next radius dj+1d_{j+1} is considered. There are O(n2)O(n^{2}) different distances that are to be considered. Next, we bound the running time.

There are (n2)(n^{2}) different distances that are to be considered. These distances are sorted in O(n2log(n2))O(n^{2}\log(n^{2})) time. For each distance that we consider, we check for each of the knkn locations whether they get sufficient support of n/kn/k. It takes time kn2kn^{2} to check if some location has support n/kn/k for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most max(k,n2k)\max(k,n^{2}k) times. Therefore the running time is O(n2log(n2))+O((kn2)(n2k))=O(n4k2)O(n^{2}\log(n^{2}))+O((kn^{2})(n^{2}k))=O(n^{4}k^{2}).

The fact that the algorithm gives PRF and 3-PF follows immediately from Lemma 1 and Lemma 2. ∎

Proof of Proposition 2.

Let XX be an outcome which satisfies PRF for Discrete Clustering. Fix SN,cS\subseteq N,c\in{\mathscr{M}} with |S|nk|S|\geq\lceil\frac{n}{k}\rceil. Let y=maxiSd(i,c)y=\max_{i\in S}d(i,c), the maximum distance any agent in SS is from cc, and denote this agent ii^{*}. We assume that y>0y>0; otherwise, if y=0y=0, then every agent in SS is located at cc, cXc\in X (by PRF), and d(i,c)=d(i,X)d(i,c)=d(i,X) for all iSi\in S.

Since |S|nk|S|\geq\frac{n}{k} and every agent in SS is a distance of at most yy from cc, it follows from XX satisfying PRF that there exist xX,iSx\in X,i\in S such that d(i,x)yd(i,x)\leq y. By the triangle inequality, d(c,x)d(i,c)+d(i,x)d(c,x)\leq d(i,c)+d(i,x) and thus d(i,x)y+d(i,c)+d(i,x)d(i^{*},x)\leq y+d(i,c)+d(i,x). From these observations, the minimum PF approximation ratio obtained by ii or ii^{*} becomes

min(d(i,x)d(i,c),d(i,x)d(i,c))\displaystyle\min\left(\frac{d(i,x)}{d(i,c)},\frac{d(i^{*},x)}{d(i^{*},c)}\right)
min(yd(i,c),y+d(i,c)+d(i,x)y)\displaystyle\leq\min\left(\frac{y}{d(i,c)},\frac{y+d(i,c)+d(i,x)}{y}\right)
min(yd(i,c),2+d(i,c)y)\displaystyle\leq\min\left(\frac{y}{d(i,c)},2+\frac{d(i,c)}{y}\right)
maxz0(min(z,2+1/z))=1+2.\displaystyle\leq\max_{z\geq 0}(\min(z,2+1/z))=1+\sqrt{2}.

Note that the final inequality above holds since replacing yd(i,c)\frac{y}{d(i,c)} by the non-negative number zz, which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when z=1+2z=1+\sqrt{2}, hence the final transition. Thus, for any cc\in{\mathscr{M}}, there exists iSi\in S such that d(i,X)(1+2)d(i,c)d(i,X)\leq(1+\sqrt{2})d(i,c). ∎

Proof of Theorem 2.

Output size. We prove by induction that if the number of candidates selected is less than kk, then then there is a way to select at least one one more center. If the number of candidates selected is less than kk, there is still aggregate weight of at least n/kn/k on the set of all agents. These agents can get a candidate selected from their neighborhood if the neighborhood distance is large enough. In particular, for d|D|=maxiN,cd(i,c)d_{|D|}=\max_{i\in N,c\in{\mathscr{M}}}d(i,c), some unselected candidate from MM can be selected.

Running time. Note that for a given radius djd_{j}, at most kk candidates can be selected. If no more candidates can be selected for a given djd_{j}, the next distance dj+1d_{j+1} is considered. Thus,here are (n||)(n\cdot|{\mathscr{M}}|) different distances to be considered. These distances are sorted in O(n||log(n||))O(n\cdot|{\mathscr{M}}|\log(n\cdot|{\mathscr{M}}|)) time. For each distance that we consider, we check for each of the |||{\mathscr{M}}| locations whether they get sufficient support of n/kn/k. It takes time ||n|{\mathscr{M}}|\cdot n to check if some location has support n/kn/k for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most max(k,n||)\max(k,n\cdot|{\mathscr{M}}|) times. Therefore the running time is O(n||log(n||))+O((||n)(k+n||))=O(||2n)O(n\cdot|{\mathscr{M}}|\log(n\cdot|{\mathscr{M}}|))+O((|{\mathscr{M}}|\cdot n)(k+n\cdot|{\mathscr{M}}|))=O({|{\mathscr{M}}|}^{2}n).

PRF. Suppose for contradiction that the algorithm finds a set of kk centers WW that violates PRF. In that case, there exist a set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k such that there are at least \ell^{\prime}\leq\ell locations in {\mathscr{M}} that are within distance yy of some agent in SS, but the number of locations in WW that are within yy of at least some agent in SS is at most 1\ell-1. In that case, consider the snapshot of the algorithm when neighborhood distance yy is considered. At this point, the the number of locations that are within yy of at least some agent in SS is at most 1\ell-1. Since agents in 𝒩{\mathscr{N}} have only used their weight towards selecting locations within yy up till this point, it follows that iSwink(1)nk=nk\sum_{i\in S}w_{i}\geq\ell\frac{n}{k}-(\ell-1)\frac{n}{k}=\frac{n}{k}. It follows that the agents in SS still have total weight at least nk\frac{n}{k} to select one more location that is at distance at most yy from some agent in SS. Hence, at this stage, when the neighborhood distance is yy, the algorithm would have selected at least one more center within distance yy than in the outcome WW. Thus WW is not the correct output of the algorithm, a contradiction.

PF approximation. Since, as we just showed, the outcome returned by Algorithm 2 satisfies PRF, it follows from Proposition 2 that the outcome returned by Algorithm 2 satisfies (1+2)(1+\sqrt{2})-PF. ∎

Appendix B Tight Analysis of PF Approximations

The following example shows the analysis in Proposition 1 is tight. That is, there is an instance for which an outcome satisfying PRF for Unconstrained Clustering obtains no better than 3+172\frac{3+\sqrt{17}}{2}-PF.

Example 5.

Consider an instance of unconstrained clustering with k=1k=1 and 𝒩={i1,i2,i3}{\mathscr{N}}=\{i_{1},i_{2},i_{3}\}. Furthermore, let cc and xx be two candidate locations in {\mathscr{M}} such that the distance between agents and candidates is as given in Table 2.

i j l
i 0 6+2176+2\sqrt{17} 7+177+\sqrt{17}
j 6+2176+2\sqrt{17} 0 7+177+\sqrt{17}
l 7+177+\sqrt{17} 7+177+\sqrt{17} 0
c 3+173+\sqrt{17} 3+173+\sqrt{17} 44
x 13+31713+3\sqrt{17} 13+31713+3\sqrt{17} 6+2176+2\sqrt{17}
Table 2: Distances between agents {i,j,l}\{i,j,l\} and candidates {c,x}\{c,x\}.

It can be verified that all distances in Table 2 satisfy the triangle inequality. Note that, since 6+2176+2\sqrt{17} is the maximum distance between any pair of agents in 𝒩{\mathscr{N}} (call this yy), and d(l,x)=yd(l,x)=y, it holds that {x}\{x\} satisfies PRF for Unconstrained Clustering. Through straightforward arithmetic, it can then be shown that d(i,x)d(i,c)=d(j,x)d(j,c)=d(l,x)d(l,c)=3+172\frac{d(i,x)}{d(i,c)}=\frac{d(j,x)}{d(j,c)}=\frac{d(l,x)}{d(l,c)}=\frac{3+\sqrt{17}}{2}.

Appendix C Strengthening of PRF

We consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We will show that the versions that we consider do not guarantee the existence of a fair outcome in all settings.

The PRF concept can be increasingly strengthened in the following ways by making the requirements on the outcome stronger.

Definition 6 (Proportionally Representative Fairness (PRF)-II for Discrete Clustering).

For any set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k, if there are \ell^{\prime}\leq\ell candidates from {\mathscr{M}} that are at distance at most yy from all agents in SS, the following holds: there exists some iSi\in S such that |cX:d(i,c)y||c\in X\mathbin{:}d(i,c)\leq y|\geq\ell^{\prime}.

Definition 7 (Proportionally Representative Fairness (PRF)-III for Discrete Clustering).

For any set of agents S𝒩S\subseteq{\mathscr{N}} of size at least |S|n/k|S|\geq\ell n/k, if there are \ell^{\prime}\leq\ell candidates from {\mathscr{M}} that are at distance at most yy from all agents in SS, the following holds: |cX:d(i,c)y for all iS||c\in X\mathbin{:}d(i,c)\leq y\text{ for all }i\in S|\geq\ell^{\prime}.

The following example shows that an outcome satisfying PRFII and hence PRFIII may not exist.

Example 6.

Take n=4,k=2n=4,k=2 with the following location profile on a unit interval [0,1][0,1]: 𝐱=(0,0,1,1).\boldsymbol{x}=(0,0,1,1). There are 4 candidate locations at =(0,0.5,0.5,1){\mathscr{M}}=(0,0.5,0.5,1). It is immediate that the 2 groups of agents at 0 and 11 must have a facility at 0 and 11. But this violates PRF-II. The set of agents NN (size 4) have 2 candidate locations with a distance of 0.50.5 of all agents; however, no agent in NN has 2 facility locations within a distance of 0.5 of them.

Appendix D kk-means++

Step 1.

The first centroid is selected randomly.

Step 2.

Calculate the Euclidean distance between the centroid and every other data point in the dataset. The point farthest away will become our next centroid.

Step 3.

Create clusters around these centroids by associating every point with its nearest centroid.

Step 4.

The point which has the farthest distance from its centroid will be our next centroid.

Step 5.

Repeat Steps 3 and 4 until kk number of centroids are located.

Appendix E Strategyproofness and Fairness

Most of the work on clustering assumes that the data points are truthfully reported. If data points correspond to individual agents who prefer to have a nearby center, then the issue of incentives also arises (see, e.g., Procaccia [40]). It is easy to see that Algorithm 1 for the unconstrained setting coincides with the mid-point mechanism if there are two agents on a line. Since the mid-point (egalitarian mechanism) is known to be manipulable (see, e.g., Procaccia and Tennenholtz [41]), Algorithm 1 is not strategyproof even for k=1k=1 (under which an agent cares about the sole facility to be as close as possible). A discretized version of the same example shows that Algorithm 2 is also not strategyproof.

Instead of understanding the strategic aspects of a particular algorithm, we next examine the tradeoff between fairness and strategyproofness. We focus on the case of k=1k=1 for which the preference relation of an agent is clear: an agent wants the location to be as close as possible.

Proposition 3.

Even for k=1k=1 and the unconstrained setting, there exists no anonymous, PF, and strategyproof algorithm.

Proof.

For k=1k=1 and Euclidean space, an outcome is Pareto optimal if and only is it is in the convex hull of the agent locations [39]. Next, we show that weak Pareto optimality is equivalent to Pareto optimality in this context. We show that if a location xx is not Pareto optimal, then it is not weakly Pareto optimal. Since xx is not Pareto optimal, it is strictly outside the convex hull HH of the agent locations. Let yy be the point in HH closest to xx. Note that yy will either be a corner of HH or the perpendicular projection of xx onto an edge of HH. Consider the line LL going through yy orthogonal to the line through xx and yy. Assume without loss of generality that LL is vertical with xx to the right of LL. Then all points of NN lie on LL or to the left of LL. Let ii be an arbitrary point in NN. If ii is the same point as yy, then d(i,y)<d(i,y)d(i,y)<d(i,y). Now suppose ii is not at the same loation as yy. Consider the triangle (i,y,x)(i,y,x). Since xx-yy is orthogonal to LL, it follows that the angle at point yy for triangle (i,y,x)(i,y,x) is at least 9090^{\circ}. The maximum angle of the other two corners of the triangle (i,y,x)(i,y,x) is strictly less than 9090^{\circ} as the sum of the angles of a triangle is 180180^{\circ}. There is a well-known triangle inequality law that “if two angles of a triangle are unequal, then the measures of the sides opposite these angles are also unequal, and the longer side is opposite the greater angle.” Hence, it follows that d(i,y)<d(i,x)d(i,y)<d(i,x). We have shown that d(i,y)<d(i,y)d(i,y)<d(i,y) for all iNi\in N. Hence xx is not weakly Pareto optimal.

Peters et al. [39] proved that there is no anonymous, strategyproof, and Pareto optimal mechanism for the Euclidean space with 3 dimensions. From our argument above, it follows that there is no anonymous, strategyproof, and weakly Pareto optimal mechanism for the Euclidean space with 3 dimensions. Since PF is equivalent to weak Pareto optimality for k=1k=1, it follows that for k=1k=1, there is no anonymous, PF, and strategyproof algorithm. ∎

Corollary 1.

Even for k=1k=1, there exists no anonymous, core fair, and strategyproof algorithm.

Proposition 4.

Even for k=2k=2 and the discrete setting, there exists no PRF strategyproof algorithm even in one dimension.

Proof.

Let 𝒙=(1,1,1.2)\boldsymbol{x}=(-1,1,1.2) and =(0,0.9,1.3){\mathscr{M}}=(0,0.9,1.3). Let k=2k=2. PRF (for discrete setting) requires that X=(0,0.9)X=(0,0.9). This follows because the group of agents S=NS=N can demand 2 centers and there exists 2 centers at most y=2.2y=2.2 for all agents locations—namely, 0 and 0.90.9. The other groups of agents do not add any further demands (i.e., no more demands that are not already satisfied by considering S=NS=N). With this outcome, agent 33 has access to 2 facilities with distances 0.30.3 and 1.21.2, respectively.

Now consider a deviation by agent 3 to x3=1.4x_{3}^{\prime}=1.4. Now the group of agents S={2,3}S=\{2,3\} can demand that the candidate center at 1.31.3 is chosen—this is because this is the only center that is within y=0.4y=0.4 of both agents 2 and 3. There are 2 possible outcomes X=(0.9,1.3)X^{\prime}=(0.9,1.3) and X′′=(0,1.3)X^{\prime\prime}=(0,1.3). In the first case, agent 33 (with true location 1.21.2) has cost 0.10.1 and 0.30.3. In the second case, agent 33 (with true location 1.21.2) has cost 0.10.1 and 1.21.2. In either case, the new outcome component-wise dominants the original outcome (under sincere voting). Hence, PRF is incompatible with SP in discrete setting. ∎

Appendix F Experimental results

MSD to 11 MSD to k/2k/2 MSD to kk
kk++ Alg 1 ALGgALG_{g} kk++ Alg 1 ALGgALG_{g} kk++ Alg 1 ALGgALG_{g}
Difference in MSD Difference in MSD Difference in MSD Difference in MSD Difference in MSD Difference in MSD
Dataset Average MSD relative to kk++ relative to kk++ Average MSD relative to kk++ relative to kk++ Average MSD relative to kk++ relative to kk++
Wholesale 34,977,392.00 664% 112% 11,668,107,198.00 -10% -12% 100,307,739,343.00 -63% -55%
HCV 811.60 519% 429% 582,360.00 -4% -6% 4,858,449.00 -70% -62%
Buddy 782.20 8% 5% 181,700.00 -8% -7% 784,360.00 -13% -9%
Seeds 0.54 23% 8% 228.68 -1% -1% 1,449.94 -4% -1%
Table 3: Performance of clustering algorithms with respect to MSD to 1, k/2, and k centroids.

We now apply Algorithm 1 to four real-world datasets. We consider the discrete domain in which =𝒩{\mathscr{M}}={\mathscr{N}} as most clustering data sets only have the data points. Algorithm 1 is guaranteed to satisfy PRF; therefore, we are interested in analyzing the performance of the algorithm with respect to other objectives. A prominent objective for clustering algorithms is to minimize the Mean Squared Distance (MSD) to the closest center: the average distance between each datapoint and its closest center. We consider this performance measure as the number of centers ranges from k=1k=1 to 100100.

In addition to MSD to the closest center, we analyze two other related measures. MSD to the closest k/2k/2 centers and MSD to the closest kk centers. Intuitively, MSD to the closest k/2k/2 centers is the average distance between each datapoint and its k/2k/2-closest centers (MSD to the closest kk centers is similar). These measures capture—to varying extents—the idea that it may be desirable to have datapoints located close to more than just one center; this idea is at the core of the proportional representation concept. To benchmark Algorithm 1’s performance, we also implement two other algorithms: the well-known kk-means++ algorithm777I.e., Lloyd’s algorithm for kk-means minimization objective with a particular initialization. Given our focus on the discrete domain, we use a kk-means++ algorithms that only chooses centers from among the data points. of Arthur and Vassilvitskii [5] and ALGg of Li et al. [33]. For all of our MSD measures, a smaller value is more desirable.

Datasets. The four datasets that we analyze are from the UCI Machine Learning Repository [19]. We summarize these below.888 The datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0); for details, please refer to https://archive-beta.ics.uci.edu/ml/donation-policy. HCV dataset (contains information about patients with Hepatitis C and blood donors.); Seeds dataset (describes measurements of geometrical properties of kernels belonging to three different varieties of wheat.); Buddy-move dataset (consists of user information from user reviews published in www.holidayiq.com about point of interests in South India); and Wholesale dataset (consists of client information for a wholesale distributor. It includes annual spending on various product categories.)

Results. Table 3 summarizes our results. We illustrate our analysis in Figures 4-9. Across all the datasets, Algorithm 1’s performance relative to kk-means improves when moving from MSD to closest center to MSD to closest k/2k/2 centers and MSD to closest kk centers. A similar pattern is observed for ALGgALG_{g}, which may be expected and is reassuring since—like Algorithm 1ALGgALG_{g} is motivated by the idea of proportional fairness. For two of the datasets (Wholesale and HCV), Algorithm 1 offers substantially better performance compared to kk-means with respect to MSD to closest kk centers. In these cases, Algorithm 1 also outperforms ALGgALG_{g}. For MSD to the closest k/2k/2 centers, ALGgALG_{g} and Algorithm 1 either outperform or perform similar to kk-means. In contrast, and as expected, the kk-means outperforms both algorithms with respect to the MSD to closest center. With this measure, ALGgALG_{g} typically outperforms Algorithm 1. The experiments illustrate that other well-known algorithms produce different outcomes compared to our Algorithm 1, and this difference can be quantified via an intuitive PRF-related metric, i.e., minimizing the MSD to multiple centers.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Buddy dataset: MSD to closest 1, k/2, and k centroids.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Seeds dataset: MSD to closest 1, k/2, and k centroids.
Refer to caption
Refer to caption
Refer to caption
Figure 6: Wholesale dataset: MSD to closest 1, k/2, and k centroids.
Refer to caption
Refer to caption
Refer to caption
Figure 7: Buddy dataset: MSD to closest 1, k/2, and k centroids.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Seeds dataset: MSD to closest 1, k/2, and k centroids.
Refer to caption
Refer to caption
Refer to caption
Figure 9: Wholesale dataset: MSD to closest 1, k/2, and k centroids.