Proportionally Representative Clustering

Haris Aziz [email protected] UNSW Sydney
Sydney, Australia Barton E. Lee [email protected] ETH Zürich, Switzerland Sean Morota Chu [email protected] UNSW Sydney
Sydney, Australia Jeremy Vollen [email protected] UNSW Sydney
Sydney, Australia

Abstract

In recent years, there has been a surge in effort to formalize notions of fairness in machine learning. We focus on centroid clustering—one of the fundamental tasks in unsupervised machine learning. We propose a new axiom “proportionally representative fairness” (PRF) that is designed for clustering problems where the selection of centroids reflects the distribution of data points and how tightly they are clustered together. Our fairness concept is not satisfied by existing fair clustering algorithms. We design efficient algorithms to achieve PRF both for unconstrained and discrete clustering problems. Our algorithm for the unconstrained setting is also the first known polynomial-time approximation algorithm for the well-studied Proportional Fairness (PF) axiom [14]. Our algorithm for the discrete setting also matches the best known approximation factor for PF.

1 Introduction

As artificial intelligence and machine learning serve as the ‘new electricity’ of systems, it is becoming increasingly important to ensure that AI systems embody societal values such as privacy, transparency, and fairness. Fairness is a growing concern, as AI systems are used to make decisions that can critically affect our daily lives, finances, and careers [17]. The impulse to tackle the issue of fairness in AI is prominently reflected in several governmental policy documents [18, 3, 24], the emergence of dedicated conferences on Ethical AI, and the formation of AI ethics boards by various big tech companies. Over the past years, several developments have taken place in the theory and application of fairness in supervised learning. One prominent approach is to use information derived from labelled data in supervised learning to formalize some notions of equity within and across (pre-specified and) protected attributes [21]. In contrast, in several models of unsupervised learning, protected attributes or their labelling are unknown. Fairness in unsupervised machine learning is becoming important with the growth of data collection both in public (e.g., via IoT devices) and private sectors (e.g., map services and social media). Unfettered and irresponsible use of such data has the danger of increasing inequity [37].

We focus on fairness in clustering, which is one of the most widely applied tasks in unsupervised machine learning. A (centroid selection) clustering problem has a metric space $\mathscr{X}$ with a distance measure $d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}$ , a multiset ${\mathscr{N}}\subseteq\mathscr{X}$ of $n$ data points (agents), a set ${\mathscr{M}}$ of candidate centers, and positive integer $k\leq n$ . In many clustering problems, each data point corresponds to an individual’s attribute profile; thus, we refer to the data points as agents. The goal is to find a size- $k$ subset of centers, $Y\subseteq{\mathscr{M}}\ :\ |Y|=k$ . Such centroid selection problems can be widely applied to various applications including recommender systems and data-compression. The problem also captures social planning scenarios where $k$ facility locations need to be chosen to serve a set of agents.

In such a (centroid selection) clustering problem, there are two versions of the problem. In the discrete version, the set of candidate centers ${\mathscr{M}}$ is a finite subset of $\mathscr{X}$ . In the unconstrained or continuous setting, ${\mathscr{M}}=\mathscr{X}$ . We first focus on the unconstrained setting as done by Micha and Shah [36]. Later we discuss how our concepts and ideas extend to the discrete setting. For a location $i$ and set of locations $S{\subseteq{\mathscr{M}}}$ , we will denote $\min_{c\in S}d(i,c)$ as $d(i,S)$ . Standard clustering solutions include $k$ -center, $k$ -medians, and $k$ -means.¹¹1 The $k$ -center solution outputs a subset $Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\max_{i\in{\mathscr{N}}}d(i,W)$ . The $k$ -medians solution outputs a subset $Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\sum_{i\in{\mathscr{N}}}d(i,W)$ . The $k$ -means solution outputs a subset $Y\in\arg\min_{W\subseteq{\mathscr{M}},|W|=k}\sum_{i\in{\mathscr{N}}}d(i,W)^{2}$ . These standard objectives can be viewed as achieving some form of global welfare but are not well-suited for proportional representational and fairness concerns. The problems of computing outcomes optimising these objectives are also NP-hard [34, 4].

We seek a suitable concept for clustering that captures general fairness principles such as non-discrimination, equality of opportunity, and equality of outcome. In many applications, the data points correspond to real individuals or their positions or opinions. Such individuals may expect the clustering outcome to capture natural fairness requirements. Previous axiomatic studies of clustering have measured fair representation by the distance from each data point to the nearest centroid. We initiate a proportional representation perspective on fairness in clustering. In capturing this desiderata axiomatically, we require a property in which representation guarantees depend on the number (or proportion) of data points and how “tightly” they are clustered together.

Our perspective is motivated by applications in which groups of individuals (to which the data points correspond) may expect to receive centroid representation proportional to their size. For instance, the centroids may be citizens selected to make representative decisions on behalf of the public, as is done in a jury or a sortition panel. Similarly, computing a proportionally fair clustering is also meaningful in facility location, where the location of the facility should depend on the density and the number of people it is serving. More generally, our axioms and algorithms are well-motivated in any setting in which one may want to take a representative sample from datapoints between which a distance metric is well-defined. In this paper, we consider the following fundamental research question. What is a meaningful concept of proportional fairness in centroid clustering? Is such a concept guaranteed to be achievable?

Contributions.

In contrast to traditional optimization objectives in clustering and centroid selection, we take an axiomatic fairness perspective. We first identify a potential shortcoming of some recently introduced concepts that aim to capture proportional fairness for clustering; in particular, we show how the concepts do not capture an intuitive and minimal requirement of proportional representation called unanimous proportionality. We then propose a new axiom called proportional representative fairness (PRF), which is our central conceptual contribution.

We propose PRF for both the unconstrained and discrete clustering setting. We discuss how PRF overcomes some of the drawbacks of the previously introduced concepts. In particular, it implies the minimal requirement of unanimous proportionality. PRF has several other desirable features: a PRF outcome is guaranteed to exist, the axiom does not require a preset specification of protected groups and it is robust to outliers and multiplicative scaling. We show that none of the traditional clustering algorithms or previously introduced algorithms for fair clustering satisfy PRF.

We design polynomial-time algorithms to achieve PRF both in the unconstrained and discrete settings.²²2The results for unconstrained and discrete settings are incomparable in the following sense. The guarantee of existence of a fair outcome in the discrete setting guarantees the existence of a fair outcome in the unconstrained setting; however, a polynomial-time algorithm for computing a fair outcome in the discrete setting does not imply the same for the unconstrained setting. We call our algorithms SEAR (Spatial Expanding Approval Rule) as they are inspired by using core ideas from the ordinal multi-winner voting Expanding Approvals Rule (EAR) of Aziz and Lee [7] but carefully use spatial information to make synchronized decisions. We prove that our algorithm for the unconstrained setting also gives $3$ -approximate Proportional Fairness (PF) [14], and hence is the first known polynomial-time approximation algorithm for PF on general metric spaces. We also prove that our algorithm for the discrete setting matches the best known approximation of Proportional Fairness (PF), thereby providing no compromise on PF approximation but additionally satisfying PRF. We summarize our theoretical contributions in Table 1 via a comparison of our algorithms with the best guarantees previously known. All missing proofs can be found in A.

	Unconstrained			Discrete
	$\rho$ -PF	PRF	Complex.	$\rho$ -PF	PRF	Complex.
Our Algorithms (SEAR)	$\mathbf{3}$	✓	in P	$\mathbf{\approx 2.4}$	✓	in P
Greedy Capture	$\approx 2.4$	-	NP-hard	$\approx 2.4$	-	in P

Table 1: Theoretical comparison of our algorithms (Algorithm 1 for the unconstrained setting and Algorithm 2 for the discrete setting) with Greedy Capture (GC), which gives the best known approximate PF guarantees. Unconstrained and discrete GC results come from Micha and Shah [36] and Chen et al. [14], respectively.

2 Related Work

2.1 Fair clustering

Fairness in machine learning and computer systems is a vast topic [27]—for an overview, see Chouldechova and Roth [17] or the survey by Mehrabi et al. [35]. We focus on the fundamental learning problem of clustering, which is a well-studied problem in computer science, statistics, and operations research [25]. Our approach is axiomatic, focused on the concept of proportional fairness (or representation), and does not rely on the use of a pre-specified or protected set of attributes. Thus, the most closely related papers are those of Chen et al. [14], Micha and Shah [36], Li et al. [33], and Jung et al. [28], which we review below.

Chen et al. [14] proposed the concept of proportional fairness (PF) for the discrete setting, which is based on the entitlement of groups of agents of large enough size. They reasoned that PF is a desirable fairness concept but it is not guaranteed to be achievable for every instance. On the other hand, they showed that outcomes satisfying reasonable approximations of PF are guaranteed to exist and can be computed in polynomial time via the ‘Greedy Capture’ algorithm.

Micha and Shah [36] analyzed the proportional fairness axiom in the unconstrained setting and presented similar results to those obtained by Chen et al. [14]. They showed that the natural adaptation of the Greedy Capture algorithm finds an approximately PF outcome in Euclidean space. On the other hand, they showed that checking whether a PF outcome exists is NP-hard and that the clustering returned by the ‘greedy capture’ algorithm cannot be computed in polynomial time unless P = NP. Li et al. [33] consider the same discrete clustering setting as Chen et al. [14] and propose a stronger version of proportional fairness called core fairness. They present results on lower and upper approximation bounds for various restrictions of the discrete setting.

Jung et al. [28] consider the discrete clustering problem in which the set of agents and candidate centers coincide. For this setting, they propose a fairness concept that has a similar goal as that of Chen et al. [14]. In subsequent work, it has been referred to as individual fairness [42, 12]. Individual fairness requires that, for each agent $i$ , there is a selected center that is at most $r(i)$ from $i$ , where $r(i)$ is the smallest radius around $i$ that includes $n/k$ agents.

In a paper inspired by our work, Kellerhals and Peters [30] analyze the key algorithms and axioms in our paper in further detail and show that the axioms that we propose also imply approximations of various fairness concepts introduced in the literature. Drawing an analogy to multiwinner voting axioms, they refer to our proposed axiom PRF for the discrete setting as mPJR and also define a weaker axiom which they call mJR. They take a deep dive into the logical relations between the axioms and approximate notions of proportional fairness, individual fairness, core fairness, and another core-inspired generalization of proportional fairness [22]. They prove that any approximation to PF is also an approximation to individual fairness and vice versa. In addition to the algorithms we formalize in this paper, they also consider Greedy Capture [14] and a randomized version of this algorithm known as Fair Greedy Capture [22]. A key takeaway from our paper is reaffirmed by their work, namely that our algorithms are the best-known algorithms for satisfying proportional representation axioms while also giving non-trivial approximation guarantees with respect to other fairness notions.

There are also several concepts and algorithms in clustering where the protected groups are pre-specified based on protected attributes [1, 2, 10, 9, 31, 15, 16, 23]. In contrast, we are seeking fairness properties that apply to arbitrarily defined groups of all possible sizes. In our proposed axiom, these groups are endogenously determined from data points; in some real-world settings, this is a legal requirement since it can be forbidden by law to have decision-making processes use certain attributes as inputs. Clustering can also be viewed as a facility location problem [26], where multiple facilities (centers) are to be located in a metric space; for a recent overview, see the survey by Chan et al. [13].

2.2 Comparison with multi-winner voting

The clustering problem especially has natural connections with multi-winner voting in which agents and centers in the clustering context correspond to agents and candidates in multi-winner voting.

The literature on proportionally representative multi-winner voting includes axioms such as JR, PJR, EJR, and FJR for dichotomous preferences [6, 32], axioms such as PSC for rankings [20], and axioms such as Generalised PSC [6, 7] or Rank-PJR+[11] for weak orders. Naively applying these axioms to our spatial problem results in a reduction from our unconstrained problem to a multiwinner voting problem with an infinite number of candidates. Conversely, our fairness (or proportional representation) axioms take into account the additional distance-based cardinal information available when the candidates belong to a general metric space. Hence, our axioms formalize a notion of proportional representation for $m$ -dimensional Euclidean space for $m>2$ . In contrast to ordinal algorithms for multi-winner voting (such as the Expanding Approvals Rule (EAR) [7] or Single Transferable Vote), the algorithms in this paper carefully account for additional spatial information to make synchronized decisions.

Kalayci et al. [29] consider a similar setting to ours but explore to what extent algorithms that are agnostic to cardinal spatial information and only use ordinal ranking of pairwise distances can achieve approximate fairness. They define $(\alpha,\gamma)$ -proportional representation, a powerful strengthening of approximate core fairness [33]. Their notion captures (and strengthens) the ideal of core stability in this setting: no coalition of agents $S$ large enough to “deserve” $t$ centroids should be able to propose an alternative set of $t$ centroids which they all prefer to their individually best size- $t$ subset of the selected centroids. They then parameterize this idea by $\alpha$ , the required size of a coalition which deserves $t$ centroids, and $\gamma$ , the degree to which the alternative centroid selection better represents the coalition. Similar to ours, their property effectively addresses the shortfalls of existing concepts with respect to proportional representation (see Definition 2). In contrast to theirs, the properties we study do not require parameterization and are guaranteed to exist. Kalayci et al. [29] show that the (ordinal multi-winner) Expanding Approvals Rule (EAR) of Aziz and Lee [7] is $(\alpha,\gamma)$ -representative with $\gamma\leq 1+6.71\alpha/(\alpha-1)$ and also satisfies $5.71$ -approximation to proportional fairness. They are able to improve on these bounds with a modified version of Greedy Capture when cardinal information is known.

Proportionally representation in multi-winner voting (and more generally participatory budgeting) has also been studied when agents have cardinal utilities. Based on prior work for approvals [6], Peters et al. [38] study desirable axioms such as FJR and EJR for positive, additive utilities, and rules such as the Method of Equal Shares (MES). We make no assumption of additive utilities, nor do we ever consider total or average distances. Under additive cardinal utilities, a group of agents of proportion large enough to “deserve” multiple candidates can (in some instances) be satisfied with respect to EJR by a single candidate which gives some agent very high utility. Because our model and axioms take an agnostic approach to cardinal utility values, we instead place requirements on the selection of a proportional number of candidates. One possible approach to our problem is to take distances as proxy for utilities and to apply a rule like MES. However, translating to cardinal utilities necessitates the loss of important spatial information as different cardinal utilities consistent with the distances give different MES outcomes. One possibility is to set utility to be negation of the distance. But the setting and axioms of Peters et al. assume positive, additive utilities.

3 Towards an Axiom Capturing Proportional Representation

Proportional fairness for clustering has been considered in a series of recent papers. Chen et al. [14] first proposed a proportional fairness axiom that requires that there is no set of agents of size at least $n/k$ that can find an unselected center that is “better” (i.e., located closer than the closest selected center) for each of them.

Definition 1 (Proportional Fairness [14]).

$X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies proportional fairness if $\forall S\subseteq{\mathscr{N}}$ with $|S|\geq\lceil\frac{n}{k}\rceil$ and for all $c\in{\mathscr{M}}$ , there exists $i\in S$ with $d(i,c)\geq d(i,X)$ .

The idea of proportional fairness was further strengthened to core fairness that requires that there is no set of agents of size at least $n/k$ that can demand a more preferred location for a center [33]. One rationale for proportional fairness is explained in the following example, which was first presented by Micha and Shah [36] and then reused by Li et al. [33]. We reuse the same example. Consider a set of data points/agents. The agents are divided into 11 subsets of clusters each of which is densely clustered. One cluster of agents has size 10,000. The other 10 clusters have sizes 100 each. The big cluster is very far from the smaller clusters. The small clusters are relatively near to each other. Micha and Shah [36] and Li et al. [33] are of the view that any set of centers that satisfies a reasonable notion of proportional fairness would place 10 centers for the big cluster and 1 center serving the small clusters. Next, we point out that proportional fairness and core fairness do not necessarily satisfy this minimal requirement, which we formalize via the unanimous proportionality axiom below.

Definition 2 (Unanimous Proportionality (UP)).

$X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies unanimous proportionality if $\forall S\subseteq{\mathscr{N}}$ with $|S|\geq\ell\lceil\frac{n}{k}\rceil$ and each agent in $S$ has the same location $x\in\mathscr{X}$ , then $\ell$ possible locations closest to $x$ are selected.

UP captures a principle which Chen et al. [14] highlighted as desirable — in their words, one which requires that “any subset of at least $r\lceil n/k\rceil$ individuals is entitled to choose $r$ centers.”³³3In the term unanimous proportionality, unanimous refers to the condition where a set of agents have the same location, and proportionality refers to the requirements that such agents get an appropriate number of centers in proportion to the number of agents. The following example shows that proportional fairness [14] does not imply Unanimous Proportionality. Similarly, it can be shown that the fairness concept proposed by Jung et al. [28] does not imply unanimous fairness.

Example 1 (Proportional fairness and core fairness).

Suppose agents are located in the $[0,1]$ interval and $k=11$ . 10,000 agents are at point 0 and 1000 agents are at point 1. UP requires that 10 centers are at or just around point 0 and 1 center is at point 1. However, placing 1 center at point 0 and 10 centers at point 1 satisfies the proportional fairness concept of Chen et al. [14] and also the core fairness concept of Li et al. [33].

The example above shows that there is a need to formalize new concepts in order to capture proportional representative fairness. The property should imply unanimous proportionality but should also be meaningful if no two agents/data points completely coincide in their location. One reason why the existing fairness concepts do not capture UP is that they make an implicit assumption that an agent only cares about the distance from the nearest center and not on how many centers are nearby. As discussed in the introduction, this assumption is not necessarily accurate in contexts such as data abstraction, sortition, and political representation.

In addition to capturing proportional representation in a robust sense, we would also like our axiom (and algorithms) to guarantee the existing proportional fairness concepts, at least in an approximate fashion. Since a PF outcome may not exist, Chen et al. [14] and Micha and Shah [36] studied an approximate notion of proportional fairness in the discrete and unconstrained settings, respectively.

Definition 3 ( $\rho$ -approximate Proportional Fairness).

$X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies $\rho$ -approximate PF (or $\rho$ -PF) if $\forall S\subseteq N$ with $|S|\geq\lceil\frac{n}{k}\rceil$ and for all $c\in{\mathscr{M}}$ , there exists $i\in S$ with $\rho\cdot d(i,c)\geq d(i,X)$ .

We seek to devise an axiom which is compatible with approximate PF, and subsequently to design algorithms which satisfy our axiom while also providing good approximate guarantees with respect to PF. We close the section with an example presented by Micha and Shah [36] that shows that a PF outcome may not exist. We will use the same example to illustrate our new fairness concept in the next section.

Example 2.

Consider Figure 1. For the 6 agents and $k=3$ , it has been shown by Micha and Shah [36] that a PF outcome may not exist.

Figure 1: An example instance with 6 agents and

k=3

for which a PF outcome does not exist.

4 Fairness for the Unconstrained Setting

We propose a new concept Proportionally Representative Fairness (PRF) that captures proportional representation fairness concerns. The intuition behind the concept is based on two natural principles: (1) a set of agents that is large enough deserves to have a proportional number of centers ‘near’ it, and (2) the requirement of proximity of the nearby centers to a subset of agents should also depend on how densely together the particular subset of agents are. We use these two principles to propose a new fairness concept.

Definition 4 (Proportionally Representative Fairness (PRF) for Unconstrained Clustering).

An outcome $X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies Proportionally Representative Fairness (PRF) if the following holds. For any set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ such that the maximum distance between pairs of agents in $S$ is $y$ , the following holds: $|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\ell$ .

PRF has the following features: it implies unanimous proportionality; it is oblivious to and does not depend on the specification of protected groups or sensitive attributes; it is robust to outliers in the data, since the fairness requirements are for groups of points that are sizable enough; and it is scale invariant to multiplication of distances (i.e., PRF outcomes are preserved if all the distances are multiplied by a constant).

Next, let us illustrate an example where proportional fairness is not achievable but one of the most natural clustering outcomes satisfies PRF.

Example 3 (Requirements of PRF).

We revisit Example 2 to illustrate the requirements that PRF imposes. For each set $S$ of size at least $\ell\lceil\frac{n}{k}\rceil=2\ell$ , PRF requires $\ell$ centers in the relevant neighborhood of agents in $S$ (see Figure 2). Therefore, for agent 1 and 2, PRF requires that one selected candidate center should be in one of the circles around 1 or 2. For the set of agents $\{1,2,3\}$ , PRF also requires that one candidate center should be in one of the circles around 1, 2 or 3. A natural solution is to have one center located between agents 1, 2, and 3, another center located between agents 4, 5, and 6, and the final center located somewhere not too far from all agents (e.g., between the two groups of agents). Such solutions are intuitively reasonable and fair and, furthermore, satisfy PRF; however, as mentioned in Example 2, this instance does not admit a proportionally fair solution.

Figure 2: Some of the requirements of PRF for instance in Example 2.

In addition to capturing natural outcomes when PF outcomes may not exist, PRF actually guarantees a $\rho$ -PF outcome for $\rho\approx 3.6$ in the unconstrained setting on general metric spaces.

Proposition 1.

For metric spaces, if an outcome $X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies PRF for Unconstrained Clustering, then $X$ is $\frac{3+\sqrt{17}}{2}$ -approximate PF, and there exists an instance for which this bound is tight.

Proof.

Let $X$ be an outcome which satisfies PRF for Unconstrained Clustering. Fix $S\subseteq N,c\in{\mathscr{M}}$ with $|S|\geq\lceil\frac{n}{k}\rceil$ .

Let $y$ be the maximum distance between pairs of agents in $S$ , i.e., $y=\max_{i,j\in S}d(i,j)$ , and let $i_{1}$ and $i_{2}$ be two agents in $S$ that maximize this distance. We assume that $y>0$ ; otherwise, if $y=0$ , then every agent in $S$ is located at the same point, and there exists $x\in X$ such that $d(i,x)=0\leq d(i,c)$ for all $i\in S$ . Denote by $i^{*}$ the agent in $S$ for which the distance to $c$ is maximized. We have that $d(i^{*},c)\geq\max(d(i_{1},c),d(i_{2},c))\geq\frac{1}{2}(d(i_{1},c)+d(i_{2},c))\geq\frac{y}{2}$ .

Next, since $|S|\geq\frac{n}{k}$ and every pair of agents in $S$ are a distance of at most $y$ from each other, it follows from $X$ satisfying Unconstrained PRF that there exist $x\in X,i\in S$ such that $d(i,x)\leq y$ . Using this along with the triangle inequality, we can establish the following upper bound: $d(i^{*},x)\leq d(i^{*},c)+d(c,x)\leq d(i^{*},c)+d(i,c)+y$ . From these observations, the minimum PF approximation ratio obtained by $i$ or $i^{*}$ becomes

	$\displaystyle\min\left(\frac{d(i,x)}{d(i,c)},\frac{d(i^{},x)}{d(i^{},c)}\right)$
	$\displaystyle\leq\min\left(\frac{y}{d(i,c)},\frac{d(i^{},c)+d(i,c)+y}{d(i^{},c)}\right)$
	$\displaystyle\leq\min\left(\frac{y}{d(i,c)},3+\frac{2\cdot d(i,c)}{y}\right)$
	$\displaystyle\leq\max_{z\geq 0}(\min(z,3+2/z))=\frac{3+\sqrt{17}}{2}.$

Note that the final inequality above holds since replacing $\frac{y}{d(i,c)}$ by the non-negative number $z$ , which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when $z=\frac{3+\sqrt{17}}{2}$ , hence the final transition. Thus, for any $c\in{\mathscr{M}}$ , there exists $i\in S$ such that $d(i,X)\leq\frac{3+\sqrt{17}}{2}d(i,c)$ . We defer the proof that this bound is tight to B. ∎

Next, we highlight that the well-known $k$ -means or $k$ -median solutions do not necessarily satisfy PRF. Example 4 is adapted from an example by Chen et al. [14].

Example 4 ( $k$ -means does not satisfy PRF).

Consider Figure 3 in which there are $n/3$ agents uniformly distributed on the perimeter of each of the three circles. If $k=3$ , a $k$ -means or $k$ -median solution will place one center by the small circles and two centers in the big circle. In contrast, PRF requires that each of the circles gets its own centroid. It can also be shown that $k$ -center does not satisfy PRF.

Figure 3: The k-means solution may not satisfy PRF.

On face value, it is not clear whether an outcome satisfying PRF exists as similar concepts such as PF cannot always be guaranteed [36]. Secondly, even if a PRF outcome is guaranteed to exist, the complex constraints seem computationally challenging. PRF is a property that requires conditions on an exponential number of subsets of agents. For each subset of agents, it enforces representational constraints pertaining to an infinite number of neighborhood distances. Perhaps surprisingly, we show that not only is a PRF outcome guaranteed to exist, it can be computed in polynomial time.

The algorithm intuitively works as follows. Firstly, instead of considering infinite possible centers, we restrict our attention to the $n$ candidates centers that coincide with each agent’s location. Each agent is given an initial weight of 1. This weight dynamically decreases over the course of the algorithm. The algorithm can be viewed as gradually expanding the neighborhoods of each of the agents.⁴⁴4The idea of expanding neighborhoods is used in other algorithms in the literature, such as the Greedy Capture algorithm for the unconstrained setting studied by Micha and Shah [36]. Unlike our algorithm, it does not involve reweighting of agents. More importantly, Greedy Capture does not satisfy PRF and is not polynomial time. Our algorithm uses reweighting of agents as is done by EAR for multi-winner voting [7]. Our algorithm can be viewed as using key ideas from the EAR for committee voting while taking into account additional spatial information to make synchronized decisions. Instead of continuously expanding the neighborhood, we only consider at most $n^{2}$ possible values of neighborhood radii. If there exists some location in the candidate set such that it is in the intersection of the neighborhoods of a group of agents with total weight of at least $n/k$ , we select one such candidate, and reduce the agents’ collective weight by $n/k$ .⁵⁵5We point out that the reweighting process can be done in any manner that results in the group’s total weight reducing by $n/k$ . Two natural possibilities would be (i) reducing weights as equally as possible, or (ii) reducing weights proportional to agent distances from the selected candidate. The process is continued until $k$ centers are chosen. The algorithm is formally specified as Algorithm 1. The following is the main result of the present section and summarizes the theoretical guarantees of Algorithm 1.

Theorem 1.

Algorithm 1 terminates in polynomial time $O(n^{4}k^{2})$ and returns a set of $k$ centers which satisfy PRF and 3-approximate PF.

We will now formally prove two key lemmas used to prove Theorem 1.

Algorithm 1 SEAR (Spatial Expanding Approval Rule) for Unconstrained Clustering

Input: metric space

\mathscr{X}

with a distance measure

d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}

, a finite multiset

{\mathscr{N}}\subseteq\mathscr{X}

n

data points (agents), and positive integer

k

Output: A multiset of

k

centers

w_{i}\longleftarrow 1

for each

i\in{\mathscr{N}}

Consider the set

D=\{d(i,i^{\prime})\mid i,i^{\prime}\in N\}

. Order the entries in

D

d_{1}\leq d_{2}\leq\cdots d_{|D|}

{\mathscr{M}}\longleftarrow{\mathscr{N}}

j\longleftarrow 1

W\longleftarrow\emptyset

while

|W|<k

C^{*}=\{c\in{\mathscr{M}}\mid\sum_{i\in{\mathscr{N}}\mid d(i,c)\leq d_{j}}w_{i}\geq n/k\}

C^{*}=\emptyset

then

j\longleftarrow j+1

else

Select some candidate

c^{*}

from

C^{*}

such that

c^{*}=\arg\max_{c^{\prime}\in C^{*}}\sum_{i\in{\mathscr{N}}\mid d(i,c^{\prime})\leq d_{j}}w_{i}

W\longleftarrow W\cup\{c^{*}\}

;

{\mathscr{M}}\longleftarrow{\mathscr{M}}\setminus\{c^{*}\}

N^{\prime}\longleftarrow\{i\in{\mathscr{N}}\,:\,d(i,c^{*})\leq d_{j}\}

Modify the weights of agents in

N^{\prime}

so the total weight of agents in

N^{\prime}

, i.e.,

\sum_{i\in N^{\prime}}w_{i}

, decreases by exactly

n/k

Return

W

Lemma 1.

Algorithm 1 returns an outcome that satisfies PRF.

Proof.

Suppose for contradiction that the algorithm finds a set of $k$ centers $W$ that violates PRF. In that case, there exist a set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ such that the maximum distance between pairs of agents in $S$ is $y$ but the number of locations that are within $y$ of some agent in $S$ is at most $\ell-1$ . In that case, consider the snapshot of the algorithm when neighborhood distance $y=\max_{i,j\in S}d(i,j)$ is considered. At this point, $|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\leq\ell-1$ . Since agents in ${\mathscr{N}}$ have only used their weight towards selecting locations within $y$ up to this point, it follows that $\sum_{i\in S}w_{i}\geq\ell\frac{n}{k}-(\ell-1)\frac{n}{k}=\frac{n}{k}$ . Therefore, the agents in $S$ still have total weight at least $\frac{n}{k}$ to select one more location that is a distance of at most $y$ from some agent in $S$ . Hence, at this stage, when the neighborhood distance is $y$ , the algorithm would have selected at least one more center within distance $y$ than in the outcome $W$ . ∎

Lemma 1 implies guaranteed existence of an outcome satisfying our axiom, which was one of our goals in devising PRF. In conjunction with Proposition 1, the lemma also gives us that the outcome returned by Algorithm 1 is $\frac{1}{2}(3+\sqrt{17})$ -PF. This is noteworthy as it establishes Algorithm 1 as the first known polynomial-time approximation algorithm for PF on general metric spaces in the unconstrained setting. As we will show next, Algorithm 1 actually provides the stronger guarantee of $3$ -PF.

Lemma 2.

Algorithm 1 finds an outcome that satisfies $3$ -approximate PF.

Proof.

For an instance of unconstrained clustering, let $X$ be the outcome returned by Algorithm 1. Fix $S\subseteq N,c\in{\mathscr{M}}$ with $|S|\geq\lceil\frac{n}{k}\rceil$ . We will show that there exists an agent $i\in S$ and center $x\in X$ such that $d(i,x)\leq 3d(i,c)$ . We assume that $S\cap X=\emptyset$ , since otherwise $d(i,x)=0$ for some agent and center, and we are done.

At the start of the algorithm, the collective weight of agents in $S$ is $|S|$ . When the algorithm terminates, the collective weight of $S$ is zero. Thus, in some iteration, the collective weight of $S$ decreases for the first time. We denote the centroid selected in this iteration by $x^{*}$ and denote an arbitrary agent whose weight is decreased in this iteration by $i^{*}$ . Let $y$ be the maximum distance between $i^{*}$ and any other agent in $S$ .

We first point out that $y\geq d(i^{*},x^{*})$ . To see this, suppose for contradiction $y<d(i^{*},x^{*})$ . Since $y$ and $d(i^{*},x^{*})$ are both distances between pairs of agents in $N$ , $y=d_{j}$ and $d(i^{*},x^{*})=d_{j^{\prime}}$ for some $j$ and $j^{\prime}$ . Furthermore, $j<j^{\prime}$ , since $D$ is in increasing order and by assumption that $y<d(i^{*},x^{*})$ . Thus, when the algorithm considers neighborhoods of $d_{j}=y$ , the collective weight of $S$ has not yet decreased, and we have that

\sum_{i\in{\mathscr{N}}|d(i,i^{*})\leq y}w_{i}\geq|S|\geq\lceil\frac{n}{k}\rceil\geq\frac{n}{k}.

Therefore, before transitioning to the next element in $D$ , the algorithm will either add $i^{*}$ to $X$ or the weight of some agent in $S$ will decrease. The first possibility contradicts $S\cap X=\emptyset$ . For the second possibility, by assumption, it must be that $x^{*}$ was added to $X$ . But, since $d(i^{*},x^{*})>y$ , $i^{*}\notin N^{\prime}$ and $w_{i^{*}}$ does not decrease, which gives us a contradiction.

We divide the remainder of the argument into cases. First, consider the case in which $d(i^{*},c)\geq\frac{y}{3}.$ It follows that $d(i^{*},x^{*})\leq y\leq 3d(i^{*},c).$ If, instead, $d(i^{*},c)<\frac{y}{3}$ , then consider the agent $j\in S\setminus\{i\}$ such that $d(i^{*},j)=y$ . By the triangle inequality, we have that

	$\displaystyle d(j,c)$	$\displaystyle\geq y-d(i^{*},c)\geq\frac{2y}{3}$
		$\displaystyle\geq\frac{1}{3}(d(j,i^{})+d(i^{},x^{}))\geq\frac{1}{3}d(j,x^{}).$

∎

5 Fairness for the Discrete Setting

We presented PRF as a desirable concept for the unconstrained setting. It is not immediate how to adapt the concept to the discrete setting. Applying the unconstrained PRF definition to the discrete setting leads to the issues of non-existence. On the other hand, some natural attempts to account for the coarseness of the candidate space, ${\mathscr{M}}$ , can lead to a concept that is too weak. For example, one alternative is to require that an outcome $X$ be such that if, for any set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ such that the maximum distance between pairs of agents in $S$ is $y$ , the following holds: $|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\min\{\ell,|\cup_{j\in S}B_{y}(j)\cap{\mathscr{M}}|\}$ , where $B_{r}(i)$ denote a ball of radius $r$ around agent $i$ . Alternatively, we could replace the right-hand side of the final inequality with $\min\{\ell,|\cap_{i\in S}B_{y}(i)\cap{\mathscr{M}}|\}$ . Both of these versions are too weak. To see this, suppose $k=1$ , all agents are located at $0$ , and ${\mathscr{M}}=(\ldots,-2,-1,1,2,3,\ldots)$ , then neither version places any restriction on the facility location and do not even imply UP.

To resolve these issues, we need to take into account that the nearest candidate locations may be very far from a subset of agents. We adapt our PRF concept for the discrete setting via a careful adaptation of the PRF concept for the unconstrained setting.

Definition 5 (Proportionally Representative Fairness (PRF) for Discrete Clustering).

For any set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ , if there are $\ell^{\prime}\leq\ell$ candidates from ${\mathscr{M}}$ that are at distance at most $y$ from all agents in $S$ , the following holds: $|c\in X\mathbin{:}\exists i\in S\text{ s.t }d(i,c)\leq y|\geq\ell^{\prime}$ .

We begin by highlighting the fact that, in the discrete setting, PRF implies $(1+\sqrt{2})$ -approximate PF, the closest PF approximation known to exist in the discrete setting [14]. Our proof follows similarly to Theorem 1 from Chen et al. [14].

Proposition 2.

For metric spaces, if an outcome $X\subseteq{\mathscr{M}}$ with $|X|=k$ satisfies PRF for Discrete Clustering, then $X$ is $(1+\sqrt{2})$ -approximate PF.

For the discrete setting, we propose a new algorithm (Algorithm 2). Algorithm 2 is similar to Algorithm 1, which was designed for the unconstrained setting. In fact, Algorithm 1 can be viewed as first setting ${\mathscr{M}}$ to the multiset of candidate locations corresponding to agents in ${\mathscr{N}}$ and then running Algorithm 2. Similar to Algorithm 1, Algorithm 2 terminates in polynomial time and returns $k$ centers and its output satisfies PRF.

Algorithm 2 SEAR (Spatial Expanding Approval Rule) for Discrete Clustering

Input: metric space

\mathscr{X}

with a distance measure

d:\mathscr{X}\times\mathscr{X}\rightarrow\mathbb{R}^{+}\cup\{0\}

, a finite multiset

{\mathscr{N}}\subseteq\mathscr{X}

n

data points (agents), a finite set of candidate locations

{\mathscr{M}}

, and positive integer

k

Output: A multiset of

k

centers

w_{i}\longleftarrow 1

for each

i\in{\mathscr{N}}

Consider the set

D=\{d(i,c)\mid i\in{\mathscr{N}},c\in{\mathscr{M}}\}

. Order the entries in

D

d_{1}\leq d_{2}\leq\cdots\leq d_{|D|}

j\longleftarrow 1

W\longleftarrow\emptyset

while

|W|<k

C^{*}=\{c\in{\mathscr{M}}\mid\sum_{i\in{\mathscr{N}}\mid d(i,c)\leq d_{j}}w_{i}\geq n/k\}

C^{*}=\emptyset

then

j\longleftarrow j+1

else

Select some candidate

c^{*}

from

C^{*}

such that

c^{*}=\arg\max_{c^{\prime}\in C^{*}}\sum_{i\in{\mathscr{N}}\mid d(i,c^{\prime})\leq d_{j}}w_{i}

W\longleftarrow W\cup\{c^{*}\}

;

{\mathscr{M}}\longleftarrow{\mathscr{M}}\setminus\{c^{*}\}

N^{\prime}\longleftarrow\{i\in{\mathscr{N}}\,:\,d(i,c^{*})\leq d_{j}\}

Modify the weights of agents in

N^{\prime}

so the total weight of agents in

N^{\prime}

, i.e.,

\sum_{i\in N^{\prime}}w_{i}

, decreases by exactly

n/k

Return

W

Theorem 2.

Algorithm 1 terminates in polynomial time $O({|{\mathscr{M}}|}^{2}n^{2})$ and returns a set of $k$ centers which satisfy PRF and $(1+\sqrt{2})$ -approximate PF.

Notably, by satisfying PRF, Algorithm 2 matches the best known PF approximation factor in the discrete setting. As the following remark highlights, however, the existing algorithms do not capture PRF.

Remark 1.

Chen et al. [14] present Greedy Capture. An equivalent algorithm, called ALG_g, is also presented by Li et al. [33]. However, these algorithms do not satisfy PRF. In fact, for given $k$ , these algorithms may fail to output $k$ candidate locations.⁶⁶6To see this, let $k=3$ and let $\boldsymbol{x}=(0,0,1)$ . Chen et al.’s Greedy Capture and Li et al.’s ALG_g will select candidate locations 0 and 1, but will not output a third location. Of course, this issue could be rectified by arbitrarily choosing a third candidate location—but then there is no guarantee that the set of locations would satisfy PRF.

In F, we experimentally compare Algorithm 2 to ALG_g and a variant of k-means. We show that it performs especially well when the mean squared distance of the agents is calculated based on their distance from not only the closest center but also the next $j$ -closest centers (for large $j$ ). This provides another perspective indicating that, in situations where a “fair” clustering should account for an agent’s distance from more than just her closest center, our algorithm is better suited than the existing alternatives. In C, we consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We show that the versions that we consider do not guarantee the existence of a fair outcome.

6 Discussion

We revisited a classical AI problem with a novel solution concept that builds on ideas from social choice and fair division. We proposed a natural fairness concept called PRF for clustering. PRF is guaranteed to exist and has desirable features: it implies unanimous proportionality, does not require the specification of protected groups, and is robust to outliers and multiplicative scaling. Even if our only focus is on the PF axiom, we make significant progress on efficient algorithms for approximate PF for general metric spaces. We did not focus on strategic aspects. Unfortunately, PRF and PF are incompatible in general with the incentive of an agent to report their location truthfully (see E). A natural next step is to identify stronger versions of PRF that still guarantee existence. It will be interesting to consider refinements of our algorithms which make more clever choices when selecting candidates with sufficient weighted support. Another direction for future research is to understand how much impact the constraint of PRF imposes on standard optimization objectives such as for $k$ -means, $k$ -median, or $k$ -center.

Acknowledgments

Haris Aziz is supported by the NSF-CSIRO project on “Fair Sequential Collective Decision-Making”. We thank John Dickerson and Bo Li for helpful comments.

References

Abbasi et al. [2021] Abbasi, M., Bhaskara, A., Venkatasubramanian, S.: Fair clustering via equitable group representations. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 504–514, FAccT ’21, Association for Computing Machinery, New York, NY, USA (2021)
Ahmadian et al. [2019] Ahmadian, S., Epasto, A., Kumar, R., Mahdian, M.: Clustering without over-representation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 267–275, ACM (2019)
AI-Committee [2019] AI-Committee: The national artificial intelligence research and development strategic plan: 2019 update. Tech. rep., National Science and Technology Council, Washington, D.C. (2019)
Aloise et al. [2009] Aloise, D., Deshpande, A., Hansen, P., Popat, P.: Np-hardness of euclidean sum-of-squares clustering. Machine Learning 75(2), 245–248 (2009)
Arthur and Vassilvitskii [2007] Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pp. 1027–1035, SIAM (2007)
Aziz et al. [2017] Aziz, H., Brill, M., Conitzer, V., Elkind, E., Freeman, R., Walsh, T.: Justified representation in approval-based committee voting. Social Choice and Welfare 48(2), 461–485 (2017)
Aziz and Lee [2020] Aziz, H., Lee, B.: The expanding approvals rule: Improving proportional representation and monotonicity. Social Choice and Welfare 54(1), 1–45 (2020)
Aziz et al. [2023] Aziz, H., Lee, B.E., Chu, S.M., Vollen, J.: Proportionally representative clustering. arXiv preprint arXiv:2304.13917 (2023)
Backurs et al. [2019] Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., Wagner, T.: Scalable fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 405–413, PMLR (2019)
Bera et al. [2019] Bera, S.K., Chakrabarty, D., Flores, N., Negahbani, M.: Fair algorithms for clustering. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019,, pp. 4955–4966 (2019)
Brill and Peters [2023] Brill, M., Peters, J.: Robust and verifiable proportionality axioms for multiwinner voting. In: EC ’23: The 24th ACM Conference on Economics and Computation, Boulder, CO, USA, July 11 - 15, 2022, ACM (2023)
Chakrabarty and Negahbani [2021] Chakrabarty, D., Negahbani, M.: Better algorithms for individually fair k-clustering. CoRR abs/2106.12150 (2021)
Chan et al. [2021] Chan, H., Filos-Ratsikas, A., Li, B., Li, M., Wang, C.: Mechanism design for facility location problems: A survey. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI) (2021)
Chen et al. [2019] Chen, X., Fain, B., Lyu, L., Munagala, K.: Proportionally fair clustering. In: Proceedings of the 36th International Conference on Machine Learning, ICML, Proceedings of Machine Learning Research, vol. 97, pp. 1032–1041, PMLR (2019)
Chierichetti et al. [2017] Chierichetti, F., Kumar, R., Lattanzi, S., Vassilvitskii, S.: Fair clustering through fairlets. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 5029–5037 (2017)
Chiplunkar et al. [2020] Chiplunkar, A., Kale, S., Ramamoorthy, S.N.: How to solve fair k-center in massive data models. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, vol. 119, pp. 1877–1886, PMLR (2020)
Chouldechova and Roth [2020] Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63(5), 82–89 (2020)
Congressional-Research-Service [2021] Congressional-Research-Service: Artificial intelligence: Background, selected issues, and policy considerations. Tech. rep., Congressional Research Service Reports (2021)
Dua and Graff [2017] Dua, D., Graff, C.: UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences (2017), URL http://archive.ics.uci.edu/ml
Dummett [1984] Dummett, M.: Voting Procedures. Oxford University Press (1984)
Dwork et al. [2012] Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: Innovations in Theoretical Computer Science 2012, Cambridge, MA, USA, January 8-10, 2012, pp. 214–226, ACM (2012)
Ebadian and Micha [2024] Ebadian, S., Micha, E.: Boosting sortition via proportional representation. arXiv preprint arXiv:2406.00913 (2024)
Esmaeili et al. [2021] Esmaeili, S.A., Brubach, B., Srinivasan, A., Dickerson, J.P.: Fair clustering under a bounded cost. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
European Commission [2020] European Commission: White paper on artificial intelligence: a european approach to excellence and trust. White Paper COM(2020) 65 final, European Commission, Brussels (Feb 2020)
Everitt et al. [2009] Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Wiley Publishing, 4th edn. (2009)
Farahani and Hekmatfar [2009] Farahani, R.Z., Hekmatfar, M.: Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media (2009)
Friedman and Nissenbaum [1996] Friedman, B., Nissenbaum, H.: Bias in computer systems. ACM Trans. Inf. Syst. 14(3), 330–347 (Jul 1996)
Jung et al. [2020] Jung, C., Kannan, S., Lutz, N.: Service in your neighborhood: Fairness in center location. In: 1st Symposium on Foundations of Responsible Computing, FORC 2020, June 1-3, 2020, Harvard University, Cambridge, MA, USA (virtual conference), LIPIcs, vol. 156, pp. 5:1–5:15 (2020)
Kalayci et al. [2024] Kalayci, Y.H., Kempe, D., Kher, V.: Proportional representation in metric spaces and low-distortion committee selection. In: Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, pp. 9815–9823, AAAI Press (2024)
Kellerhals and Peters [2023] Kellerhals, L., Peters, J.: Proportional fairness in clustering: A social choice perspective. CoRR abs/2310.18162 (2023)
Kleindessner et al. [2019] Kleindessner, M., Samadi, S., Awasthi, P., Morgenstern, J.: Guarantees for spectral clustering with fairness constraints. In: Proceedings of the 36th International Conference on Machine Learning (ICML), vol. 97, pp. 3458–3467, PMLR (2019)
Lackner and Skowron [2023] Lackner, M., Skowron, P.: Multi-Winner Voting with Approval Preferences. Springer Briefs in Intelligent Systems, Springer (2023)
Li et al. [2021] Li, B., Li, L., Sun, A., Wang, C., Wang, Y.: Approximate group fairness for clustering. In: Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 139, pp. 6381–6391, PMLR (2021)
Megiddo and Supowit [1984] Megiddo, N., Supowit, K.J.: On the complexity of some common geometric location problems. SIAM Journal on Computing 13(1), 182–196 (1984)
Mehrabi et al. [2021] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Computing Surveys 54(6) (2021)
Micha and Shah [2020] Micha, E., Shah, N.: Proportionally fair clustering revisited. In: 47th International Colloquium on Automata, Languages, and Programming, ICALP 2020, vol. 168, pp. 85:1–85:16 (2020)
O’Neil [2016] O’Neil, C.: Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group, USA (2016)
Peters et al. [2021] Peters, D., Pierczyński, G., Skowron, P.: Proportional participatory budgeting with additive utilities. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), pp. 12726–12737 (2021)
Peters et al. [1992] Peters, H., van der Stel, H., Storcken, T.: Pareto optimality, anonymity, and strategy-proofness in location problems. International Journal of Game Theory 21(3), 221–235 (1992)
Procaccia [2008] Procaccia, A.D.: Towards a theory of incentives in machine learning. SIGecom Exchanges 7(2) (2008)
Procaccia and Tennenholtz [2013] Procaccia, A.D., Tennenholtz, M.: Approximate mechanism design without money. ACM Transactions on Economics and Computation 1(4) (2013)
Vakilian and Yalçiner [2021] Vakilian, A., Yalçiner, M.: Improved approximation algorithms for individually fair clustering. CoRR abs/2106.14043 (2021)

Appendix A Omitted proofs

Proof of Theorem 1.

We prove by induction that if the number of candidates selected is less than $k$ , then there is a way to select at least one more center. If the number of candidates selected is less than $k$ , there is still aggregate weight of at least $n/k$ on the set of all agents. These agents can get a candidate selected from their neighborhoods if the neighborhood radius is large enough. In particular, for $d_{|D|}=\max_{i,j\in N}d(i,j)$ , some unselected candidate from $M$ can be selected.

Note that for a given radius $d_{j}$ , at most $k$ candidates can be selected. If no more candidates can be selected for a given $d_{j}$ , the next radius $d_{j+1}$ is considered. There are $O(n^{2})$ different distances that are to be considered. Next, we bound the running time.

There are $(n^{2})$ different distances that are to be considered. These distances are sorted in $O(n^{2}\log(n^{2}))$ time. For each distance that we consider, we check for each of the $kn$ locations whether they get sufficient support of $n/k$ . It takes time $kn^{2}$ to check if some location has support $n/k$ for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most $\max(k,n^{2}k)$ times. Therefore the running time is $O(n^{2}\log(n^{2}))+O((kn^{2})(n^{2}k))=O(n^{4}k^{2})$ .

The fact that the algorithm gives PRF and 3-PF follows immediately from Lemma 1 and Lemma 2. ∎

Proof of Proposition 2.

Let $X$ be an outcome which satisfies PRF for Discrete Clustering. Fix $S\subseteq N,c\in{\mathscr{M}}$ with $|S|\geq\lceil\frac{n}{k}\rceil$ . Let $y=\max_{i\in S}d(i,c)$ , the maximum distance any agent in $S$ is from $c$ , and denote this agent $i^{*}$ . We assume that $y>0$ ; otherwise, if $y=0$ , then every agent in $S$ is located at $c$ , $c\in X$ (by PRF), and $d(i,c)=d(i,X)$ for all $i\in S$ .

Since $|S|\geq\frac{n}{k}$ and every agent in $S$ is a distance of at most $y$ from $c$ , it follows from $X$ satisfying PRF that there exist $x\in X,i\in S$ such that $d(i,x)\leq y$ . By the triangle inequality, $d(c,x)\leq d(i,c)+d(i,x)$ and thus $d(i^{*},x)\leq y+d(i,c)+d(i,x)$ . From these observations, the minimum PF approximation ratio obtained by $i$ or $i^{*}$ becomes

	$\displaystyle\min\left(\frac{d(i,x)}{d(i,c)},\frac{d(i^{},x)}{d(i^{},c)}\right)$
	$\displaystyle\leq\min\left(\frac{y}{d(i,c)},\frac{y+d(i,c)+d(i,x)}{y}\right)$
	$\displaystyle\leq\min\left(\frac{y}{d(i,c)},2+\frac{d(i,c)}{y}\right)$
	$\displaystyle\leq\max_{z\geq 0}(\min(z,2+1/z))=1+\sqrt{2}.$

Note that the final inequality above holds since replacing $\frac{y}{d(i,c)}$ by the non-negative number $z$ , which maximizes the expression, can only weakly increase the expression. It can be easily verified that this expression is maximized when $z=1+\sqrt{2}$ , hence the final transition. Thus, for any $c\in{\mathscr{M}}$ , there exists $i\in S$ such that $d(i,X)\leq(1+\sqrt{2})d(i,c)$ . ∎

Proof of Theorem 2.

Output size. We prove by induction that if the number of candidates selected is less than $k$ , then then there is a way to select at least one one more center. If the number of candidates selected is less than $k$ , there is still aggregate weight of at least $n/k$ on the set of all agents. These agents can get a candidate selected from their neighborhood if the neighborhood distance is large enough. In particular, for $d_{|D|}=\max_{i\in N,c\in{\mathscr{M}}}d(i,c)$ , some unselected candidate from $M$ can be selected.

Running time. Note that for a given radius $d_{j}$ , at most $k$ candidates can be selected. If no more candidates can be selected for a given $d_{j}$ , the next distance $d_{j+1}$ is considered. Thus,here are $(n\cdot|{\mathscr{M}}|)$ different distances to be considered. These distances are sorted in $O(n\cdot|{\mathscr{M}}|\log(n\cdot|{\mathscr{M}}|))$ time. For each distance that we consider, we check for each of the $|{\mathscr{M}}|$ locations whether they get sufficient support of $n/k$ . It takes time $|{\mathscr{M}}|\cdot n$ to check if some location has support $n/k$ for the current neighborhood distance. If no location has enough support, we move to the next distance. If some location has enough support, we need to select one less location. Therefore, we need to check whether some location has enough support at most $\max(k,n\cdot|{\mathscr{M}}|)$ times. Therefore the running time is $O(n\cdot|{\mathscr{M}}|\log(n\cdot|{\mathscr{M}}|))+O((|{\mathscr{M}}|\cdot n)(k+n\cdot|{\mathscr{M}}|))=O({|{\mathscr{M}}|}^{2}n)$ .

PRF. Suppose for contradiction that the algorithm finds a set of $k$ centers $W$ that violates PRF. In that case, there exist a set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ such that there are at least $\ell^{\prime}\leq\ell$ locations in ${\mathscr{M}}$ that are within distance $y$ of some agent in $S$ , but the number of locations in $W$ that are within $y$ of at least some agent in $S$ is at most $\ell-1$ . In that case, consider the snapshot of the algorithm when neighborhood distance $y$ is considered. At this point, the the number of locations that are within $y$ of at least some agent in $S$ is at most $\ell-1$ . Since agents in ${\mathscr{N}}$ have only used their weight towards selecting locations within $y$ up till this point, it follows that $\sum_{i\in S}w_{i}\geq\ell\frac{n}{k}-(\ell-1)\frac{n}{k}=\frac{n}{k}$ . It follows that the agents in $S$ still have total weight at least $\frac{n}{k}$ to select one more location that is at distance at most $y$ from some agent in $S$ . Hence, at this stage, when the neighborhood distance is $y$ , the algorithm would have selected at least one more center within distance $y$ than in the outcome $W$ . Thus $W$ is not the correct output of the algorithm, a contradiction.

PF approximation. Since, as we just showed, the outcome returned by Algorithm 2 satisfies PRF, it follows from Proposition 2 that the outcome returned by Algorithm 2 satisfies $(1+\sqrt{2})$ -PF. ∎

Appendix B Tight Analysis of PF Approximations

The following example shows the analysis in Proposition 1 is tight. That is, there is an instance for which an outcome satisfying PRF for Unconstrained Clustering obtains no better than $\frac{3+\sqrt{17}}{2}$ -PF.

Example 5.

Consider an instance of unconstrained clustering with $k=1$ and ${\mathscr{N}}=\{i_{1},i_{2},i_{3}\}$ . Furthermore, let $c$ and $x$ be two candidate locations in ${\mathscr{M}}$ such that the distance between agents and candidates is as given in Table 2.

	i	j	l
i	0	$6+2\sqrt{17}$	$7+\sqrt{17}$
j	$6+2\sqrt{17}$	$0$	$7+\sqrt{17}$
l	$7+\sqrt{17}$	$7+\sqrt{17}$	$0$
c	$3+\sqrt{17}$	$3+\sqrt{17}$	$4$
x	$13+3\sqrt{17}$	$13+3\sqrt{17}$	$6+2\sqrt{17}$

Table 2: Distances between agents

\{i,j,l\}

and candidates

\{c,x\}

It can be verified that all distances in Table 2 satisfy the triangle inequality. Note that, since $6+2\sqrt{17}$ is the maximum distance between any pair of agents in ${\mathscr{N}}$ (call this $y$ ), and $d(l,x)=y$ , it holds that $\{x\}$ satisfies PRF for Unconstrained Clustering. Through straightforward arithmetic, it can then be shown that $\frac{d(i,x)}{d(i,c)}=\frac{d(j,x)}{d(j,c)}=\frac{d(l,x)}{d(l,c)}=\frac{3+\sqrt{17}}{2}$ .

Appendix C Strengthening of PRF

We consider two natural and stronger notions of PRF in the discrete setting (that also apply to the continuous setting). We will show that the versions that we consider do not guarantee the existence of a fair outcome in all settings.

The PRF concept can be increasingly strengthened in the following ways by making the requirements on the outcome stronger.

Definition 6 (Proportionally Representative Fairness (PRF)-II for Discrete Clustering).

For any set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ , if there are $\ell^{\prime}\leq\ell$ candidates from ${\mathscr{M}}$ that are at distance at most $y$ from all agents in $S$ , the following holds: there exists some $i\in S$ such that $|c\in X\mathbin{:}d(i,c)\leq y|\geq\ell^{\prime}$ .

Definition 7 (Proportionally Representative Fairness (PRF)-III for Discrete Clustering).

For any set of agents $S\subseteq{\mathscr{N}}$ of size at least $|S|\geq\ell n/k$ , if there are $\ell^{\prime}\leq\ell$ candidates from ${\mathscr{M}}$ that are at distance at most $y$ from all agents in $S$ , the following holds: $|c\in X\mathbin{:}d(i,c)\leq y\text{ for all }i\in S|\geq\ell^{\prime}$ .

The following example shows that an outcome satisfying PRFII and hence PRFIII may not exist.

Example 6.

Take $n=4,k=2$ with the following location profile on a unit interval $[0,1]$ : $\boldsymbol{x}=(0,0,1,1).$ There are 4 candidate locations at ${\mathscr{M}}=(0,0.5,0.5,1)$ . It is immediate that the 2 groups of agents at $0$ and $1$ must have a facility at $0$ and $1$ . But this violates PRF-II. The set of agents $N$ (size 4) have 2 candidate locations with a distance of $0.5$ of all agents; however, no agent in $N$ has 2 facility locations within a distance of 0.5 of them.

Appendix D $k$ -means++

Step 1.: The first centroid is selected randomly.
Step 2.: Calculate the Euclidean distance between the centroid and every other data point in the dataset. The point farthest away will become our next centroid.
Step 3.: Create clusters around these centroids by associating every point with its nearest centroid.
Step 4.: The point which has the farthest distance from its centroid will be our next centroid.
Step 5.: Repeat Steps 3 and 4 until $k$ number of centroids are located.

Appendix E Strategyproofness and Fairness

Most of the work on clustering assumes that the data points are truthfully reported. If data points correspond to individual agents who prefer to have a nearby center, then the issue of incentives also arises (see, e.g., Procaccia [40]). It is easy to see that Algorithm 1 for the unconstrained setting coincides with the mid-point mechanism if there are two agents on a line. Since the mid-point (egalitarian mechanism) is known to be manipulable (see, e.g., Procaccia and Tennenholtz [41]), Algorithm 1 is not strategyproof even for $k=1$ (under which an agent cares about the sole facility to be as close as possible). A discretized version of the same example shows that Algorithm 2 is also not strategyproof.

Instead of understanding the strategic aspects of a particular algorithm, we next examine the tradeoff between fairness and strategyproofness. We focus on the case of $k=1$ for which the preference relation of an agent is clear: an agent wants the location to be as close as possible.

Proposition 3.

Even for $k=1$ and the unconstrained setting, there exists no anonymous, PF, and strategyproof algorithm.

Proof.

For $k=1$ and Euclidean space, an outcome is Pareto optimal if and only is it is in the convex hull of the agent locations [39]. Next, we show that weak Pareto optimality is equivalent to Pareto optimality in this context. We show that if a location $x$ is not Pareto optimal, then it is not weakly Pareto optimal. Since $x$ is not Pareto optimal, it is strictly outside the convex hull $H$ of the agent locations. Let $y$ be the point in $H$ closest to $x$ . Note that $y$ will either be a corner of $H$ or the perpendicular projection of $x$ onto an edge of $H$ . Consider the line $L$ going through $y$ orthogonal to the line through $x$ and $y$ . Assume without loss of generality that $L$ is vertical with $x$ to the right of $L$ . Then all points of $N$ lie on $L$ or to the left of $L$ . Let $i$ be an arbitrary point in $N$ . If $i$ is the same point as $y$ , then $d(i,y)<d(i,y)$ . Now suppose $i$ is not at the same loation as $y$ . Consider the triangle $(i,y,x)$ . Since $x$ - $y$ is orthogonal to $L$ , it follows that the angle at point $y$ for triangle $(i,y,x)$ is at least $90^{\circ}$ . The maximum angle of the other two corners of the triangle $(i,y,x)$ is strictly less than $90^{\circ}$ as the sum of the angles of a triangle is $180^{\circ}$ . There is a well-known triangle inequality law that “if two angles of a triangle are unequal, then the measures of the sides opposite these angles are also unequal, and the longer side is opposite the greater angle.” Hence, it follows that $d(i,y)<d(i,x)$ . We have shown that $d(i,y)<d(i,y)$ for all $i\in N$ . Hence $x$ is not weakly Pareto optimal.

Peters et al. [39] proved that there is no anonymous, strategyproof, and Pareto optimal mechanism for the Euclidean space with 3 dimensions. From our argument above, it follows that there is no anonymous, strategyproof, and weakly Pareto optimal mechanism for the Euclidean space with 3 dimensions. Since PF is equivalent to weak Pareto optimality for $k=1$ , it follows that for $k=1$ , there is no anonymous, PF, and strategyproof algorithm. ∎

Corollary 1.

Even for $k=1$ , there exists no anonymous, core fair, and strategyproof algorithm.

Proposition 4.

Even for $k=2$ and the discrete setting, there exists no PRF strategyproof algorithm even in one dimension.

Proof.

Let $\boldsymbol{x}=(-1,1,1.2)$ and ${\mathscr{M}}=(0,0.9,1.3)$ . Let $k=2$ . PRF (for discrete setting) requires that $X=(0,0.9)$ . This follows because the group of agents $S=N$ can demand 2 centers and there exists 2 centers at most $y=2.2$ for all agents locations—namely, $0$ and $0.9$ . The other groups of agents do not add any further demands (i.e., no more demands that are not already satisfied by considering $S=N$ ). With this outcome, agent $3$ has access to 2 facilities with distances $0.3$ and $1.2$ , respectively.

Now consider a deviation by agent 3 to $x_{3}^{\prime}=1.4$ . Now the group of agents $S=\{2,3\}$ can demand that the candidate center at $1.3$ is chosen—this is because this is the only center that is within $y=0.4$ of both agents 2 and 3. There are 2 possible outcomes $X^{\prime}=(0.9,1.3)$ and $X^{\prime\prime}=(0,1.3)$ . In the first case, agent $3$ (with true location $1.2$ ) has cost $0.1$ and $0.3$ . In the second case, agent $3$ (with true location $1.2$ ) has cost $0.1$ and $1.2$ . In either case, the new outcome component-wise dominants the original outcome (under sincere voting). Hence, PRF is incompatible with SP in discrete setting. ∎

Appendix F Experimental results

	MSD to $1$			MSD to $k/2$			MSD to $k$
	$k$ ++	Alg 1	$ALG_{g}$	$k$ ++	Alg 1	$ALG_{g}$	$k$ ++	Alg 1	$ALG_{g}$
		Difference in MSD	Difference in MSD		Difference in MSD	Difference in MSD		Difference in MSD	Difference in MSD
Dataset	Average MSD	relative to $k$ ++	relative to $k$ ++	Average MSD	relative to $k$ ++	relative to $k$ ++	Average MSD	relative to $k$ ++	relative to $k$ ++
Wholesale	34,977,392.00	664%	112%	11,668,107,198.00	-10%	-12%	100,307,739,343.00	-63%	-55%
HCV	811.60	519%	429%	582,360.00	-4%	-6%	4,858,449.00	-70%	-62%
Buddy	782.20	8%	5%	181,700.00	-8%	-7%	784,360.00	-13%	-9%
Seeds	0.54	23%	8%	228.68	-1%	-1%	1,449.94	-4%	-1%

Table 3: Performance of clustering algorithms with respect to MSD to 1, k/2, and k centroids.

We now apply Algorithm 1 to four real-world datasets. We consider the discrete domain in which ${\mathscr{M}}={\mathscr{N}}$ as most clustering data sets only have the data points. Algorithm 1 is guaranteed to satisfy PRF; therefore, we are interested in analyzing the performance of the algorithm with respect to other objectives. A prominent objective for clustering algorithms is to minimize the Mean Squared Distance (MSD) to the closest center: the average distance between each datapoint and its closest center. We consider this performance measure as the number of centers ranges from $k=1$ to $100$ .

In addition to MSD to the closest center, we analyze two other related measures. MSD to the closest $k/2$ centers and MSD to the closest $k$ centers. Intuitively, MSD to the closest $k/2$ centers is the average distance between each datapoint and its $k/2$ -closest centers (MSD to the closest $k$ centers is similar). These measures capture—to varying extents—the idea that it may be desirable to have datapoints located close to more than just one center; this idea is at the core of the proportional representation concept. To benchmark Algorithm 1’s performance, we also implement two other algorithms: the well-known $k$ -means++ algorithm⁷⁷7I.e., Lloyd’s algorithm for $k$ -means minimization objective with a particular initialization. Given our focus on the discrete domain, we use a $k$ -means++ algorithms that only chooses centers from among the data points. of Arthur and Vassilvitskii [5] and ALG_g of Li et al. [33]. For all of our MSD measures, a smaller value is more desirable.

Datasets. The four datasets that we analyze are from the UCI Machine Learning Repository [19]. We summarize these below.⁸⁸8 The datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0); for details, please refer to https://archive-beta.ics.uci.edu/ml/donation-policy. HCV dataset (contains information about patients with Hepatitis C and blood donors.); Seeds dataset (describes measurements of geometrical properties of kernels belonging to three different varieties of wheat.); Buddy-move dataset (consists of user information from user reviews published in www.holidayiq.com about point of interests in South India); and Wholesale dataset (consists of client information for a wholesale distributor. It includes annual spending on various product categories.)

Results. Table 3 summarizes our results. We illustrate our analysis in Figures 4-9. Across all the datasets, Algorithm 1’s performance relative to $k$ -means improves when moving from MSD to closest center to MSD to closest $k/2$ centers and MSD to closest $k$ centers. A similar pattern is observed for $ALG_{g}$ , which may be expected and is reassuring since—like Algorithm 1— $ALG_{g}$ is motivated by the idea of proportional fairness. For two of the datasets (Wholesale and HCV), Algorithm 1 offers substantially better performance compared to $k$ -means with respect to MSD to closest $k$ centers. In these cases, Algorithm 1 also outperforms $ALG_{g}$ . For MSD to the closest $k/2$ centers, $ALG_{g}$ and Algorithm 1 either outperform or perform similar to $k$ -means. In contrast, and as expected, the $k$ -means outperforms both algorithms with respect to the MSD to closest center. With this measure, $ALG_{g}$ typically outperforms Algorithm 1. The experiments illustrate that other well-known algorithms produce different outcomes compared to our Algorithm 1, and this difference can be quantified via an intuitive PRF-related metric, i.e., minimizing the MSD to multiple centers.

Refer to caption — Figure 4: Buddy dataset: MSD to closest 1, k/2, and k centroids.

Proportionally Representative Clustering

Abstract

1 Introduction

Contributions.

2 Related Work

2.1 Fair clustering

2.2 Comparison with multi-winner voting

3 Towards an Axiom Capturing Proportional Representation

Definition 1 (Proportional Fairness [14]).

Definition 2 (Unanimous Proportionality (UP)).

Example 1 (Proportional fairness and core fairness).

Definition 3 (ρ\rho-approximate Proportional Fairness).

Example 2.

4 Fairness for the Unconstrained Setting

Definition 4 (Proportionally Representative Fairness (PRF) for Unconstrained Clustering).

Example 3 (Requirements of PRF).

Proposition 1.

Proof.

Example 4 (kk-means does not satisfy PRF).

Theorem 1.

Lemma 1.

Proof.

Lemma 2.

Proof.

5 Fairness for the Discrete Setting

Definition 5 (Proportionally Representative Fairness (PRF) for Discrete Clustering).

Proposition 2.

Theorem 2.

Remark 1.

6 Discussion

Acknowledgments

References

Appendix A Omitted proofs

Proof of Theorem 1.

Proof of Proposition 2.

Proof of Theorem 2.

Appendix B Tight Analysis of PF Approximations

Example 5.

Appendix C Strengthening of PRF

Definition 6 (Proportionally Representative Fairness (PRF)-II for Discrete Clustering).

Definition 7 (Proportionally Representative Fairness (PRF)-III for Discrete Clustering).

Example 6.

Appendix D kk-means++

Appendix E Strategyproofness and Fairness

Proposition 3.

Proof.

Corollary 1.

Proposition 4.

Proof.

Appendix F Experimental results

Definition 3 ( $\rho$ -approximate Proportional Fairness).

Example 4 ( $k$ -means does not satisfy PRF).

Appendix D $k$ -means++