Matchings with Group Fairness Constraints: Online and Offline Algorithms
Abstract
We consider the problem of assigning items to platforms in the presence of group fairness constraints. In the input, each item belongs to certain categories, called classes in this paper. Each platform specifies the group fairness constraints through an upper bound on the number of items it can serve from each class. Additionally, each platform also has an upper bound on the total number of items it can serve. The goal is to assign items to platforms so as to maximize the number of items assigned while satisfying the upper bounds of each class. In some cases, there is a revenue associated with matching an item to a platform, then the goal is to maximize the revenue generated.
This problem models several important real-world problems like ad-auctions, scheduling, resource allocations, school choice etc.We also show an interesting connection to computing a generalized maximum independent set on hypergraphs and ranking items under group fairness constraints.
We show that if the classes are arbitrary, then the problem is NP-hard and has a strong inapproximability. We consider the problem in both online and offline settings under natural restrictions on the classes. Under these restrictions, the problem continues to remain NP-hard but admits approximation algorithms with small approximation factors. We also implement some of the algorithms. Our experiments show that the algorithms work well in practice both in terms of efficiency and the number of items that get assigned to some platform.
1 Introduction
Graph matching is a fundamental problem in graph theory and theoretical computer science that has been studied extensively over the years. Computing the maximum matching in bipartite graphs, both in the online and the offline setting is an important building block in many applications in allocation problems such as ad-auctions [Meh13, MSVV07], scheduling [MAW99], resource allocation [HLL12], school choice [AS03] etc. Since the notation used in these various problems differ, we use the general terms items and platforms to refer to the two parts of the bipartite graph. In practice, items may be classified based on different properties and hence may belong to certain groups or classes. Modeling the allocation problems as a vanilla matching problem seeks to optimize the cost of the solution alone and may inadvertently be “unfair” to some classes of items. Necessitated by the need to be fair and unbiased towards any group of items in the input, there has been a lot of recent work studying algorithms for various problems augmented with fairness constraints, such as [CSV18, KMM15, CHRG16, BCZ+16]).
In this paper, we enforce group fairness through constraints that place an upper bound on the number of items that can be matched from a particular class to a platform. We note that group fairness constraints usually involve both upper and lower bounds. This is incompatible with the practical applications that we have in mind, namely ad-allocation and course allocation problems. For this reason, we focus only on upper bounds. We formally define the problem as follows.
Classified maximum matching (CMM)
We have a set of items and a set of platforms, and these sets form the two parts of a bipartite graph. The presence of an edge indicates that item can be matched to platform . Let denote the neighborhood of . Each platform has a collection of classes , i.e., each item in may belong to some of the classes in . Each class has an associated quota denoting the maximum number of items from that can be assigned to . In addition, each platform has a quota , which is an upper bound on the total number of items it can serve. Our goal is to compute an assignment of items to platforms so as to maximize the number of items assigned, while satisfying all the quotas.
Classes allow platforms to specify group fairness constraints – for instance the classes can be seen as properties or categories and the quotas impose the constraints that not too many items from one category are assigned to a platform. These types of fairness constraints have been studied in many practical applications. In [AS03], the authors address the school choice problem where fairness constraints are imposed to achieve racial, ethnic, and gender balance. In the assignment of medical residents to hospitals in Japanese Residency Matching Program (JRMP), regional quotas are introduced to ensure fairness amongst urban and rural hospitals [KK12, KK15]. Huang [Hua10] motivates classifications from the perspective of enabling diversity in academic hiring. Apart from matchings, group fairness constraints have also been studied for many other problems like the knapsack problem [PKL21], and clustering problems [BCFN19], to name a few.
Some recent pre-prints have discussed fairness in matching problems. Soriano and Bonchi [GSB20] study a different notion of individual fairness that they call maxmin-fairness. Their goal is to output a solution such that the satisfaction of one agent cannot be improved without making another agent worse-off. Ma and Xu [MX20] measure fairness by the ratio of expected number of agents matched from a particular group to the expected number of agents from that group and their goal is to maximize the minimum of this ratio over all groups. Basu et al. [BDLK20] also measure fairness based on metrics involving the ratio of agents across groups and the utility they provide. While qualitatively similar, our constraints can be seen as being orthogonal to such notions of fairness.
In different applications, the fairness constraints can have a different structure. In [KK12, KK15], each hospital belongs to exactly one region, hence fairness constraints partition the set of hospitals. On the other hand, in the school choice problem in [AS03], a student belongs to multiple constraints. The structure of the constraints crucially affects the computational complexity of finding a fair allocation. This has been illustrated in the context of bipartite matchings where one or both sides of the bipartition express preferences over the other side. Huang [Hua10] and Fleiner and Kamiyama [FK12], address the CMM problem when both sides of the bipartition have preferences over each other and the notion of optimality is stability, whereas [NNP19] study the CMM problem under the optimality criteria of rank-maximality and popularity. In all these cases, it has been shown that the CMM problem can be efficiently solved if the constraints have a laminar222 A family of subsets of a set is laminar if, for every pair of sets , either or or . structure, and is NP-hard in general [NNP19]. Such a restriction has been considered before in the literature, such as in the hospital-resident problem [KK12] or the college admissions problem [BFIM10, GIK+16]. However, a finer relation between the structure of the class constraints and the computational efficiency has not received much attention in literature. In this paper, we address this issue by focusing on a quantification of non-laminarity in the classes and its effect on computational efficiency. We strengthen the hardness results in [NNP19] and obtain new approximation algorithms for the CMM problem in the offline setting.
Next, we turn our attention to the online version of the problem. Online matching problems have numerous practical applications, such as in ad-allocations [MSVV07], resource allocations [DJSW11], etc. See [Meh13] for a survey on online bipartite matchings. Fairness has also been studied in online settings such as online learning [GJKR18] or ride-hailing platforms [SBZ+19]. ‘Fairness’ in another form has been considered previously in the online literature. For example, the ‘frequency caps’ mentioned in [FMMM09] places a restriction on the total number of ads that are shown to the same user, or users from a particular demographic. We study some natural online versions of the CMM problem. We first show that one of our approximation algorithms for the offline non-laminar case also works as an online algorithm, regardless of the input model. For the setting where we restrict classes to be laminar, we show that existing algorithms for online bipartite matching carry over to our setting.
1.1 Models
We study CMM problem in various settings. In practice, assigning an item to a platform may generate a revenue, which can be modelled as the weight of the edge . In such a case the goal of the weighted CMM problem is to compute an assignment of items to platforms so as to maximize the total weight of edges in the matching, while satisfying the quotas of all the classes. We now formally define our models.
Model 1 (Many-to-one).
This is the setting described earlier. In this setting, items can match to at most one platform.
Model 2 (Many-to-many).
This is a more general setting in which items may be matched to multiple platforms. In addition to the classes of platforms, each item also has a collection of classes , i.e., each platform in belongs to some of the classes in , and the items also have quotas for their classes. This model arises in scenarios like course allocation, where students may be allotted multiple courses subject to various restrictions. Courses may have restrictions over the number of students allotted to it from each department or from each batch.
In the setting of online matchings, the platforms are available offline and the items arrive online. When an item arrives online, its neighbours in , and the classes that it participates in are revealed. It must be immediately decided if we match to some platform and any edges matched cannot be unmatched later. In the literature, the order in which the items arrive has been studied in various models. In the adversarial model, the items can arrive in an arbitrary order. We study a natural online arrival model for the items, called the Random input model. See [Meh13] for a survey of other work on such models.
Model 3 (Random input [GM08]).
In this setting, there is an underlying graph . The vertices of arrive according to a permutation chosen uniformly at random.
1.2 Our results
In most applications, an item typically belongs to small number of classes, hence we first study this setting. For example, [AS03] discusses the Boston student assignment mechanism which divides students into two categories based on whether they already have a sibling in the school and whether they are within walking distance to the school. Similarly, in a faculty hiring scenario, the number of classes (which would correspond to specializations) is independent of the number of applicants. For the scenario when there are constant number of classes we show the following result.
Theorem 1 (Informal version of Theorem 6).
The CMM problem can be solved in polynomial time if there is a constant number of platforms, each with a constant number of classes. This leads to a -approximation algorithm for an arbitrary number of platforms, each with a constant number of classes.
Now we turn to a more general setting where the number of classes is arbitrary and exploit the structural relation amongst the classes. We know from [NNP19] that when the classes of every platform form exactly one laminar family then the CMM problem is solvable in polynomial time. We prove the following theorem.
Theorem 2.
There is a polynomial-time algorithm achieving an approximation ratio of for the many-to-one setting (Model 1) where each item belongs to at most laminar families of classes per platform. This generalizes to the weighted many-to-many setting (Model 2) where for each edge , the classes of and that contain can be partitioned into laminar families.
The above result is applicable in scenarios like ad-allocation where the number of classes can be arbitrary, but any ad belongs to a few of them. Complementing this, we also obtain hardness results for computing the optimal CMM.
Proposition 1.
-
(i)
CMM cannot be approximated to a factor of for any unless , where , even when there is a single platform and all edge weights are one.
-
(ii)
When there is a single platform, and additionally, each item appears in at most classes, the problem is NP-hard to approximate within a factor .
The proof of Proposition 1 follows from a reduction from the maximum independent set problem.
Online algorithms.
We first remark that our algorithm from Theorem 2 works as an online algorithm in the unweighted case, even when the input is adversarially chosen.
Theorem 3.
There is a polynomial-time online algorithm achieving, in any input model, a competitive ratio of for the many-to-one setting (Model 1) where each item belongs to at most laminar families of classes per platform. The algorithm extends to the many-to-many setting (Model 2) where the classes of and containing each edge can be partitioned into laminar families.
Having achieved a competitive ratio that is close to the lower bound (from Proposition 1 (ii)), we consider the case where classes are restricted to be laminar. We consider the random order input model (Model 3) and show that a simple greedy algorithm from the literature also works for CMM and that it achieves the same competitive ratio. We use the technique of randomized dual fitting which has been used to analyse competitive ratios in works such as [DJK13, HKT+18].
Theorem 4.
There is a polynomial-time algorithm achieving a competitive ratio of for CMM with laminar classes in the random input model.
2 Implications for other problems
Although the CMM problem is modelled as a matching of items to platforms, we show that the classes capture problems which are well studied and are of independent interest.
Maximum Independent set on hypergraphs
Given a hypergraph , and a function , compute the largest set of vertices such that for every , . We note that when for each edge , this is the problem of computing the strong maximum independent set and when , this is the weak maximum independent set problem. These problems are well-studied for bounded-degree hypergraphs; [HL09] describe algorithms achieving factors of and for the strong and weak cases respectively, where denotes the maximum degree of a vertex in . For the weak independent set, this was further improved to in [AHL13]. However, to the best of our knowledge, there is no known approximation algorithm for the case when is an arbitrary value – we call this the generalized maximum independent set on hypergraphs. We state our approximation result for independent sets on hypergraphs below, which is a consequence of Theorem 2.
Proposition 2.
There is a polynomial time approximation algorithm for the problem of computing a generalized maximum independent set on hypergraphs with maximum degree .
For the case when average degree of the vertices is , we get the following:
Theorem 5.
There is a approximation algorithm for the generalized independent set where and denotes the average degree of a vertex.
For the CMM problem, this implies an approximation algorithm when we only have an upper bound on the average number of laminar families of classes an item belongs to, and there is only one platform.
Ranking and group fairness
In an apparently different model, Celis et al. [CSV18] consider ranking items from a universe of items, where . Items are assigned properties, and upper quotas for the number of items from any property in the top ranks. When items have at most properties each, they give a approximation while allowing constraints to be violated by a factor of . This problem can be reduced to the CMM problem and our algorithm from Theorem 2 achieves the same approximation factor without violating any class constraints.
We now show a reduction from their problem to the CMM problem. The fair rankings problem is defined as follows: Given a set of items, our objective is to choose among them and rank them. Every item has a value when ranked in the position. Every item also has some properties, and every property can be represented as a subset of the items that share that property. The objective is to rank them such that this the total value is maximized, subject to some fairness constraints on the properties. A fairness constraint corresponding to the property is set of values such that is the maximum number of items with property that can be ranked in the top positions.
In the reduction, we create one platform . For every item in the ranking instance we have items in the CMM instance. The item being matched to will be equivalent to ranking the item in the position. Then our classes are
-
•
One class for every item to ensure that an item is ranked at most once. That is, , we have the class
-
•
One class for every rank to ensure that at most one item is ranked in one position. That is, , we have the class
-
•
One laminar family of classes for every property with constraints . There are constraints, one for each position in the ranking. Let be the items with property . Then we have
where . It is easily seen that this is a laminar family.
Thus, for an item in the ranking instance that belongs to properties, for any , in the constructed CMM instance belongs to laminar families of classes. Using the algorithm from Theorem 2, we get a approximation without any violation in quotas. However, it has to be noted that in [CSV18], they insist on finding a ranking with elements whereas we may output a possibly smaller ranking. We also note that this reduction immediately gives a better hardness of approximation of for the setting where every item lies in laminar families of classes per platform.
Simultaneous matchings
Kutz et al. [KEKM08] study the problem called simultaneous matchings which is defined as follows: given a bipartite graph and a collection , find the largest solution such that is a matching. This problem can be reduced to the CMM problem where every vertex in has constraints (excluding vertices to which has no edge), and each class has quota . The approximation factor in [KEKM08] is better but the constraints are significantly more restricted than ours.
3 Offline Approximation algorithms
We start by showing the hardness of the CMM problem.
Proof of Proposition 1.
Consider an arbitrary instance of the maximum independent set (MIS) problem, which is an undirected graph . From , we will create an instance of the CMM problem denoted as a graph , and a set of platform classes , where is the set of items and is a single platform.
Thus, there is an item for each , and every item in has an edge to . There is a class with quota for every edge . We claim that the instance so constructed has a feasible matching of size if and only if has an independent set of size .
Since an independent set consists of at most one end-point of an edge , the corresponding set of items respects quota of each class . Thus, given an independent set of size , there is a feasible matching of size in the CMM instance. Similarly, given a feasible matching of size in the CMM instance, the set of vertices corresponding to the matched items form an independent set in . There cannot be two matched items such that because of the quota of the class . Part (i) of Proposition 1 follows from the hardness of approximation of MIS [Zuc06].
The MIS problem is known to be NP-hard even when the input instance consists of a graph where the maximum degree of any vertex is at most , where [GJ90]. It is also known to be NP-Hard to approximate below a factor of when [BK19]. Part (ii) of Proposition 1 follows from this, by the same reduction, since the degree of a vertex in the MIS instance is the number of classes containing the corresponding item . ∎
3.1 Proof of Theorem 2
In this section, we consider the case when, for each platform and item , the classes containing can be partitioned into at most laminar families. We first present a -approximation algorithm for the case when there is only one platform. This algorithm also generalizes the maximum independent set in hypergraphs (Proposition 2). We extend this algorithm to a -approximation algorithm for the case with multiple platforms and even to the many-to-many setting.
Single platform case
Let be an instance of the CMM problem with a single platform and a family of classes with the above restriction.
Reduction to the generalized maximum independent set problem: We construct from an instance of the generalized maximum independent set problem by setting and , and . We call a set feasible if for every , . We call a set maximal if is feasible and is not feasible for every . Our algorithm is a simple greedy algorithm: output a maximal set of vertices. To prove the approximation, we use the following lemma.
Lemma 1.
Consider a set and a set such that is a maximal set of vertices. Then for every feasible set such that and , we have .
Proof.
We prove this by induction on . For any pair of sets that satisfy the conditions of the lemma, the base case of is trivially true by maximality of . To prove the induction step, assume that the lemma is true for all where is maximal and . We will show that the lemma holds for all that satisfies the lemma conditions and where . Suppose, for contradiction, that we have a feasible set such that and .
Now consider the set , for some . If is feasible, then we let , else we construct a feasible set as below.
Consider the set of edges that contain . We partition the set into laminar families and call these sets . Since each is laminar and every edge in contains the vertex , we can arrange the edges of as such that for all . Since is feasible and is not feasible, for every edge we have , that is, the violation is by at most one. For each , we find the smallest (if any) such that has a violation of 1. We remove a vertex such that . Note that the set is feasible for all the edges in since where . Further, note that exists because we assumed that and hence is feasible. We repeat this process for each laminar family and hence may have removed at most vertices from obtaining a which is feasible. Thus,
The second inequality follows from the assumption that . Now consider the induction hypothesis for , . Since , and is a feasible set containing and is disjoint from we have
from the induction hypothesis. This is a contradiction. Thus, induction hypothesis is true and the lemma follows.
∎
Let denote any maximal independent set of and be the optimal independent set. In the above lemma, set . The lemma implies . This proves Proposition 2. We note that this also gives us a -approximation for the CMM problem in the single platform case when every item belongs to at most laminar families of the platform. It is also easy to see that this algorithm runs in time . For every vertex, we add it if it does not exceed the quota of any edge it belongs to.
Multiple platforms
We can use the previous result to obtain a approximation for the multiple platforms case via a simple -time reduction to the single platform case: For every edge , make a new item . Replace all the platforms by a single dummy platform. Since classes are subsets of the neighbourhood sets of items or platforms, they can also be seen as subsets of edges of the graph. These naturally form classes over the items .This combined with the result for the single platform case gives an algorithm for the multiple platform case where is the total number of classes, establishing Theorem 2.
Remark 1 (Weights on items).
We remark that the same analysis goes through even if items have weights and the goal is to compute a maximum weight matching. The algorithm simply keeps matching the highest weight item that can feasibly match to the platform.
3.2 Constant number of classes
We can also prove Theorem 2 via the following general statement combined with Proposition 2. We will also need this to prove Theorem 1.
Lemma 2.
Given a polynomial time -approximation algorithm for the many-to-one matching problem with a single platform, we can obtain a polynomial time -approximation for the matching problem with multiple platforms.
Proof.
Suppose we have a set of items and a set of platforms. There is a hypergraph and function for each platform for the associated instance of generalized maximum independent set. Let be the set of items chosen in graph in OPT. Clearly, the set is a feasible set in . Note that the goal is to maximize the global number of items selected. Thus, the optimum does not necessarily pick the optimal sets in each hypergraph as there may be vertices which lie in multiple hypergraphs.
Let our algorithm apply the -approximation algorithm to . Let the selected independent set be . Now apply the -approximation to the graph induced on by the vertex set . Then on the graph induced on by the vertex set and so on. For define the sets,
Then, we have
(1) |
(2) |
because the s form a partition over s and s. The latter may not completely be covered by the partition, and thus we have a lower bound. Note that is a feasible set in because it is a subset of the which is a feasible set. Now consider the graph induced on by the vertex set . Every vertex in is present in this graph by the definition of and since is a feasible set in , it is feasible in any subgraph of . Thus, using our -approximation algorithm, we have
(3) |
Adding equation 3 over all and adding equation 2 multiplied by over all we have
Using equation 1,
because and are partitioned perfectly into the item set for each platform, as no item can be chosen by multiple platforms. ∎
Theorem 6 (Formal version of Theorem 1).
The CMM problem can be represented as an IP with variables if there is only one platform with classes. This can be solved in time , and also gives rise to a -approximation in time for multiple platforms, each with classes of items.
Proof.
We first reduce our instance of CMM with platform to an instance of generalized maximum independent set. Let the hypergraph corresponding to the platform be . Let the set of hyperedges of be . Define to be the power set of the edge set. Let the quota of a hyperedge be . For , let be the number of vertices we pick from the set
where . Then we have our ILP as
subject to
The above ILP has variables, and constraints. It is known that an ILP with variables and constraints can be solved in time exponential in and polynomial in [Len83]. Thus, when , where , the number of items, we can find the optimum solution in time polynomial in the input size. If we have that the number of posts is greater than 1 but still a constant, then we can keep one variable for every type-post pair. We can similarly proceed and get a polynomial-time algorithm. ∎
3.3 Bounded Average degree
We extend the result from the previous section for a single platform to the case when the average number of laminar families of classes an item belongs to is bounded by . We state it in terms of generalized maximum independent sethere. Now consider the case where the hypergraph constructed above has only bounded average degree of .
Proof of Theorem 5.
Since the average degree is , for any , there cannot be more than vertices of degree more than . Suppose we estimate and set . We call a vertex low degree if its degree is at most , otherwise the vertex is high degree. Then the number of low degree vertices is . In the graph induced by the low degree vertices, the size of the optimal independent set is at least , since at most vertices of high degree. We use our approximation algorithm on the graph induced by the low degree vertices. Since this graph has maximum degree , the size of the independent set has size .
Thus, our approximation ratio is at least . We finally need to estimate . We guess a value of from 1 to and run the above procedure for each of the guesses. Amongst all the solutions that we obtain, we pick the one with the highest cardinality. This is guaranteed to do at least as well as the case when we picked the correct value of . ∎
4 Online algorithms
The online algorithm for Theorem 3 is essentially the same as the one in Proposition 2 and works even for an arbitrary input model. Whenever an item arrives online, we match it to a platform such that the matching remains feasible. If there is no such platform, we leave it unmatched. The output is a maximal set, which by Proposition 2, gives us the required competitive ratio. However, we point out that this only works for the unweighted version.
Since the CMM problem is NP-Hard in general, we also give online algorithms for the case where the classes form a laminar family. This version is known to have an efficient offline algorithm, via the construction of a simple flow network. A similar construction is used to compute the rank-maximal matching in [NNP19].
In this setting, we study the many-to-many CMM model (Model 2), when the classes are laminar. We assume an input model where the item set arrives in a uniformly random permutation (Model 3). For the sake of the analysis, we assume that a random variable picked uniformly at random from for every item , and the items arrive in the increasing order of . Therefore the random vector fully describes the order of arrival of the items. We use to represent the vector after removing from . We use the following greedy algorithm, and analyze its competitive ratio (in expectation): Keep an arbitrary, fixed ranking of all the platforms in . When an item arrives online, match it to as many platforms as possible, picking the highest ranked ones.
We use a linear programming relaxation of our problem to analyze our algorithm. We set the primal values according to the output of our algorithm, thereby ensuring the feasibility of the primal solution. Now we need to construct an appropriate dual solution. We use the following folklore fact about the well-known method of dual fitting in designing algorithms. This technique is used in [DJK13, HKT+18] among others.
In the primal LP, we have a variable item is matched to platform . We also have constraints for both the item and platform classes. In the dual LP, we have variables corresponding to constraints in the primal LP. Let the dual variables corresponding to the item and platform classes and be respectively.
Primal
Subject to
Dual
Subject to
Fact 1 (Folklore).
Suppose we can set the dual variables such that the primal objective is equal to the dual objective, and the dual constraints of the form instead satisfy . Then, our algorithm has a competitive ratio in expectation.
Proof.
Consider the dual solution obtained by setting . Clearly this is a feasible solution. Further, we have from weak duality
Taking the expectation over the randomness of the input on both sides | ||||
Thus, the dual solution will certify a competitive ratio of in expectation. ∎
Now, we need to set the dual variables so that the dual constraints have a lower bound of . Let be a monotonically increasing function to be fixed later. Initially, all dual variables are zero. When a new item arrives, we update the primal solution based on whether we matched the item or now, and update the dual solution as follows.
-
•
If an edge is chosen by the algorithm (i.e. ) then set , where is the dual variable corresponding to the class .
-
•
If any class of platform is tight then for all items in the class set if we had that before the new item arrived. Fix for all such that . Let it remain unchanged henceforth. Also set
-
•
If any class of item is tight then set
Fix for all such that . Let it remain unchanged henceforth.
It can be easily seen that the dual objective function is equal to the primal objective function. It remains to be shown that the dual constraints are always greater than some in expectation.
The following is the key lemma in analyzing how the algorithm behaves depending on . Although it is inspired by [DJK13, HKT+18, GM08], our many-to-many model (Model 2) is more complicated in that moving one vertex up the ranking can cause more changes to the matching because an item can match to multiple platforms. Even apart from the platform classes, we must take care of item classes as well. To that end, we show the following lemma.
Lemma 3.
For any such that , if an item is matched to some platforms at , then it cannot be unmatched from any platform at due to an item class.
Proof.
Suppose the platforms are ranked and the items arrive in the order so that . We first fix an . We will prove this on induction for . That is, we first show that the lemma is true for item (ie ) and then inductively argue that it has to be true for every .
For the base case, suppose is matched to when . Then if , item is unaffected and the statement is true. Now let . We consider the changes that brings. The only way can get unmatched from some platform due to an item class is if the following chain of events ocurred. matches to some platform , forcing to unmatch from due to a platform class. matches to a platform , which it couldn’t do before because belonged to a tight item class . is forced to unmatch from some platform because belong to a tight item class . Note that both contain and by laminarity, one class must contain the other. We can argue that in either case, we get a contradiction.
Suppose the induction hypothesis is true up to (but not including) some . Then we will show that it is true for as well. Suppose not. Then for , is unmatched from some due to an item class . Then, there is a platform such that was not matched to at , but replaced at . This replacement can happen only if .
Why was not matched at ? Through a similar argument as in the base case, it can be seen that it is not possible due to an item class of . Thus, could not match to at due to a platform class (let it be ). Since it could match at , there must be some item that was matched to at but not at . Then, it must be that .
Why was unmatched from at ? By the induction hypothesis, it must be due to a platform class of . Let it be . Then there is an item that is matched to at but not at , such that . Now consider . Since , one class must contain the other by laminarity. We can argue that in both cases, we get a contradiction. ∎
Once we have the lemma, we show Theorem 4 the following way. Let be such a dual constraint. We want to show that or equivalently, . We look at the inner expectation. We fix and vary from 0 to 1, and show using dual-fitting arguments and Lemma 3 that . From Fact 1, this completes the proof of Theorem 4.
Proof of Theorem 4.
Consider a fixed platform . Suppose that some item class that is in is tight until . Thus, we have
until . Afterwards, we have two cases,
-
1.
Case 1: At , matches to . Note that remains true as long as is matched to .
-
(a)
Case 1(a): remains matched to until . Then the previous inequality continues to hold and we have
-
(b)
Case 1(b): becomes unmatched from at some point when due to a class of . In this case some platform class (say ) must have been tight. We claim that in this case, at least one class of that belongs to is tight regardless of the value of .
Clearly, increasing the value of beyond does not affect the tightness of . Thus, the class is tight even at . Now at any other value of , if the class is not tight then some item must’ve been replaced. By lemma 3, it must have been due to a platform constraint (say ). In this case, we have either
-
•
, in which case also belongs to this tight class.
-
•
, in which case the replacing element does not reduce the number of elements matched in (and cannot relax the constraint).
Since the highest rank of the items matched to in the tight class (either or ) is at most , and remained true until we have
-
•
-
(c)
Case 1(c): becomes unmatched from at some point when due to a class of . We will show that this case is not possible.
Suppose that at , item got matched to , which prevented from matching to . Thus, for some class which was tight. Since could not match to at , it must be that it was prevented from doing so owing to a item class. If it was due to a platform class, then it would not be possible for it to match at since it now has a worse rank than before.
Thus, there is some such that was matched to at (but not at ) and for another class which was tight. Again, due to laminarity and the non-empty intersection of and we have
-
•
If then it should not have been possible to match at because was tight.
-
•
If , then is not the replacing element since matching and unmatching cannot relax the constraint.
-
•
-
(a)
-
2.
Case 2: never matches to . We assumed that some item class was tight till . After that, since it was not possible for to match to , then some class of that belonged to was tight. By a similar logic as before, it can be seen that a class is tight regardless of the value of and the highest rank of any item in the class is . Thus, we have
Note that the case where is when some class of always prevented from matching to . In this case, like before we have .
Thus, we have that
as required. ∎
5 Experiments
In this section, we present the experimental evaluation of our offline algorithms from Theorem 1 and Theorem 2. We use a total of seven datasets which we categorize as real-world and synthetic datasets. The three real-world datasets are sourced from an elective allocation process at an educational institution. The four synthetic datasets are generated as described below. All experiments were run on a laptop running on a 64-bit Windows 10 Home edition, and equipped with an Intel Core i7-7500U CPU @2.7GHz and 12GB of RAM. For solving integer programs, we used IBM ILOG CPLEX Optimization Studio 20.1 through its Python API. All code was written to run on Python 3.8.
Dataset | -approx | -approx | OPT |
---|---|---|---|
Real-1 | 1871.5 (0.92) | 1899.8 (0.93) | 2035 (1) |
Real-2 | 1988.6 (0.92) | 2014.0 (0.93) | 2170 (1) |
Real-3 | 1938.6 (0.92) | 1936.7 (0.92) | 2107 (1) |
Dataset | -approx | -approx | OPT |
---|---|---|---|
Real-1 | 0.39 (1.23) | 0.11 (4.29) | 0.48 (1) |
Real-2 | 0.43 (1.03) | 0.11 (3.89) | 0.44 (1) |
Real-3 | 0.33 (1.23) | 0.10 (3.90) | 0.40 (1) |
Dataset | -approx | -approx | OPT |
---|---|---|---|
large-dense | 239552 (0.97) | 239566.4 (0.97) | 247537 (1) |
large-sparse | 212600.1 (0.97) | 211885.1 (0.97) | 218622 (1) |
small-sparse | 72676.4 (0.93) | 72821.5 (0.93) | 78279 (1) |
small-dense | 75887.7 (0.95) | 76133.4 (0.95) | 79827 (1) |
Dataset | -approx | -approx | OPT |
---|---|---|---|
large-dense | 5.68 (14.41) | 2.90 (28.21) | 81.99 (1) |
large-sparse | 4.67 (15.14) | 2.19 (32.19) | 70.73 (1) |
small-sparse | 1.55 (3.00) | 0.46 (10.07) | 4.68 (1) |
small-dense | 1.73 (5.39) | 0.58 (16.14) | 9.37 (1) |
Real-world datasets:
We use data from three course-registration periods at an educational institution. Each dataset has around 100 courses and 2000 students. The students and the courses correspond to items and platforms respectively in our model. The edges represent the courses that a student is interested in. The students are partitioned into 13 departments (majors) as well as 5 batches (1st year–5th year). Each course has an overall quota denoting the maximum number of students that can be allotted to it. For each course, we introduce a quota for each department and a quota for each batch. Each course belongs to one of two categories, and each student can be matched to at most one course of each category. The goal is to maximize the number of edges selected subject to these constraints. This can be immediately viewed as an instance of CMM.
Synthetic Datasets:
Modelled on the real-world datasets, we synthetically generate large instances and compare the performance of our algorithms to the optimal algorithm implemented using a matching Integer Linear Program. The synthetic datasets are generated as follows. Datasets labelled ‘large’ have 500 courses, and 20 departments with 10,000 students in each department. The datasets labelled ‘small’ have 300 courses, and 20 departments with 2,000 students in each department. The students have a degree that is chosen uniformly at random between 3 and 10 in the ‘dense’ datasets and between 3 and 5 in the ‘sparse’ datasets. Students choose their courses randomly based on a non-uniform probability distribution. This distribution is defined by assigning a random ‘popularity’ value to each course. We observe this feature in the real-world dataset, where all courses are not equally popular. We also experiment without this feature, and obtain similar results.
We compare our performance and running-time with the optimal solution obtained by solving the standard Matching ILP augmented with the constraints for each class. All running-times include the time taken for file I/O. The solution values and running-times are averaged over 10 runs. Though our algorithms are deterministic, these implementations utilize some randomness because of the use of hash-tables. Observe that since we have two laminar families of classes, Theorem 1 and Theorem 2 provide theoretical guarantees of only and respectively. However, the performance of the algorithms on both real-world and random data are close to optimal. All our tables provide absolute values of the solution value and running-time of the algorithm from Theorem 1 (column -approx) and algorithm from Theorem 2 (column -approx), as well as the relative value or relative speedup in comparison to that of the Matching ILP (column OPT).
5.1 Observations
Table 1 and Table 2 provide the solution values and running times for real-world instances whereas Table 3 and Table 4 provide the same for the synthetically datasets. In both the real-world and synthetic datasets, both of our algorithms output solutions with value at least 90% of the optimum value. This seems to suggest that both real-world or random settings are ‘easier’ than the worst-case instances for our algorithms. Furthermore, we believe that the significantly improved running-time more than makes up for loss of 10% in the output value. The biggest speedups are observed in the ‘large’ datasets, where our algorithms achieve speedups of and respectively. This is expected because the ILP takes time exponential in the size of the graph.
5.2 Additional Synthetic Datasets
In addition to the previous results, we evaluate the performance of our algorithms on synthetically generated datasets where the adjacency for each vertex was generated uniformly at random. That is, we did not use the ‘popularity’ measure of the courses when selecting adjacent courses for a student. Like before, datasets labelled ‘large’ have 500 courses, and 20 departments with 10,000 students in each department. The datasets labelled ‘small’ have 300 courses, and 20 departments with 2,000 students each in each department. In the datasets labelled ‘dense’, students have a degree that is chosen uniformly at random between 3 and 10, and in datasets labelled ‘sparse’, students have a degree that is chosen uniformly at random between 3 and 5. Students choose their courses randomly based on a uniform probability distribution over the courses, while ensuring that they apply to at least one course from each category. Table 5 and Table 6 provide the solution values and running times for the synthetically generated datasets described here. We observe a similar pattern here. Our algorithms achieve at least 95% of the optimal solution in each case, while being faster than the Matching ILP (OPT).
Dataset | -approx | -approx | OPT |
---|---|---|---|
large-dense | 245650.0 (1) | 245650.0 (1) | 245650 (1) |
large-sparse | 251420.9 (0.99) | 251304.1 (0.99) | 251840 (1) |
small-sparse | 74850.7 (0.95) | 74914.4 (0.95) | 78692 (1) |
small-dense | 77839.0 (0.97) | 77848.1 (0.97) | 79849 (1) |
Dataset | -approx | -approx | OPT |
---|---|---|---|
large-dense | 5.91 (24.97) | 3.05 (48.41) | 147.72 (1) |
large-sparse | 4.74 (17.52) | 2.30 (36.09) | 83.07 (1) |
small-sparse | 1.60 (2.40) | 0.50 (7.62) | 3.83 (1) |
small-dense | 1.71 (4.53) | 0.58 (13.29) | 7.78 (1) |
6 Conclusion
In this paper we gave approximation algorithms for the CMM problem in various offline and online settings. Improving these approximation factors or showing matching lower bounds are natural open questions. There are existing algorithms that break the barrier for online bipartite matching (without group fairness constraints) [FMMM09, KMT11]; obtaining similar bounds for online CMM is also an interesting open problem.
Acknowledgements
We acknowledge some initial discussions with Ajay Saju Jacob. We are grateful to the anonymous reviewers for their comments. AL was supported in part by SERB Award ECR/2017/003296 and a Pratiksha Trust Young Investigator Award. MN and PN are supported in part by SERB Award CRG/2019/004757.
References
- [AHL13] Geir Agnarsson, Magnús M. Halldórsson, and Elena Losievskaja, Sdp-based algorithms for maximum independent set problems on hypergraphs, Theoretical Computer Science 470 (2013), 1–9.
- [AS03] Atila Abdulkadiroglu and T. Sönmez, School choice: A mechanism design approach, The American Economic Review 93 (2003), no. 3, 729–747.
- [BCFN19] Suman Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani, Fair algorithms for clustering, NeurIPS, 2019, pp. 4955–4966.
- [BCZ+16] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai, Man is to computer programmer as woman is to homemaker? debiasing word embeddings, NIPS, 2016, p. 4356–4364.
- [BDLK20] Kinjal Basu, Cyrus DiCiccio, Heloise Logan, and Noureddine El Karoui, A framework for fairness in two-sided marketplaces, arXiv preprint arXiv:2006.12756 (2020).
- [BFIM10] Péter Biró, Tamás Fleiner, Robert W. Irving, and David F. Manlove, The college admissions problem with lower and common quotas, Theoretical Computer Science 411 (2010), no. 34-36, 3136–3153.
- [BK19] Amey Bhangale and Subhash Khot, Ug-hardness to np-hardness by losing half, pp. 3:1–3:20, 2019.
- [CHRG16] Matthew Costello, James Hawdon, Thomas Ratliff, and Tyler Grantham, Who views online extremism? individual attributes leading to exposure, Computers in Human Behavior 63 (2016), 311–320.
- [CSV18] L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi, Ranking with fairness constraints, ICALP, vol. 107, 2018, pp. 28:1–28:15.
- [DJK13] Nikhil R. Devanur, Kamal Jain, and Robert D. Kleinberg, Randomized primal-dual analysis of ranking for online bipartite matching, SODA, 2013, pp. 101–107.
- [DJSW11] Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens, Near optimal online algorithms and fast approximation algorithms for resource allocation problems, Proc. of ACM Conf. on Electronic commerce, 2011, pp. 29–38.
- [FK12] Tamás Fleiner and Naoyuki Kamiyama, A matroid approach to stable matchings with lower quotas, SODA, 2012, pp. 135–142.
- [FMMM09] Jon Feldman, Aranyak Mehta, Vahab S. Mirrokni, and S. Muthukrishnan, Online stochastic matching: Beating 1-1/e, FOCS, 2009, pp. 117–126.
- [GIK+16] Masahiro Goto, Atsushi Iwasaki, Yujiro Kawasaki, Ryoji Kurata, Yosuke Yasuda, and Makoto Yokoo, Strategyproof matching with regional minimum and maximum quotas, Artificial intelligence 235 (2016), 40–57.
- [GJ90] Michael R. Garey and David S. Johnson, Computers and intractability; a guide to the theory of np-completeness, W. H. Freeman & Co., 1990.
- [GJKR18] Stephen Gillen, Christopher Jung, Michael Kearns, and Aaron Roth, Online learning with an unknown fairness metric, NIPS, 2018, pp. 2605–2614.
- [GM08] Gagan Goel and Aranyak Mehta, Online budgeted matching in random input models with applications to adwords, SODA, vol. 8, 2008, pp. 982–991.
- [GSB20] David García-Soriano and Francesco Bonchi, Fair-by-design matching, Data Mining and Knowledge Discovery (2020), 1–45.
- [HKT+18] Zhiyi Huang, Ning Kang, Zhihao Gavin Tang, Xiaowei Wu, Yuhao Zhang, and Xue Zhu, How to match when all vertices arrive online, STOC, 2018, pp. 17–29.
- [HL09] Magnús M. Halldórsson and Elena Losievskaja, Independent sets in bounded-degree hypergraphs, Discrete applied mathematics 157 (2009), no. 8, 1773–1786.
- [HLL12] Hassan Halabian, Ioannis Lambadaris, and Chung-Horng Lung, Optimal server assignment in multi-server parallel queueing systems with random connectivities and random service failures, IEEE International Conference on Communications, 2012, pp. 1219–1224.
- [Hua10] Chien-Chung Huang, Classified stable matching, SODA, 2010, pp. 1235–1253.
- [KEKM08] Martin Kutz, Khaled Elbassioni, Irit Katriel, and Meena Mahajan, Simultaneous matchings: Hardness and approximation, Journal of Computer and System Sciences 74 (2008), no. 5, 884–897.
- [KK12] Yuichiro Kamada and Fuhito Kojima, Stability and strategy-proofness for matching with constraints: A problem in the japanese medical match and its solution, American Economic Review 102 (2012), no. 3, 366–70.
- [KK15] , Efficient matching under distributional constraints: Theory and applications, American Economic Review 105 (2015), no. 1, 67–99.
- [KMM15] Matthew Kay, Cynthia Matuszek, and Sean A. Munson, Unequal representation and gender stereotypes in image search results for occupations, Proc. of ACM Conf. on Human Factors in Computing Systems, 2015, pp. 3819–3828.
- [KMT11] Chinmay Karande, Aranyak Mehta, and Pushkar Tripathi, Online bipartite matching with unknown distributions, STOC, 2011, pp. 587–596.
- [Len83] H. W. Lenstra, Integer programming with a fixed number of variables, Mathematics of operations research 8 (1983), no. 4, 538–548.
- [MAW99] Nick McKeown, Venkat Anantharam, and Jean Walrand, Achieving 100% throughput in an input-queued switch, vol. 47, 1999, pp. 1260–1267.
- [Meh13] Aranyak Mehta, Online matching and ad allocation, Foundations and Trends® in Theoretical Computer Science 8 (2013), no. 4, 265–368.
- [MSVV07] Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani, Adwords and generalized online matching, J. ACM 54 (2007), no. 5, 22–es.
- [MX20] Will Ma and Pan Xu, Group-level fairness maximization in online bipartite matching, arXiv preprint arXiv:2011.13908 (2020).
- [NNP19] Meghana Nasre, Prajakta Nimbhorkar, and Nada Pulath, Classified rank-maximal matchings and popular matchings – algorithms and hardness, WG, 2019, pp. 244–257.
- [PKL21] Deval Patel, Arindam Khan, and Anand Louis, Group fairness for knapsack problems, AAMAS, 2021, p. 1001–1009.
- [SBZ+19] Tom Sühr, Asia J. Biega, Meike Zehlike, Krishna P. Gummadi, and Abhijnan Chakraborty, Two-sided fairness for repeated matchings in two-sided markets: A case study of a ride-hailing platform, KDD, 2019, pp. 3082–3092.
- [Zuc06] David Zuckerman, Linear degree extractors and the inapproximability of max clique and chromatic number, STOC, 2006, pp. 681–690.