Online Fair Division with Contextual Bandits
Abstract
This paper considers a novel online fair division problem involving multiple agents in which a learner observes an indivisible item that has to be irrevocably allocated to one of the agents while satisfying a fairness and efficiency constraint. Existing algorithms assume a small number of items with a sufficiently large number of copies, which ensures a good utility estimation for all item-agent pairs. However, such an assumption may not hold in many real-life applications, e.g., an online platform that has a large number of users (items) who only use the platform’s service providers (agents) a few times (a few copies of items), which makes it difficult to estimate the utility for all item-agent pairs. To overcome this challenge, we model the online fair division problem using contextual bandits, assuming the utility is an unknown function of the item-agent features. We then propose algorithms for online fair division with sub-linear regret guarantees. Our experimental results also verify the different performance aspects of the proposed algorithms.
1 Introduction
Growing economic, environmental, and social pressures require us to be efficient with limited resources (Aleksandrov and Walsh, 2020). Therefore, the fair division (Steinhaus, 1948) of limited resources among multiple parties/agents is needed to efficiently balance their competing interests in many real-life applications, e.g., Fisher market (Codenotti and Varadarajan, 2007; Vazirani, 2007), housing allocation (Benabbou et al., 2019), rent division (Edward Su, 1999; Gal et al., 2016), and many more (Demko and Hill, 1988). The fair division problem has been extensively studied in algorithmic game theory (Caragiannis et al., 2019; Codenotti and Varadarajan, 2007; Eisenberg and Gale, 1959; Vazirani, 2007) but focuses on the static setting where all information (items, agents, and their utility) is known in advance. However, fair division problems are often online (Aleksandrov and Walsh, 2020; Gao et al., 2021; Yamada et al., 2024), referred to as online fair division, where indivisible items arrive sequentially and each item needs to be irrevocably allocated to an agent. Existing algorithms for online fair division assume a small number of items with a sufficiently large number of copies (Yamada et al., 2024), which ensures a good utility estimation using the observed utilities for previous allocations for all item-agent pairs. The estimated utilities of item-agent pairs are used to allocate the item to an agent that maintains a desired balance between fairness (i.e., keeping the desired level of utilities across the agents) and efficiency (i.e., maximizing the total utility) (Sinclair et al., 2022).
However, many real-life applications have a large number of items with only a few copies for each item. For example, consider an online food delivery platform that wants to recommend restaurants (agents) to its users (items) while balancing between fairly recommending restaurants to accommodate their competing interests (fairness) and maximizing its profit (efficiency). Similar scenarios also arise while recommending the cab to users by ride-hailing platform, e-commerce platforms choosing top sellers to buyers, network resource allocation (Lan et al., 2010), online advertisement (Li et al., 2019), allocating tasks among crowd-sourced workers (Patil et al., 2021; Yamada et al., 2024), providing humanitarian aid post-natural disaster (Yamada et al., 2024), among many others (Aleksandrov and Walsh, 2020; Mehta et al., 2013; Walsh, 2011). The closest works to ours are Yamada et al. (2024) and Procaccia et al. (2024), but both works assume a finite number of items with a large number of copies for each item. We consider an online fair division setting where the number of items can be large with only a few copies for each item, thus considering a more general problem setting than in existing works.
Having a large number of items with only a few copies of each item makes it impossible to estimate the utility for all item-agent pairs, making existing algorithms (Procaccia et al., 2024; Yamada et al., 2024) ineffective in our setting. Therefore, this problem raises a natural question: How to design an algorithm for online fair division problems having a large number of items with only a few copies for each item? The first challenge we faced in designing such an algorithm is how to estimate the utility for any item-agent pair using observed stochastic utilities from previous allocations. A natural solution is to assume a correlation structure exists between the item-agent features and their utility, which can be realized via parameterization of the expected utility such that the utility depends on the unknown function of item-agent features. In many real-life applications, the agents may also be concerned about fairness in each round of the allocation process (Benadè et al., 2023; Sim et al., 2021), e.g., in the online platform, if service providers (agents) do not get an adequate number of users (items), they may leave the platform and, even worse, may join another competitive platform. Being fair in each round leads to the next challenge of finding how well the item’s allocation to an agent will maintain the desired balance between fairness and efficiency. To address this challenge, we introduce the notion of goodness function (e.g., weighted Gini social-evaluation function (Weymark, 1981), more details in Section 3 and Section 4.3) that measures how well an item allocation to an agent maintains the desired balance between fairness and efficiency (larger value of the goodness function implies a better allocation). The learner’s goal is to allocate each item to the agent that maximizes the goodness function value.
After having the goodness function and the estimate of the utility function, the learner must decide which agent should be assigned to each item. Since the utility is only observed for the agent to whom item is allocated (i.e., item-agent pair), the learner has to deal with the exploration-exploitation trade-off (Auer et al., 2002; Lattimore and Szepesvári, 2020; Slivkins, 2019), i.e., choosing the best agent based on the current utility estimate (exploitation) or trying a new agent that may lead to a better utility estimates in the future (exploration). To deal with the exploration-exploitation trade-off, we adapt the contextual bandit algorithms (Agrawal and Goyal, 2013; Chu et al., 2011; Krause and Ong, 2011; Li et al., 2010; Zhang et al., 2021; Zhou et al., 2020) to get an optimistic utility estimate for each item-agent pair (in Section 4), where the reward in contextual bandits corresponds to utility, contexts to items, and arms to agents. The goodness function uses these optimistic utility estimates to allocate the given item to the best possible agent that maintains the desired balance between fairness and efficiency. We prove that our proposed algorithms have a finite sample sub-linear regret upper bound guarantee (1, 2, and 3) for a class of goodness functions that have locally monotonically non-decreasing and locally Lipschitz continuous properties (in Section 4.3). Our experimental results in different settings also corroborate the theoretical results (in Section 5).
1.1 Related work
Fair division.
The fair division is a fundamental problem (Dubins and Spanier, 1961; Steinhaus, 1948) in which a designer allocates resources to agents with diverse preferences. There are two popular notions of ‘fairness’ used in fair division – envy-freeness (EF) (Varian, 1974) (every agent prefers his or her allocation most) and proportional fairness (Steinhaus, 1948) (each agent is guaranteed her proportional share in terms of total utility independent of other agents). Many classical studies on fair division focus on divisible resources (Dubins and Spanier, 1961; Steinhaus, 1948; Varian, 1974), which are later extended to indivisible items (Brams and Taylor, 1996; Budish, 2011; Moulin, 2004) due to their many practical applications. However, obtaining envy-free allocation may not be guaranteed to exist, e.g., in the case of two agents and one item. Therefore, the appropriate relaxations of EF and proportionality, such as envy-freeness up to one good (EF1) (Moulin, 2004), envy-freeness up to any good (EFX) (Caragiannis et al., 2019), and maximin share fairness (MMS) (Budish, 2011), are used in practice. We refer the readers to (Amanatidis et al., 2022) for a detailed survey on fair division.
Online fair division.
The online fair division involves problems where items arrive sequentially and must be immediately and irrevocably allocated to an agent. The nature of items can be different, e.g., either divisible (Walsh, 2011) or indivisible (Aleksandrov and Walsh, 2017; Procaccia et al., 2024; Yamada et al., 2024), homogeneous/heterogeneous (Kash et al., 2014; Walsh, 2011), multiple copies of items (Procaccia et al., 2024; Yamada et al., 2024), and agents receiving more than one item (Aleksandrov et al., 2015). Finding a fair allocation is also related to computing market or competitive equilibria, which are known to be computationally challenging tasks Codenotti and Varadarajan (2007); Vazirani (2007). This problem in a static setting can be simplified to the Eisenberg-Gale (EG) convex program (Eisenberg and Gale, 1959; Jain and Vazirani, 2010), but envy-freeness is not compatible with Pareto optimality in online settings even if items are divisible. Some works have taken a relaxed approach to envy-freeness, while others considered an approximate solution, e.g., relaxing envy-freeness (Kash et al., 2014), utilizing a stochastic approximation scheme to obtain an approximate solution for the EG convex program (Bateni et al., 2022), or achieving approximate fairness (Yamada et al., 2024). Recent works (Benadè et al., 2023; Sinclair et al., 2022) examine the trade-off between fairness and efficiency. However, existing methods can not be used in our setting as it is impossible to estimate the utility for all item-agent pairs due to a large number of items with only a few copies of each item. Prior work (Benadè et al., 2023) assumes that the utilities of all agents for an item are known upon its arrival (noiseless utility). In contrast, we consider the utilities unknown, and only noisy utility is observed for an item allocation. It introduces a learning problem, i.e., estimating the unknown utility and connecting online fair division problems to multi-armed bandits (Procaccia et al., 2024; Yamada et al., 2024).
Fair multi-armed bandits (MAB).
The fair MAB is a framework where an arm with a higher expected reward is selected with a lower probability than an arm with a lower expected reward (Joseph et al., 2016). The fair policy should sample arms with probability proportional to the value of a merit function of its mean reward (Wang and Joachims, 2021). Many works also assume that preserving fairness means the probability of selecting each arm should be similar if the two arms have a similar quality distribution (Chen et al., 2021; Liu et al., 2017). The fairness is also considered in the linear contextual bandits setting (Gillen et al., 2018; Grazzi et al., 2022; Schumann et al., 2019; Wang et al., 2021; Wu et al., 2023) where an unknown similarity metric imposes individual fairness constraints. Some fair MAB variant seeks to optimize the cumulative reward while also ensuring that, at any round, each arm is pulled at least a specified fraction of times (Chen et al., 2020; Claure et al., 2020; Patil et al., 2021; Li et al., 2019). The work of (Hossain et al., 2021) considers a multi-agent setting: the goal is to find a fair distribution over the arms for the agents. However, all works of fair MAB only consider fairness, whereas having a desired balance between fairness and efficiency is needed in many practical applications.
2 Problem setting
Online fair division.
We consider an online fair division problem involving agents in which a learner (or central planner) only observes an indivisible item each round that needs to be allocated to one of the agents while satisfying a given fairness constraint to the greatest extent possible. We denote the set of agents by and the set of indivisible items by . In our problem, we can have a large number of items with only a few copies of each item, and the learner has no prior knowledge about future items. At the start of the round , the environment selects an item , and then the learner observes and allocates the item to an agent . After that allocation, the learner observes a stochastic utility collected by the agent, denoted by , where , is an unknown utility function and is a -sub-Gaussian noise, i.e., .
Allocation quality measure.
Let be the cumulative total utility collected by agent at the beginning of the round , where and denotes the indicator function. We denote the vector of all agents’ total utility by , where We assume a goodness function G exists that incorporates the desired level needed between fairness and efficiency. This goodness function measures how well the item allocation to an agent will maintain the desired balance between fairness and efficiency (larger value implies better allocation). We discuss the different choices of the goodness function in Section B.1. For the given total utilities of agents, the value of the goodness function G after allocating the item to an agent is denoted by . Here, is the same as except the total utility of -th agent is . The learner’s goal is to allocate the item to an agent that maximizes the value of the goodness function or satisfies the given fairness constraint to the greatest extent possible.
Performance measure of allocation policy.
Let denote the optimal agent for item having the maximum goodness function value, i.e., . Since the optimal agent is unknown, we sequentially estimate the utility function using the historical information of the stochastic utility observed for the item-agent pairs and then use the estimated utility function to allocate an agent for the item . After allocating item to an agent , the learner incurs a penalty (or instantaneous regret) , where Our aim is to learn a sequential policy that selects agents for items such that the learner’s total penalty for not assigning the item to optimal agent (or cumulative regret) is as minimal as possible. Specifically, the cumulative regret (regret in short for brevity) of a sequential policy that selects agent for allocating item in the round in the rounds is given by
(1) |
A policy is a good policy if it has sub-linear regret, i.e., . This implies that the policy will eventually start allocating items to the best possible agent.
Remark 1.
Our regret definition differs from that used in standard contextual bandits algorithms as the value of the goodness function depends on the current round (as in contextual bandits) and also on utilities collected by each agent in the past (history-dependent). This difference makes regret analysis challenging for any arbitrary goodness function.
3 Goodness function incorporating fairness constraint
In this section, we first define the various performance measures for fairness and efficiency commonly used in the algorithmic game theory literature and economics. For brevity, let be the utility of agent , then:
-
Utilitarian Social Welfare (Feldman and Serrano, 2006) . Utilitarian Social Welfare (USW) is defined as the sum of utilities of all agents, i.e., . Maximizing USW indicates the system’s total utility is maximum. We refer to this as an efficiency measure.
-
Egalitarian Social Welfare (Feldman and Serrano, 2006). Egalitarian Social Welfare (ESW) is defined as the minimum utility across all agents, i.e., . It is a notion of fairness where the designer aims to maximize the utility of less happy agents to obtain a fair outcome.
-
Nash Social Welfare (Nash et al., 1950). Nash Social Welfare (NSW) is defined as the product of the utilities of all agents, i.e., . By maximizing NSW, we get an approximate fair and efficient allocation.
Goodness function.
Since the goodness function is a performance indicator of the item allocation to an agent, it is used to allocate the items to agents in a way that maintains the desired balance between fairness and efficiency. It is a well-known fact that there is a trade-off between these two performance measures. For instance, when a learner aims to maximize social welfare, the item is allocated to an agent with the highest utility. Such an allocation scheme achieves efficiency but sacrifices fairness. Intuitively, one way to obtain fairness is to apply a smaller weight factor to agents with higher cumulative utility and a larger weight factor to those with lower cumulative utility. Therefore, we must consider an appropriate goodness function G such that optimizing G corresponds to a) maximizing efficiency, i.e., maximizing the individual agent’s cumulative utility, and b) reducing the utility disparity among the agents, which ensures fairness. Our regret, as defined in Eq. 1, is formulated for a general goodness function G. However, we first consider the goodness function to be the weighted Gini social-evaluation function (Weymark, 1981), which measures the weighted fairer distribution of utility while balancing the trade-off between fairness and efficiency. To define this function, we introduce a positive and non-increasing weight vector , where , and that is a sorting operator and arranges the cumulative utilities in increasing order. The weighted Gini social-evaluation function is then given as follows:
(2) |
We now consider the following cases that coincide with the well-known social welfare functions.
-
1.
If and for , then it is evident from Eq. 2 that the weighted Gini social-evaluation function coincides with ESW, i.e., maximizing the minimum utility among agents.
-
2.
If for all , then the weighted Gini social-evaluation function aligns with USW, i.e., maximizing the goodness function leads towards efficiency.
-
3.
If for all , then by appropriately choosing the weights, we can effectively control the trade-off between efficiency and fairness.
Remark 2.
In the above first two cases, we observe that larger variability in the weights leads to fairness, and smaller variability leads to efficiency. We carry forward this idea and aim to control the trade-off with a single parameter instead of parameters. We set the weight , for all where, is a control parameter. We experimentally investigate this behavior in detail in Section 5.
We primarily focus on the weighted Gini social-evaluation function because it allows us to interpolate between fairness (ESW) and efficiency (USW) by adjusting a single parameter as discussed in 2. We also explore other goodness functions, such as the NSW and log-NSW; further details are provided in Section B.1.
4 Online fair division with contextual bandits
The online fair division setting that we consider can have a large number of items with only a few copies (even only one); hence, it is impossible to get a good utility estimate for all item-agent pairs. To overcome this challenge, we assume a correlation structure exists between the utility and item-agent features, which can be realized via parameterization of the expected utility such that the utility depends on the unknown function of item-agent features. Such assumption is common in the contextual bandit literature (Agrawal and Goyal, 2013; Chu et al., 2011; Krause and Ong, 2011; Li et al., 2010; Zhang et al., 2021; Zhou et al., 2020), where the reward (utility) is assumed be an unknown function of context-arm (item-agent) features. A utility function estimation is the main component of our problem for achieving good performance. To bring out the key ideas and results, we restrict our setting to linear utility functions and then extend our results to non-linear utility functions in Section 4.2.
4.1 Linear utility function
For brevity, we denote the set of all item-agent feature vectors in the round by and the item-agent features for item and an agent by . After allocating the item to an agent , the learner observes stochastic utility , where is the unknown parameter and is -sub-Gaussian. Let , where is the regularization parameter that ensures is a positive definite matrix and is the identity matrix. The weighted -norm of vector with respect to matrix is denoted by .
At the beginning of the round , the estimate of the unknown parameter is given by . After having this utility function estimator, the learner has to decide which agent to allocate the given item. Since the utility is only observed for the selected item-agent pair, the learner needs to deal with the exploration-exploitation trade-off (Auer et al., 2002; Lattimore and Szepesvári, 2020), i.e., choosing the best agent based on the current utility estimate (exploitation) or trying a new agent that may lead to a better utility estimator in the future (exploration). The upper confidence bound (UCB) (Li et al., 2010; Chu et al., 2011; Zhou et al., 2020) and Thompson sampling (TS) (Krause and Ong, 2011; Agrawal and Goyal, 2013; Zhang et al., 2021) are widely-used techniques for dealing with the exploration-exploitation trade-off.
UCB-based algorithm.
We first propose an UCB-based algorithm named OFD-UCB that works as follows: for the first rounds, the learner allocates items to agents in a round-robin fashion to ensure each agent has positive utility, i.e., . In the round , the environment reveals an item to the learner. Before selecting the agent for that item, the learner updates the utility function estimate using available historical information before the round (i.e., ). Then, the utility UCB value of allocating the item to an agent , denoted by , is computed as follows:
(3) |
where is the estimated utility for allocating item to agent and is the confidence bonus in which is a slowly increasing function in and the value of goes to zero as increases.
Using these optimistic utility of each agent, the algorithm selects an agent for allocating the item by maximizing the optimistic value of goodness function as follows:
(4) |
where in which is the total utility collected by the agent before the round . If there are multiple agents to whom allocating the item gives the same maximum value of goodness function, then one of these agents is chosen randomly. After allocating item to agent , the environment generates a stochastic utility . The learner observes the utility and then updates the values of and re-estimates . The same process is repeated to select the agent for allocating the subsequent items. Our next result gives the regret upper bound of OFD-UCB when the utility function is assumed linear.
Theorem 1.
Let , , noise in utility be the -sub-Gaussian, and the goodness function be same as defined in Eq. 2 with . Then, with a probability of at least , the regret in rounds is
where , , and .
The cumulative regret depends on the instantaneous regret incurred in each round. We can upper bound the instantaneous regret incurred in the round by (as shown in Appendix A). As the instantaneous regret depends on the estimation error of using observed item-agent pairs, i.e., in the round , we adapted results from linear contextual bandits (Abbasi-Yadkori et al., 2011; Chu et al., 2011) in our setting to get an upper bound on this estimation error. With this result, the regret upper bound given in 1 follows by upper-bounding the sum of all instantaneous regret. The detailed proofs of 1 and proofs of other results below are given in Appendix A.
TS-based algorithm.
Due to the empirical superiority of TS-based algorithms over UCB-based bandit algorithms (Chapelle and Li, 2011; Agrawal and Goyal, 2013), we also propose a TS-based algorithm named OFD-TS, which works similar to OFD-UCB except the agent selection part. To get a TS-based utility estimate, the algorithm first samples a utility function from the distribution , where denotes the normal distribution and (Agrawal and Goyal, 2013). Using the sampled utility function , the TS-based utility estimate, i.e., , replaces for computing the value of goodness function.
4.2 Non-linear utility function
We now consider the setting in which the utility function can be non-linear. As shown in Section 4.1, the linear contextual bandit algorithm can be used as a sub-routine to get the optimistic utility estimate in our online fair allocation. We then use these estimates to compute the value of the goodness function for allocating the received item to the agent that gives the best-desired balance between fairness and efficiency. We generalize this observation for the online fair division problem with a non-linear utility function by using a suitable non-linear contextual bandit algorithm and introduce the notion of Online Fair Division (OFD) Compatible contextual bandit algorithm.
Definition 1 (OFD Compatible Contextual Bandit Algorithm).
Let denote the observations of item-agent pairs at the beginning of round and . Then, any contextual bandit algorithm is OFD Compatible if its estimated function , with probability , satisfies:
Many contextual bandit algorithms like Lin-UCB (Chu et al., 2011), UCB-GLM (Li et al., 2017), IGP-UCB (Chowdhury and Gopalan, 2017), GP-TS (Chowdhury and Gopalan, 2017), Neural-CUB (Zhou et al., 2020), and Neural-TS (Zhang et al., 2021) are OFD compatible. The value of gives the upper bound on the goodness function of the estimated utility with respect to the true utility function. The value of function depends on the problem and the choice of contextual bandit algorithm and its associated hyperparameters, as shown in Table 1 at the end of Appendix A. Depending on problem setting, we can use any suitable OFD compatible contextual bandit algorithm as a sub-routine to get optimistic estimates of utility for all item-agent pairs, which are then used to compute the goodness function value and select an agent for allocating the item.
Let be an OFD contextual bandit algorithm with . Then, the agent for allocating the item is selected by maximizing the following goodness function:
(5) |
where in which is the total utility collected by agent thus far and is the optimistic estimate of . Next, we will give a regret upper bound for using any OFD compatible contextual bandit algorithm to get optimistic utility estimate.
Theorem 2.
Let be an OFD compatible contextual bandit algorithm with and the goodness function be same as defined in Eq. 2 with . If assumptions used in holds, then, with a probability of at least , the regret of corresponding online fair division algorithm OFD- in rounds is
The proof follows by upper-bounding the sum of instantaneous regrets. To do this, we first show that instantaneous regret incurred in the round is upper bounded by (as shown in Appendix A), which uses the fact that the estimation error is for an item-agent pair when using an OFD compatible contextual bandit algorithm .
4.3 Regret bound for other goodness functions
The regret upper bounds are given in 1 and 2 hold for a specific goodness function (defined in Eq. 2). We have also derived the regret upper bounds for goodness functions that satisfy the following properties.
Definition 2 (Locally monotonically non-decreasing and Lipschitz continuous properties).
A goodness function G is
(i) locally monotonically non-decreasing for any utility if
(ii) locally Lipschitz continuous if, for a constant corresponding to the -th element,
Next, we provide regret upper bound for goodness functions that are locally monotonically non-decreasing and locally Lipschitz continuous, e.g., NSW and log-NSW.
Theorem 3.
Let be an OFD compatible contextual bandit algorithm with and the goodness function G be is locally monotonically non-decreasing and Lipschitz continuous with . If assumptions used in holds, then, with a probability of at least , the regret of corresponding online fair division algorithm OFD- in rounds is
The instantaneous regret is upper bounded by (as shown in Appendix A). It uses the following three facts: (i) locally monotonically non-decreasing property, (ii) locally Lipschitz continuous property of goodness function, and (iii) the estimation error for contextual bandit algorithm .
5 Experiments
Our goal is to corroborate our theoretical results and empirically demonstrate the performance of our proposed algorithms in different online fair allocation problems. We repeat all our experiments 20 times and show the regret (as defined in Eq. 1) with a 95% confidence interval (the vertical line on each curve shows the confidence interval). To demonstrate the different performance aspects of our proposed algorithms, we have used different synthetic problem instances (commonly used experiment choices in bandit literature) whose details are as follows.
Experiment setting.
We use a -dimensional space to generate the sample features of each item, where item is represented by for . Similarly, we use a -dimensional space to generate the sample features of each agent, where agent is represented by represent the agent . The value of -the feature (or is sampled uniformly at random from . Note that agents remain the same across the rounds, whereas an item in each round is randomly sampled from the -dimensional space. To get the item-agent feature vectors for item in the round , we concatenate the item features with all agent feature vectors. For item and agent , the concatenated feature vector is denoted by , which is an -dimensional vector with . We then select a -dimensional vector by sampling uniformly at random from and normalized the sample vector to ensure has unit -norm. In all our experiment, we use , , , and .
Regret comparison with baselines.
We compare the regret of proposed algorithms with two other baselines: OFD-Uniform and OFD-Greedy. OFD-Uniform selects an agent uniformly at random to allocate the item. In contrast, OFD-Greedy uses -Greedy contextual algorithm, which behaves the same as OFD-UCB except it has no confidence term (i.e., setting ) and selects an agent uniformly at random with probability in each round, otherwise select agent greedily. For experiments with the linear utility (i.e., ), we use items, agents, and control parameter for the goodness function. We use three different problems with the same setting except , resulting in . As expected, our algorithms based on UCB and TS-based contextual linear bandit algorithms outperform both baselines as shown in Fig. 1 on different problem instances of linear utility (only varying the dimension while keeping remaining parameters unchanged). Note that we set a limit on the y-axis to show the sub-linear regret of our algorithm. Further, we can also observe that the TS-based algorithm performs better than the UCB-based algorithm. For the experiment with the non-linear utility, we use the square function, i.e., in our experiment having non-linear utility function, which has items, agents, , and . As shown in Fig. 1d, the linear contextual bandit algorithms perform poorly compared to their non-linear counterparts using the Gaussian process to estimate the non-linear function (Srinivas et al., 2010; Williams and Rasmussen, 2006).




Regret vs. number of agents () and dimension .
The number of agents and dimension of item-agent feature vector in the online fair division problem control the difficulty. As their values increase, the problem becomes more difficult, making it harder to allocate the item to the best agent. We want to verify this by observing how the regret of our proposed algorithms changes while changing and in the online fair division problem. To see this in our experiments, we use the linear utility function (i.e., ), items, when varying dimension, , while varying the number of agents. As shown in Fig. 2a and Fig. 2b, the regret bound of our UCB- and TS- based algorithms increases as we increase the number of agents, i.e., . We also observe the same trend when we increase the dimension of the item-agent feature vector from as shown in Fig. 2c and Fig. 2d. In all experiments, we also observe that TS-based algorithm performs better than its UCB-based counterpart (as seen in Fig. 2 by comparing the regret of both algorithms). However, when we set the value of , we observe the same trend for , but the trend reverses as the number of agents increases due to the goodness function (defined in Eq. 2) decreases with an increase in the number of agents when , as shown in Fig. 3. It happens because the value of the goodness function (defined in Eq. 2) decreases with an increase in the number of agents when .








Total utility, Gini coefficient, and ratio of minimum utility to total utility vs. .
To measure the fairness of item allocation among agents using the Gini coefficient (Dorfman, 1979) (lower is better) and the ratio of minimum utility to total utility (increasing this is same as maximizing maxi-min utility, hence having higher value is better). We can use the total utility (USW) as a measure of efficiency (having a higher total utility is better). Since the value of controls the balance between fairness and efficiency, we want to see how it changes total utility, the Gini coefficient, and the ratio of minimum utility to total utility. To see this, we use the linear utility function (i.e., ), items, , , and vary value in in set . For each value of , we note the total utility, Gini coefficient, and ratio of minimum utility to total utility after allocating all items. We repeat our experiments times and report the average value of observed results on Fig. 4. As we increase the value of , efficiency is preferred over fairness. As expected, total utility increases (Fig. 4c), the value of the Gini coefficient increases (Fig. 4b), and the ratio of minimum utility to total utility decreases (Fig. 4a). As expected, OFD-Uniform performs worse. Though OFD-Greedy behaves the same as UCB- and TS-based counterparts, it suffers from higher regret as shown in Fig. 1.



6 Conclusion
This paper considers a novel online fair division problem involving multiple agents in which a learner must irrevocably allocate an indivisible item to one of the agents while satisfying a given fairness and efficiency constraint. Since the number of items can be very large in our setting, with only a few copies of each item in many real-life applications, it is difficult to estimate the utility for all item-agent pairs. To address this challenge, we use contextual bandit to get optimistic utility estimates for each item-agent pair. We then proposed algorithms for online fair division that use these optimistic utility estimates for item allocation to agents and showed that they enjoy the sub-linear regret guarantees. Our experimental results further validate the performance of the proposed algorithms. For future research, one can explore other fairness constraints, such as envy-freeness and proportionality. Another promising direction is investigating fairness across multiple types of items in the online fair division setting with unknown utility functions.
References
- Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Proc. NeurIPS, pages 2312–2320, 2011.
- Agrawal and Goyal (2013) S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In Proc. ICML, pages 127–135, 2013.
- Aleksandrov and Walsh (2017) M. Aleksandrov and T. Walsh. Pure nash equilibria in online fair division. In Proc. IJCAI, pages 42–48, 2017.
- Aleksandrov and Walsh (2020) M. Aleksandrov and T. Walsh. Online fair division: A survey. In Proc. AAAI, pages 13557–13562, 2020.
- Aleksandrov et al. (2015) M. Aleksandrov, H. Aziz, S. Gaspers, and T. Walsh. Online fair division: Analysing a food bank problem. arXiv:1502.07571, 2015.
- Amanatidis et al. (2022) G. Amanatidis, G. Birmpas, A. Filos-Ratsikas, and A. A. Voudouris. Fair division of indivisible goods: A survey. arXiv:2202.07551, 2022.
- Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, pages 235–256, 2002.
- Bateni et al. (2022) M. Bateni, Y. Chen, D. F. Ciocan, and V. Mirrokni. Fair resource allocation in a volatile marketplace. Operations Research, pages 288–308, 2022.
- Benabbou et al. (2019) N. Benabbou, M. Chakraborty, and Y. Zick. Fairness and diversity in public resource allocation problems. Bulletin of the Technical Committee on Data Engineering, 2019.
- Benadè et al. (2023) G. Benadè, A. M. Kazachkov, A. D. Procaccia, A. Psomas, and D. Zeng. Fair and efficient online allocations. Operations Research, 2023.
- Brams and Taylor (1996) S. J. Brams and A. D. Taylor. Fair Division: From cake-cutting to dispute resolution. Cambridge University Press, 1996.
- Budish (2011) E. Budish. The combinatorial assignment problem: Approximate competitive equilibrium from equal incomes. Journal of Political Economy, pages 1061–1103, 2011.
- Caragiannis et al. (2019) I. Caragiannis, D. Kurokawa, H. Moulin, A. D. Procaccia, N. Shah, and J. Wang. The unreasonable fairness of maximum nash welfare. ACM Transactions on Economics and Computation, pages 1–32, 2019.
- Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Proc. NeurIPS, pages 2249–2257, 2011.
- Chen et al. (2021) G. Chen, X. Li, and Y. Ye. Fairer lp-based online allocation via analytic center. arXiv:2110.14621, 2021.
- Chen et al. (2020) Y. Chen, A. Cuellar, H. Luo, J. Modi, H. Nemlekar, and S. Nikolaidis. Fair contextual multi-armed bandits: Theory and experiments. In Proc. UAI, pages 181–190, 2020.
- Chowdhury and Gopalan (2017) S. R. Chowdhury and A. Gopalan. On kernelized multi-armed bandits. In Proc. ICML, pages 844–853, 2017.
- Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In Proc. AISTATS, pages 208–214, 2011.
- Claure et al. (2020) H. Claure, Y. Chen, J. Modi, M. Jung, and S. Nikolaidis. Multi-armed bandits with fairness constraints for distributing resources to human teammates. In Proc. HRI, pages 299–308, 2020.
- Codenotti and Varadarajan (2007) B. Codenotti and K. Varadarajan. Computation of market equilibria by convex programming. Algorithmic Game Theory, pages 135–158, 2007.
- Cole and Gkatzelis (2018) R. Cole and V. Gkatzelis. Approximating the nash social welfare with indivisible items. SIAM Journal on Computing, pages 1211–1236, 2018.
- Demko and Hill (1988) S. Demko and T. P. Hill. Equitable distribution of indivisible objects. Mathematical Social Sciences, pages 145–158, 1988.
- Dorfman (1979) R. Dorfman. A formula for the gini coefficient. The review of economics and statistics, pages 146–149, 1979.
- Dubins and Spanier (1961) L. E. Dubins and E. H. Spanier. How to cut a cake fairly. The American Mathematical Monthly, pages 1–17, 1961.
- Edward Su (1999) F. Edward Su. Rental harmony: Sperner’s lemma in fair division. The American mathematical monthly, pages 930–942, 1999.
- Eisenberg and Gale (1959) E. Eisenberg and D. Gale. Consensus of subjective probabilities: The pari-mutuel method. The Annals of Mathematical Statistics, pages 165–168, 1959.
- Feldman and Serrano (2006) A. M. Feldman and R. Serrano. Welfare economics and social choice theory. Springer Science & Business Media, 2006.
- Gal et al. (2016) Y. Gal, M. Mash, A. D. Procaccia, and Y. Zick. Which is the fairest (rent division) of them all? In Proc. EC, pages 67–84, 2016.
- Gao et al. (2021) Y. Gao, A. Peysakhovich, and C. Kroer. Online market equilibrium with application to fair division. In Proc. NeurIPS, pages 27305–27318, 2021.
- Gillen et al. (2018) S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In Proc. NeurIPS, pages 2605–2614, 2018.
- Grazzi et al. (2022) R. Grazzi, A. Akhavan, J. I. Falk, L. Cella, and M. Pontil. Group meritocratic fairness in linear contextual bandits. In Proc. NeurIPS, pages 24392–24404, 2022.
- Hossain et al. (2021) S. Hossain, E. Micha, and N. Shah. Fair algorithms for multi-agent multi-armed bandits. In Proc. NeurIPS, pages 24005–24017, 2021.
- Jain and Vazirani (2010) K. Jain and V. V. Vazirani. Eisenberg–gale markets: algorithms and game-theoretic properties. Games and Economic Behavior, pages 84–106, 2010.
- Joseph et al. (2016) M. Joseph, M. Kearns, J. Morgenstern, S. Neel, and A. Roth. Fair algorithms for infinite and contextual bandits. arXiv:1610.09559, 2016.
- Kash et al. (2014) I. Kash, A. D. Procaccia, and N. Shah. No agent left behind: Dynamic fair division of multiple resources. Journal of Artificial Intelligence Research, pages 579–603, 2014.
- Krause and Ong (2011) A. Krause and C. S. Ong. Contextual Gaussian process bandit optimization. In Proc. NeurIPS, pages 2447–2455, 2011.
- Lan et al. (2010) T. Lan, D. Kao, M. Chiang, and A. Sabharwal. An axiomatic theory of fairness in network resource allocation. In Proc. INFOCOM, pages 1–9, 2010.
- Lattimore and Szepesvári (2020) T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
- Li et al. (2019) F. Li, J. Liu, and B. Ji. Combinatorial sleeping bandits with fairness constraints. IEEE Transactions on Network Science and Engineering, pages 1799–1813, 2019.
- Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pages 661–670, 2010.
- Li et al. (2017) L. Li, Y. Lu, and D. Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proc. ICML, pages 2071–2080, 2017.
- Liu et al. (2017) Y. Liu, G. Radanovic, C. Dimitrakakis, D. Mandal, and D. C. Parkes. Calibrated fairness in bandits. arXiv:1707.01875, 2017.
- Mehta et al. (2013) A. Mehta et al. Online matching and ad allocation. Foundations and Trends® in Theoretical Computer Science, pages 265–368, 2013.
- Moulin (2004) H. Moulin. Fair division and collective welfare. MIT press, 2004.
- Nash et al. (1950) J. F. Nash et al. The bargaining problem. Econometrica, pages 155–162, 1950.
- Patil et al. (2021) V. Patil, G. Ghalme, V. Nair, and Y. Narahari. Achieving fairness in the stochastic multi-armed bandit problem. Journal of Machine Learning Research, pages 1–31, 2021.
- Procaccia et al. (2024) A. D. Procaccia, B. Schiffer, and S. Zhang. Honor among bandits: No-regret learning for online fair division. arXiv:2407.01795, 2024.
- Schumann et al. (2019) C. Schumann, Z. Lang, N. Mattei, and J. P. Dickerson. Group fairness in bandit arm selection. arXiv:1912.03802, 2019.
- Sim et al. (2021) R. H. L. Sim, Y. Zhang, B. K. H. Low, and P. Jaillet. Collaborative Bayesian optimization with fair regret. In Proc. ICML, pages 9691–9701, 2021.
- Sinclair et al. (2022) S. R. Sinclair, S. Banerjee, and C. L. Yu. Sequential fair allocation: Achieving the optimal envy-efficiency tradeoff curve. ACM SIGMETRICS Performance Evaluation Review, pages 95–96, 2022.
- Slivkins (2019) A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 2019.
- Srinivas et al. (2010) N. Srinivas, A. Krause, S. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, page 1015–1022, 2010.
- Steinhaus (1948) H. Steinhaus. The problem of fair division. Econometrica, pages 101–104, 1948.
- Talebi and Proutiere (2018) M. S. Talebi and A. Proutiere. Learning proportionally fair allocations with low regret. Proceedings of the ACM on Measurement and Analysis of Computing Systems, pages 1–31, 2018.
- Varian (1974) H. R. Varian. Equity, envy, and efficiency. Journal of Economic Theory, pages 63–91, 1974.
- Vazirani (2007) V. V. Vazirani. Combinatorial algorithms for market equilibria. Algorithmic game theory, pages 103–134, 2007.
- Walsh (2011) T. Walsh. Online cake cutting. In Proc. ADT, pages 292–305, 2011.
- Wang and Joachims (2021) L. Wang and T. Joachims. User fairness, item fairness, and diversity for rankings in two-sided markets. In Proc. SIGIR, pages 23–41, 2021.
- Wang et al. (2021) L. Wang, Y. Bai, W. Sun, and T. Joachims. Fairness of exposure in stochastic bandits. In Proc. ICML, pages 10686–10696, 2021.
- Weymark (1981) J. A. Weymark. Generalized gini inequality indices. Mathematical Social Sciences, pages 409–430, 1981.
- Williams and Rasmussen (2006) C. K. Williams and C. E. Rasmussen. Gaussian Processes for Machine Learning. MIT press, 2006.
- Wu et al. (2023) Y. Wu, Z. Zheng, and T. Zhu. Best arm identification with fairness constraints on subpopulations. In Proc. WSC, pages 540–551, 2023.
- Yamada et al. (2024) H. Yamada, J. Komiyama, K. Abe, and A. Iwasaki. Learning fair division from bandit feedback. In Proc. AISTATS, pages 3106–3114, 2024.
- Zhang et al. (2021) W. Zhang, D. Zhou, L. Li, and Q. Gu. Neural Thompson sampling. In Proc. ICLR, 2021.
- Zhou et al. (2020) D. Zhou, L. Li, and Q. Gu. Neural contextual bandits with UCB-based exploration. In Proc. ICML, pages 11492–11502, 2020.
Appendix A Regret Analysis.
A.1 Proof of 1
We first state the following result that we will use in the prove 1.
Lemma 1 (Theorem 2 of Abbasi-Yadkori et al. (2011)).
Let , , , , and . Then, with probability ,
See 1
A.2 Proof of 2
We first state the following result that we will use to prove our regret upper bounds for 2.
Lemma 2 (Lemma 1 of Weymark (1981)).
Let and are opposite ordered. If is the set of all permutations of , then
Lemma 3 (Lemma 1 of Sim et al. (2021)).
Consider a non-decreasing order of weights and let , where is an -dimensional utility vector and returns the -th smallest element of . For any two utility vectors and , if there exists such that and , , then .
The regret analysis of our proposed algorithms depends on bounding the instantaneous regret for each action. The following result gives an upper bound on the instantaneous regret when using an contextual algorithm .
Lemma 4.
Let G be the goodness function defined in Eq. 2 and be an OFD compatible contextual bandit algorithm with . Then, the instantaneous regret incurred by UCB- for selecting agent for item is
Proof.
Recall the definition of the goodness function, i.e., .
The returns the utilities in the same order as , while returns the utilities in the same order as . The first inequality is from applying with and then using triangle inequality to get . The second inequality follows from 3 as an additional utility of is added to the total collected utility of agent . The third inequality follows from the fact that maximizes the goodness function. The fourth inequality follows from 2. The last inequality is due to . This completes the proof, showing that . ∎
See 2
Proof.
Let denote the instantaneous regret for using contextual bandit algorithm . After using 4, the regret with contextual bandit algorithm after rounds is given as follows:
A.3 Proof of 3
See 3
Proof.
First we first get upper bound on instantaneous regret:
The first inequality follows from the monotonicity property of goodness function after applying with and then using triangle inequality to get . The second inequality follows from the fact that maximizes the goodness function. The third inequality is due to the locally Lipschitz property of goodness function. Let . Then, after rounds, the regret with contextual bandit algorithm is given as follows:
Appendix B Auxiliary Observations
We first mention some of the observations that will be useful to prove the property of monotonicity and Lipschitz continuity for our various choices of the goodness function.
Fact 1.
Suppose and are Lipschitz continuous functions with non-negative bounded domains. Then the following statements are true:
-
1.
is a Lipschitz continuous function.
-
2.
is a Lipschitz continuous function.
Proof.
Let and be the Lipschitz continuous functions in a bounded domain with Lipschitz constants and , respectively.
1. We first prove that is a Lipschitz continuous function. Consider the following difference:
where . The first inequality follows from the triangle inequality, while the second inequality uses Lipschitz continuity.
2. We now prove that is a Lipschitz continuous function. Consider the following difference:
The last inequality is due to the bound of the continuous function, and appropriately, the Lipschitz constant is chosen. ∎
Fact 2.
Suppose with are Lipschitz continuous functions with non-negative bounded domain. Then the following statements are true:
-
1.
is a Lipschitz continuous function, where all are constant.
-
2.
is a Lipschitz continuous function.
Proof.
1. We first prove that is a Lipschitz continuous function. Consider the following difference:
The first inequality is due to the triangle inequality, while the second inequality is by using Lipschitz continuity of the function.
2. We now prove that is a Lipschitz continuous function. We will use induction for this proof. For base case: When , the statement is true due to Fact 1. For induction hypothesis: Assume the statement is true for . Now, we consider the inductive step (for ) as follows:
The first inequality follows from the triangle inequality. The last inequality is due to the boundedness of the function combined with the induction hypothesis. ∎
Fact 3.
The function is Lipschitz continuous when is in a non-negative domain, i.e., .
Proof.
Suppose . Now consider the following difference.
We use that for . Therefore, it proves our claim. ∎
Fact 4.
Suppose is a continuous function, bounded and positive, then is a Lipschitz continuous function.
Proof.
B.1 Goodness functions that are locally monotonically non-decreasing and Lipschitz continuous
We will briefly discuss the different choices of the goodness function for which we can theoretically guarantee a sub-linear regret upper bound.
Weighted Gini social-evaluation function.
The weighted Gini social-evaluation function is given as follows:
Since all ’s are positive constants with and are bounded, positive, continuous functions. Therefore, the weighted Gini social-evaluation function satisfies the locally Lipschitz condition. Also, if we change the -th component while keeping the rest of the component of the weighted Gini social-evaluation function fixed, the locally monotonicity non-decreasing condition holds (also from 3). Therefore, 3 (with ) also holds for weighted Gini social-evaluation function.
Targeted weights.
Let us define a fixed target weight vector in advance, with the learner’s goal being to achieve these targeted weight distributions after each allocation. These target weights represent the desired fraction of the total cumulative utility each agent should receive. Let be the sum of total utility after allocating item to agent . Suppose the target weighted fraction of the utility vector is , which is given. The learner’s goal is to obtain the targeted weight vector at the end of the allocation process. The goodness function for this fairness constraint is defined as follows:
where and for [i.e., ESW]. The utility in is defined as , where is the proportional ratio for the -th agent. Let be the vector of targeted ratio of agents’ cumulative utility to total utilities collected by all agents (i.e., system’s total utility), where . Then, , e.g., if and , then .
Nash Social Welfare.
The Nash Social Welfare (NSW) is defined as the product of the utilities of all agents, i.e., . Since we are maximizing the goodness function, we can ignore the constant factor here. First, observe that the utility function is . Since the utility function is positive, NSW satisfies the monotonically non-decreasing function property. Also, we assume that the utility of the individual agent is positive and bounded, which implies that the NSW is locally Lipschitz continuous. As a result, the properties given in 2 hold for NSW.
Log Nash Social Welfare.
This is defined as the of the Nash social welfare function and is commonly considered in the fair division literature (Cole and Gkatzelis, 2018; Talebi and Proutiere, 2018).
We assume that the utility function is bounded and positive. It is easy to observe that using Fact 1- Fact 4, the NSW is a Lipschitz continuous function. Also, NSW satisfies the monotone property. By virtue of these observations, we can guarantee sub-linear regret, i.e., 3 holds.