Policy Aggregation
Abstract
We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods — including approval voting, Borda count, the proportional veto core, and quantile fairness — can be practically applied to policy aggregation.
1 Introduction
Early discussion of AI value alignment had often focused on learning desirable behavior from an individual teacher, for example, through inverse reinforcement learning [27, 1]. But, in recent years, the conversation has shifted towards aligning AI models with large groups of people or even entire societies. This shift is exemplified at a policy level by OpenAI’s “democratic inputs to AI” program [41] and Meta’s citizens’ assembly on AI governance [8], and at a technical level by the ubiquity of reinforcement learning from human feedback [30] as a method for fine-tuning large language models.
We formalize the challenge of value alignment with multiple individuals as a problem that we view as fundamental — policy aggregation. Our starting point is the common assumption that the environment can be represented as a Markov decision process (MDP). While the states, actions and transition functions are shared by all agents, their reward functions — which incorporate values, priorities or subjective beliefs — may be different. In particular, each agent has its own optimal policy in the underlying MDP. Our question is this: How should we aggregate the individual policies into a desirable collective policy?
A naïve answer is to define a new reward function that is the sum of the agents’ reward functions (for each state-action pair separately) and compute an optimal policy for this aggregate reward function; such a policy would guarantee maximum utilitarian social welfare. This approach has a major shortcoming, however, in that it is sensitive to affine transformations of rewards, so, for example, if we doubled one of the reward functions, the aggregate optimal policy may change. This is an issue because each agent’s individual optimal policy is invariant to (positive) affine transformations of rewards, so while it is possible to recover a reward function that induces an agent’s optimal policy by observing their actions over time,111And we assume this is done accurately, in order to focus on the essence of the policy aggregation problem. it is impossible to distinguish between reward functions that are affine transformations of each other. More broadly, economists and moral philosophers have long been skeptical about interpersonal comparisons of utility [19] due to the lack of universal scale — an issue that is especially pertinent in our context. Therefore, aggregation methods that are invariant to affine transformations are strongly preferred.
Our approach. To develop such aggregation methods, we look to social choice theory, which typically deals with the aggregation of ordinal preferences. To take a canonical example, suppose agents report rankings over alternatives. Under the Borda count rule, each voter gives points to the alternative they rank in the ’th position, and the alternative with most points overall is selected.
The voting approach can be directly applied to our setting. For each agent, it is (in theory) possible to compute the value of every possible (deterministic) policy, and rank them all by value. Then, any standard voting rule, such as Borda count, can be used to aggregate the rankings over policies and single out a desirable policy. The caveat, of course, is that this method is patently impractical, because the number of policies is exponential in the number of states of the MDP.
The main insight underlying our approach is that ordinal preferences over policies have a much more practical volumetric interpretation in the state-action occupancy polytope . Roughly speaking, a point in the state-action occupancy polytope represents a (stochastic) policy through the frequency it is expected to visit different state-action pairs. If a policy is preferred by an agent to a subset of policies , its “rank” is the volume of as a fraction of the volume of . The “score” of a policy under Borda count, for example, can be interpreted as the sum of these “ranks” over all agents.
Our results. We investigate two classes of rules from social choice theory, those that guarantee a notion of fairness and voting rules. By mapping ordinal preferences to the state-action occupancy polytope, we adapt the different rules to the policy aggregation problem.
The former class is examined in Section 5. As a warm-up we start from the notion of proportional veto core; it follows from recent work by Chaudhury et al. [7] that a volumetric interpretation of this notion is nonempty and can be computed efficiently. We then turn to quantile fairness, which was recently introduced by Babichenko et al. [4]; we prove that the volumetric interpretation of this notion yields guarantees that are far better than those known for the original, discrete setting, and we design a computationally efficient algorithm to optimize those guarantees.
The latter class is examined in Section 6; we focus on volumetric interpretations of -approval (including the ubiquitous plurality rule, which is the special case of ) and the aforementioned Borda count. In contrast to the rules studied in Section 5, existence is a nonissue for these voting rules, but computation is a challenge, and indeed we establish several computational hardness results. To overcome this obstacle, we implement voting rules for policy aggregation through mixed integer linear programming, which leads to practical solutions.
Finally, our experiments in Section 7 evaluate the policies returned by different rules based on their fairness; the results identify quantile fairness as especially appealing. The experiments also illustrate the advantage of our approach over rules that optimize measures of social welfare (which are sensitive to affine transformations of the rewards).
2 Related Work
Noothigattu et al. [28] consider a setting related to ours, in that different agents have different reward functions and different policies that must be aggregated. However, they assume that the agents’ reward functions are noisy perturbations of a ground-truth reward function, and the goal is to learn an optimal policy according to the ground-truth rewards. In social choice terms, our work is akin to the typical setting where subjective preferences must be aggregated, whereas the work of Noothigattu et al. [28] is conceptually similar to the setting where votes are seen as noisy estimates of a ground-truth ranking [39, 9, 6].
Chaudhury et al. [7] study a problem completely different from ours: fairness in federated learning. However, their technical approach served as an inspiration for ours. Specifically, they consider the proportional veto core and transfer it to the federated learning setting using volumetric arguments, by considering volumes of subsets in the space of models. Their proof that the proportional veto core is nonempty carries over to our setting, as we explain in Section 5.
There is a body of work on multi-objective reinforcement learning (MORL) and planning that uses a scalarization function to reduce the problem to a single-objective one [32, 21]. Other solutions to MORL focus on developing algorithms to identify a set of policies approximating the problem’s Pareto frontier [38]. A line of work more closely related to ours focuses on fairness in sequential decision making, often taking the scalarization approach to aggregate agents’ preferences by maximizing a (cardinal) social welfare function, which maps the vector of agent utilities to a single value. Ogryczak et al. [29] and Siddique et al. [34] investigate generalized Gini social welfare, and Mandal and Gan [25], Fan et al. [13] and Ju et al. [23] focus on Nash and egalitarian social welfare. Alamdari et al. [3] study this problem in a non-Markovian setting, where fairness depends on the history of actions over time, and introduce concepts to assess different fairness criteria at varying time intervals. A shortcoming of these solutions is that they are not invariant to affine transformations of rewards — a property that is crucial in our setting, as discussed earlier.
Our work is closely related to the pluralistic alignment literature, aiming to develop AI systems that reflect the values and preferences of diverse individuals [35, 10]. Alamdari et al. [2] propose a framework in reinforcement learning in which an agent learns to act in a way that is considerate of the values and perspectives of humans within a particular environment. Concurrent work explores reinforcement learning from human feedback (RLHF) from a social choice perspective, where the reward model is based on pairwise human preferences, often constructed using the Bradley-Terry model [5]. Zhong et al. [42] consider the maximum Nash and egalitarian welfare solutions, and Swamy et al. [36] propose a method based on maximal lotteries due to Fishburn [14].
3 Preliminaries
For , let . For a closed set , let denote the probability simplex over the set . We denote the dot product of two vectors as for . A halfspace in determined by and is the set of points satisfying . A polytope is the intersection of a finite number of halfspaces, i.e., a convex subset of the -dimensional space determined by a set of linear constraints where is a matrix of coefficients of linear inequalities and .
3.1 Multi-Objective Markov Decision Processes
A multi-objective Markov decision process (MOMDP) is a tuple defined as for the average-reward case and for the discounted-reward case, where is a finite set of states, is a finite set of actions, and is the transition probability distribution. is the probability of transitioning to state by taking action in . For , is the reward function of the th agent, the initial state is sampled from , and is the discount factor.
A (Markovian) policy is a probability distribution over the actions given the state . A policy is deterministic if at each state one action is selected with probability of , and otherwise it is stochastic. The expected average return of agent for a policy and the expected discounted return of agent for a policy are defined over an infinite time horizon as
where the expectation is over the state-action pairs at time based on both the policy and the transition function .
Definition 1 (state-action occupancy measure).
Let be the probability measure over states at time under policy . The state-action occupancy measure for state and action is defined as
For both the average and discounted cases, we can rewrite the expected return as the dot product of the state-action occupancy measures and rewards, that is, . In addition, the policy can be calculated given the occupancy measure as if , and otherwise.
Definition 2 (state-action occupancy polytope [31, 40]).
For a MOMDP , in the average-reward case, the space of valid state-action occupancies is the polytope
Similarly, in the discounted-reward case, the space of state-action occupancies is the polytope
A mechanism receives a MOMDP and aggregates the agents’ preferences into a policy. An economical efficiency axiom in the social choice literature is that of Pareto optimality.
Definition 3 (Pareto optimality).
For a MOMDP and , a policy is -Pareto optimal if there does not exist another policy such that for all , with strict inequality for at least one agent. For , we simply call such policies Pareto optimal.
We call a mechanism Pareto optimal if it always returns a Pareto optimal policy. In special cases where all agents unanimously agree on an optimal policy, Pareto optimality implies that the mechanism will return one such policy. We discuss Pareto optimal implementations of all mechanisms in this work.
3.2 Voting and Social Choice Functions
In the classical social choice setting, we have a set of agents and a set of alternatives. The preferences of voter is represented as a strict ordering over the alternatives equal to , where denotes agent prefers over for . A (possibly randomized) voting rule aggregates agents’ preferences and returns an alternative or a distribution over the alternatives as the collective decision.
Positional Scoring Rules. A positional scoring rule with scoring vector such that works as follows. Each agent gives a score of to their top choice, a score of to their second choice, and so on. The votes are tallied and an alternative with the maximum total score is selected. A few of the well-known positional scoring rules are: Plurality: , Borda: , -approval: with ones.
4 Occupancy Polytope as the Space of Alternatives
In a MOMDP , each agent incorporates their values and preferences into their respective reward function . Agent prefers over if and only if achieves higher expected return, , and is indifferent between two policies and if and only if . As discussed before, given a state-action occupancy measure in the state-action occupancy polytope , we can recover the corresponding policy . Therefore, we can interpret as the domain of all possible alternatives over which the agents have heterogeneous weak preferences (with ties). Agent prefers to in if and only if they prefer to . We study the policy aggregation problem through this lens; specifically, we design or adapt voting mechanisms where the (continuous) space of alternatives is and agents have weak preferences over them determined by their reward functions .
Affine transformation and reward normalization. A particular benefit of this interpretation, as mentioned before, is that all positive affine transformations of the reward functions, i.e., for all and , yield the same weak ordering over the polytope. Hence, we can assume without loss of generality that . Further, we can ignore agents that are indifferent between all policies, i.e., , and normalize reward functions such that and . The relative ordering of the policies does not change since for all points we have .
Volumetric definitions. A major difference between voting over a continuous space of alternatives and the classical voting setting is that the domain of alternatives is infinite and not all voting mechanisms can be directly applied to the policy aggregation problem. In particular, various voting rules require reasoning about the rank of an alternative or the size of some subset of alternatives. For instance, the Borda count of an alternative over a finite set of alternatives is defined as the number (or fraction) of candidates ranked below . In the continuous setting, for almost all of the mechanisms that we discuss later, we use the measure or volume of a subset of alternatives to refer to their size. For a measurable subset , let denote its measure. The ratio is the fraction of alternatives that lie in . A probabilistic interpretation is that for a uniform distribution over the polytope , denotes the probability that a policy uniformly sampled from lies in . We can also define the expected return distribution of an agent over as a random variable that maps a policy to its expected return, i.e., one that maps to . The pdf and cdf of this r.v. is defined below.
Definition 4 (expected return distribution).
For a MOMDP and , the expected return distribution of agent is defined as
where and are the pdf and cdf of the expected return distribution and is the Dirac delta function indicating .
A useful observation about , the pdf, is that it is unimodal, i.e., increasing up to its mode and decreasing afterwards, which follows from the Brunn–Minkowski inequality [17]. Since (pdf) is the derivative of (cdf), the unimodality of implies that is a convex function in and concave in .
In our algorithms, we use a subroutine that measures the volume of a polytope, which we denote by . Dyer et al. [12] designed a fully polynomial time randomized approximation scheme (FPRAS) for computing the volume of polytopes. We report the running time of algorithms in terms of the number of calls to this oracle.
5 Fairness in Policy Aggregation
In this section, we utilize the volumetric interpretation of the state-action occupancy polytope to extend fairness notions from social choice to policy aggregation, and we develop algorithms to compute stochastic policies provably satisfying these notions.
5.1 Proportional Veto Core
The proportional veto core was first proposed by Moulin [26] in the classical social choice setting with a finite set of alternatives where agents have full (strict) rankings over the alternatives. For simplicity, suppose the number of alternatives is a multiple of . The idea of the proportional veto core is that of the agents should be able to veto of the alternatives. More precisely, for an alternative to be in the proportional veto core, there should not exist a coalition that can “block” using their proportional veto power of . blocks if they can unanimously suggest candidates that they prefer to . For instance, if is in the proportional veto core, it cannot be the case that a coalition of of the agents unanimously prefer of the alternatives to .
Chaudhury et al. [7] extended this notion to a continuous domain of alternatives in the federated learning setting. We show that such an extension also applies to policy aggregation.
Definition 5 (proportional veto core).
Let . For a coalition of agents , let be their veto power. A point is blocked by a coalition if there exists a subset of measure such that all agents in prefer all points in to , i.e., for all and . A point is in the -proportional veto core if it is not blocked by any coalition.
A candidate in the proportional veto core satisfies desirable properties that are extensively discussed in prior work [26, 7, 24, 22]. It is worth mentioning that any candidate in the -proportional veto core, besides the fairness aspect, is also economically efficient as it satisfies -Pareto optimality. This holds since the grand coalition can veto any -Pareto dominated alternative.
Moulin [26] proved that the proportional veto core is nonempty in the discrete setting and Chaudhury et al. [7] proved it for the continuous setting. The following result is a corollary of Theorem 1 of Chaudhury et al. [7].
Theorem 1.
Let . For a policy aggregation problem, the -proportional veto core is nonempty. Furthermore, such policies can be found in polynomial time using many calls per agent to .
We refer the reader to the paper of Chaudhury et al. [7] for the complete proof, and provide a high-level description of Algorithm 1, which finds a point in the proportional veto core. Algorithm 1 iterates over the agents and lets the th agent “eliminate” roughly of the remaining space of alternatives. That is, agent finds the hyperplane such that its intersection with (the remaining part of at the th iteration) has a volume of approximately . This value, for each agent, can be found by doing a binary search over to a precision of by calls to the volume estimating subroutine .222It is important to check the non-emptyness of the search space as an over-estimation of could lead to an empty feasible region.
Pareto optimality. We briefly discuss why Algorithm 1 is Pareto optimal. During the th iteration, the space of policies is contracted by adding a linear constraint of the form . If the returned welfare-maximizing policy (derived from ) is Pareto dominated by another policy , then would satisfy all these linear constraints as with the earlier inequality being strict for at least one agent. Therefore, and achieves a higher social welfare, which is a contradiction. The same argument can be used to establish the Pareto optimality of other mechanisms discussed later; each of these mechanisms searches for a policy that satisfies certain lower bounds on agents’ utilities, from which a welfare-maximizing, Pareto optimal policy can be selected.
5.2 Quantile Fairness
Next, we consider an egalitarian type of fairness based on the occupancy polytope, building on very recent work by Babichenko et al. [4] in the discrete setting; surprisingly, we show that it is possible to obtain stronger guarantees in the continuous setting.
Babichenko et al. [4] focus on the fair allocation of a set of indivisible items among agents, where each item must be fully allocated to a single agent. They quantify the extent an allocation is fair to an agent by the fraction of allocations over which prefers (note that the number of all discrete allocations is ). In other words, if one randomly samples an allocation, the fairness is measured by the probability that is preferred to the random allocation. An allocations is -quantile fair for if all agents consider this allocation among their top -quantile allocations. Babichenko et al. [4] aim to find a universal value of such that for any fair division instance, a -quantile fair allocation exists. They make an interesting connection between -quantile fair allocations and the Erdős Matching Conjecture, and show under the assumption that the conjecture holds, -quantile fair allocations exist for all instances.333This result is for arbitrary monotone valuations. For additive valuations, they show -quantile fair allocations always exist without any assumptions.
We ask the same question for policy aggregation, and again, a key difference is that our domain of alternatives is continuous. The notion of -quantile fairness extends well to our setting. Agents assess the fairness of a policy based on the fraction of the occupancy polytope (i.e., the set of all policies) to which they prefer , or equivalently, the probability that they prefer the chosen policy to a randomly sampled policy.
Definition 6 (-quantile fairness).
For a MOMDP and , a policy is -quantile fair if for every agent , is among ’s top -fraction of policies,
We show that a -quantile fair policy always exist and that this ratio is tight; note that this bound is twice as good as that of Babichenko et al. [4], and it is unconditional. We prove this by making a connection to an inequality due to Grünbaum [17]. The centroid of a polytope is defined as .
Lemma 1 (Grunbaum’s Inequality).
Let be a polytope and a direction in . For the halfspace , it holds that
Furthermore, this is tight for the -dimensional simplex.
Theorem 2.
For every MOMDP , there always exist -quantile fair policy where and . Note that . Furthermore, this bound is tight: For any , there is an instance with a single state and actions where no -quantile fair policy exists for any .
Proof.
First, we show that the centroid of the occupancy polytope is -quantile fair policy for the aformentioned . Since is a subset of the -simplex (see Definition 2), has a nonzero volume in some lower dimensional space . By invoking Grunbaum’s inequality (Lemma 1) with being equal to projected to the -dimensional subspace for all agents , we have that where . Since , we have , which completes the proof.
To show tightness, take a MOMDP with a single state — hence a constant transition function — and actions and agents. The reward function of agent is and otherwise. The state-action occupancy polytope of this MOMDP is the -dimensional simplex . Take any point in . There exists at least one agent that has . Take the halfspace . Observe that is equivalent to a smaller -dimensional simplex which has volume of . Therefore, . ∎
The centroid of the occupancy polytope, as per Theorem 2, attains a worst-case measure of quantile-fairness. However, the centroid policy can be highly suboptimal as it disregards the preferences of the agents involved. For instance, there could exist a -quantile fair policy. To this end, we take an egalitarian approach and aim to find a -quantile fair policy with the maximum .
Max quantile fair algorithm. Algorithm 2 searches for the optimal value , for which a -quantile fair policy exists, and gets close to by a binary search. To perform the search, we need a subroutine that checks, for a given value of , if a -quantile fair policy exists. Suppose we have the -quantile expected return for all , that is, the expected return amount such that . Then, the problem of existence of a -quantile fair policy is equivalent to the feasibility of the linear program , which can be solved in polynomial time. Importantly, after finding a good approximation of , there can be infinitely many policies that are -quantile fair and there are various ways to select the final policy after finding the value . As mentioned earlier, a desirable efficiency property is Pareto optimality; to satisfy it, one can return the policy that maximizes the sum of agents’ expected returns among the ones that are -quantile fair. Finally, to calculate , we can again use binary search to get close to the value for which using calls to . The discussion above is summarized below.
Proposition 3.
Assuming an optimal oracle for , Algorithm 2 finds a -quantile fair policy that is close to the optimal value in polynomial time with per agent calls to the oracle. A -approximation to can be computed using calls to .
6 Policy Aggregation with Voting Rules
In this section, we adapt existing voting rules from the discrete setting to policy aggregation and discuss their computational complexity.
Plurality. The plurality winner is the policy that achieves the maximum number of plurality votes or “approvals,” where agent approves a policy if it achieves their maximum expected return . Hence the plurality winner is a policy in . This formulation does not require the volumetric interpretation. However, in contrast to the discrete setting where one can easily count the approvals for all candidates, we show that solving this problem in the context of policy aggregation is not only -hard, but hard to approximate up to factor of a . We establish the hardness of approximation by a reduction from the maximum independent set problem [20]; we defer the proof to Appendix A.
Theorem 4.
For any fixed , there is no polynomial-time -approximation algorithm for the maximum plurality score unless .
Nevertheless, we can compute plurality in practice, as we discuss below.
-approval. We extend the -approval rule using the volumetric interpretation of the occupancy polytope, similarly to the -quantile fairness definition. For some , agents approve a policy if its return is among their top fraction of , i.e., . The -approval winner is a policy that has the highest number of -approvals, so it is in . Note that plurality is equivalent to -approval. It is worth mentioning that there can be infinitely many policies that have the maximum approval score and, to avoid a suboptimal decision, one can return a Pareto optimal solution among the set of -approval winner policies.
Theorem 2 shows that for , there always exists a policy that all agents approve, and by Proposition 3 such policies can be found in polynomial time, assuming access to an oracle for volumetric computations. Therefore, the problem of finding an -approval winner is “easy” for . In sharp contrast, for — namely, plurality — Theorem 4 gives a hardness of approximation. The next theorem shows the hardness of computing -approval for via a reduction from the MAX-2SAT problem. We defer the proof to Appendix A.
Theorem 5.
For , computing a policy with the highest -approval score is -hard. This even holds for binary reward vectors and when every has a closed form.
Given the above hardness result, to compute the -approval rule, we turn to mixed-integer linear programming (MILP). Algorithm 3 simply creates binary variables for each agent indicating whether -approves the policy, i.e., which is equivalent to . To encode the expected return requirement for agent to approve a policy as a linear constraint, we precompute . This can be done by a binary search similar to Algorithm 2. Importantly, Algorithm 3 has one binary variable per agent and only constraints which is key to its practicability.
Borda count. The Borda count rule also has a natural definition in the continuous setting. In the discrete setting, the Borda score of agent for alternative is the number of alternatives such that . In the continuous setting, indicates the volume of the occupancy polytope to which agent prefers . The Borda count rule then selects a policy among .
The computational complexity of the Borda count rule remains an interesting open question, though we make progress on two fronts.444We suspect the problem to be -hard since the objective resembles a summation of “sigmoidal” functions over a convex domain, which is known to be -hard [37]. First, we identify a sufficient condition under which we can find an approximate max Borda count policy using convex optimization in polynomial time. Second, similar to Algorithm 3, we present a MILP to approximate the Borda count rule in Algorithm 4.
The first is based on the observation in Section 4 that is concave in range . We assume that the max Borda count policy appears in the concave portion of all agents, i.e., for all . Then, the problem becomes a maximization of the concave objective over the convex domain .
Second, Algorithm 4 is a MILP that finds an approximate max Borda count policy. As a pre-processing step, we estimate for each agent separately. We measure for the fixed expected return values of . This accounts for oracle calls to per agent. Then, for the MILP, we introduce binary variables for each agent indicating their -rounded return levels, i.e., iff for . The MILP then searches for an occupancy measure with the maximum total Borda score among the -rounded expected return vectors, where the -rounded value of is defined as . Therefore, the MILP finds a policy whose Borda score is at least as high as the maximum Borda score among -rounded return vectors. This is not necessarily equivalent to a approximation of the max Borda score.
Finally, we make a novel connection between -quantile fairness and Borda count in Theorem 6.
Theorem 6.
A -quantile fair policy is a -approximation of the maximum Borda score.
Proof.
Recall the Borda score of a policy is defined as . Since for all , we have .
By definition of a -quantile fair policy , for all . Therefore,
and thus we obtain
A corollary of Theorems 6 and 2 is that the policy returned by -max quantile fair algorithm (Algorithm 2) achieves a multiplicative approximation of the maximum Borda score.
7 Experiments
Environment. We adapt the dynamic attention allocation environment introduced by D’Amour et al. [11]. We aim to monitor several sites and prevent potential incidents, but limited resources prevent us from monitoring all sites at all times; this is inspired by applications such as food inspection and pest control 555The code for the experiments is available at https://github.com/praal/policy-aggregation.. There are warehouses and each can be in 3 different stages: normal (), risky () and incident (). There are states containing all possible stages of all warehouses. In each step, we can monitor at most one site, so there are actions, where action is no operation and action is monitoring warehouse . There are agents; each agent has a list of warehouses that they consider valuable and a reward function . In each step , , where denotes the penalty of an incident occurring in warehouse , is the scale of penalties for agent which is sampled from , and is the cost of monitoring. In each step, if we monitor warehouse , its stage becomes normal. If not, it changes from to and from to with probabilities and , and stays the same otherwise. Probabilities are sampled i.i.d. uniformly from . The state transitions is the product of the warehouses’ stage transitions.
Rules. We compare the outcomes of policy aggregation with different rules: max-quantile, Borda, -approval (), egalitarian (maximize minimum return) and utilitarian (maximize sum of returns). We sample random policies based on which we fit a generalized logistic function to estimate the cdf of the expected return distribution (Definition 4) for every agent. The policies for -approval voting rules are optimized with respect to maximum utilitarian welfare. The egalitarian rule finds a policy that maximizes the expected return of the worst-off agent, then optimizes for the second worst-off agent, and so on. The implementation details of Borda count are in Appendix B.
Results. In Figure 1, we report the normalized expected return of agents as (sorted from lowest to highest) which are averaged over different environment and agents instances. We observe that the utilitarian and egalitarian rules are sensitive to the different agents’ reward scales and tend to perform unfairly. The utilitarian rule achieves the highest utilitarian welfare by almost ignoring one agent. The egalitarian rule achieves higher return for the worst-off agents compared to the utilitarian rule, but still yields an inequitable outcome. The max-quantile rule tends to return the fairest outcomes with similar normalized returns for the agents. The Borda rule, while not a fair rule by design, tends to find fair outcomes which are slightly worse than the max-quantile rule. The -approval rule with max utilitarian completion tends to the utilitarian rule as and to plurality as . Importantly, although not shown in the plots, the plurality rule ignores almost all agents and performs optimally for a randomly selected agent.
In addition to the fine-grained utility distributions, in Table 1, we report two aggregate measures based on agents’ utilities: (i) the Gini index, a statistical measure of dispersion defined as — where a lower Gini index indicates a more equitable distribution, and (ii) the Nash welfare, defined as the geometric mean of agents’ utilities — where a higher Nash welfare is preferable. We observe a similar trend as above, where utilitarian and egalitarian rules perform worse across both metrics. For the other four rules, the Nash welfare scores are comparable in both scenarios, with Borda showing slightly better performance. The Gini index, however, highlights a clearer distinction among the rules, with max-quantile performing better.


5 symmetric agents, one per warehouse | 10 agents, random subsets of warehouses | |||
---|---|---|---|---|
Rules | Gini index | Nash welfare | Gini index | Nash welfare |
egalitarian | ||||
utilitarian | ||||
80%-approvals | ||||
90%-approvals | ||||
Borda | ||||
max-quantile |
8 Discussion
We conclude by discussing some of the limitations of our approach. A first potential limitation is computation. When we started our investigation of the policy aggregation problem, we were skeptical that ordinal solutions from social choice could be practically applied. We believe that our results successfully lay this initial concern to rest. However, additional algorithmic advances are needed to scale our approach beyond thousands of agents, states, and actions. Additionally, an interesting future direction is to apply these rules within continuous state or action spaces, as well as in online reinforcement learning setting where the environment remains unknown.
A second limitation is the possibility of strategic behavior. The Gibbard-Satterthwaite Theorem [16, 33] precludes the existence of “reasonable” voting rules that are strategyproof, in the sense that agents cannot gain from misreporting their ordinal preferences; we conjecture that a similar result holds for policy aggregation in our framework. However, if reward functions are obtained through inverse reinforcement learning, successful manipulation would be difficult: an agent would have to act in a way that the learned reward function induces ordinal (volumetric) preferences leading to a higher-return aggregate stochastic policy. This separation between the actions taken by an agent and the preferences they induce would likely alleviate the theoretical susceptibility of our methods to strategic behavior.
Acknowledgments
We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program (Vector Institute). This work was also partially supported by the Cooperative AI Foundation; by the National Science Foundation under grants IIS-2147187, IIS-2229881 and CCF-2007080; and by the Office of Naval Research under grants N00014-20-1-2488 and N00014-24-1-2704.
References
- Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 1–8, 2004.
- Alamdari et al. [2022] P. A. Alamdari, T. Q. Klassen, R. Toro Icarte, and S. A. McIlraith. Be considerate: Avoiding negative side effects in reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 18–26, 2022.
- Alamdari et al. [2024] P. A. Alamdari, T. Q. Klassen, E. Creager, and S. A. Mcilraith. Remembering to be fair: Non-Markovian fairness in sequential decision making. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 906–920, 2024.
- Babichenko et al. [2024] Y. Babichenko, M. Feldman, R. Holzman, and V. V. Narayan. Fair division via quantile shares. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1235–1246, 2024.
- Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Caragiannis et al. [2016] I. Caragiannis, A. D. Procaccia, and N. Shah. When do noisy votes reveal the truth? ACM Transactions on Economics and Computation, 4(3): article 15, 2016.
- Chaudhury et al. [2024] B. R. Chaudhury, A. Murhekar, Z. Yuan, B. Li, R. Mehta, and A. D. Procaccia. Fair federated learning via the proportional veto core. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
- Clegg [2023] N. Clegg. Bringing people together to inform decision-making on generative AI. Blog post, 2023. URL https://about.fb.com/news/2023/06/generative-ai-community-forum.
- Conitzer and Sandholm [2005] V. Conitzer and T. Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 145–152, 2005.
- Conitzer et al. [2024] V. Conitzer, R. Freedman, J. Heitzig, W. H. Holliday, B. M. Jacobs, N. Lambert, M. Mosse, E. Pacuit, S. Russell, H. Schoelkopf, E. Tewolde, and W. S. Zwicker. Position: Social choice should guide AI alignment in dealing with diverse human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 9346–9360, 2024.
- D’Amour et al. [2020] A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 3rd Conference on Fairness, Accountability, and Transparency (FAccT), pages 525–534, 2020.
- Dyer et al. [1991] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM, 38(1):1–17, 1991.
- Fan et al. [2023] Z. Fan, N. Peng, M. Tian, , and B. Fain. Welfare and fairness in multi-objective reinforcement learning. In Proceedings of the 22nd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1991–1999, 2023.
- Fishburn [1984] P. C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
- Garey et al. [1976] M.R. Garey, D.S. Johnson, and L. Stockmeyer. Some simplified np-complete graph problems. Theoretical Computer Science, 1(3):237–267, 1976. ISSN 0304-3975.
- Gibbard [1973] A. Gibbard. Manipulation of voting schemes. Econometrica, 41:587–602, 1973.
- Grünbaum [1960] B. Grünbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific Journal of Mathematics, 10(4):1257–1261, 1960.
- Gurobi Optimization, LLC [2023] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
- Hammond [1991] P. J. Hammond. Interpersonal comparisons of utility: Why and how they are and should be made. In J. Elster and J. E. Roemer, editors, Interpersonal Comparisons of Well-Being, chapter 7, pages 200–254. Cambridge University Press, 1991.
- Håstad [1999] J. Håstad. Clique is hard to approximate within n 1- . Acta Mathematica, 182:105–142, 1999.
- Hayes et al. [2022] C. F. Hayes, R. Rădulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M Zintgraf, R. Dazeley, F. Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
- Ianovski and Kondratev [2021] E. Ianovski and A. Y. Kondratev. Computing the proportional veto core. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
- Ju et al. [2023] P. Ju, A. Ghosh, and N. Shroff. Achieving fairness in multi-agent MDP using reinforcement learning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2023.
- Kondratev and Ianovski [2024] A. Y. Kondratev and E. Ianovski. Veto core consistent preference aggregation. In Proceedings of the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1020–1028, 2024.
- Mandal and Gan [2022] D. Mandal and J. Gan. Socially fair reinforcement learning. CoRR, abs/2208.12584, 2022.
- Moulin [1981] H. Moulin. The proportional veto principle. The Review of Economic Studies, 48(3):407–416, 1981.
- Ng and Russell [2000] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 663–670, 2000.
- Noothigattu et al. [2021] R. Noothigattu, T. Yan, and A. D. Procaccia. Inverse reinforcement learning from like-minded teachers. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pages 9197–9204, 2021.
- Ogryczak et al. [2013] W. Ogryczak, P. Perny, and P. Weng. A compromise programming approach to multiobjective markov decision processes. International Journal of Information Technology & Decision Making, 12(05):1021–1053, 2013.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Puterman [2014] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- Roijers et al. [2013] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
- Satterthwaite [1975] M. Satterthwaite. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10:187–217, 1975.
- Siddique et al. [2020] U. Siddique, P. Weng, and M. Zimmer. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8905–8915, 2020.
- Sorensen et al. [2024] T. Sorensen, J. Moore, J. Fisher, M. L. Gordon, N. Mireshghallah, C. Michael Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi. Position: A roadmap to pluralistic alignment. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 46280–46302, 2024.
- Swamy et al. [2024] G. Swamy, C. Dann, R. Kidambi, S. Wu, and A. Agarwal. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 47345–47377, 2024.
- Udell and Boyd [2013] M. Udell and S. Boyd. Maximizing a sum of sigmoids. Optimization and Engineering, pages 1–25, 2013.
- Yang et al. [2019] R. Yang, X. Sun, and K. Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
- Young [1988] H. P. Young. Condorcet’s theory of voting. The American Political Science Review, 82(4):1231–1244, 1988.
- Zahavy et al. [2021] T. Zahavy, B. O’Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex MDPs. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 25746–25759, 2021.
- Zaremba et al. [2023] W. Zaremba, A. Dhar, L. Ahmad, T. Eloundou, S. Santurkar, S. Agarwal, and J. Leung. Democratic inputs to AI. Blog post, 2023. URL https://openai.com/blog/democratic-inputs-to-ai.
- Zhong et al. [2024] H. Zhong, Z. Deng, W. J. Su, Z. S. Wu, and L. Zhang. Provable multi-party reinforcement learning with diverse human feedback. arXiv preprint arXiv:2403.05006, 2024.
Appendix A Proofs of Section 6
We define a fully connected MOMDP as a MOMDP where for every pair of states and any action , . Throughout the hardness proofs, we construct a fully connected MOMDP. For such MOMDPs, it is not difficult to observe that the state-action occupancy polytope, for both the average and discounted reward, is equivalent to for both the discounted and average reward case. Furthermore,
(1) |
A.1 Proof of Theorem 4
Proof of Theorem 4.
We show hardness of approximation by an approximation-preserving reduction from the maximum independent set (MIS) problem.
Definition 7 (maximum independent set (MIS)).
For a graph with vertex set and edge set , the maximum independent set (MIS) of is defined as the maximum subset of vertices such that there are no edges between any pair of vertices .
Theorem 7 (Håstad [20]).
For any fixed , there is no polynomial-time -approximation algorithm for the MIS problem unless , and no -approximation algorithm unless .
Construction of MOMDP. Let be a graph for which we want to find the maximum independent set. We create a fully connected MOMDP with states , each corresponding to an edge . There are only two actions . In state of edge , performing action and correspond to and respectively. We create agents where agent corresponds to vertex . The reward function of agent for state and action for is defined as
In other words, the reward functions encode the set of edges incident to vertex .
Correctness of reduction. A policy is optimal for agent iff for all the edges incident to the action corresponding to is selected with probability . If a policy is considered optimal by two agents and , then , since at state either or . Therefore, the set of agents that consider a policy optimal corresponds to an independent set in . Furthermore, take any independent set . Let be the policy that for each edge selects the action corresponding to the vertex in and, if no such vertices exist, select one arbitrarily. This policy is well defined, as is an independent set with no edges between its vertices. Policy is optimal for agents of since at each state their favourite action is selected. Thus, we have an equivalence between the maximum independent set of and the plurality winner policy of . Therefore, the hardness of approximation follows from Theorem 7. ∎
A.2 Proof of Theorem 5
Proof of Theorem 5.
We reduce the MAX-2SAT problem to finding an -approval policy for .
For two Boolean variables and , we denote the disjunction (i.e., logical or) of two variables by and the negation of by .
Definition 8 (maximum 2-satisfiability (MAX-2SAT)).
Given a set of Boolean variables and a set of -literal disjunction clauses , the goal of the maximum 2-satifiability problem (MAX-2SAT) is to find the maximum number of clauses that can be satisfied together by an assignment .
Garey et al. [15] showed that the MAX-2SAT problem is NP-hard.
Construction of MOMDP. For an instance of the MAX-2SAT problem, let be a set of -literal disjunction clauses over variables . We create a fully connected MOMDP with states — state representing variable . There are only two actions, and which at state correspond to setting variable to and respectively. This way, a policy can be interpreted as the probability of setting the variable to , subject to .
Agents and reward function. We introduce some notation before introducing the agents and their reward function. Take a clause where for . By combining the former relation with the mapping of and to state-action pairs and respectively, we define . For every , is the state-action pair that evaluates to when selected with probability one. Now, we are ready to introduce the agents. For each clause , we create three agents, each defined as follows:
-
•
Agent with a reward function that is for and and zero otherwise. An optimal policy of this agent chooses the two corresponding state-action pairs exclusively, which can be interpreted as and that implies .
-
•
Agent with a reward function that is for and — note the negation. This is another assignment of variables, and , that implies .
-
•
Similarly, the reward function of is for and (which again implies ).
For ease of exposition, and by a slight abuse of notation, in the rest of the proof we use to refer to the occupancy measure. From Equation 1 we have and the -approval rule is invariant to affine transformation. Therefore, we let .
Expected return distribution. The expected return of agent for a policy is . For conciseness, let . Then, the cdf is
See Figure 2 for a visualization. The cdf of all other agents have the same form. For each of the agents above, their maximum expected return is .
Rounding a policy. Observe that . For agent to approve a policy under , their utility must exceed . The fact that in addition to , implies that and . Therefore, if a policy is -approved by agent , we have and . Further, observe that for at most one of the three agents the condition of may hold as every pair of the agent disagree on one literal.
We round such a policy to an assignment by letting if and otherwise. If an agent in -approves , then is satisfied by the assignment . Therefore, we have that , where is the maximum feasible number of -approvals among all policies and is the maximum number of clauses that can be satisfied among all assignments.
Next, we show by deriving a policy that gets -approvals based on the optimal assignment for by simply letting iff . If satisfies a clause , then for the agent that matches the literal assignments of for , we have , which is an optimal, i.e., -approval, policy for . For the other two agents for , which gets a return less than their required utility of for an -approval. Therefore, we have that and the hardness of computation follows from our reduction. ∎
Appendix B Experimental Details
Experiments are all done on an AMD EPYC 7502 32-Core Processor with 258GiB system memory. We use Gurobi [18] to solve LPs and MILPs.
Running time. All our voting rules has a running time of less than ten minutes on the constructed MOMDPs of states, actions, and agents. The most resource extensive task of our experiments was sampling random policies and computing the expected return for every agent which is a standard task. For each instance, we did this in parallel with processes in a total running time of less than hours per instance.
Implementation details of Borda count rule. After fitting a generalized logistic function for based on the expected return of sampled policies, we find the value , and check the existence of a policy by solving a linear program that achieves a expected return of for all agents. Next, to utilize Gurobi LP solvers, we approximate the concave function by a set of linear constraints that form a piecewise linear concave approximation of . Therefore, our final program for an approximate Borda count rule is simply an LP.