Policy Aggregation

Parand A. Alamdari
University of Toronto & Vector Institute
[email protected] &Soroush Ebadian^$\ast$
University of Toronto
[email protected] &Ariel D. Procaccia
Harvard University
[email protected] Equal contribution. Authors are listed alphabetically.

Abstract

We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods — including approval voting, Borda count, the proportional veto core, and quantile fairness — can be practically applied to policy aggregation.

1 Introduction

Early discussion of AI value alignment had often focused on learning desirable behavior from an individual teacher, for example, through inverse reinforcement learning [27, 1]. But, in recent years, the conversation has shifted towards aligning AI models with large groups of people or even entire societies. This shift is exemplified at a policy level by OpenAI’s “democratic inputs to AI” program [41] and Meta’s citizens’ assembly on AI governance [8], and at a technical level by the ubiquity of reinforcement learning from human feedback [30] as a method for fine-tuning large language models.

We formalize the challenge of value alignment with multiple individuals as a problem that we view as fundamental — policy aggregation. Our starting point is the common assumption that the environment can be represented as a Markov decision process (MDP). While the states, actions and transition functions are shared by all agents, their reward functions — which incorporate values, priorities or subjective beliefs — may be different. In particular, each agent has its own optimal policy in the underlying MDP. Our question is this: How should we aggregate the individual policies into a desirable collective policy?

A naïve answer is to define a new reward function that is the sum of the agents’ reward functions (for each state-action pair separately) and compute an optimal policy for this aggregate reward function; such a policy would guarantee maximum utilitarian social welfare. This approach has a major shortcoming, however, in that it is sensitive to affine transformations of rewards, so, for example, if we doubled one of the reward functions, the aggregate optimal policy may change. This is an issue because each agent’s individual optimal policy is invariant to (positive) affine transformations of rewards, so while it is possible to recover a reward function that induces an agent’s optimal policy by observing their actions over time,¹¹1And we assume this is done accurately, in order to focus on the essence of the policy aggregation problem. it is impossible to distinguish between reward functions that are affine transformations of each other. More broadly, economists and moral philosophers have long been skeptical about interpersonal comparisons of utility [19] due to the lack of universal scale — an issue that is especially pertinent in our context. Therefore, aggregation methods that are invariant to affine transformations are strongly preferred.

Our approach. To develop such aggregation methods, we look to social choice theory, which typically deals with the aggregation of ordinal preferences. To take a canonical example, suppose agents report rankings over $m$ alternatives. Under the Borda count rule, each voter gives $m-k$ points to the alternative they rank in the $k$ ’th position, and the alternative with most points overall is selected.

The voting approach can be directly applied to our setting. For each agent, it is (in theory) possible to compute the value of every possible (deterministic) policy, and rank them all by value. Then, any standard voting rule, such as Borda count, can be used to aggregate the rankings over policies and single out a desirable policy. The caveat, of course, is that this method is patently impractical, because the number of policies is exponential in the number of states of the MDP.

The main insight underlying our approach is that ordinal preferences over policies have a much more practical volumetric interpretation in the state-action occupancy polytope $\mathcal{O}$ . Roughly speaking, a point in the state-action occupancy polytope represents a (stochastic) policy through the frequency it is expected to visit different state-action pairs. If a policy is preferred by an agent to a subset of policies $\mathcal{O^{\prime}}$ , its “rank” is the volume of $\mathcal{O^{\prime}}$ as a fraction of the volume of $\mathcal{O}$ . The “score” of a policy under Borda count, for example, can be interpreted as the sum of these “ranks” over all agents.

Our results. We investigate two classes of rules from social choice theory, those that guarantee a notion of fairness and voting rules. By mapping ordinal preferences to the state-action occupancy polytope, we adapt the different rules to the policy aggregation problem.

The former class is examined in Section 5. As a warm-up we start from the notion of proportional veto core; it follows from recent work by Chaudhury et al. [7] that a volumetric interpretation of this notion is nonempty and can be computed efficiently. We then turn to quantile fairness, which was recently introduced by Babichenko et al. [4]; we prove that the volumetric interpretation of this notion yields guarantees that are far better than those known for the original, discrete setting, and we design a computationally efficient algorithm to optimize those guarantees.

The latter class is examined in Section 6; we focus on volumetric interpretations of $\alpha$ -approval (including the ubiquitous plurality rule, which is the special case of $\alpha=1$ ) and the aforementioned Borda count. In contrast to the rules studied in Section 5, existence is a nonissue for these voting rules, but computation is a challenge, and indeed we establish several computational hardness results. To overcome this obstacle, we implement voting rules for policy aggregation through mixed integer linear programming, which leads to practical solutions.

Finally, our experiments in Section 7 evaluate the policies returned by different rules based on their fairness; the results identify quantile fairness as especially appealing. The experiments also illustrate the advantage of our approach over rules that optimize measures of social welfare (which are sensitive to affine transformations of the rewards).

2 Related Work

Noothigattu et al. [28] consider a setting related to ours, in that different agents have different reward functions and different policies that must be aggregated. However, they assume that the agents’ reward functions are noisy perturbations of a ground-truth reward function, and the goal is to learn an optimal policy according to the ground-truth rewards. In social choice terms, our work is akin to the typical setting where subjective preferences must be aggregated, whereas the work of Noothigattu et al. [28] is conceptually similar to the setting where votes are seen as noisy estimates of a ground-truth ranking [39, 9, 6].

Chaudhury et al. [7] study a problem completely different from ours: fairness in federated learning. However, their technical approach served as an inspiration for ours. Specifically, they consider the proportional veto core and transfer it to the federated learning setting using volumetric arguments, by considering volumes of subsets in the space of models. Their proof that the proportional veto core is nonempty carries over to our setting, as we explain in Section 5.

There is a body of work on multi-objective reinforcement learning (MORL) and planning that uses a scalarization function to reduce the problem to a single-objective one [32, 21]. Other solutions to MORL focus on developing algorithms to identify a set of policies approximating the problem’s Pareto frontier [38]. A line of work more closely related to ours focuses on fairness in sequential decision making, often taking the scalarization approach to aggregate agents’ preferences by maximizing a (cardinal) social welfare function, which maps the vector of agent utilities to a single value. Ogryczak et al. [29] and Siddique et al. [34] investigate generalized Gini social welfare, and Mandal and Gan [25], Fan et al. [13] and Ju et al. [23] focus on Nash and egalitarian social welfare. Alamdari et al. [3] study this problem in a non-Markovian setting, where fairness depends on the history of actions over time, and introduce concepts to assess different fairness criteria at varying time intervals. A shortcoming of these solutions is that they are not invariant to affine transformations of rewards — a property that is crucial in our setting, as discussed earlier.

Our work is closely related to the pluralistic alignment literature, aiming to develop AI systems that reflect the values and preferences of diverse individuals [35, 10]. Alamdari et al. [2] propose a framework in reinforcement learning in which an agent learns to act in a way that is considerate of the values and perspectives of humans within a particular environment. Concurrent work explores reinforcement learning from human feedback (RLHF) from a social choice perspective, where the reward model is based on pairwise human preferences, often constructed using the Bradley-Terry model [5]. Zhong et al. [42] consider the maximum Nash and egalitarian welfare solutions, and Swamy et al. [36] propose a method based on maximal lotteries due to Fishburn [14].

3 Preliminaries

For $t\in\mathbb{N}$ , let $[t]=\{1,2,\ldots,t\}$ . For a closed set $S$ , let $\Delta(S)$ denote the probability simplex over the set $S$ . We denote the dot product of two vectors as $\langle x,y\rangle=\sum_{i=1}^{d}x_{i}\cdot y_{i}$ for $x,y\in\mathbb{R}^{d}$ . A halfspace in $\mathbb{R}^{d}$ determined by $w\in\mathbb{R}^{d}$ and $b\in\mathbb{R}$ is the set of points satisfying $\{x\in\mathbb{R}^{d}\mid\langle x,w\rangle\leqslant b\}$ . A polytope $\mathcal{O}\subseteq\mathbb{R}^{d}$ is the intersection of a finite number of halfspaces, i.e., a convex subset of the $d$ -dimensional space $\mathbb{R}^{d}$ determined by a set of linear constraints $\{x\mid Ax\leqslant b\}$ where $A\in\mathbb{R}^{k\cdot d}$ is a matrix of coefficients of $k$ linear inequalities and $b\in\mathbb{R}^{k}$ .

3.1 Multi-Objective Markov Decision Processes

A multi-objective Markov decision process (MOMDP) is a tuple defined as $M=\langle\mathcal{S},\allowbreak\mathcal{A},\mathcal{P},R_{1},\dots,\allowbreak R_{n}\rangle$ for the average-reward case and $\langle\mathcal{S}\allowbreak,d_{\mathrm{init}},\mathcal{A},\mathcal{P},R_{1},\dots,\allowbreak R_{n},\gamma\rangle$ for the discounted-reward case, where $\mathcal{S}$ is a finite set of states, $\mathcal{A}$ is a finite set of actions, and $\mathcal{P}:(\mathcal{S}\times\mathcal{A})\to\Delta(\mathcal{S})$ is the transition probability distribution. $\mathcal{P}(s_{t},a_{t},s_{t+1})$ is the probability of transitioning to state $s_{t+1}$ by taking action $a_{t}$ in $s_{t}$ . For $i\in[n]$ , $R_{i}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ is the reward function of the $i$ th agent, the initial state is sampled from $d_{\mathrm{init}}\in\Delta(\mathcal{S})$ , and $\gamma\in(0,1]$ is the discount factor.

A (Markovian) policy $\pi(a|s)$ is a probability distribution over the actions $a\in\mathcal{A}$ given the state $s\in\mathcal{S}$ . A policy is deterministic if at each state $s$ one action is selected with probability of $1$ , and otherwise it is stochastic. The expected average return of agent $i$ for a policy $\pi$ and the expected discounted return of agent $i$ for a policy $\pi$ are defined over an infinite time horizon as

J^{\operatorname{avg}}_{i}(\pi)=\lim_{T\to\infty}\frac{1}{T}\operatorname{\mathbb{E}}_{\pi,\mathcal{P}}\left[\sum\nolimits_{t=1}^{T}R_{i}(s_{t},a_{t})\right],J^{\gamma}_{i}(\pi)=(1-\gamma)\operatorname{\mathbb{E}}_{\pi,\mathcal{P}}\left[\sum\nolimits_{t=1}^{\infty}\gamma^{t}R_{i}(s_{t},a_{t})|_{s_{1}\sim d_{\mathrm{init}}}\right]

where the expectation is over the state-action pairs at time $t$ based on both the policy $\pi$ and the transition function $\mathcal{P}$ .

Definition 1 (state-action occupancy measure).

Let $\mathcal{P}^{t}_{\pi}$ be the probability measure over states at time $t$ under policy $\pi$ . The state-action occupancy measure for state $s$ and action $a$ is defined as

\displaystyle d_{\pi}^{\operatorname{avg}}(s,a)=\lim_{T\rightarrow\infty}\frac{1}{T}\;\operatorname{\mathbb{E}}\left[\sum\nolimits_{t=1}^{T}\mathcal{P}^{t}_{\pi}(s)\pi(a|s)\right],\;d_{\pi}^{\gamma}(s,a)=(1-\gamma)\operatorname{\mathbb{E}}\left[\sum\nolimits_{t=1}^{\infty}\gamma^{t}\mathcal{P}^{t}_{\pi}(s)\pi(a|s)\right].

For both the average and discounted cases, we can rewrite the expected return as the dot product of the state-action occupancy measures and rewards, that is, $J_{i}(\pi)=\sum\nolimits_{(s,a)}d_{\pi}(s,a)\cdot R_{i}(s,a)=\langle d_{\pi},R_{i}\rangle$ . In addition, the policy can be calculated given the occupancy measure as $\pi(a|s)={d_{\pi}(s,a)}/({\sum_{a}d_{\pi}(s,a)})$ if $\sum_{a}d_{\pi}(s,a)>0$ , and $\pi(a|s)=1/|\mathcal{A}|$ otherwise.

Definition 2 (state-action occupancy polytope [31, 40]).

For a MOMDP $M$ , in the average-reward case, the space of valid state-action occupancies is the polytope

\displaystyle\mathcal{O}^{\operatorname{avg}}=\bigg{\{}d^{\operatorname{avg}}_{\pi}\mid d^{\operatorname{avg}}_{\pi}\geqslant 0,\sum_{s,a}d^{\operatorname{avg}}_{\pi}(s,a)=1,\sum_{a}d^{\operatorname{avg}}_{\pi}(s,a)

\displaystyle=\sum_{s^{\prime},a^{\prime}}\mathcal{P}(s^{\prime},a^{\prime},s)d^{\operatorname{avg}}_{\pi}(s^{\prime},a^{\prime}),\forall s\in\mathcal{S}\bigg{\}}.

Similarly, in the discounted-reward case, the space of state-action occupancies is the polytope

\displaystyle\mathcal{O}^{\gamma}=\bigg{\{}d^{\gamma}_{\pi}\mid d^{\gamma}_{\pi}\geqslant 0,\sum_{a}d^{\gamma}_{\pi}(s,a)

\displaystyle=(1-\gamma)d_{\mathrm{init}}(s)+\gamma\sum_{s^{\prime},a^{\prime}}\mathcal{P}(s^{\prime},a^{\prime},s)d^{\gamma}_{\pi}(s^{\prime},a^{\prime}),\forall s\in\mathcal{S}\bigg{\}}.

A mechanism receives a MOMDP and aggregates the agents’ preferences into a policy. An economical efficiency axiom in the social choice literature is that of Pareto optimality.

Definition 3 (Pareto optimality).

For a MOMDP $M$ and $\epsilon\geqslant 0$ , a policy $\pi$ is $\epsilon$ -Pareto optimal if there does not exist another policy $\pi^{\prime}$ such that $J_{i}(\pi^{\prime})\geqslant J_{i}(\pi)+\epsilon$ for all $i\in N$ , with strict inequality for at least one agent. For $\epsilon=0$ , we simply call such policies Pareto optimal.

We call a mechanism Pareto optimal if it always returns a Pareto optimal policy. In special cases where all agents unanimously agree on an optimal policy, Pareto optimality implies that the mechanism will return one such policy. We discuss Pareto optimal implementations of all mechanisms in this work.

3.2 Voting and Social Choice Functions

In the classical social choice setting, we have a set of $n$ agents and a set $C$ of $m$ alternatives. The preferences of voter $i\in[n]$ is represented as a strict ordering over the alternatives $\sigma_{i}:[m]\to C$ equal to $\sigma_{i}(1)\succ_{i}\sigma_{i}(2)\succ_{i}\ldots\succ_{i}\sigma_{i}(m)$ , where $c_{1}\succ_{i}c_{2}$ denotes agent $i$ prefers $c_{1}$ over $c_{2}$ for $c_{1},c_{2}\in C$ . A (possibly randomized) voting rule aggregates agents’ preferences and returns an alternative or a distribution over the alternatives as the collective decision.

Positional Scoring Rules. A positional scoring rule with scoring vector $\vec{s}=(s_{1},\ldots,s_{m})$ such that $s_{1}\geqslant s_{2}\geqslant\ldots\geqslant s_{m}$ works as follows. Each agent gives a score of $s_{1}$ to their top choice, a score of $s_{2}$ to their second choice, and so on. The votes are tallied and an alternative with the maximum total score is selected. A few of the well-known positional scoring rules are: Plurality: $(1,0,\ldots,0)$ , Borda: $(m-1,m-2,\ldots,1,0)$ , $k$ -approval: $({1,\ldots,1},0,\ldots,0)$ with $k$ ones.

4 Occupancy Polytope as the Space of Alternatives

In a MOMDP $M$ , each agent $i$ incorporates their values and preferences into their respective reward function $R_{i}$ . Agent $i$ prefers $\pi$ over $\pi^{\prime}$ if and only if $\pi$ achieves higher expected return, $J_{i}(\pi)>J_{i}(\pi^{\prime})$ , and is indifferent between two policies $\pi$ and $\pi^{\prime}$ if and only if $J_{i}(\pi)=J_{i}(\pi^{\prime})$ . As discussed before, given a state-action occupancy measure $d_{\pi}$ in the state-action occupancy polytope $\mathcal{O}$ , we can recover the corresponding policy $\pi$ . Therefore, we can interpret $\mathcal{O}$ as the domain of all possible alternatives over which the $n$ agents have heterogeneous weak preferences (with ties). Agent $i$ prefers $d_{\pi}$ to $d_{\pi^{\prime}}$ in $\mathcal{O}$ if and only if they prefer $\pi$ to $\pi^{\prime}$ . We study the policy aggregation problem through this lens; specifically, we design or adapt voting mechanisms where the (continuous) space of alternatives is $\mathcal{O}$ and agents have weak preferences over them determined by their reward functions $R_{i}$ .

Affine transformation and reward normalization. A particular benefit of this interpretation, as mentioned before, is that all positive affine transformations of the reward functions, i.e., $aR_{i}+b$ for all $a\in\mathbb{R}_{\geqslant 0}$ and $b\in\mathbb{R}$ , yield the same weak ordering over the polytope. Hence, we can assume without loss of generality that $J_{i}(\pi)\in[0,1]$ . Further, we can ignore agents that are indifferent between all policies, i.e., $\min_{\pi}J_{i}(\pi)=\max_{\pi}J_{i}(\pi)$ , and normalize reward functions $R_{i}\leftarrow\frac{R_{i}-\min_{\pi}{J_{i}(\pi)}}{\max_{\pi}{J_{i}(\pi)}-\min_{\pi}J_{i}(\pi)}$ such that $\min_{\pi}J_{i}(\pi)=0$ and $\max_{\pi}J_{i}(\pi)=1$ . The relative ordering of the policies does not change since for all points $d_{\pi}\in\mathcal{O}$ we have $\sum_{s,a}d_{\pi}(s,a)=1$ .

Volumetric definitions. A major difference between voting over a continuous space of alternatives and the classical voting setting is that the domain of alternatives is infinite and not all voting mechanisms can be directly applied to the policy aggregation problem. In particular, various voting rules require reasoning about the rank of an alternative or the size of some subset of alternatives. For instance, the Borda count of an alternative $c$ over a finite set of alternatives is defined as the number (or fraction) of candidates ranked below $c$ . In the continuous setting, for almost all of the mechanisms that we discuss later, we use the measure or volume of a subset of alternatives to refer to their size. For a measurable subset $\mathcal{O}^{\prime}\subseteq\mathcal{O}$ , let ${\rm{vol}}(\mathcal{O}^{\prime})$ denote its measure. The ratio ${\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O})$ is the fraction of alternatives that lie in $\mathcal{O}^{\prime}$ . A probabilistic interpretation is that for a uniform distribution over the polytope $\mathcal{O}$ , ${\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O})$ denotes the probability that a policy uniformly sampled from $\mathcal{O}$ lies in $\mathcal{O}^{\prime}$ . We can also define the expected return distribution of an agent over $\mathcal{O}$ as a random variable that maps a policy to its expected return, i.e., one that maps $d_{\pi}\in\mathcal{O}$ to $\langle d_{\pi},R_{i}\rangle$ . The pdf and cdf of this r.v. is defined below.

Definition 4 (expected return distribution).

For a MOMDP $M$ and $v\in\mathbb{R}$ , the expected return distribution of agent $i\in[n]$ is defined as

f_{i}(v)=\frac{1}{{\rm{vol}}(\mathcal{O})}\int_{x\in\mathcal{O}}\delta(v-\langle x,R_{i}\rangle)\;dx,\quad F_{i}(v)=\int_{x=-\infty}^{v}f_{i}(x)\,dx=\frac{{\rm{vol}}(\mathcal{O}\cap\{\langle x,R_{i}\rangle\leqslant v\})}{{\rm{vol}}(\mathcal{O})},

where $f_{i}$ and $F_{i}$ are the pdf and cdf of the expected return distribution and $\delta(\cdot)$ is the Dirac delta function indicating $v=\langle x,R_{i}\rangle$ .

A useful observation about $f_{i}$ , the pdf, is that it is unimodal, i.e., increasing up to its mode $\mathrm{mode}(f_{i})\in\operatorname{\arg\max}_{v\in\mathbb{R}}f_{i}(v)$ and decreasing afterwards, which follows from the Brunn–Minkowski inequality [17]. Since $f_{i}$ (pdf) is the derivative of $F_{i}$ (cdf), the unimodality of $f_{i}$ implies that $F_{i}$ is a convex function in $(-\infty,\mathrm{mode}(f_{i})]$ and concave in $[\mathrm{mode}(f_{i}),\infty)$ .

In our algorithms, we use a subroutine that measures the volume of a polytope, which we denote by $\mathrm{vol\text{-}comp}(\{Ax\leqslant b\})$ . Dyer et al. [12] designed a fully polynomial time randomized approximation scheme (FPRAS) for computing the volume of polytopes. We report the running time of algorithms in terms of the number of calls to this oracle.

5 Fairness in Policy Aggregation

In this section, we utilize the volumetric interpretation of the state-action occupancy polytope to extend fairness notions from social choice to policy aggregation, and we develop algorithms to compute stochastic policies provably satisfying these notions.

5.1 Proportional Veto Core

The proportional veto core was first proposed by Moulin [26] in the classical social choice setting with a finite set of alternatives where agents have full (strict) rankings over the alternatives. For simplicity, suppose the number of alternatives $m$ is a multiple of $n$ . The idea of the proportional veto core is that $x\%$ of the agents should be able to veto $x\%$ of the alternatives. More precisely, for an alternative $c$ to be in the proportional veto core, there should not exist a coalition $S$ that can “block” $c$ using their proportional veto power of $\nicefrac{{|S|}}{{n}}$ . $S$ blocks $c$ if they can unanimously suggest $m(1-\nicefrac{{|S|}}{{n}})$ candidates that they prefer to $c$ . For instance, if $c$ is in the proportional veto core, it cannot be the case that a coalition of $60\%$ of the agents unanimously prefer $40\%$ of the alternatives to $c$ .

Chaudhury et al. [7] extended this notion to a continuous domain of alternatives in the federated learning setting. We show that such an extension also applies to policy aggregation.

Definition 5 (proportional veto core).

Let $\epsilon\in(0,\nicefrac{{1}}{{n}})$ . For a coalition of agents $S\subseteq[n]$ , let $\mathrm{veto}(S)=\nicefrac{{|S|}}{{n}}$ be their veto power. A point $d_{\pi}\in\mathcal{O}$ is blocked by a coalition $S$ if there exists a subset $\mathcal{O}^{\prime}\subseteq\mathcal{O}$ of measure ${\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O})\geqslant 1-\mathrm{veto}(S)+\epsilon$ such that all agents in $S$ prefer all points in $\mathcal{O}^{\prime}$ to $d_{\pi}$ , i.e., $d_{\pi^{\prime}}\succ_{i}d_{\pi}$ for all $d_{\pi^{\prime}}\in\mathcal{O}^{\prime}$ and $i\in S$ . A point $d_{\pi}$ is in the $\epsilon$ -proportional veto core if it is not blocked by any coalition.

A candidate in the proportional veto core satisfies desirable properties that are extensively discussed in prior work [26, 7, 24, 22]. It is worth mentioning that any candidate in the $\epsilon$ -proportional veto core, besides the fairness aspect, is also economically efficient as it satisfies $\epsilon$ -Pareto optimality. This holds since the grand coalition $S=[n]$ can veto any $\epsilon$ -Pareto dominated alternative.

Moulin [26] proved that the proportional veto core is nonempty in the discrete setting and Chaudhury et al. [7] proved it for the continuous setting. The following result is a corollary of Theorem 1 of Chaudhury et al. [7].

Theorem 1.

Let $\epsilon\in(0,1/n)$ . For a policy aggregation problem, the $\epsilon$ -proportional veto core is nonempty. Furthermore, such policies can be found in polynomial time using $O(\log(1/\epsilon))$ many calls per agent to $\mathrm{vol\text{-}comp}$ .

We refer the reader to the paper of Chaudhury et al. [7] for the complete proof, and provide a high-level description of Algorithm 1, which finds a point in the proportional veto core. Algorithm 1 iterates over the agents and lets the $i$ th agent “eliminate” roughly $\nicefrac{{1}}{{n}}\cdot{\rm{vol}}(\mathcal{O})$ of the remaining space of alternatives. That is, agent $i$ finds the hyperplane $H_{i}=\{\langle x,R_{i}\rangle\leqslant v^{*}_{i}\}$ such that its intersection with $O_{i-1}$ (the remaining part of $\mathcal{O}$ at the $i$ th iteration) has a volume of approximately ${\rm{vol}}(\mathcal{O})/n$ . This value, for each agent, can be found by doing a binary search over $v^{*}_{i}$ to a precision of $[v^{*}_{i}-\epsilon,v^{*}_{i}]$ by $O(\log(1/\epsilon))$ calls to the volume estimating subroutine $\mathrm{vol\text{-}comp}$ .²²2It is important to check the non-emptyness of the search space $\mathcal{O}_{i}$ as an over-estimation of $v^{*}_{i}$ could lead to an empty feasible region.

Pareto optimality. We briefly discuss why Algorithm 1 is Pareto optimal. During the $i$ th iteration, the space of policies is contracted by adding a linear constraint of the form $J_{i}(\pi)\geqslant v^{*}_{i}$ . If the returned welfare-maximizing policy $\pi$ (derived from $d_{\pi}$ ) is Pareto dominated by another policy $\pi^{\prime}$ , then $\pi^{\prime}$ would satisfy all these linear constraints as $J_{i}(\pi^{\prime})\geqslant J_{i}(\pi)\geqslant v^{*}_{i}$ with the earlier inequality being strict for at least one agent. Therefore, $d_{\pi^{\prime}}\in\mathcal{O}_{n}$ and $\pi^{\prime}$ achieves a higher social welfare, which is a contradiction. The same argument can be used to establish the Pareto optimality of other mechanisms discussed later; each of these mechanisms searches for a policy that satisfies certain lower bounds on agents’ utilities, from which a welfare-maximizing, Pareto optimal policy can be selected.

\mathcal{O}_{0}\leftarrow\mathcal{O},\quad\delta\leftarrow\frac{1}{n}-\frac{\epsilon}{n+1}

for $i=1$ to n do

Using binary search find

v_{i}^{*}\in[0,1]

s.t.

{\rm{vol}}(\mathcal{O}_{i-1}\cap\{\langle x,R_{i}\rangle\leqslant v_{i}^{*}\})\approx\delta\cdot{\rm{vol}}(\mathcal{O})

\mathcal{O}_{i}\leftarrow\mathcal{O}_{i-1}\cap\{\langle x,R_{i}\rangle{~{}\geqslant~{}}v_{i}^{*}\}

return a welfare maximizing point $d_{\pi}\in\mathcal{O}_{n}$

ALGORITHM 1 Seq.

\epsilon

-Prop. Veto Core [7]

Procedure $q$ -quantile-feasible(q):

\mathcal{O}_{q}\leftarrow\{x\in\mathcal{O}\mid\langle x,R_{i}\rangle\geqslant F_{i}^{-1}(q),i\in[n]\}

if $\mathcal{O}_{q}$ is a feasible linear program then return True else return False

Using binary search find maximum

q\in[0,1]

s.t.

q

-quantile-feasible() is feasible

return a welfare maximizing point

d_{\pi}\in\mathcal{O}_{q}

ALGORITHM 2

\epsilon

-Max Quantile Fairness

5.2 Quantile Fairness

Next, we consider an egalitarian type of fairness based on the occupancy polytope, building on very recent work by Babichenko et al. [4] in the discrete setting; surprisingly, we show that it is possible to obtain stronger guarantees in the continuous setting.

Babichenko et al. [4] focus on the fair allocation of a set of $m$ indivisible items among $n$ agents, where each item must be fully allocated to a single agent. They quantify the extent an allocation $A$ is fair to an agent $i$ by the fraction of allocations over which $i$ prefers $A$ (note that the number of all discrete allocations is $n^{m}$ ). In other words, if one randomly samples an allocation, the fairness is measured by the probability that $A$ is preferred to the random allocation. An allocations is $q$ -quantile fair for $q\in[0,1]$ if all agents consider this allocation among their top $q$ -quantile allocations. Babichenko et al. [4] aim to find a universal value of $q$ such that for any fair division instance, a $q$ -quantile fair allocation exists. They make an interesting connection between $q$ -quantile fair allocations and the Erdős Matching Conjecture, and show under the assumption that the conjecture holds, $(1/2e)$ -quantile fair allocations exist for all instances.³³3This result is for arbitrary monotone valuations. For additive valuations, they show $\frac{0.14}{e}$ -quantile fair allocations always exist without any assumptions.

We ask the same question for policy aggregation, and again, a key difference is that our domain of alternatives is continuous. The notion of $q$ -quantile fairness extends well to our setting. Agents assess the fairness of a policy $\pi$ based on the fraction of the occupancy polytope (i.e., the set of all policies) to which they prefer $\pi$ , or equivalently, the probability that they prefer the chosen policy to a randomly sampled policy.

Definition 6 ( $q$ -quantile fairness).

For a MOMDP $M$ and $q\in[0,1]$ , a policy $\pi$ is $q$ -quantile fair if for every agent $i\in[n]$ , $\pi$ is among $i$ ’s top $q$ -fraction of policies,

{\rm{vol}}\left(\mathcal{O}\cap\left\{\,\left\langle x,R_{i}\right\rangle\leqslant J_{i}(\pi)\,\right\}\right)\;\geqslant\;q\cdot{\rm{vol}}(\mathcal{O}).

We show that a $(1/e)$ -quantile fair policy always exist and that this ratio is tight; note that this bound is twice as good as that of Babichenko et al. [4], and it is unconditional. We prove this by making a connection to an inequality due to Grünbaum [17]. The centroid of a polytope $P$ is defined as $\mathrm{centroid}(P)=\frac{\int_{x\in P}x\;dx}{\int_{x\in P}1\;dx}$ .

Lemma 1 (Grunbaum’s Inequality).

Let $P$ be a polytope and $w$ a direction in $\mathbb{R}^{d}$ . For the halfspace $H=\{x\mid\langle w,x-\mathrm{centroid}(P)\rangle\geqslant 0\}$ , it holds that

\frac{1}{e}\leqslant\left(\frac{d}{d+1}\right)^{d}\leqslant\frac{{\rm{vol}}(P\cap H)}{{\rm{vol}}(P)}\leqslant 1-\left(\frac{d}{d+1}\right)^{d}\leqslant 1-\frac{1}{e},

Furthermore, this is tight for the $n$ -dimensional simplex.

Theorem 2.

For every MOMDP $M$ , there always exist $q$ -quantile fair policy where $q=(\frac{\ell-1}{\ell})^{\ell-1}$ and $\ell=|\mathcal{S}|\cdot|\mathcal{A}|$ . Note that $q\geqslant\frac{1}{e}\approx 36.7\%$ . Furthermore, this bound is tight: For any $\ell$ , there is an instance with a single state and $\ell$ actions where no $q$ -quantile fair policy exists for any $q>(\frac{\ell-1}{\ell})^{\ell-1}$ .

Proof.

First, we show that the centroid of the occupancy polytope $c=\mathrm{centroid}(\mathcal{O})$ is $q$ -quantile fair policy for the aformentioned $q$ . Since $\mathcal{O}$ is a subset of the $(n-1)$ -simplex (see Definition 2), $\mathcal{O}$ has a nonzero volume in some lower dimensional space $\ell^{\prime}\leqslant|\mathcal{S}|\cdot|\mathcal{A}|-1$ . By invoking Grunbaum’s inequality (Lemma 1) with $w_{i}$ being equal to $R_{i}$ projected to the $\ell^{\prime}$ -dimensional subspace for all agents $i\in[n]$ , we have that ${\rm{vol}}(\mathcal{O}\cap H_{i})\geqslant(\frac{\ell^{\prime}}{\ell^{\prime}+1})^{\ell^{\prime}}\cdot{\rm{vol}}(\mathcal{O})$ where $H_{i}=\{\langle x-c,w_{i}\rangle\geqslant 0\}=\{\langle x,R_{i}\rangle\geqslant J_{i}(c)\}$ . Since $\ell^{\prime}\leqslant\ell-1$ , we have $(\frac{\ell^{\prime}}{\ell^{\prime}+1})^{\ell^{\prime}}\geqslant(\frac{\ell-1}{\ell})^{\ell-1}$ , which completes the proof.

To show tightness, take a MOMDP with a single state $\mathcal{S}=\{s\}$ — hence a constant transition function $\mathcal{P}(\cdot)=s$ — and $\ell$ actions $\mathcal{A}=\{a_{1},\ldots,a_{\ell}\}$ and $\ell$ agents. The reward function of agent $i$ is $R_{i}(s,a_{i})=1$ and $0$ otherwise. The state-action occupancy polytope of this MOMDP is the $(\ell-1)$ -dimensional simplex $\mathcal{O}=\{\sum_{a}d_{\pi}(s,a)=1,d_{\pi}\in\mathbb{R}^{1\times\ell}_{\geqslant 0}\}$ . Take any point in $d_{\pi}\in\mathcal{O}$ . There exists at least one agent $i$ that has $J_{i}(d_{\pi})=d_{\pi}(s,a_{i})\leqslant\frac{1}{\ell}$ . Take the halfspace $H_{i}=\{\langle x,R_{i}\rangle\geqslant\frac{1}{\ell}\}$ . Observe that $H_{i}\cap\mathcal{O}$ is equivalent to a smaller $(\ell-1)$ -dimensional simplex $\{\sum_{a}d_{\pi}(s,a)=1-\frac{1}{\ell}\}$ which has volume of ${\rm{vol}}\left((1-\frac{1}{\ell})\mathcal{O}\right)=\left(\frac{\ell-1}{\ell}\right)^{\ell-1}{\rm{vol}}(\mathcal{O})$ . Therefore, $\frac{{\rm{vol}}(\mathcal{O}\cap H_{i})}{{\rm{vol}}(\mathcal{O})}=\left(\frac{\ell-1}{\ell}\right)^{\ell-1}$ . ∎

The centroid of the occupancy polytope, as per Theorem 2, attains a worst-case measure of quantile-fairness. However, the centroid policy can be highly suboptimal as it disregards the preferences of the agents involved. For instance, there could exist a $99\%$ -quantile fair policy. To this end, we take an egalitarian approach and aim to find a $q$ -quantile fair policy with the maximum $q$ .

Max quantile fair algorithm. Algorithm 2 searches for the optimal value $q^{*}$ , for which a $q^{*}$ -quantile fair policy exists, and gets close to $q^{*}$ by a binary search. To perform the search, we need a subroutine that checks, for a given value of $q$ , if a $q$ -quantile fair policy exists. Suppose we have the $q$ -quantile expected return $F^{-1}_{i}(q)$ for all $i$ , that is, the expected return amount $v_{i}$ such that $F_{i}(v_{i})=q$ . Then, the problem of existence of a $q$ -quantile fair policy is equivalent to the feasibility of the linear program $\{x\in\mathcal{O}\mid\langle x,R_{i}\rangle\geqslant F_{i}^{-1}(q),i\in[n]\}$ , which can be solved in polynomial time. Importantly, after finding a good approximation of $q$ , there can be infinitely many policies that are $q$ -quantile fair and there are various ways to select the final policy after finding the value $q$ . As mentioned earlier, a desirable efficiency property is Pareto optimality; to satisfy it, one can return the policy that maximizes the sum of agents’ expected returns among the ones that are $q$ -quantile fair. Finally, to calculate $F_{i}^{-1}(q)$ , we can again use binary search to get $\delta$ close to the value $v_{i}$ for which $F_{i}(v_{i})=q$ using $O(\log 1/\delta)$ calls to $\mathrm{vol\text{-}comp}$ . The discussion above is summarized below.

Proposition 3.

Assuming an optimal oracle for $F^{-1}_{i}$ , Algorithm 2 finds a $q$ -quantile fair policy that is $\epsilon$ close to the optimal value in polynomial time with $O(\log({1}/{\epsilon}))$ per agent calls to the oracle. A $\delta$ -approximation to $F^{-1}_{i}(q)$ can be computed using $O(\log(1/\delta))$ calls to $\mathrm{vol\text{-}comp}$ .

6 Policy Aggregation with Voting Rules

In this section, we adapt existing voting rules from the discrete setting to policy aggregation and discuss their computational complexity.

Plurality. The plurality winner is the policy that achieves the maximum number of plurality votes or “approvals,” where agent $i$ approves a policy $\pi$ if it achieves their maximum expected return $J_{i}(\pi)=\max_{\pi^{\prime}}J_{i}(\pi^{\prime})=1$ . Hence the plurality winner is a policy in $\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}\operatorname{\mathbb{I}}[J_{i}(\pi)=1]$ . This formulation does not require the volumetric interpretation. However, in contrast to the discrete setting where one can easily count the approvals for all candidates, we show that solving this problem in the context of policy aggregation is not only $\mathbf{NP}$ -hard, but hard to approximate up to factor of a $1/n^{1/2-\epsilon}$ . We establish the hardness of approximation by a reduction from the maximum independent set problem [20]; we defer the proof to Appendix A.

Theorem 4.

For any fixed $\epsilon\in(0,1)$ , there is no polynomial-time $\frac{1}{n^{1/2-\epsilon}}$ -approximation algorithm for the maximum plurality score unless $\mathbf{P}=\mathbf{NP}$ .

Nevertheless, we can compute plurality in practice, as we discuss below.

$\alpha$ -approval. We extend the $k$ -approval rule using the volumetric interpretation of the occupancy polytope, similarly to the $q$ -quantile fairness definition. For some $\alpha\in[0,1]$ , agents approve a policy $\pi$ if its return is among their top $\alpha$ fraction of $\mathcal{O}$ , i.e., $F_{i}(J_{i}(\pi))\geqslant\alpha$ . The $\alpha$ -approval winner is a policy that has the highest number of $\alpha$ -approvals, so it is in $\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}\operatorname{\mathbb{I}}[F_{i}(J_{i}(\pi))\geqslant\alpha]$ . Note that plurality is equivalent to $1$ -approval. It is worth mentioning that there can be infinitely many policies that have the maximum approval score and, to avoid a suboptimal decision, one can return a Pareto optimal solution among the set of $\alpha$ -approval winner policies.

Theorem 2 shows that for $\alpha\leqslant 1/e$ , there always exists a policy that all agents approve, and by Proposition 3 such policies can be found in polynomial time, assuming access to an oracle for volumetric computations. Therefore, the problem of finding an $\alpha$ -approval winner is “easy” for $\alpha\in(0,1/e)$ . In sharp contrast, for $\alpha=1$ — namely, plurality — Theorem 4 gives a hardness of approximation. The next theorem shows the hardness of computing $\alpha$ -approval for $\alpha\in(7/8,1]$ via a reduction from the MAX-2SAT problem. We defer the proof to Appendix A.

Theorem 5.

For $\alpha\in(7/8,1]$ , computing a policy with the highest $\alpha$ -approval score is $\mathbf{NP}$ -hard. This even holds for binary reward vectors and when every $F_{i}$ has a closed form.

Given the above hardness result, to compute the $\alpha$ -approval rule, we turn to mixed-integer linear programming (MILP). Algorithm 3 simply creates $n$ binary variables for each agent $i$ indicating whether $i$ $\alpha$ -approves the policy, i.e., $F_{i}(J_{i}(\pi))\geqslant\alpha$ which is equivalent to $J_{i}(\pi)\geqslant F^{-1}_{i}(\alpha)$ . To encode the expected return requirement for agent $i$ to approve a policy as a linear constraint, we precompute $F^{-1}_{i}(\alpha)$ . This can be done by a binary search similar to Algorithm 2. Importantly, Algorithm 3 has one binary variable per agent and only $n$ constraints which is key to its practicability.

compute

F^{-1}_{i}(\alpha)

for all

i\in[n]

solve the mixed integer linear program

$\displaystyle\max\,\,\sum\nolimits_{i\in[n]}$	$\displaystyle\,a_{i}$
$\displaystyle\text{s.t.}\quad a_{i}\cdot F_{i}^{-1}(\alpha)$	$\displaystyle\leqslant\langle d_{\pi},R_{i}\rangle$	$\displaystyle\forall i\in[n]$
$\displaystyle d_{\pi}\in\mathcal{O},\quad a_{i}$	$\displaystyle\in\{0,1\}$	$\displaystyle\forall i\in[n]$

return a Pareto optimal policy subject to max

\alpha

-approval

\{i\mid a_{i}=1\}

ALGORITHM 3

\alpha

-Approvals MILP

compute

F^{-1}_{i}(k\epsilon)

for all

i\in[n]

and

k\in[\frac{1}{\epsilon}]

solve the mixed integer linear program

	$\displaystyle\max\,\,\sum_{i\in[n]}\sum_{k\in[1/\epsilon]}$	$\displaystyle\,a_{i,k}\cdot\left(F_{i}(k\epsilon)-F_{i}((k-1)\epsilon)\right)$
	$\displaystyle\text{s.t.}\qquad a_{i,k}\cdot k\epsilon$	$\displaystyle\leqslant\langle d_{\pi},R_{i}\rangle\quad\,\,\forall i\in[n]$
	$\displaystyle d_{\pi}\in\mathcal{O},$	$\displaystyle\quad a_{i,k}\in\{0,1\},\qquad\forall i\in[n],k\in[1/\epsilon]$

return a Pareto optimal policy subject to max Borda score

ALGORITHM 4

\epsilon

-Borda count MILP

Borda count. The Borda count rule also has a natural definition in the continuous setting. In the discrete setting, the Borda score of agent $i$ for alternative $c$ is the number of alternatives $c^{\prime}$ such that $c\succ_{i}c^{\prime}$ . In the continuous setting, $F_{i}(J_{i}(\pi))$ indicates the volume of the occupancy polytope to which agent $i$ prefers $\pi$ . The Borda count rule then selects a policy among $\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}F_{i}(J_{i}(\pi))$ .

The computational complexity of the Borda count rule remains an interesting open question, though we make progress on two fronts.⁴⁴4We suspect the problem to be $\mathbf{NP}$ -hard since the objective resembles a summation of “sigmoidal” functions over a convex domain, which is known to be $\mathbf{NP}$ -hard [37]. First, we identify a sufficient condition under which we can find an approximate max Borda count policy using convex optimization in polynomial time. Second, similar to Algorithm 3, we present a MILP to approximate the Borda count rule in Algorithm 4.

The first is based on the observation in Section 4 that $F_{i}$ is concave in range $[\mathrm{mode}(f_{i}),\infty)$ . We assume that the max Borda count policy $\pi$ appears in the concave portion of all agents, i.e., $J_{i}(\pi)\geqslant\mathrm{mode}(f_{i})$ for all $i\in[n]$ . Then, the problem becomes a maximization of the concave objective $\max\sum_{i}F_{i}(\langle d_{\pi},R_{i}\rangle)$ over the convex domain $\{d_{\pi}\in\mathcal{O}\mid\langle d_{\pi},R_{i}\rangle\geqslant\mathrm{mode}(f_{i}),\forall i\in[n]\}$ .

Second, Algorithm 4 is a MILP that finds an approximate max Borda count policy. As a pre-processing step, we estimate $F_{i}$ for each agent $i$ separately. We measure $F_{i}$ for the fixed expected return values of $\{\epsilon,2\epsilon,\ldots,1-\epsilon,1\}$ . This accounts for $1/\epsilon$ oracle calls to $\mathrm{vol\text{-}comp}$ per agent. Then, for the MILP, we introduce ${1}/{\epsilon}$ binary variables for each agent indicating their $\epsilon$ -rounded return levels, i.e., $a_{i,k}=1$ iff $\langle d_{\pi},R_{i}\rangle\geqslant k\epsilon$ for $k\in[1/\epsilon]$ . The MILP then searches for an occupancy measure $d_{\pi}\in\mathcal{O}$ with the maximum total Borda score among the $\epsilon$ -rounded expected return vectors, where the $\epsilon$ -rounded value of $x\in\mathbb{R}_{\geqslant 0}$ is defined as $\lfloor x/\epsilon\rfloor\cdot\epsilon$ . Therefore, the MILP finds a policy whose Borda score is at least as high as the maximum Borda score among $\epsilon$ -rounded return vectors. This is not necessarily equivalent to a $(1-\epsilon)$ approximation of the max Borda score.

Finally, we make a novel connection between $q$ -quantile fairness and Borda count in Theorem 6.

Theorem 6.

A $q$ -quantile fair policy is a $q$ -approximation of the maximum Borda score.

Proof.

Recall the Borda score of a policy $\pi$ is defined as $\operatorname{Borda}(\pi)=\sum_{i\in[n]}F_{i}(J_{i}(\pi))$ . Since $F_{i}(v)\in[0,1]$ for all $v\in\mathbb{R}$ , we have $\max_{\pi}\sum_{i\in[n]}F_{i}(J_{i}(\pi))\leqslant n$ .

By definition of a $q$ -quantile fair policy $\pi_{q}$ , $F_{i}(J_{i}(\pi_{q}))\geqslant q$ for all $i\in[n]$ . Therefore,

\operatorname{Borda}(\pi_{q})=\sum_{i\in[n]}F_{i}(J_{i}(\pi_{q}))\geqslant qn,

and thus we obtain

\frac{\operatorname{Borda}(\pi_{q})}{\max_{\pi}\operatorname{Borda}(\pi)}\geqslant\frac{qn}{n}=q.\qed

A corollary of Theorems 6 and 2 is that the policy returned by $\epsilon$ -max quantile fair algorithm (Algorithm 2) achieves a $(1/e-\epsilon)$ multiplicative approximation of the maximum Borda score.

7 Experiments

Environment. We adapt the dynamic attention allocation environment introduced by D’Amour et al. [11]. We aim to monitor several sites and prevent potential incidents, but limited resources prevent us from monitoring all sites at all times; this is inspired by applications such as food inspection and pest control ⁵⁵5The code for the experiments is available at https://github.com/praal/policy-aggregation.. There are $m=5$ warehouses and each can be in 3 different stages: normal ( $\mathrm{norm}$ ), risky ( $\mathrm{risk}$ ) and incident ( $\mathrm{inc}$ ). There are $|\mathcal{S}|=3^{m}$ states containing all possible stages of all warehouses. In each step, we can monitor at most one site, so there are $m+1$ actions, where action $m+1$ is no operation and action $i\leqslant m$ is monitoring warehouse $i$ . There are $n$ agents; each agent $i$ has a list $\ell_{i}$ of warehouses that they consider valuable and a reward function $R_{i}$ . In each step $t$ , $R_{i}(s_{t},a_{t})=-\operatorname{\mathbb{I}}[a_{t}\leqslant m]-\sum_{j\in\ell_{i}}\rho_{i}w_{j}\cdot\operatorname{\mathbb{I}}[s_{t,j}=\mathrm{inc}\land a_{t}\neq j]$ , where $w_{j}\in\{100,150,\cdots,250\}$ denotes the penalty of an incident occurring in warehouse $j$ , $\rho_{i}$ is the scale of penalties for agent $i$ which is sampled from $\{0.25,0.5,\cdots,n\}$ , and $-1$ is the cost of monitoring. In each step, if we monitor warehouse $j$ , its stage becomes normal. If not, it changes from $\mathrm{norm}$ to $\mathrm{risk}$ and from $\mathrm{risk}$ to $\mathrm{inc}$ with probabilities $p_{j,\mathrm{risk}}$ and $p_{j,\mathrm{inc}}$ , and stays the same otherwise. Probabilities are sampled i.i.d. uniformly from $[0.5,0.8]$ . The state transitions $\mathcal{P}$ is the product of the warehouses’ stage transitions.

Rules. We compare the outcomes of policy aggregation with different rules: max-quantile, Borda, $\alpha$ -approval ( $\alpha=0.9,0.8$ ), egalitarian (maximize minimum return) and utilitarian (maximize sum of returns). We sample $5\cdot 10^{5}$ random policies based on which we fit a generalized logistic function to estimate the cdf of the expected return distribution $F_{i}$ (Definition 4) for every agent. The policies for $\alpha$ -approval voting rules are optimized with respect to maximum utilitarian welfare. The egalitarian rule finds a policy that maximizes the expected return of the worst-off agent, then optimizes for the second worst-off agent, and so on. The implementation details of Borda count are in Appendix B.

Results. In Figure 1, we report the normalized expected return of agents as $\frac{J_{i}(\pi)-\min_{\pi}J_{i}(\pi)}{\max_{\pi^{\prime}}J_{i}(\pi^{\prime})-\min_{\pi^{\prime\prime}}J_{i}(\pi^{\prime\prime})}$ (sorted from lowest to highest) which are averaged over $10$ different environment and agents instances. We observe that the utilitarian and egalitarian rules are sensitive to the different agents’ reward scales and tend to perform unfairly. The utilitarian rule achieves the highest utilitarian welfare by almost ignoring one agent. The egalitarian rule achieves higher return for the worst-off agents compared to the utilitarian rule, but still yields an inequitable outcome. The max-quantile rule tends to return the fairest outcomes with similar normalized returns for the agents. The Borda rule, while not a fair rule by design, tends to find fair outcomes which are slightly worse than the max-quantile rule. The $\alpha$ -approval rule with max utilitarian completion tends to the utilitarian rule as $\alpha\to 0$ and to plurality as $\alpha\to 1$ . Importantly, although not shown in the plots, the plurality rule ignores almost all agents and performs optimally for a randomly selected agent.

In addition to the fine-grained utility distributions, in Table 1, we report two aggregate measures based on agents’ utilities: (i) the Gini index, a statistical measure of dispersion defined as $\frac{\sum_{i\in N}\sum_{j\in N}|J_{i}(\pi)-J_{j}(\pi)|}{2n\sum_{i\in N}J_{i}(\pi)}$ — where a lower Gini index indicates a more equitable distribution, and (ii) the Nash welfare, defined as the geometric mean of agents’ utilities $\left(\prod_{i\in N}J_{i}(\pi)\right)^{1/n}$ — where a higher Nash welfare is preferable. We observe a similar trend as above, where utilitarian and egalitarian rules perform worse across both metrics. For the other four rules, the Nash welfare scores are comparable in both scenarios, with Borda showing slightly better performance. The Gini index, however, highlights a clearer distinction among the rules, with max-quantile performing better.

Refer to caption — (a) 10 agents consider a random subset of warehouses valuable

	5 symmetric agents, one per warehouse		10 agents, random subsets of warehouses
Rules	Gini index	Nash welfare	Gini index	Nash welfare
egalitarian	$0.2864\pm 0.0295$	$0.2208\pm 0.0717$	$0.2126\pm 0.0209$	$0.4655\pm 0.0581$
utilitarian	$0.4392\pm 0.0094$	$0.0502\pm 0.0174$	$0.2020\pm 0.0182$	$0.5736\pm 0.0471$
80%-approvals	$0.1233\pm 0.0047$	$0.5186\pm 0.0051$	$0.1352\pm 0.0037$	$0.6741\pm 0.0200$
90%-approvals	$0.0793\pm 0.0056$	$0.5286\pm 0.0053$	$0.1257\pm 0.0034$	$0.6746\pm 0.0211$
Borda	$0.0225\pm 0.0024$	$0.5356\pm 0.0062$	$0.1029\pm 0.0083$	$0.6801\pm 0.0261$
max-quantile	$0.0188\pm 0.0022$	$0.5355\pm 0.0062$	$0.0625\pm 0.0067$	$0.6474\pm 0.0232$

Table 1: Comparison of policies optimized by different rules in two scenarios based on Gini index and Nash welfare based on their normalized expected return averaged. We report the mean and the standard error.

8 Discussion

We conclude by discussing some of the limitations of our approach. A first potential limitation is computation. When we started our investigation of the policy aggregation problem, we were skeptical that ordinal solutions from social choice could be practically applied. We believe that our results successfully lay this initial concern to rest. However, additional algorithmic advances are needed to scale our approach beyond thousands of agents, states, and actions. Additionally, an interesting future direction is to apply these rules within continuous state or action spaces, as well as in online reinforcement learning setting where the environment remains unknown.

A second limitation is the possibility of strategic behavior. The Gibbard-Satterthwaite Theorem [16, 33] precludes the existence of “reasonable” voting rules that are strategyproof, in the sense that agents cannot gain from misreporting their ordinal preferences; we conjecture that a similar result holds for policy aggregation in our framework. However, if reward functions are obtained through inverse reinforcement learning, successful manipulation would be difficult: an agent would have to act in a way that the learned reward function induces ordinal (volumetric) preferences leading to a higher-return aggregate stochastic policy. This separation between the actions taken by an agent and the preferences they induce would likely alleviate the theoretical susceptibility of our methods to strategic behavior.

Acknowledgments

We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program (Vector Institute). This work was also partially supported by the Cooperative AI Foundation; by the National Science Foundation under grants IIS-2147187, IIS-2229881 and CCF-2007080; and by the Office of Naval Research under grants N00014-20-1-2488 and N00014-24-1-2704.

References

Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 1–8, 2004.
Alamdari et al. [2022] P. A. Alamdari, T. Q. Klassen, R. Toro Icarte, and S. A. McIlraith. Be considerate: Avoiding negative side effects in reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 18–26, 2022.
Alamdari et al. [2024] P. A. Alamdari, T. Q. Klassen, E. Creager, and S. A. Mcilraith. Remembering to be fair: Non-Markovian fairness in sequential decision making. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 906–920, 2024.
Babichenko et al. [2024] Y. Babichenko, M. Feldman, R. Holzman, and V. V. Narayan. Fair division via quantile shares. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1235–1246, 2024.
Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Caragiannis et al. [2016] I. Caragiannis, A. D. Procaccia, and N. Shah. When do noisy votes reveal the truth? ACM Transactions on Economics and Computation, 4(3): article 15, 2016.
Chaudhury et al. [2024] B. R. Chaudhury, A. Murhekar, Z. Yuan, B. Li, R. Mehta, and A. D. Procaccia. Fair federated learning via the proportional veto core. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
Clegg [2023] N. Clegg. Bringing people together to inform decision-making on generative AI. Blog post, 2023. URL https://about.fb.com/news/2023/06/generative-ai-community-forum.
Conitzer and Sandholm [2005] V. Conitzer and T. Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 145–152, 2005.
Conitzer et al. [2024] V. Conitzer, R. Freedman, J. Heitzig, W. H. Holliday, B. M. Jacobs, N. Lambert, M. Mosse, E. Pacuit, S. Russell, H. Schoelkopf, E. Tewolde, and W. S. Zwicker. Position: Social choice should guide AI alignment in dealing with diverse human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 9346–9360, 2024.
D’Amour et al. [2020] A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 3rd Conference on Fairness, Accountability, and Transparency (FAccT), pages 525–534, 2020.
Dyer et al. [1991] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM, 38(1):1–17, 1991.
Fan et al. [2023] Z. Fan, N. Peng, M. Tian, , and B. Fain. Welfare and fairness in multi-objective reinforcement learning. In Proceedings of the 22nd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1991–1999, 2023.
Fishburn [1984] P. C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
Garey et al. [1976] M.R. Garey, D.S. Johnson, and L. Stockmeyer. Some simplified np-complete graph problems. Theoretical Computer Science, 1(3):237–267, 1976. ISSN 0304-3975.
Gibbard [1973] A. Gibbard. Manipulation of voting schemes. Econometrica, 41:587–602, 1973.
Grünbaum [1960] B. Grünbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific Journal of Mathematics, 10(4):1257–1261, 1960.
Gurobi Optimization, LLC [2023] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
Hammond [1991] P. J. Hammond. Interpersonal comparisons of utility: Why and how they are and should be made. In J. Elster and J. E. Roemer, editors, Interpersonal Comparisons of Well-Being, chapter 7, pages 200–254. Cambridge University Press, 1991.
Håstad [1999] J. Håstad. Clique is hard to approximate within n 1- $\varepsilon$ . Acta Mathematica, 182:105–142, 1999.
Hayes et al. [2022] C. F. Hayes, R. Rădulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M Zintgraf, R. Dazeley, F. Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
Ianovski and Kondratev [2021] E. Ianovski and A. Y. Kondratev. Computing the proportional veto core. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
Ju et al. [2023] P. Ju, A. Ghosh, and N. Shroff. Achieving fairness in multi-agent MDP using reinforcement learning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2023.
Kondratev and Ianovski [2024] A. Y. Kondratev and E. Ianovski. Veto core consistent preference aggregation. In Proceedings of the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1020–1028, 2024.
Mandal and Gan [2022] D. Mandal and J. Gan. Socially fair reinforcement learning. CoRR, abs/2208.12584, 2022.
Moulin [1981] H. Moulin. The proportional veto principle. The Review of Economic Studies, 48(3):407–416, 1981.
Ng and Russell [2000] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 663–670, 2000.
Noothigattu et al. [2021] R. Noothigattu, T. Yan, and A. D. Procaccia. Inverse reinforcement learning from like-minded teachers. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pages 9197–9204, 2021.
Ogryczak et al. [2013] W. Ogryczak, P. Perny, and P. Weng. A compromise programming approach to multiobjective markov decision processes. International Journal of Information Technology & Decision Making, 12(05):1021–1053, 2013.
Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
Puterman [2014] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
Roijers et al. [2013] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
Satterthwaite [1975] M. Satterthwaite. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10:187–217, 1975.
Siddique et al. [2020] U. Siddique, P. Weng, and M. Zimmer. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8905–8915, 2020.
Sorensen et al. [2024] T. Sorensen, J. Moore, J. Fisher, M. L. Gordon, N. Mireshghallah, C. Michael Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi. Position: A roadmap to pluralistic alignment. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 46280–46302, 2024.
Swamy et al. [2024] G. Swamy, C. Dann, R. Kidambi, S. Wu, and A. Agarwal. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 47345–47377, 2024.
Udell and Boyd [2013] M. Udell and S. Boyd. Maximizing a sum of sigmoids. Optimization and Engineering, pages 1–25, 2013.
Yang et al. [2019] R. Yang, X. Sun, and K. Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
Young [1988] H. P. Young. Condorcet’s theory of voting. The American Political Science Review, 82(4):1231–1244, 1988.
Zahavy et al. [2021] T. Zahavy, B. O’Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex MDPs. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 25746–25759, 2021.
Zaremba et al. [2023] W. Zaremba, A. Dhar, L. Ahmad, T. Eloundou, S. Santurkar, S. Agarwal, and J. Leung. Democratic inputs to AI. Blog post, 2023. URL https://openai.com/blog/democratic-inputs-to-ai.
Zhong et al. [2024] H. Zhong, Z. Deng, W. J. Su, Z. S. Wu, and L. Zhang. Provable multi-party reinforcement learning with diverse human feedback. arXiv preprint arXiv:2403.05006, 2024.

Appendix A Proofs of Section 6

We define a fully connected MOMDP as a MOMDP $M$ where for every pair of states $s,s^{\prime}\in\mathcal{S}$ and any action $a\in\mathcal{A}$ , $\mathcal{P}(s,a,s^{\prime})=\frac{1}{|\mathcal{S}|}$ . Throughout the hardness proofs, we construct a fully connected MOMDP. For such MOMDPs, it is not difficult to observe that the state-action occupancy polytope, for both the average and discounted reward, is equivalent to $\{d_{\pi}\mid\sum_{a}d_{\pi}(s,a)=\frac{1}{|\mathcal{S}|},\forall s\in\mathcal{S}\}$ for both the discounted and average reward case. Furthermore,

\pi(a|s)=\frac{d_{\pi}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}d_{\pi}(s,a^{\prime})}=|\mathcal{S}|\cdot d_{\pi}(s,a)

(1)

A.1 Proof of Theorem 4

Proof of Theorem 4.

We show hardness of approximation by an approximation-preserving reduction from the maximum independent set (MIS) problem.

Definition 7 (maximum independent set (MIS)).

For a graph $G=(V,E)$ with vertex set $V$ and edge set $E\subseteq 2^{\binom{V}{2}}$ , the maximum independent set (MIS) of $G$ is defined as the maximum subset of vertices $V^{\prime}\subseteq V$ such that there are no edges between any pair of vertices $v_{1},v_{2}\in V^{\prime}$ .

Theorem 7 (Håstad [20]).

For any fixed $\epsilon\in(0,1)$ , there is no polynomial-time $\frac{1}{n^{1/2-\epsilon}}$ -approximation algorithm for the MIS problem unless $\mathbf{P}=\mathbf{NP}$ , and no $\frac{1}{n^{1-\epsilon}}$ -approximation algorithm unless $\mathbf{ZPP}=\mathbf{NP}$ .

Construction of MOMDP. Let $G=([n],E)$ be a graph for which we want to find the maximum independent set. We create a fully connected MOMDP $M$ with $|E|$ states $\{s_{e}\}$ , each corresponding to an edge $e\in E$ . There are only two actions $\mathcal{A}=\{a_{1},a_{2}\}$ . In state $s_{e}$ of edge $e=(e_{1},e_{2})$ , performing action $a_{1}$ and $a_{2}$ correspond to $e_{1}$ and $e_{2}$ respectively. We create $n$ agents where agent $i$ corresponds to vertex $i$ . The reward function of agent $i$ for state $s_{e}$ and action $a_{k}$ for $k\in\{1,2\}$ is defined as

R_{i}(s_{e},a_{k})=\begin{cases}1,&\text{if}~{}e_{k}=i,\\ 0,&\text{o.w.}\end{cases}

In other words, the reward functions encode the set of edges incident to vertex $i$ .

Correctness of reduction. A policy $\pi$ is optimal for agent $i$ iff for all the edges incident to $i$ the action corresponding to $i$ is selected with probability $1$ . If a policy $\pi$ is considered optimal by two agents $i$ and $i^{\prime}$ , then $e=(i,i^{\prime})\notin E$ , since at state $s_{e}$ either $\pi(a_{1}|s)=1$ or $\pi(a_{2}|s)=1$ . Therefore, the set of agents that consider a policy optimal corresponds to an independent set in $G$ . Furthermore, take any independent set $V^{\prime}\subseteq[n]$ . Let $\pi$ be the policy that for each edge $e$ selects the action corresponding to the vertex in $V^{\prime}$ and, if no such vertices exist, select one arbitrarily. This policy is well defined, as $V^{\prime}$ is an independent set with no edges between its vertices. Policy $\pi$ is optimal for agents of $V^{\prime}$ since at each state their favourite action is selected. Thus, we have an equivalence between the maximum independent set of $G$ and the plurality winner policy of $M$ . Therefore, the hardness of approximation follows from Theorem 7. ∎

A.2 Proof of Theorem 5

Proof of Theorem 5.

We reduce the MAX-2SAT problem to finding an $\alpha$ -approval policy for $\alpha\in(7/8,1]$ .

For two Boolean variables $x_{1}$ and $x_{2}$ , we denote the disjunction (i.e., logical or) of two variables by $x_{1}\lor x_{2}$ and the negation of $x_{1}$ by $\neg x_{1}$ .

Definition 8 (maximum 2-satisfiability (MAX-2SAT)).

Given a set of $m$ Boolean variables $\{x_{1},\ldots x_{m}\}$ and a set of $n$ $2$ -literal disjunction clauses $\{C_{1},\ldots,C_{n}\}$ , the goal of the maximum 2-satifiability problem (MAX-2SAT) is to find the maximum number of clauses that can be satisfied together by an assignment $\phi:\{x_{j}\}_{j\in[n]}\to\{\mathrm{True},\mathrm{False}\}$ .

Garey et al. [15] showed that the MAX-2SAT problem is NP-hard.

Construction of MOMDP. For an instance of the MAX-2SAT problem, let $\{C_{1},\ldots,C_{n}\}$ be a set of $m$ $2$ -literal disjunction clauses over $n$ variables $\{x_{1},\ldots x_{m}\}$ . We create a fully connected MOMDP $M$ with $m$ states — state $s_{j}$ representing variable $x_{j}$ . There are only two actions, $a_{\mathrm{True}}$ and $a_{\mathrm{False}}$ which at state $s_{j}$ correspond to setting variable $x_{j}$ to $\mathrm{True}$ and $\mathrm{False}$ respectively. This way, a policy $\pi(a_{\mathrm{True}}|s_{j})$ can be interpreted as the probability of setting the variable $x_{j}$ to $\mathrm{True}$ , subject to $\pi(a_{\mathrm{True}}|s_{j})=1-\pi(a_{\mathrm{False}}|s_{j})$ .

Agents and reward function. We introduce some notation before introducing the agents and their reward function. Take a clause $C_{i}=c_{i,1}\lor c_{i,2}$ where $c_{i,k}\in\{x_{j},\neg x_{j}\}_{j\in[m]}$ for $k\in[2]$ . By combining the former relation with the mapping of $x_{j}$ and $\neg x_{j}$ to state-action pairs $(s_{j},a_{\mathrm{True}})$ and $(s_{j},a_{\mathrm{False}})$ respectively, we define $\lambda:\{c_{i,1},c_{i,2}\}\to\mathcal{S}\times\mathcal{A}$ . For every $c_{i,k}$ , $\lambda(c_{i,k})$ is the state-action pair that evaluates $c_{i,k}$ to $\mathrm{True}$ when selected with probability one. Now, we are ready to introduce the agents. For each clause $C_{i}$ , we create three agents, each defined as follows:

•

Agent $\mathrm{ag}_{i,1}$ with a reward function that is $1$ for $\lambda(c_{i,1})$ and $\lambda(c_{i,2})$ and zero otherwise. An optimal policy of this agent chooses the two corresponding state-action pairs exclusively, which can be interpreted as $c_{i,1}\leftarrow\mathrm{True}$ and $c_{i,2}\leftarrow\mathrm{True}$ that implies $C_{i}=\mathrm{True}$ .
•

Agent $\mathrm{ag}_{i,2}$ with a reward function that is $1$ for $\lambda(c_{i,1})$ and $\lambda(\neg c_{i,2})$ — note the negation. This is another assignment of variables, $c_{i,1}\leftarrow\mathrm{True}$ and $c_{i,2}\leftarrow\mathrm{False}$ , that implies $C_{i}\leftarrow\mathrm{True}$ .
•

Similarly, the reward function of $\mathrm{ag}_{i,3}$ is $1$ for $\lambda(\neg c_{i,1})$ and $\lambda(c_{i,2})$ (which again implies $C_{i}\leftarrow\mathrm{True}$ ).

For ease of exposition, and by a slight abuse of notation, in the rest of the proof we use $\pi$ to refer to the occupancy measure. From Equation 1 we have $|\mathcal{S}|\cdot d_{\pi}(s,a)=\pi(a|s)$ and the $\alpha$ -approval rule is invariant to affine transformation. Therefore, we let $J_{i}(\pi)=\langle\pi,R\rangle$ .

Expected return distribution. The expected return of agent $\mathrm{ag}_{i,1}$ for a policy $\pi$ is $\pi(\lambda(c_{i,1}))+\pi(\lambda(c_{i,2}))$ . For conciseness, let $v_{1}=\pi(\lambda(c_{i,1})),v_{2}=\pi(\lambda(c_{i,2}))$ . Then, the cdf is

F_{\mathrm{ag}_{i,1}}(v)=\int_{v_{1}=0}^{v}\int_{v_{2}=0}^{v}\operatorname{\mathbb{I}}[v_{1}+v_{2}\leqslant v]\cdot dv_{1}\,dv_{2}=\begin{cases}\frac{v^{2}}{2}&\text{for}~{}v\in[0,1],\\ 1-\frac{(2-v)^{2}}{2}&\text{for}~{}v\in[1,2].\end{cases}

See Figure 2 for a visualization. The cdf of all other agents have the same form. For each of the $3n$ agents above, their maximum expected return is $2$ .

(a) The scaled occupancy polytope

(b) The expected return cdf

Figure 2: The effective state-action occupancy polytope of agents and their expected return distribution.

Rounding a policy. Observe that $F_{i}(3/2)=7/8$ . For agent $\mathrm{ag}_{i,1}$ to approve a policy under $\alpha\in(7/8,1]$ , their utility must exceed $3/2=F^{-1}_{\mathrm{ag}_{i,1}}(7/8)$ . The fact that $v_{1},v_{2}\in[0,1]$ in addition to $v_{1}+v_{2}>3/2$ , implies that $v_{1}>{1}/{2}$ and $v_{2}>{1}/{2}$ . Therefore, if a policy $\pi$ is $\alpha$ -approved by agent $\mathrm{ag}_{i,1}$ , we have $\pi(\lambda(c_{i,1}))>{1}/{2}$ and $\pi(\lambda(c_{i,2}))>{1}/{2}$ . Further, observe that for at most one of the three agents $\{\mathrm{ag}_{i,k}\}_{k\in[3]}$ the condition of $J_{\mathrm{ag}_{i,k}}(\pi)>3/2$ may hold as every pair of the agent disagree on one literal.

We round such a policy $\pi$ to an assignment $\phi$ by letting $\phi(x_{j})\leftarrow\mathrm{True}$ if $\pi(a_{\mathrm{True}}|s_{j})>\frac{1}{2}$ and $\phi(x_{j})\leftarrow\mathrm{False}$ otherwise. If an agent in $\{\mathrm{ag}_{i,k}\}_{k\in[3]}$ $\alpha$ -approves $\pi$ , then $C_{i}$ is satisfied by the assignment $\phi$ . Therefore, we have that $\operatorname{OPT}_{\alpha}\leqslant\operatorname{OPT}_{\mathrm{MAX-2SAT}}$ , where $\operatorname{OPT}_{\alpha}$ is the maximum feasible number of $\alpha$ -approvals among all policies and $\operatorname{OPT}_{\mathrm{MAX-2SAT}}$ is the maximum number of clauses that can be satisfied among all assignments.

Next, we show $\operatorname{OPT}_{\mathrm{MAX-2SAT}}\leqslant\operatorname{OPT}_{\alpha}$ by deriving a policy $\pi$ that gets $\operatorname{OPT}_{\mathrm{MAX-2SAT}}$ $\alpha$ -approvals based on the optimal assignment for $\operatorname{OPT}_{\mathrm{MAX-2SAT}}$ by simply letting $\pi(a_{\mathrm{True}}|s_{j})=1$ iff $\phi(x_{i})=\mathrm{True}$ . If $\phi$ satisfies a clause $C_{i}$ , then for the agent $\mathrm{ag}\in\{\mathrm{ag}_{i,k}\}_{k\in[3]}$ that matches the literal assignments of $\phi$ for $C_{i}$ , we have $J_{\mathrm{ag}}(\pi)=2$ , which is an optimal, i.e., $1$ -approval, policy for $\mathrm{ag}$ . For the other two agents for $C_{i}$ , $J_{\mathrm{ag}}(\pi)=1$ which gets a return less than their required utility of $3/2$ for an $\alpha$ -approval. Therefore, we have that $\operatorname{OPT}_{\mathrm{MAX-2SAT}}=\operatorname{OPT}_{\alpha}$ and the hardness of computation follows from our reduction. ∎

Appendix B Experimental Details

Experiments are all done on an AMD EPYC 7502 32-Core Processor with 258GiB system memory. We use Gurobi [18] to solve LPs and MILPs.

Running time. All our voting rules has a running time of less than ten minutes on the constructed MOMDPs of $3^{5}=243$ states, $6$ actions, and $10$ agents. The most resource extensive task of our experiments was sampling $5\cdot 10^{5}$ random policies and computing the expected return for every agent which is a standard task. For each instance, we did this in parallel with $20$ processes in a total running time of less than $2$ hours per instance.

Implementation details of Borda count rule. After fitting a generalized logistic function for $F_{i}$ based on the expected return of sampled policies, we find the value $\mathrm{mode}(f_{i})$ , and check the existence of a policy by solving a linear program that achieves a expected return of $\mathrm{mode}(f_{i})$ for all agents. Next, to utilize Gurobi LP solvers, we approximate the concave function $F_{i}$ by a set of linear constraints that form a piecewise linear concave approximation of $F_{i}$ . Therefore, our final program for an approximate Borda count rule is simply an LP.

Policy Aggregation

Abstract

1 Introduction

2 Related Work

3 Preliminaries

3.1 Multi-Objective Markov Decision Processes

Definition 1 (state-action occupancy measure).

Definition 2 (state-action occupancy polytope [31, 40]).

Definition 3 (Pareto optimality).

3.2 Voting and Social Choice Functions

4 Occupancy Polytope as the Space of Alternatives

Definition 4 (expected return distribution).

5 Fairness in Policy Aggregation

5.1 Proportional Veto Core

Definition 5 (proportional veto core).

Theorem 1.

5.2 Quantile Fairness

Definition 6 (qq-quantile fairness).

Lemma 1 (Grunbaum’s Inequality).

Theorem 2.

Proof.

Proposition 3.

6 Policy Aggregation with Voting Rules

Theorem 4.

Theorem 5.

Theorem 6.

Proof.

7 Experiments

8 Discussion

Acknowledgments

References

Appendix A Proofs of Section 6

A.1 Proof of Theorem 4

Proof of Theorem 4.

Definition 7 (maximum independent set (MIS)).

Theorem 7 (Håstad [20]).

A.2 Proof of Theorem 5

Proof of Theorem 5.

Definition 8 (maximum 2-satisfiability (MAX-2SAT)).

Appendix B Experimental Details

Definition 6 ( $q$ -quantile fairness).