This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Policy Aggregation

Parand A. Alamdari
University of Toronto & Vector Institute
[email protected] &Soroush Ebadian\ast
University of Toronto
[email protected] &Ariel D. Procaccia
Harvard University
[email protected]
Equal contribution. Authors are listed alphabetically.
Abstract

We consider the challenge of AI value alignment with multiple individuals that have different reward functions and optimal policies in an underlying Markov decision process. We formalize this problem as one of policy aggregation, where the goal is to identify a desirable collective policy. We argue that an approach informed by social choice theory is especially suitable. Our key insight is that social choice methods can be reinterpreted by identifying ordinal preferences with volumes of subsets of the state-action occupancy polytope. Building on this insight, we demonstrate that a variety of methods — including approval voting, Borda count, the proportional veto core, and quantile fairness — can be practically applied to policy aggregation.

1 Introduction

Early discussion of AI value alignment had often focused on learning desirable behavior from an individual teacher, for example, through inverse reinforcement learning [27, 1]. But, in recent years, the conversation has shifted towards aligning AI models with large groups of people or even entire societies. This shift is exemplified at a policy level by OpenAI’s “democratic inputs to AI” program [41] and Meta’s citizens’ assembly on AI governance [8], and at a technical level by the ubiquity of reinforcement learning from human feedback [30] as a method for fine-tuning large language models.

We formalize the challenge of value alignment with multiple individuals as a problem that we view as fundamental — policy aggregation. Our starting point is the common assumption that the environment can be represented as a Markov decision process (MDP). While the states, actions and transition functions are shared by all agents, their reward functions — which incorporate values, priorities or subjective beliefs — may be different. In particular, each agent has its own optimal policy in the underlying MDP. Our question is this: How should we aggregate the individual policies into a desirable collective policy?

A naïve answer is to define a new reward function that is the sum of the agents’ reward functions (for each state-action pair separately) and compute an optimal policy for this aggregate reward function; such a policy would guarantee maximum utilitarian social welfare. This approach has a major shortcoming, however, in that it is sensitive to affine transformations of rewards, so, for example, if we doubled one of the reward functions, the aggregate optimal policy may change. This is an issue because each agent’s individual optimal policy is invariant to (positive) affine transformations of rewards, so while it is possible to recover a reward function that induces an agent’s optimal policy by observing their actions over time,111And we assume this is done accurately, in order to focus on the essence of the policy aggregation problem. it is impossible to distinguish between reward functions that are affine transformations of each other. More broadly, economists and moral philosophers have long been skeptical about interpersonal comparisons of utility [19] due to the lack of universal scale — an issue that is especially pertinent in our context. Therefore, aggregation methods that are invariant to affine transformations are strongly preferred.

Our approach. To develop such aggregation methods, we look to social choice theory, which typically deals with the aggregation of ordinal preferences. To take a canonical example, suppose agents report rankings over mm alternatives. Under the Borda count rule, each voter gives mkm-k points to the alternative they rank in the kk’th position, and the alternative with most points overall is selected.

The voting approach can be directly applied to our setting. For each agent, it is (in theory) possible to compute the value of every possible (deterministic) policy, and rank them all by value. Then, any standard voting rule, such as Borda count, can be used to aggregate the rankings over policies and single out a desirable policy. The caveat, of course, is that this method is patently impractical, because the number of policies is exponential in the number of states of the MDP.

The main insight underlying our approach is that ordinal preferences over policies have a much more practical volumetric interpretation in the state-action occupancy polytope 𝒪\mathcal{O}. Roughly speaking, a point in the state-action occupancy polytope represents a (stochastic) policy through the frequency it is expected to visit different state-action pairs. If a policy is preferred by an agent to a subset of policies 𝒪\mathcal{O^{\prime}}, its “rank” is the volume of 𝒪\mathcal{O^{\prime}} as a fraction of the volume of 𝒪\mathcal{O}. The “score” of a policy under Borda count, for example, can be interpreted as the sum of these “ranks” over all agents.

Our results. We investigate two classes of rules from social choice theory, those that guarantee a notion of fairness and voting rules. By mapping ordinal preferences to the state-action occupancy polytope, we adapt the different rules to the policy aggregation problem.

The former class is examined in Section 5. As a warm-up we start from the notion of proportional veto core; it follows from recent work by Chaudhury et al. [7] that a volumetric interpretation of this notion is nonempty and can be computed efficiently. We then turn to quantile fairness, which was recently introduced by Babichenko et al. [4]; we prove that the volumetric interpretation of this notion yields guarantees that are far better than those known for the original, discrete setting, and we design a computationally efficient algorithm to optimize those guarantees.

The latter class is examined in Section 6; we focus on volumetric interpretations of α\alpha-approval (including the ubiquitous plurality rule, which is the special case of α=1\alpha=1) and the aforementioned Borda count. In contrast to the rules studied in Section 5, existence is a nonissue for these voting rules, but computation is a challenge, and indeed we establish several computational hardness results. To overcome this obstacle, we implement voting rules for policy aggregation through mixed integer linear programming, which leads to practical solutions.

Finally, our experiments in Section 7 evaluate the policies returned by different rules based on their fairness; the results identify quantile fairness as especially appealing. The experiments also illustrate the advantage of our approach over rules that optimize measures of social welfare (which are sensitive to affine transformations of the rewards).

2 Related Work

Noothigattu et al. [28] consider a setting related to ours, in that different agents have different reward functions and different policies that must be aggregated. However, they assume that the agents’ reward functions are noisy perturbations of a ground-truth reward function, and the goal is to learn an optimal policy according to the ground-truth rewards. In social choice terms, our work is akin to the typical setting where subjective preferences must be aggregated, whereas the work of Noothigattu et al. [28] is conceptually similar to the setting where votes are seen as noisy estimates of a ground-truth ranking [39, 9, 6].

Chaudhury et al. [7] study a problem completely different from ours: fairness in federated learning. However, their technical approach served as an inspiration for ours. Specifically, they consider the proportional veto core and transfer it to the federated learning setting using volumetric arguments, by considering volumes of subsets in the space of models. Their proof that the proportional veto core is nonempty carries over to our setting, as we explain in Section 5.

There is a body of work on multi-objective reinforcement learning (MORL) and planning that uses a scalarization function to reduce the problem to a single-objective one [32, 21]. Other solutions to MORL focus on developing algorithms to identify a set of policies approximating the problem’s Pareto frontier [38]. A line of work more closely related to ours focuses on fairness in sequential decision making, often taking the scalarization approach to aggregate agents’ preferences by maximizing a (cardinal) social welfare function, which maps the vector of agent utilities to a single value. Ogryczak et al. [29] and Siddique et al. [34] investigate generalized Gini social welfare, and Mandal and Gan [25], Fan et al. [13] and Ju et al. [23] focus on Nash and egalitarian social welfare. Alamdari et al. [3] study this problem in a non-Markovian setting, where fairness depends on the history of actions over time, and introduce concepts to assess different fairness criteria at varying time intervals. A shortcoming of these solutions is that they are not invariant to affine transformations of rewards — a property that is crucial in our setting, as discussed earlier.

Our work is closely related to the pluralistic alignment literature, aiming to develop AI systems that reflect the values and preferences of diverse individuals [35, 10]. Alamdari et al. [2] propose a framework in reinforcement learning in which an agent learns to act in a way that is considerate of the values and perspectives of humans within a particular environment. Concurrent work explores reinforcement learning from human feedback (RLHF) from a social choice perspective, where the reward model is based on pairwise human preferences, often constructed using the Bradley-Terry model [5]. Zhong et al. [42] consider the maximum Nash and egalitarian welfare solutions, and Swamy et al. [36] propose a method based on maximal lotteries due to Fishburn [14].

3 Preliminaries

For tt\in\mathbb{N}, let [t]={1,2,,t}[t]=\{1,2,\ldots,t\}. For a closed set SS, let Δ(S)\Delta(S) denote the probability simplex over the set SS. We denote the dot product of two vectors as x,y=i=1dxiyi\langle x,y\rangle=\sum_{i=1}^{d}x_{i}\cdot y_{i} for x,ydx,y\in\mathbb{R}^{d}. A halfspace in d\mathbb{R}^{d} determined by wdw\in\mathbb{R}^{d} and bb\in\mathbb{R} is the set of points satisfying {xdx,wb}\{x\in\mathbb{R}^{d}\mid\langle x,w\rangle\leqslant b\}. A polytope 𝒪d\mathcal{O}\subseteq\mathbb{R}^{d} is the intersection of a finite number of halfspaces, i.e., a convex subset of the dd-dimensional space d\mathbb{R}^{d} determined by a set of linear constraints {xAxb}\{x\mid Ax\leqslant b\} where AkdA\in\mathbb{R}^{k\cdot d} is a matrix of coefficients of kk linear inequalities and bkb\in\mathbb{R}^{k}.

3.1 Multi-Objective Markov Decision Processes

A multi-objective Markov decision process (MOMDP) is a tuple defined as M=𝒮,𝒜,𝒫,R1,,RnM=\langle\mathcal{S},\allowbreak\mathcal{A},\mathcal{P},R_{1},\dots,\allowbreak R_{n}\rangle for the average-reward case and 𝒮,dinit,𝒜,𝒫,R1,,Rn,γ\langle\mathcal{S}\allowbreak,d_{\mathrm{init}},\mathcal{A},\mathcal{P},R_{1},\dots,\allowbreak R_{n},\gamma\rangle for the discounted-reward case, where 𝒮\mathcal{S} is a finite set of states, 𝒜\mathcal{A} is a finite set of actions, and 𝒫:(𝒮×𝒜)Δ(𝒮)\mathcal{P}:(\mathcal{S}\times\mathcal{A})\to\Delta(\mathcal{S}) is the transition probability distribution. 𝒫(st,at,st+1)\mathcal{P}(s_{t},a_{t},s_{t+1}) is the probability of transitioning to state st+1s_{t+1} by taking action ata_{t} in sts_{t}. For i[n]i\in[n], Ri:𝒮×𝒜R_{i}:\mathcal{S}\times\mathcal{A}\to\mathbb{R} is the reward function of the iith agent, the initial state is sampled from dinitΔ(𝒮)d_{\mathrm{init}}\in\Delta(\mathcal{S}), and γ(0,1]\gamma\in(0,1] is the discount factor.

A (Markovian) policy π(a|s)\pi(a|s) is a probability distribution over the actions a𝒜a\in\mathcal{A} given the state s𝒮s\in\mathcal{S}. A policy is deterministic if at each state ss one action is selected with probability of 11, and otherwise it is stochastic. The expected average return of agent ii for a policy π\pi and the expected discounted return of agent ii for a policy π\pi are defined over an infinite time horizon as

Jiavg(π)=limT1T𝔼π,𝒫[t=1TRi(st,at)],Jiγ(π)=(1γ)𝔼π,𝒫[t=1γtRi(st,at)|s1dinit]J^{\operatorname{avg}}_{i}(\pi)=\lim_{T\to\infty}\frac{1}{T}\operatorname{\mathbb{E}}_{\pi,\mathcal{P}}\left[\sum\nolimits_{t=1}^{T}R_{i}(s_{t},a_{t})\right],J^{\gamma}_{i}(\pi)=(1-\gamma)\operatorname{\mathbb{E}}_{\pi,\mathcal{P}}\left[\sum\nolimits_{t=1}^{\infty}\gamma^{t}R_{i}(s_{t},a_{t})|_{s_{1}\sim d_{\mathrm{init}}}\right]

where the expectation is over the state-action pairs at time tt based on both the policy π\pi and the transition function 𝒫\mathcal{P}.

Definition 1 (state-action occupancy measure).

Let 𝒫πt\mathcal{P}^{t}_{\pi} be the probability measure over states at time tt under policy π\pi. The state-action occupancy measure for state ss and action aa is defined as

dπavg(s,a)=limT1T𝔼[t=1T𝒫πt(s)π(a|s)],dπγ(s,a)=(1γ)𝔼[t=1γt𝒫πt(s)π(a|s)].\displaystyle d_{\pi}^{\operatorname{avg}}(s,a)=\lim_{T\rightarrow\infty}\frac{1}{T}\;\operatorname{\mathbb{E}}\left[\sum\nolimits_{t=1}^{T}\mathcal{P}^{t}_{\pi}(s)\pi(a|s)\right],\;d_{\pi}^{\gamma}(s,a)=(1-\gamma)\operatorname{\mathbb{E}}\left[\sum\nolimits_{t=1}^{\infty}\gamma^{t}\mathcal{P}^{t}_{\pi}(s)\pi(a|s)\right].

For both the average and discounted cases, we can rewrite the expected return as the dot product of the state-action occupancy measures and rewards, that is, Ji(π)=(s,a)dπ(s,a)Ri(s,a)=dπ,RiJ_{i}(\pi)=\sum\nolimits_{(s,a)}d_{\pi}(s,a)\cdot R_{i}(s,a)=\langle d_{\pi},R_{i}\rangle. In addition, the policy can be calculated given the occupancy measure as π(a|s)=dπ(s,a)/(adπ(s,a))\pi(a|s)={d_{\pi}(s,a)}/({\sum_{a}d_{\pi}(s,a)}) if adπ(s,a)>0\sum_{a}d_{\pi}(s,a)>0, and π(a|s)=1/|𝒜|\pi(a|s)=1/|\mathcal{A}| otherwise.

Definition 2 (state-action occupancy polytope [31, 40]).

For a MOMDP MM, in the average-reward case, the space of valid state-action occupancies is the polytope

𝒪avg={dπavgdπavg0,s,adπavg(s,a)=1,adπavg(s,a)\displaystyle\mathcal{O}^{\operatorname{avg}}=\bigg{\{}d^{\operatorname{avg}}_{\pi}\mid d^{\operatorname{avg}}_{\pi}\geqslant 0,\sum_{s,a}d^{\operatorname{avg}}_{\pi}(s,a)=1,\sum_{a}d^{\operatorname{avg}}_{\pi}(s,a) =s,a𝒫(s,a,s)dπavg(s,a),s𝒮}.\displaystyle=\sum_{s^{\prime},a^{\prime}}\mathcal{P}(s^{\prime},a^{\prime},s)d^{\operatorname{avg}}_{\pi}(s^{\prime},a^{\prime}),\forall s\in\mathcal{S}\bigg{\}}.

Similarly, in the discounted-reward case, the space of state-action occupancies is the polytope

𝒪γ={dπγdπγ0,adπγ(s,a)\displaystyle\mathcal{O}^{\gamma}=\bigg{\{}d^{\gamma}_{\pi}\mid d^{\gamma}_{\pi}\geqslant 0,\sum_{a}d^{\gamma}_{\pi}(s,a) =(1γ)dinit(s)+γs,a𝒫(s,a,s)dπγ(s,a),s𝒮}.\displaystyle=(1-\gamma)d_{\mathrm{init}}(s)+\gamma\sum_{s^{\prime},a^{\prime}}\mathcal{P}(s^{\prime},a^{\prime},s)d^{\gamma}_{\pi}(s^{\prime},a^{\prime}),\forall s\in\mathcal{S}\bigg{\}}.

A mechanism receives a MOMDP and aggregates the agents’ preferences into a policy. An economical efficiency axiom in the social choice literature is that of Pareto optimality.

Definition 3 (Pareto optimality).

For a MOMDP MM and ϵ0\epsilon\geqslant 0, a policy π\pi is ϵ\epsilon-Pareto optimal if there does not exist another policy π\pi^{\prime} such that Ji(π)Ji(π)+ϵJ_{i}(\pi^{\prime})\geqslant J_{i}(\pi)+\epsilon for all iNi\in N, with strict inequality for at least one agent. For ϵ=0\epsilon=0, we simply call such policies Pareto optimal.

We call a mechanism Pareto optimal if it always returns a Pareto optimal policy. In special cases where all agents unanimously agree on an optimal policy, Pareto optimality implies that the mechanism will return one such policy. We discuss Pareto optimal implementations of all mechanisms in this work.

3.2 Voting and Social Choice Functions

In the classical social choice setting, we have a set of nn agents and a set CC of mm alternatives. The preferences of voter i[n]i\in[n] is represented as a strict ordering over the alternatives σi:[m]C\sigma_{i}:[m]\to C equal to σi(1)iσi(2)iiσi(m)\sigma_{i}(1)\succ_{i}\sigma_{i}(2)\succ_{i}\ldots\succ_{i}\sigma_{i}(m), where c1ic2c_{1}\succ_{i}c_{2} denotes agent ii prefers c1c_{1} over c2c_{2} for c1,c2Cc_{1},c_{2}\in C. A (possibly randomized) voting rule aggregates agents’ preferences and returns an alternative or a distribution over the alternatives as the collective decision.

Positional Scoring Rules. A positional scoring rule with scoring vector s=(s1,,sm)\vec{s}=(s_{1},\ldots,s_{m}) such that s1s2sms_{1}\geqslant s_{2}\geqslant\ldots\geqslant s_{m} works as follows. Each agent gives a score of s1s_{1} to their top choice, a score of s2s_{2} to their second choice, and so on. The votes are tallied and an alternative with the maximum total score is selected. A few of the well-known positional scoring rules are: Plurality: (1,0,,0)(1,0,\ldots,0), Borda: (m1,m2,,1,0)(m-1,m-2,\ldots,1,0), kk-approval: (1,,1,0,,0)({1,\ldots,1},0,\ldots,0) with kk ones.

4 Occupancy Polytope as the Space of Alternatives

In a MOMDP MM, each agent ii incorporates their values and preferences into their respective reward function RiR_{i}. Agent ii prefers π\pi over π\pi^{\prime} if and only if π\pi achieves higher expected return, Ji(π)>Ji(π)J_{i}(\pi)>J_{i}(\pi^{\prime}), and is indifferent between two policies π\pi and π\pi^{\prime} if and only if Ji(π)=Ji(π)J_{i}(\pi)=J_{i}(\pi^{\prime}). As discussed before, given a state-action occupancy measure dπd_{\pi} in the state-action occupancy polytope 𝒪\mathcal{O}, we can recover the corresponding policy π\pi. Therefore, we can interpret 𝒪\mathcal{O} as the domain of all possible alternatives over which the nn agents have heterogeneous weak preferences (with ties). Agent ii prefers dπd_{\pi} to dπd_{\pi^{\prime}} in 𝒪\mathcal{O} if and only if they prefer π\pi to π\pi^{\prime}. We study the policy aggregation problem through this lens; specifically, we design or adapt voting mechanisms where the (continuous) space of alternatives is 𝒪\mathcal{O} and agents have weak preferences over them determined by their reward functions RiR_{i}.

Affine transformation and reward normalization. A particular benefit of this interpretation, as mentioned before, is that all positive affine transformations of the reward functions, i.e., aRi+baR_{i}+b for all a0a\in\mathbb{R}_{\geqslant 0} and bb\in\mathbb{R}, yield the same weak ordering over the polytope. Hence, we can assume without loss of generality that Ji(π)[0,1]J_{i}(\pi)\in[0,1]. Further, we can ignore agents that are indifferent between all policies, i.e., minπJi(π)=maxπJi(π)\min_{\pi}J_{i}(\pi)=\max_{\pi}J_{i}(\pi), and normalize reward functions RiRiminπJi(π)maxπJi(π)minπJi(π)R_{i}\leftarrow\frac{R_{i}-\min_{\pi}{J_{i}(\pi)}}{\max_{\pi}{J_{i}(\pi)}-\min_{\pi}J_{i}(\pi)} such that minπJi(π)=0\min_{\pi}J_{i}(\pi)=0 and maxπJi(π)=1\max_{\pi}J_{i}(\pi)=1. The relative ordering of the policies does not change since for all points dπ𝒪d_{\pi}\in\mathcal{O} we have s,adπ(s,a)=1\sum_{s,a}d_{\pi}(s,a)=1.

Volumetric definitions. A major difference between voting over a continuous space of alternatives and the classical voting setting is that the domain of alternatives is infinite and not all voting mechanisms can be directly applied to the policy aggregation problem. In particular, various voting rules require reasoning about the rank of an alternative or the size of some subset of alternatives. For instance, the Borda count of an alternative cc over a finite set of alternatives is defined as the number (or fraction) of candidates ranked below cc. In the continuous setting, for almost all of the mechanisms that we discuss later, we use the measure or volume of a subset of alternatives to refer to their size. For a measurable subset 𝒪𝒪\mathcal{O}^{\prime}\subseteq\mathcal{O}, let vol(𝒪){\rm{vol}}(\mathcal{O}^{\prime}) denote its measure. The ratio vol(𝒪)/vol(𝒪){\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O}) is the fraction of alternatives that lie in 𝒪\mathcal{O}^{\prime}. A probabilistic interpretation is that for a uniform distribution over the polytope 𝒪\mathcal{O}, vol(𝒪)/vol(𝒪){\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O}) denotes the probability that a policy uniformly sampled from 𝒪\mathcal{O} lies in 𝒪\mathcal{O}^{\prime}. We can also define the expected return distribution of an agent over 𝒪\mathcal{O} as a random variable that maps a policy to its expected return, i.e., one that maps dπ𝒪d_{\pi}\in\mathcal{O} to dπ,Ri\langle d_{\pi},R_{i}\rangle. The pdf and cdf of this r.v. is defined below.

Definition 4 (expected return distribution).

For a MOMDP MM and vv\in\mathbb{R}, the expected return distribution of agent i[n]i\in[n] is defined as

fi(v)=1vol(𝒪)x𝒪δ(vx,Ri)𝑑x,Fi(v)=x=vfi(x)𝑑x=vol(𝒪{x,Riv})vol(𝒪),f_{i}(v)=\frac{1}{{\rm{vol}}(\mathcal{O})}\int_{x\in\mathcal{O}}\delta(v-\langle x,R_{i}\rangle)\;dx,\quad F_{i}(v)=\int_{x=-\infty}^{v}f_{i}(x)\,dx=\frac{{\rm{vol}}(\mathcal{O}\cap\{\langle x,R_{i}\rangle\leqslant v\})}{{\rm{vol}}(\mathcal{O})},

where fif_{i} and FiF_{i} are the pdf and cdf of the expected return distribution and δ()\delta(\cdot) is the Dirac delta function indicating v=x,Riv=\langle x,R_{i}\rangle.

A useful observation about fif_{i}, the pdf, is that it is unimodal, i.e., increasing up to its mode mode(fi)argmaxvfi(v)\mathrm{mode}(f_{i})\in\operatorname{\arg\max}_{v\in\mathbb{R}}f_{i}(v) and decreasing afterwards, which follows from the Brunn–Minkowski inequality [17]. Since fif_{i} (pdf) is the derivative of FiF_{i} (cdf), the unimodality of fif_{i} implies that FiF_{i} is a convex function in (,mode(fi)](-\infty,\mathrm{mode}(f_{i})] and concave in [mode(fi),)[\mathrm{mode}(f_{i}),\infty).

In our algorithms, we use a subroutine that measures the volume of a polytope, which we denote by vol-comp({Axb})\mathrm{vol\text{-}comp}(\{Ax\leqslant b\}). Dyer et al. [12] designed a fully polynomial time randomized approximation scheme (FPRAS) for computing the volume of polytopes. We report the running time of algorithms in terms of the number of calls to this oracle.

5 Fairness in Policy Aggregation

In this section, we utilize the volumetric interpretation of the state-action occupancy polytope to extend fairness notions from social choice to policy aggregation, and we develop algorithms to compute stochastic policies provably satisfying these notions.

5.1 Proportional Veto Core

The proportional veto core was first proposed by Moulin [26] in the classical social choice setting with a finite set of alternatives where agents have full (strict) rankings over the alternatives. For simplicity, suppose the number of alternatives mm is a multiple of nn. The idea of the proportional veto core is that x%x\% of the agents should be able to veto x%x\% of the alternatives. More precisely, for an alternative cc to be in the proportional veto core, there should not exist a coalition SS that can “block” cc using their proportional veto power of |S|/n\nicefrac{{|S|}}{{n}}. SS blocks cc if they can unanimously suggest m(1|S|/n)m(1-\nicefrac{{|S|}}{{n}}) candidates that they prefer to cc. For instance, if cc is in the proportional veto core, it cannot be the case that a coalition of 60%60\% of the agents unanimously prefer 40%40\% of the alternatives to cc.

Chaudhury et al. [7] extended this notion to a continuous domain of alternatives in the federated learning setting. We show that such an extension also applies to policy aggregation.

Definition 5 (proportional veto core).

Let ϵ(0,1/n)\epsilon\in(0,\nicefrac{{1}}{{n}}). For a coalition of agents S[n]S\subseteq[n], let veto(S)=|S|/n\mathrm{veto}(S)=\nicefrac{{|S|}}{{n}} be their veto power. A point dπ𝒪d_{\pi}\in\mathcal{O} is blocked by a coalition SS if there exists a subset 𝒪𝒪\mathcal{O}^{\prime}\subseteq\mathcal{O} of measure vol(𝒪)/vol(𝒪)1veto(S)+ϵ{\rm{vol}}(\mathcal{O}^{\prime})/{\rm{vol}}(\mathcal{O})\geqslant 1-\mathrm{veto}(S)+\epsilon such that all agents in SS prefer all points in 𝒪\mathcal{O}^{\prime} to dπd_{\pi}, i.e., dπidπd_{\pi^{\prime}}\succ_{i}d_{\pi} for all dπ𝒪d_{\pi^{\prime}}\in\mathcal{O}^{\prime} and iSi\in S. A point dπd_{\pi} is in the ϵ\epsilon-proportional veto core if it is not blocked by any coalition.

A candidate in the proportional veto core satisfies desirable properties that are extensively discussed in prior work [26, 7, 24, 22]. It is worth mentioning that any candidate in the ϵ\epsilon-proportional veto core, besides the fairness aspect, is also economically efficient as it satisfies ϵ\epsilon-Pareto optimality. This holds since the grand coalition S=[n]S=[n] can veto any ϵ\epsilon-Pareto dominated alternative.

Moulin [26] proved that the proportional veto core is nonempty in the discrete setting and Chaudhury et al. [7] proved it for the continuous setting. The following result is a corollary of Theorem 1 of Chaudhury et al. [7].

Theorem 1.

Let ϵ(0,1/n)\epsilon\in(0,1/n). For a policy aggregation problem, the ϵ\epsilon-proportional veto core is nonempty. Furthermore, such policies can be found in polynomial time using O(log(1/ϵ))O(\log(1/\epsilon)) many calls per agent to vol-comp\mathrm{vol\text{-}comp}.

We refer the reader to the paper of Chaudhury et al. [7] for the complete proof, and provide a high-level description of Algorithm 1, which finds a point in the proportional veto core. Algorithm 1 iterates over the agents and lets the iith agent “eliminate” roughly 1/nvol(𝒪)\nicefrac{{1}}{{n}}\cdot{\rm{vol}}(\mathcal{O}) of the remaining space of alternatives. That is, agent ii finds the hyperplane Hi={x,Rivi}H_{i}=\{\langle x,R_{i}\rangle\leqslant v^{*}_{i}\} such that its intersection with Oi1O_{i-1} (the remaining part of 𝒪\mathcal{O} at the iith iteration) has a volume of approximately vol(𝒪)/n{\rm{vol}}(\mathcal{O})/n. This value, for each agent, can be found by doing a binary search over viv^{*}_{i} to a precision of [viϵ,vi][v^{*}_{i}-\epsilon,v^{*}_{i}] by O(log(1/ϵ))O(\log(1/\epsilon)) calls to the volume estimating subroutine vol-comp\mathrm{vol\text{-}comp}.222It is important to check the non-emptyness of the search space 𝒪i\mathcal{O}_{i} as an over-estimation of viv^{*}_{i} could lead to an empty feasible region.

Pareto optimality. We briefly discuss why Algorithm 1 is Pareto optimal. During the iith iteration, the space of policies is contracted by adding a linear constraint of the form Ji(π)viJ_{i}(\pi)\geqslant v^{*}_{i}. If the returned welfare-maximizing policy π\pi (derived from dπd_{\pi}) is Pareto dominated by another policy π\pi^{\prime}, then π\pi^{\prime} would satisfy all these linear constraints as Ji(π)Ji(π)viJ_{i}(\pi^{\prime})\geqslant J_{i}(\pi)\geqslant v^{*}_{i} with the earlier inequality being strict for at least one agent. Therefore, dπ𝒪nd_{\pi^{\prime}}\in\mathcal{O}_{n} and π\pi^{\prime} achieves a higher social welfare, which is a contradiction. The same argument can be used to establish the Pareto optimality of other mechanisms discussed later; each of these mechanisms searches for a policy that satisfies certain lower bounds on agents’ utilities, from which a welfare-maximizing, Pareto optimal policy can be selected.

𝒪0𝒪,δ1nϵn+1\mathcal{O}_{0}\leftarrow\mathcal{O},\quad\delta\leftarrow\frac{1}{n}-\frac{\epsilon}{n+1}
for i=1i=1 to n do
       Using binary search find vi[0,1]v_{i}^{*}\in[0,1] s.t. vol(𝒪i1{x,Rivi})δvol(𝒪){\rm{vol}}(\mathcal{O}_{i-1}\cap\{\langle x,R_{i}\rangle\leqslant v_{i}^{*}\})\approx\delta\cdot{\rm{vol}}(\mathcal{O})
       𝒪i𝒪i1{x,Rivi}\mathcal{O}_{i}\leftarrow\mathcal{O}_{i-1}\cap\{\langle x,R_{i}\rangle{~{}\geqslant~{}}v_{i}^{*}\}
return a welfare maximizing point dπ𝒪nd_{\pi}\in\mathcal{O}_{n}
ALGORITHM 1 Seq. ϵ\epsilon-Prop. Veto Core [7]
Procedure qq-quantile-feasible(q):
      
      𝒪q{x𝒪x,RiFi1(q),i[n]}\mathcal{O}_{q}\leftarrow\{x\in\mathcal{O}\mid\langle x,R_{i}\rangle\geqslant F_{i}^{-1}(q),i\in[n]\}
       if 𝒪q\mathcal{O}_{q} is a feasible linear program then return True else return False
      
Using binary search find maximum q[0,1]q\in[0,1] s.t. qq-quantile-feasible() is feasible
return a welfare maximizing point dπ𝒪qd_{\pi}\in\mathcal{O}_{q}
ALGORITHM 2 ϵ\epsilon-Max Quantile Fairness

5.2 Quantile Fairness

Next, we consider an egalitarian type of fairness based on the occupancy polytope, building on very recent work by Babichenko et al. [4] in the discrete setting; surprisingly, we show that it is possible to obtain stronger guarantees in the continuous setting.

Babichenko et al. [4] focus on the fair allocation of a set of mm indivisible items among nn agents, where each item must be fully allocated to a single agent. They quantify the extent an allocation AA is fair to an agent ii by the fraction of allocations over which ii prefers AA (note that the number of all discrete allocations is nmn^{m}). In other words, if one randomly samples an allocation, the fairness is measured by the probability that AA is preferred to the random allocation. An allocations is qq-quantile fair for q[0,1]q\in[0,1] if all agents consider this allocation among their top qq-quantile allocations. Babichenko et al. [4] aim to find a universal value of qq such that for any fair division instance, a qq-quantile fair allocation exists. They make an interesting connection between qq-quantile fair allocations and the Erdős Matching Conjecture, and show under the assumption that the conjecture holds, (1/2e)(1/2e)-quantile fair allocations exist for all instances.333This result is for arbitrary monotone valuations. For additive valuations, they show 0.14e\frac{0.14}{e}-quantile fair allocations always exist without any assumptions.

We ask the same question for policy aggregation, and again, a key difference is that our domain of alternatives is continuous. The notion of qq-quantile fairness extends well to our setting. Agents assess the fairness of a policy π\pi based on the fraction of the occupancy polytope (i.e., the set of all policies) to which they prefer π\pi, or equivalently, the probability that they prefer the chosen policy to a randomly sampled policy.

Definition 6 (qq-quantile fairness).

For a MOMDP MM and q[0,1]q\in[0,1], a policy π\pi is qq-quantile fair if for every agent i[n]i\in[n], π\pi is among ii’s top qq-fraction of policies,

vol(𝒪{x,RiJi(π)})qvol(𝒪).{\rm{vol}}\left(\mathcal{O}\cap\left\{\,\left\langle x,R_{i}\right\rangle\leqslant J_{i}(\pi)\,\right\}\right)\;\geqslant\;q\cdot{\rm{vol}}(\mathcal{O}).

We show that a (1/e)(1/e)-quantile fair policy always exist and that this ratio is tight; note that this bound is twice as good as that of Babichenko et al. [4], and it is unconditional. We prove this by making a connection to an inequality due to Grünbaum [17]. The centroid of a polytope PP is defined as centroid(P)=xPx𝑑xxP1𝑑x\mathrm{centroid}(P)=\frac{\int_{x\in P}x\;dx}{\int_{x\in P}1\;dx}.

Lemma 1 (Grunbaum’s Inequality).

Let PP be a polytope and ww a direction in d\mathbb{R}^{d}. For the halfspace H={xw,xcentroid(P)0}H=\{x\mid\langle w,x-\mathrm{centroid}(P)\rangle\geqslant 0\}, it holds that

1e(dd+1)dvol(PH)vol(P)1(dd+1)d11e,\frac{1}{e}\leqslant\left(\frac{d}{d+1}\right)^{d}\leqslant\frac{{\rm{vol}}(P\cap H)}{{\rm{vol}}(P)}\leqslant 1-\left(\frac{d}{d+1}\right)^{d}\leqslant 1-\frac{1}{e},

Furthermore, this is tight for the nn-dimensional simplex.

Theorem 2.

For every MOMDP MM, there always exist qq-quantile fair policy where q=(1)1q=(\frac{\ell-1}{\ell})^{\ell-1} and =|𝒮||𝒜|\ell=|\mathcal{S}|\cdot|\mathcal{A}|. Note that q1e36.7%q\geqslant\frac{1}{e}\approx 36.7\%. Furthermore, this bound is tight: For any \ell, there is an instance with a single state and \ell actions where no qq-quantile fair policy exists for any q>(1)1q>(\frac{\ell-1}{\ell})^{\ell-1}.

Proof.

First, we show that the centroid of the occupancy polytope c=centroid(𝒪)c=\mathrm{centroid}(\mathcal{O}) is qq-quantile fair policy for the aformentioned qq. Since 𝒪\mathcal{O} is a subset of the (n1)(n-1)-simplex (see Definition 2), 𝒪\mathcal{O} has a nonzero volume in some lower dimensional space |𝒮||𝒜|1\ell^{\prime}\leqslant|\mathcal{S}|\cdot|\mathcal{A}|-1. By invoking Grunbaum’s inequality (Lemma 1) with wiw_{i} being equal to RiR_{i} projected to the \ell^{\prime}-dimensional subspace for all agents i[n]i\in[n], we have that vol(𝒪Hi)(+1)vol(𝒪){\rm{vol}}(\mathcal{O}\cap H_{i})\geqslant(\frac{\ell^{\prime}}{\ell^{\prime}+1})^{\ell^{\prime}}\cdot{\rm{vol}}(\mathcal{O}) where Hi={xc,wi0}={x,RiJi(c)}H_{i}=\{\langle x-c,w_{i}\rangle\geqslant 0\}=\{\langle x,R_{i}\rangle\geqslant J_{i}(c)\}. Since 1\ell^{\prime}\leqslant\ell-1, we have (+1)(1)1(\frac{\ell^{\prime}}{\ell^{\prime}+1})^{\ell^{\prime}}\geqslant(\frac{\ell-1}{\ell})^{\ell-1}, which completes the proof.

To show tightness, take a MOMDP with a single state 𝒮={s}\mathcal{S}=\{s\} — hence a constant transition function 𝒫()=s\mathcal{P}(\cdot)=s — and \ell actions 𝒜={a1,,a}\mathcal{A}=\{a_{1},\ldots,a_{\ell}\} and \ell agents. The reward function of agent ii is Ri(s,ai)=1R_{i}(s,a_{i})=1 and 0 otherwise. The state-action occupancy polytope of this MOMDP is the (1)(\ell-1)-dimensional simplex 𝒪={adπ(s,a)=1,dπ01×}\mathcal{O}=\{\sum_{a}d_{\pi}(s,a)=1,d_{\pi}\in\mathbb{R}^{1\times\ell}_{\geqslant 0}\}. Take any point in dπ𝒪d_{\pi}\in\mathcal{O}. There exists at least one agent ii that has Ji(dπ)=dπ(s,ai)1J_{i}(d_{\pi})=d_{\pi}(s,a_{i})\leqslant\frac{1}{\ell}. Take the halfspace Hi={x,Ri1}H_{i}=\{\langle x,R_{i}\rangle\geqslant\frac{1}{\ell}\}. Observe that Hi𝒪H_{i}\cap\mathcal{O} is equivalent to a smaller (1)(\ell-1)-dimensional simplex {adπ(s,a)=11}\{\sum_{a}d_{\pi}(s,a)=1-\frac{1}{\ell}\} which has volume of vol((11)𝒪)=(1)1vol(𝒪){\rm{vol}}\left((1-\frac{1}{\ell})\mathcal{O}\right)=\left(\frac{\ell-1}{\ell}\right)^{\ell-1}{\rm{vol}}(\mathcal{O}). Therefore, vol(𝒪Hi)vol(𝒪)=(1)1\frac{{\rm{vol}}(\mathcal{O}\cap H_{i})}{{\rm{vol}}(\mathcal{O})}=\left(\frac{\ell-1}{\ell}\right)^{\ell-1}. ∎

The centroid of the occupancy polytope, as per Theorem 2, attains a worst-case measure of quantile-fairness. However, the centroid policy can be highly suboptimal as it disregards the preferences of the agents involved. For instance, there could exist a 99%99\%-quantile fair policy. To this end, we take an egalitarian approach and aim to find a qq-quantile fair policy with the maximum qq.

Max quantile fair algorithm. Algorithm 2 searches for the optimal value qq^{*}, for which a qq^{*}-quantile fair policy exists, and gets close to qq^{*} by a binary search. To perform the search, we need a subroutine that checks, for a given value of qq, if a qq-quantile fair policy exists. Suppose we have the qq-quantile expected return Fi1(q)F^{-1}_{i}(q) for all ii, that is, the expected return amount viv_{i} such that Fi(vi)=qF_{i}(v_{i})=q. Then, the problem of existence of a qq-quantile fair policy is equivalent to the feasibility of the linear program {x𝒪x,RiFi1(q),i[n]}\{x\in\mathcal{O}\mid\langle x,R_{i}\rangle\geqslant F_{i}^{-1}(q),i\in[n]\}, which can be solved in polynomial time. Importantly, after finding a good approximation of qq, there can be infinitely many policies that are qq-quantile fair and there are various ways to select the final policy after finding the value qq. As mentioned earlier, a desirable efficiency property is Pareto optimality; to satisfy it, one can return the policy that maximizes the sum of agents’ expected returns among the ones that are qq-quantile fair. Finally, to calculate Fi1(q)F_{i}^{-1}(q), we can again use binary search to get δ\delta close to the value viv_{i} for which Fi(vi)=qF_{i}(v_{i})=q using O(log1/δ)O(\log 1/\delta) calls to vol-comp\mathrm{vol\text{-}comp}. The discussion above is summarized below.

Proposition 3.

Assuming an optimal oracle for Fi1F^{-1}_{i}, Algorithm 2 finds a qq-quantile fair policy that is ϵ\epsilon close to the optimal value in polynomial time with O(log(1/ϵ))O(\log({1}/{\epsilon})) per agent calls to the oracle. A δ\delta-approximation to Fi1(q)F^{-1}_{i}(q) can be computed using O(log(1/δ))O(\log(1/\delta)) calls to vol-comp\mathrm{vol\text{-}comp}.

6 Policy Aggregation with Voting Rules

In this section, we adapt existing voting rules from the discrete setting to policy aggregation and discuss their computational complexity.

Plurality. The plurality winner is the policy that achieves the maximum number of plurality votes or “approvals,” where agent ii approves a policy π\pi if it achieves their maximum expected return Ji(π)=maxπJi(π)=1J_{i}(\pi)=\max_{\pi^{\prime}}J_{i}(\pi^{\prime})=1. Hence the plurality winner is a policy in argmaxπi[n]𝕀[Ji(π)=1]\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}\operatorname{\mathbb{I}}[J_{i}(\pi)=1]. This formulation does not require the volumetric interpretation. However, in contrast to the discrete setting where one can easily count the approvals for all candidates, we show that solving this problem in the context of policy aggregation is not only 𝐍𝐏\mathbf{NP}-hard, but hard to approximate up to factor of a 1/n1/2ϵ1/n^{1/2-\epsilon}. We establish the hardness of approximation by a reduction from the maximum independent set problem [20]; we defer the proof to Appendix A.

Theorem 4.

For any fixed ϵ(0,1)\epsilon\in(0,1), there is no polynomial-time 1n1/2ϵ\frac{1}{n^{1/2-\epsilon}}-approximation algorithm for the maximum plurality score unless 𝐏=𝐍𝐏\mathbf{P}=\mathbf{NP}.

Nevertheless, we can compute plurality in practice, as we discuss below.

α\alpha-approval. We extend the kk-approval rule using the volumetric interpretation of the occupancy polytope, similarly to the qq-quantile fairness definition. For some α[0,1]\alpha\in[0,1], agents approve a policy π\pi if its return is among their top α\alpha fraction of 𝒪\mathcal{O}, i.e., Fi(Ji(π))αF_{i}(J_{i}(\pi))\geqslant\alpha. The α\alpha-approval winner is a policy that has the highest number of α\alpha-approvals, so it is in argmaxπi[n]𝕀[Fi(Ji(π))α]\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}\operatorname{\mathbb{I}}[F_{i}(J_{i}(\pi))\geqslant\alpha]. Note that plurality is equivalent to 11-approval. It is worth mentioning that there can be infinitely many policies that have the maximum approval score and, to avoid a suboptimal decision, one can return a Pareto optimal solution among the set of α\alpha-approval winner policies.

Theorem 2 shows that for α1/e\alpha\leqslant 1/e, there always exists a policy that all agents approve, and by Proposition 3 such policies can be found in polynomial time, assuming access to an oracle for volumetric computations. Therefore, the problem of finding an α\alpha-approval winner is “easy” for α(0,1/e)\alpha\in(0,1/e). In sharp contrast, for α=1\alpha=1 — namely, plurality — Theorem 4 gives a hardness of approximation. The next theorem shows the hardness of computing α\alpha-approval for α(7/8,1]\alpha\in(7/8,1] via a reduction from the MAX-2SAT problem. We defer the proof to Appendix A.

Theorem 5.

For α(7/8,1]\alpha\in(7/8,1], computing a policy with the highest α\alpha-approval score is 𝐍𝐏\mathbf{NP}-hard. This even holds for binary reward vectors and when every FiF_{i} has a closed form.

Given the above hardness result, to compute the α\alpha-approval rule, we turn to mixed-integer linear programming (MILP). Algorithm 3 simply creates nn binary variables for each agent ii indicating whether ii α\alpha-approves the policy, i.e., Fi(Ji(π))αF_{i}(J_{i}(\pi))\geqslant\alpha which is equivalent to Ji(π)Fi1(α)J_{i}(\pi)\geqslant F^{-1}_{i}(\alpha). To encode the expected return requirement for agent ii to approve a policy as a linear constraint, we precompute Fi1(α)F^{-1}_{i}(\alpha). This can be done by a binary search similar to Algorithm 2. Importantly, Algorithm 3 has one binary variable per agent and only nn constraints which is key to its practicability.

compute Fi1(α)F^{-1}_{i}(\alpha) for all i[n]i\in[n]
solve the mixed integer linear program
maxi[n]\displaystyle\max\,\,\sum\nolimits_{i\in[n]} ai\displaystyle\,a_{i}
s.t.aiFi1(α)\displaystyle\text{s.t.}\quad a_{i}\cdot F_{i}^{-1}(\alpha) dπ,Ri\displaystyle\leqslant\langle d_{\pi},R_{i}\rangle i[n]\displaystyle\forall i\in[n]
dπ𝒪,ai\displaystyle d_{\pi}\in\mathcal{O},\quad a_{i} {0,1}\displaystyle\in\{0,1\} i[n]\displaystyle\forall i\in[n]
return a Pareto optimal policy subject to max α\alpha-approval {iai=1}\{i\mid a_{i}=1\}
ALGORITHM 3 α\alpha-Approvals MILP
compute Fi1(kϵ)F^{-1}_{i}(k\epsilon) for all i[n]i\in[n] and k[1ϵ]k\in[\frac{1}{\epsilon}]
solve the mixed integer linear program
maxi[n]k[1/ϵ]\displaystyle\max\,\,\sum_{i\in[n]}\sum_{k\in[1/\epsilon]} ai,k(Fi(kϵ)Fi((k1)ϵ))\displaystyle\,a_{i,k}\cdot\left(F_{i}(k\epsilon)-F_{i}((k-1)\epsilon)\right)
s.t.ai,kkϵ\displaystyle\text{s.t.}\qquad a_{i,k}\cdot k\epsilon dπ,Rii[n]\displaystyle\leqslant\langle d_{\pi},R_{i}\rangle\quad\,\,\forall i\in[n]
dπ𝒪,\displaystyle d_{\pi}\in\mathcal{O}, ai,k{0,1},i[n],k[1/ϵ]\displaystyle\quad a_{i,k}\in\{0,1\},\qquad\forall i\in[n],k\in[1/\epsilon]
return a Pareto optimal policy subject to max Borda score
ALGORITHM 4 ϵ\epsilon-Borda count MILP

Borda count. The Borda count rule also has a natural definition in the continuous setting. In the discrete setting, the Borda score of agent ii for alternative cc is the number of alternatives cc^{\prime} such that cicc\succ_{i}c^{\prime}. In the continuous setting, Fi(Ji(π))F_{i}(J_{i}(\pi)) indicates the volume of the occupancy polytope to which agent ii prefers π\pi. The Borda count rule then selects a policy among argmaxπi[n]Fi(Ji(π))\operatorname{\arg\max}_{\pi}\sum_{i\in[n]}F_{i}(J_{i}(\pi)).

The computational complexity of the Borda count rule remains an interesting open question, though we make progress on two fronts.444We suspect the problem to be 𝐍𝐏\mathbf{NP}-hard since the objective resembles a summation of “sigmoidal” functions over a convex domain, which is known to be 𝐍𝐏\mathbf{NP}-hard [37]. First, we identify a sufficient condition under which we can find an approximate max Borda count policy using convex optimization in polynomial time. Second, similar to Algorithm 3, we present a MILP to approximate the Borda count rule in Algorithm 4.

The first is based on the observation in Section 4 that FiF_{i} is concave in range [mode(fi),)[\mathrm{mode}(f_{i}),\infty). We assume that the max Borda count policy π\pi appears in the concave portion of all agents, i.e., Ji(π)mode(fi)J_{i}(\pi)\geqslant\mathrm{mode}(f_{i}) for all i[n]i\in[n]. Then, the problem becomes a maximization of the concave objective maxiFi(dπ,Ri)\max\sum_{i}F_{i}(\langle d_{\pi},R_{i}\rangle) over the convex domain {dπ𝒪dπ,Rimode(fi),i[n]}\{d_{\pi}\in\mathcal{O}\mid\langle d_{\pi},R_{i}\rangle\geqslant\mathrm{mode}(f_{i}),\forall i\in[n]\}.

Second, Algorithm 4 is a MILP that finds an approximate max Borda count policy. As a pre-processing step, we estimate FiF_{i} for each agent ii separately. We measure FiF_{i} for the fixed expected return values of {ϵ,2ϵ,,1ϵ,1}\{\epsilon,2\epsilon,\ldots,1-\epsilon,1\}. This accounts for 1/ϵ1/\epsilon oracle calls to vol-comp\mathrm{vol\text{-}comp} per agent. Then, for the MILP, we introduce 1/ϵ{1}/{\epsilon} binary variables for each agent indicating their ϵ\epsilon-rounded return levels, i.e., ai,k=1a_{i,k}=1 iff dπ,Rikϵ\langle d_{\pi},R_{i}\rangle\geqslant k\epsilon for k[1/ϵ]k\in[1/\epsilon]. The MILP then searches for an occupancy measure dπ𝒪d_{\pi}\in\mathcal{O} with the maximum total Borda score among the ϵ\epsilon-rounded expected return vectors, where the ϵ\epsilon-rounded value of x0x\in\mathbb{R}_{\geqslant 0} is defined as x/ϵϵ\lfloor x/\epsilon\rfloor\cdot\epsilon. Therefore, the MILP finds a policy whose Borda score is at least as high as the maximum Borda score among ϵ\epsilon-rounded return vectors. This is not necessarily equivalent to a (1ϵ)(1-\epsilon) approximation of the max Borda score.

Finally, we make a novel connection between qq-quantile fairness and Borda count in Theorem 6.

Theorem 6.

A qq-quantile fair policy is a qq-approximation of the maximum Borda score.

Proof.

Recall the Borda score of a policy π\pi is defined as Borda(π)=i[n]Fi(Ji(π))\operatorname{Borda}(\pi)=\sum_{i\in[n]}F_{i}(J_{i}(\pi)). Since Fi(v)[0,1]F_{i}(v)\in[0,1] for all vv\in\mathbb{R}, we have maxπi[n]Fi(Ji(π))n\max_{\pi}\sum_{i\in[n]}F_{i}(J_{i}(\pi))\leqslant n.

By definition of a qq-quantile fair policy πq\pi_{q}, Fi(Ji(πq))qF_{i}(J_{i}(\pi_{q}))\geqslant q for all i[n]i\in[n]. Therefore,

Borda(πq)=i[n]Fi(Ji(πq))qn,\operatorname{Borda}(\pi_{q})=\sum_{i\in[n]}F_{i}(J_{i}(\pi_{q}))\geqslant qn,

and thus we obtain

Borda(πq)maxπBorda(π)qnn=q.\frac{\operatorname{Borda}(\pi_{q})}{\max_{\pi}\operatorname{Borda}(\pi)}\geqslant\frac{qn}{n}=q.\qed

A corollary of Theorems 6 and 2 is that the policy returned by ϵ\epsilon-max quantile fair algorithm (Algorithm 2) achieves a (1/eϵ)(1/e-\epsilon) multiplicative approximation of the maximum Borda score.

7 Experiments

Environment. We adapt the dynamic attention allocation environment introduced by D’Amour et al. [11]. We aim to monitor several sites and prevent potential incidents, but limited resources prevent us from monitoring all sites at all times; this is inspired by applications such as food inspection and pest control 555The code for the experiments is available at https://github.com/praal/policy-aggregation.. There are m=5m=5 warehouses and each can be in 3 different stages: normal (norm\mathrm{norm}), risky (risk\mathrm{risk}) and incident (inc\mathrm{inc}). There are |𝒮|=3m|\mathcal{S}|=3^{m} states containing all possible stages of all warehouses. In each step, we can monitor at most one site, so there are m+1m+1 actions, where action m+1m+1 is no operation and action imi\leqslant m is monitoring warehouse ii. There are nn agents; each agent ii has a list i\ell_{i} of warehouses that they consider valuable and a reward function RiR_{i}. In each step tt, Ri(st,at)=𝕀[atm]jiρiwj𝕀[st,j=incatj]R_{i}(s_{t},a_{t})=-\operatorname{\mathbb{I}}[a_{t}\leqslant m]-\sum_{j\in\ell_{i}}\rho_{i}w_{j}\cdot\operatorname{\mathbb{I}}[s_{t,j}=\mathrm{inc}\land a_{t}\neq j], where wj{100,150,,250}w_{j}\in\{100,150,\cdots,250\} denotes the penalty of an incident occurring in warehouse jj, ρi\rho_{i} is the scale of penalties for agent ii which is sampled from {0.25,0.5,,n}\{0.25,0.5,\cdots,n\}, and 1-1 is the cost of monitoring. In each step, if we monitor warehouse jj, its stage becomes normal. If not, it changes from norm\mathrm{norm} to risk\mathrm{risk} and from risk\mathrm{risk} to inc\mathrm{inc} with probabilities pj,riskp_{j,\mathrm{risk}} and pj,incp_{j,\mathrm{inc}}, and stays the same otherwise. Probabilities are sampled i.i.d. uniformly from [0.5,0.8][0.5,0.8]. The state transitions 𝒫\mathcal{P} is the product of the warehouses’ stage transitions.

Rules. We compare the outcomes of policy aggregation with different rules: max-quantile, Borda, α\alpha-approval (α=0.9,0.8\alpha=0.9,0.8), egalitarian (maximize minimum return) and utilitarian (maximize sum of returns). We sample 51055\cdot 10^{5} random policies based on which we fit a generalized logistic function to estimate the cdf of the expected return distribution FiF_{i} (Definition 4) for every agent. The policies for α\alpha-approval voting rules are optimized with respect to maximum utilitarian welfare. The egalitarian rule finds a policy that maximizes the expected return of the worst-off agent, then optimizes for the second worst-off agent, and so on. The implementation details of Borda count are in Appendix B.

Results. In Figure 1, we report the normalized expected return of agents as Ji(π)minπJi(π)maxπJi(π)minπ′′Ji(π′′)\frac{J_{i}(\pi)-\min_{\pi}J_{i}(\pi)}{\max_{\pi^{\prime}}J_{i}(\pi^{\prime})-\min_{\pi^{\prime\prime}}J_{i}(\pi^{\prime\prime})} (sorted from lowest to highest) which are averaged over 1010 different environment and agents instances. We observe that the utilitarian and egalitarian rules are sensitive to the different agents’ reward scales and tend to perform unfairly. The utilitarian rule achieves the highest utilitarian welfare by almost ignoring one agent. The egalitarian rule achieves higher return for the worst-off agents compared to the utilitarian rule, but still yields an inequitable outcome. The max-quantile rule tends to return the fairest outcomes with similar normalized returns for the agents. The Borda rule, while not a fair rule by design, tends to find fair outcomes which are slightly worse than the max-quantile rule. The α\alpha-approval rule with max utilitarian completion tends to the utilitarian rule as α0\alpha\to 0 and to plurality as α1\alpha\to 1. Importantly, although not shown in the plots, the plurality rule ignores almost all agents and performs optimally for a randomly selected agent.

In addition to the fine-grained utility distributions, in Table 1, we report two aggregate measures based on agents’ utilities: (i) the Gini index, a statistical measure of dispersion defined as iNjN|Ji(π)Jj(π)|2niNJi(π)\frac{\sum_{i\in N}\sum_{j\in N}|J_{i}(\pi)-J_{j}(\pi)|}{2n\sum_{i\in N}J_{i}(\pi)} — where a lower Gini index indicates a more equitable distribution, and (ii) the Nash welfare, defined as the geometric mean of agents’ utilities (iNJi(π))1/n\left(\prod_{i\in N}J_{i}(\pi)\right)^{1/n} — where a higher Nash welfare is preferable. We observe a similar trend as above, where utilitarian and egalitarian rules perform worse across both metrics. For the other four rules, the Nash welfare scores are comparable in both scenarios, with Borda showing slightly better performance. The Gini index, however, highlights a clearer distinction among the rules, with max-quantile performing better.

Refer to caption
(a) 10 agents consider a random subset of warehouses valuable
Refer to caption
(b) 5 symmetric agents, one per warehouse
Figure 1: Comparison of policies optimized by different rules in two different scenarios based on the normalized expected return for agents. The bars, grouped by rule, correspond to agents sorted based on their normalized expected return. The error bars show the standard error of the mean.
5 symmetric agents, one per warehouse 10 agents, random subsets of warehouses
Rules Gini index Nash welfare Gini index Nash welfare
egalitarian 0.2864±0.02950.2864\pm 0.0295 0.2208±0.07170.2208\pm 0.0717 0.2126±0.02090.2126\pm 0.0209 0.4655±0.05810.4655\pm 0.0581
utilitarian 0.4392±0.00940.4392\pm 0.0094 0.0502±0.01740.0502\pm 0.0174 0.2020±0.01820.2020\pm 0.0182 0.5736±0.04710.5736\pm 0.0471
80%-approvals 0.1233±0.00470.1233\pm 0.0047 0.5186±0.00510.5186\pm 0.0051 0.1352±0.00370.1352\pm 0.0037 0.6741±0.02000.6741\pm 0.0200
90%-approvals 0.0793±0.00560.0793\pm 0.0056 0.5286±0.00530.5286\pm 0.0053 0.1257±0.00340.1257\pm 0.0034 0.6746±0.02110.6746\pm 0.0211
Borda 0.0225±0.00240.0225\pm 0.0024 0.5356±0.00620.5356\pm 0.0062 0.1029±0.00830.1029\pm 0.0083 0.6801±0.02610.6801\pm 0.0261
max-quantile 0.0188±0.00220.0188\pm 0.0022 0.5355±0.00620.5355\pm 0.0062 0.0625±0.00670.0625\pm 0.0067 0.6474±0.02320.6474\pm 0.0232
Table 1: Comparison of policies optimized by different rules in two scenarios based on Gini index and Nash welfare based on their normalized expected return averaged. We report the mean and the standard error.

8 Discussion

We conclude by discussing some of the limitations of our approach. A first potential limitation is computation. When we started our investigation of the policy aggregation problem, we were skeptical that ordinal solutions from social choice could be practically applied. We believe that our results successfully lay this initial concern to rest. However, additional algorithmic advances are needed to scale our approach beyond thousands of agents, states, and actions. Additionally, an interesting future direction is to apply these rules within continuous state or action spaces, as well as in online reinforcement learning setting where the environment remains unknown.

A second limitation is the possibility of strategic behavior. The Gibbard-Satterthwaite Theorem [16, 33] precludes the existence of “reasonable” voting rules that are strategyproof, in the sense that agents cannot gain from misreporting their ordinal preferences; we conjecture that a similar result holds for policy aggregation in our framework. However, if reward functions are obtained through inverse reinforcement learning, successful manipulation would be difficult: an agent would have to act in a way that the learned reward function induces ordinal (volumetric) preferences leading to a higher-return aggregate stochastic policy. This separation between the actions taken by an agent and the preferences they induce would likely alleviate the theoretical susceptibility of our methods to strategic behavior.

Acknowledgments

We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program (Vector Institute). This work was also partially supported by the Cooperative AI Foundation; by the National Science Foundation under grants IIS-2147187, IIS-2229881 and CCF-2007080; and by the Office of Naval Research under grants N00014-20-1-2488 and N00014-24-1-2704.

References

  • Abbeel and Ng [2004] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning (ICML), pages 1–8, 2004.
  • Alamdari et al. [2022] P. A. Alamdari, T. Q. Klassen, R. Toro Icarte, and S. A. McIlraith. Be considerate: Avoiding negative side effects in reinforcement learning. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 18–26, 2022.
  • Alamdari et al. [2024] P. A. Alamdari, T. Q. Klassen, E. Creager, and S. A. Mcilraith. Remembering to be fair: Non-Markovian fairness in sequential decision making. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 906–920, 2024.
  • Babichenko et al. [2024] Y. Babichenko, M. Feldman, R. Holzman, and V. V. Narayan. Fair division via quantile shares. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC), pages 1235–1246, 2024.
  • Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Caragiannis et al. [2016] I. Caragiannis, A. D. Procaccia, and N. Shah. When do noisy votes reveal the truth? ACM Transactions on Economics and Computation, 4(3): article 15, 2016.
  • Chaudhury et al. [2024] B. R. Chaudhury, A. Murhekar, Z. Yuan, B. Li, R. Mehta, and A. D. Procaccia. Fair federated learning via the proportional veto core. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024.
  • Clegg [2023] N. Clegg. Bringing people together to inform decision-making on generative AI. Blog post, 2023. URL https://about.fb.com/news/2023/06/generative-ai-community-forum.
  • Conitzer and Sandholm [2005] V. Conitzer and T. Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of the 21st Annual Conference on Uncertainty in Artificial Intelligence (UAI), pages 145–152, 2005.
  • Conitzer et al. [2024] V. Conitzer, R. Freedman, J. Heitzig, W. H. Holliday, B. M. Jacobs, N. Lambert, M. Mosse, E. Pacuit, S. Russell, H. Schoelkopf, E. Tewolde, and W. S. Zwicker. Position: Social choice should guide AI alignment in dealing with diverse human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 9346–9360, 2024.
  • D’Amour et al. [2020] A. D’Amour, H. Srinivasan, J. Atwood, P. Baljekar, D. Sculley, and Y. Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 3rd Conference on Fairness, Accountability, and Transparency (FAccT), pages 525–534, 2020.
  • Dyer et al. [1991] M. Dyer, A. Frieze, and R. Kannan. A random polynomial-time algorithm for approximating the volume of convex bodies. Journal of the ACM, 38(1):1–17, 1991.
  • Fan et al. [2023] Z. Fan, N. Peng, M. Tian, , and B. Fain. Welfare and fairness in multi-objective reinforcement learning. In Proceedings of the 22nd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1991–1999, 2023.
  • Fishburn [1984] P. C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
  • Garey et al. [1976] M.R. Garey, D.S. Johnson, and L. Stockmeyer. Some simplified np-complete graph problems. Theoretical Computer Science, 1(3):237–267, 1976. ISSN 0304-3975.
  • Gibbard [1973] A. Gibbard. Manipulation of voting schemes. Econometrica, 41:587–602, 1973.
  • Grünbaum [1960] B. Grünbaum. Partitions of mass-distributions and of convex bodies by hyperplanes. Pacific Journal of Mathematics, 10(4):1257–1261, 1960.
  • Gurobi Optimization, LLC [2023] Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
  • Hammond [1991] P. J. Hammond. Interpersonal comparisons of utility: Why and how they are and should be made. In J. Elster and J. E. Roemer, editors, Interpersonal Comparisons of Well-Being, chapter 7, pages 200–254. Cambridge University Press, 1991.
  • Håstad [1999] J. Håstad. Clique is hard to approximate within n 1- ε\varepsilon. Acta Mathematica, 182:105–142, 1999.
  • Hayes et al. [2022] C. F. Hayes, R. Rădulescu, E. Bargiacchi, J. Källström, M. Macfarlane, M. Reymond, T. Verstraeten, L. M Zintgraf, R. Dazeley, F. Heintz, et al. A practical guide to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022.
  • Ianovski and Kondratev [2021] E. Ianovski and A. Y. Kondratev. Computing the proportional veto core. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), 2021.
  • Ju et al. [2023] P. Ju, A. Ghosh, and N. Shroff. Achieving fairness in multi-agent MDP using reinforcement learning. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 2023.
  • Kondratev and Ianovski [2024] A. Y. Kondratev and E. Ianovski. Veto core consistent preference aggregation. In Proceedings of the 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), pages 1020–1028, 2024.
  • Mandal and Gan [2022] D. Mandal and J. Gan. Socially fair reinforcement learning. CoRR, abs/2208.12584, 2022.
  • Moulin [1981] H. Moulin. The proportional veto principle. The Review of Economic Studies, 48(3):407–416, 1981.
  • Ng and Russell [2000] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning (ICML), pages 663–670, 2000.
  • Noothigattu et al. [2021] R. Noothigattu, T. Yan, and A. D. Procaccia. Inverse reinforcement learning from like-minded teachers. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pages 9197–9204, 2021.
  • Ogryczak et al. [2013] W. Ogryczak, P. Perny, and P. Weng. A compromise programming approach to multiobjective markov decision processes. International Journal of Information Technology & Decision Making, 12(05):1021–1053, 2013.
  • Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • Puterman [2014] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • Roijers et al. [2013] D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
  • Satterthwaite [1975] M. Satterthwaite. Strategy-proofness and Arrow’s conditions: Existence and correspondence theorems for voting procedures and social welfare functions. Journal of Economic Theory, 10:187–217, 1975.
  • Siddique et al. [2020] U. Siddique, P. Weng, and M. Zimmer. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In Proceedings of the 37th International Conference on Machine Learning (ICML), pages 8905–8915, 2020.
  • Sorensen et al. [2024] T. Sorensen, J. Moore, J. Fisher, M. L. Gordon, N. Mireshghallah, C. Michael Rytting, A. Ye, L. Jiang, X. Lu, N. Dziri, T. Althoff, and Y. Choi. Position: A roadmap to pluralistic alignment. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 46280–46302, 2024.
  • Swamy et al. [2024] G. Swamy, C. Dann, R. Kidambi, S. Wu, and A. Agarwal. A minimaximalist approach to reinforcement learning from human feedback. In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 47345–47377, 2024.
  • Udell and Boyd [2013] M. Udell and S. Boyd. Maximizing a sum of sigmoids. Optimization and Engineering, pages 1–25, 2013.
  • Yang et al. [2019] R. Yang, X. Sun, and K. Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems (NeurIPS), 2019.
  • Young [1988] H. P. Young. Condorcet’s theory of voting. The American Political Science Review, 82(4):1231–1244, 1988.
  • Zahavy et al. [2021] T. Zahavy, B. O’Donoghue, G. Desjardins, and S. Singh. Reward is enough for convex MDPs. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS), pages 25746–25759, 2021.
  • Zaremba et al. [2023] W. Zaremba, A. Dhar, L. Ahmad, T. Eloundou, S. Santurkar, S. Agarwal, and J. Leung. Democratic inputs to AI. Blog post, 2023. URL https://openai.com/blog/democratic-inputs-to-ai.
  • Zhong et al. [2024] H. Zhong, Z. Deng, W. J. Su, Z. S. Wu, and L. Zhang. Provable multi-party reinforcement learning with diverse human feedback. arXiv preprint arXiv:2403.05006, 2024.

Appendix A Proofs of Section 6

We define a fully connected MOMDP as a MOMDP MM where for every pair of states s,s𝒮s,s^{\prime}\in\mathcal{S} and any action a𝒜a\in\mathcal{A}, 𝒫(s,a,s)=1|𝒮|\mathcal{P}(s,a,s^{\prime})=\frac{1}{|\mathcal{S}|}. Throughout the hardness proofs, we construct a fully connected MOMDP. For such MOMDPs, it is not difficult to observe that the state-action occupancy polytope, for both the average and discounted reward, is equivalent to {dπadπ(s,a)=1|𝒮|,s𝒮}\{d_{\pi}\mid\sum_{a}d_{\pi}(s,a)=\frac{1}{|\mathcal{S}|},\forall s\in\mathcal{S}\} for both the discounted and average reward case. Furthermore,

π(a|s)=dπ(s,a)a𝒜dπ(s,a)=|𝒮|dπ(s,a)\pi(a|s)=\frac{d_{\pi}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}d_{\pi}(s,a^{\prime})}=|\mathcal{S}|\cdot d_{\pi}(s,a) (1)

A.1 Proof of Theorem 4

Proof of Theorem 4.

We show hardness of approximation by an approximation-preserving reduction from the maximum independent set (MIS) problem.

Definition 7 (maximum independent set (MIS)).

For a graph G=(V,E)G=(V,E) with vertex set VV and edge set E2(V2)E\subseteq 2^{\binom{V}{2}}, the maximum independent set (MIS) of GG is defined as the maximum subset of vertices VVV^{\prime}\subseteq V such that there are no edges between any pair of vertices v1,v2Vv_{1},v_{2}\in V^{\prime}.

Theorem 7 (Håstad [20]).

For any fixed ϵ(0,1)\epsilon\in(0,1), there is no polynomial-time 1n1/2ϵ\frac{1}{n^{1/2-\epsilon}}-approximation algorithm for the MIS problem unless 𝐏=𝐍𝐏\mathbf{P}=\mathbf{NP}, and no 1n1ϵ\frac{1}{n^{1-\epsilon}}-approximation algorithm unless 𝐙𝐏𝐏=𝐍𝐏\mathbf{ZPP}=\mathbf{NP}.

Construction of MOMDP. Let G=([n],E)G=([n],E) be a graph for which we want to find the maximum independent set. We create a fully connected MOMDP MM with |E||E| states {se}\{s_{e}\}, each corresponding to an edge eEe\in E. There are only two actions 𝒜={a1,a2}\mathcal{A}=\{a_{1},a_{2}\}. In state ses_{e} of edge e=(e1,e2)e=(e_{1},e_{2}), performing action a1a_{1} and a2a_{2} correspond to e1e_{1} and e2e_{2} respectively. We create nn agents where agent ii corresponds to vertex ii. The reward function of agent ii for state ses_{e} and action aka_{k} for k{1,2}k\in\{1,2\} is defined as

Ri(se,ak)={1,ifek=i,0,o.w.R_{i}(s_{e},a_{k})=\begin{cases}1,&\text{if}~{}e_{k}=i,\\ 0,&\text{o.w.}\end{cases}

In other words, the reward functions encode the set of edges incident to vertex ii.

Correctness of reduction. A policy π\pi is optimal for agent ii iff for all the edges incident to ii the action corresponding to ii is selected with probability 11. If a policy π\pi is considered optimal by two agents ii and ii^{\prime}, then e=(i,i)Ee=(i,i^{\prime})\notin E, since at state ses_{e} either π(a1|s)=1\pi(a_{1}|s)=1 or π(a2|s)=1\pi(a_{2}|s)=1. Therefore, the set of agents that consider a policy optimal corresponds to an independent set in GG. Furthermore, take any independent set V[n]V^{\prime}\subseteq[n]. Let π\pi be the policy that for each edge ee selects the action corresponding to the vertex in VV^{\prime} and, if no such vertices exist, select one arbitrarily. This policy is well defined, as VV^{\prime} is an independent set with no edges between its vertices. Policy π\pi is optimal for agents of VV^{\prime} since at each state their favourite action is selected. Thus, we have an equivalence between the maximum independent set of GG and the plurality winner policy of MM. Therefore, the hardness of approximation follows from Theorem 7. ∎

A.2 Proof of Theorem 5

Proof of Theorem 5.

We reduce the MAX-2SAT problem to finding an α\alpha-approval policy for α(7/8,1]\alpha\in(7/8,1].

For two Boolean variables x1x_{1} and x2x_{2}, we denote the disjunction (i.e., logical or) of two variables by x1x2x_{1}\lor x_{2} and the negation of x1x_{1} by ¬x1\neg x_{1}.

Definition 8 (maximum 2-satisfiability (MAX-2SAT)).

Given a set of mm Boolean variables {x1,xm}\{x_{1},\ldots x_{m}\} and a set of nn 22-literal disjunction clauses {C1,,Cn}\{C_{1},\ldots,C_{n}\}, the goal of the maximum 2-satifiability problem (MAX-2SAT) is to find the maximum number of clauses that can be satisfied together by an assignment ϕ:{xj}j[n]{True,False}\phi:\{x_{j}\}_{j\in[n]}\to\{\mathrm{True},\mathrm{False}\}.

Garey et al. [15] showed that the MAX-2SAT problem is NP-hard.

Construction of MOMDP. For an instance of the MAX-2SAT problem, let {C1,,Cn}\{C_{1},\ldots,C_{n}\} be a set of mm 22-literal disjunction clauses over nn variables {x1,xm}\{x_{1},\ldots x_{m}\}. We create a fully connected MOMDP MM with mm states  — state sjs_{j} representing variable xjx_{j}. There are only two actions, aTruea_{\mathrm{True}} and aFalsea_{\mathrm{False}} which at state sjs_{j} correspond to setting variable xjx_{j} to True\mathrm{True} and False\mathrm{False} respectively. This way, a policy π(aTrue|sj)\pi(a_{\mathrm{True}}|s_{j}) can be interpreted as the probability of setting the variable xjx_{j} to True\mathrm{True}, subject to π(aTrue|sj)=1π(aFalse|sj)\pi(a_{\mathrm{True}}|s_{j})=1-\pi(a_{\mathrm{False}}|s_{j}).

Agents and reward function. We introduce some notation before introducing the agents and their reward function. Take a clause Ci=ci,1ci,2C_{i}=c_{i,1}\lor c_{i,2} where ci,k{xj,¬xj}j[m]c_{i,k}\in\{x_{j},\neg x_{j}\}_{j\in[m]} for k[2]k\in[2]. By combining the former relation with the mapping of xjx_{j} and ¬xj\neg x_{j} to state-action pairs (sj,aTrue)(s_{j},a_{\mathrm{True}}) and (sj,aFalse)(s_{j},a_{\mathrm{False}}) respectively, we define λ:{ci,1,ci,2}𝒮×𝒜\lambda:\{c_{i,1},c_{i,2}\}\to\mathcal{S}\times\mathcal{A}. For every ci,kc_{i,k}, λ(ci,k)\lambda(c_{i,k}) is the state-action pair that evaluates ci,kc_{i,k} to True\mathrm{True} when selected with probability one. Now, we are ready to introduce the agents. For each clause CiC_{i}, we create three agents, each defined as follows:

  • Agent agi,1\mathrm{ag}_{i,1} with a reward function that is 11 for λ(ci,1)\lambda(c_{i,1}) and λ(ci,2)\lambda(c_{i,2}) and zero otherwise. An optimal policy of this agent chooses the two corresponding state-action pairs exclusively, which can be interpreted as ci,1Truec_{i,1}\leftarrow\mathrm{True} and ci,2Truec_{i,2}\leftarrow\mathrm{True} that implies Ci=TrueC_{i}=\mathrm{True}.

  • Agent agi,2\mathrm{ag}_{i,2} with a reward function that is 11 for λ(ci,1)\lambda(c_{i,1}) and λ(¬ci,2)\lambda(\neg c_{i,2})  — note the negation. This is another assignment of variables, ci,1Truec_{i,1}\leftarrow\mathrm{True} and ci,2Falsec_{i,2}\leftarrow\mathrm{False}, that implies CiTrueC_{i}\leftarrow\mathrm{True}.

  • Similarly, the reward function of agi,3\mathrm{ag}_{i,3} is 11 for λ(¬ci,1)\lambda(\neg c_{i,1}) and λ(ci,2)\lambda(c_{i,2}) (which again implies CiTrueC_{i}\leftarrow\mathrm{True}).

For ease of exposition, and by a slight abuse of notation, in the rest of the proof we use π\pi to refer to the occupancy measure. From Equation 1 we have |𝒮|dπ(s,a)=π(a|s)|\mathcal{S}|\cdot d_{\pi}(s,a)=\pi(a|s) and the α\alpha-approval rule is invariant to affine transformation. Therefore, we let Ji(π)=π,RJ_{i}(\pi)=\langle\pi,R\rangle.

Expected return distribution. The expected return of agent agi,1\mathrm{ag}_{i,1} for a policy π\pi is π(λ(ci,1))+π(λ(ci,2))\pi(\lambda(c_{i,1}))+\pi(\lambda(c_{i,2})). For conciseness, let v1=π(λ(ci,1)),v2=π(λ(ci,2))v_{1}=\pi(\lambda(c_{i,1})),v_{2}=\pi(\lambda(c_{i,2})). Then, the cdf is

Fagi,1(v)=v1=0vv2=0v𝕀[v1+v2v]𝑑v1𝑑v2={v22forv[0,1],1(2v)22forv[1,2].F_{\mathrm{ag}_{i,1}}(v)=\int_{v_{1}=0}^{v}\int_{v_{2}=0}^{v}\operatorname{\mathbb{I}}[v_{1}+v_{2}\leqslant v]\cdot dv_{1}\,dv_{2}=\begin{cases}\frac{v^{2}}{2}&\text{for}~{}v\in[0,1],\\ 1-\frac{(2-v)^{2}}{2}&\text{for}~{}v\in[1,2].\end{cases}

See Figure 2 for a visualization. The cdf of all other agents have the same form. For each of the 3n3n agents above, their maximum expected return is 22.

0.50.5110.50.511(1,1)(1,1)π(λ(ci,1))\pi(\lambda(c_{i,1}))π(λ(ci,2))\pi(\lambda(c_{i,2}))
(a) The scaled occupancy polytope
11220.50.511vvFi(v)F_{i}(v)
(b) The expected return cdf
Figure 2: The effective state-action occupancy polytope of agents and their expected return distribution.

Rounding a policy. Observe that Fi(3/2)=7/8F_{i}(3/2)=7/8. For agent agi,1\mathrm{ag}_{i,1} to approve a policy under α(7/8,1]\alpha\in(7/8,1], their utility must exceed 3/2=Fagi,11(7/8)3/2=F^{-1}_{\mathrm{ag}_{i,1}}(7/8). The fact that v1,v2[0,1]v_{1},v_{2}\in[0,1] in addition to v1+v2>3/2v_{1}+v_{2}>3/2, implies that v1>1/2v_{1}>{1}/{2} and v2>1/2v_{2}>{1}/{2}. Therefore, if a policy π\pi is α\alpha-approved by agent agi,1\mathrm{ag}_{i,1}, we have π(λ(ci,1))>1/2\pi(\lambda(c_{i,1}))>{1}/{2} and π(λ(ci,2))>1/2\pi(\lambda(c_{i,2}))>{1}/{2}. Further, observe that for at most one of the three agents {agi,k}k[3]\{\mathrm{ag}_{i,k}\}_{k\in[3]} the condition of Jagi,k(π)>3/2J_{\mathrm{ag}_{i,k}}(\pi)>3/2 may hold as every pair of the agent disagree on one literal.

We round such a policy π\pi to an assignment ϕ\phi by letting ϕ(xj)True\phi(x_{j})\leftarrow\mathrm{True} if π(aTrue|sj)>12\pi(a_{\mathrm{True}}|s_{j})>\frac{1}{2} and ϕ(xj)False\phi(x_{j})\leftarrow\mathrm{False} otherwise. If an agent in {agi,k}k[3]\{\mathrm{ag}_{i,k}\}_{k\in[3]} α\alpha-approves π\pi, then CiC_{i} is satisfied by the assignment ϕ\phi. Therefore, we have that OPTαOPTMAX2SAT\operatorname{OPT}_{\alpha}\leqslant\operatorname{OPT}_{\mathrm{MAX-2SAT}}, where OPTα\operatorname{OPT}_{\alpha} is the maximum feasible number of α\alpha-approvals among all policies and OPTMAX2SAT\operatorname{OPT}_{\mathrm{MAX-2SAT}} is the maximum number of clauses that can be satisfied among all assignments.

Next, we show OPTMAX2SATOPTα\operatorname{OPT}_{\mathrm{MAX-2SAT}}\leqslant\operatorname{OPT}_{\alpha} by deriving a policy π\pi that gets OPTMAX2SAT\operatorname{OPT}_{\mathrm{MAX-2SAT}} α\alpha-approvals based on the optimal assignment for OPTMAX2SAT\operatorname{OPT}_{\mathrm{MAX-2SAT}} by simply letting π(aTrue|sj)=1\pi(a_{\mathrm{True}}|s_{j})=1 iff ϕ(xi)=True\phi(x_{i})=\mathrm{True}. If ϕ\phi satisfies a clause CiC_{i}, then for the agent ag{agi,k}k[3]\mathrm{ag}\in\{\mathrm{ag}_{i,k}\}_{k\in[3]} that matches the literal assignments of ϕ\phi for CiC_{i}, we have Jag(π)=2J_{\mathrm{ag}}(\pi)=2, which is an optimal, i.e., 11-approval, policy for ag\mathrm{ag}. For the other two agents for CiC_{i}, Jag(π)=1J_{\mathrm{ag}}(\pi)=1 which gets a return less than their required utility of 3/23/2 for an α\alpha-approval. Therefore, we have that OPTMAX2SAT=OPTα\operatorname{OPT}_{\mathrm{MAX-2SAT}}=\operatorname{OPT}_{\alpha} and the hardness of computation follows from our reduction. ∎

Appendix B Experimental Details

Experiments are all done on an AMD EPYC 7502 32-Core Processor with 258GiB system memory. We use Gurobi [18] to solve LPs and MILPs.

Running time. All our voting rules has a running time of less than ten minutes on the constructed MOMDPs of 35=2433^{5}=243 states, 66 actions, and 1010 agents. The most resource extensive task of our experiments was sampling 51055\cdot 10^{5} random policies and computing the expected return for every agent which is a standard task. For each instance, we did this in parallel with 2020 processes in a total running time of less than 22 hours per instance.

Implementation details of Borda count rule. After fitting a generalized logistic function for FiF_{i} based on the expected return of sampled policies, we find the value mode(fi)\mathrm{mode}(f_{i}), and check the existence of a policy by solving a linear program that achieves a expected return of mode(fi)\mathrm{mode}(f_{i}) for all agents. Next, to utilize Gurobi LP solvers, we approximate the concave function FiF_{i} by a set of linear constraints that form a piecewise linear concave approximation of FiF_{i}. Therefore, our final program for an approximate Borda count rule is simply an LP.