Provable Multi-Party Reinforcement Learning
with Diverse Human Feedback
Abstract
Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work initiates the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual’s preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.
1 Introduction
Recent advancements in AI alignment focus on training AI systems to align with user preferences. A prevalent strategy involves integrating human feedback into the learning process, a technique that has significantly influenced the development of language models and reinforcement learning among other areas (Ouyang et al.,, 2022; Ziegler et al.,, 2019). Established RLHF methods typically fit a reward model over users’ preferences data (in the form of pairwise comparisons) and then train policies to optimize the learned reward. The implicit assumption behind this approach is that different users preferences can be modeled via a single reward function.
However, this type of single-party reward-based assumption is at odds with the fact that users may possess heterogeneous viewpoints that conflict with each other. For example, the nature of human opinion in open-ended social decision-making scenarios is inherently diverse and dynamic, making traditional reinforcement learning from human feedback (RLHF) models (Ouyang et al.,, 2022; Christiano et al.,, 2017) insufficient. To see how traditional single-party RLHF can fail to balance multiple individuals’ preferences, consider the following toy example.
Example (2 parties and 3 alternatives).
Consider a scenario where users have preferences over three options . Half of the population prefers (with underlying rewards of for some ), while the other half prefers (with underlying rewards of ). Then pooling both sub-populations’ preferences together, we cannot differentiate between via pairwise comparison. Thus, if we learn a single reward function over the pooled pairwise comparison data, the resulting will have . This leads to a policy that selects the three options uniformly at random, which has an average reward less than . However, if we learn each sub-population’s reward separately, we can potentially arrive at a better policy that only recommends option and achieves an average reward of .
This tension between the traditional single-party (or single-reward) RLHF approach and diverse users preferences motivates the following question:
How can models be trained to align with the preferences of heterogeneous individuals?
Our work takes a step towards addressing this question by initiating a theoretical exploration of multi-party RLHF, that aims to explicitly model and balance diverse heterogeneous preferences from different individuals. We take inspiration from social choice theory (Noothigattu et al.,, 2020; Jin et al.,, 2020), which offers an extensive array of tools for aggregating human preferences. At a high level, we extend the framework of RLHF to directly accommodate multiple parties and heterogeneous preferences by utilizing social welfare functions. We focus on offline learning with the goal of learning a policy from offline preference data. We start from the CB setting with pairwise comparisons and later generalize to the Markov Decision Process (MDP). Our primary focus is on the popular offline setting, where the dataset is pre-collected. The insights of our approaches can be generalized to a wider range of scenarios. Notably, recent works by Zhu et al., (2023) and Zhan et al., (2023) provided an initial theoretical analysis of offline RLHF. Our paper takes a step further by considering cases with multiple parties and heterogeneous preferences.
Our main contributions are summarized as follows:
-
•
We propose a general framework for alignment with multiple heterogeneous parties. We utilize a meta-learning technique to learn individual rewards, and aggregate them using Nash’s social welfare function based on the confidence bounds on reward estimation. We further extend our analyses for the Utilitarian and Leximin welfare functions.
-
•
We provide sample complexity bounds for obtaining an approximately optimal policy through our proposed framework. We further introduce efficiency and fairness definitions and demonstrate the learned policies satisfy approximate Pareto efficiency and Pigou-Dalton principle, thus ensuring the collective welfare and avoiding inequality or dominance.
-
•
We extend our analysis to a reward-free setting, where individual preferences are no longer consistent with a certain data generative model. We provide pessimistic variants of von Neumann winner (Fishburn,, 1984), determined by the confidence bounds on preference distributions. Theoretical guarantees of sample complexity and ex-post efficiency are also provided in this generalized setting.
The novelty of our work lies in two aspects. First, we employ a meta-learning technique for learning multiple reward functions from limited observations, by utilizing a common feature representation among various parties. This strategy enhances learning efficiency in environments where data are scarce, drawing on the foundational work in few-shot learning (Du et al.,, 2020; Tripuraneni et al.,, 2020). Second, we integrate a pessimistic approach within social welfare functions to guarantee model invariance with the inclusion of zero rewards in the presence of multiple parties. This aspect of our work extends the principles established by Zhu et al., (2023), where we establish a sub-optimality bound for the (Nash’s) social welfare function. This bound is derived by aggregating heterogeneous individual rewards, necessitating a more stringent data coverage condition than what is typically required in RLHF scenarios involving a single entity.
2 Related Work
Meta-learning
Meta-learning, or learning to learn, seeks to design a learner to quickly learn new tasks based on a few prior tasks. Some of the dominant approaches learn a common representation among multiple tasks, referred to as representation learning (Baxter,, 2000; Maurer et al.,, 2016; Finn et al.,, 2017; Ji et al.,, 2023). Theoretical guarantees for multi-task linear regression with low-dimensional linear representation have been established in the literature (Tripuraneni et al.,, 2020; Du et al.,, 2020; Li and Zhang,, 2023). Our work contributes to generalizing this line of work, by investigating the learning-to-learn abilities in human preference-based models with a more complicated structure.
Reinforcement Learning with Human Feedback
Human preference is widely used in RL since preference comparison is easier to elicit than numerical reward (Ziegler et al.,, 2019; Ouyang et al.,, 2022; Chen et al.,, 2022). Particularly, the most related works to ours are Zhu et al., (2023) and Zhan et al., (2023), both of which study offline RLHF with a reward-based preference model. Zhu et al., (2023) focused on a Bradley-Terry-Luce (BTL) model under a linear reward function, while Zhan et al., (2023) further considered general function classes. Lastly, we highlight the latest work by Wang et al., (2023), which established two preference models akin to ours, but in the context of traditional single-party RLHF in online learning. In contrast, we consider a broader setting with heterogeneous individuals. Particularly, offline RL is more challenging than online RL due to limited data availability. Thus, the pessimistic technique has been widely studied in recent years, as witnessed in CBs (Li et al.,, 2022), MDPs (Xie et al.,, 2021), and Markov games (Cui and Du,, 2022). In this paper, we also utilize pessimism to address the coverage challenges.
Social Choice Theory
Social choice theory is the field that studies the aggregation of individual preferences towards collective decisions (Moulin,, 2004). Different solution concepts in game theory and voting theory have been applied to achieve social decisions, including Nash bargaining (Nash,, 1953), von Neumann winner (Fishburn,, 1984; Rivest and Shen,, 2010; Brandl et al.,, 2016; Dudík et al.,, 2015), etc. Notably, the concurrent work by Fish et al., (2023) presented pioneering attempts to combine AI systems with social choice theory given access to oracle LLM queries. However, their work focused on binary approval, while our methods consider the degree of individual preferences. Our work also extends several standard efficiency notions in social choice theory, including Pareto efficiency (Barron,, 2013; Nguyen and Rothe,, 2020; Barman et al.,, 2018; Aumann and Dombb,, 2010) and ex-post efficiency (Fishburn,, 1984; Immorlica et al.,, 2017; Zeng and Psomas,, 2020). Particularly, Barman et al., (2018) considered a Nash welfare function in the context of fair allocation and established approximate efficiency results based on agents’ valuation errors. Both their efficiency results and ours are derived from the learning errors.
3 Preliminaries
We begin with the notation. Let . We use or to denote the norm of a vector or the spectral norm of a matrix. We use to denote the Frobenius norm of a matrix. Let be the Euclidean inner product between vectors or matrices. For a matrix , let be its -th largest singular value. Our use of follows the standard notation. We use to denote the matrix-induced norm for a positive-semidefinite matrix . We write if is positive semidefinite. We use where to denote the set of orthonormal matrices (i.e., the columns are orthonormal).
3.1 Offline Data Collecting and Comparison Model
We start with the contextual bandit (CB) setting. Consider the environment , where is a state space, is an action space, for are reward functions from individuals (or parties), and is an initial state distribution. A deterministic policy is a function that maps a state to an action, while a randomized policy is a function that maps a state to a distribution over the action space.
As in Zhu et al., (2023), we consider an offline setting with a pre-collected dataset . We are provided with pairwise comparisons from individuals (or parties), where for . For the -th sample in , a state is first sampled from the initial distribution . Given the state , an action pair is sampled from a distribution . Thus is generated by a generation distribution . We then observe a binary outcome from a Bernoulli distribution . Here indicates the preferred one in . Primarily, we assume that is reward-based for .
Reward-based Model
Suppose that individual preferences rely on a reward function:
(1) |
where is the sigmoid function , which leads to the well-known Bradley-Terry-Luce (BTL) model in pairwise comparison (Bradley and Terry,, 1952). We now provide several assumptions for our analysis.
Assumption 1 (Linear reward).
The rewards lie in the family of linear models , where is a known feature mapping with . Let be the true parameters, suppose for all .
Assumption 2 (Shared representation).
Suppose , where , , and the underlying parameters satisfy for all .
Assumption 3 (Feature design).
Define , here the randomness is induced by the generation distribution . Suppose has bounded nonnegative eigenvalues .
Assumption 4 (Diversity design).
Let , , and . Suppose that and .
Assumption 1 can be applied to cases where is derived by removing the last layer of a pre-trained model. Assumption 2 further says that individual rewards share a common low-dimensional representation . Assumptions 3 and 4 are standard well-conditioned designs for the covariance matrix and the underlying reward parameters (or ) (Tripuraneni et al.,, 2020; Du et al.,, 2020).
3.2 Social Welfare Function
As different parties have their own reward functions, in this section, we introduce the social welfare functions to aggregate them such that the trained model aligns with heterogeneous preferences and reflects social welfare. They map individual utilities to collective welfare, where is the utility of the -th individual for and is the possible set of outcomes. A reasonable welfare function should satisfy six axioms: monotonicity, symmetry, continuity, independence of unconcerned agents, independence of common scale, and Pigou-Dalton transfer principle (see, Chapter 3.2, Moulin,, 2004). It has been proved that all functions that satisfy these six desirable properties lie in a one-parameter family of isoelastic social welfare functions , defined as follows:
(2) |
Among these functions, we elaborate on three prominent examples: the Utilitarian welfare function that calculates the mean expected agreement across parties, the Leximin welfare function that ensures the utility of the worst individual, and the Nash welfare function that maximizes the utility product to attain a proportional-fair (Kelly et al.,, 1998; Zhang et al.,, 2022) solution. These concepts of welfarism are widely used in social choice theory, see more details in Appendix A. We primarily consider Nash welfare function . In the context of welfarism with bargaining considerations, an objective definition of the zero of individual utility is introduced. This concept is regarded as the worst outcome or minimal utility to be accepted by a specific individual so the arbitrator must consider the utility as a strict lower bound. Thus, Nash bargaining (Nash,, 1953) considers a maximization problem over a feasible set under additional normalization of individual zeros :
By optimizing the welfare function with such normalizations, the Nash bargaining solution is invariant to affine transformations of ’s and eliminates the dominance of specific individuals, making it a standard solution concept in cooperative games. We begin with the Nash welfare function to aggregate heterogeneous reward functions in the next section and optimize the aggregated function. Additional results of other social welfare functions (e.g., Utilitarian and Leximin ) are shown in Section 4.4.
4 Reward-Based Model
In this section, we propose a general framework for alignment with multiple parties using social welfare functions in a reward-based model. We begin with the framework of Nash bargaining to learn a deterministic policy and further extend our analysis for the Utilitarian and Leximin welfare functions.
4.1 Pessimistic Nash Bargaining Algorithm
The construction of the Nash Bargaining estimator consists of four steps. Step 1 is to estimate the parameters from the observations. Step 2 is to construct the confidence sets for these parameters simultaneously. Step 3 involves introducing a pessimistic expected value function using Nash bargaining. Finally, Step 4 is to obtain the pessimistic policy through a maximization problem. We summarize this procedure in Algorithm 1, and discuss each step in detail below.
Input: The datasets , a failure probability , and the initial distribution .
Output: The pessimistic policy .
First, we learn individual reward for through maximum likelihood estimator (MLE). Denote as the low-dimensional parameters, and as the original parameters. Define the feature difference as and for . Using the dataset with , we minimize the negative log-likelihood:
(3) |
where . When the minimizer is not unique, we take any of the solutions that achieves the minimum. As a result, we obtain the estimators for . Then define the data covariance matrices for as
(4) |
Subsequently, we regard the multi-party alignment problem as a Nash bargaining game involving reward functions . We set the zero utility as and consider the following Nash welfare function to guarantee individual satisfaction,
We emphasize that the introduction of minimal rewards, , ensures solution invariance and mitigates dominance concerns, as discussed in Section 3.2. Moreover, our specific choice for is easy to compute since solving is standard in bandit and RL problems (Sutton and Barto,, 2018). Alternatively, we can also consider the reward under any reference policy. This distinction separates multi-party alignment from single-party alignment.
Next, we introduce the pessimistic technique to learn a policy. We first define the true Nash expected value of a policy and its sub-optimality as:
where is the optimal policy. This sub-optimality notion measures a performance gap compared to the optimal policy. Given a failure probability , we consider the pessimistic estimator obtained from the lower confidence bound of the set of parameters: , where is defined in the next section. We then construct a pessimistic Nash expected value function:
(5) |
Finally, we obtain as the pessimism estimator. The pessimism technique penalizes responses that are less represented in the dataset and thus contributes to finding a more conservative and accurate policy (Xie et al.,, 2021; Zhan et al.,, 2023; Zhu et al.,, 2023). Later, we will demonstrate that can approach in the sense of the true Nash expected value .
4.2 Sample Complexity
In this section, we introduce the sub-optimality bound and the corresponding sample complexity. We begin with the following estimation error between and .
Theorem 1.
Now we let in Algorithm 1 be , where is a constant. Theorem 1 implies that with probability at least , for any . We provide a proof outline here, and defer the details to Appendix C.1. First, we establish an overall bound on by leveraging the convexity of , based on second-order Taylor expansion and a careful use of concentration inequalities. Next, we derive a bound on in terms of and using the Davis-Kahan theorem and Bernstein-type inequality. Theorem 1 is a generalization of the upper bound of Lemma 3.1 in Zhu et al., (2023), which improves the previous bound of when . It shows that it is possible to use samples to learn individual rewards via learning a shared representation with all observations pooled together.
Remark 1.
We then consider the induced policy. Denote as the achieved pessimistic parameter in the lower confidence bound for . Denote and as the baseline policy to reach and , respectively. Define the concentratability coefficient as
(6) |
We have the following guarantee for the pessimistic policy, whose proof is given in Appendix C.2.
Theorem 2.
Under the same condition as Theorem 1, with probability at least ,
The proof sketch is as follows. The sub-optimality can be decomposed as, and what we truly need to focus on is the first term since the latter two terms are smaller than zero with high probability. We address the multiplicative Nash welfare function by decomposing it into the aggregation of individual errors. Given the normalization inherent in Nash bargaining, we then separately handle the gaps between the true and empirical values as well as those between the true and empirical values induced by stochasticity, by using the estimation error.
We point out that is a concentratability coefficient assumed to be bounded (Zhu et al.,, 2023; Zhan et al.,, 2023). It provides a coverage guarantee of the dataset: the ratio of the state-action occupancy induced by the optimal and baseline policies, to the data distribution, is bounded. This assumption ensures good coverage of the target vector from the dataset in the feature space, and therefore yields an accurate estimator.
For the problems with bounded concentratability coefficients (defined in (6)), we obtain a lower bound result in the following theorem, whose proof is deferred to Appendix C.3.
Theorem 3 (Lower Bound).
Consider the family of instances , where is defined in (6). For any bandit instance , is defined as the sub-optimality under instance . Suppose , then there exists a feature mapping such that the following lower bound holds.
The lower bound highlights the complexity challenge in multi-party RLHF. We face the aggregation of individual learning errors and a potentially larger concentratability coefficient defined in (6), as it is related to not only the optimal policy but also the baseline policies . This leads to a larger sample complexity than traditional RLHF, where the concentratability coefficient is solely based on a single-party optimal policy as depicted in Theorem 3.10 in Zhu et al., (2023). Obtaining tight dependence of our sub-optimality bound on feature dimensions is challenging. The gap between Theorem 2 and Theorem 3 relates to an open problem in classical estimations, see a more detailed discussion on page 8 in Tripuraneni et al., (2020).
4.3 Efficiency and Welfare Guarantees
We demonstrate the efficiency and fairness guarantees of our method using a Nash welfare function. Theorem 2 implies the above Nash solution is within a small distance (sub-optimality) to the optimal solution ( is also Pareto efficient, see Lemma 3). This observation motivates us to consider the concept of -approximate Pareto efficiency concerning each normalized reward function.
Definition 1 (-approximate Pareto Efficiency).
A solution is -approximate Pareto efficient at state if no other action at the state provides all individuals with at least the same rewards , and one individual with a reward times higher.
This efficiency guarantees an agreement and balance among multiple parties. It is an adaptation of the definitions by Aumann and Dombb, (2010); Barman et al., (2018) to the CB settings. While previous work considered an efficiency notion that allows a improvement for all individuals, we admit a improvement for only one individual when maintaining others unchanged. Define the state-wise concentratability coefficient as
We formulate the following theorem.
Theorem 4.
Under the same condition as Theorem 1, with probability at least , is -approximate Pareto efficient at state , where .
This result is established for every state and the proof is given in Appendix C.4. It confirms that we cannot improve one party significantly without making another party worse, thus leading to a proportional-fair policy. Besides, the pessimistic policy follows the approximate Pigou-Dalton principle, ensuring more equitable outcomes (Moulin,, 2004), see Appendix C.4 for details.
4.4 Comparison with Various Social Welfare Functions
Further, we provide a concise introduction of other social welfare functions, including the Utilitarian and Leximin (Egalitarian) welfare functions, and compare their results. We adjust the pessimistic value function in Step 3 of Algorithm 1 to accommodate these alternative welfare functions.
Utilitarian welfare function ()
Given its independence of individual zeros of utilities (refer to Appendix A), the introduction of a zero reward is unnecessary in this context. Similarly, the pessimistic expected value function is defined as follows,
Denote , . We present the following sub-optimality bound.
Theorem 5.
Leximin welfare function ()
In this context, the maximization is conducted over the minimum reward among all parties, denoted as . Thus, we reintroduce the concept of zero reward as . Similarly, the pessimistic expected value function is defined as follows,
Denote , . Additionally, let , be defined identically to the Nash case in Section 4.2. We present the following sub-optimality bound.
Theorem 6.
The proof is deferred to Appendix C.5. Theorem 6 implies that is approximately optimal for the worst-off individual, thus characterizing a relatively fair outcome. To achieve a sub-optimal policy, Theorem 2, Theorem 5, and Theorem 6 once again highlight the complexity challenge in a multi-party scenario that arises from the aggregation of individual learning errors as well as the potential increase in the concentratability coefficient .
Now we concisely compare the results for various social welfare functions for aggregation.
-
•
The Nash welfare function exhibits resilience against affine transformations, which significantly avoids dominance. Also, it strikes a middle ground between Utilitarian and Leximin fairness, thus attaining a favorable balance between the averaged and worst-case performances (Zhang et al.,, 2022). Moreover, under additional convex assumptions, Nash bargaining guarantees that each individual receives their minimum reward plus a certain fraction of the maximum feasible gain, expressed as (see, Chapter 3.6, Moulin,, 2004). These properties distinguish Nash’s solution.
-
•
On the other hand, both the Nash and Leximin solutions incorporate the concept of zero rewards , while the Utilitarian solution remains independent of zero rewards (refer to Appendix A). Consequently, the Nash and Leximin welfare functions necessitate a more nuanced coverage assumption (or a higher concentration coefficient ).
-
•
Bakker et al., (2022) empirically showed that the Utilitarian welfare function successfully considers both minority and majority views of various individuals. When compared to other welfare functions such as Leximin and Nash, their results showed similar average ratings among participants and comparable outcomes for the most dissenting participants. We interpret these findings as a special case of a particular data distribution. However, they fail to recognize the importance of zero rewards, which can significantly influence outcomes by shaping what constitutes an acceptable agreement. Nevertheless, it is crucial to note that these social welfare functions indeed exhibit disparity. They measure distinct aspects of social welfare and could be suitable for different practical settings.
4.5 Generalization to MDPs
We now generalize our results to MDPs where the preference is based on a whole trajectory instead of a single action. Examples of such settings include Atari games and robot locomotion (Christiano et al.,, 2017), where the preference depends on historical behaviors. We consider the environment , where is a state space, is an action space, is the horizon length, for are known transition kernels, for are reward functions from individuals (or parties), and is an initial state distribution. Define the occupancy measure as , which becomes when . We are provided with pairwise comparisons , where for . For the -th sample in , an initial state is sampled from and two trajectories are sampled from a distribution . Thus is generated by a generation distribution . We then observe a binary outcome from a Bernoulli distribution as follows,
where Assumption 1-4 hold with the feature difference in changed to the cumulative difference over the trajectory, i.e., .
The generalized algorithm for MDPs is similar to Algorithm 1. We first construct MLE , aggregate the reward functions, and then seek a policy to optimize the pessimistic Nash value function . However, here we adopt a trajectory-based preference model and thus optimize a trajectory-wise value function. We are going to discuss the differences in detail below.
Denote now as the difference between the cumulative feature in two trajectories, . Define the data covariance matrices as for all . We again minimize the negative log-likelihood:
where .
Subsequently, we use Nash bargaining to aggregate individual preferences. Here we point out that the key difference between MDPs and CBs is the inclusion of the trajectory-based measure , which stems from the observation (Zhu et al.,, 2023) that:
where , and this expectation is taken over the trajectory generated according to the transition kernel. Since the transition distribution is known, we can directly calculate for any given . Define the trajectory-wise true and pessimistic Nash expected value function and as follows:
where , is a fixed parameter. Define following the same structure as in Section 4.2, . We have the following result, whose proof is deferred to Appendix C.6.
Theorem 7.
Under the same condition as Theorem 1 (with changed), we have, then with probability at least , for any ,
We can also attain efficiency and fairness guarantees similar to Section 4.3 accordingly.
5 Extension to Reward-Free Models
The reward-based model is limited to situations in which human preferences are transitive. However, the multi-dimensional nature of human decision-making may result in intransitive preferences (Tversky,, 1969). We now introduce a more general model where each individual’s preference may not be consistent with any reward model in the contextual bandits setting.
General Model
Suppose a reward-free model without additional assumptions about human preferences. To ensure symmetry, we define the following preference matrix (Dudík et al.,, 2015):
Throughout this section, we focus on tabular cases with finite state and action spaces , . Recall that is generated by distribution in Section 3.1. Define the occupancy measure as the probability of pair appearing in the data generated by for a given state ; define as the probability of appearing in the data generated by policies respectively, given state . These definitions are akin to those in Markov games (Cui and Du,, 2022).
5.1 Pessimism Algorithm with Von Neumann Winner
For the general model, we consider an alternative solution concept, the von Neumann winner (Dudík et al.,, 2015), which corresponds to a randomized policy. Here, a von Neumann winner of an matrix is the max-min probabilistic strategy of the matrix game and the achieved value is defined as the value of this matrix game. It is known that when is skew-symmetric, the value is zero (Owen,, 2013). The construction of von Neumann winner for our problem consists of four steps. Step 1 is to calculate the aggregated preference matrices for each state. Step 2 is to construct the confidence bounds for each entry in these matrices. Step 3 involves solving a max-min optimization problem of each state. Finally, Step 4 is to transform the probabilistic strategy of each state into a randomized policy in the policy space. We summarize our procedure in Algorithm 2 first, and discuss each step in detail below.
Input: The datasets , a failure probability , and the initial distribution .
Output: The pessimistic policy .
Define the visiting number of an action pair at one state as , here denotes the number of observations in of preferring over at state . The standard preference matrix at state between actions and is then defined as:
(7) |
where if the denominator ; otherwise, it is defined as 0.
Subsequently, we consider the policy space which encompasses all deterministic policy permutations. We define the total preference matrix between policies as follows:
We further define the population-wise matrices and as
In our general model, we refrain from directly specifying reward functions to model human preferences. Consequently, obtaining an optimal policy based on an aggregated function becomes unfeasible. Instead, we consider the von Neumann winner of , defined in an average-wise manner. This winner represents a probabilistic strategy that “beats” every other strategy in the sense that . In other words, it has an average beating probability greater than (Dudík et al.,, 2015). Additionally, if the reward-based preference model (1) holds, the von Neumann winner of corresponds to the solution of the Utilitarian welfare function when or . For further details, refer to B.
For a given failure probability , define the confidence bonus as
(8) |
Here . We claim that, with probability at least , for , see Lemma 6 for details. Construct the following pessimistic estimators,
(9) |
Here is a mixed strategy for each , and is a mixed strategy over the policy space .
5.2 Main Results
Define , and , where is a concentratability coefficient assumed to be bounded (Cui and Du,, 2022). We have the following result.
Theorem 8.
With probability at least , and are -approximate von Neumann winner, i.e.
where .
The proof is deferred to Appendix D.1. It is known that the value of the skew-symmetric matrix game is zero (Owen,, 2013), thus Theorem 8 shows that the von Neumann winner estimated from the observations is an approximation of the true von Neumann winner with only an gap.
Next, we give a weak Pareto efficiency result to ensure consensus among multiple individuals, which is adopted from the definition of ex-post efficiency (Fishburn,, 1984).
Definition 2 (-Approximate Ex-post Efficiency).
A probabilistic strategy is -approximate ex-post efficient if any Pareto-dominated alternative receives probability . Here is Pareto-dominated if there exists an alternative s.t. for .
Our definition differs from those by Zeng and Psomas, (2020), where they considered a strategy that always outputs a approximate Pareto efficient alternative, while we obtain an ignorable probability of outputting a Pareto-dominated alternative. We establish the following efficiency result.
Theorem 9.
If the preference of each individual is transitive, i.e. for , then with probability at least , is -approximate ex-post efficient for , where , and is a constant.
The proof is given in Appendix D.2. Here the transitivity assumption is necessary, demonstrating the distinction between the general and reward-based models. Further details are provided in Appendix D.2. This result aligns with Theorem 4, emphasizing the efficiency and welfare assurance of multi-party alignment, especially in scenarios where individual preferences deviate from a specific data generative model.
References
- Aumann and Dombb, (2010) Aumann, Y. and Dombb, Y. (2010). Pareto efficiency and approximate pareto efficiency in routing and load balancing games. In International Symposium on Algorithmic Game Theory, pages 66–77. Springer.
- Bakker et al., (2022) Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M., et al. (2022). Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
- Barman et al., (2018) Barman, S., Krishnamurthy, S. K., and Vaish, R. (2018). Finding fair and efficient allocations. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 557–574.
- Barron, (2013) Barron, E. N. (2013). Game theory: an introduction. John Wiley & Sons.
- Baxter, (2000) Baxter, J. (2000). A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198.
- Bradley and Terry, (1952) Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Brandl et al., (2016) Brandl, F., Brandt, F., and Seedig, H. G. (2016). Consistent probabilistic social choice. Econometrica, 84(5):1839–1880.
- Chen et al., (2022) Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. (2022). Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, volume 162, pages 3773–3793. PMLR.
- Christiano et al., (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Cui and Du, (2022) Cui, Q. and Du, S. S. (2022). When is offline two-player zero-sum markov game solvable? arXiv preprint arXiv:2201.03522.
- Du et al., (2020) Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. (2020). Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434.
- Dudík et al., (2015) Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. (2015). Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
- Finn et al., (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
- Fish et al., (2023) Fish, S., Gölz, P., Parkes, D. C., Procaccia, A. D., Rusak, G., Shapira, I., and Wüthrich, M. (2023). Generative social choice. arXiv preprint arXiv:2309.01291.
- Fishburn, (1984) Fishburn, P. C. (1984). Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692.
- Hsu et al., (2012) Hsu, D., Kakade, S., and Zhang, T. (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:1 – 6.
- Immorlica et al., (2017) Immorlica, N., Lucier, B., Weyl, G., and Mollner, J. (2017). Approximate efficiency in matching markets. In International Conference on Web and Internet Economics, pages 252–265. Springer.
- Ji et al., (2023) Ji, W., Deng, Z., Nakada, R., Zou, J., and Zhang, L. (2023). The power of contrast for feature learning: A theoretical analysis. Journal of Machine Learning Research, 24(330):1–78.
- Jin et al., (2020) Jin, T., Xu, P., Gu, Q., and Farnoud, F. (2020). Rank aggregation via heterogeneous thurstone preference models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4353–4360.
- Kelly et al., (1998) Kelly, F. P., Maulloo, A. K., and Tan, D. K. H. (1998). Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research society, 49:237–252.
- Li et al., (2022) Li, G., Ma, C., and Srebro, N. (2022). Pessimism for offline linear contextual bandits using confidence sets. Advances in Neural Information Processing Systems, 35:20974–20987.
- Li and Zhang, (2023) Li, S. and Zhang, L. (2023). Multi-dimensional domain generalization with low-rank structures. arXiv preprint arXiv:2309.09555.
- Maurer et al., (2016) Maurer, A., Pontil, M., and Romera-Paredes, B. (2016). The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32.
- Moulin, (2004) Moulin, H. (2004). Fair division and collective welfare. MIT press.
- Nash, (1953) Nash, J. (1953). Two-person cooperative games. Econometrica: Journal of the Econometric Society, pages 128–140.
- Nguyen and Rothe, (2020) Nguyen, T. T. and Rothe, J. (2020). Approximate pareto set for fair and efficient allocation: Few agent types or few resource types. In IJCAI, pages 290–296.
- Noothigattu et al., (2020) Noothigattu, R., Peters, D., and Procaccia, A. D. (2020). Axioms for learning from pairwise comparisons. Advances in Neural Information Processing Systems, 33:17745–17754.
- Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Owen, (2013) Owen, G. (2013). Game theory. Emerald Group Publishing.
- Patty and Penn, (2019) Patty, J. W. and Penn, E. M. (2019). Measuring fairness, inequality, and big data: Social choice since arrow. Annual Review of Political Science, 22(1):435–460.
- Rivest and Shen, (2010) Rivest, R. L. and Shen, E. (2010). An optimal single-winner preferential voting system based on game theory. In International Workshop on Computational Social Choice, pages 399–410. Citeseer.
- Santurkar et al., (2023) Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., and Hashimoto, T. (2023). Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
- Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Tripuraneni et al., (2020) Tripuraneni, N., Jin, C., and Jordan, M. I. (2020). Provable meta-learning of linear representations. In International Conference on Machine Learning.
- Tversky, (1969) Tversky, A. (1969). Intransitivity of preferences. Psychological review, 76(1):31.
- Urvoy et al., (2013) Urvoy, T., Clerot, F., Féraud, R., and Naamane, S. (2013). Generic exploration and K-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 91–99. PMLR.
- Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
- Wang et al., (2023) Wang, Y., Liu, Q., and Jin, C. (2023). Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111.
- Xie et al., (2021) Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in Neural Information Processing Systems, 34:27395–27407.
- Yu et al., (2014) Yu, Y., Wang, T., and Samworth, R. J. (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102:315–323.
- Zeng and Psomas, (2020) Zeng, D. and Psomas, A. (2020). Fairness-efficiency tradeoffs in dynamic fair division. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 911–912.
- Zhan et al., (2023) Zhan, W., Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2023). Provable offline preference-based reinforcement learning. arXiv preprint arXiv:2305.14816.
- Zhang et al., (2022) Zhang, G., Malekmohammadi, S., Chen, X., and Yu, Y. (2022). Proportional fairness in federated learning. arXiv preprint arXiv:2202.01666.
- Zhu et al., (2023) Zhu, B., Jiao, J., and Jordan, M. I. (2023). Principled reinforcement learning with human feedback from pairwise or -wise comparisons. arXiv preprint arXiv:2301.11270.
- Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Appendix A Social Welfare Function Details
Here we provide more details of social welfare functions to complement the discussion in Section 3.2. In welfare economics, a social welfare function aggregates individual utilities towards collective welfare and thus guides social decisions to align with the group’s preferences. A reasonable social welfare function that satisfies six desirable axioms is proved to lie in a one-parameter family of isoelastic social welfare functions defined in (2).
Among these functions, the Nash welfare function stands out as the unique one that is additionally independent of individual scales of utilities (IIS); and the Utilitarian welfare function is the unique one that is additionally independent of individual zeros of utilities (IIZ). See their definitions below.
Definition 3 (Independence of Individual Scales of Utilities (IIS)).
A solution is independent of individual scales of utilities when the following holds: If we transform the -th utility function to with for , the solution remains the same.
Definition 4 (Independence of Individual Zeros of Utilities (IIZ)).
A solution is independent of individual zeros of utilities when the following holds: If we transform the -th utility function to with for , the solution remains the same.
Additionally, letting gives another important function, namely, the Leximin (egalitarian) welfare function, which can be reduced to consider for simplicity.
In the main article, we introduce the Nash welfare function as the primary illustrative example. The Nash welfare function is advantageous due to its IIS property, which eliminates the potential dominance issues arising from norm differences. Furthermore, the Nash bargaining version introduces an objective definition of individual “zero utility”. Consequently, it considers the normalization of individual zeros:
Thus, Nash bargaining solution satisfies the IIZ property. The independence makes it resilient to various distortions, such as affine transformations in individual utility. Given these desirable properties, the Nash bargaining solution has become a standard concept for negotiating agreements among individuals.
Appendix B Connections Between Two Models
The average-wise definition of von Neumann winner in Section 5.1 is based on the Utilitarian welfare function, and here we offer further justification for this choice. In the context of a voting election, where each voter has an absolute preference ranking, then the values of for . Consequently, reflects the difference between the count of individuals preferring over and the count of those preferring over . Therefore, a von Neumann winner represents the outcome preferred by a larger number of people. Here the normalization for is essential to prevent dominance.
We have claimed in Section 5.1 that, under the additional assumption of a reward-based preference model (1), the von Neumann winner in the general reward-free model consistently corresponds with the solution of the Utilitarian welfare function when or . We give a brief illustration as follows.
-
•
Case : In this scenario, there exists a Condorcet winner (Urvoy et al.,, 2013) determined by the optimal policy. Consequently, it serves as a direct von Neumann winner.
-
•
Case : It is easy to demonstrate that for any
Now, if we set , then we obtain the following equivalence:
which provides the desired result we need.
Appendix C Proofs in Section 4
C.1 Proof of Theorem 1
Denote the low-dimensional parameters and the original parameters as
Throught this section, we define as the vectorization of .
Lemma 1 (Covariance concentration).
Suppose Assumption 3 holds. If the number of samples satisfies for , then with probability at least , we have: for any ,
(10) |
Proof.
Firstly, we fix and recall . Thus , and . Since is bounded from Assumption 1, are independent sub-gaussian random vectors. We have, with probability at least ,
See Thereom 4.6.1 in Vershynin, (2018) or Lemma 7 in Tripuraneni et al., (2020) for reference. Thus, since , we have, with probability at least ,
Secondly, we take the union bound for all . Substitute with , we obtain that, with probability at least , for any
Thus holds for any if . ∎
Theorem 10 (Guarantee on source data).
Proof.
Step 1: Optimality of . Consider the following loss function, along with its first and second derivatives,
Thus we have,
Concerning the second derivative, since , we have where . Therefore,
Thus, let , from the convexity we have
Here the matrix-induced norm is defined as . In the following, we will use to denote the -norm of ’s vectorization when is a matrix. From the optimality of , we have , thus
(12) |
Using Cauchy-Schwartz inequality immediately implies that
(13) |
Concerning the first derivative, notice that , we define a random vector with components as follows,
(14) |
Here only depends on if conditioned on . Thus . So conditioned on , is a random vector with independent components that satisfy and thus sub-gaussian. Denote , we can establish the following bound (see, Appendix B.1, Zhu et al.,, 2023)
Therefore, Bernstein’s inequality for sub-Gaussian random variables in the quadratic form for (see, Thereom 2.1, Hsu et al.,, 2012) directly implies that, with probability at least ,
(15) |
Combine (15) with (13), with probability at least ,
(16) |
This implies that
Thus, with probability at least ,
(17) |
Step 2: Combining with low-dimensional assumption. Notice that , we can rewrite it as where . Thus the vectorization can be rewrite as For each we further rewrite where . Then we have
(18) |
The second term can be calculated as .
Step 3: Applying -net argument. Next, we are going to give a high-probability upper bound on using the randomness of . Since depends on which depends on , we use an -net argument to cover all possible . First, for any fixed , we let where . The defined in this way are independent of :
From the orthonormality of , we can establish the following bound
Again, use Bernstein’s inequality for sub-Gaussian random variables in the quadratic form as we did in (15), we have, with probability at least ,
Combine this with (C.1), we obtain that
Lemma A.5 in Du et al., (2020) implies that: there exists an -net of in Frobenius norm such that . Applying a union bound over , we obtain that, with probability at least ,
(19) |
Choosing , we obtain that (19) holds with probability at least .
Then we apply -net. Let such that , then we have
(20) |
Theorem 11 (Guarantee on target data).
Proof.
Step 1: Finding a rotation of near . Theorem 10 implies that
Thus, Lemma 1 further shows that
Therefore, combining this with Assumption 3 implies that
Then, we bound the singular values of . Note that is a matrix, by definition , and the worst-case condition number as . Note that when the is well-conditioned in the sense that and , assumption implies that (Tripuraneni et al.,, 2020). Thus, we have
Notice that and are orthonormal, applying the general Davis-Kahan theorem (see, Theorem 4, Yu et al.,, 2014) for matrices and implies that, there exists an orthonormal matrix such that
(21) |
Step 2: Bounding . For simplicity, suppose . We can rotate to achieve this. Then is obtained by minimizing the following loss function:
Its first and second derivatives are given by,
Similar to (13) in the proof of Theorem 10, we have
(22) |
We have , where , is defined in (C.1), and is defined as follows
We aim to bound by considering the gradient of :
Notice that and , therefore is bounded. Thus
(23) |
Therefore,
Concerning the first term, similar to (15) in the proof of Theorem 10, with probability at least , it is bounded by
Step 3: Finishing the proof. Therefore, we have the following inequalities,
C.2 Proof of Theorem 2
Proof.
Decompose the sub-optimality into three terms:
The second term from the definition of . From Theorem 1, with probability at least , , then the third term
Therefore with probability at least ,
where holds due to the equation , holds from recursion.
Separately, we have
The last inequality holds because and with probability at least for .
where holds due to the equation . The last inequality holds because and with probability at least for . Therefore
Now we conclude the proof. ∎
C.3 Proof of Theorem 3
To begin with, we first show a lower bound result for single-party cases. The proof has been shown in Zhu et al., (2023). But for completeness, we present their example instance to achieve the lower bound.
Lemma 2 (Thereom 3.10, Zhu et al., (2023)).
Consider the family of instances
Suppose that , then there exists a feature such that the following lower bound holds.
Here the single-party sub-optimality is defined as , where and .
Proof.
Here we consider for some , and we can construct some directly from . Without loss of generality, assume that is some integer. We set , . Define the feature functions and the number of observations (for each state-action pair) as follows:
Let where . We construct instances, indexed by , where each . One can verify that and . By applying Assouad’s lemma, Zhu et al., (2023) showed that there exists a which achieves the lower bound. ∎
Now we present the detailed proof of Theorem 3.
Proof.
Suppose we have contextual bandit instances, each constructed identically from an individual with the same underlying reward functions, i.e., the same . We set the instances as , respectively, for , which are designed to meet the lower bound for single-party cases in Lemma 2.
One can verify that the instance presented in Lemma 2 satisfies that and . Thus, we construct an instance where such that
where
for . Therefore, utilizing Taylor expansion implies that
The last inequality holds because is a constant. Here we disregard the scale of the reward function and thus conclude the proof. ∎
C.4 Proof of Proposition 4
We first show an original result for the Pareto efficiency of .
Lemma 3.
is Pareto efficient concerning the true Nash value function .
Proof.
Since
We can regard as being separate concerning each state. Therefore
Considering for each , we conclude that is a Pareto efficient solution. ∎
We then introduce a result akin to Theorem 2, without expectation on states. Define
Lemma 4.
Proof.
The result is obvious from Theorem 2 if we directly set either . ∎
Now we present the detailed proof of Theorem 4.
Proof.
Define , where is a fixed constant. Denote . From Lemma 4, we have, with probability at least ,
(25) |
Consider another policy which satisfies that for , then
Thus, the -th individual outcome can increase in the ratio of at most , where is a constant that depends on , the normalized rewards under . Thus we conclude our proof. ∎
We further show that satisfies the approximate Pigou-Dalton principle.
Definition 5 (-approximate Pigou-Dalton Principle).
A solution satisfies the -approximate Pigou-Dalton principle at one state if, the collective welfare decreases at most in a move satisfying that (1) it reduces the gap (or difference) between two (normalized) individual rewards , and (2) it retains their total reward .
C.5 Proofs in Section 4.4
We begin with the Utilitarian welfare function. The proof of Theorem 5 is straightforward to add up individual errors for each individual. And we mainly demonstrate the most important part of the proof.
Proof.
Then, we move to the Leximin welfare function. For Theorem 6, we introduce a lemma first.
Lemma 5.
Suppose and are two groups of real values. They are sorted as and , respectively. Thus we can establish the following inequality
Proof.
Let , where . Without loss of generality, we assume that . Therefore,
We complete the case for . In general, for any , we assume that , without loss of generality. If , we have
Otherwise , thus there exists such that (because there are only values ). Thus,
We conclude the proof. ∎
C.6 Proof of Theorem 7
We begin with an estimation error bound similar to Theorem 1.
Theorem 12.
Proof.
This proof is the same as Theorem 1. In the MDP setting, we just consider with instead. ∎
Then, we give the proof of Theorem 7.
Proof.
Decompose the sub-optimality into three terms:
The second term from the definition of . From Theorem 12, with probability at least , , then the third term
Therefore with probability at least ,
where holds due to the equation , holds from recursion.
Separately, we have
The last inequality holds because and with probability at least for .
where holds due to the equation . The last inequality holds because and with probability at least for . Therefore
Now we conclude the proof. ∎
Appendix D Proofs in Section 5
D.1 Proof of Theorem 8
First, we introduce two concentration inequalities and prove a lemma.
Proposition 1 (McDiarmid inequality).
Let be independent random variables in , let be a function of s.t. , . Then for all , with probability at least , we have
Proposition 2 (Binomial concentration inequality, Xie et al., (2021)).
Suppose , which is a Bernoulli distribution with parameter and . Then with probability at least , we have
Lemma 6.
With probability at least ,
(1) , for ;
(2) , for .
Proof.
(1) For fixed , define . The cases where is trivial. For , let for represent all the binary responses on pair in . Let . Since . we use Proposition 1 and substitute with , then with probability at least , we have
So for fixed , with probability at least ,
Then with probability at least ,
Substitute with , we complete the first part.
(2) With probability at least , for ,
Thus we conclude the proof. ∎
Now we present the detailed proof of Theorem 8.
Proof.
First, from Lemma 6, we have with probability at least ,
From Proposition 2, we have with probability at least ,
(26) |
Thus the above inequalities hold simultaneously with probability at least . Next, recall that and , we have:
Let and denote , we have
We conclude that which satisfies , equivalently, we have
(27) | ||||
Our goal is to prove that the probability over policy space satisfies:
Since , it’s easy to show that . Besides, notice that
(28) |
We then separate the terms for different states. For simplicity, denote and for . Thus, for each state , we have
D.2 Proof of Theorem 9
Proof.
Suppose for and the inequality holds for some , from transitivity we have for . Therefore and . Let equal to except on in which and . Then
(29) |
This proof follows a similar result as presented in Fishburn, (1984).
We then proceed to demonstrate the necessity of transitivity. A simple example comes from a three-alternatives case: Suppose that for , i.e., everyone believes with probability . Therefore,
As everyone has absolute preferences, there is no need to introduce a confidence bound. Consequently, it is straightforward to demonstrate that the von Neumann winner directly yields . Given that item is Pareto dominated by item , the result stated in the previous theorem does not hold anymore. Therefore, the transitivity is indeed necessary.
Appendix E Additional Results
In this section, we provide additional impossibility results for our two preference models, which demonstrate the challenge of multi-party alignment.
Reward-Based Model
In multi-party settings with heterogeneous preferences, there is a risk that specific individuals may obtain poor outcomes:
Proposition 3.
Consider the family of instances , where for . Suppose that , , then there exists a feature mapping such that the following lower bound holds.
Here the single-party sub-optimality is defined as , where and for .
Proof.
Here we consider for some , and we can construct some directly from . Without loss of generality, we assume that . Let , . Define the feature functions and the number of observations (for each state-action pair) as follows:
Here, we have ensured that . Let , ensuring that . With these settings, the optimal policies become:
It’s easy to verify that . Therefore, for ,
(30) |
Now we conclude the proof. ∎
Actually, (E) is independent of the observation data, i.e., the number of observations for each state-action pair. This proposition is consistent with the empirical results in Santurkar et al., (2023): applying a traditional single-party model fails since it cannot represent multiple views of all individuals. Thus it’s reasonable to consider collective welfare functions to serve as an alternative metric for group alignment instead of deploying the single-party approach directly.
Reward-Free Model
Additionally, for the reward-free model, we establish the following lower bound.
Proposition 4.
There exists a preference profile for such that,
Proof.
Suppose . For the -th individual, define , where denotes all the actions excluding . Thus, all the elements of the -th column of except the diagonal entry equal to .
For any strategy that satisfies , we consider comparing two strategies: (1) , and (2) the deterministic action , i.e., , for the -th individual. We have,
Then we have
Therefore, we conclude the proof. ∎