This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Multi-Party Reinforcement Learning
with Diverse Human Feedback

Huiying Zhong    Zhun Deng    Weijie J. Su    Zhiwei Steven Wu     Linjun Zhang Peking University. Email: [email protected] University. Email: [email protected] of Pennsylvania. Email: [email protected] Mellon University. Email: [email protected] University. Email: [email protected]
Abstract

Reinforcement learning with human feedback (RLHF) is an emerging paradigm to align models with human preferences. Typically, RLHF aggregates preferences from multiple individuals who have diverse viewpoints that may conflict with each other. Our work initiates the theoretical study of multi-party RLHF that explicitly models the diverse preferences of multiple individuals. We show how traditional RLHF approaches can fail since learning a single reward function cannot capture and balance the preferences of multiple individuals. To overcome such limitations, we incorporate meta-learning to learn multiple preferences and adopt different social welfare functions to aggregate the preferences across multiple parties. We focus on the offline learning setting and establish sample complexity bounds, along with efficiency and fairness guarantees, for optimizing diverse social welfare functions such as Nash, Utilitarian, and Leximin welfare functions. Our results show a separation between the sample complexities of multi-party RLHF and traditional single-party RLHF. Furthermore, we consider a reward-free setting, where each individual’s preference is no longer consistent with a reward model, and give pessimistic variants of the von Neumann Winner based on offline preference data. Taken together, our work showcases the advantage of multi-party RLHF but also highlights its more demanding statistical complexity.

1 Introduction

Recent advancements in AI alignment focus on training AI systems to align with user preferences. A prevalent strategy involves integrating human feedback into the learning process, a technique that has significantly influenced the development of language models and reinforcement learning among other areas (Ouyang et al.,, 2022; Ziegler et al.,, 2019). Established RLHF methods typically fit a reward model over users’ preferences data (in the form of pairwise comparisons) and then train policies to optimize the learned reward. The implicit assumption behind this approach is that different users preferences can be modeled via a single reward function.

However, this type of single-party reward-based assumption is at odds with the fact that users may possess heterogeneous viewpoints that conflict with each other. For example, the nature of human opinion in open-ended social decision-making scenarios is inherently diverse and dynamic, making traditional reinforcement learning from human feedback (RLHF) models (Ouyang et al.,, 2022; Christiano et al.,, 2017) insufficient. To see how traditional single-party RLHF can fail to balance multiple individuals’ preferences, consider the following toy example.

Example (2 parties and 3 alternatives).

Consider a scenario where users have preferences over three options A,B,CA,B,C. Half of the population prefers ACBA\succ C\succ B (with underlying rewards of R1(A)=1,R1(C)=1ϵ,R1(B)=0R_{1}(A)=1,R_{1}(C)=1-\epsilon,R_{1}(B)=0 for some ϵ(0,1)\epsilon\in(0,1)), while the other half prefers BCAB\succ C\succ A (with underlying rewards of R2(B)=1,R2(C)=1ϵ,R2(A)=0R_{2}(B)=1,R_{2}(C)=1-\epsilon,R_{2}(A)=0). Then pooling both sub-populations’ preferences together, we cannot differentiate between A,B,CA,B,C via pairwise comparison. Thus, if we learn a single reward function RR over the pooled pairwise comparison data, the resulting RR will have R(A)=R(B)=R(C)R(A)=R(B)=R(C). This leads to a policy that selects the three options uniformly at random, which has an average reward less than 2/32/3. However, if we learn each sub-population’s reward separately, we can potentially arrive at a better policy that only recommends option BB and achieves an average reward of 1ϵ1-\epsilon.

This tension between the traditional single-party (or single-reward) RLHF approach and diverse users preferences motivates the following question:

How can models be trained to align with the preferences of heterogeneous individuals?

Our work takes a step towards addressing this question by initiating a theoretical exploration of multi-party RLHF, that aims to explicitly model and balance diverse heterogeneous preferences from different individuals. We take inspiration from social choice theory (Noothigattu et al.,, 2020; Jin et al.,, 2020), which offers an extensive array of tools for aggregating human preferences. At a high level, we extend the framework of RLHF to directly accommodate multiple parties and heterogeneous preferences by utilizing social welfare functions. We focus on offline learning with the goal of learning a policy from offline preference data. We start from the CB setting with pairwise comparisons and later generalize to the Markov Decision Process (MDP). Our primary focus is on the popular offline setting, where the dataset is pre-collected. The insights of our approaches can be generalized to a wider range of scenarios. Notably, recent works by Zhu et al., (2023) and Zhan et al., (2023) provided an initial theoretical analysis of offline RLHF. Our paper takes a step further by considering cases with multiple parties and heterogeneous preferences.

Our main contributions are summarized as follows:

  • We propose a general framework for alignment with multiple heterogeneous parties. We utilize a meta-learning technique to learn individual rewards, and aggregate them using Nash’s social welfare function based on the confidence bounds on reward estimation. We further extend our analyses for the Utilitarian and Leximin welfare functions.

  • We provide sample complexity bounds for obtaining an approximately optimal policy through our proposed framework. We further introduce efficiency and fairness definitions and demonstrate the learned policies satisfy approximate Pareto efficiency and Pigou-Dalton principle, thus ensuring the collective welfare and avoiding inequality or dominance.

  • We extend our analysis to a reward-free setting, where individual preferences are no longer consistent with a certain data generative model. We provide pessimistic variants of von Neumann winner (Fishburn,, 1984), determined by the confidence bounds on preference distributions. Theoretical guarantees of sample complexity and ex-post efficiency are also provided in this generalized setting.

The novelty of our work lies in two aspects. First, we employ a meta-learning technique for learning multiple reward functions from limited observations, by utilizing a common feature representation among various parties. This strategy enhances learning efficiency in environments where data are scarce, drawing on the foundational work in few-shot learning (Du et al.,, 2020; Tripuraneni et al.,, 2020). Second, we integrate a pessimistic approach within social welfare functions to guarantee model invariance with the inclusion of zero rewards in the presence of multiple parties. This aspect of our work extends the principles established by Zhu et al., (2023), where we establish a sub-optimality bound for the (Nash’s) social welfare function. This bound is derived by aggregating heterogeneous individual rewards, necessitating a more stringent data coverage condition than what is typically required in RLHF scenarios involving a single entity.

2 Related Work

Meta-learning

Meta-learning, or learning to learn, seeks to design a learner to quickly learn new tasks based on a few prior tasks. Some of the dominant approaches learn a common representation among multiple tasks, referred to as representation learning (Baxter,, 2000; Maurer et al.,, 2016; Finn et al.,, 2017; Ji et al.,, 2023). Theoretical guarantees for multi-task linear regression with low-dimensional linear representation have been established in the literature (Tripuraneni et al.,, 2020; Du et al.,, 2020; Li and Zhang,, 2023). Our work contributes to generalizing this line of work, by investigating the learning-to-learn abilities in human preference-based models with a more complicated structure.

Reinforcement Learning with Human Feedback

Human preference is widely used in RL since preference comparison is easier to elicit than numerical reward (Ziegler et al.,, 2019; Ouyang et al.,, 2022; Chen et al.,, 2022). Particularly, the most related works to ours are Zhu et al., (2023) and Zhan et al., (2023), both of which study offline RLHF with a reward-based preference model. Zhu et al., (2023) focused on a Bradley-Terry-Luce (BTL) model under a linear reward function, while Zhan et al., (2023) further considered general function classes. Lastly, we highlight the latest work by Wang et al., (2023), which established two preference models akin to ours, but in the context of traditional single-party RLHF in online learning. In contrast, we consider a broader setting with heterogeneous individuals. Particularly, offline RL is more challenging than online RL due to limited data availability. Thus, the pessimistic technique has been widely studied in recent years, as witnessed in CBs (Li et al.,, 2022), MDPs (Xie et al.,, 2021), and Markov games (Cui and Du,, 2022). In this paper, we also utilize pessimism to address the coverage challenges.

Social Choice Theory

Social choice theory is the field that studies the aggregation of individual preferences towards collective decisions (Moulin,, 2004). Different solution concepts in game theory and voting theory have been applied to achieve social decisions, including Nash bargaining (Nash,, 1953), von Neumann winner (Fishburn,, 1984; Rivest and Shen,, 2010; Brandl et al.,, 2016; Dudík et al.,, 2015), etc. Notably, the concurrent work by Fish et al., (2023) presented pioneering attempts to combine AI systems with social choice theory given access to oracle LLM queries. However, their work focused on binary approval, while our methods consider the degree of individual preferences. Our work also extends several standard efficiency notions in social choice theory, including Pareto efficiency (Barron,, 2013; Nguyen and Rothe,, 2020; Barman et al.,, 2018; Aumann and Dombb,, 2010) and ex-post efficiency (Fishburn,, 1984; Immorlica et al.,, 2017; Zeng and Psomas,, 2020). Particularly, Barman et al., (2018) considered a Nash welfare function in the context of fair allocation and established approximate efficiency results based on agents’ valuation errors. Both their efficiency results and ours are derived from the learning errors.

3 Preliminaries

We begin with the notation. Let [n]={1,2,n}[n]=\{1,2\cdots,n\}. We use 2\|\cdot\|_{2} or \|\cdot\| to denote the 2\ell_{2} norm of a vector or the spectral norm of a matrix. We use F\|\cdot\|_{F} to denote the Frobenius norm of a matrix. Let ,\langle\cdot,\cdot\rangle be the Euclidean inner product between vectors or matrices. For a matrix Am×nA\in\mathbb{R}^{m\times n}, let σi(A)\sigma_{i}(A) be its ii-th largest singular value. Our use of O(),Ω(),Θ()O(\cdot),\Omega(\cdot),\Theta(\cdot) follows the standard notation. We use xΣ=xΣx\|x\|_{\Sigma}=\sqrt{x^{\top}\Sigma x} to denote the matrix-induced norm for a positive-semidefinite matrix Σ\Sigma. We write ΣΣ\Sigma\succeq\Sigma^{\prime} if ΣΣ\Sigma-\Sigma^{\prime} is positive semidefinite. We use 𝒪d1×d2\mathcal{O}_{d_{1}\times d_{2}} where d1>d2d_{1}>d_{2} to denote the set of d1×d2d_{1}\times d_{2} orthonormal matrices (i.e., the columns are orthonormal).

3.1 Offline Data Collecting and Comparison Model

We start with the contextual bandit (CB) setting. Consider the environment E=(𝒮,𝒜,{Rm}m=1M,ρ)E=(\mathcal{S},\mathcal{A},\{R_{m}\}_{m=1}^{M},\rho), where 𝒮\mathcal{S} is a state space, 𝒜\mathcal{A} is an action space, Rm:𝒮×𝒜R_{m}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} for m[M]m\in[M] are reward functions from MM individuals (or parties), and ρ\rho is an initial state distribution. A deterministic policy π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} is a function that maps a state to an action, while a randomized policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\rightarrow\Delta(\mathcal{A}) is a function that maps a state to a distribution over the action space.

As in Zhu et al., (2023), we consider an offline setting with a pre-collected dataset 𝒟\mathcal{D}. We are provided with pairwise comparisons 𝒟=m=1M𝒟m\mathcal{D}=\bigcup_{m=1}^{M}\mathcal{D}_{m} from MM individuals (or parties), where 𝒟m={(smi,am,1i,am,0i,ymi)}i=1n\mathcal{D}_{m}=\{(s_{m}^{i},a_{m,1}^{i},a_{m,0}^{i},y_{m}^{i})\}_{i=1}^{n} for m[M]m\in[M]. For the ii-th sample in 𝒟m\mathcal{D}_{m}, a state smis_{m}^{i} is first sampled from the initial distribution ρ\rho. Given the state smis_{m}^{i}, an action pair (am,1i,am,0i)(a_{m,1}^{i},a_{m,0}^{i}) is sampled from a distribution ga(a1,a0|smi)g_{a}(a_{1},a_{0}|s_{m}^{i}). Thus (smi,am,1i,am,0i)(s_{m}^{i},a_{m,1}^{i},a_{m,0}^{i}) is generated by a generation distribution g=ga(|s)ρ(s)g=g_{a}(\cdot|s)\rho(s). We then observe a binary outcome ymiy_{m}^{i} from a Bernoulli distribution m(ymi=1|am,1i,am,0i,smi)\mathbb{P}_{m}(y_{m}^{i}=1|a_{m,1}^{i},a_{m,0}^{i},s_{m}^{i}). Here ymi{0,1}y_{m}^{i}\in\{0,1\} indicates the preferred one in (am,1i,am,0i)(a_{m,1}^{i},a_{m,0}^{i}). Primarily, we assume that m\mathbb{P}_{m} is reward-based for m[M]m\in[M].

Reward-based Model

Suppose that individual preferences rely on a reward function:

m(y=1|a1,a0,s)=Φ(Rm(s,a1)Rm(s,a0)),\mathbb{P}_{m}(y=1|a_{1},a_{0},s)=\Phi(R_{m}(s,a_{1})-R_{m}(s,a_{0})), (1)

where Φ\Phi is the sigmoid function Φ(x)=11+exp(x)\Phi(x)=\frac{1}{1+\exp{(-x)}}, which leads to the well-known Bradley-Terry-Luce (BTL) model in pairwise comparison (Bradley and Terry,, 1952). We now provide several assumptions for our analysis.

Assumption 1 (Linear reward).

The rewards RmR_{m} lie in the family of linear models Rθ(s,a)=θϕ(s,a)R_{\theta}(s,a)=\theta^{\top}\phi(s,a), where ϕ(s,a):𝒮×𝒜d\phi(s,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d} is a known feature mapping with ϕ(s,a)2L\|\phi(s,a)\|_{2}\leq L. Let θmd\theta_{m}^{*}\in\mathbb{R}^{d} be the true parameters, suppose θmΘB:={θ:θ2B}\theta_{m}^{*}\in\Theta_{B}:=\{\theta:\|\theta\|_{2}\leq B\} for all m[M]m\in[M].

Assumption 2 (Shared representation).

Suppose θm=Uαm\theta_{m}^{*}=U^{*}\alpha_{m}^{*}, where U𝒪d×rU^{*}\in\mathcal{O}_{d\times r} (dr)(d\gg r), αmr\alpha_{m}^{*}\in\mathbb{R}^{r}, and the MM underlying parameters satisfy αm2=Θ(1)\|\alpha_{m}^{*}\|_{2}=\Theta(1) for all m[M]m\in[M].

Assumption 3 (Feature design).

Define Σ=𝔼g(ϕ(s,a1)ϕ(s,a0))(ϕ(s,a1)ϕ(s,a0))\Sigma_{*}=\mathbb{E}_{g}(\phi(s,a_{1})-\phi(s,a_{0}))(\phi(s,a_{1})-\phi(s,a_{0}))^{\top}, here the randomness is induced by the generation distribution gg. Suppose Σ\Sigma_{*} has bounded nonnegative eigenvalues Cmaxλ1λdCmin>0C_{\text{max}}\geq\lambda_{1}\geq\cdots\geq\lambda_{d}\geq C_{\text{min}}>0.

Assumption 4 (Diversity design).

Let Θ=(θ1,,θM)d×M\Theta^{*}=(\theta_{1}^{*},\cdots,\theta_{M}^{*})\in\mathbb{R}^{d\times M}, ν=σr(ΘΘM)\nu=\sigma_{r}(\frac{\Theta^{*\top}\Theta^{*}}{M}), and κ=σ1(ΘΘM)/ν\kappa=\sigma_{1}(\frac{\Theta^{*\top}\Theta^{*}}{M})/\nu. Suppose that ν>0\nu>0 and κO(1)\kappa\leq O(1).

Assumption 1 can be applied to cases where ϕ\phi is derived by removing the last layer of a pre-trained model. Assumption 2 further says that individual rewards share a common low-dimensional representation UU^{*}. Assumptions 3 and 4 are standard well-conditioned designs for the covariance matrix Σ\Sigma_{*} and the underlying reward parameters {θm}m=1M\{\theta_{m}^{*}\}_{m=1}^{M} (or {αm}m=1M\{\alpha_{m}^{*}\}_{m=1}^{M}) (Tripuraneni et al.,, 2020; Du et al.,, 2020).

3.2 Social Welfare Function

As different parties have their own reward functions, in this section, we introduce the social welfare functions to aggregate them such that the trained model aligns with heterogeneous preferences and reflects social welfare. They map individual utilities u1,,uMu_{1},\cdots,u_{M} to collective welfare, where um:𝒳u_{m}:\mathcal{X}\rightarrow\mathbb{R} is the utility of the mm-th individual for m[M]m\in[M] and 𝒳\mathcal{X} is the possible set of outcomes. A reasonable welfare function should satisfy six axioms: monotonicity, symmetry, continuity, independence of unconcerned agents, independence of common scale, and Pigou-Dalton transfer principle (see, Chapter 3.2, Moulin,, 2004). It has been proved that all functions that satisfy these six desirable properties lie in a one-parameter family of isoelastic social welfare functions WαW_{\alpha}, defined as follows:

Wα(u1,,uM)={m=1Munα0<α1;m=1Mumα=0;m=1Mumαα<0.W_{\alpha}(u_{1},\cdots,u_{M})=\left\{\begin{aligned} &\sum_{m=1}^{M}u_{n}^{\alpha}&0<\alpha\leq 1;\\ &\prod_{m=1}^{M}u_{m}&\alpha=0;\\ &\sum_{m=1}^{M}-u_{m}^{\alpha}&\alpha<0.\end{aligned}\right. (2)

Among these functions, we elaborate on three prominent examples: the Utilitarian welfare function (α=1)(\alpha=1) that calculates the mean expected agreement across parties, the Leximin welfare function (α)(\alpha\rightarrow-\infty) that ensures the utility of the worst individual, and the Nash welfare function (α=0)(\alpha=0) that maximizes the utility product to attain a proportional-fair (Kelly et al.,, 1998; Zhang et al.,, 2022) solution. These concepts of welfarism are widely used in social choice theory, see more details in Appendix A. We primarily consider Nash welfare function (α=0)(\alpha=0). In the context of welfarism with bargaining considerations, an objective definition of the zero of individual utility is introduced. This concept is regarded as the worst outcome or minimal utility to be accepted by a specific individual so the arbitrator must consider the utility as a strict lower bound. Thus, Nash bargaining (Nash,, 1953) considers a maximization problem over a feasible set {(u1(x),,uM(x))|x𝒳}\{(u_{1}(x),\cdots,u_{M}(x))|x\in\mathcal{X}\} under additional normalization of individual zeros (u10,,uM0)(u_{1}^{0},\cdots,u_{M}^{0}):

argmaxum=1M(umum0), s.t. um>um0 for m[M].\arg\max_{u}\prod_{m=1}^{M}(u_{m}-u_{m}^{0}),\text{ s.t. }u_{m}>u_{m}^{0}\text{ for }m\in[M].

By optimizing the welfare function with such normalizations, the Nash bargaining solution is invariant to affine transformations of umu_{m}’s and eliminates the dominance of specific individuals, making it a standard solution concept in cooperative games. We begin with the Nash welfare function to aggregate heterogeneous reward functions in the next section and optimize the aggregated function. Additional results of other social welfare functions (e.g., Utilitarian (α=1)(\alpha=1) and Leximin (α)(\alpha\rightarrow-\infty)) are shown in Section 4.4.

4 Reward-Based Model

In this section, we propose a general framework for alignment with multiple parties using social welfare functions in a reward-based model. We begin with the framework of Nash bargaining to learn a deterministic policy and further extend our analysis for the Utilitarian and Leximin welfare functions.

4.1 Pessimistic Nash Bargaining Algorithm

The construction of the Nash Bargaining estimator consists of four steps. Step 1 is to estimate the parameters θm\theta_{m}^{*} from the observations. Step 2 is to construct the confidence sets for these parameters simultaneously. Step 3 involves introducing a pessimistic expected value function using Nash bargaining. Finally, Step 4 is to obtain the pessimistic policy through a maximization problem. We summarize this procedure in Algorithm 1, and discuss each step in detail below.

Algorithm 1 Pessimistic Nash Bargaining

Input: The datasets 𝒟=m=1M𝒟m\mathcal{D}=\bigcup_{m=1}^{M}\mathcal{D}_{m}, a failure probability δ\delta, and the initial distribution ρ\rho.

Output: The pessimistic policy π^\hat{\pi}.

1:  Obtain MLE θ^m=U^α^m\hat{\theta}_{m}=\hat{U}\hat{\alpha}_{m} in (3) and calculate the data covariance matrices Σm\Sigma_{m} in (4).
2:  Construct the confidence sets ΘB(θ^m)={θΘB:θθ^mΣmΓ(δ,n,M,d,r)}\Theta_{B}(\hat{\theta}_{m})=\{\theta\in\Theta_{B}:\|\theta-\hat{\theta}_{m}\|_{\Sigma_{m}}\leq\Gamma(\delta,n,M,d,r)\} according to the bound Γ\Gamma on the estimation error in Theorem 1.
3:  Establish the pessimistic Nash expected value function J^(π)=minθmΘB(θ^m)𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s))\hat{J}(\pi)=\min\limits_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\prod\limits_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi(s))-R_{m,\theta_{m}}^{0}(s)).
4:  Calculate π^=argmaxπJ^(π)\hat{\pi}=\arg\max_{\pi}\hat{J}(\pi).

First, we learn individual reward Rm,θm(s,a)=θmϕ(s,a)R_{m,\theta_{m}}(s,a)=\theta_{m}^{\top}\phi(s,a) for m[M]m\in[M] through maximum likelihood estimator (MLE). Denote α=(α1,,αM)r×M\alpha=(\alpha_{1},\cdots,\alpha_{M})\in\mathbb{R}^{r\times M} as the low-dimensional parameters, and Θ=(θ1,,θM)d×M\Theta=(\theta_{1},\cdots,\theta_{M})\in\mathbb{R}^{d\times M} as the original parameters. Define the feature difference as Xm,i=ϕ(smi,am,1i)ϕ(smi,am,0i)X_{m,i}=\phi(s_{m}^{i},a_{m,1}^{i})-\phi(s_{m}^{i},a_{m,0}^{i}) and Xm=(Xm,1,,Xm,n)X_{m}=(X_{m,1},\cdots,X_{m,n})^{\top} for m[M]m\in[M]. Using the dataset 𝒟=m=1M𝒟m\mathcal{D}=\bigcup_{m=1}^{M}\mathcal{D}_{m} with 𝒟m={(smi,am,1i,am,0i,ymi)}i=1n\mathcal{D}_{m}=\{(s_{m}^{i},a_{m,1}^{i},a_{m,0}^{i},y_{m}^{i})\}_{i=1}^{n}, we minimize the negative log-likelihood:

U^,α^argminU𝒪d×r,αm2Bl𝒟(U,α),\displaystyle\hat{U},\hat{\alpha}\in\arg\min_{U\in\mathcal{O}_{d\times r},\|\alpha_{m}\|_{2}\leq B}l_{\mathcal{D}}(U,\alpha), (3)

where l𝒟(U,α)=1Mnm=1Mi=1n{𝟏(ymi=1)log(Φ(Uαm,Xm,i))+𝟏(ymi=0)log(Φ(Uαm,Xm,i))}l_{\mathcal{D}}(U,\alpha)=-\dfrac{1}{Mn}\sum_{m=1}^{M}\sum_{i=1}^{n}\{\mathbf{1}(y_{m}^{i}=1)\log\left(\Phi(\langle U\alpha_{m},X_{m,i}\rangle)\right)+\mathbf{1}(y_{m}^{i}=0)\log\left(\Phi(\langle U\alpha_{m},X_{m,i}\rangle)\right)\}. When the minimizer is not unique, we take any of the solutions (U^,α^m)(\hat{U},\hat{\alpha}_{m}) that achieves the minimum. As a result, we obtain the estimators θ^m=U^α^m\hat{\theta}_{m}=\hat{U}\hat{\alpha}_{m} for m[M]m\in[M]. Then define the d×dd\times d data covariance matrices for m[M]m\in[M] as

Σm=1nXmXm=1ni=1n(ϕ(smi,am,1i)ϕ(smi,am,0i))(ϕ(smi,am,1i)ϕ(smi,am,0i)).\displaystyle\Sigma_{m}=\frac{1}{n}X_{m}^{\top}X_{m}=\frac{1}{n}\sum\limits_{i=1}^{n}(\phi(s_{m}^{i},a_{m,1}^{i})-\phi(s_{m}^{i},a_{m,0}^{i}))(\phi(s_{m}^{i},a_{m,1}^{i})-\phi(s_{m}^{i},a_{m,0}^{i}))^{\top}. (4)

Subsequently, we regard the multi-party alignment problem as a Nash bargaining game involving MM reward functions {Rm,θm}m=1M\{R_{m,\theta_{m}}\}_{m=1}^{M}. We set the zero utility as Rm,θm0(s)=minaRm,θm(s,a)R_{m,\theta_{m}}^{0}(s)=\min_{a}R_{m,\theta_{m}}(s,a) and consider the following Nash welfare function to guarantee individual satisfaction,

W(R1,,RM)=m=1M(Rm,θm(s,a)Rm,θm0(s)).W(R_{1},\cdots,R_{M})=\prod_{m=1}^{M}(R_{m,\theta_{m}}(s,a)-R_{m,\theta_{m}}^{0}(s)).

We emphasize that the introduction of minimal rewards, Rm,θm0R_{m,\theta_{m}}^{0}, ensures solution invariance and mitigates dominance concerns, as discussed in Section 3.2. Moreover, our specific choice for Rm,θm0R_{m,\theta_{m}}^{0} is easy to compute since solving Rm,θm0(s)R_{m,\theta_{m}}^{0}(s) is standard in bandit and RL problems (Sutton and Barto,, 2018). Alternatively, we can also consider the reward Rm,θmref(s)R_{m,\theta_{m}}^{\text{ref}}(s) under any reference policy. This distinction separates multi-party alignment from single-party alignment.

Next, we introduce the pessimistic technique to learn a policy. We first define the true Nash expected value JJ of a policy π{\pi} and its sub-optimality as:

J(π):=𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s)),SubOpt(π^):=J(π)J(π^),\displaystyle J({\pi}):=\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}\left(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s)\right),\quad\texttt{SubOpt}(\hat{\pi}):=J(\pi^{*})-J(\hat{\pi}),

where π=argmaxπJ(π)\pi^{*}=\arg\max_{\pi}J(\pi) is the optimal policy. This sub-optimality notion measures a performance gap compared to the optimal policy. Given a failure probability δ\delta, we consider the pessimistic estimator obtained from the lower confidence bound of the set of parameters: ΘB(θ^m)={θΘB:θθ^mΣmΓ(δ,n,M,d,r)}\Theta_{B}(\hat{\theta}_{m})=\{\theta\in\Theta_{B}:\|\theta-\hat{\theta}_{m}\|_{\Sigma_{m}}\leq\Gamma(\delta,n,M,d,r)\}, where Γ\Gamma is defined in the next section. We then construct a pessimistic Nash expected value function:

J^(π)=minθmΘB(θ^m)𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s)).\hat{J}(\pi)=\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi(s))-R_{m,\theta_{m}}^{0}(s)). (5)

Finally, we obtain π^=argmaxπJ^(π)\hat{\pi}=\arg\max_{\pi}\hat{J}(\pi) as the pessimism estimator. The pessimism technique penalizes responses that are less represented in the dataset and thus contributes to finding a more conservative and accurate policy (Xie et al.,, 2021; Zhan et al.,, 2023; Zhu et al.,, 2023). Later, we will demonstrate that π^\hat{\pi} can approach π\pi^{*} in the sense of the true Nash expected value JJ.

4.2 Sample Complexity

In this section, we introduce the sub-optimality bound and the corresponding sample complexity. We begin with the following estimation error between θ^m=U^α^m\hat{\theta}_{m}=\hat{U}\hat{\alpha}_{m} and θm=Uαm\theta_{m}^{*}=U^{*}\alpha_{m}^{*}.

Theorem 1.

Suppose Assumptions 1-4 hold. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ1-\delta, for any m[M]m\in[M],

θ^mθmΣmr2n+dr2logd+rlog(M/δ)Mn.\|\hat{\theta}_{m}-\theta_{m}^{*}\|_{\Sigma_{m}}\lesssim\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}.

Now we let Γ\Gamma in Algorithm 1 be Γ(δ,n,M,d,r)=Kr2n+dr2logd+rlog(M/δ)Mn\Gamma(\delta,n,M,d,r)=K\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}, where KK is a constant. Theorem 1 implies that with probability at least 1δ1-\delta, θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) for any m[M]m\in[M]. We provide a proof outline here, and defer the details to Appendix C.1. First, we establish an overall bound on m=1MXmθ^mXmθm2\sum_{m=1}^{M}\|X_{m}\hat{\theta}_{m}-X_{m}\theta_{m}^{*}\|^{2} by leveraging the convexity of l𝒟(U,α)l_{\mathcal{D}}(U,\alpha), based on second-order Taylor expansion and a careful use of concentration inequalities. Next, we derive a bound on θ^mθmΣm\|\hat{\theta}_{m}-\theta_{m}^{*}\|_{\Sigma_{m}} in terms of UU and αm\alpha_{m} using the Davis-Kahan sinθ\sin\theta theorem and Bernstein-type inequality. Theorem 1 is a generalization of the upper bound of Lemma 3.1 in Zhu et al., (2023), which improves the previous bound of dn\sqrt{\frac{d}{n}} when r2min{M,d}r^{2}\ll\min\{M,d\}. It shows that it is possible to use O(r2)O(r^{2}) samples to learn individual rewards via learning a shared representation with all observations pooled together.

Remark 1.

Σm\Sigma_{m} is positive-definite with high probability for m[M]\forall m\in[M], see Lemma 1. For simplicity, we assume that Σm\Sigma_{m} is positive-definite throughout the paper. The results can be slightly modified to accommodate the situation when Σm\Sigma_{m} is not invertible by considering Σm+λI\Sigma_{m}+\lambda I for any λ+\lambda\in\mathbb{R}^{+} similarly (Li et al.,, 2022; Zhu et al.,, 2023).

We then consider the induced policy. Denote θ~m=argminθmΘB(θ^m)𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s))\tilde{\theta}_{m}=\arg\min\limits_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\theta_{m}}^{0}(s)) as the achieved pessimistic parameter in the lower confidence bound for π\pi^{*}. Denote π¯m(s)=argminaRm,θm(s,a)\underline{\pi}_{*m}(s)=\arg\min_{a}R_{m,\theta_{m}^{*}}(s,a) and π¯m(s)=argminaRm,θ~m(s,a)\underline{\pi}_{m}(s)=\arg\min_{a}R_{m,\tilde{\theta}_{m}}(s,a) as the baseline policy to reach Rm,θm0(s)R_{m,\theta_{m}^{*}}^{0}(s) and Rm,θ~m0(s)R_{m,\tilde{\theta}_{m}}^{0}(s), respectively. Define the concentratability coefficient as

C=maxmmaxπ{π,π¯m,π¯m}(Σm1/2𝔼sρϕ(s,π(s)))2.C^{*}=\max_{m}\max_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim\rho}\phi(s,\pi(s)))\|_{2}. (6)

We have the following guarantee for the pessimistic policy, whose proof is given in Appendix C.2.

Theorem 2.

Under the same condition as Theorem 1, with probability at least 1δ1-\delta,

SubOpt(π^)Mr2n+dr2logd+rlog(M/δ)MnC.\texttt{SubOpt}(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}.

The proof sketch is as follows. The sub-optimality can be decomposed as, J(π)J(π^)=(J(π)J^(π))+(J^(π)J^(π^))+(J^(π^)J(π^))J(\pi^{*})-J(\hat{\pi})=(J(\pi^{*})-\hat{J}(\pi^{*}))+(\hat{J}(\pi^{*})-\hat{J}(\hat{\pi}))+(\hat{J}(\hat{\pi})-J(\hat{\pi})) and what we truly need to focus on is the first term J(π)J^(π)J(\pi^{*})-\hat{J}(\pi^{*}) since the latter two terms are smaller than zero with high probability. We address the multiplicative Nash welfare function by decomposing it into the aggregation of individual errors. Given the normalization inherent in Nash bargaining, we then separately handle the gaps between the true and empirical R(s,π)R(s,\pi^{*}) values as well as those between the true and empirical R0(s)R^{0}(s) values induced by stochasticity, by using the estimation error.

We point out that CC^{*} is a concentratability coefficient assumed to be bounded (Zhu et al.,, 2023; Zhan et al.,, 2023). It provides a coverage guarantee of the dataset: the ratio of the state-action occupancy induced by the optimal and baseline policies, to the data distribution, is bounded. This assumption ensures good coverage of the target vector 𝔼sρϕ(s,π(s)),𝔼sρϕ(s,π¯(s))\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s)),\mathbb{E}_{s\sim\rho}\phi(s,\underline{\pi}(s)) from the dataset in the feature space, and therefore yields an accurate estimator.

For the problems with bounded concentratability coefficients (defined in (6)), we obtain a lower bound result in the following theorem, whose proof is deferred to Appendix C.3.

Theorem 3 (Lower Bound).

Consider the family of instances CB(M,𝒞)={ρ,{smi,am,1i,am,0i}i=1n,θm=Uαm,m[M]|C𝒞}\text{CB}(M,\mathcal{C})=\{\rho,\{s_{m}^{i},a_{m,1}^{i},a_{m,0}^{i}\}_{i=1}^{n},\theta_{m}^{*}=U^{*}\alpha_{m}^{*},m\in[M]|C^{*}\leq\mathcal{C}\}, where CC^{*} is defined in (6). For any bandit instance 𝒬(M)CB(M,𝒞)\mathcal{Q}(M)\in\text{CB}(M,\mathcal{C}), SubOpt𝒬(M)\texttt{SubOpt}_{\mathcal{Q}(M)} is defined as the sub-optimality under instance 𝒬(M)\mathcal{Q}(M). Suppose r>6,nr𝒞2,𝒞2r>6,n\gtrsim r\mathcal{C}^{2},\mathcal{C}\geq 2, then there exists a feature mapping ϕ\phi such that the following lower bound holds.

infπ^sup𝒬(M)CB(M,𝒞)SubOpt𝒬(M)(π^)𝒞Mrn.\inf_{\hat{\pi}}\sup_{\mathcal{Q}(M)\in\text{CB}(M,\mathcal{C})}\texttt{SubOpt}_{\mathcal{Q}(M)}(\hat{\pi})\gtrsim\mathcal{C}M\sqrt{\dfrac{r}{n}}.

The lower bound highlights the complexity challenge in multi-party RLHF. We face the aggregation of individual learning errors and a potentially larger concentratability coefficient CC^{*} defined in (6), as it is related to not only the optimal policy π\pi^{*} but also the baseline policies π¯m\underline{\pi}_{m}. This leads to a larger sample complexity than traditional RLHF, where the concentratability coefficient is solely based on a single-party optimal policy as depicted in Theorem 3.10 in Zhu et al., (2023). Obtaining tight dependence of our sub-optimality bound on feature dimensions is challenging. The r\sqrt{r} gap between Theorem 2 and Theorem 3 relates to an open problem in classical estimations, see a more detailed discussion on page 8 in Tripuraneni et al., (2020).

4.3 Efficiency and Welfare Guarantees

We demonstrate the efficiency and fairness guarantees of our method using a Nash welfare function. Theorem 2 implies the above Nash solution is within a small distance (sub-optimality) to the optimal solution π\pi^{*} (π\pi^{*} is also Pareto efficient, see Lemma 3). This observation motivates us to consider the concept of τ\tau-approximate Pareto efficiency concerning each normalized reward function.

Definition 1 (τ\tau-approximate Pareto Efficiency).

A solution π\pi is τ\tau-approximate Pareto efficient at state ss if no other action at the state provides all individuals with at least the same rewards RmRm0R_{m}-R_{m}^{0}, and one individual with a reward (1+τ)(1+\tau) times higher.

This efficiency guarantees an agreement and balance among multiple parties. It is an adaptation of the definitions by Aumann and Dombb, (2010); Barman et al., (2018) to the CB settings. While previous work considered an efficiency notion that allows a τ\tau improvement for all individuals, we admit a τ\tau improvement for only one individual when maintaining others unchanged. Define the state-wise concentratability coefficient as

C(s)=maxmmaxπ{π,π¯m,π¯m}(Σm1/2ϕ(s,π(s)))2.C^{*}(s)=\max_{m}\max_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\phi(s,\pi(s)))\|_{2}.

We formulate the following theorem.

Theorem 4.

Under the same condition as Theorem 1, with probability at least 1δ1-\delta, π^\hat{\pi} is τ\tau-approximate Pareto efficient at state ss, where τ(s)Mr2n+dr2logd+rlog(M/δ)MnC(s),s\tau(s)\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}(s),\forall s.

This result is established for every state and the proof is given in Appendix C.4. It confirms that we cannot improve one party significantly without making another party worse, thus leading to a proportional-fair policy. Besides, the pessimistic policy π^\hat{\pi} follows the approximate Pigou-Dalton principle, ensuring more equitable outcomes (Moulin,, 2004), see Appendix C.4 for details.

4.4 Comparison with Various Social Welfare Functions

Further, we provide a concise introduction of other social welfare functions, including the Utilitarian and Leximin (Egalitarian) welfare functions, and compare their results. We adjust the pessimistic value function J^(π)\hat{J}(\pi) in Step 3 of Algorithm 1 to accommodate these alternative welfare functions.

Utilitarian welfare function (α=1\alpha=1)

Given its independence of individual zeros of utilities (refer to Appendix A), the introduction of a zero reward is unnecessary in this context. Similarly, the pessimistic expected value function is defined as follows,

J(π):=𝔼sρm=1MRm,θm(s,π(s));\displaystyle J({\pi}):=\mathbb{E}_{s\sim\rho}\sum_{m=1}^{M}R_{m,\theta_{m}^{*}}(s,{\pi}(s));
J^(π)=minθmΘB(θ^m)𝔼sρm=1MRm,θm(s,π(s)).\displaystyle\hat{J}(\pi)=\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\sum_{m=1}^{M}R_{m,\theta_{m}}(s,{\pi}(s)).

Denote π^=argmaxπJ^(π)\hat{\pi}=\arg\max_{\pi}\hat{J}(\pi), π=argmaxπJ(π)\pi^{*}=\arg\max_{\pi}J(\pi). We present the following sub-optimality bound.

Theorem 5.

Under the same condition as Theorem 1, we have, with probability at least 1δ1-\delta,

J(π)J(π^)Mr2n+dr2logd+rlog(M/δ)MnC,J(\pi^{*})-J(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*},

where C=maxm(Σm1/2𝔼sρϕ(s,π(s)))2C^{*}=\max_{m}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s)))\|_{2}.

The proof is deferred to Appendix C.5. The advantage lies in the exclusion of zero rewards. As a result, our coverage assumption (or concentration coefficient CC^{*}) relies solely on the optimal policy π\pi^{*} compared to (6), making it a more relaxed condition.

Leximin welfare function (α\alpha\rightarrow-\infty)

In this context, the maximization is conducted over the minimum reward among all parties, denoted as minmRm\min_{m}R_{m}. Thus, we reintroduce the concept of zero reward as Rm,θm0(s)=minaRm,θm(s,a)R_{m,\theta_{m}}^{0}(s)=\min_{a}R_{m,\theta_{m}}(s,a). Similarly, the pessimistic expected value function is defined as follows,

J(π):=𝔼sρminm(Rm,θm(s,π(s))Rm,θm0(s))\displaystyle J({\pi}):=\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))
J^(π)=minθmΘB(θ^m)𝔼sρminm(Rm,θm(s,π(s))Rm,θm0(s))\displaystyle\hat{J}(\pi)=\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\theta_{m}}(s,{\pi}(s))-R_{m,\theta_{m}}^{0}(s))

Denote π^=argmaxπJ^(π)\hat{\pi}=\arg\max_{\pi}\hat{J}(\pi), π=argmaxπJ(π)\pi^{*}=\arg\max_{\pi}J(\pi). Additionally, let π¯m\underline{\pi}_{*m}, π¯m\underline{\pi}_{m} be defined identically to the Nash case in Section 4.2. We present the following sub-optimality bound.

Theorem 6.

Under the same condition as Theorem 1, we have, with probability at least 1δ1-\delta,

J(π)J(π^)Mr2n+dr2logd+rlog(M/δ)MnC,J(\pi^{*})-J(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*},

where C=maxmmaxπ{π,π¯m,π¯m}(Σm1/2𝔼sρϕ(s,π(s)))2C^{*}=\max\limits_{m}\max\limits_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim\rho}\phi(s,\pi(s)))\|_{2}.

The proof is deferred to Appendix C.5. Theorem 6 implies that π^\hat{\pi} is approximately optimal for the worst-off individual, thus characterizing a relatively fair outcome. To achieve a sub-optimal policy, Theorem 2, Theorem 5, and Theorem 6 once again highlight the complexity challenge in a multi-party scenario that arises from the aggregation of individual learning errors as well as the potential increase in the concentratability coefficient CC^{*}.

Now we concisely compare the results for various social welfare functions for aggregation.

  • The Nash welfare function exhibits resilience against affine transformations, which significantly avoids dominance. Also, it strikes a middle ground between Utilitarian and Leximin fairness, thus attaining a favorable balance between the averaged and worst-case performances (Zhang et al.,, 2022). Moreover, under additional convex assumptions, Nash bargaining guarantees that each individual receives their minimum reward plus a certain fraction of the maximum feasible gain, expressed as 1M(maxaRm(s,a)Rm0(s))\frac{1}{M}(\max_{a}R_{m}(s,a)-R_{m}^{0}(s)) (see, Chapter 3.6, Moulin,, 2004). These properties distinguish Nash’s solution.

  • On the other hand, both the Nash and Leximin solutions incorporate the concept of zero rewards Rm0R_{m}^{0}, while the Utilitarian solution remains independent of zero rewards (refer to Appendix A). Consequently, the Nash and Leximin welfare functions necessitate a more nuanced coverage assumption (or a higher concentration coefficient CC^{*}).

  • Bakker et al., (2022) empirically showed that the Utilitarian welfare function successfully considers both minority and majority views of various individuals. When compared to other welfare functions such as Leximin and Nash, their results showed similar average ratings among participants and comparable outcomes for the most dissenting participants. We interpret these findings as a special case of a particular data distribution. However, they fail to recognize the importance of zero rewards, which can significantly influence outcomes by shaping what constitutes an acceptable agreement. Nevertheless, it is crucial to note that these social welfare functions indeed exhibit disparity. They measure distinct aspects of social welfare and could be suitable for different practical settings.

4.5 Generalization to MDPs

We now generalize our results to MDPs where the preference is based on a whole trajectory instead of a single action. Examples of such settings include Atari games and robot locomotion (Christiano et al.,, 2017), where the preference depends on historical behaviors. We consider the environment E=(𝒮,𝒜,H,{Ph}h=1H1,{Rm}m=1M,ρ)E=(\mathcal{S},\mathcal{A},H,\{P_{h}\}_{h=1}^{H-1},\{R_{m}\}_{m=1}^{M},\rho), where 𝒮\mathcal{S} is a state space, 𝒜\mathcal{A} is an action space, HH is the horizon length, Ph:𝒮×𝒜Δ(𝒮)P_{h}:\mathcal{S}\times\mathcal{A}\rightarrow\Delta(\mathcal{S}) for h[H1]h\in[H-1] are known transition kernels, Rm:𝒮×𝒜R_{m}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} for m[M]m\in[M] are reward functions from MM individuals (or parties), and ρ\rho is an initial state distribution. Define the occupancy measure dπ:𝒮d^{\pi}:\mathcal{S}\rightarrow\mathbb{R} as dπ(s)=h=1Hh(sh=s|π)d^{\pi}(s)=\sum_{h=1}^{H}\mathbb{P}_{h}(s_{h}=s|\pi), which becomes ρ(s)\rho(s) when H=1H=1. We are provided with pairwise comparisons 𝒟=m=1M𝒟m\mathcal{D}=\bigcup_{m=1}^{M}\mathcal{D}_{m}, where 𝒟m={(smi,τm,1i,τm,0i,ymi)}i=1n\mathcal{D}_{m}=\{(s_{m}^{i},\tau_{m,1}^{i},\tau_{m,0}^{i},y_{m}^{i})\}_{i=1}^{n} for m[M]m\in[M]. For the ii-th sample in 𝒟m\mathcal{D}_{m}, an initial state smis_{m}^{i} is sampled from ρ\rho and two trajectories τm,1i=(am,1i,,sm,Hi,am,Hi),τm,0i=(am,1i,,sm,Hi,am,Hi)\tau_{m,1}^{i}=(a_{m,1}^{i},\cdots,s_{m,H}^{i},a_{m,H}^{i}),\tau_{m,0}^{i}=(a_{m,1}^{i^{\prime}},\cdots,s_{m,H}^{i^{\prime}},a_{m,H}^{i^{\prime}}) are sampled from a distribution gτ(τm,1i;τm,0i|smi)g_{\tau}(\tau_{m,1}^{i};\tau_{m,0}^{i}|s_{m}^{i}). Thus (smi,τm,1i,τm,0i)(s_{m}^{i},\tau_{m,1}^{i},\tau_{m,0}^{i}) is generated by a generation distribution g=gτ(|s)ρ(s)g=g_{\tau}(\cdot|s)\rho(s). We then observe a binary outcome ymiy_{m}^{i} from a Bernoulli distribution m(ymi=1|τm,1i,τm,0i)\mathbb{P}_{m}(y_{m}^{i}=1|\tau_{m,1}^{i},\tau_{m,0}^{i}) as follows,

m(y=1|τ1,τ0,s)=Φ(h=1HRm,θm(sh,ah)h=1HRm,θm(sh,ah)),\mathbb{P}_{m}(y=1|\tau_{1},\tau_{0},s)=\Phi(\sum_{h=1}^{H}R_{m,\theta_{m}^{*}}(s_{h},a_{h})-\sum_{h=1}^{H}R_{m,\theta_{m}^{*}}(s_{h}^{\prime},a_{h}^{\prime})),

where Assumption 1-4 hold with the feature difference in Σ\Sigma_{*} changed to the cumulative difference over the trajectory, i.e., Σ=𝔼g(h=1H(ϕ(sh,ah)ϕ(sh,ah)))(h=1H(ϕ(sh,ah)ϕ(sh,ah)))\Sigma_{*}=\mathbb{E}_{g}(\sum_{h=1}^{H}(\phi(s_{h},a_{h})-\phi(s_{h}^{\prime},a_{h}^{\prime})))(\sum_{h=1}^{H}(\phi(s_{h},a_{h})-\phi(s_{h}^{\prime},a_{h}^{\prime})))^{\top}.

The generalized algorithm for MDPs is similar to Algorithm 1. We first construct MLE θ^m=U^α^m\hat{\theta}_{m}=\hat{U}\hat{\alpha}_{m}, aggregate the reward functions, and then seek a policy to optimize the pessimistic Nash value function J^\hat{J}. However, here we adopt a trajectory-based preference model and thus optimize a trajectory-wise value function. We are going to discuss the differences in detail below.

Denote Xm,i=h=1H(ϕ(sm,hi,am,hi)ϕ(sm,hi,am,hi))X_{m,i}=\sum_{h=1}^{H}\left(\phi(s_{m,h}^{i},a_{m,h}^{i})-\phi(s_{m,h}^{i^{\prime}},a_{m,h}^{i^{\prime}})\right) now as the difference between the cumulative feature in two trajectories, Xm=(Xm,1,,Xm,n)X_{m}=(X_{m,1},\cdots,X_{m,n})^{\top}. Define the data covariance matrices as Σm=1nXmXm\Sigma_{m}=\dfrac{1}{n}X_{m}^{\top}X_{m} for all m[M]m\in[M]. We again minimize the negative log-likelihood:

U^,α^argminU𝒪d×r,αm2Bl𝒟(U,α),\displaystyle\hat{U},\hat{\alpha}\in\arg\min_{U\in\mathcal{O}_{d\times r},\|\alpha_{m}\|_{2}\leq B}l_{\mathcal{D}}(U,\alpha),

where l𝒟(U,α)=1Mnm=1Mi=1n{𝟏(ymi=1)log(Φ(Uαm,Xm,i))+𝟏(ymi=0)log(Φ(Uαm,Xm,i))}l_{\mathcal{D}}(U,\alpha)=-\dfrac{1}{Mn}\sum_{m=1}^{M}\sum_{i=1}^{n}\{\mathbf{1}(y_{m}^{i}=1)\log\left(\Phi(\langle U\alpha_{m},X_{m,i}\rangle)\right)+\mathbf{1}(y_{m}^{i}=0)\log\left(\Phi(\langle U\alpha_{m},X_{m,i}\rangle)\right)\}.

Subsequently, we use Nash bargaining to aggregate individual preferences. Here we point out that the key difference between MDPs and CBs is the inclusion of the trajectory-based measure dπd^{\pi}, which stems from the observation (Zhu et al.,, 2023) that:

𝔼sρ[Vπ(s)]=𝔼sdπm=1M(Rm,θm(s,π(s))Rm,θm0(s)),\mathbb{E}_{s\sim\rho}[V^{\pi}(s)]=\mathbb{E}_{s\sim d^{\pi}}\prod\limits_{m=1}^{M}(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s)),

where Vπ(s)=𝔼[h=1Hm=1M(Rm(sh,ah)Rm0(sh,ah))|s1=s,π]V^{\pi}(s)=\mathbb{E}[\sum\limits_{h=1}^{H}\prod\limits_{m=1}^{M}(R_{m}(s_{h},a_{h})-R_{m}^{0}(s_{h},a_{h}))|s_{1}=s,\pi], and this expectation 𝔼\mathbb{E} is taken over the trajectory generated according to the transition kernel. Since the transition distribution PP is known, we can directly calculate dπd^{\pi} for any given π\pi. Define the trajectory-wise true and pessimistic Nash expected value function JJ and J^\hat{J} as follows:

J(π)\displaystyle J({\pi}) =𝔼sdπm=1M(Rm,θm(s,π(s))Rm,θm0(s));\displaystyle=\mathbb{E}_{s\sim d^{\pi}}\prod_{m=1}^{M}\left(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s)\right);
J^(π)\displaystyle\hat{J}(\pi) =minθmΘB(θ^m)𝔼sdπm=1M(Rm,θm(s,π(s))Rm,θm0(s)),\displaystyle=\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim d^{\pi}}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi(s))-R_{m,\theta_{m}}^{0}(s)),

where ΘB(θ^m)={θΘB:θθ^mΣmK\Theta_{B}(\hat{\theta}_{m})=\{\theta\in\Theta_{B}:\|\theta-\hat{\theta}_{m}\|_{\Sigma_{m}}\leq K^{\prime} r2n+dr2logd+rlog(M/δ)Mn}\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}\}, KK^{\prime} is a fixed parameter. Define π^,π,π¯m,π¯m,θ~m\hat{\pi},\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m},\tilde{\theta}_{m} following the same structure as in Section 4.2, C=maxmmaxπ{π,π¯m,π¯m}(Σm1/2𝔼sdπϕ(s,π(s)))2C^{*}=\max_{m}\max_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,\pi(s)))\|_{2}. We have the following result, whose proof is deferred to Appendix C.6.

Theorem 7.

Under the same condition as Theorem 1 (with Σ\Sigma_{*} changed), we have, then with probability at least 1δ1-\delta, for any m[M]m\in[M],

J(π)J(π^)Mr2n+dr2logd+rlog(M/δ)MnC.J(\pi^{*})-J(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}.

We can also attain efficiency and fairness guarantees similar to Section 4.3 accordingly.

5 Extension to Reward-Free Models

The reward-based model is limited to situations in which human preferences are transitive. However, the multi-dimensional nature of human decision-making may result in intransitive preferences (Tversky,, 1969). We now introduce a more general model where each individual’s preference may not be consistent with any reward model in the contextual bandits setting.

General Model

Suppose a reward-free model without additional assumptions about human preferences. To ensure symmetry, we define the following preference matrix (Dudík et al.,, 2015):

Ns,m(a1,a0)=2m(y|a1,a0,s)1,s𝒮.N_{s,m}^{*}(a_{1},a_{0})=2\mathbb{P}_{m}(y|a_{1},a_{0},s)-1,\quad\forall s\in\mathcal{S}.

Throughout this section, we focus on tabular cases with finite state and action spaces |𝒮|=S|\mathcal{S}|=S, |𝒜|=A|\mathcal{A}|=A. Recall that 𝒟m\mathcal{D}_{m} is generated by distribution gg in Section 3.1. Define the occupancy measure dsg(a,a)d_{s}^{g}(a,a^{\prime}) as the probability of pair (a,a)(a,a^{\prime}) appearing in the data generated by gg for a given state ss; define dsμ,ν(a,a)d_{s}^{\mu,\nu}(a,a^{\prime}) as the probability of a,aa,a^{\prime} appearing in the data generated by policies μ,ν\mu,\nu respectively, given state ss. These definitions are akin to those in Markov games (Cui and Du,, 2022).

5.1 Pessimism Algorithm with Von Neumann Winner

For the general model, we consider an alternative solution concept, the von Neumann winner (Dudík et al.,, 2015), which corresponds to a randomized policy. Here, a von Neumann winner of an A×AA\times A matrix NN is the max-min probabilistic strategy pp of the matrix game maxpminqpNq=maxpminqi,jpiqjNij\max_{p}\min_{q}p^{\top}Nq=\max_{p}\min_{q}\sum_{i,j}p_{i}q_{j}N_{ij} and the achieved value is defined as the value of this matrix game. It is known that when NN is skew-symmetric, the value is zero (Owen,, 2013). The construction of von Neumann winner for our problem consists of four steps. Step 1 is to calculate the aggregated preference matrices for each state. Step 2 is to construct the confidence bounds for each entry in these matrices. Step 3 involves solving a max-min optimization problem of each state. Finally, Step 4 is to transform the probabilistic strategy of each state into a randomized policy in the policy space. We summarize our procedure in Algorithm 2 first, and discuss each step in detail below.

Algorithm 2 Pessimistic Von Neumann Winner

Input: The datasets 𝒟=m=1M𝒟m\mathcal{D}=\bigcup_{m=1}^{M}\mathcal{D}_{m}, a failure probability δ\delta, and the initial distribution ρ\rho.

Output: The pessimistic policy π^\hat{\pi}.

1:  Calculate the preference matrix NsN_{s} in (7) for s𝒮\forall s\in\mathcal{S}.
2:  Construct the pessimistic lower confidence bound NsBsN_{s}-B_{s} through (8).
3:  Calculate the state-wise strategy p^s\hat{p}_{s} by solving the max-min problem of matrix NsBsN_{s}-B_{s}.
4:  Obtain the final randomized policy π^\hat{\pi} through (9).

Define the visiting number of an action pair at one state as n0(m,a,a,s):=#m(aa;s)+#m(aa;s)n_{0}(m,a,a^{\prime},s):=\#_{m}(a\succ a^{\prime};s)+\#_{m}(a^{\prime}\succ a;s), here #m(aa;s)\#_{m}(a\succ a^{\prime};s) denotes the number of observations in 𝒟m\mathcal{D}_{m} of preferring aa over aa^{\prime} at state ss. The standard preference matrix at state ss between actions aa and aa^{\prime} is then defined as:

Ns(a,a)=1Mm=1MNs,m(a,a),N_{s}(a,a^{\prime})=\dfrac{1}{M}\sum_{m=1}^{M}N_{s,m}(a,a^{\prime}), (7)

where Ns,m(a,a)=#m(aa;s)#m(aa;s)#m(aa;s)+#m(aa;s)N_{s,m}(a,a^{\prime})=\frac{\#_{m}(a\succ a^{\prime};s)-\#_{m}(a^{\prime}\succ a;s)}{\#_{m}(a\succ a^{\prime};s)+\#_{m}(a^{\prime}\succ a;s)} if the denominator 0\neq 0; otherwise, it is defined as 0.

Subsequently, we consider the policy space Π\Pi which encompasses all deterministic policy permutations. We define the total preference matrix between policies as follows:

T(π,π)=𝔼sρNs(π(s),π(s)).T(\pi,\pi^{\prime})=\mathbb{E}_{s\sim\rho}N_{s}(\pi(s),\pi^{\prime}(s)).

We further define the population-wise matrices NsN_{s}^{*} and TT^{*} as

Ns(a,a)=1Mm=1MNs,m(a,a),T(π,π)=𝔼sρNs(π(s),π(s)).N_{s}^{*}(a,a^{\prime})=\dfrac{1}{M}\sum_{m=1}^{M}N_{s,m}^{*}(a,a^{\prime}),\quad T^{*}(\pi,\pi^{\prime})=\mathbb{E}_{s\sim\rho}N_{s}^{*}(\pi(s),\pi^{\prime}(s)).

In our general model, we refrain from directly specifying reward functions to model human preferences. Consequently, obtaining an optimal policy based on an aggregated function becomes unfeasible. Instead, we consider the von Neumann winner psp_{s}^{*} of Ns,s𝒮N_{s},s\in\mathcal{S}, defined in an average-wise manner. This winner represents a probabilistic strategy that “beats” every other strategy psp_{s}^{\prime} in the sense that psNsps0p_{s}^{*\top}N_{s}p_{s}^{\prime}\geq 0. In other words, it has an average beating probability greater than 1/21/2 (Dudík et al.,, 2015). Additionally, if the reward-based preference model (1) holds, the von Neumann winner of NsN_{s}^{*} corresponds to the solution of the Utilitarian welfare function when M=1M=1 or M=2M=2. For further details, refer to B.

For a given failure probability δ\delta, define the confidence bonus as

Bs(a,a)=2log(4MSA2/δ)minmn0(m,a,a,s)1.B_{s}(a,a^{\prime})=\sqrt{\dfrac{2\log(4MSA^{2}/\delta)}{\min_{m}n_{0}(m,a,a^{\prime},s)\lor 1}}. (8)

Here ab:=max{a,b}a\lor b:=\max\{a,b\}. We claim that, with probability at least 1δ1-\delta, |Ns(a,a)Ns(a,a)|Bs(a,a)|N_{s}(a,a^{\prime})-N_{s}^{*}(a,a^{\prime})|\leq B_{s}(a,a^{\prime}) for s,a,a\forall s,a,a^{\prime}, see Lemma 6 for details. Construct the following pessimistic estimators,

p^s,q^s=argmaxpminqp(NsBs)q,s𝒮,p^(π)=s𝒮p^s(π(s)).\hat{p}_{s},\hat{q}_{s}=\arg\max_{p}\min_{q}p^{\top}(N_{s}-B_{s})q,\,\forall s\in\mathcal{S},\quad\hat{p}(\pi)=\prod_{s\in\mathcal{S}}\hat{p}_{s}(\pi(s)). (9)

Here p^s\hat{p}_{s} is a mixed strategy for each s𝒮s\in\mathcal{S}, and p^(π)\hat{p}(\pi) is a mixed strategy over the policy space Π\Pi.

5.2 Main Results

Define ps,qs=argmaxpminqpNsqp_{s}^{*},q_{s}^{*}=\arg\max_{p}\min_{q}p^{\top}N_{s}^{*}q, and C=maxν,s,a,adsps,ν(a,a)dsg(a,a)C^{*}=\max_{\nu,s,a,a^{\prime}}\dfrac{d_{s}^{p_{s}^{*},\nu}(a,a^{\prime})}{d_{s}^{g}(a,a^{\prime})}, where CC^{*} is a concentratability coefficient assumed to be bounded (Cui and Du,, 2022). We have the following result.

Theorem 8.

With probability at least 1δ1-\delta, p^s\hat{p}_{s} and p^\hat{p} are ϵ\epsilon-approximate von Neumann winner, i.e.

minqp^sNsqϵ,s𝒮,\displaystyle\min_{q}\hat{p}_{s}^{\top}N_{s}^{*}q\geq-\epsilon,\;\forall s\in\mathcal{S},
minqp^Tqϵ,\displaystyle\min_{q}\hat{p}^{\top}T^{*}q\geq-\epsilon,

where ϵ=8Alog(4MSA2δ)Cn\epsilon=8A\log(\frac{4MSA^{2}}{\delta})\sqrt{\frac{C^{*}}{n}}.

The proof is deferred to Appendix D.1. It is known that the value of the skew-symmetric matrix game is zero (Owen,, 2013), thus Theorem 8 shows that the von Neumann winner estimated from the observations is an approximation of the true von Neumann winner with only an ϵ\epsilon gap.

Next, we give a weak Pareto efficiency result to ensure consensus among multiple individuals, which is adopted from the definition of ex-post efficiency (Fishburn,, 1984).

Definition 2 (τ\tau-Approximate Ex-post Efficiency).

A probabilistic strategy psp_{s} is τ\tau-approximate ex-post efficient if any Pareto-dominated alternative bb receives probability ps(b)τp_{s}(b)\leq\tau. Here bb is Pareto-dominated if there exists an alternative aa s.t. Ns,m(a,b)0N_{s,m}^{*}(a,b)\geq 0 for m[M]\forall m\in[M].

Our definition differs from those by Zeng and Psomas, (2020), where they considered a strategy that always outputs a τ\tau approximate Pareto efficient alternative, while we obtain an ignorable probability of outputting a Pareto-dominated alternative. We establish the following efficiency result.

Theorem 9.

If the preference of each individual is transitive, i.e. Ns,m(x,y)0Ns,m(x,z)Ns,m(y,z)N_{s,m}^{*}(x,y)\geq 0\Rightarrow N_{s,m}^{*}(x,z)\geq N_{s,m}^{*}(y,z) for z\forall z, then with probability at least 1δ1-\delta, p^s\hat{p}_{s} is τ\tau-approximate ex-post efficient for s\forall s, where τ=(Alog(2MSA2/δ)Cn)1/2\tau=\left(A\log(2MSA^{2}/\delta)\sqrt{\frac{C^{*}}{n}}\right)^{1/2}, and KK is a constant.

The proof is given in Appendix D.2. Here the transitivity assumption is necessary, demonstrating the distinction between the general and reward-based models. Further details are provided in Appendix D.2. This result aligns with Theorem 4, emphasizing the efficiency and welfare assurance of multi-party alignment, especially in scenarios where individual preferences deviate from a specific data generative model.

References

  • Aumann and Dombb, (2010) Aumann, Y. and Dombb, Y. (2010). Pareto efficiency and approximate pareto efficiency in routing and load balancing games. In International Symposium on Algorithmic Game Theory, pages 66–77. Springer.
  • Bakker et al., (2022) Bakker, M., Chadwick, M., Sheahan, H., Tessler, M., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M., et al. (2022). Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
  • Barman et al., (2018) Barman, S., Krishnamurthy, S. K., and Vaish, R. (2018). Finding fair and efficient allocations. In Proceedings of the 2018 ACM Conference on Economics and Computation, pages 557–574.
  • Barron, (2013) Barron, E. N. (2013). Game theory: an introduction. John Wiley & Sons.
  • Baxter, (2000) Baxter, J. (2000). A model of inductive bias learning. Journal of artificial intelligence research, 12:149–198.
  • Bradley and Terry, (1952) Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  • Brandl et al., (2016) Brandl, F., Brandt, F., and Seedig, H. G. (2016). Consistent probabilistic social choice. Econometrica, 84(5):1839–1880.
  • Chen et al., (2022) Chen, X., Zhong, H., Yang, Z., Wang, Z., and Wang, L. (2022). Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, volume 162, pages 3773–3793. PMLR.
  • Christiano et al., (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Cui and Du, (2022) Cui, Q. and Du, S. S. (2022). When is offline two-player zero-sum markov game solvable? arXiv preprint arXiv:2201.03522.
  • Du et al., (2020) Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. (2020). Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434.
  • Dudík et al., (2015) Dudík, M., Hofmann, K., Schapire, R. E., Slivkins, A., and Zoghi, M. (2015). Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
  • Finn et al., (2017) Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR.
  • Fish et al., (2023) Fish, S., Gölz, P., Parkes, D. C., Procaccia, A. D., Rusak, G., Shapira, I., and Wüthrich, M. (2023). Generative social choice. arXiv preprint arXiv:2309.01291.
  • Fishburn, (1984) Fishburn, P. C. (1984). Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692.
  • Hsu et al., (2012) Hsu, D., Kakade, S., and Zhang, T. (2012). A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17:1 – 6.
  • Immorlica et al., (2017) Immorlica, N., Lucier, B., Weyl, G., and Mollner, J. (2017). Approximate efficiency in matching markets. In International Conference on Web and Internet Economics, pages 252–265. Springer.
  • Ji et al., (2023) Ji, W., Deng, Z., Nakada, R., Zou, J., and Zhang, L. (2023). The power of contrast for feature learning: A theoretical analysis. Journal of Machine Learning Research, 24(330):1–78.
  • Jin et al., (2020) Jin, T., Xu, P., Gu, Q., and Farnoud, F. (2020). Rank aggregation via heterogeneous thurstone preference models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4353–4360.
  • Kelly et al., (1998) Kelly, F. P., Maulloo, A. K., and Tan, D. K. H. (1998). Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research society, 49:237–252.
  • Li et al., (2022) Li, G., Ma, C., and Srebro, N. (2022). Pessimism for offline linear contextual bandits using p\ell_{p} confidence sets. Advances in Neural Information Processing Systems, 35:20974–20987.
  • Li and Zhang, (2023) Li, S. and Zhang, L. (2023). Multi-dimensional domain generalization with low-rank structures. arXiv preprint arXiv:2309.09555.
  • Maurer et al., (2016) Maurer, A., Pontil, M., and Romera-Paredes, B. (2016). The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32.
  • Moulin, (2004) Moulin, H. (2004). Fair division and collective welfare. MIT press.
  • Nash, (1953) Nash, J. (1953). Two-person cooperative games. Econometrica: Journal of the Econometric Society, pages 128–140.
  • Nguyen and Rothe, (2020) Nguyen, T. T. and Rothe, J. (2020). Approximate pareto set for fair and efficient allocation: Few agent types or few resource types. In IJCAI, pages 290–296.
  • Noothigattu et al., (2020) Noothigattu, R., Peters, D., and Procaccia, A. D. (2020). Axioms for learning from pairwise comparisons. Advances in Neural Information Processing Systems, 33:17745–17754.
  • Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Owen, (2013) Owen, G. (2013). Game theory. Emerald Group Publishing.
  • Patty and Penn, (2019) Patty, J. W. and Penn, E. M. (2019). Measuring fairness, inequality, and big data: Social choice since arrow. Annual Review of Political Science, 22(1):435–460.
  • Rivest and Shen, (2010) Rivest, R. L. and Shen, E. (2010). An optimal single-winner preferential voting system based on game theory. In International Workshop on Computational Social Choice, pages 399–410. Citeseer.
  • Santurkar et al., (2023) Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., and Hashimoto, T. (2023). Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  • Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Tripuraneni et al., (2020) Tripuraneni, N., Jin, C., and Jordan, M. I. (2020). Provable meta-learning of linear representations. In International Conference on Machine Learning.
  • Tversky, (1969) Tversky, A. (1969). Intransitivity of preferences. Psychological review, 76(1):31.
  • Urvoy et al., (2013) Urvoy, T., Clerot, F., Féraud, R., and Naamane, S. (2013). Generic exploration and K-armed voting bandits. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 91–99. PMLR.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Wang et al., (2023) Wang, Y., Liu, Q., and Jin, C. (2023). Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111.
  • Xie et al., (2021) Xie, T., Jiang, N., Wang, H., Xiong, C., and Bai, Y. (2021). Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in Neural Information Processing Systems, 34:27395–27407.
  • Yu et al., (2014) Yu, Y., Wang, T., and Samworth, R. J. (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102:315–323.
  • Zeng and Psomas, (2020) Zeng, D. and Psomas, A. (2020). Fairness-efficiency tradeoffs in dynamic fair division. In Proceedings of the 21st ACM Conference on Economics and Computation, pages 911–912.
  • Zhan et al., (2023) Zhan, W., Uehara, M., Kallus, N., Lee, J. D., and Sun, W. (2023). Provable offline preference-based reinforcement learning. arXiv preprint arXiv:2305.14816.
  • Zhang et al., (2022) Zhang, G., Malekmohammadi, S., Chen, X., and Yu, Y. (2022). Proportional fairness in federated learning. arXiv preprint arXiv:2202.01666.
  • Zhu et al., (2023) Zhu, B., Jiao, J., and Jordan, M. I. (2023). Principled reinforcement learning with human feedback from pairwise or kk-wise comparisons. arXiv preprint arXiv:2301.11270.
  • Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.

Appendix A Social Welfare Function Details

Here we provide more details of social welfare functions to complement the discussion in Section 3.2. In welfare economics, a social welfare function aggregates individual utilities towards collective welfare and thus guides social decisions to align with the group’s preferences. A reasonable social welfare function that satisfies six desirable axioms is proved to lie in a one-parameter family of isoelastic social welfare functions WαW_{\alpha} defined in (2).

Among these functions, the Nash welfare function (α=0)(\alpha=0) stands out as the unique one that is additionally independent of individual scales of utilities (IIS); and the Utilitarian welfare function (α=1)(\alpha=1) is the unique one that is additionally independent of individual zeros of utilities (IIZ). See their definitions below.

Definition 3 (Independence of Individual Scales of Utilities (IIS)).

A solution is independent of individual scales of utilities when the following holds: If we transform the mm-th utility function umu_{m} to cmumc_{m}u_{m} with cm+c_{m}\in\mathbb{R}^{+} for m[M]m\in[M], the solution remains the same.

Definition 4 (Independence of Individual Zeros of Utilities (IIZ)).

A solution is independent of individual zeros of utilities when the following holds: If we transform the mm-th utility function umu_{m} to cm+umc_{m}+u_{m} with cmc_{m}\in\mathbb{R} for m[M]m\in[M], the solution remains the same.

Additionally, letting α\alpha\rightarrow-\infty gives another important function, namely, the Leximin (egalitarian) welfare function, which can be reduced to consider minmum\min_{m}u_{m} for simplicity.

In the main article, we introduce the Nash welfare function as the primary illustrative example. The Nash welfare function is advantageous due to its IIS property, which eliminates the potential dominance issues arising from norm differences. Furthermore, the Nash bargaining version introduces an objective definition of individual “zero utility”. Consequently, it considers the normalization of individual zeros:

m=1M(umum0), s.t. um>um0.\prod_{m=1}^{M}(u_{m}-u_{m}^{0}),\text{ s.t. }u_{m}>u_{m}^{0}.

Thus, Nash bargaining solution satisfies the IIZ property. The independence makes it resilient to various distortions, such as affine transformations in individual utility. Given these desirable properties, the Nash bargaining solution has become a standard concept for negotiating agreements among individuals.

Appendix B Connections Between Two Models

The average-wise definition of von Neumann winner in Section 5.1 is based on the Utilitarian welfare function, and here we offer further justification for this choice. In the context of a voting election, where each voter has an absolute preference ranking, then the values of Ns,m(a,a){1,1}N_{s,m}(a,a^{\prime})\in\{1,-1\} for aaa\neq a^{\prime}. Consequently, Ns(a,a)N_{s}(a,a^{\prime}) reflects the difference between the count of individuals preferring aa over aa^{\prime} and the count of those preferring aa^{\prime} over aa. Therefore, a von Neumann winner represents the outcome preferred by a larger number of people. Here the normalization for Ns,mN_{s,m} is essential to prevent dominance.

We have claimed in Section 5.1 that, under the additional assumption of a reward-based preference model (1), the von Neumann winner in the general reward-free model consistently corresponds with the solution of the Utilitarian welfare function when M=1M=1 or M=2M=2. We give a brief illustration as follows.

  • Case M=1M=1: In this scenario, there exists a Condorcet winner (Urvoy et al.,, 2013) determined by the optimal policy. Consequently, it serves as a direct von Neumann winner.

  • Case M=2M=2: It is easy to demonstrate that for any x,yx,y\in\mathbb{R}

    12(11+ex+11+ey)12x+y0.\dfrac{1}{2}(\dfrac{1}{1+e^{-x}}+\dfrac{1}{1+e^{-y}})\geq\dfrac{1}{2}\Leftrightarrow x+y\geq 0.

    Now, if we set x=R1(s,a1)R1(s,a0),y=R2(s,a1)R2(s,a0)x=R_{1}(s,a_{1})-R_{1}(s,a_{0}),y=R_{2}(s,a_{1})-R_{2}(s,a_{0}), then we obtain the following equivalence:

    12[1(a1a0|s)+2(a1a0|s)]12R1(s,a1)+R2(s,a1)R1(s,a0)+R2(s,a0),\dfrac{1}{2}\left[\mathbb{P}_{1}(a_{1}\succ a_{0}|s)+\mathbb{P}_{2}(a_{1}\succ a_{0}|s)\right]\geq\dfrac{1}{2}\Leftrightarrow R_{1}(s,a_{1})+R_{2}(s,a_{1})\geq R_{1}(s,a_{0})+R_{2}(s,a_{0}),

    which provides the desired result we need.

Appendix C Proofs in Section 4

C.1 Proof of Theorem 1

Denote the low-dimensional parameters and the original parameters as

α=(α1,,αM),α=(α1,,αM),α^=(α^1,,α^M)r×M\displaystyle\alpha=(\alpha_{1},\cdots,\alpha_{M}),\alpha^{*}=(\alpha_{1}^{*},\cdots,\alpha_{M}^{*}),\hat{\alpha}=(\hat{\alpha}_{1},\cdots,\hat{\alpha}_{M})\in\mathbb{R}^{r\times M}
Θ=(θ1,,θM),Θ=(θ1,,θM),Θ^=(θ^1,,θ^M)=(U^α^1,,U^α^M)d×M.\displaystyle\Theta=(\theta_{1},\cdots,\theta_{M}),\Theta^{*}=(\theta_{1}^{*},\cdots,\theta_{M}^{*}),\hat{\Theta}=(\hat{\theta}_{1},\cdots,\hat{\theta}_{M})=(\hat{U}\hat{\alpha}_{1},\cdots,\hat{U}\hat{\alpha}_{M})\in\mathbb{R}^{d\times M}.

Throught this section, we define θ,θ^,θdM\theta,\hat{\theta},\theta^{*}\in\mathbb{R}^{dM} as the vectorization of Θ,Θ^,Θ\Theta,\hat{\Theta},\Theta^{*}.

Lemma 1 (Covariance concentration).

Suppose Assumption 3 holds. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ/101-\delta/10, we have: for any m[M]m\in[M],

0.9ΣΣm1.1Σ0.9\Sigma_{*}\preceq\Sigma_{m}\preceq 1.1\Sigma_{*} (10)
Proof.

Firstly, we fix m[M]m\in[M] and recall Xm,i:=ϕ(smi,am,1i)ϕ(smi,am,0i)X_{m,i}:=\phi(s_{m}^{i},a_{m,1}^{i})-\phi(s_{m}^{i},a_{m,0}^{i}). Thus Σm=1ni=1nXm,iXm,i\Sigma_{m}=\dfrac{1}{n}\sum_{i=1}^{n}X_{m,i}X_{m,i}^{\top}, and 𝔼gΣm=Σ\mathbb{E}_{g}\Sigma_{m}=\Sigma_{*}. Since ϕ\phi is bounded from Assumption 1, Xm,iX_{m,i} (i=1,,n)(i=1,\cdots,n) are independent sub-gaussian random vectors. We have, with probability at least 12exp(t2)1-2\exp(-t^{2}),

ΣmΣ\displaystyle\|\Sigma_{m}-\Sigma_{*}\| =1ni=1nXm,iXm,iΣ\displaystyle=\|\dfrac{1}{n}\sum_{i=1}^{n}X_{m,i}X_{m,i}^{\top}-\Sigma_{*}\|
Σmax(η,η2),η=dn+tn.\displaystyle\lesssim\|\Sigma_{*}\|\max(\eta,\eta^{2}),\quad\eta=\sqrt{\dfrac{d}{n}}+\dfrac{t}{\sqrt{n}}.

See Thereom 4.6.1 in Vershynin, (2018) or Lemma 7 in Tripuraneni et al., (2020) for reference. Thus, since ΣCmax\|\Sigma_{*}\|\leq C_{\text{max}}, we have, with probability at least 1δ1-\delta,

ΣmΣd+log(1/δ)n.\|\Sigma_{m}-\Sigma_{*}\|\lesssim\sqrt{\dfrac{d+\log(1/\delta)}{n}}.

Secondly, we take the union bound for all m[M]m\in[M]. Substitute δ\delta with δ/M\delta/M, we obtain that, with probability at least 1δ1-\delta, for any m[M]m\in[M]

ΣmΣd+log(M/δ)n.\|\Sigma_{m}-\Sigma_{*}\|\lesssim\sqrt{\dfrac{d+\log(M/\delta)}{n}}.

Thus ΣmΣ0.1Σ\|\Sigma_{m}-\Sigma_{*}\|\leq 0.1\|\Sigma_{*}\| holds for any m[M]m\in[M] if nd+log(M/δ)n\gg d+\log(M/\delta). ∎

Theorem 10 (Guarantee on source data).

Suppose Assumption 1-3 hold. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ/51-\delta/5,

m=1MXmU^α^mXmUαm2Mr+rdlogd+log(1/δ).\sum_{m=1}^{M}\|X_{m}\hat{U}\hat{\alpha}_{m}-X_{m}U^{*}\alpha_{m}^{*}\|^{2}\lesssim Mr+rd\log{d}+\log{(1/\delta)}. (11)
Proof.

We assume that (10) is true, which happens with probability at least 1δ/101-\delta/10 according to Lemma 1.

Step 1: Optimality of θ^\hat{\theta}. Consider the following loss function, along with its first and second derivatives,

l𝒟(θ)=1Mnm=1Mi=1nlog[𝟏(ymi=1)1exp(θm,Xm,i)+1+𝟏(ymi=0)exp(θm,Xm,i)exp(θm,Xm,i)+1]\displaystyle l_{\mathcal{D}}(\theta)=-\dfrac{1}{Mn}\sum_{m=1}^{M}\sum_{i=1}^{n}\log[\mathbf{1}(y_{m}^{i}=1)\dfrac{1}{\exp(-\langle\theta_{m},X_{m,i}\rangle)+1}+\mathbf{1}(y_{m}^{i}=0)\dfrac{\exp(-\langle\theta_{m},X_{m,i}\rangle)}{\exp(-\langle\theta_{m},X_{m,i}\rangle)+1}]
θml𝒟(θ)=1Mni=1n[𝟏(ymi=1)exp(θm,Xm,i)exp(θm,Xm,i)+1𝟏(ymi=0)1exp(θm,Xm,i)+1]Xm,i\displaystyle\nabla_{\theta_{m}}l_{\mathcal{D}}(\theta)=-\dfrac{1}{Mn}\sum_{i=1}^{n}\left[\mathbf{1}(y_{m}^{i}=1)\dfrac{\exp(-\langle\theta_{m},X_{m,i}\rangle)}{\exp(-\langle\theta_{m},X_{m,i}\rangle)+1}-\mathbf{1}(y_{m}^{i}=0)\dfrac{1}{\exp(-\langle\theta_{m},X_{m,i}\rangle)+1}\right]X_{m,i}
θm2l𝒟(θ)=1Mni=1nexp(θm,Xm,i)(exp(θm,Xm,i)+1)2Xm,iXm,i\displaystyle\nabla_{\theta_{m}}^{2}l_{\mathcal{D}}(\theta)=\dfrac{1}{Mn}\sum_{i=1}^{n}\dfrac{\exp(-\langle\theta_{m},X_{m,i}\rangle)}{(\exp(-\langle\theta_{m},X_{m,i}\rangle)+1)^{2}}\cdot X_{m,i}X_{m,i}^{\top}

Thus we have,

θl𝒟(θ)=(θ1l𝒟(θ),,θMl𝒟(θ));\displaystyle\nabla_{\theta}l_{\mathcal{D}}(\theta)=(\nabla_{\theta_{1}}l_{\mathcal{D}}(\theta)^{\top},\cdots,\nabla_{\theta_{M}}l_{\mathcal{D}}(\theta)^{\top})^{\top};
θ2l𝒟(θ)=diag(θ12l𝒟(θ),,θM2l𝒟(θ)).\displaystyle\nabla_{\theta}^{2}l_{\mathcal{D}}(\theta)=\text{diag}(\nabla_{\theta_{1}}^{2}l_{\mathcal{D}}(\theta),\cdots,\nabla_{\theta_{M}}^{2}l_{\mathcal{D}}(\theta)).

Concerning the second derivative, since |θm,Xm,i|2BL|\langle\theta_{m},X_{m,i}\rangle|\leq 2BL, we have exp(θm,Xm,i)(exp(θm,Xm,i)+1)2γ\dfrac{\exp(-\langle\theta_{m},X_{m,i}\rangle)}{(\exp(-\langle\theta_{m},X_{m,i}\rangle)+1)^{2}}\geq\gamma where γ=12+exp(2BL)+exp(2BL)\gamma=\dfrac{1}{2+\exp(-2BL)+\exp(2BL)}. Therefore,

θm2l𝒟(θ)γMni=1nXm,iXm,i=γMΣm.\nabla_{\theta_{m}}^{2}l_{\mathcal{D}}(\theta)\succeq\dfrac{\gamma}{Mn}\sum_{i=1}^{n}X_{m,i}X_{m,i}^{\top}=\dfrac{\gamma}{M}\Sigma_{m}.

Thus, let Σsum=diag(Σ1,,ΣM)Mn×Mn\Sigma_{\text{sum}}=\text{diag}(\Sigma_{1},\cdots,\Sigma_{M})\in\mathbb{R}^{Mn\times Mn}, from the convexity we have

l𝒟(θ^)l𝒟(θ)l𝒟(θ),θ^θγ2M(θ^θ)Σsum(θ^θ)=γ2Mθ^θΣsum2.l_{\mathcal{D}}(\hat{\theta})-l_{\mathcal{D}}(\theta^{*})-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\hat{\theta}-\theta^{*}\rangle\geq\dfrac{\gamma}{2M}(\hat{\theta}-\theta^{*})^{\top}\Sigma_{\text{sum}}(\hat{\theta}-\theta^{*})=\dfrac{\gamma}{2M}\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}^{2}.

Here the matrix-induced norm is defined as xΣ=xΣx\|x\|_{\Sigma}=\sqrt{x^{\top}\Sigma x}. In the following, we will use ZΣsum\|Z\|_{\Sigma_{\text{sum}}} to denote the Σsum\Sigma_{\text{sum}}-norm of ZZ’s vectorization when ZZ is a matrix. From the optimality of θ^\hat{\theta}, we have l𝒟(θ^)l𝒟(θ)0l_{\mathcal{D}}(\hat{\theta})-l_{\mathcal{D}}(\theta^{*})\leq 0, thus

l𝒟(θ),θ^θγ2Mθ^θΣsum2.-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\hat{\theta}-\theta^{*}\rangle\geq\dfrac{\gamma}{2M}\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}^{2}. (12)

Using Cauchy-Schwartz inequality immediately implies that

l𝒟(θ)Σsum1θ^θΣsumγ2Mθ^θΣsum2\|l_{\mathcal{D}}(\theta^{*})\|_{\Sigma_{\text{sum}}^{-1}}\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}\geq\dfrac{\gamma}{2M}\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}^{2} (13)

Concerning the first derivative, notice that ymi|Xm,iBer(1exp(θm,Xm,i)+1)y_{m}^{i}|X_{m,i}\sim\texttt{Ber}\left(\dfrac{1}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}\right), we define a random vector VmnV_{m}\in\mathbb{R}^{n} with components Vm,iV_{m,i} as follows,

Vm,i\displaystyle V_{m,i} =𝟏(ymi=1)exp(θm,Xm,i)exp(θm,Xm,i)+1𝟏(ymi=0)1exp(θm,Xm,i)+1\displaystyle=\mathbf{1}(y_{m}^{i}=1)\dfrac{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}-\mathbf{1}(y_{m}^{i}=0)\dfrac{1}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}
={exp(θm,Xm,i)exp(θm,Xm,i)+1w.p. 1exp(θm,Xm,i)+11exp(θm,Xm,i)+1w.p. exp(θm,Xm,iexp(θm,Xm,i)+1\displaystyle=\left\{\begin{aligned} &\dfrac{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}\quad\text{w.p. }\dfrac{1}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}\\ &\dfrac{-1}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}\quad\text{w.p. }\dfrac{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle}{\exp(-\langle\theta_{m}^{*},X_{m,i}\rangle)+1}\end{aligned}\right. (14)

Here VmV_{m} only depends on ymy_{m} if conditioned on XmX_{m}. Thus θml𝒟(θ)=1MnXmVm\nabla_{\theta_{m}}l_{\mathcal{D}}(\theta^{*})=-\dfrac{1}{Mn}X_{m}^{\top}V_{m}. So conditioned on XmX_{m}, VmV_{m} is a random vector with independent components that satisfy 𝔼Vi=0,Vm,i1\mathbb{E}V_{i}=0,\|V_{m,i}\|\leq 1 and thus sub-gaussian. Denote Am=1n2XmΣm1XmA_{m}=\dfrac{1}{n^{2}}X_{m}\Sigma_{m}^{-1}X_{m}^{\top}, we can establish the following bound (see, Appendix B.1, Zhu et al.,, 2023)

tr(Am)dn,tr(Am2)dn2,Am1n.\text{tr}(A_{m})\leq\dfrac{d}{n},\quad\text{tr}(A_{m}^{2})\leq\dfrac{d}{n^{2}},\quad\|A_{m}\|\leq\dfrac{1}{n}.

Therefore, Bernstein’s inequality for sub-Gaussian random variables in the quadratic form for A=diag(A1,,AM)A=\text{diag}(A_{1},\cdots,A_{M}) (see, Thereom 2.1, Hsu et al.,, 2012) directly implies that, with probability at least 1δ/201-\delta/20,

M2l𝒟(θ)Σsum12=m=1M1n2VmXmΣm1XmVmMd+log(1/δ)n.M^{2}\|l_{\mathcal{D}}(\theta^{*})\|_{\Sigma_{\text{sum}}^{-1}}^{2}=\sum_{m=1}^{M}\dfrac{1}{n^{2}}V_{m}^{\top}X_{m}\Sigma_{m}^{-1}X_{m}^{\top}V_{m}\lesssim\dfrac{Md+\log(1/\delta)}{n}. (15)

Combine (15) with (13), with probability at least 1δ/201-\delta/20,

θ^θΣsumMd+log(1/δ)n.\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}\lesssim\sqrt{\dfrac{Md+\log(1/\delta)}{n}}. (16)

This implies that

Md+log(1/δ)\displaystyle Md+\log(1/\delta) m=1M(θ^mθm)XmXm(θ^mθm)\displaystyle\gtrsim\sum_{m=1}^{M}(\hat{\theta}_{m}-\theta_{m}^{*})^{\top}X_{m}^{\top}X_{m}(\hat{\theta}_{m}-\theta_{m}^{*})
nm=1M(θ^mθm)0.9Σ(θ^mθm)(using (10))\displaystyle\geq n\sum_{m=1}^{M}(\hat{\theta}_{m}-\theta_{m}^{*})^{\top}0.9\Sigma_{*}(\hat{\theta}_{m}-\theta_{m}^{*})\qquad(\text{using }\eqref{eq:boundsigma})
0.9nCminθ^θ2(using Assumption 3)\displaystyle\geq 0.9nC_{\text{min}}\|\hat{\theta}-\theta^{*}\|^{2}\qquad\qquad\qquad(\text{using Assumption }\ref{as:feature})
nθ^θ2.\displaystyle\gtrsim n\|\hat{\theta}-\theta^{*}\|^{2}.

Thus, with probability at least 1δ/201-\delta/20,

Θ^ΘF2=θ^θ2Md+log(1/δ)n.\|\hat{\Theta}-\Theta^{*}\|_{F}^{2}=\|\hat{\theta}-\theta^{*}\|^{2}\lesssim\sqrt{\dfrac{Md+\log(1/\delta)}{n}}. (17)

Step 2: Combining with low-dimensional assumption. Notice that rank(Θ^Θ)2r\text{rank}(\hat{\Theta}-\Theta^{*})\leq 2r, we can rewrite it as Θ^Θ=QR=(Q𝐫1,,Q𝐫M)\hat{\Theta}-\Theta^{*}=QR=(Q\mathbf{r}_{1},\cdots,Q\mathbf{r}_{M}) where Q𝒪d×2r,R=(𝐫1,,𝐫M)2r×MQ\in\mathcal{O}_{d\times 2r},R=(\mathbf{r}_{1},\cdots,\mathbf{r}_{M})\in\mathbb{R}^{2r\times M}. Thus the vectorization can be rewrite as θ^θ=[(Q𝐫1),,(Q𝐫M)]\hat{\theta}-\theta^{*}=[(Q\mathbf{r}_{1})^{\top},\cdots,(Q\mathbf{r}_{M})^{\top}]^{\top} For each m[M]m\in[M] we further rewrite XmQ=ZmWmX_{m}Q=Z_{m}W_{m} where Zm𝒪n×2r,Wm2r×2rZ_{m}\in\mathcal{O}_{n\times 2r},W_{m}\in\mathbb{R}^{2r\times 2r}. Then we have

l𝒟(θ),θ^θ\displaystyle-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\hat{\theta}-\theta^{*}\rangle =m=1M1MnVmXmQ𝐫m\displaystyle=\sum_{m=1}^{M}\dfrac{1}{Mn}V_{m}^{\top}X_{m}Q\mathbf{r}_{m}
=1Mnm=1MVmZmWm𝐫m\displaystyle=\dfrac{1}{Mn}\sum_{m=1}^{M}V_{m}^{\top}Z_{m}W_{m}\mathbf{r}_{m}
1Mnm=1MZmVmWm𝐫m\displaystyle\leq\dfrac{1}{Mn}\sum_{m=1}^{M}\|Z_{m}^{\top}V_{m}\|\|W_{m}\mathbf{r}_{m}\|
1Mnm=1MZmVm2m=1MWm𝐫m2\displaystyle\leq\dfrac{1}{Mn}\sqrt{\sum_{m=1}^{M}\|Z_{m}^{\top}V_{m}\|^{2}}\sqrt{\sum_{m=1}^{M}\|W_{m}\mathbf{r}_{m}\|^{2}}
=1Mnm=1MZmVm2m=1MZmWm𝐫m2\displaystyle=\dfrac{1}{Mn}\sqrt{\sum_{m=1}^{M}\|Z_{m}^{\top}V_{m}\|^{2}}\sqrt{\sum_{m=1}^{M}\|Z_{m}W_{m}\mathbf{r}_{m}\|^{2}}
=1Mnm=1MZmVm2m=1MXmQ𝐫m2\displaystyle=\dfrac{1}{Mn}\sqrt{\sum_{m=1}^{M}\|Z_{m}^{\top}V_{m}\|^{2}}\sqrt{\sum_{m=1}^{M}\|X_{m}Q\mathbf{r}_{m}\|^{2}} (18)

The second term can be calculated as m=1MXmQ𝐫m2=m=1MXm(θ^mθm)2=nθ^θΣsum\sum_{m=1}^{M}\|X_{m}Q\mathbf{r}_{m}\|^{2}=\sum_{m=1}^{M}\|X_{m}(\hat{\theta}_{m}-\theta_{m}^{*})\|^{2}=n\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}.

Step 3: Applying ϵ\epsilon-net argument. Next, we are going to give a high-probability upper bound on m=1MZmVm2\sum_{m=1}^{M}\|Z_{m}^{\top}V_{m}\|^{2} using the randomness of VmV_{m}. Since ZmZ_{m} depends on QQ which depends on ZZ, we use an ϵ\epsilon-net argument to cover all possible Q𝒪d×2rQ\in\mathcal{O}_{d\times 2r}. First, for any fixed Q¯𝒪d×2r\bar{Q}\in\mathcal{O}_{d\times 2r}, we let XmQ¯=Z¯mW¯mX_{m}\bar{Q}=\bar{Z}_{m}\bar{W}_{m} where Z¯m𝒪n×2r\bar{Z}_{m}\in\mathcal{O}_{n\times 2r}. The Z¯m\bar{Z}_{m} defined in this way are independent of VV:

m=1MZ¯mVm2=m=1MVmZ¯mZ¯mVm.\sum_{m=1}^{M}\|\bar{Z}_{m}^{\top}V_{m}\|^{2}=\sum_{m=1}^{M}V_{m}\bar{Z}_{m}\bar{Z}_{m}^{\top}V_{m}.

From the orthonormality of Z¯m\bar{Z}_{m}, we can establish the following bound

tr(Z¯mZ¯m)=r,tr(Z¯mZ¯m)2=r,Z¯mZ¯m=1.\text{tr}(\bar{Z}_{m}\bar{Z}_{m}^{\top})=r,\quad\text{tr}(\bar{Z}_{m}\bar{Z}_{m}^{\top})^{2}=r,\quad\|\bar{Z}_{m}\bar{Z}_{m}^{\top}\|=1.

Again, use Bernstein’s inequality for sub-Gaussian random variables in the quadratic form as we did in (15), we have, with probability at least 1δ1-\delta^{\prime},

m=1MZ¯mVm2Mr+log(1/δ).\sum_{m=1}^{M}\|\bar{Z}_{m}^{\top}V_{m}\|^{2}\lesssim Mr+\log(1/\delta^{\prime}).

Combine this with (C.1), we obtain that

l𝒟(θ),Q¯RMr+log(1/δ)M2nQ¯RΣsum.-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\bar{Q}R\rangle\leq\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|\bar{Q}R\|_{\Sigma_{\text{sum}}}.

Lemma A.5 in Du et al., (2020) implies that: there exists an ϵ\epsilon-net 𝒩\mathcal{N} of 𝒪d×2r\mathcal{O}_{d\times 2r} in Frobenius norm such that |𝒩|(62rϵ)2rd|\mathcal{N}|\leq(\dfrac{6\sqrt{2r}}{\epsilon})^{2rd}. Applying a union bound over 𝒩\mathcal{N}, we obtain that, with probability at least 1|𝒩|δ1-|\mathcal{N}|\delta^{\prime},

l𝒟(θ),Q¯RMr+log(1/δ)M2nQ¯RΣsum,Q¯𝒩.-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\bar{Q}R\rangle\leq\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|\bar{Q}R\|_{\Sigma_{\text{sum}}},\quad\forall\bar{Q}\in\mathcal{N}. (19)

Choosing δ=δ20(62rϵ)2rd\delta^{\prime}=\dfrac{\delta}{20(\frac{6\sqrt{2r}}{\epsilon})^{2rd}}, we obtain that (19) holds with probability at least 1δ/201-\delta/20.

Then we apply ϵ\epsilon-net. Let Q¯𝒩\bar{Q}\in\mathcal{N} such that Q¯QFϵ\|\bar{Q}-Q\|_{F}\leq\epsilon, then we have

(Q¯Q)RΣsum2\displaystyle\|(\bar{Q}-Q)R\|_{\Sigma_{\text{sum}}}^{2}
=\displaystyle= 1nm=1MXm(Q¯Q)𝐫m2\displaystyle\dfrac{1}{n}\sum_{m=1}^{M}\|X_{m}(\bar{Q}-Q)\mathbf{r}_{m}\|^{2}
\displaystyle\leq 1nm=1MXm2Q¯Q2𝐫m2\displaystyle\dfrac{1}{n}\sum_{m=1}^{M}\|X_{m}\|^{2}\|\bar{Q}-Q\|^{2}\|\mathbf{r}_{m}\|^{2}
\displaystyle\leq Σϵ2m=1M𝐫m2\displaystyle\|\Sigma_{*}\|\epsilon^{2}\sum_{m=1}^{M}\|\mathbf{r}_{m}\|^{2}
\displaystyle\leq Cmaxϵ2RF2\displaystyle C_{\text{max}}\epsilon^{2}\|R\|_{F}^{2}
=\displaystyle= Cmaxϵ2QRF2\displaystyle C_{\text{max}}\epsilon^{2}\|QR\|_{F}^{2}
\displaystyle\lesssim Md+log(1/δ)nϵ2.(using (17))\displaystyle\dfrac{Md+\log(1/\delta)}{n}\epsilon^{2}.\qquad\qquad(\text{using }\eqref{eq:bound_theta_F}) (20)

Step 4: Finishing the proof. We have the following inequalities,

γ2Mθ^θΣsum2\displaystyle\dfrac{\gamma}{2M}\|\hat{\theta}-\theta^{*}\|_{\Sigma_{\text{sum}}}^{2}
\displaystyle\leq l𝒟(θ),QR( using (12))\displaystyle-\langle\nabla l_{\mathcal{D}}(\theta^{*}),QR\rangle\qquad\qquad\qquad\qquad\qquad\qquad\qquad(\text{ using }\eqref{eq:convexl})
=\displaystyle= l𝒟(θ),Q¯Rl𝒟(θ),(Q¯Q)R\displaystyle-\langle\nabla l_{\mathcal{D}}(\theta^{*}),\bar{Q}R\rangle-\langle\nabla l_{\mathcal{D}}(\theta^{*}),(\bar{Q}-Q)R\rangle
\displaystyle\lesssim Mr+log(1/δ)M2nQ¯RΣsum+l𝒟(θ)Σsum1(Q¯Q)RΣsum(using (19))\displaystyle\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|\bar{Q}R\|_{\Sigma_{\text{sum}}}+\|\nabla l_{\mathcal{D}}(\theta^{*})\|_{\Sigma_{\text{sum}}}^{-1}\|(\bar{Q}-Q)R\|_{\Sigma_{\text{sum}}}\qquad(\text{using }\eqref{eq:unionz})
\displaystyle\lesssim Mr+log(1/δ)M2n(QRΣsum+(Q¯Q)RΣsum)\displaystyle\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}(\|QR\|_{\Sigma_{\text{sum}}}+\|(\bar{Q}-Q)R\|_{\Sigma_{\text{sum}}})
+\displaystyle+ Md+log(1/δ)M2n(Q¯Q)RΣsum(using (15))\displaystyle\sqrt{\dfrac{Md+\log(1/\delta)}{M^{2}n}}\|(\bar{Q}-Q)R\|_{\Sigma_{\text{sum}}}\qquad\qquad\qquad\qquad(\text{using }\eqref{eq:gradi})
\displaystyle\leq Mr+log(1/δ)M2nQRΣsum+Md+log(1/δ)M2n(Q¯Q)RΣsum(using r<d,δδ)\displaystyle\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|QR\|_{\Sigma_{\text{sum}}}+\sqrt{\dfrac{Md+\log(1/\delta^{\prime})}{M^{2}n}}\|(\bar{Q}-Q)R\|_{\Sigma_{\text{sum}}}\quad(\text{using }r<d,\delta^{\prime}\leq\delta)
\displaystyle\lesssim Mr+log(1/δ)M2nQRΣsum+Md+log(1/δ)M2nMd+log(1/δ)nϵ(using (C.1))\displaystyle\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|QR\|_{\Sigma_{\text{sum}}}+\sqrt{\dfrac{Md+\log(1/\delta^{\prime})}{M^{2}n}}\sqrt{\dfrac{Md+\log(1/\delta)}{n}}\epsilon\qquad(\text{using }\eqref{eq:q-qr})
\displaystyle\leq Mr+log(1/δ)M2nQRΣsum+Md+log(1/δ)Mnϵ(using δδ)\displaystyle\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|QR\|_{\Sigma_{\text{sum}}}+\dfrac{Md+\log(1/\delta)}{Mn}\epsilon\qquad(\text{using }\delta^{\prime}\leq\delta)

Thus, we have

QRΣsum2M(Mr+log(1/δ)M2nQRΣsum+Md+log(1/δ)Mnϵ)\|QR\|_{\Sigma_{\text{sum}}}^{2}\lesssim M\left(\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{M^{2}n}}\|QR\|_{\Sigma_{\text{sum}}}+\dfrac{Md+\log(1/\delta)}{Mn}\epsilon\right)

Therefore, we let ϵ=r/d\epsilon=r/d and recall that δ=δ20(62rϵ)2rd\delta^{\prime}=\dfrac{\delta}{20(\frac{6\sqrt{2r}}{\epsilon})^{2rd}}, we have

QRΣsum\displaystyle\|QR\|_{\Sigma_{\text{sum}}} max(Mr+log(1/δ)n,Md+log(1/δ)nϵ)\displaystyle\lesssim\max\left(\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{n}},\sqrt{\dfrac{Md+\log(1/\delta)}{n}\epsilon}\right)
Mr+log(1/δ)n\displaystyle\leq\sqrt{\dfrac{Mr+\log(1/\delta^{\prime})}{n}}
=Mr+rdlog(r/ϵ)+log(1/δ)n\displaystyle=\sqrt{\dfrac{Mr+rd\log{(r/\epsilon)}+\log(1/\delta)}{n}}
Mr+rdlogd+log(1/δ)n\displaystyle\leq\sqrt{\dfrac{Mr+rd\log{d}+\log(1/\delta)}{n}}

Notice that

QRΣsum2=m=1M(θ^mθm)Σm(θ^mθm)=1nm=1MXmU^α^mXmUαm2\|QR\|_{\Sigma_{\text{sum}}}^{2}=\sum_{m=1}^{M}(\hat{\theta}_{m}-\theta_{m}^{*})^{\top}\Sigma_{m}(\hat{\theta}_{m}-\theta_{m}^{*})=\dfrac{1}{n}\sum_{m=1}^{M}\|X_{m}\hat{U}\hat{\alpha}_{m}-X_{m}U^{*}\alpha_{m}^{*}\|^{2}

and the high-probability results are used in (10)(17)(19), thus we obtain the needed inequality with a failure probability at most δ10+δ20+δ20=δ5\frac{\delta}{10}+\frac{\delta}{20}+\frac{\delta}{20}=\frac{\delta}{5}. ∎

Theorem 11 (Guarantee on target data).

Suppose Assumption 1-4 hold. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ1-\delta, for any m[M]m\in[M]

θ^mθmΣm=U^α^mUαmΣmr2n+dr2logd+rlog(M/δ)Mn\|\hat{\theta}_{m}-\theta_{m}^{*}\|_{\Sigma_{m}}=\|\hat{U}\hat{\alpha}_{m}-U^{*}\alpha_{m}^{*}\|_{\Sigma_{m}}\lesssim\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}
Proof.

We assume Lemma 1 and Theorem 10 are true, which happens with probability at least 12δ/51-2\delta/5.

Step 1: Finding a rotation of U^\hat{U} near UU^{*}. Theorem 10 implies that

m=1M(θ^mθm)Σm(θ^mθm)Mr+rdlogd+log(1/δ)n\sum_{m=1}^{M}(\hat{\theta}_{m}-\theta_{m}^{*})^{\top}\Sigma_{m}(\hat{\theta}_{m}-\theta_{m}^{*})\lesssim\dfrac{Mr+rd\log{d}+\log{(1/\delta)}}{n}

Thus, Lemma 1 further shows that

m=1M(θ^mθm)Σ(θ^mθm)Mr+rdlogd+log(1/δ)n\sum_{m=1}^{M}(\hat{\theta}_{m}-\theta_{m}^{*})^{\top}\Sigma_{*}(\hat{\theta}_{m}-\theta_{m}^{*})\lesssim\dfrac{Mr+rd\log{d}+\log{(1/\delta)}}{n}

Therefore, combining this with Assumption 3 implies that

Θ^ΘF2Mr+rdlogd+log(1/δ)n.\|\hat{\Theta}-\Theta^{*}\|_{F}^{2}\leq\dfrac{Mr+rd\log{d}+\log{(1/\delta)}}{n}.

Then, we bound the singular values of Θ=Uα\Theta^{*}=U^{*}\alpha^{*}. Note that α\alpha^{*} is a r×Mr\times M matrix, by definition ν=σr(ΘΘM)=σr(ααM)\nu=\sigma_{r}(\frac{\Theta^{*\top}\Theta^{*}}{M})=\sigma_{r}\left(\frac{\alpha^{*\top}\alpha^{*}}{M}\right), and the worst-case condition number as κ=σ1(ΘΘM)/ν=σ1(ααM)/ν\kappa=\sigma_{1}(\frac{\Theta^{*\top}\Theta^{*}}{M})/\nu=\sigma_{1}\left(\frac{\alpha^{*\top}\alpha^{*}}{M}\right)/\nu. Note that when the Θ\Theta is well-conditioned in the sense that ν>0\nu>0 and κO(1)\kappa\leq O(1), assumption αm=Θ(1)\|\alpha_{m}^{*}\|=\Theta(1) implies that νΩ(1r)\nu\geq\Omega(\frac{1}{r}) (Tripuraneni et al.,, 2020). Thus, we have

σ1(ΘΘ)σr2(ΘΘ)=MνκMνrM.\sqrt{\dfrac{\sigma_{1}(\Theta^{*\top}\Theta^{*})}{\sigma_{r}^{2}(\Theta^{*\top}\Theta^{*})}}=\dfrac{\sqrt{M\nu\kappa}}{M\nu}\leq\sqrt{\dfrac{r}{M}}.

Notice that U^\hat{U} and UU^{*} are orthonormal, applying the general Davis-Kahan sinθ\sin\theta theorem (see, Theorem 4, Yu et al.,, 2014) for matrices Θ^\hat{\Theta} and Θ\Theta^{*} implies that, there exists an orthonormal matrix O^r×r\hat{O}\in\mathbb{R}^{r\times r} such that

U^O^UF\displaystyle\|\hat{U}\hat{O}-U^{*}\|_{F} σ1(ΘΘ)σr2(ΘΘ)min(r1/2Θ^Θ,Θ^ΘF)\displaystyle\lesssim\sqrt{\dfrac{\sigma_{1}(\Theta^{*\top}\Theta^{*})}{\sigma_{r}^{2}(\Theta^{*\top}\Theta^{*})}}\min\left(r^{1/2}\|\hat{\Theta}-\Theta^{*}\|,\|\hat{\Theta}-\Theta^{*}\|_{F}\right)
rMMr+rdlogd+log(1/δ)n\displaystyle\lesssim\sqrt{\dfrac{r}{M}}\sqrt{\dfrac{Mr+rd\log{d}+\log{(1/\delta)}}{n}}
=r2n+dr2logd+rlog(1/δ)Mn.\displaystyle=\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(1/\delta)}{Mn}}. (21)

Step 2: Bounding α^m\hat{\alpha}_{m}. For simplicity, suppose U^UFr2n+dr2logd+rlog(1/δ)Mn\|\hat{U}-U^{*}\|_{F}\lesssim\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(1/\delta)}{Mn}}. We can rotate U^\hat{U} to achieve this. Then α^m\hat{\alpha}_{m} is obtained by minimizing the following loss function:

l𝒟m(αm)=1ni=1nlog[𝟏(ymi=1)1exp(U^αm,Xm,i)+1+𝟏(ymi=0)exp(U^αm,Xm,i)exp(U^αm,Xm,i)+1]l_{\mathcal{D}_{m}}(\alpha_{m})=-\dfrac{1}{n}\sum_{i=1}^{n}\log[\mathbf{1}(y_{m}^{i}=1)\dfrac{1}{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)+1}+\mathbf{1}(y_{m}^{i}=0)\dfrac{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)}{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)+1}]

Its first and second derivatives are given by,

l𝒟m(αm)=1ni=1n[𝟏(ymi=1)exp(U^αm,Xm,i)exp(U^αm,Xm,i)+1𝟏(ymi=0)1exp(Uαm,Xm,i)+1]U^Xm,i\displaystyle\nabla l_{\mathcal{D}_{m}}(\alpha_{m})=-\dfrac{1}{n}\sum_{i=1}^{n}\left[\mathbf{1}(y_{m}^{i}=1)\dfrac{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)}{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)+1}-\mathbf{1}(y_{m}^{i}=0)\dfrac{1}{\exp(-\langle U\alpha_{m},X_{m,i}\rangle)+1}\right]\hat{U}^{\top}X_{m,i}
2l𝒟m(αm)=1ni=1nexp(U^αm,Xm,i)(exp(U^αm,Xm,i)+1)2U^Xm,iXm,iU^\displaystyle\nabla^{2}l_{\mathcal{D}_{m}}(\alpha_{m})=\dfrac{1}{n}\sum_{i=1}^{n}\dfrac{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)}{(\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)+1)^{2}}\cdot\hat{U}^{\top}X_{m,i}X_{m,i}^{\top}\hat{U}

Similar to (13) in the proof of Theorem 10, we have

α^mαmU^ΣmU^l𝒟m(αm)(U^ΣmU^)1\|\hat{\alpha}_{m}-\alpha_{m}^{*}\|_{\hat{U}^{\top}\Sigma_{m}\hat{U}}\lesssim\|\nabla l_{\mathcal{D}_{m}}(\alpha_{m}^{*})\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}} (22)

We have l𝒟m(αm)=1nU^XmV^m\nabla l_{\mathcal{D}_{m}}(\alpha_{m}^{*})=-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}\hat{V}_{m}, where V^m,i=Vm,i+Δi\hat{V}_{m,i}=V_{m,i}+\Delta_{i}, Vm,iV_{m,i} is defined in (C.1), and δi\delta_{i} is defined as follows

Δi=fi(U^)fi(U),fi(U)=exp(Uαm,Xm,i)exp(Uαm,Xm,i)+1\Delta_{i}=f_{i}(\hat{U})-f_{i}(U^{*}),\quad f_{i}(U)=\dfrac{\exp(-\langle U\alpha_{m}^{*},X_{m,i}\rangle)}{\exp(-\langle U\alpha_{m}^{*},X_{m,i}\rangle)+1}

We aim to bound Δi\Delta_{i} by considering the gradient of fi(U)f_{i}(U):

fi(U)=exp(U^αm,Xm,i)(exp(U^αm,Xm,i)+1)2Xm,iαm\nabla f_{i}(U)=\dfrac{\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)}{(\exp(-\langle\hat{U}\alpha_{m},X_{m,i}\rangle)+1)^{2}}\cdot X_{m,i}\alpha_{m}^{\top}

Notice that Xm,i2B\|X_{m,i}\|_{2}\leq B and αm2=Θ(1)\|\alpha_{m}\|_{2}=\Theta(1), therefore Xm,iαmF2=Xm,i22αm22\|X_{m,i}\alpha_{m}^{\top}\|_{F}^{2}=\|X_{m,i}\|_{2}^{2}\|\alpha_{m}\|_{2}^{2} is bounded. Thus

|Δi|=|fi(U^)fi(U)|U^UFXm,iαmFU^UF|\Delta_{i}|=|f_{i}(\hat{U})-f_{i}(U^{*})|\leq\|\hat{U}-U^{*}\|_{F}\|X_{m,i}\alpha_{m}^{\top}\|_{F}\lesssim\|\hat{U}-U^{*}\|_{F} (23)

Therefore,

l𝒟m(αm)(U^ΣmU^)1\displaystyle\|\nabla l_{\mathcal{D}_{m}}(\alpha_{m}^{*})\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}} =1nU^XmV^m(U^ΣmU^)1\displaystyle=\|-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}\hat{V}_{m}\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}}
=1nU^XmVm(U^ΣmU^)1+1nU^XmΔ(U^ΣmU^)1\displaystyle=\|-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}V_{m}\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}}+\|-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}\Delta\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}}

Concerning the first term, similar to (15) in the proof of Theorem 10, with probability at least 1δ/201-\delta/20, it is bounded by

1nU^XmVm(U^ΣmU^)1r+log(1/δ)n\|-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}V_{m}\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}}\lesssim\sqrt{\dfrac{r+\log(1/\delta)}{n}}

Concerning the second term, denote X~m=XmU^\tilde{X}_{m}=X_{m}\hat{U},

1nU^XmΔ(U^ΣmU^)1\displaystyle\|-\dfrac{1}{n}\hat{U}^{\top}X_{m}^{\top}\Delta\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}} =1n2ΔX~m(X~mX~m)1X~mΔ\displaystyle=\sqrt{\dfrac{1}{n^{2}}\Delta^{\top}\tilde{X}_{m}^{\top}(\tilde{X}_{m}^{\top}\tilde{X}_{m})^{-1}\tilde{X}_{m}\Delta}
1nΔ2\displaystyle\leq\dfrac{1}{n}\|\Delta\|_{2}
U^UFn(using (23))\displaystyle\lesssim\dfrac{\|\hat{U}-U^{*}\|_{F}}{\sqrt{n}}\qquad(\text{using }\eqref{eq:deltai})

Thus, (22) implies that

α^mαmU^ΣmU^l𝒟m(αm)(U^ΣmU^)1r+log(1/δ)n+U^UFn\|\hat{\alpha}_{m}-\alpha_{m}^{*}\|_{\hat{U}^{\top}\Sigma_{m}\hat{U}}\lesssim\|\nabla l_{\mathcal{D}_{m}}(\alpha_{m}^{*})\|_{(\hat{U}^{\top}\Sigma_{m}\hat{U})^{-1}}\lesssim\sqrt{\dfrac{r+\log(1/\delta)}{n}}+\dfrac{\|\hat{U}-U^{*}\|_{F}}{\sqrt{n}} (24)

Step 3: Finishing the proof. Therefore, we have the following inequalities,

U^α^mUαmΣm\displaystyle\|\hat{U}\hat{\alpha}_{m}-U^{*}\alpha_{m}^{*}\|_{\Sigma_{m}} U^α^mU^αmΣm+U^αmUαmΣm\displaystyle\leq\|\hat{U}\hat{\alpha}_{m}-\hat{U}\alpha_{m}^{*}\|_{\Sigma_{m}}+\|\hat{U}\alpha_{m}^{*}-U^{*}\alpha_{m}^{*}\|_{\Sigma_{m}}
=α^mαmU^ΣmU^+αm(U^U)Σm(U^U)αm\displaystyle=\|\hat{\alpha}_{m}-\alpha_{m}^{*}\|_{\hat{U}^{\top}\Sigma_{m}\hat{U}}+\sqrt{\alpha_{m}^{*\top}(\hat{U}-U^{*})^{\top}\Sigma_{m}(\hat{U}-U^{*})\alpha_{m}^{*}}
α^mαmU^ΣU^+αm2(U^U)Σ(U^U)1/2(using (10))\displaystyle\lesssim\|\hat{\alpha}_{m}-\alpha_{m}^{*}\|_{\hat{U}^{\top}\Sigma_{*}\hat{U}}+\|\alpha_{m}^{*}\|_{2}\|(\hat{U}-U^{*})^{\top}\Sigma_{*}(\hat{U}-U^{*})\|^{1/2}\quad(\text{using }\eqref{eq:boundsigma})
α^mαmU^ΣU^+U^UF(using Assumption 2,3)\displaystyle\lesssim\|\hat{\alpha}_{m}-\alpha_{m}^{*}\|_{\hat{U}^{\top}\Sigma_{*}\hat{U}}+\|\hat{U}-U^{*}\|_{F}\qquad(\text{using Assumption }\ref{as:share},\ref{as:feature})
r+log(1/δ)n+U^UF(using (24))\displaystyle\lesssim\sqrt{\dfrac{r+\log(1/\delta)}{n}}+\|\hat{U}-U^{*}\|_{F}\qquad(\text{using }\eqref{eq:furfea})
r+log(1/δ)n+r2n+dr2logd+rlog(1/δ)Mn(using (C.1))\displaystyle\lesssim\sqrt{\dfrac{r+\log(1/\delta)}{n}}+\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(1/\delta)}{Mn}}\qquad(\text{using }\eqref{eq:rotate})
r2n+dr2logd+rlog(1/δ)Mn.\displaystyle\lesssim\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(1/\delta)}{Mn}}.

Notice that the high-probability results are used in Lemma 1, Theorem 10, and (24), thus we obtain the needed inequality with a failure probability at most 2δ5+δ203δ5\frac{2\delta}{5}+\frac{\delta}{20}\leq\frac{3\delta}{5} for a fixed mm. Finally, applying a union bound for m[M]m\in[M] and substituting δ\delta with δ/M\delta/M finishes the proof. ∎

C.2 Proof of Theorem 2

Proof.

Decompose the sub-optimality into three terms:

J(π)J(π^)=(J(π)J^(π))+(J^(π)J^(π^))+(J^(π^)J(π^)).J(\pi^{*})-J(\hat{\pi})=\left(J(\pi^{*})-\hat{J}(\pi^{*})\right)+\left(\hat{J}(\pi^{*})-\hat{J}(\hat{\pi})\right)+\left(\hat{J}(\hat{\pi})-J(\hat{\pi})\right).

The second term J^(π)J^(π^)0\hat{J}(\pi^{*})-\hat{J}(\hat{\pi})\leq 0 from the definition of π^\hat{\pi}. From Theorem 1, with probability at least 1δ1-\delta, θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) m\forall m, then the third term

J^(π^)J(π^)\displaystyle\hat{J}(\hat{\pi})-J(\hat{\pi})
=\displaystyle= minθmΘB(θ^m)𝔼sρm=1M(Rm,θm(s,π^(s))Rm,θm0(s))𝔼sρm=1M(Rm,θm(s,π^(s))Rm,θm0(s))0\displaystyle\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\hat{\pi}(s))-R_{m,\theta_{m}}^{0}(s))-\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,\theta_{m}^{*}}(s,\hat{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))\leq 0

Therefore with probability at least 1δ1-\delta,

J(π)J(π^)J(π)J^(π)\displaystyle J(\pi^{*})-J(\hat{\pi})\leq J(\pi^{*})-\hat{J}(\pi^{*})
=\displaystyle= maxθmΘB(θ^m){𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s))𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s))}\displaystyle\max_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\left\{\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\theta_{m}}^{0}(s))\right\}
=(a)\displaystyle\overset{(a)}{=} 𝔼sρ{[R1,θ1(s,π(s))R1,θ10(s)][m=2M(Rm,θm(s,π(s))Rm,θm0(s))m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))]}\displaystyle\mathbb{E}_{s\sim\rho}\{[R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s)][\prod_{m=2}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))]\}
+\displaystyle+ 𝔼sρ{[(R1,θ1(s,π(s))R1,θ10(s))(R1,θ1(s,π(s))R1,θ10(s))][m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))]}\displaystyle\mathbb{E}_{s\sim\rho}\{[(R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s))-(R_{1,{\theta}_{1}}(s,\pi^{*}(s))-R_{1,\theta_{1}}^{0}(s))][\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))]\}
\displaystyle\leq 2BL|𝔼sρm=2M(Rm,θm(s,π(s))Rm,θm0(s))m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))|\displaystyle 2BL\left|\mathbb{E}_{s\sim\rho}\prod_{m=2}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right|
+\displaystyle+ (2BL)M1|𝔼sρ((R1,θ1(s,π(s))R1,θ10(s))(R1,θ1(s,π(s))R1,θ10(s)))|\displaystyle(2BL)^{M-1}\left|\mathbb{E}_{s\sim\rho}\left((R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s))-(R_{1,{\theta}_{1}}(s,\pi^{*}(s))-R_{1,\theta_{1}}^{0}(s))\right)\right|
(b)\displaystyle\overset{(b)}{\leq} m=1M(2BL)M1|𝔼sρ((Rm,θm(s,π(s))Rm,θm0(s))(Rm,θ~m(s,π(s))Rm,θ~m0(s)))|\displaystyle\sum_{m=1}^{M}(2BL)^{M-1}\left|\mathbb{E}_{s\sim\rho}\left((R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right)\right|
\displaystyle\leq (2BL)M1m=1M|𝔼sρ(Rm,θm(s,π(s))Rm,θ~m(s,π(s)))|+|𝔼sρ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle(2BL)^{M-1}\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}\left(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))\right)\right|+\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|
=\displaystyle= (2BL)M1m=1M|𝔼sρ(θmθ~m)ϕ(s,π(s))|+|𝔼sρ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle(2BL)^{M-1}\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|+\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|

where (a)(a) holds due to the equation a1a2b1b2=a1(a2b2)+(a1b1)b2a_{1}a_{2}-b_{1}b_{2}=a_{1}(a_{2}-b_{2})+(a_{1}-b_{1})b_{2}, (b)(b) holds from recursion.

Separately, we have

|𝔼sρ(θmθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
=\displaystyle= |𝔼sρ(θmθ^m+θ^mθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\hat{\theta}_{m}+\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
\displaystyle\leq |𝔼sρ(θmθ^m)ϕ(s,π(s))|+|𝔼sρ(θ^mθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|+\left|\mathbb{E}_{s\sim\rho}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
\displaystyle\leq θmθ^mΣm𝔼sρϕ(s,a)Σm1+θ^mθ~mΣm𝔼sρϕ(s,a)Σm1(Cauchy-Schwartz inequality)\displaystyle\|{\theta}_{m}^{*}-\hat{\theta}_{m}\|_{\Sigma_{m}}\|\mathbb{E}_{s\sim\rho}\phi(s,a)\|_{\Sigma_{m}^{-1}}+\|\hat{\theta}_{m}-\tilde{\theta}_{m}\|_{\Sigma_{m}}\|\mathbb{E}_{s\sim\rho}\phi(s,a)\|_{\Sigma_{m}^{-1}}\quad\text{(Cauchy-Schwartz inequality)}
\displaystyle\lesssim r2n+dr2logd+rlog(M/δ)Mn(Σm1/2𝔼sρϕ(s,π(s)))2\displaystyle\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s)))\|_{2}

The last inequality holds because θ~mΘB(θ^m)\tilde{\theta}_{m}\in\Theta_{B}(\hat{\theta}_{m}) and θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) with probability at least 1δ1-\delta for m\forall m.

|𝔼sρ(Rm,θm0(s)Rm,θ~m0(s))|=|𝔼sρ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|=\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|
(c)\displaystyle\overset{(c)}{\leq} |𝔼sρ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|+|𝔼sρ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|+\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{*m}(s))\right)\right|
\displaystyle\leq |𝔼sρ(Rm,θm(s,π¯m(s))Rm,θ^m(s,π¯m(s))+Rm,θ^m(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{m}(s))-R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{m}(s))+R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|
+\displaystyle+ |𝔼sρ(Rm,θm(s,π¯m(s))Rm,θ^m(s,π¯m(s))+Rm,θ^m(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{*m}(s))+R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{*m}(s))\right)\right|
\displaystyle\leq |𝔼sρ(θmθ^m)ϕ(s,π¯m(s))|+|𝔼sρ(θ^mθ~m)ϕ(s,π¯m(s))|\displaystyle\left|\mathbb{E}_{s\sim\rho}(\theta_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{m}(s))\right|+\left|\mathbb{E}_{s\sim\rho}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{m}(s))\right|
+\displaystyle+ |𝔼sρ(θmθ^m)ϕ(s,π¯m(s))|+|𝔼sρ(θ^mθ~m)ϕ(s,π¯m(s))|\displaystyle\left|\mathbb{E}_{s\sim\rho}(\theta_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{*m}(s))\right|+\left|\mathbb{E}_{s\sim\rho}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{*m}(s))\right|
\displaystyle\lesssim r2n+dr2logd+rlog(M/δ)Mn{Σm12𝔼sρϕ(s,π¯m(s))2+Σm12𝔼sρϕ(s,π¯m(s))2}\displaystyle\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}\left\{\|\Sigma_{m}^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\underline{\pi}_{m}(s))\|_{2}+\|\Sigma_{m}^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\underline{\pi}_{*m}(s))\|_{2}\right\}

where (c)(c) holds due to the equation |min{a1,a2}min{b1,b2}||a1b1|+|a2b2||\min\{a_{1},a_{2}\}-\min\{b_{1},b_{2}\}|\leq|a_{1}-b_{1}|+|a_{2}-b_{2}|. The last inequality holds because θ~mΘB(θ^m)\tilde{\theta}_{m}\in\Theta_{B}(\hat{\theta}_{m}) and θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) with probability at least 1δ1-\delta for m\forall m. Therefore

J(π)J(π^)Mr2n+dr2logd+rlog(M/δ)MnCJ(\pi^{*})-J(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}

Now we conclude the proof. ∎

C.3 Proof of Theorem 3

To begin with, we first show a lower bound result for single-party cases. The proof has been shown in Zhu et al., (2023). But for completeness, we present their example instance to achieve the lower bound.

Lemma 2 (Thereom 3.10, Zhu et al., (2023)).

Consider the family of instances

CB(𝒞)={ρ,{si,a1i,a0i}i=1n,θ|Σ12𝔼sρϕ(s,π(s))2𝒞}.\text{CB}(\mathcal{C})=\{\rho,\{s^{i},a_{1}^{i},a_{0}^{i}\}_{i=1}^{n},\theta^{*}|\|\Sigma^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s))\|_{2}\leq\mathcal{C}\}.

Suppose that r>6,nr𝒞2,𝒞2r>6,n\gtrsim r\mathcal{C}^{2},\mathcal{C}\geq 2, then there exists a feature ϕ\phi such that the following lower bound holds.

infπ^sup𝒬CB(𝒞)SinSubOpt𝒬(π^)𝒞rn.\inf_{\hat{\pi}}\sup_{\mathcal{Q}\in\text{CB}(\mathcal{C})}\texttt{SinSubOpt}_{\mathcal{Q}}(\hat{\pi})\gtrsim\mathcal{C}\sqrt{\dfrac{r}{n}}.

Here the single-party sub-optimality is defined as SinSubOptm(π)=J(π)J(π)\texttt{SinSubOpt}_{m}(\pi)=J(\pi^{*})-J(\pi), where J(π)=𝔼sρRθ(s,π(s))J(\pi)=\mathbb{E}_{s\sim\rho}R_{\theta^{*}}(s,\pi(s)) and π=argmaxJ(π)\pi^{*}=\arg\max J(\pi).

Proof.

Here we consider ϕ~=Uϕr\tilde{\phi}=U^{*\top}\phi\in\mathbb{R}^{r} for some U𝒪d×rU^{*}\in\mathcal{O}_{d\times r}, and we can construct some ϕ\phi directly from ϕ~\tilde{\phi}. Without loss of generality, assume that r/3r/3 is some integer. We set 𝒮=[r/3],𝒜={a1,a2,a3,a4}\mathcal{S}=[r/3],\mathcal{A}=\{a_{1},a_{2},a_{3},a_{4}\}, ρ=Unif([1,2,,|𝒮|])\rho=\texttt{Unif}([1,2,\cdots,|\mathcal{S}|]). Define the feature functions and the number of observations (for each state-action pair) as follows:

ϕ~(s,a1)=e3s+1e3s+2,ϕ~(s,a2)=e3s+1,ϕ~(s,a3)=0,ϕ~(s,a4)=e3s+2,\displaystyle\tilde{\phi}(s,a_{1})=e_{3s+1}-e_{3s+2},\tilde{\phi}(s,a_{2})=e_{3s+1},\tilde{\phi}(s,a_{3})=0,\tilde{\phi}(s,a_{4})=e_{3s+2},
n(s,a1,a2)=(n/|𝒮|)(12/𝒞2),n(s,a2,a3)=(n/|𝒮|)(2/𝒞2).\displaystyle n(s,a_{1},a_{2})=(n/|\mathcal{S}|)(1-2/\mathcal{C}^{2}),n(s,a_{2},a_{3})=(n/|\mathcal{S}|)(2/\mathcal{C}^{2}).

Let v1=(1/r,1/r+Δ,2/rΔ),v1=(1/r+2Δ,1/r+Δ,2/r3Δ)v_{-1}=(1/r,1/r+\Delta,-2/r-\Delta),v_{1}=(1/r+2\Delta,1/r+\Delta,-2/r-3\Delta) where Δ=𝒞|𝒮|/n\Delta=\mathcal{C}\sqrt{|\mathcal{S}|/n}. We construct 2|𝒮|2^{|\mathcal{S}|} instances, indexed by τ{1,1}|𝒮|\tau\in\{-1,1\}^{|\mathcal{S}|}, where each ατ=(vτ1,,vτ|𝒮|)\alpha_{\tau}^{*}=(v_{\tau_{1}},\cdots,v_{\tau_{|\mathcal{S}|}})^{\top}. One can verify that Σ12𝔼sρϕ(s,π(s))2𝒞\|\Sigma^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s))\|_{2}\leq\mathcal{C} and θτ=ατ1(B=1)\|\theta_{\tau}^{*}\|=\|\alpha_{\tau}^{*}\|\leq 1(B=1). By applying Assouad’s lemma, Zhu et al., (2023) showed that there exists a ατ,τ{1,1}|𝒮|\alpha_{\tau},\tau\in\{-1,1\}^{|\mathcal{S}|} which achieves the lower bound. ∎

Now we present the detailed proof of Theorem 3.

Proof.

Suppose we have MM contextual bandit instances, each constructed identically from an individual with the same underlying reward functions, i.e., the same θm\theta_{m}^{*}. We set the instances as 𝒬m\mathcal{Q}_{m}, respectively, for m[M]m\in[M], which are designed to meet the lower bound for single-party cases in Lemma 2.

One can verify that the instance presented in Lemma 2 satisfies that C𝒞C^{*}\leq\mathcal{C} and θm=αm1(B=1)\|\theta_{m}^{*}\|=\|\alpha_{m}^{*}\|\leq 1(B=1). Thus, we construct an instance 𝒬(M)CB(M,𝒞)\mathcal{Q}(M)\in\text{CB}(M,\mathcal{C}) where 𝒬(M)=𝒬1××𝒬M\mathcal{Q}(M)=\mathcal{Q}_{1}\times\cdots\times\mathcal{Q}_{M} such that

SubOpt𝒬(M)(π^)=𝔼s(R(s,π)M(R(s,π)Δ(s))M),\texttt{SubOpt}_{\mathcal{Q}(M)}(\hat{\pi})=\mathbb{E}_{s}\left(R(s,\pi^{*})^{M}-(R(s,\pi^{*})-\Delta(s))^{M}\right),

where

𝔼sΔ(s)=𝔼s(R(s,π)R(s,π^))𝒞dn\mathbb{E}_{s}\Delta(s)=\mathbb{E}_{s}(R(s,\pi^{*})-R(s,\hat{\pi}))\gtrsim\mathcal{C}\sqrt{\dfrac{d}{n}}

for π^\forall\hat{\pi}. Therefore, utilizing Taylor expansion implies that

SubOpt𝒬(M)(π^)\displaystyle\texttt{SubOpt}_{\mathcal{Q}(M)}(\hat{\pi}) 𝔼s(R(s,π)M1MΔ(s))\displaystyle\gtrsim\mathbb{E}_{s}\left(R(s,\pi^{*})^{M-1}M\Delta(s)\right)
𝒞Mrn.\displaystyle\gtrsim\mathcal{C}M\sqrt{\dfrac{r}{n}}.

The last inequality holds because R(s,π)R(s,\pi^{*}) is a constant. Here we disregard the scale of the reward function and thus conclude the proof. ∎

C.4 Proof of Proposition 4

We first show an original result for the Pareto efficiency of π\pi^{*}.

Lemma 3.

π\pi^{*} is Pareto efficient concerning the true Nash value function J(π)J(\pi).

Proof.

Since

π\displaystyle\pi^{*} =argmax𝔼sρm=1M(Rm,θm(s,π(s))Rm,θm0(s))\displaystyle=\arg\max\mathbb{E}_{s\sim\rho}\prod_{m=1}^{M}(R_{m,\theta_{m}^{*}}(s,\pi(s))-R_{m,\theta_{m}^{*}}^{0}(s))

We can regard π\pi as being separate concerning each state. Therefore

π(s)=argmaxam=1M(Rm,θm(s,a)Rm,θm0(s))\pi^{*}(s)=\arg\max_{a}\prod_{m=1}^{M}(R_{m,\theta_{m}^{*}}(s,a)-R_{m,\theta_{m}^{*}}^{0}(s))

Considering Rm,θm(s,a)R_{m,\theta_{m}^{*}}(s,a) for each mm, we conclude that π\pi^{*} is a Pareto efficient solution. ∎

We then introduce a result akin to Theorem 2, without expectation on states. Define

V(s,π):=m=1M(Rm,θm(s,π(s))Rm,θm0(s))\displaystyle V(s,{\pi}):=\prod_{m=1}^{M}(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))
V^(s,π)=minθmΘB(θ^m)m=1M(Rm,θm(s,π(s))Rm,θm0(s))\displaystyle\hat{V}(s,\pi)=\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi(s))-R_{m,\theta_{m}}^{0}(s))
Lemma 4.

Suppose Assumption 1-4 hold. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ1-\delta,

V(s,π)V(s,π^)Mr2n+dr2logd+rlog(M/δ)MnCV(s,\pi^{*})-V(s,\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}

where C(s)=maxπ{π,π¯m,π¯m}(Σm1/2ϕ(s,π(s)))2C^{*}(s)=\max_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\phi(s,\pi(s)))\|_{2}.

Proof.

The result is obvious from Theorem 2 if we directly set either 𝒮={s}\mathcal{S}=\{s\}. ∎

Now we present the detailed proof of Theorem 4.

Proof.

Define ξ=KMr2n+dr2logd+rlog(M/δ)MnC(s)\xi=KM\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}(s), where KK is a fixed constant. Denote vm(s,π)=Rm,θm(s,π(s))Rm,θm0(s)v_{m}(s,\pi)=R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s). From Lemma 4, we have, with probability at least 1δ1-\delta,

V(s,π)V(s,π^)\displaystyle V(s,\pi^{*})-V(s,\hat{\pi}) =m=1M(Rm,θm(s,π(s))Rm,θm0(s))m=1M(Rm,θm(s,π^(s))Rm,θm0(s))\displaystyle=\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\hat{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))
=(m=1Mvm(s,π)m=1Mvm(s,π^))ξ\displaystyle=(\prod_{m=1}^{M}v_{m}(s,{\pi^{*}})-\prod_{m=1}^{M}v_{m}(s,\hat{\pi}))\leq\xi (25)

Consider another policy π~\tilde{\pi} which satisfies that vm(s,π~)vm(s,π^)v_{m}(s,\tilde{\pi})\geq v_{m}(s,\hat{\pi}) for m<M\forall m<M, then

vM(s,π~)\displaystyle v_{M}(s,\tilde{\pi}) =m=1Mvm(s,π~)m=1M1vm(s,π~)\displaystyle=\dfrac{\prod_{m=1}^{M}v_{m}(s,\tilde{\pi})}{\prod_{m=1}^{M-1}v_{m}(s,\tilde{\pi})}
m=1Mvm(s,π)m=1M1vm(s,π^)(using the optimality of π)\displaystyle\leq\dfrac{\prod_{m=1}^{M}v_{m}(s,{\pi^{*}})}{{\prod_{m=1}^{M-1}v_{m}(s,\hat{\pi})}}\qquad\qquad(\text{using the optimality of $\pi^{*}$})
ξ+m=1Mvm(s,π^)m=1M1vm(s,π^)(using (25))\displaystyle\leq\dfrac{\xi+\prod_{m=1}^{M}v_{m}(s,\hat{\pi})}{\prod_{m=1}^{M-1}v_{m}(s,\hat{\pi})}\qquad(\text{using }\eqref{eq:app-par})
vM(s,π^)+ξm=1M1vm(s,π^)\displaystyle\leq v_{M}(s,\hat{\pi})+\dfrac{\xi}{\prod_{m=1}^{M-1}v_{m}(s,\hat{\pi})}
=vM(s,π^)(1+ξm=1Mvm(s,π^))\displaystyle=v_{M}(s,\hat{\pi})(1+\dfrac{\xi}{\prod_{m=1}^{M}v_{m}(s,\hat{\pi})})
vM(s,π^)(1+ξm=1Mvm(s,π)ξ)\displaystyle\leq v_{M}(s,\hat{\pi})(1+\dfrac{\xi}{\prod_{m=1}^{M}v_{m}(s,{\pi^{*}})-\xi})

Thus, the MM-th individual outcome vM(s,π)v_{M}(s,{\pi}) can increase in the ratio of at most cξc\xi, where cc is a constant that depends on m=1M(Rm,θm(s,π^(s))Rm,θm0(s))\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\hat{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s)), the normalized rewards under π\pi^{*}. Thus we conclude our proof. ∎

We further show that π^\hat{\pi} satisfies the approximate Pigou-Dalton principle.

Definition 5 (τ\tau-approximate Pigou-Dalton Principle).

A solution π\pi satisfies the τ\tau-approximate Pigou-Dalton principle at one state if, the collective welfare m(RmRm0)\prod_{m}(R_{m}-R_{m}^{0}) decreases at most τ\tau in a move satisfying that (1) it reduces the gap (or difference) |(RiRi0)(RjRj0)||(R_{i}-R_{i}^{0})-(R_{j}-R_{j}^{0})| between two (normalized) individual rewards RiRi0,RjRj0R_{i}-R_{i}^{0},R_{j}-R_{j}^{0}, and (2) it retains their total reward Ri+RjR_{i}+R_{j}.

This notion implies a preference for outcomes that are more equitable. In other words, it deems a transfer of utility from the rich to the poor as desirable (Patty and Penn,, 2019). From equation (25), we immediately derive the following corollary, which provides an approximate Pigou-Dalton result.

Corollary 1.

Suppose Assumption 1-4 hold. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ1-\delta, π^\hat{\pi} is τ\tau-approximate Pigou-Dalton principle at state ss, where τMr2n+dr2logd+rlog(M/δ)MnC(s)\tau\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}(s), and C(s)=maxmmaxπ{π,π¯m,π¯m}(Σm1/2ϕ(s,π(s)))2C^{*}(s)=\max_{m}\max_{\pi\in\{\pi^{*},\underline{\pi}_{*m},\underline{\pi}_{m}\}}\|(\Sigma_{m}^{-1/2}\phi(s,\pi(s)))\|_{2}.

C.5 Proofs in Section 4.4

We begin with the Utilitarian welfare function. The proof of Theorem 5 is straightforward to add up individual errors for each individual. And we mainly demonstrate the most important part of the proof.

Proof.

Again, we can only consider the first term in J(π)J(π^)=(J(π)J^(π))+(J^(π)J^(π^))+(J^(π^)J(π^))J(\pi^{*})-J(\hat{\pi})=(J(\pi^{*})-\hat{J}(\pi^{*}))+(\hat{J}(\pi^{*})-\hat{J}(\hat{\pi}))+(\hat{J}(\hat{\pi})-J(\hat{\pi})) as in the proof of Theorem 2.

J(π)J(π^)J(π)J^(π)\displaystyle J(\pi^{*})-J(\hat{\pi})\leq J(\pi^{*})-\hat{J}(\pi^{*})
=\displaystyle= maxθmΘB(θ^m){𝔼sρm=1MRm,θm(s,π(s))𝔼sρm=1MRm,θm(s,π(s))}\displaystyle\max_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\left\{\mathbb{E}_{s\sim\rho}\sum_{m=1}^{M}R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-\mathbb{E}_{s\sim\rho}\sum_{m=1}^{M}R_{m,{\theta}_{m}}(s,\pi^{*}(s))\right\}
=\displaystyle= Esρm=1MRm,θm(s,π(s))𝔼sρm=1MRm,θ~m(s,π(s))\displaystyle E_{s\sim\rho}\sum_{m=1}^{M}R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-\mathbb{E}_{s\sim\rho}\sum_{m=1}^{M}R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))
\displaystyle\leq m=1M|𝔼sρ(Rm,θm(s,π(s))Rm,θ~m(s,π(s)))|\displaystyle\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}\left(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))\right)\right|
=\displaystyle= m=1M|𝔼sρ(θmθ~m)ϕ(s,π(s))|\displaystyle\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|

Then, the following remains the same as the proof of Theorem 2. ∎

Then, we move to the Leximin welfare function. For Theorem 6, we introduce a lemma first.

Lemma 5.

Suppose a1,,ana_{1},\cdots,a_{n} and b1,,bnb_{1},\cdots,b_{n} are two groups of real values. They are sorted as a(1)a(n)a_{(1)}\leq\cdots\leq a_{(n)} and b(1)b(n)b_{(1)}\leq\cdots\leq b_{(n)}, respectively. Thus we can establish the following inequality

|a(i)b(i)|max1kn|akbk|.|a_{(i)}-b_{(i)}|\leq\max_{1\leq k\leq n}|a_{k}-b_{k}|.
Proof.

Let a(i)=asi,a(j)=atja_{(i)}=a_{s_{i}},a_{(j)}=a_{t_{j}}, where si,tj[n]s_{i},t_{j}\in[n]. Without loss of generality, we assume that a(1)b(1)a_{(1)}\geq b_{(1)}. Therefore,

0|a(1)b(1)|=as1bt1=(as1at1)+(at1bt1)0+maxk|akbk|=maxk|akbk|0\leq|a_{(1)}-b_{(1)}|=a_{s_{1}}-b_{t_{1}}=(a_{s_{1}}-a_{t_{1}})+(a_{t_{1}}-b_{t_{1}})\leq 0+\max_{k}|a_{k}-b_{k}|=\max_{k}|a_{k}-b_{k}|

We complete the case for i=1i=1. In general, for any i[n]i\in[n], we assume that asi=a(i)b(i)=btia_{s_{i}}=a_{(i)}\geq b_{(i)}=b_{t_{i}}, without loss of generality. If bsibtib_{s_{i}}\leq b_{t_{i}}, we have

0|a(i)b(i)|=asibti=(asibsi)+(bsibti)maxk|akbk|+0=maxk|akbk|0\leq|a_{(i)}-b_{(i)}|=a_{s_{i}}-b_{t_{i}}=(a_{s_{i}}-b_{s_{i}})+(b_{s_{i}}-b_{t_{i}})\leq\max_{k}|a_{k}-b_{k}|+0=\max_{k}|a_{k}-b_{k}|

Otherwise bsi>btib_{s_{i}}>b_{t_{i}}, thus there exists lil\leq i such that atlasia_{t_{l}}\geq a_{s_{i}} (because there are only i1i-1 values a(1),,a(i1)a(i)=asia_{(1)},\cdots,a_{(i-1)}\leq a_{(i)}=a_{s_{i}}). Thus,

0|a(i)b(i)|=asibtiatlbtlmaxk|akbk|0\leq|a_{(i)}-b_{(i)}|=a_{s_{i}}-b_{t_{i}}\leq a_{t_{l}}-b_{t_{l}}\leq\max_{k}|a_{k}-b_{k}|

We conclude the proof. ∎

With such a lemma, the proof of Theorem 6 is similar to Theorem 2. And we mainly demonstrate the most important part of the proof.

Proof.

Again, we can only consider the first term in J(π)J(π^)=(J(π)J^(π))+(J^(π)J^(π^))+(J^(π^)J(π^))J(\pi^{*})-J(\hat{\pi})=(J(\pi^{*})-\hat{J}(\pi^{*}))+(\hat{J}(\pi^{*})-\hat{J}(\hat{\pi}))+(\hat{J}(\hat{\pi})-J(\hat{\pi})) as in the proof of Theorem 2.

J(π)J(π^)J(π)J^(π)\displaystyle J(\pi^{*})-J(\hat{\pi})\leq J(\pi^{*})-\hat{J}(\pi^{*})
=\displaystyle= maxθmΘB(θ^m){𝔼sρminm(Rm,θm(s,π(s))Rm,θm0(s))𝔼sρminm(Rm,θm(s,π(s))Rm,θm0(s))}\displaystyle\max_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\left\{\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\theta_{m}}(s,{\pi}(s))-R_{m,\theta_{m}}^{0}(s))\right\}
=\displaystyle= 𝔼sρminm(Rm,θm(s,π(s))Rm,θm0(s))𝔼sρminm(Rm,θ~m(s,π(s))Rm,θ~m0(s))\displaystyle\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\mathbb{E}_{s\sim\rho}\min_{m}(R_{m,\tilde{\theta}_{m}}(s,{\pi}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))
\displaystyle\leq 𝔼sρmaxm|(Rm,θm(s,π(s))Rm,θm0(s))(Rm,θ~m(s,π(s))Rm,θ~m0(s))|(using Lemma 5)\displaystyle\mathbb{E}_{s\sim\rho}\max_{m}\left|(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-(R_{m,\tilde{\theta}_{m}}(s,{\pi}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right|\quad(\text{using Lemma \ref{le:leximin}})
\displaystyle\leq m=1M𝔼sρ|(Rm,θm(s,π(s))Rm,θm0(s))(Rm,θ~m(s,π(s))Rm,θ~m0(s))|\displaystyle\sum_{m=1}^{M}\mathbb{E}_{s\sim\rho}\left|(R_{m,\theta_{m}^{*}}(s,{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-(R_{m,\tilde{\theta}_{m}}(s,{\pi}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right|
\displaystyle\leq m=1M|𝔼sρ(Rm,θm(s,π(s))Rm,θ~m(s,π(s)))|+|𝔼sρ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}\left(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))\right)\right|+\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|
=\displaystyle= m=1M|𝔼sρ(θmθ~m)ϕ(s,π(s))|+|𝔼sρ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim\rho}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|+\left|\mathbb{E}_{s\sim\rho}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|

Then, the following remains the same as the proof of Theorem 2. ∎

C.6 Proof of Theorem 7

We begin with an estimation error bound similar to Theorem 1.

Theorem 12.

Suppose Assumption 1-4 hold, with the feature difference in Σ\Sigma_{*} of Assumption 3 changed to cumulative difference over the trajectory. If the number of samples satisfies nd+log(M/δ)n\gg d+\log(M/\delta) for δ(0,1)\delta\in(0,1), then with probability at least 1δ1-\delta, for any m[M]m\in[M],

θmθ^mΣmMr2n+dr2logd+rlog(M/δ)Mn.\|\theta_{m}^{*}-\hat{\theta}_{m}\|_{\Sigma_{m}}\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}.
Proof.

This proof is the same as Theorem 1. In the MDP setting, we just consider Xm,i=h=1H(ϕ(sm,hi,am,hi)ϕ(sm,hi,am,hi))X_{m,i}=\sum_{h=1}^{H}\left(\phi(s_{m,h}^{i},a_{m,h}^{i})-\phi(s_{m,h}^{i^{\prime}},a_{m,h}^{i^{\prime}})\right) with Xm,i22LH\|X_{m,i}\|_{2}\leq 2LH instead. ∎

Then, we give the proof of Theorem 7.

Proof.

Decompose the sub-optimality into three terms:

J(π)J(π^)=(J(π)J^(π))+(J^(π)J^(π^))+(J^(π^)J(π^))\displaystyle J(\pi^{*})-J(\hat{\pi})=\left(J(\pi^{*})-\hat{J}(\pi^{*})\right)+\left(\hat{J}(\pi^{*})-\hat{J}(\hat{\pi})\right)+\left(\hat{J}(\hat{\pi})-J(\hat{\pi})\right)

The second term J^(π)J^(π^)0\hat{J}(\pi^{*})-\hat{J}(\hat{\pi})\leq 0 from the definition of π^\hat{\pi}. From Theorem 12, with probability at least 1δ1-\delta, θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) m\forall m, then the third term

J^(π^)J(π^)\displaystyle\hat{J}(\hat{\pi})-J(\hat{\pi})
=\displaystyle= minθmΘB(θ^m)𝔼sdπ^m=1M(Rm,θm(s,π^(s))Rm,θm0(s))𝔼sdπ^m=1M(Rm,θm(s,π^(s))Rm,θm0(s))0\displaystyle\min_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\mathbb{E}_{s\sim d^{\hat{\pi}}}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\hat{\pi}(s))-R_{m,\theta_{m}}^{0}(s))-\mathbb{E}_{s\sim d^{\hat{\pi}}}\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\hat{\pi}(s))-R_{m,\theta_{m}^{*}}^{0}(s))\leq 0

Therefore with probability at least 1δ1-\delta,

J(π)J(π^)J(π)J^(π)\displaystyle J(\pi^{*})-J(\hat{\pi})\leq J(\pi^{*})-\hat{J}(\pi^{*})
=\displaystyle= maxθmΘB(θ^m){𝔼sdπm=1M(Rm,θm(s,π(s))Rm,θm0(s))𝔼sdπm=1M(Rm,θm(s,π(s))Rm,θm0(s))}\displaystyle\max_{\theta_{m}\in\Theta_{B}(\hat{\theta}_{m})}\left\{\mathbb{E}_{s\sim d^{\pi^{*}}}\prod_{m=1}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\mathbb{E}_{s\sim d^{\pi^{*}}}\prod_{m=1}^{M}(R_{m,{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\theta_{m}}^{0}(s))\right\}
=(a)\displaystyle\overset{(a)}{=} 𝔼sdπ{[R1,θ1(s,π(s))R1,θ10(s)][m=2M(Rm,θm(s,π(s))Rm,θm0(s))m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))]}\displaystyle\mathbb{E}_{s\sim d^{\pi^{*}}}\{[R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s)][\prod_{m=2}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))]\}
+\displaystyle+ 𝔼sdπ{[(R1,θ1(s,π(s))R1,θ10(s))(R1,θ1(s,π(s))R1,θ10(s))][m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))]}\displaystyle\mathbb{E}_{s\sim d^{\pi^{*}}}\{[(R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s))-(R_{1,{\theta}_{1}}(s,\pi^{*}(s))-R_{1,\theta_{1}}^{0}(s))][\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))]\}
\displaystyle\leq 2BL|𝔼sdπm=2M(Rm,θm(s,π(s))Rm,θm0(s))m=2M(Rm,θ~m(s,π(s))Rm,θ~m0(s))|\displaystyle 2BL\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\prod_{m=2}^{M}(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-\prod_{m=2}^{M}(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right|
+\displaystyle+ (2BL)M1|𝔼sdπ((R1,θ1(s,π(s))R1,θ10(s))(R1,θ1(s,π(s))R1,θ10(s)))|\displaystyle(2BL)^{M-1}\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left((R_{1,{\theta}_{1}^{*}}(s,\pi^{*}(s))-R_{1,\theta_{1}^{*}}^{0}(s))-(R_{1,{\theta}_{1}}(s,\pi^{*}(s))-R_{1,\theta_{1}}^{0}(s))\right)\right|
(b)\displaystyle\overset{(b)}{\leq} m=1M(2BL)M1|𝔼sdπ((Rm,θm(s,π(s))Rm,θm0(s))(Rm,θ~m(s,π(s))Rm,θ~m0(s)))|\displaystyle\sum_{m=1}^{M}(2BL)^{M-1}\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left((R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\theta_{m}^{*}}^{0}(s))-(R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}^{0}(s))\right)\right|
\displaystyle\leq (2BL)M1m=1M|𝔼sdπ(Rm,θm(s,π(s))Rm,θ~m(s,π(s)))|+|𝔼sdπ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle(2BL)^{M-1}\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,{\theta}_{m}^{*}}(s,\pi^{*}(s))-R_{m,\tilde{\theta}_{m}}(s,\pi^{*}(s))\right)\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|
=\displaystyle= (2BL)M1m=1M|𝔼sdπ(θmθ~m)ϕ(s,π(s))|+|𝔼sdπ(Rm,θm0(s)Rm,θ~m0(s))|\displaystyle(2BL)^{M-1}\sum_{m=1}^{M}\left|\mathbb{E}_{s\sim d^{\pi^{*}}}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|

where (a)(a) holds due to the equation a1a2b1b2=a1(a2b2)+(a1b1)b2a_{1}a_{2}-b_{1}b_{2}=a_{1}(a_{2}-b_{2})+(a_{1}-b_{1})b_{2}, (b)(b) holds from recursion.

Separately, we have

|𝔼sdπ(θmθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}({\theta}_{m}^{*}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
=\displaystyle= |𝔼sdπ(θmθ^m+θ^mθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}({\theta}_{m}^{*}-\hat{\theta}_{m}+\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
\displaystyle\leq |𝔼sdπ(θmθ^m)ϕ(s,π(s))|+|𝔼sdπ(θ^mθ~m)ϕ(s,π(s))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}({\theta}_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\pi^{*}(s))\right|
\displaystyle\leq θmθ^mΣm𝔼sdπϕ(s,a)Σm1+θ^mθ~mΣm𝔼sdπϕ(s,a)Σm1 (Cauchy-Schwartz inequality)\displaystyle\|{\theta}_{m}^{*}-\hat{\theta}_{m}\|_{\Sigma_{m}}\|\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,a)\|_{\Sigma_{m}^{-1}}+\|\hat{\theta}_{m}-\tilde{\theta}_{m}\|_{\Sigma_{m}}\|\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,a)\|_{\Sigma_{m}^{-1}}\text{ (Cauchy-Schwartz inequality)}
\displaystyle\lesssim r2n+dr2logd+rlog(M/δ)Mn(Σm1/2𝔼sdπϕ(s,π(s)))2\displaystyle\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}\|(\Sigma_{m}^{-1/2}\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,\pi^{*}(s)))\|_{2}

The last inequality holds because θ~mΘB(θ^m)\tilde{\theta}_{m}\in\Theta_{B}(\hat{\theta}_{m}) and θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) with probability at least 1δ1-\delta for m\forall m.

|𝔼sdπ(Rm,θm0(s)Rm,θ~m0(s))|=|𝔼sdπ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}^{0}(s)-R_{m,\tilde{\theta}_{m}}^{0}(s)\right)\right|=\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|
(c)\displaystyle\overset{(c)}{\leq} |𝔼sdπ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|+|𝔼sdπ(Rm,θm(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{*m}(s))\right)\right|
\displaystyle\leq |𝔼sdπ(Rm,θm(s,π¯m(s))Rm,θ^m(s,π¯m(s))+Rm,θ^m(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{m}(s))-R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{m}(s))+R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{m}(s))\right)\right|
+\displaystyle+ |𝔼sdπ(Rm,θm(s,π¯m(s))Rm,θ^m(s,π¯m(s))+Rm,θ^m(s,π¯m(s))Rm,θ~m(s,π¯m(s)))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}\left(R_{m,\theta_{m}^{*}}(s,\underline{\pi}_{*m}(s))-R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{*m}(s))+R_{m,\hat{\theta}_{m}}(s,\underline{\pi}_{*m}(s))-R_{m,\tilde{\theta}_{m}}(s,\underline{\pi}_{*m}(s))\right)\right|
\displaystyle\leq |𝔼sdπ(θmθ^m)ϕ(s,π¯m(s))|+|𝔼sdπ(θ^mθ~m)ϕ(s,π¯m(s))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}(\theta_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{m}(s))\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{m}(s))\right|
+\displaystyle+ |𝔼sdπ(θmθ^m)ϕ(s,π¯m(s))|+|𝔼sdπ(θ^mθ~m)ϕ(s,π¯m(s))|\displaystyle\left|\mathbb{E}_{s\sim d^{\pi^{*}}}(\theta_{m}^{*}-\hat{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{*m}(s))\right|+\left|\mathbb{E}_{s\sim d^{\pi^{*}}}(\hat{\theta}_{m}-\tilde{\theta}_{m})^{\top}\phi(s,\underline{\pi}_{*m}(s))\right|
\displaystyle\lesssim r2n+dr2logd+rlog(M/δ)Mn{Σm12𝔼sdπϕ(s,π¯m(s))2+Σm12𝔼sdπϕ(s,π¯m(s))2}\displaystyle\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}\left\{\|\Sigma_{m}^{-\frac{1}{2}}\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,\underline{\pi}_{m}(s))\|_{2}+\|\Sigma_{m}^{-\frac{1}{2}}\mathbb{E}_{s\sim d^{\pi^{*}}}\phi(s,\underline{\pi}_{*m}(s))\|_{2}\right\}

where (c)(c) holds due to the equation |min{a1,a2}min{b1,b2}||a1b1|+|a2b2||\min\{a_{1},a_{2}\}-\min\{b_{1},b_{2}\}|\leq|a_{1}-b_{1}|+|a_{2}-b_{2}|. The last inequality holds because θ~mΘB(θ^m)\tilde{\theta}_{m}\in\Theta_{B}(\hat{\theta}_{m}) and θmΘB(θ^m)\theta_{m}^{*}\in\Theta_{B}(\hat{\theta}_{m}) with probability at least 1δ1-\delta for m\forall m. Therefore

J(π)J(π^)Mr2n+dr2logd+rlog(M/δ)MnCJ(\pi^{*})-J(\hat{\pi})\lesssim M\sqrt{\dfrac{r^{2}}{n}+\dfrac{dr^{2}\log{d}+r\log(M/\delta)}{Mn}}C^{*}

Now we conclude the proof. ∎

Appendix D Proofs in Section 5

D.1 Proof of Theorem 8

First, we introduce two concentration inequalities and prove a lemma.

Proposition 1 (McDiarmid inequality).

Let X1,,XnX_{1},\cdots,X_{n} be independent random variables in 𝒳\mathcal{X}, let g:𝒳g:\mathcal{X}\rightarrow\mathbb{R} be a function of X1,,XnX_{1},\cdots,X_{n} s.t. x1,,xn,xi𝒳\forall x_{1},\cdots,x_{n},x_{i}^{\prime}\in\mathcal{X}, |h(x1,,xi,,xn)h(x1,,xi,,xn)|c|h(x_{1},\cdots,x_{i},\cdots,x_{n})-h(x_{1},\cdots,x_{i}^{\prime},\cdots,x_{n})|\leq c. Then for all δ>0\delta>0, with probability at least 1δ1-\delta, we have

|h(X1,,Xn)E(h(X1,,Xn))|cnlog(2/δ)2.|h(X_{1},\cdots,X_{n})-E(h(X_{1},\cdots,X_{n}))|\leq c\sqrt{\dfrac{n\log{(2/\delta)}}{2}}.
Proposition 2 (Binomial concentration inequality, Xie et al., (2021)).

Suppose NBer(n,p)N\sim\texttt{Ber}(n,p), which is a Bernoulli distribution with parameter n1n\geq 1 and p[0,1]p\in[0,1]. Then with probability at least 1δ1-\delta, we have

pN18log(1/δ)n.\dfrac{p}{N\lor 1}\leq\dfrac{8\log(1/\delta)}{n}.
Lemma 6.

With probability at least 1δ1-\delta,

(1) |Ns,m(a,a)Ns,m(a,a)|2log(2MSA2/δ)n0(m,a,a,s)1|N_{s,m}(a,a^{\prime})-N_{s,m}^{*}(a,a^{\prime})|\leq\sqrt{\dfrac{2\log{(2MSA^{2}/\delta)}}{n_{0}(m,a,a^{\prime},s)\lor 1}}, for m,s,a,a\forall m,s,a,a^{\prime} ;

(2) |Ns(a,a)Ns(a,a)|2log(2MSA2/δ)minmn0(m,a,a,s)1|N_{s}(a,a^{\prime})-N_{s}^{*}(a,a^{\prime})|\leq\sqrt{\dfrac{2\log{(2MSA^{2}/\delta)}}{\min_{m}n_{0}(m,a,a^{\prime},s)\lor 1}}, for s,a,a\forall s,a,a^{\prime}.

Proof.

(1) For fixed m,s,a,am,s,a,a^{\prime}, define n=n0(m,a,a,s)n^{\prime}=n_{0}(m,a,a^{\prime},s). The cases where n=0n^{\prime}=0 is trivial. For n1n^{\prime}\geq 1, let Xm,iX_{m,i} for i[n]i\in[n^{\prime}] represent all the binary responses on pair (s,a,a)(s,a,a^{\prime}) in 𝒟m\mathcal{D}_{m}. Let h=i=1nXi/nh=\sum_{i=1}^{n^{\prime}}X_{i}/{n^{\prime}}. Since |XiXi|/n1/n|X_{i}-X_{i}^{\prime}|/{n^{\prime}}\leq 1/{n^{\prime}}. we use Proposition 1 and substitute cc with 1/n1/n, then with probability at least 1δ1-\delta, we have

|#m(aa;s)#m(aa;s)+#m(aa;s)Ns,m(a,a)+12|\displaystyle\left|\dfrac{\#_{m}(a\succ a^{\prime};s)}{\#_{m}(a\succ a^{\prime};s)+\#_{m}(a^{\prime}\succ a;s)}-\dfrac{N_{s,m}^{*}(a,a^{\prime})+1}{2}\right|
=\displaystyle= |#m(aa;s)nNs,m(a,a)+12|\displaystyle\left|\dfrac{\#_{m}(a\succ a^{\prime};s)}{n}-\dfrac{N_{s,m}^{*}(a,a^{\prime})+1}{2}\right|
=\displaystyle= |#m(aa;s)nm(y=1|s,a,a)|\displaystyle\left|\dfrac{\#_{m}(a\succ a^{\prime};s)}{n}-\mathbb{P}_{m}(y=1|s,a,a^{\prime})\right|
\displaystyle\leq 1nnlog(2/δ)2\displaystyle\dfrac{1}{n^{\prime}}\sqrt{\dfrac{n^{\prime}\log{(2/\delta)}}{2}}

So for fixed m,s,a,am,s,a,a^{\prime}, with probability at least 1δ1-\delta,

|Ns,m(a,a)Ns,m(a,a)|2|Ns,m(a,a)+12Ns,m(a,a)+12|2log(2/δ)n0(m,a,a,s)|N_{s,m}(a,a^{\prime})-N_{s,m}^{*}(a,a^{\prime})|\leq 2\left|\dfrac{N_{s,m}(a,a^{\prime})+1}{2}-\dfrac{N_{s,m}^{*}(a,a^{\prime})+1}{2}\right|\leq\sqrt{\dfrac{2\log{(2/\delta)}}{n_{0}(m,a,a^{\prime},s)}}

Then with probability at least 1MSA2δ1-MSA^{2}\delta,

|Ns,m(a,a)Ns,m(a,a)|2log(2/δ)n0(m,a,a,s),s,m,a,a|N_{s,m}(a,a^{\prime})-N_{s,m}^{*}(a,a^{\prime})|\leq\sqrt{\dfrac{2\log{(2/\delta)}}{n_{0}(m,a,a^{\prime},s)}},\forall s,m,a,a^{\prime}

Substitute δ\delta with δ/(MSA2)\delta/(MSA^{2}), we complete the first part.

(2) With probability at least 1δ1-\delta, for a,a,s\forall a,a^{\prime},s,

|Ns(a,a)Ns(a,a)|\displaystyle|N_{s}(a,a^{\prime})-N_{s}^{*}(a,a^{\prime})| =|1Mm=1MNs,m(a,a)1Mm=1MNs,m(a,a)|\displaystyle=|\dfrac{1}{M}\sum_{m=1}^{M}N_{s,m}(a,a^{\prime})-\dfrac{1}{M}\sum_{m=1}^{M}N_{s,m}^{*}(a,a^{\prime})|
1Mm=1M|Ns,m(a,a)Ns,m(a,a)|\displaystyle\leq\dfrac{1}{M}\sum_{m=1}^{M}|N_{s,m}(a,a^{\prime})-N_{s,m}^{*}(a,a^{\prime})|
2log(2MSA2/δ)minmn0(m,a,a,s)\displaystyle\leq\sqrt{\dfrac{2\log{(2MSA^{2}/\delta)}}{\min_{m}n_{0}(m,a,a^{\prime},s)}}

Thus we conclude the proof. ∎

Now we present the detailed proof of Theorem 8.

Proof.

First, from Lemma 6, we have with probability at least 1δ/21-\delta/2,

|Ns(a,a)Ns(a,a)|Bs(a,a)=2log(4MSA2/δ)minmn0(m,a,a,s)1,s,a,a|N_{s}(a,a^{\prime})-N_{s}^{*}(a,a^{\prime})|\leq B_{s}(a,a^{\prime})=\sqrt{\dfrac{2\log(4MSA^{2}/\delta)}{\min_{m}n_{0}(m,a,a^{\prime},s)\lor 1}},\quad\forall s,a,a^{\prime}

From Proposition 2, we have with probability at least 1δ/21-\delta/2,

dsg(a,a)n0(m,a,a,s)18log(2MSA2/δ)n,m,s,a,a\dfrac{d_{s}^{g}(a,a^{\prime})}{n_{0}(m,a,a^{\prime},s)\lor 1}\leq\dfrac{8\log(2MSA^{2}/\delta)}{n},\quad\forall m,s,a,a^{\prime} (26)

Thus the above inequalities hold simultaneously with probability at least 1δ1-\delta. Next, recall that p^s,q^s=argmaxpminqp(NsBs)q\hat{p}_{s},\hat{q}_{s}=\arg\max_{p}\min_{q}p^{\top}(N_{s}-B_{s})q and ps,qs=argmaxpminqpNsqp_{s}^{*},q_{s}^{*}=\arg\max_{p}\min_{q}p^{\top}N_{s}^{*}q, we have:

minqp^sNsq\displaystyle\min_{q}\hat{p}_{s}^{\top}N_{s}^{*}q minqp^s(NsBs)q(using NsNsBs)\displaystyle\geq\min_{q}\hat{p}_{s}^{\top}(N_{s}-B_{s})q\qquad\qquad\qquad\qquad(\text{using }N_{s}^{*}\geq N_{s}-B_{s})
=p^s(NsBs)q^s\displaystyle=\hat{p}_{s}^{\top}(N_{s}-B_{s})\hat{q}_{s}
ps(NsBs)q^s(using the definition of p^s)\displaystyle\geq p_{s}^{*\top}(N_{s}-B_{s})\hat{q}_{s}\qquad\qquad\qquad\qquad(\text{using the definition of }\hat{p}_{s})
=psNsq^s+ps(NsNs)q^spsBsq^s\displaystyle=p_{s}^{*\top}N_{s}^{*}\hat{q}_{s}+p_{s}^{*\top}(N_{s}-N_{s}^{*})\hat{q}_{s}-p_{s}^{*\top}B_{s}\hat{q}_{s}
minqpsNsq2psBsq^s(using NsNsBs)\displaystyle\geq\min_{q}p_{s}^{*\top}N_{s}^{*}q-2p_{s}^{*\top}B_{s}\hat{q}_{s}\qquad\qquad\qquad(\text{using }N_{s}^{*}\geq N_{s}-B_{s})
=2psBsq^s(the value of the matrix game Ns=0)\displaystyle=-2p_{s}^{*\top}B_{s}\hat{q}_{s}\qquad\qquad\qquad(\text{the value of the matrix game }N_{s}^{*}=0)
=2𝔼apsaq^sBs(a,a)\displaystyle=-2\mathbb{E}_{\begin{subarray}{c}a\sim p_{s}^{*}\\ a^{\prime}\sim\hat{q}_{s}\end{subarray}}B_{s}(a,a^{\prime})
=2𝔼apsaq^s2log(4MSA2/δ)minmn0(m,a,a,s)1\displaystyle=-2\mathbb{E}_{\begin{subarray}{c}a\sim p_{s}^{*}\\ a^{\prime}\sim\hat{q}_{s}\end{subarray}}\sqrt{\dfrac{2\log(4MSA^{2}/\delta)}{\min_{m}n_{0}(m,a,a^{\prime},s)\lor 1}}
2𝔼apsaq^s4log2(4MSA2/δ)ndsg(a,a)(using (26))\displaystyle\geq-2\mathbb{E}_{\begin{subarray}{c}a\sim p_{s}^{*}\\ a^{\prime}\sim\hat{q}_{s}\end{subarray}}4\sqrt{\dfrac{\log^{2}(4MSA^{2}/\delta)}{nd_{s}^{g}(a,a^{\prime})}}\qquad\qquad(\text{using \eqref{ber_dg}})

Let η=log(4MSA2/δ),\eta=\log(4MSA^{2}/\delta), and denote dsπ=dsps,q^sd_{s}^{\pi}=d_{s}^{p_{s}^{*},\hat{q}_{s}}, we have

minqp^sNsq\displaystyle\min_{q}\hat{p}_{s}^{\top}N_{s}^{*}q 8a,aηdsπ(a,a)1ndsg(a,a)\displaystyle\geq-8\sum_{a,a^{\prime}}\eta d_{s}^{\pi}(a,a^{\prime})\sqrt{\dfrac{1}{nd_{s}^{g}(a,a^{\prime})}}
=8a,adsπ(a,a)ηdsπ(a,a)ndsg(a,a)\displaystyle=-8\sum_{a,a^{\prime}}\sqrt{d_{s}^{\pi}(a,a^{\prime})}\eta\sqrt{\dfrac{d_{s}^{\pi}(a,a^{\prime})}{nd_{s}^{g}(a,a^{\prime})}}
8a,adsπ(a,a)A2ηdsπ(a,a)ndsg(a,a)(Cauchy-Schwartz inequality)\displaystyle\geq-8\sqrt{\sum_{a,a^{\prime}}d_{s}^{\pi}(a,a^{\prime})A^{2}}\eta\sqrt{\dfrac{d_{s}^{\pi}(a,a^{\prime})}{nd_{s}^{g}(a,a^{\prime})}}\qquad\qquad(\text{Cauchy-Schwartz inequality})
8Aηdsπ(a,a)ndsg(a,a)\displaystyle\geq-8A\eta\sqrt{\dfrac{d_{s}^{\pi}(a,a^{\prime})}{nd_{s}^{g}(a,a^{\prime})}}
8AηCn:=ϵ\displaystyle\geq-8A\eta\sqrt{\dfrac{C^{*}}{n}}:=-\epsilon

We conclude that p^s\hat{p}_{s} which satisfies minqp^sNsqϵ\min_{q}\hat{p}_{s}^{\top}N_{s}^{*}q\geq-\epsilon, equivalently, we have

a𝒜p^s(a)Ns(a,a)ϵ,a,s\displaystyle\sum_{a\in\mathcal{A}}\hat{p}_{s}(a)N_{s}^{*}(a,a^{\prime})\geq-\epsilon,\quad\forall a^{\prime},s (27)
a𝒜p^s(a)=1,s\displaystyle\sum_{a\in\mathcal{A}}\hat{p}_{s}(a)=1,\quad\forall s

Our goal is to prove that the probability p^\hat{p} over policy space Π\Pi satisfies:

πΠp^(π)T(π,π)ϵ,π\displaystyle\sum_{\pi\in\Pi}\hat{p}(\pi)T^{*}(\pi,\pi^{\prime})\geq-\epsilon,\quad\forall\pi^{\prime}
πΠp^(π)=1\displaystyle\sum_{\pi\in\Pi}\hat{p}(\pi)=1

Since p^(π)=s𝒮p^s(π(s))\hat{p}(\pi)=\prod_{s\in\mathcal{S}}\hat{p}_{s}(\pi(s)), it’s easy to show that πΠp^(π)=1\sum_{\pi\in\Pi}\hat{p}(\pi)=1. Besides, notice that

πΠp^(π)T(π,π)\displaystyle\sum_{\pi\in\Pi}\hat{p}(\pi)T(\pi,\pi^{\prime}) =πΠp^(π)s𝒮ρsNs(π(s),π(s))\displaystyle=\sum_{\pi\in\Pi}\hat{p}(\pi)\sum_{s\in\mathcal{S}}\rho_{s}N_{s}^{*}(\pi(s),\pi^{\prime}(s))
=s𝒮ρsπΠp^(π)Ns(π(s),π(s))\displaystyle=\sum_{s\in\mathcal{S}}\rho_{s}\sum_{\pi\in\Pi}\hat{p}(\pi)N_{s}^{*}(\pi(s),\pi^{\prime}(s)) (28)

We then separate the terms for different states. For simplicity, denote 𝒮={si}i=1S\mathcal{S}=\{s_{i}\}_{i=1}^{S} and p^si=p^i\hat{p}_{s_{i}}=\hat{p}_{i} for i[S]i\in[S]. Thus, for each state sis_{i}, we have

πΠp^(π)Nsi(π(si),π(si))\displaystyle\sum_{\pi\in\Pi}\hat{p}(\pi)N_{s_{i}}^{*}(\pi(s_{i}),\pi^{\prime}(s_{i}))
=\displaystyle= πΠi=1Sp^i(π(si))Nsi(π(si),π(si))\displaystyle\sum_{\pi\in\Pi}\prod_{i=1}^{S}\hat{p}_{i}(\pi(s_{i}))N_{s_{i}}^{*}(\pi(s_{i}),\pi^{\prime}(s_{i}))
=\displaystyle= πΠ[jip^j(π(sj))]p^i(π(si))Nsi(π(si),π(si))\displaystyle\sum_{\pi\in\Pi}[\prod_{j\neq i}\hat{p}_{j}(\pi(s_{j}))]\hat{p}_{i}(\pi(s_{i}))N_{s_{i}}^{*}(\pi(s_{i}),\pi^{\prime}(s_{i}))
=\displaystyle= (π(sj)𝒜jijip^j(π(sj)))(π(si)𝒜p^i(π(si))Nsi(π(si),π(si)))ϵ\displaystyle\left(\sum_{\begin{subarray}{c}\pi(s_{j})\in\mathcal{A}\\ \forall j\neq i\end{subarray}}\prod_{j\neq i}\hat{p}_{j}(\pi(s_{j}))\right)\left(\sum_{\pi(s_{i})\in\mathcal{A}}\hat{p}_{i}(\pi(s_{i}))N_{s_{i}}^{*}(\pi(s_{i}),\pi^{\prime}(s_{i}))\right)\geq-\epsilon

The last inequality holds because the first term equals 11, and the second term is greater than or equal to ϵ-\epsilon for all a=π(si)a^{\prime}=\pi^{\prime}(s_{i}) using (27). Therefore, combining with (D.1) implies that

πΠp^(π)T(π,π)sρsϵ=ϵ\sum_{\pi\in\Pi}\hat{p}(\pi)T(\pi,\pi^{\prime})\geq-\sum_{s}\rho_{s}\epsilon=-\epsilon

Now we conclude the proof. ∎

D.2 Proof of Theorem 9

Proof.

Suppose Ns,m(a,b)0N_{s,m}^{*}(a,b)\geq 0 for m\forall m and the inequality holds for some mm, from transitivity we have Ns,m(a,c)Ns,m(b,c)N_{s,m}^{*}(a,c)\geq N_{s,m}^{*}(b,c) for c\forall c. Therefore Ns(a,b)>0N_{s}^{*}(a,b)>0 and Ns(a,c)Ns(b,c)N_{s}^{*}(a,c)\geq N_{s}^{*}(b,c). Let p^s\hat{p}_{s}^{\prime} equal to p^s\hat{p}_{s} except on {a,b}\{a,b\} in which p^s(a)=p^s(a)+p^s(b)\hat{p}_{s}^{\prime}(a)=\hat{p}_{s}(a)+\hat{p}_{s}(b) and p^s(b)=0\hat{p}_{s}^{\prime}(b)=0. Then

p^sNsp^s=\displaystyle\hat{p}_{s}^{\prime}{}^{\top}N_{s}^{*}\hat{p}_{s}= Ns(a,b)[(p^s(a)+p^s(b))p^s(b)0p^s(a)]\displaystyle N_{s}^{*}(a,b)\left[(\hat{p}_{s}(a)+\hat{p}_{s}(b))\hat{p}_{s}(b)-0\cdot\hat{p}_{s}(a)\right]
+\displaystyle+ c𝒮\{a,b}Ns(a,c)[(p^s(a)+p^s(b))p^s(c)p^s(c)p^s(a)]\displaystyle\sum_{c\in\mathcal{S}\backslash\{a,b\}}N_{s}^{*}(a,c)\left[(\hat{p}_{s}(a)+\hat{p}_{s}(b))\hat{p}_{s}(c)-\hat{p}_{s}(c)\hat{p}_{s}(a)\right]
+\displaystyle+ c𝒮\{a,b}Ns(b,c)[0p^s(c)p^s(c)p^s(b)]\displaystyle\sum_{c\in\mathcal{S}\backslash\{a,b\}}N_{s}^{*}(b,c)\left[0\cdot\hat{p}_{s}(c)-\hat{p}_{s}(c)\hat{p}_{s}(b)\right]
=\displaystyle= p^s(b)[Ns(a,b)(p^s(a)+p^s(b))+c𝒮\{a,b}p^s(c)(Ns(a,c)Ns(b,c))]\displaystyle\hat{p}_{s}(b)\left[N_{s}^{*}(a,b)(\hat{p}_{s}(a)+\hat{p}_{s}(b))+\sum_{c\in\mathcal{S}\backslash\{a,b\}}\hat{p}_{s}(c)(N_{s}^{*}(a,c)-N_{s}^{*}(b,c))\right]
\displaystyle\geq p^s(b)2Ns(a,b)\displaystyle\hat{p}_{s}(b)^{2}N_{s}^{*}(a,b) (29)

Notice that with probability at least 1δ1-\delta, we have p^sNsp^s=p^sNsp^sϵ\hat{p}_{s}^{\prime}{}^{\top}N_{s}^{*}\hat{p}_{s}=-\hat{p}_{s}^{\top}N_{s}^{*}\hat{p}_{s}^{\prime}\leq\epsilon (using p^sNsp^sminqp^sNsqϵ\hat{p}_{s}^{\top}N_{s}^{*}\hat{p}_{s}^{\prime}\geq\min_{q}\hat{p}_{s}^{\top}N_{s}^{*}q\geq-\epsilon), where ϵ\epsilon is defined in Theorem 8. Therefore, (D.2) implies that ϵp^s(b)2Ns(a,b)\epsilon\geq\hat{p}_{s}(b)^{2}N_{s}^{*}(a,b), and consequently

p^s(b)ϵ/Ns(a,b)\displaystyle\hat{p}_{s}(b)\leq\sqrt{\epsilon/N_{s}^{*}(a,b)} =(8Alog(2MSA2/δ)Cn/Ns(a,b))12\displaystyle=\left(8A\log(2MSA^{2}/\delta)\sqrt{\dfrac{C^{*}}{n}}/N_{s}^{*}(a,b)\right)^{\frac{1}{2}}
=K(Alog(2MSA2/δ)Cn)1/2\displaystyle=K\left(A\log(2MSA^{2}/\delta)\sqrt{\frac{C^{*}}{n}}\right)^{1/2}

where KK is a constant that depends on Ns(a,b)N_{s}^{*}(a,b). ∎

This proof follows a similar result as presented in Fishburn, (1984).

We then proceed to demonstrate the necessity of transitivity. A simple example comes from a three-alternatives case: Suppose that Ns,m(a,b)=1,Ns,m(b,c)=1,Ns,m(c,a)=1N_{s,m}^{*}(a,b)=1,N_{s,m}^{*}(b,c)=1,N_{s,m}^{*}(c,a)=1 for m\forall m, i.e., everyone believes aba\succ b with probability 11. Therefore,

Ns=Ns=[011101110]N_{s}=N_{s}^{*}=\begin{bmatrix}0&1&-1\\ -1&0&1\\ 1&-1&0\\ \end{bmatrix}

As everyone has absolute preferences, there is no need to introduce a confidence bound. Consequently, it is straightforward to demonstrate that the von Neumann winner directly yields (1/3,1/3,1/3)(1/3,1/3,1/3). Given that item bb is Pareto dominated by item aa, the result stated in the previous theorem does not hold anymore. Therefore, the transitivity is indeed necessary.

Appendix E Additional Results

In this section, we provide additional impossibility results for our two preference models, which demonstrate the challenge of multi-party alignment.

Reward-Based Model

In multi-party settings with heterogeneous preferences, there is a risk that specific individuals may obtain poor outcomes:

Proposition 3.

Consider the family of instances CBm(𝒞)={ρ,{smi,am,1i,am,0i}i=1n,θm=Uαm|Cm𝒞}\text{CB}_{m}(\mathcal{C})=\{\rho,\{s_{m}^{i},a_{m,1}^{i},a_{m,0}^{i}\}_{i=1}^{n},\theta_{m}^{*}=U^{*}\alpha_{m}^{*}\\ |C_{m}^{*}\leq\mathcal{C}\}, where Cm=Σm12𝔼sρϕ(s,π(s))2C_{m}^{*}=\|\Sigma_{m}^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\pi^{*}(s))\|_{2} for m[M]m\in[M]. Suppose that M2M\geq 2, 𝒞2\mathcal{C}\geq 2, then there exists a feature mapping ϕ\phi such that the following lower bound holds.

infπ^supm[M]sup𝒬mCBm(𝒞)SinSubOpt𝒬m(π^)BL/2\inf_{\hat{\pi}}\sup_{m\in[M]}\sup_{\mathcal{Q}_{m}\in\text{CB}_{m}(\mathcal{C})}\texttt{SinSubOpt}_{\mathcal{Q}_{m}}(\hat{\pi})\geq BL/2

Here the single-party sub-optimality is defined as SinSubOptm(π)=Jm(πm)Jm(π)\texttt{SinSubOpt}_{m}(\pi)=J_{m}(\pi_{m}^{*})-J_{m}(\pi), where Jm(π)=𝔼sρRm,θm(s,π(s))J_{m}(\pi)=\mathbb{E}_{s\sim\rho}R_{m,\theta_{m}^{*}}(s,\pi(s)) and πm=argmaxJm(π)\pi_{m}^{*}=\arg\max J_{m}(\pi) for m[M]m\in[M].

Proof.

Here we consider ϕ~=Uϕr\tilde{\phi}=U^{*\top}\phi\in\mathbb{R}^{r} for some U𝒪d×rU^{*}\in\mathcal{O}_{d\times r}, and we can construct some ϕ\phi directly from ϕ~\tilde{\phi}. Without loss of generality, we assume that B=L=1B=L=1. Let 𝒮={s1,s2},𝒜={a1,a2,a3}\mathcal{S}=\{s_{1},s_{2}\},\mathcal{A}=\{a_{1},a_{2},a_{3}\}, ρ=Unif([s1,s2])\rho=\texttt{Unif}([s_{1},s_{2}]). Define the feature functions and the number of observations (for each state-action pair) as follows:

ϕ~(s1,a1)=e1,ϕ~(s1,a2)=e2,ϕ~(s1,a3)=0\displaystyle\tilde{\phi}(s_{1},a_{1})=e_{1},\tilde{\phi}(s_{1},a_{2})=e_{2},\tilde{\phi}(s_{1},a_{3})=0
ϕ~(s2,a1)=e2,ϕ~(s2,a2)=e1,ϕ~(s2,a3)=0\displaystyle\tilde{\phi}(s_{2},a_{1})=e_{2},\tilde{\phi}(s_{2},a_{2})=e_{1},\tilde{\phi}(s_{2},a_{3})=0
n(sj,a1,a2)=(n/2)(12/𝒞2)\displaystyle n(s_{j},a_{1},a_{2})=(n/2)(1-2/\mathcal{C}^{2})
n(sj,a2,a3)=(n/2)(2/𝒞2)(j=1,2)\displaystyle n(s_{j},a_{2},a_{3})=(n/2)(2/\mathcal{C}^{2})\qquad(j=1,2)

Here, we have ensured that maxs,aϕ(s,a)=maxs,aϕ~(s,a)1(L=1)\max_{s,a}\|\phi(s,a)\|=\max_{s,a}\|\tilde{\phi}(s,a)\|\leq 1(L=1). Let α1=e1,α2=e2\alpha_{1}^{*}=e_{1},\alpha_{2}^{*}=e_{2}, ensuring that θi=αi=1(B=1)\|\theta_{i}^{*}\|=\|\alpha_{i}^{*}\|=1(B=1). With these settings, the optimal policies πi(s)\pi_{i}^{*}(s) become:

π1(s1)=a1,π1(s2)=a2\displaystyle\pi_{1}^{*}(s_{1})=a_{1},\pi_{1}^{*}(s_{2})=a_{2}
π2(s1)=a2,π2(s2)=a1\displaystyle\pi_{2}^{*}(s_{1})=a_{2},\pi_{2}^{*}(s_{2})=a_{1}

It’s easy to verify that Σi12𝔼sρϕ(s,πi(s))2𝒞\|\Sigma_{i}^{-\frac{1}{2}}\mathbb{E}_{s\sim\rho}\phi(s,\pi_{i}^{*}(s))\|_{2}\leq\mathcal{C}. Therefore, for π^\forall\hat{\pi},

maxi=1,2SinSubOpt𝒬i(π^)\displaystyle\max_{i=1,2}\texttt{SinSubOpt}_{\mathcal{Q}_{i}}(\hat{\pi}) =maxi=1,212[(1θiTϕ(s1,π^(s1)))+(1θiTϕ(s2,π^(s2)))]\displaystyle=\max_{i=1,2}\dfrac{1}{2}[(1-\theta_{i}^{*T}\phi(s_{1},\hat{\pi}(s_{1})))+(1-\theta_{i}^{*T}\phi(s_{2},\hat{\pi}(s_{2})))]
=maxi=1,212[(1αiTϕ~(s1,π^(s1)))+(1αiTϕ~(s2,π^(s2)))]12\displaystyle=\max_{i=1,2}\dfrac{1}{2}[(1-\alpha_{i}^{*T}\tilde{\phi}(s_{1},\hat{\pi}(s_{1})))+(1-\alpha_{i}^{*T}\tilde{\phi}(s_{2},\hat{\pi}(s_{2})))]\geq\dfrac{1}{2} (30)

Now we conclude the proof. ∎

Actually, (E) is independent of the observation data, i.e., the number of observations for each state-action pair. This proposition is consistent with the empirical results in Santurkar et al., (2023): applying a traditional single-party model fails since it cannot represent multiple views of all individuals. Thus it’s reasonable to consider collective welfare functions to serve as an alternative metric for group alignment instead of deploying the single-party approach directly.

Reward-Free Model

Additionally, for the reward-free model, we establish the following lower bound.

Proposition 4.

There exists a preference profile m\mathbb{P}_{m} for m[M]m\in[M] such that,

maxpsminmminqspsNs,mqsM1M,s𝒮.\max_{p_{s}}\min_{m}\min_{q_{s}}p_{s}^{\top}N_{s,m}^{*}q_{s}\leq-\dfrac{M-1}{M},\quad\forall s\in\mathcal{S}.
Proof.

Suppose 𝒮={s},𝒜={a1,,aM}\mathcal{S}=\{s\},\mathcal{A}=\{a_{1},\cdots,a_{M}\}. For the mm-th individual, define Ns,m(am,am)=1N_{s,m}(a_{m},a_{-m})=1, where ama_{-m} denotes all the actions excluding ama_{m}. Thus, all the elements of the mm-th column of Ns,mN_{s,m} except the diagonal entry equal to 1-1.

For any strategy ps=(α1,,αM)p_{s}=(\alpha_{1},\cdots,\alpha_{M})^{\top} that satisfies jαj=1,αj0\sum_{j}\alpha_{j}=1,\alpha_{j}\geq 0, we consider comparing two strategies: (1) psp_{s}, and (2) the deterministic action ama_{m}, i.e., qs(am)=1q_{s}(a_{m})=1, for the mm-th individual. We have,

psNs,mqs=jmαjp_{s}^{\top}N_{s,m}q_{s}=-\sum_{j\neq m}\alpha_{j}

Then we have

minm(jmαj)1Mm(jmαj)=M1M\min_{m}(-\sum_{j\neq m}\alpha_{j})\leq\dfrac{1}{M}\sum_{m}(-\sum_{j\neq m}\alpha_{j})=-\dfrac{M-1}{M}

Therefore, we conclude the proof. ∎

Proposition 4, along with Proposition 3, again shows the disparity between multi-party and single-party alignment and highlights the challenge of aligning with heterogeneous preferences. It becomes difficult to ensure the welfare of the worst individual when preferences are highly diverse.