This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PAD: Personalized Alignment at Decoding-time

Antiquus S. Hippocampus, Natalia Cerebro & Amelie P. Amygdale
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213, USA
{hippo,brain,jen}@cs.cranberry-lemon.edu
&Ji Q. Ren & Yevgeny LeNet
Department of Computational Neuroscience
University of the Witwatersrand
Joburg, South Africa
{robot,net}@wits.ac.za
\ANDCoauthor
Affiliation
Address
email
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies. Funding acknowledgements go at the end of the paper.
Abstract

The abstract paragraph should be indented 1/2 inch (3 picas) on both left and right-hand margins. Use 10 point type, with a vertical spacing of 11 points. The word Abstract must be centered, in small caps, and in point size 12. Two line spaces precede the abstract. The abstract must be limited to one paragraph.

1 Introduction

Recent advancements have demonstrated success in aligning language models with human preferences and values. Representative methods such as Reinforcement Learning from Human Feedback (RLHF) and DPO typically optimize a policy model with training signals from an explicit or implicit reward model. The reward model is singular, capturing the general human preferences or values from human feedback.

However, in this pluralistic world, people’s preferences can diverge significantly based on their different cultures, educational backgrounds, religions, and political stands. Furthermore, even for the same person, the value of a particular LLM response can vary when the application scenario changes. Hence, there always exists a proportion of human preferences that cannot be unified or may even have contradictions, also known as personalized preferences, which general alignment frameworks struggle to align with due to the need for high-quality datasets and substantial computational costs.

How can we align with personalized preferences without the need for additional data collection and training? Inspired by the work on controlled decoding, we introduce Personalized Alignment at Decoding-time (PAD), aimed at controlling model outputs during the decoding phase to align with diverse personalized preferences. Specifically, PAD achieves this by employing a personalized reward model that directly guides the text generation process of a language model. By defining the text generation process as a per-token Markov Decision Process (MDP), at each decoding step, the personalized reward model can score based on personalized preferences. This score is then combined with the standard maximum-likelihood decoding to adjust the model’s probabilistic prediction. The advantages of PAD are as follows: (1) It requires only a single policy model aligned with general preferences (General Policy), eliminating the need for training additional policys (Training-free). (2) It utilizes only one reward model (Single Reward). (3) It does not require pre-defined personalized preferences to generalize to preferences not seen during the training phase (Generalizability). A checklist of PAD’s advantages over previous methods is presented in Table 1.

Table 1: Comparison of Different Methods
Method General Policy Training-free Single Reward Generalizability
Vanilla Yes - - -
MORLHF Yes No Yes No
MODPO \citepzhou2023beyond Yes No - No
Personalized soups \citepjang2023personalized No No No Yes
Rewarded soups \citeprame2024rewarded Yes Yes No No
RiC \citepyang2024rewards Yes No Yes No
DPA \citepwang2024arithmetic Yes No Yes No
CD Yes Yes Yes No
ARGS Yes Yes Yes No
MOD \citepshi2024decoding No Yes - No
MetaAligner \citepyang_metaaligner_2024 Yes Yes - Yes
PAD Yes Yes Yes Yes

Our contributions are:

  • Theory: We model the Personalized Alignment at Decoding-time (PAD) within per-token Markov Decision Process, providing theoretical insights that PAD aligns with the objectives of RLHF.

  • Algorithm: We propose a practical PAD implementation, which does not require additional training of the policy model and only requires a single reward model to align with scalable user preferences.

  • Experiments:

2 Related Works

Large language model alignment

Large language model alignment aims to align LLMs with human preferences. The mainstream methods are training-based. A common approach involves using an RLHF (Reinforcement Learning with Human Feedback) framework [XXX] where a reward model is trained based on human feedback, and Proximal Policy Optimization (PPO) \citepschulman2017proximal is employed to derive the aligned policy model [XXX]. Recent efforts explore alternatives to enhance stability and reduce resources. Notably, DPO \citeprafailov2024direct leverages the Bradley-Terry assumption \citepbradley1952rank for direct optimization of the preference-based objective. Decoding-time alignment offers an alignment paradigm that does not require expensive RL training [han2024value] [XXX]. Controlled Decoding (CD) \citepmudgal2023controlled utilizes a prefix scorer module trained to assess value functions for rewards, allowing controlled generation from a frozen base model. ARGS \citepkhanov2024args proposed using a reward signal to adjust probabilistic predictions, thereby generating semantically aligned texts. DeAL \citephuang2024deal focusing on heuristic-guided searches to better meet diverse alignment objectives.

Personalized alignment

As humans exhibit diverse preferences and values for a single task, it is essential to align large language models (LLMs) to users’ personalized preferences \citepkirk-etal-2023-past, sorensen2023value, pmlr-v235-sorensen24a, yao2023from, kirk2024benefits, zhong2024panacea, han2024value. One line of work achieves joint optimization for different personalized preferences by defining a reward function with multiple objective dimensions \citepzhou2023beyond, wang2024arithmetic, wang2024interpretable, guo2024controllable, yang2024rewards, chakraborty2024maxmin, sun2024salmon, li2024personalized. Additionally, some approaches involve merging model parameters or predictions trained for each dimension to accommodate the diverse combinations expressed by those dimensions \citepjang2023personalized, rame2024rewarded, park2024principled, shi2024decoding. Lastly, prompt-based methods align personalized user preferences by designing diverse prompts \citepyang_metaaligner_2024,lee_aligning_2024,hwang_aligning_2023, jafari_morl-prompt_2024. The work most similar to ours is MOD \citepshi2024decoding, which achieves flexible trade-offs and optimization across multiple objectives by linearly combining predictions from different base models at decoding time to output the next token. However, MOD requires training base models for different preferences, making it difficult to scale to a large number of personalized preferences.

3 Method

3.1 Preliminaries

In this section, we first define the per-token Markov Decision Process (MDP) for large language models (LLMs) and then describe its relationship to classic Reinforcement Learning from Human Feedback (RLHF) approaches.

Text Generation as Token-level Markov Decision Process

We define the standard text generation process of large language models (LLMs) with prompt 𝐱\mathbf{x} and response 𝐲\mathbf{y} as token-level Markov Decision Process (MDP). MDP is denoted as a tuple =(𝒮,𝒜,𝒫,R,ρ)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho), where the state space 𝒮\mathcal{S} consists of the prompt and all tokens generated so far (i.e., 𝐬t=(𝐱,𝐲1:t1)\mathbf{s}_{t}=(\mathbf{x},\mathbf{y}_{1:t-1})). The action space 𝒜\mathcal{A} is the tokens from the vocabulary (i.e., 𝐚t=𝐲t\mathbf{a}_{t}=\mathbf{y}_{t}). 𝒫\mathcal{P} is the transition kernel, which is deterministic that given state 𝐬t=(𝐱,𝐲1:t1)\mathbf{s}_{t}=(\mathbf{x},\mathbf{y}_{1:t-1}) and action 𝐚t=𝐲t\mathbf{a}_{t}=\mathbf{y}_{t}, the next state is 𝐬t+1=(𝐬t,𝐚t)=(𝐱,𝐲1:t)\mathbf{s}_{t+1}=(\mathbf{s}_{t},\mathbf{a}_{t})=(\mathbf{x},\mathbf{y}_{1:t}). R : S×AS\times A\to\mathbb{R} represents the token-wise reward. ρ0\rho_{0} is the initial state distribution of 𝐬0\mathbf{s}_{0}, i.e., 𝐱\mathbf{x}. A (Markov) policy in MDPs π:SΔ(A)\pi:S\to\Delta(A) is a mapping from state to a distribution over actions.

The RLHF Pipeline

Traditional RLHF approaches model the policy by a parametric distribution πθ(𝐚|𝐬)\pi_{\theta}(\mathbf{a}|\mathbf{s}) defined by a LLM over a finite set of actions. These approaches first learn a reward function from human feedback on prompt and response pairs (𝐱,𝐲w,𝐲l)(\mathbf{x},\mathbf{y}^{w},\mathbf{y}^{l}). The reward function is modeled as a contextual bandit using Bradley-Terry preference model:

p(𝐲w𝐲l)=expR(𝐱,𝐲w)expR(𝐱,𝐲w)+expR(𝐱,𝐲l),p^{*}(\mathbf{y}^{w}\succeq\mathbf{y}^{l})=\frac{\exp R(\mathbf{x},\mathbf{y}^{w})}{\exp R(\mathbf{x},\mathbf{y}^{w})+\exp R(\mathbf{x},\mathbf{y}^{l})}, (1)

where 𝐲w\mathbf{y}^{w} and 𝐲l\mathbf{y}^{l} denote the preferred and not-preferred completions for the prompt 𝐱\mathbf{x}. p(𝐲w𝐲l)p^{*}(\mathbf{y}^{w}\succeq\mathbf{y}^{l}) denotes the probability that 𝐲w\mathbf{y}^{w} is preferred to 𝐲l\mathbf{y}^{l}. Assuming access to a static dataset of comparisons D={𝐱(i),𝐲(i)w,𝐲(i)l}i=1ND=\{\mathbf{x}(i),\mathbf{y}(i)_{w},\mathbf{y}(i)_{l}\}^{N}_{i=1} sampled from pp^{*}, we can parametrize a reward model R(𝐱,𝐲)R(\mathbf{x},\mathbf{y}) and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss:

LR(R,D)=E(𝐱,𝐲w,𝐲l)D[logσ(R(𝐱,𝐲w)R(𝐱,𝐲l))]L_{R}(R,D)=-E_{(\mathbf{x},\mathbf{y}_{w},\mathbf{y}_{l})\sim D}[\text{log}\sigma(R(\mathbf{x},\mathbf{y}_{w})-R(\mathbf{x},\mathbf{y}_{l}))] (2)

Then, the policy model (i.e., the LLM) πθ\pi_{\theta} is optimized with a gradient-based method like PPO using the following KL-constrained RL objective:

maxπθ𝔼atπθ(|𝐬t)[t=0T(R(𝐬t,𝐚t)β𝒟KL(πθ(|𝐬t),πref(|𝐬t))]\max_{\pi_{\theta}}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|\mathbf{s}_{t})}\left[\sum_{t=0}^{T}(R(\mathbf{s}_{t},\mathbf{a}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t}))\right] (3)

where πref\pi_{\text{ref}} is a reference policy, often the language model resulting from supervised finetuning, from which the learned policy should not significantly deviate. In the optimization of the policy, only generating the EOS token carries a reward as output by the reward model which is combined with KL penalty, while for all other tokens in the vocabulary, only the KL component is non-zero. Additionally, it is worth mentioning that we omit the supervised finetuning (SFT) stage, which is often considered as part of the RLHF pipeline. This is due to that the SFT stage is not directly relevant to the focus of this paper.

Direct Preference Optimization

Unlike classical RLHF, DPO, as derived in [rafailov2024direct], stays entirely within the contextual bandits setting entirely and also uses the bandit-based preference model. To circumvent the need for an RL algorithm, DPO uses the well-known closed form solution to the KL-contextual bandit version of the RL problem posed in Eq. 3 (Ziebart et al., 2008; Levine, 2018):

π(𝐲|𝐱)=1Z(𝐱)πref(𝐲|𝐱)eR(𝐱,𝐲)/β,\pi^{*}(\mathbf{y}|\mathbf{x})=\frac{1}{Z(\mathbf{x})}\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})e^{R(\mathbf{x},\mathbf{y})/\beta}, (4)

where π\pi^{*} is the optimal policy and Z(x)Z(x) is the partition function that normalizes it. DPO rearranges this equation to solve for reward as:

R(𝐱,𝐲)=βlogπ(𝐲|𝐱)πref(𝐲|𝐱)Z(𝐱).R(\mathbf{x},\mathbf{y})=\beta\text{log}\frac{\pi^{*}(\mathbf{y}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})}-Z(\mathbf{x}). (5)

Substituting this relationship into the standard binary cross-entropy loss function used for reward modeling (Eq. 2) yields the DPO loss equation as the partition function Z(x)Z(x) cancels from the Bradley Terry model:

LDPO(π,D)=𝔼(𝐱,𝐲w,𝐲l)D[logσ(logπ(𝐲w|𝐱)πref(𝐲w|𝐱)logπ(𝐲l|𝐱)πref(𝐲l|𝐱))].L_{\text{DPO}}(\pi,D)=-\mathbb{E}_{(\mathbf{x},\mathbf{y}_{w},\mathbf{y}_{l})\sim D}[\text{log}\sigma(\text{log}\frac{\pi^{*}(\mathbf{y}_{w}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_{w}|\mathbf{x})}-\text{log}\frac{\pi^{*}(\mathbf{y}_{l}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_{l}|\mathbf{x})})]. (6)

DPO as a Q-function

We first rewrite Eq. 2 as entropy-regularized:

maxπθ𝔼atπθ(|st)[t=0T(r(st,at)+βlogπref(at|st))+β(πθ)|s0ρ(s0)]\max_{\pi_{\theta}}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|s_{t})}\left[\sum_{t=0}^{T}\left(r(s_{t},a_{t})+\beta\log\pi_{\text{ref}}(a_{t}|s_{t})\right)+\beta\mathcal{H}(\pi_{\theta})\Big{|}s_{0}\sim\rho(s_{0})\right] (7)

The relationship between future returns and the current timestep is captured by the Bellman Equation which are satisifed by any valid Q-function. We write this below for the optimal policy π\pi^{*} under the reward RR with a KL divergence penalty:

Q(𝐬t,𝐚t)=r(st,at)+βlogπref(at|st)+V(𝐬t+1),Q^{*}(\mathbf{s}_{t},\mathbf{a}_{t})=r(s_{t},a_{t})+\beta\log\pi_{\text{ref}}(a_{t}|s_{t})+V^{*}(\mathbf{s}_{t+1}), (8)

where the optimal value function VV^{*} is a function of QQ^{*}:

V(𝐬t)=βlogAeQ(𝐬t,𝐚)/β𝑑𝐚.V^{*}(\mathbf{s}_{t})=\beta\text{log}\int_{A}e^{Q^{*}(\mathbf{s}_{t},\mathbf{a})/\beta}d\mathbf{a}. (9)

Following [rafailov2024r], the main idea is first inverting the Bellman Equation to represent Q function:

t=0T1R(𝐬t,𝐚t)=V(𝐬0)+t=0T1βlogπ(𝐚t|𝐬t)πref(𝐚t|𝐬t),\sum_{t=0}^{T-1}R(\mathbf{s}_{t},\mathbf{a}_{t})=V^{*}(\mathbf{s}_{0})+\sum_{t=0}^{T-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}|\mathbf{s}_{t})}{\pi_{\text{ref}}(\mathbf{a}_{t}|\mathbf{s}_{t})}, (10)

Then, the sum of rewards (Q function) in terms of the optimal policy can be directly substituted into the preference model in Eq. 1:

pπ(𝐲w𝐲l)=σ(t=0N1βlogπ(𝐚tW|𝐬tW)πref(𝐚tW|𝐬tW)t=0M1βlogπ(𝐚tl|𝐬tl)πref(𝐚tl|𝐬tl)).p_{\pi^{*}}(\mathbf{y}^{w}\succeq\mathbf{y}^{l})=\sigma(\sum_{t=0}^{N-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}^{W}|\mathbf{s}_{t}^{W})}{\pi_{\text{ref}}(\mathbf{a}_{t}^{W}|\mathbf{s}_{t}^{W})}-\sum_{t=0}^{M-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}^{l}|\mathbf{s}_{t}^{l})}{\pi_{\text{ref}}(\mathbf{a}_{t}^{l}|\mathbf{s}_{t}^{l})}). (11)

4 Soft Q-Learning

4.1 Value Functions

We are obliged to alter our definitions of value functions to include the new KL penalty terms. We shall define the state-value function as the expected return:

Vπ(𝐬t)=𝔼[t=0γt(r(𝐬t)β𝒟KL(πθ(|𝐬t),πref(|𝐬t)))]V_{\pi}(\mathbf{s}_{t})=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}(r(\mathbf{s}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t})))\right]

and we shall define the Q-function as

Qπ(𝐬t,𝐚t)=𝔼[r0+t=1γt(r(𝐬t,𝐚t)β𝒟KL(πθ(|𝐬t),πref(|𝐬t)))]Q_{\pi}(\mathbf{s}_{t},\mathbf{a}_{t})=\mathbb{E}\left[r_{0}+\sum_{t=1}^{\infty}\gamma^{t}(r(\mathbf{s}_{t},\mathbf{a}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t})))\right]

Note that this Q-function does not include the first KL penalty term, which does not depend on the action a0a_{0}.

4.2 Boltzmann Policy

In standard reinforcement learning, the “greedy policy” for QQ is defined as [GQ](s)=argmaxaQ(s,a)[GQ](s)=\arg\max_{a}Q(s,a). With entropy regularization, we need to alter our notion of a greedy policy, as the optimal policy is stochastic. Since QπQ_{\pi} omits the first entropy term, it is natural to define the following stochastic policy, which is called the Boltzmann policy, and is analogous to the greedy policy:

πQB(s)=argmaxπ{𝔼aπ[Q(s,a)]τDKL[ππ¯](s)}\pi_{Q}^{B}(\cdot\mid s)=\arg\max_{\pi}\left\{\mathbb{E}_{a\sim\pi}[Q(s,a)]-\tau D_{\text{KL}}[\pi\parallel\bar{\pi}](s)\right\}
=π¯(as)exp(Q(s,a)/τ)/𝔼aπ¯[exp(Q(s,a)/τ)].=\bar{\pi}(a\mid s)\exp(Q(s,a)/\tau)/\mathbb{E}_{a^{\prime}\sim\bar{\pi}}[\exp(Q(s,a^{\prime})/\tau)].

where the second equation is analogous to Equation (2) from the bandit setting.

Also analogously to the bandit setting, it is natural to define VQV_{Q} (a function of QQ) as

VQ(s)=τlog𝔼aπ¯[exp(Q(s,a)/τ)]V_{Q}(s)=\tau\log\mathbb{E}_{a^{\prime}\sim\bar{\pi}}[\exp(Q(s,a^{\prime})/\tau)]

so that

πQB(as)=π¯(as)exp((Q(s,a)VQ(s))/τ).\pi_{Q}^{B}(a\mid s)=\bar{\pi}(a\mid s)\exp((Q(s,a)-V_{Q}(s))/\tau).

4.3 Soft Q-Learning

The Boltzmann backup operators defined in the preceding section can be used to define practical variants of Q-learning that can be used with nonlinear function approximation. These methods, which optimize the entropy-augmented return, will be called soft Q-learning. Following Mnih et al. [2015], modern implementations of Q-learning, and n-step Q-learning (see Mnih et al. [2016]) update the Q-function incrementally to compute the backup against a fixed target Q-function, which we’ll call Q¯\bar{Q}. In the interval between each target network update, the algorithm is approximately performing the backup operation Q𝒯Q¯Q\leftarrow\mathcal{T}\bar{Q} (1-step) or Q𝒯πQ¯,nQ¯Q\leftarrow\mathcal{T}_{\pi_{\bar{Q}},n}\bar{Q} (n-step). To perform this approximate minimization, the algorithms minimize the least squares loss

L(Q)=𝔼t,st,at[12(Q(st,at)yt)2],L(Q)=\mathbb{E}_{t,s_{t},a_{t}}\left[\frac{1}{2}\left(Q(s_{t},a_{t})-y_{t}\right)^{2}\right],

where

yt=rt+γVQ¯(st+1)1-step Q-learning(45)y_{t}=r_{t}+\gamma V_{\bar{Q}}(s_{t+1})\quad\text{1-step Q-learning}\quad(45)
yt=τKLt+d=0n1γd(rt+dτKLt+d)+γnVQ¯(st+n)n-step Q-learning(46)y_{t}=\tau\text{KL}_{t}+\sum_{d=0}^{n-1}\gamma^{d}(r_{t+d}-\tau\text{KL}_{t+d})+\gamma^{n}V_{\bar{Q}}(s_{t+n})\quad\text{n-step Q-learning}\quad(46)
=τKLt+VQ¯(st)+d=0n1γdδt+d=\tau\text{KL}_{t}+V_{\bar{Q}}(s_{t})+\sum_{d=0}^{n-1}\gamma^{d}\delta_{t+d}

where

δt=(rtτKLt+γVQ¯(st+1)VQ¯(st))(47)\delta_{t}=(r_{t}-\tau\text{KL}_{t}+\gamma V_{\bar{Q}}(s_{t+1})-V_{\bar{Q}}(s_{t}))\quad(47)

In one-step Q-learning (Equation (45)), yty_{t} is an unbiased estimator of [𝒯Q¯](st,at)\left[\mathcal{T}\bar{Q}\right](s_{t},a_{t}), regardless of what behavior policy was used to collect the data. In n-step Q-learning (Equation (46)), for n>1n>1, yty_{t} is only an unbiased estimator of [𝒯πQ¯,nQ¯](st,at)\left[\mathcal{T}_{\pi_{\bar{Q}},n}\bar{Q}\right](s_{t},a_{t}) if actions at,at+1,,at+d1a_{t},a_{t+1},\dots,a_{t+d-1} are sampled using πQ¯B\pi_{\bar{Q}}^{B}.

Appendix A Appendix

You may include other additional sections here.