PAD: Personalized Alignment at Decoding-time

Antiquus S. Hippocampus, Natalia Cerebro & Amelie P. Amygdale
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213, USA
{hippo,brain,jen}@cs.cranberry-lemon.edu
&Ji Q. Ren & Yevgeny LeNet
Department of Computational Neuroscience
University of the Witwatersrand
Joburg, South Africa
{robot,net}@wits.ac.za
\ANDCoauthor
Affiliation
Address
email Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies. Funding acknowledgements go at the end of the paper.

Abstract

The abstract paragraph should be indented 1/2 inch (3 picas) on both left and right-hand margins. Use 10 point type, with a vertical spacing of 11 points. The word Abstract must be centered, in small caps, and in point size 12. Two line spaces precede the abstract. The abstract must be limited to one paragraph.

1 Introduction

Recent advancements have demonstrated success in aligning language models with human preferences and values. Representative methods such as Reinforcement Learning from Human Feedback (RLHF) and DPO typically optimize a policy model with training signals from an explicit or implicit reward model. The reward model is singular, capturing the general human preferences or values from human feedback.

However, in this pluralistic world, people’s preferences can diverge significantly based on their different cultures, educational backgrounds, religions, and political stands. Furthermore, even for the same person, the value of a particular LLM response can vary when the application scenario changes. Hence, there always exists a proportion of human preferences that cannot be unified or may even have contradictions, also known as personalized preferences, which general alignment frameworks struggle to align with due to the need for high-quality datasets and substantial computational costs.

How can we align with personalized preferences without the need for additional data collection and training? Inspired by the work on controlled decoding, we introduce Personalized Alignment at Decoding-time (PAD), aimed at controlling model outputs during the decoding phase to align with diverse personalized preferences. Specifically, PAD achieves this by employing a personalized reward model that directly guides the text generation process of a language model. By defining the text generation process as a per-token Markov Decision Process (MDP), at each decoding step, the personalized reward model can score based on personalized preferences. This score is then combined with the standard maximum-likelihood decoding to adjust the model’s probabilistic prediction. The advantages of PAD are as follows: (1) It requires only a single policy model aligned with general preferences (General Policy), eliminating the need for training additional policys (Training-free). (2) It utilizes only one reward model (Single Reward). (3) It does not require pre-defined personalized preferences to generalize to preferences not seen during the training phase (Generalizability). A checklist of PAD’s advantages over previous methods is presented in Table 1.

Table 1: Comparison of Different Methods

Method	General Policy	Training-free	Single Reward	Generalizability
Vanilla	Yes	-	-	-
MORLHF	Yes	No	Yes	No
MODPO \citepzhou2023beyond	Yes	No	-	No
Personalized soups \citepjang2023personalized	No	No	No	Yes
Rewarded soups \citeprame2024rewarded	Yes	Yes	No	No
RiC \citepyang2024rewards	Yes	No	Yes	No
DPA \citepwang2024arithmetic	Yes	No	Yes	No
CD	Yes	Yes	Yes	No
ARGS	Yes	Yes	Yes	No
MOD \citepshi2024decoding	No	Yes	-	No
MetaAligner \citepyang_metaaligner_2024	Yes	Yes	-	Yes
PAD	Yes	Yes	Yes	Yes

Our contributions are:

•

Theory: We model the Personalized Alignment at Decoding-time (PAD) within per-token Markov Decision Process, providing theoretical insights that PAD aligns with the objectives of RLHF.
•

Algorithm: We propose a practical PAD implementation, which does not require additional training of the policy model and only requires a single reward model to align with scalable user preferences.
•

Experiments:

2 Related Works

Large language model alignment

Large language model alignment aims to align LLMs with human preferences. The mainstream methods are training-based. A common approach involves using an RLHF (Reinforcement Learning with Human Feedback) framework [XXX] where a reward model is trained based on human feedback, and Proximal Policy Optimization (PPO) \citepschulman2017proximal is employed to derive the aligned policy model [XXX]. Recent efforts explore alternatives to enhance stability and reduce resources. Notably, DPO \citeprafailov2024direct leverages the Bradley-Terry assumption \citepbradley1952rank for direct optimization of the preference-based objective. Decoding-time alignment offers an alignment paradigm that does not require expensive RL training [han2024value] [XXX]. Controlled Decoding (CD) \citepmudgal2023controlled utilizes a prefix scorer module trained to assess value functions for rewards, allowing controlled generation from a frozen base model. ARGS \citepkhanov2024args proposed using a reward signal to adjust probabilistic predictions, thereby generating semantically aligned texts. DeAL \citephuang2024deal focusing on heuristic-guided searches to better meet diverse alignment objectives.

Personalized alignment

As humans exhibit diverse preferences and values for a single task, it is essential to align large language models (LLMs) to users’ personalized preferences \citepkirk-etal-2023-past, sorensen2023value, pmlr-v235-sorensen24a, yao2023from, kirk2024benefits, zhong2024panacea, han2024value. One line of work achieves joint optimization for different personalized preferences by defining a reward function with multiple objective dimensions \citepzhou2023beyond, wang2024arithmetic, wang2024interpretable, guo2024controllable, yang2024rewards, chakraborty2024maxmin, sun2024salmon, li2024personalized. Additionally, some approaches involve merging model parameters or predictions trained for each dimension to accommodate the diverse combinations expressed by those dimensions \citepjang2023personalized, rame2024rewarded, park2024principled, shi2024decoding. Lastly, prompt-based methods align personalized user preferences by designing diverse prompts \citepyang_metaaligner_2024,lee_aligning_2024,hwang_aligning_2023, jafari_morl-prompt_2024. The work most similar to ours is MOD \citepshi2024decoding, which achieves flexible trade-offs and optimization across multiple objectives by linearly combining predictions from different base models at decoding time to output the next token. However, MOD requires training base models for different preferences, making it difficult to scale to a large number of personalized preferences.

3 Method

3.1 Preliminaries

In this section, we first define the per-token Markov Decision Process (MDP) for large language models (LLMs) and then describe its relationship to classic Reinforcement Learning from Human Feedback (RLHF) approaches.

Text Generation as Token-level Markov Decision Process

We define the standard text generation process of large language models (LLMs) with prompt $\mathbf{x}$ and response $\mathbf{y}$ as token-level Markov Decision Process (MDP). MDP is denoted as a tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho)$ , where the state space $\mathcal{S}$ consists of the prompt and all tokens generated so far (i.e., $\mathbf{s}_{t}=(\mathbf{x},\mathbf{y}_{1:t-1})$ ). The action space $\mathcal{A}$ is the tokens from the vocabulary (i.e., $\mathbf{a}_{t}=\mathbf{y}_{t}$ ). $\mathcal{P}$ is the transition kernel, which is deterministic that given state $\mathbf{s}_{t}=(\mathbf{x},\mathbf{y}_{1:t-1})$ and action $\mathbf{a}_{t}=\mathbf{y}_{t}$ , the next state is $\mathbf{s}_{t+1}=(\mathbf{s}_{t},\mathbf{a}_{t})=(\mathbf{x},\mathbf{y}_{1:t})$ . R : $S\times A\to\mathbb{R}$ represents the token-wise reward. $\rho_{0}$ is the initial state distribution of $\mathbf{s}_{0}$ , i.e., $\mathbf{x}$ . A (Markov) policy in MDPs $\pi:S\to\Delta(A)$ is a mapping from state to a distribution over actions.

The RLHF Pipeline

Traditional RLHF approaches model the policy by a parametric distribution $\pi_{\theta}(\mathbf{a}|\mathbf{s})$ defined by a LLM over a finite set of actions. These approaches first learn a reward function from human feedback on prompt and response pairs $(\mathbf{x},\mathbf{y}^{w},\mathbf{y}^{l})$ . The reward function is modeled as a contextual bandit using Bradley-Terry preference model:

p^{*}(\mathbf{y}^{w}\succeq\mathbf{y}^{l})=\frac{\exp R(\mathbf{x},\mathbf{y}^{w})}{\exp R(\mathbf{x},\mathbf{y}^{w})+\exp R(\mathbf{x},\mathbf{y}^{l})},

(1)

where $\mathbf{y}^{w}$ and $\mathbf{y}^{l}$ denote the preferred and not-preferred completions for the prompt $\mathbf{x}$ . $p^{*}(\mathbf{y}^{w}\succeq\mathbf{y}^{l})$ denotes the probability that $\mathbf{y}^{w}$ is preferred to $\mathbf{y}^{l}$ . Assuming access to a static dataset of comparisons $D=\{\mathbf{x}(i),\mathbf{y}(i)_{w},\mathbf{y}(i)_{l}\}^{N}_{i=1}$ sampled from $p^{*}$ , we can parametrize a reward model $R(\mathbf{x},\mathbf{y})$ and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss:

L_{R}(R,D)=-E_{(\mathbf{x},\mathbf{y}_{w},\mathbf{y}_{l})\sim D}[\text{log}\sigma(R(\mathbf{x},\mathbf{y}_{w})-R(\mathbf{x},\mathbf{y}_{l}))]

(2)

Then, the policy model (i.e., the LLM) $\pi_{\theta}$ is optimized with a gradient-based method like PPO using the following KL-constrained RL objective:

\max_{\pi_{\theta}}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|\mathbf{s}_{t})}\left[\sum_{t=0}^{T}(R(\mathbf{s}_{t},\mathbf{a}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t}))\right]

(3)

where $\pi_{\text{ref}}$ is a reference policy, often the language model resulting from supervised finetuning, from which the learned policy should not significantly deviate. In the optimization of the policy, only generating the EOS token carries a reward as output by the reward model which is combined with KL penalty, while for all other tokens in the vocabulary, only the KL component is non-zero. Additionally, it is worth mentioning that we omit the supervised finetuning (SFT) stage, which is often considered as part of the RLHF pipeline. This is due to that the SFT stage is not directly relevant to the focus of this paper.

Direct Preference Optimization

Unlike classical RLHF, DPO, as derived in [rafailov2024direct], stays entirely within the contextual bandits setting entirely and also uses the bandit-based preference model. To circumvent the need for an RL algorithm, DPO uses the well-known closed form solution to the KL-contextual bandit version of the RL problem posed in Eq. 3 (Ziebart et al., 2008; Levine, 2018):

\pi^{*}(\mathbf{y}|\mathbf{x})=\frac{1}{Z(\mathbf{x})}\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})e^{R(\mathbf{x},\mathbf{y})/\beta},

(4)

where $\pi^{*}$ is the optimal policy and $Z(x)$ is the partition function that normalizes it. DPO rearranges this equation to solve for reward as:

R(\mathbf{x},\mathbf{y})=\beta\text{log}\frac{\pi^{*}(\mathbf{y}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}|\mathbf{x})}-Z(\mathbf{x}).

(5)

Substituting this relationship into the standard binary cross-entropy loss function used for reward modeling (Eq. 2) yields the DPO loss equation as the partition function $Z(x)$ cancels from the Bradley Terry model:

L_{\text{DPO}}(\pi,D)=-\mathbb{E}_{(\mathbf{x},\mathbf{y}_{w},\mathbf{y}_{l})\sim D}[\text{log}\sigma(\text{log}\frac{\pi^{*}(\mathbf{y}_{w}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_{w}|\mathbf{x})}-\text{log}\frac{\pi^{*}(\mathbf{y}_{l}|\mathbf{x})}{\pi_{\text{ref}}(\mathbf{y}_{l}|\mathbf{x})})].

(6)

DPO as a Q-function

We first rewrite Eq. 2 as entropy-regularized:

\max_{\pi_{\theta}}\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|s_{t})}\left[\sum_{t=0}^{T}\left(r(s_{t},a_{t})+\beta\log\pi_{\text{ref}}(a_{t}|s_{t})\right)+\beta\mathcal{H}(\pi_{\theta})\Big{|}s_{0}\sim\rho(s_{0})\right]

(7)

The relationship between future returns and the current timestep is captured by the Bellman Equation which are satisifed by any valid Q-function. We write this below for the optimal policy $\pi^{*}$ under the reward $R$ with a KL divergence penalty:

Q^{*}(\mathbf{s}_{t},\mathbf{a}_{t})=r(s_{t},a_{t})+\beta\log\pi_{\text{ref}}(a_{t}|s_{t})+V^{*}(\mathbf{s}_{t+1}),

(8)

where the optimal value function $V^{*}$ is a function of $Q^{*}$ :

V^{*}(\mathbf{s}_{t})=\beta\text{log}\int_{A}e^{Q^{*}(\mathbf{s}_{t},\mathbf{a})/\beta}d\mathbf{a}.

(9)

Following [rafailov2024r], the main idea is first inverting the Bellman Equation to represent Q function:

\sum_{t=0}^{T-1}R(\mathbf{s}_{t},\mathbf{a}_{t})=V^{*}(\mathbf{s}_{0})+\sum_{t=0}^{T-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}|\mathbf{s}_{t})}{\pi_{\text{ref}}(\mathbf{a}_{t}|\mathbf{s}_{t})},

(10)

Then, the sum of rewards (Q function) in terms of the optimal policy can be directly substituted into the preference model in Eq. 1:

p_{\pi^{*}}(\mathbf{y}^{w}\succeq\mathbf{y}^{l})=\sigma(\sum_{t=0}^{N-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}^{W}|\mathbf{s}_{t}^{W})}{\pi_{\text{ref}}(\mathbf{a}_{t}^{W}|\mathbf{s}_{t}^{W})}-\sum_{t=0}^{M-1}\beta\text{log}\frac{\pi^{*}(\mathbf{a}_{t}^{l}|\mathbf{s}_{t}^{l})}{\pi_{\text{ref}}(\mathbf{a}_{t}^{l}|\mathbf{s}_{t}^{l})}).

(11)

4 Soft Q-Learning

4.1 Value Functions

We are obliged to alter our definitions of value functions to include the new KL penalty terms. We shall define the state-value function as the expected return:

V_{\pi}(\mathbf{s}_{t})=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}(r(\mathbf{s}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t})))\right]

and we shall define the Q-function as

Q_{\pi}(\mathbf{s}_{t},\mathbf{a}_{t})=\mathbb{E}\left[r_{0}+\sum_{t=1}^{\infty}\gamma^{t}(r(\mathbf{s}_{t},\mathbf{a}_{t})-\beta\mathcal{D}_{KL}(\pi_{\theta}(\cdot|\mathbf{s}_{t}),\pi_{\text{ref}}(\cdot|\mathbf{s}_{t})))\right]

Note that this Q-function does not include the first KL penalty term, which does not depend on the action $a_{0}$ .

4.2 Boltzmann Policy

In standard reinforcement learning, the “greedy policy” for $Q$ is defined as $[GQ](s)=\arg\max_{a}Q(s,a)$ . With entropy regularization, we need to alter our notion of a greedy policy, as the optimal policy is stochastic. Since $Q_{\pi}$ omits the first entropy term, it is natural to define the following stochastic policy, which is called the Boltzmann policy, and is analogous to the greedy policy:

\pi_{Q}^{B}(\cdot\mid s)=\arg\max_{\pi}\left\{\mathbb{E}_{a\sim\pi}[Q(s,a)]-\tau D_{\text{KL}}[\pi\parallel\bar{\pi}](s)\right\}

=\bar{\pi}(a\mid s)\exp(Q(s,a)/\tau)/\mathbb{E}_{a^{\prime}\sim\bar{\pi}}[\exp(Q(s,a^{\prime})/\tau)].

where the second equation is analogous to Equation (2) from the bandit setting.

Also analogously to the bandit setting, it is natural to define $V_{Q}$ (a function of $Q$ ) as

V_{Q}(s)=\tau\log\mathbb{E}_{a^{\prime}\sim\bar{\pi}}[\exp(Q(s,a^{\prime})/\tau)]

so that

\pi_{Q}^{B}(a\mid s)=\bar{\pi}(a\mid s)\exp((Q(s,a)-V_{Q}(s))/\tau).

4.3 Soft Q-Learning

The Boltzmann backup operators defined in the preceding section can be used to define practical variants of Q-learning that can be used with nonlinear function approximation. These methods, which optimize the entropy-augmented return, will be called soft Q-learning. Following Mnih et al. [2015], modern implementations of Q-learning, and n-step Q-learning (see Mnih et al. [2016]) update the Q-function incrementally to compute the backup against a fixed target Q-function, which we’ll call $\bar{Q}$ . In the interval between each target network update, the algorithm is approximately performing the backup operation $Q\leftarrow\mathcal{T}\bar{Q}$ (1-step) or $Q\leftarrow\mathcal{T}_{\pi_{\bar{Q}},n}\bar{Q}$ (n-step). To perform this approximate minimization, the algorithms minimize the least squares loss

L(Q)=\mathbb{E}_{t,s_{t},a_{t}}\left[\frac{1}{2}\left(Q(s_{t},a_{t})-y_{t}\right)^{2}\right],

where

y_{t}=r_{t}+\gamma V_{\bar{Q}}(s_{t+1})\quad\text{1-step Q-learning}\quad(45)

y_{t}=\tau\text{KL}_{t}+\sum_{d=0}^{n-1}\gamma^{d}(r_{t+d}-\tau\text{KL}_{t+d})+\gamma^{n}V_{\bar{Q}}(s_{t+n})\quad\text{n-step Q-learning}\quad(46)

=\tau\text{KL}_{t}+V_{\bar{Q}}(s_{t})+\sum_{d=0}^{n-1}\gamma^{d}\delta_{t+d}

where

\delta_{t}=(r_{t}-\tau\text{KL}_{t}+\gamma V_{\bar{Q}}(s_{t+1})-V_{\bar{Q}}(s_{t}))\quad(47)

In one-step Q-learning (Equation (45)), $y_{t}$ is an unbiased estimator of $\left[\mathcal{T}\bar{Q}\right](s_{t},a_{t})$ , regardless of what behavior policy was used to collect the data. In n-step Q-learning (Equation (46)), for $n>1$ , $y_{t}$ is only an unbiased estimator of $\left[\mathcal{T}_{\pi_{\bar{Q}},n}\bar{Q}\right](s_{t},a_{t})$ if actions $a_{t},a_{t+1},\dots,a_{t+d-1}$ are sampled using $\pi_{\bar{Q}}^{B}$ .

Appendix A Appendix

You may include other additional sections here.