PAD: Personalized Alignment at Decoding-time
Abstract
The abstract paragraph should be indented 1/2 inch (3 picas) on both left and right-hand margins. Use 10 point type, with a vertical spacing of 11 points. The word Abstract must be centered, in small caps, and in point size 12. Two line spaces precede the abstract. The abstract must be limited to one paragraph.
1 Introduction
Recent advancements have demonstrated success in aligning language models with human preferences and values. Representative methods such as Reinforcement Learning from Human Feedback (RLHF) and DPO typically optimize a policy model with training signals from an explicit or implicit reward model. The reward model is singular, capturing the general human preferences or values from human feedback.
However, in this pluralistic world, people’s preferences can diverge significantly based on their different cultures, educational backgrounds, religions, and political stands. Furthermore, even for the same person, the value of a particular LLM response can vary when the application scenario changes. Hence, there always exists a proportion of human preferences that cannot be unified or may even have contradictions, also known as personalized preferences, which general alignment frameworks struggle to align with due to the need for high-quality datasets and substantial computational costs.
How can we align with personalized preferences without the need for additional data collection and training? Inspired by the work on controlled decoding, we introduce Personalized Alignment at Decoding-time (PAD), aimed at controlling model outputs during the decoding phase to align with diverse personalized preferences. Specifically, PAD achieves this by employing a personalized reward model that directly guides the text generation process of a language model. By defining the text generation process as a per-token Markov Decision Process (MDP), at each decoding step, the personalized reward model can score based on personalized preferences. This score is then combined with the standard maximum-likelihood decoding to adjust the model’s probabilistic prediction. The advantages of PAD are as follows: (1) It requires only a single policy model aligned with general preferences (General Policy), eliminating the need for training additional policys (Training-free). (2) It utilizes only one reward model (Single Reward). (3) It does not require pre-defined personalized preferences to generalize to preferences not seen during the training phase (Generalizability). A checklist of PAD’s advantages over previous methods is presented in Table 1.
Method | General Policy | Training-free | Single Reward | Generalizability |
---|---|---|---|---|
Vanilla | Yes | - | - | - |
MORLHF | Yes | No | Yes | No |
MODPO \citepzhou2023beyond | Yes | No | - | No |
Personalized soups \citepjang2023personalized | No | No | No | Yes |
Rewarded soups \citeprame2024rewarded | Yes | Yes | No | No |
RiC \citepyang2024rewards | Yes | No | Yes | No |
DPA \citepwang2024arithmetic | Yes | No | Yes | No |
CD | Yes | Yes | Yes | No |
ARGS | Yes | Yes | Yes | No |
MOD \citepshi2024decoding | No | Yes | - | No |
MetaAligner \citepyang_metaaligner_2024 | Yes | Yes | - | Yes |
PAD | Yes | Yes | Yes | Yes |
Our contributions are:
-
•
Theory: We model the Personalized Alignment at Decoding-time (PAD) within per-token Markov Decision Process, providing theoretical insights that PAD aligns with the objectives of RLHF.
-
•
Algorithm: We propose a practical PAD implementation, which does not require additional training of the policy model and only requires a single reward model to align with scalable user preferences.
-
•
Experiments:
2 Related Works
Large language model alignment
Large language model alignment aims to align LLMs with human preferences. The mainstream methods are training-based. A common approach involves using an RLHF (Reinforcement Learning with Human Feedback) framework [XXX] where a reward model is trained based on human feedback, and Proximal Policy Optimization (PPO) \citepschulman2017proximal is employed to derive the aligned policy model [XXX]. Recent efforts explore alternatives to enhance stability and reduce resources. Notably, DPO \citeprafailov2024direct leverages the Bradley-Terry assumption \citepbradley1952rank for direct optimization of the preference-based objective. Decoding-time alignment offers an alignment paradigm that does not require expensive RL training [han2024value] [XXX]. Controlled Decoding (CD) \citepmudgal2023controlled utilizes a prefix scorer module trained to assess value functions for rewards, allowing controlled generation from a frozen base model. ARGS \citepkhanov2024args proposed using a reward signal to adjust probabilistic predictions, thereby generating semantically aligned texts. DeAL \citephuang2024deal focusing on heuristic-guided searches to better meet diverse alignment objectives.
Personalized alignment
As humans exhibit diverse preferences and values for a single task, it is essential to align large language models (LLMs) to users’ personalized preferences \citepkirk-etal-2023-past, sorensen2023value, pmlr-v235-sorensen24a, yao2023from, kirk2024benefits, zhong2024panacea, han2024value. One line of work achieves joint optimization for different personalized preferences by defining a reward function with multiple objective dimensions \citepzhou2023beyond, wang2024arithmetic, wang2024interpretable, guo2024controllable, yang2024rewards, chakraborty2024maxmin, sun2024salmon, li2024personalized. Additionally, some approaches involve merging model parameters or predictions trained for each dimension to accommodate the diverse combinations expressed by those dimensions \citepjang2023personalized, rame2024rewarded, park2024principled, shi2024decoding. Lastly, prompt-based methods align personalized user preferences by designing diverse prompts \citepyang_metaaligner_2024,lee_aligning_2024,hwang_aligning_2023, jafari_morl-prompt_2024. The work most similar to ours is MOD \citepshi2024decoding, which achieves flexible trade-offs and optimization across multiple objectives by linearly combining predictions from different base models at decoding time to output the next token. However, MOD requires training base models for different preferences, making it difficult to scale to a large number of personalized preferences.
3 Method
3.1 Preliminaries
In this section, we first define the per-token Markov Decision Process (MDP) for large language models (LLMs) and then describe its relationship to classic Reinforcement Learning from Human Feedback (RLHF) approaches.
Text Generation as Token-level Markov Decision Process
We define the standard text generation process of large language models (LLMs) with prompt and response as token-level Markov Decision Process (MDP). MDP is denoted as a tuple , where the state space consists of the prompt and all tokens generated so far (i.e., ). The action space is the tokens from the vocabulary (i.e., ). is the transition kernel, which is deterministic that given state and action , the next state is . R : represents the token-wise reward. is the initial state distribution of , i.e., . A (Markov) policy in MDPs is a mapping from state to a distribution over actions.
The RLHF Pipeline
Traditional RLHF approaches model the policy by a parametric distribution defined by a LLM over a finite set of actions. These approaches first learn a reward function from human feedback on prompt and response pairs . The reward function is modeled as a contextual bandit using Bradley-Terry preference model:
(1) |
where and denote the preferred and not-preferred completions for the prompt . denotes the probability that is preferred to . Assuming access to a static dataset of comparisons sampled from , we can parametrize a reward model and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss:
(2) |
Then, the policy model (i.e., the LLM) is optimized with a gradient-based method like PPO using the following KL-constrained RL objective:
(3) |
where is a reference policy, often the language model resulting from supervised finetuning, from which the learned policy should not significantly deviate. In the optimization of the policy, only generating the EOS token carries a reward as output by the reward model which is combined with KL penalty, while for all other tokens in the vocabulary, only the KL component is non-zero. Additionally, it is worth mentioning that we omit the supervised finetuning (SFT) stage, which is often considered as part of the RLHF pipeline. This is due to that the SFT stage is not directly relevant to the focus of this paper.
Direct Preference Optimization
Unlike classical RLHF, DPO, as derived in [rafailov2024direct], stays entirely within the contextual bandits setting entirely and also uses the bandit-based preference model. To circumvent the need for an RL algorithm, DPO uses the well-known closed form solution to the KL-contextual bandit version of the RL problem posed in Eq. 3 (Ziebart et al., 2008; Levine, 2018):
(4) |
where is the optimal policy and is the partition function that normalizes it. DPO rearranges this equation to solve for reward as:
(5) |
Substituting this relationship into the standard binary cross-entropy loss function used for reward modeling (Eq. 2) yields the DPO loss equation as the partition function cancels from the Bradley Terry model:
(6) |
DPO as a Q-function
We first rewrite Eq. 2 as entropy-regularized:
(7) |
The relationship between future returns and the current timestep is captured by the Bellman Equation which are satisifed by any valid Q-function. We write this below for the optimal policy under the reward with a KL divergence penalty:
(8) |
where the optimal value function is a function of :
(9) |
Following [rafailov2024r], the main idea is first inverting the Bellman Equation to represent Q function:
(10) |
Then, the sum of rewards (Q function) in terms of the optimal policy can be directly substituted into the preference model in Eq. 1:
(11) |
4 Soft Q-Learning
4.1 Value Functions
We are obliged to alter our definitions of value functions to include the new KL penalty terms. We shall define the state-value function as the expected return:
and we shall define the Q-function as
Note that this Q-function does not include the first KL penalty term, which does not depend on the action .
4.2 Boltzmann Policy
In standard reinforcement learning, the “greedy policy” for is defined as . With entropy regularization, we need to alter our notion of a greedy policy, as the optimal policy is stochastic. Since omits the first entropy term, it is natural to define the following stochastic policy, which is called the Boltzmann policy, and is analogous to the greedy policy:
where the second equation is analogous to Equation (2) from the bandit setting.
Also analogously to the bandit setting, it is natural to define (a function of ) as
so that
4.3 Soft Q-Learning
The Boltzmann backup operators defined in the preceding section can be used to define practical variants of Q-learning that can be used with nonlinear function approximation. These methods, which optimize the entropy-augmented return, will be called soft Q-learning. Following Mnih et al. [2015], modern implementations of Q-learning, and n-step Q-learning (see Mnih et al. [2016]) update the Q-function incrementally to compute the backup against a fixed target Q-function, which we’ll call . In the interval between each target network update, the algorithm is approximately performing the backup operation (1-step) or (n-step). To perform this approximate minimization, the algorithms minimize the least squares loss
where
where
In one-step Q-learning (Equation (45)), is an unbiased estimator of , regardless of what behavior policy was used to collect the data. In n-step Q-learning (Equation (46)), for , is only an unbiased estimator of if actions are sampled using .
Appendix A Appendix
You may include other additional sections here.