Faster WIND:
Accelerating Iterative Best-of- Distillation for LLM Alignment
22footnotemark: 2 Google DeepMind
October 2024; revised February 2025)
Abstract
Recent advances in aligning large language models with human preferences have corroborated the growing importance of best-of- distillation (BOND). However, the iterative BOND algorithm is prohibitively expensive in practice due to the sample and computation inefficiency. This paper addresses the problem by revealing a unified game-theoretic connection between iterative BOND and self-play alignment, which unifies seemingly disparate algorithmic paradigms. Based on the connection, we establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization that approximates iterative BOND in the parameter space. We provide provable sample efficiency guarantee for one of the WIND variants with the squared loss objective. The experimental results confirm that our algorithm not only accelerates the computation, but also achieves superior sample efficiency compared to existing methods.
Keywords: Reinforcement learning from human feedback (RLHF), preference optimization, matrix game, sample efficiency
1 Introduction
Fine-tuning large language models (LLMs) to align with human preferences has become a critical challenge in artificial intelligence to ensure the safety of their deployment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a dominant approach, significantly improving LLM performance as demonstrated by InstructGPT (Ouyang et al., 2022) and subsequent works. RLHF combines reward modeling to quantify human preferences and RL fine-tuning to adjust the LLM’s output distribution, enhancing desired responses while suppressing unfavorable ones. While RLHF has shown promising results, it comes with significant extra post-training cost, and the aligned LLM may exhibit performance degeneration due to the alignment tax (Askell et al., 2021; OpenAI, 2023).
Alternatively, best-of- (BoN) sampling has emerged as a simple and surprisingly effective technique to obtain high-quality outputs from an LLM (Stiennon et al., 2020). In BoN sampling, multiple samples are drawn from an LLM, ranked according to a specific attribute, and the best one is selected. This simple approach can improve model outputs without the need for extensive fine-tuning, offering a potentially more efficient path to alignment. Building upon the success of BoN sampling, a few works explored the iterative variants of this approach (Dong et al., 2023; Sessa et al., 2024). Iterative BoN applies the BoN sampling and selection process repeatedly, potentially leading to even better alignments with human preferences.
However, BoN incurs significant computational overhead due to making multiple inference calls to generate one output, especially when the number of calls is high. To mitigate the high inference cost of (iterative) BoN, Sessa et al. (2024) proposed a distillation algorithm, best-of- distillation (BOND), to train a new model emulating the output of iterative BoN. However, this approach also has a high training cost, due to the need of collecting multiple samples in each round of distillation, leading to a major bottleneck for wider adoption. Given the growing importance and significance of the iterative BoN approach, it raises new questions about its theoretical properties, practical implementation, and relationship to established methods like RLHF. In this paper, we delve into the theoretical foundations and practical applications of iterative BoN sampling for LLM alignment. We address the following question:
What are the limiting points of iterative BoN, and can we design faster algorithms to find them?
1.1 Contributions
We provide comprehensive answers to these questions through the following key contributions.
-
•
We introduce a general algorithmic framework for iterative BoN distillation, possibly with a slow-moving anchor, and uncover its limiting point corresponds to the Nash equilibrium of a (regularized) two-player min-max game optimizing the logarithm of the expected win rate. This offers a fresh game-theoretic interpretation that is unavailable before.
- •
-
•
We propose a novel algorithm framework, WIND, to find the win rate dominance policy with flexible loss configurations, and demonstrate it exhibits improved sample and computational efficiency compared to prior work, while maintaining provable convergence guarantees.
-
•
We conduct extensive experiments to evaluate the performance of WIND, demonstrating competitive or better performance against state-of-the-art alignment methods such as J-BOND across various benchmarks, highlighting its efficiency especially in the sampling process and training cost.
1.2 Related work
RLHF and LLM alignment.
Reinforcement Learning from human feedback (RLHF) is an effective approach to train AI models to produce outputs that align with human value and preference (Christiano et al., 2017; Stiennon et al., 2020; Nakano et al., 2021). Recently, RLHF has become the most effective approach to align language models (Ouyang et al., 2022; Bai et al., 2022). The famous InstructGPT (Ouyang et al., 2022) approach eventually led to the groundbreaking ChatGPT and GPT-4 (OpenAI, 2023). A variety of RLHF methods have been proposed, including direct preference optimization (Rafailov et al., 2024) and many other variants (Zhao et al., 2023; Yuan et al., 2023b; Azar et al., 2024; Meng et al., 2024; Xu et al., 2024; Ethayarajh et al., 2024; Tang et al., 2024), to name a few, which directly learns from the preference data without RL fine-tuning. Furthermore, value-incentive preference optimization (Cen et al., 2024) has been proposed to implement the optimistic (resp. pessimistic) principle for online (resp. offline) RLHF in a practical way with theoretical guarantees.
RLHF via self-play.
Another line of RLHF methods investigate self-play optimization for unregularized and regularized two-player win rate games, respectively (Swamy et al., 2024; Munos et al., 2023). Wu et al. (2024b) introduced a scalable self-play algorithm for win rate games, enabling efficient fine-tuning of LLMs, see also Rosset et al. (2024); Zhang et al. (2024); Gao et al. (2024); Huang et al. (2024) among others.
Best-of- and BOND.
BoN has empirically shown impressive reward-KL trade-off (Nakano et al., 2021; Gao et al., 2023), which has been theoretically investigated by Gui et al. (2024) from the perspective of win rate maximization. Beirami et al. (2024) analyzed the KL divergence between the BoN policy and the base policy, and Yang et al. (2024a) studied the asymptotic properties of the BoN policy. The recent work Gui et al. (2024) also proposed a method to use both best-of- and worst-of- to train language models. Sessa et al. (2024) introduced BOND and J-BOND to train language models to learn BoN policies. Amini et al. (2024) proposed vBoN which is equivalent to BOND. However, there is no existing work for characterizing the properties of iterative BoN yet.
Notation.
We let denote the index set . Let denote the identity matrix, and inner product in Euclidean space by . We let denote the support set of the distribution , and represents the relative interior of set . We defer all the proofs to the appendix.
2 Preliminaries
2.1 RLHF: reward versus win rate
We consider the language model as a policy, where denotes its parameters, and the compact parameter space. Given a prompt , the policy generates an answer according to the conditional distribution . For notational simplicity, we drop the subscript when it is clear from the context. We let denote the simplex over , and denote the space of policies as follows:
In practice, RLHF optimizes the policy model against the reward model while staying close to a reference model . There are two metrics being considered: reward and win rate.
Reward maximization.
Suppose there is a reward model , which produces a scalar reward given a prompt and a response . RLHF aims to maximize the KL-regularized value function, given a reference model :
(1) |
where
is the KL divergence between policies and , with being the distribution of prompts. Here, is a hyperparameter that balances the reward and the KL divergence. Without loss of generality, we assume is throughout the paper.
Win rate maximization.
Another scheme of RLHF aims to maximize the KL-regularized win rate against the reference model (Gui et al., 2024). Given a reward model , a preference model can be defined as:
(2) |
Given a policy pair , the win rate of over is given as (Swamy et al., 2024)
(3) |
The KL-regularized win rate maximization objective is defined as (Gui et al., 2024):
(4) |
The win rate maximization is better aligned with evaluation metrics adopted in common benchmarks, and further, can be carried out without explicit reward models as long as the preference model is well-defined.
2.2 Best-of- distillation
Best-of- (BoN) is a simple yet strong baseline in RLHF. Given a reward model and a prompt , BoN samples i.i.d. responses from the policy and select the response
with the highest reward. We call the BoN policy which selects the sample with the highest reward given samples i.i.d. drawn from . Gui et al. (2024) shows that for any fixed small , (approximately) maximizes for properly chosen . While BoN is widely used in practice (Beirami et al., 2024; Gao et al., 2023; Wang et al., 2024), yet it can be quite expensive in terms of the inference cost for drawing samples. Hence, BoN distillation (BOND) (Sessa et al., 2024) is developed to approximate the BoN policy through fine-tuning from some reference policy , which can be updated iteratively via an exponential moving average (Sessa et al., 2024).
3 A Unified Game-Theoretic View
In this section, we present a game-theoretic understanding of iterative BoN, which allows us to connect it to existing game-theoretic RLHF approaches under a win rate maximization framework.
3.1 Iterative BoN as game solving
Iterative BoN.
Due to the success of BoN sampling, its iterative version has also been studied (Dong et al., 2023; Sessa et al., 2024), where BoN is performed iteratively by using a moving anchor as the reference policy. To understand its property in generality, we introduce the iterative BoN method in Algorithm 1 that encapsulates iterative BoN methods with or without a moving reference model, which we call the mixing and no-mixing case, respectively.
Algorithm 1 demonstrates these two cases. In the mixing case, we obtain new policies by mixing the BoN policy , and at each iteration with mixing rates . In the no-mixing case, the algorithm simply returns the BoN policy as for the next iteration. We will provide some theoretical guarantees for both cases, using the following game-theoretic framework.
Game-theoretic perspective.
We show that iterative BoN is implicitly solving the following game. Define a preference matrix of size at by
(5) |
Define further as
(6) |
We introduce the following symmetric two-player log-win-rate game:
(7) |
Let be a Nash equilibrium of the above log-win-rate game (7), which satisfies the fixed-point characterization:
(8) |
Now we present our Theorem 1, which guarantees the convergence of Algorithm 1 to solutions of the above game.
Theorem 1 (Iterative BoN solves the log-win-rate game (7)).
In the no-mixing case, we can show that the limiting policy obtained by Algorithm 1 converges to the equilibrium of the unregularized log-win-rate game. In the mixing case, we show that with a proper choice of the mixing rates, Algorithm 1 solves the regularized log-win-rate game. To the best of our knowledge, this is the first game-theoretic understanding of iterative BoN using a general preference model.
3.2 Self-play and win rate dominance
The -win-rate game (7) is a non-zero-sum game that may be challenging to optimize: the function is not convex-concave, the Nash equilibria may not be unique, and the term introduces nonlinearity, which induces difficulty in estimation. Therefore, we seek a good alternative to the -win-rate game that maintains its core properties while being more amenable to optimization.
Specifically, we now consider the following alternative two-player win-rate game:
(9) |
which eliminates the nonlinearity in the reward, and has been recently studied by Swamy et al. (2024); Wu et al. (2024b) for and Munos et al. (2023) for .
The following proposition guarantees the game (9) is well-defined and is equivalent to the following fixed point problem:
(10) |
Proposition 1 (existence of ).
exists and is the Nash equilibrium of the minmax game (9). Moreover, when , is the unique Nash equilibrium.
Win rate dominance.
The fixed-point equation (10) identifies a policy with a higher winning probability against any other policy. For , satisfies for any , ensuring it outperforms other policies. When , the KL divergence term encourages to remain close to while maintaining a high win rate. We term (10) the Win rate Dominance (WIND) optimization problem.
3.3 Connecting iterative BoN with WIND
Due to the monotonicity of , it is natural to postulate that the win rate game and the log-win-rate game underpinning iterative BoN are connected. We establish the novel relationship rigorously, which allows a unifying game-theoretic view for many existing algorithms. We define a constant related to as:
(11) |
where is the set of optimal responses for each . We now demonstrate the relationship between the equilibria set of the log-win-rate game and the win-rate game .
Theorem 2 (relationship between two games (informal)).
Theorem 2 shows when , both games have the solution . For small positive , the distance between the solutions of the two games is bounded by a term that decreases exponentially with . We verify Theorem 2 empirically on contextual bandits in Section 5.1. This result provides theoretical justification for using iterative BoN as an approximation to WIND, or vice versa, especially when is small. More importantly, it paves a way for efficient algorithm to WIND, bypassing the operator in the win-rate game.
4 Faster WIND
Based on the understanding of the connection between the -win-rate game and the win-rate game, in this section, we propose a new sample-efficient algorithm for finding the WIND solution in (9), which includes two ingredients: (i) identifying a memory-efficient, exact policy optimization algorithm with linear last-iterate convergence (Sokota et al., 2023), and (ii) developing a series of sample-efficient algorithms with flexible loss functions and finite-time convergence guarantee. With slight abuse of terminology, we shall refer to our algorithmic framework also as WIND.
4.1 Exact policy optimization with last-iterate linear convergence
Recognizing that (9) is an KL-regularized matrix game, there are many existing algorithms that can be applied to find . Nonetheless, it is desirable to achieve fast last-iterate convergence with a small memory footprint. This is especially important in LLM optimization, for the memory efficiency. For example, extragradient algorithms (e.g., Korpelevich (1976); Popov (1980); Cen et al. (2021))—although fast-convergent—are expected to be expensive in terms of memory usage due to the need of storing an additional extrapolation point (i.e., the LLM) in each iteration.
It turns out that the magnetic mirror descent algorithm in Sokota et al. (2023), which is proposed to solve a variational inequality formulation equivalent to (9), meets our consideration. We present a tailored version of this algorithm in Algorithm 2, and state its linear last-iterate convergence in Theorem 3.
(13) |
Theorem 3 (Linear last-iterate convergence of Algorithm 2, Sokota et al. (2023)).
Assume and .When the learning rate , in Algorithm 2 satisfies:
(14) |
Remark 1.
We note that when , the update rule (13) recovers (Swamy et al., 2024, Algorithm 1). When , the update rule in (13) is different from that of Munos et al. (2023), which is
where is a mixed policy defined as
As such, it requires extra memory to store . Moreover, Munos et al. (2023) shows a slower rate of , whereas Algorithm 2 admits linear convergence.
4.2 Sample-efficient algorithm
We now derive practical sample-efficient methods for approximating the exact update (13) of WIND in the parameter space of the policy , . For exposition, we use to denote the logits before softmax, i.e.,
where is the softmax function.
We consider the existence of reward model approximation error, i.e., we may use an inaccurate judger , which is an approximation of . For example, instead of training a reward model, we could use an LLM as a judger to directly judge if response is better than given a prompt , and use as an approximation of .
Algorithm derivation with the squared risk.
Let denote the parameters of and in Algorithm 2, respectively. We rewrite the update rule (13) as
(15) |
for some function . We define a proxy using the empirical win-rate as
(16) |
Our main observation is that the update (15) of is approximating the conditional expectation of , that is, for all ,
Furthermore, this conditional expectation has the lowest risk, due to the following lemma:
Lemma 1 (Conditional mean minimizes the square loss).
For any two random variables , we have
(17) |
for any function . In particular, the equality holds if and only if almost everywhere on the support of the distribution of .
To invoke Lemma 1, we assume the LLM is expressive enough, such that can be represented by :
Assumption 1 (expressive power).
For any , there exists such that
(18) |
Note that , for all , . Thus under Assumption 1, by Lemma 1 we know that for all , satisfies (18) if and only if
(19) |
where we define the squared expected risk at the -th iteration as
In implementation, at each iteration , we shall approximate by minimizing the empirical risk: we sample , , , and minimize the empirical risk defined as
(SQ) |
We summarize the update procedure in Algorithm 3.
Alternative risk functions.
By utilizing different variational forms, we could derive objectives different from (SQ). For illustration, we provide two alternatives of by using the KL divergence and the NCE loss, respectively (see Appendix A for derivations):
(KL) |
and
(NCE) |
where , , is a hyperparameter, is defined as
and is the indicator function that equals 1 if is true and 0 otherwise.
When we use the regression objective (SQ), our WIND algorithm shares a similar form to SPPO (Wu et al., 2024b). However, WIND differs from SPPO in the following aspects: (i) we solve the regularized game with the KL regularization term . This term is crucial in practice and is also considered in other iterative BOND methods (Dong et al., 2023; Sessa et al., 2024); (ii) our sampling scheme is more sample-efficient: in SPPO, for each , they sample responses to estimate by for each . On the other hand, Lemma 1 implies estimating the conditional mean with multiple samples is unnecessary and for each , sampling two responses and is enough; (iii) we allow different risk functions beyond the squared loss.
4.3 Convergence analysis
We provide a finite-sample complexity guarantee for Algorithm 3 when the risk is the squared loss. Our results could be easily extended to other risks. We define the reward model approximation error as
(20) |
We require the following assumptions to prove the convergence of Algorithm 3. The first assumes is differentiable and , is bounded.
Assumption 2 (differentiability and boundedness).
The parameter space is compact, is differentiable w.r.t. for any , and in (15) is uniformly bounded, i.e., such that for all and .
Assumption 2 guarantees the (uniform) boundedness of . Especially, there exists such that for any and , we have
(21) |
The next assumption controls the concentrability coefficient, which is commonly used in the RL literature, see Yuan et al. (2023a); Munos (2003, 2005); Munos and Szepesvári (2008); Yang et al. (2024b) for example.
Assumption 3 (concentrability coefficient).
For Algorithm 3, there exists finite such that for all and , we have
We define
(24) |
We also assume for every , the expected risk and the empirical risk both satisfy Polyak-Łojasiewicz (PL) condition, which has been proven to hold for over-parameterized neural networks including transformers (Liu et al., 2022; Wu et al., 2024a; Yang et al., 2024c).
Assumption 4 (Polyak-Łojasiewicz condition).
For all , risk and empirical risk both satisfy the Polyak-Łojasiewicz condition with parameter , i.e., for all and , we have
and
Remark 2 (Assumption 4 is satisfied with linear function approximation).
The following theorem gives the convergence of Algorithm 3.
Theorem 4 (Convergence of Algorithm 3).
Theorem 4 indicates that, assuming no model approximation error, the total sample complexity for Algorithm 3 to reach -accuracy is
In contrast with SPPO (Wu et al., 2024b), which only ensures average-iterate convergence without quantifying sample efficiency, our method has stronger theoretical guarantees, offering last-iterate convergence and explicit sample complexity bounds.
5 Experiments
We report our experimental results in this section.
5.1 Contextual bandits
In this section we conduct contextual bandit experiments to validate Theorem 2.
Experiments setup.
We set , and initialize , where , and stands for the standard Gaussian distribution. We set and to be uniform distributions and randomly initialized in Algorithm 2 using the Dirichlet distribution with parameters all set to be 1. For the distance metric, we use the average distance defined as
(26) |
We conduct the following two experiments:
- •
-
•
For the mixing case where and , we verify that and are very close to each other: we fix the iteration number for both Algorithm 1 and 2, and increase from 0.01 to 0.1 to plot the change of average distance between the final outputs of the two algorithms with respect to . In this experiments we set .


Results.
Our results are presented in Figure 1. Figure 1(a) indicates that for the no-mixing case, both algorithms converge to with WIND slightly faster than iterative BoN. From Figure 1(b), we can see that and are very close to each other when is small and their distance approaches 0 very quickly as , which corroborates (12).
Model | GSM8k | HellaSwag | MMLU | MT-Bench | ||
1st Turn | 2nd Turn | Avg | ||||
Llama-3-8B-SPPO Iter1 | 75.44 | 79.80 | 65.65 | 8.3813 | 7.6709 | 8.0283 |
Llama-3-8B-SPPO Iter2 | 75.13 | 80.39 | 65.67 | 8.3875 | 7.4875 | 7.9375 |
Llama-3-8B-SPPO Iter3 | 74.91 | 80.86 | 65.60 | 8.0500 | 7.7625 | 7.9063 |
Llama-3-8B-JBOND Iter1 | 76.12 | 77.70 | 65.73 | 8.2875 | 7.4281 | 7.8578 |
Llama-3-8B-JBOND Iter2 | 75.74 | 77.47 | 65.85 | 8.2563 | 7.4403 | 7.8495 |
Llama-3-8B-JBOND Iter3 | 76.12 | 77.36 | 65.83 | 8.2750 | 7.2767 | 7.7774 |
Llama-3-8B-WIND Iter1 (Ours) | 75.82 | 78.73 | 65.77 | 8.2875 | 7.6875 | 7.9875 |
Llama-3-8B-WIND Iter2 (Ours) | 76.19 | 79.05 | 65.77 | 8.3625 | 7.7500 | 8.0563 |
Llama-3-8B-WIND Iter3 (Ours) | 77.18 | 79.31 | 65.87 | 8.5625 | 7.8354 | 8.2013 |
5.2 LLM alignment
We follow the experiment setup in Wu et al. (2024b) and Snorkel111https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO. We use Llama-3-8B-Instruct222https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct as the base pretrained model for baseline comparisons. For fair comparison, we chose the same prompt dataset UltraFeedback (Cui et al., 2023) and round splits, and the same Pair-RM framework (Jiang et al., 2023) for the preference model as in Wu et al. (2024b) and Snorkel. The learning rate is set to be . In each iteration, we generate answers from prompts in the UltraFeedback dataset to train the model. The global training batch size is ( per device GPUs). Our experiments are run on A GPUs, where each has GB memory. We modify the per-device batch size and gradient accumulation steps in SPPO GitHub repository333https://github.com/uclaml/SPPO while keeping the actual training batch size, to avoid out-of-memory error.
Baselines and Benchmarks.
We consider two baselines: SPPO (Wu et al., 2024b) and a variant of J-BOND (Sessa et al., 2024). Here we follow the exact same setting in their repository of the SPPO paper to reproduce SPPO results, with the only change being that we use different computation devices.
We consider 4 major evaluation benchmarks: GSM8k, HellaSwag, MMLU and MT-Bench. They evaluated the following capability:
-
•
GSM8k (Cobbe et al., 2021) evaluates the mathematical reasoning at a grade school level.
-
•
HellaSwag (Zellers et al., 2019) measures the commonsense reasoning by letting language models select a choice to finish a half-complete sentence.
-
•
MMLU (Hendrycks et al., 2020) is a large-scale benchmark that encompasses a variety of tasks to measure the language models’ knowledge.
-
•
MT-Bench (Zheng et al., 2023) is also a LLM-as-a-judge benchmark that evaluates the LLM’s multi-round chat capability. The scores given by GPT-4 is reported.
Results.
For traditional benchmarks (GSM8k, HellaSwag and MMLU), which do not involve using LLMs as the judges, the results are shown in Table 1. The model Llama-3-8B-WIND of ours achieved optimal in the last iteration in GSM8k and MMLU, while performing better than the J-BOND variant and slightly worse than SPPO in HellaSwag. In fact, our method shows consistent improvement over iterations: for all three benchmarks, our method continues to improve with more iterations of training, while both SPPO and J-BOND variant show performance regressions with increasing number of iterations. For MT-Bench, Llama-3-8B-WIND achieves comparable results in comparison with SPPO, and outperforms J-BOND.

Runtime.
We also report the running time used by different methods in our setting. Since we base our implementation on the SPPO GitHub Repository, we only modify the objectives and the sampling process to reflect the difference between these algorithms. Figure 2 shows that our method achieves much better sample efficiency during data generation. In sum, the proposed WIND achieves superior performance with less computation cost, making iterative BOND practice applicable.
6 Conclusion
This work establishes a unified game-theoretic framework that connects iterative BoN with existing game-theoretic RLHF approaches. We present WIND, a sample-efficient efficient algorithm for win rate dominance optimization with finite-sample guarantees, which provides an accelerated alternative to iterative BOND. Empirical validation on multiple public benchmarks demonstrates the effectiveness and efficiency of our approach compared to several state-of-the-art methods. In the future, it is interesting to explore schemes that incorporate exploration in the game-theoretic framework when the payoff matrix is accessed using bandit feedbacks (Yang et al., 2025).
Acknowledgement
The work of T. Yang, Z. Wen, S. Cen and Y. Chi is supported in part by the grants NSF CIF-2106778, DMS-2134080 and ONR N00014-19-1-2404. In addition, the authors thank Xingyu Xu for the great discussion on the proof details.
References
- Amini et al. [2024] A. Amini, T. Vieira, and R. Cotterell. Variational best-of-n alignment. arXiv preprint arXiv:2407.06057, 2024.
- Askell et al. [2021] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Azar et al. [2024] M. G. Azar, Z. D. Guo, B. Piot, R. Munos, M. Rowland, M. Valko, and D. Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
- Bai et al. [2022] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Bauschke et al. [2003] H. H. Bauschke, J. M. Borwein, and P. L. Combettes. Bregman monotone optimization algorithms. SIAM Journal on control and optimization, 42(2):596–636, 2003.
- Beck [2017] A. Beck. First-order methods in optimization. SIAM, 2017.
- Beirami et al. [2024] A. Beirami, A. Agarwal, J. Berant, A. D’Amour, J. Eisenstein, C. Nagpal, and A. T. Suresh. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024.
- Bousquet and Elisseeff [2002] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
- Cen et al. [2021] S. Cen, Y. Wei, and Y. Chi. Fast policy extragradient methods for competitive games with entropy regularization. Advances in Neural Information Processing Systems, 34:27952–27964, 2021.
- Cen et al. [2024] S. Cen, J. Mei, K. Goshvadi, H. Dai, T. Yang, S. Yang, D. Schuurmans, Y. Chi, and B. Dai. Value-incentivized preference optimization: A unified approach to online and offline RLHF. arXiv preprint arXiv:2405.19320, 2024.
- Charles and Papailiopoulos [2018] Z. Charles and D. Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International conference on machine learning, pages 745–754. PMLR, 2018.
- Christiano et al. [2017] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Cobbe et al. [2021] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Cui et al. [2023] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
- Dong et al. [2023] H. Dong, W. Xiong, D. Goyal, Y. Zhang, W. Chow, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
- Ethayarajh et al. [2024] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, and D. Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Gao et al. [2023] L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Gao et al. [2024] Z. Gao, J. D. Chang, W. Zhan, O. Oertell, G. Swamy, K. Brantley, T. Joachims, J. A. Bagnell, J. D. Lee, and W. Sun. Rebel: Reinforcement learning via regressing relative rewards. arXiv preprint arXiv:2404.16767, 2024.
- Gui et al. [2024] L. Gui, C. Gârbacea, and V. Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024.
- Gutmann and Hyvärinen [2010] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 297–304. JMLR Workshop and Conference Proceedings, 2010.
- Hendrycks et al. [2020] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Huang et al. [2024] A. Huang, W. Zhan, T. Xie, J. D. Lee, W. Sun, A. Krishnamurthy, and D. J. Foster. Correcting the mythos of kl-regularization: Direct alignment without overparameterization via chi-squared preference optimization. arXiv preprint arXiv:2407.13399, 2024.
- Jiang et al. [2023] D. Jiang, X. Ren, and B. Y. Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
- Kang et al. [2022] Y. Kang, Y. Liu, J. Li, and W. Wang. Sharper utility bounds for differentially private models: Smooth and non-smooth. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 951–961, 2022.
- Karimi et al. [2016] H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pages 795–811. Springer, 2016.
- Klochkov and Zhivotovskiy [2021] Y. Klochkov and N. Zhivotovskiy. Stability and deviation optimal risk bounds with convergence rate . Advances in Neural Information Processing Systems, 34:5065–5076, 2021.
- Korpelevich [1976] G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
- Liu et al. [2022] C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022.
- Meng et al. [2024] Y. Meng, M. Xia, and D. Chen. SimPO: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024.
- Munos [2003] R. Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567. Citeseer, 2003.
- Munos [2005] R. Munos. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005.
- Munos and Szepesvári [2008] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
- Munos et al. [2023] R. Munos, M. Valko, D. Calandriello, M. G. Azar, M. Rowland, Z. D. Guo, Y. Tang, M. Geist, T. Mesnard, A. Michi, et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
- Nakano et al. [2021] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Nash [1951] J. Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
- OpenAI [2023] OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Pattathil et al. [2023] S. Pattathil, K. Zhang, and A. Ozdaglar. Symmetric (optimistic) natural policy gradient for multi-agent learning with parameter convergence. In International Conference on Artificial Intelligence and Statistics, pages 5641–5685. PMLR, 2023.
- Popov [1980] L. D. Popov. A modification of the arrow–hurwicz method for search of saddle points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845–848, 1980.
- Rafailov et al. [2024] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Rosset et al. [2024] C. Rosset, C.-A. Cheng, A. Mitra, M. Santacroce, A. Awadallah, and T. Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715, 2024.
- Sessa et al. [2024] P. G. Sessa, R. Dadashi, L. Hussenot, J. Ferret, N. Vieillard, A. Ramé, B. Shariari, S. Perrin, A. Friesen, G. Cideron, et al. Bond: Aligning llms with best-of-n distillation. arXiv preprint arXiv:2407.14622, 2024.
- Sokota et al. [2023] S. Sokota, R. D’Orazio, J. Z. Kolter, N. Loizou, M. Lanctot, I. Mitliagkas, N. Brown, and C. Kroer. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In The Eleventh International Conference on Learning Representations, 2023.
- Stiennon et al. [2020] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Swamy et al. [2024] G. Swamy, C. Dann, R. Kidambi, Z. S. Wu, and A. Agarwal. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
- Tang et al. [2024] Y. Tang, Z. D. Guo, Z. Zheng, D. Calandriello, R. Munos, M. Rowland, P. H. Richemond, M. Valko, B. Á. Pires, and B. Piot. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Wang et al. [2024] Z. Wang, C. Nagpal, J. Berant, J. Eisenstein, A. D’Amour, S. Koyejo, and V. Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024.
- Wu et al. [2024a] Y. Wu, F. Liu, G. Chrysos, and V. Cevher. On the convergence of encoder-only shallow transformers. Advances in Neural Information Processing Systems, 36, 2024a.
- Wu et al. [2024b] Y. Wu, Z. Sun, H. Yuan, K. Ji, Y. Yang, and Q. Gu. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024b.
- Xu et al. [2024] H. Xu, A. Sharaf, Y. Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y. J. Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417, 2024.
- Yang et al. [2024a] J. Q. Yang, S. Salamatian, Z. Sun, A. T. Suresh, and A. Beirami. Asymptotics of language model alignment. arXiv preprint arXiv:2404.01730, 2024a.
- Yang et al. [2024b] T. Yang, S. Cen, Y. Wei, Y. Chen, and Y. Chi. Federated natural policy gradient and actor critic methods for multi-task reinforcement learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024b.
- Yang et al. [2024c] T. Yang, Y. Huang, Y. Liang, and Y. Chi. In-context learning with representations: Contextual generalization of trained transformers. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024c.
- Yang et al. [2025] T. Yang, B. Dai, L. Xiao, and Y. Chi. Incentivize without bonus: Provably efficient model-based online multi-agent RL for Markov games. arXiv preprint arXiv:2502.09780, 2025.
- Yuan et al. [2023a] R. Yuan, S. S. Du, R. M. Gower, A. Lazaric, and L. Xiao. Linear convergence of natural policy gradient methods with log-linear policies. In International Conference on Learning Representations, 2023a.
- Yuan et al. [2023b] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023b.
- Zellers et al. [2019] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Zhang et al. [2024] Y. Zhang, D. Yu, B. Peng, L. Song, Y. Tian, M. Huo, N. Jiang, H. Mi, and D. Yu. Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning. arXiv preprint arXiv:2407.00617, 2024.
- Zhao et al. [2023] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and P. J. Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
- Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
Appendix A Other Objectives for Algorithm 3
In this section we give some possible alternative objectives for Algorithm 3 by utilizing different variational forms.
-divergence objectives:
We could use -divergence as the objective function. For a convex function , is defined as
(27) |
Let
Then is a function of and . Further define
(28) |
We let , then by solving
(29) |
we could approximate the update rule (15).
In particular, when , we have , and (29) becomes
(30) |
We could approximate the above objective by sampling () and minimizing the empirical risk:
(31) |
where
(32) |
For other -divergence objectives, we may not be able to get rid of the unknown on the RHS of (29), but (29) could provide a gradient estimator for the objective that allows us to optimize by stochastic gradient descent.
Noise contrastive estimation (NCE) objectives.
We could use NCE (Gutmann and Hyvärinen, 2010) as objectives. NCE is a method to estimate the likelihood of a data point by contrasting it with noise samples. Let be the discriminator (parameterized by ) that distinguishes the true data from noise samples. The NCE objective is
(33) |
where the optimal solution of (33) is
In our case, we let and where is a hyperparameter. We also let
where is defined in (28). Then we could approximate the update rule (15) by solving
(34) |
Appendix B Proofs
For notational simplicity, for any policy , we denote throughout the proof.
B.1 Proof of Proposition 1
We first prove the case when . This part of proof is inspired by Swamy et al. (2024, Lemma 2.1). Let
i.e., is a Nash equilibrium of (9) (which is guaranteed to exist since the policy space is compact). Then and , we have:
which is equivalent to
Note that
(36) |
where is the matrix of all ones. This gives
i.e.,
This implies is also a Nash equilibrium of (9). Then by the interchangeability of Nash equilibrium strategies for two-player zero-sum games (Nash, 1951), and are both the Nash equilibria of (9), which indicates that are both the solutions of (10).
Next we prove the case when . When , due to the strong concavity-convexity of the minmax problem (9), there exists a unique Nash equilibrium of it. Moreover, it is straightforward to compute that satisfies the following relation:
(37) |
where we use to denote the element-wise product of two vectors.
B.2 Proofs of Theorem 1 and Theorem 2
Theorem 5 (solution to iterative BoN (formal)).
Remark 3.
Now we give the proof of Theorem 5.
Step 1: show convergence for the no-mixing case.
We first prove the convergence result for the no-mixing case. Note that for any policy , has the expression
(41) |
where is defined in (5).
When , Algorithm 1 can be simplified as
Then is equivalent to — the best of- policy of . For any , define as the set of responses that maximize the reward function , i.e.,
Then for any and , we have . In addition,
where the latter holds since . By the BoN expression (B.2), we deduce that
(42) |
Step 2: show convergence of the mixing case.
Recall that in the mixing case we set . We take logarithm on both sides of the iteration in line 5 of Algorithm 1 and unroll it as follows:
(45) |
where are constants that depend on , and the second equality makes use of (B.2), and the last equality follows from our choice of .
Note that for all , we have
For each , we let such that . For notation simplicity, when it does not cause confusion, we simply write as . (B.2) indicates that for all , we have
(46) | |||
(47) |
Especially, (47) indicates that the ratio is decreasing with for all . Since it has a lower bound 0, we have that the ratio converges as for all . Therefore, (46) together with (47) implies that exists. To see that is a solution to (38), we make use of the following lemma.
Lemma 2.
For any sequence in where for all and exists ( can be ), for any , we have
(48) |
Proof of Lemma 2.
If , then
If , we have
thus we only need to verify that .
For any , there exists such that
where , and because converges to 0. We fix and choose such that for all , we have
Then for all , we have
This completes the proof. ∎
Step 3: bound the distance between and .
We let in Algorithm 2 and unroll the iteration (13) similar to (B.2). We have
(50) |
where is the policy at the -th round of Algorithm 2. Furthermore, similar to (46) and (47), we have
(51) | |||
(52) |
For any , we have
(53) |
we know that is increasing with for all , and thus is decreasing with for all . Moreover, by a similar argument as in Step 1, we have that exists and is the solution to (10) (even when ).
Note that (52) is equivalent to
(54) |
Also note that by (53) and the decreasing property of for all , we have
From the above expression and our choice of we know that
(55) |
Then by (B.2) we know that
which indicates that for all ,
(56) |
which gives
Combining the above relation with (51), we obtain
and
Recall that we write the expression of in (42). Therefore, we have
(57) |
For the iteration in Algorithm 1, similar to (B.2) we have
(58) |
Note that for all , we have
where in the second inequality we use the fact that is decreasing with for all . Then by a similar argument as in (55), we have
(59) |
Therefore, analogous to (57), we have
(60) |
Combining (57) and (60), we have
from which we can see that (40) holds.
B.3 Proof of Theorem 3
To start with, we reformulate problem (10) as a monotone variational inequality (VI) problem. We first define the operator for all as
(61) |
We also let
(62) |
denote the negative entropy. The following lemma gives the VI form of WIND.
Lemma 3.
Assume and . Then (10) is equivalent to the following monotone VI problem:
(63) |
where for all , is monotone and 1-Lipschitz continuous w.r.t. the -norm.
Proof of Lemma 3.
By the proof of Proposition 1 we know that when and , we have . By the optimality condition, satisfies (10) if and only if
(64) |
where
By (61) and (62), we have (64) equivalent to (63). To see the monotonicity of , we have
(65) |
where is the matrix of all ones.
Furthermore, we have
(66) |
where the second inequality follows from the fact that each entry of only take its value in . We finish up by noting (66) indicates that is 1-Lipschitz with respect to -norm. ∎
We use the following proximal mirror descent ascent rule (Sokota et al., 2023; Pattathil et al., 2023) to solve the monotone VI problem (63):
(67) |
where is the learning rate, and the Bregman distance is generated from the negative entropy :
It’s straightforward to verify that the analytical solution of (67) is (13) in Algorithm 2. Note that the negative entropy (c.f. (62)) is 1-strongly convex on with respect to the -norm (Beck, 2017, Example 5.27). Furthermore, Lemma 3 shows is 1-Lipschitz with respect to -norm. With these facts, the theorem follows directly from Sokota et al. (2023, Theorem 3.4):
The relation (14) can be deduced from the above relation by taking the expectation over on both sides.
B.4 Proof of Lemma 1
To start with, we have
(68) |
We use , and to denote the distribution of , the joint distribution of and the distribution of conditioned on , resp. Then the cross term
(69) |
where the last relation follows from the fact that
B.5 Proof of Theorem 4
We first introduce the three-point property of the Bregman divergence (Sokota et al., 2023, Proposition D.1), (Bauschke et al., 2003, Proposition 2.3):
Lemma 4 (three-point property of the Bregman divergence).
Let be a function that’s differentiable on . Let and . Then the following equality holds:
(70) | ||||
(71) | ||||
(72) |
To start with, we rewrite the update rule in Algorithm 3
(73) |
to a similar form as the update rule (67).
Step 1: reformulate the update rule (73).
We let denote the statistical error and the model approximation error at the -th round, respectively:
(74) | ||||
(75) |
We write as the shorthand of , respectively.
Step 2: bound .
By the first-order optimality condition we know that (77) is equivalent to
(80) |
Note that for any and , we have
(83) |
Combining the above two expressions (B.5) and (B.5), we have
(84) |
where, in the third relation we use the 1-strong convexity of w.r.t. the -norm (see the proof of Theorem 3) to obtain that
Note that
(85) |
To bound the error term, below we separately bound (i)-(iii).
To bound (i), we first unroll (76) similar as in (B.2) and obtain
(86) |
for some related to . The above expression gives
for any . This relation yields
where we use (21) and (74). The above expression indicates
Summing over on both sides, we get
(87) |
Therefore, we have
(88) |
where the second line follows from the Cauchy-Schwartz inequality, and the last line uses Assumption 3.
By the same argument, we could also bound (ii):
(89) |
For term (iii), note that where is defined in (75), we have
(90) |
Taking expectation w.r.t. on both sides of (B.5) and making use of (91), we have
(92) |
where the second inequality follows from Jensen’s inequality and .
The above expression implies we need to bound . If for all , could be bounded by some finite , then by (94) we have
(95) |
In the following, we bound by bounding the excess risk .
Step 3: bound the excess risk.
To bound the excess risk, we first introduce the concept of uniform stability (Bousquet and Elisseeff, 2002). Suppose we have a training dataset where each is sampled i.i.d. from some unknown distribution defined on some abstract set . Given , a learning algorithm produces the decision rule , where is the set of all decision rules and is assumed to be a closed subset of a separable Hilbert space. We use to refer to both the algorithm and the decision rule. For the loss function , we define the risk and the empirical risk of respectively as
(96) |
Definition 1.
An algorithm is uniformly -stable, if for any and any , it holds that
(97) |
We will also use the generalized Bernstein condition defined as follows:
Definition 2 (Assumption 1.1 in Klochkov and Zhivotovskiy (2021)).
Define where is a closed set. We say that satisfies the generalized Bernstein condition if there exists some constant such that for any , there exists that satisfies
(98) |
With the above two lemmas, we now introduce the following important lemma that bounds the generalization error for uniformly stable algorithms:
Lemma 5 (Theorem 1.1 in Klochkov and Zhivotovskiy (2021)).
To proceed, we analyze the generalization error at the -th iterate of Algorithm 3 for a fixed arbitrary . We’ll let denote and drop superscript/subscript when this causes no confusion. For example, we’ll simply write the update rule (73) as
For notation simplicity, we also let , , , and let denote for all . Let . Then in our case, the loss function has the following form:
(100) |
where , and similar as (96), our risk and empirical risk at the -th itrerate satisfy:
(101) |
where we let denote the distribution of (), and we have
(102) |
Mote that (22) implies that is -Lipschitz over for any . Then by Assumption 4 and Remark 3 in Kang et al. (2022), we have that satisfies the generalized Bernstein condition with
(103) |
Furthermore, Corollary 4 in Charles and Papailiopoulos (2018) gives that, when Assumption 2,4 hold, the empirical risk is -uniform stability (c.f. Definition 1) with
(104) |
Substituting (103) and (104) into (99), we obtain that for any fixed , with probability at least , we have
(105) |
By the independence of the samples in different rounds we know that with probability at least , we have
(106) |