Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective

Pietro Bernardelle 0009-0003-3657-9229 The University of QueenslandBrisbaneAustralia [email protected] and Gianluca Demartini 0000-0002-7311-3693 The University of QueenslandBrisbaneAustralia [email protected]

(2024)

Abstract.

Aligning the output of Large Language Models (LLMs) with human preferences (e.g., by means of reinforcement learning with human feedback, or RLHF) is essential for ensuring their effectiveness in real-world scenarios. Despite significant advancements in LLM alignment techniques, the impact of different type of preference data on model performance has yet to be systematically explored. In this study, we investigate the scalability, data efficiency, and effectiveness of Direct Preference Optimization (DPO) in fine-tuning pre-trained LLMs, aiming to reduce their dependency on extensive amounts of preference data, which is expensive to collect. We (1) systematically compare the performance of models fine-tuned with varying percentages of a combined preference judgement dataset to define the improvement curve of DPO and assess its effectiveness in data-constrained environments; and (2) provide insights for the development of an optimal approach for selective preference data usage. Our study reveals that increasing the amount of data used for training generally enhances and stabilizes model performance. Moreover, the use of a combination of diverse datasets significantly improves model effectiveness. Furthermore, when models are trained separately using different types of prompts, models trained with conversational prompts outperformed those trained with question answering prompts.

LLMs, Direct Preference Optimization, Data Selection

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†doi: xxxxxx^†^†conference: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region; December 9–12, 2024; Tokyo, Japan^†^†booktitle: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24), December 9–12, 2024, Tokyo, Japan^†^†isbn: 979-8-4007-0724-7/24/12^†^†ccs: Information systems Language models

1. INTRODUCTION

The rise of LLMs, such as OpenAI’s GPT (Brown et al., 2020) and Google’s BERT (Devlin et al., 2019) families, has marked a revolutionary advancement in natural language processing. These models excel in syntactic and semantic understanding yet aligning them with human preferences remains challenging.

Over the past few years, significant work has attempted to address the misalignment challenge. This has led to the adoption of Reinforcement Learning (RL) techniques that incorporate human preferences to guide LLMs optimization. Rafailov et al. (2023) introduced one of the most innovative approach in this field: Direct Preference Optimization (DPO). Despite its promise, to date, only a limited number of studies have explored its applications and potential benefits, suggesting that much remains to be discovered and understood about its effectiveness. This gap in research highlights the need to further investigate how DPO can enhance the alignment of language models with human preferences efficiently and effectively.

Our research conducts an experimental exploration of the DPO method, examining its nuances and evaluating its efficacy in real-world applications. The objective is to improve our understanding of the impact of DPO, setting the stage for future development of efficient methods for utilizing human preference data and streamlining the LLM training process. To achieve our goal, we aim to address the following research questions:

RQ1: How does the performance of LLMs fine-tuned with DPO evolve as increasingly larger subsets of preference data are used for training?

RQ2: How does the nature of the training data, specifically conversational versus question answering datasets, impact the model performance under DPO?

Table 1. Overview of the datasets used in the analysis. The table details dataset size, partitioning into training, evaluation, and testing sets, and the types of prompts included.

Dataset	Size	Training (80%)	Evaluation (10%)	Testing (10%)	Prompt Type
Dataset A^a	7,560	6,048	756	756	Conversational
Dataset B^b	12,900	10,320	1,290	1,290	Question-Answering
Dataset C^c	63,600	50,880	6,360	6,360	Question-Answering
Combination	84,060	67,248	8,406	8,406	Conversational &
Combination	84,060	67,248	8,406	8,406	Question-Answering

2. RELATED WORK

In recent years, numerous influential studies have tackled the challenge of misalignment in language models. Christiano et al. (2017) are the pioneers in merging the concepts of Reinforcement Learning (RL) with Human Feedback (RLHF), showcasing the potential of this method in guiding the training of models with direct human preference judgements. Their work laid the foundation for future enhancements in model behavior through human evaluation. Building on these foundational insights, Ouyang et al. (2022) and Bai et al. (2022) extended the application of RLHF to a broader range of language tasks, further developing the methodology for LLMs. Their contributions have been crucial in detailing the operational framework of RLHF, which involves a three-phase alignment process: (1) pretraining and supervised fine-tuning of the language model; (2) training of a reward model using explicit human preferences; and (3) alignment of the language model via reinforcement learning through the reward model.

While effective in some scenarios, questions arise about the efficiency, scalability, and extensive data requirements of this approach during the training process. Recent developments in the field have introduced a significant paradigm shift, aiming to mitigate these challenges. Rafailov et al. (2023) introduced a methodology known as DPO, which seeks to optimize language model alignment directly based on human preferences, thus eliminating the need for an explicit reward model (having it embedded in a loss function) and for reinforcement learning. By eliminating the reliance on reinforcement learning, this approach streamlines the training process and challenges the conventional wisdom of reinforcement learning-based alignment frameworks. Although there have been considerable advancements in language model alignment techniques, focusing primarily on refining architectures for improved efficiency, the crucial aspect of human preference data utilization remains largely under-explored.

3. METHODOLOGY

Data

To address the research questions outlined above, we conducted our experiments using three open-source preference judgement datasets. Table 1 provides an aggregated view of their characteristics. All selected datasets are sourced from Hugging Face and provided by Argilla¹¹1https://huggingface.co/argilla. Two of the three datasets employed in our study originate from established datasets: OpenOrca (Lian et al., 2023) and UltraFeedback (Cui et al., 2023). The third dataset is based on a newly curated collection featuring conversational style prompts: Capybara²²2https://huggingface.co/datasets/LDJnr/Capybara. The selected datasets provide a broad spectrum of scenarios and data characteristics, enabling a thorough assessment of the DPO-aligned models’ behavior when exposed to different volumes and types of preference data.

Experimental setup

Our study is structured into two experiments. The first experiment, addressing RQ1, is designed to determine the performance of DPO-aligned models as the amount of data increases. We combined the individual datasets into a single pool to control for content variability³³3The pool, formed by combining the entire individual datasets, was then split into training, evaluation, and testing segments, allocated as 80%, 10%, and 10% respectively.. From this combined dataset, we randomly sampled five subsets—20%, 40%, 60%, 80%, and 100%—of the training split. Each subset was used to train separate instances of the base model. OpenHermes-2.5-Mistral-7B⁴⁴4https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B was adopted as the base model, primarily to ensure our experiments remain reproducible and grounded in a high-performing, state-of-the-art open-source model. The process was repeated three times with different random seeds, and resulted in a total of 15 different DPO-aligned models. By incrementally increasing the data volume and repeating the training multiple times, we accounted for variability and aimed to smooth the results by averaging them.

The second experiment, addressing RQ2, focuses on the individual characteristics of the three datasets. This approach is designed to discern whether specific types of data, particularly those with conversational versus question answering style prompts, have a more pronounced effect on the efficiency and effectiveness of the training under DPO. In the second experiment, we mirrored the methodology used in the initial experiment but applied it to each individual dataset separately. This approach yielded 45 distinct DPO-aligned models.

Performance Evaluation

Each DPO-aligned model was evaluated relative to the base model using MT-Bench (Zheng et al., 2024). MT-Bench presents a series of questions to both the base model and the DPO-aligned model, and for each question, it determines a ‘win’ for the model that provides the best answer or records a ‘tie’ if no clear best answer is identified. Based on the structured evaluation framework provided by MT-Bench, we calculated the improvement over the base model by subtracting the percentage of wins of the base model from the percentage of wins of the model aligned with DPO, as the amount of data used for alignment increases. Additionally, we defined and calculated the tie rate between the DPO-aligned model and the base model at each data percentage subset as follows:

\text{Tie Rate}=1-(\text{Win Rate}+\text{Lose Rate})

where the win rate and lose rate are derived from the combined results of the three runs for each data percentage subset.

Computational Resources

For our experiments, we utilized a single H100 GPU card and approximately 80GB of RAM to accommodate the datasets. The training duration varied depending on the dataset size. For models relying on the largest datasets, the training process took up to one day, while models trained on the smallest datasets completed in approximately two hours.

4. RESULTS

4.1. Defining DPO’s Improvement Curve

Refer to caption — Figure 1. The top plot illustrates the improvement curves of DPO across three different experimental runs using the combined dataset. The bottom plot presents the averaged improvement curve of DPO, aggregating the results from the three experimental runs. Error bars indicate the standard deviation across the runs.

Figure 1 illustrates the percentage improvement in model performance across the three random seeds for the first experiment. Based on the results summarized in Figure 1, there is noticeable variability in performance improvement across the three experimental runs, indicating that the models’ performance may vary significantly depending on the specific data subsets used for alignment. Runs 2 and 3 exhibit similar patterns (Pearson $R=0.94$ , $P=.006$ , $\alpha=.05$ ), diverging noticeably from run 1, especially in the middle data usage percentages (40%, 60% and 80%). The similarity between runs 2 and 3 compared to run 1, coupled with the marked dip at 60% followed by a significant rebound, highlights the need for a more detailed analysis into how different data samples may distinctly affect the training process. Which preference judgements we use does matter. A comprehensive examination of these data subsets could reveal critical insights into optimizing the training process for more consistent and predictable improvements. Despite these fluctuations, a consistent trend emerges in which increased data usage generally correlates with enhanced performance improvements. This pattern supports the hypothesis that greater data volumes positively influence the model’s alignment with human preferences. However, the overall trend is not as smooth as initially hypothesized, which may be attributed to the variability across the three runs. Additional runs could potentially smooth out these irregularities, providing a clearer depiction of the trend.

Additionally, it is interesting to point out that, as the data volume increases, the model aligned with DPO becomes distinctly more accurate in its responses compared to the base model. The tie rate between models decreases as the percentage of data used for DPO alignment increases, suggesting that the judging mechanism finds it easier to discern and prefer the responses of the DPO-aligned model. This indicates clearer differentiation and better alignment with desired outputs.

Figure 2 reveals that the tie rate consistently diminishes as the data volume increases, with the exception of the 100% data usage mark. This pattern provides two significant insights:

(1)

The judging model is convinced of the performance dip observed around 60% of data usage shown in Figure 1, indicating that at this stage, the DPO-aligned model responses are less distinguishable from the base model responses.
(2)

The increased tie rate at 100% data usage suggests that the judging model confidence in distinguishing between the DPO-aligned and base model decreases when the improvement curve plateaus. This phenomenon implies that while additional data contributes to performance gains, there might be a threshold beyond which the added data does not significantly enhance model differentiation.

4.2. Impact of Data Nature

When we delve into the unique characteristics of the three individual datasets, we can identify the distinct contributions and impacts of different data types on model performance. As Figure 3 shows:

•

Dataset A, the smallest dataset in our study, shows a positive trend in model performance with increased data usage, achieving improvements comparable to those of models trained on much larger datasets. This highlights the significant role that conversational prompts play in providing dynamic and context-rich interactions.
•

Dataset B shows the potential for significant performance improvements through DPO, albeit with some variability and non-linear trends. The unexpected dip in performance at intermediate data volumes underscores the need for careful curation of training data to maximize the benefits of additional data.
•

Dataset C large size allows for more robust learning, reducing the impact of any individual data points that might introduce noise or anomalies. The improvement curve suggests that the model effectively leverages the additional data, steadily refining its alignment with human preferences as more data is introduced. This contrasts with the more variable performance observed in the smaller datasets, where the model improvements were less consistent and more sensitive to specific subsets of data.

4.3. Implications and Observations

Table 2. Performance improvements of the DPO-aligned models across the four datasets. The table lists both peak and average percentage improvements compared to the base model without alignment.

Dataset	Peak	Avg
Dataset	Improvement (%)	Improvement (%)
Dataset A	7.5%	4.93%
Dataset B	8.54%	3.925%
Dataset C	8.325%	7.6%
Combination	19.126%	15.35%

The results from the two experiments provide several important implications for the application of DPO in language model training:

•

Significance of Data Volume: The trend across all datasets highlights that increased data volume for alignment generally enhances model performance and stability.
•

Combined Datasets for Alignment: Combining multiple datasets for alignment consistently results in superior performance compared to using individual datasets, as shown in Table 2. This finding underscores the importance of diversity and comprehensiveness in training data, suggesting that a holistic approach to dataset selection can significantly enhance model effectiveness.
•

Impact of Data Type: The type of data significantly influences the model improvement. Conversational prompts (Dataset A) lead to steady enhancements despite its smaller size, likely because the model can grasp a better understanding of the context with this type of data. This indicates that conversational prompts are particularly effective in improving model performance and could be prioritized in future data selection approaches.
•

Performance Dips and Training Dynamics: The observed dips in performance across different data volumes, though not entirely explainable through our current analysis, point to the inherent complexities in training dynamics. These fluctuations suggest that there are underlying factors affecting performance that warrant further investigation. The random sampling of data likely contributed to the dispersion of noise across the performance curves, indicating a need for more controlled and systematic data sampling methods in future studies.

This finding underscores the value of using a more extensive and varied dataset for alignment, as it can capture a broader range of linguistic patterns and nuances, thereby enhancing the overall performance of the model.

5. CONCLUSION AND FUTURE WORK

In this study, we have explored the effectiveness and efficiency of DPO in aligning LLMs to human preference judgments. Our findings indicate that the relationship between data volume and model performance improvement is not strictly linear, as might be expected; rather, it shows notable fluctuations influenced heavily by the specific data subsets being used. Thus, preference judgement quality over quantity might be the way forward for LLM alignment. Overall, this research enriches our knowledge of the DPO method, highlighting the need for data selection strategies that can systematically and consistently yield favorable results, even with limited amounts of data.

There is considerable potential in developing targeted strategies for optimizing preference data selection, which could streamline LLM training processes, enhancing efficiency and reducing costs. By pursuing this direction, future research can continue to refine and enhance the alignment of LLMs with human preferences, ultimately advancing natural language processing technologies.

Acknowledgements.

This work is partially supported by the Australian Research Council (ARC) Training Centre for Information Resilience (Grant No. IC200100022).

References

(1)
Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 [cs.CL]
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
Christiano et al. (2017) Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (2017).
Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377 (2023).
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. NAACL-HLT 2019. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
Lian et al. (2023) Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and ”Teknium”. 2023. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. huggingface.co (2023).
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems 36 (2023), 53728–53741.
Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).