System Message Generation for User Preferences
using Open-Source Models
Abstract
System messages play a crucial role in interactions with large language models (LLMs), often serving as prompts to initiate conversations. Through system messages, users can assign specific roles, perform intended tasks, incorporate background information, specify various output formats and communication styles. Despite such versatility, publicly available data are often lack system messages and subject to strict license constraints in the industry field. Manual labeling of publicly available data with system messages that align with user instructions demands significant resources. In view of such challenges, our work introduces SysGen, a pipeline for generating system messages with better aligned assistant responses from the supervised fine-tuning dataset without system messages. Training on SysGen data has demonstrated substantial improvements in the alignment of model responses with system messages and user instructions, as demonstrated across various open-source models on the Multifacet benchmark, while maintaining minimal impact on other unseen benchmarks such as Open LLM Leaderboard 2. Our qualitative analysis highlights the importance of diverse system messages to ensure better adaptability across different contexts.
System Message Generation for User Preferences
using Open-Source Models
Minbyul Jeong111Corresponding authors. Jungho Cho Minsoo Khang Upstage AI Dawoon Jung Teakgyu Hong
1 Introduction

System message, also known as initial prompt, serves as an initial input to start a conversation with LLMs (Openai, 2024; Cohere, 2024; PromptHub, 2025). They have been shown to greatly affect model’s assistant responses by providing contexts, guidances, and directions to LLMs (Qin et al., 2024; Lee et al., 2024). For example, given a system message, we can steer the LLM’s behavior to set roles, provide the additional background information, maintain consistency of generated responses, customize a format, align to user preferences, and ensure safety and ethical considerations AlKhamissi et al. (2024); Yang et al. (2024); Dubey et al. (2024). System messages have proven capable of setting constraints such as knowledge cut-off and current date or when different model behaviors need to be tailored for optimal overall performance (Lin et al., 2024; Abdin et al., 2024).
While LLMs’ capabilities of utilizing the system messages is widely investigated, how to acquire these system messages is underexplored. Our preliminary analysis has shown the following limitations about system messages in datasets. Most publicly available datasets have license constraints when used in the industry field, limiting their use in post-training techniques for target tasks (Xie et al., 2020; Ouyang et al., 2022; Zhou et al., 2023; Cui et al., 2023). Additionally, most datasets either lack system messages or contain the common system messages such as “You are a helpful AI assistant.” (Xu et al., 2023; Pareja et al., 2024). Lastly, labeling system messages to fit various user instruction scenarios requires substantial resources (Abdin et al., 2024; Qin et al., 2024; Lee et al., 2024).
In this study, we propose SysGen, a data construction pipeline that generates system messages using open-source models with well-aligned assistant responses from existing SFT datasets without system messages. Our SysGen pipeline addresses the above limitations by automatically generating diverse system messages with open-source models that are not only well-aligned with user instructions but also avoid infringement of license constraints. Specifically, our SysGen pipeline provides the phrase level of system messages according to each key functionality, tailored to various user instructions (AlKhamissi et al., 2024; Jiang et al., 2024; Qian et al., 2024; Lee et al., 2024). Figure 1 illustrates the key concept of our SysGen pipeline.
We generate system messages by annotating these key functionalities at the phrase level, making it easy to track which features are lacking and working effectively (Sec 3.1). Erroneous special tokens are then filtered out before reorganizing the generated system message into a consistent order (Sec 3.2). By verifying each functionality of the system messages with LLM-as-a-judge approach (Zheng et al., 2023) as a self-model feedback, we softly remove abnormal phrases of functionalities (Sec 3.3). We generate new assistant responses which are better aligned with a refined system message and user instruction. Our new responses also exhibit higher lexical overlap, semantic similarities, and verbosity than the original assistant responses (Sec 3.4).
After training various open-source models on SysGen data, we evaluated the models on the Multifacet (Lee et al., 2024) dataset to measure how well the assistant responses align with system messages and user instructions. Our experiments have shown consistent improvement across various models, notably LLaMA-3.1-8B-instruct (Meta, 2024) and Phi-4 (Abdin et al., 2024) models achieving +0.9, +0.13 absolute improvements, respectively. For models that do not support system roles, such as Gemma-2-9b-it (Team et al., 2024), or have not been trained on system roles, such as Solar-10.7B-instruct (Kim et al., 2024), knowledge distillation (Hinton, 2015) using SysGen data generated by the Phi-4 model resulted in absolute improvements of +0.18 and +0.57, respectively. In addition, our experiments reveal that training on SysGen data can effectively reduce performance degradation on unseen benchmarks, Open LLM Leaderboard 2 (Myrzakhan et al., 2024).
Our analysis highlights that training open-source models with system messages tailored to diverse contexts is significantly more beneficial to align user instructions than using a common system message (e.g., "You are a helpful AI assistant") or not providing a system message. We also demonstrate that distinguishing the system and user roles in the chat template is crucial for assistant responses to align user instructions. We further provide LLM-as-a-judge result to verify that new assistant responses are truly aligned to the generated system messages.
2 Related Works
System message: utilization and evaluation.
A system message is a unique component of LLMs to initiate a conversation with them. It is utilized by many proprietary models (e.g., ChatGPT (OpenAI, 2023) and Claude (Anthropic, 2024)) as well as open-source models (e.g., Mistral (AlKhamissi et al., 2024), LLaMA (Meta, 2024), Qwen (Yang et al., 2025), and DeepSeek (Guo et al., 2025)). The system messages serve the purpose of steering the LLM’s generation behavior and are widely used for various functions, including imprinting the model’s identity, recording the knowledge cut-off date of the training data, and providing guidelines for various tool usages (Openai, 2024; Cohere, 2024; PromptHub, 2025). Additionally, the system messages are used to guide the model in generating safe and harmless responses (Touvron et al., 2023; Lu et al., 2024; Wallace et al., ).
Despite the usefulness of system messages, there is a significant lack of data that includes system messages reflecting diverse user instructions without license constraints. Furthermore, manually labeling such data requires substantial human resources and even among publicly available datasets, it is challenging to obtain data that includes various system messages (Lin et al., 2024; Xu et al., 2024). Lee et al. (2024) provide data augmentation which reflects hierarchical dimensions of system role data with multiple aspects of evaluation benchmark. Furthermore, Qin et al. (2024) provide multi-turn benchmark to evaluate system message alignment. In line of these works, our SysGen pipeline ensures high-quality system messages and assistant responses by supplementing data using only open-source models without licensing concerns. Furthermore, it demonstrates that data augmentation is possible on existing SFT datasets without requiring extensive human labeling efforts.

3 SysGen: Pipeline of System and Assistant Response Generation
Our SysGen pipeline consists of four phases: (1) generating system messages with eight key functionalities (Sec 3.1), (2) filtering mis-specified system tags and reorganizing them (Sec 3.2), (3) verifying the key functionalities on a phrase level (Sec 3.3), (4) generating the new assistant responses using the refined system messages and original user instructions (Sec 3.4). Figure 2 depicts the overall architecture of the SysGen pipeline.
3.1 Phase 1: System Message Generation
The primary goal of our SysGen pipeline is to enhance existing SFT datasets by adding system messages that were not originally included. As the system messages can steer the LLM’s behaviors, we focus on these messages during the development and release of the models. However, license constraints and substantial resource requirements of manually labeling the system messages inevitably arise, making it difficult to utilize most publicly available datasets. Thus, we aim to generate system messages by leveraging open-source models and data without any license issues.
Phrase level Annotation to System Messages
We manually classify eight functionalities that are widely used in the system messages referring to previous works (Openai, 2024; Cohere, 2024; AlKhamissi et al., 2024; Lee et al., 2024): (1) Specifies the role, profession, or identity that needs to be played (Role); (2) Specifies the content that needs to be included in the response such as an identity of the company (Content); (3) Identifies what to perform (Task); (4) Specifies the behavior to perform (Action); (5) Prefers the style of communication for responses (Style); (6) Provides additional information to be served as an assistant (Background); (7) Provides built-in methods to use (Tool); (8) Preference of what output should look like (Format).
As shown in Figure 2 (top left), all functionalities are annotated at a phrase level with pre-/post-fix tags. Given a pair of user instructions and assistant responses , we generate a system message using the open-source LLMs with a prompt that includes few-shot demonstrations:
(1) |
We provide details about the few-shot demonstrations in the Appendix D.
3.2 Phase 2: Filtering Process
After generating the system messages, we filter out the abnormal system messages for consistent text format. In Figure 2 (top right), we first identify and remove mis-tagged phrases. For example, we can guarantee the correctness of the phrases between these tokens only if the start and end tokens are the same (e.g., <<Task>>). In addition, we remove invalid tags such as <<Example>> or <<System>>, which may be generated in phase 1. To ensure a consistent structure of system messages, we reorder the tags and phrases in manually defined order.
Models | Words Composition | BERTScore | BLEURT | GLEU | Len. | ||
R1 | R2 | RL | |||||
LLaMA-3.1-8B-instruct | 33.3 | 15.6 | 23.1 | 81.3 | 33.6 | 28.2 | 1.35 |
Qwen2.5-14b-instruct | 44.9 | 23.2 | 30.7 | 85.9 | 39.9 | 39.2 | 1.55 |
Phi-4 | 51.9 | 32.3 | 41.1 | 86.1 | 40.1 | 37.2 | 1.89 |
3.3 Phase 3: Verification of Eight Key Functionalities
In this phase, we verify whether each generated phrase is appropriate for its assigned tag. Using the LLM-as-a-judge (Zheng et al., 2023) approach with self-model feedback, we assign one of three labels for each tag: Good if the tagging is appropriate, Bad if the tagging is inappropriate, and None if the tag or phrases are missing. Phrases labeled as Bad or None are then removed from the system message to ensure accuracy and consistency. We observe that most of the data instances (up to 99%) are preserved after applying phase 3.
3.4 Phase 4: Assistant Response Generation
After filtering and verifying the generated system messages, they can be used alongside existing QA pairs. However, we hypothesize that if there is any potential misalignment between the human curated QA and model-generated system messages, a follow-up data alignment phase is necessary. Therefore, we generate new assistant responses based on a refined system messages and the user instructions , ensuring better alignment with the given instructions.
To achieve this, we first remove the annotated tags from the system messages to guarantee that the refined messages seem natural. We provide a detailed example in Figure 2 (bottom right). Then, we use the open-source LLMs employed in phase 1 to generate new responses .
(2) |
In Table 1, the new responses preserve similar content with high n-gram matching compared to the original responses, but have shown diversified formats with high semanticity and verbosity. We provide the cases in Appendix C.

We also use LLM-as-a-judge with GPT-4o to analyze that the new responses are better aligned to the user instructions than the original responses . Figure 3 illustrates the proportion of cases where the new responses are judged to be better aligned than the original responses when given the user instructions. For simpler evaluation, we evaluated 1K randomly sampled instances from the generated datasets. Overall, our findings suggest that generating responses based on the system messages lead to better alignment with user instructions.
4 Experimental Settings
Models |
|
||
LLaMA-3.1-8B-instruct | 806,796 → 602,750 (74.7%) → 586,831 (72.7%) | ||
Qwen2.5-14b-instruct | 806,796 → 806,602 (99.9%) → 775,830 (96.2%) | ||
Phi-4 | 806,796 → 774,613 (96.0%) → 773,878 (95.9%) |
Model | Parameter Scale | Multifacet | Average | ||||
AlpacaEval | FLASK | Koala | MT-Bench | Self-Instruct | |||
Proprietary Models | |||||||
GPT-3.5-Turbo-0125 | ✗ | 4.05 | 3.86 | 4.15 | 3.87 | 3.85 | 3.91 |
GPT-4-0613 | ✗ | 4.25 | 4.00 | 4.18 | 4.16 | 4.13 | 4.10 |
GPT-4-Turbo-0125 | ✗ | 4.45 | 4.27 | 4.61 | 4.45 | 4.27 | 4.35 |
Open-Source Models | |||||||
LLaMA-3.1-8B-instruct | 8B | 4.26 | 3.82 | 4.29 | 4.15 | 4.06 | 4.12 |
Qwen2.5-14B-instruct | 14B | 4.37 | 4.07 | 4.37 | 4.27 | 4.21 | 4.26 |
Phi-4 | 14B | 4.53 | 4.24 | 4.51 | 4.39 | 4.40 | 4.41 |
Open-Source Models (Fine-tuning on SysGen dataset) | |||||||
LLaMA-3.1-8B-instruct | 8B | 4.38 | 3.95 | 4.41 | 4.22 | 4.11 | 4.21 |
Qwen2.5-14B-instruct | 14B | 4.40 | 4.11 | 4.42 | 4.22 | 4.25 | 4.28 |
Phi-4 | 14B | 4.62 | 4.63 | 4.52 | 4.44 | 4.49 | 4.54 |
4.1 Training Dataset
In Table 2, we provide the remaining instances after processing each phase of our generated datasets. We target datasets with three conditions: (1) widely used as SFT datasets; (2) do not contain the system messages; (3) diverse domains are covered. We enumerate the selected datasets as follows: (1) Capybara (Daniele and Suphavadeeprasit, 2023), which focuses on information diversity across a wide range of domains. (2) Airoboros (Jondurbin, 2024) is composed of multi-step instructions with a diverse structured format. (3) Orcamath (Mitra et al., 2024) aims to provide various mathematical problem solving. (4) MetamathQA (Yu et al., 2023) is an augmented version of several math instructions. (5) Magicoder (Luo et al., 2023) dataset provides various code generation problems. We provide detailed statistics in Appendix A.
4.2 Evaluation Benchmarks
We evaluate performance on Multifacet (Lee et al., 2024), which requires both the system messages and the user instructions to generate the assistant responses. For the source data, the Multifacet benchmark is constructed of approximately 921 samples by incorporating AlpacaEval (Dubois et al., 2024), FLASK (Ye et al., 2023), MT-bench (Bai et al., 2024), Koala (Geng et al., 2023), and Self-Instruct (Wang et al., 2022). The authors of Lee et al. (2024) set the multiple aspects of evaluating each response with four dimensions: style, background information, harmlessness, and informativeness. We follow these evaluation settings in our experiments.
Additionally, we aim to investigate the impact of the SysGen data on unseen benchmarks by leveraging the Open LLM Leaderboard 2 (Myrzakhan et al., 2024) as a test set. The test set is composed of MMLU (Hendrycks et al., 2020), MMLU-pro (Wang et al., 2024), Arc-challenge (Clark et al., 2018), GPQA (Rein et al., 2023), HellaSwag (Zellers et al., 2019), IFEVAL (Zhou et al., 2023), MATHQA (Amini et al., 2019), and BBH (Suzgun et al., 2023). We use the publicly available lm-evaluation harness (Gao et al., 2024) as an evaluation tool for a fair comparison.
4.3 Open-source Models
Our baseline models are composed of instruction-tuned open-source models and trained with supervised fine-tuning datasets without system messages. We select and utilize one from each widely used open-source model family: (1) Solar-10.7B-instruct (Kim et al., 2024) (2) Gemma-2-9B-instruct (Team et al., 2024) (3) LLaMA-3.1-8B-instruct (Meta, 2024) (4) Qwen2.5-14B-instruct (Yang et al., 2025), and (5) Phi-4 (Abdin et al., 2024).
5 Experiments
Model | Parameter Scale | Multifacet | Average | ||||
AE | FL | Ko | MT | SI | |||
Open-Source Models | |||||||
Solar-10.7B-instruct | 10.7B | 3.30 | 3.31 | 3.09 | 3.19 | 3.08 | 3.19 |
Gemma-2-9b-it | 9B | 4.10 | 3.80 | 4.26 | 4.15 | 3.92 | 4.05 |
Open-source Models KD (Fine-tuning on SysGen dataset) | |||||||
Solar-10.7B-instruct | 10.7B | 3.97 | 3.73 | 3.64 | 3.98 | 3.52 | 3.76 (+0.57) |
Gemma-2-9b-it | 9B | 4.40 | 4.04 | 4.30 | 4.23 | 4.18 | 4.23 (+0.18) |
Model | Parameter Scale | Unseen Benchmarks | Average | |||||||
MMLU | MMLU-Pro | ARC-c | GPQA | HellaSwag | IFEVAL | MATHQA | BBH | |||
Open-Source Models | ||||||||||
Solar-10.7B-instruct | 10.7B | 63.28 | 30.20 | 63.99 | 30.36 | 86.35 | 38.59 | 36.38 | 37.28 | 48.31 |
Gemma-2-9b-it | 9B | 73.27 | 32.78 | 67.89 | 31.05 | 81.92 | 74.78 | 38.87 | 41.98 | 55.31 |
LLaMA-3.1-8B-instruct | 8B | 67.95 | 40.87 | 54.95 | 34.60 | 79.18 | 50.71 | 39.53 | 70.85 | 54.83 |
Qwen2.5-14B-instruct | 14B | 79.73 | 51.22 | 67.39 | 45.51 | 82.31 | 79.83 | 42.12 | 78.25 | 65.79 |
Phi-4 | 14B | 84.56 | 70.12 | 68.26 | 55.93 | 84.42 | 62.98 | 48.87 | 79.87 | 69.37 |
Open-Source Models (Fine-tuning on original SFT Dataset) | ||||||||||
Solar-10.7B-instruct | 10.7B | 62.38 | 29.12 | 58.87 | 29.17 | 81.58 | 31.27 | 37.21 | 32.85 | 45.30 (-3.01) |
Gemma-2-9b-it | 9B | 71.85 | 31.67 | 62.57 | 30.51 | 77.54 | 69.25 | 39.12 | 37.25 | 52.47 (-2.84) |
LLaMA-3.1-8B-instruct | 8B | 65.34 | 36.85 | 54.18 | 33.93 | 77.98 | 35.64 | 40.03 | 62.83 | 50.85 (-3.98) |
Qwen2.5-14B-instruct | 14B | 75.87 | 49.85 | 66.89 | 43.98 | 80.99 | 62.57 | 43.28 | 71.17 | 61.82 (-3.97) |
Phi-4 | 14B | 80.27 | 66.58 | 66.27 | 52.89 | 83.39 | 55.83 | 49.98 | 75.49 | 66.33 (-6.04) |
Open-Source Models (Fine-tuning on SysGen dataset) | ||||||||||
LLaMA-3.1-8B-instruct | 8B | 66.89 | 39.77 | 54.55 | 34.21 | 78.89 | 46.75 | 42.11 | 68.98 | 54.02 (-0.81) |
Qwen2.5-14B-instruct | 14B | 78.92 | 43.38 | 66.82 | 44.46 | 80.98 | 74.59 | 43.23 | 76.28 | 63.58 (-2.20) |
Phi-4 | 14B | 83.27 | 68.77 | 67.89 | 55.18 | 84.31 | 57.87 | 50.23 | 77.12 | 68.08 (-1.29) |
Open-source Models Knowledge Distillation (Fine-tuning on SysGen dataset)) | ||||||||||
Solar-10.7B-instruct | 10.7B | 59.98 | 29.26 | 62.81 | 30.25 | 85.91 | 34.58 | 38.25 | 35.97 | 47.12 (-1.19) |
Gemma-2-9b-it | 9B | 72.19 | 31.56 | 66.75 | 30.89 | 81.53 | 71.37 | 40.27 | 40.38 | 54.37 (-0.94) |
The primary goal of SysGen pipeline is to enhance the utilization of the system role while minimizing performance degradation on unseen benchmarks, thereby improving the effectiveness of supervised fine-tuning (SFT). To validate this, we evaluate how well the models trained on SysGen data generate appropriate assistant responses given both the system messages and user instructions, using the Multifacet (Lee et al., 2024) dataset. For models that cannot generate data independently, we apply knowledge distillation to assess their effectiveness. Additionally, we leverage the widely used Open LLM Leaderboard 2 (Myrzakhan et al., 2024) as an unseen benchmark to determine whether our approach can be effectively integrated into existing SFT workflows.
SysGen provides better system message and assistant response to align with user instructions.
Given the system messages and user instructions, the assistant’s response is evaluated across four dimensions: style, background knowledge, harmlessness, and informativeness. Each of these four aspects is scored on a scale of 1 to 5 using a rubric, and the average score is presented as the final score for the given instruction. As shown in Table 3, recent open-source models achieve comparable scores to the proprietary models, indicating that open-source models have already undergone training related to system roles (Meta, 2024; Yang et al., 2024; Abdin et al., 2024).
When trained on SysGen data, both LLaMA (4.12 → 4.21) and Phi (4.41 → 4.54) show score improvements. Among the four dimensions, LLaMA exhibits score increases in style (4.15 → 4.32) and harmlessness (4.23 → 4.29). Similarly, Phi shows the improvements in style (4.42 → 4.61) and informativeness (4.37 → 4.49). As a result, even open-source models that have already been trained on system roles demonstrate their positive effects on style, informativeness, and harmlessness.
Knowledge distillation through SysGen data.
If an open-source model does not support the system roles, it may not generate the system messages properly using SysGen pipeline. However, the effectiveness of knowledge distillation, using data generated by another open-source model without the limitation, remains uncertain. To explore this, we train Gemma (Team et al., 2024) and Solar (Kim et al., 2024) using data generated by Phi-4 (Abdin et al., 2024). We use the Phi-4 data because it preserves most of the data and provides high quality assistant responses as shown in Table 1 and 2.
As shown in Table 4, even for models that do not inherently support system roles, modifying the chat template to incorporate system role and training on knowledge distilled dataset leads to an improvement in Multifacet performance, as observed in Gemma (4.05 → 4.23). We describe the details in the Appendix B. Additionally, for the Solar model, which had not been trained on system roles, we observe a dramatic performance improvement (3.19 → 3.76).111We speculate that Solar model did not properly learn the system role because its initial Multifacet score was low. This demonstrates that the data generated by the SysGen pipeline effectively supports the system roles.
SysGen data minimizes the performance degradation in unseen benchmarks.
When incorporating system messages that were not present in the original SFT datasets and modifying the corresponding assistant responses, it is crucial to ensure that the model’s existing performance should not degrade. For example, one key consideration in post-training is maintaining the model’s original performance. To assess this, we observed performance difference in unseen benchmark after applying supervised fine-tuning. As shown in Table 5, we use the Open LLM Leaderboard 2 dataset as an unseen benchmark, with performance categorized into four groups:
-
•
Performance of existing open-source models (row 1-6)
-
•
Performance of fine-tuning with open-source models using SFT datasets (row 7-12)
-
•
Performance of fine-tuning with SysGen data (row 13-16)
-
•
Performance after applying knowledge distillation using Phi-4 SysGen data (row 17-19)
The average performance degradation reflects the scores missing from each open-source model’s original performance (row 1-6).
When fine-tuning with independently generated data using SysGen, the performance degradation is significantly lower than fine-tuning with the original SFT datasets selected under the same conditions. Additionally, even for models that cannot generate data independently (e.g., those that do not support system roles), knowledge distillation helps mitigate performance drops considerably.
6 Analysis
6.1 What makes SysGen pipeline useful?
Models |
|
|
||||
No System Message | ||||||
LLaMA-3.1-8B-instruct | 3.98 | 50.85 | ||||
Phi-4 | 4.26 | 66.33 | ||||
Common System Message | ||||||
LLaMA-3.1-8B-instruct | 3.89 | 51.23 | ||||
Phi-4 | 4.23 | 66.52 | ||||
SysGen without A’ | ||||||
LLaMA-3.1-8B-instruct | 4.09 | 51.89 | ||||
Phi-4 | 4.38 | 66.12 | ||||
SysGen | ||||||
LLaMA-3.1-8B-instruct | 4.21 | 54.02 | ||||
Phi-4 | 4.54 | 68.08 |
To assess the impact of system messages generated by SysGen during training, we conduct ablation studies on four different model variations:
-
•
No System Message: The original SFT dataset which does not contain the system message.
-
•
Common System Message: An triplet where the common system message is inserted such as "You are a helpful AI assistant".
-
•
SysGen without : An triplet that includes only a system message generated by our SysGen pipeline.
-
•
SysGen: An triplet where both the SysGen-generated system message and the newly-generated answer are incorporated.
We measure the effectiveness of these models by analyzing score variations on the Multifacet and unseen benchmarks in Table 6.
Training with data that includes common system messages does not result in a significant performance difference compared to training without system messages. This led us to question: "Would it be sufficient to include only the most suitable system messages?". To explore this, we train models using data that contains only system messages generated by SysGen pipeline. As a result, we observe an improvement in Multifacet performance for both models, while the scores on the unseen benchmark remained similar. Furthermore, when both system messages and assistant responses generated by SysGen are used for fine-tuning, we observe performance improvements in both Multifacet evaluation and unseen benchmarks.
6.2 System message vs. User instruction
Models |
|
||
Open-source Models | |||
Solar-10.7B-instruct | 3.19 → 2.98 | ||
LLaMA-3.1-8B-instruct | 4.12 → 4.09 | ||
Qwen2.5-14b-instruct | 4.26 → 4.13 | ||
Phi-4 | 4.41 → 4.26 | ||
Open-source Models (with SysGen) | |||
LLaMA-3.1-8B-instruct | 4.21 → 4.13 | ||
Qwen2.5-14B-instruct | 4.28 → 4.16 | ||
Phi-4 | 4.54 → 4.38 | ||
Open-source Models KD (with SysGen) | |||
Solar-10.7b-instruct | 3.76 → 3.64 |
A key question arises that what happens if we add a message intended for the system role at the beginning of the user instruction? Could it serve as a replacement for the system role? To explore this, we conduct an experiment on a Multifacet benchmark. Specifically, we included messages that should typically be in the system role within the user instruction during inference.
As shown in Table 7, we observe that open-source models tend to experience score degradation when system role messages are incorporated into the user instruction. This trend suggests that adding such content can make the query itself more ambiguous to answer. Furthermore, even in models trained with our SysGen, this trend persists similarly to the previous work (Lee et al., 2024). Despite additional fine-tuning on system roles, scores still remain low when system messages are reflected in the user instruction. This highlights the importance of properly placing these messages in the system role to maintain performance.
6.3 New assistant responses align to the system messages

In Table 1, we presented that the new assistant responses exhibit similar n-gram matching, high semantic similarities, and verbosity. Therefore, it is necessary to verify whether the generated assistant responses aligned with the system messages. Figure 4 illustrates the GPT-4o results using LLM-as-a-judge approach. Through the three SysGen data generated by Phi-4, LLaMA, and Qwen models, we determined that all of the assistant responses are highly aligned with the system messages. Overall, the experiments and analyses reveal that our SysGen data were generated to effectively respond to various user instructions as system messages. In addition, we observed that the assistant responses align with the system messages and are capable of generating better aligned responses compared the original assistant responses.
7 Conclusion
In our study, we introduce SysGen, a novel pipeline to generate system messages with better aligned assistant responses from an existing SFT datasets without system messages. Using the SysGen data, new assistant responses maintain lexical and semantic consistency with the original responses while aligning more closely with user instructions. Our experiments reveal that various open-source models trained on SysGen data perform better on the Multifacet dataset while maintaining minimal performance degradation on unseen benchmarks, Open LLM Leaderboard 2. Our analysis demonstrates that diverse system messages improve the LLMs’ abilities to adapt to different user instructions. Additionally, we emphasize the importance of clearly distinguishing between the system and user roles.
Limitations
While our SysGen pipeline demonstrates promising results in system messages alignment to the user instructions through Multifacet dataset. However, our data construction pipeline only considers the single-turn conversation without handling multi-turn conversations (Qin et al., 2024). We acknowledge that it is important for system messages to remain effective throughout multi-turn conversations, but our study focuses on evaluation and simple level of inference usage.
Additionally, our experimental results reveal that training with SysGen data shows minimal performance degradation on unseen benchmark, Open LLM Leaderboard 2 dataset. However, we suspect that the observed performance drop may be due to the format of natural text that the SFT datasets we selected, rather than formats similar to multiple-choice questions commonly found in the unseen benchmark. Therefore, we are curious about how well the system messages could be generated in various formats such as True/False questions or Multiple Choice questions and prove its effectiveness.
Finally, in Table 8, we identify the special tokens of tags which are annotated to the publicly avaiable data. The <<Tool>> tag has been absolutely shown small portion compared to other tags. Our initial intention was to utilize the tag for generating data through search functionality or function calls. However, the selected public data deviated from this purpose, resulting in a very low proportion of the tag being generated. Therefore, it would be beneficial to gather and generate data appropriately for each tag’s intended use.
References
- Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905.
- AlKhamissi et al. (2024) Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. 2024. Investigating cultural alignment of large language models. arXiv preprint arXiv:2402.13231.
- Amini et al. (2019) Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Anthropic (2024) Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku.
- Bai et al. (2024) Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. 2024. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Cohere (2024) Cohere. 2024. Cohere tool use documentation.
- Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. 2023. Ultrafeedback: Boosting language models with high-quality feedback.
- Daniele and Suphavadeeprasit (2023) Luigi Daniele and Suphavadeeprasit. 2023. Amplify-instruct: Synthetically generated diverse multi-turn conversations for efficient llm training. arXiv preprint arXiv:(coming soon).
- Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Dubois et al. (2024) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. 2024. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475.
- Gao et al. (2024) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. A framework for few-shot language model evaluation.
- Geng et al. (2023) Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. 2023. Koala: A dialogue model for academic research.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
- Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Hinton (2015) Geoffrey Hinton. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Jiang et al. (2024) Guangyuan Jiang, Manjie Xu, Song-Chun Zhu, Wenjuan Han, Chi Zhang, and Yixin Zhu. 2024. Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems.
- Jondurbin (2024) Jondurbin. 2024. Airoboros version 3.1 datasets.
- Kim et al. (2024) Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. 2024. SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track). Association for Computational Linguistics.
- Lee et al. (2024) Seongyun Lee, Sue Hyun Park, Seungone Kim, and Minjoon Seo. 2024. Aligning to thousands of preferences via system message generalization. arXiv preprint arXiv:2405.17977.
- Lin et al. (2024) Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Miao Zheng, Xu Li, Yijie Zhou, et al. 2024. Baichuan alignment technical report. arXiv preprint arXiv:2410.14940.
- Lu et al. (2024) Xinyu Lu, Bowen Yu, Yaojie Lu, Hongyu Lin, Haiyang Yu, Le Sun, Xianpei Han, and Yongbin Li. 2024. Sofa: Shielded on-the-fly alignment via priority rule following. arXiv preprint arXiv:2402.17358.
- Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. Wizardcoder: Empowering code large language models with evol-instruct.
- Meta (2024) AI Meta. 2024. Introducing llama 3.1: Our most capable models to date. Meta AI Blog, 12.
- Mitra et al. (2024) Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. 2024. Orca-math: Unlocking the potential of slms in grade school math. Preprint, arXiv:2402.14830.
- Myrzakhan et al. (2024) Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. 2024. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545.
- OpenAI (2023) OpenAI. 2023. Openai gpt-4 technical report.
- Openai (2024) Openai. 2024. Openai function calling.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems.
- Pareja et al. (2024) Aldo Pareja, Nikhil Shivakumar Nayak, Hao Wang, Krishnateja Killamsetty, Shivchander Sudalairaj, Wenlong Zhao, Seungwook Han, Abhishek Bhandwaldar, Guangxuan Xu, Kai Xu, et al. 2024. Unveiling the secret recipe: A guide for supervised fine-tuning small llms. arXiv preprint arXiv:2412.13337.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
- PromptHub (2025) PromptHub. 2025. System messages: Best practices, real-world experiments & prompt injections.
- Qian et al. (2024) Cheng Qian, Bingxiang He, Zhong Zhuang, Jia Deng, Yujia Qin, Xin Cong, Zhong Zhang, Jie Zhou, Yankai Lin, Zhiyuan Liu, et al. 2024. Tell me more! towards implicit user intention understanding of language model driven agents. arXiv preprint arXiv:2402.09205.
- Qin et al. (2024) Yanzhao Qin, Tao Zhang, Yanjun Shen, Wenjing Luo, Haoze Sun, Yan Zhang, Yujing Qiao, Weipeng Chen, Zenan Zhou, Wentao Zhang, et al. 2024. Sysbench: Can large language models follow system messages? arXiv preprint arXiv:2408.10943.
- Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
- Rein et al. (2023) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2023. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022.
- Suzgun et al. (2023) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. 2023. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023.
- Team et al. (2024) Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- (41) Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training llms to prioritize privileged instructions, 2024. URL https://arxiv. org/abs/2404.13208.
- Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
- Wang et al. (2024) Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574.
- Wolf (2019) T Wolf. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. Advances in neural information processing systems.
- Xu et al. (2023) Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment. arXiv preprint arXiv:2312.12148.
- Xu et al. (2024) Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464.
- Yang et al. (2024) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.
- Yang et al. (2025) An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. 2025. Qwen2. 5-1m technical report. arXiv preprint arXiv:2501.15383.
- Ye et al. (2023) Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. 2023. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Yu et al. (2023) Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. 2023. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems.
- Zhou et al. (2023) Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
Appendix A Data Statistics
Statistics of generated tags.
As we stated in limitations section, we provide the statistics of generate special tag tokens in Table 8. We find out that most of the <<Role>>, <<Content>>, <<Task>> tokens are annotated in the instances. Compared to thoses tokens, <<Action>>, <<Style>>, <<Background>>, and <<Format>> depends on the user instructions to be generated. However, <<Tool>> tokens have shown absolutely low portion to be generated. We thus want to suggest that properly choosing the public or your own dataset seems to ensure the <<Tool>> tag usages such as selecting searching protocols or function calls.
Tags | LLaMA-3.1-8B-instruct | Qwen2.5-14b-instruct | Phi-4 |
Role | 576,341 | 753,579 | 745,751 |
Content | 580,231 | 739,892 | 743,311 |
Task | 579,558 | 765,331 | 735,298 |
Action | 495,301 | 382,358 | 662,589 |
Style | 283,579 | 598,553 | 603,918 |
Background | 293,791 | 539,757 | 553,791 |
Tool | 10,238 | 132,038 | 90,989 |
Format | 327,909 | 401,593 | 538,973 |
Dataset | # of instances | Avg. Query Length | Avg. Answer Length | Containing System Message | Covering Domains |
Capybara | 41,301 | 300.24 | 1423.28 | ✗ | reasoning, logic, subjects, conversations, pop-culture, STEM |
Airoboros | 59,277 | 507.26 | 1110.62 | simple system message | mathematics, MATHJSON, character’s descriptions |
OrcaMath | 200,035 | 238.87 | 878.43 | ✗ | school mathematics, math word problems |
Magicoder | 111,183 | 652.53 | 1552.41 | ✗ | code solution |
MetaMath | 395,000 | 213.53 | 498.24 | ✗ | mathematics |
Statistics of original SFT datasets.
In Table 9, we observe that most widely used public datasets either lack a system message entirely or include only a simple one, such as "You are a helpful AI assistant.". The publicly available data mostly cover mathematics, code problems following some reasoning and logical ones.
Appendix B Experimental Details
Computing Resources
We use 4x8 NVIDIA H100 Tensor Core GPU with 80GB memory to train the open-source models. We use Deepspeed stage 3 (Rajbhandari et al., 2020) to implement multi-GPU settings and FlashAttention (Dao et al., 2022) for efficient training. Our code is written in PyTorch (Paszke et al., 2019) and HuggingFace (Wolf, 2019).
Integrating system roles in models that do not support them.
Through our experiments, we find out that the Gemma-2-9b-it (Team et al., 2024) model does not inherently support the system role. To address this limitation during data generation and training, we modified the chat template in the configuration of tokenization to remove restrictions on the system role. Interestingly, despite the lack of native support, our findings show that SysGen data can still be utilized effectively to incorporate a system role into these models.
Appendix C Qualitative analysis of generated instances
|
In Table C, we provide the SysGen data by presenting the system messages, user instructions, and new assistant responses. We observe that providing a specific format such as answer with paragraph format steers the LLM’s behavior to answer in step-by-step processes within paragraph. Also, if conversational example was provided, then the phrase of style tag forces to generate assistant response friendly. Furthermore, if the system message grant specific roles such as a knowledgeable assistant, then the new assistant responses tend to generate verbose answers to the user instructions.
Appendix D Prompts
To enhance reproducibility and facilitate understanding of the SysGen pipeline, we provide multiple prompts that we utilized. In Table 11, we use three-shot demonstrations to generate useful system messages which are collected through real-world scenarios. The Conversational History written in the prompt is composed of user instructions and original assistant responses. Thus, given the user instructions and assistant responses, we generate the system messages at a phrase level containing eight functionalities with special tokens such as <<Role>>, <<Content>>, and <<Style>>.
After generating the system messages, in Table 12, we verify the quality of each tag with three classes: Good, Bad, and None. We want to note that the Annotated system messages, composed of phrases and tags, are used to verify the Filtered system messages. By utilizing LLM-as-a-judge approach, we could save tremendous budgets through self-model feedbacks rather than using proprietary models (i.e., API Calls). Through our preliminary experiment, we observe that current open-source models such as Phi-4 or Qwen2.5-14b-instruct could preserve most of the phrases after applying phase 3.
Table 13 shows the prompt of how we verify the quality of new assistant responses as shown in Figure 3. After prompting 1K randomly sampled instances, we observe that new assistant responses were qualified to be better aligned with user instructions.
|
|
|