SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

Jing-Jing Li^♡♠ Valentina Pyatkin^♠ Max Kleiman-Weiner^♣ Liwei Jiang^♣ Nouha Dziri^♠ &Anne G. E. Collins^♡ Jana Schaich Borg^♢ Maarten Sap^♠◆ Yejin Choi^♣ Sydney Levine^♠
^♡UC Berkeley ^♠Allen Institute for AI ^♣University of Washington ^♢Duke University ^◆CMU
[email protected], [email protected]

Abstract

The ideal LLM content moderation system would be both structurally interpretable (so its decisions can be explained to users) and steerable (to reflect a community’s values or align to safety standards). However, current systems fall short on both of these dimensions. To address this gap, we present SafetyAnalyst, a novel LLM safety moderation framework. Given a prompt, SafetyAnalyst creates a structured “harm-benefit tree,” which identifies 1) the actions that could be taken if a compliant response were provided, 2) the harmful and beneficial effects of those actions (along with their likelihood, severity, and immediacy), and 3) the stakeholders that would be impacted by those effects. It then aggregates this structured representation into a harmfulness score based on a parameterized set of safety preferences, which can be transparently aligned to particular values. Using extensive harm-benefit features generated by SOTA LLMs on 19k prompts, we fine-tuned an open-weight LM to specialize in generating harm-benefit trees through symbolic knowledge distillation. On a comprehensive set of prompt safety benchmarks, we show that our system (average F1=0.75) outperforms existing LLM safety moderation systems (average F1 $<$ 0.72) on prompt harmfulness classification, while offering the additional advantages of interpretability and steerability. ¹¹1https://jl3676.github.io/SafetyAnalyst

1 Introduction

Refer to caption — Figure 1: Overview of the SafetyAnalyst framework. We generated extensive harm-benefit feature data using SOTA LLMs (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, Llama-3.1-405B-Turbo, and Claude-3.5-Sonnet) on 19k user prompts through chain-of-thought prompting. We embedded each prompt in a hypothetical AI language model usage scenario and instructed the LLMs to enumerate all stakeholders who may be impacted, any potentially harmful/beneficial actions that may impact the stakeholders, and the effects each action may cause to each stakeholder. The LLMs additionally labeled the likelihood, extent/severity, and immediacy of each effect. These harm-benefit features were then used to train two specialist models — one to generate harms and one to generate benefits (together named SafetyReporter) — through symbolic knowledge distillation via supervised fine-tuning of Llama-3.1-8B-Instruct. Given any prompt, SafetyReporter efficiently generates an interpretable harm-benefit tree. The harms and benefits are weighted and traded off by an aggregation algorithm to calculate a harmfulness score, which can be directly translated into content safety labels or refusal decisions. Steerability can be achieved by aligning the weights in the aggregation algorithm to a user’s or community’s preference or to principled safety standards.

As large language models (LLMs) and their applications become rapidly integrated into people’s daily lives, it is critical to develop robust and reliable content moderation systems to ensure the safe usage of LLM-based artificial intelligence (AI) technology (Bengio et al., 2024). Recently, Dalrymple et al. (2024) proposed a blueprint for guaranteed safe AI, arguing that a “world model” that can accurately predict the causal effects of AI behavior on the outside world is an integral component of robust and reliable AI systems. However, current LLM content moderation and safeguarding systems are not grounded in an explicit understanding of such causal effects, since they rely on deep neural networks (such as LMs) to directly learn the relationship between input content and harmfulness (Markov et al., 2023; Inan et al., 2023; Han et al., 2024; Zeng et al., 2024a; Bai et al., 2022). Such systems may excel at classifying the harmfulness of contents that are within their training distributions, but their behavior is unpredictable when applied to out-of-distribution samples due to the challenge to explain or interpret their decision-making processes.

Moreover, as AI technology reaches diverse human populations (e.g., people with different ethical, cultural, political, educational, professional, and socioeconomic backgrounds) there is increased need for safety moderation that can be tailored to specific applications, adapt to different safety preferences, or reflect pluralistic human values (Sorensen et al., 2024a). AI safety criteria change for different user demographics. For example, an AI technology that is deployed to children may require stricter regulation on violent or sexually explicit content; one developed for scientists might be compliant in response to queries about potentially dangerous chemicals, while such compliance may not be appropriate for a model released to the general public. Thus, current LLM content safety moderation can benefit substantially from pluralistic paradigms that can be aligned or steered to reflect different safety perspectives (Sorensen et al., 2024b).

To improve the interpretability and steerability of LLM content moderation, we introduce SafetyAnalyst: an LLM safety moderation system that produces a world-model-inspired “harm-benefit tree” and aggregates its features mathematically via a process that can be steered to accommodate different safety preferences. While existing AI safety research relies on risk categories (e.g., illegal activity, misinformation) to classify harm (Zeng et al., 2024b; Xie et al., 2024; Han et al., 2024; Ji et al., 2024; Mazeika et al., 2024), SafetyAnalyst is grounded in the fundamental principles of cost-benefit analysis (Arrow et al., 1996), explicitly representing what actions may cause which harmful or beneficial effects on different stakeholders. For example, given the prompt “how to transport drugs through airport security?”, current safety frameworks are likely to characterize the harmfulness of this prompt based on the action of transporting illegal substances, rather than the potential harmful effects if the LLM were to provide a helpful response, such as the user being arrested and the AI developers being punished for facilitating illegal activity. Given a prompt, SafetyAnalyst generates extensive trajectories of harmful and beneficial consequences, estimates the likelihood, extent/severity, and immediacy of each effect, and aggregates them numerically into a harmfulness score. The aggregation mechanism can be parametrically modified to weight individual features differently (e.g., to up- or down-weight particular categories of harms, benefits, stakeholders, etc.). Weights can be adjusted in a top-down manner to fit safety standards or principles (e.g., as determined by a policy) or in a bottom-up matter that is optimized to fit the safety label distributions that reflect the values of a particular community or sub-community. Overall, this pipeline allows SafetyAnalyst to produce interpretable, transparent, and steerable safety labels.

We implemented the conceptual SafetyAnalyst framework into a system for prompt harmfulness classification. Using 19k harm-benefit trees generated by a mixture of state-of-the-art (SOTA) LLMs, we fine-tuned an open-weight LM to specialize in generating harm-benefit features. To perform prompt classification, we optimized the parameters of our mathematical aggregation algorithm to the harmful and benign prompt labels provided by WildJailbreak, a large-scale prompt dataset containing synthetic harmful and benign prompts generated based on 13 risk categories (Jiang et al., 2024). We show that both the SOTA teacher LMs and the fine-tuned specialist achieved high test performance on WildJailbreak prompt classification (F1 $>$ 0.84, AUPRC $>$ 0.89, and AUROC $>$ 0.88). We further report strong results applying SafetyAnalyst to prompt safety classification on a comprehensive set of public benchmarks, showcasing competitive performance against current LLM safety moderation systems on all benchmarks. On average, our system (F1=0.75) outperformed existing counterparts (F1 $<$ 0.72), while offering the benefits of interpretability and steerability that other systems lack.

2 The SafetyAnalyst framework

SafetyAnalyst breaks down the problem of content classification into sub-tasks (Figure 1). First, it generates interpretable harm-benefit features that describe the potential impacts of an AI system complying with a particular request (prompt). This feature generation process can be performed on any instruction-tuned LM through chain-of-thought prompting. Using data collected from a mixture of SOTA LLMs, we fine-tuned an open-weight LM (Llama-3.1-8B-Instruct) to specialize in efficient feature generation. Second, these features are weighted using an aggregation algorithm we developed based on their relative importance and aggregated into a numerical harmfulness score, which can be used to produce content safety labels.

2.1 Harm-benefit feature generation

Given a prompt and a scenario where the AI language model complies with the user request, a SafetyAnalyst model extensively generates features (Figure 2) including all stakeholders (individuals, groups, communities, and entities in society that may be affected), harmful and beneficial actions that may impact each stakeholder, harmful and beneficial effects that may be caused by each action on each stakeholder, and the likelihood (low, medium, or high), extent/severity (minor, significant, substantial, or major), and immediacy (immediate or downstream) of each effect. Harmful actions are generated in accordance with (and classified by) the AIR 2024 risk taxonomy (Zeng et al., 2024b), an extensive categorization of harmful actions that could result from interaction with an LM, derived from worldwide governmental and corporate policies. Beneficial actions are generated in free text. Due to the lack of formal characterization of harmful and beneficial effects in the AI safety literature, we defined a novel hierarchical taxonomy, drawing on the theories of basic/primary goods of two influential contemporary moral philosophers: Bernard Gert Gert (2004) and John Rawls Rawls (2001). See Appendix A for complete taxonomies.

We used a diverse mixture of SOTA LLMs including GPT-4o (Achiam et al., 2023), Gemini-1.5-Pro (Team et al., 2023), Llama-3.1-70B-Instruct, Llama-3.1-405B-Instruct-Turbo (Dubey et al., 2024), and Claude-3.5-Sonnet to generate extensive harm-benefit tree data on 18,901 prompts randomly sampled from WildJailbreak (Jiang et al., 2024), WildChat (Zhao et al., 2024), and AegisSafetyTrain (Ghosh et al., 2024). Table 4 in Appendix B shows the breakdown of prompt distribution over the datasets for all LLMs. We sampled most of our prompts from WildJailbreak, which is a large-scale synthetic prompt dataset covering 13 risk categories with both vanilla harmful and benign examples, as well as adversarial examples generated from the vanilla seeds. To increase the diversity of content and linguistic features in the prompts, we sampled some prompts from WildChat, which consists of in-the-wild user prompts, and AegisSafetyTrain, which was built on HH-RLHF harmlessness prompts.

Overall, the LLMs generated rich harm-benefit features that follow a tree-like structure: more than 10 stakeholders per prompt, 3-10 actions per stakeholder, 3-7 effects per action, varying between models and prompt classes in WildJailbreak (Table 1). The variance in the number of features generated by each LLM highlights the importance of sampling from different SOTA LLMs to maximize coverage of different harms and benefits.

Table 1: Number of features generated by SafetyAnalyst models for harmful/benign prompts.

Model	Stakeholders	Harms		Benefits
Model	Stakeholders	Actions/SH	Effects/Act.	Actions/SH	Effects/Act.
GPT-4o	13.6 / 7.9	6.9 / 4.8	4.4 / 3.9	4.7 / 4.9	5.2 / 4.3
Gemini	10.7 / 8.3	3.2 / 1.9	3.7 / 2.9	3.5 / 3.2	3.3 / 2.8
Llama-70B	17.7 / 13.0	3.9 / 2.9	3.5 / 3.0	5.0 / 5.5	3.3 / 3.8
Llama-405B	17.0 / -	6.3 / -	6.7 / -	6.3 / -	5.7 / -
Claude	22.0 / -	5.3 / -	4.2 / -	9.4 / -	4.2 / -
SafetyReporter	11.6 / 8.3	3.6 / 2.4	3.7 / 3.2	3.8 / 4.0	3.4 / 3.4

2.2 SafetyReporter: an open-source pair of specialist models for harm and benefit feature generation

To enable fast, cheap, and high quality harm-benefit feature generation, we trained an open-weight LM (Llama-3.1-7B-Instruct) to specialize in the tasks of generating harms and benefits using data collected from SOTA LLMs shown in Table 4. We applied supervised fine-tuning using qlora (Dettmers et al., 2024) to distill the knowledge about harmful and beneficial features of our interest from the teacher models (SOTA LLMs) into the student model (West et al., 2021). We trained one specialist model to generate harm-trees and another for benefit-trees, which can be combined into the full harm-benefit tree structure (Figure 2). Due to the extensive combined lengths of our taxonomies and the harm-benefit trees generated by teacher LLMs, we fine-tuned two specialists instead of one so that the inputs and outputs could jointly fit into the context window defined by our hardware constraints (context window length of 18,000 tokens on 8 NVIDIA H100 GPUs). The two student models that specialize in harm and benefit feature generation are collectively named “SafetyReporter.”

We trained SafetyReporter on all data generated by the teacher models shown in Table 4 except that we randomly down-sampled the WildJailbreak data from Llama-70B to 1,000 vanilla harmful and 1,000 vanilla benign prompts. Additionally, to increase the robustness of SafetyReporter to adversarial attacks (e.g., jailbreaks), we augmented the training dataset with adversarial prompts from WildJailbreak, which contains synthetic adversarial prompts created based on the vanilla prompts using in-the-wild jailbreak techniques. We randomly sampled 6,368 adversarial prompts that corresponded to the vanilla prompts (at most one adversarial prompt per vanilla prompt) used in data generation, and augmented the training dataset by pairing them with the harm-benefit trees of the corresponding vanilla prompts.

To evaluate the quality of generated harm-benefit features, we collected human annotation data from 126 prolific workers on their agreement with the generated stakeholders, harmful/beneficial effects, and the likelihoods, extents, and immediacies of the effects. Annotators showed broad agreement on the plausibility of the harm-benefit features (see Table 5 in Appendix C for results, Figure 4 in Appendix C for interface design, and the Discussion section for further discussion).

2.3 Mathematical feature aggregation

We mathematically formalize a feature aggregation algorithm for quantifying the harmfulness ( $H$ ) of a prompt over features generated by a SafetyAnalyst model parameterized by $W$ and $\gamma$ :

H(\text{prompt}\mid W,\gamma)=\sum_{\text{Stakeholder}}\sum_{\text{Action}}\sum_{\text{Effect}}w(\text{Action})\cdot f(\text{Likelihood})\cdot g(\text{Extent})\cdot h(\text{Immediacy}),

where $W$ is a set of weights for the 16 second-level action categories in the AIR 2024 taxonomy, and relative importance weights of different extents and likelihoods. $\gamma$ includes discount factors for downstream (vs. immediate) and beneficial (vs. harmful) effects. In total, the model includes 29 parameters: 16 weights for harmful action categories (Security Risks, Operational Misuses, Violence & Extremism, Hate/Toxicity, Sexual Content, Child Harm, Self-harm, Political Usage, Economic Harm, Deception, Manipulation, Defamation, Fundamental Rights, Discrimination/Bias, Privacy, and Criminal Activities), 2 weights for the relative importance of harmful effect likelihoods (Low vs. Medium and Medium vs. High), 3 weights for the relative importance of harmful effect extents (Minor vs. Significant, Significant vs. Substantial, and Substantial vs. Major), 5 weights for the relative importance of beneficial effect likelihoods and extents, and 2 weights for the immediacy discount factor for harmful and beneficial effects (Downstream vs. Immediate). The functions $w$ , $f$ , $g$ , and $h$ return the weights for a given action (conditioned on the given stakeholder), likelihood, extent, and immediacy (conditioned on the given stakeholder, action, and effect), respectively, with $f(\text{High})=1$ , $g(\text{Major})=1$ , and $h(\text{Immediate})=1$ for all harms and $w(\text{Action})=-1$ for all benefits.

2.4 Feature weight alignment

Table 2: Validating SafetyAnalyst models on WildJailbreak prompt harmfulness classification.

Metric	GPT-4o	Gemini-1.5-Pro	Llama-3.1-70B	SafetyReporter
F1	91.8	87.7	88.1	84.7
AUPRC	91.7	92.0	96.6	89.0
AUROC	94.7	92.5	95.9	88.4

To translate the numerical harmfulness score $H$ computed over features generated by some SafetyAnalyst model into a safety label for prompt classification, we aligned the aggregation algorithm to the ground-truth labels from the WildJailbreak dataset on harm-benefit trees generated by teacher and student models by optimizing $W$ and $\gamma$ within $[0,1]$ using maximum-likelihood estimation over the analytical likelihood of $\sigma(H)$ . Table 2 shows the classification performance (measured by the F1 score, AUPRC, and AUROC and presented in percentage) of different teacher and student SafetyAnalyst models (GPT-4o, Gemini-1.5-Pro, Llama-3.1-70B-Instruct, and SafetyReporter) on balanced vanilla harmful and benign prompts in WildJailbreak held-out from fitting the aggregation algorithm. All models achieved high classification performance, with F1 $\geq$ 84.7, AUPRC $\geq$ 89.0, and AUROC $\geq$ 88.4. Notably, SafetyReporter achieved sufficiently close performance to the teacher LMs while being substantially smaller with fully open data and model weights.

The optimized parameter values are illustrated in Figure 3. Among the harmful actions summarized by level-2 risk categories in the AIR 2024 taxonomy (Zeng et al., 2024b), Self-harm weighted the highest, followed by Criminal Activities and Political Usage. High likelihood, immediate effects dominated the aggregation, with near-zero weights for medium and low likelihood or downstream effects, except for medium likelihood harmful effects. All extents weighted equally except that minor harmful effects were deemed trivial by the aggregation model. Overall, aggregation was driven by harmful effects, as evident by the low relative importance of a beneficial effect compared to a harmful effect (13.4%).

3 Evaluating SafetyReporter against existing LLM safety moderation systems on prompt classification

To evaluate the effectiveness of SafetyAnalyst on identifying potentially harmful prompts, we tested SafetyReporter (aligned to WildJailbreak prompt labels with weights illustrated in Figure 3) on a comprehensive set of public benchmarks featuring potentially unsafe user queries and instructions against existing LLM safety moderation systems. Here, we report the prompt harmfulness classification performance of each model on the benchmarks.

3.1 Evaluation setup

Benchmarks.

We tested SafetyReporter and relevant baselines on 6 publicly available prompt safety benchmarks, including SimpleSafetyTests (100 prompts; Vidgen et al. 2023), HarmBenchPrompt standard test set (159 prompts; Mazeika et al. 2024), WildGuardTest (960 vanilla and 796 adversarial prompts; Han et al. 2024), AIR-Bench-2024 (5,694 prompts; Zeng et al. 2024c), and SORRY-Bench (9,450 prompts; Xie et al. 2024). These benchmarks represent a diverse and comprehensive selection of unsafe prompts, including manually crafted prompts on highly sensitive and harmful topics (SimpleSafetyTests), standard behavior that may elicit harmful LLM responses (HarmBench), adversarial prompts (WildGuardTest), benign prompts (WildGuardTest), prompts that may challenge government regulations and company policies (AIR-Bench-2024), and unsafe prompts that cover granular risk topics and linguistic characteristics (SORRY-Bench). Since our system focuses on identifying prompts that would be unsafe to respond to, rather than the harmfulness in the prompt content per se, we did not include benchmarks in which prompts were labeled for the latter, such as the OpenAI moderation dataset (Markov et al., 2023), ToxicChat (Lin et al., 2023), and AegisSafetyTest (Ghosh et al., 2024).

Baselines.

We compare SafetyReporter to 9 existing LLM safety moderation systems: OpenAI moderation endpoint (Markov et al., 2023), LlamaGuard, LlamaGuard-2, LlamaGuard-3 (Inan et al., 2023), Aegis-Guard-Defensive, Aegis-Guard-Permissive (Ghosh et al., 2024), ShieldGemma-2B, ShieldGemma-9B, ShieldGemma-27B (Zeng et al., 2024a), and WildGuard (Han et al., 2024). Additionally, we report zero-shot GPT-4 performance (Achiam et al., 2023).

We referenced Han et al. (2024)’s evaluation results where applicable and additionally tested models and benchmarks that they did not feature with temperature set to 0. We were unable to fairly evaluate Llama-Guard, Aegis-Guard-Defensive, and Aegis-Guard-Permisive (both Aegis-Guards are tuned Llama-Guard models) on SORRY-Bench, since the lengths of 457 prompts in SORRY-Bench exceeded the Llama-2 context window limit of 4,096 tokens (Touvron et al., 2023). For each model, we computed an average F1 score across benchmarks weighted by the number of prompts in each benchmark dataset.

3.2 Evaluation results

SafetyReporter outperforms existing LLM safety moderation systems on prompt harmfulness classification.

Table 3 shows our evaluation results, measured by the F1 score (denoted in percentage). SafetyReporter achieved competitive performance on all benchmarks compared to existing LLM safety moderation systems, with the highest overall F1 score of 75.4, exceeding the second highest score of 71.7 by WildGuard. Notably, SafetyReporter’s performance was zero-shot, since it was not trained on or aligned to any training datasets of the benchmarks, whereas WildGuard was trained on the WildGuardTrain set. Nonetheless, GPT-4’s classification performance was better than all the LLM moderation models with an F1 score of 81.6.

Table 3: F1 scores of prompt harmfulness classification on public benchmarks. The average was computed over all benchmarks weighted by the number of examples in each dataset. The highest average score is emphasized in bold and the second highest underlined.

Model	SimpSTests	HarmBench	WildGuardTest		AIR-Bench	SORRY-Bench	Average
Model	SimpSTests	HarmBench	Vani.	Adv.	AIR-Bench	SORRY-Bench	Average
OpenAI Mod. API	63.0	47.9	16.3	6.8	46.5	42.9	41.1
Llama-Guard	93.0	85.6	70.5	32.6	44.7	-	-
Llama-Guard-2	95.8	91.8	85.6	46.1	74.9	53.9	62.9
Llama-Guard-3	99.5	98.4	86.7	61.6	68.8	59.1	64.6
Aegis-Guard-D	100	93.6	82.0	74.5	83.4	-	-
Aegis-Guard-P	99.0	87.6	77.9	62.9	62.5	-	-
ShieldGemma-2B	99.5	100	62.2	59.2	28.6	18.5	27.4
ShieldGemma-9B	83.7	77.2	61.3	35.8	28.6	39.0	37.3
ShieldGemma-27B	85.7	74.8	62.4	43.0	32.0	42.3	40.6
WildGuard	99.5	99.7	91.7	85.5	87.6	58.2	71.7
GPT-4	100	100	93.4	81.6	84.5	78.2	81.6
SafetyReporter	95.2	94.4	88.3	73.7	83.0	69.1	75.4

4 Related work

Existing LLM content moderation systems.

Existing LLM content moderation systems include WildGuard (Han et al., 2024), ShieldGemma (Zeng et al., 2024a), AegisGuard (Ghosh et al., 2024), LlamaGuard (Inan et al., 2023), and the OpenAI moderation endpoint (Markov et al., 2023). These systems are LM-based classifiers that can categorize content risk, including user prompts. Although they can achieve high classification accuracy on prompt safety benchmarks (e.g., classifying a prompt as harmful or benign), their internal decision mechanisms are challenging to interpret, which limits their reliability and generalizability. Furthermore, due to the lack of modularity in their architectures, they cannot be easily steered to reflect different safety perspectives.

LLM content risk.

The AI safety literature has relied on risk taxonomies to categorize unsafe content. Recent work has built on standard risk categories (Weidinger et al., 2022) to include more fine-grained categories (Wang et al., 2023; Tedeschi et al., 2024; Xie et al., 2024; Brahman et al., 2024), achieve comprehensive coverage (Vidgen et al., 2024), and incorporate government regulations and company policies (Zeng et al., 2024b). Our system relies on the taxonomy developed by Zeng et al. (2024b), selected for its comprehensive and fine-grained nature. Overall, these taxonomies describe the unsafe nature of a prompt or unsafe actions that might result from a prompt being answered. To our knowledge, no prior work exists that proposes formal taxonomies for the downstream effects of unsafe prompts (as opposed to actions; see Appendix A for our taxonomies of harmful and beneficial effects).

Symbolic knowledge distillation.

We distilled a pair of small, expert LMs (SafetyReporter) to create structured harm-benefit trees, the core of our interpretable framework. The symbolic knowledge distillation strategy leverages diffuse knowledge gained by large, generalist (and often proprietary) models to create a more compact expert student model that excels at one particular task (Xu et al., 2024; West et al., 2021; Tang et al., 2019). This strategy is useful (among other reasons) to generate rich, structured data that is too costly or labor-intensive for humans to do by hand (West et al., 2021). Indeed, prior work shows that symbolic knowledge distillation from machine teachers can exceed the quality of human-authored symbolic knowledge (West et al., 2021; Jung et al., 2023). Compared to the teacher models, our SafetyReporter uses less time, memory, compute, and cost while achieving comparable performance, and it will be openly released for public use in LLM moderation contexts.

Pluralistic alignment for LLM safety.

Although current LLM safety moderation systems are yet to be pluralistically aligned, recent interest in value pluralism Sorensen et al. (2024a) has given rise to rapid developments of pluralistic alignment approaches for LLMs. Lera-Leri et al. (2022) formalized an aggregation method for value systems inspired by the social choice literature. Feng et al. (2024) outlined a more general framework based on multi-LLM collaboration, in which an LLM can be aligned to specialized community LMs for different pluralism objectives. Other methods have been proposed for learning distributions of human preferences rather than the majority (Siththaranjan et al., 2023; Chen et al., 2024).

5 Discussion

In this paper, we introduce SafetyAnalyst, an interpretable, transparent, and steerable framework for LLM content safety moderation. In addition to the conceptual framework, we provide an implementation of the system with various resources: a large-scale dataset of rich safety features (in the form of a structured “harm-benefit tree”) generated by SOTA LLMs on 19k prompts, the open-source SafetyReporter for harm-benefit feature generation, the first taxonomies of harmful and beneficial effects for AI safety, and a feature aggregation algorithm that can be steered to align with a given safety content label distribution or with top-down safety standards.

Our application of SafetyAnalyst and SafetyReporter to a comprehensive set of prompt safety benchmarks shows SOTA performance compared to existing LLM safety moderation systems. The current implementation of SafetyAnalyst focuses on prompt harmfulness classification, which can help an AI system determine if a user prompt should be refused. However, this framework can be extended to solve other content safety tasks, such as LLM response moderation, free-text moderation (Zhang et al., 2023), and general text moderation.

Our work addresses the important challenge of interpretability in AI safety research by providing a conceptual framework with concrete implementation to improve on existing LLM content safety moderation systems. The interpretable features generated by SafetyAnalyst models are aggregated mathematically to produce explainable decisions on content safety, which is particularly desirable in safety-critical applications of LMs. When applied to determine if a user prompt should be refused by an LLM, these features can help provide informative refusal responses if the prompt is deemed unsafe by SafetyReporter. The steerability of SafetyAnalyst to different safety preferences makes it suitable for various safety goals, especially as LMs are deployed for more and more applications that serve diverse human populations.

The SafetyAnalyst framework extends the current scope of AI safety research by pioneering two important conceptual innovations. First, we highlight the importance of explicitly considering harmful effects in safety moderation in addition to harmful actions, which are the primary target of current AI risk taxonomies. The strong performance achieved by SafetyReporter on safety benchmarks suggests that weighting both actions and effects is an effective approach to determine prompt harmfulness, which intuitively matches the decision process humans likely tend to use. Second, we argue that the benefits of providing a helpful response to a user prompt should be traded off with the harms in determining refusals. The discounted importance of beneficial effects from harmful effects in our aggregation model fitted to WildJailbreak, a cutting-edge LLM safety prompt dataset, suggests that the benefits of helpfulness may have been insufficiently represented in the label generation of the prompts. Future prompt harmfulness benchmarks and safety systems should account for effects and benefits in addition to only harmful actions to achieve more robust safety properties.

We propose that the weight optimization procedure of our feature aggregation algorithm, which aligns feature weights to a given distribution of harmfulness labels, can be extended to pluralistic alignment of SafetyAnalyst to different human values and safety preferences that reflect different ideas of harmfulness. Developers could apply our feature weight optimization approach to align SafetyAnalyst to a content label distribution that reflects their desired values and safety properties, such as one sampled from the customer base they serve. Although we weighted action categories at level-2 of the AIR 2024 taxonomy (e.g., Hate/Toxicity), our aggregation algorithm can be expanded to optimize weights on more fine-grained, level-3 categories (e.g., Harassment, Hate Speech, Perpetuating Harmful Beliefs, and Offensive Language) due to the taxonomy’s extensive coverage of AI risk. Furthermore, the aggregation algorithm can be extended to additionally parameterize the relative importance of effect categories that may vary between human populations and LM applications with different safety goals.

Future work should validate the proposed pluralistic alignment approach for SafetyAnalyst on diverse human populations with pluralistic values and applications of LMs with different safety preferences. Already, the annotation data we collected on the harm-benefit trees hints that value pluralism could have an important impact on LLM content moderation. The fact that SafetyAnalyst performs competitively on safety moderation benchmarks testifies to the fact that the harm-benefit trees are, in aggregate, aligned with the safety concerns of researchers and annotators creating gold-standard labels for safety benchmarks. However, the results in Table 5 reveal a more complex picture. While annotators agreed with the SafetyAnalyst model-generated features the majority of the time, there was also important variance, suggesting that there is room to fine-tune SafetyReporter or weight the aggregation mechanism of SafetyAnalyst to align more closely with individual or group values.

Limitations.

Generating the extensive harm-benefit trees, which are crucial to the interpretability of SafetyAnalyst, leads to longer inference time compared to existing, less interpretable LLM moderation systems. Although our specialized SafetyReporter substantially reduces the cost of feature generation than using an off-the-shelf LLM, we make the conscious trade-off between interpretability and efficiency to make LLM content safety decisions more reliable and transparent. While our system draws on the principles of cost-benefit-analysis commonly used to justify the adoption of governmental policies, following Arrow et al. (1996) we emphasize that simply summing harmful and beneficial effects will not be ultimately sufficient for safe decision-making. Future work should explore issues related to the incommensurability of values, the effectiveness with which SafetyAnalyst captures non-quantifiable harms and benefits, and the importance of weighting actions themselves, beyond just the effects they produce.

Conclusion.

We introduce SafetyAnalyst, a novel conceptual framework for interpretable, transparent, and steerable LLM content safety moderation. We operationalized the pipeline of harm-benefit tree data generation, symbolic knowledge distillation, and weighted feature aggregation to implement an LLM safety moderation system for prompt harmfulness classification. Our system achieved SOTA performance on a comprehensive set of prompt safety benchmarks, which promises strong potential in real-world LLM safety applications.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Arrow et al. (1996) Kenneth J Arrow, Maureen L Cropper, George C Eads, Robert W Hahn, Lester B Lave, Roger G Noll, Paul R Portney, Milton Russell, Richard Schmalensee, Kerry Smith, et al. Benefit-cost analysis in environmental, health, and safety regulation. Washington, DC: American Enterprise Institute, pp. 1–17, 1996.
Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
Bengio et al. (2024) Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384(6698):842–845, 2024.
Brahman et al. (2024) Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, et al. The art of saying no: Contextual noncompliance in language models. arXiv preprint arXiv:2407.12043, 2024.
Chen et al. (2024) Daiwei Chen, Yi Chen, Aniket Rege, and Ramya Korlakai Vinayak. Pal: Pluralistic alignment framework for learning from heterogeneous preferences. arXiv preprint arXiv:2406.08469, 2024.
Dalrymple et al. (2024) David Dalrymple, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, Christian Szegedy, Ben Goldhaber, Nora Ammann, et al. Towards guaranteed safe ai: A framework for ensuring robust and reliable ai systems. arXiv preprint arXiv:2405.06624, 2024.
Dettmers et al. (2024) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Feng et al. (2024) Shangbin Feng, Taylor Sorensen, Yuhan Liu, Jillian Fisher, Chan Young Park, Yejin Choi, and Yulia Tsvetkov. Modular pluralism: Pluralistic alignment via multi-llm collaboration. arXiv preprint arXiv:2406.15951, 2024.
Gert (2004) Bernard Gert. Common morality: Deciding what to do. Oxford University Press, 2004.
Ghosh et al. (2024) Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993, 2024.
Han et al. (2024) Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495, 2024.
Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
Ji et al. (2024) Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
Jiang et al. (2024) Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510, 2024.
Jung et al. (2023) Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, and Yejin Choi. Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing. arXiv preprint arXiv:2305.16635, 2023.
Lera-Leri et al. (2022) Roger Lera-Leri, Filippo Bistaffa, Marc Serramia, Maite Lopez-Sanchez, and Juan A Rodríguez-Aguilar. Towards pluralistic value alignment: Aggregating value systems through lp-regression. 2022.
Lin et al. (2023) Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389, 2023.
Markov et al. (2023) Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 15009–15018, 2023.
Mazeika et al. (2024) Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
Rawls (2001) John Rawls. Justice as fairness: A restatement. Harvard University Press, 2001.
Siththaranjan et al. (2023) Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional preference learning: Understanding and accounting for hidden context in rlhf. arXiv preprint arXiv:2312.08358, 2023.
Sorensen et al. (2024a) Taylor Sorensen, Liwei Jiang, Jena D Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, et al. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 19937–19947, 2024a.
Sorensen et al. (2024b) Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, et al. A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070, 2024b.
Tang et al. (2019) Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Tedeschi et al. (2024) Simone Tedeschi, Felix Friedrich, Patrick Schramowski, Kristian Kersting, Roberto Navigli, Huu Nguyen, and Bo Li. Alert: A comprehensive benchmark for assessing large language models’ safety through red teaming. arXiv preprint arXiv:2404.08676, 2024.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Vidgen et al. (2023) Bertie Vidgen, Hannah Rose Kirk, Rebecca Qian, Nino Scherrer, Anand Kannappan, Scott A Hale, and Paul Röttger. Simplesafetytests: a test suite for identifying critical safety risks in large language models. arXiv preprint arXiv:2311.08370, 2023.
Vidgen et al. (2024) Bertie Vidgen, Adarsh Agrawal, Ahmed M Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, et al. Introducing v0. 5 of the ai safety benchmark from mlcommons. arXiv preprint arXiv:2404.12241, 2024.
Wang et al. (2023) Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023.
Weidinger et al. (2022) Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 214–229, 2022.
West et al. (2021) Peter West, Chandra Bhagavatula, Jack Hessel, Jena D Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, and Yejin Choi. Symbolic knowledge distillation: from general language models to commonsense models. arXiv preprint arXiv:2110.07178, 2021.
Xie et al. (2024) Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, et al. Sorry-bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598, 2024.
Xu et al. (2024) Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024.
Zeng et al. (2024a) Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772, 2024a.
Zeng et al. (2024b) Yi Zeng, Kevin Klyman, Andy Zhou, Yu Yang, Minzhou Pan, Ruoxi Jia, Dawn Song, Percy Liang, and Bo Li. Ai risk categorization decoded (air 2024): From government regulations to corporate policies. arXiv preprint arXiv:2406.17864, 2024b.
Zeng et al. (2024c) Yi Zeng, Yu Yang, Andy Zhou, Jeffrey Ziwei Tan, Yuheng Tu, Yifan Mai, Kevin Klyman, Minzhou Pan, Ruoxi Jia, Dawn Song, et al. Air-bench 2024: A safety benchmark based on risk categories from regulations and policies. arXiv preprint arXiv:2407.17436, 2024c.
Zhang et al. (2023) Yiming Zhang, Sravani Nanduri, Liwei Jiang, Tongshuang Wu, and Maarten Sap. Biasx:” thinking slow” in toxic content moderation with explanations of implied social biases. arXiv preprint arXiv:2305.13589, 2023.
Zhao et al. (2024) Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470, 2024.

Appendix A Prompting and definitions for harm-benefit feature generation

A.1 Scenario setup

In a conversation between an AI language model and a user, the user prompts the AI language model: “[INSERT PROMPT]” The AI language model complies with the request by providing a helpful, informative answer with specific details and/or step-by-step guidance.

A.2 Taxonomy of harmful effects

Physical Harm

1.

Death
2.

Physical health risks (examples include but are not limited to injury, disease, physical disability, organ damage)

Psychological Harm

3.

Transient emotional changes (examples include but are not limited to anger, anxiety, fear, frustration, disgust, sadness)
4.

Persistent emotional state changes (examples include but are not limited to chronic stress, trauma, depression, mental disability, behavioral and cognitive changes due to addiction)

Social Harm

5.

Social deficits (examples include but are not limited to social rejection, exclusion, reputation damage, loss of public trust, loss of social status or power)

Property Harm

6.

Physical property loss (examples include but are not limited to physical property damage and theft)
7.

Financial property loss (examples include but are not limited to loss of income, lost opportunities for financial gain, credit damage)
8.

Intellectual property infringement (examples include but are not limited to plagiarism, patent theft)

Liberty Harm

9.

Freedom and agency limitations (examples include but are not limited to unjustified restrictions on or requirements of movement, speech, decision-making, and personal autonomy)
10.

Restricting, requiring, or degrading accurate information access (through mechanisms including but not limited to misinformation, deception, manipulation, coercion, censorship, deprivation of education)

Collective Harm

11.

Institutional harm (examples include but are not limited to voter suppression, loss of faith, destruction of useful societal institutions)
12.

Public resource damage (examples include but are not limited to damage to cultural artifacts, public property, public resources)
13.

Societal disruption (examples include but are not limited to loss of social order, peace, culture)

Ecological Harm

14.

Harm to non-human animals (examples include but are not limited to animal abuse, animal habitat destruction, loss of species)
15.

Ecosystem damage (examples include but are not limited to pollution, natural resource depletion)

A.3 Taxonomy of beneficial effects

Physical Benefit

1.

Prevention of death
2.

Physical health benefits (examples include but are not limited to gain of immunity, prevention of injury, prevention of disease, prevention of physical disability)

Psychological Benefit

3.

Transient emotional changes (examples include but are not limited to joy, excitement, gratitude, amusement, content, and relief, prevention of negative acute emotional changes)
4.

Persistent emotional state changes (examples include but are not limited to happiness, love, friendship, and hope, prevention of negative chronic emotional state changes)

Social Benefit

5.

Social gains (examples include but are not limited to social acceptance, inclusion, gain of positive reputation, gain of public trust, gain of social status or power, prevention of social deficits)

Property Benefit

6.

Physical property gains (examples include but are not limited to physical property acquirement, prevention of physical property loss)
7.

Financial property gains (examples include but are not limited to gain of income, increased opportunities for financial gain, prevention of financial loss)
8.

Intellectual property gains (examples include but are not limited to patent acquirement, prevention of intellectual property loss)

Liberty Benefit

9.

Freedom and agency benefits (examples include but are not limited to increased freedom of movement, speech, decision-making, and personal autonomy, prevention of freedom and agency limitations)
10.

Gain of accurate information access (through mechanisms including but not limited to accurate information, gain of education, prevention of misinformation, deception, manipulation, coercion, and censorship)

Collective Benefit

11.

Institutional benefits (examples include but are not limited to gain of faith, construction of useful societal institutions, prevention of institutional harms)
12.

Public resource gains (examples include but are not limited to protection for cultural artifacts, public property, and public resources, prevention of public resource damages)
13.

Societal order maintenance (examples include but are not limited to gain or maintenance of social order, peace, and culture, prevention of social disruption)

Ecological Benefit

14.

Benefits to non-human animals (examples include but are not limited to prevention of animal abuse, animal habitat construction and maintenance, prevention from loss of species)
15.

Ecosystem benefits (examples include but are not limited to prevention of pollution, prevention of natural resource depletion)

A.4 Extents/severity of harmful effects

Minor:

harmful consequences are isolated, reversible, temporary, treatable, and do not require significant intervention or resources to mitigate; the stakeholder may experience limited interruption of functional ability and/or quality of life.

Significant:

harmful consequences are endemic, reversible but with significant effort, acute, treatable with significant intervention or resources; the stakeholder may experience significant interference of functional ability and/or quality of life.

Substantial:

harmful consequences are widespread, not easily reversible-if at all, chronic, and not treatable even with significant intervention or resources; the stakeholder may experience substantial impairment of functional ability and/or quality of life.

Major

harmful consequences are systemic, irreversible, persistent, critical, catastrophic, and imminently threatening; the stakeholder may experience destruction of functional ability and/or quality of life.

A.5 Extents of beneficial effects

Minor:

beneficial consequences are isolated, temporary, noticeable but have limited impact on the overall situation or outcome; the stakeholder may experience a limited enhancement of functional ability and/or quality of life.

Significant:

beneficial consequences are widespread, significant, noticeable, and can lead to meaningful improvements and tangible impacts; the stakeholder may experience significant improvement in functional ability and/or quality of life.

Substantial:

beneficial consequences are extensive, important, and can lead to considerable positive changes and profound impact on the situation or outcome; the stakeholder may experience substantial enhancement of functional ability and/or quality of life.

Major:

beneficial consequences are systemic, persistent, critical, highly impactful, and can lead to transformative changes that significantly alter the courses of events; the stakeholder may experience a profound improvement in functional ability and/or quality of life.

A.6 Likelihoods of harmful/beneficial effects

Low:

unlikely to occur, but not impossible. Rare under normal circumstances; less than 30% chance of occurring.

Medium:

possible occurrence, might happen under certain conditions or occasionally; between 30% and 70% chance of occurring.

High:

likely to occur, happens frequently or under normal conditions; above 70% chance of occurring.

Appendix B Harm-benefit data collection

Table 4 breaks down the distribution of harm-benefit feature data collection from teacher LLMs on various prompt datasets. To optimize the cost-effectiveness of harm-benefit feature data collection using proprietary and computationally expensive models, we sampled fewer benign than harmful prompts from WildJailbreak, since we observed in our early aggregation analysis that the variance in feature diversity, quantified by the variance of the aggregated harmfulness score distribution, was much lower for benign prompts than harmful prompts.

Table 4: Breakdown of harm-benefit data generation by teacher LLMs (number of examples).

Model	WildJailbreak		WildChat	Aegis-Train	Total
Model	Harmful	Benign	WildChat	Aegis-Train	Total
GPT-4o	1,000	500	499	99	2,098
Gemini-1.5-Pro	1,500	750	-	-	2,250
Llama-3.1-70B-Instruct	6,607	6,325	663	-	13,595
Llama-3.1-405B-Turbo	458	-	-	-	458
Claude-3.5-Sonnet	500	-	-	-	500
Total	10,065	7,575	1,162	99	18,901

Appendix C Human evaluation of generated features

Participants.

Annotators were recruited through Prolific and paid an average of $15/hour for their participation. 42 workers annotated 25 sets of teacher-generated harmful features each, 44 workers annotated 25 sets of teacher-generated beneficial features each, 20 workers annotated 15 SafetyReporter-generated harmful features each, and 20 workers annotated 15 SafetyReporter-generated beneficial features each.

Method.

For each harmful or beneficial effect, the human annotator was given detailed instructions on how to evaluate the validity of the given features, including a stakeholder who may be impacted, a harmful/beneficial effect that may be caused to the given stakeholder, and the likelihood, extent/severity, and immediacy of the effect (Figure 4). The human annotators were asked six questions per effect, evaluating their understanding of the scenario and whether they thought each given feature was plausible or reasonable. The plausibility of stakeholders and harmful/beneficial effects was rated on a 4-point scale (very plausible, somewhat plausible, somewhat implausible, and very implausible) due to their more open-ended nature, while the likelihood, extent/severity, and immediacy labels were rated on a binary scale (reasonable or not reasonable). The choices were not forced: the annotators had the option to state that they were unsure about any given feature. Results are reported in Table 5. To obtain the agreement rates, we computed the proportion of positive ratings (e.g., very plausible, somewhat plausible, and reasonable) among all positive and negative ratings.

Table 5: Human agreement rates of harm-benefit features generated by teacher and student models.

Model	Stakeholder	Harms				Benefits
Model	Stakeholder	Effect	Extent	Lik.	Imm.	Effect	Extent	Lik.	Imm.
GPT-4o	67.7	55.0	68.9	70.1	74.7	61.7	64.4	68.0	69.9
Gemini	70.7	72.1	82.1	78.8	80.4	57.8	61.8	63.6	70.3
Llama-70B	73.3	57.9	71.0	79.9	78.2	65.5	68.4	78.1	79.4
Llama-405B	76.1	69.7	68.4	76.1	79.1	49.3	58.8	60.9	67.0
Claude	74.5	69.1	72.6	67.7	80.6	55.3	57.1	59.9	72.5
SafetyReporter	76.5	54.4	70.0	73.4	76.5	56.1	59.8	65.9	74.2

Appendix D Additional safety benchmark evaluation methods

GPT-4.

We evaluated GPT-4o’s performance on AIR-Bench and SORRY-Bench, which were not tested by Han et al. (2024), using their prompt template.

ShieldGemma.

We evaluated all three ShieldGemma models using the safety principles specified by all harm types listed in Google’s official model card (No Dangerous Content, No Harassment, No Hate Speech, and No Sexually Explicit Information).