This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Perspectives on the Social Impacts of
Reinforcement Learning with Human Feedback

Gabrielle Kaili-May Liu
Department of Mathematics
Department of Brain and Cognitive Sciences
Social and Ethical Responsibilities of Computing (SERC)
Massachusetts Institute of Technology
Cambridge, MA 02139
[email protected]
Abstract

Is it possible for machines to think like humans? And if it is, how should we go about teaching them to do so? As early as 1950, Alan Turing stated that we ought to teach machines in the way of teaching a child. Recently, reinforcement learning with human feedback (RLHF) has emerged as a strong candidate toward allowing agents to learn from human feedback in a naturalistic manner. RLHF is distinct from traditional reinforcement learning as it provides feedback from a human teacher in addition to a reward signal. It has been catapulted into public view by multiple high-profile AI applications, including OpenAI’s ChatGPT, DeepMind’s Sparrow, and Anthropic’s Claude. These highly capable chatbots are already overturning our understanding of how AI interacts with humanity. The wide applicability and burgeoning success of RLHF strongly motivate the need to evaluate its social impacts. In light of recent developments, this paper considers an important question: can RLHF be developed and used without negatively affecting human societies? Our objectives are threefold: to provide a systematic study of the social effects of RLHF; to identify key social and ethical issues of RLHF; and to discuss social impacts for stakeholders. Although text-based applications of RLHF have received much attention, it is crucial to consider when evaluating its social implications the diverse range of areas to which it may be deployed. We describe seven primary ways in which RLHF-based technologies will affect society by positively transforming human experiences with AI. This paper ultimately proposes that RLHF has potential to net positively impact areas of misinformation, AI value-alignment, bias, AI access, cross-cultural dialogue, industry, and workforce. As RLHF raises concerns that echo those of existing AI technologies for governance, industry, safety, ethics, and the future of global power relations, it will be important for all to be aware and intentional in the adoption of RLHF.

1 Introduction

Since long before modern computing we have sought to teach machines through natural, humanistic interactions GML5 . As early as 1950, Alan Turing stated in his seminal paper on artificial intelligence (AI) that we ought to “provide the machine with the best sense organs that money can buy, and then teach it… That process could follow the normal teaching of a child. Things would be pointed out and named, etc.” TURING . John McCarthy posed one of the earliest iteration of such a system in 1959, describing an “advice taker” that could learn via common sense reasoning by drawing logical conclusions from any set of premises issued to the system as imperative statements MCCARTHY . In the 1980s, this work was extended by Hayes-Roth et al. to develop a generalized framework for machines to learn from external (human) advice, involving steps for receiving, interpreting, and integrating advice into a machine’s learning HAYESROTH1 ; HAYESROTH2 . Since then, the rapid development of AI and machine learning (ML) has led to significant progress in giving artificial agents the ability to interact with humans and learn from their feedback in a naturalistic manner GML5 ; ME2 .

A technique of particular import which has arisen in the past few years is reinforcement learning with human feedback (RLHF). Reinforcement learning (RL) refers to the field of ML in which an agent learns through interactions with the environment to select the best course of action (a policy) in a given state ME2 ; GML1 . Each state-action pair is described by a reward, which serves as feedback for the agent to tune its policy. As an agent learns through training episodes, it ultimately arrives at an optimized policy which permits maximization of the reward. RL has garnered high-profile success in various applications including board and video games, autonomous driving, text summarization, online personalization, finance, and healthcare. As such, it is thought to be a critical component in the development of truly generalized autonomous AI ME2 .

RLHF is an extension of RL that incorporates human feedback into the training process GML1 ; OPENAI . In addition to the reward signal, an RLHF agent receives feedback from a human teacher that permits it to learn with broader perspective and greater efficiency in a similar fashion to humans learning from the expertise of another human GML1 . By providing a bridge between an agent and a human teacher, RLHF allows humans to directly guide machine learning and machines to grasp elements of decision-making distinctly embedded in human experience RLHF8 . The ability to provide and incorporate human feedback in RLHF is further a critical step toward achieving improved alignment between ML models and human values GML1 ; RLHF1 .

Although RLHF is a relatively young technology, it has been catapulted into public view by multiple high- profile AI applications including OpenAI’s ChatGPT, DeepMind’s Sparrow, and Anthropic’s Claude. Uses of these chatbots include constructing context-appropriate email responses, solving math problems, and generating code RLHF6 . Presently, RLHF is finding widespread application in business, education, healthcare, and entertainment RLHF8 .

RLHF creates a host of benefits over traditional RL methods. Its key advantages lie in better alignment with human intentions, as well as planning conditional on future feedback, fluid learning from various types of feedback, and curation of feedback according to necessity, all of which are indispensable for creating truly intelligent agents GML1 ; RLHF2 . It also permits machines to learn by abstracting from what humans value as opposed to simply imitating human behavior, thereby equipping agents with greater adaptability, enhanced interpretability, and more reliable decision-making RLHF2 .

Despite these advances, there is vast potential for RLHF to be improved RLHF1 ; FORBES . RLHF models are potentially prone to inaccurate or harmful behavior (e.g., issuing racist statements) GML1 . This limitation reflects a longer-term challenge and motivation for improving RLHF OPENAI ; RLHF1 ; FORBES . Additionally, gathering human preference data as feedback is costly, and disagreement between human annotators adds variance to training data which can create confusion in situations in which the ground truth is obscure (e.g., ethical dilemmas) RLHF1 . Moreover, human feedback in RLHF is often constrained to be in the form of preference orderings which provide limited information and thereby restrict applicability GML1 ; RLHF2 . It is desirable to achieve a broader formalism that considers multiple types of feedback, dependent on task context and similar to the diversity of responses utilized in human learning. Work in this area could facilitate identification of which types of feedback lead to better generalization for applications.

1.1 Context for the Present Work

As RLHF is gaining rapid traction, now is the ideal time to consider its potential impacts on society ME2 . The transformational potential of RLHF makes it critical to consider how broadened application of RLHF-based technologies may impact various stakeholders, what ethical concerns might arise as a result, how it may affect social and ethical challenges, and how governance may be utilized to mitigate risks. This analysis is timely and justified considering the current state of AI safety research. According to a 2022 report by the Center for Security and Emerging Technology, research in the top three areas of AI safety—robustness, interpretability, and reward learning—have seen explosive growth in the past decade CSET4 . Reward learning is critically concerned with reducing the risk of disparity between intended and observed outcomes, yet work in the area is less developed relative to robustness and interpretability research. This discrepancy extends to the study of related social and ethical implications CSET4 . This report therefore seeks to fill this gap and step toward expanding responsible discussion of reward learning methods like RLHF.

1.2 Objectives and Metrics

The objective of this report is threefold: first, to provide a systematic study of the social effects of RLHF; second, to identify key social and ethical issues of RLHF; and third, to discuss social impacts for stakeholders. We propose that continued development of RLHF has a net positive social impact and is thus worth continued pursuit. We define an impact to be any direct or indirect effect that “includes both positive and negative results,” and a benefit to be “a positive impact that produces a good result” HELP1 . We assume social impact to be the net effect on relevant stakeholders, which may include individuals, families, communities, organizations, nations, or global societies as a whole HELP1 . Social impacts identified in this report are evaluated relative to the social baseline—the environment in the absence of the technology in question HELP1 ; SIA .

2 Impacts of RLHF

We describe seven primary ways in which RLHF positively transforms human experiences with AI. Our analysis is guided by the following questions:

  • How might RLHF affect the integrity of information to which people have access?

  • How might RLHF reflect values and preferences of target populations?

  • How might RLHF temper or intensify different axes of social inequality?

  • How might RLHF alter the access different social groups have to AI technologies?

  • How might RLHF impact cultural and international relations?

  • How might RLHF enhance industries?

  • How might RLHF transform workforces and the organization of labor?

2.1 Combating Misinformation

As an effective alignment technique, RLHF has significant potential to assist in mitigating harmful content generation that results from large language models (LLMs) and improve information integrity111Information integrity refers to the dependability and trustworthiness of information FOOTNOTE .. LLM deficiencies are well-documented and range from biased outputs to leaked private data to misinformation and adversarial attack GML1 . Current approaches to moderating LLMs are cumbersome, require more data, or are overly complex GML1 . RLHF is a method that promises improved truthfulness and reduced toxicity of LLMs without significantly compromising performance or creating issues such as reduced representation of minorities in textual output GML1 . For instance, InstructGPT—trained with RLHF—exhibits enhanced ability versus GPT-3 to generate truthful and informative responses and follow unfamiliar instructions (Figure 1) GML1 . Combined with development of limitations on harmful output production, RLHF has great potential toward generation of positive content for assistive technologies, information sharing, and recommender/advice systems.

\includegraphics

[width=0.75]fig5.png

Figure 1: RLHF methods are significantly better versus state-of-the-art LLMs at mitigating toxic, false statements and generating truthful, appropriate content as indicated by the performance of InstructGPT versus GPT-3 GML1 ; OPENAI .

Even so, work remains to be done in order to improve the reliability of RLHF-based models. RLHF technologies like ChatGPT can still suffer from inappropriate and harmful outputs upon user request. The creators of ChatGPT and InstructGPT themselves described these technologies as perhaps being too obedient to user instruction GML1 . Such issues may be approached by combining RLHF with steerability methods or by modifying sampling procedures during training. It may also be useful to incorporate aspects of AI explainability, giving RLHF agents the ability to decline to comply with harmful requests and explain why they have done so GML2 ; morals . Overall, what renders an output harmful often depends on context, and this can complicate model design GML1 .

Despite acting in broad alignment with human values, RLHF can still be misused for misinformation, oppression, or perpetuation of societal prejudice if guardrails are not established ME1 . RLHF-based content generation—whether visual, textual, auditory, or other forms of media—has application in disinformation and automated trolling, which can lead to compromised election integrity and public distrust in media and undermine the very fabric of organized governance in society. One can see that the power of RLHF in the disinformation context is threefold. First, human-machine teaming with RLHF models can accelerate message iteration, effectively increasing the productive power of disinformation campaigns CSET1 . Second, greater understanding of human behavior, preferences, and value systems could aid reconnaissance, leading to better mimicking of human activity, viewpoint manipulation, targeted messaging, conspiracy narrative generation, advancement of political narratives, and identification of feeble social fissures CSET5 . Third, knowledge that AI holds such potential may further erode trust and accelerate descent into the cynicism that advances disinformation CSET5 .

“A people that no longer can believe anything cannot make up its mind. It is deprived not only of its capacity to act but also of its capacity to think and to judge. And with such a people you can then do what you please.” —Hannah Arendt HA

There are a number of steps we can take to counter misuse of RLHF. To start, disinformation campaigns bear an innate limitation on their scale and scope. What makes disinformation effective? Beyond content generation, a successful effort at disinformation is heavily dependent on administration. While automation may free more humans to work on these tasks, propagation of content requires financial and technical infrastructure CSET1 . Perhaps the best mitigation for misuse of RLHF is to address governance of such infrastructure CSET1 . Notably, those who control such infrastructure may wield disproportionate power over the direction of RLHF applications.

More broadly, methods to counter AI disinformation will likely be equally useful for RLHF. Cooperation and intelligence sharing between governments and industry parties will be key to developing early warning systems for disinformation campaigns, enabling rapid response, threat information sharing, and cross-platform defense CSET1 . Since openly released research always carries the potential for misuse, AI researchers must develop more formalized guidelines for guarding against misuse and recommending mitigations. There must be a process by which media outlets can report on disinformation without amplifying its effects. Finally, public resistance to ML-enabled disinformation must be boosted by improving media literacy and increasing the accuracy of public conceptions of AI CSET1 ; CSET5 ; ME6 . At present, “A good deal of fear and concern about uncontrollable AI is now being displayed in public discourse,” which has led to confusion regarding autonomy and the critical role of humans throughout the AI development and deployment pipeline ME6 . Researchers must be transparent and understandable in how their work is communicated to the public, and media outlets must avoid misleading or over-sensationalized journalism about AI. At the same time, the risk of manipulation can be addressed by increasing digital literacy to enhance personal autonomy and awareness among the general public.

2.2 Strengthening Value Alignment

A core goal of AI research is to produce systems that behave in ways consistent with human values and intentions. There exist varying definitions of AI alignment with human intentions. Askell et al. (2021) define a well-aligned AI to be one that is “helpful, honest, and harmless,” with the understanding that “these are subtle and ambiguous criteria, and the best AI behavior will involve a compromise between them” ME1 . Current ML approaches tend to suffer from misalignment between the objectives of resulting systems and human values RLHF9 . Even if we were to observe model behavior fully consistent with human preferences (outer alignment), it is difficult to guarantee true inner alignment222In the context of AI safety, an inner alignment failure refers to any situation in which an AI agent optimizes for goals or objectives different from those we have asked of it INTERNALALIGNMENT . without ulterior motives RLHF2 ; INTERNALALIGNMENT .

RLHF is an important step forward in aligning AI systems with human values as it provides more nuanced guidance than traditional ML and RL, which struggle to capture the full extent of human preference. GML1 . Specifically, RL algorithms learn the highest-reward path toward a stated objective, sometimes involving actions that lend to economic or physical harm ME2 . In contrast, there is evidence that RLHF trains models to act in accordance with both explicit (following instructions) and implicit (staying truthful, unbiased, and unhurtful) intentions GML1 . More broadly, RLHF is an important vantage point for exploring improved systems of value alignment. Insights gained through RLHF are likely transferable to other alignment methods RLHF7 . Even if RLHF does not completely resolve concerns over inner alignment, the failures it identifies and knowledge it confers to reward and policy modeling are applicable to enhancing safety, reliability, and trustworthiness of AI in social and collaborative situations RLHF7 .

As AI becomes democratized, how do we construct systems that are sensitive to a diversity of perspectives and value systems and properly aligned in such contexts? Can we design a unified values framework, or should value alignment be restricted to cultural-specific contexts, much like law enforcement differences between nations? Could RLHF make inter- and intra-regional differences in conceptions of morality and ethics more salient? A significant factor in the net impact of RLHF models lies in to whom such models are aligned GML1 . Challenges exist in designing an alignment process that is fair, unbiased, and transparent while also bearing suitable accountability mechanisms GML1 . This is relevant in light of unresolved questions over how fundamentally conflicting feedback, values, and preferences should be reconciled, and the fact that there is no consensus across society on any single unified moral theory. Gabriel (2020) suggests pursuing a principle-based approach, whereby models are built to reflect fair principles endorsed by all despite variation in moral beliefs GABRIEL . Another path forward concerns training models that are aligned with general principles and preferences, with use of subsequent fine-tuning to condition models to preferences of specific groups. Additional issues are raised by the fact that developer choices can unintentionally impact the behavior of RLHF methods ME4 . It is perhaps more useful to develop RLHF under assumptions of moral uncertainty, which supposes for any decision that one’s motives are driven by several plausible ethical theories ME5 . Further consideration must be given to the broad question: should AI agents be able to exhibit the myriad moral and ethical convictions espoused by humans?

2.3 Mitigating Bias

With proper deployment, RLHF can reduce bias at multiple levels in the AI production pipeline. Broadly speaking, AI is affected at multiple levels of development by historical bias which affects data generation, representation bias which affects sampling and population studies, measurement bias due to inaccurate data stemming and structural discrimination against groups, aggregation bias due to over-reliance on one-size-fits all models, learning and evaluation bias during model training, and deployment bias due to disparity between intended and observed application CS1 . Preliminary analysis of RLHF results suggests it can be leveraged to mitigate long-standing effects of historical, representation, and measurement bias by balancing human feedback with representation and expertise across a diverse range of human annotators morals ; ME4 . RLHF is not unsusceptible to bias or misuse, but it leverages human feedback to counter algorithmic bias directly and efficiently in comparison to existing approaches GML1 ; morals . In this light, RLHF is an important tool not only for its potential to transform AI capabilities, but also toward combating systemic inequality issues perpetuated by algorithmic development.

2.4 Improving Equitable Access and Privacy in AI

By reducing computational cost, RLHF can open the door to democratization of AI technologies to all levels of society regardless of level of development. In particular, RLHF yields smaller models requiring less compute to achieve state-of-the-art performance GML1 , which is critical for building practical AI technologies that are deployable across the world and especially to lower-income areas and developing nations. The reduced need for training data can mitigate concerns around data scraping and privacy, security, and surveillance, all of which are issues involved in traditional ML GML1 . Data collection often disproportionately impacts vulnerable groups in negative ways: data may be misused by technology companies and governments to, for example, track immigrants, and instances of surveillance used to solidify systemic discrimination against subpopulations are well-documented ME2 . RLHF thus makes it easier to achieve better outcomes without significantly compromising privacy.

2.5 Bridging Cultures

RLHF has potential to transform how we reconcile cross-cultural perspectives and approach peaceful dialogue. Cross-cultural feedback is critical to ensuring technology is deployable in contexts beyond domestic production. By soliciting human feedback that encompasses a diversity of viewpoints and cultural norms, RLHF technologies can be culturally aware and usable beyond narrow, culture-specific settings. Even minor cultural awareness can facilitate communication in a number of contexts. A salient example is in education. Mitigating stress associated with feedback interactions in learning is critical to supporting student education BIAS1 . Yet studies have shown that cross-cultural feedback conversations between teachers and students can compound stress and lead to reduced learning (e.g., via decreased ability to ask questions, “absorb information, and develop professional and mentoring relationships”), worsened long-term education outcomes, and increased cognitive load for teachers if approached incorrectly BIAS1 . This was further exacerbated for interactions between teachers of well-represented identities and students from underrepresented groups BIAS1 . In this context, RLHF technologies can help overcome such difficulties, whether by moderating conversation or suggesting appropriate ways to approach cross-cultural communication. This benefit extends beyond education into sectors such as customer service and entertainment.

2.6 Boosting Industries

By allowing AI agents to learn from human expertise, RLHF can facilitate development of more adaptable AI systems for use in various industries RLHF8 . Potential applications of RLHF include enhanced resource management, customer service, online education, eldercare, and clinical decision support EXAMPLES . Adaptive recommendations could better account for personal and cultural preferences and human intentions; value-aligned technologies could better accommodate individual preferences regarding communication, mobility, and living habits; human-guided diagnostics could improve clarity in decision-making. RLHF can better foster trust with users in order to boost business outcomes across industries and accelerate technology adoption to improve efficiency and economic output.

Concurrently, RLHF can possibly heighten big-tech’s advantage and hasten progress towards dangerous AI capabilities. Notable RLHF advances have been achieved by well-financed research laboratories and big-tech companies such as OpenAI and DeepMind, which can afford to spend enormous amounts of money on creating large datasets for RLHF algorithms. Smaller organizations lack access to such resources RLHF5 . A related concern regards who should have access to powerful RLHF models produced by organizations. If RLHF models are open-sourced, it may be difficult to check harmful applications and enforce regulation RLHF5 . Yet restricting access via closed-source models could exclude access to select groups, reducing equity. Likewise concerning is the use of RLHF for weapons development—e.g., better missile systems and more lethal drones. This is a concern for most AI technologies, and global regulatory action must be taken to mitigate possible harm. Lastly, it must be noted that RLHF methods are still susceptible to generic ML vulnerabilities such as adversarial attack, which may affect its ability to enhance industry applications CSET3 . Awareness of all of these possibilities is critical as RLHF continues to develop.

2.7 Transforming Work

RLHF will impact the degree to which different jobs are susceptible to automation. Although many uses of RLHF are still nascent, these developing applications can provide insight into key implications of RLHF for the workforce. With better models that can be more efficiently used, RLHF advances the narrative that RL-based technologies will quickly close the gap between automation and the dexterity and mobility required for low-wage jobs ME2 . This holds especially for domains in which robotic manipulation and navigation are becoming more dominant. Even so, RLHF is unlikely to lead to full automation of jobs ME2 . Importantly, RLHF methods can automate tedious or high-risk portions of manual labor ME2 , especially for tasks which are dangerous or difficult for humans to complete even if they have the correct intuitions RLHF2 . Humans may guide AI systems in such contexts by providing feedback on how to best complete such tasks RLHF2 . This can enhance workforce safety and morale and does not fully remove humans from the equation, instead shifting human expertise to different areas of production.

RLHF may further affect the spatial distribution of jobs in the workforce. Resulting automation may move jobs, dependent on factors such as required expertise and closeness to service providers ME7 . Job relocation in such contexts is not necessarily constrained by national boundaries, exemplified by techniques involving offshoring of automated operations, which, while cost effective, may introduce regulatory challenges, reduce domestic jobs, and impact transparency ME7 . Future regulations on AI technologies will likely impact the extent to which such impacts are realized.

3 Further Considerations

What role should AI play in our daily lives? Critical to answering this question is the related query: “is AI augmenting human decision, informing it, or supplanting it?” CSET7 RLHF simplifies this evaluation. While most AI applications embody a variation of the centaur’s dilemma—the fundamental opposition between human control and optimized AI functionality CSET7 —RLHF directly plants human feedback as an informative source, leading to greater clarity regarding the locus of human control while simultaneously enhancing functional results. This suggests RLHF is a significant step toward resolving the dilemma, allowing us to reap the full benefits of AI’s capacity and inform rather than undermine human decision-making. Relatedly, many of the positive impacts of RLHF are dependent on the ability to arrive at well-designed human feedback systems. There will inevitably be new ways invented for humans to meaningfully provide feedback to robots and AI agents, as well as new insights as to how human behavior inherently and subtly reveals information signals at any given point RLHF4 . How will agents extract and make sense of various sources of information? Choose between multiple forms of feedback (for example: comparisons, demonstrations, corrections, improvement, proxy rewards, punishments, credit assignment, linguistic instructions RLHF4 )? Distinguish purposeful from meaningless feedbacks? These considerations will become increasingly important as RLHF advances. Ultimately, the potential for RLHF to positively impact society should not be ignored, and dependence of its benefits on well-designed feedback systems is a further call for investment into RLHF.

4 Concluding Remarks

In this paper, we analyzed the social benefits and harms of RLHF, which is presently one of the foremost and promising AI methods. Specifically, we described how RLHF may net positively impact areas of misinformation, AI value-alignment, bias, equitable access, cross-cultural dialogue, industry, and workforce. This analysis is timely and necessary, as progress on RLHF can impact all levels and sectors of society. Overall, the application of RLHF is important from both safety and capability perspectives. The benefits RLHF projects to provide over the status quo suggests we will see more resources invested in its development. As RLHF raises concerns that echo those of existing AI technologies for governance, industry, safety, ethics, and the future of global power relations, it will be important for all to be aware and intentional in the adoption of RLHF.

Acknowledgments

The author would like to recognize and thank Dr. Marion Boulicault for valuable discussion and feedback in the preparation of this paper. This work was funded by the Social and Ethical Responsibilities of Computing at the MIT Schwarzman College of Computing.

References

  • [1] Anis Najar and Mohamed Chetouani. Reinforcement learning with human advice: A survey. Frontiers in Robotics and AI, 8, 2021.
  • [2] Alan M Turing. Computing machinery and intelligence. Springer, 2009.
  • [3] John McCarthy. Programs with common sense. 1960.
  • [4] Frederick Hayes-Roth et al. Knowledge Acquisition, Knowledge Programming, and Knowledge Refinement. ERIC, 1980.
  • [5] Frederick Hayes-Roth, Philip Klahr, and David J Mostow. Advice-taking and knowledge refinement: An iterative view of skill acquisition. Cognitive skills and their acquisition, pages 231–253, 1981.
  • [6] Jess Whittlestone, Kai Arulkumaran, and Matthew Crosby. The societal implications of deep reinforcement learning. Journal of Artificial Intelligence Research, 70:1003–1030, 2021.
  • [7] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  • [8] OpenAI. Aligning language models to follow instructions, 2022.
  • [9] Sthanikam Santhosh. Reinforcement learning from human feedback(rlhf)-chatgpt, 2023.
  • [10] Nazneen Rajani. Illustrating reinforcement learning from human feedback (rlhf), 2023.
  • [11] Edwin Chen. Introduction to reinforcement learning with human feedback, 2023.
  • [12] Ansh Radhakrishnan. Rlhf, 2022.
  • [13] Rob Toews. The next generation of large language models, 2023.
  • [14] Helen Toner and Ashwin Acharya. Exploring clusters of research in three areas of ai safety. Center for Security and Emerging Technology, February 2022. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-Non Commercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [15] Social impact evaluation guide: Business case development framework, release 3, June 2021. This work is licensed by the State of Queensland Department of State Development, Infrastructure, Local Government and Planning under a Creative Commons Attribution (CC BY) 4.0 Australia licence. To view a copy of this license, visit creativecommons.org.au.
  • [16] International Association for Impact Assessment. Social impact assessment. https://www.iaia.org/wiki-details.php?ID=23, 2023.
  • [17] E. Geisler, P. Prabhaker, and M. Nayar. Information integrity: an emerging field and the state of knowledge. In PICMET ’03: Portland International Conference on Management of Engineering and Technology Technology Management for Reshaping the World, 2003., pages 217–221, 2003.
  • [18] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  • [19] Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas I. Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, Dawn Drain, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jackson Kernion, Jamie Kerr, Jared Mueller, Joshua Landau, Kamal Ndousse, Karina Nguyen, Liane Lovitt, Michael Sellitto, Nelson Elhage, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robert Lasenby, Robin Larson, Sam Ringer, Sandipan Kundu, Saurav Kadavath, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Christopher Olah, Jack Clark, Samuel R. Bowman, and Jared Kaplan. The capacity for moral self-correction in large language models, 2023.
  • [20] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  • [21] Ben Buchanan, Micah Musser, Andrew Lohn, and Katerina Sedova. Truth, lies, and automation. Center for Security and Emerging Technology, May 2021. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [22] Katerina Sedova, Christine McNeill, Aurora Johnson, Aditi Joshi, and Ido Wulkan. Ai and the future of disinformation campaigns, part 2: A threat model. Center for Security and Emerging Technology, December 2021. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [23] Hannah Arendt. Hannah arendt: From an interview, 1978.
  • [24] Anastasia Chan. Gpt-3 and instructgpt: technological dystopianism, utopianism, and “contextual” perspectives in ai ethics and industry. AI and Ethics, pages 1–12, 2022.
  • [25] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  • [26] Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents, 2021.
  • [27] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • [28] Iason Gabriel. Artificial intelligence, values and alignment. CoRR, abs/2001.09768, 2020.
  • [29] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
  • [30] Adrien Ecoffet and Joel Lehman. Reinforcement learning under moral uncertainty. In International conference on machine learning, pages 2926–2936. PMLR, 2021.
  • [31] Harini Suresh and John Guttag. Understanding Potential Sources of Harm throughout the Machine Learning Life Cycle. MIT Case Studies in Social and Ethical Responsibilities of Computing, (Summer 2021), aug 10 2021. https://mit-serc.pubpub.org/pub/potential-sources-of-harm-throughout-the-machine-learning-life-cycle.
  • [32] Anne D Gordon. Better than our biases: Using psychological research to inform our approach to inclusive, effective feedback. Clinical L. Rev., 27:195, 2020.
  • [33] Oluwafemi Smith. Reinforcement learning from human feedback, 2023.
  • [34] Ben Dickson. What is reinforcement learning from human feedback (rlhf)?, 2023.
  • [35] Andrew J. Lohn and Wyatt Hoffman. Securing ai: How traditional vulnerability disclosure must adapt. Center for Security and Emerging Technology, March 2022. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [36] David Bissell, Thomas Birtchnell, Anthony Elliott, and Eric L Hsu. Autonomous automobilities: The social impacts of driverless vehicles. Current Sociology, 68(1):116–134, 2020.
  • [37] James E. Baker, Laurie N. Hobart, and Matthew G. Mittelsteadt. Ai for judges: A framework. Center for Security and Emerging Technology, December 2021. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [38] Hong Jun Jeon, Smitha Milli, and Anca Dragan. Reward-rational (implicit) choice: A unifying formalism for reward learning. Advances in Neural Information Processing Systems, 33:4415–4426, 2020.
  • [39] Sally Kah and Temidayo Akenroye. Evaluation of social impact measurement tools and techniques: a systematic review of the literature. Social Enterprise Journal, 16(4):381–402, 2020.
  • [40] Aistė Balžekienė, Eglė Butkevičienė, and Audronė Telešienė. Methodological framework for analyzing social impact of technological innovations. Social Sciences (1392-0758), 59(1), 2008.
  • [41] Emelia S. Probasco. A common language for responsible ai: Evolving and defining dod terms for implementation. Center for Security and Emerging Technology, October 2022. This work is licensed by the Center for Security and Emerging Technology under a Creative Commons Attribution-NonCommercial 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc/4.0/.
  • [42] Andrea Lockerd Thomaz, Cynthia Breazeal, et al. Reinforcement learning with human teachers: Evidence of feedback and guidance with implications for learning performance. In Aaai, volume 6, pages 1000–1005. Boston, MA, 2006.
  • [43] Josh Abramson, Arun Ahuja, Federico Carnevale, Petko Georgiev, Alex Goldin, Alden Hung, Jessica Landon, Jirka Lhotka, Timothy Lillicrap, Alistair Muldal, et al. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022.
  • [44] Oliver Daniels-Koch and Rachel Freedman. The expertise problem: Learning from specialized feedback. arXiv preprint arXiv:2211.06519, 2022.
  • [45] Ayush Thakur. An introduction to training llms using reinforcement learning from human feedback (rlhf), 2023.
  • [46] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.