This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\RedeclareSectionCommand

[beforeskip=.5]paragraph

Managing extreme AI risks
amid rapid progress

Yoshua Bengio &Mila - Quebec AI Institute, Université de Montréal
Geoffrey Hinton University of Toronto, Vector Institute
Andrew Yao Tsinghua University
Dawn Song UC Berkeley
Pieter Abbeel UC Berkeley
Trevor Darrell UC Berkeley
Yuval Noah Harari The Hebrew University of Jerusalem
Ya-Qin Zhang Tsinghua University
Lan Xue Institute for AI International Governance, Tsinghua University
Shai Shalev-Shwartz The Hebrew University of Jerusalem
Gillian Hadfield University of Toronto, Schwartz Reisman Inst. for Technology and Society, Vector Inst.
Jeff Clune University of British Columbia, Vector Institute
Tegan Maharaj University of Toronto, Schwartz Reisman Inst. for Technology and Society, Vector Inst.
Frank Hutter ELLIS Institute Tübingen, University of Freiburg
Atılım Güneş Baydin University of Oxford
Sheila McIlraith University of Toronto, Schwartz Reisman Inst. for Technology and Society, Vector Inst.
Qiqi Gao East China University of Political Science and Law
Ashwin Acharya RAND Corporation
David Krueger University of Cambridge
Anca Dragan UC Berkeley
Philip Torr University of Oxford
Stuart Russell UC Berkeley
Daniel Kahneman School of Public and International Affairs, Princeton University
Jan Brauner* University of Oxford, RAND Corporation
Sören Mindermann* University of Oxford, Mila - Quebec AI Institute, Université de Montréal
Abstract

                                                          Abstract

Artificial Intelligence (AI) is progressing rapidly, and companies are shifting their focus to developing generalist AI systems that can autonomously act and pursue goals. Increases in capabilities and autonomy may soon massively amplify AI’s impact, with risks that include large-scale social harms, malicious uses, and an irreversible loss of human control over autonomous AI systems. Although researchers have warned of extreme risks from AI [1], there is a lack of consensus about how exactly such risk arise, and how to manage them. Society’s response, despite promising first steps, is incommensurate with the possibility of rapid, transformative progress that is expected by many experts. AI safety research is lagging. Present governance initiatives lack the mechanisms and institutions to prevent misuse and recklessness, and barely address autonomous systems. In this short consensus paper, we describe extreme risks from upcoming, advanced AI systems. Drawing on lessons learned from other safety-critical technologies, we then outline a comprehensive plan combining technical research and development (R&D) with proactive, adaptive governance mechanisms for a more commensurate preparation.

Rapid progress

Current deep learning systems still lack important capabilities and we do not know how long it will take to develop them. However, companies are engaged in a race to create generalist AI systems that match or exceed human abilities in most cognitive work [2, 3]. They are rapidly deploying more resources and developing new techniques to increase AI capabilities, with investment in training state-of-the-art models tripling annually [4].

There is much room for further advances, as tech companies have the cash reserves needed to scale the latest training runs by multiples of 100 to 1000 [5]. Hardware and algorithms will also improve: AI computing chips have been getting 1.4 times more cost-effective, and AI training algorithms 2.5 times more efficient, each year [6, 7]. Progress in AI also enables faster AI progress [8]: AI assistants are increasingly used to automate programming [9], data collection [10, 11], and chip design [12].

There is no fundamental reason for AI progress to slow or halt at human-level abilities. Indeed, AI has already surpassed human abilities in narrow domains like playing strategy games and predicting how proteins fold [13, 14, 15]. Compared to humans, AI systems can act faster, absorb more knowledge, and communicate at higher bandwidth. Additionally, they can be scaled to use immense computational resources and can be replicated by the millions.

We don’t know for certain how the future of AI will unfold. However, we must take seriously the possibility that highly powerful generalist AI systems—outperforming human abilities across many critical domains—will be developed within the current decade or the next. What happens then?

More capable AI systems have larger impacts. Especially as AI matches and surpasses human workers in capabilities and cost-effectiveness, we expect a massive increase in AI deployment, opportunities, and risks. If managed carefully and distributed fairly, AI could help humanity cure diseases, elevate living standards, and protect ecosystems. The opportunities are immense.

But alongside advanced AI capabilities come large-scale risks that we are not on track to handle well. Humanity is pouring vast resources into making AI systems more powerful but far less into their safety and mitigating their harms. Only an estimated 1-3% of AI publications are on safety [16, 17]. For AI to be a boon, we must reorient; pushing AI capabilities alone is not enough.

We are already behind schedule for this reorientation. The scale of the risks means that we need to be proactive, as the costs of being unprepared far outweigh those of premature preparation. We must anticipate the amplification of ongoing harms, as well as novel risks, and prepare for the largest risks well before they materialize. Climate change has taken decades to be acknowledged and confronted; for AI, decades could be too long.

Societal-scale risks

If not carefully designed and deployed, increasingly advanced AI systems threaten to amplify social injustice, erode social stability, and weaken our shared understanding of reality that is foundational to society. They could also enable large-scale criminal or terrorist activities. Especially in the hands of a few powerful actors, AI could cement or exacerbate global inequities, or facilitate automated warfare, customized mass manipulation, and pervasive surveillance [18, 19, 20, 21, 22, 23].

Many of these risks could soon be amplified, and new risks created, as companies are working to develop autonomous AI: systems that can pursue goals and act in the world. While current AI systems have limited autonomy, work is underway to change this [24]. For example, the non-autonomous GPT-4 model was quickly adapted to browse the web, design and execute chemistry experiments, and utilize software tools, including other AI models [25, 26, 27, 28].

If we build highly advanced autonomous AI, we risk creating systems that pursue undesirable goals. Malicious actors could deliberately embed undesirable goals. Without R&D breakthroughs (see below), even well-meaning developers may inadvertently create AI systems pursuing unintended goals: The reward signal used to train AI systems usually fails to fully capture the intended objectives, leading to AI systems that pursue the literal specification rather than the intended outcome [29]. Additionally, the training data never captures all relevant situations, leading to AI systems that pursue undesirable goals in novel situations encountered after training.

Once autonomous AI systems pursue undesirable goals, we may be unable to keep them in check. Control of software is an old and unsolved problem: computer worms have long been able to proliferate and avoid detection [30]. However, AI is making progress in critical domains such as hacking, social manipulation, and strategic planning [24, 31], and may soon pose unprecedented control challenges.

To advance undesirable goals, future autonomous AI systems could use undesirable strategies—learned from humans or developed independently—as a means to an end [32, 33, 34, 35]. AI systems could gain human trust, acquire financial resources, influence key decision-makers, and form coalitions with human actors and other AI systems. To avoid human intervention [35], they might copy their algorithms across global server networks [36], as computer worms do. AI assistants are already co-writing a substantial share of computer code worldwide [37]; future AI systems could insert and then exploit security vulnerabilities to control the computer systems behind our communication, media, banking, supply-chains, militaries, and governments. In open conflict, AI systems could autonomously deploy a variety of weapons, including biological ones. AI systems having access to such technology would merely continue existing trends to automate military activity and biological research. If AI systems pursued such strategies with sufficient skill, it would be difficult for humans to intervene.

Finally, AI systems will not need to plot for influence if it is freely handed over. As autonomous AI systems increasingly become faster and more cost-effective than human workers, a dilemma emerges. Companies, governments, and militaries might be forced to deploy AI systems widely and cut back on expensive human verification of AI decisions, or risk being outcompeted [19, 38]. As a result, autonomous AI systems could increasingly assume critical societal roles.

Without sufficient caution, we may irreversibly lose control of autonomous AI systems, rendering human intervention ineffective. Large-scale cybercrime, social manipulation, and other harms could escalate rapidly. This unchecked AI advancement could culminate in a large-scale loss of life and the biosphere, and the marginalization or extinction of humanity.

Harms such as misinformation and discrimination from algorithms are already evident today; other harms show signs of emerging. It is vital to both address ongoing harms and anticipate emerging risks. This is not a question of either/or. Present and emerging risks often share similar mechanisms, patterns, and solutions [39]; investing in governance frameworks and AI safety will bear fruit on multiple fronts [40].

Reorient Technical R&D

There are many open technical challenges in ensuring the safety and ethical use of generalist, autonomous AI systems. Unlike advancing AI capabilities, these challenges cannot be addressed by simply using more computing power to train bigger models. They are unlikely to resolve automatically as AI systems get more capable [41, 42, 43, 44, 33, 45], and require dedicated research and engineering efforts. In some cases, leaps of progress may be needed; we thus do not know if technical work can fundamentally solve these challenges in time. However, there has been comparatively little work on many of these challenges. More R&D may thus make progress and reduce risks.

A first set of R&D areas needs breakthroughs to enable reliably safe AI. Without this progress, developers must either risk creating unsafe systems or falling behind competitors who are willing to take more risks. If ensuring safety remains too difficult, extreme governance measures would be needed to prevent corner-cutting driven by competition and overconfidence. These R&D challenges include:

Oversight and honesty:

More capable AI systems can better exploit weaknesses in technical oversight and testing [42, 46, 47]—for example, by producing false but compelling output [48, 49, 43].

Robustness:

AI systems behave unpredictably in new situations. While some aspects of robustness improve with model scale [50], other aspects do not or even get worse [51, 52, 44, 53].

Interpretability and transparency:

AI decision-making is opaque, with larger, more capable, models being more complex to interpret. So far, we can only test large models via trial and error. We need to learn to understand their inner workings [54].

Inclusive AI development:

AI advancement will need methods to mitigate biases and integrate the values of the many populations it will affect [20, 55].

Addressing emerging challenges:

Future AI systems may exhibit failure modes we have so far seen only in theory or lab experiments, such as AI systems taking control over the training reward-provision channels or exploiting weaknesses in our safety objectives and shutdown mechanisms to advance a particular goal [35, 56, 57, 58].

A second set of R&D challenges need progress to enable effective, risk-adjusted governance, or reduce harms when safety and governance fail:

Evaluation for dangerous capabilities:

As AI developers scale their systems, unforeseen capabilities appear spontaneously, without explicit programming [59]. They are often only discovered after deployment [60, 61, 62]. We need rigorous methods to elicit and assess AI capabilities, and to predict them before training. This includes both generic capabilities to achieve ambitious goals in the world (e.g., long-term planning and execution), as well as specific dangerous capabilities based on threat models (e.g. social manipulation or hacking). Current evaluations of frontier AI models for dangerous capabilities [63]—key to various AI policy frameworks—are limited to spot-checks and attempted demonstrations in specific settings [36, 64, 65]. These evaluations can sometimes demonstrate dangerous capabilities but cannot reliably rule them out: AI systems that lacked certain capabilities in the tests may well demonstrate them in slightly different settings or with post-training enhancements. Decisions that depend on AI systems not crossing any red lines thus need large safety margins. Improved evaluation tools decrease the chance of missing dangerous capabilities, allowing for smaller margins.

Evaluating AI alignment:

If AI progress continues, AI systems will eventually possess highly dangerous capabilities. Before training and deploying such systems, we need methods to assess their propensity to use these capabilities. Purely behavioral evaluations may fail for advanced AI systems: like humans, they might behave differently under evaluation, faking alignment [57, 56, 58].

Risk assessment:

We must learn to assess not just dangerous capabilities, but risk in a societal context, with complex interactions and vulnerabilities. Rigorous risk assessment for frontier AI systems remains an open challenge due to their broad capabilities and pervasive deployment across diverse application areas [66].

Resilience:

Inevitably, some will misuse or act recklessly with AI. We need tools to detect and defend against AI-enabled threats such as large-scale influence operations, biological risks, and cyber-attacks. However, as AI systems become more capable, they will eventually be able to circumvent human-made defenses. To enable more powerful AI-based defenses, we first need to learn how to make AI systems safe and aligned.

Given the stakes, we call on major tech companies and public funders to allocate at least one-third of their AI R&D budget—comparable to their funding for AI capabilities—towards addressing the above R&D challenges and ensuring AI safety and ethical use [44]. Beyond traditional research grants, government support could include prizes, advance market commitments [67], and other incentives. Addressing these challenges, with an eye toward powerful future systems, must become central to our field.

Governance measures

We urgently need national institutions and international governance to enforce standards preventing recklessness and misuse. Many areas of technology, from pharmaceuticals to financial systems and nuclear energy, show that society requires and effectively uses government oversight to reduce risks. However, governance frameworks for AI are far less developed, lagging behind rapid technological progress. We can take inspiration from the governance of other safety-critical technologies, while keeping the uniqueness of advanced AI in mind: that it far outstrips other technologies in its potential to act and develop ideas autonomously, progress explosively, behave adversarially, and cause irreversible damage.

Governments worldwide have taken positive steps on frontier AI, with key players including China, the US, the EU, and the UK engaging in discussions [68, 69] and introducing initial guidelines or regulations [70, 71, 72, 73]. Despite their limitations—often voluntary adherence, limited geographic scope, and exclusion of high-risk areas like military and R&D-stage systems—they are important initial steps towards, amongst others, developer accountability, third-party audits, and industry standards.

Yet, these governance plans fall critically short in view of the rapid progress in AI capabilities. We need governance measures that prepare us for sudden AI breakthroughs, while being politically feasible despite disagreement and uncertainty about AI timelines. The key is policies that automatically trigger when AI hits certain capability milestones. If AI advances rapidly, strict requirements automatically take effect, but if progress slows, the requirements relax accordingly. Rapid, unpredictable progress also means that risk reduction efforts must be proactive—identifying risks from next-generation systems and requiring developers to address them before taking high-risk actions. We need fast-acting, tech-savvy institutions for AI oversight, mandatory and much more rigorous risk assessments with enforceable consequences (including assessments that put the burden of proof on AI developers), and mitigation standards commensurate to powerful autonomous AI.

Without these, companies, militaries, and governments may seek a competitive edge by pushing AI capabilities to new heights while cutting corners on safety, or by delegating key societal roles to autonomous AI systems with insufficient human oversight; reaping the rewards of AI development while leaving society to deal with the consequences.

Institutions to govern the rapidly moving frontier of AI.

To keep up with rapid progress and avoid quickly outdated, inflexible laws [74, 75, 76] national institutions need strong technical expertise and the authority to act swiftly. To facilitate technically demanding risk assessments and mitigations, they will require far greater funding and talent than they are due to receive under almost any current policy plan. To address international race dynamics, they need the affordance to facilitate international agreements and partnerships [77, 78]. Institutions should protect low-risk use and low-risk academic research, by avoiding undue bureaucratic hurdles for small, predictable AI models. The most pressing scrutiny should be on AI systems at the frontier: the few most powerful systems – trained on billion-dollar supercomputers – which will have the most hazardous and unpredictable capabilities [79, 80].

Government insight.

To identify risks, governments urgently need comprehensive insight into AI development. Regulators should mandate whistleblower protections, incident reporting, registration of key information on frontier AI systems and their data sets throughout their life cycle, and monitoring of model development and supercomputer usage [81]. Recent policy developments should not stop at requiring that companies report the results of voluntary or underspecified model evaluations shortly before deployment [70, 72]. Regulators can and should require that frontier AI developers grant external auditors on-site, comprehensive (“white-box”), and fine-tuning access from the start of model development [82]. This is needed to identify dangerous model capabilities such as autonomous self-replication, large-scale persuasion, breaking into computer systems, developing (autonomous) weapons, or making pandemic pathogens widely accessible [83, 63, 84, 64, 65, 36].

Safety cases.

Despite evaluations, we cannot consider coming powerful frontier AI systems “safe unless proven unsafe”. With current testing methodologies, issues can easily be missed. Additionally, it is unclear if governments can quickly build the immense expertise needed for reliable technical evaluations of AI capabilities and societal-scale risks. Given this, developers of frontier AI should carry the burden of proof to demonstrate that their plans keep risks within acceptable limits. Doing so, they would follow best practices for risk management from industries such as aviation [85], medical devices [86], and defense software [87], where companies make safety cases [88, 89, 90, 91, 92]: structured arguments with falsifiable claims supported by evidence, which identify potential hazards, describe mitigations, show that systems will not cross certain red lines, and model possible outcomes to assess risk. Safety cases could leverage developers’ in-depth experience with their own systems. Safety cases are politically viable even when people disagree on how advanced AI will become, since it is easier to demonstrate a system is safe when its capabilities are limited. Governments are not passive recipients of safety cases: they set risk thresholds, codify best practices, employ experts and third-party auditors to assess safety cases and conduct independent model evaluations, and hold developers liable if their safety claims are later falsified.

Mitigation.

To keep AI risks within acceptable limits, we need governance mechanisms matched to the magnitude of the risks [79, 93, 94, 95]. Regulators should clarify legal responsibilities arising from existing liability frameworks and hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented—including harms that foreseeably arise from deploying powerful AI systems whose behavior they cannot predict. Liability, together with consequential evaluations and safety cases, can prevent harm and create much-needed incentives to invest in safety.

Commensurate mitigations are needed for exceptionally capable future AI systems, like autonomous systems that could circumvent human control. Governments must be prepared to license their development, restrict their autonomy in key societal roles, halt their development and deployment in response to worrying capabilities, mandate access controls, and require information security measures robust to state-level hackers, until adequate protections are ready. Governments should build these capacities now.

To bridge the time until regulations are complete, major AI companies should promptly lay out “if-then” commitments: specific safety measures they will take if specific red-line capabilities [63] are found in their AI systems. These commitments should be detailed and independently scrutinized. Regulators should encourage a race-to-the-top among companies by using the best-in-class commitments, together with other inputs, to inform standards that apply to all players.

To steer AI toward positive outcomes and away from catastrophe, we need to reorient. There is a responsible path, if we have the wisdom to take it.

Acknowledgments

Yoshua Bengio, Jeff Clune, Gillian Hadfield, Sheila McIlraith hold the position of CIFAR AI Chair. Jeff Clune is a Senior Research Advisor to Google DeepMind. Ashwin Acharya reports acting as an advisor to the Civic AI Security Program. Ashwin Acharya was affiliated with the Institute for AI Policy and Strategy at the time of the first submission. Anca Dragan now hold an appointment at Google DeepMind, but joined the company after the manuscript was written. Dawn Song is the president of Oasis Labs. Trevor Darrell is a cofounder of Prompt AI. Pieter Abbeel is a cofounder at covariant.ai and Investment Partner at AIX Ventures. Shai Shalev-Shwartz is the CTO at Mobileye. David Krueger served as a Research Director for the UK Foundation Model Task Force in 2023, and joined the board of the non-profit Center for AI Policy in 2024. Gillian Hadfield reports the following activities: 2018-2023: Senior Policy Advisor, OpenAI; 2023-present: Member, RAND Technology Advisory Group; 2022-present: Member, Partnership on AI, Safety Critical AI Steering Committee. In gratitude and remembrance of Daniel Kahneman, our co-author, whose remarkable contributions to this paper and to humanity’s cumulative knowledge and wisdom will never be forgotten.

References and notes

References

  • [1] “Statement on AI Risk” Accessed: 2024-5-1, https://www.safe.ai/work/statement-on-ai-risk, 2023
  • [2] DeepMind “About” Accessed: 2023-9-15, https://www.deepmind.com/about
  • [3] OpenAI “About” Accessed: 2023-9-15, https://openai.com/about
  • [4] Ben Cottier “Trends in the Dollar Training Cost of Machine Learning Systems”, 2023
  • [5] Alphabet “Alphabet annual report, page 33 (page 71 in the pdf): ‘As of December 31, 2022, we had USD113.8 billion in cash, cash equivalents, and short-term marketable securities’. [For comparison, the cost of training GPT-4 has been estimated as USD50 million (https://epochai.org/trends), and Sam Altman, the CEO of OpenAI, has stated that the cost for the whole process was more than USD100 million (https://www.wired.com/story/openai-ceo-sam-altman-the-age-of-giant-ai-models-is-already-over/).]”, https://abc.xyz/assets/d4/4f/a48b94d548d0b2fdc029a95e8c63/2022-alphabet-annual-report.pdf, 2022
  • [6] Marius Hobbhahn, Lennart Heim and Gökçe Aydos “Trends in Machine Learning Hardware”, 2023
  • [7] Ege Erdil and Tamay Besiroglu “Algorithmic progress in computer vision”, 2022 arXiv:2212.05153 [cs.CV]
  • [8] “Examples of AI Improving AI” Accessed: 2023-9-15, https://ai-improving-ai.safe.ai/
  • [9] Maxim Tabachnyk “ML-Enhanced Code Completion Improves Developer Productivity” Accessed: 2023-9-15, https://blog.research.google/2022/07/ml-enhanced-code-completion-improves.html
  • [10] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown and Jared Kaplan “Constitutional AI: Harmlessness from AI Feedback”, 2022 arXiv:2212.08073 [cs.CL]
  • [11] OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  • [12] Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Quoc V Le, James Laudon, Richard Ho, Roger Carpenter and Jeff Dean “A graph placement methodology for fast chip design” In Nature 594.7862, 2021, pp. 207–212
  • [13] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andrew J Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W Senior, Koray Kavukcuoglu, Pushmeet Kohli and Demis Hassabis “Highly accurate protein structure prediction with AlphaFold” In Nature 596.7873, 2021, pp. 583–589
  • [14] Noam Brown and Tuomas Sandholm “Superhuman AI for multiplayer poker” In Science 365.6456, 2019, pp. 885–890
  • [15] Murray Campbell, A Joseph Hoane and Feng-Hsiung Hsu “Deep Blue” In Artif. Intell. 134.1, 2002, pp. 57–83
  • [16] Helen Toner and Ashwin Acharya “Exploring Clusters of Research in Three Areas of AI Safety”, Center for Security and Emerging Technology, 2022
  • [17] Emerging Technology Observatory “AI safety – ETO Research Almanac” Accessed: 2024-2-12, https://almanac.eto.tech/topics/ai-safety/
  • [18] Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving and Iason Gabriel “Taxonomy of Risks posed by Language Models” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22 Seoul, Republic of Korea: Association for Computing Machinery, 2022, pp. 214–229
  • [19] Alan Chan, Rebecca Salganik, Alva Markelius, Chris Pang, Nitarshan Rajkumar, Dmitrii Krasheninnikov, Lauro Langosco, Zhonghao He, Yawen Duan, Micah Carroll, Michelle Lin, Alex Mayhew, Katherine Collins, Maryam Molamohammadi, John Burden, Wanru Zhao, Shalaleh Rismani, Konstantinos Voudouris, Umang Bhatt, Adrian Weller, David Krueger and Tegan Maharaj “Harms from Increasingly Agentic Algorithmic Systems” In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23 Chicago, IL, USA: Association for Computing Machinery, 2023, pp. 651–666
  • [20] Virginia Eubanks “Automating Inequality: How High-Tech Tools Profile, Police and Punish the Poor” St Martin’s Press, 2018
  • [21] Dan Hendrycks, Mantas Mazeika and Thomas Woodside “An Overview of Catastrophic AI Risks”, 2023 arXiv:2306.12001 [cs.CY]
  • [22] Rishi Bommasani et al. “On the Opportunities and Risks of Foundation Models”, 2021 arXiv:2108.07258 [cs.LG]
  • [23] Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, Alexandra Sasha Luccioni, Alberto Lusoli, Margaret Mitchell, Jessica Newman, Marie-Therese Png, Andrew Strait and Apostol Vassilev “Evaluating the Social Impact of Generative AI Systems in Systems and Society”, 2023 arXiv:2306.05949 [cs.CY]
  • [24] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei and Ji-Rong Wen “A Survey on Large Language Model based Autonomous Agents”, 2023 arXiv:2308.11432 [cs.AI]
  • [25] “ChatGPT plugins” Accessed: 2023-9-15, https://openai.com/blog/chatgpt-plugins
  • [26] Andres M Bran, Sam Cox, Andrew D White and Philippe Schwaller “ChemCrow: Augmenting large-language models with chemistry tools”, 2023 arXiv:2304.05376 [physics.chem-ph]
  • [27] Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun and Thomas Scialom “Augmented Language Models: a Survey”, 2023 arXiv:2302.07842 [cs.CL]
  • [28] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu and Yueting Zhuang “HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face”, 2023 arXiv:2303.17580 [cs.CL]
  • [29] Dylan Hadfield-Menell and Gillian K Hadfield “Incomplete contracting and AI alignment” In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, 2019, pp. 417–422
  • [30] Peter J Denning “The Science of Computing: The Internet Worm” In Am. Sci. 77.2 Sigma Xi, The Scientific Research Society, 1989, pp. 126–128
  • [31] Peter S Park, Simon Goldstein, Aidan O’Gara, Michael Chen and Dan Hendrycks “AI Deception: A Survey of Examples, Risks, and Potential Solutions”, 2023 arXiv:2308.14752 [cs.CY]
  • [32] A M Turner, L Smith, R Shah and A Critch “Optimal policies tend to seek power” In Thirty-Fifth Conference on Neural Information Processing Systems arxiv.org, 2019
  • [33] Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer and Jared Kaplan “Discovering Language Model Behaviors with Model-Written Evaluations”, 2022 arXiv:2212.09251 [cs.CL]
  • [34] Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons and Dan Hendrycks “Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark” In International Conference on Machine Learning, 2023
  • [35] Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel and Stuart Russell “The Off-Switch Game” In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 220–227
  • [36] Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan, Luke Harold Miles, Tao R Lin, Hjalmar Wijk, Joel Burget, Aaron Ho, Elizabeth Barnes and Paul Christiano “Evaluating Language-Model Agents on Realistic Autonomous Tasks”, 2023 arXiv:2312.11671 [cs.CL]
  • [37] Thomas Dohmke “GitHub Copilot” Accessed: 2023-9-15, https://github.blog/2023-02-14-github-copilot-for-business-is-now-available/
  • [38] Andrew Critch and Stuart Russell “TASRA: a Taxonomy and Analysis of Societal-Scale Risks from AI”, 2023 arXiv:2306.06924 [cs.AI]
  • [39] Jan Brauner and Alan Chan “AI Poses Doomsday Risks—But That Doesn’t Mean We Shouldn’t Talk About Present Harms Too” In Time, 2023
  • [40] Center for AI Safety “Existing Policy Proposals Targeting Present and Future Harms” Accessed: 2023-9-15, https://assets-global.website-files.com/63fe96aeda6bea77ac7d3000/647d5368c2368cc32b359f88_Policy%20Agreement%20Statement.pdf, 2023
  • [41] Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Recchia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zhengping Zhou, Najoung Kim, Samuel R Bowman and Ethan Perez “Inverse Scaling: When Bigger Isn’t Better” In Transactions on Machine Learning Research, 2023
  • [42] Alexander Pan, Kush Bhatia and Jacob Steinhardt “The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models” In International Conference on Learning Representations, 2022
  • [43] Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh and Dylan Hadfield-Menell “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback”, 2023 arXiv:2307.15217 [cs.AI]
  • [44] Dan Hendrycks, Nicholas Carlini, John Schulman and Jacob Steinhardt “Unsolved Problems in ML Safety”, 2021 arXiv:2109.13916 [cs.LG]
  • [45] Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou and Quoc V Le “Simple synthetic data reduces sycophancy in large language models”, 2023 arXiv:2308.03958 [cs.CL]
  • [46] Simon Zhuang and Dylan Hadfield-Menell “Consequences of misaligned AI” In Adv. Neural Inf. Process. Syst. 33 proceedings.neurips.cc, 2020, pp. 15763–15773
  • [47] Leo Gao, John Schulman and Jacob Hilton “Scaling Laws for Reward Model Overoptimization” In Proceedings of the 40th International Conference on Machine Learning 202, Proceedings of Machine Learning Research PMLR, 2023, pp. 10835–10866
  • [48] Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang and Ethan Perez “Towards Understanding Sycophancy in Language Models”, 2023 arXiv:2310.13548 [cs.CL]
  • [49] Dario Amodei, Paul Christiano and Alex Ray “Learning from human preferences” Accessed: 2023-9-15, https://openai.com/research/learning-from-human-preferences
  • [50] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt and Justin Gilmer “The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization”, 2020 arXiv:2006.16241 [cs.CV]
  • [51] Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau and David Krueger “Goal Misgeneralization in Deep Reinforcement Learning” 162, Proceedings of Machine Learning Research PMLR, 2022, pp. 12004–12019
  • [52] Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato and Zac Kenton “Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals”, 2022 arXiv:2210.01790 [cs.LG]
  • [53] Tony T Wang, Adam Gleave, Tom Tseng, Kellin Pelrine, Nora Belrose, Joseph Miller, Michael D Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine and Stuart Russell “Adversarial Policies Beat Superhuman Go AIs”, 2022 arXiv:2211.00241 [cs.LG]
  • [54] Tilman Räuker, Anson Ho, Stephen Casper and Dylan Hadfield-Menell “Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks” In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2023, pp. 464–483
  • [55] Amartya Sen “Social Choice Theory” In Handbook of Mathematical Economics, Vol. III Amsterdam: North Holland, 1986
  • [56] Richard Ngo, Lawrence Chan and Sören Mindermann “The alignment problem from a deep learning perspective” In International Conference on Learning Representations 2024, 2024
  • [57] Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer and Ethan Perez “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, 2024 arXiv:2401.05566 [cs.CR]
  • [58] Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield and Stuart Russell “Regulating advanced artificial agents” In Science 384.6691, 2024, pp. 36–38
  • [59] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean and William Fedus “Emergent Abilities of Large Language Models” In Transactions on Machine Learning Research, 2022
  • [60] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le and Denny Zhou “Chain-of-thought prompting elicits reasoning in large language models” In Adv. Neural Inf. Process. Syst. 35 Curran Associates, Inc., 2022, pp. 24824–24837
  • [61] Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V Le, Ed H Chi, Denny Zhou, Swaroop Mishra and Huaixiu Steven Zheng “Self-Discover: Large Language Models Self-Compose Reasoning Structures”, 2024 arXiv:2402.03620 [cs.AI]
  • [62] Tom Davidson, Jean-Stanislas Denain, Pablo Villalobos and Guillem Bas “AI capabilities can be significantly improved without expensive retraining”, 2023 arXiv:2312.07413 [cs.AI]
  • [63] Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano and Allan Dafoe “Model evaluation for extreme risks”, 2023 arXiv:2305.15324 [cs.AI]
  • [64] Christopher A Mouton, Caleb Lucas and Ella Guest “The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study” Santa Monica, CA: RAND Corporation, 2024
  • [65] Jérémy Scheurer, Mikita Balesni and Marius Hobbhahn “Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure”, 2023 arXiv:2311.07590 [cs.CL]
  • [66] Leonie Koessler and Jonas Schuett “Risk assessment at AGI companies: A review of popular risk assessment techniques from other safety-critical industries”, 2023 arXiv:2307.08823 [cs.CY]
  • [67] Alan Ho and Jake Taylor “Using Advance Market Commitments for Public Purpose Technology Development”, Policy Brief, 2021
  • [68] AI Safety Summit “The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023”, 2023
  • [69] OECD “G7 Hiroshima Process on Generative Artificial Intelligence (AI)” OECD Publishing, 2023, pp. 37
  • [70] The White House (US) “Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence”, 2023
  • [71] Cyberspace Administration of China “Interim Measures for Generative Artificial Intelligence Service Management” Accessed: 2024-2-12, http://www.cac.gov.cn/2023-07/13/c_1690898327029107.htm, 2023
  • [72] European Union “EU AI Act” Accessed: 2024-NA-NA, https://artificialintelligenceact.eu/the-act/, 2024
  • [73] Department of State for Science, Innovation and Technology (UK) “A pro-innovation approach to AI regulation” Accessed: 2024-2-12 In GOV.UK, https://www.gov.uk/government/publications/ai-regulation-a-pro-innovation-approach/white-paper, 2023
  • [74] Lan Xue, Kai Jia and Jing Zhao “Agile Governance Practices in Artificial Intelligence: Categorizing Regulatory Approaches and Constructing a Policy Toolbox” In Chinese Public Administration
  • [75] Matthijs M Maas “Aligning AI Regulation to Sociotechnical Change” In The Oxford Handbook of AI Governance Oxford University Press
  • [76] Lan Xue and Jing Zhao “Toward Agile Governance: The Pattern of Emerging Industry Development and Regulation” In Chinese Public Administration 410.4, 2019, pp. 28–34
  • [77] Lewis Ho, Joslyn Barnhart, Robert Trager, Yoshua Bengio, Miles Brundage, Allison Carnegie, Rumman Chowdhury, Allan Dafoe, Gillian Hadfield, Margaret Levi and Duncan Snidal “International Institutions for Advanced AI”, 2023 arXiv:2307.04699 [cs.CY]
  • [78] Robert F Trager, Ben Harack, Anka Reuel, Allison Carnegie, Lennart Heim, Lewis Ho, Sarah Kreps, Ranjit Lall, Owen Larter, Seán Ó hÉigeartaigh, Simon Staffell and José Jaime Villalobos “International Governance of Civilian AI: A Jurisdictional Certification Approach”, https://cdn.governance.ai/International_Governance_of_Civilian_AI_OMS.pdf, 2023
  • [79] Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, Cullen O’Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager and Kevin Wolf “Frontier AI Regulation: Managing Emerging Risks to Public Safety”, 2023 arXiv:2307.03718 [cs.CY]
  • [80] Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Scott Johnston, Andy Jones, Nicholas Joseph, Jackson Kernian, Shauna Kravec, Ben Mann, Neel Nanda, Kamal Ndousse, Catherine Olsson, Daniela Amodei, Tom Brown, Jared Kaplan, Sam McCandlish, Christopher Olah, Dario Amodei and Jack Clark “Predictability and Surprise in Large Generative Models” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22 Seoul, Republic of Korea: Association for Computing Machinery, 2022, pp. 1747–1764
  • [81] Noam Kolt, Markus Anderljung, Joslyn Barnhart, Asher Brass, Kevin Esvelt, Gillian K Hadfield, Lennart Heim, Mikel Rodriguez, Jonas B Sandbrink and Thomas Woodside “Responsible Reporting for Frontier AI Development”, 2024 arXiv:2404.02675 [cs.CY]
  • [82] Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger and Dylan Hadfield-Menell “Black-Box Access is Insufficient for Rigorous AI Audits”, 2024 arXiv:2401.14446 [cs.CY]
  • [83] Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe and Toby Shevlane “Evaluating Frontier Models for Dangerous Capabilities”, 2024 arXiv:2403.13793 [cs.LG]
  • [84] Jakob Mökander, Jonas Schuett, Hannah Rose Kirk and Luciano Floridi “Auditing large language models: a three-layered approach” In AI and Ethics, 2023
  • [85] European Organisation for the Safety of Air Navigation “EAD Safety Case Guidance”, 2010
  • [86] Food and Drug Administration “Infusion Pumps Total Product Life Cycle - Guidance for Industry and FDA Staff”, 2014
  • [87] “SMP12. Safety Case and Safety Case Report” Accessed: 2024-2-12, https://www.asems.mod.uk/guidance/posms/smp12, 2023
  • [88] Joshua Clymer, Nick Gabrieli, David Krueger and Thomas Larsen “Safety Cases: How to Justify the Safety of Advanced AI Systems”, 2024 arXiv:2403.10462 [cs.CY]
  • [89] Tim Kelly “A Systematic Approach to Safety Case Management” In SAE Trans. J. Mater. Manuf. 113 SAE International, 2004, pp. 257–266
  • [90] J Mcdermid and Yan Jia “Safety of artificial intelligence: A collaborative model”, 2020
  • [91] Iso/iec “ISO/IEC 23894:2023 Standard on Information technology — Artificial intelligence — Guidance on risk management”, 2023
  • [92] Tzvi Raz and David Hillson “A Comparative Review of Risk Management Standards” In Risk Manage.: Int. J. 7.4, 2005, pp. 53–66
  • [93] AI Now Institute “General Purpose AI Poses Serious Risks, Should Not Be Excluded From the EU’s AI Act — Policy Brief” Accessed: 2023-9-15, https://ainowinstitute.org/publication/gpai-is-high-risk-should-not-be-excluded-from-eu-ai-act
  • [94] Jonas Schuett, Noemi Dreksler, Markus Anderljung, David McCaffary, Lennart Heim, Emma Bluemke and Ben Garfinkel “Towards best practices in AGI safety and governance: A survey of expert opinion”, 2023 arXiv:2305.07153 [cs.CY]
  • [95] Gillian K Hadfield and Jack Clark “Regulatory Markets: The Future of AI Governance”, 2023 arXiv:2304.04914 [cs.AI]