Towards Regulatable AI Systems:
Technical Gaps and Policy Opportunities
Abstract
There is increasing attention being given to how to regulate AI systems. As governing bodies grapple with what values to encapsulate into regulation, we consider the technical half of the question: To what extent can AI experts vet an AI system for adherence to regulatory requirements? We investigate this question through the lens of two public sector procurement checklists, identifying what we can do now, what should be possible with technical innovation, and what requirements need a more interdisciplinary approach.
1 Introduction
Regulations represent a fundamental mechanism through which governments fulfill their duty to protect citizens from a variety of threats including health risks, financial fraud, and discriminatory practices. As AI systems become more advanced and integrated into our lives, there has been a corresponding urgency to ensure they align with social values and norms, and that their benefits significantly outweigh any potential harm. In response to this imperative, legal and regulatory bodies globally are engaged in a concerted effort to develop comprehensive AI regulations [177, 58, 47].
However, the increasing size, generality, opaqueness, and closed nature of present-day AI systems pose significant challenges to effective regulation [186, 12]. Even when requirements can be articulated, it remains uncertain if and how we can verify an AI system’s adherence to these standards. A requirement that cannot be checked will not effectively provide protection. If we believe that AI systems should be regulated, then AI systems must be designed to be regulatable.
In this article, we consider the following questions: What innovations in AI systems are needed for them to be effectively regulated? And in what areas will innovations in AI methods alone be insufficient, and more interdisciplinary approaches required?
We explore the answer through the lens of public sector AI procurement checklists, which offer a pragmatic perspective on the broader challenges of regulatable AI systems. These checklists do more than outline criteria for government procurement; they represent concerted efforts to codify regulatory requirements for the adoption of AI systems in the public sector. This positions them uniquely at the intersection of policy-making and practical implementation. The desiderata distilled in these checklists are comprehensive and are reflected in still nascent efforts to regulate AI in the private sector [58, 22]. Furthermore, public sector procurement checklists are among the more developed AI regulations, some having gone through several rounds of refinement [175]. We emphasize that we focus on public sector procurement checklists to make our discussion concrete; the technical innovations needed to satisfy those checklists are relevant to the wider discourse of creating regulatable AI systems.
Specifically, we closely examine the technical criteria from two existing procurement checklists: the World Economic Forum’s AI Procurement in a Box (WEF) [179] and the Canadian Directive on Automated Decision-Making (CDADM) [71]. As illustrated in Figure 1, we first group the technical criteria contained in these two checklists into categories that will be familiar to AI researchers and engineers: (pre-training) data checks, (post-hoc) system monitoring, global explanation, local explanation, objective design, privacy, and human + AI systems. For each category, we briefly summarize existing technical approaches that could be used to construct AI systems that meet those criteria. Next, we identify areas where relevant technical approaches may exist, but additional technical innovation is needed to be able to vet increasingly complex AI systems being used in increasingly varied contexts. For example, the proliferation of large language models comes with a significant difficulty in evaluation, due to factors such as their open-ended nature and data leakage. While innovative approaches like Holistic Evaluation of Language Models (HELM) [103] and Elo ratings [185] are proposed, accurately evaluating these models continues to be an unresolved issue that requires further technical advancements for effective regulation and oversight. Finally, we briefly outline aspects of these criteria that may seem technical but actually require interdisciplinary approaches to vet.
Throughout this exercise, we assume no concerns about expertise; that is, there are sufficiently qualified AI and domain experts to review whether the AI system meets the checklist criteria. Our concern is to identify to what extent experts can currently vet AI systems against these criteria, and provide a (non-comprehensive, but concrete) list of directions for technical innovation to bridge the gap towards regulatable AI systems. If AI systems can be made verified against these checklists, then we will have made significant progress towards creating regulatable AI systems in general.
2 Public Sector AI Procurement Checklists

The public sector uses procurement checklists to ensure its purchased products align with its organizational needs and values. In the context of AI procurement checklists, most pertinent to us are the items related to technical criteria, such as ensuring data quality, privacy, fairness, and appropriate monitoring and oversight. These technical criteria are also contained in broader but more nascent regulatory efforts. [58, 22].
A reasonable question may be why focus on these particular checklists—after all, not all regulations are of the same quality. We note both checklists have gone through extensive review. The directive on automated decision-making marks the Canadian government’s first step toward managing the risks of AI in public administration. Initiated in 2016, this comprehensive effort involved a white paper, workshops, consultation sessions, and the formulation of working groups, drawing upon the expertise of scholars, civil society advocates, and governmental officials [34]. Officially enacted in April 2019, this directive has undergone three rounds of review [24] and was amended in 2021 and 2023 in response to these reviews. It established the groundwork for the broader 2022 Artificial Intelligence and Data Act (AIDA) [32] and a recent guide on the use of generative AI [33].
The ”AI Procurement in a Box” by the World Economic Forum is a set of guidelines, with detailed examples, to assist governments in the creation of public sector procurement regulation. It is the result of a 2018 ”Unlocking Public Sector AI” initiative which included over 200 stakeholders across government, academia, and industry; the efficacy and applicability of the guidelines have been validated through two pilot studies conducted in the UK [180] and Brazil [174].
As noted above, both of the checklists we consider have gone through extensive review. We also spoke to government officials who were using these checklists and carefully examined them with a red-teaming mindset. While no regulatory effort is perfect, we found that these checklists are fairly comprehensive, with perhaps the only technical area lacking being insufficient attention to HCI elements of how the AI would be integrated into its intended environment. We found that their included technical criteria are relevant for broader efforts to create more regulatable AI systems.
For this paper, we aligned and grouped parts of the two checklists with existing AI research topics (outlined in Figure 1). This process required combining information from different checklist sections, highlighting the differences in how AI researchers and policymakers approach the same problems. For example, fairness is a large AI research topic, but interwoven throughout many points in CDADM and WEF, so it does not have its own section here. This paper discusses all major sections of the CDADM but excludes some elements of security and non-expert training in the WEF due to space constraints.
3 Inputs of the Model: (Pre-training) Data Checks
The characteristics of the training data have a large influence on the behavior of an AI system. What checks must be done on these data before they are used to train models? Motivations for the regulatory requirements in this section include data consent, data privacy (discussed in more detail in Section 8), and downstream impacts of data quality (e.g. on model performance, generalization and bias). Examples of checklist criteria include:
-
CDADM 6.3.1: Before launching into production, developing processes so that the data and information used by the Automated Decision Systems are tested for unintended data biases and other factors that may unfairly impact the outcomes.
-
CDADM: 6.3.3: Validating that the data collected for, and used by, the Automated Decision System is relevant, accurate, up-to-date, and in accordance with the Policy on Service and Digital and the Privacy Act.
-
CDADM 6.3.4: Establishing measures to ensure that data used and generated by the automated decision system are traceable [fingerprinting], protected and accessed appropriately, and lawfully collected, used, retained, and disposed.
-
WEF: Assess whether relevant data will be available for the project […] Data is crucial for modern-day AI tools. You should determine, at a high level, data availability before starting your procurement process. This entails developing an understanding of what data might be required for the project.
-
WEF: Select data that fits criteria of fairness. For example, the data should be representative of the population that the AI solution will address, as well as being reasonably recent.
The technical questions underlying these criteria have to do with data documentation procedures and checks that can expose potential risks in areas such as fairness, generalization, and privacy.
3.1 What we know how to do
We have proxies for checking many properties in these criteria (data privacy, label quality, feature selection, fairness, etc.) using exploratory data analysis [167]. For example, we can inspect the annotation process and check inter-annotator agreement to get an idea of label quality [17, 125]. We can also measure (and correct for) imbalance in data if we are given group labels that segment the dataset [99, 29]. We have techniques for identifying influential points [18], outliers [19], and mislabelled points [26, 129] which may cause models to exhibit poor performance or bias [61].
Further, there exist several standards for reporting dataset information [66, 81, 20, 110, 130], including on the data curation process, that are designed to help expose potential biases and limitations for which the data may be used. Sufficiently comprehensive data documentation facilitates investigation by both experts and the public. In the realm of consent, non-consenting (opt-out) data checks111For example, artists can opt-out their work with Spawning (https://spawning.ai), which provides opt-out data checks as a service to AI system developers. can give individuals control over how their data are used.
3.2 Directions requiring additional AI innovation
-
•
Metrics and Generalizability. More work is needed to connect the data metrics with impact on outcomes. For example, we have reasonable tools to connect the uncertainty or measurement error in a distance sensor to effects on motion planning [57]. However, if a traffic image dataset has a certain annotator disagreement score, what does that imply for an autonomous vehicle whose vision system is trained on those data? The question of generalizability also arises for data without explicit human annotation, such as internet-crawled language and vision datasets [131, 145]. In this case, what data checks can we perform to ensure that it will be appropriate for the domains in which the model is deployed? Data checks might lose their validity if the data is used outside its envisioned context.
A particularly important category are metrics that capture similarities of different applications and thus capture scenarios in which a data set collected for one purpose may be used for another. While this inquiry has received considerable attention in domain adaptation research [45, 2], the data-centric perspective remains relatively unexplored [5]. For example, a dataset collected for autonomous vehicles in one city might be suitable in similar cities. But what statistics or meta-data would we need to be confident? To ensure reliable utilization of datasets, additional metrics are necessary to precisely determine the range of applications for which a dataset can be safely used.
-
•
Data Quality Checks in the Context of Pretrained Models. Given the prevalence of large pre-trained models [75] and (currently) limited transparency about their training data [35], can we develop data checks that rely solely on accessing the model [112], or do certain types of checks require disclosure of specific information about the training data? Do checks for fine-tuning data—e.g., the traffic images used to tune an autonomous vehicle’s vision system on top of an existing image classifier—differ from checks for pre-training data?
-
•
Unstructured Data. For structured data, it is relatively easy to report statistics across features. For unstructured data like images or social media messages, existing standards focus on reporting the statistics of the meta-data [66, 81, 20, 110, 130]. However, is providing transparency about the meta-data sufficient? For example, in the above scenario with the traffic images, is it sufficient to provide information e.g. about where the images were collected and what kinds of cameras were used? Or might it be important to report certain information derived from the pixel values as well? Similarly, if one had a collection of social media posts, would be it important to report certain information derived from the actual content, in addition to meta-data about the site and scraping procedure?
3.3 Areas that require interdisciplinary engagement
The specific metrics that would enable meaningful inference about the quality of the data will depend on the application [78, 96]. Questions around bias and fairness are also inherently multi-faceted and will depend on the use-case [15]. Determining how data collection for AI systems respects copyright, obtains proper consent (opt-in versus opt-out), and avoids misrepresentation or detriment to the owner’s benefits necessitates input from disciplines such as law, policy, and social science [95, 49, 88, 76]. Privacy tensions—what data is retained, what statistics are made public, what kind of access is granted to trusted auditors—must also be resolved within the broader socio-technical context [50].
Furthermore, there is danger in living exclusively inside the data; cross-talks inside and outside of the data are necessary to detect many normative pitfalls. For example, bias can be introduced via the choices of labels (e.g. are non-binary labels included when labeling gender?) and the labeling process (e.g. whose perspective was being taken when an input was labeled as acceptable or problematic content?). Healthcare algorithms that demonstrate unbiased predictions of healthcare costs, but then use that prediction as a proxy for illness severity, may introduce bias because unequal access to care leads to lower healthcare spending by minority groups [119]. Detecting and addressing such issues in data necessitates active dialogue between the data realm and external perspectives. Section 7 delves deeper into the discussion of label choice concerns.
4 Outputs of the Model: (Post-hoc) System Monitoring
Once a system is deployed, it is essential to monitor its operations. These criteria have to do with monitoring for adverse outcomes and identifying unintended consequences, making that information available for scrutiny, and establishing contingencies if the system is behaving poorly. Metrics to monitor the operations of a system also relate to methods for checking an AI’s system’s performance after it has been trained. Examples of checklist criteria include:
-
CDADM 6.3.2: Developing processes to monitor the outcomes of Automated Decision Systems to safeguard against unintentional outcomes and to verify compliance with institutional and program legislation, as well as this Directive, on a scheduled basis.
-
CDADM 6.3.6: Establishing contingency systems and/or processes as per Appendix C. (Which says: Ensure that contingency plans and/or backup systems are available should the Automated Decision System be unavailable.)
-
CDADM 6.5.1: Publishing information on the effectiveness and efficiency of the Automated Decision Systems in meeting program objectives on a website or service designated by the Treasury Board of Canada.
-
WEF: [T]here should be systematic and continuous risk monitoring during every stage of the AI solution’s life cycle, from design to post-implementation maintenance.
-
WEF: Testing the model on an ongoing basis is necessary to maintain its accuracy. An inaccurate model can result in erroneous decisions and affect users of public services.
-
WEF: Enable end-to-end auditability with a process log that gathers the data across the modelling, training, testing, verifying and implementation phases of the project life cycle. Such a log will allow for the variable accessibility and presentation of information with different users in mind to achieve interpretable and justifiable AI.
The technical questions associated with these criteria have to do with how to monitor performance and identify various kinds of drift and unusual results that warrant attention.
4.1 What we know how to do
Given a specific metric, it is relatively easy to put monitoring into place. We can easily check to ensure that the outputs of an AI do not exceed threshold values. Methods exist that establish distributions for ”normal operation” and flag anomalous values during actual operation [64]. These techniques can be employed to detect shifts in inputs and outputs, in model confidences and calibrations [14], in derived quantities such as the top features used to make a prediction (allowing a person to check if a shift is sensible) and fairness metrics [70, 14, 147]. We can learn a trend in how a particular quantity changes and see if that trend holds and whether any external shock occurs. In RL settings, we can monitor differences between expected and actual reward distributions. If the causal structure of the environment is known, monitoring checks can specifically identify new confounders and mediators. That said, all anomaly detection methods require some specification of what kinds of behavior represent a change or anomaly. They may not capture every unintended consequence, and given sets of monitoring metrics may be gamed by an adversary.
More generally, we already have a set of norms around what kinds of tests should be run prior to an AI system being deployed (e.g. [97, 152]). AI developers should strive to test their systems with multiple independent, external datasets to ensure that their results are replicable (and be transparent if this kind of generalization has not been tested). These datasets should include sufficient numbers of hard cases in their test sets, and results should be presented stratified by difficulty. Similarly, one should provide stratified results on performance of cases similar and dissimilar to the training set. Performance measures should be reported with respect to the real population proportions of each class, stratified by class, or be independent of base rates so that they can be correctly applied to the intended use-case and not the proportions present in the training set.
4.2 Directions requiring additional AI innovation
-
•
Monitoring Many Metrics. Monitoring multiple metrics increases the risk of false positives and false negatives, which can overwhelm engineers. How can we monitor many metrics efficiently while not incorrectly flagging too many cases for review and not missing important deviations? Relatedly, once in operation, what data should be gathered so that we can check additional metrics in the future? For example, while we can monitor fairness for known minority groups, what data should be logged during operation so that we can audit fairness when an unknown demographic group (e.g., an intersection of some legally protected attributes) contest for unfair outcomes [91]? The question of what logs to retain only becomes more difficult when there are multiple AI systems interacting at fast rates, such as the many AI components operating within an autonomous vehicle. These questions remain despite advances in MLOps [97].
-
•
Certification of Use Cases. Across the very broad range of AI systems and contexts, can we certify the settings in which an AI system is supposed to work well? Can we assign a label to an AI model so that it is restricted to or from being applied to specific use cases? Consider, for example, the need to establish safeguards that prevent an open-access drug discovery model from being utilized for de novo design of biochemical weapons [169]. Similarly, image generative models should be restricted from generating pornographic content. Relatedly, can we provide confidence about the post-hoc performance of a deployed system on certified tasks while preventing a deployed system from being misused?
In formal verification, one mathematically checks that the formal model of a given system satisfies a desired property. Formal verification is widely in safety-critical systems. As AI systems enter safety-critical settings—such as autonomous driving or robot-assisted surgeries—it is essential that strong safety guarantees can be maintained. Certifying neural networks for safety-critical systems is an active research area [155, 150, 13, 90, 67, 182]. There are also early proposals to define standards for levels of AI system certification [184] (analogous to security standards [62]) that have yet to be refined and adopted.
-
•
Correcting Models after Deployment. There exists some work on correcting deployed models in a way that does not require re-training end-to-end (e.g. unlearning [73, 162, 98, 41], fine-tuning [82], and in-context learning [176, 39]). But more work remains to be done, especially for AI systems with many interacting parts.
-
•
Identifying Relevant Distribution Shift. There are many possible types of shift: in input distributions, in the relationship between inputs and outputs, in the rewards (objective)—and these shifts can take many forms and occur in many ways. For example, the acceleration of newer cars may be different, as well as what colors are popular. Can we distinguish between relevant and irrelevant shifts (e.g., along the lines of [42])? If the shifts happen in some uninterpretable embedding space, how can we explain them?
-
•
Monitoring Agents that are Learning Online. We can monitor for major adverse effects. However, can we identify more subtle issues, such as initial signs of catastrophic forgetting, cheating, and other harms that occur while the agent continues to perform well on its reward metric? For instance, it would be advantageous to detect early signs of reckless or inappropriate driving behavior—such as reducing distances between the vehicle and pedestrians, or increased use of residential streets where children may be playing—in autonomous driving agents before any traffic accidents occur. Our understanding of unintended consequences continues to grow [160, 30] but the problem remains unsolved.
4.3 Areas that require interdisciplinary engagement
At a high level, there will always need to be some kind of decision made about what needs to be monitored or prioritized in a given setting. There will also need to be decisions made about what kinds of safety promises or guarantees are needed e.g. how much shift is considered safe and acceptable, and how much is not, for example in healthcare [60]. It is crucial to translate the monitored metrics into meaningful implications that enable people to make informed decisions within the broader socio-technical system [122]. For instance, in autonomous driving, comparing monitored metrics against human performance can inform decisions regarding human intervention. Finally, the task of contingency planning for back-ups when models express unexpected or unwanted behaviors also requires an understanding of the broader socio-technical system.
5 Inspecting the Model: Global Explanations for Model Validation
Global explanations describe a model as a whole and are often useful for inspection or oversight. The goal is to expose information about the model that would allow a domain expert to infer the existence of some kind of unobserved confounder, something about the model that is non-causal, and other limits on the scope of the model’s applicability. Criteria related to global explanations include:
-
CDADM App. C: Plain language notice through all service delivery channels in use (Internet, in person, mail or telephone). In addition, publish documentation on relevant websites about the automated decision system, in plain language, describing: How the components work;
-
WEF: Public institutions cannot rely on black-box algorithms to justify decisions that affect individual and collective citizens’ rights, especially with the increased understanding about algorithmic bias and its discriminatory effects on access to public resources. There will be different considerations depending on the use case and application of AI that you are aiming to acquire, and you should plan to work with the supplier to explain the application for external scrutiny, ensuring your approach can be held to account. These considerations should link to the risk and impact assessment described in Guideline 2. Under certain scenarios, you could consider making it a requirement for providers to allow independent audit(s) of their solutions. This can help prevent or mitigate unintended outcomes.
-
WEF: Ensure that AI decision-making is as transparent as possible. – Encourage transparency of AI decision-making (i.e. the decisions and/or insights generated by AI). One way to do this is to encourage the use of explainable AI. You can also make it a requirement for the bidder to provide the required training and knowledge transfer to your team, even making your team part of the AI-implementation journey. Finally, you can ask for documentation that provides information about the algorithm (e.g. data used for training, whether the model is based on supervised, unsupervised or reinforcement learning, or any known biases).
Technical approaches associated with these criteria include the creation of small, inherently interpretable models with high performance, sharing certain parts or properties of a large model, and open-sourcing the model’s code.
5.1 What we know how to do
We can build inherently interpretable models (e.g. generalized additive models, decision trees, rule-based models, etc.) for tabular and other simple, relatively structured data [139]. We have some tools for interpreting neural networks in terms of human-understandable components [133, 118, 120], such as circuits [172] or even natural language [23]. When possible, these tools provide a systematic approach to explain how tasks are performed in ML models in a human understandable way. Finally, we can partially explain neural networks and other complex models via methods such as distillation [161], feature importance [104], or computing concept activation vectors [142].
5.2 Directions requiring additional AI innovation
- •
-
•
Interactive “Openboxing” of Large Models. Can we build interactive, hierarchical, and semantically-aligned views of large models such that these models are (to some extent) inherently interpretable? For example, a traffic image classifier that recognizes objects by multiplying object templates with transformation matrices [183] would be more inherently explainable than another model without this hierarchical structure. Further, can we allow users to explore such explanations at different levels of fidelity for different contexts? As noted above, methods to extract information from larger models such as large language models exist (e.g., [142, 114]) but have limitations with ways for people to effectively explore and understand larger models. More work along the lines of [16, 158] is needed.
-
•
Checking Value Alignment. Whether it is criminal justice, benefits allocations, or autonomous driving, AI systems are increasingly used in situations that require value judgments. How do we elicit and encode societal and individual values in diverse situations? What metrics can effectively measure value alignment? How do we make this mapping transparent for others to understand the value choices made (e.g., the drivers of other cars next to the autonomous vehicle)? Advancing exisiting work e.g., [27, 56] is needed for our increasing use cases.
5.3 Areas that require interdisciplinary engagement
There is a question of what to offer and to whom. For example, releasing the code and environment may allow some people to directly answer their questions [127]. Providing an explanation broadens who can inspect the model, including users and domain experts; however, what information to release, how it should be extracted, and how often during the life cycle of the model that information should be updated will depend on the use context. We will also need mechanisms for people to request more information about a model as new concerns become apparent [84]. Finally, all information release must be balanced with concerns about privacy and trade secrets.
6 Inspecting the Model: Local Explanations about Individual Decisions
These criteria have to do with providing information to a user about a specific decision that is made, such as benefits denial. In some cases, it may be sufficient to simply provide the information and logic that led to the decision (a meaningful explanation). In other cases, it may be preferable to provide actionable ways to change the decision (recourse) [171, 89]. In the following, we use the term local explanation to refer to explanations that are meant to provide insight about a particular decision, rather than about the model overall [113]. We use the term recourse to refer a modification of the input that results in the output changing to the desired value.
-
CDADM 6.2.3: Providing a meaningful explanation to affected individuals of how and why the decision was made as prescribed in Appendix C.
-
CDADM 6.4.1: Providing clients with any applicable recourse options that are available to them to challenge the administrative decision.
-
CDADM App. C: In addition to any applicable legal requirement, ensuring that a meaningful explanation is provided with any decision that resulted in the denial of a benefit, a service, or other regulatory action.
-
WEF: Explore mechanisms to enable interpretability of the algorithms internally and externally as a means of establishing accountability and contestability. – With AI solutions that make decisions affecting people’s rights and benefits, it is less important to know exactly how a machine-learning model has arrived at a result if we can show logical steps to achieving the outcome. In other words, the ability to know how and why a model performed in the way it did is a more appropriate means of evaluating transparency in the context of AI. For example, this might include what training data was used, which variables have contributed most to a result, and the types of audit and assurance the model went through in relation to systemic issues such as discrimination and fairness. This should be set out as documentation needed by your supplier. – It is also important to consider the potential tension between explainability and accuracy of AI when acquiring AI solutions. Classic statistical techniques such as decision-tree models are easier to explain but might have less predictive power, whereas more complex models,such as neural networks, have high predictive power but are considered to be black boxes.
Approaches for creating local explanations rely heavily on a notion of local region, and thus some notion of distance. Some inputs are more easily explained than others, and any explanation can introduce privacy risks.
6.1 What we know how to do
There are many techniques of providing local explanations for a model [52, 100, 135, 136, 105, 149]. Specifically, given a definition of distance, we can find a counterfactual: the closest point such that the model’s output is a desired class [72, 171]. This can be used to help an individual determine what features set them apart compared to a nearby alternative, and also set the foundation for recourse (if those features can be changed) [89].
6.2 Directions requiring additional AI innovation
-
•
Defining Distance Metrics. As noted above, local explanations rely heavily on notions of nearby data. It can be difficult to adjudicate what correlations in the data should be preserved and what should not. For example, if there are correlations between the kind of sign and the geographic location in a traffic image data set, should those correlations be retained in the distance metric? What about for race and postal codes or sex and hormone levels? Some work exists on using human input to define the appropriate distance metric for the purposes of explanation and recourse [89], but more is needed.
-
•
Data without Interpretable Dimensions. The challenges associated with choosing distance metrics are exacerbated when the individual dimensions of the data are not interpretable. For example, suppose we have a medical imaging task in which the AI system claims that certain cells represent a certain type of cancer, or a face recognition task in which the AI system claims that the face in a security video matches a face in a government database. What is a meaningful explanation [111] in this case? Does it take the form of other images in the dataset (which may create privacy issues)? Should it involve first summarizing the input into interpretable concepts [92, 69]? Similar issues arise with text [181] and timeseries data [10].
-
•
Provenance Adjudication. We may want to know if particular training datum was used in a particular way to generate the given output. For example, we may be curious if a traffic sign mis-classificatoin could be attributed to a specific mislabel example, or we may need to resolve copyright issues from AI-generated text and images. This is possible in small models, but in very nascent stages for large models (e.g., LLMs [170] and diffusion-based image generation models [48]).
-
•
Handling Out of Distribution Data. The idea behind recourse is that it gives a person a path toward getting the outcome they desire. For example, if a loan applicant is told that paying off their debts would make them eligible for the loan, then they would expect to get the loan once the debts are paid. However, if the applicant’s data is very far from the training data, then the AI-produced recourse may indeed change the model’s output, but would not be accepted by the loan officer in a real context.
-
•
Tradeoffs between Explainability and Privacy/Security. Releasing information for auditing or recourse may allow bad actors access to private information [148] or to game the system [126]. For example, explanations in the form of training samples, like those of the traffic images, may allow actors to learn not only how to trick the autonomous vehicle, but also learn about other elements of those images (that are not road signs). Advancing existing work e.g. [166] is necessary to understand the resulting dynamics.
6.3 Areas that require interdisciplinary engagement
The biggest question raised by these guidelines is what is the definition of a “meaningful explanation” [156]. This definitions will depend on the socio-technical context of the task—contesting a loan denial, a medical error, or a benefits denial may require different kinds of explanations. Different kinds of users may also require different explanations.
Relatedly, the purpose of the information provided for recourse will vary across contexts. For one task, it may be enough to provide only one recourse, while for others it may be necessary to provide multiple options [163, 154]. In other contexts, the user might benefit from an interactive system to explore different options. For example, they could themselves wish to navigate changes and see if they would result in a favorable loan decision.
Finally, it may be that a recourse generated from a local explanation may not be the appropriate way to assist a user unhappy with a decision [6, 3]. For example, suppose someone is convinced that a voice-based covid test is in error about their disease status. Rather than providing an explanation of the voice features used to make the decision, the appropriate recourse may be to allow that person to take a traditional covid test instead. We also note that certain situations may require a justification (rationale for why a decision is right with respect to laws, norms, and other aspects of the context) rather than explanation (what features the AI used to generate the output).
7 Designing the Model: Objective Design
All AI systems require formulating goals in precise, mathematical terms. Objective design converts general goals (e.g. drive safely) into precise mathematical terms [21, 80]. This distillation process is fraught with potential pitfalls; an incorrect conversation will result in the AI behaving in unintended ways. For example, encoding safe driving as always ceding the right of way may result in an autonomous vehicle that never makes a turn at a busy intersection. Collaboration with stakeholders during the objective design process can help ensure the true goals are addressed, rather than a proxy that may not result in the desired behavior. Documentation of the objective design process must be sufficiently transparent to ensure calibrated trust from stakeholders. Examples of criteria include:
-
WEF: Focus on developing a clear problem statement, rather than on detailing the specifications of a solution. - AI technologies are developing rapidly, with new technologies and products constantly being introduced to the market. By focusing on describing the challenges and/ or opportunities that you want to address and drawing on the expertise of technology partners, you can better decipher what technology is most appropriate for the issue at hand. By focusing on the challenge and/or opportunity, you might also discover a higher-priority issue, or realize you were focusing on a symptom rather than the root cause.
The criteria above encourage public servants to identify their actual goals and then allow the engineers to deliver. To be able to deliver, however, the AI engineers must be able to convert the problem statement into precise terms.
7.1 What we know how to do
In some cases, it is possible to decompose a complex task into simpler components. For example, in the context of an autonomous vehicle, we might evaluate a perception system for its ability to identify and forecast the trajectories of other objects in its environment, and the ability of a planner to make safe decisions given this information. Algorithms for multi-objective optimization can find a Pareto front of options corresponding to different trade-offs between desiderata [143, 153]. There is also recent work in inferring what objectives are truly desired given observed reward functions [74].
7.2 Directions requiring additional AI innovation
-
•
Metrics for Metrics: Measuring Match to Goals. What are the measures that can be used to determine whether some technical objective matches our policy goals? Objective and reward design are relatively well-studied in some domains, such as reinforcement learning [151, 74], but unsolved for the many more situations—from autonomous vehicles to email text completion—in which we see AI systems used today. Further, our goals may be multi-faceted; the objective must not only be faithful to our goal but also transparent in how it is faithful.
-
•
Properties of Popular Objective Functions. There are many objective functions used for their computational convenience and statistical properties (squared loss, log likelihood, etc.). Because they are so popular, their statistical properties under various conditions are often well-understood [146]. For example, we may know that L1 losses are more robust than L2; we may know that decreased model capacity (e.g. fitting a line) can make a model more prone to being swayed by influential points. However, how do these very technical understandings of statistical properties relate to more complex goals, including reward hacking and other short-cut risks? Better understanding of these properties could enable better matching between popular losses and broader policy goals.
-
•
Robustness to a Variety of Objectives. Further research is necessary to create agents that excel across a range of objectives. In RL, this research can strengthen the robustness of learned policies when objectives are not perfectly specified [116, 128]. It also applies to language models, which are trained with the next token prediction task but asked to perform agentic tasks with various constraints and objectives [7], and in different worlds [101].
-
•
Computational Constraints for More Robust Objectives. Related to the above, there are a variety of computational constraints and regularizers that often make objectives more robust to imperfect specifications. These include encouraging smoothness (e.g. Lipschitzness), sparsity, and robustness to certain types of uncertainties (e.g. [56], and distributionally robust optimization [132]. ). However, work remains to be done to more strongly connect what these computational tools do in the context of aligning the technical formulation with the true goal.
Furthermore, some constraints and regularizations are difficult to express and/or operationalize in analytical forms; instead, they are incorporated directly into the training procedure, such as adversarial training [164]. Relatedly, additional work is needed to effectively optimize objectives with multiple criteria—whether those are constraints, regularizers, or competing terms: Simply writing down an objective does not make it easy to optimize. As additional terms are added to the objective, the question of how to weigh them to achieve the desired behavior also becomes more complex.
-
•
Understanding Connections between Objectives and Learnt Model Behavior. Can we efficiently explain how changes in the objective function impact model behavior? Conversely, can we explain policies in terms of compatible reward functions? Can we efficiently identify where two reward functions may result in different policies in human-understandable terms? Some prior works try to answer this [63]; however, more analysis will help refine the reward function to better match the intended objectives. One further question is disentangling which model behavior is the result of an objective and which is the result of training data. For example, the mix of possibly conflicting beliefs in a text corpus will influence how language models trained on it behave, though all have the same objective. [7]
-
•
Inferring Goals from Observed Behavior. In some cases, we may have examples of decisions or outputs that we know align with the true goal (e.g. safe driving trajectories). However, the inverse problem of inferring rewards from behavior is not identifiable. Advancing techniques[9] to help disambiguate important elements of the reward function can help ensure that the learned policy aligns with the desired objectives, leading to improved performance and generalization.
7.3 Areas that require interdisciplinary engagement
Creating goals at a policy level requires considering factors such as contextual relevance, attainability, and alignment with overarching desiderata [83]. Ethical concerns associated with the power and impact of AI systems may also be taken into account. Moreover, sometimes the objective remains unclear even at the policy level, making it more difficult to design proper objectives for the AI systems much less validate and explain them. For example, while AI systems are commonly utilized in criminal justice [187, 93], there is often a lack of clarity regarding how they define and measure crime [87], and the data may not accurately reflect the true objectives of judges [44, 137].
8 Designing the Model: Privacy
Bad actors may use transparency about the data, code, and model for identifying private information about individuals. There are a number of examples of regulatory criteria relating to privacy concerns, including:
-
CDADM 6.2.6: Releasing custom source code owned by the Government of Canada as per the requirements specified in section A.2.3.8 of the Directive on…
-
CDADM App. C: Plain language notice through all service delivery channels in use (Internet, in person, mail, or telephone). In addition, publish documentation on relevant websites about the automated decision system, in plain language, describing: A description of the training data, or a link to the anonymized training data if this data is publicly available.
-
WEF: There are many anonymization techniques to help safeguard data privacy, including data aggregation, masking, and synthetic data. Keep in mind, however, that you must manage anonymized data as carefully as the original data, since it may inadvertently expose important insights. RFPs should encourage innovative technological approaches, such as those mentioned above, that make less intrusive use of data or that achieve the same or similar outcomes with less sensitive datasets.
-
WEF: As important as data protection is, not all data is sensitive (e.g. open-government data is freely accessible online). All data, sensitive or not, must have its integrity safeguarded, but it is not necessary to keep non-sensitive data behind closed doors. It is important to assess the privacy needs of different datasets to determine the right level of protection. Normally, personally identifiable information (PII), such as financial and health data, is considered extremely sensitive. The RFP needs to reflect data governance requirements for both the procurement process and the project that are in accordance with the nature of the data.
However, the language in these regulations leaves a number of issues unspecified, including a standardized, meaningful definition for privacy, and assumes that we are currently able to properly assess the privacy of a dataset and anonymize data, which are currently open research questions.
8.1 What we know how to do
Differential privacy is a widely-accepted theoretical notion of privacy [178, 53]. In settings where this notion of privacy is appropriate, we have differentially private algorithms that can calculate statistical properties of data [54], train machine learning models [1, 124], and generate synthetic data [31]. Many other privacy notions exist [141, 107, 102]. Choosing which privacy notion to use in a particular setting remains an open question.
8.2 Directions requiring additional AI innovation
-
•
Better Tradeoffs between (differential) Privacy and (predictive) Performance. In general, differentially-private models have lower predictive performance than models without privacy guarantees [11]. How can that gap be closed? Related questions include: Can we ensure models are private even with many queries and in conjunction with public data? What can we maximally expose about a model and training data statistics in a way that is still private? Can we precisely state what cannot be exposed, e.g. a long tail has been left out [59]? (Note: if we can make this precise, then certain information could be made public as it poses no privacy risk, and other information may be available only to a trusted auditor.)
-
•
Creating and Assessing Privacy Definitions. How can we define privacy appropriately and meaningfully for different types of data? (e.g. trajectories, text [28], etc.). What do current definitions of privacy actually achieve on these data?
-
•
Privacy via Minimal Data Collection. Can we collect only the input information needed for each decision, which may involve collecting different inputs for different people [165]? What privacy risks are mitigated by this approach? Are new risks introduced because what inputs are measured is new information?
-
•
Private Generative Models. The main focus of existing work is on classification. So, there are many open questions when it comes to the privacy of generative models [37, 35, 36, 86]. For example: How can we prevent a generative model from replicating training data? Is there a difference between a private generative model and adding noise to data? Is there a benefit to a private generative model vs. noised data? Are empirical methods to ensure privacy e.g. via reinforcement learning with human feedback [186], sufficient?
-
•
Effective Machine Unlearning. In some cases, people may be allowed to elect to have the influence of their data removed after the model has been trained. Methods have been created to remove the influence of specific inputs from the data, but these are still in progress [98, 41], especially for generative models [144, 65, 79].
8.3 Areas that require interdisciplinary engagement
Current private models still allow third parties to infer private information via access to additional, publicly available data. We need to develop new notions of privacy for this setting [28]. Broader discussion is also needed regarding what to do if privacy guarantees sacrifice predictive performance, especially if the sacrifice is primarily to underrepresented groups [11]. More generally, the appropriate definition of privacy, and how strict the privacy guarantee must be (e.g., via hyperparameter settings), will depend on the setting [55] and must be made transparent. For example, claiming a model is differentially private when it has a very large epsilon may be misleading.
Finally, while this section has focused on privacy, we also note that there are many security concerns must also be considered in a holistic manner. There are clear limitations to what can be achieved with respect to adversarial actors. If training data are available, a state actor or a large industry actor could (re)create a model. Once a model or training technique is out, we really cannot control its use. Unlimited public access to a model (via queries) intrinsically allows an adversary to learn about the model and the training data.
9 Interacting with the Model: Human + AI Systems
AI regulations frequently emphasize the involvement of humans in various stages of the decision-making process. Often the intent is for the human decision-maker to vet an AI recommendation, take responsibility for the final decision, and intervene in case of emergency situations and system failures. We also consider the case of learning from human input. Examples of related criteria include:
-
CDADM App. C: Decisions cannot be made without having specific human intervention points during the decision-making process; and the final decision must be made by a human.
-
CDADM 6.3.6: Establishing contingency systems and/or processes as per Appendix C. (Which says: Ensure that contingency plans and/or backup systems are available should the Automated Decision System be unavailable.)
Technical approaches associated with these criteria include combining information from multiple experts, as well as ways to ensure that humans are fully engaged in the decisions.
9.1 What we know how to do
There has been significant work on learning from humans. We can apply methods such as imitation learning [85, 109] and reinforcement learning [106] from human feedback to orient the model based on expert control or learn human intentions/preferences [123]. Active learning techniques can be used to proactively ask for information to improve a model from humans [173, 134]. Finally, we also have methods for humans to take the initiative to correct an agent (e.g., [138, 108, 117]). While methods in uncertainty quantification are always being improved, for the purposes of flagging uncertain inputs for human inspection [68], our current methods are reasonable.
9.2 Directions requiring additional AI innovation
-
•
HCI Methods for Avoiding Cognitive Biases. Humans have many cognitive biases and limitations. If a system behaves most of the time, people may start to over-rely on it. Confirmation bias can accompany backward reasoning (people finding ways to justify a given decision) but can be mitigated if a person performs forward reasoning first (looking at the evidence) [25]. Bias can also come from imperfect information fusion, e.g., if a human inspects the input data and then views an AI prediction based on the same input data, they may falsely believe that the AI prediction is a new, independent piece of information. For example, we may be concerned that a clinician forms an opinion from patient data, and then sees an AI opinion based on the same data, they may falsely treat the AI opinion a new, independent form of evidence. Appropriate human+AI interaction can help mitigate these biases.
-
•
Shared Mental Models and Semantic Alignment. Shared mental models—between the human and the AI system, between the AI system and the human—are essential for effective human+AI interaction [8]. While there exists work in which agents use or create models of humans (e.g., [38]) to facilitate interaction, including modeling a person’s latent states such as cognitive workload and emotions (e.g., [121]), it remains an open question as to how to develop and validate these methods for increasing number of human+AI use cases.
One particularly important area is semantic alignment between the way humans organize concepts and the way modern AI systems encode representations. Grounding terms has a long history in AI [77] and innovation is needed for our modern settings.
-
•
Humans-in-the-Loop in Time-Constrained Settings. How can we include humans in the loop when decisions have to be made quickly e.g. industrial robots in emergency scenarios involving human workers? It is crucial that automated systems can fail gracefully and hand-over control to humans, even in time-constrained settings [140].
-
•
Human+AI for Test-time Validation of Large Surface Models. Models with large output surfaces (e.g. LLMs) will be difficult to evaluate via prospective metrics; we need methods for people to assist in their validation at the task-time (e.g. as proposed in [51]).
-
•
Evaluation and Design of Realistic Human-in-the-Loop Systems. Most current testing is for lay user and consumer applications, where risks and costs are minimal. However, evaluation in other settings is more challenging: Integrating a new interactive system into an existing workflow may require not only significant software effort, but also training of users. In high-stakes settings such as healthcare, criminal justice, and major financial decisions, there is a risk of real harm to people. How can we evaluate and design for these cases? Building more general knowledge about human-in-the-loop systems and developing smarter experimental designs may help reduce these burdens. So might validating methods for piloting methods in offline or de-risked ways that may still inform the target application. Relatedly, standard procedures are needed for evaluating and monitoring human-in-the-loop systems.
9.3 Areas that require interdisciplinary engagement
Shared human+AI decision-making is an interdisciplinary area involving social science, psychology, cognitive science, etc., [43]. Fortunately, HCI research already has connections to these fields [46, 157, 159]. Furthermore, the design adoption of new tools into workplaces is well-studied in design, human factors research, and management and operations science [115, 4, 168]—and require interdisciplinary teams with appropriate expertise. These interdisciplinary efforts will help inform decisions about whether, how, and which humans to include in the loop, as well as how a system that is expecting human input should respond to inappropriate, slow, or absent input from the human.
10 Conclusion
In this document, we examined the technical criteria in two real regulatory frameworks—the Canadian Directive on Automated Decision-Making and World Economic Forum AI Procurement in a Box. We find that we only have some of the tools needed to ascertain whether an AI system meets the stated requirements. We list several concrete directions for AI innovation that, if addressed, would improve our ability to create regulatable AI systems.
Acknowledgements.
The authors thank Andrew Ross, Siddharth Swaroop, Rishav Chourasia, Himabindu Lakkaraju, and Brian Lim; all participants of NUS Responsible, Regulatable AI Working Group 2022-2023 including Limsoon Wong, Angela Yao, Suparna Ghanvatkar, and Davin Choo.
References
- Abadi et al. [2016] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
- Achille et al. [2019] A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439, 2019.
- Adebayo et al. [2018] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
- Agrawal et al. [2019] A. Agrawal, J. Gans, and A. Goldfarb. The economics of artificial intelligence: an agenda. University of Chicago Press, 2019.
- Alvarez-Melis and Fusi [2020] D. Alvarez-Melis and N. Fusi. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020.
- Alvarez-Melis and Jaakkola [2018] D. Alvarez-Melis and T. S. Jaakkola. On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049, 2018.
- Andreas [2022] J. Andreas. Language models as agent models. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.423. URL https://aclanthology.org/2022.findings-emnlp.423.
- Andrews et al. [2023] R. W. Andrews, J. M. Lilly, D. Srivastava, and K. M. Feigh. The role of shared mental models in human-ai teams: a theoretical review. Theoretical Issues in Ergonomics Science, 24(2):129–175, 2023. doi: 10.1080/1463922X.2022.2061080.
- Arora and Doshi [2021] S. Arora and P. Doshi. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
- Ates et al. [2021] E. Ates, B. Aksar, V. J. Leung, and A. K. Coskun. Counterfactual explanations for multivariate time series. In 2021 International Conference on Applied Artificial Intelligence (ICAPAI), pages 1–8. IEEE, 2021.
- Bagdasaryan et al. [2019] E. Bagdasaryan, O. Poursaeed, and V. Shmatikov. Differential privacy has disparate impact on model accuracy. Advances in neural information processing systems, 32, 2019.
- Bai et al. [2022] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Baluta et al. [2021] T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena. Scalable quantitative verification for deep neural networks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 312–323. IEEE, 2021.
- Barda et al. [2020] N. Barda, G. Yona, G. N. Rothblum, P. Greenland, M. Leibowitz, R. Balicer, E. Bachmat, and N. Dagan. Addressing bias in prediction models by improving subpopulation calibration. Journal of the American Medical Informatics Association, 28(3):549–558, 11 2020. ISSN 1527-974X. doi: 10.1093/jamia/ocaa283. URL https://doi.org/10.1093/jamia/ocaa283.
- Barocas et al. [2023] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning: Limitations and Opportunities. Adaptive Computation and Machine Learning Series. MIT Press, 2023. ISBN 978-0-262-37652-5.
- Bau et al. [2020] D. Bau, J.-Y. Zhu, H. Strobelt, A. Lapedriza, B. Zhou, and A. Torralba. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020.
- Bayerl and Paul [2011] P. S. Bayerl and K. I. Paul. What determines inter-coder agreement in manual annotations? a meta-analytic investigation. Computational Linguistics, 37(4):699–725, 2011.
- Belsley et al. [2005] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression diagnostics: Identifying influential data and sources of collinearity. John Wiley & Sons, 2005.
- Ben-Gal [2005] I. Ben-Gal. Outlier detection. Data mining and knowledge discovery handbook, pages 131–146, 2005.
- Bender and Friedman [2018] E. M. Bender and B. Friedman. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018.
- Bernardi et al. [2019] L. Bernardi, T. Mavridis, and P. Estevez. 150 successful machine learning models: 6 lessons learned at booking. com. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1743–1751, 2019.
- Biden [2023] J. R. Biden. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence, 2023.
- Bills et al. [2023] S. Bills, N. Cammarata, D. Mossing, H. Tillman, L. Gao, G. Goh, I. Sutskever, J. Leike, J. Wu, and W. Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
- Bitar et al. [2022] O. Bitar, B. Deshaies, and D. Hall. 3rd review of the treasury board directive on automated decision-making. Available at SSRN 4087546, 2022.
- Bondi et al. [2022] E. Bondi, R. Koster, H. Sheahan, M. Chadwick, Y. Bachrach, T. Cemgil, U. Paquet, and K. D. Dvijotham. Role of human-ai interaction in selective prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
- Brodley and Friedl [1999] C. E. Brodley and M. A. Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167, 1999.
- Brown et al. [2021] D. S. Brown, J. Schneider, A. Dragan, and S. Niekum. Value alignment verification. In International Conference on Machine Learning, pages 1105–1115. PMLR, 2021.
- Brown et al. [2022] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, and F. Tramèr. What does it mean for a language model to preserve privacy? In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 2280–2292, 2022.
- Buolamwini and Gebru [2018] J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
- Cabitza et al. [2017] F. Cabitza, R. Rasoini, and G. F. Gensini. Unintended consequences of machine learning in medicine. Jama, 318(6):517–518, 2017.
- Cai et al. [2023] K. Cai, X. Xiao, and G. Cormode. Privlava: synthesizing relational data with foreign keys under differential privacy. arXiv preprint arXiv:2304.04545, 2023.
- Canada Government [2022] Canada Government. Artificial Intelligence and Data Act. https://ised-isde.canada.ca/site/innovation-better-canada/en/artificial-intelligence-and-data-act, 6 2022. [Online; accessed 2024-02-27].
- Canada Government [2024a] Canada Government. Guide on the use of generative AI. https://www.canada.ca/en/government/system/digital-government/digital-government-innovations/responsible-use-ai/guide-use-generative-ai.html, feb 20 2024a. [Online; accessed 2024-02-27].
- Canada Government [2024b] Canada Government. Responsible use of artificial intelligence (AI). https://www.canada.ca/en/government/system/digital-government/digital-government-innovations/responsible-use-ai.html, feb 20 2024b. [Online; accessed 2024-02-27].
- Carlini et al. [2021] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson, et al. Extracting training data from large language models. In USENIX Security Symposium, volume 6, 2021.
- Carlini et al. [2022] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
- Carlini et al. [2023] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace. Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023.
- Carroll et al. [2019] M. Carroll, R. Shah, M. K. Ho, T. L. Griffiths, S. A. Seshia, P. Abbeel, and A. Dragan. On the utility of learning about humans for human-ai coordination. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
- Chan et al. [2022] S. Chan, A. Santoro, A. Lampinen, J. Wang, A. Singh, P. Richemond, J. McClelland, and F. Hill. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Chen et al. [2019] C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019.
- Chourasia and Shah [2023] R. Chourasia and N. Shah. Forget unlearning: Towards true data-deletion in machine learning, 2023.
- Chuang et al. [2020] C.-Y. Chuang, A. Torralba, and S. Jegelka. Estimating generalization under distribution shifts via domain-invariant representations. In International Conference on Machine Learning, pages 1984–1994. PMLR, 2020.
- Cila [2022] N. Cila. Designing human-agent collaborations: Commitment, responsiveness, and support. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–18, 2022.
- Cilevics [2020] B. Cilevics. Justice by algorithm–the role of artificial intelligence in policing and criminal justice systems. Report of Committee on Legal Affairs and Human Rights, Council of Europe, 2020.
- Cortes and Mohri [2011] C. Cortes and M. Mohri. Domain adaptation in regression. In Algorithmic Learning Theory: 22nd International Conference, ALT 2011, Espoo, Finland, October 5-7, 2011. Proceedings 22, pages 308–323. Springer, 2011.
- Cross and Ramsey [2021] E. S. Cross and R. Ramsey. Mind meets machine: Towards a cognitive science of human–machine interactions. Trends in cognitive sciences, 25(3):200–212, 2021.
- Cyberspace Administration of China [2023] Cyberspace Administration of China. Generative AI Guideline (draft). Technical report, Cyberspace Administration of China, 2023. URL http://www.cac.gov.cn/2023-04/11/c_1682854275475410.htm.
- Dai and Gifford [2023] Z. Dai and D. K. Gifford. Training data attribution for diffusion models. arXiv preprint arXiv:2306.02174, 2023.
- de Man et al. [2023] Y. de Man, Y. Wieland-Jorna, B. Torensma, K. de Wit, A. L. Francke, M. G. Oosterveld-Vlug, and R. A. Verheij. Opt-in and opt-out consent procedures for the reuse of routinely recorded health data in scientific research and their consequences for consent rate and consent bias: Systematic review. Journal of Medical Internet Research, 25:e42131, 2023.
- Dick et al. [2023] T. Dick, C. Dwork, M. Kearns, T. Liu, A. Roth, G. Vietri, and Z. S. Wu. Confidence-ranked reconstruction of census microdata from published statistics. Proceedings of the National Academy of Sciences, 120(8):e2218605120, 2023.
- Doshi-Velez and Glassman [2023] F. Doshi-Velez and E. Glassman. Contextual evaluation of ai: A new gold standard. Working Paper, 2023.
- Doshi-Velez and Kim [2017] F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Dwork [2006] C. Dwork. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10-14, 2006, Proceedings, Part II 33, pages 1–12. Springer, 2006.
- Dwork et al. [2014] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- Dwork et al. [2019] C. Dwork, N. Kohli, and D. Mulligan. Differential privacy in practice: Expose your epsilons! Journal of Privacy and Confidentiality, 9(2), 2019.
- Eckersley [2018] P. Eckersley. Impossibility and uncertainty theorems in ai value alignment (or why your agi should not have a utility function). arXiv preprint arXiv:1901.00064, 2018.
- Elbanhawi and Simic [2014] M. Elbanhawi and M. Simic. Sampling-based robot motion planning: A review. Ieee access, 2:56–77, 2014.
- European Union Commission [2022] European Union Commission. Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts - General Approach. 14954/22, 2022.
- Feldman [2020] V. Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
- Feng et al. [2022] J. Feng, R. V. Phillips, I. Malenica, A. Bishara, A. E. Hubbard, L. A. Celi, and R. Pirracchio. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of ai algorithms in healthcare. npj Digital Medicine, 5(1):66, 2022.
- Field et al. [2012] Z. Field, J. Miles, and A. Field. Discovering statistics using r. Discovering Statistics Using R, pages 1–992, 2012.
- for Standardization [2022] I. O. for Standardization. Information security, cybersecurity and privacy protection — Evaluation criteria for IT security. Standard, International Organization for Standardization, Geneva, CH, 2022.
- Gajcin et al. [2022] J. Gajcin, R. Nair, T. Pedapati, R. Marinescu, E. Daly, and I. Dusparic. Contrastive explanations for comparing preferences of reinforcement learning. In AAAI Conference on Artificial Intelligence, 2022.
- Gama et al. [2014] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
- Gandikota et al. [2023] R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau. Erasing concepts from diffusion models. arXiv preprint arXiv:2303.07345, 2023.
- Gebru et al. [2021] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. D. Iii, and K. Crawford. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- Gehr et al. [2018] T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. T. Vechev. Ai2: Safety and robustness certification of neural networks with abstract interpretation. 2018 IEEE Symposium on Security and Privacy (SP), pages 3–18, 2018.
- Geifman and El-Yaniv [2017] Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. Advances in neural information processing systems, 30, 2017.
- Ghorbani et al. [2019] A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim. Towards automatic concept-based explanations. Advances in Neural Information Processing Systems, 32, 2019.
- Ghosh et al. [2022] B. Ghosh, D. Basu, and K. S. Meel. Algorithmic fairness verification with graphical models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9539–9548, 2022.
- Government of Canada [2019] Government of Canada. Directive on Automated Decision-Making. Technical report, Government of Canada, 2019. URL https://www.tbs-sct.canada.ca/pol/doc-eng.aspx?id=32592.
- Guidotti [2022] R. Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining and Knowledge Discovery, pages 1–55, 2022.
- Gupta et al. [2021] V. Gupta, C. Jung, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and C. Waites. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34:16319–16330, 2021.
- Hadfield-Menell et al. [2017] D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan. Inverse reward design. Advances in neural information processing systems, 30, 2017.
- Han et al. [2021] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, et al. Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021.
- Hao [2022] K. Hao. A new vision of artificial intelligence for the people. https://www.technologyreview.com/2022/04/22/1050394/artificial-intelligence-for-the-people/, may 11 2022.
- Harnad [1990] S. Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
- Heinrich et al. [2018] B. Heinrich, D. Hristova, M. Klier, A. Schiller, and M. Szubartowicz. Requirements for data quality metrics. Journal of Data and Information Quality (JDIQ), 9(2):1–32, 2018.
- Heng and Soh [2023] A. Heng and H. Soh. Selective amnesia: A continual learning approach to forgetting in deep generative models. arXiv preprint arXiv:2305.10120, 2023.
- Hennig and Kutlukaya [2007] C. Hennig and M. Kutlukaya. Some thoughts about the design of loss functions. REVSTAT-Statistical Journal, 5(1):19–39, 2007.
- Holland et al. [2020] S. Holland, A. Hosny, S. Newman, J. Joseph, and K. Chmielinski. The dataset nutrition label. Data Protection and Privacy, 12(12):1, 2020.
- Hu et al. [2022] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Hulten [2018] G. Hulten. Building Intelligent Systems: A Guide to Machine Learning Engineering. Apress, 2018.
- Huq [2019] A. Z. Huq. A right to a human decision. Legal Anthropology: Laws & Constitutions eJournal, 2019. URL https://api.semanticscholar.org/CorpusID:181828093.
- Hussein et al. [2017] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
- Ippolito et al. [2022] D. Ippolito, F. Tramèr, M. Nasr, C. Zhang, M. Jagielski, K. Lee, C. A. Choquette-Choo, and N. Carlini. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
- Isaac [2017] W. S. Isaac. Hope, hype, and fear: the promise and potential pitfalls of artificial intelligence in criminal justice. Ohio St. J. Crim. L., 15:543, 2017.
- Jiang et al. [2023] H. H. Jiang, L. Brown, J. Cheng, M. Khan, A. Gupta, D. Workman, A. Hanna, J. Flowers, and T. Gebru. Ai art and its impact on artists. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 363–374, 2023.
- Karimi et al. [2021] A.-H. Karimi, B. Schölkopf, and I. Valera. Algorithmic recourse: from counterfactual explanations to interventions. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 353–362, 2021.
- Katz et al. [2017] G. Katz, C. W. Barrett, D. L. Dill, K. D. Julian, and M. J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, 2017.
- Kearns et al. [2018] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2564–2572. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kearns18a.html.
- Kim et al. [2018] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
- Kleinberg et al. [2018] J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. The quarterly journal of economics, 133(1):237–293, 2018.
- Koh et al. [2020] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338–5348. PMLR, 2020.
- Kosinski et al. [2015] M. Kosinski, S. C. Matz, S. D. Gosling, V. Popov, and D. Stillwell. Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. American psychologist, 70(6):543, 2015.
- Kosmala et al. [2016] M. Kosmala, A. Wiggins, A. Swanson, and B. Simmons. Assessing data quality in citizen science. Frontiers in Ecology and the Environment, 14(10):551–560, 2016.
- Kreuzberger et al. [2023] D. Kreuzberger, N. Kühl, and S. Hirschl. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access, 2023.
- Krishna et al. [2023] S. Krishna, J. Ma, and H. Lakkaraju. Towards bridging the gaps between the right to explanation and the right to be forgotten. arXiv preprint arXiv:2302.04288, 2023.
- Kärkkäinen and Joo [2021] K. Kärkkäinen and J. Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1547–1557, 2021. doi: 10.1109/WACV48630.2021.00159.
- Lakkaraju et al. [2016] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1675–1684, 2016.
- LeCun [2022] Y. LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62, 2022.
- Li et al. [2006] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd international conference on data engineering, pages 106–115. IEEE, 2006.
- Liang et al. [2022] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Lundberg and Lee [2017a] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017a. URL http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf.
- Lundberg and Lee [2017b] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017b.
- MacGlashan et al. [2017] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, G. Wang, D. L. Roberts, M. E. Taylor, and M. L. Littman. Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285–2294. PMLR, 2017.
- Machanavajjhala et al. [2007] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3–es, 2007.
- Madras et al. [2018] D. Madras, T. Pitassi, and R. Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in Neural Information Processing Systems, 31, 2018.
- Mandlekar et al. [2020] A. Mandlekar, D. Xu, R. Martín-Martín, Y. Zhu, L. Fei-Fei, and S. Savarese. Human-in-the-loop imitation learning using remote teleoperation. arXiv preprint arXiv:2012.06733, 2020.
- McMillan-Major et al. [2023] A. McMillan-Major, E. M. Bender, and B. Friedman. Data statements: From technical concept to community practice. ACM Journal on Responsible Computing, 2023.
- Mertes et al. [2022] S. Mertes, T. Huber, K. Weitz, A. Heimerl, and E. André. Ganterfactual—counterfactual explanations for medical non-experts using generative adversarial learning. Frontiers in artificial intelligence, 5, 2022.
- Mitchell et al. [2019] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pages 220–229, 2019.
- Mittelstadt et al. [2019] B. Mittelstadt, C. Russell, and S. Wachter. Explaining explanations in ai. In Proceedings of the conference on fairness, accountability, and transparency, pages 279–288, 2019.
- Molnar [2020] C. Molnar. Interpretable machine learning. Lulu. com, 2020.
- Moore [2019] P. V. Moore. Osh and the future of work: benefits and risks of artificial intelligence tools in workplaces. In Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body and Motion: 10th International Conference, DHM 2019, Held as Part of the 21st HCI International Conference, HCII 2019, Orlando, FL, USA, July 26–31, 2019, Proceedings, Part I 21, pages 292–315. Springer, 2019.
- Moos et al. [2022] J. Moos, K. Hansel, H. Abdulsamad, S. Stark, D. Clever, and J. Peters. Robust reinforcement learning: A review of foundations and recent advances. Machine Learning and Knowledge Extraction, 4(1):276–315, 2022.
- Mozannar and Sontag [2020] H. Mozannar and D. Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning, pages 7076–7087. PMLR, 2020.
- Nanda et al. [2023] N. Nanda, L. Chan, T. Liberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- Obermeyer et al. [2019] Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
- Olsson et al. [2022] C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Ong et al. [2019] D. C. Ong, J. Zaki, and N. D. Goodman. Computational models of emotion inference in theory of mind: A review and roadmap. Topics in cognitive science, 11(2):338–357, 2019.
- Ortiz-Barrios et al. [2023] M. Ortiz-Barrios, S. Arias-Fonseca, A. Ishizaka, M. Barbati, B. Avendaño-Collante, and E. Navarro-Jiménez. Artificial intelligence and discrete-event simulation for capacity management of intensive care units during the covid-19 pandemic: A case study. Journal of Business Research, 160:113806, 2023.
- Palan et al. [2019] M. Palan, G. Shevchuk, N. Charles Landolfi, and D. Sadigh. Learning reward functions by integrating human demonstrations and preferences. In Robotics: Science and Systems, 2019.
- Papernot et al. [2016] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar. Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755, 2016.
- Paun et al. [2022] S. Paun, R. Artstein, and M. Poesio. Statistical methods for annotation analysis. Synthesis Lectures on Human Language Technologies, 15(1):1–217, 2022.
- Pawelczyk et al. [2023] M. Pawelczyk, H. Lakkaraju, and S. Neel. On the privacy risks of algorithmic recourse. In International Conference on Artificial Intelligence and Statistics, pages 9680–9696. PMLR, 2023.
- Pineau et al. [2021] J. Pineau, P. Vincent-Lamarre, K. Sinha, V. Larivière, A. Beygelzimer, F. d’Alché Buc, E. Fox, and H. Larochelle. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program). The Journal of Machine Learning Research, 22(1):7459–7478, 2021.
- Pinto et al. [2017] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pages 2817–2826. PMLR, 2017.
- Pleiss et al. [2020] G. Pleiss, T. Zhang, E. Elenberg, and K. Q. Weinberger. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056, 2020.
- Pushkarna et al. [2022] M. Pushkarna, A. Zaldivar, and O. Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible ai. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022.
- Raffel et al. [2019] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Rahimian and Mehrotra [2019] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659, 2019.
- Räukur et al. [2022] T. Räukur, A. Ho, S. Casper, and D. Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. arXiv preprint arXiv:2207.13243, 2022.
- Ren et al. [2021] P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, B. B. Gupta, X. Chen, and X. Wang. A survey of deep active learning. ACM computing surveys (CSUR), 54(9):1–40, 2021.
- Ribeiro et al. [2016] M. T. Ribeiro, S. Singh, and C. Guestrin. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
- Ribeiro et al. [2018] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Richardson [2021] R. Richardson. Racial segregation and the data-driven society: How our failure to reckon with root causes perpetuates separate and unequal realities. Berkeley Tech. LJ, 36:1051, 2021.
- Ross et al. [2011] S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Rudin [2019] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
- Russell et al. [2016] H. E. Russell, L. K. Harbott, I. Nisky, S. Pan, A. M. Okamura, and J. C. Gerdes. Motor learning affects car-to-driver handover in automated vehicles. Science Robotics, 1(1):eaah5682, 2016.
- Samarati and Sweeney [1998] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, technical report, SRI International, 1998.
- Samek and Müller [2019] W. Samek and K.-R. Müller. Towards explainable artificial intelligence. Explainable AI: interpreting, explaining and visualizing deep learning, pages 5–22, 2019.
- Sawaragi et al. [1985] Y. Sawaragi, H. NAKAYAMA, and T. TANINO. Theory of multiobjective optimization. Elsevier, 1985.
- Schramowski et al. [2022] P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. arXiv preprint arXiv:2211.05105, 2022.
- Schuhmann et al. [2022] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Shen et al. [2023] X. Shen, Y. Wong, and M. Kankanhalli. Fair representation: Guaranteeing approximate multiple group fairness for unknown tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):525–538, 2023. doi: 10.1109/TPAMI.2022.3148905.
- Shokri et al. [2021] R. Shokri, M. Strobel, and Y. Zick. On the privacy risks of model explanations. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 231–241, 2021.
- Simonyan et al. [2013] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Singh et al. [2019] G. Singh, T. Gehr, M. Püschel, and M. Vechev. An abstract domain for certifying neural networks. Proceedings of the ACM on Programming Languages, 3(POPL):1–30, 2019.
- Singh et al. [2009] S. Singh, R. L. Lewis, and A. G. Barto. Where do rewards come from. In Proceedings of the annual conference of the cognitive science society, pages 2601–2606. Cognitive Science Society, 2009.
- Smith et al. [2022] A. Smith, R. Black, J. Davenport, J. Olszewska, J. Rößler, and J. Wright. Artificial Intelligence and Software Testing. BCS, The Chartered Institute for IT, 2022.
- Soh and Demiris [2011] H. Soh and Y. Demiris. Evolving policies for multi-reward partially observable markov decision processes (mr-pomdps). In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 713–720, 2011.
- Sokol and Flach [2020] K. Sokol and P. Flach. One explanation does not fit all: The promise of interactive explanations for machine learning transparency. KI-Künstliche Intelligenz, 34(2):235–250, 2020.
- Sommer et al. [2020] D. M. Sommer, L. Song, S. Wagh, and P. Mittal. Towards probabilistic verification of machine unlearning. arXiv preprint arXiv:2003.04247, 2020.
- Sosa [1997] D. Sosa. Meaningful explanation. Philosophical Issues, 8:351–356, 1997.
- Srinivasan and Chander [2021] R. Srinivasan and A. Chander. Explanation perspectives from the cognitive sciences—a survey. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 4812–4818, 2021.
- Strobelt et al. [2017] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics, 24(1):667–676, 2017.
- Sundar [2020] S. S. Sundar. Rise of machine agency: A framework for studying the psychology of human–ai interaction (haii). Journal of Computer-Mediated Communication, 25(1):74–88, 2020.
- Suresh and Guttag [2019] H. Suresh and J. V. Guttag. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002, 2(8), 2019.
- Tan et al. [2018] S. Tan, R. Caruana, G. Hooker, and Y. Lou. Distill-and-compare: Auditing black-box models using transparent model distillation. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 303–310, 2018.
- Tarun et al. [2023] A. K. Tarun, V. S. Chundawat, M. Mandal, and M. Kankanhalli. Fast yet effective machine unlearning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Teso and Kersting [2019] S. Teso and K. Kersting. Explanatory interactive machine learning. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 239–245, 2019.
- Tramèr et al. [2018] F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
- Tran and Fioretto [2023] C. Tran and F. Fioretto. Data minimization at inference time, 2023.
- Tsirtsis and Gomez Rodriguez [2020] S. Tsirtsis and M. Gomez Rodriguez. Decisions, counterfactual explanations and strategic behavior. Advances in Neural Information Processing Systems, 33:16749–16760, 2020.
- Tukey et al. [1977] J. W. Tukey et al. Exploratory data analysis, volume 2. Reading, MA, 1977.
- Tyler et al. [2023] C. Tyler, K. Akerlof, A. Allegra, Z. Arnold, H. Canino, M. A. Doornenbal, J. A. Goldstein, D. Budtz Pedersen, and W. J. Sutherland. Ai tools as science policy advisers? the potential and the pitfalls. Nature, 622(7981):27–30, 2023.
- Urbina et al. [2022] F. Urbina, F. Lentzos, C. Invernizzi, and S. Ekins. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
- Vyas et al. [2023] N. Vyas, S. Kakade, and B. Barak. Provable copyright protection for generative models. arXiv preprint arXiv:2302.10870, 2023.
- Wachter et al. [2017] S. Wachter, B. Mittelstadt, and C. Russell. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841, 2017.
- Wang et al. [2023] K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In International Conference on Learning Representation, 2023.
- Ware et al. [2001] M. Ware, E. Frank, G. Holmes, M. Hall, and I. H. Witten. Interactive machine learning: letting users build classifiers. International Journal of Human-Computer Studies, 55(3):281–292, 2001.
- WEF [a] WEF. Unpacking AI Procurement in a Box: Insights from ImplementationUnpacking AI Procurement in a Box: Insights from Implementation. https://www.weforum.org/publications/unpacking-ai-procurement-in-a-box-insights-from-implementation/, May 9 2022a.
- WEF [b] WEF. Ai procurement in the public sector: Lessons from Brazil. https://www.weforum.org/agenda/2022/05/the-brazilian-case-for-ai-procurement-in-a-box/, oct 6 2023b.
- Wei et al. [2022] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
- White House Office of Science and Technology Policy [2022] White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People. White House, 2022. URL https://www.whitehouse.gov/ostp/ai-bill-of-rights.
- Wood et al. [2018] A. Wood, M. Altman, A. Bembenek, M. Bun, M. Gaboardi, J. Honaker, K. Nissim, D. R. O’Brien, T. Steinke, and S. Vadhan. Differential privacy: A primer for a non-technical audience. Vand. J. Ent. & Tech. L., 21:209, 2018.
- World Economic Forum [2020a] World Economic Forum. AI Procurement in a Box. Technical report, World Economic Forum, 2020a. URL https://www.weforum.org/reports/ai-procurement-in-a-box/.
- World Economic Forum [2020b] World Economic Forum. Ai Procurement in a Box: Pilot case studies from the United Kingdom. https://www.weforum.org/publications/ai-procurement-in-a-box/pilot-uk/, 6 2020b. [Online; accessed 2024-02-26].
- Wu et al. [2021] T. Wu, M. T. Ribeiro, J. Heer, and D. S. Weld. Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6707–6723, 2021.
- Xiao et al. [2022] Y. Xiao, I. Beschastnikh, Y. Lin, R. S. Hundal, X. Xie, D. S. Rosenblum, and J. S. Dong. Self-checking deep neural networks for anomalies and adversaries in deployment. IEEE Transactions on Dependable and Secure Computing, 2022.
- Xu et al. [2021] Z. Xu, X. Shen, Y. Wong, and M. S. Kankanhalli. Unsupervised motion representation learning with capsule autoencoders. Advances in Neural Information Processing Systems, 34:3205–3217, 2021.
- Yap [2021] R. H. Yap. Towards certifying trustworthy machine learning systems. In Trustworthy AI-Integrating Learning, Optimization and Reasoning: First International Workshop, TAILOR 2020, Virtual Event, September 4–5, 2020, Revised Selected Papers 1, pages 77–82. Springer, 2021.
- Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Ziegler et al. [2019] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Zilka et al. [2022] M. Zilka, H. Sargeant, and A. Weller. Transparency, governance and regulation of algorithmic tools deployed in the criminal justice system: a uk case study. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 880–889, 2022.