Taxonomy of Machine Learning Safety: A Survey and Primer

Sina Mohseni [email protected] , Zhiding Yu [email protected] , Chaowei Xiao [email protected] , Jay Yadawa [email protected] NVIDIASanta ClaraCAUSA95051 , Haotao Wang [email protected] and Zhangyang Wang [email protected]¿ The University of Texas at AustinAustinTXUSA78712

Abstract.

The open-world deployment of Machine Learning (ML) algorithms in safety-critical applications such as autonomous vehicles needs to address a variety of ML vulnerabilities such as interpretability, verifiability, and performance limitations. Research explores different approaches to improve ML dependability by proposing new models and training techniques to reduce generalization error, achieve domain adaptation, and detect outlier examples and adversarial attacks. However, there is a missing connection between ongoing ML research and well-established safety principles. In this paper, we present a structured and comprehensive review of ML techniques to improve the dependability of ML algorithms in uncontrolled open-world settings. From this review, we propose the Taxonomy of ML Safety that maps state-of-the-art ML techniques to key engineering safety strategies. Our taxonomy of ML safety presents a safety-oriented categorization of ML techniques to provide guidance for improving dependability of the ML design and development. The proposed taxonomy can serve as a safety checklist to aid designers in improving coverage and diversity of safety strategies employed in any given ML system.

neural networks, robustness, safety, verification uncertainty quantification

1. Introduction

Advancements in machine learning (ML) have been one of the most significant innovations of the last decade. Among different ML models, Deep Neural Networks (DNNs) (LeCun et al., 2015) are well-known and widely used for their powerful representation learning from high-dimensional data such as images, texts, and speech. However, as ML algorithms enter sensitive real-world domains with trustworthiness, safety, and fairness prerequisites, the need for corresponding techniques and metrics for high-stake domains is more noticeable than before. Hence, researchers in different fields propose guidelines for Trustworthy AI (Shneiderman, 2020), Safe AI (Amodei et al., 2016), and Explainable AI (Mohseni et al., 2018) as stepping stones for next generation Responsible AI (Arrieta et al., 2020). Furthermore, government reports and regulations on AI accountability (Goodman and Flaxman, 2017), trustworthiness (Smuha, 2019), and safety (Cluzeau et al., 2020) are gradually creating mandating laws to protect citizens’ data privacy rights, fair data processing, and upholding safety for AI-based products.

The development and deployment of ML algorithms for open-world tasks come with reliability and dependability challenges rooting from model performance, robustness, and uncertainty limitations (Mohseni et al., 2019). Unlike traditional code-based software, ML models have fundamental safety drawbacks, including performance limitations on their training set and run-time robustness constraints in their operational domain. For example, ML models are fragile to unprecedented domain shift (Ganin and Lempitsky, 2014) that could easily occur in open-world scenarios. Data corruptions and natural perturbations (Hendrycks and Dietterich, 2019) are other factors affecting ML models. Moreover, from the security perspective, it has been shown that DNNs are susceptible to adversarial attacks that make small perturbations to the input sample (indistinguishable by the human eye) but can fool a DNN (Goodfellow et al., 2015). Due to the lack of verification techniques for DNNs, validation of ML models is often bounded to performance measures on standardized test sets and end-to-end simulations on the operation design domain. Realizing that dependable ML models are required to achieve safety, we observe the need to investigate gaps and opportunities between conventional engineering safety standards and a set of ML safety-related techniques.

Refer to caption — Figure 1. Paper Roadmap: we first identify key engineering safety requirements (first column) that are limited or not readily applicable on complex ML algorithms (second column). From there, we present a review of safety-related ML research followed by their categorization (third column) into three strategies to achieve (1) Inherently Safe Models, improving (2) Enhancing Model Performance and Robustness, and incorporate (3) Run-time Error Detection techniques.

1.1. Scope, Organization, and Survey Method

ML safety includes diverse hardware and software techniques for the safe execution of algorithms in open-world applications (Koopman and Wagner, 2017). In this paper, we limit our scope to only ML algorithm design and not the execution of those algorithms in platforms. With that being said, we also mainly focus on “in situ” techniques to improve run-time dependability and not on further techniques for the efficiency of the network or training.

We used a structured and iterative methodology to find ML safety-related papers and categorize these research as summarized in Table 1. In our iterative paper selection process, we started with reviewing key research papers from AI and ML safety (e.g., (Amodei et al., 2016; Leike et al., 2017; Varshney, 2016)) and software safety literature and standards (e.g., (for Standardization, 2011, 2019; Salay et al., 2017)) to identify mutual safety attributes between engineering safety and ML techniques. Next, we conducted an upward and downward literature investigation using top computer science conference proceedings, journal publications, and the Google Scholar search engine to maintain reasonable literature coverage and balance the number of papers on each ML safety attribute.

Figure 1 presents the overall organization of this paper. We first review the background on common safety terminologies and situate ML safety limitations with reference to conventional engineering safety requirements in Section 2. In Section 3 we discuss a unified “big picture” of different ML error types for real-world applications and common benchmark datasets to evaluate models for these errors. Next, we propose a ML safety taxonomy in Section 4 to organize ML techniques into safety strategies with Table 1 as an illustration of the taxonomy with the summary of representative papers on each subcategory. Sections 5, 6, and 7 construct the main body of the reviewed papers organized into ML solutions and techniques for each safety strategy. Finally, Section 8 presents a summary of key takeaways and a discussion of open problems and research directions for ML safety.

1.2. Objectives and Contributions

In this paper, we review challenges and opportunities to achieve ML safety for open-world safety-critical applications. We first review dependability limitations and challenges for ML algorithms in comparison to engineering safety standard requirements. Then, we decompose ML dependability needs into three safety strategies for (1) achieving inherently safe ML design, (2) improving model performance and robustness, and (3) building run-time error detection solutions for ML. Following our categorization of safety strategies, we present a structured and comprehensive review of 300 papers from a broad spectrum of state-of-the-art ML research and safety literature. We propose a unifying taxonomy (Table 1) that serves ML researchers and designers as a collection of best practices and allows to checkup the coverage and diversity of safety strategies employed in any given ML system. Additionally, the taxonomy of ML safety lays down a road map to safety needs in ML and accommodates in assessing technology readiness for each safety strategy. We review open challenges and opportunities for each strategy and present a summary of key takeaways in the end.

2. Background

In order to introduce and categorize ML safety techniques, we start with reviewing background on engineering safety strategies and investigate safety gaps between design and development of code-based software and ML algorithms.

2.1. Related Surveys

Related survey papers dive into ML and AI safety topics to analyze problem domain, review existing solutions and make suggestions on future directions (Amodei et al., 2016; Leike et al., 2017). Survey papers cover diverse topics including safety-relevant characteristics in reinforcement learning (Hernández-Orallo et al., 2019), verification of ML components (Huang et al., 2017; Wang et al., 2018b), adversarial robustness (Huang et al., 2020a), anomaly detection (Salehi et al., 2021), ML uncertainty (McAllister et al., 2017), and aim to connect the relation between well-established engineering safety principals and ML safety (Varshney, 2016; Mohseni et al., 2019) limitations. Hendrycks et al. (2021) introduce four major research problems to improve ML safety namely robustness, monitoring, alignment, and external safety for ML models. Shneiderman (2020) presents high level guidelines for teams, organizations, and industries to increase the reliability, safety, and trustworthiness of next generation Human-Centered AI systems.

More recently, multiple surveys present holistic review of ML promises and pitfalls for safety-critical autonomous systems. For instance, Ashmore et al. (2021) demonstrate a systematic presentation of 4-stage ML lifecycle, including data management, model training, model verification, and deployment. Authors present itemized safety assurance requirements for each stage and review methods that support each requirement. In a later work, Hawkins et al. (2021) add ML safety assurance scoping and the safety requirements elicitation stages to ML lifecycle to establish the fundamental link between system-level hazard and risk analysis and unit-level safety requirements. In a broader context, Lu et al. (2021) study challenges and limitations of existing ML system development tools and platforms (MLOps) in achieving Responsible AI principles such as data privacy, transparency, and safety. Authors report their findings with a list of operationalized Responsible AI principles and their benefits and drawbacks.

Although prior work targeted different aspects and characteristics of ML safety and dependability, in this paper, we elaborate on ML safety concept by situating open-world safety challenges with ongoing ML research. Particularly, we combine ML safety concerns between engineering and research communities to uncover mutual goals and accelerate safety developments.

2.2. Related Terminologies

We introduce terminologies related to ML safety by clarifying the relationship between ML Safety, Security, and Dependability that are often interchangeably used in the literature. Safety is a System-Level concept as a set of processes and strategies to minimize the risk of hazards due to malfunctioning of system components. Safety standards such as IEC 61508 (Commission, 2000) and ISO 26262 (for Standardization, 2011) mandate complete analysis of hazards and risks, documentation for system architecture and design, detailed development process, and thorough verification strategies for each component, integration of components, and final system-level testing. Dependability is a Unit-Level concept to ensure performance and robustness of the software in its operational domain. We define ML dependability as the model’s ability to minimize test-time prediction error. Therefore, a highly dependable ML algorithm is expected to be robust to natural distribution shifts within their intended operation design domain. Security is both a System-Level and a Unit-Level concept to protected from harm or other non-desirable (e.g., data theft, privacy violation) outcomes caused by adversaries. Note that engineering guidelines distinguish safety hazards (e.g., due to natural perturbations) from security hazards (e.g., due to adversarial perturbations) as the latter intentionally exploits system vulnerabilities to cause harm. However, the term safety is often loosely used in ML literature to refer to the dependability of algorithms against adversaries (Huang et al., 2020a).

In this paper, we focus on unit-level strategies to maintain the dependability of ML algorithms in an intelligent system rather than the safety of a complex AI-based system as a whole. We also cover adversarial training and detection techniques as a part of unit-level safety strategies regardless of the role of the adversary in generating the attack.

2.3. Engineering Safety Limitations in ML

Engineering safety broadly refers to the management of operations and events in a system in order to protect its users by minimizing hazards, risks, and accidents. Given the importance of dependability of the system’s internal components (hardware and software), various engineering safety standards have been developed to ensure the system’s functional safety based on two fundamental principles of safety life cycle and failure analysis. Built on collection of best practices, engineering safety processes discover and eliminate design errors followed by a probabilistic analysis of safety impact of possible system failures (i.e., failure analysis). Several efforts attempted to extend engineering safety standards to ML algorithms (Salay et al., 2017; Cluzeau et al., 2020). For example, European Union Aviation Safety Agency released a report on concepts of design assurance for neural networks (Cluzeau et al., 2020) that introduces safety assurance and assessment for learning algorithms in safety-critical applications. In another work, Siebert et al. (Siebert et al., 2020) present a guideline to assess ML system quality from different aspects specific to ML algorithms including data, model, environment, system, and infrastructure in an industrial use case. However, the main body of engineering standards do not account for the statistical nature of ML algorithms and errors occurring due to the inability of the components to comprehend the environment. In a recent review of automotive functional safety for ML-based software, Salay et al. (Salay et al., 2017) present an analysis that shows about 40% of software safety methods do not apply to ML models.

Given the dependability limitations of ML algorithms and lack of adaptability for traditional software development standards, we identify 5 open safety challenges for ML and briefly review active research topics for closing these safety gaps in the following. We extensively review the techniques for each challenge later in Sections 5, 6, and 7.

2.3.1. Design Specification

Documenting and reviewing the software specification is a crucial step in engineering safety; however, formal design specification of ML models is generally not feasible, as the models learn patterns from large training sets to discriminate (or generate) their distributions for new unseen input. Therefore, ML algorithms learn the target classes through their training data (and regularization constraints) rather than formal specification. The lack of specifiability could cause a mismatch between “designer objectives” and “what the model actually learned”, which could result in unintended functionality of the system. The data-driven optimization of model variables in ML training makes it challenging to define and pose specific safety constraints. Seshia et al. (Seshia et al., 2018) surveyed the landscape of formal specification for DNNs to lay an initial foundation for formalizing and reasoning about properties of DNNs. To fill this gap, a common practice is to achieve partial design specification through training data specification and coverage. Another practical way to overcome the design specification problem is to break ML components into smaller algorithms (with smaller tasks) to work in a hierarchical structure. In the case of intelligent agents, safety-enforcing regularization terms (Amodei et al., 2016), and simulation environments (Brockman et al., 2016) are suggested to specify and verify training goals for the agent.

2.3.2. Implementation Transparency

Implementation transparency is an important requirement in engineering safety which gives the ability to trace back design requirements from the implementations. However, advanced ML models trained on high-dimensional data are not transparent. The very large number of variables in the models makes them incomprehensible or a so-called black-box for design review and inspection. In order to achieve traceability, significant research has been performed on interpretability methods for DNN to provide instance explanations of model prediction and DNN intermediate feature layers (Zeiler and Fergus, 2014). In autonomous vehicles application, Bojarski et al. (2018) propose VisualBackProp technique and show that a DNN algorithm trained to control a steering wheel would in fact learn patterns of lanes, road edges, and parked vehicles to execute the targeted task. However, the completeness of interpretability methods to grant traceability is not proven yet (Adebayo et al., 2018), and in practice, interpretability techniques are mainly used by designers to improve network structure and training process rather than support a safety assessment.

2.3.3. Testing and Verification.

Design and implementation verification is another demanding requirement for unit testing to meet engineering safety standards. For example, coding guidelines for software safety enforce the elimination of dead or unreachable functions. Depending upon the safety integrity level, complete statement, branch coverage, or modified condition and decision coverage are required to confirm the adequacy of the unit tests. Coming to DNNs, formally verifying their correctness is challenging and in fact provably an NP-hard (Seshia et al., 2016) problem due to the high dimensionality of the data. Therefore, reaching complete testing and verification of the operational design domain is not feasible for domains like image and video. As a result, researchers proposed new techniques such as searching for unknown-unknowns (Bansal and Weld, 2018) and predictor-verifier training (Dvijotham et al., 2018), and simulation-based toolkits (Dreossi et al., 2019b) guided by formal models and specifications. Other techniques, including neuron coverage and fuzz testing (Wang et al., 2018b) in neural networks incorporate these aspects. Note that formal verification of shallow and linear models for low dimensional sensor data does not carry verification challenges of the image domain.

2.3.4. Performance and Robustness

Engineering safety standards treat the ML models as a black box and suggest using methods to improve model performance and robustness. However, improving model performance and robustness is still an open problem and a vast research topic. Unlike code-based algorithms, statistical learning algorithms typically contain a residual error rate (due to false positive and false negative predictions) on the test set. In addition to the error rate on the test set, operational error is referred to as the model’s error rate that commonly occurs in open-world deployment. Section 6 reviews various approaches like introducing larger networks, training regularization, active learning and data collection, and domain generalization techniques to increase the model’s ability to learn generalizable representations for open-world applications.

2.3.5. Run-time Monitoring

Engineering safety standards suggest run-time monitoring functions as preventive solutions for various system errors, including less frequent transient errors. Monitoring functions in code-based algorithms are based on a rule-set to detect hardware errors and software crashes in the target operational domain. However, designing monitoring functions to predict ML error (e.g., false positive and false negative errors) is different in nature. ML models generate prediction probability that could be used to predict uncertainty for run-time validation of predictions. However, research shows that prediction probability in complex models like DNN does not fully represent uncertainty and hence can not guarantee failure prediction (Hein et al., 2019). Section 7 reviews different approaches for run-time uncertainty estimation and detection of outlier samples and adversarial attacks.

3. ML Dependability

We define ML dependability as the model’s ability to minimize prediction risk on a given test set. Unlike code-based algorithms, the dependability of ML algorithms is bounded to the model’s learning capacity and statistical assumptions such as independent and identically distribution (i.i.d) relation of source and target domains. However, maintaining data distribution assumptions when deployed in the open-world is challenging and results in different types of prediction errors.

In this section, we decompose model dependability limitations into three prediction error types: (i) Generalization Error, (ii) Distributional Error, and (iii) Adversarial Error as a unified “big picture” for dependable and robust ML models in open-world. Additionally, we review benchmark datasets commonly used for evaluating model dependability.

3.1. Generalization Error

The first and foremost goal of machine learning is to minimize the Generalization Error. Given a hypothesis $h$ (e.g., a model with learned parameters), the generalization error (also know as the true error and denoted as $R(h)$ ) is defined as the expected error of $h$ on the data distribution $\mathcal{D}$ (Mohri et al., 2018): $R(h)=\text{Pr}_{(x,y)\sim\mathcal{D}}[h(x)\neq y]=\mathbb{E}_{(x,y)\sim\mathcal{D}}[\mathbbm{1}_{h(x)\neq y}],$ where $(x,y)$ is a pair of data and labels sampled from $\mathcal{D}$ , and $\mathbbm{1}$ is the indicator function. However, the generalization error is not directly computable since $\mathcal{D}$ is usually unknown. The de facto practical solution is to learn $h$ by empirical risk minimization (ERM) on the training set $\mathbb{S}_{S}=\{(x_{i},y_{i})\}_{i=1}^{N_{S}}$ and then estimate its generalization error by the empirical error on the holdout test set $\mathbb{S}_{T}=\{(x_{i},y_{i})\}_{i=1}^{N_{T}}$ . Formally, the empirical error $\hat{R}(h)$ is defined as the mean error on a finite set of data points $\mathbb{S}\sim\mathcal{D}^{m}$ (Mohri et al., 2018): $\hat{R}(h)=\frac{1}{m}\sum_{i=1}^{m}\mathbbm{1}_{h(x_{i})\neq y_{i}},$ where $\mathbb{S}\sim\mathcal{D}^{m}$ means $\mathbb{S}=\{(x_{i},y_{i})\}_{i=1}^{m}\overset{i.i.d.}{\sim}\mathcal{D}$ . The training and test sets are all sampled from the same distribution $\mathcal{D}$ but are disjoint.

Recent years have witnessed the successful application of this holdout evaluation methodology to monitoring the progress of many ML fields, especially where large-scale labeled datasets are available. The generalization error can be affected by many factors, such as training set quality (e.g., imbalanced class distribution (Haixiang et al., 2017), noisy labels (Bouguelia et al., 2018)), model learning capacity, and training method (e.g., using pre-training (Mahajan et al., 2018) or regularization (Bansal et al., 2018)).

Benchmark Datasets:

Model generalization is commonly evaluated on a separate i.i.d test set provided for the dataset. However, recent research has found limitations of this evaluation strategy. For example, Wang et al. (2020b) showed that the fixed ImageNet (Deng et al., 2009) test set is not sufficient to reliably evaluate the generalization ability of state-of-the-art image classifiers due to the insufficiency in representing the rich visual open-world. In another work, Tsipras et al. (2020) observed that a noisy data collection pipeline could lead to a systematic misalignment between the training sets and the real-world tasks.

3.2. Distributional Error

We define Distributional Error as the increase in model generalization error when the i.i.d assumption between the source training set $\mathbb{S}_{S}$ and target test set $\mathbb{S}_{T}$ is violated. Eliminating distributional error is particularly important for real-world applications because the i.i.d assumption is frequently violated in uncontrolled settings. In other words, we will have $p_{S}(x,y)\neq p_{T}(x,y)$ where $p_{S}(x,y)$ and $p_{T}(x,y)$ are the joint probability density distributions of data $x$ and label $y$ on the training and test distributions, respectively. Such mismatch between training and test data distribution is known as Distributional Shift (also termed as Dataset Shift (Quiñonero-Candela et al., 2009) or Domain Shift). In the following, we review the three most common roots of distribution shifts and their benchmark datasets.

Covariate Shift

refers to a change in the test distribution of the input covariate $x$ compared to training distribution so that $p_{S}(x)\neq p_{T}(x)$ while the labeling function remains the same $p_{S}(y|x)=p_{T}(y|x)$ . Covariate shift may occur due to natural perturbations (e.g., weather and lighting changes), data changes over time (e.g., seasonal variations of data), and even more subtle digital corruptions on images (e.g., JPEG compression and low saturation).

Label Distribution Shift

is the scenario when the marginal distributions of $y$ changes while the class-conditional distribution remains the same. Label distribution shift is also know as prior probability shift and formally defined as $p_{S}(y)\neq p_{T}(y),p_{S}(x|y)=p_{T}(x|y)$ . Label distribution shift is typically concerned in applications where the label $y$ is the casual variable for the observed feature $x$ (Lipton et al., 2018). For example, a trained model to predict pneumonia (i.e., label $y$ ) using chest X-ray data (i.e., features $x$ ) that was collected during summer time (when $p(y)$ is low), should still require it to be accurate on patients (i.e., new inputs) visiting in winter time (when $p(y)$ is high) regardless of label distribution shift. Long-tailed distribution (Van Horn and Perona, 2017) is a special case for label distributional shift where the training set $p_{S}(y)$ follows a long-tailed distribution but the test set is balanced (i.e., $p_{T}(y)$ roughly follows a uniform distribution).

Out-of-Distribution Samples

are test time inputs that are outliers to the training set without any semantic content shared with the training distribution, which is considered beyond reasonably foreseeable domain shifts. For example, given a model trained to recognize handwritten characters in English, a Roman character with a completely disjoint label space is an Out-of-Distribution (OOD) test sample. OOD detection (Hendrycks and Gimpel, 2016a) is a common approach to detect such outlier samples, whose predictions should be abstained (see Section 7.2 for details).

Benchmark Datasets for Covariate Shift:

Several variants of the ImageNet dataset have been introduced to benchmark distributional error (i.e., evaluating robustness against distributional shifts) when the model is trained on the original ImageNet dataset. Hendrycks and Dietterich (2019) introduce two variants of the original ImageNet validation set: ImageNet-C benchmark for input corruption robustness and the ImageNet-P dataset for input perturbation robustness. ImageNet-A (Hendrycks et al., 2019e) sorts out unmodified samples from ImageNet test set that falsifies state-of-the-art image classifiers. Hendrycks et al. (2020) present a series of benchmarks for measuring model robustness to variations on image renditions (ImageNet-R benchmark), imaging time or geographic location (StreetView benchmark), and objects size, occlusion, camera viewpoint, and zoom (DeepFashion Remixed benchmark). Recht et al. (2019) collected ImageNet-V2 using the same data source and collection pipeline as the original ImageNet paper (Deng et al., 2009). This new benchmark leads to the observation that the prediction accuracy of even the best image classifiers are still highly sensitive to minutiae of the test set distribution and extensive hyperparameter tuning. Shifts (Malinin et al., 2021) is another recent benchmark dataset for distributional shifts beyond the computer vision tasks.

Benchmark Datasets for Label Distribution Shift:

Synthetic label distribution shift is a common benchmarking method in which the test set is manually sampled according to a predefined target label distribution $p_{T}(y)$ that is different from the source label distribution $p_{S}(y)$ (Lipton et al., 2018). Wu et al. (2021) is an example of a real-world label distribution shift benchmark for text domain.

Benchmark Datasets for OOD Detection:

Test sets of natural images with disjoint label space are typically used for benchmarking OOD detection. For example, a model trained on the CIFAR10 dataset may use ImageNet (samples that do not overlap with CIFAR10 labels) as OOD test samples. ImageNet-O (Hendrycks et al., 2019e), containing 2000 images from 200 classes within ImageNet-22k and outside ImageNet is an example for the ImageNet-1k dataset. Hendrycks et al. (Hendrycks et al., 2019a) presented three large-scale and high-resolution OOD detection benchmarks for multi-class and multi-label image classification, object detection, and semantic segmentation, respectively. Yang et al. (2021) presents semantically coherent OOD (SC-OOD) benchmarks for CIFAR10 and CIFAR100 datasets. In another work, Chan et al. (2021) presents benchmarks for anomalous object segmentation and road obstacle segmentation.

3.3. Adversarial Error

Adversarial Error is the model misprediction due to synthetic perturbations (termed as adversarial perturbations) added to the original clean sample. The adversarial attack is the act of generating adversarial perturbations to cause intentional model mispredictions while keeping the identical semantic meaning of the clean sample. Different forms of adversarial attacks have been studied on different types of data. On image data, typical forms include the $\ell_{p}$ constrained additive perturbation (Goodfellow et al., 2015), spatial perturbation (Xiao et al., 2018b), and semantically meaningful perturbation (Qiu et al., 2020). Beyond the image data, adversarial attacks can also be designed, such as by altering the shape of 3D surfaces (Xiao et al., 2019a), by replacing words with synonyms (Ren et al., 2019b) or rephrasing the sentence (Iyyer et al., 2018) in natural language data, by applying adversarial printable 2D patches on real-world physical objects (Brown et al., 2017).

Benchmark Datasets:

Evaluating adversarial error (also known as adversarial robustness) is usually done by measuring empirical performance (e.g., accuracy in image classification tasks) on a set of adversarial samples. However, the key requirement for a faithful evaluation is to use strong and diverse unseen attacks to break the model. There are two commonly used strategies to achieve this goal. First, an ensemble of multiple strong and diverse adversarial attacks should be simultaneously used for evaluation. For instance, AutoAttack (Croce and Hein, 2020) consists of four state-of-the-art white-box attacks and two state-of-the-art black-box attacks. Kang et al. (2019) present another setting by creating evaluation benchmarks of ImageNet-UA and CIFAR-10-UA, which contain both $\ell_{p}$ constrained adversarial attacks and real-world adversarial attacks such as worst-case image fogging. Second, the attacks should be carefully designed to prevent the “gradient obfuscation” effect (Athalye et al., 2018; Carlini and Wagner, 2017). Since the success of traditional white-box attacks depends on the accurate calculation of model gradients, they may fail if the model gradients are not easily accessible (e.g., the model has non-differential operations). As a result, evaluating model robustness on such attacks may provide a false sense of robustness. Athalye et al. (2018) proposed three enhancements for traditional white-box attacks as solutions for common causes of gradient obfuscation. Other solutions include designing adaptive attacks for each specific defense strategy (Tramer et al., 2020) (see Section 7.3 for details).

4. ML Safety Taxonomy

Looking at fundamental limitations of code-based software safety for machine learning on one hand and research debt in AI safety on the other hand, we review and organize practical ML solutions into a taxonomy for ML safety. The proposed taxonomy unifies ML dependability objective with engineering safety strategies for their safe execution in open-world scenarios. Our ML safety taxonomy is followed by a systematic and broad review of relevant ML techniques to serve as a way to checkup coverage and diversity of safety strategies employed in any given ML system.

As illustrated in Figure 1 and Table 1, we propose categorizing ML techniques in three following safety strategies:

•

(1) Inherently Safe Model: refers to techniques for designing ML models that are intrinsically error-free or verifiable to be error-free in their intended target domain. We review model transparency and formal methods such as model specification, verification, and formal testing as main pillars to achieve the inherently safe design. However, there are many open challenges for these solutions to guarantee ML safety.
•

(2) Enhancing Performance and Robustness: refers to techniques to increase model performance (on the source domain) and robustness against distributional shifts. Perhaps the most commonly used in practice, these techniques contribute to safety by improving the operational performance of ML algorithms. We review key approaches and techniques such as training regularization, domain generalization, adversarial training, etc.
•

(3) Run-time Error Detection refers to strategies to detect model mispredictions at the run-time (or test-time) to prevent model errors from becoming system failures. This strategy can help mitigating hazards related to ML performance limitations in the operational domain. We review key approaches and techniques for model uncertainty estimation, out-of-distribution detection, and adversarial attack detection.

Additionally, we emphasize on safety-oriented Human-AI Interaction design as a type of Procedural Safeguards to prioritize end-user awareness and trust, and misuse prevention for non-experts end-users of ML-based products. Also, we differentiate ML safety from Security because the external factors (i.e., attacker) which intentionally exploit system vulnerabilities are the security threats rather than a design limitation. Table 1 presents a summary of reviewed techniques and papers for each safety strategy. Reviewed papers are organized into different solutions (middle column) to group papers into individual research approaches. We go through details and describe complement each other in the following sections.

Table 1. A taxonomy of techniques for ML safety with the left column identifying key ML safety strategies, middle column presenting relevant ML solutions, and right column listing machine learning techniques with representative research papers.

Safety Strategy

ML Solutions

ML Techniques

Inherently Safe Design

Model Transparency

Visualization Tools ((Maaten and Hinton, 2008; Hohman et al., 2019; Strobelt et al., 2018; Wongsuphasawat et al., 2017; Kahng et al., 2018))

Global Explanations ((Lage et al., 2018; Lakkaraju et al., 2016; Guidotti et al., 2018; Lakkaraju et al., 2019, 2020; Wu et al., 2018; Kim et al., 2018; Ghorbani et al., 2019; Yeh et al., 2020))

Local Explanations ((Ribeiro et al., 2016, 2018; Lundberg and Lee, 2017; Shrikumar et al., 2017; Zeiler and Fergus, 2014; Simonyan et al., 2013; Smilkov et al., 2017; Springenberg et al., 2014; Bach et al., 2015; Selvaraju et al., 2017))

Design Specification

Model Specification ((Pei et al., 2017b; Seshia et al., 2016, 2018; Bartocci et al., 2018; Dreossi et al., 2019c; Sadigh et al., 2016b))

Environment Specification ((Sadigh et al., 2016a))

Model Verification and Testing

Formal Verification ((Wang et al., 2018b; Narodytska et al., 2018; Katz et al., 2017; Dutta et al., ; Huang et al., 2017))

Semi-Formal Verification ((Dvijotham et al., 2018; Dreossi et al., 2019a; Chakraborty et al., 2014; Dreossi et al., 2019b))

Formal Testing ((Zhang et al., 2020b; Lakkaraju et al., 2017; Qin et al., 2018; Bansal and Weld, 2018; Sun et al., 2018; Pei et al., 2017a))

End-to-End Testing ((Yamaguchi et al., 2016; Fremont et al., 2020; Dreossi et al., 2018; Kim et al., 2020))

Enhancing Performance and Robustness

Robust Network Architecture

Model Capacity ((Lin et al., 2019; Nakkiran, 2019; Djolonga et al., 2020; Madry et al., 2017; Gui et al., 2019; Hu et al., 2020; Wang et al., 2020a; Ye et al., 2019; Sehwag et al., 2020))

Model Structure and Operations ((Guo et al., 2020; Chen et al., 2020b; Ning et al., 2020; Xie et al., 2020b; Tavakoli et al., 2021; Vasconcelos et al., 2020; Zhang, 2019))

Robust Training

Training Regularization ((Zheng et al., 2016; Zhang and LeCun, 2017; Yuan et al., 2020; Müller et al., 2019; Pan et al., 2018; Wang et al., 2019b, a; Huang et al., 2020b))

Pretraining and Transfer Learning ((Yosinski et al., 2014; You et al., 2020; Hendrycks et al., 2019b; Chen et al., 2020c; Jiang et al., 2020; Yue et al., 2019))

Adversarial Training ((Goodfellow et al., 2015; Zhang et al., 2019c; Ding et al., 2020; Zhang et al., 2019a; Shafahi et al., 2019; Wong and Kolter, 2018; Gowal et al., 2018; Zhang et al., 2019b))

Domain Generalization ((Tobin et al., 2017; Tremblay et al., 2018; Lee et al., 2020; Wang et al., 2019b; Li et al., 2018; Balaji et al., 2018))

Data Sampling and Augmentation

Active Learning ((Gal et al., 2017b; Beluch et al., 2018; Siddiqui et al., 2020; Haussmann et al., 2020))

Hardness Weighted Sampling ((Zhang et al., 2020a; Fidon et al., 2020))

Data Cleansing ((Beyer et al., 2020; Yun et al., 2021; Han et al., 2019))

Data Augmentation ((Cubuk et al., 2018; Zhong et al., 2020; Hendrycks et al., 2019d; Geirhos et al., 2019; Yun et al., 2019; Hendrycks et al., 2020; Volpi et al., 2018; Zhao et al., 2020))

Run-time Error Detection

Prediction Uncertainty

Model Calibration ((Guo et al., 2017; Szegedy et al., 2016; Li and Hoiem, 2020; Kumar et al., 2018; Zhang et al., 2019e))

Uncertainty Estimation ((Der Kiureghian and Ditlevsen, 2009; Lakshminarayanan et al., 2017; Gal and Ghahramani, 2016; Gal et al., 2017a; Van Amersfoort et al., 2020; Meinke and Hein, 2019; Ovadia et al., 2019; Mukhoti et al., 2021; Liu et al., 2020a; Chen et al., 2020a))

Out-of-distribution Detection

Distance-based Detection ((Lee et al., 2018; Techapanurak et al., 2020; Ruff et al., 2018, 2019; Bergman and Hoshen, 2020; Sastry and Oore, 2020; Tack et al., 2020; Sohn et al., 2021; Vyas et al., 2018))

Classification-based Detection ((Hendrycks et al., 2018, 2019c; Lee et al., 2017; Hsu et al., 2020; Yu and Aizawa, 2019; Goyal et al., 2020; Golan and El-Yaniv, 2018; Mohseni et al., 2020))

Density-based Detection ((Schlegl et al., 2017; Zong et al., 2018; Choi and Chung, 2020; Serrà et al., 2019; Wang et al., 2020d; Du and Mordatch, 2019; Grathwohl et al., 2019; Liu et al., 2020b))

Adversarial Attack Detection and Guard

Adversarial Detection ((Feinman et al., 2017; Li and Li, 2017; Xu et al., 2017; Grosse et al., 2017; Meng and Chen, 2017; Ma et al., 2018; Hendrycks and Gimpel, 2016b; Gong et al., 2017))

Adversarial Guard ((Guo et al., 2018; Das et al., 2017; Samangouei et al., 2018; Shaham et al., 2018; Xie et al., 2018))

5. Inherently Safe Design

Achieving inherently safe ML algorithms that are provably error-less w.r.t. is still an open problem (NP-hard in high-dimensional domains) despite being trivial for code-based algorithms. In this section, we review the three main requirements to achieve safe ML algorithms as being (1) model transparency, (2) formal specification, and (3) formal verification and testing. These three requirements aim to formulate high-level system design specifications into low-level task specifications, leading to transparent system design, and formal verification or testing for model specifications.

5.1. Model Transparency

Transparency and interpretability of the ML model is an essential requirement for trustworthy, fair, and safe ML-based systems in real-world applications (Mohseni et al., 2018). However, advanced ML models with high performance on high dimensional domains usually have very large parameter space, making them hard to interpret for humans. In fact, the interpretability of a ML model is inversely proportional to its size and complexity. For example, a shallow interpretable model like a decisions tree becomes uninterpretable when a large number of trees are ensembled to create a random forest model. The inevitable trade-off between model interpretability and performance limits the transparency for deep models to “explaining the black-box” in a human understandable way w.r.t explanations complexity and length (Doshi-Velez and Kim, 2017). Regularizing the training for model interpretability is a way to improve model transparency for low-dimensional domains. For example, Lage et al. (Lage et al., 2018) present a regularization to improve human interpretability by incorporate user feedback in model training by measuring users’ mean response time to predict the label assigned to each data point at inference time. Lakkaraju et al. (Lakkaraju et al., 2016) build predictive models with sets of independent interpretable if-then rules. In the following, we review techniques for model transparency in three parts: model explanations (or global explanations), prediction explanations (or instance explanations), and evaluating the truthfulness of explanations.

5.1.1. Model Explanations

Model Explanations are techniques to estimate ML models for explaining the representation space or what the model has learned. Additionally, ML visualization and analytic systems provide various monitoring and inspection tools for training data (Wexler, 2017), data flow graphs (Wongsuphasawat et al., 2017), training process (Strobelt et al., 2018) and inspecting a trained model (Hohman et al., 2019).

Model Estimations.

One way to explain ML models is through estimation and approximation of deep models to generate simple and human understandable representations. The descriptive decision rule set is a common way to generate interpretable model explanations. For example, Guidotti et al. (Guidotti et al., 2018) present a technique to train local decision tree (i.e., interpretable estimate) to explain any given black-box model. Their explanation consists of a decision rule for the prediction and a set of counterfactual rules for the reversed decision. Lakkaraju et al. (Lakkaraju et al., 2019) propose to explain deep models with a small number of compact decision sets through subspace explanations with user-selected features of interest. Ribeiro et al. (Ribeiro et al., 2018) introduce Anchors, a model-agnostic estimator that can explain the behavior of deep models with high-precision rules applicable for different domains and tasks. To address this issue, Lakkaraju et al. (Lakkaraju et al., 2020) present a framework to optimize a minimax objective for constructing high-fidelity explanations in the presence of adversarial perturbations. In a different direction, Wu et al. (Wu et al., 2018) present a tree regularization technique to estimate a complex model by learning tree-like decision boundaries. Their implementations on time-series deep models show that users could understand and trust decision trees trained to mimic deep model predictions.

Visual Concepts.

Exploring a trained network by its learned semantic concepts is another way to inspect model rationale and efficiency in recognizing patterns. Kim et al. (Kim et al., 2018) introduce concept activation vectors as a solution to translate model representation vectors to human understanding of concepts. They create high-level user-defined concepts (e.g., texture patterns) by training auxiliary linear “concept classifiers” with samples from the training set to draw concepts that are important in model prediction. Ghorbani et al. (Ghorbani et al., 2019) take another direction by clustering the super-pixel segmentation of image saliency maps to discover visual concepts. Along the same line, Zhou et al. (Zhou et al., 2018) propose a framework for generating visual concepts by decomposing the neural activations of the input image into semantically interpretable components pre-trained from a large concept corpus. Their technique is able to disentangle the features encoded in the activation feature vectors and quantify the contribution of different features to the final prediction. In another work, Yeh et al. (Yeh et al., 2020) study the completeness of visual concepts with a completeness score to quantify the sufficiency of a particular set of concepts in explaining the model’s prediction.

5.1.2. Instance Explanations

Instance or local explanations explain the model prediction for specific input instances regardless of overall model behavior. This type of explanation carries less holistic information about the model but informs about model behavior near the examples input, which is suitable for investigating the edge cases for model debugging.

Local Approximations.

Training shallow interpretable models to locally approximate the deep model’s behavior can provide model explanations. A significant benefit of local approximation techniques is their model agnostic application and clarity of the saliency feature map. However, the faithfulness of explanations is greatly limited to factors like the heuristic technique, input example, and training set quality. For instance, Ribeiro et al. (Ribeiro et al., 2016) proposed LIME which trains a linear model to locally mimic the deep model’s prediction. The linear model is trained on a small binary perturbed training set located near the input sample which the labels are generated by the deep model. Lundberg and Lee (Lundberg and Lee, 2017) present a model agnostic prediction explanation technique that uses the Shapley values of a conditional expectation function from the deep model as the measure of feature importance. DeepLIFT (Shrikumar et al., 2017) is another local approximation technique that decomposes the output prediction for the input by backpropagating the neuron contributions of the network w.r.t each input feature. They compare activations for the specific input to its “reference activation” to assign feature contribution scores.

Saliency Map for DNNs.

Various heuristic gradient-based, deconvolution-based, and perturbation-based techniques have been proposed to generate saliency maps for DNNs. Gradient-based methods use backpropagation to compute the partial derivative of the class prediction score w.r.t. the input image (Simonyan et al., 2013). Later, Smilkov et al. (Smilkov et al., 2017) proposed to improve the noisy visualization of saliency map by introducing noise to the input. Grad-CAM (Selvaraju et al., 2017) technique combines feature maps from DNN’s intermediate layers to generate saliency maps for the target class. Zeiler and Fergus (Zeiler and Fergus, 2014) propose a deconvolution-based saliency map by adding a deconvnet on each layer which provides a continuous path from the prediction back to the image. Similarly, Springenbeg et al. (Springenberg et al., 2014) propose a guided backpropagation technique that modifies ReLu function gradients and uses class-dependent constraints in the backpropagation process. For real-time applications, Bojarski et al. (Bojarski et al., 2018) presents a variant of layer-wise relevance propagation for fast execution of saliency maps. Perturbation-based or sensitivity-based techniques measure the sensitivity of the model output w.r.t. the input features. For example, Zeiler and Fergus (Zeiler and Fergus, 2014) calculate the saliency map by sliding fixed size patches to occlude the input image and measure prediction probability.

5.1.3. Explanation Truthfulness

Since model explanations are always incomplete estimation of the black-box models, there need to be mechanisms to evaluate for both correctness and completeness of model explanations w.r.t. the main model. Particularly, the fidelity of the ad-hoc explanation technique should be evaluated against the black-box model itself. Aside from the qualitative review of model explanations and their consistency compared to similar techniques (Olah et al., 2018; Selvaraju et al., 2017; Lundberg and Lee, 2017), we look into sanity check tests and human-grounded evaluations in the following.

Sanity Checks and Proxy Tasks

Examining model explanations with different heuristic tests is shown to be an effective way to evaluate explanation truthfulness for specific scenarios. For example, Samek et al. (Samek et al., 2017) proposed a framework for evaluating saliency explanations by their correlation between saliency map quality and network performance under input perturbation. In a similar work, Kindermans et al. (Kindermans et al., 2019) present the inconsistencies in saliency maps due to simple image transformations. Adebayo et al. (Adebayo et al., 2018) propose three tests to measure the fidelity of any interpretability technique in tasks that are either data sensitive or model sensitive. Additionally, presenting the usefulness of explanations with a proxy task has been used in some literature. For instance, Zeiler and Fergus (Zeiler and Fergus, 2014) propose using visualization of features in different layers to improve network architecture and training by adjusting network layers. In another example, Zhang et al. (Zhang et al., 2018) present cases of evaluating explanations’ usefulness to find representation learning flaws caused by biases in the training dataset as a proxy task.

Human Evaluation and Ground-truth

Evaluating model explanations with human input is based on the assumption that good model explanations should be consistent with human reasoning and understanding of data. Multiple works (Ribeiro et al., 2016, 2018; Lundberg and Lee, 2017) made use of human-subject studies to evaluate model explanations. However, there are multiple human factors in user feedback on ML explanations such as average user understanding, the task dependency and usefulness of explanations, and user trust in explanations. Therefore, more concise evaluation metrics are required to reveal model behavior to users, justifying the predictions, and helping humans investigate uncertain predictions (Lertvittayakumjorn and Toni, 2019). Another challenge in human subject evaluations is the time and cost to run human subject studies and collect user feedback. To eliminate the need for repeated user studies, human annotated benchmarks have been proposed containing annotation of important features w.r.t. the target class (Mohseni et al., 2021a; Das et al., 2016).

5.2. Formal Methods

Formal methods require rigorous mathematical specification and verification of a system to obtain “guarantees” on system behavior. A design cycle in formal methods involves two main steps of (1) listing design specifications to meet the system requirements and then (2) verify the system to prove the delivery of requirements in the target environment. Specifically, formal verification certifies that the system $S$ exhibits the specified property $P$ when operating in the environment $E$ . Therefore, system specification and verification are two complementary components in formal methods. Unlike the common practice in data-driven algorithms which rely on available data samples to model the environment, formal methods require exact specification of the algorithm properties. In contrary to model validation in ML, formal methods require verification of the system given the environment space. Related to ML specification and verification, Huang et al. (Huang et al., 2020a) reviews ML verification methods by the type of guarantees they can provide such as deterministic guarantees, approximate bounds, and converging bounds. However, due to challenges in model specification and verification in high-dimensional domains, many works such as Sheshia et al. (Seshia et al., 2016) and Yamaguchi et al. (Yamaguchi et al., 2016) suggest end-to-end simulation-based validation of AI-based systems as a semi-formal verification of complex systems in which a realistic simulation of the environment and events is used to find counterexamples for system failures. In the following, we review different research on formal methods for ML algorithms.

5.2.1. Formal Specification

The specification is a necessary step prior to the software development and the basis for the system verification. Examples of common formal specification methods in software design are temporal logic and regular expressions. However, in the training of ML algorithms, the training set specifies the model task in the target distribution rather than a list of rules and requirements. Here we review techniques and experiments in model and environment specifications.

Model Specification:

Specifying the desired model behavior is a design requirement prior to any system development. However, formal specification of ML algorithms is very challenging for real-world tasks involving high-dimensional data like images. Sheshia et al. (Seshia et al., 2016, 2018) review open challenges for ML specifications and survey the landscape of formal specification approaches. For example, ML for semantic feature learning can be specified on multiple levels (e.g., system-level, input distribution level, etc.) to simplify overall specifications. Bartocci et al. (Bartocci et al., 2018) review tools for specifying ML systems in complex dynamic environments. They propose specification-based monitoring algorithms that provide qualitative and quantitative satisfaction score for the model using either simulated (online) or existing (offline) inputs. Additionally, as ML algorithms carry uncertainty in their outputs, the benefits of including prediction uncertainty in the overall system specification is an open and under investigation topic (McAllister et al., 2017). Focusing on invariance specifications, Pei et al. (Pei et al., 2017b) decompose safety properties for common real-world image distortions into 12 transformation invariance properties that a ML algorithm should maintain. Based on these specifications, they verify safety properties of the trained model using samples from the target domain. In the domain of adversarial perturbations, Dreossi et al. (Dreossi et al., 2019c) propose a unifying formalization to specify adversarial perturbations from formal methods perspective.

Environment Modeling:

Modeling the operational or target environment is a requirement in formal system specification. Robust data collection techniques are needed for full coverage of the problem space environment. Several techniques such as active learning, semi-supervised learning, and knowledge transfer are proposed to improve the training data by following design specifications and encourage the model to learn more generalizable features in the target environment. As a result, a training set with enough coverage can better close the generalization gap between the source training and target operational domains. Similarly, in AI-based systems with multiple ML components, a robust specification of the dynamic environment with its active components (e.g., humans actions) enables for better system design (Sadigh et al., 2016a).

5.2.2. Model Verification

Formal verification in software development is an assurance process for design and implementation validity. There are several approaches in software engineering such as constraint solving and exhaustive search to perform formal verification. For instance, a constrain solver like Boolean Satisfiability (SAT) provides deterministic guarantee on different verification constraints. However, verification of ML algorithms with conventional methods is challenging for high-dimensional data domains. In this section, we review different approaches for system-level and algorithm-level verification on ML systems.

Formal Verification

A line of research adapts conventional verification methods for ML algorithms. For example, Narodytska et al. (Narodytska et al., 2018) present a SAT solver for Boolean encoded neural networks in which all weights and activations are binary functions. Their solution verifies various properties of binary networks such as robustness to adversarial examples; however, these solvers often perform well when problems are represented as a Boolean combination of constraints. Recent SMT solvers for neural networks present use cases of efficient solvers for DNN verification on airborne collision avoidance and vehicle collision prediction applications (Katz et al., 2017). However, the proposed line of solutions is limited to small models with limited number of parameters commonly used on low dimensional data. Additionally, given the complexity of DNNs, the efficiency and truthfulness of these verification techniques require sanity checks and comparisons against similar techniques (Dutta et al., ). Huang et al. (Huang et al., 2017) present an automated verification framework based on SMT theory which applies image manipulations and perturbations such as changing camera angle and lighting conditions. Their technique employs region-based exhaustive search and benefits from layer-wise analysis of perturbation propagation. Wang et al. (Wang et al., 2018b) combine symbolic interval and linear relaxation that scales to larger networks up to 10000 nodes. Their experiments on small image datasets verify trained networks for perturbations such as brightness and contrast.

Quantitative and Semi-formal Verification

Given the complexity of ML algorithms, quantitative verification assigns quality values to the ML system rather than a Boolean output. For example, Dvijotham et al. (Dvijotham et al., 2018) present a jointly predictor-verifier training framework that simultaneously trains and verifies certain properties of the network. Specifically, the predictor and verifier networks are jointly trained on a combination of the main task loss and the upper bound on the worst-case violation of the specification from the verifier. Another work presents random sample generators within the model and environment specification constraints for quantitative verification (Chakraborty et al., 2014).

Further, system-level simulation tools provide the environment to generate new scenarios to illustrate various safety properties of the intelligent system. For example, Leike et al. (Leike et al., 2017) present a simulation environment that can decompose safety problems into robustness and specification limitations. In more complex AI-based systems, Dreossi et al. (Dreossi et al., 2019a) presents a framework to analyze and identify misclassifications leading to system-level property violations. Their framework creates an approximation of the model and feature space to provide sets of misclassified feature vectors that can falsify the system. Another example is VerifAI (Dreossi et al., 2019b), a simulation-based verification and synthesis toolkit guided by formal models and specifications. VerifAI consists of four main modules to model the environment to abstract feature space, search the feature space to find scenarios that violate specifications, monitor the properties and objective functions, and analyze the counterexamples found during simulations.

5.2.3. Formal Testing

Testing is the process of evaluating the model or system against an unseen set of samples or scenarios. Unlike model verification, testing does not require formal specification of the system or environment but instead focuses only on the set of test samples. The need for a new and unseen test set is because often model errors happen due to systematic biases in the training data resulting in learning incorrect or incomplete representations of the environment and task. Structural coverage metrics such as statement and modified condition/decision coverage (MC/DC) have been used in code-based algorithms to measure and ensure the adequacy of testing of safety-critical applications. However, testing in high-dimensional space is expensive as it requires very large number of test scenarios to ensure adequate coverage and an oracle to identify failures (Zhang et al., 2020b). We review these two aspects in the following.

Test Coverage

The coverage of test scenarios or samples is a particularly important factor to satisfy the testing quality. Inspired by the MC/DC test coverage criterion, Sun et al. (Sun et al., 2018) propose a DNN specific test coverage criteria to balance between the computation cost and finding erroneous samples. They developed a search algorithm based on gradient descent which looks for satisfiable test cases in an adaptive manner. Pei et al. (Pei et al., 2017a) introduce the neuron coverage metric as the number of unique activated neurons (for the entire test set) over the total number of neurons in the DNN. They present DeepXplore framework to systematically test DNN by its neuron coverage and cross-referencing oracles. Similar to adversarial training setups, their experiments demonstrate that searching for samples that both trigger diverse outputs and achieving high neuron coverage in a joint optimization fashion can improve model prediction accuracy.

In guided search for testing, Lakkaraju et al. (Lakkaraju et al., 2017) present an explore-exploit strategy for discovering unknown-unknown false positive samples in the unseen $D_{test}$ . Later, Bansal and Weld (Bansal and Weld, 2018) present a search algorithm which is formulated as an optimization problem that aims to select samples from $D_{test}$ maximize the utility model subject to a budget of maximum number of oracle calls. Differential testing techniques use multiple trained copies of the target algorithm to serve as a correctness oracle for cross-referencing (Qin et al., 2018).

End-to-end Simulations

End-to-end simulation tools enable testing for complex AI-based systems by generating diverse test samples and evaluating system components together. For instance, Yamaguchi et al. (Yamaguchi et al., 2016) present a series of simulations to combine requirement mining and model checking in simulation-based testing for end-to-end systems. Fremont et al. (Fremont et al., 2020) present a scenario-based testing tool which generates test cases by combining specification of possible scenarios and safety properties. In many cases, samples and scenarios from the simulation tests are used to improve the training set. For example, Dreossi et al. (Dreossi et al., 2018) propose a framework for generating counterexamples or edge-cases for ML to improve both the training set and test coverage. Their experiment results for simulation-based augmentation show edge-cases have important properties for retraining and improving the model. In the application of object detection, Kim et al. (Kim et al., 2020) present a framework to identify and characterize misprediction scenarios using high-level semantic representation of the environment. Their framework consists of an environment simulator and rule extractor that generates compact rules that help the scene generator to debug and improve the training set.

5.3. Challenges and Opportunities

The first main challenge of designing inherently safe ML models lies in the computation complexity and scalability of solutions (Huang et al., 2020a; Zhang et al., 2020b). As ML models are becoming exponentially more complex, it will become extremely difficult to impose specifications and perform verification mechanisms that are well-adapted for large ML models. A practical solution could be the modular approach presented by Dreossi et al. (2019a) for scaling up formal methods to large ML systems, even when some components (such as perception) do not themselves have precise formal specifications.

On the other hand, recent advancements in 3D rendering and simulation have introduced promising solutions for end-to-end testing and semi-formal verification in simulated environments. However, it is challenging to mitigate the gap between simulation and real-world situations, causing questionable transfer of simulated verification and testing results. Recent work starts exploring how simulated formal simulation aid in designing real-world tests (Fremont et al., 2020). Additionally, thorough and scenario-based simulations enable system verification in broader terms such as monitoring interactions between ML modules in a complex system.

6. Enhancing Model Performance and Robustness

Enhancing model performance and robustness is the most common strategy to improve product quality and reduce the safety risk of ML models in the open world. Specifically, techniques to enhance model performance and robustness reduce different model error types to gain dependability w.r.t criteria reviewed in Section 3. In the following, we review and organize ML solutions to improve performance and robustness into three parts focusing on (1) robust network architecture, (2) robust training, and (3) data sampling and manipulation.

6.1. Robust Network Architecture

Model robustness can be influenced by model capacity and architecture.

Model Capacity

Djolonga et al. (Djolonga et al., 2020) showed that with enough training data, increasing model capacity (both width and depth) consistently helps model robustness against distribution shifts. Madry et al. (Madry et al., 2017) showed that increasing model capacity (width alone) could increase model robustness against adversarial attacks. Xie et al. (Xie and Yuille, 2019) observed that increasing network depth for adversarial training could largely boost adversarial robustness, while the corresponding clean accuracy quickly saturates as the network goes deeper. The above empirical findings that larger models lead to better robustness are also consistent with theoretical results (Nakkiran, 2019; Gao et al., 2019). In contrast, Wu et al. (Wu et al., 2020) conducted a thorough study on the impact of model width on adversarial robustness and concluded that wider neural networks may suffer from worse perturbation stability and thus worse overall model robustness. To accommodate for computational resources while maintaining model robustness, a surge of works have studied robustness-aware model compression (Lin et al., 2019; Gui et al., 2019; Hu et al., 2020; Wang et al., 2020a; Ye et al., 2019; Sehwag et al., 2020).

Network Structure and Operator

Activation functions may play an important role in model robustness (Xie et al., 2020b; Tavakoli et al., 2021). For example, Xie et al. (Xie et al., 2020b) observed that using smoother activation functions in the backward pass of training improves both model accuracy and robustness. Tavakoli et al. (Tavakoli et al., 2021) proposed a set of learnable activation functions, termed SPLASH, which simultaneously improves model accuracy and robustness. Zhang (2019) and (Vasconcelos et al., 2020) pointed out the use of down-sampling methods (e.g., pooling, strided convolutions) in modern DNNs leads to aliasing issues and pool input invariance under image shifting. The authors proposed to use traditional anti-aliasing methods (e.g., by adding low-pass filters after sampling operations) to increase model invariance under image shifting and model robustness against distribution shifts. Neural Architecture Search (NAS) has also been used to search for robust network structures (Guo et al., 2020; Chen et al., 2020b; Ning et al., 2020; Mok et al., 2021). For example, Guo et al. (Guo et al., 2020) first leveraged NAS to discover a family of robust architectures (RobNets) that are resilient to adversarial attacks. They empirically observed that using densely connected patterns and adding convolution operations to direct connection edge improve model robustness. With the recent success of Visual Transformers (ViTs), some works (Bhojanapalli et al., 2021; Mahmood et al., 2021) benchmarked model robustness of ViTs and observed it to have better general robustness than traditional CNNs. However, a recent paper overturned such a conclusion by showing that CNNs can be as robust as ViTs if trained properly (Bai et al., 2021).

6.2. Robust Training

Various robust training methods have been proposed to improve model robustness.

6.2.1. Training Regularization

The most representative regularization for robust training is to encourage model smoothness: Similar outputs should be obtained on similar inputs (e.g, two versions of the same images with slightly different rotation angles should lead to similar model predictions) (Zheng et al., 2016; Hendrycks et al., 2019d; Miyato et al., 2019). A typical form of such regularization is the pairwise distance $d(f(x),f(x^{\prime}))$ , where $x$ and $x^{\prime}$ are two different versions of the same image, and $d(\cdot,\cdot)$ is some distance measure (e.g., $\ell_{2}$ norm). Zheng et al. (2016) stabilized deep networks against small input distortions by regularizing the feature distance between the original image and its corrupted version with additive random Gaussian noises. Yuan et al. (2020) showed that model distillation can be taken as a learned label smoothing (Müller et al., 2019) regularization which helps in-distribution generalization. Hendrycks et al. (2019d) use the JSD divergence among the original image and its augmented versions as a consistency regularizer to improve model robustness against common corruptions. Virtual Adversarial Training (VAT) (Miyato et al., 2019) penalizes the worst-case pairwise distance on unlabeled data, achieving improve robustness in semi-supervised learning. Besides the above pairwise distance regularization, model smoothness can also be directly regularized by adding Lipschitz continuity constraints on model weights (Qian and Wegman, 2018; Singla and Feizi, ). For example, Singla and Feizi proposed a differential upper bound on Lipschitz constant of convolutional layers, which can be directly optimized to improve model generalization ability and robustness.

6.2.2. Pre-training and Transfer Learning

Transfer learning is consist of a range of techniques to transfer useful features from an auxiliary domain or task to the target domain and task for improved model generalization and robustness (Yosinski et al., 2014). Pre-training is a common transfer learning approach by first training the model on a large-scale dataset and then fine tuning the model on the target downstream domain and task (e.g., CIFAR10). For example, Hendrycks et al. (2019b) show pre-training on ImageNet dataset can greatly improve model robustness and uncertainty on smaller target datasets like CIFAR10. To benefit from unlabeled data sources, transfer learning can be done by first pre-training the model on a self-supervised task (e.g., rotation angle prediction), and then fine tuning on the target supervised task (e.g., image classification) (Hendrycks et al., 2019c). Recently, Chen et al. (2020c) introduced adversarial training into self-supervised pre-training to provide general-purpose robust pre-trained models. In another work, Jiang et al. (2020) leveraged contrastive learning for pre-training which further boosted adversarial robustness. Since robust training is usually data-hungry (Tsipras et al., 2018) and time-consuming (Zhang et al., 2019a), it would be economically beneficial if the robustness learned on one task or data distribution can be efficiently transferred to other ones. Awais et al. (2021) showed that model robustness can be transferred from adversarially-pretrained ImageNet models to different unlabeled target domains, by aligning the features of different models on the target domain. Motivated by the practical constraint in federated learning that only some resource-rich devices can support robust training, Hong et al. (2021) proposed the first method to transfer robustness among different devices in federated learning, while preserving the data privacy of each participant.

6.2.3. Adversarial Training

Adversarial training (AT) incorporates adversarial examples into training data to increase model robustness against adversarial attacks at test-time. State-of-the-art AT methods are arguably top-performers (Zhang et al., 2019c; Madry et al., 2017) to enhance deep network robustness against adversarial attacks. A typical AT algorithm optimizes a hybrid loss consisting of a standard classification loss ${\mathcal{L}}_{c}$ and a adversarial robustness loss term ${\mathcal{L}}_{a}$ :

(1)

\min_{\theta}\mathbb{E}_{(x,y)\sim\mathcal{D}}\leavevmode\nobreak\ [(1-\lambda){\mathcal{L}}_{c}+\lambda{\mathcal{L}}_{a}],\,{\mathcal{L}}_{c}={\mathcal{L}}(f(x;\theta),y),{\mathcal{L}}_{a}=\max_{\delta\in\mathcal{B}(\epsilon)}{\mathcal{L}}(f(x+\delta;\theta),y)

where $\mathcal{B}(\epsilon)=\{\delta\mid\|\delta\|_{\infty}\leq\epsilon\}$ is the allowed perturbation set to keep samples are visually unchanged, $\epsilon$ is the radius of the $\ell_{\infty}$ ball $\mathcal{B}(\epsilon)$ , and $\lambda$ is a fixed training weight hyper-parameter. For example common AT methods, both Fast Gradient Sign Method (FGSM-AT) (Goodfellow et al., 2015) and Projected Gradient Descent (PGD-AT) (Madry et al., 2017) uses an $\epsilon$ for the allowed perturbation set that formalizes the manipulative power of the adversary. Examples of variations of PGD include, TRADES (Zhang et al., 2019c) which uses the same clean loss as PGD-AT, but replace ${\mathcal{L}}_{a}$ from cross-entropy to a soft logits-pairing term. In MMA training (Ding et al., 2020), the adversarial ${\mathcal{L}}_{a}$ loss is to maximize the margins between correctly classified images. As a novel solution to the trade-off between model accuracy and adversarial robustness, Once-for-all AT (OAT) trains a single model that can be adjusted in-situ during run-time to achieve different desired trade-off levels between accuracy and robustness (Wang et al., 2020a). Hong et al. (2022) extended OAT to federated learning (FL), achieving run-time adaptation for robustness in practical FL systems where different participant devices have different levels of safety requirements.

Fast Adversarial Training

Despite the effectiveness of AT methods against attacks, they suffer from very high computation cost due to multiple extra backward propagations to generate adversarial examples. The high training cost would make AT impractical in certain domains and large-scale datasets (Xie et al., 2019). Therefore, a line of work tries to accelerate AT. Zhang et al. (2019a) restricted most adversarial updates in the first layer to effectively reduce the total number of the forward and backward passes to improve the training efficiency. Shafahi et al. (Shafahi et al., 2019) used a “free” AT algorithm by updating the network parameters and adversarial perturbation simultaneously on a single backward pass. Wong et al. (Wong et al., 2020) showed that, with proper random initialization, single-step PGD-AT can be as effective as multiple-step ones but much more efficient.

Certified Adversarial Training

Certified AT aims to obtain networks with provable guarantees on robustness under certain assumptions and conditions. Certified AT uses a verification method to find an upper bound of the inner max, and then update the parameters based on this upper bound of robust loss. Minimizing an upper bound of the inner max guarantees to minimize the robust loss. Linear relaxations of neural networks (Wong and Kolter, 2018) use the dual of linear programming (or other similar approaches (Wang et al., 2018a)) to provide a linear relaxation of the network (referred to as a “convex adversarial polytope”) and the resulting bounds are tractable for robust optimization. However, these methods are both computationally and memory intensive which can increase model training time by a factor of hundreds. Interval Bound Propagation (IBP) (Gowal et al., 2018) is a simple and efficient method for training verifiable neural networks which achieved state-of-the-art verified error on many datasets. However, the training procedure of IBP is unstable and sensitive to hyperparameters. Similarly, Zhang et al. proposed CROWN-IBP (Zhang et al., 2019b) to combine the efficiency of IBP and the tightness of a linear relaxation based verification bound. Other certified adverarial training techniques include ReLU stability regularization (Xiao et al., 2019b), distributionally robust optimization (Sinha et al., 2018) , semi-definite relaxations (Raghunathan et al., 2018; Dvijotham et al., ) and random smoothing (Cohen et al., 2019).

6.2.4. Domain Generalization

Domain generalization presents an important indicator of “in-the-wild” model robustness for open-world applications such as autonomous vehicles and robotics where deployment in different domains is common.

Domain Randomization

Utilizing randomized variations of the source training set can improve generalization to the unseen target domain. Domain randomization with random data augmentation has been a popular baseline in both reinforcement learning (Tobin et al., 2017) and general scene understanding (Tremblay et al., 2018), where the goal is to introduce randomized variations to the input to improve sim-to-real generalization. Yue et al. (Yue et al., 2019) use category-specific ImageNet images to randomly stylize the synthetic training images. YOLOv4 (Bochkovskiy et al., 2020) benefits from a new random data augmentation method that diversifies training samples by mixing images for detection of objects outside their normal context. Data augmentation can also be achieved through network randomization (Lee et al., 2020; Xu et al., 2021), which introduces randomized convolutional neural networks for more robust representation. Data augmentation for general visual recognition tasks is discussed in Section 6.3.

Robust Representation Learning

The ability to generalize across domains also hinges greatly on the quality of learned representations. As a result, introducing inductive bias and regularizations are crucial tools to promote robust representation. For instance, Pan et al. (Pan et al., 2018) show that better designed normalization leads to improved generalization. In addition, neural networks are prone to overfitting to superficial (domain-specific) representations such as textures and high-frequency components. Therefore, preventing overfitting by better capturing the global image gestalt can considerably improve the generalization to unseen domains (Wang et al., 2019b, a, 2020c; Huang et al., 2020b). For example, Wang et al. (2019a) prevented the early layers from learning low-level features, such as color and texture, but instead focus on the global structure of the image. Representation Self-Challenging (RSC) (Huang et al., 2020b) iteratively discards the dominant features during training and forces the network to utilize remaining features that correlate with labels.

Multi-Source Training

Another stream of methods assume multiple source domains during training and target the generalization on the held-out test domains. They use multiple domain information during training to learn domain agnostic biases and common knowledge that also apply to unseen target domains (Li et al., 2017, 2018; Balaji et al., 2018; Tang et al., 2019; Wu et al., 2019; Gong et al., 2019; Lambert et al., 2020). For example, Gong et al. (2019) aimed to bridge multiple source domains by introducing a continuous sequence of intermediate domains and thus be able to capture any test domain that lies between the source domains. More recently, Lambert et al. (2020) constructed a composite semantic segmentation dataset from multiple sources to improve the zero-shot generalization ability to unseen domains of semantic segmentation models.

6.3. Data Sampling and Augmentation

The quality of training data is an important factor for machine learning with big data (L’heureux et al., 2017). In this section, we review a collection of algorithmic techniques for data sampling, cleansing, and augmentation for improved and robust training.

Active Learning

Active learning is a framework solution to improve training set quality and minimize data labeling costs by selectively labeling valuable samples (i.e., “edge cases”) from an unlabeled pool. The sample acquisition function in active learning frameworks often relies on prediction uncertainty such as Bayesian estimation (Gal et al., 2017b) and ensemble-based (Beluch et al., 2018) uncertainties. Differently, ViewAL (Siddiqui et al., 2020) actively samples hard-cases by measuring prediction inconsistency across different viewpoints for multi-view semantic segmentation tasks. Haussmann et al. (2020) presents a scalable active learning framework for open-world applications in which both the target and environment can greatly affect model predictions. Active learning techniques can also improve training data quality by identifying noisy labels with minimum crowdsourcing overhead (Bouguelia et al., 2018).

Hardness Weighted Sampling

A typical data sampling strategy to improve model robustness is to assign larger weights (in the training loss function) to harder training samples (Katharopoulos and Fleuret, 2018; Zhang et al., 2020a; Fidon et al., 2020). For instance, Katharopoulos and Fleuret (2018) achieved this by estimating sample hardness by an efficient upper bound of their gradient norms. Zhang et al. (2020a) estimated sample hardness by their distance to the classification boundary and assigning larger weights to hard samples. Fidon et al. (2020) performed weight sampling by modeling sample uncertainty through minimizing the worst-case expected loss over an uncertainty set of training data distributions.

Data Cleansing

Label errors on large-scale annotated training sets induce noisy or even incorrect supervision during model training and evaluation (Yun et al., 2021; Beyer et al., 2020; Wang et al., 2020b). A robust learning strategy against label noise is to select training samples with small loss values to update the model, since samples with label noise usually have larger loss values than clean samples (Han et al., 2018; Yu et al., 2019; Wei et al., 2020). ImageNet is commonly used as a representative case study for label noise. Specifically, many images in ImageNet contain multiple objects, while only one of them is labeled as the ground-truth classification target. Efforts have been made to correct such label noise on both the training (Yun et al., 2021) and validation (Beyer et al., 2020) sets of ImageNet, by providing localized annotations for each object in the image using machine or human annotators. Training on the new cleansed annotations improved both in-distribution accuracy and robustness (Yun et al., 2021).

Data Augmentation

Data augmentations are widely used to improve model generalizability and robustness due to their effectiveness and simplicity. The most popular strategy of data augmentation is to increase the diversity of training samples, for example, by random rotation and scaling, random color jittering, random patch erasing (Zhong et al., 2020), and many others (Cubuk et al., 2018, 2020; Yun et al., 2019; Hendrycks et al., 2019d, 2020). AugMix (Hendrycks et al., 2019d) utilizes diverse random augmentations by mixing multiple augmented images, significantly improving model robustness against natural visual corruptions. CutMix (Yun et al., 2019) removes image patches and replaces the removed regions with a patch from another image, where the new ground truth labels are also mixed proportionally to the number of pixels of combined images. DeepAugment (Hendrycks et al., 2020) increases robustness to cross-domain shifts by employing image-to-image translation networks for data augmentation rather than conventional data-independent pixel-domain augmentation. The second type of data augmentation strategy is adversarial data augmentation (Volpi et al., 2018; Zhao et al., 2020; Xie et al., 2020a; Zhang et al., 2019d), where fictitious target distributions are generated adversarially to resemble “worst-case” unforeseen data shifts throughout training. For example, Xie et al. (2020a) observed that adversarial samples could be used as data augmentation to improve both accuracy and robustness with the help of a novel batch normalization layer. Recently, Gong et al. (2021) and Wang et al. (2021) proposed MaxUp and AugMax to combine diversity and adversity in data augmentation, further boosting model generalizability and robustness.

6.4. Challenges and Opportunities

Despite advances in defending different types of naturally occurring distributional shifts and synthetic adversarial attacks, there lacks systematic efforts to tackle robustness limitations in a unified framework to cover the “in-between” cases within this spectrum. In fact, in many cases, techniques proposed to enhance one type of robustness do not translate to benefiting other types of robustness. For example, Li et al. (2020) showed that top-performing robust training methods for one type of distribution shift may even harm the robustness on other different distribution shifts.

Another less explored direction for ML robustness is to benefit from multi-domain and multi-source training data for improved representation learning. The rich contexts captured from sensor sets with diverse orientations and data modality may improve prediction robustness compared to a single input source (e.g., single camera). For example, a recent paper (Fort et al., 2021) showed that large models trained on multi-modality data, such as CLIP (Radford et al., 2021), can significantly improve representation learning to detect domain shift. Based on the above finding, a promising direction for future research is to design multi-modality training methods which explicitly encourage model robustness. Another under-exploited approach for model robustness is run-time self-checking based on various temporal and semantic coherence of data.

Faithful and effective evaluation of model robustness is another open challenge in real-world applications. Traditional evaluation approaches are designed based on the availability of labeled test sets on the target domain. However, in a real-world setting, the target domain may be constantly shifting and making the test data collection inefficient and inaccurate. To address this issue, recent work propose more practical settings to evaluate model robustness with only unlabeled test data (Garg et al., 2022) or selective data labeling (Wang et al., 2020b).

Unlike training datasets and evaluation benchmarks commonly used in research, a safety-aware training set requires extensive data capturing, cleaning, and labeling to increase the coverage of unknown edge cases by collecting them directly from the open-world. Technique like active learning (Meng et al., 2021), object re-sampling (Chang et al., 2021), and self-labeling allow for efficient and targeted dataset improvements which can directly translate to model performance improvements. Generative Adversarial Networks (GAN) could be an underway trend for generating effective large-scale vision datasets. For example, Zhang et al. (2021b) propose DatasetGAN, an automatic procedure to generate realistic image datasets with semantically segmented labels.

7. Run-time Error Detection

The third strategy for ML safety is to detect model errors at run-time. Although the robust training methods discussed in Section 6.2 can significantly improve model robustness, they cannot entirely prevent run-time errors. As a result, run-time monitoring to detect any potential prediction errors is necessary from the safety standpoint. Selective prediction, also known as prediction with reject option, is the main approach for run-time error detection (Geifman and El-Yaniv, 2017, 2019). Specifically, it requires the model to cautiously provide predictions only when they have high confidence in the test samples. Otherwise, when the model detects potential anomalies, it will trigger fail-safe plans to prevent system failure. Selective prediction can significantly improve model robustness at the cost of test set coverage. In this section, we first review methods for model calibration and uncertainty quantification (Sec. 7.1) and then go over technique to adopt such methods on specific application scenarios: out-of-distribution detection (Sec. 7.2) and adversarial attack detection (Sec. 7.3).

7.1. Prediction Uncertainty

Prediction uncertainty is the key to enabling selective prediction. The most intuitive uncertainty measure in DNN models is the softmax probability of the predicted class (also known as maximum softmax probability, or MSP) as used in (Hendrycks and Gimpel, 2016a; Guo et al., 2017). However, DNNs are widely known to be susceptible to overconfidence in their prediction in which the MSP of misclassified samples could be as high as the correct predictions, making MSP a poor measure for prediction confidence (Guo et al., 2017; Hein et al., 2019). Here we review two commonly used solutions for models’ overconfidence predictions.

7.1.1. Model Calibration

Model calibration aims to design robust training methods so that the MSP of the resultant models be aligned with the likelihood of correct prediction. Temperature scaling Guo et al. (2017) is arguably the most simple model calibration method, without the need to retrain the existing poor-calibrated model. Specifically, it softens the softmax probability by scaling down the logits by a factor of $T(T>1)$ during test time. This technique was later found also very effective in alleviating overconfidence on test samples from a different distribution (Li and Hoiem, 2020). Considering that temperature scaling may undesirably clamp down legitimate high confidence predictions, Kumar et al. (2018) proposed maximum mean calibration error (MMCE) which is a trainable calibration measure based on a reproducible kernel Hilbert space (RKHS), and is minimized alongside the NLL loss during training. Label smoothing (Szegedy et al., 2016) is another simple and effective technique for model calibration by training the model with more uniform target distribution in cross-entropy loss instead of the traditional one-hot labels. It not only gives more calibrated outputs but also leads to improved network generalization. Training with MixUp data augmentation (Zhang et al., 2017) is also found to benefit model calibration (Thulasidasan et al., 2019). Zhang et al. (2019e) used structured dropout to promote model diversity and improve model calibration. Unlike previous implicit regularization methods, Moon et al. (2020) proposed a correctness ranking loss to explicitly encourage model calibration in training. A concurrent work (Krishnan and Tickoo, 2020) proposed another explicit model calibration loss function, termed accuracy versus uncertainty calibration (AvUC) loss, for Bayesian neural networks. Recently, Karandikar et al. (2021) proposed a softened version of AvUC, termed S-AvUC, together with another soft calibration loss function termed SB-ECE, which are applicable under the more general non-Bayesian network setting and outperforms previous methods.

7.1.2. Uncertainty Quantification (UQ)

Uncertainty quantification aims to design accurate uncertainty or prediction confidence measures for ML models. The sources of uncertainty can be categorized into two types: aleatoric uncertainty and epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009). Gal (2016) first considered modeling aleatoric uncertainties in deep neural networks following a Bayesian modeling framework where one assumes some prior distribution over the space of parameters. The authors derived a practical approximate UQ measure, which essentially equals to the prediction variance of an ensemble of models generated by statistical regularization techniques such as dropout (Srivastava et al., 2014) and other more advanced variants (Gal and Ghahramani, 2016; Gal et al., 2017a). Kendall and Gal (2017) further proposed a Bayesian framework that jointly models both aleatoric and epistemic uncertainties. Deep ensembling (Lakshminarayanan et al., 2017; Ovadia et al., 2019) have also been popular approaches for UQ. The common disadvantage of the above methods is additional computation at run-time. To overcome such high computation cost in UQ, approaches with single deterministic models have been proposed (Chen et al., 2020a; Liu et al., 2020a; Van Amersfoort et al., 2020; Mukhoti et al., 2021). For instance, Chen et al. (Chen et al., 2020a) propose an angular visual hardness (AVH) distance-based measure which shows a good correlation to human perception of visual hardness. AVH can be computed using regular training with softmax cross-entropy loss, making it a convenient drop-in replacement for MSP as the uncertainty measure.

7.2. Out-of-distribution Detection

Out-of-distribution (OOD) or outlier refers to samples that are disjoint from the source training distribution. OOD detection is a binary classification task to distinguish OOD samples from in-distribution (ID) samples at test-time. Unlike model calibration (Section 7.1.1), OOD detection does not require the prediction confidence to align well with the likelihood of correct prediction on ID data. In the following, we categorize and review OOD detection techniques into three groups.

7.2.1. Distance-Based Detection

Distance-based methods measure the distance between the input sample and source training set in the representation space. These techniques involve pre-processing or test-time sampling of the source domain distribution and measuring their averaged distance to the test sample. Various distance measures including Mahalanobis distance (Lee et al., 2018), cosine similarity (Techapanurak et al., 2020), and Euclidean distance (Goyal et al., 2020) have been employed. For example, Ruff et al. (Ruff et al., 2018) present a deep learning one-class classification approach to minimize the representation hypersphere for normal distribution and calculate the detection score by the distance of the outlier sample to the center of the hypersphere. They later (Ruff et al., 2019) extended this work using samples labeled as OOD in a semi-supervised manner. Bergman and Hoshen (Bergman and Hoshen, 2020) presented a technique that learns a feature space such that inter-class distance is larger than the intra-class distance. Sohn et al. (Sohn et al., 2021) presented a two-stage one-class classification framework that leverages self-supervision and a shallow one-class classifier. The OOD detection performance of distance-based methods can be improved by ensembling measurements over multiple input augmentations (Tack et al., 2020) and network layers (Lee et al., 2018; Sastry and Oore, 2020). Sastry and Oore (Sastry and Oore, 2020) used gram matrices to compute pairwise feature correlations between channels of each layer and identify anomalies by comparing inputs values with its respective range observed over the training data.

7.2.2. Classification-Based Detection

Classification-based detection techniques seek effective representation learning to encode normality together with OOD detection scores. Various OOD detection scores have been proposed including maximum softmax probability (Hendrycks and Gimpel, 2016a), prediction entropy (Hendrycks et al., 2018), KL-divergence and Jensen-Shannon divergence (Hendrycks et al., 2019c) from uniform distribution as detection score. Further, Lee et al. (Lee et al., 2017) and Hsu et al. (Hsu et al., 2020) proposed a combination of temperature scaling and adversarial input perturbations to calibrate the model for better OOD detection. Another line of research proposes using disjoint unlabeled OOD training set to learn normality and hence improve OOD detection. Hendrycs el al. (Hendrycks et al., 2018) present a case for joint training of natural outlier set (from any auxiliary disjoint training set) with the normal training set resulting in fast and memory efficient OOD detection with minimal architectural changes. Later, Mohseni et al. (Mohseni et al., 2020) show this type of training can be more improved by using additional reject classes in the last layer. Other classification-based techniques include revising network architecture for learning better prediction confidence during the training (DeVries and Taylor, 2018; Yu and Aizawa, 2019). From a different perspective, Vyas et al. (Vyas et al., 2018) present a framework for employing an ensemble of classifiers that each leave out a subset of the training set as OOD examples and the rest as the normal in-distribution training set. Recent work show that self-supervised learning can further improve OOD detection and surpass prior techniques (Golan and El-Yaniv, 2018; Hendrycks et al., 2019c; Winkens et al., 2020; Tack et al., 2020; Sehwag et al., 2021; Mohseni et al., 2021b). For instance, Tack et al. (Tack et al., 2020) propose using geometric transformations like rotation to shift different samples further away to improve OOD detection performance.

7.2.3. Density-based Detection

Using density estimates from Deep Generative Models (DGM) is another line of work to detect OOD samples by creating a probability density function from the source distribution. Different scores based on likelihood are proposed for OOD detection (Ren et al., 2019a) using GANs and based on reconstruction error(Zong et al., 2018) using VAEs. However, some recent studies present counterintuitive results that challenge the validity of VAEs and DGMs likelihood ratios for semantic OOD detection (Wang et al., 2020d) in high-dimensional data and propose ways to improve likelihood-based score such as using natural perturbation (Choi and Chung, 2020) For instance, Serrà et al. (Serrà et al., 2019) connect the limitations of generative models’ likelihood score for OOD detection with the input complexity and use an estimate of input complexity to derive a new efficient detection score. Energy-based models (EBMs) are another family of DGMs that has shown higher performance in OOD detection. Du and Mordatch (2019) present EBMs for OOD detection on high-dimensional data and investigate limitations of reconstruction error on VAEs compared to their proposed energy score. Liu et al. (2020b) propose another energy-based framework for model training with a new OOD detection score based on discriminative models.

7.3. Adversarial Detection and Guards

Adversarial detection and adversarial guards are test-time methods to mitigate the risk of adversarial attacks. Adversarial detection refers to designing a detector to identify adversarial perturbations, while in adversarial guarding is done based on removing the effects of adversarial perturbations from a given image sample. Note that neither of these approaches manipulates model parameters or model training process; therefore, these solutions are complementary to adversarial training solutions reviewed in Section 6.2.3.

7.3.1. Adversarial Attack Detection

The most straightforward way to detect adversarial examples is to train a secondary model as the adversarial sample detector. For example, Grosse et al. (2017) and Gong et al. (2017) both trained a binary classifier on clean and adversarial training samples as the adversarial sample detector. Besides raw pixel values, adversarial and clean examples have some different intrinsic properties which can be used to detect adversarial examples. For example, adversarial images have greater variances in low-ranked principal components (Hendrycks and Gimpel, 2016b) and larger local intrinsic dimensionality (LID) Ma et al. (2018) than clean images.

Statistical Testing utilizes the difference in distribution of adversarial examples and natural clean examples for adversarial detection. For example, Grosse et al. (2017) use the Maximum Mean Discrepancy (MMD) test to evaluate whether the clean images and adversarial examples are from the same distribution. They observe that adversarial examples are located in the different output surface regions compared to clean inputs. Similarly, Feinman et al. (2017) leverage the kernel density estimations from the last layer and Bayesian uncertainty estimations from the dropout layer to measure the statistical difference between adversarial examples and normal ones.

Applying Transformation and Randomness is another approach to detect adversarial examples based on the observation that natural images could be resistant to the transformation or random perturbations while the adversarial ones are not. Therefore, one can detect adversarial examples with high accuracy based on the model prediction discrepancy due to applying simple transformations and randomness (Li and Li, 2017; Xu et al., 2017). For instance, Li and Li (2017) apply a small mean blur filter to the input image before feeding it into the network. They show that natural images are resistant to such transformations while adversarial images are not.

7.3.2. Test-Time Adversarial Guard

Test-Time adversarial guard aims to accurately classify both adversarial and natural inputs without changing the model parameters. Various test-time transformations have been proposed to diminish the adversarial effect of the adversarial perturbations by pre-processing inputs before feeding to the model. The research investigates the efficiency of applying different basis transformations to input images, including the JPEG compression (Das et al., 2017; Guo et al., 2018; Shaham et al., 2018), bit-depth reduction, image quilting, total variance minimization (Guo et al., 2018), low-pass filters, PCA, low-resolution wavelet approximations, and soft-thresholding (Shaham et al., 2018). Most of these defense strategies prevent adversarial attacks by obfuscating the gradient propagation in adversarial attacks with non-differential or random operations. As a result, they have been shown ineffective under stronger adversarial attacks which bypass such gradient obfuscation (Athalye et al., 2018).

7.4. Challenges and Opportunities

An open challenge in OOD detection is to improve performance on near-OOD samples that are visually similar to ID samples but yet outliers w.r.t. semantic meanings. This scenario is very common in fine-grained image classification and analysis domains where target ID samples could be highly similar to OOD samples. Recent papers have made attempts in this more challenging scenario (Zhang et al., 2021a; Fort et al., 2021); however, the OOD detection performance on near-OOD samples is still much worse than that performance on far-OOD samples (i.e., visually more distinct samples).

Another open research direction is to propose techniques for efficient OOD sample selection and training. In a recent work, Chen et al. (2021) present ATOM as an empirically verified technique for mining informative auxiliary OOD training data. However, this direction remains under-explored, and many useful measures such as gradient norms (Katharopoulos and Fleuret, 2018) could be investigated for OOD training efficiency and performance.

Detecting adversarial examples will remain open research as new attacks are introduced to challenge and defeat detection methods (Carlini and Wagner, 2017). Given the overhead computational costs for both generating and detecting adversarial samples, an efficient way to nullify attacks could be is to benefit from multi-domain inputs, temporal data characteristics, and domain knowledge from a known clean training set. A related example is the work by Xiao et al. (2018a) that studies the spatial consistency property in the semantic segmentation task by randomly selecting image patches and cross-checking model predictions among the overlap area.

8. Discussion and Conclusions

In our survey, we presented a review of fundamental ML limitations in engineering safety methods; followed by a taxonomy of safety-related techniques in ML. The impetus of this work was to leverage from both engineering safety strategies and state-of-the-art ML techniques to enhance the dependability of ML components in autonomous systems. Here we summarize key takeaways from our survey and continue with recommendations on each item for future research.

T1: Engineering Standards Can Support ML Product Safety

Safety needs for design, development, and deployment of ML learning algorithms have subtle distinctions with code-based software. Our analyses are aligned with prior work and indicate that conventional engineering safety standards are not directly applicable on ML algorithms design. Consequently, relevant industrial safety standards suggest enforcing limitations on operation domain of critical ML functions to minimize potential hazards due to ML malfunction. The limitations enforced on ML functions are due to the lack of ML technology readiness and intended to minimize the risk of hazard to an acceptable level. Additionally, recent regulations mandate data privacy and algorithmic transparency which in turn could encourage new principles in responsible ML development and deployment pipelines. Our recommendation is aligned with safety standards to perform thorough risk and hazard assessments for ML components and limit their functionalities to minimize the risk of failure.

T2: The Value of ML Safety Taxonomy

The main contribution of this paper is to establish a meaningful ML Safety Taxonomy based on ML characteristics and limitations to directly benefit from engineering safety practices. Specifically, our taxonomy of ML safety techniques maps key engineering safety principles into relevant ML safety strategies to understand and emphasize on the impact of each ML solution on model reliability. The proposed taxonomy is supported with a comprehensive review of related literature and a hierarchical table of representative papers (Table 1) to categorize ML techniques into three major safety strategies and subsequent solutions.

The benefit of the ML safety taxonomy is to break down the problem space into smaller components, help to lay down a road map for safety needs in ML, and identify plausible future research directions. We remark existing challenges and plausible directions as a way to gauge technology readiness on each safety strategy within the main body of literature review in Sections 5, 6, and 7. However, given the fast pace of developments in the field of ML, a thorough assessment of technology readiness may not be a one size fits all for ML systems. On the other hand, the proposed ML safety taxonomy can benefits from emerging technologies concepts such as Responsible AI (Arrieta et al., 2020; Lu et al., 2021) to take social and legal aspects of safety into account.

T3: Recommendations for Choosing ML Safety Strategies

A practical way to improve safety of complex ML products is to benefit from diversification in ML safety strategies and hence minimizing the risk of hazards associated with ML malfunction. We recognize multiple reasons to benefit from diversification of safety strategies. To start with, as no ML solutions guarantees absolute error-less performance in open-world environments, a design based on collection of diverse solutions could learn more complete data representation and hence achieve higher performance and robustness. In other words, a design based on a collection of diverse solution is more likely to maintain robustness at the time of unforeseen distribution shifts as known as edge-cases.

Additionally, the overlaps and interactions between ML solutions boost overall performance and reduce development costs. For instance, scenario-based testing for model validation can directly impact data collection and training set quality in design and development cycles. Another example is the positive effects of transfer learning and domain generalization on uncertainty quantification and OOD detection. Lastly, diverse strategies can be applied on different stages of design, development, and deployment of ML lifecycle which benefits from continues monitoring of ML safety across all ML product teams.

T4: Recommendations for Safe AI Development Frameworks

ML system development tools and platforms (MLOps) aim to automate and unify the design, development, and deployment of ML systems with a collection of best practices. Prior work have emphasized on MLOps tools to minimize the development and maintenance costs in large scale ML systems (Amershi et al., 2019). We propose existing and emerging MLOps to support and prioritize adaptation and monitoring of ML safety strategies across both system and software level. A safety-oriented ML lifecycle incorporates all aspects of ML development from constructing safety scope and requirements, to data management, model training and evaluation, and open-world deployment and monitoring (Ashmore et al., 2021; Hawkins et al., 2021). Industry-oriented efforts in safety-aware MLOps can unify necessary tools, metrics, and increase accessibility for all AI development teams.

Recent emerging concepts such as Responsible AI (Arrieta et al., 2020) and Explainable AI (Mohseni et al., 2018) aim for building safe AI systems to ensure data privacy, fairness, and human-centered values in AI development. These new emerging AI concepts can target beyond functional safety of the intelligent system and help to prevent end-users (e.g., driver in the autonomous vehicle) from unintentional misuse of the system due to over-trusting and user unawareness (Mohseni et al., 2019).

References

Adebayo et al. [2018] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for saliency maps. In NIPS, 2018.
Amershi et al. [2019] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann. Software engineering for machine learning: A case study. In ICSE-SEIP. IEEE, 2019.
Amodei et al. [2016] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
Arrieta et al. [2020] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García, S. Gil-López, D. Molina, R. Benjamins, et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020.
Ashmore et al. [2021] R. Ashmore, R. Calinescu, and C. Paterson. Assuring the machine learning lifecycle: Desiderata, methods, and challenges. ACM Computing Surveys (CSUR), 2021.
Athalye et al. [2018] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pages 274–283. PMLR, 2018.
Awais et al. [2021] M. Awais, F. Zhou, H. Xu, L. Hong, P. Luo, S.-H. Bae, and Z. Li. Adversarial robustness for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8568–8577, 2021.
Bach et al. [2015] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
Bai et al. [2021] Y. Bai, J. Mei, A. L. Yuille, and C. Xie. Are transformers more robust than CNNs? NeurIPS, 2021.
Balaji et al. [2018] Y. Balaji, S. Sankaranarayanan, and R. Chellappa. Metareg: Towards domain generalization using meta-regularization. In NeurIPS, pages 1006–1016, 2018.
Bansal and Weld [2018] G. Bansal and D. S. Weld. A coverage-based utility model for identifying unknown unknowns. In Thirty-Second AAAI Conference on AI, 2018.
Bansal et al. [2018] N. Bansal, X. Chen, and Z. Wang. Can we gain more from orthogonality regularizations in training deep cnns. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018.
Bartocci et al. [2018] E. Bartocci, J. Deshmukh, A. Donzé, G. Fainekos, O. Maler, D. Ničković, and S. Sankaranarayanan. Specification-based monitoring of cyber-physical systems: a survey on theory, tools and applications. In Lectures on Runtime Verification, pages 135–175. Springer, 2018.
Beluch et al. [2018] W. H. Beluch, T. Genewein, A. Nürnberger, and J. M. Köhler. The power of ensembles for active learning in image classification. In CVPR, 2018.
Bergman and Hoshen [2020] L. Bergman and Y. Hoshen. Classification-based anomaly detection for general data. arXiv:2005.02359, 2020.
Beyer et al. [2020] L. Beyer, O. J. Hénaff, A. Kolesnikov, X. Zhai, and A. v. d. Oord. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
Bhojanapalli et al. [2021] S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit. Understanding robustness of transformers for image classification. arXiv preprint arXiv:2103.14586, 2021.
Bochkovskiy et al. [2020] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
Bojarski et al. [2018] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. J. Ackel, U. Muller, P. Yeres, and K. Zieba. Visualbackprop: Efficient visualization of cnns for autonomous driving. In ICRA, 2018.
Bouguelia et al. [2018] M.-R. Bouguelia, S. Nowaczyk, K. Santosh, and A. Verikas. Agreeing to disagree: active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8):1307–1319, 2018.
Brockman et al. [2016] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Brown et al. [2017] T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
Carlini and Wagner [2017] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In 10th ACM Workshop on Artificial Intelligence and Security, 2017.
Chakraborty et al. [2014] S. Chakraborty, D. Fremont, K. Meel, S. Seshia, and M. Vardi. Distribution-aware sampling and weighted model counting for sat. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
Chan et al. [2021] R. Chan, K. Lis, S. Uhlemeyer, H. Blum, S. Honari, R. Siegwart, P. Fua, M. Salzmann, and M. Rottmann. Segmentmeifyoucan: A benchmark for anomaly segmentation. In Advances in Neural Information Processing Systems, 2021.
Chang et al. [2021] N. Chang, Z. Yu, Y.-X. Wang, A. Anandkumar, S. Fidler, and J. M. Alvarez. Image-level or object-level? a tale of two resampling strategies for long-tailed detection. In International Conference on Machine Learning, pages 1463–1472. PMLR, 2021.
Chen et al. [2020a] B. Chen, W. Liu, Z. Yu, J. Kautz, A. Shrivastava, A. Garg, and A. Anandkumar. Angular visual hardness. In ICML, 2020a.
Chen et al. [2020b] H. Chen, B. Zhang, S. Xue, X. Gong, H. Liu, R. Ji, and D. Doermann. Anti-bandit neural architecture search for model defense. In ECCV, pages 70–85, 2020b.
Chen et al. [2021] J. Chen, Y. Li, X. Wu, Y. Liang, and S. Jha. ATOM: Robustifying out-of-distribution detection using outlier mining. In ECML, pages 430–445, 2021.
Chen et al. [2020c] T. Chen, S. Liu, S. Chang, Y. Cheng, L. Amini, and Z. Wang. Adversarial robustness: From self-supervised pre-training to fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020c.
Choi and Chung [2020] S. Choi and S.-Y. Chung. Novelty detection via blurring. In International Conference on Learning Representations, 2020.
Cluzeau et al. [2020] J. Cluzeau, X. Henriquel, G. Rebender, G. Soudain, L. van Dijk, A. Gronskiy, D. Haber, C. Perret-Gentil, and R. Polak. Concepts of design assurance for neural networks (codann). Public Report Extract Version 1.0, 2020.
Cohen et al. [2019] J. M. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
Commission [2000] I. E. Commission. Functional safety of electrical/electronic/programmable electronic safety-related systems. Technical report, 2000.
Croce and Hein [2020] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. arXiv preprint arXiv:2003.01690, 2020.
Cubuk et al. [2018] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
Cubuk et al. [2020] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020.
Das et al. [2016] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human attention in visual question answering: Do humans and deep networks look at the same regions? In EMNLP, 2016.
Das et al. [2017] N. Das, M. Shanbhogue, S.-T. Chen, F. Hohman, L. Chen, M. E. Kounavis, and D. H. Chau. Keeping the bad guys out: Protecting and vaccinating deep learning with jpeg compression. arXiv preprint arXiv:1705.02900, 2017.
Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Der Kiureghian and Ditlevsen [2009] A. Der Kiureghian and O. Ditlevsen. Aleatory or epistemic? does it matter? Structural safety, 2009.
DeVries and Taylor [2018] T. DeVries and G. W. Taylor. Learning confidence for out-of-distribution detection in neural networks. ICLR, 2018.
Ding et al. [2020] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. MMA training: Direct input space margin maximization through adversarial training. In ICLR, 2020.
Djolonga et al. [2020] J. Djolonga, J. Yung, M. Tschannen, R. Romijnders, L. Beyer, A. Kolesnikov, J. Puigcerver, M. Minderer, A. D’Amour, D. Moldovan, et al. On robustness and transferability of convolutional neural networks. arXiv:2007.08558, 2020.
Doshi-Velez and Kim [2017] F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
Dreossi et al. [2018] T. Dreossi, S. Ghosh, X. Yue, K. Keutzer, A. Sangiovanni-Vincentelli, and S. A. Seshia. Counterexample-guided data augmentation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018.
Dreossi et al. [2019a] T. Dreossi, A. Donzé, and S. A. Seshia. Compositional falsification of cyber-physical systems with machine learning components. Journal of Automated Reasoning, 63(4):1031–1053, 2019a.
Dreossi et al. [2019b] T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, H. Ravanbakhsh, M. Vazquez-Chanlatte, and S. A. Seshia. Verifai: A toolkit for the formal design and analysis of artificial intelligence-based systems. In International Conference on Computer Aided Verification, pages 432–442. Springer, 2019b.
Dreossi et al. [2019c] T. Dreossi, S. Ghosh, A. Sangiovanni-Vincentelli, and S. A. Seshia. A formalization of robustness for deep neural networks. arXiv preprint arXiv:1903.10033, 2019c.
Du and Mordatch [2019] Y. Du and I. Mordatch. Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, pages 3608–3618, 2019.
[51] S. Dutta, S. Jha, S. Sankaranarayanan, and A. Tiwari. Output range analysis for deep feedforward neural networks. NASA Formal Methods, LNCS 10811.
Dvijotham et al. [2018] K. Dvijotham, S. Gowal, R. Stanforth, R. Arandjelovic, B. O’Donoghue, J. Uesato, and P. Kohli. Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265, 2018.
[53] K. D. Dvijotham, R. Stanforth, S. Gowal, C. Qin, S. De, and P. Kohli. Efficient neural network verification with exactness characterization.
Feinman et al. [2017] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
Fidon et al. [2020] L. Fidon, S. Ourselin, and T. Vercauteren. Distributionally robust deep learning using hardness weighted sampling. arXiv preprint arXiv:2001.02658, 2020.
for Standardization [2011] I. O. for Standardization. Iso 26262: Road vehicles – functional safety. Technical report, 2011.
for Standardization [2019] I. O. for Standardization. Iso/pas 21448:: Road vehicles — safety of the intended functionality. Technical report, 2019.
Fort et al. [2021] S. Fort, J. Ren, and B. Lakshminarayanan. Exploring the limits of out-of-distribution detection. In NeurIPS, 2021.
Fremont et al. [2020] D. J. Fremont, E. Kim, Y. V. Pant, S. A. Seshia, A. Acharya, X. Bruso, P. Wells, S. Lemke, Q. Lu, and S. Mehta. Formal scenario-based testing of autonomous vehicles: From simulation to the real world. In 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–8. IEEE, 2020.
Gal [2016] Y. Gal. Uncertainty in deep learning. PhD thesis, PhD thesis, University of Cambridge, 2016.
Gal and Ghahramani [2016] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
Gal et al. [2017a] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. In NeurIPS, 2017a.
Gal et al. [2017b] Y. Gal, R. Islam, and Z. Ghahramani. Deep bayesian active learning with image data. In International Conference on Machine Learning, pages 1183–1192. PMLR, 2017b.
Ganin and Lempitsky [2014] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv:1409.7495, 2014.
Gao et al. [2019] R. Gao, T. Cai, H. Li, C.-J. Hsieh, L. Wang, and J. D. Lee. Convergence of adversarial training in overparametrized neural networks. Advances in Neural Information Processing Systems, 32:13029–13040, 2019.
Garg et al. [2022] S. Garg, S. Balakrishnan, Z. C. Lipton, B. Neyshabur, and H. Sedghi. Leveraging unlabeled data to predict out-of-distribution performance. In ICLR, 2022.
Geifman and El-Yaniv [2017] Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks. In NIPS, pages 4878–4887, 2017.
Geifman and El-Yaniv [2019] Y. Geifman and R. El-Yaniv. Selectivenet: A deep neural network with an integrated reject option. arXiv preprint arXiv:1901.09192, 2019.
Geirhos et al. [2019] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2019.
Ghorbani et al. [2019] A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim. Towards automatic concept-based explanations. In Advances in Neural Information Processing Systems, volume 32, pages 9277–9286, 2019.
Golan and El-Yaniv [2018] I. Golan and R. El-Yaniv. Deep anomaly detection using geometric transformations. In NIPS, pages 9758–9769, 2018.
Gong et al. [2021] C. Gong, T. Ren, M. Ye, and Q. Liu. MaxUp: A simple way to improve generalization of neural network training. In CVPR, 2021.
Gong et al. [2019] R. Gong, W. Li, Y. Chen, and L. V. Gool. DLOW: Domain flow for adaptation and generalization. In CVPR, pages 2477–2486, 2019.
Gong et al. [2017] Z. Gong, W. Wang, and W.-S. Ku. Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960, 2017.
Goodfellow et al. [2015] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. ICLR, 2015.
Goodman and Flaxman [2017] B. Goodman and S. Flaxman. European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine, 38(3):50–57, 2017.
Gowal et al. [2018] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
Goyal et al. [2020] S. Goyal, A. Raghunathan, M. Jain, H. V. Simhadri, and P. Jain. DROCC: Deep robust one-class classification. In Proceedings of the 37th International Conference on Machine Learning, 2020.
Grathwohl et al. [2019] W. Grathwohl, K.-C. Wang, J.-H. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky. Your classifier is secretly an energy based model and you should treat it like one. In International Conference on Learning Representations, 2019.
Grosse et al. [2017] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
Gui et al. [2019] S. Gui, H. Wang, H. Yang, C. Yu, Z. Wang, and J. Liu. Model compression with adversarial robustness: A unified optimization framework. 2019.
Guidotti et al. [2018] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, and F. Giannotti. Local rule-based explanations of black box decision systems. arXiv preprint arXiv:1805.10820, 2018.
Guo et al. [2017] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In ICML, 2017.
Guo et al. [2018] C. Guo, M. Rana, M. Cisse, and L. van der Maaten. Countering adversarial images using input transformations. In ICLR, 2018.
Guo et al. [2020] M. Guo, Y. Yang, R. Xu, Z. Liu, and D. Lin. When NAS meets robustness: In search of robust architectures against adversarial attacks. In CVPR, 2020.
Haixiang et al. [2017] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing. Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications, 73:220–239, 2017.
Han et al. [2018] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018.
Han et al. [2019] J. Han, P. Luo, and X. Wang. Deep self-learning from noisy labels. In ICCV, pages 5138–5147, 2019.
Haussmann et al. [2020] E. Haussmann, M. Fenzi, K. Chitta, J. Ivanecky, H. Xu, D. Roy, A. Mittel, N. Koumchatzky, C. Farabet, and J. M. Alvarez. Scalable active learning for object detection. In IEEE Intelligent Vehicles Symposium (IV), 2020.
Hawkins et al. [2021] R. Hawkins, C. Paterson, C. Picardi, Y. Jia, R. Calinescu, and I. Habli. Guidance on the assurance of machine learning in autonomous systems (amlas). arXiv preprint arXiv:2102.01564, 2021.
Hein et al. [2019] M. Hein, M. Andriushchenko, and J. Bitterwolf. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In CVPR, 2019.
Hendrycks and Dietterich [2019] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. ICLR, 2019.
Hendrycks and Gimpel [2016a] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016a.
Hendrycks and Gimpel [2016b] D. Hendrycks and K. Gimpel. Early methods for detecting adversarial images. arXiv preprint arXiv:1608.00530, 2016b.
Hendrycks et al. [2018] D. Hendrycks, M. Mazeika, and T. Dietterich. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2018.
Hendrycks et al. [2019a] D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song. Scaling out-of-distribution detection for real-world settings. arXiv preprint arXiv:1911.11132, 2019a.
Hendrycks et al. [2019b] D. Hendrycks, K. Lee, and M. Mazeika. Using pre-training can improve model robustness and uncertainty. arXiv preprint arXiv:1901.09960, 2019b.
Hendrycks et al. [2019c] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song. Using self-supervised learning can improve model robustness and uncertainty. In NeurIPS, pages 15663–15674, 2019c.
Hendrycks et al. [2019d] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781, 2019d.
Hendrycks et al. [2019e] D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. Natural adversarial examples. arXiv preprint arXiv:1907.07174, 2019e.
Hendrycks et al. [2020] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. arXiv:2006.16241, 2020.
Hendrycks et al. [2021] D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
Hernández-Orallo et al. [2019] J. Hernández-Orallo, F. Martínez-Plumed, S. Avin, and S. Ó. hÉigeartaigh. Surveying safety-relevant ai characteristics. In SafeAI@ AAAI, 2019.
Hohman et al. [2019] F. Hohman, H. Park, C. Robinson, and D. H. P. Chau. Summit: scaling deep learning interpretability by visualizing activation and attribution summarizations. IEEE Transactions on Visualization and Computer Graphics, 2019.
Hong et al. [2021] J. Hong, H. Wang, Z. Wang, and J. Zhou. Federated robustness propagation: Sharing adversarial robustness in federated learning. arXiv preprint arXiv:2106.10196, 2021.
Hong et al. [2022] J. Hong, H. Wang, Z. Wang, and J. Zhou. Efficient split-mix federated learning for on-demand and in-situ customization. In International Conference on Learning Representations, 2022.
Hsu et al. [2020] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira. Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
Hu et al. [2020] T.-K. Hu, T. Chen, H. Wang, and Z. Wang. Triple wins: Boosting accuracy, robustness and efficiency together by enabling input-adaptive inference. arXiv preprint arXiv:2002.10025, 2020.
Huang et al. [2017] X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks. In International conference on computer aided verification, pages 3–29. Springer, 2017.
Huang et al. [2020a] X. Huang, D. Kroening, W. Ruan, J. Sharp, Y. Sun, E. Thamo, M. Wu, and X. Yi. A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability. Computer Science Review, 37:100270, 2020a.
Huang et al. [2020b] Z. Huang, H. Wang, E. P. Xing, and D. Huang. Self-challenging improves cross-domain generalization. In ECCV, 2020b.
Iyyer et al. [2018] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer. Adversarial example generation with syntactically controlled paraphrase networks. In NAACL-HLT, pages 1875–1885, 2018.
Jiang et al. [2020] Z. Jiang, T. Chen, T. Chen, and Z. Wang. Robust pre-training by adversarial contrastive learning. arXiv preprint arXiv:2010.13337, 2020.
Kahng et al. [2018] M. Kahng, P. Y. Andrews, A. Kalro, and D. H. P. Chau. Activis: visual exploration of industry-scale deep neural network models. IEEE Transactions on Visualization and Computer Graphics, 24(1):88–97, 2018.
Kang et al. [2019] D. Kang, Y. Sun, D. Hendrycks, T. Brown, and J. Steinhardt. Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016, 2019.
Karandikar et al. [2021] A. Karandikar, N. Cain, D. Tran, B. Lakshminarayanan, J. Shlens, M. C. Mozer, and R. Roelofs. Soft calibration objectives for neural networks. In NeurIPS, 2021.
Katharopoulos and Fleuret [2018] A. Katharopoulos and F. Fleuret. Not all samples are created equal: Deep learning with importance sampling. In ICML, pages 2525–2534, 2018.
Katz et al. [2017] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117. Springer, 2017.
Kendall and Gal [2017] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, pages 5574–5584, 2017.
Kim et al. [2018] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, 2018.
Kim et al. [2020] E. Kim, D. Gopinath, C. Pasareanu, and S. A. Seshia. A programmatic and semantic approach to explaining and debugging neural network based object detectors. In CVPR, 2020.
Kindermans et al. [2019] P.-J. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, 2019.
Koopman and Wagner [2017] P. Koopman and M. Wagner. Autonomous vehicle safety: An interdisciplinary challenge. IEEE Intelligent Transportation Systems Magazine, 9(1):90–96, 2017.
Krishnan and Tickoo [2020] R. Krishnan and O. Tickoo. Improving model calibration with accuracy versus uncertainty optimization. In NeurIPS, pages 18237–18248, 2020.
Kumar et al. [2018] A. Kumar, S. Sarawagi, and U. Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In ICML, 2018.
Lage et al. [2018] I. Lage, A. S. Ross, S. J. Gershman, B. Kim, and F. Doshi-Velez. Human-in-the-loop interpretability prior. In NeurIPS, pages 10180–10189, 2018.
Lakkaraju et al. [2016] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
Lakkaraju et al. [2017] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz. Identifying unknown unknowns in the open world: Representations and policies for guided exploration. In AAAI, 2017.
Lakkaraju et al. [2019] H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec. Faithful and customizable explanations of black box models. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 131–138, 2019.
Lakkaraju et al. [2020] H. Lakkaraju, N. Arsov, and O. Bastani. Robust and stable black box explanations. In International Conference on Machine Learning, pages 5628–5638. PMLR, 2020.
Lakshminarayanan et al. [2017] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
Lambert et al. [2020] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun. MSeg: A composite dataset for multi-domain semantic segmentation. In CVPR, 2020.
LeCun et al. [2015] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
Lee et al. [2017] K. Lee, H. Lee, K. Lee, and J. Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325, 2017.
Lee et al. [2018] K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
Lee et al. [2020] K. Lee, K. Lee, J. Shin, and H. Lee. Network randomization: A simple technique for generalization in deep reinforcement learning. In ICLR, 2020.
Leike et al. [2017] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
Lertvittayakumjorn and Toni [2019] P. Lertvittayakumjorn and F. Toni. Human-grounded evaluations of explanation methods for text classification. In EMNLP-IJCNLP, pages 5198–5208, 2019.
Li et al. [2017] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550, 2017.
Li et al. [2018] D. Li, Y. Yang, Y.-Z. Song, and T. Hospedales. Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Li and Li [2017] X. Li and F. Li. Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pages 5764–5772, 2017.
Li et al. [2020] Y. Li, Q. Yu, M. Tan, J. Mei, P. Tang, W. Shen, A. Yuille, et al. Shape-texture debiased neural network training. In ICLR, 2020.
Li and Hoiem [2020] Z. Li and D. Hoiem. Improving confidence estimates for unfamiliar examples. In CVPR, 2020.
Lin et al. [2019] J. Lin, C. Gan, and S. Han. Defensive quantization: When efficiency meets robustness. In ICLR, 2019.
Lipton et al. [2018] Z. Lipton, Y.-X. Wang, and A. Smola. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pages 3122–3130, 2018.
Liu et al. [2020a] J. Z. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax-Weiss, and B. Lakshminarayanan. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. arXiv preprint arXiv:2006.10108, 2020a.
Liu et al. [2020b] W. Liu, X. Wang, J. D. Owens, and Y. Li. Energy-based out-of-distribution detection. arXiv:2010.03759, 2020b.
Lu et al. [2021] Q. Lu, L. Zhu, X. Xu, J. Whittle, D. Douglas, and C. Sanderson. Software engineering for responsible ai: An empirical study and operationalised patterns. arXiv preprint arXiv:2111.09478, 2021.
Lundberg and Lee [2017] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
L’heureux et al. [2017] A. L’heureux, K. Grolinger, H. F. Elyamany, and M. A. Capretz. Machine learning with big data: Challenges and approaches. IEEE Access, 5:7776–7797, 2017.
Ma et al. [2018] X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.
Maaten and Hinton [2008] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 2008.
Madry et al. [2017] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Mahajan et al. [2018] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018.
Mahmood et al. [2021] K. Mahmood, R. Mahmood, and M. Van Dijk. On the robustness of vision transformers to adversarial examples. arXiv preprint arXiv:2104.02610, 2021.
Malinin et al. [2021] A. Malinin, N. Band, Y. Gal, M. Gales, A. Ganshin, G. Chesnokov, A. Noskov, A. Ploskonosov, L. Prokhorenkova, I. Provilkov, et al. Shifts: A dataset of real distributional shift across multiple large-scale tasks. In Advances in Neural Information Processing Systems, 2021.
McAllister et al. [2017] R. McAllister, Y. Gal, A. Kendall, M. Van Der Wilk, A. Shah, R. Cipolla, and A. V. Weller. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. IJCAI, 2017.
Meinke and Hein [2019] A. Meinke and M. Hein. Towards neural networks that provably know when they don’t know. In International Conference on Learning Representations, 2019.
Meng and Chen [2017] D. Meng and H. Chen. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 135–147. ACM, 2017.
Meng et al. [2021] Q. Meng, W. Wang, T. Zhou, J. Shen, Y. Jia, and L. Van Gool. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Miyato et al. [2019] T. Miyato, S.-I. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1979–1993, 2019.
Mohri et al. [2018] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
Mohseni et al. [2018] S. Mohseni, N. Zarei, and E. D. Ragan. A multidisciplinary survey and framework for design and evaluation of explainable ai systems. arXiv, pages arXiv–1811, 2018.
Mohseni et al. [2019] S. Mohseni, M. Pitale, V. Singh, and Z. Wang. Practical solutions for machine learning safety in autonomous vehicles. arXiv preprint arXiv:1912.09630, 2019.
Mohseni et al. [2020] S. Mohseni, M. Pitale, J. Yadawa, and Z. Wang. Self-supervised learning for generalizable out-of-distribution detection. In AAAI Conference on Artificial Intelligence, 2020.
Mohseni et al. [2021a] S. Mohseni, J. E. Block, and E. Ragan. Quantitative evaluation of machine learning explanations: A human-grounded benchmark. In 26th International Conference on Intelligent User Interfaces, 2021a. doi: 10.1145/3397481.3450689.
Mohseni et al. [2021b] S. Mohseni, A. Vahdat, and J. Yadawa. Shifting transformation learning for out-of-distribution detection. arXiv preprint arXiv:2106.03899, 2021b.
Mok et al. [2021] J. Mok, B. Na, H. Choe, and S. Yoon. AdvRush: Searching for adversarially robust neural architectures. In ICCV, pages 12322–12332, 2021.
Moon et al. [2020] J. Moon, J. Kim, Y. Shin, and S. Hwang. Confidence-aware learning for deep neural networks. In ICML, pages 7034–7044, 2020.
Mukhoti et al. [2021] J. Mukhoti, A. Kirsch, J. van Amersfoort, P. H. Torr, and Y. Gal. Deterministic neural networks with appropriate inductive biases capture epistemic and aleatoric uncertainty. arXiv preprint arXiv:2102.11582, 2021.
Müller et al. [2019] R. Müller, S. Kornblith, and G. Hinton. When does label smoothing help? arXiv preprint arXiv:1906.02629, 2019.
Nakkiran [2019] P. Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprint arXiv:1901.00532, 2019.
Narodytska et al. [2018] N. Narodytska, S. Kasiviswanathan, L. Ryzhyk, M. Sagiv, and T. Walsh. Verifying properties of binarized deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Ning et al. [2020] X. Ning, J. Zhao, W. Li, T. Zhao, H. Yang, and Y. Wang. Multi-shot NAS for discovering adversarially robust convolutional neural architectures at targeted capacities. arXiv preprint arXiv:2012.11835, 2020.
Olah et al. [2018] C. Olah, A. Satyanarayan, I. Johnson, S. Carter, L. Schubert, K. Ye, and A. Mordvintsev. The building blocks of interpretability. Distill, 2018. doi: 10.23915/distill.00010.
Ovadia et al. [2019] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. V. Dillon, B. Lakshminarayanan, and J. Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. arXiv:1906.02530, 2019.
Pan et al. [2018] X. Pan, P. Luo, J. Shi, and X. Tang. Two at once: Enhancing learning and generalization capacities via ibn-net. In Proceedings of the European Conference on Computer Vision (ECCV), pages 464–479, 2018.
Pei et al. [2017a] K. Pei, Y. Cao, J. Yang, and S. Jana. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017a.
Pei et al. [2017b] K. Pei, Y. Cao, J. Yang, and S. Jana. Towards practical verification of machine learning: The case of computer vision systems. arXiv preprint arXiv:1712.01785, 2017b.
Qian and Wegman [2018] H. Qian and M. N. Wegman. L2-nonexpansive neural networks, 2018.
Qin et al. [2018] Y. Qin, H. Wang, C. Xu, X. Ma, and J. Lu. Syneva: Evaluating ml programs by mirror program synthesis. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS), pages 171–182. IEEE, 2018.
Qiu et al. [2020] H. Qiu, C. Xiao, L. Yang, X. Yan, H. Lee, and B. Li. Semanticadv: Generating adversarial examples via attribute-conditioned image editing. In European Conference on Computer Vision, pages 19–37. Springer, 2020.
Quiñonero-Candela et al. [2009] J. Quiñonero-Candela, M. Sugiyama, N. D. Lawrence, and A. Schwaighofer. Dataset shift in machine learning. Mit Press, 2009.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
Raghunathan et al. [2018] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. International Conference on Learning Representations (ICLR), arXiv preprint arXiv:1801.09344, 2018.
Recht et al. [2019] B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400, 2019.
Ren et al. [2019a] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pages 14707–14718, 2019a.
Ren et al. [2019b] S. Ren, Y. Deng, K. He, and W. Che. Generating natural language adversarial examples through probability weighted word saliency. In ACL, pages 1085–1097, 2019b.
Ribeiro et al. [2016] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
Ribeiro et al. [2018] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence, 2018.
Ruff et al. [2018] L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep one-class classification. In ICML, pages 4393–4402, 2018.
Ruff et al. [2019] L. Ruff, R. A. Vandermeulen, N. Görnitz, A. Binder, E. Müller, K.-R. Müller, and M. Kloft. Deep semi-supervised anomaly detection. In International Conference on Learning Representations, 2019.
Sadigh et al. [2016a] D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan. Planning for autonomous cars that leverage effects on human actions. In Robotics: Science and Systems, volume 2. Ann Arbor, MI, USA, 2016a.
Sadigh et al. [2016b] D. Sadigh, S. S. Sastry, S. A. Seshia, and A. Dragan. Information gathering actions over human internal state. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 66–73. IEEE, 2016b.
Salay et al. [2017] R. Salay, R. Queiroz, and K. Czarnecki. An analysis of iso 26262: Using machine learning safely in automotive software. arXiv preprint arXiv:1709.02435, 2017.
Salehi et al. [2021] M. Salehi, H. Mirzaei, D. Hendrycks, Y. Li, M. H. Rohban, and M. Sabokrou. A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges. arXiv preprint arXiv:2110.14051, 2021.
Samangouei et al. [2018] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-GAN: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
Samek et al. [2017] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller. Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673, 2017.
Sastry and Oore [2020] C. S. Sastry and S. Oore. Detecting out-of-distribution examples with in-distribution examples and Gram matrices. In ICML, pages 8491–8501, 2020.
Schlegl et al. [2017] T. Schlegl, P. Seeböck, S. M. Waldstein, U. Schmidt-Erfurth, and G. Langs. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In IPMI. Springer, 2017.
Sehwag et al. [2020] V. Sehwag, S. Wang, P. Mittal, and S. Jana. On pruning adversarially robust neural networks. arXiv preprint arXiv:2002.10509, 2020.
Sehwag et al. [2021] V. Sehwag, M. Chiang, and P. Mittal. Ssd: A unified framework for self-supervised outlier detection. In International Conference on Learning Representations, 2021.
Selvaraju et al. [2017] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 2017.
Serrà et al. [2019] J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque. Input complexity and out-of-distribution detection with likelihood-based generative models. arXiv preprint arXiv:1909.11480, 2019.
Seshia et al. [2016] S. A. Seshia, D. Sadigh, and S. S. Sastry. Towards verified artificial intelligence. arXiv preprint arXiv:1606.08514, 2016.
Seshia et al. [2018] S. A. Seshia, A. Desai, T. Dreossi, D. J. Fremont, S. Ghosh, E. Kim, S. Shivakumar, M. Vazquez-Chanlatte, and X. Yue. Formal specification for deep neural networks. In ATVA. Springer, 2018.
Shafahi et al. [2019] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.
Shaham et al. [2018] U. Shaham, J. Garritano, Y. Yamada, E. Weinberger, A. Cloninger, X. Cheng, K. Stanton, and Y. Kluger. Defending against adversarial images using basis functions transformations. arXiv preprint arXiv:1803.10840, 2018.
Shneiderman [2020] B. Shneiderman. Bridging the gap between ethics and practice: Guidelines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS), 10(4):1–31, 2020.
Shrikumar et al. [2017] A. Shrikumar, P. Greenside, and A. Kundaje. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3145–3153. JMLR. org, 2017.
Siddiqui et al. [2020] Y. Siddiqui, J. Valentin, and M. Nießner. Viewal: Active learning with viewpoint entropy for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9433–9443, 2020.
Siebert et al. [2020] J. Siebert, L. Joeckel, J. Heidrich, K. Nakamichi, K. Ohashi, I. Namba, R. Yamamoto, and M. Aoyama. Towards guidelines for assessing qualities of machine learning systems. In QUATIC, pages 17–31. Springer, 2020.
Simonyan et al. [2013] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
[214] S. Singla and S. Feizi. Fantastic four: Differentiable bounds on singular values of convolution layers.
Sinha et al. [2018] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018.
Smilkov et al. [2017] D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
Smuha [2019] N. A. Smuha. The eu approach to ethics guidelines for trustworthy artificial intelligence. CRi-Computer Law Review International, 2019.
Sohn et al. [2021] K. Sohn, C.-L. Li, J. Yoon, M. Jin, and T. Pfister. Learning and evaluating representations for deep one-class classification. In International Conference on Learning Representations, 2021.
Springenberg et al. [2014] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
Srivastava et al. [2014] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Strobelt et al. [2018] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush. Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):667–676, 2018.
Sun et al. [2018] Y. Sun, X. Huang, D. Kroening, J. Sharp, M. Hill, and R. Ashmore. Testing deep neural networks. arXiv preprint arXiv:1803.04792, 2018.
Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
Tack et al. [2020] J. Tack, S. Mo, J. Jeong, and J. Shin. CSI: Novelty detection via contrastive learning on distributionally shifted instances. In NeurIPS, 2020.
Tang et al. [2019] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay, W. Hodge, R. Kumar, S. Wang, and X. Yang. Pamtri: Pose-aware multi-task learning for vehicle re-identification using highly randomized synthetic data. In ICCV, 2019.
Tavakoli et al. [2021] M. Tavakoli, F. Agostinelli, and P. Baldi. SPLASH: Learnable activation functions for improving accuracy and adversarial robustness. Neural Networks, 140:1–12, 2021.
Techapanurak et al. [2020] E. Techapanurak, M. Suganuma, and T. Okatani. Hyperparameter-free out-of-distribution detection using cosine similarity. In ACCV, 2020.
Thulasidasan et al. [2019] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak. On MixUp training: Improved calibration and predictive uncertainty for deep neural networks. In NeurIPS, 2019.
Tobin et al. [2017] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017.
Tramer et al. [2020] F. Tramer, N. Carlini, W. Brendel, and A. Madry. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33:1633–1645, 2020.
Tremblay et al. [2018] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 969–977, 2018.
Tsipras et al. [2018] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
Tsipras et al. [2020] D. Tsipras, S. Santurkar, L. Engstrom, A. Ilyas, and A. Madry. From imagenet to image classification: Contextualizing progress on benchmarks. arXiv preprint arXiv:2005.11295, 2020.
Van Amersfoort et al. [2020] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal. Uncertainty estimation using a single deep deterministic neural network. In International Conference on Machine Learning, pages 9690–9700. PMLR, 2020.
Van Horn and Perona [2017] G. Van Horn and P. Perona. The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450, 2017.
Varshney [2016] K. R. Varshney. Engineering safety in machine learning. In Information Theory and Applications Workshop, 2016.
Vasconcelos et al. [2020] C. Vasconcelos, H. Larochelle, V. Dumoulin, N. L. Roux, and R. Goroshin. An effective anti-aliasing approach for residual networks. arXiv preprint arXiv:2011.10675, 2020.
Volpi et al. [2018] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, 2018.
Vyas et al. [2018] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Willke. Out-of-distribution detection using an ensemble of self supervised leave-out classifiers. In ECCV, pages 550–564, 2018.
Wang et al. [2019a] H. Wang, S. Ge, E. P. Xing, and Z. C. Lipton. Learning robust global representations by penalizing local predictive power. 2019a.
Wang et al. [2019b] H. Wang, Z. He, Z. C. Lipton, and E. P. Xing. Learning robust representations by projecting superficial statistics out. In ICLR, 2019b.
Wang et al. [2020a] H. Wang, T. Chen, S. Gui, T.-K. Hu, J. Liu, and Z. Wang. Once-for-all adversarial training: In-situ tradeoff between robustness and accuracy for free. arXiv preprint arXiv:2010.11828, 2020a.
Wang et al. [2020b] H. Wang, T. Chen, Z. Wang, and K. Ma. I am going mad: Maximum discrepancy competition for comparing classifiers adaptively. In International Conference on Learning Representations, 2020b.
Wang et al. [2020c] H. Wang, X. Wu, Z. Huang, and E. P. Xing. High-frequency component helps explain the generalization of convolutional neural networks. In CVPR, 2020c.
Wang et al. [2021] H. Wang, C. Xiao, J. Kossaifi, Z. Yu, A. Anandkumar, and Z. Wang. AugMax: Adversarial composition of random augmentations for robust training. In NeurIPS, 2021.
Wang et al. [2018a] S. Wang, Y. Chen, A. Abdou, and S. Jana. Mixtrain: Scalable training of formally robust neural networks. arXiv preprint arXiv:1811.02625, 2018a.
Wang et al. [2018b] S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana. Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pages 6369–6379, 2018b.
Wang et al. [2020d] Z. Wang, B. Dai, D. Wipf, and J. Zhu. Further analysis of outlier detection with deep generative models. arXiv preprint arXiv:2010.13064, 2020d.
Wei et al. [2020] H. Wei, L. Feng, X. Chen, and B. An. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR, pages 13726–13735, 2020.
Wexler [2017] J. Wexler. Facets: An open source visualization tool for machine learning training data. Google Open Source Blog, 2017.
Winkens et al. [2020] J. Winkens, R. Bunel, A. G. Roy, R. Stanforth, V. Natarajan, J. R. Ledsam, P. MacWilliams, P. Kohli, A. Karthikesalingam, S. Kohl, et al. Contrastive training for improved out-of-distribution detection. arXiv preprint arXiv:2007.05566, 2020.
Wong and Kolter [2018] E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5283–5292, 2018.
Wong et al. [2020] E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial training. arXiv:2001.03994, 2020.
Wongsuphasawat et al. [2017] K. Wongsuphasawat, D. Smilkov, J. Wexler, J. Wilson, D. Mane, D. Fritz, D. Krishnan, F. B. Viégas, and M. Wattenberg. Visualizing dataflow graphs of deep learning models in tensorflow. TVCG, 2017.
Wu et al. [2020] B. Wu, J. Chen, D. Cai, X. He, and Q. Gu. Does network width really help adversarial robustness? arXiv preprint arXiv:2010.01279, 2020.
Wu et al. [2018] M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez. Beyond sparsity: Tree regularization of deep models for interpretability. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Wu et al. [2021] R. Wu, C. Guo, Y. Su, and K. Q. Weinberger. Online adaptation to label distribution shift. In Advances in Neural Information Processing Systems, 2021.
Wu et al. [2019] Z. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon, and Z. Wang. Delving into robust object detection from unmanned aerial vehicles: A deep nuisance disentanglement approach. In ICCV, 2019.
Xiao et al. [2018a] C. Xiao, R. Deng, B. Li, F. Yu, M. Liu, and D. Song. Characterizing adversarial examples based on spatial consistency information for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018a.
Xiao et al. [2018b] C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial examples. arXiv preprint arXiv:1801.02612, 2018b.
Xiao et al. [2019a] C. Xiao, D. Yang, B. Li, J. Deng, and M. Liu. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6898–6907, 2019a.
Xiao et al. [2019b] K. Y. Xiao, V. Tjeng, N. M. Shafiullah, and A. Madry. Training for faster adversarial robustness verification via inducing ReLU stability. ICLR, 2019b.
Xie and Yuille [2019] C. Xie and A. Yuille. Intriguing properties of adversarial training at scale. arXiv preprint arXiv:1906.03787, 2019.
Xie et al. [2018] C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. Mitigating adversarial effects through randomization. In ICLR, 2018.
Xie et al. [2019] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 501–509, 2019.
Xie et al. [2020a] C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le. Adversarial examples improve image recognition. In CVPR, pages 819–828, 2020a.
Xie et al. [2020b] C. Xie, M. Tan, B. Gong, A. Yuille, and Q. V. Le. Smooth adversarial training. arXiv preprint arXiv:2006.14536, 2020b.
Xu et al. [2017] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
Xu et al. [2021] Z. Xu, D. Liu, J. Yang, and M. Niethammer. Robust and generalizable visual representation learning via random convolutions. 2021.
Yamaguchi et al. [2016] T. Yamaguchi, T. Kaga, A. Donzé, and S. A. Seshia. Combining requirement mining, software model checking and simulation-based verification for industrial automotive systems. In FMCAD, pages 201–204. IEEE, 2016.
Yang et al. [2021] J. Yang, H. Wang, L. Feng, X. Yan, H. Zheng, W. Zhang, and Z. Liu. Semantically coherent out-of-distribution detection. In ICCV, pages 8301–8309, 2021.
Ye et al. [2019] S. Ye, K. Xu, S. Liu, H. Cheng, J.-H. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, and X. Lin. Adversarial robustness vs. model compression, or both? In ICCV, 2019.
Yeh et al. [2020] C.-K. Yeh, B. Kim, S. Arik, C.-L. Li, T. Pfister, and P. Ravikumar. On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems, 33, 2020.
Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
You et al. [2020] Y. You, T. Chen, Z. Wang, and Y. Shen. When does self-supervision help graph convolutional networks? In International Conference on Machine Learning, pages 10871–10880. PMLR, 2020.
Yu and Aizawa [2019] Q. Yu and K. Aizawa. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In ICCV, pages 9518–9526, 2019.
Yu et al. [2019] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama. How does disagreement help generalization against label corruption? In ICML, pages 7164–7173, 2019.
Yuan et al. [2020] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020.
Yue et al. [2019] X. Yue, Y. Zhang, S. Zhao, A. Sangiovanni-Vincentelli, K. Keutzer, and B. Gong. Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In ICCV, pages 2100–2110, 2019.
Yun et al. [2019] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pages 6023–6032, 2019.
Yun et al. [2021] S. Yun, S. J. Oh, B. Heo, D. Han, J. Choe, and S. Chun. Re-labeling ImageNet: From single to multi-labels, from global to localized labels. In CVPR, 2021.
Zeiler and Fergus [2014] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
Zhang et al. [2019a] D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong. You only propagate once: Accelerating adversarial training via maximal principle. arXiv preprint arXiv:1905.00877, 2019a.
Zhang et al. [2017] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. MixUp: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhang et al. [2019b] H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C.-J. Hsieh. Towards stable and efficient training of verifiably robust neural networks. arXiv preprint arXiv:1906.06316, 2019b.
Zhang et al. [2019c] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. E. Ghaoui, and M. Jordan. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, 2019c.
Zhang et al. [2020a] J. Zhang, J. Zhu, G. Niu, B. Han, M. Sugiyama, and M. Kankanhalli. Geometry-aware instance-reweighted adversarial training. arXiv preprint arXiv:2010.01736, 2020a.
Zhang et al. [2021a] J. Zhang, N. Inkawhich, Y. Chen, and H. Li. Fine-grained out-of-distribution detection with mixup outlier exposure. arXiv preprint arXiv:2106.03917, 2021a.
Zhang et al. [2020b] J. M. Zhang, M. Harman, L. Ma, and Y. Liu. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 2020b.
Zhang et al. [2018] Q. Zhang, W. Wang, and S.-C. Zhu. Examining cnn representations with respect to dataset bias. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Zhang [2019] R. Zhang. Making convolutional networks shift-invariant again. In International Conference on Machine Learning, pages 7324–7334. PMLR, 2019.
Zhang and LeCun [2017] X. Zhang and Y. LeCun. Universum prescription: Regularization using unlabeled data. In AAAI, 2017.
Zhang et al. [2019d] X. Zhang, Z. Wang, D. Liu, and Q. Ling. Dada: Deep adversarial data augmentation for extremely low data regime classification. In ICASSP. IEEE, 2019d.
Zhang et al. [2021b] Y. Zhang, H. Ling, J. Gao, K. Yin, J.-F. Lafleche, A. Barriuso, A. Torralba, and S. Fidler. Datasetgan: Efficient labeled data factory with minimal human effort. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10145–10155, 2021b.
Zhang et al. [2019e] Z. Zhang, A. V. Dalca, and M. R. Sabuncu. Confidence calibration for convolutional neural networks using structured dropout. arXiv preprint arXiv:1906.09551, 2019e.
Zhao et al. [2020] L. Zhao, T. Liu, X. Peng, and D. Metaxas. Maximum-entropy adversarial data augmentation for improved generalization and robustness. In Advances in Neural Information Processing Systems, volume 33, 2020.
Zheng et al. [2016] S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In CVPR, 2016.
Zhong et al. [2020] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, 2020.
Zhou et al. [2018] B. Zhou, Y. Sun, D. Bau, and A. Torralba. Interpretable basis decomposition for visual explanation. In ECCV, 2018.
Zong et al. [2018] B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In International Conference on Learning Representations, 2018.